A Tutorial on Clustering Analysis for Bayesian Graphical Models using the R packages bgms and easybgm
Introduction
This is a short software tutorial accompanying the paper “Stochastic Block Models for Clustering in Psychological Networks.” The paper introduces the Stochastic Block Model as a prior on the network structure of the ordinal Markov random field (cf., Marsman et al., in press). Apart from providing an elegant approach for incorporating the theoretical assumption of clustering directly into the statistical model, the approach comes with the methodological benefit of allowing researchers to infer the number of clusters in the network model given the data at hand, as well as to test hypotheses about clustering.
In this brief tutorial, we first illustrate how this analysis can be done using the R package bgms
(Marsman et al., 2023), and then using the user-friendly wrapper package easybgm
(Huth et al., 2024). We will start with the package bgms
, which includes the core code to perform the analysis. Then, we will demonstrate how the same analysis can be performed using easybgm
, which offers a more accessible way to run the same procedure for those without a strong background in R
.
bgms
Currently, the version of bgms
that includes the SBM prior is not yet available on CRAN. Therefore, we need to use the following code to install the stable GitHub version:
if (!requireNamespace("remotes")) {
install.packages("remotes")
} ::install_github("MaartenMarsman/bgms") remotes
Now, say we wish to analyze whether the network structure inferred from our empirical dataset exhibits clustering. To do this, we estimate the most representative vector of node allocations given the data, which simply denotes the cluster to which each node is assigned, thereby providing insight into potential clustering in the network. We also wish to test hypotheses about clustering using the cluster Bayes factors. All of this can be achieved using the bgm
function in the bgms
package.
The first step is to estimate the model using the bgm
function, setting the edge_prior
argument to "Stochastic-Block"
. There are also additional arguments relevant to the SBM. One is the concentration parameter of the Dirichlet prior on the cluster allocation probabilities. This parameter controls the expected cluster sizes: higher values lead to more uniform cluster sizes, while lower values promote more uneven, possibly sparse cluster allocations. The default is dirichlet_alpha = 1
. Another is the rate parameter (\(\lambda\)) of the truncated Poisson prior, which governs the prior on the number of clusters. This parameter reflects the expected number of clusters before observing the data; by default, lambda = 1
, which implies no prior assumption of clustering. In the example code below, we use the default values for these prior hyperparameters. We also leave the default settings for the remaining function arguments. However, we strongly recommend that researchers carefully consider these values when conducting their own analyses. For more details, see, for example Marsman et al. (2023) and Huth et al. (2024).
library(bgms) # load the bgms package
# load the data
<- bgm(dataset, edge_prior = "Stochastic-Block") fit
Now, first we check the estimated cluster allocation vector. The output is a vector whose length equals the number of variables (columns) in the dataset. To examine this, we can inspect the allocations vector in the fit object, where the results of our analysis are stored.
"allocations"]] fit[[
We can also examine the estimated posterior probabilities for all possible numbers of clusters. In the software, this ranges from 1 (i.e., no clustering) to the total number of variables in the dataset.
"components"]] fit[[
These posterior probabilities are particularly useful for calculating the cluster Bayes factors, which are defined as follows:
\[\text{BF}_{10} = \underbrace{\frac{p(\mathcal{H}_1\mid \text{data})}{p(\mathcal{H}_0 \mid \text{data})}}_{\text{Posterior Odds}} \bigg/ \underbrace{\frac{p(\mathcal{H}_1)}{p(\mathcal{H}_0)}}_{\text{Prior Odds}}\]
Depending on the hypotheses, we consider two types of Bayes factors:
- The Bayes factor testing the hypothesis of clustering, i.e., \(\mathcal{H}_1: B > 1\), against the null hypothesis of no clustering \(\mathcal{H}_0: B = 1\). For this, we need to calculate the posterior odds of the two hypotheses, which is simply the ratio of the posterior probabilities of the two hypotheses. For \(\mathcal{H}_1: B > 1\), we would need to sum the posterior probabilities of all possible numbers of clusters greater than 1.
<- unname(fit[["components"]][-1,2]) # posterior probabilities under H1
H1 <- unname(fit[["components"]][1,2]) # posterior probabilities under H0
H0
<- sum(H1) / H0 posterior_odds
After calculating the posterior odds, we need to divide these by the prior odds in order to obtain the Bayes factor. For this first Bayes factor, the prior odds are defined as:
\[ \frac{(\exp(\lambda) - 1 - \lambda)}{\lambda}.\] Since we leave the default value for the rate parameter \(\lambda\) of the Poisson distribution, we calculate the prior odds as follows:
= 1
lambda <- (exp(lambda) - 1 - lambda) / lambda prior_odds
Finally the Bayes factor is given by
<- posterior_odds / prior_odds
BF_10 # Bayes factor in favor of H1
BF_10 1/BF_10 # Bayes factor in favor of H0
- The Bayes factor testing the hypothesis of a specific number of clusters, i.e., \(\mathcal{H}_1: B = b_1\), against \(\mathcal{H}_2: B = b_2\). Say we wish to test the null hypothesis of 1 cluster against the hypothesis that assumes there are two clusters (i.e., \(b_1 = 1\) and \(b_2 = 2\)). For this, we follow similar steps as above.
<- unname(fit[["components"]][1,2]) # posterior probabilities under H0 (i.e., B = 1)
H1 <- unname(fit[["components"]][2,2]) # posterior probabilities under H1 (i.e., B = 2)
H2 <- H1 / H2 posterior_odds
The prior odds for this Bayes factor are given by: \[\lambda^{b_1 - b_2} \cdot \frac{b_2!}{b_1!}\]
<- lambda^(1 - 2) * factorial(2) / factorial(1) prior_odds
Finally the Bayes factor is given by
<- posterior_odds / prior_odds
BF_12 # Bayes factor in favor of H1
BF_12 1/BF_12 # Bayes factor in favor of H2
when save = TRUE
We would also like to note that, if for any reason, researchers wish to save the samples from the posterior distribution, then in order to obtain the summarized posterior clustering information, they would need to use the function summarySBM
as follows:
<- bgm(Wenchuan,
fit edge_prior = "Stochastic-Block",
save = TRUE)
<- summarySBM(fit) cluster_summerized
easybgm
Everything that we have illustrated using bgms
can be much more easily done using easybgm
. Currently, the version that contains this analysis is also only available on GitHub and can be installed as follows:
if (!requireNamespace("remotes")) {
install.packages("remotes")
} ::install_github("sekulovskin/easybgm",
remotesref = "SBM_in_easybgm")
We simly perfrom the analysis as follows:
library(easybgm)
<- easybgm(dataset,
fit type = "ordinal",
edge_prior = "Stochastic-Block")
Now, we do not need to perform any post-processing or dig into the objects containing the results of the analysis or calculate the Bayes factors “by hand.” We can simply use the summary
function, and all the relevant output will be displayed.
summary(fit)
After the edge-specific overview (for more information, see Huth et al. (2024)), we see the cluster-specific overview, which includes the posterior probabilities for all possible numbers of clusters, the estimated node membership, as well as the Bayes factors testing the hypothesis of clustering and the null hypothesis of no clustering.
In case you wish to calculate the second type of Bayes factor, similar to the steps above, you can simply use the function clusterBayesfactor
, by setting the argument type = "simple"
and specifying the values for \(b_1\) and \(b_2\).
clusterBayesfactor(fit, type = "simple", b1 = 1, b2 = 2)