# Relational Topic Models

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

## Modeling Sparsity

In an undirected setting, let us consider having chosen z_ij and z_ji and then selecting the response according to r_ij ~ Bernoulli(\eta_{z_ij, z_ji}). In modeling sparsity, we assume that we draw another hidden variable say y_ij ~ Bernoulli(\eta_{z_ij, z_ji}). And then draw r_ij ~ Bernoulli(\rho) if y_ij = 1 and r_ij ~ \delta_0 otherwise. \rho represents how often we expect to observe a link between nodes that are actually linked in the latent space.

We can thus rewrite r_ij ~ y_ij Bernoulli(\rho) + (1 - y_ij) delta_0. We can integrate out y_ij by noting that p(r_ij = 1 | z_ij, z_ji) = \rho \eta_{z_ij, z_ji}. And p(r_ij = 0 | z_ij, z_ji) = 1 - \eta_{z_ij, z_ji} + (1 - \rho) \eta_{z_ij, z_ji} = 1 - \rho \eta_{z_ij, z_ji}. In other words, r_ij ~ Bernoulli(\rho \eta_{z_ij, z_ji}).

## Choosing the sparsity parameter

On the senate dataset, running spectral clustering for various values of K gives the following:

K False positives False negatives
5 .606 .058
10 .354 .078
15 .126 .078
20 .193 .094
25 .157 .107
30 .135 .114

Even with 30 topics, this would imply that we're not seeing at least around 15% of true links. Since spectral clustering is likely to be overfitting in this case, a reasonable compromise between all the K might be 25%. Although, since for this dataset we'd expect the true K to be small, 50% might be a better estimate.

--Jcone 18:27, 7 April 2008 (EDT)