# Difference between revisions of "Relational Topic Models"

m |
m (→Modeling Sparsity) |
||

(6 intermediate revisions by the same user not shown) | |||

Line 1: | Line 1: | ||

+ | == Modeling Sparsity == | ||

+ | |||

+ | In an undirected setting, let us consider having chosen z_ij and z_ji and then selecting the response according to r_ij ~ Bernoulli(\eta_{z_ij, z_ji}). | ||

+ | In modeling sparsity, we assume that we draw another hidden variable say y_ij ~ Bernoulli(\eta_{z_ij, z_ji}). And then draw r_ij ~ Bernoulli(\rho) if y_ij = 1 and r_ij ~ \delta_0 otherwise. \rho represents how often we expect to observe a link between nodes that are actually linked in the latent space. | ||

+ | |||

+ | We can thus rewrite r_ij ~ y_ij Bernoulli(\rho) + (1 - y_ij) delta_0. We can integrate out y_ij by noting that p(r_ij = 1 | z_ij, z_ji) = \rho \eta_{z_ij, z_ji}. And p(r_ij = 0 | z_ij, z_ji) = 1 - \eta_{z_ij, z_ji} + (1 - \rho) \eta_{z_ij, z_ji} = 1 - \rho \eta_{z_ij, z_ji}. In other words, r_ij ~ Bernoulli(\rho \eta_{z_ij, z_ji}). | ||

+ | |||

+ | == Choosing the sparsity parameter == | ||

On the senate dataset, running spectral clustering for various values of K gives the following: | On the senate dataset, running spectral clustering for various values of K gives the following: | ||

Line 8: | Line 16: | ||

|- | |- | ||

| 5 | | 5 | ||

− | | | + | | .606 |

− | | | + | | .058 |

+ | |- | ||

+ | | 10 | ||

+ | | .354 | ||

+ | | .078 | ||

+ | |- | ||

+ | | 15 | ||

+ | | .126 | ||

+ | | .078 | ||

+ | |- | ||

+ | | 20 | ||

+ | | .193 | ||

+ | | .094 | ||

+ | |- | ||

+ | | 25 | ||

+ | | .157 | ||

+ | | .107 | ||

|- | |- | ||

+ | | 30 | ||

+ | | .135 | ||

+ | | .114 | ||

|} | |} | ||

− | --[[User:Jcone|Jcone]] 18: | + | Even with 30 topics, this would imply that we're not seeing at least around 15% of true links. Since spectral clustering is likely to be overfitting in this case, a reasonable compromise between all the K might be 25%. Although, since for this dataset we'd expect the true K to be small, 50% might be a better estimate. |

+ | |||

+ | --[[User:Jcone|Jcone]] 18:27, 7 April 2008 (EDT) |

## Latest revision as of 19:08, 7 April 2008

## Modeling Sparsity

In an undirected setting, let us consider having chosen z_ij and z_ji and then selecting the response according to r_ij ~ Bernoulli(\eta_{z_ij, z_ji}). In modeling sparsity, we assume that we draw another hidden variable say y_ij ~ Bernoulli(\eta_{z_ij, z_ji}). And then draw r_ij ~ Bernoulli(\rho) if y_ij = 1 and r_ij ~ \delta_0 otherwise. \rho represents how often we expect to observe a link between nodes that are actually linked in the latent space.

We can thus rewrite r_ij ~ y_ij Bernoulli(\rho) + (1 - y_ij) delta_0. We can integrate out y_ij by noting that p(r_ij = 1 | z_ij, z_ji) = \rho \eta_{z_ij, z_ji}. And p(r_ij = 0 | z_ij, z_ji) = 1 - \eta_{z_ij, z_ji} + (1 - \rho) \eta_{z_ij, z_ji} = 1 - \rho \eta_{z_ij, z_ji}. In other words, r_ij ~ Bernoulli(\rho \eta_{z_ij, z_ji}).

## Choosing the sparsity parameter

On the senate dataset, running spectral clustering for various values of K gives the following:

K | False positives | False negatives |
---|---|---|

5 | .606 | .058 |

10 | .354 | .078 |

15 | .126 | .078 |

20 | .193 | .094 |

25 | .157 | .107 |

30 | .135 | .114 |

Even with 30 topics, this would imply that we're not seeing at least around 15% of true links. Since spectral clustering is likely to be overfitting in this case, a reasonable compromise between all the K might be 25%. Although, since for this dataset we'd expect the true K to be small, 50% might be a better estimate.

--Jcone 18:27, 7 April 2008 (EDT)