The R and BUGS combination is fine as far as it goes, but it’s slow, hard to debug, and doesn’t scale well. Because I’m going to need to process some big Turker-derived named entity corpora in the next month (more about that later), I thought I’d work on scaling the sampler.

Mitzi was away over the weekend, so I naturally spent my newly found “free time” coding and reading up on stats. While I was ~~procrastinating refactoring feature extraction for CRFs~~ reading a really neat paper (On Smoothing and Inference for Topic Models) from the Irvine paper mill (I also blogged about another of their paper’s on fast LDA sampling), it occurred to me that I could create a fast collapsed sampler for the multinomial form of the hierarchical models of annotation I’ve blogged about.

**Hierarchical Model of Binomial Annotation**

The model’s quite simple, at least in the binomial case. I’ve further simplified here to the case where every annotator labels every item, but the general case is just as easy (modulo indexing):

Variable | Range | Status | Distribution | Description |
---|---|---|---|---|

I | > 0 | input | fixed | number of Items |

J | > 0 | input | fixed | number of annotators |

π | [0,1] | estimated | Beta(1,1) | prevalence of category 1 |

c[i] | {0,1} | estimated | Bern(π) | category for item i |

θ0[j] | [0,1] | estimated | Beta(α0,β0) | specificity of annotator j |

θ1[j] | [0,1] | estimated | Beta(α1,β1) | sensitivity of annotator j |

α0/(α0+β0) | [0,1] | estimated | Beta(1,1) | prior specificity mean |

α0 + β0 | (0,∞) | estimated | Pareto(1.5)^{*} |
prior specificity scale |

α1/(α1+β1) | [0,1] | estimated | Beta(1,1) | prior sensitivity mean |

α1 + β1 | (0,∞) | estimated | Pareto(1.5)^{*} |
prior sensitivity scale |

x[i,j] | {0,1} | input | Bern(c[i,j]==1 ? θ1[j] : 1-θ0[j]) |
annotation of item i by annotator j |

**The Collapsed Sampler**

The basic idea is to sample only the category assignments c[i] in each round of Gibbs sampling. With categories given, it’s easy to compute prevalence, annotator sensitivity and specificity given their conjugate priors.

The only thing we need to sample is c[i], and we can inspect the graphical model for dependencies: the parent π of the c[i], and the parents θ0 and θ1 of the descendants x[i,j] of c[i]. The formula’s straightforwardly derived with Bayes’ rule:

p(c[i]|x, θ0, θ1) ∝ p(c[i]) * Π_{j in 1:J} p(x[i,j] | c[i], θ0[j], θ1[j])

**Moment-Matching Beta Priors**

^{*}The only trick is estimating the priors over the sensitivities and specificities, for which I took the usual expedient of using moment matching. Note that this does not take into account the Pareto prior on scales of the prior specificity and sensitivity (hence the asterisk in the table). In particular, given a set of annotator specificities (and there were 200+ annotators for the named-entity data), we find the beta prior with mean matching the empirical mean and variance matching the empirical variance (requires some algebra).

I’m not too worried about the Pareto scale prior — it’s pretty diffuse. I suppose I could’ve done something with maximum likelihood rather than moment matching (but for all I know, this is the maximum likelihood solution! [update: it’s not the ML estimate; check out Thomas Minka’s paper Estimating a Dirichlet Distribution and references therein.]).

**Initialization**

The inputs are initial values for annotator specificity, annotator sensitivity, and prevalence. These are used to create the first category sample given the above equation, which allows us to define all the other variables for the first sample. Then each epoch just resamples all the categories, then recomputes all the other estimates. This could be made more stochastic by updating all of the variables after each category update, but it converges so fast as is, that it hardly seemed worth the extra coding effort. I made sure to scale for underflow, and that’s about it.

It took about an hour to think about (most of which was working out the moment matching algebra, which I later found in Gelman’s *BDA* book’s appendix), about two hours to code, and about forty-five minutes to write up for this blog entry.

**Speed and Convergence**

It’s very speedy and converges very very quickly compared to the full Gibbs sampler in BUGS. I spent another hour after everything was built and running writing the online sample handler that’d compute means and deviations for all the estimated variables, just like the R/BUGS interface prints out. Having the online mean and variance calculator was just what I needed (more about it later, too), especially as many of the values were very close to each other and I didn’t want to store all of the samples (part of what’s slowing BUGS down).

**Identifiability**

I didn’t run into identifiability problems, but in general, something needs to be done to get rid of the dual solutions (you’d get them here, in fact, if the initial sensitivities and specificities were worse than 0.5).

**Open Question (for me)**

My only remaining question is: why does this work? I don’t understand the theory of collapsed samplers. First, I don’t get nearly the range of possible samples for the priors given that they’re estimated from discrete sets. Second, I don’t apply beta-binomial inference in the main sampling equation — that is, I take the prevalence and annotator sensitivities and specificities as point estimates rather than integrating out their beta-binomial form. Is this kosher?

**Downloading from LingPipe Sandbox**

You can find the code in the LingPipe Sandbox in the project `hierAnno`

(the original R and BUGS code and the data are also in that project). It’s not documented at all yet, but the one Ant task should run out of the box; I’ll probably figure out how to move the application into LingPipe.

[**Update:** The evaluation looks fine for named entities; I’m going to censor the data the same way as in the NAACL submission, and then evaluate again; with all the negative noise, it’s a bit worse than voting as is and the estimates aren’t useful because the specificites so dominate the calculations. For Snow et al.’s RTE data, the collapsed estimator explodes just like EM, with sensitivity scales diverging to infinity; either there’s a bug, the collapsed sampler isn’t stable, or I really do need a prior on those scales!]

[**Update 2:** The censored NE evaluation with collapsed Gibbs sampling and simple posterior evaluation by category sample worked out to have one fewer error than the full sampling model in BUGS (versus the original, albeit noisy, gold standard): collapsed Gibbs 232 errors, full Gibbs 233 errors, voting 242.5 errors (the half is from flipping coins on ties). Voting after throwing out the two really noisy annotators is probably competitive as is. I still don’t know why the RTE data’s blowing out variance.]

July 7, 2009 at 2:51 pm |

fyi i got EM to stabilize on the RTE data; needs the right priors.

July 7, 2009 at 4:07 pm |

The collapsed sampler converges almost instantly with any fixed beta priors for sensitivity and specificity.

What I can’t do with Brendan et al.’s RTE data is get a good estimate of the beta priors if I use moment-matching inside of EM; likelihood diverges with zero-variance beta priors. The “right” thing to do is a better estimate of the beta priors based on another layer of priors; that’s what I did in the paper and in R/BUGS (reparameterized, I used Beta(1,1) on the beta mean (alpha/(alpha+beta)), and Pareto(1.5) on the beta scale (alpha+beta). It’s easy when BUGS does all the work, and I may just need to implement a proper beta estimator with priors.

I’m in the middle of writing up a longer blog post with more details.

July 7, 2009 at 4:57 pm |

ah. right. i meant that. got it.

September 6, 2009 at 11:27 am |

Hi, I’m getting started on statistics, and I was wondering how the Pareto scale prior can be made more and more diffuse. Ane help in this regard is greatly appreciated (some bebliographical reference would be great too). Many thanks in advance.

September 6, 2009 at 1:28 pm |

I just pulled the Pareto prior out of Gelman et al.’s

Bayesian Data Analysis, where they don’t even describe it as such and really only mention it in passing.Like other power-law-type distributions, it gets more diffuse as the exponent is lowered. Below quadratic (exponent 2), it has infinite variance, so that’s pretty darn diffuse.

I just followed the BUGS manual. There’s also Wolfram: Pareto Distribution and Wikipedia: Pareto Distribution.