Home Reading Searching Subscribe Sponsors Statistics Posting Contact Spam Lists Links About Hosting Filtering Features Download Marketing Archives FAQ Blog From: Corey Arnold ucla.edu> Subject: Re: Topics over time questions Newsgroups: gmane.comp.ai.mallet.devel Date: Saturday 25th February 2012 16:33:12 UTC (over 6 years ago) ```Hi Dan and Laura, Thank you both for your help. Your responses generated a few additional questions: 1. If I generate random samples from a beta distribution with very small values of alpha and beta parameters I see how I can essentially get values of 0 and 1 with high probability. However, if I calculate the probability of a document with a timestamp of 0 or 1 I don't see how I can get anything other than 0. This is why I, like Dan, introduced a small epsilon to ensure all timestamps fall within the range of support for a beta, (0;1). Laura, perhaps you can point me to the configuration you mentioned on the Wikipedia page? 2. I have looked at Dirichlet.learnParametersWithHistogram(Object[] observations). However, I am unsure of how to apply it here. For the beta distribution for topic z, my observations consist of the timestamps of words labeled with z. Therefore, I am unsure of what parameters to use to instantiate a Dirichlet object and what observations to supply learnParametersWithHistogram(), as I really don't have a histogram. Thank you, Corey On Fri, Feb 24, 2012 at 11:26 AM, Laura Dietz wrote: > Hi Dan, Hi Corey, > > As timestamps t_{di} are drawn from a Beta, they have to be normalized to > [0,1]. (see Step 2c in the generative process) > The Beta distribution has some configurations for which sampling values 0 or > 1 are actually fairly high. (Wikipedia has some examples) Hope is that the > parameters are learned to capture whether a topic is hot near the ends of > your time range. > > Method of moments estimators game me quite some head ache in terms of > stability/robustness. Sometimes I get extreme values eventually resulting in > NaN. I switched to one of the other hyperparameter estimation method (e.g. > Tom Minka's histogramm method), which are also part of mallet. > > Cheers, > Laura > > > > On 2/24/12 2:15 PM, dan wrote: > > > On Fri, Feb 24, 2012 at 9:33 AM, Corey Arnold wrote: > > I am aware there is no implementation of Topics Over Time (Wang and > McCallum, 2006) in MALLET, but I thought this may be a good place to > ask questions about it nonetheless. > > 1. The paper does not provide much detail on how document timestamps > are normalized. My thought was that they are scaled to [0,1], but I am > then unsure of how to handle documents with 0 and 1 timestamps so that > they have some probability. > > For this, I just chose some fixed values some small epsilon from 0 and 1. > For example, I set any timestamp equal to 0 to 0.00001 and any timestamp > equal to 1.0 to 0.99999. > > > > 2. When updating the parameters for the beta distribution using the > method of moments I get negative values for seemingly reasonable > average timestamps and variances. Have others run into this? Would > someone recommend an alternate parameterization? > > The method of moments fails in two cases: > 1) when the variance becomes 0 then the method of moments calculation >     has a division by zero.  This is actually fairly common during the early >     stages of inference, in the case where all of the tokens assigned to a > topic >     end up coming from the same document. > 2) when the variance is greater than the mean, the MOM produces >     negative-valued estimates for the shape parameters (which is invalid for > the >     Beta distribution). > > > > --dan > > > Thank you, > Corey > --------------------------------------------------- > --------------------------------------------------- > CONTRIBUTIONS: Mail to [email protected] > UNSUBSCRIBE: Send "unsubscribe mallet-dev"  to majord[email protected] > PROBLEMS: Report to [email protected] > TO SUBSCRIBE: Send "subscribe mallet-dev" to [email protected] > > > -- Corey Arnold, PhD  |  UCLA Medical Imaging Informatics Group  |  310.794.3538 ---------------------------------------------------```
CD: 4ms