Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane
From: Corey Arnold <cwarnold <at> ucla.edu>
Subject: Re: Topics over time questions
Newsgroups: gmane.comp.ai.mallet.devel
Date: Saturday 25th February 2012 16:33:12 UTC (over 5 years ago)
Hi Dan and Laura,

Thank you both for your help. Your responses generated a few
additional questions:

1. If I generate random samples from a beta distribution with very
small values of alpha and beta parameters I see how I can essentially
get values of 0 and 1 with high probability. However, if I calculate
the probability of a document with a timestamp of 0 or 1 I don't see
how I can get anything other than 0. This is why I, like Dan,
introduced a small epsilon to ensure all timestamps fall within the
range of support for a beta, (0;1). Laura, perhaps you can point me to
the configuration you mentioned on the Wikipedia page?

2. I have looked at Dirichlet.learnParametersWithHistogram(Object[]
observations). However, I am unsure of how to apply it here. For the
beta distribution for topic z, my observations consist of the
timestamps of words labeled with z. Therefore, I am unsure of what
parameters to use to instantiate a Dirichlet object and what
observations to supply learnParametersWithHistogram(), as I really
don't have a histogram.

Thank you,
Corey

On Fri, Feb 24, 2012 at 11:26 AM, Laura Dietz  wrote:
> Hi Dan, Hi Corey,
>
> As timestamps t_{di} are drawn from a Beta, they have to be normalized to
> [0,1]. (see Step 2c in the generative process)
> The Beta distribution has some configurations for which sampling values 0
or
> 1 are actually fairly high. (Wikipedia has some examples) Hope is that
the
> parameters are learned to capture whether a topic is hot near the ends of
> your time range.
>
> Method of moments estimators game me quite some head ache in terms of
> stability/robustness. Sometimes I get extreme values eventually resulting
in
> NaN. I switched to one of the other hyperparameter estimation method
(e.g.
> Tom Minka's histogramm method), which are also part of mallet.
>
> Cheers,
> Laura
>
>
>
> On 2/24/12 2:15 PM, dan wrote:
>
>
> On Fri, Feb 24, 2012 at 9:33 AM, Corey Arnold  wrote:
>
> I am aware there is no implementation of Topics Over Time (Wang and
> McCallum, 2006) in MALLET, but I thought this may be a good place to
> ask questions about it nonetheless.
>
> 1. The paper does not provide much detail on how document timestamps
> are normalized. My thought was that they are scaled to [0,1], but I am
> then unsure of how to handle documents with 0 and 1 timestamps so that
> they have some probability.
>
> For this, I just chose some fixed values some small epsilon from 0 and 1.
> For example, I set any timestamp equal to 0 to 0.00001 and any timestamp
> equal to 1.0 to 0.99999.
>
>
>
> 2. When updating the parameters for the beta distribution using the
> method of moments I get negative values for seemingly reasonable
> average timestamps and variances. Have others run into this? Would
> someone recommend an alternate parameterization?
>
> The method of moments fails in two cases:
> 1) when the variance becomes 0 then the method of moments calculation
>     has a division by zero.  This is actually fairly common during
the early
>     stages of inference, in the case where all of the tokens assigned
to a
> topic
>     end up coming from the same document.
> 2) when the variance is greater than the mean, the MOM produces
>     negative-valued estimates for the shape parameters (which is
invalid for
> the
>     Beta distribution).
>
>
>
> --dan
>
>
> Thank you,
> Corey
> ---------------------------------------------------
> ---------------------------------------------------
> CONTRIBUTIONS: Mail to [email protected]
> UNSUBSCRIBE: Send "unsubscribe mallet-dev"  to [email protected]
> PROBLEMS: Report to [email protected]
> TO SUBSCRIBE: Send "subscribe mallet-dev" to [email protected]
>
>
>



-- 
Corey Arnold, PhD  |  UCLA Medical Imaging Informatics Group  |
 310.794.3538

---------------------------------------------------
 
CD: 4ms