Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane
From: David Mimno <mimno <at> cs.umass.edu>
Subject: Topic diagnostics
Newsgroups: gmane.comp.ai.mallet.devel
Date: Thursday 12th May 2011 15:12:14 UTC (over 6 years ago)
Hi all,

For those using the HG repository, I've checked in a new
TopicModelDiagnostics class, and I'd appreciate if anyone would like
to beta test.

The code defines a number of diagnostic functions that people have
found useful in detecting bad topics, including (AlSumait et al, ECML
2009). These are output in XML format. In addition to topic-level
statistics, the file also includes word-level statistics for the top N
words (defined by --num-top-words) for every topic.

Here's an example, which uses a new, somewhat simplified wrapper for
ParallelTopicModel:

> bin/mallet run cc.mallet.topics.tui.TopicTrainer --input data.sequences
--num-topics 32 --diagnostics-file diagnostics.xml

The statistics defined in the file are as follows. Suggestions are welcome!

* Word count (total number of word tokens assigned to the topic).
Small topics are often illogical, large topics are often overly
general.

* Word length (for each word, count the number of characters). Topics
with lots of very short words tend to be problematic. This metric
normalizes word length against the average word length of top words
over all topics, so negative numbers mean short words, positive
numbers mean long words.

* Coherence (probability of words given higher-ranked words). This
metric picks out illogical combinations.

* Distance from uniformity. Higher values indicate more probability
concentrated on a few words, lower values indicate more dispersed
probability.

* Distance from corpus. Higher values indicate more specific topics,
topics with lower values look like what you would get by counting all
the words in the corpus, regardless of topic.

* Effective number of words. The inverse of the sum of squared
probabilities. Higher values indicate less concentration on top words.
This metric is similar to distance from uniformity.

* Token/document difference. Higher values indicate burstiness -- one
of the top words appears many times in a small number of documents (ie
fewer docs than expected given token count).

* Documents at rank 1. Vacuous or overly general topics often occur a
small amount in many documents. This metric counts, out of the
documents that contain a given topic, how many times that topic is the
single most common topic in a document. Low numbers indicate possibly
uninteresting topics.

-David
---------------------------------------------------
 
CD: 3ms