Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane
From: David Mimno <mimno <at> cs.umass.edu>
Subject: Re: Is there an easy way to get the number of tokens for each imported document?
Newsgroups: gmane.comp.ai.mallet.devel
Date: Thursday 23rd February 2012 21:01:57 UTC (over 5 years ago)
Hi Dave,

This class:

bin/mallet run cc.mallet.util.DocumentLengths --input [sequences file]

will print the length of each FeatureSequence to standard output.

-David

On Thu, Feb 23, 2012 at 11:20 AM, Dave Stallard  wrote:
> Hi all,
>
> I'm using import-dir with stopword-removal, etc., to get a set of
documents
> to give to train-topics.     I'd like to know the number of tokens in
each
> of the documents that are fed to train-topics, i.e. the number of tokens
> after tokenization and stopword-removal.   Is there an easy way to get
this
> info?
>
> The motivation is that I'm trying to estimate a P(doc) distribution from
the
> corpus.
>
> thanks,
> Dave
>
>
> ---------------------------------------------------
> ---------------------------------------------------
> CONTRIBUTIONS: Mail to [email protected]
> UNSUBSCRIBE: Send "unsubscribe mallet-dev"  to [email protected]
> PROBLEMS: Report to [email protected]
> TO SUBSCRIBE: Send "subscribe mallet-dev" to [email protected]

---------------------------------------------------
 
CD: 2ms