Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane
From: Dave Stallard <stallard <at> bbn.com>
Subject: Is there an easy way to get the number of tokens for each imported document?
Newsgroups: gmane.comp.ai.mallet.devel
Date: Thursday 23rd February 2012 16:20:29 UTC (over 5 years ago)
Hi all,

I'm using import-dir with stopword-removal, etc., to get a set of 
documents to give to train-topics.     I'd like to know the number of 
tokens in each of the documents that are fed to train-topics, i.e. the 
number of tokens after tokenization and stopword-removal.   Is there an 
easy way to get this info?

The motivation is that I'm trying to estimate a P(doc) distribution from 
the corpus.

thanks,
Dave


---------------------------------------------------
 
CD: 3ms