Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane
From: Giampaolo Tomassoni <g.tomassoni <at> libero.it>
Subject: R: R: YAGI: Yet Another Great Idea
Newsgroups: gmane.mail.spam.spamassassin.general
Date: Tuesday 28th August 2007 18:58:02 UTC (over 10 years ago)
> -----Messaggio originale-----
> Da: Theo Van Dinter [mailto:[email protected]]
> 
> On Tue, Aug 28, 2007 at 05:05:24PM +0200, Giampaolo Tomassoni wrote:
> > > TextCat?
> >
> > Does it yield the probability for each language? Or it just yields a
> single
> > result? (i.e.: language id).
> >
> > We would need the probability of the text being in any given
> language. Or,
> > better, something close to an array of probabilities where the
> indexes are
> > the various languages for which the module can work.
> 
> To be honest, I haven't looked at that plugin in ages, so I don't
> remember
> exactly what it does.  As I recall, it gives a list of possible
> languages,
> which means that internally it would have to know probabilities.

I had a look at it, and I believe that a quick test could be done by
modifying the actual code.

The TextCat plugin internally computes the %results dictionary (in the
"classify" sub) which is keyed on the names of the user-enabled languages.
There, $results{$language} is something like the "hit score" of the text
with respect to the given language. Probably it is not that difficult to
modify "classify" in order to get also the result we're looking for...

By the way, the TextCat method is reported as being one of the best in the
Kranig's paper.

Anybody knows who is the author of the Mail::SpamAssassin::Plugin::TextCat
plugin? Just in case I get questions to raise.

Giampaolo

> 
> --
> Randomly Selected Tagline:
> "Jab, Jab, Oooh. O(n log n)! Ha! Tail recursion! Thrust! Parry!
> "
>          - Jim Flanagan
 
CD: 3ms