|
From: Boris 'pi' Piwinger <3.14 <at> logic.univie.ac.at>
Subject: Test with different lexers Newsgroups: gmane.mail.bogofilter.general Date: 2003-12-02 13:19:23 GMT (4 years, 30 weeks, 5 days, 7 hours and 58 minutes ago) Hi! I have done another test with bogofilter's new lexer and my version (http://piology.org/bogofilter/lexer_v3.l): Corpus sizes: t r0 r1 r2 tot sp 12868 4293 4287 4287 12867 ns 14571 4861 4855 4856 14572 My version: wo (fn): 0.950000 140 141 118 399 wo (fp): 0.950000 1 2 1 4 wi (fn): 0.978575 150 150 124 424 wi (fp): 0.978575 1 2 0 3 wi (fn): 0.972447 145 148 123 416 wi (fp): 0.972447 1 2 1 4 wi (fn): 0.918270 133 132 116 381 wi (fp): 0.918270 2 2 1 5 wi (fn): 0.664719 105 117 99 321 wi (fp): 0.664719 4 3 3 10 Original version: wo (fn): 0.950000 148 149 124 421 wo (fp): 0.950000 1 1 2 4 wi (fn): 0.973601 151 158 133 442 wi (fp): 0.973601 1 1 1 3 wi (fn): 0.967104 150 154 128 432 wi (fp): 0.967104 1 1 2 4 wi (fn): 0.948838 148 148 124 420 wi (fp): 0.948838 1 2 2 5 wi (fn): 0.710234 114 120 107 341 wi (fp): 0.710234 4 4 2 10 Over the time we have introduced several special rules to deal with specific problematic messages. My version has removed some of those (different token front and back, dollar rule, no short tokens, no numeric tokens, doctype switch, maybe more). With my mail collection those special treatments don't give improvements, to the opposite, the simplified version has an advantage for 5-10% fewer false negatives. While this is too small to really say it is better, it is good enough to say that it is at least as good. IIRC it was Tom who gave a strong opinion why we should really just let the statistics and don't intervene with special rules. If you want to try, just replace the lexer file and compile. It would be great if other people would repeat the test. pi, who has seen no error in incoming mail for eight days now |
|
|