Hi!
As I described to the list, I use a radically simplified
lexer with bogofilter (with great success BTW . This
essentially declares TOKEN to be:
[^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]=()+?:#$._!'`~-]+
Now I once in a while observed, that a message would
probably rated quite differently if host names were
recognized, which fails, because . is not allowed in TOKEN.
So I ask, is this a useful feature or not. What I did in
this test is just use my radical lexer and a copy which
allows for . in the middle of a TOKEN (not as first or last
character!). I built four databases. Each used 10k ham and
spam each for training. I used training to exhaustion and
full training with both versions of the lexer each.
First observation is the number of messages used for
training to exhaustion:
without dot: 342 spam / 185 ham
with dot: 351 spam / 185 ham
So it does not seem to make much of a difference here.
Interesting of course is how good those trainings are with
respect to classifying messages. So here are the false
counts of 1544 ham and 4417 spam messages:
| size kb | fp | fn
TTE | 1912 | 2 | 82
TTE (dot) | 1892 | 1 | 98
full | 13612 | 2 | 182
full (dot) | 14204 | 2 | 193
So while there are only pretty few test messages, there is
only little to observe. There is a very, very small
indication that . might help in avoiding fp's. The number of
fn's seems reduced a bit by *not* using dots. This is a
surprise, I expected the dot version to clearly outperform
the much simpler lexer. It does not. So I gonna keep it out.
With this result in mind it will be interesting to see if IP
numbers are really useful. I'll keep you posted.
pi
|