Hi!
Some days ago I did a huge test, see:
http://article.gmane.org/gmane.mail.bogofilter.general/5346
Now David suggested to use the security margin in all methods where it
is applicable. So be it. So I rerun the test (with a grown mail
collection) again.
The rest of the setting remains the same, only the messages are not
split into exactly 10,000, 5,000 or 2,500 message per file. Pleae see
my previous test for details or look at the script in the end.
Method 1) full training
Method 2) forced bogominitrain.pl (all messages in the training set
are classified correctly)
Method 3) one run of randomtrain (should be similar to one run of
bogominitrain.pl)
Method 4) combined training, i.e., full training on first ~10,000,
then correcting errors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Here is the output of my script which does what I described. I dress
up the output for better understanding:
Building the list ...
This is what we have:
bogofilter version 0.15.7
algorithm = fisher
robx = 0.499000 (4.99e-01)
robs = 0.100000 (1.00e-01)
min_dev = 0.300000 (3.00e-01)
ham_cutoff = 0.000000 (0.00e+00)
spam_cutoff = 0.500000 (5.00e-01)
block_on_subnets = no
strict_check = no
ignore_case = no
header_line_markup = yes
tokenize_html_tags = yes
replace_nonascii_characters = no
ham.0:10961
ham.1:5481
ham.2:2741
ham.3:2741
spam.0:10558
spam.1:5279
spam.2:2640
spam.3:2640
Building wordlists ...
Done with wordlist 1.
End of run #1:
Read 16442 ham mails and 15837 spam mails.
Added 280 ham mails and 546 spam mails to the database.
spam good
.MSG_COUNT 546 280
[...]
End of run #4:
Read 16442 ham mails and 15837 spam mails.
Added 1 ham mail and 2 spam mails to the database.
spam good
.MSG_COUNT 691 333
False negatives: 0
False positives: 0
4 runs needed to close off.
Done with wordlist 2.
spam reg good reg
15837 646 16442 282
Done with wordlist 3.
[Note randtrain took many more spam messages.]
Done with wordlist 4.
Training results:
-rw------- 1 3.14 3.14 27635712 Oct 23 16:24 1/wordlist.db
-rw------- 1 3.14 3.14 3563520 Oct 23 17:01 2/wordlist.db
-rw------- 1 3.14 3.14 3293184 Oct 23 18:17 3/wordlist.db
-rw------- 1 3.14 3.14 23068672 Oct 23 18:25 4/wordlist.db
Wordlist 1:
spam good
.MSG_COUNT 15837 16442
Wordlist 2:
spam good
.MSG_COUNT 691 333
Wordlist 3:
spam good
.MSG_COUNT 646 282
Wordlist 4:
spam good
.MSG_COUNT 10612 10966
|fn in|fp in
| 2640| 2741
---------+-----+-----
Method 1 | 46 | 4
Method 2 | 14 | 0
Method 3 | 15 | 1
Method 4 | 48 | 5
Results after corrections:
-rw------- 1 3.14 3.14 29884416 Oct 23 18:29 1/wordlist.db
-rw------- 1 3.14 3.14 3821568 Oct 23 19:08 2/wordlist.db
-rw------- 1 3.14 3.14 3469312 Oct 23 19:34 3/wordlist.db
-rw------- 1 3.14 3.14 23089152 Oct 23 19:38 4/wordlist.db
Wordlist 1:
spam good
.MSG_COUNT 18477 19183
fn: 43
fp: 1
Wordlist 2:
spam good
.MSG_COUNT 778 358
fn: 14
fp: 1
Wordlist 3:
spam good
.MSG_COUNT 699 300
fn: 12
fp: 5
Wordlist 4:
spam good
.MSG_COUNT 10636 10967
fn: 43
fp: 2
|fn in|fp in
| 2640| 2741
---------+-----+-----
Method 1 | 43 | 1
Method 2 | 14 | 1
Method 3 | 12 | 5
Method 4 | 43 | 2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Compared to the previous tests all methods improved. Probably this is
related to the new distribution provided by David. Very surprising is
the dramatic change for method 3). I cannot explain that. Maybe
something went horribly wrong in the original attempt. You would
expect methods 3) and 4) to work better due to the security margin.
Still you see a risk in method 3) to produce fp's, which is what
method 2) tries to avoid.
In this test the small databases (with pure train on error) perform
*much* better than the huge ones. Method 2) works a bit safer as
expected.
Again: Of course, all the results rely on my mail collection, my
settings, the phase of the moon, and whathaveyou. So I hope others do
the same kind of testing to see if those results are stable.
pi
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Here the script which did all the work:
#!/bin/sh
# From the mail archive message-lists are built
echo "Building the list ..."
distrib.sh ham '../ham*'
distrib.sh spam '../spam*'
# Settings and files we work with
echo "This is what we have:"
rm -rf 1 2 3 4
mkdir 1 2 3 4
my-bogofilter -Q
grep -c '^From ' ham.* spam.*
# I declare the games open
echo "Building wordlists ..."
# first 10,000 messages each for methods 1 and 4 (not exactly, but good enough)
my-bogofilter -d 1 -s < spam.0
my-bogofilter -d 1 -n < ham.0
cp 1/wordlist.db 4
# finish method 1 training
my-bogofilter -d 1 -s < spam.1
my-bogofilter -d 1 -n < ham.1
echo "Done with wordlist 1."
# method 2 training
bogominitrain.pl -fn 2 'ham.[01]' 'spam.[01]' '-o 0.7,0.2'
echo "Done with wordlist 2."
# method 3 training (security margins are build into randomtrain.cf)
randomtrain -d 3 -c randomtrain.cf -s spam.0 -s spam.1 -n ham.0 -n ham.1
echo "Done with wordlist 3."
# method 4 training
classify spam.1 -d 4 -o 0.7,0.2
my-bogofilter -d 4 -s < corpus.good
rm -f corpus.*
classify ham.1 -d 4 -o 0.7,0.2
my-bogofilter -d 4 -n < corpus.bad
rm -f corpus.*
echo "Done with wordlist 4."
# training results
echo "Training results:"
ls -l */wordlist.db
for dir in [1-4]
do
echo "Wordlist $dir:"
my-bogoutil -w $dir .MSG_COUNT
echo -n "fn: "
cat spam.2 | my-bogofilter -d $dir -TM | grep -cv ^S
echo -n "fp: "
cat ham.2 | my-bogofilter -d $dir -TM | grep -cv ^H
done
# correcting errors
echo "Correcting errors ..."
# method 1
my-bogofilter -d 1 -s < spam.2
my-bogofilter -d 1 -n < ham.2
echo "Done with wordlist 1."
# method 2
bogominitrain.pl -fn 2 'ham.[201]' 'spam.[201]' '-o 0.7,0.2'
echo "Done with wordlist 2."
# method 3
randomtrain -d 3 -c randomtrain.cf -s spam.2 -n ham.2
echo "Done with wordlist 3."
# method 4
classify spam.2 -d 4 -o 0.7,0.2
my-bogofilter -d 4 -s < corpus.good
rm -f corpus.*
classify ham.2 -d 4 -o 0.7,0.2
my-bogofilter -d 4 -n < corpus.bad
rm -f corpus.*
echo "Done with wordlist 4."
# training results
echo "Results after corrections:"
ls -l */wordlist.db
for dir in [1-4]
do
echo "Wordlist $dir:"
my-bogoutil -w $dir .MSG_COUNT
echo -n "fn: "
cat spam.3 | my-bogofilter -d $dir -TM | grep -cv ^S
echo -n "fp: "
cat ham.3 | my-bogofilter -d $dir -TM | grep -cv ^H
done
|