Gmane
Picon Picon Favicon
From: Boris 'pi' Piwinger <3.14 <at> logic.univie.ac.at>
Subject: Testing training methods
Newsgroups: gmane.mail.bogofilter.general
Date: 2003-10-14 08:03:12 GMT (4 years, 37 weeks, 5 days, 13 hours and 36 minutes ago)
Hi!

A while ago David, Greg and myself worked over the documentation. The
FAQ now explains several training methods. I want to do some tests.
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/*checkout*/bogofilter/bogofilter/doc/bogofilter-faq.html#training

Remember I posted some tests concerning settings of bogominitrain.pl a
few weeks ago:
http://article.gmane.org/gmane.mail.bogofilter.general/4373

As you might have noted there is a difference in how Greg and I do
those tests. He is adjusting the spam_cutoff so that all runs have the
same number of false positives and you can just compare the false
negatives. I keep all configuration fixed and look at the output which
might be harder to read, but -- at least for me -- is closer to use in
production.

Let me first describe what I will do. I have 20,000 ham and spam
message each. I take 15,000 for training and the rest for testing. The
messages are in (ham|spam)[0-3] as shown below. I train four databases
as explained in the FAQ:

Method 1) full training
Method 2) forced bogominitrain.pl (all messages in the training set
          are classified correctly)
Method 3) one run of randomtrain (should be similar to one run of
          bogominitrain.pl)
Method 4) combined training, i.e., full training on first 10,000, then
          correcting errors

Then I'll test with all four databases how the next 2,500 message ham
and spam are classified. As you would do normally, in all four methods
the errors are corrected as follows:

Method 1) all messages added
Method 2) retraining as above with now 17,500 messages each
Method 3) randomtrain on the new 2,500 messages only
Method 4) correcting errors on the new 2,500 message

Finally, I test the last 2,500 messages.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Here is the output of my script (below) which does what I described. I
dress up the output for better understanding:

This is what we have:
bogofilter version 0.15.6

algorithm   = fisher
robx        = 0.500000 (5.00e-01)
robs        = 0.100000 (1.00e-01)
min_dev     = 0.300000 (3.00e-01)
ham_cutoff  = 0.000000 (0.00e+00)
spam_cutoff = 0.501000 (5.01e-01)

block_on_subnets  = no
strict_check      = no
ignore_case       = no
header_line_markup = yes
tokenize_html_tags = yes
replace_nonascii_characters = no

ham0:10000
ham1:5000
ham2:2500
ham3:2500
spam0:10000
spam1:5000
spam2:2500
spam3:2500

Building wordlists ...
[...]

Training results:
[I had a mistake here:-(]

Wordlist 1:
                       spam   good
.MSG_COUNT            15000  15000

Wordlist 2:
                       spam   good
.MSG_COUNT              627    308

Wordlist 3:
                       spam   good
.MSG_COUNT              455     45

Wordlist 4:
                       spam   good
.MSG_COUNT            10154  10031

         |fn in|fp in
         | 2500| 2500
---------+-----+-----
Method 1 |  54 |  12
Method 2 |  29 |  17
Method 3 |   2 | 916
Method 4 |  36 |  18

Correcting errors ...
[...]

Results after corrections:
-rw-------    1 3.14     3.14     28258304 Oct 13 17:39 1/wordlist.db
-rw-------    1 3.14     3.14      3682304 Oct 13 18:30 2/wordlist.db
-rw-------    1 3.14     3.14      1679360 Oct 13 19:01 3/wordlist.db
-rw-------    1 3.14     3.14     19357696 Oct 13 19:05 4/wordlist.db

Wordlist 1:
                       spam   good
.MSG_COUNT            17500  17500

Wordlist 2:
                       spam   good
.MSG_COUNT              755    347

Wordlist 3:
                       spam   good
.MSG_COUNT              508     52

Wordlist 4:
                       spam   good
.MSG_COUNT            10190  10050

         |fn in|fp in
         | 2500| 2500
---------+-----+-----
Method 1 |  72 |   3
Method 2 |  43 |   2
Method 3 |   3 | 755
Method 4 |  60 |   9

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When doing the different trainings you might have two reasons to
choose. One will be for many people their quota. This might make
methods 1) and 4) unusable. The other criterion will be in everybody's
interest, namely the quality of the decisions in production.

So let's look at the first run. Method 1 had the lowest fp-rate, but
many more fns comapred to 2. Method 2 was better than method 4 in
both. Method 3 failed completely. Looks like something is very wrong.

In the second run then, we have an overall winner which is method 2.
I'd also say that method 1 worked better than 4 with a far better
fp-rate and only few more fns. Again, method 3 is a complete failure.

So if you need to look at your quota, method 2 clearly is the way to
go. Even if you don't care about that, it performs better than method
4 and in mostly better than method 1.

Of course, all the results rely on my mail collection, my settings,
the phase of the moon, and whathaveyou. So I hope others do the same
kind of testing to see if those results are stable.

pi

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Here the script which did all the work:

#! /bin/sh
echo "This is what we have:"
rm -rf 1 2 3 4
mkdir 1 2 3 4
my-bogofilter -Q
grep -c '^From ' ham* spam*

echo "Building wordlists ..."
# first 10,000 messages each for methods 1 and 4
my-bogofilter -d 1 -s < spam0
my-bogofilter -d 1 -n < ham0
cp 1/wordlist.db 4
# finish method 1 training
my-bogofilter -d 1 -s < spam1
my-bogofilter -d 1 -n < ham1
echo "Done with wordlist 1."
# method 2 training
bogominitrain.pl -fn 2 'ham[01]' 'spam[01]' '-o 0.701,0.201'
echo "Done with wordlist 2."
# method 3 training
randomtrain -d 3 -s spam0 -s spam1 -n ham0 -n ham1
echo "Done with wordlist 3."
# method 4 training
classify spam1 -d 4
my-bogofilter -d 4 -s < corpus.good
rm -f corpus.*
classify ham1 -d 4
my-bogofilter -d 4 -n < corpus.bad
rm -f corpus.*
echo "Done with wordlist 4."

# training results
echo "Training results:"
ls -l */wordlist.db
for dir in [1-4]
do
  echo "Wordlist $dir:"
  my-bogoutil -w $dir .MSG_COUNT
  echo -n "fn: "
  cat spam2 | my-bogofilter -d $dir -TM | grep -cv ^S
  echo -n "fp: "
  cat ham2  | my-bogofilter -d $dir -TM | grep -cv ^H
done

# correcting errors
echo "Correcting errors ..."
# method 1
my-bogofilter -d 1 -s < spam2
my-bogofilter -d 1 -n < ham2
echo "Done with wordlist 1."
# method 2
bogominitrain.pl -fn 2 'ham[201]' 'spam[201]' '-o 0.701,0.201'
echo "Done with wordlist 2."
# method 3
randomtrain -d 3 -s spam2 -n ham2
echo "Done with wordlist 3."
# method 4
classify spam2 -d 4
my-bogofilter -d 4 -s < corpus.good
rm -f corpus.*
classify ham2 -d 4
my-bogofilter -d 4 -n < corpus.bad
rm -f corpus.*
echo "Done with wordlist 4."

# training results
echo "Results after corrections:"
ls -l */wordlist.db
for dir in [1-4]
do
  echo "Wordlist $dir:"
  my-bogoutil -w $dir .MSG_COUNT
  echo -n "fn: "
  cat spam3 | my-bogofilter -d $dir -TM | grep -cv ^S
  echo -n "fp: "
  cat ham3  | my-bogofilter -d $dir -TM | grep -cv ^H
done