Gmane
Picon Picon Favicon
From: Boris 'pi' Piwinger <3.14 <at> logic.univie.ac.at>
Subject: Re: Testing training methods
Newsgroups: gmane.mail.bogofilter.general
Date: 2003-10-24 08:25:16 GMT (4 years, 38 weeks, 3 days, 2 hours and 35 minutes ago)
Hi!

Some days ago I did a huge test, see:
http://article.gmane.org/gmane.mail.bogofilter.general/5346

Now David suggested to use the security margin in all methods where it
is applicable. So be it. So I rerun the test (with a grown mail
collection) again.

The rest of the setting remains the same, only the messages are not
split into exactly 10,000, 5,000 or 2,500 message per file. Pleae see
my previous test for details or look at the script in the end.

Method 1) full training
Method 2) forced bogominitrain.pl (all messages in the training set
          are classified correctly)
Method 3) one run of randomtrain (should be similar to one run of
          bogominitrain.pl)
Method 4) combined training, i.e., full training on first ~10,000,
          then correcting errors

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Here is the output of my script which does what I described. I dress
up the output for better understanding:

Building the list ...
This is what we have:
bogofilter version 0.15.7

algorithm   = fisher
robx        = 0.499000 (4.99e-01)
robs        = 0.100000 (1.00e-01)
min_dev     = 0.300000 (3.00e-01)
ham_cutoff  = 0.000000 (0.00e+00)
spam_cutoff = 0.500000 (5.00e-01)

block_on_subnets  = no
strict_check      = no
ignore_case       = no
header_line_markup = yes
tokenize_html_tags = yes
replace_nonascii_characters = no

ham.0:10961
ham.1:5481
ham.2:2741
ham.3:2741
spam.0:10558
spam.1:5279
spam.2:2640
spam.3:2640

Building wordlists ...
Done with wordlist 1.

End of run #1:
Read 16442 ham mails and 15837 spam mails.
Added 280 ham mails and 546 spam mails to the database.
                       spam   good
.MSG_COUNT              546    280
[...]
End of run #4:
Read 16442 ham mails and 15837 spam mails.
Added 1 ham mail and 2 spam mails to the database.
                       spam   good
.MSG_COUNT              691    333

False negatives: 0
False positives: 0

4 runs needed to close off.

Done with wordlist 2.

 spam  reg   good  reg
15837  646  16442  282
Done with wordlist 3.

[Note randtrain took many more spam messages.]

Done with wordlist 4.

Training results:
-rw-------    1 3.14     3.14     27635712 Oct 23 16:24 1/wordlist.db
-rw-------    1 3.14     3.14      3563520 Oct 23 17:01 2/wordlist.db
-rw-------    1 3.14     3.14      3293184 Oct 23 18:17 3/wordlist.db
-rw-------    1 3.14     3.14     23068672 Oct 23 18:25 4/wordlist.db

Wordlist 1:
                       spam   good
.MSG_COUNT            15837  16442
Wordlist 2:
                       spam   good
.MSG_COUNT              691    333
Wordlist 3:
                       spam   good
.MSG_COUNT              646    282
Wordlist 4:
                       spam   good
.MSG_COUNT            10612  10966

         |fn in|fp in
         | 2640| 2741
---------+-----+-----
Method 1 |  46 |   4
Method 2 |  14 |   0
Method 3 |  15 |   1
Method 4 |  48 |   5

Results after corrections:
-rw-------    1 3.14     3.14     29884416 Oct 23 18:29 1/wordlist.db
-rw-------    1 3.14     3.14      3821568 Oct 23 19:08 2/wordlist.db
-rw-------    1 3.14     3.14      3469312 Oct 23 19:34 3/wordlist.db
-rw-------    1 3.14     3.14     23089152 Oct 23 19:38 4/wordlist.db

Wordlist 1:
                       spam   good
.MSG_COUNT            18477  19183
fn: 43
fp: 1
Wordlist 2:
                       spam   good
.MSG_COUNT              778    358
fn: 14
fp: 1
Wordlist 3:
                       spam   good
.MSG_COUNT              699    300
fn: 12
fp: 5
Wordlist 4:
                       spam   good
.MSG_COUNT            10636  10967
fn: 43
fp: 2

         |fn in|fp in
         | 2640| 2741
---------+-----+-----
Method 1 |  43 |   1
Method 2 |  14 |   1
Method 3 |  12 |   5
Method 4 |  43 |   2

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Compared to the previous tests all methods improved. Probably this is
related to the new distribution provided by David. Very surprising is
the dramatic change for method 3). I cannot explain that. Maybe
something went horribly wrong in the original attempt. You would
expect methods 3) and 4) to work better due to the security margin.

Still you see a risk in method 3) to produce fp's, which is what
method 2) tries to avoid.

In this test the small databases (with pure train on error) perform
*much* better than the huge ones. Method 2) works a bit safer as
expected.

Again: Of course, all the results rely on my mail collection, my
settings, the phase of the moon, and whathaveyou. So I hope others do
the same kind of testing to see if those results are stable.

pi

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Here the script which did all the work:

#!/bin/sh
# From the mail archive message-lists are built
echo "Building the list ..."
distrib.sh ham '../ham*'
distrib.sh spam '../spam*'
# Settings and files we work with
echo "This is what we have:"
rm -rf 1 2 3 4
mkdir 1 2 3 4
my-bogofilter -Q
grep -c '^From ' ham.* spam.*
# I declare the games open
echo "Building wordlists ..."
# first 10,000 messages each for methods 1 and 4 (not exactly, but good enough)
my-bogofilter -d 1 -s < spam.0
my-bogofilter -d 1 -n < ham.0
cp 1/wordlist.db 4
# finish method 1 training
my-bogofilter -d 1 -s < spam.1
my-bogofilter -d 1 -n < ham.1
echo "Done with wordlist 1."
# method 2 training
bogominitrain.pl -fn 2 'ham.[01]' 'spam.[01]' '-o 0.7,0.2'
echo "Done with wordlist 2."
# method 3 training (security margins are build into randomtrain.cf)
randomtrain -d 3 -c randomtrain.cf -s spam.0 -s spam.1 -n ham.0 -n ham.1
echo "Done with wordlist 3."
# method 4 training
classify spam.1 -d 4 -o 0.7,0.2
my-bogofilter -d 4 -s < corpus.good
rm -f corpus.*
classify ham.1 -d 4 -o 0.7,0.2
my-bogofilter -d 4 -n < corpus.bad
rm -f corpus.*
echo "Done with wordlist 4."

# training results
echo "Training results:"
ls -l */wordlist.db
for dir in [1-4]
do
  echo "Wordlist $dir:"
  my-bogoutil -w $dir .MSG_COUNT
  echo -n "fn: "
  cat spam.2 | my-bogofilter -d $dir -TM | grep -cv ^S
  echo -n "fp: "
  cat ham.2  | my-bogofilter -d $dir -TM | grep -cv ^H
done

# correcting errors
echo "Correcting errors ..."
# method 1
my-bogofilter -d 1 -s < spam.2
my-bogofilter -d 1 -n < ham.2
echo "Done with wordlist 1."
# method 2
bogominitrain.pl -fn 2 'ham.[201]' 'spam.[201]' '-o 0.7,0.2'
echo "Done with wordlist 2."
# method 3
randomtrain -d 3 -c randomtrain.cf -s spam.2 -n ham.2
echo "Done with wordlist 3."
# method 4
classify spam.2 -d 4 -o 0.7,0.2
my-bogofilter -d 4 -s < corpus.good
rm -f corpus.*
classify ham.2 -d 4 -o 0.7,0.2
my-bogofilter -d 4 -n < corpus.bad
rm -f corpus.*
echo "Done with wordlist 4."

# training results
echo "Results after corrections:"
ls -l */wordlist.db
for dir in [1-4]
do
  echo "Wordlist $dir:"
  my-bogoutil -w $dir .MSG_COUNT
  echo -n "fn: "
  cat spam.3 | my-bogofilter -d $dir -TM | grep -cv ^S
  echo -n "fp: "
  cat ham.3  | my-bogofilter -d $dir -TM | grep -cv ^H
done