Gmane
Picon Picon Favicon
From: Boris 'pi' Piwinger <3.14 <at> logic.univie.ac.at>
Subject: Re: Radical lexers
Newsgroups: gmane.mail.bogofilter.general
Date: 2003-12-10 15:13:01 GMT (4 years, 29 weeks, 5 days, 8 hours and 19 minutes ago)
[Corrected version]

This is a very short test only. I compare my version (a) of
the lexer (http://piology.org/bogofilter/lexer_v3.l) with a
much stricter version of it (b). TOKEN will effectively be
of the form
[^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]=()+?:#$._!'`~-]+

So no more difference where in a token a character shows up.
No punctuation (I hope I did not miss anything). Basically
letters, digits and characters outside ASCII are allowed.

And even more extreme (c). Tokens are explicitely: [[:alnum:]]+

Here is what I get:
      wordlist  false neg       false pos
a)    27060k    210/13612       16/15670
b)    26832k    206/13612       17/15670
c)    23332k    210/13612       18/15670

So the size is a surprise. I expected something much smaller
for b) and even more for c).

The result for b) hurts. It says (if it can be confirmed)
that we are doing much too complicated things when defining
a token. I did really not expect that lexer to work. But
well, that's how it is.

c) is really mind-blowing. This simply MUST NOT work.

pi