Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane
From: Tim Starling <t.starling <at> physics.unimelb.edu.au>
Subject: Fast string search extension
Newsgroups: gmane.comp.php.pecl.devel
Date: Tuesday 25th July 2006 16:55:57 UTC (over 10 years ago)
This is my first attempt at a real PHP extension (i.e. one that doesn't use

swig), so constructive criticism would be greatly appreciated.

At Wikimedia, we were finding strtr() too slow for the uses we were putting

it to, such as converting large documents between simplified and
traditional 
chinese. A quick glance at the source code will tell you why, it's an 
elementary algorithm, and a particularly slow implementation at that. It 
would have been doing about 20 hashtable lookups per byte.

I did a bit of research into alternative algorithms, and was getting ready 
to implement my own Aho-Corasick style algorithm, when I discovered that 
there was already one in GNU grep, in an easily reusable form, to support 
the grep -F feature. I copied the relevant files, made a few minor changes 
for memory management, and added a simple PHP interface. The result can be 
seen at:

http://noc.wikimedia.org/~tstarling/fss-0.1.0.tar.gz

FSS stands for fast string search, it seemed like a decent generic name.

The extension uses a Commentz-Walter style algorithm for multiple keywords,

and a Boyer-Moore algorithm for single search terms. I provide both a
search 
interface (like strpos) and a replacement interface (like strtr). Both 
algorithms have a setup phase where they study the search set, and a fast 
search phase, which can be executed on multiple subject strings. The setup 
phase appears to be almost fast enough to use this extension as a generic 
replacement for strtr.

In a realistic test on the motivating application of Chinese language
script 
conversion, my extension outperformed strtr by a factor of 73, if you 
include setup time, and a factor of 245 if you just count the incremental 
time. Times were strtr: 1370ms, FSS setup: 13.2ms, FSS replace 5.5ms, for a

50 KB document.

What I'd like now is if someone could review the code -- just the stuff
that 
was written this decade, presumably the part I stole from grep is pretty 
well tested. I'd also appreciate some pointers on getting it into PECL.

-- Tim Starling
 
CD: 3ms