Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane
From: Carl Worth <cworth-4HiWtcSh4w0dnm+yROfE0A <at> public.gmane.org>
Subject: Some Xapian tips and thoughts on rebuilding
Newsgroups: gmane.mail.notmuch.general
Date: Sunday 10th January 2010 17:43:38 UTC (over 7 years ago)
With the recent change to "database format 1" some users might decide to
rebuild their notmuch database. If so, there are some things I've
learned about Xapian that are good to know before you rebuild. Or maybe
what you read below will encourage you to rebuild your notmuch database.

I think all users of notmuch have been discouraged by how slow it is to
change the tags on messages. Many of you have heard of "Xapian defect
#250" that was causing some slowness here. I'm happy to report that with
initial code from Kan-Ru Chen, Richard Boulton has recently committed a
fix for this bug to Xapian upstream, (after rewriting the fix
substantially, extending the fix to multiple backends, and writing
several new Xapian test cases for it).

However, just upgrading your Xapian library won't necessarily give you
any benefit with notmuch. But you can be assured of getting some benefit
if you upgrade both Xapian and notmuch and rebuild your notmuch
database. The gory details are covered below.

Gory details for getting the Xapian #250 fix benefit with flint
---------------------------------------------------------------
Xapian has a notion of multiple backends which store the data in the
database differently. In the 1.0 versions of Xapian, the default backend
is the "flint" backend. This backend stores the document "length" in
every "posting" entry, (where a posting is effectively a link from a
particular "term" to a particular "document" perhaps with positional
information).

The fix for defect #250 is to update as little as possible when we add
or remove a single term (and hence a posting) to a document. But if this
change also changes the document length, then all postings will
unavoidably need to be updated.

Historically, notmuch hasn't taken any special care with the results on
"document length" when adding terms for things like tags. The default
treatment is that terms *do* affect document length. But for terms like
tags that don't actually occur in the document content, it makes sense
to record them as having 0 effect on the document length. I recently
fixed notmuch to do so. But you'll have to rebuild your notmuch database
with a recent notmuch in order to get that change.

But if you rebuild, you might want to use chert instead of flint
----------------------------------------------------------------
I mentioned that "flint" is the default backend in the 1.0 releases of
Xapian. In the development versions that you can checkout from the
project's svn repository, there's support for a newer backend named
"chert", (expected to be the default in an upcoming release). To get
Xapian to use chert you need to have the following environment variable
set when doing the initial "notmuch new" to build your database:

	XAPIAN_PREFER_CHERT=1

After that, Xapian will see that your database is chert and will know
how to deal with it. (Except that I have seen that upgrading Xapian
From one svn version to another may result in incompatible changes to
the chert format---so a future Xapian may not be able to read a
previously-created chert database. I assume these format changes won't
happen in stable releases of Xapian.)

One thing that's nice about chert compared to flint is that it no longer
stores the document length in every posting. This means it's easier to
get the benefit from the Xapian defect #250 fix. It also means that your
database can be much smaller. For my notmuch database, a flint built is
about 7.0GB while a chert build is only 5.0GB---a very nice change.

Compacting your database
------------------------
One final tip. I recently started experimenting with a Xapian feature
for compacting a database. This is available only via a command-line
program, (named xapian-compact in the 1.0 releases and
xapian-compact-1.1 in the current Xapian from svn). This functionality
is not yet available in the Xapian library interface or else I would
probably make notmuch call it after building the database.

If you want to experiment with xapian-compact, you'll want to call it
with a command something like the following:

     xapian-compact-1.1 --no-renumber ~/mail/.notmuch/xapian
~/mail/.notmuch/xapian-compact

The --no-renumber argument is essential with a notmuch database, since
(as of database format version 1), notmuch stores Xapian document IDs
internally within terms. If you forget this, you'll find that all of
your searches will return results that are unable to locate any of the
filenames corresponding to your mail.

After running the above command, you could then move your existing
.notmuch/xapian away and move .notmuch/xapian-compact in its place to
test, and then discard the original .notmuch/xapian if you're happy with
the result.

For me, this compaction took my 5.0GB down to 3.1GB. So my database is
now less than half the size of what I started with with flint, (and can
conceivable be cached entirely within memory on my machine!), which is
quite delightful.

I hope the above is helpful, (and yes, clearly we need to get this
content out in other ways such as in a README in the source
distribution, and on the website in some form much better than our
current pipermail-based mailing-list archives).

-Carl
 
CD: 3ms