The Verity engine ColdFusion ships with is ideal for free text searching across file based content. Working recently on a large file library we stumbled across a technique for improving the performance of indexing and building Verity file collections.
There are two approaches to indexing files. You can nominate a directory and ask CF to recursively go through the files and subdirectories to index all the files it finds that match a certain pattern. The alternative is to nominate individual files for indexation.
We've been working with file libraries where each file has a corresponding record in a database, to capture additional metadata about the files and to provide a web based interface to the file system. When indexing this style of content you typically query the databases for the file library contents and loop over the recordset indexing each file independently.
It turns out that some of these database records have lost their corresponding file over time -- although only a very small percentage of the total library, we were surprised to find this had a very significant impact on the indexing performance. If CF presents Verity with a file to index that doesn't actually exist the sysem can take up to 3 seconds to decide not to do anything and move on (running CF7 in our environment).
I've no idea what CF/Verity is up to at this point but 3 seconds times a few hundred files and suddenly you're waiting longer for it to think than the entire collection build. The workaround is easy: do a fileexists() on the physical file before indexing it and if its not there don't index.
Posted by modius at 08:58 AM | Permalink
Trackback: http://blog.daemon.com.au/cgi-bin/dmblog/mt-tb.cgi/283

