August 13th, 2010

SQL Server Compact + Full-Text Search?

SQL Server Compact is a great in-process database engine: it’s free, lightweight, and xcopy deployable. However, since one of its selling points is its small footprint, Microsoft excludes from it many features that the full version of SQL Server has. One such feature is full-text search (FTS), which helps you build search engine capabilities into your applications.

So if SQL Compact doesn’t support full-text search, how do you implement it in your SQL Compact application? More specifically, how did we do it in Glo?

Plus, Glo holds textual information not only in relational tables, but also in lots of files external to the database. How does Glo retrieve search results from all of those files?

Glo’s search engine uses Lucene.NET, which is a C# port of the open-source Lucene search engine written by Doug Cutting. Lucene is an information-retrieval API that helps you build and read FTS indexes programmatically. Essentially, it can make any text searchable within your application.

There’s nothing wrong with executing LIKE ‘%term%’ queries on database tables for simple search functionality, although it’s likely to be slow. But in the same way that searching the index in the back of a book is faster than searching every page in the book, searching FTS indexes is faster than searching your entire database. A LIKE ‘%term%’ query that starts with a wildcard will usually force a table scan or at least an index scan, and will take much longer than an index seek.

A few years ago, a man asked me how it’s possible for Google to return so many results so quickly when he searches the Internet. “How can Google search the whole Internet so fast?” I explained to him that executing a search doesn’t search the Internet at all, it instead reads a set of pre-built index files, which, like the index at the back of the book, hold the answers all in one place.

Lucene uses the same principle to get search results. In fact, retrieving search results from FTS indexes is straightforward; getting the proper and latest data into the indexes is the part that takes creativity. I’ll expound on how Glo accomplishes that in a future post.

Props to my co-worker friend Adam Hill, who encouraged me to check Lucene out.

Leave a Reply