Simple search engine in C# with Lucene.NET

Nowadays it is common that you see search boxes on websites. Most of them are using most popular search engines which search something on the website – I think about Bing, Google etc. This way of providing searching is not very sophisticated and dedicated developer would like to provide his/her own search engine. In this post I will try to shortly present capabilities of Lucene.

Lucene and Lucene.NET

It is not easy to build a search tool which will be more than just simple SQL query with couple of LIKE clausules. This is where the developers need to find suitable solution. One of the possible ones is to use third-party library. One of the most well known one is Lucene (http://goo.gl/59W4a) – full-text search engine library. One of the biggest disadvantages for C# developer is that Lucene is entirely written in Java. Fortunately there is a port version – called Lucene.NET (http://goo.gl/MVNy3). Apache Lucene as well as Lucene.NET are open source projects available for free downloads (Lucene.NET also as NuGet package).

As you can expect this port-library is under ongoing development and can cause potential problems. The current version of the core is stable and no major bugs were announced so far. Thanks to that library you do not need to implement sophisticated search logic in your application or SQL queries you use. You just need to properly include and use Lucene.NET in your application.

Introduction

There are couple of aspects which needs to be introduced before we dig into the code. Lucene uses something called index which is a textual form of the data on which the search methods will work – there are two main forms: file and memory index. Base on that your search engine can use the power of Lucene.

Each query returns a set of data which fulfill your requirements. But it is very important to understand that every file (document in Lucene’s language) can be more or less good as the search result. In this case we need a way of scoring the return values – this is done for you by the library. Each time you will receive result it will contain not only info about the documents but also scores for each of them. You can decide what level of scores will be enough to recognize a document as the search result.

Build index

First step is to create index for Lucene. This part contains couple of steps. Let’s get through them.

1. Create Writer which later will write down the Analyzer.

var dir = FSDirectory.Open(new DirectoryInfo(@"C:/test_lucene")); // (1) var analyzer = new StandardAnalyzer(Version.LUCENE_30); // (2) var writer = new IndexWriter(dir, analyzer, IndexWriter.MaxFieldLength.UNLIMITED); // (3)

In (1) the directory on the C disk is opened – this line is straightforward. In the second the analyzer is instantiated. In short words the analyzer is tokenizer, stemmer and stop-words filter. Used StandardAnalyzer filters input sequence with StandardFilter (normalizes tokens), LowerCaseFilter (normalizes token text to lower case) and StopFilter (using a list of English stop words). In the third line we create IndexWriter which just simply creates index – we can think of this index as if it was an index on the Database. This index has 20-30% the size of text indexed.

2. Add data into the index.

foreach (var sampleData in data) { var doc = new Document(); doc.Add(new Field("Id", sampleData.Id.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.Add(new Field("Name", sampleData.Name, Field.Store.YES, Field.Index.ANALYZED)); doc.Add(new Field("Description", sampleData.Description, Field.Store.YES, Field.Index.ANALYZED)); writer.AddDocument(doc); }

This iteration goes through the data enumeration (in this example this data is not important, so I have omitted it). As you can see there is a new concept – the Document object is created for each enumeration element. As you can check in the API documentation, documents are the unit of search and indexing. Each Document is a set of fields, where every has a name and a textual value. Each document should (typically) contain one of more stored fields which uniquely identify the document.

The constructor of Field used in the example takes 4 arguments:

    1. First one is the name of the value which we can later reference (this probably should be some constant value to simple re-usage and refactoring).
    2. Second is the actual value of the property for the document.
    3. Determines whether the value should be stored in the index or not.
    4. Fourth and the last one specifies whether and how a field should be indexed. In the example I have used only two possible states (out of 5): NOT_ANALYZED and ANALYZED. In the first the field’s value is indexed without using an Analyzer. The tokens are indexed by running the field’s value through an Analyzer in the second one. More can be found here.

For each element on the list the document is created and then added to index writer.

3. Close the stream objects.

writer.Optimize();
writer.Commit();
writer.Dispose();

This fragment of code is self-describing – we just close all used object to release index and make it available for other parts of the app.

In this part of the application we have created the index for our data. I can imagine creating the index in Lucene for some part of the data stored in the DB – where more information is available in database. Only parts important to use in the search are included in the Lucene’s index.

Use index

In the previous section we have created the index. Now it is time to use it and see the magic of Lucene.

1. Firstly we need to open index and prepare analyzer.

var directory = FSDirectory.Open(new DirectoryInfo(@"C:/test_lucene"));
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);

In the first line we have opened our index and in second we have created analyzer for the index.

2. It is time to see the most interesting part of this post – the actual usage of the index in which we will search the text with the input text.

var parser = new MultiFieldQueryParser(Version.LUCENE_30, new[] { "Name", "Description" }, analyzer); // (1) Query query = parser.Parse(text); // (2) var searcher = new IndexSearcher(directory, true); // (3) TopDocs topDocs = searcher.Search(query, 10); // (4)

Firstly (1) we create query parser – as you can see in this example I have used parser for multiple fields, not just only one. For one you will just use QueryParser instead of MultiFieldQueryParser. This is used in (2) to parse the input value (text). In (3) the searcher on the index is created – in this place we indicate our directory where we have created the index for our data. In the (4) we search for top 10 results which fulfill requirements for searched text.

3. Use the result from the search.

int results = topDocs.ScoreDocs.Length;
Console.WriteLine("Found {0} results", results);

for (int i = 0; i < results; i++)
{
   ScoreDoc scoreDoc = topDocs.ScoreDocs[i];
   float score = scoreDoc.Score;
   int docId = scoreDoc.Doc;
   Document doc = searcher.Doc(docId);

   Console.WriteLine("{0}. score {1}", i + 1, score);
   Console.WriteLine("ID: {0}", doc.Get("id"));
   Console.WriteLine("Text found: {0}\r\n", doc.Get("Name"));
}

In the previous point we have received topDocs on which we can iterate and get interesting us data. As I have already mentioned in this place we could get more info for found documents and download it from database or file system. One interesting part is the Score value (it is important to spot that the results are ordered by this value!) which is the score value for the query. This is always a number – the higher, the better the document satisfies the query.

Conclusion

There were only around of 100 lines of code in which we have created a simple search engine – and it was together with the sample data. Of course in real-world scenario there will be more sophisticated logic and more operations for optimizing the index. Especially when it will grow and become a very big one.

There are some important assumptions to be known while working with Lucene. One of the biggest is that the index is fully thread safe what means that multiple threads can call any of its methods, concurrently.

As you can imagine the index should be prepared during the loading of the application. I can imagine the index to stay in the memory and be updated when the new data goes to DB. You can decide what part of the data will be included in the text search source.

This post was not the full-introduction for the text-based search. It presented the potential of the Lucene and its port for .NET framework.

I think playing around with this library can be quite interesting and eyes-opening. Especially when we will understand the sophisticated algorithms behind the scenes.

Advertisements

One thought on “Simple search engine in C# with Lucene.NET”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s