Search engine for an online shop in C# with Lucene.NET (1/2)

In two previous posts we have seen the basic implementation of search engine and then the possible queries which can be build in your application. These two can be improved so your search engine will have more usability and will fulfill your requirements. In the next two posts I will try to introduce some more sophisticated techniques which you could use when building your own search engine.

After this post your search engine should have all required functionality for modern search engine which will fit into your product. I will try to also explain how to integrate the Lucene.NET library into the web-based application which most parts are stored in the database systems.

GOAL: after this post you should be able to see all of the important parts of the search engine library and some techniques you could use in your app. For the sake of this post I will be assuming that we are building the search engine for the online shop where the user can perform a search based on:

  • name of the product,
  • price range (from.. to…).

As you can see this is not very sophisticated version of the search engine but based on that you can build much more complex ones.

Filtering the search results by its price

When I was discussing the possible queries which you could build in your application (or just left it to your users) I have mentioned the range query. As you know it is possible to ask Lucene for documents where a certain field is in the range of values. This query unfortunately does not fit the goal of the search engine for the online shop. What we would like to achieve is to ask Lucene for some of the products and then filter them by the range of the price.

Of course we could do it in our code after the returned list by the Lucene. But is it possible to use Lucene’s power (indexes, cache etc.)? YES! This is the place where I shall intoduce FilteredQuery class. The definition is rather short: the query that applies a filter after another query.

As you can see to build a FilteredQuery object you need two parts – a query and a filter. How to create a query were dedicated two previous posts, so you can use them as the reference. Now we just need a filter.

Filter filter = NumericRangeFilter.NewIntRange("Price", min, max, true, true);

As you can see the creation method is as simple as it only can be. The explanation is below:

  1. The name of the field in the document.
  2. Lower bound (can be NULL!).
  3. Upper bound (can be NULL!).
  4. Boolean value whether include values where the field (1) is equal the lower bound (2).
  5. The same as 4 but related to upper bound.

As you can see our filter if designed to be build for the price value of the product. It is important to mention right now that if you would like to use a field for the filter purposes you need to declare a special type of the field.

var doc = new Document();
doc.Add(new Field("Id", sampleData.Id.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED)); // (1)
doc.Add(new Field("Name", sampleData.Name, Field.Store.YES, Field.Index.ANALYZED)); // (2)
doc.Add(new Field("Description", sampleData.Description, Field.Store.YES, Field.Index.ANALYZED)); 
doc.Add(new NumericField("Price").SetIntValue(sampleData.Price)); // (3)

You need to use a NumericField class if you would like to use the filter on your query. Having that we can create

var fq = new FilteredQuery(query, filter); // (4)

Building a filtered query is as simple as its definition. What is important to know is that FilteredQuery derives from Query and because of that you can build very sophisticated queries. Firstly you could search something then apply a filter and after that you perform maybe another filter (on another parameter). As you can see this query can become very powerful tool for your search engine.

Promotions – how to promote certain documents?

As you remember (if not, you can check it here) you can boost any field in the document which will be more important than the rest of them. You probably already expect that you could also do something similar with documents – and you would not be wrong.

It is possible to boost any document to be more relevant. It is done when you create an index. When we previously declared our doc object we could just set a property:

doc.Boost = 2;

This value influences the score value which is used by Lucene for choosing appropriate documents for your query. Normally the boost factor for each document is equal 1. For each document library multiplies the score value by boost value and based on that returns values. As you can see – you can promote some documents by making boost factor greater than 1. By making this value less than one you just decreases the score value and makes the document less relevant.

What is very important once you create the index you can not even read the boost value (the property will ALWAYS return 1!) – but you can be sure that the score value is changed due to this value.

This factor makes it possible to make your search more flexible to your needs – you could make this value a very big number (and probably add programmatically some term to your query). Based on that your query on first positions (or in the couple of first results) will be documents which you would like to show your client as the more relevant and interesting.

How to build a search engine on production systems?

Nowadays probably most of the data is stored in the database systems. How then you could use Lucene in a such environment? What I would suggest is to have the index together with the DB tables. You can make your index as compact as it is only possible – it is important due to performance goal.

Of course there is a very big challenge – how to ensure that your database and search engine index will be coherent? Unfortunately it is up to you and your skills. I will go back to this problem in the next post where we will dig into some more interesting stuff.

Conclusions

So here it is – your fully operational search engine. But still there is a place for improvement. You probably see it – when you use all this techniques your search engine still misses one thing – paging. It is common to show on the result page not all results but just a subset of the big result set. This and some other, more sophisticated topic will be covered in the next post.

Multi-parameter search engine in C# with Lucene.NET

In my previous post (see here) we have created simple search engine implemented in C# with Lucene.NET. This was rather introduction to this technology. As you can expect Lucene offers much more than just simple one/multi-word query. It is possible to create your own query through Lucene’s API but it also provides a rich query language which parses (through Query Parser) the input string into Lucene Query. I strongly recommend the documentation which is available online if you find this topic interesting – the implementation can vary from version to version so it is the best source of knowledge.

This post will cover couple of available query techniques.

  • querying specific fields,
  • wildcard searches,
  • range searches,
  • boosting the term (and the document as well!),
  • Boolean operators.

Fields

As it was presented in the simple search engine each entry in the index is build from a set of fields. For example we have previously defined our document as:

The document above contains three fields and two of them are analyzed during indexing procedure. Lucene

var doc = new Document(); doc.Add(new Field("Id", sampleData.Id.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.Add(new Field("Name", sampleData.Name, Field.Store.YES, Field.Index.ANALYZED)); doc.Add(new Field("Description", sampleData.Description, Field.Store.YES, Field.Index.ANALYZED));

allows us to query not only a set of fields. It is possible to query value in a specific field. Let’s assume that we would like to find the documents which name is “searching”.

This could be done by:

Name:searching

If we would like to ask for more than one word we should use “ as it is shown below (we try to find a document with “search engine” in the title). This is called Phrase Terms.

Name:”search engine”

If we have used

Name:search engine

Lucene would look up only search in the name and “engine” for all of the fields from the query. Of course it is possible to ask Lucene for words in more than just one field.

Name:search Descrption:engine

This is straightly connected to the Boolean operators where we can use bool logic to build more sophisticated queries.

Wildcards

It is well known in the search engines that you can use wildcards such as * or ?. Lucene has it implemented also so you could search for example for

Name:search*

or

Name:search?

The difference between these two is

  • * is used for multiple character wildcard
  • ? is used for single character wildcard

Range search

It is possible to create a query where the field value will be in a range of values.

Name:[Ada TO Tom] // (1)

Name:{Ada TO Tom} // (2)

Both of the queries will result with values from the range. The difference between them is with the lower and upper bound – in the (1) the documents whose names equals Adam or Tom will be included, in the (2)nd it will be otherwise.

To sum up:

  • [] brackets (square) are used in inclusive range queries,
  • () brackets (curly) are used in exclusive range queries.

Boost the term

As you know Lucene calculates the relevance level of matching documents based on found terms. It is possible to promote (boost) a term – simply by using the caret (^) symbol with a boost factor at the end of the term which is about to be promoted. Of course, the higher the boost factor is, the more relevant the term will be. As you will see in the next post – it is also possible to boost the document during indexing phase.

Previously we had an example of searching for search engine. Let’s assume that we would like to focus our search more on the search term rather than treat each equally. This is done in the next query.

search^2 engine

The search query above states that “search” is twice more relevant than engine. The default value of the boost factor is 1. It should always be positive, but it can be less than 1 – for example you could build a query with this factor = 0,1. 

Boolean operators

Lucene offers very sophisticated Boolean logic to be used in your queries. There are operators: AND, “+”, OR, NOT and “-“. NOTE: operators must be in CAPS to be recognized.

The default conjunction operator is OR and this means that if you do not specify any Boolean operator between two terms, Lucene will put there OR. This operator means that Lucene finds a matching document if either of the terms exist in a document.

Note that these two queries below are equal.

“search engine” search
”search engine” OR search

 

Boolean operators can be very powerfull tool – together with the query using fields names and sophisticated grouping.

AND

This operator states that both of the terms should appear in the requested document.

To find a document about search engine and cool you should use such query:

”search engine” AND cool

+

Plus operator means that the appearance of the term with + must appear in the document.

For example if you look for a document where search must appear and engine can you should basically use:

+search engine

NOT (or !)

On the other side there is NOT operator which is the opposite for plus operator. This one means that the document with term after not will not appear in the result.

“search engine” NOT gear

IMPORTANT – you cannot use such query

NOT “search engine”

It will always return zero documents.

This operator prohibits documents in which the term appears.This one is a little bit more restrict than NOT operator

Grouping

It is important to be aware that you could prepare nested queries such like this one.

(Name:”search engine” AND cool) OR interesting

Summary

As you can see Lucene can be queried using quite sophisticated strings to be parsed. Well prepared queries can provide very accurate results for your solution.

This is important to be aware that it should not always be entered as the query by the user. Sometimes (for multi-fields queries) the app should build query based on the input fields from the form. Then format the query using all required and accurate techniques.

There is still some point missing – numeric values. You could expect that range query will fulfill your job – to build a search engine where just results from a range will be returned (for example based by the price, score). Unfortunately this requires different way of building a query – of course this does not mean you should not pay attention to build a query more interesing than just a list of fields (and as you know now – connected with OR operator).

Really cool search engine we will build in the next post – we will prepare a search engine which can be used for example in the on-line shop. In such cases you not only search the product based on the name but also you would like to allow your customer to narrow results where the price will be in a specified range.

Simple search engine in C# with Lucene.NET

Nowadays it is common that you see search boxes on websites. Most of them are using most popular search engines which search something on the website – I think about Bing, Google etc. This way of providing searching is not very sophisticated and dedicated developer would like to provide his/her own search engine. In this post I will try to shortly present capabilities of Lucene.

Lucene and Lucene.NET

It is not easy to build a search tool which will be more than just simple SQL query with couple of LIKE clausules. This is where the developers need to find suitable solution. One of the possible ones is to use third-party library. One of the most well known one is Lucene (http://goo.gl/59W4a) – full-text search engine library. One of the biggest disadvantages for C# developer is that Lucene is entirely written in Java. Fortunately there is a port version – called Lucene.NET (http://goo.gl/MVNy3). Apache Lucene as well as Lucene.NET are open source projects available for free downloads (Lucene.NET also as NuGet package).

As you can expect this port-library is under ongoing development and can cause potential problems. The current version of the core is stable and no major bugs were announced so far. Thanks to that library you do not need to implement sophisticated search logic in your application or SQL queries you use. You just need to properly include and use Lucene.NET in your application.

Introduction

There are couple of aspects which needs to be introduced before we dig into the code. Lucene uses something called index which is a textual form of the data on which the search methods will work – there are two main forms: file and memory index. Base on that your search engine can use the power of Lucene.

Each query returns a set of data which fulfill your requirements. But it is very important to understand that every file (document in Lucene’s language) can be more or less good as the search result. In this case we need a way of scoring the return values – this is done for you by the library. Each time you will receive result it will contain not only info about the documents but also scores for each of them. You can decide what level of scores will be enough to recognize a document as the search result.

Build index

First step is to create index for Lucene. This part contains couple of steps. Let’s get through them.

1. Create Writer which later will write down the Analyzer.

var dir = FSDirectory.Open(new DirectoryInfo(@"C:/test_lucene")); // (1) var analyzer = new StandardAnalyzer(Version.LUCENE_30); // (2) var writer = new IndexWriter(dir, analyzer, IndexWriter.MaxFieldLength.UNLIMITED); // (3)

In (1) the directory on the C disk is opened – this line is straightforward. In the second the analyzer is instantiated. In short words the analyzer is tokenizer, stemmer and stop-words filter. Used StandardAnalyzer filters input sequence with StandardFilter (normalizes tokens), LowerCaseFilter (normalizes token text to lower case) and StopFilter (using a list of English stop words). In the third line we create IndexWriter which just simply creates index – we can think of this index as if it was an index on the Database. This index has 20-30% the size of text indexed.

2. Add data into the index.

foreach (var sampleData in data) { var doc = new Document(); doc.Add(new Field("Id", sampleData.Id.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.Add(new Field("Name", sampleData.Name, Field.Store.YES, Field.Index.ANALYZED)); doc.Add(new Field("Description", sampleData.Description, Field.Store.YES, Field.Index.ANALYZED)); writer.AddDocument(doc); }

This iteration goes through the data enumeration (in this example this data is not important, so I have omitted it). As you can see there is a new concept – the Document object is created for each enumeration element. As you can check in the API documentation, documents are the unit of search and indexing. Each Document is a set of fields, where every has a name and a textual value. Each document should (typically) contain one of more stored fields which uniquely identify the document.

The constructor of Field used in the example takes 4 arguments:

    1. First one is the name of the value which we can later reference (this probably should be some constant value to simple re-usage and refactoring).
    2. Second is the actual value of the property for the document.
    3. Determines whether the value should be stored in the index or not.
    4. Fourth and the last one specifies whether and how a field should be indexed. In the example I have used only two possible states (out of 5): NOT_ANALYZED and ANALYZED. In the first the field’s value is indexed without using an Analyzer. The tokens are indexed by running the field’s value through an Analyzer in the second one. More can be found here.

For each element on the list the document is created and then added to index writer.

3. Close the stream objects.

writer.Optimize();
writer.Commit();
writer.Dispose();

This fragment of code is self-describing – we just close all used object to release index and make it available for other parts of the app.

In this part of the application we have created the index for our data. I can imagine creating the index in Lucene for some part of the data stored in the DB – where more information is available in database. Only parts important to use in the search are included in the Lucene’s index.

Use index

In the previous section we have created the index. Now it is time to use it and see the magic of Lucene.

1. Firstly we need to open index and prepare analyzer.

var directory = FSDirectory.Open(new DirectoryInfo(@"C:/test_lucene"));
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);

In the first line we have opened our index and in second we have created analyzer for the index.

2. It is time to see the most interesting part of this post – the actual usage of the index in which we will search the text with the input text.

var parser = new MultiFieldQueryParser(Version.LUCENE_30, new[] { "Name", "Description" }, analyzer); // (1) Query query = parser.Parse(text); // (2) var searcher = new IndexSearcher(directory, true); // (3) TopDocs topDocs = searcher.Search(query, 10); // (4)

Firstly (1) we create query parser – as you can see in this example I have used parser for multiple fields, not just only one. For one you will just use QueryParser instead of MultiFieldQueryParser. This is used in (2) to parse the input value (text). In (3) the searcher on the index is created – in this place we indicate our directory where we have created the index for our data. In the (4) we search for top 10 results which fulfill requirements for searched text.

3. Use the result from the search.

int results = topDocs.ScoreDocs.Length;
Console.WriteLine("Found {0} results", results);

for (int i = 0; i < results; i++)
{
   ScoreDoc scoreDoc = topDocs.ScoreDocs[i];
   float score = scoreDoc.Score;
   int docId = scoreDoc.Doc;
   Document doc = searcher.Doc(docId);

   Console.WriteLine("{0}. score {1}", i + 1, score);
   Console.WriteLine("ID: {0}", doc.Get("id"));
   Console.WriteLine("Text found: {0}\r\n", doc.Get("Name"));
}

In the previous point we have received topDocs on which we can iterate and get interesting us data. As I have already mentioned in this place we could get more info for found documents and download it from database or file system. One interesting part is the Score value (it is important to spot that the results are ordered by this value!) which is the score value for the query. This is always a number – the higher, the better the document satisfies the query.

Conclusion

There were only around of 100 lines of code in which we have created a simple search engine – and it was together with the sample data. Of course in real-world scenario there will be more sophisticated logic and more operations for optimizing the index. Especially when it will grow and become a very big one.

There are some important assumptions to be known while working with Lucene. One of the biggest is that the index is fully thread safe what means that multiple threads can call any of its methods, concurrently.

As you can imagine the index should be prepared during the loading of the application. I can imagine the index to stay in the memory and be updated when the new data goes to DB. You can decide what part of the data will be included in the text search source.

This post was not the full-introduction for the text-based search. It presented the potential of the Lucene and its port for .NET framework.

I think playing around with this library can be quite interesting and eyes-opening. Especially when we will understand the sophisticated algorithms behind the scenes.