In two previous posts we have seen the basic implementation of search engine and then the possible queries which can be build in your application. These two can be improved so your search engine will have more usability and will fulfill your requirements. In the next two posts I will try to introduce some more sophisticated techniques which you could use when building your own search engine.
After this post your search engine should have all required functionality for modern search engine which will fit into your product. I will try to also explain how to integrate the Lucene.NET library into the web-based application which most parts are stored in the database systems.
GOAL: after this post you should be able to see all of the important parts of the search engine library and some techniques you could use in your app. For the sake of this post I will be assuming that we are building the search engine for the online shop where the user can perform a search based on:
- name of the product,
- price range (from.. to…).
As you can see this is not very sophisticated version of the search engine but based on that you can build much more complex ones.
Filtering the search results by its price
When I was discussing the possible queries which you could build in your application (or just left it to your users) I have mentioned the range query. As you know it is possible to ask Lucene for documents where a certain field is in the range of values. This query unfortunately does not fit the goal of the search engine for the online shop. What we would like to achieve is to ask Lucene for some of the products and then filter them by the range of the price.
Of course we could do it in our code after the returned list by the Lucene. But is it possible to use Lucene’s power (indexes, cache etc.)? YES! This is the place where I shall intoduce FilteredQuery class. The definition is rather short: the query that applies a filter after another query.
As you can see to build a FilteredQuery object you need two parts – a query and a filter. How to create a query were dedicated two previous posts, so you can use them as the reference. Now we just need a filter.
Filter filter = NumericRangeFilter.NewIntRange("Price", min, max, true, true);
As you can see the creation method is as simple as it only can be. The explanation is below:
- The name of the field in the document.
- Lower bound (can be NULL!).
- Upper bound (can be NULL!).
- Boolean value whether include values where the field (1) is equal the lower bound (2).
- The same as 4 but related to upper bound.
As you can see our filter if designed to be build for the price value of the product. It is important to mention right now that if you would like to use a field for the filter purposes you need to declare a special type of the field.
var doc = new Document(); doc.Add(new Field("Id", sampleData.Id.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED)); // (1) doc.Add(new Field("Name", sampleData.Name, Field.Store.YES, Field.Index.ANALYZED)); // (2) doc.Add(new Field("Description", sampleData.Description, Field.Store.YES, Field.Index.ANALYZED)); doc.Add(new NumericField("Price").SetIntValue(sampleData.Price)); // (3)
You need to use a NumericField class if you would like to use the filter on your query. Having that we can create
var fq = new FilteredQuery(query, filter); // (4)
Building a filtered query is as simple as its definition. What is important to know is that FilteredQuery derives from Query and because of that you can build very sophisticated queries. Firstly you could search something then apply a filter and after that you perform maybe another filter (on another parameter). As you can see this query can become very powerful tool for your search engine.
Promotions – how to promote certain documents?
As you remember (if not, you can check it here) you can boost any field in the document which will be more important than the rest of them. You probably already expect that you could also do something similar with documents – and you would not be wrong.
It is possible to boost any document to be more relevant. It is done when you create an index. When we previously declared our doc object we could just set a property:
doc.Boost = 2;
This value influences the score value which is used by Lucene for choosing appropriate documents for your query. Normally the boost factor for each document is equal 1. For each document library multiplies the score value by boost value and based on that returns values. As you can see – you can promote some documents by making boost factor greater than 1. By making this value less than one you just decreases the score value and makes the document less relevant.
What is very important once you create the index you can not even read the boost value (the property will ALWAYS return 1!) – but you can be sure that the score value is changed due to this value.
This factor makes it possible to make your search more flexible to your needs – you could make this value a very big number (and probably add programmatically some term to your query). Based on that your query on first positions (or in the couple of first results) will be documents which you would like to show your client as the more relevant and interesting.
How to build a search engine on production systems?
Nowadays probably most of the data is stored in the database systems. How then you could use Lucene in a such environment? What I would suggest is to have the index together with the DB tables. You can make your index as compact as it is only possible – it is important due to performance goal.
Of course there is a very big challenge – how to ensure that your database and search engine index will be coherent? Unfortunately it is up to you and your skills. I will go back to this problem in the next post where we will dig into some more interesting stuff.
So here it is – your fully operational search engine. But still there is a place for improvement. You probably see it – when you use all this techniques your search engine still misses one thing – paging. It is common to show on the result page not all results but just a subset of the big result set. This and some other, more sophisticated topic will be covered in the next post.