Search engine on a production system

So far we have covered the topic of building search engines with Lucene.NET library. There is still one big point which I should post about – all the information we collected so far need to be collected and used to build the search engine for a production system.

As you have probably spotted there is one point missing. Our search engine can have quite sophisticated query options with some advanced filtering. It also can have the connection with the DB (I will write about it here too). But what to do when you have plenty of results to be presented. Nowadays it is expected from the search modules on the webpage to present result by pages – for example to present only 10 results on one page. But how we can achieve that in Lucene?

Unfortunately Lucene by itself does NOT help with the paging. It will also return you n-first best matching results for your query. The logic connected with that must be implemented by you.

How many documents download?

This is one of the biggest questions you need to make while implementation. You need to tell Lucene how many documents you would like to get – based on that you can calculate number of pages for your search results.

Of course there is an overhead connected with downloading more documents. What I would suggest is to download all matching documents for the first page so you can calculate number of pages. But for n-page I would suggest to download only X*n documents where X is number of documents on page and n is number of page. Why not only X? As you probably know – Lucene downloads only TOP documents for matching query.

Paging – how to create subset from the result?

Let say that you have received a list from the query. How can you create your subset list for a certain page? Of course you could calculate indexes for your page and create a subset. But this leaves a place for errors. Fortunately there is a NuGet package called PagedList which can be used in this scenario. Installing is simple:

PM> Install-Package PagedList
Successfully installed 'PagedList 1.15.0.0'.
Successfully added 'PagedList 1.15.0.0' to TestLuceneApplication.

As you can expect PagedList provides simple interface for creation and maintenance of the paged list. What is more important it can be then used by controls used on your webpage.

What is more interesting this package contains extension methods which can be executed on IEnumerable (or IQuerable) interface implementation which will create the expected subset.

var resultsList = new List<Document>();
// ...
resultsList.ToPagedList(2, 10);

The usage of the method is simple. Its signature is self-explanatory:

ToPagedList<T>(this IEnumerable<T> superset, int pageNumber, int pageSize)

This method build the PagedList object which contains many very useful properties. The most important one is that you get the indexer over the page you have requested as well as the numbers for previous, next and total number of pages. Based on that you can provide your paging logic.If the superset is IQuerable implementation it will be treated as such.

As you can see you have to pass all objects even though paged list contains only subset of this objects. This mechanism often is cheated by programmers – especially when they have to build other object based on the Document received form Lucene.

Cheat the PagedList with subset of the whole TopDocs

Lucene Documents often are the starting point for creating other objects which are presented for users – for example you use builder pattern to prepare some objects. As you expect often this means some performance issues – for PagedList you need to have all elements of the collection while you present only subset of it.

How can we gain some performance when we need to pass all objects to the extension method to get all correct info on PagedList? This is neat way of preparation the data and requires from you, dear developer some more implementation time. Basically you create superset which contains (n-1)*X empty objects and X elements of your page which are fully created objects for your scenario. The image below presents this scenario:

Of course you need to calculate the correct start index but it is basic mathematic calculation and you should not have any problems with that.

Use PagedList

Why do we care and create such object as PagedList? As I have already mentioned – this gives us basic logic for paging which can be used on a page for example to show search results in a user-friendly way. But more importantly there are components for ASP.NET which can be created with such object. One of most interesting is MvcPager, which is released under Ms-PL license – this means that it is free and open source software.

It supports Ajax paging using jQuery or standard Ajax implementation on ASP.NET. I encourage you to play with MvcPager. I will dig into details about this component in the next post.

Conclusion

As you could see it is easy to create simple view of paged results of your search engine. There are places where you need to be very cautious to keep your Lucene index up-to-date with your database.

Advertisements