Category Archives: Uncategorized

Multi-parameter search engine in C# with Lucene.NET

In my previous post (see here) we have created simple search engine implemented in C# with Lucene.NET. This was rather introduction to this technology. As you can expect Lucene offers much more than just simple one/multi-word query. It is possible to create your own query through Lucene’s API but it also provides a rich query language which parses (through Query Parser) the input string into Lucene Query. I strongly recommend the documentation which is available online if you find this topic interesting – the implementation can vary from version to version so it is the best source of knowledge.

This post will cover couple of available query techniques.

  • querying specific fields,
  • wildcard searches,
  • range searches,
  • boosting the term (and the document as well!),
  • Boolean operators.

Fields

As it was presented in the simple search engine each entry in the index is build from a set of fields. For example we have previously defined our document as:

The document above contains three fields and two of them are analyzed during indexing procedure. Lucene

var doc = new Document(); doc.Add(new Field("Id", sampleData.Id.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.Add(new Field("Name", sampleData.Name, Field.Store.YES, Field.Index.ANALYZED)); doc.Add(new Field("Description", sampleData.Description, Field.Store.YES, Field.Index.ANALYZED));

allows us to query not only a set of fields. It is possible to query value in a specific field. Let’s assume that we would like to find the documents which name is “searching”.

This could be done by:

Name:searching

If we would like to ask for more than one word we should use “ as it is shown below (we try to find a document with “search engine” in the title). This is called Phrase Terms.

Name:”search engine”

If we have used

Name:search engine

Lucene would look up only search in the name and “engine” for all of the fields from the query. Of course it is possible to ask Lucene for words in more than just one field.

Name:search Descrption:engine

This is straightly connected to the Boolean operators where we can use bool logic to build more sophisticated queries.

Wildcards

It is well known in the search engines that you can use wildcards such as * or ?. Lucene has it implemented also so you could search for example for

Name:search*

or

Name:search?

The difference between these two is

  • * is used for multiple character wildcard
  • ? is used for single character wildcard

Range search

It is possible to create a query where the field value will be in a range of values.

Name:[Ada TO Tom] // (1)

Name:{Ada TO Tom} // (2)

Both of the queries will result with values from the range. The difference between them is with the lower and upper bound – in the (1) the documents whose names equals Adam or Tom will be included, in the (2)nd it will be otherwise.

To sum up:

  • [] brackets (square) are used in inclusive range queries,
  • () brackets (curly) are used in exclusive range queries.

Boost the term

As you know Lucene calculates the relevance level of matching documents based on found terms. It is possible to promote (boost) a term – simply by using the caret (^) symbol with a boost factor at the end of the term which is about to be promoted. Of course, the higher the boost factor is, the more relevant the term will be. As you will see in the next post – it is also possible to boost the document during indexing phase.

Previously we had an example of searching for search engine. Let’s assume that we would like to focus our search more on the search term rather than treat each equally. This is done in the next query.

search^2 engine

The search query above states that “search” is twice more relevant than engine. The default value of the boost factor is 1. It should always be positive, but it can be less than 1 – for example you could build a query with this factor = 0,1. 

Boolean operators

Lucene offers very sophisticated Boolean logic to be used in your queries. There are operators: AND, “+”, OR, NOT and “-“. NOTE: operators must be in CAPS to be recognized.

The default conjunction operator is OR and this means that if you do not specify any Boolean operator between two terms, Lucene will put there OR. This operator means that Lucene finds a matching document if either of the terms exist in a document.

Note that these two queries below are equal.

“search engine” search
”search engine” OR search

 

Boolean operators can be very powerfull tool – together with the query using fields names and sophisticated grouping.

AND

This operator states that both of the terms should appear in the requested document.

To find a document about search engine and cool you should use such query:

”search engine” AND cool

+

Plus operator means that the appearance of the term with + must appear in the document.

For example if you look for a document where search must appear and engine can you should basically use:

+search engine

NOT (or !)

On the other side there is NOT operator which is the opposite for plus operator. This one means that the document with term after not will not appear in the result.

“search engine” NOT gear

IMPORTANT – you cannot use such query

NOT “search engine”

It will always return zero documents.

This operator prohibits documents in which the term appears.This one is a little bit more restrict than NOT operator

Grouping

It is important to be aware that you could prepare nested queries such like this one.

(Name:”search engine” AND cool) OR interesting

Summary

As you can see Lucene can be queried using quite sophisticated strings to be parsed. Well prepared queries can provide very accurate results for your solution.

This is important to be aware that it should not always be entered as the query by the user. Sometimes (for multi-fields queries) the app should build query based on the input fields from the form. Then format the query using all required and accurate techniques.

There is still some point missing – numeric values. You could expect that range query will fulfill your job – to build a search engine where just results from a range will be returned (for example based by the price, score). Unfortunately this requires different way of building a query – of course this does not mean you should not pay attention to build a query more interesing than just a list of fields (and as you know now – connected with OR operator).

Really cool search engine we will build in the next post – we will prepare a search engine which can be used for example in the on-line shop. In such cases you not only search the product based on the name but also you would like to allow your customer to narrow results where the price will be in a specified range.