Solr Problems Faced and Approach taken
- rakeshnbr
- Mar 2, 2012
- 2 min read
Solr is a search platform from Apache Lucene project. Solr is a powerful and fast search engine which supports most of the search engine functionalities like full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Will walk through my experience with solr and how we came across this. Before going through this blog I would suggest to browse through tutorial of Apache solr
Problem 1: Implementing multiple cores
Statement : Our product follows SAAS model and the service is used by different clients. For each client there might be different types of search
Solution : Have different solr cores for individual clients. Each client can be again sub divided into separate cores. This can be achieved by adding cores to you solr.xml file.
Problem 2 : Delta Import
Statement: Search indexing is a one time job and the search indexing after first time execution is known as delta imports. Delta import is used to index new data which has to be indexed after the first time execution. If we have sub entities (ie: if we need to execute a sub query based on the result of first query) DIH (Data import handler) opens multiple connections to database to execute the sub query.
Solution : Opening of multiple connections was an overhead. To solve this problem the approach was to use combination of FileListEntityProcessor and XpathEntityProcessor. The approach was simple to fetch the records on server side by using JDBC, Ibatis or Hibernate(based on the technology preferred). Data was processed and save as XML file and the Xpath was configured in dataConfig.xml file.
This approach gives user the freedom to manipulate the data retrieved into meaningful chunk and index based on that.
Problem 3: Sorting fields with multiple tokens
Statement: When trying to sort on a field which has multiple tokens. Solr will throw an exception if number of tokens in the field to be sorted are more than number of documents.
Suppose we need to sort a field X with values [ab cd],[ab ef] respectively for two records. In this scenario solr will throw an exception since we have three tokens ab,cd and ef but number of records are 2.
Solution : To overcome this problem we need to make solr understand that the tokens inside the sorting field should be considered as single token while sorting.
New field type has to be defined which uses KeyWordTokenizerFactory and this type has to be assigned to the field which has to be sorted. This tokenizer will tokenize the values inside the field as one token and sorting is done based on that.
Note:Will share some other problems and solutions with solr in my next blog. Solutions provided here may not be the best solution. Please feel free to share your opinions :)
Comments