Full text searching with solr
Use Solr as a ‘search engine’ for applications. It is fast.
A lot faster than MySQL (at least in one slightly-scientific study we’ve undertaken). Further details below….
Like most web application developers, just about everything we write has a MySQL database behind it. MySQL is ubiquitous.
Most sites and applications will have a requirement for some form of text searching. Given most sites are based on MySQL, the normal approach is to start with some sort of SQL ‘LIKE’ query, and then to move to using MySQL’s fulltext index as traffic/load increases.
However at some point, when you’ve optimised the application and database as much as you can for performance and functionality you begin to look for other solutions – hence investigating Solr.
A typical developer checklist of things to do to improve MySQL’s performance might look like :
- Memcache to reduce DB load / cache data ?
- Optimised MySQL queries / indexes / slow query log analysis ?
- Reformed data to reduce database joins ?
- Optimised GeoSpatial queries through boxing ?
- Using MySQL fulltext index to search through textual data ?
- Performed UI changes to ‘persuade’ users to not perform poorly performing queries ?
SQL_CALC_FOUND_ROWS(so there’s no need to do a separate query to find out how many rows there are in total for your result page to work correctly)
- and so on….
However, eventually you reach a limit – probably speed or functionality – with MySQL.
For example, while MySQL’s fulltext search support has a stop-words list (common words to ignore, like ‘the’ when searching) it doesn’t index numbers or (by default) short words and doesn’t support additional features (like those below)
Advantages of Solr
- Faceting (grouping of results by e.g. type/brand/model/category)
- Stemming (realising that “becomes”, “become” and “becoming” are equivalent)
- Field weighting (a match in a name field is more important than a match in address3).
- Phrase proximity (i.e. if you’re looking for Red Lion, then a result of “The Red Lion Inn” is probably better than “The Red Fox and Lion”)
- Spelling suggestions (“Did you mean : …. ?”)
- Autocomplete support for search boxes
- “More Like This” suggestions
- Greater control over indexing
- More scalable
While some of the above can be done using MySQL, not all are available out of the box. For instance with one project, we found it necessary to implement basic stemming to de-pluralise words so a query for ‘McDonalds’ would match ‘McDonald’ and “McDonald’s” (and vice versa). We also explicitly ignored ‘the’, and had to match shorter phrases because some items had short names (e.g. “AA”, “B. O. A. B” etc).
The Disadvantages of Solr
- Requires a Java service running on your server(s) (probably Tomcat)
- Contains a separate copy of the data (it has its own datastore/index) – so needs to be synchronised with the relational database somehow
- Query language is not SQL like
- Learning curve
- Indexing can be relatively slow (while we found rebuilding the index from XML files to be very quick, generating the XML files isn’t).
Some “scientific” performance figures
So we thought we’d better do something slightly scientific while writing this article and investigating Solr.
We started by recording all search queries performed on a live site over a period of about half an hour one morning. The search queries are a mixture of textual searches (e.g name = Inn and address = birmingham) and geospatial searches (e.g. find me premises within 5 km of longitude/latitude coordinates). Our intention was to then replay those queries against our shiny new Solr index and our internal MySQL database, and compare the results.
The dataset is about 400,000 records and contains business names and addresses. MySQL has full text indices on name and address fields and various indices have been added to other fields over time. Over the last 2-3 years we’ve spent considerable time optimising the database and search queries in an effort to improve performance.
Firstly, we created a thin API in front of Solr to provide a drop in replacement for the existing search routine (MySQL). This will allow us to turn on (or off) solr in the future using a simple configuration setting – and/or use MySQL as a fallback, should Solr misbehave.
Next, we constructed a simple test script which extracted the query terms from the above mentioned search log, and ran the same queries against both the MySQL and Solr backends on our internal development servers. We output the total matches found and investigated where there was an obvious discrepancy (e.g. Solr returning no records while MySQL returned 13,000 records). Sometimes the resultset sizes were identical, and sometimes there were slight discrepancies. Generally Solr seems to return more results.
The overall result
Solr was fastest in 6100 searches out of a total of 6300. More importantly (as shaving a millisecond off each search isn’t that significant) is the total time both approaches took – specifically Solr took 350 seconds while MySQL took 2900 seconds.
While these numbers are impressive (Solr takes 12% of the time that MySQL does), they’re not totally accurate – as the “Solr API” we’ve written has to perform a few database operations to re-form the search results to match what the MySQL based API does…. so in reality Solr is faster than it appears.
But where is it faster?
Altering our test a little, to try and see where Solr is faster than MySQL (or vice versa), we tried narrowing the test to only those searches which did not try to do GeoSpatial searches – at which Solr is even quicker (54s vs 956s).
Changing the dataset so it’s only where a search should be using one of MySQL’s fulltext indexes (i.e name/address) we still find Solr is fastest – with a total time of 12s vs 30s for MySQL (unfortunately our result set at this point is only 600 searches). Interestingly, MySQL is at this point faster for 450 of these searches, but slower overall, so there may be scope for reconfiguring Solr to be faster still.
So it seems that Solr is a lot better for GeoSpatial queries (find me hits within X km of latitude/longitude) and while it’s quicker for textual searching, it’s “lead” isn’t as big.
For some searches MySQL can be quicker. Overall, however in this instance, Solr was on average always faster, often by a large margin.
Solr’s response time seems to be more predictable (our timings show it being nearly always faster than 0.2s) and it generally returns the same number of, or slightly more, results than MySQL.
Both indexes/databases have (internal) caching turned on, and should be reasonably well tuned (at least in MySQL’s case – we’ve not enough experience with Solr yet – aside from just a quick Google and bumping up some cache sizes).
The hardware involved wasn’t identical, but was comparable (i.e. the PCs running either service were Linux with multiple CPU cores, >=8Gb of RAM and SATA based disks).
Solr is configured with specified fields (not dynamic fields) within it’s XML schema. Within Solr most of the data is indexed but not stored (we only get an Id back from Solr which we use to retrieve the matching record from the DB and populate the legacy ‘search result data structure’).
It’s quite likely that we could restructure the legacy data within MySQL to try and make it more searchable/performant (we have done so in the past twice). But restructuring the data results in either duplication or there being knock-on changes in the code elsewhere, and if we’re going to cope with a duplication of the data we might as well consider moving outside of MySQL.