KohaCon 11: Solr and Koha
Solr is very powerful search and indexing engine, like the current Zebra system Koha currently uses. It’s based on Apache Lucene, and is 100% free and open source. Solr has several advantages over Zebra.
- There is very little support and a small community behind Zebra, whereas Solr has a very active community
- The configurations in Zebra are hard, and require editing and updating on the command line, along with a restart of the system; Solr can handle dynamic changes to it’s indexing configuration, so we can use the staff client administration to manage this
- Zebra’s facets are difficult to work out, and fuzzy querying is not very functional. Solr has support for these things in its base install
- Zebra also requires a cron script to update the indexes every minute or two, where as Solr can be indexed on the fly
- Fuzzy searching: citiienz~ will search “citizen” as well as the misspelled string
- Proximity search: “vision lotus”~10 will search for those two words within 10 words of each other
- Keyword ‘boosting’: koha^2 history will give “koha” a little higher relevance than “history” in search results
- Phonetic search: lve will also search for “love”
- Spelling suggestions! You can get “did you mean” returned when you make a typo
- Synonyms: flower can return results for “blossom” as well, if it’s configured
- many more functions can be implemented by plugins
BIbLibre’s Solr work brings the indexing configuration into the Koha administration screens, so libraries can configure their own indexes, whether they’re faceted or sortable, and what data type they are (string, int, date, text, simple text, more can be defined). CCL and RPN search formats are also available, like Koha use’s now (ti:title). Plugins can be linked to indexes for further processing. The other admin field defines which MARC fields get indexed where.
There are two kinds of indexes:
- static indexes: configured once before start (for IDs)
- dynamic indexes: configured through admin
And 3 based data types (plus 2 BibLibre configured):
- Integer: regular numeric. You can search ranges in addition to ‘starts with’ (I assume < and > are also available)
- Date: ISO 8601 format (YYYY-MM-DD HH:MM:SS), also searchable in range.
- Strings: not processed for diacritics, and are case sensitive. used for sorting and facets. Used for IDs
- Simple text: case and diacritics insensitive, removing dots from acronym, and can strip stop words. Good for incoming searches. Added by BibLibre.
- Text: has more features (synonyms, stemming, phoentic). It’s set up, but currently not used by any of BibLibre’s libraries at this time. Can return LOTS of results. Better for small data sets, I’d wager.
BibLibre also added a couple plugins for indexes:
- DeleteNsbNse: remove NSB (non-sorting beginning) and NSE (non-sorting ending) characters (AKA diacritical markers)
- AuthorAuthorities: retrieves authorities and stores them in biblios; allows link between authorities and biblios at search time.
Other things to note about this work:
- A Z39.50 has been developed, probably with SimpleServer, so moving to Solr doesn’t lose this Zebra feature (though there may be some compatibility issues, will have to see at the hackfest)
- Facets can easily include count of hits after the term; provided by Solr, not counted by Koha from Zebra results after the fact. This gives more accurate results, even across search pages.
- After choosing a facet, you have an [x] to remove the chosen facet; multiple facets can be chosen and deselected in different order.
- Solr uses ICU to achieve complete Unicode compliance.
- Nucsoft has tested and finds this to work well, for many languages.
There are some issues with getting Solr support into Koha. The existing search code is not well written, so the Solr work rewrites Search completely. This means will need to write a Zebra module in order to maintain support for that engine (which libraries may prefer to keep. In my opinion, this is the major blocker to getting Solr integrated. Big trick will be defining indexes in staff interface with Zebra… this is hard because they require a static file and a restart of the indexer every time that file is changed.
Another issue is that indexing performance needs a boost: 8 hrs for 1/2 million records. But, while records are being indexed, those that are completed are searchable, which is better than we get with Zebra now. And BibLibre has this down from 16 hrs, so improvements have already been made.
All in all, I look forward to testing the Solr work, and getting it ready for the rest of the world to test and make use of. Hopefully we’ll have some time to discuss this in person at the hackfest.
[Originally posted by Ian Walls]
Read more by ByWater Staff