Koha How-To

Elastic Searching

The fundamentals of searching have not changed between Zebra and Elastic, from a user’s perspective. The interface hasn’t changed, and early tester libraries haven’t reported any confusion or questions from their patrons. But there are ways in which your search results will differ between the two search engines. Additionally, Elastic brings us a lot of exciting new toys and tricks to play with.

Before jumping in, I have a few disclaimers:

  • Elastic is a big tool that can do a lot of things. We’re still in the process of making all of those things work in Koha.
  • Implementation of Elastic in Koha is a very actively ongoing process, so expect new features, fixes, and tweaks regularly.
  • As I write this, we have partners testing Elastic in Koha 19.05 and are looking forward to moving some of our Elastic testers to 19.11. As such, this post focuses on Elastic in those two versions of Koha.
  • I’m going to use the public catalog for my screenshots, but all the same functionality exists on both the public and staff catalogs.

Please also see our post about search configuration in Koha using Elastic.

General keyword searching

As with Zebra, the default search in Elastic is a general keyword search. If we do not specify a search index, Koha interprets that as a search in the keyword index.

In Zebra, this meant searching the entire MARC record. In Elastic in 19.05, the keyword index contains only the MARC fields that are included in at least one other index. Generally, this is a helpful thing as it allows us to wholly ignore parts of our records that we don’t care about. In 19.11, we have the new option to specify which of our search indices are included in a keyword search and to make that differ between the staff and public catalogs. That would allow us, for example, to make the 952$x (non-public item note) factor into staff searches but not public searches. Additionally, 19.11 can add an option in advanced search to search the entire MARC record as Zebra did, if you prefer.

Specifying an index

If you want to perform a more specific search, you can tell Koha to use something other than the keyword index. Again, the basic process for this hasn’t changed in Elastic. To search the title index, you can set the search dropdown to “Title.”

Or, if you prefer, you can leave the dropdown alone and specify your index using CCL (Common Command Language).

Saying “title:batman and robin” tells Elastic to look for “batman and robin” in the title index. Your search engine configuration page will give you a list of all your search indices and which MARC fields they include. While the search dropdown has a default set of options, we can add any you like, if there’s something you search often and would prefer not to use CCL for.

Once you’ve specified an index, it will be used for all following terms in your query until you specify something else.

That means “title: batman harley” looks for both “batman” and “harley” in the title.

Whereas “title:batman kw:harley” looks for “batman” in the title and “harley” in the keyword index.

Of course, you could accomplish that same search with “harley title:batman.” Since “harley” comes before we specify the title field, it defaults to keyword.

Note you can mix and match any indices in this way. Searching “title:batman author:dini” gives us records with “batman” in the title and “dini” in the author.

Boolean operators

In Koha, Elastic is configured to assume all our search terms are connected with AND operators. We can specify different Boolean operators just as we did in Zebra, using either the advanced search page or CCL.

In advanced search, set your Boolean dropdown to the operator you want. The screenshot above shows a search that will bring back records with either “batman” or “superman” (or both).

The same search can be done via CCL by typing “batman OR superman.” Your Boolean operator needs to be in all caps. A search for “batman or superman” will look for the word “or.”

Your operators will be applied in the order AND, OR, NOT. This order of operations can result in some unexpected processing. You can use parentheses to force explicit grouping, which can clarify things.

This returns records that contain “wonder woman” and either “batman” or “superman.”

A Boolean operator makes Elastic forget which index you were looking in, so the query “title:batman OR superman” is the same as “title:batman OR kw:superman.” You can correct for this by repeating your index -- “title:batman OR title:superman” -- or by using parentheses -- “title:(batman OR superman).”

Wildcards and truncation

Elastic supports two different wildcard characters.

A question mark stands in for one character. So “batm?n” will match “batman” or “batmen” or any other word you can make by shoving a letter or number between “batm” and “n.” But remember it will look for exactly one character to replace that question mark. That “bat?man” won’t find “batman” or “batwoman” but would find “bathman.”

An asterisk stands in for zero or more characters. So “bat*man” will find “batman,” “batwoman,” “bathman,” and even more things. Any word that starts with “bat” and ends in “man.”

An asterisk can be used at the start or end of a word to truncate. Searching for “bat*” returns all words starting with “bat” and searching for “*man” returns all words ending in “man.”

Stemming

Zebra has a feature called stemming that’s related to truncating. It’s controlled by the QueryStemming system preference. It does things like returning “enabled” when you search for “enabling.” Elastic doesn’t have a feature like that built in, and the QueryStemming system preference doesn’t do anything while you’re using Elastic. However, Elastic does have its own stemming options and that’s something we expect to explore more fully going forward.

Phrase searching

In Elastic, you can force an exact match using quotation marks.

Searching for “batman superman” with quotation marks only returns records with those two words next to each other in that order. In Zebra, you had to select a special “as phrase” search option from your search dropdown to do this (like “title as phrase” or “subject as phrase”). Those options still exist in Elastic, but all they do is insert quotation marks around your search for you.

Be aware that Elastic will ignore any wildcards within quotation marks, since quotation marks mean you want exact matches only.

Ranges

You can make Elastic search a range of values in several ways. This mostly applies to indices of numeric fields, like date-of-publication, which holds the publication year from the 008 field.

Square brackets like I’ve used here are inclusive, so my search is for anything published in 2010, 2011, or 2012. If you use curly brackets like “{2010 TO 2012}” the range would be exclusive, meaning it would only find things published in 2011. You can even get fancy and mix them up, like “[2010 TO 2012},” which would be inclusive on the low end but exclusive on the high end. Just like your Boolean operators, “TO” needs to be all caps.

You can also use greater and lesser than symbols for number-based searches. The search above returns everything with a publication date greater than or equal to 2010.

These searches won’t work well with fields that aren’t strictly numeric. If your 245 tags contain a subfield for data like “Volume 6” or “Season 2,” Elastic won’t know how to discard the word and look only at the number. However, this is exactly the sort of functionality folks in the community are currently working on, so that may change!

Negating and requiring search terms

If you want to make sure your results don’t include a specific term, you can negate it with a minus sign.

A search for “batman -joker” returns records that contain “batman” but not “joker.” You can also make terms required by adding a plus sign, but that’s redundant because we’re default to connecting all of our terms with AND, which also makes them required.

Fuzzy searching

In Zebra, turning on the QueryFuzzy system preference made all of your searches look for similarly-spelled words. It was sort of ill-defined and unpredictable and we tended to suggest folks not use it. In Elastic, turning QueryFuzzy on doesn’t change your search results on its own, but it gives you the option of making any individual term in your search fuzzy by putting a tilde after it.

A search for “batman azzarelo~” looks for records with “batman” spelled just as you’ve spelled it but “azzarelo” with some spelling variation. How much variation is allowed is based on how long you fuzzy word is: a word six or more characters long allows up to two changes, a word three to five characters long allows 1 change, and a word just one or two characters long doesn’t allow any changes (so making it fuzzy doesn’t do anything). A change here means moving a letter, replacing a letter, adding a letter, or removing a letter. So two changes in “azzarelo” is enough to make it find the correct spelling of this author’s name “azzarello.”

Proximity

A proximity search lets you find two words within a certain distance of each other.

To perform a proximity search, but your terms in quotes and then add a tilde and a number. So “batman robin”~1 gives us records in which “batman” and “robin” appear within one word of each other. That would include “batman and robin” or “batman & robin.” Note that when we say words here we’re using the terms loosely, basically to mean a group of characters separated from other characters by spaces. So in this context an ampersand is considered a word.

Now, technically the number here isn’t a count of words between our terms. It’s a count of changes needed to make our record match our search (sort of like how fuzziness counted changes to letters). It takes one edit (removing the “and”) to make “batman and robin” match “batman robin.” Following this idea of counting edits, two edits allows us to transpose our words. So “batman robin”~2 would match “robin batman.” And “batman robin”~3 would match “robin and batman.”

Boosting relevancy

In our Elastic configuration post, we talk about how to define weightings to configure how your search results are ordered. Elastic will also let you use the boost function to give a specific term some extra importance in any given search.

To boost a term in your search, follow it with a caret and a number. A search for “batman robin stephanie^10” returns records with those three words, but makes “stephanie” more important in deciding which order to display your results in. Because the default weighting is 1, you can also use a boost value between 0 and 1 to reduce a term’s importance in your search results.

Escaping punctuation

Many of the search features discussed here use punctuation marks to let Elastic know you’re doing something special. If you want to perform a search that includes one of these punctuation marks, you need to tell Elastic to ignore the punctuation’s special meaning. In coding, this is referred to as escaping the punctuation mark.

To escape a punctuation mark, put a backslash before it. So a search for “title:batman\: year one” searches for “batman: year one” directly without trying to use the colon to do something special. The following punctuation marks need to be escaped if included in your search: +, -, =, &&, ||, >, <, !, (, ), {, }, [, ], ^, ", ~, *, ?, :, \, and /.

Of course, you should usually just be able to leave those punctuation marks out of your search entirely, rather than worrying about escaping them. In my example above, “title: batman year one” without the colon would have found the same title without a problem.