Search analyzers
IBM® Cloudant® for IBM Cloud® Search is the free-text search technology that is built into the IBM Cloudant database that is powered by Apache Lucene.
When you create an IBM Cloudant Search index, you must consider which fields from your documents need to be indexed, and how they are to be indexed.
One aspect of the indexing process is the choice of analyzer. An analyzer is code that can have the following effect:
- Make the search case-insensitive by ensuring the string is lowercase.
- Tokenize the string by breaking a sentence into individual words.
- Stem the words by removing language-specific word endings, for example, farmer becomes farm.
- Remove stop words by ignoring words like a, is, or if, which can make the index smaller and more efficient.
At indexing-time, source data is processed by using the analyzer logic sorts and stores data in the index. At query-time, the search terms are processed by using the same analyzer code before it interrogates the index.
Testing the analyzer
If you want to see the effect of each analyzer, use the IBM Cloudant Search API call that applies to one of the built-in Lucene analyzers with a supplied string.
To look at each analyzer in turn, you can pass the following string to each analyzer to measure the effect:
"My name is Chris Wright-Smith. I live at 21a Front Street, Durham, UK - my email is chris7767@aol.com."
Standard analyzer
The standard
analyzer changes the string in the following ways:
- Removes punctuation.
- Splits words based on spaces and punctuation.
- Removes stop words, including "is" and "at".
- Changes words to use lowercase letters.
- Note how "aol.com" stays intact.
{"tokens":["my","name","josé","wright","smith","i","live","21a","front","street","durham","uk","my","email","jose7767","aol.com"]}
Keyword analyzer
With the keyword
analyzer, the string stays intact. See the following example:
{"tokens":["My name is José Wright-Smith. I live at 21a Front Street, Durham, UK - my email is jose7767@aol.com."]}
Simple analyzer
The simple
analyzer changes the string in the following ways:
- Removes punctuation.
- Splits words based on spaces and punctuation.
- No stop words removed (notice "is" and "at").
- Changes words to use lowercase letters.
- Note how
jose7767
changes tojose
and21a
changes toa
.
{"tokens":["my","name","is","josé","wright","smith","i","live","at","a","front","street","durham","uk","my","email","is","jose","aol","com"]}
Simple ASCII-folding analyzer
The simple_asciifolding
analyzer changes the string in the following ways:
- Removes punctuation.
- Splits words based on spaces and punctuation.
- No stop words removed (notice "is" and "at").
- Changes words to use lowercase letters.
- Converts non-ASCII characters to their closest ASCII equivalent.
- For example,
José
becomesjose
.
{"tokens":["my","name","is","jose","wright","smith","i","live","at","a","front","street","durham","uk","my","email","is","jose","aol","com"]}
Whitespace analyzer
The whitespace
analyzer changes the string in the following ways:
- Removes some punctuation.
- Splits words on spaces.
- No stop words removed (notice "is" and "at").
- Words remain case-sensitive.
- Note how email stays intact.
{"tokens":["My","name","is","José","Wright-Smith.","I","live","at","21a","Front","Street,","Durham,","UK","-","my","email","is","jose7767@aol.com."]}
Classic analyzer
The classic
analyzer changes the string in the following ways:
- Removes punctuation.
- Splits words based on spaces and punctuation.
- Removes stop words (no "is" or "at").
- Changes words to use lowercase letters.
- Note how email stays intact.
{"tokens":["my","name","josé","wright","smith","i","live","21a","front","street","durham","uk","my","email","jose7767@aol.com"]}
English analyzer
The english
analyzer changes the string in the following ways:
- Removes punctuation.
- Splits words based on spaces and punctuation.
- Stems words using the Porter Stemming algorithm (for example,
fishing
becomesfish
). - Removes stop words (no "is" or "at").
- Changes words to use lowercase letters.
{"tokens":["my","name","josé","wright","smith","i","live","21a","front","street","durham","uk","my","email","jose7767","aol.com"]}
Language-specific analyzers make the most changes to the source data. See the following two examples that use the english
analyzer:
The quick brown fox jumped over the lazy dog.
{"tokens":["quick","brown","fox","jump","over","lazi","dog"]}
Four score and seven years ago our fathers brought forth, on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.
{"tokens":["four","score","seven","year","ago","our","father","brought","forth","contin","new","nation","conceiv","liberti","dedic","proposit","all","men","creat","equal"]}
Which analyzer must I pick?
It depends on your data. If your data is structured (email addresses, postal codes, names, and so on) in separate fields, then select an analyzer that retains the data you need to search.
Only index the fields that you need. Keeping the index small helps to improve performance.
Consider the common data sources and look at the best analyzer choices.
Names
It's likely that name fields must use an analyzer that doesn't stem words. The whitespace
analyzer keeps the words' case (meaning the search terms must be full, case-sensitive matches) and leaves double-barreled names intact.
If you want to split up double-barreled names, then the standard
analyzer can do the job.
Email addresses
The built-in email
analyzer serves this purpose, which changes everything to lowercase and then behaves like the Keyword analyzer.
Unique ID
Order numbers, payment references, and UUIDs such as "A1324S", "PayPal0000445", and "ABC-1412-BBG" must be kept without any pre-processing, so the keyword
analyzer is preferred.
Country codes
Country codes like "UK" must also use the keyword
analyzer to prevent the removal of "stopwords" that match the country codes, for example, "IN" for India. The keyword
analyzer is case-sensitive.
Text
It is best to process a block of free-form text with a language-specific analyzer, such as the english
analyzer, or in a more general case, the standard
analyzer.