IBM Cloud Docs
Identifying words to ignore

Identifying words to ignore

To ignore meaningless terms during searches, add a list of custom stop words. Stop words are words that are not useful in distinguishing the semantic meaning of the content.

In English, the, is and and are examples of stop words.

The stop words that you define are filtered out of queries and improve the relevance of natural language query results.

For example, a company has three tiers of service. The documents in one of the collections pertain to only one tier, the Silver tier. You might want to add "silver" to the stop words list because the term doesn't help to distinguish the significance of one document over another, given that all of the documents relate to the Silver service tier. When a customer mentions the Silver tier in a query string, it is ignored. Other terms in the query that are more significant are used to search the data instead. Or maybe the document collection consists of car accident reports only. You might want to add "car" to the stop words list to prevent mentions of car in queries from adding noise to the search.

Discovery applies a list of default stop words for many of the supported languages automatically. These stop words are applied both at indexing time and at query time. The predefined stop words are ignored when content is indexed and they are filtered out of queries. However, stop words that you define are used at query time only. Your list doesn't replace the default list; it augments the default list. You can add stop words, but you cannot remove stop words.

Example custom stop word list:

{
  "stopwords": [
    "a", "an", "the", "ibm", "what", "how", "when", "can", "should", ...
  ]
}

Default stop word lists

You can access the default stop words list for English from the Watson Developer Cloud GitHub repository.

For the following languages, Discovery uses the default stop words list that is defined by Apache Lucene. For more information about what words are included in the list, see the Lucene reference documentation:

These default stop words are documented in TXT format, but if you want to augment the list and submit it for use by Discovery, you must submit a JSON file. To see an example of the syntax of stop words list file, see the custom English stop words list file.

For the remaining supported languages, no default stop words are used. You can specify a stop words list to use at query time for these languages. The list that you submit is not used when data is ingested.

Examples of stop word lists that you might want to apply at query time include:

See supported languages for the list of the languages that are supported by Discovery.

Defining query-time stop words

To define stop words, complete the following steps:

  1. Create a stop words file. The file must be a JSON file with the json file extension.

    Follow these guidelines:

    • Specify stop words in lowercase.
    • In general, keep your list of stop words under 200 total words. The size limit is one million characters. However, if you specify too many terms, you might negatively affect search accuracy.

    You can use the default English stop words list file, custom_stopwords_en.json, as a starting point when you build a custom stop word list in English.

  2. From the navigation pane, open the Improve and customize page.

  3. Expand Improve relevance from the Improvement tools pane.

  4. Click Stopwords, and then click Upload stopwords for the collection.

    Only one stop words list can be uploaded per collection. The stop words list that you upload augments the default stop words list for your collection; it does not replace the default list.

  5. Click Done.

To disable a custom stop words file and revert to using the default stop words, delete the custom stop words file.