Creating custom annotators

You can create a dictionary, regular expression, or machine learning annotator to generate new facets that can help you to analyze your data.

Before you begin, have the following data ready.

Custom annotator prerequisite data
Annotator type	Description	Data
Dictionary	Assigns facets to terms that match dictionary entries that you define or upload.	You can optionally upload a file of dictionary terms.
Machine learning	Assigns facets to mentions that are recognized by a machine learning model that you upload.	A compressed file of a machine learning model is required.
Regular expression	Assigns facets to text that matches Java regular expression patterns that you define or upload.	You can optionally upload a JSON file that contains regular expression patterns.

To create a custom annotator, complete the following steps:

From the analysis view of your collection, click the Collections link in the breadcrumb to open the Create a collection for your analytics solutions page of the Content Mining application.
To create an annotator, click collection, and then select custom annotator from the list.

Collection menu
Click Create custom annotator.
Name your annotator, and then optionally add a description.
Choose the annotator type, and then click Next.
Follow the on-screen instructions.

For more information about how to configure each annotator type, see one of the following sections:

Dictionary configuration

You can import an existing dictionary by uploading it or you can create a dictionary by adding terms one at a time.

If you plan to import a dictionary, the dictionary terms must be defined in a CSV file. Specify each term and its synonyms in a separate line. Use the following syntax to specify each term:

{term},{synonym},{synonym},...

To add a dictionary, complete the following steps:

Do one of the following things:
- To import the dictionary terms:
  1. Click Import, and then browse for the file with your dictionary terms.
  2. Click Import.
- To define the dictionary terms:
  1. Click Add.
  2. Click Word list to add the dictionary terms.
  3. Click Add, and then add the term in the Base word field and any synonyms that you want to define for the term in the Other words field. Separate multiple synonyms with commas. Click OK.
  4. Repeat the previous step to add more dictionary terms.
  5. After you finish adding dictionary terms, click Basic settings.
Name the dictionary.
If you plan to define terms with a part of speech other than a noun, specify the part of speech.

If the selected language is Chinese, Japanese, Korean, or Hebrew, you can only specify Noun for the part of speech.
Decide how you want to handle case.

When case is ignored, the terms Sat, SAT, and sat are all labeled as occurrences of the Sat dictionary term.

When you deselect the Ignore case checkbox to create a case-sensitive dictionary, the surface form of the term with uppercase match is used. Annotations are added for the term exactly as written and for variations of the term in which the letters are uppercase.

For example, a sat entry in the dictionary results in annotations for sat, Sat, or SAT mentions when they occur in text. For a Sat entry in the dictionary, annotations are added for occurrences of Sat and SAT, but not for sat.
Identify the facet name to use for this dictionary.

The facet name that you specify for the annotator is the facet name that is displayed from the collection search view.

You can create a hierarchical facet by including a period (.) in the facet name. For example, you might create one dictionary with the facet path Food.Vegetables and others with the facet paths Food.Fruits and Food.Proteins. Add more facet groups with more periods. For example, you can add Food.Proteins.Nuts and Food.Proteins.Meats to categorize proteins even further.

Adding a dictionary
If you want documents that are returned for a subfacet to be included when a user filters on the root facet, select Lift up words.

For example, you might enable Lift up words for Food.Fruits and Food.Proteins but not Food.Vegetables. As a result, when a user clicks the Food facet, the returned documents include documents that mention terms included in the Fruits and Meats dictionaries, such as apples and beef.

Dictionary enrichment application

However, a user must click the Food>Vegetables facet explicitly to get documents that mention terms in the Vegetables dictionary, such as lettuce, to be returned.

Subfacets
Repeat previous steps to add more dictionaries.
Click Save.

From the custom annotator page, you can see dictionaries that were created in other projects, including non-Content Mining projects. Dictionaries from other project types show the enrichment name as the annotator name. The Ignore case and Lift up words settings are disabled and the dictionary is named custom dict.

Dictionary limits

Dictionary plan limits
Plan	Number of dictionaries per service instance	Number of base words per dictionary	Number of terms for which suggestions can be generated
Cloud Pak for Data	Unlimited	Unlimited	1,000
Premium	200	10,000	1,000
Enterprise	200	10,000	1,000

Totals include enrichments that you create in this Content Mining project and in other projects in the same service instance.

Machine learning configuration

You can import an existing machine learning model.

To use Discovery to create a model, see Entity extractor.

To import a model, complete the following steps:

Click Select file, and then browse for the machine learning model file.
In the Facet path field, specify the root facet name to use for the model.

The facet name that you specify for the annotator is the facet name that is displayed from the collection search view.
Click Save.

Machine learning model limits

ML model plan limits
Plan	ML models per service instance
Cloud Pak for Data	Unlimited
Premium	10
Enterprise	10

Totals include enrichments that you create in this Content Mining project and in other projects in the same service instance.

Regular expressions configuration

You can import existing patterns by uploading them in a JSON file or you can add patterns.

To add patterns, complete the following steps:

Add the regular expression pattern to the New pattern field, and then click Add.
Specify a name for the pattern, and then identify the facet name to use for this pattern.

The facet name that you specify for the annotator is the facet name that is displayed from the collection search view.

Optional: Specify a facet value. You can specify a value from the options that are described in the table.

Regular expression facet value options
Facet value	Description
`$0`	Displays the matched text as-is.
`$n`	If your regular expression pattern contains groups, you can specify a group number to return the matched text from the pattern group only. For example, if your regular expression consists of 3 groups that define a US phone number pattern, such as `(\d{3})-(\d{3})-(\d{4})`, and you want to return only the area code portion of the phone number, you can specify `$1`. If the matched text is `212-555-1234`, then the facet value is displayed as `212`. Only specify a group as the facet value for patterns that you know will return matches.
`{prefix-text}:$0`	Adds hardcoded text in front of the facet name. You might want to use this option if you want to distinguish facets that are generated by this regular expression from facets that are similar but generated in some other way. For example, `MyRegex:$0` results in a facet named `MyRegex:212-555-1234`.

Click Save.

To import patterns, complete the following steps:

Define the patterns that you want to add in a JSON file.

The pattern definition must use the following syntax:
```
[
  {
    "name": "US Phone number",
    "description": "US mobile phone number",
    "pattern": "(\\d{3})-(\\d{3})-(\\d{4})",
    "facetPath": ".regex.usphonenumber",
    "facetValue": "$0"
  }
]
```
Keep the following notes in mind:
- The patterns must be defined in an array, even if you plan to define only one pattern.
- Escape any backslash (\) characters with a backslash.
- For more information about the facet value options, see the Regular expression facet value options table.
Click Import, and then choose the JSON file where the patterns are defined.
Click Save.

Regular expression limits

Regular expression plan limits
Plan	Regex enrichments per service instance	Regex patterns per service instance
Cloud Pak for Data	Unlimited	Unlimited
Premium	100	50
Enterprise	100	50

Totals include enrichments that you create in this Content Mining project and in other projects in the same service instance.

Applying the annotator

After the annotator is created, you must apply it to your collection.

From the Create a custom annotator for your analytics solutions page of the Content Mining application, click custom annotator, and then select collection from the list.
In the tile for your collection, click the options icon, and choose Edit collection.
Click the Enrichment tab, and then select the annotator that you created.

You might need to scroll to find it.
Click Save, and then confirm the action.

Give the index time to rebuild.

Filtering documents with your facet

Click the collection tile to open your collection in the data analysis page.
Do one of the following things:
- Your custom facets are listed in the Facets view. Scroll and click Load more repeatedly until your facets are displayed.
- Submit an empty search to return all documents. In the Facet analysis pane, select the facet that you created.
- To access your custom facets more quickly, add them to a custom view. Select Custom as the view, and then click Edit. Select one or more facets to add to the view, and then click Save.
  
  Collection menu