Creating custom annotators
You can create a dictionary, regular expression, or machine learning annotator to generate new facets that can help you to analyze your data.
Before you begin, have the following data ready.
Annotator type | Description | Data |
---|---|---|
Dictionary | Assigns facets to terms that match dictionary entries that you define or upload. | You can optionally upload a file of dictionary terms. |
Machine learning | Assigns facets to mentions that are recognized by a machine learning model that you upload. | A compressed file of a machine learning model is required. |
Regular expression | Assigns facets to text that matches Java regular expression patterns that you define or upload. | You can optionally upload a JSON file that contains regular expression patterns. |
To create a custom annotator, complete the following steps:
-
From the analysis view of your collection, click the Collections link in the breadcrumb to open the Create a collection for your analytics solutions page of the Content Mining application.
-
To create an annotator, click collection, and then select custom annotator from the list.
-
Click Create custom annotator.
-
Name your annotator, and then optionally add a description.
-
Choose the annotator type, and then click Next.
-
Follow the on-screen instructions.
For more information about how to configure each annotator type, see one of the following sections:
Dictionary configuration
You can import an existing dictionary by uploading it or you can create a dictionary by adding terms one at a time.
If you plan to import a dictionary, the dictionary terms must be defined in a CSV file. Specify each term and its synonyms in a separate line. Use the following syntax to specify each term:
{term},{synonym},{synonym},...
To add a dictionary, complete the following steps:
-
Do one of the following things:
-
To import the dictionary terms:
- Click Import, and then browse for the file with your dictionary terms.
- Click Import.
-
To define the dictionary terms:
- Click Add.
- Click Word list to add the dictionary terms.
- Click Add, and then add the term in the Base word field and any synonyms that you want to define for the term in the Other words field. Separate multiple synonyms with commas. Click OK.
- Repeat the previous step to add more dictionary terms.
- After you finish adding dictionary terms, click Basic settings.
-
-
Name the dictionary.
-
If you plan to define terms with a part of speech other than a noun, specify the part of speech.
If the selected language is Chinese, Japanese, Korean, or Hebrew, you can only specify Noun for the part of speech.
-
Decide how you want to handle case.
When case is ignored, the terms
Sat
,SAT
, andsat
are all labeled as occurrences of theSat
dictionary term.When you deselect the Ignore case checkbox to create a case-sensitive dictionary, the surface form of the term with uppercase match is used. Annotations are added for the term exactly as written and for variations of the term in which the letters are uppercase.
For example, a
sat
entry in the dictionary results in annotations forsat
,Sat
, orSAT
mentions when they occur in text. For aSat
entry in the dictionary, annotations are added for occurrences ofSat
andSAT
, but not forsat
. -
Identify the facet name to use for this dictionary.
The facet name that you specify for the annotator is the facet name that is displayed from the collection search view.
You can create a hierarchical facet by including a period (.) in the facet name. For example, you might create one dictionary with the facet path
Food.Vegetables
and others with the facet pathsFood.Fruits
andFood.Proteins
. Add more facet groups with more periods. For example, you can addFood.Proteins.Nuts
andFood.Proteins.Meats
to categorize proteins even further. -
If you want documents that are returned for a subfacet to be included when a user filters on the root facet, select Lift up words.
For example, you might enable Lift up words for
Food.Fruits
andFood.Proteins
but notFood.Vegetables
. As a result, when a user clicks the Food facet, the returned documents include documents that mention terms included in the Fruits and Meats dictionaries, such as apples and beef.However, a user must click the Food>Vegetables facet explicitly to get documents that mention terms in the Vegetables dictionary, such as lettuce, to be returned.
-
Repeat previous steps to add more dictionaries.
-
Click Save.
From the custom annotator page, you can see dictionaries that were created in other projects, including non-Content Mining projects. Dictionaries from other project types show the enrichment name as the annotator name. The Ignore case and Lift up words settings are disabled and the dictionary is named custom dict
.
Dictionary limits
Plan | Number of dictionaries per service instance | Number of base words per dictionary | Number of terms for which suggestions can be generated |
---|---|---|---|
Cloud Pak for Data | Unlimited | Unlimited | 1,000 |
Premium | 200 | 10,000 | 1,000 |
Enterprise | 200 | 10,000 | 1,000 |
Totals include enrichments that you create in this Content Mining project and in other projects in the same service instance.
Machine learning configuration
You can import an existing machine learning model.
To use Discovery to create a model, see Entity extractor.
To import a model, complete the following steps:
-
Click Select file, and then browse for the machine learning model file.
-
In the Facet path field, specify the root facet name to use for the model.
The facet name that you specify for the annotator is the facet name that is displayed from the collection search view.
-
Click Save.
Machine learning model limits
Plan | ML models per service instance |
---|---|
Cloud Pak for Data | Unlimited |
Premium | 10 |
Enterprise | 10 |
Totals include enrichments that you create in this Content Mining project and in other projects in the same service instance.
Regular expressions configuration
You can import existing patterns by uploading them in a JSON file or you can add patterns.
To add patterns, complete the following steps:
-
Add the regular expression pattern to the New pattern field, and then click Add.
-
Specify a name for the pattern, and then identify the facet name to use for this pattern.
The facet name that you specify for the annotator is the facet name that is displayed from the collection search view.
-
Optional: Specify a facet value. You can specify a value from the options that are described in the table.
Regular expression facet value options Facet value Description $0
Displays the matched text as-is. $n
If your regular expression pattern contains groups, you can specify a group number to return the matched text from the pattern group only. For example, if your regular expression consists of 3 groups that define a US phone number pattern, such as (\d{3})-(\d{3})-(\d{4})
, and you want to return only the area code portion of the phone number, you can specify$1
. If the matched text is212-555-1234
, then the facet value is displayed as212
. Only specify a group as the facet value for patterns that you know will return matches.{prefix-text}:$0
Adds hardcoded text in front of the facet name. You might want to use this option if you want to distinguish facets that are generated by this regular expression from facets that are similar but generated in some other way. For example, MyRegex:$0
results in a facet namedMyRegex:212-555-1234
. -
Click Save.
To import patterns, complete the following steps:
-
Define the patterns that you want to add in a JSON file.
The pattern definition must use the following syntax:
[ { "name": "US Phone number", "description": "US mobile phone number", "pattern": "(\\d{3})-(\\d{3})-(\\d{4})", "facetPath": ".regex.usphonenumber", "facetValue": "$0" } ]
Keep the following notes in mind:
- The patterns must be defined in an array, even if you plan to define only one pattern.
- Escape any backslash (
\
) characters with a backslash. - For more information about the facet value options, see the Regular expression facet value options table.
-
Click Import, and then choose the JSON file where the patterns are defined.
-
Click Save.
Regular expression limits
Plan | Regex enrichments per service instance | Regex patterns per service instance |
---|---|---|
Cloud Pak for Data | Unlimited | Unlimited |
Premium | 100 | 50 |
Enterprise | 100 | 50 |
Totals include enrichments that you create in this Content Mining project and in other projects in the same service instance.
Applying the annotator
After the annotator is created, you must apply it to your collection.
-
From the Create a custom annotator for your analytics solutions page of the Content Mining application, click custom annotator, and then select collection from the list.
-
In the tile for your collection, click the options icon, and choose Edit collection.
-
Click the Enrichment tab, and then select the annotator that you created.
You might need to scroll to find it.
-
Click Save, and then confirm the action.
Give the index time to rebuild.
Filtering documents with your facet
-
Click the collection tile to open your collection in the data analysis page.
-
Do one of the following things:
-
Your custom facets are listed in the Facets view. Scroll and click Load more repeatedly until your facets are displayed.
-
Submit an empty search to return all documents. In the Facet analysis pane, select the facet that you created.
-
To access your custom facets more quickly, add them to a custom view. Select Custom as the view, and then click Edit. Select one or more facets to add to the view, and then click Save.
-