Creating a rule-based model (Experimental)
This tutorial helps you understand how to create a rule-based model that you can use to find text patterns that you define in documents.
Rule-based models are experimental and are not intended for use in production deployments. Support for the models might be discontinued with short notice.
You will build a model that can find text in documents that matches the pattern month day, year
. For example, the model would find the date reference May 1, 2010. Before you define the rule pattern itself, you will create
artifacts that will help you build the pattern, including a dictionary class that recognizes month mentions and a regular expression class that recognizes year mentions in text.
Learning objectives
After you complete this tutorial, you will know how to perform the following tasks:
- Create classes
- Add documents for defining rules
- Associate dictionaries with classes
- Define regular expressions to capture sequences of characters
- Define rules
This tutorial should take approximately 30 minutes to finish. If you explore other concepts related to this tutorial, it could take longer to complete.
Before you begin
- You're using a supported browser. For more information, see Browser requirements.
- You successfully completed Getting started with Knowledge Studio, which covers creating a workspace, creating a type system, and adding a dictionary.
- You must have at least one user ID in either the Admin or Project Manager role. For information about user roles, see User roles in Knowledge Studio.
Results
After you create the rule-based model, you can use it in one of the following ways to find text patterns in documents:
- Pre-annotate your documents before you create a machine learning model.
- Deploy or export the model to other Watson services or products.
Lesson 1: Adding a dictionary of months
In this lesson, you will learn how to add a dictionary to a workspace in Knowledge Studio. The dictionary contains terms related to the months of the year.
About this task
In a later lesson, you will define a class based on this dictionary. When you create that class, all terms in this dictionary that are found in documents will be automatically annotated as a mention of the associated class type. For more information about dictionaries, see Adding dictionaries to a workspace.
Procedure
-
Download the
dictionary-items-month.csv
file to your computer. This file contains dictionary terms in CSV format that are suitable for uploading into a Knowledge Studio dictionary. -
Click Assets > Dictionaries.
-
Click the Create Dictionary button to add a dictionary.
-
In the Name field, type
Month dictionary
and click Save to create the dictionary. The new dictionary is created and automatically opened for editing. -
In the dictionary pane, click Upload.
-
Select the
dictionary-items-month.csv
file from your computer and click Upload.The terms from the file are imported into the dictionary.
Lesson 2: Adding sample documents
In this lesson, you will learn how to add documents with linguistic patterns that illustrate the types of rules you want to define.
About this task
For more information about adding documents, see Adding documents for defining rules.
Procedure
-
Download the
documents-new.csv
file to your computer. This file contains example documents suitable for uploading. -
Click Rule-based Model > Rules.
-
Click the Add a document icon, which is next to the Documents page heading.
-
Click the Upload CSV file tab.
-
Click to browse for the
documents-new.csv
file that you downloaded to your computer earlier, and then click Upload.A set of documents is displayed in the main Documents page.
Lesson 3: Creating classes
In this lesson, you will learn how to define classes that you will use when you define a rule.
About this task
For more information about classes, see Rules.
Procedure
-
From the Rules page of your workspace, click the Add a class icon next to the Class heading in the right side panel.
-
Enter
DictMonth
as the class name, and then click Add.The new class is displayed in the Class side panel.
Lesson 4: Associating a dictionary with a class
In this lesson, you will learn how to use a dictionary in the rule editor.
Procedure
-
Click Rule-based Model > Rules, and then click the Dictionaries tab.
-
Select Month dictionary that you created previously.
-
From the Class list, select
DictMonth
and then click Save.The class is associated with the dictionary.
Results
For documents that are associated with the rule editor, any references to terms in the dictionary are annotated as DictMonth
class mentions. You will see proof that these references have been annotated in the next lesson.
Lesson 5: Finding class annotations in documents
In this lesson, you will learn how to find class annotations in rule editor documents.
Procedure
-
Select Rule-based Model > Rules.
-
From the Class panel, find the
DictMonth
class that you defined earlier, and click the Search annotations in documents icon that's next to it.The Find Annotations page is displayed and shows all the documents that contain text references to months.
-
Click the
Technology - computerworld.com
document to view the full document. Notice that the textFebruary
is highlighted, which means it was annotated as a mention of theDictMonth
class.
Lesson 6: Defining a regular expression
In this lesson, you will learn how to define a regular expression.
About this task
You will define a regular expression that can find year patterns like 2009
.
For more information about defining regular expressions, see Defining a rule.
Procedure
-
From the Rules page, click the Add a class icon ![The "Add a class" icon](images/wks_tut_dict_add.jpg "The "Add a class" icon") next to Class from the right side panel.
-
Enter
RegExpYear
as the class name, and click Add. -
Click the Regex tab, and then click the Create a regular expression icon next to the Regular Expressions heading.
-
Click Add Entry.
-
In the Regular Expression field, enter the following expression, which finds years between
1900
and2099
:(?:(?:19|20)[0-9]{2})
-
Set Minimum Word Tokens to
1
and Maximum Word Tokens to1
. -
Click Add to save the regular expression entry.
-
Enter
MyYearExp
as the regular expression name, and then, from the Class menu, select theRegExpYear
class that you defined earlier. -
Click Save.
After you save the regular expression, it is automatically applied to the sample documents. Any text strings that follow the pattern that you defined in the regular expression are annotated as mentions of the
RegExpYear
class. -
To check whether the expression you defined is capturing time occurrences correctly, you can search for mentions. Click the Search annotations in documents icon next to the
RegExpYear
class in the Class panel.![Shows the hovering over the magnifying glass icon next to the "RegExpYear" class in the Class panel of the Rules page.](images/rule-regex-add5.png "Shows the cursor hovering over the magnifying glass icon next to the "RegExpYear" class in the Class panel of the Rules page.")
The Find Annotations page is displayed. Occurrences of year mentions are highlighted in the sample documents in which they occur.
Lesson 7: Defining a rule
In this lesson, you will learn how to define a rule.
About this task
You already defined a dictionary-based class for annotating month mentions. You also defined a regular expression that finds numeric values which represent a year. Now, you will define a rule that captures the sequence of a month followed by a number, a comma, and then a year. You will define a rule for date expressions like September 21, 2016.
For more information about defining rules, see Defining a rule.
Procedure
-
Select Rule-based Model > Rules, and open the
Technology - computerworld.com
document. -
Select the text
February 3, 2009
in the document. Make sure you select the comma, too.![Shows the text "February 3, 2009" selected in the document.](images/rule-add1.png "Shows the text "February 3, 2009" selected in the document.")
-
Click the Add a rule icon.
The rule editor shows a depiction of the rule pattern that you identified.
The text
February 3, 2009
is visible. A solid line that connects the cells in the depiction identifies which cells are currently part of the pattern.- The
DictMonth
class is part of the rule pattern instead of the textFebruary
. This selection is preferred because you want the model to find any month that is annotated by theDictMonth
class as the first token in the date pattern instead of the textFebruary
only. - At the end of the rule, the year
2009
is already annotated as being a mention of theRegExpYear
class. TheRegExpYear
class is part of the rule pattern instead of the number 2009. This selection is also preferred because you want the model to find any year that is annotated by theRegExpYear
class as the last token in the date pattern instead of the specific text2009
only.
The number 3 and the comma (,) after it are shown as the second and third tokens in the pattern. As the pattern is currently specified, the model will find only occurrences of dates that specify the 3rd day of a month. We want the model to find dates that specify any day of the month, so next we will change the feature settings for the day token.
- The
-
Above the day
3
cell, click the Text icon to open the feature settings for the token.Currently, the rule is set to match the exact text,
3
. Instead, we want it to match any number. -
Change the feature setting to be numeric by selecting Character Type : Numeric, and then clearing the selection, Text : 3.
![Shows the user clicking the "Character Type : Number" option as the feature setting for the "3" token.](images/rule-add5.png "Shows the user clicking the "Character Type : Number" option as the feature setting for the "3" token.")
You changed the definition for the number
3
cell.![Shows the cell that represents the "3" token now has a "Character Type" icon above it to indicate that any numeric value can match that token in the pattern.](images/rule-add6.png "Shows the cell that represents the "3" token now has an "Character Type" icon above it to indicate that any numeric value can match that token in the pattern.")
The Character Type icon indicates that instead of requiring the number to be equal to 3 exactly, it can be any number.
-
Do not change any settings for the comma token.
We want the third token in the pattern to be a comma, so the current feature setting of text : , is appropriate. In addition to a feature setting, each token has a repeat setting. The repeat setting specifies how many times the token can be repeated in the text for it to match the pattern. The current repeat setting of Required (Exactly 1) is appropriate.
![Shows the repeat setting for the comma token which is set to "Exactly 1".](images/rule-add7.png "Shows the repeat setting for the comma token which is set to "Exactly 1".")
-
Assign a class to represent the pattern
DictMonth + numeric token + comma + RegExpYear
.Notice the four empty cells that represent the four tokens that you selected from the document. To select all the cells, select the first cell, and then press Shift + click each additional cell. Enter
RuleDate
as the class name, and then click it to create the new class.![Shows that all four cells in the top row have been selected and the span is being defined as the "RuleDate" class.](images/rule-add8.png "Shows that all four cells in the top row have been selected and the span is being defined as the "RuleDate" class.")
-
In the Rule name field, enter
MyDateRule
and click Save.After you save the rule, it is automatically applied to the sample documents. If the
Technology - computerworld.com
document is still open in the rule editor, you will see that theFebruary 3, 2009
text in the document is now annotated as a mention of the RuleDate class.![Shows text from the "Technology - computerworld.com" document with only the text "February 3, 2009" annotated as a mention of the "RuleDate" class.](images/rule-add10.png "Shows text from the "Technology - computerworld.com" document with only the text "February 3, 2009" annotated as a mention of the "RuleDate" class.")
You can search for all occurrences of
RuleDate
class mentions in the sample documents by clicking the Search annotation in documents icon next to theRuleDate
class from the Class panel. It is a good practice to check that all dates are captured properly to confirm that you defined the pattern correctly.![Shows the "Find Annotations" page with two documents that contain dates that match the rule pattern you just defined.](images/rule-add11.png "Shows the "Find Annotations" page with two documents that contain dates that match the rule pattern you just defined.")
Lesson 8: Creating a rule-based model
In this lesson, you will learn how to create a rule-based model.
About this task
For more information about creating a rule-based model, see Creating the rule-based model.
Procedure
-
Select Rule-based Model > Versions and click the Rule-based model type mapping tab.
-
Map the
RuleDate
class to theDATE
entity from the type system.-
Find the
DATE
entity, and click Edit.![Shows the user clicking Edit for the "DATE" entity type in the "Rule-based model type mapping" tab.](images/rule-anno2.png "Shows the user clicking Edit for the "DATE" entity type in the "Rule-based model type mapping" tab.")
-
Choose the
RuleDate
class from the list and click Save.![Shows the user choosing the "RuleDate" class from the list.](images/rule-anno3.png "Shows the user choosing the "RuleDate" class from the list.")
-
-
To pre-annotate document sets or annotation sets with the rule-based model:
- On the Machine Learning Model > Pre-annotation page, click Run Pre-annotators.
- Select Rule-based Model, then click Next.
- Select the document set that you added to the corpus,
documents-ml.csv
, and click Run.
Attention: Run the rule-based model as a pre-annotator only on documents that were not already annotated by humans.
Tutorial summary
While learning about Knowledge Studio, you created a rule-based model.
Lessons learned
By completing this tutorial, you learned about the following concepts:
- Classes
- Regular expressions
- Rules