Rules
Create a rule-based model that can recognize patterns in your documents. Use rules to capture patterns that occur in documents and convey information about underlying entity types.
Class overview
When you construct a rule, you use classes to represent types of information. These classes are similar to entity types. So, why don't we just use entity types when defining rules? Because as you build rules, you can define intermediate classes that are used only to build other more complex classes. These intermediate classes are solely utilitarian. They are not useful on their own. Intermediate classes work with other intermediate classes to define a more useful and complete class. An intermediate class is necessary, but not something you expose as part of a type system. To enable the rule-based model to do useful things like pre-annotate documents with entity mentions, you must map the complex classes that you use during rule creation to their equivalent entity types from the type system.
For example, you want a model that can recognize people's names. To train a machine learning model to recognize people's names, you would annotate many different names that are written in a variety of formats in documents in an annotation set
with the PERSON
entity type, and train a model to recognize people's names. To create a rule-based model to recognize people's names, you define a rule that describes the text patterns used to write people's names. So, you might
create a FirstName
class and a LastName
class and use these intermediate classes to define a FullName
class. You might define conditions that determine the placement of the FullName
class
in relation to common prefixes, such as Dr.
and common suffixes, such as Jr.
. When the rule-based model is used, the FullName
class is mapped to the PERSON
entity type.
Another reason to avoid mapping intermediate classes to entities in your type system is that if you pre-annotate documents with the rule-based model, and then add them to your ground truth for training a machine learning model, you do not want
to define rules in such a way that they will produce overlapping entity mentions. For example, if you were to map both the intermediate class FirstName
and the complex class FullName
to the PERSON
entity,
then an occurrence of John Doe, Jr.
would result in an overlapping mention.
Rule editor tools
The rule editor provides some tools that help you to define rules.
-
Dictionary
Add a dictionary and assign it a class name. Any words that are found that match entries in the dictionary are automatically annotated with the assigned class.
-
Regular expression
A regular expression (regex) is a sequence of characters that define a search pattern. A basic example is
[A-Z][a-z]*
which finds capitalized words.[A-Z]
matches any capital alphabetical letter (A through Z) and[a-z]*
matches any lower-case alphabetical letter (a through z) zero or more times. The asterisk (*) is the character that defines the repeating setting (zero or more times).Consider using a free web-based regex utility to help you determine the right expression to use to capture the pattern you want to find.
For example, your documents might have several references that are similar to the following phrases:
35-year-old driver 16-year-old learner
The syntax
n-year-old x
is a pattern that typically represents a person. You can define a regular expression rule to find phrases that match then-year-old x
pattern, and annotate them asPERSON
entity mentions.
Regular expression best practices
Although the regex tool that is included in the rule editor recognizes expressions that follow the java.util.regex.Pattern
syntax, the entire syntax is not supported. When using the regex tool, consider the following best practices:
- Rules match only on token boundaries. You can't write rules that match a subsequence of a token. For more information, see Tokenizers.
- Keep your regular expressions simple and clean.
- Consider using dictionaries instead of regular expressions. Although you might think of a clever rule that can capture several expressions, in general, dictionary matching is faster than rules matching. Also, dictionaries are easier to maintain.
- Instead of relying completely on regular expressions to match tokens, it's best to use a combination of dictionaries, regular expressions, and rules. For example, consider the scenario of matching a phone number in a sentence such as
My mobile is 123-456
. You might be able to write a rule that uses a regular expression to match that sentence. But in this case, the recommended method is to add a dictionary to find words such as mobile and phone, write a simple regular expression that captures possible phone number sequences, and then create a rule to scan for a sequence of patterns, such asdictionary term
+text
+regex
as shown in the example sentence,My mobile is 123-456
. - Avoid unnecessary use of lookahead and lookbehind (
(?=ABC)
). In many cases, you can achieve the same result by using a combination of regular expressions and rules.