IBM Cloud Docs
Bootstrapping annotation

Bootstrapping annotation

Simplify the job of the human annotator by pre-annotating the documents in a workspace. A pre-annotator is a Knowledge Studio dictionary, rule-based model, or machine learning model that you can run to find and annotate mentions automatically.

Pre-annotation makes the job of human annotators easier because it covers the straightforward annotations, and gets the job of annotating the documents underway.

The method that you use to pre-annotate documents in no way restricts the ways that you can use the resulting model. For example, just because you use the Natural Language Understanding service to pre-annotate documents does not mean you must deploy the final machine learning model that you build to the Natural Language Understanding service.

Pre-annotation methods

The following pre-annotators are available:

  • Dictionary

    Uses a dictionary of terms that you provide and associate with an entity type to find mentions of that entity type in the documents. This choice is best for fields with unique or specialized terminology because this pre-annotator does not analyze the context in which the term is used in the way a machine learning pre-annotator does; it instead relies on the term being distinct enough to have a decipherable meaning regardless of the context in which it is used. For example, it is easier to recognize asbestos as a mineral entity type than to determine the entity type of squash, which can refer to a vegetable, a sport, or a verb meaning to crush something.

    Dictionary pre-annotators do not recognize entity subtypes. Human annotators can specify entity subtypes for each pre-annotated mention by working on an annotation task with the pre-annotated document.

  • Machine learning

    Uses a machine learning model to automatically annotate documents. This option is only available to you if you have created a machine learning model with Knowledge Studio already. If you add a document set, you can run the machine learning annotator that you created previously to pre-annotate the new documents. If the new set of documents is similar to the documents that were used to train the machine learning annotator originally, then this is probably your best choice for pre-annotation.

  • Rule

    Uses a rule-based model to automatically annotate documents. This option is only available if you have created a rule-based model with Knowledge Studio already. If your documents contain common patterns of tokens from which you can derive meaning, this model might be a good choice. It can incorporate some of the function of the dictionary pre-annotator if you enable it, by identifying class types for dictionary terms that it finds in the documents also.

Alternatively, you can upload already-annotated documents, and use them to start training the machine learning model. You cannot run a pre-annotator on annotated documents that you upload or the existing annotations will be stripped from the documents and replaced with annotations produced by the pre-annotator only.

You can run a pre-annotator on documents that were added to the ground truth as part of the current workspace. Annotations that were added to the documents, reviewed, accepted, and promoted to ground truth within the current workspace are not stripped out.

Running multiple pre-annotators

Knowledge Studio allows you to run multiple pre-annotators at once. First, you need to prepare the pre-annotation methods that you want to use. For more information, see the following sections.

Configuring the order of pre-annotators

When multiple pre-annotators are used, the first annotation made to a span of text is saved for the results, even if other pre-annotators attempt to annotate the same span of text later in the order. This doesn't apply to human annotations, which are preserved regardless of pre-annotation order.

For example, consider the example text IBM Watson. If a dictionary that is first in the order labels IBM as an Organization entity type, a machine learning model that is second in the order can't annotate IBM Watson as a Software Brand entity type because that would override the earlier annotation made to IBM.

You can view the current order of pre-annotators in the Order column on the Machine Learning Model > Pre-annotation page. To change the order, complete the following steps.

  1. Click Order Settings.
  2. Click the Move up and Move down arrow** buttons to move pre-annotation methods earlier or later in the order.
  3. Click Save.
  4. Double check the Order column on the Pre-annotation page to make sure that it matches the order that you want.

Run pre-annotators

  1. After your pre-annotation methods are prepared and you have configured the order of your pre-annotators, click Run Pre-annotators.
  2. Select the pre-annotators that you want to use, and then click Next.
  3. If you want to erase existing annotations made by pre-annotators before running the pre-annotator, select Wipe previous pre-annotation results. Human annotations are preserved even if this is selected.
  4. Select the document sets that you want to pre-annotate.
  5. Click Run.

Pre-annotating documents with a dictionary

To help human annotators get started with their annotation tasks, you can create a dictionary and use it to pre-annotate documents that you add to the corpus.

About this task

When a human annotator begins work on documents that were pre-annotated, it is likely that a number of mentions will already be marked by entity types based on the dictionary entries. The human annotator can change or remove the pre-annotated entity types and assign entity types to unannotated mentions. Pre-annotation by a dictionary does not annotate relations, coreferences. Relations and coreferences must be annotated by human annotators.

This task shows how to create a dictionary that is editable. If you want to upload and pre-annotate your documents with a read-only dictionary, click the Menu icon next to the Create Dictionary button. Select Upload Dictionary.

Procedure

To create an editable dictionary and pre-annotate documents, follow these steps:

  1. Log in as a Knowledge Studio administrator and select your workspace.

  2. Select the Assets > Dictionaries page.

  3. Click Create Dictionary, enter a name, and then click Save.

  4. From the Entity type list, select an entity type to associate with the dictionary.

    You can also associate an entity type with the dictionary from the Machine Learning Model > Pre-annotation page. Click the overflow menu button in the Dictionaries row in the page, then click Map entity types.

  5. Add entries for the dictionary or upload a file that contains dictionary terms.

  6. Go to the Machine Learning Model > Pre-annotation page.

  7. Click Run Pre-annotators.

  8. Select Dictionaries, and then click Next.

  9. If you want to erase existing annotations made by pre-annotators before running the pre-annotator, select Wipe previous pre-annotation results. Human annotations are preserved even if this is selected.

  10. Select the check box for each document set that you want to pre-annotate and click Run.

    Pre-annotation is applied to individual documents without regard for the various document sets or annotation sets that a document might belong to. A document that overlaps between a selected document set and an unselected document set will be pre-annotated in both document sets.

Related information:

Creating dictionaries

Getting Started > Adding a dictionary

Pre-annotating documents with the machine learning model

You can use an existing machine learning model to pre-annotate documents that you add to your corpus.

About this task

After 10 to 30 documents are annotated, a machine learning model can be trained on the data. Don't use such a minimally trained model in a production. However, you can use the model to pre-annotate documents to help speed up the human annotation of subsequent documents. For example, if you add documents to the corpus after you train a machine learning model, you can use the model to pre-annotate the new document sets. Never run a pre-annotator on the same documents that have been annotated by a person. Pre-annotators remove human annotation.

Procedure

To use an existing machine learning model to pre-annotate documents:

  1. Log in as a Knowledge Studio administrator and select your workspace.

  2. Go to the Machine Learning Model > Pre-annotation page.

  3. Click Run Pre-annotators.

  4. Select Machine Learning Model, and then click Next.

  5. If you want to erase existing annotations made by pre-annotators before running the pre-annotator, select Wipe previous pre-annotation results. Human annotations are preserved even if this is selected.

  6. Select the check box for each document set that you want to pre-annotate and click Run.

    Pre-annotation is applied to individual documents without regard for the various document sets or annotation sets that a document might belong to. A document that overlaps between a selected document set and an unselected document set will be pre-annotated in both document sets.

Pre-annotating documents with the rule-based model

You can use an existing rule-based model to pre-annotate documents that you add to your corpus.

Procedure

To use the rule-based model to pre-annotate documents, complete the following steps:

  1. Log in as a Knowledge Studio administrator and select your workspace.

  2. Go to the Machine Learning Model > Pre-annotation page.

  3. Click the overflow menu button in the Rule-based Model row in the page, then click Map entity types and classes to map entity types that you defined in the Knowledge Studio type system to one or more rule-based model classes.

    You can also open the mapping page by selecting the Rule-based Model > Versions > Rule-based Model tab.

  4. Click Edit for each entity type you want to map.

    • The drop-down list of the Class Name column is pre-populated with classes that are associated with the rule-based model.
    • You must map at least one entity type to a class.
  5. On the Machine Learning Model > Pre-annotation page, click Run Pre-annotators.

    The Rule-based Model option is not available until you map at least one entity type to a class.

  6. If you want to erase existing annotations made by pre-annotators before running the pre-annotator, select Wipe previous pre-annotation results. Human annotations are preserved even if this is selected.

  7. Select the document sets or annotation sets that you want to pre-annotate.

  8. Click Run.

    Pre-annotation is applied to individual documents without regard for the various document sets that a document might belong to. A document that overlaps between a selected document set and an unselected document will appear pre-annotated in both document sets.

Uploading pre-annotated documents

You can jump-start the training of your model by uploading documents that were pre-annotated through an Unstructured Information Management Architecture (UIMA) analysis engine.

The pre-annotated documents must be in the XMI serialization form of UIMA Common Analysis Structure (UIMA CAS XMI). The .zip file that you upload must include the UIMA TypeSystem descriptor file and a file that maps the UIMA types to entity types in your Knowledge Studio type system.

UIMA CAS XMI is a standard format of Apache UIMA. Guidelines are provided for how to create files in the correct format from analyzed collections in IBM Watson Explorer. If you use another Apache UIMA implementation, adapt these guidelines for your purposes. Regardless of how you create the XMI files, the requirements for creating the type system mapping file and .zip file are the same for everyone.

If you assign the imported documents to human annotators, the documents appear pre-annotated in the ground truth editor and a number of mentions might already be annotated. The human annotator thus has more time to focus on applying the annotation guidelines to unmarked mentions. Alternatively, you can bypass the human annotation step and use the pre-annotated documents to immediately start training and evaluating a machine learning model.

Exporting analyzed documents from Watson Explorer Content Analytics

You can export documents that were crawled and analyzed in IBM Watson Explorer Content Analytics, and upload the analyzed documents as XMI files into a Knowledge Studio workspace.

Procedure

To get analyzed documents from a Watson Explorer Content Analytics collection, follow these steps:

  1. Open the Content Analytics administration console in a web browser.

  2. On the Collections view, expand the collection that you want to export documents from. In the Parse and Index pane, ensure that the parse and index process is running, and then click the arrow icon for Export analyzed document content and metadata.

  3. In the Analyzed document export options area, select Export documents as XML files, select the Enable CAS as XMI format export check box, specify the output path for where the exported data is to be written, and click OK.

  4. Stop and restart the parse and index services for the collection, and then do one of the following steps:

    • If the collection already contains indexed documents that you want to use for training the machine learning model in the document cache, restart a full index build.
    • If the collection does not contain indexed documents that you want to use for training the machine learning model, upload documents, configure at least one crawler to crawl the documents, and start the crawler.
  5. In the Export area, check the status of the export request. The progress indicates how many documents are exported.

  6. Go to the output folder that you specified when you configured export options. When documents are exported as XML files, the output folder name is based on the time stamp when the export occurs. The output folder contains XMI files (*.xmi) and the UIMA TypeSystem descriptor file (exported_typesystem.xml).

What to do next

You must define a mapping between the UIMA types and Knowledge Studio entity types. You must also create a .zip file that contains all the files that are required to upload the analyzed data into a Knowledge Studio workspace.

Related information:

Exporting crawled or analyzed documents

Output paths for exported documents

Exporting an analyzed collection from Content Analytics Studio

You can export a collection of analyzed documents from Watson Explorer Content Analytics Studio, and upload the analyzed documents as XMI files into a Knowledge Studio project.

Procedure

To get analyzed documents from a Content Analytics Studio collection, follow these steps:

  1. Launch Content Analytics Studio and open the Studio project.
  2. Right-click a folder that contains documents that you want to use for training a machine learning model and select Analyze Collection.
  3. Select a UIMA pipeline configuration file.
  4. Go to the Collection Analysis view and click the Save icon in the Collection Analysis view. Specify the folder where the saved results are to be written and specify the file name.
  5. Open the folder that you specified. The file extension of the saved file is .annotations.
  6. Copy the .annotations file to your local file system and rename the file extension from .annotations to .zip.
  7. Extract all files from the .zip file. The extracted contents include XMI files (*.xmi), the UIMA TypeSystem descriptor file (TypeSystem.xml), and other files.

What to do next

You must define a mapping between the UIMA types and Knowledge Studio entity types. You must also create a .zip file that contains all of the files that are required to upload the analyzed data into a Knowledge Studio workspace.

Mapping UIMA types to entity types

Before you upload XMI files into a Knowledge Studio workspace, you must define mappings between the UIMA types and Knowledge Studio entity types.

Before you begin

The type system in your Knowledge Studio workspace must include the entity types that you want to map the UIMA types to.

Procedure

To map UIMA types to Knowledge Studio entity types, follow these steps:

  1. Create a file named cas2di.tsv in the folder that contains the UIMA TypeSystem descriptor file, such as exported_typesystem.xml or TypeSystem.xml.

  2. Open the cas2di.tsv file with a text editor. Each line in the file specifies a single mapping. The format of the mapping depends on which annotator's annotations you want to map:

    • You can create mappings by using the basic format:

      UIMA_Type_Name[TAB]WKS_Entity_Type

      The following example defines mappings between UIMA types produced by the Named Entity Recognition annotator in Watson Explorer Content Analytics and entity types defined in a Knowledge Studio type system:

      com.ibm.langware.Organization  ORGANIZATION
      com.ibm.langware.Person  PERSON
      com.ibm.langware.Location  LOCATION
      

      Another example defines a mapping between UIMA types produced by custom annotator that was created in Watson Explorer Content Analytics Studio and Knowledge Studio entity types:

      com.ibm.Person  PERSON
      com.ibm.Date  DATE
      
    • You can create mappings based on facets that are used in the Pattern Matcher annotator or Dictionary Lookup annotator in Watson Explorer Content Analytics. In text analysis rule files (*.pat), the facet is represented as the category attribute. To define a mapping, use the following syntax:

      com.ibm.takmi.nlp.annotation_type.ContiguousContext:category={FACET_PATH}[TAB]{WKS_ENTITY_TYPE}
      

      The following example, which applies to the Pattern Matcher and Dictionary Lookup annotators, defines a mapping between the category $.mykeyword.product and the Knowledge Studio entity type PRODUCT:

      com.ibm.takmi.nlp.annotation_type.ContiguousContext:category=$.mykeyword.product PRODUCT
      

What to do next

You must create a .zip file that contains all of the files that are required to upload the analyzed data into a Knowledge Studio workspace.

Related information:

Pattern Matcher annotator

Dictionary Lookup annotator

Named Entity Recognition annotator

Uploading UIMA CAS XMI files into a workspace

To use the pre-annotated documents that you downloaded to train a model, you must create a .zip file that contains all the files required to upload the XMI files, and then upload the .zip file into a Knowledge Studio workspace.

Before you begin

Before you upload the .zip file, ensure that the type system in your Knowledge Studio workspace includes the entity types that you mapped the UIMA types to.

UIMA analysis engines allow annotations to span sentences. In Knowledge Studio, annotations must exist within the boundaries of a single sentence. If the XMI files that you upload include annotations that span sentences, those annotations do not appear in the ground truth editor.

Procedure

To upload pre-annotated documents into a Knowledge Studio workspace, follow these steps:

  1. Create a .zip file that contains all of the files that are required by Knowledge Studio.

    1. Select the folder that contains the XMI files, UIMA type system descriptor file, and cas2di.tsv file, or select all of the files in the folder.

    2. Create a .zip file that includes all files. Make sure the cas2di.tsv and UIMA type system descriptor files are stored in the root directory of the .zip file. These files cannot be stored in a subfolder within the .zip file or Knowledge Studio will not be able to read them, and nothing will be imported.

      In Windows, you can right-click and select Send to > Compressed (zipped) folder.

  2. Upload the .zip file into a Knowledge Studio workspace.

    1. Log in as a Knowledge Studio administrator or project manager, open the workspace that you want to add the documents to, and open the Assets> Documents page.
    2. Click Upload Document Sets.
    3. Drag the .zip file that you created or click to locate and select the file.
    4. Select the check box to indicate that the .zip file contains UIMA CAS XMI files.
    5. Click Upload.