IBM Cloud Docs
Creating a machine learning model

Creating a machine learning model

This tutorial helps you understand the process for building a machine learning model that you can deploy and use with other Watson services.

Learning objectives

After you complete the lessons in this tutorial, you will know how to perform the following tasks:

  • Create document sets
  • Pre-annotate documents
  • Create tasks for human annotators
  • Analyze inter-annotator agreement and adjudicate conflicts in annotated documents
  • Create machine learning models

This tutorial takes approximately 60 minutes to finish. If you explore other concepts that are related to this tutorial, it might take longer to complete.

Before you begin

  • You're using a supported browser. See Browser requirements.

  • You successfully completed Getting started with Knowledge Studio, which covers creating a workspace, creating a type system, and adding a dictionary.

  • You must have at least one user ID in either the Admin or Project Manager role.

    If possible, use multiple user IDs for the machine learning model tasks in this tutorial (one Admin or Project Manager user ID, and at least two Human Annotator user IDs). Using multiple user IDs provides the most realistic simulation of an actual IBM Watson® Knowledge Studio workspace, where a project manager must coordinate and adjudicate annotation that is performed by multiple human annotators. However, if you have access to only a single user ID, you can still simulate most parts of the process.

    For more information about user roles, see User roles in Knowledge Studio.

Results

After completing this tutorial, you will have a custom machine learning model that you can use with other Watson services.

Lesson 1: Adding documents for annotation

In this lesson, you will learn how to add documents to a workspace in Knowledge Studio that can be annotated by human annotators.

About this task

For more information about adding documents, see Adding documents to a workspace.

Procedure

  1. Download the documents-new.csv file to your computer. This file contains example documents suitable for uploading.
  2. Within your workspace, click Assets > Documents.
  3. On the Documents page, click Upload Document Sets.
  4. Upload the documents-new.csv file from your computer. The uploaded file is displayed in the table.

Lesson 2: Pre-annotating with a dictionary-based annotator

In this lesson, you will learn how to use a dictionary-based annotator to pre-annotate documents in Knowledge Studio.

About this task

Pre-annotating documents is an optional step. However, it is a worthwhile step because it makes the job of human annotators easier later.

For more information about pre-annotation with dictionaries, see Pre-annotating documents with a dictionary.

Procedure

  1. Within your workspace, click Assets > Dictionaries.

    The Test dictionary dictionary opens. The Adding a dictionary lesson of the Getting started with Knowledge Studio tutorial shows you how to create this dictionary.

  2. From the Entity type list, select the ORGANIZATION entity type to map it to the Test dictionary dictionary.

    The Creating a type system lesson of the Getting started with Knowledge Studio tutorial shows how to create the type system that contains the ORGANIZATION entity type.

  3. On the Machine Learning Model > Pre-annotation page, click Run Pre-annotators.

  4. Select Dictionary, then click Next.

  5. Select the document set that you created in Lesson 1.

  6. Click Run.

Results

The documents in the selected sets are pre-annotated by using the dictionary that you created. If you like, you can use the dictionary to pre-annotate document sets or annotation sets that you add later.

Lesson 3: Creating an annotation task

In this lesson, you will learn how to create annotation sets and use annotation tasks to track the work of human annotators in Knowledge Studio.

About this task

An annotation set is a subset of documents from an uploaded document set that you assign to a human annotator. The human annotator annotates the documents in the annotation set. To later use inter-annotator scores to compare the annotations that are added by each human annotator, you must assign at least two human annotators to different annotation sets. You must also specify that some percentage of documents overlap between the sets.

In a realistic scenario, you create as many annotation sets as needed, based on the number of human annotators who are working in the workspace. In this tutorial, you will create two annotation sets. If you do not have access to multiple user IDs, you can assign both annotation sets to the same user.

For more information about annotation sets and annotation tasks, see Creating an annotation task.

Procedure

  1. Within your workspace, click Machine Learning Model > Annotations.

  2. Click the Annotation Tasks tab, then click Add Task.

  3. Specify the details for the task:

    • In the Task name field, enter Test.
    • In the Deadline field, select a date in the future.
  4. Click Create Annotation Sets.

    The Create Annotation Sets window opens. By default, this window shows the base set, which contains all documents, and fields where you can specify the information for a new annotation set.

  5. Click Add another set and human annotator to add fields for an additional annotation set. You can click to add as many annotation sets as you want to create. For this tutorial, you need only two annotation sets.

    A screen capture of the Create Annotation Sets page.

  6. In the Overlap field, specify 100. This value specifies that you want 100 percent of the documents in the base set to be included in all the new annotation sets so they can be annotated by all human annotators.

  7. For each new annotation set, specify the required information.

    • In the Annotator field, select a human annotator user ID to assign to the new annotation set. In a realistic scenario, each annotation set is assigned to a different human annotator.

      If you have only a single administrator ID to use for the tutorial, assign that user to all annotation sets. In a realistic scenario, you would have multiple human annotators, but for the tutorial, the administrator can act as human annotator.

    • In the Set name field, specify a descriptive name for the annotation set. For this tutorial, you can use the names, Set 1 and Set 2.

  8. Click Generate.

  9. Click Save.

  10. As human annotators begin annotating documents, you can open tasks to see their progress.

Lesson 4: Annotating documents

In this lesson, you will learn how to use the ground truth editor to annotate documents in Knowledge Studio.

About this task

For more information about human annotation, see Annotation with the ground truth editor.

Procedure

  1. Log in to Knowledge Studio as a user who is assigned to the annotation task that you created in Lesson 3.

    If you have access only to a single administrator ID for this tutorial, you can use that ID to perform human annotation. However, remember that in a realistic scenario, human annotation is performed by different users with the Human Annotator role.

  2. Open the My workspace workspace and click Machine Learning Model > Annotations.

  3. Click the Annotation Tasks tab, then open the Test annotation task you created in Lesson 3.

  4. Click Annotate for one of the assigned annotation sets.

    Depending on how you set up the annotation tasks, you could have one or more annotation tasks assigned to the user ID you logged in with.

  5. From the list of documents, find the Technology - gmanews.tv document and open it.

    Notice that the term IBM was already annotated with the ORGANIZATION entity type. This annotation was added by the dictionary pre-annotator that was applied in Lesson 2. This pre-annotation is correct, so it does not need to be modified.

    ![This screen capture shows an open document with an existing pre-annotation for "IBM".](images/wks_tut_preannotation.png "This screen capture shows an open document with an existing pre-annotation for "IBM".")

  6. Annotate a mention:

    1. Click the Entity tab.

    2. In the document body, select the text Thomas Watson.

    3. In the list of entity types, click PERSON. The entity type PERSON is applied to the selected mention.

      ![This screen capture shows the "PERSON" entity type applied to the mention, "Thomas Watson".](images/wks_tut_annotatemention3.png "This screen capture shows the "PERSON" entity type applied to the mention, "Thomas Watson".")

  7. Annotate a relation:

    1. Click the Relation tab.

    2. Select the Thomas Watson and IBM mentions (in that order). To select a mention, click the entity type label above the text.

    3. In the list of relation types, click founderOf. The two mentions are connected with a founderOf relationship.

      ![This screen capture shows two mentions connected by the relation type, "founderOf".](images/wks_tut_annotaterelation.png "This screen capture shows two mentions connected by the relation type, "founderOf".")

  8. From the status menu, select Completed, and then click the Save button.

  9. Click Open document list to return to the list of documents for this task and click Submit All Documents to submit the documents for approval.

    In a realistic situation, you would create many more annotations and complete all the documents in the set before submitting.

  10. Close this annotation set, and then open the other annotation set in the Test task.

    Depending on how you set up the annotation tasks and which users you assigned them to, you might need to log in to Knowledge Studio as the user who is assigned to the other annotation set in the annotation task.

  11. Repeat the same annotations in the Technology - gmanews.tv document, except this time, use the employedBy relation instead of the founderOf relation.

    Logging in as another user will help illustrate inter-annotator agreement in the next lesson. But if you have only one user, you can still complete the tutorial to get an understanding of how inter-annotator agreement works.

  12. After you complete the annotations for the second annotation set, click Submit All Documents.

Lesson 5: Analyzing inter-annotator agreement

In this lesson, you will learn how to compare the work of multiple human annotators in Knowledge Studio.

About this task

To determine whether different human annotators are annotating overlapping documents consistently, review the inter-annotator agreement (IAA) scores.

Knowledge Studio calculates IAA scores by examining all overlapping documents in all document sets in the task, regardless of the status of the document sets. The IAA scores show how different human annotators annotated mentions, relations, and coreference chains. It is a good idea to check IAA scores periodically and verify that human annotators are consistent with each other.

In this tutorial, the human annotators submitted all the document sets for approval. If the inter-annotator agreement scores are acceptable, you can approve the document sets. If you reject a document set, it is returned to the human annotator for improvement.

For more information about inter-annotator agreement, see Building the ground truth.

Procedure

  1. Log in to Knowledge Studio as the administrator and select Machine Learning Model > Annotations. Click the Annotation Tasks tab, then click the Test task.

    In the Status column, you can see that the document sets are submitted.

  2. Click Calculate Inter-Annotator Agreement.

  3. View IAA scores for mention, relations, and coreference chains by clicking the first menu. You can also view agreement by pairs of human annotators. You can also view agreement by specific documents. In general, aim for a score of .8 out of 1, where 1 means perfect agreement. Because you annotated only two entity types in this tutorial, most of the entity type scores are N/A (not applicable), which means no information is available to give a score.

    Figure 1. Reviewing inter-annotator scores with users named dave and phil

    This screen capture shows the inter-annotator scores for a task.

  4. After you review the scores, you can decide whether you want to approve or reject document sets that are in the SUBMITTED status. Take one of these actions:

    • If the scores are acceptable for an annotation set, select the check box and click Accept. Documents that do not overlap with other document sets are promoted to ground truth. Documents that do overlap must first be reviewed through adjudication so that conflicts can be resolved. For this tutorial, accept both document sets.
    • If the scores are not acceptable for an annotation set, select the check box and click Reject. The document set needs to be revisited by the human annotator to improve the annotations.

Results

When you evaluated the inter-annotator agreement scores, you saw how different pairs of human annotators annotated the same document. If the inter-annotator agreement score was acceptable, you accepted the document set.

Lesson 6: Adjudicating conflicts in annotated documents

In this lesson, you will learn how to adjudicate conflicts in documents that overlap between document sets in Knowledge Studio.

About this task

When you approve a document set, only the documents that do not overlap with other document sets are promoted to ground truth. If a document is part of the overlap between multiple document sets, you must adjudicate any annotation conflicts before the document can be promoted to ground truth.

For more information about adjudication, see Building the ground truth.

Procedure

  1. Log in to Knowledge Studio as the administrator and select Machine Learning Model > Annotations. Click the Annotation Tasks tab, then click the Test task.

  2. Verify that the two document sets are in an approved state.

  3. Click Check Overlapping Documents for Conflicts.

    You can see the overlapping documents that were annotated by more than one human annotator.

  4. Because the tutorial instructed you to create a conflicting relation for the Technology - gmanews.tv document, find that document in the list and click Check for Conflicts.

  5. Select the two conflicting annotation sets and click Check for Conflicts.

    Adjudication mode opens. In adjudication mode, you can view overlapping documents, check for conflicts, and remove or replace annotations before you promote the documents to ground truth.

  6. Select Relation conflicts, accept the founderOf relation, and reject the employedBy relation.

  7. Click Promote to Ground Truth.

    Alternatively, you can promote a document to ground truth by clicking Accept on the Documents page.

Results

After you resolve the annotation conflicts and promote the documents to ground truth, you can use them to train the machine learning model.

Lesson 7: Creating a machine learning model

In this lesson, you will learn how to create a machine learning model in Knowledge Studio.

About this task

When you create a machine learning model, you select the document sets that you want to use to train it. You also specify the percentage of documents that are to be used as training data, test data, and blind data. Only documents that became ground truth through approval or adjudication can be used to train the machine learning model.

For more information about the machine learning model, see Training the machine learning model and Analyzing machine learning model performance.

Procedure

  1. Log in to Knowledge Studio as the administrator.

  2. Click Machine Learning Model > Performance > Train and evaluate.

  3. Select All, and then click Train & Evaluate.

    Training might take more than ten minutes, or even hours, depending on the number of human annotations and the number of words in all the documents.

  4. After the machine learning model is trained, you can export it from the Version page, or you can view detailed information about its performance by clicking the Detailed Statistics links that are located above each of the graphs on the Performance page.

  5. To view the Training / Test / Blind Sets page, click the Train and evaluate button.

  6. To see the documents that human annotators worked on, click View Ground Truth.

  7. To see the annotations that the trained machine learning model created on that same set of documents, click View Decoding Results.

  8. To view details about the precision, recall, and F1 scores for the machine learning model, click the Performance page.

  9. Click the Detailed Statistics links above each of the graphs. On these Statistics pages, you can view the scores for mentions, relations, and coreference chains by using the radio buttons.

    You can analyze performance by viewing a summary of statistics for entity types, relation types, and coreference chains. You can also analyze statistics that are presented in a confusion matrix. To see the matrix, change Summary to Confusion Matrix. The confusion matrix helps you compare the annotations that were added by the machine learning model to the annotations in the ground truth.

    In this tutorial, you annotated documents with only a single dictionary for organizations. Therefore, the scores you see are 0 or N/A for most entity types except ORGANIZATION. The numbers are low, but that is expected because you did not do any human annotation or correction.

    Figure 2. Options on the Statistics page for a machine learning model

    This screen capture shows the Statistics page.

  10. Click Versions. On the Versions page, you can take a snapshot of the model and the resources that were used to create it (except for dictionaries and annotation tasks). For example, you might want to take a snapshot before you retrain the model. If the statistics are poorer the next time you train it, you can promote the older version and delete the version that returned poorer results.

Results

You created a machine learning model, trained it, and evaluated how well it performed when annotating test data and blind data. By exploring the performance metrics, you can identify ways to improve the accuracy of the machine learning model.

Tutorial summary

You created a machine learning model.

Lessons learned

By completing this tutorial, you learned about the following concepts:

  • Document sets
  • Machine learning models
  • Human annotation tasks
  • Inter-annotator agreement and adjudication