Managing corpora
The customization interface includes the POST /v1/customizations/{customization_id}/corpora/{corpus_name}
method for adding a corpus to a custom language model. For more information, see Add a corpus to the custom language model.
The interface also includes the following methods for listing and deleting corpora for a custom language model.
Listing corpora for a custom language model
The customization interface provides two methods for listing information about the corpora for a custom language model:
- The
GET /v1/customizations/{customization_id}/corpora
method lists information about all corpora for a custom model. - The
GET /v1/customizations/{customization_id}/corpora/{corpus_name}
method lists information about a specified corpus for a custom model.
Both methods return the name
of the corpus, the total_words
read from the corpus, and, for custom models based on previous-generation models, the number of out_of_vocabulary_words
extracted from
the corpus. The methods also list the status
of the corpus. The status is important for checking the service's analysis of a corpus in response to a request to add it to a custom model.
-
analyzed
indicates that the service successfully analyzed the corpus. You can train the custom model with data from the corpus, or you can add additional corpora or words to the model. -
being_processed
indicates that the service is still analyzing the corpus. The service cannot accept requests to add new corpora or words, or to train the custom model, until its analysis is complete. -
undetermined
indicates that the service encountered an error while processing the corpus. The information that is returned for the corpus includes an error message that offers guidance for correcting the error.For example, the corpus might be invalid, or you might have tried to add a corpus with the same name as an existing corpus. You can try to add the corpus again and include the
allow_overwrite
parameter with the request. You can also delete the corpus and then try adding it again.
List all corpora example
The following example lists all corpora for the custom model with the specified customization ID:
IBM Cloud
curl -X GET -u "apikey:{apikey}" \
"{url}/v1/customizations/{customization_id}/corpora"
IBM Cloud Pak for Data
curl -X GET \
--header "Authorization: Bearer {token}" \
"{url}/v1/customizations/{customization_id}/corpora"
Three corpora were added to the custom model. The service successfully analyzed corpus1
. It is still analyzing corpus2
, and its analysis of corpus3
failed. Because the corpora were added to a custom model
that is based on a previous-generation model, the out_of_vocabulary_words
field shows the number of OOV words that the service extracted from the first corpus. The field will show the number of OOV words that are extracted from
corpus2
when it is successfully analyzed. Analysis of corpus3
failed, so no OOV words were extracted.
{
"corpora": [
{
"name": "corpus1",
"total_words": 5037,
"out_of_vocabulary_words": 401,
"status": "analyzed"
},
{
"name": "corpus2",
"total_words": 0,
"out_of_vocabulary_words": 0,
"status": "being_processed"
},
{
"name": "corpus3",
"total_words": 0,
"out_of_vocabulary_words": 0,
"status": "undetermined",
"error": "Analysis of corpus 'corpus3.txt' failed. Please try adding the corpus again by setting the 'allow_overwrite' flag to 'true'."
}
]
}
List a specific corpus example
The following example returns information about the corpus that is named corpus1
for the custom model with the specified customization ID:
IBM Cloud
curl -X GET -u "apikey:{apikey}" \
"{url}/v1/customizations/{customization_id}/corpora/corpus1"
IBM Cloud Pak for Data
curl -X GET \
--header "Authorization: Bearer {token}" \
"{url}/v1/customizations/{customization_id}/corpora/corpus1"
The corpus, which is based on a previous-generation model, is fully analyzed and contains more than 400 OOV words:
{
"name": "corpus1",
"total_words": 5037,
"out_of_vocabulary_words": 401,
"status": "analyzed"
}
Deleting a corpus from a custom language model
Use the DELETE /v1/customizations/{customization_id}/corpora/{corpus_name}
method to remove an existing corpus from a custom language model.
- If the custom model is based on a large speech model or next-generation model, the service deletes the corpus from the model.
- If the custom model is based on a previous-generation model, the service deletes the corpus from the model and removes OOV words that are associated with the corpus from the custom model's words resource. The service removes
an OOV word from a custom model unless
- The word was also added by another corpus or by a grammar.
- The word was modified in some way with the
POST /v1/customizations/{customization_id}/words
orPUT /v1/customizations/{customization_id}/words/{word_name}
method.
Removing a corpus does not affect the custom model until you train the model on its updated data by using the POST /v1/customizations/{customization_id}/train
method. If you previously trained the model on the corpus successfully,
information extracted from the corpus remains in the model and applies to speech recognition until you retrain the model.
Delete a corpus example
The following example deletes the corpus that is named corpus3
from the custom model with the specified customization ID:
IBM Cloud
curl -X DELETE -u "apikey:{apikey}" \
"{url}/v1/customizations/{customization_id}/corpora/corpus3"
IBM Cloud Pak for Data
curl -X DELETE \
--header "Authorization: Bearer {token}" \
"{url}/v1/customizations/{customization_id}/corpora/corpus3"