管理语料库

定制接口包含 POST /v1/customizations/{customization_id}/corpora/{corpus_name} 方法，用于向定制语言模型添加语料库。有关更多信息，请参阅向定制语言模型添加语料库。该接口还包含以下方法，用于列出和删除定制语言模型的语料库。

列出定制语言模型的语料库

定制接口提供了两种方法，用于列出有关定制语言模型的语料库的信息：

GET /v1/customizations/{customization_id}/corpora 方法用于列出有关定制模型的所有语料库的信息。
GET /v1/customizations/{customization_id}/corpora/{corpus_name} 方法用于列出有关定制模型的指定语料库的信息。

这两种方法都返回语料库的 name，从语料库中读取的 total_words，以及 针对基于先前生成模型的定制模型，从语料库中抽取的 out_of_vocabulary_words 数。这两个方法还会列出语料库的状态 (status)。在响应将文本添加到自定义模型的请求时，状态对于检查服务对文本的分析非常重要。

analyzed 指示服务已成功分析语料库。您可以使用语料库中的数据来训练定制模型，也可以向模型添加其他语料库或词。
being_processed 指示服务仍在分析语料库。在分析完成之前，服务无法接受添加新语料库或词或者训练定制模型的请求。
undetermined 指示服务在处理语料库时遇到错误。为语料库返回的信息包含错误消息，用于提供更正错误的指南。

例如，语料库可能无效，或者您可能尝试添加的语料库与现有语料库同名。您可以重试添加该语料库，并在请求中包含 allow_overwrite 参数。您还可以删除该语料库，然后重试添加。

列出所有语料库示例

以下示例列出了具有指定定制标识的定制模型的所有语料库：

IBM Cloud

curl -X GET -u "apikey:{apikey}" \
"{url}/v1/customizations/{customization_id}/corpora"

IBM Cloud Pak for Data IBM Software Hub

curl -X GET \
--header "Authorization: Bearer {token}" \
"{url}/v1/customizations/{customization_id}/corpora"

已将三个语料库添加到定制模型。服务已成功分析 corpus1。服务仍在分析 corpus2，而对 corpus3 的分析失败。由于已将语料库添加到基于前代模型的定制模型，因此 out_of_vocabulary_words 字段显示服务从第一个语料库中抽取的 OOV 字数。该字段将显示成功分析时从 corpus2 中抽取的 OOV 字数。分析 corpus3 失败，因此未抽取任何 OOV 词。

{
  "corpora": [
    {
      "name": "corpus1",
      "total_words": 5037,
      "out_of_vocabulary_words": 401,
      "status": "analyzed"
    },
    {
      "name": "corpus2",
      "total_words": 0,
      "out_of_vocabulary_words": 0,
      "status": "being_processed"
    },
    {
      "name": "corpus3",
      "total_words": 0,
      "out_of_vocabulary_words": 0,
      "status": "undetermined",
      "error": "Analysis of corpus 'corpus3.txt' failed. Please try adding the corpus again by setting the 'allow_overwrite' flag to 'true'."
    }
  ]
}

列出特定语料库示例

以下示例返回有关具有指定定制标识的定制模型的语料库 corpus1 的信息：

IBM Cloud

curl -X GET -u "apikey:{apikey}" \
"{url}/v1/customizations/{customization_id}/corpora/corpus1"

IBM Cloud Pak for Data IBM Software Hub

curl -X GET \
--header "Authorization: Bearer {token}" \
"{url}/v1/customizations/{customization_id}/corpora/corpus1"

基于前代模型的语料库经过充分分析，包含 400 多个 OOV 词:

{
  "name": "corpus1",
  "total_words": 5037,
  "out_of_vocabulary_words": 401,
  "status": "analyzed"
}

从定制语言模型中删除语料库

使用 DELETE /v1/customizations/{customization_id}/corpora/{corpus_name} 方法可从定制语言模型中除去现有语料库。

如果定制模型基于大型语音模型或下一代模型， 那么服务会从模型中删除语料库。
如果定制模型基于前代模型，那么 服务会从模型中删除语料库，会从定制模型的词资源中除去与语料库关联的 OOV 词。服务会从定制模型中除去 OOV 字，除非
- 其他语料库或某个语法也添加了该词。
- 已使用 POST /v1/customizations/{customization_id}/words 或 PUT /v1/customizations/{customization_id}/words/{word_name} 方法对该词进行了某种修改。

除去语料库并不会影响定制模型，直到使用 POST /v1/customizations/{customization_id}/train 方法，基于更新的数据对定制模型进行了训练。如果先前在语料库上成功训练了模型，那么从语料库中抽取的信息将保留在模型中并应用于语音识别，直到重新训练模型为止。

删除语料库示例

以下示例从自定义模型中删除名为 corpus3 的语料库，并指定自定义ID：

IBM Cloud

curl -X DELETE -u "apikey:{apikey}" \
"{url}/v1/customizations/{customization_id}/corpora/corpus3"

IBM Cloud Pak for Data IBM Software Hub

curl -X DELETE \
--header "Authorization: Bearer {token}" \
"{url}/v1/customizations/{customization_id}/corpora/corpus3"