Speech transcript enrichment
Use the Speech Transcript Enrichment feature to improve the readability and usability of raw Automatic Speech Recognition (ASR) transcripts. This post-processing service automatically adds punctuation and applies intelligent capitalization to enhance the structure and clarity of spoken content.
The service inserts punctuation marks such as periods, commas, question marks, and exclamation points. It also capitalizes sentence beginnings, proper nouns, acronyms, and brand names based on context. Confidence scoring help ensure accurate and reliable enrichment.
By using Speech Transcript Enrichment, you can produce clean, professional transcripts that are ready for review, publication, or integration into downstream applications.
Integrate Speech transcript enrichment through API
To enable enrichment in API calls:
- Add
enrichments=punctuation
parameter to the recognition request. For more details, see Update the recognition request. - Process the enhanced response from
enriched_results
object. For more details, see Process enriched response. - Access the confidence metrics for quality assessment.
Update the recognition request
To enable enrichment, add the enrichments=punctuation
parameter to your recognition request.
Example request:
curl -X POST -u "apikey:<API-KEY>" \
-H "Content-Type: audio/wav" \
--data-binary @data/<your_wav_audio_file>.wav \
"https://api.jp-tok.speech-to-text.watson.cloud.ibm.com/instances/<INSTANCE-ID>/v1/recognize?model=fr-FR_Telephony&enrichments=punctuation"
Process the enriched response
When enrichment is enabled, the response includes an enriched_results
object. This object contains the post-processed transcript with punctuation and capitalization that is applied, along with metadata for quality and traceability.
Example response:
{
"speaker_labels": [
{
"from": 52.52,
"to": 52.7,
"speaker": 0,
"confidence": 1.0,
"final": false
}
],
"enriched_results": {
"transcript": {
"text": "Oh là, et bon, il faudrait me payer ma complation, là. Et vous avez ajouté à le fournir, me payer ma complatigne.",
"timestamp": {
"from": 0.22,
"to": 59.68
}
},
"status": "success"
}
}
For more information about speaker lables and timestamps, see Speaker Labels and Word timestamps.
Limitations
Speech transcript enrichment has the following limitations:
-
Currently, it supports only the following languages:
- English (US, UK, Australia, India)
- French (France, Canada)
- German
- Italian
- Portuguese (Brazil, Portugal)
- Spanish (Spain, Latin America, Argentina, Chile, Colombia, Mexico, Peru)
- Japanese
-
Performs best with conversational speech.
-
Requires clear audio for optimal results.
-
Processing adds minimal latency (typically <1 second).