IBM Cloud Docs
Translating documents

Translating documents

IBM is announcing the deprecation of the IBM Watson® Language Translator service for IBM Cloud® in all regions. As of 10 June 2023, the Language Translator tile will be removed from the IBM Cloud Platform for new customers; only existing customers will be able to access the product. As of 10 June 2024, the service will reach its End of Support date. As of 10 December 2024, the service will be withdrawn entirely and will no longer be available to any customers.

You can use the IBM Watson® Language Translator service to translate files from one language to another while preserving the original document format. The service supports translation of many file formats, including Microsoft® Office®, Open Office, subtitles, and many other common formats such as HTML, JSON, XML, and Adobe® PDF.

Before you begin

Make sure that you have the following information and meet the following requirements:

  • You need your Language Translator service credentials (apikey and url).
  • The document that you want to translate must not exceed the following size limits:
    • 2 MB for service instances on the Lite plan
    • 20 MB for service instances on the Standard plan
    • 50 MB for service instances on the Advanced plan
    • 150 MB for service instances on the Premium plan
  • The document must be in one of the supported file formats. Use the correct file extension for the format of the document or specify the content type (MIME type) of the format with the request. For more information, see Supported file formats.
  • The source and target languages must be among the List of supported languages.
  • The service correctly translates from and to bidirectional languages that are written left-to-right and right-to-left (for example, Arabic, Hebrew, and Urdu).

This tutorial walks you through translating documents from the command line with curl. You can also use the Watson SDKs to translate documents with a number of programming languages. For more information, see the methods in the API & SDK reference.

Step 1: Submit a document to translate

The following example request submits the file curriculum.html to the service and translates it from English to French. Replace {apikey} and {url} with your service credentials, and replace curriculum.html with a relative path to your file. The source and target parameters specify the languages for the translation.

curl -X POST --user "apikey:{apikey}" \
--form "file=@curriculum.html" \
--form "source=en" \
--form "target=fr" \
"{url}/v3/documents?version=2018-05-01"

To translate a document with a custom model, use the model_id parameter. The following request translates the document with the custom model identified by the model ID 96221b69-8e46-42e4-a3c1-808e17c787ca. The custom model is defined for en-fr translation, so the source and target parameters are not needed.

curl -X POST --user "apikey:{apikey}" \
--form "file=@curriculum.html" \
--form "model_id=96221b69-8e46-42e4-a3c1-808e17c787ca" \
"{url}/v3/documents?version=2018-05-01"

A successful translation request returns a document ID in the response. In the following example, the ID is bae02796-0d28-435c-9115-888359fdde62. The status of processing indicates that the service is translating the document.

{
  "document_id": "bae02796-0d28-435c-9115-888359fdde62",
  "filename": "curriculum.html",
  "model_id": "en-fr",
  "source": "en",
  "target": "fr",
  "status": "processing",
  "created": "2018-10-11T03:31:25"
}

Step 2: Check the translation status

After you have submitted a document for translation, you can check the translation status to find out when the translated document is available to download. The following example request checks the translation status of the document with ID bae02796-0d28-435c-9115-888359fdde62. When the status in the response is available, the translated document is ready to download.

curl -X GET --user "apikey:{apikey}" \
"{url}/v3/documents/bae02796-0d28-435c-9115-888359fdde62?version=2018-05-01"
{
  "document_id": "bae02796-0d28-435c-9115-888359fdde62",
  "filename": "curriculum.html",
  "model_id": "en-fr",
  "source": "en",
  "target": "fr",
  "status": "available",
  "created": "2018-10-11T03:31:25",
  "completed": "2018-10-11T03:31:38"
}

Step 3: Download the translated document

The following example request saves the translated document with ID bae02796-0d28-435c-9115-888359fdde62 to a file named curriculum-fr.html.

curl -X GET --user "apikey:{apikey}" \
--output "curriculum-fr.html" \
"{url}/v3/documents/bae02796-0d28-435c-9115-888359fdde62/translated_document?version=2018-05-01"

Step 4: Translate a previously submitted document

The following example request translates the original English file curriculum.html, which has document ID bae02796-0d28-435c-9115-888359fdde62, to Portuguese.

curl -X POST --user "apikey:{apikey}" \
--form "document_id=bae02796-0d28-435c-9115-888359fdde62" \
--form "source=en" \
--form "target=pt" \
"{url}/v3/documents?version=2018-05-01"
{
  "document_id": "a0ge2746-ad38-7d5c-7025-4cd3g9f451ab"
}

The response contains a new document ID. Repeat step two with the new document ID to check the status of the translation. When the status becomes available, use the new document ID to download the translated file as shown in step three.

When translating a previously submitted document, the target language must be different from the target language of the original request when the document was initially submitted.

Step 5: Delete documents

The service automatically deletes original documents and any associated translated documents after they have not been used for a certain period of time. For more information, see Information security.

To delete documents manually, use the Delete document method. In this tutorial, the English file curriculum.html was involved with two translations, so two requests are required to delete all copies of the original document.

Delete the original submission of curriculum.html and the French translation by using the document ID for that translation, bae02796-0d28-435c-9115-888359fdde621:

curl -X DELETE --user "apikey:{apikey}" \
"{url}/v3/documents/bae02796-0d28-435c-9115-888359fdde62?version=2018-05-01"

Delete the duplicate of the original curriculum.html file and the Portuguese translation by the using the document ID for that translation, a0ge2746-ad38-7d5c-7025-4cd3g9f451ab:

curl -X DELETE --user "apikey:{apikey}" \
"{url}/v3/documents/a0ge2746-ad38-7d5c-7025-4cd3g9f451ab?version=2018-05-01"

Supported file formats

The service supports translation of Microsoft Office, Open Office, subtitle, and many other common formats. To translate a file, you must identify the format of the file in one of the following ways:

  • By specifying the appropriate file extension for the format. For example, to translate an HTML file named example.html from English to French, you would use the following command:

    curl -X POST --user "apikey:{apikey}" \
    --form "file=@example.html" \
    --form "source=en" \
    --form "target=fr" \
    "{url}/v3/documents?version=2018-05-01"
    
  • By specifying the content type (MIME type) of the format as the type of the file parameter. For example, to translate an HTML file named just example from English to French, you would use the following request:

    curl -X POST --user "apikey:{apikey}" \
    --form "file=@example;type=text/html" \
    --form "source=en" \
    --form "target=fr" \
    "{url}/v3/documents?version=2018-05-01"
    

The tables in the following sections list the valid file extensions and content types for each supported format. In most cases, specifying the correct file extension is preferred because it can eliminate ambiguity and is simpler. For subtitles, the table makes clear where the file extension or the content type is needed.

All file formats other than Adobe® PDF (Portable Document Format) are generally available. PDF is currently experimental functionality.

Microsoft Office file formats

Table 1 lists the Microsoft Office file formats that the service supports for translation.

Table 1. Microsoft Office file formats
File format File extension Content type
Microsoft Excel .xls application/vnd.ms-excel
.xlsx application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Microsoft PowerPoint .ppt application/vnd.ms-powerpoint
.pptx application/vnd.openxmlformats-officedocument.presentationml.presentation
Microsoft Word .doc application/msword
.docx application/vnd.openxmlformats-officedocument.wordprocessingml.document

Microsoft, Microsoft Excel, Microsoft Office, Microsoft PowerPoint, and Microsoft Word are trademarks of the Microsoft group of companies.

Open Office file formats

Table 2 lists the Open Office file formats that the service supports for translation.

Table 2. Open Office file formats
File format File extension Content type
Open Office Calc .ods application/vnd.oasis.opendocument.spreadsheet
Open Office Impress .odp application/vnd.oasis.opendocument.presentation
Open Office Writer .odt application/vnd.oasis.opendocument.text

Subtitle file formats

Table 3 lists the subtitle (or caption) formats that service supports for translation. These textual formats contain the transcript of a sound track or video source. The formats provide plain text that is intuitively comprehensible with minimal syntax. They include a list of cues that contain synchronization information for the media source. They can also include metadata that is not intended for display.

The table shows both the file extension and the content type for all subtitle formats. However, in some cases you must specify one or the other. The table makes clear where the file extension or the content type is needed. Also, for the .sub file extension, which is used for multiple formats, the service parses the file to determine the exact subtitle format of the file.

Table 3. Subtitle file formats
File format File extension Content type
Apple® iTunes® Timed Text .itt
The file extension is required
application/xml
DirectVobSub .sub
The file extension is required
text/plain
Distribution Format Exchange Profile .dxfp
The file extension is sufficient
application/ttaf+xml
.xml application/ttaf+xml
The content type is required
MicroDVD .sub
The file extension is required
text/plain
Source Code Control .scc
The file extension is required
application/octet-stream
SubRip .srt text/srt
SubStation Alpha .ssa
The file extension is required
text/plain
SubViewer .sbv text/sbv
Synchronized Accessible Media Interchange .sami
The file extension is required
application/xml
.smi
The file extension is required
application/xml
Time Text Markup Language .ttml application/ttml+xml
VSFilter .sub
The file extension is required
text/plain
WebVTT .vtt text/vtt

The following information qualifies some of the nuances of subtitle translation:

  • Character encoding - Subtitles can be presented in different character sets. The service supports only UTF-8 input and produces only UTF-8 output. The results maintain the line separation and optional Byte Order Mark (BOM) from the original source.

  • Markup - The service attempts to preserve markup in the translated document, but preservation is not guaranteed. The service can silently remove markup from a particular cue.

  • Names - By convention, speaker names can be marked with -..:, parentheses, or both. The service extracts and translates speaker names separately to guarantee consistent translation.

  • Paragraphs - The text to be translated is grouped into paragraphs. Each paragraph can span multiple cues, but it always consists of a full set of cue lines. In other words, each paragraph consists of one or more cues in their entirety, with each cue contained fully in a single paragraph.

    A single cue can contain one or more lines of text (for example, two short sentences). The service creates paragraph breaks only at cue line boundaries to preserve the count of lines in the cue. For languages with punctuation, a paragraph generally maps to a complete sentence. For languages without punctuation, a paragraph can contain multiple sentences, which can adversely affect the distribution of lines into cues in the translated document.

  • Comments, notes, and titles - For formats that permit these elements, the service preserves the original text and adds translation that is prefixed by language code. Because this information is intended for use by the author, the service maintains the text in both its original and translated forms.

Other file formats

Table 4 lists all other file formats that the service supports for translation.

Table 4. Other file formats
File format File extension Content type
Adobe® Portable Document Format [1] pdf application/pdf
Extensible Markup Language .xml text/xml
HyperText Markup Language .htm text/html
.html text/html
.xhtml text/html
JavaScript Object Notation [2] .json text/json
Plain text .txt text/plain
Rich Text Format .rtf application/rtf

Notes:

  1. For PDF files, translation is experimental functionality. The quality of PDF translation is still largely an alpha release. The translation works best for single-column PDF documents that do not include many tables or images.
  2. For JSON files, values with type string or string array are translated.