IBM Cloud Docs
Migrating to Discovery v2

Migrating to Discovery v2

A redesign of the product, Discovery v2, was introduced in November 2019. Discovery v2 offers significant advantages over Discovery v1.

Learn about how to migrate a v1 Discovery service instance to Discovery v2, including how to move data and update your applications.

The major structural differences between Discovery v1 and v2 include:

  • There is no concept of an environment in v2. The deployment details such as size and index capacity are managed for you when you choose the appropriate service plan for your needs. For managed deployments, you can choose a Plus, Enterprise, or Premium plan, for example. For installed deployments, the sizing is managed by the deployment type that you specify when you install the service in Cloud Pak for Data.

  • There is no single configuration object in v2. Control of the enrichments that are applied to documents is managed in the collections and project objects in v2. Other v1 configuration capabilities, such as the ability to customize the conversion step of ingestion, are not available in v2.

  • Greater programmatic support is available for custom enrichments in v2. New enrichment API methods are available that you can use to create enrichments. v2 also introduces document classifier API methods that you can use to train document classifier models programmatically. You can apply these custom enrichments to a collection by using the API.

  • The capabilities of a natural language query search are expanded in v2 to enable the return of the top passages per document and of succinct answers from passages. Other advanced search capabilities are introduced, including table retrieval. In v2, the deduplication parameter is not available and the continuous relevancy training and query logging functions are not available.

  • For more information about feature differences, see the feature comparison table.

  • For more information about detailed API differences, see API version comparison.

Discovery v2 is available for all users of Plus or Enterprise plan instances, or Premium plan instances that were created after 15 July 2020. v2 is also available for IBM Watson® Discovery Cartridge for IBM Cloud Pak® for Data users.

Migration overview

Migrating from Discovery v1 to v2 is a multistep process that you can do independently.

The two versions of the Discovery service have many differences, but you can adopt techniques and utilities that were applied to a v1 instance for use with your new v2 instance.

To migrate from v1 to v2, you must complete the following high-level steps:

  1. Plan the migration.
  2. Transfer your documents.
  3. Update your application to use the v2 API.
  4. Regression test and deploy the updated application.
  5. Delete your v1 plan service instance.

Some steps require you to make programmatic changes by using the API and others involve changes that you can make from the product user interface.

Plan the migration

Get familiar with what's new in v2 and learn about how it differs from v1 before you provision a v2 instance. Your first v2 Plus plan trial instance is available at no charge for 30 days. Learn about and plan for the migration before you provision the instance so that you can get the most from your trial.

When you're ready to start the migration, create a migration schedule that you and your team can follow as you complete the process. Be sure to set up the new v2 service instance and get projects and collections re-created in the new service instance before you switch over to using the v2 service and before you delete your v1 instance.

Learn about the Discovery v2 plan options, so you can choose the right plan for your long-term needs. The Plus plan that you use to get started might be sufficient. However, you might choose to use an Enterprise or Premium plan instead. From a Plus plan, you can do an in-place upgrade to an Enterprise plan, but not to a Premium plan.

Plan how to adapt your application

One of the main changes between versions is that Discovery v2 introduces projects. A project consists of one or more collections. The advantage of using projects is that one query can run against many collections at the same time. Each collection can contain documents that you upload or that you crawl from a single data source, such as a website, Microsoft SharePoint, and more.

Things to consider when you adapt your application to use projects:

  • Although the concept of an environment does not exist in v2, data is still organized into collections. In v2, collections are grouped into projects. In most cases, you want to migrate a single v1 collection to a single v2 collection.

    If you want to keep relevancy training information that is applied to a v1 collection, add the collection documents to a single collection in your v2 project.

  • Decide how many collections you want to add to each v2 project. All project types, except Content Mining projects, can contain up to 5 collections. Choose the right type of project for your data.

    To optimize search results, different enrichments and configuration options are applied automatically to collections that are added to different project types. For more information, see the following topics:

  • The Discovery v2 API changed to account for projects and collections, among other enhancements. Some API calls changed to support actions at the project level instead of the collection level, such as submitting a query and running relevancy training. Many other API methods changed and some are not available in v2. For a detailed comparison of the v1 and v2 API methods, see API version comparison.

Picking a service plan

Choose among the Plus, Enterprise, and Premium managed plans or opt for an on-premises installation by purchasing the Discovery Cartridge for IBM Cloud Pak for Data. Review the benefits and limits of each type of plan before you choose one.

The following table shows plan types for managed deployments that are generally similar between v1 and v2.

Similar plans
Current v1 plan Example v1 data usage Similar v2 plan
Lite Not applicable Plus Trial (no charge for 30 days only)
Advanced (low usage) 10,000 documents, 10,000 queries per month Plus
Advanced (high usage) 100,000 documents, 100,000 queries per month Enterprise
Premium Not applicable Enterprise or Premium

To get information about the current storage, documents, and collections used, click the Environment details icon from the product user interface header.

You cannot do an in-place upgrade from a v1 plan, such as Lite or Advanced, to a v2 plan. You must create a new v2 plan, and then move your data to the new service instance. While you migrate your data from v1 to v2, you will likely have both a v1 and v2 instance deployed at the same time. Consider using the 30-day no charge trial that is available with your first Plus plan instance during this time.

Collecting metrics

Make a note of the following information so you can compare it to your service instance data after the migration:

  • Number of collections

    To get the number of collections in an instance in v1, use the List collections API.

  • Number of documents per collection

    To get the number of documents in a collection in v1, use the Get collection details API.

    GET {url}/v1/environments/{environment_id}/collections/{collection_id}`
    

    The API returns information about the status of the documents in the collection, which includes the total number of available documents.

    "document_counts": {
        "available": 34,
        "{other}":"{values...}"
    }
    

Transferring documents from v1 to v2

How you transfer your documents depends on the technique that was used to ingest the documents in v1.

Re-create one collection at a time. If you start multiple ingestion processes at the same time, you can tax the system resources and increase the overall time that it takes for the processing to be completed. You also want to keep an eye out for any informational messages that are generated by the ingestion process. It is easier to troubleshoot an ingestion issue, for example, when you ingest one collection at a time.

Uploaded data

If you used the API to upload documents into Discovery v1, a similar API is available in v2 to upload documents into collections. You must update any workflows that you use to automate the process to account for the new arrangement of projects and collections.

If the original documents that you ingested into Discovery v1 are no longer available, you can use the query API to extract the document text from Discovery v1. You can then add the text to a collection in Discovery v2. For more information, see Recovering documents.

Crawled data

If you crawled data from an external data source in v1, you can continue to crawl data from the same external data source in v2. All of the same data sources are supported.

To use data from an external data source, you must re-create the collections within a v2 project, and configure how the data source is crawled. For more information, see Overview of data sources.

The service needs time and resources to crawl and ingest documents from external data sources. Re-create the connectors one at a time. Factor the time it takes to recrawl the data into your migration plan schedule.

Prebuilt data collections

The following built-in data source collections are not available in v2:

Watson Discovery News
This pre-enriched data source is not offered in v2. For more information about an alternative way to get news data, see Using a news service with v2.
COVID-19 kit
This pre-built collection was designed to help you fuel a dynamic chatbot that is built with IBM® watsonx™ Assistant and Discovery to answer your customers' questions about COVID-19. In v2, you can build a similar solution. Create a Conversational Search project type with collections that crawl trusted websites for answers to COVID-19 questions.

Ingesting data

To ingest v1 data into a Discovery v2 instance, complete the following steps:

  1. Create a v2 service instance.

  2. Create a project.

  3. Add a collection to the project.

    • Uploaded data:

      From the API, you create a collection and add documents to it with two separate methods. Use the Create a collection method to create the collection. Next, add the same source documents that you added to your v1 collection to the v2 collection. Use the Add document or Update document methods. To assign the same v1 document ID to the document as you add it to the v2 collection, append the document ID to the endpoint. For more information, see Retaining document IDs.

      From the v2 product user interface, upload the same source documents that you added to your v1 collection to the v2 collection.

    • Crawled data: You cannot crawl data from an external data source programmatically in v2. From the product user interface, re-create the connection to the external data source, and then crawl the external data source from scratch.

  4. From the product user interface, you can configure the Discovery v2 collection. For example, you can choose whether to enable optical character recognition. For an external data source, you can set the crawl schedule.

  5. Apply enrichments to your data. You can apply pre-built Natural Language Processing enrichments or custom enrichments that you create.

    In v1, enrichments are associated with the configuration that is generated when you create the environment. In v2, enrichments are associated with the collection configuration. Some enrichments are applied to your collection by default, depending on the type of project used. For more information, see Default project settings. In v2, you can configure the collection to use any subset of available enrichments on the fields of your document.

Retaining document IDs

Document IDs are assigned to the documents that you add to a v2 collection when you upload them from the product user interface or add them by using the Add a document API method.

You might want to retain the IDs of your v1 documents in v2 if you are using processes that depend on these unique identifiers. For example, regression testing for the application might verify that specific documents are returned by checking the document IDs. Relevancy training uses the document IDs to track documents between training runs. These processes are easier to adapt if the document IDs are the same between your v1 and v2 instances. Otherwise, the processes that are used with the Discovery v1 instance must be remapped to the IDs that are assigned to the documents after they are added to the Discovery v2 instance.

If you specified your own documents IDs when you added documents to the v1 service instance, you can retain the IDs by using the Update a document method instead of the Add a document method. With the update method, you can assign a document ID to the document as you add it to the v2 collection. For more information, see Update a document.

If your data is stored in a JSON file, an array in the original document generates a document ID with a number appended to it. For example, original_id_n. To retain the original document ID without the number suffix, remove the array in the JSON file. Change [ {"name": "value"} ] to {"name": "value"}, for example.

If your v1 documents have system-generated IDs, you can submit an empty search query to retrieve a list of the documents and their IDs. You can then assign the same ID to each document as you add it to your new collection in v2.

Recovering documents

In some cases, the original documents that were ingested into Discovery V1 are no longer available. You can use the Discovery v1 instance to retrieve information from the document. Discovery creates a text copy of each document that it ingests. The copy is text only, so any documents in HTML, PDF, or other nontext formats are converted to a text-only version.

You can recover only the first 10,000 documents in a collection by using this method. For more information about a way to recover more than 10,000 documents, see Recovering more than 10,000 documents from a collection.

To transfer document information from v1 to v2, complete the following steps:

  1. Extract the documents from v1 by using the API to submit an empty query.

    For example, GET {url}/v1/environments/{environment_id}/collections/{collection_id}/query?q=.

    The API returns the results. The matching_results field specifies the total number of results. The results object returns the matching documents. Each document is returned as a separate JSON object. It returns a maximum of 10 documents by default.

    {
      "matching_results": 34,
      "session_token": "nnn",
      "results": [
        {"{result objects}":"{maximum of 10 by default}"}
      ]
    }
    
  2. You can use the count and offset parameters to page through the query results and save all of the documents.

    For example, to get 100 documents at a time, you can set the count to 100 and offset to 0 and submit the query.

    GET {url}/v1/environments/{environment_id}/collections/{collection_id}/query?q=&count=100&offset=0
    

    Next, you can again set the count to 100, but this time set the offset to 100 to get the next 100 documents.

    GET {url}/v1/environments/{environment_id}/collections/{collection_id}/query?q=&count=100&offset=100`
    

    Repeat this process, incrementing the offset by 100 until you retrieve all of the documents.

  3. Prepare the exported documents to be ingested into v2.

    Each resulting JSON file that you get from Discovery v1 contains data that is extracted from the original document, such as text, html, and other fields. If custom metadata was associated with the document when it was uploaded to v1, it is also present in the JSON file. In addition, the file contains several fields that were generated by the v1 analysis. Retain only a subset of this data as part of the document that you add to Discovery v2.

    The following tips can help you decide which fields to keep:

    • Include the text field or any other field with textual content that you want to be able to enrich or search in Discovery v2.
    • Include any custom metadata that is stored in the document. This metatdata is typically specific to the application that uses Discovery and is used to filter documents in a search. For example, metadata.customer_id.
    • Do not include enrichments from Discovery v1. For example, enriched_text.entities. Discovery v2 generates its own enrichments.
    • Exclude fields that are generated by Discovery unless they are used by your application and contain information that is unique to the v1 version of the document. In that case, rename the field so that it does not get replaced when the document is ingested into Discovery v2. For example, extracted_metadata.publicationdate is a field that is generated by Discovery when a document is ingested. Maybe you want to retain the metadata.parent_document_id information from v1 to understand how subdocuments were originally generated from a single source document.
    • Avoid fields that have reserved field names. For more information, see How fields are handled.
  4. Ingest each edited v1 JSON document into the Discovery v2 instance. The Discovery v1 document ID can be maintained in Discovery v2. For more information about how to retain the document ID, see Retaining document IDs.

Recovering more than 10,000 documents from a collection

A query can only return up to 10,000 documents. However, if you want to recover more than 10,000 documents from your collection, you need a way to separate the documents into non-overlapping subgroups. Each subgroup should contain fewer than 10,000 documents that can be returned by a query. Then, you can paginate through the results to retrieve the documents.

Pagination for results is restricted to the maximum of 10,000 documents that are returned by the query. Specifically, the combined use of the count and offset pagination parameters cannot exceed 10,000 documents.

One way to separate the documents into non-overlapping subgroups is to leverage a field that exists in every document and contains a unique value. For example, the SHA-1 field contains a hash of the original source file and is formatted as a hexadecimal string value. You can use the first character of the field as a way of dividing the collection into subgroups. Because SHA-1 contains a hexadecimal value, the first character can have up of 16 possible values (0-9 or a-f). If you filter by the first_char_of (SHA-1) == 0, it might return approximately 1/16 of the entire collection. You can then loop through each of the possible 16 values to get the rest of the documents. If optimum number of documents are not returned in one of the subgroups, you can use the first 2 characters of the SHA-1 field to divide the collection into 256 subgroups instead.

Transferring relevancy training

Relevancy training that was done in Discovery v1 can be transferred to Discovery v2. Transferring the training works best with a Discovery v2 project that has one collection that contains the same documents from the Discovery v1 collection.

Even if collections were added or documents changed, the relevancy training can be transferred. However, you must update the training to account for the changes.

To transfer relevancy training, complete the following steps:

  1. Load the documents in Discovery v2.

  2. Programmatically download the queries that were used for relevancy training in Discovery v1. For more information, see List training data.

  3. Programmatically re-create the relevancy training data in Discovery v2. Add each training query separately by using the Create a query method. For more information, see Create a training query.

    Be sure to specify the v2 collection ID. You must also specify the document ID also.

    If you did not retain the document IDs between the v1 and v2 collections, then you must find the v2 document ID that corresponds to the v1 document ID that is referenced in the downloaded query example.

Transferring models

You can reuse some of the models that you created in v1 with your v2 project.

Smart Document Understading (SDU) models

You can import an SDU model that was built with Discovery v1 into Discovery v2. However, the performance of the model might differ between versions. Compare the results of the v1 SDU model in v2 to verify that the behavior is the same. You cannot edit the imported v1 SDU model. If the imported model can't recognize document elements that it recognized in v1 and that are important to your use case, you must re-create the SDU model in the Discovery v2 product user interface. For more information, see Exporting SDU models in the v1 documentation and importing the SDU model in the v2 documentation.

Machine learning models

You cannot deploy models directly to Discovery v2 service instances from Knowledge Studio. Instead, you must export the machine learning models from Knowledge Studio, and then import them into Discovery. The model must have been exported from Knowledge Studio after 16 July 2020. If you have a model that was exported before that date, you must reexport the model from Knowledge Studio. Only paid Knowledge Studio plans support exporting models.

For more information, see one of the following topics:

For information about how to import a model to Discovery v2, see Importing Machine Learning models.

Update your application to use the v2 API

The Watson Developer SDKs support both Discovery v1 and v2.

These instructions assume that your application is using the latest version of the v1 API (version 2019-04-30).

When you port an application that currently uses the Discovery v1 API to use v2, you must plan how to address the following high-level differences between the two versions.

In addition to these high-level changes, review the differences at a per-method level to understand what else you might need to change. For more information, see API version comparison.

  • v2 organizes data by project and collections; there is no concept of an environment. For example, compare the following requests to get a collection:

    v1 Get collection

    GET {url}/v1/environments/{environment_id}/collections/{collection_id}
    

    v2 Get collection

    GET {url}/v2/projects/{project_id}/collections/{collection_id}
    
  • In v1, relevancy training runs on a single collection. In v2, relevancy training runs on a project. The project might contain many collections. If so, relevancy training is applied across all of the collections. For information about how to transfer relevancy training, see Transferring relevancy training.

    For example, compare the following requests that return the status of relevancy training:

    v1 Get collection

    GET {url}/v1/environments/{environment_id}/collections/{collection_id}
    

    v2 Get project

    GET {url}/v2/projects/{project_id}
    
  • Submitting a query is similar between the two versions. In v2, you can query all of the collections in a project or you can limit the query to one or more collections by specifying a collection_ids parameter. For example, compare the following requests to query data:

    v1 Query request

    POST {url}/v1/environments/{environment_id}/collections/{collection_id}/query
    

    Data that is submitted with the request:

    {
      "query": "text:IBM"
    }
    

    v2 Query request

    POST {url}/v2/projects/{project_id}/query
    

    Data that is submitted with the request:

    {
      "collection_ids": [
        "{collection_id_1}",
        "{collection_id_2}"
      ],
      "query": "text:IBM"
    }
    

    You can optionally omit the collection_ids parameter to query across all of the collections in the project.

  • The passage parameter for a query has a new per_document option that ranks the documents by document quality, and then returns the highest-ranked passages per document in a document_passages field for each document entry in the results list of the response. If false, ranks the passages from all of the documents by passage quality regardless of the document quality and returns them in a separate passages field in the response.

  • When passages are returned for a query, you can also enable answer finding. When true, answer objects are returned as part of each passage in the query results. When find_answers and per_document are both set to true, the document search results and the passage search results within each document are reordered by using the answer confidences. The goal of this reordering is to place the best answer as the first answer of the first passage of the first document. Similarly, if the find_answers parameter is set to true and per_document parameter is set to false, then the passage search results are reordered in decreasing order of the highest confidence answer for each document and passage.

  • Both v1 and v2 support custom stop words. However, there are a few differences in how custom stop words are used:

    • There is no default custom stop words list for Japanese collections in v2.
    • When you define custom stop words in v1, your stop words list replaces the existing stop words list. In v2, your list augments the default list. You cannot replace the list, which means you cannot remove stop words that are part of the default list in v2.

Update how your application handles query results

The way that your application shows query results might need to be updated due to the following differences between the query results document syntax between the v1 and v2 queries:

  • At the entity enrichment level, the following information is not supported in v2:

    • Disambiguation
    • Emotion
    • Sentiment

    The Part of Speech enrichment is applied automatically to documents in most project types in v2, but the index fields that are generated by the enrichment are not displayed in the JSON representation of the document.

    Difference in entities data structure
    Figure 1. Entities data structure differences

  • Instead of the count and relevance in v1, v2 includes the mentions.

    Each entry in the mention corresponds to an occurrence of the entity in the document text. In the following example, seven occurrences are found. For each occurrence, a confidence score and the offsets of the mention text are displayed. You can use the offsets to highlight the mention in the document text when the result is displayed in a user interface.

    Mentions in Discovery v2
    Figure 2. Entity mentions in Discovery v2

  • The JSON structure of query responses is rearranged slightly in v2.

  • Deduplication information is not included in the v2 query response.

  • In v2, enriched_text is an array instead of an object.

  • In Discovery v2, the Entities v2 enrichment is used. Entity type names in v2 are specified in headline case, instead of all uppercase letters. If you use a query or aggregation that specifies an entity name, you must change the capitalization. For example, change PERSON to Person.

  • Fields from JSON files that are added to a collection are converted differently during ingestion between v1 and v2. If your application manipulates these results, you might need to make adjustments.

    You can specify the normalizations and conversions objects in the Update a collection method of the API to move or merge JSON fields.

    How JSON source fields are handled
    Original JSON field content v1 representation v2 representation Notes
    "field": null "field": null N/A v1 retains the null value. v2 skips the null field altogether.
    "field": "" "field": "" N/A v1 retains the empty text value. v2 skips the empty text field altogether.
    "field": "value2" "field": "value2" "field": "value2" No difference.
    "field": [] "field": [] N/A v1 retains the empty array. v2 skips the field with the empty array altogether.
    "field": [ "value4" ] "field": [ "value4" ] "field": "value4" v1 retains the singleton array. v2 converts the singleton array into the value only; it is not stored as part of an array.
    "field": [ 1, 2, 3 ] "field": [ 1, 2, 3 ] "field": [ 1, 2, 3 ] No difference.
    "field": [ "v6", "v7", "v8" ] "field": [ "v6", "v7", "v8"] "field": [ "v6", "v7", "v8"] No difference.

Verifying that your data was migrated successfully

To verify that the migration was successful, compare the following metrics to the metrics that you noted before the migration.

  • Number of collections

    Be sure to re-create all of the collections that you used in v1 and want to keep. With the v2 List collections API method, you can get a list of collections, but you must submit a request per project. You cannot use one call to get the total number of collections per service instance.

  • Number of documents per collection

    For collections with uploaded data, check the number of documents in the collection by sending an empty query with the Query a project API method. Specify the collection ID parameter to limit the results to only documents in one collection. An empty query returns all documents. Therefore, you can get the total number of documents from the matching_results value in the response.

    The number of documents per collection should be close to the number of documents that were stored in the same collection in v1. The numbers might not be the same.

    For crawled data, do not be surprised if the v2 collection has fewer documents. The v1 connectors do not delete documents from a Discovery collection that are deleted from the external data source. Your v2 version of the collection has a fresher crawl of the data as it exists in the external data source today.

Do not expect the search results to be the same for queries that you submit in the v1 and v2 instances.

Using a news service with v2

If you used the Watson Discovery News data source in v1 and want to create a data source with equivalent function in v2, find a news and events data provider service. Look for a service that offers a News API that extracts news articles in JSON format. You can then upload the JSON files to create a News collection in your v2 project.

Delete your v1 service instance

After your data is migrated and your applications are updated to use the new v2 service instance, be sure to delete your v1 service instance. You are charged for the v1 service instance until you delete it. For more information, see Deleting a managed service instance.