Uploading data
You can perform a one-time document upload from your local file system at any time to add data to a project.
You can upload up to 200 files at a time.
To process document sets that are larger than 200 files, you can add them to an external data source and use a data source crawler to upload them. For IBM Cloud Pak for Data deployments, you can use a Local File System data source for this purpose.
For more information about the maximum size allowed for each file, see Document limits.
Before you upload a CSV file to a Content Mining project, consider adding headers to the source file so that any fields that are generated from the file have meaningful names. Without headers, fields are given generic names, such as column_0
,
column_1
, and so on.
To upload data, complete the following steps:
-
Open your project, go to the Manage collections page, and then click New collection.
-
Do the following based on your deployement type:
IBM Cloud Pak for Data
-
Choose Upload data as your data source and then click Next.
You can also connect to a different data source instead of uploading data such as reusing data from a collection or crawling an external data source. For more information, see Reusing data from a collection and Overview of Cloud Pak for Data data sources.
-
Name the collection. If the language of the documents in storage is not English, select the appropriate language. For a list of supported languages, see Language support.
-
Optionally, click More processing settings to expand the menu. You can select the following settings:
-
Set Apply optical character recognition (OCR) switcher to On to enable OCR.
When OCR is enabled and your documents contain images, processing takes longer. For more information, see Optical character recognition.
-
Set Use stemming instead of lemmatization when indexing switcher to On to use stemming instead of lemmatization to normalize words in the index and queries. For more information, see Enabling stemming for uncurated data.
-
-
Click Next
-
Upload data by browsing for the files you want to crawl.
You can drag documents that you want to add to your collection.
For more information about supported file types, see Supported file types.
-
Click Finish.
IBM Cloud
-
Name the collection. If the language of the documents in storage is not English, select the appropriate language. For a list of supported languages, see Language support.
-
Upload data by browsing for the files you want to add.
You can drag documents that you want to add to your collection.
For more information about supported file types, see Supported file types.
You can also connect to a different data source instead of uploading data such as reusing data from a collection or crawling an external data source. To connect to a different data source, click the link next to the Need to connect to a data source? field. For more information, see Reusing data from a collection and Overview of Cloud data sources.
-
Optionally, click More processing settings to expand the menu. You can select the following:
-
Set Apply optical character recognition (OCR) switcher to On to enable OCR.
When OCR is enabled and your documents contain images, processing takes longer. For more information, see Optical character recognition.
-
Set Use stemming instead of lemmatization when indexing switcher to On to use stemming instead of lemmatization to normalize words in the index and queries. For more information, see Enabling stemming for uncurated data
-
-
Click Finish.
-
The file upload is completed quickly. It takes more time for the data to be processed as it is added to the collection. After the files are uploaded and processed, the Activity page shows the upload results.
Unlike crawled data sources, you cannot schedule regular updates for uploaded files. If you want to add a later version of a file, delete the earlier version of the file, and then upload the latest version.
For information about how to troubleshoot issues that you might encounter when adding documents to a collection, see Troubleshooting ingestion.
For more information about what happens next, see How your data source is processed.