Local File System

Crawl documents that are stored in a local file system.

IBM Cloud Pak for Data IBM Software Hub

This information applies only to installed deployments.

What documents are crawled

Only file types that are supported by Discovery in your file path are crawled; all others are ignored. For more information, see Supported file types.
Only files in the /mnt directory or one of its subdirectories can be accessed by the crawler.
Only files with file extensions that match the file extension filter rules that you specify are crawled. Added with the 4.7.0 release.
When a source is recrawled, new documents are added, updated documents are modified to the current version, and deleted documents are deleted from the collection's index.
All Discovery data source connectors are read-only. Regardless of the permissions that are granted to the crawl account, Discovery never writes, updates, or deletes any content in the original data source.

Prerequisite steps

Before you connect to the Local File System data source, complete the following step:

Create a persistent volume claim on the crawler pod

The service uses Portworx storage by default. However, if you are using Network File System (NFS) storage, see Prerequisite steps for NFS storage instead.

Creating and mounting a persistent volume claim on the crawler pod

Before you can crawl a local file system, you must create a persistent volume claim and mount it on the crawler pod. You also need to copy the files that you want to crawl to the Discovery cluster that you are working on. If you have multiple Discovery clusters, you must copy the files along with the crawler-pvc-portworx.yaml file that you will create in this task to each cluster.

Complete the following steps:

Enter the following command to check the storageclass name of the Portworx provisioner:

oc get storageclass | grep portworx-gp3-sc

You might see output similar to the following:

NAME             PROVISIONER                    RECLAIMPOLICY  VOLUMEBINDINGMODE  ALLOWVOLUMEEXPANSION  AGE
portworx-gp3-sc  kubernetes.io/portworx-volume  Retain         Immediate          true                  51d

Create a file named crawler-pvc-portworx.yaml to define the persistent volume claim (PVC) with the following content:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: <name-of-portworx-pvc>
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
  storageClassName: portworx-gp3-sc

Replace <name-of-portworx-pvc> with the name of your dynamic Portworx persistent volume claim. For example, jdoe-pvc-portworx

Enter the following command to create the persistent volume claim:

oc create -f crawler-pvc-portworx.yaml

A message is displayed:

persistentvolumeclaim/jdoe-pvc-portworx created

Enter the following command to mount the persistent volume claim to the crawler pod:
```
oc patch wd wd --type=merge \
--patch='{"spec": {"ingestion": {"crawler": {"mount": {"enabled": true, "persistentVolumeClaimName": "<name-of-portworx-pvc>" } } } } }'
```
Replace <name-of-portworx-pvc> with the name of your dynamic Portworx persistent volume claim. For example, jdoe-pvc-portworx.
Enter the following command to copy the files that you want to crawl to your dynamic Portworx persistent volume claim.

You only need to run this command one time against one of the existing crawler pods. The persistent volume claim is shared among all crawler and ingestion-api pods. Replace the variables in the command with the appropriate information.
```
oc rsync <path-to-local-file-system-folder> <crawler-pod>:/mnt
```

You mounted the persistent volume claim (PVC) and copied the files that you want to crawl to the PVC.

Connecting to a local file system data source

From your Discovery project, complete the following steps:

From the navigation pane, choose Manage collections.
Click New collection.
Click Local File System, and then click Next.
Name the collection.
If the language of the documents that you want to crawl is not English, select the appropriate language.

For a list of supported languages, see Language support.
Optional: Change the synchronization schedule.

For more information, see Crawl schedule options.
In the Specify what you want to crawl section, enter the file path that you want to crawl in the Path field, and then click Add.

The file path is case-sensitive. Remember, only files in the /mnt directory or one of its subdirectories can be accessed by the crawler.
Optionally, add more file paths.
If you want to limit the types of files to add to the collection, you can list the file extensions for file types to either include or exclude.

For a list of supported file types, see Supported file types.

Support for this option was added with the 4.7.0 release.
If you want the crawler to extract text from images in documents, expand More processing settings, and set Apply optical character recognition (OCR) to On.

When OCR is enabled and your documents contain images, processing takes longer. For more information, see Optical character recognition.
Click Finish.

The collection is created quickly. It takes more time for the data to be processed as it is added to the collection.

If you want to check the progress, go to the Activity page. From the navigation pane, click Manage collections, and then click to open the collection.

Prerequisite steps for NFS storage

Choose one of the following methods to enable the crawler pod to access the file system:

Configure an external NFS server
Configure dynamic provisioning with an NFS storage class

Configuring an external NFS server

If the local file system files or folders that you want to crawl are stored in an external Network File System (NFS), you can use the external NFS server to create the persistent volume claim.

Create a file named crawler-pv-nfs.yaml with the following content:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: <persistent-volume-name>
  labels:
    pv-name: <persistent-volume-name>
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  nfs:
    server: <NFS server hostname or IP address>
    path: <Path of NFS exported folder>

Replace references to <persistent-volume-name> with the name of your persistent volume. For example, jdoe-nfs-pv and add the missing external NFS details.

Enter the following command to create the persistent volume claim:
```
oc create -f crawler-pv-nfs.yaml
```
The following message is displayed:
```
persistentvolume/jdoe-nfs-pv created
```

Create a file called crawler-pvc-nfs.yaml with the following content:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: <persistent-volume-claim-name>
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
  selector:
    matchLabels:
      pv-name: <persistent-volume-name>

Replace the following variables:

<persistent-volume-claim-name>: Specify the name of your persistent volume claim. For example, jdoe-nfs-pvc.
<persistent-volume-name>: Specify the name of your persistent volume. For example, jdoe-nfs-pv.

Enter the following command to create the persistent volume claim:
```
oc create -f crawler-pvc-nfs.yaml
```
The following message is displayed:
```
persistentvolumeclaim/jdoe-nfs-pvc created
```
Enter the following command to mount the persistent volume claim to the crawler pod.

This command also mounts the persistent volume claim to all ingestion-api pods. Replace <persistent-volume-claim-name> with the name of your persistent volume claim. For example, jdoe-nfs-pvc.
```
oc patch wd wd --type=merge \
--patch='{"spec": {"ingestion": {"crawler": {"mount": {"enabled": true, "persistentVolumeClaimName": "<persistent-volume-claim-name>" } } } } }'
```

Configuring dynamic provisioning with an NFS storage class

If you want to crawl your local file system files or folders but you do not want to prepare an extra NFS server to store those files or folders, you can configure dynamic storage by using an NFS storage class.

For more information about storage providers that Discovery supports and for storage comparisons, see Storage considerations.

Before you complete this task, copy the files that you want to crawl to the Discovery cluster that you are working on. If you have multiple Discovery clusters, you must copy the files along with the crawler-pvc-dynamic.yaml file that you create in this task to each cluster.

Complete the following steps:

Enter the following command to check the storageclass name of the NFS provisioner:

oc get storageclass

A message is displayed.

NAME        PROVISIONER                                     RECLAIMPOLICY  VOLUMEBINDINGMODE  ALLOWVOLUMEEXPANSION  AGE
nfs-client  cluster.local/innocence-nfs-client-provisioner  Delete         Immediate          true                  177m

Create a file that is named crawler-pvc-dynamic.yaml and add the following content to it:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: <name-of-dynamic-pvc>
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
  storageClassName: nfs-client

Replace <name-of-dynamic-pvc> with the name of your dynamic NFS persistent volume claim. For example, jdoe-dynamic-pvc.

Enter the following command to create the persistent volume claim:
```
oc create -f crawler-pvc-dynamic.yaml
```
A message is displayed.
```
persistentvolumeclaim/jdoe-dynamic-pvc created
```
Enter the following command to mount the persistent volume claim to the crawler pod.

This command also mounts the persistent volume claim to all ingestion-api pods.
```
oc patch wd wd --type=merge \
--patch='{"spec": {"ingestion": {"crawler": {"mount": {"enabled": true, "persistentVolumeClaimName": "<name-of-dynamic-pvc>" } } } } }'
```
Replace <name-of-dynamic-pvc> with the name of your dynamic NFS persistent volume claim in the previous step. For example, jdoe-dynamic-pvc.
Enter the following command to copy the files that you want to crawl to your dynamic NFS persistent volume claim.

You must run this command only one time against one of the existing crawler pods. The persistent volume claim is shared among all crawler and ingestion-api pods. Replace the variables in the command with the appropriate information.
```
oc rsync <path-to-local-file-system-folder> <crawler-pod>:/mnt
```

You mounted the persistent volume claim (PVC) and copied all of the files that you want to crawl to the PVC.