Uploading files to Elasticsearch
A few Elasticsearch features allow indexes to read from files on the file system, so IBM Cloud® Databases for Elasticsearch allows you to upload files to your deployment. The files are stored at a known location, and Elasticsearch is configured so that it is allowed to read files from the location.
Files that are uploaded to your deployment use disk resources, both in the index and on the file system. Make sure that you scale your deployment before uploading files.
Basic Process
- You base64 encode the file on client side.
- The base64 strings are stored as documents in an index named
ibm_file_sync
in your Elasticsearch deployment. - You trigger a file sync from the Cloud Databases API.
- All nodes in your Elasticsearch cluster download the file contents from the index, decode the base64, and restore the files on the deployment's disk in the
/data/ibm_file_sync/current
directory. - At regular intervals, and on restarts, the files get resynced to assure they are present on all nodes.
- Files that are on disk, but not in the index, get deleted. You can delete files from disk by removing them from the index.
The index in Elasticsearch is ibm_file_sync
.
The location of the files on disk is /data/ibm_file_sync/current
.
In Elasticsearch 7 the API removes doc types. Refer to the Elasticsearch documentation for more details. This is especially good to keep in mind when updating from prior versions.
Uploading the files to the Index
The structure of the documents in the index is as follows: name
is the file name of the file, blob
is the base64-encoded file contents, and md5
is an optional hash value over the file contents. The recommended
mapping for the index is split out based on version.
For Elasticsearch 6:
curl -X PUT "https://user:password@host:port/ibm_file_sync" -H 'Content-Type: application/json' -d'
{
"mappings": {
"files": {
"properties": {
"name": {
"type": "text"
},
"blob": {
"type": "binary"
},
"md5": {
"type": "text"
}
}
}
}
}'
For Elasticsearch 7 (note the removal of the files
section):
curl -X PUT "https://user:password@host:port/ibm_file_sync" -H 'Content-Type: application/json' -d'
{
"mappings": {
"properties": {
"name": {
"type": "text"
},
"blob": {
"type": "binary"
},
"md5": {
"type": "text"
}
}
}
}'
The URL is the https
connection string from your deployment.
To use the index, encode the file contents as base64. To encode an example file README.md
in bash, ENC=$(base64 -w 0 README.md)
. Then, build a checksum over the content, HASH=$(md5sum README.md)
.
The download function compares the hash values on each sync run and if the values are unchanged since the last sync, no new download is attempted. If any document in the index has no md5 value, all downloads are attempted again.
Next, upload the document to the index. Note that the file name is supplied in the URL as well.
For Elasticsearch 6:
curl -X PUT "https://user:password@host:port/ibm_file_sync/files/README1.md" -H 'Content-Type: application/json' -d'
{
"name": "README1.md",
"blob": '"\"$ENC\""',
"md5": '"\"$HASH\""'
}'
For Elasticsearch 7 (note only the URL is changed):
curl -X PUT "https://user:password@host:port/ibm_file_sync/_doc/README1.md" -H 'Content-Type: application/json' -d'
{
"name": "README1.md",
"blob": '"\"$ENC\""',
"md5": '"\"$HASH\""'
}'
You can verify the uploaded data.
For Elasticsearch 6:
curl https://user:password@host:port/ibm_file_sync/files/README.md?pretty
For Elasticsearch 7:
curl https://user:password@host:port/ibm_file_sync/_doc/README.md?pretty
If everything went smoothly, the returned data looks like this (shortened) example. The "md5" field can contain a file name alongside the hash.
For Elasticsearch 6:
{
"_index" : "ibm_file_sync",
"_type" : "files",
"_id" : "README1.md",
"_version" : 1,
"found" : true,
"_source" : {
"name" : "README1.md",
"blob" : "IyBF ... KWBgCg==",
"md5" : "270f60e62d3d37add3702ced7f6969a1 README.md"
}
}
For Elasticsearch 7:
{
"_index" : "ibm_file_sync",
"_id" : "README1.md",
"_version" : 1,
"found" : true,
"_source" : {
"name" : "README1.md",
"blob" : "IyBF ... KWBgCg==",
"md5" : "270f60e62d3d37add3702ced7f6969a1 README.md"
}
}
Syncing files to disk
After the files are uploaded to the index, they can be synced to the disk. Call the /elasticsearch/file_syncs
endpoint from the Cloud Databases API.
curl -X POST \
https://api.{region}.databases.cloud.ibm.com/v4/ibm/deployments/{id}/elasticsearch/file_syncs \
-H 'authorization: Bearer <token>'
The region
is the region that your deployment is in, and the id
(CRN) part of the URL needs to be url-encoded. More information is in the API Reference.
The call starts and returns a task so you can monitor its progress. After the returned task finishes, the contents in the index are present on all the nodes in your cluster.
Any number of files can be uploaded and synced. The contents of the files are not validated. Ensure that they can be processed by Elasticsearch.
Using the files
Elasticsearch features that use files on the file system do so by accepting the path to the file when defining the index. An uploaded file example.txt
is at /data/ibm_file_sync/current/example.txt
. This list contains
examples and is not exhaustive.