Use machine learning models with Elasticsearch to tag content

Objectives

Databases for Elasticsearch supports machine learning workloads. In this tutorial, you learn how to provision a machine learning model to a Databases for Elasticsearch instance and then use it to extract meaningful additional information from a test data set. Only some basic knowledge of terminal commands is required to provision and understand this tutorial.

Machine learning is a branch of artificial intelligence (AI) and computer science that focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy. By using statistical methods, algorithms are trained to make classifications or predictions, and to uncover key insights in data mining projects.

These learning algorithms are known as “models”. In this tutorial, we use a pre-built Natural Language Processing (NLP) model, which extracts meaning out of sentences in written (Natural) language. Specifically, we use the distilbert-base-uncased-finetuned-conll03-english model that tries to identify the names of people, locations, and organizations within text.

Many other models specialize in analyzing other forms of data, like text extraction from images, speech conversion to text, or object identification in images. For a full list of Elastic stack supported models, see Compatible third party NLP models. Models can be trained on domain-specific knowledge, but the training of new models is beyond the scope of this tutorial.

This tutorial guides you through the process of:

Provisioning a Databases for Elasticsearch instance
Uploading a machine learning model
Uploading a data set of headlines and summaries from news articles
Passing the data set through the NLP model
Querying the augmented data to see the results of the model on the data.

See the other tutorials in this Elasticsearch machine learning series:

Getting productive

To begin the provisioning process, install some must-have productivity tools:

You need to have an IBM Cloud account.
Terraform - To codify and provision infrastructure.
Python
jq - To process configuration files.

Obtain an API key to provision infrastructure to your account

Follow these steps to create an IBM Cloud API key that enables Terraform to provision infrastructure into your account. You can create up to 20 API keys.

For security reasons, the API key is only available to be copied or downloaded at the time of creation. If the API key is lost, you must create a new API key.

Clone the project

Clone the project from the GitHub repository.

git clone https://github.com/IBM/elasticsearch-nlp-ml-tutorial.git
cd elasticsearch-nlp-ml-tutorial/terraform

Install the infrastructure

Provision your Databases for Elasticsearch instance.

On your machine, create a document that is named terraform.tfvars, with the following fields:
```
ibmcloud_api_key = "<your_api_key_from_step_1>"
region = "<your_region>"
es_password  = "<make_up_a_password>"
```
The terraform.tfvars document contains variables that you might want to keep secret so it is ignored by the GitHub repository.
Install the infrastructure with the following command:
```
terraform init
terraform apply --auto-approve
```
The Terraform script outputs the configuration data that is needed to run the application, so copy it into the scripts folder:
```
terraform output -json >../scripts/config.json
cd ../scripts
```

Install Eland

Eland is a Python Elasticsearch client for exploring and analyzing data in Elasticsearch. Install it with a command like:

   python3 -m pip install 'eland[pytorch]'

Upload and analyze data

This tutorial uses a small data set of 132 articles that are obtained from the BBC News and Guardian websites through their RSS feeds.

These have been transformed into a json file that is formatted as required by the Elasticsearch bulk upload method.

Run the upload.sh script, which does the following:

Uploads the NLP model to Elasticsearch.
Creates a data processing pipeline in Elasticsearch that takes any incoming data and analyzes it for meaningful terms.
Uploads the pre-formatted data to Elasticsearch and passes it through the pipeline for analysis.

Since this is a demo, we connect nonsecurely to your database. For production, use secure connections.

To run the script, make sure you are in the scripts directory and use the command:

ES_PASSWORD=`cat config.json | jq -r .es_password.value`
ES_PORT=`cat config.json | jq -r .es_port.value`
ES_HOST=`cat config.json | jq -r .es_host.value`
export ES="https://admin:${ES_PASSWORD}@${ES_HOST}:${ES_PORT}"
./upload.sh

Query your data

You now have an index that is named test that has 132 records. Each of these records contains the news data (article ID, headline and summary) and an additional object called tags. This is where the machine learning model has inserted any people (PER), locations (LOC) or organizations (ORG) that it has found in the text.

The index also contains another object, called ml, which shows some of how the model works. For example, it tells you what accuracy probability it assigned to each of the terms it found.

For example, use this query to retrieve a single record to inspect:

curl -k "$ES/test/_search?size=1" | jq .

This machine learning model has generated valuable information. For example, if you run a news website, you might generate pages of tag-based content. If, for example, your users go to a page like www.mynewssite.com/tag/rishi-sunak, all your system would have to do is search by that tag in the index of news articles to retrieve a list of those that mention Rishi Sunak:

curl -kX POST -d@body.json -H "Content-Type: application/json" "$ES/test/_search" | jq .

There is a body.json file in the scripts directory that you can play around with to make different searches.

Conclusion

This tutorial shows how to use Databases for Elasticsearch to harness the power of machine learning to generate valuable additional information from your data. We hope you can use it as a springboard to explore other ways to augment and create value from your data.

To stop incurring charges, don't forget to remove all your deployed infrastructure. In your terraform directory, use the command:

terraform destroy