IBM Cloud Docs
Indexing and querying

Indexing and querying

The Index and querying document is the second best practice document in the series. It shows you the following best practices:

  • How to understand the different results between emitting data into a view or not.
  • Why you must never rely on IBM Cloudant Query's ability to query without creating explicit indexes.
  • Why you must limit the number of fields with IBM Cloudant Search (or IBM Cloudant Query indexes of type text).
  • How to manage design documents.
  • Why partitioned queries are faster and cheaper.
  • How to use the primary index as a free search index.

For more information, see Data modeling or IBM Cloudant in practice.

The content in this document was originally written by Stefan Kruger as a Best and worst practice blog post on 21 November 2019.

Understand the tradeoffs in emitting data or not into a view

As the document that is referenced by a view is always available by using include_docs=true, it is possible to do something like the following example to allow lookups on indexed_field:

emit(doc.indexed_field, null);

This example has the following advantages and disadvantages:

  • The index is compact. This index size is good, since index size contributes to storage costs.
  • The index is robust. Since the index does not store the document, you can access any field without thinking ahead about what to store in the index.
  • The disadvantage is that getting the document back is more costly than the alternative of emitting data into the index itself. First, the database has to look up the requested key in the index and then read the associated document. Also, if you’re reading the whole document, but need only a single field, you’re making the database read and transmit data that you don’t need.

This example also means that a potential race condition exists here. The document might change, or be deleted, between the index and document read (although unlikely in practice).

Emitting data into the index (a so-called “projection” in relational algebra terms) means that you can fine-tune the exact subset of the document that you need. In other words, you don’t need to emit the whole document. Emit a value that represents only the data you need in the app that is a cut-down object with minimal details, for example:

emit(doc.indexed_field, {name: doc.name, dob: doc.dob});

If you change your mind on what fields you want to emit, the index needs rebuilding.

IBM Cloudant Query’s JSON indexes use views this way under the hood. IBM Cloudant Query can be a convenient replacement for some types of view queries, but not all. Do take the time to understand when to use one or the other.

  • IBM Cloudant Query docs
  • IBM Cloudant guide to using views
  • Performance implications of using include_docs

Never rely on the default behavior of IBM Cloudant Query’s no-indexing

It’s tempting to rely on IBM Cloudant Query's ability to query without creating explicit indexes. This practice is costly in terms of performance, as every lookup is a full scan of the database rather than an indexed lookup. If your data is small, this full-scan lookup doesn’t matter, but as the data set grows, performance becomes a problem for you, and for the cluster as a whole. It is likely that we will limit this facility soon. The IBM Cloudant Dashboard provides a method for creating indexes in an easy way.

Creating indexes and crafting IBM Cloudant Queries that take advantage of them requires some flair. To identify which index is being used by a particular query, send a POST to the _explain endpoint for the database, with the query as data.

For more information, see IBM Cloudant Query docs.

In IBM Cloudant Search (or IBM Cloudant Query indexes of type text), limit the number of fields

IBM Cloudant Search and IBM Cloudant Query indexes of type text (both of which are Apache Lucene under the hood) provide you with a way to index any number of fields into the index. Some examples exist where this type of indexing is abused either deliberately, or mostly by mistake. Plan your indexing to comprise only the fields required by your actual queries. Indexes take up space and can be costly to rebuild if the number of indexed fields are large.

We also have the issue of which fields that you store in an IBM Cloudant Search. Stored fields are retrieved in the query without doing include_docs=true so the tradeoff is similar to the Understand the tradeoffs in emitting data or not into a view section. For more information, see IBM Cloudant Search docs.

Design document management requires some flair

As your data set grows, and your number of views goes up, sooner or later you want to ponder how you organize your views across design documents. A single design document can be used to form a so-called view group: a set of views that belong together by some metric that makes sense for your use case. If your views are static, that makes your view query URLs semantically similar for related queries. It’s also more performant at index time because the index loads the document once and generates multiple indexes from it.

Design documents themselves are read and written by using the same read/write endpoints as any other document. With these endpoints, you can create, inspect, modify, and delete design documents from within your application. However, even small changes to design documents can have significant effects on your database. When you update a design document, all views in it become unavailable until indexing is complete. This lag can be problematic in production. To avoid it, you have to do a crazy design document-swapping dance (see couchmigrate).

In most cases, this process is probably not what you want to have to deal with. As you start out, it is most likely more convenient to have a one-view-per-design document policy.

Also, in case it isn’t obvious, views are code. Views must be subject to the same processes you use in terms of source code version management for the rest of your application code. How to achieve this standard might not be immediately obvious. You could increase the version number for the JavaScript snippets. Then, you could cut and paste the code into the IBM Cloudant Dashboard to deploy whenever a change occurs. Yes, we all resort to this practice from time to time.

Better ways to do this exist, and we have one reason to use some of the tools that surround the couchapp concept. A couchapp is a self-contained CouchDB web application that nowadays doesn’t see much use. Several couchapp tools exist that are there to make the deployment of a couchapp, including its views, crucially, easier.

Using a couchapp tool means that you can automate deployment of views as needed, even when not using the couchapp concept itself.

Partitioned queries are faster and cheaper

Yes, partitioned queries are faster and cheaper. Opting to create a partitioned database (as opposed to an unpartitioned database) means that IBM Cloudant uses a partition key to decide on which shard each of your documents resides. Documents with the same partition key are on the same database shard. Requests for _all_docs, MapReduce views, IBM Cloudant Query _find queries, and IBM Cloudant Search operations can be directed to a single partition instead of having to interrogate all shards in a “scatter and gather” pattern, which is the case for global queries.

These partitioned queries exercise only one shard of the database. This practice makes them faster to execute than global queries. For billing purposes, they are classified as “read” requests instead of the more expensive “query” requests, which provides you with more usable capacity from the same IBM Cloudant plan.

Not all data designs lend themselves to a partitioned design, but if your data can be molded into a <partition key>:<document key> pattern, then your application can benefit in terms of performance and cost.

Treat the primary index as a free search index

A default IBM Cloudant document _id is a 32-character string, encoding 128 bits of random data. The _id attribute is used to construct the database’s primary index, which is used by IBM Cloudant to retrieve documents by _id or ranges of keys when the user supplies a startkey/endkey pair. We can leverage this fact to pack our data into the _id field and use it as a “free” index that can query for ranges of values.

See some examples in the following list:

  • Use time-sortable document IDs so that your documents are sorted into rough date and time order. This sorting makes it easy to retrieve recent additions to the database. For more information, see Time-sortable -IDs.
  • Pack searchable data into your _id field, for example, <customerid>~<date>~<orderid> can be used to retrieve data by customer, customer/date, or customer/date/orderid.
  • In a partitioned database, the judicious choice of partition key allows an entire database to be winnowed down to a handful of documents for a known partition key. Make sure that your partitioning schema solves your most common use case.
  • In a partitioned database, the two parts of the key have to contain your user-supplied data (no auto-generated _ids exist) so it’s best to use it optimally. For example, in an IoT application, <sensorid>:<time-sortable-id> allows data to be sorted by sensor and time without a secondary index. Implement this schema with time-boxed databases for best results.