Data modeling

The Data modeling document is the first best practice document in the series. It shows you the following best practices:

What you need to know about your APIs.
How to model your data.
What size documents you must use.
What to avoid.
How to configure your databases.

For more information, see Indexing and querying or IBM Cloudant in practice.

The content in this document was originally written by Stefan Kruger as a Best and worst practice blog post on 21 November 2019.

Understand the API that you are targeting

You can use Java™, Python, Go, or Node.js or some other use-case-specific language or platform. One of these languages most likely comes with convenient client-side libraries that integrate IBM Cloudant access nicely, following the conventions that you expect for your tools. These languages are great for programmer efficiency, but they also hide the API from view.

This abstraction is what you want, the whole reason for using a client library is to save yourself repeated, tedious boiler-plating. However, you must understand the underlying API is vital when you troubleshoot and report problems. When you report a suspected problem to IBM Cloudant, it helps us help you if you can provide a way for us to reproduce the problem.

This request does not mean cutting and pasting a hefty chunk of your application’s Java™ source verbatim into a support ticket, as we’re probably not able to build it. Also, your client-side code introduces uncertainties as to where the problem might be, your side or our side?

Instead, IBM Cloudant’s support teams usually requests the set of API calls, ideally as a set of curl commands that they can run, that demonstrates the issue. Adopting this approach to troubleshooting as a rule also makes it easier for you to pinpoint where issues are failing. If your code is behaving unexpectedly, try to reproduce the problem by using only direct access to the API.

If you can’t, the problem isn’t with the IBM Cloudant service itself.

If you’re investigating a performance issue, do consult the logs that are provided by IBM Cloud®. If the logs show that your requests are handled quickly by IBM Cloudant, but your application is slow, the root of that problem lies with your client-side application code. See the rule about logging and monitoring.

If you suspect that a problem lies with an officially supported client library, then try to construct a small, self-contained code example that demonstrates the issue. In this self-contained code example, use as few other dependencies as possible. If you’re using Java™, it is helpful to us if you can use a minimal test harness to highlight library issues.

Occasionally, IBM Cloudant receives support tickets that state that “IBM Cloudant is broken because my application is slow” without much in terms of supporting evidence. Nearly always this case can be traced back to issues in the application code on the client side, or misconceptions about how IBM Cloudant works.

Not always, but nearly always.

By understanding the API better, you also gain experience in how IBM Cloudant behaves, especially in terms of performance. If you’re using a client library, you must aim to at least know how to find out which HTTP requests are generated by a specific function call. For more information, see the following websites:

IBM Cloudant API docs
Logging integration
Blog post on logging

Documents must group data that mostly changes together

When you start to model your data, sooner or later, you run into the issue of how your documents might be structured. Now you know that IBM Cloudant doesn’t enforce any normalization and that it has no transactions of the type you’re used to from, say, Postgres. The temptation can be to cram as much as possible into each document, which would also save on HTTP usage.

This practice is often a bad idea.

If your model groups information that doesn’t change together, you’re more likely to suffer from update conflicts.

Consider a situation where you have users, each with a set of orders associated with them. One way might be to represent the orders as an array in the user document:

{ // DON'T DO THIS
    "customer_id": 65522389,
    "orders": [ {
      "order_id": 887865,
      "items": [ {
          "item_id": 9982,
          "item_name": "Iron sprocket",
          "cost": 53.0
        }, {
          "item_id": 2932,
          "item_name": "Rubber wedge",
          "cost": 3.0
        }
      ]
    }
  ]
}

To add an order, I need to fetch the complete document, unmarshal the JSON, add the item, marshal the new JSON, and send it back as an update. If I’m the only one doing so, it might work for a while. If the document is being updated concurrently, or being replicated, we might likely see update conflicts.

Instead, keep orders separate as their own document type, referencing the customer ID. Now the model is immutable. To add an order, I create a new order document in the database, which cannot generate conflicts.

To be able to retrieve all orders for a specific customer, we can employ a view, which we cover later.

Avoid constructs that rely on updates to parts of existing documents, where possible. Bad data models are often hard to change after you’re in production.

The previous pattern can be solved efficiently by using partitioned databases, which are covered in greater detailed later.

For more information, see the following documentation:

IBM Cloudant guide to data modeling
Database partitions

Keep documents small

IBM Cloudant imposes a max doc size of 1 MB. This limit does not mean that a close-to-1-MB document size is a good idea. On the contrary, if you find you are creating documents that exceed single-digit KB, you probably need to revisit your model. Several things in IBM Cloudant become less performant as documents grow. JSON decoding is costly, for example.

Let's look at the following sections: Documents must group data that mostly changes together and Keep documents small. It’s worth stressing that models that rely on updates have a maximum volume limit of 1 MB, the cut-off for document size. This size isn’t what you want.

Avoid using attachments

IBM Cloudant has support for storing attachments alongside documents, a long-standing feature it inherits from CouchDB. If you use IBM Cloudant as a backend for a web application, you can also store small icons and other static assets such as CSS and JavaScript files with the data.

You must consider a few things before you use attachments in IBM Cloudant today, especially if you’re looking at larger assets such as images and videos:

IBM Cloudant is expensive as a block store.
IBM Cloudant’s internal implementation is not efficient in handling large amounts of binary data.

So, slow and expensive.

IBM Cloudant is acceptable for small assets and occasional use. As a rule, if you need to store binary data alongside IBM Cloudant documents, it’s better to use a separate solution more suited for this purpose. You need store only the attachment metadata in the IBM Cloudant document. Yes, that means you need to write some extra code to upload the attachment to a suitable block store of your choice. Verify that it succeeded before you store the token or URL to the attachment in the IBM Cloudant document.

Your databases are smaller, cheaper, faster, and easier to replicate. For more information, see the following websites:

IBM Cloudant docs on attachments
Detaching IBM Cloudant attachments to Object Storage

Fewer databases are better than many

If you can, limit the number of databases per IBM Cloudant account to 500 or fewer. While this particular number is not magic (IBM Cloudant can safely handle more), several use cases exist that are adversely affected by large numbers of databases in an account.

The replicator scheduler has a limited number of simultaneous replication jobs that it is prepared to run. As the number of databases grows, the replication latency is likely to increase if you try to replicate everything contained in an account.

The flip side of the same coin is the operational aspect: IBM Cloudant’s operations team relies on replication, too, to move around accounts. By keeping down the number of databases, you help us help you if you need to shift your account from one location to another.

So when must you use a single database and distinguish between different document types by using views, and when must you use multiple databases to model your data? IBM Cloudant can’t federate views across multiple databases. If you have unrelated data that can never be “joined” or queried together, then that data can be a candidate for splitting across multiple databases.

If you have an ever-growing data set (like a log, sensor readings, or other types of time-series), it’s also not a good idea to create a single, ever-growing, massive database. This kind of use case requires time-boxing, which we cover in more detail later.

Avoid the database per user anti-pattern like the plague

If you’re building a multi-user service atop IBM Cloudant, it is tempting to allow each user store their data in a separate database under the application account. That works well, mostly, if the number of users is small.

Now add the need to derive cross-user analytics. The way that you do that is to replicate all the user databases into a single analytics database. All good. This app has suddenly become successful, and the number of users grew in the range of 150 - 20,000. You have 20,000 replications just to keep the analytics database current. If you also want to run in an active-active disaster recovery setup, add another 20,000 replications, and the system stops functioning.

Instead, multiplex user data into fewer databases, or shard users into a set of databases or accounts, or both. That way, you do not need to replicate to provide an analytics database, but authentication becomes more complicated as IBM Cloudant provides only authentication at the database level.

It’s worth stating that the “database-per-user” approach is tempting because IBM Cloudant permissions are “per database”, but it’s not really the users’ fault that this pattern emerged.

Avoid writing custom JavaScript reduce functions

The MapReduce views in IBM Cloudant are awesome. However, with great power comes great responsibility. The map part of a MapReduce view is built incrementally, so shoddy code in the map impacts only indexing time, not query time. The reduce-part, unfortunately, runs at query time. IBM Cloudant provides a set of built-in reduce functions that are implemented internally in Erlang. These functions are performant at scale while your hand-crafted JavaScript reduces are not.

If you find yourself writing reduce functions, stop and consider whether you can reorganize your data so that writing reduce functions isn’t necessary. Or so that you’re able to rely on the built-in reducers.

Views on partitioned databases do not support custom reduces, which is one factor that contributes to the significant speed-up queries only such views can offer.

For more information, see IBM Cloudant docs on reduces.

Use time boxed databases for ever-growing data sets

It’s generally not a good idea to have an ever-growing database in IBM Cloudant. Large databases can be difficult to back up, require “resharding” to maintain good performance as they grow, and suffer from long index build times.

One way of mitigating this problem is to have several smaller databases instead, with a common pattern that is time-boxed databases: a large data set is split into smaller databases, each representing a time window, for example, a month.

orders_2019_01
orders_2019_02
orders_2019_02

New data is written to this month’s database and queries for historical data can be directed to previous months’ databases. When a month’s data is no longer of interest, it can be archived to Object Storage, the monthly IBM Cloudant database is deleted, and the disk space recovered. For more information, see the following website.

Time-series Data Storage blog