Debunking the Myth of Going Schemaless

Once you understand document database fundamentals — that you need a schema, and what good and bad schemas look like — you can choose different options.

Feb 1st, 2022 6:11am by John Page

Featued image for: Debunking the Myth of Going Schemaless

Feature image via Pixabay.

John Page

John Page is a document database veteran who, after 18 years building full-stack document database technologies for the intelligence community, joined MongoDB. He now builds robots for fun and tests and writes about databases to pay for the robot parts.

Developers everywhere have embraced document databases. But in many cases, it’s for the wrong reasons. Take the current hype around going schemaless — it’s almost too easy to store arbitrary JSON or XML in a document database. But if you expect to filter, modify and retrieve it efficiently, you may be setting yourself up for disappointment. While a document database does allow you to store data without defining what it is, the shape of that data matters if you plan to do more than simply retrieve whole documents by keys. If you’re ignoring schema design and simply storing pre-existing documents, chances are you don’t need a document database, just a simple key-value store.

Document Schema Design Versus Relational Design

Relational databases were designed to give all users a defined, consistent and safe way to interact with data. The way you organized the data had nothing to do with the way it would be used. It couldn’t. After all, who could predict how different users would access the one copy of data that exists as different requirements arose? So normalized schema designs were established where relationships were defined by the data itself rather than how the data would eventually be used. This made data modeling more predictable. Given the same set of data to model, any competent architect creating a schema would come to the same result. But predictability also meant less flexibility. Document design was created in the 1960s, around the same time as object-oriented programming. But it took decades for computing to evolve to a point where the flexibility of the document model could be appreciated. With a document database, schema design is based on how the data is accessed rather than the data itself. And it’s the developers who know best how data will be accessed for their applications. It’s impossible to optimize a schema if you don’t have a plan for how users will access the data. Of course, you can just persist the data without a plan for how users will access it. The document model allows that type of flexibility. But you can and should optimize the schema later.

The Benefits of Better Schema

Beginners love document databases because they can persist objects without defining them upfront. With minimal training, almost anyone can build credible applications without understanding document schema design. However, there is a point where knowing how to design your schema correctly allows you to achieve far more, with far less server hardware. With today’s pay-as-you-go cloud pricing, that’s important. Document schemas can increase performance for a given set of hardware by reducing computation, I/O operations and contention between users. The idea that document databases lack up-front schema enforcement is simply not true. Document databases can enforce schemas just like relational databases. Schemaless design may be common in document databases, but it’s not synonymous with them. A modern document database also has strongly typed data, a rich data manipulation language (DML), multiple compound BTree-based indexes, ACID transactions and in-database aggregation calculation. And document databases also share the same underlying storage engines as Postgres, MySQL and other relational databases. What really differentiates a document database from relational is the ability to co-locate related data in the atomic unit of storage so a single record is stored contiguously on disk and in memory rather than being broken up into rows and stored independently. Simply put, in a document database, multiple values for an attribute can exist within a single record. If a person has multiple phone numbers, you don’t need a different table to store them. And you don’t need to define individual fields for each number. You can simply have an array of phone numbers or number objects. It’s like embedding rows from one table inside another at the storage layer.

{
  name: "john",
  phones: [ { type: "cell", number: 4475566218},
                 { type: "cell", number: 4479927716},
                 { type: "voip", number: 17035551234}]
}

This idea of co-locating data to reduce I/O has been a fundamental principle of database implementation for many years. In relational databases, the idea of “indexed organized storage” is used by a database administrator to co-locate rows that are expected to be accessed at the same time. But this is done after the fact. And, significantly, it doesn’t allow you to co-locate data from different tables to reduce the cost of retrieving complex records.

How Documents Reduce Computation and I/O

When you query a database, you’re filtering out a subset of the data to either retrieve it in its raw form or compute some kind of summarization. If the data you want is in a single row and your query fetches it from a single table, then a relational database or column store will probably be more efficient. On the other hand, as soon as you need to join tables together to perform your query, the additional computational work to look up multiple indexes and merge results negates that advantage. Each additional row you need to access adds I/O operations. I/O happens slowly. And speeding it up costs dearly. In a document database with a properly designed schema — because an entire business record is contained in a single document — all the data you need for filtering and retrieval is available with minimal computational overhead and in a single I/O operation. This can make finding and retrieving data far faster. For anything you can’t fit into a single document, you’ll need multiple record types and may need to query for related groups. Fortunately, mature document databases offer one or more join options as well, although they should be used sparingly.

How Documents Reduce Contention

Databases must allow multiple users or processes to edit the same records without overwriting each other’s changes. We can’t have a situation where two users are modifying the same data, and the one writing last overwrites the other’s changes. If you’ve dealt with git conflicts, you know how important it is to resolve these changes. Consistent data needs a better approach than trying to merge conflicting changes. The solution to this in a database is locking, which entails the following:

Find a record that matches your criteria for change.
Lock it so no one else can modify it.
Verify it wasn’t changed between finding and locking.
Apply changes.
Unlock it.

Doing this for each change serializes the changes and maintains data correctness. For example, if two processes simultaneously find the last item in inventory and modify it to place it in a shopping cart, only one should succeed. In this scenario, all modifications to a single record happen inside a single lock, and that lock need only persist for as long as it takes to apply the check and change, which is typically a few microseconds. In a document database, an edit to a record where only one document needs to be updated at a time is a very low contention operation. That application can sustain a large number of simultaneous users. This short-lived, low-contention change requires a rich query language capable of performing the relevant update entirely on the server-side. If you’re forced to retrieve the record, change it on the client, and then send it back, it means two edits to create and then release the long-term lock or you risk overwriting. The time between the two database calls from the client is where there’s contention, which can be substantial. Sending an instruction to set a field at a particular value is simple enough. But a document database needs to support far more logically complex modifications. Imagine you’re modeling a high score table with the top five scores stored as a single document for speed of retrieval. You might have something like the following:

{ 
 game: "super_kong",
 highscores: [{ name: "joe", score: 118231},
              { name: "amy", score: 75651},
         { name: "chloe", score: 62352},
         { name: "bryan", score: 54524}, 
              { name: "dwayne", score: 41654}]
}

What you need to be able to do is send a single command to the server to say “if high scores contain a score less than X, then in one operation, add X to the high scores, sort the high scores array by score, and then retain only the top five items.” MongoDB supports this rich edit functionality. So what happens when multiple discrete items need to be updated simultaneously to update a record? This is all too common with a relational database where an edit to a record means changing more than one row. The solution is an ACID transaction. An ACID transaction locks each item as you modify it. It then unlocks them all when the transaction is committed. This is often after multiple calls to the server, which means the documents are effectively locked for a much longer period, including network and client time. This is a typical cause of contention and performance issues in high-throughput relational database loads. You can wind up in the same scenario using a document database. But a well-designed document schema can prevent this from happening. And, if you absolutely must edit multiple documents, a document database like MongoDB provides exactly the same BEGIN TRANSACTION COMMIT semantics that are found in ACID transactions. Since it still introduces the same contention issues as with relational databases, it’s better to create a schema that avoids modifying multiple documents, or one that provides a way to perform your edit without breaking database consistency.

Tradeoffs in Document Schema Design

There is no perfect answer to schema design. Document models assume there are more reads than writes. Often many more. Optimizing for read speed over potentially writing more data during an update is a good choice. Unless a system is simply logging or auditing data, it’s safe to assume that every bit of data written will be read at least once and probably more than once. With document design, it’s OK to denormalize domain tables, or have more than one copy of some items of data. If you have a list of countries that rarely changes, it’s an acceptable tradeoff to store the country name rather than a key to a domain table and deal with the consequences should a country name ever change. For other records, there may well be a definitive copy of the data. For a customer record, there’s typically one document per customer. But some of the fields may be duplicated into other documents at write time for speed of reading. For example, you may decide to duplicate the customer’s name, address and unique identifier into each of their invoice records. This reduces the time it takes to retrieve invoices. Should a customer change their name or address, you can modify those invoice records in place. Document databases also encourage you to think about using idempotent, retriable operations and always roll forward rather than back in the event of an error. Imagine we need to give everyone a 10% pay raise. We could wrap this in a transaction, but that locks our employee table for however long it takes. And it could be hours. Alternatively, we could just ask the database to increase everyone’s salary and, as part of updating each record, add a new, temporary field, got_payraise. If this fails part of the way through, then it should try again. But this time, it should only give a pay raise if the got_payraise field does not exist, so no one gets two pay raises, and repeat until everyone gets a raise. At this point we can delete all the got_payraise fields. This model removes nearly all the contention and risk of either not rewarding someone or not knowing if someone received a pay raise. This is where the flexibility of schema design really helps out.

Putting It into Practice

Unlike a relational database, a document database asks the designer or developer to think about correctness, contention and performance. But in return, it gives you far better performance and control. Unlike a relational world where one person models the data, others build applications with it, and later a DBA attempts to optimize how it works, document databases put the developers front and center in creating a good database. This can require organizational or process adjustments when moving from relational databases. Once you understand the fundamentals — that you need to have a schema and what good and bad schemas look like — then you can choose between different options and understand what’s happening and what’s not happening. Having taught these classes for a number of years, I’ve often heard from developers that they had no clue they could achieve so much through schema design in a document database.