Debunking the Myth of Going Schemaless
Once you understand document database fundamentals — that you need a schema, and what good and bad schemas look like — you can choose different options.
Feb 1st, 2022 6:11am by
Feature image via Pixabay.
MongoDB sponsored this post.
John Page
John Page is a document database veteran who, after 18 years building full-stack document database technologies for the intelligence community, joined MongoDB. He now builds robots for fun and tests and writes about databases to pay for the robot parts.
Document Schema Design Versus Relational Design
Relational databases were designed to give all users a defined, consistent and safe way to interact with data. The way you organized the data had nothing to do with the way it would be used. It couldn’t. After all, who could predict how different users would access the one copy of data that exists as different requirements arose? So normalized schema designs were established where relationships were defined by the data itself rather than how the data would eventually be used. This made data modeling more predictable. Given the same set of data to model, any competent architect creating a schema would come to the same result. But predictability also meant less flexibility. Document design was created in the 1960s, around the same time as object-oriented programming. But it took decades for computing to evolve to a point where the flexibility of the document model could be appreciated. With a document database, schema design is based on how the data is accessed rather than the data itself. And it’s the developers who know best how data will be accessed for their applications. It’s impossible to optimize a schema if you don’t have a plan for how users will access the data. Of course, you can just persist the data without a plan for how users will access it. The document model allows that type of flexibility. But you can and should optimize the schema later.The Benefits of Better Schema
Beginners love document databases because they can persist objects without defining them upfront. With minimal training, almost anyone can build credible applications without understanding document schema design. However, there is a point where knowing how to design your schema correctly allows you to achieve far more, with far less server hardware. With today’s pay-as-you-go cloud pricing, that’s important. Document schemas can increase performance for a given set of hardware by reducing computation, I/O operations and contention between users. The idea that document databases lack up-front schema enforcement is simply not true. Document databases can enforce schemas just like relational databases. Schemaless design may be common in document databases, but it’s not synonymous with them. A modern document database also has strongly typed data, a rich data manipulation language (DML), multiple compound BTree-based indexes, ACID transactions and in-database aggregation calculation. And document databases also share the same underlying storage engines as Postgres, MySQL and other relational databases. What really differentiates a document database from relational is the ability to co-locate related data in the atomic unit of storage so a single record is stored contiguously on disk and in memory rather than being broken up into rows and stored independently. Simply put, in a document database, multiple values for an attribute can exist within a single record. If a person has multiple phone numbers, you don’t need a different table to store them. And you don’t need to define individual fields for each number. You can simply have an array of phone numbers or number objects. It’s like embedding rows from one table inside another at the storage layer.
{
name: "john",
phones: [ { type: "cell", number: 4475566218},
{ type: "cell", number: 4479927716},
{ type: "voip", number: 17035551234}]
}
How Documents Reduce Computation and I/O
When you query a database, you’re filtering out a subset of the data to either retrieve it in its raw form or compute some kind of summarization. If the data you want is in a single row and your query fetches it from a single table, then a relational database or column store will probably be more efficient. On the other hand, as soon as you need to join tables together to perform your query, the additional computational work to look up multiple indexes and merge results negates that advantage. Each additional row you need to access adds I/O operations. I/O happens slowly. And speeding it up costs dearly. In a document database with a properly designed schema — because an entire business record is contained in a single document — all the data you need for filtering and retrieval is available with minimal computational overhead and in a single I/O operation. This can make finding and retrieving data far faster. For anything you can’t fit into a single document, you’ll need multiple record types and may need to query for related groups. Fortunately, mature document databases offer one or more join options as well, although they should be used sparingly.How Documents Reduce Contention
Databases must allow multiple users or processes to edit the same records without overwriting each other’s changes. We can’t have a situation where two users are modifying the same data, and the one writing last overwrites the other’s changes. If you’ve dealt with git conflicts, you know how important it is to resolve these changes. Consistent data needs a better approach than trying to merge conflicting changes. The solution to this in a database is locking, which entails the following:- Find a record that matches your criteria for change.
- Lock it so no one else can modify it.
- Verify it wasn’t changed between finding and locking.
- Apply changes.
- Unlock it.
{
game: "super_kong",
highscores: [{ name: "joe", score: 118231},
{ name: "amy", score: 75651},
{ name: "chloe", score: 62352},
{ name: "bryan", score: 54524},
{ name: "dwayne", score: 41654}]
}
BEGIN TRANSACTION COMMIT semantics that are found in ACID transactions. Since it still introduces the same contention issues as with relational databases, it’s better to create a schema that avoids modifying multiple documents, or one that provides a way to perform your edit without breaking database consistency.
Tradeoffs in Document Schema Design
There is no perfect answer to schema design. Document models assume there are more reads than writes. Often many more. Optimizing for read speed over potentially writing more data during an update is a good choice. Unless a system is simply logging or auditing data, it’s safe to assume that every bit of data written will be read at least once and probably more than once. With document design, it’s OK to denormalize domain tables, or have more than one copy of some items of data. If you have a list of countries that rarely changes, it’s an acceptable tradeoff to store the country name rather than a key to a domain table and deal with the consequences should a country name ever change. For other records, there may well be a definitive copy of the data. For a customer record, there’s typically one document per customer. But some of the fields may be duplicated into other documents at write time for speed of reading. For example, you may decide to duplicate the customer’s name, address and unique identifier into each of their invoice records. This reduces the time it takes to retrieve invoices. Should a customer change their name or address, you can modify those invoice records in place.got_payraise. If this fails part of the way through, then it should try again. But this time, it should only give a pay raise if the got_payraise field does not exist, so no one gets two pay raises, and repeat until everyone gets a raise. At this point we can delete all the got_payraise fields. This model removes nearly all the contention and risk of either not rewarding someone or not knowing if someone received a pay raise. This is where the flexibility of schema design really helps out.
Putting It into Practice
Unlike a relational database, a document database asks the designer or developer to think about correctness, contention and performance. But in return, it gives you far better performance and control. Unlike a relational world where one person models the data, others build applications with it, and later a DBA attempts to optimize how it works, document databases put the developers front and center in creating a good database. This can require organizational or process adjustments when moving from relational databases. Once you understand the fundamentals — that you need to have a schema and what good and bad schemas look like — then you can choose between different options and understand what’s happening and what’s not happening. Having taught these classes for a number of years, I’ve often heard from developers that they had no clue they could achieve so much through schema design in a document database.
YOUTUBE.COM/THENEWSTACK
Tech moves fast, don't miss an episode. Subscribe to our YouTube
channel to stream all our podcasts, interviews, demos, and more.