How We Built the New JSON API for Cassandra and Astra DB

We looked to the Mongoose project for the kinds of data access patterns that Node.js developers need to bring the scale of Cassandra to documents.

Sep 29th, 2023 9:01am by Jeff Carpenter and Aaron Morton

Featued image for: How We Built the New JSON API for Cassandra and Astra DB

Image from Tithi Luadthong on Shutterstock.

Recently, we began to consider how to make Apache Cassandra more accessible to a wider audience of developers, particularly for the largest community of all: Node.js developers. JSON is an important part of that developer ecosystem because of its flexibility. Many Node.js applications use an object document mapper called Mongoose.js that simplifies the process of converting JavaScript objects to and from JSON documents stored in a document database. Mongoose has approximately 2 million downloads a week on npm and 3.7 million public GitHub repositories list it as a dependency. We began looking at the Mongoose project as a representation of the kinds of data-access patterns that Node.js developers need, especially in terms of filtering, projection and updating. This includes features such as:

Inserting and updating complex documents, including nested arrays and subdocuments
Finding, updating, deleting one or more documents based on criteria including
- equality (including for arrays and subdocuments)
- multi-item equality (IN)
- inequality (NOT)
- comparison (<, <=, >, >=)
- existence
- and combinations of those criteria using AND and OR
Sorting and projecting results based on a user-provided list of fields or subdocuments

Most of these data access patterns aren’t supported by Cassandra out of the box. Because it’s a distributed, table-oriented database, developers were traditionally encouraged to denormalize data to support reads and prioritize “no look” upserts. So we began to explore how we could provide an API on top of Cassandra that could implement these access patterns with improved scalability and performance. Our goal became to build the best backend for Mongoose. To accomplish this, we designed an API called the JSON API that is usable by Mongoose.js with only a configuration change or with any other language via the HTTP API. Earlier this month, we announced that the new JSON API is available in DataStax Astra DB vector databases and also can be used as part of the open source Stargate project, a data API gateway, against self-hosted Cassandra clusters. In this article, we’ll explain the details of the JSON API design and describe how it takes advantage of new Cassandra features to yield a rich set of document-oriented functionality for Node.js with demonstrably good performance and scalability.

JSON API Architecture

The key elements of this architecture are shown below. Client applications include the Mongoose JavaScript library along with the stargate-mongoose driver, packages available via npm. The developer just needs to configure the JSON API endpoint and things are ready to go.

In designing the JSON API, we discovered that we could push the vast majority of the querying and filtering logic that Mongoose requires down into the Cassandra nodes themselves with a few key enhancements, especially improvements to the Storage Attached Index (SAI) implementation first introduced in Cassandra 4.0.

Introducing Super Shredding

To understand our design approach for the JSON API, it’s helpful to take a quick look back to set some context. Our first attempt back in 2021 at building a document-style API on top of Cassandra was the Stargate Docs API, based on a “document shredding” approach. While we were able to make some performance optimizations to this API, a key challenge was that the original shredding approach broke each document into components spread across multiple Cassandra rows. Although the resulting schema provided useful flexibility to help implement some of the “exact match” desired filtering operations, more complex filtering required overfetching of documents and filtering in memory. This design also required multiple queries for document insertion, retrieval, update and deletion operations. This hurt performance and added complexity to ensure consistency across multiple rows while those queries were in flight. For the JSON API, we’re using an improved approach known as “super shredding,” which Aaron Morton described in a recent talk. The design of super shredding was developed via a logical thought process to create a performant, scalable solution:

We started with the goal of pushing as much filtering and sorting as possible down to Cassandra, including the combination of filtering terms using logical AND and OR.
To accomplish this pushdown, we took advantage of the “query engine” in SAI that can join the results from filtering on multiple indexes, a feature known as “match streaming.”
To use match streaming, each JSON document must be stored in a single row, as this makes “Row ID” tracking in SAI more efficient.
In order to store each document as a single row, we used sets and maps to store items of variable lengths.

Although the Mongoose.js API provides user-level control over indexing, we decided to build the JSON API to support efficient querying on all fields in a document without the user needing to create indexes. To achieve this, we separated the two concerns of the schema:

Store the document for filtering and sorting. We create SAI indexes by default on all columns containing searchable (filterable) fields of a document.
Store the document for projections, which must honor the field order of the original document. We also need the original document in order to process modifications made with reference to the full document contents.

Separating the two concerns led to optimizing the design for each concern, which created a more robust model.

Super Shredding Table Schema

Consider the following example, which we’ll use to describe the super shredding table schema and how it works. The JSON API supports the concepts of namespaces and collections, which correspond to Cassandra keyspaces and tables, respectively. If a user created a namespace called `purchase_database` and a collection called `products`, the following Cassandra table would be created.

CREATE TABLE purchase_database.products (
   key                 tuple<tinyint,text>,
   tx_id               timeuuid,
   doc_json            text,
   exist_keys          set<text>,
   array_size          map<text, int>,
   array_contains      set<text>,
   query_bool_values   map<text, tinyint>,
   query_dbl_values    map<text, double>,
   query_timestamp_values map<text, timestamp>,
   query_text_values   map<text, text>,
   query_null_values   set<text>
   PRIMARY KEY (key)
);

Let’s look at how these columns are used. Several of them are always populated for every document (row) that is inserted:

`key` — the primary key for the document. This corresponds to the `_id` field in a Mongoose object. The primary key may be any of the supported atomic (non null) JSON data types, such as text, double or timestamp. The tuple for the key encodes the data type and the encoding of the value as a string.
`tx_id` — a time-based (v1) UUID (universally unique identifier) that identifies the last version of the document. We’ll see below how this is used on update transactions to maintain consistency.
`doc_json` – the full serialized text of the JSON document. This is used to materialize the document for projections and updates.
`exist_keys` — a set of the JSON paths in the document that can be queried for existence. This is used to find documents that either have or do not have a specified path.

Other columns are optionally populated based on the contents of the document in order to support various application queries:

`query_bool_values`, `query_dbl_values`, `query_timestamp_values`, `query_text_values` — these map JSON paths to fields in the document that contain atomic values, including queries for array items. The fields can be of the supported JSON basic types, including text, doubles (used for all JSON number values), booleans and timestamps (part of Extended JSON). These are used to support equality and inequality filtering and sorting against fields in the document (except for arrays, as noted below). The `query_text_values` is also used to store hashed values of subdocuments, which supports searching for documents that contain a specified subdocument, for instance finding documents with a specified address.
`query_null_values` — similar to the other `query` columns but represented as a set of keys since there are no actual values to store. This is used when filtering if a field is null or not null according to JSON null values.
`array_size` — a map of JSON paths to fields in the document where the value is a JSON array, the value in the Cassandra Query Language map is the size of the array (including zero). This is used to find documents based on the exact size/length of an array in the document.
`array_contains` — a set used for all equality checks on array elements (regardless of type). Each element includes a JSON path to an atomic value or item in an array field appended with a hash of the value. This is used to find documents that contain atomic values or arrays that have atomic values.

We created SAI indexes on the `exist_keys, query_*` and `array_*` columns to support fast filtering on lookups. We’ll see an example of this below.

Inserting Documents

Next we’ll see what happens when a client application inserts a JSON document. We’ll focus on what happens in the JSON API and how it uses Cassandra. For a view of what the experience is like from the perspective of a client JavaScript application, see the blog post “Build a Text and Image Search App with Astra DB Vector Search, NodeJS, Stargate’s New JSON API, and Stargate-Mongoose.” Let’s assume the client application inserts the following document:

{
  "insertOne": {
    "document": {
      "_id": "1",
      "purchase_type": "Online",
      "customer": {
        "name": "Jim",
        "phone": "123-123-1234",
        "address": {
          "address_line": "1234 Broadway",
          "city": "New York",
          "state": "NY"
        }
      },
      "purchase_date": { "$date": 1695138780 },
      "seller": {
        "name": "Jon",
        "location": "New York"
      },
      "items": [
        {
          "car" : "BMW",
          "color": "Black"
        }, 
        "Extended warranty - 5 years"
      ],
      "amount": 65000,
      "status" : "active",
      "preferred_customer" : true
    }
  }
}

As you can see, this document has some nesting, including arrays and subdocuments, and uses a timestamp to represent the purchase date. This document provides a useful example for examining various types of queries, and in fact this is used in the JSON API Postman collection, which you can use to learn more about JSON API. One thing we should note while we’re talking about document structure: While use of Mongoose does require creation of JSON schema (it is, after all, an Object-Document Mapper), the JSON API itself does not enforce any schema, so you can insert whatever documents you wish if using it directly. Another key design goal of the JSON API is not to leak “Cassandra-isms” into the interface. If you’re a Cassandra user, you know some of these, such as the Partition Key versus the Clustering Keys. Users will not see any CQL or “Cassandra-isms” when using the API or stargate-mongoose. However, curious Cassandra developers will be interested to see the CQL row that was inserted into the purchase table, which looks something like the output below (with some formatting and values omitted for readability):

cassandra@cqlsh:purchase_database> select key, tx_id, doc_json, query_text_values from purchase where key = (1, '1');


@ Row 1
-------------------+--------------------------------------------------
 key               | (1, '1')
 tx_id             | 94e5e560-51bc-11ee-bc96-432052799679
 doc_json          | {"_id":"1","purchase_type":"Online",...}
 query_text_values | {'_id': '1', 
                      'customer.address.address_line': '1234 Broadway', 
                      'customer.address.city': 'New York', 
                      'customer.address.state': 'NY', 
                      'customer.name': 'Jim', 
                      'customer.phone': '123-123-1234', 
                      'items.0.car': 'BMW', 
                      'items.0.color': 'Black', 
                      'items.1': 'Extended warranty - 5 years', 
                      'purchase_type': 'Online', 
                      'seller.location': 'New York', 
                      'seller.name': 'Jon', 
                      'status': 'active',
                      ... }


(1 rows)

We’ll focus on the contents of the `query_text_values` field to demonstrate other aspects of the super shredding design.

Retrieving Documents

Next, let’s look at what happens if the client application queries for documents with a specific city. Here is the JSON API query:

{
  "findOne": {
    "filter": {"customer.address.city" : "New York"}
  }
}

The JSON API takes this query and interprets the requested value for `purchase_date` as a string. Therefore, it performs the following CQL query using the `query_text_values` column:

SELECT key, tx_id, doc_json FROM purchase_database.purchase WHERE query_text_values["customer.address.city"] = "New York"

The document inserted above will match this query. This query works because when the client application created the `purchases` collection, the JSON API created an SAI index on the values of the `query_text_values` column:

CREATE CUSTOM INDEX IF NOT EXISTS purchase_query_text_values ON purchase_database.purchase (entries(query_text_values)) USING 'StorageAttachedIndex';

This is a simple equality query, but the SAI also supports more complex inequality or NOT queries. Make sure to check out the JSON API documentation to see all the supported options.

Updating Documents

The JSON API also supports the rich set of update commands expected by Mongoose.js for partial or full documents, including unsetting fields or removing subdocuments, as well as optionally returning projections of the original or updated document. For example, the following JSON API query could be used to unset the preferred customer field from a document and return the updated document:

{
  "findOneAndUpdate": {
    "filter": {
      "_id": "1"
    },
    "update" : {"$unset" : {"preferred_customer": ""}},
    "options" : {"returnDocument" : "after"}
  }
}

This demonstrates some of the complexity of dealing with JSON types; in this case, a Boolean value can be `true`, `false`, `null` or `unset`, and the app can use `unset`, `false` or null to represent “not a preferred customer.” To implement the requested update, the JSON API must use a read-modify-write pattern, that is, the service pulls the JSON document into memory, updates it, then writes it back to the database. The resulting write looks something like this:

UPDATE purchase_database.purchase SET 
    tx_id = now(), 
    exist_keys = ?, 
    array_size = ?, 
    array_contains = ?, 
    query_bool_values = ?, 
    query_dbl_values = ?, 
    query_text_values = ?, 
    query_null_values = ?, 
    query_timestamp_values = ?, 
    doc_json = ? 
    WHERE key = (1, '1') IF tx_id = ? VALUES ...;

The actual values have been omitted for brevity. Notice the use of the CQL `IF` clause, which checks to make sure that the `tx_id` has the value that was obtained from the initial document read. The `IF` clause implies use of a lightweight transaction (LWT) to ensure consistent updates, in this case, that the document contents have not been changed since it was read. While not all possible update commands strictly require this protection, the most correct design is to execute them all using this read-modify-write pattern. We designed this pattern to be able to take advantage of the new Accord-based transactions coming in Cassandra 5.0 for improved performance.

Performance Results

Experienced Cassandra users might have some questions about the performance implications of some elements of the design. For example:

The use of collections, which may result in tombstones as items are removed.
The use of indexes — but remember SAI is better!
The use of LWTs for all document updates.

While these are valid concerns, it’s important to keep in mind that the performance expectations for a document store are different from those that many of us in the Cassandra community are accustomed to. To validate that our implementation supports performance in keeping with the typical expectations of a document database, we executed benchmark tests of the JSON API running with a DataStax Astra serverless database. The tests were performed using NoSQLBench and Fallout framework using a variety of queries and documents of different size and complexity. You can find the files used to execute these tests on the JSON API GitHub repository. One test consisted of a warmup phase with multiple concurrent inserts, followed by a main phase including seven different operations running in parallel. The test executed each operation 10,000 times at a rate of 25 operations per second, for a combined rate of 175 ops/s. The results are shown in the chart below:

Operation	Median latency	99th percentile latency
Find one by id	7.84 ms	16.25 ms
Find one by property	10.53 ms	18.00 ms
Find multi $eq and $exists	13.62 ms	24.77 ms
Find multi $eq and $exists with $projection	13.27 ms	24.26 ms
Find one, sort (narrow match) and update property	10.80 ms	18.43 ms
Delete one id	14.175 ms	23.78 ms
Find one id and replace	14.11 ms	23.56 ms

As these results show, the JSON API was able to sustain consistent performance under a reasonably aggressive operational load. We’re continuing to work on performance testing and optimization, and look forward to taking advantage of improvements in Cassandra to get even more speed and scalability.

Next Steps

Coming up, we’ll explain how we extended the JSON API to handle vector search, including how that affects the super shredding design. Future enhancements of the JSON API include adoption of additional Cassandra features from 5.0 and beyond. For example, the new Accord feature will enable the JSON API to improve performance of document update queries. In the meantime, we’ll be working on hardening and performance improvements as we look toward an official general availability release in the near future. But you don’t have to wait to start using this exciting new capability. Stargate-Mongoose is now available as an npm package, and we’ve just launched a public preview of the JSON API in DataStax Astra. Please try these new releases and let us know what you think, whether online or in person. We’d love to see you at Aaron’s talk, “Fast and Flexible JSON Retrieval in Cassandra Using SAI,” at Cassandra Summit, held December 12-13 in San Jose, California. For more details, see our JSON API documentation: “JSON API QuickStart with Mongoose” and “Developing with the JSON API.”

Jeff Carpenter has worked as a software engineer and architect in multiple industries and as a developer advocate at DataStax, helping engineers succeed with Apache Cassandra. He's involved in multiple open source projects in the Cassandra and Kubernetes ecosystems including...