How We Built the New JSON API for Cassandra and Astra DB
Recently, we began to consider how to make Apache Cassandra more accessible to a wider audience of developers, particularly for the largest community of all: Node.js developers. JSON is an important part of that developer ecosystem because of its flexibility. Many Node.js applications use an object document mapper called Mongoose.js that simplifies the process of converting JavaScript objects to and from JSON documents stored in a document database. Mongoose has approximately 2 million downloads a week on npm and 3.7 million public GitHub repositories list it as a dependency. We began looking at the Mongoose project as a representation of the kinds of data-access patterns that Node.js developers need, especially in terms of filtering, projection and updating. This includes features such as:
- Inserting and updating complex documents, including nested arrays and subdocuments
- Finding, updating, deleting one or more documents based on criteria including
- equality (including for arrays and subdocuments)
- multi-item equality (IN)
- inequality (NOT)
- comparison (<, <=, >, >=)
- existence
- and combinations of those criteria using AND and OR
- Sorting and projecting results based on a user-provided list of fields or subdocuments
JSON API Architecture
The key elements of this architecture are shown below. Client applications include the Mongoose JavaScript library along with the stargate-mongoose driver, packages available via npm. The developer just needs to configure the JSON API endpoint and things are ready to go.
In designing the JSON API, we discovered that we could push the vast majority of the querying and filtering logic that Mongoose requires down into the Cassandra nodes themselves with a few key enhancements, especially improvements to the Storage Attached Index (SAI) implementation first introduced in Cassandra 4.0.
Introducing Super Shredding
To understand our design approach for the JSON API, it’s helpful to take a quick look back to set some context. Our first attempt back in 2021 at building a document-style API on top of Cassandra was the Stargate Docs API, based on a “document shredding” approach. While we were able to make some performance optimizations to this API, a key challenge was that the original shredding approach broke each document into components spread across multiple Cassandra rows. Although the resulting schema provided useful flexibility to help implement some of the “exact match” desired filtering operations, more complex filtering required overfetching of documents and filtering in memory. This design also required multiple queries for document insertion, retrieval, update and deletion operations. This hurt performance and added complexity to ensure consistency across multiple rows while those queries were in flight. For the JSON API, we’re using an improved approach known as “super shredding,” which Aaron Morton described in a recent talk. The design of super shredding was developed via a logical thought process to create a performant, scalable solution:- We started with the goal of pushing as much filtering and sorting as possible down to Cassandra, including the combination of filtering terms using logical AND and OR.
- To accomplish this pushdown, we took advantage of the “query engine” in SAI that can join the results from filtering on multiple indexes, a feature known as “match streaming.”
- To use match streaming, each JSON document must be stored in a single row, as this makes “Row ID” tracking in SAI more efficient.
- In order to store each document as a single row, we used sets and maps to store items of variable lengths.
- Store the document for filtering and sorting. We create SAI indexes by default on all columns containing searchable (filterable) fields of a document.
- Store the document for projections, which must honor the field order of the original document. We also need the original document in order to process modifications made with reference to the full document contents.
Super Shredding Table Schema
Consider the following example, which we’ll use to describe the super shredding table schema and how it works. The JSON API supports the concepts of namespaces and collections, which correspond to Cassandra keyspaces and tables, respectively. If a user created a namespace called `purchase_database` and a collection called `products`, the following Cassandra table would be created.
CREATE TABLE purchase_database.products (
key tuple<tinyint,text>,
tx_id timeuuid,
doc_json text,
exist_keys set<text>,
array_size map<text, int>,
array_contains set<text>,
query_bool_values map<text, tinyint>,
query_dbl_values map<text, double>,
query_timestamp_values map<text, timestamp>,
query_text_values map<text, text>,
query_null_values set<text>
PRIMARY KEY (key)
);
- `key` — the primary key for the document. This corresponds to the `_id` field in a Mongoose object. The primary key may be any of the supported atomic (non null) JSON data types, such as text, double or timestamp. The tuple for the key encodes the data type and the encoding of the value as a string.
- `tx_id` — a time-based (v1) UUID (universally unique identifier) that identifies the last version of the document. We’ll see below how this is used on update transactions to maintain consistency.
- `doc_json` – the full serialized text of the JSON document. This is used to materialize the document for projections and updates.
- `exist_keys` — a set of the JSON paths in the document that can be queried for existence. This is used to find documents that either have or do not have a specified path.
- `query_bool_values`, `query_dbl_values`, `query_timestamp_values`, `query_text_values` — these map JSON paths to fields in the document that contain atomic values, including queries for array items. The fields can be of the supported JSON basic types, including text, doubles (used for all JSON number values), booleans and timestamps (part of Extended JSON). These are used to support equality and inequality filtering and sorting against fields in the document (except for arrays, as noted below). The `query_text_values` is also used to store hashed values of subdocuments, which supports searching for documents that contain a specified subdocument, for instance finding documents with a specified address.
- `query_null_values` — similar to the other `query` columns but represented as a set of keys since there are no actual values to store. This is used when filtering if a field is null or not null according to JSON null values.
- `array_size` — a map of JSON paths to fields in the document where the value is a JSON array, the value in the Cassandra Query Language map is the size of the array (including zero). This is used to find documents based on the exact size/length of an array in the document.
- `array_contains` — a set used for all equality checks on array elements (regardless of type). Each element includes a JSON path to an atomic value or item in an array field appended with a hash of the value. This is used to find documents that contain atomic values or arrays that have atomic values.
Inserting Documents
Next we’ll see what happens when a client application inserts a JSON document. We’ll focus on what happens in the JSON API and how it uses Cassandra. For a view of what the experience is like from the perspective of a client JavaScript application, see the blog post “Build a Text and Image Search App with Astra DB Vector Search, NodeJS, Stargate’s New JSON API, and Stargate-Mongoose.” Let’s assume the client application inserts the following document:
{
"insertOne": {
"document": {
"_id": "1",
"purchase_type": "Online",
"customer": {
"name": "Jim",
"phone": "123-123-1234",
"address": {
"address_line": "1234 Broadway",
"city": "New York",
"state": "NY"
}
},
"purchase_date": { "$date": 1695138780 },
"seller": {
"name": "Jon",
"location": "New York"
},
"items": [
{
"car" : "BMW",
"color": "Black"
},
"Extended warranty - 5 years"
],
"amount": 65000,
"status" : "active",
"preferred_customer" : true
}
}
}
cassandra@cqlsh:purchase_database> select key, tx_id, doc_json, query_text_values from purchase where key = (1, '1');
@ Row 1
-------------------+--------------------------------------------------
key | (1, '1')
tx_id | 94e5e560-51bc-11ee-bc96-432052799679
doc_json | {"_id":"1","purchase_type":"Online",...}
query_text_values | {'_id': '1',
'customer.address.address_line': '1234 Broadway',
'customer.address.city': 'New York',
'customer.address.state': 'NY',
'customer.name': 'Jim',
'customer.phone': '123-123-1234',
'items.0.car': 'BMW',
'items.0.color': 'Black',
'items.1': 'Extended warranty - 5 years',
'purchase_type': 'Online',
'seller.location': 'New York',
'seller.name': 'Jon',
'status': 'active',
... }
(1 rows)
Retrieving Documents
Next, let’s look at what happens if the client application queries for documents with a specific city. Here is the JSON API query:
{
"findOne": {
"filter": {"customer.address.city" : "New York"}
}
}
SELECT key, tx_id, doc_json FROM purchase_database.purchase WHERE query_text_values["customer.address.city"] = "New York"
The document inserted above will match this query. This query works because when the client application created the `purchases` collection, the JSON API created an SAI index on the values of the `query_text_values` column:
CREATE CUSTOM INDEX IF NOT EXISTS purchase_query_text_values ON purchase_database.purchase (entries(query_text_values)) USING 'StorageAttachedIndex';
This is a simple equality query, but the SAI also supports more complex inequality or NOT queries. Make sure to check out the JSON API documentation to see all the supported options.
Updating Documents
The JSON API also supports the rich set of update commands expected by Mongoose.js for partial or full documents, including unsetting fields or removing subdocuments, as well as optionally returning projections of the original or updated document. For example, the following JSON API query could be used to unset the preferred customer field from a document and return the updated document:
{
"findOneAndUpdate": {
"filter": {
"_id": "1"
},
"update" : {"$unset" : {"preferred_customer": ""}},
"options" : {"returnDocument" : "after"}
}
}
UPDATE purchase_database.purchase SET
tx_id = now(),
exist_keys = ?,
array_size = ?,
array_contains = ?,
query_bool_values = ?,
query_dbl_values = ?,
query_text_values = ?,
query_null_values = ?,
query_timestamp_values = ?,
doc_json = ?
WHERE key = (1, '1') IF tx_id = ? VALUES ...;
Performance Results
Experienced Cassandra users might have some questions about the performance implications of some elements of the design. For example:- The use of collections, which may result in tombstones as items are removed.
- The use of indexes — but remember SAI is better!
- The use of LWTs for all document updates.
| Operation | Median latency | 99th percentile latency |
| Find one by id | 7.84 ms | 16.25 ms |
| Find one by property | 10.53 ms | 18.00 ms |
| Find multi $eq and $exists | 13.62 ms | 24.77 ms |
| Find multi $eq and $exists with $projection | 13.27 ms | 24.26 ms |
| Find one, sort (narrow match) and update property | 10.80 ms | 18.43 ms |
| Delete one id | 14.175 ms | 23.78 ms |
| Find one id and replace | 14.11 ms | 23.56 ms |