4 Common Questions We Hear about Apache Cassandra

These are some of the top things developers want to know about this highly scalable, reliable NoSQL data store.

Sep 14th, 2022 7:00am by Pieter Humphrey

Featued image for: 4 Common Questions We Hear about Apache Cassandra

Since it was developed in 2007, Apache Cassandra has built a reputation as a rock-solid, highly scalable, reliable NoSQL data store used by some of the biggest enterprises in the world. But it also takes a certain level of experience and expertise to work with Cassandra. So it’s understandable that there are lots of questions that arise when learning about this open source database. This article covers some top questions that developers ask across a variety of community forums.

What’s the Difference Between Partition, Clustering and Composite Keys In Cassandra?

Understanding how the primary key in wide-column databases is different from relational primary keys is a critical step in learning to wield Cassandra’s power. Wide-column stores like Cassandra use the notion of column families, a database object that contains multiple columns of related data that are used together, similar to traditional relational database tables. Within a given column family, all data is stored in a row-by-row fashion, such that the columns for a given row are stored together, rather than each column being stored separately.

Put another way, a column family is a key-value pair, where the key is mapped to a value that is a set of columns. To draw an analogy with relational databases, a column family is like a “table,” with each key-value pair being a “row.” For developers, wide-column tables can present themselves as a row-and-column table that is familiar and easy to work with, in code or via APIs. Let’s look at some example code to help bring the concepts to life.

In the code above, we’ve got a keyspace, some fields like “city,” “last name” and “first name.” The primary key is at the bottom. All tables in Cassandra, by the way, must include at least one partition key. In the example highlighted by the image above, we’ll partition by “city.” Anything else that follows is a cluster column. Notice the parentheses that are around “city” — this indicates that this is the partition key. We use the parentheses to indicate what the partition key is, in the event your partition key is composite and has more than one column. Then it’s clear which columns are for primary keys and which ones are clustering columns.

The primary key’s main purpose is to ensure that a row is unique. It may also contain zero or more clustering columns, which can control sorting. But the primary key can also be “composite” or “compound,” which means it has two or more columns.

The partition key is used to partition our rows and has one or more columns.

How Does Cassandra Find the Node Containing the Data I Want?

Some people seem to think that driver clients just send data to a random node. But there’s really a non-random way that your driver picks a node to talk to. This node’s called the coordinator node. It’s typically chosen because it’s closest. Client requests can be sent to any node — and at first they’re sent to the nodes that your driver knows about. But once the driver software connects and understands the topology of your cluster, it might change to a closer coordinator. Check out the open source ecosystem project Stargate to learn how compute and storage can be separated for scalability. Nodes in an open source Cassandra cluster exchange topology information with each other using the gossip protocol. The gossiper runs every second and ensures that all nodes are kept current with the data from whichever snitch you have configured. The snitch keeps track of which data centers and racks each node belongs to. In this way, the coordinator node also has data about which nodes are responsible for each token range. You can see this information by running a node tool “ring” from the command line, although if you’re using virtual nodes or “vnodes,” that’ll be a little trickier to ascertain as data on all 256 virtual nodes (the default amount) will pretty quickly flash by the screen. On K8ssandra.io, this behavior is more Kubernetes-native, and Etcd is used instead of the Gossip protocol to propagate cluster metadata, as well as safe schema updates.

How Do Secondary Indexes Work in Cassandra?

Indexing is pretty subtle. It helps to understand the database internals. How would this query work internally in Cassandra? Take a look at this example code:

Select * from update_audit
Where scopeID=35 and
formid=78005 and
record_link_id=9897;

How would this query work internally in Cassandra? Essentially all the data for the partition with the scope ID equal to 35 and the form ID equal to 78005 would be returned, and then it would be filtered by the record link ID index. It will look or the record index ID entry for 9897 and attempt to match up the entries that match the rows returned where scope ID equals 35 and form ID equals 78005. The intersection of the rows for the partition keys and the index keys will be returned. You might reasonably ask whether a high-cardinality column like the record link ID index would affect the query performance for that. High-cardinality indices essentially create a row for almost each entry in the main table. Performance can be affected because Cassandra is designed for sequential reads for query results. An index query essentially forces Cassandra to perform random reads as the cardinality of your index increases, so does the time it takes to find the queried value. So, would Cassandra touch all the nodes for the above query? No, it should only touch a node that’s responsible for that scope ID equals 35 and that form ID equals 78005 partition. Indexes, likewise, are stored locally and only contain entries that are valid for the local node.

What’s the Difference Between Cassandra and Datastax Astra DB?

Cassandra is an open source NoSQL database that powers the distributed applications that you’re probably using every day, at a massive scale. However, it’s up to you and your team to self-manage. Astra DB, on the other hand, is a serverless database-as-a-service. It’s a fully managed, autoscaling cloud service built on Cassandra and runs on a public cloud provider of your choice. With the addition of the open source data API gateway Stargate, both Cassandra and Astra DB serve document, columnar and key-value NoSQL workloads. And with Astra DB, Stargate is automatically set up for you. Learn more about Cassandra here.

Pieter Humphrey is developer product marketing manager at DataStax. He has been in the tech industry for 20+ years, working in development, marketing, sales, and developer relations to advance Java technology in the enterprise, and more recently on the cloud.