4 Common Questions We Hear about Apache Cassandra
These are some of the top things developers want to know about this highly scalable, reliable NoSQL data store.
Sep 14th, 2022 7:00am by
DataStax sponsored this post.
What’s the Difference Between Partition, Clustering and Composite Keys In Cassandra?
Understanding how the primary key in wide-column databases is different from relational primary keys is a critical step in learning to wield Cassandra’s power. Wide-column stores like Cassandra use the notion of column families, a database object that contains multiple columns of related data that are used together, similar to traditional relational database tables. Within a given column family, all data is stored in a row-by-row fashion, such that the columns for a given row are stored together, rather than each column being stored separately.
Put another way, a column family is a key-value pair, where the key is mapped to a value that is a set of columns. To draw an analogy with relational databases, a column family is like a “table,” with each key-value pair being a “row.” For developers, wide-column tables can present themselves as a row-and-column table that is familiar and easy to work with, in code or via APIs.
Let’s look at some example code to help bring the concepts to life.
In the code above, we’ve got a keyspace, some fields like “city,” “last name” and “first name.” The primary key is at the bottom. All tables in Cassandra, by the way, must include at least one partition key. In the example highlighted by the image above, we’ll partition by “city.”
Anything else that follows is a cluster column. Notice the parentheses that are around “city” — this indicates that this is the partition key. We use the parentheses to indicate what the partition key is, in the event your partition key is composite and has more than one column. Then it’s clear which columns are for primary keys and which ones are clustering columns.
The primary key’s main purpose is to ensure that a row is unique. It may also contain zero or more clustering columns, which can control sorting. But the primary key can also be “composite” or “compound,” which means it has two or more columns.
The partition key is used to partition our rows and has one or more columns.
How Does Cassandra Find the Node Containing the Data I Want?
Some people seem to think that driver clients just send data to a random node. But there’s really a non-random way that your driver picks a node to talk to. This node’s called the coordinator node. It’s typically chosen because it’s closest. Client requests can be sent to any node — and at first they’re sent to the nodes that your driver knows about. But once the driver software connects and understands the topology of your cluster, it might change to a closer coordinator. Check out the open source ecosystem project Stargate to learn how compute and storage can be separated for scalability. Nodes in an open source Cassandra cluster exchange topology information with each other using the gossip protocol. The gossiper runs every second and ensures that all nodes are kept current with the data from whichever snitch you have configured. The snitch keeps track of which data centers and racks each node belongs to. In this way, the coordinator node also has data about which nodes are responsible for each token range. You can see this information by running a node tool “ring” from the command line, although if you’re using virtual nodes or “vnodes,” that’ll be a little trickier to ascertain as data on all 256 virtual nodes (the default amount) will pretty quickly flash by the screen. On K8ssandra.io, this behavior is more Kubernetes-native, and Etcd is used instead of the Gossip protocol to propagate cluster metadata, as well as safe schema updates.How Do Secondary Indexes Work in Cassandra?
Indexing is pretty subtle. It helps to understand the database internals. How would this query work internally in Cassandra? Take a look at this example code:
Select * from update_audit
Where scopeID=35 and
formid=78005 and
record_link_id=9897;
What’s the Difference Between Cassandra and Datastax Astra DB?
Cassandra is an open source NoSQL database that powers the distributed applications that you’re probably using every day, at a massive scale. However, it’s up to you and your team to self-manage. Astra DB, on the other hand, is a serverless database-as-a-service. It’s a fully managed, autoscaling cloud service built on Cassandra and runs on a public cloud provider of your choice. With the addition of the open source data API gateway Stargate, both Cassandra and Astra DB serve document, columnar and key-value NoSQL workloads. And with Astra DB, Stargate is automatically set up for you. Learn more about Cassandra here.
YOUTUBE.COM/THENEWSTACK
Tech moves fast, don't miss an episode. Subscribe to our YouTube
channel to stream all our podcasts, interviews, demos, and more.