How We Completed a Massive Kafka and Cassandra Migration

A look at the strategy and process, along with some best practices to make any large-scale, mission-critical Cassandra and Kafka migration go smoother.

Jun 6th, 2024 9:19am by Ben Slater

Featued image for: How We Completed a Massive Kafka and Cassandra Migration

Image from hansen.matthew.d on Shutterstock

Any data-layer migration calls for careful planning and execution regardless of migration size. That said, we recently completed what may well be the largest Apache Cassandra and Apache Kafka migration ever performed (Guinness World Records doesn’t exactly keep tabs on these … yet).

It is, in my opinion, a particularly interesting use case for achieving a rather complex technical feat without downtime (and using only the fully open source versions of Cassandra and Kafka — no open core here). Below, I’ll share the strategy and process that was used, along with some best practices that will help make any large-scale, mission-critical Cassandra and Kafka migration go smoother.

Managing a Migration of Massive Scale

Let’s set the scene on how big this migration was. This enterprise’s open source Cassandra deployment consisted of 58 clusters and 1,079 nodes, including 17 different node sizes, spread across AWS and Google Cloud Platform (GCP) in six separate cloud provider regions. On the Kafka front, the company used 154 clusters and 1,050 nodes of 21 node sizes, again across the two cloud providers and six regions. As you can imagine, making the move required tremendous time and focus. The timeline called for nine months for preparation, followed by eight months of careful production migrations.

As with any migration, strong project management and governance were critical. If this step wobbles out of the gate, you’re in for trouble later. We assigned specific responsibilities to quite a few key roles aligned with our project management methodology, including an overall program manager, a migration project manager for Cassandra and another for Kafka, technical leads for each and a key product manager. This team was quick to develop close collaboration and clear communication with the enterprise, another proven method for positive project results.

That close contact proved its value across the initial phase of the project, as we worked in sync with the enterprise’s architectural, security and compliance teams to meet their strict requirements in these areas. This meant ensuring that the destination environments for the migration would have intrusion detection, access logging, audit logs, hardened operating systems, and account-level opt-in to automatically configure new clusters with log shipping and other controls. We also enabled the loading process for custom Kafka Connect connectors to use instance roles instead of access keys for Amazon S3 access, and made improvements to the SCIM (System for Cross-Domain Identity Management) API for provisioning single sign-on (SSO) access.

During this preparation phase, we also recognized and acted on opportunities to optimize the architectural fit for migrated clusters. Because the enterprise’s architecture delivered high availability above the Kafka cluster level, we used RF2 (replication factor 2) to support Kafka clusters running in two Availability Zones. We also prepared to optimize costs by taking advantage of the latest AWS and GCP node types.

The Kafka Migration

The “Drain Out” approach is the first thought for Kafka migrations: Simply point Kafka consumers at source and destination clusters, switch producers to send messages to the destination cluster only, wait until all messages are read from the source and voilà. The limitation is that a Drain Out doesn’t preserve message ordering, something essential to many Kafka use cases, including this one.

MirrorMaker2 offers another strong option for Kafka migrations, however, its high consumer/producer application dependency meant it wasn’t a fit here.

The “Shared Cluster” approach — operating source and destination clusters as a single cluster — emerged as the best remaining option. We proceeded to create a detailed change plan for each cluster with rollback enablement always in mind. High-level steps began with provisioning the destination cluster, updating configurations to match the source and joining network environments to the source cluster with virtual private cloud peering. We then started Apache ZooKeeper at the destination in observer mode, along with the destination Kafka brokers.

Next, we moved the data using Kafka partition reassignment. This included increasing the replication factor and replications across both destination and source brokers, swapping the preferred leaders to destination brokers and then decreasing the replication factor to remove source broker replicas. We finished by reconfiguring the clients to make the destination brokers their initial contact points and then removing the old brokers.

The source environment came with a few extra wrinkles that we ironed out over the migration. For example, it shared one ZooKeeper instance across multiple clusters, leading us to carefully reconfigure and clean each destination ZooKeeper of data on other clusters. We also extended the destination configuration to support the enterprise’s particular port listener mappings, avoiding major reconfiguration work.

The Cassandra Migration

The most common approach for a zero-downtime Cassandra migration is to add a data center to an existing cluster. We also use and recommend our Instaclustr Minotaur-consistent rebuild tool (available on GitHub). This open source solution solves the issue in which missing data replicas in a source cluster can cause the rebuild process to copy multiple replicas from the same node, leading to fewer replicas at the destination. Minotaur ensures that the destination cluster has at least as many replicas as the source, and that any repairs needed can be put off until after migration.

Using this approach for this migration was especially valuable when we encountered clusters with high inconsistency. In one case, a cluster required two and a half months of repair after the migration. Another set of clusters regularly dropped tables every two to three hours, due to Cassandra dropping temporary data when schema changed during streaming. We first tried to solve this by manually pausing the table drops during node rebuilds, but found that method unsustainable. In the end, we used our provisioning API to detect node status and automate the pausing of table drops when necessary.

Big Challenge, Big Success

In the end, the (perhaps) largest Cassandra and Kafka migration ever was completed on schedule and with minimal hiccups. I credit this positive outcome to the close-knit cooperation, thoughtful planning, and strategic best practices employed by all involved, and advise anyone engaging in similarly large and complex migrations to apply these same techniques.

Ben Slater is vice president and general manager at Instaclustr by NetApp, which provides a managed platform around open source data technologies. Prior to Instaclustr, Ben was at Accenture for more than a decade, where he worked on data warehousing,...