Introduction to Rebalancing

Rebalancing is a technique used in distributed computing and database systems to ensure that data is evenly distributed across different nodes or shards. When data is partitioned across multiple nodes or shards, it is important to ensure that the data is distributed in such a way that the workload is balanced across all nodes.

This is because if data is not distributed evenly, some nodes may become overloaded with data, leading to poor performance, slower queries, or even downtime.
It is an important technique for ensuring the performance, availability, and scalability of distributed systems. By distributing data evenly across nodes or shards, rebalancing helps to prevent hotspots, reduce downtime, and improve the overall performance of the system.

There are several strategies for rebalancing, such as hash-based, dynamic, and fixed partitioning.

Expectations During Rebalancing

While rebalancing a distributed system or database, there are certain expectations that one should keep in mind:

Minimum Data Movement: Rebalancing involves moving data from one node or shard to another, which can be time-consuming and resource-intensive. Therefore, the expectation is that the amount of data movement should be minimized to reduce the impact on the system's performance and availability.
Availability During Rebalancing: The database should accept reads and writes while rebalancing. All nodes or shards should have the same version of the data, and there should be no data loss or corruption.
Fair Partitioning: After rebalancing, the workload should be distributed evenly across all nodes or shards, and no node or shard should be overloaded.

Rebalancing Strategies

Below are the main rebalancing strategies:

1. Hash-Based

In this strategy, the hash function used for sharding is also used to determine the new shard for any given piece of data. When a shard becomes overloaded, some of its data can be moved to a new shard based on the hash value of the data.

The N indicates represents the Total number of nodes in cluster.
Partition can be fair but could grow over time : Initially, the distribution might be even, but as data grows, imbalance can occur.
Readjusting N is very costly, will have to move majority of the keys: Changing the number of nodes (scaling up or down) changes the modulo result for almost every key, forcing a massive data migration.
Generally not preferred: Because of the high cost of rebalancing mentioned above, simple modulo hashing is often avoided in favor of Consistent Hashing (as we discussed previously).

2. Fixed Partitions

In this approach, the number of partitions is fixed to begin with and is chosen when the database is set up. Choosing the right number of partitions is difficult if the size of the data is highly variable. If partitions are very large, rebalancing and recovery from node failures become expensive. Conversely, if they are too small, they incur too much overhead.

Every node has a fixed number of partitions.
On addition/removal of nodes, we move some partitions from every node to the new node (and vice versa).
Scalability can be limited by the total number of partitions (since you can't have more nodes than partitions).

3. Dynamic Rebalancing

This approach uses real-time monitoring of the system's performance to determine when and how to rebalance. For example, if a shard starts to experience high load or becomes unresponsive, data can be moved to other shards to maintain optimal performance.

Similar to fixed no of partition: It shares some structural similarities with the fixed partition strategy.
Creation on Threshold: Newer partitions are created within shards specifically when they reach a certain data threshold (e.g., size or load limit).