Distributing databases with sharding and replication – why and when

By Suvorov Nikolai (nmsuvorov@edu.hse.ru)

Any application must meet various requirements to be useful and valuable. Functional requirements define what an application should do, including allowing data to be stored, modified, and retrieved, while nonfunctional requirements define certain properties of the application, including security, reliability, maintainability, scalability, availability, etc. Most application functional requirements can be satisfied by any of the currently used data storages, whereas nonfunctional requirements can only be satisfied with the correct choice of storages to be used within the application.

According to [1], main nonfunctional requirements which influence the choice of the data storages include:

Reliability (to make a system work correctly even if a fault occurred).
Scalability (to keep performance at the appropriate level even if load increased).
Maintainability (to make a system easier to modify and adapt for new use cases).

Every database type has its own field of appropriate usage and its own reasons to use. The usage of specifically distributed databases may have the following reasons:

Scalability – if a data volume, read or write load become higher than a single machine can handle, these loads can be spread across multiple machines.
High availability – if an application needs to continue working when one machine goes down, the simultaneous usage of multiple machines can become a solution.
Latency – if users of an application are spread all over the world, it may be a good idea to have servers in various locations to avoid the necessity to transfer user packets on huge distances and, thus, to decrease the response time.

Such distributed systems mainly implement sharing-nothing architectures, described in detail in [2], and are usually based on replication and sharding (partitioning) approaches which are described in the following sections. The sections give information about the approaches themselves and discuss benefits and potential risks of their usage.

Replication is an approach which proposes to keep copies of the same data on multiple machines. Replication is mainly used to achieve high availability, sufficient scalability, decrease in latency, and even can allow an application to work during a network interruption [1]. This section describes three main methods implementing this approach and shows their advantages and disadvantages.

The most popular approach, according to [1], is single-leader replication – in this case, only one node is available for write operations while all the nodes can be used for read operations (which significantly increases performance of read requests in comparison with single-machine storages). When a new write operation occurs, it firstly executed by the leader node and then the leader node sends (synchronously or asynchronously) requests for updating specific data to follower nodes. Potentially, reads from followers may be stale in asynchronous replication as it takes time to apply changes to all followers (it is only a temporary state, and such effect is known as eventual consistency [3]).

A more robust approach (in terms of presence of faulty nodes and latency spikes) is multi-leader replication – in this case, a client can send write requests to one of multiple leader nodes, which then send requests for updating data to each other and to the follower nodes. This method increases speed of executing write requests comparing with single-leader replication (as the load on a leader node decreases, whereas write requests can now go to local datacenters, which are closer to the client), but it has one serious drawback: write conflicts can occur if some data is concurrently modified in two different leader nodes, and such conflicts should somehow be resolved (most implementations of such type of replication handle conflicts poorly, so the main approach for dealing with conflicts is simply to avoid them [4]).

Another robust approach is leaderless replication – in this case, there is no leader and follower nodes, and clients send each write request to several nodes and read from several nodes in parallel [1]. Reading and writing operations use quorums to consider these operations successful [5]; such quorums are also used for repairing data using a read repair mechanism – all the replicas which have stale versions are updated with data obtained through quorum by the read request. However, multiple conflicts can also arise during concurrent write operations as well as during read repair; some techniques to resolve them are mentioned in [1], but it is still recommended to avoid such conflicts if it is possible.

Although making a copy of data seems as an easy task, many aspects should be considered including concurrent writes (mentioned above), read-after-write consistency [6], monotonic reads [7], consistent prefix reads [7], as well as dealing with unavailable nodes and network interruptions. Nevertheless, replication significantly helps to make a data storage robust and able to handle with high volumes of reads and writes. It also allows to place data geographically close to clients, so that users can interact with it faster.

The main goal of sharding (partitioning) is to spread data and query load evenly across multiple machines [1]. Potentially, a large dataset can be distributed across many disks and the query load can be distributed across many processors. Despite sharding benefits, it requires developers to carefully choose an appropriate partitioning scheme for the data as it straightforwardly influences the overall performance. In addition, rebalance techniques must be considered when new nodes are added, or existing nodes are removed. It is also worth mentioning that sharding is usually combined with the replication so that copies of partitions are stored on multiple nodes for fault tolerance.

When a person decides to use sharding, the way of partitioning the data must be the first thing to be considered. Multiple methods can be used for this task, including following:

Assigning a continuous range of keys to each partition – in this approach, data may be unevenly distributed across partitions and ‘hot spots’ can occur [8], but it allows to keep keys in a sorted order and makes the range scans easy [1].
Using a hash function to determine the shard for a given key – in this approach, the data is distributed evenly and fairly among the shards, but the task for executing range queries becomes more complex (for instance, in MongoDb in the enabled hash-sharding mode any range query is sent to all partitions [9]).

As is seen, the choice of technique highly depends on the data it is planned to store and on the requests which are going to be executed on the data. When the storage is implemented, new nodes may appear and existing nodes may be removed; that is why, it is also needed to consider rebalancing techniques. Currently, there are three main methods used for this purpose:

Using a fixed number of partitions – multiple partitions should be assigned to a single node, and when a new node is added, it ‘steals’ a few partitions from every node [1].
Using dynamic partitioning – any number of partitions should be assigned to a single node, and when a partition exceeds the maximum size, it is split into multiple partitions (for instance, it is implemented in Apache HBase, where a single partition is split into two partitions when it exceeds a size of 10 Gb [10]).
Using partitioning proportional to nodes – a fixed number of partitions per node is set and size of each partition grows proportionally to the dataset size, but when a new node is added, the partitions become smaller [11].

The usage of sharding is especially beneficial when the application requires so much data that storing and processing it on a single machine becomes unfeasible. Although it can successfully spread the data and the query load (and even parallelize complex queries across multiple nodes), it is necessary to consider various aspects before implementing it including the appropriate data schema, the rebalancing approach, request routing [1], etc. However, sharding is currently the main approach to deal with huge amounts of data which cannot be stored on a single machine, and when this approach is used correctly, it advantages significantly outweigh disadvantages – especially, when used together with replication, which substantially increases the data storage robustness and reliability.

Despite all the benefits which can be provided by replication and sharding, in my view, it is still better to use them only when nonfunctional requirements insist on scalability, fault tolerance and low latency. The usage of such technologies without a real need will only provide high expenses and may even decrease the overall performance. In addition, dozens of aspects must be carefully considered to make these technologies beneficial for a given application. Nevertheless, with the appropriate usage of such technologies, the application can become more reliable (due to multiple replicas of the same data) and faster (due to spreading the query load and possibility of decreasing latency). Such distributed architectures are becoming more and more feasible even for small companies [1] and can potentially become much more widely spread in future.

But when should developers choose either to use these technologies or not? The Agile methodology, which is continuing to grow in popularity, has the answer for this question: when a specific corresponding requirement arises. Through the Agile sprints developers acquire more and more knowledge of a given domain as well as of the customer’s requirements, and specifically this knowledge and continuous interaction with the customer will provide at a certain point the answer to this question. Anyway, only sufficient knowledge of the domain can help to successfully build a distributed architecture which will provide real benefits.

Kleppmann, M., “Designing Data-Intensive Applications,” Beijing: O'Reilly, 2017.
Stonebraker, M., “The Case for Shared Nothing,” IEEE Database Engineering Bulletin, vol. 9, no. 1, pp. 4-9, 1986.
Vogels W., “Eventually Consistent,” ACM Queue, vol. 6, no 6, pp. 14-19, 2008.
Hodges R., “State of the Art for MySQL Multi-Master Replication,” Percona Live: MySQL Conference & Expo, 2013.
Gifford D.K., “Weighted Voting for Replicated Data,” Proceedings of the seventh ACM symposium on Operating systems principles, pp. 150-162, 1979.
Terry D.B., Demers A.J., Petersen K, et al, “Session Guarantees for Weakly Consistent Replicated Data,” Proceedings of 3rd International Conference on Parallel and Distributed Information Systems, pp. 140-149, 1994.
Terry D.B., “Replicated Data Consistency Explained Through Baseball,” Microsoft Research, Technical Report MSR-TR-2011-137, 2011.
Lan I., “App Engine Datastore Tip: Monotonically Increasing Values Are Bad,” URL: https://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/, accessed: 14.12.2021.
Mongo Db, “New Hash-based Sharding Feature in MongoDB 2.4,” URL: https://www.mongodb.com/blog/post/new-hash-based-sharding-feature-in-mongodb-24, accessed: 14.12.2021.
Soztutar E., “Apache HBase Region Splitting and Merging,” URL: https://blog.cloudera.com/apache-hbase-region-splitting-and-merging/, accessed: 14.12.2021.
Evans E., “Rethinking Topology in Cassandra,” at ApacheCon Europe, 2012, URL: https://www.slideshare.net/jericevans/virtual-nodes-rethinking-topology-in-cassandra, accessed: 14.12.2021.

Distributing databases with sharding and replication – why and when

Introduction

Replication and its usage

Sharding and its usage

Conclusion

References