Horizontal, vertical, and functional data partitioning

In this tutorial we will learn about the horizontal, vertical, and functional data partitioning,

Why partition data?

Firstly, improve scalability.
Secondly, improve performance.
Thirdly, improve security.
Next, provide operational flexibility.
It also match the data store to the pattern of use.
Lastly, improve availability.

Designing partitions

There are three typical strategies for partitioning data:

Firstly, Horizontal partitioning (often called sharding). In this strategy, each partition is a separate data store, but all partitions have the same schema. Here, each partition is known as a shard and holds a specific subset of the data, such as all the orders for a specific set of customers.
Secondly, Vertical partitioning. In this strategy, each partition holds a subset of the fields for items in the data store. The fields are divided according to their pattern of use.
Lastly, Functional partitioning. In this strategy, data is aggregated according to how it is used by each bounded context in the system.

Horizontal partitioning (sharding)

Figure 1 shows horizontal partitioning or sharding. Product inventory data is separated into shards in this case depending on the product key. Each shard, on the other hand, stores data for a contiguous range of shard keys (A-G and H-Z), which are arranged alphabetically. Sharding distributes the workload over a larger number of machines, reducing congestion and improving performance.

Horizontally partitioning (sharding) data based on a partition key — Image Source: Microsoft

The selection of a sharding key is the most critical component. After the system is up and running, changing the key might be challenging. The key must guarantee that data is partitioned in order to distribute workload among the shards as equally as feasible.

Here, the shards don’t have to be the same size. Further, avoid creating “hot” partitions that can affect performance and availability. Using the initial letter of a customer’s name, for example, results in an imbalanced distribution since certain letters are more prevalent than others. Instead, to distribute data more uniformly across partitions, utilise a hash of customer identity. If shards are copied, however, some of the copies may be kept available while others are divided, merged, or modified. During the reconfiguration, however, the system may need to limit the operations that may be executed.

Vertical partitioning

The most typical application of vertical partitioning is to lower the I/O and performance costs of retrieving frequently requested objects. Vertical partitioning is seen in Figure 2. Various attributes of an object are stored in different partitions in this example. One partition stores data that is often accessed, such as the product name, description, and price. The stock count and last-ordered date are stored on another partition.

Vertically partitioning data by its pattern of use — Image Source: Microsoft

When showing product data to consumers, the program periodically asks the product name, description, and price. Stock count and last-ordered date, on the other hand, are kept in a distinct partition since they are frequently utilized together.

Other advantages of vertical partitioning:

Firstly, relatively slow-moving data can be separated from the more dynamic data (stock level and last ordered date). Slow moving data is a good candidate for an application to cache in memory.
Secondly, sensitive data can be stored in a separate partition with additional security controls.
Lastly, vertical partitioning can reduce the amount of concurrent access that’s needed.

Furthermore, within a data store, vertical partitioning occurs at the entity level, partially normalizing an entity to split it down from a large item to a series of narrow things. It works well with column-oriented databases like HBase and Cassandra. If the data in a set of columns is unlikely to change, you might want to explore utilising SQL Server’s column stores.

Functional partitioning

Functional partitioning is a strategy to increase isolation and data access speed when it’s possible to determine a limited context for each different business area in an application. Separating read-write data from read-only data is another prominent usage for functional partitioning. Figure 3 depicts a functional partitioning overview, in which inventory data is isolated from customer data.

Functionally partitioning data by bounded context or subdomain — Image Source: Microsoft

Designing partitions for scalability

To achieve optimum scalability, it’s critical to examine the size and workload of each division and balance them so that data is spread evenly. However, you must split the data such that it does not exceed the single partition store’s scalability restrictions. When creating partitions for scalability, follow these steps:

Firstly, analyze the application to understand the data access patterns, such as the size of the result set returned by each query, the frequency of access, the inherent latency, and the server-side compute processing requirements. In many cases, a few major entities will demand most of the processing resources.
Secondly, use this analysis to determine the current and future scalability targets, such as data size and workload. Then distribute the data across the partitions to meet the scalability target. Further, for horizontal partitioning, choosing the right shard key is important to make sure distribution is even.
Thirdly, make sure each partition has enough resources to handle the scalability requirements, in terms of data size and throughput. Depending on the data store, there might be a limit on the amount of storage space, processing power, or network bandwidth per partition.
Lastly, monitor the system to verify that data is distributed as expected and that the partitions can handle the load. Actual usage does not always match what an analysis predicts. If so, it might be possible to rebalance the partitions, or else redesign some parts of the system to gain the required balance.

Designing partitions for query performance

Using smaller data sets and conducting parallel searches may typically improve query speed. Each division should only include a subset of the total data set. This reduction in bulk has the potential to increase query performance. Partitioning, on the other hand, is not a substitute for properly designing and configuring a database. When creating partitions for query speed, follow these steps:

Firstly, examine the application requirements and performance:
- Use business requirements to determine the critical queries that must always perform quickly.
- Then, monitor the system to identify any queries that perform slowly.
- After that, find which queries are performed most frequently. Even if a single query has a minimal cost, the cumulative resource consumption could be significant.
Secondly, partition the data that is causing slow performance:
- First, limit the size of each partition so that the query response time is within target.
- However, if you use horizontal partitioning, design the shard key so that the application can easily select the right partition. This prevents the query from having to scan through every partition.
- Next, consider the location of a partition. If possible, try to keep data in partitions that are geographically close to the applications and users that access it.
Thirdly, if an entity has throughput and query performance requirements, use functional partitioning based on that entity. If this still doesn’t satisfy the requirements. Then, apply horizontal partitioning as well. In most cases, a single partitioning strategy will suffice, but in some cases it is more efficient to combine both strategies.
Lastly, consider running queries in parallel across partitions to improve performance.

Designing partitions for availability

By guaranteeing that the entire dataset does not serve as a single point of failure and that separate subsets of the dataset can be maintained independently, partitioning data can increase the availability of applications.

Consider the following factors that affect availability:

How critical the data is to business operations. Identify which data is critical business information, such as transactions, and which data is less critical operational data, such as log files.

Firstly, consider storing critical data in highly available partitions with an appropriate backup plan.
Secondly, establish separate management and monitoring procedures for the different datasets.
Lastly, place data that has the same level of criticality in the same partition so that it can be backed up together at an appropriate frequency. For example, partitions that hold transaction data might need to be backed up more frequently than partitions that hold logging or trace information.

How individual partitions can be managed. Designing partitions to support independent management and maintenance provides several advantages. For example:

Firstly, if a partition fails, it can be recovered independently without applications that access data in other partitions.
Secondly, partitioning data by geographical area allows scheduled maintenance tasks to occur at off-peak hours for each location. Ensure that partitions are not too large to prevent any planned maintenance from being completed during this period.

Whether to replicate critical data across partitions. This strategy can improve availability and performance, but can also introduce consistency issues. It takes time to synchronize changes with every replica. During this period, different partitions will contain different data values.

Application design considerations

Partitioning adds complexity to the design and development of your system. Consider partitioning as a fundamental part of system design even if the system initially only contains a single partition. If you address partitioning as an afterthought, it will be more challenging because you already have a live system to maintain:

Firstly, data access logic will need to be modified.
Secondly, large quantities of existing data may need to be migrated, to distribute it across partitions.
Lastly, users expect to be able to continue using the system during the migration.

Rebalancing partitions

As a system matures, you might have to adjust the partitioning scheme. For example, individual partitions might start getting a disproportionate volume of traffic and become hot, leading to excessive contention. Or you might have underestimated the volume of data in some partitions, causing some partitions to approach capacity limits.

Some data stores, such as Cosmos DB, can automatically rebalance partitions. In other cases, rebalancing is an administrative task that consists of two stages:

Firstly, determine a new partitioning strategy.
Secondly, migrate data from the old partitioning scheme to the new set of partitions.

Depending on the data store, you might be able to migrate data between partitions while they are in use. This is called online migration. If that’s not possible, you might need to make partitions unavailable while the data is relocated (offline migration).

Offline migration

Offline migration is typically simpler because it reduces the chances of contention occurring. Conceptually, offline migration works as follows:

Firstly, mark the partition offline.
Secondly, split-merge and move the data to the new partitions.
Thirdly, verify the data.
Then, bring the new partitions online.
lastly, remove the old partition.

Online migration

Online migration is more difficult to do, but it is less disruptive. The procedure is identical to that of offline migration, with the exception that the original partition is not tagged as offline. The data access code in the client applications may have to handle reading and writing data that is kept in two locations, the original partition and the new partition, depending on the granularity of the migration process (for example, item by item vs shard by shard).