Scaling and Consistency in Distributed Systems

This post will help you understand key concepts to solve challenges that arise when working with distributed systems.

A distributed system is a network of interconnected computers that work together to achieve a common goal. The following concepts are all essential for understanding how distributed systems operate to ensure that the system can scale to meet demand and ensure that data consistency is maintained across multiple nodes.

Concurrency vs Parallelism

This refers to how multiple tasks are executed at the same time. Lets discuss them in a way we can explain it to a child -

Imagine you have a lot of toys, and you want to play with them all at the same time. You could try to play with them all by yourself, but that might take a very long time, and you might get tired or bored.
So you might ask your friends to help you. If each of your friends picks up a toy and plays with it, then you are all playing together at the same time. This is called parallelism - everyone is doing something at the same time.
But what if there are only a few toys, and you all want to play with them? In that case, you might take turns playing with the toys. You might each play with a toy for a little bit, and then pass it to your friend to play with. This is called concurrency - you are all doing something, but you are taking turns.

Distributed systems can achieve faster and more efficient processing, leading to improved performance and scalability either by interleaving their execution (concurrency) or by dividing tasks into smaller sub-tasks that can be executed in parallel.

Horizontal vs Vertical scaling

This refers to different approaches to increasing the capacity of a distributed system.

Horizontal scaling means that you scale by adding more machines into your pool of resources. It is often easier to scale dynamically by adding more machines into the existing pool. This technique is generally employed in scaling application nodes or distributed databases.
Vertical scaling means that you scale by adding more power (CPU, RAM) to an existing machine. It is often limited to the capacity of a single machine, scaling beyond that capacity often involves downtime and comes with an upper limit. This technique is generally employed in scaling applications that require more processing power to handle complex computations, such as scientific simulations or financial modeling.

In general, horizontal scaling is best suited for applications that need to handle a large number of concurrent users or workloads, while vertical scaling is best suited for applications that require more processing power or memory to handle complex computations or larger datasets.

Partitioning vs Sharding

This referes to splitting up a large data set or workload into smaller, more manageable pieces. This technique is generally is discussed in accordance with database systems.

Horizontal Partitioning splits one or more tables by row, usually within a single instance of a schema and a database server. It may offer an advantage by reducing index size (and thus the search effort) provided that there is some obvious, robust, implicit way to identify in which table a particular row will be found, without first needing to search the index, e.g., the classic example of the 'CustomersEast' and 'CustomersWest' tables, where their zip code already indicates where they will be found.
Sharding goes beyond this, it divides the problematic table(s) in the same way, but it does this across potentially multiple instances. The idea behind sharding is to split a large dataset into smaller subsets known as shards, and then distribute these shards across different instances. One of the benefits of sharding is that it allows for better scalability and performance in distributed systems. However, sharding also comes with some challenges, such as data consistency and coordination between the nodes.

There is also Vertical Partitioning whereby you split a table into smaller distinct parts. Normalization also involves this splitting of columns across tables, but vertical partitioning goes beyond that and partitions columns even when already normalized.

With this technique of splitting the data, systems can handle higher scale, by being responsible for smaller set of data, which helps improve the performance by reducing the overhead complexities.

ACID vs CAP

This refers to two different approaches to data consistency and availability in distributed systems.

ACID
ACID transactions offer guarantees that absolve the end user of much of the headache of concurrent access to mutable database state.
ACID stands for:

  • Atomic: single action, all are completed or none are.

  • Consistent: follow the defined rules and restrictions of the database - constraints, triggers, cascades.

  • Isolated: refers to concurrency control, system state that would be obtained if transactions were executed serially.

  • Durable: operations will persist the state and will not be undone.

CAP

It is a tool to explain trade-offs in distributed systems.
CAP stands for:

  • Consistent: All replicas of the same data will have the same value across a distributed system.

  • Available: All live nodes in a distributed system can process operations and respond to queries.

  • Partition Tolerant: The system is designed to operate in the face of unplanned network connectivity loss between replicas.

CAP isn’t about what is possible, but rather, what isn’t possible.
A more practical way to think about CAP - in the face of network partitions, you can’t always have both perfect consistency and 100% availability. Plan accordingly. We can’t ignore partitions. If you don’t have partitions, then you don’t have a distributed system

ACID applies to transaction and databases while CAP applies to distributed systems which may involve application in addition to database.
The main difference is in Consistency:

  • ACID consistency is all about database rules enforcement.

  • CAP consistency promises that every replica of the same logical value, spread across nodes in a distributed system, has the same exact value at all times. Due to the speed of light, it may take some non-zero time to replicate values across a cluster.

Fault Tolerance

It is the ability of a distributed system to continue operating even if one or more nodes or components in the system fail. Failures are much more of common occurence in distributed systems because multiple nodes are involved. Some of the commonly used techniques to ensure fault tolerance are :

  • Prevent

    • Redundancy: Replicating data or services across multiple nodes or machines, the system can continue to operate even if one or more nodes fail. e.g. replica for database across multiple AZs.

    • Load Balancing: By distributing requests or tasks evenly across multiple nodes, the system can continue to operate even if one or more nodes are overwhelmed or fail.

  • Detect

    • It is essential to detect and isolate failures quickly to prevent them from spreading to other nodes or components. Various techniques are used for failure detection, such as heartbeats, which are periodic messages sent between nodes to check if they are still active.
  • Recover

    • When a node fails, the system needs to recover from the failure and ensure that data and services are still available. Normally achieved via replication/reconfiguration.

In general, fault tolerance is required in distributed systems to ensure that the system continues to operate even if one or more nodes or components fail.

Thanks for reading up to this point. I hope this post has given you good understanding of key concepts and techniques used to implement them. I have discussed which can help you reason your decisions in day-to-day office discussions or even in interviews. Distributed systems is a very vast universe and there are other principles that could be included in this group might include consensus algorithms, replication, and message-passing. But to keep the post concise, concluding thing here only. I hope we cover these remaining concepts in future posts.