Why NoSQL stands high on scalability?

This blog will tell you the secrets behind inherent ways of scaling NoSQL databases.

In the last post, we roughly touched base on the scalability of RDBMS and NoSQL databases. In this post, however, we will go into the details and basics of how NOSQL databases naturally lend themselves for the high scale.

The scale has to be broken down into its constituents:
Read scaling = handle higher volumes of read operations
Write scaling = handle higher volumes of write operations

We will be mainly looking at the ability to provide support horizontal scaling to allow linearly scaling the system instead of vertical scaling which has its own hardware & cost limits.

Read Scaling

  • ACID-compliant RDBMS databases can scale reads. They are not inherently less efficient than NoSQL databases for reads because the majority of performance bottlenecks are introduced by things like joins and where clause restrictions which you can opt not to use.
    Some of these things are not supported at all with NoSQL or you can choose to opt out. Instead, NoSQL databases use specialized data storage techniques (easier handling of where clause) or suggest practices to keep the data together (avoiding joins) to improve performance.

  • Clustered SQL RDBMS can scale reads by introducing additional nodes in the cluster. The additional nodes are generally read-only replica nodes. This involves replication to make sure that the replica has the same data as of primary.
    To keep the replicas highly consistent, we have to use sync replication which can cause write latencies. And if we choose async replication, the reads won't be consistent until the replica catches up with the primary in case of high write throughput and replication lag.

  • You can also scale reads by choosing the lowermost isolation level (topic in itself for another post) supported by the database. But again, this would cause issues with the consistency of data across transactions.

So, there are constraints to how far read operations can be scaled, but these are imposed by the difficulty of scaling up writes as you introduce more nodes into the cluster.

Write Scaling

As seen above, write scaling is where things get complicated. There are various constraints imposed by the ACID principles for RDBMS which we do not see in NoSQL databases:

  • Atomicity means that a transaction must be complete or fail as a whole, so a lot of bookkeeping must be done behind the scenes to guarantee this.

    • This involves the use of transactions, transaction logs, and/or locking mechanisms to ensure that transactions are treated as indivisible units of work and can be rolled back if something fails.
  • Consistency constraints mean that all nodes in the cluster must be identical in terms of data. If you write to one node, this write must be copied to all other nodes before returning a response to the client. This makes a traditional RDBMS cluster hard to scale. We already talked about replication issues with the multiple read-only replica nodes above.

    • Consistency in ACID talks about enforcing constraints like unique indexes, and foreign keys. But this definition is not required for our discussion of write scaling.
  • Isolation refers to the ability of a transaction to execute independently of other concurrent transactions. This is generally supported by multiple isolation levels. RDBMS uses a variety of techniques to support these isolation levels, including locking mechanisms, timestamping, and/or multi-version concurrency control (MVCC). This works best with a single primary node for writes.

    • When we try to go for multiple primary nodes and transactions across multiple nodes are required, then the problem gets stretched to distributed transaction management (which in itself is a difficult problem to handle and a topic for another post).
  • Durability constraints mean that in order to never lose a write you must ensure that before a response is returned to the client, the write has to be flushed to disk (there are intelligent ways employed by certain databases to simplify this but it is a limitation for most of the databases).

All of the above discussions point towards using a single node for RDBMS for easier use and correspondingly limit RDMBS scaling only to vertical scaling which is not possible beyond a certain point due to hardware & cost constraints.

To scale up write operations or the number of nodes in a cluster beyond a certain point we need to have the ability to relax some of the ACID requirements:

  • Dropping Atomicity lets you shorten the duration for which tables (sets of data) are locked. Example: MongoDB, CouchDB.

  • Dropping Consistency lets you scale up writes across cluster nodes. Examples: Riak, Cassandra.

  • Dropping Durability lets you respond to write commands without flushing to disk. Examples: Memcache, redis.

NoSQL databases typically follow the BASE (Basically Available Soft State) model instead of the ACID model. They give up the A, C, and/or D requirements, and in return, they improve scalability. Some, like Cassandra, let you opt into ACID's guarantees when you need them. The SQL API lacks a mechanism to describe queries where ACID's requirements are relaxed, this is why the BASE databases are all NoSQL.

With the above properties, NoSQL databases allow easier horizontal scaling instead of vertical scaling, which makes me call it more easily scalable than RDBMS.

But one confession before we wrap up this post,
RDBMS are equally scalable and used by multiple organizations out there for a variety of high-scale applications. With newer improvements in the hardware technologies (cheaper & faster RAM & SSDs, big-sized VMs) and expertise in tuning the database configurations, it is possible to scale RDBMS to greater heights. Even when the single node database is not enough, I have seen organizations employing custom sharding logic and avoiding ACID requirements across shards, we can get the best out of the RDBMS databases as well.

I hope this post will help you get some insights into the understanding of how certain properties of NoSQL databases make them naturally scalable. Thanks for reading till the end :)