Optimizing Distributed Data Systems

They say you can’t have it all, and that’s certainly true with distributed data systems. Did I really just write that sentence? “Sorry son, just like distributed data systems, you can’t have your cake and eat it too…”

Alright, clunky intro out of the way, let me explain my point. Pretend you’re an important decision maker or a technical architect and you’ve been tasked with rolling out the database of your organization’s new application/suite/whatever. As you consider the architecture, you’re faced with certain realities. Is it a global application with millions of users? Does it contain up-to-the-second financial information? How fault tolerant does it need to be? Is it a social interface where updates can take a few moments to permeate without causing issues?

In my opinion, the CAP Theorem does a very good job of simplifying some fairly complicated architectural concepts. Both business and technical stakeholders can benefit from an understanding of the basic concepts set forth by the theorem. I will caveat, that I’m not sure how long the theorem will hold true as new technologies emerge (see Google’s Spanner). In 3 or 4 years you may be reading a blog post from us stating why the CAP Theorem no longer applies! For now, however, the CAP theorem is a concept that should be understood before architecting any distributed system, and this blog is for those who are unfamiliar with the concept.

It’s 2017, and the field of database tools on the market is vast. Individual players fill specific niche needs, while the big boys continue to try to offer robust, consistent solutions. There are many dimensions through which any new tool or system should be evaluated, but today’s blog will focus on the CAP Theorem as it relates to distributed data systems.

What is the Cap Theorem?

CAP Theorem is a term coined by computer scientist Eric Brewer in the early 2000s to describe the relationship between Consistency, Availability, and Partition-Tolerance of distributed systems. Specifically, Brewer theorized that only 2 of the 3 could be simultaneously achieved. The three elements will be described below in (perhaps overly) simplistic terms, as well as the three paradigms of the “pick two” approach to the CAP Theorem.

CONSISTENCY refers to the state of the data being updated on a write so that every subsequent read (even from different nodes than the one where the write occurred) includes all new updates. If Jim from sales updates the contact number for a contact in the CRM system, and Lauren opens that record a moment later, she will see that updated number if the system is CONSISTENT.

AVAILABILITY is simply the capability of a system to accept incoming data or return data upon a request even if one or more nodes are down. If (to continue our previous example) Jim knows the East Coast data center is down, but can still pull customer data from one of the redundant nodes (say, from the West Coast) then the system is AVAILABLE.

PARTITION TOLERANCE essentially states that nodes in our distributed network can always communicate, even if the connection between two nodes in the system is severed.

Again, the idea behind the CAP theorem is that only two of these three can be achieved simultaneously. Let’s take a look at the three “pairings”, what they entail, and who a few of the market leaders in that space are.

C + A

For a system to be both AVAILABLE and CONSISTENT, you must trade off partition tolerance. Essentially, this makes scaling challenging. This paradigm is best represented by traditional RDBMS systems such as MySQL, SQL Server, PostgreSQL, etc.

C + P

A system that is both CONSISTENT and PARTITION TOLERANT trades off AVAILABILITY but can scale relatively easily. The types of products that fall under this category include Document, Columnar and Key-Value databases such as MongoDB, HBase, and Redis.

P + A

Largely the domain of NoSQL technologies, a system that is PARTITION TOLERANT and AVAILABLE trades off Consistency to ensure data can always be written and retrieved- though nodes may be out of sync. Like C+P, products under this category can include Document, Columnar, and Key-Value databases such as Dynamo, Cassandra, and CouchDB.

Making the Right Decision

At the end of the day, while the theorem holds true (technologies are constantly pushing the bounds of what is possible), the tradeoffs must be considered when implementing a distributed system.

As mentioned above, this is a high-level, simplistic approach to the CAP Theorem as it relates to distributed databases. Technologies are constantly pushing the bounds of what is possible, and tools like Google’s Spanner are right around the corner from breaking the established rules.