Your Privacy

This site uses cookies to enhance your browsing experience and deliver personalized content. By continuing to use this site, you consent to our use of cookies.
COOKIE POLICY

Skip to main content

Optimizing Distributed Data Systems

Optimizing Distributed Data Systems
Back to insights

They say you can’t have it all, and that’s certainly true with distributed data systems. Did I really just write that sentence? “Sorry son, just like distributed data systems, you can’t have your cake and eat it too…”

Alright, clunky intro out of the way, let me explain my point. Pretend you’re an important decision maker or a technical architect and you’ve been tasked with rolling out the database of your organization’s new application/suite/whatever. As you consider the architecture, you’re faced with certain realities. Is it a global application with millions of users? Does it contain up-to-the-second financial information? How fault tolerant does it need to be? Is it a social interface where updates can take a few moments to permeate without causing issues?

In my opinion, the CAP Theorem does a very good job of simplifying some fairly complicated architectural concepts. Both business and technical stakeholders can benefit from an understanding of the basic concepts set forth by the theorem. I will caveat, that I’m not sure how long the theorem will hold true as new technologies emerge (see Google’s Spanner). In 3 or 4 years you may be reading a blog post from us stating why the CAP Theorem no longer applies! For now, however, the CAP theorem is a concept that should be understood before architecting any distributed system, and this blog is for those who are unfamiliar with the concept.

It’s 2017, and the field of database tools on the market is vast. Individual players fill specific niche needs, while the big boys continue to try to offer robust, consistent solutions. There are many dimensions through which any new tool or system should be evaluated, but today’s blog will focus on the CAP Theorem as it relates to distributed data systems.

What is the Cap Theorem?

CAP Theorem is a term coined by computer scientist Eric Brewer in the early 2000s to describe the relationship between Consistency, Availability, and Partition-Tolerance of distributed systems. Specifically, Brewer theorized that only 2 of the 3 could be simultaneously achieved. The three elements will be described below in (perhaps overly) simplistic terms, as well as the three paradigms of the “pick two” approach to the CAP Theorem.

CONSISTENCY refers to the state of the data being updated on a write so that every subsequent read (even from different nodes than the one where the write occurred) includes all new updates. If Jim from sales updates the contact number for a contact in the CRM system, and Lauren opens that record a moment later, she will see that updated number if the system is CONSISTENT.

AVAILABILITY is simply the capability of a system to accept incoming data or return data upon a request even if one or more nodes are down. If (to continue our previous example) Jim knows the East Coast data center is down, but can still pull customer data from one of the redundant nodes (say, from the West Coast) then the system is AVAILABLE.

PARTITION TOLERANCE essentially states that nodes in our distributed network can always communicate, even if the connection between two nodes in the system is severed.

Again, the idea behind the CAP theorem is that only two of these three can be achieved simultaneously. Let’s take a look at the three “pairings”, what they entail, and who a few of the market leaders in that space are.

C + A

For a system to be both AVAILABLE and CONSISTENT, you must trade off partition tolerance. Essentially, this makes scaling challenging. This paradigm is best represented by traditional RDBMS systems such as MySQL, SQL Server, PostgreSQL, etc.

C + P

A system that is both CONSISTENT and PARTITION TOLERANT trades off AVAILABILITY but can scale relatively easily. The types of products that fall under this category include Document, Columnar and Key-Value databases such as MongoDB, HBase, and Redis.

P + A

Largely the domain of NoSQL technologies, a system that is PARTITION TOLERANT and AVAILABLE trades off Consistency to ensure data can always be written and retrieved- though nodes may be out of sync. Like C+P, products under this category can include Document, Columnar, and Key-Value databases such as Dynamo, Cassandra, and CouchDB.

Making the Right Decision

At the end of the day, while the theorem holds true (technologies are constantly pushing the bounds of what is possible), the tradeoffs must be considered when implementing a distributed system.

As mentioned above, this is a high-level, simplistic approach to the CAP Theorem as it relates to distributed databases. Technologies are constantly pushing the bounds of what is possible, and tools like Google’s Spanner are right around the corner from breaking the established rules.

 

 

Digging In

  • Data & Analytics

    Piloting Data Discovery and Governance: The Open-Source Data Catalog

    As organizations grow increasingly data-driven, the ability to quickly discover, understand, and trust internal data becomes more than a convenience—it’s a necessity. Over the past year, I’ve spent more time exploring data catalog solutions and the pivotal role they play in solving a challenge I frequently hear from clients: “We know we have the data, […]

  • Data & Analytics

    Legacy Data Modernization: A Comprehensive Guide to Upgrading Your Data Platform

    Though they may have been more than functional in the past, legacy data platforms can become a burden to your organization and prevent it from realizing its full potential. That’s why legacy data modernization can effectively transform your organization’s obsolete data systems into modern platforms that are scalable, efficient, and better equipped to handle today’s […]

  • Data & Analytics

    Masking Data 101: Safeguarding PII in Your Organization

    In today’s digital age, data security and privacy are paramount. As organizations increasingly collect, store, and process personal data, protecting Personally Identifiable Information (PII) has never been more critical. One essential practice that organizations can implement at the database level to secure this sensitive information is to obfuscate it through the usage of data masking […]

  • Data & Analytics

    Unlocking the Full Potential of a Customer 360: A Comprehensive Guide

    In today’s fast-paced digital economy, understanding your customer has never been more critical. The concept of a customer 360 view has emerged as a revolutionary approach to gaining a comprehensive understanding of consumers by integrating data from different touchpoints to offer a holistic view. A customer 360 view is about taking an overarching approach to […]

  • Data & Analytics

    Microsoft Fabric: A New Unified Data Platform

    MicroPopular data services and tools often specialize in specific aspects of the data analytics pipeline, serving teams in the data lifecycle. For instance, Snowflake addresses large-scale data warehousing challenges, while Databricks focuses on data engineering and science. Power BI and Tableau have become standard tools for business intelligence tasks. So, where does Microsoft Fabric create […]