Your Privacy

This site uses cookies to enhance your browsing experience and deliver personalized content. By continuing to use this site, you consent to our use of cookies.
COOKIE POLICY

Optimizing Distributed Data Systems

Optimizing Distributed Data Systems
Back to insights

They say you can’t have it all, and that’s certainly true with distributed data systems. Did I really just write that sentence? “Sorry son, just like distributed data systems, you can’t have your cake and eat it too…”

Alright, clunky intro out of the way, let me explain my point. Pretend you’re an important decision maker or a technical architect and you’ve been tasked with rolling out the database of your organization’s new application/suite/whatever. As you consider the architecture, you’re faced with certain realities. Is it a global application with millions of users? Does it contain up-to-the-second financial information? How fault tolerant does it need to be? Is it a social interface where updates can take a few moments to permeate without causing issues?

In my opinion, the CAP Theorem does a very good job of simplifying some fairly complicated architectural concepts. Both business and technical stakeholders can benefit from an understanding of the basic concepts set forth by the theorem. I will caveat, that I’m not sure how long the theorem will hold true as new technologies emerge (see Google’s Spanner). In 3 or 4 years you may be reading a blog post from us stating why the CAP Theorem no longer applies! For now, however, the CAP theorem is a concept that should be understood before architecting any distributed system, and this blog is for those who are unfamiliar with the concept.

It’s 2017, and the field of database tools on the market is vast. Individual players fill specific niche needs, while the big boys continue to try to offer robust, consistent solutions. There are many dimensions through which any new tool or system should be evaluated, but today’s blog will focus on the CAP Theorem as it relates to distributed data systems.

What is the Cap Theorem?

CAP Theorem is a term coined by computer scientist Eric Brewer in the early 2000s to describe the relationship between Consistency, Availability, and Partition-Tolerance of distributed systems. Specifically, Brewer theorized that only 2 of the 3 could be simultaneously achieved. The three elements will be described below in (perhaps overly) simplistic terms, as well as the three paradigms of the “pick two” approach to the CAP Theorem.

CONSISTENCY refers to the state of the data being updated on a write so that every subsequent read (even from different nodes than the one where the write occurred) includes all new updates. If Jim from sales updates the contact number for a contact in the CRM system, and Lauren opens that record a moment later, she will see that updated number if the system is CONSISTENT.

AVAILABILITY is simply the capability of a system to accept incoming data or return data upon a request even if one or more nodes are down. If (to continue our previous example) Jim knows the East Coast data center is down, but can still pull customer data from one of the redundant nodes (say, from the West Coast) then the system is AVAILABLE.

PARTITION TOLERANCE essentially states that nodes in our distributed network can always communicate, even if the connection between two nodes in the system is severed.

Again, the idea behind the CAP theorem is that only two of these three can be achieved simultaneously. Let’s take a look at the three “pairings”, what they entail, and who a few of the market leaders in that space are.

C + A

For a system to be both AVAILABLE and CONSISTENT, you must trade off partition tolerance. Essentially, this makes scaling challenging. This paradigm is best represented by traditional RDBMS systems such as MySQL, SQL Server, PostgreSQL, etc.

C + P

A system that is both CONSISTENT and PARTITION TOLERANT trades off AVAILABILITY but can scale relatively easily. The types of products that fall under this category include Document, Columnar and Key-Value databases such as MongoDB, HBase, and Redis.

P + A

Largely the domain of NoSQL technologies, a system that is PARTITION TOLERANT and AVAILABLE trades off Consistency to ensure data can always be written and retrieved- though nodes may be out of sync. Like C+P, products under this category can include Document, Columnar, and Key-Value databases such as Dynamo, Cassandra, and CouchDB.

Making the Right Decision

At the end of the day, while the theorem holds true (technologies are constantly pushing the bounds of what is possible), the tradeoffs must be considered when implementing a distributed system.

As mentioned above, this is a high-level, simplistic approach to the CAP Theorem as it relates to distributed databases. Technologies are constantly pushing the bounds of what is possible, and tools like Google’s Spanner are right around the corner from breaking the established rules.

 

 

Digging In

  • Data & Analytics

    Unlocking the Full Potential of a Customer 360: A Comprehensive Guide

    In today’s fast-paced digital economy, understanding your customer has never been more critical. The concept of a customer 360 view has emerged as a revolutionary approach to gaining a comprehensive understanding of consumers by integrating data from different touchpoints to offer a holistic view.  A customer 360 view is about taking an overarching approach to […]

  • Data & Analytics

    Microsoft Fabric: A New Unified Data Platform

    MicroPopular data services and tools often specialize in specific aspects of the data analytics pipeline, serving teams in the data lifecycle. For instance, Snowflake addresses large-scale data warehousing challenges, while Databricks focuses on data engineering and science. Power BI and Tableau have become standard tools for business intelligence tasks. So, where does Microsoft Fabric create […]

  • Data & Analytics

    Improve Member Experience: Maximize Engagement & Value for Associations

    As you know, member engagement is key to providing value and retaining members over time. However, you must also recognize that member needs and preferences are evolving rapidly, especially as they desire more seamless digital experiences. Additionally, member expectations for personalized, omnichannel interactions have risen in recent years, and this means that associations must strategically […]

  • Data & Analytics

    A Guide to Data Strategy Success in Your Association

    While countless organizations aim to harness the potential of data, few possess a clear strategy to transform raw information into actionable insights that fuel their operations and marketing efforts. Don’t fall into the trap of investing in limited, tactical solutions.

  • Data & Analytics

    ChatGPT & Your Data Strategy – Revolution or Evolution?

    You would be hard-pressed to find a single person who was not some degree of impressed when they first tried out ChatGPT. After its public release, the conversation in the tech space seemingly changed overnight about how AI would change everything. But much like past hot topics in the tech world – such as the […]

  • Data & Analytics

    Revamping Data Pipeline Infrastructure to Increase Owner Satisfaction at Twiddy

    In an ever-evolving technological landscape, embracing new methodologies is vital for enhancing efficiency. Our data and analytics interns recently undertook a significant overhaul of one of Twiddy’s data pipeline infrastructures, implementing Airbyte pipelines with Kestra orchestration to replace an existing Java application. Motivated by several challenges with the previous system, most importantly a complete loss […]