Your Privacy

This site uses cookies to enhance your browsing experience and deliver personalized content. By continuing to use this site, you consent to our use of cookies.
COOKIE POLICY

Exploring Data Streaming Technologies

Exploring Data Streaming Technologies
Back to insights

The world is in a constant state of change. Naturally, then, it would make sense to be able to examine that change in real-time. Being able to process, transform, and make decisions on data as soon as it comes in gives that data much more value to users. But it’s a bit easier said than done. Consider real-world metrics that industries would want to monitor: location data, user activity on websites, biometrics from medical equipment, wind speeds, incoming business transactions, etc. These are all highly dynamic; new information could come in at any time and at an incredibly high rate. This means that they also generate a massive amount of data. In the past, available tools and technologies simply hadn’t matured enough to make this sort of real-time processing possible at such a large scale. Over the past decade, huge strides have been made in data streaming technology, making high-performing streams at an enterprise scale more accessible than ever before. In this blog, we’ll explore the architecture and strengths of some of the leading solutions for data streaming.  

Apache Kafka

Kafka got its start at LinkedIn, where it was initially designed to be a pub/sub message system, but over the years it has been refined into a fast, fault-tolerant, and scalable distributed platform for data streaming. It’s run on a Kafka Cluster, which is made up of nodes called Brokers. Each of these brokers hosts multiple partitions of a Topic, which is an ordered, partitioned, and immutable record of all data that is published to it. Data is published to a topic from a source application through a Producer and picked up by a Consumer subscribed to the topic, allowing the data to be processed by other target applications.   

Kafka Cluster

Image Source: https://www.upsolver.com/blog/comparing-apache-kafka-amazon-kinesis 

Because Kafka is open source, you have the option of setting up an entire stream in-house if you or your team has the DevOps knowledge to do so. However, for enterprise-level production implementation, it’s probably more realistic to opt for one of the available managed solutions (Confluent or AWS or a mixture of both). Kafka also offers a high level of control over configuration for Producers, Consumers, and the Kafka cluster itself. This can be both good and bad: it makes Kafka very flexible, but it also requires a higher degree of familiarity with the system in order to take full advantage of its capabilities. If tuned optimally it can achieve very high throughput and very low latency, but it won’t get that kind of performance out of the box.  

Another strength of Kafka is in its fault tolerance and data persistence. Topics aren’t just partitioned across multiple brokers, the partitions themselves also replicate across a configurable number of brokers. If a “lead” broker goes downthen a backup broker is always available. This achieves fault tolerance, but it also adds another benefit as well. Because a topic’s size isn’t limited to the size of any one node, Kafka puts no upper limit on data persistence. Most streaming platforms will let you persist records for a week at most, requiring you to move them to a data lake or other external storage if you don’t want to lose them. Kafka topics will save data forever if you tell them to, or at least they’ll try to until you run out of disk space.  

This is great because it allows you to handle historical and incoming data cases the same way, without any limit to how far back you can go. The Consumer can begin by processing past data from a specified offset on the topic, but instead of stopping when it reaches the most recent entry it will continue to process new data as it comes in. Other streaming platforms can do this as well, but what if instead of a week, your application’s algorithm needs data from two weeks ago? Or a month? Or a year? Normally you’d have to make a separate network call to get that data from wherever you stored it, but with Kafka you’re able to consume it straight from the topic. If long-term data persistence is a must-have for your streaming application, then Kafka should be your go-to solution. 

Amazon Kinesis

Kafka works well with streaming data. Kinesis, Amazon’s in-house streaming platform, works well with streaming data too. Like Kafka, Kinesis is pub/sub-based, fault-tolerant, and scalable. Kinesis also uses Producers and Consumers. A Kinesis Stream is like a topic in Kafka. A Kinesis Shard is like a Kafka partition. The two platforms have very similar architecture and use cases.  

Amazon Kinesis Stream 

Image source: https://docs.aws.amazon.com/streams/latest/dev/key-concepts.html 

The biggest difference between the two is that Kinesis is only offered as a managed service through AWS. If Kafka is hosted in-house, then it’s all on you to plan out what computing resources and storage capacity are needed for hosting the cluster, to set it all up properly, and to scale with additional resources if the need arises. Even if you opted for a managed Kafka solution to cut out having to host, the user is still responsible for optimal performance configurations, fault tolerance, recovery, and partitioning. In Kinesis, however, all the infrastructure is handled on AWS’s side, which means that deployment can be done with significantly less time, material, and human cost. For added functionality, Kinesis integrates with other AWS services: S3 for data lake storage, Lambda for event processing, Cloudwatch for real-time monitoring and data metrics, and Application Auto Scaling for quick and reactive scaling of the stream. If your use case doesn’t necessitate the acute performance tuning or long-term data persistence of Kafka, or if you’re already leveraging AWS for other projects, Kinesis is a strong choice.  

There are many different streaming platforms available aside from just these two. Microsoft Azure has EventHubs and Google offers its own Google Cloud pub/sub service. Apache Storm performs real-time transformations on incoming data. Apache Flink is capable of processing both bounded and unbounded streams. With so many different solutions, choosing a solution can seem overwhelming. Ultimately, having a strong understanding of your data format, infrastructure, and business use case will help you determine the best fit for the streaming task at hand. 

Additional resources if you are interested in learning more about these technologies:

Apache Kafka
AWS Kinesis
Google Cloud Pub/Sub
Apache Flink
Azure EventHubs

Digging In

  • Data & Analytics

    Unlocking the Full Potential of a Customer 360: A Comprehensive Guide

    In today’s fast-paced digital economy, understanding your customer has never been more critical. The concept of a customer 360 view has emerged as a revolutionary approach to gaining a comprehensive understanding of consumers by integrating data from different touchpoints to offer a holistic view.  A customer 360 view is about taking an overarching approach to […]

  • Data & Analytics

    Microsoft Fabric: A New Unified Data Platform

    MicroPopular data services and tools often specialize in specific aspects of the data analytics pipeline, serving teams in the data lifecycle. For instance, Snowflake addresses large-scale data warehousing challenges, while Databricks focuses on data engineering and science. Power BI and Tableau have become standard tools for business intelligence tasks. So, where does Microsoft Fabric create […]

  • Data & Analytics

    Improve Member Experience: Maximize Engagement & Value for Associations

    As you know, member engagement is key to providing value and retaining members over time. However, you must also recognize that member needs and preferences are evolving rapidly, especially as they desire more seamless digital experiences. Additionally, member expectations for personalized, omnichannel interactions have risen in recent years, and this means that associations must strategically […]

  • Data & Analytics

    A Guide to Data Strategy Success in Your Association

    While countless organizations aim to harness the potential of data, few possess a clear strategy to transform raw information into actionable insights that fuel their operations and marketing efforts. Don’t fall into the trap of investing in limited, tactical solutions.

  • Data & Analytics

    ChatGPT & Your Data Strategy – Revolution or Evolution?

    You would be hard-pressed to find a single person who was not some degree of impressed when they first tried out ChatGPT. After its public release, the conversation in the tech space seemingly changed overnight about how AI would change everything. But much like past hot topics in the tech world – such as the […]

  • Data & Analytics

    Revamping Data Pipeline Infrastructure to Increase Owner Satisfaction at Twiddy

    In an ever-evolving technological landscape, embracing new methodologies is vital for enhancing efficiency. Our data and analytics interns recently undertook a significant overhaul of one of Twiddy’s data pipeline infrastructures, implementing Airbyte pipelines with Kestra orchestration to replace an existing Java application. Motivated by several challenges with the previous system, most importantly a complete loss […]