Exploring Data Streaming Technologies

By Zach Showalter

The world is in a constant state of change. Naturally, then, it would make sense to be able to examine that change in real-time. Being able to process, transform, and make decisions on data as soon as it comes in gives that data much more value to users. But it’s a bit easier said than done. Consider real-world metrics that industries would want to monitor: location data, user activity on websites, biometrics from medical equipment, wind speeds, incoming business transactions, etc. These are all highly dynamic; new information could come in at any time and at an incredibly high rate. This means that they also generate a massive amount of data. In the past, available tools and technologies simply hadn’t matured enough to make this sort of real-time processing possible at such a large scale. Over the past decade, huge strides have been made in data streaming technology, making high-performing streams at an enterprise scale more accessible than ever before. In this blog, we’ll explore the architecture and strengths of some of the leading solutions for data streaming.

Apache Kafka

Kafka got its start at LinkedIn, where it was initially designed to be a pub/sub message system, but over the years it has been refined into a fast, fault-tolerant, and scalable distributed platform for data streaming. It’s run on a Kafka Cluster, which is made up of nodes called Brokers. Each of these brokers hosts multiple partitions of a Topic, which is an ordered, partitioned, and immutable record of all data that is published to it. Data is published to a topic from a source application through a Producer and picked up by a Consumer subscribed to the topic, allowing the data to be processed by other target applications.

Kafka Cluster

Image Source: https://www.upsolver.com/blog/comparing-apache-kafka-amazon-kinesis

Because Kafka is open source, you have the option of setting up an entire stream in-house if you or your team has the DevOps knowledge to do so. However, for enterprise-level production implementation, it’s probably more realistic to opt for one of the available managed solutions (Confluent or AWS or a mixture of both). Kafka also offers a high level of control over configuration for Producers, Consumers, and the Kafka cluster itself. This can be both good and bad: it makes Kafka very flexible, but it also requires a higher degree of familiarity with the system in order to take full advantage of its capabilities. If tuned optimally it can achieve very high throughput and very low latency, but it won’t get that kind of performance out of the box.

Another strength of Kafka is in its fault tolerance and data persistence. Topics aren’t just partitioned across multiple brokers, the partitions themselves also replicate across a configurable number of brokers. If a “lead” broker goes down, then a backup broker is always available. This achieves fault tolerance, but it also adds another benefit as well. Because a topic’s size isn’t limited to the size of any one node, Kafka puts no upper limit on data persistence. Most streaming platforms will let you persist records for a week at most, requiring you to move them to a data lake or other external storage if you don’t want to lose them. Kafka topics will save data forever if you tell them to, or at least they’ll try to until you run out of disk space.

This is great because it allows you to handle historical and incoming data cases the same way, without any limit to how far back you can go. The Consumer can begin by processing past data from a specified offset on the topic, but instead of stopping when it reaches the most recent entry it will continue to process new data as it comes in. Other streaming platforms can do this as well, but what if instead of a week, your application’s algorithm needs data from two weeks ago? Or a month? Or a year? Normally you’d have to make a separate network call to get that data from wherever you stored it, but with Kafka you’re able to consume it straight from the topic. If long-term data persistence is a must-have for your streaming application, then Kafka should be your go-to solution.

Amazon Kinesis

Kafka works well with streaming data. Kinesis, Amazon’s in-house streaming platform, works well with streaming data too. Like Kafka, Kinesis is pub/sub-based, fault-tolerant, and scalable. Kinesis also uses Producers and Consumers. A Kinesis Stream is like a topic in Kafka. A Kinesis Shard is like a Kafka partition. The two platforms have very similar architecture and use cases.

Amazon Kinesis Stream

Image source: https://docs.aws.amazon.com/streams/latest/dev/key-concepts.html

The biggest difference between the two is that Kinesis is only offered as a managed service through AWS. If Kafka is hosted in-house, then it’s all on you to plan out what computing resources and storage capacity are needed for hosting the cluster, to set it all up properly, and to scale with additional resources if the need arises. Even if you opted for a managed Kafka solution to cut out having to host, the user is still responsible for optimal performance configurations, fault tolerance, recovery, and partitioning. In Kinesis, however, all the infrastructure is handled on AWS’s side, which means that deployment can be done with significantly less time, material, and human cost. For added functionality, Kinesis integrates with other AWS services: S3 for data lake storage, Lambda for event processing, Cloudwatch for real-time monitoring and data metrics, and Application Auto Scaling for quick and reactive scaling of the stream. If your use case doesn’t necessitate the acute performance tuning or long-term data persistence of Kafka, or if you’re already leveraging AWS for other projects, Kinesis is a strong choice.

There are many different streaming platforms available aside from just these two. Microsoft Azure has EventHubs and Google offers its own Google Cloud pub/sub service. Apache Storm performs real-time transformations on incoming data. Apache Flink is capable of processing both bounded and unbounded streams. With so many different solutions, choosing a solution can seem overwhelming. Ultimately, having a strong understanding of your data format, infrastructure, and business use case will help you determine the best fit for the streaming task at hand.

Additional resources if you are interested in learning more about these technologies:

Apache Kafka
AWS Kinesis
Google Cloud Pub/Sub
Apache Flink
Azure EventHubs

About Zach Showalter

Zach Showalter is a Senior Consultant on the Data team.

Digging In

Data & Analytics
Ensuring Data Strategy Adoption: The Power of a Test Drive with Blueprinting and Mock Outputs
Despite years of investment in data platforms and analytics tools, many organizations still face a familiar challenge: their data strategy looks great on paper, but never delivers the value that was expected. Dashboards sit untouched, and self-service portals fail to gain traction. The data team checked every technical box, yet business users continue defaulting to […]
Read More
Data & Analytics
Piloting Data Discovery and Governance: The Open-Source Data Catalog
As organizations grow increasingly data-driven, the ability to quickly discover, understand, and trust internal data becomes more than a convenience—it’s a necessity. Over the past year, I’ve spent more time exploring data catalog solutions and the pivotal role they play in solving a challenge I frequently hear from clients: “We know we have the data, […]
Read More
Data & Analytics
2025 Data Trends
Read More
Data & Analytics
Legacy Data Modernization: A Comprehensive Guide to Upgrading Your Data Platform
Though they may have been more than functional in the past, legacy data platforms can become a burden to your organization and prevent it from realizing its full potential. That’s why legacy data modernization can effectively transform your organization’s obsolete data systems into modern platforms that are scalable, efficient, and better equipped to handle today’s […]
Read More
Data & Analytics
Masking Data 101: Safeguarding PII in Your Organization
In today’s digital age, data security and privacy are paramount. As organizations increasingly collect, store, and process personal data, protecting Personally Identifiable Information (PII) has never been more critical. One essential practice that organizations can implement at the database level to secure this sensitive information is to obfuscate it through the usage of data masking […]
Read More
Data & Analytics
Unlocking the Full Potential of a Customer 360: A Comprehensive Guide
In today’s fast-paced digital economy, understanding your customer has never been more critical. The concept of a customer 360 view has emerged as a revolutionary approach to gaining a comprehensive understanding of consumers by integrating data from different touchpoints to offer a holistic view. A customer 360 view is about taking an overarching approach to […]
Read More

Your Privacy

Exploring Data Streaming Technologies

By Zach Showalter

Apache Kafka

Amazon Kinesis

About Zach Showalter

Digging In

Ensuring Data Strategy Adoption: The Power of a Test Drive with Blueprinting and Mock Outputs

Piloting Data Discovery and Governance: The Open-Source Data Catalog

2025 Data Trends

Legacy Data Modernization: A Comprehensive Guide to Upgrading Your Data Platform

Masking Data 101: Safeguarding PII in Your Organization

Unlocking the Full Potential of a Customer 360: A Comprehensive Guide