Exploring Data Streaming Technologies

By

The world is in a constant state of change. Naturally, then, it would make sense to be able to examine that change in real-time. Being able to process, transform, and make decisions on data as soon as it comes in gives that data much more value to users. But it’s a bit easier said than done. Consider real-world metrics that industries would want to monitor: location data, user activity on websites, biometrics from medical equipment, wind speeds, incoming business transactions, etc. These are all highly dynamic; new information could come in at any time and at an incredibly high rate. This means that they also generate a massive amount of data. In the past, available tools and technologies simply hadn’t matured enough to make this sort of real-time processing possible at such a large scale. Over the past decade, huge strides have been made in data streaming technology, making high-performing streams at an enterprise scale more accessible than ever before. In this blog, we’ll explore the architecture and strengths of some of the leading solutions for data streaming.  

Apache Kafka 

Kafka got its start at LinkedIn, where it was initially designed to be a pub/sub message system, but over the years it has been refined into a fast, fault-tolerant, and scalable distributed platform for data streaming. It’s run on a Kafka Cluster, which is made up of nodes called Brokers. Each of these brokers hosts multiple partitions of a Topic, which is an ordered, partitioned, and immutable record of all data that is published to it. Data is published to a topic from a source application through a Producer and picked up by a Consumer subscribed to the topic, allowing the data to be processed by other target applications.   

Kafka Cluster

Image Source: https://www.upsolver.com/blog/comparing-apache-kafka-amazon-kinesis 

Because Kafka is open source, you have the option of setting up an entire stream in-house if you or your team has the DevOps knowledge to do so. However, for enterprise-level production implementation, it’s probably more realistic to opt for one of the available managed solutions (Confluent or AWS or a mixture of both). Kafka also offers a high level of control over configuration for Producers, Consumers, and the Kafka cluster itself. This can be both good and bad: it makes Kafka very flexible, but it also requires a higher degree of familiarity with the system in order to take full advantage of its capabilities. If tuned optimally it can achieve very high throughput and very low latency, but it won’t get that kind of performance out of the box.  

Another strength of Kafka is in its fault tolerance and data persistence. Topics aren’t just partitioned across multiple brokers, the partitions themselves also replicate across a configurable number of brokers. If a “lead” broker goes downthen a backup broker is always available. This achieves fault tolerance, but it also adds another benefit as well. Because a topic’s size isn’t limited to the size of any one node, Kafka puts no upper limit on data persistence. Most streaming platforms will let you persist records for a week at most, requiring you to move them to a data lake or other external storage if you don’t want to lose them. Kafka topics will save data forever if you tell them to, or at least they’ll try to until you run out of disk space.  

This is great because it allows you to handle historical and incoming data cases the same way, without any limit to how far back you can go. The Consumer can begin by processing past data from a specified offset on the topic, but instead of stopping when it reaches the most recent entry it will continue to process new data as it comes in. Other streaming platforms can do this as well, but what if instead of a week, your application’s algorithm needs data from two weeks ago? Or a month? Or a year? Normally you’d have to make a separate network call to get that data from wherever you stored it, but with Kafka you’re able to consume it straight from the topic. If long-term data persistence is a must-have for your streaming application, then Kafka should be your go-to solution. 

Amazon Kinesis 

Kafka works well with streaming data. Kinesis, Amazon’s in-house streaming platform, works well with streaming data too. Like Kafka, Kinesis is pub/sub-based, fault-tolerant, and scalable. Kinesis also uses Producers and Consumers. A Kinesis Stream is like a topic in Kafka. A Kinesis Shard is like a Kafka partition. The two platforms have very similar architecture and use cases.  

Amazon Kinesis Stream 

Image source: https://docs.aws.amazon.com/streams/latest/dev/key-concepts.html 

The biggest difference between the two is that Kinesis is only offered as a managed service through AWS. If Kafka is hosted in-house, then it’s all on you to plan out what computing resources and storage capacity are needed for hosting the cluster, to set it all up properly, and to scale with additional resources if the need arises. Even if you opted for a managed Kafka solution to cut out having to host, the user is still responsible for optimal performance configurations, fault tolerance, recovery, and partitioning. In Kinesis, however, all the infrastructure is handled on AWS’s side, which means that deployment can be done with significantly less time, material, and human cost. For added functionality, Kinesis integrates with other AWS services: S3 for data lake storage, Lambda for event processing, Cloudwatch for real-time monitoring and data metrics, and Application Auto Scaling for quick and reactive scaling of the stream. If your use case doesn’t necessitate the acute performance tuning or long-term data persistence of Kafka, or if you’re already leveraging AWS for other projects, Kinesis is a strong choice.  

There are many different streaming platforms available aside from just these two. Microsoft Azure has EventHubs and Google offers its own Google Cloud pub/sub service. Apache Storm performs real-time transformations on incoming data. Apache Flink is capable of processing both bounded and unbounded streams. With so many different solutions, choosing a solution can seem overwhelming. Ultimately, having a strong understanding of your data format, infrastructure, and business use case will help you determine the best fit for the streaming task at hand. 

Additional resources if you are interested in learning more about these technologies:

Apache Kafka
AWS Kinesis
Google Cloud Pub/Sub
Apache Flink
Azure EventHubs

About The Author

Zach Showalter is a Consultant on the Data team.