By Erik Gfesser — Aug 10, 2016

New Book Review: "Making Sense of Stream Processing"

New book review for Making Sense of Stream Processing: The Philosophy Behind Apache Kafka and Scalable Stream Data Platforms, by Martin Kleppmann, O'Reilly, 2016, reposted here:

My first exposure to author Martin Kleppmann was at his Strange Loop 2014 talk in St Louis, Missouri entitled "Turning the Database Inside Out with Apache Samza". The reason this talk is of relevance is because Samza is a distributed stream processing framework developed at LinkedIn that is in the words of Kleppmann "a surreptitious attempt to take the database architecture we know, and turn it inside out", using Apache Kafka at its core for use as a distributed, durable commit log.

After discussing events and stream processing in Chapter 1, and using logs to build a solid data architecture in Chapter 2, the author discusses integrating databases and Kafka with change data capture (CDC) in Chapter 3, and the "Unix philosophy" of distributed data in Chapter 4, followed by the text version of the Strange Loop 2014 talk in Chapter 5. This freely available text is probably one of the most organized that I have seen from O'Reilly as of late, although readers will need to keep in mind that the last chapter comprises material from 2 years ago. Other than organization and readability, another aspect of this book that other architects such as myself will probably appreciate is the high number of diagrams. While many of these diagrams are drawn at a high level, just remember that this presentation covers practical architecture, with no ivory tower nonsense.

With the probable expectation that some might view stream processing as ivory tower nonsense, however, the following comments are provided in the foreward to this book: "Whenever people are excited about an idea or technology, they come up with buzzwords to describe it. Perhaps you have come across some of the following terms, and wondered what they are about: 'stream processing', 'event sourcing', 'CQRS', 'reactive', and 'complex event processing'. Sometimes, such self-important buzzwords are just smoke and mirrors, invented by companies that want to sell you their solutions. But sometimes, they contain a kernel of wisdom that can really help us design better systems."

"In this report, Martin goes in search of the wisdom behind these buzzwords. He discusses how event streams can help make your applications more scalable, more reliable, and more maintainable. People are excited about these ideas because they point to a future of simpler code, better robustness, lower latency, and more flexibility for doing interesting things with data. After reading this report, you'll see the architecture of your own applications in a completely new light. This report focuses on the architecture and design decisions behind stream processing systems. We will take several different perspectives to get a rounded overview of systems that are based on event streams, and draw comparisons to the architecture of databases, Unix, and distributed systems."

If you are interested in and unfamiliar with this space, but don't have time to read the entire 170-page book, I recommend at least reading the first chapter. The author effectively walks through explanations of some of the buzzwords listed above, because the author realizes that some of the confusion related to technologies in this space "seems to arise because similar techniques originated in different communities, and people often seem to stick within their own community rather than looking at what their neighbors are doing."

After this discussion, the author lists distributed stream processing frameworks Samza, Storm, Spark Streaming, and Flink as alternatives to Kafka Streams, and comments about the interesting design differences (pros and cons) between these tools, but be aware that he points to the Samza documentation for additional detail, after also mentioning that all of these frameworks are mostly concerned with low-level matters and that work on high-level query languages for stream processing is currently in motion.

Web developers and architects will likely want to take notice of the second chapter, where Kleppmann walks through the long-term evolution of a case study web application that starts off simply enough in terms of architecture, but then grows with a proliferation of different tools used in combination with one another. "It's not that any particular decision we made along the way was bad. There is no one database or tool that can do everything that our application requires. We use the best tool for the job, and for an application with a variety of features that implies using a variety of tools. Also, as a system grows, you need a way of decomposing it into smaller components in order to keep it manageable. That's what microservices are all about."

"But, if your system becomes a tangled mess of interdependent components, that's not manageable either. Simply having many different storage systems is not a problem in and of itself: if they were all independent from one another, it wouldn't be a big deal. The real trouble here is that many of them end up containing the same data, or related data, but in different form."

Different techniques that make sure the data ends up in all the right places are then discussed, with all the inherent issues of each, followed by the simplest solution in the mind of the author: store all writes in a fixed order, and apply them in this fixed order to the various places they need to go. The rest of the chapter (25 pages) walks through examples of how logs are used in practice, concluding with a thought experiment: What if the only way to modify data in your service was to append an event to a log? And could all this activity take place through GET methods, negating the need for POST, PUT and DELETE?

While the fourth chapter dives into a bit more theory in the sense that the philosophy behind Kafka is important in assessing the tool, but not necessarily needed in practice, the presentation that the author provides here is worth reviewing because he walks through the reasons why some "old ideas" from Unix are very different from the design approaches of mainstream databases, and why these deserve more attention today.

It is probably the third chapter that might make some readers take notice, because it walks through a very practical example that involves an existing web application, and the desire to not rewrite the code in order to make use of Kafka. In short, instead of propagating data changes to all the different tools mentioned in the earlier web application case study, the example starts with the database as the starting point and propagates to a Kafka log, followed by propagation to these other tools. I'm currently extending a proof of concept I created for a recent client to implement this approach.

Subscribe to Erik on Software