Some of the most interesting projects I worked on at LinkedIn involved building large scale real-time pricing and machine learning products. They required crafting fault-tolerant distributed data architectures to support model training, forecasting and dynamic control systems. None of this work involved a traditional relational database. Instead event streams, derived data and stream processing became the core building blocks.
We built entire multi-billion dollar products on top of these architectures. The experience gave me a new perspective on how organizations can think about their data infrastructure in terms of data flows.
Logs and Event Streams
Over the past decade, distributed backends have grown considerably. Most products start out with a typical database centric architecture, then add replications and partitions to handle scale and growth. Periodic data snapshots are made available through a data-warehouse like HDFS for analytics and offline pre-computations. In some cases, nearline stream processing is introduced, perhaps through a lambda architecture. This evolutionary path that backends go through is quite familiar. The evolution is driven primarily by requirements to handle high requests per second to servers in the shortest acceptable time, typically within a few milliseconds.
The distributed backend is made possible thanks to log based data flow between the various replications and partitions. These streams of logs capture and transmit database writes. LinkedIn’s Brooklin is an example of this. In essence, each message in the stream represents a change event in the application state. With database writes externalized in a stream of logs, we no longer have to poll the database for changes. This is where stream processing has enabled us to react to change in application states in realtime. Also considering that streams can have multiple listeners, we are able to run orthogonal processes on a single stream of data.
We don’t have to be limited to firing change events only on database writes. We can also instrument both the application frontend and backend to fire events on various activities. For example we can fire click and UI interaction events from the frontend client. And we can also fire various backend events containing meta data to provide deeper insights into the processes within backend services.
And this creates an interesting opportunity to rethink the relationship between the applications and the database. If changes in states are available outside of the database, then we no longer have to poll the data tables for updates. In fact we can choose to build processes that listen to a stream of messages, and respond to change in state in realtime. Further, each process in turn can emit new messages that trigger other processes and so on.
In this way, the event streams become what Jay Kreps, creator of Apache Kafka, calls the central nervous system through which all data activities flow.
In essence, this paradigm takes system activities and converts them into externalized logs that can be analyzed, processed and acted upon by other systems. Logs have been around for as long as computing has been around. And this paradigm shows that it continues to be important in a distributed data processing world.
It’s also worth noting that a stream based data flow system can not only be processed in real-time, but can also be processed in batches as well by simply collecting the events into a data warehouse. However, the reverse is not possible – that is, a data system that collects in batches (for example periodic snapshots of databases to a data warehouse) cannot fulfil real-time computing needs. Therefore, stream based data flow architectures can seamlessly handle batch and realtime use cases.
Loose Coupling of Systems
Data flow based architecture promotes loose coupling of systems. A stream processor doesn’t need to know specifics of the implementation of the system that emits a stream of events. Nor does it have to worry about how other processes are handling the same message. (We will discuss exactly-once processing constraint later). It only cares about the contents of the message and what it has to do to process it. Further, a single message can have multiple processors handling it.
For example, say you have a credit card that has just been charged by a merchant. When this happens we want to execute a series of tasks. We would want to update the account, run a fraud detection model, send out mobile push notifications and also email notifications. More importantly, the timing of these processes matter. Ideally you would want push notifications to go out as soon as the charge takes place and also fraud detection to run and reverse the charges immediately if there was indeed a fraudulent purchase.
In a data flow based architecture, this credit card activity can generate a single charge event against the account. We can have multiple subsystems each for fraud detection, accounting etc listening on the event stream and processing it as needed. All this can happen asynchronously, parallelized and in real time.
Each of the subsystems can evolve independently of each other. Depending on the structure of the organization managing these systems, each team can focus on doing their part to process the message as needed.
You may have noticed that in the credit card example above, there is a chance that we will lose causal consistency if, for example, the fraud detection process identifies a fraudulent transaction and tries to reverse it before the account has already been updated. A more robust approach here would be to kick off fraud detection first, let it complete and then send out notifications and update the account. Or we could do it in reverse, do the fraud check after the account has been updated and reverse the transaction in the tiny percentage of cases where it is a fraudulent transaction. For these different approaches, we would simply need to move around the stream processors and reorganize the sequence in which they would be triggered.
This composability of processes connected with a messaging system is analogous to the Unix philosophy of pipes. Each process does a small very specific work and is composed into a specific sequence based on the context and needs.
Another great benefit of stream and log based systems is the ability to build solutions with a tight feedback loop. A stream process is able to produce real-time low latency results on a continuous basis, which is particularly useful for building control systems.
One such example is typically found in real-time automated bidding systems where the next bid is generated using algorithms designed to optimize for specific goals given the outcomes from the earlier bids.
Unbundling the Database
If you look closely, you will notice that data flow based architecture looks similar to the inner workings of a relational database system. Derived data are analogous to materialized views, and stream processors are like triggers and stored procedures. Martin Kleppman, who talks about this observation in his book Designing Data Intensive Applications, calls this the “unbundling of the database”. Instead of data being confined to centralized relational tables, it is allowed to flow across the breadth of the organization.
That said, this phenomenon of unbundling and rebundling of technology is not new. The pendulum swing between these two extremes are just a part of the tech cycle.
The upside of this unbundling is that we have flexibility to compose our data flow architecture with specialized components from various vendors as needed, and implement new paradigms that are much more scalable than a relational database system. However, we do have to think through various issues related to fault tolerance and other complexities. Therefore it will not be a surprise when the pendulum eventually swings back to rebundling that embraces these new data flow based paradigms in a more integrated fashion. (Apache Flink is an example of a step in that direction).
Data Integrity and Fault Tolerance
With a distributed architecture, fault tolerance is certainly top of mind. An important consideration for data flow architectures is to ensure atomic commits through exactly once writes. That is, we want to reach the appropriate state even if the same message is inadvertently processed twice. This is something we usually have to worry about for non-idempotent data, for example counter increments.
Implementing write-once guarantees across replications and partitions is a complex topic. Turns out stream processing based architectures can handle it quite well.
One approach is to partition the message stream using a hash of a unique field in the message. We can route all messages belonging to a partition to the same stream processor which maintains a local record of the unique field. The processor can choose to discard multiple messages by checking if they have already encountered one with the same value of the unique field.
What if processors are down for a prolonged period of time. In theory, when this happens, all messages can be processed in bulk when they’re available. However, for large number of messages per second in the stream, this is not practical. Therefore it is common to maintain regular snapshots of the current state as messages are processed. This way we would only need to reprocess messages since the last snapshot to catch up to the current state.
Data Quality, Integrity and Security
A consequence of data being available everywhere is that security becomes a bigger challenge. A centralized database makes it easy to set different types of access policies and privileges. We lose that with data flowing outside of the database. Fortunately most messaging systems such as Kafka come with the ability to set Access Control List (ACL) policies built in. You will often have to configure the security policies for each component in your architecture separately.
It is common to use No-SQL based data stores for data derived from stream processing. Derived data tends to be less rigid in its structure. This can be a good and a bad thing. Less rigid means that we can structure data in the store as needed and are not bound by the structure of the source. For example, we can process and transform the data from a Kafka message into a specific key value structure that we are interested in. However, the downside is that we can end up with a lot of different datasets where the field definitions aren’t very clear. This can lead to potential bugs, misuse or field duplications.
To counter this, it is important to establish a common schema language and serialization framework, such as Avro, across the system. In addition, every schema needs to enforce strong documentation standards to ensure that no field is left ambiguous.
Log based data integration promises to improve data availability across the breadth of an organization. It creates the ability to democratize data access and analysis. With stream processing systems we are able to work with continuous flow of data in real-time using composable components.
Most organizations will need to implement change data capture in their database and instrument their applications to generate events to create data flow. Data everywhere means security and integrity of data become even more important.
Today these data flow architectures require us to bring together various systems from different vendors. In the future there is an opportunity for a more integrated approach. These architectures and associated paradigms are still quite new, so I expect to see a lot of innovation in the years to come.
The Log: What every software engineer should know about real-time data’s unifying abstraction by Jay Kreps: This post has become a classic. Jay Kreps broadly explores the fundamentals of logs and data integration paradigms.
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann: One of my favorite books on building large scale data systems. This book is vendor agnostic and looks at the various principles of building data systems at a conceptual level.
Every Company is Becoming Software by Jay Kreps: This post explores the theme of using event streams as a central nervous system in your data infrastructure strategy. My own post was largely inspired by this.