How to Introduce Real-Time Data Predictions with Redpanda

In sectors that handle high volumes of data in real time, Redpanda Data Transforms can prepare data for machine learning on the fly.

Mar 22nd, 2024 8:18am by Christina Lin

Featued image for: How to Introduce Real-Time Data Predictions with Redpanda

Featured image by Robert Anasch on Unsplash.

In the world of machine learning, change is the only constant. The traditional reliance on large, batch-processed data sets is giving way to a more dynamic, real-time approach to data. This evolution is being driven by the understanding that being able to process and analyze data in real time is not just an advantage — it’s a necessity.

This is particularly true in sectors like the food delivery ecosystem, where customer expectations and business needs can switch at the drop of a hat. Here, streaming data engines emerge as key players transforming the landscape of data processing and machine learning.

The Predicament with Batch-Processed Data

Food delivery time prediction has traditionally relied on batch-processed data. This method, while somewhat effective, often leads to stale insights due to the latency between data collection and processing. The data variables typically include the delivery partner’s mode of transport, age, ratings and the crucial metric of distance between the restaurant and delivery location.

Enter Streaming Data: The Real-Time Revolution

In recent years, the food delivery industry experienced a tremendous spike in demand. This surge, partially driven by the pandemic, highlighted the painful limitations of batch-processed data models and underlined the need for real-time data processing. Real-time data processing allows immediate insights and adaptability — key components in an industry driven by time-sensitive customer expectations.

Streaming technologies like Apache Kafka bubbled up to solve the challenges created by the influx of real-time data. Kafka, known for its ability to handle high-throughput data streams, provides the backbone for real-time data ingestion and processing. However, Kafka’s architecture, while robust, often requires additional components for data transformation and processing.

Redpanda is a modern implementation of the Kafka API positioned as a more streamlined alternative to Kafka. It addresses some of Kafka’s complexities by providing a simpler setup and operational experience for developers.

For example, Redpanda Data Transforms is powered by WebAssembly (Wasm) and allows in-place data processing. This means data can be cleaned, transformed and prepared for machine learning models directly within the Redpanda broker, eliminating the need for additional data-processing layers.

Implementing Redpanda in Real-Time Predictive Models

To illustrate Redpanda’s role in machine learning (ML) applications that handle high volumes of data in real time, I’ll continue the example of a food delivery service.

Architecture of how Redpanda fits into a real-time delivery service powered by machine learning (Source: Redpanda)

In the “food delivery time” prediction model, Redpanda’s architecture involves these key components:

Data ingestion: This data comes from various sources and is often raw and unstructured, which presents the first challenge.
Instant data transformation: Once ingested, a custom-built Golang script uses Redpanda’s Wasm feature to process the data on the fly. This includes calculating the missing “distance” metric — a critical feature for this predictive model. This process exemplifies feature engineering in ML, where key data features are developed or transformed to enhance model accuracy. Redpanda’s real-time data transformation efficiency enables immediate and dynamic feature creation and modification.
ML model training with TensorFlow: The transformed data is then fed into an ML model built using TensorFlow I/O. TensorFlow I/O facilitates the consumption of real-time data streams, allowing the model to be continuously updated with fresh data. However, it’s important to note that initial training still requires a batch of historical data to establish a baseline.
Model deployment and inference: Once trained, the model is deployed for real-time inference. As new data streams in, the model dynamically adjusts its predictions, providing up-to-date delivery time estimates.
User-facing application: The final component is a user-facing application that uses the model’s predictions to provide customers and delivery partners with accurate, real-time delivery estimates.

Set Up the Infrastructure

The following diagram illustrates the setup process, which involves several key steps.

Components of the proposed food delivery service infrastructure. (Source: Redpanda)

1. Simulate Data Streams

A Python script simulates the continuous flow of data, mimicking real-world scenarios of frequent order updates.

2. Configure the Cluster

A Redpanda cluster is set up to handle the data streams. This involves configuring the number of brokers and setting up Redpanda Console for monitoring.

3. Deploy Data Transformations

The Golang script for data transformation is deployed using Redpanda’s rpk transform deploy command. This ensures that the data transformation logic is applied uniformly across all broker nodes.

Data is processed in the broker of the partition it is sent to, and the result is written directly into memory. (Source: Redpanda)

Initiate the Redpanda Transforms project:

Build the transform into a WebAssembly (Wasm) module and deploy it to the Redpanda cluster for execution:

Deploy the module to the Redpanda cluster. Redpanda distributes the deployed module across all brokers in the cluster. This distribution is vital for load balancing and fault tolerance. Regardless of which broker is managing a particular partition or topic, the transform logic will be available to process the data to reduce latency and increase efficiency, since there’s no need to move data across the network for processing.

4. Train the TensorFlow Model

The TensorFlow I/O model is trained using both historical batch data and real-time data streams. This hybrid approach helps ensure the model benefits from the depth of historical data while staying agile with real-time updates.

Wasm assists in preprocessing data into the desired format and prepares it for ML model training. (Source: Redpanda)

To stream data directly from Redpanda topics into a TensorFlow data set, configure the data set to ingest data from the “model data” topic on a Redpanda cluster. The main processing loop handles data in batches: It accumulates messages, and then shuffles and decodes them before using them for training. Subsequently, the model is trained for one epoch with each batch and then saved and exported.

Advantages and Future Applications

Integrating Redpanda in predictive modeling offers several advantages:

Reduced latency: By processing data in real time, the latency between data collection and insight generation is significantly reduced.
Dynamic model updates: The continuous data flow allows the model to adapt and improve over time, leading to more accurate predictions.
Streamlined architecture: Performing data transformations within the broker reduces the need for additional data-processing layers, simplifying the overall architecture.

This approach, while demonstrated through the example of food delivery time prediction, has far-reaching implications. It can be applied to many sectors where real-time data analysis is crucial, such as financial markets, health-care monitoring and smart city management.

Modern streaming-data engines like Redpanda aren’t just transforming the way we handle data — they’re reshaping the future of real-time ML applications. As we continue to explore and innovate, the possibilities are as vast and exciting as the data streams we seek to harness.

Christina Lin is the Director of Developer Advocacy at Redpanda Data where she turns innovative data streaming solutions into easily accessible content for everyone to learn from. She has 20+ years of experience in software development and has worked as...