2024 Streaming Roadmap: Navigating the Real-Time Revolution

Overcoming batch-oriented architectures and embracing the advantages of streaming data are foundational to robust AI deployments.

Feb 21st, 2024 10:00am by Filip Yonov

Featued image for: 2024 Streaming Roadmap: Navigating the Real-Time Revolution

Image via Pixabay.

Generative AI (GenAI) and large language models (LLMs) will reshape how we live, work and do business. As AI enables more natural human-machine interactions, companies leveraging these technologies must prioritize effective data management to truly drive a competitive edge.

With research suggesting that generative AI could add trillions of dollars to the global economy, it is no surprise that in 2023 companies further expanded and solidified their AI and data investment strategies, and will continue to do so in the future.

Real-time data streaming is essential for realizing the promise of the AI-first enterprise. Here is the thing — businesses operate in the here and now, and to deliver rich, personalized user experiences, AI-centered architectures must process data with immediacy and low latency at scale, enabled only by streaming technologies such as Apache Kafka and Apache Flink.

Therefore, overcoming batch-oriented architectures and embracing the advantages of streaming data are foundational steps toward robust AI deployments. This evolution, along with the rapid growth of machine learning (ML), is driving major market shifts, as highlighted by Forrester’s recognition of streaming data platforms as an emerging software category in their Q4 2023 report.

In 2024, the focus for those of us in streaming data will not be only AI. The Bring Your Own Cloud (BYOC) deployment model offers an efficient mechanism for scaling managed streaming services. Machine learning is coming to real-time environments via apps built on Apache Flink, while open source table formats like Apache Iceberg, Apache Hudi and Apache Paimon are simplifying ETL, positioning Kafka as the ingestion layer for the enterprise. In parallel, data mesh architectures and streaming governance are increasingly becoming business requirements, and are set to influence the best practices for transitioning organizations to native real-time operations.

Embracing BYOC and Beyond: Flexibility and Cost Control in Streaming

As we move through 2024, the trend toward streaming data being more accessible will be increasingly pronounced. The Bring Your Own Cloud (BYOC) model is leading this charge, providing businesses with a cost-effective and flexible way to manage their streaming workloads while maximizing existing cloud commitments. But BYOC is just the beginning — a broader trend is reshaping the streaming landscape, as users demand solutions that work seamlessly across multicloud environments and that are more cost-efficient.

A significant aspect of this trend is the separation of compute and storage. This change allows businesses to scale their streaming resources independently, resulting in more efficient utilization and cost savings. In traditional data streaming setups, compute and storage are tightly coupled, leading to inefficiencies and higher costs, especially when dealing with fluctuating workloads. Though some vendors have offered tiered storage for years, the true benefits of Kafka’s Tiered Storage (currently in preview) are yet to be realized at scale.

In 2024, expect BYOC deployment capabilities to be further streamlined and automated. We’ll also witness a true separation of storage and compute, delivering unprecedented levels of elasticity and cost savings to streaming data workflows. Interestingly, some innovative approaches are already emerging, leveraging direct integration with Amazon S3 as a storage layer for Kafka and removing the need for Kafka’s network-hungry design. Coupled with Amazon S3 Express’s low-latency object storage, this creates a powerful approach to cloud native, decoupled streaming — a concept deserving a more in-depth exploration in a future blog post.

Open Table Formats — Leading the Real-Time and Batch Unification

I often get asked, “Why not use Kafka for everything?” While I recognize the power of real-time data, the true value of data lies beyond its flow: in its utility, integration and lifecycle management.

Open table formats are reshaping our approach to the data lake, enhancing its lifespan and utility, and laying the groundwork for advanced streaming use cases at scale. Streaming data in the data lake will become a first-class citizen and the default ingestion layer. In 2024, we will witness the first signs of the data utopia — real-time streaming in Kafka, historical data in object storage, yet always query-ready via an open table format like Iceberg/Hudi or Paimon.

Kafka is transcending its role as a transport layer, integrating tightly with cloud object storage (Amazon S3, Google Cloud Storage, Azure Blob Storage) to empower long-term analysis. Projects like Apache Hudi and Apache Paimon, designed for transactional and streaming data lake architectures, position Kafka as a true source of truth for incremental processing. While Iceberg will undoubtedly lead in 2024, interoperability and cross-format compatibility are truly needed — OneTable, promising seamless interaction between major lakehouse formats, is a project to keep an eye on.

The hype around lakehouse formats is justified, but what’s the real-time connection? Streaming data gains strategic value when historical context is easily accessible. Imagine expanding your fraud detection ML algorithm’s attention span from mere minutes to a full year of data!

Transactional data lake architectures, powered by open table formats and streaming, deliver this powerful combination. Open table formats are a game-changer: By transcending traditional structures like Parquet and integrating seamlessly with the ingestion layer, these formats enable businesses to unify real-time and batch data. This unification lays the groundwork for truly differentiated AI competitive advantage. This evolution in data management is not merely a procedural update; it’s fundamental in its nature and will be driving data transformation for years to come.

Apache Flink: Accelerating Real-Time Decision-Making

While 2023 saw major players introduce Flink-based managed services, adoption was held back by its perceived complexity and a lack of streamlined tooling. The challenge is that business users don’t work with streaming data directly. However, 2024 promises a major upgrade for Flink, opening it up to broader audiences like data scientists and business analysts. This will likely be led by frameworks like Apache Paimon that combine the power of stream processing with streamlined declarative ETL operations and lakehouse capabilities.

Flink’s rise mirrors the dominance of Apache Spark in batch data processing. Spark defined how businesses approach unstructured data in the lake, powering ML, business intelligence (BI) and reporting for human-centric decision-making. Now, as AI adoption surges, there’s a growing need for continuous processing of data streams to feed evolving AI models.

Flink fills that role, offering instant, on-the-fly computation at scale. This allows businesses to automate decisions based on milliseconds of fresh data. For example, TikTok uses Flink to refine its powerful recommendation engine in real time. Based on a user’s split-second actions (likes, skips, shares), Flink continuously updates recommendations, making the user’s feed exponentially more accurate, and turning real-time response into a competitive advantage.

In an AI-driven world, speed isn’t a luxury; it’s a necessity. Flink lets machines make decisions in real time with unprecedented precision. As businesses seek to deliver hyper-personalized experiences, this shift from human-centric to machine-speed decision-making becomes essential. Flink isn’t just a tool; it’s the engine for a new era of AI-powered, real-time strategy. 2024 will see its adoption soar.

Data Mesh and Stream Governance: From Principles to Imperatives

At Aiven, we empower customers to adopt data mesh principles through robust governance tooling, self-service streaming, fine-grained access controls and our Terraform Provider. In 2024, enterprise investment in stream governance will become critical to ensuring the reliability, agility and availability of real-time data across applications. It’s a multi-faceted discipline: tracing data lineage, guaranteeing accuracy, enriching metadata and cataloging securely — all to make data more accessible and usable at speed and scale.

Implementing governance early translates to faster, more relevant data for business teams. It also reduces the noise from low-value data, minimizing storage costs and potential risks. As the value of governance becomes widely recognized, 2024 will see more companies proactively integrate streaming governance. Historically, only large enterprises could create reusable data assets, but advances in governance software are democratizing this capability.

The “data-as-product” strategy will go mainstream, boosting efficiency and driving innovation throughout the real-time data landscape. The challenge lies in contextualizing shared data without compromising security. As data travels downstream, this becomes more complex and costly. Embedding governance at the source provides a clearer understanding of its context and value — and proves more cost-effective.

While multiple teams can benefit from shared access to the same data for building services and applications, presenting this data securely, contextually and comprehensively for non-originating users poses challenges. As data moves further away from its source, providing context becomes more complex and costly. Initiating the data governance process at the source is not only cost-effective but also offers a superior understanding of the data’s origin, value and meaning.

The integration of new data governance capabilities within products like cloud data warehouses, databases and other data infrastructure services is positioned to meet these evolving needs.

This means that developers no longer need to construct the infrastructure manually when creating and sharing reusable data products. This will greatly help the adoption of real-time data by the analytics and business layers of companies.

The Streaming Data Revolution

I’m optimistic about the potential of streaming data to transform businesses. At Aiven, we are committed to pushing the boundaries of data streaming technology and fostering a vibrant, open ecosystem. 2024 will see the solidification of streaming data as the indispensable backbone of the modern enterprise, playing a role just as vital as data lakes and warehouses in driving strategic decision-making.

Filip Yonov is the Director of Product Management at Aiven, where he oversees the Streaming Platform. Aiven Streaming Platform offers a comprehensive ecosystem that integrates best-of-breed streaming products like Apache Kafka and Apache Flink, deployed across multiple cloud environments. With...