Hybrid Data Collection from the IoT Edge with MQTT and Kafka

MQTT makes it easy to get data from distributed edge devices. MQTT accounts for connectivity issues, which is critical for distributed devices.

Dec 16th, 2022 1:25pm by Jason Myers

Featued image for: Hybrid Data Collection from the IoT Edge with MQTT and Kafka

Image via Pixabay.

Data collection is one of the biggest challenges when working with databases. If you can’t get data into the database, then nothing else works. While this is true for any use case, one area that can be particularly challenging is IoT. These use cases often involve a large number of systems, sensors and data sources, and they don’t always output data in the same formats or use the same protocols. When it comes to time series data, this challenge becomes even greater because of the sheer number of data sources, and the rates at which they generate data mean you might need to ingest millions of data points every second. Therefore, architecting reliable data ingestion is a critical step for IoT and industrial IoT use cases. A couple of options to consider for that are MQTT and Kafka.

MQTT

MQTT is a popular protocol used in many IoT settings, which relies on a publish/subscribe model. MQTT can generate topics on demand during payload delivery. If a topic already exists, MQTT sends the data to that topic. If the topic doesn’t exist, MQTT creates it. MQTT payloads are really flexible, which means you don’t need to define a strict schema for your data, but it also means that your subscribers need to be able to handle topics that fall outside the norm. In short, MQTT is easy to configure, reliable and extremely flexible.

Kafka

Kafka is an event-streaming platform also based on the publish/subscribe model with the added benefit of data persistence. For IoT use cases, Kafka also offers high throughput and availability, and it integrates well with many third-party systems. While Kafka is similar to MQTT in some key ways, it isn’t a cure-all for IoT architecture. This is because Kafka is built for stable networks that deploy good infrastructure. IoT devices typically run the gamut; a single system may have devices with significant resources and great connectivity, while other devices may have limited footprints and intermittent connectivity. Kafka also doesn’t deploy key data delivery features such as Keep-Alive and Last Will.

Hybrid Data Collection with MQTT and Kafka

The goal here is to create a data pipeline that leverages the benefits of each protocol to help mitigate the drawbacks of the other. If you want to play with the code for this project, you can grab it from the InfluxData Community GitHub repo. In this example we’ll simulate monitoring emergency fuel generators. We want to collect data from the generators, store it in InfluxDB, process and analyze that data, and then send that processed data on for use elsewhere. We could connect MQTT and Kafka directly to the generators, but having already highlighted the pros and cons of each, we want a more durable solution. From an architectural perspective, we can visualize the differences between these two protocols in this simulation.

Instead of connecting everything in this way, a more logical approach would be along the lines of the following:

Here, we collect data directly from our edge devices using a Kafka MQTT proxy, essentially a MQTT broker with Kafka functionality bolted on. This allows our MQTT clients to directly connect to it. Given this, we can amend the diagram above to better reflect what’s going on here.

The MQTT proxy sends the data to the Kafka cluster. We do this to take advantage of Kafka partitions, which provide consistent ordering of records and parallelism for performance purposes. This means we can create separate queues for data from individual or groups of devices. Suppose each generator produces 1,000 records per second and we have a microservice trying to consume those messages and perform some logic. We know the microservice can only handle 500 records a second so our consumer will never be able to achieve consistency with our producer. Using partitions allows us to spin up additional instances of our microservice, as needed, to consume a portion of those records in parallel to one another. This achieves a much higher throughput. We can also use partitions for topic replication. In this case, we can reserve partitions to hold copies of our records. Because these protocols can distribute our topics over many servers, this approach helps to prevent data loss due to outages. If one partition is unreachable, then our consumer will simply take records from the backup partition instead. Getting back to the example, we use the InfluxDB Sync connector to write the data from the Kafka partitions into InfluxDB, where we can use the Flux language to downsample and aggregate the data. Then we send the downsampled data back to Kafka using the InfluxDB Source connector for further distribution and consumption.

More than a Data Store

Let’s unpack that last part a bit. What we’ve got here isn’t just a simple data store. Instead, we’re using Flux to push data analysis work down into the database for greater efficiency.

As you can see in the diagram, we use InfluxDB Sync to write our raw data to a bucket called “`kafka_raw“` in InfluxDB. We’re going to use Flux to get data from the raw bucket, enrich it and write that data to the “`kafka_downsampled“` bucket. Our Flux script looks like this:

import "influxdata/influxdb/tasks"
 
option task = {name: "kafka_downsample", every: 1m, offset: 0s}
 
from(bucket: "kafka")
    |> range(start: tasks.lastSuccess(orTime: -1h))
    |> filter(fn: (r) => r["_measurement"] == "genData")
    |> pivot(rowKey: ["_time"], columnKey: ["_field"], valueColumn: "_value")
    |> group(columns: ["generatorID"], mode: "by")
    |> last(column: "fuel")
    |> map(fn: (r) => ({r with alarm: if r.fuel < 500 then "refuel" else "no action"}))
    |> set(key: "region", value: "US")
    |> to(
        bucket: "kafka_downsampled",
        tagColumns: ["generatorID", "region"],
        fieldFn: (r) =>
            ({
                "alarm": r.alarm,
                "fuel": r.fuel,
                "lat": r.lat,
                "lon": r.lon,
                "power": r.power,
                "load": r.load,
                "temperature": r.temperature,
            }),
    )

So what does all this mean? Here’s what’s happening.

The range() + tasks.lastSuccess function collects the data from the “`kafka“` bucket since the last time the task ran. It defaults to a static value of the past hour (-1h) if this is the first time the task runs.
Next, we filter the data so we only return values for the measurement genData.
We use Pivot() to shift vertically stored values into a horizontal format. This makes it more like working with SQL databases, and the horizontal format is important for the map function.
We group() by our generatorID column. This separates each generator’s data into its own table.
Last() selects and returns the last row of each table.
We then use map(), along with some conditional logic, to check our current fuel level and to create a new column called alarm. The function fills the alarms value based on the conditional logic.
set() allows us to create a new column and manually fill each row with the same value.
The to() function transfers our data to kafka_downsampled. We provide some mapping logic to define which columns are fields and tags.

To create this task in the InfluxDB UI, click the CREATE TASK button, name your task and set how often you want it to run. Then you can write your task script in the task window.

Benefits

This architecture design accomplishes several things. Leveraging MQTT makes it easy to get data from distributed edge devices. It’s lightweight, so it can go virtually anywhere, and it’s built to handle thousands of connections, so the volume and velocity of IoT data aren’t an issue. Plus, MQTT accounts for connectivity issues, which is critical for distributed devices. In other words, it’s flexible for bridging the initial gap between the physical and digital worlds. Leveraging Kafka allows us to organize that data as it moves through the data pipeline. Kafka provides high availability, so we know that whenever our edge devices have connectivity, they’ll be able to reach Kafka. Partitioning increases data throughput, so we can ensure that all that IoT data actually makes it to the data store, with the help of Kafka’s enterprise-level connectors. InfluxDB enhances the entire data pipeline by providing both data transformation capabilities and storage. To learn more about the code for this example and the journey to creating this hybrid architecture, check out this series.

Jason Myers earned a PhD in modern Irish history from Loyola University Chicago. Since then, he has used the writing skills he developed in his academic work to create content for a range of startup and technology companies. When he's...