Join our community of software engineering leaders and aspirational developers. Always
stay in-the-know by getting the most important news and exclusive content delivered
fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter
in the past. Click the button below to open the re-subscribe form
in a new tab. When you're done, simply close that tab and continue
with this form to complete your subscription.
The New Stack does not sell your information or share it with
unaffiliated third parties. By continuing, you agree to our
Terms of Use and
Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!
We’re so glad you’re here. You can expect all the best TNS content to arrive
Monday through Friday to keep you on top of the news and at the top of your game.
What’s next?
Check your inbox for a confirmation email where you can adjust your preferences
and even join additional groups.
Follow TNS on your favorite social media networks.
Data collection is one of the biggest challenges when working with databases. If you can’t get data into the database, then nothing else works. While this is true for any use case, one area that can be particularly challenging is IoT. These use cases often involve a large number of systems, sensors and data sources, and they don’t always output data in the same formats or use the same protocols.
When it comes to time series data, this challenge becomes even greater because of the sheer number of data sources, and the rates at which they generate data mean you might need to ingest millions of data points every second. Therefore, architecting reliable data ingestion is a critical step for IoT and industrial IoT use cases. A couple of options to consider for that are MQTT and Kafka.
MQTT
MQTT is a popular protocol used in many IoT settings, which relies on a publish/subscribe model. MQTT can generate topics on demand during payload delivery. If a topic already exists, MQTT sends the data to that topic. If the topic doesn’t exist, MQTT creates it.
MQTT payloads are really flexible, which means you don’t need to define a strict schema for your data, but it also means that your subscribers need to be able to handle topics that fall outside the norm. In short, MQTT is easy to configure, reliable and extremely flexible.
Kafka
Kafka is an event-streaming platform also based on the publish/subscribe model with the added benefit of data persistence. For IoT use cases, Kafka also offers high throughput and availability, and it integrates well with many third-party systems.
While Kafka is similar to MQTT in some key ways, it isn’t a cure-all for IoT architecture. This is because Kafka is built for stable networks that deploy good infrastructure. IoT devices typically run the gamut; a single system may have devices with significant resources and great connectivity, while other devices may have limited footprints and intermittent connectivity. Kafka also doesn’t deploy key data delivery features such as Keep-Alive and Last Will.
InfluxData is the creator of InfluxDB, the leading time series platform. More than 1,900 customers use InfluxDB to collect, store, and analyze all time series data at any scale. Developers can query and analyze their time-stamped data to predict, respond, and adapt in real-time.
Learn More
The latest from InfluxData
Hybrid Data Collection with MQTT and Kafka
The goal here is to create a data pipeline that leverages the benefits of each protocol to help mitigate the drawbacks of the other. If you want to play with the code for this project, you can grab it from the InfluxData Community GitHub repo.
In this example we’ll simulate monitoring emergency fuel generators. We want to collect data from the generators, store it in InfluxDB, process and analyze that data, and then send that processed data on for use elsewhere.
We could connect MQTT and Kafka directly to the generators, but having already highlighted the pros and cons of each, we want a more durable solution. From an architectural perspective, we can visualize the differences between these two protocols in this simulation.
Instead of connecting everything in this way, a more logical approach would be along the lines of the following:
Here, we collect data directly from our edge devices using a Kafka MQTT proxy, essentially a MQTT broker with Kafka functionality bolted on. This allows our MQTT clients to directly connect to it. Given this, we can amend the diagram above to better reflect what’s going on here.
The MQTT proxy sends the data to the Kafka cluster. We do this to take advantage of Kafka partitions, which provide consistent ordering of records and parallelism for performance purposes. This means we can create separate queues for data from individual or groups of devices.
Suppose each generator produces 1,000 records per second and we have a microservice trying to consume those messages and perform some logic. We know the microservice can only handle 500 records a second so our consumer will never be able to achieve consistency with our producer. Using partitions allows us to spin up additional instances of our microservice, as needed, to consume a portion of those records in parallel to one another. This achieves a much higher throughput.
We can also use partitions for topic replication. In this case, we can reserve partitions to hold copies of our records. Because these protocols can distribute our topics over many servers, this approach helps to prevent data loss due to outages. If one partition is unreachable, then our consumer will simply take records from the backup partition instead.
Getting back to the example, we use the InfluxDB Sync connector to write the data from the Kafka partitions into InfluxDB, where we can use the Flux language to downsample and aggregate the data. Then we send the downsampled data back to Kafka using the InfluxDB Source connector for further distribution and consumption.
More than a Data Store
Let’s unpack that last part a bit. What we’ve got here isn’t just a simple data store. Instead, we’re using Flux to push data analysis work down into the database for greater efficiency.
As you can see in the diagram, we use InfluxDB Sync to write our raw data to a bucket called “`kafka_raw“` in InfluxDB. We’re going to use Flux to get data from the raw bucket, enrich it and write that data to the “`kafka_downsampled“` bucket.
Our Flux script looks like this:
So what does all this mean? Here’s what’s happening.
The range() + tasks.lastSuccess function collects the data from the “`kafka“` bucket since the last time the task ran. It defaults to a static value of the past hour (-1h) if this is the first time the task runs.
Next, we filter the data so we only return values for the measurement genData.
We use Pivot() to shift vertically stored values into a horizontal format. This makes it more like working with SQL databases, and the horizontal format is important for the map function.
We group() by our generatorID column. This separates each generator’s data into its own table.
Last() selects and returns the last row of each table.
We then use map(), along with some conditional logic, to check our current fuel level and to create a new column called alarm. The function fills the alarms value based on the conditional logic.
set() allows us to create a new column and manually fill each row with the same value.
The to() function transfers our data to kafka_downsampled. We provide some mapping logic to define which columns are fields and tags.
To create this task in the InfluxDB UI, click the CREATE TASK button, name your task and set how often you want it to run. Then you can write your task script in the task window.
Benefits
This architecture design accomplishes several things. Leveraging MQTT makes it easy to get data from distributed edge devices. It’s lightweight, so it can go virtually anywhere, and it’s built to handle thousands of connections, so the volume and velocity of IoT data aren’t an issue. Plus, MQTT accounts for connectivity issues, which is critical for distributed devices. In other words, it’s flexible for bridging the initial gap between the physical and digital worlds.
Leveraging Kafka allows us to organize that data as it moves through the data pipeline. Kafka provides high availability, so we know that whenever our edge devices have connectivity, they’ll be able to reach Kafka. Partitioning increases data throughput, so we can ensure that all that IoT data actually makes it to the data store, with the help of Kafka’s enterprise-level connectors.
InfluxDB enhances the entire data pipeline by providing both data transformation capabilities and storage.
To learn more about the code for this example and the journey to creating this hybrid architecture, check out this series.
InfluxData is the creator of InfluxDB, the leading time series platform. More than 1,900 customers use InfluxDB to collect, store, and analyze all time series data at any scale. Developers can query and analyze their time-stamped data to predict, respond, and adapt in real-time.