Join our community of software engineering leaders and aspirational developers. Always
stay in-the-know by getting the most important news and exclusive content delivered
fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter
in the past. Click the button below to open the re-subscribe form
in a new tab. When you're done, simply close that tab and continue
with this form to complete your subscription.
The New Stack does not sell your information or share it with
unaffiliated third parties. By continuing, you agree to our
Terms of Use and
Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!
We’re so glad you’re here. You can expect all the best TNS content to arrive
Monday through Friday to keep you on top of the news and at the top of your game.
What’s next?
Check your inbox for a confirmation email where you can adjust your preferences
and even join additional groups.
Follow TNS on your favorite social media networks.
To the uninitiated or unfamiliar, time series data exhibits similar characteristics to relational data, but the two data types have some critical differences. Relational data’s main objective is to maintain an accurate representation of the current state of the world, with respect to its objects and the relationships between them. Time series data tells the story of what’s happening in the world right now.
For example, think about the real-time insights and immediate signal/anomaly detection that DevOps engineers need. You can use the constant stream of observations to detect patterns, to find relevant information, to identify and remove noise and to uncover unexpected patterns that signal security threats. Time series data makes these insights possible. Sure, time series data can fit into the row/table format, but it’s better suited for a columnar database with the timestamp as its primary key.
Relational Data vs. Time Series Data
As the name implies, relational data is data that illustrates a relationship. The purpose of relational data is to maintain accurate records of objects and their relationships to each other. Relational data is transactional and updated frequently to maintain accuracy.
The purpose of time series data is to provide insight for analysis and summarization. A series is a stream of observations, so by nature the data points are related by source of origin, but the data points are immutable because the past cannot change. While a single point might not be useful, the series as a whole reveals how the source changes over time.
InfluxData is the creator of InfluxDB, the leading time series platform. More than 1,900 customers use InfluxDB to collect, store, and analyze all time series data at any scale. Developers can query and analyze their time-stamped data to predict, respond, and adapt in real-time.
Learn More
The latest from InfluxData
Relational Databases Are Built for Relational Data
It might seem obvious, but relational databases are built for relational data. Time series data characteristics and workloads are very different, so a relational database doesn’t work for them.
Relational databases can’t handle the ingestion speeds of time series at scale. Because this is a problem related to scale, it only surfaces at scale. As a result, a lot of people start using a relational database for time series and end up having to do more work once they reach a scaling inflection point.
For every origin source stored in a relational database, an estimated 10 times more storage space is needed for its associated time series data. Relational databases aren’t built for this type of growth profile, nor are the features of relational databases needed for this type of data.
One example is that time series favors lower latency between reads and writes over database backups. When a relational database workload reaches the scalability tipping point, write speeds slow down as the database backs up as a safety precaution. The higher latencies impede automated systems’ ability to act immediately on any irregularities.
Another challenge with relational databases is their lack of flexibility because of explicit schema requirements. The database must undergo a labor-intensive migration whenever you need to update the schema. This is a risky undertaking because it is possible to lose or corrupt data no matter how careful developers are during the process.
Time Series Databases Are Built for Time Series Data
InfluxDB is a purpose-built time series database, delivered via cloud, on premises and open source. It is designed to meet the needs of time series data. In terms of scaling, in InfluxData’s internal benchmarking, InfluxDB ingests orders of magnitude more data per second using significantly less CPU and memory than other databases, even those that claim to be tuned for time series.
InfluxDB is “schema on write,” meaning developers can add new dimensions and fields by simply adding them to their writes. There are no change requirements to any production or development databases. This offers flexibility for workloads with changing data shapes.
Apache Arrow for Time Series
Time series is all about understanding the current picture of the world and offering immediate insight and action. Relational databases can perform basic data manipulation, but they can’t execute advanced calculations and analytics on multiple observations.
Because time series data workloads are so large, they need a database that can work with large datasets easily. Apache Arrow is specifically designed to move large amounts of columnar data. Building a database on Arrow gives developers more options to effectively operate on their data by way of advanced data analysis and the implementation of machine learning and artificial intelligence tools such as Pandas.
Some may be tempted to simply use Arrow as an external tool for a current solution. However, this approach isn’t workable because if the database doesn’t return data in Arrow format right from the source, the production application will struggle to ensure there’s enough memory to work with large datasets. The code source will also lack the compression Arrow provides. Transferring the poorly compressed bytes across the wire increases latencies between the database and code, which negatively affects overall performance.
Shrinking the Learning Curve
Building InfluxDB on the Apache ecosystem created an opportunity to add SQL support into the time series database. InfluxDB uses DataFusion as its query engine, and DataFusion uses SQL as the query language, meaning anyone who knows SQL can now query time series. There’s no additional language requirement.
To further enhance ease of access, there are already three time series-specific functions in DataFusion. These are all open source, so anyone within the Apache Arrow community can benefit from or contribute to them.
· date_bin() – Creates rows that are time windows of data with an aggregate.
· selector_first(), selector_last() – Provide the first or last row of a table that meet specific criteria.
· time_bucket_gapfill() – Returns windowed data, and if there are windows that lack data it will fill those gaps.
Conclusion
Time series data has different characteristics, storage requirements and workloads than relational data. Because the data types appear similar, it’s important to be aware of these differences early in the process. The later into production these issues are identified, the harder they are to solve.
Time series data works best with a time series database like InfluxDB to account for low latency at high ingestion rates, the flexibility of schema on write data collection and advanced data analysis. Native SQL support in InfluxDB makes time series data workloads more accessible to SQL users.
You can avoid or fix any of the pitfalls outlined above by simply adding a time series database to your tech stack.
InfluxData is the creator of InfluxDB, the leading time series platform. More than 1,900 customers use InfluxDB to collect, store, and analyze all time series data at any scale. Developers can query and analyze their time-stamped data to predict, respond, and adapt in real-time.