The Architect’s Guide to Data and File FormatsÂ
Data Formats for Microservices
Before diving deep into the data formats for efficient storage and retrieval in a data lake, let’s look at the layouts not so relevant to the data lake. Data formats like Protobuf, Thrift and MessagePack are more relevant to the intercommunication between microservices.Protocol Buffers
Google’s Protocol Buffers (also known as Protobuf) is a language- and platform-agnostic data serialization format used for efficiently encoding structured data for transmission over a network or for storing it in a file. It is designed to be extensible, allowing users to define their own data types and structures, and provides a compact binary representation that can be efficiently transmitted and stored. Protocol Buffers are used in various Google APIs and are supported in many programming languages, including C++, Python and Java. Protocol Buffers are often used in situations where data needs to be transmitted over a network or stored in a compact format, such as in network communication protocols, data storage and data integration.Implementation Overview
gRPC is a modern, open source, high-performance RPC (remote procedure call) framework that can run in any environment. It enables interactions between client and server applications’ methods directly, similar to a method call in object-oriented programming. gRPC is built on top of HTTP/2 as a transport protocol and ProtoBuf framework for encoding request and response messages. gRPC is designed to be efficient, with support for bidirectional streaming and low-latency communication, and can run in any environment. It is used in various Google APIs and is supported in many programming languages, including C++, Python and Java. gRPC is used in a variety of contexts and applications where low latency, high performance and efficient communication are important. It is often used for connecting polyglot systems, such as connecting mobile devices, web clients and backend services. It is also used in microservices architectures to connect services and to build scalable, distributed systems. In addition, gRPC is used in IoT (Internet of Things) applications, real-time messaging and streaming, and for connecting cloud services.
Source: https://developers.google.com/protocol-buffers
Thrift
Another data format like Protobuf is Thrift. Thrift is an interface definition language (IDL) and communication protocol that allows for the development of scalable services that can be used across multiple programming languages. It is similar to other IDLs such as CORBA and Google’s Protocol Buffers, but is designed to be lightweight and easy to use. Thrift uses a code-generation approach, where code is generated for the desired programming language based on the IDL definitions, allowing developers to easily build client and server applications that can communicate with each other using Thrift’s binary communication protocol. Thrift is used in a variety of applications, including distributed systems, microservices and message-oriented middleware.Languages Supported
Apache Thrift supports a wide range of programming languages, including functional languages like Erlang and Haskell. This allows you to define a service in one language, then use Thrift to generate the necessary code to implement the service in a different language, as well as client libraries that can be used to call the service from yet another language. This makes it possible to build interconnected systems that use a variety of programming languages.Implementation Overview
IDL compiler-generated code creates client and server stubs that underneath the hood would interact between the two via native protocols and transport layer, thus enabling RPC between the processes.
Source: https://thrift.apache.org/
MessagePack
MessagePack is a data serialization format that provides a compact binary representation of structured data. It is designed to be more efficient and faster than other serialization formats, such as JSON, by using a binary representation of data instead of a text-based one. MessagePack can be used in a variety of applications, including distributed systems, microservices and data storage. It is supported in many programming languages, including C++, Python and Java, and is often used in situations where data needs to be transmitted over a network or stored in a compact format. In addition to its efficiency and speed, MessagePack is also designed to be extensible, allowing users to define their own data types and structures.Languages Supported
MessagePack supports many programming languages, mainly attributable to its simplicity. See the list of implementations on its portal.Implementation Overview
We at MinIO decided on MessagePack as our serialization format. It keeps extensibility with JSON by allowing adding/removing of keys. The initial implementation is a header followed by a MessagePack object with the following structure:
{
"Versions": [
{
"Type": 0, // Type of version, object with data or delete marker.
"V1Obj": { /* object data converted from previous versions */ },
"V2Obj": {
"VersionID": "", // Version ID for delete marker
"ModTime": "", // Object delete marker modified time
"PartNumbers": 0, // Part Numbers
"PartETags": [], // Part ETags
"MetaSys": {} // Custom metadata fields.
// More metadata
},
"DelObj": {
"VersionID": "", // Version ID for delete marker
"ModTime": "", // Object delete marker modified time
"MetaSys": {} // Delete marker metadata
}
}
]
}
- Signature with version
- A version of Header Data (integer)
- A version of Metadata (Integer)
- Version Count (integer)

The overall structure of xl.meta
Data Formats for Streaming
Avro
Apache Avro is a data serialization system that provides a compact and fast binary representation of data structures. It is designed to be extensible, allowing users to define their own data types and structures, and provides support for both dynamically and statically typed languages. Avro is often used in the Hadoop ecosystem and is used in a variety of applications, including data storage, data exchange and data integration. Avro uses a schema-based approach, in which a schema is defined for the data and is used to serialize and deserialize the data. This allows Avro to support rich data structures and evolution of the data over time. Avro also supports a container file to store persistent data, RPCs and integration with multiple languages. Avro relies on schemas. Avro schema is saved in the file when writing Avro data. An Avro schema is a JSON document that defines the structure of the data that is stored in an Avro file or transmitted in an Avro message. It defines the data types of the fields in the data, as well as the names and order of those fields. It can also include information about the encoding and compression of the data, as well as any metadata that is associated with the data. Avro schemas are used to ensure that data is stored and transmitted in a consistent and predictable format. When data is written to an Avro file or transmitted in an Avro message, the schema is included with the data, so that the recipient of the data knows how to interpret it. The schema approach allows for serialization to be both small and fast, while supporting dynamic scripting languages. Files may be processed later by any program. If the program reading the data expects a different schema, this can be quickly resolved since both schemas are present.Languages Supported
Apache Avro is the leading serialization format for record data and the first choice for streaming data pipelines. It offers excellent schema evolution and has implementations for the JVM (Java, Kotlin, Scala, …), Python, C/C++/C#, PHP, Ruby, Rust, JavaScript and even Perl. To see more about how Avro compares to other systems, check out this helpful comparison in the documentation. Apache Kafka and Confluent Platform have made some special connections for Avro, but it will work with any data format.
Big Data File Formats for Data Lakes
Parquet
Apache Parquet is a columnar storage format for big data processing. It is designed to be efficient for both storage and processing, and it is widely used in the Hadoop ecosystem as a data storage and exchange format. Despite the decline of Hadoop, the format is still relevant and widely used due in part to its continued support from key data processing systems including Apache Spark, Apache Flink and Apache Drill. Parquet stores data in a columnar format, organizing data in a way that is optimized for column-wise operations such as filtering and aggregation. Parquet uses a combination of compression and encoding techniques to store data in a highly efficient manner, and it allows for the creation of data schemas that can be used to enforce data type constraints and enable fast data processing. Parquet is a popular choice for storing and querying large datasets because it enables fast querying and efficient data storage and processing.Implementation Overview
This schema allows for capturing metadata efficiently, enables the evolution of file format and simplifies data storage. Parquet compaction algorithms reduce storage requirements, enable faster retrieval and are supported by many frameworks. There are three types of metadata: file metadata, column (chunk) metadata and page header metadata. Parquet leverages Thrift’s TCompactProtocol for serialization and deserialization of the metadata structures.
Courtesy: parquet.apache.org
ORC
ORC, or Optimized Row Columnar, is a data storage format designed to improve the performance of data processing systems, such as Apache Hive and Apache Pig, by optimizing the storage and retrieval of large volumes of data. ORC stores data in a columnar format, which means data is organized and stored by column rather than by row. This allows for faster querying and analysis of data, as only the relevant columns need to be accessed instead of reading through entire rows of data. ORC also includes features such as compression, predicate pushdown, improved concurrency in using separate RecordReaders for the same files and type inference to further improve performance. Another area of improvement in ORC when compared to RCFile format, specifically in Hadoop deployments, is it dramatically reduces the load on NameNodes. It should be noted that ORC is tuned to the Hadoop architecture and workloads. Given most modern data stacks are moving away from Hadoop, ORC has limited utility in cloud native environments.Implementation Overview
The row data in an ORC file is organized into groups called stripes, which contain a series of rows and metadata about those rows. Each stripe contains a series of rows, and each row is divided into a series of columns. The data for each column is stored separately, which allows ORC to efficiently read only the columns that are needed for a particular query, rather than reading the entire row. ORC also includes metadata about the data, such as data types and compression information, which helps to improve read performance. ORC supports a variety of compression formats, including zlib, LZO and Snappy, which can help to reduce the size of the data on disk and improve read and write performance. It also supports various types of indexes, such as row indexes and bloom filters, which can be used to further improve read performance.
Courtesy: orc.apache.org
Arrow
Relatively new, Apache Arrow is an open source in-memory columnar data format designed to accelerate analytics and data processing tasks. It is a standardized format used to represent and manipulate data in a variety of systems, including data storage systems, data processing frameworks and machine learning libraries. One of the key benefits of Apache Arrow is its ability to efficiently transfer data between different systems and processes. It allows data to be shared and exchanged between systems without the need for serialization or deserialization, which can be time-consuming and resource-intensive. This makes it well-suited for use in distributed and parallel computing environments, where large volumes of data need to be processed and analyzed quickly. In addition to its performance benefits, Apache Arrow also has a rich feature set that includes support for a wide range of data types, support for nested and hierarchical data structures, and integration with a variety of programming languages. The Arrow format is often used in big data and analytics applications, where it can be used to efficiently transfer data between systems and process data in-memory. It is also used in the development of distributed systems and real-time analytics pipelines.Implementation Overview
The challenge most data teams face is that dealing with the internal data format of every database and language by serializing/deserializing without a standard columnar data format leads to inefficiency and performance issues. Another major challenge would be to write standard algorithms for each data format resulting in duplication and bloat.
Courtesy: https://arrow.apache.org/

Courtesy: https://arrow.apache.org/