Schema Evolution in Apache Avro, Protobuf, and JSON Schema
In modern distributed architectures—especially event-driven systems like Kafka or Pulsar—data is the contract. When systems scale and evolve, your data schemas will too. If not managed carefully, schema changes can break consumers, cause data loss, or disrupt analytics pipelines.
This post explores how schema evolution is handled across three common serialization formats: Apache Avro, Google Protobuf, and JSON Schema. We’ll walk through examples, common compatibility strategies, and tools to keep your contracts safe as they evolve.
Why Schema Evolution Matters
Changing data structures in production can break consumers that rely on an older version of the schema. A producer might send new fields or remove existing ones, while a consumer still expects the original version. A schema registry can help prevent this by enforcing compatibility rules and managing schema versions centrally.
Without evolution planning, you risk issues like:
- Consumers failing due to unexpected fields.
- Analytics pipelines producing inconsistent results.
- Downtime during coordinated schema rollouts.
Apache Avro
Avro is widely used in data pipelines due to its compact binary format and support for embedded or externally referenced schemas. Schemas are defined in JSON and describe record structures, fields, and types.
Example schema:
{
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "age", "type": "int"}
]
}
Evolving the schema:
Let’s say we want to add an optional email field:
{"name": "email", "type": ["null", "string"], "default": null}
This is a backward-compatible change because the default allows older consumers to process it safely.
Compatibility modes in Avro (via Confluent Schema Registry):
| Mode | Description |
|---|---|
| Backward | New schema can read data written by the old one |
| Forward | Old schema can read data written by the new one |
| Full | Ensures both forward and backward compatibility |
Useful reference: Confluent Schema Compatibility Guide.
Protocol Buffers (Protobuf)
Protobuf uses .proto files to define message schemas. Each field is assigned a unique numeric tag that acts as the identifier. Fields can be added, deprecated, or removed—but never reused.
Example:
message User {
string name = 1;
int32 age = 2;
optional string email = 3;
}
Schema evolution in Protobuf:
- New optional fields can be added without breaking old consumers.
- Unknown fields are ignored during deserialization.
- Removed fields must not reuse their tag number.
This makes Protobuf ideal for streaming systems and gRPC-based microservices. For a deeper dive into how Protobuf is used in event-based architectures, check out Deliveroo’s engineering post on streaming schema evolution with Protobuf.
JSON Schema
JSON Schema is popular in REST APIs and lightweight pub/sub systems. It defines structure, types, required fields, and value constraints for JSON data.
Example:
{
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"},
"email": {"type": ["string", "null"]}
},
"required": ["name", "age"]
}
Challenges with evolution:
- JSON Schema lacks a native versioning model.
- Optionality and required fields must be managed manually.
- Consumers need logic to handle multiple schema versions.
To evolve safely, it’s common to include a version field in the payload, enabling consumers to switch logic based on version. For in-depth rules, the JSON Schema documentation is a solid place to start.
Comparison Table
| Feature | Avro | Protobuf | JSON Schema |
|---|---|---|---|
| Format | Binary | Binary | Text |
| Schema Definition | JSON | .proto | JSON |
| Evolution Support | Strong | Strong | Weak |
| Self-describing | Yes (optional) | No | No |
| Best for | Data lakes, Kafka | RPC, Streaming | REST APIs |
Tools for Schema Evolution
- Confluent Schema Registry: Supports Avro, Protobuf, JSON Schema with version control and compatibility checks.
https://docs.confluent.io/platform/current/schema-registry/index.html - AWS Glue Schema Registry: Works well with Protobuf and Avro in streaming jobs and analytics workflows.
https://docs.aws.amazon.com/glue/latest/dg/schema-registry.html - Karapace: Open-source alternative to Confluent’s registry.
https://github.com/aiven/karapace
Best Practices
| Best Practice | Description |
|---|---|
| Use a Schema Registry | Centralizes schema versions and enforces compatibility rules automatically. |
| Add Fields as Optional or With Defaults | Prevents breaking older consumers when introducing new fields. |
| Avoid Reusing Field Identifiers | Especially important in Protobuf where field numbers must remain unique. |
| Remove Fields Carefully | Only remove fields when you’re certain no consumers depend on them. |
| Document Schema Versions | Maintain a changelog or version field to track schema changes over time. |
| Use Compatibility Modes | Enforce forward, backward, or full compatibility policies (e.g., in Avro). |
| Test Evolution Scenarios | Validate changes in staging with different producer and consumer versions. |
| Version Schemas Explicitly | Embed a version field in JSON payloads to guide deserialization logic. |
| Automate Validation in CI/CD | Integrate schema compatibility checks into your pipeline for safe deploys. |
Further Reading and Videos
- Martin Kleppmann on Schema Evolution
- YouTube: Data Serialization Explained (Protobuf vs Avro)
- Medium: Schema Evolution Patterns in Apache Kafka
Final Thoughts
Schema evolution is an unavoidable reality in growing systems. By adopting tools like schema registries and serialization formats designed with evolution in mind, teams can decouple producers from consumers and roll out changes with confidence.
Whether you pick Avro, Protobuf, or JSON Schema, the key is the same: treat your schema like code—version it, test it, and validate it before deploying.



