Software Development

Schema Evolution in Apache Avro, Protobuf, and JSON Schema

In modern distributed architectures—especially event-driven systems like Kafka or Pulsar—data is the contract. When systems scale and evolve, your data schemas will too. If not managed carefully, schema changes can break consumers, cause data loss, or disrupt analytics pipelines.

This post explores how schema evolution is handled across three common serialization formats: Apache Avro, Google Protobuf, and JSON Schema. We’ll walk through examples, common compatibility strategies, and tools to keep your contracts safe as they evolve.

Why Schema Evolution Matters

Changing data structures in production can break consumers that rely on an older version of the schema. A producer might send new fields or remove existing ones, while a consumer still expects the original version. A schema registry can help prevent this by enforcing compatibility rules and managing schema versions centrally.

Without evolution planning, you risk issues like:

  • Consumers failing due to unexpected fields.
  • Analytics pipelines producing inconsistent results.
  • Downtime during coordinated schema rollouts.

Apache Avro

Avro is widely used in data pipelines due to its compact binary format and support for embedded or externally referenced schemas. Schemas are defined in JSON and describe record structures, fields, and types.

Example schema:

{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "name", "type": "string"},
    {"name": "age", "type": "int"}
  ]
}

Evolving the schema:
Let’s say we want to add an optional email field:

{"name": "email", "type": ["null", "string"], "default": null}

This is a backward-compatible change because the default allows older consumers to process it safely.

Compatibility modes in Avro (via Confluent Schema Registry):

ModeDescription
BackwardNew schema can read data written by the old one
ForwardOld schema can read data written by the new one
FullEnsures both forward and backward compatibility

Useful reference: Confluent Schema Compatibility Guide.

Protocol Buffers (Protobuf)

Protobuf uses .proto files to define message schemas. Each field is assigned a unique numeric tag that acts as the identifier. Fields can be added, deprecated, or removed—but never reused.

Example:

message User {
  string name = 1;
  int32 age = 2;
  optional string email = 3;
}

Schema evolution in Protobuf:

  • New optional fields can be added without breaking old consumers.
  • Unknown fields are ignored during deserialization.
  • Removed fields must not reuse their tag number.

This makes Protobuf ideal for streaming systems and gRPC-based microservices. For a deeper dive into how Protobuf is used in event-based architectures, check out Deliveroo’s engineering post on streaming schema evolution with Protobuf.

JSON Schema

JSON Schema is popular in REST APIs and lightweight pub/sub systems. It defines structure, types, required fields, and value constraints for JSON data.

Example:

{
  "type": "object",
  "properties": {
    "name": {"type": "string"},
    "age": {"type": "integer"},
    "email": {"type": ["string", "null"]}
  },
  "required": ["name", "age"]
}

Challenges with evolution:

  • JSON Schema lacks a native versioning model.
  • Optionality and required fields must be managed manually.
  • Consumers need logic to handle multiple schema versions.

To evolve safely, it’s common to include a version field in the payload, enabling consumers to switch logic based on version. For in-depth rules, the JSON Schema documentation is a solid place to start.

Comparison Table

FeatureAvroProtobufJSON Schema
FormatBinaryBinaryText
Schema DefinitionJSON.protoJSON
Evolution SupportStrongStrongWeak
Self-describingYes (optional)NoNo
Best forData lakes, KafkaRPC, StreamingREST APIs

Tools for Schema Evolution

Best Practices

Best PracticeDescription
Use a Schema RegistryCentralizes schema versions and enforces compatibility rules automatically.
Add Fields as Optional or With DefaultsPrevents breaking older consumers when introducing new fields.
Avoid Reusing Field IdentifiersEspecially important in Protobuf where field numbers must remain unique.
Remove Fields CarefullyOnly remove fields when you’re certain no consumers depend on them.
Document Schema VersionsMaintain a changelog or version field to track schema changes over time.
Use Compatibility ModesEnforce forward, backward, or full compatibility policies (e.g., in Avro).
Test Evolution ScenariosValidate changes in staging with different producer and consumer versions.
Version Schemas ExplicitlyEmbed a version field in JSON payloads to guide deserialization logic.
Automate Validation in CI/CDIntegrate schema compatibility checks into your pipeline for safe deploys.

Further Reading and Videos

Final Thoughts

Schema evolution is an unavoidable reality in growing systems. By adopting tools like schema registries and serialization formats designed with evolution in mind, teams can decouple producers from consumers and roll out changes with confidence.

Whether you pick Avro, Protobuf, or JSON Schema, the key is the same: treat your schema like code—version it, test it, and validate it before deploying.

Eleftheria Drosopoulou

Eleftheria is an Experienced Business Analyst with a robust background in the computer software industry. Proficient in Computer Software Training, Digital Marketing, HTML Scripting, and Microsoft Office, they bring a wealth of technical skills to the table. Additionally, she has a love for writing articles on various tech subjects, showcasing a talent for translating complex concepts into accessible content.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Back to top button