Renaldi for AWS Community Builders

Posted on May 10

Contract-First Event-Driven Architecture on AWS

#aws #webdev #architecture #eventdriven

When event-driven systems grow past a handful of services, the biggest failures usually are not infrastructure failures. They are contract failures.

A producer adds a field and a consumer crashes.

A team renames an enum value and downstream processing silently misclassifies events.

A “minor change” ships without coordination and turns into a production incident.

In this post, I will walk through how I design a contract-first event-driven architecture on AWS with a focus on:

Event versioning strategies
Schema registry usage
Consumer tolerance patterns
Breaking vs non-breaking changes
Governance for event contracts

I will also include an end-to-end walkthrough, implementation discussion, architecture, and code examples that show how I typically structure this in practice.

This pattern is especially useful when I want:

multiple teams publishing and consuming events
safe independent deployments
compatibility checks in CI/CD
replayable operations
and a clear change-management process around event contracts

Why contract-first matters in event-driven systems

I like event-driven architectures because they reduce direct coupling at runtime. But they can easily create hidden coupling at the data contract level.

A queue, bus, or topic only decouples transport. It does not automatically decouple:

field names
field types
nullability
enum values
semantic meaning
version expectations

That is why I treat the event contract as a product interface, not just a JSON blob.

A contract-first approach means:

I define the event schema before (or alongside) producer code
I validate changes in pull requests and CI
I classify changes as breaking or non-breaking
I enforce compatibility policy before deployment
Consumers are built to be tolerant where appropriate

What I mean by “contract-first” on AWS

On AWS, I usually use Amazon EventBridge as the routing layer for domain and integration events. For contract visibility and developer ergonomics, I use EventBridge Schemas (registry/discovery/code bindings) and a Git-based contract repository as the source of truth.

EventBridge Schemas supports custom schemas, inferred schemas, and code bindings, and supports both OpenAPI 3 and JSONSchema Draft4 formats. (docs.aws.amazon.com)

A key implementation detail that is easy to miss: for contract-first systems, I do producer-side validation before publishing to the bus. AWS explicitly recommends JSON Schema for client-side validation so events conform to the schema. (docs.aws.amazon.com)

That means I think about contract enforcement in two layers:

Design-time / CI-time enforcement (compatibility and governance)
Runtime enforcement (producer validation, consumer critical-field validation)

Architecture Overview

At a high level, I split the solution into four concerns:

Contract governance (Git + PRs + compatibility checks)
CI/CD publication (schema artifacts + code bindings + service deployment)
Runtime event transport (EventBridge bus + rules + consumers)
Operational controls (archive/replay, observability, version adoption metrics)

The guiding principle is simple:

Git repo is the source of truth
Schema registry is the discovery/distribution layer
EventBridge is the routing layer
Producer validation is the enforcement point
Consumers are tolerant readers, not brittle mirror parsers

End-to-End Walkthrough

This is the end-to-end flow I use in a contract-first setup.

1) Define the event contract in a versioned repository

I keep contracts in a dedicated repo (or a clearly separated folder in a platform repo), with one folder per domain event and explicit versions.

Example structure:

contracts/
  orders/
    order-created/
      v1/
        schema.json
        examples/
          valid-minimal.json
          valid-full.json
          invalid-missing-id.json
        metadata.yaml
      v2/
        schema.json
        migration-notes.md

I usually store:

the schema (schema.json)
example payloads (valid and invalid)
contract metadata (owner, lifecycle, SLA, compatibility policy)
migration notes (for majors)

This keeps the contract reviewable and testable before code changes are merged.

2) Open a PR and run compatibility checks in CI

When a producer team proposes a schema change, the CI pipeline:

lints the schema
validates example payloads
compares the proposed schema against the last released version
classifies the change as breaking or non-breaking
blocks deployment if it violates policy

This is the point where I want failure to happen.

It is much cheaper to fail a PR than to fail a consumer at runtime.

3) Publish the schema artifact and optional code bindings

After the contract PR is approved, the pipeline publishes the schema to EventBridge Schemas (or updates the schema artifact in the registry workflow).

EventBridge Schemas can store custom schemas and generate code bindings for supported languages, which can help teams bootstrap producers/consumers faster. (docs.aws.amazon.com)

I still keep Git as the source of truth. The registry is a distribution and discovery aid, not my governance system.

4) Producer validates events before publishing to EventBridge

At runtime, the producer constructs an event envelope and validates the detail payload against the contract schema (and optionally validates the envelope as well).

I do not rely on “the bus will catch it.” I validate before PutEvents.

This is especially important in multi-team environments where one bad deploy can affect many consumers.

5) EventBridge routes events to consumers

Once published, the event goes to an EventBridge custom bus and is routed by rules to targets such as:

Lambda
SQS (then Lambda workers)
Step Functions
EventBridge Pipes targets
other buses/accounts (depending on architecture)

I keep routing concerns separate from schema governance concerns. The bus routes. The contracts define compatibility.

6) Consumers apply tolerant-reader patterns

Consumers should not parse the full event contract unless they truly need every field.

Instead, I design consumers to:

read only the fields they need
ignore unknown fields
use safe defaults where appropriate
validate critical fields they depend on
gracefully handle unsupported versions

This is what lets independent deployments actually work in practice.

7) Archive and replay for recovery and backfills

For operational resilience, I often enable EventBridge archive and replay for important event buses.

EventBridge archives can filter by event pattern and later replay events back to the same source event bus (not a different bus). EventBridge also annotates replayed events with a replay-name field, which is useful for observability and preventing accidental re-archiving loops. (docs.aws.amazon.com)

There are replay caveats worth accounting for in design:

replayed events are not guaranteed to be replayed in original ingestion order
there can be delay before recently received events are available in the archive
replay targets are on the source bus (you select rules on that bus) (docs.aws.amazon.com)

That means my consumers should be:

idempotent
order-tolerant where possible
replay-aware (for metrics and side effects)

Event Envelope and Contract Shape

I strongly prefer a stable envelope and versioned detail payload.

A practical EventBridge event envelope looks like this:

{
  "source": "com.acme.orders",
  "detail-type": "OrderCreated.v1",
  "time": "2026-02-25T10:42:00Z",
  "detail": {
    "eventId": "evt_01J...",
    "schemaVersion": "1.2.0",
    "orderId": "ord_123",
    "customerId": "cus_789",
    "amount": 149.95,
    "currency": "AUD",
    "createdAt": "2026-02-25T10:41:59Z"
  }
}

Why I separate `detail-type` major version and `schemaVersion`

I often use a hybrid strategy:

detail-type includes the major version for routing and coarse compatibility (OrderCreated.v1, OrderCreated.v2)
detail.schemaVersion carries the full semantic version (1.2.0) for visibility, telemetry, and debugging

This gives me:

simple EventBridge rule routing by major version
clearer operational visibility into actual schema rollout
room for non-breaking evolution within a major

Event versioning strategies

There is no single universal versioning strategy. I choose based on blast radius, team maturity, and consumer tolerance.

Strategy 1: Major version in event type (my default)

Example:

OrderCreated.v1
OrderCreated.v2

When I use it

multiple consumers across teams
strict backward compatibility boundaries
need clear routing and migration windows

Pros

easy routing and coexistence
explicit migration path
lower ambiguity in logs/metrics

Cons

can create duplicate rules/targets during migration
more operational overhead during dual support

Strategy 2: Single event type + `schemaVersion` field only

Example:

detail-type = "OrderCreated"
detail.schemaVersion = "1.3.0"

When I use it

fewer consumers
strong tolerant-reader discipline
changes are mostly additive

Pros

simpler routing
fewer EventBridge rules

Cons

consumers must inspect payload version
easier to accidentally ship breaking changes under the same event type

Strategy 3: Parallel events for semantic shifts

Sometimes a change is not just a new version. It is a new concept.

Example:

OrderCreated
OrderSubmitted
OrderAccepted

If semantics change, I prefer a new event name over “versioning my way out” of domain ambiguity.

This is often cleaner than endlessly evolving one overloaded event.

Breaking vs non-breaking changes

This is where teams frequently get burned, because “non-breaking” is contextual.

Usually non-breaking (with tolerant consumers)

Adding a new optional field
Adding metadata consumers can ignore
Widening field length limits (if consumers do not assume old max)
Adding a new event type (without changing existing ones)

Often breaking

Renaming a field
Removing a field
Changing field type (number -> string)
Making an optional field required
Changing date format or timestamp semantics
Reusing the same field name with a different meaning

Context-dependent (treat carefully)

Adding a new enum value This is non-breaking only if consumers tolerate unknown enum values.
Making a field nullable This can break consumers that assume presence/non-null.
Reordering array semantics This can break consumers that rely on order.

My rule is:

If a consumer written against the previous contract can fail or silently misbehave, I treat it as breaking.

Schema registry usage on AWS

What I use EventBridge Schemas for

I use EventBridge Schemas for:

schema discovery (especially in dev/staging)
storing custom event schemas
helping teams find contracts
generating code bindings for faster adoption

EventBridge Schemas supports creating/uploading schemas and inferring schemas from events on an event bus, and supports both OpenAPI 3 and JSONSchema Draft4. (docs.aws.amazon.com)

What I do not use it for (by itself)

I do not treat the registry as sufficient governance.

A registry can tell me “what exists.” It does not automatically enforce:

compatibility policy
deprecation timelines
ownership approvals
rollout coordination
consumer migration commitments

That is why I pair it with Git + CI + governance workflow.

Practical recommendation

Dev/staging: schema discovery can help identify what is actually being emitted
Production: publish vetted schemas from CI, avoid “registry discovered it so it must be okay” thinking

Consumer tolerance patterns (the part that protects independent deployments)

Consumer tolerance is what turns contract-first from process overhead into deployment freedom.

1) Tolerant reader pattern

Consumers parse only fields they need and ignore extras.

Bad approach:

deserialize the entire payload into a strict model and fail on unknown fields

Better approach:

read subset fields needed for the business action
validate only those fields and critical invariants

2) Defensive enum handling

If I consume an enum like status, I do not assume I know every future value.

I usually implement:

known value handling
fallback bucket (UNKNOWN)
metrics/alerts for unseen values

This avoids outages caused by additive enum expansion.

3) Defaulting and null tolerance (with business rules)

I default only where the business semantics are safe.

Examples:

safe default for optional metadata field: yes
safe default for money amount: no
safe default for event timestamp: usually no

The goal is to avoid brittle parsing without masking real data quality issues.

4) Version-aware adapters (upcasters/downcasters)

When migrations are active, I sometimes introduce a small adapter layer:

upcaster: converts v1 payload to internal canonical model expected by newer consumer logic
downcaster (rarer): emits compatibility events for legacy consumers during transition

This is often cleaner than embedding version branching everywhere in business code.

5) Idempotent processing for replay and retries

Replays and retries are normal in event-driven systems. Consumers should be idempotent based on a stable event key (eventId, domain aggregate ID + version, etc.).

This matters even more when I enable EventBridge archive/replay. EventBridge replay behavior is operationally powerful, but consumers still need idempotent side effects. (docs.aws.amazon.com)

Code: Contract schema (JSON Schema Draft4 style)

Below is a simplified contract for OrderCreated.v1. I am using JSON Schema because it fits runtime validation well, and AWS documentation explicitly recommends JSON Schema for client-side validation in this scenario. (docs.aws.amazon.com)

{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "title": "OrderCreated.v1",
  "type": "object",
  "additionalProperties": true,
  "required": [
    "eventId",
    "schemaVersion",
    "orderId",
    "customerId",
    "amount",
    "currency",
    "createdAt"
  ],
  "properties": {
    "eventId": {
      "type": "string",
      "minLength": 1
    },
    "schemaVersion": {
      "type": "string",
      "pattern": "^1\\.\\d+\\.\\d+$"
    },
    "orderId": {
      "type": "string",
      "minLength": 1
    },
    "customerId": {
      "type": "string",
      "minLength": 1
    },
    "amount": {
      "type": "number",
      "minimum": 0
    },
    "currency": {
      "type": "string",
      "enum": ["AUD", "USD", "EUR"]
    },
    "createdAt": {
      "type": "string",
      "format": "date-time"
    },
    "couponCode": {
      "type": ["string", "null"]
    },
    "metadata": {
      "type": "object"
    }
  }
}

Notes on this schema

I intentionally allow additionalProperties: true to support tolerant evolution within a major version.
I keep the regex on schemaVersion aligned with the major (1.x.x).
I model optional fields explicitly and avoid making everything required just because it exists today.

Code: Producer-side validation and publish to EventBridge (Python)

This example validates detail against the JSON Schema before publishing to EventBridge.

import json
import os
from datetime import datetime, timezone
from uuid import uuid4

import boto3
from jsonschema import Draft4Validator, FormatChecker

events = boto3.client("events")

EVENT_BUS_NAME = os.environ["EVENT_BUS_NAME"]
SCHEMA_PATH = os.environ.get("SCHEMA_PATH", "schemas/order-created-v1.json")

with open(SCHEMA_PATH, "r", encoding="utf-8") as f:
    ORDER_CREATED_V1_SCHEMA = json.load(f)

validator = Draft4Validator(ORDER_CREATED_V1_SCHEMA, format_checker=FormatChecker())


class ContractValidationError(Exception):
    pass


def validate_detail(detail: dict) -> None:
    errors = sorted(validator.iter_errors(detail), key=lambda e: e.path)
    if errors:
        formatted = []
        for e in errors:
            path = ".".join(str(p) for p in e.path) or "<root>"
            formatted.append(f"{path}: {e.message}")
        raise ContractValidationError("; ".join(formatted))


def publish_order_created(order: dict) -> dict:
    detail = {
        "eventId": f"evt_{uuid4().hex}",
        "schemaVersion": "1.0.0",
        "orderId": order["order_id"],
        "customerId": order["customer_id"],
        "amount": float(order["amount"]),
        "currency": order["currency"],
        "createdAt": datetime.now(timezone.utc).isoformat().replace("+00:00", "Z"),
        "metadata": {
            "channel": order.get("channel"),
            "traceId": order.get("trace_id")
        }
    }

    # Remove null metadata values to keep payloads clean
    detail["metadata"] = {k: v for k, v in detail["metadata"].items() if v is not None}

    validate_detail(detail)

    envelope = {
        "Source": "com.acme.orders",
        "DetailType": "OrderCreated.v1",
        "EventBusName": EVENT_BUS_NAME,
        "Time": datetime.now(timezone.utc),
        "Detail": json.dumps(detail)
    }

    response = events.put_events(Entries=[envelope])

    # Basic publish result handling
    if response.get("FailedEntryCount", 0) > 0:
        failed = [e for e in response.get("Entries", []) if "ErrorCode" in e]
        raise RuntimeError(f"PutEvents failed: {failed}")

    return response

Why I validate on the producer

Because contract-first only works if I enforce the contract before the event leaves the service boundary.

The registry helps discovery. CI helps governance.

Producer validation prevents runtime contract drift.

Code: Simplified compatibility check (CI gate)

In production, I usually use a stronger compatibility checker (or a custom policy engine), but here is a clear example of how I classify a subset of schema changes in CI.

import json
from typing import Dict, Any, List, Set


def load_schema(path: str) -> Dict[str, Any]:
    with open(path, "r", encoding="utf-8") as f:
        return json.load(f)


def classify_change(old: Dict[str, Any], new: Dict[str, Any]) -> List[str]:
    findings = []

    old_props = old.get("properties", {})
    new_props = new.get("properties", {})
    old_required = set(old.get("required", []))
    new_required = set(new.get("required", []))

    old_keys = set(old_props.keys())
    new_keys = set(new_props.keys())

    removed_fields = old_keys - new_keys
    added_fields = new_keys - old_keys

    if removed_fields:
        findings.append(f"BREAKING: removed fields {sorted(removed_fields)}")

    # Added required fields can break old producers/consumers
    added_required = new_required - old_required
    if added_required:
        findings.append(f"BREAKING: newly required fields {sorted(added_required)}")

    for field in sorted(old_keys & new_keys):
        old_type = old_props[field].get("type")
        new_type = new_props[field].get("type")
        if old_type != new_type:
            findings.append(
                f"BREAKING: field '{field}' type changed from {old_type} to {new_type}"
            )

        # Enum changes are context-dependent; flag for review
        old_enum = old_props[field].get("enum")
        new_enum = new_props[field].get("enum")
        if old_enum is not None or new_enum is not None:
            if old_enum != new_enum:
                findings.append(
                    f"REVIEW: field '{field}' enum changed from {old_enum} to {new_enum}"
                )

    # Additive optional fields are usually non-breaking
    optional_additions = [f for f in added_fields if f not in new_required]
    if optional_additions:
        findings.append(f"NON_BREAKING: optional fields added {sorted(optional_additions)}")

    if not findings:
        findings.append("NO_CONTRACT_DIFF_DETECTED")

    return findings


if __name__ == "__main__":
    old_schema = load_schema("contracts/orders/order-created/v1/schema.json")
    new_schema = load_schema("contracts/orders/order-created/v1-next/schema.json")
    for line in classify_change(old_schema, new_schema):
        print(line)

Important note

This script is intentionally simplified. Real compatibility checks should also evaluate:

nullability changes
numeric range tightening
string length tightening
nested object/array changes
semantic changes (which tooling cannot detect reliably)

That is why I combine automated checks with contract review governance.

Code: Consumer tolerant-reader Lambda (Python)

This consumer reads only a subset of fields and handles unknown enum values safely.

import json
from typing import Any, Dict

KNOWN_CURRENCIES = {"AUD", "USD", "EUR"}


def parse_event(record: Dict[str, Any]) -> Dict[str, Any]:
    """
    EventBridge Lambda target event shape (single event invocation).
    We intentionally read only what we need.
    """
    detail_type = record.get("detail-type", "")
    detail = record.get("detail", {})

    # Envelope guardrails
    if not detail_type.startswith("OrderCreated.v"):
        raise ValueError(f"Unsupported detail-type: {detail_type}")

    # Critical field validation (subset)
    order_id = detail.get("orderId")
    amount = detail.get("amount")
    currency = detail.get("currency", "UNKNOWN")
    event_id = detail.get("eventId")

    if not event_id:
        raise ValueError("Missing eventId")
    if not order_id:
        raise ValueError("Missing orderId")
    if amount is None:
        raise ValueError("Missing amount")

    try:
        amount = float(amount)
    except (TypeError, ValueError):
        raise ValueError("Invalid amount")

    # Tolerant enum handling
    if currency not in KNOWN_CURRENCIES:
        currency = "UNKNOWN"

    return {
        "eventId": event_id,
        "orderId": order_id,
        "amount": amount,
        "currency": currency
    }


def lambda_handler(event, context):
    parsed = parse_event(event)

    # Idempotency key should usually be eventId (persist/check externally)
    # business processing here...
    print(json.dumps({"message": "processed", **parsed}))

    return {"statusCode": 200, "processedEventId": parsed["eventId"]}

What this consumer demonstrates

it does not assume full schema lockstep
it validates only critical fields
it tolerates additive changes
it degrades safely on unknown enum values

This pattern dramatically reduces breakage from non-breaking producer evolutions.

Code: AWS CDK snippet (EventBridge bus, archive, and rule)

This is a compact example of how I might wire the bus and a rule in CDK (TypeScript). Archive is optional but useful for recovery and replay workflows.

import * as cdk from "aws-cdk-lib";
import { Construct } from "constructs";
import * as events from "aws-cdk-lib/aws-events";
import * as targets from "aws-cdk-lib/aws-events-targets";
import * as lambda from "aws-cdk-lib/aws-lambda";

export class ContractsEdaStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const bus = new events.EventBus(this, "DomainBus", {
      eventBusName: "domain-events"
    });

    const consumerFn = new lambda.Function(this, "OrderConsumerFn", {
      runtime: lambda.Runtime.PYTHON_3_12,
      handler: "app.lambda_handler",
      code: lambda.Code.fromAsset("lambda/order-consumer")
    });

    new events.Rule(this, "OrderCreatedV1Rule", {
      eventBus: bus,
      eventPattern: {
        source: ["com.acme.orders"],
        detailType: ["OrderCreated.v1"]
      },
      targets: [new targets.LambdaFunction(consumerFn)]
    });

    // Optional: event archive for replay and recovery
    new events.CfnArchive(this, "DomainBusArchive", {
      archiveName: "domain-events-archive",
      sourceArn: bus.eventBusArn,
      description: "Archive selected domain events for replay",
      // Optional filter pattern
      eventPattern: {
        source: ["com.acme.orders"]
      },
      retentionDays: 30
    });
  }
}

Implementation discussion (what makes this hold up in production)

This is the part I care about most. The architecture is the easy part. Operating it safely is the real work.

1) Make one layer the source of truth

I strongly recommend choosing one authoritative source for contracts.

My preference:

Git contracts repo = source of truth
EventBridge Schemas = discovery/distribution
generated code = convenience artifact (never hand-edited)

If teams edit schemas ad hoc in multiple places, drift becomes inevitable.

2) Separate compatibility policy by scope

Not every event needs the same compatibility rigor.

I usually define contract classes, for example:

Internal team-local events
- faster iteration
- smaller deprecation windows
Cross-team domain events
- strict review
- compatibility gates
- longer deprecation windows
External/public integration events
- strongest governance
- formal versioning and migration docs

This prevents over-governing small internal signals while protecting high-blast-radius contracts.

3) Decide your “major version rollout” playbook in advance

When a breaking change is truly necessary, I do not improvise. I use a defined rollout pattern.

Typical playbook:

Introduce v2 alongside v1
Dual-publish (or route from an adapter) for a migration window
Track consumer adoption
Freeze new dependencies on v1
Announce deprecation date
Remove v1 after agreed window

This is much more reliable than “we changed it, please update soon.”

4) Govern semantics, not just structure

Schema validation catches structural drift. It does not catch semantic drift.

Example:

amount still exists and is a number
but the team changed it from “gross amount” to “net amount”

JSON Schema will happily validate that.

To reduce semantic drift, I add:

clear field descriptions
example payloads
domain glossary references
explicit units/timezone/currency semantics
contract review by both producer and consumer owners

5) Be careful with enums

Enums are a common source of accidental breakage.

What I do:

treat enum additions as review-required
require consumers to implement fallback handling for non-critical enums
document whether enum is “closed” or “extensible”

This avoids the classic “we only added one value” outage.

6) Use archive/replay intentionally, not casually

EventBridge archive/replay is powerful for:

recovery after consumer bugs
onboarding new consumers
backfilling state after fixes

But it changes operational assumptions:

replay can be delayed
replay order may differ from original arrival order
replayed events are marked with replay-name metadata
replay is tied to the source bus (docs.aws.amazon.com)

So I design consumers to:

be idempotent
avoid unsafe side effects on duplicate/replay
optionally detect replayed events for observability paths

7) Observe version adoption as a first-class metric

I like to emit and dashboard:

events published by detail-type
events published by schemaVersion
validation failures by producer
consumer parse failures by version
unknown enum value rates
replayed event counts (replay-name present)

This gives me a factual view of migration readiness instead of relying on team status updates.

Governance for event contracts

Contract governance does not need to be bureaucratic, but it does need to be explicit.

Minimum governance I recommend

Contract ownership

Every contract should have:

producer owner
platform owner (optional but useful)
primary consumer group(s) for review

Pull request rules

I typically require:

schema diff summary
compatibility classification
migration impact statement
updated examples
deprecation notes (if applicable)

CODEOWNERS / mandatory reviews

At minimum:

producer team review
platform or architecture review for breaking changes
affected consumer review (for major changes)

Versioning policy

Document:

what counts as patch/minor/major
what fields are stable
deprecation window length
dual-publish expectations

Lifecycle states

I label contracts like:

draft
active
deprecated
retired

This avoids ambiguity around old but still discoverable schemas.

A practical contract metadata file (optional but very useful)

I often pair each schema with metadata like this:

name: OrderCreated
majorVersion: 1
status: active
owners:
  producerTeam: orders-platform
  platformTeam: eventing-platform
compatibilityPolicy:
  mode: backward-compatible-within-major
  enumAdditionsRequireReview: true
deprecation:
  minimumNoticeDays: 90
observability:
  metricsTag: orders.order_created
examples:
  - examples/valid-minimal.json
  - examples/valid-full.json

This gives CI and reviewers policy context that plain JSON Schema does not express.

Common mistakes I see (and how I avoid them)

“We have a schema registry, so we are contract-first”

Not necessarily.

A registry improves discoverability. Contract-first requires:

versioned source of truth
compatibility policy
validation enforcement
governance workflow

“Non-breaking means no consumer work”

Also not necessarily.

Even additive changes can require:

monitoring updates
analytics model adjustments
new enum handling
data warehouse schema evolution

“Consumers should validate the full schema too”

Usually not a good idea.

Consumers should validate:

the envelope/version they support
critical fields they depend on
business invariants they enforce

Over-validating the full payload makes consumers brittle and defeats decoupling.

“We can do breaking changes quickly if we notify everyone”

This works until it does not.

I prefer explicit versioning and migration windows over coordination by chat message.

Closing thoughts

The best event-driven architectures are not just asynchronous. They are intentionally evolvable.

For me, contract-first design is how I make that happen:

schemas as interfaces
compatibility checks before deployment
producer-side validation
tolerant consumers
governance that scales with team count
replay-aware operations

If I were implementing this from scratch on AWS today, I would start with:

EventBridge custom bus
Git-based contract repo (JSON Schema)
CI compatibility checks
Producer validation before PutEvents
Tolerant-reader consumer template
Optional EventBridge archive/replay for critical event domains
Version adoption dashboards

That gives a strong foundation without overcomplicating the first iteration.

References

Amazon EventBridge Schemas user guide (schemas, custom/inferred schemas, code bindings, supported formats) (docs.aws.amazon.com)
Creating an event schema in Amazon EventBridge (JSON Schema/OpenAPI support, client-side validation recommendation) (docs.aws.amazon.com)
Generating code bindings for EventBridge schemas (supported languages and workflow) (docs.aws.amazon.com)
Archiving and replaying events in Amazon EventBridge (archive/replay behavior, source bus replay, replay metadata, ordering/delay considerations) (docs.aws.amazon.com)
Amazon EventBridge API Reference: PutEvents (API semantics and request shape) (docs.aws.amazon.com)
JSON Schema specification (for runtime validation patterns)
AsyncAPI specification (optional contract documentation model for event APIs)

Corresponding Mermaid code

flowchart TB
    %% Contract-First Event-Driven Architecture on AWS (Schemas, Validation, Compatibility)
    classDef svc fill:#EEF2FF,stroke:#4F46E5,stroke-width:1px,color:#1E1B4B;
    classDef data fill:#ECFDF5,stroke:#059669,stroke-width:1px,color:#064E3B;
    classDef gov fill:#FFF7ED,stroke:#EA580C,stroke-width:1px,color:#7C2D12;
    classDef ci fill:#FCE7F3,stroke:#DB2777,stroke-width:1px,color:#831843;
    classDef consumer fill:#EFF6FF,stroke:#2563EB,stroke-width:1px,color:#1E3A8A;

    subgraph Dev["Contract-First Governance (Git)"]
      A1["AsyncAPI / JSON Schema repo
versioned contracts"]:::gov
      A2["CODEOWNERS + PR review
producer/consumer approval"]:::gov
      A3["Compatibility checks
(non-breaking vs breaking)"]:::gov
      A4["Contract changelog + deprecation policy"]:::gov
    end

    subgraph CI["CI/CD Pipeline"]
      B1["Lint schema + examples"]:::ci
      B2["Run compatibility test
against previous versions"]:::ci
      B3["Publish schema artifact
(EventBridge Schemas / package)"]:::ci
      B4["Deploy producer + consumer"]:::ci
    end

    subgraph Prod["AWS Runtime"]
      C1["Producer service
(App / Lambda / ECS)"]:::svc
      C2["Producer-side validation
JSON Schema validator"]:::svc
      C3["EventBridge Custom Bus"]:::svc
      C4["EventBridge Schemas
Registry / discovery / code bindings"]:::data
      C5["Archive (optional)"]:::data
      C6["Replay (optional)"]:::svc

      subgraph Routing["Fan-out"]
        D1["Rule A -> Lambda Consumer"]:::consumer
        D2["Rule B -> SQS queue -> Lambda"]:::consumer
        D3["Rule C -> EventBridge Pipe / Step Functions"]:::consumer
      end

      E1["Consumer tolerance layer
ignore unknowns, defaults, subset parsing"]:::consumer
      E2["Consumer-side validation
critical fields only"]:::consumer
      E3["Business processing"]:::consumer
      E4["DLQ / error handling"]:::consumer
    end

    subgraph Ops["Observability & Governance Runtime"]
      F1["Contract metrics
version adoption / failures"]:::gov
      F2["CloudWatch Logs / Metrics / Alarms"]:::gov
      F3["Schema review board / release gates"]:::gov
    end

    A1 --> B1 --> B2 --> B3 --> B4
    A2 --> B2
    A3 --> B2
    A4 --> B4

    B4 --> C1
    C1 --> C2 -->|valid event| C3
    C2 -->|invalid event| F2

    C3 -. schema discovery .-> C4
    C3 --> D1
    C3 --> D2
    C3 --> D3
    C3 -. optional archive .-> C5
    C6 --> C3

    D1 --> E1
    D2 --> E1
    D3 --> E1
    E1 --> E2
    E2 -->|pass| E3
    E2 -->|fail| E4

    C1 -. emits version metric .-> F1
    E2 -. validation errors .-> F1
    C3 -. bus metrics .-> F2
    E4 -. alarms .-> F2
    F3 --> A2

Why contract-first matters in event-driven systems

What I mean by “contract-first” on AWS

Architecture Overview

End-to-End Walkthrough

1) Define the event contract in a versioned repository

2) Open a PR and run compatibility checks in CI

3) Publish the schema artifact and optional code bindings

4) Producer validates events before publishing to EventBridge

5) EventBridge routes events to consumers

6) Consumers apply tolerant-reader patterns

7) Archive and replay for recovery and backfills

Event Envelope and Contract Shape

Why I separate detail-type major version and schemaVersion

Event versioning strategies

Strategy 1: Major version in event type (my default)

Strategy 2: Single event type + schemaVersion field only

Strategy 3: Parallel events for semantic shifts

Breaking vs non-breaking changes

Usually non-breaking (with tolerant consumers)

Often breaking

Context-dependent (treat carefully)

Schema registry usage on AWS

What I use EventBridge Schemas for

What I do not use it for (by itself)

Practical recommendation

Consumer tolerance patterns (the part that protects independent deployments)

1) Tolerant reader pattern

2) Defensive enum handling

3) Defaulting and null tolerance (with business rules)

4) Version-aware adapters (upcasters/downcasters)

5) Idempotent processing for replay and retries

Code: Contract schema (JSON Schema Draft4 style)

Notes on this schema

Code: Producer-side validation and publish to EventBridge (Python)

Why I validate on the producer

Code: Simplified compatibility check (CI gate)

Important note

Code: Consumer tolerant-reader Lambda (Python)

What this consumer demonstrates

Code: AWS CDK snippet (EventBridge bus, archive, and rule)

Implementation discussion (what makes this hold up in production)

1) Make one layer the source of truth

2) Separate compatibility policy by scope

3) Decide your “major version rollout” playbook in advance

4) Govern semantics, not just structure

5) Be careful with enums

6) Use archive/replay intentionally, not casually

7) Observe version adoption as a first-class metric

Governance for event contracts

Minimum governance I recommend

Contract ownership

Pull request rules

CODEOWNERS / mandatory reviews

Versioning policy

Lifecycle states

A practical contract metadata file (optional but very useful)

Common mistakes I see (and how I avoid them)

“We have a schema registry, so we are contract-first”

“Non-breaking means no consumer work”

“Consumers should validate the full schema too”

“We can do breaking changes quickly if we notify everyone”

Closing thoughts

References

Corresponding Mermaid code

Why I separate `detail-type` major version and `schemaVersion`

Strategy 2: Single event type + `schemaVersion` field only