Leon Pennings

Posted on May 10 • Originally published at blog.leonpennings.com

Parts in transit - Why most distributed systems are prematurely complex

#softwareengineering #java #microservices #softwaredevelopment

The incomparability problem

Here is a question that has no clean answer.

How do you know whether the architecture you chose was the right one?

Not right in the sense of working — most systems work, eventually, after enough effort. Right in the sense of optimal. Right in the sense that the complexity you introduced was warranted by the problem you were solving, and that a simpler approach would have cost more rather than less.

The honest answer, in most cases, is that you cannot know. Because the alternative was never built.

This is not a gap in the data. It is the mechanism of the problem. Most systems are built only once. There is no second system built with different assumptions, run for five years, and compared on total cost of ownership, ease of change, and operational stability. The counterfactual does not exist. Therefore the cost of the wrong choice — if it was the wrong choice — is permanently invisible.

None of this is to say that distributed systems cannot work. Many organisations have made them function, sometimes at considerable scale — usually through exceptional engineering discipline, strong platform investment, and genuine operational maturity. The question is different: how much of the total effort, over years, went into managing the consequences of the distribution itself, rather than advancing the domain? And would a simpler boundary choice have delivered more value with less sustained overhead? The counterfactual remains hard to prove, which is precisely why we need sharper prospective indicators.

And here is what makes the problem genuinely difficult: the entire industry tends to converge on the same patterns at the same time. When every team uses a similar stack, incurs similar coordination overhead, and grows to a similar size — those costs stop being visible as costs. They become the definition of what software costs. Normal and wasteful become indistinguishable.

So the question sharpens. If we cannot compare architectures retrospectively, is there anything we can measure prospectively — before five years have passed — that gives us a leading indicator of whether we are building something appropriately simple, or something unnecessarily complex?

There is. And it comes from an unlikely place.

The warehouse and the system boundary

Consider an order fulfilment operation. An order arrives. A picker walks to the rack holding the product, picks it, and places it on the assembly line. Routine.

Now consider what happens when that order is cancelled.

If the picker has not yet left the rack, cancellation is a system operation. One record updated. The state change is contained. The cost is negligible and the outcome is certain.

If the picker is already walking the floor — part in hand, mid-transit — the picture changes entirely. The picker must be located and reached. The instruction must be communicated and confirmed. The picker turns around, returns the part, re-shelves it in the correct position, and logs the return. The assembly line must be told the part is not coming and adjust accordingly. Each of those steps can fail. Each failure requires its own recovery. If the picker has already placed the part on the line, someone else must retrieve it, the line has already reacted to its arrival, and the cleanup compounds further.

The correction costs more than the original action. Not marginally more — multiplicatively more. More people, more coordination, more opportunity for secondary failure, and a system left in a state requiring verification before it can be trusted again.

This is the principle that makes architectural cost measurable before a system is built:

As long as domain actions happen within a single system boundary, the cost of failure is a rollback. The moment actions propagate outside that boundary, the cost of failure becomes coordination.

This is not a preference. It is a structural property of distributed systems, and it applies regardless of how well the coordination is engineered. You can manage the cost with better tooling. You cannot eliminate it. It is inherent to the boundary crossing.

The warehouse makes this visible in a way that software obscures. In the warehouse, you can see the picker walking. You can see the empty rack. You can see the stalled line. The cost of the part in transit is physically apparent. In software, the equivalent states — the uncommitted saga step, the unacknowledged event, the stalled compensating transaction — are invisible unless you built dedicated instrumentation to see them. The cost is identical. The visibility is not. That invisibility is precisely why the cost became acceptable.

The well-run warehouse minimises the time parts spend in transit, because parts in transit are the expensive state. The leading indicator of a well-designed system is the same: how much of the domain work happens within a single rollback boundary, and how much crosses outside it?

Rollbackability — the degree to which a failed action can be fully undone by the system without external coordination — is a concrete, prospective benchmark for simplicity. If you are designing a system and the failure path requires coordinating compensation across multiple services, you have already committed to a significant and permanent cost. The question is whether the benefit justified it.

In most cases, that question was never asked.

A concrete example: order creation

Take a canonical domain flow: an order is created, inventory is reserved, an invoice is generated, a shipment is planned. Four concepts. One business action. It either succeeds completely or it does not happen.

In a monolith with a well-modelled domain, this is the entirety of the orchestration:

java

@Transactional
public OrderConfirmation createOrder(OrderRequest request) {
    Order order       = new Order(request);
    Inventory.reserve(order);
    Invoice invoice   = new Invoice(order);
    Shipment shipment = new Shipment(order);
    return OrderConfirmation.of(order, invoice, shipment);
}

The database transaction is the system boundary. If anything fails, nothing happened. The domain concepts — Order, Inventory, Invoice, Shipment — do the work. The technology serves them. Rollbackability is total. The failure path costs nothing beyond the failed attempt itself.

This example is deliberately straightforward — but the principle holds as domain complexity increases. In fact, the more complex the domain, the more important it becomes that the infrastructure does not add noise. A complex financial workflow with regulatory holds is hard enough to reason about correctly without the additional burden of distributed coordination, partial failure states, and eventual consistency layered on top of it.

Now split those four concepts across four services. The business requirement has not changed by a single word. What changes is everything else.

The infrastructure required before writing a line of business logic

A message broker. Services cannot call each other synchronously if you want any resilience. Kafka or RabbitMQ: a three-node production cluster, topic design, schema registry, retention policies, consumer group monitoring, and a local development environment every developer must run and maintain.

Saga infrastructure. There is no transaction. Coordination must be made durable — if the orchestrator crashes mid-flow, it must resume from the correct step. This means a saga framework (Axon, Temporal, AWS Step Functions — each a substantial system with its own operational model and learning curve) or a hand-rolled saga state table with step tracking and a crash recovery process. Either way, there is now a fifth service whose entire existence is accidental complexity. It owns no domain concept. It exists solely because the transaction boundary was removed.

Distributed tracing. Four services produce four independent log streams with no shared identity unless you build one. Jaeger or Zipkin for the trace infrastructure. Every service propagates a correlation ID in HTTP headers, event envelopes, and log output. A log aggregation stack on top, because reconstructing an incident across four separate log streams without tooling is not a debugging workflow — it is an archaeology project.

Idempotency handling — in every service. Message brokers guarantee at-least-once delivery. The same event will arrive twice. Every consumer must handle this without creating two invoices or two shipments. An idempotency key strategy per event type. A deduplication store — typically a processed-events table — checked on every inbound message. This is not a framework you install. It is code you write, in every service, correctly, and maintain forever.

Compensating transactions — per failure path. The rollback equivalent. Designed, coded, tested, and maintained per service per failure scenario. For four services the paths are: inventory fails — cancel order; invoice fails — release inventory, cancel order; shipping fails — void invoice, release inventory, cancel order. Each compensation is a domain operation that must exist, be reachable, be idempotent, and be tested both in isolation and in combination. The failure paths grow as O(n²) with the number of services.

API contracts and versioning. In a monolith, a method signature change is a compiler error caught before deployment. Across services it is a potential production incident. OpenAPI specifications or event schemas in the schema registry. A versioning strategy for deploying new service versions while old ones are still running. Consumer-driven contract tests — an entirely new test layer that did not exist before.

Per-service operational overhead — multiplied by four. Each service needs its own CI/CD pipeline, its own database (shared databases between services defeat the architectural purpose), its own health checks, its own deployment configuration, its own secret management, and its own database migration strategy.

None of this is business logic. All of it requires expertise to operate correctly. In practice it means a platform or infrastructure team to own the broker and deployment infrastructure, application developers who understand distributed systems failure modes rather than just domain logic, and an ongoing operational load that scales with the number of services — not with the complexity of the domain.

The cost, made visible

The following table makes the prospective cost explicit — before the first line of business logic is written, and before five years have passed.

Concern	Monolith	Microservices	What the split actually costs
Atomicity and failure
Rollback on failure	Database transaction. One word.	Saga pattern. Hundreds of lines.	Design, code, and test a compensating action per service per failure path. O(n²) paths for n services.
Partial failure state	Impossible. Transaction is atomic.	Permanent possibility. Must be designed around.	Order exists, invoice does not. Every consumer of your data now reasons about completeness. Forever.
Consistency	Immediate. Guaranteed.	Eventual. A property you live with.	Not solvable with better tooling. A structural consequence of the boundary choice.
Infrastructure before business logic
Message broker	None.	Kafka or RabbitMQ. 3-node cluster.	Topic design, schema registry, retention policy, consumer group monitoring, local dev setup.
Saga / orchestration	None.	Axon / Temporal / hand-rolled plus a fifth service.	Durable saga state, crash recovery, step tracking. An entire service that owns zero domain concepts.
Distributed tracing	One stack trace.	Jaeger / Zipkin plus correlation IDs everywhere.	Every service propagates trace IDs in headers, event envelopes, and log output. Log aggregation stack on top.
Idempotency	N/A. Methods are naturally idempotent.	Required in every service. Always.	Deduplication store per service. Idempotency key strategy per event. Written, maintained, tested forever.
API contracts	Compiler. Free.	OpenAPI / schema registry plus versioning strategy.	Consumer-driven contract tests. A breaking change is a production incident. Another test layer that did not exist.
Per-service operational overhead
CI/CD pipelines	1	4+	Independent versioning, deployment windows, rollback strategies. Coordination overhead on every release.
Databases	1	4+	Independent migration strategies per service. Schema changes coordinated across deployment boundaries.
Local dev environment	One process.	4+ services plus broker plus docker-compose.	Onboarding measured in days not hours. Partial environments produce integration bugs that only appear in the full stack.
Debuggability and sustainability
Debug a production failure	One stack trace. One log stream.	Reconstruct a timeline across 4+ log streams.	Clock skew between services. Correlation IDs that were not propagated. Broker lag that shifted event order.
Bug surface	Domain complexity only.	Domain multiplied by accidental complexity.	Each async handoff is a new class of timing bug. Compensating paths run rarely, are tested inadequately, and fail in production.
Codebase legibility	Domain is the code.	Domain distributed across event schemas and API contracts.	"What does order creation actually do?" has no single answer. The behaviour is implicit in subscriptions across four codebases.
Maintenance cost over time	Proportional to domain complexity.	Domain plus accidental complexity.	Accidental complexity does not reduce over time. Services accumulate. Contracts fossilise. Framework versions break. Teams leave.
Scaling
Unit of scale	The atomic action. Run more instances.	Individual steps — which are not the bottleneck.	Invoice creation and shipment planning are simple writes. They are not traffic hotspots. The decomposition solves a problem that does not exist.
Infrastructure to scale	Load balancer plus N identical instances.	Everything above, multiplied.	All the saga, broker, and tracing infrastructure exists solely to reconstruct what the database transaction provided for free.

The scaling argument that is rarely examined closely

The case for microservices typically rests on scalability. You can scale the parts that need scaling independently, rather than scaling everything together.

This sounds rational until you ask what actually needs scaling.

In an order creation flow, the bottleneck is almost never the invoice logic or the shipment record creation. These are simple writes that happen once per order. The thing that needs scaling is the number of concurrent orders being created — the atomic action as a whole.

Scaling the atomic action requires a load balancer and N identical instances of one deployed artefact. Each instance connects to one database. The database handles concurrent transactions reliably, as it has for decades. The infrastructure cost is a fraction of the distributed alternative. The operational complexity is a fraction. The failure surface is a fraction.

A well-modelled core domain is not large. This is not an aspiration — it is what remains when accidental complexity is removed. The essential logic of order-to-shipment fits comfortably in one process, understood by one team. What makes codebases large is not the domain. It is frameworks imposing their structure on domain code, duplication caused by unclear boundaries, accidental complexity accreting around poor models, and boilerplate generated by architectural patterns that do not fit the problem.

Strip those out and the core is small, fast to deploy, cheap to run, and trivially scalable as a unit.

The industry asked "how do we scale the parts?" before asking whether the parts needed to be separate. It then built an entire ecosystem of frameworks, patterns, and operational infrastructure to answer the first question — all solving a decomposition problem that, in most cases, did not need to exist.

When distribution is the right answer — and when the arguments do not hold

Distribution has genuine use cases. They are narrower than the industry's adoption rate suggests, and several of the most commonly cited justifications do not survive close examination.

Physical and regulatory constraints

The standard argument: if data must live in a specific jurisdiction for regulatory reasons, you need a distributed architecture.

The better answer: replicate the full domain logic into that regulatory cell. The atomic action stays atomic. The cell — with its own deployment, its own database, its own complete stack — is the unit of distribution. What you do not do is split the domain action across a jurisdictional boundary, routing parts of it between regions. That creates the coordination cost of distribution without the isolation that justified it. The constraint is geographic. The solution is geographic deployment of the whole, not decomposition of the parts.

Independent scaling profiles

The standard argument: if one component needs more scale than others, separating it avoids scaling everything unnecessarily.

The better answer: the cost of splitting a single component out of an otherwise coherent domain action is large, fixed, and permanent — as the table above makes clear. The question is not only "does this component need more scale?" but "does the benefit of isolating its scale exceed the full coordination cost of the split?" In most cases it does not, because the component that appears to need independent scaling is rarely the actual bottleneck under measurement, and because scaling the whole is cheaper than the industry assumes. If there is no compelling reason not to scale everything, scale everything. Simplicity requires a reason to abandon it, not a reason to adopt it.

Organisational boundaries

The standard argument: Conway's Law — systems tend to mirror the communication structures of the organisations that build them. If teams are separated, align the architecture accordingly.

Conway's Law is a useful observation in retrospect. It describes what tends to happen when architecture is not deliberately managed. It is not a prescription, and it should never be used as one. Using it as a justification for a service boundary is encoding organisational structure permanently into the system — and paying the technical cost of that boundary in every sprint, by every developer, for the lifetime of the product.

The cost of an artificially introduced service boundary compounds over years. The cost of reorganising a team is paid once. The engineering should define the ideal architecture with as few compromises as possible. The organisation should be arranged to serve that architecture, not the other way around. This pays dividends — perhaps not in year one, but reliably by year five, and every year thereafter. Teams that succeed with microservices often do so despite the architecture, through heroic platform investment and operational discipline. The patterns can be made to work. The deeper question is whether they were the right starting point for the domain in front of them.

Genuinely independent domain concepts

This is the one case where distribution has a legitimate technical argument — and even here, the bar should be high.

Domain concepts are genuinely independent when they have no transactional relationship with each other. Not merely different in name or ownership, but different in the sense that one completing or failing has no bearing on the integrity of the other. A recommendation engine and a payment processor are genuinely independent. An order and its invoice are not.

The strongest version of this argument comes from systems with a fundamentally asymmetric workload — a platform where reads vastly outnumber writes, where the read path has no transactional requirement, and where the scale difference between the two is large and proven. A social platform where the overwhelming majority of requests are reads with no transactional requirements is a system where isolating the read path separates two genuinely different kinds of work with different resource profiles and different failure tolerances.

But this is a workload argument supported by measurement, not an architectural principle applied by default. It applies to a small fraction of the systems that have adopted microservices, and it should be reached by evidence, not anticipated in advance.

Three tests before splitting a boundary

The rollback test. If this action fails halfway through, what does recovery cost? If the answer is a database rollback, the action belongs inside a single boundary. If the answer is a coordinated sequence of compensating calls across multiple services, each of which can itself fail, ask whether that coordination cost was consciously accepted — or simply inherited from a pattern that was never examined.

The scaling test. Which specific step in this action is the measured bottleneck under current or near-term load? Not the theoretical bottleneck. The step that is demonstrably the constraint today, under real conditions. If the answer is none of them individually, the action does not need decomposition. It needs more instances of the whole.

The standup test. In the daily standup, what language does the team use? If the items are about services, pipelines, brokers, schemas, and migrations — the team is working on accidental complexity. If the items are about domain concepts — what an order means, who owns a responsibility, what a rule actually requires — the team is working on the right problems. You do not need a cost model to apply this test. You need one conversation.

Measuring it in a system you already have

If these tests apply prospectively, they also apply to systems already in production. A short audit reveals more than any architecture review.

Count the sagas. How many business capabilities require a saga or orchestrator to complete? Each one is a boundary crossing that converted a rollback into a coordination problem. The number tells you how much of the domain is currently in transit.

Measure the standup ratio. Over two weeks, track how many standup items are about infrastructure, services, pipelines, and schemas versus domain concepts, rules, and business questions. The ratio is a direct reading of how much of the team's daily energy is absorbed by accidental complexity.

Trace a failure end to end. Pick a recent production incident. Count the number of log streams, services, and correlation IDs required to reconstruct what happened. That reconstruction cost — in time, in tooling, in expertise — is paid on every incident. It is the maintenance tax of the boundary choices made at design time.

Apply the migration heuristic. A well-modelled monolith can be split later, when measurement proves a specific boundary is warranted. A distributed system can rarely be reassembled cheaply once the boundaries have fossilised into contracts, event schemas, and separate team ownership. Optionality has value. The simpler starting point preserves it. The complex starting point spends it immediately, in exchange for flexibility that may never be needed.

First principles

There is nothing novel in the argument this article makes. It is an application of principles that engineering has held for as long as engineering has existed.

Minimise the moving parts. Every component that can fail will eventually fail. Every interface between components is a surface for misunderstanding, for version drift, for timing errors that only appear under conditions nobody anticipated. The system with fewer moving parts is not the primitive system — it is the disciplined one.

Solve the problem in front of you. The system that is over-engineered for scale it has not reached, for distribution it does not need, for independence that its domain does not have — that system is not prepared for the future. It is burdened by it. It is paying, today and every day, for problems it may never have.

Prefer reversibility. The decision that can be undone when it proves wrong is worth more than the decision that cannot, regardless of how confident you are at the time. A monolith that can be split later, when the evidence demands it, is a better starting point than a distributed system that cannot be reassembled after the evidence proves the split was premature.

Measure before you commit. The incomparability problem — the fact that the alternative architecture was never built, so its cost can never be directly compared — cannot be fully solved. But its worst effects can be mitigated by demanding evidence before committing to complexity: evidence of the scaling requirement, evidence of the domain independence, evidence that the coordination cost is worth the benefit it buys.

The software industry has a habit of adopting solutions before fully understanding the problems they were designed to solve, and then normalising the cost of those solutions until the cost becomes invisible. The distributed systems patterns that dominate today were developed by organisations with genuine physical distribution requirements, at a scale that a small fraction of systems ever reach. They solved real problems. They are also expensive, complex, and failure-prone in ways that compound over time and rarely appear on the original architectural diagram.

The question to ask, before any architectural decision, is not "how do others solve this?" It is "what does this problem actually require?" Start from first principles. Follow the cost. Build the simplest thing that genuinely solves the problem in front of you. Treat every boundary crossing — every point where a database rollback becomes a distributed coordination problem — as a commitment with a known, permanent price tag.

Because it will cost exactly that. Invisibly, continuously, and for as long as the system runs.

DEV Community