DEV Community

Manish Giri
Manish Giri

Posted on

Your Codebase Is Clean. Your Operations Are a Mess. Here's Why That Matters.

Most engineers I know care deeply about code quality. Clean architecture, good test coverage, sensible abstractions. They'll spend three hours in a PR review debating naming conventions.

Those same engineers will tolerate absolutely chaotic operational processes without blinking.

Deployments that take 45 minutes and require two people to babysit. Incident response that depends on whoever happens to be online. Onboarding processes that live inside the head of one senior engineer who is, of course, currently on vacation.

I've seen this pattern everywhere. And I think it comes down to a blind spot that the developer community has never really addressed: we treat operational excellence as a business problem, not an engineering problem.

It's both. And ignoring the business side is costing us.


What Operational Excellence Actually Means for Engineers

Operational excellence is one of those phrases that sounds like something a VP says in an all-hands meeting. But strip away the business jargon and what you get is basically: doing the right things, consistently, with as little waste as possible, and improving over time.

Sound familiar? It should. It's exactly what we try to do with software.

The Lean principle of eliminating waste maps almost perfectly onto the engineering instinct to remove unnecessary complexity. Six Sigma's obsession with reducing defects is just a fancier version of why we write tests. The continuous improvement loop in most OpEx frameworks is basically agile retrospectives, taken seriously.

The difference is that in software, we've built culture and tooling around these ideas. We have CI/CD pipelines, observability stacks, blameless postmortems, SLOs. We've operationalized the improvement loop.

Outside of software, in the broader organization, most of that infrastructure doesn't exist. And when engineering teams interact with those parts of the business, things get messy fast.


The Hidden Cost of Operational Chaos

Let me be specific about where this actually hurts engineering teams.

Context switching from broken processes. When a business process upstream of your team is poorly designed, you absorb the chaos. Customer support tickets arrive with incomplete information because the intake form is bad. Finance requests reports in Excel because nobody ever built a proper data pipeline. Sales promises features based on a roadmap that engineering hasn't seen. Every one of these is an operational failure that lands in your backlog as rework.

Toil that never gets fixed. Google's SRE culture has a concept of toil: repetitive, manual, automatable work that doesn't improve the system over time. Engineering teams are good at identifying toil in their own workflows. They're much worse at identifying it in cross-functional processes, where the toil is distributed across teams and nobody owns fixing it.

Invisible technical debt in process. We talk constantly about technical debt in code. We almost never talk about process debt: the accumulated weight of workflows that made sense two years ago but now slow everything down. Process debt compounds just like technical debt, and it's harder to refactor because you can't just run a test suite to check if you broke something.


Why Engineers Avoid This Problem

Honestly? Because it's uncomfortable.

Fixing a broken deployment pipeline is satisfying. You can measure the before and after. You own the system. You can ship the fix in a PR.

Fixing a broken cross-functional process means talking to people who don't think the way you do, navigating organizational politics, and accepting that progress will be slower and messier than you'd like. It means influencing without authority. It means sitting in meetings.

Most engineers would rather rewrite a service in Rust.

But here's the thing: the engineers who move into principal engineer, staff engineer, and engineering leadership roles are almost always the ones who figured this out. At some point, the ceiling on your technical impact is determined by the quality of the processes around your technical work.

You can build a perfect system. If the process that feeds it with requirements, operates it in production, and responds to its failures is broken, the system will underperform. Always.


What Good Actually Looks Like

I want to be concrete about this because "operational excellence" can feel abstract.

Good looks like: An on-call rotation where every alert is actionable, the runbooks are current, and the person on call knows what they're supposed to do. Not because everyone memorized it, but because the process was designed carefully and is maintained deliberately.

Good looks like: A deployment process where the happy path is fully automated, the failure modes are understood and handled gracefully, and the rollback procedure has actually been tested. Not just documented.

Good looks like: A project intake process where engineering gets involved early enough to give useful input, not three days before a deadline when the requirements are already locked.

None of this is technically complex. All of it requires deliberate process design, measurement, and continuous improvement. Which is to say, it requires treating operations the same way we treat software.


The Measurement Problem

One reason operations stay broken is that we don't measure them.

We measure everything in our systems. Request latency, error rates, throughput, CPU utilization. We have dashboards for all of it. We set alerts, we track trends, we do capacity planning.

We almost never apply the same rigor to our operational processes. How long does it actually take to onboard a new engineer, end to end? What percentage of incidents are caused by the same class of problem recurring? How much engineering time is spent on rework caused by requirements changing late in a project?

If you don't measure it, you can't improve it. And if you don't improve it, it quietly gets worse as the organization grows.

The good news is that measurement doesn't have to be complex. Starting with cycle time, defect rate, and lead time for change gives you enough signal to identify where the biggest problems are. From there, you can prioritize what to fix.


A Broader Point About Recognition

There's a related problem in the operational excellence world that I think is worth mentioning, because it affects how knowledge gets shared.

The teams and organizations doing genuinely excellent operational work are almost never the ones you hear about. Conference talks go to companies with good marketing teams. Awards programs tend to favor large enterprises with dedicated submissions staff. The small manufacturer that figured out a brilliant solution to a chronic quality problem, or the engineering team at a mid-size company that designed a genuinely elegant incident response process, just quietly does the work and moves on.

This is changing. There are newer platforms, like World Opex, that are trying to surface this kind of work more broadly, with a focus on recognizing results regardless of company size or methodology. The idea of rating rather than ranking (so multiple organizations can be recognized on their own merits rather than competing against each other) seems like the right direction.

More visibility into what excellent operations actually look like, from organizations of all sizes and sectors, is good for everyone. It's how the field learns.


Where to Start If Your Processes Are a Mess

If you've read this far and are thinking about your own team or organization, here's a practical starting point.

Pick one process that causes the most pain. Not the most complex one, the most painful one. The thing that makes people groan when it comes up. That's your highest-leverage target.

Map it as it actually exists, not as it's supposed to work. Talk to everyone involved. Draw out every step, every handoff, every place where things get stuck or go wrong. The gap between the official process and the real process is usually where most of the waste lives.

Measure the current state. How long does it take? How often does it fail? What does failure look like? You need a baseline before you can claim improvement.

Make one change, measure the effect, repeat. Not a big redesign. One change. See what happens. Adjust. This is the improvement loop, applied to process.

Write it down. The most fragile operational processes are the ones that live in someone's head. Documentation is not glamorous. It is incredibly high leverage.


Final Thought

The best engineers I've worked with had a quality in common that I didn't appreciate early in my career: they cared as much about how work happened as what got built.

Not just their own work. The whole system. The processes, the handoffs, the feedback loops, the places where information got lost or delayed. They treated operational problems with the same intellectual seriousness they brought to technical problems.

That's not a soft skill. It's an engineering mindset applied to a wider system.

And it's probably the highest-leverage thing most engineers could do with their next six months.

Top comments (0)