Inderpreet Singh

2025-12-01

The Happy Path Is a Lie We Tell Ourselves

#systems#design#production#failure

Three engineers sit in a design review for a new temperature monitoring system. The architecture is clean: sensors report to edge controllers, controllers aggregate to a gateway, gateway publishes to the cloud. Redundancy at every layer. Graceful degradation clearly marked on the diagram.

Someone asks: "What happens if the gateway loses connectivity?"

"It buffers locally for up to 72 hours," comes the answer. "More than enough for any reasonable outage."

Everyone nods. The design is approved.

Two years later, a cellular provider pushes a configuration update that breaks connectivity for five days. The gateway's flash memory fills in eighteen hours. When connectivity returns, 96 hours of critical temperature data is gone. The system worked exactly as designed—it just wasn't designed for this.

The happy path is seductive because it lets us move forward. But it's also a lie we tell ourselves, and design reviews are where that lie gets institutionalized.

Why Design Reviews Reward Optimism

Design reviews are supposed to find problems. In practice, they often do the opposite: they create consensus around the assumption that problems won't happen.

Here's how it works:

You present a design. You show the nominal flow. You mark the error paths. You explain the redundancies. And then the questions come—but they come in a specific form: "What if X fails?" You answer with your contingency for X. "What if Y happens?" You show your handling of Y.

Each question asked and answered creates the illusion that you've covered the space of possible failures. What's harder to see is the space of failures that weren't asked about—not because they're impossible, but because they're hard to imagine from a conference room.

The cellular configuration update that breaks connectivity for five days isn't a question anyone asks, because connectivity outages are supposed to last hours, not days. The sensor that doesn't fail cleanly but drifts slowly enough to stay within acceptance bounds for months isn't on the checklist, because sensors are supposed to either work or fail obviously. The operator who learns to bypass an interlock because it trips too often during valid operations isn't in the room, because we're reviewing technical design, not operational reality.

Design reviews optimize for demonstrable coverage of known failure modes. They reward the ability to show that you've thought about things. But "thinking about things" is different from designing for them, and both are different from experiencing them.

This creates a subtle bias: the designs that pass review most easily are the ones that look most robust on paper. Clean boundaries. Clear error handling. Documented assumptions. The designs that acknowledge fundamental uncertainties—"we don't know how operators will actually use this," "we can't predict how this will interact with the legacy system at scale"—feel incomplete. They sound like excuses.

So we learn to present certainty. We learn to have answers. And slowly, the happy path becomes the only path we're willing to defend.

How Organizations Institutionalize Blind Spots

Individual engineers know the happy path is optimistic. But organizations have a way of turning individual caution into collective blindness.

Consider what happens after that design review. The design is approved. It goes into a requirements document. The requirements become tasks. The tasks become sprints. And at each translation, something is lost.

The subtle caveat—"this assumes network partitions are transient"—becomes "handles network failures." The hedge—"buffering capacity should be sufficient for typical outages"—becomes "72-hour buffer." The uncertainty—"we'll need to monitor how this behaves under load"—becomes a checkbox: "performance tested."

This isn't malice. It's how organizations create actionable plans from ambiguous reality. But in the process, assumptions get promoted to facts, and facts get encoded into architecture.

Worse, once something is designed and built, it becomes expensive to question. Not just financially expensive—politically expensive. The team that built it has invested in it. Managers have reported progress on it. Customers have been promised it. To say "we need to rethink this" is to say all that investment might have been misdirected.

So instead, we add compensating controls. We build monitoring. We write runbooks. We train operators. Each addition reinforces the original design rather than questioning it. We're not asking "should this system exist in this form?" We're asking "how do we make this system work?"

I've sat in meetings where everyone in the room privately knew a system was brittle, but no one said it directly because the system was already in production and replacing it would be a six-month project. Instead, we talked about "hardening" and "resilience improvements"—language that suggested we were making something robust rather than patching something fundamentally fragile.

The organization's immune system had learned to reject the observation that the system was designed wrong, because accepting that observation would require acknowledging that a lot of other decisions were also wrong.

Most Failures Are Designed

Here's an uncomfortable truth: most production failures aren't accidents. They're not the result of bugs that slipped through testing or edge cases that no one thought of.

They're the inevitable outcome of decisions made under constraints.

That temperature monitoring system that lost 96 hours of data? The decision to use flash memory with limited write endurance was made to hit a cost target. The decision to buffer for 72 hours was made based on historical uptime data from a different cellular provider. The decision not to implement hierarchical buffering (edge → gateway → cloud) was made to keep the architecture simple and shippable.

None of those were wrong decisions in isolation. Given the constraints—budget, schedule, what was known at the time—they were defensible. Reasonable, even.

But they were decisions, not accidents. And decisions have consequences that aren't always visible until they compound.

This is what I mean when I say most failures are designed. Not that anyone set out to build a fragile system, but that fragility is often the natural result of optimizing for other things: speed, cost, simplicity, familiarity.

The difference between a bug and a decision is that bugs can be fixed. Decisions are encoded into the architecture. They become load-bearing assumptions. You can't fix them without rethinking the system.

When an incident review concludes "the system worked as designed, but we didn't anticipate this scenario," what they're really saying is: "we made tradeoffs, and this is what we traded away."

The question is whether we're honest about what we're trading. Most of the time, we're not—because being honest would make it harder to get the design approved.

Why Test, Staging, and Simulation Always Mislead

Every environment before production is a curated experience.

Test environments use clean data. Staging uses a subset of production scale. Simulations use models that abstract away complexity. Even load testing is fundamentally artificial—you generate the load, you choose when to apply it, you know what you're testing for.

This isn't a criticism of testing. Testing is essential. But it's also fundamentally limited, and we consistently underestimate how limited it is.

Here's what test environments can't show you:

They can't show you emergent behavior. That interaction between the VFD noise and the CAN bus that only manifests when both systems are under load and the ambient temperature is above 30°C? Your test bench runs at 22°C with one system at a time.

They can't show you operational reality. The operator who learned that if you cycle the power on the HMI in a specific sequence, you can temporarily bypass an error condition that otherwise requires a maintenance window? They discovered that in production, under pressure, when the line manager was demanding a workaround.

They can't show you the full dependency graph. You tested integration with the ERP system. You didn't test integration with the ERP system while the network team is doing maintenance and traffic is rerouting through a secondary path with higher latency and the backup MES is handling requests because the primary is being patched.

They can't show you time. That sensor calibration drift that takes six months to become problematic? Your acceptance test runs for six hours.

They can't show you organizational dynamics. Who gets called when something ambiguous happens at 2 AM? Who has the authority to make the call to shut down the line? Who actually knows where the documentation is? None of that exists in staging.

Production is the only environment where all the variables are real, all at once, without anyone curating the experience.

What Production Pressure Reveals That Design Never Does

There's a specific kind of knowledge that only emerges under production pressure, and it has nothing to do with technical design.

It's the knowledge of what actually matters.

In design, everything matters equally. Every requirement is important. Every failure mode is worth handling. Every dependency is documented. But in production, under time pressure, with costs accumulating, you discover very quickly what's truly critical and what was just conceptually important.

You discover that the "critical" alert that fires three times a day gets ignored, while the informal message in Slack from the night shift operator gets immediate attention because everyone knows that operator only speaks up when something is genuinely wrong.

You discover that the official escalation path—submit a ticket, wait for triage, get assigned to the on-call engineer—is too slow, and there's an unofficial path where certain people have certain phone numbers and that's what actually gets used when things are breaking.

You discover that the monitoring dashboard everyone insisted on building never gets looked at during incidents, but everyone opens the same three SSH sessions to check the same three log files because that's where the useful information actually lives.

You discover that the runbook that took weeks to write is useless because it assumes you have time to read it, and in production you don't—so people fall back on intuition, pattern matching, and educated guesses.

None of this is visible during design because design happens in an environment where time is less expensive and mistakes are reversible. Production pressure doesn't just reveal technical problems. It reveals organizational truth.

It shows you which abstractions hold up and which ones collapse. It shows you where authority actually lies versus where the org chart says it lies. It shows you which skills matter—and they're often not the ones that got people promoted.

What This Means for How We Design

I'm not arguing that we should abandon design reviews or testing or staging environments. They serve a purpose. The problem is that we treat them as validation rather than exploration.

We treat passing a design review as evidence that the design is good. It's not. It's evidence that the design is defensible given what we know now, in this room, without production pressure.

We treat staging as a high-fidelity simulation of production. It's not. It's a curated environment that shares some properties with production and systematically excludes others.

If we were honest about this, we'd design differently. We'd spend less energy trying to anticipate every failure mode and more energy building systems that can be understood and modified when inevitably surprising failures occur.

We'd document not just what the system does, but what we assumed about the environment it operates in. Not as a CYA exercise, but as a genuine artifact that helps the next person understand what we were thinking—and where we were probably wrong.

We'd treat the first six months in production not as "stabilization" but as "learning what we actually built"—because that's what it is.

The happy path isn't useless. It's necessary to ship anything at all. But it's a lie we tell ourselves to make progress in the face of uncertainty.

The question is whether we remember it's a lie—or whether we start believing it.

Because the system will tell the truth eventually.

It always does.

← Back to Feed