Inderpreet Singh

2025-12-20

The First Incident Changes Everything

#incidents#ownership#production#systems

The pager goes off at 2:47 AM. You're the firmware lead for a new motor control system that's been in production for six weeks. It's been running fine. No issues. And then: "Line 3 emergency stop. Controllers unresponsive. Production halted."

You're on a call by 2:52. The plant manager is already there—in person, at the facility, at 3 AM. The line is still down. Every minute costs money you've only thought about abstractly until now. The operators are watching. Maintenance is standing by. And everyone is looking at your system.

Not metaphorically looking. Actually standing in front of the cabinet, pointing at LEDs, asking what they mean.

You realize, with a clarity that's almost physical, that you don't know. You know what they're supposed to mean. You wrote the documentation. But right now, with the system in an unknown state, you're not sure. You're running mental simulations, trying to map LED patterns to states, wondering if the watchdog triggered or if it's a CAN bus fault or if something else entirely happened.

The plant manager asks: "How long until we're back up?"

This is the moment. Not the moment you become an experienced engineer. The moment you realize you weren't one yet.

When Ownership Becomes Real

Before your first real incident, ownership is theoretical. You own the firmware, sure. Your name is on the commit history. You wrote the architecture doc. You presented at the design review.

But that's ownership as authorship. Ownership as credit.

Real ownership is different. Real ownership is: when this breaks, you get called. When it misbehaves, you explain why. When someone needs to make a decision about whether to push forward or shut down, they're looking at you for information that might not exist yet.

The shift happens fast. One moment you're proud of what you built. The next moment you're responsible for it in a way that makes pride seem quaint.

I remember my first significant incident. A PLC I'd written firmware for started dropping Modbus requests unpredictably. Not often—maybe one in ten thousand. Enough to be noticeable in production but rare enough that we'd never caught it in testing. The system had error handling. It would retry. But the retries added latency, and that latency cascaded into other systems, and suddenly a water treatment facility was dealing with tank levels that were drifting outside normal operating ranges.

No one was angry. Everyone was professional. But I could feel the weight of it: this wasn't my code anymore. It was their water treatment system. Their compliance requirements. Their operators who had to explain to management why levels were fluctuating.

The code was the same. The system was the same. But what it meant was different.

How Teams Behave Under Public Failure

Here's something they don't teach you: teams change during incidents.

The usual dynamics suspend. The person who never speaks up in meetings suddenly has crucial information. The senior engineer who usually drives decisions steps back and lets the person closest to the problem lead. Or sometimes the opposite happens—someone who's normally collaborative becomes territorial, defensive.

You learn who stays calm and who doesn't. Who asks clarifying questions and who jumps to conclusions. Who admits when they don't know something and who deflects. Who thinks about the system and who thinks about blame.

This isn't about good people versus bad people. It's about how pressure reveals what people actually optimize for when the stakes are real.

I've watched a junior technician with three months on the job diagnose a problem faster than anyone else because they'd been the one doing the daily rounds and noticed a pattern no one else had seen. I've watched a principal engineer who designed the original system defer to an operator who knew which breaker "acted funny" in a way no documentation captured.

I've also watched teams spiral. Someone suggests a theory, and instead of testing it, everyone starts building on it. Someone makes a change without announcing it, and now two people are modifying the same system with different assumptions. Someone gets frustrated and starts blaming the vendor, the previous team, the budget constraints—anything except the current problem.

The best incident response I've seen wasn't the fastest. It was the most deliberate. Someone—it happened to be a controls engineer who'd been with the company for fifteen years—said: "We don't understand what state we're in. Before we change anything, let's gather data." And then, critically: "Who's writing this down?"

That changed everything. Not because documentation mattered in the moment, but because it forced everyone to externalize their mental model. When you have to say out loud what you think is happening and someone writes it down, you think differently. You catch your own assumptions.

What Broke Wasn't the System

After the incident, there's always a postmortem. And almost always, the postmortem focuses on the technical failure.

In that motor control incident at 3 AM, the technical root cause was clear: a race condition in the watchdog handling code. Under specific timing conditions, the watchdog could timeout while a critical section was locked, and the recovery path assumed clean state that didn't exist. Textbook bug. We fixed it in forty lines of code.

But that wasn't what broke.

What broke was that the firmware had been written with assumptions about the execution environment that were true on the test bench and false in production. The test bench had consistent loop timing. Production had variable timing because of interrupt patterns we hadn't characterized. We'd tested the watchdog recovery path, but we'd tested it by triggering the watchdog deliberately, not by letting it trigger from timing variance.

Okay, but that still sounds technical. Go deeper.

Why didn't we characterize the interrupt patterns in production? Because commissioning was scheduled tight, and characterization takes time, and we had twelve other systems to bring up.

Why was commissioning scheduled tight? Because the project was already over budget, and the plant needed the new line operational before the next quarter.

Why was the project over budget? Partly because the scope grew during implementation, partly because integration with legacy systems took longer than estimated, partly because we'd optimized the bid to win the contract.

Who decided that the commissioning schedule was more important than thorough characterization? No one, explicitly. It was a collective drift. A series of small decisions that each made sense locally but accumulated into a gap where critical testing didn't happen.

Here's what actually broke: the communication path between the firmware team and the commissioning team. We didn't have a clear handoff protocol for "things that need to be validated in production that we can't validate in the lab." That's not a technical problem. That's an organizational design problem.

And beneath that: the incentive structure. Commissioning engineers were measured on schedule adherence. Firmware engineers were measured on feature completion. No one was measured on "probability we understand the system's actual behavior in production."

So we didn't do it. Not because anyone was negligent. Because the organization wasn't designed to value it.

Why Technical Postmortems Often Miss the Point

Most postmortems I've read follow a pattern:

Incident summary: What happened and when
Impact: Who was affected and how
Root cause: The technical failure
Contributing factors: Other technical issues
Action items: Technical fixes

This format is comfortable because it suggests the problem was technical and therefore fixable. Change the code, add a check, improve the test suite. Done.

But look at what's missing:

Why did the design review not catch this failure mode?
What information existed that wasn't shared?
Which tradeoffs were made consciously versus unconsciously?
What incentives led to those tradeoffs?
How did the team's understanding of the system differ from reality?

These questions are harder because the answers implicate decisions, not just code. They make people uncomfortable because they suggest the incident wasn't just bad luck or an obscure bug—it was predictable given how the system was designed and how the organization operates.

I once sat in a postmortem for a system that failed because of insufficient input validation. The root cause was listed as "missing bounds check." The action item was "add bounds checking to all inputs."

But the actual story was different. The bounds checking had been in the original design. It was removed during optimization because it added 2ms to the control loop latency, and the spec required sub-10ms response time. The decision to remove it was documented in a code review but not escalated. The person who reviewed it assumed someone else would catch it in integration testing. The integration testing was abbreviated because commissioning was behind schedule.

The missing bounds check was the mechanism of failure. But what failed was a web of assumptions about who was responsible for verifying what, under what time pressure, with what communication overhead.

Fixing the bounds check addressed the symptom. It didn't address why a known-critical validation got removed without proper scrutiny.

"The System Worked Exactly as Designed"

This phrase appears in postmortems as a conclusion. It's meant to close the investigation: there was no malfunction, just an unanticipated scenario.

But it's the most dangerous phrase in incident analysis because it's almost always true—and almost always misunderstood.

Yes, the system worked as designed. The problem is that "the design" includes far more than you documented.

It includes the implicit assumptions you made about the operating environment. The tradeoffs you accepted to meet the schedule. The risks you deprioritized because they seemed unlikely. The monitoring you didn't implement because it wasn't in scope. The operational knowledge that exists only in people's heads.

When the motor controllers became unresponsive at 2:47 AM, they worked exactly as designed: the watchdog detected a timeout, triggered the recovery path, and entered a safe state. The recovery path was designed to reset the system to a known-good state. It did that.

What it didn't do was account for the scenario where the "known-good state" wasn't actually safe because other systems were depending on outputs that were now frozen at their last value. That wasn't a bug in the recovery logic. That was a gap in the system-level design.

But more than that: it was a gap in how we thought about the system. We'd designed the controller firmware in isolation, with defined interfaces. We'd assumed those interfaces were sufficient to capture the dependencies. They weren't. The actual dependencies included timing relationships, implicit state machines in other systems, and operator expectations that weren't written anywhere.

The system worked exactly as designed. The design was incomplete.

The Difference Between Intent and Reality

There's a moment during every significant incident where you realize: we built what we said we'd build. We just didn't build what was needed.

Not because we were incompetent. Because reality is richer than specification.

The spec said "handle watchdog timeout." It didn't say "handle watchdog timeout in a way that doesn't leave dependent systems in ambiguous state." That seemed implied. It wasn't.

The spec said "buffer sensor data during communication loss." It didn't say "buffer data in a way that's meaningful when different sensors lose communication at different times." We assumed synchronization. The assumption was wrong.

The spec said "provide HMI feedback for fault conditions." It didn't say "provide feedback that operators can interpret correctly at 3 AM when multiple faults are active simultaneously." We tested each fault individually. They don't happen individually in production.

Intent and reality diverge in the space between what you specify and what you assume. And you don't discover what you assumed until it's tested in an environment that doesn't share your assumptions.

What Changes After

The first incident changes everything because it changes what "working" means.

Before: working meant passing tests, meeting specifications, behaving correctly under expected conditions.

After: working means remaining understandable and recoverable when conditions aren't expected. When you don't have complete information. When the people responding aren't the people who built it. When time is expensive.

You start designing differently. Not necessarily better—sometimes the right design is still the simple one. But you design with the knowledge that the system will eventually teach you what you got wrong, and the question is whether you've made it possible to learn that lesson without catastrophic cost.

You write different documentation. Not more documentation—more documentation is often worse. But documentation that captures why, not just what. Documentation that names the assumptions. Documentation that future-you, woken up at 3 AM and trying to understand what state the system is in, will actually use.

You test differently. Not just testing the happy path and the error paths. Testing the transitions. Testing what happens when error paths stack. Testing what happens when recovery takes longer than expected. Testing what happens when the test itself is wrong.

And you think about ownership differently. Not as credit, but as responsibility. Not as "I built this," but as "I'm accountable for what this does in an environment I don't control."

That motor control system got fixed. The race condition was patched. We added instrumentation to characterize interrupt timing in production. We updated commissioning procedures to include the checks we'd skipped.

But more than that: I stopped being surprised when systems behaved in ways I hadn't anticipated. I started assuming they would. And I started designing for that assumption.

Because the first incident taught me something no design review ever could: the system you build is not the system that runs in production.

Production adds variables you didn't model. Pressures you didn't simulate. Dependencies you didn't document. And people who need answers you don't have yet.

The question isn't whether you'll face that gap. The question is whether you've designed for it.

Most of the time, you haven't.

And that's okay—as long as you remember it.

The first incident won't be your last. But if you learn from it, the next one will be different.

Not easier. Just different.

And maybe, slowly, you'll build systems that fail in ways that are less surprising, less catastrophic, and more recoverable.

That's not mastery. That's just experience.

And it starts with that first call at 2:47 AM.

← Back to Feed