How to handle an Incident in Production

Published on Fri Feb 09 2024

Learn PAAF: Protect, Analyze, Alert, Fix. From saving lives to debugging code, it's the universal process for handling emergencies effectively.

Protéger

Examiner

Alerter

Soigner

Behind this acronym in French 🇫🇷, lies the process to follow when you have to rescue someone. It's one of the first things you learn when you follow a first aid formation at Croix-Rouge Française. In English 🇬🇧, it's less impactful because the acronym does not give a vegetable :). It would be more something like Protect, Analyze, Alert, Fix (PAAF !).

Interestingly, it's "funny" to realize that we follow the exact same process when we have to intervene on an emergency in our running applications or infrastructure.

Before digging into the process, how are we aware of the problem ?

If we have proper alerting and monitoring, it can come from the alerting system like PagerDuty. On the other hand, it can come from another human : yourself, someone within the company, or worse, from a customer via social media. That's why we need to act fast !

An Incident in Production

"Protéger"

That's the first thing to do. Depending on the problem, it can take multiple forms. In extreme cases, we would completely "unplug" the machine to avoid any more harm. That's what OpenAI had to do when they realized the system was leaking users info. In other cases, we can simply disable the "thing" that triggered the problem. For instance, it can be the last feature we've released.

Naturally, to be able to disable a feature, we need to have thoroughly designed the system beforehand by properly encapsulating features behind Feature Flags. There are tons of SaaS solutions available on the market : Flagsmith, LaunchDarkly, Unleash, DevCycle, Optimizely, FeatBit. The list is not exhaustive and in some simple cases, we don't even need a specific tool. If the feature flags are server side only, relying on Environment Variables is totally fine.

"Examiner"

Once that we've made sure the problem cannot be triggered again, it's time to investigate the underlying causes. Here as well, the reasons can be from multiple origins : Server Issue, DNS Problem, DDoS Attack, Expired Domain, Coding Error or Bug, Security Breach, Bandwidth Limit Exceeded, Software Update, Traffic Surge, Power Outage, Natural Disaster, Scheduled Maintenance, Human Error... And the list goes on and on.

First, we can try to reproduce what has been reported, if applicable. In any case, do not assume the reporter did something wrong. Even though users can be "dumb" sometimes, if something has been reported, assume that something is wrong within the system.

Obviously, the real reason(s) can be unobvious at first sight, so it might take time. One of the biggest mistake here is to focus on the logs or anything else and to start digging too much into the problem... without communicating.

"Alerter"

OK no one has seen the problem, let's try to fix it quickly and we'll be fine
Too confident developer

We must admit that we've all tried this, but in most cases it's a really bad idea. Once we have protected and had an idea of the causes of the problem, it's really important to communicate. First, internally, generally by creating a dedicated Slack channel. Adding only a few people into the loop helps avoiding being distracted by too many messages. There must be only the principal stakeholders in the War Room, otherwise the discussion will go in all directions. Focus is important here.

Once it's been alerted internally, we also need to communicate to customers if they're facing the issue. People tend to be more comprehensive when they know what's going on. When the subway is stopped in the middle of the tunnel, don't we feel better when the driver tells us that there is a short delay but the train will restart pretty soon ? If the company is structured, it's not our job, though. We the PO/PM or someone more customer facing, communicating in a more human way (mostly on social media).

"Soigner"

And of course, the last step, but the most important : fixing the problem !

There are as many solutions as there are issues. If we multiply by the number of possible implementations within each company, that gives us a lot of possible solutions. So of course, it's our experience and skills that will do the right thing.

In any case, if it's a coding error and a hotfix needs to be performed and deployed to production, we always need to follow the regular process.

PR ➡️ Review ➡️ Staging ➡️ Prod

It's really important not to skip these steps. Rushing into the solution could cause more harm if we mix speed and precipitation.

Final Words

Once everything is back to normal, we can breathe and enable back the disabled features if any. It's also important to write a Post Mortem with an action plan to avoid the problem from happening again.

At RebootX, we truly believe that problems cannot be avoided. But we can have great tooling to solve them. For now, we can act on servers and monitor Grafana. Coming next, we'll integrate Feature Flag services to allow us to "Protect" very quickly so we have more time to act on what counts.

As usual, feel free to contact us if you have any needs and/or ideas.

Chafik H'nini

Want to give it a try ? Get the app on the stores.