Most OT organizations have incident response procedures buried in a wiki or a shared drive that no one reads during a crisis. The runbooks exist but aren't maintained, don't reflect current systems, and assume knowledge that the on-call team doesn't have. When an incident happens, people improvise. That's when mistakes happen—mistakes that could cost you a facility shutdown or worse.
An effective OT runbook library is different from incident response playbooks for IT. Your runbooks must account for the physics of your process—you can't simply kill a process and restart it. They must be written for operators and engineers with OT expertise, not security analysts. They must be current, tested quarterly, and accessible from the war room in the moment you need them.
Structure That Works
We recommend organizing runbooks by scenario, not by tools or teams. Each runbook covers a specific incident type: "Suspicious PLC Configuration Change," "Unexpected Historian Data Modification," "Unauthorized Remote Access to Engineering Network." For each scenario, the runbook should include: detection indicators (what triggered the alarm), initial assessment questions (is this a real threat or a false positive), containment options (what can we do without stopping the process), and escalation criteria (when to shut down or call the plant manager).
Each runbook should be one to three pages. Longer than that and it won't be read during an incident. Include decision trees, not prose. Include phone numbers and escalation paths. Include step-by-step isolation procedures with safety checkpoints. Include a list of evidence to preserve and tools needed. A runbook that says "isolate the suspect network" is useless. A runbook that says "isolate the North Building PLC VLAN by disconnecting the managed switch port 47 on the industrial firewall, then verify that the HMI still displays current pressure readings from the Level 2 sensor" is actionable.
Keeping Runbooks Current
- Quarterly reviews: Every quarter, assign one runbook to someone on your team for review and testing. Did we change our network topology? Did we upgrade firmware on this controller? Update the runbook and test the procedures on lab equipment.
- Incident post-mortems update runbooks: After every security incident, even a false positive, the team should review and update the relevant runbook. What did we learn? What would have helped? Bake those lessons into the procedure before the next incident.
- Tabletop exercises validate runbooks: Twice a year, run a tabletop scenario. Give the team a mock incident description and have them walk through the relevant runbook. If they get stuck, if procedures don't make sense, fix them now.
- Version control your runbooks: Treat them like code. Use git or a shared drive with clear version numbers. Mark who approved the current version and when. If an incident happens, you need to know which version of which runbook you followed.
Integration with SOAR and War Rooms
Your SOAR platform should reference and link to runbooks. When an alert fires, the playbook should pull up the relevant runbook in your war room tool. Print copies of critical runbooks and keep them in your 24/7 operations center. Digital access is good; hardcopy backup is essential if your systems are compromised.
Building a runbook library requires deep understanding of your facility's process, your control systems, and your operational constraints. Many organizations get this wrong. We've helped utilities, water systems, and manufacturers create runbook libraries that are specific, testable, and actually followed during incidents. Reach out to develop a runbook strategy for your operation.
This article was written by the Cascadia OT Security practice, which advises Pacific Northwest data centers and manufacturers on industrial cybersecurity. For engagement inquiries, reach our practice team.