Wednesday, April 06, 2011

Black Swans and Complex System Failure

Black Swan theory (Wikipedia) tells us among other things that people tend to underestimate the probability of extremely rare events.

A corollary of this theory that is of particular interest to architects and complex system engineers concerns the design of fail-safe mechanisms. Nuclear power and oil extraction are examples of environmentally critical operations; they are therefore subject to detailed risk assessment, and designed with multiple fail-safe mechanisms. And yet both the oil spillage last year in the Gulf of Mexico and the partial melt-down in Japanese nuclear reactors following the recent tsunami involved the simultaneous failure of multiple fail-safe mechanisms. Obviously that's not supposed to happen.

Simultaneous failure of supposedly independent mechanisms is a Black Swan event.

Update (August 2011)

A recent study by Oxford University and McKinsey has blamed rare but high-impact problems, dubbed "black swans", for the increasingly common phenomenon of large IT project whose cost spirals out of control. The study finds this phenomenon to be three times as common in IT than in other domains [BBC News, 26 August 2011]. See my post on Black Swan Blindness.

Update (October 2011)

Reviewing a couple of recent books about BP and the oil spill in the Gulf of Mexico, Mattathias Schwartz makes a number of relevant points.

When crucial pieces of our infrastructure fail, they do so gracelessly, without much warning and in ways that are difficult to anticipate. ... The failure to grasp the possibility of system-wide failure might be one in an accelerating series, bookended by the 2008 financial crisis and the Fukushima nuclear meltdown last spring.
One reason for the oil and gas industry’s quick comeback in the US was the successful packaging of the blowout as a ‘black swan’, an event of such low probability that it couldn’t have been anticipated. This certainly helped excuse the fact that no one – not BP, Chevron, Exxon or Shell – had a working plan for plugging a blowout as deep as Macondo .
BP ... claimed, in its own report on the blowout, that the event had eight causes, of which BP was partly responsible for one. The president’s commission concluded that the disaster had nine causes, and that BP was responsible for six or seven. And yet BP stands by what it said at the start. 

The size of the system and the complexity of the data make it possible to argue for a maddeningly wide range of positions, especially when it comes to vague legal notions like ‘negligence’ or ‘responsibility’. Both concepts hinge on proving that one linear narrative is the right one. 

Mattathias Schwartz, LRB 6 October 2011 
  • Spills and Spin: The Inside Story of BP by Tom Bergin 
  • A Hole at the Bottom of the Sea: The Race to Kill the BP Oil Gusher by Joel Achenbach


Roger Sessions said...

Interesting post, as always, Richard.

There in also an emotional component to how we estimate the probability of extremely rare events. All of the examples here are of events charged with negative emotion, things we don't want to believe can happen.

But when we estimate the probability of an extremely rare event that we do want to happen, we overestimate the probability of the event occurring. This is what keeps casinos in business.

In the IT world, I am interested in complex systems failures. We attempt to build highly complex systems even though statistics show us that such systems have a low probability of success. But we want to believe we can build these systems, so we ignore the statistics.

On the face of it, this approach seems insane. We use the same methods that have consistently failed in the past and hope for a different outcome.

What terminology would you use to differentiate the negatively charged rare events that we underestimate from the positively charged rare events that we overestimate?

FredF said...

Richard, A long time ago in another life I was a Chem Eng working for ICI in South Africa and making somewhat hazardous chemicals, explosives and such.After some disasters such as Flixborough in the UK - an ICI Engineer called Trevor Kletz, developed a methodology / process called Hazard and Operability study for processes (then referring to chemical processes) to tackle exactly the issue you address - how to formally review and then deal with the incidence of problems that "may" represent risks to personnel and equipment. I have since applied it to Business processes and IT landscape design, in a modified version. Our mutual friend Chris Bird and I continue to discuss its relevance to design today.
As a result - I'm afraid I dont buy the concept of "Black Swan" in Complex System Failures. While HAZOP is not totally risk free - cos you ALWAYS get to a probability of occurrence --> Impact and then cost trade off, it beats the hell out of the alternative of "hoping" it'll all work as designed, and then being surprised by "rare events".

Wikipedia and many other sources have good reviews of Hazop. Last I heard Trevor was at A&M University in Texas, and had written an interesting report on BP's Texas City chemical incident..