Showing posts with label SPOF. Show all posts
Showing posts with label SPOF. Show all posts

Monday, May 23, 2022

Risk Algebra

In this post, I want to explore some important synergies between architectural thinking and risk management. 


The first point is that if we want to have an enterprise-wide understanding of risk, then it helps to have an enterprise-wide view of how the business is configured to deliver against its strategy. Enterprise architecture should provide a unified set of answers to the following questions.

  • What capabilities delivering what business outcomes?
  • Delivering what services to what customers?
  • What information, process and resources to support these?
  • What organizations, systems, technologies and external partnerships to support these?
  • Who is accountable for what?
  • And how is all of this monitored, controlled and governed?

Enterprise architecture should also provide an understanding of the dependencies between these, and which ones are business-critical or time-critical. For example, there may be some components of the business that are important in the long-term, but could easily be unavailable for a few weeks before anyone really noticed. But there are other components (people, systems, processes) where any failure would have an immediate impact on the business and its customers, so major issues have to be fixed urgently to maintain business continuity. For some critical elements of the business, appropriate contingency plans and backup arrangements will need to be in place.

Risk assessment can then look systematically across this landscape, reviewing the risks associated with assets of various kinds, activities of various kinds (processes, projects, etc), as well as other intangibles (motivation, brand image, reputation). Risk assessment can also review how risks are shared between the organization and its business partners, both officially (as embedded in contractual agreements) and in actual practice.

 

Architects have an important concept, which should also be of great interest for enterprise risk management - the idea of a single point of failure (SPOF). When this exists, it is often the result of a poor design, or an over-zealous attempt to standardize and strip out complexity. But sometimes this is the result of what I call Creeping Business Dependency - in other words, not noticing that we have become increasingly reliant on something outside our control.


There are also important questions of scale and aggregation. Some years ago, I did some risk management consultancy for a large supermarket chain. One of the topics we were looking at was fridge failure. Obviously the supermarket had thousands and thousands of fridges, some in the customer-facing parts of the stores, some at the back, and some in the warehouses.

Fridges fail all the time, and there is a constant processes of inspecting, maintaining and replacing fridges. So a single fridge failing is not regarded as a business risk. But if several thousand fridges were all to fail at the same time, presumably for the same reason, that would cause a significant disruption to the business.

So this raised some interesting questions. Could we define a cut-off point? How many fridges would have to fail before we found ourselves outside business-as-usual territory? What kind of management signals or dashboard could be put in place to get early warning of such problems, or to trigger a switch to a "safety" mode of operation.

Obviously these questions aren't only relevant to fridges, but can apply to any category of resource, including people. During the pandemic, some organizations had similar issues in relation to staff absences.


Aggregation is also relevant when we look beyond a single firm to the whole ecosystem. Suppose we have a market or ecosystem with n players, and the risk carried by each player is R(n). Then what is the aggregate risk of the market or ecosystem as a whole? 

If we assume complete independence between the risks of each player, then we may assume that there is a significant probability of a few players failing, but a very small probability of a large number of players failing - the so-called Black Swan event. Unfortunately, the assumption of independence can be flawed, as we have seen in financial markets, where there may be a tangled knot of interdependence between players. Regulators often think they can regulate markets by imposing rules on individual players. And while this might sometimes work, it is easy to see why it doesn’t always work. In some cases, the regulator draws a line in the sand (for example defining a minimum capital ratio) and then checks that nobody crosses the line. But then if everyone trades as close as possible to this line, how much capacity does the market as a whole have for absorbing unexpected shocks?

 

Both here and in the fridge example, there is a question of standardization versus diversity. On the one hand, it's a lot simpler for the supermarket if all the fridges are the same type, with a common set of spare parts. But on the other hand, having more than one type of fridge helps to mitigate the risk of them all failing at the same time. It also gives some space for experimentation, thus addressing the longer term risk of getting stuck with an out-of-date fridge estate. The fridge example also highlights the importance of redundancy - in other words, having spare fridges. 

So there are some important trade-offs here between pure economic optimization and a more balanced approach to enterprise risk.

Monday, April 22, 2019

When the Single Version of Truth Kills People

@Greg_Travis has written an article on the Boeing 737 Max Disaster, which @jjn1 describes as "one of the best pieces of technical writing I’ve seen in ages". He explains why normal airplane design includes redundant sensors.

"There are two sets of angle-of-attack sensors and two sets of pitot tubes, one set on either side of the fuselage. Normal usage is to have the set on the pilot’s side feed the instruments on the pilot’s side and the set on the copilot’s side feed the instruments on the copilot’s side. That gives a state of natural redundancy in instrumentation that can be easily cross-checked by either pilot. If the copilot thinks his airspeed indicator is acting up, he can look over to the pilot’s airspeed indicator and see if it agrees. If not, both pilot and copilot engage in a bit of triage to determine which instrument is profane and which is sacred."

and redundant processors, to guard against a Single Point of Failure (SPOF).

"On the 737, Boeing not only included the requisite redundancy in instrumentation and sensors, it also included redundant flight computers—one on the pilot’s side, the other on the copilot’s side. The flight computers do a lot of things, but their main job is to fly the plane when commanded to do so and to make sure the human pilots don’t do anything wrong when they’re flying it. The latter is called 'envelope protection'."

But ...

"In the 737 Max, only one of the flight management computers is active at a time—either the pilot’s computer or the copilot’s computer. And the active computer takes inputs only from the sensors on its own side of the aircraft."

As a result of this design error, 346 people are dead. Travis doesn't pull his punches.

"It is astounding that no one who wrote the MCAS software for the 737 Max seems even to have raised the possibility of using multiple inputs, including the opposite angle-of-attack sensor, in the computer’s determination of an impending stall. As a lifetime member of the software development fraternity, I don’t know what toxic combination of inexperience, hubris, or lack of cultural understanding led to this mistake."

He may not know what led to this specific mistake, but he can certainly see some of the systemic issues that made this mistake possible. Among other things, the widespread idea that software provides a cheaper and quicker fix than getting the hardware right, together with what he calls cultural laziness.

"Less thought is now given to getting a design correct and simple up front because it’s so easy to fix what you didn’t get right later."

Agile, huh?


Update: CNN finds an unnamed Boeing spokesman to defend the design.

"Single sources of data are considered acceptable in such cases by our industry".

OMG, does that mean that there are more examples of SSOT elsewhere in the Boeing design!?




How a Single Point of Failure (SPOF) in the MCAS software could have caused the Boeing 737 Max crash in Ethiopia (DMD Solutions, 5 April 2019) - provides a simple explanation of Fault Tree Analysis (FTA) as a technique to identify SPOF.

Mike Baker and Dominic Gates, Lack of redundancies on Boeing 737 MAX system baffles some involved in developing the jet (Seattle Times 26 March 2019)

Curt Devine and Drew Griffin, Boeing relied on single sensor for 737 Max that had been flagged 216 times to FAA (CNN, 1 May 2019) HT @marcusjenkins

George Leopold, Boeing 737 Max: Another Instance of ‘Go Fever”? (29 March 2019)

Mary Poppendieck, What If Your Team Wrote the Code for the 737 MCAS System? (4 April 2019) HT @CharlesTBetz with reply from @jpaulreed

Gregory Travis, How the Boeing 737 Max Disaster Looks to a Software Developer (IEEE Spectrum, 18 April 2019) HT @jjn1 @ruskin147

And see my other posts on the Single Source of Truth.


Updated  2 May 2019

Saturday, March 10, 2018

Fail Fast - Why did the Chicken cross the road?

A commonly accepted principle of architecture and engineering is to avoid a single point of failure (SPOF). A single depot for a chain of over 850 fast food restaurants could be risky, as KFC was warned when it announced that it was switching its logistics from Bidvest to a partnership with DHL and QSL, to be served out of a single depot in Rugby. We may imagine that the primary motivation for KFC was cost-saving, although the announcement was dressed up in management speak - "re-writing the rule book" and "setting a new benchmark".

The new system went live on 14th February 2018. The changeover did not go well: by the weekend, over three quarters of the stores were closed. Rugby is a great location for a warehouse - except when there is a major incident on a nearby motorway. (Who knew that could happen?)

After a couple of weeks of disruption, as well as engaging warehouse-as-a-service startup Stowga for non-food items, KFC announced that it was resuming its relationship with Bidvest. According to some reports, Burger King also flirted with DHL some years ago before returning to Bidvest. History repeating itself.

However, the problems faced by KFC cannot be attributed solely to the decision to supply the whole UK mainland from Rugby. A just-in-time supply chain needs contingency planning - covering business continuity and disaster recovery. (Good analysis by Richard Priday, who tweets as @InsomniacSteel.)



KFC revolutionizes UK foodservice supply chain with DHL and QSL appointment (DHL Press Release, 11 Oct 2017)

Andrew Don, KFC admits chicken waste as cost of DHL failure grows (The Grocer, 23 Feb 2018)

Andrea Felsted, Supply chains: Look for the single point of failure (FT 2 May 2011)

Adam Leyland, KFC supply chain fiasco is Heathrow's Terminal 5 all over again (The Grocer, 23 Feb 2018)

Charlie Pool (CEO of Stowga), Warehousing on-demand saves KFC (Retail Technology 26 February 2018)

Richard Priday, The inside story of the great KFC chicken shortage of 2018 (Wired 21 February 2018) How KFC ended the great chicken crisis by taking care of its mops (Wired 2 March 2018) The KFC chicken crisis is finally over: it's (sort of) ditched DHL (Wired 8 March 2018)

Carol Ryan, Stuffed KFC only has itself to blame (Reuters, 20 February 2018)

Su-San Sit, KFC was 'warned DHL would fail' (Supply Management, 20 February 2018)

Matthew Weaver, Most KFCs in UK remain closed because of chicken shortage (Guardian 19 Feb 2018) KFC was warned about switching UK delivery contractor, union says (Guardian 20 Feb 2018)

Zoe Wood, KFC returns to original supplier after chicken shortage fiasco (Guardian 8 March 2018)

Wikipedia: Single Point of Failure

Related posts: Fail Fast - Burger Robotics (March 2018)

Thursday, March 09, 2017

Inspector Sands to Platform Nine and Three Quarters

Last week was not a good one for the platform business. Uber continues to receive bad publicity on multiple fronts, as noted in my post on Uber's Defeat Device and Denial of Service (March 2017). And on Tuesday, a fat-fingered system admin at AWS managed to take out a significant chunk of the largest platform on the planet, seriously degrading online retail in the Northern Virginia (US-EAST-1) Region. According to one estimate, performance at over half of the top internet retailers was hit by 20 percent or more, and some websites were completely down.

What have we learned from this? Yahoo Finance tells us not to worry.
"The good news: Amazon has addressed the issue, and is working to ensure nothing similar happens again. ... Let’s just hope ... that Amazon doesn’t experience any further issues in the near future."

Other commentators are not so optimistic. For Computer Weekly, this incident
"highlights the risk of running critical systems in the public cloud. Even the most sophisticated cloud IT infrastructure is not infallible."

So perhaps one lesson is not to trust platforms. Or at least not to practice wilful blindness when your chosen platform or cloud provider represents a single point of failure.

One of the myths of cloud, according to Aidan Finn,
"is that you get disaster recovery by default from your cloud vendor (such as Microsoft and Amazon). Everything in the cloud is a utility, and every utility has a price. If you want it, you need to pay for it and deploy it, and this includes a scenario in which a data center burns down and you need to recover. If you didn’t design in and deploy a disaster recovery solution, you’re as cooked as the servers in the smoky data center."

Interestingly, Amazon itself was relatively unaffected by Tuesday's problem. This may have been because they split their deployment across multiple geographical zones. However, as Brian Guy points out, there are significant costs involved in multi-region deployment, as well as data protection issues. He also notes that this question is not (yet) addressed by Amazon's architectural guidelines for AWS users, known as the Well-Architected Framework.

Amazon recently added another pillar to the Well-Architected Framework, namely operational excellence. This includes such practices as performing operations with code: in other words, automating operations as much as possible. Did someone say Fat Finger?




Abel Avram, The AWS Well-Architected Framework Adds Operational Excellence (InfoQ, 25 Nov 2016)

Julie Bort, The massive AWS outage hurt 54 of the top 100 internet retailers — but not Amazon (Business Insider, 1 March 2017)

Aidan Finn, How to Avoid an AWS-Style Outage in Azure (Petri, 6 March 2017)

Brian Guy, Analysis: Rethinking cloud architecture after the outage of Amazon Web Services (GeekWire, 5 March 2017)

Daniel Howley, Why you should still trust Amazon Web Services even though it took down the internet (Yahoo Finance, 6 March 2017)

Chris Mellor, Tuesday's AWS S3-izure exposes Amazon-sized internet bottleneck (The Register, 1 March 2017)

Shaun Nichols, Amazon S3-izure cause: Half the web vanished because an AWS bod fat-fingered a command (The Register, 2 March 2017)

Cliff Saran, AWS outage shows vulnerability of cloud disaster recovery (Computer Weekly, 6 March 2017)

Friday, September 02, 2016

Single Point of Failure (Comms)

Large business-critical systems can be brought down by power failure. My previous post looked at Airlines. This time we turn our attention to Telecommunications.




Obviously a power cut is not the only possible cause of business problems. Another single-point of failure could be a single rogue employee.




Gavin Clarke, Telecity's engineers to spend SECOND night fixing web hub power outage (The Register, 18 November 2015)


Related Post: Single Point of Failure (Airlines) (August 2016)

Monday, August 08, 2016

Single Point of Failure (Airlines)

Large business-critical systems can be brought down by power failure. Who knew?

In July 2016, Southwest Airlines suffered a major disruption to service, which lasted several days. It blamed the failure on "lingering disruptions following performance issues across multiple technology systems", apparently triggered by a power outage.
In August 2016 it was Delta's turn.

Then there were major problems at British Airways (Sept 2016) and United (Oct 2016).



The concept of "single point of failure" is widely known and understood. And the airline industry is rightly obsessed by safety. They wouldn't fly a plane without backup power for all systems. So what idiot runs a whole company without backup power?

We might speculate what degree of complacency or technical debt can account for this pattern of adverse incidents. I haven't worked with any of these organizations myself. However, my guess is that some people within the organization were aware of the vulnerability, but this awareness didn't somehow didn't penetrate the management hierarchy. (In terms of orgintelligence, a short-sighted board of directors becomes the single point of failure!) I'm also guessing it's not quite as simple and straightforward as the press reports and public statements imply, but that's no excuse. Management is paid (among other things) to manage complexity. (Hopefully with the help of system architects.)

If you are the boss of one of the many airlines not mentioned in this post, you might want to schedule a conversation with a system architect. Just a suggestion.


American Airlines Gradually Restores Service After Yesterday's Power Outage (PR Newswire, 15 August 2003)

British Airways computer outage causes flight delays (Guardian, 6 Sept 2016)

Delta: ‘Large-scale cancellations’ after crippling power outage (CNN Wire, 8 August 2016)

Gatwick Airport Christmas Eve chaos a 'wake-up call' (BBC News, 11 April 2014)

Simon Calder, Dozens of flights worldwide delayed by computer systems meltdown (Independent, 14 October 2016)

Jon Cox, Ask the Captain: Do vital functions on planes have backup power? (USA Today, 6 May 2013)

Jad Mouawad, American Airlines Resumes Flights After a Computer Problem (New York Times, 16 April 2013)

 Marni Pyke, Southwest Airlines apologizes for delays as it rebounds from outage (Daily Herald, 20 July 2016)

Alexandra Zaslow, Outdated Technology Likely Culprit in Southwest Airlines Outage (NBC News, Oct 12 2015)


Related post Single Point of Failure (Comms) (September 2016), The Cruel World of Paper (September 2016), When the Single Version of Truth Kills People (April 2019)


Updated 14 October 2016. Link added 26 April 2019

Tuesday, March 08, 2011

Creeping Business Dependency

People are slowly waking up to the fact that we have created yet another single point of failure into our business ecosystem. It seems that businesses have gradually made themselves dependent on Global Positioning Systems (GPS) and satellite navigation (satnav). So we are now starting to hear doom-and-gloom stories about the dire economic consequences of any interruption to the service, which could apparently be caused by anything from cyberterrorism (Daily Mail 8 March 2011) to solar flares (Daily Mail 21 Sept 2010).

Those with long memories may recall the millennium bug scare, which postulated that widespread computer error might result in total economic collapse when the date went from 99 to 00. Many companies took the opportunity to carry out a long overdue inventory of their software programs, and decommissioned a fair amount of obsolete code, as well as reviewing their disaster recovery procedures; even though the scare was probably exaggerated, some useful work was done. (I myself picked up some contract work in this area, so I can't complain.)

The Royal Academy of Engineering has just issued a report on Global Navigation Space Systems, which takes a more balanced view of the subject than the Daily Mail, but still warns of the danger of over-reliance on satellite navigation [Report (pdf), Press Release].

Chairman of the RAoE working group, Dr Martyn Thomas, told the BBC:
"We're not saying that the sky is about to fall in; we're not saying there's a calamity around the corner. What we're saying is that there is a growing interdependence between systems that people think are backing each other up. And it might well be that if a number these systems fail simultaneously, it will cause commercial damage or just conceivably loss of life. This is wholly avoidable." [BBC News 8 March 2011]

Maybe this does sound pretty speculative (as @martinjmurray complains). Nonetheless it may be a good idea for any business that has gradually become dependent on this or any other technology to check out the possible risks.

From an architectural point of view, what I find most interesting about this situation is the tendency for critical business dependencies (and the associated risks) to emerge, as a particular technology migrates unobtrusively from marginal use to core business use.

Another example of a creeping business dependency is the extent to which Google has now inserted itself into the relationship between any business and its customers. If a business offends Google in some way, and consequently disappears from Google search, this will have serious business consequences. (BMW disappeared from Google for three days in 2006 - see my post BMW Search Requests). And yet it's still rare to see Google shown as a business-critical service partner in business architecture or business process diagrams.

If we think of an architecture in terms of a set of dependencies, we can distinguish between a centrally planned architecture, in which the dependencies and their implications are understood from the outset, and an emergent defacto architecture, in which unanticipated dependencies and risks can be created by a quantity of uncontrolled activity. In a planned world, all innovation must be controlled to prevent emergent risk; in an evolving world, innovation (such as the use of Google or GPS) can be encouraged provided that there is a robust mechanism to detect and manage emerging risks.


Related posts: BMW Search Requests (Feb 2006), Cloud and Continuity of Supply Risk (March 2013)

Sunday, October 03, 2010

Single cause of failure

In May 2010, there was a mysterious plunge in US share prices. At the time an SEC investigation found that no single cause was to blame [BBC News 12 May 2010]. Following further investigation, the SEC now believes that the "flash crash" did indeed have a single cause, which the SEC has traced to an algorithmic trade of around $4 billion, executed by a single trader's computer program [BBC News 1 October 2010]. The authorities have since introduced "circuit breakers", which may help to mitigate the effects of such algorithmic trades in future, and there are hints that further measures could be on the way.

I haven't looked at the detail of these circuit breakers, but my architectural instincts make me uncomfortable with the idea of bolting on an additional mechanism into an already complicated system, in order to mitigate a single point of failure. Surely these circuit breakers will themselves be subject to perverse consequences, as well as being anticipated (and perhaps even deliberately triggered) by ever more sophisticated algorithms.

One of the key principles of distributed systems architecture is the avoidance of a single point of failure. Technical architects tend to focus on technical failure, although security experts often remind them of the equal dangers of socially engineered points of failure in technical systems. Meanwhile, enterprise architects need to pay attention to the possible failure modes of the business and its ecosystem, from a business and sociotechnical perspective.

In the case of the "flash crash", the key question for market regulators and market players is about the resilience and intelligence of the market in the face of certain classes of activity. Although the finger of blame is now pointing to a piece of software, the architectural question here is not software architecture but market architecture - regarding the market as a complex sociotechnical system in which pieces of software interact with other social and economic actors. Architects should beware of thinking that "single point of failure" is merely a technical question.