Architecture, Data and Intelligence: failure

Showing posts with label failure. Show all posts

Monday, April 22, 2019

When the Single Version of Truth Kills People

@Greg_Travis has written an article on the Boeing 737 Max Disaster, which @jjn1 describes as "one of the best pieces of technical writing I’ve seen in ages". He explains why normal airplane design includes redundant sensors.

"There are two sets of angle-of-attack sensors and two sets of pitot tubes, one set on either side of the fuselage. Normal usage is to have the set on the pilot’s side feed the instruments on the pilot’s side and the set on the copilot’s side feed the instruments on the copilot’s side. That gives a state of natural redundancy in instrumentation that can be easily cross-checked by either pilot. If the copilot thinks his airspeed indicator is acting up, he can look over to the pilot’s airspeed indicator and see if it agrees. If not, both pilot and copilot engage in a bit of triage to determine which instrument is profane and which is sacred."

and redundant processors, to guard against a Single Point of Failure (SPOF).

"On the 737, Boeing not only included the requisite redundancy in instrumentation and sensors, it also included redundant flight computers—one on the pilot’s side, the other on the copilot’s side. The flight computers do a lot of things, but their main job is to fly the plane when commanded to do so and to make sure the human pilots don’t do anything wrong when they’re flying it. The latter is called 'envelope protection'."

But ...

"In the 737 Max, only one of the flight management computers is active at a time—either the pilot’s computer or the copilot’s computer. And the active computer takes inputs only from the sensors on its own side of the aircraft."

As a result of this design error, 346 people are dead. Travis doesn't pull his punches.

"It is astounding that no one who wrote the MCAS software for the 737 Max seems even to have raised the possibility of using multiple inputs, including the opposite angle-of-attack sensor, in the computer’s determination of an impending stall. As a lifetime member of the software development fraternity, I don’t know what toxic combination of inexperience, hubris, or lack of cultural understanding led to this mistake."

He may not know what led to this specific mistake, but he can certainly see some of the systemic issues that made this mistake possible. Among other things, the widespread idea that software provides a cheaper and quicker fix than getting the hardware right, together with what he calls cultural laziness.

"Less thought is now given to getting a design correct and simple up front because it’s so easy to fix what you didn’t get right later."

Agile, huh?

Update: CNN finds an unnamed Boeing spokesman to defend the design.

"Single sources of data are considered acceptable in such cases by our industry".

OMG, does that mean that there are more examples of SSOT elsewhere in the Boeing design!?

How a Single Point of Failure (SPOF) in the MCAS software could have caused the Boeing 737 Max crash in Ethiopia (DMD Solutions, 5 April 2019) - provides a simple explanation of Fault Tree Analysis (FTA) as a technique to identify SPOF.

Mike Baker and Dominic Gates, Lack of redundancies on Boeing 737 MAX system baffles some involved in developing the jet (Seattle Times 26 March 2019)

Curt Devine and Drew Griffin, Boeing relied on single sensor for 737 Max that had been flagged 216 times to FAA (CNN, 1 May 2019) HT @marcusjenkins

George Leopold, Boeing 737 Max: Another Instance of ‘Go Fever”? (29 March 2019)

Mary Poppendieck, What If Your Team Wrote the Code for the 737 MCAS System? (4 April 2019) HT @CharlesTBetz with reply from @jpaulreed

Gregory Travis, How the Boeing 737 Max Disaster Looks to a Software Developer (IEEE Spectrum, 18 April 2019) HT @jjn1 @ruskin147

And see my other posts on the Single Source of Truth.

Updated 2 May 2019

Saturday, April 28, 2018

Be the Change

Anyone fancy a job as Head of Infrastructure? Here is the job description, posted to Linked-In earlier this week.

We're responsible for "IT Change", including the end to end architecture, deployment and maintenance of IT infrastructure technologies across [organization]. We’re the first technical point of contact for people in [organization] who want to speak to the CIO function. We take business requirements and architect solutions, then work with [group IT] to input the solution into our data centres.
We provide direction, thought leadership, guidance and subject matter expertise on our IT estate to make sure we get the maximum value from our investment in our IT. We do this by defining our IT strategy and aligning it with Group IT, producing technology roadmaps and identifying and recommending IT solution opportunities, supporting business initiatives and ideas, and documenting and managing our architecture assets.
The Head of Infrastructure is a key leadership role in the CIO and critical to the delivery of both customer and partner facing technology. Working closely with our technology supplier, group IT, CISO and Service Management teams, this leader will be accountable for the end to ends analysis, design, build, test and implementation of; Platforms and Middleware, Network and Communications, Cloud Services, Data Warehouse and End User Services.
https://www.linkedin.com/jobs/view/633114556/

The job description contains a number of key words and phrases that architects should be comfortable with - direction, strategy, alignment, thought leadership, roadmaps, architecture assets.

But perhaps the first clue that there may be something amiss with this position is the fact that "IT Change" is in quotes. (As if to say that in IT, nothing really changes.)

The Register has contacted the person who is (according to Linked-In) currently holding this position. Is he moving on, moving up? Could this vacancy be connected in any way with recent IT difficulties facing the organization? (No answer reported. Curious.)

The recent IT difficulties facing this particular organization have come to the attention of politicians and the media. After the chair of the Treasury Select Committee described the situation as having "all the hallmarks of an IT meltdown", the word "meltdown" is now the descriptor of choice for journalists covering the story.

But help is at hand: IBM has kindly volunteered to help sort out the mess. So we can guess what "working closely with our technology supplier" might look like.

Karl Flinders, TSB IT meltdown has the makings of an epic (Computer Weekly, 25 April 2018)

Samuel Gibbs, Warning signs for TSB's IT meltdown were clear a year ago – insider (The Guardian, 28 April 2018)

Kat Hall, Newsworthy Brit bank TSB is looking for a head of infrastructure (The Register, 27 April 2018)

Stuart Sumner, TSB brings in IBM in attempt to resolve IT crisis (Computing, 26 April 2018)

Saturday, March 10, 2018

Fail Fast - Why did the Chicken cross the road?

A commonly accepted principle of architecture and engineering is to avoid a single point of failure (SPOF). A single depot for a chain of over 850 fast food restaurants could be risky, as KFC was warned when it announced that it was switching its logistics from Bidvest to a partnership with DHL and QSL, to be served out of a single depot in Rugby. We may imagine that the primary motivation for KFC was cost-saving, although the announcement was dressed up in management speak - "re-writing the rule book" and "setting a new benchmark".

The new system went live on 14th February 2018. The changeover did not go well: by the weekend, over three quarters of the stores were closed. Rugby is a great location for a warehouse - except when there is a major incident on a nearby motorway. (Who knew that could happen?)

There's gossip in the hen house, here's the facts... pic.twitter.com/lEuyiOZx2h
— KFC UK & Ireland (@KFC_UKI) February 21, 2018

After a couple of weeks of disruption, as well as engaging warehouse-as-a-service startup Stowga for non-food items, KFC announced that it was resuming its relationship with Bidvest. According to some reports, Burger King also flirted with DHL some years ago before returning to Bidvest. History repeating itself.

However, the problems faced by KFC cannot be attributed solely to the decision to supply the whole UK mainland from Rugby. A just-in-time supply chain needs contingency planning - covering business continuity and disaster recovery. (Good analysis by Richard Priday, who tweets as @InsomniacSteel.)

KFC revolutionizes UK foodservice supply chain with DHL and QSL appointment (DHL Press Release, 11 Oct 2017)

Andrew Don, KFC admits chicken waste as cost of DHL failure grows (The Grocer, 23 Feb 2018)

Andrea Felsted, Supply chains: Look for the single point of failure (FT 2 May 2011)

Adam Leyland, KFC supply chain fiasco is Heathrow's Terminal 5 all over again (The Grocer, 23 Feb 2018)

Charlie Pool (CEO of Stowga), Warehousing on-demand saves KFC (Retail Technology 26 February 2018)

Richard Priday, The inside story of the great KFC chicken shortage of 2018 (Wired 21 February 2018) How KFC ended the great chicken crisis by taking care of its mops (Wired 2 March 2018) The KFC chicken crisis is finally over: it's (sort of) ditched DHL (Wired 8 March 2018)

Carol Ryan, Stuffed KFC only has itself to blame (Reuters, 20 February 2018)

Su-San Sit, KFC was 'warned DHL would fail' (Supply Management, 20 February 2018)

Matthew Weaver, Most KFCs in UK remain closed because of chicken shortage (Guardian 19 Feb 2018) KFC was warned about switching UK delivery contractor, union says (Guardian 20 Feb 2018)

Zoe Wood, KFC returns to original supplier after chicken shortage fiasco (Guardian 8 March 2018)

Wikipedia: Single Point of Failure

Related posts: Fail Fast - Burger Robotics (March 2018)

Tuesday, September 06, 2016

The Cruel World of Paper

Some airlines have blamed recent service disruptions on power cuts. In contrast, recent disruptions to service at BA have been blamed on a system upgrade in the check-in department. Among other things, it appears that BA staff could not access their computers to see which passengers had gone through security.

BA check-in system checks out: Staff flung back to cruel '90s world of paper https://t.co/5WPKXSRZao
— The Register (@TheRegister) September 6, 2016

The cruel world of paper remains a constant threat in some sectors, particularly healthcare. In the same week of February 2017, those afflicted included a hospital in Melbourne and a hospital trust in Tyneside. Healthcare also seems particularly vulnerable to virus and ransomware attacks.

In contrast, when media giant WPP was hit by a ransomware attack in June 2017, although some creatives turned to "good old-fashioned pen and paper", it seems that other departments simply went out, bought some Macs, and went to work in local coffee shops. Oh, cruel world.

Related posts

Single Point of Failure (Airlines) (August 2016)

Chris Baraniuk, BA apologises for check-in delays at Heathrow and Gatwick (BBC News,18 July 2016)

Kate Gibbons, Cyberattack sent marketing giant back to pen and paper (Times, 7 December 2017) (paywall)

Jeremy Lee, Safety first: Lessons from the cyberattack on WPP (Campaign, 7 July 2017)

Kate McDonald, EMR crash causes Austin Health to revert to paper temporarily (Pulse+IT, 17 February 2017)

Laura Stevens, Failure of core network at Northumbria downs IT systems (Digital Health, 23 February 2017)

Evan Sweeney, Buffalo hospital returns to pen and paper after a virus shuts down IT systems (12 April 2017)

British Airways passengers delayed by computer glitch (BBC News, 6 September 2016)

Who was hit by the Cyber attack? (BBC News, 13 May 2017)

Post updated and extended 28 December 2017

Friday, September 02, 2016

Single Point of Failure (Comms)

Large business-critical systems can be brought down by power failure. My previous post looked at Airlines. This time we turn our attention to Telecommunications.

If someone said you had to accept an unreliable electricity supply as the price of innovation in appliances, you'd laugh. #NotNeutrality
— Martin Geddes (@martingeddes) August 8, 2016

More misery for BT broadband users after new power cut. Looks like 'no single point of failure' is an alien concept. https://t.co/mOobFidWe4
— Chris Tripp (@ChrisJTripp) July 21, 2016

It would be interesting to know where the single point of failure was in their power protection plan. https://t.co/zuaTm1z4tK
— Robin Koffler MBA (@robin_koffler) July 21, 2016

2G and 3G data services from @EE are down after a power outage. Details: https://t.co/zEJFpgpl4n pic.twitter.com/vcUOkPVtet
— The Register (@TheRegister) September 2, 2016

Obviously a power cut is not the only possible cause of business problems. Another single-point of failure could be a single rogue employee.

That shows that management should look at automating network. Since Network is single point of failure. https://t.co/ND5UXtNntj
— Anurag Kaushik (@kaushikanuk) August 3, 2016

Gavin Clarke, Telecity's engineers to spend SECOND night fixing web hub power outage (The Register, 18 November 2015)

Related Post: Single Point of Failure (Airlines) (August 2016)

Monday, August 08, 2016

Single Point of Failure (Airlines)

Large business-critical systems can be brought down by power failure. Who knew?

In July 2016, Southwest Airlines suffered a major disruption to service, which lasted several days. It blamed the failure on "lingering disruptions following performance issues across multiple technology systems", apparently triggered by a power outage.

Click below for the latest update on our system and operation: https://t.co/bqV1qwahmz
— Southwest Airlines (@SouthwestAir) July 21, 2016

In August 2016 it was Delta's turn.

New statement from Delta - power outage caused IT failure pic.twitter.com/trkQbpym05
— Rory Cellan-Jones (@ruskin147) August 8, 2016

@ruskin147 A power outage *triggered* this issue, but poor planning and no HA *caused* it. Why can Netflix get this right but airlines cant?
— Richard Price (@RichardPrice) August 8, 2016

I am no computer expert but it seems like a whole system crashing (3 separate airlines) points to bad design (single point of failure)? 3/
— Dan DePodwin (@WxDepo) August 8, 2016

Then there were major problems at British Airways (Sept 2016) and United (Oct 2016).

@razankhabour We apologize to our customers for the delay and we appreciate their patience as our IT teams work to resolve this issue.
— British Airways (@British_Airways) September 6, 2016

We're aware of an issue with our system and are working to resolve it. We'll update as we learn more. We apologize for the inconvenience.
— United (@united) October 14, 2016

So every @united flight is grounded because they can’t run a decent IT shop. What year is this??
— Randy Bias (@randybias) October 14, 2016

The concept of "single point of failure" is widely known and understood. And the airline industry is rightly obsessed by safety. They wouldn't fly a plane without backup power for all systems. So what idiot runs a whole company without backup power?

We might speculate what degree of complacency or technical debt can account for this pattern of adverse incidents. I haven't worked with any of these organizations myself. However, my guess is that some people within the organization were aware of the vulnerability, but this awareness didn't somehow didn't penetrate the management hierarchy. (In terms of orgintelligence, a short-sighted board of directors becomes the single point of failure!) I'm also guessing it's not quite as simple and straightforward as the press reports and public statements imply, but that's no excuse. Management is paid (among other things) to manage complexity. (Hopefully with the help of system architects.)

If you are the boss of one of the many airlines not mentioned in this post, you might want to schedule a conversation with a system architect. Just a suggestion.

American Airlines Gradually Restores Service After Yesterday's Power Outage (PR Newswire, 15 August 2003)

British Airways computer outage causes flight delays (Guardian, 6 Sept 2016)

Delta: ‘Large-scale cancellations’ after crippling power outage (CNN Wire, 8 August 2016)

Gatwick Airport Christmas Eve chaos a 'wake-up call' (BBC News, 11 April 2014)

Simon Calder, Dozens of flights worldwide delayed by computer systems meltdown (Independent, 14 October 2016)

Jon Cox, Ask the Captain: Do vital functions on planes have backup power? (USA Today, 6 May 2013)

Jad Mouawad, American Airlines Resumes Flights After a Computer Problem (New York Times, 16 April 2013)

Marni Pyke, Southwest Airlines apologizes for delays as it rebounds from outage (Daily Herald, 20 July 2016)

Alexandra Zaslow, Outdated Technology Likely Culprit in Southwest Airlines Outage (NBC News, Oct 12 2015)

Related post Single Point of Failure (Comms) (September 2016), The Cruel World of Paper (September 2016), When the Single Version of Truth Kills People (April 2019)

Updated 14 October 2016. Link added 26 April 2019

Friday, January 27, 2012

On the misuse of general principles

#entarch There is a common fallacy among enterprise architects that radical structural and behavioural change can and should be driven by a few simple and powerful ideas. Alas, the public sector is strewn with the disastrous consequences of this fallacy.

We can find countless examples from the National Health Service (NHS) in the UK. For Steve Harrison, Honorary Professor of Social Policy at the University of Manchester, the idea that NHS reorganisations can be triggered by a few general ideas is one of the Seven Fallacies of English Health Policy. He points out that high levels of abstraction (beloved by academics and architects alike) do not allow proper assessment of the plausibility of claims about benefits of reorganisation and how the system will work. (HT @mellojonny)

Where do health reorganization principles come from? I asked a popular search engine, and was led to a paper called Basic Principles of Information Technology Organization in Health Care Institutions (JAMIA 1997); (I suppose from the high search ranking of this paper that it is a widely used source for such principles.) The paper concludes that all organizations MUST have certain characteristics, based on a single case study where these characteristics seemed to be beneficial; in other words, arguing from the particular to the general. (I'm sure there must be some more rigorous studies, but they don't seem to get as good search rankings for some reason.)

But many of the principles that govern sweeping architectural reforms of the public sector aren't even derived by thinly based generalization from such observed vignettes, but are derived from purely abstract concepts such as "choice" and "competition" and "justice", to which each may attach his or her own politically motivated interpretation.

This leads to several levels of failure - not only failure of execution and planning (because the generalized principles are not sufficiently refined to provide realistic and coherent solutions to complex practical problems) but also failure of intention (because a vague but upbeat set of principles helps to conceal the fact that the underlying vision remains woolly).

Thursday, November 16, 2006

Reliability and Availability

One of the pleasures of being an industry analyst is that you get to read a lot of vendor material.

Yesterday I came across the following statement in a white paper by Jonathan Purdy of Tangosol on Data Grids and SOA.

"As Business Services are integrated into increasingly complex work-flows, the added dependencies decrease availability. If a Business Process depends on a number of services, the availability of the process is actually the product of the weaknesses of all the composed services. For example, if a Business Process depends on six services, each of which achieves 99% uptime, then the Business Process itself will have a maximum of 94% uptime, corresponding to more than 500 hours of unplanned downtime each year."

This might be true under certain circumstances, but it depends on how the business process is designed, and the degree of coupling within the process. A typical design objective is to compose services in a way that doesn't multiply the dependencies in this way. One of the principles of distributed systems has always been to avoid single points of failure, and this principle is surely inherited by SOA. When used intelligently, loose coupling and asynchrony should make a business more robust.

It is sometimes possible to orchestrate services in a way that that the reliability of the whole is greater than the reliability of the parts. Firstly, there may be underlying services that are not required for every transaction, so the reliability of these underlying services only partially impacts the reliability of the process. Secondly, there may be services that provide multiple or alternative provision of a given capability - alternative process paths can be defined to make the process more fault-tolerant.

That's not going to make the problem of reliability and availability go away of course, and there are undoubtedly some useful things that vendors such as Tangosol can offer in the physical implementation of SOA. And Purdy is right to warn his readers of the risks associated with complexity.

But what this raises for me is the general difficulty of reasoning about non-functional requirements. Do you add them, do you multiply them, do you average them, or do you need to perform a more complicated bit of algebra?

Technorati Tags: availability reliability SOA service-oriented Tangosol

Tuesday, September 26, 2006

Lost Bags

I am sure I don't want to compete with Redmonk analyst Stephen O'Grady in the travel disruption stakes ...

... but the airlines have managed to lose my bag on three consecutive flights home from the USA. Once from Las Vegas via Los Angeles, and twice from St Louis via Chicago.

Is this a record? And what's it got to do with the service-oriented business?

On two of the three occasions, the problem was a tight connection. Well it wasn't a tight connection when I checked in, but when I got to the gate for the first leg I discovered that the flight was seriously delayed. Helpful ground staff managed to get me onto a different flight on both occasions, in one case as the doors were closing, but of course the bag was already checked in for the delayed flight. So hardly a surprise that the bag got left behind.

Trying to outwit the airlines and their propensity for delay, I booked an extremely long connection at Chicago so that there was no chance of missing my connection. But what about my bag joining me on the same flight? Perhaps having too much time for a process seems to leave room for a different class of error.

Apart from the slight uncertainty about whether and when, it's quite nice not to have to carry a heavy bag home from the airport, and have someone deliver it to you door instead. I'm getting quite used to the procedure.

But I got some interesting glimpses of what happens when a complicated process across multiple organizations goes wrong.

1. Collaboration failures (and consequent mistrust) between airlines. When there is an alliance between two airlines, it's always the other airline's fault when something goes wrong. An employee of airline A couldn't change the status of my ticket on airline B's system; an employee of airline B rolled his eyes when I said that I had flown with airline A for the first part of my journey.

2. Supply chain visibility and trust. When I got onto one plane, I asked whether my bag was on the same flight. The ground staff looked up the tag on the computer and assured me it was. At the other end, a customer service rep said this information might have been unreliable. (Perhaps the ground staff had lied to me in order to get me onto the plane without a fuss - the lost bag would then be someone else's problem.)

3. The baggage recovery system works after a fashion, but it's highly inefficient. It surely can't be economic to have a delivery van to take bags to passengers.

4. Airline schedules are generally designed around hubs. So there are lots of connecting flights. But connections get delayed, and bags get lost. This must surely affect the overall economics of the hub-and-spoke model of air travel.

I might add some more notes later ...

Friday, April 28, 2006

Bathroom Interoperability

There is an urban myth that Americans in Europe always complain about the showers and the plumbing. My fellow Brit Phil Wainewright, currently on tour around the US West Coast, has just posted a complaint about Hotel Sink Stoppers on his SaaS blog.

Phil says this is Off-Topic. But I thought his travails with the sink stoppers were actually (there's a British word for you) actually rather relevant to SOA and SaaS.

Firstly, Phil's complaint involves an escalation of service failure. The sink stoppers fail. And as a result, the sink fails to provide the service that Phil requires - a sinkful of hot water. Which means that his preferred method of shaving fails.

What kind of failure is this? There are three possibilities.

Firstly, it could be an design failure (error of execution). The sink stopper doesn't do what it was supposed to do, period.

Secondly, it could be an architectural failure (error of planning). The sink stopper itself is fine, it just doesn't work very well in combination with this particular sink.

Thirdly, it could be a strategic failure (error of intention). The sink stopper wasn't ever intended to hold water, merely to stop your wedding ring going down the pipes.

Or maybe it isn't a failure at all - merely a clever security device to stop people leaving the taps on and flooding the room downstairs.

From the supply-side perspective (namely the hotel), Phil's inability to shave the way he likes probably ranks lower in the hotel's scheme of things than Phil's inability to cause a flood. So even if it were aware of the problem, the hotel would probably choose to degrade the quality of service experienced by Phil, rather than incur the theoretical risk of a complete service outage elsewhere.

Ultimately, this is an example of asymmetric demand. The hotel simply doesn't recognize Phil's service requirement - there is a value deficit.

Pass the SOAP.

[Update] See further comments by Vinnie M with reply from Phil W.

Technorati Tags: SaaS SOA service-oriented

Architecture, Data and Intelligence

Pages

Monday, April 22, 2019

When the Single Version of Truth Kills People

Saturday, April 28, 2018

Be the Change

Saturday, March 10, 2018

Fail Fast - Why did the Chicken cross the road?

Tuesday, September 06, 2016

The Cruel World of Paper

Friday, September 02, 2016

Single Point of Failure (Comms)

Monday, August 08, 2016

Single Point of Failure (Airlines)

Friday, January 27, 2012

On the misuse of general principles

Thursday, November 16, 2006

Reliability and Availability

Tuesday, September 26, 2006

Lost Bags

Friday, April 28, 2006

Bathroom Interoperability

Blog Archive

Creative Commons

or by email

Pages

Monday, April 22, 2019

Saturday, April 28, 2018

Saturday, March 10, 2018

Tuesday, September 06, 2016

Friday, September 02, 2016

Monday, August 08, 2016

Friday, January 27, 2012

Thursday, November 16, 2006

Tuesday, September 26, 2006

Friday, April 28, 2006

Blog Archive

Creative Commons

Subscribe

or by email