Showing posts with label redundancy. Show all posts
Showing posts with label redundancy. Show all posts

Monday, May 23, 2022

Risk Algebra

In this post, I want to explore some important synergies between architectural thinking and risk management. 


The first point is that if we want to have an enterprise-wide understanding of risk, then it helps to have an enterprise-wide view of how the business is configured to deliver against its strategy. Enterprise architecture should provide a unified set of answers to the following questions.

  • What capabilities delivering what business outcomes?
  • Delivering what services to what customers?
  • What information, process and resources to support these?
  • What organizations, systems, technologies and external partnerships to support these?
  • Who is accountable for what?
  • And how is all of this monitored, controlled and governed?

Enterprise architecture should also provide an understanding of the dependencies between these, and which ones are business-critical or time-critical. For example, there may be some components of the business that are important in the long-term, but could easily be unavailable for a few weeks before anyone really noticed. But there are other components (people, systems, processes) where any failure would have an immediate impact on the business and its customers, so major issues have to be fixed urgently to maintain business continuity. For some critical elements of the business, appropriate contingency plans and backup arrangements will need to be in place.

Risk assessment can then look systematically across this landscape, reviewing the risks associated with assets of various kinds, activities of various kinds (processes, projects, etc), as well as other intangibles (motivation, brand image, reputation). Risk assessment can also review how risks are shared between the organization and its business partners, both officially (as embedded in contractual agreements) and in actual practice.

 

Architects have an important concept, which should also be of great interest for enterprise risk management - the idea of a single point of failure (SPOF). When this exists, it is often the result of a poor design, or an over-zealous attempt to standardize and strip out complexity. But sometimes this is the result of what I call Creeping Business Dependency - in other words, not noticing that we have become increasingly reliant on something outside our control.


There are also important questions of scale and aggregation. Some years ago, I did some risk management consultancy for a large supermarket chain. One of the topics we were looking at was fridge failure. Obviously the supermarket had thousands and thousands of fridges, some in the customer-facing parts of the stores, some at the back, and some in the warehouses.

Fridges fail all the time, and there is a constant processes of inspecting, maintaining and replacing fridges. So a single fridge failing is not regarded as a business risk. But if several thousand fridges were all to fail at the same time, presumably for the same reason, that would cause a significant disruption to the business.

So this raised some interesting questions. Could we define a cut-off point? How many fridges would have to fail before we found ourselves outside business-as-usual territory? What kind of management signals or dashboard could be put in place to get early warning of such problems, or to trigger a switch to a "safety" mode of operation.

Obviously these questions aren't only relevant to fridges, but can apply to any category of resource, including people. During the pandemic, some organizations had similar issues in relation to staff absences.


Aggregation is also relevant when we look beyond a single firm to the whole ecosystem. Suppose we have a market or ecosystem with n players, and the risk carried by each player is R(n). Then what is the aggregate risk of the market or ecosystem as a whole? 

If we assume complete independence between the risks of each player, then we may assume that there is a significant probability of a few players failing, but a very small probability of a large number of players failing - the so-called Black Swan event. Unfortunately, the assumption of independence can be flawed, as we have seen in financial markets, where there may be a tangled knot of interdependence between players. Regulators often think they can regulate markets by imposing rules on individual players. And while this might sometimes work, it is easy to see why it doesn’t always work. In some cases, the regulator draws a line in the sand (for example defining a minimum capital ratio) and then checks that nobody crosses the line. But then if everyone trades as close as possible to this line, how much capacity does the market as a whole have for absorbing unexpected shocks?

 

Both here and in the fridge example, there is a question of standardization versus diversity. On the one hand, it's a lot simpler for the supermarket if all the fridges are the same type, with a common set of spare parts. But on the other hand, having more than one type of fridge helps to mitigate the risk of them all failing at the same time. It also gives some space for experimentation, thus addressing the longer term risk of getting stuck with an out-of-date fridge estate. The fridge example also highlights the importance of redundancy - in other words, having spare fridges. 

So there are some important trade-offs here between pure economic optimization and a more balanced approach to enterprise risk.

Monday, April 22, 2019

When the Single Version of Truth Kills People

@Greg_Travis has written an article on the Boeing 737 Max Disaster, which @jjn1 describes as "one of the best pieces of technical writing I’ve seen in ages". He explains why normal airplane design includes redundant sensors.

"There are two sets of angle-of-attack sensors and two sets of pitot tubes, one set on either side of the fuselage. Normal usage is to have the set on the pilot’s side feed the instruments on the pilot’s side and the set on the copilot’s side feed the instruments on the copilot’s side. That gives a state of natural redundancy in instrumentation that can be easily cross-checked by either pilot. If the copilot thinks his airspeed indicator is acting up, he can look over to the pilot’s airspeed indicator and see if it agrees. If not, both pilot and copilot engage in a bit of triage to determine which instrument is profane and which is sacred."

and redundant processors, to guard against a Single Point of Failure (SPOF).

"On the 737, Boeing not only included the requisite redundancy in instrumentation and sensors, it also included redundant flight computers—one on the pilot’s side, the other on the copilot’s side. The flight computers do a lot of things, but their main job is to fly the plane when commanded to do so and to make sure the human pilots don’t do anything wrong when they’re flying it. The latter is called 'envelope protection'."

But ...

"In the 737 Max, only one of the flight management computers is active at a time—either the pilot’s computer or the copilot’s computer. And the active computer takes inputs only from the sensors on its own side of the aircraft."

As a result of this design error, 346 people are dead. Travis doesn't pull his punches.

"It is astounding that no one who wrote the MCAS software for the 737 Max seems even to have raised the possibility of using multiple inputs, including the opposite angle-of-attack sensor, in the computer’s determination of an impending stall. As a lifetime member of the software development fraternity, I don’t know what toxic combination of inexperience, hubris, or lack of cultural understanding led to this mistake."

He may not know what led to this specific mistake, but he can certainly see some of the systemic issues that made this mistake possible. Among other things, the widespread idea that software provides a cheaper and quicker fix than getting the hardware right, together with what he calls cultural laziness.

"Less thought is now given to getting a design correct and simple up front because it’s so easy to fix what you didn’t get right later."

Agile, huh?


Update: CNN finds an unnamed Boeing spokesman to defend the design.

"Single sources of data are considered acceptable in such cases by our industry".

OMG, does that mean that there are more examples of SSOT elsewhere in the Boeing design!?




How a Single Point of Failure (SPOF) in the MCAS software could have caused the Boeing 737 Max crash in Ethiopia (DMD Solutions, 5 April 2019) - provides a simple explanation of Fault Tree Analysis (FTA) as a technique to identify SPOF.

Mike Baker and Dominic Gates, Lack of redundancies on Boeing 737 MAX system baffles some involved in developing the jet (Seattle Times 26 March 2019)

Curt Devine and Drew Griffin, Boeing relied on single sensor for 737 Max that had been flagged 216 times to FAA (CNN, 1 May 2019) HT @marcusjenkins

George Leopold, Boeing 737 Max: Another Instance of ‘Go Fever”? (29 March 2019)

Mary Poppendieck, What If Your Team Wrote the Code for the 737 MCAS System? (4 April 2019) HT @CharlesTBetz with reply from @jpaulreed

Gregory Travis, How the Boeing 737 Max Disaster Looks to a Software Developer (IEEE Spectrum, 18 April 2019) HT @jjn1 @ruskin147

And see my other posts on the Single Source of Truth.


Updated  2 May 2019

Saturday, October 30, 2010

Organizations as Brains

This post is based on Chapter 3 of Gareth Morgan's classic book Images of Organization (Sage 1986), which opens with the following question: "Is it possible to design organizations so that they have the capacity to be as flexible, resilient, and inventive as the functioning of a brain?"

To start with, Morgan makes two important distinctions. The first distinction is between two different notions of rationality, and the second involves two contrasting uses of the "brain" metaphor.

Mechanistic or bureaucratic organizations rely on what Morgan (following Karl Mannheim) calls "instrumental rationality", where people are valued for their ability to fit in and contribute to the efficient operation of a predetermined structure. Morgan contrasts this with "substantial rationality", where elements of organization are able to question the appropriateness of what they are doing and to modify their action to take account of new situations. Morgan states that the human brain possesses higher degrees of substantial rationality than any man-made system. (pp78-79)

Morgan also observes a common trend to use the term "brain" metaphorically to refer to a centralized planning or management function within an organization, the brain "of" the firm. Instead, Morgan wants to talk about brain-like capabilities distributed throughout the organization, the brain "as" the firm. Using the brain metaphor in this way leads to two important ideas. Firstly, that organizations are information processing systems, potentially capable of learning to learn. And secondly, that organizations may be holographic systems, in the sense that any part represents and can stand in for the whole. (p 80)

The first of these two ideas, organizations as information processing systems, goes back to the work of James March and Herbert Simon in the 1940s and 1950s. Simon's theory of decision-making leads us to understand organizations as kinds of institutionalized brains that fragment, routinize and bound the decision-making process in order to make it manageable. (p 81) According to this theory, the  organization chart does not merely define a structure of work activity, it also creates a structure of attention, interpretation and decision-making. (p 81) Later organization design theorists such as Jay Galbraith showed how this kind of decision-making structure coped with uncertainty and information overload, either by reducing the need for information or by increasing the capacity to process information. (pp 82-83)

Nowadays, of course, much of this information processing capacity is provided by man-made systems. Writing in the mid 1980s, Morgan could already see the emergence of the virtual organization, embedded not in human activity but in computer networks. If it wasn't already, the organization-as-brain is now indisputably a sociotechnical system. The really big question, Morgan asks, is whether such organizations will also become more intelligent. (p84)

The problem here is that man-made systems (bureaucratic as well as automatic) tend towards instrumental rationality rather than substantial rationality. Such systems can produce goal-directed behaviour under four conditions. (p87)
  1. The capacity to sense, monitor and scan significant aspects of their environment
  2. The ability to relate this information to the operating norms that guide system behaviour
  3. Ability to detect significant deviations from these norms
  4. Ability to initiate corrective action when discrepancies are detected.
But this is merely single-loop learning, whereas true learning-to-learn calls for double-loop learning.  Morgan identifies three factors that inhibit double-loop learning. (pp89-90)
  1. Division of responsibilities cause a fragmentation of knowledge and attention.
  2. Bureaucratic accountability and asymmetric information produce ethical problems such as deception. (This is a form of the principal-agent problem.) 
  3. Organizations also suffer from various forms of collective self-deception, resulting in a gap between "espoused theory" and "theory-in-use".
and he goes on to identify four design principles that may facilitate double-loop learning. (pp 91-95)
  1. Encourage and value openness and reflectivity. Accept error and uncertainty.
  2. Recognize the importance of exploring different viewpoints. 
  3. Avoid imposing structures of action. Allow intelligence and direction to emerge.
  4. Create organizational structures and principles that help implement these principles.
The flexible, self-organizing capacities of a brain depend on four further design principles, which help to instantiate the notion of the "holographic" organization. (pp 98-103)

  1. Redundancy of function - each individual or team has a broader range of knowledge and skills than is required for the immediate task-at-hand, thus building flexibility into the organization.
  2. Requisite variety - the internal diversity must match the challenges posed by the environment. All elements of an organization should embody critical dimensions of the environment.
  3. Minimal critical specification - allow each system to find its own form.
  4. Learning to learn - use autonomous intelligence and emergent connectivity to find novel and progressive solutions to complex problems.
In conclusion, innovative organizations must be designed as learning systems that place primary emphasis on being open to enquiry and self-criticism. The innovative attitudes and abilities of the whole must be enfolded in the parts. (p 105) Morgan identifies two major obstacles to implementing this ideal.
  1. The realities of power and control. (p 108)
  2. The inertia stemming from existing assumptions and beliefs. (p 109)
Morgan says he favours the brain metaphor because of the fundamental challenge it presents to the bureaucratic mode of organization. (pp 382-3) Writing in the mid 1980s, Morgan noted that computing facilities were often used to increase centralization, and to reinforce bureaucratic principles and top-down hierarchical control, and expressed a hope that this was a consequence of the limited vision of system designers rather than a necessary consequence of the new technologies. "The principles of cybernetics, organizational learning, and holographic self-organization provide valuable guidelines regarding the direction [technology] change might take." (p 108) A quarter of a century later, let's hope we're finally starting to move in the right direction.

Monday, March 17, 2008

Information Algebra

I get more information from two newspapers than from one - but not twice as much information. So how much more, exactly? That depends how much difference there is between the two newspapers. 

Even if two newspapers report the same general facts, they typically report different details, and they may have different sources. To the extent that there are differences in style and detail between the two newspapers, this typically reinforces my confidence in the overall story because it indicates that the journalists are not merely reusing a common source (such as a company press release). 

In the real world, we are accustomed to the fact that information and intelligence needs double-checking and corroboration. And yet in the computer world, there is a widespread belief that it is always a good thing to have a single source of information - that repeated messages are not only unnecessary but wasteful. Data cleansing wipes out difference in the name of consistency and standardization, leaving the resulting information flat and attenuated. A single source of information ("single source of truth") sometimes means a single source of failure - never a good idea in an open distributed system.

Writing about this in an SOA context - when three heads are better than one - Steve Jones describes this as redundancy, and points out the potential value of redundancy to increase reliability. He quotes Lewis Carroll (as Andrew Clarke points out, it was actually the Bellman): "What I tell you three times is true."

The same quote can be found at the head of Chapter 3 of Gregory Bateson's Mind and Nature, available online as Multiple Versions of the World. This expands on Bateson's earlier slogan "Two descriptions are better than one".

Bateson himself used the word "redundancy", but it is not a simple redundancy that can be plucked out without a second thought. Thinking about the consequences of adding and subtracting redundancy is a hard problem - Paulo Rocchi calls it calculus, but I prefer to call it algebra.

Monday, November 13, 2006

Service-oriented security 2

Form Follows Function.

In a recent post, Bruce Schneier makes some interesting points about the relationship between Architecture and Security [via Confused of Calcutta].
  • "Security concerns have always influenced architecture."

  • "The problem is that architecture tends toward permanence, while security threats change much faster. Something that seemed a good idea when a building was designed might make little sense a century -- or even a decade -- later. But by then it's hard to undo those architectural decisions."

  • "It's dangerously shortsighted to make architectural decisions based on the threat of the moment without regard to the long-term consequences of those decisions."
  • End-to-End Process.

    In a separate post on Voting Technology and Security, Bruce Schneier describes the steps in ensuring that the result of an election properly represents the intentions of the voters.
    "Even in normal operations, each step can introduce errors. Voting accuracy, therefore, is a matter of 1) minimizing the number of steps, and 2) increasing the reliability of each step."
    Whether this is strictly true depends on the architecture of the process - whether it is a simple linear process with no redundancy or latency, or whether there is deliberate redundancy built in to provide security of the whole over and above the security of each step. Bruce himself advocates a paper audit trail, which can be used retrospectively if required to verify the accuracy of the electronic voting machines.

    Shearing Layers.

    Security management doesn't necessarily operate on the same timescale as other elements of architecture. Our appproach to service-oriented security - indeed, to SOA generally - is based on the notion of a layered architecture, in which each layer has a different rate of change. (This is based on the Shearing Layers principle (now known as the Pace Layering principle). Thus the security layer is decoupled from the core business layer, and also from the user experience layer.

    Previous Posts: Adaption and Adaptability, Business IT Alignment 2, Service-Oriented Security

    Tuesday, April 11, 2006

    Loose Coupling 2

    Is loose coupling a defining characteristic of the service-based business and/or a core principle of SOA? ZDnet's analysts have been producing a set of IT commandments (13 at the last count), and Joe McKendrick's latest contribution is Thou Shalt Loosely Couple.

    Joe quotes John Hagel's definition of loose coupling, which refers to reduced interdependencies between modules or components, and consequently reduced interoperability risk. Hagel clearly intends this definition to apply to dependencies between business units, not just technical artefacts. I think this is fine as far as it goes, but is not precise enough to my taste.

    In his post The Developer's View of SOA: Just adding complexity?, Ron Ten-Hove (Sun Microsystems) defines loose coupling in terms of knowledge - "the minimization of the "knowledge" a service consumer has of a service provider it is is using, and vice versa".

    (It is surely possible to link these defining notions - interoperability and interdependency, risk and knowledge - at a deep level, but I'm not going to attempt it right now.)

    I want to have a notion of loose coupling that applies to sociotechnical systems of systems - and therefore needs to cover organizational interdependencies and interoperability as well as technical. I have previously proposed a definition of Loose Coupling based on Karl Weick's classic paper on loosely coupled organizations.

    The trouble with proclaiming the wonders of loose coupling is that it sounds as if tight coupling was just a consequence of stupid design and/or stupid technology. It fails to acknowledge that there are sometimes legitimate reasons for tight coupling.

    Ron Ten-Hove puts forward a more sophisticated argument for loose coupling. He acknowledges the advantages of what he calls a mixed service model, namely that "it allows for creation of a component model that combines close-coupling and loose-coupling in a uniform fashion". But he also talks about the disadvantages of this model, in terms of reduced SOA benefits and increased developer complexity, at least with the current technology.

    Loose coupling is great, but it is not a free lunch. It is not simply a bottom-up consequence of the right design on the right platform. Sometimes loose coupling requires a top-down forcing-apart. I think the correct word for this top-down forcing-apart is deconfliction, although when I use this word it causes some of my colleagues to shudder in mock horror.

    Deconfliction is a word used in military circles, to refer to the active principle of making one unit independent of another, and this will often include the provision of redundant supplies and resources, or a tolerance of reduced utilization of some central resources. Deconfliction is a top-down design choice.

    Deconfliction is an explicit acceptance of the costs of loose coupling, as well as the benefits. Sometimes the deconflicted solution is not the most efficient in terms of the economics of scale, but it is the most effective in terms of flexibility and interoperability. This is the kind of trade-off that military planners are constantly addressing.


    Sometimes coupling is itself a consequence of scale. At low volumes, a system may be able to operate effectively in asynchronous mode. At high volumes, the same system may have to switch to a more synchronous mode. If an airport gets two incoming flights per hour, then the utilization of the runway is extremely low and planes hardly ever need to wait. But if the airport gets two incoming flights per minute, then the runway becomes a scarce resource demanding tight scheduling, and planes are regularly forced to wait for a take-off or landing slot. Systems can become more complex simply as a consequence of a change in scale.

    (See my earlier comments on the relationship between scale and enterprise architecture: Lightweight Enterprise.)

    In technical systems, loose coupling carries an overhead - not just an operational overhead, but a design and governance overhead. Small-grained services may give you greater decoupling, but only if you have the management capability to coordinate them effectively. In sociotechnical systems, fragmentation may impair the effectiveness of the whole, unless there is appropriate collaboration.

    In summary, I don't see loose coupling as a principle of SOA. I prefer to think of it as a design choice. I think it's great that SOA technology give us better choices, but I want these choices to be taken intelligently rather than according to some fixed rules. SOA entails just-enough loose coupling with just-enough coordination. What is important is getting the balance right.