Friday, June 29, 2012

Where were the architects at RBS?

#entarch Some interesting architectural implications of the recent embarrassing failure of banking systems at RBS-NatWest Bank, which has caused financial stress and distress for millions of customers.

A banking software expert quoted in the Guardian offered an interesting architectural analogy.

"Banking systems are like a huge game of Jenga [the tower game played with interlaced blocks of wood]. Two unrelated transactions might not look related now, but 500,000 transactions from now they might have a huge relation. So everything needs to be processed in order."

This analogy suggests that the problem is one of architectural knowledge and governance. This is always a problem for any large and complex enterprise, but outsourcing typically amplifies such problems. From the press reports, it seems that the implementation of the RBS-NatWest application architecture has been delegated to a bunch of relatively inexperienced Indians with little knowledge of the RBS-NatWest business.

The finger of blame is being pointed to CA-7, which I understand to be a middleware product responsible for the orchestration of complex batch runs. As recently as February, there were job adverts in Inda urgently seeking people with CA-7 experience for the RBS contract.
Distributes or centralize job submission, management and monitoring as you choose and simplify job management by automating as much as possible and provides a simple-to-use interface to manage your environment. CA 7® Workload Automation is a mainframe-hosted, fully-integrated workload automation engine that coordinates and executes job schedules and event triggers across the enterprise.

http://www.ca.com/us/products/detail/ca-7-workload-automation.aspx


The Guardian continues
It seems whoever made the update to CA-7 managed to delete or corrupt the files which hold the schedule for the overnight jobs, so they did not run, or ran incorrectly.

ComputerWorld quotes an RBS spokesman.
The focus right now is on fixing the problem, which was triggered during a software system upgrade.
and BBC's Robert Peston adds

the software update that went so badly wrong last Tuesday night was fairly quickly identified and patched by Royal Bank; it is the absence of a contingency plan to deal with the knock-ons from the initial computer failure that many will see as deeply troubling

I presume that CA-7 expertise involves the ability to create and maintain these control files. But these control files essentially contain executable metadata that describe how the applications must be joined up, which must ultimately be based on a rigorous view of the application architecture - in other words, a model of the application layer.

In my discussion of business capabilities, I have always said that the most troublesome capabilities (and the ones overlooked by most business analysts) are the coordination capabilities, and these are the ones that need the most care when outsourcing. The RBS-NatWest incident illustrates this point.

@davidsprott uses the incident to illustrate the need for application modernization. But was the problem in the core application systems, or was it in the platform layer?  

To the extent that application coordination is being managed via CA-7, it looks suspiciously as if the model of the application layer was embedded in the platform layer, and managed as if it was merely technical infrastructure. This suggests a fundamental architectural flaw in RBS systems - a failure to maintain a clean separation of concerns between the application layer and the platform layer.

This is one of the reasons why enterprise architecture is important. With clean separation and robust interfaces between the architectural layers (business, application, platform), we can carry out modernization, innovation and continuous change in each layer separately. This follows the principle of pace layering, based on the notion that each layer has a different characteristic rate of change. Without clean separation between layers, the layers shear apart, resulting in misalignment and system failure. And as @davidsprott points out, service enabling has exactly this (layer separation) outcome.

Conclusions
  • It's risky outsourcing the core systems unless the architecture is clearly understood and controlled.
  • Good outsourcing‬ requires a good service architecture, which may include business, app and/or platform services.
  • Modernization requires good architecture.
  • In complex systems of systems, coordination is a core business capability. Outsource with extreme caution.




Charles Arthur, How NatWest's IT meltdown developed (Guardian 25 June 2012)

Anh Nguyen, CA 'helps' RBS resolve tech problem that led to massive outage (ComputerWorld 25 June 2012)

Robert Peston, Is outsourcing the cause of RBS debacle? (BBC News 25 June 2012)

David Sprott, RBS Crash - Management Prefer Offshoring to Modernization? (25 June 2012)

See also Architecture as Jenga (September 2012)

No comments: