Architecture, Data and Intelligence: SSOT

Showing posts with label SSOT. Show all posts

Friday, July 10, 2020

Three Types of Data - Bronze, Gold and Mercury

In this post, I'm going to look at three types of data, and the implications for data management. For the purposes of this story, I'm going to associate these types with three contrasting metals: bronze, gold and mercury. (Update: fourth type added - scroll down for details.)

Bronze

The first type of data represents something that happened at a particular time. For example, transaction data: this customer made this purchase of this product on this date. This delivery was received, this contract was signed, this machine was installed, this notification was sent.

Once this kind of data is correctly recorded, it should never change. Even if an error is detected in a transaction record, the usual procedure is to add two more transaction records - one to reverse out the incorrect values, and one to reenter the correct values.

For many organizations, this represents by far the largest portion of data by volume. The main data management challenges tend to be focused on the implications of this - how much to collect, where to store it, how to move it around, how soon can it be deleted or archived.

Gold

The second type of data represents current reality. This kind of data must be promptly and efficiently updated to reflect real-world changes. For example, the customer changes address, an employee moves to a different department. Although the changes themselves may be registered as Bronze Data, and we may want to check what the position was at a given time in the past, or is expected to be at a given time in the future, what we usually want to know is where does the customer now reside, where does Sam now work.

Some of these updates can be regarded as simple facts, based on observations or reports (a customer tells us her address). Some updates are derived from other data, using calculation or inference rules. And some updates are based on decisions - for example, the price of this product shall be X.

And not all of these updates can be trusted. If you receive an email from a supplier requesting payment to a different bank account, you probably want to check that this email is genuine before updating the supplier record.

Typically much smaller data volumes than Bronze, but much more critical to the business if you get it wrong.

Mercury

Finally, we have data with a degree of uncertainty, including estimates and forecasts. This data is fluid, it can move around for no apparent reason. It can be subjective or based on unreliable or partial sources. Nevertheless, it can be a rich source of insight and intelligence.

This category also includes projected and speculative data. For example, we might be interested in developing a fictional "what if" scenario - what if we opened x more stores, what if we changed the price of this product to y?

For some reason, an estimate that is generated by an algorithm or mathematical model is sometimes taken more seriously than an estimate pulled out of the air by a subject matter expert. However, as Cathy O'Neill reminds us, algorithms are themselves merely opinions embedded in code.

If you aren't sure whether to trust an estimate, you can scrutinize the estimation process. For example, you might suspect that the subject matter expert provides more optimistic estimates after lunch. Or you could just get a second opinion. Two independent but similar opinions might give you more confidence than one extremely precise but potentially flawed opinion.

As well as estimates and forecasts, Mercury data may include assessments of various kinds. For example, we may want to know a customer's level of satisfaction with our products and services. Opinion surveys provide some relevant data points, but what about the customers who don't complete these surveys? And what if we pick up different opinions from different individuals within a large customer organization? In any case, these opinions change over time, and we may be able to correlate these shifts in opinion with specific good or bad events.

Thus Mercury data tend to be more complex than Bronze or Gold data, and can often be interpreted in different ways.

Update: Glass

@tonyjoyce suggests a fourth type.

This is good. I’d like to suggest there is another elemental type, in the taxonomy we use for structures. A substrate for complex data like addresses. Call it glass, for it shatters when stressed.
— tonyjoyce (@tonyjoyce) July 10, 2020

This is a great insight. If you are not careful, you will end up with pieces of broken glass in your data. While this kind of data may be necessary, it is fragile and has to be treated with due care, and can't just be chucked around like bronze or gold.

Single Version of Truth (SVOT)

Bronze and Gold data usually need to be reliable and consistent. If two data stores have different addresses for the same customer, this could indicate any of the following errors.

The data in one of the data stores is incorrect or out-of-date.
It’s not the same customer after all.
It’s not the same address. For example, one is the billing address and the other is the delivery address.

For the purposes of data integrity and interoperability, we need to eliminate such errors. We then have a single version of the truth (SVOT), possibly taken from a single source of truth (SSOT).

Facts and derivations may be accurate or inaccurate. In the case of a simple fact, inaccuracy may be attributed to various causes, including translation errors, carelessness or dishonesty. Calculations may be inaccurate either because the input data are inaccurate or incomplete, or because there is an error in the derivation rule itself. (However, the derived data can sometimes be more accurate or useful, especially if random errors and variations are smoothed out.)

For decisions however, it doesn’t make sense to talk about accuracy / inaccuracy, except in very limited cases. Obviously if someone decides the price of an item shall be x pounds, but this is incorrectly entered into the system as x pence, this is going to cause problems. But even if x pence is the wrong price, arguably it is what the price is until someone fixes it.

Plural Version of Truth (PVOT)

But as I've pointed out in several previous posts, the Single Version of Truth (SVOT) or Single Source of Truth (SSOT) isn't appropriate for all types of data. Particularly not Mercury Data. When making sense of complex situations, having alternative views provides diversity and richness of interpretation.

Analytical systems may be able to compare alternative data values from different sources. For example, two forecasting models might produce different estimates of the expected revenue from a given product. Intelligent use of these estimates doesn’t entail choosing one and ignoring the other. It means understanding why they are different, and taking appropriate action.

Or what about conflicting assessments? If we are picking up a very high satisfaction score from some parts of the customer organization, and a low satisfaction score from other parts of the same organization, we shouldn't simply average them out. The difference between these two scores could be telling us something important, might be revealing an opportunity to engage differently with the two parts of the customer.

And for some kinds of Mercury Data, it doesn't even make sense to ask whether they are accurate or inaccurate. Someone may postulate x more stores, but this doesn’t imply that x is true, or even likely, merely speculative. And this speculative status is inherited by any forecasts or other calculations based on x. (Just look at the discourse around COVID data for topical examples.)

Master Data Management (MDM)

The purpose of Master Data Management is not just to provide a single source of data for Gold Data - sometimes called the Golden Record - but to provide a single location for updates. A properly functioning MDM solution will execute these updates consistently and efficiently, and ensure all consumers of the data (whether human or software) are using the updated version.

There is an important connection to draw out between master data management and trust.

In order to trust Bronze Data, we simply need some assurance that it is correctly recorded and can never be changed. (“The moving finger writes …”) In some contexts, a central authority may be able to provide this assurance. In systems with no central authority, Blockchain can guarantee that a data item has not been changed, although Blockchain alone cannot guarantee that it was correctly recorded in the first place.

For Gold Data, trustworthiness is more complicated, as there will need to be an ongoing series of automatic and manual updates. Master data management will provide the necessary sociotechnical superstructure to manage and control these updates. For example, what are the controls on updating a supplier's bank account details?

There will always be requirements for data integrity between Bronze Data and Gold Data. Firstly, there will typically be references from Bronze Data to Gold Data. For example, a transaction record may reference a specific customer purchasing a specific product. And secondly, there may be attributes of the Gold Data that are updated as a result of each transaction. For example, the stock levels of a product will be affected by sales of that product.

However, as we've seen, the data management challenges of Bronze Data are not the same as the challenges for Gold Data. And the challenges of Mercury Data are different again. So it is better to focus your MDM efforts exclusively on Gold Data. (And avoid splinters of Glass.)

Post prompted by a discussion on Linked-In with Robert Daniels-Dwyer, Steve Fisher and Steve Lenny. https://www.linkedin.com/posts/danielsdwyer_dataarchitecture-datamanagement-enterprisearchitecture-activity-6673873980866228224-Mu1O

Updated 15 July 2020 following suggestion by Tony Joyce.

Tuesday, December 10, 2019

Is there a Single Version of Truth about Stents?

Clinical trials are supposed to generate reliable data to support healthcare decisions and policies at several levels. Regulators use the data to control the marketing and use of medicines and healthcare products. Clinical practice guidelines are produced by healthcare organizations (from the WHO downwards) as well as professional bodies. Clinicians apply and interpret these guidelines for individual patients, as well as prescribing medicines, products and procedures, both on-label and off-label.

Given the importance of these decisions and policies for patients, there are some critical issues concerning the quality of clinical trial data, and the ability of clinicians, researchers, regulators and others to make sense of these data. Obviously there are significant commercial interests involved, and some players may be motivated to be selective about the publication of trial data. Hence the AllTrials campaign for clinical trial transparency.

But there is a more subtle issue, to do with the way the data are collected, coded and reported. The BBC has recently uncovered an example that is both fascinating and troubling. It concerns a clinical trial comparing the use of stents with heart bypass surgery. The trial was carried out in 2016, funded by a major manufacturer of stents, and published in a prestigious medical journal. According to the article, the two alternatives were equally effective in protecting against future heart attacks.

But this is where the controversy begins. Researchers disagree about the best way of measuring heart attacks, and the authors of the article used a particular definition. Other researchers prefer the so-called Universal Definition, or more precisely the Fourth Universal Definition (there having been three previous attempts). Some experts believe that if you use the Universal Definition instead of the definition used in the article, the results are much more one-sided: stents may be the right solution for many patients, but are not always as good as surgery.

Different professional bodies interpret matters differently. The European Association for Cardio-thoracic Surgery (EACTS) told the BBC that this raised serious concerns about the current guidelines based on the 2016 trial, while the European Society of Cardiology stands by these guidelines. The BBC also notes the potential conflicts of interests of researchers, many of whom had declared financial relationships with stent manufacturers.

I want to draw a more general lesson from this story, which is about the much-vaunted Single Version of Truth (SVOT). By limiting the clinical trial data to a single definition of heart attack, some of the richness and complexity of the data are lost or obscured. For some purposes at least, it would seem appropriate to make multiple versions of the truth available, so that they can be properly analysed and interpreted. SVOT not always a good thing, then.

See my previous blogposts on Single Source of Truth.

Deborah Cohen and Ed Brown, Surgeons withdraw support for heart disease advice (BBC Newsnight, 9 December 2019) See also https://www.youtube.com/watch?v=_vGfJKMbpp8

Debabrata Mukherjee, Fourth Universal Definition of Myocardial Infarction (American College of Cardiology, 25 Aug 2018)

See also Off-Label (March 2005), Is there a Single Version of Truth about Statins? (April 2019), Ethics of Transparency and Concealment (October 2019)

Monday, April 22, 2019

When the Single Version of Truth Kills People

@Greg_Travis has written an article on the Boeing 737 Max Disaster, which @jjn1 describes as "one of the best pieces of technical writing I’ve seen in ages". He explains why normal airplane design includes redundant sensors.

"There are two sets of angle-of-attack sensors and two sets of pitot tubes, one set on either side of the fuselage. Normal usage is to have the set on the pilot’s side feed the instruments on the pilot’s side and the set on the copilot’s side feed the instruments on the copilot’s side. That gives a state of natural redundancy in instrumentation that can be easily cross-checked by either pilot. If the copilot thinks his airspeed indicator is acting up, he can look over to the pilot’s airspeed indicator and see if it agrees. If not, both pilot and copilot engage in a bit of triage to determine which instrument is profane and which is sacred."

and redundant processors, to guard against a Single Point of Failure (SPOF).

"On the 737, Boeing not only included the requisite redundancy in instrumentation and sensors, it also included redundant flight computers—one on the pilot’s side, the other on the copilot’s side. The flight computers do a lot of things, but their main job is to fly the plane when commanded to do so and to make sure the human pilots don’t do anything wrong when they’re flying it. The latter is called 'envelope protection'."

But ...

"In the 737 Max, only one of the flight management computers is active at a time—either the pilot’s computer or the copilot’s computer. And the active computer takes inputs only from the sensors on its own side of the aircraft."

As a result of this design error, 346 people are dead. Travis doesn't pull his punches.

"It is astounding that no one who wrote the MCAS software for the 737 Max seems even to have raised the possibility of using multiple inputs, including the opposite angle-of-attack sensor, in the computer’s determination of an impending stall. As a lifetime member of the software development fraternity, I don’t know what toxic combination of inexperience, hubris, or lack of cultural understanding led to this mistake."

He may not know what led to this specific mistake, but he can certainly see some of the systemic issues that made this mistake possible. Among other things, the widespread idea that software provides a cheaper and quicker fix than getting the hardware right, together with what he calls cultural laziness.

"Less thought is now given to getting a design correct and simple up front because it’s so easy to fix what you didn’t get right later."

Agile, huh?

Update: CNN finds an unnamed Boeing spokesman to defend the design.

"Single sources of data are considered acceptable in such cases by our industry".

OMG, does that mean that there are more examples of SSOT elsewhere in the Boeing design!?

How a Single Point of Failure (SPOF) in the MCAS software could have caused the Boeing 737 Max crash in Ethiopia (DMD Solutions, 5 April 2019) - provides a simple explanation of Fault Tree Analysis (FTA) as a technique to identify SPOF.

Mike Baker and Dominic Gates, Lack of redundancies on Boeing 737 MAX system baffles some involved in developing the jet (Seattle Times 26 March 2019)

Curt Devine and Drew Griffin, Boeing relied on single sensor for 737 Max that had been flagged 216 times to FAA (CNN, 1 May 2019) HT @marcusjenkins

George Leopold, Boeing 737 Max: Another Instance of ‘Go Fever”? (29 March 2019)

Mary Poppendieck, What If Your Team Wrote the Code for the 737 MCAS System? (4 April 2019) HT @CharlesTBetz with reply from @jpaulreed

Gregory Travis, How the Boeing 737 Max Disaster Looks to a Software Developer (IEEE Spectrum, 18 April 2019) HT @jjn1 @ruskin147

And see my other posts on the Single Source of Truth.

Updated 2 May 2019

Tuesday, April 16, 2019

Is there a Single Version of Truth about Statins?

@bengoldacre provides some useful commentary on a BBC news item about statins. In particular, he notes a detail from the original research paper that didn't make it into the BBC news item - namely the remarkable lack of agreement between GPs and hospitals as to whether a given patient had experienced a cardiovascular event.

This is not a new observation: it was analysed in a 2013 paper by Emily Herrett and others. Dr Goldacre advised a previous Health Minister that "different data sources within the NHS were wildly discrepant wrt to the question of something as simple as whether a patient had had a heart attack". The minister asked which source was right - in other words, asking for a single source of truth. But the point is that there isn't one.

Data quality issues can be traced to a number of causes. While some of the issues may be caused by administrative or technical errors and omissions, others are caused by the way the data are recorded in the first place. This is why the comparison of health data between different countries is often misleading - because despite international efforts to standardize classification, different healthcare regimes still code things differently. And despite the huge amounts of NHS money thrown at IT projects to standardize medical records (as documented by @tonyrcollins), the fact remains that primary and secondary healthcare view the patient completely differently.

See my previous blogposts on Single Source of Truth

Tony Collins, Another NPfIT IT scandal in the making? (Campaign4Change, 9 February 2016)

Emily Herrett et al, Completeness and diagnostic validity of recording acute myocardial infarction events in primary care, hospital care, disease registry, and national mortality records: cohort study (BMJ 21 May 2013)

Michelle Roberts, Statins 'don't work well for one in two people' (BBC News, 15 April 2019)

Benoît Salanave et al, Classification differences and maternal mortality: a European study (International Journal of Epidemiology 28, 1999) pp 64–69

Tuesday, November 06, 2018

Big Data and Organizational Intelligence

Ten years ago, the editor of Wired Magazine published an article claiming the end of theory. With enough data, the numbers speak for themselves.

The idea that data (or facts) speak for themselves, with no need for interpretation or analysis, is a common trope. It is sometimes associated with a legal doctrine known as Res Ipsa Loquitur - the thing speaks for itself. However this legal doctrine isn't about truth but about responsibility: if a surgeon leaves a scalpel inside the patient, this fact alone is enough to establish the surgeon's negligence.

Legal doctrine aside, perhaps the world speaks for itself. The world, someone once asserted, is all that is the case, the totality of facts not of things. Paradoxically, big data often means very large quantities of very small (atomic) data.

But data, however big, does not provide a reliable source of objective truth. This is one of the six myths of big data identified by Kate Crawford, who points out, data and data sets are not objective; they are creations of human design. In other words, we don't just build models from data, we also use models to obtain data. This is linked to Piaget's account of how children learn to make sense of the world in terms of assimilation and accommodation. (Piaget called this Genetic Epistemology.)

Data also cannot provide explanation or understanding. Data can reveal correlation but not causation. Which is one of the reasons why we need models. As Kate Crawford also observes, we get a much richer sense of the world when we ask people the why and the how not just the how many. And Bernard Stiegler links the end of theory glorified by Anderson with a loss of reason (2019, p8).

In the traditional world of data management, there is much emphasis on the single source of truth. Michael Brodie (who knows a thing or two about databases), while acknowledging the importance of this doctrine for transaction systems such as banking, argues that it is not appropriate everywhere. In science, as in life, understanding of a phenomenon may be enriched by observing the phenomenon from multiple perspectives (models). ... Database products do not support multiple models, i.e., the reality of science and life in general. One approach Brodie talks about to address this difficulty is ensemble modelling: running several different analytical models and comparing or aggregating the results. (I referred to this idea in my post on the Shelf-Life of Algorithms).

Along with the illusion that what the data tells you is true, we can identify two further illusions: that what the data tells you is important, and that what the data doesn't tell you is not important. These are not just illusions of big data of course - any monitoring system or dashboard can foster them. The panopticon affects not only the watched but also the watcher.

From the perspective of organizational intelligence, the important point is that data collection, sensemaking, decision-making, learning and memory form a recursive loop - each inextricably based on the others. An organization only perceives what it wants to perceive, and this depends on the conceptual models it already has - whether these are explicitly articulated or unconsciously embedded in the culture. Which is why real diversity - in other words, genuine difference of perspective, not just bureaucratic profiling - is so important, because it provides the organizational counterpart to the ensemble modelling mentioned above.

https://xkcd.com/552/

Each day seems like a natural fact
And what we think changes how we act

(Gang of Four, Why Theory)

Chris Anderson, The End of Theory: The Data Deluge Makes the Scientific Method Obsolete (Wired, 23 June 2008)

Michael L Brodie, Why understanding of truth is important in Data Science? (KD Nuggets, January 2018)

Kate Crawford, The Hidden Biases in Big Data (HBR, 1 April 2013)

Kate Crawford, The Anxiety of Big Data (New Inquiry, 30 May 2014)

Bruno Gransche, The Oracle of Big Data – Prophecies without Prophets (International Review of Information Ethics, Vol. 24, May 2016)

Kevin Kelly, The Google Way of Science (The Technium, 28 June 2008)

Thomas McMullan, What does the panopticon mean in the age of digital surveillance? (Guardian, 23 July 2015)

Evelyn Ruppert, Engin Isin and Didier Bigo, Data politics (Big Data and Society, July–December 2017: 1–7)

Ian Steadman, Big Data and the Death of the Theorist (Wired, 25 January 2013)

Bernard Stiegler, The Age of Disruption: Technology and Madness in Computational Capitalism (English translation, Polity Press 2019)

Ludwig Wittgenstein, Tractatus Logico-Philosophicus (1922)

Related posts

Information Algebra (March 2008), How Dashboards Work (November 2009), Conceptual Modelling - Why Theory (November 2011), Co-Production of Data and Knowledge (November 2012), Real Criticism - The Subject Supposed to Know (January 2013), The Purpose of Diversity (December 2014), The Shelf-Life of Algorithms (October 2016), The Transparency of Algorithms (October 2016), Algorithms and Governmentality (July 2019), Mapping out the entire world of objects (July 2020)

Wikipedia: Ensemble Learning, Genetic Epistemology, Panopticism, Res ipsa loquitur (the thing speaks for itself)

Stanford Encyclopedia of Philosophy: Kant and Hume on Causality

For more on Organizational Intelligence, please read my eBook.
https://leanpub.com/orgintelligence/

Monday, December 13, 2010

Can Single Source of Truth work?

@tonyrcollins asks if any healthcare IT system can provide a Single Source of Truth (SSOT)? In his blog (13 December 2010), he discusses a press release claiming that an electronic healthcare record system from Cerner Millennium Solutions is a "single source of truth", citing the Children’s Cancer Hospital Egypt 57357 (CCHE) as a success story (via Egyptian Chamber).

My first observation is that even if we take this success story at face value, it doesn't tell us much about the possibilities of SSOT in an environment such as the UK NHS that is several orders of magnitude more complicated/complex. I'm guessing the Children’s Cancer Hospital Egypt 57357 (CCHE) doesn't have as many different types of "truth" to manage as the NHS.

one type of patient (children)
one type of condition (cancer)
a single building

My second observation is that if a closed organization has a single source of truth, it will never discover flaws in any of these truths. If a child is given the wrong medication, for whatever reason, we can only detect the error and prevent its recurrence by finding a second source of truth. The reason SSOT has not been successfully implemented in the UK is not just because it wouldn't work (after all, lots of things are implemented that don't work) but because there are too many people who know it wouldn't work and are sufficiently powerful to resist it.

My third observation is that single-source of truth may be a bureaucratic fantasy, but responsible doctors will always strive to get best-truth rather than sole-truth. People in bureaucratic organizations don't always stick to the formal channels, and often have alternative ways of finding out what they need to know. So perhaps the Egyptian doctors at CCHE have managed to preserve alternative sources of information, and the "single source of truth" is merely a bureaucratic illusion.

See my previous post What's Wrong with the Single Source of Truth?

Wednesday, April 28, 2010

Quality and Responsibility

One of the key challenges with shared data and shared services is the question of data quality. Who is responsible for mistakes?

@tonyrcollins raises a specific example - who's responsible for mistakes in summary care records?

"NHS Connecting for Health suggests that responsibility for mistakes lies with the person making the incorrect entry into a patient's medical records. But the legal responsibility appears to lie with the Data Controller who, in the case of Summary Care Records, is the Secretary of State for Health."

From an organizational design point of view, it is usually best to place responsibility for mistakes along with the power and expertise to prevent or correct mistakes. But that in turn calls for an analysis of the root causes of mistakes. If all mistakes can be regarded as random incidents of carelessness or incompetence on the part of the person making the incorrect entry, then clearly the responsibility lies there. But if mistakes are endemic across the system, then the root cause may well be carelessness or incompetence in the system requirements and design, and so the ultimate responsibility rightly lies with the Secretary of State for Health.

Part of the problem here is that the Summary Care Record (SCR) is supposed to be a Single Source of Truth (SSOT), and I have already indicated What's Wrong with the Single Version of Truth (SVOT). Furthermore, it is intended to be used in Accident and Emergency, to support decisions that may be safety-critical or even life-critical. Therefore to design a system that is vulnerable to random incidents of carelessness or incompetence is itself careless and incompetent.

What general lessons can we learn from this example, for shared services and SOA? The first lesson is for design: data quality must be rigorously designed-in, rather than merely relying on validation filters at the data entry stage, and then building downstream functionality that uses the data uncritically. (This is a question for the design of the whole sociotechnical system, not just the software architecture.) And the second lesson is for governance: make sure that stakeholders understand and accept the distribution of risk and responsibility and reward BEFORE spending billions of taxpayers' money on something that won't work.

Friday, March 19, 2010

What's Wrong with the Single Version of Truth

As @tonyrcollins reports, a confidential report currently in preparation on the NHS Summary Care Records (SCR) database will reveal serious flaws in a massively expensive database (Computer Weekly, March 2010). Well knock me down with a superbug, whoever would have guessed this might happen?

"The final report may conclude that the success of SCRs will depend on whether the NHS, Connecting for Health and the Department of Health can bridge the deep cultural and institutional divides that have so far characterised the NPfIT. It may also ask whether the government founded the SCR on an unrealistic assumption: that the centralised database could ever be a single source of truth."

There are several reasons to be ambivalent about the twin principles Single Version of Truth (SVOT) and Single Source of Truth (SSOT), and this kind of massive failure must worry even the most fervent advocates of these principles.

Don't get me wrong, I have served my time in countless projects trying to reduce the proliferation and fragmentation of data and information in large organizations, and I am well aware of the technical costs and business risks associated with data duplication. However, I have some serious concerns about the dogmatic way these principles are often interpreted and implemented, especially when this dogmatism results (as seems to be the case here) in a costly and embarrassing failure.

The first problem is that Single-Truth only works if you have absolute confidence in the quality of the data. In the SCR example, there is evidence that doctors simply don't trust the new system - and with good reason. There are errors and omissions in the summary records, and doctors prefer to double-check details of medications and allergies, rather than take the risk of relying on a single source.

The technical answer to this data quality problem is to implement rigorous data validation and cleansing routines, to make sure that the records are complete and accurate. But this would create more work for the GP practices uploading the data. Officials at the Department of Health fear that setting the standards of data quality too high would kill the scheme altogether. (And even the most rigorous quality standards would only reduce the number of errors, could never eliminate them altogether.)

There is a fundamental conflict of interest here between the providers of data and the consumers - even though these may be the same people - and between quality and quantity. If you measure the success of the scheme in terms of the number of records uploaded, then you are obviously going to get quantity at the expense of quality.

So the pusillanimous way out is to build a database with imperfect data, and defer the quality problem until later. That's what people have always done, and will continue to do, and the poor quality data will never ever get fixed.

The second problem is that even if perfectly complete and accurate data are possible, the validation and data cleansing step generally introduces some latency into the process, especially if you are operating a post-before-processing system (particularly relevant to environments such as military and healthcare where, for some strange reason, matters of life-and-death seem to take precedence over getting the paperwork right). So there is a design trade-off between two dimensions of quality - timeliness and accuracy. See my post on Joined-Up Healthcare.

The third problem is complexity. Data cleansing generally works by comparing each record with a fixed schema, which defines the expected structure and rules (metadata) to which each record must conform, so that any information that doesn't fit into this fixed schema will be barred or adjusted. Thus the richness of information will be attenuated, and useful and meaningful information may be filtered out. (See Jon Udell's piece on Object Data and the Procrustean Bed from March 2000. See also my presentation on SOA for Data Management.)

The final problem is that a single source of information represents a single source of failure. If something is really important, it is better to have two independent sources of information or intelligence, as I pointed out in my piece on Information Algebra. This follows Bateson's slogan that "two descriptions are better than one". Doctors using the SCR database appear to understand this aspect of real-world information better than the database designers.

It may be a very good idea to build an information service that provides improved access to patient information, for those who need this information. But if this information service is designed and implemented according to some simplistic dogma, then it isn't going to work properly.

Update. The Health Secretary has announced that NHS regulation will be based on a single version of the truth.

"in the future the chief inspector will ensure that there is a single version of the truth about how their hospitals are performing, not just on finance and targets, but on a single assessment that fully reflects what matters to patients"

Roger Taylor, Jeremy Hunt's dangerous belief in a single 'truth' about hospitals (Guardian 26 March 2013)

Updated 28 March 2013

Tuesday, November 11, 2008

Post Before Processing

In Talk to the Hand, Saul Caganoff describes his experiences of error entering his timesheet data into one of those time-recording systems many of us have to use. He goes on to draw some general lessons about error-handing in business process management (BPM). In Saul's account, this might sometimes necessitate suspending a business rule.

My own view of the problem starts further back - I think it stems from an incorrect conceptual model. Why should your perfectly reasonable data get labelled as error or invalid just because it is inconsistent with your project manager's data? This happens in a lot of old bureaucratic systems because they are designed on the implicit (hierarchical, top-down) assumption that the manager (or systems designer) is always right and the worker (or data entry clerk) is always the one that gets things wrong. It's also easier for the computer system to reject the new data items, rather than go back and question items (such as reference data) that have already been accepted into the database.

I prefer to label such inconsistencies as anomalies, because that doesn't imply anyone in particular being at fault.

It would be crazy to have a business rule saying that anomalies are not allowed. Anomalies happen. What makes sense is to have a business rule saying how and when anomalies are recognized (i.e. what counts as an anomaly) and resolved (i.e. what options are available to whom).

Then you never have to suspend the rule. It is just a different, more intelligent kind of rule.

One of my earliest experiences of systems analysis was designing order processing and book-keeping systems. When I visited the accounts department, I saw people with desks stacked with piles of paper. It turned out that these stacks were the transactions that the old computer system wouldn't accept, so the accounts clerks had developed a secondary manual system for keeping track of all these invalid transactions until they could be corrected and entered.

According to the original system designer, the book-keeping process had been successfully automated. But what had been automated was over 90% of the transactions - but less than 20% of the time and effort. So I said, why don't we build a computer system that supports the work that the accounts clerks actually do. Let them put all these dodgy transactions into the database and then sort them out later.

But I was very junior and didn't know how things were done. And of course the accounts clerks had even less status than I did. The high priests who commanded the database didn't want mere users putting dodgy data in, so it didn't happen.

Many years later, I came across the concept of Post Before Processing, especially in military or medical systems. If you are trying to load or unload an airplane in a hostile environment, or trying to save the life of a patient, you are not going to devote much time or effort to getting the paperwork correct. So all sorts of incomplete and inaccurate data get shoved quickly into the computer, and then sorted out later. These systems are designed on the principle that it is better to have some data, however incomplete or inaccurate, than none at all. This was a key element of the DoD Net-Centric Data Strategy (2003).

The Post Before Processing paradigm also applies to intelligence. For example, here is a US Department of Defense ruling on the sharing of intelligence data.

In the past, intelligence producers and others have held information pending greater completeness and further interpretative processing by analysts. This approach denies users the opportunity to apply their own context to data, interpret it, and act early on to clarify and/or respond. Information producers, particularly those at large central facilities, cannot know even a small percentage of potential users' knowledge (some of which may exceed that held by a center) or circumstances (some of which may be dangerous in the extreme). Accordingly, it should be the policy of DoD organizations to publish data assets at the first possible moment after acquiring them, and to follow-up initial publications with amplification as available. Net-Centric Enterprise Services Technical Guide

See also

Saul Caganoff, Talk to the Hand (11 November 2008), Progressive Data Constraints (21 November 2008)

Jeff Jonas, Introducing the concept of network-centric warfare and post before processing (21 January 2006), The Next Generation of Network-Centric Warfare: Process at Posting or Post at Processing (Same thing) (31 January 2007)(

Related Post: Progressive Design Constraints (November 2008)

Monday, March 17, 2008

Information Algebra

I get more information from two newspapers than from one - but not twice as much information. So how much more, exactly? That depends how much difference there is between the two newspapers.

Even if two newspapers report the same general facts, they typically report different details, and they may have different sources. To the extent that there are differences in style and detail between the two newspapers, this typically reinforces my confidence in the overall story because it indicates that the journalists are not merely reusing a common source (such as a company press release).

In the real world, we are accustomed to the fact that information and intelligence needs double-checking and corroboration. And yet in the computer world, there is a widespread belief that it is always a good thing to have a single source of information - that repeated messages are not only unnecessary but wasteful. Data cleansing wipes out difference in the name of consistency and standardization, leaving the resulting information flat and attenuated. A single source of information ("single source of truth") sometimes means a single source of failure - never a good idea in an open distributed system.

Writing about this in an SOA context - when three heads are better than one - Steve Jones describes this as redundancy, and points out the potential value of redundancy to increase reliability. He quotes Lewis Carroll (as Andrew Clarke points out, it was actually the Bellman): "What I tell you three times is true."

The same quote can be found at the head of Chapter 3 of Gregory Bateson's Mind and Nature, available online as Multiple Versions of the World. This expands on Bateson's earlier slogan "Two descriptions are better than one".

Bateson himself used the word "redundancy", but it is not a simple redundancy that can be plucked out without a second thought. Thinking about the consequences of adding and subtracting redundancy is a hard problem - Paulo Rocchi calls it calculus, but I prefer to call it algebra.

Architecture, Data and Intelligence

Pages

Friday, July 10, 2020

Three Types of Data - Bronze, Gold and Mercury

Tuesday, December 10, 2019

Is there a Single Version of Truth about Stents?

Monday, April 22, 2019

When the Single Version of Truth Kills People

Tuesday, April 16, 2019

Is there a Single Version of Truth about Statins?

Tuesday, November 06, 2018

Big Data and Organizational Intelligence

Monday, December 13, 2010

Can Single Source of Truth work?

Wednesday, April 28, 2010

Quality and Responsibility

Friday, March 19, 2010

What's Wrong with the Single Version of Truth

Tuesday, November 11, 2008

Post Before Processing

Monday, March 17, 2008

Information Algebra

Blog Archive

Creative Commons

or by email

Pages

Friday, July 10, 2020

Tuesday, December 10, 2019

Monday, April 22, 2019

Tuesday, April 16, 2019

Tuesday, November 06, 2018

Monday, December 13, 2010

Wednesday, April 28, 2010

Friday, March 19, 2010

Tuesday, November 11, 2008

Monday, March 17, 2008

Blog Archive

Creative Commons

Subscribe

or by email