Architecture, Data and Intelligence: data quality

Showing posts with label data quality. Show all posts

Monday, March 06, 2023

Trusting the Schema

A long time ago, I did some work for a client that had an out-of-date and inflexible billing system. The software would send invoices and monthly statements to the customers, who were then expected to remit payment to clear the balance on their account.

The business had recently introduced a new direct debit system. Customers who had signed a direct debit mandate no longer needed to send payments.

But faced with the challenge of introducing this change into an old and inflexible software system, the accounts department came up with an ingenious and elaborate workaround. The address on the customer record was changed to the address of the internal accounts department. The computer system would print and mail the statement, but instead of going straight to the customer it arrived back at the accounts department. The accounts clerk used a rubber stamp PAID BY DIRECT DEBIT, and would then mail the statement to the real customer address, which was stored in the Notes field on the customer record.

Although this may be an extreme example, there are several important lessons that follow from this story.

Firstly, business can't always wait for software systems to be redeveloped, and can often show high levels of ingenuity in bypassing the constraints imposed by an unimaginative design.

Secondly, the users were able to take advantage of a Notes field that had been deliberately underdetermined to allow for future expansion.

Furthermore, users may find clever ways of using and extending a system that were not considered by the original designers of the system. So there is a divergence between technology-as-designed and technology-in-use.

Now let's think what happens when the IT people finally get around to replacing the old billing system. They will want to migrate customer data into the new system. But if they simply follow the official documentation of the legacy system (schema etc), there will lots of data quality problems.

And by documentation, I don't just mean human-generated material but also schemas automatically extracted from program code and data stores. Just because a field is called CUSTADDR doesn't mean we can guess what it actually contains.

Here's another example of an underdetermined data element, which I presented at a DAMA conference in 2008. SOA Brings New Opportunities to Data Management.

In this example, we have a sales system containing a Business Type called SALES PROSPECT. But the content of the sales system depends on the way it is used - the way SALES PROSPECT is interpreted by different sales teams.

Sales Executive 1 records only the primary decision-maker in the prospective organization. The decision-maker’s assistant is recorded as extra information in the NOTES field.
Sales Executive 2 records the assistant as a separate instance of SALES PROSPECT. There is a cross-reference between the assistant and the boss

Now both Sales Executives can use the system perfectly well - in isolation. But we get interoperability problems under various conditions.

When we want to compare data between executives
When we want to reuse the data for other purposes
When we want to migrate to new sales system

(And problems like these can occur with packaged software and software as a service just as easily as with bespoke software.)

So how did this mess happen? Obviously the original designer / implementer never thought about assistants, or never had the time to implement or document them properly. Is that so unusual?

And this again shows the persistent ingenuity of users - finding ways to enrich the data - to get the system to do more than the original designers had anticipated.

And there are various other complications. Sometimes not all the data in a system was created there, some of it was brought in from an even earlier system with a significantly different schema. And sometimes there are major data quality issues, perhaps linked to a post before processing paradigm.

Both data migration and data integration are plagued by such issues. Since the data content diverges from the designed schemas, it means you can't rely on the schemas of the source data but you have to inspect the actual data content. Or undertake a massive data reconstruction exercise, often misleadingly labelled "data cleansing".

There are several tools nowadays that can automatically populate your data dictionary or data catalogue from the physical schemas in your data store. This can be really useful, provided you understand the limitations of what this is telling you. So there a few important questions to ask before you should trust the physical schema as providing a complete and accurate picture of the actual contents of your legacy data store.

Was all the data created here, or was some of it mapped or translated from elsewhere?
Is the business using the system in ways that were not anticipated by the original designers of the system?
What does the business do when something is more complex than the system was designed for, or when it needs to capture additional parties or other details?
Are classification types and categories used consistently across the business? For example, if some records are marked as "external partner" does this always mean the same thing?
Do all stakeholders have the same view on data quality - what "good data" looks like?
And more generally, is there (and has there been through the history of the system) a consistent understanding across the business as to what the data elements mean and how to use them?

Related posts: Post Before Processing (November 2008), Ecosystem SOA 2 (June 2010), Technology in Use (March 2023)

Tuesday, December 10, 2019

Is there a Single Version of Truth about Stents?

Clinical trials are supposed to generate reliable data to support healthcare decisions and policies at several levels. Regulators use the data to control the marketing and use of medicines and healthcare products. Clinical practice guidelines are produced by healthcare organizations (from the WHO downwards) as well as professional bodies. Clinicians apply and interpret these guidelines for individual patients, as well as prescribing medicines, products and procedures, both on-label and off-label.

Given the importance of these decisions and policies for patients, there are some critical issues concerning the quality of clinical trial data, and the ability of clinicians, researchers, regulators and others to make sense of these data. Obviously there are significant commercial interests involved, and some players may be motivated to be selective about the publication of trial data. Hence the AllTrials campaign for clinical trial transparency.

But there is a more subtle issue, to do with the way the data are collected, coded and reported. The BBC has recently uncovered an example that is both fascinating and troubling. It concerns a clinical trial comparing the use of stents with heart bypass surgery. The trial was carried out in 2016, funded by a major manufacturer of stents, and published in a prestigious medical journal. According to the article, the two alternatives were equally effective in protecting against future heart attacks.

But this is where the controversy begins. Researchers disagree about the best way of measuring heart attacks, and the authors of the article used a particular definition. Other researchers prefer the so-called Universal Definition, or more precisely the Fourth Universal Definition (there having been three previous attempts). Some experts believe that if you use the Universal Definition instead of the definition used in the article, the results are much more one-sided: stents may be the right solution for many patients, but are not always as good as surgery.

Different professional bodies interpret matters differently. The European Association for Cardio-thoracic Surgery (EACTS) told the BBC that this raised serious concerns about the current guidelines based on the 2016 trial, while the European Society of Cardiology stands by these guidelines. The BBC also notes the potential conflicts of interests of researchers, many of whom had declared financial relationships with stent manufacturers.

I want to draw a more general lesson from this story, which is about the much-vaunted Single Version of Truth (SVOT). By limiting the clinical trial data to a single definition of heart attack, some of the richness and complexity of the data are lost or obscured. For some purposes at least, it would seem appropriate to make multiple versions of the truth available, so that they can be properly analysed and interpreted. SVOT not always a good thing, then.

See my previous blogposts on Single Source of Truth.

Deborah Cohen and Ed Brown, Surgeons withdraw support for heart disease advice (BBC Newsnight, 9 December 2019) See also https://www.youtube.com/watch?v=_vGfJKMbpp8

Debabrata Mukherjee, Fourth Universal Definition of Myocardial Infarction (American College of Cardiology, 25 Aug 2018)

See also Off-Label (March 2005), Is there a Single Version of Truth about Statins? (April 2019), Ethics of Transparency and Concealment (October 2019)

Wednesday, December 04, 2019

Data Strategy - Assurance

This is one of a series of posts looking at the four key dimensions of data and information that must be addressed in a data strategy - reach, richness, agility and assurance.

In previous posts, I looked at Reach (the range of data sources and destinations), Richness (the complexity of data) and Agility (the speed and flexibility of response to new opportunities and changing requirements). Assurance is about Trust.

In 2002, Microsoft launched its Trustworthy Computing Initiative, which covered security, privacy, reliability and business integrity. If we look specifically at data, this mean two things.

Trustworthy data - the data are reliable and accurate.
Trustworthy data management - the processor is a reliable and responsible custodian of the data, especially in regard to privacy and security

Let's start by looking at trustworthy data. To understand why this is important (both in general and specifically to your organization), we can look at the behaviours that emerge in its absence. One very common symptom is the proliferation of local information. If decision-makers and customer-facing staff across the organization don't trust the corporate databases to be complete, up-to-date or sufficiently detailed, they will build private spreadsheets, to give them what they hope will be a closer version of the truth.

This is of course a data assurance nightmare - the data are out of control, and it may be easier for hackers to get the data out than it is for legitimate users. And good luck handling any data subject access request!

But in most organizations, you can't eliminate this behaviour simply by telling people they mustn't. If your data strategy is to address this issue properly, you need to look at the causes of the behaviour, understand what level of reliability and accessibility you have to give people, before they will be willing to rely on your version of the truth rather than theirs.

DalleMule and Davenport have distinguished two types of data strategy, which they call offensive and defensive. Offensive strategies are primarily concerned with exploiting data for competitive advantage, while defensive strategies are primarily concerned with data governance, privacy and security, and regulatory compliance.

As a rough approximation then, assurance can provide a defensive counterbalance to the offensive opportunities offered by reach, richness and agility. But it's never quite as simple as that. A defensive data quality regime might install strict data validation, to prevent incomplete or inconsistent data from reaching the database. In contrast, an offensive data quality regime might install strict labelling, with provenance data and confidence ratings, to allow incomplete records to be properly managed, enriched if possible, and appropriately used. This is the basis for the NetCentric strategy of Post Before Processing.

Because of course there isn't a single view of data quality. If you want to process a single financial transaction, you obviously need to have a complete, correct and confirmed set of bank details. But if you want aggregated information about upcoming financial transactions, you don't want any large transactions to be omitted from the total because of a few missing attributes. And if you are trying to learn something about your customers by running a survey, it's probably not a good idea to limit yourself to those customers who had the patience and loyalty to answer all the questions.

Besides data quality, your data strategy will need to have a convincing story about privacy and security. This may include certification (e.g. ISO 27001) as well as regulation (GDPR etc.) You will need to have proper processes in place for identifying risks, and ensuring that relevant data projects follow privacy-by-design and security-by-design principles. You may also need to look at the commercial and contractual relationships governing data sharing with other organizations.

All of this should add up to establishing trust in your data management - reassuring data subjects, business partners, regulators and other stakeholders that the data are in safe hands. And hopefully this means they will be happy for you to take your offensive data strategy up to the next level.

Next post: Developing Data Strategy

Leandro DalleMule and Thomas H. Davenport, What’s Your Data Strategy? (HBR, May–June 2017)

Richard Veryard, Microsoft's Trustworthy Computing (CBDI Journal, March 2003)

Wikipedia: Trustworthy Computing

Tuesday, April 16, 2019

Is there a Single Version of Truth about Statins?

@bengoldacre provides some useful commentary on a BBC news item about statins. In particular, he notes a detail from the original research paper that didn't make it into the BBC news item - namely the remarkable lack of agreement between GPs and hospitals as to whether a given patient had experienced a cardiovascular event.

This is not a new observation: it was analysed in a 2013 paper by Emily Herrett and others. Dr Goldacre advised a previous Health Minister that "different data sources within the NHS were wildly discrepant wrt to the question of something as simple as whether a patient had had a heart attack". The minister asked which source was right - in other words, asking for a single source of truth. But the point is that there isn't one.

Data quality issues can be traced to a number of causes. While some of the issues may be caused by administrative or technical errors and omissions, others are caused by the way the data are recorded in the first place. This is why the comparison of health data between different countries is often misleading - because despite international efforts to standardize classification, different healthcare regimes still code things differently. And despite the huge amounts of NHS money thrown at IT projects to standardize medical records (as documented by @tonyrcollins), the fact remains that primary and secondary healthcare view the patient completely differently.

See my previous blogposts on Single Source of Truth

Tony Collins, Another NPfIT IT scandal in the making? (Campaign4Change, 9 February 2016)

Emily Herrett et al, Completeness and diagnostic validity of recording acute myocardial infarction events in primary care, hospital care, disease registry, and national mortality records: cohort study (BMJ 21 May 2013)

Michelle Roberts, Statins 'don't work well for one in two people' (BBC News, 15 April 2019)

Benoît Salanave et al, Classification differences and maternal mortality: a European study (International Journal of Epidemiology 28, 1999) pp 64–69

Thursday, March 16, 2017

From Dodgy Data to Dodgy Policy - Mrs May's Immigration Targets

The TotalData™ value chain is about the flow from raw data to business decisions (including evidence-based policy decisions).

In this post, I want to talk about an interesting example of a flawed data-driven policy. The UK Prime Minister, Theresa May, is determined to reduce the number of international students visiting the UK. This conflicts with the advice she is getting from nearly everyone, including her own ministers.

As @Skapinker explains in the Financial Times, there are a number of mis-steps in this case.

Distorted data collection. Mrs May's policy is supported by raw data indicating the number of students that return to their country of origin. These are estimated measurements, based on daytime and evening surveys taken at UK airports. Therefore students travelling on late-night flights to such countries as China, Nigeria, Hong Kong, Saudi Arabia and Singapore are systematically excluded from the data.

Disputed data definition. Most British people do not regard international students as immigrants. But as May stubbornly repeated to a parliamentary committee in December 2016, she insists on using an international definition of migration, which includes any students that stay for more than 12 months.

Conflating measurement with target. Mrs May told the committee that "the target figures are calculated from the overall migration figures, and students are in the overall migration figures because it is an international definition of migration". But as Yvette Cooper pointed out "The figures are different from the target. ... You choose what to target."

Refusal to correct baseline. Sometimes the easiest way to achieve a goal is to move the goalposts. Some people are quick to use this tactic, while others instinctively resist change. Mrs May is in the latter camp, and appears to regard any adjustment of the baseline as backsliding and morally suspect.

If you work with enterprise data, you may recognize these anti-patterns.

David Runciman, Do your homework (London Review of Books Vol. 39 No. 6, 16 March 2017)

Michael Skapinker, Theresa May’s clampdown on international students is a mystery (Financial Times, 15 March 2017)

International students and the net migration target: Should students be taken out? (Migration Observatory, 25 Jun 2015)

Oral evidence: The Prime Minister (House of Commons HC 833, 20 December 2016)

TotalData™ is a trademark of Reply Ltd. All rights reserved

Wednesday, April 28, 2010

Quality and Responsibility

One of the key challenges with shared data and shared services is the question of data quality. Who is responsible for mistakes?

@tonyrcollins raises a specific example - who's responsible for mistakes in summary care records?

"NHS Connecting for Health suggests that responsibility for mistakes lies with the person making the incorrect entry into a patient's medical records. But the legal responsibility appears to lie with the Data Controller who, in the case of Summary Care Records, is the Secretary of State for Health."

From an organizational design point of view, it is usually best to place responsibility for mistakes along with the power and expertise to prevent or correct mistakes. But that in turn calls for an analysis of the root causes of mistakes. If all mistakes can be regarded as random incidents of carelessness or incompetence on the part of the person making the incorrect entry, then clearly the responsibility lies there. But if mistakes are endemic across the system, then the root cause may well be carelessness or incompetence in the system requirements and design, and so the ultimate responsibility rightly lies with the Secretary of State for Health.

Part of the problem here is that the Summary Care Record (SCR) is supposed to be a Single Source of Truth (SSOT), and I have already indicated What's Wrong with the Single Version of Truth (SVOT). Furthermore, it is intended to be used in Accident and Emergency, to support decisions that may be safety-critical or even life-critical. Therefore to design a system that is vulnerable to random incidents of carelessness or incompetence is itself careless and incompetent.

What general lessons can we learn from this example, for shared services and SOA? The first lesson is for design: data quality must be rigorously designed-in, rather than merely relying on validation filters at the data entry stage, and then building downstream functionality that uses the data uncritically. (This is a question for the design of the whole sociotechnical system, not just the software architecture.) And the second lesson is for governance: make sure that stakeholders understand and accept the distribution of risk and responsibility and reward BEFORE spending billions of taxpayers' money on something that won't work.

Friday, March 19, 2010

What's Wrong with the Single Version of Truth

As @tonyrcollins reports, a confidential report currently in preparation on the NHS Summary Care Records (SCR) database will reveal serious flaws in a massively expensive database (Computer Weekly, March 2010). Well knock me down with a superbug, whoever would have guessed this might happen?

"The final report may conclude that the success of SCRs will depend on whether the NHS, Connecting for Health and the Department of Health can bridge the deep cultural and institutional divides that have so far characterised the NPfIT. It may also ask whether the government founded the SCR on an unrealistic assumption: that the centralised database could ever be a single source of truth."

There are several reasons to be ambivalent about the twin principles Single Version of Truth (SVOT) and Single Source of Truth (SSOT), and this kind of massive failure must worry even the most fervent advocates of these principles.

Don't get me wrong, I have served my time in countless projects trying to reduce the proliferation and fragmentation of data and information in large organizations, and I am well aware of the technical costs and business risks associated with data duplication. However, I have some serious concerns about the dogmatic way these principles are often interpreted and implemented, especially when this dogmatism results (as seems to be the case here) in a costly and embarrassing failure.

The first problem is that Single-Truth only works if you have absolute confidence in the quality of the data. In the SCR example, there is evidence that doctors simply don't trust the new system - and with good reason. There are errors and omissions in the summary records, and doctors prefer to double-check details of medications and allergies, rather than take the risk of relying on a single source.

The technical answer to this data quality problem is to implement rigorous data validation and cleansing routines, to make sure that the records are complete and accurate. But this would create more work for the GP practices uploading the data. Officials at the Department of Health fear that setting the standards of data quality too high would kill the scheme altogether. (And even the most rigorous quality standards would only reduce the number of errors, could never eliminate them altogether.)

There is a fundamental conflict of interest here between the providers of data and the consumers - even though these may be the same people - and between quality and quantity. If you measure the success of the scheme in terms of the number of records uploaded, then you are obviously going to get quantity at the expense of quality.

So the pusillanimous way out is to build a database with imperfect data, and defer the quality problem until later. That's what people have always done, and will continue to do, and the poor quality data will never ever get fixed.

The second problem is that even if perfectly complete and accurate data are possible, the validation and data cleansing step generally introduces some latency into the process, especially if you are operating a post-before-processing system (particularly relevant to environments such as military and healthcare where, for some strange reason, matters of life-and-death seem to take precedence over getting the paperwork right). So there is a design trade-off between two dimensions of quality - timeliness and accuracy. See my post on Joined-Up Healthcare.

The third problem is complexity. Data cleansing generally works by comparing each record with a fixed schema, which defines the expected structure and rules (metadata) to which each record must conform, so that any information that doesn't fit into this fixed schema will be barred or adjusted. Thus the richness of information will be attenuated, and useful and meaningful information may be filtered out. (See Jon Udell's piece on Object Data and the Procrustean Bed from March 2000. See also my presentation on SOA for Data Management.)

The final problem is that a single source of information represents a single source of failure. If something is really important, it is better to have two independent sources of information or intelligence, as I pointed out in my piece on Information Algebra. This follows Bateson's slogan that "two descriptions are better than one". Doctors using the SCR database appear to understand this aspect of real-world information better than the database designers.

It may be a very good idea to build an information service that provides improved access to patient information, for those who need this information. But if this information service is designed and implemented according to some simplistic dogma, then it isn't going to work properly.

Update. The Health Secretary has announced that NHS regulation will be based on a single version of the truth.

"in the future the chief inspector will ensure that there is a single version of the truth about how their hospitals are performing, not just on finance and targets, but on a single assessment that fully reflects what matters to patients"

Roger Taylor, Jeremy Hunt's dangerous belief in a single 'truth' about hospitals (Guardian 26 March 2013)

Updated 28 March 2013

Tuesday, November 11, 2008

Post Before Processing

In Talk to the Hand, Saul Caganoff describes his experiences of error entering his timesheet data into one of those time-recording systems many of us have to use. He goes on to draw some general lessons about error-handing in business process management (BPM). In Saul's account, this might sometimes necessitate suspending a business rule.

My own view of the problem starts further back - I think it stems from an incorrect conceptual model. Why should your perfectly reasonable data get labelled as error or invalid just because it is inconsistent with your project manager's data? This happens in a lot of old bureaucratic systems because they are designed on the implicit (hierarchical, top-down) assumption that the manager (or systems designer) is always right and the worker (or data entry clerk) is always the one that gets things wrong. It's also easier for the computer system to reject the new data items, rather than go back and question items (such as reference data) that have already been accepted into the database.

I prefer to label such inconsistencies as anomalies, because that doesn't imply anyone in particular being at fault.

It would be crazy to have a business rule saying that anomalies are not allowed. Anomalies happen. What makes sense is to have a business rule saying how and when anomalies are recognized (i.e. what counts as an anomaly) and resolved (i.e. what options are available to whom).

Then you never have to suspend the rule. It is just a different, more intelligent kind of rule.

One of my earliest experiences of systems analysis was designing order processing and book-keeping systems. When I visited the accounts department, I saw people with desks stacked with piles of paper. It turned out that these stacks were the transactions that the old computer system wouldn't accept, so the accounts clerks had developed a secondary manual system for keeping track of all these invalid transactions until they could be corrected and entered.

According to the original system designer, the book-keeping process had been successfully automated. But what had been automated was over 90% of the transactions - but less than 20% of the time and effort. So I said, why don't we build a computer system that supports the work that the accounts clerks actually do. Let them put all these dodgy transactions into the database and then sort them out later.

But I was very junior and didn't know how things were done. And of course the accounts clerks had even less status than I did. The high priests who commanded the database didn't want mere users putting dodgy data in, so it didn't happen.

Many years later, I came across the concept of Post Before Processing, especially in military or medical systems. If you are trying to load or unload an airplane in a hostile environment, or trying to save the life of a patient, you are not going to devote much time or effort to getting the paperwork correct. So all sorts of incomplete and inaccurate data get shoved quickly into the computer, and then sorted out later. These systems are designed on the principle that it is better to have some data, however incomplete or inaccurate, than none at all. This was a key element of the DoD Net-Centric Data Strategy (2003).

The Post Before Processing paradigm also applies to intelligence. For example, here is a US Department of Defense ruling on the sharing of intelligence data.

In the past, intelligence producers and others have held information pending greater completeness and further interpretative processing by analysts. This approach denies users the opportunity to apply their own context to data, interpret it, and act early on to clarify and/or respond. Information producers, particularly those at large central facilities, cannot know even a small percentage of potential users' knowledge (some of which may exceed that held by a center) or circumstances (some of which may be dangerous in the extreme). Accordingly, it should be the policy of DoD organizations to publish data assets at the first possible moment after acquiring them, and to follow-up initial publications with amplification as available. Net-Centric Enterprise Services Technical Guide

See also

Saul Caganoff, Talk to the Hand (11 November 2008), Progressive Data Constraints (21 November 2008)

Jeff Jonas, Introducing the concept of network-centric warfare and post before processing (21 January 2006), The Next Generation of Network-Centric Warfare: Process at Posting or Post at Processing (Same thing) (31 January 2007)(

Related Post: Progressive Design Constraints (November 2008)

Monday, January 15, 2007

Information Sharing and Joined-Up Services 2

My colleague David Sprott has just posted a critique (Big Brother Database Dinosaur) of the latest UK Government proposals [Note 1] for putting citizen data into a large central database.

As many commentators have pointed out [Note 2], a large central database of this kind would have to be built to extremely high standards of data quality and data protection. Given the recent history of public sector IT, it is hard to be confident that such standards would be achieved or maintained. There is also the question of liability and possible compensation - for example if a citizen suffered financial or other loss as a result of incorrect data.

But in any case, as David points out from an SOA perspective, the proposal is architecturally unsound and technologically obsolescent. Robin Wilton (Sun Microsystems) comes to a similar conclusion from the perspective of federated identity.

Government ministers are busily backtracking on the "Big Brother" elements of the proposal [Note 3], but the policy paper confirms some of the details [Note 4].

David's comments refer mainly to the proposed consolidation of citizen information across various public sector agencies within the UK. But there is another information-sharing problem in the news at present - the fact that the UK criminal records database does not include tens of thousands of crimes committed by UK citizens in other countries. [Note 5]

Part of the difficulty seems to be in verifying the identity of these records. Information sharing requires some level of interoperability, and this includes minimum standards of identification. There are some serious issues here, including semantics, which can never be resolved merely by collecting large amounts of data into one place.

The problem of information sharing within one country is really no different from the problems of information sharing between countries. But at least in the latter case there is nobody saying we can solve all the problems by building a single international database. At least I hope not.

As I said on this blog in 2003 [Note 6], we need to innovate new mechanisms to manage information sharing. This is one of the opportunities and challenges for SOA in delivering joined-up services in a proper manner. Then centralization becomes irrelevant.

Note 1: BBC News January 14th 2007

Note 2: Fish & Chip Papers: Government uber-databases,

Note 3: BBC News January 15th 2007. See also Fish & Chip Papers: Data sharing does not a Big Brother make.

Note 4: Daily Telegraph Microchips for mentally ill planned in shake-up.

Note 5: According to ACPO, some 27,500 case files were left in desk files at the Home Office instead of being properly examined and entered into the criminal records database. [BBC News]

Note 6: See my post from 2003 on Information Sharing and Joined-Up Services.

Sunday, October 23, 2005

Data Provenance

In my previous post on Information Sharing, I discussed some of the problems of information sharing in a loosely-coupled world, with special reference to the security services. There is further discussion of the social and political aspects of this at IntoTheMachine and Demanding Change. In this blog, I am continuing to focus on the information management aspects.

On Friday, I had a briefing about an EU research project on data provenance, led by IBM. This project is currently focused on the creation and storage of provenance data (in other words data that describe the provenance of other data - what we used to call an audit trail). The initial problem they are addressing is the fact that provenance data (audit trails and logs) are all over the place - incomplete and application-specific - and this makes it extremely hard to trace provenance across complex enterprise systems. The proposed solution is the integration of provenance data from heterogeneous sources to create one or more provenance stores. These provenance stores may then be made available for interrogation and analysis in a federated network or grid, subject to important questions of trust and security.

In art history, provenance means being able to trace a continuous history of a painting back to the original artist - for example proving the version of the Mona Lisa currently in the Louvre is the authentic work of Leonardo da Vinci. As it happens, we don't have a completely watertight provenance for the Mona Lisa, as it was stolen by an Italian artist in 1911, and remained absent from the Louvre until 1913. Most art lovers assume that the painting that was returned to the Louvre is genuine, but there is a gap in the audit trail in which an excellent forgery might possibly have been committed. [See The Day The Mona Lisa was Stolen. I learned about this story from Darien Leader's book Stealing the Mona Lisa.] Provenance may also involve an audit trail of other events in the painting's history, such as details of any restoration or repair.

In information systems, provenance can be understood as a form of instrumentation of the business process - instrumentation and context that allows the source and reliability of information to be validated and verified. (Quality wonks will know that there is a subtle distinction between validation and verification: both are potentially important for data provenance; and I may come back to this point at a later date.) Context data are used for many purposes besides provenance, and so provenance may involve a repurposing of instrumentation (data collection) that is already carried out for some other purpose, such as business activity monitoring (BAM). Interrogation of provenance is at the level of the business process, and IBM is talking about possible future standards for provenance-aware business processes.

Provenance-awareness is an important enabler for compliance to various regulations and practices, including Basel, Sarbanes-Oxley, and HIPPA. If a person or organization is to be held accountable for something, then this typically includes being accountable for the source and reliability of relevant information. Thus provenance must be seen as an aspect of governance.

Provenance is also an important issue in complex B2B environments, where organizations are collaborating under imperfect trust. From a service-oriented point of view, I think what is most interesting about data provenance is not the collection and storage of provenance data, but the interrogation and use. This means we don't just want provenance-aware business processes (supported by provenance-aware application systems) but also provenance-aware objects and services. Some objects (especially documents, but possibly also physical objects with suitable RFID encoding) may contain and deliver their own provenance data, in some untamperable form. Web services may carry some security coding that allows the consumer to trust the declared provenance of the service and its data content. There are some important questions about composition and decomposition - how do we reason architecturally about the relationship between provenance at the process/application level and provenance at the service/object level?

We also need an analysis and design methodology for provenance. How do you determine how much provenance data will be adequate to satisfy a given compliance requirement in a given context (surely standards cannot be expected to provide a complete answer to this) and how do you compose a sufficiently provenance-aware business process either from provenance-aware services, or from normal services plus some additional provenancing services. Conversely, in the design of services to be consumed outside the enterprise, there are important analysis and design questions about the amount of provenance data you are prepared to expose to your service consumers. The EU project includes methodology as one of its deliverables, due sometime next year. In the meantime, IBM hopes that people will start to implement the provenance architecture, as described on the project website, and provide practical input.

Updated 30 April 2013

Tuesday, December 02, 2003

Three Responses to Inconsistency

Inconsistency may prompt a technocratic, moral or commercial response.

Suppose a man is found to have inconsistent identity or characteristics -- let's say one computer believes him to be married, while another believes him to be single.

One possible explanation for this is that the inconsistency refers to two different men -- perhaps father and son with the same name at the same address. A technocratic resolution is based on a simple EITHER/OR. Either there are two men with two different identities, or there is one man and at least some of the computer data relating to him are incorrect or out-of-date. The inconsistency is resolved and removed by fixing the identifier and/or correcting the data.

In contrast to this, a moral response to this contradiction might be to imagine some lax moral behaviour on the part of the man himself, or elsewhere in the system (social or technical). A married man who sometimes claims to be unmarried (or permits people to believe him unmarried) might be on morally dangerous ground. A simple-minded moralist might be tempted to make instant moral judgements, based on the first signs of inconsistency. A fair and proper assessment of character would require more careful consideration and discretion, but this could still lead to some form of moral judgement.

A commercial (and morally cynical) response to detecting this inconsistency could be to highlight the man as a possible consumer of certain products and services -- from flowers and chocolates to legal representation.

The point here is that inconsistency sometimes reveals something interesting and important -- and there may be valid alternatives to the technocratic preference for correcting and smoothing the data. Professional auditors are taught that apparently trivial inconsistencies may yield clues to serious fraud. Psychologists (including some marketing experts) suggest that one's deepest desires may often be marked by apparent ambivalence or oscillation. An approach to data management or personal identity that suppresses or erases inconsistency for the sake of convenience and seamless communication is systematically incapable of recognizing many of the subtleties of human behaviour.

Inconsistency can also reveal possible clashes of worldview between business partners or other stakeholders. One firm identifies a man as having two children, another firm identifies him as having no children. When we investigate this inconsistency, we find that he does indeed have two children, but they are grown-up and no longer live with him. Two different organizations have two different reasons for asking what is apparently the same question. But the difference in content reveals a deeper difference in meaning.

Originally posted at http://www.users.globalnet.co.uk/~rxv/demcha/contradiction.htm

Architecture, Data and Intelligence

Pages

Monday, March 06, 2023

Trusting the Schema

Tuesday, December 10, 2019

Is there a Single Version of Truth about Stents?

Wednesday, December 04, 2019

Data Strategy - Assurance

Tuesday, April 16, 2019

Is there a Single Version of Truth about Statins?

Thursday, March 16, 2017

From Dodgy Data to Dodgy Policy - Mrs May's Immigration Targets

Wednesday, April 28, 2010

Quality and Responsibility

Friday, March 19, 2010

What's Wrong with the Single Version of Truth

Tuesday, November 11, 2008

Post Before Processing

Monday, January 15, 2007

Information Sharing and Joined-Up Services 2

Sunday, October 23, 2005

Data Provenance

Tuesday, December 02, 2003

Three Responses to Inconsistency

Blog Archive

Creative Commons

or by email

Pages

Monday, March 06, 2023

Tuesday, December 10, 2019

Wednesday, December 04, 2019

Tuesday, April 16, 2019

Thursday, March 16, 2017

Wednesday, April 28, 2010

Friday, March 19, 2010

Tuesday, November 11, 2008

Monday, January 15, 2007

Sunday, October 23, 2005

Tuesday, December 02, 2003

Blog Archive

Creative Commons

Subscribe

or by email