Architecture, Data and Intelligence: infomgt

Showing posts with label infomgt. Show all posts

Wednesday, March 04, 2020

Economic Value of Data

How far can general principles of asset management be applied to data? In this post, I'm going to look at some of the challenges of putting monetary or non-monetary value on your data assets.

Why might we want to do this? There are several reasons why people might be interested in the value of data.

Establish internal or external benchmarks
Set measurable targets and track progress
Identify underutilized assets
Prioritization and resource allocation
Threat modelling and risk assessment (especially in relation to confidentiality, privacy, security)

Non-monetary benchmarks may be good enough if all we want to do is compare values - for example, this parcel of data is worth a lot more than that parcel, this process/practice is more efficient/effective than that one, this initiative/transformation has added significant value, and so on.

But for some purposes, it is better to express the value in financial terms. Especially for the following:

Cost-benefit analysis – e.g. calculate return on investment
Asset valuation – estimate the (intangible) value of the data inventory – e.g. relevant for flotation or acquisition
Exchange value – calculate pricing and profitability for traded data items

There are (at least) five entirely different ways to put a monetary value on any asset.

Historical Cost The total cost of the labour and other resources required to produce and maintain an item.
Replacement Cost The total cost of the labour and other resources that would be required to replace an item.
Liability Cost The potential damages or penalties if the item is lost or misused. (This may include regulatory action, reputational damage, or commercial advantage to your competitors, and may bear no relation to any other measure of value.)
Utility Value The economic benefits that may be received by an actor from using or consuming the item.
Market Value The exchange price of an item at a given point in time. The amount that must be paid to purchase the item, or the amount that could be obtained by selling the item.

But there are some real difficulties in doing any of this for data. None of these difficulties are unique to data, but I can't think of any other asset class that has all of these difficulties multiplied together to the same extent.

Data is an intangible asset. There are established ways of valuing intangible assets, but these are always somewhat more complicated than valuing tangible assets.

Data is often produced as a side-effect of some other activity. So the cost of its production may already be accounted for elsewhere, or is a very small fraction of a much larger cost.

Data is a reusable asset. You may be able to get repeated (although possibly diminishing) benefit from the same data.

Data is an infinitely reproducible asset. You can sell or share the same data many times, while continuing to use it yourself.

Some data loses its value very quickly. If I’m walking past a restaurant, this information has value to the restaurant. Ten minutes later I'm five blocks away, and the information is useless. And even before this point, suppose there are three restaurants and they all have access to the information that I am hungry and nearby. As soon as one of these restaurants manages to convert this information, its value to the remaining restaurants becomes zero or even negative.

Data combines in a non-linear fashion. Value (X+Y) is not always equal to Value (X) + Value (Y). Even within more tangible asset classes, we can find the concepts of Assemblage and Plottage. For data, one version of this non-linearity is the phenomenon of information energy described by Michael Saylor of MicroStrategy. And for statisticians, there is also Simpson’s Paradox.

The production costs of data can be estimated in various ways. One approach is to divide up the total ICT expenditure, estimating roughly what proportion of the whole to allocate to this or that parcel of data. This generally only works for fairly large parcels - for example, this percent to customer transactions, this percentage to transport and logistics, etc. Another approach is to work out the marginal or incremental cost: this is commonly preferred when considering new data systems, or decommissioning old ones. We can compare the effort consumed in different data domains, or count the number of transformation steps from raw data to actionable intelligence.

As for the value of the data, there are again many different approaches. Ideally, we should look at the use-value or performance value of the data - what contribution does it make to a specific decision or process, or what aggregate contribution does it make to a given set of decisions and processes.

This can be based on subjective assessments of relevance and usefulness, perhaps weighted by the importance of the decisions or processs where the data are used. See Bill Schmarzo's blogpost for a worked example.

Or it may be based on objective comparisons of results with and without the data in question - making a measurable difference to some key performance indicator (KPI). In some cases, the KPI may be directly translated into a financial value.

However, comparing performance fairly and objectively may only be possible for organizations that are already at a reasonable level of data management maturity.

In the absence of this kind of metric, we can look instead at the intrinsic value of the data, independently of its potential or actual use. This could be based on a weighted formula involving such quality characteristics as accuracy, alignment, completeness, enrichment, reliability, shelf-life, timeliness, uniqueness, usability. (Gartner has published a formula that uses a subset of these factors.)

Arguably there should be a depreciation element to this calculation. Last year's data is not worth as much as this year's data, and the accuracy of last year's data may not be so critical, but the data is still worth something.

An intrinsic measure of this kind could be used to evaluate parcels of data at different points in the data-to-information process. For example, showing the increase of enrichment and usability from 1. to 2. and from 2. to 3., and therefore giving a measure of the added-value produced by the data engineering team that does this for us.

1. Source systems

2. Data Lake – cleansed, consolidated, enriched and accessible to people with SQL skills

3. Data Visualization Tool – accessible to people without SQL skills
If any of my readers know of any useful formulas or methods for valuing data that I haven't mentioned here, please drop a link in the comments.

Heather Pemberton Levy, Why and How to Value Your Information as an Asset (Gartner, 3 September 2015)

Bill Schmarzo, Determining the Economic Value of Data (Dell, 14 June 2016)

Wikipedia: Simpson's Paradox, Value of Information

Related posts: Information Algebra (March 2008), Does Big Data Release Information Energy? (April 2014), Assemblage and Plottage (January 2020)

Saturday, April 20, 2013

From information architecture to evidence-based practice

@bengoldacre has produced a report for the UK Department for Education, suggesting some lessons that education can learn from medicine, and calling for a coherent “information architecture” that supports evidence based practice. Dr Goldacre notes that in the highest performing education systems, such as Singapore, “it is almost impossible to rise up the career ladder of teaching, without also doing some work on research in education.”

Here are some of his key recommendations. Clearly these recommendations would be relevant to many other corporate environments, especially those where there is strong demand for innovation, performance and value-for-money.

a simple infrastructure that supports evidence-based practice
teachers should be empowered to participate in research
the results of research should be disseminated more efficiently
resources on research should be available to teachers, enabling them to be critical and thoughtful consumers of evidence
barriers between teachers and researchers should be removed
teachers should be driving the research agenda, by identifying questions that need to be answered.

Clearly it is not enough merely to create an information architecture or knowledge infrastructure. The challenge is to make sure they are aligned with an inquiring culture.

to be continued ...

Ben Goldacre, Teachers! What would evidence based practice look like? (Bad Science, March 2013)

Friday, March 01, 2013

Knowledge and Memory

Once upon a time, people thought of an information model as defining the structure of the stuff you want to remember. Nowadays, this definition is too restrictive: it might possibly be adequate for a system/database designer, but is not adequate for an architect.

Here are three examples where my knowing and using information is not dependent on my remembering it.

1. I recently helped an SME with its PCI DSS submission. One of the critical points is that they avoid a lot of trouble by NEVER storing their customers' credit card details, merely transmitting these details directly to Barclaycard using a secure device. Clearly the credit card details are stored in various places elsewhere, but our systems don't store them anywhere, not even in cache.

2. When I did my driving test, many years ago, the Highway Code had a table of stopping distances that you were supposed to remember. I decided it would be easier to remember the formula and calculate as required. Obviously that's a design decision.

3. My computer remembers frequently used email addresses. But if I want to get back in touch with someone, I will use Linked-In or Google to find his current email address rather than use the one on my computer which may be out-of-date.

The system/database developer and the architect may use the same patterns for defining an entity or entity type, but they have different agendas, and therefore different success criteria. The architect may be less rigorous about some aspects, and needs to be more rigorous about some other aspects.

The system/database developer may see an entity as "something you need to remember etc.". An architect sees an entity as "something you need to exchange information about". The information model establishes a consistent basis for information exchange between people, systems, services, processes and organizations. I can only talk with you about customers, share your customer data, use your customer-related services, and so on, if either (a) you and I have the same understanding of CUSTOMER or (b) there is a mapping between your notion of CUSTOMER and mine. We can call this semantic interoperability.

This idea of semantic interoperability underlies the Open Group vision of Boundaryless Information Flow.

If you have a private list of customers on your laptop, then you can identify them in any adhoc manner you choose. But if you want to share the list with other people, the identity criteria become important.

So which entity types is the architect is most interested in? Primarily the ones that are referenced in the interfaces between systems and services. There are lots of other things that you might wish to remember, monitor and/or direct, but if they can be encapsulated inside some service or component they have no architectural significance. For example, the business needs to know the prevailing VAT rate; so you build a common VAT routine with the VAT rate hidden inside.

Monday, February 25, 2013

Who owns data management strategy?

@joel_schectman exposes an apparent divergence of opinion among #Gartner analysts - whether CEO or CIO should be in charge of data management strategy.

@ted_friedman says that taking out IT as the gatekeeper of centrally stored data can promote “better fact based decision making across the organization”.

@merv adrian says that bypassing the CIO can have unintended side effects like risks to privacy and the quality of the analysis.

Merv explains further “If you don’t have to go through a procurement process and IT, you’re a lot freer to do what you want,” said Mr. Adrian. “But all of that carefully constructed governance is completely undermined, you can be drawing incorrect conclusions, and exposing risks to privacy because they are doing things IT hasn’t vetted.”

Merv's concern about quality also applies to the widespread and often uncontrolled use of spreadsheets and other end-user tools. For example, we can find @JamesYKwak and @alexhern discussing whether we can blame Microsoft Excel for $9bn losses at JPMorgan?

What exactly do we mean by data management strategy? Joel says it includes how to best utilize customer information to leverage growth. Most CIOs seem to think their responsibility for data finishes when they deliver data and information to the user's device. They seem uninterested in how these users actually use the data, and whether better or faster data genuinely improve decisions and policies, and produce better business outcomes.

In other words, the CIO doesn't operate as a Chief Information Officer but as a Chief Information Systems and Technology Officer.

True information strategy includes a closed feedback and learning loop, so that the use of the information can be monitored. Are these expensively collected and elaborately processed data analytics actually influencing decisions, or are the users mostly ignoring them?

Alex Hern, Is Excel the most dangerous piece of software in the world? (New Statesman Feb 2013)

James Kwak, The Importance of Excel (Baseline Scenario Feb 2013)

Joel Schectman, Democratizing Data Analysis Has Risk (WSJ Feb 2013)

Updated 20 February 2016

Sunday, November 11, 2012

Co-Production of Data and Knowledge

Following Russell Ackoff, systems thinkers like to equate wisdom with systems thinking. As Nikhil Sharma points out,

"the choice between Information and Knowledge is based on what the particular profession believes to be manageable".

When this is described as a hierarchy, this is essentially a status ranking. Wisdom (which is what I happen to have) is clearly superior to mere knowledge (which is what the rest of you might have, if you're lucky).

Here's an analogy for the so-called hierarchy of Data, Information, Knowledge and Wisdom (DIKW).

Data = Flour
Information = Bread
Knowledge = A Recipe for Bread-and-Butter Pudding
Wisdom = Only Eating A Small Portion

Note that Information isn't made solely from Data, Knowledge isn't made solely from Information, and Wisdom isn't made solely from Knowledge. See also my post on the Wisdom of the Tomato.

That's enough analogies. Let me now explain what I think is wrong with this so-called hierarchy.

Firstly, the term hierarchy seems to imply that there are three similar relationships.

between Data and Information
between Information and Knowledge
and between Knowledge and Wisdom

as well as implying some logical or chronological sequence

Data before Information
Information before Knowledge
Knowledge before Wisdom

while the pyramid shape implies some quantitative relationships

Much more data than information
Much more information than knowledge
Tiny amounts of wisdom

But my objection to DIKW is not just that it isn't a valid hierarchy or pyramid, but it isn't even a valid schema. It encourages people to regard Data-Information-Knowledge-Wisdom as a fairly rigid classification scheme, and to enter into debates as to whether something counts as information or knowledge. For example, people often argue that something only counts as knowledge if it is in someone's head. I regard these debates as unhelpful and unproductive.

A number of writers attack the hierarchical DIKW schema, and propose alternative ways of configuring the four elements. For example, Dave Snowden correctly notes that knowledge is the means by which we create information out of data. Meanwhile Tom Graves suggests we regard DIKW not as layers but as distinct dimensions in a concept-space.

But merely rearranging DIKW fails to address the most fundamental difficulty of DIKW, which is a naive epistemology that has been discredited since the Enlightenment. You don't simply build knowledge out of data. Knowledge develops through Judgement (Kant), Circular Epistemology and Dialectic (Hegel), Assimilation and Accommodation (Piaget), Conjecture and Refutation (Popper), Proof and Refutation (Lakatos), Languaging and Orientation (Maturana), and/or Mind (Bateson).

These thinkers share two things: firstly the rejection of the Aristotelian idea of one-way traffic from data to knowledge, and secondly an insistence that data must be framed by knowledge. Thus we may validate knowledge by appealing to empirical evidence (data), but we only pick up data in the first place in accordance with our preconceptions and observation practices (knowledge). Meanwhile John Durham Peters suggests that knowledge is not the gathering but the throwing away of information. Marvellous Clouds, p 318

Among other things, this explains why organizations struggle to accommodate (and respond effectively to) weak signals, and why they persistently fail to connect the dots.

And if architects and engineers persist in trying to build information systems and knowledge management systems according to the DIKW schema, they will continue to fall short of supporting organizational intelligence properly.

For a longer and more thorough critique, see Ivo Velitchkov, Do We Still Worship The Knowledge Pyramid (May 2017)

Many other critiques are available ...

Gene Bellinger, Durval Castro and Anthony Mills, Data, Information, Knowledge, and Wisdom (Systems Thinking Wiki, 2004)

Tom Graves, Rethinking the DIKW Hierarchy (Nov 2012)

Patrick Lambe, From Data With Love (Feb 2010)

Nikhil Sharma, The Origins of the DIKW Hierarchy (23 Feb 2005)

Kathy Sierras, Moving up the wisdom hierarchy (23 April 2006)

Dave Snowden, Sense-making and Path-finding (March 2007)

Gordon Vala-Webb, The DIKW Pyramid Must Die (KM World, Oct 2012) - as reported by V Mary Abraham

David Weinberger, The Problem with the Data-Information-Knowledge-Wisdom Hierarchy (HBR, 2 February 2020)

DIKW Model (KM4dev Wiki)

Related posts: Connecting the Dots (January 2010), Too Much Information (April 2010), Seeing is not observing (November 2012), Big Data and Organizational Intelligence (November 2018), An Alternative to the DIKW Pyramid (February 2020)

Updated 8 December 2012
More links added 01 March 2020
Also merging in some material originally written in May 2006.

Wednesday, October 31, 2012

Architecture and the Imagination

"Thinking about the future is a form of unreality." Leif Frenzel, Lost time, sedimentation, and the future as a form of unreality (March 2011)

An architect looks at a valley and imagines a viaduct. She then describes this imaginary viaduct in great detail. As a result of her imagination, and the efforts of many engineers and other workers, when we visit the valley ten years later we too can see the viaduct, now fully realized in graffiti-daubed concrete.

Similarly, much of the work of enterprise and solution architects refers to things that don't exist yet. Most obviously, this applies to systems that haven't been built yet. But it can also apply to business concepts that haven't been "realized" yet.

For example, before Apple launched the iPhone, it must already have had a reasonably well-elaborated concept of IPHONE-USER, and it would had ensured that this concept was adequately supported by a combination of existing and new systems. (Of course, the concept of IPHONE-USER may well be a specialization of the concept of CUSTOMER, but there is a lot of new conceptual matter to accommodate.)

The concept of IPHONE-USER is essentially an exercise in imagination. However this imagination can be grounded by various practical experiments - for example, test engineers creating artificial or proxy instances of IPHONE-USER to make sure everything works properly.

Some 1980s methodologies, including Information Engineering, preached that a conceptual or business information model was in some sense timeless, and that Information Strategy could essentially be reduced to Information Systems Strategy. (This agenda was of course promoted by companies who wanted to sell information systems.)

I now think it is perfectly reasonable for the conceptual model of the enterprise to evolve over time, as the business starts to conceive of previously unconceivable things. Here are some well-worn examples.

The introduction of loyalty cards into retail, allowing retailers to recognize their customers as "the same again", and therefore replacing the concept of a customer-per-visit with the concept of a customer with continuity over time.

The ability to track activity at ever-finer levels of granularity. For example, monitoring every click on your website, or watching your customers navigating the store. (Did she pick up the lemons from the display next to the fish or from the display next to the gin?)

Exposing a wealth of associations between products and customers. For example, Amazon's development of the "people-who-bought-this-also-bought-that" pattern.

So as I see it, Information Strategy includes imagining new things for the business to pay attention to. This is a lot broader and more interesting than Information Systems Strategy, and is just one of the areas where the architect needs to use some imagination.

See also

Architecture and Reality (Nov 2012)
EA Archetypes (June 2009)
The Value of Models (April 2005)
The Value of Models 2 (Sept 2010)

Updated 30 March 2013

Monday, December 13, 2010

Can Single Source of Truth work?

@tonyrcollins asks if any healthcare IT system can provide a Single Source of Truth (SSOT)? In his blog (13 December 2010), he discusses a press release claiming that an electronic healthcare record system from Cerner Millennium Solutions is a "single source of truth", citing the Children’s Cancer Hospital Egypt 57357 (CCHE) as a success story (via Egyptian Chamber).

My first observation is that even if we take this success story at face value, it doesn't tell us much about the possibilities of SSOT in an environment such as the UK NHS that is several orders of magnitude more complicated/complex. I'm guessing the Children’s Cancer Hospital Egypt 57357 (CCHE) doesn't have as many different types of "truth" to manage as the NHS.

one type of patient (children)
one type of condition (cancer)
a single building

My second observation is that if a closed organization has a single source of truth, it will never discover flaws in any of these truths. If a child is given the wrong medication, for whatever reason, we can only detect the error and prevent its recurrence by finding a second source of truth. The reason SSOT has not been successfully implemented in the UK is not just because it wouldn't work (after all, lots of things are implemented that don't work) but because there are too many people who know it wouldn't work and are sufficiently powerful to resist it.

My third observation is that single-source of truth may be a bureaucratic fantasy, but responsible doctors will always strive to get best-truth rather than sole-truth. People in bureaucratic organizations don't always stick to the formal channels, and often have alternative ways of finding out what they need to know. So perhaps the Egyptian doctors at CCHE have managed to preserve alternative sources of information, and the "single source of truth" is merely a bureaucratic illusion.

See my previous post What's Wrong with the Single Source of Truth?

Friday, March 19, 2010

What's Wrong with the Single Version of Truth

As @tonyrcollins reports, a confidential report currently in preparation on the NHS Summary Care Records (SCR) database will reveal serious flaws in a massively expensive database (Computer Weekly, March 2010). Well knock me down with a superbug, whoever would have guessed this might happen?

"The final report may conclude that the success of SCRs will depend on whether the NHS, Connecting for Health and the Department of Health can bridge the deep cultural and institutional divides that have so far characterised the NPfIT. It may also ask whether the government founded the SCR on an unrealistic assumption: that the centralised database could ever be a single source of truth."

There are several reasons to be ambivalent about the twin principles Single Version of Truth (SVOT) and Single Source of Truth (SSOT), and this kind of massive failure must worry even the most fervent advocates of these principles.

Don't get me wrong, I have served my time in countless projects trying to reduce the proliferation and fragmentation of data and information in large organizations, and I am well aware of the technical costs and business risks associated with data duplication. However, I have some serious concerns about the dogmatic way these principles are often interpreted and implemented, especially when this dogmatism results (as seems to be the case here) in a costly and embarrassing failure.

The first problem is that Single-Truth only works if you have absolute confidence in the quality of the data. In the SCR example, there is evidence that doctors simply don't trust the new system - and with good reason. There are errors and omissions in the summary records, and doctors prefer to double-check details of medications and allergies, rather than take the risk of relying on a single source.

The technical answer to this data quality problem is to implement rigorous data validation and cleansing routines, to make sure that the records are complete and accurate. But this would create more work for the GP practices uploading the data. Officials at the Department of Health fear that setting the standards of data quality too high would kill the scheme altogether. (And even the most rigorous quality standards would only reduce the number of errors, could never eliminate them altogether.)

There is a fundamental conflict of interest here between the providers of data and the consumers - even though these may be the same people - and between quality and quantity. If you measure the success of the scheme in terms of the number of records uploaded, then you are obviously going to get quantity at the expense of quality.

So the pusillanimous way out is to build a database with imperfect data, and defer the quality problem until later. That's what people have always done, and will continue to do, and the poor quality data will never ever get fixed.

The second problem is that even if perfectly complete and accurate data are possible, the validation and data cleansing step generally introduces some latency into the process, especially if you are operating a post-before-processing system (particularly relevant to environments such as military and healthcare where, for some strange reason, matters of life-and-death seem to take precedence over getting the paperwork right). So there is a design trade-off between two dimensions of quality - timeliness and accuracy. See my post on Joined-Up Healthcare.

The third problem is complexity. Data cleansing generally works by comparing each record with a fixed schema, which defines the expected structure and rules (metadata) to which each record must conform, so that any information that doesn't fit into this fixed schema will be barred or adjusted. Thus the richness of information will be attenuated, and useful and meaningful information may be filtered out. (See Jon Udell's piece on Object Data and the Procrustean Bed from March 2000. See also my presentation on SOA for Data Management.)

The final problem is that a single source of information represents a single source of failure. If something is really important, it is better to have two independent sources of information or intelligence, as I pointed out in my piece on Information Algebra. This follows Bateson's slogan that "two descriptions are better than one". Doctors using the SCR database appear to understand this aspect of real-world information better than the database designers.

It may be a very good idea to build an information service that provides improved access to patient information, for those who need this information. But if this information service is designed and implemented according to some simplistic dogma, then it isn't going to work properly.

Update. The Health Secretary has announced that NHS regulation will be based on a single version of the truth.

"in the future the chief inspector will ensure that there is a single version of the truth about how their hospitals are performing, not just on finance and targets, but on a single assessment that fully reflects what matters to patients"

Roger Taylor, Jeremy Hunt's dangerous belief in a single 'truth' about hospitals (Guardian 26 March 2013)

Updated 28 March 2013

Thursday, June 18, 2009

Deconstructing The Grammar of Business

@JohnIMM (John Owens) trots out a familiar piece of advice about data modelling today.

"Want to know what data entities your business needs? Start with the nouns in the business function names."

Starting with the nouns is a very old procedure. I can remember sitting through courses where the first exercise was to underline the nouns in a textual description of some business process. So when I started teaching data modelling, I decided to make this procedure more interesting. I took an extract from George Orwell's essay on Hop-Picking, and got the students to underline the nouns. Then we worked out what these nouns actually signified. For example, some of them were numbers and units of measure, some of them were instances, and some of them were reifications. (I'll explain shortly what I mean by reification.) Only a minority of the nouns in this passage passed muster as data entities. Another feature of the extract was that it used a lot of relatively unfamiliar terms - few of us had experience measuring things in bushels, for example - and I was able to show how this analytical technique provided a way of getting into the unfamiliar terminology of a new business area. I included this example in my first book, Pragmatic Data Analysis, published in 1984 and long out of print.

One problem with using this procedure in a training class is that it gives a false impression of what modelling is all about. Modelling is not about translating a clear written description into a clear diagrammatic structure; in the real world you don't have George Orwell doing your observation and writing up your interview notes for you.

Now let me come on to the problem of reification. The Zachman camp has started to use this word (in my view incorrectly) as an synonym of realisation - in other words, the translation and transformation of Ideas into Reality. (They claim this notion can be traced back to the ancient Greeks, but they do not provide any references to support this claim. As far as I am aware, this is a mediaeval notion; it can for example be found in the work of the Arab philosopher ibn Arabi, who talks about entification in apparently this sense.) However, modern philosophers of language use the word "reification" to refer the elevation of abstract ideas (such as qualities) to Thingness. One of the earliest critics of reification was Ockham, who objected to the mediaeval habit of multiplying abstract ideas and reified universals; his principle of simplicity is now known as Ockham's Razor.

In our time, Quine showed how apparently innocent concepts often contained hidden reification, and my own approach to information modelling has been strongly influenced by Quine. For example, I am wary of taking "customer" as a simple concept, and prefer to deconstruct it into a bundle of bits of intentionality and behaviour and other stuff. (See my post on Customer Orientation.) As for business concepts like "competitor" or "prospect", I generally regard these these as reifications resulting from business intelligence.

Reification tends to obscure the construction processes - tempting us to fall into the fallacy of regarding the reifications as if they directly reflected some real world entities. (See my posts on Responding to Uncertainty 1 and 2.) So I like to talk about ratification as a counterbalance to reification - making the construction process explicit.

Of course, John Owens is right insofar as the grammar of the data model should match the grammar of the process model. And of course for service-oriented modelling, the grammar of the capabilities must match that of the core business services. But what is the grammar of the business itself? Merely going along with the existing nouns and verbs may leave us short of discovering the deep structural patterns.

Update May 2024. The distinction I'm making here between reification and its opposite, which I've called ratification, can be compared to Simondon's distinction between ontology and ontogenesis, so I shall need to write more about that. Meanwhile, I now acknowledge the possibility that some notion of reification might be found among the Neoplatonists but that's several hundred years after Plato himself.

Related posts: Reification and Ratification November 2003), Business Concepts and Business Types (May 2009), Business Rule Concepts (December 2009), The Topography of Enterprise Architecture (September 2011), Conceptual Modelling - Why Theory (November 2011), From AS-IS to TO-BE (October 2012), BankSpeak (May 2015), Mapping out the entire world of objects (July 2020)

Saturday, September 13, 2008

SOA Example - Total Asset Visibility

One of the potential applications of service-oriented architecture (SOA) is something called Total Asset Visibility. As we shall see, there are slightly different interpretations as to what this phrase actually means: what kind of visibility over what kind of assets; and does total refer to the assets (some visibility of all assets) or to the visibility (complete visibility of some assets)? However, SOA seems to be relevant to any of these (overlapping) meanings.

Supply Chain - Materiel

Update (20 July 2009):

The annual processes for verifying the location of certain fixed assets have revealed a significant increase in the levels of discrepancies being reported. In the case of the BOWMAN secure communications system (currently being used by Service personnel in Iraq and Afghanistan), some £155 million worth of BOWMAN assets reported in the accounts could not be fully accounted for, although the MOD estimates that a significant proportion of these are under repair. “At this time of high operational demand, it is more important than ever for the Ministry of Defence to have accurate records of where its assets are, and how much stock it has.” [National Audit Office, BBC News]

Our People Are Our Greatest Assets (Stalin)

I think what he actually said was something like "Our Cadres Are Our Greatest Wealth", but it comes to the same thing doesn't it?

But perhaps Total Asset Visibility isn't just about material, but about people as well. In a post called Big Brother USA: Surveillance via "Tagging, Tracking and Locating", Laurel Federbush refers to the possibility of implanting RFID chips into American soldiers, allowing not only their location to be tracked but also their physical and mental state. Federbush refers to something called the Soldier Status Monitoring Project: this is presumably the same as Warfighter Physiological Status Monitoring, which was planned as long ago as 1997 [Army Science and Technology Master Plan (ASTMP 1997)]; current research now addresses predictive modeling.

Total Asset Visibility from Richard Veryard

Friday, August 01, 2008

Faithful representation

Systems people (including some SOA people and CEP people and BPM people) sometimes talk as if a system was supposed to be a faithful representation of the real world.

This mindset leads to a number of curious behaviours.

Firstly, ignoring the potential differences between the real world and its system representation, treating them as if they were one and the same thing. For example, people talking about "Customer Relationship Management" when they really mean "Management of Database Records Inaccurately and Incompletely Describing Customers". Or referring to any kind of system objects as "Business Objects". Or equating a system workflow with "The Business Process".

Secondly, asserting the primacy of some system ontology because "That's How the Real World Is Structured". For example, the real world is made up of "objects" or "processes" or "associations", therefore our system models ought to be made of the same things.

Thirdly, getting uptight about any detected differences between the real world and the system world, because there ought not to be any differences. Rigid data schemas and obsessive data cleansing, to make sure that the system always contains only a single version of the truth.

Fourthly, confusing the stability of the system world with the stability of the real world. The basic notion of "Customer" doesn't change (hum), so the basic schema of "Customer Data" shouldn't change either. (To eliminate this confusion you may need two separate information models - one of the "real world" and one of the system representation of the real world. There's an infinite regress there if you're not careful, but we won't go there right now.)

In the Complex Event world, Tim Bass and Opher Etzion have picked up on a simple situation model of complex events, in which events (including derived, composite and complex events) represent the "situation". [Correction: Tim's "simple model" differs from Opher's in some important respects. See his later post The Secret Sauce is the Situation Models, with my comment.] This is fine as a first approximation, but what neither Opher nor Tim mentions is something I regard as one of the more interesting complexities of event processing, namely that events sometimes lie, or at least fail to tell the whole truth. So our understanding of the situation is mediated through unreliable information, including unreliable events. (This is something that has troubled philosophers for centuries.)

From a system point of view, there is sometimes little to choose between unreliable information and basic uncertainty. If we are going to use complex event processing for fraud detection or anything like that, it would make sense to build a system that treated some class of incoming events with a certain amount of suspicion. You've "lost" your expensive camera have you Mr Customer? You've "found" weapons of mass destruction in Iraq have you Mr Vice-President?

One approach to unreliable input is some kind of quarantine and filtering. Dodgy events are recorded and analyzed, and then if they pass some test of plausibility and coherence they are accepted into the system. But this approach can produce some strange effects and anomalies. (This makes me think of perimeter security, as critiqued by the Jericho Forum. I guess we could call this approach "perimeter epistemology". The related phenomenon of Publication Bias refers to the distortion resulting from analysing data that pass some publication criterion while ignoring data that fail this criterion.)

In some cases, we are going to have to unpack the simple homogeneous notion of "The Situation" into a complex situation awareness, where a situation is constructed from a pile of unreliable fragments. Tim has strong roots in the Net-Centric world, and I'm sure he could say more about this than me if he chose.

Monday, June 23, 2008

Homogeneous Business Vocabulary?

Nick Malik asks whether a common vocabulary is a blessing or curse.

Who benefits from a common vocabulary - whose agenda is it? IT architects tend to like a common vocabulary because it means they can more easily use the same systems and data stores; bureaucrats tend to like a common vocabulary because it means they can impose the same kinds of procedures and performance targets.

Let's look at the bureaucratic agenda first. In the old days, the bus company had passengers, hospitals had patients, prisons had prisoners, and universities had students. Now they are all supposed to be regarded as "customers", and driven by the same things: "choice" and "customer satisfaction". This reframing has had mixed results - perhaps a few beneficial effects in places, but also some damaging or absurd consequences.

BBC News: Students: Customers or learners?
Trisha Torrey: Patient as Customer

Meanwhile, IT organizations want to deploy similar solutions across a range of business domains. Here are a few examples picked at random.

Unisys: Customer Loyalty Solution
Anari Intelligence: Passenger and Customer Profiling (hedging their bets there!)

Meanwhile, IT architects are often lumpers rather than splitters, and so they like to produce information models with a relatively small number of highly generalized objects like PARTY, which mean absolutely nothing to a real business person.

So in some enterprises, and especially in the public sector, the IT architects may be aligned with the central bureaucrats against the line-of-business. Maybe sometimes there really is a good reason for the diversity of business vocabulary, not just idiot managers being obstinate.

With a stratified Service-Oriented Architecture, it becomes possible to get the best of both worlds - building some highly generic services in one layer, which support a range of different specialized and context-specific ontologies in the layer above. So it becomes possible to accommodate a broader range of requirements without imposing a common vocabulary. Of course this raises some complexity issues, which many IT architects would prefer not to have to deal with. For more on these complexity issues, see the Asymmetric Design blog.

Monday, March 17, 2008

Information Algebra

I get more information from two newspapers than from one - but not twice as much information. So how much more, exactly? That depends how much difference there is between the two newspapers.

Even if two newspapers report the same general facts, they typically report different details, and they may have different sources. To the extent that there are differences in style and detail between the two newspapers, this typically reinforces my confidence in the overall story because it indicates that the journalists are not merely reusing a common source (such as a company press release).

In the real world, we are accustomed to the fact that information and intelligence needs double-checking and corroboration. And yet in the computer world, there is a widespread belief that it is always a good thing to have a single source of information - that repeated messages are not only unnecessary but wasteful. Data cleansing wipes out difference in the name of consistency and standardization, leaving the resulting information flat and attenuated. A single source of information ("single source of truth") sometimes means a single source of failure - never a good idea in an open distributed system.

Writing about this in an SOA context - when three heads are better than one - Steve Jones describes this as redundancy, and points out the potential value of redundancy to increase reliability. He quotes Lewis Carroll (as Andrew Clarke points out, it was actually the Bellman): "What I tell you three times is true."

The same quote can be found at the head of Chapter 3 of Gregory Bateson's Mind and Nature, available online as Multiple Versions of the World. This expands on Bateson's earlier slogan "Two descriptions are better than one".

Bateson himself used the word "redundancy", but it is not a simple redundancy that can be plucked out without a second thought. Thinking about the consequences of adding and subtracting redundancy is a hard problem - Paulo Rocchi calls it calculus, but I prefer to call it algebra.

Wednesday, March 05, 2008

SOA Brings New Opportunities to Data Management (San Diego March 2008)

If you live in San Diego, or you're going to be attending the DAMA international conference later this month, please drop me a line. I shall be delivering a 3-hour SOA management briefing on Sunday 16th, and a 1-hour presentation on Tuesday 18th called SOA brings new opportunities to data management.

SOA for Data Management from Richard Veryard

I look forward to meeting some of you there. If you tell me the secret password ("wombat") I'll buy you a drink, and if you can tell me the significance of this password to data management I'll buy you two drinks.

Monday, January 15, 2007

Information Sharing and Joined-Up Services 2

My colleague David Sprott has just posted a critique (Big Brother Database Dinosaur) of the latest UK Government proposals [Note 1] for putting citizen data into a large central database.

As many commentators have pointed out [Note 2], a large central database of this kind would have to be built to extremely high standards of data quality and data protection. Given the recent history of public sector IT, it is hard to be confident that such standards would be achieved or maintained. There is also the question of liability and possible compensation - for example if a citizen suffered financial or other loss as a result of incorrect data.

But in any case, as David points out from an SOA perspective, the proposal is architecturally unsound and technologically obsolescent. Robin Wilton (Sun Microsystems) comes to a similar conclusion from the perspective of federated identity.

Government ministers are busily backtracking on the "Big Brother" elements of the proposal [Note 3], but the policy paper confirms some of the details [Note 4].

David's comments refer mainly to the proposed consolidation of citizen information across various public sector agencies within the UK. But there is another information-sharing problem in the news at present - the fact that the UK criminal records database does not include tens of thousands of crimes committed by UK citizens in other countries. [Note 5]

Part of the difficulty seems to be in verifying the identity of these records. Information sharing requires some level of interoperability, and this includes minimum standards of identification. There are some serious issues here, including semantics, which can never be resolved merely by collecting large amounts of data into one place.

The problem of information sharing within one country is really no different from the problems of information sharing between countries. But at least in the latter case there is nobody saying we can solve all the problems by building a single international database. At least I hope not.

As I said on this blog in 2003 [Note 6], we need to innovate new mechanisms to manage information sharing. This is one of the opportunities and challenges for SOA in delivering joined-up services in a proper manner. Then centralization becomes irrelevant.

Note 1: BBC News January 14th 2007

Note 2: Fish & Chip Papers: Government uber-databases,

Note 3: BBC News January 15th 2007. See also Fish & Chip Papers: Data sharing does not a Big Brother make.

Note 4: Daily Telegraph Microchips for mentally ill planned in shake-up.

Note 5: According to ACPO, some 27,500 case files were left in desk files at the Home Office instead of being properly examined and entered into the criminal records database. [BBC News]

Note 6: See my post from 2003 on Information Sharing and Joined-Up Services.

Friday, March 10, 2006

Information Use

Having just read Scribe's sarcastic post about the UK students charged with possessing information likely to be useful to a terrorist [source BBC News] ...

"how about bus timetables?"

... I was primed to misread Bruce Schneier's post about Data Mining for Terrorists. I thought this was going to be about bad people using good tools for bad purposes. It turned out it was about good people using ~~bad~~ inappropriate tools for good purposes.

But it set me thinking about the potential use of good information by bad people. Or by good people for misguided purposes. Or whatever.

In an open distributed world, just as we have problems with controlling the source of information (see my previous posts on Information Sharing and Data Provenence), we have problems with controlling the destination.

This is obviously important for security and privacy. See my notes on Information Leakage, Inference and Inference Control. But it has wider implications. What is the effect of someone reusing my information (or my services, which often comes to the same thing) in complex mashed-up ways I cannot predict or control?

(Apologies to Feedburner and Feedblitz subscribers who got yesterday's pictures twice. I think I need to change my splicing settings. This seems to be a minor example of a similar thing - interference between two useful services producing unexpected effects. Except of course this one is under my control - or would have been if I had paid attention to it.)

Technorati Tags: SOA service-oriented

Sunday, October 23, 2005

Data Provenance

In my previous post on Information Sharing, I discussed some of the problems of information sharing in a loosely-coupled world, with special reference to the security services. There is further discussion of the social and political aspects of this at IntoTheMachine and Demanding Change. In this blog, I am continuing to focus on the information management aspects.

On Friday, I had a briefing about an EU research project on data provenance, led by IBM. This project is currently focused on the creation and storage of provenance data (in other words data that describe the provenance of other data - what we used to call an audit trail). The initial problem they are addressing is the fact that provenance data (audit trails and logs) are all over the place - incomplete and application-specific - and this makes it extremely hard to trace provenance across complex enterprise systems. The proposed solution is the integration of provenance data from heterogeneous sources to create one or more provenance stores. These provenance stores may then be made available for interrogation and analysis in a federated network or grid, subject to important questions of trust and security.

In art history, provenance means being able to trace a continuous history of a painting back to the original artist - for example proving the version of the Mona Lisa currently in the Louvre is the authentic work of Leonardo da Vinci. As it happens, we don't have a completely watertight provenance for the Mona Lisa, as it was stolen by an Italian artist in 1911, and remained absent from the Louvre until 1913. Most art lovers assume that the painting that was returned to the Louvre is genuine, but there is a gap in the audit trail in which an excellent forgery might possibly have been committed. [See The Day The Mona Lisa was Stolen. I learned about this story from Darien Leader's book Stealing the Mona Lisa.] Provenance may also involve an audit trail of other events in the painting's history, such as details of any restoration or repair.

In information systems, provenance can be understood as a form of instrumentation of the business process - instrumentation and context that allows the source and reliability of information to be validated and verified. (Quality wonks will know that there is a subtle distinction between validation and verification: both are potentially important for data provenance; and I may come back to this point at a later date.) Context data are used for many purposes besides provenance, and so provenance may involve a repurposing of instrumentation (data collection) that is already carried out for some other purpose, such as business activity monitoring (BAM). Interrogation of provenance is at the level of the business process, and IBM is talking about possible future standards for provenance-aware business processes.

Provenance-awareness is an important enabler for compliance to various regulations and practices, including Basel, Sarbanes-Oxley, and HIPPA. If a person or organization is to be held accountable for something, then this typically includes being accountable for the source and reliability of relevant information. Thus provenance must be seen as an aspect of governance.

Provenance is also an important issue in complex B2B environments, where organizations are collaborating under imperfect trust. From a service-oriented point of view, I think what is most interesting about data provenance is not the collection and storage of provenance data, but the interrogation and use. This means we don't just want provenance-aware business processes (supported by provenance-aware application systems) but also provenance-aware objects and services. Some objects (especially documents, but possibly also physical objects with suitable RFID encoding) may contain and deliver their own provenance data, in some untamperable form. Web services may carry some security coding that allows the consumer to trust the declared provenance of the service and its data content. There are some important questions about composition and decomposition - how do we reason architecturally about the relationship between provenance at the process/application level and provenance at the service/object level?

We also need an analysis and design methodology for provenance. How do you determine how much provenance data will be adequate to satisfy a given compliance requirement in a given context (surely standards cannot be expected to provide a complete answer to this) and how do you compose a sufficiently provenance-aware business process either from provenance-aware services, or from normal services plus some additional provenancing services. Conversely, in the design of services to be consumed outside the enterprise, there are important analysis and design questions about the amount of provenance data you are prepared to expose to your service consumers. The EU project includes methodology as one of its deliverables, due sometime next year. In the meantime, IBM hopes that people will start to implement the provenance architecture, as described on the project website, and provide practical input.

Updated 30 April 2013

Thursday, October 20, 2005

Information Sharing

A fascinating statement (pdf) to the House of Lords today by MI5 chief Dame Manningham-Buller (profile), which is already being interpreted as a convoluted excuse for torture (news report).

Here is a brief summary of her argument.

	Asymmetric threat (Al Qaeda) calls for an orchestrated response ("enhanced international co-operation"). This in turn calls for information exchange to "maximize the flow of intelligence" to and from foreign agencies.
	This cooperation and exchange depends not just on mutual awareness (of specific requirements) but also motivation. Manningham-Buller is clearly unwilling to take any action that might reduce this motivation.
	Suppliers of information generally obscure the provenance of information, as a result of various local policies (such as the protection of sources) and practices (such as the use of liaison officers with tightly defined briefs as information channels).
	Users of information generally seek to discover as much context as possible, because this helps both with assessing its reliability, and with interpreting it properly.
	However the availability of context is limited by the policies and practices of the information supplier, as well as an unwillingness on the part of the user to put pressure on the supplier that might damage the supplier's motivation to co-operate in future.
	Information that is obtained from particular contexts (individuals in detention) has to be interpreted with great care. "Where the Agencies are not aware of the circumstances in which the intelligence was obtained, it is likely to be more difficult to assess its reliability." (In other words, we know that torture doesn't always produce the truth.)
	"The Agencies will often not know the location or details of detention." (In other words, foreign agencies don't tell us whether detainees have been tortured, and we are too polite to ask.)
	On at least two occasions we have obtained accurate and useful information without inquiring too closely about its provenance. (Therefore this justifies our policy of not demanding provenance.)

Manningham-Buller's statement raises some important political and ethical questions about torture and our cooperation with countries where torture is tolerated. There are also some important questions about the extent to which evidence with dubious provenance can be used in legal proceedings.

However, what I want to do here is to draw out some more general points about information sharing in a loosely coupled world. The intelligence situation described by Manningham-Buller provides an example of an extremely common pattern of conflict of interest, whereby suppliers of information are seeking to attenuate the information for various reasons, and the users of information are seeking to amplify the information in various ways. I have spoken to senior Army officers in the past about the difficulties of sharing sensitive information with other forces that are supposedly on the same side. But similar questions arise in commercial collaborations as well - how much information can be shared with partners, suppliers or customers, and how can we systematically analyse "need-to-know" questions?

Organizations are struggling to collaborate effectively in situations of imperfect trust. In an open distributed world, data provenance is a key issue. We need better tools for understanding and negotiating these complex information ecosystems. Watch this space.

Technorati Tags: information sharing SOA service-oriented trust

Friday, May 06, 2005

Repurposing Data and Services

Does it make sense to talk about reusing services, or should we talk instead about repurposing?

The word repurpose is largely being pushed from the data/metadata side, especially the XML/XSL crowd. XML is certainly relevant to technical reformatting and interoperability, but may also support data being put to new uses.

IBM defines repurposing in terms of metadata, with specific reference to XML
The Role of XSL in e-business Solutions (May 2001), by Mark Colan of IBM
Xerox Research on XML Schema Management

Meanwhile, some examples of data repurposing look like old-fashioned data sharing. Look at this abstract, which (when you strip away the fashionable technology such as intelligent agents) is just finding new uses for existing data.

Using Intelligent Agents to Repurpose Administrative Data ... (Jan 2004) (abstract)

The word also makes the bandwagon-jumping antics of some product vendors explicit. For example, following 9/11, Siebel repurposed its CRM software to deal with Homeland Security. (Government Computer News, Sept 2002) What's terrorism got to do with customer relationship management, I hear you ask. Well it does, in the same sense that an FBI agent might say "He's a tricky customer."

XML is very good for this kind of repurposing, because it operates at a level of semantic vagueness where it doesn't really matter whether "customer" means "customer" or "terrorist". To my mind this is both a strength and a weakness of XML. It seems to me that if we want to promote the repurposing of services, we need to explain how to design services that can operate with a calculated lack of semantic specificity, with weak preconditions. (But strong postconditions.)

See also Reuse or Repurpose (May 2005)

Friday, February 18, 2005

SOA and database

A traditional view of data is as a structured collection of attributes. Each attribute provides information: in Bateson's phrase, it is a difference that makes a difference. For example, let's suppose the attribute CUSTOMER CREDIT LIMIT controls the outcome of some business transaction - whether the customer is permitted to place an order, or has to pay upfront. (The difference in the attribute value causes a difference in the transaction outcome.) We are accustomed to having such attribute values stored in a massive database somewhere, and passed around in XML packets. But the attribute value is itself the result of some calculation, judgement or negotiation. In a genuinely real-time enterprise, this calculation, judgement or negotiation would be done in real-time, with zero latency. Of course, there are organizational as well as technical reasons why we don't generally do this.

So the things in the database are simply those items that it happens to be more convenient to store than (re)discover. At any point in the future, some items may shift from storage to real-time discovery. The content of the data layer should not be fixed for all time, but we should expect it to change as the costs of real-time discovery (both in performance and in design/development effort) are reduced.

Thus SOA erodes the traditional database data layer. This is NOT just removing data replication (as Lawrence Wilkes advocated in December 2000), but removing data storage altogether. At the present technological state-of-the-art (SOTA), this is only going to happen in isolated areas.

In his ServerSide article The Fallacy of the Data Layer, Rocky Lhotka imagines the wholesale deconstruction of the data layer, and the impact on the application, although he doesn't go as far as I am suggesting. Then when he is attacked in the subsequent discussion for his apparent SOA enthusiasm, he sidesteps and says he was being sarcastic. But while there may be considerable scepticism about the current viability of his remarks, it seems perfectly reasonable to present this as a hypothetical future scenario, if and when certain technological and organizational conditions are met.

Pages