Friday, March 27, 2020

Data Strategy - More on Agility

Continuing my exploration of the four dimensions of Data Strategy. In this post, I bring together some earlier themes, including Pace Layering and Trimodal.

The first point to emphasize is that there are many elements to your overall data strategy, and these don't all work at the same tempo. Data-driven design methodologies such as Information Engineering (especially the James Martin version) were based on the premise that the data model was more permanent than the process model, but it turns out that this is only true for certain categories of data.

So one of the critical requirements for your data strategy is to manage both the slow-moving stable elements and the fast-moving agile elements. This calls for a layered approach, where each layer has a different rate of change, known as pace-layering.

The concept of pace-layering was introduced by Stewart Brand. In 1994, he wrote a brilliant and controversial book about architecture, How Buildings Change, which among other things contained a theory about evolutionary change in complex systems based on earlier work by the architect Frank Duffy. Although Brand originally referred to the theory as Shearing Layers, by the time of his 1999 book he had switched to calling it Pace Layering. If there is a difference between the two, Shearing Layers is primarily a descriptive theory about how change happens in complex systems, while Pace Layering is primarily an architectural principle for the design of resilient systems-of-systems.

In 2006, I was working as a software industry analyst, specializing in Service-Oriented Architecture (SOA). Microsoft invited me to Las Vegas to participate in a workshop with other industry analysts, where (among other things) I drew the following layered picture.

SPARK Workshop Day 2

Here's how I now draw the same picture for data strategy. It also includes a rough mapping to the Trimodal approach.

Giles Slinger and Rupert Morrison, Will Organization Design Be Affected By Big Data? (J Org Design Vol 3 No 3, 2014)

Wikipedia: Information Engineering, Shearing Layers 

Related Posts: Layering Principles (March 2005), SPARK 2 - Innovation or Trust (March 2006), Beyond Bimodal (May 2016), Data Strategy - Agility (December 2019)

Wednesday, March 04, 2020

Economic Value of Data

How far can general principles of asset management be applied to data? In this post, I'm going to look at some of the challenges of putting monetary or non-monetary value on your data assets.

Why might we want to do this? There are several reasons why people might be interested in the value of data.
  • Establish internal or external benchmarks
  • Set measurable targets and track progress
  • Identify underutilized assets
  • Prioritization and resource allocation
  • Threat modelling and risk assessment (especially in relation to confidentiality, privacy, security)
Non-monetary benchmarks may be good enough if all we want to do is compare values - for example, this parcel of data is worth a lot more than that parcel, this process/practice is more efficient/effective than that one, this initiative/transformation has added significant value, and so on.

But for some purposes, it is better to express the value in financial terms. Especially for the following:
  • Cost-benefit analysis – e.g. calculate return on investment
  • Asset valuation – estimate the (intangible) value of the data inventory – e.g. relevant for flotation or acquisition
  • Exchange value – calculate pricing and profitability for traded data items

There are (at least) five entirely different ways to put a monetary value on any asset.
  • Historical Cost The total cost of the labour and other resources required to produce and maintain an item. 
  • Replacement Cost The total cost of the labour and other resources that would be required to replace an item. 
  • Liability Cost The potential damages or penalties if the item is lost or misused. (This may include regulatory action, reputational damage, or commercial advantage to your competitors, and may bear no relation to any other measure of value.) 
  • Utility Value The economic benefits that may be received by an actor from using or consuming the item. 
  • Market Value The exchange price of an item at a given point in time. The amount that must be paid to purchase the item, or the amount that could be obtained by selling the item. 

But there are some real difficulties in doing any of this for data. None of these difficulties are unique to data, but I can't think of any other asset class that has all of these difficulties multiplied together to the same extent.

  • Data is an intangible asset. There are established ways of valuing intangible assets, but these are always somewhat more complicated than valuing tangible assets.
  • Data is often produced as a side-effect of some other activity. So the cost of its production may already be accounted for elsewhere, or is a very small fraction of a much larger cost.
  • Data is a reusable asset. You may be able to get repeated (although possibly diminishing) benefit from the same data.
  • Data is an infinitely reproducible asset. You can sell or share the same data many times, while continuing to use it yourself. 
  • Some data loses its value very quickly. If I’m walking past a restaurant, this information has value to the restaurant. Ten minutes later I'm five blocks away, and the information is useless. And even before this point, suppose there are three restaurants and they all have access to the information that I am hungry and nearby. As soon as one of these restaurants manages to convert this information, its value to the remaining restaurants becomes zero or even negative. 
  • Data combines in a non-linear fashion. Value (X+Y) is not always equal to Value (X) + Value (Y). Even within more tangible asset classes, we can find the concepts of Assemblage and Plottage. For data, one version of this non-linearity is the phenomenon of information energy described by Michael Saylor of MicroStrategy. And for statisticians, there is also Simpson’s Paradox.

The production costs of data can be estimated in various ways. One approach is to divide up the total ICT expenditure, estimating roughly what proportion of the whole to allocate to this or that parcel of data. This generally only works for fairly large parcels - for example, this percent to customer transactions, this percentage to transport and logistics, etc.  Another approach is to work out the marginal or incremental cost: this is commonly preferred when considering new data systems, or decommissioning old ones. We can compare the effort consumed in different data domains, or count the number of transformation steps from raw data to actionable intelligence.

As for the value of the data, there are again many different approaches. Ideally, we should look at the use-value or performance value of the data - what contribution does it make to a specific decision or process, or what aggregate contribution does it make to a given set of decisions and processes. 
  • This can be based on subjective assessments of relevance and usefulness, perhaps weighted by the importance of the decisions or processs where the data are used. See Bill Schmarzo's blogpost for a worked example.
  • Or it may be based on objective comparisons of results with and without the data in question - making a measurable difference to some key performance indicator (KPI). In some cases, the KPI may be directly translated into a financial value. 
However, comparing performance fairly and objectively may only be possible for organizations that are already at a reasonable level of data management maturity.

In the absence of this kind of metric, we can look instead at the intrinsic value of the data, independently of its potential or actual use. This could be based on a weighted formula involving such quality characteristics as accuracy, alignment, completeness, enrichment, reliability, shelf-life, timeliness, uniqueness, usability. (Gartner has published a formula that uses a subset of these factors.)

Arguably there should be a depreciation element to this calculation. Last year's data is not worth as much as this year's data, and the accuracy of last year's data may not be so critical, but the data is still worth something.

An intrinsic measure of this kind could be used to evaluate parcels of data at different points in the data-to-information process. For example, showing the increase of enrichment and usability from 1. to 2. and from 2. to 3., and therefore giving a measure of the added-value produced by the data engineering team that does this for us.
    1. Source systems
    2. Data Lake – cleansed, consolidated, enriched and accessible to people with SQL skills
    3. Data Visualization Tool – accessible to people without SQL skills

If any of my readers know of any useful formulas or methods for valuing data that I haven't mentioned here, please drop a link in the comments.

Heather Pemberton Levy, Why and How to Value Your Information as an Asset (Gartner, 3 September 2015)

Bill Schmarzo, Determining the Economic Value of Data (Dell, 14 June 2016)

Wikipedia: Simpson's Paradox, Value of Information

Related posts: Information Algebra (March 2008), Does Big Data Release Information Energy? (April 2014), Assemblage and Plottage (January 2020)

Sunday, February 23, 2020

An Alternative to the DIKW Pyramid

My 2012 post on the Co-Production of Data and Knowledge offered a critique of the #DIKW pyramid. When challenged recently to propose an alternative schema, I drew something quickly on the wall, against a past-present-future timeline. Here is a cleaned-up version.

Data is always given from the past – even if only a fraction of a second into the past.

We use our (accumulated) knowledge (or memory) to convert data into information – telling us what is going on right now. Without prior knowledge, we would be unable to do this. As Dave Snowden puts it, knowledge is the means by which we create information out of data. 

We then use this information to make various kinds of judgement into the future. In his book The Art of Judgment, Vickers identifies three types.We predict what will happen if we do nothing, we work out how to achieve what we want to happen, and we put these into an ethical frame.
Intelligence is about the smooth flow towards judgement, as well as effective feedback and learning back into the creation of new knowledge, or the revision/reinforcement of old knowledge.

And finally, wisdom is about maintaining a good balance between all of these elements - respecting data and knowledge without being trapped by them.

What the schema above doesn't show are the feedback and learning loops. Dave Snowden invokes the OODA loop, but a more elaborate schema would include many nested loops - double-loop learning and so on - which would make the diagram a lot more complex.
And although the schema roughtly indicates the relationship between the various concepts, what it doesn't show is the fuzzy boundary between the concepts. I'm really not interested in discussing the exact criteria by which the content of a document can be classified as data or information or knowledge or whatever.

Dave Snowden, Sense-making and Path-finding (March 2007)

Geoffrey Vickers, The Art of Judgment: A Study of Policy-Making (1965)

Related posts: Wisdom of the Tomato (March 2011), Co-Production of Data and Knowledge (November 2012)

Saturday, February 15, 2020

The Dashboard Never Lies

The lie detector (aka #polygraph) is back in the news. The name polygraph is based on the fact that the device can record and display several things at once. Like a dashboard.

In the 1920s, a young American physiologist named John Larson devised a version for detecting liars, which measured blood pressure, respiration, pulse rate and skin conductivity. Larson called his invention, which he took with him to the police, a cardiopneumo psychogram, but polygraph later became the standard term. To this day, there is no reliable evidence that polygraphs actually work, but the great British public will no doubt be reassured by official PR that makes our masters sound like the heroes of an FBI crime series.

Over a hundred years ago, G.K. Chesteron wrote a short story exposing the fallacy of relying on such a machine. Even if the measurements are accurate, they can easily be misinterpreted.

There's a disadvantage in a stick pointing straight. The other end of the stick always points the opposite way. It depends whether you get hold of the stick by the right end.

There are of course many ways in which the data displayed on the dashboard can be wrong - from incorrect and incomplete data to muddled or misleading calculations. But even if we discount these errors, there may be many ways in which the user of the dashboard can get the wrong end of the stick.

As I've pointed out before, along with the illusion that what the data tells you is true, there are two further illusions: that what the data tells you is important, and that what the data doesn't tell you is not important.

No machine can lie, nor can it tell the truth.

G.K. Chesterton, The Mistake of the Machine (1914)

Hannah Devlin, Polygraph’s revival may be about truth rather than lies (The Guardian, 21 January 2020)

Megan Garber, The Lie Detector in the Age of Alternative Facts (Atlantic, 29 March 2018)

Stephen Poole, Is the word 'polygraph' hiding a bare-faced lie? (The Guardian, 23 January 2020)

Related posts: Memory and the Law (June 2008), How Dashboards Work (November 2009), Big Data and Organizational Intelligence (November 2018)

Dark Data

At @imperialcollege this week to hear Professor David Hand talk about his new book on Dark Data.

Some people define dark data as unanalysed data, data you have but are not able to use, and this is the definition that can be found on Wikipedia. The earliest reference I can find to dark data in this sense is a Gartner blogpost from 2012.

In a couple of talks I gave in 2015, I used the term Data Data in a much broader sense - to include the data you simply don't have. My talks both included the following diagram.

Here's an example of this idea. A supermarket may know that I sometimes buy beer at the weekends. This information is derived from its own transaction data, identifying me through my use of a loyalty card. But what about the weekends when I don't buy beer from that supermarket? Perhaps I am buying beer from a rival supermarket, or drinking beer at friends' houses, or having a dry weekend. If they knew this, it might help them sell me more beer in future. Or sell me something else for those dry weekends.

Obviously the supermarket doesn't have access to its competitors' transaction data. But it does know when its competitors are doing special promotions on beer. And there may be some clues about my activity from social media or other sources.

The important thing to remember is that the supermarket rarely has a complete picture of the customer's purchases, let alone what is going on elsewhere in the customer's life. So it is trying to extract useful insights from incomplete data, enriched in any way possible by big data.

Professor Hand's book is about data you don't have - perhaps data you wish you had, or hoped to have, or thought you had, but nevertheless data you don't have. He argues that the missing data are at least as important as the data you do have. So this is the same sense that I was using in 2015.

Hand describes and illustrates many different manifestations of dark data, and talks about a range of statistical techniques for drawing valid conclusions from incomplete data and for overcoming potential bias. He also talks about the possible benefits of dark data - for example, hiding some attributes to improve the quality and reliability of the attributes that are exposed. A good example of this would be double-blind testing in clinical trials, which involves hiding which subjects are receiving which treatment, because revealing this information might influence and distort the results.

Can big data solve the challenges posed by dark data? In my example, we might be able to extract some useful clues from big data. But although these clues might lead to new avenues to investigate, or hypotheses that could be tested further, the clues themselves may be unreliable indicators. The important thing is to be mindful of the limits of your visible data.

David J Hand, Dark Data: Why what you don't know matters (Princeton 2020). See also his presentation at Imperial College, 10 February 2020

Richard Veryard, Boundaryless Customer Engagement (Open Group, October 2015), Real-Time Personalization (Unicom December 2015)

Andrew White, Dark Data is like that furniture you have in that Dark Cupboard (Gartner, 11 July 2012)

Wikipedia: Dark Data

Related post: Big Data and Organizational Intelligence (November 2018)  

Wednesday, January 01, 2020

Assemblage and Plottage

John Reilly of @RealTown explains the terms Assemblage and Plottage.
Assemblage is the process of joining several parcels to form a larger parcel; the resulting increase in value is called plottage.

If we apply these definitions to real estate, which appears to be Mr Reilly's primary domain of expertise, the term parcel refers to parcels of land or other property. He explains why combining parcels increases the total value.

However, Mr Reilly posted these definitions in a blog entitled The Data Advocate, in which he and his colleagues promote the use of data in the real estate business. So we might reasonably use the same terms in the data domain as well. Joining several parcels of data to form a larger parcel (assemblage) is widely recognized as a way of increasing the total value of the data.

While calculation of plottage in the real estate business can be grounded in observations of exchange value or use value, calculation of plottage in the data domain may be rather more difficult. Among other things, we may note that there is much greater diversity in the range of potential uses for a large parcel of data than for a large parcel of land, and that a large parcel of data can often be used for multiple purposes simultaneously.

Nevertheless, even in the absence of accurate monetary estimates of data plottage, the concept of data plottage could be useful for data strategy and management. We should at least be able to argue that some course of action generates greater levels of plottage than some other course of action.

By the way, although the idea that the whole is greater than the sum of its parts is commonly attributed to Aristotle, @sentantiq argues that this attribution is incorrect.

John Reilly, Assemblage vs Plottage (The Data Advocate, 10 July 2014)

Sententiae Antiquae, No, Aristotle Didn’t Write A Whole is Greater Than the Sum of Its Parts (6 July 2018)

Related post: Economic Value of Data (March 2020)

Tuesday, December 10, 2019

Is there a Single Version of Truth about Stents?

Clinical trials are supposed to generate reliable data to support healthcare decisions and policies at several levels. Regulators use the data to control the marketing and use of medicines and healthcare products. Clinical practice guidelines are produced by healthcare organizations (from the WHO downwards) as well as professional bodies. Clinicians apply and interpret these guidelines for individual patients, as well as prescribing medicines, products and procedures, both on-label and off-label.

Given the importance of these decisions and policies for patients, there are some critical issues concerning the quality of clinical trial data, and the ability of clinicians, researchers, regulators and others to make sense of these data. Obviously there are significant commercial interests involved, and some players may be motivated to be selective about the publication of trial data. Hence the AllTrials campaign for clinical trial transparency.

But there is a more subtle issue, to do with the way the data are collected, coded and reported. The BBC has recently uncovered an example that is both fascinating and troubling. It concerns a clinical trial comparing the use of stents with heart bypass surgery. The trial was carried out in 2016, funded by a major manufacturer of stents, and published in a prestigious medical journal. According to the article, the two alternatives were equally effective in protecting against future heart attacks.

But this is where the controversy begins. Researchers disagree about the best way of measuring heart attacks, and the authors of the article used a particular definition. Other researchers prefer the so-called Universal Definition, or more precisely the Fourth Universal Definition (there having been three previous attempts). Some experts believe that if you use the Universal Definition instead of the definition used in the article, the results are much more one-sided: stents may be the right solution for many patients, but are not always as good as surgery.

Different professional bodies interpret matters differently. The European Association for Cardio-thoracic Surgery (EACTS) told the BBC that this raised serious concerns about the current guidelines based on the 2016 trial, while the European Society of Cardiology stands by these guidelines. The BBC also notes the potential conflicts of interests of researchers, many of whom had declared financial relationships with stent manufacturers.

I want to draw a more general lesson from this story, which is about the much-vaunted Single Version of Truth (SVOT). By limiting the clinical trial data to a single definition of heart attack, some of the richness and complexity of the data are lost or obscured. For some purposes at least, it would seem appropriate to make multiple versions of the truth available, so that they can be properly analysed and interpreted. SVOT not always a good thing, then.

See my previous blogposts on Single Source of Truth.

Deborah Cohen and Ed Brown, Surgeons withdraw support for heart disease advice (BBC Newsnight, 9 December 2019) See also

Debabrata Mukherjee, Fourth Universal Definition of Myocardial Infarction (American College of Cardiology, 25 Aug 2018)

See also Off-Label (March 2005), Is there a Single Version of Truth about Statins? (April 2019), Ethics of Transparency and Concealment (October 2019)