Saturday, February 15, 2020

The Dashboard Never Lies

The lie detector (aka #polygraph) is back in the news. The name polygraph is based on the fact that the device can record and display several things at once. Like a dashboard.

In the 1920s, a young American physiologist named John Larson devised a version for detecting liars, which measured blood pressure, respiration, pulse rate and skin conductivity. Larson called his invention, which he took with him to the police, a cardiopneumo psychogram, but polygraph later became the standard term. To this day, there is no reliable evidence that polygraphs actually work, but the great British public will no doubt be reassured by official PR that makes our masters sound like the heroes of an FBI crime series.
Poole

Over a hundred years ago, G.K. Chesteron wrote a short story exposing the fallacy of relying on such a machine. Even if the measurements are accurate, they can easily be misinterpreted.

There's a disadvantage in a stick pointing straight. The other end of the stick always points the opposite way. It depends whether you get hold of the stick by the right end.
Chesterton

There are of course many ways in which the data displayed on the dashboard can be wrong - from incorrect and incomplete data to muddled or misleading calculations. But even if we discount these errors, there may be many ways in which the user of the dashboard can get the wrong end of the stick.

As I've pointed out before, along with the illusion that what the data tells you is true, there are two further illusions: that what the data tells you is important, and that what the data doesn't tell you is not important.

No machine can lie, nor can it tell the truth.
Chesterton



G.K. Chesterton, The Mistake of the Machine (1914)

Hannah Devlin, Polygraph’s revival may be about truth rather than lies (The Guardian, 21 January 2020)

Megan Garber, The Lie Detector in the Age of Alternative Facts (Atlantic, 29 March 2018)

Stephen Poole, Is the word 'polygraph' hiding a bare-faced lie? (The Guardian, 23 January 2020)

Related posts: How Dashboards Work (November 2009), Big Data and Organizational Intelligence (November 2018)

Dark Data

At @imperialcollege this week to hear Professor David Hand talk about his new book on Dark Data.

Some people define dark data as unanalysed data, data you have but are not able to use, and this is the definition that can be found on Wikipedia. The earliest reference I can find to dark data in this sense is a Gartner blogpost from 2012.

In a couple of talks I gave in 2015, I used the term Data Data in a much broader sense - to include the data you simply don't have. My talks both included the following diagram.




Here's an example of this idea. A supermarket may know that I sometimes buy beer at the weekends. This information is derived from its own transaction data, identifying me through my use of a loyalty card. But what about the weekends when I don't buy beer from that supermarket? Perhaps I am buying beer from a rival supermarket, or drinking beer at friends' houses, or having a dry weekend. If they knew this, it might help them sell me more beer in future. Or sell me something else for those dry weekends.

Obviously the supermarket doesn't have access to its competitors' transaction data. But it does know when its competitors are doing special promotions on beer. And there may be some clues about my activity from social media or other sources.

The important thing to remember is that the supermarket rarely has a complete picture of the customer's purchases, let alone what is going on elsewhere in the customer's life. So it is trying to extract useful insights from incomplete data, enriched in any way possible by big data.

Professor Hand's book is about data you don't have - perhaps data you wish you had, or hoped to have, or thought you had, but nevertheless data you don't have. He argues that the missing data are at least as important as the data you do have. So this is the same sense that I was using in 2015.

Hand describes and illustrates many different manifestations of dark data, and talks about a range of statistical techniques for drawing valid conclusions from incomplete data and for overcoming potential bias. He also talks about the possible benefits of dark data - for example, hiding some attributes to improve the quality and reliability of the attributes that are exposed. A good example of this would be double-blind testing in clinical trials, which involves hiding which subjects are receiving which treatment, because revealing this information might influence and distort the results.

Can big data solve the challenges posed by dark data? In my example, we might be able to extract some useful clues from big data. But although these clues might lead to new avenues to investigate, or hypotheses that could be tested further, the clues themselves may be unreliable indicators. The important thing is to be mindful of the limits of your visible data.




David J Hand, Dark Data: Why what you don't know matters (Princeton 2020). See also his presentation at Imperial College, 10 February 2020 https://www.youtube.com/watch?v=R3IO5SDVmuk

Richard Veryard, Boundaryless Customer Engagement (Open Group, October 2015), Real-Time Personalization (Unicom December 2015)

Andrew White, Dark Data is like that furniture you have in that Dark Cupboard (Gartner, 11 July 2012)

Wikipedia: Dark Data

Related post: Big Data and Organizational Intelligence (November 2018)  

Wednesday, January 01, 2020

Assemblage and Plottage

John Reilly of @RealTown explains the terms Assemblage and Plottage.
Assemblage is the process of joining several parcels to form a larger parcel; the resulting increase in value is called plottage.

If we apply these definitions to real estate, which appears to be Mr Reilly's primary domain of expertise, the term parcel refers to parcels of land or other property. He explains why combining parcels increases the total value.

However, Mr Reilly posted these definitions in a blog entitled The Data Advocate, in which he and his colleagues promote the use of data in the real estate business. So we might reasonably use the same terms in the data domain as well. Joining several parcels of data to form a larger parcel (assemblage) is widely recognized as a way of increasing the total value of the data.

While calculation of plottage in the real estate business can be grounded in observations of exchange value or use value, calculation of plottage in the data domain may be rather more difficult. Among other things, we may note that there is much greater diversity in the range of potential uses for a large parcel of data than for a large parcel of land, and that a large parcel of data can often be used for multiple purposes similtaneously.

Nevertheless, even in the absence of accurate monetary estimates of data plottage, the concept of data plottage could be useful for data strategy and management. We should at least be able to argue that some course of action generates greater levels of plottage than some other course of action.



By the way, although the idea that the whole is greater than the sum of its parts is commonly attributed to Aristotle, @sentantiq argues that this attribution is incorrect.



John Reilly, Assemblage vs Plottage (The Data Advocate, 10 July 2014)

Sententiae Antiquae, No, Aristotle Didn’t Write A Whole is Greater Than the Sum of Its Parts (6 July 2018)

Tuesday, December 10, 2019

Is there a Single Version of Truth about Stents?

Clinical trials are supposed to generate reliable data to support healthcare decisions and policies at several levels. Regulators use the data to control the marketing and use of medicines and healthcare products. Clinical practice guidelines are produced by healthcare organizations (from the WHO downwards) as well as professional bodies. Clinicians apply and interpret these guidelines for individual patients, as well as prescribing medicines, products and procedures, both on-label and off-label.

Given the importance of these decisions and policies for patients, there are some critical issues concerning the quality of clinical trial data, and the ability of clinicians, researchers, regulators and others to make sense of these data. Obviously there are significant commercial interests involved, and some players may be motivated to be selective about the publication of trial data. Hence the AllTrials campaign for clinical trial transparency.

But there is a more subtle issue, to do with the way the data are collected, coded and reported. The BBC has recently uncovered an example that is both fascinating and troubling. It concerns a clinical trial comparing the use of stents with heart bypass surgery. The trial was carried out in 2016, funded by a major manufacturer of stents, and published in a prestigious medical journal. According to the article, the two alternatives were equally effective in protecting against future heart attacks.

But this is where the controversy begins. Researchers disagree about the best way of measuring heart attacks, and the authors of the article used a particular definition. Other researchers prefer the so-called Universal Definition, or more precisely the Fourth Universal Definition (there having been three previous attempts). Some experts believe that if you use the Universal Definition instead of the definition used in the article, the results are much more one-sided: stents may be the right solution for many patients, but are not always as good as surgery.

Different professional bodies interpret matters differently. The European Association for Cardio-thoracic Surgery (EACTS) told the BBC that this raised serious concerns about the current guidelines based on the 2016 trial, while the European Society of Cardiology stands by these guidelines. The BBC also notes the potential conflicts of interests of researchers, many of whom had declared financial relationships with stent manufacturers.

I want to draw a more general lesson from this story, which is about the much-vaunted Single Version of Truth (SVOT). By limiting the clinical trial data to a single definition of heart attack, some of the richness and complexity of the data are lost or obscured. For some purposes at least, it would seem appropriate to make multiple versions of the truth available, so that they can be properly analysed and interpreted. SVOT not always a good thing, then.

See my previous blogposts on Single Source of Truth.



Deborah Cohen and Ed Brown, Surgeons withdraw support for heart disease advice (BBC Newsnight, 9 December 2019) See also https://www.youtube.com/watch?v=_vGfJKMbpp8

Debabrata Mukherjee, Fourth Universal Definition of Myocardial Infarction (American College of Cardiology, 25 Aug 2018)

See also Off-Label (March 2005), Is there a Single Version of Truth about Statins? (April 2019), Ethics of Transparency and Concealment (October 2019)

Saturday, December 07, 2019

Developing Data Strategy

The concepts of net-centricity, information superiority and power to the edge emerged out of the US defence community about twenty years ago, thanks to some thought leadership from the Command and Control Research Program (CCRP). One of the routes of these ideas into the civilian world was through a company called Groove Networks, which was acquired by Microsoft in 2005 along with its founder, Ray Ozzie. The Software Engineering Institute (SEI) provided another route. And from the mid 2000s onwards, a few people were researching and writing on edge strategies, including Philip Boxer, John Hagel and myself.

Information superiority is based on the idea that the ability to collect, process, and disseminate an uninterrupted flow of information will give you operational and strategic advantage. The advantage comes not only from the quantity and quality of information at your disposal, but also from processing this information faster than your competitors and/or fast enough for your customers. TIBCO used to call this the Two-Second Advantage.

And by processing, I'm not just talking about moving terabytes around or running up large bills from your cloud provider. I'm talking about enterprise-wide human-in-the-loop organizational intelligence: sense-making (situation awareness, model-building), decision-making (evidence-based policy), rapid feedback (adaptive response and anticipation), organizational learning (knowledge and culture). For example, the OODA loop. That's my vision of a truly data-driven organization.

There are four dimensions of information superiority which need to be addressed in a data strategy: reach, richness, agility and assurance. I have discussed each of these dimensions in a separate post:





Philip Boxer, Asymmetric Leadership: Power to the Edge

Leandro DalleMule and Thomas H. Davenport, What’s Your Data Strategy? (HBR, May–June 2017) 

John Hagel III and John Seely Brown, The Agile Dance of Architectures – Reframing IT Enabled Business Opportunities (Working Paper 2003)

Vivek Ranadivé and Kevin Maney, The Two-Second Advantage: How We Succeed by Anticipating the Future--Just Enough (Crown Books 2011). Ranadivé was the founder and former CEO of TIBCO.

Richard Veryard, Building Organizational Intelligence (LeanPub 2012)

Richard Veryard, Information Superiority and Customer Centricity (Cutter Business Technology Journal, 9 March 2017) (registration required)

Wikipedia: CCRP, OODA Loop, Power to the Edge

Related posts: Microsoft and Groove (March 2005), Power to the Edge (December 2005), Two-Second Advantage (May 2010), Enterprise OODA (April 2012), Reach Richness Agility and Assurance (August 2017)

Wednesday, December 04, 2019

Data Strategy - Assurance

This is one of a series of posts looking at the four key dimensions of data and information that must be addressed in a data strategy - reach, richness, agility and assurance.



In previous posts, I looked at Reach (the range of data sources and destinations), Richness (the complexity of data) and Agility (the speed and flexibility of response to new opportunities and changing requirements). Assurance is about Trust.

In 2002, Microsoft launched its Trustworthy Computing Initiative, which covered security, privacy, reliability and business integrity. If we look specifically at data, this mean two things.
  1. Trustworthy data - the data are reliable and accurate.
  2. Trustworthy data management - the processor is a reliable and responsible custodian of the data, especially in regard to privacy and security
Let's start by looking at trustworthy data. To understand why this is important (both in general and specifically to your organization), we can look at the behaviours that emerge in its absence. One very common symptom is the proliferation of local information. If decision-makers and customer-facing staff across the organization don't trust the corporate databases to be complete, up-to-date or sufficiently detailed, they will build private spreadsheets, to give them what they hope will be a closer version of the truth.

This is of course a data assurance nightmare - the data are out of control, and it may be easier for hackers to get the data out than it is for legitimate users. And good luck handling any data subject access request!

But in most organizations, you can't eliminate this behaviour simply by telling people they mustn't. If your data strategy is to address this issue properly, you need to look at the causes of the behaviour, understand what level of reliability and accessibility you have to give people, before they will be willing to rely on your version of the truth rather than theirs.

DalleMule and Davenport have distinguished two types of data strategy, which they call offensive and defensive. Offensive strategies are primarily concerned with exploiting data for competitive advantage, while defensive strategies are primarily concerned with data governance, privacy and security, and regulatory compliance.

As a rough approximation then, assurance can provide a defensive counterbalance to the offensive opportunities offered by reach, richness and agility. But it's never quite as simple as that. A defensive data quality regime might install strict data validation, to prevent incomplete or inconsistent data from reaching the database. In contrast, an offensive data quality regime might install strict labelling, with provenance data and confidence ratings, to allow incomplete records to be properly managed, enriched if possible, and appropriately used. This is the basis for the NetCentric strategy of Post Before Processing.

Because of course there isn't a single view of data quality. If you want to process a single financial transaction, you obviously need to have a complete, correct and confirmed set of bank details. But if you want aggregated information about upcoming financial transactions, you don't want any large transactions to be omitted from the total because of a few missing attributes. And if you are trying to learn something about your customers by running a survey, it's probably not a good idea to limit yourself to those customers who had the patience and loyalty to answer all the questions.

Besides data quality, your data strategy will need to have a convincing story about privacy and security. This may include certification (e.g. ISO 27001) as well as regulation (GDPR etc.) You will need to have proper processes in place for identifying risks, and ensuring that relevant data projects follow privacy-by-design and security-by-design principles. You may also need to look at the commercial and contractual relationships governing data sharing with other organizations.

All of this should add up to establishing trust in your data management - reassuring data subjects, business partners, regulators and other stakeholders that the data are in safe hands. And hopefully this means they will be happy for you to take your offensive data strategy up to the next level.

Next post: Developing Data Strategy



Leandro DalleMule and Thomas H. Davenport, What’s Your Data Strategy? (HBR, May–June 2017)

Richard Veryard, Microsoft's Trustworthy Computing (CBDI Journal, March 2003)

Wikipedia: Trustworthy Computing

Tuesday, December 03, 2019

Data Strategy - Agility

This is one of a series of posts looking at the four key dimensions of data and information that must be addressed in a data strategy - reach, richness, agility and assurance.



In previous posts, I looked at Reach, which is about the range of data sources and destinations, and Richness, which is about the complexity of data. Now let me turn to Agility - the speed and flexibility of response to new opportunities and changing requirements.

Not surprisingly, lots of people are talking about data agility, including some who want to persuade you that their products and technologies will help you to achieve it. Here are a few of them.
Data agility is when your data can move at the speed of your business. For companies to achieve true data agility, they need to be able to access the data they need, when and where they need it. Pinckney
Collecting first-party data across the customer lifecycle at speed and scale. Jones
Keep up with an explosion of data. ... For many enterprises, their ability to collect data has surpassed their ability to organize it quickly enough for analysis and action. Scott
How quickly and efficiently you can turn data into accurate insights. Tuchen
But before we look at technological solutions for data agility, we need to understand the requirements. The first thing is to empower, enable and encourage people and teams to operate at a good tempo when working with data and intelligence, with fast feedback and learning loops.

Under a trimodal approach, for example, pioneers are expected to operate at a faster tempo, setting up quick experiments, so they should not be put under the same kind of governance as settlers and town planners. Data scientists often operate in pioneer mode, experimenting with algorithms that might turn out to help the business, but often don't. Obviously that doesn't mean zero governance, but appropriate governance. People need to understand what kinds of risk-taking are accepted or even encouraged, and what should be avoided. In some organizations, this will mean a shift in culture.

Beyond trimodal, there is a push towards self-service ("citizen") data and intelligence. This means encouraging and enabling active participation from people who are not doing this on a full-time basis, and may have lower levels of specialist knowledge and skill.

Besides knowledge and skills, there are other important enablers that people need to work with data. They need to be able to navigate and interpret, and this calls for meaningful metadata, such as data dictionaries and catalogues. They also need proper tools and platforms. Above all, they need an awareness of what is possible, and how it might be useful.

Meanwhile, enabling people to work quickly and effectively with data is not just about giving them relevant information, along with decent tools and training. It's also about removing the obstacles.

Obstacles? What obstacles?

In most large organizations, there is some degree of duplication and fragmentation of data across enterprise systems. There are many reasons why this happens, and the effects may be felt in various areas of the business, degrading the performance and efficiency of various business functions, as well as compromising the quality and consistency of management information. System interoperability may be inadequate, resulting in complicated workflows and error-prone operations.

But perhaps the most important effect is on inhibiting innovation. Any new IT initiative will need either to plug into the available data stores or create new ones. If this is to be done without adding further to technical debt, then the data engineering (including integration and migration) can often be more laborious than building the new functionality the business wants.

Depending on whom you talk to, this challenge can be framed in various ways - data engineering, data integration and integrity, data quality, master data management. The MDM vendors will suggest one approach, the iPaaS vendors will suggest another approach, and so on. Before you get lured along a particular path, it might be as well to understand what your requirements actually are, and how these fit into your overall data strategy.

And of course your data strategy needs to allow for future growth and discovery. It's no good implementing a single source of truth or a universal API to meet your current view of CUSTOMER or PRODUCT, unless this solution is capable of evolving as your data requirements evolve, with ever-increasing reach and richness. As I've often discussed on this blog before, one approach to building in flexibility is to use appropriate architectural patterns, such as loose coupling and layering, which should give you some level of protection against future variation and changing requirements, and such patterns should probably feature somewhere in your data strategy.

Next post - Assurance


Richard Jones, Agility and Data: The Heart of a Digital Experience Strategy (WayIn, 22 November 2018)

Tom Pinckney, What's Data Agility Anyway (Braze Magazine, 25 March 2019)

Jim Scott, Why Data Agility is a Key Driver of Big Data Technology Development (24 March 2015)

Mike Tuchen, Do You Have the Data Agility Your Business Needs? (Talend, 14 June 2017)

Related posts: Enterprise OODA (April 2012), Beyond Trimodal: Citizens and Tourists (November 2019)