Friday, April 09, 2021

Near Miss

A serious aviation incident in the news today. A plane took off from Birmingham last year with insufficient fuel, because the weight of the passengers was incorrectly estimated. This is being described as an IT error.

As Cathy O'Neil's maxim reminds us, algorithms are opinions embedded in code. The opinion in this case was the assumption that the prefix Miss referred to a female child. According to the official report, published this week, this is how the prefix is used in the country where the system was programmed

In this particular flight, 38 adult women were classified as Miss, so the algorithm estimated their weight as 35 kg instead of 69 kg.

The calculation error was apparently compounded by several human factors.

  • A smaller discrepancy had been spotted and corrected on a previous flight. 
  • The pilot noticed that there seemed to be an unusually high number of children on the flight, but took no action because the pandemic had disrupted normal expectations of passenger numbers.
  • The software was being upgraded, but the status of the fix at the time of the flight was unclear. There were other system-wide changes being implemented at the same time, which may have complicated the fix.
  • Guidance to ground staff to double-check the classification of female passengers was not properly communicated and followed, possibly due to weekend shift patterns.

As Dan Nguyen points out, there have been previous incidents resulting from incorrect assumptions about passenger weight. But I think we need to distinguish between factual errors (what is the average weight of an adult passenger) and classification errors (what exactly does the Miss prefix signify).

There is an important lesson for data management here. You may have a business glossary or data dictionary that defines an attribute called Prefix and provides a list of permitted values. But if different people (different parts of your organization, different external parties) understand and use these values to mean different things, there is still scope for semantic confusion unless you make the meanings explicit.

AAIB Bulletin 4/2021 (April 2021)

Tui plane in ‘serious incident’ after every ‘Miss’ on board was assigned child’s weight (Guardian, 9 April 2021)

For further discussion and related examples, see Dan Nguyen's Twitter thread

Friday, January 01, 2021

Does Big Data Drive Netflix Content?

One of the factors driving Netflix success is undoubtedly its recommendation engine, based on an algorithm called CineMatch. I discussed this in my earlier post Rhyme or Reason (June 2017).

But that's not the only way Netflix uses data. According to several pundits (Bikker, Dans, Delger, FrameYourTV, Selerity), Netflix also uses big data to create content. However, it's not always clear to what extent these assertions are based on inside information rather than just intelligent speculation.

According to Enrique Dans
The latest Netflix series is not being made because a producer had a divine inspiration or a moment of lucidity, but because a data model says it will work.
Craig Delger's example looks pretty tame - analysing the intersection between existing content to position new content. 

The data collected by Netflix indicated there was a strong interest for a remake of the BBC miniseries House of Cards. These viewers also enjoyed movies by Kevin Spacey, and those directed by David Fincher. Netflix determined that the overlap of these three areas would make House of Cards a successful entry into original programming.

This is the kind of thing risk-averse producers have always done, and although data analytics might enable Netflix to do this a bit more efficiently, it doesn’t seem to represent a massive technological innovation. Thomas Davenport and Jeanne Harris discuss some more advanced use of data in the second edition of their book Competing on Analytics.

Netflix ... has used analytics to predict whether a TV show will be a hit with audiences before it is produced. Netflix has employed analytics to increase the likelihood of its success. It has used attribute analysis, which it developed for its movie recommendation system, to predict whether customers would like a series, and has identified as many as seventy thousand attributes of movies and TV shows, some of which it drew on for the decision whether to create it.

One of the advantages of a content delivery platform is that you can track the consumption of your content. Amazon used the Kindle to monitor how many chapters people actually read, at what times of day, where and when they get bored. Games platforms (Nintendo, PlayStation, X-Box) can track how far people get with the games, where they get stuck, and where they might need some TLC or DLC. So Netflix knows where you pause or give up, which scenes you rewind to watch again. Netflix can also experiment with alternative trailers for the same content.

In theory, this kind of information can then be used not just by Netflix to decide where to invest, but also by content producers to produce more engaging content. But it's difficult to get clear evidence how much influence this actually has on content creation.

How much other (big) data does Netflix actually collect about its consumers. Craig Delger assumes they operate much like most other data-hungry companies.

Netflix user account data provides verified personal information (sex, age, location), as well as preferences (viewing history, bookmarks, Facebook likes).

 However, in a 2019 interview (reported by @dadehayes), Ted Sarandos denied this.

We don’t collect your data. I don’t know how old you are when you join Netflix. I don’t know if you’re black or white. We know your credit card, but that’s just for payment and all that stuff is anonymized.

Sarandos, who is Chief Content Officer at Netflix, also downplayed the role that data (big or otherwise) played in driving content.

Picking content and working with the creative community is a very human function. The data doesn’t help you on anything in that process. It does help you size the investment. … Sometimes we’re wrong on both ends of that, even with this great data. I really think it’s 70, 80% art and 20, 30% science.

But perhaps that's what you'd expect him to say, given that Netflix has always tried to attract content producers with the promise of complete creative freedom. Amazon Studios has made similar claims. See report by Roberto Baldwin.

While there may be conflicting narratives about the difference data makes to content creation, there are some observations that seem relevant if inconclusive.

Firstly, the long tail argument. The orginal business model for Amazon and Netflix was based on having a vast catalogue, in which most of the entries are of practically no interest to anyone, because the cost of adding something to the catalogue was trivial. Even if the tail doesn't actually contribute as much revenue as the early proponents of the long tail theory suggested, it helps to mitigate uncertainty and risk - not knowing in advance which are going to be hits.

But this effect is countered by the trend towards vertical integration. Amazon and Netflix have gone from distribution to producing their own content, while Disney has moved into streaming. This encourages (but doesn't prove) the hypothesis that there may be some data synergies as well as commercial synergies.

And finally, an apparent preference for conventional non-disruptive content, as noted by Alex Shephard, which is pretty much what we would expect from a data-driven approach.

Netflix is content to replicate television as we know it—and the results are deliberately less than spectacular.


Roberto Baldwin, With House of Cards, Netflix Bets on Creative Freedom (Wired, 1 February 2013)

Yannick Bikker, How Netflix Uses Big Data to Build Mountains of Money (7 July 2020)

Enrique Dans, How Analytics Has Given Netflix The Edge Over Hollywood (Forbes, 27 May 2018), Netflix: Big Data And Playing A Long Game Is Proving A Winning Strategy (Forbes, 15 January 2020)

Thomas Davenport and Jeanne Harris, Competing on Analytics (Second edition 2017) - see extract here

FrameYourTV, How Netflix uses Big Data to Drive Success via Inside BigData (20 January 2018) 

Daniel G. Goldstein and Dominique C. Goldstein, Profiting from the Long Tail (Harvard Business Review, June 2006)

Dade Hayes, Netflix’s Ted Sarandos Weighs In On Streaming Wars, Agency Production, Big Tech Breakups, M+A Outlook (Deadline, 22 June 2019)

Selerity, How Netflix used big data and analytics to generate billions (5 April 2019)

Alex Shephard, What Netflix’s Obama Deal Says About the Future of Streaming (New Republic 23 May 2018)

Related posts: Competing on Analytics (May 2010), Rhyme or Reason - the Logic of Netflix (June 2017)

Thursday, November 26, 2020

Assembling your Data Strategy - walk in the way of insight

How many pillars (or components or building blocks) does your data strategy need? I found lots of different answers, from random bloggers to the UK Government.


An anonymous blogger who writes under the pen-name Beautiful Data identifies three pillars of data strategy.

  • Data Management - managing data as an asset
  • Data Democratization - putting data into the hands of the business
  • Data Monetization - driving direct and indirect business benefit


SnapAnalytics identifies People, Process, Data and Technology as its four pillars, and that's a popular approach for many things.

For a different approach, we have four pillars of data strategy from Aleksander Velkoski, the Director of Data Science at the National Association of Realtors.

  • Data Literacy
  • Data Acquisition and Governance
  • Knowledge Mining
  • Business Implementation

Olga Lagunova, Chief Data Analytics Officer at Pitney Bowes, identifies four pillars that are roughly similar.

  • Business Outcome - knowing what you want to achieve
  • Mature Data Ecosystem - including data sourcing and data governance
  • Data Science - practices and organization
  • Culture that values data-driven decision

In his conversation with her, Anthony Scriffignano, Chief Data Scientist at Dun & Bradstreet, replies that "we have many of those same elements". Perhaps because he is in the business of selling data, Anthony looks at data strategy from two directions, which broadly correspond to Olga's first two pillars.

  • Customer-centric - addressing customer needs, solving ever more complex business problems
  • Data-centric - data supply chain, including sourcing, quality assurance and governance

The UK National Data Strategy also has four pillars.

  • Data Foundations
  • Data Skills
  • Data Availability
  • Responsible Data


A white paper from SAS defines five essential components of a data strategy - Identify, Store, Provision, Process and Govern. But a component isn't a pillar. So the editors of Ingenium magazine have turned these into five pillars - Identify, Store, Provision, Integrate and Govern.

(The SAS paper talks a lot about integration, so the Ingenium modification of the SAS list seems fair.)


For six pillars, we can turn to Cynozure, a UK-based data and analytics strategy consultancy.

  • Vision and Value
  • People and Culture
  • Operating Model
  • Technology and Architecture
  • Data Governance
  • Roadmap

Cynozure has also published seven building blocks.

  • Data Vision
  • Data Sources
  • Data Governance and Management
  • Data Analysis
  • Data Team
  • Tech Stack
  • Measuring Success


At last we get to the magic number seven, thanks to @EvanLevy.

  • The Questions (aka Problems) - the more valuable your question, the more valuable analytics is to the company
  • Technical Implementation - he argues that the most valuable datasets require high levels of customization
  • The Users - access and control (this links to the Data Democratization pillar mentioned above)
  • Data Storage and Structure - including data retention
  • Data Security - risk and compliance
  • Personally Identifiable Information (PII) - privacy
  • Visualization and Analysis Needs - flexibility and timeliness


Lawrence of Arabia's autobiography was entitled Seven Pillars of Wisdom, and this is of course a reference to the Bible. 

Wisdom has built her house; she has set up its seven pillars. ...
Leave your simple ways and you will live; walk in the way of insight.

Proverbs 9.1


Maybe it doesn't matter how many pillars your data strategy has, as long as it gets you walking in the way of insight. (Whatever that means.)

Obviously not everyone is using the pillar metaphor in the same way - there is presumably some difference between a foundation, a pillar and a building block - but there is a lot of commonality here as well, with a widely shared emphasis on business value and people, as well as a few interesting outliers. 

While most of the sources listed in this blogpost are fairly brief, the UK National Digital Strategy contains a lot of detail. While it deserves credit for the attention devoted to ethics and accountability in the Responsibility pillar, it is not yet clear to me how it addresses some of the other concerns mentioned in this blogpost. I plan to post a more thorough review in a separate blogpost.



"Beautiful Data", Three Pillars of a Data Strategy (19 Sept ??)

Cynozure, Building A Data Strategy For Business Success (Cynozure, 29 May 2019)

Jason Foster, The Six Pillars of a Data Strategy (Cynozure via YouTube, 19 April 2019)

Ingenium, The 5 Pillars of a Data Strategy (Ingenium Magazine, 24 August 2017)

Evan Levy, 7 Pillars of Data Strategy (HighFive, 1 March 2018)

SAS, The 5 Essential Components of a Data Strategy (SAS 2018)

Anthony Scriffignano and Olga Lagunova, Data Strategy - Key Pillars That Define Success (Dun & Bradstreet via YouTube, 29 March 2018)

UK Government, UK National Data Strategy (Department for Digital, Culture, Media and Sport, 9 September 2020)

Aleksander Velkoski, The Four Pillars of Data and Analytics Strategy (Business Quick, 24 August 2020)

Tuesday, August 04, 2020

Data by Design

If your #datastrategy involves collecting and harvesting more data, then it makes sense to check this requirement at an early stage of a new project or other initiative, rather than adding data collection as an afterthought.

For requirements such as security and privacy, the not-as-afterthought heuristic is well established in the practices of security-by-design and privacy-by-design. I have also spent some time thinking and writing about technology ethics, under the heading of responsibility-by-design. In my October 2018 post on Responsibility by Design, I suggested that all of these could be regarded as instances of a general pattern of X-by-design, outlining What,Why, When, For Whom, Who, How and How Much for a given concern X.

In this post, I want to look at three instances of the X-by-design pattern that could support your data strategy:

  • data collection by design
  • data quality by design
  • data governance by design

Data Collection by Design

Here's a common scenario. Some engineers in your organization have set up a new product or service or system or resource. This is now fully operational, and appears to be working properly. However, the system is not properly instrumented.
Thought should always be given to the self instrumentation of the prime equipment, i.e. design for test from the outset. Kev Judge
In the past, it was common for a system is instrumented during the test phase, but once the tests are completed, data collection is switched off for performance reasons.
If there is concern that the self instrumentation can add unacceptable processing overheads then why not introduce a system of removing the self instrumentation before delivery? Kev Judge
Not just for operational testing and monitoring but also for business intelligence. And for IBM, this is an essential component of digital advantage:
Digitally reinvented electronics organizations pursue new approaches to products, processes and ecosystem participation. They design products with attention toward the types of information they need to collect to design the right customer experiences. IBM
The point here is that a new system or service needs to have data collection designed in from the start, rather than tacked on later.

Data Quality by Design

The next pitfall I want to talk about is when a new system or service is developed, the data migration / integration is done in a big rush towards the end of the project, and then - surprise, surprise - the data quality isn't good enough.

Particularly relevant when data is being repurposed. During the pandemic, there was a suggestion of using BlueTooth connection strength as a proxy for the distance between two phones, and therefore an indicator of the distance between the owners of the phones. Although this data might have been adequate for statistical analysis, it was not good enough to justify putting a person into quarantine.

Data Governance by Design

Finally, there is the question of the sociotechnical organization and processes needed to manage and support the data - not only data quality but all other aspects of data governance.

The pitfall here is to believe you can sort out the IT plumbing first, leaving the necessary governance and controls to be added in later. 

Scott Burnett, Reza Firouzbakht, Cristene Gonzalez-Wertz and Anthony Marshall, Using Data by Design (IBM Institute for Business Value, 2018)

Kev Judge, Self Instrumentation and S.I. (undated, circa 2007)

Monday, August 03, 2020

A Cybernetics View of Data-Driven

Cybernetics helps us understand dynamic systems that are driven by a particular type of data. Here are some examples:

  • Many economists see markets as essentially driven by price data.
  • On the Internet (especially social media) we can see systems that are essentially driven by click data.
  • Stan culture, where hardcore fans gang up on critics who fail to give the latest album a perfect score

In a recent interview with Alice Pearson of CRASSH, Professor Will Davies explains the process as follows:

For Hayek, the advantage of the market was that it was a space in which stimulus and response could be in a constant state of interactivity: that prices send out information to people, which they respond to either in the form of consumer decisions or investment decisions or new entrepreneurial strategies.

Davies argued that this is now managed on screens, with traders on Wall Street and elsewhere constantly interacting with (as he says) flashing numbers that are rising and falling.

The way in which the market is visualized to people, the way it presents itself to people, the extent to which it is visible on a single control panel, is absolutely crucial to someone's ability to play the market effectively.

Davies attributes to cybernetics a particular vision of human agency: to think of human beings as black boxes which respond to stimuluses in particular ways that can be potentially predicted and controlled. (In market trading, this thought leads naturally to replacing human beings with algorithmic trading.)

Davies then sees this cybernetic vision encapsulated in the British government approach to the COVID-19 pandemic.

What you see now with this idea of Stay Alert ... is a vision of an agent or human being who is constantly responsive and constantly adaptable to their environment, and will alter their behaviour depending on what types of cues are coming in from one moment to the next. ... The ideological vision being presented is of a society in which the rules of everyday conduct are going to be constantly tweaked in response to different types of data, different things that are appearing on the control panels at the Joint Biosecurity Centre.

The word alert originally comes from an Italian military term all'erta - to the watch. So the slogan Stay Alert implies a visual idea of agency. But as Alice Pearson pointed out, that which is supposed to be the focus of our alertness is invisible. And it is not just the virus itself that is invisible, but (given the frequency of asymptomatic carriers) which people are infectious and should be avoided.

So what visual or other signals is the Government expecting us to be alert to? If we can't watch out for symptoms, perhaps we are expected instead to watch out for significant shifts in the data - ambiguous clues about the effectiveness of masks or the necessity of quarantine. Or perhaps significant shifts in the rules.

Most of us only see a small fraction of the available data - Stafford Beer's term for this is attenuation, and Alice Pearson referred to hyper-attenuation. So we seem to be faced with a choice between on the one hand a shifting set of rules based on the official interpretation of the data - assuming that the powers-that-be have a richer set of data than we do, and a more sophisticated set of tools for managing the data - and on the other hand an increasingly strident set of activists encouraging people to rebel against the official rules, essentially setting up a rival set of norms in which for example mask-wearing is seen as a sign of capitulation to a socialist regime run by Bill Gates, or whatever.
Later in the interview, and also in his New Statesman article, Davies talks about a shifting notion of rules, from a binding contract to mere behavioural nudges.

Rules morph into algorithms, ever-more complex sets of instructions, built around an if/then logic. By collecting more and more data, and running more and more behavioural tests, it should in principle be possible to steer behaviour in the desired direction. ... The government has stumbled into a sort of clumsy algorithmic mentality. ... There is a logic driving all this, but it is one only comprehensible to the data analyst and modeller, while seeming deeply weird to the rest of us. ... To the algorithmic mind, there is no such thing as rule-breaking, only unpredicted behaviour.

One of the things that differentiates the British government from more accomplished practitioners of data-driven biopower (such as Facebook and WeChat) is the apparent lack of fast and effective feedback loops. If what the British government is practising counts as cybernetics at all, it seems to be a very primitive and broken version of first-order cybernetics.

When Norbert Wiener introduced the term cybernetics over seventy years ago, describing thinking as a kind of information processing and people as information processing organisms, this was a long way from simple behaviourism. Instead, he emphasized learning and creativity, and insisted on the liberty of each human being to develop in his freedom the full measure of the human possibilities embodied in him.
In a talk on the entanglements of bodies and technologies, Lucy Suchman draws on an article by Geoff Bowker to describe the universal aspirations of cybernetics.
Cyberneticians declared a new age in which Darwin's placement of man as one among the talks about how animals would now be followed by cybernetics' placement of man as one among the machines.
However, as Suchman reminds us
Norbert Wiener himself paid very careful attention to questions of labour, and actually cautioned against the too-broad application of models that were designed in relation to physical or computational systems to the social world.

Even if sometimes seeming outnumbered, there have always been some within the cybernetics community who are concerned about epistemology and ethics. Hence second-order (or even third-order) cybernetics.

Ben Beaumont-Thomas, Hardcore pop fans are abusing critics – and putting acclaim before art (The Guardian, 3 August 2020)

Geoffrey Bowker, How to be universal: some cybernetic strategies, 1943-1970 (Social Studies of Science 23, 1993) pp 107-127
Philip Boxer & Vincent Kenny, The economy of discourses - a third-order cybernetics (Human Systems Management, 9/4 January 1990) pp 205-224

Will Davies, Coronavirus and the Rise of Rule-Breakers (New Statesman, 8 July 2020)

Lucy Suchman, Restoring Information’s Body: Remediations at the Human-Machine Interface (Medea, 20 October 2011) Recording via YouTube
Norbert Wiener, The Human Use of Human Beings (1950, 1954)

Stanford Encyclopedia of Philosophy: A cybernetic view of human nature

Wednesday, July 29, 2020

Information Advantage (not necessarily) in Air and Space

Some good down-to-earth points from #ASPC20 @airpowerassn 's Air and Space Power Conference earlier this month. Although the material was aimed at a defence audience, much of the discussion is equally relevant to civilian and commercial organizations interested in information superiority (US) or information advantage (UK).

Professor Dame Angela Mclean, who is the Chief Scientific Advisor to the MOD, defined information advantage thus:

The credible advantage gained through the continuous, decisive and resilient employment of information and information systems. It involves exploiting information of all kinds to improve every aspect of operations: understanding, decision-making, execution, assessment and resilience.

She noted the temptation for the strategy to jump straight to technology (technology push); the correct approach is to set out ambitious, enduring capability outcomes (capability pull), although this may be harder to communicate. Nevertheless, technology push may make sense in those areas where technologies could contribute to multiple outcomes.

She also insisted that it was not enough just to have good information, it was also necessary to use this information effectively, and she called for cultural change to drive improved evidence-based decision-making. (This chimes with what I've been arguing myself, including the need for intelligence to be actioned, not just actionable.)

In his discussion of multi-domain integration, General Sir Patrick Sanders reinforced some of the same points.
  • Superiority in information (is) critical to success
  • We are not able to capitalise on the vast amounts of data our platforms can deliver us, as they are not able to share, swap or integrate data at a speed that generates tempo and advantage
  • (we need) Faster and better decision making, rooted in deeper understanding from all sources and aided by data analytics and supporting technologies

See my previous post on Developing Data Strategy (December 2019) 

Professor Dame Angela Mclean, Orienting Defence Research to anticipate and react to the challenges of a future information-dominated operational environment (Video)

General Sir Patrick Sanders, Cohering Joint Forces to deliver Multi Domain Integration (Air and Space Power Conference, 15 July 2020) (Video, Official Transcript)

For the full programme, see

Wednesday, July 22, 2020

Encouraging Data Innovation

@BCSDMSG and @DAMAUK ran an online conference last month, entitled Delivering Value Through Data. Videos are now available on YouTube.

The conference opened with a very interesting presentation by Peter Thomas (Prudential Regulation Authority, part of the Bank of England). Some key takeaways:

The Bank of England is a fairly old-fashioned institution. The data programme was as much a cultural shift as a technology shift, and this was reflected by a change in the language – from data management to data innovation.

Challenges: improve the cadence of situation awareness, sense-making and decision-making.

One of Peter's challenges was to wean the business off Excel. The idea was to get data straight into Tableau, bypassing Excel. Peter referred to this as straight-through processing, and said this was the biggest bang for the buck.

Given the nature of his organization, the link between data governance and decision governance is particularly important. Peter described making governance more effective/efficient by reducing the number of separate governance bodies, and outlined a stepwise approach for persuading people in the business to accept data ownership:
  1. You are responsible for your decisions
  2. You are responsible for your interpretation of the data used in your decisions
  3. You are responsible for your requests and requirements for data.
Some decisions need to be taken very quickly, in crisis management mode. (This is a particular characteristic of a regulatory organization, but also relevant to anyone dealing with COVID-19.) Then if they can cut through the procrastination in such situations, this should create a precedent for doing things more quickly in Business-As-Usual mode.

Finally, Peter reported some tension between two camps – those who want data and decision management to be managed according to strict rules, and those who want the freedom to experiment. Enterprise-wide innovation needs to find a way to reconcile these camps.

Plenty more insights in the video, including the Q&A at the end - well worth watching.

Peter Thomas, Encouraging Data Innovation (BCS via YouTube, 15 June 2020)