Architecture, Data and Intelligence: bigdata

Showing posts with label bigdata. Show all posts

Friday, January 01, 2021

Does Big Data Drive Netflix Content?

One thing that contributes to the success of Netflix is its recommendation engine, originally based on an algorithm called CineMatch. I discussed this in my earlier post Rhyme or Reason (June 2017).

But that's not the only way Netflix uses data. According to several pundits (Bikker, Dans, Delger, FrameYourTV, Selerity), Netflix also uses big data to create content. However, it's not always clear to what extent these assertions are based on inside information rather than just intelligent speculation.

According to Enrique Dans

The latest Netflix series is not being made because a producer had a divine inspiration or a moment of lucidity, but because a data model says it will work.

Craig Delger's example looks pretty tame - analysing the intersection between existing content to position new content.

The data collected by Netflix indicated there was a strong interest for a remake of the BBC miniseries House of Cards. These viewers also enjoyed movies by Kevin Spacey, and those directed by David Fincher. Netflix determined that the overlap of these three areas would make House of Cards a successful entry into original programming.

This is the kind of thing risk-averse producers have always done, and although data analytics might enable Netflix to do this a bit more efficiently, it doesn’t seem to represent a massive technological innovation. Thomas Davenport and Jeanne Harris discuss some more advanced use of data in the second edition of their book Competing on Analytics.

Netflix ... has used analytics to predict whether a TV show will be a hit with audiences. ... It has used attribute analysis ... to predict whether customers would like a series, and has identified as many as seventy thousand attributes of movies and TV shows, some of which it drew on for the decision whether to create it.

One of the advantages of a content delivery platform is that you can track the consumption of your content. Amazon used the Kindle to monitor how many chapters people actually read, at what times of day, where and when they get bored. Games platforms (Nintendo, PlayStation, X-Box) can track how far people get with the games, where they get stuck, and where they might need some TLC or DLC. So Netflix knows where you pause or give up, which scenes you rewind to watch again. Netflix can also experiment with alternative trailers for the same content.

In theory, this kind of information can then be used not just by Netflix to decide where to invest, but also by content producers to produce more engaging content. But it's difficult to get clear evidence how much influence this actually has on content creation.

How much other (big) data does Netflix actually collect about its consumers. Craig Delger assumes they operate much like most other data-hungry companies.

Netflix user account data provides verified personal information (sex, age, location), as well as preferences (viewing history, bookmarks, Facebook likes).

However, in a 2019 interview (reported by @dadehayes), Ted Sarandos denied this.

We don’t collect your data. I don’t know how old you are when you join Netflix. I don’t know if you’re black or white. We know your credit card, but that’s just for payment and all that stuff is anonymized.

Sarandos, who is Chief Content Officer at Netflix, also downplayed the role that data (big or otherwise) played in driving content.

Picking content and working with the creative community is a very human function. The data doesn’t help you on anything in that process. It does help you size the investment. … Sometimes we’re wrong on both ends of that, even with this great data. I really think it’s 70, 80% art and 20, 30% science.

But perhaps that's what you'd expect him to say, given that Netflix has always tried to attract content producers with the promise of complete creative freedom. Amazon Studios has made similar claims. See report by Roberto Baldwin.

While there may be conflicting narratives about the difference data makes to content creation, there are some observations that seem relevant if inconclusive.

Firstly, the long tail argument. The orginal business model for Amazon and Netflix was based on having a vast catalogue, in which most of the entries are of practically no interest to anyone, because the cost of adding something to the catalogue was trivial. Even if the tail doesn't actually contribute as much revenue as the early proponents of the long tail theory suggested, it helps to mitigate uncertainty and risk - not knowing in advance which are going to be hits.

But this effect is countered by the trend towards vertical integration. Amazon and Netflix have gone from distribution to producing their own content, while Disney has moved into streaming. This encourages (but doesn't prove) the hypothesis that there may be some data synergies as well as commercial synergies.

And finally, an apparent preference for conventional non-disruptive content, as noted by Alex Shephard, which is pretty much what we would expect from a data-driven approach.

Netflix is content to replicate television as we know it—and the results are deliberately less than spectacular.

Update (June 2023)

I have been reading a detailed analysis in Ed Finn's book, What Algorithms Want (2017).

Finn's answer to my question about data-driven content is no, at least not directly. Although Netflix had used data to commission new content as well as recommend existing content (Finn's example was House of Cards) it had apparently left the content itself to the producers, and then used data and algorithmic data to promote it.

After making the initial decision to invest in House of Cards, Netflix was using algorithms to micromanage distribution, not production. Finn p99

Obviously that doesn't say anything about what Netflix has been doing more recently, but Finn seems to have been looking at the same examples as the other pundits I referenced above.

Roberto Baldwin, With House of Cards, Netflix Bets on Creative Freedom (Wired, 1 February 2013)

Yannick Bikker, How Netflix Uses Big Data to Build Mountains of Money (7 July 2020)

Enrique Dans, How Analytics Has Given Netflix The Edge Over Hollywood (Forbes, 27 May 2018), Netflix: Big Data And Playing A Long Game Is Proving A Winning Strategy (Forbes, 15 January 2020)

Thomas Davenport and Jeanne Harris, Competing on Analytics (Second edition 2017) - see extract here https://www.huffpost.com/entry/how-netflix-uses-analytics-to-thrive_b_5a297879e4b053b5525db82b

Ed Finn, What Algorithms Want: Imagination in the Age of Computing (MIT Press, 2017)

FrameYourTV, How Netflix uses Big Data to Drive Success via Inside BigData (20 January 2018)

Daniel G. Goldstein and Dominique C. Goldstein, Profiting from the Long Tail (Harvard Business Review, June 2006)

Dade Hayes, Netflix’s Ted Sarandos Weighs In On Streaming Wars, Agency Production, Big Tech Breakups, M+A Outlook (Deadline, 22 June 2019)

Alexis C. Madrigal, How Netflix Reverse-Engineered Hollywood (Atlantic, 2 January 2014)

Selerity, How Netflix used big data and analytics to generate billions (5 April 2019)

Alex Shephard, What Netflix’s Obama Deal Says About the Future of Streaming (New Republic 23 May 2018)

Related posts: Competing on Analytics (May 2010), Rhyme or Reason - the Logic of Netflix (June 2017)

Saturday, February 15, 2020

Dark Data

At @imperialcollege this week to hear Professor David Hand talk about his new book on Dark Data.

Some people define dark data as unanalysed data, data you have but are not able to use, and this is the definition that can be found on Wikipedia. The earliest reference I can find to dark data in this sense is a Gartner blogpost from 2012.

In a couple of talks I gave in 2015, I used the term Data Data in a much broader sense - to include the data you simply don't have. My talks both included the following diagram.

Here's an example of this idea. A supermarket may know that I sometimes buy beer at the weekends. This information is derived from its own transaction data, identifying me through my use of a loyalty card. But what about the weekends when I don't buy beer from that supermarket? Perhaps I am buying beer from a rival supermarket, or drinking beer at friends' houses, or having a dry weekend. If they knew this, it might help them sell me more beer in future. Or sell me something else for those dry weekends.

Obviously the supermarket doesn't have access to its competitors' transaction data. But it does know when its competitors are doing special promotions on beer. And there may be some clues about my activity from social media or other sources.

The important thing to remember is that the supermarket rarely has a complete picture of the customer's purchases, let alone what is going on elsewhere in the customer's life. So it is trying to extract useful insights from incomplete data, enriched in any way possible by big data.

Professor Hand's book is about data you don't have - perhaps data you wish you had, or hoped to have, or thought you had, but nevertheless data you don't have. He argues that the missing data are at least as important as the data you do have. So this is the same sense that I was using in 2015.

Hand describes and illustrates many different manifestations of dark data, and talks about a range of statistical techniques for drawing valid conclusions from incomplete data and for overcoming potential bias. He also talks about the possible benefits of dark data - for example, hiding some attributes to improve the quality and reliability of the attributes that are exposed. A good example of this would be double-blind testing in clinical trials, which involves hiding which subjects are receiving which treatment, because revealing this information might influence and distort the results.

Can big data solve the challenges posed by dark data? In my example, we might be able to extract some useful clues from big data. But although these clues might lead to new avenues to investigate, or hypotheses that could be tested further, the clues themselves may be unreliable indicators. The important thing is to be mindful of the limits of your visible data.

David J Hand, Dark Data: Why what you don't know matters (Princeton 2020). See also his presentation at Imperial College, 10 February 2020 https://www.youtube.com/watch?v=R3IO5SDVmuk

Richard Veryard, Boundaryless Customer Engagement (Open Group, October 2015), Real-Time Personalization (Unicom December 2015)

Andrew White, Dark Data is like that furniture you have in that Dark Cupboard (Gartner, 11 July 2012)

Wikipedia: Dark Data

Related post: Big Data and Organizational Intelligence (November 2018), Dark Data and the US Election (November 2020)

Sunday, December 01, 2019

Data Strategy - Reach

This is one of a series of posts looking at the four key dimensions of Data and Information that must be addressed in a data strategy - reach, richness, agility and assurance.

Data strategy nowadays is dominated by the concept of big data, whatever that means. Every year our notions of bigness are being stretched further. So instead of trying to define big, let me talk about reach.

Firstly, this means reaching into more sources of data. Instead of just collecting data about the immediate transactions, enterprises now expect to have visibility up and down the supply chain, as well as visibility into the world of the customers and end-consumers. Data and information can be obtained from other organizations in your ecosystem, as well as picked up from external sources such as social media. And the technologies for monitoring (telemetrics, internet of things) and surveillance (face recognition, tracking, etc) are getting cheaper, and may be accurate enough for some purposes.

Obviously there are some ethical as well as commercial issues here. I'll come back to these.

Reach also means reaching more destinations. In a data-driven business, data and information need to get to where they can be useful, both inside the organization and across the ecosystem, to drive capabilities and processes, to support sense-making (also known as situation awareness), policy and decision-making, and intelligent action, as well as organizational learning. These are the elements of what I call organizational intelligence. Self-service (citizen) data and intelligence tools, available to casual as well as dedicated users, improve reach; and the tool vendors have their own reasons for encouraging this trend.

In many organizations, there is a cultural divide between the specialists in Head Office and the people at the edge of the organization. If an organization is serious about being customer-centric, it needs to make sure that relevant and up-to-date information and insight reaches those dealing with awkward customers and other immediate business challenges. This is the power-to-the-edge strategy.

Information and insight may also have value outside your organization - for example to your customers and suppliers, or other parties. Organizations may charge for access to this kind of information and insight (direct monetization), may bundle it with other products and services (indirect monetization), or may distribute it freely for the sake of wider ecosystem benefits.

And obviously there will be some data and intelligence that must not be shared, for security or other reasons. Many organizations will adopt a defensive data strategy, protecting all information unless there is a strong reason for sharing; others may adopt a more offensive data strategy, seeking competitive advantage from sharing and monetization except for those items that have been specifically classified as private or confidential.

How are your suppliers and partners thinking about these issues? To what extent are they motivated or obliged to share data with you, or to protect the data that you share with them? I've seen examples where organizations lack visibility of their own assets, because they have outsourced the maintenance of these assets to an external company, and the external company fails to provide sufficiently detailed or accurate information. (When implementing your data strategy, make sure your contractual agreements cover your information sharing requirements.)

Data protection introduces further requirements. Under GDPR, data controllers are supposed to inform data subjects how far their personal data will reach, although many of the privacy notices I've seen have been so vague and generic that they don't significantly constrain the data controller's ability to share personal data. Meanwhile, GDPR Article 28 specifies some of the aspects of data sharing that should be covered in contractual agreements between data controllers and data processors. But compliance with GDPR or other regulations doesn't fully address ethical concerns about the collection, sharing and use of personal data. So an ethical data strategy should be based on what the organization thinks is fair to data subjects, not merely what it can get away with.

There are various specific issues that may motivate an organization to improve the reach of data as part of its data strategy. For example:

Critical data belongs to third parties
Critical business decisions lacking robust data
I know the data is in there, but I can't get it out.
Lack of transparency – I can see the result, but I don’t know how it has been calculated.
Analytic insight narrowly controlled by a small group of experts – not easily available to general management
Data and/or insight would be worth a lot to our customers, if only we had a way of getting it to them.

In summary, your data strategy needs to explain how you are going to get data and intelligence

From a wide range of sources
Into a full range of business processes at all touchpoints
Delivered to the edge – where your organization engages with your customers

Next post Richness

Related posts

Power to the Edge (December 2005)
Reach, Richness, Agility and Assurance (August 2017)
Setting off towards the data-driven business (August 2019)
Beyond Trimodal - Citizens and Tourists (November 2019)

Tuesday, November 06, 2018

Big Data and Organizational Intelligence

Ten years ago, the editor of Wired Magazine published an article claiming the end of theory. With enough data, the numbers speak for themselves.

The idea that data (or facts) speak for themselves, with no need for interpretation or analysis, is a common trope. It is sometimes associated with a legal doctrine known as Res Ipsa Loquitur - the thing speaks for itself. However this legal doctrine isn't about truth but about responsibility: if a surgeon leaves a scalpel inside the patient, this fact alone is enough to establish the surgeon's negligence.

Legal doctrine aside, perhaps the world speaks for itself. The world, someone once asserted, is all that is the case, the totality of facts not of things. Paradoxically, big data often means very large quantities of very small (atomic) data.

But data, however big, does not provide a reliable source of objective truth. This is one of the six myths of big data identified by Kate Crawford, who points out, data and data sets are not objective; they are creations of human design. In other words, we don't just build models from data, we also use models to obtain data. This is linked to Piaget's account of how children learn to make sense of the world in terms of assimilation and accommodation. (Piaget called this Genetic Epistemology.)

Data also cannot provide explanation or understanding. Data can reveal correlation but not causation. Which is one of the reasons why we need models. As Kate Crawford also observes, we get a much richer sense of the world when we ask people the why and the how not just the how many. And Bernard Stiegler links the end of theory glorified by Anderson with a loss of reason (2019, p8).

In the traditional world of data management, there is much emphasis on the single source of truth. Michael Brodie (who knows a thing or two about databases), while acknowledging the importance of this doctrine for transaction systems such as banking, argues that it is not appropriate everywhere. In science, as in life, understanding of a phenomenon may be enriched by observing the phenomenon from multiple perspectives (models). ... Database products do not support multiple models, i.e., the reality of science and life in general. One approach Brodie talks about to address this difficulty is ensemble modelling: running several different analytical models and comparing or aggregating the results. (I referred to this idea in my post on the Shelf-Life of Algorithms).

Along with the illusion that what the data tells you is true, we can identify two further illusions: that what the data tells you is important, and that what the data doesn't tell you is not important. These are not just illusions of big data of course - any monitoring system or dashboard can foster them. The panopticon affects not only the watched but also the watcher.

From the perspective of organizational intelligence, the important point is that data collection, sensemaking, decision-making, learning and memory form a recursive loop - each inextricably based on the others. An organization only perceives what it wants to perceive, and this depends on the conceptual models it already has - whether these are explicitly articulated or unconsciously embedded in the culture. Which is why real diversity - in other words, genuine difference of perspective, not just bureaucratic profiling - is so important, because it provides the organizational counterpart to the ensemble modelling mentioned above.

https://xkcd.com/552/

Each day seems like a natural fact
And what we think changes how we act

(Gang of Four, Why Theory)

Chris Anderson, The End of Theory: The Data Deluge Makes the Scientific Method Obsolete (Wired, 23 June 2008)

Michael L Brodie, Why understanding of truth is important in Data Science? (KD Nuggets, January 2018)

Kate Crawford, The Hidden Biases in Big Data (HBR, 1 April 2013)

Kate Crawford, The Anxiety of Big Data (New Inquiry, 30 May 2014)

Bruno Gransche, The Oracle of Big Data – Prophecies without Prophets (International Review of Information Ethics, Vol. 24, May 2016)

Kevin Kelly, The Google Way of Science (The Technium, 28 June 2008)

Thomas McMullan, What does the panopticon mean in the age of digital surveillance? (Guardian, 23 July 2015)

Evelyn Ruppert, Engin Isin and Didier Bigo, Data politics (Big Data and Society, July–December 2017: 1–7)

Ian Steadman, Big Data and the Death of the Theorist (Wired, 25 January 2013)

Bernard Stiegler, The Age of Disruption: Technology and Madness in Computational Capitalism (English translation, Polity Press 2019)

Ludwig Wittgenstein, Tractatus Logico-Philosophicus (1922)

Related posts

Information Algebra (March 2008), How Dashboards Work (November 2009), Conceptual Modelling - Why Theory (November 2011), Co-Production of Data and Knowledge (November 2012), Real Criticism - The Subject Supposed to Know (January 2013), The Purpose of Diversity (December 2014), The Shelf-Life of Algorithms (October 2016), The Transparency of Algorithms (October 2016), Algorithms and Governmentality (July 2019), Mapping out the entire world of objects (July 2020)

Wikipedia: Ensemble Learning, Genetic Epistemology, Panopticism, Res ipsa loquitur (the thing speaks for itself)

Stanford Encyclopedia of Philosophy: Kant and Hume on Causality

For more on Organizational Intelligence, please read my eBook.
https://leanpub.com/orgintelligence/

Architecture, Data and Intelligence

Pages

Friday, January 01, 2021

Does Big Data Drive Netflix Content?

Saturday, February 15, 2020

Dark Data

Sunday, December 01, 2019

Data Strategy - Reach

Tuesday, November 06, 2018

Big Data and Organizational Intelligence

Blog Archive

Creative Commons

or by email

Pages

Friday, January 01, 2021

Does Big Data Drive Netflix Content?

Saturday, February 15, 2020

Dark Data

Sunday, December 01, 2019

Data Strategy - Reach

Tuesday, November 06, 2018

Big Data and Organizational Intelligence

Blog Archive

Creative Commons

Subscribe

or by email