Ten years ago, the editor of Wired Magazine published an article claiming the end of theory. "With enough data, the numbers speak for themselves."
The idea that data (or facts) speak for themselves, with no need for interpretation or analysis, is a common trope. It is sometimes associated with a legal doctrine known as Res Ipsa Loquitur - the thing speaks for itself. However this legal doctrine isn't about truth but about responsibility: if a surgeon leaves a scalpel inside the patient, this fact alone is enough to establish the surgeon's negligence.
Or even the world speaks for itself. The world, someone once asserted, is all that is the case, the totality of facts not of things. Paradoxically, big data often means very large quantities of very small (atomic) data.
But data, however big, does not provide a reliable source of objective truth. This is one of the six myths of big data identified by Kate Crawford, who points out, "data and data sets are not objective; they are creations of human design". In other words, we don't just build models from data, we also use models to obtain data. This is linked to Piaget's account of how children learn to make sense of the world in terms of assimilation and accommodation. (Piaget called this Genetic Epistemology.)
Data also cannot provide explanation or understanding. Data can reveal correlation but not causation. Which is one of the reasons why we need models. As Kate Crawford also observes, "we get a much richer sense of the world when we ask people the why and the how not just the how many".
In the traditional world of data management, there is much emphasis on the single source of truth. Michael Brodie (who knows a thing or two about databases), while acknowledging the importance of this doctrine for transaction systems such as banking, argues that it is not appropriate everywhere. "In science, as in life, understanding of a phenomenon may be enriched by observing the phenomenon from multiple perspectives (models). ... Database products do not support multiple models, i.e., the reality of science and life in general.". One approach Brodie talks about to address this difficulty is ensemble modelling: running several different analytical models and comparing or aggregating the results. (I referred to this idea in my post on the Shelf-Life of Algorithms).
Along with the illusion that what the data tells you is true, we can identify two further illusions: that what the data tells you is important, and that what the data doesn't tell you is not important. These are not just illusions of big data of course - any monitoring system or dashboard can foster them. The panopticon affects not only the watched but also the watcher.
From the perspective of organizational intelligence, the important point is that data collection, sensemaking, decision-making, learning and memory form a recursive loop - each inextricably based on the others. An organization only perceives what it wants to perceive, and this depends on the conceptual models it already has - whether these are explicitly articulated or unconsciously embedded in the culture. Which is why real diversity - in other words, genuine difference of perspective, not just bureaucratic profiling - is so important, because it provides the organizational counterpart to the ensemble modelling mentioned above.
Chris Anderson, The End of Theory: The Data Deluge Makes the Scientific Method Obsolete (Wired, 23 June 2008)
Michael L Brodie, Why understanding of truth is important in Data Science? (KD Nuggets, January 2018)
Kate Crawford, The Hidden Biases in Big Data (HBR, 1 April 2013)
Kate Crawford, The Anxiety of Big Data (New Inquiry, 30 May 2014)
Bruno Gransche, The Oracle of Big Data – Prophecies without Prophets (International Review of Information Ethics, Vol. 24, May 2016)
Thomas McMullan, What does the panopticon mean in the age of digital surveillance? (Guardian, 23 July 2015)
Evelyn Ruppert, Engin Isin and Didier Bigo, Data politics (Big Data and Society, July–December 2017: 1–7)
Ian Steadman, Big Data and the Death of the Theorist (Wired, 25 January 2013)
Ludwig Wittgenstein, Tractatus Logico-Philosophicus (1922)
Information Algebra (March 2008)
How Dashboards Work (November 2009)
Co-Production of Data and Knowledge (November 2012)
Real Criticism - The Subject Supposed to Know (January 2013)
The Purpose of Diversity (December 2014)
The Shelf-Life of Algorithms (October 2016)
The Transparency of Algorithms (October 2016)
Wikipedia: Ensemble Learning, Genetic Epistemology, Panopticism, Res ipsa loquitur (the thing speaks for itself)
Stanford Encyclopedia of Philosophy: Kant and Hume on Causality
For more on Organizational Intelligence, please read my eBook.