Architecture, Data and Intelligence: MDM

Showing posts with label MDM. Show all posts

Friday, July 10, 2020

Three Types of Data - Bronze, Gold and Mercury

In this post, I'm going to look at three types of data, and the implications for data management. For the purposes of this story, I'm going to associate these types with three contrasting metals: bronze, gold and mercury. (Update: fourth type added - scroll down for details.)

Bronze

The first type of data represents something that happened at a particular time. For example, transaction data: this customer made this purchase of this product on this date. This delivery was received, this contract was signed, this machine was installed, this notification was sent.

Once this kind of data is correctly recorded, it should never change. Even if an error is detected in a transaction record, the usual procedure is to add two more transaction records - one to reverse out the incorrect values, and one to reenter the correct values.

For many organizations, this represents by far the largest portion of data by volume. The main data management challenges tend to be focused on the implications of this - how much to collect, where to store it, how to move it around, how soon can it be deleted or archived.

Gold

The second type of data represents current reality. This kind of data must be promptly and efficiently updated to reflect real-world changes. For example, the customer changes address, an employee moves to a different department. Although the changes themselves may be registered as Bronze Data, and we may want to check what the position was at a given time in the past, or is expected to be at a given time in the future, what we usually want to know is where does the customer now reside, where does Sam now work.

Some of these updates can be regarded as simple facts, based on observations or reports (a customer tells us her address). Some updates are derived from other data, using calculation or inference rules. And some updates are based on decisions - for example, the price of this product shall be X.

And not all of these updates can be trusted. If you receive an email from a supplier requesting payment to a different bank account, you probably want to check that this email is genuine before updating the supplier record.

Typically much smaller data volumes than Bronze, but much more critical to the business if you get it wrong.

Mercury

Finally, we have data with a degree of uncertainty, including estimates and forecasts. This data is fluid, it can move around for no apparent reason. It can be subjective or based on unreliable or partial sources. Nevertheless, it can be a rich source of insight and intelligence.

This category also includes projected and speculative data. For example, we might be interested in developing a fictional "what if" scenario - what if we opened x more stores, what if we changed the price of this product to y?

For some reason, an estimate that is generated by an algorithm or mathematical model is sometimes taken more seriously than an estimate pulled out of the air by a subject matter expert. However, as Cathy O'Neill reminds us, algorithms are themselves merely opinions embedded in code.

If you aren't sure whether to trust an estimate, you can scrutinize the estimation process. For example, you might suspect that the subject matter expert provides more optimistic estimates after lunch. Or you could just get a second opinion. Two independent but similar opinions might give you more confidence than one extremely precise but potentially flawed opinion.

As well as estimates and forecasts, Mercury data may include assessments of various kinds. For example, we may want to know a customer's level of satisfaction with our products and services. Opinion surveys provide some relevant data points, but what about the customers who don't complete these surveys? And what if we pick up different opinions from different individuals within a large customer organization? In any case, these opinions change over time, and we may be able to correlate these shifts in opinion with specific good or bad events.

Thus Mercury data tend to be more complex than Bronze or Gold data, and can often be interpreted in different ways.

Update: Glass

@tonyjoyce suggests a fourth type.

This is good. I’d like to suggest there is another elemental type, in the taxonomy we use for structures. A substrate for complex data like addresses. Call it glass, for it shatters when stressed.
— tonyjoyce (@tonyjoyce) July 10, 2020

This is a great insight. If you are not careful, you will end up with pieces of broken glass in your data. While this kind of data may be necessary, it is fragile and has to be treated with due care, and can't just be chucked around like bronze or gold.

Single Version of Truth (SVOT)

Bronze and Gold data usually need to be reliable and consistent. If two data stores have different addresses for the same customer, this could indicate any of the following errors.

The data in one of the data stores is incorrect or out-of-date.
It’s not the same customer after all.
It’s not the same address. For example, one is the billing address and the other is the delivery address.

For the purposes of data integrity and interoperability, we need to eliminate such errors. We then have a single version of the truth (SVOT), possibly taken from a single source of truth (SSOT).

Facts and derivations may be accurate or inaccurate. In the case of a simple fact, inaccuracy may be attributed to various causes, including translation errors, carelessness or dishonesty. Calculations may be inaccurate either because the input data are inaccurate or incomplete, or because there is an error in the derivation rule itself. (However, the derived data can sometimes be more accurate or useful, especially if random errors and variations are smoothed out.)

For decisions however, it doesn’t make sense to talk about accuracy / inaccuracy, except in very limited cases. Obviously if someone decides the price of an item shall be x pounds, but this is incorrectly entered into the system as x pence, this is going to cause problems. But even if x pence is the wrong price, arguably it is what the price is until someone fixes it.

Plural Version of Truth (PVOT)

But as I've pointed out in several previous posts, the Single Version of Truth (SVOT) or Single Source of Truth (SSOT) isn't appropriate for all types of data. Particularly not Mercury Data. When making sense of complex situations, having alternative views provides diversity and richness of interpretation.

Analytical systems may be able to compare alternative data values from different sources. For example, two forecasting models might produce different estimates of the expected revenue from a given product. Intelligent use of these estimates doesn’t entail choosing one and ignoring the other. It means understanding why they are different, and taking appropriate action.

Or what about conflicting assessments? If we are picking up a very high satisfaction score from some parts of the customer organization, and a low satisfaction score from other parts of the same organization, we shouldn't simply average them out. The difference between these two scores could be telling us something important, might be revealing an opportunity to engage differently with the two parts of the customer.

And for some kinds of Mercury Data, it doesn't even make sense to ask whether they are accurate or inaccurate. Someone may postulate x more stores, but this doesn’t imply that x is true, or even likely, merely speculative. And this speculative status is inherited by any forecasts or other calculations based on x. (Just look at the discourse around COVID data for topical examples.)

Master Data Management (MDM)

The purpose of Master Data Management is not just to provide a single source of data for Gold Data - sometimes called the Golden Record - but to provide a single location for updates. A properly functioning MDM solution will execute these updates consistently and efficiently, and ensure all consumers of the data (whether human or software) are using the updated version.

There is an important connection to draw out between master data management and trust.

In order to trust Bronze Data, we simply need some assurance that it is correctly recorded and can never be changed. (“The moving finger writes …”) In some contexts, a central authority may be able to provide this assurance. In systems with no central authority, Blockchain can guarantee that a data item has not been changed, although Blockchain alone cannot guarantee that it was correctly recorded in the first place.

For Gold Data, trustworthiness is more complicated, as there will need to be an ongoing series of automatic and manual updates. Master data management will provide the necessary sociotechnical superstructure to manage and control these updates. For example, what are the controls on updating a supplier's bank account details?

There will always be requirements for data integrity between Bronze Data and Gold Data. Firstly, there will typically be references from Bronze Data to Gold Data. For example, a transaction record may reference a specific customer purchasing a specific product. And secondly, there may be attributes of the Gold Data that are updated as a result of each transaction. For example, the stock levels of a product will be affected by sales of that product.

However, as we've seen, the data management challenges of Bronze Data are not the same as the challenges for Gold Data. And the challenges of Mercury Data are different again. So it is better to focus your MDM efforts exclusively on Gold Data. (And avoid splinters of Glass.)

Post prompted by a discussion on Linked-In with Robert Daniels-Dwyer, Steve Fisher and Steve Lenny. https://www.linkedin.com/posts/danielsdwyer_dataarchitecture-datamanagement-enterprisearchitecture-activity-6673873980866228224-Mu1O

Updated 15 July 2020 following suggestion by Tony Joyce.

Tuesday, December 03, 2019

Data Strategy - Agility

This is one of a series of posts looking at the four key dimensions of data and information that must be addressed in a data strategy - reach, richness, agility and assurance.

In previous posts, I looked at Reach, which is about the range of data sources and destinations, and Richness, which is about the complexity of data. Now let me turn to Agility - the speed and flexibility of response to new opportunities and changing requirements.

Not surprisingly, lots of people are talking about data agility, including some who want to persuade you that their products and technologies will help you to achieve it. Here are a few of them.

Data agility is when your data can move at the speed of your business. For companies to achieve true data agility, they need to be able to access the data they need, when and where they need it. Pinckney

Collecting first-party data across the customer lifecycle at speed and scale. Jones

Keep up with an explosion of data. ... For many enterprises, their ability to collect data has surpassed their ability to organize it quickly enough for analysis and action. Scott

How quickly and efficiently you can turn data into accurate insights. Tuchen

But before we look at technological solutions for data agility, we need to understand the requirements. The first thing is to empower, enable and encourage people and teams to operate at a good tempo when working with data and intelligence, with fast feedback and learning loops.

Under a trimodal approach, for example, pioneers are expected to operate at a faster tempo, setting up quick experiments, so they should not be put under the same kind of governance as settlers and town planners. Data scientists often operate in pioneer mode, experimenting with algorithms that might turn out to help the business, but often don't. Obviously that doesn't mean zero governance, but appropriate governance. People need to understand what kinds of risk-taking are accepted or even encouraged, and what should be avoided. In some organizations, this will mean a shift in culture.

Beyond trimodal, there is a push towards self-service ("citizen") data and intelligence. This means encouraging and enabling active participation from people who are not doing this on a full-time basis, and may have lower levels of specialist knowledge and skill.

Besides knowledge and skills, there are other important enablers that people need to work with data. They need to be able to navigate and interpret, and this calls for meaningful metadata, such as data dictionaries and catalogues. They also need proper tools and platforms. Above all, they need an awareness of what is possible, and how it might be useful.

Meanwhile, enabling people to work quickly and effectively with data is not just about giving them relevant information, along with decent tools and training. It's also about removing the obstacles.

Obstacles? What obstacles?

In most large organizations, there is some degree of duplication and fragmentation of data across enterprise systems. There are many reasons why this happens, and the effects may be felt in various areas of the business, degrading the performance and efficiency of various business functions, as well as compromising the quality and consistency of management information. System interoperability may be inadequate, resulting in complicated workflows and error-prone operations.

But perhaps the most important effect is on inhibiting innovation. Any new IT initiative will need either to plug into the available data stores or create new ones. If this is to be done without adding further to technical debt, then the data engineering (including integration and migration) can often be more laborious than building the new functionality the business wants.

Depending on whom you talk to, this challenge can be framed in various ways - data engineering, data integration and integrity, data quality, master data management. The MDM vendors will suggest one approach, the iPaaS vendors will suggest another approach, and so on. Before you get lured along a particular path, it might be as well to understand what your requirements actually are, and how these fit into your overall data strategy.

And of course your data strategy needs to allow for future growth and discovery. It's no good implementing a single source of truth or a universal API to meet your current view of CUSTOMER or PRODUCT, unless this solution is capable of evolving as your data requirements evolve, with ever-increasing reach and richness. As I've often discussed on this blog before, one approach to building in flexibility is to use appropriate architectural patterns, such as loose coupling and layering, which should give you some level of protection against future variation and changing requirements, and such patterns should probably feature somewhere in your data strategy.

Next post - Assurance

Richard Jones, Agility and Data: The Heart of a Digital Experience Strategy (WayIn, 22 November 2018)

Tom Pinckney, What's Data Agility Anyway (Braze Magazine, 25 March 2019)

Jim Scott, Why Data Agility is a Key Driver of Big Data Technology Development (24 March 2015)

Mike Tuchen, Do You Have the Data Agility Your Business Needs? (Talend, 14 June 2017)

Related posts: Enterprise OODA (April 2012), Beyond Trimodal: Citizens and Tourists (November 2019)

Saturday, August 03, 2019

Towards the Data-Driven Business

If we want to build a data-driven business, we need to appreciate the various roles that data and intelligence can play in the business - whether improving a single business service, capability or process, or improving the business as a whole. The examples in this post are mainly from retail, but a similar approach can easily be applied to other sectors.

Sense-Making and Decision Support

The traditional role of analytics and business intelligence is helping the business interpret and respond to what is going on.

Once upon a time, business intelligence always operated with some delay. Data had to be loaded from the operational systems into the data warehouse before they could be processed and analysed. I remember working with systems that generated management information based on yesterday's data, or even last month's data. Of course, such systems don't exist any more (!?), because people expect real-time insight, based on streamed data.

Management information systems are supposed to support individual and collective decision-making. People often talk about actionable intelligence, but of course it doesn't create any value for the business until it is actioned. Creating a fancy report or dashboard isn't the real goal, it's just a means to an end.

Analytics can also be used to calculate complicated chains of effects on a what-if basis. For example, if we change the price of this product by this much, what effect is this predicted to have on the demand for other products, what are the possible responses from our competitors, how does the overall change in customer spending affect supply chain logistics, do we need to rearrange the shelf displays, and so on. How sensitive is Y to changes in X, and what is the optimal level of Z?

Analytics can also be used to support large-scale optimization - for example, solving complicated scheduling problems.

Automated Action

Increasingly, we are looking at the direct actioning of intelligence, possibly in real-time. The intelligence drives automated decisions within operational business processes, often without a human-in-the-loop, where human supervision and control may be remote or retrospective. A good example of this is dynamic retail pricing, where an algorithm adjusts the prices of goods and services according to some model of supply and demand. In some cases, optimized plans and schedules can be implemented without a human in the loop.

So the data doesn't just flow from the operational systems into the data warehouse, but there is a control flow back into the operational systems. We can call this closed loop intelligence.

(If it takes too much time to process the data and generate the action, the action may no longer be appropriate. A few years ago, one of my clients wanted to use transaction data from the data warehouse to generate emails to customers - but with their existing architecture there would have been a 48 hour delay from the transaction to the email, so we needed to find a way to bypass this.)

Managing Complexity

If you have millions of customers buying hundreds of thousands of products, you need ways of aggregating the data in order to manage the business effectively. Customers can be grouped into segments, products can be grouped into categories, and many organizations use these groupings as a basis for dividing responsibilities between individuals and teams. However, these groupings are typically inflexible and sometimes seem perverse.

For example, in a large supermarket, after failing to find maple syrup next to the honey as I expected, I was told I should find it next to the custard. There may well be a logical reason for this grouping, but this logic was not apparent to me as a customer.

But the fact that maple syrup is in the same product category as custard doesn't just affect the shelf layout, it may also mean that it is automatically included in decisions affecting the custard category and excluded from decisions affecting the honey category. For example, pricing and promotion decisions.

A data-driven business is able to group things dynamically, based on affinity or association, and then allows simple and powerful decisions to be made for this dynamic group, at the right level of aggregation.

Automation can then be used to cascade the action to all affected products, making the necessary price, logistical and other adjustments for each product. This means that a broad plan can be quickly and consistently implemented across thousands of products.

Experimentation and Learning

In a data-driven business, every activity is designed for learning as well as doing. Feedback is used in the cybernetic sense - collecting and interpreting data to control and refine business rules and algorithms.

In a dynamic world, it is necessary to experiment constantly. A supermarket or online business is a permanent laboratory for testing the behaviour of its customers. For example, A/B testing where alternatives are presented to different customers on different occasions to test which one gets the best response. As I mentioned in an earlier post, Netflix declares themselves "addicted" to the methodology of A/B testing.

In a simple controlled experiment, you change one variable and leave everything else the same. But in a complex business world, everything is changing. So you need advanced statistics and machine learning, not only to interpret the data, but also to design experiments that will produce useful data.

Managing Organization

A traditional command-and-control organization likes to keep the intelligence and insight in the head office, close to top management. An intelligent organization on the other hand likes to mobilize the intelligence and insight of all its people, and encourage (some) local flexibility (while maintaining global consistency). With advanced data and intelligence tools, power can be driven to the edge of the organization, allowing for different models of delegation and collaboration. For example, retail management may feel able to give greater autonomy to store managers, but only if the systems provide faster feedback and more effective support.

Transparency

Related to the previous point, data and intelligence can provide clarity and governance to the business, and to a range of other stakeholders. This has ethical as well as regulatory implications.

Among other things, transparent data and intelligence reveal their provenance and derivation. (This isn't the same thing as explanation, but it probably helps.)

Obviously most organizations already have many of the pieces of this, but there are typically major challenges with legacy systems and data - especially master data management. Moving onto the cloud, and adopting advanced integration and robotic automation tools may help with some of these challenges, but it is clearly not the whole story.

Some organizations may be lopsided or disconnected in their use of data and intelligence. They may have very sophisticated analytic systems in some areas, while other areas are comparatively neglected. There can be a tendency to over-value the data and insight you've already got, instead of thinking about the data and insight that you ought to have.

Making an organization more data-driven doesn't always entail a large transformation programme, but it does require a clarity of vision and pragmatic joined-up thinking.

Related posts: Rhyme or Reason: The Logic of Netflix (June 2017), Setting off towards the Data-Driven Business (August 2019)

Updated 13 September 2019

Tuesday, November 11, 2008

Post Before Processing

In Talk to the Hand, Saul Caganoff describes his experiences of error entering his timesheet data into one of those time-recording systems many of us have to use. He goes on to draw some general lessons about error-handing in business process management (BPM). In Saul's account, this might sometimes necessitate suspending a business rule.

My own view of the problem starts further back - I think it stems from an incorrect conceptual model. Why should your perfectly reasonable data get labelled as error or invalid just because it is inconsistent with your project manager's data? This happens in a lot of old bureaucratic systems because they are designed on the implicit (hierarchical, top-down) assumption that the manager (or systems designer) is always right and the worker (or data entry clerk) is always the one that gets things wrong. It's also easier for the computer system to reject the new data items, rather than go back and question items (such as reference data) that have already been accepted into the database.

I prefer to label such inconsistencies as anomalies, because that doesn't imply anyone in particular being at fault.

It would be crazy to have a business rule saying that anomalies are not allowed. Anomalies happen. What makes sense is to have a business rule saying how and when anomalies are recognized (i.e. what counts as an anomaly) and resolved (i.e. what options are available to whom).

Then you never have to suspend the rule. It is just a different, more intelligent kind of rule.

One of my earliest experiences of systems analysis was designing order processing and book-keeping systems. When I visited the accounts department, I saw people with desks stacked with piles of paper. It turned out that these stacks were the transactions that the old computer system wouldn't accept, so the accounts clerks had developed a secondary manual system for keeping track of all these invalid transactions until they could be corrected and entered.

According to the original system designer, the book-keeping process had been successfully automated. But what had been automated was over 90% of the transactions - but less than 20% of the time and effort. So I said, why don't we build a computer system that supports the work that the accounts clerks actually do. Let them put all these dodgy transactions into the database and then sort them out later.

But I was very junior and didn't know how things were done. And of course the accounts clerks had even less status than I did. The high priests who commanded the database didn't want mere users putting dodgy data in, so it didn't happen.

Many years later, I came across the concept of Post Before Processing, especially in military or medical systems. If you are trying to load or unload an airplane in a hostile environment, or trying to save the life of a patient, you are not going to devote much time or effort to getting the paperwork correct. So all sorts of incomplete and inaccurate data get shoved quickly into the computer, and then sorted out later. These systems are designed on the principle that it is better to have some data, however incomplete or inaccurate, than none at all. This was a key element of the DoD Net-Centric Data Strategy (2003).

The Post Before Processing paradigm also applies to intelligence. For example, here is a US Department of Defense ruling on the sharing of intelligence data.

In the past, intelligence producers and others have held information pending greater completeness and further interpretative processing by analysts. This approach denies users the opportunity to apply their own context to data, interpret it, and act early on to clarify and/or respond. Information producers, particularly those at large central facilities, cannot know even a small percentage of potential users' knowledge (some of which may exceed that held by a center) or circumstances (some of which may be dangerous in the extreme). Accordingly, it should be the policy of DoD organizations to publish data assets at the first possible moment after acquiring them, and to follow-up initial publications with amplification as available. Net-Centric Enterprise Services Technical Guide

See also

Saul Caganoff, Talk to the Hand (11 November 2008), Progressive Data Constraints (21 November 2008)

Jeff Jonas, Introducing the concept of network-centric warfare and post before processing (21 January 2006), The Next Generation of Network-Centric Warfare: Process at Posting or Post at Processing (Same thing) (31 January 2007)(

Related Post: Progressive Design Constraints (November 2008)

Architecture, Data and Intelligence

Pages

Friday, July 10, 2020

Three Types of Data - Bronze, Gold and Mercury

Tuesday, December 03, 2019

Data Strategy - Agility

Saturday, August 03, 2019

Towards the Data-Driven Business

Tuesday, November 11, 2008

Post Before Processing

Blog Archive

Creative Commons

or by email

Pages

Friday, July 10, 2020

Three Types of Data - Bronze, Gold and Mercury

Tuesday, December 03, 2019

Data Strategy - Agility

Saturday, August 03, 2019

Towards the Data-Driven Business

Tuesday, November 11, 2008

Post Before Processing

Blog Archive

Creative Commons

Subscribe

or by email