When it comes to data management and data governance, “data discovery” has historically been a vague term. Is it simply the ability to connect and create an inventory of an enterprise’s data assets — or is there more involved in the process?
As an analyst at Gartner for 14 years, I took hundreds of calls on data discovery. When a client would ask about doing data discovery, I would inevitably ask, “What for?” Why was the enterprise interested in doing discovery in the first place? Some would indicate that they wanted to create an inventory to see what data assets they had — and this was always a key indication that the project would not go very far. Why?
Data Inventory vs Data Catalog
Let’s start with what will happen to the discovery results — where will they go? I often used to get the question about the difference between a data inventory and a data catalog, which led me to create an analogy from my college days, back when the internet was a few years from its embryonic phase.
At the time, “research” consisted of going to the library and looking up information in actual hard copy books and reference material. A data inventory would be akin to a straightforward, complete list of all of the books in the library. And that’s better than nothing, but when you’re talking about hundreds of thousands to millions of books, this approach is impractical and of little value.
Libraries use the Dewey Decimal System, which is a way to classify and correlate books into a logical segmentation so that all books about, say, finance are located in one area and can be easily “discovered” and located by referencing an available card catalog. This is basically the same idea as today’s data catalog — with the latter in electronic form.
Discovery Starts with the Business Need
A data inventory in itself has little use except to show a disparate list of assets. Organizations need to know more about their data assets than simply what they have if those assets are to be of any value. This initiative does not start with querying the data but at the complete opposite end of the organization — in business initiatives. Why discover and inventory data if the effort does not tie to business goals and, once applied, help achieve them?
As an industry analyst, when asked about inventorying the data, I would immediately ask to what business KPI the initiative was tied. I would get a variety of responses, but very seldom was one of them a tangible goal or KPI that answered, for example, one of the following questions:
- How did inventorying data specifically help sales meet quarterly numbers or marketing to meet campaign goals?
- How did the inventory help with meeting Privacy DSAR requests?
- How did the inventory better protect the organization?
Once we begin to understand why we are doing the discovery and what is required from the business then we can start to ask important questions of the data itself, like:
- How is data interrelated?
- How is the data used?
- Where did it originate?
A Deep Set of Capabilities Beyond Inventory
In order to understand the answers to these questions, we need to:
- move from simply doing an inventory as part of discovery and start to digging deeper into the metadata — including inferred and implied metadata
- understand data labels and tags, how the data is classified, and how it should be classified with additional tags
- correlate data so that we can see all of the data elements that are tied to one entity — such as a single individual — so we can either effectively market to that person or holistically protect their privacy
- cluster data to find, for example, all of the mortgage documents in file shares or customer account numbers in various databases throughout the organization
- cover all enterprise data — the average 20% that is structured and the 80% that is unstructured
- build a modern data catalog as part of active data management as part of the overall process.
So data discovery is more than simply inventorying the data and then hoping the organization finds a use for it. Data discovery starts at the other end of the organization — in the business, in marketing and sales KPIs, in privacy and security initiatives — as something that can be tied to established KPIs, goals, or initiatives.
Once we know what we want to discover the data for, then we can go beyond just finding it to turning it into real value that can be measured.