Data is at the heart of the modern digital business. It defines how users engage and interact with a business. Understanding and analyzing customer content is, of course, not a new pursuit. The whole field of Big Data is a response to the need to better understand and anticipate customer behaviors by understanding the digital footprints they leave behind. However, with rapidly proliferating applications and digital touchpoints, companies are facing unprecedented data sprawl, making analysis harder while simultaneously complicating privacy and data protection.

A Picture Is Worth A Thousand Words

You can’t fully know your customer unless you know their data but getting visibility into diffuse and constantly spreading Personal Information (PI) data can sometimes feel impossible. Traditional approaches to building 360° views into customers required assembly of data warehouses that were complex to manage and ultimately incomplete. Every new data lake for customer information was merely a weak facsimile of the data it aimed to represent, one that provided only a partial picture and made navigating the lake feel more like wading through a swamp.

Modern data governance tools aim to partially fill the void by giving organizations a clearer picture of what they have, wherever it resides. However, they are limited by relying on incomplete or even inaccurate surveys for finding and managing data sprawl. While human memory can translate beautifully into a piece of art, it is not a reliable input for deriving science from data. A painted picture will never offer the objective realism of a photograph, and even the most descriptive words can never accurately visualize the reality of the customer PI organizations collect and process.

Of Data Lakes & Data Swamps

The volume of data companies collect on their customers today is big and getting bigger. Identity data has unique characteristics that make visualizing it possible without need for yet another data lake, or data swamp, depending on your perspective. When the Google founders first tried to simplify the navigation of something as big as the Internet, their default was not to create a more searchable facsimile. Instead, they focused their efforts on building a smart index that mapped the sprawling relationships across the hyperlink pointers that define the World Wide Web of Internet addresses.

When Facebook came onto the scene, they similarly realized that the secret to tackling performance, scale and context challenges of mapping billions of intertwining human relationships was to elevate the concept of a social graph which articulated the content and context of who and what interacted on their platform. Certainly data stores, warehouses and lakes still have their place in aggregating and analyzing data, but the essence of visualizing the social relationships was the social graph, just as Google’s PageRank index had been for navigating through the Internet’s seeming disorder.

The lessons learned by Google and Facebook beg the question why mapping of the most vital asset an organization manages, its customer data, should be different. Just like the Web and the Social Graph, personal data in organizations and enterprises is connected by relationships: data belongs to a specific data subject, data is stored in a given country, data is accessed by a common application, etc. Traditional PI discovery tools miss all of this nuance as they attempt to just find anything that looks like a social security number, or anything that looks like a credit card. However the relationship context is essential to understanding PI, protecting it and assuring privacy compliance in the era of regulations like GDPR that require complete knowledge of a person’s data.

Being Aware By Being There

Understanding customer data requires an effective way to visually map its distribution, motion and connectedness. Being there means “being” aware. This is vital for efforts around data governance. Where this really hits home, however, is when it comes to data protection and privacy compliance.

Past efforts at data protection were unsuccessful because they operated without context, often creating unacceptable errors. Knowing data risk requires data context that entails more information than just whether a number is 16 digits. It also requires an ability to de-identify the data in a way that preserves analytic value to the organization while still protecting the privacy of the person to whom the data belongs.

Similarly, modern privacy regulations like the EU GDPR mandate a whole set of protections that are impossible to attain with just a data warehouse approach, or by using a PCI-era regular-expression-based tool for discovering sensitive information. It requires context around data like residency, purpose of use, retention requirements, consent, lineage and, of course, affiliation with a specific person. Without this ability to understand and visualize context and relationships it will be impossible to comply with requirements around consent, retention or right to be forgotten.

As organizations try to make sense of identity data across hundreds of petabytes, traditional approaches to discovery and visualization will break down. Solutions like BigID aim to rethink how Big Identity Data is discovered and visualized without adding new data management or security complexity.