Correlation Vs Classification: Reimagining Data Discovery in the Age of GDPR

March 6, 2018

6 minute read

With the advent of the privacy era, and the looming General Data Protection Regulation, organizations are starting to realize that relying on classification for data discovery has hit its limit. GDPR, and privacy more broadly is about whose data you have, not just what data you have. Privacy is centered on identity; it’s about people. Classification-based discovery tools have no identity context, and therefore can’t address critical privacy challenges like data subject rights, or finding personal information (PI) beyond just traditional PII categories. To satisfy the privacy challenges of 2018 it doesn’t make sense to depend on PCI-era technology developed to find highly structured data patterns. New problems require new approaches, and privacy-centered discovery requires identity-centered correlation.

From Content to Context

Classification-centric data discovery emerged decades ago to help organizations categorize data by type and to meet then emerging compliance requirements like PCI and HIPAA. These classification-centric discovery tools rely on pattern matching in order to categorize data. Invariably, the classification is rooted in some variations on Regular Expression to find similarly patterned data into categories. Most modern security tools that have discovery components (such as DLP, DRM, and DAM) are based on this kind of pattern recognition mechanism.

However, traditional classification has inherent weaknesses that become more pronounced when used for privacy use cases like GDPR. Firstly, classification-based approaches lack accuracy. For well-structured data, like payment card information, classification based tools can work well, however, they will not be able to distinguish between different data types similar in appearance. (For example, in the United States, Social Security numbers and ZIP+4 postal codes are both 9-digit numbers, and it’s not uncommon to store both without delimiters, such as dashes.) Classification becomes far less accurate in instances where data types have fewer unique distinguishing characteristics or don’t follow well-defined patterns. .

Moreover, classification-based tools can’t distinguish what is personal from what is not. Europe defines “personal” very broadly. Simply put, data is personal if it appears in the context of a specific individual. Pattern matching alone can’t connect general data to a particular person or identity. It lacks contextual awareness to determine that a pronoun or an IP address belongs to that individual. They can match data types, but not data to an identity.

Most importantly, classification can’t help answer questions regarding data subject rights. GDPR is fundamentally is regulation that enshrines data rights to individuals. EU citizens rights to access, port, erase and rectify their data are being strengthened further, above and beyond the definitions and requirements of the 1995 95/46/EC Data Protection Directive. For organizations, that means having to account for every individual’s data. Classification based data discovery tools can’t provide identity context. That’s why privacy requires a new approach to data discovery rooted in identity correlation, and not just classification because privacy is about understanding identity data context along with content.

Getting Smart on Identity Intelligence

BigID has taken a fundamentally different approach to data discovery, rooted in intelligent identity correlation. Privacy is about people, and to find people data you need to understand people context. To understand people or identity context, BigID leverages machine learning applied to customer’s existing data sets. This approach uses existing enterprise data to determine how personal information looks in a given enterprise, and how such personal data is connected to an identity.

With BigID, the organization’s training data (or “seed” data) can be spread across different data sources, and any number of datasets can be used to bootstrap discovery. None have to be complete or comprehensive. These data sources are used to understand basic identifiers, relationships, and distributions. BigID then leverages properties of this data to contextualize various information in other data stores. Without requiring the deployment of software agents, and having been granted only “read” access, BigID can scan across any number of structured, unstructured and semi-structured repositories, mainframes, cloud environments, Big Data warehouses, and applications in order to find personal data and correlate it automatically to an identity.

When BigID finds unknown personal data (i.e. “dark data”) that it has not previously encountered, the BigID ML automatically correlates this data to an identity based on parameters like uniqueness, proximity, frequency, etc. This process continues with scans of each additional data source to build ever richer graphs of each person data holdings. Importantly, no personal data is ever copied to the BigID software running within the company’s environment. The BigID software only retains a hashed graph representation of each person’s data that can be used for subsequent search and data subject rights reporting.

Crucially, BigID doesn’t stop at correlation. Once data element relationships are understood, the data is then classified and automatically cataloged using the organization’s own data glossary. This means BigID’s customers don’t have to choose between correlation and classification. They get both, plus a comprehensive way to catalog data for easier parsing and analysis. However, unlike older classification-only tools, the resulting data maps include a full data inventory for each individual – essential for satisfying data subject rights requirements, such as right-to-be-forgotten in GDPR.

PI, Not Your Father’s PII

Starting with smart correlation prior to classification gives organizations a critical advantage in solving privacy use cases. Personal right to access, portability, rectification, and erasure now become straightforward to operationalize. Data can be easily organized by residency for cross-border flow analysis and sovereignty violations. Existing consent logs can be correlated to data subjects in order to provide a comprehensive view of consent across applications by person. Access logs can be cross-referenced to user data activity in order to provide a fine-grained view of usage for any individual’s data. Different profiles of user can be compared across different data stores in order to detect anomalies and possible account fraud.

Performing classification after correlation also provides a number of unique operational advantages. Classification-based tools tend to be optimized for specific data stores, such as structured, unstructured, or Big Data repositories, but not all. With BigID, scans can be performed across a wide range of data stores, including relational databases, files shares, Big Data environments, data warehouses, document repositories, ERP applications, , NoSQL stores, SaaS, IaaS, and more, providing for the first time a true cross-platform customer data view.

Correlation does not require duplicating data or creating data-warehouses, ensuring that companies can have a centralized view of an individual’s identifying information without centralizing the data. Correlation makes it easier to find PI (personal information), and not just PII, as discovery is based on context, as well as content. And since the BigID correlation engine doesn’t preemptively try to match data types, it can correlate data across any language. Correlation can even surface relationships between encrypted and unencrypted data, helping to locate pseudo-identifiable data, which is also important for GDPR.

The Three C’s: Correlation, Classification, Cataloging

Correlation-centric search is not entirely new, although it is novel to data discovery. Internet search engines take a not entirely dissimilar approach to efficiently index the internet for easier navigation by using an analogous hyperlink relevancy algorithm. Social networks also leverage relationship graphs to help navigate connections among individuals. Applying similar approaches to indexing data brings a number of advantages, ranging from scale to data independence. But perhaps most importantly, BigID’s patent-pending identity-centric data discovery helps organizations address privacy use cases like those introduced by GDPR. Now companies can find PI and not just PII. They can address data subject rights, like right-to-be-forgotten. They can answer data sovereignty, residency, breach and consent questions more easily. But perhaps most importantly, using BigID they don’t have to settle for pattern-based classification alone. They can still classify data. They can catalog data. And for the first time, they can correlate data.

Author

Dimitri Sirota