Rethinking Data Classification

It’s rare to spot a flip phone these days when smartphones are practically ubiquitous. Yet, in the realm of data security, where precision and context are critical, too many are still using flip phone technology to discover, understand and classify personal data.

In the flip phone era, classification was intended as a means to an end. By determining where sensitive data was stored through endless Regular Expression tweaking and comparing the raw counts of matched PII, enterprises could – in theory – move from a panicked scramble to a focus on data sources with the greatest security and compliance risk.

But now classification needs to function as an integral component for data management, data security, and data privacy outcomes. In the privacy era, identity correlation is integral to the value, relevance, and accuracy of classification.

Modern Classification: Context and Correlation

Without the context derived from identity correlation, enterprises are in the dark about what personal data they store and process – and by extension what privacy, security and compliance risks they face. Being able to understand the relationship between data values and correlated identities even before classification is applied avoids one of the fundamental shortcomings of legacy classification that gets stuck on data that looks the same, but has no mechanism to disambiguate.

Without a data catalog that is built using identity correlation, discovery and indexing across data sources, enterprises are stuck in the past, even as the volume and interconnectedness of their personal data proliferate. By incorporating cataloging features with classification, enterprises gain another layer of context through integrating personal data understanding with metadata analysis.

In the same way, that classification that looks at data in isolation is losing relevance, classification itself should be seen as part of a broader approach that integrates classification, correlation, and cataloging.

Privacy Changes the Game

Personal data – as defined by new privacy mandates like the EU GDPR and the California Consumer Protection Act – is sensitive based on whether it is associated with a person. A prime example is location data (especially apt in the smartphone era). Location data is not unique to any person, but it does become personal based on its association with a person.

Without that personal TV context, legacy classification can’t tell you anything about what is personal data – even if the technology can scan across more than a single type of data source or aggregate across data silos.

In the intervening years since the first wave of data breaches and PCI-DSS requirements drove adoption of classification via pattern matching there have been efforts to reduce the number of false positives, and more recently, use machine learning to automate and refine resource-intensive RegEx training.

These new iterations of the same approach are still telling you the same thing: providing data counts, not data accounting, and performing coarse classification at the folder level, not granular accounting at the data value level.

Enterprises used to only have to worry about credit cards and Social Security numbers. Now they have to identify all personal data, even that data that is only personal because or context to a person. That’s a big identity security problem.

There’s Hope on the Horizon

Fortunately, there’s now a better approach that is designed for the modern data environments. Just as smartphones don’t just have voice and text, modern data classification incorporates legacy methodologies like Regular Expressions as one arrow in the quiver.

But rather than pattern-matching classification being the first and only step, the approach starts with the data values themselves, establishes whether the data is uniquely identifiable, the degree of correlation with other data values, determines who or what the data is associated with, and then applies classification.

The classification outcomes can also be modified based on interaction with the underlying machine learning model or through integration with a business glossary. These interactions are incorporated into the machine learning models to iterate on accuracy improvements.

The outcome is a dynamic and comprehensive inventorying and mapping of all personal data across the enterprise environments that can be sliced, diced and interpreted through classification to frame decisions and processes – rather a representation of which folders are flashing the most red based on an informed guess.

If data is personal based on the association with an individual, then classification should be driven by that association, not the limitations of the technology tool.

Classification For the Privacy Era

A data-first approach takes a multi-step process that can incorporate and extend established methodologies and lay the foundation for the integration of machine learning tools to establish relationships like neural networks or random tree classifiers as well as natural language processing.

Looking at the data in totality, deliver greater accuracy, and the ability to discover dark personal data. To achieve this outcome, the approach needs to have multiple components.

• A breadth of Coverage across the Enterprise: Unstructured, structured, semi-structured, cloud and apps (‘legacy’ like SAP, and SaaS like Salesforce)
• Correlation and Machine Learning to Establish Data Relationships
• Generate Granular Insights – Folder, file, and data object discovery and classification
• Extend Regular Expressions Through Enrichment
• No ‘Black Box’: Supervised Learning, Model Interaction, and Business Glossary Integration
• Advanced Unstructured Data Intelligence: Neural Network-based entity extraction and resolution for “dark data” in unstructured data sources