Machine Learning, Artificial Intelligence, and now Deep Learning have become so overused that they may as well be synonymous with pixie dust and magical incantations. But, approaches utilizing these techniques are quietly altering how organizations face their most pressing challenge in understanding and protecting data.  

That challenge can be succinctly described as how do you best achieve data knowledge in order to better steward and safeguard information.

Knowing your data is essential to protecting your data because you can’t protect what you don’t understand. Data knowledge is also critical for getting insight, risk profiling, and value from your data. Traditional, manual approaches to gathering intelligence on what personal data is collected and processed depended on interviews and surveys. Neither interviews or surveys are reliable, accurate or scalable – especially in the Big Data era. Machine-based approaches at interrogating data stores hold the promise of greater detail, reliability, and precision in data knowledge – especially if the right steps are taken to align the model with data inputs.

But sorting, organizing and making sense of petabytes of digital detritus is no easy task. Automating rule-based classification techniques helps, however, it still falls short in understanding data context. That’s where advanced machine learning and related approaches provide a roadmap for better mapping and understanding of personal data – the bedrock of effective personal data protection and privacy.

People & Privacy

Ironically, when it comes to privacy problems, the goal of advanced ML is not necessarily to be more like a human in terms of data processing. AI is typically associated with making sense of input like text, interactions and images through constant iteration and feedback to drive automation of action and behavior that is indistinguishable from humans. For data protection and privacy requirements, the opposite holds true: people are poor judges of what data resides where, determining how the data is connected to other data, tracking data usage and flows, and evaluating data risk.

For privacy, the overarching aim is to analyze data based on relationships, and not just similarity, in ways that humans cannot. ML and Deep Learning provide a set of approaches that can be applied to specific data challenges, and to build a sustainable model for privacy and data protection problems that are dependent on context, mapping of relationships, and data flows.  

No single technique is a silver bullet by itself,  however, by combining machine learning components in ways that are “fit for purpose” – whether random tree classifiers to improve accuracy, correlation, and reasoning; probability thresholds to assess data relationships;  clustering for predictive sampling and comparison analysis of personal data distribution; and neural networks for establishing entity extraction and resolution, as well as confidence scoring to balance precision and recall – companies can build and sustain a complete data privacy picture.

Man Vs Machine

For problems of privacy and personal data protection, human efforts prove inadequate for finding, classifying, or sorting personal information. Privacy relies first and foremost in understanding what personal information an organization collects, and how that information is processed and used. This requires an accurate inventory of personal data.  Interviews and surveys can only build inventories based on recollections – not actual data records. A machine is more capable than man when it comes to examining data records inside other machines.

Having a machine build a data inventory requires an ability to look across any data source, and to classify that data by type, person, residency and application, whether that data resides inside a database, a file share, Big Data warehouse, or a cloud service. This is often characterized as building a data inventory, where the data can be organized using a different pivot to better understand its context.

This is where approaches that can establish the degree of correlation between widely distributed data values, graph relationships of highly correlated values through reasoning, and apply machine learning models for classification accuracy can address the data knowledge challenge.

Early attempts at these kinds of machine-driven efforts to analyze data and organize it into a comprehensive inventory relied on off-the-shelf indexing and pattern matching technologies.  Tools like Elasticsearch provide simple ways to index terabytes of data and match similar looking information using various ML algorithms.

While a step forward, these early attempts at a data inventory have irreconcilable flaws. In the course of trying to solve the problem of data input, they create new issues. Using an external warehouse for analysis is impractical with the volume of data most organizations house since it requires copying vast amounts of sensitive information to a secondary store. It also carries enormous infrastructure costs necessary to power the indexing. Moreover, it creates a severe security problem by centralizing sensitive data in one place.

However, the issues are not just related to the steps necessary to perform the indexing. The value of the findings is also limited. Even a full index will help classify data by type – but not by person. Foundationally, privacy requires people context; it requires understanding what data is personal, and to whom it belongs. What makes data personal is that it is contextually associated with an individual: i.e., it is by definition about, or by, that person.

Naive ML classification algorithms which can match patterns can help resolve two similar looking entities, but they can’t in and of themselves show correlation to a person in order to determine if the data constitutes personal information. That requires a different kind of ML and deep learning not available in off-the-shelf tools like Elasticsearch.

The Big Idea of BigID ML


Privacy and personal data protection begin with effective data intelligence that can understand what personal data an organization collects, to whom it belongs, and how it is being used. Sometimes this is characterized as a data inventory plus a record of data processing, but it goes beyond that. It requires an ability to find, classify, correlate, catalog, and even track data as it is captured and processed in a company. None of these are easy tasks to start,  and they are made even harder by the complexity and diversity of where and how companies collect data on people across their mobile, web and IoT applications.

Trying to solve these problems by iterating on one piece of the puzzle is an improvement on aspects of the problem, but still leaves the broader issue of data knowledge by person or entity unresolved. Furthermore, any black box approach that doesn’t allow for interaction with the confidence scoring, or refining correlation methodologies and classification accuracy, will never contend with the complexities of enterprise data estates.

To advance towards the goal of understanding the interrelationships between discovered data and attributes with a high degree of accuracy and confidence in the context of whose data it is without the burden of unnecessary noise and false positives require purpose-built Machine Learning. Data discovery, classification, identity correlation, as well as privacy-specific requirements like consent checking each rely on different techniques, training models, reasoning and input weighting. However, these elements need to fit into a cohesive model with the ability to respond to a new machine or human input in order to deliver living and breathing data privacy protection.