Classification’s never been an easy thing: even Aristotle struggled with it. When he tried to divide organisms into two groups (plants and animals) and then each of those into three (the former got shrubs, hedges, trees; the latter land, air, and water), it wasn’t enough. If a bird was sorted as an animal that is of the air – what about our lovely penguin friends and other birds that don’t fly?
In the natural world, it got a bit more accurate in the 18th century, when Linnaeus revolutionized classification with kingdom, phylum, class, order, etc.- the taxonomies we all learned in biology class. But even those definitions get hazy when biologists start to account for relationships among organisms – birds and crocodiles and dinosaurs, are all related, after all – but in very different classes.
In the world of data? Things get even more complex.
Classification is the key to understanding your data – and ultimately getting your data to work for you: it’s critical to being able to reduce risk, make strategic decisions, sustain compliance, accelerate governance, retain (or reduce) the right data, manage data privacy, and protect your data in the first place.
Traditional data classification falls short: data isn’t categorized and labeled consistently, it lacks context, it’s noisy, and it’s unreliable.
You can manually tag, label, and categorize your data – but that takes time, it’s error prone, and you’re not able to understand the relationships between data points. Is it part of a bigger set? Part of an identity? Is it regulated data?
You’ve got your basic regular expression based classification – essentially data that follows a specific pattern: a 7 digit number that starts with 312 might mean a phone number with a Chicago area code. But what if it’s an account number instead?
Without context, it’s difficult to classify data correctly. If you try to classify the word Brooklyn in a set of data, for example, how do you know if that Brooklyn is referring to the borough of New York City, the suburb of Melbourne, Australia or the first name of a specific person? How do you know if that particular instance of Brooklyn is public or private or restricted data?
What does modern data classification need to account for?
In today’s data sprawl, the importance of accurate, scalable data classification is paramount. Organizations need to take a layered approach in order to build a foundation for the business to be able to get more from their data – whether that’s for analysis and strategic business purposes or driving data security and compliance.
These days, modern data classification needs to address:
- Accuracy: If it’s noisy, if there are too many false positives, you might as well start from scratch. Modern data classification needs to be accurate so that it can be used for everything from data validation to policy enforcement.
- Patterns and relationships: Understanding a single point in time is one thing – getting the big picture is another altogether. It’s critical now to understand how data is related, how it’s connected: is it all part of the same set of intellectual property? Does it all relate to the same individual?
- Context: Adding context makes all the difference – with context, you’ll be able to tell if it’s Brooklyn the city or Brooklyn the first name. You’ll be able to label regulated data in the right way, automatically apply policies, and reduce noise and friction.
- Customization: Every organization’s data is different: it’s got a different set up, different meaning, different priorities. Data classification needs to be customizable to the data itself – and needs to be able to learn on custom data sets in order to bring meaningful value.
And you can’t do this manually any more – not at the rate that data grows, nor at the rate that the definition of “sensitive data” evolves. You can’t just take the same old techniques and wrap them in new packaging: you need to layer tried and true data classification with cutting edge ML and NLP to get data classification that’s going to work with today’s data – classification that’s built for today’s challenges across use, storage, type, and more.
Modern data classification goes beyond simply assigning a level of sensitivity to data, or categorizing it by attribute, by type, by content. It combines these techniques with ml-augmented context, applies confidence scoring, integrates policy libraries, and extends across all data silos so that classification at scale is truly the foundation for any successful data initiative.