S3 object store has become a popular foundation for storing unstructured documents and mixed file types with elastic scale. However, like with any wide and deep data lake, it creates unique data security challenges and risks that require different mechanisms to address.

Identifying Sensitive Data in S3

Measuring data risk inside S3 begins with accurately identifying the data inside S3 buckets. By the nature of object storage, almost any kind of data can be deposited inside of S3 buckets whether unstructured, structured or some combination. This creates complexity for analyzing the diverse variety of data. As an open ended data lake that can scale without a cap, it can also create a volume and velocity data analysis challenge.

For the purpose of understanding data risk it becomes important to understand what crown jewels are stored and entering S3. This high value, high risk data can take different forms. It could be regulated data like GLBA, PHI, PCI, or NPI. It could also be Privacy relevant personal data. It can be credentials and secrets. Or it can be any customer defined crown jewel like intellectual property, client data, recipe or anything in between.

To accurately inventory sensitive, critical and regulated data in S3, metadata scans are typically insufficient since metadata will itself not always be accurately labeled, nor will it capture the full range of possible content. Instead there needs to exist a way to scale to S3 scanning of data content. Moreover there needs to be some means to customize the definition of what is sensitive, critical and regulated. For instance, GPS data could be GDPR personal data for some and not others. Secrets like passwords and privileged credentials could be personal data in some contexts, but not others. A one size fits all for classifying data will not work for the majority of enterprises.

Lastly, building a persistent inventory or catalog is also vital to effectively use the findings for privacy, security and data governance use cases, all while ensuring a manageable remediation strategy. Stateless solutions that just send alerts require immediate action and will create unmanageable noise for a Security Operations Center. Embedding the findings in a dynamically updated inventory and catalog makes it easier to delegate and track remediation while leveraging the findings for common privacy use cases like DSAR and governance use cases like Metadata Cataloging and Search.

Measuring S3 Data Risk

Data risk can originate from many sources. Understanding location and volume of sensitive or critical or regulated data is in and of itself helpful to understand risk. Highlighting this heat map of risky data is a starting point for understanding broader insider or outsider exfiltration risk by data sensitivity, criticality (like passwords) and covering regulation.

Password Protect S3 Bucket Files

Related to straight up data risk is the related problem of access risk. This could be related to an employee having excessive access to sensitive data or outsiders that have inadvertent access to sensitive data owing to misconfiguration. For both insiders and outsiders, access risk starts by identifying any open access on buckets or folders. For S3 this would mean in password protected buckets. For insiders this could also mean understanding what individuals have excessive privileges to buckets with sensitive data. For insiders, this S3 problem is similar to the least access privilege problem in file folder analysis— knowing about open and over-privileged employees.

Separate from access, location and cross-border transfer represent new kinds of privacy risk. Emerging state and country privacy regulations often come with commensurate residency and cross-border transfer restrictions. Residency in turn will require knowledge of data location for sovereignty and also citizenship of the data subject to whom the data belongs. Flagging residency or cross-border violations is an important marker of risk.

Avoid S3 Data Exfiltration Risk

Duplication and redundancy of data is another example of risk that can be easily improved. In most instances duplicate data, whether structured or unstructured, represents additional attack surface. Being able to flag where data is duplicated gives organizations an opportunity to reduce their data footprint and associated exfiltration risk. For S3 buckets where you can have both structured, semi-structured and unstructured data this represents an easy type of incident prevention and means to reduce cost as well.

S3 Encryption

Identifying data that is not encrypted is also a quick win against risk. Not all data is encrypted everywhere but being able to enumerate instances where sensitive, critical or regulated is in the clear – and where it shouldn’t be – can help organizations avoid impactful breaches.

Remediating and Reducing S3 Data Risk

Once sensitive data and the associated data risk is identified the question organizations ask themselves is now what? The answer usually takes the form of remediation or access restrictions blocking insider and outsider access to data.

Remediation involves delegating a set of actions to a data owner either manually or automatically so they can remediate the data. This can take the form of actions like encryption, masking, deletion or archiving. All of which are preventative steps.

Alternatively, open and over-privileged access can be dynamically controlled using tools like BigID and their native controls for AWS.

Protecting S3 Data with BigID

BigID offers the full data risk management life cycle for sensitive data. This includes everything from identifying data, flagging associated risk from configuration, location or policy violations and providing a built-in mechanism to reduce risk. Get 1:1 demo.