Best Practices for Data Scanning

By Alex Bulis , VP, Education and Community

November 8, 2024

4 minute read

Data Scanning Strategy – Why It’s Important

When evaluating a data discovery and visibility solution for your enterprise, a critical factor you must consider is its ability to use the right scanning strategy for your specific needs. Scanning is the first step in building an accurate index of your critical data assets across the entire enterprise, and should be efficient on one hand and comprehensive on the other hand. Modern data visibility platforms are expected to do a lot – help you with core discovery across dozens of systems, support your AI and general data security use cases, provide strong DSPM, DPM and DLP capabilities to help you manage your data, and help you stay compliant. The scanning engine underneath all these capabilities must therefore be robust, accurate, and insightful.

The scanning engine should also be flexible and highly-customizable. Enterprises are not all created equal, and a “one size fits all” scanning approach will likely result in unnecessary overloading of your systems and network, a non-optimal scan performance, and unnecessary expenses. A great scanning engine is one that supports the scanning strategy that fits your very specific needs.

Common Data Scanning Stages

Very generally speaking, there are four stages of scanning that a company can follow on its data discovery and lifecycle journey. Each of these stages should ideally focus on specific requirements raised by specific stakeholders:

Survey Scan

Broad discovery to identify general areas of concern. For data governance stakeholders, this stage should focus on metadata (e.g., file ownership and access levels for unstructured data, and interesting classification for structured data). For security stakeholders, the focus will be on general assessment to quickly survey the landscape.

Comparative Prioritization

A configurable sample scan to identify the type and magnitude of sensitive data stored in the corporate’s systems, and calculate their comparative density to prioritize the next steps.

Full Scan

Get the exact counts and full data map for select databases and buckets, and use this comprehensive index for initiating and tracking a remediation plan.

Maintenance

Recurring and infrequent rescans to identify new databases/buckets or changed schemas/files, and decide on additional remediation as required.

Some commercial discovery solutions only do assessment scanning, while others focus only on metadata or on sampling. Some solutions can do full scans but on one data source at a time. It is important to understand which stages are critical for your business realities, and choose your solution accordingly. Remember that in most cases, a single stage will not be enough and your solution will ultimately be expected to fully support two, three, or all four stages.

BigID Data Scan Types

Designed from the ground up for flexibility, scale, and cloud readiness, BigID features different types of scans to support different scenarios and use cases. These scan types can be further customized to support any environment, and can be combined into a larger strategy:

Full scan: identifies all sensitive data, and can be configured to run in full mode, sampling mode, and differential mode.
Assessment scan: quick survey of the data, uses sampling with configurable thresholds.
Metadata scan: scans object metadata but not the content.
Lineage scan: finds relationships between objects.
Hyperscan: ML-based scan for optimized scanning of large data sources.
Labeling: scan and add labels to objects to trigger rule-based actions.

With so many choices, BigID offers the right scan type for every scanning stage, and can implement the scanning strategy most appropriate for your business today and in the future.

See BigID in Action

Real-World Use Case

To further illustrate the importance of a good scanning strategy, consider this real-life example: a large retailer is preparing for a comprehensive merger-trigger security audit, and must eliminate open access to all files on their dozens of file shares. With BigID deployed in their environment and connected to all unstructured data sources, the following scanning strategy is a good starting point:

Stage 1 – Survey

Run a BigID metadata scan to identify problematic file shares (those with file ownerships and permissions open too wide.)

Stage 2 – Comparative Prioritization

Run a sample scan into these file shares, with relevant classifiers turned on, to specifically identify the number and exact sensitive data stored within these over-exposed files.

Based on the results of this scan, the company decides on 3 cleanup phases with the first phase targeting the files with the most sensitive data (social security numbers and credit card numbers).

Stage 3 – Full Scan

For each cleanup phase, the company runs Stage 3 – full scan – to get the complete list of files to address, and uses BigID’s access intelligence and remediation apps to implement the end-to-end correction and auditing workflow, which involves the removal or the editing of each file to restrict its permissions and remove the unnecessary sensitive information in them as needed.

After all cleanup phases are complete and all known problematic files are fixed, the company moves to the steady-state Stage 4 – maintenance.

Stage 4 – Maintenance

Once a month a scheduled sample scan rescans all unstructured data sources to identify new or changed files that have permission issues and that are open too wide.

Data Scanning Best Practices

Remember that scanning by itself, as important as it is, is still a means, not the goal. Scanning just for the sake of scanning may produce a basic data inventory, but will likely not produce enough value to justify the investment. There is usually little or no value in constant scanning if you are not taking an action on the findings. And remember that your scanning strategy should be determined by your overall data strategy, not the other way around.