Best Practices for Data Scanning
Data Scanning Strategy – Why It’s Important
When evaluating a data discovery and visibility solution for your enterprise, a critical factor you must consider is its ability to use the right scanning strategy for your specific needs. Scanning is the first step in building an accurate index of your critical data assets across the entire enterprise, and should be efficient on one hand and comprehensive on the other hand. Modern data visibility platforms are expected to do a lot – help you with core discovery across dozens of systems, support your AI and general data security use cases, provide strong DSPM, DPM and DLP capabilities to help you manage your data, and help you stay compliant. The scanning engine underneath all these capabilities must therefore be robust, accurate, and insightful.
The scanning engine should also be flexible and highly-customizable. Enterprises are not all created equal, and a “one size fits all” scanning approach will likely result in unnecessary overloading of your systems and network, a non-optimal scan performance, and unnecessary expenses. A great scanning engine is one that supports the scanning strategy that fits your very specific needs.
Common Data Scanning Stages
Very generally speaking, there are four stages of scanning that a company can follow on its data discovery and lifecycle journey. Each of these stages should ideally focus on specific requirements raised by specific stakeholders:
Survey Scan
Broad discovery to identify general areas of concern. For data governance stakeholders, this stage should focus on metadata (e.g., file ownership and access levels for unstructured data, and interesting classification for structured data). For security stakeholders, the focus will be on general assessment to quickly survey the landscape.
Comparative Prioritization
A configurable sample scan to identify the type and magnitude of sensitive data stored in the corporate’s systems, and calculate their comparative density to prioritize the next steps.
Full Scan
Get the exact counts and full data map for select databases and buckets, and use this comprehensive index for initiating and tracking a remediation plan.
Maintenance
Recurring and infrequent rescans to identify new databases/buckets or changed schemas/files, and decide on additional remediation as required.
Some commercial discovery solutions only do assessment scanning, while others focus only on metadata or on sampling. Some solutions can do full scans but on one data source at a time. It is important to understand which stages are critical for your business realities, and choose your solution accordingly. Remember that in most cases, a single stage will not be enough and your solution will ultimately be expected to fully support two, three, or all four stages.
BigID Data Scan Types
Designed from the ground up for flexibility, scale, and cloud readiness, BigID features different types of scans to support different scenarios and use cases. These scan types can be further customized to support any environment, and can be combined into a larger strategy:
- Full scan: identifies all sensitive data, and can be configured to run in full mode, sampling mode, and differential mode.
- Assessment scan: quick survey of the data, uses sampling with configurable thresholds.
- Metadata scan: scans object metadata but not the content.
- Lineage scan: finds relationships between objects.
- Hyperscan: ML-based scan for optimized scanning of large data sources.
- Labeling: scan and add labels to objects to trigger rule-based actions.
With so many choices, BigID offers the right scan type for every scanning stage, and can implement the scanning strategy most appropriate for your business today and in the future.
Real-World Use Case
To further illustrate the importance of a good scanning strategy, consider this real-life example: a large retailer is preparing for a comprehensive merger-trigger security audit, and must eliminate open access to all files on their dozens of file shares. With BigID deployed in their environment and connected to all unstructured data sources, the following scanning strategy is a good starting point:
Stage 1 – Survey
Run a BigID metadata scan to identify problematic file shares (those with file ownerships and permissions open too wide.)
Stage 2 – Comparative Prioritization
Run a sample scan into these file shares, with relevant classifiers turned on, to specifically identify the number and exact sensitive data stored within these over-exposed files.
Based on the results of this scan, the company decides on 3 cleanup phases with the first phase targeting the files with the most sensitive data (social security numbers and credit card numbers).
Stage 3 – Full Scan
For each cleanup phase, the company runs Stage 3 – full scan – to get the complete list of files to address, and uses BigID’s access intelligence and remediation apps to implement the end-to-end correction and auditing workflow, which involves the removal or the editing of each file to restrict its permissions and remove the unnecessary sensitive information in them as needed.
After all cleanup phases are complete and all known problematic files are fixed, the company moves to the steady-state Stage 4 – maintenance.
Stage 4 – Maintenance
Once a month a scheduled sample scan rescans all unstructured data sources to identify new or changed files that have permission issues and that are open too wide.
Data Scanning Best Practices
Remember that scanning by itself, as important as it is, is still a means, not the goal. Scanning just for the sake of scanning may produce a basic data inventory, but will likely not produce enough value to justify the investment. There is usually little or no value in constant scanning if you are not taking an action on the findings. And remember that your scanning strategy should be determined by your overall data strategy, not the other way around.