Case Study

GDPR compliance

enforcement in a banking development environment

Ready made modules to plug in your custom project

Data Discovery

Detect any kind of information in your data with auto tagging. Specify custom data domains using regular expressions, dictionaries or plug your own algorithm

Masking on demand

Mask personal information using predefined algorithms or specify a custom one.

Supply your own data discovery to the masking process. Fine tune any time and re-run the masking every time you need it

Subsetting

Perform data alignment or find specific cases to replicate in your testing environment using the subsetting module.

Apply data masking algorythms on the fly, without anyone accessing personal information in the process.

Fake data generation

Use realistic data in yout tests, entirely skipping real data containing personal information. Tune the generation, rollback the same generated data or create new batches every time you need it.

Secure Developement environments from sensitive information

Our banking customer adopted the Esplores platform with three different features for GDPR compliance enforcement:

Data discovery of all personal information in a structured database
Masking of all personal information in a development environment, so that unauthorized users cannot see personal information
Subsetting new data to the developement environment with contextual masking.

Scope and perimeter

The first phase was to define what kind of information the privacy department chose to mark as personal and needed masking.

Then the application developers, in collaboration with DBAs, defined the actual perimeter with databases and schemas to scan for personal information.

Performance and tuning tests

Esplores allows the customers to prepare PoCs or small tests on real environments, to get an estimate of required resources and time spans and to tune the default configurations.

Three tests were performed: since they were interdependent from each other, each test was not made before the execution of the real activity, so the data masking test was executed only after the data discovery and validation of personal information fields was completed. The same happened for subsetting: we waited to have masked at least one database before making the actual subsetting test

i. Testing data discovery

The data discovery was tested just configuring read only access to the business continuity servers of the production environment and running the default discovery on a subset of the perimeter.

After the results were analyzed, some assumptions were made on the required resources and rules of inclusion and exclusion were applied

ii. Testing data masking

Masking requires a test environment that can be refreshed or deleted.

The first test was to clone some database tables and mask some personal data with default parameters.

After the first tuning, a test was made on a database that was going to be refreshed with original data, so that a real execution was made on a subset of the actual perimeter.

After the test, some assumptions were made on the required resources, on exclusions and on the application tests to prepare for the entire perimeter

iii. Testing subsetting

Subsetting required the setup to be able to read data in the production environment and write in the test environment. After checking for permissions, a simple test of one row was made to check that everything was setup correctly.

Data discovery and validation

Data has been identified using the Esplores Data Discovery module, but for some databases the customer had already run another data discovery solution.

Esplores can be complementary with other solutions in two ways:

Support the DBAs and application developers to validate other data discoveries or double check them
Simply import the validated information on personal data without checks (Not recommended, but useful in case of deadlines or historical – i.e. well known – databases)

Validation can be done using the Esplores user interface or throughout exports and imports.

a. Comparison of data discovery

The customer had made a data discovery with a competitor tool that returned a list of fields with their assigned personal information domain. That discovery had two problems:

The information was not up to date
Some personal information was not found, not because of new fields, but because of the data sampling strategy or because there wasn’t enough data to sample

Esplores used a custom randomization algorithm to download data from the database, so that the sample contained both old and new records and to minimize the chance to get null records. We also used custom dictionaries for names and surnames and deep learning algorithms to identify personal information.

After retrieving a sample in percentage (x%) of the database and at least 1.000.000 records, the data discovery algorithms were run and the results were submitted to the application development representatives and data officers using an export of the data containing:

The list of fields with personal information
The list of domains for each field with a confidence figure (if a field had more identified domains, the highest confidence was the most probable domain)
A class of validation using a custom threshold (NO MEANING, TO CHECK, AUTO-VALIDATED)

A comparison table between old and new discovery results has been generated, highlighting discrepancies and missing personal information

b. Validation

Validation required consulting all the developer team leaders to check the accuracy of the discovery but also the technical feasibility of the masking of the fields: some fields, although containing personal information, have to be masked with special rules (e.g.: ignore some values or apply a string format to the content) or even excluded by the masking process because would break the application or render it unusable (e.g.: user names cannot be changed because the user has to use its actual user name).

Masking of existing personal information

Prior to actual masking, the validated discovery has been configured on the Esplores Data Masking module and performance and tuning tests have been performed on databases of the project perimeter

Masking used the following workflow:

Apply SQL statements to disable constraints, prepare indexes and keys to perform queries on all the tables of the perimeter
Download only the personal information to a local filesystem with high performance file format
Mask the information locally, using parallel processing
Update the databases with masked data
Run SQL post activities and cleanup
Run application tests (made by the application developers or UAT and testing teams)

Optionally, the saved data on the filesystem can be kept for a limited period of time to be able to run checks on the masking or rollback a few data to pass the application tests.

Fake data generation

To avoid using real data, the customer asked to prepare a fake data generator to populate the test and development environments.

We defined a set of personal information to generate with names, surnames and other personal information.

The fake data generator can be run on demand and used to feed the subsetting phase.

ASK FOR A DEMO

REQUEST YOUR DEMO.

By ticking the above box I agree and consent the terms of the Privacy Policy consenting Esplores processing and disclosing my data and communicating with me according to the policy.

CHECK OUT OUR PUBLICATIONS.

We often publish new content.