Masking Data Through Collaboration
With all the data breaches and privacy losses abound, data masking has become more important than ever. This is particularly crucial for organizations that habitually copy sensitive production data to non-production environments for:
- Application development and testing
- Personnel training
- Business analytics modeling
- Sharing data with consultants or other third parties
Organizations from different industry verticals may copy or move around sensitive information, such as:
- Media: Personally identifiable information (PII)
- Healthcare: Protected health information (HIPAA PHI)
- E-commerce: Payment card information (subject to PCI-DSS regulation)
- Technology: Intellectual property (subject to)
Having copies of sensitive data floating around increases the risk of data leakage — one that may lead to data privacy loss due to either insider negligence or external data breach, or to lack of compliance with GDPR and other regulations.
While data privacy is important, regulation and enforcement vary globally:
- In the US, privacy is not a formal right, but some regulations over information exchange are regulated (e.g. HIPAA)
- In Canada, privacy is a right and heavily regulated (PIPEDA)
- In the EU, right to protection of personal data is well established
- In China, the personal data process is heavily controlled (MIIT)
Essentially, data leakage can occur whenever organizations move data around. This includes:
- Sharing data via email or social media
- Checking in data to version control systems
- Leaving production databases backdoors open.
- Copying or moving data during ETL operations
Data masking types
Data masking helps reduce the unnecessary spread of protected data inside the organization while maintaining the usability of the data. There are several types of data masking:
Static Data Masking
- Here, a copy of production data is used to create amasked copy of the database, which is then replicated to non-production environments (dev, test, and training). Masking happens as the database is copied. Ideally, masking happens at the persistence level so that no one is able to retrieve naked data from the masked copy of the database.
Dynamic Data Masking
- In this scenario, data is masked as it’s being requested by anyone who lacks sufficient privileges. Exposed data never leaves the production database; rather, the contents are jumbled in responses to queries. Only authorized users can access the raw data.
On-the-fly Data Masking
- Like in Dynamic Data Masking, this masking occurs on demand. An ETL process executes where data is masked within the memory of a database app. This method is useful for agile firms that are focused on continuous delivery.
Data masking techniques
Organizations can use numerous techniques to mask critical data. These techniques include the following:
- Averaging: Individual numeric values are replaced by a value derived by averaging some portion of them.
- Character Scrambling: Characters are jumbled into a random order so the original content isn’t revealed.
- Encryption: Authorized users must access the data with a key. This is the most complex and secure type of data masking, using an encryption algorithm.
- Nulling Out (or Reduction): Data appears as a series of nulls to anyone who isn’t authorized to access it, such as returning xxx-xxx-xxxx for SSN.
- Number and Data Variance: Data is given in a range (min, max) that the actual number falls into, not the actual exact number.
- Shuffling: One data set is used in place of another. The data in an individual column is shuffled vertically in a randomized fashion: Within the same column, a field from one row is inserted into another row.
- String Composite: The system generates random strings along a pattern. It’s designed for strings that follow a pattern in order to be valid. Typically, this is done via a regex-like interface.
- Substitution: Like the shuffling technique, this one mimics the look and feel of real data without compromising privacy. Data such as names, dates, and identifiers are replaced with fake values but ones that follow the original formats.
- Tokenization: This method substitutes data elements with random placeholder values. Tokens are non-reversible because the token bears no logical relationship with the original value.
- Transposition: One value, or a portion of a string, is swapped with another. Typically uses a mathematical function that moves existing data around in a consistent pattern.
Collaborative data masking in Lore IO
Source system data is extracted and deployed into a cloud staging area where Lore IO picks it up in order to manage all necessary transformations and to publish transformed data views that customers materialize in their data warehouses.
To enable customers to protect sensitive data, Lore IO supports a number of data masking techniques, such as encryption, shuffling, substitution, and nulling out. Lore IO is compatible with several data platforms that support data-at-rest Encryption. This provides baseline protection of customer data.
On top of that, hashing can be done directly in Lore IO using Lore IO expressions that incorporate hashing functions like md5. Clean Up Rules can be applied to any table column.
Substitution and shuffling, which are particularly effective techniques for masking data while maintaining important characteristics of the data, can be implemented within Lore. For instance, shuffling can be achieved by using Lore IO’s Lookup feature as follows:
- Define a Virtual Column on the unmasked Table using Lore IO’s Record Number function.
- Define a Virtual Column on the unmasked Table using an Expression of the form ceil(random(SEED) * NUM_REC) % NUM_REC, where SEED is a seed value for the random number generator and NUM_REC is the number of records in the Table.
- Define another Virtual Column using Lore IO’s Lookup function to perform a self-join using the above two virtual Columns as keys
Regardless of the technique used, data masking is carried out collaboratively in Lore IO. Masking rules can be stored in a universal data layer for future reuse. Since Lore IO supports collaborative transformation, masking can be done independently from other ETL steps, by different team members. For instance, one stakeholder can configure masking while other contributors are responsible for source-to-target data mapping or for transforming data according to business rules.
Roles and permissions within the Lore IO platform enable customers to obfuscate data from some stakeholders that perform data preparation. Unmasked columns and/or rows can be hidden from team members who know how to manage the metadata but who lack sufficient credentials to the underlying data. This capability enables them to contribute to the transformation project without compromising data security or privacy.