Understanding the Limitations of Conventional Anonymization Techniques
Data anonymization is the data breach and data leakage problem nobody talks about.
When news of breach and leak incidents reaches the public domain, the state the data was in when this happened is rarely seen as important beyond whether fields containing financial data were encrypted to meet the requirements of regulations such as PCI-DSS. These events happen for many reasons – security failings, misconfiguration, human error – but the data itself is almost always seen as a passive victim of incompetence or bad intentions rather than a potential root cause of the problem.
In fact, in a growing number of customer data loss incidents a critical underlying factor is that the spilled data could or should have been anonymized but, for various reasons, wasn’t. Sometimes that happens because nobody thought anonymization was a priority or that the process was attempted but went wrong.
Very occasionally we get to see this in action. A revealing example was a report from earlier this year that a database containing 1.9 million unique email addresses, full names, phone numbers, IP addresses, and hashed passwords had been stolen from Dutch e-Ticketing company Ticketcounter. This quickly appeared for sale on a hacker forum after which the criminals demanded the company pay them $337,000 in Bitcoins not to leak the data.
Just another data breach? Technically, it turned out this was a leak that led to a breach. In a supreme irony, Ticketcounter’s CEO Sjoerd Bakker bravely admitted to a news site that the database had been exposed by accident after being copied to an unsecured Azure staging server to test the process of anonymization. This was a company trying to do the right thing and yet that process went awry, ending in a worse outcome.
Not all breaches are caused by process failures in DataOps, but it’s intriguing to ponder how common this problem might be without that being widely acknowledged. In theory it shouldn’t even be possible at all if data has been competently anonymized before being moved to a cloud server. Unfortunately, when it comes to data anonymization, theory and practice are not always the same thing.
For society, this should matter: once data such as personally identifiable information (PII) has been leaked or stolen, it can never be unleaked or un-stolen. It’s like a switch with no off setting. That PII becomes public forever and has privacy repercussions that last a lifetime.
What is Anonymized Data?
Google’s definition is a good place to start:
“Anonymization is a data processing technique that removes or modifies personally identifiable information (PII); it results in anonymized data that cannot be associated with any one individual.”
Traditional Data Anonymization Techniques
With the end objective being to reach an acceptable standard of differential privacy, this can be done using a variety of techniques, such as by adding noise (mathematically adding random values), generalization (removing data values) masking (hiding or encrypting data values), and pseudonymization (adding fake values). The objective is always to make it impossible to connect PII data to an individual.
The drawback of traditional anonymization is that no matter which technique is used, there is a chance it can be reversed using analytical techniques. The bottom line is that while de-identification hides data, the real data is still there. This isn’t hypothetical – several widely-cited real-world examples have come to light where this has happened. In those incidents, the data had been anonymized but re-identification or individuals was possible, putting in peril compliance with the EU’s GDPR or the US’s California Consumer Privacy Act, to name only two.
Privacy-Preserving Synthetic Data
In theory, developers could simply test applications using randomly generated mock data (‘mocking’) but this doesn’t always accurately model application behavior in the real world.
This brings us to the emerging alternative to the data anonymization puzzle, that of privacy-preserving synthetic data. The principle of this is to generate data that models the characteristics of the original data (good for development) but doing so in a way that makes it impossible to re-identify individuals (good for privacy). Unlike conventional techniques, there is no 1-to-1 mapping between original data and anonymized data; each synthetic data point is completely generated ‘out of thin air’.
It’s an appealing theory but also challenging - the data must have the statistical properties of the original data without revealing that data. How can someone using synthetic data be sure that reidentification isn’t possible? And if it does, how can they be sure that synthetic data is still meaningful?
The answer, in the case of Synthethized platform, arrives in the shape of privacy assurance reports, which validate in a statistical way that the synthetic data does not allow reidentification.
If there’s a catch it’s that not all techniques for generating synthetic data are equal. Depending on the method, synthetic data can be susceptible to linkage or attribute disclosure attacks, which in Synthesized’s case is handled by filtering/disabling sensitive data attributes in records, or conditionally generating data that has a low risk of being used in a linkage attack and revealing too much information about an individual.
In addition, the Synthesized platform can incorporate (𝜺, δ)-differential privacy into the process of building a statistical model of the original data, providing a mathematical guarantee of individual privacy. This gives organizations the ability to customize the level of privacy depending on their use case and the required utility-privacy tradeoff.
Conclusion
The ability to remove or obscure identifiers is the underpinning for the whole big data economy. Without it, large numbers of organizations would not be able to sell it, share it, report it for compliance and, as in the Ticketcounter incident mentioned above, to use it as test data in application development. Data isn’t just a passive output of the data economy, it is the fuel for it too.
Synthetic data is becoming an essential tool for achieving that, not only to avoid breaches but to address the issue of regulation and compliance. But as the continued toll of data breaches and leaks demonstrates, the industry is stuck with a more basic problem of understanding it has a problem at all.