Before we dive into how we measure the value of the data, let’s start with what Synthesized data is. At Synthesized we pride ourselves with enabling enterprises to securely work with sensitive data. Unlike typical data masking and data anonymization tools, which can be easily attacked by modern techniques and drastically reduce the data quality, our DataOps platform automatically models complex interactions and hidden features in datasets to generate high-quality data products at any scale while maintaining data utility. This data looks and behaves exactly like original data, but consists of entirely new data points, leading to faster, more accurate training of models. Our unique approach learns the complex statistical relationships in the data, enabling automatic generation of new realistic samples at any volume while preserving the full quality and performance of the original set. This process of synthesizing data offers a robust solution for data privacy and utility.
This post will expand on our philosophy of how we evaluate them and why that makes us confident in the strength of our AI system.
At the heart of this problem lies the question: given two datasets with the same schema (one real, one Synthesized) - how can I tell if they match each other well? At Synthesized we have a key philosophy that underscores how we approach this question:
Make as Few Assumptions as Possible
If your sensitive data is complex and multi-faceted enough to require a machine learning solution like ours, then you can’t simplify your measurement step down just to make your life easier. The complexity of the evaluations needs to match the complexity of the data.
This means we:
- Never rely on a single metric when assessing synthesizing data
- Use metrics that are as general as possible.
- Provide great visualisations of our data for human inspection and, importantly, reassurance.
- Measure ‘use case’ performance where possible when synthesizing data.
Let's look at the credit dataset taken from Kaggle, a simple dataset that still provides enough complexity for us to elaborate on these details
Example
Here’s our dataset, we have 11 columns of financial information including recent delinquencies, monthly income and the number of open credit lines. We’ve put this through the Synthesized platform.
You can see the original data in pink and the Synthesized data in blue:
So far the data looks good! But, let’s take a cautious approach. There’s likely some complexity hiding underneath that is missing in these plots, which is why we’ve invested a lot of energy in finding comprehensive measurements to find any issues when synthesizing data.
Here’s a set of metrics that measure the distance of each column in isolation:
- Earth-Moving distance (EMD) is used for columns with a small number of unique values, this discrete metric makes few assumptions and takes into account the distance between values. Which some other discrete measures of distance (like Histogram Similarity) may not do when synthesizing data.
- Kolmogorov-Smirnov Distance (KSD) tends to be used for continuous inputs. It is commonly used as part of a common non-parametric statistical test.
It’s also really important that the interactions between the distributions of each column are preserved, below we plot a measure of this for continuous values: the Kendall Tau Correlation:
Common metrics we use to investigate interactions are:
- The Kendall-Tau Rank Correlation Coefficient is a non-parametric measurement of ordered data that detects when columns exhibit associations.
- Cramér's V is a measurement between 0 and 1 that similarly measures the association of columns. We tend to use it in cases where columns exhibit few unique values.
- The McFadden’s pseudo-R2 metric gives the change in the performance of a Logistic Regressor for a continuous variable against a categorical variable. This ranges from 0 to 1 where a high value indicates a large change in performance. We refer to this as the Categorical Logistic Regression Correlation.
All ‘non-parametric’ methods here demonstrate the similarity of the structure of the interactions in our Synthesized data. Testing for dependence between columns is difficult, so using a wide range of metrics like these gives us a great idea about the similarity. These are only a subset of the methods we use to evaluate data, machine learning blurs the lines between many fields of study. We pick through each of these fields for important ways to detect differences in datasets.
The results for this dataset look great, but is this enough to say that our Synthesized data will work for any use case?
After Synthesizing multiple datasets across a wide range of domains to validate our approach we have the experience to say: Yes! Our metrics give us a great insight into performance across many use cases. Which is why we’re so excited for Synthesized to empower businesses.
Depending on particular use cases, we can construct more information about the quality of the Synthesized data. Let’s explore machine learning modelling.
Machine Learning Modelling
A common use case for data is using it to train a machine learning model. Can we use this to generate more information about the Synthesized data? Of course! Having a concrete use case gives another target for us to hit. Let’s model the chance of Delinquency using the rest of the data. Are we able to maintain the same performance just using the Synthetic Data?
Information like this is great, our solution is meant to be applied to real scenarios, not perform well in some abstract statistical sense! One of the great benefits of the Synthesized platform is that we can correct any imbalances in the dataset. For example, we don’t have many examples of Delinquent customers in the original data (taking up less than 10% of the data). However, many machine learning methods perform better when this imbalance is addressed, using the Synthesized platform we can reweight the dataset so 30% of customer have experienced delinquency:
A machine learning practitioner is going to be very happy to use the Synthesized data here!
Conclusion
Data is difficult, we need complex tools to understand it - never mind generate it! In many cases, errors can occur in your pipeline without you even realising. At Synthesized we treat these problems very seriously and work hard to understand them. We want to help our partners solve bigger problems, not waste time worrying about data accuracy! That makes us passionate about thoroughly evaluating our product.
This post is just scratching the surface in terms of how we approach data evaluation on our platform. Our philosophy gives us confidence in our product and we’re just getting started with helping businesses solve big problems.
Relevant Resources:
- The Promise of Synthetic Data
- Three Common Misconceptions about Synthetic and Anonymised Data
- Usefulness of Accurate Synthesized Data for Enterprise Data Science and Business Intelligence
FAQs
What exactly is synthesized data and how does it differ from masked or anonymized data?
Synthesized data is artificial data that mirrors the statistical properties of real data, making it a privacy-safe alternative for testing and development. Unlike masking or anonymization, which alter original data and risk information loss or re-identification, synthesized data is created from scratch. This ensures that no sensitive information is exposed, while still allowing for realistic simulations and analysis.
Why is the Synthesized platform's approach to evaluating the quality of synthesized data unique?
Synthesized uses a comprehensive and cautious approach to data evaluation, employing a diverse set of metrics and visual aids. This multifaceted synthesizing data approach goes beyond simplistic assessments, ensuring that the synthesized data not only matches the original data statistically but also performs reliably in real-world use cases. It's a commitment to thoroughness and accuracy that sets Synthesized apart.
What kind of metrics does Synthesized use to measure the quality of synthesized data?
Synthesized employs a variety of statistical metrics to measure the quality of synthesized data. For discrete data with fewer unique values, Earth Mover's Distance (EMD) is used. For continuous data, the Kolmogorov-Smirnov Distance (KSD) is preferred. To assess relationships between columns, Kendall-Tau Rank Correlation Coefficient is used, and Cramer's V is used for columns with few unique values. Synthesized also uses McFadden's pseudo-R2 metric to assess the impact of synthesizing data on regression models.
How can synthesized data be applied for machine learning?
Synthesized data serves as a privacy-preserving substitute for training machine learning models, offering the same statistical properties as real data without the risk of exposing sensitive information. It can also be modified to correct imbalances in the original data, potentially leading to improved model performance.
What are the advantages of using synthesized data for machine learning compared to original data?
Synthesized data protects sensitive information while maintaining the statistical properties of real data, ensuring privacy in machine learning applications. It can also be adapted to resolve imbalances in the original data, enhancing model performance. Furthermore, synthesized data can be generated in any desired quantity, offering flexibility and scalability for machine learning projects.
Can synthesized data be used for purposes other than machine learning?
Synthesized data can be used for software testing, providing realistic simulations without risking actual user data. In data analysis, it can fill in missing values or augment existing datasets while preserving privacy. Additionally, synthesizing data is valuable for research purposes, allowing for realistic experiments without the ethical concerns associated with using real, potentially sensitive data.
How does the process of synthesizing data help businesses?
Synthesizing data allows businesses to harness the full potential of their data assets without jeopardizing sensitive information. This accelerates innovation, improves decision-making, and enables the development of better products and services in a privacy-compliant manner. By using synthesized data, businesses can gain valuable insights and unlock new opportunities while upholding ethical and legal standards.