Solving Data Imbalance with Synthetic Data - Blog

Will a customer purchase this product? Is this transaction fraudulent? Is this a picture of a cat or a dog? These are high value business problems - well, maybe not the last one - that can be solved with the appropriate machine learning techniques. However, as always, data is king and without access to high-quality balanced data the efforts to answer these questions will be fruitless.

Most real-world datasets are highly skewed and show bias towards a particular outcome, category or segment - especially those related to the detection of rare events.

For example, consider the problem of predicting whether a credit card transaction is fraudulent. Fortunately for the lenders, the overwhelming majority of purchases are legitimate. Unfortunately for the data scientist, their dataset of transactions will contain only a faint signal of fraudulent activity; predicting fraud is a classification task with high data imbalance.

Common Pitfalls in Applying Machine Learning Techniques for Imbalanced Classification

When applying machine learning techniques for imbalanced classification, one may encounter a number of pitfalls: some models are unsuitable, model explainability may suffer and unwanted biases may be propagated.

When large dataset imbalances are present, models can achieve apparently stellar performance just by predicting everything as the majority outcome. For example, with a dataset that contains 99% legitimate transactions, such a model would have an accuracy of 99%. Great, right? Unfortunately not!

When the data imbalance is taken into account, this becomes much less impressive when the real value is incorrectly identifying the transactions that were fraudulent. (For the curious, more appropriate metrics to look at in this case are precision and recall).

Traditional Dataset Rebalancing Techniques

There is a range of well-studied and utilised techniques that aim to solve the problem of class imbalance, and these fall into two categories:

Sampling-based methods, which aim to augment and reshape the underlying data;
Model-based methods which directly constrain how a model can learn from the data.

Sampling based techniques aim to ‘rebalance’ the data, ensuring there is an equal representation of each outcome. The simplest approach is to randomly undersample the majority outcome, or oversample the minority. The drawbacks here are clear: there is either a reduction in the training size or a duplication of records, leading to a reduction in data variability that can result in model overfitting.

More advanced techniques rely on creating new data-points for the minority outcome to achieve a balanced distribution of classes. SMOTE (Synthetic Minority Oversampling Technique) is one such method, available in open-source projects such as imbalanced-learn.

However, it is not based on a statistical understanding of the data, and is problematic with non-continuous variables and high-dimensional datasets. For complex datasets, SMOTE often does not provide an advantage over random sampling of the original data.

Scalable Rebalancing Solutions

Alternatively, synthetic data produced with the Synthesized platform can be used to rebalance, and we believe this is the most scalable and powerful technique. With a deep understanding of the data imbalance, our approach can go beyond simple rebalancing of individual classes, and enables tweaking and reshaping of the arbitrary groups within a dataset. This allows users to generate a range of custom scenarios for testing and development purposes.

Additionally, the Synthesized platform provides a powerful all-in-one solution for common data science tasks, e.g. data-augmentation and missing value imputation, all whilst being privacy preserving by design.

The Test: Rebalancing with Synthetic Data versus Original Dataset

To demonstrate rebalancing with synthetic data, we apply this method to the Kaggle credit card fraud detection dataset. It contains anonymised credit card transaction details, of which approximately 99.8% are legitimate and the remaining 0.02% are fraudulent; an extreme, but realistic data imbalance for this type of problem.

To predict the fraudulent transactions, we train separate logistic regression classifiers on:

The original dataset, a balanced dataset using SMOTE, and
a balanced synthetic dataset created with the Synthesized platform.

These trained models are then evaluated and compared on an unseen sample of the original imbalanced dataset.

Before getting to the results, it is interesting to understand what our synthetic fraud examples look like.

One method to achieve this is to visualise a 2-dimensional representation of the Synthesized dataset using a UMAP embedding. With this we can identify how distinct fraudulent and non-fraudulent transactions are, and whether there is any significant clustering.

‍

Visualisation of a 2-dimensional representation of the Synthesized dataset using a UMAP embedding

Each red point is a completely new synthetic example of fraud that the Synthesized platform has been able to generate from only a small sample of real fraud examples. The fact that the two clusters are separate indicates that there is a distinct difference between them. The smaller clusters indicate that there is a variety of synthetic examples, and they aren't all duplicated data points with the same characteristics.

So, onto the results...

How Well Can We Predict Fraud with Our Three Datasets?

Unfortunately, there is no obvious metric to use, as the ideal choice depends on the costs to the business of missing fraud or incorrectly flagging legitimate transactions. However, we can look at the area under the ROC-curve (AUC-ROC), together with a confusion matrix to understand how well the model can find and correctly predict fraudulent activity, shown below.

Fraud Prediction Comparison in 3 Datasets

Key Takeaways of the Analysis

It is clear that on the original imbalanced dataset, the model can only correctly find approx. 60% of the fraudulent transactions; it is biased towards classifying everything as non-fraudulent and has a ROC-AUC score of 0.93.

With SMOTE we see a significant improvement, with the model able to identify almost 80% of the fraud cases, and a ROC-AUC of 0.96.

Looking at our balanced Synthesized dataset, we obtain even better results -- all cases of fraud in the test data have been successfully identified and the ROC-AUC increases to 0.99! One possible reason for this is the larger variety of fraud cases that can be generated using synthetic data.

However, you may notice that this comes at a drawback of an increased false positive rate (legitimate transactions incorrectly classified as fraud). This is an inherent trade-off between the precision and recall of a classifier, and is an understood phenomenon that occurs with resampling techniques.

Conclusion

To summarise, data imbalance is a problem that affects most real-world datasets, and must be handled correctly when training predictive models. Synthesized offers a powerful solution for data scientists to rebalance their datasets with high quality synthetic data that may produce significantly better results than conventional techniques.

In addition, the Synthesized platform can solve a number of common problems for data scientists, all in the same solution, with privacy-preservation by design.

FAQs

What are the initial steps to identify a data imbalance in a dataset?

Before addressing a data imbalance, it is critical to first identify whether one exists. This can usually be done by analysing the distribution of classes within the dataset. A significant discrepancy in the number of instances for each class typically indicates a data imbalance. Tools like data visualisation software can help highlight these imbalances by providing clear, graphical representations of data distribution.

Can data imbalance affect machine learning models in regression tasks?

While data imbalance is commonly discussed in the context of classification problems, it can also affect regression tasks. In regression, imbalances might not pertain to class distribution but can occur in the form of outliers or skewed distribution of key variables. These imbalances can lead the model to develop biases towards more frequently occurring values, which can skew predictions and affect the model's overall accuracy.

Are there industry-specific concerns when dealing with data imbalance?

Yes, data imbalance can have different implications depending on the industry. For instance, in healthcare, an imbalance might lead to less accurate predictions for rare diseases, which can be critical. In finance, as discussed in the article, data imbalance can hinder fraud detection. Each industry must approach data imbalance with tailored strategies that consider the stakes of misclassification or prediction errors.

How does data imbalance impact the interpretability of machine learning models?

Data imbalance can significantly complicate the interpretability of machine learning models. When models are trained on imbalanced data, they might develop complex decision rules that overly favor the majority class, making it hard to discern how decisions are being made for the minority class. This lack of clarity can be problematic, especially in sectors like finance or healthcare where understanding the rationale behind every decision is crucial.

What are some emerging techniques to handle data imbalance that weren't mentioned in the article?

Beyond traditional methods like SMOTE or synthetic data generation, emerging techniques for handling data imbalance include advanced algorithms like Adaptive Synthetic Sampling (ADASYN) and Cluster-based Over Sampling (CBO). These methods are designed to generate synthetic samples in a more nuanced manner by considering the data's underlying structure.