Improve performance of ML models using Synthesized SDK

Customers realize millions in cost savings by improving model performance by up to 15%.
case study

Labelled high-quality data generation for fraud detection

Challenge

A financial institution operating in Latin America rapidly expanding its operations lacked labelled transactional data to develop new product offerings at the speed demanded by its users; which led to low predictive performance of customer-facing models.
Specifically, transactional bank fraud is a notoriously difficult and complex problem to address. The performance of fraud detection and AML models is only as good as the quality and amount of training and test data. Data acquistion and provisioning is often slow or impossible due to the highly controlled records.

Solution

Synthesized SDK automatically extracted the "deep" statistical properties of the available fraud data in order to create a highly representative new examples of fraud for training and testing of the fraud detection system and to augment the existing data. The SDK  learned to generate 5 million complex fraud data records in less than 10 minutes.

Impact

  • Increased model performance - 5 times fewer errors on the augmented fraud data
  • Increased developer productivity & speed-to-market - minutes to generated additional labelled synthetic fraud data
  • Lowered data acquisition costs
ML Models

Try the Synthesized SDK difference

Performance increase

Banking customer accelerated model performance across entire fraud model portfolio by ~4-15%.

Cost savings

Retail banking customer realized $5M in cost savings with a 2% performance increase in a single fraud model.

Time savings

Customers eliminate 2-4 months from model delivery cycles with fast synthetic data generation.

Reduce the noise and amplify the signal in your training data

Synthesized makes it easy to reshape and rebalance training data to amplify the fraud signal, critical to improving model performance.
With data bootstrapping, Synthesized delivers remarkable statistical accuracy across every dimension of your data.
Models retrained with Synthesized simply perform better.

Automatically improve the quality of your training data

Missing data is another curse of model training.
Synthesized data imputation instantly replaces missing values with synthetic values learned from the patterns of the existing data.
Improving training data quality means better model performance.

Generate any volume of synthetic training data

Most companies simply don’t have enough real data to train their models effectively.
Synthesized allows you to generate any volume of high-quality synthetic data in minutes.
With data bootstrapping, Synthesized delivers remarkable statistical accuracy across every dimension of your data.
Models retrained with Synthesized simply perform better.

See Synthesized SDK in action

Transcript
Hello.

Today we're going to be diving into an example of how Synthesized can be used to improve the performance of your fraud detection models.

In this example, we have a transactions dataset with 8 columns. You can see fraud is the binary target on the left.

Let’s see how good a simple model is on the existing dataset. We'll use age, gender, category, and amount as explanatory variables to try to predict fraudulent transactions.

So we get an ROC AUC of about 88%. Not bad! But we can do better by adding in some synthetic data.

First, we'll need to extract some metadata, build a generative model and train it. But this is easy with Synthesized. The training process itself doesn't take long either. Here we have 8 columns and about 20 thousand rows. and on a 4-core CPU, it's going to take about 3 - 5 minutes.

Once the generative model is trained we can use it to *upsample* the number of fraudulent transactions in our training dataset and thereby amplify the signal of fraud in the dataset. Fraud datasets are typically very imbalanced with a weak signal. Synthesized can be used to highlight this signal and improve model performance.

It's finished training now. Let's use a Conditional Sampler to generate a dataset but the amount of fraud rebalanced to be 50:50.

Now that we've created the new dataset, let's validate what it looks like compared to the original. We can do that with the Assessor class. Let’s save that figure and have a look.

As you can see, the fraud in the new dataset has been upsampled to a 50:50 split.

Now we can reevaluate our model, comparing its performance when trained on the synthetic dataset, to that trained on the original dataset and evaluated on some held out original data.

We've improved the performance here from 88% to 95% -> an absolute difference of 7%. And it only took about 5 minutes to do!

This has been a walkthrough of just one of the ways Synthesized can help you extract the most out of your data. Thank you for listening.Additional ResourcesSynthetic Data in Machine Learning: What, Why, How?

Learn more about AI
and data bias

Synthetic data in machine learning: what, why, how?
Solving data imbalance with synthetic data
A guide to data augmentation and data rebalancing