Improve performance of ML models

case study

Labelled high-quality data generation for fraud detection

Challenge

A financial institution operating in Latin America rapidly expanding its operations lacked labelled transactional data to develop new product offerings at the speed demanded by its users; which led to low predictive performance of customer-facing models.

Specifically, transactional bank fraud is a notoriously difficult and complex problem to address. The performance of fraud detection and AML models is only as good as the quality and amount of training and test data. Data acquistion and provisioning is often slow or impossible due to the highly controlled records.

Solution

Synthesized SDK automatically extracted the "deep" statistical properties of the available fraud data in order to create a highly representative new examples of fraud for training and testing of the fraud detection system and to augment the existing data. The SDK learned to generate 5 million complex fraud data records in less than 10 minutes.

Impact

Increased model performance - 5 times fewer errors on the augmented fraud data
Increased developer productivity & speed-to-market - minutes to generated additional labelled synthetic fraud data
Lowered data acquisition costs

Reduce the noise and amplify the signal in your training data

Synthesized makes it easy to reshape and rebalance training data to amplify the fraud signal, critical to improving model performance.

With data bootstrapping, Synthesized delivers remarkable statistical accuracy across every dimension of your data.

Models retrained with Synthesized simply perform better.

Generate any volume of synthetic training data

Most companies simply don’t have enough real data to train their models effectively.

Synthesized allows you to generate any volume of high-quality synthetic data in minutes.

With data bootstrapping, Synthesized delivers remarkable statistical accuracy across every dimension of your data.

Models retrained with Synthesized simply perform better.

See Synthesized SDK in action

Transcript

Hello.

Today we're going to be diving into an example of how Synthesized can be used to improve the performance of your fraud detection models.

In this example, we have a transactions dataset with 8 columns. You can see fraud is the binary target on the left.

Let’s see how good a simple model is on the existing dataset. We'll use age, gender, category, and amount as explanatory variables to try to predict fraudulent transactions.

So we get an ROC AUC of about 88%. Not bad! But we can do better by adding in some synthetic data.

First, we'll need to extract some metadata, build a generative model and train it. But this is easy with Synthesized. The training process itself doesn't take long either. Here we have 8 columns and about 20 thousand rows. and on a 4-core CPU, it's going to take about 3 - 5 minutes.

Once the generative model is trained we can use it to *upsample* the number of fraudulent transactions in our training dataset and thereby amplify the signal of fraud in the dataset. Fraud datasets are typically very imbalanced with a weak signal. Synthesized can be used to highlight this signal and improve model performance.

It's finished training now. Let's use a Conditional Sampler to generate a dataset but the amount of fraud rebalanced to be 50:50.

Now that we've created the new dataset, let's validate what it looks like compared to the original. We can do that with the Assessor class. Let’s save that figure and have a look.

As you can see, the fraud in the new dataset has been upsampled to a 50:50 split.

Now we can reevaluate our model, comparing its performance when trained on the synthetic dataset, to that trained on the original dataset and evaluated on some held out original data.

We've improved the performance here from 88% to 95% -> an absolute difference of 7%. And it only took about 5 minutes to do!

This has been a walkthrough of just one of the ways Synthesized can help you extract the most out of your data. Thank you for listening.Additional ResourcesSynthetic Data in Machine Learning: What, Why, How?

Get started

Improve performance of ML models using Synthesized SDK

Labelled high-quality data generation for fraud detection

Challenge

Solution

Impact

Try the Synthesized SDK difference

Performance increase

Cost savings

Time savings

Reduce the noise and amplify the signal in your training data

Automatically improve the quality of your training data

Generate any volume of synthetic training data

See Synthesized SDK in action

Learn more about AI
and data bias

Improve performance of ML models using Synthesized SDK

Labelled high-quality data generation for fraud detection

Challenge

Solution

Impact

Try the Synthesized SDK difference

Performance increase

Cost savings

Time savings

Reduce the noise and amplify the signal in your training data

Automatically improve the quality of your training data

Generate any volume of synthetic training data

See Synthesized SDK in action

Learn more about AIand data bias

Learn more about AI
and data bias