Fraud Detection Dataset

Every year there are millions of credit card fraud victims, and the costs for the credit card issuer can be huge. Being able to develop a fast and reliable fraud detection system can change drastically the financial performance of the credit card business, and such a system heavily relies on historical data to understand how fraud works and be able to prevent it.

Dataset

This fraud detection dataset (from Kaggle dataset repository) contains historical data of 594,643 transactions for 4,112 different users. The target variable (fraud) flags fraudulent payments, and there are 7 other columns in this fraud detection dataset that contain a time step identifier, personal information about the payer (an identifier and their age and gender), specifics about the transaction (merchant, category and amount).

Use case

The objective is to train a ML model that gives a fraud probability for each transaction in a subsample of the dataset. The prediction system has to be properly balanced on reducing fraud as much as possible while keeping the system not too strict, as customer satisfaction can be affected if non-fraudulent transactions are blocked too often. Therefore, ROC AUC score is a good metric as it contains properly balanced information about all types of prediction errors.

Given the temporal dimension of this fraud detection dataset, it can also be treated as a time-series problem to exploit the temporal relationship between samples.

Data problems and Synthesized solutions

Although this dataset can make a huge difference on the credit card issue' performance, it has some problems that complicate its usage. Luckily, Synthesized can solve these problems in a fast and intuitive way.

Privacy. This dataset contains personal information about users, making it difficult to work and share this fraud detection dataset. In Synthesized we can generate a synthetic dataset that preserves statistical information (95% utility across multiple ML tasks compared to original data) in under 10 minutes, while removing all risk of non-compliance with data regulation such as GDPR, HIPAA and CCPA.
Imbalanced Dataset. There are only 7,200 out of 587,443 (6.68%) fraudulent payments in this dataset. This imbalance may heavily reduce performance of the model if not treated carefully. With Synthesized's Data Manipulation tool we can manipulate the output distributions of this column and generate a balanced dataset, being able to improve final model performance. Read more about the benefits of data rebalancing.
‍Fairness and Biases. AI models can be unintentionally (and potentially illegal) discriminative to certain sensitive groups of people, if the underlying training data is biased. In this case, features such as age and NumberOfDependents should not be used as discriminative features under current jurisdictions in the US, UK and EU. Synthesized can help assessing how biased a dataset is, finding where the biases are and flagging them to the user. Read more about discrimination by AI.

References

This fraud detection dataset is publicly available in "Synthetic data from a financial payment system" Kaggle dataset repository.