Revisiting Machine Learning: Back to Fraud Detection

type

status

date

slug

summary

Rediscovering Traditional Machine Learning

It’s been over three years since I last worked with traditional machine learning, and honestly, I was a bit nervous about how much I’d forgotten. The world of AI has exploded with neural networks and large language models, but there’s something grounding about returning to the classics—algorithms like Random Forest that powered so many real-world applications before deep learning took center stage. To shake off the rust, I decided to tackle a credit card fraud detection hands-on using a well-known Kaggle dataset. It felt like a safe yet challenging way to finish the fundamentals while applying them to a problem that matters: keeping financial systems secure.

The dataset itself is a bit of a mystery box. It’s packed with numerical features transformed through PCA (Principal Component Analysis), which means the original transaction details—like location, or merchant—are hidden behind abstract labels like V1, V2, and so on. The only untransformed features are ‘Time’ and ‘Amount,’ giving me a narrow window into the data’s context. This setup simplifies feature engineering in a way, but it also forces you to trust the data as it is—no overthinking allowed. With 284,315 transactions and only 492 frauds (a tiny 0.172% of the total), the dataset’s extreme imbalance was the first hurdle. It’s a classic problem in fraud detection: the thing you’re trying to catch is rare, but missing it can be costly.

Building with Random Forest

For this hands-on, I chose a Random Forest Classifier, a reliable workhorse in the machine learning world. Random Forest builds multiple decision trees and combines their predictions, making it great for spotting complex patterns in data like transaction amounts or timing. Its strength lies in handling non-linear relationships and resisting overfitting, which is perfect for a dataset with abstract PCA features. Plus, it gives you feature importance scores, which can hint at which variables (even cryptic ones like V12, V14 or V17) are driving fraud predictions.

I trained the model with SMOTETomek to balance and refine the dataset (to be discussed later), and it felt satisfying to see the pieces come together. The process wasn’t perfect—I only spent half a day to dive in, so I didn’t explore other algorithms like logistic regression or XGBoost, but Random Forest was a solid starting point.

Tackling the Class Imbalance Challenge

Class imbalance is the kind of issue that can make or break a model. If you train a model on this dataset without addressing the imbalance, it’ll likely just predict “not fraud” for everything and still score a deceptively high accuracy—because 99.8% of the transactions are legitimate. But that’s useless when the whole point is to catch those rare frauds.

To handle the imbalance, we can consider two approaches: upsampling the minority class (frauds) or downsampling the majority class (legitimate transactions). Upsampling adds more fraud examples, either by duplicating them or creating synthetic ones, while downsampling reduces the number of legitimate transactions. Both have trade-offs. Downsampling throws away data, which feels wasteful when you’re already limited. Upsampling, on the other hand, risks overfitting if you just copy the same fraud cases repeatedly. I opted for a hybrid approach using SMOTETomek, which combines SMOTE (Synthetic Minority Oversampling Technique) for generating new, realistic fraud samples with Tomek Links to remove overlapping or noisy samples from both classes. This method creates a more balanced and cleaner dataset, improving the model’s ability to learn meaningful patterns without overfitting.

Standard metrics like accuracy can lie in an imbalance dataset scenarios, so I followed the dataset owner’s advice to focus on the Area Under the Precision-Recall Curve (AUPRC). Unlike accuracy, AUPRC zooms in on the minority class (frauds), measuring how well the model balances precision (how many flagged transactions are actually fraudulent) and recall (how many frauds it catches). It’s a more honest way to evaluate performance when the stakes are high.

Lessons from a Half-Day Dive

This quick hands-on reminded me how much I enjoy the problem-solving side of machine learning. Wrestling with class imbalance, choosing the right metrics, and tweaking the model felt like reconnecting with an old friend. I could’ve spent more time experimenting with other algorithms or fine-tuning hyperparameters, but even this short revisit was enough to spark my curiosity again. Fraud detection is a high-stakes puzzle—banks lose billions to fraud, and catching it in real-time can protect customers and institutions alike. Knowing that my half-day effort ties into such a critical application made it all the more meaningful.

Resources

If you’re curious to explore the dataset yourself, you can find it on Kaggle here. I’ve also shared my notebook for this project on GitHub here, where you can check out the code and see how I approached it.