The Humbling Reality of AI Energy Costs: What I Learned in My Master's Thesis

Hessel Tjeerdsma

12 Apr 2025 • 4 min read

There's nothing like building your own AI system to discover how much easier it is to hype these technologies than to deploy them responsibly.

I spent the last several months of my academic life obsessed with a deceptively simple question: Could large language models like GPT-4 and Claude detect credit card fraud as well as traditional algorithms? And if so, what would it cost us – not just in dollars, but in watts, joules and carbon?

When Assumptions Meet Reality

I entered this project with what seemed like a reasonable hypothesis: Large language models would excel at fraud detection. After all, these are the systems that can write poetry, explain quantum physics, and generate functioning code. They appear to understand nuance and context in ways traditional algorithms never could.

"If LLMs can bluff their way through complex essays, literature reviews of master theses, and technical discussions," I thought, "surely they could spot suspicious patterns in credit card transactions."

I was wrong. Spectacularly wrong.

Not only did LLMs perform significantly worse than traditional algorithms at the actual task of fraud detection, but they consumed energy at rates that made my jaw drop. It was a sobering reminder that impressive capabilities in one domain don't necessarily translate to others – and that there's often a vast gap between theoretical potential and practical application.

Building a Stream Processing Architecture From Scratch

Credit card fraud happens in real time, with transactions streaming in by the thousands per second. So I designed and built an event-driven stream processing system from the ground up, taking into account mostly scalability.

The architecture I created is based on technologies that you'd find in actual financial institutions, with some key components:

Apache Kafka served as my transaction ingestion system – think of it as a high-speed conveyor belt moving data through the system. I chose it specifically for its partition design that allows easy horizontal scaling to handle massive transaction volumes.

Redis functioned as the state management layer, storing transaction histories that could be rapidly accessed to provide context for each new purchase. Its flexible retention windows were perfect for maintaining just the right amount of historical data without wasting resources.

The heart of the system, FastStream, processed these transactions through multiple stages: from initial ingestion to feature engineering, prompt generation, LLM analysis, and finally performance monitoring. I connected all these components into a cohesive pipeline, with each stage transforming the data to prepare it for the next.

The most challenging aspect was designing the prompt generation layer – translating numerical transaction data into natural language that LLMs could understand. After extensive experimentation, I developed a structured approach based on chain-of-thought reasoning that synthesized transaction details, customer profiles, and historical patterns.

For energy measurement, I implemented a monitoring framework based on EnergyMeter, tapping directly into both Intel's RAPL framework for CPU power and Nvidia's Management Library for GPU measurements. This required bare-metal deployment rather than cloud infrastructure, as virtualized environments block access to these low-level metrics.

The Shocking Energy Gap

Let me be blunt: In my experiments, large language models consumed up to 400 times more energy than traditional machine learning algorithms to process the same credit card transactions.

That's not a typo. Four hundred times.

While traditional models sipped a modest 2.7 kilojoules of energy (roughly the amount needed to lift an apple 275 meters into the air), LLMs guzzled between 117.9 and 1064.2 kilojoules – enough energy to charge your smartphone from dead to 100% multiple times over.

When I projected these needs to Capital One's scale – about 1,700 transactions per second – the numbers became almost comical:

A traditional fraud detection system would need about 1.2 kilowatts (roughly a hair dryer). The LLM-based system? Up to 1,022.9 kilowatts – enough to power a small neighborhood.

In the Netherlands, where I live, that's the difference between an €800 annual electric bill and a staggering €700,000.

I couldn't help but wonder: Is this really the future we're rushing toward?

The Fairness Paradox

Yet here's where it gets complicated. These energy-hungry behemoths did show one surprising advantage: fairness.

When examining how different models treated transactions across gender and age groups, LLMs maintained remarkably balanced outcomes. Their "disparate impact" values – a key fairness metric – consistently hovered between 0.95 and 1.05 (with 1.0 representing perfect equality).

Traditional algorithms, despite their energy efficiency, showed troubling variations ranging from 0.4 to 2.0, suggesting significant bias against certain groups.

I found myself in an ethical quandary. What's more important: dramatically reducing energy consumption or improving algorithmic fairness? Is there a sustainable way to achieve both?

Performance Reality Check

The decision became clearer when I examined actual fraud detection performance. Despite consuming vastly more resources, the best LLM achieved only 69% accuracy with an F1-score of 0.025. Meanwhile, a simple Random Forest algorithm reached 99% accuracy with an F1-score of 0.46.

In other words, the traditional approach was both dramatically more energy efficient AND better at actually detecting fraud. My initial assumption – that LLMs' general intelligence would translate to superior fraud detection – collapsed completely in the face of empirical evidence.

Adding to these concerns, LLMs produced hallucinations (false information) at rates between 13.6% and 53.4%. Imagine explaining to your CEO why your fraud system falsely accused customers because it "hallucinated" suspicious patterns.

Finding a Middle Path

Despite these sobering findings, I'm not suggesting we abandon large language models entirely. My architectural work points to a promising compromise: a hybrid system where traditional algorithms handle continuous processing, while LLMs are selectively deployed for high-uncertainty cases.

In this design, the stream processing infrastructure I built could direct most transactions through efficient traditional pipelines, reserving the more expensive LLM path for cases needing deeper analysis or explanation. This maintains reasonable energy consumption while leveraging LLMs' strengths in interpreting complex patterns and generating human-readable justifications.

As someone entering the professional world at a time of unprecedented AI hype, I find this perspective both grounding and liberating. It reminds me that our job isn't to use the flashiest tools, but to solve real problems in responsible, sustainable ways.

Who knows – maybe the true AI revolution won't come from building ever-larger models, but from learning when and how to deploy them with wisdom, restraint, and an eye toward their full societal costs. The architecture I've developed could be one small step in that direction.

And perhaps the most valuable lesson is this: Question your assumptions. Test them rigorously. Be prepared to be wrong. In a field moving as quickly as AI, our intuitions often lag behind reality – and sometimes, the most enlightening discoveries come from discovering just how wrong we were.