Human in The Loop AI Finance - Why the Robots Still Need Us And Your Org Chart Is About to Flip
(Note: All client data, figures, and circumstances have been anonymized to protect confidentiality.)
Over the last few years, I’ve been fortunate to sit at the front row of the AI revolution—from working on Google’s largest AI partnerships and shaping AI strategy at the Cloud Deals Desk, to consulting with robotics process automation (RPA) platforms at Effectus, and SaaS pioneers back at Deloitte.
But this year has been different.
We’ve hit a tipping point, where AI isn’t just hype, it’s being used daily by consumers, executives, and teams in ways we couldn’t have imagined just 12 months ago. So I went deeper: through strategic advisory work with early-stage founders, interviews on and off my podcast, invite-only leadership breakfasts, and hands-on projects supporting GTM modernization and finance transformation across industries.
This piece is a synthesis of what I’ve learned, where AI actually works today, how companies are applying it in finance, and what’s coming next.
TL;DR: The Future of Finance Has a New Org Chart
Chatbots and LLMs are killing staff work, but they’re going to supercharge demand (and compensation) for cross-functional hybrid experts. Think: CPAs fluent in AI, technical accounting, and Python.
In complex domains like finance, hallucinations aren’t just inconvenient, they’re dangerous. One of my clients nearly dropped pricing across the board based on a hallucinated 30% discount benchmark. Millions could’ve been lost from one confidently wrong number.
You already have AI in your org. You just might not know it yet. If you don’t have governance, controls, or policies around its use, you need to. Fast. Between the EU AI Act and NIST's AI Risk Management Framework, the compliance wave is building and it's looking a lot like Sarbanes-Oxley 2.0.
Best-in-class teams are running what’s called “AI Orchestration”, using multiple models like ChatGPT, Gemini, and Claude for each critical task, comparing answers, and escalating disagreements through a human-in-the-loop (HITL) process. That means: (1) Multi-model prompts, (2) Flagged conflicts trigger expert review, (3) Oversight is tracked and statistically sampled for quality assurance, and (4) Talent models are shifting: spreadsheet jockeys out, hybrid senior reviewers in.
This piece breaks down:
How we got here
What this looks like inside real companies
What’s required to implement it well
And why the future of finance won't be model-driven—it’ll be model-curated.
You can stop here if you're just looking for the high-level view, or scroll on for the real stories, case studies, and actionable playbooks from the field.
1 The surprise nobody saw coming
Back in 1999, they said: “The Internet will kill the middleman.” The reality? It birthed entire industries, UX designers, digital ad brokers, SEO consultants.
In 2010, the hype was: “The cloud will eliminate IT.” The reality? DevOps, FinOps, and SREs became standard line items on every tech payroll.
Now it’s generative AI’s turn.
Yes, LLMs are vaporizing staff work at astonishing speed. But the vacuum they leave behind doesn’t stay empty, it demands deep domain expertise, paired with technical fluency. In finance, that means there will be new unicorns: CPAs and CFAs who can prompt, parse, and pressure-test AI outputs all while understanding the evolving regulatory landscape. And they won't come cheap.
Fewer spreadsheets ≠ fewer people. Just fewer people who are easy to hire.
Three uncomfortable truths are driving this shift:
Headcounts may shrink at the bottom, but pay curves steepen at the top.
Training budgets beat layoffs, because good people now need to be great in order for your team to remain competitive.
Regulators will follow the hype, just like they did in the post-Enron, post-IPO, and post SPAC booms.
Everything that follows in this piece builds on those three forces, workforce compression, expertise inflation, and the quiet arrival of AI compliance as the next SOX moment.
2 The coffee-stained origin story
Picture this: a half-lit kitchen, a cold espresso on the counter, and a SaaS CFO squinting into a 7:00 a.m. Zoom. He holds up a rainbow-ink printout like it’s evidence in a trial.
“Three chatbots swear the market discount average is thirty percent,” he says. “Our CRO wants to slash pricing by noon. Can I trust this?”
I pause, sip my own coffee, and ask the question every finance leader should keep taped to their monitor:
Who double-checked the bot?
Thirty-six hours and several 10-Q footnotes later, we find the culprit: one outlier competitor running a recession-era “buy-down” promo that skewed the results. The real median discount? Thirty-two percent.
Two margin points may sound small, but on this client’s P&L, that was nearly 25% of net income. One hallucinated benchmark, blindly followed, almost triggered a misfire big enough to crater the quarter.
That moment shoved me head-first into Human-in-the-Loop (HITL) design, where humans aren't replaced by AI, but become the critical governors of it.
A large language model is the smartest intern alive: omnivorous reader, zero sick days, lightning drafts. It’s also the intern who bluffs when stumped and fabricates citations with a straight face. Benchmarks peg hallucination at 1–2 % for everyday questions but ≈ 10 % for finance-numeric prompts: lethal odds for an earnings deck.
Even if your team is beta-testing a single GPT chat, the blind-spot risk starts day one. Below is a playbook of everything I wish I (and the CFO) knew at that moment.
4 HITL Systems in a Nutshell. A family road-trip: one wrong exit, five AI lessons
It starts like any other family road trip. You’re headed to Yosemite, coffee in hand, minivan loaded to the ceiling. Somewhere near Modesto, your GPS chirps its command with smug certainty: “Take Exit 203.”
You glance up. Exit 203 is lined with fresh orange cones and flashing signs: Ramp Closed. Something’s off. But the voice sounds so sure, like it always does.
In the passenger seat, Grandpa leans forward, unfolding his creased AAA paper map with the authority of someone who’s seen real roads. “That ramp washed out last week,” he mutters, finger tracing Highway 120 like it’s 1987. “Take the next one.” He’s your fine-tuned model: expert on this specific domain, calibrated from experience, but working with stale data. Useful, but not enough.
Meanwhile, the backseat’s descending into chaos. Your daughter’s tablet is running Waze, it also says Exit 203. Your son’s tablet votes for Exit 205. The phone wedged between them insists you turn around entirely. Welcome to AI orchestration: multiple models, competing answers, no context.
And then comes your HITL moment, Human-in-the-Loop. You ignore the machines and trust your own eyes. You sail past 203, take 205 instead, loop around, and rejoin the highway. You lose twelve minutes, but save your suspension and maybe your whole day.
Mom casually taps out a group text: “Took 205 detour, arriving 2 p.m.” That’s your audit trail. Who made the decision, when, why. Just in case the grandparents ask later.
The takeaway? Smart systems are helpful. Competing systems are noisy.
Real-world judgment still matters.
You still need someone in the driver’s seat, someone who can recognize a hallucinated “Exit 203” for what it is, ignore the noise, and make the right call when it counts.
And that, in a nutshell, is how HITL workflows in finance work. LLMs can suggest a path. Other models might argue for different ones. But the human, the one who sees the cones, reads the road, and owns the decision, is still responsible.
Especially when the exit leads to your P&L.
5 Goldilocks and the AI Bears: Best Practices in Action
Once upon a quarter, we asked three of the top AI models a deceptively simple question: “What’s the average first-year SaaS discount for companies with $50–250M in ARR, as of Q1 2025?”
GPT replied softly: 29.7%
Claude chimed in: 30.1%
Gemini blurted out: 45.4%
Just like the porridge in the fairy tale, one answer was too cold, one too hot, and one... maybe just right? But here’s the thing: LLMs aren’t Goldilocks. They don’t know which bed is too soft or which number is wildly off base. You do. Or at least, someone on your team needs to.
In this case, the company’s internal rulebook saved the day: If model results diverge by more than eight percentage points, escalate to a human.
The baton passed to a senior analyst. She pulled fresh 10-Ks and sales call transcripts. Turns out Gemini had over-weighted a one-off “buy-down” recession promo. The true median? 32%.
That just-right answer preserved $2.4 million in margin and prevented a needless pricing reset.
This is the essence of HITL (Human-in-the-Loop) orchestration:
Use multiple models—each has strengths and biases
Set triggers for disagreement (like thresholds or flags)
Design escalation paths to subject matter experts
Log, document, and learn from every override
Even the big players are doing it. UBS now produces 5,000+ AI-generated research videos a year, but none go live without a human analyst’s review. Because unlike fairy tales, in finance, a bad guess doesn’t end in porridge. It ends in SEC fines or missed revenue.
So yes, use the AI bears. But always make sure someone’s playing Goldilocks.
6 Data Security: Choose Your Lane Before Legal Chooses for You
Picture this: a well-meaning associate uploads a Stock Purchase Agreement to an open AI playground. Just testing the model, just looking for a clause summary. No big deal.
Except it might be. Depending on your jurisdiction, you may have just violated California’s Privacy Rights Act, waived attorney-client privilege, and shared sensitive M&A details with a vendor whose terms you never reviewed.
AI is fast. Legal exposure is faster.
Before your GC or auditor comes knocking, pick a security lane, and make sure your whole team knows it. Here are the two prevailing paths:
Zero-Data Retention (ZDR): Platforms like OpenAI Enterprise offer settings that wipe prompts the moment the output is delivered. Nothing stored. Nothing trained. Great for lightweight use cases with high sensitivity.
Bring Your Own Keys (BYOK): For heavier workloads or regulated data, companies host models in a private cloud and encrypt data at rest and in transit, with customer-controlled keys. This gives you full custody over your inputs, outputs, and embeddings.
But here’s the shared truth for both lanes:
You need immutable audit logs. If your system can’t answer who used what, when, and why, it’s not ready for regulatory scrutiny.
Take a page from Amazon’s playbook. S3 Object Lock turns a standard cloud bucket into write-once, read-many (WORM) storage. No edits. No deletes. Just clean, timestamped, audit-grade evidence.
So what’s S3 Object Lock and WORM storage?
Think of Amazon’s S3 Object Lock as a digital file cabinet where documents can be added but never edited or deleted. Once you save something, say, a model output or a log of who prompted what, it’s frozen in time. This type of setup is called WORM storage: Write Once, Read Many.
In plain English: You can store a record once, and you (or auditors) can read it forever, but no one can tamper with it, not even admins.
This is the same type of logging used in capital markets, healthcare, and regulated SaaS. If your AI systems are generating content or decisions with financial impact, WORM-style storage isn’t overkill, it’s table stakes.
So before you ask ChatGPT to summarize your NDA or pricing model, make sure you’ve asked your legal team:
Where does this data go?
Can we prove who accessed it?
What happens if we’re audited?
In the AI era:
Your security posture is only as strong as your worst prompt.
Pick your lane, lock your logs, and make sure you’re not the cautionary tale.
7 Fine-tune or RAG? A four-to-one price lesson
You’ve probably heard the terms “fine-tuned model” and “RAG” tossed around like everyone in the room has a PhD in machine learning. Let’s break it down, without the jargon.
The Two Common Approaches to Custom AI
Fine-Tuning is like raising your own AI baby from scratch. You retrain the model with your data so it “remembers” your language, your rules, your patterns. The benefit? A model that speaks your company’s dialect natively. The downside? It’s expensive, slow to update, and once it’s trained, it’s set in stone, until you train it again.
RAG (Retrieval-Augmented Generation) is more like giving a smart intern a cheat sheet. The LLM stays general-purpose, but before it answers your question, it quickly looks up the latest, most relevant documents (contracts, manuals, filings) and references them in real time, kind of like using Google during an open-book test. Think of RAG as your "notebook LLM", instead of guessing from memory, the model looks up an answer from your own files first, then responds.
Why Fine-Tuning Sounds Cool But Costs a Fortune
Training a fine-tuned model means reserving expensive GPUs, tuning parameters, and testing outputs… often for months. It’s not uncommon to spend 4× more on infrastructure and dev time in year one compared to a RAG system.
And here’s the real kicker: fine-tuned models get outdated fast. If your contract template changes, your promo rules shift, or a new regulation rolls in, you’ve got to re-train the whole thing, or live with outdated outputs.
With RAG, you just update the source documents and carry on. It’s like swapping out the cheat sheet.
When Fine-Tuning Makes Sense
There are edge cases. If you're building a real-time trading copilot or on-prem AI for high-security use cases where every millisecond or data byte matters, fine-tuning is still the best route. But if you're in the finance, legal, or ops world and looking to boost productivity without nuking your AWS budget, RAG gives you 80% of the value at 20% of the cost, with way more flexibility.
TL;DR
If you’re AI-curious and looking to scale responsibly:
Start with RAG. It’s faster, cheaper, and more agile.
Track where it fails. Over time, you’ll see if fine-tuning is really worth it.
Resist the urge to overbuild too early. LLMs are evolving fast—today’s state-of-the-art can look quaint in six months.
And remember: just because a demo was impressive doesn’t mean it’s production-ready. Start lean. Learn fast. Optimize later.
8 Multi-Agent Mayhem: Why Guardrails Alone Won’t Save You
Let’s fast-forward to what’s already quietly happening inside AI-forward finance teams.
You don’t just have one AI assistant anymore, you have a small fleet of them. Maybe one is tuned for internal reporting, another for external disclosures, a third for board decks, and a fourth for contract summaries. These are multi-agent systems: a group of specialized bots working in parallel, each with its own skills and sandboxed data access.
Now imagine this:
Your CFO’s assistant bot can read sensitive Q2 forecasts, acquisition models, and draft earnings guidance.
Your analyst team’s bot can help slice churn by cohort but can’t see anything marked “CFO eyes only.”
Your sales ops bot pulls discounting trends, but it’s blind to deferred revenue waterfalls and legal memos.
A rogue prompt or careless config, and suddenly an intern is asking a model, “What’s our real net retention for Q3?”, and getting a material, nonpublic answer before IR sees it.
Welcome to the next risk frontier in finance AI: prompt leakage, access drift, and information asymmetry inside your own four walls.
It’s not just what the model knows, it’s who’s allowed to ask it.. You’ll need:
Agent-specific access controls: Which bot knows what? Can it write back to systems, or just read?
User-tier segmentation: Is this user cleared for this data? What happens when they change roles?
Prompt logs with alerts: Who asked what, when, and what did the model say? Where was human review applied?
Drift detection and review cadences: Has the bot started suggesting weird things? Flag it like a fraud alert.
Think of it like parenting four kids with different chores. One empties the dishwasher. One knows the WiFi password. One holds the spare key. If you don’t set boundaries, or worse, forget who’s supposed to do what, someone’s going to crash the car or spill something on the investor call.
The true cost of safety isn’t just $.002 per call. It’s about designing for control now, so you don’t have to explain a material leak to your auditors, or worse, to your board, later.
9 A loop you can brag about
Let’s say you’ve bought in. You’ve got a few bots. You’ve stopped them from hallucinating about Exit 203. Now what?
This is the loop, the operating system, high-performing finance teams are quietly standing up behind the scenes. No hype. Just a system you can brag about to your auditors and your board.
Here’s what it looks like when done right:
Tiered Access = Smarter Trust
Let bots roam free on low-risk tasks: FAQ decks, calendar pulls, invoice sorting.
For higher-stakes outputs, board materials, earnings commentary, customer pricing, require human sign-off before anything leaves draft mode.
Think of it like a bouncer line at a nightclub. Everyone can show up, but only a few get past the velvet rope without ID.
Escalation Rules Even the Intern Can Follow
If three models return answers that vary by >8 percentage points, or if there’s no reliable citation, flag it. Escalate to a senior analyst. Always have written in your review procedures: Check All Citations.
This is your finance-friendly HITL protocol, human-in-the-Loop by design, not by accident.
Reviewer SLAs Matter
One hour SLA for flagged items, 7 a.m.–7 p.m. local time, especially for anything hitting the C-suite or Boardroom.
If your team is global, stagger reviewers. If you’re lean, set auto-notifications and carve buffer time into final reviews.
The goal? No “it slipped through” excuses during earnings week.
Logs That Don’t Wiggle
If it’s material, log it immutably.
Think of this like a black box on a plane. When something goes wrong, you don’t want to realize someone deleted the chat log.
RAG Refreshes = Adaptive Intelligence
If you’re using Retrieval-Augmented Generation (RAG), set a standing rule: refresh your document database weekly during volatile markets or major reporting cycles.
Don’t let GPT cite last quarter’s ARR in the middle of a merger.
Track KPIs Like It’s a Real Workflow
You can’t improve what you don’t measure. Here’s the current gold standard:
Hallucination rate < 2% (flag via random audits)
Reviewer turnaround < 30 minutes
Cost per validated output trending downward over time
If you can build a close checklist, you can build a HITL loop. And once you have one, you can audit it, optimize it, and teach your team to run it at scale.
This isn’t about replacing your people. It’s about letting the best ones spend less time fixing bad math, and more time steering the ship.
10 The Talent Shuffle You Can’t Ignore
The finance org chart is shifting under your feet. Yesterday’s pyramid was simple: hire junior staff by the dozen, throw spreadsheets at them, and wait for a few to rise. Promotions meant managing people. The partners watched margins from above.
But now?
LLMs just torched 80% of the pyramid’s foundation. No more armies of analysts formatting decks at midnight or reconciling spreadsheets line-by-line. That layer is vanishing. Fast.
In its place? A leaner, meaner structure: hybrid seniors. We're talking about:
CPAs who can debug Python
Controllers who can tweak LangChain prompts
FP&A leads who understand embeddings and can spot when a chatbot hallucinates a margin trend
These are the new stars of modern finance. They don’t manage bots, they outthink them. They don’t need to memorize GAAP footnotes, they design systems to surface the right ones in seconds.
Promotion now equals technical fluency, not headcount. Leader's won't just be managing people, but rather curating decisions from the AI mess and translating them into business impact.
What Finance Leaders Should Do Now
This isn’t a future trend. It’s already happening. Here’s what the sharpest orgs are doing to keep up:
Fund serious up-skilling. Launch internal academies. Think: LangChain for auditors. SQL for controllers. Prompt engineering for staff accountants. Upskill your team now, or spend double replacing them later.
Rewrite job specs Stop asking for “Excel wizardry.” Start screening for:
RAG familiarity
Vector store experience
Prompt testing and model evaluation Yes, even for mid-level roles.
Partner with universities. The VLOOKUP generation is graduating into a different world. Co-create modern accounting and finance curriculum. Offer guest lectures. Sponsor pilot programs. If higher ed won’t evolve fast enough, help it.
Re-grade your pay bands. Hybrid talent costs more. But they also replace multiple FTEs and catch errors no one else can see. Don’t compress them into old bands. Redesign your comp strategy around value, not tenure.
Consultants Will Bridge the Gap...Temporarily
If you need this capability now (and most companies do), your best bet is short-term: AI-fluent finance consultants who’ve been living in the RAG/fine-tune/HITL trenches.
But here’s the catch:
Anyone fluent in both FASB nuance and embeddings is beachfront property.
They’re scarce, expensive, and getting booked out fast. If you plan to make a move in Q4, you should have started yesterday.
11 Europe’s ticking clock—SOX déjà vu
If this feels like Sarbanes-Oxley déjà vu, it is. The EU AI Act is not just about consumer protection or Big Tech, it’s coming for your finance stack. And fast.
Key Dates:
February 2025 – Core obligations kick in: risk classification, transparency, and basic documentation.
August 2025 – Governance requirements extend to foundation models (like GPT-4 and Claude), including oversight of output risk and training transparency.
August 2027 – Full enforcement for high-risk use cases, which include financial forecasting, credit scoring, audit, and internal controls.
This timeline will feel eerily familiar to anyone who lived through SOX. But this time, it’s model governance, not spreadsheet governance.
Translation: "High-risk AI" = SOX-level documentation.
And it’s not just Europe.
In the U.S.:
Regulators are aligning with the NIST AI Risk Management Framework (RMF), which already sets standards for internal control, explainability, and data lineage.
Internal audit teams at forward-looking companies are adding "AI model governance" to their charters, especially in public-company environments.
Expect draft guidance from ESMA (securities) and EBA (banking) by 2026, with downstream effects for any multinational finance team.
If you use AI to support investor disclosures, automate forecasting, or set pricing, this applies to you.
The smart play? Start documenting now, inputs, outputs, overrides, escalation paths, data permissions, and audit logs. When enforcement lands, you won’t be scrambling.
This isn’t just a policy wave, it’s a compliance clock. And it’s already ticking.
12 Looking beyond the freeway offram
By 2026, your finance stack will feel less like a spreadsheet suite and more like a smart-home hub.
Each prompt becomes a door sensor.
Every agent is a chore chart: delegated, tracked, and escalating when tasks break.
And your LLM models are like thermostats: fine-tuned, self-adjusting, and logging every degree of change.
All of it beams status updates to a single pane of glass. The tech will hum along quietly in the background.
But just like the house, someone still has to lock the door, set the alarm, and wake up when the dashboard pings at 3 a.m. LLMs won't eliminate judgment, they amplify the consequences of bad judgment.
And remember:
Fewer spreadsheets ≠ fewer people. Just scarcer, sharper, and more expensive ones.
Final Invitation
If human-in-the-loop is still a bullet point in your board deck instead of a fully mapped process: it’s time. One hallucinated figure in an earnings deck can cost more than the price of a pilot program or giving your tech-affluent consultants and leaders a raise.
Letting your team use AI without processes, controls, or qualified oversight is like handing a toddler a Sharpie and turning them loose in your boss’s living room. You might not see the damage right away, but when you do, it’s on you.
If you have questions, want to pressure-test your own playbook, or just want to swap notes—feel free to DM me or drop a comment below.
— Devon
(Feel free to forward this to anyone who still thinks “AI” stands for “Automated IFRS.”)