← All articles
AI & Sherpa 24 April 2026 · 14 min read

ChatGPT, Claude, Gemini, Grok, Perplexity, Llama — and Sherpa: why asset managers need the right LLM for each task

AB
Antony Bergeot-Lair
Technical Co-founder, Datafabric
ChatGPT Claude Gemini Grok Perplexity THE LLMs orchestrated by Sherpa by Datafabric GPT-5.4 Opus 4.6 Gemini 3.1 Grok 4.20 Sonar Pro ORCHESTRATED
Contents
  1. The conversation every asset manager is having
  2. Six models, six different strengths
  3. Benchmark comparison: all six models head-to-head
  4. Why there is no single “best” LLM
  5. The five things no generic LLM can do for asset managers
  6. The right model for the right task
  7. How Sherpa solves this
  8. Where the industry is heading

The Conversation Every Asset Manager Is Having

If you run an asset management firm in 2026, you’ve had some version of this conversation in the last six months.

Someone on your team raises the idea of “using AI.” The room nods. One person has been experimenting with ChatGPT. Another prefers Claude for research. Someone in IT says Google Gemini has better integration with your workspace. The analyst swears by Perplexity for sourced research. And someone just read that Grok can pull real-time data from X. The compliance officer wants to know where the data goes with any of them.

And nobody can agree on which one your firm should standardise on.

Here’s what we’ve learned after two years of building AI specifically for asset managers: the debate over which LLM is “best” is the wrong debate entirely. Each of the major models is genuinely excellent at something. The real question is how you orchestrate the right model for each task — within a governed, domain-aware platform that actually connects to your operational data.

The LLM Landscape: Six Models, Six Different Strengths

The six major LLMs your firm is likely evaluating — OpenAI’s ChatGPT, Anthropic’s Claude, Google’s Gemini, xAI’s Grok, Perplexity, and Meta’s Llama — are all remarkable technology. But they are not interchangeable. Each has invested heavily into different strengths, and the 2026 benchmarks make those differences clear.

Let’s give credit where it’s due.

ChatGPT
OpenAI · GPT-5.4
Best at: Reasoning & Versatility
The most versatile general-purpose model. Excels at structured reasoning (o-series), computer use, and has the broadest ecosystem of plugins and integrations. The go-to for brainstorming, drafting, and conversational tasks.
Claude
Anthropic · Claude Opus 4.6
Best at: Coding & Nuanced Writing
Tops coding benchmarks (SWE-bench, Arena code). Produces the most natural, nuanced long-form prose. 1M-token context window handles entire codebases. Constitutional AI approach makes it the most cautious and trustworthy for enterprise use.
Gemini
Google · Gemini 3.1 Pro
Best at: Multimodal & Scientific Reasoning
Leads on abstract reasoning (ARC-AGI-2 77.1%), scientific benchmarks (GPQA 94.3%), and multimodal capabilities. Fastest inference at the lowest cost. 1M-token context window. Deep integration with Google Workspace.
Grok
xAI · Grok 4.20
Best at: Real-Time Data & Maths
Dominates mathematical reasoning (AIME 93.3%, MATH-500 99%). Native integration with X/Twitter for real-time trend data and sentiment analysis. Multi-agent debate architecture and the fastest deep search capability.
Perplexity
Perplexity · Sonar Reasoning Pro
Best at: Sourced Research & Factuality
Every answer includes numbered citations linked to source material. Tops factuality benchmarks (SimpleQA 0.858). Model Council runs queries through multiple frontier LLMs simultaneously. Deep Research mode for comprehensive, sourced analysis.
Llama
Meta · Llama 4 Maverick
Best at: Open-Source & Cost Efficiency
The leading open-weight model family. Llama 4 Maverick matches frontier closed models on key benchmarks while offering full self-hosting and fine-tuning flexibility. Massive 1M-token context window. Zero per-token API cost when self-hosted — ideal for high-volume enterprise workloads where data sovereignty matters.

Benchmark Comparison: All Six Models Head-to-Head

We continuously benchmark LLMs across the dimensions that matter for asset management workflows. Here’s where things stand as of April 2026.

Benchmark What It Measures ChatGPT
GPT-5.4
Claude
Opus 4.6
Gemini
3.1 Pro
Grok
4.20
Perplexity
Sonar Pro
Llama
4 Maverick
MMLU-Pro Broad knowledge 91.8% Best 89.5% Strong 89.8% Strong ~86% Good 75.5% Good 88.2% Strong
SWE-bench Verified Real-world coding 78.2% Strong 80.8% Best 78.8% Strong ~74% Good ~72% Good
GPQA Diamond Scientific reasoning 92.8% Strong 91.3% Strong 94.3% Best 88.0% Good 85.2% Strong
Chatbot Arena Elo Human preference Top-5 Strong 1504 Best 1493 Strong 1491 Strong 1467 Good
Mathematical reasoning AIME / MATH-500 Strong Strong Strong Strong Strong Strong 93.3% / 99% Best Strong Strong
Factuality (SimpleQA) Verifiable accuracy Strong Strong Strong Strong Strong Strong Good Good 0.858 Best Strong Strong
Real-time data access Live information Via plugins Good Limited Good Google Search Strong Native X/Twitter Best Native web search Best None Good
Long-form writing Prose quality & nuance Good Good Excellent Best Strong Strong Good Good Good Good Good Good
Multimodal input Image, video, audio Strong Strong Strong Strong Excellent Best Strong Strong Limited Good Strong Strong
Inference cost Price-performance Moderate Strong High Good Lowest Best Moderate Strong Low Best Free (self-hosted) Best
Context window Input capacity 128K Good 1M Best 1M Best 1M Best 128K Good 1M Best

Sources: LM Arena, Artificial Analysis, MindStudio Benchmarks. Scores reflect published results as of April 2026; benchmarks evolve rapidly.

Why There Is No Single “Best” LLM

Look at that table. No single model wins every row. And that’s the point.

ChatGPT excels at structured reasoning and agentic computer use. It’s the model you want for complex multi-step logic, tool orchestration, and tasks that require interacting with systems. OpenAI has earned this — they’ve invested heavily in the o-series reasoning models and their agent capabilities are genuinely ahead.

Claude produces the most natural, nuanced writing and is the strongest at real-world coding. Anthropic’s Constitutional AI approach also makes it the most cautious about hallucination and the most transparent about uncertainty — qualities that matter enormously in regulated industries. When you need a 40-page client report that reads like it was written by a senior analyst, or you need code that handles edge cases correctly, Claude is the right choice.

Gemini leads on scientific reasoning, multimodal input, and raw speed. Google’s investment in Gemini’s reasoning capabilities has paid off — it tops GPQA at 94.3% and its ARC-AGI-2 performance on abstract reasoning is best in class. When you need to process a mix of PDFs, images, and spreadsheets at speed, Gemini delivers.

Grok has quietly become the mathematical reasoning powerhouse. xAI’s Grok 4.20 scores 93.3% on AIME and 99% on MATH-500 — numbers that no other model matches. Its native integration with X/Twitter gives it real-time access to market sentiment, breaking news, and trending topics. When your research team needs live sentiment data or your quant team needs complex calculations, Grok is the right tool.

Perplexity takes a fundamentally different approach: every answer comes with numbered, linked citations to source material. Its SimpleQA factuality score of 0.858 outperforms every other model on verifiable accuracy. The Model Council feature runs queries through multiple frontier LLMs simultaneously and shows side-by-side comparisons. When you need sourced, defensible research — the kind you can put in front of a compliance officer or a board — Perplexity is purpose-built for it.

Llama is Meta’s open-weight alternative — and it’s closing the gap fast. Llama 4 Maverick matches frontier closed models on broad knowledge and scientific reasoning benchmarks, with a 1M-token context window. The real differentiator is control: because it can be self-hosted and fine-tuned, firms with strict data sovereignty requirements or high-volume inference workloads can run it at effectively zero marginal cost. It won’t win on real-time data or writing quality, but for batch processing, internal tooling, and privacy-sensitive workflows, it’s a compelling option.

Each of these models represents billions of dollars of research and genuinely world-class engineering. The mistake isn’t choosing any of them. The mistake is choosing just one.

The Five Things No Generic LLM Can Do for Asset Managers

Whether you use ChatGPT, Claude, Gemini, Grok, Perplexity, Llama, or all six, every generic LLM shares the same five fundamental gaps when deployed in an asset management context.

1. Access your live operational data

No generic LLM connects to your custodian, fund administrator, registry, CRM, or market data feeds. They operate on whatever text you paste into them. Someone still has to manually extract, clean, and format data before AI can help. That’s not automation — that’s adding a step.

2. Understand asset management terminology

Every asset manager uses different names for the same data. “FUA,” “AUM,” “Assets Under Management” — spelled three different ways across your own systems. Generic models don’t know that your CRM’s “total portfolio value” is the same as your custodian’s “FUA.” Without a domain-specific ontology layer, AI gives you confident-sounding wrong answers.

3. Respect your compliance rules

This gap matters more than ever in 2026. ASIC’s REP 798 “Beware the gap” found that financial services licensees are deploying AI faster than their governance frameworks can keep up — creating real risk of consumer harm. From December 2026, Privacy Act amendments require disclosure of automated decision-making, with penalties up to $50M.

REP 798
ASIC “Beware the gap” — AI governance failing to keep pace with adoption across AFSL holders
Dec 2026
Privacy Act amendments — up to $50M penalties for non-compliance on automated decisions

Generic LLMs lack data residency controls, audit trails, role-based access, and compliance guardrails. It doesn’t matter how good the model is if your compliance team can’t sign off on it.

4. Take action — not just answer questions

A generic LLM can answer questions, but it can’t generate a compliant client report from live data, flag a compliance breach before it becomes regulatory, plan an optimal trip across 12 client meetings, or produce a board-ready summary from real-time FUA figures. The model is the brain. But without integrations, a data layer, and action capabilities, it’s a brain floating in empty space.

5. Operate within your existing infrastructure

Asset managers run on complex stacks of custodians, fund admins, registries, and CRMs built up over years. A purpose-built AI platform should sit on top of existing systems, not replace them. We’ve deployed across firms managing $3B to $116B in AuM — in every case, the platform connected to existing systems via APIs. Average time from kickoff to live: four weeks.

The Right Model for the Right Task

Here’s the insight that changed how we build AI for asset managers: the right answer isn’t picking one LLM and hoping it covers everything. It’s having the right LLM for each specific task.

Consider the range of AI tasks in a typical asset management firm:

No single model is the best choice for all six of those tasks. And that’s before you factor in cost — using the most expensive frontier model for a simple email summary is throwing money away.

How Sherpa Solves This

Sherpa is Datafabric’s AI assistant, purpose-built for asset managers. But unlike any single LLM, Sherpa is designed as an orchestration layer that brings together the best models for each task.

How Sherpa works
The right LLM for every task. Automatically.
Sherpa benchmarks, selects, and monitors the best model for each workflow.
01
Benchmark all major LLMs against your tasks
02
Route each task to the best-fit model
03
Connect to your live operational data
04
Monitor quality, cost, and compliance

Here’s what this looks like in practice:

We offer all the key LLMs. Sherpa isn’t locked to a single provider. We integrate ChatGPT, Claude, Gemini, Grok, Perplexity, Llama, and other models as they prove their worth. When a new model launches or an existing one improves, we test it and add it to the rotation.

We benchmark them continuously. Every model is evaluated against the specific tasks that matter for asset managers — report generation, data extraction, compliance checking, research, and more. Not generic benchmarks. Your benchmarks, on your data, for your workflows.

We set up the right model for the right task. Client report? Routed to the model that writes best. Document processing? Routed to the fastest multimodal model. Complex multi-step workflow? Routed to the strongest reasoning model. This happens automatically — your team doesn’t need to think about which model to use.

We monitor everything. Every query is logged. Every response is traceable. Quality scores are tracked over time. If a model’s performance degrades or a better option becomes available, Sherpa adapts. Cost is tracked per task, per model, so you know exactly what you’re paying for.

And critically, Sherpa provides the five capabilities that no generic LLM offers on its own: live data connectivity across your custodian, fund admin, CRM, and internal systems; a domain-specific ontology that understands asset management terminology; full compliance governance with audit trails, data residency, and role-based access; the ability to take action (generate reports, flag breaches, plan trips); and integration with your existing infrastructure without replacement.

The models are the brains. Sherpa is the nervous system that connects them to your business.

Where the Industry Is Heading

The shift from “pick one LLM” to “orchestrate the right LLM for each task” is happening across financial services. Microsoft’s 2026 financial services outlook identifies domain specialisation as one of the five key predictors of AI success. NVIDIA’s latest survey shows firms doubling down on industry-specific AI investment. Regulators — including ASIC and APRA — are making it clear that “we used an AI tool” is not an acceptable governance position.

The LLM wars will continue. New models will launch. Benchmarks will shift. Some of the numbers in this article will be outdated within months.

But the principle won’t change: the right answer for asset managers is not the best model. It’s the best model for each task, connected to your data, governed by your compliance rules, and monitored continuously.

That’s what Sherpa does. And that’s the difference between experimenting with AI and actually deploying it.

See Sherpa in action

Book a 30-minute demo and we’ll show you how Sherpa routes the right LLM to the right task — connected to live asset management data, with full governance built in.

Book a Demo