ChatGPT, Claude, Gemini, Grok, Perplexity, Llama — and Sherpa: Why Asset Managers Need the Right LLM for Each Task

Contents

The conversation every asset manager is having
Six models, six different strengths
Benchmark comparison: all six models head-to-head
Why there is no single “best” LLM
The five things no generic LLM can do for asset managers
The right model for the right task
How Sherpa solves this
Where the industry is heading

The Conversation Every Asset Manager Is Having

If you run an asset management firm in 2026, you’ve had some version of this conversation in the last six months.

Someone on your team raises the idea of “using AI.” The room nods. One person has been experimenting with ChatGPT. Another prefers Claude for research. Someone in IT says Google Gemini has better integration with your workspace. The analyst swears by Perplexity for sourced research. And someone just read that Grok can pull real-time data from X. The compliance officer wants to know where the data goes with any of them.

And nobody can agree on which one your firm should standardise on.

Here’s what we’ve learned after two years of building AI specifically for asset managers: the debate over which LLM is “best” is the wrong debate entirely. Each of the major models is genuinely excellent at something. The real question is how you orchestrate the right model for each task — within a governed, domain-aware platform that actually connects to your operational data.

The LLM Landscape: Six Models, Six Different Strengths

The six major LLMs your firm is likely evaluating — OpenAI’s ChatGPT, Anthropic’s Claude, Google’s Gemini, xAI’s Grok, Perplexity, and Meta’s Llama — are all remarkable technology. But they are not interchangeable. Each has invested heavily into different strengths, and the 2026 benchmarks make those differences clear.

Let’s give credit where it’s due.

ChatGPT

OpenAI · GPT-5.4

Best at: Reasoning & Versatility

The most versatile general-purpose model. Excels at structured reasoning (o-series), computer use, and has the broadest ecosystem of plugins and integrations. The go-to for brainstorming, drafting, and conversational tasks.

Claude

Anthropic · Claude Opus 4.6

Best at: Coding & Nuanced Writing

Tops coding benchmarks (SWE-bench, Arena code). Produces the most natural, nuanced long-form prose. 1M-token context window handles entire codebases. Constitutional AI approach makes it the most cautious and trustworthy for enterprise use.

Gemini

Google · Gemini 3.1 Pro

Best at: Multimodal & Scientific Reasoning

Leads on abstract reasoning (ARC-AGI-2 77.1%), scientific benchmarks (GPQA 94.3%), and multimodal capabilities. Fastest inference at the lowest cost. 1M-token context window. Deep integration with Google Workspace.

Grok

xAI · Grok 4.20

Best at: Real-Time Data & Maths

Dominates mathematical reasoning (AIME 93.3%, MATH-500 99%). Native integration with X/Twitter for real-time trend data and sentiment analysis. Multi-agent debate architecture and the fastest deep search capability.

Perplexity

Perplexity · Sonar Reasoning Pro

Best at: Sourced Research & Factuality

Every answer includes numbered citations linked to source material. Tops factuality benchmarks (SimpleQA 0.858). Model Council runs queries through multiple frontier LLMs simultaneously. Deep Research mode for comprehensive, sourced analysis.

Llama

Meta · Llama 4 Maverick

Best at: Open-Source & Cost Efficiency

The leading open-weight model family. Llama 4 Maverick matches frontier closed models on key benchmarks while offering full self-hosting and fine-tuning flexibility. Massive 1M-token context window. Zero per-token API cost when self-hosted — ideal for high-volume enterprise workloads where data sovereignty matters.

Benchmark Comparison: All Six Models Head-to-Head

We continuously benchmark LLMs across the dimensions that matter for asset management workflows. Here’s where things stand as of April 2026.

Benchmark	What It Measures	ChatGPT GPT-5.4	Claude Opus 4.6	Gemini 3.1 Pro	Grok 4.20	Perplexity Sonar Pro	Llama 4 Maverick
MMLU-Pro	Broad knowledge	91.8% Best	89.5% Strong	89.8% Strong	~86% Good	75.5% Good	88.2% Strong
SWE-bench Verified	Real-world coding	78.2% Strong	80.8% Best	78.8% Strong	~74% Good	—	~72% Good
GPQA Diamond	Scientific reasoning	92.8% Strong	91.3% Strong	94.3% Best	88.0% Good	—	85.2% Strong
Chatbot Arena Elo	Human preference	Top-5 Strong	1504 Best	1493 Strong	1491 Strong	—	1467 Good
Mathematical reasoning	AIME / MATH-500	Strong Strong	Strong Strong	Strong Strong	93.3% / 99% Best	—	Strong Strong
Factuality (SimpleQA)	Verifiable accuracy	Strong Strong	Strong Strong	Strong Strong	Good Good	0.858 Best	Strong Strong
Real-time data access	Live information	Via plugins Good	Limited Good	Google Search Strong	Native X/Twitter Best	Native web search Best	None Good
Long-form writing	Prose quality & nuance	Good Good	Excellent Best	Strong Strong	Good Good	Good Good	Good Good
Multimodal input	Image, video, audio	Strong Strong	Strong Strong	Excellent Best	Strong Strong	Limited Good	Strong Strong
Inference cost	Price-performance	Moderate Strong	High Good	Lowest Best	Moderate Strong	Low Best	Free (self-hosted) Best
Context window	Input capacity	128K Good	1M Best	1M Best	1M Best	128K Good	1M Best

Sources: LM Arena, Artificial Analysis, MindStudio Benchmarks. Scores reflect published results as of April 2026; benchmarks evolve rapidly.

Why There Is No Single “Best” LLM

Look at that table. No single model wins every row. And that’s the point.

ChatGPT excels at structured reasoning and agentic computer use. It’s the model you want for complex multi-step logic, tool orchestration, and tasks that require interacting with systems. OpenAI has earned this — they’ve invested heavily in the o-series reasoning models and their agent capabilities are genuinely ahead.

Claude produces the most natural, nuanced writing and is the strongest at real-world coding. Anthropic’s Constitutional AI approach also makes it the most cautious about hallucination and the most transparent about uncertainty — qualities that matter enormously in regulated industries. When you need a 40-page client report that reads like it was written by a senior analyst, or you need code that handles edge cases correctly, Claude is the right choice.

Gemini leads on scientific reasoning, multimodal input, and raw speed. Google’s investment in Gemini’s reasoning capabilities has paid off — it tops GPQA at 94.3% and its ARC-AGI-2 performance on abstract reasoning is best in class. When you need to process a mix of PDFs, images, and spreadsheets at speed, Gemini delivers.

Grok has quietly become the mathematical reasoning powerhouse. xAI’s Grok 4.20 scores 93.3% on AIME and 99% on MATH-500 — numbers that no other model matches. Its native integration with X/Twitter gives it real-time access to market sentiment, breaking news, and trending topics. When your research team needs live sentiment data or your quant team needs complex calculations, Grok is the right tool.

Perplexity takes a fundamentally different approach: every answer comes with numbered, linked citations to source material. Its SimpleQA factuality score of 0.858 outperforms every other model on verifiable accuracy. The Model Council feature runs queries through multiple frontier LLMs simultaneously and shows side-by-side comparisons. When you need sourced, defensible research — the kind you can put in front of a compliance officer or a board — Perplexity is purpose-built for it.

Llama is Meta’s open-weight alternative — and it’s closing the gap fast. Llama 4 Maverick matches frontier closed models on broad knowledge and scientific reasoning benchmarks, with a 1M-token context window. The real differentiator is control: because it can be self-hosted and fine-tuned, firms with strict data sovereignty requirements or high-volume inference workloads can run it at effectively zero marginal cost. It won’t win on real-time data or writing quality, but for batch processing, internal tooling, and privacy-sensitive workflows, it’s a compelling option.

Each of these models represents billions of dollars of research and genuinely world-class engineering. The mistake isn’t choosing any of them. The mistake is choosing just one.

The Five Things No Generic LLM Can Do for Asset Managers

Whether you use ChatGPT, Claude, Gemini, Grok, Perplexity, Llama, or all six, every generic LLM shares the same five fundamental gaps when deployed in an asset management context.

1. Access your live operational data

No generic LLM connects to your custodian, fund administrator, registry, CRM, or market data feeds. They operate on whatever text you paste into them. Someone still has to manually extract, clean, and format data before AI can help. That’s not automation — that’s adding a step.

2. Understand asset management terminology

Every asset manager uses different names for the same data. “FUA,” “AUM,” “Assets Under Management” — spelled three different ways across your own systems. Generic models don’t know that your CRM’s “total portfolio value” is the same as your custodian’s “FUA.” Without a domain-specific ontology layer, AI gives you confident-sounding wrong answers.

3. Respect your compliance rules

This gap matters more than ever in 2026. ASIC’s REP 798 “Beware the gap” found that financial services licensees are deploying AI faster than their governance frameworks can keep up — creating real risk of consumer harm. From December 2026, Privacy Act amendments require disclosure of automated decision-making, with penalties up to $50M.

REP 798

ASIC “Beware the gap” — AI governance failing to keep pace with adoption across AFSL holders

ASIC, 2025

Dec 2026

Privacy Act amendments — up to $50M penalties for non-compliance on automated decisions

OAIC, 2026

Generic LLMs lack data residency controls, audit trails, role-based access, and compliance guardrails. It doesn’t matter how good the model is if your compliance team can’t sign off on it.

4. Take action — not just answer questions

A generic LLM can answer questions, but it can’t generate a compliant client report from live data, flag a compliance breach before it becomes regulatory, plan an optimal trip across 12 client meetings, or produce a board-ready summary from real-time FUA figures. The model is the brain. But without integrations, a data layer, and action capabilities, it’s a brain floating in empty space.

5. Operate within your existing infrastructure

Asset managers run on complex stacks of custodians, fund admins, registries, and CRMs built up over years. A purpose-built AI platform should sit on top of existing systems, not replace them. We’ve deployed across firms managing $3B to $116B in AuM — in every case, the platform connected to existing systems via APIs. Average time from kickoff to live: four weeks.

The Right Model for the Right Task

Here’s the insight that changed how we build AI for asset managers: the right answer isn’t picking one LLM and hoping it covers everything. It’s having the right LLM for each specific task.

Consider the range of AI tasks in a typical asset management firm:

Generating a 40-page client report with nuanced commentary? You want the model that writes best — that’s Claude.
Processing a mixed batch of PDFs, images, and spreadsheets from a custodian feed? You want multimodal strength and speed — that’s Gemini.
Orchestrating a complex multi-step workflow that pulls data from five systems, runs compliance checks, and generates a board pack? You want the strongest reasoning and tool-use model — that’s ChatGPT’s o-series.
Scanning real-time market sentiment and running quantitative calculations on fund performance? You want live data access and mathematical precision — that’s Grok.
Producing sourced, defensible research on regulatory changes, competitor activity, or market trends — with every claim linked to its source? That’s Perplexity.
Quick-turnaround tasks like drafting emails, summarising meeting notes, or answering ad-hoc data questions? You want the fastest inference at the lowest cost.

No single model is the best choice for all six of those tasks. And that’s before you factor in cost — using the most expensive frontier model for a simple email summary is throwing money away.

How Sherpa Solves This

Sherpa is Datafabric’s AI assistant, purpose-built for asset managers. But unlike any single LLM, Sherpa is designed as an orchestration layer that brings together the best models for each task.

How Sherpa works

The right LLM for every task. Automatically.

Sherpa benchmarks, selects, and monitors the best model for each workflow.

Benchmark all major LLMs against your tasks

Route each task to the best-fit model

Connect to your live operational data

Monitor quality, cost, and compliance

Here’s what this looks like in practice:

We offer all the key LLMs. Sherpa isn’t locked to a single provider. We integrate ChatGPT, Claude, Gemini, Grok, Perplexity, Llama, and other models as they prove their worth. When a new model launches or an existing one improves, we test it and add it to the rotation.

We benchmark them continuously. Every model is evaluated against the specific tasks that matter for asset managers — report generation, data extraction, compliance checking, research, and more. Not generic benchmarks. Your benchmarks, on your data, for your workflows.

We set up the right model for the right task. Client report? Routed to the model that writes best. Document processing? Routed to the fastest multimodal model. Complex multi-step workflow? Routed to the strongest reasoning model. This happens automatically — your team doesn’t need to think about which model to use.

We monitor everything. Every query is logged. Every response is traceable. Quality scores are tracked over time. If a model’s performance degrades or a better option becomes available, Sherpa adapts. Cost is tracked per task, per model, so you know exactly what you’re paying for.

And critically, Sherpa provides the five capabilities that no generic LLM offers on its own: live data connectivity across your custodian, fund admin, CRM, and internal systems; a domain-specific ontology that understands asset management terminology; full compliance governance with audit trails, data residency, and role-based access; the ability to take action (generate reports, flag breaches, plan trips); and integration with your existing infrastructure without replacement.

The models are the brains. Sherpa is the nervous system that connects them to your business.

Where the Industry Is Heading

The shift from “pick one LLM” to “orchestrate the right LLM for each task” is happening across financial services. Microsoft’s 2026 financial services outlook identifies domain specialisation as one of the five key predictors of AI success. NVIDIA’s latest survey shows firms doubling down on industry-specific AI investment. Regulators — including ASIC and APRA — are making it clear that “we used an AI tool” is not an acceptable governance position.

The LLM wars will continue. New models will launch. Benchmarks will shift. Some of the numbers in this article will be outdated within months.

But the principle won’t change: the right answer for asset managers is not the best model. It’s the best model for each task, connected to your data, governed by your compliance rules, and monitored continuously.

That’s what Sherpa does. And that’s the difference between experimenting with AI and actually deploying it.

ChatGPT, Claude, Gemini, Grok, Perplexity, Llama — and Sherpa: why asset managers need the right LLM for each task