Microsoft Is Turning Copilot Into a Multi-Model AI Tool

TOOLSFEATURED

Saad Amjad

4/4/20263 min read

Microsoft did something genuinely interesting this week. Instead of just adding another AI feature to Copilot, they fundamentally changed how the tool works under the hood. And it says a lot about where AI products are heading.

Here's what happened. Microsoft launched a feature called Critique inside its Microsoft 365 Copilot Researcher agent. The setup is pretty clever: GPT drafts a response to a research query, and then Claude independently reviews it for accuracy, completeness, and citation quality before the answer ever reaches the user [GeekWire]. Two models, working in sequence. One writes, the other checks.

That's a genuinely new approach to AI product design.

And the results are backing it up. Microsoft claims this multi-model setup scored 57.4 on the DRACO benchmark, a 13.8% improvement, putting it ahead of standalone deep-research tools from OpenAI, Google, Perplexity, and Anthropic [WinBuzzer]. Those are big numbers, though worth noting no independent party has verified them yet.

So why does this matter beyond the benchmarks?

The interesting part here isn't really the accuracy improvement. It's the philosophy behind the decision.

For years, the dominant approach for AI products was simple: pick one model, build around it, ship. OpenAI had GPT, Anthropic had Claude, Google had Gemini. The big platforms mostly picked one and called it a day. Microsoft was no different. Copilot was basically an OpenAI product with a Microsoft wrapper around it.

That's now changing. Microsoft's bet is that the real differentiation isn't in the model itself. It's at the context and orchestration layer [Constellation Research]. In other words: whoever owns the workflow, the data, and the routing logic wins. The model is just one part of that.

That shift has real implications for how enterprise AI tools get built going forward.

There's more going on here too.

Alongside Critique, Microsoft also launched a Model Council feature. Council runs Anthropic and OpenAI models simultaneously, with each producing a standalone report and a judge model evaluating where the responses agree and diverge [Constellation Research]. Users can literally compare how GPT and Claude approach the same research question, side by side, before deciding which output to use.

On top of that, Microsoft launched Copilot Cowork through the Frontier early access program, a tool for handling long-running, multi-step workflows autonomously across apps like Excel, Outlook, and SharePoint, built on Anthropic's Cowork technology [Techzine Global]. Users describe an outcome they want, and the system figures out the steps and executes them without needing a human to prompt every action. That's a meaningful step up from Copilot's previous single-shot output style.

Why Microsoft really needs this to work

There's an honest business reason behind all of this. Only 3.3% of Microsoft's 450 million commercial Microsoft 365 users actually pay for Copilot right now. That's about 15 million paid seats [WinBuzzer]. For a product Microsoft has been pushing hard for over a year, that's a slower adoption rate than the company would like.

More compelling features, especially ones that feel genuinely useful rather than just impressive in demos, are exactly what Microsoft needs to convert the other 96.7%.

The multi-model approach is a smart bet here because it makes the quality argument easier to make. When you can show that two models checking each other's work produce better research outputs, that's a tangible reason for an enterprise buyer to upgrade their subscription.

The bigger picture

What Microsoft is doing here is worth keeping an eye on, because other platforms will likely follow. The multi-model approach is likely to be adopted by other hyperscalers like Amazon and Google Cloud as they expand their own model catalogs [Constellation Research].

We're moving away from "which model do you use?" as the key question. The next generation of serious AI tools will be built around orchestration: which models do you combine, at which step, for which task?

Microsoft is making a strong early bet on that answer. And for once, Copilot has something genuinely different to show for it.

Microsoft Is Turning Copilot Into a Multi-Model AI Tool

So why does this matter beyond the benchmarks?

There's more going on here too.

Why Microsoft really needs this to work

The bigger picture

Reframe your inbox