
Are Artificial Intelligence Agents Helpers to Doctors or Are They New Health Professionals
Only 27% of healthcare workers’ time is spent on direct patient care; the rest goes to paperwork and admin tasks. AI agents promise to change this balance. Where do we stand today?
🌟 Why Did We Write This?
The time consumed by electronic records and administrative work keeps growing, fueling physician burnout. Is there a reliable way to change the picture? 🤔
The leading candidate: AI agents. Beyond chatbots, these systems can handle record management, lab interpretation, and workflow integration. Will they remain assistants that ease workload, or evolve into a “new healthcare professional”? 👩⚕️🤖
To explore this, we review results from MedAgentBench, a Stanford platform. Published in NEJM AI on August 14, 2025 (DOI: 10.1056/AIdbp2500144), it is the first large-scale benchmark of AI agents in realistic EHR scenarios. Below we summarize its scope and model performance. 🚀
Why Is This Important?
Studies show only 27% of time goes to direct care, while paperwork, EHR entry, and admin duties dominate. This imbalance drives burnout; AI agents aim to free clinical time by taking over routine tasks.
From Chatbots to Agents
- Interpret complex instructions and plan actions,
- Integrate information from multiple sources,
- Interact with EHRs via standard APIs,
- Execute step by step and present summaries to physicians.
Example: Beyond answering “what is pneumonia treatment?”, an agent can factor allergies, antibiogram data, drug interactions, and risk scores to prepare a personalized plan as a draft order.
📊 MedAgentBench: The First Medical AI Agent Benchmark
MedAgentBench, developed at Stanford, evaluates agents across realistic EHR workflows; 300 tasks and 100 patient profiles in a FHIR-compliant environment.
📈 Model Success Rates — Interactive
Switch views: Overall SR, Query SR, Action SR.
Tip: Hover over bars for exact percentages. Click a model name to lock/unlock for comparison.
| Model | Overall SR (%) | Query SR (%) | Action SR (%) |
|---|---|---|---|
| Claude 3.5 Sonnet v2 | 69.67 | 85.33 | 54.00 |
| GPT-4o | 64.00 | 72.00 | 56.00 |
| DeepSeek-V3 | 62.67 | 70.67 | 54.67 |
| Gemini 1.5 Pro | 62.00 | 52.67 | 71.33 |
| GPT-4o mini | 56.33 | 59.33 | 53.33 |
| Qwen2.5 (72B) | 51.33 | 38.67 | 64.00 |
⚠️ Common Errors
- Not following required output format (e.g., returning text instead of numbers),
- Invalid/incorrect API calls (payload or syntax errors),
- Incomplete understanding of clinical context.
👩⚕️ Can They Replace Doctors?
Agents are not ready to replace physicians yet, but they already work as assistants by handling paperwork, order entry, and simple queries—freeing clinical time. With better reliability and standards, they may become a new category of healthcare professional.
🚀 Looking Ahead
- Improved reproducibility and reliability,
- Richer datasets including clinical notes and team collaboration,
- Clear ethical, safety, and regulatory frameworks.
🔗 Source & Reference
MedAgentBench: A Virtual EHR Environment to Benchmark Medical LLM Agents — NEJM AI (Published: August 14, 2025). DOI: 10.1056/AIdbp2500144. GitHub: stanfordmlgroup/MedAgentBench



