Datamation content and product recommendations are
editorially independent. We may make money when you click on links
to our partners.
Learn More
AfriMed-QA has released a study that exposes a blind spot in how people have been testing medical AI systems for Africa.
Developed by researchers from Georgia Tech and Google, AfriMed-QA’s research, winner of the Best Social Impact Paper Award at ACL 2025, provides a dataset that pulls together roughly 15,000 clinically diverse questions from 621 contributors across more than 60 medical schools in 12 countries, spanning 32 medical specialties. Researchers then tested 30 different AI models against this real-world medical data. What they found changes the conversation about AI in healthcare.
Answers for Africa
For years, AI systems have been acing the United States Medical Licensing Examination, which created an aura of medical competence. Then came the reality check. When those same models were tested on African medical scenarios, performance fell off a cliff. Findings from two months ago showed that widely used benchmarks “underrepresent African disease burdens,” a miss that gives researchers and clinicians a dangerous sense of security.
The gaps are not small. They are systematic. Diseases affecting millions barely appear in training or evaluation. Breakthrough research from this summer reported that “sickle cell disease [is] absent in three [global benchmark] sets,” despite its massive footprint. Malaria, HIV, and tuberculosis, conditions that dominate care in many regions, show minimal representation in existing benchmarks, despite their scale and impact. That is not a rounding error.
It gets worse. Evidence from six months ago found that “only 5% of evaluations used real patient data” in medical AI research. We have been running driving simulators, then declaring the cars road ready. Real patients do not live in simulations.
What this means
The ripple effects extend far beyond a single region or disease area. Comprehensive analysis from last month flagged “systemic weaknesses across the entire landscape of medical benchmarks,” including a disconnect from clinical practice, data contamination, safety neglect, and shaky validation.
One result from AfriMed-QA stands out. The findings show that “baseline general models outperform and generalize better than biomedical models of similar size.” In plain terms, the specialized medical models many teams have been building can underperform general-purpose models that were never designed for healthcare.
There is another twist. When consumers and clinicians rated AI responses against doctor-provided answers, they “consistently rated [frontier AI models] to be more complete, informative, and relevant” than human clinicians. But those ratings skew toward scenarios with plenty of training data. The newly exposed gaps, the ones tied to underrepresented diseases and settings, were not the focus of those favorable evaluations. Different test, different outcome.
Trustworthy AI in healthcare
It could be argued that a reset is underway. The dataset and evaluation code are open-sourced for the community, along with a public leaderboard that tracks performance across diverse scenarios. If you want to see how models do outside tidy exam questions, the scoreboard is now visible.
The research community is moving, fast. Research revealed earlier this year that teams created additional datasets with “11,000+ manually and LLM-generated personas representing a broad array of tropical and infectious diseases.” The methods, the same work notes, “can be scaled to other locales where digitized benchmarks may not currently be available.”
Most crucially, the next phase acknowledges how medicine actually works, across languages and modalities. Industry leaders confirmed that “efforts are underway to expand beyond English-only text-based question answering to include non-English languages and multimodal datasets,” since “medicine is inherently multilingual and multimodal.”
This is not just about fixing AI for underrepresented populations. It is about building medical AI that reflects the world as it is, clinic to clinic, language to language. This research from AfriMed-QA is a desire to reshape what trustworthy AI in healthcare must look like.