A landmark study by Penda Health and OpenAI found that an artificial intelligence “safety net” called AI Consult reduced diagnostic errors by 16% and treatment errors by 13% across nearly 40,000 patient visits at 15 primary care clinics in Nairobi.
The system, powered by GPT-4o and adapted to Kenyan clinical guidelines, acted as a co-pilot during consultations. It continuously monitored patient interactions and flagged potential errors with color-coded alerts: green for no issues, yellow for warnings, and red for critical safety concerns.
Independent physician reviewers confirmed the AI significantly reduced errors in history-taking, investigations, diagnosis, and treatment. Yet two patient deaths occurred during the study, both deemed potentially preventable if AI alerts had been followed. In one case, multiple red alerts were ignored.
Researchers stressed that the most difficult challenges were not technological but human. Penda managers had to coach clinicians, track ignored alerts, and invest heavily in change management. Initially, over 35% of critical safety warnings went unheeded. Clinicians using AI also spent more time per patient — a median of 16.43 minutes compared with 13.01 minutes in the control group — but still made fewer errors, suggesting a “quality-time tradeoff.”
The study also underscored the dangers of bias in health data. Instead of training on historical patient records, which risk codifying systemic inequities, Penda built its tool on evidence-based clinical guidelines. The approach offers a template for more equitable AI design.
Despite reduced errors, the study found no statistically significant difference in patient-reported outcomes between groups — a reminder that fewer process errors do not always translate into healthier patients.
Researchers concluded that safe AI in health care requires three pillars: a capable model, clinically aligned implementation, and active deployment. They called for regulators such as the U.S. Food and Drug Administration to require an “implementation and ethics playbook” that includes equity assessments, workflow integration plans, and post-market monitoring.
The findings have implications far beyond Kenya. As U.S. systems like Mayo Clinic and Geisinger invest heavily in AI, the Nairobi study serves as a cautionary tale: the hardest part of AI in medicine is not building the algorithm but ensuring it is used correctly in real-world practice.
Javaid Iqbal Sofi is a doctoral researcher at Virginia Tech specializing in artificial intelligence and health care.





