A team at the University of Cape Town has developed a new artificial intelligence language model trained specifically on South Africa’s 11 official written languages — addressing a long-standing gap that has left millions of speakers underserved by mainstream AI tools.
The research will be presented at the Language Resources and Evaluation Conference in Mallorca, Spain, this month. It introduces two interconnected contributions: MzansiText, a curated multilingual dataset covering the 11 official written languages, and MzansiLM, a language model trained on that dataset from scratch. The work was led by Anri Lombard and Jan Buys from UCT’s Department of Computer Science, alongside Francois Meyer and a wider group of collaborators.
The paper arrives at a moment when AI language tools have become part of daily life for millions worldwide — but the experience differs sharply for speakers of most South African languages. Asking a popular AI assistant a question in isiNdebele or Sepedi typically returns poor, inconsistent or incorrect responses. The researchers say the reason comes down to data.
“In language modelling, languages are considered low resource, primarily because there are much fewer and smaller textual datasets available in these languages for training language models,” Buys said. “Our dataset, MzansiText, is still small compared to data available for high-resource languages such as English and major European and Asian languages, but larger than previous datasets for South African languages.”
Nine of South Africa’s 11 official written languages fall into the low-resource category. Languages such as isiZulu and isiXhosa have received some attention from the global research community, but others including isiNdebele and Sepedi have been largely overlooked. MzansiLM is believed to be the first publicly available decoder-only language model to explicitly target all 11.
“There has been real progress in language modelling for African languages, including some South African ones like isiXhosa and isiZulu,” Meyer said. “But most existing models only cover a subset of languages. With MzansiLM, we wanted to build a single model focused specifically on South Africa that covers all 11 official written languages, including those that are often left out.”
For Lombard, a master’s student in computer science, the project grew out of a recurring question in his research. “I came into this work through my master’s research, which looks at how different language-model architectures perform for low-resource languages, since that is still a relatively underexplored area,” he said. “One thing that stood out to me is that publicly available models tended to cover only a subset of the South African languages we care about. MzansiLM was meant to provide a small decoder-only baseline that future work can compare against and build on.”
The model is small by the standards of today’s commercial AI systems, with 125 million parameters. But the team’s tests showed it performing competitively on specific tasks, outperforming much larger open-source models on benchmarks in several South African languages. On isiXhosa text generation, MzansiLM produced results that competed with encoder-decoder models more than 10 times its size.
The researchers stress that MzansiLM is not a consumer-facing chatbot like ChatGPT or Claude. It is a base model — a foundation that developers and researchers can fine-tune for specific applications. “In practice, that means developers could build tools for specific use cases; for example, summarising information or annotating raw data, in South African languages,” Meyer said. “Adapting MzansiLM for a limited use case might be more effective and affordable than relying on proprietary large language models, if you want users to be able to interact with a system in their home language.”
The team’s findings also offer insight into a broader question: why even powerful commercial AI systems still struggle with non-English languages. “Our findings show that the model can work well when fine-tuned for specific tasks but is not yet able to work well for general-purpose user interaction or instruction following, due to the limited training data,” Buys said. “This helps to explain why even larger language models don’t yet work as well when used in languages other than English.”
The team is clear that MzansiLM is a step rather than a destination. Closing the gap between South African languages and the capabilities now available in English will require sustained collective effort. “A lot of the progress we were able to make depends on earlier open research from the African Natural Language Processing research community, so continuing that openness is essential,” Lombard said. “We still need better and broader data sources, stronger benchmarks, and the kind of shared datasets, models, code, and results that make it possible for others to reproduce and extend the work.”
Meyer echoed the point. “The research community plays an important role here by working openly, sharing datasets, models, and findings so others can build on them. That kind of openness is often what leads to progress, especially compared to proprietary systems where much of the data and methodology isn’t accessible.”
The UCT team has made both MzansiText and MzansiLM publicly available. The paper, “MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages,” is available on arXiv.





