Contributed by The African Languages Lab
African languages make up nearly one-third of all languages worldwide. Yet, of the more than 2,000 languages spoken across the continent, only 49 are available on translation platforms like Google Translate. Even worse, a stunning 88% of African languages are “severely underrepresented” or “completely ignored” in computational linguistics (Joshi et al., 2020).
Artificial Intelligence (AI) offers a chance to protect underrepresented languages, but guidance and safeguards are critical. Without them, large language models (LLMs) risk reinforcing institutional languages and accelerating the decline of others. The consequences are dire—40% of languages globally are at risk of extinction, hundreds of which are spoken in Africa. (UNESCO, 2022).
The African Languages Lab (All Lab) is a youth-led collaborative committed to preserving African languages by documenting, digitizing, translating, and empowering them through advanced AI and natural language processing (NLP) systems. Together with partners like Smartling, we’re making substantial strides in addressing the digital divide for African languages. Here’s how.
Linguistic diversity is one of the greatest assets of the African continent, but it also presents monumental challenges. Many, especially smaller communities, speak unique languages which aren’t well documented. These “low-resource” languages lack the necessary datasets needed for computational use, making machine translation (MT), speech processing, automated transcription, and other NLP applications difficult, if not impossible.
The challenge is pervasive—fewer than 5% of African languages have significant digital resources. (Association for Computational Linguistics, 2019) It’s clear that we need to better document these languages, but the process is no small task.
At the African Languages Lab, we’re using AI and NLP systems to digitize, translate, and preserve African languages to create positive outcomes for people across the continent. Our four-pillar approach currently supports 40 languages, from spoken Bantu to lesser-known Khoisan, representing diverse cultures, regions, and linguistic families across the continent.
Countries that integrate local languages in education and digital content tend to have higher literacy rates and stronger cultural retention.
Executing our four pillars requires the right technology and collaborative partners. As such, we’ve formed a strategic partnership with Smartling, a leader in translation and localization technology. This partnership enables us to leverage Smartling’s cutting-edge tools for language translation, management, and contextual accuracy, transforming the way low-resource languages are documented and shared digitally.
Here’s how technology is driving our progress in African language digitization and translation.
For many African languages, centralized language data is lacking. We collect and standardize data from various sources, leveraging Python scripts to clean, standardize, and convert the data into a common format with the goal of creating a centralized corpus for broad use. Consolidating and refining language data ensures consistency and accessibility—ultimately empowering communities to create educational resources, translation tools, and digital content.
The African Languages Lab has gathered over 400GB of speech and text data for 40 African low-resource languages, advancing their documentation and digital availability.
As mentioned previously, incomplete data is a critical gap for language preservation that can be difficult to fill in some African communities. Our innovative data collection app, All Voices, allows institutions, communities, and native speakers to document and digitize their local language. Contributors can record speech for 40 African languages, supporting our collective need to capture data for low-resource languages.
In the future, All Voices will bridge communication gaps in communities and make local languages accessible to all. It will also translate between African languages and popular languages like English and French. With seamless and accurate translation across a wide variety of languages, All Voices aims to foster deeper cultural exchange, while also contributing to a growing dataset of low-resource language data.
Linguistic data aggregation and organization—in addition to community availability—are critical to our work at The All Lab. Smartling plays a vital role in our entire data management process, from data collection, to storage, to translation. With Smartling, we can upload, organize, and store data from multiple projects in a secure, centralized system.
Smartling’s API enables us to not only share our data broadly across multiple platforms, but also make updates in real time—ensuring that every member of our community has access to the most accurate and complete digital corpus.
We’ve relied on Smartling’s translation memory, AI-powered translations, and skilled translators to support consistent and accurate content across different African languages. Our resulting structured and accessible language repository is essential for expanding digital accessibility and preservation efforts across Africa’s linguistic diversity.
Our work at the All Lab—supported by the above technologies—generates structured African linguistic datasets, which play a critical role in digitizing low-resource languages. These datasets are instrumental in developing new machine translation, speech recognition, and language preservation tools. Ultimately, our data helps advance African linguistic research and supports the development of more accurate and culturally relevant language models.
We also make our datasets available through open-access platforms like Huggingface. Our work fosters community-based AI development and encourages greater investment in African language technologies.
At the African Languages Lab, we’ve made substantial progress in addressing the digital divide for African languages through data collection, aggregation, standardization, crowdsourcing, and model development and deployment. We’re proud of our growing, robust corpus of linguistic data—which is about half a terabyte in size—advanced translation tools, and successful expansion of access to language resources.
To date, we’ve collected over 400GB of speech and text datasets for 40 African low-resource languages, supporting their documentation and technological advancement. Through partnerships with academic institutions like the UCLA MARS Lab and industry leaders such as Smartling, we’re harnessing cutting-edge research and technology to drive our mission forward. We’re also actively raising awareness about the African language landscape through seminars, conferences, and technical papers.
As we look to the future, we'll work to preserve more low-resource African languages, beyond our current 40. We also aim to broaden the availability of our datasets and tools. And, we are committed to driving further innovation in machine translation, language preservation, and AI-driven linguistic research across Africa. Together, we will ensure that Africa's linguistic heritage not only survives but thrives in the digital age.