How Indigenous Technologists Are Rewiring AI to Save Endangered Languages
Faced with a digital landscape that favors English, Indigenous communities are building custom AI models and wearable robots to revitalize their native languages on their own terms.
By Factlen Editorial Team
- Indigenous Technologists
- Argue that language AI must be built on a foundation of data sovereignty, ensuring communities control their cultural knowledge rather than surrendering it to tech giants.
- Academic AI Researchers
- Focus on developing new model architectures and transfer-learning techniques that can accurately process languages with very small amounts of training data.
- Digital Rights Advocates
- Highlight the structural inequalities of the global language data gap, warning that AI will exacerbate digital exclusion if it remains English-centric.
What's not represented
- · Elders and traditional knowledge keepers who may be skeptical of digitizing sacred or nuanced oral histories.
- · Public school administrators tasked with integrating these experimental AI tools into formal language curricula.
Why this matters
Roughly 40 percent of the world's 7,000 languages are at risk of disappearing, taking centuries of cultural and scientific knowledge with them. By forcing AI to work for low-resource languages, technologists are proving that the digital age does not have to be an extinction event for global diversity.
Key points
- Mainstream AI models struggle with non-dominant languages due to a lack of digitized training data, known as the global language data gap.
- Indigenous technologists are building custom AI tools, like the Anishinaabemowin-speaking 'Skobot', to teach endangered languages to youth.
- New Zealand's Te Hiku Media crowdsourced 300 hours of speech to build a highly accurate Māori speech recognition model.
- Communities are using 'Kaitiakitanga' licenses to maintain data sovereignty, preventing tech giants from extracting their cultural knowledge.
- Researchers are developing 'transfer learning' techniques to train AI on low-resource languages using significantly less data.
For decades, the internet has operated as a linguistic homogenizer. With more than half of all web domains written in English, the digital age has inadvertently accelerated the decline of minority languages. According to the United Nations, roughly 40 percent of the world's 7,000 languages are currently at risk of extinction, with one Indigenous language lost every two weeks. But a new generation of Indigenous technologists is flipping the script, transforming artificial intelligence from a tool of cultural erasure into an engine for language revitalization.[1][2][3][5]
One of the most visible examples of this shift sits on the shoulders of children. Danielle Boyer, a 24-year-old Anishinaabe roboticist, recently designed the "Skobot"—a small, brightly colored wearable robot shaped somewhat like a parrot. Equipped with an internally developed AI model, the motion-activated toy converses fluently in Anishinaabemowin, the endangered language of the Anishinaabe nation in North America. When a child asks the robot how to say a specific word, the AI interprets the audio and responds in real-time, simulating a natural, immersive conversation that is often missing in modern digital environments.[2]
Innovations like the Skobot are necessary because the broader AI industry has largely left the global majority behind. A 2025 paper from the Stanford Institute for Human-Centered Artificial Intelligence highlighted a structural disparity known as the "global language data gap." Mainstream large language models (LLMs) developed by major tech firms rely heavily on publicly available text scraped from the internet. Because the web is overwhelmingly English-centric, these models perform exceptionally well in dominant languages but fail spectacularly when tasked with anything else.[1][3][4]

In the field of artificial intelligence, languages outside this dominant cluster are termed "low-resource." This designation has nothing to do with the number of native speakers a language has; even widely spoken languages like Urdu fall into this category. Instead, "low-resource" refers strictly to a scarcity of machine-readable, digitized, and annotated text available to train algorithms. Without this massive corpus of data, standard AI transcription and translation tools struggle with cultural nuance, introduce inherent biases, and frequently hallucinate incorrect grammar.[3][4][6]
To bridge this gap, Indigenous communities are taking data collection into their own hands, refusing to wait for Silicon Valley to notice them. In New Zealand, Te Hiku Media, an iwi-led (tribal) broadcasting organization, recognized that te reo Māori needed a robust digital presence to survive. Rather than relying on existing tech giants, they launched "Kōrero Māori," a massive crowdsourcing initiative designed to build a custom automatic speech recognition (ASR) model from scratch.[1][5]
The response from the Māori community was unprecedented. In just ten days, over 2,500 individuals signed up to read more than 200,000 phrases, generating over 300 hours of highly accurate, labeled speech data. Using the open-source NVIDIA NeMo toolkit and advanced tensor core GPUs, Te Hiku Media trained a speech-to-text model that now transcribes te reo Māori with 92 percent accuracy. It can even seamlessly transcribe bilingual speech, switching between English and te reo with an 82 percent accuracy rate.[5]
In just ten days, over 2,500 individuals signed up to read more than 200,000 phrases, generating over 300 hours of highly accurate, labeled speech data.
The success of Te Hiku Media is not just a technical triumph; it is a blueprint for Indigenous data sovereignty. Historically, marginalized communities have seen their cultural artifacts and knowledge extracted and monetized by outside entities. To prevent this, Te Hiku Media collected its data under a strict "Kaitiakitanga" (guardianship) license. This legal and cultural framework ensures that the data, and the AI models built from it, remain under Māori control and are used exclusively for the benefit of the Māori people.[1][5][6]

This insistence on sovereignty is reshaping how AI research is conducted globally. At the prestigious NeurIPS AI conference, recent workshops have centered entirely on building LLM architectures tailored to low-resource linguistic features through ethical, community-centered dataset collection. Researchers are moving away from brute-force data scraping and toward "transfer learning"—a technique where an AI model applies the underlying structural knowledge it learned from a high-resource language to a low-resource one, drastically reducing the amount of native data required.[6]
Institutions like Mila, the Quebec AI Institute, are pushing this further through their First Languages AI Reality (FLAIR) initiative. Developing an ASR model for a new language typically requires hundreds of hours of pristine audio. FLAIR is pioneering foundational research to create custom voice models for endangered languages using a fraction of that data. These lightweight models can then power voice-controlled technology, audio transcription, and immersive virtual reality experiences for Indigenous youth.[8]
The shift from static preservation to dynamic interaction is critical for intergenerational transfer. Younger speakers engage primarily through smartphones and interactive media, making traditional, static digital archives less effective. To address this, New Zealand-based software company Kiwa Digital partnered with Amazon Web Services to launch CultureQ, a generative AI platform. By embedding conversational AI into cultural archives, users can ask questions and hear the language spoken aloud, turning historical texts into living dialogues.[7]

Despite these breakthroughs, significant technical and ethical uncertainties remain. AI models, by their nature, recognize patterns and calculate probabilities; they do not "understand" culture. Linguists warn that general-purpose AI can inadvertently simplify or misrepresent Indigenous knowledge, stripping away the deep contextual tones and morphologies that give these languages their meaning. There is a persistent fear that synthetic data generation—using AI to create artificial training text—could slowly dilute the authenticity of the language over time.[3][4][6]
To mitigate these risks, the consensus among researchers and community leaders is that AI must remain a "human-in-the-loop" technology. In newsrooms and classrooms experimenting with these tools, "hybrid translation"—where AI outputs are rigorously reviewed by native speakers before publication—is becoming the gold standard. The goal is not to replace human teachers or elders, but to give them infinitely scalable tools to amplify their reach.[3][7]

The implications of this work extend far beyond linguistics. Studies have shown that a strong connection to linguistic heritage correlates with tangible public health benefits in Indigenous communities, including lower rates of teen suicide, diabetes, and excessive alcohol consumption. Language is the vessel for identity, and preserving it has a profound stabilizing effect on a community's social fabric.[2][6]
By forcing cutting-edge technology to adapt to their needs, Indigenous technologists are proving that the future of AI does not have to be a monolith. From wearable robots in Michigan to sovereign data centers in New Zealand, these initiatives demonstrate that with the right ethical frameworks and community leadership, artificial intelligence can be harnessed to protect the very diversity it once threatened to erase.[1][2][5]
How we got here
2013
Te Hiku Media convenes with community elders to form a strategy for sharing Māori content in the digital era.
2024
Te Hiku Media's CEO is recognized on the TIME100 AI list for pioneering Indigenous data sovereignty in machine learning.
2025
Stanford HAI publishes a paper detailing how mainstream large language models fail users in the global majority.
2025
The NeurIPS conference hosts dedicated workshops on centering low-resource languages in the age of LLMs.
Viewpoints in depth
Indigenous Technologists' View
Emphasizes that language revitalization must be paired with strict data sovereignty to protect cultural heritage.
For Indigenous developers, the rush to build multilingual AI models by massive tech corporations represents a new form of digital colonialism. They argue that simply scraping the internet for native languages extracts cultural knowledge without compensating or empowering the communities it belongs to. By building their own models under frameworks like the Kaitiakitanga license, these technologists ensure that the tools serve the community first. They view AI not just as a translation engine, but as a sovereign digital asset that can foster intergenerational connection on their own terms.
Academic AI Researchers' View
Focuses on the technical challenge of rewiring AI architectures to learn efficiently from scarce data.
Computer scientists and linguists at institutions like Mila and Stanford are tackling the 'global language data gap' from an architectural standpoint. Standard LLMs are data-hungry, requiring billions of parameters to function smoothly. Because low-resource languages will never have the same volume of digitized text as English, researchers are pioneering techniques like transfer learning and synthetic data generation. Their goal is to create lightweight, highly adaptable models that can grasp complex morphologies and tonal nuances without needing massive, brute-force datasets, thereby democratizing access to AI technology.
What we don't know
- Whether synthetic data generation will eventually dilute the authentic nuances and idioms of endangered languages.
- How quickly these custom, community-led AI models can scale to cover the thousands of other low-resource languages currently at risk.
Key terms
- Low-Resource Language
- A language that has very little digitized text or annotated data available online, making it difficult to train standard artificial intelligence models.
- Automatic Speech Recognition (ASR)
- Technology that allows a computer to identify and process human voice inputs, converting spoken language into written text.
- Data Sovereignty
- The concept that a community or nation has the right to control the collection, ownership, and application of its own data.
- Kaitiakitanga
- A Māori concept of guardianship and protection, used in this context as a licensing framework to protect Indigenous data from commercial exploitation.
- Transfer Learning
- A machine learning technique where an AI model applies knowledge gained from a data-rich task (like English translation) to help it learn a data-poor task (like translating an endangered language).
Frequently asked
What makes a language 'low-resource' in AI?
A low-resource language is one that lacks a large volume of digitized, machine-readable text and audio data on the internet, which is necessary to train standard AI models. It does not necessarily mean the language has few human speakers.
How did Te Hiku Media build its Māori AI model?
Te Hiku Media launched a crowdsourcing campaign that gathered 300 hours of labeled speech from over 2,500 Māori speakers in just ten days. They used this data to train a custom speech recognition model that operates with 92% accuracy.
What is data sovereignty?
Data sovereignty is the principle that data is subject to the laws and governance structures of the nation or community it comes from. For Indigenous groups, it ensures their cultural knowledge and language data cannot be exploited or monetized by outside tech companies.
Can AI perfectly translate cultural nuances?
Not currently. AI models recognize statistical patterns rather than truly understanding culture, meaning they can sometimes simplify or misrepresent deep linguistic nuances. Researchers recommend keeping human experts in the loop to review AI outputs.
Sources
[1]Factlen Editorial Team
Synthesis by Factlen editorial team
Read on Factlen Editorial Team →[2]Smithsonian MagazineIndigenous Technologists
How a 24-Year-Old Roboticist is Preserving Indigenous Languages
Read on Smithsonian Magazine →[3]Nieman Journalism LabAcademic AI Researchers
Studies on AI transcription and translation in journalism reveal “low-resource” language gap
Read on Nieman Journalism Lab →[4]Global VoicesDigital Rights Advocates
Lost in translation: How AI models impact low-resource language communities
Read on Global Voices →[5]NVIDIAIndigenous Technologists
Māori Speech AI Model Helps Preserve and Promote New Zealand Indigenous Language
Read on NVIDIA →[6]NeurIPSAcademic AI Researchers
Centering Low-Resource Languages and Cultures in the Age of Large Language Models
Read on NeurIPS →[7]Amazon Web ServicesIndigenous Technologists
A GenAI Approach to Revitalizing Indigenous Language for the Digital Age
Read on Amazon Web Services →[8]MilaAcademic AI Researchers
First Languages AI Reality (FLAIR)
Read on Mila →
Every angle. Every day.
Get culture stories with full source coverage and perspective breakdowns delivered to your inbox.










