Meta has introduced an Artificial Intelligence (AI) language model, distinct from its ChatGPT counterparts. The innovative Massively Multilingual Speech (MMS) project developed by Meta can identify over 4,000 spoken languages and generate text-to-speech in more than 1,100 languages. The MMS project is being open-sourced by Meta, inviting researchers to build on its unique framework and contribute to language diversity preservation. “Through this work, we hope to make a small contribution to preserve the incredible language diversity of the world,” Meta said.
Training speech recognition and text-to-speech models typically require thousands of hours of audio with corresponding transcription labels, essential for machine learning algorithms to categorize and “understand” data. However, for languages that aren’t widely used in industrialized nations, many of which are at risk of extinction, such data are non-existent.
Meta has adopted an unusual method of sourcing audio data by utilizing audio recordings of translated religious texts. These translations, widely studied for text-based language translation research, have publicly accessible audio recordings in different languages. By incorporating these unlabeled recordings, Meta’s researchers have expanded the model’s language range to over 4,000.
Despite the religious nature of the audio content, Meta assures that the model does not exhibit a bias towards religious language or a male bias, as most of the religious recordings were read by male speakers. They attribute this to the use of a connectionist temporal classification (CTC) approach.
Following data alignment, Meta utilized wav2vec 2.0, their self-supervised speech representation learning model, which can train on unlabeled data. The combination of novel data sources and a self-supervised speech model resulted in exceptional results, outperforming existing models and covering ten times as many languages.
Meta acknowledges that the models aren’t perfect, admitting a risk of mistranscribing select words or phrases, which could potentially result in offensive or inaccurate language. They emphasize the necessity of cross-community collaboration for responsible AI development.
In open-sourcing the MMS project, Meta envisions an inclusive technological landscape that preserves language diversity.
They imagine a world where assistive technology, text-to-speech, and even VR/AR tech enable everyone to communicate and learn in their native languages.