

In response to a Twitter question about the fine-tuning the model, Conneau said This interference can be mitigated by increasing the model capacity and adjusting the sampling of the languages during pre-training. The researchers also noted that the XSLR performs worse than baseline on high-resource languages due to interference, or sharing of model capacity across languages. Low-resource languages especially benefit from pre-training on related languages for example, performance on Italian improves when additional Spanish language data is included in pre-training. On low-resource languages, even those used only in fine-tuning but not in pre-training, the large XSLR model outperforms baseline models. The team trained several models of varying size the largest model contained 24 Transformer blocks of dimension 1,204 with 16 attention heads. The fine-tuned model is evaluated against held-out datasets from CommonVoice and BABEL.
#FACEBOOK AI LANGUAGE TRANSCRIPT FULL#
The full dataset contains over 56k hours of speech in 53 languages. It is pre-trained using multilingual batches of audio data drawn from three datasets: CommonVoice, a corpus of read speech BABEL, a corpus of telephone conversations and Multilingual LibriSpeech (MLS), a corpus of audiobooks. XSLR is uses the same architecture as wav2vec 2.0. For the pre-training phase, a certain percentage of the latent representations are masked, and the network learns to predict the masked values this is analogous to the masked language model training used in BERT. The model uses a convolutional neural-network (CNN) feature encoder to convert audio into latent speech representations which are quantized then fed into a Transformer the Transformer converts sequences of speech representations into text. This strategy has been applied by Facebook and others for neural-machine translation, using popular sequence-to-sequence natural language Transformer models such as BERT.įAIR published the original wav2vec deep-learning model for automated speech recognition (ASR) in 2019 and the updated wav2vec 2.0 model in 2020.

In this situation, researchers turn to transfer learning: fine-tuning models that have been pre-trained on a large publicly-available dataset. Acquiring such a dataset can be challenging for non-European languages-often termed low-resource languages because of the lack of readily available data. Training a deep-learning model requires a large dataset of labeled examples for speech-recognition, this would mean audio data with corresponding text transcripts. Our goal.is to enable few-shot learning for languages that are actually low-resource, leveraging unsupervised data from higher-resource languages. The system can also learn languages not seen during pre-training, outperforming monolingual models specifically trained on those languages.

When evaluated on the CommonVoice and BABEL benchmarks, the model outperforms existing baselines. The system is pre-trained on three public datasets containing 53 languages. XSLR is built on the wav2vec architecture and uses transfer learning to improve performance on "low-resource" languages. The model architecture and related experiments were described in a paper published on arXiv. XSLR is trained on 53 languages and outperforms existing systems when evaluated on common benchmarks. Facebook AI Research (FAIR) open-sourced Cross-Lingual Speech Recognition (XSLR), a multilingual speech recognition AI model.
