A Convolutional Deep Markov Model for Unsupervised Speech Representation Learning - Laboratoire Informatique de l'Université du Maine Accéder directement au contenu
Communication Dans Un Congrès Année : 2020

A Convolutional Deep Markov Model for Unsupervised Speech Representation Learning

Sameer Khurana
  • Fonction : Auteur
  • PersonId : 1075440
Wei-Ning Hsu
  • Fonction : Auteur
Ricard Marxer
James Glass
  • Fonction : Auteur

Résumé

Probabilistic Latent Variable Models (LVMs) provide an alternative to self-supervised learning approaches for linguistic representation learning from speech. LVMs admit an intuitive probabilistic interpretation where the latent structure shapes the information extracted from the signal. Even though LVMs have recently seen a renewed interest due to the introduction of Vari-ational Autoencoders (VAEs), their use for speech representation learning remains largely unexplored. In this work, we propose Convolutional Deep Markov Model (ConvDMM), a Gaus-sian state-space model with non-linear emission and transition functions modelled by deep neural networks. This unsupervised model is trained using black box variational inference. A deep convolutional neural network is used as an inference network for structured variational approximation. When trained on a large scale speech dataset (LibriSpeech), ConvDMM produces features that significantly outperform multiple self-supervised feature extracting methods on linear phone classification and recognition on the Wall Street Journal dataset. Furthermore, we found that ConvDMM complements self-supervised methods like Wav2Vec and PASE, improving on the results achieved with any of the methods alone. Lastly, we find that ConvDMM features enable learning better phone recognizers than any other features in an extreme low-resource regime with few labelled training examples.
Fichier principal
Vignette du fichier
convDMM_arxiv.pdf (1023.33 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-02912029 , version 1 (05-08-2020)

Identifiants

  • HAL Id : hal-02912029 , version 1

Citer

Sameer Khurana, Antoine Laurent, Wei-Ning Hsu, Jan Chorowski, Adrian Łańcucki, et al.. A Convolutional Deep Markov Model for Unsupervised Speech Representation Learning. Interspeech 2020, Oct 2020, Shanghai, China. ⟨hal-02912029⟩
215 Consultations
67 Téléchargements

Partager

Gmail Facebook X LinkedIn More