We describe Parrotron, an end-to-end-trained speech-to-speechconversion model that maps an input spectrogram directly toanother spectrogram, without utilizing any intermediate discreterepresentation. The network is composed of an encoder, spectro-gram and phoneme decoders, followed by a vocoder to synthe-size a time-domain waveform. We demonstrate that this modelcan be trained to normalize speech from any speaker regardlessof accent, prosody, and background noise, into the voice of asinglecanonical target speaker with a fixed accent and consistentarticulation and prosody. We further show that this normalizationmodel can be adapted to normalize highly atypical speech froma deaf speaker, resulting in significant improvements in intelli-gibility and naturalness, measured via a speech recognizer andlistening tests. Finally, demonstrating the utility of this modelon other speech tasks, we show that the same model architecturecan be trained to perform a speech separation task.Index Terms: speech normalization, voice conversion, atypicalspeech, speech synthesis, sequence-to-sequence mode. Read More