FELLE

Autoregressive Speech Synthesis with Token-Wise

Coarse-to-Fine Flow Matching

Abstract. To advance continuous-valued token modeling and temporal-coherence enforcement, we propose FELLE, an autoregressive model that integrates language modeling with token-wise flow matching. By leveraging the autoregressive nature of language models and the generative efficacy of flow matching, FELLE effectively predicts continuous-valued tokens (mel-spectrograms). For each continuous-valued token, FELLE modifies the general prior distribution in flow matching by incorporating information from the previous step, improving coherence and stability. Furthermore, to enhance synthesis quality, FELLE introduces a coarse-to-fine flow-matching mechanism, generating continuous-valued tokens hierarchically, conditioned on the language model's output. Experimental results demonstrate the potential of incorporating flow-matching techniques in autoregressive mel-spectrogram modeling, leading to significant improvements in TTS generation quality.

Contents

Model Overview
Zero-Shot Text-to-Speech for Cross-Sentence Task
Zero-Shot Text-to-Speech for Continuation Task
FELLE with Different Parameter Configurations
Ethics Statement

Model Overview

Figure. Overview of FELLE, an autoregressive mel-spectrograms model that generates personalized speech from text and acoustic prompts. At each timestep, the framework relies on the previous mel-spectrogram distribution as a prior, conditioned on the output of the language model, applying a coarse-to-fine flow-matching module to produce refined spectral features.

Zero-Shot Text-to-Speech for Cross-Sentence Task

Samples are from LibriSpeech dataset.

English Text	Speaker Prompt	MELLE	FELLE
for a long time he had wished to explore the beautiful land of oz in which they lived
he shall not leave you day or night whether you are working or playing or sleeping
john taylor who had supported her through college was interested in cotton
soft heart he said gently to her then to thorkel well let him go thorkel
and lay me down in thy cold bed and leave my shining lot
there is no class and no country that has yielded so abjectly before the pressure of physical want as to deny themselves all gratification of this higher or spiritual need
as to his age and also the name of his master jacob's statement varied somewhat from the advertisement
horse sense a degree of wisdom that keeps one from betting on the races
the stop at queenstown the tedious passage up the mersey were things that he noted dimly through his growing impatience
then he rushed down stairs into the courtyard shouting loudly for his soldiers and threatening to patch everybody in his dominions if the sailorman was not recaptured

Zero-Shot Text-to-Speech for Continuation Task

Samples are from LibriSpeech dataset.

Note: The first 3 seconds of each speaker prompt audio are used as the reference prompt for synthesis.

English Text	Speaker Prompt	MELLE	FELLE
milligram roughly 128000 of an ounce
i get tired of seeing men and horses going up and down up and down
i will show you what a good job i did and she went to a tall cupboard and threw open the doors
the utility of consumption as an evidence of wealth is to be classed as a derivative growth
but philip is honest and he has talent enough if he will stop scribbling to make his way
i made her for only 20 oars because i thought few men would follow me for i was young 15 years old
they they excite me in some way and i i can not bear them you must excuse me
it sounded dull it sounded strange and all the more so because of his main condition which was

FELLE with Different Parameter Configurations

Samples are from LibriSpeech dataset. We compare FELLE's performance under different NFE (Number of Flow Evolution steps) and CFG (Classifier-Free Guidance) scale settings.

English Text	NFE=3, CFG=1.6	NFE=6, CFG=1.6	NFE=3, CFG=1.0	NFE=3, CFG=2.2
forthwith all ran to the opening of the tent to see what might be amiss but master will who peeped out 1st needed no more than one glance
also a popular contrivance whereby love making may be suspended but not stopped during the picnic season
then as if satisfied of their safety the scout left his position and slowly entered the place
positively heroic added cresswell avoiding his sister is eyes
we want you to help us publish some leading work of luther is for the general american market will you do it
but the memory of their exploits has passed away owing to the lapse of time and the extinction of the actors
surely it must be because we are in danger of loving each other too well of losing sight of the creator in idolatry of the creature
i have not had a chance yet to tell you what a jolly little place i think this is

Ethics Statement

FELLE is purely a research project. FELLE could synthesize speech that maintains speaker identity and could be used for education, entertainment, journalistic, self-authored content, accessibility features, interactive voice response systems, translation, chat-bot, and so on. While FELLE can speak in a voice like the voice talent, the similarity, and naturalness depend on the length and quality of the speech prompt, the background noise, as well as other factors. It may carry potential risks in the misuse of the model, such as spoofing voice identification or impersonating a specific speaker. We conducted the experiments under the assumption that the user agrees to be the target speaker in speech synthesis. If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model.

This page is for research demonstration purposes only.