Scaling Rich Style-Prompted Text-to-Speech Datasets

Abstract

We introduce Paralinguistic Speech Captions (ParaSpeechCaps), a large-scale dataset that annotates speech utterances with rich style captions. While rich abstract tags (e.g. guttural, nasal, pained) have been explored in small-scale human-annotated datasets, existing large-scale datasets only cover basic tags (e.g. low-pitched, slow, loud). We combine off-the-shelf text and speech embedders, classifiers and an audio language model to automatically scale rich tag annotations for the first time. ParaSpeechCaps covers a total of 59 style tags, including both speaker-level intrinsic tags and utterance-level situational tags. It consists of 282 hours of human-labelled data (PSC-Base) and 2450 hours of automatically annotated data (PSC-Scaled). We finetune Parler-TTS, an open-source style-prompted TTS model, on ParaSpeechCaps, and achieve improved style consistency (+7.9% Consistency MOS) and speech quality (+15.5% Naturalness MOS) over the best performing baseline that combines existing rich style tag datasets. We ablate several of our dataset design choices to lay the foundation for future work in this space. ParaSpeechCaps and our trained models will be open-sourced.

Below you'll find two interactive demos showcasing our work. First, we present our Style Controlled TTS Experiments Demo comparing different model outputs. Following that, you can explore the ParaSpeechCaps Dataset Demo featuring examples from both our human-annotated and automatically-annotated datasets. For easy visualization, we underline all rich style tags in the style prompts.

Style Prompted TTS Experiments Demo

Listen to our generated speech samples demonstrating various styles and expressions. We provide both cherry-picked examples showcasing our best results as well as randomly-picked examples. Each row showcases different model outputs for the same input description and input text.

Cherry-Picked
Randomly-Picked
Transcription Style Prompt 🔊 Ours (Scaled) 🔊 Ours (Base) +LTTSP,Exp,EARS +LTTSR Parler-TTS Ground Truth*
That's my brother. I do agree, though, it wasn't very well-groomed.A man speaks with a booming, medium-pitched voice in a clear environment, delivering his words at a measured speed.
reveal my true intentions in different ways. That's why the Street King Project and SMSA male speaker's speech is distinguished by a slurred articulation, delivered at a measured pace in a clear environment.
the Grand Slam tennis game has sort of taken over our set that's sort of all the wayIn a clear environment, a male speaker delivers his words hesitantly with a measured pace.
you know you want to see how far you can push everything and as an artistA low-pitched, guttural male voice speaks slowly in a clear environment.
most important but the reaction is very similar throughout the world it's really very very similarA man speaks with a measured pace in a clear environment, displaying a distinct British accent.
about God and the people him come from is more Christian, you know. We alwaysA male speaker's voice is clear and delivered at a measured pace in a quiet environment. His speech carries a distinct Jamaican accent.
Was that your landlord?In a clear environment, a male voice speaks with a sad tone.
I mean, to be fair, I did see a UFO, so, you know.A man speaks with a measured pace in a clear environment, his voice carrying a sleepy tone.
Yes, that's what they said. I don't know what you're getting done. What are you getting done? Oh, okay. Yeah.A frightened woman speaks with a clear and distinct voice.
Oh wow, this music is fantastic. You play so well. I could just sit here.A woman speaks slowly in a clear environment, her voice filled with awe.
this is just way too overwhelming. I literally don't know how I'm going to get any of this done on time. I feel so overwhelmed right now. No one is helping me. Everyone's ignoring my calls and my emails. I don't know what I'm supposed to do right now.A woman speaks with a high-pitched voice in a clear environment, conveying a sense of anxiety.
What is wrong with him, Chad?A female speaker's high-pitched voice is clear and carries over a laughing, unobstructed environment.
The fruit piece, the still lifes, you mean.In a clear environment, a man speaks in a whispered tone.
Ari had to somehow be subservient to Lloyd that would be unbelievable like if Lloyd was the guy who was like running Time Warner you know what I mean likeA male speaker with a husky, low-pitched voice delivers clear speech in a quiet environment.
You know, Joe Bow, hockey mom from Wasilla, if I have an idea that would perhaps makeA female speaker's voice is clear and expressed at a measured pace, but carries a high-pitched, nasal tone, recorded in a quiet environment.
Transcription Style Prompt 🔊 Ours (Scaled) 🔊 Ours (Base) +LTTSP,Exp,EARS +LTTSR Parler-TTS Ground Truth*
and I felt I had to stand up and fight for what I believed was justice I never stopped to think whether I am a man or a woman or whether what I am doing is right I felt I had to do this I did it I did whatever I thought was right I have lived my life according toIn a clear environment, a woman speaks with a measured speed and a medium-pitched, authoritative voice.
I thought he looked familiar. He calls me up. He goes, hi, it's Kurt Ball time. I figured Kurt Ball time to band lead. I didn't know.In a clear environment, a man speaks with a nasal tone and a high-pitched voice.
if you want because I didn't want to do it at all so he talked me into that actually I IA female speaker delivers her words at a measured pace in a clear environment, yet her voice exhibits occasional vocal-fry.
You have to be objective and look at yourself that somehow because you're part of pop cultureA male speaker's voice exhibits vocal fry in a clear environment, delivering his words at a measured speed.
He's incredibly impressed on a number of levels. First of all, his life plays out like a Greek tragedy.A man speaks with a deep, husky voice in a clear environment.
I have never compromised when it comes to protecting the state's interests or theA female speaker with a medium-pitched voice and an Indian accent delivers her speech at a measured pace in a clear environment.
where it was meant to be placed which was of course with a great dose of irony so IIn a clear environment, a woman speaks with a measured speed and an Australian accent.
I never even know who Marcus Garvey is, who Ellis Lassie is, or who any black man is.A male speaker with a Jamaican accent delivers his words in a clear environment, maintaining a measured pace and a medium-pitched voice.
All the while her skin smoothes, blemishes fade, and wrinkles flatten against tightening skin.A man speaks slowly in a clear environment, his words sounding confused.
Wanna see "The Color Purple" on Broadway?A male speaker's voice is high-pitched and clear, but his speech is confused and delivered at a measured speed in a quiet environment.
Yes, "The King's Speech" is popular.A female speaks with a clear and measured pace, conveying a sense of sadness in her voice.
Okay, um, you, okay, sure, you can grab one of those, yep, you don't have to leave, uh-huh, okay.In a clear environment, a woman speaks with a passive tone and at a slow speaking speed.
If you tilt the glass, whoa!A woman speaks in a clear and loud voice in an environment with no discernible background noise.
It didn't work. It's not working. It's not enough. I...A female speaker's voice is clear and her speech is delivered at a measured pace in a quiet environment. Her tone, however, conveys fear.
Oh, it hurts. I can't believe you moved that table there. I think I broke my toe. Oh, I might have to go to the ER. I think it's broken. I know they, what can they do for a broken toe?A high-pitched female voice speaks with evident pain in a clear environment.

* While all models are prompted with a 'clear environment' tag and tend to adhere to it, ground truth audio comes from our in-the-wild PSC-Base dataset, which may contain background noise.

ParaSpeechCaps Dataset Demo

Listen to randomly-sampled examples from our ParaSpeechCaps dataset for a subset of our 59 supported style tags, with both the human-annotated PSC-Base and the automatically-annotated PSC-Scaled available.

ParaSpeechCaps-Base
ParaSpeechCaps-Scaled
AudioTranscriptionRich TagsBasic TagsStyle Prompt
that for your motivation i mean what keeps you going when you know you have you know family back
americancrispflowingshrill
femalehigh-pitchedmeasured speedslightly clean environment
A female speaker's voice is flowing and high-pitched, with a shrill tone, displaying a crisp and measured speed. Recorded in a slightly clean environment. (American accent)
So it was a lot of fun. I loved it and I'm happy the song came out well. Plus it's a lot of pressure, like the bar is set very high.
americanauthoritativeboomingdeephusky
low-pitchedmalemeasured speedslightly noisy environment
In a slightly noisy environment, a deep and booming voice of an American male with a husky and authoritative tone speaks at a measured speed, conveying a low-pitched and confident demeanor.
the emergence of Lily was set against the backdrop of this extraordinary moment of
authoritativebritishdeepraspy
low-pitchedmalemeasured speedvery clean environment
A British male speaks with an authoritative tone, his deep, raspy voice having a low pitch and a measured speed in a very clean environment.
think by getting to choose my artist I really hope it's just someone that it can't be someone that's just an amazing singer because there's a lot of amazing singers it has to be someone that's what's their purpose what do they want to bring to the industry because right now there's so many there's people that are willing to put on the
americanflowingraspyslurred
fast speedfemalemedium-pitchedslightly noisy environment
A female speaker with a raspy, American accent delivers her words in a medium-pitched, flowing tone, speaking fast in a slightly noisy environment. Her speech is slightly slurred.
When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow. The rainbow is a division of white light into many beautiful colors.
americanenunciatedhesitantloudmonotonouspunctuated
femalehigh-pitchedmeasured speedslightly clean environment
A female speaker with a high-pitched voice delivers enunciated and punctuated words in an American accent. Her speech is loud and produced in a slightly clean environment. The measured speed of her speech is monotonous, yet she exhibits hesitancy.
two years, of course, the world's media is known about that, but that's been part of the history of that place forever, which is why we place that story in the heart of our
flowingscottishsilky
femalemeasured speedmedium-pitchednoisy environment
A Scottish female speaks with a flowing, silky voice at a measured speed, recorded in a medium-noisy environment. Her pitch is of a medium height.
It's just that somehow for many many years round robins were the only attractive events around and but recently a couple of attractive Swiss events have come up so and you can see already that many participants open to playing. So it's not like I had some fundamental objection, it's just that's the way it
flowinghesitantindianpitchyslurred
malemeasured speedmedium-pitchednoisy environment
A male speaker with a medium-pitched voice delivers a flowing Indian accent, yet his speech is hesitant and slightly slurred. The recording takes place in a noisy environment, which adds pitchiness to his measured, yet inconsistent speed.
Short term memory loss at your age?
americanauthoritativeconfusedflowingnasalsingsong
environment balanced in clarityhigh-pitchedmalemeasured speed
A male speaker delivers authoritative, measured statements in a singsong, high-pitched voice, displaying confusion at times. The environment is balanced in clarity, allowing for a flowing American accent with a distinct nasal quality.
This is some major consp- this is some major conspiracy thing, because literally it's that shit is- that shit is following you wherever you go. You might not notice it, because you're so used to the smell, but that shit is around on every hike, it is around at every watering hole, it is at every campsite, it is right next to your snack bag, it is everywhere.
americanangryanimatedflowingsingsong
high-pitchedmalemeasured speedslightly clean environment
A male speaker delivers an animated, flowing, and singsong speech with a high-pitched voice, expressing anger. The recording takes place in a slightly clean environment, and the speaker maintains a measured speed throughout. This American accent is evident in his speech.
oh i just stubbed my toe on the side of the bed oh my gosh i am in severe pain oh my god i hope this does not turn red and swollen
americanenunciatedhesitantmonotonouspainedpunctuated
femalehigh-pitchedslightly clean environmentslow speed
A female speaker's voice is high-pitched and monotonous, with enunciated and punctuated words in a slightly clean environment. Her speech is hesitant, pained, and delivered at a slow speed, with an American accent.
I just love how you can play guitar. You're so impressive. I admire your abilities so much.
americanawedcrispdeeppitchy
high-pitchedmaleslightly clean environmentslow speed
A male speaker with a deep, slow speech delivery expresses awe in a crisp, slightly clean environment. His voice has a high-pitched, pitchy quality, characteristic of an American accent.
Mmm that chocolate fudge lava cake looks devine. I want that car so badly. I can't wait to see you again.
americandesirousflowinghesitantnasal
environment balanced in clarityfemalehigh-pitchedslow speed
A female speaker's voice is flowing yet hesitant, displaying a nasal, high-pitched tone. Her speech is delivered at a slow speed in an American accent, ensuring a balanced environment for clear enunciation. The desirous nature of her emotion is evident.
And Michael's like, yeah, yeah, yeah, yeah.
americananimatedflowinglaughingsingsong
clean environmenthigh-pitchedmalemeasured speed
A male speaker delivers a measured, high-pitched speech with animated laughter, his American accent carrying a flowing, singsong quality in a clean environment.
Okay, gotcha, gotcha. So four o'clock actually means like six o'clock then, you know what I mean?
americanflowingsingsongwhispered
high-pitchedmalemeasured speedslightly noisy environment
In a slightly noisy environment, a male American voice delivers a flowing, singsong speech with a measured speed and a high-pitched whisper.
AudioTranscriptionRich TagsBasic TagsStyle Prompt
That's wonderful. So I guess you're looking for a ring.
americanenthusiasticloudshrill
femalehigh-pitchedmeasured speedvery clean environment
A female speaker's voice is enthusiastic and high-pitched, delivered at a measured speed in a very clean environment. The tone is loud and can come across as shrill.
Does that thing look good? It does, doesn't it? It's done. Amazing!
admiringamericanconfidententhusiastichappyhuskyraspy
high-pitchedmaleslow speedvery noisy environment
A male speaker with a husky, raspy voice delivers happy and admiring remarks at a slow speed in a very noisy American environment. His speech is enthusiastic and confident, with occasional high-pitched inflections.
The seed was sown long ago and it flowers beautifully.
americanauthoritativeboomingcalmdeepguttural
clean environmentlow-pitchedmalemeasured speed
A low-pitched, authoritative male voice with a guttural American accent dominates the clean environment, delivering each word with a measured, calm, and deep boom.
Let's get to it. Hi friends, happy, you might be expecting me to say Friday.
americanenthusiasticflowingpunctuated
clean environmentfemalehigh-pitchedmeasured speed
A female speaker's voice is enthusiastic and flows smoothly, recorded in a clean American environment. Her pitch is high, and she speaks with a punctuated, measured speed.
When you helped someone, describe a time when you helped someone. Maybe you were working in a team at university or school or in your job and you helped one of the other team members. Use some of your vocabulary from the describe a person topic again when we were talking about teamwork and working hard.
authoritativedeeploudscottish
low-pitchedmalemeasured speedslightly clean environment
A deep-voiced man speaks authoritatively with a Scottish accent in a slightly clean environment. His speech is measured and delivered at a loud volume with a low pitch.
So a few active friends of mine for their birthdays and for their anniversary parties or something, we did a few, uh, events for them, the decor and planning for their events. It turned out to be really successful.
shrillindian
environment balanced in clarityfemalehigh-pitchedmeasured speed
A female speaker with an Indian accent delivers a measured speech in a clear environment, her shrill and high-pitched voice adding intensity.
Of whatever thing, baggage about me being either gay or dra- I don't know. Is either the drag the gay the something? Who knows? I don't know. But uhm, yeaah.
confused
clean environmenthigh-pitchedmaleslow speed
A male speaker's voice is slow and confused, recorded in a clean environment with a high-pitched tone.
Who's standing in the way of progress with stupidness and arrogance and dumb shit? Is that guy?
americanangryauthoritativedeepenunciatedflowing
malemeasured speedmedium-pitchedvery clean environment
A male speaker with an American accent delivers his words in a measured and authoritative tone, with a medium-pitched, deep voice that flows smoothly in a very clean environment. His speech is enunciated clearly, expressing a sense of anger.
I lay down beside my old man when they carried the stretcher into the hospital room and hung onto the stretcher and cried and cried, and he looked so white and gone and so awfully dead. I couldn't help feeling that if my old man was dead, maybe they didn't need to have shot Guilford. His hoof might have got well. I don't know. I love my old man so much.
pained
malemeasured speedmedium-pitchedvery clean environment
A male voice, medium-pitched, speaks with a measured speed, conveying pain in his tone, recorded in a very clean environment.
In your eyes, I am the most loved and loving, everyone's best friend. Perfect daughter, the perfect mother, the perfect wife, a beautiful person to know. And when I see myself reflected in your eyes, I see someone ten times the person I'll ever be. I see you.
awed
clean environmentfemalehigh-pitchedslow speed
A high-pitched female voice speaks with awe in a clean environment at a slow speed.
Maxine Dupree's graduation in the Alpha Academy. When this graphic popped up, I cheered. Same thing. I was genuinely so thrilled. I wrote, in fact, look, recap of Maxine from last week. She's graduating tonight!
admiringdesirousenthusiastic
environment balanced in clarityhigh-pitchedmalemeasured speed
A male speaker delivers a desirous and admiring tone in a high-pitched voice, his speech exhibiting enthusiastic and measured speed in a clear and balanced environment.