A Study on the Efficacy of Model Pre-training in Developing Neural Text-to-Speech System

Abstract

In the development of neural text-to-speech systems, model pre-training with a large amount of non-target speakers' data is a common approach. However, in terms of ultimately achieved system performance for target speaker(s), the actual benefits of model pre-training are uncertain and unstable, depending very much on the quantity and text content of training data. This study aims to understand better why and how model pre-training can positively contribute to TTS system performance. It is postulated that the pre-training process plays a critical role in learning text-related variation in speech, while further training with the target speaker's data aims to capture the speaker-related variation. Different test sets are created with varying degrees of similarity to target speaker data in terms of text content. Experiments show that leveraging a speaker-independent TTS trained on speech data with diverse text content can improve the target speaker TTS on domain-mismatched text. We also attempt to reduce the amount of pre-training data for a new text domain and improve the data and computational efficiency. It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.

Part 1. 24-hour LJSpeech Speech Data as Target Speaker data

These examples are randomly sampled from the MOS and CMOS evaluation set for Table 1 in the paper.
TTS without pre-training: The system is trained only on 24-hour target speaker data, i.e., without pre-training with other data.
TTS with pre-training: The system is built by applying 24-hour target speaker data to fine-tune a speaker-independent model pre-trained with 960-hour LibriSpeech.

TTS comparison on Test set T-SIM

TTS without pre-training TTS with pre-training
1. It is clear that the note was written while the Oswalds were living in Dallas before they moved to New Orleans in the spring of nineteen sixty-three.
2.The numbers soon increased, however, and by eighteen eleven had again risen to six hundred twenty-nine; and Mr. Neild was told that there had been at one time
3.On the fourteenth June, eighteen hundred, there were one hundred ninety-nine debtors and two hundred eighty-nine felons in the prison.
4. Marina Oswald appeared before the Commission again on June eleven, nineteen sixty-four.
5. and twelve, nineteen sixty-three, and most probably on either March nine or March ten.

TTS comparison on Test set T-DIFF

TTS without pre-training TTS with pre-training
1. I have missed four fits and had but five and have recovered so much strength as made me venture to meet your letter on Wednesday a mile from home.
2. It wasn't I who said that, said the girl smiling but that's so anyhow, and then she sighed.
3.Mister king thought so too, and he beamed at Phronsie, so you did he cried now that's fine I wish you'd write me a letter some time.
4. Perusal said the pawnbroker, that's the way to pronounce it.
5. Remember me as a man who disregarded priceless love such as yours to go and make himself a proud position among fools and knaves; Indeed, that's what it comes to.

TTS comparison on Test set T-RAN

TTS without pre-training TTS with pre-training
1.As I have said, there is the very serious doubt whether your father would accept money from you when you are my wife.
2. They knew what it was without a word. Missus sterling clasped her hands and bowed her head.
3.Young fitzooth had been commanded to his mother's chamber so soon as he had come out from his converse with the squire.
4. The light of the lamps seemed to grow dim and darkness to tarnish the face of the bride herself.
5. Every year at a certain day of a certain month, he went away to a distant city to collect money on an account.

Part 2. 1.5-hour LJSpeech Speech Data as Target Speaker data

These examples are randomly sampled from the MOS and CMOS evaluation set for Table 3 in the paper.
TTS without pre-training: The system is trained only on 1.5-hour target speaker data, i.e., without pre-training with other data.
TTS with pre-training: The system is built by applying 1.5-hour target speaker data to fine-tune a speaker-independent model pre-trained with 960-hour LibriSpeech.

TTS comparison on Test set T-SIM

TTS without pre-training TTS with pre-training
1. During the first interrogation on November twenty-two, Fritz asked Oswald to account for himself at the time the President was shot.
2. When Marina Oswald testified before the Commission on February three to six, nineteen sixty-four.
3. would be consistent with the period when the Oswalds were living on Neely Street since the apartment was rented on March three, nineteen sixty-three.
4. the Dallas Police Department forwarded it on December two, nineteen sixty-three.
5. and testified that a few days before her husband's departure from Dallas to New Orleans on April twenty-four, nineteen sixty-three.

TTS comparison on Test set T-DIFF

TTS without pre-training TTS with pre-training
1. But in time, the end of it all came, and Wabi went back to the princess's mother to Minnetaki and to his forests.
2. This was her dream as nearly as she could recall it when she came to herself after waking from it with a cry.
3.It was locked from the inside, and we had to burn it down with a torch; that's where they are.
4. And I have no one ready to whom I can give up the archives of the government.
5. I called him into the bathroom and I closed the door and I wanted to prevent him and then I started to cry.

TTS comparison on Test set T-RAN

TTS without pre-training TTS with pre-training
1.As I have said, there is the very serious doubt whether your father would accept money from you when you are my wife.
2. They knew what it was without a word. Missus sterling clasped her hands and bowed her head.
3.Young fitzooth had been commanded to his mother's chamber so soon as he had come out from his converse with the squire.
4. The light of the lamps seemed to grow dim and darkness to tarnish the face of the bride herself.
5. Every year at a certain day of a certain month, he went away to a distant city to collect money on an account.

Part 3. Pre-training Data reduction

These examples are randomly sampled from the MOS and CMOS evaluation set for Table 4 in the paper.
All four target speaker TTS systems are fine-tuned from pre-trained speaker-independent TTS models. The only difference between those systems is the pre-training data:
Random: 40,000 pairs data randomly sampled from the LibriSpeech
Full: Full 960-hour LibriSpeech.
Prepexity-based: 40,000 pairs data sampled fromLibriSpeech with the method described in subsection 4.1 of the paper
Bert-based: 40,000 pair data selected from LibriSpeech using the method described in subsection 4.2 of the paper

TTS comparison on test set from novel books domain

Random Full Bert-based Perplexity-based
1.Much later, when a friend of his was preparing an edition of all his Latin works, he remarked to his home circle if I had my way about it, they would republish only those of my books which have doctrine my Galatians, for instance.
2. Another favourite present at this time among Buddhists is a cage of living birds to be borne to the grave and released thereon.
3.With them, the body has worn out the soul, the senses have burned up the heart dissipation has blunted the feelings.
4. I even kissed her when she asked me to, and it sent a shiver all down my back.
5. They have known me much longer but never honour me with any familiarity though hardly a day passes without my bringing them food.