A Study on the Efficacy of Model Pre-training in Developing Neural Text-to-Speech System

Part 1. 24-hour LJSpeech Speech Data as Target Speaker data

These examples are randomly sampled from the MOS and CMOS evaluation set for Table 1 in the paper.
TTS without pre-training: The system is trained only on 24-hour target speaker data, i.e., without pre-training with other data.
TTS with pre-training: The system is built by applying 24-hour target speaker data to fine-tune a speaker-independent model pre-trained with 960-hour LibriSpeech.

TTS comparison on Test set T-SIM

TTS without pre-training	TTS with pre-training
1. It is clear that the note was written while the Oswalds were living in Dallas before they moved to New Orleans in the spring of nineteen sixty-three.

2.The numbers soon increased, however, and by eighteen eleven had again risen to six hundred twenty-nine; and Mr. Neild was told that there had been at one time

3.On the fourteenth June, eighteen hundred, there were one hundred ninety-nine debtors and two hundred eighty-nine felons in the prison.

4. Marina Oswald appeared before the Commission again on June eleven, nineteen sixty-four.

5. and twelve, nineteen sixty-three, and most probably on either March nine or March ten.

TTS comparison on Test set T-DIFF

TTS without pre-training	TTS with pre-training
1. I have missed four fits and had but five and have recovered so much strength as made me venture to meet your letter on Wednesday a mile from home.

2. It wasn't I who said that, said the girl smiling but that's so anyhow, and then she sighed.

3.Mister king thought so too, and he beamed at Phronsie, so you did he cried now that's fine I wish you'd write me a letter some time.

4. Perusal said the pawnbroker, that's the way to pronounce it.

5. Remember me as a man who disregarded priceless love such as yours to go and make himself a proud position among fools and knaves; Indeed, that's what it comes to.

TTS comparison on Test set T-RAN

TTS without pre-training	TTS with pre-training
1.As I have said, there is the very serious doubt whether your father would accept money from you when you are my wife.

2. They knew what it was without a word. Missus sterling clasped her hands and bowed her head.

3.Young fitzooth had been commanded to his mother's chamber so soon as he had come out from his converse with the squire.

4. The light of the lamps seemed to grow dim and darkness to tarnish the face of the bride herself.

5. Every year at a certain day of a certain month, he went away to a distant city to collect money on an account.

Part 2. 1.5-hour LJSpeech Speech Data as Target Speaker data

These examples are randomly sampled from the MOS and CMOS evaluation set for Table 3 in the paper.
TTS without pre-training: The system is trained only on 1.5-hour target speaker data, i.e., without pre-training with other data.
TTS with pre-training: The system is built by applying 1.5-hour target speaker data to fine-tune a speaker-independent model pre-trained with 960-hour LibriSpeech.

TTS comparison on Test set T-SIM

TTS without pre-training	TTS with pre-training
1. During the first interrogation on November twenty-two, Fritz asked Oswald to account for himself at the time the President was shot.

2. When Marina Oswald testified before the Commission on February three to six, nineteen sixty-four.

3. would be consistent with the period when the Oswalds were living on Neely Street since the apartment was rented on March three, nineteen sixty-three.

4. the Dallas Police Department forwarded it on December two, nineteen sixty-three.

5. and testified that a few days before her husband's departure from Dallas to New Orleans on April twenty-four, nineteen sixty-three.

TTS comparison on Test set T-DIFF

TTS without pre-training	TTS with pre-training
1. But in time, the end of it all came, and Wabi went back to the princess's mother to Minnetaki and to his forests.

2. This was her dream as nearly as she could recall it when she came to herself after waking from it with a cry.

3.It was locked from the inside, and we had to burn it down with a torch; that's where they are.

4. And I have no one ready to whom I can give up the archives of the government.

5. I called him into the bathroom and I closed the door and I wanted to prevent him and then I started to cry.

TTS comparison on Test set T-RAN

TTS without pre-training	TTS with pre-training
1.As I have said, there is the very serious doubt whether your father would accept money from you when you are my wife.

2. They knew what it was without a word. Missus sterling clasped her hands and bowed her head.

3.Young fitzooth had been commanded to his mother's chamber so soon as he had come out from his converse with the squire.

4. The light of the lamps seemed to grow dim and darkness to tarnish the face of the bride herself.

5. Every year at a certain day of a certain month, he went away to a distant city to collect money on an account.

Part 3. Pre-training Data reduction

These examples are randomly sampled from the MOS and CMOS evaluation set for Table 4 in the paper.
All four target speaker TTS systems are fine-tuned from pre-trained speaker-independent TTS models. The only difference between those systems is the pre-training data:
Random: 40,000 pairs data randomly sampled from the LibriSpeech
Full: Full 960-hour LibriSpeech.
Prepexity-based: 40,000 pairs data sampled fromLibriSpeech with the method described in subsection 4.1 of the paper
Bert-based: 40,000 pair data selected from LibriSpeech using the method described in subsection 4.2 of the paper

TTS comparison on test set from novel books domain

Random	Full	Bert-based	Perplexity-based
1.Much later, when a friend of his was preparing an edition of all his Latin works, he remarked to his home circle if I had my way about it, they would republish only those of my books which have doctrine my Galatians, for instance.

2. Another favourite present at this time among Buddhists is a cage of living birds to be borne to the grave and released thereon.

3.With them, the body has worn out the soul, the senses have burned up the heart dissipation has blunted the feelings.

4. I even kissed her when she asked me to, and it sent a shiver all down my back.

5. They have known me much longer but never honour me with any familiarity though hardly a day passes without my bringing them food.

A Study on the Efficacy of Model Pre-training in Developing Neural Text-to-Speech System

Abstract

Part 1. 24-hour LJSpeech Speech Data as Target Speaker data

TTS comparison on Test set T-SIM

TTS comparison on Test set T-DIFF

TTS comparison on Test set T-RAN

Part 2. 1.5-hour LJSpeech Speech Data as Target Speaker data

TTS comparison on Test set T-SIM

TTS comparison on Test set T-DIFF

TTS comparison on Test set T-RAN

Part 3. Pre-training Data reduction

TTS comparison on test set from novel books domain