What would Alan Watts say if he were here with us today?
(a practical guide to speech synthesis and text generation)
I found myself at a crossroads at the end of 2022: I needed a new project to have fun and learn something new but wasn’t sure which direction to take. After speaking with a couple of smart people, an idea began to take shape. If you think about skills you would like to develop and things that interest you, what might be at the center of your Venn diagram?
For me, it was the intersection of NLP and philosophy. I decided to gain some exposure to speech synthesis (Tacotron2 and Waveglow) that I hadn’t worked with before and also leverage text generation (aitextgen), thus finetuning 2–3 separate models that would think and speak like one of my favorite philosophers of all times.
As Alan Watts put it, “the only way to make sense out of change is to plunge into it, move with it, and join the dance”, and so the dance began.
Material and Tech Specs
I had over 50 hours of Alan Watts’ lectures, videos, audiobooks, and a small number of written materials. Equipped with a 16-core CPU and 8GB GPU Nvidia Geforce RTX 2070, I was aiming to achieve at the minimum half-decent results in a reasonable timeframe (or at least, that’s what I thought before noticing smoke coming out of my computer; but hey, getting a new motherboard and replacing CPU with a newer model only boosted the speed of the experiment).
I had to deal with a number of environment setup issues, and I covered some of them here if it helps. Feel free to ignore the “Poor alignment” section altogether because it is missing the most important garbage-in-garbage-out cause, which you’ll read about below. (And yes, as you may have noticed, I’m using Windows).
Model Overview
Let’s quickly get on the same page as to what we’re dealing with here.
- Tacotron2 (original paper) is an end-to-end neural network text-to-speech (TTS) system that takes a sequence of characters as input and outputs a sequence of mel-spectrogram frames, which represent the acoustic features of the speech. The model’s architecture includes an encoder network that converts text into a high-level semantic representation, an attention mechanism that aligns the input text with the generated spectrogram, and a decoder network that converts the aligned spectrogram into a raw audio waveform. By training the model on a specific person’s speech data, the model can learn unique patterns and nuances of that person’s speech style and generate spectrograms that reflect that style, which would include the unique patterns of their speech, intonation, rhythm, and emphasis on certain words. While we can take advantage of the Tacotron2 decoder network to directly convert the aligned mel-spectrograms into raw audio waveforms, the quality of those may not be optimal in terms of naturalness, clarity, and overall fidelity, and that’s why we need Waveglow.
- Waveglow (original paper) is a flow-based generative model that takes mel-spectrograms (generated by Tacotron2) and produces natural-sounding synthesized speech by modeling the probability distribution of audio waveforms using a deep neural network architecture, which enables it to capture the complex structure of speech signals. It also creates a set of conditioning parameters that are used to guide the generation of the audio waveform and which help ensure that the generated audio waveform captures important aspects of the speaker’s voice, such as pitch, tone, and accent, making it sound natural and realistic.
- Aitextgen is a transformer-based language generation model. It is built on top of the popular Hugging Face Transformers library and uses OpenAI’s GPT-2 architecture for natural language processing tasks such as text generation, summarization, and question-answering. I was exhausted by the end of my struggle with Tacotron2/Waveglow, and so working with a more straightforward, simple model like Aitextgen was the long-awaited reprieve.
Preprocessing for Text Generation
Pretty straightforward here: convert ebooks to text and clean thereafter. Helpful writeup if you are working with books. As mentioned, I was using Aitextgen for generating text (more on that below); the exact preprocessing steps will, of course, depend on the model that you are planning to leverage.
Preprocessing for Speech Synthesis
This is where things get convoluted, partially due to my being utterly unfamiliar with speech synthesis, and partially due to the objective complexity of the endeavor. Two high-level overviews helped me approach this unknown terrain: jaimeleal’s short guide and a youtube guide on training a custom voice model — that’s on top of the official model repos, internet surfing, and Q&A with ChatGPT (also planning to integrate 4 beta with a Telegram bot later down the line).
What we need to understand is that for training Tacotron2, we need pairs of audio clips and their corresponding subtitles, with mapping defined in a separate file. Waveglow only requires audio to be trained, so not much hassle there.
I tried to keep steps chronological in my repo if you decide to reference it, and here are the main highlights:
1. Bring all audio to the same denominator, i.e. сonvert mp4, mp3, avi, etc. to wav. If you have multiple cores at your disposal, Multiprocessing (or Celery task queue paired with Redis broker) comes to the rescue.
2. Reduce noise. The overall quality of Alan Watts’ lectures was on the low end, and sometimes noise reduction only exacerbated the issue. After many launch attempts, I can’t remember if I settled on using it but may be worth trying.
3. Remove silent parts if applicable. May not be worth doing before you split your audio files into smaller segments.
4. Split audio. It’s embarrassing to admit how long it took me to realize that the shorter segments the better the model would train and the faster it would converge. I started with 60-second segments, reducing and retraining on 15, and finally 4–7 sec. Do not repeat my mistakes. :)
5. Transcribe. Speaking of embarrassing mistakes, I tried leveraging a free Autosub library, and after seeing that it returned a word error rate of under 30 for a couple of files that I tested, I launched the training process only to hear my AI-Alan moo and hiss during inference. After some uneventful but time-consuming debugging dives, it finally occurred to me to eyeball the transcripts, and it all became clear: garbage in, garbage out. I spent a few dollars on transcribing audio via Google Speech to Text API after that, making my AI-Alan’s speech graduate from the animal kingdom to sapiens. (Cost me under $78 to transcribe over 50 hours of audio and took less than a couple of hours, with almost perfect WER — compare that to almost 10 hours of processing by Autosub distributed by Multiprocessing over 16 cores, with barely intelligible transcription in the end).
6. Create reference files. This step mostly boils down to removing transcripts that have no matching audio and vice versa, plus setting up the lists of files that both Tacotron2 and Waveglow models require. Your Tacotron2 audio_text_test_filelist.txt or metadata.csv would consist of pairs of path_to_file and transcript, like “…/1_1_The_Four_Noble_Truths.wav|you realized that is impossible to do because the motive”. Waveglow’s file would just be a list of audio to train on, no transcripts required.
7. Download pre-trained models published in the official repos for Tacotron2 as well as Waveglow (links are in the corresponding repo README).
8. And finally, get your mind broken by trying to figure out the appropriate hyperparameters to get it all to work. I give all credit to Tacotron2 Issues opened in the repo, ChatGPT, and Stackoverflow for helping me figure out what is what. Made some notes for future travelers here.
It all made sense to my tired brain at the time, but do let me know if you spot inconsistencies. I’ll talk more about parameters below and why after tons of experimentation I ended up going mostly with defaults.
Speech Synthesis: Tacotron2
The number of hours I spent debugging why the alignment wouldn’t form before realizing that I all I had to do was cut the audio duration and get the correct subtitles is too painful to admit.
With my modest 8GB GPU and clips of 4–7 second duration, I was able to snatch a batch size of 32, starting with the learning rate of 1e-3 and ending on 1e-7 towards the end. After fixing the main issues with audio/transcripts, I no longer had to spend time fine-tuning Tacontron’s hyperparameters, everything just finally fell into place.
I trained for about 500 epochs, which took a (very) long time considering a single GPU and 44,151 audio clips per epoch, but patient turtles win the race 🐢. I could’ve switched over to the cloud as I did for text generation or audio transcribing, but after setting up all of the environments locally, it felt unbearable to transition.
I didn’t speak much about the setup (traumatic memories — let the sleeping dogs lie), but you do obviously need to clone the repos and install all of the dependencies, incl. Pytorch with CUDA. Make sure your GPU is cuda-compatible; here’s a MUST-read via Stackoverflow to confirm that both GPU and drivers support the CUDA version you’ve installed (or are about to install).
I ended up using the Anaconda terminal, and would typically have 3 sessions open:
- 1 for training,
- 1 for checking logs in Tensorboard,
- and 1 with a Jupyter for an occasional inference check via inference.ipynb.
So, after the pre-processing step #8 above, where you need to update your hyperparameters to point to the correct file paths, you might want to decide to warm start from the Tacotron2 pre-trained model that you’ve downloaded in #7. For those unaware, warm-starting can be used to initialize the encoder and decoder networks with pre-trained weights from another model, which helps the model to learn more quickly and effectively from the available training data, thus leading to faster convergence.
# warm start from a pretrained model
conda activate pytorch3
python E:/tacotron2/train.py - output_directory E:/tacotron2/checkpoints - log_directory E:/tacotron2/logdir - max-split-size-mb 8000 -c tacotron2_statedict.pt - warm_start
# resume from a checkpoint (session #1)
python E:/tacotron2/train.py - output_directory E:/tacotron2/checkpoints - log_directory E:/tacotron2/logdir - max-split-size-mb 8000 -c checkpoints/checkpoint_50000
# launch Tensorboard (session #2)
conda activate pytorch3
tensorboard - logdir="E:/tacotron2/logdir/" - launch the board
# debug if Tensorboard is empty
tensorboard - inspect - logdir="E:/tacotron2/logdir/" # check if it's seeing the event files
# run inference (session #3)
conda activate pytorch3
cd /d e:/tacotron2
jupyter notebook # inference.ipynb
I was mostly interested in making sure that the training loss (how well the model is able to fit the training data) and validation loss (how well the model is performing against the validation set) are both steadily trending downward, and kept an eye on the ever-present alignment.
In TensorBoard, the x-axis for these two chars represents the number of training steps. The y-axis represents the actual loss.
As a general rule of thumb, a reasonable expected decrease in validation loss for Tacotron2 training is around 0.1 to 0.2 per epoch (which typically consists of several thousand steps). When either starts plateauing, I would decrease the learning rate and restart from the latest checkpoint (I started with 1e-3 and ended on 1e-7). I would also run periodic inference using different checkpoints to check synthesis quality.
Note that you may run into situations when one loss is decreasing, e.g. training loss dropping, and validation loss is staying even or even increasing — this might indicate that the model is overfitting and not generalizing well to the validation data. Hence it’s important to monitor both the training and validation losses and consider techniques such as early stopping, regularization, or increasing the amount of data to prevent overfitting.
Training time. You can see in the repo that the suggested number of epochs to train for is 500. What it means is that the model is expected to go over the entire training dataset 500 times. Say, you have 10,000 clips to train on and you can go as high as 32 for your batch size. Therefore the overall training time you might plan for is (500 epochs x 10000 clips) / 32 = 156,250 steps. Generally, it is recommended to train Tacotron2 until it converges and the validation/training loss stops improving significantly.
Alignment. Fun fact I’ve learned is that it’s possible that the stable/persistent alignment diagonal may not form at all during training. While the alignment diagonal can be an indication of good model performance, it is not a guarantee of solid synthesis quality. It’s expected to form during the early stages of training when the model is still learning to align each input frame with its corresponding output frame. However, as the model learns to align frames that are not necessarily adjacent, the diagonal attention pattern may dissolve, and the attention mechanism would become more distributed and flexible. This is generally expected to happen in later stages of training when the model has been trained with enough data and given enough time to learn more complex alignments.
Therefore, you do not need to restart the training if the attention diagonal breaks and dissolves. Instead, it is more important to monitor the training/validation loss, and convergence of the model, and stop training when there is no significant improvement in performance after a reasonable amount of time.
Quite often I would see no diagonal during training, but get it during inference.
I remember seeing this gif showing incremental alignment formation, but to me it felt like an impossible standard as my diagonal would form and dissolve, then form again, which was something I’d expect to happen during training for reasons mentioned above.
Eventually, I stopped at 621K steps. I had 44,151 audio clips and used a batch size of 32, so it’d take me 1,380 steps to get a full iteration of 1 epoch → dividing 621K by 1380, and we land with around 450 completed epochs. The quality of my MVP was good enough to stop, and stop I did.
Speech Synthesis: Waveglow
Now it was Waveglow’s turn! You can configure params for Waveglow in its configuration file. I mostly played around with the learning rate and sigma parameters (landed on 1e-4 → 1e-8 towards the end + sigma of 0.666–1), leaving everything else be. Training Waveglow demands more processing power, so I was only able to fit in a batch of 6. As you can see from the config, the suggested number of epochs is 100,000… there was no way I would stick to that many, considering that the goal is a simple proof of concept, so I let it run (gradually decreasing the learning rate as with Tacotron2) until about 754K steps (or ~102 epochs).
# train
conda activate waveglow3 # had one conda env set up for Tacotron2, one for Waveglow - as you can see by number 3, the first two attemps were not as successful)
cd /d e:/tacotron2/waveglow
python E:/tacotron2/waveglow/train.py -c E:/tacotron2/waveglow/config.json
# tensorboard
tensorboard - logdir="E:/tacotron2/waveglow/checkpoints/logs"
# inference
conda activate pytorch3
cd /d e:/tacotron2
jupyter notebook # inference.ipynb
The problem I was having with Waveglow is that if I interrupted training, resuming from a checkpoint thereafter, it’d start logging from 0 (which was different from Tacotron2 where you’d have an accumulative number of iterations with corresponding historical trends in Tensorboard). Not sure if there is a way around it, but it didn’t bug me as much except that it was hard to see the historical trend for training loss and keep track of completed epochs.
Text Generation: aitextgen
Towards the end of this endeavor I was inclined to keep it simple and leave sophistication for another project to be, so I went with aitextgen that took me under 15 minutes to set up and start generating results. I’ve published some helpful likes as part of the Colab, and included some snippets of the output that do sound like something Alan Watts might have said. I would generate a sentence, and then feed it back to generate more.
Results
The question is, does the voice sound a little bit like his as well? Reminder that the below results are derived from Tacotron2 trained for ~450 epochs and Waveglow ~100 epochs. I did not have any clean audiobooks available for training, it was mostly recorded live lectures, so the original quality is questionable to begin with. (I know, excuses, excuses…)
The modern world feels transformed into a marvelous degree, into a single bioelectronic body.
In many years, it is going to be absolutely incomprehensible for most people that they are not one with the universe.
What is happening now is a game. That is, the whole world is playing.
Conclusion
In the words of the one and only Alan Watts, “This is the real secret of life — to be completely engaged with what you are doing in the here and now. And instead of calling it work, realize it is play.” Considering this was my first time working with speech synthesis and transformers in general and these specific models in particular, I pronounce this past game successful and the play fully enjoyed. On to more tech toys and endeavors!