Azure Custom Neural Voice: Tips and Tricks

Custom Neural Voice (CNV) is a part of Azure Cognitive Services that lets you create a customized, synthetic voice for your applications. It's a text-to-speech feature, which enables you to build a highly natural-sounding voice for your brand or characters by providing human speech samples as training data.

I've recently worked on a project involving the generation of a custom voice, and there're certain features and hidden issued which are not covered in the official documentation, this is why I would like to share some tips and tricks in this article.

As the theoretical part is pretty well documented, the advices in this post are mostly based on my personal experience. Hopefully I will find them useful. Up we go!

Audio Recording

Firstly, you need to prepare a well-balanced script. Remember that providing a right proportion of question/exclamation/statement sentences is more crucial than making the training set as close to the target domain as possible. To sum up, a good dataset is composed of:

Statement sentences : 70-80%
Questions: 10-20% and equal number of rising and falling tunes (we use rising intonation on yes/no questions whereas a falling tune is very common in wh-questions)
Exclamation sentences : 10-20%
Short word/phrase : 10%

You can refer to this repository to compose your dataset. Statement sentences start with 00, questions with 01 and exclamations with 02.

Secondly, it is very useful to have a monitor in the recording room, but if you don't have one, you can print out the three copies of your script : one for yourself, one for the artist (we call them voice talent) and one for the sound engineer (if it is not you). The most convenient format is a word table with three columns : number, utterance and status (to mark the processed phrases). Don't forget to shade rows or columns alternately in the table as it facilitates the navigation.

Finally, I suggest recording all the utterances once, with regular pauses, but without saving the recordings partially. I've tried doing both, and can assure you that there's no performance gain if you do multiple exports, like 0-100, 100-200 etc. Moreover, it makes the recording longer, which is more critical, especially, if your voice talent has a very busy schedule. If there're errors during the recording, don't cut this part out during the session, note the timestamp somewhere and remove it after, at the pre-processing stage. This is why, it is better to make a single huge export, because with these notes you will be able to easily locate the errors. Make long pauses between the utterances (at least 3 seconds), as it will allow you to easily cut them at the processing stage.

Sound editing software

There're several possible solutions, such as Adobe Audition or Audacity. I do suggest the second one, and not because it's free and the first one is paid. Audacity offers a limited functionality, which is great in our case, as we only need to select the utterance, export it and cut it out. Minimalism is the key to succes. Moreover, it's easier to navigate the tracks and you can minimize all the unnecessary toolboxes.

Finally, the File Menu provides commands for creating, opening and saving Audacity projects and importing and exporting audio files. For instance, the exporting function is by default unassigned, so you can easily create a shortcut to export your selection. This is great as it dramatically accelerates your processing. I've tried both Adobe Audition and Audacity, and with Audacity I've finished the work within 2 working days instead of 4, when I worked on the same amount of data with Adobe Audition.

Price

Here're my project details

Model type : Neural V5.2022.05
Engine version : 2023.01.16.0
Training hours : 30.48
Data size : 440 utterances
Price: $1584.27

The price may vary depending on the engine version and the number of training hours, but at least you have a sample.

Intake form

You already know that the access is only granted after you fill in the Intake Form. Before providing all the project info, please refer to the Microsoft's Responsible AI Standards, it will help you adjust the description and the scenario.

Audio Preparation

The process is quite straightforward. Create a notepad with all the utterances and their ids. Select utterances, one by one, export them, save using the Id, and delete from the notepad. Define the optimal size beforehand, and don't zoom in/zoom out during the work, as you will get used to the timeline size, and will add the required 100-200 ms of silence more easily.

Automatic Suspend

The endpoint hosting may be expensive, so certain companies do prefer keeping the endpoint up and running during the working hours only. Instead of doing this manually, you may want to do it automatically. During my first project I considered creating a Power Automate job, that would click on the suspend button, but fortunately, there's a new suspend/resume endpoint, available through REST API. For instance, you can create a time triggered Azure Function, and it will save you at least 30%. Attention, where's it is possible to integrate the custom voice endpoint into your Virtual Networks, the API to suspend/resume voice models does not support Private Endpoints.

Azure Custom Neural Voice is a great service that allows you to create a high quality custom text-to-speech service. Unlike the services proposing you to clone anybody's voice with only 3 seconds this solution is supposed to be deployed into production environment, which means the service generates studio-quality recordings, and the overall solution is implemented in accordance with ethical standards. In this article I've shared some practical tips, which will probably save you a couple of hours or even days. Hope you found it useful.