Azure Text to Speech with Python without SDK

Alibek Jakupov
Nov 1, 2022
5 min read

Did you know that your applications, tools, or devices can turn text into human-sounding synthetic speech by using the Azure text-to-speech technology? Speech synthesis is a different name for the text-to-speech functionality. Use prebuilt neural voices that are humanlike right out of the box, or develop a custom neural voice that is specific to your brand or product.

The neural text-to-speech engine has been recently fully improved for use with the Speech service on Azure. The voices of computers are essentially indistinguishable from recordings of people thanks to this engine's utilization of deep neural networks. When consumers interact with AI systems, neural text-to-speech dramatically minimizes listening fatigue thanks to the precise word articulation.

Prosody refers to the stress and intonation patterns used in spoken language. Prosody is divided into discrete language analysis and acoustic prediction processes in conventional text-to-speech systems, each of which is controlled by a different model. This could lead to buzzy, muddled speech synthesis.

If you have a look at the quick start tutorial provide by Microsoft, you can notice that it suggests using the Python SDK. It is always great to use the SDK as it allows to start directly using the functionalities without having to deal with all the deeper stuff.

But as the "Clean Code" author, Rober Martin aka Uncle Bob, suggests, using external SDKs and libraries adds certain vulnerability to your code, as you're now dependent on the external package. Certainly one can claim that even if you write your own wrapper to the API, you're still dependent on the text-to-speech service itself. However, at least it removes one additional layer of useless dependence. Moreover, it makes debugging much easier as you won't have to look for the error in external code.

It doesn't mean that SDK is useless. My point is that if you are using only a limited set of functionalities of the service, you would better create your own wrapper which will be perfectly tested, and which is fully under your control (and your responsibility of course). So if one day there are some minor/major changes to the SDK, your code won't crash (at least not because of the SDK).

Sometimes, if you want to rapidly test a cognitive service, but don't have time to use an external library, it may be useful to have a code snippet under your hand which does the job for you. This is why, in this short article we'll see how to implement a simple text-to-speech generation. No deep stuff, just a useful code snippet to copy and paste in your project. Up we go!

Prerequisites

Azure subscription
Create a Speech resource in the Azure portal.
Get the resource key and region. After your Speech resource is deployed, select Go to resource to view and manage keys

Step 1: Get the URL

All the needed information you can get from the API reference. In our case we already now the voice we want to use in our project so we can go directly to the speech generation part. The pattern is quite easy to understand:

https://<your-region>.tts.speech.microsoft.com/cognitiveservices/v1

The region is available in the Azure portal, just under the API keys.

Step 2: configure the headers

Here're the headers you will need to provide:

Ocp-Apim-Subscription-Key : an API key obtained from the Azure portal. See below a fake example of an API key
Content-Type: Specifies the content type for the provided text. Accepted value: application/ssml+xml.
X-Microsoft-OutputFormat: Specifies the audio output format.

Here's the list of available output formats:

amr-wb-16000hz
audio-16khz-16bit-32kbps-mono-opus
audio-16khz-32kbitrate-mono-mp3
audio-16khz-64kbitrate-mono-mp3
audio-16khz-128kbitrate-mono-mp3
audio-24khz-16bit-24kbps-mono-opus
audio-24khz-16bit-48kbps-mono-opus
audio-24khz-48kbitrate-mono-mp3
audio-24khz-96kbitrate-mono-mp3
audio-24khz-160kbitrate-mono-mp3
audio-48khz-96kbitrate-mono-mp3
audio-48khz-192kbitrate-mono-mp3
ogg-16khz-16bit-mono-opus
ogg-24khz-16bit-mono-opus
ogg-48khz-16bit-mono-opus
raw-8khz-8bit-mono-alaw
raw-8khz-8bit-mono-mulaw
raw-8khz-16bit-mono-pcm
raw-16khz-16bit-mono-pcm
raw-16khz-16bit-mono-truesilk
raw-22050hz-16bit-mono-pcm
raw-24khz-16bit-mono-pcm
raw-24khz-16bit-mono-truesilk
raw-44100hz-16bit-mono-pcm
raw-48khz-16bit-mono-pcm
webm-16khz-16bit-mono-opus
webm-24khz-16bit-24kbps-mono-opus
webm-24khz-16bit-mono-opus

If you don't have any system limitations, you can always chose the highest frequency with the highest bit rate, e.g. audio-48khz-192kbitrate-mono-mp3. Honestly, I'm not quite familiar with the domain, so it may be completely wrong. The intuition behind this is quite silly, the higher the better.

Here're the sample headers (with a fake subscription key)

Ocp-Apim-Subscription-Key: 2e5827c90e3f5b7e09758387419g7bd0
Content-Type: application/ssml+xml
X-Microsoft-OutputFormat: audio-48khz-192kbitrate-mono-mp3

Step 3: prepare the body

Azure text-to-speech accepts raw text but you can also provide the text in SSM format. I would suggest using SSML as it allows you to implement the following things:

Choose a voice for text-to-speech
Use multiple voices
Adjust speaking styles, Style degree and Role
Adjust speaking languages
Add or remove a break or pause
Add silence
Specify paragraphs and sentences
Use phonemes to improve pronunciation
Use custom lexicon to improve pronunciation
Add background audio
Adjust emphasis
etc.

But nevertheless, it should be mentioned that such sophisticated features as adjusting speaking languages, or adding emphasis only work with English neural voices (Jenny for example). I've tried with the French ones, and no difference has been detected.

Tip: if you want to rapidly generate an SSML document, you can simply go to the official website, select the desired voice, pitch and prosody and click on the SSML tab.

Here's the sample body:

<speak
	xmlns="http://www.w3.org/2001/10/synthesis"
	xmlns:mstts="http://www.w3.org/2001/mstts"
	xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US">
	<voice name="fr-FR-AlainNeural">
		<prosody rate="0%" pitch="0%">Vous pouvez remplacer ce texte par le texte de votre choix. Vous pouvez écrire dans cette zone de texte ou coller votre propre texte ici.Essayez différentes langues et voix. Modifiez la vitesse et le ton de la voix. Vous pouvez même adapter le langage SSML (Speech Synthesis Markup Language) pour contrôler la prononciation des différentes sections du texte. Cliquez sur SSML ci-dessus pour essayer! Profitez de la synthèse vocale!</prosody>
	</voice>
</speak>

Step 4: Pretest the configs

Test with the VS Code Thunder client or Postman to see whether it works.

Step 5: write the code

First, create two environment variables, SPEECH_KEY and SPEECH_REGION.

subscription_key = os.environ['SPEECH_KEY']
region = os.environ['SPEECH_REGION']

Then provide the url and headers:

url = "https://{}.tts.speech.microsoft.com/cognitiveservices/v1".format(region)
headers = {
    "Ocp-Apim-Subscription-Key": subscription_key,
    "Content-Type": "application/ssml+xml",
    "X-Microsoft-OutputFormat": "audio-48khz-192kbitrate-mono-mp3"
}

Create an SSML file and save it somewhere on your drive. Then read the contents of the drive into a single string. Important: don't forget to encode it to UTF-8, especially if you're using non-english characters.

with open("./sample/path/my_pitch.ssml") as file:
    ssml = file.readlines()

ssml = " ".join(ssml)
ssml = ssml.encode('utf-8')

We are now ready to read the contents of our response and save it to a local file:

response = requests.post(url=url, data=ssml, headers=headers)

with open("output.mp3", "wb") as f:
    f.write(response.content)

Don't forget that the output format is defined by the X-Microsoft-OutputFormat variable. So it may not necessarily be the .mp3, but also a .wav file.

The full code looks like this:

import os
import requests


def get_voice():
    subscription_key = os.environ['SPEECH_KEY']
    region = os.environ['SPEECH_REGION']

    url = "https://{}.tts.speech.microsoft.com/cognitiveservices/v1".format(region)
    headers = {
        "Ocp-Apim-Subscription-Key": subscription_key,
        "Content-Type": "application/ssml+xml",
        "X-Microsoft-OutputFormat": "audio-48khz-192kbitrate-mono-mp3"
    }

    with open("./sample/path/my_pitch.ssml") as file:
        ssml = file.readlines()

    ssml = " ".join(ssml)
    ssml = ssml.encode('utf-8')

    response = requests.post(url=url, data=ssml, headers=headers)

    with open("output.mp3", "wb") as f:
        f.write(response.content)

The main goal of this short article was to provide you a code snippet to easily test the Azure text-to-speech service without using the SDK but with the help of an SSML file. Some of the aspects could have been improved, for instance we could have generated a Bearer token instead of passing the Ocp-Apim-Subscription-Key directly to the header. But hopefully it was useful and allowed you to test the service rapidly before integrating it into your project.