Table of contents
1.
Introduction
2.
Basics of Text-to-Speech
2.1.
Speech synthesis
2.2.
Voices
2.2.1.
WaveNet voices
2.3.
Speech Synthesis Markup Language (SSML) support
3.
Prerequisite
3.1.
Set up your Google Cloud Platform project
3.2.
Create a new service account
3.3.
Create a JSON key for your service account
3.4.
Set your authentication environment variable
3.5.
Disable the Text-to-Speech API
4.
Create audio from text by using client libraries
4.1.
Install the client library
4.2.
Create audio data
5.
Create audio from text by using the command line
5.1.
Synthesize audio from text
6.
Use device profiles for generated audio 
6.1.
Specifying an audio profile to use
7.
Create voice audio files 
7.1.
Convert text to synthetic voice audio
7.2.
Convert SSML to synthetic voice audio
8.
List all supported voices 
9.
Decode base64-encoded audio content
10.
Specify a regional endpoint 
11.
Frequently Asked Questions
11.1.
What is speech-to-text accommodation?
11.2.
What data does text-to-speech use?
11.3.
Which algorithm is used in text-to-speech?
12.
Conclusion
Last Updated: Mar 27, 2024

Overview of Text-to-Speech in GCP

Author Nagendra
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

The text-to-speech technology enables the programmers to produce synthetic, human-sounding speech that can be played back. You may fuel your applications with the audio data files you produce with Text-to-Speech, and you can also utilise them to enhance media like films and audio recordings.
This blog describes the Text-to-Speech in GCP, as well as how to create voice audio files, decode base64-encoded audio information, and generate audio from text using client libraries and the command line.

Without further ado, let's get into the basics of Text-to-Speech.

Basics of Text-to-Speech

Any application that plays users' audio of human speech is best served by text-to-speech. It enables you to produce speech output from any combination of strings, words, and sentences.
Imagine you have a voice assistant app that sends consumers playable audio files with natural language feedback. Your app might perform an action and then give the user feedback in the form of human speech.

Speech synthesis

Synthesis is the conversion of text input into audio data, and synthetic speech is the result of synthesis. Input for text-to-speech can be either raw text or data in SSML format (discussed below). You can use the API's synthesis endpoint to produce a new audio file.
Raw audio data is produced by the voice synthesis process as a base64-encoded string. Before an application can play it, the base64-encoded string must be converted into an audio file. For converting base64 text into playable media files, most platforms and operating systems have the necessary tools.

Voices

Text-to-Speech generates unprocessed audio files of real human speech. In other words, it produces audio that resembles human speech. A voice must be specified when sending a synthesis request to Text-to-Speech so that the words can be "spoken."
You can choose from a variety of custom voices in Text-to-Speech. The voices vary in terms of accent, language, and gender (for some languages). Let's look into the details of WaveNet voices.

WaveNet voices

Text-to-Speech offers quality WaveNet-generated sounds in addition to other conventional synthetic voices. The voices produced by Wavenet are perceived by users as being warmer and more human-like than other synthetic voices.

The WaveNet model used to create the voice is the main distinction from a WaveNet voice. WaveNet models were trained using unprocessed audio recordings of real people conversing. These models thus produce synthetic speech with stress and intonation on syllables, phonemes, and words that are more akin to human speech.

Speech Synthesis Markup Language (SSML) support

By adding Speech Synthesis Markup Language to the text, you can improve the artificial speech Text-to-Speech produces (SSML). You can add pauses, acronym pronunciations, and other extra information to the audio data produced by Text-to-Speech using SSML.

Let's look at the prerequisites of using Text-to-Speech.

Prerequisite

You must enable the API in the Google Cloud Platform Console before you can use Text-to-Speech. Follow the following steps before using the Text-to-Speech API:

Set up your Google Cloud Platform project

  • Log in to the console.
     
  • Visit the project selection page. You have the option of selecting an existing project or starting a new one. 
     
  • You will be asked to connect a billing account to a new project when you establish one. Make sure charging is enabled if you're using an existing project.
     
  • You can activate the Text-to-Speech API after choosing a project and connecting it to a paying account. Enter "speech" into the Search products and resources box at the top of the page. Out of the list of results, pick the Cloud Text-to-Speech API.
     
  • Select the TRY THIS API option to test Text-to-Speech without attaching it to your project. Click ENABLE to make the Text-to-Speech API available for use with your project.
     
  • You must now connect the Text-to-Speech API to one or more service accounts. On the Text-to-Speech API page's left side, click the Credentials link.
     

Create a new service account if you don't already have one by adhering to the guidelines in the section on creating a new service account.

Create a new service account

If your project doesn't currently have a service account, make one. Text-to-Speech cannot be used without a service account, which must be created. The following steps create a new service account:

  • Click on Create service account.
     
  • Enter a unique name for the new service account in the Service account name box. The Service account ID box already has your data filled in. If you intend to link several service accounts to your project, it is advised that you check the Service account description box, which is optional. Click CREATE AND CONTINUE after entering a succinct description of the service account in this box.
     
  • Scroll down to Basic and select Basic from the drop-down menu for selecting a role. From the choices that show up in the right-hand column, you may select a role for this service account. Choose CONTINUE.
     
  • The final step gives you a choice to grant access to your service account to other entities (people, Google groups, and so forth). You can click DONE without providing any information if you don't need to provide any more access.
     
  • On the Service Accounts page, the service account is now visible. You are always free to modify the service account's permissions, add or create new keys, and give access.

Create a JSON key for your service account

On the service accounts page, the newly established service account is displayed. To be connected to that account, create a private key. When you send a request to Text-to-Speech, you must authenticate yourself using this private key. If you decide not to produce a key right away, you can do it whenever you like by logging into the service account through the IAM & Admin -> Service Accounts menu item in the main navigation menu.

Follow the steps :

  • Click on the service account and choose KEYS to create a key. Click ADD KEY -> Create new key.
     
  • Your preferred key format is automatically downloaded along with a new key. Make a note of the file path and keep this file in a secure location. When you go through the authentication procedure at the start of each new Text-to-Speech session, you must point the GOOGLE_APPLICATION_CREDENTIALS environment variable to this file. This is one of the crucial steps in the authentication process for Text-to-Speech requests. Next to the service account's name is the special ID for the key.

Set your authentication environment variable

You need a service account connected to your project and access to its JSON key in order to set your GOOGLE_APPLICATION_CREDENTIALS.

  • Set the required environment variable GOOGLE_APPLICATION_CREDENTIALS to provide authentication credentials to your application code. This variable only functions during the current shell session. Set the variable in your shell starting file, such as  ~/.bashrc or ~/.profile file, if you want it to be applicable to subsequent shell sessions.

Command:

export GOOGLE_APPLICATION_CREDENTIALS="KEY_PATH"

Disable the Text-to-Speech API

Follow the steps to disable the Text-to-Speech API:

  • Go to your Google Cloud Platform dashboard.
     
  • Click the Go to APIs overview link in the APIs box to disable the Text-to-Speech API. At the top of the page, click Text-to-Speech API.
     
  • Choose the DISABLE API option.
     

Let's look at the details of creating audio from text by using client libraries.

Create audio from text by using client libraries

This section will guide you through the process of requesting audio from text utilising client libraries and Text-to-Speech.

Install the client library

The following command is used to install the client library:

Command:

go get cloud.google.com/go/texttospeech/apiv1

Create audio data

You can now produce an audio file of artificial human speech using Text-to-Speech. To send a synthesis request to the Text-to-Speech API, use the following code.

Code:

package main
//importing files
import (
        "context"
        "fmt"
        "io/ioutil"
        "log"
//Specifying url
        texttospeech "cloud.google.com/go/texttospeech/apiv1"
        texttospeechpb "google.golang.org/genproto/googleapis/cloud/texttospeech/v1"
)


func main() {
        // Instantiates a client.
        ctx := context.Background()


        client, err := texttospeech.NewClient(ctx)
        if err != nil {
                log.Fatal(err)
        }
        defer client.Close()


        // Perform the text-to-speech request on the text input with the selected
        // voice parameters and audio file type.
        req := texttospeechpb.SynthesizeSpeechRequest{
                // Set the text input to be synthesized.
                Input: &texttospeechpb.SynthesisInput{
                        InputSource: &texttospeechpb.SynthesisInput_Text{Text: "Hello, World!"},
                },
                // Build the voice request, select the language code ("en-US"), and the SSML
                // voice gender ("neutral").
                Voice: &texttospeechpb.VoiceSelectionParams{
                        LanguageCode: "en-US",
                        SsmlGender: texttospeechpb.SsmlVoiceGender_NEUTRAL,
                },
                // Selecting return type of file
                AudioConfig: &texttospeechpb.AudioConfig{
                        AudioEncoding: texttospeechpb.AudioEncoding_MP3,
                },
        }


        resp, err := client.SynthesizeSpeech(ctx, &req)
        if err != nil {
                log.Fatal(err)
        }


        // The resp's AudioContent is binary.
        filename := "output.mp3"
        err = ioutil.WriteFile(filename, resp.AudioContent, 0644)
        if err != nil {
                log.Fatal(err)
        }
        fmt.Printf("Audio content written to file: %v\n", filename)
}

This will create the first request to Text-to-Speech.

Let's look at the details of creating audio from text by using the command line.

Create audio from text by using the command line

This section guides you through the process of requesting audio from text using the command line and Text-to-Speech.

Synthesize audio from text

By sending an HTTP POST request to the https://texttospeech.googleapis.com/v1/text:synthesize endpoint, you can convert text to audio. Specify the type of voice to be created in the voice configuration section of your POST command, the type of text to be created in the input area's text field, and the type of audio to be created in the audioConfig section.

  • Use Text-to-Speech by running the REST request listed below on the command line to create audio from text. In order to obtain an authorization token for the request, the command uses the gcloud auth application-default print-access-token command.

Command:

POST https://texttospeech.googleapis.com/v1/text:synthesize

Request JSON body:

{
//The input
  "input":{
    "text":"Test data"
  },
//The voice
  "voice":{
    "languageCode":"en-gb",
    "name":"en-GB-Standard-A",
    "ssmlGender":"FEMALE"
  },
  "audioConfig":{
    "audioEncoding":"MP3"
  }
}

A JSON response resembling the following should be given to you:

JSON Response:

{
  "audioContent": "//NExAASCCIIAAhEAGAAEMW4kAYPnwwIKw/BBTpwTvB+IAxIfghUfW.."
}
  • The base64-encoded synthetic audio is present in the JSON result for the REST command. The audioContent field's contents should be copied into a new file called synthesize-output-base64.txt. The new file you created will resemble something like this:
     
  • Create a new file called synthesized-audio.mp3 by decoding the data in the synthesize-output-base64.txt file.
     
  • Play the audio files included in synthesized-audio.mp3 on a device or in an audio program. To play the audio, you can also open the synthesized-audio.mp3 file in the Chrome browser by going to the folder where it is located, for example, file://my_file_path/synthesized-audio.mp3.
     

Let's dive into using the device profiles for generated audio.

Use device profiles for generated audio 

For playing on various types of hardware, Text-to-synthetic Speech's speech can be modified. For instance, if your app is predominantly used on smaller, so-called "wearable" devices, you can use the Text-to-Speech API to generate synthetic speech that is tailored specifically for smaller speakers.
Additionally, you can use several device profiles with the same synthetic speech. The sequence of the device profiles that are applied to the audio by the Text-to-Speech API depends on the request made to the text: synthesize endpoint. The same profile shouldn't be specified more than once because doing so could lead to unfavorable outcomes.

Specifying an audio profile to use

You have to Set the effectsProfileId field for the speech synthesis request to define the audio profile to use.

Code:

import (
        "fmt"
        "io"
        "io/ioutil"


        "context"


        texttospeech "cloud.google.com/go/texttospeech/apiv1"
        texttospeechpb "google.golang.org/genproto/googleapis/cloud/texttospeech/v1"
)



func audioProfile(w io.Writer, text string, outputFile string) error {
        // text := "hello"
        // outputFile := "out.mp3"


        ctx := context.Background()


        client, err := texttospeech.NewClient(ctx)
        if err != nil {
                return fmt.Errorf("NewClient: %v", err)
        }
        defer client.Close()


        req := &texttospeechpb.SynthesizeSpeechRequest{
                Input: &texttospeechpb.SynthesisInput{
                        InputSource: &texttospeechpb.SynthesisInput_Text{Text: text},
                },
                Voice: &texttospeechpb.VoiceSelectionParams{LanguageCode: "en-US"},
                AudioConfig: &texttospeechpb.AudioConfig{
                        AudioEncoding: texttospeechpb.AudioEncoding_MP3,
                        EffectsProfileId: []string{"telephony-class-application"},
                },
        }


        resp, err := client.SynthesizeSpeech(ctx, req)
        if err != nil {
                return fmt.Errorf("SynthesizeSpeech: %v", err)
        }


        if err = ioutil.WriteFile(outputFile, resp.AudioContent, 0644); err != nil {
                return err
        }


        fmt.Fprintf(w, "Audio content written to file: %v\n", outputFile)


        return nil
}

Let's look at the details of creating voice audio files.

Create voice audio files 

With text-to-speech, you can turn phrases and clauses into base64-encoded audio files that mimic real human speech. The audio data can subsequently be transformed into a playable audio file, such as an MP3, by decoding the base64 data. The Speech Synthesis Markup Language or plain text can be entered into the Text-to-Speech API (SSML).

Convert text to synthetic voice audio

The code samples that follow show how to turn a string into audio data.

Code:

// SynthesizeText synthesizes plain text and saves the output to outputFile.
func SynthesizeText(w io.Writer, text, outputFile string) error {
        ctx := context.Background()

        client, err := texttospeech.NewClient(ctx)
        if err != nil {
                return err
        }
        defer client.Close()


        req := texttospeechpb.SynthesizeSpeechRequest{
                Input: &texttospeechpb.SynthesisInput{
                        InputSource: &texttospeechpb.SynthesisInput_Text{Text: text},
                },
                // Note: the voice can also be specified by name.
                // Names of voices can be retrieved with client.ListVoices().
                Voice: &texttospeechpb.VoiceSelectionParams{
                        LanguageCode: "en-US",
                        SsmlGender: texttospeechpb.SsmlVoiceGender_FEMALE,
                },
                AudioConfig: &texttospeechpb.AudioConfig{
                        AudioEncoding: texttospeechpb.AudioEncoding_MP3,
                },
        }


        resp, err := client.SynthesizeSpeech(ctx, &req)
        if err != nil {
                return err
        }


        err = ioutil.WriteFile(outputFile, resp.AudioContent, 0644)
        if err != nil {
                return err
        }
        fmt.Fprintf(w, "Audio content written to file: %v\n", outputFile)
        return nil
}

Convert SSML to synthetic voice audio

You can get audio that sounds more like human speech by including SSML in your request for audio synthesis. In particular, SSML allows you to have finer-grained control over how the audio output reflects speech pauses or how dates, times, acronyms, and abbreviations are pronounced.

Code:

// SynthesizeSSML synthesizes ssml and saves the output to outputFile.
//
// ssml must be well-formed according to:
//
// https://www.w3.org/TR/speech-synthesis/
//
// Example: <speak>Hello there.</speak>
func SynthesizeSSML(w io.Writer, ssml, outputFile string) error {
        ctx := context.Background()

        client, err := texttospeech.NewClient(ctx)
        if err != nil {
                return err
        }
        defer client.Close()


        req := texttospeechpb.SynthesizeSpeechRequest{
                Input: &texttospeechpb.SynthesisInput{
                        InputSource: &texttospeechpb.SynthesisInput_Ssml{Ssml: ssml},
                },
                // Note: the voice can also be specified by name.
                // Names of voices can be retrieved with client.ListVoices().
                Voice: &texttospeechpb.VoiceSelectionParams{
                        LanguageCode: "en-US",
                        SsmlGender: texttospeechpb.SsmlVoiceGender_FEMALE,
                },
                AudioConfig: &texttospeechpb.AudioConfig{
                        AudioEncoding: texttospeechpb.AudioEncoding_MP3,
                },
        }

        resp, err := client.SynthesizeSpeech(ctx, &req)
        if err != nil {
                return err
        }
        err = ioutil.WriteFile(outputFile, resp.AudioContent, 0644)
        if err != nil {
                return err
        }
        fmt.Fprintf(w, "Audio content written to file: %v\n", outputFile)
        return nil
}

List all supported voices 

By using the API's voices: list endpoint, you can obtain a comprehensive list of all the supported voices. On the website for Supported Voices, you may also discover a complete list of the voices that are offered.

The Text-to-Speech API for text-to-speech synthesis can list the voices that are available by using the code snippets below.

Code:

// ListVoices lists the available text-to-speech voices.
func ListVoices(w io.Writer) error {
        ctx := context.Background()
        client, err := texttospeech.NewClient(ctx)
        if err != nil {
                return err
        }
        defer client.Close()
        // Performs the list voices request.
        resp, err := client.ListVoices(ctx, &texttospeechpb.ListVoicesRequest{})
        if err != nil {
                return err
        }
        for _, voice:= range resp.Voices {
                // Display the voice's name. Example: tpc-vocoded
                fmt.Fprintf(w, "Name: %v\n", voice.Name)
                // Display the supported language codes for this voice. Example: "en-US"
                for _, languageCode := range voice.LanguageCodes {
                        fmt.Fprintf(w, " Supported language: %v\n", languageCode)
                }
                // Display the SSML Voice Gender.
                fmt.Fprintf(w, " SSML Voice Gender: %v\n", voice.SsmlGender.String())
                // Display the natural sample rate hertz for this voice. Example: 24000
                fmt.Fprintf(w, " Natural Sample Rate Hertz: %v\n",
                        voice.NaturalSampleRateHertz)
        }
        return nil
}

Let's dive into the details of decoding base64-encoded audio content.

Decode base64-encoded audio content

Binary data includes audio data. When replying to a REST request, JSON is utilised; however, you can read the binary data directly from a gRPC answer. Text-to-Speech delivers a response string that is Base64 encoded because JSON is a text format and does not natively accept binary data. Before you can play the response's base64-encoded text data on a device, you must convert it to binary.

The audioContent field in JSON answers from Text-to-Speech contains audio content that has been base64 encoded.

Code:

{
  "audioContent": "//NExAARqoIIAAhEuWAAAGNmBGMY4EBcxvABAXBPmPIAF//yAuh9Tn5CEap3/o..."
}

Base64 can be converted into an audio file by the following steps:

  • Enter a text file with only the base-64 encoded content.
     
  • Using the -d flag and the base64 command line tool, decode the source text file:
    Code:
$ base64 SOURCE_BASE64_TEXT_FILE -d > DESTINATION_AUDIO_FILE

Specify a regional endpoint 

Text-to-Speech provides regional API endpoints for the US and EU. Your data which is in transit and at rest will remain on the continental territory of Europe or the USA if you pick a regional endpoint. If the location of your data needs to be restricted to meet regional regulatory requirements, specifying an endpoint is crucial. The API behaves in the same manner in terms of functionality.

The commands listed below create a regional endpoint:

  • EU
https://eu-texttospeech.googleapis.com/
gcloud config set api_endpoint_overrides/texttospeech 
US
https://us-texttospeech.googleapis.com/
gcloud config set api_endpoint_overrides/texttospeech 

 

Frequently Asked Questions

What is speech-to-text accommodation?

The STT accommodation enables users to electronically convert their spoken utterances into written text or commands by using speech-to-text software.

What data does text-to-speech use?

Text-to-Speech transforms input from text or Speech Synthesis Markup Language (SSML) into audio data such as MP3 or LINEAR16.

Which algorithm is used in text-to-speech?

The ML system creates the link between phonemes and sounds, providing them with precise intonations. The technology produces a vocal sound using a sound wave generator. Eventually, the sound wave generator loads the frequency characteristics of the phrases it has learned from the acoustic model.

Conclusion

In this article, we have extensively discussed the details of Text-to-Speech in GCP along with the details of creating audio from text by using client libraries and command line, creating voice audio files, and decoding base64-encoded audio content.

We hope that this blog has helped you enhance your knowledge regarding Text-to-Speech in GCP, and if you would like to learn more, check out our articles on Google Cloud Certification. You can refer to our guided paths on the Coding Ninjas Studio platform to learn more about DSADBMSCompetitive ProgrammingPythonJavaJavaScript, etc. To practice and improve yourself in the interview, you can also check out Top 100 SQL problemsInterview experienceCoding interview questions, and the Ultimate guide path for interviews. Do upvote our blog to help other ninjas grow. Happy Coding!!

Live masterclass