Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
Transcribe short audio files
2.1.
Perform synchronous speech recognition on a local file
2.2.
Perform synchronous speech recognition on a remote file
3.
Transcribe long audio files
3.1.
Transcribe long audio files using a Google Cloud Storage file
3.2.
Upload your transcription results to a Cloud Storage bucket
4.
Transcribe audio from streaming input
4.1.
Perform streaming speech recognition on a local file
4.2.
Perform streaming speech recognition on an audio stream
5.
Send a recognition request with model adaptation
5.1.
Code sample
6.
Enable word-level confidence
6.1.
Word-level confidence
6.2.
Enable word-level confidence in a request
6.2.1.
Using a local file
6.2.2.
Using a Remote file
7.
Detect different speakers in an audio recording
7.1.
Speaker diarization
7.2.
Enable speaker diarization in a request
7.2.1.
Using a local file
7.2.2.
Use a Cloud Storage bucket
8.
Automatically detect language
8.1.
Multiple language recognition
8.2.
Enable language recognition in audio transcription requests
8.2.1.
Use a local file
8.2.2.
Use a remote file
9.
Frequently Asked Questions
9.1.
How does Google Speech-to-Text API work?
9.2.
What is Google Cloud speech?
9.3.
How do you measure speech recognition accuracy?
10.
Conclusion
Last Updated: Mar 27, 2024

Overview of Speech-to-Text in GCP

Author Sanjana Yadav
0 upvote
Master Python: Predicting weather forecasts
Speaker
Ashwin Goyal
Product Manager @

Introduction

Using speech recognition, you may use Google Cloud Speech-to-Text to convert audio files to text. 

The Google Cloud Platform interface or the Speech-to-Text API via Cloud Functions can be used to transcribe audio. 

Speech-to-Text technology may be used to offer voice control and searching, add real-time subtitles to streaming material, and improve the user experience by incorporating interactive voice response (IVR) into your apps.

Let us learn more about Speech-to-Text in detail.

Transcribe short audio files

Perform synchronous speech recognition on a local file

Here's an example of synchronous speech recognition performed on a local audio file:

def transcribe_file(speech_file):
    """Transcribe the given audio file."""
    from google.cloud import speech
    import io

    client = speech.SpeechClient()


    with io.open(speech_file, "rb") as audio_file:
        content = audio_file.read()


    audio = speech.RecognitionAudio(content=content)
    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=16000,
        language_code="en-US",
    )


    response = client.recognize(config=config, audio=audio)


    # Each result is for a consecutive portion of the audio. Iterate through
    # them to get the transcripts for the entire audio file.
    for result in response.results:
        # The first alternative is the most likely one for this portion.
        print(u"Transcript: {}".format(result.alternatives[0].transcript))

Perform synchronous speech recognition on a remote file

For your convenience, the Speech-to-Text API may conduct synchronous speech recognition on an audio file stored in Google Cloud Storage without requiring you to transmit the audio file's contents in the body of your request.

Here's an example of synchronous speech recognition on a file from Cloud Storage:

def transcribe_gcs(gcs_uri):
    """Transcribes the audio file specified by the gcs_uri."""
    from google.cloud import speech


    client = speech.SpeechClient()


    audio = speech.RecognitionAudio(uri=gcs_uri)
    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.FLAC,
        sample_rate_hertz=16000,
        language_code="en-US",
    )


    response = client.recognize(config=config, audio=audio)


    # Each result is for a consecutive portion of the audio. Iterate through
    # them to get the transcripts for the entire audio file.
    for result in response.results:
        # The first alternative is the most likely one for this portion.
        print(u"Transcript: {}".format(result.alternatives[0].transcript))
Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

Transcribe long audio files

Audio content from a local file can be supplied straight to Speech-to-Text for asynchronous processing. The audio duration restriction for local files, on the other hand, is 60 seconds. An error will occur if you attempt to transcribe local audio files that are longer than 60 seconds. You must have your data saved in a Google Cloud Storage bucket to utilize asynchronous speech recognition to transcribe audio longer than 60 seconds.

Transcribe long audio files using a Google Cloud Storage file

The raw audio input for the long-running transcription process is stored in a Cloud Storage bucket in these examples. 

def transcribe_gcs(gcs_uri):
    """Asynchronously transcribes the audio file specified by the gcs_uri."""
    from google.cloud import speech


    client = speech.SpeechClient()


    audio = speech.RecognitionAudio(uri=gcs_uri)
    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.FLAC,
        sample_rate_hertz=16000,
        language_code="en-US",
    )


    operation = client.long_running_recognize(config=config, audio=audio)


    print("Waiting for operation to complete...")
    response = operation.result(timeout=90)


    # Each result is for a consecutive portion of the audio. Iterate through
    # them to get the transcripts for the entire audio file.
    for result in response.results:
        # The first alternative is the most likely one for this portion.
        print(u"Transcript: {}".format(result.alternatives[0].transcript))
        print("Confidence: {}".format(result.alternatives[0].confidence))

Upload your transcription results to a Cloud Storage bucket

Speech-to-Text allows you to immediately transfer your long-running recognition results to a Cloud Storage bucket. When used with Cloud Storage Triggers, Cloud Storage uploads can trigger alerts that invoke Cloud Functions, eliminating the need to poll Speech-to-Text for recognition results.

Provide the optional TranscriptOutputConfig output parameter in your longrunning recognition request to have your output uploaded to a Cloud Storage bucket.

message TranscriptOutputConfig {


    oneof output_type {
      // Specifies a Cloud Storage URI for the recognition results. Must be
      // specified in the format: `gs://bucket_name/object_name`
      string gcs_uri = 1;
    }
  }

Transcribe audio from streaming input

Perform streaming speech recognition on a local file

An example of performing streaming speech recognition on a local audio file is shown below. All streaming queries submitted to the API are limited to 10 MB. This restriction applies to the initial StreamingRecognize request as well as the size of each individual message in the stream. Any attempt to exceed this limit will result in an error.

def transcribe_streaming(stream_file):
    """Streams transcription of the given audio file."""
    import io
    from google.cloud import speech


    client = speech.SpeechClient()


    with io.open(stream_file, "rb") as audio_file:
        content = audio_file.read()


    # In practice, stream should be a generator yielding chunks of audio data.
    stream = [content]


    requests = (
        speech.StreamingRecognizeRequest(audio_content=chunk) for chunk in stream
    )


    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=16000,
        language_code="en-US",
    )


    streaming_config = speech.StreamingRecognitionConfig(config=config)


    # streaming_recognize returns a generator.
    responses = client.streaming_recognize(
        config=streaming_config,
        requests=requests,
    )


    for response in responses:
        # Once the transcription has settled, the first result will contain the
        # is_final result. The other results will be for subsequent portions of
        # the audio.
        for result in response.results:
            print("Finished: {}".format(result.is_final))
            print("Stability: {}".format(result.stability))
            alternatives = result.alternatives
            # The alternatives are ordered from most likely to least.
            for alternative in alternatives:
                print("Confidence: {}".format(alternative.confidence))
                print(u"Transcript: {}".format(alternative.transcript))

Perform streaming speech recognition on an audio stream

Speech-to-Text can also recognize audio that is streaming in real-time.

Here's an example of streaming speech recognition applied to an audio stream received from a microphone:
 

from __future__ import division

import re
import sys

from google.cloud import speech

import pyaudio
from six.moves import queue

# Audio recording parameters
RATE = 16000
CHUNK = int(RATE / 10)  # 100ms


class MicrophoneStream(object):
    """Opens a recording stream as a generator yielding the audio chunks."""

    def __init__(self, rate, chunk):
        self._rate = rate
        self._chunk = chunk

        # Create a thread-safe buffer of audio data
        self._buff = queue.Queue()
        self.closed = True

    def __enter__(self):
        self._audio_interface = pyaudio.PyAudio()
        self._audio_stream = self._audio_interface.open(
            format=pyaudio.paInt16,
            # The API currently only supports 1-channel (mono) audio
            # https://goo.gl/z757pE
            channels=1,
            rate=self._rate,
            input=True,
            frames_per_buffer=self._chunk,
            # Run the audio stream asynchronously to fill the buffer object.
            # This is necessary so that the input device's buffer doesn't
            # overflow while the calling thread makes network requests, etc.
            stream_callback=self._fill_buffer,
        )

        self.closed = False

        return self

    def __exit__(self, type, value, traceback):
        self._audio_stream.stop_stream()
        self._audio_stream.close()
        self.closed = True
        # Signal the generator to terminate so that the client's
        # streaming_recognize method will not block the process termination.
        self._buff.put(None)
        self._audio_interface.terminate()

    def _fill_buffer(self, in_data, frame_count, time_info, status_flags):
        """Continuously collect data from the audio stream, into the buffer."""
        self._buff.put(in_data)
        return None, pyaudio.paContinue

    def generator(self):
        while not self.closed:
            # Use a blocking get() to ensure there's at least one chunk of
            # data, and stop iteration if the chunk is None, indicating the
            # end of the audio stream.
            chunk = self._buff.get()
            if chunk is None:
                return
            data = [chunk]

            # Now consume whatever other data's still buffered.
            while True:
                try:
                    chunk = self._buff.get(block=False)
                    if chunk is None:
                        return
                    data.append(chunk)
                except queue.Empty:
                    break

            yield b"".join(data)


def listen_print_loop(responses):
    """Iterates through server responses and prints them.

    The responses passed is a generator that will block until a response
    is provided by the server.

    Each response may contain multiple results, and each result may contain
    multiple alternatives; for details, see https://goo.gl/tjCPAU.  Here we
    print only the transcription for the top alternative of the top result.

    In this case, responses are provided for interim results as well. If the
    response is an interim one, print a line feed at the end of it, to allow
    the next result to overwrite it, until the response is a final one. For the
    final one, print a newline to preserve the finalized transcription.
    """
    num_chars_printed = 0
    for response in responses:
        if not response.results:
            continue

        # The `results` list is consecutive. For streaming, we only care about
        # the first result being considered, since once it's `is_final`, it
        # moves on to considering the next utterance.
        result = response.results[0]
        if not result.alternatives:
            continue

        # Display the transcription of the top alternative.
        transcript = result.alternatives[0].transcript

        # Display interim results, but with a carriage return at the end of the
        # line, so subsequent lines will overwrite them.
        #
        # If the previous result was longer than this one, we need to print
        # some extra spaces to overwrite the previous result
        overwrite_chars = " " * (num_chars_printed - len(transcript))

        if not result.is_final:
            sys.stdout.write(transcript + overwrite_chars + "\r")
            sys.stdout.flush()

            num_chars_printed = len(transcript)

        else:
            print(transcript + overwrite_chars)

            # Exit recognition if any of the transcribed phrases could be
            # one of our keywords.
            if re.search(r"\b(exit|quit)\b", transcript, re.I):
                print("Exiting..")
                break

            num_chars_printed = 0


def main():
    # See http://g.co/cloud/speech/docs/languages
    # for a list of supported languages.
    language_code = "en-US"  # a BCP-47 language tag

    client = speech.SpeechClient()
    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=RATE,
        language_code=language_code,
    )

    streaming_config = speech.StreamingRecognitionConfig(
        config=config, interim_results=True
    )

    with MicrophoneStream(RATE, CHUNK) as stream:
        audio_generator = stream.generator()
        requests = (
            speech.StreamingRecognizeRequest(audio_content=content)
            for content in audio_generator
        )

        responses = client.streaming_recognize(streaming_config, requests)

        # Now, put the transcription responses to use.
        listen_print_loop(responses)


if __name__ == "__main__":
    main()

Send a recognition request with model adaptation

Model adaptation can help you increase the accuracy of your Speech-to-Text transcription outcomes. The model adaption feature allows you to designate which words and/or phrases Speech-to-Text should recognize more frequently in your audio data than other options that could be recommended. Model adaptation is very beneficial for improving transcribing accuracy in the following scenarios:

  • Your audio includes words or phrases that are likely to be repeated.
  • Your audio is likely to contain terms that are uncommon (such as proper names) or do not exist in common usage.
  • Your audio contains noise or is somehow distorted.

Code sample

Speech Adaptation is one of the optional Speech-to-Text tools that you may employ to tailor your transcription results to your specific requirements.

The code example below demonstrates how to increase transcription accuracy by utilizing a SpeechAdaptation resource: PhraseSet, CustomClass, and model adaptation boost. Make a note of the resource name returned in the response when you create a PhraseSet or CustomClass to utilize it in future queries.

from google.cloud import speech_v1p1beta1 as speech

def transcribe_with_model_adaptation(
    project_id, location, storage_uri, custom_class_id, phrase_set_id
):


    """
    Create`PhraseSet` and `CustomClasses` to create custom lists of similar
    items that are likely to occur in your input data.
    """


    # Create the adaptation client
    adaptation_client = speech.AdaptationClient()


    # The parent resource where the custom class and phrase set will be created.
    parent = f"projects/{project_id}/locations/{location}"


    # Create the custom class resource
    adaptation_client.create_custom_class(
        {
            "parent": parent,
            "custom_class_id": custom_class_id,
            "custom_class": {
                "items": [
                    {"value": "sushido"},
                    {"value": "altura"},
                    {"value": "taneda"},
                ]
            },
        }
    )
    custom_class_name = (
        f"projects/{project_id}/locations/{location}/customClasses/{custom_class_id}"
    )
    # Create the phrase set resource
    phrase_set_response = adaptation_client.create_phrase_set(
        {
            "parent": parent,
            "phrase_set_id": phrase_set_id,
            "phrase_set": {
                "boost": 10,
                "phrases": [
                    {"value": f"Visit restaurants like ${{{custom_class_name}}}"}
                ],
            },
        }
    )
    phrase_set_name = phrase_set_response.name
    # The next section shows how to use the newly created custom
    # class and phrase set to send a transcription request with speech adaptation


    # Speech adaptation configuration
    speech_adaptation = speech.SpeechAdaptation(phrase_set_references=[phrase_set_name])


    # speech configuration object
    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=24000,
        language_code="en-US",
        adaptation=speech_adaptation,
    )


    # The name of the audio file to transcribe
    # storage_uri URI for audio file in Cloud Storage, e.g. gs://[BUCKET]/[FILE]


    audio = speech.RecognitionAudio(uri=storage_uri)


    # Create the speech client
    speech_client = speech.SpeechClient()


    response = speech_client.recognize(config=config, audio=audio)


    for result in response.results:
        print("Transcript: {}".format(result.alternatives[0].transcript))

Enable word-level confidence

You can instruct Speech-to-Text to display an accuracy or confidence level rating for particular words in transcription.

Word-level confidence

When Speech-to-Text transcribes an audio clip, it also measures the response's accuracy. The confidence level for the full transcription request is stated as a value between 0.0 and 1.0 in the answer from Speech-to-Text. The code sample below is an example of the confidence level value produced by Speech-to-Text.

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "how old is the Brooklyn Bridge",
          "confidence": 0.96748614
        }
      ]
    }
  ]
}

Speech-to-Text can offer the confidence level of individual words within the transcription in addition to the confidence level of the full transcription. As seen in the following example, the answer contains WordInfo information in the transcription, indicating the confidence level for particular words.

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "how old is the Brooklyn Bridge",
          "confidence": 0.98360395,
          "words": [
            {
              "startTime": "0s",
              "endTime": "0.300s",
              "word": "how",
              "confidence": SOME NUMBER
            },
            ...
          ]
        }
      ]
    }
  ]
}

Enable word-level confidence in a request

The following code sample shows how to enable word-level confidence in a Speech-to-Text transcription request utilizing local and remote files.

Using a local file

/**
 * Transcribe a local audio file with word level confidence
 *
 * @param fileName the path to the local audio file
 */
public static void transcribeWordLevelConfidence(String fileName) throws Exception {
  Path path = Paths.get(fileName);
  byte[] content = Files.readAllBytes(path);


  try (SpeechClient speechClient = SpeechClient.create()) {
    RecognitionAudio recognitionAudio =
        RecognitionAudio.newBuilder().setContent(ByteString.copyFrom(content)).build();
    // Configure request to enable word level confidence
    RecognitionConfig config =
        RecognitionConfig.newBuilder()
            .setEncoding(AudioEncoding.LINEAR16)
            .setSampleRateHertz(16000)
            .setLanguageCode("en-US")
            .setEnableWordConfidence(true)
            .build();
    // Perform the transcription request
    RecognizeResponse recognizeResponse = speechClient.recognize(config, recognitionAudio);


    // Print out the results
    for (SpeechRecognitionResult result : recognizeResponse.getResultsList()) {
      // There can be several alternative transcripts for a given chunk of speech. Just use the
      // first (most likely) one here.
      SpeechRecognitionAlternative alternative = result.getAlternatives(0);
      System.out.format("Transcript : %s\n", alternative.getTranscript());
      System.out.format(
          "First Word and Confidence : %s %s \n",
          alternative.getWords(0).getWord(), alternative.getWords(0).getConfidence());
    }
  }
}

Using a Remote file

/**
 * Transcribe a remote audio file with word level confidence
 *
 * @param gcsUri path to the remote audio file
 */
public static void transcribeWordLevelConfidenceGcs(String gcsUri) throws Exception {
  try (SpeechClient speechClient = SpeechClient.create()) {


    // Configure request to enable word level confidence
    RecognitionConfig config =
        RecognitionConfig.newBuilder()
            .setEncoding(AudioEncoding.FLAC)
            .setSampleRateHertz(44100)
            .setLanguageCode("en-US")
            .setEnableWordConfidence(true)
            .build();


    // Set the remote path for the audio file
    RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(gcsUri).build();


    // Use non-blocking call for getting file transcription
    OperationFuture<LongRunningRecognizeResponse, LongRunningRecognizeMetadata> response =
        speechClient.longRunningRecognizeAsync(config, audio);


    while (!response.isDone()) {
      System.out.println("Waiting for response...");
      Thread.sleep(10000);
    }
    // Just print the first result here.
    SpeechRecognitionResult result = response.get().getResultsList().get(0);


    // There can be several alternative transcripts for a given chunk of speech. Just use the
    // first (most likely) one here.
    SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
    // Print out the result
    System.out.printf("Transcript : %s\n", alternative.getTranscript());
    System.out.format(
        "First Word and Confidence : %s %s \n",
        alternative.getWords(0).getWord(), alternative.getWords(0).getConfidence());
  }
}

Detect different speakers in an audio recording

Audio data may comprise samples of more than one individual speaking. For example, audio from a phone conversation typically includes the voices of two or more persons. The transcription of the call should include who talks when.

Speaker diarization

Multiple voices in the same audio clip can be recognized using Speech-to-Text. When you send an audio transcription request to Speech-to-Text, you may add a parameter instructing it to identify the various speakers in the audio sample. This technology, known as speaker diarization, identifies when speakers change and classifies the various voices identified in the audio by number.

When you activate speaker diarization in your transcription request, Speech-to-Text attempts to identify the many voices in the audio sample. Each word in the transcription output is labeled with a number assigned to individual speakers. The same number is assigned to words pronounced by the same speaker. A transcription output can include as many speakers as Speech-to-Text can recognize in an audio sample.

When you employ speaker diarization, Speech-to-Text creates a running aggregate of all the transcription outcomes. The words from the preceding result are included in each outcome. Consequently, the words array in the end result gives the whole, diarized transcription results.

Enable speaker diarization in a request

To enable speaker diarization, set the enableSpeakerDiarization field in the request's SpeakerDiarizationConfig parameters to true. To optimize your transcription results, set the diarizationSpeakerCount field in the SpeakerDiarizationConfig parameters to the number of speakers present in the audio clip. If you do not give a number for diarizationSpeakerCount, Speech-to-Text will use the default value.

For all voice recognition algorithms, Speech-to-Text allows speaker diarization: speech:recognize speech:longrunningrecognize, and Streaming.\

Using a local file

The following code snippet shows how to activate speaker diarization in a transcription request to Speech-to-Text using a local file.

/**
 * Transcribe the given audio file using speaker diarization.
 *
 * @param fileName the path to an audio file.
 */
public static void transcribeDiarization(String fileName) throws Exception {
  Path path = Paths.get(fileName);
  byte[] content = Files.readAllBytes(path);


  try (SpeechClient speechClient = SpeechClient.create()) {
    // Get the contents of the local audio file
    RecognitionAudio recognitionAudio =
        RecognitionAudio.newBuilder().setContent(ByteString.copyFrom(content)).build();


    SpeakerDiarizationConfig speakerDiarizationConfig =
        SpeakerDiarizationConfig.newBuilder()
            .setEnableSpeakerDiarization(true)
            .setMinSpeakerCount(2)
            .setMaxSpeakerCount(2)
            .build();


    // Configure request to enable Speaker diarization
    RecognitionConfig config =
        RecognitionConfig.newBuilder()
            .setEncoding(AudioEncoding.LINEAR16)
            .setLanguageCode("en-US")
            .setSampleRateHertz(8000)
            .setDiarizationConfig(speakerDiarizationConfig)
            .build();


    // Perform the transcription request
    RecognizeResponse recognizeResponse = speechClient.recognize(config, recognitionAudio);


    // Speaker Tags are only included in the last result object, which has only one alternative.
    SpeechRecognitionAlternative alternative =
        recognizeResponse.getResults(recognizeResponse.getResultsCount() - 1).getAlternatives(0);


    // The alternative is made up of WordInfo objects that contain the speaker_tag.
    WordInfo wordInfo = alternative.getWords(0);
    int currentSpeakerTag = wordInfo.getSpeakerTag();


    // For each word, get all the words associated with one speaker, once the speaker changes,
    // add a new line with the new speaker and their spoken words.
    StringBuilder speakerWords =
        new StringBuilder(
            String.format("Speaker %d: %s", wordInfo.getSpeakerTag(), wordInfo.getWord()));


    for (int i = 1; i < alternative.getWordsCount(); i++) {
      wordInfo = alternative.getWords(i);
      if (currentSpeakerTag == wordInfo.getSpeakerTag()) {
        speakerWords.append(" ");
        speakerWords.append(wordInfo.getWord());
      } else {
        speakerWords.append(
            String.format("\nSpeaker %d: %s", wordInfo.getSpeakerTag(), wordInfo.getWord()));
        currentSpeakerTag = wordInfo.getSpeakerTag();
      }
    }


    System.out.println(speakerWords.toString());
  }
}

Use a Cloud Storage bucket

The code below shows how to enable speaker diarization in a transcription request to Speech-to-Text by utilizing a Google Cloud Storage file.

/**
 * Transcribe a remote audio file using speaker diarization.
 *
 * @param gcsUri the path to an audio file.
 */
public static void transcribeDiarizationGcs(String gcsUri) throws Exception {
  try (SpeechClient speechClient = SpeechClient.create()) {
    SpeakerDiarizationConfig speakerDiarizationConfig =
        SpeakerDiarizationConfig.newBuilder()
            .setEnableSpeakerDiarization(true)
            .setMinSpeakerCount(2)
            .setMaxSpeakerCount(2)
            .build();


    // Configure request to enable Speaker diarization
    RecognitionConfig config =
        RecognitionConfig.newBuilder()
            .setEncoding(AudioEncoding.LINEAR16)
            .setLanguageCode("en-US")
            .setSampleRateHertz(8000)
            .setDiarizationConfig(speakerDiarizationConfig)
            .build();


    // Set the remote path for the audio file
    RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(gcsUri).build();


    // Use non-blocking call for getting file transcription
    OperationFuture<LongRunningRecognizeResponse, LongRunningRecognizeMetadata> response =
        speechClient.longRunningRecognizeAsync(config, audio);


    while (!response.isDone()) {
      System.out.println("Waiting for response...");
      Thread.sleep(10000);
    }


    // Speaker Tags are only included in the last result object, which has only one alternative.
    LongRunningRecognizeResponse longRunningRecognizeResponse = response.get();
    SpeechRecognitionAlternative alternative =
        longRunningRecognizeResponse
            .getResults(longRunningRecognizeResponse.getResultsCount() - 1)
            .getAlternatives(0);


    // The alternative is made up of WordInfo objects that contain the speaker_tag.
    WordInfo wordInfo = alternative.getWords(0);
    int currentSpeakerTag = wordInfo.getSpeakerTag();


    // For each word, get all the words associated with one speaker, once the speaker changes,
    // add a new line with the new speaker and their spoken words.
    StringBuilder speakerWords =
        new StringBuilder(
            String.format("Speaker %d: %s", wordInfo.getSpeakerTag(), wordInfo.getWord()));


    for (int i = 1; i < alternative.getWordsCount(); i++) {
      wordInfo = alternative.getWords(i);
      if (currentSpeakerTag == wordInfo.getSpeakerTag()) {
        speakerWords.append(" ");
        speakerWords.append(wordInfo.getWord());
      } else {
        speakerWords.append(
            String.format("\nSpeaker %d: %s", wordInfo.getSpeakerTag(), wordInfo.getWord()));
        currentSpeakerTag = wordInfo.getSpeakerTag();
      }
    }


    System.out.println(speakerWords.toString());
  }
}

Automatically detect language

In some instances, you are unsure what language your audio recordings include. For example, if you offer your service, app, or product in a nation with numerous official languages, you may receive audio input from people speaking different languages. Defining a single language code for transcribing requests is far more challenging.

Multiple language recognition

You can provide a set of alternative languages that your audio data may contain using Speech-to-Text. When you send an audio transcription request to Speech-to-Text, you can provide a list of possible extra languages in the audio data. If you give a language list in your request, Speech-to-Text will attempt to transcribe the audio in the language that best matches the sample from the options you supply. The transcription results are then labeled with the expected language code by Speech-to-Text.

This capability is appropriate for programs that require the transcription of brief statements, such as voice commands or searches. In addition to your primary language, you can specify up to three alternate languages supported by Speech-to-Text (for four languages total).

Even though you can provide alternate languages for your speech transcription request, you must still enter a main language code in the languageCode section. Also, keep the number of languages you request to a bare minimum. The fewer alternative language codes you specify, the better Speech-to-Text will choose the proper one. The best results are obtained by specifying only one language.

Enable language recognition in audio transcription requests

To indicate other languages in your audio transcription, set the alternativeLanguageCodes field in the RecognitionConfig parameters for the request to a list of language codes. For all voice recognition techniques, including speech:recognize, speech:longrunningrecognize, and Streaming, Speech-to-Text offers different language codes.

Use a local file

/**
 * Transcribe a local audio file with multi-language recognition
 *
 * @param fileName the path to the audio file
 */
public static void transcribeMultiLanguage(String fileName) throws Exception {
  Path path = Paths.get(fileName);
  // Get the contents of the local audio file
  byte[] content = Files.readAllBytes(path);


  try (SpeechClient speechClient = SpeechClient.create()) {


    RecognitionAudio recognitionAudio =
        RecognitionAudio.newBuilder().setContent(ByteString.copyFrom(content)).build();
    ArrayList<String> languageList = new ArrayList<>();
    languageList.add("es-ES");
    languageList.add("en-US");


    // Configure request to enable multiple languages
    RecognitionConfig config =
        RecognitionConfig.newBuilder()
            .setEncoding(AudioEncoding.LINEAR16)
            .setSampleRateHertz(16000)
            .setLanguageCode("ja-JP")
            .addAllAlternativeLanguageCodes(languageList)
            .build();
    // Perform the transcription request
    RecognizeResponse recognizeResponse = speechClient.recognize(config, recognitionAudio);


    // Print out the results
    for (SpeechRecognitionResult result : recognizeResponse.getResultsList()) {
      // There can be several alternative transcripts for a given chunk of speech. Just use the
      // first (most likely) one here.
      SpeechRecognitionAlternative alternative = result.getAlternatives(0);
      System.out.format("Transcript : %s\n\n", alternative.getTranscript());
    }
  }
}

Use a remote file

/**
 * Transcribe a remote audio file with multi-language recognition
 *
 * @param gcsUri the path to the remote audio file
 */
public static void transcribeMultiLanguageGcs(String gcsUri) throws Exception {
  try (SpeechClient speechClient = SpeechClient.create()) {


    ArrayList<String> languageList = new ArrayList<>();
    languageList.add("es-ES");
    languageList.add("en-US");


    // Configure request to enable multiple languages
    RecognitionConfig config =
        RecognitionConfig.newBuilder()
            .setEncoding(AudioEncoding.LINEAR16)
            .setSampleRateHertz(16000)
            .setLanguageCode("ja-JP")
            .addAllAlternativeLanguageCodes(languageList)
            .build();


    // Set the remote path for the audio file
    RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(gcsUri).build();


    // Use non-blocking call for getting file transcription
    OperationFuture<LongRunningRecognizeResponse, LongRunningRecognizeMetadata> response =
        speechClient.longRunningRecognizeAsync(config, audio);


    while (!response.isDone()) {
      System.out.println("Waiting for response...");
      Thread.sleep(10000);
    }


    for (SpeechRecognitionResult result : response.get().getResultsList()) {


      // There can be several alternative transcripts for a given chunk of speech. Just use the
      // first (most likely) one here.
      SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);


      // Print out the result
      System.out.printf("Transcript : %s\n\n", alternative.getTranscript());
    }
  }
}

Frequently Asked Questions

How does Google Speech-to-Text API work?

A Speech-to-Text API synchronous recognition request is the most basic approach for recognizing speech audio data. Speech-to-Text can handle up to one minute of synchronous speech audio data. Speech-to-Text gives a response after processing and recognizing all of the audio.

What is Google Cloud speech?

The Google Cloud Speech API allows developers to transform audio to text using strong neural network models in a simple API. To serve your worldwide user base, the API detects over 80 languages and variations.

How do you measure speech recognition accuracy?

The word error rate is the industry standard for measuring model correctness (WER). WER counts the number of wrong words discovered during recognition and divides the total number of words in the human-labeled transcript by the total number of words (N).

Conclusion

In this article, we have extensively discussed Speech-to-Text in GCP. Our discussion mainly focused on how to use synchronous, asynchronous, and streaming speech recognition to convert an audio recording to text, model adaptation, word-level confidence, and detection of different languages and speakers in an audio file.

We hope this blog has helped you enhance your Google cloud knowledge. To learn more about Google cloud concepts, refer to our articles on All about GCP Certifications: Google Cloud Platform | Coding Ninjas Blog.  

You can also consider our Online Coding Courses such as the Machine Learning Course to give your career an edge over others.

Do upvote our blog to help other ninjas grow. Happy Coding!

An image that displays a thankyou message from coding ninjas.

Previous article
Overview of Text-to-Speech in GCP
Next article
Training of scikit-learn and XGBoost in an AI Platform
Live masterclass