Speech-to-Text basics in GCP

Q: How can we safeguard data during cloud transportation?

Check that the encryption key used together with the data you provide does not leak data as it flows from point A to point B on the cloud to ensure that the cloud data is secure.

Introduction

The simplest way to recognize speech audio data is through a synchronous recognition request to the Speech-to-Text API. Up to one minute of speech audio data given as a synchronous request can be processed by speech-to-text. All of the audio is processed and recognized by Speech-to-Text before a response is provided.

When a request is synchronous and blocking, Speech-to-Text must respond before moving on to the next request. Speech-to-Text typically processes 30 seconds of audio in 15 seconds, which is faster than in real-time. Your recognition request may take a lot longer if the audio quality is poor.

Speech requests

There are three primary ways to accomplish speech recognition in speech-to-text.

The Speech-to-Text API receives audio data via Synchronous Recognition (REST and gRPC), processes it for recognition, and then returns the results, after all, the audio has been processed. Requests for synchronized recognition are only accepted for audio files that are no longer than one minute.
A Long Running Operation is started when Asynchronous Recognition (REST and gRPC) sends audio data to the Speech-to-Text API. You can regularly poll for recognition results using this operation. For audio data with any duration up to 480 minutes, use asynchronous requests.
A gRPC bi-directional stream's audio data is recognized via streaming recognition (gRPC only). Requests for streaming are intended to be used for real-time recognizing tasks like recording live audio from a microphone. While the audio is being recorded, intermediate results are provided using streaming recognition, allowing for the appearance of results.

Synchronous Speech Recognition Requests

The voice recognition setup and audio data are the two components of a synchronous Speech-to-Text API request. A voice recognition config field is required in all synchronous recognition requests made using the Speech-to-Text API (of type RecognitionConfig). The following sub-fields are found in a RecognitionConfig:

Encoding - (needed) describes the audio encoding strategy that is provided (of type AudioEncoding). For FLAC and WAV files, if the encoding is already present in the file header, the encoding field is not required.
sampleRateHertz - (needed) defines the audio input's sample rate (in Hertz). For FLAC and WAV files, where the sample rate is listed in the file header, the sampleRateHertz parameter is not required.
languageCode - (needed) provides the language plus the region or locale to use for the given audio's speech recognition. A BCP-47 identifier must be used as the language code.
maxAlternatives - The number of alternative transcriptions to include in the response is indicated by the Alternatives field, which is optional and defaults to 1. The Speech-to-Text API offers a single primary transcription by default.
Profanity - Indicates whether to exclude offensive words or phrases from the filter (optional). Words that have been excluded will only have their first letter and an asterisk for each additional character.
For analyzing this audio, the speechContext - (optional) parameter contains additional contextual data. The subsequent sub-field is found in a context:
phrases - a list of words and phrases that offer guidance on how to complete the speech recognition task.
Speech-to-Text receives audio via the audio parameter of type RecognitionAudio. Any of the following sub-fields can be found in the audio field:
The audio to be evaluated is included in the content and is incorporated in the request.
A URI linking to the audio content is contained in URI.

Sample Rates

The sample rate of your audio must match the sample rate of the linked audio material or stream, which you can define in the sampleRateHertz field of the request configuration. Within Speech-to-Text, sample rates between 8000 Hz and 48000 Hz are supported. Instead of utilizing the sampleRateHertz field, you can provide the sample rate for a FLAC or WAV file in the file header. To be uploaded to the Speech-to-Text API, a FLAC file must have the sample rate in the FLAC header.

If you can choose, record audio using a sample rate of 16000 Hz when encoding the source material. Greater levels have little to no impact on speech recognition quality, whereas values higher than this may reduce the accuracy of speech recognition.

Model Selection

One of several machine learning models can be used by Speech-to-Text to transcribe your audio file. These speech recognition models were developed by Google using particular audio formats and sources.

By mentioning the original audio source in your request for an audio transcription from Speech-to-Text, you can get better results. By processing your audio files using a machine learning model that has been trained to recognize spoken audio from that specific source type, the Speech-to-Text API is now able to process your audio files.

Include the model field in the RecognitionConfig object for your request, defining the speech recognition model you intend to employ.

Embedded Audio Content

When a content parameter is passed into the speech recognition request's audio field, embedded audio is also included. gRPC requests that include embedded audio must send the audio as binary data and in a format compatible with Proto3 serialization. In order for audio to be included as content in a REST request, it must first be Base64-encoded and then be compatible with JSON serialization.

Speech-to-Text API responses

According to the length of the provided audio, a synchronous Speech-to-Text API response may take some time to produce results. Following processing, the API will provide the following response:

{
  "results": [
    {
      "alternatives": [
        {
          "confidence": 0.98267895,
          "transcript": "how old is the Brooklyn Bridge"
        }
      ]
    }
  ]
}

Select Alternatives

A successful synchronous recognition response may comprise one or more choices for each outcome (if the maxAlternatives value for the request is greater than 1). An alternative is included in the response if Speech-to-Text determines that it has a high enough Confidence Value. The best (most likely) alternative is always the first one listed in the response.

A larger value for maxAlternatives than 1 does not ensure or suggest that more than one alternative will be returned. In general, giving customers real-time options when they receive results from a Streaming Recognition Request is more appropriate.

Handling transcriptions

A transcript of the recognized text will be included with each alternative offered in the response. You ought to combine these transcriptions if you are given consecutive options.

The following Python function concatenates the transcriptions by iterating through a list of results. Keep in mind that we always choose the first option or the zeroth.

response = service_request.execute()

recognized_text = 'Transcribed Text: \n'

for i in range(len(response['results'])):

recognized_text += response['results'][i]['alternatives'][0]['transcript']

Confidence values

The estimate for the confidence value ranges from 0.0 to 1.0. It is determined by adding up the "probability" values that have been given to each word in the audio. A higher number denotes an estimated higher possibility that each word was successfully identified. Usually, this feature is only made available for the top hypothesis and for outcomes where is final=true. The confidence value, for instance, can be used to determine whether to present the user with alternate results or request confirmation.

Asynchronous Requests and Responses

The form of an asynchronous Speech-to-Text API request sent to the LongRunningRecognize method is the same as that of a synchronous request. However, the asynchronous request will start a Long Running Operation (of type Operation) and instantly return this operation to the callee rather than sending back a response. Asynchronous speech recognition works with audio files up to 480 minutes in length.

Below is an example of a common operation response:

{
  "name": "operation_name",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeMetadata"
    "progressPercent": 34,
    "startTime": "2016-08-30T23:26:29.579144Z",
    "lastUpdateTime": "2016-08-30T23:26:29.826903Z"
  }
}

Streaming Speech-to-Text API Recognition Requests

A bi-directional stream of audio can be captured in real-time and recognized using a streaming Speech-to-Text API recognition function. Your application can transmit audio over the request stream while simultaneously receiving real-time interim and final recognition results over the response stream. The final recognition result represents the last best guess for that piece of audio, whereas interim results represent the current recognition result for that audio section.

Streaming requests

A configuration of type StreamingRecognitionConfig without accompanying audio must be present in the first StreamingRecognizeRequest. Then, subsequent frames of raw audio bytes will be delivered via the same stream in StreamingRecognizeRequests.

The following fields make up a StreamingRecognitionConfig:

Configuration data for the audio is contained in the required parameter config, which is of type RecognitionConfig and is identical to the data displayed in synchronous and asynchronous requests.
If the single utterance is false, this request will not immediately finish when the speech is no longer being recognized. Processing voice instructions require the single utterance setting to be set to true.
The interim results parameter, optional and defaults to false, tells the stream request to return preliminary results that can be improved later (after processing more audio).

Streaming responses

A series of answers of type StreamingRecognitionResponse are returned along with streaming speech recognition results. An answer of this kind has the following fields:

Events of the type SpeechEventType are contained in speechEventType. When one utterance has been deemed to have been finished, the value of these events will show it. The following sub-fields are included in the results list:

A list of alternative transcriptions is provided under alternatives.
The word "isFinal" designates whether the findings contained in this list entry are preliminary or conclusive.
Stability measures the erratic nature of findings so far, with 0.0 denoting total instability and 1.0 denoting total stability.

Audio formats vs encodings

Audio encoding is not the same as an audio format. A well-known file format, like .WAV, specifies the header structure for an audio file but is not an audio encoding in and of itself. FLAC, however, can occasionally cause confusion because it is both a file format and an encoder. To be uploaded to the Speech-to-Text API, a FLAC file must have the sample rate in the FLAC header. All other audio encodings specify headerless audio data; only FLAC stipulates the inclusion of a header. The codec is always meant when the FLAC name is used in the Speech-to-Text API. We'll use the phrase "a.FLAC file" to describe a FLAC file format.

Numerous encodings are supported by the Speech-to-Text API. They include, among others, MP3, FLAC, LINEAR16, MULAW, AMR, AMR WB, OGG OPUSET, and others.

Need of encoding

Waveforms, composed of the interposition of waves with various frequencies and amplitudes, make up audio. The waveforms must be sampled at rates that can (at least) approximate the highest frequency of the sounds you want to mimic to represent these waveforms in digital media. They must also store enough bit depth to represent the waveforms' right amplitude accurately.

A sound processing device's capacity to reproduce frequencies is referred to as its frequency response, and its capacity to provide appropriate loudness and softness is referred to as its dynamic range. The sum of these is referred to as the fidelity of a sound device. In its most basic form, encoding is a method for reconstructing sound utilizing these two fundamental principles. and also store and transfer it effectively.

Sampling rates and Bit depths

There is an analog waveform for sound. By sampling the amplitude of this analog wave quickly enough to match the wave's inherent frequencies, a section of digital audio simulates this analog wave. The sample rate of a digital audio segment determines how many samples (per second) should be taken from the audio source; a high sample rate improves the accuracy of the representation of high frequencies in digital audio.

The dynamic range of a certain audio sample is influenced by bit depth. More exact amplitudes can be represented by using a greater bit depth. The signal-to-noise ratio in audio samples is likewise lowered with higher bit depths. 16 bits of bit depth are used to generate the musical sounds on CDs.

Compressed and Uncompressed audio

Audio data is frequently compressed to make it simpler to store and convey. Lossless or lossy compression can be used when encoding audio. You can decompress lossless compression to get the digital data back to how it was originally. Lossy compression, parameterized to specify how much tolerance to give the compression approach to delete data, inevitably destroys part of this information during compression and decompression.

The standard employed in CDs and the Speech-to-Text API's LINEAR16 encoding (which denotes that the amplitude response is linearly uniform across the sample) is linear PCM. Both encodings generate an uncompressed stream of bytes directly matching audio data, and both specifications have a depth of 16 bits. Uncompressed audio is an example of Linear PCM (LINEAR16).

Lossless and Lossy compression

Lossless compression, which does not degrade the quality of the original digital sample, compresses digital audio data via intricate reorganizations of the recorded data. No information will be lost when the data is uncompressed and restored to its original digital form using lossless compression. FLAC and LINEAR16 are two lossless encodings that the Speech-to-Text API supports.

On the other hand, lossy compression reduces or removes specific types of information from the compressed audio data before it is stored. The Speech-to-Text API supports several lossy formats, but you should stay away from them if you have access to the audio because data loss may impair recognition accuracy. A popular example of lossy encoding is the MP3 codec.

Data logging

Google involves customer data from Google Consumer Data to further develop its products as part of its ongoing effort to improve them. Speech-to-Text does not automatically record customer audio or transcripts. You might participate in the data logging initiative to assist Speech-to-Text in better meeting your needs. Google can enhance the quality of Speech-to-Text by leveraging consumer data to hone its speech recognition technology, thanks to the data logging program. Let us see data logging in detail here in this section of the blog.

Data privacy and security

Google does not record all of your information when you sign up for the program: On projects where data logging is enabled, Google solely uses the information given to Speech-to-Text. Google only uses the information you give to such projects when it's necessary to deliver the service. All data that you upload to a project while data logging is enabled remains entirely your property. However, the actual models created using those data remain the property of Google. Your data gathered via data logging is only accessible to a select group of approved Google workers and contractors. Google does not target goods, services, or advertisements at your users or clients.

Improve transcription results with model adaptation

The model adaption function can be used to make Speech-to-Text detect certain words or phrases more frequently than it could otherwise. the following use cases, in particular, benefit from the model adaptation:

Words and phrases that frequently recur in your audio data will be more accurately translated.
Increasing the number of words that Speech-to-Text can understand. Speech-to-Text has an enormous vocabulary.
Enhancing voice transcription accuracy when the provided audio is noisy or not very clear.

Improve recognition using classes

Classes stand in for frequent ideas found in natural languages, such as currency and dates. With a class, you can increase the accuracy of your transcriptions of lengthy passages that share a concept but don't always contain the same exact words or phrases.

Here comes the concept of class tokens. Include a class token in the phrases field of a PhraseSet resource if you want to use a class in model adaption. Classes can be used as independent items in the phrases array or as tokens embedded in lengthier multi-word phrases.

There is also another way. You can create custom classes: a class made up of your own unique list of connected objects or values. Create a CustomClass resource and add an item as ClassItem to utilize a custom class. The pre-built class tokens and custom classes both carry out the same functions. Both pre-built class tokens and bespoke classes are acceptable in a phrase.

Fine-tune transcription results using a boost

By default, model adaptation has a minimal impact, particularly for single-word sentences. Using the model adaption boost option, you can give some phrases greater weight than others to raise the recognition model bias.

Boost basics

With the boost, you can give phrase elements in a PhraseSet resource a weighted value. This weighted value is used by Speech-to-Text to determine a potential transcription for each word in your audio input. The more significant the value, the more likely it is that Speech-to-Text will select that word or phrase above the alternatives. When you give a multi-word phrase a boost value, the boost is only applied to the full phrase.

Set boost values

The float value of a boost value must be greater than zero. 20 is the practical upper limit for boosting values. Adjust your boost numbers up or down till you receive accurate transcription results as you experiment for the best outcomes.

Less false negatives, or instances where a word or phrase appeared in the audio but wasn't accurately identified by Speech-to-Text, can be achieved with higher boost settings.

Improve transcription using a PhraseSet

Step 1: Make a PhraseSet first:

Step 2: Acquire the phrase set:

Step 3: Include your desired phrases in the PhraseSet and give each a boost value of 10

Step 4: This time, use the model adaption and the PhraseSet you have already constructed to recognize the audio.

Introduction to Latest Models

When you define the model field, the Speech-to-Text API provides access to two new model tags that can be used. These models can deliver higher accuracy for voice recognition than other existing models because they are made to give you access to Google's most recent speech technology and machine learning research. The "latest" models, however, do not yet offer all of the features supported by other models already on the market.

Model Identifiers

There are two variations of the most recent models:

The latest short model is designed for brief utterances that last only a few seconds. It can be helpful when attempting to record instructions or other single-shot directed speech use cases.
The latest long model is for any long-form information, including media, discussions, and spontaneous speech. In addition, the latest long can be used in lieu of the default model.

Model Technology and Pricing

The most recent models aim to give Google Cloud users access to the most recent voice technology. The Conformer Speech Model technology from Google serves as the foundation for our most recent models. However, this may change in the future. The command-and-search or default models are used and cost similarly to the latest long and latest short models advertised as "Standard."

Model Updates and Languages

The newest models are based on machine learning technology, which is developing quickly. Due to this, we may change or refresh the model more regularly than we do with our other models. These updates may bring in new features or minor latency or accuracy adjustments.

More than 20 languages and 50 versions are offered for the newest models.

Feature Support and Limitations

The following features are not currently supported by the most recent models:

Confidence Scores - Although the API will return a value, it is not a confidence score in the true sense.
Biasing is only supported by the en-us latest short model.
Both of the most recent models do not enable diarization.

Model Service Level Agreement

The Speech-to-Text API's Most Recent Models are regarded as a Generally Available feature. Because of this, the functionality they support is included in the v1 API and qualifies for the same Service Level Agreement and other safeguards as features and products that are generally available.

Frequently Asked Questions

How can we safeguard data during cloud transportation?

Check that the encryption key used with the data you provide does not leak data as it flows from point A to point B on the cloud to ensure that the data is secure.

What do cloud computing system integrators do?

A cloud is made up of numerous complicated components. The system integrator is a cloud strategy that enables, among other things, the design of the cloud and the fusion of various components to produce a hybrid or private cloud network.

Can you name a few well-known open-source cloud computing platforms?

Some of the major open-source cloud systems are the following: Cloud Foundry, KVM, Docker, OpenStack, and Apache Mesos.

Conclusion

To conclude this blog, firstly we discussed speech requests, Synchronous Speech Recognition Requests, sample rates, model selection, embedded audio content, etc. We also discussed audio formats and encodings, the need for encoding, data logging, improving transcription results with model adaptation, and introduction to the latest models.

For more content, Refer to our guided paths on Coding Ninjas Studio to upskill yourself.

Do upvote our blogs if you find them helpful and engaging!

Happy Learning!