Table of contents

Introduction

Translating streaming audio into text

2.1.

Setting Up the Project

2.2.

Translate speech

2.2.1.

Translating speech from an audio file

2.2.2.

Translating speech from a microphone

Media Translation basics

3.1.

Speech translation requests

Best Practices for Media Translation

4.1.

Audio pre-processing

4.2.

Request configuration

4.3.

Frame Size

Introduction to audio encoding

Frequently Asked Questions

6.1.

What is media translation?

6.2.

What is the purpose of speech translation?

6.3.

What is LINEAR16?

Conclusion

Last Updated: Mar 27, 2024

Media Translation

Author Abhishek Nayak

Do you think IIT Guwahati certified course can help you in your career?

Yes

Introduction

The Media Translation API provides real-time speech translation directly from audio data to your content and applications. The API, which uses Google's machine learning technology, improves accuracy and simplifies integration while equipping you with a comprehensive range of features to further optimize your translation results. Improve user experience with low-latency streaming translation and swiftly scale with simple internationalization.

The Media Translation API improves interpretation accuracy by optimizing model integrations from audio to text and abstracts any frictions that may arise when starting several API requests. Make a single API call, and Media Translation will handle the rest.

Translating streaming audio into text

Media Translation is the process of converting an audio file or a stream of speech into text in another language.

Setting Up the Project

To use Media Translation, you must first create a Google Cloud project and enable the Media Translation API for that project.

Create an account if you're new to Google Cloud to see how their products perform in real-world scenarios.
Select or create a Google Cloud project on the project selector page of the Google Cloud console.
Check that billing for your Cloud project is enabled.
Turn on the Media Translation API.
Set up a service account:
1. Navigate to the Create service account page in the console.
2. Choose your project.
3. Enter the name in the Service account name field. Based on this name, the console populates the Service account ID field.
  Enter the description in the Service account description field. As an example, consider the Service account for quickstart.
4. Continue by clicking Create.
5. Grant your service account the following role(s) to gain access to your project: Owner > Project
  Choose a role from the Select a role list.
  Click + to add more roles. Add another role, and then each subsequent role.
6. Click the Continue button.
7. To finish creating the service account, click Done.
  Keep your browser window open. It will be useful in the following step.
Make a key for your service account:
1. Select the email address associated with the service account you created in the console.
2. Select Keys.
3. Click Add key, followed by Create new key.
4. Click the Create button. Your computer receives a JSON key file.
5. Close the window.
Set the GOOGLE APPLICATION CREDENTIALS environment variable to the path to the JSON file containing your service account key. This variable only applies to the current shell session, so if you open a new one, you must set it again.
1. Consider Linux or macOS, for example.
  Replace KEY_PATH with the path of the JSON file that contains your service account key.export GOOGLE_APPLICATION_CREDENTIALS="KEY_PATH"
2. For Windows:
  Replace KEY_PATH with the path of the JSON file that contains your service account key.
  
  For powershell:
  $env:GOOGLE_APPLICATION_CREDENTIALS="KEY_PATH"
  
  For command prompt:
  set GOOGLE_APPLICATION_CREDENTIALS=KEY_PATH
Install and launch the Google Cloud CLI.
Install the client library for the language you want to use

Translate speech

The code examples below show how to translate speech from a file with up to five minutes of audio or a live microphone.

Translating speech from an audio file

const fs = require('fs');

// Imports the Cloud Media Translation client library
const {
  SpeechTranslationServiceClient,
} = require('@google-cloud/media-translation');

// Creates a client
const client = new SpeechTranslationServiceClient();

async function translate_from_file() {
  /**
   * TODO for developer, Uncomment the following lines before running the sample.
   */
  // const encoding = 'Encoding of the audio file, e.g. LINEAR16';
  // const filename = 'Local path to audio file, e.g. /path/to/audio.raw';
  // const targetLanguage = 'BCP-47 target language code, e.g. es-ES';
  // const sourceLanguage = 'BCP-47 source language code, e.g. en-US';

  const config = {
    audioConfig: {
      audioEncoding: encoding,
      sourceLanguageCode: sourceLanguage,
      targetLanguageCode: targetLanguage,
    },
    single_utterance: true,
  };

  // first request must simply contain a streaming config and no data
  const initialRequest = {
    streamingConfig: config,
    audioContent: null,
  };

  const readStream = fs.createReadStream(filename, {
    highWaterMark: 4096,
    encoding: 'base64',
  });

  const chunks = [];
  readStream
    .on('data', chunk => {
      const request = {
        streamingConfig: config,
        audioContent: chunk.toString(),
      };
      chunks.push(request);
    })
    .on('close', () => {
      // Config-only requests should be first in the stream of requests
      stream.write(initialRequest);
      for (let i = 0; i < chunks.length; i++) {
        stream.write(chunks[i]);
      }
      stream.end();
    });

  const stream = client.streamingTranslateSpeech().on('data', response => {
    const {result} = response;
    if (result.textTranslationResult.isFinal) {
      console.log(
        `\nFinal translation: ${result.textTranslationResult.translation}`
      );
      console.log(`Final recognition result: ${result.recognitionResult}`);
    } else {
      console.log(
        `\nPartial translation: ${result.textTranslationResult.translation}`
      );
      console.log(`Partial recognition result: ${result.recognitionResult}`);
    }
  });

Translating speech from a microphone

// Allow user input from terminal
const readline = require('readline');

const rl = readline.createInterface({
  input: process.stdin,
  output: process.stdout,
});

function doTranslationLoop() {
  rl.question("Press any key to translate or 'q' to quit: ", answer => {
    if (answer.toLowerCase() === 'q') {
      rl.close();
    } else {
      translateFromMicrophone();
    }
  });
}

// Node-Record-lpcm16
const recorder = require('node-record-lpcm16');

// Imports the Cloud Media Translation client library
const {
  SpeechTranslationServiceClient,
} = require('@google-cloud/media-translation');

// Creates a client
const client = new SpeechTranslationServiceClient();

function translateFromMicrophone() {
  /**
   * TODO for developer, Uncomment the following lines before running the sample.
   */
  //const encoding = 'linear16';
  //const sampleRateHertz = 16000;
  //const sourceLanguage = 'Language to translate from, as BCP-47 locale';
  //const targetLanguage = 'Language to translate to, as BCP-47 locale';
  console.log('Begin speaking ...');

  const config = {
    audioConfig: {
      audioEncoding: encoding,
      sourceLanguageCode: sourceLanguage,
      targetLanguageCode: targetLanguage,
    },
    singleUtterance: true,
  };

  // first request must simply contain a streaming configuration and no data
  const initialRequest = {
    streamingConfig: config,
    audioContent: null,
  };

  let currentTranslation = '';
  let currentRecognition = '';
  // Create a recognize stream
  const stream = client
    .streamingTranslateSpeech()
    .on('error', e => {
      if (e.code && e.code === 4) {
        console.log('Streaming translation reached its deadline.');
      } else {
        console.log(e);
      }
    })
    .on('data', response => {
      const {result, speechEventType} = response;
      if (speechEventType === 'END_OF_SINGLE_UTTERANCE') {
        console.log(`\nFinal translation: ${currentTranslation}`);
        console.log(`Final recognition result: ${currentRecognition}`);

        stream.destroy();
        recording.stop();
      } else {
        currentTranslation = result.textTranslationResult.translation;
        currentRecognition = result.recognitionResult;
        console.log(`\nPartial translation: ${currentTranslation}`);
        console.log(`Partial recognition result: ${currentRecognition}`);
      }
    });

  let isFirst = true;
  // Start a recording and transmit microphone data to the Media Translation API
  const recording = recorder.record({
    sampleRateHertz: sampleRateHertz,
    threshold: 0, //silence threshold
    recordProgram: 'rec',
    silence: '5.0', //seconds of silence before ending
  });
  recording
    .stream()
    .on('data', chunk => {
      if (isFirst) {
        stream.write(initialRequest);
        isFirst = false;
      }
      const request = {
        streamingConfig: config,
        audioContent: chunk.toString('base64'),
      };
      if (!stream.destroyed) {
        stream.write(request);
      }
    })
    .on('close', () => {
      doTranslationLoop();
    });
}

doTranslationLoop();

Media Translation basics

In media translation basics, we will cover the types of requests you can make to Media Translation, how to construct those requests, and how to handle their responses.

Speech translation requests

So far, Media Translation has only one method for performing speech translation:

Streaming Translation (gRPC only) converts audio data within a bi-directional gRPC stream. Streaming requests are intended for use in real-time translation, such as capturing live audio from a microphone. Streaming translation produces interim results while audio is being captured, allowing results to appear while a user is still speaking, for example. Streaming translation requests are limited to audio files with a duration of 5 minutes or less.

Configuration parameters or audio data are included in requests.

Best Practices for Media Translation

The Media Translation API is most effective when the data sent to the service falls within the parameters listed below.

Capture audio at 16,000 Hz or higher sampling rate. Otherwise, set sample rate hertz to match the audio source's native sample rate (instead of re-sampling).
To record and transmit audio, use a lossless codec. It is best to use FLAC or LINEAR16.
For low streaming response latency, use the LINEAR16 codec.
Place the microphone as much close to the speaker as possible, especially if there is background noise.
For better results with noisy background audio, use an enhanced model.
Specify source language code with language code "language-region," and target language code without region (except zh-CN and zh-TW).

Audio pre-processing

It is best to provide as clean audio as possible using a high-quality, well-positioned microphone. On the other hand, applying noise-reduction signal processing to the audio before sending it to the service usually reduces recognition accuracy. The recognition service is intended to deal with noisy audio.

To achieve the best results:

Place the microphone as close to the person speaking as possible, especially if there is background noise.
Audio clipping should be avoided.
The use of automatic gain control is not recommended (AGC).
Any noise reduction processing should be turned off.
Listen to some audio samples. It should be clear and free of distortion or unexpected noise.

Request configuration

Ensure that the audio data sent with your request to the Media Translation API is accurately described. Ensuring that your request's TranslateSpeechConfig describes the correct sample_rate_hertz, audio_encoding, target_language_code, and source_language_code will result in the most accurate transcription and billing.

Frame Size

Streaming recognition detects live audio captured from a microphone or other audio source. The audio stream is divided into frames and delivered in a series of StreamingTranslateSpeechRequest messages. Any size frame is acceptable. Larger frames are more efficient, but they introduce latency. A frame size of 100 milliseconds is recommended as a good compromise between latency and efficiency.

Introduction to audio encoding

The Media Translation API supports a variety of encodings. The audio codecs that are supported are listed in the table below:

Frequently Asked Questions

What is media translation?

What is the purpose of speech translation?

With the aid of speech translation technology, people may communicate in other languages. As a result, it has enormous significance for humanity in terms of research, intercultural communication, and international trade.

What is LINEAR16?

Uncompressed audio is an example of linear PCM (LINEAR16), in which the digital data is stored exactly as the above standards indicate. You may, for instance, count off every 16 bits (2 bytes) when reading a one-channel stream of bytes encoded using Linear PCM to obtain a new waveform amplitude value.

Conclusion

In this article, we explored Google Cloud media translation and also learned about the translation of streaming audio into text, its practices, and audio encoding.

If you think the blog has helped you with an overview of Cloud Vision API, and if you like to learn more, check out our articles Cloud Computing, Cloud Computing Technologies, Cloud Computing Infrastructure, and Overview of a log-based metric.

Refer to our Guided Path on Coding Ninjas Studio to upskill yourself in Data Structures and Algorithms, Competitive Programming, JavaScript, System Design, and many more! You can check out the mock exam series and participate in the contests on Coding Ninjas Studio to test your coding proficiency. But if you have just started your learning process and are looking for questions asked by tech giants like Amazon, Microsoft, Uber, etc. In that case, you must look at the problems, interview experiences, and interview bundle for placement preparations.

Nevertheless, you may consider our paid courses to give your career an edge over others!

Happy Learning!