Code360 powered by Coding Ninjas X Code360 powered by Coding Ninjas X
Table of contents
How AI Training Works
A Typical Machine Learning Application
Distributed Training Structure
Distributed Training Strategies
Packaging the Application
Submitting the Training Job
Job ID
Scale Tiers
Hyperparameter Tuning
Regions and Zones
Using job-dir as a Common Output Directory
Runtime Version
Input Data
Output Data
Building Training Jobs that are Resilient to VM Restarts
Training with GPUs
Frequently Asked Questions
What is the AI platform in GCP?
What is Google Compute Engine?
Why is a virtualisation platform required to construct a cloud?
What other techniques are there for logging into the Google Compute Engine API?
Last Updated: Mar 27, 2024

Overview of AI Platform Training

Master Python: Predicting weather forecasts
Ashwin Goyal
Product Manager @


Google has been one of the main newsmakers in the AI(Artificial Intelligence) space. Millions of users use Google features that rely on AI technology in the background every day. We can create, train, and deploy our machine learning models in the cloud using the services included in the Google AI Platform. Jupyter notebooks may be used on a GCP(Google Cloud Platform) virtual machine thanks to AI Platform Notebooks. Access to reliable computing resources for model training is provided through AI platform training. A trained model may be made available as a service using AI Platform Prediction.

How AI Training Works

The training job is run on cloud computing resources via AI Platform Training. Without building a training application, we may train a built-in algorithm (beta) on our dataset. If the built-in algorithms do not suit our needs, we can write our training application that will operate on AI Platform Training.

Here is an outline of how to use our training app:

  1. We write a Python application that trains and builds our model the same way we would run it locally in our development platform.
  2. We collect our training and verification data and store it in a location that AI Platform Training may access. This usually entails storing information in Cloud Storage, Cloud Bigtable, or another Google Cloud storage service linked to the same Google Cloud project as the AI Platform Training.
  3. When our application is finished, we must package it and upload it to a Cloud Storage bucket that our project can use. This is automatically carried out when we utilise the Google Cloud CLI(Command Line Interface) to perform a training task.
  4. The AI Platform Training training service provides us with the resources we need to do our job. Based on our task setup, it allocates one or more virtual machines (known as training instances). Each training instance is configured as follows:
    a) They are configured using the standard machine image for the AI Platform Training version that our job requires.
    b) By loading and installing the application package with pip.
    c) By installing any extra packages designated as dependencies.
  5. The training service executes our program, passing any command-line arguments we provide when we establish the training task.
    a)We can get it by Cloud Logging. 
    b)By requesting job details or running log streaming with the gcloud command-line tool.
    c)By programmatically making status requests to the training service.
  6. AI Platform Training stops all job operations and cleans up the resources when our training job succeeds or meets an unrecoverable error.

A Typical Machine Learning Application

The AI Platform Training training service is intended to have the slightest potential impact on our application. This frees us up to concentrate on our model code.

The majority of machine learning applications:

  • Make it possible to obtain training and evaluation data.
  • They take care of and process data instances.
  • To test the model's accuracy, use evaluation data. It analyses how frequently it predicts the proper value.
  • It provides a way to produce checkpoints at intervals in the process to acquire a snapshot of the model's progress for TensorFlow training applications.
  • When the application is finished, it provides a means to export the trained model.

Distributed Training Structure

When we use AI Platform Training to conduct a distributed TensorFlow task, we provide several machines (nodes) in a training cluster. The training service assigns resources to the machine types we select. A replica is a running job on a specific node. In distributed training, each replica in the training cluster is assigned a single role or task based on the distributed TensorFlow model:

  • Master: The master worker is one of the exact replicas. This task oversees the others and reports on the overall state of the job. The training service will continue till our job is completed or an unrecoverable error occurs. The status of the master replica in distributed training indicates the total work status.
  • If the task we are running is a single process task, the only replica is the master.
  • Workers: As workers, one or more copies may be designated. These replicas carry out their assigned tasks as specified in our job configuration.
  • Parameter servers: One or more replicas can be selected as parameter servers. These copies coordinate the workers' shared model state.

Distributed Training Strategies

For training a model with numerous nodes, there are three fundamental approaches:

  • Synchronous updates with data-parallel training.
  • Asynchronous updates with data-parallel training.
  • Model-parallel training.

The data-parallel strategy is a valuable starting point for applying the distributed training method to our custom model. This is because it may be used independently of model topology. All worker nodes share the entire model in data-parallel training. In the same way, as mini-batch processing does, each node calculates gradient vectors independently from a subset of the training dataset. The generated gradient vectors are loaded into the parameter server node, and the model parameters are updated using the gradient vectors' total sum. If we divide 10,000 batches among ten worker nodes, each node will handle around 1,000 batches.

Asynchronous or synchronous updates can be used for data-parallel training. When asynchronous updates are used, the parameter server applies each gradient vector independently as soon as it is received from one of the worker nodes, as indicated in the diagram below:

Please remember that the preceding section only applies to TensorFlow and custom container training.

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job

Packaging the Application

We must package our program and its dependencies before we can run it on AI Platform Training. The package must then be uploaded to a Cloud Storage bucket that our Google Cloud project can access.

Much of the procedure is automated via the Google Cloud CLI. We can upload our application package and submit our training job using gcloud ai-platform jobs submit training.

Submitting the Training Job

Model training is provided as an asynchronous (batch) service by AI Platform Training. We can submit a training job either using the command line to run gcloud ai-platform jobs submit training or by sending a request to the API at Model training is provided as an asynchronous (batch) service by AI Platform Training. We can submit a training job either using the command line to run gcloud ai-platform jobs submit training or by sending a request to the API at

Job ID

We must name our training job in accordance with these guidelines:

  • It must be exclusive to our Google Cloud project.
  • It is limited to mixed-case letters, numbers, and underscores.
  • It has to begin with a letter.
  • It cannot be longer than 128 characters.

We are free to use any job name convention we wish. If we don't have many jobs to run, the name we choose may not be important. We may need to find our job ID in vast lists if we run a lot of jobs. Making our job IDs plainly distinct is a fantastic idea.

A popular approach is to create a base name for all jobs linked with a specified model and append a date/time string to it. Because all jobs for a model are then grouped in ascending order, this standard makes it simple to sort lists of jobs by name.

Scale Tiers

When we perform a training task on AI Platform Training, we must define the number and type of machines we require. To make things easier, we can select from a set of predefined cluster requirements known as scaling tiers. Alternatively, we can select a bespoke tier and specify the machine kinds ourselves.

To specify a scaling tier, add it to the Training Input object in our task configuration. We may use the same identifiers if we submit our training task using the gcloud command.

Hyperparameter Tuning

When we establish our training task, we must specify configuration details if we wish to employ hyperparameter tuning. A user should look deeper into the conceptual overview of hyperparameter tuning and how to apply it.

Regions and Zones

Google Cloud defines the geographic location of physical computer resources using regions partitioned into zones. When we run an AI Platform Training training task, we indicate which section we want it to run.

Assume we store our training dataset in the cloud. In that situation, we should run our training job in the same region as the Cloud Storage bucket that contains our training data. Our work may take longer if we perform our task in a different part of our data bucket. 
Read the region guide to determine which regions are available for AI Platform Training services such as model training and online/batch prediction.

Using job-dir as a Common Output Directory

When we configure a job, we can define the output directory. When we submit the job, AI Platform Training conducts the following:

  • It validates the directory so that any issues can be resolved before the job runs.
  • The path to our application is passed as a command-line parameter termed —job-dir.

In our program, we must care for the —job-dir parameter. Capture the argument value while parsing our other arguments and use it when saving the output of our application.

Runtime Version

A supported AI Platform Training runtime version for our training task is usually selected. This is if we want to train with one of AI Platform Training's hosted machine learning frameworks. TensorFlow, scikit-learn, XGBoost, and other Python libraries are installed on our allocated training instances based on the runtime version. Specify a version that provides the capabilities we require. If we perform the training job locally and in the cloud, ensure the runtime versions are the same.

Input Data

To operate on AI Platform Training, the data we can utilise in our training task must fulfil the following rules:

  • The data must be in a format we can read and feed into our training algorithm.
  • The data must be stored where our code may access it. This usually involves storing information using one of Google's Cloud storage or big data services.

Output Data

It is usual for applications to output data, such as checkpoints during training and a saved model once training is finished. Other data might be output as required by our application. It is easiest to retain our output files in the same Google Cloud project as our training job in a Cloud Storage bucket.

Building Training Jobs that are Resilient to VM Restarts

Google Cloud VMs occasionally restart. Save model checkpoints frequently and set up our task to restore the most recent checkpoint to make sure our training job can withstand these restarts.

The Cloud Storage directory that we supply with the —job-dir parameter in the gcloud ai-platform jobs submit training command is where we typically save model checkpoints.

Checkpoint capability is implemented for us through the TensorFlow Estimator API. If our model is enclosed in an estimator, we will not have to be concerned about restarting events on our virtual machines.

If we cannot integrate our model with a TensorFlow Estimator, add checkpoint saving and restoring capability to our training code. The tf.train module from TensorFlow includes the following helpful resources:

Training with GPUs

We can conduct our training tasks with graphics processing units(GPUs) on the AI platform. GPUs are built to carry out complex mathematical computations quickly. They can perform some operations on tensor data more efficiently than installing another computer with one or more CPU cores.

There is not a unique interface for dealing with GPUs provided by the AI Platform Training training service. To run our task, we may select GPU-capable computers, and the service will allocate them for us. For instance, we may use our code to allocate TensorFlow Ops to GPUs in a TensorFlow training job. The service runs a single clone of our code on each machine when we provide a machine type with GPU access for a task type; this is the case every time that task type is assigned.

We train using a customised container and a different machine learning framework. In such a situation, that framework may offer a different interface for interacting with GPUs in such a situation.

Running on GPUs is not beneficial for all models. For big, intricate models with numerous mathematical operations, we advise GPUs. Even then, we should train a tiny portion of our data to assess the value of GPU support.

Note that the preceding section only applies to custom containers and TensorFlow training.

Frequently Asked Questions

What is the AI platform in GCP?

We may train models utilising a variety of customisation options using the AI Platform training service. We may enable distributed training, employ hyperparameter tweaking, and accelerate with GPUs and TPUs while choosing from various machine types to power our training workloads.

What is Google Compute Engine?

The core element of the Google Cloud Platform is Google Cloud Engine. It is an IaaS that offers adaptable, self-managed, Linux- and Windows-based virtual machines hosted on the Google infrastructure. KVM and local, reliable storage alternatives can both be used to run virtual machines.

Why is a virtualisation platform required to construct a cloud?

Through virtualisation, we may construct virtual copies of the storage, operating systems, programs, networks, etc. Utilising the appropriate virtualisation allows us to strengthen the current infrastructure. On already installed servers, several applications and operating systems may operate.

What other techniques are there for logging into the Google Compute Engine API?

The Google Compute Engine API may be authenticated using the variety of techniques mentioned: Using client libraries, Using OAuth 2.0, and utilising an access token directly.


AI is a concrete set of skills that unlocks revenue growth and cost reductions, not a problem in search of a solution. Applications for AI are found across all industries and in a wide range of business processes because of its capabilities to integrate larger data sets into studies, recognise concepts and patterns in data more accurately than rules-based systems, and enable human-to-machine interaction. Every industry may use it to open up new opportunities and increase efficiency. We learned about the training of AI platforms in the article. You can get to know about Cloud Computing and find our courses on Data Science and machine learning. Do not forget to check out more blogs on GCP to follow.

Explore Coding Ninjas Studio to find more exciting stuff. Happy Coding!

Previous article
Introduction to AI Platform in GCP
Next article
Training Workflow in AI Platform with GCP
Live masterclass