Code360 powered by Coding Ninjas X Code360 powered by Coding Ninjas X
Last Updated: Mar 27, 2024

Training Workflow in AI Platform with GCP

Leveraging ChatGPT - GenAI as a Microsoft Data Expert
Prerita Agarwal
Data Specialist @
23 Jul, 2024 @ 01:30 PM


The AI Platform Training in Google Cloud hosts several machine learning frameworks such as TensorFlow, scikit-learn, or XGBoost. There are various options to configure your development environment. After training, you can also customise your development environment and deploy your trained model to AI Platform Prediction. Before running the training application with AI Platform Training, the code must be uploaded to the Cloud. Packaging the trained application and running and monitoring it constitute the training workflow in Google Cloud.

Packaging an Application

For training applications with AI Platform Training, one must upload their code and any dependencies into a Cloud Storage bucket accessible by the Google Cloud project. Let’s see the different ways to package an application to the Cloud.

Using gcloud for Packaging

It is the simplest way to package a training application and upload it along with its dependencies to the Cloud. A single command in the Google Cloud CLI can package and upload the application - “gcloud ai-platform jobs submit training”.

It's helpful to define the configuration values as shell variables in the CLI:


gcloud ai-platform jobs submit training $JOBNAME \
    --staging-bucket=$STAGINGBUCKET \
    --job-dir=$JOBDIR  \
    --package-path=$PACKAGEPATH \
    --module-name=$MODULENAME \
    --region=$REGION \
    -- \
    --user_first_arg=first_arg_value \
  • PACKAGE_PATH is the path to the package’s directory in the local environment.
  • MODULE_NAME is the full name of the training module.
  • BUCKET_NAME is the name of a Cloud Storage bucket.
  • JOB_NAME is a name for the training job
  • JOB_OUTPUT_PATH is the URI of a Cloud Storage directory where the training job will save its output.
  • REGION to define the region where the training job has to run.

Working with dependencies

Dependencies are packages that are imported into the code. The application may have any number of dependencies to make it work. A training application runs on training instances that have many Python packages previously installed. A user may need to add two types of dependencies:

  • Standard dependencies - The common Python packages available on PyPI.
  • Custom packages developed by the user or those internal to an organisation.

Standard Dependencies can be added to the script in the training application's root directory. The pip command can be used to install the package in the training instances

from setuptools import find_packages
from setuptools import setup

REQUIRED_PACKAGES = ['some_PyPI_package>=1.0']

    description='The training application packages.')

Run the following command to execute the script.

python sdist

There are options to specify custom dependencies for the training application. Users can pass the paths as part of the job configuration. The URI to the package of each dependency. All custom dependencies have to be stored in Cloud Storage.

In the gcloud CLI, users can specify the dependencies on their local machine and Cloud Storage as part of the “gcloud ai-platform jobs submit training” command. The --packages flag includes the dependencies in a comma-separated list.

Uploading Existing Packages

Users can upload previously built packages with the Cloud CLI. These could be uploaded from the local system or Cloud Storage. In the “gcloud ai-platform jobs submit training” command:

  • Set the --packages flag to the path of the packaged application.
  • Set the --module-name flag to the name of the application's main module, the package namespace dot notation. 
gcloud ai-platform jobs submit training $JOBNAME \
    --staging-bucket $PACKAGESTAGINGPATH \
    --job-dir $JOBDIR \
    --packages trainer-0.0.1.tar.gz \
    --module-name $MODULENAME \
    --region us-central1 \
    -- \
    --user_first_arg=first_arg_value \

Packages can also be uploaded manually by using the gsutil tool:

gsutil cp /local/path/to/package1.tar.gz  gs://bucket/path/
Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job

Running a Training Job

Model training in the AI Platform runs as an asynchronous (batch) service. The training job can be configured and submitted by running “gcloud ai-platform jobs submit training” from the CLI or by sending a request to the API at

Configuring the job

The first step in configuring the job is to gather the job configuration data. Following is the list of properties that define the job. 

  1. Job name or jobId defines the name for the job.
  2. Cluster configuration specifies the type of processing cluster to run the job on.
  3. Disk configuration is the configuration of the boot disk for each training VM.
  4. Training application package that is staged in a Cloud Storage location.
  5. Module name of the main module in the package.
  6. Region (region) defines where the job is set to run.
  7. Job directory (jobDir) defines the path to a Cloud Storage location to store the job output.
  8. A Runtime version of the AI Platform Training for the job.
  9. Python version
  10. Maximum wait time determines how long the job is to remain in the QUEUED and PREPARING states.
  11. A service account is the email address used when running the training application.

The configuration details can be formatted using command-line flags or in a separate YAML file representing the Job resource. 

After formatting the configuration details, the job can be submitted. Two flags are specified in this stage:

  1. Job configuration parameters to set up resources in the Cloud and deploy the application on each node in the processing cluster.
  2. User arguments or application parameters.
    Also see, How to Check Python Version in CMD

Specifying machine types or scale tiers 

Specifying the number and types of machines required to run the AI Platform training job is essential. Users can pick from a set of predefined cluster specifications known as scale tiers to make the process easier. There is also an option to choose a custom tier and specify the machine types.

Scale Tiers

Google optimises the configuration of the scale tiers for different jobs over time, based on the feedback and availability of cloud resources. Each scale tier is defined based on suitability for specific jobs. Advanced tiers are allocated more machines to the cluster with more powerful specifications for each virtual machine.

 VM Instance Types

Machine types for the custom scale tier

Using a custom scale tier gives more refined control over the processing cluster used to train the model. The configuration is specified in the TrainingInput object of the job configuration. To custom configure one worker, users can specify it with a machine type for the master only. Here's an example of the config.yaml file:

  scaleTier: CUSTOM
  masterType: n1-highcpu-16

Following are the predefined Compute Engine machine type identifiers for training jobs.

Following are the predefined Compute Engine machine type identifiers for training jobs.

Compute Engine machine identifiers

Monitoring training jobs 

Training jobs consume time based on the size of the data set and the code’s complexity. Users can check the status of their job, monitor resource consumption and save the summary using TensorBoard for visualisation in the future.

The Job details page on the AI Platform Training Jobs page in the Google Cloud console contains the job list. Jobs can be filtered based on Type, JobID, State, and job creation time. Choose a job to find the job status at the top of the report.

Resource utilisation charts for the training jobs are also mentioned on the Job Details page. It includes information on the job's aggregate CPU or GPU utilisation and memory utilisation. It also displays the job's network usage, measured in bytes per second.

Frequently Asked Questions

What is TensorBoard?

TensorBoard is TensorFlow’s visualisation toolkit that helps track and visualise metrics such as loss and accuracy. Users can visualise the model graph and view histograms of weights, biases, or other tensors as they vary with time.

What are the different machine types for custom scale tiers?

The various machine types are masterTypes for the master worker, workerType for the workers, paramterServerType for parameter servers and evaluatorType for evaluators.

What are is a REST method that creates a training or a batch prediction job. Its request body contains an instance of a job. A successful response body has a newly created instance of a job.


This blog discusses the training workflow in Google Cloud’s AI Platform Training. It explains the process of packaging applications to the cloud, running jobs and monitoring them on GCP.

Check out our articles on TensorFlow and Hyperparameter TuningExplore our Library on Coding Ninjas Studio to gain knowledge on Data Structures and Algorithms, Machine Learning, Deep Learning, Cloud Computing and many more! Test your coding skills by solving our test series and participating in the contests hosted on Coding Ninjas Studio! 

Looking for questions from tech giants like Amazon, Microsoft, Uber, etc.? Look at the problems, interview experiences, and interview bundle for placement preparations.

Upvote our blogs if you find them insightful and engaging! Happy Coding!

Thank you

Topics covered
Packaging an Application
Using gcloud for Packaging
Working with dependencies
Uploading Existing Packages
Running a Training Job
Configuring the job
Specifying machine types or scale tiers 
Scale Tiers
Machine types for the custom scale tier
Monitoring training jobs 
Frequently Asked Questions
What is TensorBoard?
What are the different machine types for custom scale tiers?
What are