Introduction
Google has been one of the main newsmakers in the AI(Artificial Intelligence) space. Millions of users use Google features that rely on AI technology in the background every day. We can create, train, and deploy our machine learning models in the cloud using the services included in the Google AI Platform. Jupyter notebooks may be used on a GCP(Google Cloud Platform) virtual machine thanks to AI Platform Notebooks. Access to reliable computing resources for model training is provided through AI platform training. A trained model may be made available as a service using AI Platform Prediction.
How AI Training Works
The training job is run on cloud computing resources via AI Platform Training. Without building a training application, we may train a built-in algorithm (beta) on our dataset. If the built-in algorithms do not suit our needs, we can write our training application that will operate on AI Platform Training.
Here is an outline of how to use our training app:
- We write a Python application that trains and builds our model the same way we would run it locally in our development platform.
- We collect our training and verification data and store it in a location that AI Platform Training may access. This usually entails storing information in Cloud Storage, Cloud Bigtable, or another Google Cloud storage service linked to the same Google Cloud project as the AI Platform Training.
- When our application is finished, we must package it and upload it to a Cloud Storage bucket that our project can use. This is automatically carried out when we utilise the Google Cloud CLI(Command Line Interface) to perform a training task.
-
The AI Platform Training training service provides us with the resources we need to do our job. Based on our task setup, it allocates one or more virtual machines (known as training instances). Each training instance is configured as follows:
a) They are configured using the standard machine image for the AI Platform Training version that our job requires.
b) By loading and installing the application package with pip.
c) By installing any extra packages designated as dependencies. -
The training service executes our program, passing any command-line arguments we provide when we establish the training task.
a)We can get it by Cloud Logging.
b)By requesting job details or running log streaming with the gcloud command-line tool.
c)By programmatically making status requests to the training service. - AI Platform Training stops all job operations and cleans up the resources when our training job succeeds or meets an unrecoverable error.
A Typical Machine Learning Application
The AI Platform Training training service is intended to have the slightest potential impact on our application. This frees us up to concentrate on our model code.
The majority of machine learning applications:
- Make it possible to obtain training and evaluation data.
- They take care of and process data instances.
- To test the model's accuracy, use evaluation data. It analyses how frequently it predicts the proper value.
- It provides a way to produce checkpoints at intervals in the process to acquire a snapshot of the model's progress for TensorFlow training applications.
- When the application is finished, it provides a means to export the trained model.
Distributed Training Structure
When we use AI Platform Training to conduct a distributed TensorFlow task, we provide several machines (nodes) in a training cluster. The training service assigns resources to the machine types we select. A replica is a running job on a specific node. In distributed training, each replica in the training cluster is assigned a single role or task based on the distributed TensorFlow model:
- Master: The master worker is one of the exact replicas. This task oversees the others and reports on the overall state of the job. The training service will continue till our job is completed or an unrecoverable error occurs. The status of the master replica in distributed training indicates the total work status.
- If the task we are running is a single process task, the only replica is the master.
- Workers: As workers, one or more copies may be designated. These replicas carry out their assigned tasks as specified in our job configuration.
- Parameter servers: One or more replicas can be selected as parameter servers. These copies coordinate the workers' shared model state.
Distributed Training Strategies
For training a model with numerous nodes, there are three fundamental approaches:
- Synchronous updates with data-parallel training.
- Asynchronous updates with data-parallel training.
- Model-parallel training.
The data-parallel strategy is a valuable starting point for applying the distributed training method to our custom model. This is because it may be used independently of model topology. All worker nodes share the entire model in data-parallel training. In the same way, as mini-batch processing does, each node calculates gradient vectors independently from a subset of the training dataset. The generated gradient vectors are loaded into the parameter server node, and the model parameters are updated using the gradient vectors' total sum. If we divide 10,000 batches among ten worker nodes, each node will handle around 1,000 batches.
Asynchronous or synchronous updates can be used for data-parallel training. When asynchronous updates are used, the parameter server applies each gradient vector independently as soon as it is received from one of the worker nodes, as indicated in the diagram below:

Please remember that the preceding section only applies to TensorFlow and custom container training.





