Table of contents
1.
Introduction
2.
Introducing Deep Learning Containers
2.1.
Choose a container and develop locally
2.2.
Create derivative containers and deploy to Cloud AI Platform Notebooks and GKE
2.3.
Deploying Deep Learning Containers on GKE with NVIDIA GPUs
3.
Get started with a local deep learning container
3.1.
Prerequisite
3.2.
Create your container
4.
Deep Learning Containers overview 
5.
Choose a container image 
5.1.
Included dependencies
5.2.
TensorFlow Enterprise container images
5.3.
Experimental images
5.4.
Listing all available versions
5.4.1.
Using locally
6.
Train in a container using Google Kubernetes Engine 
6.1.
Prerequisite
6.2.
Open your command line tool
6.2.1.
Use Google Cloud Shell
6.2.2.
Use command-line tools locally
6.3.
Create a GKE cluster
6.4.
Create the Dockerfile
6.5.
Build and upload the container image
6.6.
Deploy your application
7.
Create a derivative container 
7.1.
Prerequisite
7.2.
Process
8.
Create the initial Dockerfile and run modification commands
8.1.
Build and push the container image
9.
Frequently Asked Questions
9.1.
What is a deep learning container?
9.2.
Does GCP support Docker containers?
9.3.
What is a GCP container engine?
10.
Conclusion
Last Updated: Mar 27, 2024

Deep Learning Containers

Author Nagendra
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

It's simple to underestimate how long it takes to launch a machine learning project. This can be annoying, time-consuming, and prevent you from doing what you want to accomplish, which is spending time iterating and perfecting your model. All too frequently, these projects demand you to manage the compatibility and complexity of an ever-evolving software stack. 
We will look into the details of Deep Learning Containers, local deep learning containers, Training a container using Google Kubernetes Engine, and details of derivative containers to let you avoid this setup and start working on the project right away.

Without further ado, let's get started.

Introducing Deep Learning Containers

You may start using Deep Learning Containers right away because they are pre-packaged, performance-optimized, and compatibility-tested. In order to ensure repeatability and accuracy, productionize your workflow necessitates not just creating the code or artifacts you intend to deploy but also preserving a consistent execution environment. It can be challenging to ensure that all relevant dependencies are packed appropriately and are accessible to all runtimes if your development strategy combines local prototyping with a variety of cloud technologies. By offering a uniform environment for testing and deploying your application across GCP products and services, like Cloud AI Platform Notebooks and Google Kubernetes Engine (GKE), Deep Learning Containers address this difficulty and make it simple to grow in the cloud or switch between on-prem.

Choose a container and develop locally

Every Deep Learning Container has a predefined Jupyter environment, making it possible to utilise each one as a prototyping environment right away. Make sure the gcloud utility is installed and set up before anything else. Next, choose the container that you want to utilise. The command will list all containers hosted under gcr.io/deeplearning-platform-release.

Command:

gcloud container images list --repository="gcr.io/deeplearning-platform-release"


Each container offers a Python3 environment that is compatible with the relevant Deep Learning VM, together with conda, the chosen data science framework, the NVIDIA stack for GPU images (CUDA, cuDNN, NCCL), and a variety of other supplementary software and tools. TensorFlow 1.13, TensorFlow 2.0, PyTorch, and R containers make up our initial release, and we are aiming to achieve parity with all Deep Learning VM types.
The names of the containers will follow the format <framework>-cpu/gpu>.<framework version> with the exception of the basic containers. Let's suppose you want to prototype using TensorFlow on the CPU solely. The Jupyter server will be bound to port 8080 on the local machine, the TensorFlow Deep Learning Container will be started in detached mode, and mount/path/to/local/dir  to /home in the container.

Command:

docker run -d -p 8080:8080 -v /path/to/local/dir:/home \
  gcr.io/deeplearning-platform-release/tf-cpu.1-13


Then, go to localhost:8080 to visit the JupyterLab instance that is now executing. Make sure to develop in /home because when the container is stopped, any additional files will be deleted.

You must have a CUDA 10 compatible GPU, the related driver, and nvidia-docker installed in order to use the GPU-enabled containers. Then you can issue a command akin to that.

Command:

docker run --runtime=nvidia -d -p 8080:8080 -v /path/to/local/dir:/home \
  gcr.io/deeplearning-platform-release/tf-gpu.1-13

Create derivative containers and deploy to Cloud AI Platform Notebooks and GKE

You'll eventually require a machine with more horsepower than what your local machine can provide, but there may be local data and packages that must be placed in the environment beforehand. Your local files can be added to Deep Learning Containers, which can then be customised and deployed in a Cloud AI Platform Notebooks instance and GKE.
Consider a scenario in which your Pytorch workflow uses a local Python package called "mypackage" that you have access to. Make a Dockerfile with that name in the directory above mypackage.

Code:

FROM gcr.io/deeplearning-platform-release/pytorch-gpu
COPY mypackage /mypackage
RUN pip install /mypackage


The package files will be copied into the environment and installed using this straightforward Dockerfile. Additional RUN pip/conda commands can be added, but CMD and ENTRYPOINT shouldn't be changed because they are already set up for AI Platform Notebooks. Create this container, then upload it to the Google Container Registry.

Code:

export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
export IMAGE_REPO_NAME=pytorch_custom_container
export IMAGE_TAG=$(date +%Y%m%d_%H%M%S)
export IMAGE_URI=gcr.io/$PROJECT_ID/$IMAGE_REPO_NAME:$IMAGE_TAG
docker build -f Dockerfile -t $IMAGE_URI ./
gcloud auth configure-docker
docker push $IMAGE_URI


Afterward, use the gcloud CLI to build an AI Platform Notebooks instance (custom container UI support coming soon). Change the instance type and accelerator fields as necessary to accommodate your workload.

Code:

export IMAGE_FAMILY="common-container" 
export ZONE="us-central1-b"
export INSTANCE_NAME="custom-container-notebook"
export INSTANCE_TYPE="n1-standard-8"
export ACCELERATOR="type=nvidia-tesla-t4,count=2"
gcloud compute instances create $INSTANCE_NAME \
        --zone=$ZONE \
        --image-family=$IMAGE_FAMILY \
        --image-project="deeplearning-platform-release" \
        --maintenance-policy=TERMINATE \
        --accelerator=$ACCELERATOR \
        --machine-type=$INSTANCE_TYPE \
        --boot-disk-size=100GB \
        --scopes=https://www.googleapis.com/auth/cloud-platform \
        --metadata="install-nvidia-driver=True,proxy-mode=project_editors,container=$IMAGE_URI"


It will take some time to set up the image. If the container was loaded properly, the proxy-url metadata field will contain a link to JupyterLab, and the instance will show as ready in the AI Platform > Notebooks UI on Cloud Console. In addition, you can directly query the connection by providing the instance metadata.

Command:

gcloud compute instances describe "${INSTANCE_NAME}" \
  --format='value[](metadata.items.proxy-url)'


You can access your JupyterLab instance at this URL.

Deploying Deep Learning Containers on GKE with NVIDIA GPUs

Additionally, you can use GKE to work on developing your Deep Learning Containers. The only thing left to do is to define the container image in your Kubernetes pod spec after setting up your GKE cluster with GPUs in accordance with the user guide. A pod with one GPU from tf-gpu and a connected GCE persistent disc is created by the following specification:

Code:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: dlc-persistent-volume-claim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: dlc-tf
spec:
  containers:
  - name: dlc-tf
    image: gcr.io/deeplearning-platform-release/tf-gpu
    resources:
      limits:
        nvidia.com/gpu: 1
    volumeMounts:
      - name: dlc-volume
        mountPath: /home
  volumes:
  - name: dlc-volume
    persistentVolumeClaim:
      claimName: dlc-persistent-volume-claim


The following commands will deploy and connect to your instance:

Command:

kubectl apply -f ./pod.yaml
kubectl port-forward pods/dlc-tf 8080:8080


You can reach your running JupyterLab instance at localhost:8080 once the pod has been fully deployed.

Let's get into the details of the local deep learning container.

Get started with a local deep learning container

Let's begin by discussing the prerequisite for the local deep learning containers.

Prerequisite

To create a Google Cloud account, enable the necessary APIs, and install and activate the necessary software, follow the on-screen instructions.

  • Go to the Manage resources page in the Google Cloud Console and choose Create a project there.
     
  • Install the gcloud CLI.
     
  • Install Docker.
     
  • To run Docker without using sudo on a Linux-based operating system like Ubuntu or Debian, add your account to the docker group:
    Command: 
sudo usermod -a -G docker ${USER}
  • After adding yourself to the docker group, your system might need to be restarted.
     
  • Run Docker. Run the following Docker command, which produces the current time and date, to confirm that Docker is up and running:

Command:

docker run busybox date

 

  • Use gcloud as Docker's credential helper:

Command:

gcloud auth configure-docker
  • Install nvidia-docker if you wish to run the container locally on a GPU.

Create your container

Create your container by following these steps.

  • To see a list of available containers:

Command:

gcloud container images list \
  --repository="gcr.io/deeplearning-platform-release"


To assist you in choosing the container you desire, you might wish to visit Choosing a container.

  • Enter the following code example if a GPU-enabled container is not required. Substitute the name of the container you want to use for tf-cpu.1-13.

Command:

docker run -d -p 8080:8080 -v /path/to/local/dir:/home/jupyter \
	gcr.io/deeplearning-platform-release/tf-cpu.1-13
  • Put the following code example into the container if you wish to use a GPU. Substitute the name of the container you want to use for tf-gpu.1-13.

Command:

docker run --runtime=nvidia -d -p 8080:8080 -v /path/to/local/dir:/home/jupyter \
  gcr.io/deeplearning-platform-release/tf-gpu.1-13

This command maps port 8080 on the container to port 8080 on your local system, launches the container in detached mode and mounts the local directory /path/to/local/dir to /home/jupyter in the container. You can access the JupyterLab server that is already running on the container at http://localhost:8080.

Deep Learning Containers overview 

A group of Docker containers called Deep Learning Containers came pre-installed with essential data science frameworks, libraries, and tools. You can quickly prototype and deploy workflows with these containers since they give you environments that are consistent and optimised for performance.

Let's look into the details of choosing a container image.

Choose a container image 

Each container image includes the NVIDIA stack for GPU images (CUDA, cuDNN, NCCL2), Conda, the chosen data science framework (such as PyTorch or TensorFlow), a Python 3 environment, and a number of additional supporting packages and tools.

Included dependencies

You may access lists of the Python dependencies used in each release on Cloud Storage at the following link:

Link:

gs://deeplearning-platform-release/installed-dependencies/containers/RELEASE_MILESTONE


Substitute your release milestone in place of RELEASE MILESTONE.

TensorFlow Enterprise container images

You can get a Google Cloud optimised distribution of TensorFlow from TensorFlow Enterprise container images, and some versions of this distribution also come with Long Term Version Support.

Experimental images

The table of picture families shows which Deep Learning Containers image families are experimental. Experimental pictures may not receive refreshes with each new framework version but are supported as well as possible.

Listing all available versions

You can search the entire database of available container images if you require a specific framework or CUDA version. Use the following command in Cloud Shell or the Google Cloud CLI with your choice terminal to list all Deep Learning Containers images that are currently accessible.

Command:

gcloud container images list --repository="gcr.io/deeplearning-platform-release"

Using locally

You can pull and use local Deep Learning Containers at your local machine.

Let dive into training in a container using Google Kubernetes Engine.

Train in a container using Google Kubernetes Engine 

Let's begin by discussing the prerequisite for the setup.

Prerequisite

Make sure the following procedures have been taken before you start.

  • Complete the setup procedures outlined in the Before you Start section of Using a local deep learning container.
     
  • Check to see if your Google Cloud project has billing enabled.
     
  • Enable the Container Registry, Compute Engine, and Google Kubernetes Engine APIs.

Open your command line tool

This guide can be executed locally using command-line tools or Google Cloud Shell. The command-line tools used in this tutorial—gcloud, docker, and kubectl—are preinstalled on Google Cloud Shell. You can avoid installing these command-line tools on your workstation if you use Cloud Shell.

Use Google Cloud Shell

Follow the steps to use Google Cloud Shell :

  • Navigate to the Google Cloud console.
     
  • At the top of the console window, click the Activate Cloud Shell button.

Use command-line tools locally

You must install the following utility if you want to follow this guide on your local machine:
Install the Kubernetes command-line tool using the gcloud CLI. To communicate with Kubernetes, the cluster orchestration system utilised by Deep Learning Containers clusters, use kubectl:

Command:

gcloud components install kubectl


When you finished the getting started stages, Google Cloud CLI and Docker were already installed.

Create a GKE cluster

To construct a two-node cluster in GKE with the name pytorch-training-cluster, enter the following command:

Command:

gcloud container clusters create pytorch-training-cluster \
    --num-nodes=2 \
    --zone=us-west1-b \
    --accelerator="type=nvidia-tesla-p100,count=1" \
    --machine-type="n1-highmem-2" \
    --scopes="gke-default,storage-rw"

The cluster's creation could take several minutes.

You might also use an existing cluster in your Google Cloud project rather than building one from scratch. To ensure that the kubectl command-line tool has the right authorizations to access your cluster if you do this, you might need to run the following command:

Command:

gcloud container clusters get-credentials YOUR_EXISTING_CLUSTER

Create the Dockerfile

A container picture can be created in a variety of ways. These instructions will demonstrate how to create one that can execute the Python script trainer.py.

A list of possible container images can be seen here:

Command:

gcloud container images list \
  --repository="gcr.io/deeplearning-platform-release"

You may learn how to insert a Python script with the name trainer.py into a specific PyTorch deep learning container type by looking at the example below:

  • Write the following commands into a file called Dockerfile to generate the dockerfile. This step assumes that you have machine learning model training code in a directory called model-training-code, with trainer.py serving as the directory's primary Python module. The container will be removed in this scenario after the job is finished, thus, your training script should be set up to output to cloud storage.
    Command:
FROM gcr.io/deeplearning-platform-release/pytorch-gpu
COPY model-training-code /train
CMD ["python", "/train/trainer.py"]

Build and upload the container image

Use the following commands to create and upload the container image to Container Registry:

Command: 

export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
export IMAGE_REPO_NAME=pytorch_custom_container
export IMAGE_TAG=$(date +%Y%m%d_%H%M%S)
export IMAGE_URI=gcr.io/$PROJECT_ID/$IMAGE_REPO_NAME:$IMAGE_TAG
docker build -f Dockerfile -t $IMAGE_URI ./
docker push $IMAGE_URI

Deploy your application

The following information should be placed in a file called pod.yaml, substituting IMAGE_URI with the URI of your picture.

Code:

apiVersion: v1
kind: Pod
metadata:
  name: gke-training-pod
spec:
  containers:
  - name: my-custom-container
    image: IMAGE_URI
    resources:
      limits:
        nvidia.com/gpu: 1


Run the following command to deploy your application using the kubectl command-line tool:

Command:

kubectl apply -f ./pod.yaml


Run the following command to monitor the status of the pod:

Command:

kubectl describe pod gke-training-pod


Let's look into the details of the derivative container.

Create a derivative container 

Let's begin by discussing the prerequisite for the setup.

Prerequisite

Make sure the following procedures have been taken before you start.

  • Complete the setup procedures outlined in the Before you Start section of Using a local deep learning container.
     
  • Check to see if your Google Cloud project has billing enabled.
     
  • Enable the Container Registry API.

Process

You'll use a procedure akin to this to generate a derivative container:

  • Run commands to modify the Dockerfile after creating it.
     
  • One of the various picture kinds is used to generate a Deep Learning Containers container at first. The container image can then be customised using conda, pip, or Jupyter commands.
     
  • Create the container image and push it.
     
  • Create the container image, and then push it to a location that your Compute Engine service account can access.
     

Let's look into the details of creating the initial Dockerfile and run modification commands

Create the initial Dockerfile and run modification commands

To choose a Deep Learning Containers image type and make a minor change to the container image, use the commands below. This example demonstrates how to begin with the most recent TensorFlow image and modify it using a unique TensorFlow wheel. The example that follows requires tensorflow.whl is present in the same working directory as your Dockerfile. The Dockerfile should contain the following commands:

Command:

FROM gcr.io/deeplearning-platform-release/tf-gpu:latest
# Copy from local file system to container
COPY tensorflow.whl /tensorflow.whl
RUN pip uninstall -y tensorflow && \
    pip install -y /tensorflow.whl

Build and push the container image

The container image can be created and pushed to the Container Registry using the commands below, where your Google Compute Engine service account can access it.

Command:

export PROJECT=$(gcloud config list project --format "value(core.project)")
docker build . -f Dockerfile.example -t "gcr.io/${PROJECT}/tf-custom:latest"
docker push "gcr.io/${PROJECT}/tf-custom:latest"

Also see, kubernetes interview questions

Frequently Asked Questions

What is a deep learning container?

Deep Learning Containers are a collection of Docker containers that come pre-installed with essential data science frameworks, libraries, and tools. These containers give you stable, performance-optimized settings that can hasten the implementation of process prototypes.

Does GCP support Docker containers?

Yes, GCP supports Docker containers. Numerous technologies, such as managed Kubernetes and serverless container execution, are available on the Google Cloud Platform for working and operating with Docker containers. 

What is a GCP container engine?

A management and orchestration system for Docker containers and container clusters running on Google's public cloud services is called Google Kubernetes Engine (GKE). Based on Kubernetes, Google's open source container management system, Google Kubernetes Engine was created.

Conclusion

In this article, we have extensively discussed the details of Deep Learning Containers along with the details of local deep learning containers, Training a container using Google Kubernetes Engine and details of derivative containers.

We hope that this blog has helped you enhance your knowledge regarding Deep Learning Containers, and if you would like to learn more, check out our articles on Google Cloud Certification. You can refer to our guided paths on the Coding Ninjas Studio platform to learn more about DSADBMSCompetitive ProgrammingPythonJavaJavaScript, etc. To practice and improve yourself in the interview, you can also check out Top 100 SQL problemsInterview experienceCoding interview questions, and the Ultimate guide path for interviews. Do upvote our blog to help other ninjas grow. Happy Coding!!

thank you image
Live masterclass