Introduction
It's simple to underestimate how long it takes to launch a machine learning project. This can be annoying, time-consuming, and prevent you from doing what you want to accomplish, which is spending time iterating and perfecting your model. All too frequently, these projects demand you to manage the compatibility and complexity of an ever-evolving software stack.
We will look into the details of Deep Learning Containers, local deep learning containers, Training a container using Google Kubernetes Engine, and details of derivative containers to let you avoid this setup and start working on the project right away.
Without further ado, let's get started.
Introducing Deep Learning Containers
You may start using Deep Learning Containers right away because they are pre-packaged, performance-optimized, and compatibility-tested. In order to ensure repeatability and accuracy, productionize your workflow necessitates not just creating the code or artifacts you intend to deploy but also preserving a consistent execution environment. It can be challenging to ensure that all relevant dependencies are packed appropriately and are accessible to all runtimes if your development strategy combines local prototyping with a variety of cloud technologies. By offering a uniform environment for testing and deploying your application across GCP products and services, like Cloud AI Platform Notebooks and Google Kubernetes Engine (GKE), Deep Learning Containers address this difficulty and make it simple to grow in the cloud or switch between on-prem.
Choose a container and develop locally
Every Deep Learning Container has a predefined Jupyter environment, making it possible to utilise each one as a prototyping environment right away. Make sure the gcloud utility is installed and set up before anything else. Next, choose the container that you want to utilise. The command will list all containers hosted under gcr.io/deeplearning-platform-release.
Command:
gcloud container images list --repository="gcr.io/deeplearning-platform-release"
Each container offers a Python3 environment that is compatible with the relevant Deep Learning VM, together with conda, the chosen data science framework, the NVIDIA stack for GPU images (CUDA, cuDNN, NCCL), and a variety of other supplementary software and tools. TensorFlow 1.13, TensorFlow 2.0, PyTorch, and R containers make up our initial release, and we are aiming to achieve parity with all Deep Learning VM types.
The names of the containers will follow the format <framework>-cpu/gpu>.<framework version> with the exception of the basic containers. Let's suppose you want to prototype using TensorFlow on the CPU solely. The Jupyter server will be bound to port 8080 on the local machine, the TensorFlow Deep Learning Container will be started in detached mode, and mount/path/to/local/dir to /home in the container.
Command:
docker run -d -p 8080:8080 -v /path/to/local/dir:/home \
gcr.io/deeplearning-platform-release/tf-cpu.1-13
Then, go to localhost:8080 to visit the JupyterLab instance that is now executing. Make sure to develop in /home because when the container is stopped, any additional files will be deleted.
You must have a CUDA 10 compatible GPU, the related driver, and nvidia-docker installed in order to use the GPU-enabled containers. Then you can issue a command akin to that.
Command:
docker run --runtime=nvidia -d -p 8080:8080 -v /path/to/local/dir:/home \
gcr.io/deeplearning-platform-release/tf-gpu.1-13
Create derivative containers and deploy to Cloud AI Platform Notebooks and GKE
You'll eventually require a machine with more horsepower than what your local machine can provide, but there may be local data and packages that must be placed in the environment beforehand. Your local files can be added to Deep Learning Containers, which can then be customised and deployed in a Cloud AI Platform Notebooks instance and GKE.
Consider a scenario in which your Pytorch workflow uses a local Python package called "mypackage" that you have access to. Make a Dockerfile with that name in the directory above mypackage.
Code:
FROM gcr.io/deeplearning-platform-release/pytorch-gpu
COPY mypackage /mypackage
RUN pip install /mypackage
The package files will be copied into the environment and installed using this straightforward Dockerfile. Additional RUN pip/conda commands can be added, but CMD and ENTRYPOINT shouldn't be changed because they are already set up for AI Platform Notebooks. Create this container, then upload it to the Google Container Registry.
Code:
export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
export IMAGE_REPO_NAME=pytorch_custom_container
export IMAGE_TAG=$(date +%Y%m%d_%H%M%S)
export IMAGE_URI=gcr.io/$PROJECT_ID/$IMAGE_REPO_NAME:$IMAGE_TAG
docker build -f Dockerfile -t $IMAGE_URI ./
gcloud auth configure-docker
docker push $IMAGE_URI
Afterward, use the gcloud CLI to build an AI Platform Notebooks instance (custom container UI support coming soon). Change the instance type and accelerator fields as necessary to accommodate your workload.
Code:
export IMAGE_FAMILY="common-container"
export ZONE="us-central1-b"
export INSTANCE_NAME="custom-container-notebook"
export INSTANCE_TYPE="n1-standard-8"
export ACCELERATOR="type=nvidia-tesla-t4,count=2"
gcloud compute instances create $INSTANCE_NAME \
--zone=$ZONE \
--image-family=$IMAGE_FAMILY \
--image-project="deeplearning-platform-release" \
--maintenance-policy=TERMINATE \
--accelerator=$ACCELERATOR \
--machine-type=$INSTANCE_TYPE \
--boot-disk-size=100GB \
--scopes=https://www.googleapis.com/auth/cloud-platform \
--metadata="install-nvidia-driver=True,proxy-mode=project_editors,container=$IMAGE_URI"
It will take some time to set up the image. If the container was loaded properly, the proxy-url metadata field will contain a link to JupyterLab, and the instance will show as ready in the AI Platform > Notebooks UI on Cloud Console. In addition, you can directly query the connection by providing the instance metadata.
Command:
gcloud compute instances describe "${INSTANCE_NAME}" \
--format='value[](metadata.items.proxy-url)'
You can access your JupyterLab instance at this URL.
Deploying Deep Learning Containers on GKE with NVIDIA GPUs
Additionally, you can use GKE to work on developing your Deep Learning Containers. The only thing left to do is to define the container image in your Kubernetes pod spec after setting up your GKE cluster with GPUs in accordance with the user guide. A pod with one GPU from tf-gpu and a connected GCE persistent disc is created by the following specification:
Code:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: dlc-persistent-volume-claim
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
---
apiVersion: v1
kind: Pod
metadata:
name: dlc-tf
spec:
containers:
- name: dlc-tf
image: gcr.io/deeplearning-platform-release/tf-gpu
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: dlc-volume
mountPath: /home
volumes:
- name: dlc-volume
persistentVolumeClaim:
claimName: dlc-persistent-volume-claim
The following commands will deploy and connect to your instance:
Command:
kubectl apply -f ./pod.yaml
kubectl port-forward pods/dlc-tf 8080:8080
You can reach your running JupyterLab instance at localhost:8080 once the pod has been fully deployed.
Let's get into the details of the local deep learning container.