Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
Set up a Datalab VM instance
2.1.
Steps to set up and open Cloud Datalab
2.2.
Clean up
3.
Working with notebooks
3.1.
Source control
3.2.
Copying notebooks from the Cloud Datalab VM
3.3.
Cloud Datalab backup
3.4.
Restoring backups
3.5.
Working with data
4.
Using Cloud Datalab in a team environment
4.1.
Create instances for each team member
4.2.
Use the automatically created git repository for sharing notebooks
5.
Adding Python libraries to a Cloud Datalab instance
6.
Starting Datalab on a Chromebook
6.1.
Launch Datalab from Cloud Shell
7.
Choosing a VM machine type
7.1.
Considerations when choosing a VM machine type
7.2.
Choosing a machine type
8.
Managing the lifecycle of a Cloud Datalab instance
8.1.
Prerequisites
8.2.
Creating an instance
8.3.
Connecting to an instance
8.4.
Stopping an instance
8.5.
Updating the Cloud Datalab VM without deleting the notebooks disk
8.6.
Deleting an instance and the notebooks disk
8.7.
Reducing the usage of compute resources
9.
Frequently asked questions
9.1.
What is Cloud Dataflow used for?
9.2.
What are Dataflow and Dataproc in GCP?
9.3.
What is Dataflow equivalent in AWS?
9.4.
Is Dataproc fully managed?
10.
Conclusion
Last Updated: Mar 27, 2024

Datalab

Author Komal Shaw
0 upvote

Introduction

To explore, analyze, convert, and visualize data as well as generate machine learning models on the Google Cloud Platform, Cloud Datalab is a potent interactive tool. You can concentrate on your data science projects because it runs on Google Compute Engine and connects to numerous cloud services quickly.

Google Cloud Datalab

Set up a Datalab VM instance

The purpose is to show how to use the datalab command line tool to set up and open Google Cloud Datalab.

Steps to set up and open Cloud Datalab

From a terminal window on your local machine:

  1. gcloud components update
  2. gcloud components install datalab
  3. datalab create datalab-instance-name
  4. Open the Cloud Datalab home page in your browser.
    http://localhost:8081

Clean up

Clean up to avoid incurring charges to your Google Cloud account for the resources used on this page.

Working with notebooks

Source control

The first time you run datalab create VM-instance-name, a datalab-notebooks Cloud Source Repository is added to the project. This is a remote repository for the docker container running in your Cloud Datalab VM instance that contains the /content/datalab/notebooks git repository.

Using ungit in your browser - The Cloud Datalab container includes ungit, a web-based git client, which allows you to make commits to the Cloud Datalab VM repo and push notebooks to the cloud remote repo from the Cloud Datalab browser UI.

Using git from the command line - Instead of using ungit from the Cloud Datalab UI for source control (see Using ungit in your browser), you can SSH into the Cloud Datalab VM and run git from a terminal running in your VM or from Cloud Shell.

Copying notebooks from the Cloud Datalab VM

The gcloud compute scp command can be used to copy files out of your Cloud Datalab VM instance. Run the following command, substituting instance-name with the name of your Cloud Datalab VM, for instance, to copy the contents of the datalab/notebooks directory in your Cloud Datalab VM to the instance-name-notebooks directory on your local machine.

Cloud Datalab backup

To avoid unintentional loss of user content in the event of a failing or deleted VM disc, Cloud Datalab instances periodically backup user content to a Google Cloud Storage bucket in the user's project. The backup tool operates on the root of the associated disc where all of the user's stuff is by default stored in a Cloud Datalab instance. The backup job runs every ten minutes, creates a zip file of the entire disc, compares it to the zip file from the last backup, and uploads the zip file if there is a difference and enough time has passed after the last backup and the new changes. The backup files are uploaded to Google Cloud Storage by Cloud Datalab.

Datalab Backup

Cloud Datalab retains the last 10 hourly backups, 7 daily backups, and 20 weekly backups, and deletes older backup files to preserve space. Backups can be turned off by passing the --no-backups flag when creating a Cloud Datalab instance with the datalab create command.

Restoring backups

When choosing a backup file from Google Cloud Storage to restore, the user looks at the VM zone, VM name, notebook directory, and the human-readable timestamp.

Working with data

Cloud Datalab can access data located in any of the following places:

Google Cloud Storage: The datalab.storage APIs enable programmatic access to Cloud Storage's files and folders.

BigQuery: The tables and views can be queried using SQL and datalab.bigquery APIs.

Local file system on the persistent disk: You can add new files or copy existing ones to the persistent disc connected to your Cloud Datalab VM's file system.

Using Cloud Datalab in a team environment

Create instances for each team member

As single-user environments, Cloud Datalab instances, each team member needs their own instance. Having more than one Cloud Datalab user per instance is not supported, despite the fact that the standard access guidelines for Google Compute Engine VMs apply (project editors, for instance, can SSH into the VM).

There are two ways to create VM instances for team members:

  1. A project owner creates instances for other team members - A project owner can create a Cloud Datalab instance for each team member by using the datalab create command. To do this, the project owner must pass in an additional --for-user flag specifying the email address of the Cloud Datalab user.
     
  2. Each team member creates their own instance - If each team member is a project editor, they can create their own Cloud Datalab instances.

Use the automatically created git repository for sharing notebooks

When datalab create is executed for the first time in a project, a Cloud Source Repository with the name datalab-notebooks is created. From the Repositories page of the Google Cloud dashboard, you can browse this repository.

Adding Python libraries to a Cloud Datalab instance

A number of libraries are part of Cloud Datalab. Common situations for data analysis, transformation, and visualization are supported by the supplied libraries.

You can add additional Python libraries using one of the following three mechanisms:

  1. Option 1: !conda install -y lib-name
     
  2. Option 1.5: !pip install lib-name
     
  3. Option 2: %%bash
    echo "conda install -y lib-name" >> /content/datalab/.config/startup.sh
    cat /content/datalab/.config/startup.sh
     
  4. Option 3: nherit from the Cloud Datalab Docker container using a Docker customization mechanism. This option is much more heavyweight compared to the other options listed above. However, it provides maximum flexibility for those who intend to significantly customize the container for use by a team or organization. 
    FROM datalab
    ...
    pip install lib-name
    ...

Starting Datalab on a Chromebook

Datalab on a Chromebook

Launch Datalab from Cloud Shell

  1. Using a Google Cloud Platform project, begin a Cloud Shell session.
     
  2. Run the following command to create a Datalab VM instance in the Cloud Shell session window. Make careful to choose a distinct name for the instance; it cannot finish with a hyphen and must begin with a lowercase letter, followed by up to 62 lowercase characters, numbers, or hyphens.
    datalab create instance-name.
     
  3. By clicking the Web preview button web-preview-button and then selecting Change port -> Port 8081, you can launch your browser to the Datalab home page.

Choosing a VM machine type

VM machine type

Considerations when choosing a VM machine type

A machine type for Google Compute Engine can be chosen when creating a Datalab VM instance. The machine type in use is n1-standard-1 by default. Depending on the performance and cost parameters, you can choose a different machine type to meet your data analysis demands.

Here are a few key considerations for selecting a machine type:

  1. Each notebook uses a Python kernel to run code in its own process. For example, if you have N notebooks open, there are at least N processes corresponding to those notebooks.
     
  2. Each kernel is single-threaded. Unless you are running multiple notebooks at the same time, multiple cores may not provide significant benefits.
     
  3. You may benefit significantly by selecting a machine with additional memory depending on your usage pattern and the amount of data processed.
     
  4. Execution is cumulative—running three Cloud Datalab notebook cells in a row result in the accumulation of the corresponding states, including memory allocated for data structures used in those cells.
     
  5. Processing large amounts of data in memory (for example, using Pandas Dataframes) causes proportional memory allocation. When you finish running a notebook, you can stop the session by clicking on the Running Sessions icon sessions-icon in the top bar (you may need to resize the browser window to see the icon) and shutting down the session.
     
  6. Cloud Datalab utilizes a disk-based swap file to provide overhead for additional memory requirements, but relying on the swap file is likely to slow down processing. It's best to estimate memory needs, then pick a machine type with at least the estimated amount of memory.

Choosing a machine type

You choose a machine type for your Cloud Datalab VM instance when you create the instance. Example:-

datalab create --machine-type n1-highmem-2 instance-name

Managing the lifecycle of a Cloud Datalab instance

The Google Compute Engine virtual machine that houses Cloud Datalab has a persistent disc attached to it for storing notebooks. A project called datalab-network connects cloud Datalab VMs to a unique network. This network's default settings restrict incoming connections to SSH connections.

Prerequisites

Prerequisites
  1. Installed the gcloud CLI, including the datalab component
  2. Authenticated with the Google Cloud CLI
  3. Configured the Google Cloud CLI to use your selected project and zone

Creating an instance

  1. You create a Cloud Datalab instance using the datalab create command.
    datalab create instance-name
     
  2. There are several command-line options available with this command. To list all available options, run:
    datalab create --help
     
  3. The datalab create command establishes a connection to the newly formed instance by default. Pass the —no-connect flag to create the instance without connecting to it:
    datalab create --no-connect instance-name

Connecting to an instance

You can connect to your Cloud Datalab instance from a local browser as if it were operating on your computer by using the datalab tool, which can build a permanent SSH tunnel to your instance.

To create this connection, use the datalab connect command:

datalab connect instance-name

Stopping an instance

When you want to stop using Cloud Datalab, run the following command to stop your Cloud Datalab instance and save money.

datalab stop instance-name

Updating the Cloud Datalab VM without deleting the notebooks disk

You can destroy and recreate the Cloud Datalab VM without losing your notebooks that are kept on the persistent disc to update to a new Cloud Datalab version or modify VM parameters like the machine type or the service account.

datalab delete --keep-disk instance-name
datalab create instance-name

Deleting an instance and the notebooks disk

The persistent disc that stores your notebooks is not deleted by default when using the datalab remove command. This makes changing the VM simple and prevents data loss by accident.

Add the —delete-disk parameter to the command if you wish to delete both the virtual machine and the associated persistent disc:

datalab delete --delete-disk instance-name

Reducing the usage of compute resources

VMs on Google Compute Engine is expensive. Whether or whether you use a Cloud Datalab instance, you will be billed for that period. By halting the instance while it's not in use, you can lower your Cloud Datalab VM costs. While the VM instance itself is halted, you won't be charged for the resources attached to it (such as the persistent disc and the external IP address), but you will.

The datalab tool will restart the instance before attempting to connect to it when you run datalab connect instance-name to connect to your instance the next time you need to utilize your stopped instance.

You must run the datalab remove command with the —delete-disk option in order to delete the attached persistent disc and the virtual machine in order to stop paying any fees related to a Cloud Datalab instance.

Check out most important Git Interview Questions here.

Frequently asked questions

What is Cloud Dataflow used for?

Dataflow is a managed service for executing a wide variety of data processing patterns.

What are Dataflow and Dataproc in GCP?

Dataproc is a Google Cloud product with Data Science/ML service for Spark and Hadoop. In comparison, Dataflow follows a batch and stream processing of data.

What is Dataflow equivalent in AWS?

Dataflow is equivalent to Amazon Elastic MapReduce (EMR) or AWS Batch Dataflow.

Is Dataproc fully managed?

Dataproc is a fully managed and highly scalable service for running Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks.

Conclusion

In this article, we have extensively discussed Datalab. We hope this blog has helped you enhance your knowledge regarding Datalab.

If you want to learn more, check out our articles on Introduction to Cloud MonitoringOverview of log based metricCloud Logging in GCP.

Check out this problem - Smallest Distinct Window .

Refer to our guided paths on Coding Ninjas Studio to learn more about DSA, Competitive Programming, JavaScript, System Design, etc.

Enroll in our courses and refer to the mock test and problems available.

Take a look at the interview experiences and interview bundle for placement preparations.

Do upvote our blog to help other ninjas grow.

Happy Coding!

Live masterclass