To explore, analyze, convert, and visualize data as well as generate machine learning models on the Google Cloud Platform, Cloud Datalab is a potent interactive tool. You can concentrate on your data science projects because it runs on Google Compute Engine and connects to numerous cloud services quickly.
Set up a Datalab VM instance
The purpose is to show how to use the datalab command line tool to set up and open Google Cloud Datalab.
Clean up to avoid incurring charges to your Google Cloud account for the resources used on this page.
Working with notebooks
Source control
The first time you run datalab create VM-instance-name, a datalab-notebooks Cloud Source Repository is added to the project. This is a remote repository for the docker container running in your Cloud Datalab VM instance that contains the /content/datalab/notebooks git repository.
Using ungit in your browser - The Cloud Datalab container includes ungit, a web-based git client, which allows you to make commits to the Cloud Datalab VM repo and push notebooks to the cloud remote repo from the Cloud Datalab browser UI.
Using git from the command line - Instead of using ungit from the Cloud Datalab UI for source control (see Using ungit in your browser), you can SSH into the Cloud Datalab VM and run git from a terminal running in your VM or from Cloud Shell.
Copying notebooks from the Cloud Datalab VM
The gcloud compute scp command can be used to copy files out of your Cloud Datalab VM instance. Run the following command, substituting instance-name with the name of your Cloud Datalab VM, for instance, to copy the contents of the datalab/notebooks directory in your Cloud Datalab VM to the instance-name-notebooks directory on your local machine.
Cloud Datalab backup
To avoid unintentional loss of user content in the event of a failing or deleted VM disc, Cloud Datalab instances periodically backup user content to a Google Cloud Storage bucket in the user's project. The backup tool operates on the root of the associated disc where all of the user's stuff is by default stored in a Cloud Datalab instance. The backup job runs every ten minutes, creates a zip file of the entire disc, compares it to the zip file from the last backup, and uploads the zip file if there is a difference and enough time has passed after the last backup and the new changes. The backup files are uploaded to Google Cloud Storage by Cloud Datalab.
Cloud Datalab retains the last 10 hourly backups, 7 daily backups, and 20 weekly backups, and deletes older backup files to preserve space. Backups can be turned off by passing the --no-backups flag when creating a Cloud Datalab instance with the datalab create command.
Restoring backups
When choosing a backup file from Google Cloud Storage to restore, the user looks at the VM zone, VM name, notebook directory, and the human-readable timestamp.
Working with data
Cloud Datalab can access data located in any of the following places:
Google Cloud Storage: The datalab.storage APIs enable programmatic access to Cloud Storage's files and folders.
BigQuery: The tables and views can be queried using SQL and datalab.bigquery APIs.
Local file system on the persistent disk: You can add new files or copy existing ones to the persistent disc connected to your Cloud Datalab VM's file system.
Using Cloud Datalab in a team environment
Create instances for each team member
As single-user environments, Cloud Datalab instances, each team member needs their own instance. Having more than one Cloud Datalab user per instance is not supported, despite the fact that the standard access guidelines for Google Compute Engine VMs apply (project editors, for instance, can SSH into the VM).
There are two ways to create VM instances for team members:
A project owner creates instances for other team members - A project owner can create a Cloud Datalab instance for each team member by using the datalab create command. To do this, the project owner must pass in an additional --for-user flag specifying the email address of the Cloud Datalab user.
Each team member creates their own instance - If each team member is a project editor, they can create their own Cloud Datalab instances.
Use the automatically created git repository for sharing notebooks
When datalab create is executed for the first time in a project, a Cloud Source Repository with the name datalab-notebooks is created. From the Repositories page of the Google Cloud dashboard, you can browse this repository.
Adding Python libraries to a Cloud Datalab instance
A number of libraries are part of Cloud Datalab. Common situations for data analysis, transformation, and visualization are supported by the supplied libraries.
You can add additional Python libraries using one of the following three mechanisms:
Option 3: nherit from the Cloud Datalab Docker container using a Docker customization mechanism. This option is much more heavyweight compared to the other options listed above. However, it provides maximum flexibility for those who intend to significantly customize the container for use by a team or organization. FROM datalab ... pip install lib-name ...
Starting Datalab on a Chromebook
Launch Datalab from Cloud Shell
Using a Google Cloud Platform project, begin a Cloud Shell session.
Run the following command to create a Datalab VM instance in the Cloud Shell session window. Make careful to choose a distinct name for the instance; it cannot finish with a hyphen and must begin with a lowercase letter, followed by up to 62 lowercase characters, numbers, or hyphens. datalab create instance-name.
By clicking the Web preview button web-preview-button and then selecting Change port -> Port 8081, you can launch your browser to the Datalab home page.
Choosing a VM machine type
Considerations when choosing a VM machine type
A machine type for Google Compute Engine can be chosen when creating a Datalab VM instance. The machine type in use is n1-standard-1 by default. Depending on the performance and cost parameters, you can choose a different machine type to meet your data analysis demands.
Here are a few key considerations for selecting a machine type:
Each notebook uses a Python kernel to run code in its own process. For example, if you have N notebooks open, there are at least N processes corresponding to those notebooks.
Each kernel is single-threaded. Unless you are running multiple notebooks at the same time, multiple cores may not provide significant benefits.
You may benefit significantly by selecting a machine with additional memory depending on your usage pattern and the amount of data processed.
Execution is cumulative—running three Cloud Datalab notebook cells in a row result in the accumulation of the corresponding states, including memory allocated for data structures used in those cells.
Processing large amounts of data in memory (for example, using Pandas Dataframes) causes proportional memory allocation. When you finish running a notebook, you can stop the session by clicking on the Running Sessions icon sessions-icon in the top bar (you may need to resize the browser window to see the icon) and shutting down the session.
Cloud Datalab utilizes a disk-based swap file to provide overhead for additional memory requirements, but relying on the swap file is likely to slow down processing. It's best to estimate memory needs, then pick a machine type with at least the estimated amount of memory.
Choosing a machine type
You choose a machine type for your Cloud Datalab VM instance when you create the instance. Example:-
Managing the lifecycle of a Cloud Datalab instance
The Google Compute Engine virtual machine that houses Cloud Datalab has a persistent disc attached to it for storing notebooks. A project called datalab-network connects cloud Datalab VMs to a unique network. This network's default settings restrict incoming connections to SSH connections.
Prerequisites
Installed the gcloud CLI, including the datalab component
Authenticated with the Google Cloud CLI
Configured the Google Cloud CLI to use your selected project and zone
Creating an instance
You create a Cloud Datalab instance using the datalab create command. datalab create instance-name
There are several command-line options available with this command. To list all available options, run: datalab create --help
The datalab create command establishes a connection to the newly formed instance by default. Pass the —no-connect flag to create the instance without connecting to it: datalab create --no-connect instance-name
Connecting to an instance
You can connect to your Cloud Datalab instance from a local browser as if it were operating on your computer by using the datalab tool, which can build a permanent SSH tunnel to your instance.
To create this connection, use the datalab connect command:
datalab connect instance-name
Stopping an instance
When you want to stop using Cloud Datalab, run the following command to stop your Cloud Datalab instance and save money.
datalab stop instance-name
Updating the Cloud Datalab VM without deleting the notebooks disk
You can destroy and recreate the Cloud Datalab VM without losing your notebooks that are kept on the persistent disc to update to a new Cloud Datalab version or modify VM parameters like the machine type or the service account.
The persistent disc that stores your notebooks is not deleted by default when using the datalab remove command. This makes changing the VM simple and prevents data loss by accident.
Add the —delete-disk parameter to the command if you wish to delete both the virtual machine and the associated persistent disc:
datalab delete --delete-disk instance-name
Reducing the usage of compute resources
VMs on Google Compute Engine is expensive. Whether or whether you use a Cloud Datalab instance, you will be billed for that period. By halting the instance while it's not in use, you can lower your Cloud Datalab VM costs. While the VM instance itself is halted, you won't be charged for the resources attached to it (such as the persistent disc and the external IP address), but you will.
The datalab tool will restart the instance before attempting to connect to it when you run datalab connect instance-name to connect to your instance the next time you need to utilize your stopped instance.
You must run the datalab remove command with the —delete-disk option in order to delete the attached persistent disc and the virtual machine in order to stop paying any fees related to a Cloud Datalab instance.
Dataflow is a managed service for executing a wide variety of data processing patterns.
What are Dataflow and Dataproc in GCP?
Dataproc is a Google Cloud product with Data Science/ML service for Spark and Hadoop. In comparison, Dataflow follows a batch and stream processing of data.
What is Dataflow equivalent in AWS?
Dataflow is equivalent to Amazon Elastic MapReduce (EMR) or AWS Batch Dataflow.
Is Dataproc fully managed?
Dataproc is a fully managed and highly scalable service for running Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks.
Conclusion
In this article, we have extensively discussed Datalab. We hope this blog has helped you enhance your knowledge regarding Datalab.