Table of contents
1.
Introduction
2.
Quickstart
3.
Create a lake
3.1.
Create a Dataplex Lake
4.
Managing a lake
4.1.
Access control
4.2.
Viewing a Lake
4.3.
Updating a Lake
4.4.
Deleting a lake
5.
Discover data
5.1.
Discovery configuration
5.2.
View discovered tables and filesets
5.3.
Discovery actions
5.4.
Resolve Discovery actions
5.5.
Other Actions
6.
Cloud Dataplex API
6.1.
Service: dataplex.googleapis.com
6.2.
REST Resource: v1.projects.locations
6.3.
REST Resource: v1.projects.locations.lakes
6.4.
REST Resource: v1.projects.locations.lakes.actions
6.5.
REST Resource: v1.projects.locations.lakes.zones.assets.actions
6.6.
REST Resource: v1.projects.locations.lakes.zones.entities
7.
Cloud Dataplex API
7.1.
google.cloud.dataplex.v1.ContentService
7.2.
google.cloud.dataplex.v1.DataplexService
7.3.
google.cloud.dataplex.v1.MetadataService
7.4.
google.cloud.location.Locations
7.5.
google.iam.v1.IAMPolicy
8.
Frequently asked questions
8.1.
What is Cloud Dataflow used for?
8.2.
How do I specify table names?
8.3.
What is Dataflow equivalent in AWS?
8.4.
Is Dataproc fully managed?
9.
Conclusion
Last Updated: Mar 27, 2024

Dataplex

Author Komal Shaw
0 upvote

Introduction

Dataplex is an intelligent data fabric that offers a mechanism to securely make your data available to a range of analytics and data science tools while centrally managing, monitoring, and governing your data across data lakes, data warehouses, and data marts.

You may use Dataplex to logically group your Cloud Storage and BigQuery data into lakes and zones, automate data management and governance across that data, and enable large-scale analytics.

Quickstart

First of all, you have to create a lake using the Google Cloud console.

  1. Go to Dataplex in the console.
  2. Navigate to the Manage view.
  3. Click add Create.
  4. Enter a Display name.
  5. The lake ID is automatically generated for you.
  6. Specify the Region in which to create the lake.
  7. Click Create.
  8. Then add a zone to your lake.
  9. Next is to attach an asset and data can be attached as assets to data zones within a Dataplex lake.
  10. After you create your lake, zones, and assets, now you can use your lakes.
  11. Next is to avoid incurring charges to your Google Cloud account for the resources used on this page.

Create a lake

Here we will know how to create a Dataplex lake, using the Google Cloud console, gcloud CLI, or the lakes.create API method.

  1. You have to make sure that you have the pre-defined roles roles/dataplex.admin or roles/dataplex.editor granted to you so that you can create and manage your lake. 
  2. Now you have to create a metastore (Dataproc Metastore service.)

Create a Dataplex Lake

  1. Go to Dataplex in the console.
  2. Navigate to the Manage view.
  3. Click addCreate.
  4. Enter a Display name.
  5. The Lake ID will be automatically generated for you.
  6. Then you have to specify the region and click CREATE.

Managing a lake

First of all, you have to create a lake.

Access control

You need to have IAM roles with the dataplex.lakes.create and dataplex.lakes.delete IAM permissions in order to update or delete a lake, respectively. To grant update and remove permission, use the roles/dataplex.editor and roles/dataplex.admin for Dataplex.

Viewing a Lake

You can view your Dataplex Lake on the console by clicking the lake name of the lake that you want to view on the Dataplex page.

Updating a Lake

You can alter a lake's details on the edit lake page launched in a local browser or by using the Dataplex API method lakes.patch.

Deleting a lake

A lake can be deleted using either the Dataplex API function lakes.delete or the delete button on the lake page displayed in a local browser.

Discover data

Here we will learn how to enable and use Dataplex Discovery. Data in a data lake are scanned and their metadata extracted, then registered by Discovery to Dataproc Metastore, BigQuery, and Data Catalog for analysis, search, and exploration.

Discovery configuration

By default, when you create a new zone or asset, discovery is enabled. Discovery can be turned off at the zone or asset level. You can decide whether to override Discovery settings at the asset level or to inherit Discovery settings at the zone level when you create a zone or an asset.

View discovered tables and filesets

You can search for discovered tables and filesets in the Dataplex Discover view in the console.

Discovery actions

Discovery raises the following admin actions whenever data-related issues are detected during scans.

Resolve Discovery actions

Additional Discovery scans verify data with actions. When the problem that prompted the action is remedied, the next scheduled Discovery scan automatically takes care of the action.

Other Actions

Missing resource: A matching dataset or bucket for an existing asset cannot be located.

Unauthorized resource: Dataplex lacks the necessary authorizations to perform discovery on the bucket or dataset it manages or to apply security policies to it.

Issues with security policy propagation: Security policies that were provided for a specific lake, zone, or asset could not be correctly propagated to the underlying buckets or datasets due to a number of problems. This kind of action could be raised at the lake, zone, and asset levels while all other actions are at the asset level.

Read about Batch Operating System here.

Cloud Dataplex API

Service: dataplex.googleapis.com

We advise using the client libraries supplied by Google to call this service. Use the following data when making API calls if your application needs to use your own libraries to call this service.

REST Resource: v1.projects.locations

get

GET /v1/{name=projects/*/locations/*}

Gets information about a location.

list

GET /v1/{name=projects/*}/locations

Lists information about the supported locations for this service.

REST Resource: v1.projects.locations.lakes

create

POST /v1/{parent=projects/*/locations/*}/lakes

Creates a lake resource.

get

GET /v1/{name=projects/*/locations/*/lakes/*}

Retrieves a lake resource.

delete

DELETE /v1/{name=projects/*/locations/*/lakes/*}

Deletes a lake resource.

list

GET /v1/{parent=projects/*/locations/*}/lakes

Lists lake resources in a project and location.

getIamPolicy

GET /v1/{resource=projects/*/locations/*/lakes/*}:getIamPolicy

Gets the access control policy for a resource.

setIamPolicy

POST /v1/{resource=projects/*/locations/*/lakes/*}:setIamPolicy

Sets the access control policy on the specified resource.

patch

PATCH /v1/{lake.name=projects/*/locations/*/lakes/*}

Updates a lake resource.

testIamPermissions

POST /v1/{resource=projects/*/locations/*/lakes/*}:testIamPermissions

Returns permissions that a caller has on the specified resource.

REST Resource: v1.projects.locations.lakes.actions

list

GET /v1/{parent=projects/*/locations/*/lakes/*}/actions

Lists action resources in a lake.

REST Resource: v1.projects.locations.lakes.zones.assets.actions

list

GET /v1/{parent=projects/*/locations/*/lakes/*/zones/*/assets/*}/actions

Lists action resources in an asset.

REST Resource: v1.projects.locations.lakes.zones.entities

create

POST /v1/{parent=projects/*/locations/*/lakes/*/zones/*}/entities

Create a metadata entity.

delete

DELETE /v1/{name=projects/*/locations/*/lakes/*/zones/*/entities/*}

Delete a metadata entity.

update

PUT /v1/{entity.name=projects/*/locations/*/lakes/*/zones/*/entities/*}

Update a metadata entity.

list

GET /v1/{parent=projects/*/locations/*/lakes/*/zones/*}/entities

List metadata entities in a zone.

get

GET /v1/{name=projects/*/locations/*/lakes/*/zones/*/entities/*}

Get a metadata entity.

 

Cloud Dataplex API

The Service name dataplex.googleapis.com is needed to create RPC client stubs.

google.cloud.dataplex.v1.ContentService

CreateContent Creating a content.
GetContent Get a content resource.
DeleteContent Deleting a content.
ListContent List the content.
SetIamPolicy Sets the access control policy on the specified contentitem resource.
UpdateContent Update a content.
GetIamPolicy Gets the access control policy for a contentitem resource.
TestIamPermissions Returns the caller's permissions on a resource.

google.cloud.dataplex.v1.DataplexService

CancelJob Cancel jobs running for the task resource.
CreateAsset Creates an asset resource.
CreateEnvironment Create an environment resource.
CreateLake Creates a lake resource.
CreateTask Creates a task resource within a lake.
CreateZone Creates a zone resource within a lake.
DeleteAsset Deletes an asset resource.
DeleteEnvironment Delete the environment resource.
DeleteLake Deletes a lake resource.
DeleteTask Delete the task resource.
DeleteZone Deletes a zone resource.
GetAsset Retrieves an asset resource.
GetEnvironment Get environment resource.
GetJob Get job resource.
GetLake Retrieves a lake resource.
ListAssetActions Lists action resources in an asset.
ListAssets Lists asset resources in a zone.
ListEnvironments Lists environments under the given lake.
ListJobs Lists Jobs under the given task.
UpdateAsset Updates an asset resource.
UpdateEnvironment Update the environment resource.

google.cloud.dataplex.v1.MetadataService

CreateEntity Create a metadata entity.
CreatePartition Create a metadata partition.
DeleteEntity Delete a metadata entity.
DeletePartition Delete a metadata partition.
GetEntity Get a metadata entity.
ListEntities List metadata entities in a zone.
ListPartitions List metadata partitions of an entity.
UpdateEntity Update a metadata entity.

google.cloud.location.Locations

GetLocation Gets information about a location.
ListLocations Lists information about the supported locations for this service.

google.iam.v1.IAMPolicy

GetIamPolicy Gets the access control policy for a resource.
SetIamPolicy Sets the access control policy on the specified resource.
TestIamPermissions Returns permissions that a caller has on the specified resource.

 

Frequently asked questions

What is Cloud Dataflow used for?

Dataflow is a managed service for executing a wide variety of data processing patterns.

How do I specify table names?

You can specify table names by using the metadata API.

What is Dataflow equivalent in AWS?

Dataflow is equivalent to Amazon Elastic MapReduce (EMR) or AWS Batch Dataflow.

Is Dataproc fully managed?

Dataproc is a fully managed and highly scalable service for running Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks.

Conclusion

In this article, we have extensively discussed Dataplex. We hope this blog has helped you enhance your knowledge regarding Dataplex.

If you want to learn more, check out our articles on Introduction to Cloud MonitoringOverview of log based metricCloud Logging in GCP.

Refer to our guided paths on Coding Ninjas Studio to learn more about DSA, Competitive Programming, JavaScript, System Design, etc.

Enroll in our courses and refer to the mock test and problems available.

Take a look at the interview experiences and interview bundle for placement preparations.

Do upvote our blog to help other ninjas grow.

Happy Coding!

Live masterclass