Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
Cloud Data Fusion
2.1.
Creating an Instance
2.2.
Creating a Private Instance
2.2.1.
Creating a Private Instance in a VPC Network
2.2.2.
Creating a Private Instance in a Shared VPC Network
2.3.
Setting Up VPC Network Peering
2.4.
Granting Service Account User Permissions
2.5.
Using JDBC Drivers
2.5.1.
Viewing a JDBC Driver
2.5.2.
Deleting a JDBC Driver
2.6.
Using VPC Controls with Cloud Data Fusion
2.6.1.
Restricting the Cloud Plane Surfaces
2.6.2.
Restricting the Data Plane Surfaces
2.7.
Viewing Pipeline Logs
2.7.1.
Filtering Logs
2.7.2.
Downloading Logs
2.8.
Access Controls using IAM
2.8.1.
Grant Roles
2.8.2.
Required Permissions
2.8.3.
Cloud Data Fusion Roles
2.8.4.
Cloud Data Fusion API Permissions
2.8.5.
Permissions for Common Tasks
3.
Frequently Asked Questions
3.1.
Mention some key features of Data Cloud Fusion.
3.2.
How does Data Cloud Fusion manage Real-Time data integration?
3.3.
Name some use cases of Data Cloud Fusion.
3.4.
What do you mean by Hybrid Enablement?
4.
Conclusion
Last Updated: Mar 27, 2024

Cloud Data Fusion

Author Rupal Saluja
0 upvote
Master Python: Predicting weather forecasts
Speaker
Ashwin Goyal
Product Manager @

Introduction

In April 2008, Google announced a platform that would develop and host web applications for Google-managed data centers, which came to be known as App Engine. It was the first cloud computing service for Google. Since its announcement, that is, since 2011, Google has announced multiple other cloud services which will work for its platform. GCP, being a part of Google Cloud, includes Google Workspace, Cloud Infrastructure, Enterprise versions of APIs, Android, and Chrome. 
 

Children discussing Google Cloud Platform.

Cloud Data Fusion is one such cloud computing service that has been doing wonders since its launch. This blog provides you with the knowledge required to understand Cloud Data Fusion better and, at the same time, deals with every key aspect related to it.

Cloud Data Fusion

Cloud Data Fusion is an automatically managed, cloud-native data integration service that is used to build and manage pipelines. It has come up with an innovative drag and drop interface, already built connectors, a self-service model, and code-free data integration leading to the removal of the need for technical expertise every now and then. 

 technical terms with reference to Cloud Data Fusion

 

It offers a lost cost of pipeline ownership, making it more reliable and more scalable. Its features such as end-to-end data lineage, integrated metadata, cloud-native security, and data protection services make it more secure and provide assurance to its customers about its capabilities of data governing.

Before you begin the whole process, take care of the points below.

  1. Make sure you have an account with Google Cloud and that you have created a project.
  2. If not enabled, enable the Cloud Data Fusion API.
  3. Now, create a Cloud Data Fusion Instance by using the Create Instance option.
  4. After that, Navigate the Cloud Data Fusion API using the Actions Option of the Instances page.
  5. Deploy a sample pipeline, execute it and view the results.

 

Now, your account is ready to get into the whole process. Follow the blog to learn about the major key aspects of Cloud Data Fusion.

Creating an Instance

To Create an Instance, you need to follow the steps below.

  1. As soon as you enable the API, you will see an Instances page in the console. Using that page, you will create and manage new instances.
  2. Click on Create Instance option and give a name to that.
  3. Add a description and specify the region.
  4. Also, specify the version and edition of Cloud Data Fusion you would prefer.
  5. If you want to add accelerators, you can choose Add Accelerators option, which is optional. Also, there are a few advanced options you can choose as per your requirements. Those options are optional.
  6. At last, click on Create button, and that will take up to 30 minutes to complete.

Creating a Private Instance

To create a private instance, you need to set up your VPC Network first. For that, enable the private google access for your account and then allocate an IP range as per the requirements of the instance you want to create. Now, your account is ready to create a private instance. There are two ways to create a private instance. We will discuss both ways in this blog ahead.

Creating a Private Instance in a VPC Network

The creation of a private instance in a VPC Network is very similar to creating an ordinary instance. The single difference is that in the latter case, the choice of advance options was as per the requirements. In this case, under the Advanced Options section, you have to make sure that you have enabled the Private IP option.

Creating a Private Instance in a Shared VPC Network

The steps to create a private instance in a shared VPC Network are mentioned below for your reference.

Using the commands below, you have to export certain variables.

export PROJECT=PROJECT_ID
export LOCATION=REGION
export DATA_FUSION_API_NAME=datafusion.googleapis.com

 

Now, to create a Cloud Data Fusion Instance using the REST API, submit the following request.

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" https://$DATA_FUSION_API_NAME/v1/projects/$PROJECT/locations/$LOCATION/instances?instanceId=instance_id -X POST -d '{"description": "Private CDF instance created through REST.", "type": "ENTERPRISE", "privateInstance": true, "networkConfig": {"network": "projects/shared_vpc_host_project_id/global/networks/network", "ipAllocation": "ip_range"}}'

Setting Up VPC Network Peering

VPC Network Peering is used to establish network connections to your VPC connections in Cloud Data Fusion. In this way, it can access resources even through private IP addresses. To set up a fully functional VPC Network Peering, you need to set up an external connection first as part of your pre-preparation. Once you are done setting up an external connection, find your tenant Project ID. After that, follow the steps mentioned below to complete the process of setting up VPC Network Peering.

Image depicting Network Peering
  1. In the Console window, you will find a VPC Network Peering page, open that.
  2. Select Create Peering Connection option and clock on Continue.
  3. Enter a name for your connection.
  4. Select the network in which you have created an Instance under Your VPC Network option.
  5. Select In another project option under Peered VPC Network.
  6. Enter your Tenant Project ID under the heading Project ID.
  7. Under the heading VPC Network Name, enter Instance Region and Instance ID.
  8. Under the Exchange Custom routes option, select Export Custom routes and Click Create.

 

You can set up IAM permissions or create a firewall rule as per the requirements of your Instance, that’s optional.

Granting Service Account User Permissions

As part of your pre-preparation, you need to have your service account name accessible to you at any time. For that, go to the Identity and Access Management page in the Console Window. Select the project from the project selector, and find and copy the necessary details from the information provided to you.

Once you are done with the pre-preparation part, follow the steps below to proceed further.

  1. From the Console window, select the project in whose Service Account you want to locate your Dataproc Cluster and Click Open.
  2. Click on the email address of your Dataproc Service Account and then open the Permission tab. You will see a list of principals that have been granted roles on the service account already.
  3. Click Grant Access.
  4. Paste the Cloud Data Fusion service account name which you copied previously in the New principal's field.
  5. Now, select the Service Account User role and Click on Save.

 

You are done with Granting Service Account User Permissions process. You can alter the changes in the grant if you want.

Using JDBC Drivers

As part of your pre-preparation, you need to access the Cloud Data Fusion Graphical Interface from the Instances page. Then, you need to upload a JDBC Driver so that it can be used further in the process.

JDBC Drivers

The steps to use JDBC Drivers are mentioned below.

  1. Firstly, you need to go to the Cloud Data Fusion Pipeline Studio.
  2. Select Source or Sink and click on the Source or Sink of your choice in the left Navigation pane. A rectangle representing the source or sink will appear on the Studio canvas.
  3. Now, click on the Properties of the Source or Sink and open the Reference Name Field.
  4. Under the Reference Name Field, you need to enter the name of the driver you uploaded previously.
  5. Fill out the rest of the field of the Configuration Tab and select Upload.

 

Viewing a JDBC Driver

The JDBC drivers you have uploaded previously appear as artifacts in the Cloud Data Fusion Center. To view any JDBC driver uploaded by you, you need to access the Cloud Data Fusion Graphical Interface first. Select the Artifacts options from Filter By dropdown menu. You will see an Artifact card appearing in front of you containing the information of the JDBC driver you wanted to view.

Deleting a JDBC Driver

To delete any JDBC Driver uploaded by you, you need to access the Cloud Data Fusion Graphical Interface first. Select the Artifacts options from Filter By dropdown menu. You will see an Artifact card appearing in front of you. In that card, you will see a Trash icon. Click on that Trash icon, and your artifact will be deleted.

Using VPC Controls with Cloud Data Fusion

Using the VPC Controls with Cloud Data Fusion, you can restrict the Cloud Data Fusion API Surfaces. There are two types of Cloud Data Fusion Surfaces that can be restricted using the VPC Controls, namely Control Plane Surface and Data Plane Surface. Both of these types are discussed further.

Restricting the Cloud Plane Surfaces

You can restrict the Cloud Plane Surfaces by setting up the private connectivity to its Google APIs and services using datafusion.googleapis.com. For detailed knowledge, you can refer to our blog on

Restricting the Data Plane Surfaces

To restrict the Data Plane Surfaces, you can refer to the steps below.

  1. Firstly, you need to create a new private zone using Cloud DNS. In that, you need to mention Zone type as private, Zone name as datafusiongoogleusercontentcom, DNS name as datafusion.googleusercontent.com, and select that private network you chose before when you created that private instance.
  2. In the DNS Zone section, you will find Zone details fields named, NS and SOA records. Use ADD RECORD SET option to add records you want to restrict as per Data Plane Surface Standards.
  3. Click on Add Item every time you want to add some new records.

Viewing Pipeline Logs

View pipeline logs

As part of your pre-preparation, you need to enable Cloud Logging using the Logging and Monitoring option under the Instances page. Get your unique Pipeline RunID from the Pipeline Studio using the Table Tab. Now, to view Pipeline Logs, you need to follow the steps mentioned.

  1. Open the Logs Explorer page present under the Cloud Logging section of the Console Window.
  2. Under the Cloud Dataproc Cluster field of the Filters menu, paste the unique RunID and Click Open.

 

Filtering Logs

You will use Filter menus to filter your logs. These menus can be used to filter logs at various severity levels or by using components such as datafusion-pipeline-logs.

Downloading Logs

To download all the uploaded logs, you need to click on the Download Logs present on the Instances page of the Console Window.

Access Controls using IAM

Under this module, we will discuss the various Access Controls that are made available to us using IAM. This module covers Grant Roles, Required Permissions, Cloud Data Fusion roles, Cloud Data Fusion API permissions, and a few permissions for common tasks.

Grant Roles

Users are granted roles at the project level using the Google Cloud Console, the Resource Manager API, or the Google Cloud CLI. For detailed explanations of all these, you can refer to our blog on Cloud Life Sciences (beta), where you can get a brief explanation of these grant roles.

Required Permissions

The following table describes the permissions required to run the Cloud Data Fusion. These permissions are granted automatically as soon as you enable the Cloud Data Fusion API.

Table 1

Cloud Data Fusion Roles

The table below contains the roles offered by Cloud Data Fusion. The lowest level resource to which a project can be granted is a Project.

Table 2

 

Cloud Data Fusion API Permissions

The following table shows the permission required to run Cloud Data Fusion API. 

Table 3

Permissions for Common Tasks

The table below provides you with the information required to perform some common tasks.

Table 4

 

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

Frequently Asked Questions

Mention some key features of Data Cloud Fusion.

Some key features of Data Cloud Fusion are listed below.

  • Data Integration through Collaboration and Standardization.
  • Integration with Google’s industry-leading Big Data tools.
  • Open core, code-free integration.
  • Hybrid and Multi-cloud Integration.

How does Data Cloud Fusion manage Real-Time data integration?

Some Real-Time data Integration facilities provided by Data Cloud Fusion include replication of transactional and operational databases such as SQL Server, Oracle, and MySQL directly into BigQuery with just a few clicks using the Data Replication feature. Continuous Analytics, Feasibility assessment, performance, health monitoring, etc., are some of the Real-Time facilities provided by Cloud Data Fusion.

Name some use cases of Data Cloud Fusion.

Some prominent use cases of Data Cloud Fusion are mentioned below.

  • Modernized and more secure data lakes.
  • More Agile data warehouses.
  • Unified Analytical environment.
  • Collaborative Data Engineering.

What do you mean by Hybrid Enablement?

Data Cloud Fusion, being open-source, is flexible and portable, which facilitates it for building standardized data integration solutions for various hybrid and multi-cloud applications. This notion is known as Hybrid Enablement.

Conclusion

In a nutshell, we understood what is Cloud Data Fusion, learned about creating an ordinary as well as a private instance, setting up VPC Network Peering, grating service account user permissions, using JDBC drivers, using VPC controls with Cloud Data Fusion, and viewing Pipeline logs. We also saw some standards related to access controls using IAM.

We hope the above discussion helped you understand Data Cloud Fusion in clearer terms and can be used for future reference whenever needed. If you want to see the differences between AWS and GCP, you must see our GCP vs AWS comparison blog. If you are preparing to get a GCP certification, you must pay attention to our GCP Certifications blog. For a crystal understanding of cloud computing, You can refer to our blogs on Cloud Computing ArchitectureAWS Cloud ComputingCloud Computing Infrastructure, and Cloud Delivery Models by clicking on the respective links. 

Visit our website to read more such blogs. Make sure that you enroll in the courses provided by us, take mock tests and solve problems available and interview puzzles. Also, you can pay attention to interview stuff- interview experiences and an interview bundle for placement preparations. Do upvote our blog to help fellow ninjas grow.

Happy Coding!

Previous article
Cloud Composer
Next article
Cloud Life Sciences (beta)
Live masterclass