Table of contents
1.
Introduction
2.
Data Catalog
2.1.
Tags and Tag Templates
2.1.1.
🍁Tags
2.1.2.
🍁 Tag Templates
2.2.
How to Tag a BigQuery Table using Data Catalog
2.2.1.
🌻 Creating a Template and Attaching it
2.2.2.
🌻 Deletion
2.3.
Searching Data Assets with Data Catalog
2.4.
Viewing Data Assets with Data Catalog
2.5.
Creating Custom Data Catalog Entries
3.
Surfacing Files from Cloud Storage
4.
Frequently Asked Questions
4.1.
What is Search Scope?
4.2.
Define Data-sharded tables.
4.3.
What are fields? 
4.4.
How can you star your favorite entries?
5.
Conclusion
Last Updated: Mar 27, 2024
Medium

Data Catalog

Author Rupal Saluja
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

Google Cloud Platform, acronymed as GCP, is a collection of cloud computing services running on the same internal infrastructure that Google uses for its end-user products. Data Catalog, which will be discussed in this blog further, is one such service provided by Google. 

When we say end-user products, it refers to those products that are used directly by the consumers, for example, Gmail, Google Search, YouTube, and Google Drive.There are several types of products offered by Google, such as Storage and Databases products, Networking Products, Big Data Products, etc. 

GCP

Data Catalog

Data Catalog is a management service provided by Google that is fully managed and scalable within Dataplex. Dataplex is like a data fabrication that facilitates the unification of distributed data and automated data management. A large number of organizations have now realized the importance of informed decision-making, and thus, nowadays, they call organized data as data assets. Data Catalog has been helping and will continue to help organizations manage their data assets, search for insightful data, understand data, and make that useful for their firm.

Data Catalog

With Data Catalog, clients can gain a unified view, technical and business metadata, and efficient data management capabilities. The three prominent data catalog functions include searching for data entries, tagging data entries, and facilitating column-level security for BigQuery tables.

Tags and Tag Templates

Handling a large number of data entries is quite difficult. The difficulty increases further if these entries are used by different groups of the same organization because of their varying needs. Often it was found that each group is creating its own set of data entries and metadata describing the same data resulting in duplication of efforts and incomplete information. Data Catalog has come up with a solution of using tags. Tags have enabled the organizations to create, search, and manage these data entries and metadata.

The two key Data Catalog concepts are Tags and Templates. We will discuss both these templates in this blog ahead.

Tags and Templates

🍁Tags

Tags in the data catalog are like any other tag which are used to provide context. You need to attach custom metadata fields along with data entries which will serve as tags for those data entries. The addition of tags gives a meaningful context to anyone who wants to use that asset. The tags are of two types, that is, Public and Private, differentiated on the basis of their use and advantages.

🌻 Private Tags

Private Tags come with strict access controls. Searching and viewing these tags and data entries associated with these tags can be done only if the required permissions are granted on both the data entries and the private tag template.

🌻 Public Tags

Public Tags come with less strict access controls. Any user with required view permissions for a data entry will have the permission to view all the public tags associated. Searching and viewing these tags become more accessible and easy.

🍁 Tag Templates

One or more tag templates are needed if you want to start tagging data. A tag template can be public or private, in which public is set by default when you create any tag template. A tag template is a collection of metadata key-value pairs known as fields. If you have a set of tag templates, it is like having a schema for your data.

To help users start better, Data Catalog provides them with a gallery of sample tag templates that will illustrate common tagging use cases. To use any template gallery, you will have to go to the Tag Templates page. Then, click on Create tag template option. You will see a template gallery displayed as a part of Create Template page.

How to Tag a BigQuery Table using Data Catalog

Before starting up the whole process, make sure to set up a project. To set up a project, you need to follow the steps below.

  1. Create an account in Google Cloud, and then using the Console page, create a Google Cloud Project.
  2. Enable a few options such as Data Catalog and BigQuery APIs, install Google Cloud CLI, and initialize it.
  3. Now, when you are done with building up the project, add a public data entry to your project using the Explorer section of the BigQuery page.
  4. After that, create a dataset using the Actions icon of the Explorer panel.
  5. Once you are done creating a dataset, copy a publicly accessible table to that dataset using the copy table pane under the Explorer pane.
     

You are done setting up a project. Now, we will proceed with the steps necessary to tag a BigQuery Table.

🌻 Creating a Template and Attaching it

To create a Template as well as to attach it, you need to follow the steps listed below.

  1. Open the Dataplex Tag Templates page.
  2. Create a Tag Template and add the necessary details and click Create.
  3. After that, go to the Dataplex search page and search for your dataset.
  4. In the results, you will see the dataset and the table. Click on the table.
  5. A page opens. Attach tags using Attach Tags panel and click save.
     

You can create an overview using the same table by using Add overview option.

🌻 Deletion

You can delete a Tag Template, a dataset, or even the complete project as per your requirements.

To delete a Tag Template, you need to go to the Templates page in the Data Catalog window. Under the Demo Tag Template option, click on Actions and delete the template.

To delete a dataset, you need to use the BigQuery page. From there, under the Explorer panel, search for the dataset and click on Actions options and then delete the opted dataset.

To delete the complete project, you need to open the Manage Resources page. From the project list, select the project and then delete it.

Searching Data Assets with Data Catalog

You are provided with several options from which you can choose any option as per your convenience, using which you will search data assets. The options available are Using Console and Filters, Implementation in Java, Implementation in Node.js, Implementation in Python, and Using REST & CMD LINE.

Searching Data Assets with Data Catalog

You can use any of these methods to search for your desired data assets by following the necessary steps. Here, we will see the implementation in Java.

import com.google.cloud.datacatalog.v1.DataCatalogClient;
import com.google.cloud.datacatalog.v1.DataCatalogClient.SearchCatalogPagedResponse;
import com.google.cloud.datacatalog.v1.SearchCatalogRequest;
import com.google.cloud.datacatalog.v1.SearchCatalogRequest.Scope;
import com.google.cloud.datacatalog.v1.SearchCatalogResult;
import java.io.IOException;

// Sample to search catalog
public class SearchAssets
{
  public static void main(String[] args)
  {
    String projectId = "my-project-id";
    String query = "type=dataset";
    searchCatalog(projectId, query);
  }
  public static void searchCatalog(String projectId, String query) throws IOException
  {
    // Create a scope object setting search boundaries to the given organization.
    // Scope scope = Scope.newBuilder().addIncludeOrgIds(orgId).build();
    // Alternatively, search using project scopes.
    Scope scope = Scope.newBuilder().addIncludeProjectIds(projectId).build();
    // Initializing clients that can be used for sending requests. This client needs to be created only once, and can be reused for several requests. After 
    // completing every request of yours, call the close method on the clients to safely clean up any remaining background resources.
    try (DataCatalogClient dataCatalogClient = DataCatalogClient.create())
    {
      // Search the catalog.
      SearchCatalogRequest searchCatalogRequest = SearchCatalogRequest.newBuilder().setScope(scope).setQuery(query).build();
      SearchCatalogPagedResponse response = dataCatalogClient.searchCatalog(searchCatalogRequest);
      System.out.println("Search results:");
      for (SearchCatalogResult result : response.iterateAll())
      {
        System.out.println(result);
      }
    }
  }
}

Viewing Data Assets with Data Catalog

Data Catalog can also be used to view table details within the Cloud Console. You need to follow the steps below to view any table.

  1. Open the Dataplex Search page and in the search box, type the name of the dataset whose table you want to view.
  2. Click on the table. A BigQuery table detail opens.

The table details include Tags, Schema and Column Tags, and other details.

Creating Custom Data Catalog Entries

You can use the implementation in Java below to create custom data catalog entries.

import com.google.cloud.datacatalog.v1.ColumnSchema;
import com.google.cloud.datacatalog.v1.CreateEntryGroupRequest;
import com.google.cloud.datacatalog.v1.CreateEntryRequest;
import com.google.cloud.datacatalog.v1.DataCatalogClient;
import com.google.cloud.datacatalog.v1.Entry;
import com.google.cloud.datacatalog.v1.EntryGroup;
import com.google.cloud.datacatalog.v1.LocationName;
import com.google.cloud.datacatalog.v1.Schema;
import java.io.IOException;

// Sample to create custom entry
public class CreateEntry
{
  public static void main(String[] args) throws IOException
  {
    String projectId = "my-project";
    String entryGroupId = "onprem_entry_group";
    String entryId = "onprem_entry_id";
    createEntry(projectId, entryGroupId, entryId);
  }
  public static void createCustomEntry(String projectId, String entryGroupId, String entryId) throws IOException
  {
    // Currently, Data Catalog stores metadata in the us-central1 region.
    String location = "us-central1";
    // Initializing clients that can be used for sending requests. This client needs to be created only once, and can be reused for several requests. After 
    // completing all of your requests, call the "close" method on the client to safely clean up any remaining background resources.
    try (DataCatalogClient dataCatalogClient = DataCatalogClient.create())
    {
      // Construct the EntryGroup for the EntryGroup request.
      EntryGroup entryGroup = EntryGroup.newBuilder().setDisplayName("My awesome Entry Group").setDescription("This Entry Group represents an external system").build();
      
      // Constructing EntryGroup request to be sent by the client.
      CreateEntryGroupRequest entryGroupRequest = CreateEntryGroupRequest.newBuilder().setParent(LocationName.of(projectId, location).toString()).setEntryGroupId(entryGroupId).setEntryGroup(entryGroup).build();
     
       // Use the client to send the API request.
      EntryGroup createdEntryGroup = dataCatalogClient.createEntryGroup(entryGroupRequest);
      
      // Constructing Entry for the Entry request.
      Entry entry = Entry.newBuilder().setUserSpecifiedSystem("onprem_data_system").setUserSpecifiedType("onprem_data_asset").setDisplayName("My awesome data asset").setDescription("This data asset is managed by an external system.").setLinkedResource("//my-onprem-server.com/dataAssets/my-awesome-data-asset").setSchema(Schema.newBuilder().addColumns(ColumnSchema.newBuilder().setColumn("first_column").setDescription("This columns consists of ....").setMode("NULLABLE").setType("DOUBLE").build()).addColumns(ColumnSchema.newBuilder().setColumn("second_column").setDescription("This columns consists of ....").setMode("REQUIRED").setType("STRING").build()).build()).build();

      // Construct the Entry request to be sent by the client.
      CreateEntryRequest entryRequest =CreateEntryRequest.newBuilder().setParent(createdEntryGroup.getName()).setEntryId(entryId).setEntry(entry).build();

      // Using client to send the API request.
      Entry createdEntry = dataCatalogClient.createEntry(entryRequest);
      System.out.printf("Custom entry created with name: %s", createdEntry.getName());
    }
  }
}

Surfacing Files from Cloud Storage

You are provided with several options from which you can choose any option as per your convenience using which you will use to surface files from cloud storage. The options available are Using Console and Filters, cloud, Implementation in Java, Implementation in Node.js, Implementation in Python, and Using REST & CMD LINE.

Surfacing Files from Cloud Storage

You can any of these methods to surface files from cloud storage by following the necessary steps. Here, we will see the implementation in Java.

import com.google.cloud.datacatalog.v1.ColumnSchema;
import com.google.cloud.datacatalog.v1.CreateEntryRequest;
import com.google.cloud.datacatalog.v1.DataCatalogClient;
import com.google.cloud.datacatalog.v1.Entry;
import com.google.cloud.datacatalog.v1.EntryGroupName;
import com.google.cloud.datacatalog.v1.EntryType;
import com.google.cloud.datacatalog.v1.GcsFilesetSpec;
import com.google.cloud.datacatalog.v1.Schema;
import java.io.IOException;


// Sample to create file set entry
public class CreateEntry
{
  public static void main(String[] args) throws IOException
  {
    String project = "my-project-id";
    String entryGroupId = "fileset_entry_group";
    String entryId = "fileset_entry_id";
    createEntry(project, entryGroupId, entryId);
  }

  // Create Fileset Entry.
  public static void createFilesetEntry(String project, String entryGroupId, String entryId) throws IOException
  {
    String location = "us-central1";
    // Initializing the client that can be used for sending requests. This client needs to be created only once, and can be reused for several requests.
    // After completing every request of yours, call the "close" method on the client to safely clean up any remaining background resources.
    try (DataCatalogClient dataCatalogClient = DataCatalogClient.create())
    {
      // Construct the Entry for the Entry request.
      Entry entry = Entry.newBuilder().setDisplayName("My Fileset").setDescription("This fileset consists of ....").setGcsFilesetSpec(GcsFilesetSpec.newBuilder().addFilePatterns("gs://cloud-samples-data/*").build()).setSchema(Schema.newBuilder().addColumns(ColumnSchema.newBuilder().setColumn("first_name").setDescription("First name").setMode("REQUIRED").setType("STRING").build()).addColumns(ColumnSchema.newBuilder().setColumn("last_name").setDescription("Last name").setMode("REQUIRED").setType("STRING").build()).addColumns(ColumnSchema.newBuilder().setColumn("addresses").setDescription("Addresses").setMode("REPEATED").setType("RECORD").addSubcolumns(ColumnSchema.newBuilder().setColumn("city").setDescription("City").setMode("NULLABLE").setType("STRING").build()).addSubcolumns(ColumnSchema.newBuilder().setColumn("state").setDescription("State").setMode("NULLABLE").setType("STRING").build()).build()).build()).setType(EntryType.FILESET).build();

      // Constructing Entry request to be sent by the client.
      CreateEntryRequest entryRequest =CreateEntryRequest.newBuilder().setParent(EntryGroupName.of(projectId, location, entryGroupId).toString()).setEntryId(entryId).setEntry(entry).build();

      // Using the client to send API request.
      Entry entryCreated = dataCatalogClient.createEntry(entryRequest);
      System.out.printf("Entry created with name: %s", entryCreated.getName());
    }
  }
}

Frequently Asked Questions

What is Search Scope?

Search Scope basically decides what search results are to be displayed based on the permissions a user has. That means different search results are displayed for every user. Data Catalog search results are scoped keeping in mind your role.

Define Data-sharded tables.

Data-Sharded tables appear as a single logical entry in the Data Catalog. This will have the same schema and each entry gets its access level from the dataset it belongs to. Individual data-sharded tables will never be visible in any Data Catalog search.

What are fields? 

Fields are something that is stored in the template as an ordered set. The order of the set represents its relative importance when compared to the other fields. Fields are optional unless you specify them as optional.

How can you star your favorite entries?

To star your favorite entries so that they can be easily found whenever needed, you need to search assets using the Dataplex search page. Now, click on the star icon present to bookmark your favorite entry.

Conclusion

In a nutshell, we understood what is Data Catalog, learned about Tags and Tag Templates, and saw how to tag a BigQuery Table using Data Catalog, search and view data assets with Data Catalog and create custom data catalog entries. We also saw how to surface files from cloud storage.

We hope the above discussion helped you understand Data Catalog in clearer terms and can be used for future reference whenever needed. If you need to see a comparison between AWS and GCP, you must see our GCP vs AWS comparison blog. If you are preparing to get a GCP certification, you must pay attention to our GCP Certifications blog. For a crystal understanding of cloud computing, You can refer to our blogs on Cloud Computing ArchitectureAWS Cloud ComputingCloud Computing Infrastructure, and Cloud Delivery Models by clicking on the respective links. 

Visit our website to read more such blogs. Make sure that you enroll in the courses provided by us, take mock tests and solve problems available and interview puzzles. Also, you can pay attention to interview stuff- interview experiences and an interview bundle for placement preparations. Do upvote our blog to help fellow ninjas grow.

Happy Coding!

Live masterclass