Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
Import the HBase data into Bigtable using Dataflow
2.1.
Snappy compressed tables 
3.
Validate the imported data in Bigtable
3.1.
Route writes to Bigtable.
4.
Replicate from HBase to Bigtable
4.1.
Use cases
5.
Migrate to Bigtable
6.
Set up and configure the replication library
6.1.
Configure authentication
6.2.
Create a destination instance and tables
6.3.
Set the config properties
6.4.
Install the replication library
7.
 Add a Bigtable peer
8.
Upload your CSV
8.1.
Remove and store the headers
8.2.
Prepare your Cloud Bigtable table for data import.
9.
Run the Dataflow job
9.1.
Clone the repository
9.2.
Start the Dataflow job
9.3.
Monitor your job
10.
Verify your data was inserted
11.
Frequently Asked Questions
11.1.
Which feature is supported by Cloud Bigtable?
11.2.
What is the difference between BigQuery and Bigtable?
11.3.
Why did Google create Bigtable?
11.4.
What language is Bigtable in?
11.5.
Is the Bigtable column based?
12.
Conclusion
Last Updated: Mar 27, 2024

Advanced Concepts of Cloud Bigtable with Migration Concepts

Author Muskan Sharma
0 upvote

Introduction

While working on large data, we have to store the data in tables. So what to do when you have billions of rows and columns? For that, we have Cloud BigTable.

You can store terabytes or even petabytes of data in Cloud Bigtable, a large table containing billions of rows and thousands of columns.

So in this blog, you'll learn the Advanced Concepts of Cloud Bigtable with Migration Concepts.

Let’s dive into the topic to explore more. 

Advanced Concepts of Cloud Bigtable with Migration Concepts

Import the HBase data into Bigtable using Dataflow

You are prepared to import and validate your data once you have a table ready to transfer your data there.

Uncompressed tables

Run the command below for each table you want to migrate if your HBase tables are not compressed:

java -jar $IMPORT_JAR importsnapshot \
    --runner=DataflowRunner \
    --project=$PROJECT_ID \
    --bigtableInstanceId=$INSTANCE_ID \
    --bigtableTableId=TABLE_NAME \
    --hbaseSnapshotSourceDir=$MIGRATION_DESTINATION_DIRECTORY/data \
    --snapshotName=SNAPSHOT_NAME \
    --stagingLocation=$MIGRATION_DESTINATION_DIRECTORY/staging \
    --tempLocation=$MIGRATION_DESTINATION_DIRECTORY/temp \
    --maxNumWorkers=$(expr 3 \* $CLUSTER_NUM_NODES) \
    --region=$REGION

 After restoring the HBase snapshot to your Cloud Storage bucket with the command, the tool starts the import job. Depending on the size of the photo, the restoration procedure may take several minutes to complete. 

The following advice should be kept in mind when importing:

  • Be sure to set maxNumWorkers in order to speed up the data loading process. With this setting, the import task is more likely to have sufficient computing power to finish in a fair amount of time while not using up too much of the Bigtable instance's resources.
  • Use the number of nodes in your Bigtable instance multiplied by 3 for maxNumWorkers if the Bigtable instance is not being used for another workload.
  • Reduce the setting of maxNumWorkers properly if you are simultaneously
  • Importing data from HBase and using the instance for another workload.
  • Utilize the standard worker type.
  • You should keep an eye on the CPU use of the Bigtable instance throughout the import. You might need to add more nodes if the Bigtable instance's overall CPU use is too high. The cluster may take up to 20 minutes to start delivering the performance advantages of extra nodes.

See Monitoring a Bigtable instance for more details on monitoring the Bigtable instance.

Snappy compressed tables 

        --enableSnappy=true

Validate the imported data in Bigtable

You must execute the sync-table job in order to verify the imported data. The sync-table job computes hashes for Bigtable row ranges and compares them to the hashtable output you previously computed.

Run the following commands in the command shell to launch the sync-table job:

 java -jar $IMPORT_JAR sync-table  \
    --runner=dataflow \
    --project=$PROJECT_ID \
    --bigtableInstanceId=$INSTANCE_ID \
    --bigtableTableId=TABLE_NAME \
    --outputPrefix=$MIGRATION_DESTINATION_DIRECTORY/sync-table/output-TABLE_NAME-$(date +"%s") \
    --stagingLocation=$MIGRATION_DESTINATION_DIRECTORY/sync-table/staging \
    --hashTableOutputDir=$MIGRATION_DESTINATION_DIRECTORY/hashtable/TABLE_NAME \
    --tempLocation=$MIGRATION_DESTINATION_DIRECTORY/sync-table/dataflow-test/temp \
    --region=$REGION
 

Open the Dataflow Job details page when the sync-table job is finished and look through the Custom counters section. The value for ranges matched has a value, and ranges not matched have a value if the import operation successfully imports all of the data.

If ranges are not matched, display a value, select Worker Logs from the Logs page, and filter by Mismatch on range. These logs' machine-readable output is placed at the output destination you specify in the sync-table outputPrefix option in Cloud Storage.

Route writes to Bigtable.

You can set up your applications to send all of their traffic to Bigtable once you have verified the data for each table in the cluster, at which point you can deprecate the HBase instance.

On your HBase instance, you can remove the snapshots once the migration is finished.

Replicate from HBase to Bigtable

The open-source Cloud Bigtable HBase Client for Java includes the Cloud Bigtable HBase replication module. Using the HBase replication service, the replication library enables asynchronous data replication from an HBase cluster to a Bigtable instance. Go to the GitHub repository to look over the README and the source code.

Use cases

● Online migration to Bigtable - You can migrate from HBase to Bigtable with no downtime by using the Cloud Bigtable HBase replication library in conjunction with an offline migration of your existing HBase data.

● Replication of your HBase data to an offsite Bigtable instance can help you retrieve your data in the event of an emergency.

● Using a single Bigtable instance that automatically manages replication among its clusters, you can centralize datasets by using the library to replicate data from HBase clusters located in various places.

● Replicate to a Bigtable instance with clusters outside of your present HBase locations to expand your HBase footprint.

Migrate to Bigtable

You may switch to Bigtable without interrupting your application, thanks to the Bigtable HBase replication library.

Migrate to Bigtable

The process for online migration from HBase to Bigtable can be summarised as follows. For further information, see the README.

Follow the setup and configuration instructions before you start.

  1. On your HBase cluster, enable replication.
  2. Add a peer for a Bigtable replication endpoint.
  3. Make the Bigtable peer inactive. As a result, the HBase cluster begins to buffer writes to HBase moving forward.
  4. Follow the offline migration guide to move a snapshot of your current HBase data once the buffering to record new writes has begun.
  5. Re-enable the Bigtable peer to enable the buffer drain and replay of writes once the offline migration is finished.
  6. Restart your application to submit requests to Bigtable after the buffer has been exhausted.

Set up and configure the replication library

The activities in this section must be finished before you can use the Bigtable HBase replication. 

Configure authentication

Follow the instructions in Creating a service account to ensure the replication library has authorization to write to Bigtable.

The next step is to include the following in your hbase-site.xml file for the entire HBase cluster.

<property>
    <name>google.bigtable.auth.json.keyfile</name>
    <value>JSON_FILE_PATH</value>
<description>
        Service account JSON file to connect to Cloud Bigtable
    </description>
</property>

Create a destination instance and tables

To replicate HBase to Bigtable, you must first establish a Bigtable instance. A Bigtable instance may contain a single or several clusters that function as many primary clusters. The Bigtable instance's nearest cluster receives requests from the HBase replication service, which is replicated to the other clusters.

Your HBase table's name and column families must match those of your Bigtable destination table. To establish a table with the same schema as your HBase table using the Bigtable Schema Translation tool, see Create destination tables for more instructions. The steps are the same whether you are duplicating or importing your data.

Set the config properties

For the complete HBase cluster, add the following to your hbase-site.xml.

<property>
    <name>google.bigtable.project.id</name>
    <value>PROJECT_ID</value>
    <description>
   Cloud Bigtable project ID
    </description>
</property>
<property>
    <name>google.bigtable.instance.id</name>
    <value>INSTANCE_ID</value>
    <description>
    Cloud Bigtable instance ID
    </description>
</property>
<property>
    <name>google.bigtable.app_profile.id</name>
    <value>APP_PROFILE_ID</value>
    <description>
    Cloud Bigtable app profile ID
    </description>
</property> 

Install the replication library

Installing the Bigtable HBase replication library on each HBase cluster server is required to use it. Use the replication library version that matches the version of HBase you are running (1. x or 2. x).

Run the following commands in the HBase shell to obtain the replication library 

wget BIGTABLE_HBASE_REPLICATION_URL

 Add a Bigtable peer

You need to add a Bigtable endpoint as a replication peer in order to replicate from HBase to Bigtable.

  1. To ensure that the replication library is loaded, restart the HBase servers.
  2. Launch the HBase shell and execute the following.
add_peer PEER_ID_NUMBER, ENDPOINT_CLASSNAME =>
'com.google.cloud.bigtable.hbaseHBASE_VERSION_NUMBER_x.replication.HbaseToCloudBigtableReplicationEndpoint`

Upload your CSV

You can use the provided example or a CSV file of your own.

Remove and store the headers

The data import technique described in this tutorial cannot take care of the headers automatically. Make a copy of the comma-separated list of headers before uploading your file, and if you don't want the header row to be imported into your table, remove it from the CSV.

Prepare your Cloud Bigtable table for data import.

To build a Cloud Bigtable instance and install the command-line tool for Cloud Bigtable, adhere to the instructions in the cbt quickstart. If you'd like, you can use an active instance.

1. Set up a table:

cbt createtable my-table

2. In your table, create the csv column family:

cbt createfamily my-table csv

 Data are inserted into the column family csv by the dataflow task.

3. Check if the creation is functional:

cbt ls my-table

The following should appear as the output:

Family Name GC Policy
----------- ---------
csv   [default] 

Run the Dataflow job

Dataflow is a fully-managed serverless service for stream (real-time) and batch (historical) modes of data transformation and enrichment. In this lesson, Dataflow is used as a speedy method to simultaneously process the CSV and carry out large-scale writes to the table. Additionally, prices are kept low because you only pay for what you use.

Clone the repository

Change to the directory containing the code for this tutorial after cloning the following repository: 

git clone https://github.com/GoogleCloudPlatform/cloud-bigtable-examples.git
cd cloud-bigtable-examples/java/dataflow-connector-examples

Start the Dataflow job

mvn package exec:exec -DCsvImport -Dbigtable.projectID=YOUR_PROJECT_ID -Dbigtable.instanceID=YOUR_INSTANCE_ID \
-DinputFile="YOUR_FILE" -Dbigtable.table="YOUR_TABLE_ID" -Dheaders="YOUR_HEADERS"

 A command example is as follows:

mvn package exec:exec -DCsvImport -Dbigtable.projectID=YOUR_PROJECT_ID -Dbigtable.instanceID=YOUR_INSTANCE_ID \
-DinputFile="gs://YOUR_BUCKET/sample.csv" -Dbigtable.table="my-table" -Dheaders="rowkey,a,b"
 

The row key is always the first column.

You may need to set up application credentials as described here if you encounter an error message that reads, "Unable to retrieve application default credentials." A custom service account should have the appropriate roles assigned when being set up. Use Bigtable Administrator, Dataflow Admin, and Storage Admin for testing purposes.

Monitor your job

Check the status of the just generated job to determine if there are any errors before attempting to launch it in the Dataflow console.

Verify your data was inserted

Run the following command to view the data for your Cloud Bigtable table's first five rows (ordered lexicographically by row key), and then confirm that the output corresponds to the data in the CSV file:

cbt read my-table count=5

 Expect a result resembling the following:

Frequently Asked Questions

Which feature is supported by Cloud Bigtable?

Bigtable enables high read and write throughput at low latency for quick access to massive volumes of data and is perfect for storing very large amounts of data in a key-value store.

What is the difference between BigQuery and Bigtable?

Bigtable is a wide-column NoSQL database designed for high read and write volumes. For vast amounts of structured relational data, on the other hand, BigQuery functions as an enterprise data warehouse.

Why did Google create Bigtable?

Bigtable was created to enable applications needing tremendous scalability; the technology was meant to be utilized with petabytes of data from the beginning.

What language is Bigtable in?

Go, Python, Java, C++, and Ruby are the languages used in Bigtable.

Is the Bigtable column based?

Since Bigtable is a row-oriented database, all of the data for a single row are stored together before being arranged by column family and then by column.

Conclusion

This blog has extensively discussed the Advanced Concepts of Cloud Bigtable with Migration Concepts, Migrating Data from HBase to Cloud Bigtable, and Setting up and configuring the replication library.

We hope this blog has helped you learn about the Migration Concept in Cloud BigTable.If you want to learn more, check out the excellent content on the Coding Ninjas Website:

Overview of cloud BigtableOverview of cloud billing concepts

Refer to our guided paths on the Coding Ninjas Studio platform to learn more about DSA, DBMS, Competitive Programming, Python, Java, JavaScript, etc.

Refer to the links problems, top 100 SQL problems, resources, and mock tests to enhance your knowledge.

For placement preparations, visit interview experiences and interview bundle.

Thank You Image

Do upvote our blog to help other ninjas grow. Happy Coding!

Live masterclass