While working on large data, we have to store the data in tables. So what to do when you have billions of rows and columns? For that, we have Cloud BigTable.
You can store terabytes or even petabytes of data in Cloud Bigtable, a large table containing billions of rows and thousands of columns.
So in this blog, you'll learn the Advanced Concepts of Cloud Bigtable with Migration Concepts.
Let’s dive into the topic to explore more.
Import the HBase data into Bigtable using Dataflow
You are prepared to import and validate your data once you have a table ready to transfer your data there.
Uncompressed tables
Run the command below for each table you want to migrate if your HBase tables are not compressed:
After restoring the HBase snapshot to your Cloud Storage bucket with the command, the tool starts the import job. Depending on the size of the photo, the restoration procedure may take several minutes to complete.
The following advice should be kept in mind when importing:
Be sure to set maxNumWorkers in order to speed up the data loading process. With this setting, the import task is more likely to have sufficient computing power to finish in a fair amount of time while not using up too much of the Bigtable instance's resources.
Use the number of nodes in your Bigtable instance multiplied by 3 formaxNumWorkers if the Bigtable instance is not being used for another workload.
Reduce the setting of maxNumWorkers properly if you are simultaneously
Importing data from HBase and using the instance for another workload.
Utilize the standard worker type.
You should keep an eye on the CPU use of the Bigtable instance throughout the import. You might need to add more nodes if the Bigtable instance's overall CPU use is too high. The cluster may take up to 20 minutes to start delivering the performance advantages of extra nodes.
See Monitoring a Bigtable instance for more details on monitoring the Bigtable instance.
Snappy compressed tables
--enableSnappy=true
Validate the imported data in Bigtable
You must execute the sync-table job in order to verify the imported data. The sync-table job computes hashes for Bigtable row ranges and compares them to the hashtable output you previously computed.
Run the following commands in the command shell to launch the sync-table job:
Open the Dataflow Job details page when the sync-table job is finished and look through the Custom counters section. The value for ranges matched has a value, and ranges not matched have a value if the import operation successfully imports all of the data.
If ranges are not matched, display a value, select Worker Logs from the Logs page, and filter by Mismatch on range. These logs' machine-readable output is placed at the output destination you specify in the sync-table outputPrefix option in Cloud Storage.
Route writes to Bigtable.
You can set up your applications to send all of their traffic to Bigtable once you have verified the data for each table in the cluster, at which point you can deprecate the HBase instance.
On your HBase instance, you can remove the snapshots once the migration is finished.
Replicate from HBase to Bigtable
The open-source Cloud Bigtable HBase Client for Java includes the Cloud Bigtable HBase replication module. Using the HBase replication service, the replication library enables asynchronous data replication from an HBase cluster to a Bigtable instance. Go to the GitHub repository to look over the README and the source code.
Use cases
● Online migration to Bigtable - You can migrate from HBase to Bigtable with no downtime by using the Cloud Bigtable HBase replication library in conjunction with an offline migration of your existing HBase data.
● Replication of your HBase data to an offsite Bigtable instance can help you retrieve your data in the event of an emergency.
● Using a single Bigtable instance that automatically manages replication among its clusters, you can centralize datasets by using the library to replicate data from HBase clusters located in various places.
● Replicate to a Bigtable instance with clusters outside of your present HBase locations to expand your HBase footprint.
Migrate to Bigtable
You may switch to Bigtable without interrupting your application, thanks to the Bigtable HBase replication library.
The process for online migration from HBase to Bigtable can be summarised as follows. For further information, see the README.
Follow the setup and configuration instructions before you start.
On your HBase cluster, enable replication.
Add a peer for a Bigtable replication endpoint.
Make the Bigtable peer inactive. As a result, the HBase cluster begins to buffer writes to HBase moving forward.
Follow the offline migration guide to move a snapshot of your current HBase data once the buffering to record new writes has begun.
Re-enable the Bigtable peer to enable the buffer drain and replay of writes once the offline migration is finished.
Restart your application to submit requests to Bigtable after the buffer has been exhausted.
Set up and configure the replication library
The activities in this section must be finished before you can use the Bigtable HBase replication.
Configure authentication
Follow the instructions in Creating a service account to ensure the replication library has authorization to write to Bigtable.
The next step is to include the following in your hbase-site.xml file for the entire HBase cluster.
<property>
<name>google.bigtable.auth.json.keyfile</name>
<value>JSON_FILE_PATH</value>
<description>
Service account JSON file to connect to Cloud Bigtable
</description>
</property>
Create a destination instance and tables
To replicate HBase to Bigtable, you must first establish a Bigtable instance. A Bigtable instance may contain a single or several clusters that function as many primary clusters. The Bigtable instance's nearest cluster receives requests from the HBase replication service, which is replicated to the other clusters.
Your HBase table's name and column families must match those of your Bigtable destination table. To establish a table with the same schema as your HBase table using the Bigtable Schema Translation tool, see Create destination tables for more instructions. The steps are the same whether you are duplicating or importing your data.
Set the config properties
For the complete HBase cluster, add the following to your hbase-site.xml.
Installing the Bigtable HBase replication library on each HBase cluster server is required to use it. Use the replication library version that matches the version of HBase you are running (1. x or 2. x).
Run the following commands in the HBase shell to obtain the replication library
wget BIGTABLE_HBASE_REPLICATION_URL
Add a Bigtable peer
You need to add a Bigtable endpoint as a replication peer in order to replicate from HBase to Bigtable.
To ensure that the replication library is loaded, restart the HBase servers.
You can use the provided example or a CSV file of your own.
Remove and store the headers
The data import technique described in this tutorial cannot take care of the headers automatically. Make a copy of the comma-separated list of headers before uploading your file, and if you don't want the header row to be imported into your table, remove it from the CSV.
Prepare your Cloud Bigtable table for data import.
To build a Cloud Bigtable instance and install the command-line tool for Cloud Bigtable, adhere to the instructions in the cbt quickstart. If you'd like, you can use an active instance.
1. Set up a table:
cbt createtable my-table
2. In your table, create the csv column family:
cbt createfamily my-table csv
Data are inserted into the column family csv by the dataflow task.
3. Check if the creation is functional:
cbt ls my-table
The following should appear as the output:
Family Name GC Policy
----------- ---------
csv [default]
Run the Dataflow job
Dataflow is a fully-managed serverless service for stream (real-time) and batch (historical) modes of data transformation and enrichment. In this lesson, Dataflow is used as a speedy method to simultaneously process the CSV and carry out large-scale writes to the table. Additionally, prices are kept low because you only pay for what you use.
Clone the repository
Change to the directory containing the code for this tutorial after cloning the following repository:
git clone https://github.com/GoogleCloudPlatform/cloud-bigtable-examples.git
cd cloud-bigtable-examples/java/dataflow-connector-examples
You may need to set up application credentials as described here if you encounter an error message that reads, "Unable to retrieve application default credentials." A custom service account should have the appropriate roles assigned when being set up. Use Bigtable Administrator, Dataflow Admin, and Storage Admin for testing purposes.
Monitor your job
Check the status of the just generated job to determine if there are any errors before attempting to launch it in the Dataflow console.
Verify your data was inserted
Run the following command to view the data for your Cloud Bigtable table's first five rows (ordered lexicographically by row key), and then confirm that the output corresponds to the data in the CSV file:
cbt read my-table count=5
Expect a result resembling the following:
Frequently Asked Questions
Which feature is supported by Cloud Bigtable?
Bigtable enables high read and write throughput at low latency for quick access to massive volumes of data and is perfect for storing very large amounts of data in a key-value store.
What is the difference between BigQuery and Bigtable?
Bigtable is a wide-column NoSQL database designed for high read and write volumes. For vast amounts of structured relational data, on the other hand, BigQuery functions as an enterprise data warehouse.
Why did Google create Bigtable?
Bigtable was created to enable applications needing tremendous scalability; the technology was meant to be utilized with petabytes of data from the beginning.
What language is Bigtable in?
Go, Python, Java, C++, and Ruby are the languages used in Bigtable.
Is the Bigtable column based?
Since Bigtable is a row-oriented database, all of the data for a single row are stored together before being arranged by column family and then by column.
Conclusion
This blog has extensively discussed the Advanced Concepts of Cloud Bigtable with Migration Concepts, Migrating Data from HBase to Cloud Bigtable, and Setting up and configuring the replication library.
We hope this blog has helped you learn about the Migration Concept in Cloud BigTable.If you want to learn more, check out the excellent content on the Coding Ninjas Website: