Table of contents
1.
Introduction
2.
Garbage Collection Tuning
3.
Memstore-Local Allocation Buffer
4.
HBase Compression
4.1.
Available HBase Codecs
4.2.
Verifying Installation
4.3.
Enabling Compression
5.
HBase Configuration
6.
Load Tests in HBase Performance Tuning
6.1.
HBase Performance Evaluation
6.2.
YCSB (Yahoo! Cloud Serving Benchmark*)
7.
FAQs
8.
Key Takeaways
Last Updated: Mar 27, 2024

HBase Performance Tuning

Author Sanjana Yadav
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

Since HBase is a critical component of the Hadoop architecture and a distributed database, we wish to improve HBase Performance as much as feasible.

Knowing how to tune some of the knobs on Hbase's dashboard runs smoothly in autopilot mode. It is critical to understand not only each knob but also how it interacts with other knobs. To make Hbase perform optimally, several factors must be tuned based on your use case or work load. 

Garbage Collection Tuning

One of the lower-level parameters we need to adjust for the region server processes is the Garbage Collection Parameter. However, be sure that the master is not an issue here because data does not move through it, and it does not manage any heavy loads.

However, we simply need to add these Garbage Collection Parameters to the HBase Region Servers for HBase Performance Tuning.

Memstore-Local Allocation Buffer

To address the issue of heap fragmentation caused by excessive churn on an HBase Region Server's memstore instances, HBase 0.90 introduces an advanced method known as Memstore-Local Allocation Buffers (MSLAB).

These MSLABs are basically buffers of constant size that contain KeyValue instances of varying sizes. When a buffer cannot entirely accommodate a newly inserted KeyValue, it is considered full, and a new buffer of the specified fixed size is produced.

HBase Compression

Another aspect of HBase is that it supports a variety of compression techniques. HBase compression techniques may essentially be activated at the column-family level.

Furthermore, compression improves performance; for every other use case, this is achievable because the CPU conducting the compression and decompression has a lower overhead than the actual need to read more data from the disk.

Available HBase Codecs

In HBase, there is a fixed set of supported compression methods from which we may choose. They do, however, differ in terms of compression ratio, as well as CPU and installation requirements.

Verifying Installation

As soon as we have installed a supported HBase compression technique, you should check to see if the installation was successful. So, to do this, HBase has several mechanisms.

HBase Compression test tool
A tool in HBase may be used to determine whether compression is correctly configured. In order to use it, the following command may be used:

./bin/ hbase org.apache.hadoop.hbase.util.CompressionTest,

As a result, it returns the following information on how to execute the tool:

$ ./bin/hbase org.apache.hadoop.hbase.util.CompressionTest
Usage: CompressionTest <path> none|gz|lzo|snappy

Enabling Compression

Enabling compression necessitates the installation of the JNI and native compression libraries.

hbase(main):001:0> create 'testtable', { NAME => 'colfam1', COMPRESSION => 'GZ' }
0 row(s) in 1.1920 seconds
hbase(main):012:0> describe 'testtable'
DESCRIPTION ENABLED
{NAME => 'testtable', FAMILIES => [{NAME => 'colfam1', true
BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS
=> '3', COMPRESSION => 'GZ', TTL => '2147483647', BLOCKSIZE
=> '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
1 row(s) in 0.0400 seconds

The describe HBase shell command is used to get back the schema of the newly formed table. The compression is set to GZIP in this case. Furthermore, for existing tables, we use the edit command to enable—or change or disable—the compression technique.

Also, to convert the compression format to NONE, disable the compression for the specified column family.

Load Balancing

  • Master includes one built-in function known as the balancer. By default, the balancer runs every five minutes. And we configure it using the hbase.balancer.period attribute.
  • As soon as it starts, it seeks to equalize the number of assigned regions of each region server, such that they are within one region of the average number per server. Essentially, the call chooses a new assignment plan first.
  • That explains which regions should be relocated to where. The process of transferring the regions is then initiated by iteratively invoking the administrative API's unassign() function.
  • In addition, the balancer has an upper limit that determines how long it may operate. Essentially, it is configured or defaults to half of the balancer period value, or two and a half minutes, by utilizing the hbase.balancer.max.balancing property.

Merging Regions

We may need to combine regions from time to time because it is far more frequent for regions to divide automatically when we add data to the associated table.

To illustrate, suppose we wish to minimize the number of regions hosted by each server after removing a large amount of data; there is a feature in HBase that allows us to combine two nearby regions as long as the cluster is not online.

As a result, the following command-line tool may be used to obtain the usage information:

$ ./bin/hbase org.apache.hadoop.hbase.util.Merge
Usage: bin/hbase merge <table-name> <region-1> <region-2>

 

Client API: Best Practices
We should consider a few adjustments to get the optimal efficiency while receiving or publishing data from a client via the API.

Disable auto-flush
Set the auto-flush functionality of HTable to false by using the setAutoFlush(false) method when conducting a large number of put operations.

Limit scan scope
It states that while using scan to handle a large number of rows, we should be careful of the attributes we are selecting.

Close ResultScanners
This may not help improve performance, but it will surely help avoid performance issues.

Block cache usage
Furthermore, using the setCacheBlocks() function, we may instruct Scan instances to use the region server's block cache.

  1. Optimal loading of row keys
  2. Turn off WAL on Puts

HBase Configuration

Many configuration parameters are available in HBase to help us fine-tune our HBase Cluster setup.

  1. Decrease ZooKeeper timeout.
  2. Increase handlers.
  3. Increase heap settings.
  4. Enable data compression.
  5. Increase region size.
  6. Adjust block cache size.
  7. Adjust memstore limits.
  8. Increase blocking store files.
  9. Increase block multiplier.
  10. Decrease maximum logfiles.

Load Tests in HBase Performance Tuning

After deploying our cluster, it is advised that we run HBase performance tests to ensure that it is operational. Furthermore, it provides us with a baseline to which we can refer when making changes to the cluster's setup or the schemas of our tables.

A burn-in of our cluster will show us how much we can gain from it, but make sure that this does not substitute a test with the load predicted from our use case.

HBase Performance Evaluation

HBase comes with its own tool for conducting performance evaluations. This is referred to as Performance Evaluation (PE). Essentially, we may learn its primary usage information by using it without any command-line parameters:

$./bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation
Usage: java org.apache.hadoop.hbase.PerformanceEvaluation \
[--miniCluster] [--nomapred] [--rows=ROWS] <command> <nclients>

Furthermore, in order to operate a single evaluation client:

$ bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation sequentialWrite 1

YCSB (Yahoo! Cloud Serving Benchmark*)

We can use the YCSB as a package of tools to execute comparable workloads against various storage systems. While it was designed to evaluate different systems, it is also useful for doing an HBase cluster burn-in—or performance text.

YCSB installation

Since YCSB is only available via an online repository, we must generate a binary version ourselves. So, first and foremost, clone the repository:

$ git clone http://github.com/brianfrankcooper/YCSB.git

 

Start the new empty Git repository in /private/tmp/YCSB/.git/...

Resolving deltas: 100% (475/475), done.

Furthermore, it generates a YCSB directory in our current path. However, the next step is to navigate to the newly formed directory or compile the executable code and copy the needed libraries for HBase:

$ cd YCSB/
$ cp $HBASE_HOME/hbase*.jar db/hbase/lib/
$ cp $HBASE_HOME/lib/*.jar db/hbase/lib/
$ ant
Buildfile: /private/tmp/YCSB/build.xml
...
makejar:
[jar] Building jar: /private/tmp/YCSB/build/ycsb.jar
BUILD SUCCESSFUL
Total time: 1 second
$ ant dbcompile-hbase
...

 

BUILD SUCCESSFUL

Total time: 1 second

It leaves us with a JAR file that can be executed in the build directory.

Although YCSB can rarely simulate the workload, it might still be beneficial to evaluate a diverse set of loads on your cluster. Use the included workloads or develop your own to emulate cases that are constrained to read, write or both types of operations.

FAQs

  1. How do I make HBase scan faster?
    Having appropriately constructed row keys is the most effective technique to increase scan performance. HBase stores row sorted by row keys internally, and you may define the start and end rows for a scan. As a result, it is critical to have row keys that are built for searching by the most often used criteria.
     
  2. What is an HBase balancer?
    The Load Balancer guarantees that region replicas are not co-hosted on the same region servers or racks (if possible). The HDFS balancer tries to distribute HDFS blocks equally among DataNodes. HBase uses compactions to restore locality following a region split or failure.
     
  3. What is HBase compaction?
    Because Apache HBase is a distributed data store based on a log-structured merge tree, having only one file per store (Column Family) provides the best read speed. Instead, HBase will attempt to merge HFiles in order to lower the maximum number of disc searches required for a read. This is referred to as compaction.
     
  4. How do I scan an HBase table?
    Other arguments or properties that can be used with the HBase scan command include TIMERANGE, FILTER, TIMESTAMP, LIMIT, MAXLENGTH, COLUMNS, CACHE, STARTROW, and STOPROW. We will utilize the 'personal' table that we generated as part of the Insert data using the HBase shell put command.
     
  5. How do I make HBase scan faster?
    Having appropriately constructed row keys is the most effective technique to increase scan performance. HBase stores row sorted by row keys internally, and you may define the start and end rows for a scan. As a result, it is critical to have row keys that are built for searching by the most often used criteria.

Key Takeaways

  • Cheers if you reached here! In this blog, We discovered all of the best strategies for optimizing HBase performance in our HBase system.
  • Furthermore, we spoke about trash collection tuning, HBase scan performance optimization, and HBase read performance tuning.
  • We also applied a load test for HBase Performance Tuning

On the other hand, learning never ceases, and there is always more to learn. So, keep learning and keep growing, ninjas!

Check out the Top 100 SQL Problems to get hands-on experience with frequently asked interview questions and land your dream job.

Good luck with your preparation!

Live masterclass