Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
Performance of Bigtable
2.1.
Bigtable capacity
2.2.
Replication
3.
Monitoring Bigtable
3.1.
CPU Utilisation and Disk Usage
3.2.
Instance and Cluster Overview
4.
Key Visualiser
4.1.
Exploring Heatmaps
4.2.
Heatmap Patterns
4.3.
Diagnostics
5.
Frequently Asked Questions
5.1.
What is a latency-sensitive application?
5.2.
What are the different Warning metrics of Bigtable?
5.3.
What key points to consider while testing a Bigtable application’s performance?
6.
Conclusion
Last Updated: Mar 27, 2024

Monitoring and Troubleshooting in Cloud BigTable

Author Yashesvinee V
0 upvote

Introduction

Bigtable provides NoSQL database storage support for several Google applications with different workloads, from throughput-oriented batch-processing jobs to serving latency-sensitive data to end users. Its fundamental design goals are broad applicability, scalability, high performance, and availability. To achieve this, Bigtable organises the data storage in tables where rows are distributed over the distributed file system supporting the middleware. It is fast, seamlessly scalable and easy to integrate. Let us see how to Monitor and calculate the performance of Bigtable using a set of factors under certain conditions.

Performance of Bigtable

Bigtable delivers high performance and is seamlessly scalable. Bigtable optimises data over time by distributing the amount of data across nodes and distributing reads and writes evenly across the cluster’s nodes. The approximate throughput of a Bigtable node depends on the type of storage used by the cluster.

  • SSD (Solid State Drive) can perform 10,000 reads and writes per second and scans up to 220 MB/s.
     
  • HDD (Hard Disk Drive) can read up to 500 rows per second and write up to 10,000 rows per second. It scans up to 180 MB/s.
     

Replication greatly affects the performance of Bigtable in a positive direction. Bigtable’s performance can decrease due to several factors.

  • Reading many row keys or ranges in a single read request.
     
  • The Bigtable rows contain large amounts of data with many cells.
     
  • Improper design of table schema. 
     
  • Network connection issues.
     
  • Using an out-of-date client library for the application. 
     
  • Insufficient nodes in the cluster.

Bigtable capacity

Performance scales linearly as more clusters are added to a cluster. The capacity of Bigtable clusters should be determined according to the trade-off between throughput and latency and the trade-off between storage and performance. When the CPU load of a cluster is below 70%, Bigtable offers optimal latency. Planning at least twice the capacity for an application's maximum queries per second is recommended. This allows a cluster to run at less than 50% of the CPU load, offering low latency to front-end services. 

Bigtable optimises the storage by distributing data across all the cluster nodes as the data volume increases. The storage usage per node is calculated by dividing the cluster's storage utilisation by the number of nodes in the cluster. Storage Utilisation for latency-sensitive applications is set to 60% or below. Workloads experience increased query processing latency even if the nodes meet overall CPU needs. More background work is required as the storage per node increases. This can result in low throughput and high latency.

Replication

Cloud Bigtable replicates data across multiple regions and zones to increase availability and durability. This ensures that serving applications are isolated from batch reads, and that data is stored globally. Replication increases read throughput but do not affect write throughput. Without replication, the write throughput for an instance doubles when the number of nodes in a cluster increases. However, with replication, each piece of data is written twice. First is when the write is received and then during replication to the other cluster.

Monitoring Bigtable

Monitoring gives a detailed view of the usage of Bigtable. This can be done from the following places in the console.

  • Bigtable cluster overview
  • Bigtable table overview
  • Key Visualizer
  • Bigtable monitoring
  • Bigtable instance overview
  • Google Cloud's operations suite 
  • Cloud Monitoring
     

It can also be monitored using the Cloud Monitoring API. Users can monitor performance over time by breaking down metrics for various resources and using charts for a given period. Cloud Monitoring can import usage metrics from Bigtable. Usage metrics can be viewed in the Metrics Explorer on the Resources page under Monitoring.

CPU Utilisation and Disk Usage

Nodes in a cluster perform various operations like reading, writing and other administrative tasks, all of which require CPU resources. The metrics for CPU utilisations are:

  • Average CPU Utilisation of all the cluster nodes.
     
  • CPU Utilisation of the hottest node: The hottest node is the busiest node that changes states frequently.
     
  • CPU Utilisations by methods, table and app profile. 
     

Bigtable measures disk usage in binary units like Binary Gigabytes, also known as gibibytes (GiB). Storage metrics calculate data in the disk as of the last computation. Metrics used are:

  • Storage Utilisation in bytes and % of storage capacity used.
     
  • Disk Load is calculated only for HDD clusters. It gives the maximum possible bandwidth of reads and writes in HDDs.

Instance and Cluster Overview

The Instance overview page displays the metrics of every cluster in real-time. Some of the key metrics are:

  • Average CPU Utilisation
     
  • Rows read
     
  • Rows written
     
  • Read throughput - It shows the amount of response data sent per second.
     
  • Replication latency for input
     
  • Replication intensity for output
     
  • System error rate - It displays the percentage of failed requests front the serverside of Bigtable.
     

The Cluster overview page helps users analyse every cluster's present and past status.

  • The number of nodes currently in use.
     
  • The Maximum node count target defines the maximum limit of nodes for autoscaling.
     
  • The Minimum node count target defines the minimum limit of nodes for autoscaling.
     
  • The recommended number of nodes for CPU and storage target.
     
  • CPU Utilisation
     
  • Storage Utilisation

Key Visualiser

Monitoring the usage patterns of Bigtable can be done using Key Visualiser. It is a tool that helps users analyse and diagnose Bigtable. The visual reports generated by Key Visualiser give detailed insights into usage patterns that may be difficult to analyse otherwise. They can be used to improvise the existing schema designs and troubleshoot performance issues. Key Visualiser does not display all metrics responsible for the performance of Bigtable. Hence, additional troubleshooting along with Key Visualiser scans are needed to identify the causes for performance issues.

Key Visualiser scans consist of a heatmap with aggregate values on each axis. Heatmaps show the patterns of a group of keys over time. The x-axis represents time, and the y-axis represents the row keys. Low metric values for a row key are said to be “cold” and are denoted in dark colours. High values appear as light colours. Such visual patterns make it easy to diagnose problems with just a glance.

Exploring Heatmaps

Heatmap

Source: Google Cloud

The given picture denotes a heat map in Key Visualiser. Issues identified by the Key Visualiser are displayed above the heatmap as diagnostic messages. The usage of a particular resource determines high and low values. If warning and performance metrics appear in bright colours, the Key Visualiser detects a potential problem.

Colours in the heatmap can be adjusted using the +/- buttons on either side of the Adjust Brightness option. Increasing the brightness decreases the range of values represented by that colour and vice versa.

Users can use Rectangular Zoom to enlarge a particular area in the heatmap to get a closer and more detailed look. This helps notice issues for a specific period.

Row keys represent a hierarchy of values, each having an identifier to capture usage and a timestamp. Users can drill down into the data of a heatmap using a common prefix shared by a group of row keys. Specific row-key hierarchies can be selected from the left side of the heatmap. The key prefix for all the row keys at that level is also displayed.

Details about a metric are shown as a tooltip when the cursor moves over the heatmap. Tooltips can be pinned by clicking on the heatmap. The ops metric gives an overview of the usage pattern for a table. Users can switch metrics by choosing one from the Metric drop-down list above the heatmap.

Heatmap Patterns

Five main heatmap patterns are frequently spotted.

  1. The pattern denotes sequential reads and writes. The diagonal line implies the access of contiguous key ranges in sequential order.
     
  2. The pattern represents evenly distributed reads and writes. The fine-grained texture shows an effective usage pattern.
     
  3. Alternating bands of dark and light colours show that the key ranges are accessed only at specific periods and not always.
     
  4. An abrupt change from dark to light colour denotes a sudden increase in adding or accessing rows in a specific period.
     
  5. Horizontal lines of light and dark colours can represent hot key ranges, usually while performing larger reads and writes.

Diagnostics

Diagnostic messages help identify issues in performance data while observing a Key Visualiser scan. The messages can include a Warning symbol or Danger symbol to denote the problem-causing rows. Following are some of the diagnostic messages.

  • High read pressure
     
  • High write pressure
     
  • Larger rows - It notifies that some rows in the table exceed 256MB of data.
     
  • No data scanned - This implies no performance data for the table.
     
  • Keyspace not to scale - If a table contains a small number of rows, the Key Visualizer cannot evenly distribute the row keys into buckets.
     
  • Not all details are shown in Tooltip.
     
  • Values per row are approximate.

Frequently Asked Questions

What is a latency-sensitive application?

Latency is the time interval between the occurrence of an event and its handling. Reduced latency improves performance. A latency-sensitive application is an application that reacts fast due to specific circumstances.

What are the different Warning metrics of Bigtable?

Read pressure index and Write pressure index denote a row key or range that involves CPU utilisation and latency for reads and writes, respectively. Large rows indicate that that row data exceeds 256MB.

What key points to consider while testing a Bigtable application’s performance?

Test with enough data and note the storage utilisation per node. Run a pre-test for several minutes before actually running the main test. Perform the test for a minimum time of 10 minutes to ensure a thorough test of all data.

Conclusion

This blog discusses the Monitoring and troubleshooting of Cloud Bigtable applications. It explains the factors that affect the performance of Bigtable and how to monitor them. It also gives an overview of the Key Visualiser and its features.

Check out our articles on Cloud Logging in GCP, Monitoring Agent and Identity Access Management. Explore our Library on Coding Ninjas Studio to gain knowledge on Data Structures and Algorithms, Machine Learning, Deep Learning, Cloud Computing and many more! Test your coding skills by solving our test series and participating in the contests hosted on Coding Ninjas Studio! 

Looking for questions from tech giants like Amazon, Microsoft, Uber, etc.? Look at the problems, interview experiences, and interview bundle for placement preparations.

Upvote our blogs if you find them insightful and engaging! Happy Coding!

Thank you

Live masterclass