Do you think IIT Guwahati certified course can help you in your career?
No
Introduction
A distributed storage system is a type of architecture that allows data to be divided amongst several physical servers and frequently more than one data centre. A data synchronisation and coordination method across cluster nodes often take the shape of a cluster of storage units. Bigtable is a NoSQL database offered by Google Cloud that is fully managed, scalable, and capable of handling heavy operational and analytical workloads. It is optimised for applications that need high throughput, low latency, and performance scalability.
Cloud Bigtable
Cloud Bigtable by Google is a sparsely populated table that can scale to billions of rows and thousands of columns. This makes it possible for us to store terabytes or even petabytes of data. Each row contains a single value that is indexed; this value is referred to as the row key. Bigtable is the best solution for storing a lot of single-keyed data quickly. It is the ideal data source for MapReduce processes since it enables high read and write throughput with low latency.
Key Features
Cloud BigTable by Google has various features which make it stand out and highly useful.
High throughput at low latency: Bigtable is the best option for storing enormous volumes of data in a key-value store since it enables high read and write throughput at low latency for quick access to enormous amounts of data.
Cluster resizing without downtime: Easily scale from a few thousand to millions of reads and writes per second. When adding or removing cluster nodes,
Flexible, automated replication to optimise any workload: We cancontrol high availability and workload isolation by writing data only once and automatically replicating it where necessary with consistency.
Advantages
Let us take a look at several of the different advantages of Bigtable.
Swift and effective: Low latency, on the order of single-digit milliseconds, is also possible with Bigtable. Bigtable has the capacity to handle the enormous write throughput created.
Seamless scaling: Bigtable is a very scalable database as well. Bigtable's capacity to scale lets us store up to petabytes of data. Dynamically modifying throughput is possible by adding or removing Bigtable nodes.
Management and Reliability: Bigtable is a trustworthy and fully-managed database. With a 99.999 per cent SLA, Bigtable recently joined Cloud Firestore and Cloud Spanner (Service Level Agreement). For live serving apps, replication also delivers high availability and workload segregation.
Simple and integrated: Other Google Cloud big data products like BigQuery, Dataflow and Dataproc work well with Bigtable.
Database Challenges
When selecting a Database, several considerations must be considered, especially when working with significant amounts of data. Perhaps we are creating a music app to suggest songs to users based on other music they enjoy. Our app may first suggest the best songs to users according to the genre that users listen to.
These circumstances pose new difficulties for the developers.
BigTable Data Model
Bigtable is a NoSQL database that stores information as key-value pairs in rows and columns. Each column holds data for each row, and each row represents a single entity. For organisational reasons, related columns can be grouped into column families. Each row and column intersection can have many cells representing a different data version at a specific timestamp. Bigtable tables are sparse; thus, if a column is empty for a particular row, it takes up no space.
Writes
The Cloud Bigtable Data API and client libraries enable us to write data to the tables programmatically. For each "write," Bigtable returns a response or acknowledgement.
Bigtable client libraries have a built-in "smart retries" feature that handles temporary unavailability seamlessly.
Types of Writes and When to Use
We will have a look at the types of Writes and when to use them.
Simple Writes
With a MutateRow request that includes the table name, the ID of the app profile to be used, a row key, and up to one hundred thousand (100,000) mutations for that row, we can write a single row to Bigtable. A single-row write is atomic. When making multiple changes to a single row, use this write type.
When to Avoid Using Simple Writes
For the following use cases, simple writes are not the best way to write data:
We are creating a batch of data with contiguous row keys. Because a contiguous batch can be applied in a single backend call, we should use "batch writes" instead of consecutive "simple writes" in this case.
We require high throughput rows per second or bytes per second but not low latency. In this case, "batch writes" would be faster.
Increments and Appends
Submit a ReadModifyWriteRow request to append data to an existing value or to increment a current numeric value. Bigtable treats an increment as an empty or non-existent value as if the value is zero. Each rule includes the name of the column family, a column qualifier and either an appended value or an increment amount. Requests are atomic, and if they fail for any reason, they are not retried.
When to Avoid Using Increments and Appends
In the following cases, we should not send ReadModifyWriteRow requests:
We are using a multi-cluster routing app profile.
We run several single cluster application profiles and send writing that can clash with data written to the same row and column in other clusters in the example.
We depend on the client libraries' intelligent retries feature. Increments and appends cannot be recovered.
We are writing a lot of data and need the "writes" to finish quickly. A read-and-modify request is slower than a simple write request. As a result, at scale, this type of writing is not always the best approach.
Conditional Writes
Submit a CheckAndMutateRow request if we want to check a row for a condition and then write data to that row based on the result. A row key and a row filter are included in this type of request. A row filter is a set of rules that we use to validate existing data. Only when certain conditions, as determined by the filter, are met are mutations committed to specific columns in the row. This process of checking and writing is completed in a single atomic action.
A filter request must contain one or both of the following mutations:
True mutations, or mutations that will be applied if the filter returns a value.
If the filter produces no results, false mutations are applied.
Batch Writes
A MutateRows request allows us to write multiple rows in a single call. MutateRows requests can have up to 100,000 entries, each of which is applied atomically. A "batch write", for example, could include any of the following permutations:
One hundred thousand entries, each with one mutation.
There is only one entry with 100,000 mutations.
One thousand entries, each with 100 mutations.
Consistency When Using Replication
Bigtable is eventually consistent if an instance has multiple clusters, indicating that it is using replication. By routing requests to the same cluster, we can achieve "read-your-write consistency". Data can be read immediately using a single-cluster instance. Consistency tokens are used to ensure that all of the data has been replicated.
Conflict Resolution
The four-tuple, column qualifier, row key, column family, and timestamp uniquely identify each cell value in a Bigtable table. When two writes with the same four-tuple are sent to different clusters, Bigtable automatically resolves the conflict by using an internal last write wins algorithm based on server-side time. The "last write wins" implementation in Bigtable is deterministic.
Schema Design
This section contains information about the Cloud Bigtable schema design. We should be familiar with Bigtable's overview before reading this article.
General Concepts
Bigtable schema design differs from relational database schema design. A schema in Bigtable is a blueprint or model of a table that includes the structure of the following table components:
Best Practices
A good schema yields superior performance and scalability. A poorly designed schema can result in a system that performs poorly. Every use case is unique and necessitates design, but the best practices listed below apply to the vast majority of use cases. There are some exceptions. The sections that follow describe the best practices for schema design, beginning at the table level and working down to the row key level:
Tables
Instead of storing datasets with similar schemas in separate tables, Bigtable stores them all in the same table. Other database systems may allow us to store multiple tables depending on the subject and number of columns, but in Bigtable, keeping all the data in one table is usually preferable.
Column Families
When a row has multiple values that are related to one another, it is best to group the columns containing those values into the same column family. Group data as closely as possible to avoid having to create complex filters so that we only get the information we need.
Each request includes the transfer of data that includes names. Choose names for the column families that are short but meaningful. Put columns with varying data retention requirements in different column families. Garbage collection policies are defined at the column family level rather than the column level.
Columns
Bigtable tables are compact, and there is no penalty for not using a column in a row. If no row exceeds the maximum limit of 256 MB per row, we can have millions of columns in a table.
This best practice is influenced by a number of factors, including:
Bigtable requires time to process each cell in a row.
Each cell increases the amount of data stored in the table and sent over the network by a small amount. In that case, storing the data in a single cell is far more space-efficient than spreading it across 1,024 cells, each of which contains one byte.
If the dataset has more columns per row than Bigtable can handle efficiently, consider storing it as a protobuf in a single column.
Rows
No more than 100 MB of data should be stored in a single row. Rows that exceed this limit may have a negative impact on read performance. Maintain all of an entity's information in one row to avoid inconsistencies. This practice ensures that the data is not left incomplete if a portion of a "write request" fails or needs to be resent.
Cells
No more than 10 Megabytes of data should be stored in a single cell. A cell is the data stored with a unique timestamp for a given row and column. The garbage collection policy we specify for the column family that contains that column governs the number of cells retained in that column.
Row Keys
Create the row keys based on the queries we will be using to retrieve data. Bigtable performs best when row keys are well-designed. The most effective Bigtable queries use one of the following methods to retrieve data:
The row key
Prefix for row keys
Starting and ending row keys define the range of rows.
A row key must be no larger than 4 KB in size. Long row keys consume more memory and storage, lengthening the time it takes to receive responses from the Bigtable servers.
When the row key contains multiple values, it is critical that we clearly understand how we use the data. A delimiter, such as a slash, colon, or hash symbol, is usually used to separate row key segments. The row key prefix is the first segment or set of contiguous segments. Storing related data in contiguous rows allows us to access it as a range of rows.
With this row key design, we can retrieve data with just one request for:
A type of device
A device type and device ID combination
When possible, use human-readable string values in the row keys. This practice makes it easier to troubleshoot Bigtable issues with the Key Visualiser tool.
Reads
Bigtable read requests return the contents of the requested rows in key order, which means they are given back in the order in which they were stored. Any writes that have returned a response can be read.
Bigtable read requests are divided into two categories:
Single Row Reads: The row key can be used to request a single row.
Scans: The most usual way to read Bigtable data is through scans. By specifying a row key prefix or the beginning and end row keys, we can read a range of adjacent and continuous rows or multiple ranges of rows from Bigtable.
Garbage Collection
The automatic removal of expired and obsolete data from Bigtable tables is known as garbage collection. Garbage collection occurs on a set schedule that is not affected by the amount of data that needs to be deleted. It can take up to a week before garbage collection data is deleted.
Benefits
The following are some of the advantages of garbage collection policies:
Reduce row size - Large rows degrade performance. The ideal size for a row is 100 MB, and the maximum is 256 Megabytes.
Reduce costs - We are charged for storing expired or obsolete data until it is compacted and garbage collection eligible. Garbage collection policies can be set programmatically or using the cbt tool. Policies for garbage collection are defined at the column family level.
Simulate Cell Level TTL
Garbage collection policies in Cloud Bigtable are configured at the column family level. A cell-level garbage collection policy cannot be specified. Change the garbage collection settings to simulate a time-to-live (TTL) policy at the cell level.
One Second Expiration
Bigtable deletes any cells with a timestamp that is at least one second in the past during compaction. Set the cell's timestamp to the time we want the value to expire whenever we write data. This method allows us to specify different expiration dates for cells within the same column family.
Default expiration
This method generates the column family with the garbage collection age limit set to the default TTL. Set the timestamp to a time after the data is written if we want it to expire later than the default.
For writes without a custom TTL, a default TTL is used. This method is safe to use on an existing table.
Garbage Collection for Sequential Numbers in Timestamps
We may want to assign sequential numbers to a cell's timestamp property rather than a date and time for reasons unrelated to garbage collection.
Number of Versions
The garbage collection policy should be based on the number of versions if timestamps are sequence numbers. This means we can specify how many cells to keep. When using sequential numbers instead of real timestamps, an age-based garbage collection policy is unsafe because age-based policies get rid of the data based on the timestamp.
Keep Most Recent Value
We can use filters in all Cloud Bigtable client libraries to read the latest value, or cell, at a given row and column. We may never need to read older versions of the data in some cases. To avoid paying for older data that we no longer require, use the strategy on this page to delete it.
Timestamp of Zero
If you only want to read the most recent value in a column family's columns, set the timestamp to zero (2022-01-01 00:00-00 UTC) whenever we write data to the column family. New writes immediately hide old ones in this scenario, reads always return a single value for each column.
Frequently Asked Questions
What is a column family in Bigtable?
When a table is initially built, column families are defined. One may have one named column or more within a column family. Typically, all the data in a column family belongs to the same type. All columns in a column family are typically compressed together when using BigTable.
What is the difference between Bigtable and Firestore?
Google has used Cloud Bigtable for more than ten years, putting it to the test. This database powers important programmes like Google Analytics and Gmail. Cloud Firestore, on the other hand, is described as a "NoSQL database intended for global apps."
How is replication handled in Bigtable?
By copying data across numerous zones in a region or across many regions, Bigtable replication helps boost the data's availability and durability. Replication facilitates workload isolation by leveraging application profiles to route various requests to separate clusters.
Conclusion
Google's Cloud BigTable is a database that can manage massive data workloads while being extremely scalable and dependable. Google examined how it stacks up against other Google cloud databases in the article. We also read about its applications, distinguishing qualities, and data model. You can learn about Cloud Computingand find our courses on Data Science and machine learning. To find out if BigTable will meet your requirements, read more articles on Cloud BigTable.