Table of contents
1.
Introduction
2.
Exporting Table Data
2.1.
Limitations of Export
2.2.
Required Permissions
2.3.
Exporting data stored in BigQuery
3.
Loading Data
3.1.
Batch Loading
3.2.
Streaming
3.3.
Data Integration Method
4.
Streaming API
4.1.
Before you start
4.2.
Required Permissions
4.3.
Streaming Data into BigQuery
4.4.
Streaming Data Availability
4.5.
Best Efforts De-duplications
4.6.
Disabling best effort de-duplication
5.
External Data Sources
6.
Frequently Asked Questions
6.1.
What are the features of Google BigQuery?
6.2.
What is the architecture of BigQuery?
6.3.
What language is BigQuery written in?
7.
Conclusion
Last Updated: Mar 27, 2024
Medium

Advanced Concepts of BigQuery

Author Sagar Mishra
1 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

BigQuery is the serverless data warehouse from Google. This scalable enterprise data solution is a cloud data warehouse. It allows firms to store and query their data. To help you learn your data better, you can upload huge datasets to BigQuery ML. 

The infrastructure of Google can process data faster than this software. BigQuery can assist you in sorting and making sense of your data as ML is embedded into the technology. 

In this article, we will learn the advanced concepts of BigQuery. This is the final part of the Big Query series. For the first and second part, you may visit Introduction to Big Query and Basic Concepts of Big Query.

Exporting Table Data

Exporting Table Data

This section will show how to extract and export data from the BigQuery table to Google Cloud. We can use the EXPORT DATA statement to export any query result.

Limitations of Export

Limitations

There are a few limitations of Export. A user must verify once the below list before exporting data. Let us see some of the limitations.

  • A single export job cannot export data from many tables.
     
  • Nested and repetitive data cannot be exported in CSV format. Avro, JSON, and Parquet export all support nested and repeated data.
     
  • If you export data via the Google Cloud console, you can only select GZIP as the compression type.
     
  • Unless you use the EXPORT DATA statement and add an ORDER BY clause in the query statement, the order of exported table data is not assured.
     
  • INT64 data types are encoded as JSON strings when you export data in JSON format. As it has to keep 64-bit precision when other systems read the data.

Required Permissions

Required Permissions

A user needs some permissions from IAM to export the data from a BigQuery table. 

  1. Permission to run exported jobs.
    1. roles/bigquery.user
    2. roles/bigquery.jobUser
    3. roles/bigquery.admin
       
  2. Permissions from BigQuery table.
    1. roles/bigquery.dataViewer
    2. roles/bigquery.dataOwner
    3. roles/bigquery.admin
    4. roles/bigquery.dataEditor
       
  3. Permissions from Cloud Storage.
    1. storage.objects.create
    2. storage.objects.delete
    3. roles/storage.objectAdmin
    4. roles/storage.admin

Exporting data stored in BigQuery

Exporting

A user can export the data in three different ways:

  1. Submitting an extract job via the client libraries or API.
  2. Using GCC.
  3. Using the bq extract.

You can refer to any language to perform the exporting such as Go, SQL, Java, etc. In this article, we will learn about Data Exporting using SQL.

We will use the EXPORT DATA statement to perform the action. Follow the below steps.

Step 1Go to the BigQuery page in the GCC.

Step 2: Paste the below statement in the query editor.

EXPORT DATA
  OPTIONS (
    uri = 'gs://bucket/folder/*.csv',
    format = 'CSV',
    overwrite = true,
    header = true,
    field_delimiter = ';')
AS (
  SELECT field1, field2
  FROM mydataset.table1
  ORDER BY field1
);

 

Step 3: Click on the Run button. 

Loading Data

In this section, we will learn how to load data in BigQuery.

Batch Loading

Batch Loading

Batch loading is a process for loading the source data into a BigQuery table. There is a list of options for batch loading in BIgQuery, as follows.

  • Load jobs.
  • SQL.
  • BigQuery Data Transfer Services.
  • BigQuery Storage Write API.
  • Other managed services.

Streaming

Streaming

Streaming is a process of sending smaller batches of data in real-time. It helps in the availability of data for querying. The options for streaming include:

  • Dataflow.
  • Storage Write API.
  • BigQuery connector for SAP.

Data Integration Method

A user must consider the things mentioned below before choosing Data Integration Method.

  • Data Source.
  • Reliability of solutions.
  • Latency.
  • Data Ingestion Format.
  • Slow changing vs. Fast changing.

Streaming API

Streaming API

We will use the legacy tabledata.insertAll method in this section. If you are working on a new project, then using the BigQuery Storage Write API is suggested. The reason is Write API is cheap, with more robust features. 

Before you start

  1. Ensure that you have access to your target table. And the table must exist before you begin to write your data in it.
     
  2. For streaming data, check the quota policy,
     
  3. Verify that the billing is enabled for your project.
     
  4. Grant all IAM roles.

Required Permissions

You need a few permissions from the IAM. The list of permissions is as follows.

  • bigquery.tables.create
  • bigquery.datasets.get
  • bigquery.tables.get
  • bigquery.tables.updateData
  • roles/bigquery.dataEditor
  • roles/bigquery.dataOwner
  • roles/bigquery.admin

Streaming Data into BigQuery

Follow the below example to stream data into BigQuery.

from google.cloud import bigquery


client = bigquery.Client()


rows_to_insert = [
    {"full_name": "Phred Phlyntstone", "age": 32},
    {"full_name": "Wylma Phlyntstone", "age": 29},
]


errors = client.insert_rows_json(table_id, rows_to_insert)  # Make an API request.
if errors == []:
    print("New rows have been added.")
else:
    print("Encountered errors while inserting rows: {}".format(errors))

 

When you insert rows, the insertID field does not need to be filled manually. The example that follows shows how to stream data without delivering an insertID for each row.

from google.cloud import bigquery

client = bigquery.Client()

rows_to_insert = [
    {"full_name": "Phred Phlyntstone", "age": 32},
    {"full_name": "Wylma Phlyntstone", "age": 29},
]


errors = client.insert_rows_json(
    table_id, rows_to_insert, row_ids=[None] * len(rows_to_insert)
)  # Make an API request.
 if errors == []:
    print("New rows have been added.")
 else:
    print("Encountered errors while inserting rows: {}".format(errors))

Streaming Data Availability

In this section, we will learn some points on Streaming data availability. 

  • After BigQuery fully accepts a tabledata.insertAll request, the data is directly available for real-time analysis using Google Standard SQL queries.
     
  • For the PARTITIONTIME pseudo column, the partitioned table has a NULL value.
     
  • BigQuery has assigned the final NON NULL value of the PARTITIONTIME column in the background.
     
  • The streamed row may not be available for a few minutes. It may take 90 minutes in rare cases.

Best Efforts De-duplications

De-duplicate in BigQuery is multiple occurrences of any given row. The system expects that any rows given with identical insertIds will also match. It is unclear which row BigQuery saves if two rows have similar insertIds.

Disabling best effort de-duplication

We can disable this feature by not giving the same insertids for each row inserted. You can also remove the duplicate rows manually. Follow the below steps.

Step 1: Your table schema should include the insertId column, and the insertId value should be included in the data for each row.

Step 2: Perform the given query to look for duplication after streaming has ended:

#standardSQL
SELECT
  MAX(count) FROM(
  SELECT
    ID_COLUMN,
    count(*) as count
  FROM
    `TABLE_NAME`
  GROUP BY
    ID_COLUMN)

 

Step 3: Run the following query to remove duplicates. Disable result flattening, allow huge results, and specify a destination table.

External Data Sources

An external data source is one that you can access directly through BigQuery even though the data is not kept there. There are only a few external sources that BigQuery supports. The list of those is below:

Frequently Asked Questions

What are the features of Google BigQuery?

The features of BigQuery include:

Multicloud aspects.

Inbuilt ML Integration.

Foundation for BI.

Geospatial Analysis.

Automated Data Transfer.

What is the architecture of BigQuery?

BigQuery stores data in a columnar structure called Capacitor. You may expect that BigQuery can achieve high compression ratios and scan versions. This is because each field, or column in the table, is stored in a separate Capacitor file.

What language is BigQuery written in?

Although a legacy SQL dialect is also offered, BigQuery supports the Google Standard SQL dialect. Google Standard SQL offers the widest set of abilities. And is the best choice if you're new to BigQuery. For e.g., Google Standard SQL is the only database engine that supports capabilities like DDL and DML.

Conclusion

We have discussed the topic of Advanced concepts of BigQuery. We have learned how to export table data loading data, stream API, and use external data sources.

We hope this blog has helped you enhance your knowledge of the Advanced concepts of BigQuery. If you want to learn more, check out our articles on: 

And many more on our platform Coding Ninjas Studio.

But suppose you have just started your learning process and are looking for questions from tech giants like Amazon, Microsoft, Uber, etc. In that case, you must look at the problemsinterview experiences, and interview bundles for placement trials.

You may also consider our paid courses to give your career an edge over others!

Happy Learning!

Live masterclass