Introduction
Do you know alongside a set of Management tools, Google Cloud Platform (GCP) provides several modular cloud services that include Data Storage, Computing, Data Analytics, and Machine Learning?
Google Cloud Platform provides Infrastructure as a Service, Platform as a Service, and a few serverless computing environments. There are several types of products offered by Google such as API Platforms, IoT services, Identity and Security, Management tools, etc.

Cloud Life Sciences (beta)
Formerly known as Google Genomics, Cloud Life Sciences facilitates the life sciences community to store, manage, and process biomedical data at scale. It is cost-effective and is supported by a growing partner ecosystem, making it quite popular nowadays. Its availability to institutions, support for top workflow engines, growing ecosystem, and information security and compliance have added stars to its reach.

Before starting up the whole process, make sure to set up a project. To set up a project, you need to follow the steps below.
- Create an account in Google Cloud and then using the Console page, create a Google Cloud Project.
- Ensure that billing and Cloud Life Sciences API are enabled.
- Try to understand several service accounts and default roles. This is because if you remove or add these things, you may encounter Troubleshooting errors.
- Now, download, install and initialize the Google Cloud CLI.
- After that, download certain credentials using Create Credentials option present on the Console page so that there is no problem accessing the APIs.
What’s new in Cloud Life Sciences (beta)
Cloud Life Sciences (beta) is a regionalized service. That implies it lets you align with the locality needed to fetch your data. Unlike Google Genomics, which was a Global service that could not run in specific locations. As soon as you select the location where Cloud Life Sciences API would run, the metadata for the operation gets stored in that location.
To learn more about how to make requests and specify the location, pay attention to the information below.
🍁 Changes in REST and RPC paths
All paths will now use lifesciences.googleapis.com, and not genomics.googleapis.com. You will be asked to specify a Google Cloud Location which was not necessary before.
For Example:
🍁 Changes in Google Cloud CLI
All Cloud Life Sciences gcloud CLI commands will now use gcloud beta lifesciences, and not gcloud alpha genomics. The machine type flag is now being used instead of cpu and memory flags.
For Example: gcloud beta lifesciences operations describe OPERATION_ID
🍁 Changes in IAM
The namespace in Identity and Access Management (IAM) roles and permissions have changes to lifesciences from genomics. The role has now been changed from roles/genomics.PipelineRunner to roles/lifesciences.workflowsRunner. Also, the permissions have changed from genomics.pipelines.run to lifesciences.workflows.run.
🍁 Changes in Migration Requests and Responses
The process of migrating requests and responses mainly dealt with replacing field names and changing the structure of fields.
🔥 In the Action section, the name field changed to containerName.
🔥 In the Event section, the event details were stored inside one of rather than protobuf.
🔥 In the Network section, the name field changed to network.
🔥 In the Resources section, it no longer takes the projectId field. The ProjectID is detected by the operation itself from the request URL instead.
Running Interval JOINs with BigQuery
The BigQuery is used to run a JOIN query on variants in which data is described by genomic region intervals, or overlaps.

Querying an inline table
You can run the following query in the New Query field to query an inline table.
#standardSQL
WITH
-- Retrieve the variants in this cohort, flattening by alternate bases and counting affected alleles.
variants AS (
SELECT
REPLACE(reference_name, 'chr', '') as reference_name, start_position, end_position, reference_bases, alternate_bases.alt AS alt,
(SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles,
(SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles
FROM
`bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v,
UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ),
-- Define an inline table that uses five rows selected from silver-wall-555.TuteTable.hg19.
intervals AS (
SELECT * FROM UNNEST ([
STRUCT<Gene STRING, Chr STRING, gene_start INT64, gene_end INT64, region_start INT64, region_end INT64>
('PRCC', '1', 156736274, 156771607, 156636274, 156871607),
('NTRK1', '1', 156785541, 156852640, 156685541, 156952640),
('PAX8', '2', 113972574, 114037496, 113872574, 114137496),
('FHIT', '3', 59734036, 61238131, 59634036, 61338131),
('PPARG', '3', 12328349, 12476853, 12228349, 12576853)
])),
-- JOIN the variants with the genomic intervals overlapping the genes of interest.
-- The JOIN criteria is complicated. With standard SQL you can use complex JOIN predicates, including arbitrary expressions.
gene_variants AS (
SELECT
reference_name, start_position, reference_bases, alt, num_variant_alleles, total_num_alleles
FROM
variants
INNER JOIN
intervals ON
variants.reference_name = intervals.Chr
AND intervals.region_start <= variants.start_position
AND intervals.region_end >= variants.end_position )
-- And finally JOIN the variants in the regions of interest with annotations for rare variants.
SELECT DISTINCT
Chr, annots.Start AS Start, Ref, annots.Alt, Func, Gene, PopFreqMax, ExonicFunc, num_variant_alleles, total_num_alleles
FROM
`silver-wall-555.TuteTable.hg19` AS annots
INNER JOIN
gene_variants AS vars
ON
vars.reference_name = annots.Chr
AND vars.start_position = annots.Start
AND vars.reference_bases = annots.Ref
AND vars.alt = annots.Alt
WHERE
-- Retrieve annotations for rare variants only.
PopFreqMax <= 0.01
ORDER BY
Chr,
Start;
Querying a Materialized Table
You can run the following query in the New Query field to query a materialized table.
#standardSQL
WITH
-- Retrieve the variants in this cohort, flattening by alternate bases and counting affected alleles.
variants AS (
SELECT
REPLACE(reference_name, 'chr', '') as reference_name, start_position, end_position, reference_bases, alternate_bases.alt AS alt,
(SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles,
(SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles
FROM
`bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v,
UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ),
-- JOIN the variants with the genomic intervals overlapping the genes of interest.
-- The JOIN criteria is complicated. With standard SQL, you can use complex JOIN predicates, including arbitrary expressions.
gene_variants AS (
SELECT
reference_name, start_position, reference_bases, alt, num_variant_alleles, total_num_alleles
FROM
variants
INNER JOIN
`genomics.myIntervalTable` AS intervals ON
variants.reference_name = intervals.Chr
AND intervals.region_start <= variants.start_position
AND intervals.region_end >= variants.end_position )
-- And finally, JOIN the variants in the regions of interest with annotations for rare variants.
SELECT DISTINCT
Chr, annots.Start AS Start, Ref, annots.Alt, Func, Gene, PopFreqMax, ExonicFunc, num_variant_alleles, total_num_alleles
FROM
`silver-wall-555.TuteTable.hg19` AS annots
INNER JOIN
gene_variants AS vars
ON
vars.reference_name = annots.Chr
AND vars.start_position = annots.Start
AND vars.reference_bases = annots.Ref
AND vars.alt = annots.Alt
WHERE
-- Retrieve annotations for rare variants only.
PopFreqMax <= 0.01
ORDER BY
Chr,
Start;
Variant Transforms Tool
Based on Apache beam, Variant Transforms is an open source tool that is used by Cloud Life Sciences. It is used to transform and load hundreds of thousands of files, millions of samples, and billions of records. The Variant Transform preprocessor is used to validate the VCF files as well as to identify inconsistencies. The typical workflow of this tool includes storing raw VCF files in cloud storage and using the tool to load those files from cloud storage to BigQuery. The BigQuery can then be used to analyze the variants.
Understanding the BigQuery Variants Schema

🍁 Genomics Nomenclature
- Sample- It refers to the DNA collected and processed for a single person.
- Reference name- It is the name of the reference segment of the DNA.
- Variant- It is that region of the genome differing from the reference genome.
- Non-variant segment- It is that portion of the genome that is the same as the reference genome.
- Call- It is an identified occurrence of the variant or the non-variant segment.
- INFO Fields- These are the optional fields that may or may not be added with variant, non-variant, and call fields.
🍁 BigQuery Terms
- Sample Fields- It contains simple data elements.
- Nested Fields- It contains complex data elements.
- Repeated Fields- It contains repeated data elements.
🍁 Variant Table Structure
An example schema of the variant table is given below for your reference.

The records are split automatically if rows reach the 100MB limit. This is because BigQuery has a limit of 100MB per row.
Sample Pipelines
🍁 Run GATK Best Practices
To run a pipeline using GATK Best practices, you need to follow the steps mentioned below.
- Make sure you have an Account with Google Cloud.
- Do a project if you have not made any till now, and enable the billing.
- Now, enable Cloud Life Sciences, Compute Engine, and Cloud Storage APIs.
- Install and Initialize the Google Cloud CLI, gcloud components, and git.
- After that, create a Cloud Storage bucket using the gsutil mb command.
- Download the example files and run the pipeline using sample data
To run the pipeline, create two environment variables, one containing the Broad pipeline files and the other which points to the Cloud Storage bucket and a folder for the output. Then, change the directory so that you can run the pipeline.
- Use the gcloud beta lifesciences describe command to track the status of the pipeline.
- The operations describe command returns done: true if the pipeline has finished.
🍁 Sentieon DNAseq Pipeline
- Make sure you have an Account with Google Cloud.
- Do a project if you have not made any till now, and enable the billing.
- Now, enable Cloud Life Sciences, Compute Engine, and Cloud Storage APIs.
- Install and Initialize the Google Cloud CLI, gcloud components, and git.
- Set up the local environment, install prerequisites, and download the pipeline script.
- Understand the input format and run the pipeline.
To run the pipeline, create two environment variables, one containing the Broad pipeline files and the other which points to the Cloud Storage bucket and a folder for the output. Then, change the directory so that you can run the pipeline.
The most recommended configuration is mentioned below for your reference.
{
"FQ1": "gs://my-bucket/sample1_1.fastq.gz",
"FQ2": "gs://my-bucket/sample1_2.fastq.gz",
"REF": "gs://sentieon-test/pipeline_test/reference/hs37d5.fa",
"OUTPUT_BUCKET": "gs://BUCKET",
"BQSR_SITES": "gs://sentieon-test/pipeline_test/reference/Mills_and_1000G_gold_standard.indels.b37.vcf.gz,gs://sentieon-test/pipeline_test/reference/1000G_phase1.indels.b37.vcf.gz,gs://sentieon-test/pipeline_test/reference/dbsnp_138.b37.vcf.gz",
"DBSNP": "gs://sentieon-test/pipeline_test/reference/dbsnp_138.b37.vcf.gz",
"PREEMPTIBLE_TRIES": "2",
"NONPREEMPTIBLE_TRY": true,
"STREAM_INPUT": "True",
"ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
"PROJECT_ID": "PROJECT_ID",
"EMAIL": "EMAIL_ADDRESS"
}
Run DeepVariant
To run DeepVariant, you need to follow the steps mentioned below.
- Make sure you have an Account with Google Cloud.
- Do a project if you have not made any till now, and enable the billing.
- Now, enable Cloud Life Sciences, Compute Engine, and Cloud Storage APIs.
- Install and Initialize the Google Cloud CLI.
- Create a Compute Engine Instance using the VM Instances page.
- To start the DeepVariant, run the following command.
sudo docker run \
-v "${DATA_DIR}":"/input" \
-v "${OUTPUT_DIR}:/output" \
gcr.io/deepvariant-docker/deepvariant:"${BIN_VERSION}" \
/opt/deepvariant/bin/run_deepvariant \
--model_type=WGS \
--ref="/input/${REF}" \
--reads="/input/${BAM}" \
--output_vcf=/output/${OUTPUT_VCF} \
--output_gvcf=/output/${OUTPUT_GVCF} \
--regions chr20 \
--num_shards=$(nproc) \
--intermediate_results_dir /output/intermediate_results_dir