Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
Cloud Life Sciences (beta)
2.1.
What’s new in Cloud Life Sciences (beta)
2.1.1.
🍁 Changes in REST and RPC paths
2.1.2.
🍁 Changes in Google Cloud CLI
2.1.3.
🍁 Changes in IAM
2.1.4.
🍁 Changes in Migration Requests and Responses
2.2.
Running Interval JOINs with BigQuery
2.2.1.
Querying an inline table
2.2.2.
Querying a Materialized Table
2.3.
Variant Transforms Tool
2.4.
Understanding the BigQuery Variants Schema
2.4.1.
🍁 Genomics Nomenclature
2.4.2.
🍁 BigQuery Terms
2.4.3.
🍁 Variant Table Structure
2.5.
Sample Pipelines
2.5.1.
🍁 Run GATK Best Practices
2.5.2.
🍁 Sentieon DNAseq Pipeline
2.6.
Run DeepVariant
3.
Frequently Asked Questions
3.1.
Name some versions of Sentineon DNASeq versions available.
3.2.
What are read groups?
3.3.
Define preemptible instances.
3.4.
Tell me something about Duplicate Marking.
4.
Conclusion
Last Updated: Mar 27, 2024
Medium

Cloud Life Sciences (beta)

Author Rupal Saluja
0 upvote
Leveraging ChatGPT - GenAI as a Microsoft Data Expert
Speaker
Prerita Agarwal
Data Specialist @
23 Jul, 2024 @ 01:30 PM

Introduction

Do you know alongside a set of Management tools, Google Cloud Platform (GCP) provides several modular cloud services that include Data Storage, Computing, Data Analytics, and Machine Learning? 

Google Cloud Platform provides Infrastructure as a Service, Platform as a Service, and a few serverless computing environments. There are several types of products offered by Google such as API Platforms, IoT services, Identity and Security, Management tools, etc. 

 

Google Cloud Platform

Cloud Life Sciences (beta)

Formerly known as Google Genomics, Cloud Life Sciences facilitates the life sciences community to store, manage, and process biomedical data at scale. It is cost-effective and is supported by a growing partner ecosystem, making it quite popular nowadays. Its availability to institutions, support for top workflow engines, growing ecosystem, and information security and compliance have added stars to its reach.

Cloud Life Sciences

Before starting up the whole process, make sure to set up a project. To set up a project, you need to follow the steps below.

  1. Create an account in Google Cloud and then using the Console page, create a Google Cloud Project.
  2. Ensure that billing and Cloud Life Sciences API are enabled.
  3. Try to understand several service accounts and default roles. This is because if you remove or add these things, you may encounter Troubleshooting errors.
  4. Now, download, install and initialize the Google Cloud CLI.
  5. After that, download certain credentials using Create Credentials option present on the Console page so that there is no problem accessing the APIs.

What’s new in Cloud Life Sciences (beta)

Cloud Life Sciences (beta) is a regionalized service. That implies it lets you align with the locality needed to fetch your data. Unlike Google Genomics, which was a Global service that could not run in specific locations. As soon as you select the location where Cloud Life Sciences API would run, the metadata for the operation gets stored in that location.

To learn more about how to make requests and specify the location, pay attention to the information below.

🍁 Changes in REST and RPC paths

All paths will now use lifesciences.googleapis.com, and not genomics.googleapis.com. You will be asked to specify a Google Cloud Location which was not necessary before.

For Example:

GET https://lifesciences.googleapis.com/v2beta/projects/PROJECT_ID/locations/LOCATION/operations/OPERATION_ID

🍁 Changes in Google Cloud CLI

All Cloud Life Sciences gcloud CLI commands will now use gcloud beta lifesciences, and not gcloud alpha genomics. The machine type flag is now being used instead of cpu and memory flags.

For Example: gcloud beta lifesciences operations describe OPERATION_ID

🍁 Changes in IAM

The namespace in Identity and Access Management (IAM) roles and permissions have changes to lifesciences from genomics. The role has now been changed from roles/genomics.PipelineRunner to roles/lifesciences.workflowsRunner. Also, the permissions have changed from genomics.pipelines.run to lifesciences.workflows.run.

🍁 Changes in Migration Requests and Responses

The process of migrating requests and responses mainly dealt with replacing field names and changing the structure of fields. 

  🔥 In the Action section, the name field changed to containerName.

  🔥 In the Event section, the event details were stored inside one of rather than protobuf.

  🔥 In the Network section, the name field changed to network.

  🔥 In the Resources section, it no longer takes the projectId field. The ProjectID is detected by the operation itself from the request URL instead.

Running Interval JOINs with BigQuery

The BigQuery is used to run a JOIN query on variants in which data is described by genomic region intervals, or overlaps. 

Running interval joins

Querying an inline table

You can run the following query in the New Query field to query an inline table.

#standardSQL
WITH
  -- Retrieve the variants in this cohort, flattening by alternate bases and counting affected alleles.
  variants AS (
  SELECT
    REPLACE(reference_name, 'chr', '') as reference_name, start_position, end_position, reference_bases, alternate_bases.alt AS alt,
    (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles,
    (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles
  FROM
    `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v,
    UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ),
    
  -- Define an inline table that uses five rows selected from silver-wall-555.TuteTable.hg19.
  intervals AS (
    SELECT * FROM UNNEST ([
    STRUCT<Gene STRING, Chr STRING, gene_start INT64, gene_end INT64, region_start INT64, region_end INT64>
    ('PRCC', '1', 156736274, 156771607, 156636274, 156871607),
    ('NTRK1', '1', 156785541, 156852640, 156685541, 156952640),
    ('PAX8', '2', 113972574, 114037496, 113872574, 114137496),
    ('FHIT', '3', 59734036, 61238131, 59634036, 61338131),
    ('PPARG', '3', 12328349, 12476853, 12228349, 12576853)
  ])),
  
  -- JOIN the variants with the genomic intervals overlapping the genes of interest.
  -- The JOIN criteria is complicated. With standard SQL you can use complex JOIN predicates, including arbitrary expressions.
  gene_variants AS (
  SELECT
    reference_name, start_position, reference_bases, alt, num_variant_alleles, total_num_alleles
  FROM
    variants
  INNER JOIN
    intervals ON
    variants.reference_name = intervals.Chr
    AND intervals.region_start <= variants.start_position
    AND intervals.region_end >= variants.end_position )

  -- And finally JOIN the variants in the regions of interest with annotations for rare variants.
SELECT DISTINCT
  Chr, annots.Start AS Start, Ref, annots.Alt, Func, Gene, PopFreqMax, ExonicFunc, num_variant_alleles, total_num_alleles
FROM
  `silver-wall-555.TuteTable.hg19` AS annots
INNER JOIN
  gene_variants AS vars
ON
  vars.reference_name = annots.Chr
  AND vars.start_position = annots.Start
  AND vars.reference_bases = annots.Ref
  AND vars.alt = annots.Alt
WHERE
  -- Retrieve annotations for rare variants only.
  PopFreqMax <= 0.01
ORDER BY
  Chr,
  Start;

 

Querying a Materialized Table

You can run the following query in the New Query field to query a materialized table.

#standardSQL
WITH
  -- Retrieve the variants in this cohort, flattening by alternate bases and counting affected alleles.
  variants AS (
  SELECT
    REPLACE(reference_name, 'chr', '') as reference_name, start_position, end_position, reference_bases, alternate_bases.alt AS alt,
    (SELECT COUNTIF(gt = alt_offset+1) FROM v.call call, call.genotype gt) AS num_variant_alleles,
    (SELECT COUNTIF(gt >= 0) FROM v.call call, call.genotype gt) AS total_num_alleles
  FROM
    `bigquery-public-data.human_genome_variants.platinum_genomes_deepvariant_variants_20180823` v,
    UNNEST(v.alternate_bases) alternate_bases WITH OFFSET alt_offset ),
 
  -- JOIN the variants with the genomic intervals overlapping the genes of interest.
  -- The JOIN criteria is complicated.  With standard SQL, you can use complex JOIN predicates, including arbitrary expressions.
  gene_variants AS (
  SELECT
    reference_name, start_position, reference_bases, alt, num_variant_alleles, total_num_alleles
  FROM
    variants
  INNER JOIN
    `genomics.myIntervalTable` AS intervals ON
    variants.reference_name = intervals.Chr
    AND intervals.region_start <= variants.start_position
    AND intervals.region_end >= variants.end_position )

  -- And finally, JOIN the variants in the regions of interest with annotations for rare variants.
SELECT DISTINCT
  Chr, annots.Start AS Start, Ref, annots.Alt, Func, Gene, PopFreqMax, ExonicFunc, num_variant_alleles, total_num_alleles
FROM
  `silver-wall-555.TuteTable.hg19` AS annots
INNER JOIN
  gene_variants AS vars
ON
  vars.reference_name = annots.Chr
  AND vars.start_position = annots.Start
  AND vars.reference_bases = annots.Ref
  AND vars.alt = annots.Alt
WHERE
  -- Retrieve annotations for rare variants only.
  PopFreqMax <= 0.01
ORDER BY
  Chr,
  Start;

Variant Transforms Tool

Based on Apache beam, Variant Transforms is an open source tool that is used by Cloud Life Sciences. It is used to transform and load hundreds of thousands of files, millions of samples, and billions of records. The Variant Transform preprocessor is used to validate the VCF files as well as to identify inconsistencies. The typical workflow of this tool includes storing raw VCF files in cloud storage and using the tool to load those files from cloud storage to BigQuery. The BigQuery can then be used to analyze the variants.

Understanding the BigQuery Variants Schema

Formation of BigQuery Table

🍁 Genomics Nomenclature

  • Sample- It refers to the DNA collected and processed for a single person.
  • Reference name- It is the name of the reference segment of the DNA.
  • Variant- It is that region of the genome differing from the reference genome.
  • Non-variant segment- It is that portion of the genome that is the same as the reference genome.
  • Call- It is an identified occurrence of the variant or the non-variant segment.
  • INFO Fields- These are the optional fields that may or may not be added with variant, non-variant, and call fields.

 

🍁 BigQuery Terms

  • Sample Fields- It contains simple data elements.
  • Nested Fields- It contains complex data elements.
  • Repeated Fields- It contains repeated data elements.

 

🍁 Variant Table Structure

An example schema of the variant table is given below for your reference.

The records are split automatically if rows reach the 100MB limit. This is because BigQuery has a limit of 100MB per row.

Sample Pipelines

🍁 Run GATK Best Practices

To run a pipeline using GATK Best practices, you need to follow the steps mentioned below.

  1. Make sure you have an Account with Google Cloud.
  2. Do a project if you have not made any till now, and enable the billing.
  3. Now, enable Cloud Life Sciences, Compute Engine, and Cloud Storage APIs.
  4. Install and Initialize the Google Cloud CLI, gcloud components, and git.
  5. After that, create a Cloud Storage bucket using the gsutil mb command. 
  6. Download the example files and run the pipeline using sample data

 

To run the pipeline, create two environment variables, one containing the Broad pipeline files and the other which points to the Cloud Storage bucket and a folder for the output. Then, change the directory so that you can run the pipeline. 

  1. Use the gcloud beta lifesciences describe command to track the status of the pipeline.
  2. The operations describe command returns done: true if the pipeline has finished.

 

🍁 Sentieon DNAseq Pipeline

  1. Make sure you have an Account with Google Cloud.
  2. Do a project if you have not made any till now, and enable the billing.
  3. Now, enable Cloud Life Sciences, Compute Engine, and Cloud Storage APIs.
  4. Install and Initialize the Google Cloud CLI, gcloud components, and git.
  5. Set up the local environment, install prerequisites, and download the pipeline script.
  6. Understand the input format and run the pipeline.

 

To run the pipeline, create two environment variables, one containing the Broad pipeline files and the other which points to the Cloud Storage bucket and a folder for the output. Then, change the directory so that you can run the pipeline.

The most recommended configuration is mentioned below for your reference.

{
  "FQ1": "gs://my-bucket/sample1_1.fastq.gz",
  "FQ2": "gs://my-bucket/sample1_2.fastq.gz",
  "REF": "gs://sentieon-test/pipeline_test/reference/hs37d5.fa",
  "OUTPUT_BUCKET": "gs://BUCKET",
  "BQSR_SITES": "gs://sentieon-test/pipeline_test/reference/Mills_and_1000G_gold_standard.indels.b37.vcf.gz,gs://sentieon-test/pipeline_test/reference/1000G_phase1.indels.b37.vcf.gz,gs://sentieon-test/pipeline_test/reference/dbsnp_138.b37.vcf.gz",
  "DBSNP": "gs://sentieon-test/pipeline_test/reference/dbsnp_138.b37.vcf.gz",
  "PREEMPTIBLE_TRIES": "2",
  "NONPREEMPTIBLE_TRY": true,
  "STREAM_INPUT": "True",
  "ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
  "PROJECT_ID": "PROJECT_ID",
  "EMAIL": "EMAIL_ADDRESS"
}

Run DeepVariant

To run DeepVariant, you need to follow the steps mentioned below.

  1. Make sure you have an Account with Google Cloud.
  2. Do a project if you have not made any till now, and enable the billing.
  3. Now, enable Cloud Life Sciences, Compute Engine, and Cloud Storage APIs.
  4. Install and Initialize the Google Cloud CLI.
  5. Create a Compute Engine Instance using the VM Instances page.
  6. To start the DeepVariant, run the following command.

 

sudo docker run \
    -v "${DATA_DIR}":"/input" \
    -v "${OUTPUT_DIR}:/output" \
    gcr.io/deepvariant-docker/deepvariant:"${BIN_VERSION}"  \
    /opt/deepvariant/bin/run_deepvariant \
    --model_type=WGS \
    --ref="/input/${REF}" \
    --reads="/input/${BAM}" \
    --output_vcf=/output/${OUTPUT_VCF} \
    --output_gvcf=/output/${OUTPUT_GVCF} \
    --regions chr20 \
    --num_shards=$(nproc) \
    --intermediate_results_dir /output/intermediate_results_dir
Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

Frequently Asked Questions

Name some versions of Sentineon DNASeq versions available.

Some available versions of Sentineon DNASeq are:

  • 201711.01
  • 201711.02
  • 201711.03
  • 201711.04
  • 201711.05
  • 201808
  • 201808.01
  • 201808.03
  • 201808.05
  • 201808.06
  • 201808.07
  • 201808.08

 

What are read groups?

Read groups are such groups that contain sample metadata. These are added when fastq files are aligned with a reference genome.

Define preemptible instances.

You can set preemptible instances in the pipeline using the PREEMPTIBLE_TRIES JSON key. It determines the number of times to attempt the pipeline while preemptible instances are being used.

Tell me something about Duplicate Marking.

The duplicate reads are removed from the BAM files by pipeline by default. This behavior can be changed by setting the DEDUP JSON key as mentioned below.

"DEDUP": "markdup"

Conclusion

In a nutshell, we understood what is Cloud Life Sciences (Beta), learned about what’s new in it, learned how to run Interval JOINs, used Variant Transforms Tool, and understood BigQuery Variants Schema. We also saw some sample pipelines and learned about DeepVariant in Cloud Life Sciences (beta).

We hope the above discussion helped you understand Cloud Life Sciences (beta) in clearer terms and can be used for future reference whenever needed. If you want to see a comparison between AWS and GCP, you must see our GCP vs AWS comparison blog. If you are planning to get a GCP certification, you must pay attention to our GCP Certifications blog. For a crystal understanding of cloud computing, You can refer to our blogs on Cloud Computing ArchitectureAWS Cloud ComputingCloud Computing Infrastructure, and Cloud Delivery Models by clicking on the respective links. 

Visit our website to read more such blogs. Make sure that you enroll in the courses provided by us, take mock tests and solve problems available and interview puzzles. Also, you can pay attention to interview stuff- interview experiences and an interview bundle for placement preparations. Do upvote our blog to help fellow ninjas grow.

Happy Coding!

Live masterclass