Table of contents
1.
Introduction
2.
Create a Dataflow pipeline using Java
2.1.
Step-by-step guidance
2.2.
Get the pipeline code
2.3.
Run the pipeline locally
2.4.
Run the pipeline on the Dataflow service
2.5.
View your results
3.
Create a streaming pipeline using a Dataflow template
3.1.
Create a BigQuery dataset and table
3.2.
Run the pipeline
3.3.
View your results
4.
Pipeline fundamentals for the Apache Beam SDKs
5.
Deploying a pipeline
5.1.
Pipeline lifecycle: from pipeline code to Dataflow job
5.2.
Execution Graph
5.3.
Parallelization and distribution
5.4.
Fusion optimization
6.
Setting pipeline options
6.1.
Setting pipeline options programmatically
6.2.
Accessing the pipeline options object
6.3.
Launching on Dataflow
6.4.
Launching locally
6.5.
Creating custom pipeline options
7.
Developing with Apache Beam notebooks
8.
Get started with Google-provided templates
8.1.
WordCount
9.
Stop a running pipeline
9.1.
Cancel a job
9.2.
Drain a job
9.3.
Force cancel a job
10.
Pipeline troubleshooting and debugging
10.1.
Determine the cause of a pipeline failure
10.2.
Detect graph or pipeline construction errors
10.3.
Detect errors in Dataflow job validation
11.
Frequently asked questions
11.1.
What is Cloud Dataflow used for?
11.2.
What are Dataflow and Dataproc in GCP?
11.3.
What is Dataflow equivalent in AWS?
11.4.
Is Dataproc fully managed?
12.
Conclusion
Last Updated: Mar 27, 2024

Dataflow

Author Komal Shaw
0 upvote

Introduction

Data can be interpreted or extracted in a variety of ways in the modern world. In order to address this need, Google Cloud Platform (GCP) offers three key solutions for data processing and warehousing. Numerous ETL solutions are offered by Dataproc, Dataflow, and Dataprep to its clients, meeting a variety of demands.

Google Cloud Dataflow

Dataflow follows a batch and stream processing of data. It establishes a new pipeline for the processing of data and the production or removal of resources as needed.

Dataflow's major goal is to make Big Data simpler. To achieve parallelization, the frameworks for programming and execution are combined. In Dataflow, no cluster data is kept unused. The cluster is instead constantly reviewed and improved.

Create a Dataflow pipeline using Java

Google Cloud Dataflow

Step-by-step guidance

  1. Choose or create a Google Cloud project from the project selector page in the Google Cloud console.
     
  2. Make sure that billing is enabled for your Cloud project.
     
  3. Enable the APIs for Dataflow, Compute Engine, Cloud Logging, BigQuery, Cloud Pub/Sub, Cloud Datastore, Cloud Storage, Google Cloud Storage JSON, and Cloud Resource Manager.
     
  4. Create a service account.
    In the console, go to the Create service account page.
    Select your project.
    Enter a name in the Service account name field. Based on this name, the console fills up the Service account ID field.
    Enter a description in the Service account description area. Service account, as an example, for quickstart
    Click Create and continue.
    Give your service account the following role(s) so that others can access your project: Project > Owner.
    Choose a role from the list under "Select a role."
    Click Continue.
    Click Done to finish creating the service account.
     
  5. Create a service account key:
    In the console, click the email address for the service account that you created.
    Click Keys.
    Click Add key, and then click Create new key.
    Click Create. A JSON key file is downloaded to your computer.
    Click Close.
     
  6. Set the location to the JSON file containing your service account key as the value for the environment variable GOOGLE_APPLICATION_CREDENTIALS. If you open a new shell session, you must set the variable again because it only pertains to the current shell session.
     
  7. Create a Cloud Storage bucket.

Get the pipeline code

An open-source programming framework for data processing pipelines is called Apache Beam SDK. These pipelines are defined by an Apache Beam program, and you can select a runner, such as Dataflow, to execute your pipeline.

Run the pipeline locally

In your shell or terminal, run the WordCount pipeline locally from your word-count-beam directory.

The output files are written to the word-count-beam directory and include prefix counts. They include the unique words from the source material as well as the frequency of each word.

Run the pipeline on the Dataflow service

In your shell or terminal, build and run the WordCount pipeline on the Dataflow service from your word-count-beam directory.

Replace the following:

  1. PROJECT_ID: your Cloud project ID
  2. BUCKET_NAME: the name of your Cloud Storage bucket
  3. REGION: a Dataflow regional endpoint, like us-central1

View your results

  1. In the console, go to the Dataflow Jobs page.
  2. In the console, go to the Cloud Storage Browser page.
  3. Click the storage bucket that you created.
  4. The Bucket details page shows the output files and staging files that your Dataflow job created.

Create a streaming pipeline using a Dataflow template

This quickstart demonstrates how to build a streaming pipeline using a Dataflow template offered by Google. This quickstart specifically uses the Pub/Sub Topic to BigQuery template as an illustration.

A streaming pipeline called the Pub/Sub Topic to BigQuery template receives JSON-formatted messages from a Pub/Sub topic and publishes them to a BigQuery table.

Create a BigQuery dataset and table

Create a BigQuery dataset and table with the appropriate schema for your Pub/Sub topic using the console.

Run the pipeline

Run a streaming pipeline using the Google-provided Pub/Sub Topic to BigQuery template. The pipeline gets incoming data from the input topic.

View your results

Then view the data written to your real-time table.

Pipeline fundamentals for the Apache Beam SDKs

On the Apache Beam website, you can find documentation on:

  1. How to design your pipeline: It demonstrates how to create your pipeline, select the appropriate data transforms, and choose your input and output strategies.
  2. How to create your pipeline: It outlines how to use the classes in the Beam SDKs and the procedures needed to construct a pipeline.
  3. How to test your pipeline: presents best practices for testing your pipelines.

Deploying a pipeline

After you construct and test your Apache Beam pipeline, you can use the Dataflow managed service to deploy and execute it. Once on the Dataflow service, your pipeline code becomes a Dataflow job.

Deploying a pipeline

Pipeline lifecycle: from pipeline code to Dataflow job

Dataflow generates an execution graph from the code used to create your Pipeline object, which includes all of the transforms and the processing operations associated with them when you run your Dataflow pipeline (such as DoFns). This stage, referred to as Graph Construction Time, is locally executed on the computer hosting the pipeline.

Execution Graph

Based on the transforms and information you used to create your Pipeline object, Dataflow creates a graph of steps that symbolizes your pipeline. The pipeline execution graph looks like this.

Parallelization and distribution

The Dataflow service automatically distributes and parallelizes the pipeline's processing logic among the workers you've designated to carry out your task. As an illustration, Dataflow automatically distributes your processing code (represented by DoFns) to various workers so that it can be run in parallel as a result of your ParDo transformations. Dataflow uses the abstractions in the programming paradigm to describe parallel processing functions.

Fusion optimization

The Dataflow service may alter the execution graph of your pipeline after validating the JSON representation of it. These improvements may involve condensing many steps or transforming the execution graph of your pipeline into a single phase. The Dataflow service can save money by not having to materialize every intermediary PCollection in your pipeline, which can be a memory and processing overhead-intensive procedure.

Setting pipeline options

The execution of the pipeline is carried out independently of the Apache Beam software. Your written Apache Beam software creates a pipeline for deferred execution. This indicates that the application creates a set of actions that any Apache Beam runner that is supported can carry out. Both the Dataflow runner on Google Cloud and the direct runner, which runs the pipeline directly in a local environment, are compatible runners.

Setting pipeline options programmatically

You can set pipeline options programmatically by creating and modifying a PipelineOptions object.

In JAVA, Construct a PipelineOptions object using the method PipelineOptionsFactory.fromArgs.

Accessing the pipeline options object

In JAVA, You can access PipelineOptions inside any ParDo's DoFn instance by using the method ProcessContext.getPipelineOptions.

Related Article Apache Server

Launching on Dataflow

Setting pipeline options programmatically

In JAVA,

// Create and set your PipelineOptions.
DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
// For cloud execution, set the Google Cloud project, staging location,
// and set DataflowRunner.
options.setProject("my-project-id");
options.setStagingLocation("gs://my-bucket/binaries");
options.setRunner(DataflowRunner.class);

// Create the Pipeline with the specified options.
Pipeline p = Pipeline.create(options);

 

Using pipeline options from the command line

// Set your PipelineOptions to the specified command-line options
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation();

// Create the Pipeline with the specified options.
Pipeline p = Pipeline.create(options);

Launching locally

In JAVA,

// Create and set our Pipeline Options.
PipelineOptions options = PipelineOptionsFactory.create();
// Create the Pipeline with the specified options.
Pipeline p = Pipeline.create(options);

Creating custom pipeline options

In JAVA,

public interface MyOptions extends PipelineOptions {
  String getMyCustomOption();
  void setMyCustomOption(String myCustomOption);
}

Developing with Apache Beam notebooks

Using the Apache Beam interactive runner with JupyterLab notebooks lets you iteratively develop pipelines, inspect your pipeline graph, and parse individual PCollections in a read-eval-print-loop (REPL) workflow. These Apache Beam notebooks are made available through Vertex AI Workbench user-managed notebooks, a service that hosts notebook virtual machines pre-installed with the latest data science and machine learning frameworks.
 

Get started with Google-provided templates

Google provides a set of open source Dataflow templates.

Templates

WordCount

The WordCount template is a batch pipeline that reads text from Cloud Storage, tokenizes the text lines into individual words, and performs a frequency count on each of the words.

Stop a running pipeline

You cannot delete a Dataflow job; you can only stop it.

To stop a Dataflow job, you can use either the Google Cloud console, Cloud Shell, a local terminal installed with the Google Cloud CLI, or the Dataflow REST API.

Cancel a job

The following actions occur when you cancel a job:

  1. The Dataflow service halts all data ingestion and data processing.
  2. The Dataflow service begins cleaning up the Google Cloud resources that are attached to your job.

Drain a job

When you drain a job, the Dataflow service finishes your job in its current state. If you want to prevent data loss as you bring down your streaming pipelines, the best option is to drain your job.

Force cancel a job

When you force cancel a job, the Dataflow service stops the job immediately, leaking any VMs the Dataflow job created. Regular cancel must be attempted at least 30 minutes before force canceling.

Pipeline troubleshooting and debugging

Dataflow provides real-time feedback about your job, and there is a basic set of steps you can use to check the error messages, logs, and for conditions such as your job's progress having stalled.

Pipeline Troubleshooting and Debugging

Determine the cause of a pipeline failure

Graph or pipeline construction errors: These errors happen when Dataflow has trouble constructing the pipeline's step-by-step graph, as specified by your Apache Beam pipeline.

Errors in job validation: Any pipeline job you start is validated by the Dataflow service. Your job may not be successfully created or run if there are validation errors. Issues with your project's permissions or the Cloud Storage bucket for your Google Cloud project can result in validation issues.

Exceptions in worker code: These issues happen when user-provided code, such as the DoFn instances of a ParDo transform, which Dataflow distributes to parallel workers, has flaws or defects.

Errors caused by transient failures in other Google Cloud services: A short outage or other issue with one of the Google Cloud services, such as Compute Engine or Cloud Storage, which Dataflow depends on, could cause your pipeline to crash.

Detect graph or pipeline construction errors

When Dataflow creates the execution graph for your pipeline from the code in your Dataflow program, a graph construction error may take place. Dataflow checks for unlawful activities when building the graph.

Keep in mind that no job is created on the Dataflow service if Dataflow identifies a mistake in the graph creation. As a result, there is no feedback displayed in the Dataflow monitoring interface.

Detect errors in Dataflow job validation

Once the Dataflow service has received your pipeline's graph, the service attempts to validate your job. This validation includes the following:

Making sure that the service can access the Cloud Storage buckets connected to your operation for file staging and interim output.

Verifying that your Google Cloud project has the necessary permissions.

Making sure that input and output sources, such as files, are accessible to the service.

Frequently asked questions

What is Cloud Dataflow used for?

Dataflow is a managed service for executing a wide variety of data processing patterns.

What are Dataflow and Dataproc in GCP?

Dataproc is a Google Cloud product with Data Science/ML service for Spark and Hadoop. In comparison, Dataflow follows a batch and stream processing of data.

What is Dataflow equivalent in AWS?

Dataflow is equivalent to Amazon Elastic MapReduce (EMR) or AWS Batch Dataflow.

Is Dataproc fully managed?

Dataproc is a fully managed and highly scalable service for running Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks.

Conclusion

In this article, we have extensively discussed Dataflow. We hope this blog has helped you enhance your knowledge regarding Dataflow.

If you want to learn more, check out our articles on Introduction to Cloud MonitoringOverview of log based metricCloud Logging in GCP.

Refer to our guided paths on Coding Ninjas Studio to learn more about DSA, Competitive Programming, JavaScript, System Design, etc.

Enroll in our courses and refer to the mock test and problems available.

Take a look at the interview experiences and interview bundle for placement preparations.

Do upvote our blog to help other ninjas grow.

Happy Coding!

Live masterclass