Data can be interpreted or extracted in a variety of ways in the modern world. In order to address this need, Google Cloud Platform (GCP) offers three key solutions for data processing and warehousing. Numerous ETL solutions are offered by Dataproc, Dataflow, and Dataprep to its clients, meeting a variety of demands.
Dataflow follows a batch and stream processing of data. It establishes a new pipeline for the processing of data and the production or removal of resources as needed.
Dataflow's major goal is to make Big Data simpler. To achieve parallelization, the frameworks for programming and execution are combined. In Dataflow, no cluster data is kept unused. The cluster is instead constantly reviewed and improved.
Create a Dataflow pipeline using Java
Step-by-step guidance
Choose or create a Google Cloud project from the project selector page in the Google Cloud console.
Make sure that billing is enabled for your Cloud project.
Enable the APIs for Dataflow, Compute Engine, Cloud Logging, BigQuery, Cloud Pub/Sub, Cloud Datastore, Cloud Storage, Google Cloud Storage JSON, and Cloud Resource Manager.
Create a service account. In the console, go to the Create service account page. Select your project. Enter a name in the Service account name field. Based on this name, the console fills up the Service account ID field. Enter a description in the Service account description area. Service account, as an example, for quickstart Click Create and continue. Give your service account the following role(s) so that others can access your project: Project > Owner. Choose a role from the list under "Select a role." Click Continue. Click Done to finish creating the service account.
Create a service account key: In the console, click the email address for the service account that you created. Click Keys. Click Add key, and then click Create new key. Click Create. A JSON key file is downloaded to your computer. Click Close.
Set the location to the JSON file containing your service account key as the value for the environment variable GOOGLE_APPLICATION_CREDENTIALS. If you open a new shell session, you must set the variable again because it only pertains to the current shell session.
Create a Cloud Storage bucket.
Get the pipeline code
An open-source programming framework for data processing pipelines is called Apache Beam SDK. These pipelines are defined by an Apache Beam program, and you can select a runner, such as Dataflow, to execute your pipeline.
Run the pipeline locally
In your shell or terminal, run the WordCount pipeline locally from your word-count-beam directory.
The output files are written to the word-count-beam directory and include prefix counts. They include the unique words from the source material as well as the frequency of each word.
Run the pipeline on the Dataflow service
In your shell or terminal, build and run the WordCount pipeline on the Dataflow service from your word-count-beam directory.
Replace the following:
PROJECT_ID: your Cloud project ID
BUCKET_NAME: the name of your Cloud Storage bucket
REGION: a Dataflow regional endpoint, like us-central1
View your results
In the console, go to the Dataflow Jobs page.
In the console, go to the Cloud Storage Browser page.
Click the storage bucket that you created.
The Bucket details page shows the output files and staging files that your Dataflow job created.
Create a streaming pipeline using a Dataflow template
This quickstart demonstrates how to build a streaming pipeline using a Dataflow template offered by Google. This quickstart specifically uses the Pub/Sub Topic to BigQuery template as an illustration.
A streaming pipeline called the Pub/Sub Topic to BigQuery template receives JSON-formatted messages from a Pub/Sub topic and publishes them to a BigQuery table.
Create a BigQuery dataset and table
Create a BigQuery dataset and table with the appropriate schema for your Pub/Sub topic using the console.
Run the pipeline
Run a streaming pipeline using the Google-provided Pub/Sub Topic to BigQuery template. The pipeline gets incoming data from the input topic.
View your results
Then view the data written to your real-time table.
Pipeline fundamentals for the Apache Beam SDKs
On the Apache Beam website, you can find documentation on:
How to design your pipeline: It demonstrates how to create your pipeline, select the appropriate data transforms, and choose your input and output strategies.
How to create your pipeline: It outlines how to use the classes in the Beam SDKs and the procedures needed to construct a pipeline.
How to test your pipeline: presents best practices for testing your pipelines.
Deploying a pipeline
After you construct and test your Apache Beam pipeline, you can use the Dataflow managed service to deploy and execute it. Once on the Dataflow service, your pipeline code becomes a Dataflow job.
Pipeline lifecycle: from pipeline code to Dataflow job
Dataflow generates an execution graph from the code used to create your Pipeline object, which includes all of the transforms and the processing operations associated with them when you run your Dataflow pipeline (such as DoFns). This stage, referred to as Graph Construction Time, is locally executed on the computer hosting the pipeline.
Execution Graph
Based on the transforms and information you used to create your Pipeline object, Dataflow creates a graph of steps that symbolizes your pipeline. The pipeline execution graph looks like this.
Parallelization and distribution
The Dataflow service automatically distributes and parallelizes the pipeline's processing logic among the workers you've designated to carry out your task. As an illustration, Dataflow automatically distributes your processing code (represented by DoFns) to various workers so that it can be run in parallel as a result of your ParDo transformations. Dataflow uses the abstractions in the programming paradigm to describe parallel processing functions.
Fusion optimization
The Dataflow service may alter the execution graph of your pipeline after validating the JSON representation of it. These improvements may involve condensing many steps or transforming the execution graph of your pipeline into a single phase. The Dataflow service can save money by not having to materialize every intermediary PCollection in your pipeline, which can be a memory and processing overhead-intensive procedure.
Setting pipeline options
The execution of the pipeline is carried out independently of the Apache Beam software. Your written Apache Beam software creates a pipeline for deferred execution. This indicates that the application creates a set of actions that any Apache Beam runner that is supported can carry out. Both the Dataflow runner on Google Cloud and the direct runner, which runs the pipeline directly in a local environment, are compatible runners.
Setting pipeline options programmatically
You can set pipeline options programmatically by creating and modifying a PipelineOptions object.
In JAVA, Construct a PipelineOptions object using the method PipelineOptionsFactory.fromArgs.
Accessing the pipeline options object
In JAVA, You can access PipelineOptions inside any ParDo's DoFn instance by using the method ProcessContext.getPipelineOptions.
// Create and set your PipelineOptions.
DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
// For cloud execution, set the Google Cloud project, staging location,
// and set DataflowRunner.
options.setProject("my-project-id");
options.setStagingLocation("gs://my-bucket/binaries");
options.setRunner(DataflowRunner.class);
// Create the Pipeline with the specified options.
Pipeline p = Pipeline.create(options);
Using pipeline options from the command line
// Set your PipelineOptions to the specified command-line options
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation();
// Create the Pipeline with the specified options.
Pipeline p = Pipeline.create(options);
Launching locally
In JAVA,
// Create and set our Pipeline Options.
PipelineOptions options = PipelineOptionsFactory.create();
// Create the Pipeline with the specified options.
Pipeline p = Pipeline.create(options);
Using the Apache Beam interactive runner with JupyterLab notebooks lets you iteratively develop pipelines, inspect your pipeline graph, and parse individual PCollections in a read-eval-print-loop (REPL) workflow. These Apache Beam notebooks are made available through Vertex AI Workbench user-managed notebooks, a service that hosts notebook virtual machines pre-installed with the latest data science and machine learning frameworks.
Get started with Google-provided templates
Google provides a set of open source Dataflow templates.
WordCount
The WordCount template is a batch pipeline that reads text from Cloud Storage, tokenizes the text lines into individual words, and performs a frequency count on each of the words.
Stop a running pipeline
You cannot delete a Dataflow job; you can only stop it.
To stop a Dataflow job, you can use either the Google Cloud console, Cloud Shell, a local terminal installed with the Google Cloud CLI, or the Dataflow REST API.
Cancel a job
The following actions occur when you cancel a job:
The Dataflow service halts all data ingestion and data processing.
The Dataflow service begins cleaning up the Google Cloud resources that are attached to your job.
Drain a job
When you drain a job, the Dataflow service finishes your job in its current state. If you want to prevent data loss as you bring down your streaming pipelines, the best option is to drain your job.
Force cancel a job
When you force cancel a job, the Dataflow service stops the job immediately, leaking any VMs the Dataflow job created. Regular cancel must be attempted at least 30 minutes before force canceling.
Pipeline troubleshooting and debugging
Dataflow provides real-time feedback about your job, and there is a basic set of steps you can use to check the error messages, logs, and for conditions such as your job's progress having stalled.
Determine the cause of a pipeline failure
Graph or pipeline construction errors: These errors happen when Dataflow has trouble constructing the pipeline's step-by-step graph, as specified by your Apache Beam pipeline.
Errors in job validation: Any pipeline job you start is validated by the Dataflow service. Your job may not be successfully created or run if there are validation errors. Issues with your project's permissions or the Cloud Storage bucket for your Google Cloud project can result in validation issues.
Exceptions in worker code: These issues happen when user-provided code, such as the DoFn instances of a ParDo transform, which Dataflow distributes to parallel workers, has flaws or defects.
Errors caused by transient failures in other Google Cloud services: A short outage or other issue with one of the Google Cloud services, such as Compute Engine or Cloud Storage, which Dataflow depends on, could cause your pipeline to crash.
Detect graph or pipeline construction errors
When Dataflow creates the execution graph for your pipeline from the code in your Dataflow program, a graph construction error may take place. Dataflow checks for unlawful activities when building the graph.
Keep in mind that no job is created on the Dataflow service if Dataflow identifies a mistake in the graph creation. As a result, there is no feedback displayed in the Dataflow monitoring interface.
Detect errors in Dataflow job validation
Once the Dataflow service has received your pipeline's graph, the service attempts to validate your job. This validation includes the following:
Making sure that the service can access the Cloud Storage buckets connected to your operation for file staging and interim output.
Verifying that your Google Cloud project has the necessary permissions.
Making sure that input and output sources, such as files, are accessible to the service.
Frequently asked questions
What is Cloud Dataflow used for?
Dataflow is a managed service for executing a wide variety of data processing patterns.
What are Dataflow and Dataproc in GCP?
Dataproc is a Google Cloud product with Data Science/ML service for Spark and Hadoop. In comparison, Dataflow follows a batch and stream processing of data.
What is Dataflow equivalent in AWS?
Dataflow is equivalent to Amazon Elastic MapReduce (EMR) or AWS Batch Dataflow.
Is Dataproc fully managed?
Dataproc is a fully managed and highly scalable service for running Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks.
Conclusion
In this article, we have extensively discussed Dataflow. We hope this blog has helped you enhance your knowledge regarding Dataflow.