Cloud Dataflow Overview Google Professional Data Engineer GCP
Google Cloud Dataflow is
- a managed data transformation service,
- has a unified data processing model to process both unbounded and bounded datasets.
- a serverless platform
- write code in the form of pipelines, and submit to CloudDataflow for execution.
- offers autoscaling workers and dynamically rebalancing workloads across those workers
- provides the open Apache Beam programming model as a managed service for
- process data in multiple ways as
- batch operations
- extract-transform-load (ETL) patterns
- continuous, streaming computation.
- pipelines operate on data in terms of collections, using the abstract PCollection .
- Each PCollection is a distributed set of homogeneous data as in the pipeline
- can represent bounded data (CSV file in Cloud Storage) or unbounded data source, such (Cloud Pub/Sub topic).
- PCollection is immutable.
- each element in PCollection has an associated timestamp either from data’s source, or explicitly defined.
Google Professional Data Engineer (GCP) Free Practice TestTake a Quiz