Cloud Dataflow Google Professional Data Engineer GCP
- It is a managed service
- executes data processing patterns.
- Uses Apache Beam SDK
- Can develop both batch and streaming pipelines.
Pipeline
- It manage data similar to a factory line
- Tasks include
- Transforming data
- Aggregating data and computing
- Enriching data.
- Moving data.
- Data Pipeline views all data as streaming
- It breaks dataflows into smaller units
- Tasks are written in Java or Python using beam sdk
- Jobs are executed in parallel
- Same code is used for streaming and batch
- Pipeline is a directed graph of steps
- Source/Sink can be filesystem, gcs, bigquery, pub/sub
- Runner can be local laptop, dataflow(in cloud)
- Output data written can be sharded or unsharded.
- Input and outputs are pcollection. Pcollection is not in-memory and can be unbounded.
- Each transform – give a name
- Read from source, write to sink
- Common tasks to do
- Convert incoming data to a common format.
- Prepare data for analysis and visualization.
- Migrate between databases.
- Share data processing logic across web apps, batch jobs, and APIs.
- Power data ingestion and integration tools.
- Consume large XML, CSV, and fixed-width files.
- Replace batch jobs with real-time data.
- pipeline components
- Data Nodes – for input data
- Activities – work definition to do
- Preconditions
- Resources
- Actions – An action that is triggered if conditions are met
Types
Types of pipelines are
-
- useful for large volumes of data at a regular interval
- no real-time processing needed
- Real-time
- to process data in real time.
- processing data from a streaming source
- Cloud native
- optimized to work with cloud-based data, such as data from AWS buckets.
- tools are hosted in the cloud
Google Professional Data Engineer (GCP) Free Practice TestTake a Quiz