Start preparing for your Next Exam | Use coupon – TOGETHER | Avail 30% discount

Key Concepts Google Professional Data Engineer GCP

Pipelines

encapsulates the all processes in reading input data, transforming that data, and writing output data.
The input source and output sink can be the same or of different types
Apache Beam programs start by constructing a Pipeline object
then using that object as the basis for creating the pipeline’s datasets.
Each pipeline represents a single, repeatable job.

PCollection

It represents a distributed, multi-element dataset
acts as the pipeline’s data.
Apache Beam transforms use PCollection objects as inputs and outputs for each step in pipeline.
A PCollection can hold a dataset of a fixed size or an unbounded dataset from a continuously updating data source.

Transforms

Represent a processing operation that transforms data.
takes one or more PCollections as input, performs an specified operation and produces one or more PCollections as output.
can perform many kind of processing operation

ParDo

It is the core parallel processing operation in the Apache Beam SDKs,
It invokes a user-specified function on each of the elements of the input PCollection.
ParDo collects the zero or more output elements into an output PCollection.
The ParDo transform processes elements independently and possibly in parallel.

Pipeline I/O

let you read data into pipeline and write output data from pipeline.
consists of a source and a sink.
can also write a custom I/O connector.

Aggregation

process of computing some value from multiple input elements.

Side input

Can be static like constant
Can also be a list or map. If side input is a pcollection, we first convert to list or map and pass that as side input.
Call parDo.withsideInputs with the map or list

Mapreduce

Map – operates in parallel, reduce – aggregates based on key
parDo acts on one item at a time, similar to map operation in mapreduce, should not have state/history. Useful for filtering, mapping.
In python, map done using map for 1:1, flatmap for non 1:1. In Java, done using parDo

User-defined functions (UDFs)

Apache Beam allow executing user-defined code to configure the transform.
For ParDo, user-defined code specifies the operation to apply to every element,
UDFs can be written in a different language than the language of runner.

Runner

the software that accepts a pipeline and executes it.
runners are translators or adapters to massively parallel big-data processing systems.

Event time

The time a data event occurs,
determined by the timestamp on the data element itself.

Windowing

enables grouping operations over unbounded collections
divides the collection into windows of finite collections

Watermarks

Apache Beam tracks a watermark, all data in a certain window to have arrived in the pipeline.

Trigger

Triggers determine when to emit aggregated results as data arrives.
For bounded data, results are emitted after all of the input has been processed.
For unbounded data, results are emitted when the watermark passes the end of the window

Menu