Key Concepts Google Professional Data Engineer GCP
Pipelines
- encapsulates the all processes in reading input data, transforming that data, and writing output data.
- The input source and output sink can be the same or of different types
- Apache Beam programs start by constructing a Pipeline object
- then using that object as the basis for creating the pipeline’s datasets.
- Each pipeline represents a single, repeatable job.
PCollection
- It represents a distributed, multi-element dataset
- acts as the pipeline’s data.
- Apache Beam transforms use PCollection objects as inputs and outputs for each step in pipeline.
- A PCollection can hold a dataset of a fixed size or an unbounded dataset from a continuously updating data source.
Transforms
- Represent a processing operation that transforms data.
- takes one or more PCollections as input, performs an specified operation and produces one or more PCollections as output.
- can perform many kind of processing operation
ParDo
- It is the core parallel processing operation in the Apache Beam SDKs,
- It invokes a user-specified function on each of the elements of the input PCollection.
- ParDo collects the zero or more output elements into an output PCollection.
- The ParDo transform processes elements independently and possibly in parallel.
Pipeline I/O
- let you read data into pipeline and write output data from pipeline.
- consists of a source and a sink.
- can also write a custom I/O connector.
Aggregation
- process of computing some value from multiple input elements.
Side input
- Can be static like constant
- Can also be a list or map. If side input is a pcollection, we first convert to list or map and pass that as side input.
- Call parDo.withsideInputs with the map or list
Mapreduce
- Map – operates in parallel, reduce – aggregates based on key
- parDo acts on one item at a time, similar to map operation in mapreduce, should not have state/history. Useful for filtering, mapping.
- In python, map done using map for 1:1, flatmap for non 1:1. In Java, done using parDo
User-defined functions (UDFs)
- Apache Beam allow executing user-defined code to configure the transform.
- For ParDo, user-defined code specifies the operation to apply to every element,
- UDFs can be written in a different language than the language of runner.
Runner
- the software that accepts a pipeline and executes it.
- runners are translators or adapters to massively parallel big-data processing systems.
Event time
- The time a data event occurs,
- determined by the timestamp on the data element itself.
Windowing
- enables grouping operations over unbounded collections
- divides the collection into windows of finite collections
Watermarks
- Apache Beam tracks a watermark, all data in a certain window to have arrived in the pipeline.
Trigger
- Triggers determine when to emit aggregated results as data arrives.
- For bounded data, results are emitted after all of the input has been processed.
- For unbounded data, results are emitted when the watermark passes the end of the window
Google Professional Data Engineer (GCP) Free Practice TestTake a Quiz