Dataprep Overview Google Professional Data Engineer GCP
- an intelligent data service
- to visually explore, clean and prepare data that is not ready for immediate analysis.
Flow
- a container for holding one or more datasets, associated recipes and other objects.
- is a means for packaging Cloud Dataprep objects for following actions
- Creating relationships between datasets, their recipes, and other datasets.
- Sharing with other users
- Copying
- Execution of pre-configured jobs
- Creating references between recipes and external flows
Imported Dataset
- a reference to the original data
- the data does not exist within the platform.
- can be a reference to a file, multiple files, database table, or other type of data.
- is a pointer to a source of data.
- An imported dataset can be referenced in recipes.
- Imported datasets are created through the Import Data Page.
Recipe
- a user-defined sequence of steps to transform a dataset.
- A recipe object is created from an imported dataset or another recipe.
- can create a recipe from a recipe to chain together recipes.
- Recipes are interpreted by Cloud Dataprep by TRIFACTA INC. and turned into commands that can be executed against data.
- When initially created, a recipe contains no steps.
- Recipes are augmented and modified using the various visual tools in the Transformer Page.
In a flow, the following objects are associated with each recipe
- Outputs
- References
Outputs
Outputs
- contain one or more publishing destinations
- which define the output format, location, and other publishing options
- applied to the results generated from a job run on the recipe.
References
- to create a reference to the output of the recipe’s steps in another dataset.
- When you select a recipe’s reference object, you can add it to another flow.
- A reference dataset is a read-only version of the output data generated from the execution of a recipe’s steps.
Samples
- It is a subset of the entire dataset.
- For smaller datasets, the sample may be the entire dataset.
- As you build or modify recipe, the results of each modification are immediately reflected in the sampled data.
- Can generate additional samples
Macros
- can create reusable sequences of steps that can be parameterized for use in other recipes.
Run Jobs
A job may be composed of one or more of the following job types:
- Transform job: Executes the set of recipe steps that you have defined against sample(s), generating the transformed set of results across the entire dataset.
- Profile job: choose to generate a visual profile of the results of transform job.
- When a job completes, you can review the resulting data and identify data that still needs fixing.
Schedules
- Associate a schedule with a flow.
- schedule is a combination of one or more triggers and the outputs.
- A flow can have only one schedule associated with it.
- A trigger is a scheduled time of execution.
- A schedule can have multiple triggers associated with it.
- A recipe can have only one scheduled destination.
- Each recipe in a flow can have a scheduled destination.
Example
Type | Datasets | Description |
Standard job execution | Recipe 1/Job 1 | Results of the job are used to create a new imported dataset (I-Dataset 2). |
Create dataset from generated results | Recipe 2/Job 2 | Recipe 2 is created off of I-Dataset 2 and then modified. A job has been specified for it, but the results of the job are unused. |
Chaining datasets | Recipe 3/Job 3 | Recipe 3 is chained off of Recipe 2. The results of running jobs off of Recipe 2 include all of the upstream changes as specified in I-Dataset 1/Recipe1 and I-Dataset 2/Recipe 2. |
Reference dataset | Recipe 4/Job 4 | I-Dataset 4 is created as a reference off of Recipe 3. It can have its own recipe, job, destinations, and results. |
Workflow
Basic Workflow
- Review object overview:
- Import data
- Profile data
- Build transform recipes
- Run job
- Export results
Google Professional Data Engineer (GCP) Free Practice TestTake a Quiz