Configure Dataproc Cluster and Submit Job Google Professional Data Engineer GCP
Configuration terms
Cluster Region:
- specify a global region or a specific region for cluster.
- The global region is a special multi-region endpoint to deploy instances into any user-specified Compute Engine zone.
- can also specify distinct regions
Compute Engine Virtual Machine instances (VMs)
- consist of master and worker VMs,
- require full internal IP networking access to each other.
- Default network available to create a cluster helps ensure this access.
Labels –
- apply user labels to cluster and job resources
- to group resources and related operations for later filtering and listing.
- associate labels when resource is created, at cluster creation or job submission.
- label is propagated to operations performed on the resource
Cluster Update / Delete
- can update a cluster by Dataproc API, gcloud, or from Configuration tab of Cluster details page for the cluster in the Google Cloud Console.
- Following can be updated
- the number of standard worker nodes in a cluster
- the number of secondary worker nodes in a cluster
- whether to use graceful decommissioning to control shutting down a worker after its jobs are completed
- adding or deleting cluster labels
Deleting a cluster –
- delete a cluster by
- Dataproc API
- gcloud
- Google Cloud Console.
Submit a job
- Submit a job to an existing cluster by Dataproc API, gcloud or Google Cloud Console
- can also SSH into the master instance in cluster, and then run a job
Log and Monitor
job and cluster logs can be viewed, searched, filtered, and archived in Cloud Logging.
Google Professional Data Engineer (GCP) Free Practice TestTake a Quiz