Dataproc Overview Google Professional Data Engineer GCP
- A managed Spark and Hadoop service
- Use for batch processing, querying, streaming, and machine learning.
- automation helps you
- create clusters quickly
- manage them easily
- save money by turning clusters off when you don’t need them.
- Use cases –
- Move Hadoop and Spark clusters to the cloud
- Data science on Dataproc
- Apache Hadoop ecosystem components are automatically installed on the cluster.
- initialization actions provide faster cluster startup times
- clusters can be provisioned with a custom image
- Dataproc manages preemptible node addition and deletion
- Atleast single regular worker is needed.
- Workers can be preemptible.
- preemptible worker nodes will not have hdfs storage,
- preemptible has same config as regular worker nodes.
- Web ports used tcp port 8088 which is Hadoop, 9870 which is HDFS and 8080 which is Datalab
- access Dataproc from
- Through the REST API
- Using the Cloud SDK
- Using the Dataproc UI
- Through the Cloud Client Libraries
Google Professional Data Engineer (GCP) Free Practice TestTake a Quiz