Choosing data processing technologies
In this we will learn and understand about data processing technologies.
Data processing
Dataproc and Dataflow offer autoscaling options to scale your data pipelines and data processing. These options to allow your pipelines to access more computing resources based on the processing load.
Recommendations
- Firstly, use Google Cloud Load Balancers to provide a global endpoint.
- Secondly, use managed instance groups with Compute Engine to automatically scale.
- Thirdly, use the cluster autoscaler in GKE to automatically scale the cluster.
- Then, use App Engine to autoscale your Platform-as-a-Service (PaaS) application.
- Lastly, use Cloud Run or Cloud Functions to autoscale your function or microservice.
Dataproc
Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Moreover, dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don’t need them.
Further, when compared to traditional, on-premises products and competing cloud services, Dataproc has a number of unique advantages for clusters of three to hundreds of nodes:
- Firstly, Low cost. Dataproc is priced at only 1 cent per virtual CPU in your cluster per hour, on top of the other Cloud Platform resources you use. In addition to this low price, Dataproc clusters can include preemptible instances that have lower compute prices. Further, it reduces your costs.
- Secondly, Super fast. Without using Dataproc, it can take from five to 30 minutes to create Spark and Hadoop clusters on-premises or through IaaS providers. Thirdly, Integrated. Dataproc has built-in integration with other Google Cloud Platform services. This include BigQuery, Cloud Storage, Cloud Bigtable, Cloud Logging, and Cloud Monitoring.
- Next, Managed. Use Spark and Hadoop clusters without the assistance of an administrator or special software. Moreover, you can easily interact with clusters and Spark or Hadoop jobs through the Google Cloud Console, the Cloud SDK, or the Dataproc REST API.
- Lastly, Simple and familiar. You don’t need to learn new tools or APIs to use Dataproc, making it easy to move existing projects into Dataproc without redevelopment.
Dataflow
Dataflow is a managed service for executing a wide variety of data processing patterns. This helps you how to deploy your batch and streaming data processing pipelines using Dataflow. It also include directions for using service features. Further, the Apache Beam SDK is an open source programming model that enables you to develop both batch and streaming pipelines. You create your pipelines with an Apache Beam program and then run them on the Dataflow service.
Reference: Google Documentation, Doc 1, Doc 2