Continuous integration / continuous deployment
In this tutorial we will learn how to set up a continuous integration/continuous deployment (CI/CD) pipeline for processing data by implementing CI/CD methods with managed products on Google Cloud. Data scientists and analysts can adapt the methodologies from CI/CD practices to help to ensure high quality, maintainability, and adaptability of the data processes and workflows. The methods that you can apply are as follows:
- Firstly, version control of source code.
- Secondly, automatic building, testing, and deployment of apps.
- Thirdly, environment isolation and separation from production.
- Lastly, replicable procedures for environment setup.
The CI/CD pipeline
At a high level, the CI/CD pipeline consists of the following steps:
- Firstly, Cloud Build packages the WordCount sample into a self-running Java Archive (JAR) file using the Maven builder. The Maven builder is a container with Maven installed in it. When a build step configure to use the Maven builder, Maven runs the tasks.
- Secondly, Cloud Build uploads the JAR file to Cloud Storage.
- Thirdly, Cloud Build runs unit tests on the data-processing workflow code and deploys the workflow code to Cloud Composer.
- Lastly, Cloud Composer picks up the JAR file and runs the data-processing job on Dataflow.
The data-processing workflow
The instructions for how Cloud Composer runs the data-processing workflow are defined in a Directed Acyclic Graph (DAG) written in Python. In the DAG, all the steps of the data-processing workflow are defined together with the dependencies between them. However. the CI/CD pipeline automatically deploys the DAG definition from Cloud Source Repositories to Cloud Composer in each build. This process ensures that Cloud Composer is always up to date with the latest workflow definition without needing any human intervention.
The data-processing workflow consists of the following steps:
- Firstly, run the WordCount data process in Dataflow.
Secondly, download the output files from the WordCount process. The WordCount process outputs three files:- download_result_1
- Then, download_result_2
- download_result_3
- Thirdly, download the reference file, called download_ref_string.
- Lastly, verify the result against the reference file. This integration test aggregates all three results and compares the aggregated results with the reference file.
Creating Cloud Build pipelines
In this section, you create the build pipelines that build, deploy, and test the data-processing workflow.
Grant access to Cloud Build service account
Cloud Build deploys Cloud Composer DAGs and triggers workflows, which are enabled when you add additional access to the Cloud Build service account.
- Firstly, in Cloud Shell, add the composer.admin role to the Cloud Build service account so the Cloud Build job can set Airflow variables in Cloud Composer:
gcloud projects add-iam-policy-binding $GCP_PROJECT_ID \
–member=serviceAccount:[email protected] \
–role=roles/composer.admin
- Secondly, add the composer.worker role to the Cloud Build service account so the Cloud Build job can trigger the data workflow in Cloud Composer:
gcloud projects add-iam-policy-binding $GCP_PROJECT_ID \
–member=serviceAccount:[email protected] \
–role=roles/composer.worker
Verify the build and test pipeline
After you submit the build file, verify the build steps.
- Firstly, in the Cloud Console, go to the Build History page to view a list of all past and currently running builds.
- Secondly, click the build that is currently running.
- Thirdly, on the Build details page, verify that the build steps match the previously described steps. here, on the Build details page, the Status field of the build says Build successful when the build finishes.
- Then, in Cloud Shell, verify that the WordCount sample JAR file was copied into the correct bucket:
gsutil ls gs://$DATAFLOW_JAR_BUCKET_TEST/dataflow_deployment*.jar
The output is similar to the following:
gs://…-composer-dataflow-source-test/dataflow_deployment_e88be61e-50a6-4aa0-beac-38d75871757e.jar
- After that, get the URL to your Cloud Composer web interface.
gcloud composer environments describe $COMPOSER_ENV_NAME \
–location $COMPOSER_REGION \
–format=”get(config.airflowUri)”
- Next, use the URL from the previous step to go to the Cloud Composer UI to verify a successful DAG run. If the Dag Runs column doesn’t display any information, wait a few minutes and reload the page.
- To verify that the data-processing workflow DAG test_word_count is deployed and is in running mode, hold the pointer over the light-green circle below DAG Runs and verify that it says Running.
- Next, to see the running data-processing workflow as a graph, click the light-green circle. Then on the Dag Runs page, click Dag Id: test_word_count.
- Lastly, reload the Graph View page to update the state of the current DAG run. It usually takes between three to five minutes for the workflow to finish. To verify that the DAG run finishes successfully, hold the pointer over each task to verify that the tooltip says State: success. The last task, named do_comparison, is the integration test that verifies the process output against the reference file.
Test the trigger
To test the trigger, you add a new word to the test input file and make the corresponding adjustment to the test reference file. However, you verify that the build pipeline is triggered by a commit push to Cloud Source Repositories and that the data-processing workflow runs correctly with the updated test files.
- Firstly, in Cloud Shell, add a test word at the end of the test file:
echo “testword” >> ~/$SOURCE_CODE_REPO/workflow-dag/support-files/input.txt
- Secondly, update the test result reference file, ref.txt, to match the changes done in the test input file:
echo “testword: 1” >> ~/$SOURCE_CODE_REPO/workflow-dag/support-files/ref.txt
- Thirdly, commit and push changes to Cloud Source Repositories
- Fourthly, in the Cloud Console, go to the History page.
- Next, to verify that a new build is triggered by the previous push to master branch, on the current running build, the Trigger column says Push to master branch.
- After that, in Cloud Shell, get the URL for your Cloud Composer web interface:
gcloud composer environments describe $COMPOSER_ENV_NAME \
–location $COMPOSER_REGION –format=”get(config.airflowUri)”
- After the build finishes, go to the URL from the previous command to verify that the test_word_count DAG is running.
- Now, wait until the DAG run finishes, which indicates when the light green circle in the DAG runs column goes away. It usually takes between three to five minutes for the process to finish.
- Then, in Cloud Shell, download the test result files:
mkdir ~/result-download
cd ~/result-download
gsutil cp gs://$RESULT_BUCKET_TEST/output* .
- Lastly, verify that the new word is in one of the result files:
grep testword output*
The output is similar to the following:
output-00000-of-00003:testword: 1
Reference: Google Documentation