Launching Dataproc Jobs with Cloud Composer
In this lab you'll use Google Cloud Composer to automate the transform and load steps of an ETL data pipeline. The pipeline will create a Dataproc cluster, perform transformations on extracted data (via a Dataproc PySpark job), and then upload the results to BigQuery. You'll then trigger this pipeline using either:
- HTTP POST request to a Cloud Composer endpoint
- Recurring schedule (similar to cron job)
Cloud Composer workflows are comprised of DAGs (Directed Acyclic Graphs). You will create your own DAG, including design considerations, as well as implementation details, to ensure that your prototype meets the requirements.
What you will build
You're going to build an Apache Airflow DAG that will:
- Begin running when triggered from an on-prem POST request
- Spin up a Dataproc cluster
- Run a Pyspark job on cluster
- Tear down cluster when job completes
- Upload Pyspark output to BigQuery
- Remove any remaining intermediate files
Join Qwiklabs to read the rest of this lab...and more!
- Get temporary access to the Google Cloud Console.
- Over 200 labs from beginner to advanced levels.
- Bite-sized so you can learn at your own pace.
Create a Cloud Storage bucket
Export the Data
Create Cloud Composer environment
Uploading the DAG to Cloud Storage
Triggering the DAG