Bayes Classification with Cloud Datalab, Spark, and Pig on Google Cloud Dataproc

Bayes Classification with Cloud Datalab, Spark, and Pig on Google Cloud Dataproc

1 hour 30 minutes 7 Credits


Google Cloud Self-Paced Labs



In this lab you will learn how to deploy a Google Cloud Dataproc cluster with Google Cloud Datalab pre-installed. You will then use Spark to perform quantization of a dataset to improve the accuracy of the data modelling over the single variable approaches used in the earlier labs.

The data is stored in Google BigQuery and the analysis will be performed using Google Cloud Datalab running in Google Cloud Dataproc.

The base dataset that is used provides historical information about internal flights in the United States retrieved from the US Bureau of Transport Statistics website. This data set can be used to demonstrate a wide range of data science concepts and techniques and is used in all of the other labs in the Data Science on the Google Cloud Platform and Data Science on Google Cloud Platform: Machine Learning quests.

Cloud Dataproc is a fast, easy-to-use, fully-managed cloud service for running Apache Spark and Apache Hadoop clusters in a simple, cost-efficient way.

Cloud Datalab is a powerful interactive tool created to explore, analyze, transform and visualize data and build machine learning models on Google Cloud Platform. It runs on Google Compute Engine and connects to multiple cloud services such as Google BigQuery, Cloud SQL or simple text data stored on Google Cloud Storage so you can focus on your data science tasks.

Google BigQuery is a RESTful web service that enables interactive analysis of massively large datasets working in conjunction with Google Storage.


  • Create a Cloud DataProc cluster running Cloud Datalab.

  • Create a training data model using Spark on Cloud Datalab

  • Evaluate a data model using Cloud Datalab.

  • Perform bulk data analysis using Apache Pig

Join Qwiklabs to read the rest of this lab...and more!

  • Get temporary access to the Google Cloud Console.
  • Over 200 labs from beginner to advanced levels.
  • Bite-sized so you can learn at your own pace.
Join to Start This Lab


Look for Cloud Dataproc cluster called ch6cluster

Run Step

/ 25

Check that a new Jupyter Notebook has been created and used to clone the git hub repository for the lab

Run Step

/ 25

Check that a copy of the quantization notebook has been created.

Run Step

/ 25

Query Cloud Dataproc for a successful Apache Pig job and confirm output data was saved to Cloud Storage

Run Step

/ 25