Machine Learning with Spark on Google Cloud Dataproc
In this lab you will learn how to implement logistic regression using a machine learning library for Apache Spark running on a Google Cloud Dataproc cluster to develop a model for data from a multivariable dataset.
Google Cloud Dataproc is a fast, easy-to-use, fully-managed cloud service for running Apache Spark and Apache Hadoop clusters in a simple, cost-efficient way. Cloud Dataproc easily integrates with other Google Cloud Platform (GCP) services, giving you a powerful and complete platform for data processing, analytics and machine learning
Apache Spark is an analytics engine for large scale data processing. Logistic regression is available as a module in Apache Spark's machine learning library, MLlib. Spark MLlib, also called Spark ML, includes implementations for most standard machine learning algorithms such as k-means clustering, random forests, alternating least squares, k-means clustering, decision trees, support vector machines, etc. Spark can run on a Hadoop cluster, like Google Cloud Dataproc, in order to process very large datasets in parallel.
The base data set that is used provides historical information about internal flights in the United States retrieved from the US Bureau of Transport Statistics website. This data set can be used to demonstrate a wide range of data science concepts and techniques and is used in all of the other labs in the Data Science on the Google Cloud Platform and Data Science on Google Cloud Platform: Machine Learning quests. In this lab the data is provided for you as a set of CSV formatted text files.
In this lab, you will learn how to:
Prepare the Spark interactive shell on a Google Cloud Dataproc cluster.
Create a training dataset for machine learning using Spark.
Develop a logistic regression machine learning model using Spark.
Evaluate the predictive behavior of a machine learning model using Spark on Google Cloud Datalab.
Join Qwiklabs to read the rest of this lab...and more!
- Get temporary access to the Google Cloud Console.
- Over 200 labs from beginner to advanced levels.
- Bite-sized so you can learn at your own pace.
Check that the Spark ML model files have been saved to Cloud Storage
Check that a new Jupyter Notebook has been created and find clone of the git repository for the lab on the master node
Check that the notebook has been updated to reference /flights/tzcorr/all_flights-00004