Machine Learning with Spark on Google Cloud Dataproc
In this lab you will learn how to implement logistic regression using a machine learning library for Apache Spark running on a Google Cloud Dataproc cluster to develop a model for data from a multivariable dataset.
Google Cloud Dataproc is a fast, easy-to-use, fully-managed cloud service for running Apache Spark and Apache Hadoop clusters in a simple, cost-efficient way. Cloud Dataproc easily integrates with other Google Cloud Platform (GCP) services, giving you a powerful and complete platform for data processing, analytics and machine learning
Apache Spark is an analytics engine for large scale data processing. Logistic regression is available as a module in Apache Spark's machine learning library, MLlib. Spark MLlib, also called Spark ML, includes implementations for most standard machine learning algorithms such as k-means clustering, random forests, alternating least squares, k-means clustering, decision trees, support vector machines, etc. Spark can run on a Hadoop cluster, like Google Cloud Dataproc, in order to process very large datasets in parallel.
The base data set that is used provides historical information about internal flights in the United States retrieved from the US Bureau of Transport Statistics website. This data set can be used to demonstrate a wide range of data science concepts and techniques and is used in all of the other labs in the Data Science on the Google Cloud Platform and Data Science on Google Cloud Platform: Machine Learning quests. In this lab the data is provided for you as a set of CSV formatted text files.
In this lab, you will learn how to:
Prepare the Spark interactive shell on a Google Cloud Dataproc cluster.
Create a training dataset for machine learning using Spark.
Develop a logistic regression machine learning model using Spark.
Evaluate the predictive behavior of a machine learning model using Spark on Google Cloud Datalab
加入 Qwiklabs 即可阅读本实验的剩余内容…以及更多精彩内容！
- 获取对“Google Cloud Console”的临时访问权限。
- 200 多项实验，从入门级实验到高级实验，应有尽有。
Check that the Spark ML model files have been saved to Cloud Storage