menu
arrow_back

Provisioning and Using a Managed Hadoop/Spark Cluster with Cloud Dataproc (Command Line)

Provisioning and Using a Managed Hadoop/Spark Cluster with Cloud Dataproc (Command Line)

40 minutes 5 Credits

GSP036

Google Cloud Self-Paced Labs

Overview

Cloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Cloud Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data. Without using Cloud Dataproc, it can take from five to 30 minutes to create Spark and Hadoop clusters on-premises or through IaaS providers. By comparison, Cloud Dataproc clusters are quick to start, scale, and shutdown, with each of these operations taking 90 seconds or less, on average. This means you can spend less time waiting for clusters and more hands-on time working with your data.

This tutorial is adapted from https://cloud.google.com/dataproc/overview

What you'll learn

  • How to create a managed Cloud Dataproc cluster (with Apache Spark pre-installed).

  • How to submit a Spark job

  • How to resize a cluster

  • How to SSH into the master node of a Dataproc cluster

  • How to use gcloud to examine clusters, jobs, and firewall rules

What you'll need

Join Qwiklabs to read the rest of this lab...and more!

  • Get temporary access to the Google Cloud Console.
  • Over 200 labs from beginner to advanced levels.
  • Bite-sized so you can learn at your own pace.
Join to Start This Lab
Score

—/15

Create a Cloud Dataproc cluster

Run Step

/ 5

Submit a Spark job to your cluster

Run Step

/ 5

Resize the cluster

Run Step

/ 5

home
Home
school
Catalog
menu
More
More