Exploring Google Ngrams with Amazon EMR and Hive
SPL-DD-300-ANGNGR Version 4.0.11
© 2021 Amazon Web Services, Inc. and its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission from Amazon Web Services, Inc. Commercial copying, lending, or selling is prohibited. All trademarks are the property of their owners.
Corrections, feedback, or other questions? Contact us at AWS Training and Certification.
Overview
In this lab, you will use Amazon EMR to analyze Ngrams from Google Books. An n-gram is a contiguous sequence of n items from a given sequence of text or speech. For example, consider this sentence:
Some 2-grams from this sentence are "the sun", "in the" and "sets in". A sample 3-gram is "sets in the" and a sample 4-gram is "rises in the east".
N-grams are used to predict the probability of certain words appearing in a sequence. This can be useful for providing typing suggestions on web pages and mobile phones.
The steps in this lab are very similar to the activities that a Data Scientist would perform when analyzing a new set of data. This includes loading the data, examining the data attributes and writing SQL to analyze the data. You will run SQL against publicly available Ngrams data stored in Amazon S3 to gain interesting insights.
Objectives
After completing this lab, you will be able to:
- Create an Amazon EMR cluster running Hive
- Use Hive statements to create tables from Google Ngram input data stored in Amazon S3
- Run Hive queries to drill-down and analyze data
Start Lab
- At the top of your screen, launch your lab by choosing Start Lab
This starts the process of provisioning your lab resources. An estimated amount of time to provision your lab resources is displayed. You must wait for your resources to be provisioned before continuing.
If you are prompted for a token, use the one distributed to you (or credits you have purchased).
- Open your lab by choosing Open Console
This automatically logs you in to the AWS Management Console.
Do not change the Region unless instructed.
Common Login Errors
Error: Federated login credentials
If you see this message:
- Close the browser tab to return to your initial lab window
- Wait a few seconds
- Choose Open Console again
You should now be able to access the AWS Management Console.
Error: You must first log out
If you see the message, You must first log out before logging into a different AWS account:
- Choose click here
- Close your browser tab to return to your initial lab window
- Choose Open Console again
Inscrivez-vous sur Qwiklabs pour consulter le reste de cet atelier, et bien plus encore.
- Obtenez un accès temporaire à Console Amazon Web Services.
- Plus de 200 ateliers, du niveau débutant jusqu'au niveau expert.
- Fractionné pour vous permettre d'apprendre à votre rythme.