EMR File System Client-side Encryption Using AWS KMS-managed Keys
SPL-148 - Version 1.1.4
© 2019 Amazon Web Services, Inc. and its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission from Amazon Web Services, Inc. Commercial copying, lending, or selling is prohibited.
Errors or corrections? Email us at email@example.com.
Other questions? Contact us at https://aws.amazon.com/contact-us/aws-training/
In this lab you will enable client-side at-rest encryption using AWS KMS-managed key for data stored in Amazon S3 with the EMR File System (EMRFS). Using Amazon EMR, you will create a security configuration to encrypt the object being written to S3 with client-side encryption using the AWS KMS-managed key specified by you, and decrypt objects with the same key that was used to encrypt them. This will allow you to more easily leverage frameworks like Apache Spark, Apache Tez, and Apache Hadoop MapReduce on Amazon EMR to run big data analytics, stream processing, machine learning, and ETL workloads on confidential data.
This lab will demonstrate how to:
- Create an Amazon S3 bucket
- Create a key using AWS KMS
- Create security configuration in EMR to enable client-side encryption using AWS KMS-managed key
- Launch an AWS Elastic Map Reduce(EMR) cluster using the AWS Management Console
- Read and write objects from and to S3 using AWS EMR File System (EMRFS)
- View EMR output data directly from Amazon S3
Technical knowledge prerequisites
To successfully complete this lab, you should be familiar with basics of Hadoop and Hadoop File System (HDFS).
You should also be familiar with basic Linux server administration and comfortable using the Linux command-line tools. You also need to be able to connect to a Linux instance, e.g. using SSH (from an OS X or Linux client) or PuTTY (from a Windows client).
Other AWS services
AWS services other than those needed for this lab are disabled by IAM policy during your access time in this lab. In addition, the capabilities of the services used in this lab are limited to what's required by the lab and in some cases are even further limited as an intentional aspect of the lab design. You should expect errors when accessing other services or performing actions beyond those provided in this lab guide.
What is Amazon EMR?
Amazon EMR is a web service that makes it easy to quickly and cost-effectively process vast amounts of data. Amazon EMR simplifies big data processing, providing a managed Hadoop framework that makes it easy, fast, and cost-effective for you to distribute and process vast amounts of your data across dynamically scalable Amazon EC2 instances. You can also run other popular distributed frameworks such as Apache Spark and Presto in Amazon EMR, and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB. Amazon EMR securely and reliably handles your big data use cases, including log analysis, web indexing, data warehousing, machine learning, financial analysis, scientific simulation, and bio-informatics.
What is EMRFS?
EMRFS is an implementation of HDFS used for reading and writing regular files from Amazon EMR directly to Amazon S3. EMRFS provides the convenience of storing persistent data in Amazon S3 for use with Hadoop while also providing features like Amazon S3 server-side encryption, read-after-write consistency, and list consistency.
What is AWS KMS?
AWS Key Management Service (KMS) is a managed service that makes it easy for you to create and control the encryption keys used to encrypt your data, and uses Hardware Security Modules (HSMs) to protect the security of your keys. AWS Key Management Service is integrated with several other AWS services to help you protect the data you store with these services. AWS Key Management Service is also integrated with AWS CloudTrail to provide you with logs of all key usage to help meet your regulatory and compliance needs.
What is Amazon S3?
Amazon Simple Storage Service (Amazon S3), provides developers and IT teams with secure, durable, highly-scalable cloud storage. Amazon S3 is easy to use object storage, with a simple web service interface to store and retrieve any amount of data from anywhere on the web. With Amazon S3, you pay only for the storage you actually use. On the AWS cloud, Amazon S3 is a good candidate for a data lake implementation to store large-scale data for big data analytics using Amazon EMR.
What is a security configuration in Amazon EMR?
You can use a security configuration to encrypt data at-rest, data in-transit, or both. Each security configuration is stored in Amazon EMR rather than in cluster configuration objects, so you can easily reuse a configuration to specify encryption settings whenever a cluster is created.
How to connect to EMR Master Node Using SSH?
Secure Shell (SSH) is a network protocol you can use to create a secure connection to a remote computer. After you make a connection, the terminal on your local computer behaves as if it is running on the remote computer. When you use SSH with AWS, you are connecting to an EC2 instance, which is a virtual server running in the cloud. When working with Amazon EMR, the most common use of SSH is to connect to the EC2 instance that is acting as the master node of the cluster. You can issue Linux commands on the master node, run applications such as Hive and Pig interactively, browse directories, read log files, and so on. You can also create a tunnel in your SSH connection to view the web interfaces hosted on the master node.
Notice the lab properties below the lab title:
- setup - The estimated time to set up the lab environment
- access - The time the lab will run before automatically shutting down
- completion - The estimated time the lab should take to complete
- At the top of your screen, launch your lab by clicking
If you are prompted for a token, use the one distributed to you (or credits you have purchased).
A status bar shows the progress of the lab environment creation process. The AWS Management Console is accessible during lab resource creation, but your AWS resources may not be fully available until the process is complete.
- Open your lab by clicking
This will automatically log you into the AWS Management Console.
Please do not change the Region unless instructed.
Common login errors
Error : Federated login credentials
If you see this message:
- Close the browser tab to return to your initial lab window
- Wait a few seconds
- Click again
You should now be able to access the AWS Management Console.
Error: You must first log out
If you see the message, You must first log out before logging into a different AWS account:
- Click click here
- Close your browser tab to return to your initial Qwiklabs window
- Click again
Join Qwiklabs to read the rest of this lab...and more!
- Get temporary access to the Amazon Web Services Console.
- Over 200 labs from beginner to advanced levels.
- Bite-sized so you can learn at your own pace.