Advanced Amazon Redshift: Table Layout and Schema Design
SPL-121 - Version 1.2.2
© 2019 Amazon Web Services, Inc. and its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission from Amazon Web Services, Inc. Commercial copying, lending, or selling is prohibited.
Errors or corrections? Email us at email@example.com.
Other questions? Contact us at https://aws.amazon.com/contact-us/aws-training/
In this lab, you will take a close look at different types of table layout and schema design. You will create tables using various methods for data compression and distribution, and analyze which methods work best, including incorporating Amazon Redshift recommendations. You will conclude the lab by building five different versions of the same table, and analyzing how the differences impact performance.
By the end of this lab, you will be able to:
- Create Redshift tables and load data from Amazon S3
- Perform detailed analysis of a table's layout and design, including taking advantage of Redshift's recommendations and automatic efficiencies
- Set parameters in CREATE table and scripts to optimize table layout and design, based on how you plan to use the table (what types of data you will load, what types of queries you will run)
- Similarly, set parameters in COPY scripts when loading data for optimal performance
Technical Knowledge Prerequisites
To successfully complete this lab, you should be familiar with basic Redshift concepts.
Other AWS Services
Other AWS Services than the ones needed for this lab are disabled by IAM policy during your access time in this lab. In addition, the capabilities of the services used in this lab are limited to what's required by the lab and in some cases are even further limited as an intentional aspect of the lab design. Expect errors when accessing other services or performing actions beyond those provided in this lab guide.
What is Amazon Redshift?
Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to analyze all your data using your existing business intelligence tools. It is optimized for datasets ranging from a few hundred gigabytes to a petabyte or more and costs less than $1,000 per terabyte per year, a tenth of the cost of traditional data warehousing solutions. Typically, you will see 3x compression, reducing your costs to $333 per uncompressed terabyte per year.
Amazon Redshift delivers fast query and I/O performance for virtually any size dataset by using columnar storage technology and parallelizing and distributing queries across multiple notes. With automation for most of the common administrative tasks associated with provisioning, configuring, monitoring, backing- up, and securing a data warehouse, Amazon Redshift is even easier to use.
This lab guide explains basic concepts of AWS in a step by step fashion. However, it can only give a brief overview of Redshift concepts. For further information, see the official Amazon Web Services Documentation for Redshift at https://aws.amazon.com/documentation/redshift/. For pricing details, see https://aws.amazon.com/redshift/pricing/.
Compression is a column-level operation that reduces the size of data when it is stored. Compression conserves storage space and reduces the size of data that is read from storage, which reduces the amount of disk I/O and therefore improves query performance.
By default, Amazon Redshift stores data in its raw, uncompressed format. You can apply a compression type, or encoding, to the columns in a table manually when you create the table, or you can use the COPY command to analyze and apply compression automatically. For details about applying automatic compression, see Loading Tables with Automatic Compression.
For more details, see http://docs.aws.amazon.com/redshift/latest/dg/t_Compressing_data_on_disk.html.
Distribution and Storage
When you load data into a table, Amazon Redshift distributes the rows of the table to each of the compute nodes according to the table's distribution style. When you execute a query, the query optimizer redistributes the rows to the compute nodes as needed to perform any joins and aggregations. The goal in selecting a table distribution style is to minimize the impact of the redistribution step by locating the data where it needs to be before the query is executed.
For more details, see http://docs.aws.amazon.com/redshift/latest/dg/t_Distributing_data.html.
Join Qwiklabs to read the rest of this lab...and more!
- Get temporary access to the Amazon Web Services Console.
- Over 200 labs from beginner to advanced levels.
- Bite-sized so you can learn at your own pace.