Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Building a Data Warehouse on AWS

Collect Process Analyze


Data Answers
Store

Amazon Amazon Visualize


S3 Redshift

@Lynn Langit
AWS Marketplace
Enterprise software store for business users who need simplified procurement

•2000+ product listings Business Intelligence

•to browse, test and buy software


•1-click deployment
•to launch, in multiple regions around the Advanced Analytics

world
•Pay-as-you-go pricing
•to use on demand Data Enablement
Building a Data Warehouse on AWS

Collect Process Analyze


Data Answers
Store

Move data into Redshift


from S3 for analysis
Amazon Visualize
Amazon
Redshift
S3

AWS Marketplace
Partners
Matillion Yellowfin
Setup
Our Scenario and Source Files
“In this scenario we will use Matillion ETL File Types
for Redshift to prepare two separate data -- Text - .csv
sources ready for analysis.
The sample data is US airport flight -- Compressed - .gz
information from 1995 -> 2008. Every flight File Categories
to or from a US airport (and whether it left
on time or not) is included. Details / Events
-- Flights
The second data set is weather data, taken
from NOAA, including the daily weather -- Weather
readings for each US Airport.” Metadata
-- Airports
-- Carriers
Loading data from S3 in to Redshift
Using Matillion ETL for Redshift
• Create Instance (AMI/EC2) of Matillion/AWS Marketplace
• Connect Matillion to Redshift
Loading
Data in
Redshift
Table distribution styles
Distribution Key All Even
All data on Round robin
Same key to same location
every node distribution

ke
y
y1

ke

4
2
ke

ke y

y3

Slice
Slice Slice
Slice Slice
Slice Slice
Slice Slice
Slice Slice
Slice Slice
Slice Slice
Slice Slice
Slice Slice
Slice Slice
Slice Slice
Slice
11 22 33 44 11 22 33 44 11 22 33 44

Node 1 Node 2 Node 1 Node 2 Node 1 Node 2


Sort Keys
• Single Column - [ SORTKEY ( date ) ]
• Queries that use 1st column (i.e. date) as primary filter

• Compound - [ SORTKEY COMPOUND ( date, region,


country) ]
• Queries that use 1st column as primary filter, then other columns

• Interleaved - [ SORTKEY INTERLEAVED ( date,


region, country) ]
• Queries that use different columns in filter
Time Series Data – Vacuum Operation

Region
Sorted

Sorted

Sorted
Sort Unsorted Merge
Region
Region
Unsorted

Sorted
Append in Sort Key Order
Visualizing
with Yellowfin
Automate – https://github.com/lynnlangit/AWSDataWarehouse

You might also like