Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 46

Data Warehousing

Chapter 29
Special thanks to: Liem Tran, Robert Turan, and Miguel Delgado
Outline
I. Introduction to Data Warehousing
I. Definition of Data Warehouse
II. Data Warehouse Development
III. Benefits and Advantages
IV. Database vs Data Warehouse
V. Datamart
II. Data Warehouse Structure
I. Star & Snowflake Schema
II. Fact Table & Dimension Table
III. OLTP
IV. OLAP
V. ROLAP
VI. MOLAP
VII. HOLAP
III. Real Life Implementation of Data Warehouse
I. Data Warehouse Providers
II. What Data Warehouse Do Big Companies Use?
III. Job Opportunities
What is a data warehouse?

“A Data Warehouse (DW) is simply a


consolidation of data from a variety of
sources that is designed to support
strategic and tactical decision making.  Its
main purpose is to provide a coherent
picture of the business at a point in time”.
-data-warehouses.net
What led to DW development?
• Complex logic was required to join multiple tables making
it difficult to get the desired information.
• Inefficient queries had to be run after normal operating
hours, not to disrupt other transactions.
• Increased number of databases within a single business.
• Higher number of databases = inconsistent and
fragmented data!
• Lack of useful information needed for decision making.
Benefits and advantages
• Queries can be performed without impacting support
systems. (transactional and operational systems)
• Data is organized to support complex questions.
• Faster and more efficient retrieval of information gives the
user a better experience.
• Single truth representation of data.
• Data is persistent.
• Data can be integrated from multiple sources.
Database vs Data Warehouse
Database Data warehouse
•Current data •Historical data
•Online Transactional •Online Analytical
Processing (OLTP) Processing (OLAP)
•Detailed information •Summarized information
•Relational view •Multidimensional view
Columns and rows Layers of columns and
rows
Data Marts
• Subsets of a data warehouse.
• Based on star or snowflake schema.
• Designed to meet the needs of a specific group in an
organization.
• Easier access to data that is needed.
• Shorter implementation time.
• <100 GB of storage
ODS (Operational Data Store) -  is a type of database often used as an interim area for a data warehouse.

Staging (data) A staging area, or landing zone, is an intermediate storage area used for data processing


during the extract, transform and load (ETL) process. (wiki)
ETL
• Extract – Data is extracted from one or more data sources.
• Validation rules are applied
• Transform – Data is processed to ensure consistency.
• E.g. Date formatting
• Load – Process of loading the formatted data into the data
warehouse.
Hardware Considerations
•Configure I/O for bandwidth not capacity
Disk drives may have enough space to store data,
but I/O can suffer.
•Stripe data files
Files are distributed across multiple disks
Highest possible I/O bandwidth
•Use redundancy
Protects against hardware failure
•Plan for growth
Schemas

Star vs Snowflake
Star Schema
• Most simple design and most
commonly used

• Consists of a fact table that point


to dimension tables

• Dimensions are not normalized

• PKs in dimension tables make up


the composite key in fact table

• High performance due to simple


design and few joins

• Recommended for large or small


datawarehousing & data marts
Fact- Dimension tables

Example Star Schema

Example Star VS Snowflake Schema


Snowflake Schema
• More complex version of Star
schema

• Dimension tables are normalized

• More complex queries due to


more joins

• Longer query execution times

• If dimension tables hold high


volume of data snowflake design
can save space

• Recommended for smaller data


warehouses
Snowflake Schema
What exactly is a fact table?

• Contains keys for dimension tables and measurable attributes

• Table is usually narrow due to few attributes

• Table has many rows due to high volume of measurable data

• Measurable attributes are usually numeric values that provide data useful
for business and data analysts

• For example:
• the fact table for a retail store will continue key attributes and the fact
attributes can be Sales, Price, Time, and City.
Dimension Tables
• Each table describes a specific dimension

• Tables contain many attributes

• Not too many rows but can grow large in some cases

• Usually rows contain textual values (descriptions of object)

• Denormalized in Star schema

• Normalized in Snowflake schema


Sample Query
SELECT CalendarYearWeek, sum(SalesAmount)
FROM factSales AS f
JOIN dimDate as d ON d.DateKey = f.DateKey
WHERE Year = 2010
GROUP BY CalendarYearWeek;

Results for weekly sales in 2010


OLTP – Online Transactional Process
• Purpose is to update data efficiently

• Uses simple queries for speed

• Uses little space

• Functions on a normalized database schema

• Tuning is important to maximize performance


OLAP – Online Analytical Process

• Main purpose is to analyze data

• Uses historical data

• Different methods (Rolap, Molap, Holap)

• Software sits between client and DBMS

• OLAP tool will understand database organization and contains special


functions to analyze it
–Relational Online Analytical Process
• Performs dynamic analysis of data stored in a relational database, rather
than on a multidimensional database

• Requires more processing time and disk space to perform some tasks of
multidimensional databases are designed for

• Supports larger user groups and larger amounts of data


MOLAP – Multidimensional Online
Analytical Process
• A version of OLAP that indexes directly into a multidimensional database

• Processes data in a multidimensional array (all combinations of data are


reflected, stored in a cell that can be accessed directly)

• Faster than other methods (due to indexing)


P – Hybrid Online Analytical Process
• Combination of Rolap and MOlap

• Stores data in both relation and multidimensional databases

• Uses both databases depending on which is better for processing goals

• For data-heavy processing - RDB and MDDB for more speculative


processing
Data Warehouse Providers
• Founded in 1979

• Teradata Active Enterprise Data Warehouse is the platform that runs the Teradata Database, with added data
management tools and data mining software.

• The data warehouse differentiates between “hot and cold” data – meaning that the warehouse puts data that is not
often used in a slower storage section. As of October 2010, Teradata uses Xeon 5600 processors for the server
nodes.
• Use frequently – Hot data

• Teradata Database 13.10 was announced in 2010 as the company’s database software for storing and processing
data.

• Teradata Database 14 was sold as the upgrade to 13.10 in 2011 and runs multiple data warehouse workloads at the
same time. It includes column-store analyses.

• Teradata Integrated Analytics is a set of tools for data analysis that resides inside the data warehouse.
Teradata https://www.teradata.com/
Industries they serve https://www.teradata.com/Solutions

Teradata Customers https://www.teradata.com/Customers?ViewAll=ViewAll


• A multinational computer technology corporation, headquartered in Redwood Shores, California

• database software and technology, cloud engineered systems and enterprise software products—particularly its own
brands of database management systems

• In 2015 Oracle was the second-largest software maker by revenue, after Microsoft

• Oracle 18c Database is the industry standard for high performance scalable, optimized data warehousing.

• The company’s specialized platform for the data warehousing side is the Oracle Exadata Machine.

• There are an estimated 390,000 Oracle DBMS customers worldwide, and about 4,000 Exadata data warehousing
appliances have been sold.
Database Data Warehousing Guide
Oracle Warehouse in the Cloud
• An American electronic commerce and cloud computing company, founded on July 5, 1994, by Jeff Bezos and based in
Seattle, Washington

• AWS (Amazon Warehouse Service) was officially launched in 2006

• The whole shift in data storage and warehousing to the cloud over the last several years has been momentous and
Amazon has been a market leader in that whole paradigm.

• Amazon offers a whole ecosystem of data storage tools and resources that complement its cloud services platform.

• For example, there is Amazon Redshift, a fast, fully managed, petabyte-scale data warehouse cloud solution; AWS Data
Pipeline, a web service designed for transporting data between existing AWS data services; and Elastic MapReduce, which
provides an easily managed Hadoop solution on top of the AWS services platform.

• Big AWS customers: Airbnb, Adobe Systems, Johnson & Johnson, GE, BMW, Cannon, NASA,…

• AWS pulls in about $11 billion in revenue each year for Amazon.
• Three engineers from Google, Yahoo and Facebook (Christophe Bisciglia, Amr Awadallah and Jeff Hammerbacher,
respectively) joined with a former Oracle executive (Mike Olson) to form Cloudera in 2008

• Cloudera has emerged in recent years as a major enterprise provider of Hadoop-based data storage and processing
solutions.

• Cloudera offers an Enterprise Data Hub (EDH) for its variety of operational data store, or data warehouse.

• The EDH is Cloudera’s proprietary framework for the “information-driven enterprise” and focuses on “batch
processing, interactive SQL, enterprise search, and advanced analytics—together with the robust security, governance,
data protection, and management that enterprises require.”

• Cloudera’s data warehouse is based on CDH, which is Cloudera’s version of Apache Hadoop and the world’s largest
distribution at that.

• Cloudera customers: Mastercard, Deloitte,..


• MarkLogic is a Silicon Valley-based private software firm founded in 2001 that offers an enterprise NoSQL database
platform.

• In 2013 MarkLogic released a new semantics platform which provides the capability of storing billions of RDF triples that
can queried with SPARQL (a semantic query language for the RDF platform).

• Customers: Aetna, NBC, BBC, Boeing, U.S. Navy, and U.S. Army.
What Data Warehouse Do Big Companies Use?

TeraData Customers:
• Apple: operating a multiple-petabyte Teradata system, was Teradata’s “fastest ever
customer to a petabyte.”

• Walmart: Teradata’s first-ever terabyte-scale database in 1992.

• Ebay: Its primary data warehouse is over 9.2 petabyes; its “singularity system” that
stores web clicks and other “big” data is more than 40 petabytes.
What Data Warehouse Do Big Companies Use?

TeraData Customers:
• Harrah’s (Caesar’s Entertainment Casino):understands how much money particular
gamblers can afford to lose in a day before they won’t come back the next day.

• Disney: new bracelet tickets equipped with GPS and NFC that track everything visitors do
while inside Disney’s amusement parks.

• Continental Airlines: to keep its customers happy, began assessing them by lifetime
value and began making alternative arrangements for them as soon as the airline
realized flights would be delayed
Teradata https://www.teradata.com/
Industries they serve https://www.teradata.com/Solutions

Teradata Customers https://www.teradata.com/Customers?ViewAll=ViewAll


What Data Warehouse Do Big Companies Use?

Other Big Companies:


• Microsoft: uses its own Azure Data Warehouse.

• Amazon: uses its own Amazon Redshift.

• IBM: uses its own IBM Data Warehouse.

• Facebook: Hive
Data Warehouse Developer

• The Data Warehouse Developer is responsible for the successful delivery of business intelligence information to the
entire organization and is experienced in BI development and implementations, data architecture and data
warehousing.

• Additional responsibilities include:

A. Creating star schema data models, performing ETLs and validating results with business representatives

B. Supporting implemented BI solutions by: monitoring and tuning queries and data loads, addressing user
questions concerning data integrity, monitoring performance and communicating functional and technical issues.
Data Warehouse Developer
Data Warehouse Analyst

• A data warehouse analyst collects, analyzes, mines and helps the business leverage the information stored in data
warehouses.

• Professionals in this role research and recommend technology solutions related to data storage, reporting, importing
and other business concerns; they also define the user interfaces for managing the interaction between data.

• A data warehouse analyst is often expected to collaborate with business intelligence analysts and developers to
translate data requirements into logical data models.
Data Warehouse Analyst
ETL Developer
• Write scripts to extract data from the source systems

• Analyze data

• Familiar with SQL, Java, XML and several data warehouse architecture techniques such as EDW, ODS, DM,
ROLAP and MOLAP.

• Help the production support team to resolve ETL batch failures

• Perform data analysis as per the project requirements

• Write scripts to execute ETL programs

• Create design documents /test cases.

• Creating data model.


ETL Developer

https://www.indeed.com/jobs?q=etl%20developer&l
Thank You

You might also like