Lecture 1 Practical Analytics - Introduction, Data Sources, Data Modeling, Data Warehouse v3

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 80

Practical 

Analytics
What is data analytics?
 Simply speaking data analytics is the process that take 
us from data to decision

 Data  Information  Knowledge  Wisdom 


Decision e.g., audit investigation
What is data analytics?
 Data analytics is a process involves:
a) Identifying the problem e.g., account receivable - billing corectly e.g., how much outstanding

b) Gathering relevant data that frequently are not in a 
usable form
c) Cleaning up the data to make them usable data is dirty, gabage in gabage out
d) Loading them into storage models different data models
e) Manipulating them to discover information that 
leads to actionable insights data miner
f) Making decisions based on those insights
e.g., discover miconduct or fraud, or mistakes
The relationship among analytics
e.g., regression
Analytics: part of data science
(insert figure 1‐3)

how to present the data visually, help users to


proper understand the result
stored in media
Global Information Storage Capacity
How about now and future?
 Source: Forbes.com (175 Zettabytes By 2025, by Tom Coughlin)

One Zettabyte is approximately equal to a thousand Exabytes,


a billion Terabytes, or a trillion Gigabytes
Global Information Storage Capacity
How about now and future?
 Source: Forbes.com (175 Zettabytes By 2025, by Tom Coughlin)

more people using cloud,


more store in enterprise (concern of the security and back up),
less stored in consumer (e.g., smartphone, too much data)
Global Information Storage Capacity
How about now and future?
 Source: Forbes.com (175 Zettabytes By 2025, by Tom Coughlin)

1/4 of data generate in china and


asian (japan, australia),
us - smaller data generate than china,
rest of world
Why study data analytics
 Study by AMA (American Management Association):
 99% of business leaders indicated that analytics will be 
important to their organizations either at the time of the 
survey (2013) or in the five‐year period immediately 
succeeding it.
 PWC’s 2015 white paper also indicated 85% of CEOs in 
the surveyed companies placed a high value on data 
analytics and data mining while they targeted data 
analytics as their top IT spending priority.
Applications of Data Analytics
 Accounting
 Retail
 Manufacturing logistice, reduce costs

 Marketing
 Medicines
 Etc. 
Data Fundamentals
1. client give you the system (client concern confidential information)

Data Acquisition 2. access databases


3. provide with data file or sets to work on

 Data provisioning: the process of providing users and 
systems with access to data.
 Includes maintaining security authorizations to limit 
access to only those data which the user or system is  1. c

officially permitted to view.
 Replication: data from a source system are replicated or 
copied.  source data remain intact.
1. different ways of access to data (for security reasosns)
2. original data may be corrupted after using —> better to copy on the source and work on the copy file, original data will be intact
e.g., excel after save cannot go back

now big data, limitation on excel

acl provide the log file (log on the file, by who)


- important for work in the team, who done what (cant by excel) —> ad for acl (data analytics)
Structured and Unstructured Data
 Structured data: computer‐readable and usable. Data 
sourced from databases, spreadsheets, flat files and 
other systems. delimited file
e.g., .csv, —> use coma to seperate dat record

 Have metadata data about data, like context, meaning 
and purpose to data. data type: numeric, text\character (id), date
strucure data can still be dirty —> e.g., definition of data type —> clean the data

 Structured does not mean ready to be used  may 
contain errors, redundancies, and omissions.
 Unstructured data: not conform to data models and 
associated metadata.  dont have metadata, dont follow pattern
computer can properly read them (process)
need to define the data column
Structured and Unstructured Data

metadata

e.g., pdf - unstructured data can stil be imported,


Three most common structured data
 Spreadsheets: data stored in rows and columns
 Flat files: data in text format also stored in rows and 
columns. ASCII, comma separated values (.csv), and 
other delimited files are flat files.
 Often used to transfer data from one location to 
another
 Utilized to consolidate or synchronize data
 Databases: organized collections of data that enable 
users to access, manage, and update the data. 
csv. most popular file type used by —> reasons for delmited file to be popular:
- disadvantage of excel:
- 1. big data cannot be stored, data may be seperated —> troublesome to combine (errors)
- 2. excel file have much big sizes (formula, differnet spreadsheet)
Flat File example use comma to seperate name, email, phone —> . csv file
Databases invented by ibm scentist

 The most popular model of databases is Relational 
database create tables and relatinships

 Primary keys: uniquely and universally identifies each 
double u:
instance of an entity or relationship set - every students have the student id
—> good primary keys

 Foreign keys: other tables’ primary keys referenced to 
the tables

assume one student one course only


Three Types of Anomalies

Tables that have not been normalized are associated with 
three types of problems:
⽂本

 Insertion Anomaly:  A new item cannot be added to the 
table until at least one entity uses a particular attribute item.
 Deletion Anomaly: If an attribute item used by only one 
entity is deleted, all information about that attribute item is 
lost.
 Update Anomaly: A modification on an attribute must be 
made in each of the rows in which the attribute appears.
 Anomalies can be corrected by creating relational tables.

21
Three Types of Anomalies

economic order quantity (whne palce


the order, the q need to order)

1. insertion anomoly: dont have the part nnumber for potential suppliers
2. deletion: delete one of the attribute of buell —> other attributes may be deleted
3. update: duplication e.g., alibaba duplication troublesome —> update phone number multiple time for one supplier
—> need normalization (divide tables)

22
Three Types of Anomalies

combined keys -uique (primary key + foreign key) inventory & suppliers - manay to many —> should create another table (later)
1. no longer update problem 23
2. insertion problem eliminate:
3. deleteion: delete information of supplier bullt, unchange inventory data
—> not 3 problems —> 3nf
Normalization
 Normalization  eliminate those three anomalies.
 Normalization is the process of decomposing a 
database table into more tables.
 Business systems (transactional systems) are typically 
normalized up to the third normal form (3NF)
 Normalized tables are generated with referential 
integrity constraints between primary key and foreign 
key pairs.  each foreign key must have a 
corresponding value in a primary key in another table. 
For instance, if Customer ID is the foreign key in a Sales Order table, 
then the primary key in the Customer table must also be Customer ID.
Relational Databases & ER Diagram
 Details of the subjects will be further discussed in 
Lecture 2.
Processing unstructured data
 Tagged Data: XML and XBRL
 eXtensible Markup Language (XML): used to describe 
data to both humans and computers; tagging or coding 
data in documents, so that they can be read by both 
people and computers.
 HTML (Hyper Text Markup Language)  used to tag 
data so that browsers can display that data as web page
 XML  used to create metadata about data so that data 
can be understood by computers for further processing 
and structuring. 
 eXtensible Business Reporting Language (XBRL): 
developed by accounting professionals for reporting.
 Example of XBRL label the data - provide metadata to the file (e.g., sales)
- (even though they are unstructured e.g., pdf of finanacial statements
 Example of Inline XBRL (iXBRL) - come from xml: html define the locations, xml provide more information can be processed
- advantage: provide metadata of the data, dont have to input the data (ala it is xbrl - structured)
Processing unstructured data
 Image Recognition still need to be imrpoved

 Artificial Intelligence (AI)
Young woman with a letter and a messenger in an interior (1670) is an oil on canvas
painting by the Dutch painter Pieter de Hooch, it is an example of Dutch Golden Age
painting and is part of the collection of the Rijksmuseum.
Tim Cook: ”I always thought I knew when the 
iPhone was invented, but now I’m not so sure. 
Data Sources
 Transactional Systems (OLTP) erp system for 3

 Informational Systems (OLAP)
 Legacy Systems
 Web Services e.g., competitors

 Crawlers and Info Agents (Search Engines) google, baidu

 Social Media
 Sensors collect data e.g., temparature
Data Sources
 Business data from transactional systems
 Transactional systems ranging from manual paper 
based systems to any number of computerized 
systems. come from erp (transactions) / manual papers in the past
 Computerized systems may be as simple as an Excel 
spreadsheet or as complex as ERP
 OLTP systems
information systems typically facilitate and manage transaction-oriented applications
2 compoenents (olap & oltp)
e.g., sap, oracle, people

ERP System Business Enterprise

Legacy
Data Warehouse Systems

ERP System
On-Line Analytical Processing Bolt-On Applications
(OLAP) (Industry Specific Functions)

Customers Suppliers
Core Functions [On-Line Transaction Processing (OLTP)]

Sales
Business Shop Floor
& Logistics
Planning Control
Distribution
modules to take care of different fucntiosn

Operational Database
Customers, Production,
Vendor, Inventory, etc.

Compare
Characteristics of Transactional Systems some companies have 2 systems to make sure system is aviabale all the time

 Availability: systems that process transactions should 
be available close to 100% of the time.
 Level of details: the data of transactional systems 
should be available at full detail so that each 
transaction as well as its content, creator, date and 
details  are available at all times.
 Updatable: business transactions are created, updated, 
or changed and delated quite frequently.
 Speed: the ability to process large quantities of 
transactions is critical. can be updated in real time quickly
e.g., monitor and respone form the maiframe —> now real time

 Current: store only recent a year or two of data.
older data store in the datawarrhouse, curretn data store in the operationla database
erp have different module to take care of different functions
e.g., mrp (only for the material) —> now have more labor, overhead, original inventry mgmt—> now called erp (for whole enterprise)

Characteristics of Transactional Systems
 Operational: OLTP supports the organization’s 
business functions
 Concurrent: OLTP systems assessed by many users at 
the same time. Concurrency management is needed to 
do concurrency control.
 Support requirements of business process
 Small uniform transactions: most transactions in an 
OLTP system are small and uniform.
 Optimized for storage: for efficiency and performance 
issues, transactions should be written quickly to the 
database. Quick storage  relational database
 Data are functionally oriented 1. no duplication
stored in operational database, ad of relational database

2. data can be share with other functions (can create table


and sett up relationship —> easy to share data
What is ERP?
to take care of different functions, but can share the data e.g., office

 Those activities supported by multi-module


application software that help a company
manage the important parts of its business in
an integrated fashion. can alos integrate with outsider e.g., supplier

 Key features include:


 Smooth and seamless flow of information
across organizational boundaries
 Standardized environment with shared
database independent of applications and
integrated applications
35
Problems with Non-ERP
Systems
customized for company only

 In-house design limits connectivity outside the


company
 Tendency toward separate IS’s within firm
 lack of integration limits communication within the
company
 Strategic decision-making not supported
 Long-term maintenance costs high
 Limits ability to engage in process reengineering

36
Traditional IS Model:
Closed Database Architecture
 Similar in concept to flat-file approach
 data remains the property of the application
 fragmentation limits communications
 Existence of numerous distinct and
independent databases
 redundancy and anomaly problems
 Paper-based
 requires multiple entry of data
 status of information unknown at key points
37
dont share one database —>
1. slow to share the data within the company
— cannot share the data electronically —> e.g., paper to paper: print out document, input into the system
2. difficult to share outside the company
difference:
1. instead of 3 databased —> have 1 databased
Business Enterprise
2. share data with suppliers: (Figure 11-1)
data on one direction change to data on 2 directions (ala both use erp) e.g., procurment to supplier
—— e.g., lower than the eoq, automatically place order -> better supplier chain mgmt
3. share data with customers:
—> customer relationship management
Products Materials

Manufacturing
Order Entry and Procurement Purchases
Orders Supplier
Customer System Distribution System
System

Customer Production Vendor


Sales Scheduling Accts Pay
Account Rec Shipping Inventory

Customer Database Manufacturing Procurement


Database Database

Traditional Information System with Closed


Database Architecture
Back
ERP System Business Enterprise
(Figure 11-2)

Legacy
Data Warehouse Systems

ERP System
On-Line Analytical Processing Bolt-On Applications
(OLAP) (Industry Specific Functions)

Customers Suppliers
Core Functions [On-Line Transaction Processing (OLTP)]

Sales
Business Shop Floor
& Logistics
Planning Control
Distribution

Operational Database
Customers, Production,
Vendor, Inventory, etc.

Compare
Two Main ERP Applications
Online Transaction Processing (OLTP)
aka Core Applications
 support the day-to-day operational activities of
the business
 support mission-critical tasks through simple
queries of operational databases
 Consists of large numbers of relatively simple
transactions such as updating accounting
records. e.g., customer tables

 Simple and only a few records are actually


retrieved or updated in a single transaction.
40
Two Main ERP Applications
Online Transaction Processing (OLTP)
cont’d
 Typically would include (not limited)
 Sales and Distribution (order entry and delivery
scheduling),
 Business Planning (forecasting demand, planning
product production, and detailed routing information)
 Production Planning (Manufacturing Resource
Planning, MRP)
 Shop Floor Control (Detailed production scheduling,
deipatching, and job costing activities)
 Logistics modules (Assuring timely delivery to the
customer)
SAP Modules from SAP tutorial web site 41
Two Main ERP Applications
Online Analytical Processing (OLAP)
have access to old data (database warehouse) & real
time data (operational database)
- use for decision making for long term & short term

 decision support tool for management-critical tasks


through analytical investigation of complex data
associations
 supplies management with “real-time” information and
permits timely decisions to improve performance and
achieve competitive advantage
 includes decision support, modeling, information
retrieval, ad-hoc reporting/analysis, and what-if
analysis
g: use real time data
- use the pos (point of sale), quickly respone to the real time data, e.g., what ptoudtc is popular, produce more

42
Two Main ERP Applications
Online Analytical Processing (OLAP) cont’d
 Access large amounts of data (e.g. several years of
sales data)
 Analyze the relationships between many types of
business elements such as sales, products, geographic
regions, and marketing channels. cut and slice, regression, cluster

 Involve aggregated data such as sales volumes,


budgeted dollars, and dollars spent.
 Compare aggregated data over hierarchical time
periods.
 Present data in different perspectives.

43
Two Main ERP Applications
Online Analytical Processing (OLAP) cont’d
 Involve complex calculations between data elements
such as expected profit as a function of sales revenue
for each type of sales channel.
 Respond quickly to users requests so that they can
pursue an analytical thought process without being
stymied system delay.
PREVENT

44
OLAP
 Supports management-critical tasks through
analytical investigation of complex data
associations captured in data warehouses:
 Consolidation is the aggregation or roll-up
of data.
 Drill-down allows the user to see data in
selective increasing levels of detail.
 Slicing and Dicing enables the user to
examine data from different viewpoints often
performed along a time axis to depict trends
and patterns.
slice the data into smaller piece, easy to digest by computer and user 45
dicing: evene samller
OLAP

46
(Figure 11-2a)
OLAP

dicing: detail of data

47
(Figure 11-2b)
OLAP

consolidation: dont care the city

48
OLAP

 A projection onto the xy plane


 (Quarter x Product)

 Yields the sales of all cities

 A projection onto the xz plan


 (Quarter x City)

 Yields the sales of all products

 A projection onto the yz plane


 (Product x City)

 Yields the sales over all four quarters

49
OLAP
 A projection onto the x-axis (Quarter)
 Yields the sales of all products in all
cities
 A projection onto the y-axis (Product)
 Yields the sales over all quarters in all
cities
 A projection onto the z-axis (City)
 Yields the sales over all quarters of all
products
 Collapsing the cube into the cell next to
the origin (Summing up all 3 dimensions)
 The total sales figures for all quarters,
products, and cities
50
OLTP vs OLAP

 OLTPs are transaction processing systems


with many individual transactions that
affect a few records in a variety of
connected tables per request.
 OLAPs (Informational systems) are research
databases where user requests access very
large amounts of data to analyze
relationships or aggregate data or segment
data with complex algorithms.
51
OLTP vs OLAP

 Level of Detail e.g., sales — >only know the journl entry —> now, can link information
to other table (invoice number to link inventory table, inventory sold)

 OLTP: detailed data down to individual


transaction, i.e., its content, creator, date, and
change logs.
 OLAP: summarized data by customer, product,
or store location etc. 
 detailed data are often not necessary for analysis.
 Reduce system’s storage requirements for fiscal and
security reasons.
 Allow for quicker queries.
not normalized for 3rd form 52
- becasue want to qucik acess to data
- relationsal database take too long (have to many tables)
OLTP vs OLAP
 Periodic volatile data

 OLTP: real-time data  updatable, i.e., created,


updated, or changed and deleted frequently.
 OLAP: Periodic data extracts past data, may not be volatile

 Requirements not always known


 OLAP: database design has to be broad enough
to handle a variety of queries since the queries
are generated from decision-making standpoints.
Decisions change according to situations.
used need to have some knowledge about the data e.g., regression
— find something interest about the pattern e.g., someone buy a will buy b, dont have to have the knowledge about teh data

53
OLTP vs OLAP
 Managerial Requirements
 OLTP: queries focus on operations
 OLAP: managerial (and strategic) in natural, i.e.
adding or dropping product lines.
 Optimized for access
 OLTP: data storage structure is optimized for
writing (inserts, updates, and deletes). Speed 
ability to process large quantity of transactions is
crucial.
 OLAP: optimized for access, i.e. for quick reading,
often referred to as read optimized. 54
OLTP vs OLAP
 Historical Data
 OLTP: Current data  only recent, a year or two of
data.
 OLAP: historical data  create a clear picture of
past business events and trends to make optimal
decisions.
not only for historical, alsoaccess to current dta, but decision amke use historical dta more

 Data may be integrated consolidation

 OLAP: data are fully integrated  analyses can be


performed across departments, functional areas,
time, regions, product lines, and other business
entities. 55
OLTP vs OLAP

 Availability
 OLTP: Should be available close to 100% of the
time.
segeration of duty

 Not all users should have access to all of the


data in the system. Authentication and
authorization to access data are required.
 Concurrent
 OLTP: accessed by many users at the same
time.
56
OLTP vs OLAP

57
ERP System Business Enterprise
(Figure 11-2)

Legacy
Data Warehouse Systems

ERP System
On-Line Analytical Processing Bolt-On Applications
(OLAP) (Industry Specific Functions)

Customers Suppliers
Core Functions [On-Line Transaction Processing (OLTP)]

Sales
Business Shop Floor
& Logistics
Planning Control
Distribution

Operational Database
Customers, Production,
Vendor, Inventory, etc.

Compare
ERP System Configurations:
Databases and Bolt-Ons
later

 Database Configuration operationla databased for relational database


database warehouse for starshma (different databsed for differnt configuration)

 selection of database tables in the thousands


 setting the switches in the system
 Bolt-on Software e.g., manager can cancel the order anytime —> redirect truck driver to other city
—> work with sap —> an interface (bolt on software)

 third-party vendors provide specialized functionality


software (Domino’s Pizza)
 Supply-Chain Management (SCM) links vendors,
carriers, third-party logistics companies, and
information systems providers
e.g., only install certain module - financial (standardized), other use legacy system
—> share the data of legacy system with bolt in applcaition & erp

SAP Modules from SAP tutorial web site


59
What is a Data Warehouse?
 A relational or multi-dimensional database that may
consume hundreds of gigabytes or even terabytes of
disk storage
 The data is normally extracted periodically from operational
database or from a public information service.
 A database constructed for quick searching, retrieval,
ad-hoc queries, and ease of use
 An ERP system could exist without having a data
warehouse. The trend, however, is that organizations
that are serious about competitive advantage deploy
both. The recommended data architecture for an ERP
implementation includes separate operational and data
warehouse databases
do crm still need datawarehouse:
company concern for data security —> customer only acess to datawarehouse 60
past data, data about competitors, data about countries —> format startegy
ERP for data analytics for operationla database

 ERPs typically utilize a relational databases to facilitate 
data sharing for transaction processing.
 Data analytics of ERP data examples:
 Big chain retailers could use ERP data to analyze sales of 
each customer by product, store, region, time of day, 
and date.  change of store hours, promotion of 
products, etc.
 Manufacturing companies optimize their logistics by 
analyzing data regarding the costs of and the time 
required for shipments to customers and from vendors.
Middle Platform (中台 )

Operation Center Data Center


 Why middle platform?
 ERP packaged software 
process the transactions first, erp deal with later, speed up the process slower response to 
customers’ needs
because have to go through difernet modules
 Short term: provide front 
Operation Center Data Center
end quicker responses by 
operation center and data 
De-ERP center
 Long term: De‐ERP 
erp is still needed to process
rearrange ERP module to 
Operation Center
Data Center
support middle platform 
and front end better

Reference:https://www.infoq.cn/article/wCZV6X5uujxDXFP0Eub9
Data Sources (review)
 Transactional Systems (OLTP)
 Informational Systems (OLAP)
 Legacy Systems
 May include obsolete or aged hardware, outdated software and older 
operating systems.
 Could be customized software applications or in‐house developed 
applications.
 Special interfaces may need to retrieve data
 Web Services
 Crawlers and Info Agents (Search Engines)
 Alexa owned by Amazon  provides hits to websites
 Social Media
 Facebook, Twitter, etc.  identify consumer buying habits, likes and 
dislikes, and potential upcoming purchases
 Sensors
 Gathered from devices like heating units, vehicles, health monitors, etc.
Data Collecting
Identifying the source data is the first step. Next, 
the data collection methods:
 Sampling
 Calibrating and Scaling
 Continuous monitoring and Embedded 
Audit Module
 Data from Feedback Mechanisms
 Intelligent Control Agents
Sampling
 The act of extracting only certain data values from a 
dataset  a subset of the data. TIME CONUSMING & COSTLY
 Pull a data sample on study consumer purchasing 
behaviors, possible election results, etc.
 Random sampling is crucial  selected data points are 
representative of the entire population.
 Sampling is appropriate when:
 Each data point is representative of the entire set
problem: fraud, but limited time & resources
 The source data set is too large for the planned analysis
 The application specifically calls for a data sample 
some accounting and regulatory compliance audits.
Calibrating and Scaling
 Data Calibration establishing a relationship between 
a data point and a unit of measure that has been 
e.g.,. different accounting standard, different
formally defined. countries (adjustment for fifo, fair compaarision),
exchange for currency
 Comparison is made between an observation and a 
standard.
 Source data may be standardized  Scaled to lie 
between a minimum and maximum value or to be 
more normally distributed. e.g., learning curve - exponential
Continuous monitoring and embedded 
audit modules EAM

 All or most business transactions needed to be examine in 
order to uncover fraudulent activities.
 Traditionally, identifying fraud using samples of 
transactional data is very problematic because fraudulent 
transactions usually occur in a small number of instances 
 statistically less likely to be included in the test sample.
 SOx made it necessary for business to continuously 
monitor their info systems for irregularities that might 
reflect error or fraud.
 Data Analytics  auditors need to examine all of the data 
within a company’s database. identify fraud activities, investigate
Continuous monitoring and embedded 
audit modules ‐ continued
 In essence, auditors need to be data miners, exploring vast 
quantities of data and carefully reviewing all of the data in 
the organization’s systems to identify
 Trends
 Outliers
 Other anomalies
 Embedded Audit Module (EAM)  system within a 
system like ERP  analyzes transactional data to identify 
abnormalities like unusual usages or provides electronic 
audit trail needed to verify transactions within the system. 
to identify possible fraud and not normal transaction
e.g., bank send message or call you (use card at a place
not you usually at)
e.g., buyin from not normal suppliers
Data Gathering
 Sampling
 Calibrating and Scaling
 Continuous Monitoring and Embedded Audit Modules
 Data from Feedback Mechanisms it auidt
 IT department analyze system performance logs  abnormal traffic, 
usage trends, liable to fail. security breach & illegal transactions
 EAMs  Exceptional Reporting
 Intelligent Control Agents ICA

 Work autonomously with distributed systems to control or run a 
system.
looking at traffic, unauthorized transactions on date
Data Staging
 Data Staging:  e.g., use acl to preapre for the data

 The process whereby data are organized and 
prepared for analysis  ensures that data are 
consistent, relevant, and free of ambiguities. defined
 To stage data, we need to understand the 
characteristics of and the relationships among 
the data items. as auditor, need to understand firm (swot)
 The definition of the data and their 
relationships is called data modeling. 
*ETL (Extraction, Transformation and Loading)
normalized - divide tables,

Data Modelling:  denormalized - combine tables (merge)


—> olap access small amount of data, olap acess to big
amount of data
— (speed important, use relational database is slow, have to
join tables)
Database Structure
 OLAP systems often use denormalized (flat‐file) 
database structures
 The primary reason to denormalize a database is 
performance  faster query speeds.
relational database, detialed transactions (itms on invoices)

 Normalized database many tables  perform 


a JOIN to combine tables and longer processing 
time needed and  more costly
 So to speed up queries  denormalized
databases  fewer tables, JOINs take shorter 
time to perform
Data Modelling: 
Database Structure
Denormalized databases
 Read only nonvalitle - unchange (historical)
 Often called Data Warehouse  provides the 
persistent (permanent) storage of summarized, 
harmonized, cleansed, and consolidated data, 
often from multiple sources.
internal information & external infrmation about competitors, govt
Multidimensional Modelling: 
Star Schema
* *

pk from surrounding tables


dimentional table - denormalized (read only, have the 3 issues - duplication, not update)
*Dimension table
Multidimensional Modelling: 
Star Schema
The most common model for storing data in a cube for analytics; 
used for data warehouses for many years.
 Pros
 The star‐like layout  easy to understand and implement
 Only a single level of JOIN for query.
join amanay level for relational - slower to join and retrive
 Cons
size of data file is larger & other issues, not an issue becasue not change
 Duplication of dimensional data
 Alphanumeric primary and foreign keys, the query joins are 
slower and performance suffers.
what command has been executed, can follow up
 Multiple languages Not supported.what have been done to the data vs dont what
have been changed

 Time‐dependent changes or historization Not supported.
 Data hierarchies not handled well
Multidimensional Modelling: 
Snowflake Schema
Snowflake Schema is an enhancement to the Star Schema to 
overcome the drawbacks of the popular schema
 The dimension tables of the star schema are divided into 
additional tables thereby creating branching to the star
normallized the dimentionla table, but not in 3nf,
because would have too many tables, slower to
retreive data
Star vs Snowflake Schema
Reference: by
Emil Drkušić
Database designer and developer, financial analyst
which one is better to do olap:
Star vs Snowflake Shema
The First Difference: Normalization
 Snowflake schemas will use less space to store dimension 
tables. This is because n0rmalized database produces far fewer 
redundant records.
 Denormalized models increase the chances of data  3 issues
integrity problems. These issues will complicate future 
modifications and maintenance as well.
data warehouse, data not change, not quite a issue
Star vs Snowflake Shema
The Second Difference: Query Complexity
 Snowflake schema query is more complex. Because the 
dimension tables are normalized and needed to dig deeper to 
get detailed data. For example, JOIN command will be needed 
for every level inside the same dimension.
 In the star schema, we only JOIN the fact table with those 
dimension tables we need. join only 1 level use pk

 Basically, a query ran against snowflake schema data mart 
will execute more slowly.
Star vs Snowflake Shema
Which one to use: stoorgae space
Using the snowflake schema: dont need much space, no deuplication

 In data warehouse. As the warehouse is Data Central for the 
company, we could save lot of space this way.
 When dimension tables require a significant amount of 
storage space.
Using the star schema: ⽂本

 In data marts. Data marts are subsets of data taken out the 
central data warehouse and created for different projects. 
Saving storage space is not a priority.
store dat by column (e.g., only need
to read the customers data - lower
storage) or by row

Columnar Database – recent development
Three main benefits of storing data in columns:
1. Have higher read efficiency
2. Compress better than row‐based relational 
databases
3. Have higher sorting and indexing efficiency

You might also like