Lecture 1 Practical Analytics - Introduction, Data Sources, Data Modeling, Data Warehouse v3

Practical
Analytics
What is data analytics?
 Simply speaking data analytics is the process that take
us from data to decision
 Data  Information  Knowledge  Wisdom 

Decision e.g., audit investigation
What is data analytics?
 Data analytics is a process involves:
a) Identifying the problem e.g., account receivable - billing corectly e.g., how much outstanding
b) Gathering relevant data that frequently are not in a
usable form
c) Cleaning up the data to make them usable data is dirty, gabage in gabage out
d) Loading them into storage models different data models
e) Manipulating them to discover information that
leads to actionable insights data miner
f) Making decisions based on those insights
e.g., discover miconduct or fraud, or mistakes
The relationship among analytics
e.g., regression
Analytics: part of data science
(insert figure 1‐3)
how to present the data visually, help users to

proper understand the result
stored in media
Global Information Storage Capacity
How about now and future?
 Source: Forbes.com (175 Zettabytes By 2025, by Tom Coughlin)
One Zettabyte is approximately equal to a thousand Exabytes,

a billion Terabytes, or a trillion Gigabytes
more people using cloud,

more store in enterprise (concern of the security and back up),
less stored in consumer (e.g., smartphone, too much data)
1/4 of data generate in china and

asian (japan, australia),
us - smaller data generate than china,
rest of world
Why study data analytics
 Study by AMA (American Management Association):
 99% of business leaders indicated that analytics will be
important to their organizations either at the time of the
survey (2013) or in the five‐year period immediately
succeeding it.
 PWC’s 2015 white paper also indicated 85% of CEOs in
the surveyed companies placed a high value on data
analytics and data mining while they targeted data
analytics as their top IT spending priority.
Applications of Data Analytics
 Accounting
 Retail
 Manufacturing logistice, reduce costs
 Marketing
 Medicines
 Etc.
Data Fundamentals
1. client give you the system (client concern confidential information)
Data Acquisition 2. access databases

3. provide with data file or sets to work on
 Data provisioning: the process of providing users and
systems with access to data.
 Includes maintaining security authorizations to limit
access to only those data which the user or system is 1. c
officially permitted to view.
 Replication: data from a source system are replicated or
copied.  source data remain intact.
1. different ways of access to data (for security reasosns)
2. original data may be corrupted after using —> better to copy on the source and work on the copy file, original data will be intact
e.g., excel after save cannot go back
now big data, limitation on excel
acl provide the log file (log on the file, by who)

- important for work in the team, who done what (cant by excel) —> ad for acl (data analytics)
Structured and Unstructured Data
 Structured data: computer‐readable and usable. Data
sourced from databases, spreadsheets, flat files and
other systems. delimited file
e.g., .csv, —> use coma to seperate dat record
 Have metadata data about data, like context, meaning
and purpose to data. data type: numeric, text\character (id), date
strucure data can still be dirty —> e.g., definition of data type —> clean the data
 Structured does not mean ready to be used  may
contain errors, redundancies, and omissions.
 Unstructured data: not conform to data models and
associated metadata. dont have metadata, dont follow pattern
computer can properly read them (process)
need to define the data column
Structured and Unstructured Data
metadata
e.g., pdf - unstructured data can stil be imported,

Three most common structured data
 Spreadsheets: data stored in rows and columns
 Flat files: data in text format also stored in rows and
columns. ASCII, comma separated values (.csv), and
other delimited files are flat files.
 Often used to transfer data from one location to
another
 Utilized to consolidate or synchronize data
 Databases: organized collections of data that enable
users to access, manage, and update the data.
csv. most popular file type used by —> reasons for delmited file to be popular:
- disadvantage of excel:
- 1. big data cannot be stored, data may be seperated —> troublesome to combine (errors)
- 2. excel file have much big sizes (formula, differnet spreadsheet)
Flat File example use comma to seperate name, email, phone —> . csv file
Databases invented by ibm scentist
 The most popular model of databases is Relational
database create tables and relatinships
 Primary keys: uniquely and universally identifies each
double u:
instance of an entity or relationship set - every students have the student id
—> good primary keys
 Foreign keys: other tables’ primary keys referenced to
the tables
assume one student one course only

Three Types of Anomalies
Tables that have not been normalized are associated with
three types of problems:
⽂本
 Insertion Anomaly: A new item cannot be added to the
table until at least one entity uses a particular attribute item.
 Deletion Anomaly: If an attribute item used by only one
entity is deleted, all information about that attribute item is
lost.
 Update Anomaly: A modification on an attribute must be
made in each of the rows in which the attribute appears.
 Anomalies can be corrected by creating relational tables.
21
economic order quantity (whne palce

the order, the q need to order)
1. insertion anomoly: dont have the part nnumber for potential suppliers
2. deletion: delete one of the attribute of buell —> other attributes may be deleted
3. update: duplication e.g., alibaba duplication troublesome —> update phone number multiple time for one supplier
—> need normalization (divide tables)
22
combined keys -uique (primary key + foreign key) inventory & suppliers - manay to many —> should create another table (later)
1. no longer update problem 23
2. insertion problem eliminate:
3. deleteion: delete information of supplier bullt, unchange inventory data
—> not 3 problems —> 3nf
Normalization
 Normalization  eliminate those three anomalies.
 Normalization is the process of decomposing a
database table into more tables.
 Business systems (transactional systems) are typically
normalized up to the third normal form (3NF)
 Normalized tables are generated with referential
integrity constraints between primary key and foreign
key pairs.  each foreign key must have a
corresponding value in a primary key in another table.
For instance, if Customer ID is the foreign key in a Sales Order table,
then the primary key in the Customer table must also be Customer ID.
Relational Databases & ER Diagram
 Details of the subjects will be further discussed in
Lecture 2.
Processing unstructured data
 Tagged Data: XML and XBRL
 eXtensible Markup Language (XML): used to describe
data to both humans and computers; tagging or coding
data in documents, so that they can be read by both
people and computers.
 HTML (Hyper Text Markup Language)  used to tag
data so that browsers can display that data as web page
 XML  used to create metadata about data so that data
can be understood by computers for further processing
and structuring.
 eXtensible Business Reporting Language (XBRL):
developed by accounting professionals for reporting.
 Example of XBRL label the data - provide metadata to the file (e.g., sales)
- (even though they are unstructured e.g., pdf of finanacial statements
 Example of Inline XBRL (iXBRL) - come from xml: html define the locations, xml provide more information can be processed
- advantage: provide metadata of the data, dont have to input the data (ala it is xbrl - structured)
Processing unstructured data
 Image Recognition still need to be imrpoved
 Artificial Intelligence (AI)
Young woman with a letter and a messenger in an interior (1670) is an oil on canvas
painting by the Dutch painter Pieter de Hooch, it is an example of Dutch Golden Age
painting and is part of the collection of the Rijksmuseum.
Tim Cook: ”I always thought I knew when the
iPhone was invented, but now I’m not so sure.
Data Sources
 Transactional Systems (OLTP) erp system for 3
 Informational Systems (OLAP)
 Legacy Systems
 Web Services e.g., competitors
 Crawlers and Info Agents (Search Engines) google, baidu
 Social Media
 Sensors collect data e.g., temparature
Data Sources
 Business data from transactional systems
 Transactional systems ranging from manual paper
based systems to any number of computerized
systems. come from erp (transactions) / manual papers in the past
 Computerized systems may be as simple as an Excel
spreadsheet or as complex as ERP
 OLTP systems
information systems typically facilitate and manage transaction-oriented applications
2 compoenents (olap & oltp)
e.g., sap, oracle, people
ERP System Business Enterprise
Legacy
Data Warehouse Systems
ERP System
On-Line Analytical Processing Bolt-On Applications
(OLAP) (Industry Specific Functions)
Customers Suppliers
Core Functions [On-Line Transaction Processing (OLTP)]
Sales
Business Shop Floor
& Logistics
Planning Control
Distribution
modules to take care of different fucntiosn
Operational Database
Customers, Production,
Vendor, Inventory, etc.
Compare
Characteristics of Transactional Systems some companies have 2 systems to make sure system is aviabale all the time
 Availability: systems that process transactions should
be available close to 100% of the time.
 Level of details: the data of transactional systems
should be available at full detail so that each
transaction as well as its content, creator, date and
details are available at all times.
 Updatable: business transactions are created, updated,
or changed and delated quite frequently.
 Speed: the ability to process large quantities of
transactions is critical. can be updated in real time quickly
e.g., monitor and respone form the maiframe —> now real time
 Current: store only recent a year or two of data.
older data store in the datawarrhouse, curretn data store in the operationla database
erp have different module to take care of different functions
e.g., mrp (only for the material) —> now have more labor, overhead, original inventry mgmt—> now called erp (for whole enterprise)
Characteristics of Transactional Systems
 Operational: OLTP supports the organization’s
business functions
 Concurrent: OLTP systems assessed by many users at
the same time. Concurrency management is needed to
do concurrency control.
 Support requirements of business process
 Small uniform transactions: most transactions in an
OLTP system are small and uniform.
 Optimized for storage: for efficiency and performance
issues, transactions should be written quickly to the
database. Quick storage  relational database
 Data are functionally oriented 1. no duplication
stored in operational database, ad of relational database
2. data can be share with other functions (can create table

and sett up relationship —> easy to share data
What is ERP?
to take care of different functions, but can share the data e.g., office
 Those activities supported by multi-module

application software that help a company
manage the important parts of its business in
an integrated fashion. can alos integrate with outsider e.g., supplier
 Key features include:

 Smooth and seamless flow of information
across organizational boundaries
 Standardized environment with shared
database independent of applications and
integrated applications
35
Problems with Non-ERP
Systems
customized for company only
 In-house design limits connectivity outside the

company
 Tendency toward separate IS’s within firm
 lack of integration limits communication within the
company
 Strategic decision-making not supported
 Long-term maintenance costs high
 Limits ability to engage in process reengineering
36
Traditional IS Model:
Closed Database Architecture
 Similar in concept to flat-file approach
 data remains the property of the application
 fragmentation limits communications
 Existence of numerous distinct and
independent databases
 redundancy and anomaly problems
 Paper-based
 requires multiple entry of data
 status of information unknown at key points
37
dont share one database —>
1. slow to share the data within the company
— cannot share the data electronically —> e.g., paper to paper: print out document, input into the system
2. difficult to share outside the company
difference:
1. instead of 3 databased —> have 1 databased
Business Enterprise
2. share data with suppliers: (Figure 11-1)
data on one direction change to data on 2 directions (ala both use erp) e.g., procurment to supplier
—— e.g., lower than the eoq, automatically place order -> better supplier chain mgmt
3. share data with customers:
—> customer relationship management
Products Materials
Manufacturing
Order Entry and Procurement Purchases
Orders Supplier
Customer System Distribution System
System
Customer Production Vendor

Sales Scheduling Accts Pay
Account Rec Shipping Inventory
Customer Database Manufacturing Procurement

Database Database
Traditional Information System with Closed

Database Architecture
Back
(Figure 11-2)
Legacy
ERP System
Customers Suppliers
Sales
Business Shop Floor
& Logistics
Planning Control
Distribution
Compare
Two Main ERP Applications
Online Transaction Processing (OLTP)
aka Core Applications
 support the day-to-day operational activities of
the business
 support mission-critical tasks through simple
queries of operational databases
 Consists of large numbers of relatively simple
transactions such as updating accounting
records. e.g., customer tables
 Simple and only a few records are actually

retrieved or updated in a single transaction.
40
Online Transaction Processing (OLTP)
cont’d
 Typically would include (not limited)
 Sales and Distribution (order entry and delivery
scheduling),
 Business Planning (forecasting demand, planning
product production, and detailed routing information)
 Production Planning (Manufacturing Resource
Planning, MRP)
 Shop Floor Control (Detailed production scheduling,
deipatching, and job costing activities)
 Logistics modules (Assuring timely delivery to the
customer)
SAP Modules from SAP tutorial web site 41
Online Analytical Processing (OLAP)
have access to old data (database warehouse) & real
time data (operational database)
- use for decision making for long term & short term
 decision support tool for management-critical tasks

through analytical investigation of complex data
associations
 supplies management with “real-time” information and
permits timely decisions to improve performance and
achieve competitive advantage
 includes decision support, modeling, information
retrieval, ad-hoc reporting/analysis, and what-if
analysis
g: use real time data
- use the pos (point of sale), quickly respone to the real time data, e.g., what ptoudtc is popular, produce more
42
Online Analytical Processing (OLAP) cont’d
 Access large amounts of data (e.g. several years of
sales data)
 Analyze the relationships between many types of
business elements such as sales, products, geographic
regions, and marketing channels. cut and slice, regression, cluster
 Involve aggregated data such as sales volumes,

budgeted dollars, and dollars spent.
 Compare aggregated data over hierarchical time
periods.
 Present data in different perspectives.
43
Online Analytical Processing (OLAP) cont’d
 Involve complex calculations between data elements
such as expected profit as a function of sales revenue
for each type of sales channel.
 Respond quickly to users requests so that they can
pursue an analytical thought process without being
stymied system delay.
PREVENT
44
OLAP
 Supports management-critical tasks through
analytical investigation of complex data
associations captured in data warehouses:
 Consolidation is the aggregation or roll-up
of data.
 Drill-down allows the user to see data in
selective increasing levels of detail.
 Slicing and Dicing enables the user to
examine data from different viewpoints often
performed along a time axis to depict trends
and patterns.
slice the data into smaller piece, easy to digest by computer and user 45
dicing: evene samller
OLAP
46
(Figure 11-2a)
OLAP
dicing: detail of data
47
(Figure 11-2b)
OLAP
consolidation: dont care the city
48
OLAP
 A projection onto the xy plane

 (Quarter x Product)
 Yields the sales of all cities
 A projection onto the xz plan

 (Quarter x City)
 Yields the sales of all products
 A projection onto the yz plane

 (Product x City)
 Yields the sales over all four quarters
49
OLAP
 A projection onto the x-axis (Quarter)
 Yields the sales of all products in all
cities
 A projection onto the y-axis (Product)
 Yields the sales over all quarters in all
cities
 A projection onto the z-axis (City)
 Yields the sales over all quarters of all
products
 Collapsing the cube into the cell next to
the origin (Summing up all 3 dimensions)
 The total sales figures for all quarters,
products, and cities
50
OLTP vs OLAP
 OLTPs are transaction processing systems

with many individual transactions that
affect a few records in a variety of
connected tables per request.
 OLAPs (Informational systems) are research
databases where user requests access very
large amounts of data to analyze
relationships or aggregate data or segment
data with complex algorithms.
51
OLTP vs OLAP
 Level of Detail e.g., sales — >only know the journl entry —> now, can link information
to other table (invoice number to link inventory table, inventory sold)
 OLTP: detailed data down to individual

transaction, i.e., its content, creator, date, and
change logs.
 OLAP: summarized data by customer, product,
or store location etc. 
 detailed data are often not necessary for analysis.
 Reduce system’s storage requirements for fiscal and
security reasons.
 Allow for quicker queries.
not normalized for 3rd form 52
- becasue want to qucik acess to data
- relationsal database take too long (have to many tables)
OLTP vs OLAP
 Periodic volatile data
 OLTP: real-time data  updatable, i.e., created,

updated, or changed and deleted frequently.
 OLAP: Periodic data extracts past data, may not be volatile
 Requirements not always known

 OLAP: database design has to be broad enough
to handle a variety of queries since the queries
are generated from decision-making standpoints.
Decisions change according to situations.
used need to have some knowledge about the data e.g., regression
— find something interest about the pattern e.g., someone buy a will buy b, dont have to have the knowledge about teh data
53
OLTP vs OLAP
 Managerial Requirements
 OLTP: queries focus on operations
 OLAP: managerial (and strategic) in natural, i.e.
adding or dropping product lines.
 Optimized for access
 OLTP: data storage structure is optimized for
writing (inserts, updates, and deletes). Speed 
ability to process large quantity of transactions is
crucial.
 OLAP: optimized for access, i.e. for quick reading,
often referred to as read optimized. 54
OLTP vs OLAP
 Historical Data
 OLTP: Current data  only recent, a year or two of
data.
 OLAP: historical data  create a clear picture of
past business events and trends to make optimal
decisions.
not only for historical, alsoaccess to current dta, but decision amke use historical dta more
 Data may be integrated consolidation
 OLAP: data are fully integrated  analyses can be

performed across departments, functional areas,
time, regions, product lines, and other business
entities. 55
OLTP vs OLAP
 Availability
 OLTP: Should be available close to 100% of the
time.
segeration of duty
 Not all users should have access to all of the

data in the system. Authentication and
authorization to access data are required.
 Concurrent
 OLTP: accessed by many users at the same
time.
56
OLTP vs OLAP
57
(Figure 11-2)
Legacy
ERP System
Customers Suppliers
Sales
Business Shop Floor
& Logistics
Planning Control
Distribution
Compare
ERP System Configurations:
Databases and Bolt-Ons
later
 Database Configuration operationla databased for relational database

database warehouse for starshma (different databsed for differnt configuration)
 selection of database tables in the thousands

 setting the switches in the system
 Bolt-on Software e.g., manager can cancel the order anytime —> redirect truck driver to other city
—> work with sap —> an interface (bolt on software)
 third-party vendors provide specialized functionality

software (Domino’s Pizza)
 Supply-Chain Management (SCM) links vendors,
carriers, third-party logistics companies, and
information systems providers
e.g., only install certain module - financial (standardized), other use legacy system
—> share the data of legacy system with bolt in applcaition & erp
SAP Modules from SAP tutorial web site

59
What is a Data Warehouse?
 A relational or multi-dimensional database that may
consume hundreds of gigabytes or even terabytes of
disk storage
 The data is normally extracted periodically from operational
database or from a public information service.
 A database constructed for quick searching, retrieval,
ad-hoc queries, and ease of use
 An ERP system could exist without having a data
warehouse. The trend, however, is that organizations
that are serious about competitive advantage deploy
both. The recommended data architecture for an ERP
implementation includes separate operational and data
warehouse databases
do crm still need datawarehouse:
company concern for data security —> customer only acess to datawarehouse 60
past data, data about competitors, data about countries —> format startegy
ERP for data analytics for operationla database
 ERPs typically utilize a relational databases to facilitate
data sharing for transaction processing.
 Data analytics of ERP data examples:
 Big chain retailers could use ERP data to analyze sales of
each customer by product, store, region, time of day,
and date.  change of store hours, promotion of
products, etc.
 Manufacturing companies optimize their logistics by
analyzing data regarding the costs of and the time
required for shipments to customers and from vendors.
Middle Platform (中台 )
Operation Center Data Center

 Why middle platform?
 ERP packaged software 
process the transactions first, erp deal with later, speed up the process slower response to
customers’ needs
because have to go through difernet modules
 Short term: provide front
Operation Center Data Center
end quicker responses by
operation center and data
De-ERP center
 Long term: De‐ERP 
erp is still needed to process
rearrange ERP module to
Operation Center
Data Center
support middle platform
and front end better
Reference:https://www.infoq.cn/article/wCZV6X5uujxDXFP0Eub9
Data Sources (review)
 Transactional Systems (OLTP)
 Informational Systems (OLAP)
 Legacy Systems
 May include obsolete or aged hardware, outdated software and older
operating systems.
 Could be customized software applications or in‐house developed
applications.
 Special interfaces may need to retrieve data
 Web Services
 Crawlers and Info Agents (Search Engines)
 Alexa owned by Amazon  provides hits to websites
 Social Media
 Facebook, Twitter, etc.  identify consumer buying habits, likes and
dislikes, and potential upcoming purchases
 Sensors
 Gathered from devices like heating units, vehicles, health monitors, etc.
Data Collecting
Identifying the source data is the first step. Next,
the data collection methods:
 Sampling
 Calibrating and Scaling
 Continuous monitoring and Embedded
Audit Module
 Data from Feedback Mechanisms
 Intelligent Control Agents
Sampling
 The act of extracting only certain data values from a
dataset  a subset of the data. TIME CONUSMING & COSTLY
 Pull a data sample on study consumer purchasing
behaviors, possible election results, etc.
 Random sampling is crucial  selected data points are
representative of the entire population.
 Sampling is appropriate when:
 Each data point is representative of the entire set
problem: fraud, but limited time & resources
 The source data set is too large for the planned analysis
 The application specifically calls for a data sample 
some accounting and regulatory compliance audits.
Calibrating and Scaling
 Data Calibration establishing a relationship between
a data point and a unit of measure that has been
e.g.,. different accounting standard, different
formally defined. countries (adjustment for fifo, fair compaarision),
exchange for currency
 Comparison is made between an observation and a
standard.
 Source data may be standardized  Scaled to lie
between a minimum and maximum value or to be
more normally distributed. e.g., learning curve - exponential
Continuous monitoring and embedded
audit modules EAM
 All or most business transactions needed to be examine in
order to uncover fraudulent activities.
 Traditionally, identifying fraud using samples of
transactional data is very problematic because fraudulent
transactions usually occur in a small number of instances
 statistically less likely to be included in the test sample.
 SOx made it necessary for business to continuously
monitor their info systems for irregularities that might
reflect error or fraud.
 Data Analytics  auditors need to examine all of the data
within a company’s database. identify fraud activities, investigate
Continuous monitoring and embedded
audit modules ‐ continued
 In essence, auditors need to be data miners, exploring vast
quantities of data and carefully reviewing all of the data in
the organization’s systems to identify
 Trends
 Outliers
 Other anomalies
 Embedded Audit Module (EAM)  system within a
system like ERP  analyzes transactional data to identify
abnormalities like unusual usages or provides electronic
audit trail needed to verify transactions within the system.
to identify possible fraud and not normal transaction
e.g., bank send message or call you (use card at a place
not you usually at)
e.g., buyin from not normal suppliers
Data Gathering
 Sampling
 Calibrating and Scaling
 Continuous Monitoring and Embedded Audit Modules
 Data from Feedback Mechanisms it auidt
 IT department analyze system performance logs  abnormal traffic,
usage trends, liable to fail. security breach & illegal transactions
 EAMs  Exceptional Reporting
 Intelligent Control Agents ICA
 Work autonomously with distributed systems to control or run a
system.
looking at traffic, unauthorized transactions on date
Data Staging
 Data Staging: e.g., use acl to preapre for the data
 The process whereby data are organized and
prepared for analysis  ensures that data are
consistent, relevant, and free of ambiguities. defined
 To stage data, we need to understand the
characteristics of and the relationships among
the data items. as auditor, need to understand firm (swot)
 The definition of the data and their
relationships is called data modeling.
*ETL (Extraction, Transformation and Loading)
normalized - divide tables,
Data Modelling: denormalized - combine tables (merge)

—> olap access small amount of data, olap acess to big
amount of data
— (speed important, use relational database is slow, have to
join tables)
Database Structure
 OLAP systems often use denormalized (flat‐file)
database structures
 The primary reason to denormalize a database is
performance  faster query speeds.
relational database, detialed transactions (itms on invoices)
 Normalized database many tables  perform

a JOIN to combine tables and longer processing
time needed and  more costly
 So to speed up queries  denormalized
databases  fewer tables, JOINs take shorter
time to perform
Data Modelling:
Database Structure
Denormalized databases
 Read only nonvalitle - unchange (historical)
 Often called Data Warehouse  provides the
persistent (permanent) storage of summarized,
harmonized, cleansed, and consolidated data,
often from multiple sources.
internal information & external infrmation about competitors, govt
Multidimensional Modelling:
Star Schema
* *
pk from surrounding tables

dimentional table - denormalized (read only, have the 3 issues - duplication, not update)
*Dimension table
Star Schema
The most common model for storing data in a cube for analytics;
used for data warehouses for many years.
 Pros
 The star‐like layout  easy to understand and implement
 Only a single level of JOIN for query.
join amanay level for relational - slower to join and retrive
 Cons
size of data file is larger & other issues, not an issue becasue not change
 Duplication of dimensional data
 Alphanumeric primary and foreign keys, the query joins are
slower and performance suffers.
what command has been executed, can follow up
 Multiple languages Not supported.what have been done to the data vs dont what
have been changed
 Time‐dependent changes or historization Not supported.
 Data hierarchies not handled well
Snowflake Schema
Snowflake Schema is an enhancement to the Star Schema to
overcome the drawbacks of the popular schema
 The dimension tables of the star schema are divided into
additional tables thereby creating branching to the star
normallized the dimentionla table, but not in 3nf,
because would have too many tables, slower to
retreive data
Star vs Snowflake Schema
Reference: by
Emil Drkušić
Database designer and developer, financial analyst
which one is better to do olap:
Star vs Snowflake Shema
The First Difference: Normalization
 Snowflake schemas will use less space to store dimension
tables. This is because n0rmalized database produces far fewer
redundant records.
 Denormalized models increase the chances of data 3 issues
integrity problems. These issues will complicate future
modifications and maintenance as well.
data warehouse, data not change, not quite a issue
The Second Difference: Query Complexity
 Snowflake schema query is more complex. Because the
dimension tables are normalized and needed to dig deeper to
get detailed data. For example, JOIN command will be needed
for every level inside the same dimension.
 In the star schema, we only JOIN the fact table with those
dimension tables we need. join only 1 level use pk
 Basically, a query ran against snowflake schema data mart
will execute more slowly.
Which one to use: stoorgae space
Using the snowflake schema: dont need much space, no deuplication
 In data warehouse. As the warehouse is Data Central for the
company, we could save lot of space this way.
 When dimension tables require a significant amount of
storage space.
Using the star schema: ⽂本
 In data marts. Data marts are subsets of data taken out the
central data warehouse and created for different projects.
Saving storage space is not a priority.
store dat by column (e.g., only need
to read the customers data - lower
storage) or by row
Columnar Database – recent development
Three main benefits of storing data in columns:
1. Have higher read efficiency
2. Compress better than row‐based relational
databases
3. Have higher sorting and indexing efficiency

Lecture 1 Practical Analytics - Introduction, Data Sources, Data Modeling, Data Warehouse v3

Uploaded by

Copyright:

Available Formats

You might also like

Lecture 1 Practical Analytics - Introduction, Data Sources, Data Modeling, Data Warehouse v3

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 1 Practical Analytics - Introduction, Data Sources, Data Modeling, Data Warehouse v3

Uploaded by

Copyright:

Available Formats

Practical

 Data  Information  Knowledge  Wisdom 

how to present the data visually, help users to

One Zettabyte is approximately equal to a thousand Exabytes,

more people using cloud,

1/4 of data generate in china and

Data Acquisition 2. access databases

now big data, limitation on excel

acl provide the log file (log on the file, by who)

e.g., pdf - unstructured data can stil be imported,

assume one student one course only

economic order quantity (whne palce

 Crawlers and Info Agents (Search Engines) google, baidu

ERP System Business Enterprise

2. data can be share with other functions (can create table

 Those activities supported by multi-module

 Key features include:

 In-house design limits connectivity outside the

Customer Production Vendor

Customer Database Manufacturing Procurement

Traditional Information System with Closed

 Simple and only a few records are actually

 decision support tool for management-critical tasks

 Involve aggregated data such as sales volumes,

dicing: detail of data

consolidation: dont care the city

 A projection onto the xy plane

 Yields the sales of all cities

 A projection onto the xz plan

 Yields the sales of all products

 A projection onto the yz plane

 Yields the sales over all four quarters

 OLTPs are transaction processing systems

 OLTP: detailed data down to individual

 OLTP: real-time data  updatable, i.e., created,

 Requirements not always known

 Data may be integrated consolidation

 OLAP: data are fully integrated  analyses can be

 Not all users should have access to all of the

 Database Configuration operationla databased for relational database

 selection of database tables in the thousands

 third-party vendors provide specialized functionality

SAP Modules from SAP tutorial web site

Operation Center Data Center

Data Modelling: denormalized - combine tables (merge)

 Normalized database many tables  perform

pk from surrounding tables

You might also like