Professional Documents
Culture Documents
Lecture 1 Practical Analytics - Introduction, Data Sources, Data Modeling, Data Warehouse v3
Lecture 1 Practical Analytics - Introduction, Data Sources, Data Modeling, Data Warehouse v3
Lecture 1 Practical Analytics - Introduction, Data Sources, Data Modeling, Data Warehouse v3
Analytics
What is data analytics?
Simply speaking data analytics is the process that take
us from data to decision
b) Gathering relevant data that frequently are not in a
usable form
c) Cleaning up the data to make them usable data is dirty, gabage in gabage out
d) Loading them into storage models different data models
e) Manipulating them to discover information that
leads to actionable insights data miner
f) Making decisions based on those insights
e.g., discover miconduct or fraud, or mistakes
The relationship among analytics
e.g., regression
Analytics: part of data science
(insert figure 1‐3)
Marketing
Medicines
Etc.
Data Fundamentals
1. client give you the system (client concern confidential information)
Data provisioning: the process of providing users and
systems with access to data.
Includes maintaining security authorizations to limit
access to only those data which the user or system is 1. c
officially permitted to view.
Replication: data from a source system are replicated or
copied. source data remain intact.
1. different ways of access to data (for security reasosns)
2. original data may be corrupted after using —> better to copy on the source and work on the copy file, original data will be intact
e.g., excel after save cannot go back
Have metadata data about data, like context, meaning
and purpose to data. data type: numeric, text\character (id), date
strucure data can still be dirty —> e.g., definition of data type —> clean the data
Structured does not mean ready to be used may
contain errors, redundancies, and omissions.
Unstructured data: not conform to data models and
associated metadata. dont have metadata, dont follow pattern
computer can properly read them (process)
need to define the data column
Structured and Unstructured Data
metadata
The most popular model of databases is Relational
database create tables and relatinships
Primary keys: uniquely and universally identifies each
double u:
instance of an entity or relationship set - every students have the student id
—> good primary keys
Foreign keys: other tables’ primary keys referenced to
the tables
Tables that have not been normalized are associated with
three types of problems:
⽂本
Insertion Anomaly: A new item cannot be added to the
table until at least one entity uses a particular attribute item.
Deletion Anomaly: If an attribute item used by only one
entity is deleted, all information about that attribute item is
lost.
Update Anomaly: A modification on an attribute must be
made in each of the rows in which the attribute appears.
Anomalies can be corrected by creating relational tables.
21
Three Types of Anomalies
1. insertion anomoly: dont have the part nnumber for potential suppliers
2. deletion: delete one of the attribute of buell —> other attributes may be deleted
3. update: duplication e.g., alibaba duplication troublesome —> update phone number multiple time for one supplier
—> need normalization (divide tables)
22
Three Types of Anomalies
combined keys -uique (primary key + foreign key) inventory & suppliers - manay to many —> should create another table (later)
1. no longer update problem 23
2. insertion problem eliminate:
3. deleteion: delete information of supplier bullt, unchange inventory data
—> not 3 problems —> 3nf
Normalization
Normalization eliminate those three anomalies.
Normalization is the process of decomposing a
database table into more tables.
Business systems (transactional systems) are typically
normalized up to the third normal form (3NF)
Normalized tables are generated with referential
integrity constraints between primary key and foreign
key pairs. each foreign key must have a
corresponding value in a primary key in another table.
For instance, if Customer ID is the foreign key in a Sales Order table,
then the primary key in the Customer table must also be Customer ID.
Relational Databases & ER Diagram
Details of the subjects will be further discussed in
Lecture 2.
Processing unstructured data
Tagged Data: XML and XBRL
eXtensible Markup Language (XML): used to describe
data to both humans and computers; tagging or coding
data in documents, so that they can be read by both
people and computers.
HTML (Hyper Text Markup Language) used to tag
data so that browsers can display that data as web page
XML used to create metadata about data so that data
can be understood by computers for further processing
and structuring.
eXtensible Business Reporting Language (XBRL):
developed by accounting professionals for reporting.
Example of XBRL label the data - provide metadata to the file (e.g., sales)
- (even though they are unstructured e.g., pdf of finanacial statements
Example of Inline XBRL (iXBRL) - come from xml: html define the locations, xml provide more information can be processed
- advantage: provide metadata of the data, dont have to input the data (ala it is xbrl - structured)
Processing unstructured data
Image Recognition still need to be imrpoved
Artificial Intelligence (AI)
Young woman with a letter and a messenger in an interior (1670) is an oil on canvas
painting by the Dutch painter Pieter de Hooch, it is an example of Dutch Golden Age
painting and is part of the collection of the Rijksmuseum.
Tim Cook: ”I always thought I knew when the
iPhone was invented, but now I’m not so sure.
Data Sources
Transactional Systems (OLTP) erp system for 3
Informational Systems (OLAP)
Legacy Systems
Web Services e.g., competitors
Social Media
Sensors collect data e.g., temparature
Data Sources
Business data from transactional systems
Transactional systems ranging from manual paper
based systems to any number of computerized
systems. come from erp (transactions) / manual papers in the past
Computerized systems may be as simple as an Excel
spreadsheet or as complex as ERP
OLTP systems
information systems typically facilitate and manage transaction-oriented applications
2 compoenents (olap & oltp)
e.g., sap, oracle, people
Legacy
Data Warehouse Systems
ERP System
On-Line Analytical Processing Bolt-On Applications
(OLAP) (Industry Specific Functions)
Customers Suppliers
Core Functions [On-Line Transaction Processing (OLTP)]
Sales
Business Shop Floor
& Logistics
Planning Control
Distribution
modules to take care of different fucntiosn
Operational Database
Customers, Production,
Vendor, Inventory, etc.
Compare
Characteristics of Transactional Systems some companies have 2 systems to make sure system is aviabale all the time
Availability: systems that process transactions should
be available close to 100% of the time.
Level of details: the data of transactional systems
should be available at full detail so that each
transaction as well as its content, creator, date and
details are available at all times.
Updatable: business transactions are created, updated,
or changed and delated quite frequently.
Speed: the ability to process large quantities of
transactions is critical. can be updated in real time quickly
e.g., monitor and respone form the maiframe —> now real time
Current: store only recent a year or two of data.
older data store in the datawarrhouse, curretn data store in the operationla database
erp have different module to take care of different functions
e.g., mrp (only for the material) —> now have more labor, overhead, original inventry mgmt—> now called erp (for whole enterprise)
Characteristics of Transactional Systems
Operational: OLTP supports the organization’s
business functions
Concurrent: OLTP systems assessed by many users at
the same time. Concurrency management is needed to
do concurrency control.
Support requirements of business process
Small uniform transactions: most transactions in an
OLTP system are small and uniform.
Optimized for storage: for efficiency and performance
issues, transactions should be written quickly to the
database. Quick storage relational database
Data are functionally oriented 1. no duplication
stored in operational database, ad of relational database
36
Traditional IS Model:
Closed Database Architecture
Similar in concept to flat-file approach
data remains the property of the application
fragmentation limits communications
Existence of numerous distinct and
independent databases
redundancy and anomaly problems
Paper-based
requires multiple entry of data
status of information unknown at key points
37
dont share one database —>
1. slow to share the data within the company
— cannot share the data electronically —> e.g., paper to paper: print out document, input into the system
2. difficult to share outside the company
difference:
1. instead of 3 databased —> have 1 databased
Business Enterprise
2. share data with suppliers: (Figure 11-1)
data on one direction change to data on 2 directions (ala both use erp) e.g., procurment to supplier
—— e.g., lower than the eoq, automatically place order -> better supplier chain mgmt
3. share data with customers:
—> customer relationship management
Products Materials
Manufacturing
Order Entry and Procurement Purchases
Orders Supplier
Customer System Distribution System
System
Legacy
Data Warehouse Systems
ERP System
On-Line Analytical Processing Bolt-On Applications
(OLAP) (Industry Specific Functions)
Customers Suppliers
Core Functions [On-Line Transaction Processing (OLTP)]
Sales
Business Shop Floor
& Logistics
Planning Control
Distribution
Operational Database
Customers, Production,
Vendor, Inventory, etc.
Compare
Two Main ERP Applications
Online Transaction Processing (OLTP)
aka Core Applications
support the day-to-day operational activities of
the business
support mission-critical tasks through simple
queries of operational databases
Consists of large numbers of relatively simple
transactions such as updating accounting
records. e.g., customer tables
42
Two Main ERP Applications
Online Analytical Processing (OLAP) cont’d
Access large amounts of data (e.g. several years of
sales data)
Analyze the relationships between many types of
business elements such as sales, products, geographic
regions, and marketing channels. cut and slice, regression, cluster
43
Two Main ERP Applications
Online Analytical Processing (OLAP) cont’d
Involve complex calculations between data elements
such as expected profit as a function of sales revenue
for each type of sales channel.
Respond quickly to users requests so that they can
pursue an analytical thought process without being
stymied system delay.
PREVENT
44
OLAP
Supports management-critical tasks through
analytical investigation of complex data
associations captured in data warehouses:
Consolidation is the aggregation or roll-up
of data.
Drill-down allows the user to see data in
selective increasing levels of detail.
Slicing and Dicing enables the user to
examine data from different viewpoints often
performed along a time axis to depict trends
and patterns.
slice the data into smaller piece, easy to digest by computer and user 45
dicing: evene samller
OLAP
46
(Figure 11-2a)
OLAP
47
(Figure 11-2b)
OLAP
48
OLAP
49
OLAP
A projection onto the x-axis (Quarter)
Yields the sales of all products in all
cities
A projection onto the y-axis (Product)
Yields the sales over all quarters in all
cities
A projection onto the z-axis (City)
Yields the sales over all quarters of all
products
Collapsing the cube into the cell next to
the origin (Summing up all 3 dimensions)
The total sales figures for all quarters,
products, and cities
50
OLTP vs OLAP
Level of Detail e.g., sales — >only know the journl entry —> now, can link information
to other table (invoice number to link inventory table, inventory sold)
53
OLTP vs OLAP
Managerial Requirements
OLTP: queries focus on operations
OLAP: managerial (and strategic) in natural, i.e.
adding or dropping product lines.
Optimized for access
OLTP: data storage structure is optimized for
writing (inserts, updates, and deletes). Speed
ability to process large quantity of transactions is
crucial.
OLAP: optimized for access, i.e. for quick reading,
often referred to as read optimized. 54
OLTP vs OLAP
Historical Data
OLTP: Current data only recent, a year or two of
data.
OLAP: historical data create a clear picture of
past business events and trends to make optimal
decisions.
not only for historical, alsoaccess to current dta, but decision amke use historical dta more
Availability
OLTP: Should be available close to 100% of the
time.
segeration of duty
57
ERP System Business Enterprise
(Figure 11-2)
Legacy
Data Warehouse Systems
ERP System
On-Line Analytical Processing Bolt-On Applications
(OLAP) (Industry Specific Functions)
Customers Suppliers
Core Functions [On-Line Transaction Processing (OLTP)]
Sales
Business Shop Floor
& Logistics
Planning Control
Distribution
Operational Database
Customers, Production,
Vendor, Inventory, etc.
Compare
ERP System Configurations:
Databases and Bolt-Ons
later
ERPs typically utilize a relational databases to facilitate
data sharing for transaction processing.
Data analytics of ERP data examples:
Big chain retailers could use ERP data to analyze sales of
each customer by product, store, region, time of day,
and date. change of store hours, promotion of
products, etc.
Manufacturing companies optimize their logistics by
analyzing data regarding the costs of and the time
required for shipments to customers and from vendors.
Middle Platform (中台 )
Reference:https://www.infoq.cn/article/wCZV6X5uujxDXFP0Eub9
Data Sources (review)
Transactional Systems (OLTP)
Informational Systems (OLAP)
Legacy Systems
May include obsolete or aged hardware, outdated software and older
operating systems.
Could be customized software applications or in‐house developed
applications.
Special interfaces may need to retrieve data
Web Services
Crawlers and Info Agents (Search Engines)
Alexa owned by Amazon provides hits to websites
Social Media
Facebook, Twitter, etc. identify consumer buying habits, likes and
dislikes, and potential upcoming purchases
Sensors
Gathered from devices like heating units, vehicles, health monitors, etc.
Data Collecting
Identifying the source data is the first step. Next,
the data collection methods:
Sampling
Calibrating and Scaling
Continuous monitoring and Embedded
Audit Module
Data from Feedback Mechanisms
Intelligent Control Agents
Sampling
The act of extracting only certain data values from a
dataset a subset of the data. TIME CONUSMING & COSTLY
Pull a data sample on study consumer purchasing
behaviors, possible election results, etc.
Random sampling is crucial selected data points are
representative of the entire population.
Sampling is appropriate when:
Each data point is representative of the entire set
problem: fraud, but limited time & resources
The source data set is too large for the planned analysis
The application specifically calls for a data sample
some accounting and regulatory compliance audits.
Calibrating and Scaling
Data Calibration establishing a relationship between
a data point and a unit of measure that has been
e.g.,. different accounting standard, different
formally defined. countries (adjustment for fifo, fair compaarision),
exchange for currency
Comparison is made between an observation and a
standard.
Source data may be standardized Scaled to lie
between a minimum and maximum value or to be
more normally distributed. e.g., learning curve - exponential
Continuous monitoring and embedded
audit modules EAM
All or most business transactions needed to be examine in
order to uncover fraudulent activities.
Traditionally, identifying fraud using samples of
transactional data is very problematic because fraudulent
transactions usually occur in a small number of instances
statistically less likely to be included in the test sample.
SOx made it necessary for business to continuously
monitor their info systems for irregularities that might
reflect error or fraud.
Data Analytics auditors need to examine all of the data
within a company’s database. identify fraud activities, investigate
Continuous monitoring and embedded
audit modules ‐ continued
In essence, auditors need to be data miners, exploring vast
quantities of data and carefully reviewing all of the data in
the organization’s systems to identify
Trends
Outliers
Other anomalies
Embedded Audit Module (EAM) system within a
system like ERP analyzes transactional data to identify
abnormalities like unusual usages or provides electronic
audit trail needed to verify transactions within the system.
to identify possible fraud and not normal transaction
e.g., bank send message or call you (use card at a place
not you usually at)
e.g., buyin from not normal suppliers
Data Gathering
Sampling
Calibrating and Scaling
Continuous Monitoring and Embedded Audit Modules
Data from Feedback Mechanisms it auidt
IT department analyze system performance logs abnormal traffic,
usage trends, liable to fail. security breach & illegal transactions
EAMs Exceptional Reporting
Intelligent Control Agents ICA
Work autonomously with distributed systems to control or run a
system.
looking at traffic, unauthorized transactions on date
Data Staging
Data Staging: e.g., use acl to preapre for the data
The process whereby data are organized and
prepared for analysis ensures that data are
consistent, relevant, and free of ambiguities. defined
To stage data, we need to understand the
characteristics of and the relationships among
the data items. as auditor, need to understand firm (swot)
The definition of the data and their
relationships is called data modeling.
*ETL (Extraction, Transformation and Loading)
normalized - divide tables,
Time‐dependent changes or historization Not supported.
Data hierarchies not handled well
Multidimensional Modelling:
Snowflake Schema
Snowflake Schema is an enhancement to the Star Schema to
overcome the drawbacks of the popular schema
The dimension tables of the star schema are divided into
additional tables thereby creating branching to the star
normallized the dimentionla table, but not in 3nf,
because would have too many tables, slower to
retreive data
Star vs Snowflake Schema
Reference: by
Emil Drkušić
Database designer and developer, financial analyst
which one is better to do olap:
Star vs Snowflake Shema
The First Difference: Normalization
Snowflake schemas will use less space to store dimension
tables. This is because n0rmalized database produces far fewer
redundant records.
Denormalized models increase the chances of data 3 issues
integrity problems. These issues will complicate future
modifications and maintenance as well.
data warehouse, data not change, not quite a issue
Star vs Snowflake Shema
The Second Difference: Query Complexity
Snowflake schema query is more complex. Because the
dimension tables are normalized and needed to dig deeper to
get detailed data. For example, JOIN command will be needed
for every level inside the same dimension.
In the star schema, we only JOIN the fact table with those
dimension tables we need. join only 1 level use pk
Basically, a query ran against snowflake schema data mart
will execute more slowly.
Star vs Snowflake Shema
Which one to use: stoorgae space
Using the snowflake schema: dont need much space, no deuplication
In data warehouse. As the warehouse is Data Central for the
company, we could save lot of space this way.
When dimension tables require a significant amount of
storage space.
Using the star schema: ⽂本
In data marts. Data marts are subsets of data taken out the
central data warehouse and created for different projects.
Saving storage space is not a priority.
store dat by column (e.g., only need
to read the customers data - lower
storage) or by row
Columnar Database – recent development
Three main benefits of storing data in columns:
1. Have higher read efficiency
2. Compress better than row‐based relational
databases
3. Have higher sorting and indexing efficiency