Professional Documents
Culture Documents
Summary - BIA
Summary - BIA
1
➔ Transformed organizations are three times more likely to outperform their peers
Slides
Role of a business analyst:
Uses business intelligence tools and applications to understand and improve business conditions and
business processes > Descriptive analytics
Involved in:
• Business development
• Identification of business needs and opportunities
• Business model and systems analysis
• Process design
• Interpretation of business rules and developing system requirements
Other roles:
• Data architects
• Data engineers
2
Knowledge requirements for advanced analytics:
• Modeling → Data scientist: Uses advanced algorithms and interactive exploration tools to
uncover non-obvious patterns in data
• Business domain → Business analyst: Uses BI tools and applications to understand business
conditions and drive business processes
• Data:
Zooming in on data:
“Data are that which exists prior to argument or interpretation that converts them to facts, evidence
and information” (Rosenberg, 2013)
Big data can be used to improve and innovate the business model.
Improving the business model:
• New data: Armed with new data, these companies can advance to generate insight. (e.g.
sensors, social media for understanding behaviour)
• New insight: New big data approaches and techniques that ranged from high-end statistics
and models to colourful visualizations of the output. (e.g. outlier detection, using big data
techniques)
• New action: As companies become well-armed with big data and proficient at making
insights based on that data, they act differently – often faster and more wisely. (e.g. outlier
detection, using big data techniques)
Big data facilitates improvements to business models across industries. The most effective
improvements result from creating well-articulated strategies that are informed by data and then
honed and shaped accordingly.
3
'The New Patterns of Innovation', Parmar et al. (2014) – article 3
Patterns in creating value from data and analytics:
1. Sensors, driving with an app, pay for insurance if you drive too hard.
2. Spotify, Netflix, not buying a cd anymore (physical asset) make it available to the customers
when its digitized. Making blueprints digital to make a 3D model out of it. Digitize existing
products or you can digitize it to improve the design > ex ante and ex-post
3. Healthcare challenge is to coordinate different activities. Not one person or organization,
integrating the different data sources of the different organization to improve/optimize the
care for people. Google maps company specialized in traffic jams
4 & 5 always on top of the first three.
4. Sell data to others that want for example to digitize the cd’s
5. Codify the capability- what capability are we talking about
> put it in a context to make it a value.
'Big data for big business? A taxonomy of data-driven business models used by start-
up firms', Hartmann et al. (2014) – article 4
A taxonomy of business models used by start-up firms:
4
Exam preparation
Multiple-choice questions (focus: knowledge)
> In the first lecture we discussed the three analytics capability levels that were identified by LaValle
et al. (2010). Which of the following levels is not mentioned by LaValle et al. (2010)?
> In the first lecture, we talked about different patterns of creating value from data as identified by
Parmar et al. (2014). What is the key difference between the pattern of digitizing physical assets and
codifying a capability?
> In the first lecture, we talked about the paper of Woerner & Wixom (2015). In this paper, the
authors argue that big data can be used to improve as well as to innovate the business model. What
is an example of the use of big data to innovate the business model?
a) What would be the next capability level Confucius should aim for, and what is the role of data
analytics at that level? [4 points]
b) Give two recommendations management can follow to yield a higher pay-off on data analytics in
the organization. [2 points]
c) Explain how both recommendations could be implemented in this organization [4 points]
5
Lecture 2: Data input
T1: Information requirements
Information requirements:
Relevant question: what information do executive’s needs?
• Executive work activities (Watson & Frolick, 1993):
o Diverse, brief and fragmented
o Verbal communications preferred (soft info)
o More unstructured, non‐routine and long‐range in nature
than other managerial work (Mintzberg)
o Network building, building cooperative relationships
(internal/external)
6
EIS continues to evolve over time in response to:
Competitor actions
• Changing customer preferences
• Government regulations
• Industry developments
• Technological opportunities, etc.
Examples:
• Image in financial markets
• Technological reputation with customers
• New market success
• Company morale
7
How to measure CSF’s > KPI’s
Key Performance Indicators (KPIs) are used to measure (quantify) CSFs
Examples:
CSF KPI
Image in financial markets Price/earnings ratio
Technological reputation with customers Orders/bid ratio
New market success Change in market share
Company morale Employee satisfaction score
• Performance measures are the foundation of a useful EIS
From strategy to reports:
• Use of CSF’s and KPI’s enables measurement, and thus control, of strategic objectives.
• Performance measures (KPI’s) that measure the execution of the strategy and the creation of
value must be included
8
T 2: Data quality
What is it?
Data with a lot of information that you could integrate and summarize and abstract very relevant
information out of it. Use or purpose. Accurate or accessible, relevant, trustworthy. Should be able to
check if the data is truthful. Recent, should be complete, no missing values.
9
'Data Quality (DQ) in Context', Strong et al. (1997) – article 1
Data quality in context
• Focus on intrinsic DQ problems fails to solve complex organizational problems > consider DQ
beyond intrinsic view
• Focus not only on stored data (data custodians), but also on production (data producers)
and utilization (data consumers)
• High‐quality data: fit for use by data consumers
Table: Data quality dimensions used in the broader conceptualization of data quality argued by the
article
Figure 1 Figure 2
10
Figure 1
T 3: Data governance
Defining (data) governance:
More focused on the strategic level, whereas management is focusing on the tactical and operational
level.
Data governance:
Data governance is the exercise of authority and control over the management of data assets.
• I.e. Data architecture, data quality, data storage & operations, data security and many more
'One Size Does Not Fit All – A Contingency Approach to Data Governance', Weber et al. (2009)
– article 2
Data governance model:
Consist out of the following three objects (only following two where mentioned in the lectures):
• Data quality roles:
o Executive sponsor: Provides sponsorship, strategic direction, funding, advocacy and
oversight for DQM (Data Quality Management)
o Data quality board: Defines the data governance framework for the whole
enterprise and controls its implementation
o Chief steward: Puts the board’s decisions into practice, enforces the adoption of
standards, helps establish DQ metrics and targets
o Business data steward: Details corporate-wide DQ standards and policies for his/her
are of responsibility from a business perspective
o Technical data steward: Provides standardized data element definitions and formats,
profiles and explains source system details and data flows between systems.
11
• The assignment of responsibilities: The abbreviations “R”, “A”, “C”, and “I” fill the cells of the
matrix to depict the kind of responsibility a role has for a specific DQM activity or decision.
o Responsible (“R”). This role is responsible for executing a particular DQM activity.
▪ Only one “R” is allowed per row, that is, only one role is ultimately
responsible for executing an activity
o Accountable (“A”). This role is ultimately accountable for authorizing a decision
regarding a particular DQM activity.
o Consulted (“C”). This role may or must be consulted to provide input and support for
a DQM activity or decision before it is completed.
o Informed (“I”). This role may or must be informed of the completion or output of a
decision or activity.
The contingency factors determine the fit between the design of the data governance model and the
success of DQM within the organization.
12
'Big Data for All: Privacy and User Control in the Age of Analytics', Tene & Polonetsky (2013) –
article 3
Big data: big benefits:
• Healthcare: i.e. It is not possible for FDA to control the correlations of every single medicine
at the market. By using big data, it is possible to find harmful correlations between
medicines. i.e. Google can forecast a flu epidemic.
• Mobile: Mobile devices–always on, location aware, and with multiple sensors including
cameras, microphones, movement sensors, GPS, and Wi-Fi capabilities have revolutionized
the collection of data in the public sphere and enabled innovative data harvesting and use.
• Smart-grid: The smart grid is designed to allow electricity service providers, users, and other
third parties to monitor and control electricity use. Utilities view the smart grid as a way to
precisely locate power outages or other problems, including cyber-attacks or natural
disasters, so that technicians can be dispatched to mitigate problems.
• Traffic management: i.e. Governments around the world are establishing electronic toll
pricing systems, which determine differentiated payments based on mobility and congestion
charges.48 These systems apply varying prices to drivers based on their differing use of
vehicles and roads.
• Retail: It was Wal-Mart’s inventory management system (“Retail Link”) which pioneered the
age of big data by enabling suppliers to see the exact number of their products on every shelf
of every store at each precise moment in time.
• Payments: Another major arena for valuable big data use is fraud detection in the payment
card industry.
• Online: the most oft-cited example of the potential of big data analytics lies within the
massive data silos maintained by the online tech giants: Google, Facebook, Microsoft, Apple,
and Amazon.
13
• The ethics of analytics: Where should the red line be drawn when it comes to big data
analysis? Moreover, who should benefit from access to big data? Could ethical scientific
research be conducted without disclosing to the general public the data used to reach the
results?
• Chilling effect: “a surveillance society,” a psychologically oppressive world in which
individuals are cowed to conforming behaviour by the state’s potential panoptic gaze.
Exam preparations
Multiple‐choice questions (focus: knowledge)
• The essence of the paper Data Quality in Context of Strong et al. (1997) can best be
described as:
• In the Data governance contingency model of Weber et al. (2009; One Size Does Not Fit All –
A Contingency Approach to Data Governance), what is considered a contingency factor?
• What is, according to the paper of Tene & Polonetsky (2013; Big Data for All: Privacy and
User Control in the Age of Analytics), a promising way to deal with privacy and user control in
the Big Data age?
14
Lecture 3: Data architecture
T1: Data
What is data?
Data are that which exists prior to argument or interpretation that converts them to facts, evidence
and information (Rosenberg, 2013)
Data types:
• Form (qualitative and quantitative)
• Structure (structured, semi-structured, unstructured)
o Structures: tabular, network and hierarchical
o Relational databases: two tables with matching columns, these tables can be
combined
• Source (capture, derived, exhaust, transient)
• Producer (primary, secondary, tertiary)
• Type (indexical, attribute, metadata)
Structured query language (SQL): A high-level, declarative language for data access and
manipulation. Allows asking the database human-like questions (queries). Widely used for simple
functional reporting.
15
T2: Operational databases
Operational databases:
'An Overview of Business Intelligence Technology', Chaudhuri et al. (2011) – article 1
Typical business intelligence architecture:
16
• Front-end applications: There are several popular front-end applications through which
users perform BI tasks: spreadsheets, enterprise portals for searching, performance
management applications that enable decision makers to track key performance indicators
of the business using visual dashboards, tools that allow users to pose ad hoc queries,
viewers for data mining models, and so on. Rapid, ad hoc visualization of data can enable
dynamic exploration of patterns, outliers and help uncover relevant facts for BI.
Behoud
Wisselvalligheid
Structured query language (SQL) is a programming language for storing and processing information in a relational database. A relational
database stores information in tabular form, with rows and columns representing different data attributes and the various relationships between
the data values. You can use SQL statements to store, update, remove, search, and retrieve information from the database. You can also use
SQL to maintain and optimize database performance.
17
T3: Data warehouses
'Data Warehouses, Business Intelligence Systems, and Big Data', Kroenke et al. (2018) – article
2
Setting the stage: Business analysts need large datasets available for analysis by business intelligence
(BI) applications. BI systems such as online analytical processing (OLAP) and data warehouses are
used.
• Business intelligence (BI) systems are information systems that assist managers and other
professionals in the analysis of current and past activities and in the prediction of future
events.
• Operational systems—such as sales, purchasing, and inventory-control systems— support
primary business activities. They are also known as transactional systems or online
transaction processing (OLTP) systems because they record the ongoing stream of business
transactions.
• BI systems fall into two broad categories: reporting systems and data mining applications.
o Reporting systems sort, filter, group, and make elementary calculations on
operational data.
o Data mining applications, in contrast, perform sophisticated analyses on data,
analyses that usually involve complex statistical and mathematical processing.
Data warehouse: A data warehouse is a database system that has data, programs, and personnel
that specialize in the preparation of data for BI processing. Data are read from operational databases
by the extract, transform, and load (ETL) system. The ETL system then cleans and prepares the data
for BI processing. This can be a complex process.
Data mart: A data mart is a collection of data that is smaller than the data warehouse that addresses
a specific component or functional area of the business.
Dimensional database: The data warehouse databases are built in a design called a dimensional
database that is designed for efficient data queries and analysis. A dimensional database is used to
store historical data rather than just the current data stored in an operational database.
• A dimension within a dimensional database is a column or set of columns that describes
some aspect of the enterprise
• Because dimensional databases are used for analysis of historical data, they must be
designed to handle data that change over time. In order to track such changes, a dimensional
database must have a date dimension or time dimension as well.
OLAP: Online analytical processing (OLAP) provides the ability to sum, count, average, and perform
other simple arithmetic operations on groups of data. OLAP systems produce OLAP reports. An OLAP
report is also called an OLAP cube. This is a reference to the dimensional data model. OLAP uses the
dimensional database model discussed earlier in this chapter, so it is not surprising to learn that an
OLAP report has measures and dimensions. A measure is a dimensional model fact—the data item of
interest that is to be summed or averaged or otherwise processed in the OLAP report. A dimension,
as you have already learned, is an attribute or a characteristic of a measure.
18
The main idea behind data warehouse
Data warehouses:
From ad-hoc activities to a structured approach
Online Analytical Processing (OLAP): computational approach, BI system or mid-tier server for
answering multi-dimensional analytical queries, using interactive real-time interfaces.
Multi-dimensional analytical queries (MDA): questions drawing on several data domains/ several
dimensions (i.e. sales by region, by product, by salesperson and by time (4 dimensions)
19
Characteristics of data warehouses:
Behoud
Wisselvalligheid
20
Multidimensional data model: decision cube
• Decision cube: The equivalent of a pivot table for a data warehouse. You may think of it as a
multi-dimensional table. Allows summarizing and analyzing the facts pertaining to a specific
subject along different dimensions. (e.g., sales cube, inventory cube)
• Facts or measures: What is tracked about the subject of the cube. A continuous additive
quantity (always numeric) that can be defined for all possible intersections of the
dimensions. (e.g. sales, revenue, units sold)
• Dimensions: Tag what is tracked about the subject of the cube. A dimension acts as an index
for identifying values within a cube. Each represents a single perspective on the data. If all
dimensions have a single member selected, then a single cell is defined. Dimensions are
structured in hierarchies and consist of categorical variables. (e.g. time (year, quarter,
month, week, day, hour), product type (line, series, model), and store of sale (region,
country, city, branch))
Multidimensional data model: star schema: The Star Schema is the way we represent the data
model pertaining to decision cubes. It is the ERD for a cube. The star schema consists of one or more
fact tables referencing a number of dimension tables.
21
• Benefits of data warehouses (Watson et al., 2002)
o Greatest benefits from DW apps occur when used to redesign business processes
and support strategic business objectives
o Can be a critical enabler for strategic change
• The main merits of data warehouses become their drawback in the age of big data
o Purpose-driven, rigid schemas that are not appropriate for ad-hoc analysis
o Pre-processing incurs high integration costs and means not all data can get in
Challenge: Big data is getting bigger and bigger (Kroenke et al., 2018)
• Big data is large
o From 33 zettabytes (trillion GB) in 2018 to a predicted 175 zettabytes in 2025 (IDC,
2018)
• Big data is fast
o 1.7 MB of data is created per second per person every day by 2020 (Forbes, 2015)
• Big data is highly dimensional
o Both in terms of records, but also in terms of attributes
• Big data is difficult to exploit
o 0.5% of all data is analysed/used
Distributed computing:
Approaches to deal with big data: distributed computing
• Based on the split-apply-combine (sac) paradigm
o E.g., calculating the average grade per exam opportunity
22
• sac analogy for big data: map-reduce1 , implemented in e.g. Apache Hadoop
o E.g. calculating the frequencies of all words on the internet
23
T5: Data lakes
‘Using knowledge management to create a Data Hub and leverage the usage of a Data Lake’,
Ferreira et al., (2018) – article 3
Storing big data: data lakes
• Dropping storage costs allow for retaining giant amounts of data until the use case is found
(Ferreira et al., 2018)
o Deferred processing and modelling has become a possibility these days
• Business analysts are only one of the beneficiaries of corporate data these days. Data
scientists and deep learning machines have entered the stage as well
o The data pipeline must cater to the needs of all these users
o “tidy data” rather than structured data has become more important to match
today’s analytics needs
24
Preventing a data swamp:
The main challenge is not creating a data lake, but taking advantage of the opportunities it presents.
Otherwise, one only creates a data swamp (Ferreira et al., 2018)
Data flow:
Exam preparations
Multiple-choice questions (focus: knowledge)
In the paper 'An Overview of Business Intelligence Technology', Chaudhuri et al. (2011) discuss a
typical architecture for supporting BI within enterprises.
What do we call the back-end components that prepare data for business intelligence tasks?
A. Data mining, text analytic engines
B. Extract Transform Load tools
C. Online Analytical Processing servers
D. Operational databases
In the whitepaper ‘The enterprise data lake: Better integration and deeper analytics’, Stein &
Morrison (2014) introduce the data lake as an answer to the massive growth in data available to
organizations.
25
Open questions (focus: application)
Ledoitte is a global professional services firm with competences in management and IT consulting. As
an information systems consultant in Ledoitte, you are trying to sell the idea of a business
intelligence system to Rich Tals, a beauty and cosmetics brand based in Amsterdam, running a
sizeable network of retail shops.
A. How are operational databases (i.e. OLTP) and business intelligence systems (i.e. OLAP) used in an
organization? Focus on the differences in use that justify maintaining two separate systems. [4
points]
B. Why is a database redesign required when moving from online transaction processing (OLTP) to a
data warehouse (OLAP)? [2 points]
C. What are the characteristic differences between the data in an operational database system and
the data in a data warehouse? [4 points]
26
Lecture 4: Data analytics
T1: Basic concepts
Knowledge discovery in databases
The knowledge discovery in databases (KDD) process (Fayyad et al. 1996), and more in particular data
mining and machine learning techniques help in making sense of large volumes of data
Data mining
“Data mining is the application of specific algorithms for extracting patterns from data” (Fayyad,
1996, p. 37)
Different typologies:
Different analytic methods depending on.
• Data: Non-dependent vs dependent data
• Task: Descriptive vs predictive
• Relations: Variables vs observations
• Algorithm: e.g., Regression vs Classification vs Clustering
• Learning: Supervised vs Unsupervised Learning
27
Task: Description vs prediction:
Descriptive / Explorative Data Mining
• Learn about and understand the data
• e.g. Identify and describe groups of customers with similar purchasing behaviour (London’s
position in global corporate network)
Relations:
• Between the attributes (classification)
• Between the observations (clustering)
Algorithm:
• Regression: Regression is a function that maps a
data item to a prediction variable
• Data clustering: Given a set of observations, each having a set of attributes, and a similarity
measure among them, finds clusters such that. Observations in one cluster are more similar
to each other. Observations in different clusters are less similar to each other.
• Data classification: Find a model for class attribute as a function of the values of other
attributes. Objective: Previously unseen observations (test set) should be assigned a class as
accurately as possible.
• Association rule discovery: Given a set of records, each of which contain some number of
items from a given collection. Capture the co-occurrence of items
• Collaborative filtering
Learning:
In the case of Machine Learning
• Supervised: Machine learning task of inferring a function from labelled training data
(classification)
o You give to the computer some pairs of inputs/ outputs, so in the future when new
inputs are presented you have an intelligent output. Requires a training and a test
set. (i.e., inferring a function from labelled training data, e.g. classification)
• Unsupervised: Machine learning algorithm used to draw inferences from datasets consisting
of input data without labeled responses (clustering)
o You let the computer learn from the data itself without showing what is the expected
output. (i.e., drawing inferences from data without labelled responses, e.g.
clustering)
28
T2: Supervised learning
Supervised learning concept:
The task at hand is learning a function that predicts an output given some inputs, based on example
input-output pairs
• Output can be the result of an unsupervised learning project
• Requires a training and test set
• E.g. predicting sales: Given data on past sales (combined with other relevant data), can we
predict future sales?
Run on the same data, input: debt and income. Tries to predict the grey area. Find a way of
classifying the observation for which you do not have a label yet. What your model learns from the
training set is applicable to the test set and can reveal some things in the test set. Important to read
it in the test set.
29
'Classification Models', Wendler & Gröttrup (2016) – article
Classification algorithms deal with the problem of assigning a category to each input variable vector.
Classification models are dedicated to categorizing samples into exactly one category.
The procedure for building a classification model: The original dataset is split into two independent
subsets, the training and the test set. The training data is used to build the classifier, which is then
applied to the test set for evaluation. Using a separate dataset is the most common way to measure
the goodness of the classifier’s fit to the data and its ability to predict. This process is called cross
validation.
• Often, some model parameters have to be predefined. To find the optimal parameter, a third
independent dataset is used, the validation set.
General idea of a classification model: When training a classification model, the classification
algorithm inspects the training data and tries to find regularities in data records with the same target
value and differences between data records of different target values.
• In the simplest case, the algorithm converts these findings into a set of rules, such that the
target classes are characterized in the best possible way through these “if ... then ...”
statements.
Classification algorithms: The right choice of classifier strongly depends on the data type.
• Linear: linear methods try to separate the different classes with linear functions
• Nonlinear: nonlinear classifiers can construct more complex scoring and separating functions
• Rule-based: the rule-based models search the input data for structures and commonalities
without transforming the data. These models generate “if ... then ...” clauses on the raw data
itself.
o Decision trees (Workshop 4)
30
Yes or no decision: four possible events can occur
• True positive (TP). The true value is “yes” and the classifier predicts “yes”. A patient has
cancer and cancer is diagnosed.
• True negative (TN). The true value is “no” and the classifier predicts “no”. A patient is
healthy and no cancer is diagnosed.
• False positive (FP). The true value is “no” and the classifier predicts “yes”. A patient is
healthy but cancer is diagnosed.
• False negative (FN). The true value is “yes” and the classifier predicts “no”. A patient has
cancer but no cancer is diagnosed.
→ Unfortunately, a perfect classifier with no misclassification is pretty rare. It is almost impossible to
find an optimal classifier.
Slides
Classification:
31
Evaluating classification models:
Evaluation:
• We use the confusion matrix for assessing model quality to assess the extent to which the
model confuses the outcome classes (i.e. mislabelling one class as the other)
• Note: predicted and actual might be switched! Also note: what is considered “positive” and
“negative” often is determined alphabetically
• Quality measure: Accuracy: TP + TN / sample size
o Can be calculated for both training and test data. Accuracy tends to be lower for test
data
• 100% accuracy is not a goal
Supervised learning:
Under- and overfitting:
Image recognition:
32
T3: Unsupervised learning
‘Cluster Analysis’, Wendler & Gröttrup (2016) – article
• A cluster analysis is used to identify groups of objects that are “similar”.
• The term “cluster analysis” stands for a set of different algorithms for finding subgroups in a
dataset. Measuring the similarity or the dissimilarity/distance of objects is the basis of all
cluster algorithms. In general, we call such measures “proximity measures”.
Proximity measures: are used to identify objects that belong to the same subgroup in a cluster
analysis. They can be divided into two groups: similarity and dissimilarity measures. Nominal
variables are recoded into a set of binary variables, before similarity measures are used. Dissimilarity
measures are mostly distance-based. Different approaches/metrics exist to measure the distance
between objects described by metrical variables.
• Hierarchical clustering:
o The agglomerative algorithms measure the distance between all objects. In the next
step, objects that are close are assigned to one subgroup. In a recursive procedure,
the algorithms now calculate the distances between the more or less large
subgroups and merge them stepwise by their distance.
o The divisive algorithms assign all objects to the same cluster. This cluster is then
divided step-by-step, so that in the end homogeneous subgroups are produced.
• Partitional clustering: The first step in a partitioning clustering method is to assign each
object to an initial cluster. Then a quality index for each cluster will be calculated. By
reassigning the objects to other clusters, the overall quality of the classification should now
be improved. After checking all possible cluster configurations, by reassigning the elements
to any other cluster, the algorithm will end when no improvement to the quality index is
possible.
o Monothetic algorithm: Algorithms where only one variable assigns objects to the
cluster
o Polythetic algorithm: If more than one variable is used
Use cases:
• Marketing: With cluster analysis, statisticians can identify customer subgroups.
• Banking: In the case of a new enquiry for a loan, the bank is able to predict the risk, based on
the data of the firm.
• Medicine: Based on data, a risk evaluation with respect to the carcinogenic qualities of
certain substances can be performed.
• Education: Identifying groups of students with special needs.
33
Slides
Unsupervised learning:
• The task at hand is detecting similarities in the data
o Between variables (columns)
▪ Principal component analysis (PCA)
o Between observations (rows)
▪ Clustering
o Recommender systems
▪ Association rule discovery
▪ User-based collaborative filtering
▪ Item-based collaborative filtering
▪ Text prediction
34
Clustering:
K-means clustering:
Provide insight in the clusters & details of the K-means.
Clustering: multi-dimensionality
• Why does point 7 not belong to the blue cluster?
• Representing multi-dimensional data in a 2-dimensional plane can be decisive!
o (As is representing 2-dimensional data in a multidimensional plane, but more about
that in the lecture on data visualization)
o Compared to the other points, point 7 and 8 have a distinct score on another feature
o An analogy would be looking at two buildings right from above. Although both might
look similar, one could be a skyscraper, whereas the other could be a parking garage
35
Data mining task: recommendation
• Recommender engines (Res) (Shmueli et al., 2016)
Generating candidate rules: Association rules provide information of this type in the form of "if-then"
statements. These rules are computed from the data; unlike the if-then rules of logic, association
rules are probabilistic in nature. We use the term antecedent to describe the IF part, and consequent
to describe the THEN part. In association analysis, the antecedent and consequent are sets of items
(called item sets) that are disjoint (do not have any items in common).
• Consider only combinations that occur with higher frequency in the database. These are
called frequent item sets. Determining what qualifies as a frequent itemset is related to the
concept of support. The support of a rule is simply the number of transactions that include
both the antecedent and consequent item sets.
• Apriori algorithm: The key idea of the algorithm is to begin by generating frequent itemsets
with just one item (one-item sets) and to recursively generate frequent itemsets with two
items, then with three items, and so on, until we have generated frequent itemsets of all
sizes.
• To measure the strength of association implied by a rule, we use the measures of confidence
and lift ratio
Two principles can guide us in assessing rules for possible spuriousness due to chance effects:
• The more records the rule is based on, the more solid is the conclusion.
• The more distinct are the rules we consider seriously (perhaps consolidating multiple rules
that deal with the same items), the more likely it is that at least some will be based on
chance sampling results.
36
Collaborative filtering: In collaborative filtering, the goal is to provide personalized recommendations
that leverage user-level information. User-based collaborative filtering starts with a user, then finds
users who have purchased a similar set of items or ranked items in similar fashion, and makes a
recommendation to the initial user based on what the similar users purchase or like.
• The recommender engine provides personalized recommendations to a user based on the
user's information as well as on similar users' information. Information means behaviors
indicative of preference, such as purchase, ratings, and clicking.
• Collaborative filtering requires availability of all item-user information. Specifically, for each
item-user combination, we should have some measure of the user's preference for that item.
User based collaborative finding “People like you”: One approach to generating personalized
recommendations for a user using collaborative filtering is based on finding users with similar
preferences, and recommending items that they liked but the user hasn't purchased. The algorithm
has two steps:
• Find users who are most similar to the user of interest (neighbors). This is done by comparing
the preference of our user to the preferences of other users. → Calculated with the help of
correlation and cosine similarity
• Considering only the items that the user has not yet purchased, recommend the ones that
are most preferred by the user's neighbors.
Item based collaborative finding: When the number of users is much larger than the number of
items, it is computationally cheaper (and faster) to find similar items rather than similar users.
Specifically, when a user expresses interest in a particular item, the item based collaborative filtering
algorithm has two steps:
• Find the items that were co-rated, or co-purchased, (by any user) with the item of interest.
• Recommend the most popular or correlated item(s) among the similar items.
37
Slides:
Association rule discovery:
REs: Association rule discovery (Shmueli et al., 2016)
• “Frequently bought together”
o Association rule
▪ if red (antecedent), then white (consequent)
▪ if red and white (antecedent), then green (consequent)
▪ if red (antecedent), then white and green (consequent)
• The Apriori algorithm is much faster
• Uses the concept of support: (number of transactions that include an item set / total number
of transactions) * 100%
Collaborative filtering:
• Maintain a database of many users’ ratings of a variety of items
• identifying relevant items for a specific user from the very large set of items ("filtering") by
• considering preferences of many users ("collaboration")
Cautionary tale:
Machine learning is no magic bullet. It offers a set of tools and methodologies
• You need to know how to utilize them
• Can be disastrous if not used properly
• Does not replace skilled business analysts! It requires guidance and output validation (see
Yudkowsky (2008) for an example)
38
Exam preparations:
Multiple-choice questions (focus: knowledge)
In their book chapters, Wendler & Gröttup discuss the machine learning techniques of classification
and clustering.
In the book chapter ‘Association Rules and Collaborative Filtering’, Shmueli et al. (2016) discuss
several techniques that can be used to recommend items to users.
a. Will you use a supervised or an unsupervised method to do the customer segmentation? Why? [3
points]
b. What type of algorithm will you use, how is the algorithm implemented, what kind of data does it
take, and what output does it yield? [5 points]
c. How do you find the most lucrative customer segments? [2 points]
39
Lecture 5: Organizing for BIA
T1: BIA Maturity
Maturity:
Definition maturity: How developed an object is
• It differs between objects, evolves over time and can be modelled
Maturity models
• In IT: Capability Maturity Model (CMM) for Software
Defined different maturity levels and the capabilities also increase. Increased maturity a better
business performance > competitive advantage.
40
'A business analytics capability framework', Cosic et al. (2015) – article 1
Business analytics capability framework:
Business analytics (BA) capabilities can potentially provide value and lead to better organizational
performance. This paper develops a business analytics capability framework (BACF) that specifies,
defines and ranks the capabilities that constitute an organizational BA initiative.
Definition BA capability: the ability to utilize resources to perform a BA task, based on the interaction
between IT assets and other firm resources.
T2: Outsourcing
BI&A: Make or buy it?
• Make: in house
• Buy: outsourcing
41
'Should You Outsource Analytics?', Fogarty & Bell (2014) – article 2
Should you outsource analytics?
Fogarty & Bell (2014): Two types of organizations:
1. ‘Analytically challenged’:
• See it as a quick and easy way to access analytic capability/skills
• Generally do not worry about IP, like to collaborate in this area
2. ‘Analytically superior’:
• See analytics as important ‘core competence’ leading to competitive advantage
• Will be more hesitant to outsource analytics: what about IP?
• Possibly outsourcing of ‘basic’ analytics functionalities (BI?) to free up internal analysts
• More might be achieved…
Slides
Service‐oriented DSS (Demirkan & Delen, 2013)
The upcoming of service-oriented businesses is associated with the demand for outsourcing
analytics.
Service oriented decision support systems (DSS):
42
T3: BIA succes
'An Empirical Investigation of the Factors Affecting Data Warehousing Success' – article 3
• ‘Success’ is popular topic in IS research
• DWH: “a specially prepared repository of data created to support decision making” (p. 18)
• Combines databases across an entire organization (versus data mart): IT infrastructure
project
• Three dimensions of system success were selected as being the most appropriate for this
study: data quality(The focus is on the data stored in the warehouse), system quality(The
focus is on the system itself), and perceived net benefits (A system displaying high data
quality and system quality can lead to net benefits for various stakeholders).
• Three facets of warehousing implementation success were identified: success with
organizational issues (Accepted into the organization and integrated into work), success with
project issues(Require highly skilled, well-managed teams who can overcome issues that
arise during the implementation project), and success with technical issues.(The technical
complexity of data warehousing is high)
• Seven implementation factors were included in the research model because of their
potential importance to data warehousing success: management support, champion (A
champion actively supports and promotes the project and provides information, material
resources, and political support.), resources, user participation, team skills, source systems
(The quality of an organization's existing data can have a profound effect on systems
initiatives and that companies that improve data management realize significant benefits.),
and development technology (Development technology is the hardware, soft- ware,
methods, and programs used in completing a project.)
• Results: The expectation was that all the arrows in figure 26 would show a positively
correlated relation. The results are shown in figure 27, as you can see, some relations are
supported and some are not supported (NS).
43
Slides:
Relativity of success:
Depends on who and when you ask
• “Who?”: Management success, project success, user success, correspondence success and system
success
• “When?”: Chartering phase, project phase, shakedown phase and onward & upward phase
44
Data driven culture: The capability to aggregate, analyse, and use data to make informed decisions
that lead to action and generate real business value. (Both technical and human capabilities)
Data to knowledge to results model:
1. Context: Prerequisites of success in this process → The Strategic, Skill, Organizational &
cultural, Technological and data related factors that must be present for an analytical effort
to succeed → Continually refined and affected by other elements
2. Transformation: where the data is actually analysed and used to support a business decision
3. Outcomes: Changes that result from the analysis and decision making
o Behaviours
o Processes and programs
o Financial conditions
Fast data: an extreme version of big data “the fast nephew of big data”
• The V’s: Volume, Velocity, Variety, Veracity, Variability and Value
• Why is fast data important?
o Reach a competitive advantage
o Source of value
o Increasing customer expectations
o Rapidly changing organizational environment
45
o Translate org. strategy into business rules
▪ Be able to adapt those every moment
• Culture
o Ensure trust in available data
o Let employees practice and learn with FD decisions
o Give employees autonomy to respond based on FD
o Be prepared for rapid changes based on FD
• Skills and experience
o Knowledge of & experience with:
▪ Technology: systems and software
▪ Algorithms and pattern meaning
▪ The data
▪ The organization and ‐strategy
▪ Communication: be able to convey the data and patterns found
Exam preparations:
Multiple‐choice questions (focus: knowledge)
Of which capability from the business analytics capability framework of Cosic et al. (2015; A business
analytics capability framework) is the following a description?
Following the paper Should you outsource analytics? (Fogarty & Bell, 2014), what is true about
‘analytically challenged’ organizations? They...
Consider the following propositions in the context of the paper An empirical investigation of the
factors affecting data warehousing success (Wixom & Watson, 2001).
a. Proposition I: A high level of data quality will be associated with a high level of perceived net
benefits
b. Proposition II: A strong champion presence is associated with a high level of organizational
implementation success
a. Provide an answer to this question [6 points]. In your answer, indicate which three types of
implementation success can be distinguished [4 points] according to Wixom & Watson in their
article: An empirical investigation of the factors affecting data warehousing success (2001)
46
Lecture 6: Data visualization & reporting
T1: Setting the stage
'Graphics Lies, Misleading Visuals', Alberto Cairo (2015) – article 1
• Developments in BI&A took place in tandem with those in the field of information
visualization
• Visualizations can (un)-intentionally be misleading. Both creators (encoders) and readers
(decoders) have a role in this
Patterson proposes to develop a new kind of journalism education. He calls it “knowledge based
journalism.” It combines deep subject-area expertise with a good understanding of how to acquire
and evaluate information (research methods).
Slides:
T2: Heuristics
Heuristics:
• The field of data visualization is broad and currently lacks adequate theoretical foundations
(Chen, 2010)
• Developing effective visualisations is not an art, but a craft based on embracing certain
principles and heuristics derived from experience and scientific inquiry (Cairo, 2014)
Five key (and tightly interrelated) qualities of effective visualisations (three are most important)
(Cairo, 2014):
1. Truthful: Getting the information as right as possible – display the information as right as
possible. Comparing two different timeframes, don’t know if the numbers are adjusted for
inflation, and there are a couple of years missing. Don’t manipulate your data to come to
another conclusion.
• Always ask yourself the question: compared to what, to whom, to when, to where?
o e.g. 2013 headline “About 28% of journalism grads wish they’d chosen another
field”. Increase depth by making comparisons with previous years and other grads.
Increase breadth by (e.g.) including annual wages
• Report not only the mean, but also min, max, standard deviations
• Use equal bin sizes
• Make sure that the number of information-carrying (variable) dimensions depicted does not
exceed the number of dimensions in the data.
• Some add a three d effect, but there are no three dimensions. (Barrels 3 dimensions) but
price and year are not three dimensional.
47
2. Functional: what is it that I want to show – choose graphic forms according to the task(s) you
wish to enable. Mislead by the data to use a non-zero base or respresent it different.
Making choices on hot to present data (what is the nature of the data and how effectively to
plot this) (Nussbaumer Knaflic, 2015). Using piechart altogether better to use a table for
example.
• Ask yourself the question what task(s) the graphic should enable. Use logical and meaningful
baselines.
• E.g. displaying change
• The only worse design than a pie chart is several of them (Tufte, 1983).
• Use logical and meaningful baselines zero baseline.
3. Beautiful: ‘Beauty Is not a property of objects, but a measure of the emotional experience
those objects may unleash (Cairo, 2014).
• Avoid chart junk through maximizing the data-ink ratio. This ratio is defined as the ratio
between data-ink and the total ink used to print the graphic.
• Avoid unintentional optical art
4. Insightful
5. Enlightening
T3: Principles
'Information visualization', Chen (2010) – article 2
Principles:
• Gestallt principles – existing graphs and how to improve them
• Suggestions for improvement on exam!!
• Preattentive attribute
48
1. Gestalts principle of proximity
We tend to think of objects that are physically close together as belonging to part of a group o Force
the reader to let the reader read In a way
• Example: table design.
49
• Use pre-attentive attributes sparingly. The goal is to reduce cognitive load. "It is easy to spot
a hawk in a sky full of pigeons. As the variety of birds increases, however, that hawk becomes
harder and harder to pick out” (Ware, 2004).
• Only works if you want to emphasize one or two attributes.
• Hue = colour
T4: Example
• Not only display but tell a story
• Visualization steps (Nussbaumer Knaflic, 2015)
1. Understand the context
2. Choose an appropriate visual display
3. Eliminate clutter
4. Focus attention where you want it
5. Think like a designer
6. Tell a story
50
Exam preparations
Multiple-choice questions (focus: knowledge)
In the paper 'Information visualization', Chen (2010) introduces some of the fundamental concepts in
information visualization. One of these concepts are Gestallt principles. What is the Gestalt principle
of proximity?
a. We perceive objects that are physically close together as belonging to part of a group
b. We perceive objects that are physically enclosed together as belonging to part of a group
c. Whenever we can, we perceive a set of individual elements as a single, recognizable shape
d. When we look at objects, our eyes seek the smoothest paths and naturally create continuity in
what we see
In the book chapter 'Graphics Lies, Misleading Visuals', Alberto Cairo (2015) explains some of the
possible problems that might occur in data visualizations.
When the number of dimensions that are used to visualise data in a graph exceeds the number of
actual data dimensions, one can say that this graph has a:
In the last lecture, we talked about heuristics for, and principles underlying effective visualisations.
a. List two aspects of this visualisation that you think could be improved. Make sure each aspect
refers to a different feature of effective visualisations, as discussed during the lecture. [2 points]
b. For each aspect, explain why you think this aspect could be improved. [4 points]
c. For each aspect, explain how you would improve it so that it results in a more effective
visualisation. [4 points]
51