Download as pdf or txt
Download as pdf or txt
You are on page 1of 51

Business Intelligence and Analytics

Summary lectures & articles 2023

Lecture 1: Setting the stage


What is business intelligence?
1. A broad category of: Applications, technologies and processes
2. That aims at: Gathering, sorting, accessing and analysing data
3. With the purpose of: Helping business users make better decisions

‘Analytics: The New Path to Value’ – article 1


‘Analytics: The New Path to Value' (LaValle et al., 2010)

What is meant by analytics?


Four levels of analytics:
• Descriptive analytics - similar to
traditional business intelligence:
what has occurred / how did we
do. What happened?
• Diagnostic analytics: why did it
happened?
• Predictive analytics: what will
occur, try to predict what might
happen, still made by a human
decision maker. What will
happen?
• Prescriptive analytics: Here you
let analytics define your
strategy, decision making can be
done by an algorithm/ AI. But it’s
defined by humans. What should
occur? How can we make it
happen?

Where are organizations now?


3 levels of analytics capability:
• Aspirational: These organizations are farthest from achieving their desired analytical goals.
Often focusing on efficiency or automation of existing processes and searching for ways to
cut costs. Have few of the necessary building blocks (people, process or tools) to collect,
understand, incorporate or act on analytic insights.
• Experienced: Having gained some analytic experience (often with efficiencies in aspirational
phase) and looking to go beyond cost management. The experiencer they are, the better
they collect, understand, incorporate or act on analytic insights and being able to optimize
their organizations.
• Transformed: Have substantial experience using analytics across a broad range of functions.
Analytics is a competitive differentiator and are less focused on cutting costs. Already adapt
analytics in organizing people, processes and tools to optimize and differentiate.

1
➔ Transformed organizations are three times more likely to outperform their peers

Recommendations for implementing BI&A:


• Within each opportunity, start with questions, not data
o Business need should be clear
• Focus on the biggest and highest value opportunities
o Be a vocal supporter
o Strong, committed sponsorship
o Link incentives and compensation to desired behaviours
• Embed insights to drive actions and deliver value
o Alignment between the business and IT Strategy
o Ask to see what analytics went into decisions
o A strong data-infrastructure + right analytical tools
• Keep existing capabilities while adding new ones
o Recognise that some people can’t or won’t adjust
o Stress that outdated methods must be discontinued
o Strong analytical people in an appropriate organizational structure
• Use an information agenda to plan for the future

Slides
Role of a business analyst:
Uses business intelligence tools and applications to understand and improve business conditions and
business processes > Descriptive analytics
Involved in:
• Business development
• Identification of business needs and opportunities
• Business model and systems analysis
• Process design
• Interpretation of business rules and developing system requirements

Role of a data scientist:


Uses advanced algorithms and interactive exploration tools to uncover non-obvious patterns in data
> Predictive and prescriptive analytics
Can be involved in (Donoho, 2017):
• Data exploration and preparation
• Data representation and transformation
• Computing with data (lecture 4)
• Data modelling (lecture 4)
• Data visualization and representation (lecture 6)
• Science about data science
Usually has a multidisciplinary background

Other roles:
• Data architects
• Data engineers

2
Knowledge requirements for advanced analytics:
• Modeling → Data scientist: Uses advanced algorithms and interactive exploration tools to
uncover non-obvious patterns in data
• Business domain → Business analyst: Uses BI tools and applications to understand business
conditions and drive business processes
• Data:

Zooming in on data:
“Data are that which exists prior to argument or interpretation that converts them to facts, evidence
and information” (Rosenberg, 2013)

'Big data: extending the business strategy toolbox'– article 2


'Big data: extending the business strategy toolbox', Woerner & Wixom (2015)

Big data can be used to improve and innovate the business model.
Improving the business model:
• New data: Armed with new data, these companies can advance to generate insight. (e.g.
sensors, social media for understanding behaviour)
• New insight: New big data approaches and techniques that ranged from high-end statistics
and models to colourful visualizations of the output. (e.g. outlier detection, using big data
techniques)
• New action: As companies become well-armed with big data and proficient at making
insights based on that data, they act differently – often faster and more wisely. (e.g. outlier
detection, using big data techniques)
Big data facilitates improvements to business models across industries. The most effective
improvements result from creating well-articulated strategies that are informed by data and then
honed and shaped accordingly.

Innovate the business model:


• Data monetization: the act of exchanging information-bases products and services for legal
tender of something of perceived equivalent value. There are fundamentally information
companies with a main focus on information-bases products. Companies that sell tangible
products are increasingly engaging in data information-bases products and services as well.
o 3 forms of data monetization:
▪ Wrapping: refers to wrapping information around other core products and
services. (enriching core products with data)
▪ Selling: occurs when companies receive money or some form of legal tender
in exchange for information offerings. (offering information for sale)
▪ Bartering: occurs when companies choose to trade information in return for
new tools, services, or special deals. (trading information)
• Digital transformation: digital transformation occurs when companies leverage digitization
to move into completely new industries or to create new ones. (becoming active in new
industries)

Characteristics of new data: 4 Vs (Volume, Velocity, Variety, Veracity)

3
'The New Patterns of Innovation', Parmar et al. (2014) – article 3
Patterns in creating value from data and analytics:

1. Sensors, driving with an app, pay for insurance if you drive too hard.
2. Spotify, Netflix, not buying a cd anymore (physical asset) make it available to the customers
when its digitized. Making blueprints digital to make a 3D model out of it. Digitize existing
products or you can digitize it to improve the design > ex ante and ex-post
3. Healthcare challenge is to coordinate different activities. Not one person or organization,
integrating the different data sources of the different organization to improve/optimize the
care for people. Google maps company specialized in traffic jams
4 & 5 always on top of the first three.
4. Sell data to others that want for example to digitize the cd’s
5. Codify the capability- what capability are we talking about
> put it in a context to make it a value.

'Big data for big business? A taxonomy of data-driven business models used by start-
up firms', Hartmann et al. (2014) – article 4
A taxonomy of business models used by start-up firms:

4
Exam preparation
Multiple-choice questions (focus: knowledge)
> In the first lecture we discussed the three analytics capability levels that were identified by LaValle
et al. (2010). Which of the following levels is not mentioned by LaValle et al. (2010)?

> In the first lecture, we talked about different patterns of creating value from data as identified by
Parmar et al. (2014). What is the key difference between the pattern of digitizing physical assets and
codifying a capability?

> In the first lecture, we talked about the paper of Woerner & Wixom (2015). In this paper, the
authors argue that big data can be used to improve as well as to innovate the business model. What
is an example of the use of big data to innovate the business model?

Open-ended questions (focus: bridging theory and practice)


You are hired as a consultant by Confucius Inc., an organization that - according to the analytics
capabilities framework by LaValle et al. (2010) can be classified as 'Experienced'. One of the business
challenges that this organization faces is that its employees do not always understand how to
leverage analytics to create business value. You are asked to advice Confucius Inc. how to address
this issue. Drawing on the insights of the LaValle et al. (2010) paper, address the following to
Confucius Inc.:

a) What would be the next capability level Confucius should aim for, and what is the role of data
analytics at that level? [4 points]
b) Give two recommendations management can follow to yield a higher pay-off on data analytics in
the organization. [2 points]
c) Explain how both recommendations could be implemented in this organization [4 points]

5
Lecture 2: Data input
T1: Information requirements
Information requirements:
Relevant question: what information do executive’s needs?
• Executive work activities (Watson & Frolick, 1993):
o Diverse, brief and fragmented
o Verbal communications preferred (soft info)
o More unstructured, non‐routine and long‐range in nature
than other managerial work (Mintzberg)
o Network building, building cooperative relationships
(internal/external)

Executive Information Systems (EIS)


Key questions in DSS (Decision support System) / EIS development:
• Which information needs should be covered? What can and
can’t we do with this system?
• What kind of analyses should the system support?
“A major developmental problem is determining what information
to include in the system” (Watson & Frolick, p. 255)
• Organizational‐level information requirements > specify a
portfolio of apps and databases, what will we cover, what
will each specific system cover.
• Application‐level information requirements > define the needs of a specific application, such
as an executive information system.
Metaphor: managerial cockpit > a lot of keys, levers, meters, everywhere, very complex and relevant
information, executive a pilot of an airplane (organization). What is our plan, are we going up or are
we going down and are we performing according to the plan.

Determining information requirements:


Four generic strategies for identifying information requirements (Davis, 1982):
1. Asking > means that the executives need to be available and be able to articulate their needs
(but often a very hard thing to do).
2. Deriving from an existing IS > means that the system needs to be already in place and also
often these executive information or new system, differ often from the existing system.
3. Synthesizing from characteristics of the utilizing system > what is the utilizing system?
Organization or user? Typically done through the critical success factor method but also
requires big executive involvement and that is tricky.
4. Discovering from experimentation with an evolving IS > means that the system has to be
already in place so this means that this is only useful for ongoing development.

Two phases in the lifetime of an EIS:


1. Initial phase (initial roll out of system) you launch the new system
2. Ongoing phase (modify, add new screens/analyses/reports) fourth strategy could be
applicable.

6
EIS continues to evolve over time in response to:
Competitor actions
• Changing customer preferences
• Government regulations
• Industry developments
• Technological opportunities, etc.

16 methods for determining information requirements (Watson & Frolick)


1. Discussions with executives
2. EIS planning meetings
3. Examinations of computer-generated information
4. Discussions with support personnel
5. Volunteered information
6. Examination of other organizations’ EIS
7. Examinations of non-computer-generated information
8. Critical success factors sessions
9. Participation in strategic planning sessions
10. Strategic business objectives method
11. Attendance at meetings
12. Information systems teams working in isolation
13. Examination of the strategic plan
14. Tracking executive activity
15. Software tracking of EIS usage
16. Formal change requests
Point 8: critical success factors sessions: The few key areas where ‘things must go right’ for the
organizations wellbeing/organizations survival.
Which of the 16 methods is the best?
• Application of specific method for determining information requirements dependent on
phase, but each has pros and cons > Use multiple methods to triangulate information needs
• Development of BI&A is journey rather than destination

Critical succes factors


“The limited number of areas in which satisfactory results will ensure successful competitive
performance for the individual, department or organization. Critical success factors are the few key
areas where 'things must go right' for the business to flourish and for the manager's goals to be
attained.” (Bullen & Rockart, 1981)

Examples:
• Image in financial markets
• Technological reputation with customers
• New market success
• Company morale

7
How to measure CSF’s > KPI’s
Key Performance Indicators (KPIs) are used to measure (quantify) CSFs
Examples:
CSF KPI
Image in financial markets Price/earnings ratio
Technological reputation with customers Orders/bid ratio
New market success Change in market share
Company morale Employee satisfaction score
• Performance measures are the foundation of a useful EIS
From strategy to reports:
• Use of CSF’s and KPI’s enables measurement, and thus control, of strategic objectives.
• Performance measures (KPI’s) that measure the execution of the strategy and the creation of
value must be included

From mission to information needs:

8
T 2: Data quality
What is it?
Data with a lot of information that you could integrate and summarize and abstract very relevant
information out of it. Use or purpose. Accurate or accessible, relevant, trustworthy. Should be able to
check if the data is truthful. Recent, should be complete, no missing values.

Data quality dimensions


6 dimensions:
1. Completeness
2. Uniqueness
3. Timeliness
4. Validity
5. Accuracy
6. Consistency

Other quality dimensions?


Data may be perfectly complete, unique, valid, accurate and timely.
However if data items are in English and the users do not
understand English, it will be useless!
• Usability of the data
• Timing issues with the data (beyond timeliness itself)
• Flexibility of the data
• Confidence in the data
• Value of the data
• Legal aspects of the data
• In line with corporate image/message?

Effects of low quality data (Redman, 1998)


Typical issues:
• Inaccurate data: 1‐5% of data fields are erred
• Inconsistencies across databases
• Unavailable data necessary for certain operations or decisions
Typical impacts:
• At operational level: lowered customer and employee satisfaction, increased cost
• At tactical level: poorer decision making, more difficult to implement data warehouses, more
difficult to reengineer, increased organizational mistrust
• At strategic level: more difficult to set and execute strategy, contribute to issues of data
ownership, compromise ability to align organizations, divert management attention > hard to
steer the cockpit because you don’t have the right information on the dashboard.
E.g. if the customer relations department have address X and the logistics department has address Y,
because a move was not updated > operational impact: customer don’t get their package.

9
'Data Quality (DQ) in Context', Strong et al. (1997) – article 1
Data quality in context
• Focus on intrinsic DQ problems fails to solve complex organizational problems > consider DQ
beyond intrinsic view
• Focus not only on stored data (data custodians), but also on production (data producers)
and utilization (data consumers)
• High‐quality data: fit for use by data consumers

Table: Data quality dimensions used in the broader conceptualization of data quality argued by the
article

Common patterns and sequences of dimensions during DQ projects:


• Intrinsic DQ pattern: mismatches among sources of the same data are a common cause of
intrinsic DQ concerns (subpattern 1). Judgment or subjectivity in the data production process
is another common cause (subpattern 2). Figure 1
• Accessibility DQ pattern: accessibility DQ problems were characterized by underlying
concerns about technical accessibility (subpatterns 1-2), data-representation issues
interpreted by data consumers as accessibility problems (subpatterns 3-4), and data-volume
issues interpreted as accessibility problems (subpattern 5). Figure 2
• Contextual DQ pattern: Three underlying causes for data consumers’ complaints that
available data does not support their tasks: missing (incomplete) data, inadequately defined
or measured data, data that could not be appropriately aggregated. Figure 3

Figure 1 Figure 2

10
Figure 1

Moral of data quality:


For any BI&A project:
• Select the relevant (DQ) dimensions
• Define a threshold level

T 3: Data governance
Defining (data) governance:
More focused on the strategic level, whereas management is focusing on the tactical and operational
level.

Data governance:
Data governance is the exercise of authority and control over the management of data assets.
• I.e. Data architecture, data quality, data storage & operations, data security and many more

'One Size Does Not Fit All – A Contingency Approach to Data Governance', Weber et al. (2009)
– article 2
Data governance model:
Consist out of the following three objects (only following two where mentioned in the lectures):
• Data quality roles:
o Executive sponsor: Provides sponsorship, strategic direction, funding, advocacy and
oversight for DQM (Data Quality Management)
o Data quality board: Defines the data governance framework for the whole
enterprise and controls its implementation
o Chief steward: Puts the board’s decisions into practice, enforces the adoption of
standards, helps establish DQ metrics and targets
o Business data steward: Details corporate-wide DQ standards and policies for his/her
are of responsibility from a business perspective
o Technical data steward: Provides standardized data element definitions and formats,
profiles and explains source system details and data flows between systems.

11
• The assignment of responsibilities: The abbreviations “R”, “A”, “C”, and “I” fill the cells of the
matrix to depict the kind of responsibility a role has for a specific DQM activity or decision.
o Responsible (“R”). This role is responsible for executing a particular DQM activity.
▪ Only one “R” is allowed per row, that is, only one role is ultimately
responsible for executing an activity
o Accountable (“A”). This role is ultimately accountable for authorizing a decision
regarding a particular DQM activity.
o Consulted (“C”). This role may or must be consulted to provide input and support for
a DQM activity or decision before it is completed.
o Informed (“I”). This role may or must be informed of the completion or output of a
decision or activity.

Data governance contingency model:


Where do we place the DQM activities in the organization and how is it coordinated to achieve DQM
success? Contingency means it is dependent on, so the variables on the top make it a contingency
model.

The contingency factors determine the fit between the design of the data governance model and the
success of DQM within the organization.

E.g. data governance at the VU:


Two prime streams of data > student data (names, degrees, address, financial info, grades) and
scientific research.
• Research data
o Data mgmt. policy
• Student data
o Student analytics? Predicts study success.
➢ what is the best financial thing to do, what is the most ethical thing to do what is the
smartest thing to do. Focus on the students who will not make it or will make it according to
predictions.

T 4: Data privacy and ethics


Privacy (& ethics) & BI&A
How does the concept of privacy change in the age of BI&A?
➢ Track people according to facial recognition.

12
'Big Data for All: Privacy and User Control in the Age of Analytics', Tene & Polonetsky (2013) –
article 3
Big data: big benefits:
• Healthcare: i.e. It is not possible for FDA to control the correlations of every single medicine
at the market. By using big data, it is possible to find harmful correlations between
medicines. i.e. Google can forecast a flu epidemic.
• Mobile: Mobile devices–always on, location aware, and with multiple sensors including
cameras, microphones, movement sensors, GPS, and Wi-Fi capabilities have revolutionized
the collection of data in the public sphere and enabled innovative data harvesting and use.
• Smart-grid: The smart grid is designed to allow electricity service providers, users, and other
third parties to monitor and control electricity use. Utilities view the smart grid as a way to
precisely locate power outages or other problems, including cyber-attacks or natural
disasters, so that technicians can be dispatched to mitigate problems.
• Traffic management: i.e. Governments around the world are establishing electronic toll
pricing systems, which determine differentiated payments based on mobility and congestion
charges.48 These systems apply varying prices to drivers based on their differing use of
vehicles and roads.
• Retail: It was Wal-Mart’s inventory management system (“Retail Link”) which pioneered the
age of big data by enabling suppliers to see the exact number of their products on every shelf
of every store at each precise moment in time.
• Payments: Another major arena for valuable big data use is fraud detection in the payment
card industry.
• Online: the most oft-cited example of the potential of big data analytics lies within the
massive data silos maintained by the online tech giants: Google, Facebook, Microsoft, Apple,
and Amazon.

Big data: big concerns:


The harvesting of large sets of personal data and the use of state of the art analytics implicate
growing privacy concerns. Protecting privacy will become harder as information is multiplied and
shared ever more widely among multiple parties around the world.
• Incremental effect: this incremental effect will lead to a “database of ruin,” chewing away,
byte by byte, on an individual’s privacy until his or her profile is completely exposed. Every
click you make, reveals more about your identity, based on your identity you’ll see certain
results. I.e. everyone has its own personal page on Netflix based on your viewing history
(online identity).
• Automated decision making: The relegation of decisions about an individual’s life to
automated processes based on algorithms and artificial intelligence raises concerns about
discrimination, self-determination, and the narrowing of choice.
• Predictive analysis: Big data may facilitate predictive analysis with stark implications for
individuals susceptible to disease, crime, or other socially stigmatizing characteristics or
behaviours. This can be helpful, for example forecast when a disaster is going to happen. But
it can also be creepy, a supermarket who predict how likely it is that costumers get pregnant.
• Lack of access and exclusion: An additional concern raised by big data is that it tilts an
already uneven scale in favour of organizations and against individuals. The big benefits of
big data, the argument goes, accrue to government and big business, not to individuals—and
they often come at individuals’ expense.

13
• The ethics of analytics: Where should the red line be drawn when it comes to big data
analysis? Moreover, who should benefit from access to big data? Could ethical scientific
research be conducted without disclosing to the general public the data used to reach the
results?
• Chilling effect: “a surveillance society,” a psychologically oppressive world in which
individuals are cowed to conforming behaviour by the state’s potential panoptic gaze.

De- identification & data minimization:


• De-identification: prevent a person’s identity from being connected with data
• Through e.g.:
o Data masking: illuminate the personal details
o Pseudonymization: other name, authors publishing on a different name
o Encryption: more toward the faculty of science
o Aggregation: big pile of it, all students together.
• Data minimization: only collect data after permission and for a specific purpose & delete the
data when we don’t need it anymore
• However, Big Data (collecting as much data as possible for new patterns) makes
deidentification harder and juxtaposes data minimization!

The legal framework: challenges & solutions:


• It is hard for law/policy to keep up with technological developments/possibilities
• De‐identification as protective measure rather than solution
• Give consumers access to their data and “share the wealth”
• Transparency over data collection and analytics
➢ Tension between this transparency and the company’s ‘secret sauce

Exam preparations
Multiple‐choice questions (focus: knowledge)
• The essence of the paper Data Quality in Context of Strong et al. (1997) can best be
described as:
• In the Data governance contingency model of Weber et al. (2009; One Size Does Not Fit All –
A Contingency Approach to Data Governance), what is considered a contingency factor?
• What is, according to the paper of Tene & Polonetsky (2013; Big Data for All: Privacy and
User Control in the Age of Analytics), a promising way to deal with privacy and user control in
the Big Data age?

Open‐ended questions (focus: bridging theory and practice)


As Lenovo plans to work more with external data, ethical and privacy‐related issues become more
prominent as well. Based on the work of Tene & Polonetsky (article Big Data for All: Privacy and User
Control in the Age of Analytics, 2013):
A) Explain why the traditional measures de‐identification & data minimization do not fit with the Big
Data (business) model [4 points]
B) Following the same paper, provide Lenovo with an advice on what they should do to ensure
ethical and privacy‐aware use of external (e.g., customer) data [6 points]

14
Lecture 3: Data architecture
T1: Data
What is data?
Data are that which exists prior to argument or interpretation that converts them to facts, evidence
and information (Rosenberg, 2013)

The knowledge pyramid

Data types:
• Form (qualitative and quantitative)
• Structure (structured, semi-structured, unstructured)
o Structures: tabular, network and hierarchical
o Relational databases: two tables with matching columns, these tables can be
combined
• Source (capture, derived, exhaust, transient)
• Producer (primary, secondary, tertiary)
• Type (indexical, attribute, metadata)

Structured data explained:


• Unstructured – video’s/texts
• Semi-structured – spreadsheets
• Structured data: is related, a way of knowing how to treat each piece of data. Spreadsheets
and databases.
• Difference of use: computers need it structured.
• Semi-structured: spreadsheets, and by relating them you can structure them.

Structured query language (SQL): A high-level, declarative language for data access and
manipulation. Allows asking the database human-like questions (queries). Widely used for simple
functional reporting.

Data can be stored in:


• Operational Databases: support transaction processing Also: OLTP On Line Transaction Processing databases
• Data Warehouses: support business intelligence
• Data Lakes: support various data types for analysis and exploration

15
T2: Operational databases
Operational databases:
'An Overview of Business Intelligence Technology', Chaudhuri et al. (2011) – article 1
Typical business intelligence architecture:

Business intelligence architecture:


• Data sources: The data over which BI tasks are performed often comes from different
sources— typically from multiple operational databases across departments within the
organization, as well as external vendors.
• Data movement, streaming engines: back-end technologies for preparing the data for BI are
collectively referred to as Extract-Transform-Load (ETL) tools. Increasingly there is a need to
support BI tasks in near real time, that is, make business decisions based on the operational
data itself. Specialized engines referred to as Complex Event Processing (CEP) engines have
emerged to support such scenarios.
• Data warehouse servers: The data over which BI tasks are performed is typically loaded into
a repository called the data warehouse that is managed by one or more data warehouse
servers. A popular choice of engines for storing and querying warehouse data is relational
database management systems (RDBMS). As more data is born digital, there is increasing
desire to architect low-cost data platforms that can support much larger data volume than
that traditionally handled by RDBMSs. Driven by this goal, engines based on the MapReduce
paradigm are now being targeted for enterprise analytics.
• Mid-tier servers: Provide specialized functionality for different BI scenarios.
o Online analytic processing (OLAP) servers efficiently expose the multidimensional
view of data to applications or users and enable the common BI operations such as
filtering, aggregation, drill-down and pivoting.
o Reporting servers enable definition, efficient execution and rendering of reports.
o Enterprise search engines support the keyword search paradigm over text and
structured data in the warehouse.
o Data mining engines enable in-depth analysis of data that goes well beyond what is
offered by OLAP or reporting servers, and provides the ability to build predictive
models
o Text analytic engines can analyse large amounts of text data

16
• Front-end applications: There are several popular front-end applications through which
users perform BI tasks: spreadsheets, enterprise portals for searching, performance
management applications that enable decision makers to track key performance indicators
of the business using visual dashboards, tools that allow users to pose ad hoc queries,
viewers for data mining models, and so on. Rapid, ad hoc visualization of data can enable
dynamic exploration of patterns, outliers and help uncover relevant facts for BI.

Characteristics of operational databases (or Online Transaction Processing Systems, OLTPs):

Behoud

Wisselvalligheid

Operational databases often are relational databases

Actual customers and orders tables


Entity-relationship diagram (ERD)

Databases can be accessed using the structured query language (SQL)


• A high-level, declarative language for data access and manipulation
o Allows asking human-like questions (queries) to databases
• Widely used for simple functional reporting
o Amount of sales this month
o The number of current female employees
o The age composition of the workforce

Structured query language (SQL) is a programming language for storing and processing information in a relational database. A relational
database stores information in tabular form, with rows and columns representing different data attributes and the various relationships between
the data values. You can use SQL statements to store, update, remove, search, and retrieve information from the database. You can also use
SQL to maintain and optimize database performance.

17
T3: Data warehouses
'Data Warehouses, Business Intelligence Systems, and Big Data', Kroenke et al. (2018) – article
2
Setting the stage: Business analysts need large datasets available for analysis by business intelligence
(BI) applications. BI systems such as online analytical processing (OLAP) and data warehouses are
used.
• Business intelligence (BI) systems are information systems that assist managers and other
professionals in the analysis of current and past activities and in the prediction of future
events.
• Operational systems—such as sales, purchasing, and inventory-control systems— support
primary business activities. They are also known as transactional systems or online
transaction processing (OLTP) systems because they record the ongoing stream of business
transactions.
• BI systems fall into two broad categories: reporting systems and data mining applications.
o Reporting systems sort, filter, group, and make elementary calculations on
operational data.
o Data mining applications, in contrast, perform sophisticated analyses on data,
analyses that usually involve complex statistical and mathematical processing.

Data warehouse: A data warehouse is a database system that has data, programs, and personnel
that specialize in the preparation of data for BI processing. Data are read from operational databases
by the extract, transform, and load (ETL) system. The ETL system then cleans and prepares the data
for BI processing. This can be a complex process.

Data mart: A data mart is a collection of data that is smaller than the data warehouse that addresses
a specific component or functional area of the business.

Dimensional database: The data warehouse databases are built in a design called a dimensional
database that is designed for efficient data queries and analysis. A dimensional database is used to
store historical data rather than just the current data stored in an operational database.
• A dimension within a dimensional database is a column or set of columns that describes
some aspect of the enterprise
• Because dimensional databases are used for analysis of historical data, they must be
designed to handle data that change over time. In order to track such changes, a dimensional
database must have a date dimension or time dimension as well.

OLAP: Online analytical processing (OLAP) provides the ability to sum, count, average, and perform
other simple arithmetic operations on groups of data. OLAP systems produce OLAP reports. An OLAP
report is also called an OLAP cube. This is a reference to the dimensional data model. OLAP uses the
dimensional database model discussed earlier in this chapter, so it is not surprising to learn that an
OLAP report has measures and dimensions. A measure is a dimensional model fact—the data item of
interest that is to be summed or averaged or otherwise processed in the OLAP report. A dimension,
as you have already learned, is an attribute or a characteristic of a measure.

18
The main idea behind data warehouse

Data warehouses:
From ad-hoc activities to a structured approach

Key concepts explained:


Data warehouses are (multi-)dimensional databases

Online Analytical Processing (OLAP): computational approach, BI system or mid-tier server for
answering multi-dimensional analytical queries, using interactive real-time interfaces.

Multi-dimensional analytical queries (MDA): questions drawing on several data domains/ several
dimensions (i.e. sales by region, by product, by salesperson and by time (4 dimensions)

Data warehouses definition:


• A collection of integrated, subject-oriented databases designed to support DSS (decision
support system) function, where each unit of data is relevant to some moment of time. The
data warehouse contains atomic and lightly summarized data.
• Data warehouses integrate pre-processed, well-structured problem-specific data. Being a
tool for efficiently handling multi-dimensional analytical queries (MDAs), they reduce
reporting time and costs.
• A data warehouse can be compared with a distributor in a supply chain (focus on data). Data
marts are like a retail store in that supply chain (focus on a specific component of a
business). So, data warehouses are used in large organizations (Kroenke et al., 2018).

19
Characteristics of data warehouses:

Operational databases vs. data warehouses:

Behoud

Wisselvalligheid

Information life cycle of the firm (Kroenke et al., 2018):

20
Multidimensional data model: decision cube
• Decision cube: The equivalent of a pivot table for a data warehouse. You may think of it as a
multi-dimensional table. Allows summarizing and analyzing the facts pertaining to a specific
subject along different dimensions. (e.g., sales cube, inventory cube)
• Facts or measures: What is tracked about the subject of the cube. A continuous additive
quantity (always numeric) that can be defined for all possible intersections of the
dimensions. (e.g. sales, revenue, units sold)
• Dimensions: Tag what is tracked about the subject of the cube. A dimension acts as an index
for identifying values within a cube. Each represents a single perspective on the data. If all
dimensions have a single member selected, then a single cell is defined. Dimensions are
structured in hierarchies and consist of categorical variables. (e.g. time (year, quarter,
month, week, day, hour), product type (line, series, model), and store of sale (region,
country, city, branch))

Facts versus dimensions


Identifying facts from dimensions for the sales cube

Multidimensional data model: star schema: The Star Schema is the way we represent the data
model pertaining to decision cubes. It is the ERD for a cube. The star schema consists of one or more
fact tables referencing a number of dimension tables.

21
• Benefits of data warehouses (Watson et al., 2002)
o Greatest benefits from DW apps occur when used to redesign business processes
and support strategic business objectives
o Can be a critical enabler for strategic change
• The main merits of data warehouses become their drawback in the age of big data
o Purpose-driven, rigid schemas that are not appropriate for ad-hoc analysis
o Pre-processing incurs high integration costs and means not all data can get in

T4: Big data


Data sources of modern organizations (Kroenke et al., 2018)
• Historical transactional data
• Market research & off-the-shelf databases
• Social networks
• Sensor-based
• Digital traces

Challenge: Big data is getting bigger and bigger (Kroenke et al., 2018)
• Big data is large
o From 33 zettabytes (trillion GB) in 2018 to a predicted 175 zettabytes in 2025 (IDC,
2018)
• Big data is fast
o 1.7 MB of data is created per second per person every day by 2020 (Forbes, 2015)
• Big data is highly dimensional
o Both in terms of records, but also in terms of attributes
• Big data is difficult to exploit
o 0.5% of all data is analysed/used

Distributed computing:
Approaches to deal with big data: distributed computing
• Based on the split-apply-combine (sac) paradigm
o E.g., calculating the average grade per exam opportunity

22
• sac analogy for big data: map-reduce1 , implemented in e.g. Apache Hadoop
o E.g. calculating the frequencies of all words on the internet

Map reduce process:


The data themselves can be analysed using the MapReduce process. Because Big Data involves
extremely large datasets, it is difficult for one computer to process all of the data by itself. Therefore,
a set of clustered computers is used using a distributed processing system.
• The MapReduce process is used to break a large analytical task into smaller tasks, assign
(map) each smaller task to a separate computer in the cluster, gather the results of each of
those tasks, and combine (reduce) them into the final product of the original task. The term
Map refers to the work done on each individual computer, and the term Reduce refers to the
combining of the individual results into the final result.

Approaches to deal with big data:


Distributed computing
• Up to 100 times faster: Apache Spark (runs in RAM compared to HDD for map-reduce)
• Narrow vs. wide transformation becomes a more important consideration
• Complements:
o Hadoop (distributed file system)
o Mesos (research management)
o Cassandra (operational systems)
o Openstack (cloud operations)
o MySQL (databases)
• Eco-system for big-data tools
o Streaming
o SQL
o Machine learning
o Graphs

23
T5: Data lakes
‘Using knowledge management to create a Data Hub and leverage the usage of a Data Lake’,
Ferreira et al., (2018) – article 3
Storing big data: data lakes
• Dropping storage costs allow for retaining giant amounts of data until the use case is found
(Ferreira et al., 2018)
o Deferred processing and modelling has become a possibility these days
• Business analysts are only one of the beneficiaries of corporate data these days. Data
scientists and deep learning machines have entered the stage as well
o The data pipeline must cater to the needs of all these users
o “tidy data” rather than structured data has become more important to match
today’s analytics needs

A data lake is:


A centralized repository containing virtually inexhaustible amounts of raw (or minimally curated)
data that is readily made available anytime to anyone authorized to perform analytical activities
(Chessell et al., 2014)
A repository for large quantities and varieties of data, both structured and unstructured. Four
criteria:
• Large in size and low in costs: Data lakes are big. They can be an order of magnitude less
expensive on a per-terabyte basis to set up and maintain than data warehouses.
• Fidelity: Hadoop data lakes preserve data in its original form and capture changes to data
and contextual semantics throughout the data lifecycle.
• Late binding: Hadoop lends itself to flexible, task-oriented structuring and does not require
up-front data models.
• Ease of accessibility: Accessibility is easy in the data lake, which is one benefit of preserving
the data in its original form.

24
Preventing a data swamp:
The main challenge is not creating a data lake, but taking advantage of the opportunities it presents.
Otherwise, one only creates a data swamp (Ferreira et al., 2018)

Need for (knowledge) management (Ferreira et al., 2018)


• Metadata (what’s the data in the lake about)
• Data hub (navigating the data lake)

Data flow:

Exam preparations
Multiple-choice questions (focus: knowledge)
In the paper 'An Overview of Business Intelligence Technology', Chaudhuri et al. (2011) discuss a
typical architecture for supporting BI within enterprises.

What do we call the back-end components that prepare data for business intelligence tasks?
A. Data mining, text analytic engines
B. Extract Transform Load tools
C. Online Analytical Processing servers
D. Operational databases

In the whitepaper ‘The enterprise data lake: Better integration and deeper analytics’, Stein &
Morrison (2014) introduce the data lake as an answer to the massive growth in data available to
organizations.

What are the main characteristics of a data lake?

A. Analysis-oriented data, pre-transformed data, and cross-functional structure


B. Large data quantity, data heterogeneity and distributed architecture
C. Monolithic, silo-based, using containers for resource isolation and abstraction
D. Size and low cost, fidelity, ease of accessibility and late binding

25
Open questions (focus: application)
Ledoitte is a global professional services firm with competences in management and IT consulting. As
an information systems consultant in Ledoitte, you are trying to sell the idea of a business
intelligence system to Rich Tals, a beauty and cosmetics brand based in Amsterdam, running a
sizeable network of retail shops.

Explain the following to the board of directors:

A. How are operational databases (i.e. OLTP) and business intelligence systems (i.e. OLAP) used in an
organization? Focus on the differences in use that justify maintaining two separate systems. [4
points]

B. Why is a database redesign required when moving from online transaction processing (OLTP) to a
data warehouse (OLAP)? [2 points]

C. What are the characteristic differences between the data in an operational database system and
the data in a data warehouse? [4 points]

26
Lecture 4: Data analytics
T1: Basic concepts
Knowledge discovery in databases
The knowledge discovery in databases (KDD) process (Fayyad et al. 1996), and more in particular data
mining and machine learning techniques help in making sense of large volumes of data

Data mining
“Data mining is the application of specific algorithms for extracting patterns from data” (Fayyad,
1996, p. 37)

Data mining process


• Business understanding
o Understand project objectives and data mining problem identification
• Data understanding
o Capture, understand, explore the data for quality issues
• Data preparation
o Clean and merge data, derive attributes, etc.
• Modelling
o Select the machine learning technique, build the model
• Evaluation
o Evaluate the results and approve model
• Deployment
o Put model into practice, monitor and maintain

Different typologies:
Different analytic methods depending on.
• Data: Non-dependent vs dependent data
• Task: Descriptive vs predictive
• Relations: Variables vs observations
• Algorithm: e.g., Regression vs Classification vs Clustering
• Learning: Supervised vs Unsupervised Learning

Data: Non-dependent vs dependent:


• Nondependency-Oriented Data
o Numeric: Continuous
o Ordinal: Ordered discrete
o Categorical: Unordered discrete
o Binary: A special case of ordinal or categorical
• Dependency-Oriented Data
o Temporal (e.g. Time Series)
o Spatial (e.g. Maps, Graphics, etc.)
o Sequences (e.g. Genome)
o Graphs (e.g. Networks, Social Networks, etc.)
Dependency-oriented data require different and often more complex analytic methods.

27
Task: Description vs prediction:
Descriptive / Explorative Data Mining
• Learn about and understand the data
• e.g. Identify and describe groups of customers with similar purchasing behaviour (London’s
position in global corporate network)

Predictive Data Mining


• Build models in order to predict unknown values of interest
• e.g. Whether customer X will switch to another energy supplier (Catalog order prediction: A
model that predicts how much the customer will spend on the next catalog order, given a
customer’s characteristics)

Relations:
• Between the attributes (classification)
• Between the observations (clustering)

Algorithm:
• Regression: Regression is a function that maps a
data item to a prediction variable
• Data clustering: Given a set of observations, each having a set of attributes, and a similarity
measure among them, finds clusters such that. Observations in one cluster are more similar
to each other. Observations in different clusters are less similar to each other.
• Data classification: Find a model for class attribute as a function of the values of other
attributes. Objective: Previously unseen observations (test set) should be assigned a class as
accurately as possible.
• Association rule discovery: Given a set of records, each of which contain some number of
items from a given collection. Capture the co-occurrence of items
• Collaborative filtering

Learning:
In the case of Machine Learning
• Supervised: Machine learning task of inferring a function from labelled training data
(classification)
o You give to the computer some pairs of inputs/ outputs, so in the future when new
inputs are presented you have an intelligent output. Requires a training and a test
set. (i.e., inferring a function from labelled training data, e.g. classification)
• Unsupervised: Machine learning algorithm used to draw inferences from datasets consisting
of input data without labeled responses (clustering)
o You let the computer learn from the data itself without showing what is the expected
output. (i.e., drawing inferences from data without labelled responses, e.g.
clustering)

Machine learning workflows:

28
T2: Supervised learning
Supervised learning concept:
The task at hand is learning a function that predicts an output given some inputs, based on example
input-output pairs
• Output can be the result of an unsupervised learning project
• Requires a training and test set
• E.g. predicting sales: Given data on past sales (combined with other relevant data), can we
predict future sales?

Data mining task: regression:


Regression is a function that maps a data item to a prediction variable

Data mining task: classification


• Given a collection of observations (training set)
o Each record contains a set of attributes
o One of the attributes denotes the class (label) you try to predict that class or label.
• Find a model for class attribute as a function of the values of the values of other attributes
• Objective: previously unseen observations (test set) should be assigned a class as accurately
as possible

Run on the same data, input: debt and income. Tries to predict the grey area. Find a way of
classifying the observation for which you do not have a label yet. What your model learns from the
training set is applicable to the test set and can reveal some things in the test set. Important to read
it in the test set.

29
'Classification Models', Wendler & Gröttrup (2016) – article
Classification algorithms deal with the problem of assigning a category to each input variable vector.
Classification models are dedicated to categorizing samples into exactly one category.

The procedure for building a classification model: The original dataset is split into two independent
subsets, the training and the test set. The training data is used to build the classifier, which is then
applied to the test set for evaluation. Using a separate dataset is the most common way to measure
the goodness of the classifier’s fit to the data and its ability to predict. This process is called cross
validation.
• Often, some model parameters have to be predefined. To find the optimal parameter, a third
independent dataset is used, the validation set.

General idea of a classification model: When training a classification model, the classification
algorithm inspects the training data and tries to find regularities in data records with the same target
value and differences between data records of different target values.
• In the simplest case, the algorithm converts these findings into a set of rules, such that the
target classes are characterized in the best possible way through these “if ... then ...”
statements.

Classification algorithms: The right choice of classifier strongly depends on the data type.
• Linear: linear methods try to separate the different classes with linear functions
• Nonlinear: nonlinear classifiers can construct more complex scoring and separating functions
• Rule-based: the rule-based models search the input data for structures and commonalities
without transforming the data. These models generate “if ... then ...” clauses on the raw data
itself.
o Decision trees (Workshop 4)

Classification vs. clustering:


• Clustering: A clustering algorithm is an unsupervised learning method. This means that the
data points are not labeled, hence, there is no target variable needed. Clustering is therefore
used to group the data points, based on their similarities and dissimilarities. The purpose is
to find new patterns and structures within the data.
• Classification: Classification has a different purpose. When using a classifier, the data points
are labeled. Due to this labeled training data, classification models belong to the supervised
learning algorithms. The model learns from some training data and then predicts the
outcome of the response variable based on this training data.

30
Yes or no decision: four possible events can occur
• True positive (TP). The true value is “yes” and the classifier predicts “yes”. A patient has
cancer and cancer is diagnosed.
• True negative (TN). The true value is “no” and the classifier predicts “no”. A patient is
healthy and no cancer is diagnosed.
• False positive (FP). The true value is “no” and the classifier predicts “yes”. A patient is
healthy but cancer is diagnosed.
• False negative (FN). The true value is “yes” and the classifier predicts “no”. A patient has
cancer but no cancer is diagnosed.
→ Unfortunately, a perfect classifier with no misclassification is pretty rare. It is almost impossible to
find an optimal classifier.

Slides
Classification:

Building a decision tree:


‘Divide and conquer algorithm’ (Wendler & Gröttrup, 2016)
• Start with all variables in one group
• Find the variable/split that best separates the outcome
• Divide the data into two groups (“leaves”) on that split (“node”)
• Within each split, find the best variable / split that separates the outcome
• Continue until the groups are too small or sufficiently “pure”

31
Evaluating classification models:
Evaluation:
• We use the confusion matrix for assessing model quality to assess the extent to which the
model confuses the outcome classes (i.e. mislabelling one class as the other)

• Note: predicted and actual might be switched! Also note: what is considered “positive” and
“negative” often is determined alphabetically
• Quality measure: Accuracy: TP + TN / sample size
o Can be calculated for both training and test data. Accuracy tends to be lower for test
data
• 100% accuracy is not a goal

Supervised learning:
Under- and overfitting:

Image recognition:

32
T3: Unsupervised learning
‘Cluster Analysis’, Wendler & Gröttrup (2016) – article
• A cluster analysis is used to identify groups of objects that are “similar”.
• The term “cluster analysis” stands for a set of different algorithms for finding subgroups in a
dataset. Measuring the similarity or the dissimilarity/distance of objects is the basis of all
cluster algorithms. In general, we call such measures “proximity measures”.

Proximity measures: are used to identify objects that belong to the same subgroup in a cluster
analysis. They can be divided into two groups: similarity and dissimilarity measures. Nominal
variables are recoded into a set of binary variables, before similarity measures are used. Dissimilarity
measures are mostly distance-based. Different approaches/metrics exist to measure the distance
between objects described by metrical variables.

• Hierarchical clustering:
o The agglomerative algorithms measure the distance between all objects. In the next
step, objects that are close are assigned to one subgroup. In a recursive procedure,
the algorithms now calculate the distances between the more or less large
subgroups and merge them stepwise by their distance.
o The divisive algorithms assign all objects to the same cluster. This cluster is then
divided step-by-step, so that in the end homogeneous subgroups are produced.
• Partitional clustering: The first step in a partitioning clustering method is to assign each
object to an initial cluster. Then a quality index for each cluster will be calculated. By
reassigning the objects to other clusters, the overall quality of the classification should now
be improved. After checking all possible cluster configurations, by reassigning the elements
to any other cluster, the algorithm will end when no improvement to the quality index is
possible.
o Monothetic algorithm: Algorithms where only one variable assigns objects to the
cluster
o Polythetic algorithm: If more than one variable is used

Clustering methods in SPSS modeler:


• K-means: As a first step, K-Means determines a cluster center within the data. Then each
object is assigned to the cluster center with the smallest distance. The cluster centers are
recalculated and the clusters are optimized, by rearranging some objects. (Workshop 5)

Use cases:
• Marketing: With cluster analysis, statisticians can identify customer subgroups.
• Banking: In the case of a new enquiry for a loan, the bank is able to predict the risk, based on
the data of the firm.
• Medicine: Based on data, a risk evaluation with respect to the carcinogenic qualities of
certain substances can be performed.
• Education: Identifying groups of students with special needs.

33
Slides
Unsupervised learning:
• The task at hand is detecting similarities in the data
o Between variables (columns)
▪ Principal component analysis (PCA)
o Between observations (rows)
▪ Clustering
o Recommender systems
▪ Association rule discovery
▪ User-based collaborative filtering
▪ Item-based collaborative filtering
▪ Text prediction

Between variables: PCA


• Principal component analysis (Wendler & Gröttrup, 2016)

Between observations example: customer segmentation


Given data on customer characteristics, can we distinguish between groups of customers?

Between observations: clustering


Given a set of observations, each having a set of attributes, and a similarity measure among them,
finds clusters such that
• Observations in one cluster are more similar to each other
• Observations in different clusters are less similar to each other

34
Clustering:

K-means clustering:
Provide insight in the clusters & details of the K-means.

K-means clustering Starting centroids Assign to closest centroid Recalculate centroids

Reassign values Update centroids

Clustering: multi-dimensionality
• Why does point 7 not belong to the blue cluster?
• Representing multi-dimensional data in a 2-dimensional plane can be decisive!
o (As is representing 2-dimensional data in a multidimensional plane, but more about
that in the lecture on data visualization)
o Compared to the other points, point 7 and 8 have a distinct score on another feature
o An analogy would be looking at two buildings right from above. Although both might
look similar, one could be a skyscraper, whereas the other could be a parking garage

35
Data mining task: recommendation
• Recommender engines (Res) (Shmueli et al., 2016)

'Association Rules and Collaborative Filtering', Shmueli et al. (2016) – article


Association rules: In association rules, the goal is to identify item clusters in transaction-type
databases. Association rule discovery in marketing is termed "market basket analysis" and is aimed at
discovering which groups of products tend to be purchased together.

Generating candidate rules: Association rules provide information of this type in the form of "if-then"
statements. These rules are computed from the data; unlike the if-then rules of logic, association
rules are probabilistic in nature. We use the term antecedent to describe the IF part, and consequent
to describe the THEN part. In association analysis, the antecedent and consequent are sets of items
(called item sets) that are disjoint (do not have any items in common).
• Consider only combinations that occur with higher frequency in the database. These are
called frequent item sets. Determining what qualifies as a frequent itemset is related to the
concept of support. The support of a rule is simply the number of transactions that include
both the antecedent and consequent item sets.
• Apriori algorithm: The key idea of the algorithm is to begin by generating frequent itemsets
with just one item (one-item sets) and to recursively generate frequent itemsets with two
items, then with three items, and so on, until we have generated frequent itemsets of all
sizes.
• To measure the strength of association implied by a rule, we use the measures of confidence
and lift ratio

The process of rule selection:


• The first stage, consists of finding all "frequent" item sets, those item sets that have a
requisite support.
• In the second stage we generate, from the frequent item sets, association rules that meet a
confidence requirement.

Two principles can guide us in assessing rules for possible spuriousness due to chance effects:
• The more records the rule is based on, the more solid is the conclusion.
• The more distinct are the rules we consider seriously (perhaps consolidating multiple rules
that deal with the same items), the more likely it is that at least some will be based on
chance sampling results.

36
Collaborative filtering: In collaborative filtering, the goal is to provide personalized recommendations
that leverage user-level information. User-based collaborative filtering starts with a user, then finds
users who have purchased a similar set of items or ranked items in similar fashion, and makes a
recommendation to the initial user based on what the similar users purchase or like.
• The recommender engine provides personalized recommendations to a user based on the
user's information as well as on similar users' information. Information means behaviors
indicative of preference, such as purchase, ratings, and clicking.
• Collaborative filtering requires availability of all item-user information. Specifically, for each
item-user combination, we should have some measure of the user's preference for that item.

User based collaborative finding “People like you”: One approach to generating personalized
recommendations for a user using collaborative filtering is based on finding users with similar
preferences, and recommending items that they liked but the user hasn't purchased. The algorithm
has two steps:
• Find users who are most similar to the user of interest (neighbors). This is done by comparing
the preference of our user to the preferences of other users. → Calculated with the help of
correlation and cosine similarity
• Considering only the items that the user has not yet purchased, recommend the ones that
are most preferred by the user's neighbors.

Item based collaborative finding: When the number of users is much larger than the number of
items, it is computationally cheaper (and faster) to find similar items rather than similar users.
Specifically, when a user expresses interest in a particular item, the item based collaborative filtering
algorithm has two steps:
• Find the items that were co-rated, or co-purchased, (by any user) with the item of interest.
• Recommend the most popular or correlated item(s) among the similar items.

Collaborative filtering vs. association rules:


• Association rules look for frequent item combinations and will provide recommendations
only for those items. In contrast, collaborative filtering provides personalized
recommendations for every item, thereby catering to users with unusual taste. In this sense,
collaborative filtering is useful for capturing the "long tail" of user preferences, while
association rules look for the "head."
• Because association rules produce generic, impersonal rules, they can be used for setting
common strategies such as product placement in a store or sequencing of diagnostic tests in
hospitals. In contrast, collaborative filtering generates user-specific recommendations and is
therefore a tool designed for personalization.
• In association rules, the antecedent and consequent can each include one or more items. In
contrast, in collaborative filtering similarity is measured between pairs of items or pairs of
users

37
Slides:
Association rule discovery:
REs: Association rule discovery (Shmueli et al., 2016)
• “Frequently bought together”
o Association rule
▪ if red (antecedent), then white (consequent)
▪ if red and white (antecedent), then green (consequent)
▪ if red (antecedent), then white and green (consequent)
• The Apriori algorithm is much faster
• Uses the concept of support: (number of transactions that include an item set / total number
of transactions) * 100%

Collaborative filtering:
• Maintain a database of many users’ ratings of a variety of items
• identifying relevant items for a specific user from the very large set of items ("filtering") by
• considering preferences of many users ("collaboration")

REs: User-based collaborative filtering (Shmueli et al., 2016)


• “Customers that bought this product, also bought…”
• Steps:
o 1. Find users who are most similar to the user of interest (“neighbours”) by
comparing preferences
o 2. Recommend the items most preferred by the neighbours that the user has not yet
purchased
o > indicates an expressed preference (e.g. a numerical rating, a purchase, a “like”, or a
click)
• “You also might be interested in…”
o Approach: If a user indicates a preference for an item, find the item that is co-rated
most often and suggest that item
o > indicates a shared preference (e.g. co-rated or co-purchased)

Cautionary tale:
Machine learning is no magic bullet. It offers a set of tools and methodologies
• You need to know how to utilize them
• Can be disastrous if not used properly
• Does not replace skilled business analysts! It requires guidance and output validation (see
Yudkowsky (2008) for an example)

Advanced algorithms: Advanced challenges


• The obvious patterns are not always the most relevant ones
• The “barrier of meaning” of intelligent systems
• Algorithms are often black-boxed • Potential unexpected implications

38
Exam preparations:
Multiple-choice questions (focus: knowledge)

In their book chapters, Wendler & Gröttup discuss the machine learning techniques of classification
and clustering.

What is the basis of all clustering algorithms?


a. A scoring function
b. Assigning a unique category to each input variable vector
c. Measuring the similarity or dissimilarity of objects
d. Multivariate logistic regression

In the book chapter ‘Association Rules and Collaborative Filtering’, Shmueli et al. (2016) discuss
several techniques that can be used to recommend items to users.

In the context of collaborative filtering, what is meant by a “cold start”?

a. It can not be used as-is, i.e. as a self-contained system


b. It can not be used as long as there is a substantial amount of data available
c. It can not be used for new users or new items
d. It takes several iterations of learning until it returns significant results

Open questions (focus: application)


Lenovo is a Chinese multinational technology company with headquarters in Beijing, India and China
and Morrisville, North Carolina. It designs, develops, manufactures and sells personal computers,
tablet computers, smartphones, workstations, servers, electronic storage devices, IT management
software, and smart televisions. Since 2013, Lenovo is the world's largest personal computer vendor
by unit sales. However, with the gradual shift towards mobile and cloud computing in recent years,
the market for personal computers has lost its growth momentum. This has created concerns about
the strategic positioning of the company for its executives, given that personal computing represents
Lenovo’s largest market. Lenovo asked you to do a market analysis based on the recent sales of the
company, with the objective to identify the different customer segments as well as their evolution
over the years. The company also is particularly interested in identifying the most lucrative of the
customer segments in order to adapt its product ranges. You have access to a large collection of
customer data, including the customer information forms that the new customers fill in when they
register for an extended warranty (about two-thirds of the new customers), or when they opt-in for
Lenovo’s online services (less than half of the customers).

a. Will you use a supervised or an unsupervised method to do the customer segmentation? Why? [3
points]
b. What type of algorithm will you use, how is the algorithm implemented, what kind of data does it
take, and what output does it yield? [5 points]
c. How do you find the most lucrative customer segments? [2 points]

39
Lecture 5: Organizing for BIA
T1: BIA Maturity
Maturity:
Definition maturity: How developed an object is
• It differs between objects, evolves over time and can be modelled

Maturity models
• In IT: Capability Maturity Model (CMM) for Software

• Different levels of maturity


• Multidimensional maturity
• Alignment

• Many different applications


• Benchmarking & roadmapping
Note: more mature is not always better! Better if it is divided, depends also on how you define the
maturity level.

How does this maturity apply to BI&A?

Descriptive- shaping the future

Defined different maturity levels and the capabilities also increase. Increased maturity a better
business performance > competitive advantage.

40
'A business analytics capability framework', Cosic et al. (2015) – article 1
Business analytics capability framework:

Business analytics (BA) capabilities can potentially provide value and lead to better organizational
performance. This paper develops a business analytics capability framework (BACF) that specifies,
defines and ranks the capabilities that constitute an organizational BA initiative.

Definition BA capability: the ability to utilize resources to perform a BA task, based on the interaction
between IT assets and other firm resources.

There are 16 BA capabilities grouped in the following categories:


• Governance: Mechanism for managing the use of BA resources and the assignment of
decision rights and accountabilities to align BA initiatives with organizational objectives
• Culture: Tacit and explicit organizational norms, values and behavioral patterns that form
over time and lead to systematic ways of gathering, analyzing and disseminating data o
• People: Individuals who use BA as part of their job function
• Technology: Development and use of hardware, software and data within BA activities.

T2: Outsourcing
BI&A: Make or buy it?
• Make: in house
• Buy: outsourcing

Outsourcing matrix (Dornier, 1998)


• Use this for certain activities
• E.g. Coffee cups > low strategic importance, but it
is necessary, so outsource the production of the
coffee cups

41
'Should You Outsource Analytics?', Fogarty & Bell (2014) – article 2
Should you outsource analytics?
Fogarty & Bell (2014): Two types of organizations:
1. ‘Analytically challenged’:
• See it as a quick and easy way to access analytic capability/skills
• Generally do not worry about IP, like to collaborate in this area
2. ‘Analytically superior’:
• See analytics as important ‘core competence’ leading to competitive advantage
• Will be more hesitant to outsource analytics: what about IP?
• Possibly outsourcing of ‘basic’ analytics functionalities (BI?) to free up internal analysts
• More might be achieved…

Slides
Service‐oriented DSS (Demirkan & Delen, 2013)
The upcoming of service-oriented businesses is associated with the demand for outsourcing
analytics.
Service oriented decision support systems (DSS):

• Advances in IT make service-oriented thinking possible


• Service orientation entails reusability, substitutability, extensibility, scalability,
customizability, composability, reliability, …
• Loose coupling between systems (and business processes) enables service-oriented thinking
➔ The upcoming of these service-oriented products has impact on organizations: i.e. new
organizational units, restructuring business processes, etc.

42
T3: BIA succes
'An Empirical Investigation of the Factors Affecting Data Warehousing Success' – article 3
• ‘Success’ is popular topic in IS research
• DWH: “a specially prepared repository of data created to support decision making” (p. 18)
• Combines databases across an entire organization (versus data mart): IT infrastructure
project

Research model for data warehouse success:

• Three dimensions of system success were selected as being the most appropriate for this
study: data quality(The focus is on the data stored in the warehouse), system quality(The
focus is on the system itself), and perceived net benefits (A system displaying high data
quality and system quality can lead to net benefits for various stakeholders).
• Three facets of warehousing implementation success were identified: success with
organizational issues (Accepted into the organization and integrated into work), success with
project issues(Require highly skilled, well-managed teams who can overcome issues that
arise during the implementation project), and success with technical issues.(The technical
complexity of data warehousing is high)
• Seven implementation factors were included in the research model because of their
potential importance to data warehousing success: management support, champion (A
champion actively supports and promotes the project and provides information, material
resources, and political support.), resources, user participation, team skills, source systems
(The quality of an organization's existing data can have a profound effect on systems
initiatives and that companies that improve data management realize significant benefits.),
and development technology (Development technology is the hardware, soft- ware,
methods, and programs used in completing a project.)
• Results: The expectation was that all the arrows in figure 26 would show a positively
correlated relation. The results are shown in figure 27, as you can see, some relations are
supported and some are not supported (NS).

43
Slides:
Relativity of success:
Depends on who and when you ask
• “Who?”: Management success, project success, user success, correspondence success and system
success
• “When?”: Chartering phase, project phase, shakedown phase and onward & upward phase

Business Intelligence Success

T4: Data-driven culture


Data to knowledge to results:
Davenport et al. (2001): Most companies are not succeeding in turning data into knowledge and then
results
• Why? Neglecting the most important step in the data transformation process – the human
realm of:
o analysing and
o interpreting data, and then
o acting on the insights
• Primarily: Technology focus (tools, infrastructure)
• Ignoring the organizational, cultural, and strategic changes needed to leverage their
investments
➢ We need: Data driven culture

44
Data driven culture: The capability to aggregate, analyse, and use data to make informed decisions
that lead to action and generate real business value. (Both technical and human capabilities)
Data to knowledge to results model:

1. Context: Prerequisites of success in this process → The Strategic, Skill, Organizational &
cultural, Technological and data related factors that must be present for an analytical effort
to succeed → Continually refined and affected by other elements
2. Transformation: where the data is actually analysed and used to support a business decision
3. Outcomes: Changes that result from the analysis and decision making
o Behaviours
o Processes and programs
o Financial conditions

Fast data: an extreme version of big data “the fast nephew of big data”
• The V’s: Volume, Velocity, Variety, Veracity, Variability and Value
• Why is fast data important?
o Reach a competitive advantage
o Source of value
o Increasing customer expectations
o Rapidly changing organizational environment

A successful fast data organization is built on 4 pillars:


• Technology:
o Analyse and process as events
o Process incoming Fast data directly, by means of the two techniques splitting and
filtering
o Maintain two separate databases one for historical data & one for Fast data
o Combine these for event recognition
• Strategy:
o Ensure a clear strategy for required data
o Define what ‘real‐time’ means
o Determine the required length for ‘windows’ of incoming stream data

45
o Translate org. strategy into business rules
▪ Be able to adapt those every moment
• Culture
o Ensure trust in available data
o Let employees practice and learn with FD decisions
o Give employees autonomy to respond based on FD
o Be prepared for rapid changes based on FD
• Skills and experience
o Knowledge of & experience with:
▪ Technology: systems and software
▪ Algorithms and pattern meaning
▪ The data
▪ The organization and ‐strategy
▪ Communication: be able to convey the data and patterns found

Exam preparations:
Multiple‐choice questions (focus: knowledge)
Of which capability from the business analytics capability framework of Cosic et al. (2015; A business
analytics capability framework) is the following a description?

Following the paper Should you outsource analytics? (Fogarty & Bell, 2014), what is true about
‘analytically challenged’ organizations? They...

Consider the following propositions in the context of the paper An empirical investigation of the
factors affecting data warehousing success (Wixom & Watson, 2001).

a. Proposition I: A high level of data quality will be associated with a high level of perceived net
benefits
b. Proposition II: A strong champion presence is associated with a high level of organizational
implementation success

Open questions (focus: application)


The management team of Lenovo is interested when the implementation of the new system /
technology that will enable them to integrate social media data in their decision making can be
considered ‘a success’ and poses you (a BI&A consultant) this question.

a. Provide an answer to this question [6 points]. In your answer, indicate which three types of
implementation success can be distinguished [4 points] according to Wixom & Watson in their
article: An empirical investigation of the factors affecting data warehousing success (2001)

46
Lecture 6: Data visualization & reporting
T1: Setting the stage
'Graphics Lies, Misleading Visuals', Alberto Cairo (2015) – article 1
• Developments in BI&A took place in tandem with those in the field of information
visualization
• Visualizations can (un)-intentionally be misleading. Both creators (encoders) and readers
(decoders) have a role in this

Inaccuracy of graphs is based on three strategies:


• Hiding relevant data to highlight what benefits us
• Displaying too much data to obscure reality
• Using graphic forms in inappropriate ways

Patterson proposes to develop a new kind of journalism education. He calls it “knowledge based
journalism.” It combines deep subject-area expertise with a good understanding of how to acquire
and evaluate information (research methods).

Slides:
T2: Heuristics
Heuristics:
• The field of data visualization is broad and currently lacks adequate theoretical foundations
(Chen, 2010)
• Developing effective visualisations is not an art, but a craft based on embracing certain
principles and heuristics derived from experience and scientific inquiry (Cairo, 2014)

Five key (and tightly interrelated) qualities of effective visualisations (three are most important)
(Cairo, 2014):
1. Truthful: Getting the information as right as possible – display the information as right as
possible. Comparing two different timeframes, don’t know if the numbers are adjusted for
inflation, and there are a couple of years missing. Don’t manipulate your data to come to
another conclusion.
• Always ask yourself the question: compared to what, to whom, to when, to where?
o e.g. 2013 headline “About 28% of journalism grads wish they’d chosen another
field”. Increase depth by making comparisons with previous years and other grads.
Increase breadth by (e.g.) including annual wages
• Report not only the mean, but also min, max, standard deviations
• Use equal bin sizes
• Make sure that the number of information-carrying (variable) dimensions depicted does not
exceed the number of dimensions in the data.
• Some add a three d effect, but there are no three dimensions. (Barrels 3 dimensions) but
price and year are not three dimensional.

47
2. Functional: what is it that I want to show – choose graphic forms according to the task(s) you
wish to enable. Mislead by the data to use a non-zero base or respresent it different. 
Making choices on hot to present data (what is the nature of the data and how effectively to
plot this) (Nussbaumer Knaflic, 2015). Using piechart altogether  better to use a table for
example.
• Ask yourself the question what task(s) the graphic should enable. Use logical and meaningful
baselines.
• E.g. displaying change
• The only worse design than a pie chart is several of them (Tufte, 1983).
• Use logical and meaningful baselines  zero baseline.

3. Beautiful: ‘Beauty Is not a property of objects, but a measure of the emotional experience
those objects may unleash (Cairo, 2014).
• Avoid chart junk through maximizing the data-ink ratio. This ratio is defined as the ratio
between data-ink and the total ink used to print the graphic.
• Avoid unintentional optical art

4. Insightful
5. Enlightening

T3: Principles
'Information visualization', Chen (2010) – article 2
Principles:
• Gestallt principles – existing graphs and how to improve them
• Suggestions for improvement on exam!!
• Preattentive attribute

• Gestallt principles of visual perception (Chen, 2010)


o Define how people interact with and create order out of visual stimuli – how do we
make sense of the world around us.
• Can be used to identify unnecessary elements and ease the processing of our visual
communications. With that, we can reduce cognitive load.

48
1. Gestalts principle of proximity
We tend to think of objects that are physically close together as belonging to part of a group o Force
the reader to let the reader read In a way
• Example: table design.

2. Gestalt principle of similarity


Objects that are of similar colour, shape, size or orientation are perceived as related or belonging to
part of a group
• Example: graph

3. Gestalt principle of enclosure


• We think of objects that are physically enclosed together as belonging to a part of a group
• Even the line is not continuous we connect it in a certain way.

4. Gestalt principle of closure


We like things to be simpleand to fit in the constructs we have in our heads. Whenever we can, we
perceive a set of individual elements as a single, recognizable shape.
• Example: removal of chart borders and gridlines in a graph

5. Gestalt principle of continuity


When looking at objects, our eyes seek the smoothest paths and naturally create continuity in what
we see
• Example: removal of axis lines in a graph or plot

6. Gestalt principle of figure/ground


The human brain will distinguish between the objects it considers to be in the foreground of an
image (the figure, or focal point) and the background (the area on which the figures rest)
• E.g. someone enters your property and if you ask him to leave, you can shoot him. Axes
other way around

'An Economist's Guide to Visualizing Data', Schwabish (2014) – article 3


• Pre-attentive attributes (Schwabish, 2014)
o Visual properties that we notice without making a conscious effort
o Iconic (i.e. super fast) memory is tuned to a set of pre-attentive attributes
o Can be used to make your audience see what we want them to see before they even
know they are seeing it
• Examples:
o Size
o Colour
o Spatial position
o Orientation
o Shape
o Line length / width

49
• Use pre-attentive attributes sparingly. The goal is to reduce cognitive load. "It is easy to spot
a hawk in a sky full of pigeons. As the variety of birds increases, however, that hawk becomes
harder and harder to pick out” (Ware, 2004).
• Only works if you want to emphasize one or two attributes.
• Hue = colour

• Gestalt: how we make sense of the world around us


• Pre-attentive attributes: where we focus our attention

T4: Example
• Not only display but tell a story
• Visualization steps (Nussbaumer Knaflic, 2015)
1. Understand the context
2. Choose an appropriate visual display
3. Eliminate clutter
4. Focus attention where you want it
5. Think like a designer
6. Tell a story

50
Exam preparations
Multiple-choice questions (focus: knowledge)
In the paper 'Information visualization', Chen (2010) introduces some of the fundamental concepts in
information visualization. One of these concepts are Gestallt principles. What is the Gestalt principle
of proximity?

a. We perceive objects that are physically close together as belonging to part of a group
b. We perceive objects that are physically enclosed together as belonging to part of a group
c. Whenever we can, we perceive a set of individual elements as a single, recognizable shape
d. When we look at objects, our eyes seek the smoothest paths and naturally create continuity in
what we see

In the book chapter 'Graphics Lies, Misleading Visuals', Alberto Cairo (2015) explains some of the
possible problems that might occur in data visualizations.

When the number of dimensions that are used to visualise data in a graph exceeds the number of
actual data dimensions, one can say that this graph has a:

a. Data-ink ratio smaller than 1


b. Data-ink ratio that equals 1
c. Lie factor that equals 1
d. Lie factor larger than 1

Open questions (focus: application)


During the COVID-19 crisis, many data visualizations have been published to give more insight into
key indicators. Consider the graph below. The graph was published by the Texas Medical Institute. Its
aim is to communicate the seven-day moving average of new hospital admissions in Texas.

In the last lecture, we talked about heuristics for, and principles underlying effective visualisations.

a. List two aspects of this visualisation that you think could be improved. Make sure each aspect
refers to a different feature of effective visualisations, as discussed during the lecture. [2 points]
b. For each aspect, explain why you think this aspect could be improved. [4 points]
c. For each aspect, explain how you would improve it so that it results in a more effective
visualisation. [4 points]

51

You might also like