Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

Ordinal data allows distances FALSE dm1 slide 79 no distance defined for ordinal

MapReduce consists in Map/Shift&Sort/Reduce phase FALSE Big data slide 45 Map/Shuffle&Sort/Reduce


KNN with even value of K FALSE dm2 slide 110 K should be uneven
IMPALA allows UPDATE & DELETE TRUE Impala Doc Documentation shows UPDATE and DELETE commands
IMPALA has symmetric node architecture TRUE Big data slide 60 Fully symmetric design helps with fault-tolerance and
load-balancing
Drill Down brings from a detailed to an aggregated view FALSE Data going from a coarser level of aggregation to a finer
Warehousing (more detailed) level
Part 1 slide 52
Something about type of clustering… - Dm3 slides 10- Types of clusters: well-separated clusters, center-based
16 clusters, contiguous clusters, density-based clusters,
property or conceptual, described by an objective
function
Data Mining goal success criteria cannot be subjective FALSE Dm1 slide 50 For subjective data mining success criteria goals there
have to be persons identified making the judgement
Hadoop is fault tolerant TRUE Big data slide 37 Fault-tolerance is one of the three Hadoop key
characteristics
Fayadd KDD focuses on Data Mining aspects TRUE DM1/20 5-step KDD process: selection, preprocessing,
transformation, data mining, interpretation/evaluation
Fayyad KDD focuses on Deployment FALSE DM1/20 see above
Hadoop is for large streaming reads TRUE Big Data/43 HDFS Characteristics: Large streaming reads, not good
at low latency!
A DWH is a subject-oriented, integrated, time-variant, nonvolatile collection of TRUE Intro/29 „A data warehouse is a subject-oriented, integrated,
data time-variant, nonvolatile collection of data in support of
management’s decision-making process.“ Inmon (1996)
TDWI working definition of the DWH focuses on data integration and analysis FALSE Intro/8 “The processes, technologies, and tools needed
and not on business decision support (something like that) to turn data into information, information
into knowledge, and knowledge into plans
that drive profitable business action.
Business intelligence encompasses data
warehousing, business analytic tools and
content/knowledge management”
In a mediator integration approach everybody talks directly to everybody else FALSE Intro/23 and Mediator: Virtual warehouse – turns a user query into a
following sequence of source queries and assembles the results of
this queries into an “aggregate” result
ETL only needed for integrated DWH FALSE
Single Linkage identifies contiguous clusters well TRUE MIN for hierarchical clustering is less susceptible to
breaking larger clusters apart as it depends on the
closest links from the clusters. However, this makes it
susceptible to outliers
Complete Linkage identifies globular clusters well TRUE MAX for hierarchical clustering is biased towards
globular clusters as it finds the furthest away point of
the two clusters
Change of a single data point can change Decision Trees dramatically TRUE
A carefully pruned Decision tree usually has lower validation set error than an TRUE Pruning reduces overfitting so in theory, validation set
un-pruned one error should generally be less
Model for deployment is chosen by lowest training set error FALSE
Cross-validation is used for a more robust estimate for error TRUE
K-means is extremely robust against outliers in the data FALSE
Carefully pruned decision trees usually show higher precision on the training FALSE Unpruned decision trees are prone to overfitting, which
data than un-pruned decision trees results in better training metrics
Lazy Learning is not recommended when there is high drift in data space, FALSE That's the whole point of using lazy learning as decision
leading to changing decision boundaries boundaries are made on most up-to-date data
The knn classifier using Euclidean distance is computationally more expensive FALSE knn spends 0 resources at the model building stage
at the model building stage than a Decision Tree using simple error counts as
splitting criterion.
Ordinal data allows distances to be computed between data points FALSE
Random sampling of time series data for classifier training may lead to an TRUE
overestimation of model performance
1-to-N coding (one-hot encoding) reduces the dimensionality of the feature FALSE
space
CRISP-DM: Business Success Criteria are ideally specified as subjective FALSE Success criteria should always be objective measures.
measures and Data Mining Success Criteria should be specified as objective
measures
Zero-mean unit variance normalization is highly sensitive to outliers in the data TRUE
Lazy learning is more time-efficient at classification stage FALSE Lazy learners have to 'train' during the classification
stage
According to the Data Warehousing Institute (TDWI) working definition, TRUE It encompasses warehousing, tools, and content
Business Intelligence encompasses analytic tools management to turn Data into Information into
Knowledge into Wisdom
In Hadoop, applications are typically written in high-level code such as Java TRUE
In Hadoop processing is coordinated through MapReduce. TRUE
In context of the DWH reference architecture, the Metadata Component TRUE
stores operational metadata, extraction and transformation metadata, and
end-user metadata.
The Staging Area in the DWH reference architecture is a database that stores a FALSE That would be the Landing Area. The Staging Area
single data extract of a source database. stores matching data extracts from multiple Landing
Areas
[DWH] In a typical Lamba architecture, queries can be answered by merging TRUE
results form a batch and real-time views.
Data silos hold data for individual sets for applications or organization units. TRUE
Big advantages of a snowflake schema include that the schema becomes more FALSE Less intuitive than star schema
intuitive and browsing through the content is easy
In DWH, the concept "warehouse" supports bi-directional data flows between FALSE Uni-directional
related data sources.
In context of DWH analytics, predictive analytics focus on investigating past FALSE That's descriptive analytics. Predictive analytics focuses
effects to capture relevant information. on actionable insights
The soft margin parameter of SVM controls the error on the training set. TRUE
In the DWH reference architecture, the Staging Area is a database that stores a FALSE
single data extract of a source database.
The Staging Area in the DWH reference architecture is a database that stores a FALSE
single data extract of a source database.
Age of business intelligence starts at 2010. FALSE Age of Business Intelligence from 1995
The processes, technologies, and tools needed to turn data into information, TRUE Definition of DWH
information into knowledge, and knowledge into plans that drive profitable
business action.
Approaches for Information Integration - Mediator means everybody talks FALSE That’s federation
directly to everyone else.
Approaches for Integration - Federation connects multiple (heterogenous) FALSE That’s warehouse
data sources.
A data warehouse is a subject-oriented, integrated, time-variant, nonvolatile TRUE
collection of data.
You chose a model for deployment that has minimal training error. FALSE
Hive allows real-time queries and has low-latency.
The Kappa architecture is more complicated as the Lambda architecture. FALSE It’s simplified
For data from a DWH you do not need data preparation and data exploration.
F-score is a weighted score of Precision and Recall. FALSE Harmonic mean
ETL extraction monitoring strategies are Trigger-based, replication-based, TRUE And also log-based
timestamp-based, snapshot-based.
Even a fully-grown decision tree can have impure leaf nodes.
An advantage of Master-Slave Replication is high read-performance. TRUE Lots of slaves to read from; the read may or may
not be accurate though

Open questions
Data Warehousing
 Describe OLAP and OLTP and their characteristics

"On-line analytical processing" / "online transaction processing"

OLTP - e.g. take an order/process a claim/make a shipment etc.

- (very fast obv)


- consistent DB, up-to-date access
- avoid redundancies -> normalized schemas
- optimized for many short and small transactions (single row updates or inserts)
- precision: usually exact
- freshness of data: up-to-date data bc serializability

OLAP - analyzing the data describing the transactions. so we turn raw data into strategic info, e.g. management information systems (MIS), statistical
databases, decision support system (DSS)

- (not so fast)
- large results set
- no updates/inserts immediately, only periodically (daily, weekly)
- precision: sampling, statistical summaries etc.
- freshness of data: historization bc reproducibility of analyses

 Describe the ETL process

ETL is a process of extraction, transformation and loading the data whiele storing intermediate results in order to preserve independence from other
systems, especially sources.
Extraction -> extract data from a few data sources into Landing Area
Transformation -> perform some individual tasks like cleaning, standardization, and combining from diff sources. In transformation stage we have few
layers connected to data process like filtering, harmonization, aggregation and enrichment.
Loading -> two distinct groups of tasks

 initial load - moves large volumes of data into data warehouse storage
 then feeding incremental data revisions on an ongoing basis (but like yearly, monthly etc refrsh)

 Heterogeneity challenges integration


- Technical heterogeneity – different systems
- Structural and syntactic heterogeneity – encodings (like ISO for countries/currencies), different units, synonyms, homonyms
- Semantic heterogeneity - interpretation
 DW vs Lake

▪ Data lake stores unstructured and structured data at any scale, DWH contains usually clean historical transformed data that has been integrated from
many sources
▪ Lakes can handle large amounts of data, they are more flexible
▪ Single store of enterprise data with a flat architecture
▪ Schema and data requirements are not defined until the data is queried
▪ ELT rather than ETL
▪ Scalability and flexibility
▪ Allows new kinds of analyses and insights
▪ Promises low long-term cost of ownership
▪ Data in lakes are accessible via SQL-on-Hadoop or Spark
▪ BUT can emerge into data swamp, no structure, less quality and value, what about governance, trust and privacy

 Inmon Kimball pros/cons, differences

Inmon – Top-down approach – centralized DW first, then Data Marts for specific requirements of user groups, goes from generic -> specific

Pros: really the data present in the organisation, good architecture, single central storage, which means centralized rules and control and quick results if
implementing in iterations.

Cons: takes long time to build, risk to failure, need people from different domains and cross-functional skills, high outlay without proof of concept

Kimball – Bottom-up approach – one Data Mart per business process first, and then DWH, enterprise-wide cohesion through data bus, goes from
specific -> generic
Pros: faster and easier implementation. Lower risk failure, incremental, allows project team to learn and grow, it can be used as proof of concept

Cons: each data mart has its own narrow view of the data, no centralized view at first and the data is redundant in every data mart. The data can be
inconsistent.

 Name three requirements of operation systems using OLTP and describe each of them briefly

– Optimize for many short and “small” transactions: point queries, single-row updates and/or inserts

– Access to up-to-date, consistent DB

– Avoid (uncontrolled) redundancies → use normalized schema

 Describe the 3 Steps of the MapReduce process flow

MapReduce is model of distributed programming for parallel data processing, really painful if you’ve taken DIC course. Consist of 3 steps:

1. Map – data divided into smaller chunks and processed one element at time by some user-defined function and emits

2. Shuffle&Sort – shuffle data and sort by the same key, group by key

3.Reduce – process list of values by another user-defined function of the grouped data, emits result

 FASMI

Acronym summarizes the five desired properties of an OLAP system:


1. Fast: regular queries within 5 sec, complex ones within 20 sec
2. Analysis: intuitive analysis and arbitrary calculations
3. Shared: effective access control in a multi-user environment
4. Multidimensional: on a conceptual level, data is presented as a multidimensional view (irrespective of the actual implementation of the underlying data
structure)
5. Information: Scalability and acceptable response time for large data

 Explain main differences between Star schema and Snowflake schema.

In star schema the fact table is connected to dimension tables by foreign key. The dimension tables are not normalized. This design allows for more
efficient queries but increased storage requirements. It’s also easily understood, In the snowflake schema the dimension tables are normalized, meaning
they are broken down into multiple related tables. This representation is less intuitive and queries become more complex due to joins (as a result the
query performance is degraded), but it reduces data duplication so it saves storage space and normalized structures are easier to update and maintain.
 Explain the three types of analytics.

We need to use them to build a business, they cant rely on intuition

We have three different type of analytics:


1) descriptive -> display relevant data to the decisions aka *What happened*
2) predictive -> discover actionable insights and present them -> *What will happen*
3) prescriptive -> computes best decisions and recommends them to the decision maker

 Describe three operation in MOLAP

Slicing - filter conditions in one or more dimensions (WHERE)- e.g., sales by product group and region for a given year

Dicing - query of a (sub-)cube by picking specific dimensions and dimension values, adding / removing / exchanging dimensions

Drill-down - navigate among the levels of data, going from a coarser level of aggregation to a finer (more detailed) level (add column to GROUP BY,
ROLLUP operation in SQL)

Roll-up - navigate among the levels of data, going from no aggregation or finer level of aggregation to a coarser level (remove a column from GROUP
BY_

Ranking - top performers, top risks etc.

 Agile business intelligence

Developing BI in agile way so in short cycles (iterations) between User, System and the Builder, with overlapping phases and adaptive planning. Based
on user stories rather than technical specifications and JEDUF (Just Enough Data Up Front). The objective is to deliver value much faster and be flexible
to change. As an advantage is less risk of failure. Some problems with this: some architecture still necessary, acceptance of the solution, management
across different projects, devops cant keep up with the pace of changes, not necessarily cheaper.

 Definition of DWH and three examples

DWH is copy of transaction data specifically structured for query and analysis – according to some dude named Kimball. It deals with dispersed data silos
since it’s centralized, it collects, historize and integrate data from many sources, provides data to data marts, reorganize the data in subject-oriented
manner. It’s read-only usually, but can be virtualized, no idea what that means tbh. Its goal is to support decisions of white dudes in suits (management
decision-making).

Examples:
- Single integrated DWH – feeding all data marts, master data can be viewed as another source DB, typically one landing area per
source DB
- Multiple data warehouses for different functions – highly problematic but not uncommon
- Single stand-alone data mart – ok if no need for integration
- Several independent stand-alone data marts – “quick and dirty”
- A conformed constellation of data marts – also called bus architecture, nice solution, conforming means that they share common
dimension, we don’t have explicit DW, mapping to dimensions through Master Data
 Explain MapReduce
 Describe metadata component in a DWH

Metdata is data about data in DWH necessary for using DWH containing e.g. available data in the DWH, building source systems, mappings,
transformation rules and administering complexity and size of DWH. There are three types/examples are:

 Operational metadata
 Extraction and transformation metadata
 End-user metadata

 Solutions for information integration heterogeneity

1. Federation: Everybody talks directly to everyone else (Point-to-Point Integration). Issue: n applications / data stores → up to n^2 connections

2. Warehouse: Sources are translated from their local schema to a global schema and copied to a central DB. Issue: usually only unidirectional data flows
supported

3. Mediator: Virtual warehouse – turns a user query into a sequence of source queries and assembles the results of this queries into an “aggregate” result.
Issue: complex architecture, potentially slow, difficult to maintain

Data Mining

 What is a lazy learner? Example? When to use?

Lazy learners don’t require training at all, generalization is made upon query. Example of lazy learner is k Nearest Neighbours. It’s most useful for online
learning when the data is continuously updated with new entries e.g. recommender systems.

 K-means algorithm, discuss robustness w.r.t. initialization. Explain the problem with initial centroid selection.

K-means is partitional clustering approach, each cluster is associated with centroid and each point is assigned to the cluster with the closest centroid.
Number of clusters (k) must be specified beforehand. We select k initial centroids and the assigning points to the closest centroid to form clusters, after
reach assigning we recompute centroid of each cluster. Initial centroids often chosen randomly but Kmeans is sensitive to this choice and produced
clusters vary from one another in different runs. If there are K ‘real’ clusters then the chance of selecting one centroid from each cluster is small.
Sometimes the initial centroids will re-adjust themselves in ‘right’ way, and sometimes they don’t.

Solutions -> ensemble, multiple k-means performed with different initial centroids, final clusters based on majority vote. Or use hierarchical clustering first
to determine initial centroids. Or select more than k initial centroids and then select among these the ones that are most widely separated. Another oprion
is postprocessing or using kmeans++ where final clusters can then be chosen based on the lowest sum of squared distances between the data points and
their corresponding centroid.

Kmeans has also problems when clusters differ in sizes, densities or when shapes are non-globular (like a ball) and when data contains outliers.

 1-n encoding and Binning (Explain, when to use, example) Describe (a) when it is applied, (b) how it is applied, (c) give an example

1-n encoding = one hot encoding – coding the categorical variables as numerical variables by creating binary columns for each unique value, with value
one indicating the presence of category and 0 indicating absence. It’s applied for nominal and ordinal values (but for ordinal we loose ordering kinda).
Example of nominal: eyes [blue, green, brown] -> eyes [{1, 0, 0}, {0, 1, 0}, {0, 0, 1}]

Binning (bucketing) – transforming continuous output into categorical to use classification method instead. It’s basically sub-division into discrete bins like
(very low, low, average, high, very high values of e.g. prices, income). Before applying binning define suitable method of creating bins (evenly spaced,
natural boundaries such as age groups, cluster based) and granularity of bins (how many groups we are dividing values to).

 KNN

K Nearest Neighbours algo – classify inputs based on k closest training examples, k is chosen and its high influence on the result (should be uneven bc
majority voting, good value depends on the data, larger k reduce noise effect but makes boundaries less distinct). Distance is defined with some metric
(e.g. Euclidan distance, Manhattan distance). To classify a data sample identify the k nearest neighbors given distance metric in training set and
determine the majority class. Simple, lazy learner (no training/model building, not limited to linear separation), normalization is important. Becomes
computationally expensive with many items to classify, degrades with high dimensionality, they require large amounts of memory.

 Describe and differences: a)define business criteria b) define business goals c) data mining success criteria d) data mining goals

Business success criteria – specific guidelines or standards that a company uses to measure its performance. These criteria may include financial metrics
such as revenue, profit, and return on investment, as well as non-financial metrics such as customer satisfaction, employee engagement, and market
share. Business criteria are used to evaluate the effectiveness of a company's strategies and operations, and to identify areas that need improvement.

Business goals - specific objectives that a company or organization aims to achieve. These goals may be short-term or long-term, and may be aligned
with the company's overall mission and vision. Business goals are used to guide decision-making and resource allocation, and to measure progress
towards achieving desired outcomes.

Data mining success criteria - specific guidelines or standards that are used to evaluate the effectiveness of a data mining project. These criteria may
include the accuracy and completeness of the data, the quality and interpretability of the results, the timeliness of the deliverables, and the overall value of
the insights generated. Data mining success criteria are used to ensure that the project is meeting its objectives and delivering value to the organization.
Data mining goals - specific objectives that a data mining project aims to achieve. Translation of business questions to data mining goals. These goals
may include identifying patterns and relationships in the data, developing predictive models, and providing insights that can inform business decisions.
Data mining goals are used to guide the design and execution of the project, and to measure progress towards achieving desired outcomes.

 Name 4 types of attributes used in data mining, make an example, describe the characteristics and the allowed mathematical operations

- Nominal aka categorical – distinct labels from a defined vocabulary, like class labels but also for attributes (e.g. music genres – jazz,
rock, techno; eye color – blue, green, brown). They can be numeric – zip codes. Math: only equality
- Ordinal – distinct labels from a defined voacabulary, numeric or strings. Impose an order on discrete categories, but no distance is
defined. (e.g. temperature – cold, cool, mild, hot, very hot; grades – 1, 2, 3, 4, 5). Math: ordering, no additions, no subtractions
- Interval – ordered elements with fixed distance in-between, discrete or continuous values. (e.g. time – years like 2011, 2018; levels of
pain). Math: ordering, distance (subtractions), no additions (cant add year 2011 and 2018 or levels of pain)
- Ratio – continuous values, zero-point defined, usually represented as real numbers (e.g. TFIDF, images, measurements, audio as
features extracted from spectrogram). Can’t be used as class labels duh. Math: all allowed.

 Describe Single Linkage and Complete Linkage, how the algorithms work and what are the characteristics

Single Linkage – hierarchical clustering method that starts by treating each data point as its own cluster and at each step it merges two closest clusters
together based on the minimum distance between any two points in the two clusters. So similarity of two clusters is based on the two most similar
(closest) points in the different clusters. It stops when all the data points are in a single cluster. Results in dendrogram that can be cut at any point.
Limitations -> sensitive to noise and outliers. Strengths: -> can handle non-elliptical shapes (aka non globular).

Complete Linkage – uses maximum distance between any two points instead of minimum as in Single Linkage. So similarity of two clusters is based on
the two least similar (most distant) points in the different clusters. Limitations -> tends to break large clusters, biased towards globular clusters. Strengths
-> less susceptible to noise and outliers.

 Define precision, recall, micro/macro, explain differences

Precision – how happy we are with what we’ve got P= TP/(TP+FP)

Recall – how much more we could have had R = TP/(TP+FN)

F-score – harmonic mean over precision and recall F1 = 2 * PR/(P + R) => beta set to 1 usually so F1 score. When beta < 1 -> more weight to precision,
beta >1 more weight to recall. General version that nobody uses: F = (1 + beta 2) PR/(beta2 P + R)

 Two different scaling approaches. Characteristics and benefits

Z-score aka standard scaling aka zero-mean-unit variance – in order to measure the distance independently of measurement unit. How -> subtract mean,
divide by std. z = (xij – mean(xj))/std(xj). It preserves gaussian distribution of the data, works best for data following this distribution in general
MinMax – scale variables to the same fixed range, usually between 0 and 1 but you can make it however you like. How -> subtract min of each variable,
divide by value range, multiply by new range (if different than 0 and 1). z= (x ij – min(xj))/(max(xj) - min(xj)). It preserves the shape of the data distribution
and may works better with outliers as its robust to them.

 Describe training/validation/test data

Holdout method of dividng data instances into three groups with different purposes. With train set, which is the biggest subset we train the model, with
validation set we measure performance in order to select the model/tune hyperparameters (not for generalization error), with test set we estimate
generalization error. Test set is final unseen subset and should be used at the very end.

Example split: 70% training, 15% validation, 15% test.

You might also like