Professional Documents
Culture Documents
Bi - MC V1.4
Bi - MC V1.4
Open questions
Data Warehousing
Describe OLAP and OLTP and their characteristics
OLAP - analyzing the data describing the transactions. so we turn raw data into strategic info, e.g. management information systems (MIS), statistical
databases, decision support system (DSS)
- (not so fast)
- large results set
- no updates/inserts immediately, only periodically (daily, weekly)
- precision: sampling, statistical summaries etc.
- freshness of data: historization bc reproducibility of analyses
ETL is a process of extraction, transformation and loading the data whiele storing intermediate results in order to preserve independence from other
systems, especially sources.
Extraction -> extract data from a few data sources into Landing Area
Transformation -> perform some individual tasks like cleaning, standardization, and combining from diff sources. In transformation stage we have few
layers connected to data process like filtering, harmonization, aggregation and enrichment.
Loading -> two distinct groups of tasks
initial load - moves large volumes of data into data warehouse storage
then feeding incremental data revisions on an ongoing basis (but like yearly, monthly etc refrsh)
▪ Data lake stores unstructured and structured data at any scale, DWH contains usually clean historical transformed data that has been integrated from
many sources
▪ Lakes can handle large amounts of data, they are more flexible
▪ Single store of enterprise data with a flat architecture
▪ Schema and data requirements are not defined until the data is queried
▪ ELT rather than ETL
▪ Scalability and flexibility
▪ Allows new kinds of analyses and insights
▪ Promises low long-term cost of ownership
▪ Data in lakes are accessible via SQL-on-Hadoop or Spark
▪ BUT can emerge into data swamp, no structure, less quality and value, what about governance, trust and privacy
Inmon – Top-down approach – centralized DW first, then Data Marts for specific requirements of user groups, goes from generic -> specific
Pros: really the data present in the organisation, good architecture, single central storage, which means centralized rules and control and quick results if
implementing in iterations.
Cons: takes long time to build, risk to failure, need people from different domains and cross-functional skills, high outlay without proof of concept
Kimball – Bottom-up approach – one Data Mart per business process first, and then DWH, enterprise-wide cohesion through data bus, goes from
specific -> generic
Pros: faster and easier implementation. Lower risk failure, incremental, allows project team to learn and grow, it can be used as proof of concept
Cons: each data mart has its own narrow view of the data, no centralized view at first and the data is redundant in every data mart. The data can be
inconsistent.
Name three requirements of operation systems using OLTP and describe each of them briefly
– Optimize for many short and “small” transactions: point queries, single-row updates and/or inserts
MapReduce is model of distributed programming for parallel data processing, really painful if you’ve taken DIC course. Consist of 3 steps:
1. Map – data divided into smaller chunks and processed one element at time by some user-defined function and emits
2. Shuffle&Sort – shuffle data and sort by the same key, group by key
3.Reduce – process list of values by another user-defined function of the grouped data, emits result
FASMI
In star schema the fact table is connected to dimension tables by foreign key. The dimension tables are not normalized. This design allows for more
efficient queries but increased storage requirements. It’s also easily understood, In the snowflake schema the dimension tables are normalized, meaning
they are broken down into multiple related tables. This representation is less intuitive and queries become more complex due to joins (as a result the
query performance is degraded), but it reduces data duplication so it saves storage space and normalized structures are easier to update and maintain.
Explain the three types of analytics.
Slicing - filter conditions in one or more dimensions (WHERE)- e.g., sales by product group and region for a given year
Dicing - query of a (sub-)cube by picking specific dimensions and dimension values, adding / removing / exchanging dimensions
Drill-down - navigate among the levels of data, going from a coarser level of aggregation to a finer (more detailed) level (add column to GROUP BY,
ROLLUP operation in SQL)
Roll-up - navigate among the levels of data, going from no aggregation or finer level of aggregation to a coarser level (remove a column from GROUP
BY_
Developing BI in agile way so in short cycles (iterations) between User, System and the Builder, with overlapping phases and adaptive planning. Based
on user stories rather than technical specifications and JEDUF (Just Enough Data Up Front). The objective is to deliver value much faster and be flexible
to change. As an advantage is less risk of failure. Some problems with this: some architecture still necessary, acceptance of the solution, management
across different projects, devops cant keep up with the pace of changes, not necessarily cheaper.
DWH is copy of transaction data specifically structured for query and analysis – according to some dude named Kimball. It deals with dispersed data silos
since it’s centralized, it collects, historize and integrate data from many sources, provides data to data marts, reorganize the data in subject-oriented
manner. It’s read-only usually, but can be virtualized, no idea what that means tbh. Its goal is to support decisions of white dudes in suits (management
decision-making).
Examples:
- Single integrated DWH – feeding all data marts, master data can be viewed as another source DB, typically one landing area per
source DB
- Multiple data warehouses for different functions – highly problematic but not uncommon
- Single stand-alone data mart – ok if no need for integration
- Several independent stand-alone data marts – “quick and dirty”
- A conformed constellation of data marts – also called bus architecture, nice solution, conforming means that they share common
dimension, we don’t have explicit DW, mapping to dimensions through Master Data
Explain MapReduce
Describe metadata component in a DWH
Metdata is data about data in DWH necessary for using DWH containing e.g. available data in the DWH, building source systems, mappings,
transformation rules and administering complexity and size of DWH. There are three types/examples are:
Operational metadata
Extraction and transformation metadata
End-user metadata
1. Federation: Everybody talks directly to everyone else (Point-to-Point Integration). Issue: n applications / data stores → up to n^2 connections
2. Warehouse: Sources are translated from their local schema to a global schema and copied to a central DB. Issue: usually only unidirectional data flows
supported
3. Mediator: Virtual warehouse – turns a user query into a sequence of source queries and assembles the results of this queries into an “aggregate” result.
Issue: complex architecture, potentially slow, difficult to maintain
Data Mining
Lazy learners don’t require training at all, generalization is made upon query. Example of lazy learner is k Nearest Neighbours. It’s most useful for online
learning when the data is continuously updated with new entries e.g. recommender systems.
K-means algorithm, discuss robustness w.r.t. initialization. Explain the problem with initial centroid selection.
K-means is partitional clustering approach, each cluster is associated with centroid and each point is assigned to the cluster with the closest centroid.
Number of clusters (k) must be specified beforehand. We select k initial centroids and the assigning points to the closest centroid to form clusters, after
reach assigning we recompute centroid of each cluster. Initial centroids often chosen randomly but Kmeans is sensitive to this choice and produced
clusters vary from one another in different runs. If there are K ‘real’ clusters then the chance of selecting one centroid from each cluster is small.
Sometimes the initial centroids will re-adjust themselves in ‘right’ way, and sometimes they don’t.
Solutions -> ensemble, multiple k-means performed with different initial centroids, final clusters based on majority vote. Or use hierarchical clustering first
to determine initial centroids. Or select more than k initial centroids and then select among these the ones that are most widely separated. Another oprion
is postprocessing or using kmeans++ where final clusters can then be chosen based on the lowest sum of squared distances between the data points and
their corresponding centroid.
Kmeans has also problems when clusters differ in sizes, densities or when shapes are non-globular (like a ball) and when data contains outliers.
1-n encoding and Binning (Explain, when to use, example) Describe (a) when it is applied, (b) how it is applied, (c) give an example
1-n encoding = one hot encoding – coding the categorical variables as numerical variables by creating binary columns for each unique value, with value
one indicating the presence of category and 0 indicating absence. It’s applied for nominal and ordinal values (but for ordinal we loose ordering kinda).
Example of nominal: eyes [blue, green, brown] -> eyes [{1, 0, 0}, {0, 1, 0}, {0, 0, 1}]
Binning (bucketing) – transforming continuous output into categorical to use classification method instead. It’s basically sub-division into discrete bins like
(very low, low, average, high, very high values of e.g. prices, income). Before applying binning define suitable method of creating bins (evenly spaced,
natural boundaries such as age groups, cluster based) and granularity of bins (how many groups we are dividing values to).
KNN
K Nearest Neighbours algo – classify inputs based on k closest training examples, k is chosen and its high influence on the result (should be uneven bc
majority voting, good value depends on the data, larger k reduce noise effect but makes boundaries less distinct). Distance is defined with some metric
(e.g. Euclidan distance, Manhattan distance). To classify a data sample identify the k nearest neighbors given distance metric in training set and
determine the majority class. Simple, lazy learner (no training/model building, not limited to linear separation), normalization is important. Becomes
computationally expensive with many items to classify, degrades with high dimensionality, they require large amounts of memory.
Describe and differences: a)define business criteria b) define business goals c) data mining success criteria d) data mining goals
Business success criteria – specific guidelines or standards that a company uses to measure its performance. These criteria may include financial metrics
such as revenue, profit, and return on investment, as well as non-financial metrics such as customer satisfaction, employee engagement, and market
share. Business criteria are used to evaluate the effectiveness of a company's strategies and operations, and to identify areas that need improvement.
Business goals - specific objectives that a company or organization aims to achieve. These goals may be short-term or long-term, and may be aligned
with the company's overall mission and vision. Business goals are used to guide decision-making and resource allocation, and to measure progress
towards achieving desired outcomes.
Data mining success criteria - specific guidelines or standards that are used to evaluate the effectiveness of a data mining project. These criteria may
include the accuracy and completeness of the data, the quality and interpretability of the results, the timeliness of the deliverables, and the overall value of
the insights generated. Data mining success criteria are used to ensure that the project is meeting its objectives and delivering value to the organization.
Data mining goals - specific objectives that a data mining project aims to achieve. Translation of business questions to data mining goals. These goals
may include identifying patterns and relationships in the data, developing predictive models, and providing insights that can inform business decisions.
Data mining goals are used to guide the design and execution of the project, and to measure progress towards achieving desired outcomes.
Name 4 types of attributes used in data mining, make an example, describe the characteristics and the allowed mathematical operations
- Nominal aka categorical – distinct labels from a defined vocabulary, like class labels but also for attributes (e.g. music genres – jazz,
rock, techno; eye color – blue, green, brown). They can be numeric – zip codes. Math: only equality
- Ordinal – distinct labels from a defined voacabulary, numeric or strings. Impose an order on discrete categories, but no distance is
defined. (e.g. temperature – cold, cool, mild, hot, very hot; grades – 1, 2, 3, 4, 5). Math: ordering, no additions, no subtractions
- Interval – ordered elements with fixed distance in-between, discrete or continuous values. (e.g. time – years like 2011, 2018; levels of
pain). Math: ordering, distance (subtractions), no additions (cant add year 2011 and 2018 or levels of pain)
- Ratio – continuous values, zero-point defined, usually represented as real numbers (e.g. TFIDF, images, measurements, audio as
features extracted from spectrogram). Can’t be used as class labels duh. Math: all allowed.
Describe Single Linkage and Complete Linkage, how the algorithms work and what are the characteristics
Single Linkage – hierarchical clustering method that starts by treating each data point as its own cluster and at each step it merges two closest clusters
together based on the minimum distance between any two points in the two clusters. So similarity of two clusters is based on the two most similar
(closest) points in the different clusters. It stops when all the data points are in a single cluster. Results in dendrogram that can be cut at any point.
Limitations -> sensitive to noise and outliers. Strengths: -> can handle non-elliptical shapes (aka non globular).
Complete Linkage – uses maximum distance between any two points instead of minimum as in Single Linkage. So similarity of two clusters is based on
the two least similar (most distant) points in the different clusters. Limitations -> tends to break large clusters, biased towards globular clusters. Strengths
-> less susceptible to noise and outliers.
F-score – harmonic mean over precision and recall F1 = 2 * PR/(P + R) => beta set to 1 usually so F1 score. When beta < 1 -> more weight to precision,
beta >1 more weight to recall. General version that nobody uses: F = (1 + beta 2) PR/(beta2 P + R)
Z-score aka standard scaling aka zero-mean-unit variance – in order to measure the distance independently of measurement unit. How -> subtract mean,
divide by std. z = (xij – mean(xj))/std(xj). It preserves gaussian distribution of the data, works best for data following this distribution in general
MinMax – scale variables to the same fixed range, usually between 0 and 1 but you can make it however you like. How -> subtract min of each variable,
divide by value range, multiply by new range (if different than 0 and 1). z= (x ij – min(xj))/(max(xj) - min(xj)). It preserves the shape of the data distribution
and may works better with outliers as its robust to them.
Holdout method of dividng data instances into three groups with different purposes. With train set, which is the biggest subset we train the model, with
validation set we measure performance in order to select the model/tune hyperparameters (not for generalization error), with test set we estimate
generalization error. Test set is final unseen subset and should be used at the very end.