Professional Documents
Culture Documents
DS Midterm Review - Lecture Notes
DS Midterm Review - Lecture Notes
DS Midterm Review - Lecture Notes
Use of high performance computing to scale up to the size of big data (production of models by
machine learning algorithms is more time consuming than application of models)
- parallelization: aka supercomputers. Used to speed up implementation
- distributed/cluster: used to divide up the data between computers for same algorithm
Lecture 2:
Data Mining:
Actually using the planned DM tools
Descriptive DM: concise description (summarization) of data to reveal interesting, general
properties of the data
Predictive DM: constructs models via inference (learning) from training data that can be used to
make predictions on new test data
- training and test datasets are made, then model is made with chosen DM tool, then model
(knowledge) is verified with appropriate test measures and procedures
KD model by nature is iterative and interactive with feedback loops and steps that are revised
and repeated
Challenges of KD:
1) Many different data sources and types
2) Large variety of distinct DM tasks
3) Bigdata and scalability
Case study to demonstrate: Cystic Fibrosis with Denver's Children's Hospital and University of
Colorado
Lecture 3:
Lecture 4:
VOLUME:
Huge volumes of data.. but only some data mining algorithms can scale to highly dimensional
data
- different methods require differing runtimes to process we must consider
We consider: number of objects, number of features, and number of values a feature assumes
- the ability of an algorithm to cope with highly dimensional inputs id described as asymptotic
complexity
- 'order of magnitude' of total number of operations that translates into amount of runtime
- describes growth rate of runtime as size of each dimension increases
Hadoop → open source software platform to store and process bigdata across many servers
- Hadoop Distributed File System (HDFS) = manages store data over multiple linked computers
- MapReduce = framework for running parallel computations
- Yet Another Resource Negotiator (YARN) = managers to schedule computing resources
Incomplete:
- available data doesn't have enough info to discover new knowledge
- can be caused by insufficient number of objects/coverage, missing values, etc
1) Identify the problem
2) Take measures to mitigate it (ie collect and record additional targetted data)
Redundant:
- two or more identical objects or two or more features that are strongly correlated
- need to check that redunancy isn't an indicator of relevant frequency
Can be removed by feature selection algorithms or before application (improves speedup)
Irrelevant data can be omitted if it is insignificant to analysis (ie name for health condition)
Missing Values:
- may be caused by result of manual data entry, incorrect measurements, equipment errors, etc
Noise:
Noise in data is defined as a value that is a random error or variance in a measured feature
- can cause significant problems in DMKD process
- noise can be reduced by using constraints on features to detect anomalies
Remove by:
- manual inspection while using predefined constraints
- binning: sort values of noisy features and substitute values with means or medians of
predefined bins
- clustering: group similar object and remove or change values outside of these clusters
Lecture 5:
Missing values is common in real life datasets - usually NULL, *, ? or left blank
Single Imputation:
Mean imputation = uses the mean of complete values of a feature that contains missing data to
fill in the missing values
- for symbolic features, mode is used instead
- imputes missing values for each feature seperately (column calculated)
- can be conditional or unconditional
- conditional = imputes mean value that depends on the values of a selected feature (ie,
mean calculated for sick vs healthy patients)
- low asymptotic complexity O(n)
Hot deck imputation = uses complete values from the most similar (hot) object to the object with
missing values, measured based on a given distance function
- if most similar record also contains missing values, then the second closest object is found and
used
- repeated until all missing values are successfully imputed
- can be conditional or unconditional
- high asymptotic complexity O(n^2) because it uses more data
Other imputation methods can be used at the cost of higher computational complexity
- they usually compute imputed values based on info from the complete portion of the entire
dataset
- other methods may have better quality imputed values since they use more data (but at the
price of complexity)
mean imputation uses one feature
hot deck imputation uses one object
Lecture 6:
Data Warehouse - subject oriented, integrated, time-variant and nonvolatile collection of data in
support of decision making
● database is maintained separately
● focuses on modeling and analysis, not daily operations and transactions
● integrates multiple heterogeneous data sources
● data cleaning and integration techniques used to ensure consistency
● has longer time range than in operational systems
● updates cannot occur in a data warehouse directly
● can only load and read data
Data Cube:
● lattice of cuboids forms a data cube
● apex cuboid is the top most O-D cuboid (highest level of summarization)
● base cuboid is a n-D base cube
● #-D of cuboid is based on the amount of upwards connections present
3-D Cuboid
for one 'producer'
4-D Cuboid
Multiple cuboids together for multiple producers
Types of Schema:
Star → fact table in the middle with connections to set of dimension tables
Snowflake → some dimension tables are normalized into sets of smaller tables
Galaxy → multiple fact tables share dimension tables
Concept Hierarchy: sequence of mapping from a very specific low-level to more general
high-level
- useful to know to perform OLAP summarizations
OLAP Commands:
Roll Up → navigates to lower levels of detail (more specific to more general)
ie from production by month, roll up to production by quarter
Drill Down → navigates to higher levels of detail (more general to more specific)
ie all regions, drill down to USA only
Slice → provides cuts through cube to focus on specific perspectives
ie. Production only in Richmond, VA
Pivot → rotates cube to change perspective
ie from "time item" to "time location"
Dice → provides one cell from a cube (smallest most specific slice)
Implementation of OLAP:
Since data warehouses store high volumes of data, efficiency is important. To improve
performance of queries in OLAP…
● materialization
● indexing (bitmap or join)
Materializing:
Full Materializing = fastest query response but heavy pre-computing and very large store needs.
● usually unrealistic and too expense
No Materialization = slowest query response, always needs dynamic query evaluation, but less
storage needs
● very slow response time causes need for some materialization
Partial = balances response time and required storage space
Bitmap Indexing:
● index performed on chosen columns
○ each value in column is represented by a bit vector
● Join and aggregation operations reduced to bit arithmetic
● works best for low cardinality domains
Join Indexing:
● traditional indices map the value of an attribute to a list of record IDs
● join indices used to register the joinable rows of two relations
Lecture 7:
Case Study: find genetic sequence of proteins that interact with DNA and RNA (nucleic acids)
● cost of sequencing a genome significantly decreased (more doable to explore genomes
and have many more to explore)
● as of Sept 2002, there are 239 million sequenced
● 3% of all proteins interact with DNA, but we only know 45k of them (5.5% of 810,000
possible that we could know right now)
Approach: use historical data to build a predictive model, test it, and if the model has good
predictive performance, we can use it to predict new data
Historical data: proteins known to interact with RNA and DNA compared to those that do not
Predictive model: takes protein sequence as input and output whether or not it can interact with
DNA or RNA
Predictive performance: use the model to predict a set of proteins for which we know the
outcomes and compare the predicted with the true/known outcomes
Lecture 8:
Top DS software:
1) Rapidminer
2) Excel
3) Anaconda
4) Tensorflow
Top DS languages:
1) Python
2) R
3) SQL
4) Java
Rapidminer (YALE) can do.. classification, clustering, regression, associations, text mining,
outlier detection, data visualization, model visualization
Lecture 9:
Simple word matching doesn't always work because of different semantic meanings
- NL is structured but query/documents are not
- query/documents lack context
We cannot provide *one* best answer, so we give multiple possibly good answers and leaves
the selection up to the user
To do this, we need to check stored documents to find potential relevance to the query
● requires high quality and fast to compute measure of relevance
● find documents 'close' to the right answer and heuristics to measure 'how close' to sort
answers
Query function:
- refine user query by stemming and removing stop-words
- linguistic analysis of semantics not performed to maintain efficiency (otherwise would take too
long)
Evaluating Outputs:
relevancy = precision
● ability to retrieve top ranked documents that are mostly relevant
Precision = # relevant documents found from query / total # of retrieved docs
TRADEOFF: between precision and recall exists (high in one, means low in other usually)
Ideal: High precision and recall
We can't always know the total number of relevant documents, we must estimate this by:
- sampling and performing relevant judgment
- applying different algorithms to same DB for the same query (and compare results)
frequency is normalized by dividing the frequency of the most common term in the document
𝑡𝑓 𝑖𝑗
= 𝑓 𝑖𝑗
/ 𝑚𝑎𝑥 𝑖(𝑓 )
𝑖𝑗
Highest weight is assigned to terms that occur frequently in the document but rarely in the rest
of the database
MATHEMATICALLY:
D1= 2T1 + 6T2 + 5T3 2^2 = 4 6^2 = 36 5^2 = 25
D2= 5T1 + 5T2 + 2T3 5^2 = 25 5^2 = 25 2^2 = 4
Q1 = 0T1 + 0T2 + 2T3 0^2 = 0 0^2 = 0 2^2 = 4