DS Midterm Review - Lecture Notes

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Lecture 1:

Data Science - interdisciplinary field that aims to extract knowledge products

Average data set analyzed by Data Scientists: 1.1-10GB

Five V's of bigdata:


1. Volume - amount of data
2. Velocity - differentiate between static data and dynamic data that changes over time
(streaming)
3. Variety - data exists in many forms (ie structured, unstructured, text, multimedia)
4. Veracity - data may be in doubt or uncertain due to data inconsistency and incompleteness
5. Value - the data you analyze should bring value

Databases - used to store, access, maintain and explore data


Data Warehouses - offer tools to explore data

Data mining - process of discovering knowledge products from the data


Challenges: besides hundreds of algorithms available, some may be better than others
1) methods must be capable of scaling up to data size
2) lack of automated data preprocessing procedures (data cleaning)

Use of high performance computing to scale up to the size of big data (production of models by
machine learning algorithms is more time consuming than application of models)
- parallelization: aka supercomputers. Used to speed up implementation
- distributed/cluster: used to divide up the data between computers for same algorithm

Lecture 2:

Knowledge Discovery (KD):


"Non Trivial process of identifiying valid, novel, potentially useful and ultimately
understandable patterns from large collections of data."
patterns = knowledge products

6 step KD process model:


1) Understanding the problem domain
2) Understanding the data
3) Preparation of the data
4) Data mining
5) Evaluating the discovered knowledge
6) Using the discovered knowledge
Understanding the problem domain:
Work with domain experts to define problem, project goals, success criteria, key people & their
availability, current solutions to the problem, domain specific terminology, restrictions/limitations
- initial selection of potential DM tools here

Understanding the data:


Collecting sample data and deciding which data will be needed, ranking importance of data
attributes, verification of usefulness of data, checking for completeness, redundancy, missing
values, etc

Preparation of the data:


Selecting and sampling of data, correlation and significance tests with data cleaning
- steps to improves completeness, remove/correct noise, imput/remove missing data
- results of this step make the new data records that meet input requirements for planned DM
tools
- key step in the KD process
- may take ½ of the entire project effort to do this

Data Mining:
Actually using the planned DM tools
Descriptive DM: concise description (summarization) of data to reveal interesting, general
properties of the data
Predictive DM: constructs models via inference (learning) from training data that can be used to
make predictions on new test data
- training and test datasets are made, then model is made with chosen DM tool, then model
(knowledge) is verified with appropriate test measures and procedures

Evaluation discovered knowledge:


Understand the results, checking if the models are novel and interesting, interpreting the results
by domain experts, checking impact of discovered knowledge
- only the best models are retained
- may revisit KD process to identify alternative actions to improve the results

Using discovered knowledge:


Planning where and how it will be used
- deployment of knowledge product and development of a business plan for it
- plan is made to monitor implementation
- entire project is documented

KD model by nature is iterative and interactive with feedback loops and steps that are revised
and repeated

Time breakdown (on avg)


20% business objective determination
60% data preparation
- can't be automated, necessary data may be not be available, errors in data or
redundancy/inconsistency
20% data mining, analysis and knowledge assimilation

Challenges of KD:
1) Many different data sources and types
2) Large variety of distinct DM tasks
3) Bigdata and scalability

Case study to demonstrate: Cystic Fibrosis with Denver's Children's Hospital and University of
Colorado

Lecture 3:

Success of DS projects rely on the input data

A single unit of data is a value of a feature/attribute


- objects are described by features that are combined to form datasets that are stored in
databases or data warehouses

Features = attributes = dimensions = descriptors = variables


objects = examples = instances = data points = records = individuals

Two types of values: numerical and symbolic


Numerical → values expressed by numbers (ie real numbers, prime numbers, integers, etc)
Symbolic → describe qualitative concepts (ie colors, size, etc)

Discrete → the total number of distinct values is relatively small


Binary → (ie 1 or 0) only has two distinct values
Continuous → total number of values is very large (infinite) and covers a specific interval (range)
Nominal → implies there is no natural ordering between its values
Ordinal → implies some ordering exists

Objects represent entities:


Multivariate = described by many features
Univariate = described by a single feature

Flat files - tables with rows and columns


- used to store data in text file format
- often generated from data stored in spreadsheets or databases

Datasets - main repositories of flat file datasets


Data type most often analyzed by data scientists: table data then text

Data Storage options:


1) Relational databases
- consists of a set of tables
- tables contain attributes (columns) and tuples (records/rows)
- each table has a unique name and each tuple has a key identifier (key)
- includes a entity-relational data (ER) model that defines a set of entities and their
relationships
2) Data Warehouses
- Main purpose is data analysis and exploration
- organized around a set of subjects of interest to the user
- analysis performed to provide info in historical context
- data/queries rely on aggregated info over a number of relevant dimensions (age, time)
- uses a multidimensional database structure where each dimension corresponds to an
attribute or set of attributes selected by user to be included
- each value corresponds to aggregated measures
3) Transactional databases
- consists of records the represent transactions
- includes unique identifier and set of items that make up the transaction
- stores a variable number of unsorted items vs a dataset stores values for fixed-size
vector of features

Database Management System (DBMS)


Consists of the database that store the data and a set of programs to access the database
- defines structure (schema) of the database
- stores, updates and accesses in concurrent and distributed ways
- ensures security and consistency of the data
Uses SQL (structured query language) to access database

Why use database systems?


Data source may be too big for flat files
May need to access many different datasets
Need to add, update, delete data by many people in different locations
Avoids redundant information

Data/info retrieval or finding aggregate values is NOT data mining


Data mining uses models/trends/patterns to generate knowledge products (more complex)

Lecture 4:

Issues related to data quantity and quality: volume and veracity

VOLUME:
Huge volumes of data.. but only some data mining algorithms can scale to highly dimensional
data
- different methods require differing runtimes to process we must consider
We consider: number of objects, number of features, and number of values a feature assumes
- the ability of an algorithm to cope with highly dimensional inputs id described as asymptotic
complexity
- 'order of magnitude' of total number of operations that translates into amount of runtime
- describes growth rate of runtime as size of each dimension increases

O(n) < O(n*log(n)) < O(n^2) < O(n^3)

Two techniques to improve scalability:


1) Speed up the algorithm (via heuristics, optimization and parallelization)
Heuristics = processing of data is simplified (ie only rules of maximal length n)
Optimization = using efficient data structures to manipulate data
Parallelization = distribute processing into several processors that work in parallel and speed up
computations

2) Partitioning (via reducing dimensionality and/or sequential or parallel processing)


Reducing = data is sampled and only a subset of objects or features is used
- useful when redundant objects exist or similar/irrelevant data
Sequential = beneficial when complexity is worse than linear
- uses additional overhead to merge results from subsets
Parallel = processing using Hadoop

Hadoop → open source software platform to store and process bigdata across many servers
- Hadoop Distributed File System (HDFS) = manages store data over multiple linked computers
- MapReduce = framework for running parallel computations
- Yet Another Resource Negotiator (YARN) = managers to schedule computing resources

To reduce dimensionality.. we can change sampling to reduce amount of data


- the size of the sample shouldn't negatively impact completeness
- Usually sampling is done at random by stratified sampling
● this is done by dividing the data into subsets and the same fraction of data is
chosen from each. This eliminates bias (over/under sampling)
Reducing dimensionality for unbalanced data needs more attention:
- you may need to actually skew sampling to balance data in some scenarios
ie undersampling large/majority, or oversampling small/minority classes
- over/under sampling is not effective when data for minority classes is incomplete- use SMOTE

SMOTE = Synthetic Minority Over Sampling technique


Rational oversampling that makes synthetic samples for minority classes
1) Derive vectors from each minority class object to its k neighbor
2) multiple vectors by random number between 0-1 to computer synthetic objects
VERACITY:
Includes incompleteness, redundancy, missing values, noise.

Incomplete:
- available data doesn't have enough info to discover new knowledge
- can be caused by insufficient number of objects/coverage, missing values, etc
1) Identify the problem
2) Take measures to mitigate it (ie collect and record additional targetted data)

Redundant:
- two or more identical objects or two or more features that are strongly correlated
- need to check that redunancy isn't an indicator of relevant frequency
Can be removed by feature selection algorithms or before application (improves speedup)
Irrelevant data can be omitted if it is insignificant to analysis (ie name for health condition)

Missing Values:
- may be caused by result of manual data entry, incorrect measurements, equipment errors, etc

Noise:
Noise in data is defined as a value that is a random error or variance in a measured feature
- can cause significant problems in DMKD process
- noise can be reduced by using constraints on features to detect anomalies
Remove by:
- manual inspection while using predefined constraints
- binning: sort values of noisy features and substitute values with means or medians of
predefined bins
- clustering: group similar object and remove or change values outside of these clusters

Lecture 5:

Missing values is common in real life datasets - usually NULL, *, ? or left blank

Can be a result of:


- human error
- equipment error
- attrition: patients improve or die
- skip pattern: certain tests aren't always done for every situation
- nonresponse: some patients refuse to give info
- deleted noisy values

Three types of missing data:


1) Missing completely at random (rare)
2) Missing at random (occur in specific parts of data, randomness cannot be verified
computationally)
3) Missing not at random

Missing data can be handled by two approaches:


- deletion of missing data
- imputation (filling-in) of missing data
- missing not at random cannot be deleted or imputed

Deletion of Missing Values:


- Objects (list-wise deletion) or features (variable deletion) with missing values are discarded
- decreases info content of the data
- should only be done when the removed items are not needed for analysis

Imputation of Missing Values:


Two methods:
1) Single imputation - impute a single value for each missing value (ie mean, hot deck,
regression)
2) Multiple imputation - compute several likelihood ordered choices for imputing the missing
values. One of the corresponding datasets is selected.

Single Imputation:
Mean imputation = uses the mean of complete values of a feature that contains missing data to
fill in the missing values
- for symbolic features, mode is used instead
- imputes missing values for each feature seperately (column calculated)
- can be conditional or unconditional
- conditional = imputes mean value that depends on the values of a selected feature (ie,
mean calculated for sick vs healthy patients)
- low asymptotic complexity O(n)

Hot deck imputation = uses complete values from the most similar (hot) object to the object with
missing values, measured based on a given distance function
- if most similar record also contains missing values, then the second closest object is found and
used
- repeated until all missing values are successfully imputed
- can be conditional or unconditional
- high asymptotic complexity O(n^2) because it uses more data

Other imputation methods can be used at the cost of higher computational complexity
- they usually compute imputed values based on info from the complete portion of the entire
dataset
- other methods may have better quality imputed values since they use more data (but at the
price of complexity)
mean imputation uses one feature
hot deck imputation uses one object

Lecture 6:

Data Warehouse - subject oriented, integrated, time-variant and nonvolatile collection of data in
support of decision making
● database is maintained separately
● focuses on modeling and analysis, not daily operations and transactions
● integrates multiple heterogeneous data sources
● data cleaning and integration techniques used to ensure consistency
● has longer time range than in operational systems
● updates cannot occur in a data warehouse directly
● can only load and read data

DBMS uses OLTP (online transaction processing)


Data Warehouses used OLAP (online analytical processing)
● OLAP allows complex queries and multidimensional views with GROUP BY and
aggregators

Data Warehouses can be made of many data marts


● Data marts are subsets of the company that has value to a smaller group of users (ie
sales, marketing)

Three layer architecture:


TOP TIER: front-end tools. shows results of OLAP
MIDDLE TIER: OLAP server, fast querying here
BOTTOM TIER: data warehouse server

Metadata repository: stores data defining warehouse objects


- Provides parameters and info for middle and top tier applications
● Description of the structure of the warehouse
● operational metadata (currency of data, monitoring info)
● system performance data
● info about mapping from operational database
● summarization algorithms
● business data (terms, definitions, ownership,etc)

In a multidimensional data model…


Data is viewed in the form of a data cube
- dimensions describe the underlying data
ie item descr(name,type) and time(day, week, month, quarter, year)
- data is stored in a fact table which contains actual data, often summarized

Data Cube:
● lattice of cuboids forms a data cube
● apex cuboid is the top most O-D cuboid (highest level of summarization)
● base cuboid is a n-D base cube
● #-D of cuboid is based on the amount of upwards connections present

2-d Data Model = relational table

3-D Cuboid
for one 'producer'

4-D Cuboid
Multiple cuboids together for multiple producers

Types of Schema:
Star → fact table in the middle with connections to set of dimension tables
Snowflake → some dimension tables are normalized into sets of smaller tables
Galaxy → multiple fact tables share dimension tables
Concept Hierarchy: sequence of mapping from a very specific low-level to more general
high-level
- useful to know to perform OLAP summarizations

OLAP Commands:
Roll Up → navigates to lower levels of detail (more specific to more general)
ie from production by month, roll up to production by quarter
Drill Down → navigates to higher levels of detail (more general to more specific)
ie all regions, drill down to USA only
Slice → provides cuts through cube to focus on specific perspectives
ie. Production only in Richmond, VA
Pivot → rotates cube to change perspective
ie from "time item" to "time location"
Dice → provides one cell from a cube (smallest most specific slice)

Implementation of OLAP:

Possible Server Architectures:


● Relational OLAP (ROLAP)
● Multidimensional OLAP (MOLAP)
● Hybrid OLAP (HOLAP)

Since data warehouses store high volumes of data, efficiency is important. To improve
performance of queries in OLAP…
● materialization
● indexing (bitmap or join)

Materializing:
Full Materializing = fastest query response but heavy pre-computing and very large store needs.
● usually unrealistic and too expense
No Materialization = slowest query response, always needs dynamic query evaluation, but less
storage needs
● very slow response time causes need for some materialization
Partial = balances response time and required storage space

Bitmap Indexing:
● index performed on chosen columns
○ each value in column is represented by a bit vector
● Join and aggregation operations reduced to bit arithmetic
● works best for low cardinality domains

Join Indexing:
● traditional indices map the value of an attribute to a list of record IDs
● join indices used to register the joinable rows of two relations

Lecture 7:

Case Study: find genetic sequence of proteins that interact with DNA and RNA (nucleic acids)
● cost of sequencing a genome significantly decreased (more doable to explore genomes
and have many more to explore)
● as of Sept 2002, there are 239 million sequenced
● 3% of all proteins interact with DNA, but we only know 45k of them (5.5% of 810,000
possible that we could know right now)

Approach: use historical data to build a predictive model, test it, and if the model has good
predictive performance, we can use it to predict new data

Historical data: proteins known to interact with RNA and DNA compared to those that do not
Predictive model: takes protein sequence as input and output whether or not it can interact with
DNA or RNA
Predictive performance: use the model to predict a set of proteins for which we know the
outcomes and compare the predicted with the true/known outcomes

Lecture 8:

Top DS software:
1) Rapidminer
2) Excel
3) Anaconda
4) Tensorflow

Top DS languages:
1) Python
2) R
3) SQL
4) Java

On avg, 6.1 tools/languages used per voter

Rapidminer (YALE) can do.. classification, clustering, regression, associations, text mining,
outlier detection, data visualization, model visualization

Lecture 9:

Information retrieval - finding material (usually documents) or an unstructured nature (usually


text) that satisfies an information need from within large collections (database)
Database - collection of documents
Term - semantic unit, a word, phrase
Query - request for documents that concern a particular topic of interest, usually very short

Simple word matching doesn't always work because of different semantic meanings
- NL is structured but query/documents are not
- query/documents lack context

We cannot provide *one* best answer, so we give multiple possibly good answers and leaves
the selection up to the user

To do this, we need to check stored documents to find potential relevance to the query
● requires high quality and fast to compute measure of relevance
● find documents 'close' to the right answer and heuristics to measure 'how close' to sort
answers

Search Database function:


- make inverted index of significant terms from data
- terms exclude determiners, conjunctions, stop-words

Query function:
- refine user query by stemming and removing stop-words
- linguistic analysis of semantics not performed to maintain efficiency (otherwise would take too
long)

Evaluating Outputs:
relevancy = precision
● ability to retrieve top ranked documents that are mostly relevant
Precision = # relevant documents found from query / total # of retrieved docs

did we miss any relevant documents = recall


● ability to search and find all relevant items
Recall = # relevant documents found from query / total # of relevant focs

TRADEOFF: between precision and recall exists (high in one, means low in other usually)
Ideal: High precision and recall

We can't always know the total number of relevant documents, we must estimate this by:
- sampling and performing relevant judgment
- applying different algorithms to same DB for the same query (and compare results)

To measure text similarity:


use a vector space model - used to provide similarity measure between doc and query
- fast to compute and scales well
Vector Space Model:
Vocabulary = all distinct terms after preprocessing documents (whats left after removing crap)
● Contains t index terms
Each term = i
In a document or query = j
Each i in j is given a real-weight = (w)ij

Documents are expressed as t-dimensional vectors


dj = (w1j, w2j, …, wtj)

ie Query = Q1 = 0T1 + 0T2 + 2T3 where 2 is the weight of term 3


Doc 1 = D1 = 2T1 + 6T2 + 5T3
Doc 2 = D2 = 5T1 + 5T2 + 2T3
Note: D1 has a higher weight for T3 than Doc 2.. Q1 is only looking for T3! higher weights for D2
for T1 and/or T2 doesn't matter because Q1 has 0 weight for those terms

Vector represents a collection of n documents in a term-document matrix


0 = term doesn't exist in the document

Frequency of a term: more frequent = more important


(f)ij = frequency of term i in document j

frequency is normalized by dividing the frequency of the most common term in the document
𝑡𝑓 𝑖𝑗
= 𝑓 𝑖𝑗
/ 𝑚𝑎𝑥 𝑖(𝑓 )
𝑖𝑗

Inverse document frequency = indicate the term's discriminate power


- aka terms in many documents are less indicative for a specific topic
(df)i = document frequency of term i = aka number of documents containing term i
(idf)i = inverse document frequency of term i
(idf)i = log2(N/(df)i) Where N = total # of documents in the database

Highest weight is assigned to terms that occur frequently in the document but rarely in the rest
of the database

Text Similarity Measure:


Quantifies the similarities between two measure (doc to doc or doc to query)
● used to rank documents
● can be calculated using the cosine of the angle between two vectors

MATHEMATICALLY:
D1= 2T1 + 6T2 + 5T3 2^2 = 4 6^2 = 36 5^2 = 25
D2= 5T1 + 5T2 + 2T3 5^2 = 25 5^2 = 25 2^2 = 4
Q1 = 0T1 + 0T2 + 2T3 0^2 = 0 0^2 = 0 2^2 = 4

(2*0 + 6*0 + 5*2) 10


Comparing D1, Q = (D1, Q) = = = 0. 62
(4+36+25)(0+0+4) 64*4

(5*0 + 5*0 + 2*2) 4


Comparing D2, Q = (D2, Q) = = = 0. 27
(25+25+4)(0+0+4) 54*4

Comparing 0.27 < 0.62


Therefore document 1 is twice more similar than document 2

You might also like