Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

DCPP Notes

1) Data > Information > Knowledge > Wisdom


a. Data – Raw data and collection from various sources
b. Information = Data + Meaning
c. Knowledge = Information + Ability to interpret information
d. Wisdom = Knowledge + Actionable insights to derive inference
2) Data Collection: Gathering and measuring information on targeted variables. Enables one to
answer relevant questions and evaluate outcomes. Broad data types
a. Qualitative – Non-Statistical, categorized based on properties, attributes, labels.
Investigative and open ended. Generated through text, documents, Audio-Video
recordings, Transcripts and focus groups, notes, etc.
b. Quantitative – Statistical, Concise, and close ended, Answers questions like ‘how
much’/ ‘how many’. Generated through texts, experiments, survey’s market
research, metrics.
3) Data Types – Structured, Unstructured, Semi-Structured
a. Structured – Mostly Qualitative with rows and columns.
i. Characteristics - Predefined data models, easy to search, text based shows
what is happening. Resides in RDBS, Data Warehouses. Examples : Dates,
Phone numbers, Social Security numbers, Customer Names, Transactional
information
b. Unstructured – Free flowing text mostly
i. Characteristics – No Predefined data models, Difficult to search, Text, PDF,
Images, Video, Shows the why. Resides in Applications, Data Warehouse and
lakes. Stored In various formats. Examples – Documents, Emails, Messages,
Conversation transcript, image files, open ended survey answers
c. Semi-Structured: Easier to store/comprehend than unstructured data
i. Characteristics – Loosely organized, Meta-level structure that can contain
unstructured data, HTML/JSON/XML. Resides in RDBS, Tagged Text-format,
Stored in abstract and figures. Examples – server logs, tweets organized by
hashtags, Email sorting by folders (inbox, sent, draft)
4) Data Collection Methods:
a. Web scraping: Extracting data from webpages and storing in a database.
i. Automated web scraping uses bot or a web crawler. Scrapy/BeautifulSoup
libraries used in Python.
1. Fetch a page: Download the raw rendered HTML content ○ Extract
information: Parse HTML, extract content such as text, images, links
ii. Popular option BoilerPipe content:
1. Parse the HTML content and retrieve only the main text without
HTML tags E.g., ArticleExtractor prunes meta information in news
articles
b. Web Crawling: Systematically browsing the web for indexing the web
i. Seed URL’s
ii. Crawl frontier
iii. Web archiving – crawler copies and saves information as it goes
iv. Focused Crawling: Specific types of websites and webpages
c. Use of Specific API’s: Content providers make certain data available through their
application programming interfaces.
i. Methodology – Query an API endpoint with set of parameters to extract
required data. E.g. Google Trends, Wikipedia Page Traffic, Twitter API
ii. Typically API’s provide data in semi-structured format such as JSON, XML
1. Makretstack API – for real time intraday time series market data
2. Twitter API for collecting Tweets with various fields
3. Facegook Graph API – ineract with Facebook’s social media graph
4. News API to locate articles, news headlines
5. Accuweather API – for information on weather conditions and
indices
5) Structured Data collection and Pre-processing
a. Types of Data : Discrete, Continuous, Categorical i.e. Ordinal, Nominal
i. Values on present numerical scale
1. Continuous: any value in x,y interval
2. Discrete: only integer value
3. Categorical:
a. Ordinal – various classes/categories with explicit ordering
b. Nominal – categories with no implied order
ii. Specialized Data Structures: Arrays, List, Dictionaries. Specialized objects –
holds a collection of other more primitive data values
b. Type of Datasets : Rectangular, Non-Rectangular
i. Rectangular - Spreadsheet or a data frame • Rows represent records •
Columns represent attributes • All items in a single column have same type
of data – same feature • Indexable – Relational Databases • store a unique
referent to every row/record (Like a row number) • Multi-level indexing for
faster access
ii. Non-Rectangular – Spatial Data Structures, Graph Data Structures, User
defined data structures
c. Data Quality Assessment and Pre-processing
i. Data Quality Assessments
1. Data Inconsistencies
a. Mismatch in data types: Integer and float (1 and 1.0),
Encoding schemes (ASCII and UTF-8)
b. Inconsistent data-array dimension – Expected array length
c. Inconsistent use of identifiers – woman, Female,
temperature, temp
2. Cleaning - inconsistencies, noise, duplicates, missing
a. Data Deletion: prune out erroneous records
b. Data Editing: replace with inferable values
i. Missing Data - Ignore and prune out such records –
Sufficiently large dataset or multiple significant
attributes missing
ii. Fill in the missing attributes: Integrate through
available external data sources – Evaluate if the
value is inferable given context – Use a dummy
value – Attribute mean – Most probable attribute
value – mode
iii. Noise - Duplicates/Random noise while recording
data - Dealing with noise • largely data specific –
noise in images, sound-clips, particular sensor
readings • universal principles of dealing with noisy
data – binning or bucketing – smoothing – clustering
– dimensionality reduction
iv. Binning/Bucketing:
1. Replace sortable attribute within an interval
2. Replace fractional temperature readings
with temperature range
v. Smoothing: Fit a regression function onto a
particular data feature. Complexity of regression
function, number of parameters
1. Various algorithms • Moving average:
Simple/Exponential • Random walks • Add 1
• Back-off model
3. Integration - data from different representations, sources put
together, conflicts resolved. Combine data from multiple
heterogeneous data sources into a coherent data store
a. Aggregation
i. Approaches towards data integration / coupling
mechanism:
1. Schema Integration- Integrate metadata
from sources, Entity-identification, and
conflation
2. Redundancy detection - Attribute that is
inferable mathematically from other
attributes, Detect potential redundancies
through correlation analysis
3. Resolution of data-value conflicts - Attribute
values for same real-world entity may differ
among the available sources, Solution:
Majority voting or pick the most reliable
sources
ii. Tight Coupling - data combined from different
sources into a single physical location. ETL (Extract,
load, transform)
iii. Loose Coupling:
1. no unified physical location created
2. data remains only in actual source
databases
3. query for a record is transformed for source
database’s understanding
4. Sampling - Sample as a proxy for the entire population
5. Transformation - Discretization, normalization schemes, redundancy
removal.
a. Morph dataset into a format best suited for the
downstream application
i. Normalization: Normalization: Scale datapoints into
a given range (say -1 to 1)
1. Z-Scale
2. Min-max scaling
3. Decimal Scaling
ii. Feature selection and extraction - Curse of
dimensionality
1. Reduce the number of columns or attributes
(features)
2. Feature selection – select only a subset of
feature
3. Feature extraction – Create and utilize
freshly tailored features – Principal
Component Analysis (PCA), Linear
Discriminant Analysis (LDA
iii. Discretization
iv. Concept Hierarchy Generation or Generalization
6. Reduction – Reduce quantity of data while preserving quality. Key
idea: condensed description minimizing information loss
a. Summarize
b. Data Cube Reduction
c. Data String
ii. Data Pre-processing: Key Steps
1. Data gathering methods are loosely controlled
2. Transform raw data to well-formed datasets
3. Ensure data-quality for further downstream tasks that utilize the
dataset
4. Tailor the collected data into format(s) suitable for tasks in question
a. Cleaning & Cleansing
b. Integration
c. Reduction & transformation
d. Collection & Text Pre-processing: Text (any unstructured data) processing is more
difficult Hence, Pre-processing for unstructured data is more important Data
collection from unstructured sources is more error prone Textual data is noisy esp.
when collected from social media Use basic NLP tools to pre-process and remove
noise Learn to live with remaining errors
i. Extracting Data
ii. Text Pre-processing
1. Basic –
a. Linguistically motivated, but basic implementations
b. Tokenizing - Using NLTK’s word_tokenize as a blackbox
c. Stop words - Identifying stop words, normally words
between context words or identified based on character
count in strings
d. Spelling variations
i. Method to correct – Minimum Edit distance to
nearest word in dictionary. Quantifying dissimilarity
between two strings: Minimum number of
operations required to transform one string into the
other. Set of operations (Levenshtein, 1966)
1. Insertion ~ uv → uxv
2. Deletion ~ uxv → uv
3. Substitution ~ uxv → uy
ii. Bing Spell Check suggestions
e. Word stemming
i. Stemming: Stemming is commonly used in IR to
conflate morphological variants
1. Typical stemmer consists of collection of
rules and/or dictionaries
a. Simplest stemmer is “suffix s”
b. Porter stemmer is a collection of
rules – KSTEM uses lists of words
plus rules for inflectional and
derivational morphology
c. Similar approach can be used in
many languages
d. Some languages are difficult –
Indian Languages, Finnish, Arabic
etc
2. Small improvements in effectiveness and
significant usability benefit
ii. Lemmatization: Maps forms of a word to its lemma
based on its intended meaning. Lemmatization
requires deep linguistic knowledge, expensive. More
involved than stemming
1. identifies the intended part-of-speech -
disambiguates the word sense based on
surrounding context (sentence, discourse)
2. ‘better’ stems to ‘better’, lemmatizes to
‘good

2. Advanced
a. Advanced Linguistically motivated, more complex
implementations
b. Phrase/name identification
i. Goal is to use phrases as text units
ii. Statistical approach : find all pairs of adjacent words
(‘bigrams’). Explosion of elements – makes this non-
feasible. Also it adds a lot of nonsense pharases
iii. NLP Approach
1. Run of words
2. Sentence parsing
3. Statistical models
c. Sentence Segmentation: Split sentences at specific
punctuation marks like period, question mark, exclamation?
d. Code-mixing and code-switching :
i. Code-switching: shifting from one linguistic code (a
language or or dialect) to another, depending on the
social context or conversational setting
ii. Code-mixing: placing or mixing of various linguistic
units from two different grammatical systems within
the same sentence and speech context
iii. Code-switching is inter-sentential, while code-
mixing, is intra-sentential
iv. Language in Online Social Media ~ extensive use of
transliteration
v. Rise of Hinglish, Singlish,
e. Word sense disambiguation
f. Lexical acquisition
g. Parts of speech Synonym expansion

You might also like