Unit1-Data Science Fundamentals

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

Data Science

1
Definition of Data Science

• Data Science is the application of computational


and statistical techniques to gain insight into a
real world problem expressed using data.
• Computational because it involves algorithmic
methods written in a code.
• Statistical because inferences based on statistics
help us to build predictions that we make.
What is Data?
• The word data comes from the Latin word “datum”
which means “a piece of information”. It is used to
describe things by assigning a value to them.
• Dictionary Definition- Facts and statistics collected
together for reference or analysis
• Definition in data science: Data are the values of
qualitative or quantitative variables belonging to a
set of items.
• Variables are the measurement or characteristics of
the item.
Why do we use Data?
• To solve a real world problem
• To tell a story
• To find patterns
How do we collect data?
• Primary Data
• Secondary Data

Data collected by first hand


experience/research

Sources:Experiments
Surveys Data that already exists/published
Interviews
Questionnaires
Sources:Books/newspapers
Web information
Government Reports/published census
Research Articles
Raw Data
• It is also called as source data or atomic data.
• It is unprocessed data which is hard to parse or
analyze.
• The processing of raw data has to be carried out
more than once, this record should be
maintained.
Typical Attributes of Raw data

• No software is run on the data.


• No manipulation of any of the numbers in the
data has taken place.
• No data is removed from the data set.
• The data set is not summarized in any possible
way.
Examples of Raw data

• Binary file generated by a measurement


machine.
• Unformatted excel file.
• JSON from Twitter API.
• Hand entered numbers(readings) you
collected.
Processed Data

• It is the data which is ready for analysis.


• Processing can include merging, subsetting,
transforming etc.
• Depending on the field of work there can be
standards of processing.
• It is important that all the steps should be
recorded.
Expected Attributes of Processed data
• Each variable to be measured should be in one column.
• Each different observation of that variable should be in
a different row.
• There should be one table for each kind of variable.
• If you end up with multiple tables , they should include
a column in the table that allows them to be linked.
This is important for merging the dataset.
Minor Attributes:
• Include a row at the top of each file with variable
names.
• Assign ’easily perceivable’ variables names.
Qualitative data
• Non numerical, uses words and descriptions,
Generally not used for comparisons.
•Example:
Types of qualitative data
• Ordinal Data:
The categorical variables having ordered level are ordinal
variables e.g customer satisfaction survey for a service can
be very satisfactory, satisfactory , neutral ,unsatisfactory ,
very unsatisfactory or speed if low, medium , high etc.

• Nominal Data:
Does not have any order. E.g which chocolate do you like
dark or white, gender is female or male etc.
Quantitative data
• Referring to a number, can be measured or
ranked.
•Example:

//Source: mathematics_monster.com
Types of Quantitative Data
• Discrete data
- can be counted
- involves whole numbers
- e.g. No of Children in your family

• Continuous data
- Numerical data
- any values in certain range
- e. g. Temperature
Case Study : Google Transparency Report
Country Cr_req Cr_c Ud_req Ud_comp Hemisphere hdi
ompl ly
y
Argentina 21 100 134 32 Southern Very high
Australia 10 40 361 73 Southern Very high
Belgium 6 100 90 67 Northern Very high
Brazil 224 67 703 82 Southern high
USA 92 63 5950 93 Northern Very high
Description of variables
• Country: Identifier variable, indicating name of the country for
which data are gathered.
• Cr_req: No. of content removal requests made by the
respective country.
It is Discrete numerical variable.
• Cr_comply: Percentage of content removal requests that
Google has complied with.
It is Continuous numerical variable.
• Ud_req: No. of user data requests by the country as a part of
criminal investigation.
It is discrete numerical variable.
• Ud_comply: Percentage of user data requests
that Google has complied with.
It is continuous numerical data variable.
• Hemisphere: Whether the country is in
southern or northern hemisphere.
It is nominal categorical variable.
• Hdi: human development index: It combines
indicators of life expectancy, educational
attainment and income and is released by
United Nations(UN).
It is ordinal categorical.
Code Book or Meta data
It should contain following information about the
variables:
• Units- whether the unit of column is in Rs in lacs or
thousands.
• Summary choices-whether mean or median is used.
• Information about resource of data- which data base is
used or structured survey etc is used.
• Valid link to the data base should be given.
• For structured survey- what was the population,
whether it was observational or experimental design ,
selection of samples , confounding variables
information biases if any , mathematical formulation
should be mentioned in the code book.
• Example:..\Codebook-Example.txt
Case Study
• “Growth in a Time of Debt”, Reinhart C. and Rogoff K , American Economic
Review: Papers and Proceedings , Vol- 100.
• Main finding is that across both advanced countries and emerging markets,
high debt/GDP levels (90 percent and above) are associated with notably
lower growth outcomes.
• Another Economist “Thomas Herndon” got hold of raw excel file and
metadata and proved that selective exclusion of available data and
unconventional weighting of summary statistics lead to serious errors.
• Hence the representation of relationship between public debt and GDP is
inaccurate.
• “Does High public debt consistently stifle economic growth? A critique of
Reinhart and Rogoff” , Herndon T, Ash m, Pollin r, Peri working paper series ,
no.32, April 2013.
• Case study highlights the importance of metadata in data processing and
more importantly ethics in data science.
Data Science Pipeline
1. Data collection
• All available datasets are gathered from
structured/clearly defined data sources( relational
databases) and unstructured data sources(emails,
social media chats, audio/ video mobile data)
• There are various ways of collecting data such as
✓ Web scraping(extract data from websites)
✓ Querying databases(request data from databases)
✓ Questionnaires and surveys
✓ Reading from excel sheets and other documents
• The quality of collected data determines the quality
of developed solution. Hence the solution is as good
as the data quality.
Challenges in extracting the data:
• Real data is in a far from easier format.
• Sometimes data is available in the form of free text. This data
may be interpretable with human intelligence but can be a
challenge as a task.— Doctor’s Prescription. It contains Dr’s
name registration no, etc.
• Another challenge is data may be well organized but may be
in some different format which is difficult to analyze.
• Sometimes data would be given in two different formats for
combining and joint processing.
Example: MySQL and MongoDB are free popular databases.
The data is extracted into a usable format such as CSV,JSON
etc. Hence the knowledge of different types of formats is
also necessary.
File formats

1. CSV(Comma Separated Values)


2. XML Format: Extensible Markup Language.
3. JSON Format: Java Script Object Notation.
4. MySQL Format
5. HTML: Hyper Text Markup Language
6. HDF Format: Hierarchical Data Format
2. Data Processing
• Time consuming and laborious
• This can include:
• Data cleaning
-Fill in the missing values,smooth noisy data,identify or
remove outliers and resolve inconsistencies.
• Data integration
-Integration of multiple databases,data cubes, or files.
• Data transformation
- Normalization and Aggregation
• Data reduction
-Obtains reduced representation in volume but
produces the same or similar analytical results.
Data cleaning
• It is defined as the process to ensure the correctness
, consistency and usability of the data.
• Data in the real world is Dirty .The extracted data
may contain missing values, irrelevant features or
duplicate values. This may happen due to faulty
instrument , human or computer error, transmission
error etc.
• Some types of incorrect(Dirty) data:
➢ Incomplete/missing data: lacking attribute values, lacking
certain attributes of interest
e.g. Occupation=“ ”
Data cleaning
➢ noisy: containing noise, errors, or outliers
e.g., Salary=“−10”
➢ inconsistent: containing discrepancies in codes or
names,
e.g.
1. Age=“42”, Birthday=“03/07/2010”
2. Was rating “1, 2, 3”, now rating “A, B, C”
3. discrepancy between duplicate records
➢ Intentional (e.g., disguised missing data)
e.g.: Jan. 1 as everyone’s birthday
Data Quality Criteria
Validity: The degree to which the data conform to defined business
rules or constraints. e.g dates should fall in typical range, certain
columns cannot be empty etc.
2. Accuracy: The degree to which the data is close to the true
values. e.g the address of a street is given in valid format but is not
true i.e the address does not exist.
3. Completeness: The degree to which all required data is known.
4. Consistency: The degree to which data is consistent.
Inconsistency occurs when two values in the data set contradict
with each other e.g Boy with age 10 years is defined as senior
citizen.
5. Uniformity: The degree to which data is specified using same
unit e.g currency of one country is different from other
“Garbage in-Garbage out”(GIGO)
Fundamental principle of computing
• Data cleaning steps may involve:
✓ Selecting variables(columns) & filtering
observations(rows)
✓ mutating the data (creating ,recording &
transforming variables)
✓ Summarizing the data to reduce multiple
values to single value such as mean
✓ using techniques such as feature selection
and data imputation.
3. Data Exploration/Visualization

• This is the quickest and most powerful technique to


understand new and existing information in the data.
• Different types of visualization and statistical
testing techniques are used. Initially we try to reveal
the underlying features of a dataset like different
distributions, correlations or other visible patterns.
This process is also called exploratory data
analysis (EDA).
• Visualizations uncover outliers and data errors which
the data scientist needs to take care about.
• Revealed patterns can inspire hypothesis
about the underlying processes or
modelling techniques to be tested.
• This process can also help in feature
extraction step that is used to identify
and test significant variables. Extracting
features that are most important for a
given problem will always result in a
relevant and more accurate machine
learning model .
• Case Study: Anscombe’s quartet
• https://en.wikipedia.org/wiki/Anscombe%27s_quartet
4. Data Analysis/Machine Learning
• The objective of this process is to do the in-depth
analytics , mainly the creation of relevant machine
learning models or algorithms for prediction.
• After developing the model we measure its
performance by evaluating/testing the model. This
involves multiple sessions of evaluation and
optimization cycles.
• The accuracy of algorithm has to be increased by
training it with fresh ingestion of data, minimizing
losses etc.
• Sometimes we may have to test multiple models for
their performance, error rate etc. and select the
optimum model as per requirement.
5. Decision
• The objective of this step is to first identify the
business insight and correlate it to the decision.
• Interpreting the data is more like communicating the
decision to interested parties.
• This can involve domain experts in correlating the
decisions with business problems.
• Domain experts help in visualizing the decisions
according to the business dimensions which also
assists in communicating facts to non technical
audiences.
6. Feedback
• After the decision phase which may involve
domain experts, we deploy the model to get a
feedback from the model’s users(customers).
• These users provide feedback about the model so
that you can refine the model further, evaluate it
and deploy it again. The process repeats until you
have a final model.
• Also it is important to revisit and update the
model on a periodic basis, depending on the
frequency of new data generation.
• The more that is received , the frequent the
feedback is.

You might also like