Data Science 1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 133

DEPARTMENT OF ARTIFICIAL INTELLIGENCE &

MACHINE LEARNING

INTRODUCTION TO DATA SCIENCE

LECTURE NOTES – UNIT 1

B. TECH
II YEAR – II SEM (Sec-A & B)
Academic Year 2022-23

Prepared & compiled by

DR.G. ARUN SAMPAUL THOMAS,


ASSOCIATE PROFESSOR & HOD, DEPARTMENT OF AI&ML
J.B.I.E.T
Bhaskar Nagar, Yenkapally(V), Moinabad(M),

Ranga Reddy(D), Hyderabad – 500 075, Telangana, India.


J. B. Institute of Engineering and
AY 2020-21 B. Tech: AI & ML
Technology
onwards II Year – II Sem
(UGC Autonomous)
Course Code:
INTRODUCTION TO DATA SCIENCE L T P D
J22D3
Credits: 2 2 0 0 0

Pre-requisite:
Database Management Systems, Data Structures

Course Objectives:
This course will enable students to:
• Know about the fundamental concepts and technologies of Data Science.
• Explore the various Data collection and storage methods.
• Understand the Data Analysis, statistics, and various machine learning algorithms.
• Investigate about the visualization of data and apply coding techniques to data for
securing the data.
• Study the Applications of Data Science, Technologies for visualization Handling of
variables using Python.

UNIT-I - Introduction to Data Science


Introduction to core concepts and technologies: Introduction, Terminology, Data science
Process, data science toolkit, Types of data, Example applications

UNIT-II - Data collection and management:


Introduction, Sources of data, Data collection and APIs, Exploring and fixing data. Data storage
and management, using multiple data sources.

UNIT-III - Data analysis:


Introduction, Terminology and concepts, Introduction to statistics, Central tendencies and
distributions, Variance, Distribution properties and arithmetic, Samples/CLT. Basic machine
learning algorithms, Linear regression, SVM, Naive Bayes.

UNIT-IV - Data visualization:


Introduction, Types of data visualization, Data for visualization:
Data types, Data encodings, Retinal variables, mapping variables to encodings, Visual
encodings.

UNIT-V - Practices and Case Studies in Data Science:


Applications of Data Science, Technologies for visualization, Recent trends in various data
collection and analysis techniques, various visualization techniques, application development
methods used in data science. Demonstrate some case studies like Marketing, Finance, HR,
Manufacturing, Healthcare etc

Textbooks:
1. Cathy O’Neil, Rachel Schutt, Doing Data Science, Straight Talk from the Frontline. O’Reilly,
2013.
2. Jure Leskovek, Anand Rajaraman, Jeffrey Ullman, Mining of Massive Datasets. v 2.1,
Cambridge University Press, 2014.
Reference Books:
1. Joel Grus, “Data Science from scratch”, O'Reilly, 2015.
2. Gupta, S.C. and Kapoor, V.K.: “Fundamentals of Mathematical Statistics”, Sultan &
Chand & Sons, New Delhi, 11th Ed, 2002.
3. Hastie, Trevor, et al. “The elements of Statistical Learning”, Springer, 2009.
4. Wes Mc Kinney, “Python for Data Analysis”, O'Reilly Media, 2012

Course Outcomes:
The student will be able to
• Identify the basic concepts of data science and identify the types of data.
• Analyse about how to collect the data, manage the data, explore the data, store the data.
• Implement the basic measures of central tendency and classify the data using SVM and
navie Bayesian.
• Interpret the visualization of data and apply coding techniques to data for securing the
data.
• Analyse the various concepts of data science and can be able to handle simple
applications of data science using python.

WEBSITE REFERENCES FOR SELF LEARNING


1. https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-python-
scratch-2/
2. https://www.rstudio.com/online-learning/
INTRODUCTION TO DATA SCIENCE

UNIT– I
Ø DATA SCIENCE
BASICS
Ø Intro to DS

DR. G. ARUN SAMPAUL THOMAS


Associate Professor & HOD – Department of AI&ML
J.B. Institute of Engineering and Technology
Hyderabad, Telangana
1
arunsam.infotech@gmail.com arunthomas.ai_ml@jbiet.edu.in
WHAT IS DATA SCIENCE?
…solving problems with data…
scientific,
collect & clean & use data
social, or data
understand format to create
business problem
data data solution
problem

…sounds cool!
What makes a good data scientist?

2
WHAT IS DATA SCIENCE?
…solving problems with data…
scientific,
collect & clean & use data
social, or data
understand format to create
business problem
data data solution
problem

…which step is most challenging?

use data data analysis


to create or
solution machine learning
(or both)
1
0
WHAT IS DATA ANALYSIS?
…using data to discover useful information…

• data: anything you can measure or record

• statistics: summarize (and visualize) main


Statistics
characteristics of the data

• algorithms: apply algorithms to find


Algor ithms patterns in the data

1
1
WHAT IS MACHINE LEARNING?
…creating and using models that learn from data…

• data: anything you can measure or record

• model: specification of a (mathematical)


relationship between different variables

• evaluation: how well does the model


work?

1
7
WHAT IS MACHINE LEARNING?
• Traditional CS

data
output
program

• Machine Learning

data
data output
program
output

1
8
WHAT IS MACHINE LEARNING?
…creating and using models that learn from data…

Examples
Detecting Predicting the
Identifying zip code communities traffic volume
from handwritten in social at rush hour
digits networks

Detecting fraudulent Determining the


credit card location of distribution
transactions centers based on
customers’
residence
[DSFS] p 3-13
7
LEARNING FROM DATA
• Regression

20
LEARNING FROM DATA
• Classification

21
LEARNING FROM DATA
• Clustering

22
WHAT IS MACHINE LEARNING?

…creating and using models that learn from data…

• come up with predictions


• extract knowledge/insights

à unsupervised learning/data mining

[PDSH] p 332-342
[DSFS] p 141-142 11
ACTIVITY 1
…creating and using models that learn from data…

Categorize these Examples


Detecting Predicting the
Identifying zip code communities traffic volume
from handwritten in social at rush hour
digits networks

Detecting fraudulent Determining the


credit card location of distribution
transactions centers based on
customers’
residence

24
MACHINE LEARNING WORKFLOW
• training phase, test phase, evaluation phase

ground
truth performance
data measure
data
output
model
output

à let’s have a closer look at the data we are using

25
ACTIVITY 2
• Example: Census Data

• training data and test data 14


DATA
• Notation:
• D all observed data
• X all features
• y observations Helper Notation:
• ☐TE test n number of data points
d number of features
• ☐TR training m number of training points
• y! predictions ☐1,…,i,…,n: indices for data points
☐1,…,j,…,d: indices for features

• What data structure to use?


• set, list, or array?

15
SUMMARY & READING math &
statistics

• Data Science is about hacking


expertise
skills
data, models, and evaluation
• Data Science can solve a wide variety of problems –
once we have the right data and model!

16
INTRODUCTION TO DATA SCIENCE

UNIT– I
Ø Terminologies
Ø DS Process
Ø Data Scientist
Process
DR. G. ARUN SAMPAUL THOMAS
Associate Professor & HOD – Department of AI&ML
J.B. Institute of Engineering and Technology
Hyderabad, Telangana
1
arunsam.infotech@gmail.com arunthomas.ai_ml@jbiet.edu.in
Basic Terminologies

• Data
• It can be
Simulation
-generated
-collected
-retrieved.
Similarity Measures

Data Structures

Algorithms
• Data: facts with no meanings.
• Information: learning from facts.
• Knowledge: practical understanding of a subject.
• Understanding: the ability to absorb knowledge and learn to reason.
• Wisdom: the quality of having experience and good judgment; ability to
think and foresee.
• Validity: ways to confirm truth.
• Cross-sectional data: applied on data without time.
• Temporal data: applied on time series.
• Spatial: considers location i.e. coordinate determination in touch phones.
• Temporal cum Spatial (GIS): considers change with passage of time for example
population density.

• Measurements of Scales
There are 4 scales of measurement
• Nominal: determines classification of data i.e. male/female.
• Ordinal: determines order of data and can be numerical or non-numerical i.e. time of
day (dawn, morning, noon, afternoon, evening, night).
• Interval: gives the interval of a measurement i.e. temperature interval.
• Ratio: gives ratio of the measurement i.e. weight, height, number of children.
Why DS Now?

• We have massive amounts of data about many aspects of our lives, and
,simultaneously, What people might not know is that the “datafication” of our
offline behavior has started as well.
• On the Internet, this means Amazon recommendation systems.
• on Facebook, friend recommendations, film and music recommendations, and
so on.
• In finance, this means credit ratings, trading algorithms, and models.
• In education, this is starting to mean dynamic personalized learning and
assessments coming out of places like Knewton and Khan Academy.
• In government, this means policies based on data.
Datafication

• In the May/June 2013 issue of Foreign Affairs, Kenneth Neil Cukier and Viktor
Mayer-Schoenberger wrote an article called “The Rise of Big Data”, In it they
discuss the concept of datafication,
They define datafication as a process of “taking all aspects of
life and turning them into data.”

• They follow up their definition in the article with a line that speaks volumes
about their perspective:
Once we datafy things, we can transform their purpose and
turn the information into new forms of value.
Datafication
Examples:
• How we quantify friendships with “likes”.
• “Google’s augmented-reality glasses datafy the gaze.
• Twitter datafies stray thoughts.
• LinkedIn datafies professional networks.
• When we “like” someone or something online, we are intending to be
datafied.
• Browse the Web, we are unintentionally through cookies.
• When we walk around in a store, or even on the street, we are being
datafied, via sensors, cameras, or Google glasses.
• Taking part in a social media experiment.
• All-out surveillance and stalking.

But it’s all datafication


Data Science Process
A Data Scientist’s Role in This
Process
The growth in data scientist job postings on Indeed, from December 2016 to December 2018
OK, So What Is a Data Scientist, Really?
Perhaps the most concrete approach is to define data science is by its usage.
• In Academia
• An academic data scientist is a scientist, trained in anything from social science to
biology, who works with large amounts of data, and must grapple with
computational problems posed by the structure, size, messiness, and the
complexity and nature of the data, while simultaneously solving a real-world
problem.
• In Industry
More generally, a data scientist is someone who knows
• How to design the experiments,
• how to the process of collecting, cleaning, and munging of data.
• Skills that are also necessary for understanding biases in the data, and for
debugging logging output from code.
• Exploratory data analysis, which combines visualization and data sense.
• Find patterns, build models, and algorithms.
• Use analyses for decision making.
What Is a Data Scientist
Data Engineers are the
Data analyst is someone
data professionals who
who merely curates
prepare the “big data”
meaningful insights from
infrastructure to be
data.
analyzed by Data
Scientists

A data scientist is a professional with the capabilities to gather large amounts of


data to analyze and synthesize the information into actionable plans for companies
and other organizations.
INTRODUCTION TO DATA SCIENCE

UNIT– I
Ø Data Science Toolkits
Ø DS Techniques

DR. G. ARUN SAMPAUL THOMAS


Associate Professor & HOD – Department of AI&ML
J.B. Institute of Engineering and Technology
Hyderabad, Telangana 1
arunsam.infotech@gmail.com arunthomas.ai_ml@jbiet.edu.in
Data Science Tools
- R
- Python
- Tableau
- Spark with ML
- Hadoop (Pig and Hive)
- SAS
- SQL
Data Science with R
A popular language
in Data Science
What Is R
https://www.r-project.org/about.html
R is an integrated suite of software facilities for data manipulation, calculation
and graphical display. It includes
● an effective data handling and storage facility,
● a suite of operators for calculations on arrays, in particular matrices,
● a large, coherent, integrated collection of intermediate tools for data
analysis,
● graphical facilities for data analysis and display either on-screen
or on hardcopy, and
● a well-developed, simple and effective programming language which
includes conditionals, loops, user-defined recursive functions and input and
output facilities.
Install R
https://cran.r-project.org/bin/windows/base/
Install RStudio
https://www.rstudio.com/products/rstudio/download/
Statistical Software Landscape
SAS Matlab
Python (Pandas) JMP
IBM SPSS E views
R
Julia
Clojure
Octave
Using R with other software
https://rforanalytics.wordpress.com/useful-links-for-r/using-r-from-other-software/

Tableau http://www.tableausoftware.com/new-features/r-integration

Qlik http://qliksolutions.ru/qlikview/add-ons/r-connector-eng/

Oracle R http://www.oracle.com/technetwork/database/database-technologies/r/r-enterprise/overview/index.html

Rapid Miner https://rapid-i.com/content/view/202/206/lang,en/#r

JMP http://blogs.sas.com/jmp/index.php?/archives/298-JMP-Into-R!.html
Using R with other software
https://rforanalytics.wordpress.com/useful-links-for-r/using-r-from-other-software/

SAS/IML http://www.sas.com/technologies/analytics/statistics/iml/index.html

Teradata http://developer.teradata.com/applications/articles/in-database-analytics-with-teradata-r

Pentaho http://bigdatatechworld.blogspot.in/2013/10/integration-of-rweka-with-pentaho-data.html

IBM SPSS

https://www14.software.ibm.com/webapp/iwm/web/signup.do?source=ibm-analytics&S_PKG=ov18855&S_TACT=M161003W&dy
nform=127&lang=en_US

TIBCO TERR
http://spotfire.tibco.com/discover-spotfire/what-does-spotfire-do/predictive-analytics/tibco-enterprise-runtime-for-r-terr
Some Advantages of R
open source
free
large number of algorithms and packages esp for statistics
flexible
very good for data visualization
superb community
rapidly growing
can be used with other software
Some Disadvantages of R
in memory (RAM) usage
steep learning curve
some IT departments frown on open source
verbose documentation
tech support
evolving ecosystem for corporates
Solutions for Disadvantages of R
in memory (RAM) usage specialized packages, in database computing
steep learning curve TRAINING !!!
some IT departments frown on open source TRAINING and education!
verbose documentation CRAN View , R Documentation
tech support expanding pool of resources
evolving ecosystem for corporates getting better with MS et al






http://www.sas.com/en_in/software/university-edition/download-software.html




Python
What is Python
Python is a widely used general-purpose, high-level programming language

Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would
be possible in languages such as C++ or Java.

Python is used widely

https://www.python.org/about/success/
Object Oriented Programming (OOPS)
a computer program consists of, such as variables, expressions, functions or modules.

name = ajay

print (name)

import printer

Hi I am %name

Object-oriented programming (OOP) is a programming paradigm based on the concept of "objects", which are data structures that contain
data, in the form off ields, often known as attributes; and code, in the form of procedures, often known as methods.

Dynamic programming language is a term used in computer science to describe a class of high-level programming languageswhich, at
runtime, execute many common programming behaviors that static programming languages perform during compilation.

"compiler" is primarily used for programs that translate source code from a high-level programming language to a lower level language (e.g.,
assembly language or machine code).
Java
http://introcs.cs.princeton.edu/java/11cheatsheet/
Linux
http://www.linuxstall.com/linux-command-line-tips-that-every-linux-user-should-know/
SQL
http://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins
Hive QL
http://hortonworks.com/wp-content/uploads/downloads/2013/08/Hortonworks.CheatSheet.SQLtoHive.pdf
Python
http://www.astro.up.pt/~sousasag/Python_For_Astronomers/Python_qr.pdf
Python
https://s3.amazonaws.com/quandl-static-content/Documents/Quandl+-+Pandas,+SciPy,+NumPy+Cheat+Sheet.pdf
R
http://cran.r-project.org/doc/contrib/Short-refcard.pdf
Pig
HDFS
https://github.com/michiard/CLOUDS-LAB/blob/master/C-S.md
Git
http://overapi.com/static/cs/git-cheat-sheet.pdf
All together now
PIG http://www.slideshare.net/Mathias-Herberts/hadoop-pig-syntax-card
HDFS https://github.com/michiard/CLOUDS-LAB/blob/master/C-S.md
R http://cran.r-project.org/doc/contrib/Short-refcard.pdf
Python https://s3.amazonaws.com/quandl-static-content/Documents/Quandl+-+Pandas,+SciPy,+NumPy+Cheat+Sheet.pdf
Python http://www.astro.up.pt/~sousasag/Python_For_Astronomers/Python_qr.pdf
Java http://introcs.cs.princeton.edu/java/11cheatsheet/
Linux http://www.linuxstall.com/linux-command-line-tips-that-every-linux-user-should-know/
SQL http://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins
Git http://overapi.com/static/cs/git-cheat-sheet.pdf
R
R provides a wide variety of statistical (linear and nonlinear modelling, classical
statistical tests, time-series analysis, classification, clustering, …) and graphical
techniques, and is highly extensible.

R is an integrated suite of software facilities for data manipulation, calculation and


graphical display. It includes an effective data handling and storage facility, a suite
of operators for calculations on arrays, in particular matrices, a large, coherent,
integrated collection of intermediate tools for data analysis, graphical facilities for
data analysis and display either on-screen or on hardcopy, and a well-developed,
simple and effective programming language

https://www.r-project.org/about.html
Python
http://python-history.blogspot.in/ and https://www.python.org/
SAS
http://www.sas.com/en_in/home.html
Data Science Techniques
- Machine Learning
- Regression
- Logistic Regression
- K Means Clustering
- Association Analysis
- Decision Trees
- Text Mining
- Social Network Analysis
- Time Series Forecasting
- LTV and RFM Analysis
- Pareto Analysis
What is an algorithm

● a process or set of rules to be followed in calculations or other


problem-solving operations, especially by a computer.

● a self-contained step-by-step set of operations to be performed

● a procedure or formula for solving a problem, based on conducting a


sequence of specified action

● a procedure for solving a mathematical problem (as of finding the greatest


common divisor) in a finite number of steps that frequently involves
repetition of an operation; broadly : a step-by-step procedure for solving a
problem or accomplishing some end especially by a computer.
Machine Learning

Machine learning concerns the construction and study of systems that can learn from data. For example, a machine learning
system could be trained on email messages to learn to distinguish between spam and non-spam messages

Supervised learning is the machine learning task of inferring a function from labeled training data.[1] The training data consist of a
set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a
desired output value (also called the supervisory signal).

In the terminology of machine learning, classification is considered an instance of supervised learning, i.e. learning where a
training set of correctly identified observations is available.

In machine learning, the problem of unsupervised learning is that of trying to find hidden structure in unlabeled data. Since the
examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution. This distinguishes
unsupervised learning from supervised learning

The corresponding unsupervised procedure is known as clustering or cluster analysis, and involves grouping data into categories
based on some measure of inherent similarity (e.g. the distance between instances, considered as vectors in a multi-dimensional
vector space).
CRAN VIEW Machine Learning
Machine Learning in Python
Classification
In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a
new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership
is known.
The individual observations are analyzed into a set of quantifiable properties, known as various explanatory variables,features,
etc.
These properties may variously be categorical (e.g. "A", "B", "AB" or "O", for blood type),
ordinal (e.g. "large", "medium" or "small"),
integer-valued (e.g. the number of occurrences of a part word in an email) or
real-valued (e.g. a measurement of blood pressure).

Some algorithms work only in terms of discrete data and require that real-valued or integer-valued data be discretized into groups
(e.g. less than 5, between 5 and 10, or greater than 10).
Regression

regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for
modeling and analyzing several variables, when the focus is on the relationship between
a dependent variable and one or more independent variables.

More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable')
changes when any one of the independent variables is varied, while the other independent variables are held fixed.

Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent
variables – that is, the average value of the dependent variable when the independent variables are fixed. Less commonly, the
focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent
variables.
kNN
Support Vector Machines

http://axon.cs.byu.edu/Dan/678/miscellaneous/SVM.example.pdf
Association Rules

http://en.wikipedia.org/wiki/Association_rule_learning
Based on the concept of strong rules, Rakesh Agrawal et al.[2] introduced association rules for discovering regularities between
products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets.
For example, the rule found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes
together, he or she is likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing
activities such as, e.g., promotional pricing or product placements.
In addition to the above example from market basket analysis association rules are employed today in many application areas
including Web usage mining, intrusion detection, Continuous production, and bioinformatics. As opposed to sequence mining,
association rule learning typically does not consider the order of items either within a transaction or across transactions

Conecpts- Support, Confidence, Lift


In R
apriori() in arules package
In Python
http://orange.biolab.si/docs/latest/reference/rst/Orange.associate/
Gradient Descent

Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a function using gradient descent,
one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point.

http://econometricsense.blogspot.in/2011/11/gradient-descent-in-r.html

Start at some x value, use derivative at that value to tell


us which way to move, and repeat. Gradient descent.

http://www.cs.colostate.edu/%7Eanderson/cs545/Lectures/week6day2/week6day2.pdf
Gradient Descent

https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/
A standard approach to
solving this type of
problem is to define
an error function (also
called a cost function)
that measures how “good”
a given line is.

initial_b = 0 # initial y-intercept guess


initial_m = 0 # initial slope guess
num_iterations = 1000
Decision Trees

http://select.cs.cmu.edu/class/10701-F09/recitations/recitation4_decision_tree.pdf
Decision Trees

Http://www.ise.bgu.ac.il/faculty/liorr/hbchap9.pdf
Random Forest

Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down each of
the trees in the forest. Each tree gives a classification, and we say the tree "votes" for that class. The forest chooses the
classification having the most votes (over all the trees in the forest).
Each tree is grown as follows:
1. If the number of cases in the training set is N, sample N cases at random - but with replacement, from the original data. This
sample will be the training set for growing the tree.
2. If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out
of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
3. Each tree is grown to the largest extent possible. There is no pruning.
In the original paper on random forests, it was shown that the forest error rate depends on two things:
● The correlation between any two trees in the forest. Increasing the correlation increases the forest error rate.
● The strength of each individual tree in the forest. A tree with a low error rate is a strong classifier. Increasing the strength of
the individual trees decreases the forest error rate.

https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#intro
Bagging

Bagging, aka bootstrap aggregation, is a relatively simple way to increase the


power of a predictive statistical model by taking multiple random samples(with
replacement) from your training data set, and using each of these samples to
construct a separate model and separate predictions for your test set. These
predictions are then averaged to create a, hopefully more accurate, final
prediction value.

http://www.vikparuchuri.com/blog/build-your-own-bagging-function-in-r/
Boosting

Boosting is one of several classic methods for creating ensemble models,


along with bagging, random forests, and so forth. Boosting means that each
tree is dependent on prior trees, and learns by fitting the residual of the trees
that preceded it. Thus, boosting in a decision tree ensemble tends to improve
accuracy with some small risk of less coverage.
XGBoost is a library designed and optimized for boosting trees algorithms.
XGBoost is used in more than half of the winning solutions in machine learning
challenges hosted at Kaggle.

http://xgboost.readthedocs.io/en/latest/model.html#
And http://dmlc.ml/rstats/2016/03/10/xgboost.html
Top 10 Data Analytics Tools 2020
(Currently in use with Various Organizations)

https://www.youtube.com/watch?v=P-bKqfKhqR8
Top 10 Data Science Tools For 2022
Data Science Tools and Libraries

https://www.youtube.com/watch?v=zVBcmTkJqpo
INTRODUCTION TO DATA SCIENCE

UNIT– I
Ø Types of Data
Ø DS Applications
& Use Cases

DR. G. ARUN SAMPAUL THOMAS


Associate Professor & HOD – Department of AI&ML
J.B. Institute of Engineering and Technology
Hyderabad, Telangana
1
arunsam.infotech@gmail.com arunthomas.ai_ml@jbiet.edu.in
Data All Around

• Lots of data is being collected


and warehoused
– Scientific Experiments
– Internet of Things
– Web data, e-commerce
– Financial transactions, bank/credit transactions
– Online trading and purchasing
– Social Network
– ……many more!

2
What To Do With These Data?

• Aggregation and Statistics


– Data warehousing and OLAP
• Indexing, Searching, and Querying
– Keyword based search
– Pattern matching (XML/RDF)
• Knowledge discovery
– Data Mining
– Statistical Modeling
• Data Driven
– Predictive Analytics
– Deep Learning

3
Statistical and Critical Thinking
Analyzing Data: Potential Pitfalls
• Misleading Conclusions
When forming a conclusion based on a statistical analysis, we should make statements that are clear
even to those who have no understanding of statistics and its terminology.
• Sample Data Reported Instead of Measured
When collecting data from people, it is better to take measurements yourself instead of asking
subjects to report results.
• Loaded Questions
If survey results are not worded carefully, the results of a study can be misleading.
• Order of Questions
Sometimes survey questions are unintentionally loaded by the order of the items being considered.
• Nonresponse
A nonresponse occurs when someone either refuses to respond or is unavailable.
• Percentages
Some studies cite misleading percentages. Note that 100% of some quantity is all of it, but if there
are references made to percentages that exceed 100%, such references are often not justified.
5
Types of Data, Key Concept

A major use of statistics is to collect and use sample data to make conclusions
about populations.

Parameter & Statistic

• Parameter
a numerical measurement describing some
characteristic of a population
• Statistic
a numerical measurement describing some
characteristic of a sample

7
Types of Data

Quantitative Data & Categorical Data


• Quantitative (or numerical) data
consists of numbers representing counts or measurements.

Example: The weights of supermodels


Example: The ages of respondents

• Categorical (or qualitative or attribute) data


consists of names or labels (not numbers that represent counts or measurements).

Example: The gender (male/female) of professional athletes


Example: Shirt numbers on professional athletes uniforms - substitutes for names
8
Types of Data, Quantitative Data

Discrete & Continuous types:


• Discrete data
result when the data values are quantitative and the number of values is
finite, or “countable.”

Example: The number of tosses of a coin before getting tails


• Continuous (numerical) data
result from infinitely many possible quantitative values, where the
collection of values is not countable.

Example: The lengths of distances from 0 cm to 12 cm

9
Types of Data, Quantitative Data

Data

Qualitative Quantitative
Categorical Numerical,
Can be ranked

Discrete Continuous
Countable Can be decimals
5, 29, 8000, etc. 2.59, 312.1, etc.

1
0
Types of Data, Levels of Measurement:
Another way of classifying data: 4 levels of measurement: nominal, ordinal, interval, and ratio.

• Nominal level of measurement


characterized by data that consist of names, labels, or categories only, and
the data cannot be arranged in some order (such as low to high).
• Nominal - categories only
Example: Survey responses of yes, no, and undecided
(Names)
• Ordinal level of measurement
involves data that can be arranged in some order, but differences (obtained
by subtraction) between data values either cannot be determined or are • Ordinal - categories with
meaningless.
some order ( nominal, plus can
Example: Course grades A, B, C, D, or F be ranked (order))
• Interval level of measurement
involves data that can be arranged in order, and the differences between • Interval - differences but no
data values can be found and are meaningful. However, there is no
natural zero starting point at which none of the quantity is present. natural zero point (Ordinal,
plus intervals are consistent)
Example: Years 1000, 2000, 1776, and 1492
• Ratio level of measurement • Ratio - differences and a
data can be arranged in order, differences can be found and are
meaningful, and there is a natural zero starting point (where zero indicates natural zero point(Iinterval,
that none of the quantity is present). Differences and ratios are both
meaningful.
plus ratios are consistent, true
zero)
Example: Class times of 50 minutes and 100 minutes 10
Types of Data, Levels of Measurement:
Example 1:

Determine the measurement level.

Variable Nominal Ordinal Interval Ratio Level


Hair Color Yes No Nominal
Zip Code Yes No Nominal
Letter Grade Yes Yes No Ordinal
ACT Score Yes Yes Yes No Interval
Height Yes Yes Yes Yes Ratio
Age Yes Yes Yes Yes Ratio
Temperature Yes Yes Yes No Interval

(F)

3
Example 2:

4
Example 3:

Parameter or Statistic?

Statistic

Parameter

5
Example 4:

Discrete or Continuous?

Continuous

Discrete

6
Example 5:
Determine the measurement level.

Nominal

Ratio

Ordinal

Interval

7
Example 6:
Determine the measurement level & what’s wrong with the conclusion?

8
Structured vs Unstructured

https://www.youtube.com/watch?v=WBU7sW1jy2o
Big Data & Data Science

• “… the stylish job in the next 10 years will


be statisticians,” Hal Varian, Google Chief Economist
• The U.S. will need 140,000-190,000 predictive
analysts and 1.5 million managers/analysts by 2018.
McKinsey Global Institute’s June 2011

• New Data Science institutes being created or


repurposed – NYU, Columbia, Washington, UCB,...
• New degree programs, courses, boot-camps:
– e.g., at Berkeley: Stats, I-School, CS, Astronomy…
– One proposal (elsewhere) for an MS in “Big Data Science”
– Plans for Data Science Stream at AUST
– RDA-CODATA School of Research Data Science
20
Data Science Vs Analysis Vs Software
Delivery
Component Traditional Analysis Traditional Software Data Science
Delivery
Tools SAS, R, Excel, SQL, in- Java, source control, Linux, R, Java, scientific Python libraries,
house tools continuous integration, unit Excel, SQL, Hadoop, Hive, Pig,
testing, bug reports and Mahout and other machine learning
project management libraries, github for source control
and issue management
Analytical Regressions, N/A Classification, clustering, similarity
Methods classifications, detection, recommenders,
measuring prediction unsupervised and supervised
accuracy and learning, small- and large-scale
coverage/error, computations, measuring prediction
sampling accuracy and coverage/error
Team Statisticians, Developers, Project Mathematicians, Statisticians,
Structure Mathematicians, Managers, Systems Scientists, Developers, Systems
Scientists Engineers Engineers
Time Frame Either: Regular software release Either:
• Usually on-going cycle, continuous delivery, etc. • Discovery/learning phase leading
research and to product development
discovery within a Or:
team in the • On-going research and product
organization invention/improvement
Or:
• Specific project to
determine answers 21
Contrast: Scientific Computing

Image General purpose classifier


Supernova

Not

Nugent group / C3 LBL

Scientific Modeling Data-Driven Approach


Physics-based models General inference engine replaces model
Problem-Structured Structure not related to problem
Mostly deterministic, precise Statistical models handle true randomness,
and un-modeled complexity.
Run on Supercomputer or High-end Run on cheaper computer Clusters (EC2)
Computing Cluster
22
Contrast: Machine Learning

Machine Learning Data Science


Develop new (individual) models Explore many models, build and tune
hybrids
Prove mathematical properties of Understand empirical properties of
models models
Improve/validate on a few, relatively Develop/use tools that can handle
clean, small datasets massive datasets
Publish a paper J Take action!
14
Contrast: Data Engineering

Data Science Data Engineering


Approach Scientific (Exploration) Engineering (Development)
Problems Unbounded Bounded
Path to Solution Iterative, exploratory, Mostly linear
nonlinear
Education More is better (PhD’s BS and/or self-trained
common)
Presentation Skills Important Not as important
Research Important Not as important
Experience
Programming Not as important Important
Skills
Data Skills Important Important

24
Data Science Applications

Business Health Care Urban Leaving


Summary From car design to Tomorrow’s healthcare may For the first time in human
insurance to pizza delivery, look more efficient thanks to history, more people live in
businesses are using data things like electronic health cities than in suburban or
science to optimize their records. It also may look a lot rural areas. An emerging field
operations and better meet more effective. Reduced called “urban informatics”
their customers’ readmissions, better care, and combines data science with
expectations. earlier detection are on the the unique challenges facing
horizon. the world’s growing cities
Two-Way Street for the Reducing Hospital Taking on Megacity Traffic
Ford Focus Electric Car Readmissions
Better Fraud Detection Better Point-of-Care Decisions Fighting Crime with Data
What is Boosts Customer "predictive policing"
happening? Satisfaction
E-Commerce Insights:
Domino’s Secret Sauce
What is possible Using Social Data to Medical Exams by Bathroom Instrumenting cities
Select Successful Retail Mirrors
Locations
.

25
Data Science: Case Study
Cancer Research
• Cancer is an incredibly complex disease; a single tumor can have
more than 100 billion cells, and each cell can acquire mutations
individually. The disease is always changing, evolving, and adapting.
• Employ the power of big data analytics and high-performance
computing.
• Leverage sophisticated pattern and machine learning algorithms to
identify patterns that are potentially linked to cancer
• Huge amount of data processing and recognition

26
Data Science: Case Study
Health Care

• Stanford Medicine, Google


team up to harness power of
data science for health care
• Stanford Medicine will use the
power, security and scale of
Google Cloud Platform to
support precision health and
more efficient patient care.
• Analyzing genetic data
• Focusing on precision health
• Data as the engine that
drives research

http://med.stanford.edu/news/all-news/2016/08/stanford-medicine-google-team-up-to-harness-power-of-data-science.html 27
Data Science: Case Study
Elections
• The Obama campaigns in 2008 and 2012 are credited for their
successful use of social media and data mining.
• Micro-targeting in 2012
– http://www.theatlantic.com/politics/archive/2012/04/the-
creepiness-factor-how-obama-and-romney-are-getting-to-know-
you/255499/
– http://www.mediabizbloggers.com/group-m/How-Data-and-Micro-
Targeting-Won-the-2012-Election-for-Obama---Antony-Young-
Mindshare-North-America.html
• Micro-profiles built from multiple sources accessed by aps, real-
time updating data based on door-to-door visits, focused media
buys, e-mails and Facebook messages highly targeted.
• 1 million people installed the Obama Facebook app that gave
access to info on “friends”.
22
Data Science: Case Study
Internet of Things (IoT)
• The Internet of Things is rapidly growing. It is predicted that more than 25 billion devices
will be connected by 2020.

• The Internet of Things (IOT) will soon produce a massive volume and variety of data at
unprecedented velocity. If "Big Data" is the product of the IOT, "Data Science" is it's
soul. 23
Data Science: Case Study
Customer Analytics

30
Case Study - How Recommender Systems Work
(Netflix/Amazon)

https://www.youtube.com/watch?v=n3RKsY2H-NE

You might also like