Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

COURSE CODE COURSE TITLE L T P C

1152CS239 Introduction to Data Science 2 0 2 3


A. Preamble
This course provides an introduction to data science and highlights its importance in business
decision making. It provides an overview of commonly used data science tools along with
spreadsheets, relational databases, statistics, and programming assignments to lay the foundation
for data science applications.

B. Prerequisite Course
1150MA201-Applied Statistics

C. Course Objectives
Learners are exposed to:
● Identify general statistical techniques for data analysis
● Determine how to summarize data to present information using descriptive statistics
● Use a correlation chart to determine the strength of a correlation.

D. Course Outcomes
Upon the successful completion of the course, students will be able to:
CO
Course Outcomes K - Level
No’s
Implement the life cycle of data science process for building real world
CO1 K3
applications

CO2 Use the different types of data and variables K3

CO3 Demonstrate statistical analysis of data using regression techniques K3

Apply Data wrangling concepts to convert and map the raw data and to
CO4 K3
make the data ready for analysis.

CO5 Build a model for a given real time applications K3

Knowledge Level (Based on revised Bloom’s Taxonomy)


K1-Remember K2-Understand K3-Apply K4-Analyze K5-Evaluate K6-Create
E

E. Correlation of COs with Program outcomes and Programme Specific Outcomes:


PSO PSO2 PSO3
Cos PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
11 2 3
CO1 3 3 3 2
CO2 2 3 2 3 1 1 2 2
CO3 2 3 2 3 1 1 2 2
CO4 3 3 2 3 1 1 3 2
CO5 3 3 2 3 1 1 3 2
3- High; 2-Medium; 1-Low

F. Course Contents

Unit 1 Introduction 6 Hours


Introduction to Data Science – Evolution of Data Science – Data Science Roles – Life cycle of
Data Science Project - Data Science: Benefits and uses – facets of data – Data Science Process:
Overview – Defining research goals – Retrieving data – Data preparation – Exploratory Data
analysis – Descriptive –diagnostic-Predictive-Prescriptive- feature engineering –algorithm
selection- Build the model–algorithm tuning - presenting findings and building applications

Unit 2 Describing Data 6 Hours


Types of Data - Qualitative –Quantitative – Categorical –Nominal –ordinal –numerical -discrete
–continuous –interval –ration- Types of Variables –Univariate analysis – Bivariate analysis –
Multivariate analysis-Describing Data with Tables and Graphs –Measures of variability -
Describing Data with Averages – Describing Variability – Normal Distributions and Standard (z)
Scores

Unit 3 Describing Relationships 6 Hours


Covariance - Correlation –positive – negative –simple – partial – multiple - linear – non-linear –
Pearson correlation - spearman ranking- Scatter plots – box plot – cross table - Histogram –
correlation coefficient for quantitative data –computational formula for correlation coefficient –
Regression –regression line – best fit line –least squares regression line – Multiple Regression–
ridge – lasso regression- Polynomial Regression -Regression Assumptions- Standard error of
estimate – Mean Square Error –root mean square error -Mean Absolute error/deviation–mean
absolute percentage error - interpretation of r2 –adjusted R square - multiple regression equations
–regression towards the mean

Unit 4 Data Wrangling 6 Hours


Basics of Numpy arrays –aggregations –computations on arrays –comparisons, masks, boolean
logic – fancy indexing – Slicing of arrays –concatenate – reshape –broadcasting – transpose -
structured arrays – Multidimensional –numpy universal functions - Pandas series – data frame –
describe –modifying datatypes - duplicates- drop- windowing operations- date time- time series
-Data manipulation with Pandas – data indexing and selection – operating on data – missing data
– Hierarchical indexing – combining datasets – aggregation and grouping –merge –sorting -
pivot tables.
Unit 5 Model Development 6 Hours

Binary Classification – Multiclass Classification – Multi target – Decision Tree –Gini /


information gain – root node – branch node –leaf node - interpretation of decision tree- Pruning –
– Model Evaluation metrics – Accuracy – confusion matrix – Precision –recall – AucRoC curve
- Residual Plot – Distribution Plot –degree of polynomial – Bias – Variance –under fitting -
overfitting low bias and low variance –-Generalised model - Pipelines – Measures for In-
sample Evaluation – Prediction and Decision Making – Imbalanced classification – under
sampling – over sampling - Smote– Hyper parameter tuning – Best parameter – Cross validation
– Grid and Randomised search cross validation- deployment – monitoring -MLOps

Total: 30 Hours

Laboratory Experiments Total : 30 Hours


Part – I
Task 1 : Exploration of Python Modules, Data types and Functions.
Exploratory Data Analysis
Task 2: Write programs to perform descriptive statistics, subset of dataset.
Task 3: Write programs to perform exploratory data analysis: variance, standard derivation,
summarization, distribution and statistical inference.
Task 4: Write programs to find the data distributions using box and scatter plot, outliers using
plot on sample dataset
Task 5: Write programs to plot the data using X-Y graph, Bar- chart, histogram, piechart and
using other plotting techniques on sample dataset
Task 6: Find the correlation matrix, covariance, plot the correlation plot on dataset and
visualize giving an overview of relationships among data on sample dataset.
Model Building, Evaluation and Visualization
Task 7: Write a program to build, evaluate the model using regression for a sample dataset.
Apply multiple regressions, if data have a continuous independent variable
Task 8: Consider a Dataset and perform the following tasks
a) Write a program to identify the column(s) of a given DataFrame which have at least
one missing value.
b) Write a program to count the number of missing values in each column of a given
DataFrame.
c) Write a program to find and replace the missing values in a given DataFrame which
do not have any valuable information.
d) Write a program to drop the rows where at least one element is missing in a given
DataFrame.
e) Write a program to drop the columns where at least one element is missing in a given
DataFrame.
f) Write a program to drop the rows where all elements are missing in a given
DataFrame
Task 9: Write a program to build, evaluate the model using Decision Tree for a sample dataset.
Choose classifier for binary classification problem. Evaluate the performance of classifier.
Task 10: Write a program to build, evaluate the model using Decision Tree for a sample
dataset. Choose classifier for Multiclass classification problem. Evaluate the performance of
classifier.
Part – II
Use Case 1: Let us use the built-in dataset air quality which has Daily air quality measurements
in NewYork, May to September 1973. Create a histogram by using appropriate arguments for
the following statements.
a. Assigning names, using the air quality data set.
b. Change colors of the Histogram
c. Remove Axis and Add labels to Histogram
d. Change Axis limits of a Histogram
e. Create a Histogram with density and Add Density curve to the histogram
Use Case 2: Create a dataset or table [‘Smart Phone”] in an excel sheet that stores the mobile
information [price, company name, model, Sale Percent] of five different companies. Store at
least 20 rows. Write the scripts and find out the output for the following information.
a. Maximum price of the mobile of each company
b. Minimum price of mobile of each company
c. Average price of mobile of each company
d. Total Price of mobile of each company
Use Case 3: The Tooth Growth data are from a study which examined the growth of teeth in
guineapigs (n=10) in response to three dose levels of Vitamin C (0.5, 1, and 2 mg), which was
administered using two delivery methods (orange juice or ascorbic acid). Data from the Tooth
Growth Study is available as an R dataset and information about this study can be found by
using R help.

a. How many rows are there is Tooth Growth?


b. What is the mean and standard deviation of Tooth length
c. Which treatment is the best in terms of tooth growth? Derive the findings based on
correlation between Dosage and Length for both supplements.
Use Case 4:Predict the sales of any product by performing the following
0. Data collection from any source
0. Data cleaning
0. Model Building, Evaluation and Visualization
Use Case 5: Perform the Customer Churn analysis for a banking application with the following.
0. Data collection from any source
0. Data cleaning
0. Model Building Evaluation and Visualization

H. Learning Resources
i. Text Books:
1. David Cielen, Arno D. B. Meysman, and Mohamed Ali, “Introducing Data Science”,
Manning Publications, 2016. (Unit I)
2. Robert S. Witte and John S. Witte, “Statistics”, Eleventh Edition, Wiley Publications,
2017. (Units II and III)
3. Jake VanderPlas, “Python Data Science Handbook”, O’Reilly, 2016. (Units IV and V)

ii. Reference Books:


1. Allen B. Downey, “Think Stats: Exploratory Data Analysis in Python”, Green Tea Press,
2014.
iii. Online References:
1. “Introduction to Data Science”, April. 11. 2021. Accessed on April. 22.2021 [Online].
Available: https://rafalab.github.io/dsbook/
2. "Introducing Data Science" 2016, Accessed: April 20, 2021, [Online]. Available:
http://bedford-computing.co.uk/learning/wp-content/uploads/2016/09/introducing-data-
science-machine-learning-python.pdf

You might also like