Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 20

Part of the <mtcars> data set given below.

>mtcars
mpg cyl disp hp gear carb
Mazda RX4 21.0 6 160.0 110 4 4
Mazda RX4 Wag 21.0 6 160.0 110 4 4
Datsun 710 22.8 4 108.0 93 4 1
Hornet 4 Drive 21.4 6 258.0 110 3 1
Hornet Sportabout 18.7 8 360.0 175 3 2
Valiant 18.1 6 225.0 105 3 1
Duster 360 14.3 8 360.0 245 3 4
Formulate 10 questions to describe the distributions in data.Write code using
ggplot library for each one of the visualizations.
Sol:
1.Which car has the highest mpg?
ggplot(data = mtcars, aes(x = car, y = mpg)) + geom_bar(stat = "identity")
2. Which car has the highest horsepower?
ggplot(data = mtcars, aes(x = car, y = hp)) + geom_bar(stat = "identity")
3. What is the average mpg of cars with 4 cylinders?
ggplot(data = mtcars, aes(x = cyl, y = mpg)) + geom_boxplot()
4. What is the average horsepower of cars with 6 cylinders?
ggplot(data = mtcars, aes(x = cyl, y = hp)) + geom_boxplot()
5. What is the range of mpg for cars with 3 gears?
ggplot(data = mtcars, aes(x = gear, y = mpg)) + geom_boxplot()
6. What is the range of horsepower for cars with 4 gears?
ggplot(data = mtcars, aes(x = gear, y = hp)) + geom_boxplot()
7. What is the distribution of mpg for cars with 5 carburetors?
ggplot(data = mtcars, aes(x = carb, y = mpg)) + geom_histogram()
8. What is the distribution of horsepower for cars with 6 carburetors?
ggplot(data = mtcars, aes(x = carb, y = hp)) + geom_histogram()
9. What is the correlation between mpg and cylinders?
ggplot(data = mtcars, aes(x = mpg, y = cyl)) + geom_point()
10. What is the correlation between horsepower and gears?
ggplot(data = mtcars, aes(x = hp, y = gear)) + geom_point()

Write the code to get the plot as shown


Also write codes to get three variants of the plot

Sol:
iris%>%
ggplot(aes(x= Species, y=Sepal.Length)) +
geom_boxplot(aes(color= Species)) +
geom_jitter()

Question:
Write the block of code to reproduce the following 7x7 matrix. The output of your
code should be the matrix as given.
The main diagonal ranges from 3 to 0 to 3 in the sequence.
The other two consecutive diagonals consist of 1 as elements.
Rest of the matrix elements are zero.
CO4
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 3 1 0 0 0 0 0
[2,] 1 2 1 0 0 0 0
[3,] 0 1 1 1 0 0 0
[4,] 0 0 1 0 1 0 0
[5,] 0 0 0 1 1 1 0
[6,] 0 0 0 0 1 2 1
[7,] 0 0 0 0 0 1 3
What is the matrix called?

Sol.:

Question:
A dataset “states” with its structure is given below. Frost is minimumnumber of
days below freezing point in a particular state. The Murderrate in a state depends
on other variables as shown in the code outputbelow.
> str(states)
'data.frame':50 obs. of 5 variables:
$ Murder : num 15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
$ Population: num 3615 365 2212 2110 21198 ...
$ Illiteracy: num 2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
$ Income : num 3624 6315 4530 3378 5114 ...
$ Frost : num 20 152 15 65 20 166 139 103 11 60 ...
To predict the murder rate a multiple regression model is fitted todata. Summary of
the model is shown below.
> fit = lm(Murder~., data = states)
> summary(fit)
Call:
lm(formula = Murder ~ ., data = states)
Residuals:
Min 1Q Median 3Q Max
-4.7960 -1.6495 -0.0811 1.4815 7.6210
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.235e+00 3.866e+00 0.319 0.7510
Population 2.237e-04 9.052e-05 2.471 0.0173
Illiteracy 4.143e+00 8.744e-01 4.738 2.19e-05
Income 6.442e-05 6.837e-04 0.094 0.9253
Frost 5.813e-04 1.005e-02 0.058 0.9541
---
Residual standard error: 2.535 on 45 degrees of freedom
Multiple R-squared: 0.567,Adjusted R-squared: 0.5285
F-statistic: 14.73 on 4 and 45 DF, p-value: 9.133e-08
i. Name the predictor variable/ variables in the above regressionmodel.
ii. Which among these variables is the most significant?
iii. If illiteracy increased by 1% what is the impact on murder?
iv.What is F-statistic, how it helps to know the quality of themodel
Sol:
i. Population, Illiteracy, Income, frost
ii. Illiteracy is most significant because p-value is less than 0.05(which is
2.19e-05)
iii. If illiteracy increases by 1%, murder will also increase significantly.
iv. F-statistic is the ratio of variance. If F-Stats is less than 2.18 then the
model is significant. In the model, F-stats = 14.73 so the model is
insignificant.
Question:
The scatter plot for certain data with the accompanying code isshown below. The
names used are self-explanatory.
> advertise %>%
+ ggplot(aes(x= Radio, y= Sales)) + geom_point() +
+ geom_smooth() +
+ labs(x= "spendOnRadioAdvertise", y= "Revenuegenerated")

A. What should be the change in code to make the trend line linear and to remove
the shadow along the line.

The Sales average is 14.02. A simple linear regression is fitted to the data and the
code for the same is given below.
> lm(Sales ~ Radio, data = advertise)
Call:
lm(formula = Sales ~ Radio, data = advertise)
Coefficients:
(Intercept) Radio
9.3116 0.2025
B. Predict the increase in sales if an additional $100 dollar is spend in Radio
advertisement.
C. The residual sum square is 3619 and total sum square is 5418. What is its
intuitive interpretation?
D. Calculate the R-statistic. Explain the meaning of the result.
E. Does the regression model predict better than the baseline mode?

Sol:
A.
Advertise %>%
ggplot(aes(x = Radio, y = Sales)) +
geom_point() +
geom_smooth(se=F, method = lm) +
labs (x= "SpendOnRadioAdvertise", y= "RevenueGenerated")

B. Y= MX + C (M= Slope, C= Intercept, X = 100)


Y= 9.3116 + 0.2025 * (100)

C. SSR = 3619
SST = 5418
SST = SSR + SSE
SSE = SST-SSR
= 5418-3619 = 1799 The error is the difference between the
observed value and the predicted value.

D. R-Statistics = SSR/SST = 3619/5418 = 0.66

E. Yes Regression model predict better than baseline model because it


shows the intercept and slope value.

Question
A dataset named <advertise> consists four variables. Part of the dataset is shown in
the below table.
> advertise %>% head(10)
TV Radio Newspaper Sales
1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9
6 8.7 48.9 75.0 7.2
7 57.5 32.8 23.5 11.8
8 120.2 19.6 11.6 13.2
9 8.6 2.1 1.0 4.8
10 199.8 2.6 21.2 10.6
Sales: represent the unit sales
TV/ Radio/ Newspaper : represent the spend in these media
As an analytics consultant we need to suggest some solution toincrease sales for a
company.
We ran simple and multiple regressions to find out the salesdynamics w.r.t. only
radio spending as well as spending over allmedia. Regression models are given
below
> summary(lmRadio)
Call:
lm(formula = Sales ~ Radio, data = advertise)
Residuals:
Min 1Q Median 3Q Max
-15.7305 -2.1324 0.7707 2.7775 8.1810
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.31164 0.56290 16.542 <2e-16 ***
Radio 0.20250 0.02041 9.921 <2e-16 ***
We also run the multiple regression as given below.
> summary(lmTotal)
Call:
lm(formula = Sales ~ TV + Radio + Newspaper, data = advertise)
Residuals:
Min 1Q Median 3Q Max
-8.8277 -0.8908 0.2418 1.1893 2.8292
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.938889 0.311908 9.422 <2e-16 ***
TV 0.045765 0.001395 32.809 <2e-16 ***
Radio 0.188530 0.008611 21.893 <2e-16 ***
Newspaper -0.001037 0.005871 -0.177 0.86
A.What do you observe in both the models w.r.t. radio spend for advertisement.

B.Are there any inconsistencies in the model which may explain the opposing
behaviour?

C.The correlation matrix is given below. Can you interpret the reason for different
coefficients of newspaper impact on sales

> round(cor(advertise),4)
TV Radio Newspaper Sales
TV 1.0000 0.0548 0.0566 0.7822
Radio 0.0548 1.0000 0.3541 0.5762
Newspaper 0.0566 0.3541 1.0000 0.2283
Sales 0.7822 0.5762 0.2283 1.0000
Sol:
A. In 1st model, radio spend has t-value = 9.921 and in 2nd model t-value =
21.89. So we can say that 2nd model is more significant than 1st model.

B. The inconsistency lies in the intercept value. The intercept value


decreases with addition of independent variables. So the the intercept
value in 1st model (9.31164) which is more than 2nd model (2.938889).

C. For every 1% increase in demand of newspaper will increase sales by


22.83%.

Question
As an analyst you are assigned a task to classify a set of emails intospam and non-
spam groups. The dataset consists of 1000 emailswith 400 spam and rest non-spam
emails. As a starting point, dataare divided into training and test set. You applied
the logisticregression model and used the prediction algorithm on test data set.The
confusion matrix is obtained as given below.

FALSE TRUE
non-spam 267 17
spam 21 163
a)What is the accuracy of the null model. [1]
b)How accurate is your classifier (Hint: calculate accuracy)?
c)Interpret the accuracy and take decision if the model isacceptable.
d)Explain precision in maximum two lines. [1]
e)How much precision is observed by the classification model.
f)What is recall in this case.(fraction of spam class detectedby classifier) [1]

Sol:
(a) The accuracy of the Null Model (Random Selection) = 0.50 i.e. 50% where
we have an equal chances for of classifying the email as spam or not.

(b) Accuracy of the Classifier = (TP + TN)/(TP+FP+FN+TN)

= (267 + 163)/(267+17+21+163)

= 0.9188

Accuracy of the Model = 0.9188 ~= 91.88%

(c) As the Accuracy of the Classifier = 91.88%, which is very high and close to
100%, hence we can conclude that the model has high predictive power to
classify emails as spam or not. Model is acceptable.

(d) Precision implies that how close the measured values are to each other.
Precision is independent of accuracy.

The mathematical formula for Precision = TP/ (TP + FP)

(e) Precision = 163/(163+17) = 0.9055

(f) Recall can be calculated as:

Recall = TP/Actual Positives

Recall = 163/(163+21) = 0.8858

Question:

A data analyst consultant provides consulting services to a bankwhich gives loans


to its customers. The loans data set contain abinary dependent variable
not_fully_paid (value ‘1’ indicatesdefault). To predict this dependent variable, rest
of the variables areused in the data set. The “loans” data with its summary is
givenbelow.
> str(loans)
'data.frame':9578 obs. of 7 variables:
$ credit.policy : int 1 1 1 1 1 1 1 1 1 1 ...
$ purpose : Factor w/ 7 levels "all_other","credit_card",..: 3 2 3..
$ int.rate : num 0.119 0.107 0.136 0.101 0.143 ...
$ installment : num 829 228 367 162 103 ...
$ log.annual.inc: num 11.4 11.1 10.4 11.4 11.3 ...
$ pub.rec : int 0 0 0 0 0 0 1 0 0 0 ...
$ not.fully.paid: int 0 0 0 0 0 0 1 1 0 0 ...
Pub.rec represents the borrowers number of derogatory publicrecords. Rest of the
variables are self-explanatory.
> table(loans$not.fully.paid)
01
8045 1533
> summary(loans)
credit.policy purpose int.rate installment
Min. :0.000 all_other :2331 Min. :0.0600
1st Qu.:1.000 credit_card :1262 1st Qu.:0.1039
Median :1.000 debt_consolidation:3957 Median :0.1221
Mean :0.805 educational : 343 Mean :0.1226
3rd Qu.:1.000 home_improvement : 629 3rd Qu.:0.1407
Max. :1.000 major_purchase : 437 Max. :0.2164
small_business : 619
log.annual.inc pub.rec not.fully.paid
Min. : 7.548 Min. :0.0000 Min. :0.0000
1st Qu.:10.558 1st Qu.:0.0000 1st Qu.:0.0000
Median :10.928 Median :0.0000 Median :0.0000
Mean :10.932 Mean :0.0621 Mean :0.1601
3rd Qu.:11.290 3rd Qu.:0.0000 3rd Qu.:0.0000
Max. :14.528 Max. :5.0000 Max. :1.0000
NA's :4 NA's :29
The range of values in installment variable is from 15.67 to 940.1. All
data are available in the variable.
With reference to above information answer the following.
i) What percentage of loans are not fully paid?
ii) Since the bank needs predicted risk for all borrowers what
decision should you take for log.annual.inc and pub.rec
variables.
iii) Which type regression fit you need to model to predict the
dependent variable in the question.
iv) After developing the model, the confusion matrix is obtained
as given below. Calculate the accuracy of the model.
> test$prediction=predict(mod,newdata = test, type =
"response")
Warning message:
contrasts dropped from factor purpose
> table(test$not.fully.paid, test$prediction >0.5)
FALSE TRUE
0 2406 7
1 454 6
> table(test$not.fully.paid)
01
2413 460
Calculate the baseline model accuracy. Compare with your model accuracy.

Sol:

i. Percentage of loan not fully paid = 1533/(8045+1533)


ii.
iii. Multi-regression model will fit the given situation because of many
independent variables.

iv. Accuracy of the model = (2406+6)/(2406+6+454+7)

v. Baseline model accuracy = 460/(460+2413)

Question:
Answer the following questions citing proper example of your choice.
(a) Explain with example for each: Where we should use, boxplot, histogram, bar
chart, stacked bar chart and density plot
(b) Write four assumptions in a linear regression model development.

(c) What is a sigmoid curve, and where this is used?

Sol:
A. Bar Chart
Bar chart compares the measure of categorical dimension. Comparing the
height of each bar gives us a more intuitive perception than looking at the
table alone. Bar chart is very similar to a histogram. The fundamental
difference is that the x-axis of bar charts is categorical attribute instead of
numeric interval in the histogram.

Histogram
Histogram looks very similar to bar chart because, oh well, it is also composed
of bars. However, instead of comparing the categorical data, it breaks down a
numeric data into interval groups and shows the frequency of data fall into
each group. It is commonly used to gain insights about your customers, e.g.
Pinterest use histograms to show the age distribution of your audience.
Histogram is good at identifying the pattern of data distribution on a numeric
spectrum

Boxplot
We use box plots in descriptive data analysis, indicating whether a
distribution is skewed and potential unusual observations (outliers) in the
data set.Box plots are also very useful when large numbers of observations are
involved and when two or more data sets are being compared.

Stacked Bar Chart


Stacked bar chart is used when we need to break down a primary category
into a secondary category. As we can see in the chart below, it is very similar
to the bar chart we saw earlier. Horizontally, it also compares the
performance of each market. Vertically, it further demonstrates the
composition of each segment within the market.

Density Plot
Density plots (aka Kernel Density Plots or Density Trace Graph) are used to
observe a variable's distribution in a dataset.
This chart is a smoothed version of the histogram and is used in the same
concept. It uses a kernel density estimate to show the variable's probability
density function, allowing for smoother distributions by smoothing out the
noise. Thus, the plots are smooth across bins and are not affected by the
number of bins created, creating a more defined distribution shape. The peaks
of a density plot help display where values are concentrated over the interval.
An advantage density plots have over histograms is that they’re better at
determining the distribution shape because they’re not affected by the
number of bins used 

B. Linearity • examine scatter diagram (should appear linear) • examine


residual plot (should appear random)
Normality of Errors • view a histogram of standard residuals •
regression is robust to departures from normality. the linear regression
analysis requires all variables to be multivariate normal

Homoscedasticity The assumption of homoscedasticity (meaning “same


variance”) is central to linear regression models. Homoscedasticity
describes a situation in which the error term (that is, the “noise” or
random disturbance in the relationship between the independent
variables and the dependent variable) is the same across all values of
the independent variables.

Independence of Errors: successive observations should not be related.


• This is important when the independent variable is time. (Durbin-
Watson)

C. An S-shaped population curve represents logistic growth. The lower


curve of the S is formed as a small population grows exponentially. The
upper curve of the S is formed as the population nears its carrying
capacity and its growth rate slows.

The main reason why we use sigmoid function is because it exists


between (0 to 1). Therefore, it is especially used for models where we
have to predict the probability as an output. Since probability of
anything exists only between the range of 0 and 1, sigmoid is the right
choice.

Question:
Answer the following questions citing proper example of your choice.
(a) Discuss the importance of variance and bias in case of simple regression
analysis. How the variance changes when we add more variable for fitting the
linear model?

(b) If we add more variables in a linear regression model to predict the target
variable the R-squared value increases. Thus, to capture more variance in data a
data scientist should add variables depending on availability of data. Discuss the
conjecture in the above statement.

(c) Explain Heteroskedasticity with example


SOL:
A. Bias Term in Linear Regression

For any given phenomenon, the bias term we include in our equations is
meant to represent the tendency of the data to have a distribution
centered about a given value that is offset from an origin; in a way, the
data is biased towards that offset.
In terms of linear regression, variance is a measure of how far observed
values differ from the average of predicted values, i.e., their difference
from the predicted value mean. The goal is to have a value that is low.
Adding independent variables to a multiple linear regression model will
always increase the amount of explained variance in the dependent
variable (typically expressed as R²). Therefore, adding too many
independent variables without any theoretical justification may result in
an over-fit model.

B.

C. Heteroskedasticity refers to situations where the variance of the


residuals is unequal over a range of measured values. When running a
regression analysis, heteroskedasticity results in an unequal scatter of
the residuals.

One common example of heteroskedasticity is the relationship between


food expenditures and income. For those with lower incomes, their food
expenditures are often restricted based on their budget.

As incomes increase, people tend to spend more on food as they have


more options and fewer budget restrictions.

Question

Answer the following questions citing proper example of your choice.


(a) Explain the basics of OLS regression.
(b) How a generalized linear model is derived from a simple linear
model to solve the classification problems?
(c) Discuss all tools that can be used to analyze the relationship
between
(i) two quantitative variables
(ii) quantitative vs qualitative variable
(iii) two qualitative variables

Sol:

A. Ordinary Least Squares regression (OLS) is a common technique


for estimating coefficients of linear regression equations which describe
the relationship between one or more independent quantitative
variables and a dependent variable (simple or multiple linear
regression).

B. GLM can be used to construct the models for regression and


classification problems by using the type of distribution which best
describes the data or labels given for training the model. To show that
Linear Regression is a special case of the GLMs. It is considered that
the output labels are continuous values and are therefore a Gaussian
distribution. 

C. i. Two Quantitative variables: Scatter plot


ii. Quantitative vs qualitative variable: Sample T test – Single sample t
test, paired sample t test
iii. Two qualitative variable: Chi- square test

Question
Part of the mpg dataset is shown below.
> mpg
# A tibble: 234 x 11
manufacturer cyl trans drv
1 audi 4 auto(l5) f
2 audi 4 manual(m5) f
3 audi 4 manual(m6) f
4 audi 4 auto(av) f
5 audi 6auto(l5) f
6 audi 6 manual(m5) f
7 audi 6 auto(av) f
8 audi 4 manual(m5) 4
9 audi 4 auto(l5) 4
10 audi 4 manual(m6) 4
# ... with 224 more rows
i)The mpg dataset is a tibble. What does thatrepresent?
ii)What is difference between tibble and data frame
iii)Write code in R using ggplot library to get the aboveplot. Show code for
package installation, library andplotting.
iv)In the x-axis it shows factor(cyl). In case we supply only <cyl> as the x-axis
variable what should be the plot outcome? Write in one line.
v)Write the code to remove the legend from the aboveplot.

vi)Can you represent the data in some other bar chartform. Write the code for the
same

Sol:

1.

2. There are two main differences in the usage of a data frame vs a tibble:
printing, and subsetting. Tibbles have a refined print method that shows only
the first 10 rows, and  the columns that fit on screen. This makes it much
easier to work with large data.

3. install.package(ggplot2)
library(tidyverse)
mpg
mpg %>%
ggplot(aes(x=factor(cyl))) +
geom_bar(aes(fill= trans),position = "dodge",show.legend = TRUE)
4.
mpg %>%
ggplot(aes(x=(cyl)) +
geom_bar(aes(fill= trans),position = "dodge",show.legend = TRUE)

Here Factor removes the null value from the data. Without factor its
show all the value including null values.

5.
mpg %>%
ggplot(aes(x=factor(cyl))) +
geom_bar(aes(fill= trans),position = "dodge",show.legend = FALSE)

Question:
Write one line regarding the output of each of the functionsgiven below. Let <df>
is any dataframe.

tail(df,2)
df[, c(1,3)]
table(df$<any variable of df>)
hist(df[, 3]) #column 3 is a factor variable
tapply(iris[,1], iris[,5], mean)

Write the use of the following functions with an example line ofcode
i.everything()
ii.c()
iii.theme()
iv.geom_smooth()
v.par(mfrow())
vi.pivot_longer()
vii.mutate()
viii.summarise()
ix.facet_wrap()
x.sapply()
Sol:

tail(df,2) : It will give last two rows of dataframe as output.

df[,c(1,3)] :

Lets break this first :

c(row, column): To extract rows and columns together

now df[,c(1,3)] will give output of 0th row to the output of c(1,3)

table(df$<any variable of df>) : Main Objective of table function in R is


creating Frequency table. Now this will make table on the basis of the variable
of the dataframe.

hist(df[,3]) : make a histogram from the variable of df[,3]

tapply(iris[1],iris[5],mean) : output will have mean of iris[1] categorized on


the basis of iris[5].

Use of functions:

everything(): everything() selects all variable. It is also useful in combination


with other tidyselect operators.

c(): The c function in R programming stands for 'combine.' This function is


used to get the output by giving parameters inside the function. The
parameters are of the format c(row, column). With the c function, you can
extract data in three ways:

To extract rows, use c(row, )

To extract columns, use c( , column)

To extract rows and columns together, use c(row, column)

theme():Themes can be used to give plots a consistent customized look.


Modify a single plot's theme using theme()

geom_smooth(): geom_smooth() for adding smoothed conditional means /


regression line. Key arguments: color , size and linetype
par(mfrow()): The par() function allows to set parameters to the plot. The
mfrow() parameter allows to split the screen in several panels. Subsequent
charts will be drawn in panels. You have to provide a vector of length 2 to
mfrow() : number of rows and number of columns.

pivot_longer(): pivot_longer() makes datasets longer by increasing the


number of rows and decreasing the number of columns.

mutate(): the mutate function is used to create a new variable from a data set.
In order to use the function, we need to install the dplyr package, which is an
add-on to R that includes a host of cool functions for selecting, filtering,
grouping, and arranging data.

summarise(): summarise() creates a new data frame. It will have one (or


more) rows for each combination of grouping variables; if there are no
grouping variables, the output will have a single row summarising all
observations in the input. It will contain one column for each grouping
variable and one column for each of the summary statistics that you have
specified.

facet_wrap(): facet_wrap() makes a long ribbon of panels (generated by any


number of variables) and wraps it into 2d. This is useful if you have a single
variable with many levels and want to arrange the plots in a more space
efficient manner.

sapply(): sapply() function takes list, vector or data frame as input and gives
output in vector or matrix. It is useful for operations on list objects and
returns a list object of same length of original set. Sapply function in R does
the same job as lapply() function but returns a vector.

Note  : To understand sapply() more understand lapply()

lapply() function is useful for performing operations on list objects and


returns a list object of same length of original set. lappy() returns a list of the
similar length as input list object, each element of which is the result of
applying FUN to the corresponding element of list. Lapply in R takes list,
vector or data frame as input and gives output in list.

You might also like