Professional Documents
Culture Documents
Dar Solved Ans
Dar Solved Ans
>mtcars
mpg cyl disp hp gear carb
Mazda RX4 21.0 6 160.0 110 4 4
Mazda RX4 Wag 21.0 6 160.0 110 4 4
Datsun 710 22.8 4 108.0 93 4 1
Hornet 4 Drive 21.4 6 258.0 110 3 1
Hornet Sportabout 18.7 8 360.0 175 3 2
Valiant 18.1 6 225.0 105 3 1
Duster 360 14.3 8 360.0 245 3 4
Formulate 10 questions to describe the distributions in data.Write code using
ggplot library for each one of the visualizations.
Sol:
1.Which car has the highest mpg?
ggplot(data = mtcars, aes(x = car, y = mpg)) + geom_bar(stat = "identity")
2. Which car has the highest horsepower?
ggplot(data = mtcars, aes(x = car, y = hp)) + geom_bar(stat = "identity")
3. What is the average mpg of cars with 4 cylinders?
ggplot(data = mtcars, aes(x = cyl, y = mpg)) + geom_boxplot()
4. What is the average horsepower of cars with 6 cylinders?
ggplot(data = mtcars, aes(x = cyl, y = hp)) + geom_boxplot()
5. What is the range of mpg for cars with 3 gears?
ggplot(data = mtcars, aes(x = gear, y = mpg)) + geom_boxplot()
6. What is the range of horsepower for cars with 4 gears?
ggplot(data = mtcars, aes(x = gear, y = hp)) + geom_boxplot()
7. What is the distribution of mpg for cars with 5 carburetors?
ggplot(data = mtcars, aes(x = carb, y = mpg)) + geom_histogram()
8. What is the distribution of horsepower for cars with 6 carburetors?
ggplot(data = mtcars, aes(x = carb, y = hp)) + geom_histogram()
9. What is the correlation between mpg and cylinders?
ggplot(data = mtcars, aes(x = mpg, y = cyl)) + geom_point()
10. What is the correlation between horsepower and gears?
ggplot(data = mtcars, aes(x = hp, y = gear)) + geom_point()
Sol:
iris%>%
ggplot(aes(x= Species, y=Sepal.Length)) +
geom_boxplot(aes(color= Species)) +
geom_jitter()
Question:
Write the block of code to reproduce the following 7x7 matrix. The output of your
code should be the matrix as given.
The main diagonal ranges from 3 to 0 to 3 in the sequence.
The other two consecutive diagonals consist of 1 as elements.
Rest of the matrix elements are zero.
CO4
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 3 1 0 0 0 0 0
[2,] 1 2 1 0 0 0 0
[3,] 0 1 1 1 0 0 0
[4,] 0 0 1 0 1 0 0
[5,] 0 0 0 1 1 1 0
[6,] 0 0 0 0 1 2 1
[7,] 0 0 0 0 0 1 3
What is the matrix called?
Sol.:
Question:
A dataset “states” with its structure is given below. Frost is minimumnumber of
days below freezing point in a particular state. The Murderrate in a state depends
on other variables as shown in the code outputbelow.
> str(states)
'data.frame':50 obs. of 5 variables:
$ Murder : num 15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
$ Population: num 3615 365 2212 2110 21198 ...
$ Illiteracy: num 2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
$ Income : num 3624 6315 4530 3378 5114 ...
$ Frost : num 20 152 15 65 20 166 139 103 11 60 ...
To predict the murder rate a multiple regression model is fitted todata. Summary of
the model is shown below.
> fit = lm(Murder~., data = states)
> summary(fit)
Call:
lm(formula = Murder ~ ., data = states)
Residuals:
Min 1Q Median 3Q Max
-4.7960 -1.6495 -0.0811 1.4815 7.6210
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.235e+00 3.866e+00 0.319 0.7510
Population 2.237e-04 9.052e-05 2.471 0.0173
Illiteracy 4.143e+00 8.744e-01 4.738 2.19e-05
Income 6.442e-05 6.837e-04 0.094 0.9253
Frost 5.813e-04 1.005e-02 0.058 0.9541
---
Residual standard error: 2.535 on 45 degrees of freedom
Multiple R-squared: 0.567,Adjusted R-squared: 0.5285
F-statistic: 14.73 on 4 and 45 DF, p-value: 9.133e-08
i. Name the predictor variable/ variables in the above regressionmodel.
ii. Which among these variables is the most significant?
iii. If illiteracy increased by 1% what is the impact on murder?
iv.What is F-statistic, how it helps to know the quality of themodel
Sol:
i. Population, Illiteracy, Income, frost
ii. Illiteracy is most significant because p-value is less than 0.05(which is
2.19e-05)
iii. If illiteracy increases by 1%, murder will also increase significantly.
iv. F-statistic is the ratio of variance. If F-Stats is less than 2.18 then the
model is significant. In the model, F-stats = 14.73 so the model is
insignificant.
Question:
The scatter plot for certain data with the accompanying code isshown below. The
names used are self-explanatory.
> advertise %>%
+ ggplot(aes(x= Radio, y= Sales)) + geom_point() +
+ geom_smooth() +
+ labs(x= "spendOnRadioAdvertise", y= "Revenuegenerated")
A. What should be the change in code to make the trend line linear and to remove
the shadow along the line.
The Sales average is 14.02. A simple linear regression is fitted to the data and the
code for the same is given below.
> lm(Sales ~ Radio, data = advertise)
Call:
lm(formula = Sales ~ Radio, data = advertise)
Coefficients:
(Intercept) Radio
9.3116 0.2025
B. Predict the increase in sales if an additional $100 dollar is spend in Radio
advertisement.
C. The residual sum square is 3619 and total sum square is 5418. What is its
intuitive interpretation?
D. Calculate the R-statistic. Explain the meaning of the result.
E. Does the regression model predict better than the baseline mode?
Sol:
A.
Advertise %>%
ggplot(aes(x = Radio, y = Sales)) +
geom_point() +
geom_smooth(se=F, method = lm) +
labs (x= "SpendOnRadioAdvertise", y= "RevenueGenerated")
C. SSR = 3619
SST = 5418
SST = SSR + SSE
SSE = SST-SSR
= 5418-3619 = 1799 The error is the difference between the
observed value and the predicted value.
Question
A dataset named <advertise> consists four variables. Part of the dataset is shown in
the below table.
> advertise %>% head(10)
TV Radio Newspaper Sales
1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9
6 8.7 48.9 75.0 7.2
7 57.5 32.8 23.5 11.8
8 120.2 19.6 11.6 13.2
9 8.6 2.1 1.0 4.8
10 199.8 2.6 21.2 10.6
Sales: represent the unit sales
TV/ Radio/ Newspaper : represent the spend in these media
As an analytics consultant we need to suggest some solution toincrease sales for a
company.
We ran simple and multiple regressions to find out the salesdynamics w.r.t. only
radio spending as well as spending over allmedia. Regression models are given
below
> summary(lmRadio)
Call:
lm(formula = Sales ~ Radio, data = advertise)
Residuals:
Min 1Q Median 3Q Max
-15.7305 -2.1324 0.7707 2.7775 8.1810
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.31164 0.56290 16.542 <2e-16 ***
Radio 0.20250 0.02041 9.921 <2e-16 ***
We also run the multiple regression as given below.
> summary(lmTotal)
Call:
lm(formula = Sales ~ TV + Radio + Newspaper, data = advertise)
Residuals:
Min 1Q Median 3Q Max
-8.8277 -0.8908 0.2418 1.1893 2.8292
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.938889 0.311908 9.422 <2e-16 ***
TV 0.045765 0.001395 32.809 <2e-16 ***
Radio 0.188530 0.008611 21.893 <2e-16 ***
Newspaper -0.001037 0.005871 -0.177 0.86
A.What do you observe in both the models w.r.t. radio spend for advertisement.
B.Are there any inconsistencies in the model which may explain the opposing
behaviour?
C.The correlation matrix is given below. Can you interpret the reason for different
coefficients of newspaper impact on sales
> round(cor(advertise),4)
TV Radio Newspaper Sales
TV 1.0000 0.0548 0.0566 0.7822
Radio 0.0548 1.0000 0.3541 0.5762
Newspaper 0.0566 0.3541 1.0000 0.2283
Sales 0.7822 0.5762 0.2283 1.0000
Sol:
A. In 1st model, radio spend has t-value = 9.921 and in 2nd model t-value =
21.89. So we can say that 2nd model is more significant than 1st model.
Question
As an analyst you are assigned a task to classify a set of emails intospam and non-
spam groups. The dataset consists of 1000 emailswith 400 spam and rest non-spam
emails. As a starting point, dataare divided into training and test set. You applied
the logisticregression model and used the prediction algorithm on test data set.The
confusion matrix is obtained as given below.
FALSE TRUE
non-spam 267 17
spam 21 163
a)What is the accuracy of the null model. [1]
b)How accurate is your classifier (Hint: calculate accuracy)?
c)Interpret the accuracy and take decision if the model isacceptable.
d)Explain precision in maximum two lines. [1]
e)How much precision is observed by the classification model.
f)What is recall in this case.(fraction of spam class detectedby classifier) [1]
Sol:
(a) The accuracy of the Null Model (Random Selection) = 0.50 i.e. 50% where
we have an equal chances for of classifying the email as spam or not.
= (267 + 163)/(267+17+21+163)
= 0.9188
(c) As the Accuracy of the Classifier = 91.88%, which is very high and close to
100%, hence we can conclude that the model has high predictive power to
classify emails as spam or not. Model is acceptable.
(d) Precision implies that how close the measured values are to each other.
Precision is independent of accuracy.
Question:
Sol:
Question:
Answer the following questions citing proper example of your choice.
(a) Explain with example for each: Where we should use, boxplot, histogram, bar
chart, stacked bar chart and density plot
(b) Write four assumptions in a linear regression model development.
Sol:
A. Bar Chart
Bar chart compares the measure of categorical dimension. Comparing the
height of each bar gives us a more intuitive perception than looking at the
table alone. Bar chart is very similar to a histogram. The fundamental
difference is that the x-axis of bar charts is categorical attribute instead of
numeric interval in the histogram.
Histogram
Histogram looks very similar to bar chart because, oh well, it is also composed
of bars. However, instead of comparing the categorical data, it breaks down a
numeric data into interval groups and shows the frequency of data fall into
each group. It is commonly used to gain insights about your customers, e.g.
Pinterest use histograms to show the age distribution of your audience.
Histogram is good at identifying the pattern of data distribution on a numeric
spectrum
Boxplot
We use box plots in descriptive data analysis, indicating whether a
distribution is skewed and potential unusual observations (outliers) in the
data set.Box plots are also very useful when large numbers of observations are
involved and when two or more data sets are being compared.
Density Plot
Density plots (aka Kernel Density Plots or Density Trace Graph) are used to
observe a variable's distribution in a dataset.
This chart is a smoothed version of the histogram and is used in the same
concept. It uses a kernel density estimate to show the variable's probability
density function, allowing for smoother distributions by smoothing out the
noise. Thus, the plots are smooth across bins and are not affected by the
number of bins created, creating a more defined distribution shape. The peaks
of a density plot help display where values are concentrated over the interval.
An advantage density plots have over histograms is that they’re better at
determining the distribution shape because they’re not affected by the
number of bins used
Question:
Answer the following questions citing proper example of your choice.
(a) Discuss the importance of variance and bias in case of simple regression
analysis. How the variance changes when we add more variable for fitting the
linear model?
(b) If we add more variables in a linear regression model to predict the target
variable the R-squared value increases. Thus, to capture more variance in data a
data scientist should add variables depending on availability of data. Discuss the
conjecture in the above statement.
For any given phenomenon, the bias term we include in our equations is
meant to represent the tendency of the data to have a distribution
centered about a given value that is offset from an origin; in a way, the
data is biased towards that offset.
In terms of linear regression, variance is a measure of how far observed
values differ from the average of predicted values, i.e., their difference
from the predicted value mean. The goal is to have a value that is low.
Adding independent variables to a multiple linear regression model will
always increase the amount of explained variance in the dependent
variable (typically expressed as R²). Therefore, adding too many
independent variables without any theoretical justification may result in
an over-fit model.
B.
Question
Sol:
Question
Part of the mpg dataset is shown below.
> mpg
# A tibble: 234 x 11
manufacturer cyl trans drv
1 audi 4 auto(l5) f
2 audi 4 manual(m5) f
3 audi 4 manual(m6) f
4 audi 4 auto(av) f
5 audi 6auto(l5) f
6 audi 6 manual(m5) f
7 audi 6 auto(av) f
8 audi 4 manual(m5) 4
9 audi 4 auto(l5) 4
10 audi 4 manual(m6) 4
# ... with 224 more rows
i)The mpg dataset is a tibble. What does thatrepresent?
ii)What is difference between tibble and data frame
iii)Write code in R using ggplot library to get the aboveplot. Show code for
package installation, library andplotting.
iv)In the x-axis it shows factor(cyl). In case we supply only <cyl> as the x-axis
variable what should be the plot outcome? Write in one line.
v)Write the code to remove the legend from the aboveplot.
vi)Can you represent the data in some other bar chartform. Write the code for the
same
Sol:
1.
2. There are two main differences in the usage of a data frame vs a tibble:
printing, and subsetting. Tibbles have a refined print method that shows only
the first 10 rows, and the columns that fit on screen. This makes it much
easier to work with large data.
3. install.package(ggplot2)
library(tidyverse)
mpg
mpg %>%
ggplot(aes(x=factor(cyl))) +
geom_bar(aes(fill= trans),position = "dodge",show.legend = TRUE)
4.
mpg %>%
ggplot(aes(x=(cyl)) +
geom_bar(aes(fill= trans),position = "dodge",show.legend = TRUE)
Here Factor removes the null value from the data. Without factor its
show all the value including null values.
5.
mpg %>%
ggplot(aes(x=factor(cyl))) +
geom_bar(aes(fill= trans),position = "dodge",show.legend = FALSE)
Question:
Write one line regarding the output of each of the functionsgiven below. Let <df>
is any dataframe.
tail(df,2)
df[, c(1,3)]
table(df$<any variable of df>)
hist(df[, 3]) #column 3 is a factor variable
tapply(iris[,1], iris[,5], mean)
Write the use of the following functions with an example line ofcode
i.everything()
ii.c()
iii.theme()
iv.geom_smooth()
v.par(mfrow())
vi.pivot_longer()
vii.mutate()
viii.summarise()
ix.facet_wrap()
x.sapply()
Sol:
df[,c(1,3)] :
now df[,c(1,3)] will give output of 0th row to the output of c(1,3)
Use of functions:
mutate(): the mutate function is used to create a new variable from a data set.
In order to use the function, we need to install the dplyr package, which is an
add-on to R that includes a host of cool functions for selecting, filtering,
grouping, and arranging data.
sapply(): sapply() function takes list, vector or data frame as input and gives
output in vector or matrix. It is useful for operations on list objects and
returns a list object of same length of original set. Sapply function in R does
the same job as lapply() function but returns a vector.