Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

FACULTY OF MANAGEMENT STUDIES

UNIVERSITY OF DELHI
MBA-Executive II-year Semester examination, April 2022
MBA EX 9805: Predictive Analytics and Big Data
Time: 3 hours Maximum marks: 70
Answer any FOUR questions [17.5 x 4]

1.(a) What are the different stages of the Data Science Life cycle? Discuss the above statement by
describing the activities done in each stage and their utility. (Limit your answer to 300 words)

(b) The following figure is a section of output obtained during multiple regression analysis.

Interpret the results presented there, especially those highlighted (boldfaced and underlined
text).

```{r}
# Multiple Linear Regression
lm.fit=lm(Sales ~ TV + Radio + Newspaper, data = mydata)
summary(lm.fit)
```

Call:
lm(formula = Sales ~ TV + Radio + Newspaper, data = mydata)

Residuals:
Min 1Q Median 3Q Max
-8.8277 -0.8908 0.2418 1.1893 2.8292

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.938889 0.311908 9.422 <2e-16 ***
TV 0.045765 0.001395 32.809 <2e-16 ***
Radio 0.188530 0.008611 21.893 <2e-16 ***
Newspaper -0.001037 0.005871 -0.177 0.86
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.686 on 196 degrees of freedom


Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16

Page 1 of 11
2.(a) What is Association rules mining? How is it used for developing recommender systems?
(Limit your answer to 200 words)

(b) The following figure is a section of output obtained from application of a predictive analytics
tool. Explain the results presented there, especially those highlighted (boldfaced and underlined
text).

```{r}
model1 <- glm( diabetes ~ ., data = train.data, family = binomial)
# model with all covariates
summary(model1)
```

Call:
glm(formula = diabetes ~ ., family = binomial, data = train.data)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.7037 -0.6530 -0.3794 0.6352 2.5264

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.491e+00 1.570e+00 -6.045 1.49e-09 ***
pregnant 1.546e-01 7.271e-02 2.127 0.03344 *
glucose 3.783e-02 7.150e-03 5.291 1.22e-07 ***
pressure -3.663e-04 1.639e-02 -0.022 0.98217
triceps 2.170e-02 2.171e-02 1.000 0.31739
insulin -2.174e-05 1.773e-03 -0.012 0.99021
mass 4.788e-02 3.602e-02 1.329 0.18378
pedigree 1.817e+00 5.953e-01 3.053 0.00227 **
age 1.221e-02 2.369e-02 0.515 0.60623
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 308.67 on 238 degrees of freedom


Residual deviance: 212.37 on 230 degrees of freedom
AIC: 230.37

Number of Fisher Scoring iterations: 5

Page 2 of 11
```{r}
list(
model1 = table(test.data$diabetes, test.predicted.m1 > 0.5) %>%
prop.table() %>% round(3)
)
```

$model1

FALSE TRUE
neg 0.601 0.092
pos 0.118 0.190

```{r}
# model 1 AUC
prediction(test.predicted.m1, test.data$diabetes) %>%
performance(measure = "auc") %>%
.@y.values
```

[[1]]
[1] 0.8250703

3. (a) What do you understand by ‘Text mining’ and ‘Text analytics’? Describe the steps in a text
mining process. (Limit your answer to 200 words).

(b) Read the case “Insurance Group Strengthens Risk Management with Text Mining Solution”
printed at the end of this document. Answer the following questions on the case.

(i) How can text analytics and mining be used to keep up with changing business needs of
insurance companies?

(ii) What were the challenges, the proposed solution, and the obtained results?

(iii) Can you think of other uses of text analytics and text mining for insurance companies?

Page 3 of 11
4. Write short notes on the following (Each in approximately 200 words)

(i) Genesis of Bigdata applications and current trends

(ii) Bigdata processing technologies and methods

(iii) Creating strategic value from Bigdata analytics

5. The classification tree was applied to a dataset. A section of the output from the analysis is
presented below. It is divided into nine sections. Describe what is being done in these nine
sections, define the terms printed in boldface and underlined.

What conclusion can you draw from this analysis?

Loading the data as a dataframe


library(ISLR)
mydata <- Carseats

Check all variables


str(mydata)

## 'data.frame': 400 obs. of 12 variables:


## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2
3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
## $ High : Factor w/ 2 levels "No","Yes": 2 2 2 1 1 2 1 2 1 1 ...

Page 4 of 11
Section 1. Split data into training (70%) and validation (30%)
dt = sort(sample(nrow(mydata), nrow(mydata)*.7))
train.data<-mydata[dt,]
test.data<-mydata[-dt,] # Check number of rows in training data set
nrow(train.data)

## [1] 280

Section 2. Fit the decision tree model


mytree <- rpart(High ~. - Sales, data = train.data, method="class", control =
rpart.control(minsplit = 20, minbucket = 7, maxdepth = 10, usesurrogate = 2,
xval =10 ))
mytree

## n= 280
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 280 118 No (0.57857143 0.42142857)
## 2) ShelveLoc=Bad,Medium 221 67 No (0.69683258 0.30316742)
## 4) Price>=106.5 146 26 No (0.82191781 0.17808219)
## 8) Advertising< 13.5 123 16 No (0.86991870 0.13008130) *
## 9) Advertising>=13.5 23 10 No (0.56521739 0.43478261)
## 18) Age>=55 11 2 No (0.81818182 0.18181818) *
## 19) Age< 55 12 4 Yes (0.33333333 0.66666667) *
## 5) Price< 106.5 75 34 Yes (0.45333333 0.54666667)
## 10) Age>=68.5 15 1 No (0.93333333 0.06666667) *
## 11) Age< 68.5 60 20 Yes (0.33333333 0.66666667)
## 22) Income< 59.5 17 6 No (0.64705882 0.35294118) *
## 23) Income>=59.5 43 9 Yes (0.20930233 0.79069767) *
## 3) ShelveLoc=Good 59 8 Yes (0.13559322 0.86440678)
## 6) Price>=150 7 2 No (0.71428571 0.28571429) *
## 7) Price< 150 52 3 Yes (0.05769231 0.94230769) *

Section 3. Print the tree


#Plot tree
plot(mytree)
text(mytree)

Page 5 of 11
Section 4. Draw the confusion matrix on TRAIN data
library(caret)

## Loading required package: lattice

pred.default.train <- predict(mytree, train.data, type = "class")


confusionMatrix(pred.default.train,train.data$High)

## Confusion Matrix and Statistics


##
## Reference
## Prediction No Yes
## No 146 27
## Yes 16 91
##
## Accuracy : 0.8464
## 95% CI : (0.7988, 0.8866)
## No Information Rate : 0.5786
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.681
##
## Mcnemar's Test P-Value : 0.1273
##
## Sensitivity : 0.9012
## Specificity : 0.7712
## Pos Pred Value : 0.8439
## Neg Pred Value : 0.8505
## Prevalence : 0.5786
## Detection Rate : 0.5214
## Detection Prevalence : 0.6179
Page 6 of 11
## Balanced Accuracy : 0.8362
##
## 'Positive' Class : No
##

5. Draw the confusion matrix on TEST data


pred.default.test <- predict(mytree, test.data, type = "class")
confusionMatrix(pred.default.test,test.data$High)

## Confusion Matrix and Statistics


##
## Reference
## Prediction No Yes
## No 54 19
## Yes 20 27
##
## Accuracy : 0.675
## 95% CI : (0.5835, 0.7577)
## No Information Rate : 0.6167
## P-Value [Acc > NIR] : 0.1104
##
## Kappa : 0.3154
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.7297
## Specificity : 0.5870
## Pos Pred Value : 0.7397
## Neg Pred Value : 0.5745
## Prevalence : 0.6167
## Detection Rate : 0.4500
## Detection Prevalence : 0.6083
## Balanced Accuracy : 0.6583
##
## 'Positive' Class : No
##

6. Tabulate the performance of trees for different values of Complexity Parameter (CP)
printcp(mytree)

##
## Classification tree:
## rpart(formula = High ~ . - Sales, data = train.data, method = "class",
## control = rpart.control(minsplit = 20, minbucket = 7, maxdepth = 10,
## usesurrogate = 2, xval = 10))
##
## Variables actually used in tree construction:
## [1] Advertising Age Income Price ShelveLoc
##
## Root node error: 118/280 = 0.42143

Page 7 of 11
##
## n= 280
##
## CP nsplit rel error xerror xstd
## 1 0.364407 0 1.00000 1.00000 0.070022
## 2 0.084746 1 0.63559 0.63559 0.062798
## 3 0.042373 3 0.46610 0.56780 0.060502
## 4 0.025424 4 0.42373 0.46610 0.056339
## 5 0.016949 5 0.39831 0.52542 0.058879
## 6 0.010000 7 0.36441 0.60169 0.061694

bestcp <- mytree$cptable[which.min(mytree$cptable[,"xerror"]),"CP"]

7. Prune the tree using the best cp.


pruned <- prune(mytree, cp = bestcp)

8. Plot pruned tree


prp(pruned, faclen = 0, cex = 0.8, extra = 1)

Page 8 of 11
9. Draw the confusion matrix and calculate the accuracy
conf.matrix <- table(train.data$High, predict(pruned,type="class"))
rownames(conf.matrix) <- paste("Actual", rownames(conf.matrix), sep = ":")
colnames(conf.matrix) <- paste("Pred", colnames(conf.matrix), sep = ":")
print(conf.matrix)

##
## Pred:No Pred:Yes
## Actual:No 145 17
## Actual:Yes 33 85

pred.pruned.test <- predict(pruned, test.data, type = "class")


confusionMatrix(pred.pruned.test,test.data$High)

## Confusion Matrix and Statistics


##
## Reference
## Prediction No Yes
## No 53 24
## Yes 21 22
##
## Accuracy : 0.625
## 95% CI : (0.532, 0.7117)
## No Information Rate : 0.6167
## P-Value [Acc > NIR] : 0.4655
##
## Kappa : 0.1969
##
## Mcnemar's Test P-Value : 0.7656
##
## Sensitivity : 0.7162
## Specificity : 0.4783
## Pos Pred Value : 0.6883
## Neg Pred Value : 0.5116
## Prevalence : 0.6167
## Detection Rate : 0.4417
## Detection Prevalence : 0.6417
## Balanced Accuracy : 0.5972
##
## 'Positive' Class : No
##

Page 9 of 11
CASE
Insurance Group Strengthens Risk Management with Text Mining Solution

When asked for the biggest challenge facing the Czech automobile insurance industry,
Peter Jedlička, PhD, doesn’t hesitate. “Bodily injury claims are growing
disproportionately compared with vehicle damage claims,” says Jedlička, team leader of
actuarial services for the Czech Insurers’ Bureau (CIB). CIB is a professional organization
of insurance companies in the Czech Republic that handles uninsured, international, and
untraced claims for what’s known as motor third-party liability. “Bodily injury damages
now represent about 45% of the claims made against our members, and that proportion
will continue to increase because of recent legislative changes.”

One of the difficulties that bodily injury claims pose for insurers is that the extent of an
injury is not always predictable in the immediate aftermath of a vehicle accident. Injuries
that were not at first obvious may become acute later, and apparently minor injuries can
turn into chronic conditions. The earlier that insurance companies can accurately
estimate their liability for medical damages, the more precisely they can manage their risk
and consolidate their resources. However, because the needed information is contained
in unstructured documents such as accident reports and witness statements, it is
extremely time consuming for individual employees to perform the needed analysis.

To expand and automate the analysis of unstructured accident reports, witness


statements, and claim narratives, CIB deployed a data analysis solution based on Dell
Statistica Data Miner and the Statistica Text Miner extension. Statistica Data Miner
offers a set of intuitive, user-friendly tools that are accessible even to nonanalysts.

The solution reads and writes data from virtually all standard file -formats and offers
strong, sophisticated data cleaning tools. It also supports even novice users with query
wizards, called Data Mining Recipes, that help them arrive at the answers they need more
quickly.

With the Statistica Text Miner extension, users have access to extraction and selection
tools that can be used to index, classify, and cluster information from large collections of
unstructured text data, such as the narratives of insurance claims. In addition to using
the Statistica solution to make predictions about future medical damage claims, CIB can
Page 10 of 11
also use it to find patterns that indicate attempted fraud or to identify needed road safety
improvements.

Improves Accuracy of Liability Estimates

Jedlička expects the Statistica solution to greatly improve the ability of CIB to predict the
total medical claims that might arise from a given accident. “The Statistica solution’s data
mining and text mining capabilities are already helping us expose additional risk
characteristics, thus making it possible to predict serious medical claims in earlier stages
of the investigation,” he says. “With the Statistica solution, we can make much more
accurate estimates of total damages and plan accordingly.”

Expands Service Offerings to Members

Jedlička is also pleased that the Statistica solution helps CIB offer additional services to
its member companies. “We are in a data-driven business,” he says. “With Statistica, we
can provide our members with detailed analyses of claims and market trends. Statistica
also helps us provide even stronger recommendations concerning claims reserves.”

Intuitive for Business Users

The intuitive Statistica tools are accessible by even nontechnical users. “The outputs of
our Statistica analyses are easy to understand for business users,” says Jedlička. “Our
business users also find that the analysis results are in line with their own experience and
recommendations, so they readily see the value in the Statistica solution.”

Page 11 of 11

You might also like