Professional Documents
Culture Documents
9805 MBAex PredAnalBigDataMar22
9805 MBAex PredAnalBigDataMar22
UNIVERSITY OF DELHI
MBA-Executive II-year Semester examination, April 2022
MBA EX 9805: Predictive Analytics and Big Data
Time: 3 hours Maximum marks: 70
Answer any FOUR questions [17.5 x 4]
1.(a) What are the different stages of the Data Science Life cycle? Discuss the above statement by
describing the activities done in each stage and their utility. (Limit your answer to 300 words)
(b) The following figure is a section of output obtained during multiple regression analysis.
Interpret the results presented there, especially those highlighted (boldfaced and underlined
text).
```{r}
# Multiple Linear Regression
lm.fit=lm(Sales ~ TV + Radio + Newspaper, data = mydata)
summary(lm.fit)
```
Call:
lm(formula = Sales ~ TV + Radio + Newspaper, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-8.8277 -0.8908 0.2418 1.1893 2.8292
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.938889 0.311908 9.422 <2e-16 ***
TV 0.045765 0.001395 32.809 <2e-16 ***
Radio 0.188530 0.008611 21.893 <2e-16 ***
Newspaper -0.001037 0.005871 -0.177 0.86
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Page 1 of 11
2.(a) What is Association rules mining? How is it used for developing recommender systems?
(Limit your answer to 200 words)
(b) The following figure is a section of output obtained from application of a predictive analytics
tool. Explain the results presented there, especially those highlighted (boldfaced and underlined
text).
```{r}
model1 <- glm( diabetes ~ ., data = train.data, family = binomial)
# model with all covariates
summary(model1)
```
Call:
glm(formula = diabetes ~ ., family = binomial, data = train.data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7037 -0.6530 -0.3794 0.6352 2.5264
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.491e+00 1.570e+00 -6.045 1.49e-09 ***
pregnant 1.546e-01 7.271e-02 2.127 0.03344 *
glucose 3.783e-02 7.150e-03 5.291 1.22e-07 ***
pressure -3.663e-04 1.639e-02 -0.022 0.98217
triceps 2.170e-02 2.171e-02 1.000 0.31739
insulin -2.174e-05 1.773e-03 -0.012 0.99021
mass 4.788e-02 3.602e-02 1.329 0.18378
pedigree 1.817e+00 5.953e-01 3.053 0.00227 **
age 1.221e-02 2.369e-02 0.515 0.60623
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Page 2 of 11
```{r}
list(
model1 = table(test.data$diabetes, test.predicted.m1 > 0.5) %>%
prop.table() %>% round(3)
)
```
$model1
FALSE TRUE
neg 0.601 0.092
pos 0.118 0.190
```{r}
# model 1 AUC
prediction(test.predicted.m1, test.data$diabetes) %>%
performance(measure = "auc") %>%
.@y.values
```
[[1]]
[1] 0.8250703
3. (a) What do you understand by ‘Text mining’ and ‘Text analytics’? Describe the steps in a text
mining process. (Limit your answer to 200 words).
(b) Read the case “Insurance Group Strengthens Risk Management with Text Mining Solution”
printed at the end of this document. Answer the following questions on the case.
(i) How can text analytics and mining be used to keep up with changing business needs of
insurance companies?
(ii) What were the challenges, the proposed solution, and the obtained results?
(iii) Can you think of other uses of text analytics and text mining for insurance companies?
Page 3 of 11
4. Write short notes on the following (Each in approximately 200 words)
5. The classification tree was applied to a dataset. A section of the output from the analysis is
presented below. It is divided into nine sections. Describe what is being done in these nine
sections, define the terms printed in boldface and underlined.
Page 4 of 11
Section 1. Split data into training (70%) and validation (30%)
dt = sort(sample(nrow(mydata), nrow(mydata)*.7))
train.data<-mydata[dt,]
test.data<-mydata[-dt,] # Check number of rows in training data set
nrow(train.data)
## [1] 280
## n= 280
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 280 118 No (0.57857143 0.42142857)
## 2) ShelveLoc=Bad,Medium 221 67 No (0.69683258 0.30316742)
## 4) Price>=106.5 146 26 No (0.82191781 0.17808219)
## 8) Advertising< 13.5 123 16 No (0.86991870 0.13008130) *
## 9) Advertising>=13.5 23 10 No (0.56521739 0.43478261)
## 18) Age>=55 11 2 No (0.81818182 0.18181818) *
## 19) Age< 55 12 4 Yes (0.33333333 0.66666667) *
## 5) Price< 106.5 75 34 Yes (0.45333333 0.54666667)
## 10) Age>=68.5 15 1 No (0.93333333 0.06666667) *
## 11) Age< 68.5 60 20 Yes (0.33333333 0.66666667)
## 22) Income< 59.5 17 6 No (0.64705882 0.35294118) *
## 23) Income>=59.5 43 9 Yes (0.20930233 0.79069767) *
## 3) ShelveLoc=Good 59 8 Yes (0.13559322 0.86440678)
## 6) Price>=150 7 2 No (0.71428571 0.28571429) *
## 7) Price< 150 52 3 Yes (0.05769231 0.94230769) *
Page 5 of 11
Section 4. Draw the confusion matrix on TRAIN data
library(caret)
6. Tabulate the performance of trees for different values of Complexity Parameter (CP)
printcp(mytree)
##
## Classification tree:
## rpart(formula = High ~ . - Sales, data = train.data, method = "class",
## control = rpart.control(minsplit = 20, minbucket = 7, maxdepth = 10,
## usesurrogate = 2, xval = 10))
##
## Variables actually used in tree construction:
## [1] Advertising Age Income Price ShelveLoc
##
## Root node error: 118/280 = 0.42143
Page 7 of 11
##
## n= 280
##
## CP nsplit rel error xerror xstd
## 1 0.364407 0 1.00000 1.00000 0.070022
## 2 0.084746 1 0.63559 0.63559 0.062798
## 3 0.042373 3 0.46610 0.56780 0.060502
## 4 0.025424 4 0.42373 0.46610 0.056339
## 5 0.016949 5 0.39831 0.52542 0.058879
## 6 0.010000 7 0.36441 0.60169 0.061694
Page 8 of 11
9. Draw the confusion matrix and calculate the accuracy
conf.matrix <- table(train.data$High, predict(pruned,type="class"))
rownames(conf.matrix) <- paste("Actual", rownames(conf.matrix), sep = ":")
colnames(conf.matrix) <- paste("Pred", colnames(conf.matrix), sep = ":")
print(conf.matrix)
##
## Pred:No Pred:Yes
## Actual:No 145 17
## Actual:Yes 33 85
Page 9 of 11
CASE
Insurance Group Strengthens Risk Management with Text Mining Solution
When asked for the biggest challenge facing the Czech automobile insurance industry,
Peter Jedlička, PhD, doesn’t hesitate. “Bodily injury claims are growing
disproportionately compared with vehicle damage claims,” says Jedlička, team leader of
actuarial services for the Czech Insurers’ Bureau (CIB). CIB is a professional organization
of insurance companies in the Czech Republic that handles uninsured, international, and
untraced claims for what’s known as motor third-party liability. “Bodily injury damages
now represent about 45% of the claims made against our members, and that proportion
will continue to increase because of recent legislative changes.”
One of the difficulties that bodily injury claims pose for insurers is that the extent of an
injury is not always predictable in the immediate aftermath of a vehicle accident. Injuries
that were not at first obvious may become acute later, and apparently minor injuries can
turn into chronic conditions. The earlier that insurance companies can accurately
estimate their liability for medical damages, the more precisely they can manage their risk
and consolidate their resources. However, because the needed information is contained
in unstructured documents such as accident reports and witness statements, it is
extremely time consuming for individual employees to perform the needed analysis.
The solution reads and writes data from virtually all standard file -formats and offers
strong, sophisticated data cleaning tools. It also supports even novice users with query
wizards, called Data Mining Recipes, that help them arrive at the answers they need more
quickly.
With the Statistica Text Miner extension, users have access to extraction and selection
tools that can be used to index, classify, and cluster information from large collections of
unstructured text data, such as the narratives of insurance claims. In addition to using
the Statistica solution to make predictions about future medical damage claims, CIB can
Page 10 of 11
also use it to find patterns that indicate attempted fraud or to identify needed road safety
improvements.
Jedlička expects the Statistica solution to greatly improve the ability of CIB to predict the
total medical claims that might arise from a given accident. “The Statistica solution’s data
mining and text mining capabilities are already helping us expose additional risk
characteristics, thus making it possible to predict serious medical claims in earlier stages
of the investigation,” he says. “With the Statistica solution, we can make much more
accurate estimates of total damages and plan accordingly.”
Jedlička is also pleased that the Statistica solution helps CIB offer additional services to
its member companies. “We are in a data-driven business,” he says. “With Statistica, we
can provide our members with detailed analyses of claims and market trends. Statistica
also helps us provide even stronger recommendations concerning claims reserves.”
The intuitive Statistica tools are accessible by even nontechnical users. “The outputs of
our Statistica analyses are easy to understand for business users,” says Jedlička. “Our
business users also find that the analysis results are in line with their own experience and
recommendations, so they readily see the value in the Statistica solution.”
Page 11 of 11