BA Test

Business Analytics Internal Assessment
Team 16
Submitted by : Prerona Dey (19021141078)
Debopriya Banerjee (19021141036)
Ashmita Baidya (19021141026)
Shibam Bhattacharjee (19021141108)
Aparna Singh (19021141023)
Question 1:
T16 = Text report of the document sent yesterday Bar chart, Frequency count not less than for 9th
column and observation from the word cloud.
Dmining.csv
Data allocated: 9th column- Covid 19 treatments that you have confronted
Introduction
The project carried below discusses data mining on text descriptives followed by a sentiment analysis
obtained from the descriptives after analysis on parameters like Term document Matrix and Corpus
Frequency.
Descriptive text is a text which says what a person or a thing or a situation is like. Its purpose is
to describe and reveal the characteristics of the query in a Descriptive Analysis process.
Data mining finds valuable information hidden in large volumes of data. Data mining is the
analysis of data and the use of software techniques for finding patterns and regularities in sets of data. The
computer is responsible for finding the patterns by identifying the underlying rules and features in the
data.
Descriptive data mining tasks usually find data describing patterns and come up with new,
significant information from the available data set. Such techniques help in formulating underlying
interpretations in a sentiment analysis form thus curbing out meaningful conclusions, subjective
information and deducing polarity of the statements at a basic level.
DataSet Description
● The dataset used in this report is Dmining.csv.

● It is a set consisting of 5 queries namely, virtual sessions, Covid 19 Present Situation, Covid
19 effect on school children, Covid 19 treatments that you have confronted, and Covid 19
Social distance which further contains 100 observations under each query.
● All observations are in the form of text data which represents the perception of different people
about the queries or situations stated above.
● The data set contains a total of 10 variables where 5 are queries and the rest 5 are variables like
ID, Start time, completion time, name, and email.
Observation in Dataset
● In the dataset the column assigned to us is 9th column i.e. Covid 19 treatments that you have
confronted.
● The dataset shows 101 observations for 1o variables which means there are 100 obs for each
variable and 1 obs is accounted for “query” head.
● The dataset lists all responses by students which are either one word or a sentence text output, for
example: zero, not many, texting should be swift and results should be quick etc.
Output, Syntax and Analysis

Prerequisites to obtain the required result are- three packages (ggplot2, wordcloud and tm)
Has to be installed first, in case these packages are not there in one’s R Studio and then all these packages
have to be selected to perform the below mentioned syntaxes.
Full Syntax-
dmining=Dmining[,9]
dmining
TextDoc= Corpus(VectorSource(dmining))
TextDoc
# Build a term-document matrix
TextDoc_dtm <- TermDocumentMatrix(TextDoc)
TextDoc_dtm
dtm_m <- as.matrix(TextDoc_dtm)
dtm_m
# Sort by decreasing value of frequency
dtm_v <- sort(rowSums(dtm_m),decreasing=TRUE)
dtm_v
dtm_d <- data.frame(word = names(dtm_v),freq=dtm_v)
dtm_d
# Display the top 9 most frequent words
head(dtm_d, 9)
# Plot the most frequent words
barplot(dtm_d[1:9,]$freq, las = 2, names.arg = dtm_d[1:9,]$word,
col ="turquoise", main ="Top 9 most frequent words",
ylab = "Word frequencies")
generate word cloud
set.seed(1234)
wordcloud(words = dtm_d$word, freq = dtm_d$freq, min.freq = 4,
max.words=100, random.order=FALSE, rot.per=0.40,
colors=brewer.pal(8, "Dark2"))
> TextDoc= Corpus(VectorSource(dmining))
> TextDoc
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 101
Analysis- Corpus is one among the four main terminologies used in data mining, which means a
collection of documents (having natural language texts). Here, we have 101 observations out of which the
first one i.e. “Covid 19 treatments that you have confronted” is the query we are working upon. The
rest of the observations apart from our query are the documents from which the most repeatedly used
words will have to be extracted to obtain a word cloud.
> # Build a term-document matrix

> TextDoc_dtm <- TermDocumentMatrix(TextDoc)
> TextDoc_dtm
<<TermDocumentMatrix (terms: 266, documents: 101)>>
Non-/sparse entries: 442/26424
Sparsity : 98%
Maximal term length: 18
Weighting : term frequency (tf)
>
Analysis- Here, the number of terms is 266 and the number of documents/ sentences is 101 among which
the first one is a query.
> # Display the top 9 most frequent words

> head(dtm_d, 9)
word freq
and and 16
not not 15
the the 14
none none 10
Treatment treatment 9
are are 7
mask mask 6
immunity immunity 6
but but 6
Analysis- Here, most repeated words have been picked up where the prepositional words actually don't
contribute much to our understanding. But the terms, mask, immunity , treatment indicates what people
think about what covid treatment they have confronted.
# Plot the most frequent words

barplot(dtm_d[1:9,]$freq, las = 2, names.arg = dtm_d[1:9,]$word,
col ="turquoise", main ="Top 9 most frequent words",
ylab = "Word frequencies")
Analysis- It's a visual representation of the most frequent terms used in documents/ sentences entered by
people while posting their perception about Covid treatment.
#generate word cloud

set.seed(1234)
wordcloud(words = dtm_d$word, freq = dtm_d$freq, min.freq = 4,
max.words=100, random.order=FALSE, rot.per=0.40,
colors=brewer.pal(8, "Dark2"))
Analysis- This word cloud clearly visually represents what are the most important used words by people
in their perception regarding covid treatment and it is seen that the most big sized words like and, not, the
have been used the most followed by treatment, mask, better, home, drinking, etc.
Findings and Conclusion
● From the word cloud it can be thus concluded that the most frequent words are treatment, none,
mask, home, immunity and confronted.
● With these frequencies it can be concluded that most respondents did not confront any Covid
19 positive cases.
● It can also be interpreted that for a good immunity Covid 19 confrontation will be less.
● Also better usage of masks and staying home can prevent one from getting exposed to Covid
situations.
● Lastly, it can be said that treatments of Covid 19 are done as per symptoms these days.
● So, the people associate Covid treatment with mask, treatment, drinking, immunity, home, etc.
Question 2:
Worked on Cardata and noted the observations received and duly interpreted
Introduction:
The process carried out below is purely based on visualization of data. Visualization is the conversion of
data into a visual or tabular format so that the characteristics of the data and the relationships among data
items or attributes can be analysed or reported. It is an appealing technique for data exploration.
Visualization of data can be used to explore a pattern or trends, it is mostly used in cases of large data
sets.
The data set, Cardata includes various particulars related to car sales over a period of time. The various
aspects of the cars are also given. In this report we have interpreted and tried to derive valuable insights
from the data set and various functions of R Studio have been used to derive the same.
Dataset description:
● The dataset used in this report is CARDATA.csv

● This dataset contains 11914 observations with 16 variables.
Syntax
summary(Cardata)
Make Model Year Engine .Fuel.Type

Length:11914 Length:11914 Min. :1990 Length:11914
Class :character Class :character 1st Qu.:2007 Class :character
Mode :character Mode :character Median :2015 Mode :character
Mean :2010
3rd Qu.:2016
Max. :2017
Engine.HP Engine.Cylinders Transmission. Type Driven_Wheels

Min. : 55.0 Min. : 0.000 Length:11914 Length:11914
1st Qu.: 170.0 1st Qu.: 4.000 Class :character Class :character
Median : 227.0 Median : 6.000 Mode :character Mode :character
Mean : 249.4 Mean : 5.629
3rd Qu.: 300.0 3rd Qu.: 6.000
Max. :1001.0 Max. :16.000
NA's :69 NA's :30
Number.of.Doors Market.Category Vehicle.Size Vehicle.Style

Min. :2.000 Length:11914 Length:11914 Length:11914
1st Qu.:2.000 Class :character Class :character Class :character
Median :4.000 Mode :character Mode :character Mode :character
Mean :3.436
3rd Qu.:4.000
Max. :4.000
NA's :6
highway.MPG city.mpg Popularity MSRP

Min. : 12.00 Min. : 7.00 Min. : 2 Min. : 2000
1st Qu.: 22.00 1st Qu.: 16.00 1st Qu.: 549 1st Qu.: 21000
Median : 26.00 Median : 18.00 Median :1385 Median : 29995
Mean : 26.64 Mean : 19.73 Mean :1555 Mean : 40595
3rd Qu.: 30.00 3rd Qu.: 22.00 3rd Qu.:2009 3rd Qu.: 42231
Max. :354.00 Max. :137.00 Max. :5657 Max. :2065902
Syntax:
Cardata%>%count(Make)
Make n
1 Acura 252
2 Alfa Romeo 5
3 Aston Martin 93
4 Audi 328
5 Bentley 74
6 BMW 334
7 Bugatti 3
8 Buick 196
9 Cadillac 397
10 Chevrolet 1123
11 Chrysler 187
12 Dodge 626
13 Ferrari 69
14 FIAT 62
15 Ford 881
16 Genesis 3
17 GMC 515
18 Honda 449
19 HUMMER 17
20 Hyundai 303
21 Infiniti 330
22 Kia 231
23 Lamborghini 52
24 Land Rover 143
25 Lexus 202
26 Lincoln 164
27 Lotus 29
28 Maserati 58
29 Maybach 16
30 Mazda 423
31 McLaren 5
32 Mercedes-Benz 353
33 Mitsubishi 213
34 Nissan 558
35 Oldsmobile 150
36 Plymouth 82
37 Pontiac 186
38 Porsche 136
39 Rolls-Royce 31
40 Saab 111
41 Scion 60
42 Spyker 3
43 Subaru 256
44 Suzuki 351
45 Tesla 18
46 Toyota 746
47 Volkswagen 809
48 Volvo 281
Syntax:
cor(Cardata$highway.MPG,Cardata$city.
0.8868295
Syntax:
Cardata%>%count(Vehicle.Size)
Vehicle.Size n
1 Compact 4764
2 Large 2777
3 Midsize 4373
Syntax:
Cardata%>%count(Vehicle.Style)
Vehicle.Style n
1 2dr Hatchback 506
2 2dr SUV 138
3 4dr Hatchback 702
4 4dr SUV 2488
5 Cargo Minivan 71
6 Cargo Van 95
7 Convertible 793
8 Convertible SUV 29
9 Coupe 1211
10 Crew Cab Pickup 681
11 Extended Cab Pickup 623
12 Passenger Minivan 417
13 Passenger Van 128
14 Regular Cab Pickup 392
15 Sedan 3048
16 Wagon 592
Syntax:
Cardata%>%count(Engine.Fuel.Type)
Engine.Fuel.Type n
1 diesel 154
2 electric 66
3 flex-fuel (premium unleaded recommended/E85) 26
4 flex-fuel (premium unleaded required/E85) 54
5 flex-fuel (unleaded/E85) 899
6 flex-fuel (unleaded/natural gas) 6
7 natural gas 2
8 premium unleaded (recommended) 1523
9 premium unleaded (required) 2009
10 regular unleaded 7175
Histogram:
Syntax:
#histogram for Engine.HP

hist(Cardata$Engine.HP)
#colored histogram with number of different bins
hist(Cardata$Engine.HP, breaks=5, col = "red")
Result:
Syntax:
hist(Cardata$Engine.Cylinders)
#colored histogram with number of different bins
hist(Cardata$Engine.Cylinders, breaks=6, col = "pink")
Result:
Histogram with normal curve:
Syntax:
Cardata #is a data set from datasets package

names(Cardata)
x=Cardata$Popularity
h=hist(x,col="red", xlab="popularity",
main="Histogram with Normal Curve")
xfit=seq(min(x),max(x))
yfit=dnorm(xfit,mean=mean(x),sd=sd(x))
yfit=yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2)
Result:
Syntax:
names(Cardata)
x=Cardata$MSRP
h=hist(x,col="green", xlab="MSRP",
lines(xfit, yfit, col="red", lwd=2)
Result:
Syntax:
x=Cardata$highway.MPG
h=hist(x,col="light blue", xlab="highway.mpg",
lines(xfit, yfit, col="red", lwd=2)
Result:
Syntax:
d=density(Cardata$MSRP)
plot(d, main="Kernel Density of MSRP")
polygon(d, col="green", border="blue")
Result:
Syntax:
d=density(Cardata$Year)
plot(d, main="Kernel Density of year")
polygon(d, col="pink", border="blue")
Result:
Observations:
The above figure indicates that the most of the cars are being sold in the year 2015 and it is positively
skewed.
Syntax:
slices=c(4764, 2777, 4373) # vehicle size of all the cars

lbls=c("compact=4764", "Large=2777", "midsize=4373")
pie(slices, labels = lbls, main="Pie Chart vehicle size ")
Result:
Observations:
Observations:
● The dataset contains data from the timespan of 1990 to 2017, therefore a 27 year
timeframe is taken and the data regarding Cars have been taken along with their various
attributes and also the preference of people towards various aspects of the cars have also
been taken.
● The CarModel of Sedan is the most popular among the sample chosen therefore people as
a whole can be said to be inclined to vehicles which carry 4 people at least and the 4dr
Suv being the 2nd highest only concretes the fact.
● The sample didn’t seem to consider EcoFriendly approach rather it takes Cost
effectiveness when choosing for Engine Fuel type as Regular Unleaded has a landslide
more occurrence than the other choices.
● The sample size seems to prefer the mid-size cars as it has the maximum occurrences.
● It can be seen that Cars seem to be highly and positively correlated when taking mileage
in cities and highways therefore mileage doesn’t vary much in highways and cities.
● The sample can be seen to have been taken from the mid-economic class population as
Chevrolet has the largest occurrence whereas premium sports cars like Buggati,Spyder
etc has the least count.
● In reference to Milege the dataset has a very wide variety as it has various other variables
to consider.
● From the above histogram, it can be observed that that engine horsepower has the
maximum frequency ranging from 200-400.
●

BA Test

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BA Test

Uploaded by

Copyright:

Available Formats

Business Analytics Internal Assessment

Submitted by : Prerona Dey (19021141078)

Debopriya Banerjee (19021141036)

Ashmita Baidya (19021141026)

Shibam Bhattacharjee (19021141108)

Aparna Singh (19021141023)

● The dataset used in this report is Dmining.csv.

Output, Syntax and Analysis

> # Build a term-document matrix

> # Display the top 9 most frequent words

# Plot the most frequent words

#generate word cloud

Findings and Conclusion

● The dataset used in this report is CARDATA.csv

Make Model Year Engine .Fuel.Type

Engine.HP Engine.Cylinders Transmission. Type Driven_Wheels

Number.of.Doors Market.Category Vehicle.Size Vehicle.Style

highway.MPG city.mpg Popularity MSRP

#histogram for Engine.HP

Cardata #is a data set from datasets package

slices=c(4764, 2777, 4373) # vehicle size of all the cars

You might also like