Professional Documents
Culture Documents
BA Test
BA Test
BA Test
Team 16
Question 1:
T16 = Text report of the document sent yesterday Bar chart, Frequency count not less than for 9th
column and observation from the word cloud.
Dmining.csv
Data allocated: 9th column- Covid 19 treatments that you have confronted
Introduction
The project carried below discusses data mining on text descriptives followed by a sentiment analysis
obtained from the descriptives after analysis on parameters like Term document Matrix and Corpus
Frequency.
Descriptive text is a text which says what a person or a thing or a situation is like. Its purpose is
to describe and reveal the characteristics of the query in a Descriptive Analysis process.
Data mining finds valuable information hidden in large volumes of data. Data mining is the
analysis of data and the use of software techniques for finding patterns and regularities in sets of data. The
computer is responsible for finding the patterns by identifying the underlying rules and features in the
data.
Descriptive data mining tasks usually find data describing patterns and come up with new,
significant information from the available data set. Such techniques help in formulating underlying
interpretations in a sentiment analysis form thus curbing out meaningful conclusions, subjective
information and deducing polarity of the statements at a basic level.
DataSet Description
Observation in Dataset
● In the dataset the column assigned to us is 9th column i.e. Covid 19 treatments that you have
confronted.
● The dataset shows 101 observations for 1o variables which means there are 100 obs for each
variable and 1 obs is accounted for “query” head.
● The dataset lists all responses by students which are either one word or a sentence text output, for
example: zero, not many, texting should be swift and results should be quick etc.
Full Syntax-
dmining=Dmining[,9]
dmining
TextDoc= Corpus(VectorSource(dmining))
TextDoc
# Build a term-document matrix
TextDoc_dtm <- TermDocumentMatrix(TextDoc)
TextDoc_dtm
dtm_m <- as.matrix(TextDoc_dtm)
dtm_m
# Sort by decreasing value of frequency
dtm_v <- sort(rowSums(dtm_m),decreasing=TRUE)
dtm_v
dtm_d <- data.frame(word = names(dtm_v),freq=dtm_v)
dtm_d
# Display the top 9 most frequent words
head(dtm_d, 9)
# Plot the most frequent words
barplot(dtm_d[1:9,]$freq, las = 2, names.arg = dtm_d[1:9,]$word,
col ="turquoise", main ="Top 9 most frequent words",
ylab = "Word frequencies")
generate word cloud
set.seed(1234)
wordcloud(words = dtm_d$word, freq = dtm_d$freq, min.freq = 4,
max.words=100, random.order=FALSE, rot.per=0.40,
colors=brewer.pal(8, "Dark2"))
> TextDoc= Corpus(VectorSource(dmining))
> TextDoc
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 101
Analysis- Corpus is one among the four main terminologies used in data mining, which means a
collection of documents (having natural language texts). Here, we have 101 observations out of which the
first one i.e. “Covid 19 treatments that you have confronted” is the query we are working upon. The
rest of the observations apart from our query are the documents from which the most repeatedly used
words will have to be extracted to obtain a word cloud.
Analysis- Here, most repeated words have been picked up where the prepositional words actually don't
contribute much to our understanding. But the terms, mask, immunity , treatment indicates what people
think about what covid treatment they have confronted.
Analysis- It's a visual representation of the most frequent terms used in documents/ sentences entered by
people while posting their perception about Covid treatment.
● From the word cloud it can be thus concluded that the most frequent words are treatment, none,
mask, home, immunity and confronted.
● With these frequencies it can be concluded that most respondents did not confront any Covid
19 positive cases.
● It can also be interpreted that for a good immunity Covid 19 confrontation will be less.
● Also better usage of masks and staying home can prevent one from getting exposed to Covid
situations.
● Lastly, it can be said that treatments of Covid 19 are done as per symptoms these days.
● So, the people associate Covid treatment with mask, treatment, drinking, immunity, home, etc.
Question 2:
Worked on Cardata and noted the observations received and duly interpreted
Introduction:
The process carried out below is purely based on visualization of data. Visualization is the conversion of
data into a visual or tabular format so that the characteristics of the data and the relationships among data
items or attributes can be analysed or reported. It is an appealing technique for data exploration.
Visualization of data can be used to explore a pattern or trends, it is mostly used in cases of large data
sets.
The data set, Cardata includes various particulars related to car sales over a period of time. The various
aspects of the cars are also given. In this report we have interpreted and tried to derive valuable insights
from the data set and various functions of R Studio have been used to derive the same.
Dataset description:
Syntax
summary(Cardata)
Syntax:
Cardata%>%count(Make)
Make n
1 Acura 252
2 Alfa Romeo 5
3 Aston Martin 93
4 Audi 328
5 Bentley 74
6 BMW 334
7 Bugatti 3
8 Buick 196
9 Cadillac 397
10 Chevrolet 1123
11 Chrysler 187
12 Dodge 626
13 Ferrari 69
14 FIAT 62
15 Ford 881
16 Genesis 3
17 GMC 515
18 Honda 449
19 HUMMER 17
20 Hyundai 303
21 Infiniti 330
22 Kia 231
23 Lamborghini 52
24 Land Rover 143
25 Lexus 202
26 Lincoln 164
27 Lotus 29
28 Maserati 58
29 Maybach 16
30 Mazda 423
31 McLaren 5
32 Mercedes-Benz 353
33 Mitsubishi 213
34 Nissan 558
35 Oldsmobile 150
36 Plymouth 82
37 Pontiac 186
38 Porsche 136
39 Rolls-Royce 31
40 Saab 111
41 Scion 60
42 Spyker 3
43 Subaru 256
44 Suzuki 351
45 Tesla 18
46 Toyota 746
47 Volkswagen 809
48 Volvo 281
Syntax:
cor(Cardata$highway.MPG,Cardata$city.
0.8868295
Syntax:
Cardata%>%count(Vehicle.Size)
Vehicle.Size n
1 Compact 4764
2 Large 2777
3 Midsize 4373
Syntax:
Cardata%>%count(Vehicle.Style)
Vehicle.Style n
1 2dr Hatchback 506
2 2dr SUV 138
3 4dr Hatchback 702
4 4dr SUV 2488
5 Cargo Minivan 71
6 Cargo Van 95
7 Convertible 793
8 Convertible SUV 29
9 Coupe 1211
10 Crew Cab Pickup 681
11 Extended Cab Pickup 623
12 Passenger Minivan 417
13 Passenger Van 128
14 Regular Cab Pickup 392
15 Sedan 3048
16 Wagon 592
Syntax:
Cardata%>%count(Engine.Fuel.Type)
Engine.Fuel.Type n
1 diesel 154
2 electric 66
3 flex-fuel (premium unleaded recommended/E85) 26
4 flex-fuel (premium unleaded required/E85) 54
5 flex-fuel (unleaded/E85) 899
6 flex-fuel (unleaded/natural gas) 6
7 natural gas 2
8 premium unleaded (recommended) 1523
9 premium unleaded (required) 2009
10 regular unleaded 7175
Histogram:
Syntax:
Result:
Syntax:
hist(Cardata$Engine.Cylinders)
#colored histogram with number of different bins
hist(Cardata$Engine.Cylinders, breaks=6, col = "pink")
Result:
Histogram with normal curve:
Syntax:
Result:
Syntax:
names(Cardata)
x=Cardata$MSRP
h=hist(x,col="green", xlab="MSRP",
main="Histogram with Normal Curve")
xfit=seq(min(x),max(x))
yfit=dnorm(xfit,mean=mean(x),sd=sd(x))
yfit=yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="red", lwd=2)
Result:
Syntax:
x=Cardata$highway.MPG
h=hist(x,col="light blue", xlab="highway.mpg",
main="Histogram with Normal Curve")
xfit=seq(min(x),max(x))
yfit=dnorm(xfit,mean=mean(x),sd=sd(x))
yfit=yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="red", lwd=2)
Result:
Syntax:
d=density(Cardata$MSRP)
plot(d, main="Kernel Density of MSRP")
polygon(d, col="green", border="blue")
Result:
Syntax:
d=density(Cardata$Year)
plot(d, main="Kernel Density of year")
polygon(d, col="pink", border="blue")
Result:
Observations:
The above figure indicates that the most of the cars are being sold in the year 2015 and it is positively
skewed.
Syntax:
Result:
Observations:
Observations:
● The dataset contains data from the timespan of 1990 to 2017, therefore a 27 year
timeframe is taken and the data regarding Cars have been taken along with their various
attributes and also the preference of people towards various aspects of the cars have also
been taken.
● The CarModel of Sedan is the most popular among the sample chosen therefore people as
a whole can be said to be inclined to vehicles which carry 4 people at least and the 4dr
Suv being the 2nd highest only concretes the fact.
● The sample didn’t seem to consider EcoFriendly approach rather it takes Cost
effectiveness when choosing for Engine Fuel type as Regular Unleaded has a landslide
more occurrence than the other choices.
● The sample size seems to prefer the mid-size cars as it has the maximum occurrences.
● It can be seen that Cars seem to be highly and positively correlated when taking mileage
in cities and highways therefore mileage doesn’t vary much in highways and cities.
● The sample can be seen to have been taken from the mid-economic class population as
Chevrolet has the largest occurrence whereas premium sports cars like Buggati,Spyder
etc has the least count.
● In reference to Milege the dataset has a very wide variety as it has various other variables
to consider.
● From the above histogram, it can be observed that that engine horsepower has the
maximum frequency ranging from 200-400.
●