Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

Objective

The dataset used was of Mexico which showed the C02 emission rate in the country from
1990 to 2018.

About the Data


To comprehend the carbon footprint of in Mexico over a period of years the analysis was
done. The data available was from 1990 to 2018. In order to forecast if there will be any
increment or decrement an analysis was done. It Consists of two variables Year and Co2
emission.

Model Used
Arima Model is a statistical analysis model that employs time series data to forecast future
trends or to better understand the current data set. These models use “auto” correlations and
moving averages over residual errors in the data to forecast future values.

Data Collection:
Data is collected from Kaggle. As Kaggle is an open source for data collection

Code:
Mexico=Mexico_data
GDP
plot(Mexico)
GDP=log(Mexico)
plot(Mexico)
ts(Mexico,frequency = 1,start = c(1990,1))
plot(Mexico)
#decom=decompose(Mexico)
library(forecast)
model=auto.arima(Mexico $CO2emissions)
model
acf(model$residuals, main="Correlogram")
pacf(model$residuals, main="Partial Correlogram")
FC=forecast(model,5)
FC
library(ggplot2)
autoplot(FC)
accuracy(FC)
fc=as.data.frame(FC)
ds1=exp(fc$`Point Forecast`)
ts(ds1,frequency = 1,start=c(2019,1))
ds1
year=c(2019,2020,2021,2022,2023)
GDP=c(3.699542,3.655186,3.611362,3.568064,3.525284)
FP=data.frame(year,GDP)
FP

Fig: Co2 Emission from 1990 to 2018

 The AIC value was -98.93. The lesser the value the better.
 Sharp rise and fall in Co2 emissions can be seen.
Fig. Correlogram
Findings:
The bandwidth of correlogram is between 0.4 to 0.8. After the initial difference the data became
stationary. Minor changes were observed.

Fig. Partial Correlogram

Findings:
The observed and expected are falling in between the bandwidth of the residual value.
Fig. Forecast from Arima

Forecasts from ARIMA(0,2,2) without means:


Where p= 0
d= 2
q= 2

● Though there has been a significant rise during a period later the C02 emissions variance
was stable
● A constant drop in emissions can be noted.

The forecasted values do not show a significant change in emissions from 2019 to 2023.
Linear Regression
Objective
The data is about the lung capacity across difference age groups and their gender. Another
importance factor if they smoked or not which had an impact on their lung capacity.

Data
The data consists of variables like Lung capacity,age, height, gender,smoke, Caesarean.

Model Used:
Linear Regression model was used to look to see the impact on the lung capacity and how
significant role all the other variables played in order for it to deteriorate.

Code
getwd()
library(MASS)
library(ggplot2)
x=Lungcapacity_dataset
dim(x)
names(x)
str(x)
summary(x)
set.seed(1)
row.number=sample(1:nrow(x),0.8*nrow(x))# Splitting data
train=x[row.number,]
test=x[-row.number,]
train
#Explore the data
ggplot(train, aes(LungCap)) + geom_density(fill="blue")#,"red", "purple", "green", "orange")
#model building 1
#Let us make default model.
model1 = lm((LungCap)~.,data=train)
summary(model1)
par(mfrow=c(2,2)) #Determines vector with row and column values)
par(bg="white")
plot(model1)
#Model Building - Model 2
#As the next step, we can remove the three lesser significant
#features ('indus','age') and check the model again.
model2 = update(model1, ~.-Caesareanyes)
summary(model2)
# Predict model 2
pred1=predict(model2, newdata = test)
pred1
exp(pred1)

Fig: Plot of Lung Capacity

 The density is maximum when the lung capacity is between 5 to 10 CC.


 The above plot checks the distribution response of Lung Capacity variable.
 The better the curve the better will be the normal distribution.
Findings:
 The model shows that Caesareanyes does not have a very significant impact on other
indicduals lung capacitywhere Age,height,smokes and Gender do play a role on the
lung capacity.

Residual vs Fitted plot: We can see that the red trend line is almost at zero except at the
starting and end locations and the average error is almost zero.
Normal Q-Q Plot: This plot indicates if the residuals are normally distributed. In this we
should checks for dots on the line. The residual value found in the first graph and normal in
distribution. In the above graph, almost all the plotted dots are on the red line except at the
end. almost all the plotted dots are on the red line except at the end of the line, indicating
residuals are normally distributed.

Scale Location: This is used to display how the residuals are spread and indicate if residuals
have an equal variance or not.
Residuals vs Leverage: : This plot is useful to identify the influential observations. The dots
which are outside the dashed line will be the influential point. The dots are more influential
on the dependant variable

Linear Programming Problem


Objective
The objective of Linear Programming Problem is to look for a feasible solution when
maximized or minimized when it is subjected to various constraints.
About the Data
It is the following problem statement wherein :
A store sells two types of toys, A and B. The store owner pays 8 and14 for each one unit of
toy A and B respectively. One unit of toys A yields a profit of 2 while a unit of toys B yields
a profit of 2 while aunit of toys B yields a profit of 3. The store owner estimates that no more
than 2000 toys will be sold every month and he does not plan to invest more than $20,000 in
inventory of these toys. How many units of each type of toys should be stocked in order to
maximize his monthly total profit profit?
Solution:

Code:
library(lpSolve)
direction="max"
obj=c(2,3)
cm=matrix(c(1,1,8,14,1,0,0,1),nrow=4,byrow=TRUE)
cd=c("<=","<=",">=",">=")
cr=c(2000,20000,0,0)
opt=lp(direction="max",obj,cm,cd,cr)
summary(opt)
opt$objective
opt$objval
opt$solution
opt$constraints

Findings:
Hence 4667 units can be stocked of each unit of toys to ensure the shopkeeper has maximum
profit.

Logistic Regression
Objective
The data used is of a marketing agency. It will help predict if user will click on an
advertisement or not.

About the data


The dataset has 10 features:

'Daily Time Spent on Site', 'Age', 'Area Income', 'Daily Internet Usage', 'Ad Topic Line',
'City', 'Male', 'Country', Timestamp' 'Clicked on Ad'.

'Clicked on Ad' is the categorical target feature, which has two possible values: 0 (user
didn't click) and 1(user clicked).

Model
Logistic Regression model has been used. The classification algorithm Logistic Regression is
used to predict the likelihood of a categorical dependent variable. The dependant variable in
logistic regression is a binary variable with data coded as 1 (yes, True, normal, success, etc.)
or 0 (no, False, abnormal, failure, etc.).

Data collection
The source of data is Kaggle. As Kaggle is an open source for data collection

Code:
names(Ad_click_Prediction)
str(Ad_click_Prediction)
dim(Ad_click_Prediction)
#converting num dependent to factor variable
Ad_click_Prediction$ClickedonAd=as.factor(Ad_click_Prediction$ClickedonAd)
str(Ad_click_Prediction)
model1=glm(ClickedonAd~., data=Ad_click_Prediction,family="binomial")
summary(model1)
set.seed(123)
part=sample(1:nrow(Ad_click_Prediction),0.8*nrow(Ad_click_Prediction))
train=Ad_click_Prediction[part,]
test=Ad_click_Prediction[-part,]
dim(train)
dim(test)
#modeling glm on train data
model2=glm(ClickedonAd~DailyTimeSpentonSite+Age+AreaIncome+DailyInternetUsage,d
ata=train,family="binomial")
summary(model2)
p1=predict(model2,test,type='response')
head(p1)
predict1=ifelse(p1>0.5,1,0)
predict1
table1=table(predicted=predict1,actual=test$ClickedonAd)
table1
plot(p1,test$ClickedonAd)

The confusion matrix:

You might also like