Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

MIT 302 – STATISTICAL

COMPUTING 11
TUTORIAL 3: Advanced Statistical Modelling

1 Overview
Advanced Statistical Modelling in R takes your data analysis skills to the next level by
introducing you to powerful techniques for modelling complex relationships and extracting
meaningful insights from your data. This course provides a comprehensive overview of
advanced statistical modelling concepts and their practical application using the R
programming language.
The course begins with an introduction to the importance of statistical modelling in data
analysis. You will gain an understanding of how advanced modelling techniques can unravel
complex patterns and uncover hidden relationships within your data.
One of the key topics covered is linear regression, building upon your existing knowledge. You
will dive deeper into multiple linear regression, exploring how to model relationships between
multiple predictors and a continuous outcome variable. You will learn how to assess model
assumptions and diagnostics, as well as how to handle interactions and non-linear relationships.
Another crucial area of focus is generalized linear models (GLMs). You will discover the
versatility of GLMs in modelling a variety of response variables, including binary, count, and
categorical outcomes. Through hands-on exercises, you will gain proficiency in fitting GLMs,
interpreting model output, and assessing model fit.
The course also explores advanced topics such as mixed-effects models, time series analysis,
and machine learning algorithms. You will learn how to handle correlated and nested data using
mixed-effects models, analyse temporal patterns and forecast future values with time series
models, and harness the predictive power of machine learning algorithms for complex
modelling tasks.
Throughout the course, you will work extensively with real-world datasets, applying the
advanced modelling techniques using R. You will gain practical experience in implementing
these models, interpreting results, and effectively communicating your findings.
By the end of the course, you will have a solid foundation in advanced statistical modelling
techniques in R, enabling you to tackle complex data analysis problems and derive meaningful
insights from your data. Whether you're working in academia, industry, or research, this course
equips you with the skills to make informed decisions and drive impactful outcomes through
advanced statistical modelling in R.

2 Linear regression models


Linear regression is a common and powerful statistical technique used to model the relationship
between a dependent variable and one or more independent variables. In R, you can easily
perform linear regression analysis using built-in datasets and various functions available in the
base R package. Let's explore the details of linear regression models in R, using examples,
code, and interpretation of the results. We will also utilize built-in datasets in R for illustration.
2.1 Simple Linear Regression
Simple linear regression models the relationship between a dependent variable and a single
independent variable. The goal is to find the best-fitting straight line that minimizes the sum of
squared residuals.
Example: Using the built-in "mtcars" dataset in R, let's perform a simple linear regression
analysis to predict fuel consumption (mpg) based on the number of cylinders (cyl).
# Load the mtcars dataset
data(mtcars)

# Perform simple linear regression


lm_model <- lm(mpg ~ cyl, data = mtcars)

# Print the summary of the regression model


summary(lm_model)
In this example, we load the "mtcars" dataset and use the lm()
function to fit a linear regression
model. Here, "mpg" is the dependent variable, and "cyl" is the independent variable.
The summary() function provides an overview of the regression model, including the estimated
coefficients, standard errors, t-values, p-values, and goodness-of-fit measures (such as R-
squared and adjusted R-squared).
Interpretation: The output of the summary() function provides valuable insights into the
relationship between the dependent variable and the independent variable. Specifically, you
can interpret the coefficient estimate for "cyl" as the average change in "mpg" for each unit
increase in "cyl". The t-value and p-value associated with the coefficient estimate indicate the
statistical significance of the relationship. Additionally, the R-squared value represents the
proportion of variance explained by the model.
2.2 Multiple Linear Regression
Multiple linear regression extends the concept of simple linear regression to include multiple
independent variables. It allows for modelling the relationship between a dependent variable
and multiple predictors.
Example: Let's perform a multiple linear regression analysis using the built-in "mtcars" dataset
to predict fuel consumption (mpg) based on the number of cylinders (cyl), horsepower (hp),
and weight (wt).
# Perform multiple linear regression
lm_model <- lm(mpg ~ cyl + hp + wt, data = mtcars)

# Print the summary of the regression model


summary(lm_model)
In this example, we use the lm() function with multiple independent variables (cyl, hp, and wt)
to fit a multiple linear regression model. The summary() function provides detailed information
about the model's coefficients, standard errors, t-values, and p-values, along with goodness-of-
fit measures.
Interpretation: The summary() output for multiple linear regression provides interpretation for
each independent variable. For example, the coefficient estimates for "cyl", "hp", and "wt"
represent the average change in "mpg" associated with a one-unit increase in each respective
variable, while holding other variables constant. The t-values and p-values indicate the
statistical significance of each independent variable's contribution to the model.
2.3 Evaluating Model Assumptions
It is important to assess the assumptions of linear regression models, such as linearity,
independence, constant variance (homoscedasticity), and normality of residuals. Violations of
these assumptions can affect the validity of the model.
Example: Let's assess the assumptions of our simple linear regression model using a scatter
plot of the residuals and a normality plot.
# Obtain the residuals from the linear regression model
residuals <- residuals(lm_model)

# Scatter plot of residuals vs. fitted values


plot(lm_model$fitted.values, residuals,
xlab = "Fitted Values",
ylab = "Residuals",
main = "Residuals vs. Fitted Values")

# Normality plot of residuals


qqnorm(residuals)
qqline(residuals)
In this example, we extract the residuals from the linear regression model using
the residuals() function. We then create a scatter plot of the residuals against the fitted values
to check for heteroscedasticity (non-constant variance). Additionally, we create a normality
plot (Q-Q plot) of the residuals using the qqnorm() and qqline() functions to assess their
normal distribution.
Interpretation: The scatter plot of residuals against the fitted values helps identify any patterns
or trends in the residuals. Ideally, the plot should exhibit random dispersion around the
horizontal line, indicating constant variance. Deviations from randomness may indicate
heteroscedasticity. The normality plot (Q-Q plot) provides a visual assessment of whether the
residuals follow a normal distribution. Ideally, the points on the plot should lie close to the
diagonal reference line, indicating normality. Departures from the diagonal line suggest
deviations from normality.
Linear regression models in R, supported by built-in datasets, allow for a comprehensive
analysis of the relationship between variables. By fitting regression models, interpreting the
coefficient estimates and statistical significance, and evaluating model assumptions, you can
gain insights into the variables' effects and the overall model's validity. Remember to interpret
the results within the context of your specific dataset and research question.

3 Generalized linear models


Generalized Linear Models (GLMs) are an extension of linear regression models that allow for
the analysis of non-normal response variables, such as binary, count, or categorical data. In R,
you can perform GLM analysis using the glm() function, which is a versatile tool for fitting
GLMs. Let's explore the details of generalized linear models in R, including examples, code,
and interpretation of the results. We will also utilize built-in datasets in R for illustration.
3.1 Logistic Regression
Logistic regression is a widely used statistical modelling technique for predicting binary or
categorical outcomes. It is particularly useful when the dependent variable is dichotomous,
meaning it only has two possible outcomes. Logistic regression models the relationship
between the independent variables and the probability of the outcome occurring.
3.1.1 Components of Logistic Regression
• Dependent Variable: The dependent variable in logistic regression is a binary or categorical
variable. It represents the outcome or event of interest. For example, it could be whether a
customer will churn (1) or not churn (0) in a telecommunications dataset.
• Independent Variables: These are the predictor variables that are used to predict the
probability of the outcome. They can be continuous or categorical variables. For example,
the independent variables could include customer demographics, usage patterns, and
service subscriptions.
3.1.2 Logistic Regression Formula
The logistic regression model estimates the probability of the dependent variable (Y) being 1
(success) given the values of the independent variables (X). The logistic regression model uses
the logistic function, also known as the sigmoid function, to transform a linear combination of
the independent variables into a probability between 0 and 1. The formula for logistic
regression is:
1
𝑃(𝑌 = 1 ∣ 𝑋) =
1 + 𝑒−(𝛽 +𝛽 𝑋 +𝛽
0 1 1 2 𝑋2
⋯+𝛽𝑝 𝑋𝑝 )

Where:
• 𝑃(𝑌 = 1 ∣ 𝑋) is the probability of the dependent variable being 1 given the values of
the independent variables.
• 𝛽0 , 𝛽1 , 𝛽2 , … , 𝛽𝑝 are the coefficients or parameters estimated by the model.
• 𝑋1 , 𝑋2 , … , 𝑋𝑝 are the values of the independent variables.
3.1.3 Logistic Regression Implementation in R
R provides the "glm" function (generalized linear model) for implementing logistic regression.
Here's an example code snippet to illustrate the implementation:
# Load the dataset
data <- read.csv("customer_data.csv")

# Fit a logistic regression model


model <- glm(churn ~ age + gender + usage, data = data, family = binomial)

# Print the model summary


summary(model)

# Interpretation:
# The model summary provides information about the estimated coefficients and thei
r significance.
# Coefficients with small p-values are considered statistically significant.

# Predict probabilities for new data


new_data <- data.frame(age = c(30, 40, 50), gender = c("Male", "Female", "Male"),
usage = c(500, 1000, 1500))
probabilities <- predict(model, newdata = new_data, type = "response")

# Print the predicted probabilities


print(probabilities)

# Interpretation:
# The predicted probabilities represent the probability of churn (1) for the new d
ata points based on the fitted logistic regression model.
In this example, the "customer_data.csv" file contains a dataset with the dependent variable
"churn" indicating whether a customer churned (1) or not (0), and independent variables such
as "age," "gender," and "usage." The logistic regression model is fitted using the "glm"
function, specifying the formula, dataset, and the family as "binomial" to indicate logistic
regression. The model summary provides information about the estimated coefficients and their
significance. Coefficients with small p-values are considered statistically significant, indicating
a significant relationship between the independent variable and the probability of the outcome.
The "predict" function is used to predict probabilities for new data provided in the "new_data"
dataframe.
Interpretation of the logistic regression model involves examining the estimated coefficients
and their significance. Positive coefficients indicate a positive relationship with the probability
of the outcome, while negative coefficients indicate a negative relationship. The magnitude of
the coefficients reflects the strength of the relationship.
By using logistic regression in R, you can analyse and predict binary outcomes based on the
relationship between independent variables and the probability of the outcome occurring.
3.2 Poisson Regression
Poisson regression is a statistical modelling technique used to analyze count data, where the
dependent variable represents the number of occurrences of an event within a fixed interval. It
is commonly used when the dependent variable follows a Poisson distribution. Poisson
regression models the relationship between the independent variables and the expected counts
of the event.
3.2.1 Components of Poisson Regression
• Dependent Variable: The dependent variable in Poisson regression is a count variable
representing the number of occurrences of an event. For example, it could be the
number of accidents at a particular intersection in a day.
• Independent Variables: These are the predictor variables that are used to explain the
variation in the count variable. They can be continuous or categorical variables. For
example, the independent variables could include variables such as the time of day,
weather conditions, and road type.
3.2.2 Poisson Regression Formula
The Poisson regression model estimates the expected counts of the event (Y) given the values
of the independent variables (X). The model assumes that the expected counts follow a Poisson
distribution. The formula for Poisson regression is:
(𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 +⋯+ 𝛽𝑝 𝑋𝑝 )
𝐸(𝑌 ∣ 𝑋) = 𝑒
Where:
• 𝐸(𝑌 ∣ 𝑋) is the expected count of the event given the values of the independent
variables.
• 𝛽0 , 𝛽1 , 𝛽2 , … , 𝛽𝑝 are the coefficients or parameters estimated by the model.
• 𝑋1 , 𝑋2 , … , 𝑋𝑝 are the values of the independent variables.

3.2.3 Poisson Regression Implementation in R


R provides the "glm" function (generalized linear model) for implementing Poisson regression.
Here's an example code snippet to illustrate the implementation:
# Load the required library
library(dplyr)

# Load the dataset


data <- read.csv("accident_data.csv")

# Fit a Poisson regression model


model <- glm(accidents ~ time_of_day + weather_condition + road_type, data = data,
family = poisson)

# Print the model summary


summary(model)

# Interpretation:
# The model summary provides information about the estimated coefficients and thei
r significance.
# Coefficients with small p-values are considered statistically significant.

# Predict expected counts for new data


new_data <- data.frame(time_of_day = c("Morning", "Evening"), weather_condition =
c("Clear", "Rainy"), road_type = c("Highway", "City"))
expected_counts <- predict(model, newdata = new_data, type = "response")

# Print the predicted expected counts


print(expected_counts)

# Interpretation:
# The predicted expected counts represent the expected counts of accidents for the
new data points based on the fitted Poisson regression model.
In this example, the "accident_data.csv" file contains a dataset with the dependent variable
"accidents" representing the number of accidents, and independent variables such as
"time_of_day," "weather_condition," and "road_type." The Poisson regression model is fitted
using the "glm" function, specifying the formula, dataset, and the family as "poisson" to
indicate Poisson regression. The model summary provides information about the estimated
coefficients and their significance. Coefficients with small p-values are considered statistically
significant, indicating a significant relationship between the independent variable and the
expected counts of the event. The "predict" function is used to predict the expected counts for
new data provided in the "new_data" dataframe.
Interpretation of the Poisson regression model involves examining the estimated coefficients
and their significance. Positive coefficients indicate a positive relationship with the expected
counts of the event, while negative coefficients indicate a negative relationship. The magnitude
of the coefficients reflects the strength of the relationship.
By using Poisson regression in R, you can analyze count data and understand the relationship
between independent variables and the expected counts of an event.
3.3 Multinomial Logistic Regression
Multinomial logistic regression is a statistical modelling technique used to analyse categorical
dependent variables with more than two categories. It allows us to model the relationship
between multiple independent variables and the probabilities of each category of the dependent
variable. Multinomial logistic regression is an extension of binary logistic regression, which is
used when the dependent variable has only two categories. Let's explore the components,
formula, code, and interpretation of multinomial logistic regression in R:
3.3.1 Components of Multinomial Logistic Regression
• Dependent Variable: The dependent variable in multinomial logistic regression is a
categorical variable with more than two categories. For example, it could be a variable
representing flower species, such as "setosa," "versicolor," and "virginica."
• Independent Variables: These are the predictor variables that are used to explain the
variation in the categories of the dependent variable. They can be continuous or
categorical variables. For example, the independent variables could include flower
measurements such as sepal length, sepal width, petal length, and petal width.
3.3.2 Multinomial Logistic Regression Formula
Multinomial logistic regression models the relationship between the independent variables and
the probabilities of each category of the dependent variable. It uses the softmax function to
estimate the probabilities for each category. The formula for multinomial logistic regression is:

𝑒 (𝛽0𝑗 + 𝛽1𝑗𝑋1 + 𝛽2𝑗𝑋2 +⋯+ 𝛽𝑝𝑗 𝑋𝑝 )


𝑃(𝑌 = 𝑗|𝑋) = 𝑗−1
1 + ∑𝑘=1 𝑒 (𝛽0𝑘+ 𝛽1𝑘𝑋1 + 𝛽2𝑘 𝑋2 +⋯+ 𝛽𝑝𝑘𝑋𝑝 )

Where:
• 𝑃(𝑌 = 𝑗|𝑋) is the probability of category j given the values of the independent
variables.
• 𝛽0𝑗 , 𝛽1𝑗 , 𝛽2𝑗 + ⋯ + 𝛽𝑝𝑗 are the coefficients or parameters estimated for category j by
the model.
• 𝑋1 , 𝑋2 , … , 𝑋𝑝 are the values of the independent variables.
• J represents the total number of categories.
3.3.3 Multinomial Logistic Regression Implementation in R
R provides the "multinom" function from the "nnet" package for implementing multinomial
logistic regression. Here's an example code snippet to illustrate the implementation using the
iris dataset:
# Load the required library
library(nnet)

# Load the iris dataset


data(iris)

# Fit a multinomial logistic regression model


model <- multinom(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Widt
h, data = iris)
# Print the model summary
summary(model)

# Interpretation:
# The model summary provides information about the estimated coefficients and thei
r significance.
# Coefficients with small p-values are considered statistically significant.

# Predict the probabilities for new data


new_data <- data.frame(Sepal.Length = c(5.1, 6.2), Sepal.Width = c(3.5, 2.9), Peta
l.Length = c(1.4, 4.5), Petal.Width = c(0.2, 1.5))
probabilities <- predict(model, newdata = new_data, type = "probs")

# Print the predicted probabilities


print(probabilities)

# Interpretation:
# The predicted probabilities represent the probabilities of each category of the
dependent variable for the new data points based on the fitted multinomial logisti
c regression model.
In this example, we use the "iris" dataset, which is a built-in dataset in R that contains
measurements of flower species. The dependent variable is "Species," which represents the
flower species, and the independent variables are "Sepal.Length," "Sepal.Width,"
"Petal.Length," and "Petal.Width." The multinomial logistic regression model is fitted using
the "multinom" function, specifying the formula and the dataset. The model summary provides
information about the estimated coefficients and their significance. Coefficients with small p-
values are considered statistically significant, indicating a significant relationship between the
independent variables and the probabilities of each category of the dependent variable. The
"predict" function is used to predict the probabilities for new data provided in the "new_data"
dataframe.
Interpretation of the multinomial logistic regression model involves examining the estimated
coefficients and their significance for each category of the dependent variable. Positive
coefficients indicate a positive relationship with the corresponding category, while negative
coefficients indicate a negative relationship. The magnitude of the coefficients reflects the
strength of the relationship.
By using multinomial logistic regression in R, you can analyse categorical dependent variables
with more than two categories, model the relationship between independent variables and the
probabilities of each category, and make predictions for new observations. The example using
the iris dataset demonstrates how to implement multinomial logistic regression in R, interpret
the model summary, and predict probabilities for new data points.

4 Time Series Analysis


Time series analysis is a statistical technique used to analyse and forecast data that is collected
over a series of equally spaced time intervals. R provides a comprehensive set of packages and
functions for time series analysis.
4.1 Loading and Exploring Time Series Data
R provides several built-in datasets for time series analysis, such as "AirPassengers" and
"EuStockMarkets". Let's start by loading and exploring the "AirPassengers" dataset, which
contains monthly airline passenger numbers from 1949 to 1960.
# Load the "AirPassengers" dataset
data(AirPassengers)

# View the first few rows of the dataset


head(AirPassengers)

# Plot the time series


plot(AirPassengers, main = "Airline Passengers", xlab = "Year", ylab = "Pas
senger Count")
In this example, we use the data() function to load the "AirPassengers" dataset. We then use
the head() function to view the first few rows of the dataset, giving you a glimpse of the
structure and format of the time series data. Finally, we plot the time series using
the plot() function, which provides a visual representation of the data over time.
4.2 Decomposition of Time Series
Time series data often exhibit trends, seasonal patterns, and random fluctuations.
Decomposition is a technique used to separate these components. Let's decompose the
"AirPassengers" time series into its trend, seasonal, and random components.
# Decompose the time series
decomposition <- decompose(AirPassengers)

# Plot the decomposed components


plot(decomposition)
In this example, we use the decompose() function to decompose the "AirPassengers" time
series into its trend, seasonal, and random components. The resulting decomposition object
contains these components, which can be plotted using the plot() function.
4.3 ARIMA (Autoregressive Integrated Moving Average)
ARIMA is a widely used time series analysis technique for modelling and forecasting data
with a temporal component. It combines autoregressive (AR), differencing (I), and moving
average (MA) components to capture different patterns and dependencies in the data. ARIMA
models are particularly useful when the data exhibits trends, seasonality, or autocorrelation.
4.3.1 Components of ARIMA
• Autoregressive (AR) Component: The AR component represents the relationship
between the current observation and a certain number of lagged (past) observations. It
captures the linear dependency of the current value on its own past values.
• Differencing (I) Component: The differencing component is used to make the time
series stationary. Stationarity implies that the mean, variance, and autocovariance of the
series remain constant over time. Differencing involves taking the difference between
consecutive observations, which helps remove trends or seasonality.
• Moving Average (MA) Component: The MA component represents the dependency
between the current observation and a certain number of lagged forecast errors. It
captures the impact of past errors on the current value.
4.3.2 ARIMA Formula
The ARIMA(p, d, q) model consists of three parameters: p, d, and q, representing the order of
the AR, I, and MA components, respectively. The formula for ARIMA is as follows:
𝑋𝑡 = 𝑐 + 𝜙1 𝑋𝑡−1 + 𝜙2 𝑋𝑡−2 + ⋯ + +𝜙𝑝 𝑋𝑡−𝑝 + 𝜃1 𝜖𝑡−1 + 𝜃2 𝜖𝑡−2 + ⋯ + 𝜃𝑞 𝜖𝑡−𝑞 + 𝜖𝑡

Where:
• 𝑋𝑡 is the time series at time t.
• c is a constant term.
• 𝜙1 , 𝜙2 , … , 𝜙𝑝 are the AR coefficients.
• 𝜖𝑡 represents the error term at time t.
• 𝜃1 , 𝜃2 , … , 𝜃𝑞 are the MA coefficients.
• p, d, and q represent the order of the AR, I, and MA components, respectively.
4.3.3 ARIMA Implementation in R
R provides the "forecast" package, which includes the "arima" function for implementing
ARIMA models. Here's an example code snippet to illustrate the implementation using the iris
dataset:
To provide a more appropriate example, let's consider the "AirPassengers" dataset available in
R, which contains the monthly number of airline passengers from 1949 to 1960. Here's an
updated code snippet using the "AirPassengers" dataset for ARIMA modelling :
# Load the required library
library(forecast)

# Load the AirPassengers dataset


data(AirPassengers)

# Convert the dataset to a time series object


ts_data <- ts(AirPassengers, frequency = 12)

# Plot the original time series


plot(ts_data, main = "AirPassengers Time Series")

# Fit an ARIMA model


arima_model <- auto.arima(ts_data)

# Print the model summary


print(arima_model)

# Interpretation:
# The model summary provides information about the estimated coefficients and thei
r significance.
# Coefficients with small p-values are considered statistically significant.

# Forecast future values


forecast_values <- forecast(arima_model, h = 12)

# Print the forecasted values


print(forecast_values)

# Interpretation:
# The forecasted values represent the predicted future values based on the fitted
ARIMA model.
# The "h" parameter specifies the number of future periods to forecast.

# Plot the time series with forecasted values


plot(forecast_values, main = "ARIMA Forecast")

# Interpretation:
# The plot shows the original time series and the forecasted values for future per
iods.
In this updated example, we load the "AirPassengers" dataset, which contains the monthly
number of airline passengers from 1949 to 1960. We convert the dataset into a time series object
using the "ts" function, specifying the frequency as 12 since the data is monthly. We then plot
the original time series to visualize the data.
Next, we fit an ARIMA model to the time series data using the "auto.arima" function from the
"forecast" package. The "auto.arima" function automatically selects the best ARIMA model
based on various criteria. The model summary provides information about the estimated
coefficients and their significance.
We then use the "forecast" function to forecast future values based on the fitted ARIMA model.
The "h" parameter specifies the number of future periods to forecast, in this case, 12 months.
The forecasted values represent the predicted future values based on the ARIMA model.
Finally, we use the "plot" function to visualize the time series data along with the forecasted
values.
Interpretation of the ARIMA model involves examining the estimated coefficients, their
significance, and the forecasted values. Positive AR coefficients indicate a positive relationship
with past observations, while negative MA coefficients indicate a dependency on past forecast
errors. The forecasted values provide insights into the future behaviour of the time series,
helping with forecasting and decision-making.

5 Nonlinear Regression Models


Nonlinear regression models are used when the relationship between the response variable and
the predictor variables cannot be adequately described by a linear equation. Instead, nonlinear
regression models allow for more flexible and complex relationships between the variables.
5.1 Polynomial Regression
Polynomial regression is a form of nonlinear regression where the relationship between the
response variable and the predictor variable(s) is modelled using polynomial functions. In
polynomial regression, the predictor variable(s) are raised to different powers to capture
polynomial effects.
The general form of a polynomial regression model with a single predictor variable is:
𝑦 = 𝛽 0 + 𝛽1 𝑥 + 𝛽 2 𝑥 2 + ⋯ + 𝛽ₙ𝑥ⁿ + 𝜀
where:
• y is the response variable.
• x is the predictor variable.
• 𝛽₀, 𝛽₁, 𝛽₂, . . . , 𝛽ₙ are the coefficients of the polynomial terms.
• 𝜀 is the error term.
In R, polynomial regression can be performed using the "lm" function.
# Create example data
x <- 1:10
y <- c(3, 5, 6, 9, 10, 12, 13, 14, 15, 16)

# Fit a polynomial regression model of degree 2


model <- lm(y ~ poly(x, degree = 2))

# Print the model summary


summary(model)

# Predict the response for new values of x


new_x <- 11:15
predicted_y <- predict(model, newdata = data.frame(x = new_x))
In this example, we create sample data with the predictor variable "x" and the response variable
"y." We then fit a polynomial regression model of degree 2 using the "lm" function, where the
predictor variable is transformed using the "poly" function with the specified degree. The
summary of the model provides information about the estimated coefficients and their
significance.
To make predictions for new values of "x," we use the "predict" function, specifying the new
values as a data frame in the "newdata" argument.
5.2 Splines
Splines are another approach to modelling nonlinear relationships in regression. They involve
dividing the predictor variable range into smaller segments and fitting piecewise polynomials
within those segments. The points where the segments join is called knots.
There are different types of splines, such as cubic splines and natural splines. Cubic splines use
cubic polynomials, while natural splines impose additional smoothness conditions on the fitted
curve.
In R, splines can be implemented using the "splines" package.
# Load the required library
library(splines)

# Create example data


x <- 1:10
y <- c(3, 5, 6, 9, 10, 12, 13, 14, 15, 16)

# Fit a cubic spline


model <- lm(y ~ ns(x, df = 3))

# Print the model summary


summary(model)

# Predict the response for new values of x


new_x <- 11:15
predicted_y <- predict(model, newdata = data.frame(x = new_x))
In this example, we load the "splines" package and create sample data with the predictor
variable "x" and the response variable "y." We then fit a cubic spline using the "ns" function,
specifying the number of degrees of freedom (df) as 3. The summary of the model provides
information about the estimated coefficients and their significance.
To make predictions for new values of "x," we use the "predict" function similarly to
polynomial regression.
It's worth noting that there are other types of splines and additional parameters that can be
adjusted to control the smoothness and flexibility of the fitted curve.

6 Introduction to Bayesian statistics


Bayesian statistics is a branch of statistics that provides a framework for updating beliefs or
knowledge about unknown quantities based on new data. It involves the use of prior
knowledge, data, and Bayes' theorem to make statistical inferences.
6.1 Bayes' Theorem
Bayes' theorem can be stated as follows:
𝑃(𝐵|𝐴) × 𝑃(𝐴)
𝑃(𝐴|𝐵) =
𝑃(𝐵)
where:
• P(A|B) is the posterior probability of hypothesis A given the observed data B.
• P(B|A) is the likelihood, which represents the probability of observing the data B given
the hypothesis A.
• P(A) is the prior probability of hypothesis A.
• P(B) is the probability of observing the data B.
In Bayesian statistics, we start with a prior belief (prior probability) about a parameter or
hypothesis, and then update it using observed data to obtain a posterior belief (posterior
probability).
6.2 Bayesian Inference
Bayesian inference involves three main steps: specifying the prior, calculating the likelihood,
and obtaining the posterior.
6.2.1 Specifying the Prior
The prior represents our initial belief or knowledge about the parameter or hypothesis before
observing any data. It is typically specified as a probability distribution that reflects our
uncertainty. Commonly used priors include uniform, normal, and beta distributions.
6.2.2 Calculating the Likelihood
The likelihood represents the probability of observing the data given different values of the
parameter or hypothesis. It is obtained from the statistical model assumed for the data. The
likelihood is often calculated using probability density functions (PDFs) or probability mass
functions (PMFs).
6.2.3 Obtaining the Posterior
The posterior is the updated belief about the parameter or hypothesis after considering the
observed data. It is obtained by combining the prior and the likelihood using Bayes' theorem.
The posterior distribution provides information about the uncertainty in the parameter or
hypothesis after observing the data.
In R, Bayesian inference can be performed using packages like "rjags," "Stan," or "brms,"
which provide convenient functions for fitting Bayesian models. Here's an example using the
"rjags" package to fit a Bayesian linear regression model:
# Load the required library
library(rjags)

# Create example data


x <- 1:10
y <- c(3, 5, 6, 9, 10, 12, 13, 14, 15, 16)

# Specify the Bayesian model in JAGS syntax


model_code <- "
model {
for (i in 1:N) {
y[i] ~ dnorm(mu[i], tau)
mu[i] <- beta0 + beta1 * x[i]
}
beta0 ~ dnorm(0, 1e-6)
beta1 ~ dnorm(0, 1e-6)
tau ~ dgamma(0.001, 0.001)
}
"

# Prepare the data for JAGS


data <- list(x = x, y = y, N = length(x))

# Specify the parameters to monitor


parameters <- c("beta0", "beta1", "tau")

# Fit the Bayesian model using JAGS


model <- jags.model(textConnection(model_code), data = data, n.chains = 3)
samples <- coda.samples(model, variable.names = parameters, n.iter = 10000)

# Print the summary of posterior distributions


summary(samples)
In this example, we create sample data with the predictor variable "x" and the response variable
"y." We then specify the Bayesian linear regression model in JAGS syntax. The model assumes
a normal distribution for the response variable, with mean mu[i] defined by the linear
regression equation.
We then prepare the data and specify the parameters to monitor. The JAGS model is fitted using
the "jags.model" function, and posterior samples are obtained using the "coda.samples"
function. Finally, we print the summary of the posterior distributions to examine the estimated
parameter values and their uncertainty.
It's worth noting that Bayesian statistics allows for more flexible modelling and inference by
incorporating prior knowledge, handling complex models, and providing a distribution of
credible values for the parameters.

7 Data
7.1 Customer_data
The following code creates the "customer_data.csv" file with randomly generated customer
data, including the age, gender, usage, and churn variables.
# Load the required library
library(dplyr)

# Set the seed for reproducibility


set.seed(123)

# Create the customer data


customer_data <- tibble(
age = round(runif(100, min = 18, max = 65)), # Generate random ages between 18 a
nd 65
gender = sample(c("Male", "Female"), size = 100, replace = TRUE), # Generate ran
dom gender
usage = round(rnorm(100, mean = 1000, sd = 200)) # Generate random usage data
)

# Generate the churn variable based on age, gender, and usage


customer_data <- customer_data %>%
mutate(churn = ifelse(age > 50 | (gender == "Male" & usage < 800), 1, 0))

# Save the customer data as CSV


write.csv(customer_data, file = "customer_data.csv", row.names = FALSE)
In this code, we use the dplyr library to create a tibble (a data frame) named customer_data.
We generate random data for the age variable using the runif function, which generates
random numbers between 18 and 65. The gender variable is randomly assigned either "Male"
or "Female" using the sample function. The usage variable is generated using
the rnorm function to generate normally distributed random numbers with a mean of 1000 and
a standard deviation of 200.
Next, we create the churn variable using the mutate function. We define churn based on certain
conditions: if the age is greater than 50 or if the gender is "Male" and the usage is less than
800, then churn is set to 1; otherwise, churn is set to 0.
Finally, we use the write.csv function to save the customer_data tibble as a CSV file named
"customer_data.csv" without including row names.
7.2 accident_data
The "accident_data.csv" file has randomly generated accident data, including the time of day,
weather condition, road type, and accident counts.
# Load the required library
library(dplyr)

# Set the seed for reproducibility


set.seed(123)

# Create the accident data


accident_data <- tibble(
time_of_day = sample(c("Morning", "Afternoon", "Evening", "Night"), size = 100,
replace = TRUE), # Generate random time of day
weather_condition = sample(c("Clear", "Rainy", "Snowy"), size = 100, replace = T
RUE), # Generate random weather conditions
road_type = sample(c("Highway", "City"), size = 100, replace = TRUE), # Generate
random road types
accidents = rpois(100, lambda = 5) # Generate random accident counts with a mean
of 5
)
# Save the accident data as CSV
write.csv(accident_data, file = "accident_data.csv", row.names = FALSE)
In this code, we use the dplyr library to create a tibble (a data frame) named accident_data.
We generate random data for the time_of_day variable using the sample function to randomly
select "Morning," "Afternoon," "Evening," or "Night." Similarly, we generate random data for
the weather_condition variable, randomly selecting "Clear," "Rainy," or "Snowy."
The road_type variable is generated by randomly selecting "Highway" or "City" using
the sample function.
Next, we create the accidents variable using the rpois function to generate random accident
counts. We specify a Poisson distribution with a lambda (mean) value of 5, indicating that the
average number of accidents is 5.
Finally, we use the write.csv function to save the accident_data tibble as a CSV file named
"accident_data.csv" without including row names.

You might also like