Professional Documents
Culture Documents
x22209077 Prajwal Shashidhara TABA Report
x22209077 Prajwal Shashidhara TABA Report
Prajwal Shashidhara
Statistics For Data Analysis
National College Of Ireland
Dublin, Ireland
x22209077@student.ncirl.ie
Abstract—Time Series Analysis: Weather forecasting is impor- such as ARIMA/SARIMA models, Exponential Smoothing,
tant because knowing what the weather might be like ahead of and Simple Time Series Models to capture and predict weather
time is super helpful for us to plan. Our project has details trends accurately. Through this analysis, we aim to uncover
about CBL pressure collected at Dublin Airport in Ireland. The
report presents a comprehensive analysis of the Mean Convective seasonal variations, detect underlying patterns, and evaluate
Boundary Layer (CBL) pressure-hpa time series data obtained different forecasting models’ performance. Ultimately, the
from historical weather records. The objective was to conduct project aims to contribute to enhanced weather prediction ca-
a thorough time-series analysis employing different modelling pabilities, enabling better preparedness and informed decision-
techniques to forecast the mean CBL pressure. Various models, making in response to changing weather conditions.
including Simple Time Series Models, Exponential Smoothing,
and ARIMA/SARIMA, were applied and evaluated for forecast- B-Logistic Regression:Cardiovascular diseases are a major
ing performance. health problem globally, affecting people in a wide range of
Logistic Regression: This report investigates a data set encom- populations These conditions include a variety of heart and
passing various characteristics of individuals, with the primary blood vessel-related conditions, which often pose significant
goal of predicting the likelihood of individuals having cardiac health risks and require global treatment for effective preven-
conditions. Key factors such as age, weight, gender, and fitness
scores are analyzed to understand their potential correlation tion and early detection, addressing and understanding the
with the presence or absence of cardiac issues. The study factors contributing to these conditions is essential. Logistic
comprises several essential steps including exploring the data regression, a statistical technique, is used in this study to create
to understand its patterns, and constructing a predictive model a predictive model The primary goal is to identify possible
using logistic regression techniques. Utilizing statistical methods connections between individuals’ demographics, health status,
and visualizations, this exploration aims to provide a clear
understanding of how each factor might influence the likelihood and susceptibility to cardiac conditions. By exploring various
of cardiac issues. factors such as age, weight, gender, and fitness scores, this
Index Terms—Time series analysis, Weather Forecasting, Sea- investigation endeavours to uncover patterns or correlations
sonal Decomposition, Stationarity, Exponential Smoothing, ACF, that might exist between these attributes and the likelihood
PACF, ARIMA/SARIMA,Cardiac Conditions, Logistic Regres- of individuals having cardiac issues. The goal is to create
sion, Maximum Likelihood Estimation, Confusion Matrix, Ridge
Model, Lasso Model, Accuracy, Confusion Matrix, AUC-ROC, a reliable predictive model that could assist in identifying
Accuracy individuals at risk or those more prone to developing cardiac
conditions based on readily available information.
I. I NTRODUCTION
II. METHODOLOGY
A-Time series analysis: Weather forecasting is very impor-
tant in everyday life because it helps people plan what to wear
and prepare in our case Temperature data, atmospheric pres-
sure, and wind speed. The dataset, reported by Met Eireann,
spans several decades, containing detailed information about
temperature, atmospheric pressure, and wind speed, focusing
on a specific weather station Dublin Station, the dataset
encompasses observations starting from m 1st January 1942
to 31st October 2023, providing a rich repository of climatic
insights. The main aim is to explore, analyze, and model
the time series data to develop robust forecasting methods.
This involves the application of various statistical techniques Fig. 1. Time series Methodology
Methodology for Time series Analysis on weather data chef uses the right ingredients without wasting any, making
the perfect burger patties.
F1 Score: Think of F1 Score as a grade that combines how
well a student does on a written test and a presentation. It’s a
balanced score that considers both performances equally.
AUC-ROC: Imagine two friends studying for a test. AUC-
ROC measures how well they can separate the topics they’ve
mastered from the ones they’re still confused about, sort of
like how good they are at distinguishing between what they
know and what they don’t
. Confusion Matrix: Imagine a scoreboard showing how
many questions a teacher got right and wrong while grading
papers for two different subjects. It counts how many correct
and incorrect answers were given for each subject.
IV. EXPLORATORY DATA ANALYSIS
Fig. 2. Logistic Regression Methodology A-Time series Analysis
Methodology for Logistic Regression on Cardiac Conditions A. Data set and Variable Details
The dataset comprises a comprehensive collection of daily
III. DEFINITION AND TERMINOLOGY weather data dating back to January 1st, 1942, providing
Logistic Regression valuable insights into various meteorological parameters. The
The logistic function, or S-shaped curve, acts like a magical row corresponds to a specific date and contains measurements
machine that takes any number and squeezes it between 0 and related to different weather aspects. These measurements in-
1. Imagine it’s like a ”squisher” that never fully gets to 0 or clude the mean CBL pressure (CBL) in hectopascals, rep-
1 but stays super close. This magical tool is super handy in resenting average pressure, and depicting water evaporation
guessing probabilities, like if your favorite team will win a rates. This dataset offers a comprehensive record of daily
game or not. It’s also used in nature to describe how animal weather conditions Such analysis can aid in understanding
or plant populations grow, but they eventually stop growing long-term weather patterns, identifying trends, and developing
because there’s only so much space and resources available predictive models for future weather forecasts.
for them to thrive. B. Descriptive Statistics (Summary of the Data)
The sigmoid function in logistic regression is like a magical
The CBL (Mean CBL Pressure) column in the data tells us
about the average air pressure. It shows the average (mean)
pressure, how much the pressure usually changed (standard
deviation), and the lowest and highest pressure recorded. This
helps us understand how the air pressure varied over time.
C. Checking Missing Values
We’ll examine the data to ensure there aren’t any blank or
missing entries. If the list we get back after checking the data
shows zero, we’ll assume that there aren’t any missing values
in our dataset
Fig. 3. Sigmoid Curve
D. Data Decomposition
converter. You give it any number, and it crunches it down In time series analysis, data decomposition means breaking
to a value between 0 and 1. It’s a bit like squishing play down a dataset into its main components so that we can better
dough—no matter how big or small the number you put understand its patterns The data is divided into three major
in, the Sigmoid always smooshes it into this special range. parts: the trend, seasonal variations, and irregularities or noise
This function is great because it helps predict the chance The long-term direction of the data is shown by the trend
of something happening or not happening, like whether your component, which indicates if values are consistently increas-
favourite team will win a game or not. ing or decreasing over time Recurring patterns that happen
Accuracy: Think of accuracy like a score in a game—it tells regularly, like daily, weekly, or yearly cycles, are highlighted
you how many times a team played and how many games they by seasonal variations Random fluctuations or irregularities
won in total. that are not explained by trends or seasonal patterns are
Precision: Imagine a chef making burgers and trying not the remaining element Additive decomposition, which sums
to waste any ingredients. Precision is like how accurately the up the components, and multiplicative decomposition, which
Fig. 5. Dickey-Fuller Test
B-Logistic Regression
1) Data set and Variable Details: The dataset comprises
information on 100 individuals, each characterized by several
attributes: age, weight, gender, fitness score, and cardiac
condition. Age, recorded in years, spans a range across the
data set, from 30 to 74 years old. Weight, likely measured
in kilogrammes, varies widely, ranging from 50 to 115.42
Fig. 4. Seasonal Decompose kg. Gender is represented categorically as male or female,
reflecting a mix of both genders within the data set. Fitness
scores, reflecting individuals’ fitness levels, demonstrate a
multiplies them, are two common methods of decomposition diverse spectrum, ranging notably between 27.35 and 62.5.
By analysing these elements, forecasters build more accurate The crucial variable, cardiac condition, indicates the presence
prediction models by identifying trends, recognising recurring or absence of cardiac issues among individuals, categorized as
patterns, and comprehending irregularities. we have plotted the present or absent. Notably, a variety of individuals within the
decomposed data We can see in Fig 4 the upward trend in the data set have differing fitness levels and weights, potentially
trend component from 2021 on wards for the CBL pressure. suggesting diverse health profiles among the participants.
Talking about seasonality in the Seasonal Component with the 2) Descriptive Statistics (Summary of the Data): Descrip-
cycle tive statistics help us understand what the information in our
E. Stationarity data set looks like. They show us things like what the average
Stationarity in time series means that things stay more or value is for different details (like age or weight), how much
less the same over time. When data is stationary, it does not the numbers differ from that average (spread), and how often
show big changes or patterns. It stays at a similar average level, certain values appear. Descriptive statistics are like a summary
with consistent ups and downs without any overall trend. This that gives us a quick idea about the main characteristics of our
property is essential because it helps us use tools to make data, helping us understand and explore it better. We can look
predictions more accurately. Detecting stationarity involves at the mean, median, mode, variance, standard deviance, etc
looking at the data visually or using tests to check if things So, the main goal here is to explore the data and learn from
stay steady. To make data stationary, we can do things like it
make changes in how we look at the numbers or use different 3) Treatment of outliers : The box plot visualization aimed
math tricks. Overall, having stationary data helps us predict at detecting outliers within the data set. A box plot displays the
future trends better. distribution of numerical data through quartiles, highlighting
Dickey-Fuller Test is a statistical test used to check if any potential outliers or extreme values in fig-6 we can see
a time series data set is stationary or not. It evaluates the some outliers are present in the age column the Inter quartile
presence of a unit root in the data, which, if present, indicates Range (IQR) method used to detect potential outliers within
non-stationarity. The core idea is to assess whether the data the ’age’ column
shows a consistent pattern over time or if it exhibits a trend
that might affect predictions.
• Null Hypothesis: The null hypothesis assumes that the
data has a unit root and is non-stationary.
• Alternate Hypothesis: The alternate hypothesis suggests
that the data doesn’t have a unit root and is stationary.
In this, we can see p-value ¡= 0.05 & t-statics 12.49 are
less then the critical values so this is Stationarity
5) Standard Scaler: A standard scaler is a tool used to Fig. 9. Cardiac Condition by Gender
make sure all the numbers in the data are in a similar range.
It changes the numbers so that each feature’s average becomes In Fig. 9, we can see that there are 63 people marked as
0 and its spread becomes 1. This helps the model work better ’male’ and 37 people marked as ’female’. This tells us that
by preventing any one feature from having too much impact there are more individuals labeled as ’male’ than ’female’ in
just because of its size, and it helps the model’s calculations our group. It’s like saying there are more boys than girls in a
run smoothly. class.
6) Encoding Categorical Data: Encoding categorical data
means converting non-numeric information (like categories or
labels) into numbers that a computer can understand. This
helps machine learning models handle and analyze this type
of data because many models work better with numbers rather
with words or categories. There are various methods to encode
categorical data, such as Label Encoding (assigning a unique
number to each category) or One-Hot Encoding (creating new
columns for each category and marking with 1s and 0s)
Fig. 10. Histogram for Age, Wright and Fitness score
V. DATA VISUALIZATION
Data visualization is like storytelling with pictures made
from numbers. Instead of staring at a list of numbers, it’s like
creating colorful, easy-to-understand pictures or diagrams that
tell a story. It’s like drawing a map or making a graph to show
what’s happening with the information. Just like how a picture
book helps you understand a story better than just words, data
visualization helps us understand data by showing it in a way
that’s simple and easy to grasp
Fig. 11. Cardiac Condition by Age Wright, Fitness score
Fig. 13. Linear Regression Model Forecast There are various types of Exponential Smoothing methods:
1) Simple Exponential Smoothing: Simple Exponential
Smoothing is a fundamental method used for making forecasts
in time series data exhibiting a stable mean without trends
or seasonality. It employs a single smoothing parameter to
provide predictions based on past observations. Suitable for
data without pronounced trends or seasonal patterns. t assigns
exponentially decreasing weights to past observations, with
more weight given to recent data
2) Double Exponential Smoothing (Holt’s Method): Holt’s
Method, an extension of Simple Exponential Smoothing, in- Fig. 17. Model Comparison Plots
corporates trend information in addition to the level of the
data, making it suitable for time series with trends but no
seasonality. Handles data with trends but lacking seasonal In summary, while some methods like the 2-Point Moving
patterns. Handles data with trends but lacking seasonal pat- Average showed better predictive accuracy (lower RMSE),
terns. Employs two equations—one for level and another for others like the Naive Model and Exponential Smoothing
trend—to model data evolution over time. resulted in higher prediction errors in this dataset
3) Triple Exponential Smoothing (Holt-Winters Method):
Triple Exponential Smoothing, an extension of Double Ex- C. ARIMA/SARIMA
ponential Smoothing, incorporates seasonal components in ARIMA (AutoRegressive Integrated Moving Average) and
addition to level and trend components, making it suitable SARIMA (Seasonal ARIMA) models are mathematical tools
for time series data exhibiting both trends and seasonality used to predict future values based on patterns found in time
Designed for data showcasing trends along with seasonal series data. In simpler terms, they are like clever detectives
patterns. Considers level, trend, and seasonal indices to capture that study how things behaved in the past to make smart
variations across different periods. guesses about what might happen in the future. ARIMA breaks
down into three parts: AR (AutoRegressive), which looks
at how past values affect future ones; I (Integrated), which
adjusts data to make it more predictable; and MA (Moving
Average), which predicts future values based on past prediction
errors. SARIMA includes all these components but focuses
on data showing seasonal patterns, like temperature changes
across seasons. These models are valuable because they can
understand complex data patterns and make predictions even
when the data changes over time, helping forecasters anticipate
Fig. 15. Exponential Smoothing future trends or behaviors based on historical data trends.