Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

A project report titled

“Exploratory Data Analysis on Student Performance Dataset”


Submitted in partial fulfilment of the curriculum for the award of the degree of
B.E in

INFORMATION SCIENCE AND ENGINEERING

Submitted by,

ADITHYA 01JST21IS003

DARSHAN KUMAR P 01JST21IS013

SMRITHI P ULLAL 01JST21IS052

VINAY K 01JST22UIS411

Submitted to,

Ms. LAVANYA M S

ASSISTANT PROFESSOR

DEPARTMENT OF INFORMATION SCIENCE AND ENGINEERING


ABSTRACT

This report analyzes the academic performance of students using a dataset containing information on
their math, reading, and writing scores, as well as demographic factors such as gender, race/ethnicity,
parental level of education, lunch type, and test preparation course completion. The primary objective
is to investigate the factors influencing students' math scores and determine if there are significant
differences in mean scores based on various demographic attributes.

The analysis employs statistical methods including ANOVA (Analysis of Variance) to compare mean
math scores across different demographic groups. Additionally, descriptive statistics such as mean,
median, mode, and standard deviation are calculated to understand the distribution of scores.
Hypothesis testing is conducted to assess the significance of observed differences.

Results indicate that there is a significant difference in mean math scores among different demographic
groups, particularly concerning gender and parental level of education. The critical region for rejection
of the null hypothesis is identified, providing insight into factors influencing academic achievement.

The findings of this study have implications for educational policy and practice, highlighting the
importance of addressing disparities in academic performance based on demographic factors. Further
research may explore additional variables and their impact on student outcomes, leading to targeted
interventions to support student success.
TABLE OF CONTENTS
1. INTRODUCTION ............................................................................... 4
1.1 Overview ..................................................................................... 4
1.2 Dataset Overview ......................................................................... 4
2. METHODOLOGIES .......................................................................... 6
2.1 Implementation Overview.............................................................. 6
2.2 Software Requirements ...................................................................6
2.3 Univariate Analysis of Car Dataset .................................................. 8
2.3.2.1 Mean, Median, Mode and Standard Deviation………………...8
2.3.2.2 Histogram……………………………………………………....9
2.3.2.3 Distplot ................................................................................... 10
2.3.2.4 Box plot .................................................................................. 11
2.3.3 Categorical data ......................................................................... 13
2.3.3.1 Pie Chart ................................................................................. 13
2.3.3.2 Donut chart ............................................................................. 13

2.3.3.3 Bar graph……………………………………………………...15


2.3.3.4 Count plot……………………………………………………...16

2.4 Bivariate Analysis of Car Dataset .................................................. 17


2.4.1 Introduction ............................................................................... 17
2.4.2 Scatter plot (Numerical-Categorical) ......................................... 18
2.4.3 Bar plot (Numerical-Categorical) .............................................. 19
2.4.4 Distplot (Numerical-Categorical) .............................................. 19
3. ANOVA and Hypothesis analysis……………………………………
4. CONCLSION ................................................................................. 23
5. REFERENCES .............................................................................. 24
1.INTRODUCTION

1.1 Overview

The dataset encompasses a diverse range of variables, including gender, race/ethnicity, parental level of
education, lunch type, and test preparation course completion. Each of these factors may contribute to variations
in students' academic achievement. By exploring these variables and their relationships with academic
performance, we aim to gain insights into the dynamics shaping educational outcomes.

The primary objective of this analysis is to investigate the influence of demographic factors on students' math
scores and discern any significant differences in mean scores across various demographic groups. To achieve
this, we employ statistical methods such as ANOVA (Analysis of Variance), hypothesis testing, and descriptive
statistics. These methodologies allow us to uncover patterns, identify disparities, and assess the significance of
observed differences in academic performance.

By elucidating the factors associated with students' math scores, this study contributes to a deeper understanding
of educational equity and access. The findings hold implications for educational policy and practice, informing
interventions aimed at addressing disparities and promoting student success. Through this analysis, we endeavor
to shed light on the multifaceted nature of academic achievement and pave the way for evidence-based
approaches to enhance educational outcomes for all students.

1.2 Dataset Overview

This dataset includes the following fields:

1. gender: Indicates whether the student is male or female. This attribute is used to analyze
potential gender-based performance differences in various academic subjects.

2. race/ethnicity: Categorical variable representing the student's racial or ethnic group


(e.g., group A, group B, group C). It helps in studying demographic impacts on
educational performance.

3. parental level of education: Highest parental education level, indicating its influence on
student performance.
4. lunch: Type of lunch the student receives, either standard or free/reduced. This variable
can be used to examine the correlation between socioeconomic status and academic
achievement.

5. test preparation course: Indicates whether the student completed a test preparation
course. This is useful for evaluating the effectiveness of such courses on improving
student scores.

6. math score: Numerical score representing the student's performance in mathematics.


Scores range from 0 to 100, providing a quantitative measure of mathematical
proficiency.

7. reading score: Numerical score indicating the student's reading ability, ranging from 0
to 100. It assesses literacy and comprehension skills.

8. writing score: Numerical score reflecting the student's writing proficiency, with scores
between 0 and 100. This measure evaluates grammar, structure, and content quality in
written responses.

fig. Student performance Data-set used sample


2.METHODOLOGIES

2.1 Implementation Overview

Exploring a student performance dataset through bivariate and univariate analysis in Jupyter Notebook,
employing essential dependencies, is a fundamental step in comprehending the intricacies of data
attributes and their impact on the student performance. By scrutinizing relationships between variables
such as gender, parent level education, test prep, math score, reading score this analysis unveils
patterns, trends, and correlations essential for student performance analysis and performance
understanding. Bivariate analysis elucidates interactions among these attributes, while univariate
analysis provides a detailed examination of each variable in isolation. This comprehensive approach
furnishes invaluable insights for parents, including scores, and skills, guiding strategic decisions, in
improvement of the student

2.2 Software Requirements

The project was implemented using Jupyter Notebook with necessary dependencies for Bivariate
and Univariate Analysis. The hardware and software requirements for this implementation are as
follows:

1. Jupyter Notebook

Jupyter Notebook is an open-source web application for creating and sharing


documents containing live code, equations, visualizations, and text. It supports multiple
programming languages and is widely used in data analysis, research, and education for its
interactive and collaborative features.

2. Python (3.6.9 or higher):

Python is a versatile, high-level programming language known for its simplicity and
readability. It supports multiple programming paradigms and is widely used in web
development, data analysis, artificial intelligence, scientific computing, and automation.
Python's extensive standard library and large ecosystem of third-party packages make it a
popular choice for diverse applications.

3. Matplotlib (3.1.2 or higher):

Matplotlib is a Python library for creating high-quality visualizations like line plots,
bar charts, and scatter plots. It's widely used in scientific research, data analysis, and
visualization tasks.

4. Pandas (1.1.0 or higher):

Pandas is a Python library for data manipulation and analysis, offering versatile data
structures and tools for reading, writing, and analyzing data efficiently, commonly used in data
science and research.

The project involves analyzing the AQI dataset using Univariate and Bivariate Analysis techniques.
Univariate analysis is used to examine only one variable at a time, while Bivariate analysis is used to
study the relationship between two variables. The analysis is performed using various statistical
methods, such as descriptive statistics, correlation analysis, and scatter plots. The project also involves
data visualization, which is used to communicate complex information in a concise and
understandable manner. There are mainly three types of data visualization: univariate analysis,
b i va ri a t e a n a l y s i s , a n d m u l t i v a ri a t e a n a l y s i s . Univariate a n a l y s i s i s t h e m o s t
straightforward method, which involves examining only one variable at a time using descriptive
statistics like mean, median, mode, standard deviation, and range. Bivariate analysis is the study of
the relationship between two variables, which can be determined by using correlation analysis, scatter
plots, and other statistical methods. The main goal of this analysis is to establish whether there is a
connection between the two variables and to comprehend the strength and direction of that
connection.
2.3 Univariate Analysis of CAR Dataset

2.3.1 Introduction

Univariate analysis is a fundamental aspect of data analysis that deals with the study of individual
variables in isolation, without considering the relationship between them. This type of analysis is
commonly used in statistical analysis to understand the distribution, central tendency, and
variability of individual variables. In univariate analysis, the focus is on summarizing and
describing the data in a single variable. This is done through various statistical measures, such as
mean, median, mode, standard deviation, and range. These measures provide insights into the
distribution of the data and help to identify patterns and trends within the data. Univariate analysis
is an important step in the data analysis process, as it provides a foundation for understanding the
data before moving on to more complex analysis methods. By understanding the individual
variables in the dataset, analysts can gain a deeper understanding of the data and make more
informed decisions. Additionally, univariate analysis can help to identify outliers, missing values,
and other issues in the data that may impact the accuracy of the analysis.

2.3.2Numerical Data

2.3.2.1 Mean, Median, Mode, and Standard Deviation

1. Mean: Add up all values and divide by the total number of values.
Formula: Mean = (Sum of all values) / (Total number of values)

2. Median: Arrange values in ascending order and find the middle value (or average of two
middle values).
Formula: Median = (Middle value or average of two middle values)
3. Mode: Identify the most frequently occurring value(s) in the dataset
4. Standard Deviation: Calculate the mean, then find the squared differences between each
value and the mean, calculate their mean, and take the square root.
Formula: Standard Deviation = √[Σ(xi - μ)² / N], where xi represents each value, μ is the
mean, and N is the total number of values.
fig. measure of mean, median, mode, standard deviation

2.3.2.2 Histogram

Histogram is a graphical representation of the distribution of numerical data. It consists of a series


of adjacent rectangles, or bins, where each bin represents a specific range of values, and the height
of the bin corresponds to the frequency (or count) of data points within that range.

Fig. Histogram of Math scores


2.3.2.3 Kernel Density Plot

A Kernel Density Plot (KDE) is a data visualization technique used to estimate the probability density function
of a continuous random variable. It works by smoothing a histogram, creating a continuous curve that represents
the underlying distribution of the data. KDEs provide insights into the shape and distribution of data, allowing
for the identification of peaks, troughs, and overall patterns. They are particularly useful for understanding the
density of observations in a dataset and are commonly employed in statistical analysis, machine learning, and
exploratory data analysis.

fig. Kernel Density plot of Reading score distribution

2.3.2.4 Box plot

A box plot, or box-and-whisker plot, visually summarizes the distribution of a dataset by depicting its quartiles
(Q1, Q2, Q3). Q1 represents the 25th percentile, Q2 is the median (50th percentile), and Q3 is the 75th percentile.
The box spans from Q1 to Q3, with "whiskers" extending to the minimum and maximum values within a
calculated range, typically 1.5 times the interquartile range (IQR). This concise representation aids in
understanding the central tendency, spread, and presence of outliers in the data.

fig. Box plot for Writing score with outliers

Fig. Box plot for Writing score without outliers


2.3.3 Categorical data

Categorical data refers to data that represents categories or groups, rather than numerical values.
These categories can be qualitative or nominal in nature, meaning they represent different groups
or labels without any inherent order or numerical significance.

Analyzing categorical data often involves counting the frequency of each category, identifying the
mode (most common category), and possibly comparing distributions between different groups.
Common methods for analyzing categorical data include frequency tables, bar charts, and
contingency tables (also known as cross-tabulations). These methods provide insights into the
distribution and relationships between different categories, which can be valuable for making
decisions and drawing conclusions in various fields such as marketing, sociology, and healthcare.

2.3.3.1 Pie chart


A pie chart is a circular statistical graphic divided into slices to illustrate numerical proportions.
Each slice represents a proportionate part of the whole, and the size of each slice is proportional to
the quantity it represents

fig. Pie chart to visualize the Lunch distribution


2.3.3.2 Donut chart

A donut chart is a circular statistical graphic, similar to a pie chart, but with a hole in the center. It
presents data in a segmented format, with each segment representing a proportion of the whole. The
size of each segment corresponds to its percentage of the total. Donut charts are effective for illustrating
the distribution of categorical data and making comparisons between different categories. They are
commonly used in business presentations, reports, and data visualization to convey relative proportions
and trends in a visually appealing manner.

fig. Donut Chart for Test Preparation


2.3.3.3 Bar Graph
A bar graph is a visual representation of data using rectangular bars, where the length or height of
each bar corresponds to the frequency or value of a category. It is commonly used to display and
compare discrete categories or groups. Bar graphs are effective in illustrating differences between
categories, trends over time, or comparisons across different groups. They provide a clear and intuitive
way to convey information and are widely utilized in fields such as statistics, economics, business, and
social sciences for data analysis and presentation.

fig. bar graph for Parents level of education


2.3.3.4 Count plot (Horizontal bar graph)

A count plot is a type of bar graph that displays the frequency of observations within categorical
data. Each bar represents the count of occurrences for a specific category. It provides a visual
summary of the distribution of categorical variables, aiding in the analysis of dataset proportions
and identifying patterns or imbalances. Count plots are particularly useful for exploring the
frequency of categories and comparing their occurrence across different groups or variables,
making them valuable tools in data visualization, exploratory data analysis, and statistical
inference.

fig. Count plot graph for Ethnicity Distribution


2.4 Bivariate Analysis of CAR Dataset

2.4.1 Introduction

Bivariate Analysis is a simple (two variable) special case of multivariate analysis, used to determine
the empirical relationship between two variables. It is a form of quantitative (statistical) analysis that
involves the analysis of two variables, often denoted as X and Y, for the purpose of determining the
relationship between them. Bivariate analysis can be helpful in testing simple hypotheses of
association and can determine to what extent it becomes easier to know and predict a value for one
variable based on the other variable. It can be contrasted with univariate analysis, which involves the
analysis of only one variable. Bivariate analysis can be descriptive or inferential, and it is used to
analyze the relationship between the two variables. In bivariate analysis, the relationship between the
variables can be positive or negative. A positive correlation indicates that as the value of one variable
increases, the value of the other variable also increases. A negative correlation indicates that as the
value of one variable increases, the value of the other variable decreases. The strength of the
correlation can be measured using the correlation coefficient, which ranges from -1 to 1. A correlation
coefficient close to 1 indicates a strong positive correlation, while a correlation coefficient close to -
1 indicates a strong negative correlation. Bivariate analysis can be used when there is a dependent
variable, such as the preferred brand of cereal, and an independent variable, such as age. In this case,
probit or logit regression (or multinomial probit or multinomial logit) can be used. If both variables
are ordinal, meaning they are ranked in a sequence as first, second, etc., then a rank correlation
coefficient can be computed. If just the dependent variable is ordinal, ordered probit or ordered logit
can be used. If the dependent variable is continuous, such as a temperature scale or an income scale,
then simple regression can be used. Graphical methods, such as scatterplots, box plots, and mosaic
plots, can be used to represent bivariate data. These graphs are part of descriptive statistics and are
used to explore the relationship between the two variables and the depth of this relationship. Bivariate
analysis can help determine if there are any discrepancies between the variable and the causes of the
differences.
2.4.2 Scatter plot (Numerical-Categorical)

A scatter plot is a visual representation of the relationship between two variables, where each
data point is plotted as a dot on a graph. It aids in identifying patterns, trends, and correlations
within the data, allowing for insights into the nature of the relationship between the variables.
This graphical tool is widely used across disciplines such as statistics, scientific research,
economics, and data analysis to visualize and analyze data sets efficiently and effectively.

fig. scatter plot of math score and reading score


2.4.3 Bar plot (Numerical-Categorical)

A Bar plot is a graphical representation of categorical data using rectangular bars. The
length or height of each bar corresponds to the frequency, proportion, or any other aggregate
measure of the data within each category.

Fig. bar plot of gender vs writing score


2.4.4 Distplot (Numerical-Categorical)

A Distplot in Seaborn offers a comprehensive view of the distribution of a univariate dataset.


It displays the frequency distribution of data points using a histogram, where the data is
divided into bins, and the height of each bar represents the frequency of data points within
that bin. Additionally, it overlays a smoothed curve known as a kernel density estimate
(KDE) plot, which provides a continuous estimate of the underlying probability density
function. This visualization aids in understanding key characteristics of the data, such as its
central tendency (where most values cluster), spread (how dispersed the values are),
skewness (whether the distribution is symmetric or skewed to one side), and the presence of
any outliers or multiple modes.

fig. displot for Distribution of Writing and test prep


3. ANOVA and Hypothesis analysis

ANOVA

ANOVA compares means of three or more groups to detect if they're statistically different. It
measures variation between group means against variation within groups. If the calculated F-
statistic surpasses the critical value at a chosen significance level (e.g., 0.05), the null
hypothesis of no differences between groups is rejected, suggesting significant differences
among at least two groups. ANOVA is valuable for analyzing categorical factors and
continuous outcomes, aiding researchers in discerning genuine effects from random variation.
Post-hoc tests can identify specific group differences if ANOVA reveals significance.

Hypothesis Analysis

Hypothesis analysis entails testing a hypothesis about a population parameter using statistical
methods. It involves formulating a null hypothesis (H0) representing the status quo and an
alternative hypothesis (Ha) suggesting what's being tested. A significance level (α) is chosen
to determine the probability threshold for rejecting the null hypothesis. Data is collected, and
an appropriate statistical test is chosen based on the type of data and hypotheses. A test
statistic is calculated from the data, then compared to a critical value from a relevant
statistical distribution to make a decision about rejecting or failing to reject the null
hypothesis. Conclusions are drawn based on this decision, considering the context of the
problem and interpreting results in light of the hypotheses.

fig. ANOVA results


Fig. bell plot of accepting value and rejection
4. CONCLUSION
After thoroughly analyzing the "StudentsPerformance" dataset, several key insights emerge.
The dataset comprises various demographic and performance-related attributes for students,
allowing us to explore relationships and trends among these variables.

Firstly, gender differences in academic performance are notable. Male students tend to score
higher on average in mathematics, while female students generally perform better in reading
and writing. This aligns with common educational research findings that suggest gender-based
strengths in different subjects.

Socio-economic factors also play a significant role in student performance. Students who
receive test preparation courses tend to score higher across all subjects, indicating the positive
impact of additional academic support. Furthermore, parental education levels correlate
strongly with student scores, suggesting that higher parental education can lead to better student
performance, possibly due to a more supportive home learning environment.

The analysis also highlights the influence of race/ethnicity on academic outcomes. Certain
ethnic groups consistently outperform others, which may reflect broader socio-economic
disparities and access to educational resources.

Visualizing the data through histograms and box plots has been instrumental in identifying
these trends, providing a clear picture of distribution patterns and relationships. Overall, the
analysis underscores the multifaceted nature of student performance, influenced by gender,
socio-economic background, parental education, and ethnicity. These insights can inform
targeted educational interventions and policies aimed at reducing disparities and improving
outcomes for all students.

Overall, our project underscores the pivotal role of data visualization in unlocking insights from
automotive datasets, empowering stakeholders to make informed decisions and drive
innovation in the automotive industry.
5. REFERENCES

1. Main dataset used: https://www.kaggle.com/datasets/spscientist/students-


performance-in-exams?resource=download
2. Jupyter Notebook. Retrieved from https://jupyter.org/ (Last accessed: April 26, 2024)

3. Python Software Foundation. Python. Retrieved from https://www.python.org/


(Last accessed: April 26, 2024)
4. Matplotlib. Retrieved from https://matplotlib.org/ (Last accessed: April 26, 2024)

5. Pandas. Retrieved from https://pandas.pydata.org/ (Last accessed: April 26, 2024)

You might also like