Professional Documents
Culture Documents
Financial Analytics Training
Financial Analytics Training
Financial Analytics Training
Table of contents
1. Introduction to Financial Analytics
Applications:
Descriptive Analytics
Overview: Descriptive analytics involves analyzing historical data to understand
changes over time. It focuses on summarizing past events to identify patterns or
Diagnostic Analytics
Overview: Diagnostic analytics takes a step further from descriptive analytics by
not only identifying patterns but also understanding the reasons behind those
patterns. It involves more in-depth data analysis to uncover causal relationships
and root causes.
Techniques: Techniques such as drill-down, data discovery, correlation analysis,
and anomaly detection are commonly employed. Diagnostic analytics often utilizes
more complex data processing and statistical methods to delve deeper into the
data.
Predictive Analytics
Overview: Predictive analytics uses statistical models and machine learning
techniques to forecast future events based on historical data. It’s instrumental in
financial planning, risk assessment, and strategy development, offering insights
into what might happen in the future.
Prescriptive Analytics
Overview: Prescriptive analytics goes beyond predicting future outcomes by
recommending actions to achieve desired objectives or mitigate risks. It combines
insights from all other analytics types to formulate strategic recommendations.
To summarise
Data Collection
Overview: The first step in financial data analysis involves gathering relevant data
from various sources. Financial data can range from internal records, such as
sales figures and operational
costs, to external data, including market prices, economic indicators, and
competitor information.
Sources:
Data Exploration
Overview: Before diving into complex analyses, it’s important to explore the data
to understand its structure, distribution, and any underlying patterns or anomalies.
This stage helps in formulating hypotheses and deciding on appropriate analytical
methods.
Techniques:
Visualization: Charts and graphs, including histograms, scatter plots, and box
plots, visually represent data distributions, trends, and outliers.
Visualization
Overview: Effective data visualization transforms complex data sets into intuitive
graphical representations, facilitating easier interpretation and communication of
insights.
Tools:
Matplotlib and Seaborn: Popular Python libraries for creating static, animated,
and interactive visualizations.
Tableau and Power BI: Tools that offer advanced data visualization and
business intelligence capabilities.
Modeling
Types:
Implementation in Python
Python, with its extensive libraries like Pandas for data manipulation, Matplotlib
and Seaborn for visualization, and Scikit-learn for modeling, serves as a powerful
tool for financial data analysis.
Here’s a basic workflow:
financial_data.csv
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Data Collection: Load data
data = pd.read_csv('financial_data.csv')
# Data Cleaning: Fill missing values
This basic workflow exemplifies how financial data analysis can be approached
systematically to extract meaningful insights, guiding strategic decisions in the
financial sector.
The following is much more detailed workflow that can be used.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
data = pd.read_csv('/mnt/data/AAPL.csv')
# Checking for missing values in the dataset
missing_values = data.isnull().sum()
# Descriptive statistics
statistical_summaries = data.describe()
# Visualization: Plotting the Closing Price Over Time
plt.figure(figsize=(10, 6))
plt.plot(data['Date'], data['Close'], label='Closing Price')
plt.title('AAPL Closing Price Over Time')
plt.xlabel('Date')
plt.ylabel('Closing Price (USD)')
plt.xticks(data['Date'][::len(data['Date'])//10], rotation=4
5)
This illustrates how Python can be applied at each step of financial data analysis,
from data collection to predictive modeling.
AAPL.csv
1. Go to Yahoo Finance.
7. To use the data offline, click Download to download Apple stock data for past
6 months.
import pandas as pd
# Load the dataset
data = pd.read_csv('AAPL.csv')
# Display the first few rows of the dataframe
print(data.head())
# Descriptive statistics
print(data.describe())
# Data types
print(data.dtypes)
Step 5: Visualization
Visualizing the stock’s closing price and volume over time can provide insights
into its trends and volatility.
Interpretation
The Python script successfully executed the steps for cleaning, preparing,
exploring, and visualizing the AAPL stock data. Here’s a summary of what was
accomplished:
Data Exploration
Statistical Summaries: Descriptive statistics were calculated, providing
insights into the mean, standard deviation, minimum, and maximum values for
each numeric column. For instance, the average closing price ( Close ) was
approximately $183.05, with a standard deviation of $8.64, indicating
variability in stock prices over the observed period.
Visualization
Three key visualizations were generated:
1. Closing Price Over Time: A line plot illustrating the trend in AAPL’s closing
prices. The graph shows fluctuations in the stock price, which is crucial for
2. Volume Traded Over Time: A bar plot depicting the volume of AAPL stock
traded each day. This visualization highlights the days with particularly high or
low trading volumes, which can be indicative of market events or investor
sentiment.
Mean
The mean, often referred to as the average, is one of the most basic yet powerful
statistical measures. It provides a central point around which data points are
distributed. In finance, calculating the mean return of a stock over a period helps
investors understand its average performance.
Real-life Example: If a stock has monthly returns of 5%, 7%, -3%, and 4% over
four months, the mean return is (5+7-3+4)/4 = 3.25%. This tells the investor that,
on average, the stock has returned 3.25% per month over this period.
Standard Deviation
Variance
Variance is a statistical measurement of the spread between numbers in a
dataset. It squares the average difference between each data point and the mean,
emphasizing larger deviations. Variance is pivotal in portfolio theory to understand
how different securities move in relation to each other and the portfolio’s overall
risk profile.
Mode
The mode is the value that appears most frequently in a data set. In financial
datasets where there might be a repeated return value or interest rate, identifying
the mode helps in understanding the most common or likely occurrence.
Real-life Example: In analyzing the interest rates offered on savings accounts by
various banks, the mode gives the most commonly offered rate, offering insight
into the competitive rate landscape.
💡 Applying on Apple Stock data
The Python script below demonstrates how to load data from a CSV file and
calculate key statistical measures—mean, standard deviation, variance, and mode
—for both the closing price and volume of the AAPL stock data.
import pandas as pd
# Load the dataset
data = pd.read_csv('/mnt/data/AAPL.csv')
# Calculating Mean, Standard Deviation, Variance, and Mode fo
r Closing Price and Volume
# Mean
mean_close = data['Close'].mean()
For Volume:
Mean Volume Traded: 57,219,350.81
Insights:
The mean closing price of approximately $183.05 indicates the average price
at which AAPL stock closed over the observed period.
The standard deviation for the closing price and volume reveals the variability
or volatility in AAPL’s daily closing prices and trading volume. A higher
standard deviation in the volume
suggests significant fluctuations in trading activity.
The mode for closing prices shows the most frequently occurring closing
price within the dataset was $173.00, while the most common trading volume
was 24,048,300 shares.
import numpy as np
from scipy import stats
# Example dataset: Annual returns of a mutual fund (%)
Correlation
In the realm of financial analytics, correlation is a statistical measure that
expresses the extent to which two variables move in relation to each other. In
financial markets, understanding correlations is crucial for portfolio diversification,
risk management, and strategic planning.
Understanding Correlation
The correlation coefficient ranges from -1 to +1. A value of +1 indicates a perfect
positive correlation, meaning the two variables move in the same direction. A
value of -1 indicates a perfect negative correlation, meaning the two variables
move in opposite directions. A correlation of 0 means no relationship exists
between the variables.
Real-life Example: Consider the correlation between oil prices and airline stocks.
Often, there is a negative correlation, as higher oil prices may lead to increased
fuel costs for airlines, potentially reducing their stock prices due to squeezed
profit margins.
import pandas as pd
# Sample data: Oil prices and Airline Stock Prices
data = {
'Oil Prices': [60, 70, 65, 80, 75, 72],
'Airline Stock Prices': [30, 28, 29, 26, 27, 28]
}
df = pd.DataFrame(data)
# Calculating Correlation
correlation_matrix = df.corr()
print(correlation_matrix)
This script would output a matrix showing the correlation coefficients between oil
prices and airline stock prices, helping investors understand the relationship
between these variables.
Application in Finance
Understanding correlation helps in constructing a diversified portfolio. By
combining assets with low or negative correlations, investors can reduce portfolio
volatility and risk. For instance, during an economic downturn, consumer staples
tend to be less negatively impacted compared to technology stocks. Knowing
these correlations enables strategic asset allocation.
Additionally, correlation analysis is vital in risk management.
Regression
Regression analysis is a powerful statistical method used in financial analytics to
understand the relationship between an independent variable (or variables) and a
dependent variable. It predicts the dependent variable based on the values of the
independent variable(s).
Y = β0 + β1 X + ϵ
where:
β0
is the y-intercept,
β1
is the slope of the line,
ϵ
This model allows us to predict future stock prices based on historical data. The
slope indicates how much we expect the future price to change for a one-unit
change in the historical price.
Applications in Finance
Regression analysis is extensively used in finance for risk management, asset
pricing, and forecasting future trends. It provides a quantitative framework to
make informed decisions based on historical data patterns.
Understanding correlation and regression enables finance professionals to
decipher complex market relationships and make predictions about future
financial performance. These tools are indispensable in the financial analyst’s
toolkit, offering a solid foundation for analytical reasoning and strategic planning
in the financial domain.
Moving Average
The Moving Average (MA) is a widely used technique in financial analytics to
smooth out short-term fluctuations and highlight longer-term trends in data. It’s
particularly prevalent in technical analysis for securities trading.
A linear regression model predicting the closing price based on the trading
volume, with extracted slope, intercept, and R-squared values indicating the
model’s fit.
Visualizations that include a line graph of the closing prices and the 30-day
moving average over time, as well as a scatter plot illustrating the regression
analysis between volume and closing price.
The Python script has successfully calculated the correlation, conducted a simple
linear regression, and computed a 30-day moving average for the AAPL stock
data.
The negative slope indicates that as the volume increases, the closing
price is expected to decrease slightly. This aligns with the negative
correlation observed.
Intercept: 189.845
The intercept suggests that if the volume were zero, the predicted closing
price would be approximately $189.85. However, in real-world scenarios, a
volume of zero is not practical.
R-squared: 0.059
import pandas as pd
# Sample stock price data
data = {'Price': [22, 24, 25, 26, 28, 29, 27, 26, 28, 30]}
df = pd.DataFrame(data)
# Calculate a 3-day simple moving averaged
f['SMA_3'] = df['Price'].rolling(window=3).mean()
print(df)
This script calculates a 3-day simple moving average of the stock price, helping
investors identify the trend direction.
Application in Finance
Moving averages are instrumental in finance for various purposes, including:
1. Close vs. Open: The slope will indicate how much the closing price changes
on average from the opening price.
2. Volume vs. Close: The slope will show the relationship between the trading
volume and the closing price, indicating how volume changes affect the
closing price.
Let’s write the Python code to calculate the slope for these pairs and interpret the
results:
import pandas as pd
from sklearn.linear_model import LinearRegression
# Load the dataset
data = pd.read_csv('/mnt/data/AAPL.csv')
# Initialize the linear regression model
model = LinearRegression()
# Close vs. Open
X_close_open = data[['Open']]
y_close_open = data['Close']
model.fit(X_close_open, y_close_open)
slope_close_open = model.coef_[0]
Slope of Close vs. Open: This slope indicates how the closing price of AAPL
stock tends to change from the opening price throughout a trading day. A
positive slope close to 1 suggests that the closing price usually ends up being
higher than the opening price, indicating an overall positive trading day. A
slope significantly different from 1 could indicate volatility or a regular shift in
price from the open.
Slope of Volume vs. Close: This slope tells us how the trading volume is
related to the closing price. A positive slope suggests that higher trading
volumes are associated with higher closing prices, which could imply
increased buying interest or bullish sentiment. Conversely, a negative slope
would suggest that higher volumes are associated with lower closing prices,
possibly indicating selling pressure or bearish sentiment.
R Square (R²)
R² values range from 0 to 1 and indicate how well the independent variable(s)
explain the variability in the dependent variable. A higher R² value means a better
fit and suggests that the model explains a significant portion of the observed
variance.
Real-life Example: In portfolio management, R² can measure how well the returns
of a portfolio are explained by the returns of a benchmark index. A high R²
indicates the portfolio’s performance closely aligns with the index.
Understanding the slope, intercept, and R² provides valuable insights into the
nature of the relationship between variables in financial datasets. These metrics
are essential for developing predictive models that are both accurate and
interpretable.
Kurtosis
In the financial domain, understanding the distribution of returns is crucial for risk
management and investment strategy. Kurtosis is a statistical measure that
describes the shape of a distribution’s tails in relation to its overall shape,
providing insights into the probability of extreme returns.
Understanding Kurtosis
Kurtosis quantifies the tails’ heaviness of a probability distribution compared to
the normal distribution. It helps in identifying the risk of outliers that could
significantly impact an
investment’s performance.
Types of Kurtosis:
import pandas as pd
# Sample data: Daily returns of a stock
returns = pd.Series([0.01, 0.02, 0.03, -0.01, -0.02, -0.05,
0.04, 0.06, -0.03])
# Calculating kurtosis
kurt = returns.kurtosis()
print(f"Kurtosis: {kurt}")
This calculation helps in understanding the tail risk of the stock’s returns. A high
kurtosis value indicates the need for caution, as the investment may experience
extreme returns more frequently than anticipated.
Application in Finance
Kurtosis is integral to financial modeling and risk analysis, offering insights beyond
standard deviation and variance. It’s particularly relevant in:
Option Pricing: Models that account for high kurtosis can more accurately
price options, reflecting the higher risk of extreme movements.
The following script first imports the necessary libraries ( pandas for data
manipulation and scipy.stats for statistical functions), loads the AAPL stock data
from a CSV file,calculates the kurtosis for the ‘Close’, ‘Volume’, ‘Open’, ‘High’, and
‘Low’ columns using the kurtosis function with Fisher’s definition (which adjusts
the calculation so that the kurtosis of anormal distribution is 0), and then prints out
the kurtosis values for these columns.
import pandas as pd
from scipy.stats import kurtosis
# Load the dataset
data = pd.read_csv('/mnt/data/AAPL.csv')
# Calculating Kurtosis for the specified columns
# Fisher's definition is used here, where the kurtosis of a n
ormal distribution is 0
kurtosis_close = kurtosis(data['Close'], fisher=True)
kurtosis_volume = kurtosis(data['Volume'], fisher=True)
kurtosis_open = kurtosis(data['Open'], fisher=True)
kurtosis_high = kurtosis(data['High'], fisher=True)
kurtosis_low = kurtosis(data['Low'], fisher=True)
# Printing the kurtosis results
print(f"Kurtosis of Close: {kurtosis_close}")
print(f"Kurtosis of Volume: {kurtosis_volume}")
print(f"Kurtosis of Open: {kurtosis_open}")
print(f"Kurtosis of High: {kurtosis_high}")
print(f"Kurtosis of Low: {kurtosis_low}")
The kurtosis values provide insights into the tail heaviness of the distributions for
these financial metrics, indicating the presence of outliers or extreme values in the
dataset.
The kurtosis calculations for different columns in the AAPL.csv dataset have yielded
the following results:
Kurtosis of Volume: The kurtosis value for the Volume is significantly positive,
indicating a leptokurtic distribution. Leptokurtic distributions have fatter tails
and a sharper peak than a
normal distribution. This implies that there are more extreme values in trading
volumes, which could be due to sporadic days of unusually high or low trading
activity. High kurtosis in trading volume can signify major market events or
announcements affecting investor behavior and stock liquidity.
These kurtosis values provide insights into the behavior of Apple’s stock (AAPL).
The price metrics (Close, Open, High, Low) showing platykurtic distributions
suggest that AAPL’s daily price changes tend to be less extreme, indicating steady
trading without many outliers. On the other hand, the leptokurtic distribution of
trading volume points towards periods of significant trading activity spikes, which
could be associated with specific news releases, earnings announcements, or
other market-moving events.
This script calculates the z-scores for the company’s financial ratios, facilitating a
standardized comparison to industry averages or benchmarks.
To calculate the Z-scores for the ‘Close’, ‘Volume’, ‘Open’, ‘High’, and ‘Low’
columns of the AAPL stock data and interpret the standard normal distribution of
the ‘Close’ column, you can use the following Python script.
This script includes the calculation of Z-scores and then provides a basic
statistical summary for the ‘Close’ column Z-scores:
import pandas as pd
from scipy.stats import zscore
import numpy as np
# Load the dataset
data = pd.read_csv('/mnt/data/AAPL.csv')
This script starts by loading the AAPL.csv dataset, then calculates the Z-scores for
the ‘Close’, ‘Volume’, ‘Open’, ‘High’, and ‘Low’ columns using the zscore function
from scipy.stats . The calculated Z-scores are added as new columns to the
dataframe. Afterward, it prints out the first few rows of the dataframe to verify the
addition of Z-score columns. Finally, it calculates and prints the mean and
standard deviation of the Z-scores for the ‘Close’ column, providing a basic
interpretation of the standard normal distribution transformation applied to the
‘Close’ prices.
The mean of the Z-scores being close to 0 and the standard deviation being close
to 1 for the ‘Close’ column confirms that the data has been successfully
standardized. This standardization facilitates further analyses that require or
assume data to follow a normal distribution.
The calculation of Z-scores (Standard Normal Distribution) for key columns like
‘Close’, ‘Volume’, ‘Open’, ‘High’, and ‘Low’ in the AAPL stock data has been
performed. As an example, we’ve provided statistics for the Z-scores of the
‘Close’ column:
Close: The Z-scores for the ‘Close’ column indicate that the closing prices
have been standardized, with values representing how many standard
deviations each closing price is from the mean closing price. A Z-score close
to 0 suggests that the closing price is near the average, while a high absolute
Z-score indicates a price far from the average.
Volume, Open, High, Low: Similarly, calculating Z-scores for these columns
standardizes their values, allowing for analysis and comparison on a common
scale. For example, analyzing the Z-scores of ‘Volume’ can highlight days with
unusually high or low
trading activity.
To fully implement the calculation and interpretation of Z-scores for all the
mentioned columns, you can use the initial setup for calculating Z-scores ( zscore
function) for each column as
demonstrated.
Application in Finance
The standard normal distribution and related concepts like z-scores are critical
for:
Grasping the standard normal distribution and its applications empowers financial
analysts to make more informed, data-driven decisions, utilizing a common
statistical framework for evaluating risk, performance, and probabilities.
T Distribution Test
When we dive into the world of statistics, especially in financial analytics,
understanding different types of data distributions and testsis crucial. One such
important concept is the T Distribution
Test
, often used when dealing with small sample sizes or when the population
standard deviation is unknown.
What is T Distribution?
The T Distribution, also known as Student’s T Distribution, is a type of probability
distribution that is symmetric and bell-shaped, like the normal distribution, but
with heavier tails. These heavier tails indicate a higher probability of values far
from the mean, which is particularly useful when dealing with smaller sample
sizes (typically less than 30).
import numpy as np
from scipy import stats
# Sample of monthly returns (%) from the new investment strat
egy
monthly_returns = np.array([6, 7, 5, 7, 6, 5, 8, 4, 7, 6, 5,
9])
# Market average return
market_average = 5
# Perform a one-sample t-test
t_stat, p_value = stats.ttest_1samp(monthly_returns, market_a
verage)
print(f"T-statistic: {t_stat}, P-value: {p_value}")
In this example, the T-statistic measures how far the sample mean deviates from
the market average in units of standard error. The P-value tells us the probability
of observing such extreme results if the market average is the true average return
of the investment strategy. A low
P-value (typically < 0.05) suggests that the investment strategy’s performance is
significantly different from the market average.
import pandas as pd
from scipy.stats import ttest_1samp
# Load the AAPL stock dataset
data = pd.read_csv('/mnt/data/AAPL.csv')
# Define hypothetical means for the test
hypothetical_mean_close = 150
# Hypothetical mean for closing price
hypothetical_mean_volume = 100000000
# Hypothetical mean for volume
# Perform a one-sample t-test for the Closing Price against t
he hypothetical
meant_stat_close, p_value_close = ttest_1samp(data['Close'],
hypothetical_mean_close)
# Perform a one-sample t-test for the Volume against the hypo
thetical
meant_stat_volume, p_value_volume = ttest_1samp(data['Volum
e'], hypothetical_mean_volume)
# Print the results
This script provides a straightforward way to test whether the mean closing price
and volume of AAPL stock significantly differ from predefined hypothetical means.
It leverages the ttest_1samp function from scipy.stats to conduct the analysis and
prints the results, which include both the t-statistics and p-values for each test.
These outcomes help to determine if there are statistically significant differences
between the sample means (derived from the dataset) and the hypothetical
means, offering valuable insights into the stock’s performance and trading activity.
T-statistic: Indicates how many standard deviations the sample mean is from
the hypothetical mean. A higher absolute value indicates a greater difference.
P-value: Indicates the probability of observing the data (or more extreme) if
the null hypothesis (no difference) is true. A low P-value (typically < 0.05)
suggests that the observed data is unlikely under the null hypothesis, leading
to its rejection.
For Volume
Similarly, a low P-value for volume would indicate a significant difference from
the hypothetical mean volume, suggesting that AAPL’s trading activity is
unusually high or low compared to expected levels.
Closing Price
T-statistic: 42.62
Volume
T-statistic: -27.04
Interpretation
Closing Price
The T-statistic of 42.62 is significantly high, and the extremely low P-value
indicates that we can reject the null hypothesis. This result suggests that the
mean closing price of AAPL stock significantly differs from the hypothetical
mean of 150, and given the positive T-statistic, it is significantly higher than
the hypothetical mean.
Volume
The T-statistic of -27.04, combined with a very low P-value, also leads to the
rejection of the null hypothesis, indicating that the mean trading volume
significantly differs from the hypothetical mean of 100,000,000. The negative
T-statistic indicates that the actual mean
trading volume is significantly lower than the hypothetical mean.
Overall Insight
These t-test results suggest substantial deviations from the hypothetical means
for both the closing price and volume of AAPL stock.
Z Test
In financial analytics, the Z Test is a statistical method used to determine whether
there is a significant difference between the mean of a sample and the population
mean, based on the sample size and standard deviation. This test is particularly
useful when the sample size is large ((n > 30)) and the population standard
deviation is known, allowing analysts to make inferences about the population
based on sample data.
2. Calculates the sample mean for the Close and Volume columns.
4. Performs the Z-test using the formula and interprets the results.
Since performing a Z-test requires the population standard deviation (or variance)
and this information is typically not available for real-world data like stock prices,
we’ll proceed with hypothetical values for demonstration purposes.
import pandas as pd
import numpy as np
from scipy.stats import norm
# Load the dataset
data = pd.read_csv('/mnt/data/AAPL.csv')
# Hypothetical population means and standard deviations
population_mean_close = 150
# Hypothetical population mean for 'Close'
population_std_close = 20
# Hypothetical population standard deviation for 'Close'
population_mean_volume = 100000000
# Hypothetical population mean for 'Volume'
population_std_volume = 15000000
# Hypothetical population standard deviation for 'Volume'
# Calculate sample means
sample_mean_close = data['Close'].mean()
sample_mean_volume = data['Volume'].mean()
# Calculate the size of the sample
n_close = len(data['Close'])
n_volume = len(data['Volume'])
# Perform Z-test (Close)
These Z-test results help in understanding how the observed stock data compare
to broader market expectations or historical benchmarks.
However, it’s essential to remember that the choice of population means and
standard deviations significantly affects the test’s outcome and should be based
on realistic and justifiable assumptions.
Given the hypothetical results from the Z-test for the AAPL.csv dataset:
P-value = 0.0001
P-value = 0.0004
Interpretation:
Close
The Z-score of 5.0 indicates that the sample mean closing price is significantly
higher than the hypothetical population mean. The positive Z-score suggests
that AAPL’s closing prices, on average, are above the benchmark.
The very low P-value (0.0001) strongly suggests that the difference between
the sample mean and the hypothetical population mean is statistically
significant. This means we have strong evidence to reject the null hypothesis
that the sample mean is equal to the population mean.
Volume
The Z-score of -3.5 for volume implies that the sample mean volume is
significantly lower than the hypothetical population mean. The negative Z-
score indicates that AAPL’s trading volume, on average, is below the
benchmark.
Similar to the closing price, the low P-value (0.0004) for volume indicates that
the difference is statistically significant, providing strong evidence to reject the
null hypothesis in favor of the alternative hypothesis that there’s a significant
difference between the sample and population means.
Overall Insight:
These hypothetical Z-test results suggest that AAPL’s stock had significantly
higher closing prices than expected, based on the hypothetical population mean.
Conversely, the trading volume was significantly lower than the hypothetical
average, indicating less trading activity than might have been anticipated. These
insights could be valuable for investors or analysts looking to evaluate AAPL’s
stock performance relative to market expectations or historical
benchmarks.
Interpretation and
Application
The Chi2 Statistic measures how much the observed frequencies deviate from
the expected frequencies, with a higher value indicating a greater deviation. The
P-value determines the significance of the association; a low P-value (typically <
0.05) suggests a significant relationship between the variables.
In the example, if the P-value is below 0.05, the bank can conclude that product
preference is associated with income bracket, guiding targeted marketing efforts.
Understanding and applying tests like the Z Test and Chi-Square Test enables
financial analysts and researchers to draw meaningful conclusions from data,
informing investment decisions, marketing strategies, and risk assessments.
Given the extensive explanation already provided for the key statistical concepts
used in financial analytics, including the Z Test and Chi-Square Test, let’s proceed
with additional important statistical measures and tests commonly applied in the
field.
💡 On Apple data
The Chi-square test is commonly used to examine the independence between two
categorical variables or to determine the goodness of fit between observed
frequencies and expected frequencies in one categorical variable with several
levels or categories. For stock market data like that in AAPL.csv , which primarily
consists of numerical and continuous data (e.g., opening price, closing price,
volume), applying a Chi-square test directly is not straightforward without
categorization or discretization of data.
However, one approach could be to categorize continuous variables (like
‘Volume’) into bins (e.g., High, Medium, Low) based on defined thresholds and
then perform a Chi-square test for independence between two such categorized
variables or a goodness-of-fit test to see if the
distribution of a single categorized variable matches expected frequencies.
import pandas as pd
from scipy.stats import chi2_contingency, chi2
# Load the dataset
data = pd.read_csv('/mnt/data/AAPL.csv')
# Categorizing 'Volume' into 'High', 'Medium', 'Low' based on
quantiles
data['Volume Category'] = pd.qcut(data['Volume'], 3, labels=
['Low', 'Medium', 'High'])
# Assuming a simple case where we want to test if the observe
d distribution of 'Volume Category'
# matches an expected distribution equally across 'Low', 'Med
ium', 'High'
observed_frequencies = data['Volume Category'].value_counts
().sort_index()
expected_frequencies = [len(data) / 3] * 3
# Equal distribution
# Chi-square test
chi_stat, p_value = chi2_contingency([observed_frequencies, e
xpected_frequencies])[:2]
print(f"Chi-square Statistic: {chi_stat}, P-value: {p_valu
e}")
Interpretation:
Chi-square Statistic: A high value might indicate that the observed
frequencies of ‘Volume Category’ deviate significantly from the expected
frequencies, suggesting that the distribution across categories is not equal.