Professional Documents
Culture Documents
Unit 2 Statics and DA
Unit 2 Statics and DA
1. Correlation:
Correlation measures the strength and direction of a linear relationship between two
continuous variables. The correlation coefficient, often denoted by r, ranges from -1
to 1. A positive value indicates a positive correlation, a negative value indicates a
negative correlation, and a value close to zero suggests little or no correlation.
2. Covariance:
Covariance measures how two variables change together. Like correlation, it
indicates the direction of the relationship (positive or negative), but its magnitude is
not standardized. Covariance can be calculated between any two variables, whether
they are continuous or discrete.
3. Regression Analysis:
4. Contingency Tables:
Contingency tables are used when analyzing the relationship between two
categorical variables. They show the frequency distribution of the joint occurrence of
values for these variables, allowing you to assess if there is an association or
dependency between them.
5. Chi-Square Test:
Description: The chi-square test assesses the association between two categorical variables
by comparing observed and expected frequencies in a contingency table.
Covariance and correlation coefficient are two statistical measures that describe the relationship
between two random variables. Both are used to quantify the degree to which two variables change
together.
1. Covariance:
Covariance measures the extent to which two variables change in relation to each
other.
It can take any value between negative infinity and positive infinity.
A positive covariance indicates that as one variable increases, the other variable
tends to increase as well.
A negative covariance indicates that as one variable increases, the other variable
tends to decrease.
The formula for covariance between two variables X and Y is given by:
2. Correlation Coefficient:
It ranges from -1 to 1.
The formula for the correlation coefficient (Pearson correlation coefficient) is given by:
In summary, while covariance measures the direction of the linear relationship between two
variables (positive or negative), correlation coefficient provides a standardized measure of both the
direction and strength of the linear relationship. Correlation is often preferred because it is
dimensionless and allows for easier comparison between different pairs of variables.
chi-square
The chi-square (χ2) statistic is commonly used in statistics to assess the goodness of fit of a
distribution to a set of observed data. It is also used in the context of testing independence in
contingency tables. However, it is not directly used to measure skewness or kurtosis.
In a goodness of fit test, the chi-square statistic is used to compare the observed
distribution of data with the expected distribution. The formula for the chi-square
goodness of fit statistic is:
χ2=∑(Oi−Ei)2/Ei
where Oi is the observed frequency, Ei is the expected frequency, and the sum is taken over all
categories or bins.
This test helps to assess whether the observed data follows a particular theoretical
distribution.
The chi-square test is used to examine the independence between two categorical
variables. The test statistic is calculated based on the differences between the
observed and expected frequencies in a contingency table.
The formula for the chi-square test of independence is similar to the goodness of fit
test, with the expected frequencies being calculated under the assumption of
independence between the variables.
1. Skewness:
Flat peak.
Fewer values concentrated around the mean but still more than normal distribution.
Lighter tails.
2. Leptokurtic distribution (kurtosis > 3, excess kurtosis > 0): sharp peak, heavy tails
3. Platykurtic distribution (kurtosis < 3, excess kurtosis < 0): flat peak, light tails
# Example dataset
diamonds = sns.load_dataset("diamonds")
diamond_prices = diamonds["price"]
mean_price = diamond_prices.mean()
median_price = diamond_prices.median()
std = diamond_prices.std()
>>> print(
f"The Pierson's second skewness score of diamond prices distribution is {skewness:.5f}"
)
The Pierson's second skewness score of diamond prices distribution is 1.15189
Another formula highly influenced by the works of Karl Pearson is the moment-based
formula to approximate skewness. It is more reliable and given as follows:
Here:
n represents the number of values in a distribution
x_i denotes each data point
If you don’t want to calculate skewness manually, you can use built-in methods
from pandas or scipy:
# Pandas version
diamond_prices.skew()
1.618395283383529
# SciPy version
skew(diamond_prices)
1.6183502776053016
Explanation
- The code is using two different libraries, pandas and scipy, to calculate the skewness of a dataset.
• The skewness is a measure of the asymmetry of the probability distribution of a real-valued
random variable.
• The first part of the code uses the pandas library's 'skew' function on the 'diamond_prices' dataset.
• The output, 1.618395283383529, indicates a right-skewed distribution as it's positive.
• The second part of the code imports the 'skew' function from the 'scipy.stats' module.
• It then uses this function to calculate the skewness of the same 'diamond_prices' dataset.
• The output, 1.6183502776053016, is slightly different due to the different algorithms used by the
libraries.
A Box and Whisker Plot (or simply Box Plot) is a graphical representation of the distribution of a
dataset. It provides a visual summary of the central tendency, spread, and skewness of the data.
Here are the key components of a Box Plot:
1. Box:
The box represents the interquartile range (IQR), which is the range between the first
quartile (Q1) and the third quartile (Q3). It spans the middle 50% of the data.
The length of the box indicates the spread or variability of the middle 50% of the
data.
2. Whiskers:
The whiskers extend from the box to the minimum and maximum values within a
specified range, often defined by a multiplier of the IQR.
3. Outliers:
Individual data points that fall outside the whiskers are considered outliers and are
plotted individually.
Example Problem
Using Box Plots to Compare Distributions:
Box Plots are useful for comparing the distributions of different datasets or groups. Here's how you
can interpret and use Box Plots for comparisons:
1. Central Tendency:
Compare the medians of different Box Plots to assess the central tendency.
If one median is higher than another, it suggests a higher central tendency in the
corresponding dataset.
2. Spread:
Compare the lengths of the boxes to assess the spread or variability of the data.
A longer box indicates greater variability in the middle 50% of the data.
3. Skewness:
Observe the symmetry of the boxes. A skewed distribution may have one whisker
longer than the other.
4. Outliers:
Outliers can provide insights into extreme values or data points that deviate
significantly from the norm.
1. Histogram:
2. Scatter Plot:
Used to display the relationship between two continuous variables. Each point on
the plot represents an observation with coordinates (x, y).
3. Line Chart:
Shows the relationship between two variables with continuous data points
connected by lines. It is often used to depict trends over time.
4. Bar Chart:
Represents categorical data with rectangular bars. The length of each bar
corresponds to the frequency or proportion of observations in each category.
5. Pie Chart:
Displays the proportion of a whole that each category represents. It is suitable for
categorical data where the parts contribute to the whole.
Choosing the appropriate graph depends on the nature of the data and the specific goals of the
analysis. Different graphs highlight different aspects of the data distribution and relationships.
Problem:
1. Covariance (8 Marks):
X 1 2 3 4 5
Y 2 4 5 4 5
2. a. Calculate the mean of X and Y. b. Compute the covariance between X and Y. c. Interpret
the sign of the covariance in the context of the relationship between X and Y. d. Discuss the
limitations of covariance as a measure of the relationship.
a. Calculate the correlation coefficient between X and Y. b. Interpret the magnitude and sign of the
correlation coefficient. c. Compare and contrast the correlation coefficient with covariance,
highlighting advantages and disadvantages. d. Discuss the implications of a correlation coefficient
close to +1.
Answer:
1. Covariance (8 Marks):
This problem assesses the understanding and application of covariance and correlation coefficient,
covering calculations, interpretation, and comparing the two measures. It also delves into the
limitations and implications of these statistical measures.
Problem:
Question (16 Marks):
1. Chi-Square (4 Marks):
Male Female
Candidate A 30 20
Candidate B 25 35
Candidate C 15 25
2. Apply the Chi-Square test to determine if there is a significant association between gender
and voting preference.
3. Measures of Distribution (Skewness and Kurtosis) (4 Marks):
You are given a dataset representing the daily returns of a stock over the past year.
Calculate the skewness and kurtosis of the dataset and interpret the results.
Consider two classes (Class A and Class B) with the following exam scores: Class A
(78, 85, 88, 90, 92) and Class B (75, 80, 85, 92, 95). Create box and whisker plots for
each class and compare the distributions.
Using the dataset of monthly sales for a retail business, create a histogram to
represent the distribution of sales. Additionally, create a scatter plot to explore the
relationship between advertising spending and monthly sales.
Answer:
1. Chi-Square (4 Marks):
where Oi is the observed frequency, Ei is the expected frequency, and the sum is taken over all
categories or bins.
Applying this to the given data, we can calculate the Chi-Square statistic and
compare it to the critical value to determine if there is a significant association.
For each class, calculate the quartiles, interquartile range (IQR), and plot the box and
whisker plots. Compare the positions of medians, spread of data, and identify any
outliers.
Conclusion:
This comprehensive problem requires knowledge and application of various statistical concepts, their
calculations, and interpretation. It assesses the ability to use statistical tools for analyzing
relationships, distribution characteristics, and graphical representations in real-world scenarios.
Problem:
Question (16 Marks):
A company conducted a survey to investigate the relationship between job satisfaction and
department in a large organization. The data is summarized in the contingency table below:
HR Department 45 20 15
Finance Department 30 25 25
IT Department 40 15 20
Clearly state the null hypothesis (H0) and the alternative hypothesis (Ha).
2. Interpretation (6 Marks):
Reflect on any limitations of the Chi-Square test for this type of analysis.
Answer:
Based on the calculated Chi-Square statistic and critical value, determine whether
the association is statistically significant.
Interpret the findings in terms of how job satisfaction is associated with different
departments.
c. Reflect on limitations:
Chi-Square tests assume independence of observations, so discuss any factors that
might affect the results, such as the sample size or potential confounding variables.
Conclusion:
This problem evaluates the ability to apply the Chi-Square test, interpret results, and reflect on the
practical implications and limitations of the analysis in a real-world scenario.
Box and Whisker Plot is defined as a visual representation of the five-point summary. The Box and
Whisker Plot is also called as Box Plot. It consists of a rectangular “box” and two “whiskers.” Box and
Whisker Plot contains the following parts:
Box: The box in the plot spans from the first quartile (Q1) to the third quartile (Q3). This box
contains the middle 50% of the data and represents the interquartile range (IQR). The width
of the box provides insights into the data’s spread.
Whiskers: The whiskers extend from the minimum value to Q1 and from Q3 to the maximum
value. They signify the range of the data, excluding potential outliers. The whiskers can vary
in length, indicating the data’s skewness or symmetry.
Median Line: A line within the box represents the median (Q2). It divides the data into two
halves, revealing the central tendency.
Outliers: Individual data points lying beyond the whiskers are considered outliers and are
often plotted as individual points.
1. Imagining Information Dispersion: Box plots are brilliant instruments for acquiring a visual
comprehension of the circulation of a dataset. They give a speedy outline of the central tendency,
spread, and state of the information dissemination, assisting with distinguishing whether the
information is symmetric, slanted, or contains exceptions.
2. Contrasting Distributions: Box plots are valuable for looking at the circulations of different
datasets one next to the other. This is especially important when you need to think about the
qualities of various gatherings, populaces, or classes. For instance, Contrasting the grades of
understudies from various schools or locales, examining the exhibition of different items or
medicines in a review, etc.
3. Estimating Skewness: By looking at the box and whiskers’ general lengths and positions, an
individual can evaluate the skewness of the information. A more drawn-out tail on one side of the
box recommends skewness that way.
4. Information Investigation: Box plots can act as starting tools for information investigation. They
give a compact rundown of a dataset’s key qualities, assisting with settling on the proper information
investigation techniques or changes.
5. Statistical Analysis: Box plots are much of the time utilised close by measurable tests and
investigations. They can assist with picturing the circulation of information before directing
speculation testing or looking at the method for various gatherings.
6. Quality Control: In assembling and quality control processes, box plots are utilised to screen
varieties in item determinations and distinguish imperfections or deviations from quality guidelines.
They help recognise when an interaction is working inside satisfactory cutoff points or when it needs
changes.
7. Navigation: Box plots furnish chiefs with an unmistakable and instinctive method for surveying
information qualities. They are utilised in business, money, and medical care to go with informed
choices given information synopses.
8. Risk Appraisal: In fields like finance and insurance, box plots can be utilised to envision the gamble
related to various speculations or protection contracts. They assist partners with figuring out the
possible fluctuation in returns or misfortunes.
9. General Wellbeing and Epidemiology: Box plots are utilised to imagine and think about well-
being-related information. For example, the circulation of illness rates among various districts or
segment gatherings.
10. Ecological Sciences: Box plots can be applied to examine natural information. For example, air
quality estimations or water contamination levels, and survey varieties across time or areas.
Box and Whisker Plots are particularly useful in the following situations:
1. Comparing Scores: When there is a need to think about the performance of students from various
classes or schools, a box plot can assist with surveying the dispersion of test scores in each gathering
and recognise whether one gathering beats the others.
2. Analysing Worker Compensations: While examining the pay rates of representatives in an
organisation, one can utilises box plots to look at the compensation circulations among various
divisions or occupation jobs, assisting with recognising differences or exceptions.
3. Evaluating Product Quality: In assembling, if one needs to screen the nature of an item, one can
make box plots of estimations taken at different creation runs. This recognises varieties and whether
the item satisfies quality guidelines.
4. Distinguishing Anomalies in Financial Data: While examining monetary information, like stock
returns, one can utilise box plots to identify exception exchanging days or uncommon cost
developments, which might show huge occasions or blunders in information.
5. Comparing Patient Recuperation Times: In medical care, one could utilise box plots to think about
the recuperation seasons of patients who have various therapies or medical procedures. This can
assist with figuring out which treatment approach is more compelling.
6. Assessing Marketing Campaigns: Marketers can utilise box plots to evaluate the effect of various
publicising efforts by contrasting measurements like navigate rates or change rates across crusade
varieties.
7. Observing Air Quality: Ecological researchers and offices use box plots to envision air quality
information, contrasting pollutant concentrations across various monitoring stations or locales.
8. Assessing Investment Portfolios: Financial experts can utilise box plots to think about the
circulations of profits for various venture portfolios, assisting investors and backers with
understanding gamble and return compromises.
9. Comparing Housing Prices: Real estate marketers can utilise box plots to think about the costs of
houses in various areas or urban communities, giving experiences in real estate market varieties.
10. Breaking down Crime Percentages: Law enforcement agencies can utilise box plots to look at
crime percentages in various regions or after some time, distribute assets and focus on mediations.
The following steps are involved in making Box and Whisker Plot:
2. Work out Quartiles: Track down the main quartile (Q1), third quartile (Q3), and median (Q2) from
the given information.
3. Decide Whiskers: Ascertain the base and most extreme qualities, barring anomalies.
4. Plot the Box and Whiskers: Draw a case from Q1 to Q3, a line inside the crate at Q2, and hairs
from the base to Q1 and from Q3 to the greatest.
5. Recognise Outliers: Plot any pieces of information outside the stubbles as individual focuses.
Example:
Suppose we have a dataset representing the test scores of a group of students: Data (test scores): 78,
85, 90, 92, 95, 96, 97, 98, 99, 100, 105, 110, 120.
Solution:
Dataset: 78, 85, 90, 92, 95, 96, 97, 98, 99, 100, 105, 110, 120
-Q1 (the first quartile) is the median of the lower half of the data (78, 85, 90, 92, 95, 96) = 91
-Q3 (the third quartile) is the median of the upper half of the data: (98, 99, 100, 105, 110, 120) =
102.5
To find the whiskers, calculate the minimum and maximum values within the dataset, excluding
potential outliers.
Any data points that fall outside the whiskers are considered outliers. In this case, we do not have
any outliers. This Box and Whisker Plot gives a visual rundown of the grades, showing the middle
(Q2) at 97, the interquartile range (IQR) from Q1 to Q3 (91 to 102.5), and the shortfall of exceptions.
It successfully outlines the focal propensity, spread, and dissemination of the dataset.
Internal Systems:
ERP systems integrate internal business processes, including accounting, human resources,
and inventory management.
These systems provide a centralized source of data for various departments within an
organization.
Valuable for businesses to understand customer behavior and improve customer satisfaction.
3. Operational Databases:
These databases store transactional data generated during daily operations, such as sales,
purchases, and inventory movements.
4. Data Warehouses:
Data warehouses consolidate and organize data from various sources for reporting and
analysis.
They enable organizations to have a unified view of their data for strategic decision-making.
5. In-House Applications:
Custom-built applications specific to an organization's needs can generate and store data.
External Systems:
1. External APIs (Application Programming Interfaces):
Many organizations offer APIs that allow external systems to access their data.
This can include data from social media platforms, financial institutions, or weather services.
2. Cloud-Based Services:
Cloud platforms provide services where data can be stored and accessed remotely.
Services like Amazon S3, Google Cloud Storage, and Microsoft Azure offer scalable and
flexible data storage solutions.
Governments and organizations often make datasets publicly available for research and
analysis.
Examples include data.gov, World Bank datasets, and various scientific research databases.
4. Web Scraping:
External data acquisition can include information from competitor websites, news articles, or
any publicly available online content.
Data acquired from external vendors, suppliers, or business partners can provide valuable
insights.
This may include market trends, industry reports, or collaborative research data.
Social media data, including user interactions, sentiment analysis, and trending topics, can be
acquired for marketing and brand analysis.
APIs from platforms like Twitter, Facebook, and Instagram provide access to their data.
7. Sensor Data:
For industries like manufacturing or IoT (Internet of Things), sensor data from external
devices is crucial.
This can include temperature sensors, GPS data, or other telemetry data.
In the context of data acquisition, organizations often employ a combination of internal and external
data sources to create a comprehensive and diverse dataset for analysis and decision-making. The
integration of data from different sources is a key aspect of building a robust data ecosystem within
an organization.
1. Definition:
Web APIs are sets of rules and protocols that allow different software applications to
communicate with each other.
They enable the exchange of data and functionalities between different systems over the
internet.
RESTful APIs: Representational State Transfer APIs are widely used for their simplicity and
scalability.
SOAP APIs: Simple Object Access Protocol APIs use XML as a format for data exchange.
JSON-RPC and XML-RPC APIs: These allow remote procedure calls using JSON or XML.
Twitter API: Provides access to Twitter's data, allowing developers to retrieve tweets, user
information, and trends.
GitHub API: Allows developers to access information about repositories, issues, and users on
GitHub.
Rate limits are imposed to prevent abuse and ensure fair usage of the API.
Social media platforms like Facebook, Instagram, and LinkedIn provide APIs for accessing
user data, posts, and engagement metrics.
1. Definition:
Open data refers to data that is freely available, accessible, and can be used, modified, and
shared by anyone.
Governments, organizations, and institutions often release data openly for public use.
3. International Organizations:
International organizations like the World Bank and the United Nations release open datasets
covering global development indicators, economic data, and demographic information.
Databases like PubMed, arXiv, and Kaggle datasets are popular sources for researchers and
data scientists.
5. Non-Profit Organizations:
6. OpenStreetMap:
OpenStreetMap provides open and collaborative mapping data that can be used for various
applications, including geographic information systems (GIS).
Fosters innovation as developers, researchers, and businesses can leverage diverse datasets.
8. Challenges:
In summary, Web APIs and Open Data Sources play crucial roles in data acquisition, offering a wealth
of information for diverse applications, from business analytics to research and development.
Integrating data from these sources enriches the depth and breadth of datasets available for analysis
and decision-making.
To interact with Web APIs and utilize Open Data Sources for data acquisition in programs, you
typically use programming languages like Python or R. Below are examples using Python with
libraries such as requests for API requests and data retrieval and pandas for data manipulation. Keep
in mind that you might need API keys for some services.