Unit 2 Statics and DA

Unit 2
STATISTICAL ANALYSIS AND DATA ACQUISITION

Relationship between attributes: Covariance, Correlation Coefficient
In statistical analysis, the relationship between attributes refers to how different variables or
attributes are related to each other. This relationship can be explored using various statistical
techniques. Here are some key concepts related to the relationship between attributes in statistical
analysis:
1. Correlation:
 Correlation measures the strength and direction of a linear relationship between two
continuous variables. The correlation coefficient, often denoted by r, ranges from -1
to 1. A positive value indicates a positive correlation, a negative value indicates a
negative correlation, and a value close to zero suggests little or no correlation.
2. Covariance:
 Covariance measures how two variables change together. Like correlation, it
indicates the direction of the relationship (positive or negative), but its magnitude is
not standardized. Covariance can be calculated between any two variables, whether
they are continuous or discrete.
3. Regression Analysis:
 Regression analysis is used to model the relationship between a dependent variable

and one or more independent variables. It helps in understanding the strength and
nature of the relationship, and it can be used for predicting the value of the
dependent variable based on the values of the independent variables.
4. Contingency Tables:
 Contingency tables are used when analyzing the relationship between two
categorical variables. They show the frequency distribution of the joint occurrence of
values for these variables, allowing you to assess if there is an association or
dependency between them.
5. Chi-Square Test:
Description: The chi-square test assesses the association between two categorical variables
by comparing observed and expected frequencies in a contingency table.
Use: Commonly used for testing independence in categorical data

Covariance, Correlation Coefficient
Covariance and correlation coefficient are two statistical measures that describe the relationship
between two random variables. Both are used to quantify the degree to which two variables change
together.
1. Covariance:
 Covariance measures the extent to which two variables change in relation to each
other.
 It can take any value between negative infinity and positive infinity.
 A positive covariance indicates that as one variable increases, the other variable
tends to increase as well.
 A negative covariance indicates that as one variable increases, the other variable
tends to decrease.
The formula for covariance between two variables X and Y is given by:
2. Correlation Coefficient:
 Correlation coefficient is a standardized measure of the strength and direction of the

linear relationship between two variables.
 It ranges from -1 to 1.
 A correlation coefficient of 1 indicates a perfect positive linear relationship, -1

indicates a perfect negative linear relationship, and 0 indicates no linear relationship.
 The correlation coefficient is obtained by dividing the covariance by the product of
the standard deviations of the two variables.
The formula for the correlation coefficient (Pearson correlation coefficient) is given by:
In summary, while covariance measures the direction of the linear relationship between two
variables (positive or negative), correlation coefficient provides a standardized measure of both the
direction and strength of the linear relationship. Correlation is often preferred because it is
dimensionless and allows for easier comparison between different pairs of variables.
chi-square
The chi-square (χ2) statistic is commonly used in statistics to assess the goodness of fit of a
distribution to a set of observed data. It is also used in the context of testing independence in
contingency tables. However, it is not directly used to measure skewness or kurtosis.
1. Goodness of Fit Test:
 In a goodness of fit test, the chi-square statistic is used to compare the observed
distribution of data with the expected distribution. The formula for the chi-square
goodness of fit statistic is:
χ2=∑(Oi−Ei)2/Ei
where Oi is the observed frequency, Ei is the expected frequency, and the sum is taken over all
categories or bins.
 This test helps to assess whether the observed data follows a particular theoretical
distribution.
2. Testing Independence in Contingency Tables:
 The chi-square test is used to examine the independence between two categorical
variables. The test statistic is calculated based on the differences between the
observed and expected frequencies in a contingency table.
 The formula for the chi-square test of independence is similar to the goodness of fit
test, with the expected frequencies being calculated under the assumption of
independence between the variables.
 The chi-square test is sensitive to departures from expected frequencies, and a

significant result suggests that there is a significant association between the two
categorical variables.
Measure of Distribution (Skewness and Kurtosis)
Skewness and kurtosis are different statistical measures that describe the shape of a distribution.
“Skewness essentially is a commonly used measure in descriptive statistics that characterizes the
asymmetry of a data distribution, while kurtosis determines the heaviness of the distribution tails.”
1. Skewness:
 Skewness measures the asymmetry of a distribution.
 A positive skewness indicates a right-skewed distribution (tail on the right), while a

negative skewness indicates a left-skewed distribution (tail on the left).
2. Kurtosis:
While skewness focuses on the spread (tails) of normal distribution, kurtosis focuses more on the
height. It tells us how peaked or flat our normal (or normal-like) distribution is. The term, which means
curved or arched from Greek, was first coined by, unsurprisingly, from the British mathematician Karl
Pearson (he spent his life studying probability distributions).
High kurtosis indicates:

 Sharp peakedness in the distribution’s center.
 More values concentrated around the mean than normal distribution.
 Heavier tails because of a higher concentration of extreme values or outliers in tails.
 Greater likelihood of extreme events.
On the other hand, low kurtosis indicates:
 Flat peak.
 Fewer values concentrated around the mean but still more than normal distribution.
 Lighter tails.
 Lower likelihood of extreme events.
Depending on the degree, distributions have three types of kurtosis:
1. Mesokurtic distribution (kurtosis = 3, excess kurtosis = 0): perfect normal distribution or

very close to it.
2. Leptokurtic distribution (kurtosis > 3, excess kurtosis > 0): sharp peak, heavy tails
3. Platykurtic distribution (kurtosis < 3, excess kurtosis < 0): flat peak, light tails
 Positive kurtosis (leptokurtic) indicates heavier tails, while negative kurtosis

(platykurtic) indicates lighter tails compared to a normal distribution.
How to Calculate Skewness in Python

There are many ways to calculate skewness, but the simplest one is Pearson’s second
skewness coefficient, also known as median skewness.
Let’s implement the formula manually in Python:

import numpy as np
import pandas as pd
import seaborn as sns
# Example dataset
diamonds = sns.load_dataset("diamonds")
diamond_prices = diamonds["price"]
mean_price = diamond_prices.mean()
median_price = diamond_prices.median()
std = diamond_prices.std()
skewness = (3 * (mean_price - median_price)) / std
>>> print(
f"The Pierson's second skewness score of diamond prices distribution is {skewness:.5f}"
)
The Pierson's second skewness score of diamond prices distribution is 1.15189
Another formula highly influenced by the works of Karl Pearson is the moment-based
formula to approximate skewness. It is more reliable and given as follows:
Here:
 n represents the number of values in a distribution
 x_i denotes each data point
Let’s implement it in Python too:

def moment_based_skew(distribution):
n = len(distribution)
mean = np.mean(distribution)
std = np.std(distribution)
# Divide the formula into two parts
first_part = n / ((n - 1) * (n - 2))
second_part = np.sum(((distribution - mean) / std) ** 3)
skewness = first_part * second_part
return skewness
>>> moment_based_skew(diamond_prices)
1.618440289857168
If you don’t want to calculate skewness manually, you can use built-in methods
from pandas or scipy:
# Pandas version
diamond_prices.skew()
1.618395283383529
# SciPy version
from scipy.stats import skew
skew(diamond_prices)
1.6183502776053016
Explanation
- The code is using two different libraries, pandas and scipy, to calculate the skewness of a dataset.
• The skewness is a measure of the asymmetry of the probability distribution of a real-valued
random variable.
• The first part of the code uses the pandas library's 'skew' function on the 'diamond_prices' dataset.
• The output, 1.618395283383529, indicates a right-skewed distribution as it's positive.
• The second part of the code imports the 'skew' function from the 'scipy.stats' module.
• It then uses this function to calculate the skewness of the same 'diamond_prices' dataset.
• The output, 1.6183502776053016, is slightly different due to the different algorithms used by the
libraries.
Box and Whisker Plot
A Box and Whisker Plot (or simply Box Plot) is a graphical representation of the distribution of a
dataset. It provides a visual summary of the central tendency, spread, and skewness of the data.
Here are the key components of a Box Plot:
1. Box:
 The box represents the interquartile range (IQR), which is the range between the first
quartile (Q1) and the third quartile (Q3). It spans the middle 50% of the data.
 The length of the box indicates the spread or variability of the middle 50% of the
data.
 The line inside the box represents the median.
2. Whiskers:
 The whiskers extend from the box to the minimum and maximum values within a
specified range, often defined by a multiplier of the IQR.
 Whiskers show the range of the data, excluding outliers.
3. Outliers:
 Individual data points that fall outside the whiskers are considered outliers and are
plotted individually.
Example Problem
Using Box Plots to Compare Distributions:
Box Plots are useful for comparing the distributions of different datasets or groups. Here's how you
can interpret and use Box Plots for comparisons:
1. Central Tendency:
 Compare the medians of different Box Plots to assess the central tendency.
 If one median is higher than another, it suggests a higher central tendency in the
corresponding dataset.
2. Spread:
 Compare the lengths of the boxes to assess the spread or variability of the data.
 A longer box indicates greater variability in the middle 50% of the data.
3. Skewness:
 Observe the symmetry of the boxes. A skewed distribution may have one whisker
longer than the other.
4. Outliers:
 Identify and compare the presence of outliers in different datasets.
 Outliers can provide insights into extreme values or data points that deviate
significantly from the norm.
Other Statistical Graphs:
1. Histogram:
 A graphical representation of the distribution of a continuous dataset. It consists of

bars where the area of each bar corresponds to the frequency of observations within
a certain range.
2. Scatter Plot:
 Used to display the relationship between two continuous variables. Each point on
the plot represents an observation with coordinates (x, y).
3. Line Chart:
 Shows the relationship between two variables with continuous data points
connected by lines. It is often used to depict trends over time.
4. Bar Chart:
 Represents categorical data with rectangular bars. The length of each bar
corresponds to the frequency or proportion of observations in each category.
5. Pie Chart:
 Displays the proportion of a whole that each category represents. It is suitable for
categorical data where the parts contribute to the whole.
Choosing the appropriate graph depends on the nature of the data and the specific goals of the
analysis. Different graphs highlight different aspects of the data distribution and relationships.
Problem:
Question (16 Marks):
1. Covariance (8 Marks):
 Given two variables, X and Y, with the following data points:
X 1 2 3 4 5
Y 2 4 5 4 5
2. a. Calculate the mean of X and Y. b. Compute the covariance between X and Y. c. Interpret
the sign of the covariance in the context of the relationship between X and Y. d. Discuss the
limitations of covariance as a measure of the relationship.
3. Correlation Coefficient (8 Marks):
 Using the same data for X and Y:
a. Calculate the correlation coefficient between X and Y. b. Interpret the magnitude and sign of the
correlation coefficient. c. Compare and contrast the correlation coefficient with covariance,
highlighting advantages and disadvantages. d. Discuss the implications of a correlation coefficient
close to +1.
Answer:
1. Covariance (8 Marks):
Correlation Coefficient (8 Marks):

Conclusion:
This problem assesses the understanding and application of covariance and correlation coefficient,
covering calculations, interpretation, and comparing the two measures. It also delves into the
limitations and implications of these statistical measures.
Problem:
1. Chi-Square (4 Marks):
 A researcher conducted a survey to investigate the relationship between gender and

voting preference in a local election. The data is summarized in the contingency table
below:
 Male Female
Candidate A 30 20
Candidate B 25 35
Candidate C 15 25
2. Apply the Chi-Square test to determine if there is a significant association between gender
and voting preference.
3. Measures of Distribution (Skewness and Kurtosis) (4 Marks):
 You are given a dataset representing the daily returns of a stock over the past year.
Calculate the skewness and kurtosis of the dataset and interpret the results.
4. Box and Whisker Plots (4 Marks):
 Consider two classes (Class A and Class B) with the following exam scores: Class A
(78, 85, 88, 90, 92) and Class B (75, 80, 85, 92, 95). Create box and whisker plots for
each class and compare the distributions.
5. Other Statistical Graphs (4 Marks):
 Using the dataset of monthly sales for a retail business, create a histogram to
represent the distribution of sales. Additionally, create a scatter plot to explore the
relationship between advertising spending and monthly sales.
Answer:
1. Chi-Square (4 Marks):
The Chi-Square statistic is calculated as follows: χ2=∑(Oi−Ei)2/Ei
where Oi is the observed frequency, Ei is the expected frequency, and the sum is taken over all
categories or bins.
 Applying this to the given data, we can calculate the Chi-Square statistic and
compare it to the critical value to determine if there is a significant association.
2. Measures of Distribution (Skewness and Kurtosis) (4 Marks):

 Use the formulas for skewness and kurtosis to calculate the values for the daily
returns dataset. Interpret the skewness and kurtosis values in the context of the
stock's risk and distribution characteristics.
3. Box and Whisker Plots (4 Marks):
 For each class, calculate the quartiles, interquartile range (IQR), and plot the box and
whisker plots. Compare the positions of medians, spread of data, and identify any
outliers.
4. Other Statistical Graphs (4 Marks):
 Create a histogram to represent the distribution of monthly sales, providing insights

into the sales patterns. Additionally, create a scatter plot to visualize the relationship
between advertising spending and monthly sales, indicating whether there is a
correlation.
Conclusion:
This comprehensive problem requires knowledge and application of various statistical concepts, their
calculations, and interpretation. It assesses the ability to use statistical tools for analyzing
relationships, distribution characteristics, and graphical representations in real-world scenarios.
Problem:
A company conducted a survey to investigate the relationship between job satisfaction and
department in a large organization. The data is summarized in the contingency table below:
Satisfied Neutral Dissatisfied
HR Department 45 20 15
Finance Department 30 25 25
IT Department 40 15 20
1. Chi-Square Test (10 Marks):
 Apply the Chi-Square test to determine if there is a significant association between

job satisfaction and department.
 Clearly state the null hypothesis (H0) and the alternative hypothesis (Ha).
 Calculate the expected frequencies for each cell.

 Compute the Chi-Square statistic and determine whether to reject the null
hypothesis.
 Provide a conclusion based on the results.
2. Interpretation (6 Marks):
 Interpret the Chi-Square test results in the context of the survey.
 Discuss the practical implications of any significant association found.
 Reflect on any limitations of the Chi-Square test for this type of analysis.
Answer:
1. Chi-Square Test (10 Marks):

2. Interpretation (6 Marks):
a. Interpret the Chi-Square test results:
 Based on the calculated Chi-Square statistic and critical value, determine whether
the association is statistically significant.
b. Discuss the practical implications:
 Interpret the findings in terms of how job satisfaction is associated with different
departments.
 Provide insights into areas that might need attention or improvement.
c. Reflect on limitations:
 Chi-Square tests assume independence of observations, so discuss any factors that
might affect the results, such as the sample size or potential confounding variables.
Conclusion:
This problem evaluates the ability to apply the Chi-Square test, interpret results, and reflect on the
practical implications and limitations of the analysis in a real-world scenario.
What is Box and Whisker Plot?
Box and Whisker Plot is defined as a visual representation of the five-point summary. The Box and
Whisker Plot is also called as Box Plot. It consists of a rectangular “box” and two “whiskers.” Box and
Whisker Plot contains the following parts:
 Box: The box in the plot spans from the first quartile (Q1) to the third quartile (Q3). This box
contains the middle 50% of the data and represents the interquartile range (IQR). The width
of the box provides insights into the data’s spread.
 Whiskers: The whiskers extend from the minimum value to Q1 and from Q3 to the maximum
value. They signify the range of the data, excluding potential outliers. The whiskers can vary
in length, indicating the data’s skewness or symmetry.
 Median Line: A line within the box represents the median (Q2). It divides the data into two
halves, revealing the central tendency.
 Outliers: Individual data points lying beyond the whiskers are considered outliers and are
often plotted as individual points.
Uses of Box and Whisker Plot
1. Imagining Information Dispersion: Box plots are brilliant instruments for acquiring a visual
comprehension of the circulation of a dataset. They give a speedy outline of the central tendency,
spread, and state of the information dissemination, assisting with distinguishing whether the
information is symmetric, slanted, or contains exceptions.
2. Contrasting Distributions: Box plots are valuable for looking at the circulations of different
datasets one next to the other. This is especially important when you need to think about the
qualities of various gatherings, populaces, or classes. For instance, Contrasting the grades of
understudies from various schools or locales, examining the exhibition of different items or
medicines in a review, etc.
3. Estimating Skewness: By looking at the box and whiskers’ general lengths and positions, an
individual can evaluate the skewness of the information. A more drawn-out tail on one side of the
box recommends skewness that way.
4. Information Investigation: Box plots can act as starting tools for information investigation. They
give a compact rundown of a dataset’s key qualities, assisting with settling on the proper information
investigation techniques or changes.
5. Statistical Analysis: Box plots are much of the time utilised close by measurable tests and
investigations. They can assist with picturing the circulation of information before directing
speculation testing or looking at the method for various gatherings.
6. Quality Control: In assembling and quality control processes, box plots are utilised to screen
varieties in item determinations and distinguish imperfections or deviations from quality guidelines.
They help recognise when an interaction is working inside satisfactory cutoff points or when it needs
changes.
7. Navigation: Box plots furnish chiefs with an unmistakable and instinctive method for surveying
information qualities. They are utilised in business, money, and medical care to go with informed
choices given information synopses.
8. Risk Appraisal: In fields like finance and insurance, box plots can be utilised to envision the gamble
related to various speculations or protection contracts. They assist partners with figuring out the
possible fluctuation in returns or misfortunes.
9. General Wellbeing and Epidemiology: Box plots are utilised to imagine and think about well-
being-related information. For example, the circulation of illness rates among various districts or
segment gatherings.
10. Ecological Sciences: Box plots can be applied to examine natural information. For example, air
quality estimations or water contamination levels, and survey varieties across time or areas.
When to Use Box and Whisker Plot
Box and Whisker Plots are particularly useful in the following situations:
1. Comparing Scores: When there is a need to think about the performance of students from various
classes or schools, a box plot can assist with surveying the dispersion of test scores in each gathering
and recognise whether one gathering beats the others.
2. Analysing Worker Compensations: While examining the pay rates of representatives in an
organisation, one can utilises box plots to look at the compensation circulations among various
divisions or occupation jobs, assisting with recognising differences or exceptions.
3. Evaluating Product Quality: In assembling, if one needs to screen the nature of an item, one can
make box plots of estimations taken at different creation runs. This recognises varieties and whether
the item satisfies quality guidelines.
4. Distinguishing Anomalies in Financial Data: While examining monetary information, like stock
returns, one can utilise box plots to identify exception exchanging days or uncommon cost
developments, which might show huge occasions or blunders in information.
5. Comparing Patient Recuperation Times: In medical care, one could utilise box plots to think about
the recuperation seasons of patients who have various therapies or medical procedures. This can
assist with figuring out which treatment approach is more compelling.
6. Assessing Marketing Campaigns: Marketers can utilise box plots to evaluate the effect of various
publicising efforts by contrasting measurements like navigate rates or change rates across crusade
varieties.
7. Observing Air Quality: Ecological researchers and offices use box plots to envision air quality
information, contrasting pollutant concentrations across various monitoring stations or locales.
8. Assessing Investment Portfolios: Financial experts can utilise box plots to think about the
circulations of profits for various venture portfolios, assisting investors and backers with
understanding gamble and return compromises.
9. Comparing Housing Prices: Real estate marketers can utilise box plots to think about the costs of
houses in various areas or urban communities, giving experiences in real estate market varieties.
10. Breaking down Crime Percentages: Law enforcement agencies can utilise box plots to look at
crime percentages in various regions or after some time, distribute assets and focus on mediations.
How to Make Box and Whisker Plot
The following steps are involved in making Box and Whisker Plot:
1. Gather Information: Accumulate the dataset of which the envision is needed.
2. Work out Quartiles: Track down the main quartile (Q1), third quartile (Q3), and median (Q2) from
the given information.
3. Decide Whiskers: Ascertain the base and most extreme qualities, barring anomalies.
4. Plot the Box and Whiskers: Draw a case from Q1 to Q3, a line inside the crate at Q2, and hairs
from the base to Q1 and from Q3 to the greatest.
5. Recognise Outliers: Plot any pieces of information outside the stubbles as individual focuses.
Example of Box and Whisker Plot
Example:
Suppose we have a dataset representing the test scores of a group of students: Data (test scores): 78,
85, 90, 92, 95, 96, 97, 98, 99, 100, 105, 110, 120.
Solution:
Step 1: Collect Data
Dataset: 78, 85, 90, 92, 95, 96, 97, 98, 99, 100, 105, 110, 120
Step 2: Calculate Quartiles

To create a Box and Whisker Plot, we need to calculate the quartiles (Q1 and Q3) and the median
(Q2).
-Q1 (the first quartile) is the median of the lower half of the data (78, 85, 90, 92, 95, 96) = 91
-Q2 (the median) is the median of the entire dataset = 97
-Q3 (the third quartile) is the median of the upper half of the data: (98, 99, 100, 105, 110, 120) =
102.5
Step 3: Determine Whiskers
To find the whiskers, calculate the minimum and maximum values within the dataset, excluding
potential outliers.
Minimum = 78, Maximum = 120
The required five-number summary is 78, 91, 97, 102.5, 120.
Step 4: Plot the Box and Whiskers
Now, we can create the Box and Whisker Plot:
-Draw a box from Q1 (91) to Q3 (102.5).
-Draw a line inside the box at Q2 (97).
-Extend the left whisker from the minimum (78) to Q1 (91).
-Extend the right whisker from Q3 (102.5) to the maximum (120).
Step 5: Identify Outliers
Any data points that fall outside the whiskers are considered outliers. In this case, we do not have
any outliers. This Box and Whisker Plot gives a visual rundown of the grades, showing the middle
(Q2) at 97, the interquartile range (IQR) from Q1 to Q3 (91 to 102.5), and the shortfall of exceptions.
It successfully outlines the focal propensity, spread, and dissemination of the dataset.
Internal Systems:
1. Enterprise Resource Planning (ERP) Systems:
 ERP systems integrate internal business processes, including accounting, human resources,
and inventory management.
 These systems provide a centralized source of data for various departments within an
organization.
2. Customer Relationship Management (CRM) Systems:

 CRM systems store and manage customer-related data, including interactions, preferences,
and transaction history.
 Valuable for businesses to understand customer behavior and improve customer satisfaction.
3. Operational Databases:
 These databases store transactional data generated during daily operations, such as sales,
purchases, and inventory movements.
 Examples include MySQL, Oracle, and Microsoft SQL Server databases.
4. Data Warehouses:
 Data warehouses consolidate and organize data from various sources for reporting and
analysis.
 They enable organizations to have a unified view of their data for strategic decision-making.
5. In-House Applications:
 Custom-built applications specific to an organization's needs can generate and store data.
 These applications may include project management tools, internal communication

platforms, or proprietary software.
External Systems:
1. External APIs (Application Programming Interfaces):
 Many organizations offer APIs that allow external systems to access their data.
 This can include data from social media platforms, financial institutions, or weather services.
2. Cloud-Based Services:
 Cloud platforms provide services where data can be stored and accessed remotely.
 Services like Amazon S3, Google Cloud Storage, and Microsoft Azure offer scalable and
flexible data storage solutions.
3. Open Data Sources:
 Governments and organizations often make datasets publicly available for research and
analysis.
 Examples include data.gov, World Bank datasets, and various scientific research databases.
4. Web Scraping:
 Web scraping involves extracting data from websites.
 External data acquisition can include information from competitor websites, news articles, or
any publicly available online content.
5. Vendor and Partner Data:
 Data acquired from external vendors, suppliers, or business partners can provide valuable
insights.
 This may include market trends, industry reports, or collaborative research data.
6. Social Media Platforms:
 Social media data, including user interactions, sentiment analysis, and trending topics, can be
acquired for marketing and brand analysis.
 APIs from platforms like Twitter, Facebook, and Instagram provide access to their data.
7. Sensor Data:
 For industries like manufacturing or IoT (Internet of Things), sensor data from external
devices is crucial.
 This can include temperature sensors, GPS data, or other telemetry data.
In the context of data acquisition, organizations often employ a combination of internal and external
data sources to create a comprehensive and diverse dataset for analysis and decision-making. The
integration of data from different sources is a key aspect of building a robust data ecosystem within
an organization.
Web APIs (Application Programming Interfaces):
1. Definition:
 Web APIs are sets of rules and protocols that allow different software applications to
communicate with each other.
 They enable the exchange of data and functionalities between different systems over the
internet.
2. Types of Web APIs:
 RESTful APIs: Representational State Transfer APIs are widely used for their simplicity and
scalability.
 SOAP APIs: Simple Object Access Protocol APIs use XML as a format for data exchange.
 JSON-RPC and XML-RPC APIs: These allow remote procedure calls using JSON or XML.
3. Data Acquisition through Web APIs:

 Organizations expose their data and functionalities through APIs, allowing external systems
to access and use the data.
 API endpoints often return data in formats like JSON or XML.
4. Examples of Web APIs:
 Twitter API: Provides access to Twitter's data, allowing developers to retrieve tweets, user
information, and trends.
 Google Maps API: Enables integration of mapping and location-based services.
 GitHub API: Allows developers to access information about repositories, issues, and users on
GitHub.
5. Authentication and Rate Limits:
 Web APIs often require authentication to control access.
 Rate limits are imposed to prevent abuse and ensure fair usage of the API.
6. Data from Social Media Platforms:
 Social media platforms like Facebook, Instagram, and LinkedIn provide APIs for accessing
user data, posts, and engagement metrics.
 This data is valuable for social media analytics and marketing.
Open Data Sources:
1. Definition:
 Open data refers to data that is freely available, accessible, and can be used, modified, and
shared by anyone.
 Governments, organizations, and institutions often release data openly for public use.
2. Government Open Data Portals:

 Many governments maintain open data portals where they publish datasets related to
demographics, economics, health, and more.
 Examples include data.gov (U.S.), data.gov.uk (UK), and data.gov.in (India).
3. International Organizations:
 International organizations like the World Bank and the United Nations release open datasets
covering global development indicators, economic data, and demographic information.
4. Scientific Research Databases:
 Universities and research institutions often share research data openly.
 Databases like PubMed, arXiv, and Kaggle datasets are popular sources for researchers and
data scientists.
5. Non-Profit Organizations:
 Non-profit organizations may release datasets related to their missions.

 Examples include environmental organizations sharing climate data or healthcare
organizations sharing public health statistics.
6. OpenStreetMap:
 OpenStreetMap provides open and collaborative mapping data that can be used for various
applications, including geographic information systems (GIS).
7. Benefits of Open Data:
 Encourages transparency and accountability.
 Fosters innovation as developers, researchers, and businesses can leverage diverse datasets.
 Supports data-driven decision-making in various domains.
8. Challenges:
 Ensuring data quality and reliability.
 Addressing privacy concerns when dealing with sensitive information.
In summary, Web APIs and Open Data Sources play crucial roles in data acquisition, offering a wealth
of information for diverse applications, from business analytics to research and development.
Integrating data from these sources enriches the depth and breadth of datasets available for analysis
and decision-making.
To interact with Web APIs and utilize Open Data Sources for data acquisition in programs, you
typically use programming languages like Python or R. Below are examples using Python with
libraries such as requests for API requests and data retrieval and pandas for data manipulation. Keep
in mind that you might need API keys for some services.

Unit 2 Statics and DA

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 2 Statics and DA

Uploaded by

Copyright:

Available Formats

Unit 2

STATISTICAL ANALYSIS AND DATA ACQUISITION

 Regression analysis is used to model the relationship between a dependent variable

Use: Commonly used for testing independence in categorical data

 Correlation coefficient is a standardized measure of the strength and direction of the

 A correlation coefficient of 1 indicates a perfect positive linear relationship, -1

1. Goodness of Fit Test:

2. Testing Independence in Contingency Tables:

 The chi-square test is sensitive to departures from expected frequencies, and a

 Skewness measures the asymmetry of a distribution.

 A positive skewness indicates a right-skewed distribution (tail on the right), while a

High kurtosis indicates:

 More values concentrated around the mean than normal distribution.

 Heavier tails because of a higher concentration of extreme values or outliers in tails.

 Greater likelihood of extreme events.

On the other hand, low kurtosis indicates:

 Lower likelihood of extreme events.

Depending on the degree, distributions have three types of kurtosis:

1. Mesokurtic distribution (kurtosis = 3, excess kurtosis = 0): perfect normal distribution or

 Positive kurtosis (leptokurtic) indicates heavier tails, while negative kurtosis

How to Calculate Skewness in Python

Let’s implement the formula manually in Python:

skewness = (3 * (mean_price - median_price)) / std

Let’s implement it in Python too:

from scipy.stats import skew

Box and Whisker Plot

 The line inside the box represents the median.

 Whiskers show the range of the data, excluding outliers.

 Identify and compare the presence of outliers in different datasets.

Other Statistical Graphs:

 A graphical representation of the distribution of a continuous dataset. It consists of

Question (16 Marks):

 Given two variables, X and Y, with the following data points:

3. Correlation Coefficient (8 Marks):

 Using the same data for X and Y:

Correlation Coefficient (8 Marks):

 A researcher conducted a survey to investigate the relationship between gender and

4. Box and Whisker Plots (4 Marks):

5. Other Statistical Graphs (4 Marks):

The Chi-Square statistic is calculated as follows: χ2=∑(Oi−Ei)2/Ei

2. Measures of Distribution (Skewness and Kurtosis) (4 Marks):

3. Box and Whisker Plots (4 Marks):

4. Other Statistical Graphs (4 Marks):

 Create a histogram to represent the distribution of monthly sales, providing insights

Satisfied Neutral Dissatisfied

1. Chi-Square Test (10 Marks):

 Apply the Chi-Square test to determine if there is a significant association between

 Calculate the expected frequencies for each cell.

 Provide a conclusion based on the results.

 Interpret the Chi-Square test results in the context of the survey.

 Discuss the practical implications of any significant association found.

1. Chi-Square Test (10 Marks):

a. Interpret the Chi-Square test results:

b. Discuss the practical implications:

 Provide insights into areas that might need attention or improvement.

What is Box and Whisker Plot?

Uses of Box and Whisker Plot

When to Use Box and Whisker Plot

How to Make Box and Whisker Plot

1. Gather Information: Accumulate the dataset of which the envision is needed.

Example of Box and Whisker Plot

Step 1: Collect Data

Step 2: Calculate Quartiles