Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

CRASH COURSE DATA SCIENCE - BEGINNER LEVEL

DATA COLLECTION
1. Data collection is the process of gathering relevant information from various
sources to analyze and derive insights.
2. In data science, the quality of collected data directly impacts the accuracy of the
resulting analysis and models.
3. A well-defined sampling strategy ensures that collected data is representative of
the larger population.
4. Surveys, interviews, and questionnaires are common methods for collecting
primary data directly from individuals.
5. Web scraping involves extracting information from websites and is often used to
collect data from online sources.
6. Sensor networks and Internet of Things (IoT) devices contribute to the collection
of real-time data in various applications.
7. Secondary data refers to data collected by someone else for a different purpose
but can still be useful for analysis.
8. The bias present in collected data can lead to skewed insights and inaccurate
conclusions.
9. Data curation involves organizing, cleaning, and preparing collected data for
analysis.
10. The process of data collection should follow ethical guidelines to ensure privacy
and respect for individuals' rights.
DESCRIPTIVE STATISTICS
1. Descriptive statistics summarize and describe the main features of a dataset.
2. Descriptive statistics can be used to summarize both categorical and numerical
variables.
3. Range is a measure of dispersion that represents the difference between the
maximum and minimum values in a dataset.
4. The range is NOT a measure of central tendency that represents the middle value
in a dataset.
5. The interquartile range (IQR) is a measure of spread that represents the range
between the first quartile (Q1) and the third quartile (Q3).
6. The mode is the value that occurs most frequently in a dataset.
7. The median is less affected by outliers than the mean.
8. The median is less influenced by extreme values in the dataset, making it a more
robust measure of central tendency compared to the mean.
9. Standard deviation measures the average distance of values from the mean.
10. Standard deviation quantifies the dispersion or spread of data by measuring the
average distance between each data point and the mean.
11. Variance is NOT the square root of the standard deviation.
12. Variance is the squared value of the standard deviation.
13. Skewness is a measure of the symmetry of a distribution.
14. Skewness indicates the extent to which a distribution is skewed or asymmetrical.
15. Correlation measures the strength and direction of the linear relationship
between two numerical variables.
EDA
1. EDA involves summarizing and visualizing data to gain insights and understand
patterns.
2. EDA is typically performed after data cleaning and preprocessing to ensure the
data is in a suitable format for analysis.
3. EDA includes identifying outliers (extreme values) and missing values in the
dataset, which can impact the validity of the analysis.
4. Descriptive statistics, such as mean, median, and standard deviation, are
commonly calculated during EDA to summarize the central tendency and
dispersion of the data.
5. EDA is NOT a flexible and iterative process.
6. EDA can help detect relationships and correlations between variables, which can
provide valuable insights into the dataset.
7. The primary goal of EDA is to gain an understanding of the data rather than
formal hypothesis testing and statistical inference.
8. EDA can reveal potential data quality issues, such as inconsistent or erroneous
values, and identify data anomalies that require further investigation.
9. Graphical techniques, such as histograms, scatter plots, and box plots, are
commonly used in EDA to visualize the distribution, relationships, and outliers in
the data.
10. EDA is NOT an ongoing process.
DATA VISUALIZATIONS
1. Data visualization is the presentation of data in a graphical or pictorial format.
2. Bar chart, line chart, and pie chart are some of the common types of visualization
charts.
3. A line chart is a data visualization technique suitable for displaying trends over
time.
4. A heat map is used to represent distribution of values with colors.
5. A tree map is used to show hierarchical data using nested rectangles.
6. A box plot is used to show the distribution of data.
7. A choropleth map is used to represent geographic data with color variations.
8. The points on the scatter plot show the relationship between two variables.
9. In a bar chart, y-axis shows the dependent variable while x-axis shows the
independent variable.
10. Python is the most commonly used programming language that creates
interactive data visualizations.

DATA CLEANING
1. Imputation technique is used to fill in missing values.
2. Outlier detection is used to identify and handle unusual data points.
3. Standardization is used to bring all variables to a common scale.
4. Deduplication is used to identify and handle duplicate records.
5. Regular Expressions are used for pattern matching and extraction.
6. One-Hot Encoding is used for handling categorical variables.
7. Scaling is used to re-scale numerical variables.
8. Trimming is used to remove unnecessary white spaces.
9. Mean imputation involves replacing missing values with the mean of the variable.
10. Forward filling involves filling missing values with the value before them.
11. Interpolation involves estimating missing values based on the adjacent values.
12. Deleting rows involves removing rows with missing values.
MACHINE LEARNING
1. The two main categories of machine learning models are supervised and
unsupervised.
2. Labeled data in supervised learning provides correct answers for training the
model to learn relationships between input features and output labels.
3. Precision is the ratio of correctly predicted positive observations to the total
predicted positives, while recall is the ratio of correctly predicted positive
observations to the total actual positives.
4. Accuracy might not be suitable for imbalanced datasets because it can be
dominated by the majority class and may not reflect the true model performance.
5. Cross-validation assesses a machine learning model's performance by dividing
the dataset into subsets, training/evaluating the model on different
combinations, and providing insights into its generalization capability.
6.

Data mining is the computational process of extracting knowledge from large datasets
through methods at the intersection of artificial intelligence, machine learning, statistics, and
database systems. This associative approach aims to discover meaningful patterns, and it
could be more appropriately named "knowledge mining" to emphasize the extraction of valuable
insights from data. The overarching goal is to transform raw data into an understandable
structure, facilitating further analysis and informed decision-making.

Tasks of Data Mining


1. Anomaly Detection:
● It is the task to Identifying unusual data records that might be interesting or errors, requiring
investigation.
● Example: In a credit card transaction dataset, if someone usually spends $50 per day, a sudden
transaction of $500 might be an anomaly, signaling potential fraud.

2. Association Rule Learning:


● It is the task to Searching for relationships between variables to understand how they are
connected.
● Example: In a supermarket, data might show that customers buying diapers often purchase beer
as well. This association can be used for targeted marketing promotions.

3. Clustering:
● It is the task to Discovering groups or structures in data that are somehow similar, without
predefined categories.
● Example: Social media platforms grouping users based on similar interests or behaviors, creating
communities.

4. Classification:
● It is the task to Generalizing known patterns to apply to new data.
● Example: An email program learning from labeled emails (spam or not) to automatically classify
new emails as either "legitimate" or "spam".

5. Regression:
● It is the task to Finding a function that models data with the least error.
● Example: Predicting house prices based on factors like square foot, number of bedrooms, and
location.

6. Summarization:
● It is the task to Providing a more concise representation of the dataset, often through
visualization and reports.
● Example: Creating a bar chart to summarize monthly sales data, making it easy to see trends over
time.

Data Mining Architecture:


​ Knowledge Base:
● This is the foundational element that incorporates domain knowledge to guide the search
and assess interesting patterns. This includes concept hierarchies, user beliefs,
metadata, and other knowledge.
● Example: Organizing customer preferences into levels for targeted marketing, like "Basic,"
"Intermediate," and "Advanced."
​ Data Mining Engine:
● The core system with functional modules for tasks like association analysis,
classification, and cluster analysis.
● Example: Identifying associations, such as customers who buy laptops also purchasing
laptop accessories.
​ Pattern Evaluation Module:
● A component utilizing importance measures to assess pattern value and guide the
search. It may use thresholds for filtering patterns.
● Example: Setting a threshold to only consider sales patterns that show a significant
increase.
​ User Interface:
● The interface facilitates communication between users and the system, allowing queries,
result exploration, and visualization.
● Example: Allowing a user to query the system for trends in monthly sales and visually
presenting the findings.

This architecture combines domain knowledge, analytical modules, evaluation criteria, and user
interaction to efficiently extract meaningful insights from data.

Data Mining Process:


1. State the Problem and Formulate the Hypothesis:
● Description: This step involves defining a problem and making an initial guess
about how things are related. It requires knowledge of the specific field.
● Example: If we're studying customer behavior in a store, the problem might be
understanding what factors influence buying decisions. The initial guess could be
that the time of day affects purchasing habits.

2. Collect the Data:


● Description: Data collection involves gathering information either through
planned experiments or by observing existing data. It's crucial to ensure the
collected data represents the real-world scenario.
● Example: In our store example, we might collect data on customer purchases
throughout the day, recording items bought and the time of purchase.

3. Preprocess the Data:


● Description: Before analyzing the data, certain steps are taken to clean and
organize it. This includes finding and handling unusual data points (outliers) and
ensuring that different types of information are treated equally.
● Example: We might identify and remove data that doesn't fit the usual shopping
patterns, like extremely high-value purchases. Additionally, we might scale or
adjust the way we represent different types of data to ensure fair analysis.

4. Estimate the Model:


● Description: This step involves choosing and applying a method to find patterns
in the data. It's not always straightforward, and multiple models might be tested
before selecting the most suitable one.
● Example: Using a statistical method to analyze the shopping data, like looking for
patterns that show certain items are frequently bought together.

5. Interpret the Model and Draw Conclusions:


● Description: The results from the model need to make sense and be
understandable to be useful. Balancing accuracy with simplicity is crucial
because complex models might be accurate but hard to understand.
● Example: If the model shows that certain products are often bought together, this
information needs to be presented in a way that's clear and actionable for
decision-makers, like store managers.

Major issues in data mining

1. Mining Different Kinds of Knowledge in Databases:


● Challenge: Users have changing interests and may seek different types of
knowledge.
● Data mining needs to cover a broad range of discovery tasks to cater for diverse
user needs.

2. Interactive Mining of Knowledge:


● Challenge: The data mining process needs to be interactive to allow users to
refine search patterns based on results.
● Interactivity enhances the effectiveness of the data mining process by involving
users in focusing the search.

3. Incorporation of Background Knowledge:


● Challenge: Utilizing background knowledge to guide the discovery process and
express patterns at multiple levels of abstraction.
● Background knowledge enhances the interpretability and expression of
discovered patterns.

4. Presentation and Visualization of Data Mining Results:


● Challenge: Expressing discovered patterns in high-level languages and visual
representations that are easily understandable.
● Effective communication of results is crucial for users to comprehend and utilize
the insights gained from data mining.

5. Handling Noisy or Incomplete Data:


● Challenge: Developing data cleaning methods to handle noise and incomplete
data, ensuring the accuracy of discovered patterns.
● Without proper data cleaning, the accuracy of discovered patterns may be
compromised.

6. Pattern Evaluation:
● Challenge: Assessing the interestingness of discovered patterns in terms of
representing common knowledge or lacking novelty.
● Patterns need to be interesting and valuable to users, either by providing new
insights or confirming existing knowledge.

These issues highlight the complexity of data mining tasks and the importance of
addressing various challenges to ensure the meaningful extraction and utilization of
knowledge from large datasets.

******************************* END ************************


lecture

Statistics vs data mining

Types of variables

Dataset

Data preparation > Data cleaning


Population:
● The "population" in data mining refers to the entire universe or collection of
objects that are relevant to a specific application. It encompasses a broad range
of entities, from individuals to inanimate objects.
● Examples: The population could include people (alive or dead), hospital patients,
dogs in a city, train journeys between two locations, rocks on the moon, or even
web pages stored on the World Wide Web.

Sample:
● Given the often vast size of the population, a "sample" is a subset of this
universe that is accessible and used for analysis in data mining. It represents a
manageable portion from which we aim to extract information applicable to the
entire population.
● The sample is crucial as it allows for practical analysis without having to process
or examine the entire population. Insights gained from the sample are
extrapolated to make predictions about the larger dataset.

Statistics vs. Data Mining:


● Size of Dataset:
● Statistics: Often involves smaller datasets suitable for manual analysis.
● Data Mining: Deals with large datasets, sometimes in the terabytes,
making manual inspection impractical.
● Curse of Dimensionality:
● Statistics: Typically deals with a limited number of variables.
● Data Mining: When you have a lot of variables, using traditional statistics
becomes difficult due to the "curse of dimensionality."
● Predictions:
● Statistics: Focuses on analyzing data to draw conclusions or make
inferences.
● Data Mining: Emphasizes making predictions, utilizing large datasets to
identify patterns and trends.
● Example of Spurious Discovery:
● Data Mining: Small sample sizes can lead to spurious correlations. For
instance, correlating the "PSL Man of the Match" with stock market
movements, even though there may be no real connection.

Variables in Data Mining:


Objects in data mining are described by variables, often referred to as attributes.

Types of Variables in Data Mining:


​ Categorical Variables:
● Description: Corresponding to categories, includes nominal, binary, and
ordinal variables.
● Examples:
● Nominal: Object names, colors.
● Binary: True/False, 1/0.
● Ordinal: Small, medium, large.
​ Continuous Variables:
● Description: Corresponding to numerical values, includes integer,
interval-scaled, and ratio-scaled variables.
● Examples:
● Integer: 'Number of children.'
● Interval-scaled: Fahrenheit or Celsius temperature scales.
● Ratio-scaled: Kelvin temperature, molecular weight.

Types of Variables:
​ Categorical > Nominal Variables:
● Description: Used to categorize objects (e.g., name or color), with
numerical values having no mathematical interpretation.
● Example: Assigning numbers (1, 2, 3, ...) to represent categories without
meaningful arithmetic.
​ Categorical > Binary Variables:
● Description: A special case of nominal variables with only two possible
values (e.g., true or false, 1 or 0).
​ Categorical > Ordinal Variables:
● Description: Similar to nominal variables but with values that can be
arranged in a meaningful order (e.g., small, medium, large).
​ Continuous > Integer Variables:
● Description: Takes genuine integer values, and arithmetic operations have
meaningful interpretations (e.g., 'number of children').
​ Continuous > Interval-scaled Variables:
● Description: Takes numerical values with equal intervals from a zero point,
but the zero does not imply the absence of the measured characteristic
(e.g., Fahrenheit or Celsius temperature scales).
​ Continuous > Ratio-scaled Variables:
● Description: Similar to interval-scaled variables, but the zero point reflects
the absence of the measured characteristic (e.g., Kelvin temperature and
molecular weight).
​ 'Ignore' Attribute:
● Description: A third category representing variables of no significance for
the application. They are retained in the dataset but may not contribute to
the analysis (e.g., patient names or serial numbers).

Understanding these types of variables is essential in data mining as they influence the
choice of appropriate analysis methods and help in extracting meaningful patterns from
the data.

Dataset:
● The complete set of data available for an application is called a dataset.
● Representation: A dataset is often depicted as a table.
● Record or Instance:
● The set of variable values corresponding to each of the objects is called a
record or an instance.
● Each row in the dataset represents an instance.
● Each column contains the value of one of the variables (attributes) for
each of the instances.
● Example:
● The dataset is an example of labeled data.
● One attribute is given special significance, and the aim is to predict its
value.
● This attribute is given the standard name 'class.'
● Labeling:
● When there is no such significant attribute, we call the data unlabeled.

Understanding datasets is fundamental in data mining, as they serve as the foundation


for analysis, pattern recognition, and prediction. The structure of datasets, with
instances and attributes, plays a crucial role in extracting meaningful information from
the available data.

Example 1: Employee Performance


Dataset:

EmployeeID Age Department Years of Experience Performance Score

101 28 Sales 3 High

102 35 Marketing 7 Medium

103 42 HR 12 Low

104 30 IT 5 High

105 25 Finance 2 Medium

Explanation:

● Attributes:
● EmployeeID: Unique identifier for each employee.
● Age: Age of the employee.
● Department: Employee's department.
● Years of Experience: Employee's work experience.
● Performance Score: The performance level of the employee.
● Labeling:
● The 'Performance Score' is a labeled attribute, and the goal is to predict its
value.

Example 2: Housing Prices


Dataset:

HouseID Area (sq ft) Bedrooms Bathrooms Year Built Price (USD)

501 1500 3 2 1990 $250,000

502 2000 4 3 2005 $350,000


503 1200 2 1 1985 $200,000

504 1800 3 2 1998 $300,000

505 2500 5 4 2010 $450,000

Explanation:

● Attributes:
● HouseID: Unique identifier for each house.
● Area (sq ft): Size of the house.
● Bedrooms: Number of bedrooms.
● Bathrooms: Number of bathrooms.
● Year Built: Year the house was constructed.
● Price (USD): Selling price of the house.
● Labeling:
● The 'Price (USD)' is a labeled attribute, and the goal is to predict its value.

I. Surveys
Data Collection
Method Surveys

Common method to collect information from a large


population by asking a series of questions. Surveys can be
Description
conducted in various ways, such as in person, over the
phone, by email, or online.

- Can collect data from a large population. Can be


Advantages conducted remotely. Responses are standardized, making
data comparison and analysis easy.

- Low response rates can lead to biased results.


Respondents may not provide truthful or accurate
Disadvantages
responses. Survey questions may be poorly designed,
leading to inaccurate data.

II. Interviews
Data Collection
Method Interviews

Involves direct communication with individuals or groups


Description to collect information. Interviews can be structured,
semi-structured, or unstructured.

- Can provide in-depth information and insights. Can clarify


Advantages ambiguous or complex responses. Can be adapted to the
needs of the interviewee.

- Time-consuming and expensive. Responses may be


Disadvantages influenced by interviewer bias. Small sample size may not
be representative of the larger population.

III. Observations
Data Collection
Method Observations

Involves collecting data by watching and recording the


Description behavior of individuals or groups in natural or controlled
settings.

- Can provide accurate and objective data. Can capture


Advantages non-verbal behavior and interactions. Can be used to
collect data on hard-to-measure variables.

- Time-consuming and labor-intensive. May be affected by


Disadvantages
observer bias. Can be difficult to replicate.

IV. Secondary Data


Data Collection
Method Secondary Data

Involves collecting data from existing sources, such as


Description government records, academic papers, or company
reports.
- May be outdated or incomplete. Data quality may be
Disadvantages questionable. May not be tailored to the specific
research question.

- Can provide large amounts of data quickly. Can be used


Advantages
to compare and validate results from other data sources.

V. Cookies
Data Collection
Method Cookies

Small files stored on a user's computer when visiting a


Description website, tracking user activity, preferences, and
interactions.

- Can help personalize a user's experience on a website.


Can be used to remember items in a shopping cart or
Advantages
login credentials. Can help website owners track user
behavior to improve functionality.

- Some cookies can track users across multiple websites,


seen as an invasion of privacy. Can be used to target
Disadvantages
users with ads based on browsing history. Users may not
be aware of or able to control/delete cookies.

VI. Web Trackers


Data Collection
Method Web Trackers

Scripts embedded in websites to collect information about


Description user activity, including IP address, browser, pages visited,
and time spent.

- Help website owners monitor traffic and user behavior.


Advantages
Identify and fix technical issues on the website.
- Some web trackers track users across multiple websites,
seen as an invasion of privacy. Can be used to target
Disadvantages
users with ads based on browsing history. Users may not
be aware or able to control/delete web trackers.

VII. Social Media


Data Collection
Method Social Media

Social media platforms collect data about users' activity,


Description
including likes, follows, and group memberships.

- Users have control over their profiles. Social media is a


Advantages powerful tool for businesses to reach potential
customers.

- Social media can be used to spread misinformation or


Disadvantages harmful content. Users may not be aware of how their
data is used or shared.

VIII. Mobile Devices/IOT Devices


Data Collection
Method Mobile Devices/IOT Devices

Mobile apps collect user data, including location,


browsing history, and app usage, often used for targeted
Description
advertising. Smart devices produce data monitored by
mobile apps.

- Can be used to collect data for improving user


Advantages experiences. Some mobile apps may collect sensitive
data for targeted advertising.

- Users may not be aware of data collection by mobile


apps. Mobile apps may collect data without users'
Disadvantages
explicit consent. Privacy concerns may arise from the
use of data collected by mobile devices.
This organized presentation includes a table for each data collection method,
summarizing the key information, advantages, and disadvantages.

Data Cleaning > Incorrectly Recorded Data:


Identifying and addressing erroneously recorded values in a dataset.
Example: Detecting a numerical attribute, such as age, recorded as '72' due to default
settings in data collection.
Problems Table (Incorrectly Recorded Data):

Age

72

72

72

72

Solution Table (Incorrectly Recorded Data):

Age

25

40

30

45

Data Cleaning > Incorrectly Recorded Data > Solution - Sorting:


Applying sorting to values to uncover unexpected patterns and errors.
Example: Sorting numerical values to identify outliers or inconsistent entries.
Problems Table (Sorting):
Exam Score

98

72

102

95

Solution Table (Sorting):

Exam Score

72

95

98

102

Data Cleaning > Incorrectly Recorded Data > Solution - Categorization:


Treating numerical variables with limited distinct values as categorical.
Example: Recognizing a variable with only six widely separated numerical values as
categorical.
Problems Table (Categorization):

Temperature

20

22

25

15

Solution Table (Categorization):


Temperature

Cold

Moderate

Warm

Very Cold

Data Cleaning > Incorrectly Recorded Data > Solution - Identical Values:
Handling variables where all values are identical.
Example: Identifying a column where all entries are 'Unknown' and considering it for
removal.
Problems Table (Identical Values):

Status

Unknown

Unknown

Unknown

Unknown

Solution Table (Identical Values):


Status

Remove

Keep

Remove

Remove

Data Cleaning > Incorrectly Recorded Data > Solution - Categorical for
Almost All Identical Values:
Treating a variable as categorical when almost all values are identical,
except for a few.
Example: Recognizing a column where most entries are 'Yes' except for a
few 'No' values.
Problems Table (Categorical for Almost All Identical Values):

Approval

Yes

Yes

No

Yes

Solution Table (Categorical for Almost All Identical Values):

Approval

Yes

Yes

No
Yes

Data Cleaning > Incorrectly Recorded Data > Out of Range Values:
Identifying values outside the normal range for a variable.
Example: Detecting a continuous attribute with most values in the range
200 to 5000, but a few outliers like 22654.8.
Problems Table (Out of Range Values):

Income

3000

4000

22654.8

5000

Solution Table (Out of Range Values):

Income

3000

4000

5000

4500

Data Cleaning > Incorrectly Recorded Data > Out of Range Values >
Handling Outliers:
Addressing outliers that may be genuine values significantly different from
others.
Example: Deciding whether to discard or adjust outliers in medical or
physics data.
Problems Table (Handling Outliers):

Test Score
85

92

105

78

Solution Table (Handling Outliers):

Test Score

85

92

95

78

Data Cleaning > Incorrectly Recorded Data > Repeated Values:


Identifying values that occur abnormally frequently in a dataset.
Example: Noticing a specific country, like 'Albania,' appearing unusually
often in a web service registration dataset.
Problems Table (Repeated Values):

Country

Albania

USA

Albania

Germany

Solution Table (Repeated Values):


Country

Albania

USA

Albania

Germany

Data Cleaning > Incorrectly Recorded Data > Repeated Values > Possible
Interpretations:
Considering potential explanations for abnormally frequent values.
Example: Exploring reasons why 'Albania' might be overly represented in a
country field.
Problems Table (Possible Interpretations):

Country

Albania

USA

Albania

Albania

Solution Table (Possible Interpretations):

Country

Albania

USA

Albania
USA

Data Cleaning > Incorrectly Recorded Data > Repeated Values >
Inconsistencies:
Identifying inconsistencies in abnormally frequent occurrences of values.
Example: Recognizing that a high proportion of recorded ages being 72
may indicate errors in data collection or processing.
Problems Table (Inconsistencies):

Age

72

28

72

72

Solution Table (Inconsistencies):

Age

72

28

72

28

Data Cleaning > Missing Values:


Addressing situations where certain attribute values are not recorded for
all instances in a dataset.
Example: Instances lacking data due to attributes not being applicable or
malfunctioning equipment.
Problems Table (Missing Values):
Attribute1 Attribute2 Attribute3

15 20 25

10 (Miss) 22

18 15 30

12 18 (Miss)

Solution Table (Missing Values):

Attribute1 Attribute2 Attribute3

15 20 25

10 18 22

18 15 30

12 18 25

Data Cleaning > Missing Values > Discard Instances:


Deleting instances with missing values, a conservative approach.
Example: Removing instances with at least one missing value and using the
remainder.
Problems Table (Discard Instances):

Attribute1 Attribute2 Attribute3

15 20 25

18 15 30

Solution Table (Discard Instances):

Attribute1 Attribute2 Attribute3


15 20 25

18 15 30

Data Cleaning > Missing Values > Replace by Most Frequent/Average


Value:
Estimating missing values using the most frequent value for categorical
attributes or the average for continuous ones.
Example: Replacing missing values of a country attribute with the most
frequently occurring country.
Problems Table (Replace by Most Frequent/Average Value):

Country Age

USA 28

(Miss) 45

Germany 72

Solution Table (Replace by Most Frequent/Average Value):

Country Age

USA 28

USA 45

Germany 72

Data Cleaning > Missing Values > Reducing the Number of Attributes:
Trimming datasets with numerous attributes to avoid computational
overhead.
Example: Using feature reduction techniques to select the most relevant
attributes for analysis.
Problems Table (Reducing the Number of Attributes):

Attribute1 Attribute2 Attribute3


15 20 25

10 (Miss) 22

18 15 30

12 18 (Miss)

Solution Table (Reducing the Number of Attributes):

Attribute1 Attribute3

15 25

10 22

18 30

12 25

1. Introduction to Descriptive Statistics

Descriptive statistics is a fundamental branch of statistics focused on the collection,


analysis, and interpretation of data. Its primary objective is to provide a succinct
summary of key features within a dataset, including central tendency, variability, and
shape. This field serves as a crucial tool for researchers, analysts, and decision-makers,
enabling them to make sense of data and extract meaningful insights.

Example:

In a market research study, descriptive statistics may be employed to analyze customer


feedback data, summarizing patterns and preferences to guide strategic decisions.
2. Measures of Central Tendency

Measures of central tendency aim to define the "typical" or central value within a
dataset. Common measures include the mean (average), median (middle value), and
mode (most frequent value). These measures offer valuable insights into the core
characteristics of a dataset.

Example:

In a classroom setting, measures of central tendency can be applied to analyze exam


scores, providing an understanding of the average performance and identifying the
most common score achieved.

3. Measures of Variability

Measures of variability quantify the spread or dispersion of data points within a dataset.
Key measures encompass the range (difference between largest and smallest values),
variance (average of squared differences from the mean), and standard deviation
(square root of the variance). These measures illuminate the distribution of values
within the dataset.

Example:

In financial analysis, understanding the standard deviation of stock prices helps assess
the level of risk associated with an investment, indicating how spread out the prices are
over time.

4. Measures of Shape

Measures of shape, including skewness (asymmetry) and kurtosis ("peakedness"),


describe the distribution's characteristics. These measures are crucial for identifying
any irregular features in the data, such as outliers or clusters.
Example:

In demographic studies, skewness might reveal whether income distribution is evenly


spread or if there is a concentration of higher incomes, indicating potential
socioeconomic disparities.

4.1) Skewness

Skewness quantifies the asymmetry of a distribution, indicating the extent to which the
data deviates from perfect symmetry. A skewness score can be positive, negative, or
close to zero, each conveying distinct characteristics about the distribution.

Example:

In analyzing household income data, a positive skewness suggests that a majority of


households earn below the average income, while a negative skewness implies the
opposite.

4.2) Kurtosis

Kurtosis measures the shape of a distribution, focusing on the heaviness of tails relative
to the center. It provides insights into the presence of outliers and extreme values within
the dataset.

Example:

In educational research, examining the kurtosis of test scores can indicate whether
there are a significant number of exceptionally high or low scores, potentially influencing
the overall performance trends.

5. Graphical Representations
Graphical representations are visual tools employed to enhance the understanding of
key features within a dataset. Common types include histograms (frequency
distribution), box plots (quartiles and outliers), and scatter plots (relationships between
variables).

Example:

In climate science, a scatter plot might visualize the relationship between temperature
and sea level rise, providing a clear depiction of any correlations or patterns.

5.1) Histogram

A histogram visually represents the frequency distribution of a dataset, aiding in the


interpretation of its shape, center, and spread.

Example:

Analyzing a histogram of patient recovery times in a medical study can reveal whether
the distribution is skewed, indicating variations in treatment effectiveness.

5.2) Interpretations of Box Plots

Box plots visually summarize central tendency, spread, skewness, and outliers within a
dataset, offering a concise overview of key features.

Example:

In a marketing survey, a box plot might illustrate the distribution of customer


satisfaction scores, providing insights into the spread of opinions and identifying
potential outliers.

5.3) Interpretation of Scatter Plots


Scatter plots visually represent relationships between two variables, showcasing
patterns, trends, and potential outliers.

Example:

In social science research, a scatter plot examining the relationship between education
level and income can highlight whether higher education correlates with increased
earnings.

6. Using Descriptive Statistics in Real-World Applications

Descriptive statistics find practical applications in various fields to analyze, interpret,


and derive insights from data.

Examples:

● In business, descriptive statistics can analyze sales data to identify market


trends and inform marketing strategies.
● In healthcare, they can analyze patient data to identify risk factors for specific
diseases, facilitating proactive healthcare measures.
● In social sciences, descriptive statistics can analyze survey data to identify
trends in public opinion, aiding policymakers in decision-making processes.
Data Visualization

Data visualization is the representation of data in a graphical or pictorial format. It


involves the process of transforming raw data into visual insights that can be easily
understood by stakeholders.

Importance of Data Visualization


​ Clarity:
● Helps clarify complex ideas and concepts, making it easier to understand
and communicate information.
​ Insight:
● Patterns and trends become apparent when data is presented visually,
aiding in better comprehension.
​ Efficiency:
● Saves time and reduces errors by facilitating the identification of outliers,
trends, and patterns.
​ Engagement:
● More engaging and interactive than other forms of communication,
increasing understanding among stakeholders.

Types of Data Visualizations


1. Column Charts (Vertical Bar)
● Simple and effective for representing data divided into discrete categories.
● Each category (x-axis) is represented by a bar, and the bar's height corresponds
to the data value (y-axis).

2. Bar Graph (Horizontal Bar)


● Similar to column charts but uses horizontal bars.
● Categories are displayed on the y-axis, and values are on the x-axis.
● Useful when category labels are lengthy or when space is limited.

3. Histograms
● Displays the distribution of continuous numerical data.
● Rectangular bars represent ranges of values, with the bar's width indicating the
range and height corresponding to frequency or count.
● Useful for understanding the shape, central tendency, and spread of a dataset.

4. Pie Charts
● Circular graph divided into slices representing the proportion or percentage of
different categories within a whole.
● Each slice's size is proportional to the corresponding value or percentage it
represents.

5. Line Charts
● Represents data that changes over time.
● Shows how data points are connected, creating a visual representation of trends
over time.

6. Scatter Plots
● Used to represent the relationship between two variables.
● Reveals patterns and trends that may not be apparent in other visualizations.

7. Heat Maps
● Graphical representation where values are represented by colors.
● Used to show the distribution of data across different categories.

8. Tree Maps
● Hierarchical representation of data, where each level of the hierarchy is
represented by a rectangle.
● The size of the rectangle corresponds to the value of the data.

Data visualization tools play a crucial role in creating these visual representations,
helping users make sense of complex datasets.

You might also like