Data Mining

CRASH COURSE DATA SCIENCE - BEGINNER LEVEL
DATA COLLECTION
1. Data collection is the process of gathering relevant information from various
sources to analyze and derive insights.
2. In data science, the quality of collected data directly impacts the accuracy of the
resulting analysis and models.
3. A well-defined sampling strategy ensures that collected data is representative of
the larger population.
4. Surveys, interviews, and questionnaires are common methods for collecting
primary data directly from individuals.
5. Web scraping involves extracting information from websites and is often used to
collect data from online sources.
6. Sensor networks and Internet of Things (IoT) devices contribute to the collection
of real-time data in various applications.
7. Secondary data refers to data collected by someone else for a different purpose
but can still be useful for analysis.
8. The bias present in collected data can lead to skewed insights and inaccurate
conclusions.
9. Data curation involves organizing, cleaning, and preparing collected data for
analysis.
10. The process of data collection should follow ethical guidelines to ensure privacy
and respect for individuals' rights.
DESCRIPTIVE STATISTICS
1. Descriptive statistics summarize and describe the main features of a dataset.
2. Descriptive statistics can be used to summarize both categorical and numerical
variables.
3. Range is a measure of dispersion that represents the difference between the
maximum and minimum values in a dataset.
4. The range is NOT a measure of central tendency that represents the middle value
in a dataset.
5. The interquartile range (IQR) is a measure of spread that represents the range
between the first quartile (Q1) and the third quartile (Q3).
6. The mode is the value that occurs most frequently in a dataset.
7. The median is less affected by outliers than the mean.
8. The median is less influenced by extreme values in the dataset, making it a more
robust measure of central tendency compared to the mean.
9. Standard deviation measures the average distance of values from the mean.
10. Standard deviation quantifies the dispersion or spread of data by measuring the
average distance between each data point and the mean.
11. Variance is NOT the square root of the standard deviation.
12. Variance is the squared value of the standard deviation.
13. Skewness is a measure of the symmetry of a distribution.
14. Skewness indicates the extent to which a distribution is skewed or asymmetrical.
15. Correlation measures the strength and direction of the linear relationship
between two numerical variables.
EDA
1. EDA involves summarizing and visualizing data to gain insights and understand
patterns.
2. EDA is typically performed after data cleaning and preprocessing to ensure the
data is in a suitable format for analysis.
3. EDA includes identifying outliers (extreme values) and missing values in the
dataset, which can impact the validity of the analysis.
4. Descriptive statistics, such as mean, median, and standard deviation, are
commonly calculated during EDA to summarize the central tendency and
dispersion of the data.
5. EDA is NOT a flexible and iterative process.
6. EDA can help detect relationships and correlations between variables, which can
provide valuable insights into the dataset.
7. The primary goal of EDA is to gain an understanding of the data rather than
formal hypothesis testing and statistical inference.
8. EDA can reveal potential data quality issues, such as inconsistent or erroneous
values, and identify data anomalies that require further investigation.
9. Graphical techniques, such as histograms, scatter plots, and box plots, are
commonly used in EDA to visualize the distribution, relationships, and outliers in
the data.
10. EDA is NOT an ongoing process.
DATA VISUALIZATIONS
1. Data visualization is the presentation of data in a graphical or pictorial format.
2. Bar chart, line chart, and pie chart are some of the common types of visualization
charts.
3. A line chart is a data visualization technique suitable for displaying trends over
time.
4. A heat map is used to represent distribution of values with colors.
5. A tree map is used to show hierarchical data using nested rectangles.
6. A box plot is used to show the distribution of data.
7. A choropleth map is used to represent geographic data with color variations.
8. The points on the scatter plot show the relationship between two variables.
9. In a bar chart, y-axis shows the dependent variable while x-axis shows the
independent variable.
10. Python is the most commonly used programming language that creates
interactive data visualizations.
DATA CLEANING
1. Imputation technique is used to fill in missing values.
2. Outlier detection is used to identify and handle unusual data points.
3. Standardization is used to bring all variables to a common scale.
4. Deduplication is used to identify and handle duplicate records.
5. Regular Expressions are used for pattern matching and extraction.
6. One-Hot Encoding is used for handling categorical variables.
7. Scaling is used to re-scale numerical variables.
8. Trimming is used to remove unnecessary white spaces.
9. Mean imputation involves replacing missing values with the mean of the variable.
10. Forward filling involves filling missing values with the value before them.
11. Interpolation involves estimating missing values based on the adjacent values.
12. Deleting rows involves removing rows with missing values.
MACHINE LEARNING
1. The two main categories of machine learning models are supervised and
unsupervised.
2. Labeled data in supervised learning provides correct answers for training the
model to learn relationships between input features and output labels.
3. Precision is the ratio of correctly predicted positive observations to the total
predicted positives, while recall is the ratio of correctly predicted positive
observations to the total actual positives.
4. Accuracy might not be suitable for imbalanced datasets because it can be
dominated by the majority class and may not reflect the true model performance.
5. Cross-validation assesses a machine learning model's performance by dividing
the dataset into subsets, training/evaluating the model on different
combinations, and providing insights into its generalization capability.
6.
Data mining is the computational process of extracting knowledge from large datasets
through methods at the intersection of artificial intelligence, machine learning, statistics, and
database systems. This associative approach aims to discover meaningful patterns, and it
could be more appropriately named "knowledge mining" to emphasize the extraction of valuable
insights from data. The overarching goal is to transform raw data into an understandable
structure, facilitating further analysis and informed decision-making.
Tasks of Data Mining

1. Anomaly Detection:
● It is the task to Identifying unusual data records that might be interesting or errors, requiring
investigation.
● Example: In a credit card transaction dataset, if someone usually spends $50 per day, a sudden
transaction of $500 might be an anomaly, signaling potential fraud.
2. Association Rule Learning:

● It is the task to Searching for relationships between variables to understand how they are
connected.
● Example: In a supermarket, data might show that customers buying diapers often purchase beer
as well. This association can be used for targeted marketing promotions.
3. Clustering:
● It is the task to Discovering groups or structures in data that are somehow similar, without
predefined categories.
● Example: Social media platforms grouping users based on similar interests or behaviors, creating
communities.
4. Classification:
● It is the task to Generalizing known patterns to apply to new data.
● Example: An email program learning from labeled emails (spam or not) to automatically classify
new emails as either "legitimate" or "spam".
5. Regression:
● It is the task to Finding a function that models data with the least error.
● Example: Predicting house prices based on factors like square foot, number of bedrooms, and
location.
6. Summarization:
● It is the task to Providing a more concise representation of the dataset, often through
visualization and reports.
● Example: Creating a bar chart to summarize monthly sales data, making it easy to see trends over
time.
Data Mining Architecture:

Knowledge Base:
● This is the foundational element that incorporates domain knowledge to guide the search
and assess interesting patterns. This includes concept hierarchies, user beliefs,
metadata, and other knowledge.
● Example: Organizing customer preferences into levels for targeted marketing, like "Basic,"
"Intermediate," and "Advanced."
Data Mining Engine:
● The core system with functional modules for tasks like association analysis,
classification, and cluster analysis.
● Example: Identifying associations, such as customers who buy laptops also purchasing
laptop accessories.
Pattern Evaluation Module:
● A component utilizing importance measures to assess pattern value and guide the
search. It may use thresholds for filtering patterns.
● Example: Setting a threshold to only consider sales patterns that show a significant
increase.
User Interface:
● The interface facilitates communication between users and the system, allowing queries,
result exploration, and visualization.
● Example: Allowing a user to query the system for trends in monthly sales and visually
presenting the findings.
This architecture combines domain knowledge, analytical modules, evaluation criteria, and user
interaction to efficiently extract meaningful insights from data.
Data Mining Process:

1. State the Problem and Formulate the Hypothesis:
● Description: This step involves defining a problem and making an initial guess
about how things are related. It requires knowledge of the specific field.
● Example: If we're studying customer behavior in a store, the problem might be
understanding what factors influence buying decisions. The initial guess could be
that the time of day affects purchasing habits.
2. Collect the Data:

● Description: Data collection involves gathering information either through
planned experiments or by observing existing data. It's crucial to ensure the
collected data represents the real-world scenario.
● Example: In our store example, we might collect data on customer purchases
throughout the day, recording items bought and the time of purchase.
3. Preprocess the Data:

● Description: Before analyzing the data, certain steps are taken to clean and
organize it. This includes finding and handling unusual data points (outliers) and
ensuring that different types of information are treated equally.
● Example: We might identify and remove data that doesn't fit the usual shopping
patterns, like extremely high-value purchases. Additionally, we might scale or
adjust the way we represent different types of data to ensure fair analysis.
4. Estimate the Model:

● Description: This step involves choosing and applying a method to find patterns
in the data. It's not always straightforward, and multiple models might be tested
before selecting the most suitable one.
● Example: Using a statistical method to analyze the shopping data, like looking for
patterns that show certain items are frequently bought together.
5. Interpret the Model and Draw Conclusions:

● Description: The results from the model need to make sense and be
understandable to be useful. Balancing accuracy with simplicity is crucial
because complex models might be accurate but hard to understand.
● Example: If the model shows that certain products are often bought together, this
information needs to be presented in a way that's clear and actionable for
decision-makers, like store managers.
Major issues in data mining
1. Mining Different Kinds of Knowledge in Databases:

● Challenge: Users have changing interests and may seek different types of
knowledge.
● Data mining needs to cover a broad range of discovery tasks to cater for diverse
user needs.
2. Interactive Mining of Knowledge:

● Challenge: The data mining process needs to be interactive to allow users to
refine search patterns based on results.
● Interactivity enhances the effectiveness of the data mining process by involving
users in focusing the search.
3. Incorporation of Background Knowledge:

● Challenge: Utilizing background knowledge to guide the discovery process and
express patterns at multiple levels of abstraction.
● Background knowledge enhances the interpretability and expression of
discovered patterns.
4. Presentation and Visualization of Data Mining Results:

● Challenge: Expressing discovered patterns in high-level languages and visual
representations that are easily understandable.
● Effective communication of results is crucial for users to comprehend and utilize
the insights gained from data mining.
5. Handling Noisy or Incomplete Data:

● Challenge: Developing data cleaning methods to handle noise and incomplete
data, ensuring the accuracy of discovered patterns.
● Without proper data cleaning, the accuracy of discovered patterns may be
compromised.
6. Pattern Evaluation:
● Challenge: Assessing the interestingness of discovered patterns in terms of
representing common knowledge or lacking novelty.
● Patterns need to be interesting and valuable to users, either by providing new
insights or confirming existing knowledge.
These issues highlight the complexity of data mining tasks and the importance of
addressing various challenges to ensure the meaningful extraction and utilization of
knowledge from large datasets.
******************************* END ************************

lecture
Statistics vs data mining
Types of variables
Dataset
Data preparation > Data cleaning

Population:
● The "population" in data mining refers to the entire universe or collection of
objects that are relevant to a specific application. It encompasses a broad range
of entities, from individuals to inanimate objects.
● Examples: The population could include people (alive or dead), hospital patients,
dogs in a city, train journeys between two locations, rocks on the moon, or even
web pages stored on the World Wide Web.
Sample:
● Given the often vast size of the population, a "sample" is a subset of this
universe that is accessible and used for analysis in data mining. It represents a
manageable portion from which we aim to extract information applicable to the
entire population.
● The sample is crucial as it allows for practical analysis without having to process
or examine the entire population. Insights gained from the sample are
extrapolated to make predictions about the larger dataset.
Statistics vs. Data Mining:

● Size of Dataset:
● Statistics: Often involves smaller datasets suitable for manual analysis.
● Data Mining: Deals with large datasets, sometimes in the terabytes,
making manual inspection impractical.
● Curse of Dimensionality:
● Statistics: Typically deals with a limited number of variables.
● Data Mining: When you have a lot of variables, using traditional statistics
becomes difficult due to the "curse of dimensionality."
● Predictions:
● Statistics: Focuses on analyzing data to draw conclusions or make
inferences.
● Data Mining: Emphasizes making predictions, utilizing large datasets to
identify patterns and trends.
● Example of Spurious Discovery:
● Data Mining: Small sample sizes can lead to spurious correlations. For
instance, correlating the "PSL Man of the Match" with stock market
movements, even though there may be no real connection.
Variables in Data Mining:

Objects in data mining are described by variables, often referred to as attributes.
Types of Variables in Data Mining:

Categorical Variables:
● Description: Corresponding to categories, includes nominal, binary, and
ordinal variables.
● Examples:
● Nominal: Object names, colors.
● Binary: True/False, 1/0.
● Ordinal: Small, medium, large.
Continuous Variables:
● Description: Corresponding to numerical values, includes integer,
interval-scaled, and ratio-scaled variables.
● Examples:
● Integer: 'Number of children.'
● Interval-scaled: Fahrenheit or Celsius temperature scales.
● Ratio-scaled: Kelvin temperature, molecular weight.
Types of Variables:
Categorical > Nominal Variables:
● Description: Used to categorize objects (e.g., name or color), with
numerical values having no mathematical interpretation.
● Example: Assigning numbers (1, 2, 3, ...) to represent categories without
meaningful arithmetic.
Categorical > Binary Variables:
● Description: A special case of nominal variables with only two possible
values (e.g., true or false, 1 or 0).
Categorical > Ordinal Variables:
● Description: Similar to nominal variables but with values that can be
arranged in a meaningful order (e.g., small, medium, large).
Continuous > Integer Variables:
● Description: Takes genuine integer values, and arithmetic operations have
meaningful interpretations (e.g., 'number of children').
Continuous > Interval-scaled Variables:
● Description: Takes numerical values with equal intervals from a zero point,
but the zero does not imply the absence of the measured characteristic
(e.g., Fahrenheit or Celsius temperature scales).
Continuous > Ratio-scaled Variables:
● Description: Similar to interval-scaled variables, but the zero point reflects
the absence of the measured characteristic (e.g., Kelvin temperature and
molecular weight).
'Ignore' Attribute:
● Description: A third category representing variables of no significance for
the application. They are retained in the dataset but may not contribute to
the analysis (e.g., patient names or serial numbers).
Understanding these types of variables is essential in data mining as they influence the
choice of appropriate analysis methods and help in extracting meaningful patterns from
the data.
Dataset:
● The complete set of data available for an application is called a dataset.
● Representation: A dataset is often depicted as a table.
● Record or Instance:
● The set of variable values corresponding to each of the objects is called a
record or an instance.
● Each row in the dataset represents an instance.
● Each column contains the value of one of the variables (attributes) for
each of the instances.
● Example:
● The dataset is an example of labeled data.
● One attribute is given special significance, and the aim is to predict its
value.
● This attribute is given the standard name 'class.'
● Labeling:
● When there is no such significant attribute, we call the data unlabeled.
Understanding datasets is fundamental in data mining, as they serve as the foundation

for analysis, pattern recognition, and prediction. The structure of datasets, with
instances and attributes, plays a crucial role in extracting meaningful information from
the available data.
Example 1: Employee Performance

Dataset:
EmployeeID Age Department Years of Experience Performance Score
101 28 Sales 3 High
102 35 Marketing 7 Medium
103 42 HR 12 Low
104 30 IT 5 High
105 25 Finance 2 Medium
Explanation:
● Attributes:
● EmployeeID: Unique identifier for each employee.
● Age: Age of the employee.
● Department: Employee's department.
● Years of Experience: Employee's work experience.
● Performance Score: The performance level of the employee.
● Labeling:
● The 'Performance Score' is a labeled attribute, and the goal is to predict its
value.
Example 2: Housing Prices

Dataset:
HouseID Area (sq ft) Bedrooms Bathrooms Year Built Price (USD)
501 1500 3 2 1990 $250,000
502 2000 4 3 2005 $350,000

503 1200 2 1 1985 $200,000
504 1800 3 2 1998 $300,000
505 2500 5 4 2010 $450,000
Explanation:
● Attributes:
● HouseID: Unique identifier for each house.
● Area (sq ft): Size of the house.
● Bedrooms: Number of bedrooms.
● Bathrooms: Number of bathrooms.
● Year Built: Year the house was constructed.
● Price (USD): Selling price of the house.
● Labeling:
● The 'Price (USD)' is a labeled attribute, and the goal is to predict its value.
I. Surveys
Data Collection
Method Surveys
Common method to collect information from a large

population by asking a series of questions. Surveys can be
Description
conducted in various ways, such as in person, over the
phone, by email, or online.
- Can collect data from a large population. Can be

Advantages conducted remotely. Responses are standardized, making
data comparison and analysis easy.
- Low response rates can lead to biased results.

Respondents may not provide truthful or accurate
Disadvantages
responses. Survey questions may be poorly designed,
leading to inaccurate data.
II. Interviews
Data Collection
Method Interviews
Involves direct communication with individuals or groups

Description to collect information. Interviews can be structured,
semi-structured, or unstructured.
- Can provide in-depth information and insights. Can clarify

Advantages ambiguous or complex responses. Can be adapted to the
needs of the interviewee.
- Time-consuming and expensive. Responses may be

Disadvantages influenced by interviewer bias. Small sample size may not
be representative of the larger population.
III. Observations
Data Collection
Method Observations
Involves collecting data by watching and recording the

Description behavior of individuals or groups in natural or controlled
settings.
- Can provide accurate and objective data. Can capture

Advantages non-verbal behavior and interactions. Can be used to
collect data on hard-to-measure variables.
- Time-consuming and labor-intensive. May be affected by

Disadvantages
observer bias. Can be difficult to replicate.
IV. Secondary Data

Data Collection
Method Secondary Data
Involves collecting data from existing sources, such as

Description government records, academic papers, or company
reports.
- May be outdated or incomplete. Data quality may be
Disadvantages questionable. May not be tailored to the specific
research question.
- Can provide large amounts of data quickly. Can be used

Advantages
to compare and validate results from other data sources.
V. Cookies
Data Collection
Method Cookies
Small files stored on a user's computer when visiting a

Description website, tracking user activity, preferences, and
interactions.
- Can help personalize a user's experience on a website.

Can be used to remember items in a shopping cart or
Advantages
login credentials. Can help website owners track user
behavior to improve functionality.
- Some cookies can track users across multiple websites,

seen as an invasion of privacy. Can be used to target
Disadvantages
users with ads based on browsing history. Users may not
be aware of or able to control/delete cookies.
VI. Web Trackers

Data Collection
Method Web Trackers
Scripts embedded in websites to collect information about

Description user activity, including IP address, browser, pages visited,
and time spent.
- Help website owners monitor traffic and user behavior.

Advantages
Identify and fix technical issues on the website.
- Some web trackers track users across multiple websites,
seen as an invasion of privacy. Can be used to target
Disadvantages
users with ads based on browsing history. Users may not
be aware or able to control/delete web trackers.
VII. Social Media

Data Collection
Method Social Media
Social media platforms collect data about users' activity,

Description
including likes, follows, and group memberships.
- Users have control over their profiles. Social media is a

Advantages powerful tool for businesses to reach potential
customers.
- Social media can be used to spread misinformation or

Disadvantages harmful content. Users may not be aware of how their
data is used or shared.
VIII. Mobile Devices/IOT Devices

Data Collection
Method Mobile Devices/IOT Devices
Mobile apps collect user data, including location,

browsing history, and app usage, often used for targeted
Description
advertising. Smart devices produce data monitored by
mobile apps.
- Can be used to collect data for improving user

Advantages experiences. Some mobile apps may collect sensitive
data for targeted advertising.
- Users may not be aware of data collection by mobile

apps. Mobile apps may collect data without users'
Disadvantages
explicit consent. Privacy concerns may arise from the
use of data collected by mobile devices.
This organized presentation includes a table for each data collection method,
summarizing the key information, advantages, and disadvantages.
Data Cleaning > Incorrectly Recorded Data:

Identifying and addressing erroneously recorded values in a dataset.
Example: Detecting a numerical attribute, such as age, recorded as '72' due to default
settings in data collection.
Problems Table (Incorrectly Recorded Data):
Age
72
72
72
72
Solution Table (Incorrectly Recorded Data):
Age
25
40
30
45
Data Cleaning > Incorrectly Recorded Data > Solution - Sorting:

Applying sorting to values to uncover unexpected patterns and errors.
Example: Sorting numerical values to identify outliers or inconsistent entries.
Problems Table (Sorting):
Exam Score
98
72
102
95
Solution Table (Sorting):
Exam Score
72
95
98
102
Data Cleaning > Incorrectly Recorded Data > Solution - Categorization:

Treating numerical variables with limited distinct values as categorical.
Example: Recognizing a variable with only six widely separated numerical values as
categorical.
Problems Table (Categorization):
Temperature
20
22
25
15
Solution Table (Categorization):

Temperature
Cold
Moderate
Warm
Very Cold
Data Cleaning > Incorrectly Recorded Data > Solution - Identical Values:
Handling variables where all values are identical.
Example: Identifying a column where all entries are 'Unknown' and considering it for
removal.
Problems Table (Identical Values):
Status
Unknown
Unknown
Unknown
Unknown
Solution Table (Identical Values):

Status
Remove
Keep
Remove
Remove
Data Cleaning > Incorrectly Recorded Data > Solution - Categorical for
Almost All Identical Values:
Treating a variable as categorical when almost all values are identical,
except for a few.
Example: Recognizing a column where most entries are 'Yes' except for a
few 'No' values.
Problems Table (Categorical for Almost All Identical Values):
Approval
Yes
Yes
No
Yes
Solution Table (Categorical for Almost All Identical Values):
Approval
Yes
Yes
No
Yes
Data Cleaning > Incorrectly Recorded Data > Out of Range Values:
Identifying values outside the normal range for a variable.
Example: Detecting a continuous attribute with most values in the range
200 to 5000, but a few outliers like 22654.8.
Problems Table (Out of Range Values):
Income
3000
4000
22654.8
5000
Solution Table (Out of Range Values):
Income
3000
4000
5000
4500
Data Cleaning > Incorrectly Recorded Data > Out of Range Values >
Handling Outliers:
Addressing outliers that may be genuine values significantly different from
others.
Example: Deciding whether to discard or adjust outliers in medical or
physics data.
Problems Table (Handling Outliers):
Test Score
85
92
105
78
Solution Table (Handling Outliers):
Test Score
85
92
95
78
Data Cleaning > Incorrectly Recorded Data > Repeated Values:

Identifying values that occur abnormally frequently in a dataset.
Example: Noticing a specific country, like 'Albania,' appearing unusually
often in a web service registration dataset.
Problems Table (Repeated Values):
Country
Albania
USA
Albania
Germany
Solution Table (Repeated Values):

Country
Albania
USA
Albania
Germany
Data Cleaning > Incorrectly Recorded Data > Repeated Values > Possible
Interpretations:
Considering potential explanations for abnormally frequent values.
Example: Exploring reasons why 'Albania' might be overly represented in a
country field.
Problems Table (Possible Interpretations):
Country
Albania
USA
Albania
Albania
Solution Table (Possible Interpretations):
Country
Albania
USA
Albania
USA
Data Cleaning > Incorrectly Recorded Data > Repeated Values >
Inconsistencies:
Identifying inconsistencies in abnormally frequent occurrences of values.
Example: Recognizing that a high proportion of recorded ages being 72
may indicate errors in data collection or processing.
Problems Table (Inconsistencies):
Age
72
28
72
72
Solution Table (Inconsistencies):
Age
72
28
72
28
Data Cleaning > Missing Values:

Addressing situations where certain attribute values are not recorded for
all instances in a dataset.
Example: Instances lacking data due to attributes not being applicable or
malfunctioning equipment.
Problems Table (Missing Values):
Attribute1 Attribute2 Attribute3
15 20 25
10 (Miss) 22
18 15 30
12 18 (Miss)
Solution Table (Missing Values):
15 20 25
10 18 22
18 15 30
12 18 25
Data Cleaning > Missing Values > Discard Instances:

Deleting instances with missing values, a conservative approach.
Example: Removing instances with at least one missing value and using the
remainder.
Problems Table (Discard Instances):
15 20 25
18 15 30
Solution Table (Discard Instances):

15 20 25
18 15 30
Data Cleaning > Missing Values > Replace by Most Frequent/Average

Value:
Estimating missing values using the most frequent value for categorical
attributes or the average for continuous ones.
Example: Replacing missing values of a country attribute with the most
frequently occurring country.
Problems Table (Replace by Most Frequent/Average Value):
Country Age
USA 28
(Miss) 45
Germany 72
Solution Table (Replace by Most Frequent/Average Value):
Country Age
USA 28
USA 45
Germany 72
Data Cleaning > Missing Values > Reducing the Number of Attributes:
Trimming datasets with numerous attributes to avoid computational
overhead.
Example: Using feature reduction techniques to select the most relevant
attributes for analysis.
Problems Table (Reducing the Number of Attributes):

15 20 25
10 (Miss) 22
18 15 30
12 18 (Miss)
Solution Table (Reducing the Number of Attributes):
Attribute1 Attribute3
15 25
10 22
18 30
12 25
1. Introduction to Descriptive Statistics
Descriptive statistics is a fundamental branch of statistics focused on the collection,

analysis, and interpretation of data. Its primary objective is to provide a succinct
summary of key features within a dataset, including central tendency, variability, and
shape. This field serves as a crucial tool for researchers, analysts, and decision-makers,
enabling them to make sense of data and extract meaningful insights.
Example:
In a market research study, descriptive statistics may be employed to analyze customer

feedback data, summarizing patterns and preferences to guide strategic decisions.
2. Measures of Central Tendency
Measures of central tendency aim to define the "typical" or central value within a
dataset. Common measures include the mean (average), median (middle value), and
mode (most frequent value). These measures offer valuable insights into the core
characteristics of a dataset.
Example:
In a classroom setting, measures of central tendency can be applied to analyze exam

scores, providing an understanding of the average performance and identifying the
most common score achieved.
3. Measures of Variability
Measures of variability quantify the spread or dispersion of data points within a dataset.
Key measures encompass the range (difference between largest and smallest values),
variance (average of squared differences from the mean), and standard deviation
(square root of the variance). These measures illuminate the distribution of values
within the dataset.
Example:
In financial analysis, understanding the standard deviation of stock prices helps assess
the level of risk associated with an investment, indicating how spread out the prices are
over time.
4. Measures of Shape
Measures of shape, including skewness (asymmetry) and kurtosis ("peakedness"),

describe the distribution's characteristics. These measures are crucial for identifying
any irregular features in the data, such as outliers or clusters.
Example:
In demographic studies, skewness might reveal whether income distribution is evenly

spread or if there is a concentration of higher incomes, indicating potential
socioeconomic disparities.
4.1) Skewness
Skewness quantifies the asymmetry of a distribution, indicating the extent to which the
data deviates from perfect symmetry. A skewness score can be positive, negative, or
close to zero, each conveying distinct characteristics about the distribution.
Example:
In analyzing household income data, a positive skewness suggests that a majority of

households earn below the average income, while a negative skewness implies the
opposite.
4.2) Kurtosis
Kurtosis measures the shape of a distribution, focusing on the heaviness of tails relative
to the center. It provides insights into the presence of outliers and extreme values within
the dataset.
Example:
In educational research, examining the kurtosis of test scores can indicate whether
there are a significant number of exceptionally high or low scores, potentially influencing
the overall performance trends.
5. Graphical Representations
Graphical representations are visual tools employed to enhance the understanding of
key features within a dataset. Common types include histograms (frequency
distribution), box plots (quartiles and outliers), and scatter plots (relationships between
variables).
Example:
In climate science, a scatter plot might visualize the relationship between temperature
and sea level rise, providing a clear depiction of any correlations or patterns.
5.1) Histogram
A histogram visually represents the frequency distribution of a dataset, aiding in the

interpretation of its shape, center, and spread.
Example:
Analyzing a histogram of patient recovery times in a medical study can reveal whether
the distribution is skewed, indicating variations in treatment effectiveness.
5.2) Interpretations of Box Plots
Box plots visually summarize central tendency, spread, skewness, and outliers within a
dataset, offering a concise overview of key features.
Example:
In a marketing survey, a box plot might illustrate the distribution of customer

satisfaction scores, providing insights into the spread of opinions and identifying
potential outliers.
5.3) Interpretation of Scatter Plots

Scatter plots visually represent relationships between two variables, showcasing
patterns, trends, and potential outliers.
Example:
In social science research, a scatter plot examining the relationship between education
level and income can highlight whether higher education correlates with increased
earnings.
6. Using Descriptive Statistics in Real-World Applications
Descriptive statistics find practical applications in various fields to analyze, interpret,

and derive insights from data.
Examples:
● In business, descriptive statistics can analyze sales data to identify market

trends and inform marketing strategies.
● In healthcare, they can analyze patient data to identify risk factors for specific
diseases, facilitating proactive healthcare measures.
● In social sciences, descriptive statistics can analyze survey data to identify
trends in public opinion, aiding policymakers in decision-making processes.
Data Visualization
Data visualization is the representation of data in a graphical or pictorial format. It

involves the process of transforming raw data into visual insights that can be easily
understood by stakeholders.
Importance of Data Visualization

Clarity:
● Helps clarify complex ideas and concepts, making it easier to understand
and communicate information.
Insight:
● Patterns and trends become apparent when data is presented visually,
aiding in better comprehension.
Efficiency:
● Saves time and reduces errors by facilitating the identification of outliers,
trends, and patterns.
Engagement:
● More engaging and interactive than other forms of communication,
increasing understanding among stakeholders.
Types of Data Visualizations

1. Column Charts (Vertical Bar)
● Simple and effective for representing data divided into discrete categories.
● Each category (x-axis) is represented by a bar, and the bar's height corresponds
to the data value (y-axis).
2. Bar Graph (Horizontal Bar)

● Similar to column charts but uses horizontal bars.
● Categories are displayed on the y-axis, and values are on the x-axis.
● Useful when category labels are lengthy or when space is limited.
3. Histograms
● Displays the distribution of continuous numerical data.
● Rectangular bars represent ranges of values, with the bar's width indicating the
range and height corresponding to frequency or count.
● Useful for understanding the shape, central tendency, and spread of a dataset.
4. Pie Charts
● Circular graph divided into slices representing the proportion or percentage of
different categories within a whole.
● Each slice's size is proportional to the corresponding value or percentage it
represents.
5. Line Charts
● Represents data that changes over time.
● Shows how data points are connected, creating a visual representation of trends
over time.
6. Scatter Plots
● Used to represent the relationship between two variables.
● Reveals patterns and trends that may not be apparent in other visualizations.
7. Heat Maps
● Graphical representation where values are represented by colors.
● Used to show the distribution of data across different categories.
8. Tree Maps
● Hierarchical representation of data, where each level of the hierarchy is
represented by a rectangle.
● The size of the rectangle corresponds to the value of the data.
Data visualization tools play a crucial role in creating these visual representations,
helping users make sense of complex datasets.

Data Mining

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining

Uploaded by

Copyright:

Available Formats

CRASH COURSE DATA SCIENCE - BEGINNER LEVEL

Tasks of Data Mining

2. Association Rule Learning:

Data Mining Architecture:

Data Mining Process:

2. Collect the Data:

3. Preprocess the Data:

4. Estimate the Model:

5. Interpret the Model and Draw Conclusions:

Major issues in data mining

1. Mining Different Kinds of Knowledge in Databases:

2. Interactive Mining of Knowledge:

3. Incorporation of Background Knowledge:

4. Presentation and Visualization of Data Mining Results:

5. Handling Noisy or Incomplete Data:

******************************* END ************************

Statistics vs data mining

Data preparation > Data cleaning

Statistics vs. Data Mining:

Variables in Data Mining:

Types of Variables in Data Mining:

Understanding datasets is fundamental in data mining, as they serve as the foundation

Example 1: Employee Performance

EmployeeID Age Department Years of Experience Performance Score

101 28 Sales 3 High

102 35 Marketing 7 Medium

105 25 Finance 2 Medium

Example 2: Housing Prices

501 1500 3 2 1990 $250,000

502 2000 4 3 2005 $350,000

504 1800 3 2 1998 $300,000

505 2500 5 4 2010 $450,000

Common method to collect information from a large

- Can collect data from a large population. Can be

- Low response rates can lead to biased results.

Involves direct communication with individuals or groups

- Can provide in-depth information and insights. Can clarify

- Time-consuming and expensive. Responses may be

Involves collecting data by watching and recording the

- Can provide accurate and objective data. Can capture

- Time-consuming and labor-intensive. May be affected by

IV. Secondary Data

Involves collecting data from existing sources, such as

- Can provide large amounts of data quickly. Can be used

Small files stored on a user's computer when visiting a

- Can help personalize a user's experience on a website.

- Some cookies can track users across multiple websites,

VI. Web Trackers

Scripts embedded in websites to collect information about

- Help website owners monitor traffic and user behavior.

VII. Social Media

Social media platforms collect data about users' activity,

- Users have control over their profiles. Social media is a

- Social media can be used to spread misinformation or

VIII. Mobile Devices/IOT Devices

Mobile apps collect user data, including location,

- Can be used to collect data for improving user

- Users may not be aware of data collection by mobile

Data Cleaning > Incorrectly Recorded Data:

Solution Table (Incorrectly Recorded Data):

Data Cleaning > Incorrectly Recorded Data > Solution - Sorting:

Solution Table (Sorting):

******* END