Professional Documents
Culture Documents
Data Mining
Data Mining
DATA COLLECTION
1. Data collection is the process of gathering relevant information from various
sources to analyze and derive insights.
2. In data science, the quality of collected data directly impacts the accuracy of the
resulting analysis and models.
3. A well-defined sampling strategy ensures that collected data is representative of
the larger population.
4. Surveys, interviews, and questionnaires are common methods for collecting
primary data directly from individuals.
5. Web scraping involves extracting information from websites and is often used to
collect data from online sources.
6. Sensor networks and Internet of Things (IoT) devices contribute to the collection
of real-time data in various applications.
7. Secondary data refers to data collected by someone else for a different purpose
but can still be useful for analysis.
8. The bias present in collected data can lead to skewed insights and inaccurate
conclusions.
9. Data curation involves organizing, cleaning, and preparing collected data for
analysis.
10. The process of data collection should follow ethical guidelines to ensure privacy
and respect for individuals' rights.
DESCRIPTIVE STATISTICS
1. Descriptive statistics summarize and describe the main features of a dataset.
2. Descriptive statistics can be used to summarize both categorical and numerical
variables.
3. Range is a measure of dispersion that represents the difference between the
maximum and minimum values in a dataset.
4. The range is NOT a measure of central tendency that represents the middle value
in a dataset.
5. The interquartile range (IQR) is a measure of spread that represents the range
between the first quartile (Q1) and the third quartile (Q3).
6. The mode is the value that occurs most frequently in a dataset.
7. The median is less affected by outliers than the mean.
8. The median is less influenced by extreme values in the dataset, making it a more
robust measure of central tendency compared to the mean.
9. Standard deviation measures the average distance of values from the mean.
10. Standard deviation quantifies the dispersion or spread of data by measuring the
average distance between each data point and the mean.
11. Variance is NOT the square root of the standard deviation.
12. Variance is the squared value of the standard deviation.
13. Skewness is a measure of the symmetry of a distribution.
14. Skewness indicates the extent to which a distribution is skewed or asymmetrical.
15. Correlation measures the strength and direction of the linear relationship
between two numerical variables.
EDA
1. EDA involves summarizing and visualizing data to gain insights and understand
patterns.
2. EDA is typically performed after data cleaning and preprocessing to ensure the
data is in a suitable format for analysis.
3. EDA includes identifying outliers (extreme values) and missing values in the
dataset, which can impact the validity of the analysis.
4. Descriptive statistics, such as mean, median, and standard deviation, are
commonly calculated during EDA to summarize the central tendency and
dispersion of the data.
5. EDA is NOT a flexible and iterative process.
6. EDA can help detect relationships and correlations between variables, which can
provide valuable insights into the dataset.
7. The primary goal of EDA is to gain an understanding of the data rather than
formal hypothesis testing and statistical inference.
8. EDA can reveal potential data quality issues, such as inconsistent or erroneous
values, and identify data anomalies that require further investigation.
9. Graphical techniques, such as histograms, scatter plots, and box plots, are
commonly used in EDA to visualize the distribution, relationships, and outliers in
the data.
10. EDA is NOT an ongoing process.
DATA VISUALIZATIONS
1. Data visualization is the presentation of data in a graphical or pictorial format.
2. Bar chart, line chart, and pie chart are some of the common types of visualization
charts.
3. A line chart is a data visualization technique suitable for displaying trends over
time.
4. A heat map is used to represent distribution of values with colors.
5. A tree map is used to show hierarchical data using nested rectangles.
6. A box plot is used to show the distribution of data.
7. A choropleth map is used to represent geographic data with color variations.
8. The points on the scatter plot show the relationship between two variables.
9. In a bar chart, y-axis shows the dependent variable while x-axis shows the
independent variable.
10. Python is the most commonly used programming language that creates
interactive data visualizations.
DATA CLEANING
1. Imputation technique is used to fill in missing values.
2. Outlier detection is used to identify and handle unusual data points.
3. Standardization is used to bring all variables to a common scale.
4. Deduplication is used to identify and handle duplicate records.
5. Regular Expressions are used for pattern matching and extraction.
6. One-Hot Encoding is used for handling categorical variables.
7. Scaling is used to re-scale numerical variables.
8. Trimming is used to remove unnecessary white spaces.
9. Mean imputation involves replacing missing values with the mean of the variable.
10. Forward filling involves filling missing values with the value before them.
11. Interpolation involves estimating missing values based on the adjacent values.
12. Deleting rows involves removing rows with missing values.
MACHINE LEARNING
1. The two main categories of machine learning models are supervised and
unsupervised.
2. Labeled data in supervised learning provides correct answers for training the
model to learn relationships between input features and output labels.
3. Precision is the ratio of correctly predicted positive observations to the total
predicted positives, while recall is the ratio of correctly predicted positive
observations to the total actual positives.
4. Accuracy might not be suitable for imbalanced datasets because it can be
dominated by the majority class and may not reflect the true model performance.
5. Cross-validation assesses a machine learning model's performance by dividing
the dataset into subsets, training/evaluating the model on different
combinations, and providing insights into its generalization capability.
6.
Data mining is the computational process of extracting knowledge from large datasets
through methods at the intersection of artificial intelligence, machine learning, statistics, and
database systems. This associative approach aims to discover meaningful patterns, and it
could be more appropriately named "knowledge mining" to emphasize the extraction of valuable
insights from data. The overarching goal is to transform raw data into an understandable
structure, facilitating further analysis and informed decision-making.
3. Clustering:
● It is the task to Discovering groups or structures in data that are somehow similar, without
predefined categories.
● Example: Social media platforms grouping users based on similar interests or behaviors, creating
communities.
4. Classification:
● It is the task to Generalizing known patterns to apply to new data.
● Example: An email program learning from labeled emails (spam or not) to automatically classify
new emails as either "legitimate" or "spam".
5. Regression:
● It is the task to Finding a function that models data with the least error.
● Example: Predicting house prices based on factors like square foot, number of bedrooms, and
location.
6. Summarization:
● It is the task to Providing a more concise representation of the dataset, often through
visualization and reports.
● Example: Creating a bar chart to summarize monthly sales data, making it easy to see trends over
time.
Knowledge Base:
● This is the foundational element that incorporates domain knowledge to guide the search
and assess interesting patterns. This includes concept hierarchies, user beliefs,
metadata, and other knowledge.
● Example: Organizing customer preferences into levels for targeted marketing, like "Basic,"
"Intermediate," and "Advanced."
Data Mining Engine:
● The core system with functional modules for tasks like association analysis,
classification, and cluster analysis.
● Example: Identifying associations, such as customers who buy laptops also purchasing
laptop accessories.
Pattern Evaluation Module:
● A component utilizing importance measures to assess pattern value and guide the
search. It may use thresholds for filtering patterns.
● Example: Setting a threshold to only consider sales patterns that show a significant
increase.
User Interface:
● The interface facilitates communication between users and the system, allowing queries,
result exploration, and visualization.
● Example: Allowing a user to query the system for trends in monthly sales and visually
presenting the findings.
This architecture combines domain knowledge, analytical modules, evaluation criteria, and user
interaction to efficiently extract meaningful insights from data.
6. Pattern Evaluation:
● Challenge: Assessing the interestingness of discovered patterns in terms of
representing common knowledge or lacking novelty.
● Patterns need to be interesting and valuable to users, either by providing new
insights or confirming existing knowledge.
These issues highlight the complexity of data mining tasks and the importance of
addressing various challenges to ensure the meaningful extraction and utilization of
knowledge from large datasets.
Types of variables
Dataset
Sample:
● Given the often vast size of the population, a "sample" is a subset of this
universe that is accessible and used for analysis in data mining. It represents a
manageable portion from which we aim to extract information applicable to the
entire population.
● The sample is crucial as it allows for practical analysis without having to process
or examine the entire population. Insights gained from the sample are
extrapolated to make predictions about the larger dataset.
Types of Variables:
Categorical > Nominal Variables:
● Description: Used to categorize objects (e.g., name or color), with
numerical values having no mathematical interpretation.
● Example: Assigning numbers (1, 2, 3, ...) to represent categories without
meaningful arithmetic.
Categorical > Binary Variables:
● Description: A special case of nominal variables with only two possible
values (e.g., true or false, 1 or 0).
Categorical > Ordinal Variables:
● Description: Similar to nominal variables but with values that can be
arranged in a meaningful order (e.g., small, medium, large).
Continuous > Integer Variables:
● Description: Takes genuine integer values, and arithmetic operations have
meaningful interpretations (e.g., 'number of children').
Continuous > Interval-scaled Variables:
● Description: Takes numerical values with equal intervals from a zero point,
but the zero does not imply the absence of the measured characteristic
(e.g., Fahrenheit or Celsius temperature scales).
Continuous > Ratio-scaled Variables:
● Description: Similar to interval-scaled variables, but the zero point reflects
the absence of the measured characteristic (e.g., Kelvin temperature and
molecular weight).
'Ignore' Attribute:
● Description: A third category representing variables of no significance for
the application. They are retained in the dataset but may not contribute to
the analysis (e.g., patient names or serial numbers).
Understanding these types of variables is essential in data mining as they influence the
choice of appropriate analysis methods and help in extracting meaningful patterns from
the data.
Dataset:
● The complete set of data available for an application is called a dataset.
● Representation: A dataset is often depicted as a table.
● Record or Instance:
● The set of variable values corresponding to each of the objects is called a
record or an instance.
● Each row in the dataset represents an instance.
● Each column contains the value of one of the variables (attributes) for
each of the instances.
● Example:
● The dataset is an example of labeled data.
● One attribute is given special significance, and the aim is to predict its
value.
● This attribute is given the standard name 'class.'
● Labeling:
● When there is no such significant attribute, we call the data unlabeled.
103 42 HR 12 Low
104 30 IT 5 High
Explanation:
● Attributes:
● EmployeeID: Unique identifier for each employee.
● Age: Age of the employee.
● Department: Employee's department.
● Years of Experience: Employee's work experience.
● Performance Score: The performance level of the employee.
● Labeling:
● The 'Performance Score' is a labeled attribute, and the goal is to predict its
value.
HouseID Area (sq ft) Bedrooms Bathrooms Year Built Price (USD)
Explanation:
● Attributes:
● HouseID: Unique identifier for each house.
● Area (sq ft): Size of the house.
● Bedrooms: Number of bedrooms.
● Bathrooms: Number of bathrooms.
● Year Built: Year the house was constructed.
● Price (USD): Selling price of the house.
● Labeling:
● The 'Price (USD)' is a labeled attribute, and the goal is to predict its value.
I. Surveys
Data Collection
Method Surveys
II. Interviews
Data Collection
Method Interviews
III. Observations
Data Collection
Method Observations
V. Cookies
Data Collection
Method Cookies
Age
72
72
72
72
Age
25
40
30
45
98
72
102
95
Exam Score
72
95
98
102
Temperature
20
22
25
15
Cold
Moderate
Warm
Very Cold
Data Cleaning > Incorrectly Recorded Data > Solution - Identical Values:
Handling variables where all values are identical.
Example: Identifying a column where all entries are 'Unknown' and considering it for
removal.
Problems Table (Identical Values):
Status
Unknown
Unknown
Unknown
Unknown
Remove
Keep
Remove
Remove
Data Cleaning > Incorrectly Recorded Data > Solution - Categorical for
Almost All Identical Values:
Treating a variable as categorical when almost all values are identical,
except for a few.
Example: Recognizing a column where most entries are 'Yes' except for a
few 'No' values.
Problems Table (Categorical for Almost All Identical Values):
Approval
Yes
Yes
No
Yes
Approval
Yes
Yes
No
Yes
Data Cleaning > Incorrectly Recorded Data > Out of Range Values:
Identifying values outside the normal range for a variable.
Example: Detecting a continuous attribute with most values in the range
200 to 5000, but a few outliers like 22654.8.
Problems Table (Out of Range Values):
Income
3000
4000
22654.8
5000
Income
3000
4000
5000
4500
Data Cleaning > Incorrectly Recorded Data > Out of Range Values >
Handling Outliers:
Addressing outliers that may be genuine values significantly different from
others.
Example: Deciding whether to discard or adjust outliers in medical or
physics data.
Problems Table (Handling Outliers):
Test Score
85
92
105
78
Test Score
85
92
95
78
Country
Albania
USA
Albania
Germany
Albania
USA
Albania
Germany
Data Cleaning > Incorrectly Recorded Data > Repeated Values > Possible
Interpretations:
Considering potential explanations for abnormally frequent values.
Example: Exploring reasons why 'Albania' might be overly represented in a
country field.
Problems Table (Possible Interpretations):
Country
Albania
USA
Albania
Albania
Country
Albania
USA
Albania
USA
Data Cleaning > Incorrectly Recorded Data > Repeated Values >
Inconsistencies:
Identifying inconsistencies in abnormally frequent occurrences of values.
Example: Recognizing that a high proportion of recorded ages being 72
may indicate errors in data collection or processing.
Problems Table (Inconsistencies):
Age
72
28
72
72
Age
72
28
72
28
15 20 25
10 (Miss) 22
18 15 30
12 18 (Miss)
15 20 25
10 18 22
18 15 30
12 18 25
15 20 25
18 15 30
18 15 30
Country Age
USA 28
(Miss) 45
Germany 72
Country Age
USA 28
USA 45
Germany 72
Data Cleaning > Missing Values > Reducing the Number of Attributes:
Trimming datasets with numerous attributes to avoid computational
overhead.
Example: Using feature reduction techniques to select the most relevant
attributes for analysis.
Problems Table (Reducing the Number of Attributes):
10 (Miss) 22
18 15 30
12 18 (Miss)
Attribute1 Attribute3
15 25
10 22
18 30
12 25
Example:
Measures of central tendency aim to define the "typical" or central value within a
dataset. Common measures include the mean (average), median (middle value), and
mode (most frequent value). These measures offer valuable insights into the core
characteristics of a dataset.
Example:
3. Measures of Variability
Measures of variability quantify the spread or dispersion of data points within a dataset.
Key measures encompass the range (difference between largest and smallest values),
variance (average of squared differences from the mean), and standard deviation
(square root of the variance). These measures illuminate the distribution of values
within the dataset.
Example:
In financial analysis, understanding the standard deviation of stock prices helps assess
the level of risk associated with an investment, indicating how spread out the prices are
over time.
4. Measures of Shape
4.1) Skewness
Skewness quantifies the asymmetry of a distribution, indicating the extent to which the
data deviates from perfect symmetry. A skewness score can be positive, negative, or
close to zero, each conveying distinct characteristics about the distribution.
Example:
4.2) Kurtosis
Kurtosis measures the shape of a distribution, focusing on the heaviness of tails relative
to the center. It provides insights into the presence of outliers and extreme values within
the dataset.
Example:
In educational research, examining the kurtosis of test scores can indicate whether
there are a significant number of exceptionally high or low scores, potentially influencing
the overall performance trends.
5. Graphical Representations
Graphical representations are visual tools employed to enhance the understanding of
key features within a dataset. Common types include histograms (frequency
distribution), box plots (quartiles and outliers), and scatter plots (relationships between
variables).
Example:
In climate science, a scatter plot might visualize the relationship between temperature
and sea level rise, providing a clear depiction of any correlations or patterns.
5.1) Histogram
Example:
Analyzing a histogram of patient recovery times in a medical study can reveal whether
the distribution is skewed, indicating variations in treatment effectiveness.
Box plots visually summarize central tendency, spread, skewness, and outliers within a
dataset, offering a concise overview of key features.
Example:
Example:
In social science research, a scatter plot examining the relationship between education
level and income can highlight whether higher education correlates with increased
earnings.
Examples:
3. Histograms
● Displays the distribution of continuous numerical data.
● Rectangular bars represent ranges of values, with the bar's width indicating the
range and height corresponding to frequency or count.
● Useful for understanding the shape, central tendency, and spread of a dataset.
4. Pie Charts
● Circular graph divided into slices representing the proportion or percentage of
different categories within a whole.
● Each slice's size is proportional to the corresponding value or percentage it
represents.
5. Line Charts
● Represents data that changes over time.
● Shows how data points are connected, creating a visual representation of trends
over time.
6. Scatter Plots
● Used to represent the relationship between two variables.
● Reveals patterns and trends that may not be apparent in other visualizations.
7. Heat Maps
● Graphical representation where values are represented by colors.
● Used to show the distribution of data across different categories.
8. Tree Maps
● Hierarchical representation of data, where each level of the hierarchy is
represented by a rectangle.
● The size of the rectangle corresponds to the value of the data.
Data visualization tools play a crucial role in creating these visual representations,
helping users make sense of complex datasets.