AI351 Lecture 1 - Data Preprocessing

Lecture 1
Navigating the Missing Data Maze in Machine Learning:

Techniques and Impact
Raja Hashim Ali 1,2 *
1 Faculty of Computer Science and Engineering, AI Research Group, Ghulam Ishaq Khan Institute of
Engineering Sciences and Technology, Topi, Khyber Pakhtunkhwa, Pakistan; hashim.ali@giki.edu.pk
2 Department of Technology and Software Engineering, University of Europe for Applied Sciences, Berlin,
Germany; hashim.ali@ue-germany.de
* Correspondence: hashim.ali@giki.edu.pk
Abstract: The presence of missing data is an omnipresent challenge in the realm of machine learning, 1
capable of profoundly influencing the outcomes of subsequent analyses and classification endeav- 2
ors. This article delves into the intricacies of missing data handling, presenting a comprehensive 3
exploration of diverse techniques such as data removal, statistical imputation, observation-based 4
imputation, most frequent value imputation, random value imputation, and linear regression-based 5
imputation. Each technique is examined in-depth, offering insights into their utility and appropri- 6
ateness under varying data scenarios. This article underscores the pivotal role played by missing 7
data treatment in the machine learning pipeline, emphasizing that its effective management is not 8
just a technical exercise but a fundamental prerequisite for the development of accurate and reliable 9
machine learning models. As the field of machine learning continues to evolve, mastering the art of 10
missing data handling remains an essential skill for data scientists, ensuring that our models stand 11
on solid data foundations, thereby enhancing the quality and trustworthiness of our analytical and 12
predictive outcomes. 13
Keywords: Index Terms—Data Preprocessing, Missing data, Data removal, Statistical imputation. 14
1. Data Preprocessing in Python — Handling Missing Data 15
Data pre-processing involves a series of data preparation steps used to remove un- 16
wanted noise and filter out necessary data from a dataset. We will learn how to preprocess 17
data in this lecture by reading about seven different ways to handle missing data in Python. 18
There is a general convention that states that almost 80% of one’s time is spent in 19
pre-processing data whereas only 20% is used to build the actual ML model itself. Hence, 20
we can understand that data pre-processing is a vital step in building intelligent robust ML 21
models. 22
1.1. Techniques For Handling Missing Data 23
Data may not always be complete i.e. some of the values in the data may be missing 24
or null. Thus, there are a specific set of ways to handle the missing data and make the data 25
complete. 26
The following example shows that the ‘Years of Experience’ of ‘Employee’ is missing. 27
Also, the ‘Salary (in USD per year)’ of ‘Junior Manager’ is missing. 28
September 16, 2023

AI351 - Machine Learning Lecture 1 2 of 8
Figure 1. Displaying the sample dataset generated with missing values for a few cells.
29
1 import pandas as pd 30
2 31
3 # Creating the dataframe as shown above 32
4 33
5 df = pd.DataFrame({'Job Position': ['CEO', 'Senior Manager', 'Junior ←- 34
Manager', 'Employee', 'Assistant Staff'], 'Years of Experience':[5, ←- 35
4, 3, None, 1], 'Salary':[100000,80000,None,40000, 20000]}) 36
6 37
7 # Viewing the contents of the dataframe 38
8 df.head() 39
40
Figure 2. Results as displayed after creating the table in Juyter Notebook
Some of the ways to handle missing data are listed below: 41
1.1.1. Data Removal 42
Remove the missing data rows (data points) from the dataset. However, when using 43
this technique will decrease the available dataset and in turn result in less robustness of 44
data point if the size of dataset is originally small. 45
Figure 3. Expected result after removing the rows with missing data.
46
1 # Dropping the 2nd and 3rd index 47
2 dropped_df = df.drop([2,3],axis=0) 48
3 49
4 # Viewing the dataframe 50
5 dropped_df 51
52
Removing records (rows) with missing data can have some advantages in certain 53
situations, although it should be done judiciously and with a clear understanding of the 54
potential trade-offs. Here are some advantages: 55

Figure 4
• Simplicity - Removing records with missing data is a straightforward and easy-to- 56
implement approach. It doesn’t require complex imputation methods or handling 57
missing value patterns. 58
• Preservation of Data Quality - By removing records with missing data, you ensure 59
that the remaining data points are complete and free from missing values. This can 60
lead to cleaner and more reliable analyses. 61
• No Imputation Assumptions - Imputing missing values often involves making as- 62
sumptions about the distribution or relationships within the data. Removing records 63
avoids making such assumptions, which can be advantageous when you’re uncertain 64
about the validity of imputation assumptions. 65
• Potential for Improved Model Performance - In some cases, removing records with 66
missing data can lead to improved model performance, especially when the missing 67
data is associated with a specific pattern or condition that is not easily imputable. 68
• Reduced Complexity - In datasets with a large number of missing values, imputation 69
methods can introduce complexity and computational overhead. Removing records 70
simplifies the dataset and may make subsequent analysis more efficient. 71
• Focus on Critical Data - If certain variables with missing data are not critical to your 72
analysis or research question, removing records with missing values in those variables 73
allows you to focus on the core variables of interest. 74
• Avoidance of Bias - When missing data is not missing completely at random (MCAR) 75
or missing at random (MAR), imputation methods can introduce bias into your 76
analysis. Removing records eliminates this potential source of bias. 77
Removing records (rows) with missing data is a common approach to handling missing 78
values in a dataset, but it comes with some significant drawbacks: 79
• Loss of Information - When you remove records with missing data, you are essen- 80
tially discarding potentially valuable information. This can lead to a substantial 81
reduction in the size of your dataset, which may adversely affect the performance and 82
representativeness of your machine learning model. 83
• Bias - Removing records with missing data can introduce bias into your analysis, 84
especially if the missing data is not missing completely at random (MCAR). If certain 85
types of data tend to be missing more often for specific groups or under certain 86
conditions, removing those records may bias your model’s predictions. 87
• Reduced Statistical Power - As you reduce the size of your dataset by removing 88
records, you may also reduce the statistical power of your analysis. This can make it 89
harder to detect meaningful patterns or relationships in the data. 90
• Inefficiency - In cases where only a small percentage of records have missing values, 91
removing them may be inefficient. You might end up with a smaller dataset that 92
requires retraining your model, which can be time-consuming and computationally 93
expensive. 94
• Loss of context - Each record in a dataset represents a real-world observation or entity. 95
When you remove records, you lose the context and potentially important details 96
associated with those observations. This can limit the interpretability of your results. 97
• Underestimating Variability - Removing records with missing data can lead to an 98
underestimation of the variability in the dataset, which may result in overly optimistic 99
estimates of model performance. 100
• Non-Ignorable Missingness - In some cases, missing data is not missing completely 101
at random (MAR) or MCAR but follows a specific pattern. Removing records in 102
such cases may not be appropriate, as it can distort the analysis and lead to incorrect 103
conclusions. 104
In summary, while removing records with missing data can be a quick and simple 105
way to handle missing values, it should be done with caution and a clear understanding of 106
the potential drawbacks. It is essential to consider the nature of the missing data and its 107
implications for your analysis before deciding on this approach. Alternative methods like 108
imputation or advanced missing data techniques may be more suitable in many situations 109
to retain data integrity and mitigate these drawbacks. This approach is most suitable when 110
the missing data is random or missing completely at random (MCAR) and when the loss of 111
data is acceptable given the goals of your analysis. In many cases, alternative methods such 112
as imputation or advanced missing data techniques may be more appropriate to retain as 113
much data as possible while addressing missing values effectively. 114
1.1.2. Fill missing value through statistical imputation 115
Fill the missing data by taking the mean or median of the available data points. 116
Generally, the median of the data points is used to fill the missing values as it is not affected 117
heavily by outliers like the mean. Here, we have used the median to fill the missing data. 118
Figure 5
119
1 # Filling each column with their mean values 120
2 121
3 df['Years of Experience'] = df['Years of Experience'].fillna(df['Years ←- 122
of Experience'].mean()) 123
4 124
5 df['Salary'] = df['Salary'].fillna(df['Salary'].mean()) 125
6 126
7 # Viewing the dataframe 127
8 df 128
129
Figure 6
Here are a few common statistical imputation techniques along with brief explanations: 130
• Mean Imputation - Missing values are replaced with the mean (average) of the 131
observed values for that variable. 132
• Median Imputation - Missing values are replaced with the median (middle) value 133
of the observed values for that variable, which is less sensitive to outliers than mean 134
imputation. 135
• Mode Imputation - Missing categorical values are replaced with the mode (most 136
frequent category) of the observed values for that variable. 137

• Regression Imputation - Missing values are predicted using regression models based 138
on the relationships between the variable with missing data and other variables in the 139
dataset. 140
• k-Nearest Neighbors (KNN) Imputation - Missing values are estimated by averaging 141
or weighting the values of the k-nearest neighbors in the dataset based on similarity 142
in other variables. 143
• Random Forest Imputation - Missing values are imputed using a Random Forest algo- 144
rithm, which leverages decision trees to predict missing values based on relationships 145
in the data. 146
• Multiple Imputation - Multiple datasets with imputed values are created, and analy- 147
ses are performed on each. Results are then combined to account for the uncertainty 148
associated with imputation. 149
• Interpolation and Extrapolation - For time series or spatial data, missing values 150
are estimated based on the values of adjacent time points or neighboring locations, 151
respectively. 152
• Expectation-Maximization (EM) Imputation - A statistical algorithm that iteratively 153
estimates missing values and model parameters in a way that maximizes the likelihood 154
of the observed data. 155
• Hot-Deck Imputation - Missing values are imputed by randomly selecting values from 156
similar records in the dataset, effectively “borrowing” values from other observations. 157
These are just a few examples of statistical imputation techniques, each with its 158
strengths and limitations. The choice of which method to use should depend on the charac- 159
teristics of your data and the specific research question or analysis you are conducting. 160
Filling missing values through statistical imputation offers several advantages, making 161
it a valuable approach for handling missing data in many situations: 162
• Preservation of Data Volume - Statistical imputation allows you to retain all records 163
in your dataset, even those with missing values. This preserves the full volume of 164
your data, which can be important for maintaining statistical power and the represen- 165
tativeness of your analysis. 166
• Minimizes Data Loss - Unlike removing records with missing values, imputation 167
minimizes data loss, ensuring that valuable information is not discarded. This is 168
especially beneficial when dealing with limited data resources. 169
• Preservation of Patterns and Relationships - Statistical imputation methods aim to 170
preserve patterns and relationships within the data. This means that the imputed 171
values are estimated based on the available data, considering variables’ distributions 172
and correlations. This can lead to more accurate representations of the underlying 173
data structure. 174
• Improved Model Performance - Imputing missing values can lead to improved model 175
performance compared to removing records with missing data. Models trained on 176
imputed datasets often have better predictive accuracy and generalizability. 177
• Reduced Bias - Imputation can reduce bias in your analysis, especially when missing 178
data is not missing completely at random (MCAR) or missing at random (MAR). 179
By retaining records with missing values and providing estimates for those missing 180
values, imputation helps mitigate potential bias introduced by removing records. 181
• Consistency - Imputation methods ensure that the dataset remains consistent in terms 182
of record count, which can simplify data analysis and downstream processes such as 183
visualization, reporting, and model deployment. 184
• Flexibility - Various statistical imputation techniques are available, allowing you 185
to choose the method that best suits your data and research question. Common 186
methods include mean imputation, median imputation, regression imputation, and 187
more advanced techniques like k-nearest neighbors (KNN) imputation and multiple 188
imputation. 189
• Ease of Implementation - Many statistical imputation methods are readily available in 190
data analysis and machine learning libraries, making them easy to implement within 191
your data pipeline. 192
• Handling Missing Data Patterns - Statistical imputation methods can handle missing 193
data patterns, including missing data in multiple variables. This flexibility is especially 194
useful when dealing with complex datasets with interrelated missing values. 195
While statistical imputation offers several advantages, it’s essential to choose the most 196
appropriate imputation method for your specific dataset and research objectives. The 197
choice should be guided by an understanding of the data’s characteristics, the nature of 198
the missingness, and the potential impact on subsequent analyses or models. Additionally, 199
it’s important to acknowledge the assumptions made by each imputation method and to 200
perform sensitivity analyses to assess the robustness of your results to imputation choices. 201
While statistical imputation can be a valuable technique for handling missing data, it 202
is not without drawbacks and limitations. Here are some of the drawbacks associated with 203
filling missing values through statistical imputation: 204
• Introduction of Bias - Imputation methods, especially simple ones like mean or 205
median imputation, can introduce bias into the dataset. Imputed values are often 206
estimated based on available data, and if the missing data is not missing completely 207
at random (MCAR) or missing at random (MAR), this can lead to biased imputations. 208
• Assumption Sensitivity - Different imputation methods make different assumptions 209
about the distribution and relationships within the data. If these assumptions do not 210
hold true for your dataset, imputed values may be inaccurate and misleading. 211
• Underestimation of Variability - Imputing missing values can lead to underestimation 212
of the variability in the data. This can result in overly confident or optimistic model 213
predictions and may not accurately reflect the true uncertainty in the data. 214
• Inappropriate Handling of Non-Normal Data - Some statistical imputation methods 215
assume that the data follows a normal distribution. If your data is not normally 216
distributed, imputation methods like mean or median imputation may not be suitable. 217
• Complexity and Computational Cost - More advanced imputation techniques, such 218
as regression imputation or k-nearest neighbors (KNN) imputation, can be computa- 219
tionally expensive and complex to implement, particularly for large datasets. 220
• Propagation of Errors - Imputed values may contain errors or noise, which can 221
propagate through subsequent analyses or modeling steps. This can lead to incorrect 222
conclusions and predictions. 223
• Difficulty with Categorical Data - Imputing missing values in categorical variables 224
can be challenging. Some imputation methods may not handle categorical data well, 225
and choosing appropriate imputed values can be subjective. 226
• Loss of Transparency - Imputation can make the dataset less transparent as imputed 227
values are not actual observations. This can complicate data interpretation and make 228
it harder to understand the true nature of the data. 229
• Inability to Capture Complex Missing Data Mechanisms - While statistical imputa- 230
tion methods can handle common missing data mechanisms like missing at random 231
(MAR), they may not effectively capture more complex patterns of missingness. 232
• Imputation Uncertainty - Imputed values come with an inherent level of uncertainty, 233
which is often not accounted for in downstream analyses. Ignoring this uncertainty 234
can lead to overconfidence in model predictions. 235
To mitigate these drawbacks, it’s crucial to carefully consider the nature of your data 236
and the missing data mechanism when choosing an imputation method. Additionally, 237
conducting sensitivity analyses, evaluating the impact of imputation on your results, and 238
considering multiple imputation techniques can help address some of these limitations and 239
provide a more robust treatment of missing data. 240

1.1.3. Fill missing value using observation 241
Manually fill in the missing data from observation. This may be possible sometimes 242
for small datasets but for larger datasets it is very difficult to do so. 243
Figure 7. Results of filling missing values through observation
1.1.4. Fill in the most repeated value 244
Fill in the missing value using the most repeated value in the dataset. This is done 245
when most of the data is repeated and there is good reasoning to do so. Since there are no 246
repeated values in the example, we can fill it with any one of the numbers in the respective 247
column. 248
Figure 8. Results of filling missing values through the most repeated value in the data
1.1.5. Fill in with random value within the range of available data 249
Take the given range of data points and fill in the data by randomly selecting a value 250
from the available range. 251
Figure 9. Results of filling in with random value within the range of the available data.
1.1.6. Fill in by regression 252
Use regression analysis to find the most probable data point for filling in the dataset. 253
Figure 10. Results of filling the missing data through linear regression.
254
1 from sklearn.linear_model import LinearRegression 255
2 256
3 # Excluding the rows with the null data 257
4 train_df = df.drop([2,3],axis=0) 258
5 259
6 # Creating linear regression model 260
7 regr = LinearRegression() 261
8 262
9 # Here the target is the Salary and the feature is Years of Experience 263
10 regr.fit(train_df[['Years of Experience']], train_df[['Salary']]) 264
11 265
12 # Predicting for 3 years of experience 266
13 regr.predict([[3]]) 267
268
Figure 11. Results from Linear Regression for finding the salary for 3 years of experience.
Therefore, the salary for 3 years of experience by regression is 60000. Now, finding the 269
years of experience based on salary. 270

271
1 from sklearn.linear_model import LinearRegression 272
2 273
3 # Excluding the rows with the null data 274
4 train_df = df.drop([2,3],axis=0) 275
5 276
6 # Creating linear regression model 277
7 regr = LinearRegression() 278
8 279
9 # Here the target is the Years of Experience and the feature is Salary 280
10 regr.fit(train_df[['Salary']], train_df[['Years of Experience']]) 281
11 282
12 # Predicting for 40000 salary 283
13 regr.predict([[40000.0]]) 284
285
Figure 12. Results from the Linear Regression for finding the missing years of experience based on
salary.
Therefore, the years of experience for 40000 salary is 2. 286
2. Conclusion 287
In conclusion, handling missing data is a crucial aspect of machine learning that cannot 288
be overlooked. Throughout this article, we have explored various techniques for addressing 289
missing data, including the removal of rows with missing values, statistical imputation, 290
observation-based imputation, imputing with the most frequent values, random value 291
imputation, and even leveraging the power of linear regression. These methods not only 292
ensure data completeness but also play a pivotal role in enhancing the accuracy and 293
reliability of downstream analysis and classification techniques. The impact of missing 294
data extends far beyond mere data manipulation; it influences the very foundations of our 295
machine learning models and, consequently, the quality of decisions and insights drawn 296
from them. As data scientists and machine learning practitioners, it is our responsibility to 297
master these techniques, as the effective handling of missing data is an indispensable skill 298
in our journey toward more accurate and robust machine learning solutions. 299
300

AI351 Lecture 1 - Data Preprocessing

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AI351 Lecture 1 - Data Preprocessing

Uploaded by

Copyright:

Available Formats

Lecture 1

Navigating the Missing Data Maze in Machine Learning:

exploration of diverse techniques such as data removal, statistical imputation, observation-based 4

1. Data Preprocessing in Python — Handling Missing Data 15

1.1. Techniques For Handling Missing Data 23

September 16, 2023

3 # Creating the dataframe as shown above 32

5 df = pd.DataFrame({'Job Position': ['CEO', 'Senior Manager', 'Junior ←- 34

Manager', 'Employee', 'Assistant Staff'], 'Years of Experience':[5, ←- 35

4, 3, None, 1], 'Salary':[100000,80000,None,40000, 20000]}) 36

7 # Viewing the contents of the dataframe 38

Figure 2. Results as displayed after creating the table in Juyter Notebook

Some of the ways to handle missing data are listed below: 41

1.1.1. Data Removal 42

data point if the size of dataset is originally small. 45

4 # Viewing the dataframe 50

potential trade-offs. Here are some advantages: 55

• Simplicity - Removing records with missing data is a straightforward and easy-to- 56

implement approach. It doesn’t require complex imputation methods or handling 57

missing value patterns. 58

lead to cleaner and more reliable analyses. 61

• No Imputation Assumptions - Imputing missing values often involves making as- 62

about the validity of imputation assumptions. 65

• Reduced Complexity - In datasets with a large number of missing values, imputation 69

methods can introduce complexity and computational overhead. Removing records 70

allows you to focus on the core variables of interest. 74

analysis. Removing records eliminates this potential source of bias. 77

values in a dataset, but it comes with some significant drawbacks: 79

tially discarding potentially valuable information. This can lead to a substantial 81

representativeness of your machine learning model. 83

conditions, removing those records may bias your model’s predictions. 87

harder to detect meaningful patterns or relationships in the data. 90

requires retraining your model, which can be time-consuming and computationally 93

• Loss of context - Each record in a dataset represents a real-world observation or entity. 95

• Underestimating Variability - Removing records with missing data can lead to an 98

estimates of model performance. 100

much data as possible while addressing missing values effectively. 114

1.1.2. Fill missing value through statistical imputation 115

3 df['Years of Experience'] = df['Years of Experience'].fillna(df['Years ←- 122

5 df['Salary'] = df['Salary'].fillna(df['Salary'].mean()) 125

7 # Viewing the dataframe 127

observed values for that variable. 132

frequent category) of the observed values for that variable. 137

in other variables. 143

in the data. 146

associated with imputation. 149

• Expectation-Maximization (EM) Imputation - A statistical algorithm that iteratively 153

of the observed data. 155

it a valuable approach for handling missing data in many situations: 162

tativeness of your analysis. 166

especially beneficial when dealing with limited data resources. 169

• Preservation of Patterns and Relationships - Statistical imputation methods aim to 170

data structure. 174

visualization, reporting, and model deployment. 184

your data pipeline. 192

filling missing values through statistical imputation: 204

• Assumption Sensitivity - Different imputation methods make different assumptions 209

• Underestimation of Variability - Imputing missing values can lead to underestimation 212

• Inappropriate Handling of Non-Normal Data - Some statistical imputation methods 215

as regression imputation or k-nearest neighbors (KNN) imputation, can be computa- 219

conclusions and predictions. 223

and choosing appropriate imputed values can be subjective. 226

it harder to understand the true nature of the data. 229