Professional Documents
Culture Documents
AI351 Lecture 1 - Data Preprocessing
AI351 Lecture 1 - Data Preprocessing
1 Faculty of Computer Science and Engineering, AI Research Group, Ghulam Ishaq Khan Institute of
Engineering Sciences and Technology, Topi, Khyber Pakhtunkhwa, Pakistan; hashim.ali@giki.edu.pk
2 Department of Technology and Software Engineering, University of Europe for Applied Sciences, Berlin,
Germany; hashim.ali@ue-germany.de
* Correspondence: hashim.ali@giki.edu.pk
Abstract: The presence of missing data is an omnipresent challenge in the realm of machine learning, 1
capable of profoundly influencing the outcomes of subsequent analyses and classification endeav- 2
ors. This article delves into the intricacies of missing data handling, presenting a comprehensive 3
imputation, most frequent value imputation, random value imputation, and linear regression-based 5
imputation. Each technique is examined in-depth, offering insights into their utility and appropri- 6
ateness under varying data scenarios. This article underscores the pivotal role played by missing 7
data treatment in the machine learning pipeline, emphasizing that its effective management is not 8
just a technical exercise but a fundamental prerequisite for the development of accurate and reliable 9
machine learning models. As the field of machine learning continues to evolve, mastering the art of 10
missing data handling remains an essential skill for data scientists, ensuring that our models stand 11
on solid data foundations, thereby enhancing the quality and trustworthiness of our analytical and 12
predictive outcomes. 13
Keywords: Index Terms—Data Preprocessing, Missing data, Data removal, Statistical imputation. 14
Data pre-processing involves a series of data preparation steps used to remove un- 16
wanted noise and filter out necessary data from a dataset. We will learn how to preprocess 17
data in this lecture by reading about seven different ways to handle missing data in Python. 18
There is a general convention that states that almost 80% of one’s time is spent in 19
pre-processing data whereas only 20% is used to build the actual ML model itself. Hence, 20
we can understand that data pre-processing is a vital step in building intelligent robust ML 21
models. 22
Data may not always be complete i.e. some of the values in the data may be missing 24
or null. Thus, there are a specific set of ways to handle the missing data and make the data 25
complete. 26
The following example shows that the ‘Years of Experience’ of ‘Employee’ is missing. 27
Also, the ‘Salary (in USD per year)’ of ‘Junior Manager’ is missing. 28
Figure 1. Displaying the sample dataset generated with missing values for a few cells.
29
1 import pandas as pd 30
2 31
4 33
6 37
8 df.head() 39
40
Remove the missing data rows (data points) from the dataset. However, when using 43
this technique will decrease the available dataset and in turn result in less robustness of 44
Figure 3. Expected result after removing the rows with missing data.
46
1 # Dropping the 2nd and 3rd index 47
2 dropped_df = df.drop([2,3],axis=0) 48
3 49
5 dropped_df 51
52
Removing records (rows) with missing data can have some advantages in certain 53
situations, although it should be done judiciously and with a clear understanding of the 54
Figure 4
• Preservation of Data Quality - By removing records with missing data, you ensure 59
that the remaining data points are complete and free from missing values. This can 60
sumptions about the distribution or relationships within the data. Removing records 63
avoids making such assumptions, which can be advantageous when you’re uncertain 64
• Potential for Improved Model Performance - In some cases, removing records with 66
missing data can lead to improved model performance, especially when the missing 67
data is associated with a specific pattern or condition that is not easily imputable. 68
simplifies the dataset and may make subsequent analysis more efficient. 71
• Focus on Critical Data - If certain variables with missing data are not critical to your 72
analysis or research question, removing records with missing values in those variables 73
• Avoidance of Bias - When missing data is not missing completely at random (MCAR) 75
or missing at random (MAR), imputation methods can introduce bias into your 76
Removing records (rows) with missing data is a common approach to handling missing 78
• Loss of Information - When you remove records with missing data, you are essen- 80
reduction in the size of your dataset, which may adversely affect the performance and 82
• Bias - Removing records with missing data can introduce bias into your analysis, 84
especially if the missing data is not missing completely at random (MCAR). If certain 85
types of data tend to be missing more often for specific groups or under certain 86
• Reduced Statistical Power - As you reduce the size of your dataset by removing 88
records, you may also reduce the statistical power of your analysis. This can make it 89
• Inefficiency - In cases where only a small percentage of records have missing values, 91
removing them may be inefficient. You might end up with a smaller dataset that 92
expensive. 94
When you remove records, you lose the context and potentially important details 96
associated with those observations. This can limit the interpretability of your results. 97
underestimation of the variability in the dataset, which may result in overly optimistic 99
• Non-Ignorable Missingness - In some cases, missing data is not missing completely 101
at random (MAR) or MCAR but follows a specific pattern. Removing records in 102
AI351 - Machine Learning Lecture 1 4 of 8
such cases may not be appropriate, as it can distort the analysis and lead to incorrect 103
conclusions. 104
In summary, while removing records with missing data can be a quick and simple 105
way to handle missing values, it should be done with caution and a clear understanding of 106
the potential drawbacks. It is essential to consider the nature of the missing data and its 107
implications for your analysis before deciding on this approach. Alternative methods like 108
imputation or advanced missing data techniques may be more suitable in many situations 109
to retain data integrity and mitigate these drawbacks. This approach is most suitable when 110
the missing data is random or missing completely at random (MCAR) and when the loss of 111
data is acceptable given the goals of your analysis. In many cases, alternative methods such 112
as imputation or advanced missing data techniques may be more appropriate to retain as 113
Fill the missing data by taking the mean or median of the available data points. 116
Generally, the median of the data points is used to fill the missing values as it is not affected 117
heavily by outliers like the mean. Here, we have used the median to fill the missing data. 118
Figure 5
119
1 # Filling each column with their mean values 120
2 121
of Experience'].mean()) 123
4 124
6 126
8 df 128
129
Figure 6
Here are a few common statistical imputation techniques along with brief explanations: 130
• Mean Imputation - Missing values are replaced with the mean (average) of the 131
• Median Imputation - Missing values are replaced with the median (middle) value 133
of the observed values for that variable, which is less sensitive to outliers than mean 134
imputation. 135
• Mode Imputation - Missing categorical values are replaced with the mode (most 136
• Regression Imputation - Missing values are predicted using regression models based 138
on the relationships between the variable with missing data and other variables in the 139
dataset. 140
• k-Nearest Neighbors (KNN) Imputation - Missing values are estimated by averaging 141
or weighting the values of the k-nearest neighbors in the dataset based on similarity 142
• Random Forest Imputation - Missing values are imputed using a Random Forest algo- 144
rithm, which leverages decision trees to predict missing values based on relationships 145
• Multiple Imputation - Multiple datasets with imputed values are created, and analy- 147
ses are performed on each. Results are then combined to account for the uncertainty 148
• Interpolation and Extrapolation - For time series or spatial data, missing values 150
are estimated based on the values of adjacent time points or neighboring locations, 151
respectively. 152
estimates missing values and model parameters in a way that maximizes the likelihood 154
• Hot-Deck Imputation - Missing values are imputed by randomly selecting values from 156
similar records in the dataset, effectively “borrowing” values from other observations. 157
These are just a few examples of statistical imputation techniques, each with its 158
strengths and limitations. The choice of which method to use should depend on the charac- 159
teristics of your data and the specific research question or analysis you are conducting. 160
Filling missing values through statistical imputation offers several advantages, making 161
• Preservation of Data Volume - Statistical imputation allows you to retain all records 163
in your dataset, even those with missing values. This preserves the full volume of 164
your data, which can be important for maintaining statistical power and the represen- 165
• Minimizes Data Loss - Unlike removing records with missing values, imputation 167
minimizes data loss, ensuring that valuable information is not discarded. This is 168
preserve patterns and relationships within the data. This means that the imputed 171
values are estimated based on the available data, considering variables’ distributions 172
and correlations. This can lead to more accurate representations of the underlying 173
• Improved Model Performance - Imputing missing values can lead to improved model 175
performance compared to removing records with missing data. Models trained on 176
imputed datasets often have better predictive accuracy and generalizability. 177
• Reduced Bias - Imputation can reduce bias in your analysis, especially when missing 178
data is not missing completely at random (MCAR) or missing at random (MAR). 179
By retaining records with missing values and providing estimates for those missing 180
values, imputation helps mitigate potential bias introduced by removing records. 181
• Consistency - Imputation methods ensure that the dataset remains consistent in terms 182
of record count, which can simplify data analysis and downstream processes such as 183
• Flexibility - Various statistical imputation techniques are available, allowing you 185
to choose the method that best suits your data and research question. Common 186
methods include mean imputation, median imputation, regression imputation, and 187
more advanced techniques like k-nearest neighbors (KNN) imputation and multiple 188
imputation. 189
AI351 - Machine Learning Lecture 1 6 of 8
• Ease of Implementation - Many statistical imputation methods are readily available in 190
data analysis and machine learning libraries, making them easy to implement within 191
• Handling Missing Data Patterns - Statistical imputation methods can handle missing 193
data patterns, including missing data in multiple variables. This flexibility is especially 194
useful when dealing with complex datasets with interrelated missing values. 195
While statistical imputation offers several advantages, it’s essential to choose the most 196
appropriate imputation method for your specific dataset and research objectives. The 197
choice should be guided by an understanding of the data’s characteristics, the nature of 198
the missingness, and the potential impact on subsequent analyses or models. Additionally, 199
it’s important to acknowledge the assumptions made by each imputation method and to 200
perform sensitivity analyses to assess the robustness of your results to imputation choices. 201
While statistical imputation can be a valuable technique for handling missing data, it 202
is not without drawbacks and limitations. Here are some of the drawbacks associated with 203
• Introduction of Bias - Imputation methods, especially simple ones like mean or 205
median imputation, can introduce bias into the dataset. Imputed values are often 206
estimated based on available data, and if the missing data is not missing completely 207
at random (MCAR) or missing at random (MAR), this can lead to biased imputations. 208
about the distribution and relationships within the data. If these assumptions do not 210
hold true for your dataset, imputed values may be inaccurate and misleading. 211
of the variability in the data. This can result in overly confident or optimistic model 213
predictions and may not accurately reflect the true uncertainty in the data. 214
assume that the data follows a normal distribution. If your data is not normally 216
distributed, imputation methods like mean or median imputation may not be suitable. 217
• Complexity and Computational Cost - More advanced imputation techniques, such 218
tionally expensive and complex to implement, particularly for large datasets. 220
• Propagation of Errors - Imputed values may contain errors or noise, which can 221
propagate through subsequent analyses or modeling steps. This can lead to incorrect 222
• Difficulty with Categorical Data - Imputing missing values in categorical variables 224
can be challenging. Some imputation methods may not handle categorical data well, 225
• Loss of Transparency - Imputation can make the dataset less transparent as imputed 227
values are not actual observations. This can complicate data interpretation and make 228
• Inability to Capture Complex Missing Data Mechanisms - While statistical imputa- 230
tion methods can handle common missing data mechanisms like missing at random 231
(MAR), they may not effectively capture more complex patterns of missingness. 232
• Imputation Uncertainty - Imputed values come with an inherent level of uncertainty, 233
which is often not accounted for in downstream analyses. Ignoring this uncertainty 234
To mitigate these drawbacks, it’s crucial to carefully consider the nature of your data 236
and the missing data mechanism when choosing an imputation method. Additionally, 237
conducting sensitivity analyses, evaluating the impact of imputation on your results, and 238
considering multiple imputation techniques can help address some of these limitations and 239
Manually fill in the missing data from observation. This may be possible sometimes 242
for small datasets but for larger datasets it is very difficult to do so. 243
Fill in the missing value using the most repeated value in the dataset. This is done 245
when most of the data is repeated and there is good reasoning to do so. Since there are no 246
repeated values in the example, we can fill it with any one of the numbers in the respective 247
column. 248
Figure 8. Results of filling missing values through the most repeated value in the data
1.1.5. Fill in with random value within the range of available data 249
Take the given range of data points and fill in the data by randomly selecting a value 250
Figure 9. Results of filling in with random value within the range of the available data.
Use regression analysis to find the most probable data point for filling in the dataset. 253
Figure 10. Results of filling the missing data through linear regression.
254
1 from sklearn.linear_model import LinearRegression 255
2 256
5 259
8 262
AI351 - Machine Learning Lecture 1 8 of 8
9 # Here the target is the Salary and the feature is Years of Experience 263
11 265
13 regr.predict([[3]]) 267
268
Figure 11. Results from Linear Regression for finding the salary for 3 years of experience.
Therefore, the salary for 3 years of experience by regression is 60000. Now, finding the 269
2 273
5 276
8 279
9 # Here the target is the Years of Experience and the feature is Salary 280
11 282
13 regr.predict([[40000.0]]) 284
285
Figure 12. Results from the Linear Regression for finding the missing years of experience based on
salary.
2. Conclusion 287
In conclusion, handling missing data is a crucial aspect of machine learning that cannot 288
be overlooked. Throughout this article, we have explored various techniques for addressing 289
missing data, including the removal of rows with missing values, statistical imputation, 290
observation-based imputation, imputing with the most frequent values, random value 291
imputation, and even leveraging the power of linear regression. These methods not only 292
ensure data completeness but also play a pivotal role in enhancing the accuracy and 293
reliability of downstream analysis and classification techniques. The impact of missing 294
data extends far beyond mere data manipulation; it influences the very foundations of our 295
machine learning models and, consequently, the quality of decisions and insights drawn 296
from them. As data scientists and machine learning practitioners, it is our responsibility to 297
master these techniques, as the effective handling of missing data is an indispensable skill 298
in our journey toward more accurate and robust machine learning solutions. 299
300