Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Lecture 1

Navigating the Missing Data Maze in Machine Learning:


Techniques and Impact
Raja Hashim Ali 1,2 *

1 Faculty of Computer Science and Engineering, AI Research Group, Ghulam Ishaq Khan Institute of
Engineering Sciences and Technology, Topi, Khyber Pakhtunkhwa, Pakistan; hashim.ali@giki.edu.pk
2 Department of Technology and Software Engineering, University of Europe for Applied Sciences, Berlin,
Germany; hashim.ali@ue-germany.de
* Correspondence: hashim.ali@giki.edu.pk

Abstract: The presence of missing data is an omnipresent challenge in the realm of machine learning, 1

capable of profoundly influencing the outcomes of subsequent analyses and classification endeav- 2

ors. This article delves into the intricacies of missing data handling, presenting a comprehensive 3

exploration of diverse techniques such as data removal, statistical imputation, observation-based 4

imputation, most frequent value imputation, random value imputation, and linear regression-based 5

imputation. Each technique is examined in-depth, offering insights into their utility and appropri- 6

ateness under varying data scenarios. This article underscores the pivotal role played by missing 7

data treatment in the machine learning pipeline, emphasizing that its effective management is not 8

just a technical exercise but a fundamental prerequisite for the development of accurate and reliable 9

machine learning models. As the field of machine learning continues to evolve, mastering the art of 10

missing data handling remains an essential skill for data scientists, ensuring that our models stand 11

on solid data foundations, thereby enhancing the quality and trustworthiness of our analytical and 12

predictive outcomes. 13

Keywords: Index Terms—Data Preprocessing, Missing data, Data removal, Statistical imputation. 14

1. Data Preprocessing in Python — Handling Missing Data 15

Data pre-processing involves a series of data preparation steps used to remove un- 16

wanted noise and filter out necessary data from a dataset. We will learn how to preprocess 17

data in this lecture by reading about seven different ways to handle missing data in Python. 18

There is a general convention that states that almost 80% of one’s time is spent in 19

pre-processing data whereas only 20% is used to build the actual ML model itself. Hence, 20

we can understand that data pre-processing is a vital step in building intelligent robust ML 21

models. 22

1.1. Techniques For Handling Missing Data 23

Data may not always be complete i.e. some of the values in the data may be missing 24

or null. Thus, there are a specific set of ways to handle the missing data and make the data 25

complete. 26

The following example shows that the ‘Years of Experience’ of ‘Employee’ is missing. 27

Also, the ‘Salary (in USD per year)’ of ‘Junior Manager’ is missing. 28

September 16, 2023


AI351 - Machine Learning Lecture 1 2 of 8

Figure 1. Displaying the sample dataset generated with missing values for a few cells.
29
1 import pandas as pd 30

2 31

3 # Creating the dataframe as shown above 32

4 33

5 df = pd.DataFrame({'Job Position': ['CEO', 'Senior Manager', 'Junior ←- 34

Manager', 'Employee', 'Assistant Staff'], 'Years of Experience':[5, ←- 35

4, 3, None, 1], 'Salary':[100000,80000,None,40000, 20000]}) 36

6 37

7 # Viewing the contents of the dataframe 38

8 df.head() 39
40

Figure 2. Results as displayed after creating the table in Juyter Notebook

Some of the ways to handle missing data are listed below: 41

1.1.1. Data Removal 42

Remove the missing data rows (data points) from the dataset. However, when using 43

this technique will decrease the available dataset and in turn result in less robustness of 44

data point if the size of dataset is originally small. 45

Figure 3. Expected result after removing the rows with missing data.
46
1 # Dropping the 2nd and 3rd index 47

2 dropped_df = df.drop([2,3],axis=0) 48

3 49

4 # Viewing the dataframe 50

5 dropped_df 51
52

Removing records (rows) with missing data can have some advantages in certain 53

situations, although it should be done judiciously and with a clear understanding of the 54

potential trade-offs. Here are some advantages: 55


AI351 - Machine Learning Lecture 1 3 of 8

Figure 4

• Simplicity - Removing records with missing data is a straightforward and easy-to- 56

implement approach. It doesn’t require complex imputation methods or handling 57

missing value patterns. 58

• Preservation of Data Quality - By removing records with missing data, you ensure 59

that the remaining data points are complete and free from missing values. This can 60

lead to cleaner and more reliable analyses. 61

• No Imputation Assumptions - Imputing missing values often involves making as- 62

sumptions about the distribution or relationships within the data. Removing records 63

avoids making such assumptions, which can be advantageous when you’re uncertain 64

about the validity of imputation assumptions. 65

• Potential for Improved Model Performance - In some cases, removing records with 66

missing data can lead to improved model performance, especially when the missing 67

data is associated with a specific pattern or condition that is not easily imputable. 68

• Reduced Complexity - In datasets with a large number of missing values, imputation 69

methods can introduce complexity and computational overhead. Removing records 70

simplifies the dataset and may make subsequent analysis more efficient. 71

• Focus on Critical Data - If certain variables with missing data are not critical to your 72

analysis or research question, removing records with missing values in those variables 73

allows you to focus on the core variables of interest. 74

• Avoidance of Bias - When missing data is not missing completely at random (MCAR) 75

or missing at random (MAR), imputation methods can introduce bias into your 76

analysis. Removing records eliminates this potential source of bias. 77

Removing records (rows) with missing data is a common approach to handling missing 78

values in a dataset, but it comes with some significant drawbacks: 79

• Loss of Information - When you remove records with missing data, you are essen- 80

tially discarding potentially valuable information. This can lead to a substantial 81

reduction in the size of your dataset, which may adversely affect the performance and 82

representativeness of your machine learning model. 83

• Bias - Removing records with missing data can introduce bias into your analysis, 84

especially if the missing data is not missing completely at random (MCAR). If certain 85

types of data tend to be missing more often for specific groups or under certain 86

conditions, removing those records may bias your model’s predictions. 87

• Reduced Statistical Power - As you reduce the size of your dataset by removing 88

records, you may also reduce the statistical power of your analysis. This can make it 89

harder to detect meaningful patterns or relationships in the data. 90

• Inefficiency - In cases where only a small percentage of records have missing values, 91

removing them may be inefficient. You might end up with a smaller dataset that 92

requires retraining your model, which can be time-consuming and computationally 93

expensive. 94

• Loss of context - Each record in a dataset represents a real-world observation or entity. 95

When you remove records, you lose the context and potentially important details 96

associated with those observations. This can limit the interpretability of your results. 97

• Underestimating Variability - Removing records with missing data can lead to an 98

underestimation of the variability in the dataset, which may result in overly optimistic 99

estimates of model performance. 100

• Non-Ignorable Missingness - In some cases, missing data is not missing completely 101

at random (MAR) or MCAR but follows a specific pattern. Removing records in 102
AI351 - Machine Learning Lecture 1 4 of 8

such cases may not be appropriate, as it can distort the analysis and lead to incorrect 103

conclusions. 104

In summary, while removing records with missing data can be a quick and simple 105

way to handle missing values, it should be done with caution and a clear understanding of 106

the potential drawbacks. It is essential to consider the nature of the missing data and its 107

implications for your analysis before deciding on this approach. Alternative methods like 108

imputation or advanced missing data techniques may be more suitable in many situations 109

to retain data integrity and mitigate these drawbacks. This approach is most suitable when 110

the missing data is random or missing completely at random (MCAR) and when the loss of 111

data is acceptable given the goals of your analysis. In many cases, alternative methods such 112

as imputation or advanced missing data techniques may be more appropriate to retain as 113

much data as possible while addressing missing values effectively. 114

1.1.2. Fill missing value through statistical imputation 115

Fill the missing data by taking the mean or median of the available data points. 116

Generally, the median of the data points is used to fill the missing values as it is not affected 117

heavily by outliers like the mean. Here, we have used the median to fill the missing data. 118

Figure 5
119
1 # Filling each column with their mean values 120

2 121

3 df['Years of Experience'] = df['Years of Experience'].fillna(df['Years ←- 122

of Experience'].mean()) 123

4 124

5 df['Salary'] = df['Salary'].fillna(df['Salary'].mean()) 125

6 126

7 # Viewing the dataframe 127

8 df 128
129

Figure 6

Here are a few common statistical imputation techniques along with brief explanations: 130

• Mean Imputation - Missing values are replaced with the mean (average) of the 131

observed values for that variable. 132

• Median Imputation - Missing values are replaced with the median (middle) value 133

of the observed values for that variable, which is less sensitive to outliers than mean 134

imputation. 135

• Mode Imputation - Missing categorical values are replaced with the mode (most 136

frequent category) of the observed values for that variable. 137


AI351 - Machine Learning Lecture 1 5 of 8

• Regression Imputation - Missing values are predicted using regression models based 138

on the relationships between the variable with missing data and other variables in the 139

dataset. 140

• k-Nearest Neighbors (KNN) Imputation - Missing values are estimated by averaging 141

or weighting the values of the k-nearest neighbors in the dataset based on similarity 142

in other variables. 143

• Random Forest Imputation - Missing values are imputed using a Random Forest algo- 144

rithm, which leverages decision trees to predict missing values based on relationships 145

in the data. 146

• Multiple Imputation - Multiple datasets with imputed values are created, and analy- 147

ses are performed on each. Results are then combined to account for the uncertainty 148

associated with imputation. 149

• Interpolation and Extrapolation - For time series or spatial data, missing values 150

are estimated based on the values of adjacent time points or neighboring locations, 151

respectively. 152

• Expectation-Maximization (EM) Imputation - A statistical algorithm that iteratively 153

estimates missing values and model parameters in a way that maximizes the likelihood 154

of the observed data. 155

• Hot-Deck Imputation - Missing values are imputed by randomly selecting values from 156

similar records in the dataset, effectively “borrowing” values from other observations. 157

These are just a few examples of statistical imputation techniques, each with its 158

strengths and limitations. The choice of which method to use should depend on the charac- 159

teristics of your data and the specific research question or analysis you are conducting. 160

Filling missing values through statistical imputation offers several advantages, making 161

it a valuable approach for handling missing data in many situations: 162

• Preservation of Data Volume - Statistical imputation allows you to retain all records 163

in your dataset, even those with missing values. This preserves the full volume of 164

your data, which can be important for maintaining statistical power and the represen- 165

tativeness of your analysis. 166

• Minimizes Data Loss - Unlike removing records with missing values, imputation 167

minimizes data loss, ensuring that valuable information is not discarded. This is 168

especially beneficial when dealing with limited data resources. 169

• Preservation of Patterns and Relationships - Statistical imputation methods aim to 170

preserve patterns and relationships within the data. This means that the imputed 171

values are estimated based on the available data, considering variables’ distributions 172

and correlations. This can lead to more accurate representations of the underlying 173

data structure. 174

• Improved Model Performance - Imputing missing values can lead to improved model 175

performance compared to removing records with missing data. Models trained on 176

imputed datasets often have better predictive accuracy and generalizability. 177

• Reduced Bias - Imputation can reduce bias in your analysis, especially when missing 178

data is not missing completely at random (MCAR) or missing at random (MAR). 179

By retaining records with missing values and providing estimates for those missing 180

values, imputation helps mitigate potential bias introduced by removing records. 181

• Consistency - Imputation methods ensure that the dataset remains consistent in terms 182

of record count, which can simplify data analysis and downstream processes such as 183

visualization, reporting, and model deployment. 184

• Flexibility - Various statistical imputation techniques are available, allowing you 185

to choose the method that best suits your data and research question. Common 186

methods include mean imputation, median imputation, regression imputation, and 187

more advanced techniques like k-nearest neighbors (KNN) imputation and multiple 188

imputation. 189
AI351 - Machine Learning Lecture 1 6 of 8

• Ease of Implementation - Many statistical imputation methods are readily available in 190

data analysis and machine learning libraries, making them easy to implement within 191

your data pipeline. 192

• Handling Missing Data Patterns - Statistical imputation methods can handle missing 193

data patterns, including missing data in multiple variables. This flexibility is especially 194

useful when dealing with complex datasets with interrelated missing values. 195

While statistical imputation offers several advantages, it’s essential to choose the most 196

appropriate imputation method for your specific dataset and research objectives. The 197

choice should be guided by an understanding of the data’s characteristics, the nature of 198

the missingness, and the potential impact on subsequent analyses or models. Additionally, 199

it’s important to acknowledge the assumptions made by each imputation method and to 200

perform sensitivity analyses to assess the robustness of your results to imputation choices. 201

While statistical imputation can be a valuable technique for handling missing data, it 202

is not without drawbacks and limitations. Here are some of the drawbacks associated with 203

filling missing values through statistical imputation: 204

• Introduction of Bias - Imputation methods, especially simple ones like mean or 205

median imputation, can introduce bias into the dataset. Imputed values are often 206

estimated based on available data, and if the missing data is not missing completely 207

at random (MCAR) or missing at random (MAR), this can lead to biased imputations. 208

• Assumption Sensitivity - Different imputation methods make different assumptions 209

about the distribution and relationships within the data. If these assumptions do not 210

hold true for your dataset, imputed values may be inaccurate and misleading. 211

• Underestimation of Variability - Imputing missing values can lead to underestimation 212

of the variability in the data. This can result in overly confident or optimistic model 213

predictions and may not accurately reflect the true uncertainty in the data. 214

• Inappropriate Handling of Non-Normal Data - Some statistical imputation methods 215

assume that the data follows a normal distribution. If your data is not normally 216

distributed, imputation methods like mean or median imputation may not be suitable. 217

• Complexity and Computational Cost - More advanced imputation techniques, such 218

as regression imputation or k-nearest neighbors (KNN) imputation, can be computa- 219

tionally expensive and complex to implement, particularly for large datasets. 220

• Propagation of Errors - Imputed values may contain errors or noise, which can 221

propagate through subsequent analyses or modeling steps. This can lead to incorrect 222

conclusions and predictions. 223

• Difficulty with Categorical Data - Imputing missing values in categorical variables 224

can be challenging. Some imputation methods may not handle categorical data well, 225

and choosing appropriate imputed values can be subjective. 226

• Loss of Transparency - Imputation can make the dataset less transparent as imputed 227

values are not actual observations. This can complicate data interpretation and make 228

it harder to understand the true nature of the data. 229

• Inability to Capture Complex Missing Data Mechanisms - While statistical imputa- 230

tion methods can handle common missing data mechanisms like missing at random 231

(MAR), they may not effectively capture more complex patterns of missingness. 232

• Imputation Uncertainty - Imputed values come with an inherent level of uncertainty, 233

which is often not accounted for in downstream analyses. Ignoring this uncertainty 234

can lead to overconfidence in model predictions. 235

To mitigate these drawbacks, it’s crucial to carefully consider the nature of your data 236

and the missing data mechanism when choosing an imputation method. Additionally, 237

conducting sensitivity analyses, evaluating the impact of imputation on your results, and 238

considering multiple imputation techniques can help address some of these limitations and 239

provide a more robust treatment of missing data. 240


AI351 - Machine Learning Lecture 1 7 of 8

1.1.3. Fill missing value using observation 241

Manually fill in the missing data from observation. This may be possible sometimes 242

for small datasets but for larger datasets it is very difficult to do so. 243

Figure 7. Results of filling missing values through observation

1.1.4. Fill in the most repeated value 244

Fill in the missing value using the most repeated value in the dataset. This is done 245

when most of the data is repeated and there is good reasoning to do so. Since there are no 246

repeated values in the example, we can fill it with any one of the numbers in the respective 247

column. 248

Figure 8. Results of filling missing values through the most repeated value in the data

1.1.5. Fill in with random value within the range of available data 249

Take the given range of data points and fill in the data by randomly selecting a value 250

from the available range. 251

Figure 9. Results of filling in with random value within the range of the available data.

1.1.6. Fill in by regression 252

Use regression analysis to find the most probable data point for filling in the dataset. 253

Figure 10. Results of filling the missing data through linear regression.
254
1 from sklearn.linear_model import LinearRegression 255

2 256

3 # Excluding the rows with the null data 257

4 train_df = df.drop([2,3],axis=0) 258

5 259

6 # Creating linear regression model 260

7 regr = LinearRegression() 261

8 262
AI351 - Machine Learning Lecture 1 8 of 8

9 # Here the target is the Salary and the feature is Years of Experience 263

10 regr.fit(train_df[['Years of Experience']], train_df[['Salary']]) 264

11 265

12 # Predicting for 3 years of experience 266

13 regr.predict([[3]]) 267
268

Figure 11. Results from Linear Regression for finding the salary for 3 years of experience.

Therefore, the salary for 3 years of experience by regression is 60000. Now, finding the 269

years of experience based on salary. 270


271
1 from sklearn.linear_model import LinearRegression 272

2 273

3 # Excluding the rows with the null data 274

4 train_df = df.drop([2,3],axis=0) 275

5 276

6 # Creating linear regression model 277

7 regr = LinearRegression() 278

8 279

9 # Here the target is the Years of Experience and the feature is Salary 280

10 regr.fit(train_df[['Salary']], train_df[['Years of Experience']]) 281

11 282

12 # Predicting for 40000 salary 283

13 regr.predict([[40000.0]]) 284
285

Figure 12. Results from the Linear Regression for finding the missing years of experience based on
salary.

Therefore, the years of experience for 40000 salary is 2. 286

2. Conclusion 287

In conclusion, handling missing data is a crucial aspect of machine learning that cannot 288

be overlooked. Throughout this article, we have explored various techniques for addressing 289

missing data, including the removal of rows with missing values, statistical imputation, 290

observation-based imputation, imputing with the most frequent values, random value 291

imputation, and even leveraging the power of linear regression. These methods not only 292

ensure data completeness but also play a pivotal role in enhancing the accuracy and 293

reliability of downstream analysis and classification techniques. The impact of missing 294

data extends far beyond mere data manipulation; it influences the very foundations of our 295

machine learning models and, consequently, the quality of decisions and insights drawn 296

from them. As data scientists and machine learning practitioners, it is our responsibility to 297

master these techniques, as the effective handling of missing data is an indispensable skill 298

in our journey toward more accurate and robust machine learning solutions. 299

300

You might also like