Data preprocessing involves handling missing data, encoding categorical variables, splitting the dataset into training and test sets, and scaling features. Key steps include imputing missing values, one-hot encoding categorical variables, training and testing set splitting, and feature scaling of training data. Preprocessing transforms raw data into a format suitable for modeling and prepares it for analysis.
Data preprocessing involves handling missing data, encoding categorical variables, splitting the dataset into training and test sets, and scaling features. Key steps include imputing missing values, one-hot encoding categorical variables, training and testing set splitting, and feature scaling of training data. Preprocessing transforms raw data into a format suitable for modeling and prepares it for analysis.
Data preprocessing involves handling missing data, encoding categorical variables, splitting the dataset into training and test sets, and scaling features. Key steps include imputing missing values, one-hot encoding categorical variables, training and testing set splitting, and feature scaling of training data. Preprocessing transforms raw data into a format suitable for modeling and prepares it for analysis.
The set of operations need to performs on the dataset is • Handling missing data • Managing categorical data • Dataset distribution(training, testing) • Scaling the features Preprocessing Steps in Python • How to import libraries ? Import pandas as pd Import numpy as np Continues… • How to Load datasets?
If the dataset is in excel form then
Dataset=pd.read_excel(“dataset”)
• If the dataset is in CSV format then
df=pd.read_csv(“dataset.csv”, sep=‘ ‘) Dataset Inde Student Subject Marks Grade x Name 0 Ramu Maths 70 A 1 Somu Maths 55 B 2 Lilly Technical English NaN O 3 Rose Python 80 A+ 4 Nisha Java NaN O 5 Seetha Compiler 50 B 6 Patrick Big Data 40 Fail 7 Peter E-Commerce 75 A Process Continues • In the above dataset you can find the dataset with NaN value, which requires to handle the missing value. The missing values can be handled by : • Mean • Median • Mode • Constant value Handling Missing Values from sklearn.impute import SimpleImputer impute = SimpleImputer (missing_values=np.nan, stratergy=“mean”) imputer.fit(X[:,1:3) X[:,1:3] = imputer.transform(X[:, 1:3]) Print(X) Classifying Dependent and Independent Variables • In our dataset Mark and Grade are dependent variables. x=df.iloc [:, [0,1,2] ] . values y=df.iloc[:, 3 ] .values Print(x) Print(y) Feature Encoding • In our dataset we have subject feature which has string-based value. Since, String-based features cannot be processed in training model. Hence, a method is required to convert the string to numeric value. This is called Onehot Encoding. How to encode? from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder Ct=ColumTransformer (transformers=[‘encoder’, OneHotEncoder(), [0])], remainder=‘passthrough’) X=ct.fit_transform(X) X=np.array(X) Print (X) Dataset Distribution #split data into train and test dataset
from sklearn.model_selection import train_test_split #(for
python2) #from sklearn.model_selection import train_test_split (for python3) X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=0)print('X_train.shape: ', X_train.shape) print('X_test.shape: ', X_test.shape) print('y_train.shape: ', y_train.shape) print('y_test.shape: ', y_test.shape) Scaling the Features #feature scaling of training dataset