Professional Documents
Culture Documents
Ex 5.1 Customer Behaviour Prediction
Ex 5.1 Customer Behaviour Prediction
1. Introduction
3. Data Collection
● Missing values are handled through imputation techniques such as mean or median
imputation.
● Outliers are detected and removed using statistical methods or domain knowledge.
● Data is preprocessed by performing feature scaling, encoding categorical variables, and
handling any data transformations necessary for modelling.
5. Exploratory Data Analysis (EDA)
6. Feature Selection
● The dataset is split into training and validation sets using techniques such as k-fold cross-
validation.
● Hyperparameter tuning is performed using grid search or randomized search to optimize
model performance.
9. Model Evaluation Model performance is evaluated using appropriate metrics:
● For classification tasks, metrics such as accuracy, precision, recall, F1-score, and ROC
AUC are computed.
● For regression tasks, metrics such as mean squared error (MSE) or mean absolute error
(MAE) are calculated.
● Models are compared based on their performance on the validation set.
9. Model Evaluation
10. Deployment
import numpy as np
import pandas as pd
data = pd.read_csv('Customer_Behaviour.csv')
data.info()
ef preprocess_inputs(df, engineer_features=False):
df = df.copy()
# Binary encode
df['Gender'] = df['Gender'].replace({'Female': 0, 'Male': 1})
# Feature engineering
if engineer_features == True:
income_threshold = df['EstimatedSalary'].quantile(0.95)
df['High Income'] = df['EstimatedSalary'].apply(lambda x: 1 if
x >= income_threshold else 0)
old_age_threshold = df['Age'].quantile(0.75)
df['Old Age'] = df['Age'].apply(lambda x: 1 if x >=
old_age_threshold else 0)
young_age_threshold = df['Age'].quantile(0.25)
df['Young Age'] = df['Age'].apply(lambda x: 1 if x <=
young_age_threshold else 0)
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y,
train_size=0.7, shuffle=True, random_state=1)
# Scale X
scaler = StandardScaler()
scaler.fit(X_train)
X_train = pd.DataFrame(scaler.transform(X_train),
index=X_train.index, columns=X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), index=X_test.index,
columns=X_test.columns)
model = LogisticRegression()
model.fit(X_train, y_train)
model = LogisticRegression()
model.fit(X_train, y_train)