Professional Documents
Culture Documents
TP 03 - KNN - Airline Passenger Satisfaction
TP 03 - KNN - Airline Passenger Satisfaction
Objectif :
-La création d’un modèle KNN pour un problème de classification multi-class.
-l’évaluation du modèle
About Dataset
This dataset contains an airline passenger satisfaction survey. What factors are highly correlated to a
satisfied (or dissatisfied) passenger? Can you predict passenger satisfaction?
Colonne Description
Gender Gender of the passengers (Female, Male)
Customer Type The customer type (Loyal customer, disloyal customer)
Age: The actual age of the passengers
Type of Travel Purpose of the flight of the passengers (Personal Travel, Business Travel)
Class Travel class in the plane of the passengers (Business, Eco, Eco Plus)
Flight distance The flight distance of this journey
Inflight wifi service Satisfaction level of the inflight wifi service (0:Not Applicable;1-5)
Departure/Arrival Satisfaction level of Departure/Arrival time convenient
time convenient
Ease of Online Satisfaction level of online booking
booking
Gate location Satisfaction level of Gate location
Food and drink Satisfaction level of Food and drink
Online boarding Satisfaction level of online boarding
Seat comfort Satisfaction level of Seat comfort
Inflight Satisfaction level of inflight entertainment
entertainment
On-board service Satisfaction level of On-board service
Leg room service Satisfaction level of Leg room service
Baggage handling: Satisfaction level of baggage handling
Check-in service: Satisfaction level of Check-in service
Inflight service: Satisfaction level of inflight service
Cleanliness: Satisfaction level of Cleanliness
Departure Delay in Minutes delayed when departure
Minutes:
Arrival Delay in Minutes delayed when Arrival
Minutes
Satisfaction Airline satisfaction level(Satisfaction, neutral or dissatisfaction)
Travail demandé :
Construire un modèle KNN de classification multi classe, qui permet de prédire la satisfaction
du passager en se basant sur les informations relatives à ce client.
Etapes :
1- Pre-preprocessing de données
2- Création du modèle KNN
3- Cross Validation : utiliser la validation croisée pour trouver le meilleur paramètre k
4- Curve Validation : elle simplifie la validation croisée
5- GridSearchCV : elle permet de trouver plusieurs paramètres à la fois
6- Evaluation : Accuracy, matrice de confusion, ROC…
Machine learning algorithms like linear regression, logistic regression, neural network, etc. that use
gradient descent as an optimization technique require data to be scaled
2. Distance-Based Algorithms
Distance algorithms like KNN, K-means, and SVM are most affected by the range of features. This is
because behind the scenes they are using distances between data points to determine their similarity.
3. Tree-Based Algorithms
Tree-based algorithms, on the other hand, are fairly insensitive to the scale of the features.
What is Standardization?
Standardization is another scaling technique where the values are centered around the mean with a
unit standard deviation. This means that the mean of the attribute becomes zero and the resultant
distribution has a unit standard deviation( eg : x=x-moyenne/ écart type).
The Big Question – Normalize or Standardize?
Normalization is good to use when you know that the distribution of your data does not follow
a Gaussian distribution. This can be useful in algorithms that do not assume any distribution
of the data like K-Nearest Neighbors and Neural Networks.
Standardization, on the other hand, can be helpful in cases where the data follows a Gaussian
distribution. However, this does not have to be necessarily true. Also, unlike normalization,
standardization does not have a bounding range. So, even if you have outliers in your data,
they will not be affected by standardization.
the choice of using normalization or standardization will depend on your problem and the machine
learning algorithm you are using. There is no hard and fast rule to tell you when to normalize or
standardize your data. You can always start by fitting your model to raw, normalized and
standardized data and compare the performance for best results.