Professional Documents
Culture Documents
Data Preprocessing
Data Preprocessing
Additional resource:
https://towardsdatascience.com/17-types-of-similarity-and-dissimi
larity-measures-used-in-data-science-3eb914d2681
Explanation & Question #2
1. Refer to file “titanic_train.xlsx.”
– Survived: 1 for survived, 0 for deceased
– Pclass: Ticket class
– Sibsp: # of siblings/spouses aboard the Titanic
– parch: # of parents/children aboard the Titanic
– fare: Passenger fare
– cabin: Cabin number
– embarked : Port of Embarkation, C = Cherbourg, Q = Queenstown, S = Southampton
2. Which features contain blank, null, or empty values? How we fixed them?
3. Explain do we have balanced data set. Provide visualization!
4. Can we reduce the number of columns?
5. Try to derive an additional column that represents marriage status.
6. Do we have a strong correlation between fare and cabin?
7. What are the variables that are instrumental in explaining “Survived”? Give appropriate
visualization!
https://www.kaggle.com/competitions/titanic/overview
Explanation & question #3
Refer to file “TA_restaurants_curated.xlsx”:
1. Summarize the data!
2. Which features contain blank, null, or empty values? Can we fix
them? How?
3. Do we have highly correlated columns?
4. Explain do we have balanced data set. Provide visualization!
5. Summarize the cuisine style based on city, rating, and price range!
6. Check whether a high number of reviews correlates with rating and
price range. Give supporting visualization!
7. Analyze the words in “Reviews” that correlate with Rating and Price
Range!
https://www.kaggle.com/datasets/damienbeneschi/krakow-ta-restaurans-
data-raw
Explanation & question #4
Refer to file “credit_scoring_sample.xlsx.”
• SeriousDlqin2yrs (target variable): Person experienced 90 days past due delinquency or worse
• RevolvingUtilizationOfUnsecuredLines: Total balance on credit cards and personal lines of credit
except for real estate and no installment debt like car loans divided by the sum of credit limits
• Age: Age of borrower in years
• NumberOfTime30-59DaysPastDueNotWorse: Number of times borrower has been 30-59 days past
due but no worse in the last two years.
• NumberOfTime60-89DaysPastDueNotWorse: Number of times borrower has been 60-89 days past
due but no worse in the last two years.
• NumberOfTimes90DaysLate: Number of times the borrower has been 90 days or more past due.
• DebtRatio: Monthly debt payments, alimony, and living costs divided by monthly gross income
• MonthlyIncome: Monthly income
• NumberOfOpenCreditLinesAndLoans: Number of Open loans (installments like car loans or
mortgages) and Lines of credit (e.g., credit cards)
• NumberRealEstateLoansOrLines: Number of mortgage and real estate loans, including home equity
lines of credit
• NumberOfDependents: Number of dependents in the family excluding themselves (spouse, children,
etc.)
Explanation & question #4 (cont.)
Refer to file “credit_scoring_sample.xlsx.”
https://www.kaggle.com/competitions/GiveMeSomeCredit/overview
Explanation & question #5
Refer to file “Walmart data.xlsx”:
1. Which store and which department have high weekly sales?
2. Which month that have meager sales?
3. Do sales change over time?
4. Draw the weekly sales chart! Do specific weeks or months
have a trend or pattern that repeats every month/year?
5. Find the correlation of Weekly_Sales, IsHoliday, Temperature,
Fuel_Price, CPI, and Unemployment.
https://www.kaggle.com/competitions/walmart-sales-forecastin
g/overview
Explanation & question #6
Refer to file “DelayedFlights.csv”:
1. Perform data preprocessing!
2. Explain, analyze, and visualize the data!
3. Give insights, trends, relations, and patterns,
and give visualization!
https://www.kaggle.com/datasets/giovamata/ai
rlinedelaycauses