Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 13

Homework 01: Data

preprocessing & data insight


1. Naming the file: “Homework-
01_StudentName#1_StudentN
General requirement ame#2.ipynb”
2. Submission is accepted when
using a Jupyter Notebook file
(ipynb)
3. Every time you use Python
code, include the code in the
ipynb.
4. Fill in the peer evaluation form!
5. Each student needs to submit
and also the peer review!
6. Upload and submit through
https://emas.ui.ac.id (deadline
Monday May 1st 10.00 WIB)
Explanation & Question #1
• Refer to file “vertebrate.xlsx.”
• The file contains characteristics of several vertebrates.
• Examine and analyze the correlations between these columns:
“Warm-blooded, Gives Birth, Aquatic Creature, Aerial Creature,
Has Legs, Hibernates.” Please be advised they are binary
variables.
• What variables are instrumental in classifying the “Class”
column? Explain and give a visualization!

Additional resource:
https://towardsdatascience.com/17-types-of-similarity-and-dissimi
larity-measures-used-in-data-science-3eb914d2681
Explanation & Question #2
1. Refer to file “titanic_train.xlsx.”
– Survived: 1 for survived, 0 for deceased
– Pclass: Ticket class
– Sibsp: # of siblings/spouses aboard the Titanic
– parch: # of parents/children aboard the Titanic
– fare: Passenger fare
– cabin: Cabin number
– embarked : Port of Embarkation, C = Cherbourg, Q = Queenstown, S = Southampton
2. Which features contain blank, null, or empty values? How we fixed them?
3. Explain do we have balanced data set. Provide visualization!
4. Can we reduce the number of columns?
5. Try to derive an additional column that represents marriage status.
6. Do we have a strong correlation between fare and cabin?
7. What are the variables that are instrumental in explaining “Survived”? Give appropriate
visualization!

https://www.kaggle.com/competitions/titanic/overview
Explanation & question #3
Refer to file “TA_restaurants_curated.xlsx”:
1. Summarize the data!
2. Which features contain blank, null, or empty values? Can we fix
them? How?
3. Do we have highly correlated columns?
4. Explain do we have balanced data set. Provide visualization!
5. Summarize the cuisine style based on city, rating, and price range!
6. Check whether a high number of reviews correlates with rating and
price range. Give supporting visualization!
7. Analyze the words in “Reviews” that correlate with Rating and Price
Range!
https://www.kaggle.com/datasets/damienbeneschi/krakow-ta-restaurans-
data-raw
Explanation & question #4
Refer to file “credit_scoring_sample.xlsx.”
• SeriousDlqin2yrs (target variable): Person experienced 90 days past due delinquency or worse
• RevolvingUtilizationOfUnsecuredLines: Total balance on credit cards and personal lines of credit
except for real estate and no installment debt like car loans divided by the sum of credit limits
• Age: Age of borrower in years
• NumberOfTime30-59DaysPastDueNotWorse: Number of times borrower has been 30-59 days past
due but no worse in the last two years.
• NumberOfTime60-89DaysPastDueNotWorse: Number of times borrower has been 60-89 days past
due but no worse in the last two years.
• NumberOfTimes90DaysLate: Number of times the borrower has been 90 days or more past due.
• DebtRatio: Monthly debt payments, alimony, and living costs divided by monthly gross income
• MonthlyIncome: Monthly income
• NumberOfOpenCreditLinesAndLoans: Number of Open loans (installments like car loans or
mortgages) and Lines of credit (e.g., credit cards)
• NumberRealEstateLoansOrLines: Number of mortgage and real estate loans, including home equity
lines of credit
• NumberOfDependents: Number of dependents in the family excluding themselves (spouse, children,
etc.)
Explanation & question #4 (cont.)
Refer to file “credit_scoring_sample.xlsx.”

1. Draw chart(s) showing how age, DebtRatio, MonthlyIncome and


NumberOfDependents correlate with defaults!
2. Do we have redundant columns?
3. Explain do we have balanced data set? Provide visualization!
4. Analyze which are the most instrumental feature that the credit
company need in monitoring the default?

https://www.kaggle.com/competitions/GiveMeSomeCredit/overview
Explanation & question #5
Refer to file “Walmart data.xlsx”:
1. Which store and which department have high weekly sales?
2. Which month that have meager sales?
3. Do sales change over time?
4. Draw the weekly sales chart! Do specific weeks or months
have a trend or pattern that repeats every month/year?
5. Find the correlation of Weekly_Sales, IsHoliday, Temperature,
Fuel_Price, CPI, and Unemployment.

https://www.kaggle.com/competitions/walmart-sales-forecastin
g/overview
Explanation & question #6
Refer to file “DelayedFlights.csv”:
1. Perform data preprocessing!
2. Explain, analyze, and visualize the data!
3. Give insights, trends, relations, and patterns,
and give visualization!

https://www.kaggle.com/datasets/giovamata/ai
rlinedelaycauses

You might also like