Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

​ hat I would like to see in your projects:

W
1. Initial analysis of the data.
- column descriptions, data types, number of null/na values;
- numerical/statistical analysis of the columns.
2. Some interesting grouping/filtering of the data that will put in evidence some features.
- At least 5 queries involving aggregating two or more columns;
- At least 5 queries that will filter data based on different column values.
3. Plots of the data for a single column and plots that will analyze multiple columns.
- At least 3 plots (great if the type of plots will be different) for a single column;
- At least 3 plots involving multiple columns and showcasing the relationship between
them.
4. Modelling.
- Apply Kfold or RepeatedKFold cross validation;
- Apply Grid Search/RandomizedSearch pipelines;
- Create different type of models that are suited for your task (Linear Regression, Logistic
Regression, Lasso, Ridge, Elastic Net, Decision Tree, Random Forest, AdaBoost,
GradientBoost, SVM, KNN)
5. Depending on your task, provide a list of metrics for the test set on your final (best)
model resulted from hyper parameter tunning in step 4.
​Example Datasets (You can find any other dataset on the internet):
​ . Credit Card Fraud Detection (Classification)
1
`projects/credit_card_fraud.csv`
The dataset contains transactions made by credit cards in September 2013 by European
cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds
out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds)
account for 0.172% of all transactions.
It contains only numerical input variables which are the result of a PCA transformation.
Unfortunately, due to confidentiality issues, we cannot provide the original features and
more background information about the data. Features V1, V2, … V28 are the principal
components obtained with PCA, the only features which have not been transformed with
PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each
transaction and the first transaction in the dataset. The feature 'Amount' is the transaction
Amount, this feature can be used for example-dependant cost-sensitive learning. Feature
'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.
​ . Customer Churn (Classification)
2
`projects/customer_churn.csv`
Each row represents a customer.
The data set includes information about:
Customers who left within the last month – the column is called Churn
Services that each customer has signed up for – phone, multiple lines, internet, online
security, online backup, device protection, tech support, and streaming TV and movies
Customer account information – how long they’ve been a customer, contract, payment
method, paperless billing, monthly charges, and total charges
Demographic info about customers – gender, age range, and if they have partners and
dependents

3. Loan Prediction (Classification)


`projects/loan_prediction.csv`
Column Description Type
income Income of the user int
age Age of the user int
experience Professional experience of the user in years int
profession Profession string
married Whether married or single string
house_ownership Owned or rented or neither string
car_ownership Does the person own a car string
risk_flag Defaulted on a loan string
currentjobyears Years of experience in the current job int
currenthouseyears Number of years in the current residence int
city City of residence string
state State of residence string

4. Heart Failure (Classification)


`projects/heart_failure.csv`
Context
Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an
estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide.
Four out of 5CVD deaths are due to heart attacks and strokes, and one-third of these
deaths occur prematurely in people under 70 years of age. Heart failure is a common
event caused by CVDs and this dataset contains 11 features that can be used to predict a
possible heart disease.
People with cardiovascular disease or who are at high cardiovascular risk (due to the
presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or
already established disease) need early detection and management wherein a machine
learning model can be of great help.

Attribute Information
Age: age of the patient [years]
Sex: sex of the patient [M: Male, F: Female]
ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP:
Non-Anginal Pain, ASY: Asymptomatic]
RestingBP: resting blood pressure [mm Hg]
Cholesterol: serum cholesterol [mm/dl]
FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave
abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH:
showing probable or definite left ventricular hypertrophy by Estes' criteria]
MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
Oldpeak: oldpeak = ST [Numeric value measured in depression]
ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down:
downsloping]
HeartDisease: output class [1: heart disease, 0: Normal]

5. Loan Prediction 2 (Classification)


`projects/loan_prediction_2.csv`
Columns:
Gender: gender of the customers
Married: marital status
Dependents:
Education: highest level of studies
Self employed: whether the customer is self employed or not
ApplicantIncome: income of the customers
CoapplicantIncome: income of the co-applicant for the credit (if present)
LoanAmount: sum of money borrowed
Loan_Amount_Term: number of months of the loan
Credit History: whether the customer has a credit hisotry or not
Property_Area: the social area of the cusomer (rural or urban)
Loan_Status: if the customer is given the loan or not.

You might also like