Professional Documents
Culture Documents
SDBIS 2023 2024 Project Datasets
SDBIS 2023 2024 Project Datasets
W
1. Initial analysis of the data.
- column descriptions, data types, number of null/na values;
- numerical/statistical analysis of the columns.
2. Some interesting grouping/filtering of the data that will put in evidence some features.
- At least 5 queries involving aggregating two or more columns;
- At least 5 queries that will filter data based on different column values.
3. Plots of the data for a single column and plots that will analyze multiple columns.
- At least 3 plots (great if the type of plots will be different) for a single column;
- At least 3 plots involving multiple columns and showcasing the relationship between
them.
4. Modelling.
- Apply Kfold or RepeatedKFold cross validation;
- Apply Grid Search/RandomizedSearch pipelines;
- Create different type of models that are suited for your task (Linear Regression, Logistic
Regression, Lasso, Ridge, Elastic Net, Decision Tree, Random Forest, AdaBoost,
GradientBoost, SVM, KNN)
5. Depending on your task, provide a list of metrics for the test set on your final (best)
model resulted from hyper parameter tunning in step 4.
Example Datasets (You can find any other dataset on the internet):
. Credit Card Fraud Detection (Classification)
1
`projects/credit_card_fraud.csv`
The dataset contains transactions made by credit cards in September 2013 by European
cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds
out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds)
account for 0.172% of all transactions.
It contains only numerical input variables which are the result of a PCA transformation.
Unfortunately, due to confidentiality issues, we cannot provide the original features and
more background information about the data. Features V1, V2, … V28 are the principal
components obtained with PCA, the only features which have not been transformed with
PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each
transaction and the first transaction in the dataset. The feature 'Amount' is the transaction
Amount, this feature can be used for example-dependant cost-sensitive learning. Feature
'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.
. Customer Churn (Classification)
2
`projects/customer_churn.csv`
Each row represents a customer.
The data set includes information about:
Customers who left within the last month – the column is called Churn
Services that each customer has signed up for – phone, multiple lines, internet, online
security, online backup, device protection, tech support, and streaming TV and movies
Customer account information – how long they’ve been a customer, contract, payment
method, paperless billing, monthly charges, and total charges
Demographic info about customers – gender, age range, and if they have partners and
dependents
Attribute Information
Age: age of the patient [years]
Sex: sex of the patient [M: Male, F: Female]
ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP:
Non-Anginal Pain, ASY: Asymptomatic]
RestingBP: resting blood pressure [mm Hg]
Cholesterol: serum cholesterol [mm/dl]
FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave
abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH:
showing probable or definite left ventricular hypertrophy by Estes' criteria]
MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
Oldpeak: oldpeak = ST [Numeric value measured in depression]
ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down:
downsloping]
HeartDisease: output class [1: heart disease, 0: Normal]