IDM Assignment 2

IDM Assignment 2
Hamza Faisal 22971

Business Understanding
• Sberbank, Russia’s oldest and largest bank helps their customers by making predictions about
realty prices so renters, developers, and lenders are more confident when they purchase real
estate.
• Sberbank has provided us a dataset containing a vast collection of features including housing data
and macroeconomic patterns, and we are to employ the use of different algorithms to develop a
model that will accurately predict realty prices.
• Since we are predicting numerical values based on different data points, the dataset can indeed
can be converted into a data mining definition; specifically a supervised learning problem. As we
are using independent variables to predict the outcome of the dependent target variable (realty
price), we will employ the use of different regression algorithms to help us understand the
correlation between these variables.
• For Sberbank, the objective is to help customers be more self-assured when purchasing real estate
by offering them the cheapest price possible and for us the objective is to predict, as accurately as
possible, the realty prices by analyzing the correlation coefficient between a vast number of
factors and reduce the RMSE between our predicted values and the actual values of the realty
prices.
Data Understanding
• Initial surface level observations suggest that the dataset comprises of a mixture of floating
point, integer and string datatypes; spanning across 100,000 rows and over more than 100
columns. Also present are binary nominal variables and categorical variables that have been
converted to continuous variables via dummy encoding. From this dataset the “price_doc”
variable is our target variable.
• The dataset also includes a collection of features about each property's surrounding
neighborhood, and some features that are constant across each sub-area. Also provided are
macroeconomic variables such as multiple variations of GDP information, average salaries,
employment rates etc.
• However there are many such columns that have null or empty values which in turn will have
an adverse effect when predicting the realty prices. These columns might have to be removed
in the data preparation phase in order to prevent discrepancies in our prediction model.
Data Preparation
• Before starting the modelling process I applied standard data cleaning procedures to the dataset so that they
would have a minimum impact on the R^2 value. Following are the steps I did to clean the data:
1) Data imputation: Since there were many continuous attribute columns with missing values, I opted to substitute
those values with the mean of the column.
2) Encoding: There were many categorical columns so I changed those to continuous columns through labelled
encoding. Initially, I did not opt for one-hot or dummy encoding because it would increase the number of
columns.
3) Handling null values: After encoding, I initially replaced all null values in each column with the mean values of
each respective column.
This is the standard data pre-processing I did before making the first submission. If data was cleaned any further in
any particular submission, it is mentioned under the “Data Preprocessing” column in the data modelling table.
Data Modelling
No. Data Preprocessing Model Details Algorithm Score
Used
1 Standard preprocessing as mentioned in the Mean squared error was used as the evaluation criterion. As this was the initial Linear 3392286.12551
previous slide. submission, no additional preprocessing was done in order to check how our model Regression
scores.
2 Same as before Same settings as in the last submission. Only change this time was that R^2 was Linear 3392286.12551
used as the evaluation criterion. Regression
3 Same as before Same settings as in the last submission. Only change this time was that regression Regression Tree 3596739.13627
tree was used as the algorithm.
4 Same as before Same settings as in the last submission. Only change this time was that Random Random Forest 2174999.04935
Forest Regressor was used as the algorithm; which improved score. Regressor
5 Same as before Same settings as in the last submission. Only change this time was that Gradient Gradient Boost 2977458.28191
Boost Regressor was used as the algorithm. Regressor
6 Same as before Same settings as in the last submission. Only change this time was that AdaBoost AdaBoost 3652269.40542
Regressor was used as the algorithm. Regressor
7 Dropped all rows in training set that contained Dropping rows with outliers improved the score only by a very small amount. Maybe Random Forest 2152815.25745
outliers dropping the rows decreased accuracy of the model due to less data. Regressor
8 Capped all outliers using IQR method Replaced outlier values with upper/lower quartiles of the respective column but Random Forest 2314521.30069
according to upper and lower quartiles. score did not improve maybe because the data was not normally distributed and Regressor
capping outliers may have messed up the distribution.
9 Removed outlier capping entirely. Outliers still Score improved by filling in null values of each column with median of the Random Forest 2123910.53844
in data. Filled null values of columns with respective column. Regressor
their median values.
10 Dropped all rows in train set that contained Used previous submission except dropped all rows in train set that contained Random Forest 2740528.96311
outliers outliers. Score did not improve at all like last time. Regressor
No. Data Preprocessing Model Details Algorithm Used Score
11 Filled null values of columns with their Score did not improve by filling in null values of each column with standard deviation of the respective column. Random Forest Regressor 2154857.55406
standard deviation values.
12 Filled null values of columns with their median Imputed data by filling null values with the median of those columns again which improved score. Maybe using median Random Forest Regressor 2113787.27074
values. helps in skewing the data normally.
13 Dropped all 8 columns which had greater than No improvement here maybe because higher number of columns and datapoints were dropped. This theory could be Random Forest Regressor + 2158337.66508
equals to 10,000 null values tested by dropping lesser number of columns. Forward Selection
14 Dropped all 5 columns which had greater than Score improved relatively when lesser number of columns were dropped. This can be confirmed if even lesser columns Random Forest Regressor + 2124832.36276
equals to 12,000 null values are dropped Forward Selection
15 Same as before Accidentally submitted previous file again. Random Forest Regressor 2124832.36276
16 Dropped all 3 columns which had greater than Dropping lesser columns did not improve score and overall, dropping columns did not get a high score hence from here Random Forest Regressor 2181056.73043
equals to 17,000 null values on I will keep all columns, with their null values replaced by their median values
17 Reverted back to submission 12 preprocessing. AdaBoost parameters set to default with top 50 best columns. Score worsened still. Will not use AdaBoost further. AdaBoost Regressor + 3116837.40527
Forward Selection
18 Same as before No changes. Used same settings as in submission 12 yet score produced was worse than in submission 12. Dataset is Random Forest Regressor 2140890.99149
showing discrepancies on same model.
19 Performed standard scaling on x_train and Performing standard scaling on x_train and x_test produced worst score so far. Random Forest Regressor 10028270.45330
x_test
20 Performed MinMax scaling on x_train and Since MinMaxScaler scales and shrinks the data within a given range of minimum and maximum values of the column, Random Forest Regressor 7963230.43166
x_test the loss of important data produced a bad score
21 Increased train-test split to 0.2 from 0.15 Increasing train-test split ratio did not have any particular improvement but since it’s the norm, I will keep this split for Linear Regression + 2160510.68883
future models. Using a polynomial interaction of degree 2 did not have much impact on score and was very time Polynomial Interaction
consuming.
22 Same as before Decreased Random Forest models to 200. Score improved a bit from before but not much. Random Forest Regressor 2115411.01026
23 Same as before Used Decision Tree regression with default 100 models and squared error as the splitting criterion Decision Tree 3584906.11883
Regressor
24 Same as before Increased Random Forest models to 300 with rest of the settings as previous. Score dropped from when I used 200 Random Forest Regressor 2213330.93636
models.
25 Did one-hot encoding instead of label encoding Score improved because number of columns and datapoints increased. Random Forest Regressor 2130694.86557
26 Same as before Used Decision Tree regression with default 100 models with “Poisson” as the splitting criterion and the splitter set to Decision Tree 2880326.99357
“random” Regressor
27 Same as before Changed splitting criterion to “friedman_mse” and set the max depth of the tree to 10. Decision Tree 2957875.29684
Regressor
Used
28 Same as before Increased tree depth to 30 from previous model Decision Tree 3513515.01412
Regressor
29 Same as before Set splitting criterion to “poisson” with rest of the parameters set to default Decision Tree 5947012.86626
Regressor
30 Same as before Used 50 models in Random Forest with rest of the parameters set to default. Score improved drastically from the last Random Forest 2190409.84127
submission. Regressor
31 Same as before Used the same model from last submission. Only decreased the number of models from 50 to 25. Random Forest 2235733.66333
Regressor
32 Dropped “sub-area” column from dataset because it Increased number of models from last submission to 1000 and set a random seed of 42 to ensure consistency in future Random Forest 1762142.67527
had too many unique variables. Also dropped columns models. Score improved a lot by dropping these columns Regressor
with multicollinearity > 0.95
33 Same as before Set objective to “reg:linear”, with number of models set to 10 and random state of 42. Gradient Boost 3049019.67531
Regressor
34 Same as before Increased number of models to 100 Gradient Boost 3322311.56428

Regressor
35 Same as before Increased number of models to 200. Score worsens as number of models are increased. Gradient Boost 3420800.04449
Regressor
36 Same as before Reduced number of models to 2 Gradient Boost 4822008.50768

Regressor
37 Same as before Accidentally submitted the previous file Gradient Boost 4822008.50768
Regressor
38 Same as before Used on default parameters. Score improved due to increased number of learning trees. Extra Trees 2074735.34347
Regressor
39 Dropped columns that had greater than 9000 null Increased number of models to 1000 with random state set to 42. Extra Trees 2075811.08781
values Regressor
40 Dropped columns that had greater than 15,000 null Decreased number of models to 50. Score improved by dropping columns with higher amount of null values. Extra Trees 1594195.87785
values Regressor
41 Same as before Increased number of models to 200 Extra Trees 1581125.02489

Regressor
42 Same as before Increased number of models to 1000. Score improved Extra Trees 1576188.73191
Regressor
43 Same as before Reduced number of models to 250. It appears that higher number of models produces a higher score Extra Trees 1580294.5888
Regressor

Regressor
Used
45 Same as before Increased number of models to 1100. Score improved but not higher than current best score. Extra Trees 1577108.64684
Regressor

Regressor
47 Same as before Decreased number of models to 900 Extra Trees 1577206.62902

Regressor
48 Increased multicollinearity threshold. Same settings as in the previous model Extra Trees 1582538.9526
Dropped columns with multicollinearity > 0.80 Regressor
59 Same as before Submitted same file yet score produced was different. Discrepancies detected in dataset. Extra Trees 1582993.2937
Regressor
50 Decreased multicollinearity threshold. Fixed crucial mistake. X_train previously only consisted of 193 columns after one-hot encoding which I had hardcoded. Changed this Extra Trees 1574086.47716
Dropped columns with multicollinearity > 0.95 to feed all one-hot encoded columns into X_train. This produced best score so far. Regressor
51 Performed n-1 dummy encoding When we do one-hot encoding it introduces multicollinearity in the dataset. I opted for n-1 dummy encoding to try and improve Extra Trees 1577240.63723
score but score did not improve. Regressor
52 Reverted to submission 50 preprocessing Increased number of models to 1300. More trees produced a better score, resulting in the best score so far. Increased number of Extra Trees 1573442.66273
except this time with one-hot encoding columns in the dataset due to one-hot encoding also helped in improving the accuracy. Regressor
53 Filled null values in columns with their Used previous model except filled in null values of each column with the median. Extra Trees 1573952.3048
median Regressor
54 Same as before Implemented stacking with Random Forest and Extra Trees as the base estimators and Linear Regression as the final estimator with Stacking 4904707.92226
parameters of both set to default. Score worsened drastically.
55 Same as before Same stacking model as before except increased no. of models in the Extra Trees Regressor to 1000. Score improved a bit from before Stacking 4585886.19682
but not what was expected.
56 Same as before Decreased no. of models to 50 from default 100. Score produced was as expected Gradient Boost 3049450.86639
Regressor
57 Same as before Implemented XGBoost Regressor with all it’s parameters set to default. XGBoost performed well as compared to Gradient Boost since XG Boost 2853892.82845
it is a more advanced version of the latter. Regressor
58 Same as before Changed the booster parameter to “dart” but score was the same. Most probably implemented it incorrectly XG Boost 2853892.82845
Regressor
59 Same as before Increased the number of models to 500. Score worsened drastically as compared to default settings. XG Boost 3226911.81454
Regressor
60 Same as before Decreased the number of models to 50. Score was a bit worse as compared to default settings but explainable since it had less no. of XG Boost 2862738.26407
trees to learn on. Regressor
61 Same as before Increased the number of models to 3000. Score worsened again as number of models increased. XG Boost 3827481.92342
Regressor
No. Data Preprocessing Model Details Algorithm Used Score
62 Same as before Changed number of models to 300 with increased learning rate of 0.5; score is still not showing any clear patterns upon varying XG Boost 2880326.99357
model numbers. Regressor
63 Same as before Changed number of models to 700 with increased learning rate of 0.7 with the objective as squared error. Score worsened still. XG Boost 2957875.29684
Regressor
64 Same as before Implemented default linear regression with top 10 best columns according to forward selector. Score was not good due to much Linear Reg. + 3513515.01412
lesser data. Forward Selection
65 Same as before Same settings as in the previous model except used top 10 best columns according to forward selector. Score was not good due Linear Reg. + 3075196.04853
to much lesser data. Forward Selection
66 Dropped columns with variance greater than Used Extra Trees Regressor again with 1000 models coupled with top 50 best columns Extra Trees + 1574002.23362
0.99 Forward Selection
67 Dropped columns with variance greater than Same model as previous submission. Removal of extra columns and lesser data reduced the accuracy Extra Trees 1779384.25508
0.90 Regressor
68 Same as before Implemented stacking using Gradient Boost, Random Forest and Extra Trees as base learners with Logistic Regression as the Stacking 1728803.72516
meta learner.
69 Same as before Same model as earlier except training done with only 10 best features selected using forward selection. Stacking + 1714147.57976
Forward Selection
70 Same as before Same model as earlier except training done with 50 best features selected using forward selection. Stacking + 1606096.82358
Forward Selection
Forward Selection
Forward Selection
73 Same as before 1000 models in Random Forest with best 30 features and polynomial interaction of degree 1. Score did not change much from last Random Forest + 1804573.61399
submission and polynomial interaction took a lot of processing time as before. Polynomial
74 Same as before 2000 models with best 40 features and polynomial interaction. Score improved drastically maybe due to more trees and a semi- Random Forest + 1577108.64684
optimal balance was achieved between variance and bias in the dataset. Polynomial
75 Same as before 2500 models with best 50 features and polynomial interaction. Score did not improve. Random Forest + 1577123.96828
Polynomial
Data Evaluation
1) Which algorithm worked best for the given dataset and why?
• Extra Trees was, for me, the greatest algorithm. When I received a score of 1762142.67527 through Random Forest, which put me in eighth place on the leaderboard at first, I
decided to use more sophisticated tree-based ensemble models, specifically XG Boost and Extra Trees since tree-based ensemble models frequently provide higher accuracy. Extra
Trees produced my greatest score of 1573442.66273 out of these two. This might be because Extra Trees, like Random Forest, uses bootstrapping, but it uses the entire dataset to
lessen bias. Like Random Forest, Extra Trees chooses the best feature to split on, but it does it at random. This feature of choosing each node’s split point randomly reduces variance
and also explains why there were discrepancies in the score when I increased the number of models in this algorithm. Since bias and variance both are reduced in Extra Trees, the
chances of the model being under/over-fitted also decreases.
2) Which data transformation techniques (such as column filtering based on correlation) boosted your score, and so on?
• The following data transformation techniques improved my score:
i) One-hot encoding all the categorical attributes
ii) Filling in the null values of each column with the median of that respective column.
iii) Initially dropping the “sub-area” column since it had a lot of unique values which was contributing to variance after encoding.
iv) Dropping columns which had more than 15,000 null values
v) Dropping columns which had greater than equals to 0.95 multicollinearity
3) What were the overall challenges that you faced while improving the score?
• Running ensemble models was very time consuming especially when the number of models were increased
• Implementing backward feature selection was extremely time consuming (taking almost 200 minutes) to produce only 5 best features. Due to this, I was not able to implement
backward feature selection.
• Polynomial interaction with other models was also a very time and memory consuming task. The highest degree of polynomial interaction I was able to implement was 2 and that was
with Linear Regression. I tried to go higher but my laptop crashed twice on two different attempts and thus I chose not to implement polynomial interaction with any other model

IDM Assignment 2

Uploaded by

Copyright:

Available Formats

You might also like

IDM Assignment 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IDM Assignment 2

Uploaded by

Copyright:

Available Formats

IDM Assignment 2

Hamza Faisal 22971

34 Same as before Increased number of models to 100 Gradient Boost 3322311.56428

36 Same as before Reduced number of models to 2 Gradient Boost 4822008.50768

41 Same as before Increased number of models to 200 Extra Trees 1581125.02489

44 Same as before Increased number of models to 400 Extra Trees 1578311.05987

46 Same as before Increased number of models to 2000 Extra Trees 1577181.74205

47 Same as before Decreased number of models to 900 Extra Trees 1577206.62902

You might also like