Professional Documents
Culture Documents
QT Project Report Group 9C PDF
QT Project Report Group 9C PDF
PROPERTY-RELATED VARIABLES
PROJECT REPORT
QUANTITATIVE TECHNIQUES II
[QT2BJ21-2]
Submitted by:
DATA DESCRIPTION
We have selected a data set from Kaggle on "Municipal Debt Risk Analysis" for our analysis.
The dataset is used to predict if an account will result in bad debt. The data for the same has
been collected from the billing systems of 8 municipalities in South Africa over two years.
Evaluating the dataset:
1. Who collected the data?
The author of the dataset is Dylan Rawlins, a Kaggle user.
2. What is the source of the data?
Municipal Finance Systems of 8 Municipalities in South Africa.
3. When was the dataset created?
It was created in July 2020 and is expected to be updated annually.
4. What was the purpose of the dataset?
To predict if an account is a bad debt, based on various parameters like account
category, property value, property size, total billing and so on.
5. What type of data was collected?
The account category, property details, billing details, receipting details, debt records,
total write off, collection ratio, electricity bill, debt billing ratio, and bad debt details.
A sample of the dataset is given below:
accountcategoryid 12 1 1 1 1
accountcategory Place of Worship Residential Residential Residential Residential
acccatabbr POW RES RES RES RES
propertyvalue 0 0 0 0 0
propertysize 0 0 0 0 0
totalbilling 4177 2084 4177 1862 4456
avgbilling 116 58 116 155 124
totalreceipting 16525 0 0 0 0
avgreceipting 2066 0 0 0 0
total90debt 0 25387 30269 19819 25161
totalwriteoff 0 0 0 0 0
collectionratio 3.96 0 0 0 0
debtbillingratio 0 12.18 7.25 10.64 5.65
totalelecbill 0 0 0 0 0
hasidno 0 0 0 0 0
baddebt 0 1 1 1 1
Table 1: Sample of the Dataset(5 entries)
No two variables are highly negatively correlated. The negative correlation values revolve
around 0.
Interesting insights: The correlation between "Bad Debt" and whether the customer has an ID
No. is 0.3067. This shows that whether a debt will go bad is slightly correlated to whether the
customer has an ID no.
2. Bar Plot: We used bar plots to understand the relationship between the various
categorical variables. Our dataset had three categorical variables:
We can see that out of all different types of accounts, the "Unknown" accounts (UKN) have
the highest percentage of Bad Debts, followed by "Municipal" accounts (MUN). "Educational"
(EDU) and "Industry" (IND) accounts have the lowest percentage of Bad Debts.
From the stacked count bar plot, we can see that the "Residential" (RES) account has the
highest number of Bad Debts (~45,000).
We can see that there are more bills for people with
no social security numbers. The percentage of
accounts with bad debt is more for customers who
have a social security number.
3. Scatter plot
a. Analyzing the relationship of Debt to Billing Ratio and Collection Ratio of Bad
Debt
We used a scatterplot to see how the "baddebt" behaves when analyzed using the two ratios
in our data set - "debtbillingratio", which is the ratio between the debt issued and bills
generated for a particular account and "collectionratio" which is the amount of bill paid v/s
that generated for any given account. Using this, we wanted to see the relationship between
these ratios and the presence of debt being defaulted.
" baddebt" being a categorical variable, has only two values, 0 or 1, representing the
presence of defaulted debts. This can be seen from the graph where we have light blue
showing defaulted debts and dark blue showing debts paid back on time.
REGRESSION MODELLING
We split the dataset into train and test dataset using a 70-30 split for our analysis. We then
trained our model on the train dataset and used the model to predict the values of the test
dataset.
To build the logistical regression model, we first cleaned the dataset by analysing two major
factors:
1. Missing values
There are no missing values in the dataset. The output is shown as follows.
2. Outliers
We found several outliers in the data on creating boxplots, so simply removing them could
cause a massive loss in information. The majority of outliers were present in propertyvalue and
propertysize variables which could be due to people not owning property; as a result, the
variable had several occurrences of the value 0.
Figure 12: Outliers by variable
Model Building
During Logistic regression modelling, we analyzed the impact of various independent variables
on the dependent variable i.e., baddebt and attempted to predict the value of baddebt variable.
The model was built on the train dataset, and it was tested for its accuracy on the test dataset.
We analyzed the summary of model_2 and observed that the p-value of all the variables was
very low which implied that all the variables were significant in the modelling now. However,
the AIC value increased significantly. So, we built a new model through backward elimination
using the step function to achieve a lower AIC value and using that model as our final
prediction model to predict the results in the test dataset.
Performing the step function, we arrived at the following result
In the step function, we realized that the AIC values reduced significantly to 3208.32 from
2177123 when the debtbillingratio variable was not included as a predictor variable. Hence,
we built our third model, Model_3.
In this model, we observed that the AIC value had reduced to 3117.8. However, the p values
for all the encoded variables were very high, so we applied the step function to check if we
could get a better model with reduced p-values.
Given below is the final result of the step function on model_4
We got the following output on building model_5 based on the result of the step
function applied on model_4.
Updated regression model – Model_5
Based on the results of the step function applied on model_4, we arrived at this
model.
Given below is the summary of the updated model_5
We decided to further analyze model_3, model_4 and model_5 for the following reasons.
Model_3 – This model had the least AIC value without the accountcategoryid being one-hot
encoded.
Model_4 – This model had all the one-hot encoded accountcategoryid variables.
Model_5 – This model had the least AIC value out of all the models.
MULTICOLLINEARITY DIAGNOSTICS
For multicollinearity diagnostic, we checked the VIF values for all the variables. The results
for all the three selected models are as shown
It was observed that all the variables in model_3 had VIF values less than 5 suggesting there
was little to no correlation among variables in model_3. In model_4, some variables had high
VIF values, implying that the variables were highly correlated. Model_5 had one variable
which had a moderately high VIF value and hence showed a moderate correlation.
The resulting p-value was observed to be very low, which implied that we have sufficient
evidence to reject the null hypothesis. Hence, at least one variable in all three models was
significant. To check the significance of individual variables, we looked at the z-test results in
the summary of the model.
For the Z test, we took the null and alternate hypotheses as follows.
Null: The coefficient of the variable under consideration is equal to zero, which implies the
variable is not significant.
Alternate: The coefficient of the variable under consideration is not equal to zero, which
implies the variable is significant.
Analyzing the P values for individual variables in the models under consideration, we observed
that the p-values for all the variables in model_3 and model_5 were very low. Hence, we could
reject the null for all the variables, implying that the variables were significant. However, the
p-Values of some variables in model_4 were significantly high, and hence we could not reject
the null, which implied that the model might not be accurate.
𝑝
logit(p)= log ( ) = −5.89 + .06547 ∗ 𝑎𝑐𝑐𝑜𝑢𝑛𝑡𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦𝑖𝑑 + 0.00000001313 ∗
1−𝑝
𝑝𝑟𝑜𝑝𝑒𝑟𝑡𝑦𝑠𝑖𝑧𝑒 + 9.069 ∗ 𝑡𝑜𝑡𝑎𝑙90𝑑𝑒𝑏𝑡 + 7.626 ∗ 𝑡𝑜𝑡𝑎𝑙𝑤𝑟𝑖𝑡𝑒𝑜𝑓𝑓 + 0.9248 ∗ ℎ𝑎𝑠𝑖𝑑𝑛𝑜
CONCLUSION
Through our analysis, we have established that variables such as Account Type, Property
Size, Total outstanding debt in the past 90 days, and Availability of Social Security
Number are significant factors that determine whether an account will result in bad debt for
the municipalities.
This conclusion follows our initial intuition about bad debt being correlated with the various
account parameters. We can logically conclude that certain account categories (such as
government, municipality) have higher instances of debt going bad, which our model validates.
We can also validate our initial assumption that people with bigger properties are differently
likely to default on debt than people with smaller property sizes through our model.
Thus, our model could predict whether a customer will default with fairly high accuracy.
FUTURE SCOPE
In the analysis, we would like to include the following in future to analyse the dataset further
and to try and make the model more accurate.
• Standardize the data and use the standardized data to build the model.
• Analyse the interaction effect of the independent variables.
• Build different models using different approaches like Naiive Bayes, Decision trees,
Support Vector Machines, etc and use the model with the highest accuracy.
REFERENCES
• Kaggle.com. 2022. Comprehensive EDA on R with Logistic Regression. [online]
Available at: <https://www.kaggle.com/mbkinaci/comprehensive-eda-on-r-with-
logistic-regression>
• Holtz, Y., 2022. Grouped, stacked and percent stacked barplot in ggplot2. [online] R-
graph-gallery.com. Available at: <https://www.r-graph-gallery.com/48-grouped-
barplot-with-ggplot2.html>.
• Kaggle.com. 2022. Municipal Debt Risk Analysis. [online] Available at:
<https://www.kaggle.com/dmsconsultingsa/municipal-debt-risk-analysis>.
• Correspondence Analysis in Archaeology. 2022. R function for binary Logistic
Regression. [online] Available at: <http://cainarchaeology.weebly.com/r-function-for-
binary-logistic-regression.html>].
• Stack Overflow. 2022. Stack Overflow - Where Developers Learn, Share, & Build
Careers. [online] Available at: <https://stackoverflow.com/>