Se of Data Warehousing To Analyze Customer Complaint Data of CFPB of USA

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

Use of Data Warehousing to Analyze Customer

Complaint Data of CFPB of USA


W. R. A. Fonseka, D. G. M. Nadeesha, P. M. C. Thakshila, N. A. Jeewandara, D. M. Wijesinghe,
R. V. De. S. Sahabandu, P. P. G. D. Asanka
Sri Lanka Institute of Information Technology,
Malabe
Sri Lanka
rosh.fsk@gmail.com, mihirinadeesha@gmail.com, chamanithakshila@gmail.com, nimnajeewandara@gmail.com,
danu.singhe@gmail.com, vasana@live.com, dineshasanka@gmail.com

Abstract CFPB is established with four missions: 01: To enforce laws to


ensure access to credit; 02: To educate consumers; 03: To
The Consumer Financial Protection Bureau was protect the financial interests of service members; and 04: To
established in USA for enabling the USA consumers to protect the interests of older Americans[ CITATION blo \l
report customer support and complaint related information 1033 ]1.On the other hand, the above missions try to improve
regarding their financial issues with the US government. and protect all internal and external stakeholders of the
The complaint data is freely available for analysis and financial institutions.
tracking of how efficiently and effectively the financial
institutes handle the complaints lodged against them. Each The Customer complaints study upkeeps to diagnose how
complaint consists of attributes that can uniquely describe consumers perceive and use financial products and services to
and identify it. These features have been exploited for data help identify potential problems in the marketplace and achieve
mining, analysis and predictions. The data warehouse better outcomes for everyone. Reviewing findings could help
creation and data analysis was done using Microsoft SQL the industry, and all kinds of financial institutions to step into
Server Technologies. The data mining techniques such as the correct position on making decisions, policies, and
Microsoft Decision Tree, Microsoft Naïve Bayes and practices.
Microsoft Time Series and Microsoft Neural Network
models were used in this study. Based on the results, it was Customer complaints are a rich source of symptoms for
observed that there is a correlation between the growth of companies and the stakeholders. By properly analyzing the
complaints in certain financial domains with regards to complaint data they can make forecasts and take actions.
changes in the economic, political and regulatory forces. Enterprise Data Warehouse (EDW) represents historical data
Probability predictions also show, how each product can get from data marts of different operational areas in a company or
a particular issue-related complaint, how a particular issue organization by providing a solution for the 7Vs of big data.
can get a timely response, how a particular issue can cause a Volume, Velocity, Variety, Variability, Veracity, Visualization
consumer dispute and what type of issues are mostly lodged and Value are the current 7V challenges of big data
via a particular submission method, etc. This information [ CITATION Eil \l 1033 ]2 which expound as the number of
can be used in prescriptive analysis to enhance financial terabytes data daily adding to databases, high speed of data
consumer services and also improve the response quality of processing due to many real-time services, heterogeneous data
automated consumer support systems. from different data sources, constantly changing data, due to
noisy and messy nature of big data inaccurateness, presenting
Index Terms — Financial, Data Mining, Big Data, Data
Warehousing, Time Series, Decision Trees, Neural Network data with a readable and accessible way, the potential value of
the information which lies in rigorous analysis of accurate data.
EDW is a technology used mainly focusing on reporting and
data analysis and it keeps central repositories of integrated data
I. INTRODUCTION from disparate sources. This paper will explain and showcase
This paper presents a study about the complaints of consumer how data warehousing can be used to analyze a vast number of
financial protection bureau (CFPB) of United States of customer complain data to get predictive and prescriptive
America (USA). Over the past few years (2011-2016), the decisions.
financial consumers have lodged over six hundred thousand
complaints on various financial institutions of USA. The study II. CURRENT IMPLEMENTATION
focuses specifically on identifying the patterns from existing
data to predict the future trends and numbers of complaints that This section will explain how the CFPB handles consumer
could appear and to leverage the capacity to resolve those complains. Consumers send thousands of their complaints
complaints efficiently. about financial products and services to companies for
response and CFPB manages that process as a third party.

978-1-5090-6132-7/16/$31.00 ©2016 IEEE


Currently, they gather data mainly from three parties; consumers, the bureau and the responsible company of the
financial product or service related to the complaint.

Complaint Review and Complany Complaint Consumer Analyze &


Submitted Route Response published Review Report
By Consumer By Bureau By Company By Bureau By Consumer By Bureau
Ralph Kimball’s dimensional design approach (the bottom-up
design) is about the ceation of the data marts facilitating reports
and analysis first; these are then combined together to create a
Figure 1: Current Implementation of the CFPB [ CITATION htt4 \l broad data warehouse [2].
1033 ]3 The most prominent similarities between Inmon’s and
Kimball’s models are the use of time-stamped data, and the
Consumer submits a complaint about an issue with a company extract, transform, and load (ETL) process. Although the
regarding a financial product or service. Complainer will execution of these two elements differs between the two
receive email updates and they can log in to the CFPB portal to models, the data attributes and query results are very
track the status of the complaint. The consumer is facilitated to similar[ CITATION Mar04 \l 1033 ]4.
submit complaints via Phone Calls, Email, Fax, Web, Postal
Mail, and Referral systems. In the next step, CFPB forwards When comparing those two approaches, in the Kimball
the complaint and any documents provided by the complainer approach it requires individual business requirements to
to the company and monitors the response from them. Third integrate, small team of generalists (the skill set) are required,
step, the responsible company reviews the complaint, low startup cost while Inmon approach requires enterprise-
communicates with complainer as needed and updates their wide integration, bigger team of specialists and high start-up
response. In forth step, CFPB publishes the information about costs. Inmon’s approach is suitable for stable businesses which
the complaints according to their own format in their public can afford the time taken for design and the costs
consumer complaint database as OLTP (On Line Transaction involved[ CITATION Ral \l 1033 ]5. On the other hand, users
Processing) data. In fifth level CFPB will let complainer know need to be able to query these large amounts of data easily.
when the company responds for the corresponding complain. Often, they may not really know what relationships there are
Complainer can review that response and give the feedback. between data elements they are searching for and what does it
Finally, complaints are used to help with the work to supervise take to find a statistically significant complaint oriented
companies, enforce federal consumer financial laws and write relationship between two such unlikely occurrences. This is a
better rules and regulations. CFPB also reports to congress challenge of providing an effective data warehouse solution.
about the complaints that they receive[ CITATION htt4 \l 1033
]3. Therefore considering factors in the Kimball Approach, it is
good enough for the CFPB consumer complain data warehouse
There are a few facts that could be identified as issues involved approach. In the Kimball approach, the data warehousing Fact
in the current model. The existing database is mainly focusing table consists measurements and metrices related to the
on the historical data and it provides only the OLTP features business processes. It is located at the center of a star schema
and it does not produce and predictions and forecasts for the or a snowflake schema surrounded by dimension tables. A fact
analysis. Therefore, the requirement of a data warehouse model table typically has two types of columns: those that contain
to organize and analyze data to overcome the drawbacks of the facts and those that are a related via foreign keys to the
current system is identified. dimension table(s) [3]. The first step in designing a fact table is
to determine the granularity of the fact table. By granularity,
we mean the lowest level of information that will be stored in
III. DATA WAREHOUSE DESIGN the fact table [4].
The data warehouse exists to facilitate decision support system
in an organization. Decision support systems help users with The CFPB data warehouse design for fact and dimension table
ad-hoc analysis and strategic decision making. Generally, relationships is modeled based on the star schema. The fact
decision support systems require historical data, both table (Fact_CustomerComplaints) include High level
summarized and at a transaction level of detail[ CITATION information about the complaint issue, sub issue, timely
Mar04 \l 1033 ]4. Considering the designing of the data response, consumer dispute state and additional logic and
warehouse for CFPB consumer complaints data, there is a functions were used to extract the ReceivedDate, Status and
possibility of following two of the most commonly discussed Product dimension surrogate keys in the form of lookup
methods by Bill Inmon and Ralph Kimball. functions. Using a replace function the Yes/No strings were
transformed to 1/0 and data conversion functions were used to
In a nutshell Bill Inmon’s enterprise data warehouse approach convert the string results to integer for the TimelyResponse
(the top-down design) is a normalized data model which is and ConsumerDisputed entries. Three dimension tables were
designed first. Then the dimensional data marts, which contain created as Dimension_ReceivedDate including High level
data required for specific business processes or specific information about the complaint received date, day of month,
departments are created from the data warehouse. month, quarter and year, Dimension_Status including Status

978-1-5090-6132-7/16/$31.00 ©2016 IEEE


information about the complaint submit method and company level information about the complaint product, sub product,
response to consumer and Dimension_Product including High company, US state, ZIP code.

Importing of the data from the source to the staging area


database was done as follows; The destination database was
created and a single table named
Staging_ConsumerComplaints was used to store the imported
data. The source data set was filtered and only the following
attributes were added to the staging database: ComplaintID,
DateReceived, Product, SubProduct, Issue, SubIssue,
Company, USState, ZIPCode, SubmittedVia,
CompanyResponseToConsumer, TimelyResponse,
ConsumerDisputed, ComplaintID_Copy. After creating the
staging table, the SQL Server Management Studio was used to
import the data from the CSV file into the staging table. It is
important that the destination table attribute data types match
with the selected data types during the import process. Once
all the required data was imported, the transformation
processes were developed using a SQL Server Data Tools,
SSIS (SQL Server Integration Service) package. Based on
each dimension and the fact table the required transformations
varied. Hence the Staging data table was multiplexed among
the different dimensions.

Initially the Dimension_ReceivedDate was selected, and for


this the DateReceived column of the staging table was used as
the source. To extract distinct received dates a sorting module
was used with a setting to remove duplicate rows. Next, to
derive the Day, Month and Year a derived column module
was used with DATEPART() date/time functions for each
output. Once the month column was extracted it was used to
derive the quarter column using a derived column module
with a set of conditional statements. After deriving and
Figure 2: Data warehouse design for CFPB consumer complaints data filtering all the required attributes of the
Dimension_ReceivedDate table the data was loaded into the
relevant destination table.
IV. ETL
The original data set was obtained from the CFPB complaint For the Dimension_Status there were no complex
data download portal, in the CSV flat file format. The data set transformations used. The required destination table was
contained over 500000 records, recorded from 2011 December created according to the dimensional model with the
to 2016 April. The complaints contained the following CompanyResponseToConsumer and the SubmittedVia
attributes: Date received, Product, Sub-product, Issue, Sub- attributes. A sorting module was used with the remove
issue, Consumer complaint narrative, Company public duplicates configuration to sort and extract distinct records
response, Company, State, ZIP code, Submitted via, Tags, from the staging table and the output was loaded into the
Consumer consent provided, Date sent to company, Company destination table.
response to consumer, Timely response, Consumer disputed,
Complaint ID. The entire dataset was structured as a single For the Dimension_Product there were no complex
denormalized table. After analyzing the data set and content, it transformations used. The required destination table was
was observed that there were columns (attributes) that did not created according to the dimensional model with the Product,
contain enough data even as a percentage for a proper analysis. SubProduct, Company, State and ZipCode attributes. A
Hence these columns were ignored prior to the loading of the sorting module was used with the remove duplicates
staging database. The staging database was created to simplify configuration to sort and extract distinct records from the
the ETL process and reduce delays. The required Fact and staging table and the output was loaded into the destination
Dimension tables were also created in parallel to this process. table.

Once the dimensions were loaded to the data warehouse, the


Fact table was prepared for loading. For this it required the

978-1-5090-6132-7/16/$31.00 ©2016 IEEE


data from the staging table as well as the surrogate keys from used and the ReceivedDate column was selected as lookup
the dimensions. A separate SSIS package was created to column. With this lookup, the corresponding
transform and load the data into the fact table. Initially the ReceivedDate_SK was extracted from the
staging table was set as the source, and then all the required Dimension_ReceivedDate to be inserted into the fact table.
columns were selected to be loaded into the fact table. In order Similarly, the Status_SK was extracted from the
to map the Dimension_ReceivedDate a lookup module was Dimension_Status using both the
CompanyResponseToConsumer and the SubmittedVia Table 2: Issue vs. timely response results
columns. And the Product_SK from the Dimension_Product Issue Predicted Probability of a
using all Product, SubProduct, Company, State and ZipCode Timely Response
columns. Additionally, the TimelyResponce and Unable to get credit 0.9995
ConsumerDisputed columns had to be transformed into a report/credit score
binary data type. For this a replace module was used to replace
Yes/No into 1/0 respectively. Since the data type was still Managing the loan or lease 0.9801
string after the replacement, a data conversion module was Excessive fees 0.8411
used to convert the string into integer. Once all the columns Table 2 shows the issues such as unable to get credit report/
were ready and were in the correct formats the loading was credit score, managing the loan or lease and excessive fees are
done into the fact table more likely to get a timely response than other issues as these
problems could be easily sorted out by the relevant company.

V. ANALYSIS & PREDICTION b) Product, Issue vs. Company response


Table 3 mentions the complaint related to the issues; can’t
The following comparative analysis were carried out on the repay my loan under the product student loan, wrong amount
CFPB data set concerning on different aspects including the charged or received under the product money transfers and
complaint submission methods as CFPB could allocate more billing disputes under the product credit card. The company
resources to improve their service via opening the opportunities response is closed with explanation and first two issues have a
to the consumers and apparently to make the process as probability above 0.8 whereas the last one has a probability
required or preferred by CFPB. around 0.5.
Table 3: Share of complaint submission method vs predicted timely response
Consumers submit their complaints via several methods and
Table 1 shows the percentage figures of current methods of Product Issue Company Prediction
receiving complaints by CFPB. Over 60% of the total Response Probability
complaints have been received via the web. Accordingly, the Student Can’t repay my Closed with 0.8906
referrals, phone, post mail, fax and email have 20%, 7%, 6.6%, loan loan explanation
1.5% and .06% percentages of complaints respectively. Overall Money Wrong amount Closed with 0.8155
all the complaint submitted methods have received timely transfer charged or received explanation
responses with a probability over 0.95. It shows that CFPB has s
no preference to the complaint submitted method they are Credit Billing disputes Closed with 0.5247
impartial to all submitted methods. card explanation

Table 1: Share of complaint submission method vs predicted timely response c) Product, Sub product vs. Issue
The issue managing the loan or lease have a high probability of
Complaint Submit Percentage of Predicted being the sub product Vehicle loan than among vehicle lease
Method total complaints Probability of a and installment loan.
Timely Response
Web 64.415 0.9722 When comparing the product, sub product vs. issue, the issue
Referral 20.162 0.9805 of managing the loan or lease under the product consumer loan
Phone 7.2388 0.9717 and sub product vehicle loan, issue of loan modification,
Postal mail 6.6271 0.9850 collection for closure under the product mortgage and sub
Fax 1.4912 0.9785 product FHA mortgage and issue of continued attempts to
Email 0.0659 0.9592 collect debt not owed under the product debt collection and sub
product vehicle lease had probabilities around 0.5.
Further comparative analysis was performed on the following
scenarios. Table 4: Share of complaint submission method vs predicted timely response

a) Issue vs. Timely response Product Sub Issue Prediction


Product Probability
Consume Vehicle Managing the loan 0.5350

978-1-5090-6132-7/16/$31.00 ©2016 IEEE


r loan loan or lease
Mortgage FHA Loan modification, 0.5034 d) Complaints by day of the week and complains by
mortgage collection for week day or week end
closure When comparing the number of complaints received by day of
Debt Vehicle Continued attempts 0.4191 the week we identified that the number of complaints received
collection lease to collect debt not on week days are significantly higher than the week ends. This
owed shows that consumers are more likely to make complaints on
week days because they need quick response. But according to about any financial product from December 2011
this analysis consumers get timely response irrespective of until March 16, 2015.
whether it was a weekday or weekend. Tuesdays, Wednesdays • USA national share on complaints to the sates median
and Thursdays had the most number of complaints recorded by is nearly 4 percent.
CFPB. • Only in New Hampshire (NH) has denoted a
decrement in the percentages of complaints from year
The CFPB data set was analyzed with few other databases also, 2012 to 2013
in order to identify and recognize the patterns related to USA • Only in West Virginia (WV) has shown a household
financial complaints and their background. income median negative variation percentage in
periods of 2012 to 2013 and 2013 to 2014. At the
e) Complaints and Household Median by State mean time variation of the number of complaints
2012 – 2014 denoted an increase of the percentages.
It was observed that there was an increase of the state median
in 2013 compared to 2012. A state median drop is shown in the
year 2014. This state median reduction of 2014 has not been
affected to the overall increasing pattern of complaints received
by CFPB.

Figure 5: Prediction probablilities of various models for timely response based


on the complaint sumbission method
Figure 3: Household income median vs complints of differnet states in year
2012, 2013 and 2014 Figure 5 explains the prediction probabilities generated by
Microsoft Decision Trees, Naïve Bayes and Neural Network
(respectively 97.22%, 97.22% and 97.67%) for the analysis of
the complaint submit method versus the timely response
parameters. This indicates that based on the attributes the
optimal mining model can change.

VI. NON-FUNCTIONAL FEATURES


Data Privacy
Figure 4: Household income median variation percentage vs complaints The original dataset did not contain any personal information
variation percentage of differnet states in year 2012 to 2013 and 2013 to 2014
of the consumers or any confidential information related to the
companies. Hence, no special measures were required or used
to handle privacy or security.
According to figure 3 a similarity in the house hold income
median and number of complaints is depicted. Figure 4 shows
few exceptional situations in the patterns of figure 3. Hardware Environment
The data warehouse creation and mining was done on a
Some key findings based on figure 3 and 4 are: notebook PC with an Intel Core i7 processor, 16 gigabytes of
memory and 1 terabyte of disk space. No special performance
improvements, optimizations or configurations were required.
• The CFPB has published 138,086 complaints about
mortgages, which is the most complaints received

978-1-5090-6132-7/16/$31.00 ©2016 IEEE


VII. CONCLUSION & FUTURE WORK the CFPB to serve better concerning the reason for the
Predictive analytics for CFPB, on the other hand, builds existence of the bureau.
analytic models at the lowest levels of the existing complaint There’s a potential survey to carry out in the NH and WV
recording system based on the individual complaint and states distinguishing the patterns differentiation compared to
various other attributes including complaint product, complaint other states in the relevant years. That may include the future
company and looks for predictable behaviors, propensities, and predictions for the discussed points in the future years of same
business rules that can be used to predict the likelihood of sates. Additionally, more development can be done using the
certain behaviors and actions. These predictions would drive SSIS and SSAS models to integrate with applications to
provide
analytics to customers about how each financial product,
institute or issue can bring value or satisfaction before
committing to them. VIII. REFERENCES
x
[1] http://www.cpmlegal.com. [Online]. Available: http://www.cpmlegal.com/blogs-Advocates-For-Justice,Why-the-CFPB-Is-Important-to-Seniors .
[Accessed April 4, 2016].
[2] Eileen McNulty. dataconomy. [Online]. Available: http://dataconomy.com/seven-vs-big-data/ . [Accessed April 20, 2016]
[3] http://www.consumerfinance.gov/. [Online]. Available: http://www.consumerfinance.gov/complaint/process/ . [Accessed March 21, 2016]
[4] Mary Breslin, "Data Warehousing Battle of the Giants: Comparing the Basics of the Kimball and Inmon Models," BUSINESS INTELLIGENCE
JOURNAL , vol. WINTER, pp. P06-P20, 2004.
[5] Margy Ross Ralph Kimball, The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, Third Edition ed.: John Wiley & Sons, Inc.
[6] D Hand, H Mannila, and P Smyth, "Retrival by Content," in Principles of Data Mining. New Delhi: Prentice Hall of India Private Limited, 2007, pp.
456–464.
[7] Term Extraction Transformation. [Online]. Available: http://msdn.microsoft.com/en-us/library/ms141809.aspx . [Accessed April 20, 2016]
[8] Term Lookup Transformation. [Online]. Available: http://msdn.microsoft.com/en-us/library/ms137850.aspx . [Accessed April 22, 2016]
[9] tf–idf. [Online]. Available: https://goralewicz.com/blog/what-is-tf-idf/ . [Accessed April 27, 2016]
[10] Cosine Similarity. [Online]. Available: http://www.minerazzi.com/tutorials/cosine-similarity-tutorial.pdf . [Accessed April 27, 2016]
[11] Joseph F. Hair, Jr., William C. Black, Barry J. Babin, Rolph E. Anderson, and Ronald L. Tatham, Multivariate Data Analysis. New Delhi: Pearson
Education in South Asia, 2012, p. 599.
x

978-1-5090-6132-7/16/$31.00 ©2016 IEEE

You might also like