Professional Documents
Culture Documents
Detection of Fraud Statement Based On Word Vector Evidence From Financial Companies in China - ScienceDirect
Detection of Fraud Statement Based On Word Vector Evidence From Financial Companies in China - ScienceDirect
View PDF
Detection of fraud statement based on word vector: Evidence from financial companies in
China
Yi Zhang a, b , Ailing Hu c, Jiahua Wang d , Yaojie Zhang e
Show more
Highlights
• Our model can recognize most fraud reports in China.
• Fraud recognition is based on the text of the report rather than financial data.
• The empirical results support the effectiveness of the MD&A non-financial disclosure
mechanism in China.
Abstract
This paper aims to detect plausible frauds of financial companies via text analysis of annual and quarterly reports of China's listed companies.
The Management Discussion and Analysis (MD&A) is digitized as vectors. The empirical results indicate that compared with various vector
indexes, the bag-of-words (BoW) model and machine learning algorithm have a prediction effect and the ability to recognize frauds where the
BoW model can correctly recognize 77% of the fraud reports. This would be helpful for audit authorities to identify fraud reports.
Keywords
Public sector audit; Text analysis; Word vector; Bag-of-words; Machine learning
1. Introduction
Public Sector Auditing plays an important role in maintaining the security of financial markets and protecting the rights and interests of
financial consumers. However, audit and supervision of falsification and fraud activities by various financing entities such as securities
markets, bond markets, and bank credit markets rely mainly on the business and financial data and the extended verification from accounting
analysis results. The review on disclosure reports in public information is still mainly at the stage case by case. Systematic data analysis
focuses on the basis of structured data (Al-Hashedi and Magalingam, 2021). The main route in the screening of suspected falsification and
fraud is driven by "financial statements - account sets and vouchers - raised funds or credit funds - specific business". The systematic
application of unstructured data and the use of big data thinking are still in the initial stage.
Bonsall et al. (2017) find that the length of annual reports has gradually been increased, and the proportion of unstructured text has
significantly been augmented, and a large amount of hidden information remains to be explored. More and more attention is paid to the non-
financial disclosures in reports (Dhaliwal et al., 2011; Marshall et al., 2009). Humphreys et al. (2010) verify that MD&A is read mostly, and
1 of 9 28/12/2021 23:25
Detection of fraud statement based on word vector: Ev... https://www-sciencedirect.ez11.periodicos.capes.gov.br...
the diversification of its language facilitates text analysis. Goel et al. (2010, 2012) find that the information provided by financial data is more
and more limited, but non-standardized texts are subjective and the language features can be used as predictors of fraud. In recent years,
Chinese scholars pay attention to the text of reports (LI, 2010; Ren and Lu, 2021) and apply text analysis method to study the relationship
between financial information and economic phenomena. The results mainly focus on the semantics and readability (Kong et al., 2021;
Luo et al., 2018). Compared with the mainstream methods based on financial data to detect frauds (Cheng et al., 2021; Song et al., 2014), this
paper focuses on the text of MD&A. We first transform MD&A texts into digital data, and then adopt the text analysis as the combination of
machine learning and vector index to detect those hidden problems on financial companies in the Chinese A-share stock market.
The rest of the paper is organized as follows. Section 2 describes the data, Section 3 provides empirical methodology, Section 4 presents the
empirical results, and Section 5 concludes.
2. Data
The financial listed companies under the CSRC Industry Classification (2012 Edition) are taken for this research. Relevant data of *ST, ST,
sales, and transfer companies are excluded. The sample period is from January 1, 2005 to December 31, 2019, and there are 3317 records in
total, including 370 fraud records.
The fraud records are from the CSMAR database, and the MD&A annual and quarterly texts are respectively taken from the WIND database
and the RESSET database. The missing text data for 2011 and 2012 are extracted from the summary section of MD&A in the reports of listed
companies.
Due to the time lag in the administrative penalty announcement of CSRC, it is necessary to uniformly define the year of fraud, and all reports
within the year are judged to be fraud. Except for the specific years clearly indicated, the other fraud years are the years before the
announcement year. In combination with the "violating or not" label in the CSMAR database and manual judgment, the two types of sample
are labeled, and then cleaned and segmented. Table 1 shows the high-frequency words before and after data cleaning. The results indicate that
removing invalid information can greatly reduce the vocabulary and calculation. At the same time, the accuracy of word segment is improved
and increases the credibility of the subsequent analysis.
Table 1. Comparison of word frequency output before and after data cleaning.
Note: The HIT Stopwords List was used to compare the 10 most frequent words before and after cleaning up the noise in the "self-evaluation" column. The word
segmentation is based on the Jieba library, which can be tracked in https://github.com/fxsjy/jieba.
3. Empirical methodology
2 of 9 28/12/2021 23:25
Detection of fraud statement based on word vector: Ev... https://www-sciencedirect.ez11.periodicos.capes.gov.br...
language processing and text analysis. As a classic model in text vectorization, the BoW model uses words as units and represents documents
as an N-dimensional vector by constructing a dictionary. The length of the vector is the size of the dictionary, and the number of each
coordinate is the word frequency. Benezeth et al. (2015) use the BoW model for classification, and Purda and Skillicorn (2015) use the BoW
model and Support Vector Machine (SVM) to recognize fraud reports with an accuracy of 82.95%. Based on the results of Purda and
Skillicorn (2015), we can screen and retain high-contribution words through random forest, construct the BoW vector in the financial industry,
and then predict the probability of fraud using SVM.
With Word2Vec vectors as input, three mainstream machine learning methods are used: Naive Bayes, Ensemble Learning, and SVM. In text
classification, Naive Bayes has high classification performance (Frank and Bouchaert, 2006), small sample demand, fast calculating, and less
influence of outliers; as a common model for ensemble learning, random forest has excellent text classification performance (Wu et al., 2014);
and Antweiler et al. (2004) have verified that SVM can effectively classify text.
To screen out non-fraud samples as much as possible is the focus of attention in the context of recognizing fraud reports. It is one-sided if we
use ROC only. Moreover, the amount of data in the two types of samples is extremely unbalanced. It is necessary to adjust weights of different
types. Table 2 and 3 introduce a confusion matrix and related indices:
3 of 9 28/12/2021 23:25
Detection of fraud statement based on word vector: Ev... https://www-sciencedirect.ez11.periodicos.capes.gov.br...
Accuracy The number of data correctly predicted. For the detection of fraud, the main focus is on the screening effect of non-fraud samples, rather
than the suitable index.
Precision The proportion of correctly judged fraud samples in all samples predicted to be fraudulent, that is, the accuracy of the judgment as fraud.
Recall The number of data correctly predicted in the samples that are actually fraudulent, which is used to describe how many fraud samples
have been screened out. The higher the value, the better the effect.
4. Empirical results
In this section, we first maintain the ratio of two types of samples, and randomly divide 80% of the total samples into the training set with a
total of 2653 samples and 296 fraud samples; and into the test set with a total of 664 samples, including 74 fraud samples. Then the three
methods and indexes in Section 3 are computed for testing separately on the training set and the test set. If it does not have a good detection
effect in the training set, it does not need to go further in the test set.
4 of 9 28/12/2021 23:25
Detection of fraud statement based on word vector: Ev... https://www-sciencedirect.ez11.periodicos.capes.gov.br...
In contrast, the classification effect of BoW method and machine learning are far better than that of random classifier. The BoW method has
the highest AUC value of 0.98, and the Naive Bayesian has the lowest AUC value of 0.68, so the classification effect of the training set is
better. However, it should be noted that there is a problem of overfitting when the AUC area is too large, so the effect of the test set needs to be
combined.
The probability measurement in Table 4 is based on SVM, 2700 high-contribution words are screened and retained through random forest,
which reduces the dimension of the word frequency matrix, with a cumulative contribution rate of 85%, and improves calculation efficiency.
The distribution characteristics of the true probability of the overall, fraud and non-fraud data are summarized and compared. There is no
significant difference in general, and the distribution of fraud samples is relatively concentrated. The calculated threshold value is 0.125320,
and those exceeding this value are judged as fraud samples.
5 of 9 28/12/2021 23:25
Detection of fraud statement based on word vector: Ev... https://www-sciencedirect.ez11.periodicos.capes.gov.br...
Note: The probability is measured based on SVM, and the probability of fraud reports among different samples and their distribution characteristics are calculated.
The threshold is 0.125320. If the probability of fraud is greater than the threshold, it is judged as a fraud sample.
According to ROC curve and AUC in Fig. 2, the detection effect of the test set is slightly lower than that of the training set, and the AUC value
loss of the random forest is the largest, namely, 0.1884, indicating that both BoW and machine learning have no obvious overfitting, the
prediction performance is good, the generalization ability is acceptable, and it is of practical significance in real applications.
Table 5 reports the classification effect of the test set. For a single index, SVM has the highest accuracy and precision. Moreover, SVM has
better overall judgment and classification performance. However, for the audit work, it is important to detect fraud samples as much as
possible, so the recall rate is a key index. In general, the effect of the BoW method is the best. It can recognize 77.03% of fraud samples and
improve audit efficiency.
6 of 9 28/12/2021 23:25
Detection of fraud statement based on word vector: Ev... https://www-sciencedirect.ez11.periodicos.capes.gov.br...
Note: The above data is calculated based on the confusion matrix of the BoW method and machine learning test set. The accuracy represents the number of
correctly predicted data; the recall represents the number of correctly predicted data in the samples that are actually fraudulent; and the precision represents the
proportion of the fraud samples that are correctly judged in all the predicted fraud samples.
5. Conclusions
The importance of unstructured data has gradually become prominent in the data era. Take the financial industry as an example. On the one
hand, the experimental results prove that China's regularly reported MD&A non-financial disclosure mechanism is effective and can provide
useful incremental information to the public sector audit authority and financial consumers. 2 On the other hand, the recall rate of 77% shows
that BoW method and machine learning have certain detection capabilities to cover nearly 80% of fraudulent companies. It has reference role
for public sector auditing to construct an analysis route of "unstructured data - text mining analysis - abnormal characteristic data traceability -
specific business and financial compliance review" in the verification of falsification and fraud in the financial market, it has strong technical
guiding significance for improving audit efficiency and audit coverage. This is an inevitable trend in the development of data auditing.
Acknowledgments
We are grateful to the editor and two anonymous referees for insightful comments that significantly improved the paper. This work is
supported by the National Natural Science Foundation of China [72001110].
References
Al-Hashedi and Magalingam, 2021 K.G. Al-Hashedi, P. Magalingam
Financial fraud detection applying data mining techniques: a comprehensive review from 2009 to 2019
Comput. Sci. Rev., 40 (2021), Article 100402
https://doi-org.ez11.periodicos.capes.gov.br/10.1016/j.cosrev.2021.100402
Article Download PDF Google Scholar
Benezeth et al., 2015 Benezeth, Y., Bertaux, A., Manceau, A., 2015. Bag-of-word based brand recognition using Markov clustering algorithm
for codebook generation, in: 2015 IEEE International Conference on Image Processing (ICIP). pp. 3315-3318. https://doi-
org.ez11.periodicos.capes.gov.br/10.1109/ICIP.2015.7351417.
Google Scholar
Bonsall et al., 2017 S.B. Bonsall, A.J. Leone, B.P. Miller, K. Rennekamp
A plain English measure of financial reporting readability
J. Account. Econ., 63 (2017), pp. 329-357
7 of 9 28/12/2021 23:25
Detection of fraud statement based on word vector: Ev... https://www-sciencedirect.ez11.periodicos.capes.gov.br...
https://doi-org.ez11.periodicos.capes.gov.br/10.1016/j.jacceco.2017.03.002
Article Download PDF View Record in Scopus Google Scholar
Dhaliwal et al., 2011 D.S. Dhaliwal, O.Z. Li, A. Tsang, Y.G. Yang
Voluntary nonfinancial disclosure and the cost of equity capital: the initiation of corporate social responsibility reporting
Account. Rev., 86 (2011), pp. 59-100
https://doi-org.ez11.periodicos.capes.gov.br/10.2308/accr.00000005
CrossRef View Record in Scopus Google Scholar
LI, 2010 F. LI
The information content of forward-looking stements in corporate filings-A Naiive Bayesian machine learning approach
J. Account. Res., 48 (2010), pp. 1049-1102
https://doi-org.ez11.periodicos.capes.gov.br/10.1111/j.1475-679X.2010.00382.x
View Record in Scopus Google Scholar
8 of 9 28/12/2021 23:25
Detection of fraud statement based on word vector: Ev... https://www-sciencedirect.ez11.periodicos.capes.gov.br...
Wu et al., 2014 Wu, Q., Ye, Y., Zhang, H., Ng, M.K., Ho, S.-.S., 2014. ForesTexter: an efficient random forest algorithm for imbalanced text
categorization. Knowledge-Based Syst. 67, 105–116. https://doi-org.ez11.periodicos.capes.gov.br/10.1016/j.knosys.2014.06.004.
Google Scholar
1
Results for vector distance index are up to request, and are not listed in the paper due to the limited space.
2
At the same time, fraud detection is closely related to the potential risks of listed companies. Combined with the recent empirical results on stock
returns (Dai et al., 2021; Jiang et al., 2019), it is of great significance to financial market regulation and stock return forecasting.
View Abstract
9 of 9 28/12/2021 23:25