Detection of Fraud Statement Based On Word Vector Evidence From Financial Companies in China - ScienceDirect

Detection of fraud statement based on word vector: Ev... https://www-sciencedirect.ez11.periodicos.capes.gov.br...
View PDF
Finance Research Letters

Available online 22 September 2021, 102477
In Press, Corrected Proof
Detection of fraud statement based on word vector: Evidence from financial companies in
China
Yi Zhang a, b , Ailing Hu c, Jiahua Wang d , Yaojie Zhang e
Show more
Outline Share Cite
https://doi.org/10.1016/j.frl.2021.102477 Get rights and content
Highlights
• Our model can recognize most fraud reports in China.
• Fraud recognition is based on the text of the report rather than financial data.
• The empirical results support the effectiveness of the MD&A non-financial disclosure
mechanism in China.
Abstract
This paper aims to detect plausible frauds of financial companies via text analysis of annual and quarterly reports of China's listed companies.
The Management Discussion and Analysis (MD&A) is digitized as vectors. The empirical results indicate that compared with various vector
indexes, the bag-of-words (BoW) model and machine learning algorithm have a prediction effect and the ability to recognize frauds where the
BoW model can correctly recognize 77% of the fraud reports. This would be helpful for audit authorities to identify fraud reports.
Keywords
Public sector audit; Text analysis; Word vector; Bag-of-words; Machine learning
1. Introduction
Public Sector Auditing plays an important role in maintaining the security of financial markets and protecting the rights and interests of
financial consumers. However, audit and supervision of falsification and fraud activities by various financing entities such as securities
markets, bond markets, and bank credit markets rely mainly on the business and financial data and the extended verification from accounting
analysis results. The review on disclosure reports in public information is still mainly at the stage case by case. Systematic data analysis
focuses on the basis of structured data (Al-Hashedi and Magalingam, 2021). The main route in the screening of suspected falsification and
fraud is driven by "financial statements - account sets and vouchers - raised funds or credit funds - specific business". The systematic
application of unstructured data and the use of big data thinking are still in the initial stage.
Bonsall et al. (2017) find that the length of annual reports has gradually been increased, and the proportion of unstructured text has
significantly been augmented, and a large amount of hidden information remains to be explored. More and more attention is paid to the non-
financial disclosures in reports (Dhaliwal et al., 2011; Marshall et al., 2009). Humphreys et al. (2010) verify that MD&A is read mostly, and
1 of 9 28/12/2021 23:25
the diversification of its language facilitates text analysis. Goel et al. (2010, 2012) find that the information provided by financial data is more
and more limited, but non-standardized texts are subjective and the language features can be used as predictors of fraud. In recent years,
Chinese scholars pay attention to the text of reports (LI, 2010; Ren and Lu, 2021) and apply text analysis method to study the relationship
between financial information and economic phenomena. The results mainly focus on the semantics and readability (Kong et al., 2021;
Luo et al., 2018). Compared with the mainstream methods based on financial data to detect frauds (Cheng et al., 2021; Song et al., 2014), this
paper focuses on the text of MD&A. We first transform MD&A texts into digital data, and then adopt the text analysis as the combination of
machine learning and vector index to detect those hidden problems on financial companies in the Chinese A-share stock market.
The rest of the paper is organized as follows. Section 2 describes the data, Section 3 provides empirical methodology, Section 4 presents the
empirical results, and Section 5 concludes.
2. Data
The financial listed companies under the CSRC Industry Classification (2012 Edition) are taken for this research. Relevant data of *ST, ST,
sales, and transfer companies are excluded. The sample period is from January 1, 2005 to December 31, 2019, and there are 3317 records in
total, including 370 fraud records.
The fraud records are from the CSMAR database, and the MD&A annual and quarterly texts are respectively taken from the WIND database
and the RESSET database. The missing text data for 2011 and 2012 are extracted from the summary section of MD&A in the reports of listed
companies.
Due to the time lag in the administrative penalty announcement of CSRC, it is necessary to uniformly define the year of fraud, and all reports
within the year are judged to be fraud. Except for the specific years clearly indicated, the other fraud years are the years before the
announcement year. In combination with the "violating or not" label in the CSMAR database and manual judgment, the two types of sample
are labeled, and then cleaned and segmented. Table 1 shows the high-frequency words before and after data cleaning. The results indicate that
removing invalid information can greatly reduce the vocabulary and calculation. At the same time, the accuracy of word segment is improved
and increases the credibility of the subsequent analysis.
Table 1. Comparison of word frequency output before and after data cleaning.
Before data cleaning After data cleaning
Words Frequency Words Frequency
'' (Space) 865,570 Company 64,467
, 263,662 Business 52,251
1 252,587 100 million yuan 32,055
0 248,317 Risk 29,554
\n 214,870 Investment 26,219
2 212,442 Management 24,530
. 199,014 Year 23,066
, 174,373 Assets 21,161
3 161,584 Report 18,770
4 140,357 Increase 17,727
Total words 10,534,550 Total words 2,606,675
Number of unique words 3117 Number of unique words 34,846
Note: The HIT Stopwords List was used to compare the 10 most frequent words before and after cleaning up the noise in the "self-evaluation" column. The word
segmentation is based on the Jieba library, which can be tracked in https://github.com/fxsjy/jieba.
3. Empirical methodology
3.1. BoW method

How to effectively express the documents that are not recognized by the computer in the form of vectors is one of the key issues in natural
2 of 9 28/12/2021 23:25
language processing and text analysis. As a classic model in text vectorization, the BoW model uses words as units and represents documents
as an N-dimensional vector by constructing a dictionary. The length of the vector is the size of the dictionary, and the number of each
coordinate is the word frequency. Benezeth et al. (2015) use the BoW model for classification, and Purda and Skillicorn (2015) use the BoW
model and Support Vector Machine (SVM) to recognize fraud reports with an accuracy of 82.95%. Based on the results of Purda and
Skillicorn (2015), we can screen and retain high-contribution words through random forest, construct the BoW vector in the financial industry,
and then predict the probability of fraud using SVM.
3.2. Machine learning

When BoW is constructed, words are independent of each other by default and word order is irrelevant. In practical applications, the volume of
words tends to be very large, the generated vectors become very sparse in general. Due to void information, a large number of coordinates in
the N-dimensional vector are meaningless and waste quite a bit of computing resources. To overcome the shortcomings of the BoW model,
Word2Vec is used to vectorize the text, and the dimension is set to be 200. First, this technique retains the positional relationship of words, and
makes the expression of words more accurate. Word2Vec has been used widely in text classification (Muhammad et al., 2021; Zhang et al.,
2015); second, Word2Vec is based on the word embedding static model, the representation of a word is fixed, which can save computing time.
Unusual words with a frequency of less than 8 are excluded during training, and the model vocabulary is constructed to obtain the vector form
of a word. Traverse each text with reference to the practice of Xing et al. (2014). If it is in the vocabulary, then one can add and average the
word vectors to get the text vector.
With Word2Vec vectors as input, three mainstream machine learning methods are used: Naive Bayes, Ensemble Learning, and SVM. In text
classification, Naive Bayes has high classification performance (Frank and Bouchaert, 2006), small sample demand, fast calculating, and less
influence of outliers; as a common model for ensemble learning, random forest has excellent text classification performance (Wu et al., 2014);
and Antweiler et al. (2004) have verified that SVM can effectively classify text.
3.3. Vector distance index

The word vectors trained by Word2Vec can embed the text into a low-dimensional dense and meaningful vector space. Each vector is a point in
the space. The semantic similarity between words can be measured by cosine. There is a variety of indexes for measuring vector distances. We
use the most common L1, L2 distances, and angle cosines as supplements and verification for machine learning methods. 1 In addition, we
calculate the L1 and L2 parametrizations of the vectors as indicators. Considering small scale and minor feature difference of word vectors,
feature scaling can be used to enlarge the difference of near 0 value. We have adopted 4 methods: (1) Normalization: with dimensions as the
unit, calculate the standard deviation and mean of each dimension for normalization; (2) logarithmization: the transformation ln(X + 1/e) of
each dimension is made; (3) exponentiation: take the function; and (4) double the logarithm of the normalized data: the ln(X + 1/e)
function is taken after the normalized vector is squared for each dimension. Since the averaging process can also cause a partial loss of
features, we construct document vectors by only superimposing word vectors in the vocabulary without averaging them.
3.4. Evaluation methods

The receiver operating characteristic curve (ROC) and the area under the ROC curve (AUC) are often used to measure two classifiers. The true
positive rate on the ordinate of ROC curve represents the proportion of samples predicted to be fraudulent in all fraud samples; the false
positive rate on the abscissa represents the proportion of all fraud samples misjudged by the classifier as non-fraud samples in all non-fraud
samples. If the curve is higher than the diagonal straight line, the classification effect is better than that of the random classifier. When the ROC
curve does not clearly indicate which classifier performs better, the area AUC under the curve can be used for discrimination, and the larger the
value, the better the effect of the classifier.
To screen out non-fraud samples as much as possible is the focus of attention in the context of recognizing fraud reports. It is one-sided if we
use ROC only. Moreover, the amount of data in the two types of samples is extremely unbalanced. It is necessary to adjust weights of different
types. Table 2 and 3 introduce a confusion matrix and related indices:
Table 2. Confusion matrix.
Predicted value 0 (non-fraud) Predicted value 1 (fraud)
True value 0 (non-fraud) TN FP
True value 1 (fraud) FN TP
Table 3. Effect index.
3 of 9 28/12/2021 23:25
Index Calculation Meaning
Accuracy The number of data correctly predicted. For the detection of fraud, the main focus is on the screening effect of non-fraud samples, rather
than the suitable index.
Precision The proportion of correctly judged fraud samples in all samples predicted to be fraudulent, that is, the accuracy of the judgment as fraud.
Recall The number of data correctly predicted in the samples that are actually fraudulent, which is used to describe how many fraud samples
have been screened out. The higher the value, the better the effect.
4. Empirical results
In this section, we first maintain the ratio of two types of samples, and randomly divide 80% of the total samples into the training set with a
total of 2653 samples and 296 fraud samples; and into the test set with a total of 664 samples, including 74 fraud samples. Then the three
methods and indexes in Section 3 are computed for testing separately on the training set and the test set. If it does not have a good detection
effect in the training set, it does not need to go further in the test set.
4.1. Effect of training set

According to the ROC curve of each model in Fig. 1, all the vector distance indexes, including the verified cosine value, have no detection
effect. With the same length measurement method, the ROC curve of the four scaling methods fluctuated slightly around the diagonal line,
indicating that the scaling can not magnify features or improve the detection effect. For comparison among groups, whether averaged or not, L 1
length or L2 length, the AUC values are all close to 0.5. Hence there is no significant difference among those methods. These treatments are
useless in terms of practical effects. The reasons for the failure of vector distance index are: (1) Excessive data cleaning, some low-frequency
words and symbols may be helpful for detection; (2) the word vector lacks obvious characteristics; and (3) small amount of sample data.
4 of 9 28/12/2021 23:25
Download : Download high-res image (916KB) Download : Download full-size image
Fig. 1. ROC curve of each model training set.
In contrast, the classification effect of BoW method and machine learning are far better than that of random classifier. The BoW method has
the highest AUC value of 0.98, and the Naive Bayesian has the lowest AUC value of 0.68, so the classification effect of the training set is
better. However, it should be noted that there is a problem of overfitting when the AUC area is too large, so the effect of the test set needs to be
combined.
The probability measurement in Table 4 is based on SVM, 2700 high-contribution words are screened and retained through random forest,
which reduces the dimension of the word frequency matrix, with a cumulative contribution rate of 85%, and improves calculation efficiency.
The distribution characteristics of the true probability of the overall, fraud and non-fraud data are summarized and compared. There is no
significant difference in general, and the distribution of fraud samples is relatively concentrated. The calculated threshold value is 0.125320,
and those exceeding this value are judged as fraud samples.
Table 4. Summary of descriptive statistics of fraud probability under BoW method.
5 of 9 28/12/2021 23:25
Overall in training set Non-fraud Fraud
Minimum value 0.000270 0.000270 0.001507
Maximum value 0.463598 0.463598 0.391765
Median 0.094489 0.093904 0.103236
Mean 0.109771 0.110139 0.106842
Standard deviation 0.086765 0.087214 0.083183
Skewness 0.841729 0.850054 0.757130
Kurtosis 0.160812 0.144826 0.279858
Note: The probability is measured based on SVM, and the probability of fraud reports among different samples and their distribution characteristics are calculated.
The threshold is 0.125320. If the probability of fraud is greater than the threshold, it is judged as a fraud sample.
4.2. Detection effect of test set

For the audit authority, the essence lies on how to judge whether the current listed company's report is suspected of fraud based on historical
data. Therefore, the performance of the model of the test set is more important than that of the training set. The vector distance index does not
need to be verified in the test set due to the poor detection effect in the training set.
According to ROC curve and AUC in Fig. 2, the detection effect of the test set is slightly lower than that of the training set, and the AUC value
loss of the random forest is the largest, namely, 0.1884, indicating that both BoW and machine learning have no obvious overfitting, the
prediction performance is good, the generalization ability is acceptable, and it is of practical significance in real applications.
Download : Download high-res image (255KB) Download : Download full-size image
Fig. 2. ROC curve of BoW and machine learning test set.
Table 5 reports the classification effect of the test set. For a single index, SVM has the highest accuracy and precision. Moreover, SVM has
better overall judgment and classification performance. However, for the audit work, it is important to detect fraud samples as much as
possible, so the recall rate is a key index. In general, the effect of the BoW method is the best. It can recognize 77.03% of fraud samples and
improve audit efficiency.
6 of 9 28/12/2021 23:25
Table 5. Comparison table of effect index.
BoW method Naive Bayes Random Forest SVM
Accuracy (%) 53.01 55.72 65.51 71.39
Precision (%) 16.09 15.41 20.08 23.39
Recall (%) 77.03 66.22 70.27 68.92
Note: The above data is calculated based on the confusion matrix of the BoW method and machine learning test set. The accuracy represents the number of
correctly predicted data; the recall represents the number of correctly predicted data in the samples that are actually fraudulent; and the precision represents the
proportion of the fraud samples that are correctly judged in all the predicted fraud samples.
5. Conclusions
The importance of unstructured data has gradually become prominent in the data era. Take the financial industry as an example. On the one
hand, the experimental results prove that China's regularly reported MD&A non-financial disclosure mechanism is effective and can provide
useful incremental information to the public sector audit authority and financial consumers. 2 On the other hand, the recall rate of 77% shows
that BoW method and machine learning have certain detection capabilities to cover nearly 80% of fraudulent companies. It has reference role
for public sector auditing to construct an analysis route of "unstructured data - text mining analysis - abnormal characteristic data traceability -
specific business and financial compliance review" in the verification of falsification and fraud in the financial market, it has strong technical
guiding significance for improving audit efficiency and audit coverage. This is an inevitable trend in the development of data auditing.
CRediT authorship contribution statement

Yi Zhang: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing – original draft.
Ailing Hu: Methodology, Software, Formal analysis, Investigation, Data curation, Visualization. Jiahua Wang: Conceptualization,
Methodology, Investigation, Writing – review & editing, Supervision. Yaojie Zhang: Methodology, Writing – review & editing, Supervision,
Funding acquisition.
Acknowledgments
We are grateful to the editor and two anonymous referees for insightful comments that significantly improved the paper. This work is
supported by the National Natural Science Foundation of China [72001110].
Recommended articles Citing articles (0)
References
Al-Hashedi and Magalingam, 2021 K.G. Al-Hashedi, P. Magalingam
Financial fraud detection applying data mining techniques: a comprehensive review from 2009 to 2019
Comput. Sci. Rev., 40 (2021), Article 100402
https://doi-org.ez11.periodicos.capes.gov.br/10.1016/j.cosrev.2021.100402
Article Download PDF Google Scholar
Antweiler and Frank, 2004 W. Antweiler, M. Frank

Is all that talk just noise? The information content of internet stock message boards
J. Finance, 59 (2004), pp. 1259-1294
https://doi-org.ez11.periodicos.capes.gov.br/10.2139/ssrn.282320
CrossRef View Record in Scopus Google Scholar
Benezeth et al., 2015 Benezeth, Y., Bertaux, A., Manceau, A., 2015. Bag-of-word based brand recognition using Markov clustering algorithm
for codebook generation, in: 2015 IEEE International Conference on Image Processing (ICIP). pp. 3315-3318. https://doi-
org.ez11.periodicos.capes.gov.br/10.1109/ICIP.2015.7351417.
Google Scholar
Bonsall et al., 2017 S.B. Bonsall, A.J. Leone, B.P. Miller, K. Rennekamp
A plain English measure of financial reporting readability
J. Account. Econ., 63 (2017), pp. 329-357
7 of 9 28/12/2021 23:25
https://doi-org.ez11.periodicos.capes.gov.br/10.1016/j.jacceco.2017.03.002
Article Download PDF View Record in Scopus Google Scholar
Cheng et al., 2021 C.-.H. Cheng, Y.-.F. Kao, .H..-P. Lin

A financial statement fraud model based on synthesized attribute selection and a dataset with missing values and imbalanced
classes
Appl. Soft Comput., 108 (2021), Article 107487
https://doi-org.ez11.periodicos.capes.gov.br/10.1016/j.asoc.2021.107487
Dai et al., 2021 Z. Dai, J. Kang, F. Wen

Predicting stock returns: a risk measurement perspective
Int. Rev. Financ. Anal., 74 (2021), Article 101676
https://doi-org.ez11.periodicos.capes.gov.br/10.1016/j.irfa.2021.101676
Dhaliwal et al., 2011 D.S. Dhaliwal, O.Z. Li, A. Tsang, Y.G. Yang
Voluntary nonfinancial disclosure and the cost of equity capital: the initiation of corporate social responsibility reporting
Account. Rev., 86 (2011), pp. 59-100
https://doi-org.ez11.periodicos.capes.gov.br/10.2308/accr.00000005
Frank and Bouckaert, 2006 E. Frank, R.R. Bouckaert

Naive Bayes for Text Classification with Unbalanced Classes BT - Knowledge Discovery in Databases: PKDD 2006
Springer Berlin Heidelberg, Berlin, Heidelberg (2006), pp. 503-510
in: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (Eds.) https://doi-org.ez11.periodicos.capes.gov.br/10.1007/11871637_49
Goel et al., 2010 S. Goel, J. Gangolly, S. Faerman

Can linguistic predictors detect fraudulent financial filings?
J. Emerg. Technol. Account., 7 (1) (2010), pp. 25-42
https://doi-org.ez11.periodicos.capes.gov.br/10.2308/jeta.2010.7.1.25
Goel and Gangolly, 2012 S. Goel, J. Gangolly

Beyond the numbers: mining the annual reports for hidden cues indicative of financial statement fraud
Int. J. Intell. Syst. Account. Financ. Manag., 19 (2012), pp. 75-89
https://doi-org.ez11.periodicos.capes.gov.br/10.1002/isaf.1326
Jiang et al., 2019 F. Jiang, J. Lee, X. Martin, G. Zhou

Manager sentiment and stock returns
J. financ. econ., 132 (2019), pp. 126-149
https://doi-org.ez11.periodicos.capes.gov.br/10.1016/j.jfineco.2018.10.001
Kong et al., 2021 D. Kong, L. Shi, F. Zhang

Explain or conceal? Causal language intensity in annual report and stock price crash risk
Econ. Model., 94 (2021), pp. 715-725
https://doi-org.ez11.periodicos.capes.gov.br/10.1016/j.econmod.2020.02.013
LI, 2010 F. LI
The information content of forward-looking stements in corporate filings-A Naiive Bayesian machine learning approach
J. Account. Res., 48 (2010), pp. 1049-1102
https://doi-org.ez11.periodicos.capes.gov.br/10.1111/j.1475-679X.2010.00382.x
View Record in Scopus Google Scholar
Luo et al., 2018 J. Luo, X. Li, H. Chen

Annual report readability and corporate agency costs
China J. Account. Res., 11 (2018), pp. 187-212
https://doi-org.ez11.periodicos.capes.gov.br/10.1016/j.cjar.2018.04.001
8 of 9 28/12/2021 23:25
Marshall et al., 2009 S. Marshall, D. Brown, M. Plumlee

The impact of voluntary environmental disclosure quality on firm value
Acad. Manag. Proc. (2009), pp. 1-6
2009 https://doi-org.ez11.periodicos.capes.gov.br/10.5465/ambpp.2009.44264648
View Record in Scopus Google Scholar
Copyright © 2021 Elsevier B.V. or its licensors or contributors.
Muhammad et al., 2021
ScienceDirect P.F. Muhammad,
® is a registered R. Kusumaningrum,
trademark of Elsevier B.V.A. Wibowo
Sentiment Analysis Using Word2vec And Long Short-Term Memory (LSTM) For Indonesian Hotel Reviews
Procedia Comput. Sci., 179 (2021), pp. 728-735
https://doi-org.ez11.periodicos.capes.gov.br/10.1016/j.procs.2021.01.061
Purda and Skillicorn, 2015 L. Purda, D. Skillicorn

Accounting variables, deception, and a bag of words: assessing the tools of fraud detection
Contemp. Account. Res., 32 (2015), pp. 1193-1223
https://doi-org.ez11.periodicos.capes.gov.br/10.1111/1911-3846.12089
Ren and Lu, 2021 C. Ren, X. Lu

A multi-dimensional analysis of the Management's Discussion and Analysis narratives in Chinese and American corporate
annual reports
English Specif. Purp., 62 (2021), pp. 84-99
https://doi-org.ez11.periodicos.capes.gov.br/10.1016/j.esp.2020.12.004
Song et al., 2014 X. Song, Z.-.H. Hu, J. Du, Z. Sheng

Application of machine learning methods to risk assessment of financial statement fraud: evidence from China
J. Forecast., 33 (2014)
https://doi-org.ez11.periodicos.capes.gov.br/10.1002/for.2294
Google Scholar
Wu et al., 2014 Wu, Q., Ye, Y., Zhang, H., Ng, M.K., Ho, S.-.S., 2014. ForesTexter: an efficient random forest algorithm for imbalanced text
categorization. Knowledge-Based Syst. 67, 105–116. https://doi-org.ez11.periodicos.capes.gov.br/10.1016/j.knosys.2014.06.004.
Google Scholar
Xing et al., 2014 C. Xing, D. Wang, X. Zhang, C. Liu

Document classification with distributions of word vectors
Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific (2014), pp. 1-5
https://doi-org.ez11.periodicos.capes.gov.br/10.1109/APSIPA.2014.7041633
Zhang et al., 2015 D. Zhang, H. Xu, Z. Su, Y. Xu

Chinese comments sentiment classification based on word2vec and SVMperf
Expert Syst. Appl., 42 (2015), pp. 1857-1863
https://doi-org.ez11.periodicos.capes.gov.br/10.1016/j.eswa.2014.09.011
1
Results for vector distance index are up to request, and are not listed in the paper due to the limited space.
2
At the same time, fraud detection is closely related to the potential risks of listed companies. Combined with the recent empirical results on stock
returns (Dai et al., 2021; Jiang et al., 2019), it is of great significance to financial market regulation and stock return forecasting.
View Abstract
© 2021 Elsevier Inc. All rights reserved.
9 of 9 28/12/2021 23:25

Detection of Fraud Statement Based On Word Vector Evidence From Financial Companies in China - ScienceDirect

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Detection of Fraud Statement Based On Word Vector Evidence From Financial Companies in China - ScienceDirect

Uploaded by

Copyright:

Available Formats

Detection of fraud statement based on word vector: Ev... https://www-sciencedirect.ez11.periodicos.capes.gov.br...

Finance Research Letters

Outline Share Cite

https://doi.org/10.1016/j.frl.2021.102477 Get rights and content

Before data cleaning After data cleaning

Words Frequency Words Frequency

'' (Space) 865,570 Company 64,467

, 263,662 Business 52,251

1 252,587 100 million yuan 32,055

0 248,317 Risk 29,554

\n 214,870 Investment 26,219

2 212,442 Management 24,530

. 199,014 Year 23,066

, 174,373 Assets 21,161

3 161,584 Report 18,770

4 140,357 Increase 17,727

Total words 10,534,550 Total words 2,606,675

Number of unique words 3117 Number of unique words 34,846

3.1. BoW method

3.2. Machine learning

3.3. Vector distance index

3.4. Evaluation methods

Table 2. Confusion matrix.

Predicted value 0 (non-fraud) Predicted value 1 (fraud)

True value 0 (non-fraud) TN FP

True value 1 (fraud) FN TP

Table 3. Effect index.

Index Calculation Meaning

4.1. Effect of training set

Download : Download high-res image (916KB) Download : Download full-size image

Fig. 1. ROC curve of each model training set.

Table 4. Summary of descriptive statistics of fraud probability under BoW method.

Overall in training set Non-fraud Fraud

Minimum value 0.000270 0.000270 0.001507

Maximum value 0.463598 0.463598 0.391765

Median 0.094489 0.093904 0.103236

Mean 0.109771 0.110139 0.106842

Standard deviation 0.086765 0.087214 0.083183

Skewness 0.841729 0.850054 0.757130

Kurtosis 0.160812 0.144826 0.279858

4.2. Detection effect of test set

Download : Download high-res image (255KB) Download : Download full-size image

Fig. 2. ROC curve of BoW and machine learning test set.

Table 5. Comparison table of effect index.

BoW method Naive Bayes Random Forest SVM

Accuracy (%) 53.01 55.72 65.51 71.39

Precision (%) 16.09 15.41 20.08 23.39

Recall (%) 77.03 66.22 70.27 68.92

CRediT authorship contribution statement

Recommended articles Citing articles (0)

Antweiler and Frank, 2004 W. Antweiler, M. Frank

Cheng et al., 2021 C.-.H. Cheng, Y.-.F. Kao, .H..-P. Lin

Dai et al., 2021 Z. Dai, J. Kang, F. Wen

Frank and Bouckaert, 2006 E. Frank, R.R. Bouckaert

Goel et al., 2010 S. Goel, J. Gangolly, S. Faerman

Goel and Gangolly, 2012 S. Goel, J. Gangolly

Jiang et al., 2019 F. Jiang, J. Lee, X. Martin, G. Zhou

Kong et al., 2021 D. Kong, L. Shi, F. Zhang

Luo et al., 2018 J. Luo, X. Li, H. Chen

Marshall et al., 2009 S. Marshall, D. Brown, M. Plumlee

Purda and Skillicorn, 2015 L. Purda, D. Skillicorn

Ren and Lu, 2021 C. Ren, X. Lu