Professional Documents
Culture Documents
2 RRS-1 2024
2 RRS-1 2024
ABSTRACT
Autocoding plays an essential role in editing official statistics data, and here
we have proposed a classification method which is fundamental to a hybrid autocoding
system. This system has the essential feature of combining a rule-based classifica-
tion method and a machine learning based classification method, in order to lead the
coding task of the Family Income and Expenditure Survey. It is known that shopping
receipt image data causes difficulty when using only the rule-based part of the pro-
posed system, due to the complexity of the given data. Therefore, including a variety
of criteria in the evaluation of the classification results for the shopping receipt images
is essential for obtaining correct data. For this reason, this paper presents surveyed
results of the various criteria for the shopping receipt image data based on the hybrid
autocoding system. As a result, we found that the machine learning based classifica-
tion element of the autocoding system chiefly works for dealing with the shopping
receipt image data. Additionally, due to the recent increase in quantity of shopping re-
ceipt image data, the importance of the machine learning based classification method
grows. Moreover, based on the results of various criteria, the existence of a dynamical
sensitivity over the times was found. This may direct us to future developments in the
machine learning-based classification method, such as fuzzy clustering based support
vector machine currently in development (Toko and Sato-Ilic, 2021, Toko and Sato-Ilic,
2022).
Keywords: Coding, Evaluation metrics, Fuzzy logic, Text classification
JEL Classification: C38
1. INTRODUCTION
Official statistics often require coding and translating text descriptions
into standardized labels for data processing. Traditionally, this labor-intensive
task was performed manually by many experts over a long period of time. To
improve efficiency, automated coding has become a focus of recent research.
For example, Hacking and Willenborg (2012) described a methodology
(1)
(2)
Minimum: (5)
4. NUMERICAL EXAMPLES
(8)
(9)
(10)
In figure 8, dashed lines show the results of receipt data and solid lines
show the results of not receipt data. From figure 8, it can be seen that values
of accuracy are stable for both kinds of data over the targeted period. In the
meanwhile, values of f1 score, precision, and recall are not stable, especially
(a) Result of entropy based algebraic (b) Result of entropy based Hamacher
product in (7) using (1), (3) product in (7) using (1), (4)
(c) Result of entropy based minimum (d) Result of coefficient based algebraic
in (7) using (1), (5) product in (7) using (2), (3)
(g) Result of coefficient based Einstein product in (7) using (2), (6)
5. CONCLUSION
References
1. Benedikt, L., Joshi, C., De Wolf, N., Schouten, B. (2020), “Optical character
recognition and machine learning classification of shopping receipts”, Available at:
https://ec.europa.eu/eurostat/documents/54431/11489222/6+Receipt+scan+analys
is.pdf (accessed Dec. 2023)
2. Bezdek, J.C. (1981), Pattern recognition with fuzzy objective function algorithms,
Plenum Press.
3. Bezdek, J.C., Keller J., Krisnapuram, R., Pal, N.R. (1999), Fuzzy models and algorithms
for pattern recognition and image processing, Kluwer Academic Publishers.
4. Gweon, H., Schonlau, M., Kaczmirek, L., Blohm, M., Steiner, S. (2017), “Three
methods for occupation coding based on statistical learning”, Journal of Official
Statistics, Vol. 33, No. 1, pp. 101-122.