Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

NLP Cross-Domain Recognition of Retail Products

Tobias Petterson∗ Rachid Oucheikh∗ Tuwe Löfström


tobias.pettersson@itab.com Department of Computing, Jönköping Department of Computing, Jönköping
ITAB Shop Products AB, Sweden and AI Lab, Jönköping University AI Lab, Jönköping University
Department of Computing, Jönköping Jönköping, Sweden Jönköping, Sweden
AI Lab, Jönköping University, Sweden rachid.oucheikh@ju.se tuwe.lofstrom@ju.se
and Department of Engineering,
University of Skövde
Sweden
ABSTRACT Rome, Italy. ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/
Self-checkout systems aim to provide a seamless and high-quality 3529399.3529436
shopping experience and increase the profitability of stores. These
advantages come with some challenges such as shrinkage loss. To 1 INTRODUCTION
overcome these challenges, automatic recognition of the purchased Self-service technologies aim to increase the profitability of orga-
products is a potential solution. In this context, one of the big nizations by shifting work from attendants to customers, which
issues that emerge is the data shifting, which is caused by the become an active participant in the delivery of a service which they
difference between the environment in which the recognition model ultimately consume. Self-scan and Checkout (SCO) services are one
is trained and the environment in which the model is deployed. In type of these technologies that are increasingly being integrated
this paper, we use transfer learning to handle the shift caused by into many supermarkets. SCO brings many benefits as they can
the change of camera and lens or their position as well as critical increase profits, reduce service time, optimize the number of staff
factors, mainly lighting, reflection, and occlusion. We motivate the needed, and provide high-quality service to customers.
use of Natural Language Processing (NLP) techniques on textual Despite these advantages, SCO adoption creates some technical
data extracted from images instead of using image recognition to and financial challenges, such as minimizing stock loss. Further-
study the efficiency of transfer learning techniques. The results more, the stores are growing and offering a wider range of products,
show that cross-domain NLP retail recognition using the BERT and as the number of items ordered gradually increase, there is
language model only results in a small reduction in performance a need to ensure the reliability of the checkout process. Early on
between the source and target domain. Furthermore, a small number in using SCO systems, retailers affirmed that they could increase
of additional training samples from the target domain improves their exposure to retail losses by creating new loss factors, both
the model to perform comparable as a model trained on the source malicious and non-malicious [3]. Retailers calculate their stock loss,
domain. typically called ‘shrinkage’, through measuring the difference be-
tween the amount of stock they think they should have and what
CCS CONCEPTS they actually have, usually ascertained through regular physical
• Computing methodologies → Object detection; Object identi- audits/stock counts. The difficulty lies in trying to understand the
fication; • Theory of computation → Online learning algorithms; cause of any missing stock, particularly when stock audits are car-
• Networks → Network reliability. ried out infrequently (often annually). The time lag between a loss
event happening and it eventually being recorded can be consider-
KEYWORDS able, making identification of the cause very challenging. According
to Efficient Consumer Response (ECR) self checkout report [2], re-
Product recognition, text classification, retail, transfer learning,
tail stores with 50% of transactions being processed through self
domain adaptation, NLP, BERT
checkouts can expect their shrinkage losses to be 75% higher than
ACM Reference Format: the average rate found in grocery retailing.
Tobias Petterson, Rachid Oucheikh, and Tuwe Löfström. 2022. NLP Cross- To reduce the shrinkage, product recognition technologies could
Domain Recognition of Retail Products. In 2022 7th International Conference be used to detect fraudulent behavior. Product recognition has
on Machine Learning Technologies (ICMLT) (ICMLT 2022), March 11–13, 2022,
mainly used image-based recognition models [6, 20]. However, a
∗ Both authors contributed equally to this research. problem in the SCO environment is the collection of image data.
Products are held and moved by customers, generating occlusion
Permission to make digital or hard copies of all or part of this work for personal or and motion blur in the image data. In other controlled environ-
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation ments, such as a semi-automatic checkout1 , the products are regis-
on the first page. Copyrights for components of this work owned by others than ACM tered through a barcode recognition system that provides image
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, data on products without occlusion and motion blur. A potential
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@acm.org. workaround to the data shortage in the SCO environment, making
ICMLT 2022, March 11–13, 2022, Rome, Italy it easier to collect more high-quality training data, could be to apply
© 2022 Association for Computing Machinery.
1 https://itab.com/products-and-services/products/checkouts-self-service-
ACM ISBN 978-1-4503-9574-8/22/03. . . $15.00
https://doi.org/10.1145/3529399.3529436 systems/hybrid-semi-automated-checkout-solutions/scanmate

237
ICMLT 2022, March 11–13, 2022, Rome, Italy Pettersson et al.

data from the semi-automatic checkouts to the SCO environment. area of research which alleviates these issues and ultimately tack-
This can be reflected as a data distribution shift between the semi- les the data scarcity problem. The TL paradigm aims to leverage
automatic checkout, in which the system is trained, and the SCO the information gained in one or more learning tasks having a
target domain, in which it is deployed. larger dataset to improve performance in another related task with
In this study, we are interested in fixed SCO systems as described a smaller dataset [10, 22]. Thus, TL helps bridge the distribution
by Beck [2]. A fixed SCO is when a customer performs the scanning gap between datasets retrieved from the source domain D S and the
of products in a designated machine. In a fixed SCO, a camera can target domain DT .
be placed where barcode readers typically are mounted, either be- More formally (inspired by the description in [12]), a domain
low or front-facing the product that is scanned. The camera stream D = {X, p(X )} can be defined as a space of all feature vectors X and
provides images of the product when positioned at the counter. a marginal probability distribution p(X ) where X = {x 1 , . . . , x n } ∈
Using the captured images and machine learning techniques, the X is a particular learning sample with n feature vectors. For a
product can be classified into its correct category. Automatising specific domain, the goal is to learn an objective predictive function
the recognition process has a number of benefits for both super- f that predicts a label yi from label space Y for the corresponding
markets and their customers, some of which can be related to the sample x i . This is called a task and can be noted T = {Y, f (x)}.
increase of revenue, identification of scanning errors, and detection Given source and target data, transfer learning is used when there
of malicious behavior in the checkout. is a discrepancy in either the domains (DS , DT ) or tasks (TS ,
We have opted to use text extracted from product images. This is TT ). Source and target domains are said to be different if their
well motivated by our previous paper [11] in which we showed that feature spaces are not the same (XS , XT ) or their marginal
the use of Optical Character Recognition (OCR) and NLP techniques probability distributions are unequal (p(X S ) , p(XT )). In such
are promising for the recognition and verification of retail products. cases, an effective way to learn the target task is to explicitly map
This success is due to the following reasons: Text extraction using the data to a common or domain invariant representation. This
OCR is fairly robust to changes in scale, rotation, and illumination. learning strategy is called Domain Adaptation (DA).
Furthermore, text recognition accuracy is high even on images The prevalent assumption is that the source and target label
containing few textual elements. spaces are similar. DA methods are classified further into three
Different camera placement and resolution, lighting conditions, categories depending on the availability of labels in the target
and color information creates a significant domain shift between domain [19]. Supervised DA methods assume that the target data
images in the source and target domain. The goal of this paper is is labeled (but generally small). In the semi-supervised case, in
to exploit the source domain knowledge and apply it in the target addition to a small amount of labeled target data, a larger amount
domain. The source domain consists of text extracted from RGB of unlabeled target data is available. In the unsupervised setting,
images while the target domain extracts text from a monochrome no labeled data is available for the target domain. In this paper, we
camera. Consequently, the objective of this paper is to assess the use supervised DA techniques and we experiment with varying a
efficiency of transfer learning from a source domain, using an RGB small amount of unlabeled target data to evaluate the performance
camera, to a target domain, using a monochrome camera, by us- of the retrained model.
ing OCR and deep learning approaches for NLP. Furthermore, we
evaluate two different kinds of models, a Convolutional Neural
Network (CNN) with GloVe embeddings and BERT, for the product 2.1 Transfer Learning in NLP
classification. The arrival of Transfer Learning and rapid improvements in the
performance of language models present a great progress in the
NLP domain. Global Vectors (GloVe) is one of the models allowing
to learn distributed word representation in multi-dimensional space
2 BACKGROUND [13]. It is a global log-bilinear unsupervised regression model that
It is common that the deployment of learning models is performed aims to obtain vector representations for words. This is achieved
in environments which are different from the training environment. by mapping words into a meaningful space where the distance
This is reflected by the fact that the data collected and used for between semantically similar words is small. The model is trained
the training is not perfectly matching the data in the deployment on word-word co-occurrence statistics from a large corpus. The
environment. Usually, statistical learning theory is based on the resulting representations capture meaningful linear substructures
assumption that both training and test data are derived by the of the word vector space and the obtained pre-trained word em-
same process (i.e., on the i.i.d. assumption), which is not practical beddings can be used for new tasks. However, these embeddings
for real world data in which the model can be used in similar but do not capture the context of words. Recently developed language
not identical environments. Naturally, the performance of trained models, such as BERT [5] and ULMFiT [7], can learn the context of
machine learning models may decrease when the distributions of each word and catch the nuances and inter-dependencies among
the training and test data differ. the various words of the text. The introduction of such language
Another challenge which may worsen the case is that the col- models has the advantage of allowing data scientists to ensure rea-
lection and labeling of data sets for each deployment environment, sonable accuracy using pretrained models without the need for
which are large enough to train an efficient model, is costly and an excessive amount of manually labeled data and to achieve the
we often have insufficient data that may lead to overfitting or un- state-of-art performance for various NLP tasks. One of the possi-
desirable behavior. Transfer Learning (TL) is an important new bilities to use these pretrained language models is to build new

238
NLP Cross-Domain Recognition of Retail Products ICMLT 2022, March 11–13, 2022, Rome, Italy

neural network structures on top of them. Basically, there are three 2.2 Transfer learning for retail product
ways to train these stacked neural networks: stack-and-finetune, recognition
stack-only and finetune-only. In the stack-only method, the pre-
Transfer learning is exploited for retail product recognition using
trained model works as a feature extractor and its final layer is
image datasets. Tonioni and Di Stefano [17] have used an image-
removed and replaced by one or several layers. The weights of
to-image translation by means of a generative adversarial network
the pretrained model are frozen during training for the new task
(GAN) to address the domain shift and generate resilient cross-
and only the additional layer weights can be updated. Finetune-
domain feature embedding of product images. In the case where
only methods use only the pretrained model and train it further on
the source datasets includes subsets with different distributions,
the new domain. Peters et al. [14] compared to the stack-only and
Thota et al. [16] proposed a multi-source deep learning-based do-
finetune-only methods and concluded that finetune-only is better
main adaptation system which serves to identify and verify the
than the stack-only option. Finally, stack-and-finetune combines
presence and legibility of date information from the use-by on food
both methods, fine-tuning the pretrained language model while
packaging pictures. The method incorporate discrepancy losses,
also training the stacked model.
i.e., maximum mean discrepancy (MMD) [9] and correlation align-
An important feature of the recent language models which led
ment (CORAL) [15], to extract domain-invariant representations
to their success is the attention mechanism. Instead of encoding a
for all domains. Furthermore, it aligns distribution of all pairs of
single vector to represent the sequence, the attention mechanism
source and target domains in a common feature space, along with
computes a context vector for all tokens in the input sequence and
the class boundaries. Zhang et al. [23] developed a dual pyramid
for each token in the output [23]. The decoder computes a relevancy
scale network to learn the multi-scale feature of the data using both
score for all tokens on the input side. These scores are then nor-
detection and counting views. Furthermore, an iterative knowl-
malized by performing a softmax operation to obtain the attention
edge distillation training strategy is applied to leverage both image
weights. These weights are then used to perform a weighted sum
and instance levels and thereby narrow the semantic gap between
of the encoder’s hidden states, thus obtaining the context vector c t .
source domain and target domain. To tackle the cross-domain fine-
Then, the attention vector is obtained by performing a hyperbolic
grained product recognition, Wang et al. [21] develop a CNN based
tangent operation on the concatenation of the context vector and
model which consists of two specialized components. The first
the target hidden state. This attention vector generally provides a
component is an adversarial component which allows to handle
better representation of the sequence than traditional fixed-sized
the domain shift by gradually minimizing the discrepancy between
vector methods by identifying the relevant input tokens while gen-
different domains and ensuring alignment in both domain-level and
erating the output token. Using this mechanism, Bahdanu et al.
class-level. The second component is a self-attention module de-
[1] were able to achieve state-of-the-art performance in machine
signed for fine-grained recognition by capturing the discriminative
translation tasks.
image regions.
The Transformer architecture, building upon the significant im-
provements achieved by the attention mechanism and consisting
of an encoder and a decoder, was proposed in Vaswani et al. [18]. 3 THE PROPOSED APPROACH
The encoder includes a multi-head attention layer, residual con- Domain adaptation is a common challenge for image recognition
nections, normalization layer, and a generic feed-forward layer. systems. The domain shift between a source and target domain can
The decoder is identical to the encoder except that it contains a be caused by the change of background, pose, or type of camera.
”masked” multi-head attention layer. The Transformer has achieved This may lead to a significant degradation in performance for the
new state-of-the-art results in various tasks such as machine trans- new domain, caused by different data distributions between training
lation, entailment, and so on. and tests [4]. To mitigate this, re-calibration of the model to the
Based on transformers, Bidirectional Encoder Representations target domain has to be performed by collecting data from the
from Transformers (BERT) was developed to learn contextual em- target domain. However, this is time-consuming since it requires a
beddings for words [5]. BERT has at its core a Transformer language lot of manual work.
model with a variable number of encoder layers and self-attention The setup of the training procedure of our proposed approach is
heads. It is the first unsupervised language model which can be pre- illustrated in figure 1. The images from the source domain consist of
trained from unlabeled data extracted from a large corpus. During three-channel RGB data, while the target domain captures images
the training, some of the words in the input are randomly masked from a monochrome camera from another viewpoint. The image
out and a condition is used for each word bidirectionally, i.e., in data from both domains are processed by an OCR module that
both left and right contexts to predict the masked words. Being detects words in an image and outputs it in a textual data format.
bidirectional, BERT can learn efficiently the contextual embeddings The source textual data is used for training a product recognition
and achieve the state of the art in various tasks such as named model from scratch with RGB textual data, while the target domain
entity recognition and question answering. uses textual data from both the source and target domain. Further
details about the training and testing of the product recognition
model are explained in section 4.2.
In this study, we have used two machine learning models, namely,
CNN with GloVe and BERT. The architecture of CNN consists of 3
one-dimensional convolutional layers separated by two max pool-
ing layers, followed by one global max pooling and two dense layers.

239
ICMLT 2022, March 11–13, 2022, Rome, Italy Pettersson et al.

Source The dataset includes textual descriptions of 81 retail products


Textual Data
extracted from two different types of images, namely, RGB and line
Training
RGB scan (LS). Thus, we have two different domains representing the
Final
Source Images
OCR Result same products. Figure 3 shows example images from the dataset in
Training
RGB+LS
both domains. There are a few characteristic differences between
the images in the domains. Due to low lightning conditions, the
Target Textual Product
Data Recognition images of the line scan camera have to be monochrome. In addition,
Model
the placement of lightning generates glare in products. The RGB
Target Images camera images consist of images with more poses because it is
Figure 1: Training procedure of the proposed approach not faced down as the line scan camera images. Furthermore, the
resolution of the textual details is worse compared to the line scan
The first layer is an embedding layer which receives as input the camera images.
sequenced GloVe embeddings for the text data. This stack-only Textual descriptions have been extracted from both the RGB and
method was used in our previous paper [11] and yielded the best the LS domains using OCR via Google Vision API5 . Datasets for
classification results compared to several other text classification the domains have been extracted and some statistics are presented
models. BERT is a state-of-the-art model for NLP tasks. We used in table 1. The RGB domain consists of a training set, denoted
a pretrained BERT model2 for evaluation of the text recognition Train RG B and a test set denoted T est RG B . Line scan data include a
performance. The model used default configurations with a clas- training set Train LS and a test set T est LS . All datasets have been
sification head consisting of a linear layer. During training of the extracted to have a balanced distribution for all classes.
model, both the classification and the underlying BERT model are In SCOs, products are often occluded when the product is being
trained, hence this is a stack-and-finetune model. scanned. To simulate the removal of text snippets at the edges
of a product when occluded by the hand of a person, a test set
4 EXPERIMENTAL SETUP T est LS _CC is created. It divides the words for each sample in T est LS
into multiple clusters using k-means based on the center point of
4.1 Dataset description the word in the image. A mean center point is calculated using
The dataset has been collected using automatic scanning tunnels the position of all words in a sample. The cluster centers closest
deployed at a retailer at the checkout counter in grocery stores. to the mean center point are then included in the sample. Two
Figure 2 shows a schematic of the acquisition process. different occlusion factors (OF) of 0.25 and 0.5 has been extracted,
meaning that either 25% or 50% of the text snippet clusters have
Scanning Tunnel been removed. This emulates the case where a customer is holding
RGB
Camera a standard size package with either one or two hands. We evaluate
Belt Direction the cluster sizes of 4 and 8 with k-means to see if different cluster
size affects our emulation method.
Product1 Product2 Product3 The size of the text snippets has been measured to analyze the
characteristics of the RGB and the LS domains and are available
Line Scan in table 1. It shows that the LS domain contains 27% more words
Camera
per sample and the average number of characters per word is 12%
Figure 2: Picture of the data acquisition setup. Products are
higher. The difference between the domains is expected, because
passing through the scanning tunnel, registered by a bar-
smaller characters are distinguishable in the LS domain due to
code recognition system, enabling RGB and Line scan cam-
higher resolution. Qualitative inspection of the dataset in both
era data to be collected.
domains shows that misspellings are frequently occurring in the
extracted words. Misreadings are also frequent, where non-text
Two cameras have been used to collect image data: An RGB
parts of the image are interpreted as text. No method for correcting,
camera 3 with resolution 2592x1944 has been placed in the top
removing, or cleaning the textual data has been done in this work.
of the tunnel towards the checkout belt. The second camera is a
monochrome line scan camera4 with a horizontal resolution of 4096
pixels. The line scan camera is located below the tunnel scanner Table 1: Datasets extracted from RGB and LS domain.
and the products that are passing through are scanned through a Dataset Classes Samples Average # words Average #
per sample characters per
small gap between the two checkout belts. The camera placement word
and lighting conditions of the line scan camera are similar to those T r ain RG B 81 16200 22.2 16.4
of a front-faced barcode scanner with a fixed SCO system. The class T est RG B 81 4050 21.9 16.2
T r ain LS 81 4050 28.2 18.4
labeling is performed by a separate barcode recognition system T est LS 81 39926 27.8 18.2
integrated in the tunnel scanner. T est LS _CC 81 39926
OF0.25 : 21.9 OF0.25 : 18.6
OF0.5 : 14.6 OF0.5 : 18.9
2 https://huggingface.co/transformers/model_doc/bert.html
3 DFK 33GP006 GigE color industrial camera: https://www.theimagingsource.com/
products/industrial-cameras/gige-color/dfk33gp006/ 5 https://cloud.google.com/vision
4 Teledyne DALSA LA-GM-04K08A Line scan GigE camera https://www.1stvision.com/ 6 Some samples from 3 classes were not possible to collect due to corrupt image data
cameras/models/Teledyne-DALSA/LA-GM-04K08A from the LS camera.

240
NLP Cross-Domain Recognition of Retail Products ICMLT 2022, March 11–13, 2022, Rome, Italy

(a) (b)
Figure 3: Example images (a), and (b) from the dataset. Each sample shows an image captured by the RGB camera (left) and
the line scan camera (right). Note that the example images in (a) is different samples and is displayed to show the domain
difference.

To get more insight into the data distribution in the space, we 4.2 Experiments
used t-distributed stochastic neighbor embedding (t-SNE) to visual- Three experiments are conducted to evaluate cross-domain recog-
ize both domains. Figure 4 shows the source domain, represented in nition of retail products using NLP. They are described in the fol-
red color, and the target domain, drawn in blue, where each point lowing points and a summarization of the training and test dataset
in the scatter plot represents a projection in 3D of a sentence em- for the experiments can be seen in table 3.
bedding representing one product. The shift between the domains Experiment 1. The goal of the first experiment is to evaluate
is visually clear and it is evident that the data distributions of the the performance of the model pretrained using RGB data on the
domains are different with an overlap in some regions of the space. LS data without any retraining of the model. The objective is to
get insights on the gap between the two domains. Thus, we train
the CNN and BERT models on a training set including only textual
data from the RGB domain using Train RG B . Then we evaluate it
on T est RG B and T est LS and report the obtained accuracies.
Experiment 2. The goal of the second experiment is to explore
how additional training samples from the target domain improve
the model performance compared to experiment 1. Thus, we train
the model with Train RG B and an additional increasing number of
samples from Train LS , using random selection of 2, 5, 10, 20, 30
and 50 samples for each training run. Due to the random selection
of samples for Train LS , 5 iterations are performed for each training
run and the mean accuracy is reported.
Figure 4: Visualization of source (red) and target (blue) do- Experiment 3. In a realistic context such as a fixed SCO system,
mains using T-SNE the products are often occluded when scanned. In this experiment,
we simulate how occlusion affects the performance of the model
In addition to the visualization, the gap between both domains by removing parts (either 25% or 50%) of the textual data. Since
is measured using the word moving distance (WMD) technique [8] clustering is used to find the clusters defining what to occlude in
and is presented in table 2. WMD measures the similarity between T est LS _CC , 5 iterations are evaluated for each case and the mean
documents by computing the minimal traveling distance between result is reported.
them using word2vec embeddings. The mean of the distances be- Table 3: Setup of datasets for the experiments
tween T rain RG B and the target domain is 1.32 while it is just 1.11 Experiment Training Dataset Test Dataset
between T rain RG B and T est RG B . The relative difference of these 1 T r ain RG B T est RG B
distances, which is 16% reflects the gap existing between the two T est LS
T r ain RG B +
domains and supports the visualization obtained by t-SNE. 2 {2, 5, 10, 20, 30, 50} samples T est LS
from T r ain LS
T est LS _CC with occlusion
Table 2: WMD distance between source and target domains 3 T r ain RG B factors of {0.25, 0.5} and
Domains Mean Variance cluster sizes of {4, 8}
T r ain RG B –> T r ain LS ∪ T est LS 1.32 0.33
T r ain RG B –> T est RG B 1.11 0.31
For the BERT model, we used a sequence length of 128, which
included most of the text snippets for the samples. On the few

241
ICMLT 2022, March 11–13, 2022, Rome, Italy Pettersson et al.

samples that contained longer sequences than 128, the sequence accuracy of 80.1% achieved in experiment 1 for T est LS drops around
was cut. 5 percentage points when having an occlusion factor of 0.25 and
around 15 percentage points with an occlusion factor of 0.5 for
5 RESULTS AND DISCUSSION T est LS _CC . The classification difference is not very significant with
The first experiment compares the performance of the two differ- an OF of 0.25. This scenario is comparable to a hand covering a
ent kinds of models that we have used, the CNN with GloVe and medium sized package, showing that the BERT model is robust
BERT. The results are reported in table 4. Once the CNN model is to occlusion. The number of clusters was varied to investigate if
trained using T rain RG B , it is evaluated on the RGB domain using different cluster size affects our emulation of occlusion by a hand
the T est RG B set and it achieves the accuracy 81.3%. Without any covering a product. The results with cluster sizes of 4 and 8 show
retraining of the model, the same pretrained model achieves 65.1% almost the same accuracy for OF 0.25 and 0.5.
in the LS domain. On the other hand, the BERT model achieves an
accuracy of 88.2% on the test set from the source domain and as Table 6: Accuracy obtained in experiment 3 for the BERT
much as 80.3% in the target domain. model testing the impact of occlusion using T est LS _CC
In summary, the results of this experiment show that when using Dataset Clusters Occlusion factor Accuracy
4 0.25 76.2%
the model that has been pretrained on the source domain directly 4 0.5 67.9%
T est LS _CC
on the target domain, the accuracy is reduced by approximately 16 8 0.25 77.3%
percentage points for CNN and 8 percentage points for BERT. In 8 0.5 67.9%
addition, the BERT model shows a much higher performance than
CNN. Summarizing the results, the experiments show that NLP can
leverage the knowledge acquired in different environments. Specif-
Table 4: Accuracy results for Experiment 1 using BERT and ically, the BERT model can achieve good performance for retail
CNN-based model product recognition on a dataset with a large number of products,
Model T est RG B T est LS and it is robust for domain change. Retraining the model on 20
CNN 81.3% 65.1%
BERT 88.2% 80.3%
samples per product from the target domain causes a drop of only
2 percentage points in the accuracy and 50 samples were enough
to obtain a performance comparable to that obtained in the source
Table 5 shows the results of the second experiment in which domain.
we train the models with varying amounts of data from the target
domain. The goal is to determine the amount of data required for a 6 CONCLUSIONS
reasonable performance, which should preferably be similar to the
A common aspect for many learning tasks including retail product
one obtained in the source domain. As explained in the previous
recognition is that the available labeled data used for the training
section, the test dataset is fixed and the training data from the target
do not match the actual data in the deployment environment. This
domain used for fine-tuning is varying and includes 2, 5, 10, 20, 30
paper aims to handle this domain shift by transferring the learning
and 50 random text samples per class. Both models learn from a
acquired in the source domain to the deployment domain which
relatively small amount of data from the new domain. The accuracy
has different features such as sensors, occlusion, viewing angles,
of the CNN model increases by 4.4 percentage points by adding only
and lighting conditions. The results show that cross-domain NLP
2 samples per product and achieves 80.2% by adding 30 samples for
retail recognition preserves similar performance when using the
each class, which is close to the accuracy obtained on the source
BERT language model. We also show that a small amount of extra
domain. Retraining the BERT model with only 2 data points for each
training data from the target domain increases the performance to
product yields an accuracy of 82.9% which represents an increase
the same level as the source domain. In the future, the effects using
of 2.6 percentage points. Adding 20 samples in the training data
and combining different OCR engines could also be explored for
results in an accuracy of 86.0% whereas 50 samples were enough
real-time applicability.
to achieve 88.1%, which is comparable to the accuracy obtained
in the source domain. The CNN model also achieves an accuracy ACKNOWLEDGMENTS
comparable to that obtained in the source domain when 50 samples
are added to the training data, although it remains below the BERT This work was supported by the Swedish Knowledge Foundation
accuracy. (DATAKIND 20190194), the company ITAB, and Smart Industry
Sweden (KKS-2020-0044).
Table 5: Accuracy obtained in experiment 2 using BERT and
CNN-based model
REFERENCES
[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2016. Neural Machine
#Samples per
Translation by Jointly Learning to Align and Translate. arXiv:1409.0473 [cs.CL]
class from 2 5 10 20 30 50
[2] Adrian Beck. 2018. Self-checkout in retail: Measuring the loss. Technical Report.
T r ain LS
Efficient Consumer Response Community. https://doi.org/10.13140/RG.2.2.14100.
CNN 69.5% 74.2% 76.3% 79.0% 80.2% 80.7% 55686
BERT 82.9% 84.1% 85.3% 86.0% 86.8% 88.1% [3] Adrian Beck and Matt Hopkins. 2017. Scan and rob! Convenience shopping,
crime opportunity and corporate social responsibility in a mobile world. Security
Journal 30, 4 (oct 2017), 1080–1096. https://doi.org/10.1057/sj.2016.6
The results from experiment 3 are summarized in table 6, includ- [4] Gabriela Csurka. 2017. Domain Adaptation for Visual Applications: A Compre-
ing results for the BERT model. They show that the classification hensive Survey. ArXiv abs/1702.05374 (2017).

242
NLP Cross-Domain Recognition of Retail Products ICMLT 2022, March 11–13, 2022, Rome, Italy

[5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: 4th Workshop on Representation Learning for NLP (RepL4NLP-2019). Association
Pre-training of Deep Bidirectional Transformers for Language Understanding. for Computational Linguistics. https://doi.org/10.18653/v1/w19-4302
arXiv:1810.04805 [cs.CL] [15] Baochen Sun, Jiashi Feng, and Kate Saenko. 2016. Correlation Alignment for
[6] Venugopal Gundimeda, Ratan S. Murali, Rajkumar Joseph, and N. T. Naresh Babu. Unsupervised Domain Adaptation. arXiv:1612.01939 [cs.CV]
2019. An Automated Computer Vision System for Extraction of Retail Food [16] Mamatha Thota, Stefanos Kollias, Mark Swainson, and Georgios Leontidis. 2020.
Product Metadata. In First International Conference on Artificial Intelligence and Multi-source domain adaptation for quality control in retail food packaging.
Cognitive Computing, Raju Surampudi Bapi, Koppula Srinivas Rao, and Munaga Computers in Industry 123 (2020), 103293. https://doi.org/10.1016/j.compind.
V. N. K. Prasad (Eds.). Springer Singapore, Singapore, 199–216. https://doi.org/ 2020.103293
10.1007/978-981-13-1580-0_20 [17] Alessio Tonioni and Luigi Di Stefano. 2019. Domain invariant hierarchical embed-
[7] Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine- ding for grocery products recognition. Computer Vision and Image Understanding
tuning for Text Classification. In Proceedings of the 56th Annual Meeting of the 182 (2019), 81–92. https://doi.org/10.1016/j.cviu.2019.03.005
Association for Computational Linguistics (Volume 1: Long Papers). Association [18] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
for Computational Linguistics. https://doi.org/10.18653/v1/p18-1031 Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
[8] Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. 2015. From Word you need. In Advances in Neural Information Processing Systems. 5998–6008.
Embeddings To Document Distances. In Proceedings of the 32nd International Con- [19] Mei Wang and Weihong Deng. 2018. Deep visual domain adaptation: A survey.
ference on Machine Learning (Proceedings of Machine Learning Research, Vol. 37), Neurocomputing 312 (oct 2018), 135–153. https://doi.org/10.1016/j.neucom.2018.
Francis Bach and David Blei (Eds.). PMLR, Lille, France, 957–966. 05.083
[9] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I. Jordan. 2015. Learning [20] Wenyong Wang, Yongcheng Cui, Guangshun Li, Chuntao Jiang, and Song Deng.
Transferable Features with Deep Adaptation Networks. arXiv:1502.02791 [cs.LG] 2020. A self-attention-based destruction and construction learning fine-grained
[10] Shuteng Niu, Yongxin Liu, Jian Wang, and Houbing Song. 2020. A Decade Survey image classification method for retail product recognition. Neural Computing
of Transfer Learning (2010–2020). IEEE Transactions on Artificial Intelligence 1, 2 and Applications 32, 18 (jul 2020), 14613–14622. https://doi.org/10.1007/s00521-
(oct 2020), 151–166. https://doi.org/10.1109/tai.2021.3054609 020-05148-3
[11] Rachid Oucheikh, Tobias Pettersson, and Tuwe Löfström. 2022. Product verifica- [21] Yimu Wang, Ren-Jie Song, Xiu-Shen Wei, and Lijun Zhang. 2020. An Adversarial
tion using OCR classification and Mondrian conformal prediction. Expert Systems Domain Adaptation Network For Cross-Domain Fine-Grained Recognition. In
with Applications 188 (2022), 115942. https://doi.org/10.1016/j.eswa.2021.115942 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). 1217–
[12] Sinno Jialin Pan and Qiang Yang. 2010. A Survey on Transfer Learning. IEEE 1225. https://doi.org/10.1109/WACV45572.2020.9093306
Transactions on Knowledge and Data Engineering 22, 10 (oct 2010), 1345–1359. [22] Karl Weiss, Taghi M. Khoshgoftaar, and DingDing Wang. 2016. A survey of
https://doi.org/10.1109/tkde.2009.191 transfer learning. Journal of Big Data 3, 1 (may 2016). https://doi.org/10.1186/
[13] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: s40537-016-0043-6
Global Vectors for Word Representation. EMNLP 14, 1532–1543. https://doi.org/ [23] Libo Zhang, Dawei Du, Lufficc Li, Yanjun Wu, and Tiejian Luo. 2020. Itera-
10.3115/v1/D14-1162 tive Knowledge Distillation for Automatic Check-Out. IEEE Transactions on
[14] Matthew E. Peters, Sebastian Ruder, and Noah A. Smith. 2019. To Tune or Not to Multimedia PP (11 2020). https://doi.org/10.1109/TMM.2020.3037502
Tune? Adapting Pretrained Representations to Diverse Tasks. In Proceedings of the

243

You might also like