Professional Documents
Culture Documents
Namiki 2021
Namiki 2021
Authorized licensed use limited to: University of Exeter. Downloaded on May 27,2021 at 21:09:20 UTC from IEEE Xplore. Restrictions apply.
time recognition” as meaning “to attach a recognition result [5], include bilinear, bicubic and Lanczos [6] [7]. Another
to the corresponding tracking ID before the object goes out approach focuses on changing the aspect ratio of an image
of tracking range.” Thus, the main problems in combining while preserving the salient features of the image as much
the CNN-based recognition with high-frame-rate imaging in as possible (e.g., [8] and [9]). However, none of them is not
the real-time applications can be summarized as follows: suitable for object recognition tasks because the downscaled
1) the workload of recognition needs to be reduced while (i.e., low resolution) images usually lose or alter necessary
maintaining high recognition accuracy and 2) the robustness of feature information for detailed recognition.
the recognition needs to be improved using a sufficiently fast Many studies have been performed on image composition,
(i.e. light-weight) algorithm. For example, a naive approach such as high dynamic range (HDR) reconstruction [10] and
to addressing the former problem involves randomly sampling deblurring [11]. Although these methods maintain the image
one image from the image sequence obtained by the high- quality while controlling the number of images to be recog-
speed camera and passing it to the recognizer (Fig. 1a). nized in accordance with the capability of the recognizer, they
Although this approach is fast and simple, the recognition require great computational resources because of the complex
accuracy becomes dependent on the image selection because calculations. Furthermore, they cannot handle scenes where
the randomly sampled image is not guaranteed to be suitable the appearance changes dynamically.
for recognition. Image selection, one of the simplest ways for low com-
To tackle the aforementioned problems, we propose a novel plexity, is basically performed randomly or at regular interval;
object recognition framework specialized for real-time applica- however, random image selection cannot guarantee high recog-
tions with high-speed camera imaging. The proposed method nition accuracy when the image quality in the sequence varies
consists of two key parts: population data cleansing based widely.
on the recognizability score and data ensemble with a single Our framework overcomes these difficulties and achieves
light-weight CNN model (Fig. 1b). Population data cleansing high robustness and high accuracy by calculating recognizabil-
improves the recognition accuracy by quantifying the rec- ity scores frame-by-frame and removing low recognizability
ognizability and by removing data with low recognizability images prior to the image selection.
from the original population, while data ensemble improves
the robustness of the object recognition by merging the class B. Downsizing the Recognition Network Model
probability outputs with multiple images. Experimental results
Real-time recognition with a high-speed camera requires
with a real world dataset show that our framework is more
building a recognition model that is sufficiently small and
effective than existing methods. We also prepared a new video
accurate by considering conditions such as the learning cost,
dataset for classification tasks of fast moving objects, which
specifications of the computing device (e.g., GPU and FPGA),
were captured using a 1,000 fps high-speed camera, to evaluate
and the input data characteristics. Although the latest CNN
our method.
models generally have high spacial and temporal complexity,
The contributions of this paper are 1) our novel object
many researchers have proposed methods to obtain smaller
recognition framework for high-speed camera imaging based
architectures with high accuracy. These methods can be cate-
on population data cleansing and data ensemble and 2) a new
gorized into the following six types: factorization [12], pruning
dataset we constructed for object recognition using high-speed
[13], neural architecture search (NAS) [14], early termination
camera images.
and dynamic computation graphs [15], distillation [16], and
This paper is organized as follows: Section 2 describes
quantization [17].
related work for achieving recognition in real time and ex-
isting high-frame-rate video datasets. Section 3 introduces our However, most of these existing approaches aim to recog-
framework, which combines CNN-based recognition models nize one given image as accurately as possible, while our target
with real-time systems using a high-speed camera. Section 4 is recognizing a vast number of sequential images obtained by
presents an evaluation of the method in several experiments, a high-speed camera.
and section 5 concludes the paper.
C. Ensemble Methods
II. R ELATED W ORK
The ensemble method usually trains multiple base recog-
A. Reducing the Workload of Recognition nizers and aggregates them to get final recognition results
Because a high-speed camera enables tracking objects with- to reduce the number of recognition errors. In most cases,
out frame-by-frame recognition, what we need is to recognize it is known to be better than the single best base recognizer
the objects while they are in the camera view. Roughly three [18]. The paradigm to train the base recognizer is generally
ways can be used to reduce the volume of image data input separated into two types: sequential methods and parallel
from the camera: 1) downscaling each image, 2) combining methods.
sequential images into one image, and 3) selecting one image The sequential methods create a strong recognizer from a
from sequential images. number of weak ones. For example, boosting [19] recreates
Examples of downscaling filters, starting with the classi- a model that attempts to correct the errors from the previous
cal techniques originated from Shannon’s sampling theorem model and combines them following a deterministic strategy.
2026
Authorized licensed use limited to: University of Exeter. Downloaded on May 27,2021 at 21:09:20 UTC from IEEE Xplore. Restrictions apply.
(a) Online recognition process (b) Training process of the scoring function F (u)
Fig. 2. Pipeline of our framework.
Models are added until the training dataset is predicted per- 230 fps videos. N-CARS is a large real-world dataset for
fectly or a maximum number of models are added. car classification, composed of over 24,000 samples of cars
The parallel methods train multiple (often homogeneous) and non-cars (i.e., background). The images are derived from
weak recognizers independently from each other and aggregate an event-based camera, which outputs pixel-level brightness
their outputs. For example, bagging [20] creates multiple boot- changes instead of the light intensity itself. The camera is
strap samples from the original distribution so that each sample mounted behind the windshield of a car, and the view is similar
acts as an almost independent dataset, fits weak recognizers to what the driver would see.
for each of these samples, and aggregates them to obtain a However, none of them have been considered for the clas-
less variant answer. Another well-known example is a random sification tasks of the object categories, especially for objects
forest [21], one of the decision tree methods, which randomly moving at high speed. Therefore, we introduce a new dataset
samples over features and uses only a random subset of to handle this issue in Section 4.
features to build a tree. For aggregation, averaging methods
that output continuous numeric values are mainly used in
regression tasks, while voting methods are mainly used in III. O UR F RAMEWORK
classification problems. There is also a method called stacking
[22], which trains a meta-model to output a prediction based A. Pipeline of the Framework
on multiple predictions from weak recognizers. Fig. 2a illustrates the pipeline of our framework for online
These ensemble methods train multiple weak recognizers recognition using a high-speed camera. The online recognition
and somehow combine them. However, they consume a large process is composed of three steps: object tracking, population
volume of memory in prediction and thus cause a large data cleansing, and data ensemble recognition. The object
overhead in loading when multiple individual recognizers are tracker extracts frame-by-frame region of interest (ROI) im-
stored on devices such as a GPU. ages of each object using a self-window method [4] without
Semi-supervised methods are also available. They increase visual feature recognition. Next, image quality scores in the
the input data using various data augmentation techniques, population data cleansing step are calculated using a scoring
utilize them to train a student recognizer, and update a teacher function F (u) that is described in the next subsection, and
recognizer as an exponential moving average of student a certain rate of ROI images with lower scores are removed
weights [23]. from the recognition target population. The scoring function
From a real-time point of view, we assume that training and F (u) is trained using the results of whether the predictions of
using a single weak recognizer for prediction is better than the pre-trained recognition model are true or false (Fig. 2b).
preparing multiple weak recognizers. Therefore, in the next Because images with higher F (u) scores are considered to
section, we present our method to aggregate the predictions have higher recognizability than those with lower scores, this
of a single recognizer for multiple input images. F (u) can be used to cleanse, i.e., to select a good part of
the original population of ROI images. Then, a single light-
D. High-Frame-Rate Video Dataset weight CNN model receives multiple different ROI images
Some existing datasets such as SHIP-v [24], SlowFlow [25], of the same object simultaneously, they are sampled from
and N-CARS [26] have high-frame-rate video images. SHIP-v the cleansed population, and the class probabilities with these
mainly deals with relatively high-speed human motions such images are aggregated into a single classification result. Note
as arm swinging captured at 1,000 fps for gesture recognition that the network used in the data ensemble is required to be
tasks. SHIP-v also contains some high-speed objects like a fast enough to output the class probability before the target
ping pong ball and a yo-yo, as well as pan-tilt motions of object exits the camera view.
a camera itself. SlowFlow is a collection of optical flow In the following subsections, we describe the details of the
reference data in a variety of real-world scenes in 30 – population data cleansing and the data ensemble.
2027
Authorized licensed use limited to: University of Exeter. Downloaded on May 27,2021 at 21:09:20 UTC from IEEE Xplore. Restrictions apply.
B. Population Data Cleansing Using Tracking Features prediction outputs. Let N be the number of the input images;
The purpose of the population data cleansing step is to then, N images are sampled from the pool of ROI images
remove false ROI images, which the recognition model yields of the target object and input into the light-weight model.
as wrong labels, as much as possible prior to the object The predicted classification probabilities are aggregated into
recognition. Because this step is to be executed against one a single class output as follows:
or more ROI images per frame (depending on the number of i+N
1 X j
objects in the same frame) with no latency, the computational H j (xi , . . . , xi+N ) = h (xk ),
complexity should be as low as possible. Therefore, we do not N
k=i
expect this process to separate false and true data completely Ci = arg max H j (xi , . . . , xi+N ),
but to generate a new population of the recognition target j
images with higher recognizability than the original one. Next where xi is the i-th ROI image of the sampled images,
are the details of the data cleansing step. hj (xi ) ∈ [0, 1] is the j-th class probability of xi output
First, we prepare a set of ROI images annotated with two from the light-weight model, H j is the averaged j-th class
classes, true (right label assignment) and false (wrong label probability, and Ci is the final output of the predicted class.
assignment), in accordance with the output of the pre-trained Note that the processing time varies in accordance with the
recognition network that is used in the subsequent recognition number of images used in data ensemble.
step. Second, using a fast classification method such as a linear
support vector machine (SVM) [27] and linear discriminant D. Time Constraint in Real-time Applications
analysis (LDA) [28], we obtain a hyperplane As we mentioned in Section 1, our goal of “real-time
T
F (u) := w · u + w0 = 0, recognition” means “to attach a recognition result before the
object goes out of tracking range.” Let Trcg be the required
where u, w and w0 represent an M -dimensional feature time for the object recognition; then the time constraint in
vector of the ROI image, a weight vector, and an offset value, real-time applications is expressed as follows:
respectively. w and w0 are calculated so that F (u) separates
true and false data in the feature space to a certain extent. To Trcg := tc (N ) + te (M ) < Ttr ,
achieve low computational complexity, the elements of u need
where tc (N ) is the required time to obtain N sequential
to be selected from the features that can be easily derived from
images from a camera, te (M ) is the processing time of data
the calculation in the prior object tracking step. The examples
ensemble with M (≤ N ) images, and Ttr is the duration of
of such features can be categorized into the following types:
tracking. The processing time of frame-by-frame tracking and
• Statistics of the pixel values
data cleansing does not appear on the constraint because both
(e.g., area, average, variance, skewness, and kurtosis)
are executed faster than the frame rate. When objects pass in
• Spatial derivatives of the pixel values
front of the camera one by one, Ttr equals to the reciprocal
(e.g., gradient and Laplacian)
of the throughput.
• Temporal derivatives of the object position
Considering a visual inspection machine that uses a normal
(e.g., center of mass, velocity, and acceleration)
video-rate (i.e., 30 fps) camera, the reasonable maximum value
Then this F (u) is used as a scoring function in the online of Ttr is around 30 milliseconds; that is to say, N and M can
recognition process. Some of the ROI images that have lower take values from 1 to 30 when a 1,000 fps camera is used.
scores than others are removed before the recognition process
under the assumption that the remaining images with higher IV. E XPERIMENTS
scores lead to higher recognition accuracy. Note that too much A. Data Preparation
of data cleansing may lose the data diversity because of its
simple scoring formula and weaken the generalization ability Due to the lack of publicly available dataset for the classi-
of the subsequent data ensemble. fication task of moving objects in high-frame-rate videos, we
prepared a new dataset inspired by the visual inspection pro-
C. Data Ensemble with a Light-weight CNN Model cess in the mass production. These were the requirements of
In the data ensemble step, we aim to improve and stabilize the dataset: 1) every video had to be captured at a high-frame-
the recognition accuracy by inputting multiple ROI images rate (e.g., 1,000 fps), 2) objects had to move at high speed, and
into a single light-weight CNN model simultaneously and by 3) all data had to be annotated with object categories. We used
aggregating the outputs, instead of constructing a great model IDP Express [30], which enabled us to conduct high-frame-
with a high accuracy against any single image. rate recording (up to 2,000 fps, 512*512 pixels) and real-time
First, the light-weight model is constructed by decreasing image processing simultaneously, and collected the sequential
the convolution layers (or the channels of convolution layers) ROI images of each object. The objects were 1-cm-diameter
of an existing CNN model such as MobileNetV2 [12] and white blocks with two attributes: shape (two classes) and
ResNetV2 [29]. Although this may weaken the generalization character (ten classes) carved on the surface. See Fig. 3 for the
performance of the model, we compensate for this disad- shooting background and an example of the object trajectory;
vantage by inputting multiple images and aggregating the Fig. 4 for an example of sequential ROI images in the dataset
2028
Authorized licensed use limited to: University of Exeter. Downloaded on May 27,2021 at 21:09:20 UTC from IEEE Xplore. Restrictions apply.
TABLE I
S PECIFICATIONS OF THE HIGH - SPEED VIDEO DATASETS
62 × 62 (Extracted from
Resolution 512 × 512 images) 1, 024 × 1, 024 2, 560 × 1, 440 N/A
dataset as the training data for the scoring function and the
light-weight recognition model, 10% as the validation data,
and the remaining 20% as the test data.
2029
Authorized licensed use limited to: University of Exeter. Downloaded on May 27,2021 at 21:09:20 UTC from IEEE Xplore. Restrictions apply.
0.08 1.0
PD (true)
Probability Density (PD)
PD (false)
0.02
0.2
0.00 0.0
-1.0 0.0 1.0 2.0 3.0 4.0 5.0
Recognizability score
(a) Training data (a) Mean accuracy vs. processing time
0.08 1.0
PD (true)
Probability Density (PD)
PD (false)
Cumulative Ratio (CR)
0.8
0.06 CR (true)
CR (false)
0.6
0.04
0.4
0.02
0.2
0.00 0.0
-1.0 0.0 1.0 2.0 3.0 4.0 5.0
Recognizability score
(b) Minimum accuracy vs. processing time
(b) Test data
Fig. 6. Comparison of the recognition accuracy and processing time between
Fig. 5. Probability density distribution of the recognizability score F (u)
our ensemble recognition models and conventional deeper recognition models.
against the (a) training and (b) test dataset. The false data biases towards the
(a) and (b) show the mean and minimum accuracy in 100 trials, respectively.
left side are comparable with the true data, so the cumulative ratio of the false
data always surpasses that of the true data.
2030
Authorized licensed use limited to: University of Exeter. Downloaded on May 27,2021 at 21:09:20 UTC from IEEE Xplore. Restrictions apply.
The key approaches of the framework are population data
0.95
cleansing based on recognizability scores and data ensem-
ble with multiple input images into a single light-weight
Recognition accuracy
mirror with background subtraction,” Advanced Robotics, vol. 29, no. 12,
0.60 pp. 457–468, Apr. 2015.
[2] A. Namiki, Y. Imai, M. Ishikawa, and M. Kaneko, “Development of a
high-speed multifingered hand system and its application to catching,”
0.40 in Proc. IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS),
Data cleansing rate Las Vegas, NV, Oct. 2003, pp. 2666–2671.
60% [3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for
40% Image Recognition,” arXiv e-prints, p. arXiv:1512.03385, Dec. 2015.
0.20
[4] I. Ishii, Y. Nakabo, and M. Ishikawa, “Target tracking algorithm for 1 ms
20%
visual feedback system using massively parallel processing vision,” in
0% Proc. IEEE Int. Conf. on Robotics and Automation (ICRA), Minneapolis,
0.00 MN, Apr. 1996, pp. 2309–2314.
0 5 10 15 20 25 30 [5] C. E. Shannon, “Communication in the presence of noise,” in Proc.
Number of images used in data ensemble Institute of Radio Engineers (IRE), vol. 33, Jan. 1949, pp. 10–21.
[6] G. Wolberg, Digital Image Warping. Los Alamitos, NM: IEEE
(b) Minimum recognition accuracy Computer Society Press, 1990.
Fig. 7. Comparison of the recognition accuracy among trials with different [7] K. Turkowski and S. Gabriel, “Filters for common resampling tasks,”
data cleansing rates (DCRs). (a) Higher DCR results in higher mean accuracy in Graphics Gems I, A. Glassner, Ed. Academic Press, Jul. 1993, pp.
when the data ensemble number is small enough. Dashed lines indicate cross 147–165.
points between the results with each DCR (blue, green, and orange) and the [8] S. Avidan and A. Shamir, “Seam carving for content-aware image
result with no data cleansing (red). (b) Higher DCR constantly results in resizing,” in Proc. ACM SIGGRAPH 2007, San Diego, CA, Aug. 2007,
higher minimum accuracy, which means few repeatability errors. p. 10–es.
[9] L. Wolf, M. Guttmann, and D. Cohen-Or, “Non-homogeneous content-
driven video-retargeting,” in Proc. IEEE Int. Conf. on Computer Vision
(ICCV), Rio de Janeiro, Oct. 2007, pp. 1–6.
average recognition accuracy when the data ensemble number [10] P. Sen, N. K. Kalantari, M. Yaesoubi, S. Darabi, D. B. Goldman,
was small enough, while stably suppressing the repeatability and E. Shechtman, “Robust patch-based hdr reconstruction of dynamic
errors. Note that the calculation of F (u) was executed in the scenes,” ACM Trans. on Graphics (TOG), vol. 31, no. 6, pp. 203:1–11,
Nov. 2012.
object tracking process within 1 millisecond even when five [11] T. H. Kim, K. M. Lee, B. Scholkopf, and M. Hirsch, “Online video
objects existed in the camera view at the same time; therefore, deblurring via dynamic temporal blending network,” in Proc. IEEE Int.
the data cleansing process did not affect the latency of the Conf. on Computer Vision (ICCV), Venice, Oct. 2017, pp. 4038–4047.
[12] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. C. Chen, “Mo-
overall process. bilenetv2: Inverted residuals and linear bottlenecks,” in Proc. IEEE/CVF
The aforementioned experiments demonstrated that our Conf. on Computer Vision and Pattern Recognition (CVPR), Salt Lake
framework is effective for online object recognition tasks using City, UT, Jun. 2018, pp. 4510–4520.
a high-speed camera. [13] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning
Filters for Efficient ConvNets,” arXiv e-prints, p. arXiv:1608.08710,
In addition, as shown in the supplemental video 1 , we Aug. 2016.
successfully developed a real-time sorting system that recog- [14] I. Bello, B. Zoph, V. Vasudevan, and Q. V. Le, “Neural optimizer search
nizes 1-cm-diameter free-falling blocks with three attributes with reinforcement learning,” in Proc. Int. Conf. on Machine Learning
(ICML), vol. 70, Sydney, Aug. 2017, pp. 459–468.
(2 shapes, 6 colors, and 36 characters) at a rate of 100 objects [15] Z. Wu, T. Nagarajan, A. Kumar, and S. Rennie, “Blockdrop: Dynamic
per second and flicks out any designated objects immediately. inference paths in residual networks,” in Proc. IEEE/CVF Conf. on
Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT,
V. C ONCLUSION Jun. 2018, pp. 8817–8826.
[16] G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a
This paper proposed a novel object recognition framework Neural Network,” arXiv e-prints, p. arXiv:1503.02531, Mar. 2015.
for real-time applications with high-speed camera imaging. [17] S. Gupta, A. Agrawal, and K. G. nad P. Narayanan, “Deep learning with
limited numerial precision,” in Proc. Int. Conf. on Machine Learning
1 We plan to release this video on the internet after acceptance (ICML), vol. 37, Lille, Jul. 2015, pp. 1737–1746.
2031
Authorized licensed use limited to: University of Exeter. Downloaded on May 27,2021 at 21:09:20 UTC from IEEE Xplore. Restrictions apply.
[18] L. K. Hansen and P. Salamon, “Neural network ensembles,” IEEE Trans.
Pattern Analysis and Machine Intelligence, vol. 12, no. 10, pp. 993–
1001, Oct. 1990.
[19] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of
on-line learning and an application to boosting,” Journal of Computer
and System Sciences, vol. 55, no. 1, pp. 119–139, Aug. 1997.
[20] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, pp. 123–
140, Aug. 1996.
[21] ——, “Random forests,” Machine Learning, vol. 45, pp. 5–32, Jan. 2001.
[22] D. H. Wolpert, “Stacked generalization,” Neural Networks, vol. 5, no. 2,
pp. 241–260, 1992.
[23] A. Tarvainen and H. Valpola, “Mean teachers are better role mod-
els: Weight-averaged consistency targets improve semi-supervised deep
learning results,” in Proc. Int. Conf. on Neural Information Processing
Systems (NIPS), Long Beach, CA, Dec. 2017, pp. 1195–1204.
[24] Ship-v: Services for high-speed image processing. Ishikawa Senoo
Lab., the Univ. of Tokyo. [Online]. Available: http://ishikawa-vision.
org/ship-v/index-e.html
[25] J. Janai, F. Guney, J. Wulff, M. Black, and A. Geiger, “Slow flow:
Exploiting high-speed cameras for accurate and diverse optical flow
reference data,” in Proc. IEEE/CVF Conf. on Computer Vision and
Pattern Recognition (CVPR), Honolulu, HI, Jul. 2017, pp. 1406–1416.
[26] A. Sinori, M. Brambilla, and X. N. Bourdis, “Blockdrop: Dynamic
inference paths in residual networks,” in Proc. IEEE/CVF Conf. on
Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT,
Jun. 2018, pp. 8817–8826.
[27] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training algorithm for
optimal margin classifiers,” in Proc. Annual Workshop on Computational
Learning Theory (COLT), Pittsburgh, PA, Jul. 1992, pp. 144–152.
[28] N. Mohanty, A. John, R. Manmatha, and T. M. Rath, Shape-Based Image
Classification and Retrieval, 12 2013, vol. 31, pp. 249–267.
[29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proc. IEEE/CVF Conf. on Computer Vision and Pattern
Recognition (CVPR), Las Vegas, NV, Jun. 2016, pp. 770–778.
[30] I. Ishii, T. Tatebe, Q. Gu, Y. Moriue, T. Takaki, and K. Tajima, “2000
fps real-time vision system with high-frame-rate video recording,” in
Proc. IEEE Int. Conf. on Robotics and Automation (ICRA), Anchorage,
AK, Jun. 2010, pp. 1536 – 1541.
2032
Authorized licensed use limited to: University of Exeter. Downloaded on May 27,2021 at 21:09:20 UTC from IEEE Xplore. Restrictions apply.