Namiki 2021

2020 25th International Conference on Pattern Recognition (ICPR)
Milan, Italy, Jan 10-15, 2021
Online Object Recognition Using CNN-based

Algorithm on High-speed Camera Imaging
Framework for fast and robust high-speed camera object recognition
based on population data cleansing and data ensemble
Shigeaki Namiki∗ , Keiko Yokoyama∗ , Masatoshi Ishikawa

Shoji Yachida, Takashi Shibata† and Hiroyoshi Miyano
Biometrics Research Laboratories Information Technology Center
NEC The University of Tokyo
Kanagawa, Japan Tokyo, Japan
Email: s-namiki@nec.com; k.yokoyama@nec.com Email: ishikawa@ishikawa-vision.org
Abstract—High-speed camera imaging (e.g., 1,000 fps) is effec-

2020 25th International Conference on Pattern Recognition (ICPR) | 978-1-7281-8808-9/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICPR48806.2021.9413042
tive to detect and recognize objects moving at high speeds because

temporally dense images obtained by a high-speed camera
can usually capture the best moment for object detection and
recognition. However, the latest recognition algorithms, with their
high complexity, are difficult to utilize in real-time applications
involving high-speed cameras because a vast number of images
need to be processed with no latency. To tackle this problem, we
propose a novel framework for real-time object recognition with (a) Naive approach
high-speed camera imaging. The proposed framework has the
key processes of population data cleansing and data ensemble.
Population data cleansing improves the recognition accuracy by
quantifying the recognizability and by excluding part of the
images prior to the recognition process, while data ensemble
improves the robustness of object recognition by merging the
class probabilities with multiple images from the object tracking
sequence. Experimental results with a real dataset show that our
framework is more effective than existing methods.
Index Terms—Applications of computer vision, Object recog-
nition, Image classification. (b) Our approach
Fig. 1. Overview of the joint use of a high-speed camera and convolutional
neural network (CNN) based recognition.
I. I NTRODUCTION
High-speed (i.e., high-frame-rate) camera imaging is widely
used in many real-time applications such as object tracking cation situations such as those involving mass production lines
[1] and robot hand manipulation [2]. The key advantages of and autonomous vehicles. In this sense, the joint use of high-
a high-speed camera are high temporal information density speed cameras and the latest object recognition techniques,
and low latency, which make object tracking and online especially deep convolutional neural network (CNN) based
feedback control dramatically easier. Furthermore, the high- ones as typified by [3] are highly important in the pattern
speed camera is also effective for moving object recognition recognition field. Although many real-time applications using
tasks because a set of images captured at a high frame rate a high-speed camera have high potential, a critical limitation
are not likely to miss the desired moment for recognition, so must be overcome: the processing algorithm needs to be
they certainly contain one or more suitable (i.e., high quality) simple enough to keep up with the speed of the camera frame
images without strict control between object movement and rate. Therefore, deep CNN recognition, which tends to have
camera shutter timing. high computational complexity, is difficult to utilize.
Recognizing and controlling fast moving objects with low However, because of the object tracking algorithm special-
latency and high accuracy are in high demand for many appli- izing in high-frame-rate images (e.g., self-window method
∗ Equally contributed [4]), the recognition process needs to be executed while the
† This author now belongs to NTT Communication Science Laboratories, object is tracked in the camera view instead of being executed
NTT Corporation frame by frame. Therefore, we can interpret the term “real-
978-1-7281-8808-9/20/$31.00 ©2020 IEEE 2025
Authorized licensed use limited to: University of Exeter. Downloaded on May 27,2021 at 21:09:20 UTC from IEEE Xplore. Restrictions apply.
time recognition” as meaning “to attach a recognition result [5], include bilinear, bicubic and Lanczos [6] [7]. Another
to the corresponding tracking ID before the object goes out approach focuses on changing the aspect ratio of an image
of tracking range.” Thus, the main problems in combining while preserving the salient features of the image as much
the CNN-based recognition with high-frame-rate imaging in as possible (e.g., [8] and [9]). However, none of them is not
the real-time applications can be summarized as follows: suitable for object recognition tasks because the downscaled
1) the workload of recognition needs to be reduced while (i.e., low resolution) images usually lose or alter necessary
maintaining high recognition accuracy and 2) the robustness of feature information for detailed recognition.
the recognition needs to be improved using a sufficiently fast Many studies have been performed on image composition,
(i.e. light-weight) algorithm. For example, a naive approach such as high dynamic range (HDR) reconstruction [10] and
to addressing the former problem involves randomly sampling deblurring [11]. Although these methods maintain the image
one image from the image sequence obtained by the high- quality while controlling the number of images to be recog-
speed camera and passing it to the recognizer (Fig. 1a). nized in accordance with the capability of the recognizer, they
Although this approach is fast and simple, the recognition require great computational resources because of the complex
accuracy becomes dependent on the image selection because calculations. Furthermore, they cannot handle scenes where
the randomly sampled image is not guaranteed to be suitable the appearance changes dynamically.
for recognition. Image selection, one of the simplest ways for low com-
To tackle the aforementioned problems, we propose a novel plexity, is basically performed randomly or at regular interval;
object recognition framework specialized for real-time applica- however, random image selection cannot guarantee high recog-
tions with high-speed camera imaging. The proposed method nition accuracy when the image quality in the sequence varies
consists of two key parts: population data cleansing based widely.
on the recognizability score and data ensemble with a single Our framework overcomes these difficulties and achieves
light-weight CNN model (Fig. 1b). Population data cleansing high robustness and high accuracy by calculating recognizabil-
improves the recognition accuracy by quantifying the rec- ity scores frame-by-frame and removing low recognizability
ognizability and by removing data with low recognizability images prior to the image selection.
from the original population, while data ensemble improves
the robustness of the object recognition by merging the class B. Downsizing the Recognition Network Model
probability outputs with multiple images. Experimental results
Real-time recognition with a high-speed camera requires
with a real world dataset show that our framework is more
building a recognition model that is sufficiently small and
effective than existing methods. We also prepared a new video
accurate by considering conditions such as the learning cost,
dataset for classification tasks of fast moving objects, which
specifications of the computing device (e.g., GPU and FPGA),
were captured using a 1,000 fps high-speed camera, to evaluate
and the input data characteristics. Although the latest CNN
our method.
models generally have high spacial and temporal complexity,
The contributions of this paper are 1) our novel object
many researchers have proposed methods to obtain smaller
recognition framework for high-speed camera imaging based
architectures with high accuracy. These methods can be cate-
on population data cleansing and data ensemble and 2) a new
gorized into the following six types: factorization [12], pruning
dataset we constructed for object recognition using high-speed
[13], neural architecture search (NAS) [14], early termination
camera images.
and dynamic computation graphs [15], distillation [16], and
This paper is organized as follows: Section 2 describes
quantization [17].
related work for achieving recognition in real time and ex-
isting high-frame-rate video datasets. Section 3 introduces our However, most of these existing approaches aim to recog-
framework, which combines CNN-based recognition models nize one given image as accurately as possible, while our target
with real-time systems using a high-speed camera. Section 4 is recognizing a vast number of sequential images obtained by
presents an evaluation of the method in several experiments, a high-speed camera.
and section 5 concludes the paper.
C. Ensemble Methods
II. R ELATED W ORK
The ensemble method usually trains multiple base recog-
A. Reducing the Workload of Recognition nizers and aggregates them to get final recognition results
Because a high-speed camera enables tracking objects with- to reduce the number of recognition errors. In most cases,
out frame-by-frame recognition, what we need is to recognize it is known to be better than the single best base recognizer
the objects while they are in the camera view. Roughly three [18]. The paradigm to train the base recognizer is generally
ways can be used to reduce the volume of image data input separated into two types: sequential methods and parallel
from the camera: 1) downscaling each image, 2) combining methods.
sequential images into one image, and 3) selecting one image The sequential methods create a strong recognizer from a
from sequential images. number of weak ones. For example, boosting [19] recreates
Examples of downscaling filters, starting with the classi- a model that attempts to correct the errors from the previous
cal techniques originated from Shannon’s sampling theorem model and combines them following a deterministic strategy.
2026
(a) Online recognition process (b) Training process of the scoring function F (u)
Fig. 2. Pipeline of our framework.
Models are added until the training dataset is predicted per- 230 fps videos. N-CARS is a large real-world dataset for
fectly or a maximum number of models are added. car classification, composed of over 24,000 samples of cars
The parallel methods train multiple (often homogeneous) and non-cars (i.e., background). The images are derived from
weak recognizers independently from each other and aggregate an event-based camera, which outputs pixel-level brightness
their outputs. For example, bagging [20] creates multiple boot- changes instead of the light intensity itself. The camera is
strap samples from the original distribution so that each sample mounted behind the windshield of a car, and the view is similar
acts as an almost independent dataset, fits weak recognizers to what the driver would see.
for each of these samples, and aggregates them to obtain a However, none of them have been considered for the clas-
less variant answer. Another well-known example is a random sification tasks of the object categories, especially for objects
forest [21], one of the decision tree methods, which randomly moving at high speed. Therefore, we introduce a new dataset
samples over features and uses only a random subset of to handle this issue in Section 4.
features to build a tree. For aggregation, averaging methods
that output continuous numeric values are mainly used in
regression tasks, while voting methods are mainly used in III. O UR F RAMEWORK
classification problems. There is also a method called stacking
[22], which trains a meta-model to output a prediction based A. Pipeline of the Framework
on multiple predictions from weak recognizers. Fig. 2a illustrates the pipeline of our framework for online
These ensemble methods train multiple weak recognizers recognition using a high-speed camera. The online recognition
and somehow combine them. However, they consume a large process is composed of three steps: object tracking, population
volume of memory in prediction and thus cause a large data cleansing, and data ensemble recognition. The object
overhead in loading when multiple individual recognizers are tracker extracts frame-by-frame region of interest (ROI) im-
stored on devices such as a GPU. ages of each object using a self-window method [4] without
Semi-supervised methods are also available. They increase visual feature recognition. Next, image quality scores in the
the input data using various data augmentation techniques, population data cleansing step are calculated using a scoring
utilize them to train a student recognizer, and update a teacher function F (u) that is described in the next subsection, and
recognizer as an exponential moving average of student a certain rate of ROI images with lower scores are removed
weights [23]. from the recognition target population. The scoring function
From a real-time point of view, we assume that training and F (u) is trained using the results of whether the predictions of
using a single weak recognizer for prediction is better than the pre-trained recognition model are true or false (Fig. 2b).
preparing multiple weak recognizers. Therefore, in the next Because images with higher F (u) scores are considered to
section, we present our method to aggregate the predictions have higher recognizability than those with lower scores, this
of a single recognizer for multiple input images. F (u) can be used to cleanse, i.e., to select a good part of
the original population of ROI images. Then, a single light-
D. High-Frame-Rate Video Dataset weight CNN model receives multiple different ROI images
Some existing datasets such as SHIP-v [24], SlowFlow [25], of the same object simultaneously, they are sampled from
and N-CARS [26] have high-frame-rate video images. SHIP-v the cleansed population, and the class probabilities with these
mainly deals with relatively high-speed human motions such images are aggregated into a single classification result. Note
as arm swinging captured at 1,000 fps for gesture recognition that the network used in the data ensemble is required to be
tasks. SHIP-v also contains some high-speed objects like a fast enough to output the class probability before the target
ping pong ball and a yo-yo, as well as pan-tilt motions of object exits the camera view.
a camera itself. SlowFlow is a collection of optical flow In the following subsections, we describe the details of the
reference data in a variety of real-world scenes in 30 – population data cleansing and the data ensemble.
2027
B. Population Data Cleansing Using Tracking Features prediction outputs. Let N be the number of the input images;
The purpose of the population data cleansing step is to then, N images are sampled from the pool of ROI images
remove false ROI images, which the recognition model yields of the target object and input into the light-weight model.
as wrong labels, as much as possible prior to the object The predicted classification probabilities are aggregated into
recognition. Because this step is to be executed against one a single class output as follows:
or more ROI images per frame (depending on the number of i+N
1 X j
objects in the same frame) with no latency, the computational H j (xi , . . . , xi+N ) = h (xk ),
complexity should be as low as possible. Therefore, we do not N
k=i
expect this process to separate false and true data completely Ci = arg max H j (xi , . . . , xi+N ),
but to generate a new population of the recognition target j
images with higher recognizability than the original one. Next where xi is the i-th ROI image of the sampled images,
are the details of the data cleansing step. hj (xi ) ∈ [0, 1] is the j-th class probability of xi output
First, we prepare a set of ROI images annotated with two from the light-weight model, H j is the averaged j-th class
classes, true (right label assignment) and false (wrong label probability, and Ci is the final output of the predicted class.
assignment), in accordance with the output of the pre-trained Note that the processing time varies in accordance with the
recognition network that is used in the subsequent recognition number of images used in data ensemble.
step. Second, using a fast classification method such as a linear
support vector machine (SVM) [27] and linear discriminant D. Time Constraint in Real-time Applications
analysis (LDA) [28], we obtain a hyperplane As we mentioned in Section 1, our goal of “real-time
T
F (u) := w · u + w0 = 0, recognition” means “to attach a recognition result before the
object goes out of tracking range.” Let Trcg be the required
where u, w and w0 represent an M -dimensional feature time for the object recognition; then the time constraint in
vector of the ROI image, a weight vector, and an offset value, real-time applications is expressed as follows:
respectively. w and w0 are calculated so that F (u) separates
true and false data in the feature space to a certain extent. To Trcg := tc (N ) + te (M ) < Ttr ,
achieve low computational complexity, the elements of u need
where tc (N ) is the required time to obtain N sequential
to be selected from the features that can be easily derived from
images from a camera, te (M ) is the processing time of data
the calculation in the prior object tracking step. The examples
ensemble with M (≤ N ) images, and Ttr is the duration of
of such features can be categorized into the following types:
tracking. The processing time of frame-by-frame tracking and
• Statistics of the pixel values
data cleansing does not appear on the constraint because both
(e.g., area, average, variance, skewness, and kurtosis)
are executed faster than the frame rate. When objects pass in
• Spatial derivatives of the pixel values
front of the camera one by one, Ttr equals to the reciprocal
(e.g., gradient and Laplacian)
of the throughput.
• Temporal derivatives of the object position
Considering a visual inspection machine that uses a normal
(e.g., center of mass, velocity, and acceleration)
video-rate (i.e., 30 fps) camera, the reasonable maximum value
Then this F (u) is used as a scoring function in the online of Ttr is around 30 milliseconds; that is to say, N and M can
recognition process. Some of the ROI images that have lower take values from 1 to 30 when a 1,000 fps camera is used.
scores than others are removed before the recognition process
under the assumption that the remaining images with higher IV. E XPERIMENTS
scores lead to higher recognition accuracy. Note that too much A. Data Preparation
of data cleansing may lose the data diversity because of its
simple scoring formula and weaken the generalization ability Due to the lack of publicly available dataset for the classi-
of the subsequent data ensemble. fication task of moving objects in high-frame-rate videos, we
prepared a new dataset inspired by the visual inspection pro-
C. Data Ensemble with a Light-weight CNN Model cess in the mass production. These were the requirements of
In the data ensemble step, we aim to improve and stabilize the dataset: 1) every video had to be captured at a high-frame-
the recognition accuracy by inputting multiple ROI images rate (e.g., 1,000 fps), 2) objects had to move at high speed, and
into a single light-weight CNN model simultaneously and by 3) all data had to be annotated with object categories. We used
aggregating the outputs, instead of constructing a great model IDP Express [30], which enabled us to conduct high-frame-
with a high accuracy against any single image. rate recording (up to 2,000 fps, 512*512 pixels) and real-time
First, the light-weight model is constructed by decreasing image processing simultaneously, and collected the sequential
the convolution layers (or the channels of convolution layers) ROI images of each object. The objects were 1-cm-diameter
of an existing CNN model such as MobileNetV2 [12] and white blocks with two attributes: shape (two classes) and
ResNetV2 [29]. Although this may weaken the generalization character (ten classes) carved on the surface. See Fig. 3 for the
performance of the model, we compensate for this disad- shooting background and an example of the object trajectory;
vantage by inputting multiple images and aggregating the Fig. 4 for an example of sequential ROI images in the dataset
2028
TABLE I
S PECIFICATIONS OF THE HIGH - SPEED VIDEO DATASETS
Ours Ship-v [24] SlowFlow [25] N-CARS [26]

Gesture recognition, Optical flow estimation
Purpose Object recognition optical flow analysys, etc. and 3D reconstruction Car detection
N/A (Some gesture

Label type Object category category only) Optical flow Car / Non-car
Target 1-cm-diameter blocks Humans, balls, etc. N/A Cars
62 × 62 (Extracted from
Resolution 512 × 512 images) 1, 024 × 1, 024 2, 560 × 1, 440 N/A
Frames per second up to 10, 000

(FPS) 1, 000 1, 000 30 - 240 (event based)
20
Num. of classes (2 shapes, 10 characters) N/A N/A 2
N/A 12, 336 (cars),

Num. of sequences 200 124 (Not released yet) 11, 693 (non-cars)
dataset as the training data for the scoring function and the
light-weight recognition model, 10% as the validation data,
and the remaining 20% as the test data.
B. Improving Recognizability by Data Cleansing

We obtained the scoring function F (u) for the data cleans-
ing by training a linear discriminant analysis (LDA) classifier
to separate true and false outputs of the pre-trained baseline
model. The features used in F (u) in this experiment were
as follows, all of which could be calculated during or right
after the pixel scanning loop in the preceding 1,000 fps object
tracking step with no latency.
• Average / variance of the foreground pixel brightness
• Contrast ratio in the foreground pixels
Fig. 3. A background image of the shooting environment. The yellow line • Average / variance of the foreground pixel gradient
describes a sample trajectory of the object; the object bounces and slides on (x-axis and y-axis, respectively)
a slope.
• Spatial distribution skewness of the gradient
(x-axis and y-axis, respectively)
• Velocity
(i.e., frame-to-frame displacement of the center of mass)
Fig. 5 shows the probability density distribution of F (u)
scores against the training and test dataset. As we expected,
the false data biases towards the left side were comparable
(a) 12-hedron, character ”5” with the true data, so the cumulative ratio of the false data
always surpassed that of the true data. This meant that the
cleansed population with higher F (u) scores definitely had
better recognizability than the original population.
C. Stabilizing Recognition Accuracy by Data Ensemble

Next, we compared the accuracy and the processing speed
(b) 20-hedron, character ”0” of the following four recognition methods:
Fig. 4. Examples of sequential ROI images in the dataset (every five frames). • Our method, which inputs 1 - 30 images randomly
selected per sequence into MobileNetV2 [12] with 10%
depth (i.e., the number of channels in the convolutional
extracted from camera images. In addition, Table I describes layers) compared with the standard model and aggregates
a comparison between our dataset and the conventional high- the outputs,
frame-rate video datasets introduced in Section 2-C. • The conventional method, which intputs one randomly
In the following experiment, we used 70% of the prepared sampled image per sequence into MoblieNetV2 with a
2029
0.08 1.0
PD (true)
Probability Density (PD)
PD (false)
Cumulative Ratio (CR)

0.8
0.06 CR (true)
CR (false)
0.6
0.04
0.4
0.02
0.2
0.00 0.0
-1.0 0.0 1.0 2.0 3.0 4.0 5.0
Recognizability score
(a) Training data (a) Mean accuracy vs. processing time
0.08 1.0
PD (true)
Probability Density (PD)
PD (false)
Cumulative Ratio (CR)
0.8
0.06 CR (true)
CR (false)
0.6
0.04
0.4
0.02
0.2
0.00 0.0
-1.0 0.0 1.0 2.0 3.0 4.0 5.0
Recognizability score
(b) Minimum accuracy vs. processing time
(b) Test data
Fig. 6. Comparison of the recognition accuracy and processing time between
Fig. 5. Probability density distribution of the recognizability score F (u)
our ensemble recognition models and conventional deeper recognition models.
against the (a) training and (b) test dataset. The false data biases towards the
(a) and (b) show the mean and minimum accuracy in 100 trials, respectively.
left side are comparable with the true data, so the cumulative ratio of the false
data always surpasses that of the true data.
Fig. 6 is the comparison results with processing time in the

different depth (35% - 300% compared with the standard horizontal axis and the recognition accuracy in the vertical
model), axis, where K = 100. These results show that the minimum
• Our method, which inputs 1 - 30 images randomly recognition accuracy of our methods appearently better than
selected per sequence into ResNetV2 [29] with 18 con- that of the conventional methods, whereas the mean accuracy
volutional layers and aggregates the outputs, and and the processing time were both comparable between our
• The conventional method, which inputs one randomly methods the conventional methods; this meant that the data
sampled image per sequence into ResNetV2 with a dif- ensemble stabilized the recognition accuracy without increas-
ferent number of convolutional layers (50, 101, and 152). ing the procesing time.
Finally, we evaluated the combination of the population
The network models used with our method were trained
data cleansing and the data ensemble. Fig. 7 compares the
from scratch and the ones used with conventional method were
recognition accuracy by varying the data cleansing rate r (0%,
pre-trained with the ImageNet.
20%, 40%, and 60%) and the number of images used in the
The recognition accuracy of the i-th sequence Ai was data ensemble N (1 - 20). The recognition model was the same
calculated using the following formula: as one of the models used in the previous experiment, which
( shaved off 90% of the depth from the standard MobileNetV2.
1 X 1 (true prediction) The cleansing rate of r% means that the images for the data
Ai = klij , lij = ,
K j=1 0 (false prediction) ensemble were randomly sampled from the population, where
r% of the buffered images with lower F (u) scores were
where N , K, lij represent the number of test sequences, the removed in each sequence. For each (r, N ), we performed
number of trials for random sampling per sequence, and the 1,000 trials with different random sampling patterns. The chart
prediction result in the j-th trial, respectively. indicates that the data cleansing kept or slightly improved the
2030
The key approaches of the framework are population data
0.95
cleansing based on recognizability scores and data ensem-
ble with multiple input images into a single light-weight
Recognition accuracy
CNN recognition model. Using both approaches, the proposed

0.90
framework enables us to perform CNN-based recognition
against high-frame-rate time-series data in real time. Exper-
0.85 Data cleansing rate imental results with an ad-hoc real world dataset showed that
60% our framework is more effective than existing approaches.
0.80 40%
20% ACKNOWLEDGMENT
0.75 0%
This work is based on results obtained from a project
0 5 10 15 20 25 30 commissioned by the New Energy and Industrial Technology
Number of images used in data ensemble
Development Organization (NEDO).
(a) Mean recognition accuracy
R EFERENCES
0.80 [1] K. Okumura, K. Yokoyama, H. Oku, and M. Ishikawa, “1 ms auto pan-
tilt - video shooting technology for objects in motion based on saccade
Recognition accuracy
mirror with background subtraction,” Advanced Robotics, vol. 29, no. 12,
0.60 pp. 457–468, Apr. 2015.
[2] A. Namiki, Y. Imai, M. Ishikawa, and M. Kaneko, “Development of a
high-speed multifingered hand system and its application to catching,”
0.40 in Proc. IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS),
Data cleansing rate Las Vegas, NV, Oct. 2003, pp. 2666–2671.
60% [3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for
40% Image Recognition,” arXiv e-prints, p. arXiv:1512.03385, Dec. 2015.
0.20
[4] I. Ishii, Y. Nakabo, and M. Ishikawa, “Target tracking algorithm for 1 ms
20%
visual feedback system using massively parallel processing vision,” in
0% Proc. IEEE Int. Conf. on Robotics and Automation (ICRA), Minneapolis,
0.00 MN, Apr. 1996, pp. 2309–2314.
0 5 10 15 20 25 30 [5] C. E. Shannon, “Communication in the presence of noise,” in Proc.
Number of images used in data ensemble Institute of Radio Engineers (IRE), vol. 33, Jan. 1949, pp. 10–21.
[6] G. Wolberg, Digital Image Warping. Los Alamitos, NM: IEEE
(b) Minimum recognition accuracy Computer Society Press, 1990.
Fig. 7. Comparison of the recognition accuracy among trials with different [7] K. Turkowski and S. Gabriel, “Filters for common resampling tasks,”
data cleansing rates (DCRs). (a) Higher DCR results in higher mean accuracy in Graphics Gems I, A. Glassner, Ed. Academic Press, Jul. 1993, pp.
when the data ensemble number is small enough. Dashed lines indicate cross 147–165.
points between the results with each DCR (blue, green, and orange) and the [8] S. Avidan and A. Shamir, “Seam carving for content-aware image
result with no data cleansing (red). (b) Higher DCR constantly results in resizing,” in Proc. ACM SIGGRAPH 2007, San Diego, CA, Aug. 2007,
higher minimum accuracy, which means few repeatability errors. p. 10–es.
[9] L. Wolf, M. Guttmann, and D. Cohen-Or, “Non-homogeneous content-
driven video-retargeting,” in Proc. IEEE Int. Conf. on Computer Vision
(ICCV), Rio de Janeiro, Oct. 2007, pp. 1–6.
average recognition accuracy when the data ensemble number [10] P. Sen, N. K. Kalantari, M. Yaesoubi, S. Darabi, D. B. Goldman,
was small enough, while stably suppressing the repeatability and E. Shechtman, “Robust patch-based hdr reconstruction of dynamic
errors. Note that the calculation of F (u) was executed in the scenes,” ACM Trans. on Graphics (TOG), vol. 31, no. 6, pp. 203:1–11,
Nov. 2012.
object tracking process within 1 millisecond even when five [11] T. H. Kim, K. M. Lee, B. Scholkopf, and M. Hirsch, “Online video
objects existed in the camera view at the same time; therefore, deblurring via dynamic temporal blending network,” in Proc. IEEE Int.
the data cleansing process did not affect the latency of the Conf. on Computer Vision (ICCV), Venice, Oct. 2017, pp. 4038–4047.
[12] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. C. Chen, “Mo-
overall process. bilenetv2: Inverted residuals and linear bottlenecks,” in Proc. IEEE/CVF
The aforementioned experiments demonstrated that our Conf. on Computer Vision and Pattern Recognition (CVPR), Salt Lake
framework is effective for online object recognition tasks using City, UT, Jun. 2018, pp. 4510–4520.
a high-speed camera. [13] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning
Filters for Efficient ConvNets,” arXiv e-prints, p. arXiv:1608.08710,
In addition, as shown in the supplemental video 1 , we Aug. 2016.
successfully developed a real-time sorting system that recog- [14] I. Bello, B. Zoph, V. Vasudevan, and Q. V. Le, “Neural optimizer search
nizes 1-cm-diameter free-falling blocks with three attributes with reinforcement learning,” in Proc. Int. Conf. on Machine Learning
(ICML), vol. 70, Sydney, Aug. 2017, pp. 459–468.
(2 shapes, 6 colors, and 36 characters) at a rate of 100 objects [15] Z. Wu, T. Nagarajan, A. Kumar, and S. Rennie, “Blockdrop: Dynamic
per second and flicks out any designated objects immediately. inference paths in residual networks,” in Proc. IEEE/CVF Conf. on
Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT,
V. C ONCLUSION Jun. 2018, pp. 8817–8826.
[16] G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a
This paper proposed a novel object recognition framework Neural Network,” arXiv e-prints, p. arXiv:1503.02531, Mar. 2015.
for real-time applications with high-speed camera imaging. [17] S. Gupta, A. Agrawal, and K. G. nad P. Narayanan, “Deep learning with
limited numerial precision,” in Proc. Int. Conf. on Machine Learning
1 We plan to release this video on the internet after acceptance (ICML), vol. 37, Lille, Jul. 2015, pp. 1737–1746.
2031
[18] L. K. Hansen and P. Salamon, “Neural network ensembles,” IEEE Trans.
Pattern Analysis and Machine Intelligence, vol. 12, no. 10, pp. 993–
1001, Oct. 1990.
[19] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of
on-line learning and an application to boosting,” Journal of Computer
and System Sciences, vol. 55, no. 1, pp. 119–139, Aug. 1997.
[20] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, pp. 123–
140, Aug. 1996.
[21] ——, “Random forests,” Machine Learning, vol. 45, pp. 5–32, Jan. 2001.
[22] D. H. Wolpert, “Stacked generalization,” Neural Networks, vol. 5, no. 2,
pp. 241–260, 1992.
[23] A. Tarvainen and H. Valpola, “Mean teachers are better role mod-
els: Weight-averaged consistency targets improve semi-supervised deep
learning results,” in Proc. Int. Conf. on Neural Information Processing
Systems (NIPS), Long Beach, CA, Dec. 2017, pp. 1195–1204.
[24] Ship-v: Services for high-speed image processing. Ishikawa Senoo
Lab., the Univ. of Tokyo. [Online]. Available: http://ishikawa-vision.
org/ship-v/index-e.html
[25] J. Janai, F. Guney, J. Wulff, M. Black, and A. Geiger, “Slow flow:
Exploiting high-speed cameras for accurate and diverse optical flow
reference data,” in Proc. IEEE/CVF Conf. on Computer Vision and
Pattern Recognition (CVPR), Honolulu, HI, Jul. 2017, pp. 1406–1416.
[26] A. Sinori, M. Brambilla, and X. N. Bourdis, “Blockdrop: Dynamic
inference paths in residual networks,” in Proc. IEEE/CVF Conf. on
Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT,
Jun. 2018, pp. 8817–8826.
[27] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training algorithm for
optimal margin classifiers,” in Proc. Annual Workshop on Computational
Learning Theory (COLT), Pittsburgh, PA, Jul. 1992, pp. 144–152.
[28] N. Mohanty, A. John, R. Manmatha, and T. M. Rath, Shape-Based Image
Classification and Retrieval, 12 2013, vol. 31, pp. 249–267.
[29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proc. IEEE/CVF Conf. on Computer Vision and Pattern
Recognition (CVPR), Las Vegas, NV, Jun. 2016, pp. 770–778.
[30] I. Ishii, T. Tatebe, Q. Gu, Y. Moriue, T. Takaki, and K. Tajima, “2000
fps real-time vision system with high-frame-rate video recording,” in
Proc. IEEE Int. Conf. on Robotics and Automation (ICRA), Anchorage,
AK, Jun. 2010, pp. 1536 – 1541.
2032

Namiki 2021

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Namiki 2021

Uploaded by

Copyright:

Available Formats

2020 25th International Conference on Pattern Recognition (ICPR)

Milan, Italy, Jan 10-15, 2021

Online Object Recognition Using CNN-based

Shigeaki Namiki∗ , Keiko Yokoyama∗ , Masatoshi Ishikawa

Abstract—High-speed camera imaging (e.g., 1,000 fps) is effec-

tive to detect and recognize objects moving at high speeds because

978-1-7281-8808-9/20/$31.00 ©2020 IEEE 2025

Ours Ship-v [24] SlowFlow [25] N-CARS [26]

N/A (Some gesture

Target 1-cm-diameter blocks Humans, balls, etc. N/A Cars

Frames per second up to 10, 000

N/A 12, 336 (cars),

B. Improving Recognizability by Data Cleansing

C. Stabilizing Recognition Accuracy by Data Ensemble

Cumulative Ratio (CR)

Fig. 6 is the comparison results with processing time in the

CNN recognition model. Using both approaches, the proposed

You might also like