Professional Documents
Culture Documents
Smart Multimedia Second International Conference ICSM 2019 San Diego CA USA December 16 18 2019 Revised Selected Papers Troy Mcdaniel
Smart Multimedia Second International Conference ICSM 2019 San Diego CA USA December 16 18 2019 Revised Selected Papers Troy Mcdaniel
https://textbookfull.com/product/accelerator-programming-using-
directives-6th-international-workshop-waccpd-2019-denver-co-usa-
november-18-2019-revised-selected-papers-sandra-wienke/
https://textbookfull.com/product/services-services-2019-15th-
world-congress-held-as-part-of-the-services-conference-
federation-scf-2019-san-diego-ca-usa-june-25-30-2019-proceedings-
yunni-xia/
https://textbookfull.com/product/cognitive-cities-second-
international-conference-ic3-2019-kyoto-japan-
september-3-6-2019-revised-selected-papers-jian-shen/
https://textbookfull.com/product/information-security-22nd-
international-conference-isc-2019-new-york-city-ny-usa-
september-16-18-2019-proceedings-zhiqiang-lin/
https://textbookfull.com/product/smart-health-international-
conference-icsh-2015-phoenix-az-usa-november-17-18-2015-revised-
selected-papers-1st-edition-xiaolong-zheng/
https://textbookfull.com/product/high-performance-computing-isc-
high-performance-2019-international-workshops-frankfurt-germany-
june-16-20-2019-revised-selected-papers-michele-weiland/
Troy McDaniel
Stefano Berretti
Igor D. D. Curcio
Anup Basu (Eds.)
LNCS 12015
Smart Multimedia
Second International Conference, ICSM 2019
San Diego, CA, USA, December 16–18, 2019
Revised Selected Papers
Lecture Notes in Computer Science 12015
Founding Editors
Gerhard Goos
Karlsruhe Institute of Technology, Karlsruhe, Germany
Juris Hartmanis
Cornell University, Ithaca, NY, USA
Smart Multimedia
Second International Conference, ICSM 2019
San Diego, CA, USA, December 16–18, 2019
Revised Selected Papers
123
Editors
Troy McDaniel Stefano Berretti
Arizona State University University of Florence
Mesa, AZ, USA Florence, Italy
Igor D. D. Curcio Anup Basu
Nokia Technologies University of Alberta
Tampere, Finland Edmonton, AB, Canada
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
General Chairs
S. Panchanathan Arizona State University, USA
A. Leleve INSA Lyon, France
A. Basu University of Alberta, Canada
Program Chairs
S. Berretti University of Florence, Italy
M. Daoudi IMT Lille, France
Area Chairs
N. Thirion-Moreau SeaTech, France
W. Pedrycz University of Alberta, Canada
J. Wu University of Windsor, Canada
Tutorial Chair
H. Venkateswara Arizona State University, USA
Registration Chair
Y.-P. Huang Taipei Tech, Taiwan
viii Organization
Publicity Chairs
J. Zhou Griffith University, Australia
L. Gu NII, Japan
A. Liew Griffith University, Australia
Q. Tian University of Texas, USA
Finance Chair
L. Ying Together Inc., USA
Submissions Chairs
C. Zhao University of Alberta, Canada
S. Mukherjee University of Alberta, Canada
Web Chair
X. Sun University of Alberta, Canada
Advisors
E. Moreau SeaTech, France
I. Cheng University of Alberta, Canada
Image Understanding
Miscellaneous
Video Applications
Multimedia in Medicine
Biometrics
Short Paper
1 Introduction
input to the CNN model. Second, their geometry maps are obtained by comput-
ing the geometric attributes on the face mesh model, then displaying and saving
them as 2D images. In our method, we rather establish a correspondence between
geometric descriptors computed on 3D triangular facets of the mesh and pixels
in the texture image. Specifically, in our case, geometric attributes are computed
on the mesh at each triangle facet, and then mapped to their corresponding pix-
els in the texture image using the newly proposed scheme. Third, our method
is computationally more-efficient as we compute the geometric attributes on a
compressed facial mesh surface. With respect to the conformal mapping method
proposed by Kittler et al. [6], our method presents three fundamental differences.
First, in [6], the mapping is performed fitting a generic 3D morphable face model
(3DMM) to the 2D face image, whereas in our solution we map each face model
of a subject to its corresponding 2D face image. Second, the geometric informa-
tion mapped from the 3DMM to the 2D image is given by the point coordinates,
which are just an estimate of the actual values, obtained from the 3DMM by
landmark-based model fitting. In our work, instead, we map the actual point
coordinates, in addition to other varieties of shape descriptors, from each face
model to its corresponding image. Third, in terms of scope, while the work in [6]
deals with 2D face recognition, our work extends to any 3D textured mesh sur-
face. In [21], similarly to [6], a 3DMM is fit to a 2D image, but for face image
frontalization purposes. The face alignment method employs a cascaded CNN
to handle the self-occlusions. Finally, in [16], a spherical parameterization of the
mesh manifold is performed allowing a two-dimensional structure that can be
treated by conventional CNNs.
In the next section, we elaborate on our proposed scheme. Afterwards, in
Sect. 3, we describe the experiments validating our proposed method, then we
terminate by concluding remarks in Sect. 4.
As shown in the block diagram of Fig. 1, our proposed framework jointly exploits
the 3D shape and 2D texture information. First, we perform a down-sampling
on the facial surface derived from the full 3D face model. Subsequently, we com-
pute the new texture mapping transformation T between the texture image and
the compressed mesh. Afterward, we extract a group of local shape descriptors
that include mean and Gaussian curvature (H, K), shape index (SI), and the
local-depth (LD). These local descriptors can be computed efficiently and com-
plement each other by encoding different shape information. We also consider
the gray-level (GL) of the original 2D texture image associated to the mesh. A
novel scheme is then proposed to map the extracted geometric descriptors onto
2D textured images, using the inverse of the texture mapping transform T . We
dubbed these images Geometry Augmented Images (GAIs). It is relevant to note
here that we assume the existence of a 2D texture image, which is in correspon-
dence with the triangulated mesh via standard texture-mapping. In this respect,
our solution can also be regarded as a multi-modal 2D+3D solution, where the
6 B. Taha et al.
2D texture data is required to enable the mapping, and also constitutes an addi-
tional feature that can be early fused with the 3D data in a straightforward way.
Here, triplets of GAIs originated from the 3D-to-2D descriptors mapping are then
fused to generate multiple three-channel images. We dubbed these images Fused
Geometry Augmented Images (FGAIs). The FGAIs are used as input to a CNN
network to generate, through their output activations, a highly discriminative
feature representation to be used in the subsequent classification stage. To the
best of our knowledge, we are the first to propose such 2D and 3D information
fusion for textured-geometric data analysis.
Given a triangular mesh originated from a surface, we project the mesh
vertices onto the plane spanned by the main orientation of the surface. This
yields the depth function z = f (x, y) defined on the scattered set of vertices. The
function is then interpolated and re-computed over a regular grid constructed by
a uniform down-sampling of order k. The 2D Delaunay triangulation computed
over the achieved regular points produces a set of uniform triangular facets. We
complete them with interpolated z coordinate to obtain a 3D down-sampled
regular mesh (Fig. 1(a)). A sub-sampling of step k produces approximately a
compression ratio of k for both the facets and the vertices. As the original 3D
mesh has an associated texture image, the mapping between the mesh and the
texture image is established by the usual texture mapping approach, where the
vertices of each triangle (facet) on the mesh are mapped to three points (i.e.,
pixels) in the image. However, the projection and the re-sampling step break
such correspondence. Therefore, in the next stage, we reconstruct a new texture
mapping between the 2D face image and the newly re-sampled mesh vertices. For
each vertex in the re-sampled mesh, we use its nearest neighbor in the original
mesh, computed in the previous stage, to obtain the corresponding pixel in the
original image via the texture mapping information in the original face scan.
This new obtained texture mapping transformation (T in Fig. 1) between the
original image and the new re-sampled facial mesh allows us to map descriptors
computed on the facial surface, at any given mesh resolution, to a 2D geometry-
augmented image (GAI), which encodes the surface descriptor as pixel values
in a 2D image. In order to keep consistent correspondence between the texture
information and the geometric descriptors, we down-sample the original texture
images to bring it at the same resolution of its GAI counterpart. We do this as
follows: we take the vertices of each facet in the down-sampled facial mesh, and
we compute, using T −1 their corresponding pixels in the original image. These
three pixels form a triangle in the original image; we assign to all pixels that are
confined in that triangle the average of their gray-level values (Fig. 2(a)(b)(c)).
Figure 2(d) depicts the GAIs obtained by mapping the LD descriptor computed
on the original mesh and its down-sampled counterpart, at a compressed ratio
of 3.5.
After mapping the extracted local geometric descriptors onto the 2D textured
images, we encode the shape information with the help of a compact 2D matrix.
As we extract four local geometric descriptors, the shape information is repre-
sented by their corresponding three 2D GAIs, to which we add the gray level
Fused Geometry Augmented Images for Analyzing Textured Mesh 7
Fig. 2. (a) A triangular facet in a down-sampled face mesh; (b) The texture mapping
associates the vertices of the facet in (a) to three pixels in the texture image forming
a triangle; (c) All the pixels confined in the triangle in (b) are assigned their mean
gray level; (d) GAIs obtained by 2D mapping of the LD descriptor computed on the
original mesh (top) and its down-sampled version (bottom); (e) FGAIs generated for
a sample face with the combination H-LD-SI, H-GL-SI, K-H-SI, GL-LD-SI, K-H-LD;
(f) a textured lesion surface and its FGAIs images GL-SI-LD, H-GL-LD, H-K-GL, and
K-GL-SI.
Fig. 3. Bosphorus dataset: AU detection measured as AuC for each of the 24 AUs.
Results for 12 AUs are reported in the upper plot, and for the rest in the lower one
(notation: LF and UF stay, respectively, for lower- and upper-face; R and L indicate the
left and right part of the face). Comparison includes the following state-of-the-art meth-
ods: 3DLBPs [4], LNBPs [14], LABPs [13], LDPQs [10], LAPQs [10], LDGBPs [20],
LAGBPs [20], LDMBPs [17], LAMBPs [17].
3 Experiments
We demonstrated the effectiveness of our representation paradigm on three tasks:
facial expression recognition, action unit (AU) detection, and skin cancer classifi-
cation, using respectively, the Bosphorus 3D face database [15], the Binghamton
University 4D Facial Expression database (BU-4DFE) [19], and the Dermofit
dataset1 . First, we conducted an ablative analysis experimentation that aims to
compare the performance of the different variants in terms of the FGAI combi-
nation, the CNN layer used to derive the output features, and the down-sampled
data versus the original data. Because of the limited space, we just report the best
variant obtained for each dataset, namely, AlexNet-Conv4, H-GL-SI, AlexNet-
Conv4, H-LD-SI, and Vgg-vd16-Conv53 H-GL-LD for Bosphorus, BU-4DE and
1
https://licensing.eri.ed.ac.uk/i/software/dermofit-image-library.html.
Fused Geometry Augmented Images for Analyzing Textured Mesh 9
Dermofit, respectively. Note that for the Dermofit, we augmented and normal-
ized the data using the same protocol reported in [5].
For AU classification, we compared our results with a number of existing
methods, adopting the same protocol (i.e., 5-fold cross validation, 3, 838 AU
scans of 105 subjects). Figure 3 shows the AuC results for each AU individually,
obtained with our proposed method, using the combination (H-GL-SI), together
with the state-of-the-art methods. Computing the average AuC on all the AUs,
it can be observed that our proposed feature representation scheme achieves
the highest score of 99.79%, outperforming the current state-of-the art AuC of
97.2%, which is scored by the Depth Gabor Binary Patterns (LDGBP) feature
proposed in [13].
For facial expression recognition, we adopted the same experimentation pro-
tocol reported in the recent state-of-the-art methods [7–9,18] (10-cross valida-
tion, expression scans collected from 60 subjects, randomly selected from 65
individuals). Results obtained with the H-GL-SI are reported in Table 1. For
both original and compressed face scans our method outperforms the current
state-of-the-art solutions by a significant margin of 18%/13% and 17%/12% for
the learned features of the AlexNet and the Vgg-vd16, respectively.
Table 1. Comparison with the state-of-the-art on the Bosphorus dataset using the
following configuration: FGAI H-GL-SI.
Method Accuracy
MS-LNPs [15] 75.83
GSR [16] 77.50
iPar-CLR [17] 79.72
DF-CNN svm [1] 80.28
Original-AlexNet 98.27
Compressed-AlexNet 93.29
Original-Vgg-vd16 98.16
Compressed-Vgg-vd16 92.38
Table 2. Comparison with the state-of-the-art on the BU-4DFE dataset using the
following configuration: FGAI H-LD-SI.
Method Accuracy
Abd EI Meguid et al. [19] 73.10
Dapogny et al. [18]-1 72.80
Dapogny et al. [18]-2 74.00
Compressed-AlexNet 91.34
Compressed-Vgg-vd16 89.81
Table 3. Comparison with the state-of-the-art on the Dermofit dataset using the
following configuration: FGAI H-GL-LD.
4 Conclusion
We proposed a novel scheme, which maps the geometric information from 3D
meshes onto 2D textured images, providing a mechanism to simultaneously rep-
resent geometric attributes from the mesh-model alongside with the texture
information. The proposed mapped geometric information can be employed in
conjunction with a CNN for jointly learning 2D and 3D information fusion at
the data level, while providing a highly discriminative representation. Compared
to existing learned feature representation schemes for 3D data, the proposed
method is both memory and computation efficient as it does not resort to expen-
sive tensor-based or multi-view inputs. The effectiveness of our representation
has been demonstrated on three different surface classification tasks, showing a
neat boost in the performance when compared to competitive methods.
Fused Geometry Augmented Images for Analyzing Textured Mesh 11
References
1. Abd El Meguid, M.K., et al.: Fully automated recognition of spontaneous facial
expressions in videos using random forest classifiers. IEEE Trans. Affect. Comput.
5(2), 141–154 (2014). https://doi.org/10.1109/TAFFC.2014.2317711
2. Ballerini, L., et al.: A color and texture based hierarchical K-NN approach to the-
classification of non-melanoma skin lesions. In: Celebi, M., Schaefer, G. (eds.) Color
Medical Image Analysis, vol. 6, pp. 63–86. Springer, Dordrecht (2015). https://doi.
org/10.1007/978-94-007-5389-1 4
3. Dapogny, A., et al.: Investigating deep neural forests for facial expression recogni-
tion. In: IEEE International Conference on Automatic Face and Gesture Recogni-
tion (FG), pp. 629–633, May 2018
4. Huang, Y., et al.: Combining statistics of geometrical and correlative features for
3D face recognition (2006)
5. Kawahara, J., et al.: Deep features to classify skin lesions. In: International Sym-
posium on Biomedical Imaging (2016)
6. Kittler, J., et al.: Conformal mapping of a 3D face representation onto a 2D image
for CNN based face recognition. In: International Conference on Biometrics, pp.
146–155 (2018)
7. Li, H., et al.: 3d facial expression recognition via multiple kernel learning of multi-
scale local normal patterns. In: ICPR, pp. 2577–2580 (2012)
8. Li, H., et al.: An efficient multimodal 2D + 3D feature-based approach to automatic
facial expression recognition. Comput. Vis. Image Underst. 140(Suppl. C), 83–92
(2015)
9. Li, H., et al.: Multimodal 2D+3D facial expression recognition with deep fusion
convolutional neural network. IEEE Trans. Multimed. 19(12), 2816–2831 (2017).
https://doi.org/10.1109/TMM.2017.2713408
10. Ojansivu, V., Heikkilä, J.: Blur insensitive texture classification using local phase
quantization. In: Elmoataz, A., Lezoray, O., Nouboud, F., Mammass, D. (eds.)
ICISP 2008. LNCS, vol. 5099, pp. 236–243. Springer, Heidelberg (2008). https://
doi.org/10.1007/978-3-540-69905-7 27
11. Razavian, S., et al.: CNN features off-the-shelf: an astounding baseline for recogni-
tion. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops
(CVPRW), pp. 806–813 (2014)
12. Ross, A., Jain, A.K.: Information fusion in biometrics. Pattern Recogn. Lett. 24,
2115–2125 (2003)
13. Sandbach, G., et al.: Binary pattern analysis for 3D facial action unit detection.
In: British Machine Vision Conference (BMVC), pp. 119.1–119.12. BMVA Press,
September 2012
14. Sandbach, G., et al.: Local normal binary patterns for 3D facial action unit detec-
tion. In: IEEE International Conference on Image Processing (ICIP), pp. 1813–
1816, September 2012
15. Savran, A., Alyüz, N., Dibeklioğlu, H., Çeliktutan, O., Gökberk, B., Sankur, B.,
Akarun, L.: Bosphorus database for 3D face analysis. In: Schouten, B., Juul,
N.C., Drygajlo, A., Tistarelli, M. (eds.) BioID 2008. LNCS, vol. 5372, pp. 47–56.
Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-89991-4 6
12 B. Taha et al.
16. Sinha, A., Bai, J., Ramani, K.: Deep learning 3D shape surfaces using geometry
images. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS,
vol. 9910, pp. 223–240. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-
46466-4 14
17. Yang, M., et al.: Monogenic binary pattern (MBP): a novel feature extraction and
representation model for face recognition. In: International Conference on Pattern
Recognition, pp. 2680–2683 (2010)
18. Yang, X., et al.: Automatic 3D facial expression recognition using geometric scat-
tering representation. In: IEEE International Conference and Workshops on Auto-
matic Face and Gesture Recognition (FG), vol. 1, pp. 1–6, May 2015
19. Yin, L., et al.: A high-resolution 3D dynamic facial expression database. In: IEEE
Conference on Face and Gesture Recognition (FG), pp. 1–6, September 2008
20. Zhang, W., et al.: Local gabor binary pattern histogram sequence (LGBPHS):
a novel non-statistical model for face representation and recognition. In: IEEE
International Conference on Computer Vision (ICCV), vol. 1, pp. 786–791, October
2005
21. Zhu, X., et al.: Face alignment across large poses: a 3D solution. In: IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR), pp. 146–155, June
2016
Textureless Object Recognition Using
an RGB-D Sensor
1 Introduction
paradigm, generally consists of two well-known stages called training and evalu-
ation. In the training stage, a number of samples are collected and then processed
to determine a set of characteristics that can accurately and uniquely represent
each object. Once the feature set is obtained, one can proceed to train a model
that allows separating the features and therefore be able to distinguish between
the different classes of the objects. The major challenge is designing a recognition
system with real-time capabilities, robust against occlusions, illumination vari-
ations, and noisy environments. Industrial objects are usually textureless (e.g.,
made of aluminum material), small and homogeneous, and therefore, extracting
a representative and unique feature set is even more challenging. That is why
current object recognition algorithms often fail to recognize industrial parts.
Fig. 1. An illustration of our object recognition framework: (Left) A 48×48 cm2 box,
where multiple objects are placed on the electric motorized 360◦ rotating turntable.
Three different objects are shown in this example. (Center) We used the Intel RealSense
D415 sensor and its field of view FOV (H × V × D with range of 69.4◦ × 42.5◦ × 77◦
+/– 3◦ ) to capture the entire experimental scene. Four different views with rotation
0◦ , 25◦ , 180◦ , and 315◦ of an example object are shown. (Right) Our object recognition
processing pipeline.
2 Relate Work
The original object detection methods were based on shape descriptors. Several
shape descriptors have been proposed by researchers. They can be categorized
into two groups: region based and contour based descriptors depending on pixel
values or curvature information [11].
Shape matrix [13], Zernike moments [6], convex hull [10], moment-based
descriptors [24], and media axis features [19] are some region based shape descrip-
tors that have been proposed in the literature. Moment based descriptors are
very popular among them. It is shown that they are usually concise, computa-
tional cost effective, robust and easy to compute, as well as invariant to scaling,
rotation, and translation of the object. However, it is difficult to correlate higher
order moments with the shape’s salient features due to the global nature of these
methods [6]. The media axis features are robust to noise, but are computationally
expensive due to their capability to reduce information redundancy.
Some of the proposed contour based shape descriptors are harmonic shape
representation [31], Fourier descriptors [46], Wavelet descriptors [9], chain
code [14], curvature scale space (CSC) descriptors [29], spectral descriptors [8]
and boundary moments [42]. Fourier and Wavelet descriptors are stable over
noise in the spectral domain, while chain code is sensitive to noise. CSC descrip-
tors capture the maximal contour known as the object’s CSC contour. This
approach is robust to noise, changes in scale and orientation of objects, but does
not always produce results consistent with the human visual system [1,45].
Visual recognition is another key factor used to improve the accuracy of
object detection and recognition. Techniques such as histogram of oriented
gradient (HOG) [25], local ternary patterns (LTP) [7], local binary patterns
(LBP) [36], scale invariant feature transform (SIFT) [12], binary robust indepen-
dent elementary features (BRIEF) [5], speed-up robust features (SURF) [43], and
oriented fast and rotated brief (ORB) [40] have been proposed over the years.
However, these techniques still have insufficiencies. For example, various images
of a specific object may appear profoundly different due to changes in the ori-
entation of the object and lighting conditions [28]. SIFT and ORB are robust
descriptors that facilitate the object recognition. Although SIFT is slower than
ORB, it is more stable. Both SIFT and ORB are suitable for real-time applica-
tions. However, these methods rely on textural patterns, and if the objects do not
have enough textural information, like most of the industrial robotic automation
and machine shop applications, they will fail to detect objects accurately.
In order to improve the accuracy of traditional object detection methods, peo-
ple later combined machine learning techniques with shape descriptors [18,39].
Many machine learning algorithms, such as support vector machine (SVM) [22],
decision trees (DT) [35], linear discriminant analysis (LDA) [2], Naive Bayes
(NB) [15], random forest (RF) [4], learning vector quantization (LVQ) [27], k-
means [20] and k-medians [26] have been tested and showed their efficiencies in
solving classification problems.
Others object recognition approaches like LINE2D [23] and their variants,
examine a template of sparse points across a gradient map. These approaches
Another random document with
no related content on Scribd:
back