Professional Documents
Culture Documents
Toward More Accurate Heterogeneous Iris Recognition With Transformers and Capsules
Toward More Accurate Heterogeneous Iris Recognition With Transformers and Capsules
Toward More Accurate Heterogeneous Iris Recognition With Transformers and Capsules
Abstract. As diverse iris capture devices have been deployed, the per-
formance of iris recognition under heterogeneous conditions, e.g., cross-
spectral matching and cross-sensor matching, has drastically degraded.
Nevertheless, the performance of existing manual descriptor-based meth-
ods and CNN-based methods is limited due to the enormous domain gap
under heterogeneous acquisition. To tackle this problem, we propose a
model with transformers and capsules to extract and match the domain-
invariant feature effectively and efficiently. First, we represent the fea-
tures from shallow convolution as vision tokens by spatial attention.
Then we model the high-level semantic information and fine-grained dis-
criminative features in the token-based space by a transformer encoder.
Next, a Siamese transformer decoder exploits the relationship between
pixel-space and token-based space to enhance the domain-invariant and
discriminative properties of convolution features. Finally, a 3D capsule
network not only efficiently considers part-whole relationships but also
increases intra-class compactness and inter-class separability. Therefore,
it improves the robustness of heterogeneous recognition. Experimental
validation results on two common datasets show that our method signif-
icantly outperforms the state-of-the-art methods.
1 Introduction
The irises from billions of citizens worldwide have been collected under near-
infrared(NIR) illumination to establish national ID database [19]. Visible(VIS)
light sensors can acquire identifiable long-range images and are cheaper than NIR
sensors [7]. However, drastic performance degradation is anticipated when VIS
can represent various properties of entities, and the routing mechanism can learn
the part-whole relationship [3]. Motivated by such, we propose an improved 3D
capsule network as feature matching to support different instantiation parame-
ters such as specific spectrums, specific sensors, rotations, etc., which improves
the robustness of unconstrained and different domains. Furthermore, the intra-
class features are pulled closer to each other and inter-class features are kept
away from each other, thereby improving the performance.
In summary, our main contributions are as follows:
2 Technical Details
iris images. The CNN feature extractor outputs feature maps for each normal-
ized iris. Thereafter, a vision token generator is used to obtain vision tokens
from corresponding feature maps. Subsequently, the transformer feature extrac-
tor comprising a transformer encoder and a Siamese transformer decoder takes
vision tokens to generate refined feature maps. Finally, an efficient 3D capsule
matcher makes the decision for iris image pairs (i.e., genuine or imposter).
Qn = tn Wqn , K n = tnWkn , V n = n n
t Wv ,
Qn (K n )T (2)
A (Q , K , V ) = σ
n n n √
d
n
V ,
where n denotes the nth layer of transformer encoder. Qn , K n , and V n are query
matrix, key matrix, and value matrix of the self-attention. Wqn , Wkn , Wvn ∈ RC×d
are three learnable transform matrices, tn denotes forwarding tokens, and d
denotes the column dimension of Qn , K n , V n . MSA maps the input to different
representation subspaces using multiple sets of (Qn , K n , V n ) in parallel. Then
the results in these subspaces are concatenated and passed through an additional
transform matrix to obtain the final output. MSA is formulated as follows:
means a set of capsules can represent entities and routing. The 3D convolution
obtains localized features allowing neighboring capsules to share similar infor-
mation. Hence, such significantly reduces the number of parameters compared
to dynamic routing in [15]. Subsequently, log probability b ∈ RH×W ×1×C ×C is
initialized to 0 and passed through 3d− sof tmax function, which is formulated
34 Z. Zhou et al.
as follows:
exp (bs )
os = 3d− sof tmax (bs ) = H W C , (6)
i j k exp (bi,j,k,s )
where bs denotes the log prior probability of the sth 3D capsule, and os denotes
the coupling coefficients of the sth 3D capsule. The (i, j, k)th capsule in the
digital capsule layer has C possible outcomes from C 3D capsules in the
previous layer. Therefore, the intermediate votes v and the coupling coeffi-
cients os perform weighted summation to obtain inactivated output capsules
ϕ ∈ RH×W ×N ×C ×1 , which is formulated as follows:
C
ϕi,j,k = oi,j,k,s v̂i,j,k,s , (7)
s
where ϕi,j,k denotes the (i, j, k)th inactivated output capsule in the 3D capsule
layer. A modified 3D nonlinear activation function compresses the values of the
capsule between 0 and 1 to obtain the capsules ϕ̂ ∈ RH×W ×N ×C ×1 , which is
formulated as follow:
2
ϕi,j,k ϕi,j,k
ϕ̂i,j,k = 3d− squash (ϕi,j,k ) = 2 , (8)
0.5 + ϕi,j,k ϕ i,j,k
where ϕ̂i,j,k denotes the activated output capsules in the 3D capsule layer. The
agreement between v̂ and ϕ̂ is determined by their inner product. If the inner
product of two capsules is positive, the coupling coefficient between these two
capsules will increase. The update of b is formulated as follows:
bi,j,k,s = bi,j,k,s + ϕ̂i,j,k · v̂i,j,k,s , (9)
where bi,j,k,s denotes the log probability between the sth capsule in the previous
layer and the (i, j, k)th capsule in the next layer. Following [15], we use 3 routing
iterations and achieve better convergence.
mechanism in [15], ϕ̂r are mapped into digital capsules g ∈ R2×N , which denotes
imposter and genuine capsules in the case of binary classification.
Fig. 5. Some samples from two datasets: (a) PolyU iris dataset and (b) IIIT-D CLI
dataset.
3.1 Datasets
We use a public algorithm [22] for each dataset to accurately segment, normalize,
and enhance the iris images. The normalized iris images resized into 256×256
36 Z. Zhou et al.
are taken as input by the subsequent network. Two datasets are summarized as
follows: (1) PolyU iris database, containing 12540 iris images from 209 subjects,
is acquired in the NIR and VIS wavelengths from the left and right eyes. Each
subject consists of 15 different instances of right and left eye for both VIS and
NIR spectrum. Samples are shown in Fig. 5 (a). (2) IIIT-D CLI database, con-
taining 6570 iris images from 101 subjects, is obtained using two different NIR
sensors. It is developed to evaluate the performance of contact lens presentation
attacks. Because it contains 2163 images without contact lenses, it is also used
to evaluate the cross-sensor performance. Samples are shown in Fig. 5 (b).
the cross-spectral performance. In the IIIT-D CLI database, the first 3 images
per eye from 110 subjects without contact lenses are selected as the training set,
the fourth image as the valuation set, and the remaining one as the test set. Such
generates 202(1×1×101×2) genuine scores and 40602(1×1×202×201) imposter
scores from the IIIT-D CLI database to evaluate the cross-sensor performance.
Methods 0.1∗ (%) 0.01 (%) EER(%) Methods 0.1(%) 0.01(%) EER(%)
IrisCode [6] 51.89 33.12 30.81 ESG [5] 49.20 21.53 29.59
MRF [8] 64.91 47.74 24.50 DA-DBNN [8] 89.92 85.04 10.02
CNN-SDH [19] 96.41 90.71 5.39 CNN-SDH 96.41 90.71 5.39
DenseSENet [1] 99.05 96.81 2.40 DenseSENet 98.94 98.07 1.73
Ours 99.72 98.45 1.36 Ours 99.56 98.91 0.68
4 Conclusions
In this paper, we use transformers and capsules to ease the difficulty of heteroge-
neous iris recognition. Leveraging vision tokens, transformer encoder and trans-
former decoder models the relationship between token-based space and pixel-
space to refine the low semantic features. And 3D capsule routing efficiently
models the part-whole correlation. Extensive experiments on cross-spectral and
cross-sensor scenarios show our proposed method significantly improves the per-
formance of heterogeneous iris recognition.
Toward More Accurate Heterogeneous Iris Recognition 39
References
1. Chen, Y., Zeng, Z., Zeng, Y., Gan, H., Chen, H.: DenseSENet: more accurate and
robust cross-domain iris recognition. J. Electron. Imaging 30(6), 063024 (2021)
2. Kohli, N., Yadav, D., Vatsa, M., Singh, R.: Revisiting iris recognition with color
cosmetic contact lenses. In: 2013 International Conference on Biometrics (ICB),
pp. 1–7. IEEE (2013)
3. Kosiorek, A., Sabour, S., Teh, Y.W., Hinton, G.E.: Stacked capsule autoencoders.
In: Advances in Neural Information Processing Systems, vol. 32 (2019)
4. Liu, N., Zhang, M., Li, H., Sun, Z., Tan, T.: Deepiris: learning pairwise filter bank
for heterogeneous iris verification. Pattern Recogn. Lett. 82, 154–161 (2016)
5. Ma, L., Tan, T., Wang, Y., Zhang, D.: Personal identification based on iris texture
analysis. IEEE Trans. Pattern Anal. Mach. Intell. 25(12), 1519–1533 (2003)
6. Masek, L.: Matlab source code for a biometric identification system based on iris
patterns. http://peoplecsse.uwa.edu.au/pk/studentprojects/libor/ (2003)
7. Mostofa, M., Mohamadi, S., Dawson, J., Nasrabadi, N.M.: Deep GAN-based cross-
spectral cross-resolution iris recognition. IEEE Trans. Biometrics Behav. Identity
Sci. 3(4), 443–463 (2021)
8. Nalla, P.R., Kumar, A.: Toward more accurate iris recognition using cross-spectral
matching. IEEE Trans. Image Process. 26(1), 208–221 (2016)
9. Nguyen, K., Fookes, C., Ross, A., Sridharan, S.: Iris recognition with off-the-shelf
CNN features: a deep learning perspective. IEEE Access 6, 18848–18855 (2017)
10. Oktiana, M., et al.: Advances in cross-spectral iris recognition using integrated
gradientface-based normalization. IEEE Access 7, 130484–130494 (2019)
11. Oktiana, M., Saddami, K., Arnia, F., Away, Y., Munadi, K.: Improved cross-
spectral iris matching using multi-scale weberface and gabor local binary pattern.
In: 2018 10th International Conference on Electronics, Computers and Artificial
Intelligence (ECAI), pp. 1–6. IEEE (2018)
12. Pillai, J.K., Patel, V.M., Chellappa, R., Ratha, N.K.: Secure and robust iris recog-
nition using random projections and sparse representations. IEEE Trans. Pattern
Anal. Mach. Intell. 33(9), 1877–1893 (2011)
13. Pillai, J.K., Puertas, M., Chellappa, R.: Cross-sensor iris recognition through kernel
learning. IEEE Trans. Pattern Anal. Mach. Intell. 36(1), 73–85 (2013)
14. Proença, H., Neves, J.C.: Irina: Iris recognition (even) in inaccurately segmented
data. In: CVPR, pp. 538–547 (2017)
15. Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. In:
Advances in Neural Information Processing Systems, vol. 30 (2017)
16. Sequeira, A., et al.: Cross-eyed-cross-spectral iris/periocular recognition database
and competition. In: 2016 International Conference of the Biometrics Special Inter-
est Group (BIOSIG), pp. 1–5. IEEE (2016)
17. Sequeira, A.F., et al.: Cross-eyed 2017: cross-spectral iris/periocular recognition
competition. In: IJCB„ pp. 725–732. IEEE (2017)
18. Wang, J., et al.: Learning fine-grained image similarity with deep ranking. In:
CVPR, pp. 1386–1393 (2014)
19. Wang, K., Kumar, A.: Cross-spectral iris recognition using CNN and supervised
discrete hashing. Pattern Recogn. 86, 85–98 (2019)
40 Z. Zhou et al.
20. Wei, J., Wang, Y., Li, Y., He, R., Sun, Z.: Cross-spectral iris recognition by learning
device-specific band. IEEE Trans. Circ. Syst. Video Technol. (2021)
21. Zhao, B., Tang, S., Chen, D., Bilen, H., Zhao, R.: Continual representation learning
for biometric identification. In: WACV, pp. 1198–1208 (2021)
22. Zhao, Z., Ajay, K.: An accurate iris segmentation framework under relaxed imaging
constraints using total variation model. In: ICCV, pp. 3828–3836 (2015)