Toward More Accurate Heterogeneous Iris Recognition With Transformers and Capsules

Toward More Accurate Heterogeneous Iris
Recognition with Transformers

and Capsules
Zhiyong Zhou1,2 , Yuanning Liu1,2 , Xiaodong Zhu1,2(B) , Shuai Liu1,2 ,

Shaoqiang Zhang1,2 , and Zhen Liu3
1
College of Computer Science and Technology, Jilin University, Changchun, China
zhuxd@jlu.edu.cn
2
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry
of Education, Changchun, China
3
Graduate School of Engineering, Nagasaki Institute of Applied Science,
Nagasaki, Japan
Abstract. As diverse iris capture devices have been deployed, the per-
formance of iris recognition under heterogeneous conditions, e.g., cross-
spectral matching and cross-sensor matching, has drastically degraded.
Nevertheless, the performance of existing manual descriptor-based meth-
ods and CNN-based methods is limited due to the enormous domain gap
under heterogeneous acquisition. To tackle this problem, we propose a
model with transformers and capsules to extract and match the domain-
invariant feature effectively and efficiently. First, we represent the fea-
tures from shallow convolution as vision tokens by spatial attention.
Then we model the high-level semantic information and fine-grained dis-
criminative features in the token-based space by a transformer encoder.
Next, a Siamese transformer decoder exploits the relationship between
pixel-space and token-based space to enhance the domain-invariant and
discriminative properties of convolution features. Finally, a 3D capsule
network not only efficiently considers part-whole relationships but also
increases intra-class compactness and inter-class separability. Therefore,
it improves the robustness of heterogeneous recognition. Experimental
validation results on two common datasets show that our method signif-
icantly outperforms the state-of-the-art methods.
Keywords: Heterogeneous iris recognition · Transformers · Capsules
1 Introduction
The irises from billions of citizens worldwide have been collected under near-
infrared(NIR) illumination to establish national ID database [19]. Visible(VIS)
light sensors can acquire identifiable long-range images and are cheaper than NIR
sensors [7]. However, drastic performance degradation is anticipated when VIS
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023

D.-T. Dang-Nguyen et al. (Eds.): MMM 2023, LNCS 13833, pp. 28–40, 2023.
https://doi.org/10.1007/978-3-031-27077-2_3
Toward More Accurate Heterogeneous Iris Recognition 29
iris imaged by popular high-resolution smartphones and surveillance devices are

matched with the NIR iris stored in the national ID database [8]. In addition,
to reduce the re-enrollment costs and time in large-scale applications, enroll-
ment and authentication are performed through different sensors, which leads to
reduced performance [13]. In cross-spectral and cross-sensor heterogeneous iris
recognition, the disparities in imaging cause the large gap in domain distribu-
tion between iris images. Thus, intra-class compactness is weakened, significantly
affecting recognition performance [20]. Cross-spectral and cross-sensor recogni-
tion is a well-known challenge task, and related competitions have been held at
BIOSIG 2016 [16] and IJCB 2017 [17].
Some traditional manual descriptor-based methods for reducing the domain
gap in heterogeneous irises can be roughly classified into integrated and non-
integrated methods. Oktiana et al. [11] proposed a multiscale weberface inte-
grated with Gabor local binary pattern. [10] integrated Gradientface-based nor-
malization and the texture descriptors. Integrated methods are more capable of
overcoming domain variances relative to the particular one among them. How-
ever, redundant prior knowledge and computational complexity are increased.
Gradually non-integrated methods have been developed to significantly reduce
the domain gaps. [13] alined the domain distribution by a kernel learning frame-
work. [12] learned a common sparse dictionary representation to mitigate the
distribution discrepancy. Nevertheless, such traditional methods require exten-
sive prior knowledge with low upper performance. CNNs have been success-
fully applied to iris recognition tasks and surpassed traditional methods [14].
[9] extracted domain-invariant features using pre-trained off-the-shelf CNNs. [4]
learned paired CNNs to adapt heterogeneous iris features. [19] applied Siamese
network and Triplet to extract features and proposed a supervised discrete hash
to improve the matching rate. However, intra-class variations in iris images are
minuter than in natural images [21]. For heterogeneous iris recognition, like fine-
grained image recognition tasks [18], pure CNNs are inherently limited to the
size of the reception field. Thus pure CNNs model limited contextual global
information and struggle to prevent performance degradation. Generating more
fine-grained domain-invariant discriminative features is still a challenging task.
To address the above challenge, following [19], we propose a new feature
extraction and matching architecture with transformers and capsules. We intro-
duce a transformer that models rich contextual global information in token-based
space, using a series of vision tokens to represent high levels of intra-class subtle
variances. Further, the Siamese transformer decoder establishes links between
the token-based space represented by the transformer encoder and the pixel-
space represented by the CNN to refine the CNN features. Therefore, the size of
the receptive field is greatly enlarged to obtain finer-grained domain-invariant
discriminative features. In the feature matching learning phase, [19] successfully
utilized multilayer perceptrons. However, these studies lose spatial information
and do not consider the part-whole relationship of the features. Capsule net-
works [15] can be used as an advanced alternative to CNNs. Different capsules
30 Z. Zhou et al.
can represent various properties of entities, and the routing mechanism can learn
the part-whole relationship [3]. Motivated by such, we propose an improved 3D
capsule network as feature matching to support different instantiation parame-
ters such as specific spectrums, specific sensors, rotations, etc., which improves
the robustness of unconstrained and different domains. Furthermore, the intra-
class features are pulled closer to each other and inter-class features are kept
away from each other, thereby improving the performance.
In summary, our main contributions are as follows:
(1) A transformer is proposed to model rich contextual spatial relationships

in token-based space. It enhances the domain-invariant properties of fea-
tures and obtains finer-grained discriminative features for heterogeneous iris
recognition.
(2) A modified 3D capsule network is proposed as a feature matcher to represent
the part-whole relationship between features while ignoring inter-domain
differences, which improves the robustness of heterogeneous iris recognition.
(3) Our approach yields state-of-the-art results on two commonly used datasets
in this community.
(4) To our knowledge, this work pioneers the use of transformers and capsules
to improve the performance of heterogeneous iris recognition.
Fig. 1. Illustration of overall framework: preprocessing, twin CNN feature extractor,

vision token generator, transformer feature extractor, and 3D capsule matcher.
2 Technical Details
The overall framework of our proposed method is summarized in Fig. 1. Het-

erogeneous iris pairs from two different domains (i.e., NIR against VIS or one
sensor against another different sensor) are preprocessed to obtain normalized
iris images. The CNN feature extractor outputs feature maps for each normal-
ized iris. Thereafter, a vision token generator is used to obtain vision tokens
from corresponding feature maps. Subsequently, the transformer feature extrac-
tor comprising a transformer encoder and a Siamese transformer decoder takes
vision tokens to generate refined feature maps. Finally, an efficient 3D capsule
matcher makes the decision for iris image pairs (i.e., genuine or imposter).
2.1 Vision Token Generator

The illustration of the vision token generator is shown in Fig. 2, where a spa-
tial attention mechanism generates feature maps that interact with CNN feature
maps to generate vision tokens. Consider d1 and d2 be two different domains(i.e.,
spectrum or sensor). Let fd1 , fd2 ∈ RH×W ×C be feature maps from the CNN
feature extractor corresponding to d1 and d2 domain, where H, W, C denotes
height, width, channels of the feature maps. Suppose td1 , td2 ∈ RH×W ×C rep-
resents tokens from vision token generator, where L denotes length of tokens,
i.e., numbers of the element in tokens. The 1 × 1 convolution with a softmax
function on the H × W dimension takes fdi (i = 1, 2) as input to obtain the
spatial attention maps sdi ∈ RH×W ×C . Subquently, a weighted sum operation
between fdi and sdi is used to obtain vision tokens tdi ∈ RL×C .
The vision token generator is formulated as follows:
T T
tdi = (sdi ) fdi = (σ (ψ (fdi ; W ))) fdi , (1)
where ψ(·) denotes 1 × 1 convolution, W ∈ R1×1×C×L indicates weight param-

eter of 1 × 1 convolution, σ(·) denotes softmax function operating on H × W
dimension, and T indicates transpose operation.
Fig. 2. Illustration of the token gener- Fig. 3. Illustration of the transformer

ator. feature extractor.
2.2 Transformer Encoder

As shown in Fig. 3 (a), the transformer encoder has Ne layers consisting of
residual connection, Layernorm, multi-head self-attention(MSA), and multilayer
32 Z. Zhou et al.
perceptron(MLP). td1 and td2 are concatenated as t ∈ R2L×C . Furthermore, t

plus a learnable positional coding Wp ∈ R2L×C is fed to the transformer encoder
to output te ∈ R2L×C . An attention head in MSA can be formulated as follows:
Qn = tn Wqn , K n = tnWkn , V n = n n
t Wv ,
Qn (K n )T (2)
A (Q , K , V ) = σ
n n n √
d
n
V ,
where n denotes the nth layer of transformer encoder. Qn , K n , and V n are query
matrix, key matrix, and value matrix of the self-attention. Wqn , Wkn , Wvn ∈ RC×d
are three learnable transform matrices, tn denotes forwarding tokens, and d
denotes the column dimension of Qn , K n , V n . MSA maps the input to different
representation subspaces using multiple sets of (Qn , K n , V n ) in parallel. Then
the results in these subspaces are concatenated and passed through an additional
transform matrix to obtain the final output. MSA is formulated as follows:
MSA (tn) = Cat (a1 , a2 , . . . , a

hs ) WO ,
(3)
aj = A tn Wqnj , tn Wknj , tn Wvnj ,
where Cat(·) denotes the concatenation operation in the vector dimension, aj ∈

R2L×d is the output of the j th self-attention head, hs denotes the number of self-
attention heads, and WO ∈ R(H×d)×C denotes the transform matrix of obtaining
the MSA’s output. MLP has three layers comprising C neurons in the input and
output layers and 2C neurons in the hidden layer with a GELU function.
2.3 Transformer Decoder

As shown in Fig. 3 (b), we separate te into te,1 , te,2 ∈ RL×C , and a Siamese
transformer decoder converts tokens into pixel features. The final refined feature
maps fdr1 , fdr2 ∈ RH×W ×C with global context are obtained. The transformer
decoder has Nd layers consisting of residual connection, Layernorm, multi-head
cross-attention(MCA), and MLP. Different from transformer encoder, we remove
positional coding and substitute MSA with MCA. The key and value of MCA
are obtained by te,1 , te,2 , and the query is obtained by fd1 , fd2 . The configuration
of MLP is the same as in the transformer encoder. MCA can be formulated as
follows:

MCA fdni , te,i = Cat (a1 , a2 , . . ., ahc ) WO ,
(4)
aj = A fdni Wqnj , te,i Wknj , te,i Wvnj ,
where aj ∈ R2L×d denotes the output of the j th cross-attention head, hc denotes

the number of cross-attention heads, and WO ∈ R(H×d)×C denotes the transform
matrix of obtaining the MCA’s output.
2.4 3D Capsule Matcher

The structure of our 3D capsule matcher is shown in Fig. 4, containing capsule
convolution layer, 3D capsule layer, and digital capsule layer. The difference
Fig. 4. Illustration of the 3D capsule matcher and reconstruction decoder.
maps fd ∈ RH×W ×C are obtained by performing L2 Norm operation on the sub

of fdr1 , fdr2 from transformer decoder. Capsules are then routed to the 3D capsule
layer using 3D capsule routing and finally routed to the digital capsule layer
using routing-by-agreement mechanisms. The length of the final output capsule
indicates the probability of whether the image pairs belong to the same class.
A Capsule Convolution Layer. The capsule convolution layer takes fd ∈

RH×W ×C as input to obtain u ∈ RH×W ×C ×N . The capsule convolution layer
contains convolution, reshaping operation, and nonlinear activation. First, fd
is convolved with C × N filters, 3 × 3 kernels, stride of 1, and padding of 1
to obtain the feature maps with the shape (H, W, C × N ). Secondly, they are

reshaped into 3D capsules S ∈ RH×W ×C ×N , where N denotes the dimension
of the capsule and C denotes the number of capsules. Finally, these capsules are

applied with a nonlinear activation function squash to obtain u ∈ RH×W ×C ×N .
The squash is formulated as follows:
2
sj sj
uj = squash (sj ) = 2 , (5)
1 + sj sj
where sj denotes the j th capsule in 3D capsules, uj denotes the nonlinear acti-

vation output of the j th capsule.
B 3D Capsule Layer. Our proposed new 3D capsule routing algorithm takes

u as input to predict the new output capsules. The entire routing algorithm
is shown in Fig. 4 . Firstly, We reshape u into û ∈ RH×W ×(C ×N )×1 . Sec-

ondly, û is convoluted using 3D convolution with C × N filters, 3 × 3 × 3

kernels, stride of (1, 1, N ), padding of (1, 1, 0) to obtain intermediate votes
v ∈ RH×W ×C ×(C ×N ) . Next, v is reshaped into v̂ ∈ RH×W ×C ×N ×C , which

means a set of capsules can represent entities and routing. The 3D convolution
obtains localized features allowing neighboring capsules to share similar infor-
mation. Hence, such significantly reduces the number of parameters compared

to dynamic routing in [15]. Subsequently, log probability b ∈ RH×W ×1×C ×C is
initialized to 0 and passed through 3d− sof tmax function, which is formulated
34 Z. Zhou et al.
as follows:
exp (bs )
os = 3d− sof tmax (bs ) = H W C , (6)
i j k exp (bi,j,k,s )
where bs denotes the log prior probability of the sth 3D capsule, and os denotes
the coupling coefficients of the sth 3D capsule. The (i, j, k)th capsule in the
digital capsule layer has C possible outcomes from C 3D capsules in the
previous layer. Therefore, the intermediate votes v and the coupling coeffi-
cients os perform weighted summation to obtain inactivated output capsules

ϕ ∈ RH×W ×N ×C ×1 , which is formulated as follows:

C
ϕi,j,k = oi,j,k,s v̂i,j,k,s , (7)
s
where ϕi,j,k denotes the (i, j, k)th inactivated output capsule in the 3D capsule
layer. A modified 3D nonlinear activation function compresses the values of the

capsule between 0 and 1 to obtain the capsules ϕ̂ ∈ RH×W ×N ×C ×1 , which is
formulated as follow:
2
ϕi,j,k ϕi,j,k
ϕ̂i,j,k = 3d− squash (ϕi,j,k ) = 2 , (8)
0.5 + ϕi,j,k ϕ i,j,k
where ϕ̂i,j,k denotes the activated output capsules in the 3D capsule layer. The
agreement between v̂ and ϕ̂ is determined by their inner product. If the inner
product of two capsules is positive, the coupling coefficient between these two
capsules will increase. The update of b is formulated as follows:
bi,j,k,s = bi,j,k,s + ϕ̂i,j,k · v̂i,j,k,s , (9)
where bi,j,k,s denotes the log probability between the sth capsule in the previous
layer and the (i, j, k)th capsule in the next layer. Following [15], we use 3 routing
iterations and achieve better convergence.
C Digital capsule layer. The output capsules ϕ̂ in the 3D capsule layer

are reshaped into ϕ̂r ∈ R(H×W ×C )×N . Then using a routing-by-agreement

mechanism in [15], ϕ̂r are mapped into digital capsules g ∈ R2×N , which denotes
imposter and genuine capsules in the case of binary classification.
2.5 Loss Function

The coupling coefficients between capsules are learned by a routing algorithm.
But the back-propagation algorithm learns other weight parameters through a
separable function called margin loss, which increases intra-class compactness
and inter-class separability. The margin loss of the j th digital capsule is formu-
lated as follows:
2 2
Lj = Tj max 0, m+ − gj + λm (1 − Tj ) max 0, gj − m− , (10)
where Tj is 1 when class j is activated, otherwise Tj is 0, m+ denotes the lower

boundary of gj , m− denotes the upper boundary of gj , λm is a down-
weighting parameter, and gj denotes the probability of existence of class . To
prevent overfitting of the 3D capsule matcher, we use a regularization term loss
called reconstruction loss to guide the capsule decoder and 3D capsule matcher
for finer-grained feature judgments. As shown in Fig. 4, we use transposed con-
volutional layers as capsule decoder to reconstruct difference maps. The first 4
layers employ BatchNorm and ReLU. The tanh is employed in the last layer.

In the training phase, we reshape the digital capsule gr ∈ R1×N corresponding
1×1×N
to GroudTruth into ĝr ∈ R . Then it is fed to the capsule decoder to
obtain reconstructed difference maps fd ∈ RH×W ×C . The reconstruction loss is
formulated as follows:
2
Lr = fd − fd . (11)
The total loss of our proposed method is the summation of margin loss and
reconstruction loss scaled by a hyperparameter λr . It is formulated as follows:
2

Lt = Lj + λr Lr . (12)
j
3 Experiments and Results

To fairly and comprehensively evaluate the effectiveness of our proposed method,
we use two general datasets including a cross-spectral PolyU iris dataset [19] and
a cross-sensor IIIT-D Contact Lens Iris(CLI) dataset [2]. Genuine acceptance
rate(GAR), false acceptance rate(FAR), and equal error rate(EER) are used to
quantitatively evaluate the performance.
Fig. 5. Some samples from two datasets: (a) PolyU iris dataset and (b) IIIT-D CLI
dataset.
3.1 Datasets
We use a public algorithm [22] for each dataset to accurately segment, normalize,
and enhance the iris images. The normalized iris images resized into 256×256
36 Z. Zhou et al.
are taken as input by the subsequent network. Two datasets are summarized as
follows: (1) PolyU iris database, containing 12540 iris images from 209 subjects,
is acquired in the NIR and VIS wavelengths from the left and right eyes. Each
subject consists of 15 different instances of right and left eye for both VIS and
NIR spectrum. Samples are shown in Fig. 5 (a). (2) IIIT-D CLI database, con-
taining 6570 iris images from 101 subjects, is obtained using two different NIR
sensors. It is developed to evaluate the performance of contact lens presentation
attacks. Because it contains 2163 images without contact lenses, it is also used
to evaluate the cross-sensor performance. Samples are shown in Fig. 5 (b).
3.2 Network Structure

CNN Feature Extractor. We modify the pretrained ResNet18, set stride of
last two stages to 1, and add a 1×1 convolution with 32 filters and the upsample
with bilinear interpolation.
Transformer Feature Extractor. The layer numbers Ne of the transformer

encoder are set to 1. The layer numbers Nd of the transformer decoder are set to
4. The number of heads hs in MSA and hc in MCA are set to 8. The dimension
d of each head in MSA and MCA is set to 8.
3D Capsule Matcher. The convolution in the capsule convolution layer has

C ×N = 8×16 filters. The 3D convolution in the 3D capsule layer has C ×N =
8 × 16 filters.
Capsule Decoder. It consists of 5 transposed convolution layers with filter

numbers {32, 64, 128, 256, 32}, 4 × 4 kernels, stride of {1, 2, 2, 2, 2}, padding of
{0, 1, 1, 1, 1}.
3.3 Training Setup

Our method is implemented by using torch 1.6.0. The training process lasts 100
epochs with a batch-size of 32. We use the SGD optimizer for training on two
NVIDIA GeForce GTX 1080 TI. The momentum is set to 0.99, the weight decay
is set to 0.0005 , and the learning rate is set to 0.001. In the loss function, m−
is set to 0.1, m+ is set to 0.9, λm is set to 0.5, and λr is set to 0.1.
3.4 Evaluation Protocol

To fairly compare with other state-of-the-art methods, we follow the data par-
titioning protocol in [8]. In the PolyU iris database, the first ten images per eye
from the first 140 subjects are selected as the training set and the remaining
five images as the test set. Such generates 2800(10 × 140 × 2) genuine scores and
1953000(6975×140×2) imposter scores from the PolyU iris database to evaluate
the cross-spectral performance. In the IIIT-D CLI database, the first 3 images
per eye from 110 subjects without contact lenses are selected as the training set,
the fourth image as the valuation set, and the remaining one as the test set. Such
generates 202(1×1×101×2) genuine scores and 40602(1×1×202×201) imposter
scores from the IIIT-D CLI database to evaluate the cross-sensor performance.
3.5 Comparison with Other State-of-the-art Heterogeneous Iris

Recognition Algorithms
Experiments on Cross-spectral Scenario. In the cross-spectral scenario,

VIS iris images are matched with NIR iris images. We compare two traditional
methods: MRF [8] and IrisCode [6], and two deep learning methods: CNN with
SDH [19] and DenseSENet [1], because these methods report competitive results
on PolyU iris database. As is shown in Table 1, our proposed method obtains
the best performance compared with other state-of-the-art methods in GAR and
EER. Besides, our method significantly outperforms the state-of-the-art tradi-
tional method MRF and the deep learning method DenseSENet. Our method
reduces the EER metric by 43.33% compared to DenseSENet and by 74.77%
compared to CNN-SDH. The results show that our method models more contex-
tual relationships in the feature extraction space and obtains more fine-grained
discriminative features. In addition, our 3D capsule routing decreases the loss of
spatial information and is robust to domain variation and non-ideal conditions.
However, other deep learning methods don’t consider the part-whole correlation
and loss of spatial information.
Table 1. Comparison of other methods Table 2. Comparison of other methods

on cross-spectal scenario. on cross-sensor scenario.
Methods 0.1∗ (%) 0.01 (%) EER(%) Methods 0.1(%) 0.01(%) EER(%)
IrisCode [6] 51.89 33.12 30.81 ESG [5] 49.20 21.53 29.59
MRF [8] 64.91 47.74 24.50 DA-DBNN [8] 89.92 85.04 10.02
CNN-SDH [19] 96.41 90.71 5.39 CNN-SDH 96.41 90.71 5.39
DenseSENet [1] 99.05 96.81 2.40 DenseSENet 98.94 98.07 1.73
Ours 99.72 98.45 1.36 Ours 99.56 98.91 0.68
∗: GAR at FAR=0.1 : GAR at FAR=0.01
Experiments on Cross-sensor Scenario. In the cross-sensor scenario, iris

images from one sensor are matched with iris images from the other different
sensor. Our method is compared with two traditional methods: Even symmetric
Gabor(ESG) [5] and DA-DBNN [8], and a deep learning method: DenseSENet
on IIIT-D CLI database. As is shown in Table 2, our proposed method obtains
the best results: GAR = 99.56% at FAR=0.1 and GAR=98.91% at FAR=0.01.
Compared with DenSENet and DA-DBNN, our method decreases by 60.69%
38 Z. Zhou et al.
and 93.21%, respectively, in EER. It is worth noting that we reimplement Dens-

eSENet due to possible errors in the introduction and partition of the dataset
they claim. The results show our method improves the cross-sensor performance.
Because the transformer models the context in the token-based space to refine
the features, which are more domain-invariant than those of pure convolution
and traditional descriptors. Moreover, compared with fully connected networks
and distance metrics, the 3D capsule matcher considers the most informative
partial representation while overlooking the domain gap.
Table 3. Ablation on different modules. Table 4. Ablation on λm and λr .
TG TFE CM 0.1∗ (%) 0.01 (%) EER(%) λm λr 0.1(%) 0.01(%) EER(%)

✘ ✘ ✘ 72.52 61.43 19.26 0.1 0.1 97.89 95.63 2.21
✘ ✔ ✘ 96.14 94.87 2.76 0.25 0.1 98.46 97.04 1.94
✔ ✔ ✘ 97.90 96.34 2.46 0.5 0.1 99.56 98.91 0.68
✘ ✘ ✔ 80.89 75.76 12.84 0.1 0.25 97.72 95.54 2.30
✔ ✔ ✔ 99.56 98.91 0.68 0.1 0.5 97.32 95.12 2.56
∗: GAR at FAR=0.1 : GAR at FAR=0.01
3.6 Ablation Experiments

Ablation on Token Generator(TG), Transformer Feature Extrac-
tor(TFE) and 3D Capsule Matcher(CM). The ablation experiments are
performed on IIIT-D CLI database. When CM is removed, we add a fully con-
nected network with softmax as the feature matcher. When TG is removed, the
input to TFE comes from the chunked sequence of features. As seen in Table 3,
the proposed token generator, transformer feature extractor, and 3D capsule
matcher all contribute to improving the performance.
Ablation on Hyperparameters. We performed ablation experiments on the

IIIT-D CLI database to select the hyperparameters in the loss function. From
Table 4, we can note that the proposed approach achieves optimal performance
when it is trained with λm = 0.5 and λr = 0.1.
4 Conclusions
In this paper, we use transformers and capsules to ease the difficulty of heteroge-
neous iris recognition. Leveraging vision tokens, transformer encoder and trans-
former decoder models the relationship between token-based space and pixel-
space to refine the low semantic features. And 3D capsule routing efficiently
models the part-whole correlation. Extensive experiments on cross-spectral and
cross-sensor scenarios show our proposed method significantly improves the per-
formance of heterogeneous iris recognition.
Acknowledgements. This work was supported by the National Natural Science

Foundation of China (grant number 61471181) and the Natural Science Foundation
of Jilin Province (grant number YDZJ202101ZYTS144).
References
1. Chen, Y., Zeng, Z., Zeng, Y., Gan, H., Chen, H.: DenseSENet: more accurate and
robust cross-domain iris recognition. J. Electron. Imaging 30(6), 063024 (2021)
2. Kohli, N., Yadav, D., Vatsa, M., Singh, R.: Revisiting iris recognition with color
cosmetic contact lenses. In: 2013 International Conference on Biometrics (ICB),
pp. 1–7. IEEE (2013)
3. Kosiorek, A., Sabour, S., Teh, Y.W., Hinton, G.E.: Stacked capsule autoencoders.
In: Advances in Neural Information Processing Systems, vol. 32 (2019)
4. Liu, N., Zhang, M., Li, H., Sun, Z., Tan, T.: Deepiris: learning pairwise filter bank
for heterogeneous iris verification. Pattern Recogn. Lett. 82, 154–161 (2016)
5. Ma, L., Tan, T., Wang, Y., Zhang, D.: Personal identification based on iris texture
analysis. IEEE Trans. Pattern Anal. Mach. Intell. 25(12), 1519–1533 (2003)
6. Masek, L.: Matlab source code for a biometric identification system based on iris
patterns. http://peoplecsse.uwa.edu.au/pk/studentprojects/libor/ (2003)
7. Mostofa, M., Mohamadi, S., Dawson, J., Nasrabadi, N.M.: Deep GAN-based cross-
spectral cross-resolution iris recognition. IEEE Trans. Biometrics Behav. Identity
Sci. 3(4), 443–463 (2021)
8. Nalla, P.R., Kumar, A.: Toward more accurate iris recognition using cross-spectral
matching. IEEE Trans. Image Process. 26(1), 208–221 (2016)
9. Nguyen, K., Fookes, C., Ross, A., Sridharan, S.: Iris recognition with off-the-shelf
CNN features: a deep learning perspective. IEEE Access 6, 18848–18855 (2017)
10. Oktiana, M., et al.: Advances in cross-spectral iris recognition using integrated
gradientface-based normalization. IEEE Access 7, 130484–130494 (2019)
11. Oktiana, M., Saddami, K., Arnia, F., Away, Y., Munadi, K.: Improved cross-
spectral iris matching using multi-scale weberface and gabor local binary pattern.
In: 2018 10th International Conference on Electronics, Computers and Artificial
Intelligence (ECAI), pp. 1–6. IEEE (2018)
12. Pillai, J.K., Patel, V.M., Chellappa, R., Ratha, N.K.: Secure and robust iris recog-
nition using random projections and sparse representations. IEEE Trans. Pattern
Anal. Mach. Intell. 33(9), 1877–1893 (2011)
13. Pillai, J.K., Puertas, M., Chellappa, R.: Cross-sensor iris recognition through kernel
learning. IEEE Trans. Pattern Anal. Mach. Intell. 36(1), 73–85 (2013)
14. Proença, H., Neves, J.C.: Irina: Iris recognition (even) in inaccurately segmented
data. In: CVPR, pp. 538–547 (2017)
15. Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. In:
Advances in Neural Information Processing Systems, vol. 30 (2017)
16. Sequeira, A., et al.: Cross-eyed-cross-spectral iris/periocular recognition database
and competition. In: 2016 International Conference of the Biometrics Special Inter-
est Group (BIOSIG), pp. 1–5. IEEE (2016)
17. Sequeira, A.F., et al.: Cross-eyed 2017: cross-spectral iris/periocular recognition
competition. In: IJCB„ pp. 725–732. IEEE (2017)
18. Wang, J., et al.: Learning fine-grained image similarity with deep ranking. In:
CVPR, pp. 1386–1393 (2014)
19. Wang, K., Kumar, A.: Cross-spectral iris recognition using CNN and supervised
discrete hashing. Pattern Recogn. 86, 85–98 (2019)
40 Z. Zhou et al.
20. Wei, J., Wang, Y., Li, Y., He, R., Sun, Z.: Cross-spectral iris recognition by learning
device-specific band. IEEE Trans. Circ. Syst. Video Technol. (2021)
21. Zhao, B., Tang, S., Chen, D., Bilen, H., Zhao, R.: Continual representation learning
for biometric identification. In: WACV, pp. 1198–1208 (2021)
22. Zhao, Z., Ajay, K.: An accurate iris segmentation framework under relaxed imaging
constraints using total variation model. In: ICCV, pp. 3828–3836 (2015)

Toward More Accurate Heterogeneous Iris Recognition With Transformers and Capsules

Uploaded by

Copyright:

Available Formats

You might also like

Toward More Accurate Heterogeneous Iris Recognition With Transformers and Capsules

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Toward More Accurate Heterogeneous Iris Recognition With Transformers and Capsules

Uploaded by

Copyright:

Available Formats

Toward More Accurate Heterogeneous Iris

Recognition with Transformers

Zhiyong Zhou1,2 , Yuanning Liu1,2 , Xiaodong Zhu1,2(B) , Shuai Liu1,2 ,

Keywords: Heterogeneous iris recognition · Transformers · Capsules

c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023

iris imaged by popular high-resolution smartphones and surveillance devices are

(1) A transformer is proposed to model rich contextual spatial relationships

Fig. 1. Illustration of overall framework: preprocessing, twin CNN feature extractor,

The overall framework of our proposed method is summarized in Fig. 1. Het-

2.1 Vision Token Generator

where ψ(·) denotes 1 × 1 convolution, W ∈ R1×1×C×L indicates weight param-

Fig. 2. Illustration of the token gener- Fig. 3. Illustration of the transformer

2.2 Transformer Encoder

perceptron(MLP). td1 and td2 are concatenated as t ∈ R2L×C . Furthermore, t

MSA (tn) = Cat (a1 , a2 , . . . , a

where Cat(·) denotes the concatenation operation in the vector dimension, aj ∈

2.3 Transformer Decoder

where aj ∈ R2L×d denotes the output of the j th cross-attention head, hc denotes

2.4 3D Capsule Matcher

Fig. 4. Illustration of the 3D capsule matcher and reconstruction decoder.

maps fd ∈ RH×W ×C are obtained by performing L2 Norm operation on the sub

A Capsule Convolution Layer. The capsule convolution layer takes fd ∈

where sj denotes the j th capsule in 3D capsules, uj denotes the nonlinear acti-

B 3D Capsule Layer. Our proposed new 3D capsule routing algorithm takes

ondly, û is convoluted using 3D convolution with C × N ﬁlters, 3 × 3 × 3

C Digital capsule layer. The output capsules ϕ̂ in the 3D capsule layer

2.5 Loss Function

where Tj is 1 when class j is activated, otherwise Tj is 0, m+ denotes the lower

3 Experiments and Results

3.2 Network Structure

Transformer Feature Extractor. The layer numbers Ne of the transformer

3D Capsule Matcher. The convolution in the capsule convolution layer has

Capsule Decoder. It consists of 5 transposed convolution layers with ﬁlter

3.3 Training Setup

3.4 Evaluation Protocol

3.5 Comparison with Other State-of-the-art Heterogeneous Iris

Experiments on Cross-spectral Scenario. In the cross-spectral scenario,

Table 1. Comparison of other methods Table 2. Comparison of other methods

∗: GAR at FAR=0.1 : GAR at FAR=0.01

Experiments on Cross-sensor Scenario. In the cross-sensor scenario, iris

and 93.21%, respectively, in EER. It is worth noting that we reimplement Dens-

Table 3. Ablation on diﬀerent modules. Table 4. Ablation on λm and λr .

TG TFE CM 0.1∗ (%) 0.01 (%) EER(%) λm λr 0.1(%) 0.01(%) EER(%)

3.6 Ablation Experiments

Ablation on Hyperparameters. We performed ablation experiments on the

Acknowledgements. This work was supported by the National Natural Science

You might also like