Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Using an LLM to Turn Sign Spottings into Spoken Language Sentences

Ozge Mercanoglu Sincan1 and Necati Cihan Camgoz2 and Richard Bowden1
{o.mercanoglusincan, r.bowden}@surrey.ac.uk, neccam@meta.com
1
University of Surrey, UK
2
Meta Reality Labs, Switzerland

Abstract— Sign Language Translation (SLT) is a challenging


task that aims to generate spoken language sentences from
sign language videos. In this paper, we introduce a hybrid
SLT approach, Spotter+GPT, that utilizes a sign spotter and a
pretrained large language model to improve SLT performance.
Our method builds upon the strengths of both components. The
videos are first processed by the spotter, which is trained on
arXiv:2403.10434v1 [cs.CV] 15 Mar 2024

a linguistic sign language dataset, to identify individual signs.


These spotted signs are then passed to the powerful language
model, which transforms them into coherent and contextually
appropriate spoken language sentences. Fig. 1. Overview of the CSLR (Continuous Sign Language Recognition)
and SLT (Sign Language Translation) tasks.
I. INTRODUCTION
Sign languages are visual languages that rely on manual
hand articulations, facial expressions, and body movements. [34], [37]. Although end-to-end approaches can achieve
To bridge communication gaps between the Deaf community excellent results on small SLT datasets, such as PHOENIX-
and hearing people, Sign Language Translation (SLT) (sign 2014-T [5], they under perform on SLT datasets that are
language → spoken language) and Sign Language Produc- weakly aligned or have a large domain of discourse [26],
tion (SLP) (spoken language → sign language) systems hold [2], [16].
great significance. In contrast to the conventional approaches in Sign Lan-
SLT is often formulated as a Neural Machine Translation guage Translation (SLT), which typically rely on gloss and
(NMT) task since it aims to generate spoken/written lan- spoken sentence pairs for training a gloss-to-text model,
guage sentences from sign language videos [5]. Compared to we propose a hybrid SLT approach that combines a sign
classical text based NMT approaches, which work on easily spotter with a Large Language Model (LLM). We first train
tokenizable text, SLT involves processing continuous sign a sign spotter using a large sign language dataset from the
language videos, which is hard to align and tokenize. Fur- linguistic domain. Then, we utilize our spotter to recognize
thermore, spoken and sign languages have different grammar, the sequence of sign glosses and pass those spotted glosses to
and the order of the sign glosses and spoken language words the LLM via prompting to generate spoken sentences without
are different as seen in Fig. 1. requiring additional SLT specific training or datasets.
To address this issue, some researchers have approached II. RELATED WORK
SLT as a combination of two sub-tasks; Sign Language
Recognition (SLR), recognizing the constituent signs of A. Sign Language Recognition
sequences and then translating the recognized signs into SLR can be divided into two categories; Isolated Sign
meaningful spoken language sentences [5], [35], [6]. Some Language Recognition (ISLR) [12], [14], [1], [27], [2], [33],
researchers approach the first task as Continuous Sign Lan- [3] and Continuous Sign Language Recognition (CSLR)
guage Recognition (CSLR) and try to detect sequences [20], [36], [32], [8]. ISLR focuses on recognizing a single
of glosses as an intermediate representation to represent sign from the video. Feature extraction plays a critical role
sign language videos. Camgoz et. al [5], [6] showed that and 3D-CNNs, specifically I3D, have achieved great success
using gloss-based intermediate representations improved SLT in isolated SLR in recent years [12], [14], [33]. On the other
performance significantly. The common approach in CSLR hand, CSLR focuses on recognizing the sequence of glosses.
is learning spatial and temporal visual representations with a It is a weakly supervised recognition task since continuous
sequence-to-sequence Connectionist Temporal Classification SLR datasets [5], [9] usually provide gloss sequences without
(CTC) loss [10]. These CSLR approaches require gloss explicit temporal boundary since labeling each gloss frame
supervision, which is hard and time-consuming to accurately by frame is a time-consuming process. For this reason, using
annotate. the CTC loss [10] became popular [6], [20], [36], [32].
On the other hand, some researchers tackle SLT in an end- Some researchers develop CSLR models with the as-
to-end manner by training SLT models that produce spoken sistance of an ISLR model [8], [32]. Cui et. al [8] first
language sentences directly from sign language videos [6], trained a feature extraction module and then finetuned the
Fig. 2. An overview of the proposed sign language translation architecture.

whole system iteratively. Wei et. al [32] trained two ISLR glosses. We call this first step Sign Spotting (Section III-
models in two different sign languages and then proposed a A). Subsequently, we employ a pretrained LLM to convert
multilingual CSLR by utilizing cross-lingual signs with their these sequences of glosses into spoken language sentences.
assistance. In this work, we utilized an ISLR model to detect To achieve this, we prompted a ChatGPT model (Section
the sequence of glosses. III-B).

B. Leveraging LLMs for Sign Language Translation A. Sign Spotting

The development of Large Language Models (LLMs) As spotter, we utilize an I3D model [7]. We prepare
has led to significant improvements in natural language isolated sign sequences from a continuous sign language
processing tasks [22], [4], [29]. Recently, OpenAI introduced dataset, MeineDGS [13], where the glosses are annotated
ChatGPT, which was trained using Reinforcement Learning in frame level. Within the MeineDGS dataset the average
from Human Feedback (RLHF). It has attracted significant duration of a gloss is 10 frames while we train our I3D
attention because of its success in several tasks, including models with a window size of 16. If an isolated sign sequence
question answering, summarization, machine translation, etc is shorter than 16 frames, we repeat the last frame. For longer
[18], [21], [15]. Despite the effectiveness of ChatGPT, the sign instances we generate multiple samples via utilize a
fusion of ChatGPT and sign language translation remains an sliding window with a stride of 8.
underexplored domain. We believe leveraging the power of In the training time, the model takes 16 consecutive
ChatGPT, in the context of sign language translation presents frames. We resize the input images to 256 × 256 and then
a promising direction. Shahin and Ismail [25] were the first to crop to a 224 × 224 region. We replaced the ReLU activation
explore ChatGPT’s capabilities for sign language translation. function with the Swish activation function as it improves
They tested ChatGPT for gloss-to-spoken language and spo- sign language recognition performances [11], [33].
ken language-to-gloss translation with only 5 medical-related Following [2], we utilize cross-entropy loss and an SGD
statements in English and Arabic sign-spoken language pairs. optimizer [28] with momentum 0.9, batch size 4, and an
Although a very small test, they found it performs better initial learning rate of 0.01. We decreased the learning rate by
when translating to English rather than Arabic. a factor of 10 when validation loss plateaus for 4 consecutive
To the best of our knowledge, we are the first to employ epochs. We use label smoothing of 0.1 to prevent overfitting.
ChatGPT and a Sign Spotter for sign language translation While our input videos are cropped randomly during training
directly from video inputs. time, it is cropped from the center during evaluation. We
applied color augmentation during training.
III. METHOD After training our I3D model, we employ it to spot signs
in coarticulated continuous sign language videos in a sliding
As shown in Fig. 2, our approach contains a two-step window manner. With a stride of 1, for the given input video
process. First, we train a sign spotter to identify individual with T frames, we obtain T −15 gloss predictions. Then, we
signs. This model was trained to recognize a single isolated propose a straightforward yet efficient solution to produce a
sign from the input video. Once trained, the spotter is final sequence of glosses. First, we filter gloss predictions by
applied to continuous sign language videos in a sliding a threshold, based on the model’s prediction confidence. Fol-
window manner to get gloss predictions. Then we pool and lowing this, we collapse the consecutive repeated predictions
threshold these detections to obtain the final set of identified to obtain the final sequence of glosses predictions.
TABLE I
B. ChatGPT
I SOLATED SIGN LANGUAGE RECOGNITION PERFORMANCE ON
Due to the success of ChatGPT in many tasks, we utilize M EINE DGS-V.
it to create spoken sentences from glosses. This strategy not
only eliminates the requirement for gloss-to-text model train- Model #Class Pretrained on Per-instance Per-class
ing but also offers potential advantages. ChatGPT’s ability I3D 2,301 Kinetics [7] 53.24% 40.70%
to understand and generate contextually relevant sentences I3D 2,301 Kinetics [7] + BOBSL [2] 54.57% 42.48%
makes it an ideal candidate for translating glosses into
coherent sign language sentences.
It is worth noting that the order of the glosses in sign criterion leads to 2,301 classes. We train our I3D spotter on
languages are different from the word order in spoken these 2,301 classes using train, validation, and test splits of
languages. Meanwhile, the vocabulary of our gloss spotter is Saunders et. al [23].
constrained by the number of classes on which the I3D model Please note that in [23] the spoken sentences are lower-
was trained on. However, the dynamic nature of ChatGPT case, punctuation was removed, and all German characters (ä,
allows a flexible translation process, enabling us to overcome ö, ü, and ß) were replaced with corresponding English letters.
the limitations of a fixed gloss set and, different ordering. This will cause some words to change meaning. Therefore,
We prompt ‘gpt-3.5-turbo’ version of ChatGPT without instead of the sentence representation of [23], we used the
fine-tuning it. For each input video, after we obtain a series original spoken language sentences [13] in the sign language
of glosses we pass them to ChatGPT and prompt the model translation task.
to create a meaningful sentence using those glosses. DGS-20 Videos. To further evaluate our approach, we
collected a new small dataset containing 20 videos, where the
C. Prompt Engineering
participants sign in DGS using the glosses that exist in the
We initially set our prompt by only defining our task as aforementioned 2,301 classes and their corresponding spoken
generating a sentence based on the list of words entered language German translations.
by the user. However, we observed that sometimes Chat-
GPT produces outputs like ”Sorry, I could not generate a B. Evaluation Metrics
sentence.” in cases when Spotter failed to detect any gloss We use BLEU [17], and BLEURT [24] metrics to eval-
or when the number of detected glosses was insufficient. uate the performance of our SLT approach. BLEU is the
To address this, we added two rules to prevent unrelated most used metric in machine translation and is based on
sentences and produce ”No translation.” in cases where the precision of n-grams (consecutive word sequences). On
translation was not feasible. The refined prompt is as follows: the other hand, BLEURT aims to achieve human-quality
• You are a helpful assistant designed to generate a scoring. Higher scores indicate better translation. We use the
sentence based on the list of words entered by the user. sacreBLEU [19] implementation for BLEU, and BLEURT-
You need to strictly follow these rules: 20 checkpoints for the calculation of BLEURT scores.
1) The user will only give the list of German words C. Quantitative Results
separated by a space, you just need to generate a Sign Spotting: We evaluate our spotter on the MeineDGS
meaningful sentence from them. dataset providing the per-instance and per-class accuracy
2) Only provide a response containing the generated scores. We first fine-tuned I3D pretrained on Kinetics [7] and
sentence. If you cannot create a German sentence obtained 53.24% and 40.70%, respectively. It has been shown
then respond with ”No Translation”. that pretraining on larger sign recognition datasets improves
IV. EXPERIMENTS sign performance [1], [31]. Therefore, we first fine-tuned the
I3D model on the large-scale BOBSL (BBC-Oxford British
A. Dataset and Preprocessing Sign Language) dataset [2]. Then we fine-tuned an I3D
MeineDGS. It is a large linguistic German Sign Language model pretrained on Kinetics + BOBSL for finetuning on
(DGS) dataset [13]. Videos contain free-flowing conversation the MeineDGS. We observe 1.5% performance increase with
between two deaf participants. We follow the sign language the usage of sign language data in weight initialization. Our
translation protocol set [23] on MeineDGS, which has 40,230 results are provided in Table I.
training, 4,996 development, and 4,997 test sentences. SLT: We evaluate our entire SLT approach on the
We use the MeineDGS-V split [23], which distinguishes MeineDGS-V test split and our DGS-20 videos. The quanti-
between sign variants, each containing the same meaning tative results are provided in Table II and Table III, respec-
but with differing motions. MeineDGS-V has approximately tively.
10,000 glosses available for training. However, some signs We empirically evaluate the Spotter’s performance using
have relatively few examples or some of them are single- varying probability thresholds. The threshold of 0.7 for the
tons. To make the dataset more balanced for isolated SLR probability associated with each gloss prediction yielded the
model training, we selected glosses that have more than 12 best results. It is worth mentioning that DGS-20 videos result
samples and we exclude the INDEX gloss as it stands for a in higher performance than MeineDGS. The possible reason
pointing, making it the predominant gloss in the dataset. This is that the sentences in DGS-20 are created with a limited
TABLE IV
vocabulary which enables GPT to generate more consistent Q UALITATIVE RESULTS OF THE PROPOSED METHOD .
sentences. E XAMPLES 1-3 ARE FROM M EINE DGS; E XAMPLES 4-6 ARE FROM
TABLE II DGS-20.
P ERFORMANCE OF OUR APPROACH ON M EINE DGS-V TEST SET.
GT Glosses NUR2B NUM-EINER1A:1d PROGRAMM1A
Method B-1 B-2 B-3 B-4 BLEURT ICH1 DDR4 ES-GIBT3
Spotter+GPT 14.82 4.19 1.45 0.64 21.62 Spoken Language Es gab nur ein Programm in der DDR. (There was
Spotter+Transformer 19.5 6.13 2.48 1.08 19.01 only one program in the GDR.)
Sub-GT+GPT 16.65 6.45 3.02 1.55 29.72 Spotter NUM-EINER PROGRAMM DDR ES-GIBT
Spotter+GPT Es gibt nur ein Programm in der DDR. (There is
only one program in the GDR.)
TABLE III
GT Glosses ICH1 IMMER4A FAMILIE1 MEIN1 KIRCHE1
P ERFORMANCE OF OUR APPROACH ON DGS-20.
Spoken Language Dort gibt es sehr gutes Essen. (There is very good
food there.)
Model B-1 B-2 B-3 B-4 BLEURT
Spotter ICH IMMER FAMILIE MEIN KIRCHE
Spotter+GPT 38.25 24.13 15.81 9.12 46.93
Spotter+Transformer 18.65 4.91 1.79 0.92 19.85 Spotter+GPT Es gibt sehr viel gutes Essen. (There is a lot of
GT+GPT 68.74 59.23 52.21 46.22 79.04 good food.)
GT Glosses LETZTE URLAUB SARDINIEN INDEX
Spoken Language Letztens war ich im Urlaub auf Sardinien. (I was
Evaluation of each component. To evaluate the perfor- recently on holiday in Sardinia.)
mance of our components independently we conduct two Spotter LETZTE MEHR
different types of experiments. First, to evaluate the perfor- Spotter+GPT Letzte Woche gab es mehr Regen als erwartet.
mance of our Spotter, we replace its results with the ground (Last week there was more rain than expected.)
truth gloss annotations. Although the MeineDGS-V test split GT Glosses FAMILIE1 ESSEN1 ABEND2 RESTAURANT1
has 4,620 glosses, our spotter is only able to recognize Spoken Language Die Familie isst abends im Restaurant. (The fam-
ily eats in the restaurant in the evening.)
2,301 glosses. To make a fair comparison, we excluded gloss
Spotter FAMILIE ESSEN ABEND RESTAURANT
annotations that do not belong to our Spotter’s vocabulary
Spotter+GPT Die Familie geht am Abend ins Restaurant, um zu
and called this experiment Sub-GT+GPT. As expected, these essen. (The family goes to the restaurant to eat in
scores surpass the results obtained with the spotter. the evening.)
Secondly, to compare the performance of GPT, we trained GT Glosses DOLMETSCHER1 SPRECHEN3 FAMILIE1
a transformer model [30] that takes gloss sequences and MADCHEN1
Spoken Language Der Dolmetscher spricht mit der Familie des
generates spoken sentences. We use two layers with 8- Mädchens. (The interpreter speaks to the girl’s
heads in the transformer encoder and decoder using 512 family.)
hidden units. We use the Adam optimizer with an initial Spotter DOLMETSCHER MIT MADCHEN SPRECHEN
learning rate of 6 × 10−4 with batch size 64. We reduce the Spotter+GPT Der Dolmetscher spricht mit dem Mädchen. (The
interpreter speaks to the girl.)
learning rate by a factor of 0.7 if the BLEU-4 score does
not increase for 5 epochs. On MeineDGS, the transformer GT Glosses DOZENT1 MORGEN1B FAHRRAD1 UNIVER-
SITAT1
outperforms GPT in terms of BLEU. However, when con- Spoken Language Der Dozent fährt morgens mit dem Fahrrad zur
sidering BLEURT, which reflects semantic similarity, GPT Universität. (Tomorrow the lecturer will ride his
achieves better results (Table II). In DGS-20 videos, GPT bike to the university.)
outperforms transformers in all metrics (Table III). Spotter DOZENT MORGEN FAHRRAD LUFT
Spotter+GPT Morgen fahre ich mit dem Fahrrad zur Uni, weil
D. Qualitative results mein Dozent sagt, frische Luft ist wichtig. (To-
morrow I’ll ride my bike to university because
We provide qualitative results of our approach in Table IV. my lecturer says fresh air is important.)
When the spotter can detect the majority of glosses, GPT
demonstrates its capacity to generate high-quality spoken
sentences from the glosses (Example 1, 2, and 4). Even
generation capabilities of ChatGPT. Our approach does not
when provided with missing or additional glosses, ChatGPT
need any training or fine-tuning for sign language translation.
still preserves semantic coherence with information gaps or
Also, it is capable of producing coherent and fluid sign
the generation of new content (Examples 5, and 6). On the
language sentences.
other hand, not surprisingly ChatGPT is only as good as the
glosses given to it. When the spotter fails to detect glosses, One of the limitations of our approach is that the per-
ChatGPT generates incorrect sentences (Example 3) or ”No formance is dependent on the vocabulary of our spotter.
translation”. However, our approach can be adapted to specialized sign
language interpretation by fine-tuning the spotter on domain-
V. CONCLUSIONS AND FUTURE WORKS specific gloss sets. A future direction may include expanding
In this paper, we proposed a sign language translation the vocabulary of the spotter to increase the range of recog-
method integrating a sign spotter with the natural language nized glosses. By increasing the range of recognized glosses,
our system can become even more versatile and applicable [15] S. Manakhimova, E. Avramidis, V. Macketanz, E. Lapshinova-
across various sign language contexts. Koltunski, S. Bagdasarov, and S. Möller. Linguistically motivated
evaluation of the 2023 state-of-the-art machine translation: Can chatgpt
outperform nmt? In Proceedings of the Eighth Conference on Machine
VI. ACKNOWLEDGMENTS Translation, pages 224–245, 2023.
[16] M. Müller, S. Ebling, E. Avramidis, A. Battisti, M. Berger, R. Bowden,
This work was supported by the SNSF project ‘SMILE A. Braffort, N. C. Camgöz, C. España-Bonet, R. Grundkiewicz, et al.
II’ (CRSII5 193686), European Union’s Horizon2020 pro- Findings of the first wmt shared task on sign language translation
gramme (‘EASIER’ grant agreement 101016982) and the (wmt-slt22). In Proceedings of the Seventh Conference on Machine
Translation (WMT), pages 744–772, 2022.
Innosuisse IICT Flagship (PFFS-21-47). This work reflects [17] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method
only the authors view and the Commission is not responsible for automatic evaluation of machine translation. In Proceedings of the
40th annual meeting of the Association for Computational Linguistics,
for any use that may be made of the information it contains. pages 311–318, 2002.
All experiments were run, datasets accessed, and models [18] K. Peng, L. Ding, Q. Zhong, L. Shen, X. Liu, M. Zhang, Y. Ouyang,
built by researchers at the University of Surrey. and D. Tao. Towards making the most of chatgpt for machine
translation. Available at SSRN 4390455, 2023.
[19] M. Post. A call for clarity in reporting bleu scores. arXiv preprint
R EFERENCES arXiv:1804.08771, 2018.
[20] J. Pu, W. Zhou, and H. Li. Iterative alignment network for continuous
[1] S. Albanie, G. Varol, L. Momeni, T. Afouras, J. S. Chung, N. Fox,
sign language recognition. In Proceedings of the IEEE/CVF confer-
and A. Zisserman. Bsl-1k: Scaling up co-articulated sign language
ence on computer vision and pattern recognition, pages 4165–4174,
recognition using mouthing cues. In Computer Vision–ECCV 2020:
2019.
16th European Conference, Glasgow, UK, August 23–28, 2020, Pro- [21] C. Qin, A. Zhang, Z. Zhang, J. Chen, M. Yasunaga, and D. Yang.
ceedings, Part XI 16, pages 35–53. Springer, 2020. Is chatgpt a general-purpose natural language processing task solver?
[2] S. Albanie, G. Varol, L. Momeni, H. Bull, T. Afouras, H. Chowdhury,
arXiv preprint arXiv:2302.06476, Accepted by Empirical Methods in
N. Fox, B. Woll, R. Cooper, A. McParland, et al. Bbc-oxford british
Natural Language Processing (EMNLP) 2023., 2023.
sign language dataset. arXiv preprint arXiv:2111.03635, 2021. [22] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al. Improving
[3] M. Bohacek and M. Hrúz. Learning from what is already out there:
language understanding by generative pre-training. 2018.
Few-shot sign language recognition with online dictionaries. In 2023 [23] B. Saunders, N. C. Camgoz, and R. Bowden. Signing at scale:
IEEE 17th International Conference on Automatic Face and Gesture Learning to co-articulate signs for large-scale photo-realistic sign
Recognition (FG), pages 1–6. IEEE, 2023. language production. In Proceedings of the IEEE/CVF Conference on
[4] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal,
Computer Vision and Pattern Recognition, pages 5141–5151, 2022.
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language mod- [24] T. Sellam, D. Das, and A. P. Parikh. Bleurt: Learning robust metrics
els are few-shot learners. Advances in neural information processing for text generation. arXiv preprint arXiv:2004.04696, 2020.
systems, 33:1877–1901, 2020. [25] N. Shahin and L. Ismail. Chatgpt, let us chat sign language:
[5] N. C. Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden. Neural Experiments, architectural elements, challenges and research direc-
sign language translation. In Proceedings of the IEEE Conference on tions. In 2023 International Symposium on Networks, Computers and
Computer Vision and Pattern Recognition (CVPR), June 2018. Communications (ISNCC), pages 1–7. IEEE, 2023.
[6] N. C. Camgoz, O. Koller, S. Hadfield, and R. Bowden. Sign [26] O. M. Sincan, N. C. Camgoz, and R. Bowden. Is context all you
language transformers: Joint end-to-end sign language recognition and need? scaling neural sign language translation to large domains of
translation. In Proceedings of the IEEE/CVF conference on computer discourse. In Proceedings of the IEEE/CVF International Conference
vision and pattern recognition, pages 10023–10033, 2020. on Computer Vision, pages 1955–1965, 2023.
[7] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new [27] O. M. Sincan and H. Y. Keles. Autsl: A large scale multi-modal
model and the kinetics dataset. In proceedings of the IEEE Conference turkish sign language dataset and baseline methods. IEEE Access,
on Computer Vision and Pattern Recognition, pages 6299–6308, 2017. 8:181340–181355, 2020.
[8] R. Cui, H. Liu, and C. Zhang. A deep neural framework for continuous [28] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance
sign language recognition by iterative training. IEEE Transactions on of initialization and momentum in deep learning. In International
Multimedia, 21(7):1880–1891, 2019. conference on machine learning, pages 1139–1147. PMLR, 2013.
[9] A. Duarte, S. Palaskar, L. Ventura, D. Ghadiyaram, K. DeHaan, [29] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux,
F. Metze, J. Torres, and X. Giro-i Nieto. How2sign: a large- T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama:
scale multimodal dataset for continuous american sign language. In Open and efficient foundation language models. arXiv preprint
Proceedings of the IEEE/CVF conference on computer vision and arXiv:2302.13971, 2023.
pattern recognition, pages 2735–2744, 2021. [30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
[10] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber. Connectionist Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need.
temporal classification: labelling unsegmented sequence data with Advances in neural information processing systems, 30, 2017.
recurrent neural networks. In Proceedings of the 23rd international [31] M. Vázquez-Enrı́quez, J. L. Alba-Castro, L. Docı́o-Fernández, and
conference on Machine learning, pages 369–376, 2006. E. Rodrı́guez-Banga. Isolated sign language recognition with multi-
[11] S. Jiang, B. Sun, L. Wang, Y. Bai, K. Li, and Y. Fu. Skeleton scale spatial-temporal graph convolutional networks. In Proceedings of
aware multi-modal sign language recognition. In Proceedings of the the IEEE/CVF conference on computer vision and pattern recognition,
IEEE/CVF conference on computer vision and pattern recognition, pages 3462–3471, 2021.
pages 3413–3423, 2021. [32] F. Wei and Y. Chen. Improving continuous sign language recognition
[12] H. R. V. Joze and O. Koller. Ms-asl: A large-scale data set and with cross-lingual signs. In Proceedings of the IEEE/CVF Interna-
benchmark for understanding american sign language. arXiv preprint tional Conference on Computer Vision, pages 23612–23621, 2023.
arXiv:1812.01053, 2018. [33] R. Wong, N. C. Camgöz, and R. Bowden. Hierarchical i3d for sign
[13] R. Konrad, T. Hanke, G. Langer, D. Blanck, J. Bleicken, I. Hofmann, spotting. In European Conference on Computer Vision, pages 243–
O. Jeziorski, L. König, S. König, R. Nishio, A. Regen, U. Salden, 255. Springer, 2022.
S. Wagner, S. Worseck, O. Böse, E. Jahn, and M. Schulder. Meine [34] H. Yao, W. Zhou, H. Feng, H. Hu, H. Zhou, and H. Li. Sign language
dgs–annotiert.öffentliches korpus der deutschen gebärdensprache, 3. translation with iterative prototype. In Proceedings of the IEEE/CVF
release / my dgs – annotated. public corpus of german sign language, International Conference on Computer Vision, pages 15592–15601,
3rd release. 2020. 2023.
[14] D. Li, C. Rodriguez, X. Yu, and H. Li. Word-level deep sign lan- [35] B. Zhang, M. Müller, and R. Sennrich. Sltunet: A simple unified
guage recognition from video: A new large-scale dataset and methods model for sign language translation. In The Eleventh International
comparison. In Proceedings of the IEEE/CVF winter conference on Conference on Learning Representations (ICLR), 2022.
applications of computer vision, pages 1459–1469, 2020.
[36] J. Zheng, Y. Wang, C. Tan, S. Li, G. Wang, J. Xia, Y. Chen, and
S. Z. Li. Cvt-slr: Contrastive visual-textual transformation for sign
language recognition with variational alignment. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 23141–23150, 2023.
[37] H. Zhou, W. Zhou, W. Qi, J. Pu, and H. Li. Improving sign
language translation with monolingual data by sign back-translation.
In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), pages 1316–1325, June 2021.

You might also like