Early Screening of Autism in Toddlers Via Response-To-Instructions Protocol

3914 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 52, NO.
5, MAY 2022
Early Screening of Autism in Toddlers via

Response-To-Instructions Protocol
Jingjing Liu, Zhiyong Wang , Graduate Student Member, IEEE, Kai Xu, Bin Ji, Gongyue Zhang, Yi Wang,
Jingxin Deng, Qiong Xu, Xiu Xu, and Honghai Liu , Senior Member, IEEE
Abstract—Early screening of autism spectrum disorder (ASD) contact and joint attention, and sensitivity to physical con-
is crucial since early intervention evidently confirms significant tact [2]. It is noted that the incidence of autism is far beyond
improvement of functional social behavior in toddlers. This arti- the public imagination. It is estimated that 1/59 children had
cle attempts to bootstrap the response-to-instructions (RTIs)
protocol with vision-based solutions in order to assist professional autism in the U.S. in 2018, while the number of autistic
clinicians with an automatic autism diagnosis. The correlation patients reached 67 million in the world [3]. Children with
between detected objects and toddler’s emotional features, such ASD would bring about enormous costs for families and gov-
as gaze, is constructed to analyze their autistic symptoms. Twenty ernments, including special education services and parental
toddlers between 16–32 months of age, 15 of whom diagnosed productivity loss [4]. ASD individuals can be severely stressed
with ASD, participated in this study. The RTI method is val-
idated against human codings, and group differences between due to the difficulty of proper mutual communication and
ASD and typically developing (TD) toddlers are analyzed. The being misunderstood. Because of the lack of necessary social
results suggest that the agreement between clinical diagnosis and skills, their personal lives will be impacted, and their career
the RTI method achieves 95% for all 20 subjects, which indicates opportunities will be hampered, also resulting in a burden
vision-based solutions are highly feasible for automatic autistic for society. Although there is currently no cure for autism,
diagnosis.
authoritative research suggests that early behavioral treatments
Index Terms—Autistic early screening, gaze estimation, social can improve the symptoms of children with autism, and the
behavior disorder. earlier the intervention, the better the effect [5], [6]. Thus,
the ability to screen toddlers with ASD in the early period
is significant since early screening and early diagnosis are
the premise of early intervention. The importance of early
I. I NTRODUCTION screening of children with autism has also been highlighted in
UTISM spectrum disorders (ASD), characterized by
A severe impairments in social communication and
unusual, restricted, or repetitive behaviors, contain a series of
recent practice guidelines issued by the American Academy
of Pediatrics [7].
Currently, the screening and diagnosis of autism are con-
neurodevelopmental disorders [1]. The signs of autism include ducted based on developmental history, assessment scales,
difficulty in communicating, using language, talking about or and behavior observation by professional clinicians [8]. In
understanding feelings, disinclination to share or engage in the process of clinical diagnosis, the physicians observe and
reciprocal play with others, lack of behaviors, such as eye record behaviors of children referring to some assessment cri-
teria and scales, such as autism diagnostic interview-revised
Manuscript received February 14, 2020; revised June 4, 2020; accepted (ADI-R) and autism diagnostic observation schedule (ADOS),
August 5, 2020. Date of publication September 23, 2020; date of current
version May 19, 2022. This work was supported by the National Natural which are considered to be the most standard tools in autism
Science Foundation of China under Grant 61733011 and Grant 51575338. diagnosis [9], [10]. ADI-R refers to an interview with par-
This article was recommended by Associate Editor S. Chen. (Corresponding ents to collect children’s behavioral manifestations in detail.
authors: Xiu Xu; Honghai Liu.)
Jingjing Liu and Zhiyong Wang are with the State Key Laboratory of ADOS is mainly employed to assess the ability of lan-
Mechanical System and Vibration, Shanghai Jiao Tong University, Shanghai guage communication, interpersonal communication, playing
200240, China, and also with the State Key Laboratory of Robotics and games, and imagination of individuals with suspected autism
Systems, Harbin Institute of Technology Shenzhen, Shenzhen 518055, China
(e-mail: lily121@sjtu.edu.cn). or other widespread developmental disorders. Due to the dif-
Kai Xu and Bin Ji are with the State Key Laboratory of Mechanical System ferent levels of perspectives and clinical experience maintained
and Vibration, Shanghai Jiao Tong University, Shanghai 200240, China. by different clinicians, there are some variabilities across
Gongyue Zhang is with the School of Computing, University of Portsmouth,
Portsmouth PO1 3HE, U.K. their diagnostic results since the process of assessing autism
Yi Wang, Jingxin Deng, Qiong Xu, and Xiu Xu are with the Department of based on these scales heavily relies on artificial observation.
Child Health Care, Children’s Hospital of Fudan University, Shanghai 201102, Thus, accurate diagnosis requires extensive clinical experience.
China (e-mail: xuxiu@fudan.edu.cn).
Honghai Liu is with the State Key Laboratory of Robotics and Systems, Besides, the time for ASD diagnosis is long [11] and quali-
Harbin Institute of Technology Shenzhen, Shenzhen, China (e-mail: hong- fied professional clinicians are in short supply. To cope with
hai.liu@icloud.com). the above challenges in clinical autism screening, increased
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCYB.2020.3017866. research efforts using technical means have been made to facil-
Digital Object Identifier 10.1109/TCYB.2020.3017866 itate an objective and automatic progress of autistic diagnosis.
2168-2267
c 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: SRM Univ - Faculty of Eng & Tech- Vadapalani. Downloaded on January 21,2024 at 09:00:22 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: EARLY SCREENING OF AUTISM IN TODDLERS VIA RESPONSE-TO-INSTRUCTIONS PROTOCOL 3915
Various methods are employed to collect and process autistic- well as related algorithms for realizing automatic assessment.
related information [12]–[16], such as the EEG recording Section IV describes experiments on ASD children and nor-
system, wristband with an accelerometer, fMRI, vision based, mal children and the results to validate the feasibility of RTI.
vocalization based, etc. Among these technical methods, com- Finally, this article is concluded in Section V with future
puter vision methods [17], [18] especially stand out since they works.
are a more intuitive way to model the pathological mechanism
of autism based on behavioral factors. By capturing videos II. R ELATED W ORK
of subjects in an unconstrained environment, these methods Core symptoms of autism can be concluded as three cate-
can provide quantitative analysis about human language, eye gories: 1) social interaction disorder; 2) verbal and nonverbal
gaze, expressions, and actions that can reflect typical autism communication disorder; and 3) narrow interests and stereo-
patterns. typy. We mainly focus on the former two characteristics.
There is evidence that reduced levels of social attention and Social interaction disorder defects of ASD children are char-
social communication, as well as increased repetitive behav- acterized by external behaviors, such as avoiding eye contact,
ior with objects, are early markers of ASD between 12 and lack of interest and response to human voice, lack of interest
24 months of age [7]. Social attention and communication in socializing, and so on. On the other hand, ASD children
indicators of ASD include decreased response to one’s name have dysfunction in emotional perception internally, which is
being called (i.e.,“orienting to name”), reduced visual atten- represented as difficulties in recognizing emotional and social
tion to socially meaningful stimuli, and less frequent use of information from faces. The objective measurement of these
joint attention and communicative gestures. Besides, failure to symptoms [22]–[24] would undoubtedly enhance the reliabil-
understand language instructions is also an essential observa- ity of assessment methods. In this regard, various computer
tional indicator in the screening and diagnosis of ASD. There vision methods [25], [26] are employed to generate automated
are some researchers [19]–[21] who focused on recognizing measurement and reveal intrinsic information. Wang et al. [27]
one or two early indicators of autism by computer vision algo- proposed an objective and effective method to assist autism
rithms. However, we aim to develop a vision-based system screening. A multisensor system is built for 3-D gaze direc-
to assist autism screening in which various early indicators tion estimation to assess the common clinical task in the autism
of autism are taken into consideration comprehensively. A diagnosis process: response to name calling. The experiment
multisensor platform is elaborately designed and built to col- results on ten adults and seven children (five ASD subjects
lect video information from different views. In this article, and two healthy subjects) achieved an average classification
we focus on describing toddlers’ ability of distraction from score of 92.7%. Joint attention also plays a major role in the
nonsocial stimulus and responding to the language instruc- development of autism in which the eye gaze is a key fac-
tions. Incomprehension or neglect of the primary interactive tor [28]. Courgeon et al. [29] attempted to simulate the joint
language is taken as a severe defect in toddlers with autism attention with virtual humans, which are endowed with the
because the appropriate reaction to the basic vocabulary is ability to follow the user’s attention by eye tracking. Some
the first step in social communication. A novelty experimental works [30]–[32] attempt to uncover the characteristics of facial
protocol is proposed that toddlers are presented with toys as expressions, notably distinct from those in typically devel-
a central stimulus, and their abilities are tested to disengage oping (TD) children. Leo et al. [20] used a single-camera
from the toys and respond to instructions. Computer vision system to assess the capability of ASD children to produce
algorithms, including hand detection and gaze estimation, are facial expressions. A comparison of the system’s outputs with
employed to assess performances of toddlers, and their validity the evaluations performed by psychologists made evident that
is testified by comparing them against human codings. The main the proposed system could perform quantitative analysis and
contributions of this article can be summarized as follows. overcome human limitations in observing and understand-
1) A novel experimental protocol called response-to- ing behaviors. ASD researchers have also proposed automatic
instruction (RTI) is able to assist the screening of autism, emotion annotation solutions [33] to assist autistic patients
and relevant details are standardized for subsequent perceive facial expression of emotion in their social lives.
unified assessment. Hashemi et al. [21] provided computer vision tools for the
2) Appropriate technical solutions are proposed to automat- early detection of autism based on three critical autism obser-
ically assess the protocol in unconstrained conditions. vation scale for infants (AOSI) activities that assess visual
Besides, a database called the TASD (the ASD database) tracking, disengagement of attention, and sharing interest,
including hands movement of children is established, respectively. Visual attention is assessed using head motion
which can be used for the subsequent analysis of by tracking facial features.
children’s hand gestures. As for communication disorders, ASD children fail to use
3) A multivision sensor system is constructed to capture the right body language to express desires or transmit mes-
ASD children’s quantitative behavior ranging from body sages in terms of nonverbal communication. A common scene
movement to emotional states. is that the child may pull adults’ hands toward what he
The remainder of this article is organized as follows. wants without corresponding facial expressions and eye con-
Section II reviews the related work of computational meth- tact. However, there are few relevant studies on ASD children’s
ods for characterizing autism symptoms. Section III presents hand gestures and body languages using computer vision
the proposed protocol RTI and its hardware platform, as methods. For verbal communication disorders embodied in
3916 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 52, NO. 5, MAY 2022
In this protocol, recordings from only two cameras are used:

RGB camera C1 and C2. C1 is 1 m above the tabletop, mainly
used for detecting the movement of toys and human hands.
C2 is about 1 m away from the left front of the table, taking
an angle of depression of 15◦ . C2 can provide more intu-
itive scenes of the experiment process than C1, which can be
utilized to analyze children’s facial expressions and attention
states. The RGB camera images are saved at 25 frames/s with
a resolution of 800 × 600.
2) Protocol Architecture: The proposed protocol RTI aims
to explore whether the child will respond to the instructions
as well as the abilities of understanding preliminary languages
and social communication. In practice, the clinician asks the
child to hand the toy for interactive play occasionally. TD tod-
Fig. 1. (a) 3-D representation of the designed experimental scene. The dlers aged between 16 and 32 months mostly can understand
experimental site and indoor equipment have fixed positions and sizes for simple human languages and are willing to respond and inter-
unified data acquisition and processing. Sizes of the room, the table, and the
chair are 4 m × 2.5 m, 0.6 m × 0.6 m × 0.6 m, 0.25 m × 0.25 m × 0.25 m, act with others while ASD ones may behave differently. As
respectively. Three RGB cameras: C1, C2, and C3 are located at the top, left depicted in Fig. 2, the audio stream and video streams captured
front, and flank of the tabletop, respectively. The Kinect is hung on the wall to by the aforementioned platform are utilized to assess each
acquire the child’s skeleton of the upper body, taking an angle of depression
of 15◦ . (b) Top view snapshot of the experiment scenario by C1. (c) Left round of tests by specific technical methods. Once the instruc-
front view snapshot of the experiment scenario by C2. (d) Toys provided to tions are detected in audio data by speech recognition, video
children with annotated dimensions. data near this time node would be extracted to be the input
to the RTI method. Taking the object detection results and
ASD children, the abilities of language comprehension can gaze direction results as inputs simultaneously, the evaluation
be impaired in varying degrees. Since impairment of prosodic method would output the automatic assessment results for each
processing, in particular, is a common feature for ASD chil- round of tests. The yet another robot platform (YARP) [35]
dren, Depriest et al. [34] first investigated prosodic phrasing is employed to maintain multisource data collection and pro-
in ASD in both language and music by using event-related cessing. To deal with the synchronization problem between
brain potential (ERP) and behavioral methods. different vision sensors, a multithread programming is con-
To summarize, these related works confirmed the potential structed, in which each sensor owns a separate thread and a
effectiveness of computational approaches to assist detecting new thread is used to control the start and end of the other
and measuring autistic markers. This work delves into both threads.
the language comprehension ability and social interaction ten- At most four rounds of test would be executed to ensure the
dency of ASD children. Hand movement and gaze direction rationality and validity of the protocol. Experimental processes
are provided as technical means for taking quantitative mea- of the first two rounds and the last two rounds of test are the
surement and comparison with TD children. To the best of our same except for the used toys in the interactions. For each
knowledge, this is one of the first studies that realize automatic kind of toy, at least one round of test will be taken. If the
assessment of ASD children’s aural comprehension ability by child fails the first round of test, a second round of test would
observing their hand movement indirectly and skillfully. be taken. The clinician would stretch out his hands when giv-
ing instructions as a hint to reduce the level of difficulty in
the second round of test. The specific experimental process
III. P ROTOCOLS AND METHODS
is designed as follows. At the beginning of the experiment,
A. Response to Instructions Protocol the parent is asked to play toy cars with the child to help
1) Sites and Equipments: A platform is designed as shown him adapt to the environment. After several minutes, the par-
in Fig. 1(a) in order to evaluate the interactive activities in ent would be replaced by the clinician. Then, the child would
the RTI protocol. The child and the clinician would sit face- be provided with another kind of toys: rattling balls or wind-
to-face on the two chairs, interacting on the tabletop. Two up tortoise as the central stimulus. The clinician may show
kinds of toys will be provided to the child in proper order: them how to play with the toys and give them some guidance.
1) two rattling balls of 4.5 cm in diameter and 2) one tor- When the child is playing with the toys, the clinician gives
toise winding toy. Three RGB sensors (Logitech BRIO) and the instructions “XX, give the ball/tortoise (XX is the child’s
one depth sensor (Microsoft Kinect 1.0) are designed to be name) to me” in a clear voice. If the child understands and
located at specific positions to record meaningful scenes under follows the instructions and delivers the toy to the clinician,
particular angles and unify data analysis. This multisensor plat- the clinician takes the toy and interacts with the child. Then,
form ensures the coverage of multisource data collection from this round of test ends. If the child does not respond to the
different angles, including 2-D images, 3-D images, human instructions, the clinician will repeat words in 1 min for the
skeletons, and audio. Thus, it is not limited to the proposed second time of test. This is regarded as one round of test.
protocol RTI and can be extended to other experimental tasks, The score for each round is illustrated in Fig. 2 depending on
which would be explored in our future work. the child’s performance and the final score is the sum of four
Fig. 2. Overview of the whole protocol. There will be at most four rounds of test depending on the subject’s performance. Both the audio data flow and
video data flow are used to assess the subject’s performance. The audio data flow offers a trigger for assessment by speech recognition. Then, the video data
from different sources are used for object detection and gaze estimation perspectively. The relevant results are combined to score the subject’s behavior. The
final score is the sum of four rounds.
TABLE I
rounds. If the child leaves his seat during the experiment, the mAP OF D IFFERENT C LASSES OVER 0.5 IOU
clinician would guide him back to the chair and continue the
experiments.
B. Algorithms
The processing algorithms for automatic measurement
of RTI protocol are divided into two parts: one is the network is trained for 200 k steps using arms prop opti-
interpretation algorithms for the analysis of human behavior mizer with a starting learning rate of 0.004, a weight
and another is the evaluation part that assesses the RTI protocol decay of 0.05, and a momentum of 0.9. Although the
based on the results of interpretation algorithms. trained model performed well on detecting children’s
1) Interpretation Algorithms: The interpretation algorithms hands, clinician’s hands, and the wind-up toy, the mean
mainly detect human hand movement and attention states by average precision (mAP) of detection of the balls is
using object detection and gaze estimation algorithms. relatively low. Unsatisfactory detection results for balls
1) Object Detection: The successive images captured by may be due to the misdetection for small objects using
camera C1 are used to detect the locations of toys and the SSD model since balls are commonly occluded by
human hands on the tabletop. Since the hand is sim- hands. A simple but effective method using traditional
ply an instantiation of one particular object, it is taken image processing is utilized to solve the misdetection
as another object different from the used toys. Due to for balls. Color identification based on HSV spaces is
the flexible shapes of human hands and occlusion in first applied on the RGB images captured by camera C1.
the interaction process, deep learning-based algorithms Then, two requirements are proposed to pick the contour
are employed for object detection instead of traditional of balls from possible contours in the binary image since
object detection algorithms. In this article, one single- the processed binary image has many noisy pixels and
shot multibox detector (SSD) [36] model is taken as interference options. On the one hand, area, aspect ratio,
the preliminary model for object detection because of width, and height of possible counters have to satisfy a
its high speed and high accuracy. In practice, the SSD certain threshold. On the other hand, the center of the
model pretrained on the COCO dataset [37] is taken as ball’s bounding box has to be located at the inside of the
our start point followed by transfer learning on our own table’s contour. The table contour is also detected using
dataset. The hands of children, the hands of the clinician, color threshold segmentation on HSV spaces. As shown
the balls, and the wind-up toy are labeled as four kinds in Table I, mAP of four different classes of objects over
of objects. The original images from C1 sized 800×600 0.5 IOU has exceeded 90%.
pixels are cropped to square sized 500 × 500 pixels to 2) Gaze Estimation: Typical social communication and
eliminate unnecessary background environment. More interaction impairments of ASD children can be
than 15 000 images performed by ten subjects are used reflected by their visual attention and gaze patterns.
to build our own dataset called TASD in which images Thus, gaze direction is another significant indicator
are shuffled to be split into three parts: a) 50% for train- to be measured in the RTI protocol. Images cap-
ing; b) 40% for test; and c) 10% for evaluation. The tured by camera C2 are used for children’s head pose
system, which is built at the center of C2

PLi − OLe
VLo = (1)
re
where re = PLi − OLe 2 is the radius of eyeball, and
normally it is approximately 12.4 mm. With the aid of
Kinect, the 3-D pupil center PLi can be derived from the
estimated 2-D pupil center pL under the image coordi-
nate system of camera C2. The calibration of the camera
C2 and the Kinect is operated in advance so that the 2-D
pixel position in the C2’s image coordinate system can
be converted into the 3-D location in the world coor-
dinate system by taking Kinect as an intermediary. The
Fig. 3. (a) Cropped head image sample with the estimated head pose. eyeball center OLe can be calculated through the coordi-
(b) Image samples of the left eye and right eye. (c) Schematic of the eye nate system conversion from the head coordinate system
model.
to the world coordinate system since the relative posi-
tion of the eyeball center in the head coordinate system
can be regarded as a constant. Thus, the eyeball center
estimation and subsequent gaze estimation. In an uncon- OLe can be estimated as
strained environment, accurate head pose estimation is T
the precondition and guarantee for gaze direction esti- OLe = RH · OeL,H + tH,W (2)
mation. A fine-grained convolutional neural network
where RH denotes the rotation matrix of the head coor-
called Hopenet [38] is employed due to its remark-
dinate system from the estimated head pose angles.
able performance in difficult scenarios. Different from
OeL,H denotes the eyeball center in the head coordinate
traditional head pose estimation methods, this model
system that is set to the center of two inner eye cor-
realizes 3-D Euler angles (roll, yaw, and pitch) extraction
ners of the subject. tH,W represents the translation matrix
from 2-D images by using a multiloss network. Given a
from the head coordinate system to the world coordinate
frame image of the video streaming from C2, a dlib face
system, as well as the 3-D location of the head coordi-
detector is utilized to mark the location of child’s face,
nate system’s origin under the world coordinate system.
which is about 1.2–1.4 m from the camera. As shown in
Given the unit vector of optical axis VLo , the unit vec-
Fig. 3(a), the cropped image (∼ 90 × 90 pixels) around
tor of visual axis VLg can be computed by rotating the
the child’s face is taken as the input to the pretrained
optical axis with the Kappa angle as follows:
Hopenet on 300W-LP, a large synthetically expanded ⎡ ⎤
dataset. Except for head pose estimation, the human eye cos ϕ L sin θ L
pupil center is also closely related to the human gaze. To VLo = ⎣ L ϕ L
sin L ⎦
begin with, facial landmarks on the child’s face are auto- − cos ϕ cos θ
matically detected and tracked [39]. Then, the landmarks ⎡ ⎤
cos ϕ L + β L sin θ L + α L
around the eyes locate the square eye region which is VLg = ⎣ sin ϕ + β
L L

⎦ (3)
taken as the region of interest. Within the cropped eye − cos ϕ L + β L cos θ L + α L
region image as shown in Fig. 3(b) (∼ 14 × 10 pixels
per eye), a robust eye center localization method [40] is where θ L and ϕ L denote the horizontal and vertical
applied to detect the 2-D pupil center pL for left eye components of VLo . α L and β L are the corresponding
and pR for right eye. Both the raw pixel values and compensations from the Kappa angle, which are taken
the gradient intensity values are input to train a sup- as the approximate values of 5◦ and 1.5◦ , respectively.
port vector regression (SVR) model on two large image 2) Evaluation Method: The automatic evaluation of pro-
datasets LFPW and HELEN to detect the eye center. tocol RTI starts only when the clinician gives meaning-
After obtaining head pose results and pupil center, a two- ful instructions. Thus, it is crucial to obtain accurate time
eye model-based gaze estimation method [41] is referred information of the instructions. The evaluation part is triggered
to as detect gaze direction. Fig. 3(c) depicts the 3-D once key words in the meaningful instructions are detected
model of two eyes for a subject. Take the left eye as an using speech recognition. Off-the-shelf speech recognition
example, the visual axis is defined as the 3-D line con- products are used: HKUST Xunfei [42]. Once the keywords,
necting the cornea center OLc and the point of gaze PG . such as “ball,” “tortoise,” or “give,” are detected, both the
The line connecting the eyeball center OLe and the pupil gaze direction and the objects’ locations would be measured.
center PLi is defined as the optical axis. The angle of According to the proposed protocol, valid response to the
deviation between the visual axis and the optical axis is instructions when playing with toys should be the intention
known as Kappa, which is taken as a constant. The unit of interactions with the person who issued the instructions.
vector of optical axis VLo is calculated according to the Ideally, the TD child would follow the instructions to deliver
3-D locations of OLe and PLi under the world coordinate the toy to the clinician and turn his attention on the clinician
for deeper eye communication. On the contrary, ASD children

may still play alone or just have a look at the clinician without
any handing action.
On the one hand, response to the instructions is charac-
terized by having visual attention on the person who gave
instructions. The final gaze vector G is calculated as the aver-
age of the gaze direction of the left eye VLg and right eye VRg .
When the gaze directions VLg and VRg are not available, G is
estimated approximately as the orientation of child’s head

G = VLg + VRg /2
T
or G = RH · 0 0 −1 . (4)
Given the gaze vector G = {xG , yG , zG }, ϕ denotes the eye
gaze direction in radians that is transformed from the gaze
vector G
ϕ = a tan 2(xG , yG ) (5)
where a tan 2(x, y) denotes the function of calculating the
azimuth angle, that is, the included angle between vector (x, y)
and the x-axis. The detection of a child’s attention on the clin-
ician is defined as the 2-D gaze vector g falls in the zone Z0 as
shown in Fig. 4(a). The boundaries ϕ1 and ϕ2 of zone Z0 are
empirically defined by assuming the position of the clinician
is fixed. On the other hand, the response to the instructions is
reflected as executing corresponding actions. Thus, the loca- Fig. 4. Evaluation criteria. (a) ϕ denotes the azimuth angle of 2-D gaze
tions of toys are measured to satisfy certain criteria. For the vector g which is the projection of 3-D gaze vector G on the camera plane.
ϕ ∈ [ϕ1 , ϕ2 ] indicates that the child’s visual attention is on the clinician.
first round test of each kind of toy, the criteria are that the (b) Pi denotes the center of the detected bounding box of the toy. Xtc denotes
child stretches his hands toward the clinician. As illustrated in the horizontal distance between the center of the table and the left edge of
Fig. 4(b), Pi has to fall in zone Z1 , which infers that the child the image. The region Z1 is determined by εx and εy both horizontally and
vertically. (c) P1 and P2 denote the centers of the detected bounding box of
stretches out his hand. It is noted that Xtc is regarded as the clinician’s hands and child’s hands, respectively. Z2 is a circular region which
horizontal baseline for Z1 approximately since the clinician’s centered at P1 with a radius of ε.
hands are occluded by her head in the first round. For the sec-
ond round test of each kind of toy, the clinician extends her
hands so the criteria change to a different one. Pi has to fall detected. As for gaze direction, the sum of frames tf 2 satisfy-
in the zone Z2 , which represents the neighboring area of the ing gt ∈ Z0 (ϕ1 < ϕ < ϕ2 ) will be counted within 3 s before
clinician’s hands as shown in Fig. 4(c). It means that the child and after the moment. A “response” tag is assigned for a video
must deliver and place the toy on the clinician’s hands. clip only when tf 1 and tf 2 reach certain values TF1 = 5 and
As for the critical threshold values related to the zone TF2 = 8. As shown in Algorithm 1, TF1 is closely related
Z0 , Z1 , and Z2 , some of them are defined artificially: ϕ1 and to the calculation of the boundaries of the region Z1 and Z2 .
ϕ2 as 3/4π and π while others are calculated as shown in Taking the professional clinician’s experience and the collected
Algorithm 1. Dk,t denotes the locations of P1 , Pi , and Xtc data into consideration, TF1 is set to 5 implying the least
at frame t for subject k. f 0 is the initial frame of detecting duration of 0.2 s for hand movements meeting requirements.
instructions. The basic idea is taking the performance data of Generally, the fixation time of attentive human gaze is required
two groups: response ones and none response ones as two at least about 0.2 s [43]. Considering the practical gaze esti-
Gaussian distribution and the threshold is determined by 3σ mation errors, TF2 is set to 8 corresponding to the duration
rule. For round 1 and round 3 of the test, y,k,t denotes the of 0.32 s to ensure the valid human attention detection.
stretched forward distance of the subject k at frame t and y,k To summarize, the entire course of multimodal data alignment
denotes a key threshold value for subject k. y,k calculated in and processing for the RTI protocol is outlined in Algorithm 2.
the response data group RD is naturally larger than the ones First, raw data are presented frame by frame as Vf 1 , If12 , and
calculated in the nonresponse data group ND since the subject If22 (f 1 and f 2 denote the frame) since the frame rates of
having response will stretch out his hand toward the clinician. video collection and audio collection are different. Relevant
Thus, the threshold value y for defining the region Z1 is deter- features of raw data are then calculated using aforementioned
mined as (1/2)(μ1 − 3σ1 + μ2 + 3σ2 ). A similar pattern is methods. For audio data, the function F1 (t) denotes the speech
applied to ε. recognition results for each second from (t − 1) s to t s and fr is
Given the criteria as above, the sum of frames tf 1 satisfying the sampling frequency. As a two-value function, F1 (t) returns
Pi,t ∈ Z1 (y,k,t > εy , x,k,t < εx ) or Pi,t ∈ Z2 (k,t < ε) will 1 when detecting keywords in the speech recognition result and
be counted within 3 s from the moment when instructions are otherwise, returns 0. F2 (f ) denotes the object detection results
Algorithm 1 Evaluation Threshold Definition Algorithm 2 Multimodal Alignment and Processing

Input: 1: Raw data:
1: Data sequences of two groups: response data group RD 2: Audio Data: V1 , V2 , ..., Vf 1 , ..., Vnf 1
and none-response data group ND 3: Video Data1 (C1):I11 , I21 , ..., If12 , ..., Inf
1
2
2: RD = [D1 , D2 , ..., Dm ] 4: Video Data2 (C2):I1 , I2 , ..., If 2 , ..., Inf
2 2 2 2
2
3: ND = [Dm+1 , Dm+2 , ..., Dm+n ] 5: Feature Alignment:
4: Dk = [Dk,1 , Dk,2 , ..., Dk,t ...] 6: F1 (t) = Speech_rec([V(t−1)∗fr , ..., Vt·fr ])
5: whereDk,t = [(Pk1x,t , Pk1y,t ), (Pkix,t , Pkiy,t ), Xtc,t
k ], t = 7: F2 (f ) = Obj_de(If1 ) =
f0 , ..., f0 + 75 8: [P1,f (P1x,f , P1y,f ), P2,f (P2x,f , P2y,f ), Pi,f (Pix,f , Piy,f ), Xtc,f ]
Output: Threshold values: εx , εy , ε 9: F3 (f ) = Gaze(If2 ) = δ[(ϕt − ϕ1 )(ϕ2 − ϕt )]
6: Round1 and Round 3: εx = 0, εy = 0

10: dist(f ) = P1,f − P2,f , f ∈ [f ∗ ∗
i − f , fi + f ]
7: for k = 1 to m + n do K
8: y,k = 0 11: = dk N( f |μk , σk )
9: for t = f0 + 1 to f0 + 75 do k=1
12: f∗ ∗
i = (ti − 1) · fr · nf2 nf 1
10: y,k,t = Pkiy,f 0 − Pkiy,t
11: end for 13: fi∗ = arg minμk ( μk − f ∗
i )
12:
y,k,t ← sorted y,k,t in descending order 14: Description Logic:
13:
y,k = y,k,TF1 15: Concepts: Response, Clinician
14: end for 16: Roles: IssueInstructionsBy, HandToyTo, GazeAt
m m .
15: μ1 = m
1
y,k , σ1 2 = m1 (y,k − μ1 )2 17: Response = IssueInstructionsBy.Clinician
+
k=1 k=1 18: ó♦3s (∃HandToyTo.Clinician)
m+n m+n + −
19: ó[♦3s (∃GazeAt.Clinician)ò♦3s (∃GazeAt.Clinician)]
16: μ2 = 1
n y,k , σ2 2 = 1
n (y,k − μ2 )2
k=m+1 k=m+1
17: εy = 12 (μ1 − 3σ1 + μ2 + 3σ2 )
18: for k = 1 to m do of test, that is, F1 (ti∗ ) = 1. fi ∗ denotes the time node for
19: for t = f0 + 1 to f0 + 75 do video features F2 (f ) and F3 (f ) corresponding to ti∗ . fi ∗ is
20: if y,k,t > εy then determined by describing P1,f − P2,f , f ∈ [f ∗ ∗
i − f , fi +
21: if x,k,t = Pkix,t − Xtc,tk > εx then f ] using a Gaussian mixture model and finding the nearest
22: εx = x,k,t crest to ti∗ . Finally, the guideline of evaluating response to
23: end if instructions is modeled by temporal description logic (DL)
24: end if language ALCQIT [44]. We denote the notions of DL concepts
25: end for that are Response and Clinician, and the candidates for DL
26: end for roles are IssueInstructionsBy, HandToyTo, and GazeAt. The
27: Round2 and Round 4: ε = 0 expressivity of the RTI guideline can be understood as follows:
28: for k = 1 to m + n do “the toy is handed to the clinician in 3 s after the clinician issues
29: for t = f0 + 1 to f0 + 75 do instructions and gaze at the clinician is detected 3 s before or
30: k,t
2 = (Pk − Pk )2 + (Pk − Pk )2
1x,t ix,t 1y,t iy,t after the instructions. Such circumstance is recognized as one
31: end for time of response.”
← sorted
k,t
32: k,t in ascending order
33: k = k,TF1 IV. E XPERIMENTS AND R ESULTS
34: end for m m A. Experiments
35: μ1 = m1 k , σ1 2 = m1 (k − μ1 )2
k=1 k=1 Twenty toddlers between 16 and 32 months of age, 15 who

m+n

m+n had ASD and a comparison group of TD toddlers (N = 5)
36: μ2 = n
1 2
k , σ2 = 1n (k − μ2 )2 of similar age, participated in the experiments to validate the
k=m+1 k=m
37: ε = 2 (μ1 + 3σ1 + μ2 − 3σ2 ) protocol RTI and related methods. Participants were recruited
1
from communities in Shanghai, China. Children with ASD
were diagnosed by two professionals via ADI-R and ADOS in
advance. Toddlers with known vision or hearing deficits were
for frame f . F3 (f ) is quite similar to F1 (t) and returns 1 when excluded. The experiments have passed the ethical review.
the detected gaze angle ϕt is in [ϕ1 , ϕ2 ]. Given the features After experiments, two professional clinicians would give
extracted from different source data, the alignment of F1 (t) human coding results by watching recorded videos.
and F2 (f ) is tackled by aligning the key nodes of different
features. F1 (t) and F2 (f ) are both temporal ordering sequences B. Results
within corresponding feature distribution. The detection of key The experimental results of valid response instances using
words in F1 (t) matches with the abrupt change of clinician’s proposed algorithms over round 3 and round 4 are presented
hand movement characterized by F2 (f ). ti∗ denotes the time in Fig. 5. The distance curve often had many perturba-
of detecting key words in the instructions for the ith round tions and discontinuities due to factors, including noisy
Fig. 5. (a) Time series of the toy’s location for round 3 after giving instructions (time = 0), measured in pixels. The red dot is the vertical distance between
the toy’s current location and initial location, that is, y,k,t for frame t and subject k. The blue cross represents the horizontal distance between the center of
the toy and the table, i.e., x,k,t for frame t and subject k. Time segments meeting the threshold values are labeled as green bars along the timeline. Screenshots
from the recorded video are displayed. (b) Time series of measuring the toy’s location for round 4. The green dot represents the distance between the center
of the toy and clinician’s hands, that is, k,t for frame t and subject k. (c) Time series of measuring the gaze direction. The green diamond denotes the gaze
angle ϕ using the degree corresponding to ϕ using radian. ϕ1 and ϕ2 are degree values converted from ϕ1 and ϕ2 .
Fig. 6 summarizes the results of both clinician and RTI

methods for assessment of the proposed protocol. Each time
of test is judged as “R” (having response) if only the corre-
sponding video clip satisfying (tf 1 ≥ TF1) and (tf 2 ≥ TF2).
Otherwise, the test will be considered as “N” (none response).
As shown in Fig. 6, the agreement between clinical diagno-
sis and RTI method achieved 95% for 20 subjects and 99.2%
for 124 testing times. Inter-rater reliability between them is
quantified with a two-way intraclass coefficient (ICC). The
results suggest that the reliability between the RTI method and
human codings is excellent with ICC 0.998 (95% confidence
interval 0.996–0.999) since the ICC value above 0.9 indicates
extremely high consistency. Taking human codings as ground
truth, the sensitivity of the RTI method on the experimental
task is 99% and specificity is 100%. The proposed RTI method
achieves the same results as human codings for TD children
without none misdiagnosis. It only fails for part of the entire
tests for a single subject that only affects the severity of ASD.
Compared with human codings, assessment by computational
approaches only fails for participant #9. The main reason is
that error gaze estimation results of RTI occurred for that time
of the test. To be more specific, during the second time of the
Fig. 6. Experimental results assessed by clinician and RTI solutions.
test in round 4 of participant #9, the child did give the toy to
Each round of tests includes two times. One time of test is considered the clinician. But the child also reached out to his other hand
either as“response” (R) or “none response” (N) depending on the child’s to hold the clinician’s hand to ask the clinician to wind up
performance. Participants #1 − #15 are ASD children while participants
#16 − #20 are TD children.
the tortoise winding toy. The child’s behavior is considered
as expressing needs instead of a response to the instructions.
In fact, pulling adult’s hands or clothing to expressing needs
object detection results, occlusion, and missing detections. without eye contact is a common behavior of ASD children.
To overcome these factors, we analyze the sum of frames Apart from comparisons between the RTI method and
whose detection results satisfy certain thresholds correspond- human codings, group differences are also studied in terms
ing to the green squares along the timeline in Fig. 5. The of engagement degree, response latency, and visual attention
instances depicted in Fig. 5 are thus recognized as having on human beings. Children’s engagement degree, expressed
valid response since the sum of desirable frames has reached as the proportion of activities on the desktop in the entire
TF1 and TF2. experiment process, is quantified as shown in Fig. 7. ASD
solitary play than for people. Consistent findings with other

researches verified the symptom of autism and even provide a
quantized description. In general, the proposed RTI method is
effective and reliable. Despite the effective results, some lim-
itations have yet to be overcome. The inconsistency between
computational method and human codings results from acci-
dental individual behaviors of autistic children and wrong
detection results. First, techniques for analyzing gaze [24] to
acquire children’s attention more accurately need to be incor-
porated. Second, more samples need to be collected to refine
Fig. 7. Proportion of time engaged and response latency in the task for
different groups. the protocol since autism covers a wide spectrum of disorders
and performances of autistic children may vary much. The
TABLE II criteria for judgement of response should be stricter by con-
T IMES AND D URATION OF C HILDREN ’ S G AZE AT THE C LINICIAN sidering more occasional scenes. For instance, the unexpected
performance executed by participant #9 will be used to replen-
ish DL by considering the relationship between the child’s
hand movement and clinician’s hand movement. More exper-
iments and detailed analysis of children’s eye gaze should be
incorporated to improve the protocol and analyze the impact of
gender and age factors in further investigation. Also, the cur-
rent study only tested child’s response to social stimuli, future
children engaged in the task 73.64% of the time on average, research should be done by considering nonsocial stimuli.
compared to 80.41% for TD children. There is a slight dif-
ference between the ASD group and the TD group in average V. C ONCLUSION
engagement degree while TD children show a higher average We proposed a novel protocol called RTI, in which autism
engagement degree. However, a larger intragroup difference risk behaviors are captured and measured automatically to
was observed in the ASD group. Some autistic children can facilitate autism screening. A multisensor system was intro-
be attentive enough that they engaged in the activities on the duced to capture multisource information and normative exper-
desktop totally. In contrast, a particular autistic child shows imental procedures were designed to induce ASD-related
very low engagement in the task, being interested in other symptoms. The analysis of hand movement and gaze direc-
things or just looking for parents. tion is provided as quantitative ways to characterize children’s
Response latency, namely, latency between instructions performance in the protocol, namely, response to instructions.
issued and valid response detected is also analyzed as shown in Also, a comprehensive dataset of more than 15 000 annotated
Fig. 7. In the analysis of TD children, mean response latency hand images collected from both ASD children and TD chil-
was 1.32 s (SD = 0.84). For ASD children who did respond dren was built for further research. Comparisons between the
to instructions, mean response latency was similar, 1.40 s proposed automatic method and human coding demonstrated
(SD = 0.77). There is not a significant difference between the effectiveness and reliability of the former while it is more
ASD and TD group in spite of a bit larger intragroup diver- labor saving and objective. Additionally, some subtle group
sity for the TD group. In fact, only one TD child exhibited the differences were reported as a supplement for differentiating
highest latency as long as 2.9 s. Such performance may be due children with and without ASD. Compared with human cod-
to that very few children were rapt in playing with toys when ing and qualitative description, computational methods can
instructions were issued. As for response latency, some stud- also provide the measurement of quantified timing, which is
ies [19] revealed longer latency for ASD children of orienting unavailable by human perception. Such automatic and objec-
to name since motor delays in ASD, such as the inability to tive detection of the proposed method allows for further and
coordinate functional movements, may prevent from the timely deeper findings of atypical autism characterization. With the
response. However, the same conclusion cannot be obtained aid of ubiquitous cameras and related technologies [46], this
herein, which may be due to insufficient experimental samples. approach could be promoted for wider use such as increasing
More instances should be incorporated to lighten the impact access to autism screening in remote areas. In future work, the
of exception. proposed protocol RTI would be improved in terms of techni-
Statistical analysis of children’s visual attention on human cal methods [47], [48] and experiments. Additional protocols
beings is also conducted to reveal group difference as shown focusing on ASD children’s other characteristics would also
in Table II. Successive frames that satisfy ϕ ∈ [ϕ1 , ϕ2 ] exceed be explored to build a clinically acceptable autism screening
0.3 s are taken as one valid time of visual attention on the system.
clinician sitting opposite. Both the times and duration for the
TD group is much larger than the ASD group, which sug- R EFERENCES
gest that ASD ones have less interest to communicate with
[1] R. E. Kaliouby, R. Picard, and S. Baron-Cohen, “Affective computing
another person proactively. Many studies [45] have also found and autism,” Ann. New York Acad. Sci., vol. 1093, no. 1, pp. 228–248,
that ASD children show a higher preference for objects and 2007.
[2] B. Scassellati, H. Admoni, and M. Mataric, “Robots for use in autism [24] G. Boccignone and M. Ferraro, “Ecological sampling of gaze shifts,”
research,” Annu. Rev. Biomed. Eng., vol. 14, pp. 275–294, May 2012. IEEE Trans. Cybern., vol. 44, no. 2, pp. 266–279, Apr. 2014.
[3] J. Baio et al., “Prevalence of autism spectrum disorder among children [25] C. Liu, K. Conn, N. Sarkar, and W. Stone, “Online affect detec-
aged 8 years—Autism developmental disabilities monitoring network, tion and robot behavior adaptation for intervention of children
11 sites, United States, 2014 (vol 67, pg 1, 2018),” Morbidity Mortality with autism,” IEEE Trans. Robot., vol. 24, no. 4, pp. 883–896,
Weekly Rep., vol. 67, no. 45, p. 1280, Nov. 2018. Aug. 2008.
[4] X. Liu, Q. Wu, W. Zhao, and X. Luo, “Technology-facilitated diagnosis [26] P. Sarah et al., “Disease prediction using graph convolutional networks:
and treatment of individuals with autism spectrum disorder: An engi- Application to autism spectrum disorder and Alzheimer’s disease,” Med.
neering perspective,” Appl. Sci. BASEL, vol. 7, no. 10, pp. 731–736, Image Anal., vol. 48, pp. 117–130, Aug. 2018.
Oct. 2017. [27] Z. Wang, J. Liu, K. He, Q. Xu, X. Xu, and H. Liu, “Screening
[5] M. Iliana, C. Tony, and H. Patricia, “A two-year prospective follow-up early children with autism spectrum disorder via response-to-name
study of community-based early intensive behavioural intervention and protocol,” IEEE Trans. Ind. Informat., early access, Dec. 9, 2019,
specialist nursery provision for children with autism spectrum disorders,” doi: 10.1109/TII.2019.2958106
J. Child Psychol. Psychiat., vol. 48, no. 8, pp. 803–812, 2010. [28] Z. Yucel, A. A. Salah, C. Mericli, T. Mericli, R. Valenti, and T. Gevers,
[6] S. Ming, T. A. Mulhern, I. Stewart, L. Moran, and K. Bynum, “Training “Joint attention by gaze interpolation and saliency,” IEEE Trans.
class inclusion responding in typically-developing children and individ- Cybern., vol. 43, no. 3, pp. 829–842, May 2013.
uals with autism,” J. Appl. Behav. Anal., vol. 51, no. 1, pp. 53–60, [29] M. Courgeon, G. Rautureau, J.-C. Martin, and O. Grynszpan, “Joint
2018. attention simulation using eye-tracking and virtual humans,” IEEE Trans.
[7] L. Zwaigenbaum et al., “Early identification of autism spectrum disor- Affect. Comput., vol. 5, no. 3, pp. 238–250, Jul. 2014.
der: Recommendations for practice and research,” Pediatrics, vol. 136, [30] M. Samad, N. Diawara, J. Bobzien, C. Taylor, J. Harrington, and
no. S1, p. S10, 2015. K. Iftekharuddin, “A pilot study to identify autism related traits in
[8] E. Fernell, M. Eriksson, and C. Gillberg, “Early diagnosis of autism spontaneous facial actions using computer vision,” Res. Autism Spectr.
and impact on prognosis: A narrative review,” Clin. Epidemiol., vol. 5, Disord., vol. 65, pp. 14–24, Nov. 2019.
pp. 33–43, Feb. 2013. [31] K. Owada et al., “Computer-analyzed facial expression as a surrogate
[9] C. J. Dover and A. L. Couteur, “How to diagnose autism,” Archives marker for autism spectrum social core symptoms,” PLoS ONE, vol. 13,
Disease Childhood, vol. 92, no. 6, p. 540, 2007. no. 1, 2018, Art. no. e0190442.
[10] J. L. Matson, M. Nebel-Schwalm, and M. L. Matson, “A review of [32] T. Guha, Z. Yang, R. B. Grossman, and S. S. Narayanan, “A
methodological issues in the differential diagnosis of autism spec- computational study of expressive facial dynamics in children with
trum disorders in children,” Res. Autism Spectr. Disord., vol. 1, no. 1, autism,” IEEE Trans. Affect. Comput., vol. 9, no. 1, pp. 14–20,
pp. 38–54, 2007. Jan.–Mar. 2018.
[11] N. Muty and Z. Azizul, “Detecting arm flapping in children with autism [33] X. Zhao, J. Zou, H. Li, E. Dellandrea, I. A. Kakadiaris, and L. Chen,
spectrum disorder using human pose estimation and skeletal represen- “Automatic 2.5-D facial landmarking and emotion annotation for
tation algorithms,” in Proc. Int. Conf. Adv. Informat. Concepts, 2017, social interaction assistance,” IEEE Trans. Cybern., vol. 46, no. 9,
pp. 33–45. pp. 2042–2055, Aug. 2016.
[12] T. Heunis et al., “Recurrence quantification analysis of resting state
[34] J. Depriest, A. Glushko, K. Steinhauer, and S. Koelsch, “Language and
eeg signals in autism spectrum disorder—A systematic methodological
music phrase boundary processing in autism spectrum disorder: An ERP
exploration of technical and demographic confounders in the search for
study,” Sci. Rep., vol. 7, no. 1, 2017, Art. no. 14465.
biomarkers,” BMC Med., vol. 16, no. 1, pp. 28–37, 2018.
[35] G. Metta, P. Fitzpatrick, and L. Natale, “YARP: Yet another robot
[13] J. R. Sato, M. Vidal, S. de Siqueira Santos, K. B. Massirer, and
platform,” Int. J. Adv. Robot. Syst., vol. 3, no. 1, p. 2006, 2008.
A. Fujita, “Complex network measures in autism spectrum disor-
[36] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. Eur. Conf.
ders,” IEEE/ACM Trans. Comput. Biol. Bioinformat., vol. 15, no. 2,
Comput. Vis., 2016, pp. 21–37.
pp. 581–587, Mar./Apr. 2018.
[14] M. S. Goodwin, M. Haghighi, Q. Tang, M. Akcakaya, D. Erdogmus, [37] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in
and S. Intille, “Moving towards a real-time system for automatically Proc. Eur. Conf. Comput. Vis., Apr. 2014, p. 8693.
recognizing stereotypical motor movements in individuals on the autism [38] N. Ruiz, E. Chong, and J. M. Rehg, “Fine-grained head pose estima-
spectrum using wireless accelerometry,” in Proc. ACM Int. Joint Conf. tion without keypoints,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Pervasive Ubiquitous Comput. (UbiComp), 2014, pp. 861–872. Recognit. Workshops (CVPRW), vol. 1, 2018, Art. no. 215509.
[15] J. Wang, Q. Wang, H. Zhang, J. Chen, S. Wang, and D. Shen, “Sparse [39] Y. Wang, Y. Hui, J. Dong, B. Stevens, and H. Liu, “Facial expression-
multiview task-centralized ensemble learning for ASD diagnosis based aware face frontalization,” in Proc. Asian Conf. Comput. Vis., 2016,
on age- and sex-related functional connectivity patterns,” IEEE Trans. pp. 375–388.
Cybern., vol. 49, no. 8, pp. 3141–3154, Aug. 2019. [40] Z. Wang, H. Cai, and H. Liu, “Robust eye center localization based on an
[16] R. Ognjen, L. Jaeryoung, D. Miles, S. Bjorn, and R. W. Picard, improved SVR method,” in Proc. 25th Int. Conf. (ICONIP), Jan. 2018,
“Personalized machine learning for robot perception of affect and pp. 623–634.
engagement in autism therapy,” Science, vol. 3, no. 19, 2018, [41] X. Zhou, H. Cai, Y. Li, and H. Liu, “Two-eye model-based gaze esti-
Art. no. eaao6760. mation from a Kinect sensor,” in Proc. IEEE Int. Conf. Robot. Autom.
[17] T. Li, B. Ni, M. Xu, M. Wang, Q. Gao, and S. Yan, “Data-driven affective (ICRA), 2017, pp. 1646–1653.
filtering for images and videos,” IEEE Trans. Cybern., vol. 45, no. 10, [42] iflytek Open Platform. Accessed: Apr. 2019. [Online]. Available: https:
pp. 2336–2349, Oct. 2015. //www.xfyun.cn/services/voicedictation
[18] L. Shao, X. Zhen, D. Tao, and X. Li, “Spatio-temporal Laplacian pyra- [43] T. Ohno and H. Ogasawara, “Information acquisition model of highly
mid coding for action recognition,” IEEE Trans. Cybern., vol. 44, no. 6, interactive tasks,” in Proc. ICCS/JCSS, Aug. 1999, p. 26.
pp. 817–827, Jun. 2014. [44] A. Scalmato, A. Sgorbissa, and R. Zaccaria, “Describing and recognizing
[19] K. Campbell et al., “Computer vision analysis captures atypical attention patterns of events in smart environments with description logic,” IEEE
in toddlers with autism,” Autism, vol. 23, no. 2, pp. 619–628, 2018. Trans. Cybern., vol. 43, no. 6, pp. 1882–1897, Dec. 2013.
[20] M. Leo et al., “Computational assessment of facial expression produc- [45] F. Happe and U. Frith, “The weak coherence account: Detail-focused
tion in ASD children,” Sensors, vol. 18, p. 3993, Nov. 2018. cognitive style in autism spectrum disorders,” J. Autism Develop.
[21] J. Hashemi et al., “Computer vision tools for the non-invasive assess- Disord., vol. 36, no. 1, p. 5, 2006.
ment of autism-related behavioral markers,” 2012. [Online]. Available: [46] J. Han, L. Shao, D. Xu, and J. Shotton, “Enhanced computer vision
arXiv:1210.7014. with Microsoft Kinect sensor: A review,” IEEE Trans. Cybern., vol. 43,
[22] T. Zhang, W. Zheng, Z. Cui, Y. Zong, and Y. Li, “Spatial ctemporal no. 5, pp. 1318–1334, Oct. 2013.
recurrent neural network for emotion recognition,” IEEE Trans. Cybern., [47] H. Zhou, H. Hu, H. Liu, and J. Tang, “Classification of upper limb
vol. 49, no. 3, pp. 839–847, Mar. 2019. motion trajectories using shape features,” IEEE Trans. Syst., Man,
[23] H. Meng, N. Bianchi-Berthouze, Y. Deng, J. Cheng, and J. P. Cosmas, Cybern. C, Appl. Rev., vol. 42, no. 6, pp. 970–982, Nov. 2012.
“Time-delay neural network for continuous emotional dimension [48] B. Liu, Z. Ju, and H. Liu, “A structured multi-feature representation for
prediction from facial expression sequences,” IEEE Trans. Cybern., recognizing human action and interaction,” Neurocomputing, vol. 318,
vol. 46, no. 4, pp. 916–929, Apr. 2016. pp. 287–296, Nov. 2018.
Jingjing Liu received the B.E. degree from the Yi Wang received the master’s degree in child
Beijing Institute of Technology, Beijing, China, in healthcare from Fudan University, Shanghai, China,
2017. She is currently pursuing the Ph.D. degree in 2020.
with the Robotics Institute, Shanghai Jiao Tong Her research interests include molecular mecha-
University, Shanghai, China. nisms and treatments of autism.
She is also a Visiting Scholar with the State
Key Laboratory of Robotics and Systems, Harbin
Institute of Technology Shenzhen, Shenzhen, China.
Her current research interests include computer
vision and applications in autistic screening and
intervention.
Jingxin Deng received the bachelor’s degree in clin-
ical medicine from Chongqing Medical University,
Chongqing, China, in 2018. She is currently pursu-
ing the M.Med. degree with the Children’s Hospital
of Fudan University, Shanghai, China.
Her research interests include early child develop-
Zhiyong Wang (Graduate Student Member, IEEE) ment and mechanisms of autism spectrum disorders.
received the B.E. degree from the South China
University of Technology, Guangzhou, China, in
2016. He is currently pursuing the Ph.D. degree
with the Robotics Institute, Shanghai Jiao Tong
University, Shanghai, China.
He is also a Visiting Scholar with the State
Key Laboratory of Robotics and Systems, Harbin Qiong Xu received the Ph.D. degree in pediatrics
Instiute of Technology Shenzhen, Shenzhen, China. from Fudan University, Shanghai, China, in 2015.
His research interests include gaze estimation, She is a Chief Physician and the Associate Chief of
human motion analytics, and applications in autistic the Division of Child Health Care, Children’s Hospital
screening and intervention. of Fudan University. She was a Visiting Scholar with
Duke Children’s Hospital and Health Center, Durham,
NC, USA, from 2012 to 2013. She is working on
early detection and early intervention for ASD in
community- and hospital-based practices in Shanghai
as well as exploring the molecular mechanism of
genetic mutations in humans and animal models.
Kai Xu received the B.E. degree from the School

of Mechanical and Automotive Engineering, Central Xiu Xu received the Ph.D. degree in pediatrics from
South University, Changsha, China, in 2018. He Fudan University, Shanghai, China, in 2001.
is currently pursuing the master’s degree with the She is a Professor of pediatrics and the Chief
School of Mechanical Engineering, Shanghai Jiao of the Division of Child Health Care, Children’s
Tong University, Shanghai, China. Hospital of Fudan University, Shanghai. Her
His research interests include facial expres- research interests include early evaluating, diagnos-
sion recognition, human pose estimation, and gaze ing, and intervention for neurodevelopmental disor-
estimation. ders with special interest in autism spectrum disorder
(ASD). As a PI for early detection and early inter-
vention for ASD in community and hospital-based
practice in Shanghai, she contributed to set up an
ASD screening program in “The Three-level Network” of Child healthcare ser-
vice in Shanghai. As the Co-PI for the National Project of Ministry of Health
in China—The epidemiology, diagnosis, and early intervention of ASD. She
is a Chief Physician and the Associate Chief of the Division of Child Health
Bin Ji received the B.E. degree in mechani- Care, Children’s Hospital of Fudan University. She was a Visiting Scholar with
cal engineering from the School of Mechano- Duke Children’s Hospital and Health Center, Durham, NC, USA, from 2012
Electronic Engineering, Xidian University, Xi’an, to 2013. She is working on early detection and early intervention for ASD in
China, in 2018. He is currently pursuing the master’s community- and hospital-based practices in Shanghai as well as exploring the
degree with the School of Mechanical Engineering, molecular mechanism of genetic mutations in humans and animal models.
Shanghai Jiao Tong University, Shanghai, China. Prof. Xu is the Steering Member of Child Health, the Society of Child
His current research interests include monocular Health, and the China Academy of Pediatrics.
3-D human pose estimation based on deep learning
and human–computer interaction.
Honghai Liu (Senior Member, IEEE) received
the Ph.D. degree in robotics from King’s College
London, London, U.K., in 2003.
He is a Professor with the State Key Laboratory of
Robotics and Systems, Harbin Institute of Technology
Shenzhen, Shenzhen, China. He is also the Chair
Professor of human machine systems with the
Gongyue Zhang received the B.S. degree from the
University of Portsmouth, Portsmouth, U.K. His
University of Science and Technology of China,
research interests include biomechatronics, pattern
Hefei, China, in 2015. He is currently pursuing
recognition, intelligent video analytics, intelligent
the Ph.D. degree with the School of Computing,
robotics, and their practical applications with an
University of Portsmouth, Portsmouth, U.K.
emphasis on approaches that could make contribution to the intelligent con-
His research interests include image processing,
nection of perception to action using contextual information.
computer vision, and applications in autistic screen-
Prof. Liu is an Associate Editor of the IEEE T RANSACTIONS ON
ing and intervention.
I NDUSTRIAL E LECTRONICS, the IEEE T RANSACTIONS ON I NDUSTRIAL
I NFORMATICS, and the IEEE T RANSACTIONS ON C YBERNETICS. He is a
Fellow of the Institution of Engineering and Technology.

Early Screening of Autism in Toddlers Via Response-To-Instructions Protocol

Uploaded by

Copyright:

Available Formats

You might also like

Early Screening of Autism in Toddlers Via Response-To-Instructions Protocol

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Early Screening of Autism in Toddlers Via Response-To-Instructions Protocol

Uploaded by

Copyright:

Available Formats

3914 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 52, NO.

Early Screening of Autism in Toddlers via

In this protocol, recordings from only two cameras are used:

system, which is built at the center of C2

for deeper eye communication. On the contrary, ASD children

Algorithm 1 Evaluation Threshold Definition Algorithm 2 Multimodal Alignment and Processing

Fig. 6 summarizes the results of both clinician and RTI

solitary play than for people. Consistent findings with other

Kai Xu received the B.E. degree from the School

You might also like