Keyframeextraction PDF

Computer Methods and Programs in Biomedicine 165 (2018) 13–23
Contents lists available at ScienceDirect
Computer Methods and Programs in Biomedicine

journal homepage: www.elsevier.com/locate/cmpb
Keyframe extraction from laparoscopic videos based on visual saliency

detection
Constantinos Loukas a,∗, Christos Varytimidis b, Konstantinos Rapantzikos b,
Meletios A. Kanakis c
a
Laboratory of Medical Physics, Medical School, National and Kapodistrian University of Athens, Mikras Asias 75 str., Athens 11527, Greece
b
School of Electrical and Computer Engineering, National and Technical University of Athens, Athens, Greece
c
Cardiothoracic Surgery Unit, Great Ormond Street Hospital for Children, London, UK
a r t i c l e i n f o a b s t r a c t
Article history: Background and objective: Laparoscopic surgery offers the potential for video recording of the operation,
Received 1 February 2018 which is important for technique evaluation, cognitive training, patient briefing and documentation. An
Revised 1 June 2018
effective way for video content representation is to extract a limited number of keyframes with semantic
Accepted 16 July 2018
information. In this paper we present a novel method for keyframe extraction from individual shots of
the operational video.
Keywords: Methods: The laparoscopic video was first segmented into video shots using an objectness model,
Video analysis which was trained to capture significant changes in the endoscope field of view. Each frame of a shot was
Keyframe extraction then decomposed into three saliency maps in order to model the preference of human vision to regions
Hidden Markov multivariate autoregressive
with higher differentiation with respect to color, motion and texture. The accumulated responses from
models
each map provided a 3D time series of saliency variation across the shot. The time series was modeled as
Visual saliency
a multivariate autoregressive process with hidden Markov states (HMMAR model). This approach allowed
the temporal segmentation of the shot into a predefined number of states. A representative keyframe was
extracted from each state based on the highest state-conditional probability of the corresponding saliency
vector.
Results: Our method was tested on 168 video shots extracted from various laparoscopic cholecystec-
tomy operations from the publicly available Cholec80 dataset. Four state-of-the-art methodologies were
used for comparison. The evaluation was based on two assessment metrics: Color Consistency Score
(CCS), which measures the color distance between the ground truth (GT) and the closest keyframe, and
Temporal Consistency Score (TCS), which considers the temporal proximity between GT and extracted
keyframes. About 81% of the extracted keyframes matched the color content of the GT keyframes, com-
pared to 77% yielded by the second-best method. The TCS of the proposed and the second-best method
was close to 1.9 and 1.4 respectively.
Conclusions: Our results demonstrated that the proposed method yields superior performance in terms
of content and temporal consistency to the ground truth. The extracted keyframes provided highly se-
mantic information that may be used for various applications related to surgical video content represen-
tation, such as workflow analysis, video summarization and retrieval.
© 2018 Elsevier B.V. All rights reserved.
1. Introduction cognitive training of junior surgeons, something that makes video-

based review an essential part of surgical education (e.g., teaching
In addition to its known therapeutic benefits, minimally inva- of complex tasks, management of critical scenarios, etc.) [2]. Sec-
sive surgery (MIS) offers the potential for video recording based ond, for retrospective evaluation of applied techniques and for re-
on the endoscopic camera that is naturally used for visualizing view of critical steps. Third, to provide the patient with a personal
the anatomic area in operation. The recorded video may retrospec- copy for future reference and for patient briefing. Fourth, to make
tively be used for a number of important reasons [1]. First, for decisions regarding trainee’s competence and to set requirements
for advancement, as well as for skills improvement [3]. Recently,
some countries have set legal requirements for digital recording of
∗
Corresponding author. the surgical operations, in order to provide evidence for lawsuits
E-mail address: cloukas@med.uoa.gr (C. Loukas).
https://doi.org/10.1016/j.cmpb.2018.07.004
0169-2607/© 2018 Elsevier B.V. All rights reserved.
14 C. Loukas et al. / Computer Methods and Programs in Biomedicine 165 (2018) 13–23
in case of malpractice [4]. To protect the patent’s personal data, to tool insertion/removal or organ manipulation (e.g., gallbladder
guidelines for standardized recording and secure archival of digital removal). Compared to diagnostic examinations (e.g., capsule en-
media produced in the operating room are currently under devel- doscopy, colonoscopy, etc.), in MIS motion is a strong indicator of
opment [5]. surgical activity, which is related to a surgical action performed
In nowadays, given the rapid expansion of high-volume stor- with the aid of the tools. Moreover, in diagnostic procedures the
age media and file-sharing technologies, the amount of surgical camera is manipulated by a single operator whereas in MIS there
videos recorded has increased tremendously. The videos are not are more than one operators (the camera operator and the surgeon
only stored on local servers at the hospital, but they are also up- performing the operation).
loaded on video-sharing websites and educational resources, so In MIS, keyframe extraction methods are limited probably due
that they are accessible to other educators and trainees. Cogni- to a number of visual challenges, such as: camera/instrument
tive training is typically performed by retrieving relevant videos in movement, bleeding, tissue deformation, uninformative frames,
response to text queries and predefined keywords. This is usually and presence of smoke [24]. An effort for video summarization
achieved by manual pre-annotation of the video based on cues of of arthroscopic procedures was presented in [25]. The proposed
potential interest, which is time consuming and limits additional tool generated a keyframe-based summary by clustering similar
future searches on non-annotated terms. Moreover, in most cases frames. Five different combinations of features and dissimilarity
the surgeon performs manual video skimming in order to locate metrics were employed. For laparoscopic videos, the most rele-
the event/task of interest, which is also inefficient. Content-based vant work is [26], where a method based on monitoring of local
video search is not applied to large scale search engines, mainly features within a sliding temporal window was proposed. Using
due to the diversity of the visual content and the lack of a uni- ORB (Oriented FAST and Rotated BRIEF) descriptors, nearby frames
versal way to represent the surgical procedures. To automate and were matched in order to determine frames of significant content
enhance this process, one must develop technologies to effectively change (candidate keyframes). Keyframes that were similar to al-
index and represent the content of the surgical video. ready selected keyframes were removed. The method was evalu-
A popular way for video content representation is to extract ated by external observers, based on the ‘appropriateness’ of the
frames that incorporate higher level semantic information corre- selected keyframes. The same group also proposed a browser for
sponding to key-events (keyframes). For large scale video data, stimeline-based visualization of the keyframes [27].
such as those from surgical operations, a fundamental prerequi- In this paper we propose an alternative methodology for
site is to segment the entire video into meaningful structural units keyframe extraction from laparoscopic videos that combines char-
(shots), which are considered to represent a continuous spatiotem- acteristics from the sequential frame comparison-based, and
poral action [6]. After extracting keyframes from each shot, it is clustering-based groups of methodologies. The first contribution
then straightforward to establish the overall video context as a col- lies in the segmentation of the video into shots using an objectness
lection of representations arisen from the individual shots. These model trained to capture significant camera viewpoint changes.
representations may refer either to the actual keyframes, or to ro- The second contribution lies in the representation of the shot as a
bust feature descriptors extracted from the keyframes. The former multivariate signal of frame-accumulated metrics of visual saliency.
leads to applications related to video summarization and browsing, In particular, each frame is decomposed into three maps of color,
where instead of the entire video the user may visualize a number motion, and texture saliency. Color and motion saliency maps are
of preselected keyframes. Extracting descriptors from keyframes is extracted by adapting recent developments on image contrast mea-
mostly related to video indexing and image retrieval, where the surement, whereas for texture LogGabor filters were employed. For
goal is to retrieve keyframes similar to a query frame. each frame, a data vector of three saliency values is produced, by
Keyframe extraction has been extensively applied across sev- accumulating the response across each map. The third contribu-
eral video domains, such as in news broadcasts, sport events, traf- tion lies in modeling the 3D signal of saliency values as a hidden
fic cameras, TV-shows, and movies [7–10]. The main assumption Markov multivariate autoregressive (HMMAR) process. This model
is that there are great redundancies among the frames of the allows, first the representation of the multivariate signal as data
shot, and the central goal is to select those frames that contain vectors generated from a number of different states (components),
as much salient information as possible. Indicative image features each of which is a static MAR process on its own, and second the
include: color, texture, edges, MPEG-7 motion descriptors, and op- temporal variation of the vectors based on a hidden Markov model.
tical flow. Among the various approaches employed, sequential Hence, the video shot is eventually represented as a temporal se-
comparison-based methods compare frames subsequent to a previ- quence of predominant (most likely) states at each time point,
ously extracted keyframe [11], whereas global-based methods per- which allows temporal clustering of the frames. In other words,
form global frame comparison by minimizing an objective function the model provides a segmentation of the shot into a number of
[12]. Reference-based methods generate a reference frame, and ex- predefined states. A keyframe is extracted from each state by tak-
tract keyframes based on the comparison of the shot frames with ing the frame that corresponds to the data vector with the high-
the reference frame [13]. Other popular methods include keyframe est state-conditional probability. A graphical overview of the main
selection based on frame clustering [14], and trajectory curve rep- steps is presented in Fig. 1. The proposed method was compared
resentation of the frame sequence [15]. against four different keyframe selection techniques, yielding su-
In the domain of medical endoscopy, various keyframe extrac- perior results.
tions methods have been proposed across a number of applica-
tion fields, such as in diagnostic hysteroscopy [16–18], gastroin-
testinal endoscopy [19], video capsule endoscopy [20–22] and en- 2. Materials and methods
domicroscopy [23]. Although these approaches may potentially be
applied in the field of MIS, surgical videos present significant dif- 2.1. Shot detection
ferences compared to diagnostic videos, since the operator heavily
interacts with the displayed anatomic organs (cutting, coagulation, Intuitively, a full MIS video is composed of a single shot, since
clipping, etc.). The camera field of view changes constantly due the anatomical area in operation does not change. Nevertheless,
to the operator’s actions, not just because of navigation through important changes like the tool insertion/removal, manipulation of
the internal body. Consequently, the field of view may, for exam- the gallbladder or the mild change of camera viewpoint can be
ple, be covered with smoke (e.g., in coagulation), or change due captured.
C. Loukas et al. / Computer Methods and Programs in Biomedicine 165 (2018) 13–23 15
Fig. 1. A graphical overview of the main steps of the proposed keyframe extraction method.
We focus on the appearing/disappearing tools and train an ob- to natural images, some colors are rarely encountered in surgical
jectness model to highlight their presence. The concept of object- images. So, the number of colors was further reduced by modi-
ness has been used in many algorithms for locating generic ob- fying the value of less frequent colors. In particular, color values
jects disregarding their identity [28,29]. Given the objectness re- with frequency >90% were kept unchanged, whereas the remain-
sponse of different image subwindows, these algorithms propose ing ones were assigned the closest value of the unchanged colors.
candidate regions fully enclosing distinct objects. For our method, Finally, the color histogram was smoothed with a Gaussian filter.
we adapt the objectness approach of [28]. We run the model on
every 5th frame of the video, and compute a single-valued mea- 2.3. Motion saliency map
sure by averaging the objectness response across the frame. We
suggest that changes in the global objectness of video frames cor- Motion detection is an intrinsic characteristic of human vision.
relate well with meaningful shot changes. We mark shots as the Surgical videos may contain different patterns of motion, mostly
temporal periods between two consecutive outliers (sudden jumps due to tissue manipulation and camera motion. A method was de-
or falls) of the measure. In particular, shot boundaries are marked veloped for calculating the motion saliency map for each frame in
at local minima or maxima of the objectness measure exceeding a the shot. First, image blocks in a rectangular grid were tracked be-
fixed delta threshold, equal to 30% of the maximum value. tween consecutive frames in order to extract the optical flow vec-
tors, mvi :
2.2. Color saliency map
Bk = mvi = (dx , dy )i |i = 1, . . . , Nk (2)
k
Image contrast is an important attribute of human vision due
to the preferential response of cortical cells to high contrast stimuli where Bk denotes the ‘actively moving’ blocks for frame k, Nk is
[30]. In surgical images, most color values are concentrated around the number of these blocks, and di = (dx , dy )i is the motion vector
a certain part of the light spectrum. We suggest that frames con- for block i. Actively moving blocks were considered as those with
taining regions with higher color contrast will be more attractive a displacement greater that a predefined threshold.
to the human observer. For our method, a color saliency map was The motion vectors of the active blocks were then quantized
generated for each frame by adapting the histogram based contrast in terms of their orientation and magnitude. For orientation, 18
method proposed in [31], to the color statistics of the shot. Specif- bins in the range [ − 180, 180) were employed. For magnitude,
ically, if C = {c1 ,…, cm } denotes the employed histogram of color given that the range of displacements among frames can vary a
values for all pixels in a shot, the color saliency (CS) value for pixel lot, a K-means algorithm was used to cluster the displacements
i with color cl is given by: of all blocks contained in the shot, into five classes of increas-
ing motion magnitude. The center of the highest motion cluster

n had the greatest displacement, and the lowest motion cluster the
CS(i ) = CS(cl ) = f j D cl , c j (1)
lowest. Following this procedure, each active block was assigned
j=1
two motion labels: orientation θ i = {1, …, 18}, and magnitude
where n denotes the total number of distinct color values in the mj = {0.2,..., 1}.
shot, fj is the frequency of pixel with color cj , and Dc (cl ,cj ) is the A 2D orientation-magnitude histogram: M = {θ mij }, where
color difference between colors cl and cj in the L∗ a∗ b∗ color space. θ mij = (θ i ,mj ), was then computed for each frame. The motion
Due to the large number of possible color values (2563 ), the saliency (MS) value for block i with labels θ mkl was computed
number of colors in each channel was quantized to 12. Compared as:

18
5 by a 3D vector of saliency responses, yt . Using a MAR model, yt
MS(i ) = MS(θ mkl ) = fij D θ mkl , θ mij (3) was modelled as a weighted sum of the r previous values:
i=1 j=1 r
yt = yt−k ak + et (9)
where fij is the pixel frequency of the bin θ mij , and D(θ mkl ,θ mij ) is k=1
the ‘motion distance’ between θ mkl and θ mij , which was calculated where ak is matrix of the regression coefficients (weights), and et
as the weighted sum of the ‘orientation distance’ and the ‘magni- is Gaussian noise.
tude distance’: The uncertainty over yt may also be expressed in a conditional
probability form:
D θ mkl , θ mi j = wθ D(θk , θi ) + (1 − wθ )D ml , m j (4)
p(yt |xt , A, ) = N (yt |xt A, −1 ) (10)
where
where N denotes the Gaussian distribution, A = [a1 ,a2 ,…, ar ]T , and
D(θk , θi ) = kθ (1 − cos(θk − θi ) ) (5) is the inverse covariance (precision).
For our method, the MAR model was extended to include hid-
den states in the data, providing the HMMAR model, which is
D ml , m j = km ml − m j (6) based on two assumptions. First, the multivariate signal follows a
The multiplication parameters were set to: kθ = 0.5 and Gaussian mixture models formulation. The signal consists of data
km = 1.25, to ensure distance normalization to [0, 1]. The weight vectors generated from a number of different components, each of
wθ was determined as: which is a static MAR model on its own. For a mixture model with
K components and parameters: m = {Aj , j }j = 1, …, K , the likelihood
σθ
wθ = (7) is given by:
σθ + σm k
where σ θ , σ m indicates the standard deviation (SD) of motion ori- p(yt |π , m ) = π j p y|xt A, −1
j
(11)
j=1
entation and magnitude in the current frame respectively. where π j is the weight of the jth component. Introducing a latent
Having calculated the MS score for each active image block, the (hidden) variable z ∈ {z1 , . . . , zK } to denote either of the compo-
score of non-active blocks was set to 0. Finally, the resulting MS nents, the previous equation may be rewritten as:
map was multiplied by a Gaussian mask centered on the frame
center in order to model the preference of human attention to re- p(yt , zt |π , m ) = p(yt |zt , m ) p(zt |π ) (12)
gions lying closer to the frame center. zt
where p(zt = zj |π ) = π j , and p(yt |zt = zj ,m) = p(yt |mj ) is the state
2.4. Texture saliency map conditional distribution of the observation yt , given by Eq. (10).
The second assumption is the same to that followed by HMM:
A popular approach in characterizing texture regions includes the state (mixture component) at time t is dependent on the state
spectral filtering with a 2D transform such as wavelet, discrete co- value at time t − 1. Hence:

sine, or Fourier [32]. After computation of the filter responses, tex- p(yt |m ) = p(yt , zt , zt−1 |m )
ture features are extracted based on first and second order statis- zt zt−1

tics. Among various filters, Gabor filters were shown to model the = p(yt |zt , m ) p(zt |zt−1 ) (13)
zt zt−1
early psychovisual features of the human visual system [33]. Log-
Gabor filters are defined as Gaussian functions shifted from the where p(zt |zt − 1 ) is the transition probability from state zt − 1 to zt .
origin due to the singularity of the log function. In contrast to the The parameters of the HMMAR model were computed under
Gabor, LogGabor filters have a zero DC component. a variational Bayesian (VB) framework described in [38], and the
Log-Gabor filters have successfully been applied for image re- number of regressors was set to r = 2. The input to the model was
trieval and texture classification applications [34,35]. A recent the 3D signal and the numbers of states, K.
study has showed that they may also be used for generating dy- The most likely state sequence {zt }1,…, T for the signal was de-
namic saliency maps of video frames [36]. For our method, the in- termined by the Viterbi algorithm. Consequently, the data vec-
tensity channel (I) of the color frame was initially convolved with tors (and so the frames), were temporally grouped into K differ-
a bank of 2D LogGabor filters using the parameters described in ent states (zj , j = 1, … K). For the purpose of this study, K also de-
[37]. A series of filter response maps across no = 4 orientations (0°, notes the number of keyframes that needed to be extracted from
45°, 90°, 135), and ns = 7 scales, was generated. Given that the con- the shot.
volution of the image with the LogGabor filter provides a complex In order to select the most representative frame (keyframe)
response for each pixel, the L2 norm was used to describe the local from each state, we find the time-point (tj ) at which the
energy. The final texture saliency (TS) map was computed as: corresponding saliency data-vector, yt j , had the highest state-

conditional probability. The frame corresponding to this time-point
ns (Eo,s − μEo )2 was selected as a keyframe (Ft j ):
TS = (8)
no c
t j = arg max p yt |zt = z j , m (14)
t
where Eo,s is the computed energy map for a certain orientation
and scale, μEo = Es,o/ns is the mean energy map for a certain ori- Following this procedure for each one of the K states, an equal
number of keyframes {Ft j } j=1,...,K was extracted from the shot.
ns (Eo,s − μEo )
2
entation, and c = max is a normalization factor
Fig. 2 illustrates an example of the state sequence computation
to ensure that for each orientation the map is maximized to 1. The for a simulated 3D signal. The signal consists of four segments of
TS map was finally smoothed with a Gaussian filter. three consecutive three-variate sinusoidal signals of increasing fre-
quency (fk = 10, 20, 30 Hz), corrupted with Gaussian noise. The
2.5. Keyframe extraction based on HMMAR model sampling frequency is Fs = 100 Hz and each sinusoid has 1 s dura-
tion. The multivariate signal was fed into the HMMAR algorithm
After computing the three saliency maps for each frame, a 3D with parameters K = 3 and r = 1. The computed state sequence is
signal was produced by accumulating the responses from each plotted on top of the signal, where it is clear that the algorithm
map, for every frame of the shot. Thus, each frame was described outputs the correct sequence with significant accuracy.
Table 1
Overview of the LC video shot dataset and the ground truth keyframes; # denotes ‘number
of’.
Video shots Mean duration ± SD (sec) Total # frames Mean # keyframes ± SD
168 67 ± 55 279,811 2.7 ± 1.6
is painted according to the color triplet of the bin center, whereas

its radius is proportional to the number of pixels assigned to the
bin. Despite the small number of bins used for histogram plotting,
the color reduction is clear.
Fig. 4 shows the outcome of color smoothing on the computa-
tion of the HC saliency values on the original color image shown
in Fig. 3(a). In particular, Fig. 4(a)–(b) show the saliency maps be-
fore and after color smoothing respectively, whereas Fig. 4(c)–(d)
show the corresponding histograms. Lighter shades denote greater
saliency. After color smoothing, regions with similar color are
grouped together into similar saliency values (compare the tool
shaft between Fig. 4(a),(b)). Moreover, the contrast of the saliency
map is improved (see for example the gallbladder). Compared to
Fig. 2. Simulation results demonstrating the recovery of the state sequence for a Fig. 4(c) where most saliency values are <50, the histogram in
three-variate signal consisting of four chunks of three equal-length consecutive si-
Fig. 4(d) covers a greater region of the grayscale axis.
nusoids with different frequency.
Fig. 5 shows an example of the construction of the MS map.
Fig. 5(a) is a video frame extracted during dissection of the gall-
3. Results bladder, and Fig. 5(b) is the optical flow map. The instruments
shown on the left and upper part of the image were used to keep
3.1. Dataset the target site unclouded. The other instrument performed the dis-
section. Hence, most of the motion activity is concentrated in the
The original dataset included 6 laparoscopic cholecystectomy region around the tip of this instrument. This is verified by the
(LC) operations from the recently published Cholec80 dataset [39]. motion saliency map shown in Fig. 5(c). Using a pseudocolor map,
The original resolution of the video frames was 1920 × 1080, cap- pixels painted in red/blue represent high/low motion saliency, re-
tured at 25 frapes per second (fps). To reduce the computational spectively. The moving regions around the instrument tip exhibit
cost, spatial down-sampling by a factor of 4 was performed. The the greatest saliency.
videos were divided into shots according to the technique de- Fig. 6 shows TS maps for various frames extracted from a video
scribed in the Methods. Shots of very short duration (<15 sec), or shot. Figs. 6(a)–(c) are the input frames to the algorithm, showing
with several uninformative frames (e.g., outpatient) were excluded. snapshots during gallbladder dissection (the cystic duct-artery are
The final dataset included 168 video shots from various phases of clearly seen), clipping, and coagulation, respectively. Fig. 6(d)–(f)
the operation, such as gallbladder inspection, dissection, clipping, are the computed TS maps, indicating the response of the LogGa-
and coagulation. bor filter. Fig. 6(g)–(i) show the contours of the regions with TS
The ground truth for the keyframes was extracted separately values >10% of maximum value, overlaid on the original images.
for each shot. After careful inspection of the shot, a number of It may be seen that most of the highly textured regions are sur-
keyframes (K) that included most of the important events was rounded by the contours. These regions include part of the dis-
annotated by an experienced clinician (e.g., expose of anatomical sected gallbladder, Fig. 6(g), (h), and coagulated liver bed and gall-
structures, clipping, cutting). The ground truth was decided based bladder, Fig. 6(i). The black regions in Fig. 6(d)–(f) essentially in-
on the ‘appropriateness as representative preview images’, as sug- dicate low texture, something that may be verified by the corre-
gested in the related work [26]. Some of the qualitative criteria sponding uniform gray-tone regions, shown for example on the
employed by the clinician included: no-blurriness, clear view of right in Fig. 6(a) and (b), and on the left in Fig. 6(c).
the anatomy (e.g., the dissected area is not covered by the tools),
3.3. Performance evaluation
close-up views that display fine details of the anatomy, tool-tissue
interaction and insertion of new tools (e.g., retrieval bag).
Four different keyframe selection techniques were used for
To avoid an excess or lack of keyframes, an empirical rule was
comparison with the proposed method: random selection, uniform
followed by the clinician during annotation: each shot must con-
selection, K-means clustering, and the method proposed in [26]. In
tain at least one keyframe and in total no more than 4 keyframes
the first and second case, the keyframes are extracted from random
per min. In total, 456 keyframes were extracted from the 168 shots.
and equidistant positions in the shot, respectively. In the third case,
An overview of the video dataset is presented in Table 1.
the clustering was performed on the 3D vectors extracted from
the three saliency maps, in order to compare the HMMAR model
3.2. Saliency maps with a static clustering technique. The technique in [26] performs
keyframe extraction specifically on endoscopic surgery videos. It
Fig. 3 shows an example of the color quantization process. employs a metric to measure the distance of feature keypoints
Fig. 3(a) shows a typical frame selected from a shot, whereas matched between consecutive frames. The distance is monitored
Fig. 3(b) is the same image quantized to 57 color values. Despite over a sliding window, and a keyframe is selected when its value
the significant reduction of color values the image quality remains is outside an adaptive threshold range. However, this method does
visually unchanged. The reduction of colors may be confirmed by not generate a predefined number of keyframes, as required for
the corresponding color histograms shown in Fig. 3(c)–(d). For bet- the evaluation purpose of this study. Hence, we selected the K
ter visualization, only 8 bins per channel were used. Each sphere keyframes with the greater distance metric. In a few cases where
Fig. 3. Color quantization: the top row shows the same image before (a), and after (b), color quantization; the bottom row shows the corresponding color histograms ((c),
(d)).
Fig. 4. Color saliency and smoothing: the top row shows the resulting saliency map before (a), and after (b), color smoothing; the bottom rows show the corresponding
histograms ((c), (d)).
Fig. 5. Motion saliency: (a) the original color frame, (b) its optical flow map (b), and the final motion saliency map (c).
Fig. 6. Texture saliency: (a)–(c) are the intensity channels of three different video frames, (d)-(f) are the computed texture saliency maps, and (g)–(i) show the contours of
the regions with saliency value greater than 10% of the maximum value, overlaid on the original frames.
the extracted number was less than K, the threshold range was yields the highest scores, whereas Uniform and Random selection
gradually reduced. the lowest ones. K-means clustering has TCS similar with that of
The evaluation was based on the adaptation of two assessment [26], but its CCS is higher.
metrics employed for similar purposes in gastroscopic videos [19]. Based on the CCS definition, the results indirectly indicate that
The first one, Color Consistency Score (CCS), measures the color about 81% of the keyframes extracted by the proposed method
distance between the ground truth (GT) keyframe and the closest match the color content of the GT keyframes, using the specified
extracted keyframe: tolerance. For Uniform and Random selection only about 1/3rd of
the keyframes have color content similar to the ground truth. Both
k dc ( f k , gk )
CC S = (15) methods ignore the content and temporal variation of the video
K
frames. The CCS of the method in [26] is higher than Uniform and
where gk is the kth GT keyframe, fk is the extracted frame closest
Random selection, but lower than K-means and significantly lower
to gk , and dc is the color distance between the two frames:
than the proposed method (about 16%).
1 i f fk − gk ≤ Tc For K-means, the CCS is lower by about 5% compared to that
dc ( f k , gk ) = (16) of HMMAR-Saliency. Note that the feature vectors employed for K-
0 otherwise
means clustering were the same to those employed by the pro-
where fk − gk denotes the mean of the Euclidian distances between posed method (i.e. 3-variate vectors of saliency sums). However,
the RGB color vectors of the same pixels in fk and gk , and Tc is the latter also considered the temporal variation of the saliency
the content threshold, which here was set equal to the mean of measures in the shot, whereas K-means did not. This difference is
ft + 1 − ft in the shot. more profound in the TCS results. The TCS of the proposed method
The second metric, Temporal Consistency Score (TCS), consid- is close to 2 whereas for k-means is 1.4. Based on the TCS defini-
ers the temporal proximity between the GT and the extracted tion, the keyframes extracted by the HMMAR-Saliency method are
keyframes: about 2–4 sec away from the ground truth, whereas for K-means

k dt ( fk , gk ) 4–6 sec. The performance of the [26] method is lower than that of
T CS = (17) K-means. Uniform and Random selection provide the worst results
K
where dt is the temporal distance defined as: with keyframes separated by about >6 sec from the GT keyframes.
⎧
⎪
⎪ 3 i f t fk − tgk ≤ Tt1
⎨
3.4. Performance validation
2 i f t fk − tgk ∈ (Tt1 , Tt2 ]

dt ( fk , gk ) = (18) In addition to the aforementioned objective metrics we per-
⎪ 1 i f t fk − tgk ∈ (Tt2 , Tt3 ]
⎪
⎩
0 i f t f − tgk > Tt3
formed a validation experiment where a 2nd experienced user
k
(validator) was asked to rate in a blind fashion the similarity of
where |t fk − tgk | denotes the absolute difference between the tim- the keyframes generated by each method to the ones selected by
ings of the extracted keyframe and the GT keyframe. The thresh- the 1st experienced clinician (‘ground truth’), using a 5-point Likert
olds Tt1 , Tt2 and Tt3 were set equal to Tm , 2Tm , and 3Tm respec- scale: −2 (very dissimilar) to 2 (very similar). A similar approach
tively, where Tm = 2 sec was the mean of the 5 shortest temporal has been followed in a related work [26]. The rating focused on the
distances between all consecutive GT keyframes. For both CCS and anatomic tissues and surgical tools as well as their spatial relation,
TCS, each fk was matched to the closest gk according to its lowest with regard to those shown in the ground truth. Table 3 shows the
color and temporal distance among all GT keyframes, respectively. average rating of the validator for each method tested. Among the
Table 2 presents the keyframe extraction results based on the five methods, uniform and random selection performed worst. K-
aforementioned metrics. The proposed method (HMMAR-Saliency) means and the method in [26] seem to have similar performance,
Table 2
Keyframe extraction results based on the two metrics: CCS and TCS (average ± SD).
Uniform selection Random selection K-means clustering Method in [26] HMMAR-Saliency
CCS 0.35 ± 0.11 0.37 ± 0.12 0.77 ± 0.04 0.68 ± 0.08 0.81 ± 0.05
TCS 0.42 ± 0.06 0.34 ± 0.09 1.41 ± 0.04 1.07 ± 0.07 1.92 ± 0.06
Table 3
Average rating of the keyframes similarity to the ground truth.
Uniform selection Random selection K-means clustering Method in [26] HMMAR-Saliency
−0.96 −0.81 0.27 0.62 1.09
keyframes respectively, whereas the 2nd keyframe contains sim-

ilar information to the 1st one. In contrast, our approach extracts
keyframes with content similar to that of the GT, showing more di-
verse information about the video shot. The temporal proximity of
the GT and the extracted keyframes may be seen in Fig. 8, which
shows the three saliency signals, the three-state sequence gener-
ated by the HMMAR model, and the temporal location of the GT
and HMMAR keyframes extracted from each state. It may be seen
that, for this particular video shot, each state becomes more pre-
dominant at different parts of the video, although this was not al-
ways the case. The temporal difference between the 1st, 2nd and
3rd GT and extracted keyframes was: 0.8 sec, 1.2 and 2.3 sec, re-
spectively.
Fig. 9 presents five keyframes from another video shot during
clipping and cutting of the cystic duct-artery. The 1st GT keyframe
shows the dissected gallbladder area, 2nd-4th keyframes show clip
application, and the last two keyframes show cutting. Although a
detailed one-to-one visual comparison is complex, it may be seen
that apart from the 3rd keyframe, our approach extracts keyframes
with content very similar to that of the GT. The other methods
failed to provide an appropriate representation of the video con-
tent, generating either many similar keyframes (e.g., 1–3 for Uni-
form, 3–5 for Random), or keyframes different in content/order
with regard to the GT (e.g., 2nd, 3rd, 4th for K-means, 2nd, 3rd,
6th for the [26] method). The temporal location of the keyframes,
Fig. 7. Keyframe extraction for a video shot (≈1 min length). Top row shows the the HMMAR state sequence, and the 3D saliency signal are shown
ground truth keyframes. The other rows show the keyframes (sorted in temporal
in Fig. 10. The time difference between the GT and the proposed
order) extracted by four different keyframe selection techniques (random selection,
uniform selection, K-means clustering, and the method in [26]), and the proposed keyframes varied between 2.6 sec (4th keyframe) and 20.6 sec (3rd
method (bottom row). keyframe).
4. Discussion
between ‘neutral’ (rate: 0) and ‘similar’ (rate: 1). The proposed
method achieved the best average rating, slightly above ‘similar’. In this paper, we have presented a method for keyframe ex-
Moreover, we performed a statistical analysis on the user-ratings. traction from video shots of laparoscopic operations. Instead of
A value of p < 0.05 was assumed to be statistically significant. A common feature descriptors, we employed mechanisms of visual
non-parametric Friedman test showed statistical significant differ- saliency detection with respect to three main sources of image in-
ences across the methods tested. A consequent Wilcoxon rank-sum formation: color, texture, and motion. The underlying motive be-
test, with a Bonferroni adjustment, showed that the ratings for the hind the use of visual saliency was to apply a model that detects
proposed method were significantly higher to all other methods the most noticeable regions in the image. Although the detection
(p < 0.05). of regions that match human’s attention is a major field of research
with significant challenges, recent developments have showed sig-
3.5. Examples of keyframe extraction nificant progress not only for static images [31], but also for dy-
namic sequences [36]. These approaches, along with the proposed
Fig. 7 shows keyframes examples of the proposed method motion saliency detection technique, were adapted to the charac-
(HMMAR-Saliency) and the four methods used for comparison, teristics of the laparoscopic videos with the aim to describe the
with regard to the ground truth. The video shot is a segment from frames’ saliency content.
dissection of the gallbladder, about 1 min duration. The 1st, 2nd, The accumulated response of the three saliency components
and 3rd GT keyframes correspond to inspection of the target tis- across each frame provided a multivariate signal which reflected
sue area, cystic duct-artery dissection, and gallbladder manipula- the variation of the overall saliency in the shot. The next task
tion (close to the shot end), respectively. The Uniform, Random, was to detect the clusters that the saliency vectors were grouped
and the [26] method seem to select keyframes with very similar into, and pick the most representative one from each cluster. Given
content; only the first keyframe is close to the ground truth. For K- that each vector was uniquely mapped into one frame, the selec-
means, the 1st and 3rd keyframes are similar to the 1st and 2nd GT tion of the most representative keyframe from each cluster was
Fig. 8. The bottom graph shows the three saliency signals (CS, MS and TS), whereas the top graph shows the most likely state sequence based on the HMMAR model, for
the video shot described in Fig. 7. The squares and the circles correspond to the temporal location of the ground truth and the extracted keyframes, respectively.
Fig. 9. Another example of keyframe extraction from a video shot of about 2 min length. The notation is the same to that followed for Fig. 7.
Fig. 10. Same as Fig. 8, but for the video shot described in Fig. 9.
straightforward. Although one could use standard clustering tech- states, each with a different pattern of visual saliency. After identi-
niques to find the underlying saliency clusters in the input signal, fying the most likely sequence of saliency states in the shot, the
here we proposed the HMMAR model, which also includes dynam- most representative data vector (and so keyframe) was selected
ical information. As indicated by the results, static clustering of the from each state, based on its highest state-conditional probability.
saliency vectors did not provide as accurate results as the HMMAR To avoid confusion, it should be noted that the states here are used
model. Each data vector was part of a sequence with temporal de- for temporal clustering of the saliency data-vectors in the shot, and
pendencies, and static clustering cannot provide dynamic informa- have no connection with higher level semantics (e.g., phases, ac-
tion. In contrast, the HMMAR model includes dynamic information tions, etc.), where one would have to combine data from several
regarding which state the saliency data is in at a particular time. shots. Hence, our application does not require imposing some type
Hence, the video shot was temporally segmented into a number of of constraints between state transitions, such as in other research
works where the temporal clusters correspond to semantic units of keyframe extraction problem), instead of an optimal solution that
the overall operation, some of which have strong temporal depen- is hard to achieve with current video content analysis methods.
dencies [40]. Nevertheless, an alternative subjective validation on the similarity
Another important point is that in this work all three feature of the method-generated and ground truth keyframes showed
types were used as input to the HMMAR model, in order to cap- that the proposed method yields superior performance. In the
ture the overall pattern of temporal variation. Of course, one may future, we plan to study the agreement between different users
well use a different HMMAR model for each feature type (1-variate using objective criteria for annotated region contours (e.g., spatial
process), and then follow the same procedure in order to find the agreement between tools or anatomic parts).
most likely set of keyframes (for each feature type). However, this Although this work focused on keyframe extraction from sur-
procedure does not ensure that the ‘best’ feature would be bet- gical video streams, the proposed technique could also be used
ter than the combination. So, in this work we preferred to fuse all for other applications [1]. For example, the extracted keyframes
three feature types and then let the HMMAR model determine the may well be used as input to a video summarization algorithm,
most likely state sequence, based on the variation pattern of all or to a surgical workflow recognition model [43]. To our knowl-
feature types. edge the former has not yet been investigated in the literature,
The experimental results based on various criteria showed that whereas the state-of-the-art in surgical phase recognition process
the extracted keyframes using the proposed scheme are much all video frames from the operation, which increases the compu-
closer to the ground truth compared to the other methods. With tational complexity. We hope that the proposed methodology will
regard to the color content metric, about 81% of the extracted inspire the investigation of these as well as additional novel appli-
keyframes matched the GT keyframes. The proposed method also cations in the field of surgical video analysis.
extracted keyframes with the best temporal proximity with regard
to the ground truth. Among the other methods compared, K-means Declarations of interest
clustering yielded better scores, but much lower than the pro-
posed method. Moreover a qualitative evaluation showed that the None.
extracted keyframes convey more diverse information about the
video shot. These results indicate the advantages of visual saliency
to describe the most noticeable information in the video frames, Conflict of interest statement
and also of the HMMAR model to capture the underlying dynam-
ics of the hidden saliency components. The authors have no conflicts of interest or financial ties to
A potential drawback of this study is that it examined video disclose.
shots without uninformative frames, which are usually encoun-
tered when the camera is removed from the patient’s body. Al- References
though one could add a pre-processing step based on color analysis
[1] C. Loukas, Video content analysis of surgical procedures, Surg. Endosc. 32
techniques such as those proposed in [41,42], rejection of individ- (2018) 553–568, doi:10.10 07/s0 0464- 017- 5878- 1.
ual frames would impact negatively the flow of the shot, causing [2] C. Loukas, N. Nikiteas, M. Kanakis, E. Georgiou, The contribution of simulation
discontinuities in the saliency signal. A potential remedy would be training in enhancing key components of laparoscopic competence, Am. Surg.
77 (2011) 708–715.
to eliminate these frames prior to shot detection, using an ‘image [3] C. Loukas, E. Georgiou, Performance comparison of various feature detector-
quality evaluation’ framework based on supervised learning. An- descriptors and temporal models for video-based assessment of laparoscopic
other possibility is a shot detection technique designed for unin- skills, Int. J. Med. Robot. Comput. Assist. Surg. 12 (2016) 387–398, doi:10.1002/
rcs.1702.
formative shot rejection. Both issues will be investigated in the fu-
[4] K.R. Henken, F.W. Jansen, J. Klein, L.P.S. Stassen, J. Dankelman, J.J. van den
ture. Dobbelsteen, Implications of the law on video recording in clinical practice,
Another limitation of this study is that the number of Surg. Endosc. 26 (2012) 2909–2916, doi:10.10 07/s0 0464- 012- 2284- 6.
[5] A.M.J. Turnbull, E.S. Emsley, Video recording of ophthalmic surgery-ethical
keyframes that needed to be extracted by each method was de-
and legal considerations, Surv. Ophthalmol. 59 (2014) 553–558, doi:10.1016/
fined a priori, based on the ground truth. This procedure was fol- j.survophthal.2014.01.006.
lowed to allow a fair comparison among the examined methods [6] C. Loukas, N. Nikiteas, D. Schizas, E. Georgiou, Shot boundary detection in en-
and the ground truth. Besides, most of the methods did not pro- doscopic surgery videos using a variational Bayesian framework, Int. J. Comput.
Assist. Radiol. Surg. 11 (2016) 1937–1949, doi:10.1007/s11548- 016- 1431- 2.
vide a way to control the number of frames that best represent the [7] N. Ejaz, I. Mehmood, S. Wook Baik, Efficient visual attention based frame-
video shot. This is related to ‘model order selection’, a well-known work for extracting key frames from videos, Signal Process. Image Commun.
problem in the field of machine learning. The proposed HMMAR 28 (2013) 34–44, doi:10.1016/j.image.2012.10.002.
[8] S.K. Kuanar, R. Panda, A.S. Chowdhury, Video key frame extraction through dy-
model is based on a variational Bayesian formulation and provides namic Delaunay clustering with a structural constraint, J. Vis. Commun. Image
a mechanism to tackle the number of states (and so keyframes) Represent. 24 (2013) 1212–1227, doi:10.1016/j.jvcir.2013.08.003.
that best model the input data. In the future we aim to perform [9] S. Angadi, V. Naik, Shot boundary detection and keyframe extraction for sports
video summarization based on spectral entropy and mutual information, in:
detailed investigation of this issue along with other model order Proc. Fourth Int. Conf. Signal Image Process. 2012 (ICSIP 2012)„ Springer India,
selection criteria that have been employed for this purpose. 2013, pp. 81–97, doi:10.1007/978- 81- 322- 0997- 3_8.
Regarding the selection of the ground-truth data, a potential [10] W. Hu, N. Xie, L. Li, X. Zeng, S. Maybank, A survey on visual content-based
video indexing and retrieval, IEEE Trans. Syst. Man, Cybern. Part C (Applica-
drawback is that a single experienced clinician was employed for
tions Rev). 41 (2011) 797–819, doi:10.1109/TSMCC.2011.2109710.
keyframe selection. As pointed in a related work [26], there may [11] J. Peng, Q. Xiao-Lin, Keyframe-based video summary using visual attention
be discrepancies in the keyframes selected by different users and clues, IEEE Multimed. 17 (2010) 64–73, doi:10.1109/MMUL.2009.65.
[12] Y. Cong, J. Yuan, J. Luo, Towards scalable summarization of consumer videos
yet the definition of what is a ‘suitable keyframe’ may change
via sparse dictionary selection, IEEE Trans. Multimed. 14 (2012) 66–75, doi:10.
from one expert to another and even with time (based on the 1109/TMM.2011.2166951.
surgeon’s experience). However, in this study our principal aim [13] Z. Sun, K. Jia, H. Chen, Video keyframe extraction based on spatial-temporal
was not to compare the output among experienced users and color distribution, in: 2008 Int. Conf. Intell. Inf. Hiding Multimed. Signal Pro-
cess., IEEE, 2008, pp. 196–199, doi:10.1109/IIH-MSP.2008.245.
computer-based methods. Due to the nature of the problem (lack [14] E. Spyrou, G. Tolias, P. Mylonas, Y. Avrithis, Concept detection and keyframe
of a ‘definite ground-truth’), we rather aimed to perform a com- extraction using a visual thesaurus, Multimed. Tools Appl. 41 (2009) 337–373,
parative evaluation among the various methods based on the same doi:10.1007/s11042- 008- 0237- 9.
[15] Q. Zhang, X. Xue, D. Zhou, X. Wei, Motion key-frames extraction based on am-
‘ground-truth’ generated by an expert. Hence, the main goal of plitude of distance characteristic curve, Int. J. Comput. Intell. Syst. 7 (2014)
this work was to provide a generally meaningful solution (for the 506–514, doi:10.1080/18756891.2013.859873.
[16] K. Muhammad, M. Sajjad, M.Y. Lee, S.W. Baik, Efficient visual attention driven [30] J.H. Reynolds, R. Desimone, Interacting roles of attention and visual salience
framework for key frames extraction from hysteroscopy videos, Biomed. Signal in V4., Neuron 37 (2003) 853–863. http://www.ncbi.nlm.nih.gov/pubmed/
Process. Control. 33 (2017) 161–168, doi:10.1016/J.BSPC.2016.11.011. 12628175. (accessed January 17, 2017).
[17] N. Ejaz, I. Mehmood, S.W. Baik, Visual attention driven framework for hys- [31] M.-M. Cheng, N.J. Mitra, X. Huang, P.H.S. Torr, S.-M. Hu, Global contrast based
teroscopy video abstraction, Microsc. Res. Tech. 76 (2013) 559–563, doi:10. salient region detection, IEEE Trans. Pattern Anal. Mach. Intell. 37 (2015) 569–
1002/jemt.22205. 582, doi:10.1109/CVPR.2011.5995344.
[18] K. Muhammad, J. Ahmad, M. Sajjad, S.W. Baik, Visual saliency models for sum- [32] T. Randen, J.H. Husoy, Filtering for texture classification: a comparative study,
marization of diagnostic hysteroscopy videos in healthcare systems, Springer- IEEE Trans. Pattern Anal. Mach. Intell. 21 (1999) 291–310, doi:10.1109/34.
plus 5 (2016) 1495, doi:10.1186/s40064- 016- 3171- 8. 761261.
[19] S. Wang, Y. Cong, J. Cao, Y. Yang, Y. Tang, H. Zhao, H. Yu, Scalable gastro- [33] J.G. Daugman, Uncertainty relation for resolution in space, spatial fre-
scopic video summarization via similar-inhibition dictionary selection, Artif. quency, and orientation optimized by two-dimensional visual cortical filters,
Intell. Med. 66 (2016) 1–13, doi:10.1016/j.artmed.2015.08.006. J. Opt. Soc. Am. A. 2 (1985) 1160–1169. http://www.ncbi.nlm.nih.gov/pubmed/
[20] D.K. Iakovidis, A. Koulaouzidis, Software for enhanced video capsule en- 4020513. (accessed January 19, 2017).
doscopy: challenges for essential progress, Nat. Rev. Gastroenterol Hepatol. 12 [34] R. Nava, B. Escalante-Ramírez, G. Cristóbal, Texture image retrieval based on
(2015) 172–186, doi:10.1038/nrgastro.2015.13. log-Gabor features, Lect. Notes Comput. Sci. 7441 (2012) 414–421, doi:10.1007/
[21] I. Mehmood, M. Sajjad, S.W. Baik, Video summarization based tele-endoscopy: 978- 3- 642- 33275- 3_51.
a service to efficiently manage visual data generated during wireless [35] J. Arrospide, L. Salgado, Log-Gabor filters for image-based vehicle verification,
capsule endoscopy procedure, J. Med. Syst. 38 (2014) 109, doi:10.1007/ IEEE Trans. Image Process. 22 (2013) 2286–2295, doi:10.1109/TIP.2013.2249080.
s10916- 014- 0109- y. [36] V. Leboran, A. Garcia-Diaz, X. Fdez-Vidal, X. Pardo, Dynamic whitening
[22] I. Mehmood, M. Sajjad, S. Baik, Mobile-cloud assisted video summariza- saliency, IEEE Trans. Pattern Anal. Mach. Intell. 39 (2017) 893–907, doi:10.1109/
tion framework for efficient management of remote sensing data generated TPAMI.2016.2567391.
by wireless capsule sensors, Sensors 14 (2014) 17112–17145, doi:10.3390/ [37] A. Garcia-Diaz, X.R. Fdez-Vidal, X.M. Pardo, R. Dosil, Saliency from hierarchical
s140917112. adaptation through decorrelation and variance normalization, Image Vis. Com-
[23] B. André, T. Vercauteren, A.M. Buchner, M.B. Wallace, N. Ayache, Learning se- put. 30 (2012) 51–64, doi:10.1016/j.imavis.2011.11.007.
mantic and visual similarity for endomicroscopy video retrieval, IEEE Trans. [38] M.J. Cassidy, P. Brown, Hidden Markov based autoregressive analysis of sta-
Med. Imaging. 31 (2012) 1276–1288, doi:10.1109/TMI.2012.2188301. tionary and non- stationary electrophysiological signals for functional coupling
[24] C. Loukas, E. Georgiou, Smoke detection in endoscopic surgery videos: a first studies, J. Neurosci. Methods. 116 (2002) 35–53.
step towards retrieval of semantic events, Int. J. Med. Robot. Comput. Assist. [39] A.P. Twinanda, S. Shehata, D. Mutter, J. Marescaux, M. de Mathelin, N. Padoy,
Surg. 11 (2015) 80–94, doi:10.1002/rcs.1578. EndoNet: a deep architecture for recognition tasks on laparoscopic videos, IEEE
[25] M. Lux, O. Marques, K. Schöffmann, L. Böszörmenyi, G. Lajtai, A novel tool for Trans. Med. Imaging 36 (2017) 86–97, doi:10.1109/TMI.2016.2593957.
summarization of arthroscopic videos, Multimed. Tools Appl. 46 (2009) 521– [40] N. Padoy, D. Mateus, D. Weinland, M.-O. Berger, N. Navab, Workflow monitor-
544, doi:10.1007/s11042- 009- 0353- 1. ing based on 3D motion features, in: IEEE 12th Int. Conf. Comput. Vis. Work,
[26] K. Schoeffmann, M. Del Fabro, T. Szkaliczki, L. Böszörmenyi, J. Keckstein, IEEE, 2009, pp. 585–592, doi:10.1109/ICCVW.2009.5457648.
Keyframe extraction in endoscopic video, Multimed. Tools Appl. 74 (2014) [41] B. Munzer, K. Schoeffmann, L. Boszormenyi, Relevance segmentation of laparo-
11187–11206, doi:10.1007/s11042- 014- 2224- 7. scopic videos, in: IEEE Int. Symp. Multimed., Anaheim, California USA, IEEE,
[27] J. Lokoc, K. Schoeffmann, M. del Fabro, Dynamic hierarchical visualization of 2013, pp. 84–91, doi:10.1109/ISM.2013.22.
keyframes in endoscopic video, Lect. Notes Comput. Sci. 8936 (2015) 291–294. [42] J. Oh, S. Hwang, J. Lee, W. Tavanapong, J. Wong, P.C. de Groen, Informative
[28] B. Alexe, T. Deselaers, V. Ferrari, Measuring the objectness of image windows, frame classification for endoscopy video, Med. Image Anal. 11 (2007) 110–127,
IEEE Trans. Pattern Anal. Mach. Intell. 34 (2012) 2189–2202, doi:10.1109/TPAMI. doi:10.1016/j.media.2006.10.003.
2012.28. [43] C. Loukas, E. Georgiou, Surgical workflow analysis with Gaussian mixture mul-
[29] J.R.R. Uijlings, K.E.A. Van De Sande, T. Gevers, A.W.M. Smeulders, Selective tivariate autoregressive (GMMAR) models: a simulation study, Comput. Aided
search for object recognition, Int. J. Comput. Vis. 104 (2013) 154–171, doi:10. Surg. 18 (2013) 47–62, doi:10.3109/10929088.2012.762944.
1007/s11263- 013- 0620- 5.

Keyframeextraction PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Keyframeextraction PDF

Uploaded by

Copyright:

Available Formats

Computer Methods and Programs in Biomedicine 165 (2018) 13–23

Contents lists available at ScienceDirect

Computer Methods and Programs in Biomedicine

Keyframe extraction from laparoscopic videos based on visual saliency

1. Introduction cognitive training of junior surgeons, something that makes video-

Video shots Mean duration ± SD (sec) Total # frames Mean # keyframes ± SD

168 67 ± 55 279,811 2.7 ± 1.6

is painted according to the color triplet of the bin center, whereas

Uniform selection Random selection K-means clustering Method in [26] HMMAR-Saliency

Uniform selection Random selection K-means clustering Method in [26] HMMAR-Saliency

−0.96 −0.81 0.27 0.62 1.09

keyframes respectively, whereas the 2nd keyframe contains sim-

You might also like