Heterogeneous Information Fusion and Visualization For A Large-Scale Intelligent Video Surveillance System

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/297676413
Heterogeneous Information Fusion and Visualization for a Large-Scale

Intelligent Video Surveillance System
Article in IEEE Transactions on Systems, Man, and Cybernetics: Systems · January 2016

DOI: 10.1109/TSMC.2016.2531671
CITATIONS READS
60 1,132
3 authors, including:
Ching-Tang Fan Yuan-Kai Wang

Fu Jen Catholic University Fu Jen Catholic University
12 PUBLICATIONS 242 CITATIONS 66 PUBLICATIONS 995 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Video Surveillance View project
Deep Learning for End-To-End Person Detection, Tracking and Re-identification across Cameras View project
All content following this page was uploaded by Yuan-Kai Wang on 15 May 2018.
The user has requested enhancement of the downloaded file.

Accepted by IEEE Transactions on Systems, Man, and Cybernetics: Systems 1
Heterogeneous Information Fusion and

Visualization for a Large-Scale Intelligent Video
Surveillance System
Ching-Tang Fan, Yuan-Kai Wang, Member, IEEE, and Cai-Ren Huang
especially for smart communities [1]. Recently, automated

Abstract—Wide-area monitoring for a smart community can be visual surveillance has been advanced to the third-generation
challenging in systems engineering because of its large scale and surveillance system (3GSS) [2] that installs a high number of
heterogeneity at the sensor, algorithm and visualization levels. A cameras in geographically diverse locations by a distributed
smart interface to visualize high-level information fused from a
diversity of low-level surveillance data, and to facilitate rapid
manner for establishing a multimodal camera network. The
response of events, is critical for the design of the system. This 3GSS is the culmination of research efforts from various
paper presents an event-driven visualization mechanism fusing disciplines ranging from signal acquisition, data fusion,
multimodal information for a large-scale intelligent video communication models, video analytics, to software
surveillance system. The mechanism proactively helps security architecture.
personnel intuitively beware events by the close cooperation A technical evolution of surveillance system from first
among the visualization, data fusion and sensor tasking. The
visualization not only displays 2D, 3D and geographical
generation to third generation is briefly explained in Table I.
information within a condensed form of interface, but also The first generation of surveillance system (1GSS) is a
automatically shows the only important video streams complete analogue system. Analogue closed-circuit television
corresponding to spontaneous alerts and events by a decision cameras capture the observed scene and transmit the video
process called display switching arbitration. The display signals over analogue communication links to central back-end
switching arbitration decides the importance of cameras by score systems, which present and archive the video data in analogue
ranking that considers event urgency and semantic object features.
This system has been successfully deployed in a campus to
form. Less intelligent techniques can be applied to improve the
demonstrate its usability and efficiency for an installation with 1GSS. The second generation of surveillance system (2GSS)
two camera clusters that include dozens of cameras, and with a lot uses digital back-end components but analogue front-end
of video analytics to detect alerts and events. A further simulation equipment. However, digital video storage enables automated
comparing the display switching arbitration with similar camera event detection and alarms by robust computer vision
selection methods shows that our method improves the techniques such as abandoned objects detection and behavior
visualization by selecting better representative camera views and
reducing redundant switchover among multiview videos.
analysis. Proactive detection of alarming events substantially
improves the quality of the surveillance system and lessen the
Index Terms—Display switching arbitration, information burden of investigation tasks from manually watching all
fusion, third-generation surveillance system (3GSS), visualization, recorded videos into semi-automatically collecting and
visual surveillance, visualizability. organizing detected events. 3GSS is evolved towards large
distributed and heterogeneous (with fixed, PTZ, and active
cameras) surveillance systems for wide-area surveillance.
I. INTRODUCTION Integrating huge volume of information from a diversity of
T HERE has been considerable emphasis recently on

intelligent video surveillance research for achieving
automatic interpretation of scenes and understanding the
cameras enables a single human operator to monitor behaviors
over wide areas and investigate the inferred meaning from
compound events.
behaviors of objects. Safety and security surveillance, which Ideally, an effective 3GSS should be capable of
targets crime prevention and forensic analyses, is one of the automatically issuing alerts, events of interests to operators in a
most critical applications of video analytics and intelligence, control room environment. It avoids manual monitoring of
cameras and prevents the operators from suffering information
overload and a short attention span [3]. The 3GSS has several
Copyright (c) 2015 IEEE. Personal use of this material is permitted.
However, permission to use this material for any other purposes must be capabilities: automatic event understanding for alerting
obtained from the IEEE by sending a request to pubs-permissions@ieee.org. operators, computer-based monitoring of a target, and tracking
C.-T. Fan is with Graduate Institute of Applied Science and Engineering, Fu of targets among cameras [4]. Some systems in 3GSS focus
Jen Catholic University, New Taipei 24205, Taiwan (e-mail: bevis@islab.tw).
Y.-K. Wang is with the Department of Electrical Engineering, Fu Jen
more on offering intelligent functions, such as intrusion
Catholic University, New Taipei 24205, Taiwan (e-mail: ykwang@ieee.org). detection and parking space counting, based on predefined
C.-R. Huang is with Graduate Institute of Applied Science and Engineering, objectives for each sensor. However, heterogeneity of diverse
Fu Jen Catholic University, New Taipei 24205, Taiwan.
TABLE I. SUMMARY OF THE THREE GENERATIONS OF INTELLIGENT SURVEILLANCE SYSTEMS.

Characteristics Surveillance Tasks Research Issues
Analogue system with Video recording for manual
1GSS Video broadcasting and storage
centralized architecture investigation after events
Partially digital system with Video analysis for semi-automated Real-time computer vision for robust detection, tracking
2GSS
centralized architecture investigation after events and behavior analysis
Fully digital system with Video understanding for Multisensor data fusion, space-time signal
3GSS large-scale, distributed and semi-automated investigation correspondence, task coordination and distribution,
heterogeneous architecture during event time video communication and intelligent visualization etc.
communication and cooperation among the three main

components.
The mission of scene visualization component is
event-driven 2D and 3D visualization, incorporating
information of events, keyframes, streaming videos, and 2D/3D
geographic context. In addition to the fusion of heterogeneous
information, the event-driven visualization includes a novel
design to adaptively display live video streams by selecting
only important camera views of interest from a high number of
heterogeneous cameras. A visualizability measure is devised to
evaluate the importance of the cameras and a display switching
arbitration is developed to decide the priority of camera views
for visualization. The sensor tasking component receives
commands from interface and dispatches coordinated tasks to
video analytics. Two missions, focus-of-attention and
collaborative surveillance, are included. The focus-of-attention
Fig. 1. The proposed system with a visualization mechanism on top of a mission represents the task management of general video
classical 3GSS system.
analytics detecting events for a surveillance scene through the
information, including event messages, keyframes, and live and use of stationary cameras, and the collaborative surveillance
recorded videos, complicates the design of the systems [5]. In mission indicates the task management of special video
particular, their user interfaces are sundry in a mess due to the analytics employing active cameras to assist stationary cameras
need to display a lot of diverse events and information. for capturing crucial details such as close-ups of faces. When an
Information overload problem still exists in 3GSSs [6] because event is triggered, the data fusion component activates sparse
messy information and events incurred by the large scale of the geolocation mission to bind the event with geographical
system have to be digested and processed by well-trained information, and summarization mission to reduce redundant
security personnel. Developing new visualization mechanisms information and extract keyframes, event clips and synopsis of
to improve interfaces by aggregating and fusing diverse the events.
surveillance information is imperative to foster and accelerate Our contributions are threefold. First, the event-driven
the progress of 3GSSs. visualization scheme in Fig. 1 is first proposed by this paper to
This study proposes a novel visualization mechanism for a promptly display only critical information in a compact user
large-scale intelligent video surveillance system of 3GSSs. The interface. A visualizability measure concerning semantics of
bottom of Fig. 1 represents a 3GSS with heterogeneous visual features and the importance of cameras, views and
cameras (analog, digital, embedded, and pan-tilt-zoom) in events are proposed. Critical information for visualization is
distributed fashion. Intelligent tasks, such as the detection and decided by display switching arbitration that optimizes the
recognition of objects, events and behaviors, analyze videos visualizability measure. Second, the data fusion mentioned in
from several cameras and provide information to be shown in Fig. 1 is novel. It integrates a diversity of information from
user interfaces, that can be messy when a visualization scheme camera sensors and geography is devised to present the
fusing and condensing information is absent. Scene compact user interface for human operator. Video
visualization, data fusion and sensor tasking are the main summarization and sparse geolocation binding are two critical
components of our visualization mechanism (middle of Fig. 1). techniques developed in this paper to achieve this. Third, a full
Events, keyframes, videos and 3D scene are automatically and improved 3GSS system was developed to demonstrate the
fused by the visualization mechanism. A compact and usability and necessity of intelligent visualization. The
event-driven interface shows only critical information that is communication management, sensor tasking, several video
filtered and fused by the visualization mechanism. Finally, a analytics tasks and multimodal camera networks implemented
scalable message-push architecture is adopted in the in this system shown in Fig. 1 are necessary to support our
communication management component to facilitate the intelligent visualization schemes. However, details of classical
topics in 3GSS, such as video analytics tasks are not proposed methodology into in a large-scale deployable system. The
in this paper and are referred to [7-16]. This paper gives more number of user interfaces increases linearly and the workload
attention to the proposed event-driven visualization and data of the operator increases dramatically when the number of
fusion. detectors increases. Listing all events in an event inbox [23] is
The remainder of this paper is organized as follows. Section beneficial, but text information only is deficient for responsive
II discusses background information and reviews related work decision and nonverbal information such as keyframes and
on large-scale intelligent surveillance systems. Section III video clips is important.
introduces the data fusion and event-driven visualization design. Video summarization such as keyframe and skim provides a
In Section IV, the details of the remaining system components compact representation by eliminating redundancy of videos
are presented, including sensor tasking and communication and preserving only crucial frames for better visualization of
management. Section V describes a system testbed operated in stored videos. Keyframes are more widely used than video
a campus, and a simulated experiment for the display switching skimming because of the ease of browsing and navigating [24].
arbitration. Finally, conclusion and suggestions for future work A fundamental approach of keyframe extraction considers only
are provided in Section VI. the changes of pixels and/or features in a frame [25]–[27]. A
more compact keyframe representation achieved by
II. STATE-OF-THE-ART incorporating object information is proposed by Erol and
The study of large-scale systems for wide-area video Kossentini [28] and Kim and Hwang [29]. Studies [7] and [30]–
monitoring is briefly reviewed and discussed. A substantial [31] further demonstrate that the object information with the
number of studies have been devoted to video analytics, high-level features based on human visual perception improves
cooperative surveillance, and software architecture in the the optimization of keyframe extraction. In addition to these
context of multi-camera surveillance systems; however, a methods for summarization of single videos, [32] and [33] are
modicum of the systems explored the fusion and visualization more appropriate for large-scale systems because they extract
problem. keyframes from multiple videos acquired from multiple
Intelligent video surveillance, which applies computer vision cameras with different viewpoints. Highly compact
algorithms to detect and recognize objects and events for representation is achieved by removing more redundant frames.
specific prevention and forensic tasks [17], aims at active However, pixel and frame changes cannot be applied for the
monitoring. A visualization scheme that effectively displays multicamera keyframe extraction, but object and semantic
videos and events can simplify human operation and reduce features become more critical for extracting meaningful
manual tasks. Matrix arrangement of video displays to show keyframes from multiple videos. Both centralized [32][33] and
objects and events corresponding to each single camera has decentralized [34][35] networking approaches have been
been widely used as a traditional visualization scheme. proposed. A global optimization mechanism not only adapting
However, the one-camera-one-monitor methodology is not to networking approaches but also incorporating semantic
feasible for a large-scale system with a high number of cameras features is important for video summarization in large-scale
[18]. An alternate approach employs a large screen to show all systems.
camera views in turn or randomly; however, lack of semantic Live video display within a limited visual space for a high
information such as spatial and temporal relationship among number of cameras is also challenging. A hand-off approach for
objects and events, and importance of camera views, incurs finding the next most meaningful camera is indispensable for
heavy overload for the management of the systems. Displaying focusing visual attention on a specific object. Some studies [23],
tremendous information within a limited visual space cannot be [36]–[38] extract high-level semantic features for selecting a
a trivial task. dominant camera. The dynamic camera switching method
Object- and event-oriented interface designs are efficient proposed in [23] identifies whichever camera exhibiting the
ways for information visualization of large-scale systems. In greatest image difference as the main sensor view. Huang and
[19] and [20], only information of specific objects such as Kim et al. [37] propose a probabilistic camera hand-off (PCH)
locations and tracks of objects are visualized for sustaining method for continuous object tracking. The ratio of foreground
attention. They offer only one interface by locating the object blocks and the ratio of angle distance between the camera and
information across multiple cameras on floor plan map for the object are calculated for each camera to obtain the
assisting navigation. [21] provides predictive selection of proximity probability. The camera with the highest proximity
camera views by object tracking to simplify operator's tasks. probability is identified as the dominant camera. The dominant
Nevertheless, methods [19]-[21] could be applied to camera can be identified using the tracking result of the
small-scale systems because they do not employ behavior aforementioned hand-off approach, and its object trajectory is
analysis and event detection for extraction of high-level represented in a map by using a homograph. Goshorn et al. [38]
information. Event-oriented interface provides less but propose a cluster-based multicamera surveillance network that
important information for large-scale systems, and facilitates generates a camera selection manager (CSM) as a cluster
responsive interface for critical threats. IBM S3 [22] integrates network. The CSM for the optimal view depends on weighted
diverse event detectors with one-detector-one-monitor importance and four semantic features.
The literature review shows that security personnel has to

assess events expeditiously by an effective visualization
mechanism. From the viewpoint of visualization, fulfilling two
requirements is necessary for enhancing the practicability of
surveillance cameras and monitors. The first requirement is a
universal user interface involving heterogeneous sensor tasking
that can effectively convey events, and the second requirement
is automatic display switching arbitration that selects
meaningful camera views. From the perspective of the system,
a relevant software and hardware architecture corresponding to Fig. 2. Four-tier scheme for the proposed visualization mechanism.
the new visualization mechanism is essential.
III. HETEROGENEOUS DATA FUSION AND EVENT-DRIVEN

VISUALIZATION
An event-driven user interface to guide the visual attention of
operators to appropriate situations is proposed. A four-tier
scheme shown in Fig. 2 is used to illustrate the idea. Each tier
(a) (b)
digests and transforms diverse information into high-level
Fig. 3. Interface configuration for multimodal information fusion. (a) Three
messages to be the input of next tier. The first tier acquires and areas to display event’s information of geography, image, video and text. (b)
analyzes streaming videos to generate events. Tier 2 performs A set of master-slave video pairs to show streaming videos of clusters.
sensor tasking for camera collaboration when events occur.
cameras placed in corresponding locations. An event box may
Geoinformation fusion for geolocation and visual fusion for
pop up alert messages and key frames for a camera when the
summarization are performed in Tier 3, which corresponds to
video analytics responsible for the camera detect events. Text
the data fusion component in Fig. 1. For data fusion, Tier 3
messages that can be read only by trained security personnel,
acquires public Web data from public-record databases. A
such as abbreviated event code names, camera identification
geomodel and an event model are used for supporting both
numbers, and event sketches, are shown in the dynamic text
Tiers 2 and 3 for constructing a smart monitoring system. Tier 4
message bar. Video streaming box is responsible for the play of
aggregates all related data, referred to the event-driven
recorded videos of events and live videos of cameras.
visualization component in Fig. 1.
As shown in Fig. 3(b), two streaming videos as a
Tiers 1 and 2 are fundamental components been well
master-slave pair is assigned to a camera cluster. A camera
discussed in classical 3GSS systems. Our implementations of
cluster is a set cameras monitoring a physical area within a
these two tiers are described in next section or referred to
neighborhood. There could be N clusters for a wide-area
published papers. Most importantly, the tiers 3 and 4 are critical
surveillance system, and each cluster has two boxes to play
components constituting the intelligent visualization scheme
streaming videos. However, the number of cameras for each
proposed in this paper, and are explained in this section. The
cluster should be more than two, which means only two boxes
data fusion subsection explains a compact interface for the
is not enough to show all the videos of a cluster. Therefore the
visualization of only one event with time-space correspondence.
two cameras of a cluster chosen to be played in video streaming
The event-driven visualization subsection proposes an
box will be marked with specific colors corresponding to the
arbitration mechanism for the visualization of multiple
box’s colors. This color correspondence helps the
spontaneous events.
understanding of the synchronized and dynamic behavior
A. Data Fusion for Compact Interface between 2D map area and video box area. In addition, an
A compact interface achieved by two techniques: time-space automated scheme to dynamically select two camera views to
correspondence and video summarization, is designed to fuse play recorded event videos is devised with a visualizability
multimodal information of surveillance data. Traditional data model and a display switching arbitration method explained in
fusion methods assume homogeneous signals are to be fused at next subsection, while a manual operation by clicking a camera
data level, feature level or class level by weighting schemes mark in the 2D map can also force the interface to show event
which may be achieved by probabilistic approach. However, clips and live videos in the master box of the cluster.
heterogeneous information including symbolic, streaming Two approaches of video summarization, event clips and key
signal and geographic data cannot be fused by a homogeneous frame, are provided in this compact interface for each detected
weighting scheme, but only be fused into an effective and event. Both static and dynamic summaries improve the
compact interface by aggregating and binding the multimodal efficiency of prompt event filtering and browsing, but have
information with spatial and timing constraints. different implications and effectiveness [39]. An object-based
The configuration of the interface is shown in Fig. 3(a), key frame extraction method described in our previous paper [7]
which consists of three areas: two-dimensional geographic map, is applied to this system. When an event is triggered, the
video streaming box and dynamic text message bar. The 2D corresponding video analytic task records a clip of the event
map shows a geographic layout of a monitoring area with and synchronizes the time and space of the event with our
visualization agent, which is described in Section IV. The agent camera with the highest visualizability score, 𝐶𝑣 =
analyzes the synchronized event clip and extracts key frames in
arg max 𝑉 𝐶𝑖 , where 𝐶𝑖 is the ith camera.
real-time to promptly show the key frame in the event box when 𝐶𝑖
an event occurs. The synchronized event clip will be displayed The event term is modeled by a weighted indicator function
in the video streaming box accompanying with the event box, as
manually or automatically. 𝑉+ 𝐶A = 𝑃+ ∙ ∆ 𝐶A (2)
Space-time correspondence of multimodal information for a
single event is therefore effectively fused in this compact where ∆ . is 1 if an event is triggered in camera 𝐶𝑖 . 𝑃𝑒 is
interface. Time stamps of events and GPS coordinates of defined as the priority of the event.
cameras are two fundamental data for space-time The object term represents the score of a weighted sum of 𝐽
synchronization. Spatio-temporal hypergraph [40] could be object features, and is defined as
applied to organize and synchronize the multimodal H
information. 𝑉/01 𝐶A = 𝑤1 ×𝑃1 𝐶A (3)

With more than two spontaneous events, this compact 1I%
interface still lacks of space to display the information of all
where 𝑃𝑗 . and 𝑤𝑗 denote presentation probability and
events. A strategy called display switching arbitration is
necessary to dynamically select the most important events for importance weight aggregated for the jth∈ 1,2, … 𝐽 feature,
𝐽
visualization. respectively, with 𝑗=1 𝑤𝑗 = 1. The presentation probability is
calculated by
B. Event-driven Visualization by Display Switching LM NO
𝑃1 𝐶A = P L N . (4)
Arbitration OQR M O
Traditional way to display multiple events or all camera The parameter 𝑋𝑗 𝐶𝑖 is the jth element in the feature vector
videos is infeasible for 3GSS due to limited space of interface.
A smart display switching method is required to assist a single 𝑋 𝐶𝑖 , which is calculated by applying the local optimized
human operator to operate a multicamera surveillance system at process to camera 𝐶𝑖 . We perform keyframe extraction in a
a relatively high level of abstraction. An event-driven specific period as the local optimized process. The process is
visualization strategy with dynamic arbitration of display expressed as
switch is devised here. The display of our visualization
interface can automatically switch to the most meaningful
𝑋 𝐶A = 𝑋T 𝐶A |𝜖 = argmax 𝐹Y 𝑋W 𝐶A , 𝑤 , 𝑡 − 𝛿
camera views by using a visualizability model. W
(5)
Display switching arbitration aims at providing a dynamic <𝑡 <𝑡+𝛿
view of video playing. The display switching arbitration
approach consists of a local optimized process and a where the criterion function 𝐹𝐿 𝑋𝑡 𝐶𝑖 , 𝑤 =
decentralized optimized process. Local visualizability is 𝐽
𝑗=1 𝑤𝑗 ×𝑋𝑗 𝐶𝑖 can be considered as local object
signified by the most representative frame of an object in the
local video. In multiview videos, clustered visualizability is visualizability. The weighted sum of 𝑋𝑗 . is the representative
represented for the most representative frame that provides the score of the local camera.
optimal view. After assessing semantic significances of The proposed method considers three object features, such as
visualizabilities, we design two criterion functions by applying the region of object size, the region containing the object’s skin,
the visualizabilities to achieve the integration of meaningful and the face of the object, to calculate the presentation
representations. probability for selecting a camera. We employ a Kalman filter
Let us first consider a cluster of cameras 𝐂 = 𝐶% , 𝐶' , … , 𝐶) . to optimize all features to reduce the measurement noises in
For visualizability modeling, we express a scoring vector as a these object features.
linear combination of two individual terms:
IV. THE PROPOSED LARGE-SCALE INTELLIGENT VIDEO
𝑉 = 𝑉+ + 𝛾 ∙ 𝑉/01 . (1)
SURVEILLANCE SYSTEM
The event term 𝑉𝑒 encodes the contribution of events for The proposed large-scale IVS system has a complex
visualizability, and the object term 𝑉𝑜𝑏𝑗 the contribution of software and hardware architecture for use in a real
object features for visualizability. A penalty 𝛾, limited to the environment; the detailed architecture is shown in Fig. 4. The
range [0,1), is added to adjust the importance of object features main constituents of this system are four subsystems:
with respect to the behaviors in the monitoring area. The intelligent visualization, sensor tasking, communication, and
dominant camera 𝐶𝑣 is obtained by calculating the video streaming and storage.
visualizability of each camera, 𝑉 𝐶𝑖 , and then choosing the
The interaction among the three interfaces is all trigged by

the main interface, because the main interface gives the
personnel an abstract but summarized information and the
other two interfaces give the personnel further detailed
information. The 3D interface is controlled from the main
interface by selecting a corresponding area on the 2D map.
More details of the area will then be monitored in-depth by the
security personnel with three-dimensional views. To see
object’s behavior in the video synopsis interface, the personnel
chooses an object of an event displayed in the main interface
and the visualization agent then controls the synopsis task to
play videos.
B. Sensor Tasking Subsystem
The sensor tasking subsystem includes a lot of video
analytics tasks that work independently or collaboratively.
Focus of attention represents a group of tasks to detect
Fig. 4. Detailed architecture of the large-scale intelligent video surveillance
system.
interested events independently. Cooperative surveillance
represents collaborative tasks to monitoring a wider area of
interest. Focus-of-attention tasks are called attention tasks for
A brief description of interactions among the four brevity, and collaboration tasks are the shortened form of
subsystems is given first. Events triggered by the focus of collaborative surveillance tasks.
attention in sensor tasking subsystem are transmitted to the Attention tasks in the focus of attention group are subdivided
event agent in intelligent visualization subsystem. The event into three stages according to the necessity of continually
agent then manages and stores diverse information on the event manual inspection. Each stage is influenced by the consequence
into several databases; in addition, the event agent sends the of the foregoing stage, and the tasks in higher stage need more
event to data fusion and the visualization. The visualization cameras for analysis. The first stage involves preprocessing
agent immediately presents all related information on multiple tasks. The second stage involves detecting and tracking routine
interfaces. The visualization agent also requests the cooperative information about objects such as moving pedestrians and
surveillance to invoke related video analytics and control PTZ empty parking spaces. Complex situations are interpreted on
cameras. A queue manager is responsible for data bridging and the basis of object recognition, and evidence reasoning is
buffering among subsystems. Details of the four subsystems are performed for event inference. Processes at the third stage tend
given below. to be highly situation specific, and behavior and activity
analysis tasks depend highly on the segmented regions
A. Intelligent Visualization Subsystem
corresponding to the objects of interest in an image. The final
Our intelligent visualization includes multiple user interfaces results of attention tasks are event-focused.
implemented for the central control room of our system. A There are four attention tasks. To confirm that the image
Web-based interface, referred to herein as the main interface quality and the field of view of surveillance videos are
(See Fig. 6(a) for an example), uses the data fusion and smart favorable, a camera anomaly detection [11] task detects
display switching described in previous section to seamlessly abnormal camera events such as defocusing, translating, and
present a lot of prompt event results. It also offers a live mode covering by statistically calculating accumulated variations of
for real-time event alerting and a playback mode for event features in the temporal domain. Robust detection of vacant
retrieval. space in outdoor parking lots by using a multicamera
Two other interfaces are also provided to supplement monitoring method was practiced; the task was named parking
abundant information of events. A 3D interface (See Fig. 7 for space counting [12]. The task automatically counts the number
an example) blending Google earth and live videos by of vacant spaces in a large parking lot, demonstrating its
homography projection is provided to be an auxiliary display potential use for parking assistance. Vehicle anomaly, a
for aiding operators in determining the location of tracked concern regarding illegal parking, is detected by performing an
objects [8]. The 3D interface provides an immersive illegal parking detection [13] task. This detection method was
surveillance experience to simulate peripheral vision. Video designed without background modeling. A texture-based
synopsis [9-10] as a high-efficiency video object retrieval method, which can overcome environmental challenges, is used
technique is also used in this system to effectively remove for target tracking. Intrusion detection for a climbing [14] task
redundancies from a video and condense long videos into clips. is a constructive constituent for monitoring people anomalies.
It provides object-based video retrieval to supplement the event The human object in videos is segmented into body parts using
information in the main interface. An example of video deformable triangulation, and postures are analyzed for
synopsis is given in the middle one of the top three screens in verifying climbing events in a restricted area.
Fig. 6(c).
Each collaborative task in the cooperative surveillance group identification of real-time events. For streaming live video from
is a single task with more than two cameras and more than two existing IP cameras and streaming servers, a VLC plugin with
video analytics for an event. It is beneficial for a monitored area AJAX is used. With the rapid development of HTML5
wider than one camera view. A PTZ camera is always utilized technology, the HTML5 elements are effective for accessing
to complement a static camera. Video analytics of a static video clips and map services.
collaborative task are decentralized and performed on Cameras in the system are not restricted to networked and
embedded systems in PTZ cameras. Collaborative tasks are analog cameras, where the analog signal is digitized through a
reactive rather than proactive. video server that provides real-time real time streaming
Two collaborative tasks are implemented. In a loitering protocol (RTSP) streaming video to the back-end. To balance
detection task, a moving target in a wide open area is monitored the network traffic and monitoring needs, each stationary
through the automatic control of an embedded PTZ camera [15], camera owns at least two real-time streaming videos
combined with an intrusion detection algorithm with a static simultaneously. The two streaming videos may have different
camera for continuous tracking. An illegal parking task settings. One is a high-quality video encoded in the H.264
combines a car parking detection algorithm with a human face format with resolution as high as possible for providing
detection [16] for a dynamic scene. The car parking event forensic evidence in the storage subsystem, which is mounted
detected in an illegal position triggers the analytics in a PTZ to a large-scale storage area network with cloud storage.
camera to dynamically track a high-resolution human face, Another video has the QVGA resolution which is enough for
which is useful for forensics. video analysis. In addition to containing historical footage, in
the storage subsystem, a relational database management
C. Communication Subsystem
subsystem built on SQLite is served as event database, and
A publisher-subscriber communication pattern was video clips of marked events and video synopses corresponding
implemented to facilitate message delivery and enhance to events are recorded in the historical footage. If clients query
cooperation among the event agent, video acquisition and events, the native video clips can be cut from the event database
streaming agent, visualization agent, attention tasks, and Web and streamed from the video streaming server.
clients. The sender is called a topic publisher, and the queue
manager according to the registered publisher-subscriber table, V. TEST RUN AND CASE STUDY
who requires information from the appointed publisher, sends
The proposed system has been developed and installed in a
the information to all subscribers. Because the information may
campus with the Vision Based Intelligent Environment (VBIE)
be too abundant to transmitting, many queues are used for
project [41] in Taiwan. More than 20 cameras classified into
publishers by the queue manager. Three topics and five queues
two camera clusters were setup and distributed in two disjoint
are set for various publishers, and a topic can map to several
regions of the campus, as shown in Fig. 5. Each cluster has a
queues.
cooperative task with one collaborative PTZ camera.
When an event is triggered, the visualization agent publishes
Configuration of the camera networks cannot be trivial and a
information fused to the topic of visualization, which is handled
lot of approaches were applied in the proposed system for
by the alert queue and the short message service (SMS) queue.
different situations and constraints. The first thing to mention is
The alert queue delivers push notifications to Web browsers,
that both overlapping and nonoverlapping field-of-views (FoVs)
and the SMS queue sends an SMS notification. The keyframe
of cameras are utilized in this system. The four cameras for
extracted during the mission of summarization consists of
parking space counting [12] is configured to be overlapping
simple text messages and links that can be used to send SMS
FoV, and the cameras for other video analytics are configured
notifications to mobile clients; such links can be effective for
to be nonoverlapping FoV. Camera calibration for the parking
driving users to a Web site or download link. Compared with
space counting utilizes epipolar-plane constraint among
push notifications that are limited to smartphones, SMS support
cameras and feature points of the scene to obtain intrinsic and
is ubiquitous and available in all phones. Furthermore, a video
extrinsic parameters of cameras. Exact object positions in
streaming server publishes on the topic of video streaming. The
three-dimensional space can then be estimated for the accurate
video queue obtains videos and then embeds them into the main
counting of vacant spaces. A pair of PTZ and static cameras in
interface.
the cooperation surveillance tasks are collaborative to track the
D. Video Streaming and Storage Subsystem same object for an event, which also needs to be coordinated
The proposed system provides the output on the Web-based and calibrated. Object location is estimated first in the static
user interface and receives streaming videos from multimodal view and the PTZ camera is then controlled to capture a high
cameras. The Web interface facilitates systems used on resolution object image or to track the behavior of the object.
portable devices such as smartphones and smartpads. The user An assumption of depth information is applied in advance, so
interface is melded from static and dynamic data, where the the static camera sends only x and y coordinates to the PTZ
HTTP protocol is used for accessing static data such as camera to adjust pan and tilt parameters. Details of the camera
interface layout. The dynamic data achieve bidirectional calibration and configuration for parking space counting and
communication with a server through an AJAX or WebSocket cooperative surveillance tasks are referred to published papers.
protocol, facilitating live content transmission and the
(a)
Fig. 5. Sketch of the two camera clusters of the system in a campus map. The
two red rectangles in the top map represent the two disjoint clusters, and red
dots are cameras. 3D views of the two corresponding camera clusters are
shown at the bottom.
The remaining cameras have disjoint FoVs and an intuitive

method is applied to ease the calibration loading of the whole
system. A sparse geolocation method by calibrating GPS
coordinates of cameras with the coordinates in 2D map is used.
The sparse geolocation refers to the geographic location of a
(b)
camera where an event happens, which is used in our
event-based representation in the visualization mechanism.
The Web-based main interface offers live and playback
modes, as shown in Fig. 6(a) and (b). The live mode presents
real-time information, and playback mode is useful for
event-based retrieval. The illegal parking event shown in this
example is triggered and the event-box with summarized text
attributes and keyframe is popped up to indicate the alerted
camera. The corresponding camera views of a cooperative task
is live-streamed to the second line of the video streaming box,
with the master box for a collaborative zoomed view and the
slave box for alerted camera view. A playback mode example
shown in Fig. 6(b) illustrate the play of a recorded event clip of
an intrusion detection by illegal fence climbing. Fig. 6(c) shows
(c)
a real picture of the multiscreen interface deployed at our Fig. 6. Web-based main interface of (a) live mode and (b) playback mode.
central control room. The two large interfaces at the bottom are (c) Multiscreen interface.
the main interface in the right and the 3D interface in the left. the accuracy and stability of the display switching arbitration.
Three other screens at the top area display results of three
selected tasks from right to left: the parking space detection A. Overall System Performance of Real Test Runs
task, video synopsis, and live camera view for any selected one. The system was operated and tested for more than six months,
Six examples of the 3D interface for the road surveillance and its reliability and robustness were verified from 33 formal
around a building are shown in Fig. 7. The 2D map in the test runs. The test runs were conducted on normal and abnormal
middle is part of the main interface. A click in the 2D map behaviors of humans and vehicles under different weather
commands the 3D interface to show the view of Google earth at conditions in a cluttered outdoor environment. Few false alarms
the location, and a live video is blended with the 3D view to were triggered, and many illegal events were successfully
give a live 3D monitoring. detected. Detailed performance analysis of each single task is
The following give experimental results of overall referred to [7-16]. The system failed with three times, that is
performances in pragmatic test runs, and simulation results of 91% accuracy, when improper key frames and video clips were
(a)
Fig. 7. Six examples of 3D interface interacted with the 2D map in the center.
The darker areas in six 3D examples are 3D information from Google Earth,
and the lighter areas in 3D examples are blended live videos by homographic
transformation. (b)
shown in the visualization interface. Concurrency control and

conflicts among tasks induced the problem and a delay and
queuing mechanism with database concurrency could solve the
issue.
Fig. 8 shows examples of successful event detections under
challenging conditions. An illegal parking event occurred in a
rainy day is shown in Fig. 8(a); dynamic face detection was (c)
triggered to capture the driver’s face, which was partly
obscured by an umbrella. Fig. 8(b) and (c) show the events in a
sunny day with the presence of complex and dynamic shadows.
A successful case of intrusion detection is shown in Fig. 8(b); a
PTZ camera continuously tracked the person who climbed over
a wall, as shown in Fig. 9. When the camera was sabotaged by a
bag, the camera anomaly detection task issued an alert and
provided a pair of images before and after the anomaly event, as (d)
depicted in Fig. 8(c). Fig. 8(d) presents an example of object Fig. 8. Examples of successfully detected illegal events under different
weather conditions.
retrieval. The left image shows a frame of a video synopsis, and
the source video of the object in the yellow box is shown on the
right side.
Figs. 8(a) and (b) show two collaborative cases. The alert is
first triggered by the attention task working on stationary
cameras, and subsequently, visualization agent assumes the
responsibility of capturing a close-up image of human face by
controlling a collaborative PTZ camera.
When an event is triggered, the keyframe is fused with the
map and text messages. Fig. 10 shows three examples of
actively triggered events. Figs. 10(a) and (b) show a climbing
event and a camera anomaly event, respectively. A set of
collaborative events involving an illegal parking event and Fig. 9. An example of embedded PTZ camera control for moving-object
dynamic detection of the driver’s face is shown in Fig. 10(c). In tracking.
this case, after an alert was displayed in the 2D map for the different event and object visualizabilities of the cameras. The
illegal parking event, the collaborative dynamic face detection event visualizability is equal to watchful priority in this
task found the face close to the vehicle’s door. The face image situation because all cameras have events.
was captured if dynamic face detection was successful. There Table II shows three cases of different watchful priorities at
are only three cameras in the camera cluster and each camera three events, where the climbing event is occurring at Camera 1
simultaneously records a video for the same triggered event. and the camera anomaly event and illegal parking event are
Different display switching results are produced because of occurring at Cameras 2 and 3, respectively. The object
visualizability is given by 𝑉𝑜𝑏𝑗 = (0.2,0,0.8) because Camera
2 does not have any object and Camera 3 records a clear face
that is larger than the object of Camera 1. The penalty 𝛾 is set to
80 183 240 340 400 600
C1
C2
C3
(a) (b)
C4
Top
view
(a)
77 128 189 237 249
340 382 417 572 624

(c)
Fig. 10. Examples of actively triggered events.
TABLE II. EXAMPLES OF DISPLAY SWITCHING ARBITRATION FOR EVENTS (b)

OCCURRING SIMULTANEOUSLY AND γ=0.8.
Fig. 11. Result obtained using the proposed approach in the case of Smart Home
Climbing Camera anomaly Illegal parking sequences of the ICDSC 2009 data set. (a) Original images of sequences and a
Event @ Camera 𝐶A top view. (b) Cluster keyframes of the decentralized process.
@𝐶% @𝐶' @𝐶c
d⃑/01 (𝐶A )
𝑉 0.2 0 0.8
d⃑+ (𝐶A )
𝑉 2 2 1 in Fig. 11(a). Three of the cameras have overlapped FoVs. The
Case 1
𝑉d⃑(𝐶A ) 2.16 2 1.64 resolution of all sequences was 320 × 240. The frame rate of the
Case 2
d⃑+ (𝐶A )
𝑉 2 2 2 video was approximately 10 frames/s, and the videos were
𝑉d⃑(𝐶A ) 2.16 2 2.64 produced using the XVID codec. We apply background
d⃑+ (𝐶A )
𝑉 2 3 2
Case 3
𝑉d⃑(𝐶A ) 2.16 3 2.64
subtraction to extract human objects in the videos.
Fig. 11(a) shows a person walking through the entrance of
the living room, proceeding to sit on the sofa for watching TV,
0.8. In Case 1, 𝑉𝑒 = (2,2,1), and the final visualizability is
and then picking up a table magazine, then changing to the
given by 𝑉 = (2.16,2,1.64), where Camera 1 is selected as the other seat during reading the magazine. The segment that we
dominant camera. In Case 2, when the three events have the sampled in sequences was from Frame 0 to Frame 450; there
same watchful priority of 2, the visualizability of Camera 3 was no overlap in the FOVs of Cameras 1 and 2 for this
increases to 2.64, resulting in this camera being the dominant sequence. The top view shows that the object remained in each
camera. When the priority of the camera anomaly is increased place. Three cameras in the living room detected the object in
to 3 in Case 3, Camera 2 becomes the dominant camera, and it Frame 183. The person who moves to another place is not
has a visualizability of 3. detected by Camera 3 in Frame 394. Finally, he returns to the
B. Evaluation of Display Switching Arbitration at Ordinary sofa and sit in Frame 470.
Times The selected dominant camera views of the proposed
clustered object visualizability are shown in Fig. 11(b). It
The accuracy and stability of display switching arbitration
demonstrates the capability of the proposed approach to select
are verified by a public data set, and compared with PCH [37]
the FOV with the representative object. When a person is
and CSM [38] algorithms. The weighted vector of the proposed
% % % monitored by two cameras, the camera that captures the face or
method was 𝑤 = ( , , ). We use the ICDSC 2009 data set [42] a larger extent of the skin is selected as the dominant camera.
c c c
that contains videos of routine activities of people in an indoor The accuracies, defined as the ratio of the number of image
and common home environment recorded using eight cameras. frames with the correct dominant camera selection over the
The multiview of this data set includes dense and sparse views. total number of image frames, of the proposed camera selection,
The data set includes complex activities rather than only PCH [37] and CSM [38] are 95.87%, 56.72%, and 56.73%,
walking. We selected an activity performed in the living room. respectively. When we replace the features of the PCH and the
The video sequences of four cameras of the activity are shown CSM with our proposed features, the accuracies of PCH and
VI. CONCLUSION
This paper presents a large-scale scalable system extended
from 3GSSs. The system integrates a lot of computer vision
tasks, such as object detection, camera anomaly detection,
keyframe extraction, and mobile surveillance, with the
knowledge acquired from over 10 years' experiences of
cooperation among academia, industry, and the government.
By integrating algorithmic tasks through the proposed
event-driven visualization and systems engineering techniques,
an efficient system for wide-area visual surveillance by single
(a)
security personnel is presented. The multitier scheme with a
novel visualization mechanism is proposed for centralizing all
heterogeneous surveillance information into a universal
Web-based user interface. A new camera selection method not
only considering objects but also events is developed with the
display switching arbitration. The method enables meaningful
FOV selection and smooth handoff from multiple cameras for
both normal and event-alerting and situations. The fuse of
multimodal information is advocated for event-driven
visualization. A collaborative scheme involving sensor tasking
(b) at static and dynamic cameras is proposed for object-oriented
visualization for aiding continuous visual tracking.
Although the visualization mechanism and new system
concepts to complement 3GSSs have been presented in this
paper, future work remains. In addition to accuracy and
stability, the evaluation of system performances could apply
other measures such as time to receive events, ergonomics and
comfortability. A subjective evaluation of the degree of
assistance of the proposed approach to security personnel can
be quantitatively assessed with respect to user interface issues
(c) such as ergonomics and comfortability. Robust multiobject
Fig. 12. Results of camera selection by (a) PCH method [37], (b) CSM method
[38], and (c) the proposed method; the X-axis represents the frame number and
tracking across multiple cameras [43] can be helpful to display
the Y-axis represents the camera number. switching arbitration. For example, if a person is to be
monitored across cameras, an efficient human tracking
CSM are increased to 75.82% and 78.77%. The results show technique can help the system determine the dominant camera
that the proposed semantic features are very effective, and the for display switching with more accuracy.
optimization algorithm with the visualizability criterion is an
effective design than the two compared algorithms. These two ACKNOWLEDGEMENTS
components constitute the high accuracy of the proposed The authors would like to thank the professors in the Vision
method. Based Intelligent Environment project: Yi-Ping Hung,
Fig. 12 shows the stability of camera selection for the test Sheng-Wen Shih, Yong-Sheng Chen, Jun-Wei Hsieh,
scenario. Stability refers to the ability of an algorithm to stably Chi-Hung Chuang, Chin-Teng Lin, Cheng-Chang Lian,
generate a cluster keyframe without false detection. Figs. 12(a) Chin-Chuan Han, Hsi-Jian Lee, Sheng-Jyh Wang, Daw-Tung
and (b) show the results obtained using the PCH and CSM Lin, Kuo-Chin Fan, Wen-Thong Chang and Wen-Hsiang Tsai,
algorithms, respectively. The PCH algorithm detected a for their contribution to the success of this system. The authors
dominant camera only from Frame 160, which has false also thank the graduate students of these professors for their
negatives for frames from 70 to 150. Both algorithms were tireless efforts and supports.
highly unstable from Frame 160, where the object appeared in
both cameras 3 and 4. Fig. 12(c) shows a relatively smooth REFERENCES
result that is good for display switching. The use of Kalman [1] X. Li, R. Lu, X. Liang, X. Shen, J. Chen, and X. Lin, "Smart
filtering, semantic features and the optimization of the community: an internet of things application," IEEE Communications
visualization model constitute the successful and stable results Magazine, vol.49, no.11, pp.68-75, Nov. 2011.
[2] T. D. Raty, "Survey on contemporary remote surveillance systems for
of the proposed method. public safety," IEEE Tran. on Systems, Man, and Cybernetics, Part C:
Applications and Reviews, vol.99, pp.1-23, Mar. 2010.
[3] G. Smith, "Behind the screens: Examining constructions of deviance
and informal practices among CCTV control room operators," Surveill.
Soc., vol. 2, no. 2–3, 2002.
[4] X. Wang, "Intelligent multi-camera video surveillance: A review," [26] H. C. Lee and S. D. Kim, "Rate-driven key frame selection using
Pattern Recognition Letters, vol. 34, no.1, pp.3-19, 2013. temporal variation of visual content," Electron. Lett., vol. 38, no. 5, pp.
[5] L. Yu and T. E. Boult, "System issues in distributed multi-modal 217-218, February 2002.
surveillance," IEEE Conference on Computer Vision and Pattern [27] K. Sze, K. Lam, and G. Qiu, "A new key frame representation for video
Recognition, Minnesota, USA, Jul. 2007, pp.1-2. segment retrieval," IEEE Trans. on Circuits Syst. Video Technol., vol.
[6] F. Porikli, F. Bremond, at al., "Video Surveillance: Past, Present, and 15, no. 9, pp. 1148-1155, 2005.
Now the Future", IEEE Signal Processing Magazine, vol. 30, no. 3, pp. [28] B. Erol and F. Kossentini, "Automatic key video object plane selection
190-198, 2013. using the shape information in the MPEG-4 compressed domain," IEEE
[7] Y. K. Wang, L. Y. Wang, Y. C. Huang, and C. T. Fan, "An online Trans. on Multimedia, vol. 2, pp. 129-138, June 2000.
object-based key frame extraction method for the abstraction of [29] C. Kim and J. N. Hwang, "Object-based video abstraction for video
surveillance videos," in Proc. National Computer Symposium, Taiwan, surveillance systems," IEEE Trans. on Circuits and Systems for Video
pp. 241-249, November 2009. Technology, vol. 12, pp. 1128-1138, 2002.
[8] K. W. Chen, C. W. Lin, T. H. Chiu, M. Y. Chen, and Y. P. Hung, [30] E. Spyrou and Y. Avrithis, "A region thesaurus approach for high-level
"Multi-resolution design for large-scale and high-resolution concept detection in the natural disaster domain," Lecture Notes in
monitoring," IEEE Transactions on Multimedia, vol.13, no.6, Computer Science, LNCS 4816, pp. 74-77, 2007.
pp.1256-1268, Dec. 2011. [31] Z. Ji, Y. Su, R. Qian, and J. Ma, "Surveillance video summarization
[9] D. T. Lin and L. Y. Liu, "Method of detecting moving object." U.S. based on moving object detection and trajectory extraction," Int. Conf.
Patent 121268,603. (Pending) on Signal Processing Systems, Yantai, China, pp. 250-253, July 2010.
[10] Y. Pritch, A. Rav-Acha, and S. Peleg, "Nonchronological video [32] J. Yoder, H. Medeiros, J. Park, and A. Kak, "Cluster-based distributed
synopsis and indexing," IEEE Transactions on Pattern Analysis and face tracking in camera networks," IEEE Trans. on Image Proc., vol. 19,
Machine Intelligence, vol.30 no.11, pp.1971-1984, Nov. 2008. no. 10, pp. 2551-2563, October 2010.
[11] Y.K. Wang, C.T. Fan, K.Y. Cheng, and P.S. Deng, "Real-time camera [33] T. Mat and N. Ukita, "Real-time multitarget tracking by a cooperative
anomaly detection for real-world video surveillance," International distributed vision system," Proc. IEEE, vol. 90, no. 7, pp. 1136-1150,
Conference on Machine Learning and Cybernetics, vol.4 , China, Jul. July 2002.
2011, pp.1520-1525. [34] H. Medeiros, J. Park, and A. C. Kak, "Distributed object tracking using
[12] C. C. Huang, S. J. Wang, Y. J. Chang, and T. Chen. "A hierarchical a cluster-based Kalman filter in wireless camera networks," IEEE
Bayesian generation framework for vacant parking space detection," Journal of Selected Topics in Signal Proc., vol. 2, no. 4, pp. 448-463,
IEEE Transactions on Circuits and Systems for Video Technology, vol. August 2008.
20, no. 12, pp.1770-1785, Dec. 2010. [35] I. F. Akyildiz, T. Melodia, and K. R. Chowdhury, "A survey on wireless
[13] C. C. Lien, Y. T. Tsai, M. H. Tsai, and L. G. Jang, "Vehicle counting multimedia sensor networks," Computer Network, vol. 51, pp. 921-960,
without background modeling," International Conference on Advances 2007.
in Multimedia Modeling, vol.Part I, Taipei, Taiwan, Jan. 2011. [36] N. Martinel, C. Micheloni, C. Piciarelli, and G. L. Foresti, "Camera
[14] J. W. Hsieh, C. H. Chuang, S. Y. Chen, C. C. Chen, and K. C. Fan, Selection for Adaptive Human-Computer Interface," IEEE
"Segmentation of human body parts using deformable triangulation," Transactions on Systems, Man, and Cybernetics: Systems, vol. 44, no.
IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems 5, pp. 653-664, 2014.
and Humans, vol.40, no.3, pp.596-610, May 2010. [37] J. Kim and D. Kim, "Probabilistic camera hand-off for visual
[15] C. T. Lin, Linda Siana, Y. W. Shou, and T. K. Shen, "A conditional surveillance," in Proc. Int. Conf. on Digital Smart Cameras, Stanford,
entropy-based independent component analysis for applications in USA, pp. 1-8, September 2008.
human detection and tracking," EURASIP Journal on Advances in [38] R. Goshorn, J. Goshorn, D. Goshorn, and H. Aghajan. "Architecture for
Signal Processing, vol.2010, Apr. 2010. cluster-based automated surveillance network for detecting and
[16] R. Khemmar, J. Y. Ertaud, and X. Savatier, "Face detection and tracking multiple persons," in Proc.ACM/IEEE Int. Conf. on
recognition based on fusion of omnidirectional and PTZ vision sensors Distributed Smart Cameras, Vienna, Austria, pp. 219-226, September
and heteregenous database," International Journal of Computer 2007.
Applications, vol.61, no. 21, 2013. [39] A. Bakhtari , M. D. Naish , M. Eskandari , E. A. Cloft, and B. Benhabib
[17] N. Haering, P. L. Venetianer, and A. Lipton, "The evolution of video "Active-vision-based multisensor surveillance—An implementation",
surveillance: an overview," Machine Vision and Applications, vol.19, IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 36, no. 5, pp. 668
no.5-6, pp.279-290, Sep. 2008. -680, 2006.
[18] J. Ferenbok and A. Clement, (2011). Hidden changes: from CCTV to [40] Y. Fu, Y. Guo, Y. Zhu, F. Liu, C. Song, and Z. H. Zhou, “Multi-view
‘smart’ video surveillance. In A. Doyle, R. Lipert, and D. Lyon (Eds.), video summarization.” IEEE Transactions on Multimedia, vol. 12, no.
Eyes Everywhere: The Global Growth of Camera Surveillance (pp. 7, pp. 717-729, 2010.
218-234). New York: Routledge. [41] Vision-based Intelligent Environment project (accessed Oct. 2015).
[19] P. M. Roth, V. Settgast, P. Widhalm, M. Lancelle, J. Birchbauer, N. [Online]. Available: http://cvrc.nctu.edu.tw/~TT/home.php?&lang=en.
Brandle, S. Havemann, and H. Bischof, "Next-generation 3D [42] ICDSC Challenge - Smart Homes Data set. (accessed 2009) [Online].
visualization for visual surveillance," IEEE Conference on Advanced Available: http://wsnl2.stanford.edu/icdsc09challenge/.
Video and Signal Based Surveillance, Sabta Fe, USA, Sep. 2011, [43] C. M. Huang and L. C. Fu, “Multitarget Visual Tracking Based
pp.343-348. Effective Surveillance With Cooperation of Multiple Active Cameras,”
[20] A. Girgensohn, D. Kimber, J. Vaughan, T. Yang, F. Shipman, T. Turner, IEEE Trans. Syst., Man, Cybern. B, vol. 41, no. 1, pp. 234-247, 2011.
E. Rieffel, L. Wilcox, F. Chen, and T. Dunnigan, "DOTS: support for
effective video surveillance," International Conference on Multimedia,
Augsburg, Germany, Sep. 2007, pp.423-432.
[21] N. Martinel, C. Micheloni, C. Piciarelli, and G. L. Foresti,"Camera
selection for adaptive human-computer interface," IEEE Tran. on
Systems, Man, and Cybernetics: Systems, vol.44, no.5, May 2014.
[22] Y. L. Tian, L. Brown, A. Hampapur, M. Lu, A. Senior, and C. F. Shu,
"IBM smart surveillance system (S3): event based video surveillance
system with an open and extensible framework," Machine Vision and
Applications, vol.19 no.5-6, p.315-327, Sep. 2008.
[23] D. Kieran, J. Weir, and W. Q. Yan, "A framework for an event driven
video surveillance system," Journal of Multimedia, vol.6, no.1, Feb.
2011.
[24] G. C. Chao, Y. P. Tsai, and S. K. Jeng, "Augmented keyframe," J. Vis.
Commun. Image R., vol. 21, pp. 682-692, 2010.
[25] A. M. Ferman, A. M. Tekalp, and R. Mehrotra, "Robust color histogram
descriptors for video segment retrieval and identification," IEEE Trans.
on Image Proc., vol. 11, no. 5, pp. 497-508, May 2002.
View publication stats

Heterogeneous Information Fusion and Visualization For A Large-Scale Intelligent Video Surveillance System

Uploaded by

Copyright:

Available Formats

You might also like

Heterogeneous Information Fusion and Visualization For A Large-Scale Intelligent Video Surveillance System

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Heterogeneous Information Fusion and Visualization For A Large-Scale Intelligent Video Surveillance System

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Heterogeneous Information Fusion and Visualization for a Large-Scale

Article in IEEE Transactions on Systems, Man, and Cybernetics: Systems · January 2016

Ching-Tang Fan Yuan-Kai Wang

SEE PROFILE SEE PROFILE

Video Surveillance View project

The user has requested enhancement of the downloaded file.

Heterogeneous Information Fusion and

especially for smart communities [1]. Recently, automated

T HERE has been considerable emphasis recently on

TABLE I. SUMMARY OF THE THREE GENERATIONS OF INTELLIGENT SURVEILLANCE SYSTEMS.

communication and cooperation among the three main

The literature review shows that security personnel has to

III. HETEROGENEOUS DATA FUSION AND EVENT-DRIVEN

information. 𝑉/01 𝐶A = 𝑤1 ×𝑃1 𝐶A (3)

The interaction among the three interfaces is all trigged by

The remaining cameras have disjoint FoVs and an intuitive

shown in the visualization interface. Concurrency control and

80 183 240 340 400 600

340 382 417 572 624

TABLE II. EXAMPLES OF DISPLAY SWITCHING ARBITRATION FOR EVENTS (b)

View publication stats

You might also like