Big Data Research: António Cruz, Joel P. Arrais, Penousal Machado

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Big Data Research 27 (2022) 100291

Contents lists available at ScienceDirect

Big Data Research


www.elsevier.com/locate/bdr

Force-Directed Timelines: Visualizing & Exploring Temporal Patterns


António Cruz ∗ , Joel P. Arrais, Penousal Machado
University of Coimbra, CISUC, Department of Informatics Engineering, Coimbra 3030-290, Portugal

a r t i c l e i n f o a b s t r a c t

Article history: Visualization has shown to be a valuable tool in the analysis of large and complex temporal datasets,
Received 14 February 2021 aided by the emergence of new models such as Time Curves, which distorts timelines to position
Received in revised form 9 August 2021 time points based on their similarity with each other, reflecting changes in the data over time. In
Accepted 28 October 2021
this paper, we further explore time-series functionally and aesthetically by presenting an interactive
Available online 10 November 2021
and parameter-based implementation of the Time Curves model, complemented with the addition of
Keywords: supporting visualizations and data analysis methods. In our implementation we introduce Time Paths,
Data visualization a force-directed layout that can dynamically transform the original model to not only smoothen the
Time-series analysis transitions between time points, but also reduce visual noise in favor of portraying overall patterns. The
Interactive systems proposed addition of visual elements to the model includes temporal glyphs and a supporting timeline
graph which help discover and better understand temporal patterns across complex datasets. Through
interactive exploration, we demonstrate how these methods can be used to analyze and identify the main
agents at the source of significant instances in three biological datasets. These methods are presented
within CroP, a data visualization tool with coordinated multiple views aimed at the analysis of biological
datasets.
© 2021 Elsevier Inc. All rights reserved.

1. Introduction which is not only capable of representing the overall temporal


patterns in diverse datasets, but also smoothing the resulting visu-
Multiple disciplines must contend with the study of subjects alizations to highlight predominant patterns. The latter is achieved
which contain temporal variables, requiring the analysis of com- through Time Paths, a force-directed layout that redraws Time
plex datasets that describe diverse processes changing over time Curve visualizations to control their sensitivity to variations in the
simultaneously [1,2]. For instance, biological datasets often contain data. By increasing the level of smoothing, we can be used to not
large quantities of multivariate data, including temporal and spa- only reduce visual clutter but also promote the representation of
tial attributes, that must be complemented with data from external predominant behaviors. Additionally, we can more easily control
databases. Visualization can be a powerful tool in data analysis, the visual proprieties of edges using Time Paths, which allows for
providing users with the means to navigate, brush and filter in- smoother transitions between time points that more clearly repre-
formation through simple representations, as well as new ways to sent the flow of time. Secondly, we complement this layout with
visualize data, highlighting patterns and significant moments [3,4]. the addition of supporting visualization elements aimed at facili-
Through this, it may be possible to obtain a deeper understand- tating the identification of specific moments and behaviors in the
ing of observed behaviors in datasets, identify their sources and timeline, particularly when dealing with complex time curve vi-
potentially predict future events. sualizations. Specifically, we propose the addition of glyphs that
In this paper, we present new visual and interactive approaches represent the dataset at each time point, as to show the evolu-
for the exploration and analysis of time-series datasets based on tion of the dataset without need of multiple views, and a timeline
the Time Curves layout [5], where timelines are bent using multi- graph visualization, which highlights moments marked by signif-
dimensional scaling to position time points relatively to their sim- icant data variations or periods of stability. Furthermore, as time
ilarity. Our main contributions are twofold. Firstly, we present an curves mainly portray general behaviors, we also present a lens-
interactive, parameter-based implementation of the Time Curves, based approach that can be used to brush through time points and
identify the source of these behaviors so that they can be isolated
and studied further. These functionalities were implemented into
CroP, an existing data visualization tool that is capable of repre-
* Corresponding author.
E-mail addresses: antonioc@dei.uc.pt (A. Cruz), jpa@dei.uc.pt (J.P. Arrais), senting network and time-series datasets through multiple coordi-
machado@dei.uc.pt (P. Machado). nated views [6]. Despite being able to process relational datasets,

https://doi.org/10.1016/j.bdr.2021.100291
2214-5796/© 2021 Elsevier Inc. All rights reserved.
A. Cruz, J.P. Arrais and P. Machado Big Data Research 27 (2022) 100291

the focus of this paper is on the developed methods for the repre- their relative distance reflects the similarity between their at-
sentation and analysis of time-series data. tributes [5]. This is achieved by calculating the similarity between
The paper is structured as follows: Section 2 features an time points using data-specific metrics and then applying a force-
overview of related work in representing time-series data; Sec- directed layout on the timeline to attract time points based on
tion 3 provides an overview of CroP as the framework for our their similarity, resulting in a bent timeline whose shape reflects
main contributions; Section 4 presents the updated time curve the behaviors of the data, such as significant events, cyclical pat-
model and its new functionalities; Our experimental results are terns, regressions and outliers.
showcased in Section 5, while the validation of our methods is Elzen et al. presented a similar concept which further showed
presented in Section 6; Finally, Section 7 provides a closing sum- how this layout is able to represent the overall behaviors of com-
mary and future work. plex systems over time [17]. In their work, the properties of a
network at each point in time are abstracted into a point on a
2. Related work two-dimensional plane to build a Time Curve, which will then por-
tray the structural changes in the network over time. Through the
2.1. Time representation Time Curve it is possible to identify the periods in which the net-
work remained with a specific structure, the moments when this
Timelines have been used to describe sequences of events structure changed, the intensity of these changes, and when the
across many disciplines for centuries, having had multiple types network returned to similar, previous structures.
of representations to help portray and analyze time-series within
various contexts [7]. M. Brehmer et al. surveyed existing timeline 2.3. Interactive visualization
visualizations in the context of storytelling, using these as a means
to portray information coherently and in such a way that it en-
Visualization tools can employ interactive functions that pro-
gages the viewer [8]. Linear representations were found to be the
vide users with the ability to navigate and filter the data, giving
most common as they advantage of portraying sequences of events
them control over the amount of information displayed on screen.
with an easily perceived order, with scatter plots, line charts and
Regarding navigation, visualization tools can employ a details-on-
histograms being common examples of simple supporting visual-
demand approach by allowing users control over the timeline
izations used to portray time-series data in various tools [9,10].
to switch between time points and visualize additional informa-
Alternatively, radial layouts also share this advantage, displaying
tion through supporting visualizations [18,19]. Switching between
ordered events on a circular outline. However, this shape is bet-
levels of detail can also be achieved through semantic zooming,
ter suited for portraying cyclical or periodic events, as the timeline
where the amount of information displayed regarding each time
loops and its beginning and end share similar positions [8]. Spi-
point increases as the user zooms in on a section of the time-
ral layouts portray a similar concept except with multiple loops,
line [20].
either expanding from or converging into the center. While spirals
Filtering or highlighting information can be achieved through
can portray cycles, they are space-filling layouts, making them use-
queries. These can be performed indirectly through user interface
ful for displaying large amount of time points with a defined order
elements, such as search bars to find specific values and sliders
within large areas [11,12].
that set threshold values to filter less relevant data, or directly
Timelines can easily encode a small number of variables
through brushing, where data is selected by interacting with its
through shape, size and color, but may not be feasible when accu-
graphical representations. For instance, TimeSearcher and MaTSE
rately representing complex systems changing over time, such as
allow users to select sections of time-series data visualizations in
large networks where each time point represents a different state
order to find temporal patterns that are similar to the one within
for each data point. While animation can be employed as a nat-
the selected area [21,22]. Such methods are also advantageous in
ural way to convey changes over time, where each state of the
dynamic visualizations, such as those employed by DEVIS [23], a
entire system can be represented in successive frames, it is lim-
tool for visualizing and analyzing the evolution of relaxed func-
ited by human perception capabilities, as people are more likely
tional dependencies over time. This tool utilizes a line plot and
to only focus on significant changes [13]. Alternatively, complex
a dependency table that dynamically adapt to continuous discov-
glyphs or small multiples can be used to show a representation of
ery processes, allowing users to interact with the results through
the state of the data for each time point [14,15]. For instance, Cir-
queries and filters, including the selection of an interval of time
cleView [16] utilizes segmented pie charts to represent multiple
through a brush to analyze the changes to the dependencies dur-
attributes over time in a simple visual artefact that not only uses
ing that period. An alternative method for exploratory queries is
similarity and ordering algorithms to help viewers identify signif-
lens-based approaches [24]. These can be used as both a seman-
icant elements and even patterns, but also user interaction such
as selections, filtering and drilling-down to allow the exploration tic zooming tool and as area brushes, as users can control which
of more complex datasets. However, in any of these methods scal- elements and information falls within their radius to be expanded.
ability must be considered, as showing thousands of time points
simultaneously will significantly increase the complexity of visu- 3. Framework
alizations and the excess of information may result in overlaps or
even overwhelm the viewers. The methods presented in this paper were integrated into
CroP [6], a data visualization tool created in Java using the Pro-
2.2. Time curves cessing library [25]. CroP employs a multiple coordinated views
layout to visualize user-provided datasets at different levels of de-
To manage visual complexity, visualizations may seek to reduce tail through various visualization models contained within flexible
the size of the information space, condensing visualizations or panels, including relational networks, tabular visualizations, linear
even aggregating groups into singular visual elements. This often graphs, and an implementation of time curves (Fig. 1).
entails a loss of information in favor of reducing visual noise and While CroP was designed for the analysis of biological datasets,
highlighting general patterns. Bach et al. presented Time Curves, a specifically protein-protein interaction networks and gene expres-
visualization model which utilizes multidimensional scaling to po- sion time-series datasets, it is also capable of processing generic
sition time points in low-dimensional space in such a way that relational and temporal data. A list of every loaded node and edge

2
A. Cruz, J.P. Arrais and P. Machado Big Data Research 27 (2022) 100291

Fig. 1. Screenshot of CroP, representing a dataset clustered into five groups: a Data Table (1), a Network panel (2), and a Time Curve panel (3) that can be divided into three
areas: a time curve layout (a), an options menu for this layout (b), and a timeline slider with a graph (c). Time points are displayed with pie chart glyphs that represent node
clusters; when brushed with the mouse lens, a larger chart represents similarity between clusters (d). (For interpretation of the colors in the figure(s), the reader is referred
to the web version of this article.)

is displayed in the Data Table panel (Fig. 1.1), while in the Net- al. [5]. The following section describes our revisions to the model
work panel these are displayed through various layouts that reflect and methods for visualizing and exploring time-series.
their attributes (Fig. 1.2). When temporal data exists, users can ex-
plore change in values over time through the timeline slider at the 4. Exploring time curves
bottom of the panel. This data can also be clustered using a hierar-
chical algorithm to create groups of nodes that are based on either
In this section, we describe the Time Curve panel (Fig. 1.3) and
their relationships or temporal patterns [26,27]. This algorithm has
our implementation of the time curve layout, as well as a force-
the advantage of being more versatile, only needing to calculate directed approach for smoothing this layout which we named Time
similarly once before allowing users to choose between the num- Paths. These layouts are supported by a timeline graph as to help
ber of clusters while the visualization updates dynamically. How- users understand sequences of events, as well as through a lens-
ever, as one algorithm may not be ideal for every problem, the based approach that is used to analyze temporal behaviors and
framework has been prepared to support the future addition of identify their source.
more clustering algorithms. After being clustered, node groups on
the network panel are sorted with a spiral layout which orders
4.1. Time curve panel
nodes by similarity within each cluster. While time-series can be
loaded and analyzed without relational data, networks will posi-
tion clusters based on the edges between their nodes, meaning CroP’s Time Curve panel consists of a visualization layout, an
that closer clusters will contain higher numbers of related nodes. options menu and a timeline slider (Fig. 1.a,b,c). When temporal
Regarding node representation, values are mapped between data is loaded into CroP, each time point is displayed sequentially,
black, which represents minimum values, and saturated colors, laid out either as a horizontal line or as a spiral, the latter being
which represent maximum values. As the default colors hues, blue used when the length of the former surpasses the width of the
is used to represent time-series values and orange represents time window. After the initial layout is loaded, the timeline can be dis-
progression. These hues can be switched with any color on the torted into a time curve through a force-directed layout comprised
rainbow spectrum, allowing users to pick a color scheme that they of springs between every time node. The formula used to calculate
feel comfortable with. Depending on the dataset and type of tem- the force applied to each spring is based on Hooke’s law [28] and
poral behaviors being studied, color can be mapped to the direct each spring’s ideal stretching length is determined using a similar-
values of each node, to their variation, or to their tendency. We de- ity matrix. The similarity between any two time points t i and t j is
fine variation as the difference between the current value and that determined through the function f (i , j ), described below, where
of the previous time point, as to represent how values are varying N is the number of nodes in a dataset, and P is a node containing
over time. Tendency is an alternative approach that does not take the time series t. It is performed by calculating both the difference
into account values, only how the data shifts between the previous of the values of the time points v (i , j ) and the difference of their
and next time points. Here, the brightest color represents a peak variation b(i , j ), averaged between every point in the dataset. By
of values and black represents a valley, while a darker blue will using both of these operations in calculating similarity, time points
represent increasing values, and grey dark blue represents values are positioned based not only on their current values, but also in
decreasing. This can be used in the analysis of patterns in tendency relation to whether these values are increasing or decreasing sim-
shifts across complex datasets, such as gene expression time-series ilarly over time.
where peaks of values represent over-expressed proteins.
In this paper, our focus is on the Time Curve panel, which pro-

N
motes the discovery of relationships between time points through v (i , j ) = | P ti − P t j | (1)
a visualization model based on the layout presented by Bach et
P =0

3
A. Cruz, J.P. Arrais and P. Machado Big Data Research 27 (2022) 100291


N
the flow of time: a pulse created from increasing and decreasing
b (i , j ) = |( P t i − P t i−1 ) − ( P t j − P t j−1 )| (2) the weight of each segment in sequence, and arrow particles that
P =0 move across the time curve, increasing in speed when edges are

v (i , j ), if i=0 far apart to convey the intensity of variation between time points.
f (i , j ) = v (i , j )+b(i , j ) (3) We defined two variables that can be controlled through slid-
2
, else
ers to dynamically update the layout: the number of intermediate
Values are averaged across every node in the dataset when cal- points and momentum. The number of intermediate points be-
culating similarity, meaning that the distance between time points tween each time point controls the detail of each curve. Momen-
will reflect the percentage of nodes that are behaving similarly be- tum controls the speed of forces converging between time points,
tween them. As such, closer time points reflect higher amounts where decreasing it will create sharp turns between points, and in-
of nodes manifesting similar behaviors at those moments in time. creasing it will create wider loops. The calculation of a Time Path
However, if only a smaller percentage of the dataset is manifesting only needs to be performed once for each set of parameters, as
a behavior pattern, time points would be scattered without per- all of the intermediate points are saved along with their proper-
ceivable correlation. For such cases, we added a slider that controls ties. Additionally, overlapping intermediate points will be removed,
how similarity is mapped to the distance between time points. For cleaning the visualization and improving drawing speed. However,
instance, if time points are close to each other when the slider is it should be addressed that time paths distort the position of the
set to 60% maximum similarity, then at least 60% of all data in time nodes from the original time curve, in which their position
the dataset should be behaving similarly at those time points (as best reflected their similarity. To diminish this, after the time path
shown by the selected time points in Fig. 1.b). Furthermore, as to has been calculated, we move time nodes along the new path to a
more quickly identify an ideal parameter, the maximum similarity point that is closest to their original position on the time curve.
between every time point in the current dataset is marked on its
slider (shown in Fig. 1.a). 4.3. Supporting timeline
By dragging the timeline slider (Fig. 1.c), every time node will
be highlighted in sequence, hiding every edge except those be- As the force-directed layout distorts the timeline, the order of
tween the time points that the slider crossed to create an animated the time points becomes harder to perceive. While interaction and
transition between those selected. Additionally, the mouse can be animation can help users better perceive temporal order, these
used to pan and zoom the time curve layout, and time points methods may not be ideal to identify significant moments when
can be selected by clicking them, which displays their names and dealing with large amounts of time points. In order to more easily
marks them on the timeline graph with a circle. The menu on the understand the sequence of events in a time series, regardless of
top-left (Fig. 1.b) lists options to control the layout, such as tog- the visual complexity of its time curve, we added a graph that rep-
gling the visibility of nodes and edges, changing parameters of the resents the behaviors of the data over time to the timeline slider.
layout, and switching the color scheme. The default color scheme The height of each bar in the graph is mapped to the distance
maps temporal progression from the initial time point to the last that time point and the next in the time curve, from zero to the
using a gradient from black to orange. largest current distance. As such, the size of the bar will repre-
sent the intensity of the changes in the data from one time point
4.2. Time paths to the next in the current time curve. To exemplify, Fig. 2 shows
a graph representing an instance where the distance between the
While the time curve layout is able to position time points
first three time points is large, while the following ones are close
relatively to their similarity, visual complexity increases with the
together. This indicates that there were three large shifts in values,
number of time points, making edge representation particularly
followed by a period of stability where values presented mini-
important. The Processing library provides several options to man-
mal changes. However, when represented as a bar chart, the initial
age the proprieties of the edges, including the creation of curved
edges across multiple points, but these methods offer little control three significant shifts present similar heights (Fig. 2.a), which may
over visual attributes. To resolve this, we implemented a method result in the viewer perceiving these time points as being similar.
that creates segmented edges with adjustable proprieties, allow- To better distinguish significant shifts in the data, we altered the
ing us to smoothen time curves by controlling their trajectory and bar representation so that shifts are represented as spikes, where
curvature. their sharpness matches the intensity of the shifts, and stability
Time Paths is a layout that smoothens an existing time curve vi- is represented by smooth low portions. (Fig. 2.b). To better con-
sualization by redrawing it with a brush controlled by parameter- vey the flow of these shifts over time, the shape of the graph is
based attraction forces. This brush consists of a moving point then curved, smoothing the transitions between points (Fig. 2.c).
which is first placed at the initial time point on the original time To achieve the spike shape, the height of the point between two
curve, and it is then pulled towards the following time point us- sequential time points is calculated based on their height. The for-
ing a spring, calculated using Hooke’s law and a fixed attraction mula to calculate the middle point h located between time points
strength. The brush’s route is mapped by intermediate points that t1 and t2 is described below where T is an array of the average
are left behind as it moves, and after a set number of points, it is
values of every time point, max( T ) and min( T ) are the maximum
pulled towards the next time point. However, the new spring does
and minimum values in T, respectively, and av g (t1, t2) is the aver-
not immediately replace the previous one, as we apply momen-
age value of t1 and t2. This formula describes how h is calculated
tum: a percentage value that defines how quickly the attraction
by mapping av g (t1, t2) from the set [min( T ), max( T )] into the set
force to the previous time point is converted into the attraction
force to the next point. Once the brush reached the final time [av g (t1, t2), min( T )], meaning that the height of both time points
point, every edge of the new curve is defined by the sets inter- increases (which would indicate two sequential large shifts of data)
mediate points that were left in its path. This provides increased the middle point’s height decreases.
control over the visual representation throughout the edge as we
av g (t1, t2) − min( T )
can define gradual transitions of both color and size between any h = (1 − )×(av g (t1, t2)− min( T ))+ min(t )
time node, as well as easily animate visual elements along the max( T ) − min( T )
drawn trajectory. Using this, we added two animations that convey (4)

4
A. Cruz, J.P. Arrais and P. Machado Big Data Research 27 (2022) 100291

Fig. 2. Timeline graph showing high variation for the first three steps and then a period of minimal changes: as a basic bar chart (a), with middle points (b) and then
smoothed (c).

However, as this representation is more complex than the ini-


tial bar chart, we also established a minimum pixel width for each
bar so that the spikes will not be drawn whenever the width
of the graph is too small. As such, only the bar chart will be
shown whenever the graph is too small for the spikes to be read-
able, which also saves processing power whenever there are too
many time points. Additionally, there is a “similarity” color scheme
that also helps discern behaviors independently of the number of
time points. When selected, the color of time nodes is mapped
to the similarity values every time point that has been selected.
As such, when selecting a time point, every point that is similar
will be highlighted on the graph. As before, selected time points
are marked with a circle above their corresponding section of the
graph.

4.4. Analyzing temporal patterns

A limitation of the Time Curves layout is that it can only show


how the data generally behaves over time. When viewing a time
curve that is representing the general behavior of thousands of
nodes, it does not provide any information regarding the nodes
contributing to such behaviors. As such, we are interested how
the model can be used to portray the state of the data at each
time point to better understand its pattern, as well as providing
the means to dig-down and discern the potential sources of these
behaviors. Fig. 3. Network panel (left) with HIV-1 dataset nodes clustered into six groups and
Our first challenge was to provide graphical representations of Time Curve panel (right). 1 - Maximum similarity at 100%. 2 - Maximum similarity
at 60% and time points 12H and 18H are selected; clusters are represented with pie
the dataset at each time point, allowing these to be compared after
slices (a, b & c).
being positioned by Time Curves without having to rely on inter-
action or additional views. To assure that our approach is scalable
to any dataset, we chose to portray data clusters as these repre- cluster between the selected time points. This similarity is deter-
sent a flexible number of groups of data points that have similar mined by calculating the standard deviation between their average
proprieties. Upon clustering the dataset, time nodes in the time values, represent clusters with consistent behaviors through filled
curve layout are replaced with pie chart glyphs. Each slice repre- slices (Fig. 3.2.a,c), while others will be shorter and less visible
sents one of the clusters, where the width of its arc represents the (Fig. 3.2.b). Additionally, the scroll wheel can be used to increase
number of nodes in the cluster and the color corresponds to the or decrease the size of the lens, and by using a keyboard shortcut,
average properties of every node in the group at that time point. nodes will remain selected even when no longer within the lens,
The pie chart slices are sorted relatively to the positions of the allowing users to brush through multiple distinct areas of a time
clusters on the network visualization, allowing them to be more curve.
easily matched to their corresponding cluster (Fig. 3.2). In order to identify the source of particular behaviors more
To further explore the relationships between nodes and their easily, selecting time points is also coordinated with the network
temporal patterns, we introduced a lens-based approach that pro- panel. When points are selected, the transparency and saturation
vides an on-demand visualization which highlights the similarities of each node in the network is mapped to their similarity across
and differences between any group of time points. Right-clicking the selected time points, highlighting nodes with consistent behav-
over the Time Curve panel will create a circular area around the iors. Furthermore, two outlines are drawn around each cluster: the
mouse, allowing users to brush over multiple time nodes simul- first outline has a varying weight which increases based on the
taneously and creating a larger pie chart visualization (Fig. 1.d). percentage of similar nodes within the cluster, while the second
This pie chart represents an aggregate of all the glyphs, where the outline is fixed, marking the maximum size for the first outline. As
color of each slice represents the average of each cluster across such, these outlines make it simpler to quickly identify how sim-
the selected time points. Moreover, the radius of each slice is ilar each cluster is across every selected time point (Fig. 3.2). As
now variable, matching the average similarity of their respective we highlight both significant individual nodes and significant clus-

5
A. Cruz, J.P. Arrais and P. Machado Big Data Research 27 (2022) 100291

Fig. 4. Time curve visualization of a sine wave dataset (a) and variations: an increase in amplitude (b), an increase of values (c), then combined with an increase of
frequency (d).

ters, users can more easily identify and isolate data groups that are until the final five years, where the production increase stabilizes.
displaying behaviors that they might want to study further. In both cases, the Time Curve depicted each cycle with circular
patterns whose sizes and positions matched the respective vari-
5. Experimental results ations and trends observable in the dataset, which is consistent
with previous results. For instance, the “Monthly Milk Production”
In this section, we present several visualizations created by our dataset is represented with shifted loops, up until the final cycles
models, first from simple datasets comprised of a single time- when production no longer increases yearly, resulting in overlaps.
series that can be easily matched with their results, and then from The Time Path layout was then applied with four different sets of
biological datasets that contain thousands of nodes with individual parameters, increasing the level of smoothing with each one. The
time-series, where we employ the developed methods to identify results are depicted in Figs. 5 & 6.
patterns and their source. The first of each set of results (3.a) shows the default param-
eters, which generally smoothed the initial time curve (2.) while
5.1. Individual time-series remaining sensitive to small variations. We can note a small loop
in the center that highlights a slight increase in values within a
While the representation of a single time-series may not offer decreasing trend. In the second result set (3.b), the number of
much insight that is not already observable in a linear represen- intermediate points is reduced while momentum is increased, re-
tation of the data, the following datasets allowed us to test the ducing the perturbations that indicate moments with small varia-
representation abilities of our layouts, as the resulting visualiza- tion shifts, like the previous loop. In the third set (3.c) momentum
tions can be compared with each time-series. To demonstrate our was further increased, removing almost all visual clutter in favor
implementation of the time curve layout, we conducted tests using of representing the overall cycles. In the last parameter set (3.d),
a sine wave dataset with 500 time points, which depicts a simple momentum was increased again to create exaggerated depictions
consistent behavior: a cycle, where values increase and decrease of the previous cycles, particularly large data variations such as
repeatedly between a minimum and a maximum. Following this, the two largest increases in the sunspots dataset (Fig. 5.3.d). The
the dataset was altered to depict other predictable behaviors, as time curves model was generally able to represent how time-series
shown in Fig. 4. The visualization of the basic sine wave dataset vary over time, while time paths helped reduce visual noise and
resulted in an oval that represents the shifts in variation over time, highlight the main tendencies of the time-series, while offering al-
where the number of loops matches the number of cycles in the ternative representations of otherwise simple datasets.
dataset (Fig. 4.a). When the minimum and maximum values were
altered, the shape’s height also changed to match the amplitude 5.2. Biological datasets
of each cycle (Fig. 4.b). Adding a consistent value increase over
time resulted in a gradual position shift for the ovals represent- Biological datasets can be generally characterized as complex,
ing each cycle (Fig. 4.c). Finally, increasing frequency resulted in containing large quantities of data points with multiple variables.
stronger value variations, which increased the width of the cy- Unlike in the previous datasets, tendencies and significant mo-
cle’s loops. To make this more noticeable, we combined it with ments are not easily identified without data analysis methods.
the previous value increase (Fig. 4.d). In these initial experiments, First, we present a visualization of a gene expression time-series
our implementation of the time curves layout was able to position RNA-Seq dataset over a network of 7,590 proteins in reaction to the
time points based on their values and variations, creating visual- HIV-1 infection, measured every 2 hours over a period of 24 hours.
izations that conveyed the overall behaviors of each time-series. As this dataset was shown to present cyclical tendencies in early
In order to test the Time Paths layout, we used two time-series versions of our layout [31], we sought to explore these behaviors
gathered from real events describing cyclical behaviors with differ- in its current iteration. The results are shown Figs. 1 & 3, where
ent trends and variations. These datasets were chosen to not only the time curve was able to depict two cycles through three distinct
test the impact of different parameters, but also demonstrate how groups of time points, where at least half of the dataset presented
the layout smoothens curves to reduce visual clutter and highlight- the same behaviors. Through the time curve lens, it was possible
ing general trends. The first dataset is “Wolfer’s Sunspot Numbers”, to quickly identify which nodes had similar behaviors at the same
a yearly measurement of sunspots from 1770 to 1869 [29]. It con- points in time. For instance, in Fig. 1 we show the lens being used
tains 100 time points and presents cycles with similar minimums on the group of time points for 4, 10 and 16 hours, resulting in a
but varying peak values, resulting from periodic value increases chart with one full slice colored in blue. Even without the network
of different intensities. The second dataset depicts “Monthly Milk view, we can infer that this represents a cluster that only contains
Production” from January 1962 to December 1975 [30], contain- nodes that present peaks of expression across the selected time
ing 168 time points and characterized by a yearly production cycle points. Moreover, the remaining slices also indicate that the other
with minor jumps in variation and a consistent increasing trend clusters also contain large quantities of nodes expressing other

6
A. Cruz, J.P. Arrais and P. Machado Big Data Research 27 (2022) 100291

Fig. 5. Visualizations of the “Wolfer’s Sunspot Numbers” dataset, depicted as a line chart (1), a Time Curve (2), and through Time Paths (3) with different parameters (a to d).

Fig. 6. Visualizations of the “Monthly Milk Production” dataset, depicted as a line chart (1), a Time Curve (2), and through Time Paths (3) with different parameters (a to d).

Fig. 7. Time curve of the Plasmodium Falciparum dataset (left), and close-ups of two sets of time points being brushed with the mouse lens (a, b).

consistent behaviors. In Fig. 3, we compare the dataset between observe that the data does not present consistent variations. In the
12 and 18 hours to identify that most of the dataset presents the supporting timeline graph, we can easily identify both moments of
same patterns of expression between these time points, with the stable variation and some significant shifts in the data. Using the
exception of the bottom-most cluster (Fig. 3.b). Such clusters can lens to analyze some of these larger shifts, we can identify which
be easily selected and isolated so they can be studied further. groups of nodes present differences between time points instead
Secondly, we analyzed a gene expression time-series of the de- of similarities. When comparing TP12 and TP13, we can identify
velopmental cycle of Plasmodium Falciparum, the agent responsi- that the two clusters on the left were responsible for the spike in
ble for human malaria. The dataset contains 5,080 genes, whose the time curve, and just by observing their glyph colors, we can
values were measured with an hour interval over 48 hours. The also conclude that these clusters had a significant increase in val-
time curve visualization of this dataset shows a general continuous ues (Fig. 7.b).
behavior throughout, small shifts in variation between subsequent Finally, we visualized transcription profiles across a yeast cell
time nodes, without overlaps (Fig. 7). Additionally, the dataset has cycle, where 4,381 alpha factor synchronized cells were followed
near 90% maximum similarity, meaning that genes present very across two cell cycles, sampled every 5 minutes throughout 2
similar behaviors. This matches the description of this data in a hours [33]. The resulting time curve loops in the middle, which
previous study [32] which referred to the behavior of the genes may be related to the second cell cycle (Fig. 8). Despite maximum
as a cascade of continuous expression that lacks sharp transitions. similarity being set at 80%, the cycles appear to be significantly
Moreover, near the end, the dataset appears to return to a state different from each other, although this is also substantiated by
similar to where it began, characteristic of a cycle. By using the previous studies [33]. In particular, we can observe that the end
lens, we can compare two time points near the beginning and end of the second cycle is marked by a very significant shift in values,
of the timeline, TP2 and TP45, to observe that most of the dataset characterized mainly by a large decrease of values in one cluster
does appear to present similar values and variations at those in- and a large increase in three others (Fig. 8.b), followed by a re-
stances (Fig. 7.a). Despite the lack of sharp transitions, we can turn to values in line with the previous tendency. There is also

7
A. Cruz, J.P. Arrais and P. Machado Big Data Research 27 (2022) 100291

Fig. 8. Time curve of the yeast cell cycle dataset (left), and network visualizations of the dataset at 45 minutes (a) and at 105 minutes (b), clustered by temporal values.

another large shift at 45 minutes that breaks away from the ex- level users. This would have required a higher focus on the time
pected tendency. We identified the cause as being a specific cluster curves, such as more tasks or examples to help users further ex-
of nodes that presents very little activity throughout the dataset plore the model and surpass the inherent learning curve. In spite
until that point in time, during which the values of every node of this, 7 out of the 9 participants were able to correctly identify
increase sharply (Fig. 8.a), then return to normalcy. the pattern of the time curve as representing multiple cycles, rat-
ing the clarity of this representation with an average score of 5.6
6. User testing out of 7. Additionally, these participants generally considered the
temporal glyphs as helpful in the understanding the pattern with
Throughout the experiments performed using our implemen- a score of 5.1 out of 7, as the visual similarity of adjacent nodes
tation of the time curve model, we were able to observe various helped distinguish each group. In the final feedback portion, par-
types of behaviors in the visualization that were created. However, ticipants were asked to give a score to several affirmations from
one predominant concern is the comprehensibility of the model by a 1 (Strongly Disagree) to 5 (Strongly Agree). The following is the
different types of users, in particular those with minimal knowl- average score given to each affirmation: they generally agreed that
edge of data visualizations. To this end, we invited users with CroP was accessible with 3.9, that they required more time to use
knowledge of biological datasets but low-levels of experience with CroP properly with 3.9, and that they could see themselves using
visualization to perform interface tests on CroP. As these tests were this tool in the future with 4.1, while they generally disagreed that
aimed at CroP as a tool, we will only present an overview and a the tool was complex with 1.6.
brief discussion of the results in the context of the presented visu-
alization methods. To further test the efficacy of our visualization 6.2. Visualization model survey
models, we also performed surveys on individuals from different
fields, focused on the interpretation of various time curve visual- The survey consisted of a form containing several sets of
izations. The full tests and results are provided as supplementary multiple-choice questions, involving the identification of behaviors
materials. and patterns across multiple visualizations, as well as inquiries
into participants’ preferences regarding the representation of the
6.1. Interface tests datasets. This study was conducted in person with university stu-
dents from various fields of study: out of the 25 participants, 8 had
The interface tests were conceived with the objective of detect- an Information Visualization background, 4 had a Computational
ing general usability problems in CroP and to better understand Creativity background, 6 had a Computer Science background, and
how users with minimal knowledge of data visualization would 7 had a Biomedical Science background.
perform in navigating the tool, exploring data, and using its func- At the start of the survey, participants were shown a time curve
tions to identify patterns. The tests were performed by 9 college from the sine waves dataset (Fig. 4.a), in which all of them were
students from a Biochemistry degree, where all of which had a able to identify that the predominant behavior being represented
low level of experience with visualization tools. These participants was cyclical. When asked to match three modified sine wave time
were asked to load and explore the gene expression dataset of the curves (Fig. 4.b,c,d) to their respective datasets, 48% of the partic-
HIV-1 infection (Fig. 1), utilizing the data table, network and time ipants correctly matched the first dataset, 84% of them were able
curve panels to solve various tasks. In what concerns the visualiza- to correctly match the second dataset and 64% correctly matched
tion methods presented in this paper, participants were asked to the third. Following this, they were asked to identify moments,
use clustering to identify a group of nodes with a similar profile to periods and specific characteristics from the dataset visualizations
that of another node and then use the time curve panel to search presented in Figs. 6 & 5 and over 90% of participants were able to
for a pattern of behavior. This was followed by a multiple-choice correctly identify both increasing trends and periods of stabiliza-
question regarding the behaviors they identified, how clearly they tion, as well as moments with significant changes. However, only
were represented, and whether the time glyphs helped in under- 54% were capable of identifying a specific outlier in the time curve
standing the pattern. visualization, although this was in spite of their limited experience
None of the participants showed any significant issues in nav- with the model. When asked to choose between six Time Path
igating the tool, applying clustering and layouts or exploring the variations of these visualizations with different levels of smooth-
data. However, 4 users were not able to identify the intended ing, we could observe that participants generally preferred rounder
group in the clustering task and 2 users had difficulties in inter- curves with less variation details, as long as the visualization was
preting the time curve layout, stating that they could not perceive still able convey the overall behaviors of the original dataset. For
neither continuous nor cyclical behaviors. Based on the received instance, the Sunspot dataset visualizations with high smoothing
feedback, we can attribute these difficulties to having no prior parameters (Fig. 5.d) were unpopular as they distorted the repre-
knowledge of clustering and a lack of context on the data for low sentation of the larger cycles.

8
A. Cruz, J.P. Arrais and P. Machado Big Data Research 27 (2022) 100291

From the initial questions we were able to observe that the grated into CroP’s framework, either to provide alternative meth-
model is not immediately intuitive, which was also supported by ods to map the position of time points to their similarity, such as
the received feedback where 9 participants commented on the new clustering methods, or to add a new dimension to the visu-
existence of a learning curve, especially to those without a vi- alization, such as a 3D plot which would allow the exploration of
sualization background. However, the average score given to the the data through different angles.
model’s ability to represent behaviors was 4.1 out of 5, accom-
panied by feedback that the model should be useful tool in data Funding
analysis after the learning curve is surpassed, one noting the abil-
ity to identify outliers that weren’t noticeable in the line chart. The This work is funded by national funds through the FCT - Foun-
absence of interaction in these tests did highlight some limitations dation for Science and Technology, I.P. [SFRH/BD/124538/2016],
on the static models, such as overlapping lines causing visual noise, within the scope of the project CISUC - UID/CEC/00326/2020 and
although this can be diminished when using the tool by scrolling by European Social Fund, through the Regional Operational Pro-
through the timeline or using animations. Participants also gave gram Centro 2020, and through D4 - Deep Drug Discovery and
the presentation of the Time Path visualizations an average score Deployment [CENTRO-01-0145-FEDER029266].
of 3.9 out of 5, commenting that they were generally more visu-
ally appealing than traditional visualization models, but that their Declaration of competing interest
application is very context-sensitive.
The authors declare that they have no known competing finan-
7. Conclusions & future work
cial interests or personal relationships that could have appeared to
influence the work reported in this paper.
In this paper, we presented new methods to further discover
and interpret the behavior patterns represented by the Time
Appendix A. Supplementary material
Curves layout, integrated into CroP, our visualization tool for re-
lational and temporal data.
Firstly, we presented our implementation of the Time Curves Supplementary material related to this article can be found on-
layout and demonstrated its ability to represent different types of line at https://doi.org/10.1016/j.bdr.2021.100291.
behaviors in time-series datasets. We complemented this model
with Time Paths, a parameter-based layout that dynamically trans- References
forms Time Curve visualizations to represent the original behaviors
[1] S. Nusrat, T.A. Harbig, N. Gehlenborg, Tasks, techniques, and tools for genomic
with different levels of detail or abstraction. In the experiments
data visualization, Comput. Graph. Forum (2019), https://doi.org/10.1111/cgf.
performed, the model was capable of not only representing dif- 13727.
ferent types of behaviors over time, but also smoothen layouts to [2] N. Kerracher, J. Kennedy, K. Chalmers, The design space of temporal graph vi-
reduce visual clutter and highlight overall trends. Time Paths also sualisation, in: N. Elmqvist, M. Hlawitschka, J. Kennedy (Eds.), EuroVis - Short
gives additional control over the representation of edges, which al- Papers, The Eurographics Association, 2014.
[3] W. Aigner, S. Miksch, H. Schumann, C. Tominski, Visualization of Time-Oriented
lowed for the creation of animated visual elements that convey
Data, 1st edition, Springer Publishing Company, Incorporated, 2011.
direction and variation more easily. [4] S.I. O’Donoghue, B.F. Baldi, S.J. Clark, A.E. Darling, J.M. Hogan, S. Kaur, L. Maier-
Secondly, we provided both visual and interactive methods Hein, D.J. McCarthy, W.J. Moore, E. Stenau, J.R. Swedlow, J. Vuong, J.B. Procter,
aimed at analyzing temporal patterns across large volumes of data Visualization of biomedical data, Annu. Rev. Biomed. Data Sci. 1 (1) (2018)
points, allowing users to dig-down on time curve visualizations. 275–304, https://doi.org/10.1146/annurev-biodatasci-080917-013424.
[5] B. Bach, C. Shi, N. Heulot, T. Madhyastha, T. Grabowski, P. Dragicevic, Time
When data is clustered, time nodes are replaced with glyphs that
curves: folding time to visualize patterns of temporal evolution in data, IEEE
represent the dataset at each time step, allowing time points to be Trans. Vis. Comput. Graph. 22 (2016) 559–568, https://doi.org/10.1109/TVCG.
compared without the need of additional views. Furthermore, we 2015.2467851.
implemented a lens-based area brush that highlights nodes with [6] A. Cruz, P. Machado, J.P. Arrais, CroP—Coordinated Panel visualization for bi-
similar behaviors between groups of time points. This lens is co- ological networks analysis, Bioinformatics 36 (4) (2020) 1298–1299, https://
doi.org/10.1093/bioinformatics/btz688.
ordinated with the network panel to better identify groups and [7] D. Rosenberg, A. Grafton, Cartographies of Time: A History of the Timeline,
individual nodes with significant behaviors. Throughout the exper- Princeton Architectural Press, 2013.
iments performed with biological datasets, these methods helped [8] M. Brehmer, B. Lee, B. Bach, N.H. Riche, T. Munzner, Timelines revisited: a
quickly discover the groups of nodes responsible for various behav- design space and considerations for expressive storytelling, IEEE Trans. Vis.
Comput. Graph. 23 (9) (2017) 2151–2164, https://doi.org/10.1109/TVCG.2016.
iors, which can then be isolated and compared with other variables
2614803.
to be studied further. Finally, we validated our models and meth- [9] A. Theocharidis, S. Dongen, A. Enright, T. Freeman, Network visualisation and
ods through interface tests and visualization models, performed analysis of gene expression data using biolayout express3d, Nat. Protoc. 4
with individuals from varied fields of study. While these confirmed (2009) 1535–1550, https://doi.org/10.1038/nprot.2009.177.
that the time curve model possesses an inherent learning curve, [10] A. Lex, M. Streit, H.-J. Schulz, C. Partl, D. Schmalstieg, P. Park, N. Gehlen-
borg, Stratomex: visual analysis of large-scale heterogeneous genomics data
even users with low knowledge of visualization were able to use
for cancer subtype characterization, Comput. Graph. Forum 31 (3pt3) (2012)
the tools to discover patterns and interpret them. 1175–1184, https://doi.org/10.1111/j.1467-8659.2012.03110.x.
As future work, we plan to further explore the presented mod- [11] J.V. Carlis, J.A. Konstan, Interactive visualization of serial periodic data, in:
els, not only to further improve their representation of data, but Proceedings of the 11th Annual ACM Symposium on User Interface Software
also to more easily portray significant information within complex and Technology, UIST ’98, ACM, New York, NY, USA, 1998, pp. 29–38, http://
doi.acm.org/10.1145/288392.288399.
datasets. In particular, the time curve model still presents limita-
[12] T. Bergstrom, K. Karahalios, Conversation clock: visualizing audio patterns in
tions in its readability when representing large time-series, mainly co-located groups, in: 2007 40th Annual Hawaii International Conference on
due to edge overlap. Supported by the timeline graph and interac- System Sciences (HICSS’07), 2007, p. 78.
tive methods which help pinpoint important moments, our models [13] T. von Landesberger, A. Kuijper, T. Schreck, J. Kohlhammer, J. van Wijk, J.-D.
are still capable of being used in the analysis of complex datasets Fekete, D. Fellner, Visual analysis of large graphs: state-of-the-art and future
research challenges, Comput. Graph. Forum 30 (6) (2011) 1719–1749, https://
regardless of edge representation, but the integration of techniques doi.org/10.1111/j.1467-8659.2011.01898.x, https://onlinelibrary.wiley.com/doi/
such as edge-bundling and aggregation may reduce overlap and pdf/10.1111/j.1467-8659.2011.01898.x, https://onlinelibrary.wiley.com/doi/abs/
contribute to highlighting patterns. New layouts can also be inte- 10.1111/j.1467-8659.2011.01898.x.

9
A. Cruz, J.P. Arrais and P. Machado Big Data Research 27 (2022) 100291

[14] J. Zhao, S. Drucker, D. Fisher, D. Brinkman, Timeslice: interactive faceted [24] C. Tominski, S. Gladisch, U. Kister, R. Dachselt, H. Schumann, Inter-
browsing of timeline data, in: AVI’12: Proceedings of the Interna- active lenses for visualization: an extended survey, Comput. Graph.
tional Working Conference on Advanced Visual Interfaces, Capri Island, Forum 36 (6) (2017) 173–200, https://doi.org/10.1111/cgf.12871, https://
Italy, 2012, https://www.microsoft.com/en-us/research/publication/timeslice- onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.12871, https://onlinelibrary.wiley.
interactive-faceted-browsing-timeline-data/. com/doi/abs/10.1111/cgf.12871.
[15] A. Lex, H.-J. Schulz, M. Streit, C. Partl, D. Schmalstieg, Visbricks: multiform visu- [25] C. Reas, B. Fry, Processing: programming for the media arts, AI & Society 20 (4)
alization of large, inhomogeneous data, IEEE Trans. Vis. Comput. Graph. (InfoVis (2006) 526–538, https://doi.org/10.1007/s00146-006-0050-9.
’11) 17 (12) (2011) 2291–2300, https://doi.org/10.1109/TVCG.2011.250. [26] A. Fahad, N. Alshatri, Z. Tari, A. Alamri, I. Khalil, A.Y. Zomaya, S. Foufou, A.
[16] D.A. Keim, J. Schneidewind, M. Sips, Circleview: a new approach for visualiz- Bouras, A survey of clustering algorithms for big data: taxonomy and empirical
ing time-related multidimensional data sets, in: Proceedings of the Working analysis, IEEE Trans. Emerg. Top. Comput. 2 (3) (2014) 267–279.
Conference on Advanced Visual Interfaces, 2004, pp. 179–182. [27] D. Müllner, Modern hierarchical, agglomerative clustering algorithms, arXiv:
[17] S. van den Elzen, D. Holten, J. Blaas, J.J. van Wijk, Reducing snapshots to 1109.2378, 2011.
points: a visual analytics approach to dynamic network exploration, IEEE Trans. [28] J. Rychlewski, On Hooke’s law, J. Appl. Math. Mech. 48 (3) (1984) 303–314,
Vis. Comput. Graph. 22 (1) (2016) 1–10, https://doi.org/10.1109/TVCG.2015. https://doi.org/10.1016/0021-8928(84)90137-0, http://www.sciencedirect.com/
2468078. science/article/pii/0021892884901370.
[18] C. Niederer, H. Stitz, R. Hourieh, F. Grassinger, W. Aigner, M. Streit, Taco: vi- [29] P.J. Brockwell, R.A. Davis, Time Series: Theory and Methods, Springer-Verlag,
sualizing changes in tables over time, IEEE Trans. Vis. Comput. Graph. 24 (1) Berlin, Heidelberg, 1986.
(2018) 677–686, https://doi.org/10.1109/TVCG.2017.2745298. [30] J. Minichino, Recurrent neural networks course project: time series prediction
[19] B. Bach, E. Pietriga, J. Fekete, Graphdiaries: animated transitions and temporal and text generation, https://github.com/techfort/aind2-rnn, 2017. (Accessed 2
navigation for dynamic networks, IEEE Trans. Vis. Comput. Graph. 20 (5) (2014) December 2019).
740–754, https://doi.org/10.1109/TVCG.2013.254. [31] A. Cruz, J.P. Arrais, P. Machado, Interactive network visualization of gene ex-
[20] D.A.G. Aguilar, R. Therón, F.J. García-Peñalvo, Semantic spiral timelines used as pression time-series data, in: 2018 22nd International Conference Information
support for e-learning, J. Univers. Comput. Sci. 15 (7) (2009) 1526–1545. Visualisation (IV), 2018, pp. 574–580.
[21] H. Hochheiser, E.H. Baehrecke, S.M. Mount, B. Shneiderman, Dynamic query- [32] Z. Bozdech, M. Llinás, B.L. Pulliam, E.D. Wong, J. Zhu, J.L. DeRisi, The transcrip-
ing for pattern identification in microarray and genomic data, in: 2003 In- tome of the intraerythrocytic developmental cycle of plasmodium falciparum,
ternational Conference on Multimedia and Expo. ICME ’03. Proceedings (Cat. PLoS Biol. 1 (1) (2003) e5, https://doi.org/10.1371/journal.pbio.0000005.
No.03TH8698), vol. 3, 2003, III–453. [33] T. Pramila, W. Wu, S. Miles, W.S. Noble, L.L. Breeden, The forkhead transcrip-
[22] P. Craig, A. Cannon, R. Kukla, J. Kennedy, Matse: the microarray time-series tion factor hcm1 regulates chromosome segregation genes and fills the s-phase
explorer, in: 2012 IEEE Symposium on Biological Data Visualization (BioVis), gap in the transcriptional circuitry of the cell cycle, Genes Dev. 20 (16) (2006)
2012, pp. 41–48. 2266–2278, https://doi.org/10.1101/gad.1450606.
[23] B. Breve, L. Caruccio, S. Cirillo, V. Deufemia, G. Polese, Visualizing dependencies
during incremental discovery processes, in: EDBT/ICDT Workshops, 2020.

10

You might also like