Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/264791578

Custom Visualization Charts for Cancer Research in SAP Lumira

Article · April 2014

CITATIONS READS
0 1,076

1 author:

Gerardo Navarro S
Hasso Plattner Institute
1 PUBLICATION   0 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Gerardo Navarro S on 18 August 2014.

The user has requested enhancement of the downloaded file.


Custom Visualization Charts for Cancer Research in
SAP Lumira

Gerardo Navarro Suarez


Hasso Plattner Institute
August–Bebel–Str. 88
14482 Potsdam, Germany
gerardo.navarro-suarez@student.hpi.uni-potsdam.de

Abstract—Cancer researchers investigate on the best treat- easier to understand by people in general [5]. Additionally,
ment for certain cancer types. After conducting xenograft ex- it allows them to identify patterns more easily inside their
periments and collecting additional information, they are faced data. Nowadays, the creation of such charts is a manual and
with a large amount of data. In order to identify patterns more inefficient task, as there is little tool support. In addition, the
easily, they summarize the conducted data in visual charts. This created charts are static visualizations, disallowing researchers
is a manual and tedious work with only minimal tool support. In
to interact and engage with their data freely. We see the
order to fill this gap, we leverage and tailor existing visualization
tools to meet the needs of cancer researchers. In this work, we need to tackle this problem by exploiting existing visualization
focus on the design, development and integration of custom chart tools. As a goal, we envision a tool providing interactive data
extensions into the SAP Lumira visualization tool. Thereby, we exploration capabilities to cancer researchers. Researchers can
achieve the goal of providing an interactive and exploratory way engage with the data via the visualization and eventually gain
for cancer researchers to discover insights within their data. new insights.
Keywords—SAP Lumira, Clustered Heat Map Given there are many Business Intelligence (BI) tools that
focus on data visualization, it is not necessary to develop a
I. I NTRODUCTION new solution from the ground up, but rather adjust existing BI
tools to our use case. This makes it also possible to deliver
In the last decade, faster sequencing and decoding of the
the visualization tool to the researchers within the time frame
human genome opened up new areas in medical research as
of the project. SAP Lumira is such a BI tool that allows
it is now possible to dig deeper into certain diseases. Fur-
cancer researchers to access their data sources, transform the
thermore, this additional data source revealed lots of potential
data, and eventually visualize data graphically. As a result,
for personalized medicine, because it allows to treat patients
the rich set of visualization charts enables users with a basic
specifically based on their individual DNA or disposition [1].
technical understanding to explore their data freely and quickly
Given this large amount of diagnostic data, it is clear that
discover valuable insights within these charts, without having
researchers need technical help to exploit all information inside
to write scripts and SQL queries. Unfortunately, Lumira’s
the data – especially in the context of cancer research and
visualization chart portfolio misses some chart types that are
treatment. Some data analysis tasks are already supported by
specifically used in the context of cancer research, such as a
applications using in-memory technology, e.g. alignment and
normalized stacked bar chart. Fortunately, SAP Lumira allows
variant calling [1].
the definition and implementation of custom visualization
Besides that, we extend the existing capabilities to the area charts as so-called chart extensions.
of drug response analysis focused on cancer researchers. In
our project scenario, cancer researchers investigate on the best The contribution presented in this paper focuses on the
treatment for certain cancer types looking into DNA level, such design, development and integration of custom chart extensions
as head and neck cancer. As part of their research, they collect into SAP Lumira to achieve the goal of providing an interactive
meta data and tumor samples from patients and conduct many and exploratory way for cancer researchers to discover insights
xenograft experiments which produce a large amount of data. within their data, like a clustered heat map shown in Fig. 1.

This data is edited, enhanced and analyzed in a manual The remainder of the paper is structured as follows: In
and tedious process by researchers taking from several days Sect. II, our work goes deeper into visualizations in the clinical
up to weeks. As part of the Analyze Genomes Project (AGP), context as part of the related work. Section III present basic
we provide a cloud service that automates parts of this manual background information to SAP Lumira. In Sect. IV, we go
work [1]. Scientists upload the conducted tumor-specific data deeper into the design of the new visualization charts and
to the online service that processes the data and calculates explain the details of implementing custom chart extensions for
additional information [2]. To get a deeper understanding of SAP Lumira. We evaluate and compare the developed charts
their data, researchers initiate data analyzing algorithms to find in Sect. V. Our work concludes with an outlook in Sect. VI.
relevant dependencies within the tumor data or to classify the
tumor data with machine learning approaches [3], [4]. II. R ELATED W ORK
Finally, researchers want to summarize the data in visual Before starting to work on specific visualizations in SAP
charts, because a graphical representation of raw data is Lumira, it is important understand what graphical representa-
Fig. 1. Clustered heat map that shows the #Mutation count for a subset of genes and tumor samples. This SAP Lumira chart extension uses hierarchical
clustering to sort and rearrange its cells. Through this, it is easier to understand the data and to detect insightful pattern, e.g. genes ATM and TP53 are relevant
across all samples.

tions are currently used for clinical data and what visualization Examples for the usage of (clustered) heat maps, bar charts,
type would be helpful to cancer researchers. Stengel et al. and box plots and combinations of different other visualizations in
Snyder emphasize that the main goal of any visualization is the clinical context can be found in various papers [6] [8].
to effectively communicate the information contained in the Most papers also include unique and custom charts tailored to
data in a non-distorting manner [5], [6]. As a result of this, their respective data set because these unique charts emphasize
the choice for a certain visualization depends on the specific certain aspects of the data in a better way than standard
data set that researchers wish to present. However, both books visualization chart types. These unique charts are not supported
state that certain diagram types are more suited in the clinical by SAP Lumira. Hence, it is important to identify and prioritize
context and therefore used more frequently, e.g. column bar the visualizations that are not supported.
charts, box plots and scatter plots.
III. SAP L UMIRA OVERVIEW
Clinical research usually deals with a large amount of SAP Lumira is a new self-service BI solution from SAP.
cross-referenced data that finally needs to be visualized in a The solution allows analysts, decision makers, and now cancer
comprehensive way and to point out the main findings within researchers to access one or multiple data sources, transform
the large data set. For this situation, heat maps proved to be the data and eventually visualize data graphically. As a result,
the most effective way of displaying high-dimensional array the rich set of visualizations enables users with a basic
data, particularly suitable for gene expression array data [7]. technical understanding to engage with their data and quickly
Especially in our scenario with cancer researcher, the heat map discover valuable insights, without having to write scripts and
visualization is often used by cancer researchers, in order to SQL queries [10].
explore certain aspects across all their conducted tumor data,
such as researchers investigating the effectiveness of drugs
across a large set of tumors. The full potential of heat maps A. Integrating into SAP Lumira
is revealed when the data is sorted or clustered. A sorted or SAP Lumira as a visualization tool is a good starting point
clustered heat map ensures that similar data points are arranged for our scenario because of its rich feature set and easy-to-
close together, e.g. similar genes are located close to each use interface for non-technical persons. The researchers would
other which makes it easier to identify patterns across the large have three possibilities to interact with SAP Lumira:
data set. The sorting is often achieved by applying hierarchical
clustering methods separately on rows and columns of the heat 1) SAP Lumira Desktop is a desktop program installed on
map [8]. It is a more generic clustering approach that does not Windows machines and enable users to prepare data from
require specific knowledge about the data. Additionally, there multiple sources, compose visualizations and share the
are partitional clustering methods that map the data points to defined visualizations with other contributors.
a known number of clusters, like K-means clustering. This 2) SAP Lumira Cloud is an online platform providing a similar
is achieved by repeatedly assigning all samples to one of K feature set as the Lumira Desktop, but makes it available
clusters based on which cluster centroid is closest [9]. for mobile devices and browsers.
3) SAP Lumira Server is a UI5 HANA-based XS application
that will run on-premise with similar mobile web access and
user experience to Lumira Cloud, but specifically targeted
to integrate well with the deployed SAP HANA repository
[11], [12].
Amongst the three presented solutions, SAP Lumira Desk-
top allowed faster development and deployment of custom
charts, because of the following reasons:
• Lumira Desktop includes a software development kit
(SDK) that provides stable API interfaces and utilities,
streamlines the development steps, and makes it easier
to implement custom visualizations as compact bundles
[13].
• During our project, SAP Lumira Server was still in devel-
opment and therefore only available as a developer release Fig. 2. The development process for creating custom visualization chart
with no SDK or extension mechanism in-place, unstable extensions in SAP Lumira.
interfaces and no detailed documentation. Furthermore,
working with the developer release of Lumira Server
involved a large amount of reverse engineering and trial- regarding how a lean and efficient development process can
and-error debugging that lead to laborious and inefficient look like for SAP Lumira chart extensions. In order to share
development with lots of effort. our experiences and fill this gap, we present the structured
• Developing custom visualization charts with Lumira SDK development process we used and refined throughout our
makes it easier to port and deploy the custom charts project. Figure 2 illustrates the process broken down into the
to Lumira Server once a similar extension mechanism following steps:
is established. An extension mechanism for visualization
charts is planned to be included in post Q2 releases of 1) Design and define the requirements for your new visu-
Lumira Server. alization. What is the input – dimensions and measures
[10], [15]? What should be visualized and how? At this
point, a well-considered definition of required dimensions
B. Building custom visualizations in SAP Lumira
and measures is crucial. This avoids tedious refactoring
SAP Lumira offers a variety of visualizations out of the later in the development / implementation phase.
box that can definitely be of use in clinical publications [10]. 2) Setup the development and test environment using the
However, as it is impossible for Lumira to cover all possible VizPacker utility. The VizPacker is able to automatically
visualization chart types. Therefore, the development team generate the test environment which consists of an HTML
of Lumira Desktop introduced an SDK, which provides an website, the necessary files and JavaScript libraries in-
extension framework with an associated API that lets you cluding your chart extension bundle. As part of the setup,
develop your own chart, and integrate it with SAP Lumira it is essential to construct and integrate the appropriate
[13]. The SDK includes the so called VizPacker utility, a web test data set into the test environment as defined earlier.
client that helps to create the directory structure and the bundle As a best practice, we recommend to extract a real test
needed to develop a visualization extension. data set directly from SAP Lumira using its debugging
capabilities [13, p. 12].
The development of custom visualization with the SDK 3) Develop the chart extension bundle. Start by completing
uses exclusively JavaScript as a programming language and the declaration of necessary dimension and measures in
takes advantage of two main libraries: the feed definition that is needed to bind data to chart [15,
• SAP CVOM (Common Visual Object Modeler), also re- p. 111] [13]. Continue working on the rendering code for
ferred to as SAP HTML5 Visualization, is a visualization the chart extension with D3.js – start by creating a pre-
and charting engine designed to work with HTML5 tech- liminary version with static data and enhance your chart
nology to create different visualizations. CVOM provides extension iteratively by including the real data and other
utility classes to register and build own visualizations visualization parameters given by the SDK, e.g. width
in SAP Lumira and other SAP products utilizing other and color palette. Finally, test your chart extension in the
auxiliary libraries in the back, e.g. jquery and requirejs. HTML test environment and browser before deploying it
• D3.js is a JavaScript library for manipulating HTML to SAP Lumira Desktop.
documents. It allows you to bind arbitrary data to the 4) Deploy your custom chart extension to to SAP Lumira. In
Document Object Model (DOM), and then apply data- case of the Lumira Desktop, this is done by copying your
driven transformations to the DOM. This approach makes bundle to a certain directory on the Windows file system
it easy to create visualizations in a fast and lean way [13, p. 9]. When Lumira Desktop starts, it will look for
depending on the available data. new chart extensions and dynamically load them into the
runtime.
The documentation for the Lumira SDK is limited to a 5) Validate and test the chart extension in SAP Lumira. Are
handbook and blog entries from the community [13], [14]. Un- all requirements fulfilled? What can be refined? What
fortunately, there is little documentation and insights available about performance? These new requirements are the input
for the next development cycle. purpose of visual comparison, normalized stacked bar charts
hide the total quantities of each bar by normalizing and using
The main characteristic of the proposed development pro- percentages which makes it easier to see the relative difference
cess is that the chart extension implementation happens mainly between quantities in each bar, like in Fig. 3.
outside of SAP Lumira Desktop, because the Lumira Desktop
does not support ad-hoc reloading of bundles containing the As described before, the first part of the implementation is
chart extension [13, p. 24]. Therefore, you are forced to defining the input needed for the chart extension - dimensions
develop new visualization charts in a separate environment and measures. For the normalized stacked bar chart extension,
using tools like the VizPacker contained in the Lumira SDK. we defined two dimensions (Entity and Stacked Entity) and
a measure (Stacked Measure), as seen in Lst. 1. This way,
There are many helpful examples for custom visualizations Lumira knows the input that is expected by the chart extension,
available on the web, such as flag bar chart for showing the top waits for the user to select desired data sets and combines the
three Olympic Medal winners fir 2004 and 2008 [14], [15]. data sets to a cross-table data format. In case of Fig. 3, we
selected the TUMOR NAME and GENE NAME as dimension
IV. I MPLEMENTATION IN SAP L UMIRA and #Mutations as only measure leading to cross-table data
input similar to a matrix with the dimensions as axes and the
Having decided to build the custom visualizations inside
measure as the concrete value of the matrix cell.
SAP Lumira Desktop, our next step was to identify the
necessary visualization types that need to be provided. In 1 chart_definition.addFeed({ "type": "Dimension",
order to discover relevant but missing visualization types in 2 "id": "viz.ext.hig.module.snst.plot.DS1",
Lumira Desktop, we evaluated the visualization charts used 3 "name": "Entity", "aaIndex": 1
in cancer research papers and tried to find the right chart 4 "min": 1, "max": 1
5 });
type in SAP Lumira Desktop for each of the visualizations 6
[16], [17]. During this process, we focused on identifying 7 chart_definition.addFeed({"type": "Dimension",
reusable visualization charts. As stated before, many papers 8 "id": "viz.ext.hig.module.snst.plot.DS2",
often contain unique, custom visualizations that are tailored 9 "name": "Stacked Entity", "aaIndex": 2
10 "min": 1, "max": 1
to the specific data from the published research. Although 11 });
these graphical representations are comprehensive and valid, 12
we decided to disregard these visualizations as it is hard to 13 chart_definition.addFeed({ "type": "Measure",
reuse them in other clinical papers and contexts. 14 "id": "viz.ext.hig.module.snst.plot.MS1",
15 "name": "Stacked Measure", "mgIndex": 1
As a result of this evaluation, we discovered that Stransky 16 "min": 1, "max": Infinity
et al. includes normalized stacked bar charts where each stack 17 });
in the bar represents the ratio relative to the total amount Listing 1. Code snippet from the feed definition for the normalized
of the bar [16]. Although Lumira Desktop supports stacked stacked bar chart. Each feed has an ID, name, a type – either Dimension
or Measure – and an index that allows to find the specific data feed within
bar charts, it is not able to normalized the bars as it is a given data structure. There are also many more additional parameters
required in the paper. Another non-supported visualization and setting, such as max and min.
type is the clustered heat map used in many papers including
[17]. In order to have the biggest impact with our custom The data input given as a data matrix is traversed com-
visualizations, we decided to implement normalized stack bar pletely and afterwards sorted by the Stacked Entity dimension.
chart and clustered heat map, as these chart types are slight Afterwards the stacked bar chart, axes, tool-tip, legend and
modifications of standard visualization types and continuously other elements are rendered.
used in the clinical context and publications.
The complexity of the rendering code can be described as
The next sections present the approach, implementation O(n · m + n · m · log(m)) ⊆ O(n · m · log(m)) with n being the
details and integration into SAP Lumira for each of the number of Entity dimension values and m being the number
missing chart types. In addition we analyze the complexity of Stacked Entity dimension values.
of the rendering code and demonstrate some drawbacks of the
implemented chart extensions. A commonly known drawback of (normalized) stacked
bar charts is that too many categories add more visual noise
making it hard to detect patterns in the data. A possible
A. Normalized Stacked Bar Chart
improvement for this chart extension could be to limit the
A normalized stacked bar chart, also referred to as a stacked number of categories inside each bar and consolidate the
percentage column chart, is a slight variation of the commonly smaller stacks (e.g. less than 2%) into one ”Other” stack if
used bar chart. Bar charts are very useful, because they are necessary.
easy to understand, and their visual structure matches the
internal structure of data in many cases. But in cases with B. Clustered Heat Map using Sorted Neighborhood Method
more complex data, it is necessary to switch to some variations
in order to better visualize and indicate relationships inside As discussed before, clustered heat maps are a very useful
the structure of a data set. For example, stacked bar charts visualization, since they allow to identify patterns more easily
can show a grouped structure in the data and a hierarchy within a large interconnected data set. There are many ap-
inside the data up to one level deep. This is particularly useful proaches to cluster heat maps. We decided to develop 1) An
when showing and comparing the totals of all bars because own approach inspired by the Sorted Neighborhood Method
they visually aggregate all of the categories in a bar. For the (SNM) shown in Fig. 4, and 2) A hierarchical clustering
Fig. 3. Normalized Stacked Bar Chart with TUMOR NAME as Entity dimension, GENE NAME as Stacked Entity dimension and #Mutations as the only
measure in order to show the relative breakdown of the DNA mutation counts for affected genes in tumor samples. It is interesting that almost 50 percent of
the mutations across all tumor samples belong to the genes ATM and TP53.

approach presented in Sect. IV-C. This gives us the possibil- lower right corner. After this step, the sorted heat map will
ity to compare and evaluate different characteristics of both be easier to understand and to detect interesting findings in
approaches, e.g. quality of clustered result and runtime. the data.
3) Merging similar rows and columns by comparing the row
In this section, we present an approach for clustering a heap value sum (row sumi ) with the value sum of the previous
map that follows the idea of SNM. SNM is an approach from row (row sumi−1 ), as defined in Eq. 1 – for columns
the field of duplicate detection and record linkage providing respectively. row sumi is the sum of all values in the
a parallelizable and efficient way to find duplicates in large matrix row i and consequently row sums is the set of all
data sets [18]. SNM consists of 1) Defining sorting keys for row value sums. Rows i and i−1 are considered to be in one
each data record consisting of relevant fields, 2) Sorting data cluster if their absolute difference is smaller than a certain
records using the predefined key, and 3) Merging data records fraction X of the calculated ∆row min max . X is a user-
within a fixed size windows that moves through the sorted defined threshold between 0 and 1. In combination with the
records. ∆row min max , it indicates how much the rows can differ
Before going into SNM, it is important to mention that the until they are considered to considered to be dissimilar.
chart extension receives a data matrix from SAP Lumira that In case that the difference of the two value sums is greater
represent the heat map values. Therefore, the feed definition than the threshold, we insert a cluster break. A cluster break
for this clustered heat maps is based on Lst. 1. means that the compared rows / columns belong to different
clusters and therefore should be visually separated. Other
We adapted SNM for our scenario to cluster a heat map clustering approaches also use a threshold [19].
horizontally and vertically. This is necessary in order to have
a consistent clustering of the whole heat map on a row and
column level. The SNM approach is illustrated in Fig. 5 and |row sumi − row sumi−1 | < ∆min max · X (1)
can be summarized in the following steps that are executed for ∆row = max(row sums) − min(row sums) (2)
min max
rows and columns of the data matrix individually:
After applying the adjusted SNM row-wise and column-
1) Defining and calculating the sorting key as the sum of
wise to the data matrix input, the heat map is rendered. The
the values for each row and column. Other sorting keys
identified cluster breaks from the last step are used to create
are possible but need to be evaluated, e.g. average, min
small gaps in order to visualize the clusters within the heat
/ max delta and entropy. At the same time calculate
map, as seen in Fig. 4 and Fig. 5.
∆row min max , the range between the highest and lowest
row / column sum value, as defined in Eq. 2. Eq. 3 shows the complexity of the rendering code that can
2) Sorting rows and columns based on the sorting key. This be broken down into the complexity for: 1) Applying SNM to
will naturally rearrange higher values in the data matrix to n row value sums and m column value sums. O(n · log(n) +
the matrix’s upper left corner and the lower values to the k ·n) ⊆ O(n·log(n)) is the complexity of SNM with a sliding
Fig. 4. Heat map shows the #Mutation count for a subset of genes and tumor samples clustered with the Sorted Neighborhood Method (SNM). In this example
the user defined X = 0.25 as the threshold. The drawback of SNM is also obvious in this example when there are significant differences between two rows,
such as row TP53 and APC and KRAS. This leads to clusters with one element which is usually not desirable.

C. Clustered Heat Map using Hierarchical Clustering


SAP Lumira is a general purpose tool for creating vi-
sualizations from any kind of structured data input. If we
want to provide clustered heat maps as seen in Fig. 1, we
need to use generic clustering algorithms that do not require
specific knowledge about the data. Hierarchical clustering is
a clustering method that builds a hierarchy of clusters from
the given data by iteratively merging the closest data points to
one cluster (agglomerative hierarchical clustering). In order to
identify the clusters that should be combined, the clustering
algorithm needs a measure of dissimilarity between sets of
observations. For hierarchical clustering, the measure is formed
by combining an appropriate metric for distance calculation
between data points and a linkage criterion for calculating the
distance between merged data points [9].
Fig. 5. SNM approach with concrete heat map (data matrix) example. Step 1
shows calculated rows sums used as the sorting key. After sorting the data The correct definition of data points is crucial to every
matrix in Step 2, the cluster breaks are included when the sliding window
moves across the dimensions in Step 3. For the specific data matrix example,
clustering algorithm, because the distance functions for the
Eq. 2 returns ∆row min max = 21 − 9 = 12 and X = 0.25 which means data points are key for applying clustering methods to any do-
that a cluster break is included when the difference between two row sums is main. Especially when applying clustering to a heat map (data
greater than or equal to 3 = 12 · 0.25 – for columns respectively. matrix), it is important to have a clear definition of the data
point that should be clustered, as the integrity and consistency
of the heat map cannot be destroyed. It is inappropriate to
window size k of 1 when applied to all rows of the data matrix use the single heat map cell values as data points, because
[18] – for O(m · log(m)) respectively. 2) Constructing the the clustering result does not consider the position (row and
data structure and rendering the final heat map. This can be column) of the data point within the heat map. It is hard to
done in O(n · m) as the matrix of rows and columns has to exploit this clustering result in order to rearrange the heat map
be traversed completely. The overall asymptotic complexity of consistently.
the algorithm is bound to O(n·m) which is necessary in order
to construct the data structure and render the complete matrix As a result, we decided to cluster rows and columns of
as a heat map. the heat map individually by using row and column vectors. A
row vector is a m-dimensional vector where m is the number
of columns – and vice versa for column vectors. Having all
O(n · log(n) + m · log(m) + n · m) ⊆ O(n · m) (3) row and column vectors, it is possible to cluster one dimension
and leverage the output of the clustering in order to rearrange • Including clustering information in data sets retrieved
and split the heat map along this dimension accordingly. We by SAP Lumira. The clustering algorithm could be per-
used primarily the Euclidean distance function to calculate the formed periodically on the data stored in HANA using
distance between the vectors. Single linkage was used as the its Predictive Analysis Library (PAL) [20]. Once a data
linkage criteria. set is requested by Lumira, the response can be enhanced
with pre-calculated clustering information that could be
The result of any hierarchical clustering is a dendrogram, a
used directly to render the heat map. A drawback is that
binary tree with data points as leaves. It represents the clustered
users are limited to data sets that have been clustered
data points and the nested clusters at certain similarity levels,
beforehand.
as shown in Fig. 6. The dendrogram is used to rearrange /
• Separate, auxiliary request in HANA database. Once the
reorder the heat map and identify the right position to split
user selects the data sets to visualize, SAP Lumira makes
the heat map. For example, rearranging rows in the heat map
a separate request to the HANA XS Engine and asks to
according to the clustering result can be done by traversing the
cluster the selected data set on the fly [12]. Finally, the
dendrogram tree and leveraging the order of the leaves which
heat map is rendered based on the clustering result. This
are by our definition the row vector. Due to this, similar rows
seems to be the worst option in terms of network overhead
will be side by side or at least close to each other.
and ad-hoc clustering as PAL requires setup time [20].
• Using existing JavaScript clustering libraries when ren-
dering the heat map inside SAP Lumira. Although clus-
tering can be a resource intensive task, we believe that
it can be performed on the client-side, because you can
assume that the amount of data is somehow limited by
the amount of information a user is able to comprehend.
So there will not be a useful heat map with over a million
cells that would require powerful infrastructure. The data
amount that need to be clustered is relatively small and
therefore manageable. Doing the clustering inside SAP
Lumira avoids additional communication with the HANA
data store and it performs better than the other options
considering the limited data set, although this still needs
proper benchmarking.
Fig. 6. Dendrogram tree as the result of applying hierarchical clustering to
the row vectors in Fig. 1. Traversing the tree retrieves the order of the rows Although it affects the responsiveness when calculating
based on its similarity and the cluster breaks. Cluster breaks are the turning and rendering the clustered heat map inside SAP Lumira, we
points in the tree that split up into two clusters, denoted as the black nodes. decided to implement the last option using clusterfck.js, a light
weight clustering library with limited feature set [21].
Identifying an appropriate position to include a gap in the
heat map is difficult, because there is no specific information Analyzing the runtime complexity of the rendering code
about the data and users’ expectations regarding clusters have in Eq. 4, it can be broken down into the complexity of:
to be anticipated. There is no single approach for this problem, 1) Applying hierarchical clustering to n row vectors and m
as it highly depends on the given data set [19]. However, there column vectors. O(n2 ) is the runtime complexity of hierar-
are approaches trying to identify and extract clusters from chical clustering, because we used Euclidean distance metric
dendrograms without any specific knowledge about the data and Single Linkage criteria in our implementation. Using
[19]. We want to pursue the same goal, but with a different other distance or linkage functions can increase the runtime
approach. complexity to O(n3 ). 2) Traversing the dendrogram tree for
Our implementation extracts cluster breaks from the den- both clustering results. An in-order traversal can be done in
drogram tree through the search for turning point nodes. A O(n) where n is the number of leaves which are row or column
turning point node in a dendrogram tree is a node that has two vectors. 3) Constructing the data structure and rendering the
non-leaf nodes as children, denoted as black nodes in Fig. 1. final heat map can be done in O(n · m) as the matrix of rows
Such a node splits up the remainder of the tree in two areas and columns has to be traversed completely. This breakdown
that contain similar data points, i.e. row and column vectors. leads to the following Eq. 4, which is bound by O(n2 + m2 ).
These areas can be seen as clusters and the black nodes as
cluster breaks. The turning points and cluster breaks can be
O((n2 + m2 ) + (n + m) + (n · m)) ⊆ O(n2 + m2 ) (4)
identified by doing an in-order traversal. The result of this
traversal is an ordered list containing the rows / columns and
V. E VALUATION
cluster breaks, as shown in the bottom of Fig. 1. This list is
used to add small gaps when rendering the heat map, see Fig. In this section, we present an evaluation of the chart
1. extensions presented in this work. This evaluation focuses
on comparing the two presented approaches in terms of the
Implementing a hierarchical clustering algorithm in SAP
runtime complexity of the rendering code and the results on a
Lumira can be done in many ways. It was clear for us to exploit
qualitative basis.
existing algorithm implementations provided as a library, but
there are a few options from an infrastructure point of view Assessing the runtime complexity is a good indicator of
that needed to be considered: performance behavior of the implementation when data are
scaled up. Although we do not assume that the visualizations with the cancer researchers focused on the comprehensibility
will have to deal with a large amount of data, it is still and usefulness of two clustered heat map approaches. The
necessary to analyze the complexity to avoid unresponsive resulting feedback will help to refine the chart extensions in
user interfaces. Eq. 3 shows the complexity for the heat map SAP Lumira. Simultaneously, we will continue developing new
with SNM. The complexity for the hierarchically clustered heat chart extensions that are necessary in the context of cancer
map is shown before in Eq. 4. Comparing the complexities, research and treatment, such as a combination of clustered
we noticed that both approaches belong to the same quadratic heat map and dendrogram [7], [17].
complexity class. However, the breakdown of each runtime
complexity reveals that the SNM approach for clustering will R EFERENCES
be faster, because the asymptotic complexity is bound by the
[1] M.-P. Schapranow and H. Plattner, “HIG – An In-memory Database
rendering of the heat map, i.e. O(n · m). The hierarchical Platform Enabling Real-time Analyses of Genome Data,” in Proceed-
clustering approach is bound to the runtime of the hierarchical ings of the International Conference on Big Data, 2013, pp. 691–696.
clustering algorithms applied to rows and columns of the heat [2] S. Aechtner, “An Algorithm to Calculate Functional Changes in the
map, i.e. O(n2 + m2 ). DNA Utilizing In-Memory Technology,” 2014.
[3] T. Schubotz, “Drug Response Analysis by Association Rule Mining
Looking at both clustering results, we are pleased with based on Genetic Changes in Genome Data,” 2014.
the quality of the clustering considering the fact that both [4] D. Petrick, “Drug Response Classification and Prediction for Tumor
approaches work without additional knowledge about the data. Data with Support Vector Machines,” 2014.
The readability and comprehensibility of the heat map im- [5] D. Stengel, M. Bhandari, and B. Hanson, Statistik und Aufbereitung
proved. However, the hierarchically clustered heat map seems klinischer Daten. Georg Thieme Verlag, 2011.
to produce better results in general than the heat map clustered [6] T. Snyder, “Data Visualization for Clinical Trials Data Management and
with SNM. The reason for this is that clusters are identified Operations,” in A Picture is Worth a Thousand Tables. Springer, 2012,
directly from the result of the hierarchical clustering algorithm pp. 359–372.
that continuously merges data points to clusters. On the other [7] M. B. Eisen, P. T. Spellman, P. O. Brown, and other, “Cluster Analysis
and Display of Genome-wide Expression Patterns,” Proceedings of the
side, we observed that SNM reacts sensitively to significant National Academy of Sciences, vol. 95, no. 25, pp. 14 863–14 868, 1998.
differences between row and column data, leading to clusters
[8] B. A. Teicher, Tumor Models in Cancer Research. Springer, 2002,
consisting of one element, as shown in Fig. 4. Furthermore, the vol. 10.
SNM approach disregards slowly increasing row value sums [9] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data Clustering: a Review,”
which can result non-discovered clusters. ACM computing surveys (CSUR), vol. 31, no. 3, pp. 264–323, 1999.
[10] SAP AG. (2014, Mar.) SAP Lumira User Guide 1.15. [Online].
As a result, we can say that clustered heat map using a Available: http://help.sap.com/businessobject/product guides/vi01/en/
hierarchical clustering algorithm produces a clearer and more lum 115 user en.pdf1
reliable clustering result. Considering the rather small amount [11] ——. (2014, Mar.) SAP Lumira Server User Guide 1.15. [Online].
of data passed to the clustered heat map, it will not have a Available: http://help.sap.com/businessobject/product guides/vi01/
significant impact on the runtime performance. en/lumS115 user en.pdf,http://help.sap.com/businessobject/product
guides/vi01/en/lumS115 install en.pdf1
In order to discover further potential for optimization and [12] ——. (2014, Mar.) SAP HANA Developer Guide. [Online]. Available:
reveal the benefit of clustered heat maps for cancer researchers, http://help.sap.com/hana/SAP HANA Developer Guide en.pdf1
it is necessary to extend the evaluation and conduct bench- [13] ——. (2014, Feb.) SAP Lumira SDK Getting Started Guide 1.14.
marks assessing other criteria, such as user experience and [Online]. Available: http://help.sap.com/businessobject/product guides/
space complexity. vi01/en/lum 114 vp en.pdf1
[14] ——. (2014, Feb.) SAP Lumira Community. [Online]. Available:
http://scn.sap.com/community/lumira1
VI. C ONCLUSION AND O UTLOOK
[15] ——. (2014, Mar.) SAP Lumira User Guide 1.11. [Online].
In the given work, we presented the need for cancer re- Available: http://help.sap.com/businessobject/product guides/vi01/en/
vi1 0 11 user en.pdf1
searchers to have an interactive visualization of their conducted
[16] N. Stransky, A. M. Egloff, A. D. Tward et al., “The Mutational
data in order to explore and discover patterns and insights Landscape of Head and Neck Squamous Cell Carcinoma,” Science, vol.
in their data. For this goal, we exploited and adjusted the 333, no. 6046, pp. 1157–1160, 2011.
capabilities of SAP Lumira, an existing data visualization [17] C. C. Whiteford, S. Bilke, B. T. Greer et al., “Credentialing Preclinical
tool to fit the needs of cancer researchers. This required Pediatric Xenograft Models using Gene Expression and Tissue Microar-
the design of new visualization charts that were missing: ray Analysis,” Cancer research, vol. 67, no. 1, pp. 32–40, 2007.
1) Normalized Stacked Bar Chart, 2) Clustered Heat Map with [18] M. A. Hernández and S. J. Stolfo, “The merge / purge problem for large
Sorted Neighborhood Method, and 3) Clustered Heat Map with databases,” in ACM SIGMOD Record, vol. 24, no. 2. ACM, 1995, pp.
127–138.
hierarchical clustering. For each visualization we explained
[19] J. Sander, X. Qin, Z. Lu, N. Niu, and A. Kovarsky, “Automatic
the design and shared insights, implementation details and Extraction of Clusters from Hierarchical Clustering Representations,”
considerations from the integration of chart extensions in SAP in Advances in Knowledge Discovery and Data Mining. Springer,
Lumira. Finally, we compared both approaches for clustered 2003, pp. 75–87.
heat map approach and concluded that hierarchical clustering [20] SAP AG. (2014, Mar.) SAP HANA Predictive Analysis Library (PAL).
leads to better and more reliable clustering results. [Online]. Available: http://help.sap.com/hana/SAP HANA Predictive
Analysis Library PAL en.pdf1
Concerning future work, we will focus on a scientific eval- [21] (2012) Clusterfck: A Clustering Analysis Library in JavaScript.
uation of the developed visualization chart including perfor- [Online]. Available: https://github.com/harthur/clusterfck1
mance and a qualitative benchmark. As part of the qualitative
benchmark, it is required to reengage and conduct user studies 1 All online references checked on Mar. 31, 2014

View publication stats

You might also like