Professional Documents
Culture Documents
Research - Cancer - Lumira - Study Case
Research - Cancer - Lumira - Study Case
net/publication/264791578
CITATIONS READS
0 1,076
1 author:
Gerardo Navarro S
Hasso Plattner Institute
1 PUBLICATION 0 CITATIONS
SEE PROFILE
All content following this page was uploaded by Gerardo Navarro S on 18 August 2014.
Abstract—Cancer researchers investigate on the best treat- easier to understand by people in general [5]. Additionally,
ment for certain cancer types. After conducting xenograft ex- it allows them to identify patterns more easily inside their
periments and collecting additional information, they are faced data. Nowadays, the creation of such charts is a manual and
with a large amount of data. In order to identify patterns more inefficient task, as there is little tool support. In addition, the
easily, they summarize the conducted data in visual charts. This created charts are static visualizations, disallowing researchers
is a manual and tedious work with only minimal tool support. In
to interact and engage with their data freely. We see the
order to fill this gap, we leverage and tailor existing visualization
tools to meet the needs of cancer researchers. In this work, we need to tackle this problem by exploiting existing visualization
focus on the design, development and integration of custom chart tools. As a goal, we envision a tool providing interactive data
extensions into the SAP Lumira visualization tool. Thereby, we exploration capabilities to cancer researchers. Researchers can
achieve the goal of providing an interactive and exploratory way engage with the data via the visualization and eventually gain
for cancer researchers to discover insights within their data. new insights.
Keywords—SAP Lumira, Clustered Heat Map Given there are many Business Intelligence (BI) tools that
focus on data visualization, it is not necessary to develop a
I. I NTRODUCTION new solution from the ground up, but rather adjust existing BI
tools to our use case. This makes it also possible to deliver
In the last decade, faster sequencing and decoding of the
the visualization tool to the researchers within the time frame
human genome opened up new areas in medical research as
of the project. SAP Lumira is such a BI tool that allows
it is now possible to dig deeper into certain diseases. Fur-
cancer researchers to access their data sources, transform the
thermore, this additional data source revealed lots of potential
data, and eventually visualize data graphically. As a result,
for personalized medicine, because it allows to treat patients
the rich set of visualization charts enables users with a basic
specifically based on their individual DNA or disposition [1].
technical understanding to explore their data freely and quickly
Given this large amount of diagnostic data, it is clear that
discover valuable insights within these charts, without having
researchers need technical help to exploit all information inside
to write scripts and SQL queries. Unfortunately, Lumira’s
the data – especially in the context of cancer research and
visualization chart portfolio misses some chart types that are
treatment. Some data analysis tasks are already supported by
specifically used in the context of cancer research, such as a
applications using in-memory technology, e.g. alignment and
normalized stacked bar chart. Fortunately, SAP Lumira allows
variant calling [1].
the definition and implementation of custom visualization
Besides that, we extend the existing capabilities to the area charts as so-called chart extensions.
of drug response analysis focused on cancer researchers. In
our project scenario, cancer researchers investigate on the best The contribution presented in this paper focuses on the
treatment for certain cancer types looking into DNA level, such design, development and integration of custom chart extensions
as head and neck cancer. As part of their research, they collect into SAP Lumira to achieve the goal of providing an interactive
meta data and tumor samples from patients and conduct many and exploratory way for cancer researchers to discover insights
xenograft experiments which produce a large amount of data. within their data, like a clustered heat map shown in Fig. 1.
This data is edited, enhanced and analyzed in a manual The remainder of the paper is structured as follows: In
and tedious process by researchers taking from several days Sect. II, our work goes deeper into visualizations in the clinical
up to weeks. As part of the Analyze Genomes Project (AGP), context as part of the related work. Section III present basic
we provide a cloud service that automates parts of this manual background information to SAP Lumira. In Sect. IV, we go
work [1]. Scientists upload the conducted tumor-specific data deeper into the design of the new visualization charts and
to the online service that processes the data and calculates explain the details of implementing custom chart extensions for
additional information [2]. To get a deeper understanding of SAP Lumira. We evaluate and compare the developed charts
their data, researchers initiate data analyzing algorithms to find in Sect. V. Our work concludes with an outlook in Sect. VI.
relevant dependencies within the tumor data or to classify the
tumor data with machine learning approaches [3], [4]. II. R ELATED W ORK
Finally, researchers want to summarize the data in visual Before starting to work on specific visualizations in SAP
charts, because a graphical representation of raw data is Lumira, it is important understand what graphical representa-
Fig. 1. Clustered heat map that shows the #Mutation count for a subset of genes and tumor samples. This SAP Lumira chart extension uses hierarchical
clustering to sort and rearrange its cells. Through this, it is easier to understand the data and to detect insightful pattern, e.g. genes ATM and TP53 are relevant
across all samples.
tions are currently used for clinical data and what visualization Examples for the usage of (clustered) heat maps, bar charts,
type would be helpful to cancer researchers. Stengel et al. and box plots and combinations of different other visualizations in
Snyder emphasize that the main goal of any visualization is the clinical context can be found in various papers [6] [8].
to effectively communicate the information contained in the Most papers also include unique and custom charts tailored to
data in a non-distorting manner [5], [6]. As a result of this, their respective data set because these unique charts emphasize
the choice for a certain visualization depends on the specific certain aspects of the data in a better way than standard
data set that researchers wish to present. However, both books visualization chart types. These unique charts are not supported
state that certain diagram types are more suited in the clinical by SAP Lumira. Hence, it is important to identify and prioritize
context and therefore used more frequently, e.g. column bar the visualizations that are not supported.
charts, box plots and scatter plots.
III. SAP L UMIRA OVERVIEW
Clinical research usually deals with a large amount of SAP Lumira is a new self-service BI solution from SAP.
cross-referenced data that finally needs to be visualized in a The solution allows analysts, decision makers, and now cancer
comprehensive way and to point out the main findings within researchers to access one or multiple data sources, transform
the large data set. For this situation, heat maps proved to be the data and eventually visualize data graphically. As a result,
the most effective way of displaying high-dimensional array the rich set of visualizations enables users with a basic
data, particularly suitable for gene expression array data [7]. technical understanding to engage with their data and quickly
Especially in our scenario with cancer researcher, the heat map discover valuable insights, without having to write scripts and
visualization is often used by cancer researchers, in order to SQL queries [10].
explore certain aspects across all their conducted tumor data,
such as researchers investigating the effectiveness of drugs
across a large set of tumors. The full potential of heat maps A. Integrating into SAP Lumira
is revealed when the data is sorted or clustered. A sorted or SAP Lumira as a visualization tool is a good starting point
clustered heat map ensures that similar data points are arranged for our scenario because of its rich feature set and easy-to-
close together, e.g. similar genes are located close to each use interface for non-technical persons. The researchers would
other which makes it easier to identify patterns across the large have three possibilities to interact with SAP Lumira:
data set. The sorting is often achieved by applying hierarchical
clustering methods separately on rows and columns of the heat 1) SAP Lumira Desktop is a desktop program installed on
map [8]. It is a more generic clustering approach that does not Windows machines and enable users to prepare data from
require specific knowledge about the data. Additionally, there multiple sources, compose visualizations and share the
are partitional clustering methods that map the data points to defined visualizations with other contributors.
a known number of clusters, like K-means clustering. This 2) SAP Lumira Cloud is an online platform providing a similar
is achieved by repeatedly assigning all samples to one of K feature set as the Lumira Desktop, but makes it available
clusters based on which cluster centroid is closest [9]. for mobile devices and browsers.
3) SAP Lumira Server is a UI5 HANA-based XS application
that will run on-premise with similar mobile web access and
user experience to Lumira Cloud, but specifically targeted
to integrate well with the deployed SAP HANA repository
[11], [12].
Amongst the three presented solutions, SAP Lumira Desk-
top allowed faster development and deployment of custom
charts, because of the following reasons:
• Lumira Desktop includes a software development kit
(SDK) that provides stable API interfaces and utilities,
streamlines the development steps, and makes it easier
to implement custom visualizations as compact bundles
[13].
• During our project, SAP Lumira Server was still in devel-
opment and therefore only available as a developer release Fig. 2. The development process for creating custom visualization chart
with no SDK or extension mechanism in-place, unstable extensions in SAP Lumira.
interfaces and no detailed documentation. Furthermore,
working with the developer release of Lumira Server
involved a large amount of reverse engineering and trial- regarding how a lean and efficient development process can
and-error debugging that lead to laborious and inefficient look like for SAP Lumira chart extensions. In order to share
development with lots of effort. our experiences and fill this gap, we present the structured
• Developing custom visualization charts with Lumira SDK development process we used and refined throughout our
makes it easier to port and deploy the custom charts project. Figure 2 illustrates the process broken down into the
to Lumira Server once a similar extension mechanism following steps:
is established. An extension mechanism for visualization
charts is planned to be included in post Q2 releases of 1) Design and define the requirements for your new visu-
Lumira Server. alization. What is the input – dimensions and measures
[10], [15]? What should be visualized and how? At this
point, a well-considered definition of required dimensions
B. Building custom visualizations in SAP Lumira
and measures is crucial. This avoids tedious refactoring
SAP Lumira offers a variety of visualizations out of the later in the development / implementation phase.
box that can definitely be of use in clinical publications [10]. 2) Setup the development and test environment using the
However, as it is impossible for Lumira to cover all possible VizPacker utility. The VizPacker is able to automatically
visualization chart types. Therefore, the development team generate the test environment which consists of an HTML
of Lumira Desktop introduced an SDK, which provides an website, the necessary files and JavaScript libraries in-
extension framework with an associated API that lets you cluding your chart extension bundle. As part of the setup,
develop your own chart, and integrate it with SAP Lumira it is essential to construct and integrate the appropriate
[13]. The SDK includes the so called VizPacker utility, a web test data set into the test environment as defined earlier.
client that helps to create the directory structure and the bundle As a best practice, we recommend to extract a real test
needed to develop a visualization extension. data set directly from SAP Lumira using its debugging
capabilities [13, p. 12].
The development of custom visualization with the SDK 3) Develop the chart extension bundle. Start by completing
uses exclusively JavaScript as a programming language and the declaration of necessary dimension and measures in
takes advantage of two main libraries: the feed definition that is needed to bind data to chart [15,
• SAP CVOM (Common Visual Object Modeler), also re- p. 111] [13]. Continue working on the rendering code for
ferred to as SAP HTML5 Visualization, is a visualization the chart extension with D3.js – start by creating a pre-
and charting engine designed to work with HTML5 tech- liminary version with static data and enhance your chart
nology to create different visualizations. CVOM provides extension iteratively by including the real data and other
utility classes to register and build own visualizations visualization parameters given by the SDK, e.g. width
in SAP Lumira and other SAP products utilizing other and color palette. Finally, test your chart extension in the
auxiliary libraries in the back, e.g. jquery and requirejs. HTML test environment and browser before deploying it
• D3.js is a JavaScript library for manipulating HTML to SAP Lumira Desktop.
documents. It allows you to bind arbitrary data to the 4) Deploy your custom chart extension to to SAP Lumira. In
Document Object Model (DOM), and then apply data- case of the Lumira Desktop, this is done by copying your
driven transformations to the DOM. This approach makes bundle to a certain directory on the Windows file system
it easy to create visualizations in a fast and lean way [13, p. 9]. When Lumira Desktop starts, it will look for
depending on the available data. new chart extensions and dynamically load them into the
runtime.
The documentation for the Lumira SDK is limited to a 5) Validate and test the chart extension in SAP Lumira. Are
handbook and blog entries from the community [13], [14]. Un- all requirements fulfilled? What can be refined? What
fortunately, there is little documentation and insights available about performance? These new requirements are the input
for the next development cycle. purpose of visual comparison, normalized stacked bar charts
hide the total quantities of each bar by normalizing and using
The main characteristic of the proposed development pro- percentages which makes it easier to see the relative difference
cess is that the chart extension implementation happens mainly between quantities in each bar, like in Fig. 3.
outside of SAP Lumira Desktop, because the Lumira Desktop
does not support ad-hoc reloading of bundles containing the As described before, the first part of the implementation is
chart extension [13, p. 24]. Therefore, you are forced to defining the input needed for the chart extension - dimensions
develop new visualization charts in a separate environment and measures. For the normalized stacked bar chart extension,
using tools like the VizPacker contained in the Lumira SDK. we defined two dimensions (Entity and Stacked Entity) and
a measure (Stacked Measure), as seen in Lst. 1. This way,
There are many helpful examples for custom visualizations Lumira knows the input that is expected by the chart extension,
available on the web, such as flag bar chart for showing the top waits for the user to select desired data sets and combines the
three Olympic Medal winners fir 2004 and 2008 [14], [15]. data sets to a cross-table data format. In case of Fig. 3, we
selected the TUMOR NAME and GENE NAME as dimension
IV. I MPLEMENTATION IN SAP L UMIRA and #Mutations as only measure leading to cross-table data
input similar to a matrix with the dimensions as axes and the
Having decided to build the custom visualizations inside
measure as the concrete value of the matrix cell.
SAP Lumira Desktop, our next step was to identify the
necessary visualization types that need to be provided. In 1 chart_definition.addFeed({ "type": "Dimension",
order to discover relevant but missing visualization types in 2 "id": "viz.ext.hig.module.snst.plot.DS1",
Lumira Desktop, we evaluated the visualization charts used 3 "name": "Entity", "aaIndex": 1
in cancer research papers and tried to find the right chart 4 "min": 1, "max": 1
5 });
type in SAP Lumira Desktop for each of the visualizations 6
[16], [17]. During this process, we focused on identifying 7 chart_definition.addFeed({"type": "Dimension",
reusable visualization charts. As stated before, many papers 8 "id": "viz.ext.hig.module.snst.plot.DS2",
often contain unique, custom visualizations that are tailored 9 "name": "Stacked Entity", "aaIndex": 2
10 "min": 1, "max": 1
to the specific data from the published research. Although 11 });
these graphical representations are comprehensive and valid, 12
we decided to disregard these visualizations as it is hard to 13 chart_definition.addFeed({ "type": "Measure",
reuse them in other clinical papers and contexts. 14 "id": "viz.ext.hig.module.snst.plot.MS1",
15 "name": "Stacked Measure", "mgIndex": 1
As a result of this evaluation, we discovered that Stransky 16 "min": 1, "max": Infinity
et al. includes normalized stacked bar charts where each stack 17 });
in the bar represents the ratio relative to the total amount Listing 1. Code snippet from the feed definition for the normalized
of the bar [16]. Although Lumira Desktop supports stacked stacked bar chart. Each feed has an ID, name, a type – either Dimension
or Measure – and an index that allows to find the specific data feed within
bar charts, it is not able to normalized the bars as it is a given data structure. There are also many more additional parameters
required in the paper. Another non-supported visualization and setting, such as max and min.
type is the clustered heat map used in many papers including
[17]. In order to have the biggest impact with our custom The data input given as a data matrix is traversed com-
visualizations, we decided to implement normalized stack bar pletely and afterwards sorted by the Stacked Entity dimension.
chart and clustered heat map, as these chart types are slight Afterwards the stacked bar chart, axes, tool-tip, legend and
modifications of standard visualization types and continuously other elements are rendered.
used in the clinical context and publications.
The complexity of the rendering code can be described as
The next sections present the approach, implementation O(n · m + n · m · log(m)) ⊆ O(n · m · log(m)) with n being the
details and integration into SAP Lumira for each of the number of Entity dimension values and m being the number
missing chart types. In addition we analyze the complexity of Stacked Entity dimension values.
of the rendering code and demonstrate some drawbacks of the
implemented chart extensions. A commonly known drawback of (normalized) stacked
bar charts is that too many categories add more visual noise
making it hard to detect patterns in the data. A possible
A. Normalized Stacked Bar Chart
improvement for this chart extension could be to limit the
A normalized stacked bar chart, also referred to as a stacked number of categories inside each bar and consolidate the
percentage column chart, is a slight variation of the commonly smaller stacks (e.g. less than 2%) into one ”Other” stack if
used bar chart. Bar charts are very useful, because they are necessary.
easy to understand, and their visual structure matches the
internal structure of data in many cases. But in cases with B. Clustered Heat Map using Sorted Neighborhood Method
more complex data, it is necessary to switch to some variations
in order to better visualize and indicate relationships inside As discussed before, clustered heat maps are a very useful
the structure of a data set. For example, stacked bar charts visualization, since they allow to identify patterns more easily
can show a grouped structure in the data and a hierarchy within a large interconnected data set. There are many ap-
inside the data up to one level deep. This is particularly useful proaches to cluster heat maps. We decided to develop 1) An
when showing and comparing the totals of all bars because own approach inspired by the Sorted Neighborhood Method
they visually aggregate all of the categories in a bar. For the (SNM) shown in Fig. 4, and 2) A hierarchical clustering
Fig. 3. Normalized Stacked Bar Chart with TUMOR NAME as Entity dimension, GENE NAME as Stacked Entity dimension and #Mutations as the only
measure in order to show the relative breakdown of the DNA mutation counts for affected genes in tumor samples. It is interesting that almost 50 percent of
the mutations across all tumor samples belong to the genes ATM and TP53.
approach presented in Sect. IV-C. This gives us the possibil- lower right corner. After this step, the sorted heat map will
ity to compare and evaluate different characteristics of both be easier to understand and to detect interesting findings in
approaches, e.g. quality of clustered result and runtime. the data.
3) Merging similar rows and columns by comparing the row
In this section, we present an approach for clustering a heap value sum (row sumi ) with the value sum of the previous
map that follows the idea of SNM. SNM is an approach from row (row sumi−1 ), as defined in Eq. 1 – for columns
the field of duplicate detection and record linkage providing respectively. row sumi is the sum of all values in the
a parallelizable and efficient way to find duplicates in large matrix row i and consequently row sums is the set of all
data sets [18]. SNM consists of 1) Defining sorting keys for row value sums. Rows i and i−1 are considered to be in one
each data record consisting of relevant fields, 2) Sorting data cluster if their absolute difference is smaller than a certain
records using the predefined key, and 3) Merging data records fraction X of the calculated ∆row min max . X is a user-
within a fixed size windows that moves through the sorted defined threshold between 0 and 1. In combination with the
records. ∆row min max , it indicates how much the rows can differ
Before going into SNM, it is important to mention that the until they are considered to considered to be dissimilar.
chart extension receives a data matrix from SAP Lumira that In case that the difference of the two value sums is greater
represent the heat map values. Therefore, the feed definition than the threshold, we insert a cluster break. A cluster break
for this clustered heat maps is based on Lst. 1. means that the compared rows / columns belong to different
clusters and therefore should be visually separated. Other
We adapted SNM for our scenario to cluster a heat map clustering approaches also use a threshold [19].
horizontally and vertically. This is necessary in order to have
a consistent clustering of the whole heat map on a row and
column level. The SNM approach is illustrated in Fig. 5 and |row sumi − row sumi−1 | < ∆min max · X (1)
can be summarized in the following steps that are executed for ∆row = max(row sums) − min(row sums) (2)
min max
rows and columns of the data matrix individually:
After applying the adjusted SNM row-wise and column-
1) Defining and calculating the sorting key as the sum of
wise to the data matrix input, the heat map is rendered. The
the values for each row and column. Other sorting keys
identified cluster breaks from the last step are used to create
are possible but need to be evaluated, e.g. average, min
small gaps in order to visualize the clusters within the heat
/ max delta and entropy. At the same time calculate
map, as seen in Fig. 4 and Fig. 5.
∆row min max , the range between the highest and lowest
row / column sum value, as defined in Eq. 2. Eq. 3 shows the complexity of the rendering code that can
2) Sorting rows and columns based on the sorting key. This be broken down into the complexity for: 1) Applying SNM to
will naturally rearrange higher values in the data matrix to n row value sums and m column value sums. O(n · log(n) +
the matrix’s upper left corner and the lower values to the k ·n) ⊆ O(n·log(n)) is the complexity of SNM with a sliding
Fig. 4. Heat map shows the #Mutation count for a subset of genes and tumor samples clustered with the Sorted Neighborhood Method (SNM). In this example
the user defined X = 0.25 as the threshold. The drawback of SNM is also obvious in this example when there are significant differences between two rows,
such as row TP53 and APC and KRAS. This leads to clusters with one element which is usually not desirable.