Professional Documents
Culture Documents
Annex 1 - Description of The Action (Part B)
Annex 1 - Description of The Action (Part B)
History of Changes
Date Description
2020-05-21 Corrections of deliverable types (consistent with existing description): D1.2
RORDP, D3.4 DEMR, D4.1 DEMR, D4.3 DEMR, D10.3 RDEM;
Changed deliverable dissemination level: D10.1/D.10.2 PU CO.
2020-05-21 Removed proposal content according to Grant Agreement instructions to avoid
unnecessary redundancy regarding Part A; minor formatting and rewording.
2020-05-21 Added explicit commitment to support BDVA in all events relevant to the
activities of the project (Section 2.2.1)
Table of Content
1 EXCELLENCE ........................................................................................................................................................ 2
1.1 OBJECTIVES ....................................................................................................................................................... 2
1.2 RELATION TO THE WORK PROGRAM .................................................................................................................. 4
1.3 CONCEPT AND METHODOLOGY .......................................................................................................................... 5
1.4 AMBITION ........................................................................................................................................................ 14
2 IMPACT ................................................................................................................................................................. 20
2.1 EXPECTED IMPACTS ......................................................................................................................................... 20
2.2 MEASURES TO MAXIMIZE IMPACT ................................................................................................................... 23
3 IMPLEMENTATION ........................................................................................................................................... 28
3.1 WORK PLAN..................................................................................................................................................... 28
3.2 MANAGEMENT STRUCTURE, AND PROCEDURES ............................................................................................... 30
3.3 CONSORTIUM AS A WHOLE .............................................................................................................................. 34
3.4 RESOURCES TO BE COMMITTED ....................................................................................................................... 35
4 MEMBERS OF THE CONSORTIUM ................................................................................................................ 36
4.1 PARTICIPANTS .................................................................................................................................................. 36
4.2 THIRD PARTIES INVOLVED IN THE PROJECT (INCLUDING USE OF THIRD-PARTY RESOURCES) ............................ 78
5 ETHICS AND SECURITY ................................................................................................................................... 79
5.1 ETHICS ............................................................................................................................................................. 79
5.2 SECURITY......................................................................................................................................................... 79
Productivity and Systems Support: Integrated, complex pipelines that involve data
integration, cleaning, and preparation, machine learning model training and scoring, as well
as high-performance libraries still requires substantial manual effort, often by specialized
teams that are unavailable to small and medium-sized enterprises. This is due to different
1
Matei Zaharia et al.: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.
NSDI 2012, pages 15-28.
2
Alexander Alexandrov et al: The Stratosphere platform for big data analytics. VLDB J. 23(6) 2014, pages 939-964.
3
Tyler Akidau et al.: The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-
Scale, Unbounded, Out-of-Order Data Processing. PVLDB 8(12) 2015, pages 1792-1803.
4
Martín Abadi et al.: TensorFlow: A System for Large-Scale Machine Learning. OSDI 2016, pages 265-283.
5
Xiangrui Meng et al.: MLlib: Machine Learning in Apache Spark. LMLR 17, 2016, pages 1-7.
6
Matthias Boehm et al.: SystemML: Declarative Machine Learning on Spark. PVLDB 9(13) 2016, pages 1425-1436.
7
Adam Paszke et al.: PyTorch: An Imperative Style, High-Performance Deep Learning Library. NeurIPS 2019.
8
Luca Canali: Big Data Tools and Pipelines for Machine Learning in HEP, CERN-EP/IT Data Science Seminar 2019.
9
Katie Bouman: Imaging the Unseen: Taking the First Picture of a Black Hole, Spark Summit Europe 2019.
10
Yohai Bar-Sinai et al.: Learning data-driven discretizations for partial differential equations, PNAS 116(31) 2019,
pages 15344-15349.
Overhead and Low Utilization: Relying on separate, statically provisioned clusters for data
management, ML systems, and HPC inevitably leads to unnecessary overhead for data
exchange and underutilization in the presence of workload fluctuations. The lack of
interoperability often even requires coarse-grained file exchange and renders the identification
of redundancy and optimization opportunities nearly impossible. This unnecessary data
exchange is problematic because studies11 have shown that even for ML scoring workloads
on consumer devices, 62.7% of total system energy is spent on data movement from the
memory system to the compute units and subsequent data transformations. With the trend
toward hardware specialization in terms of heterogeneous devices, along with vendor-
provided libraries, we expect this problem to become even more severe.
Lack of Common System Infrastructure: Despite some unidirectional efforts (from any of
the involved communities) on new programming models, the different systems for data
management and processing, ML systems, and HPC are still largely separated. Conceptual
ideas and techniques are reused but redundantly implemented in these different systems.
Although many systems are open source, they are often company-controlled (e.g., aiming at
user lock-in to generate cloud revenue), which hinders the integration, adoption, and broad
exploitation of research results. An open system infrastructure combining DM, HPC and ML,
inspired by OpenStack (which provides more generic infrastructure for cloud computing), is
highly desirable in order to improve interoperability, avoid boundary crossing and related
overheads, and enable provenance and versioning over entire pipelines.
Overall Objective: DAPHNE’s overall objective is the definition of an open and extensible systems
infrastructure for integrated data analysis pipelines, including data management and processing, HPC,
and ML training and scoring. DAPHNE will increase pipeline development productivity and reduce
unnecessary overheads and low utilization. We will build a reference implementation of a domain-
specific language, an intermediate representation, compilation and runtime techniques, and integrated
storage and accelerators devices, with selected advancements in critical components.
Strategic Objectives: The overall aim naturally leads to the following three strategic objectives:
Objective 1 System Architecture, APIs and DSL: Improve the productivity for developing
integrated data analysis pipelines via appropriate APIs and a domain-specific language, an
overall system architecture for seamless integration with existing data processing
frameworks, HPC libraries, and ML systems. A major goal is an open, extensible reference
implementation of the necessary compiler and runtime infrastructure to simplify the
integration of current and future state-of-the-art methods.
Objective 2 Hierarchical Scheduling and Task Planning: Improve the utilization of existing
computing clusters, multiple heterogeneous hardware devices, and capabilities of modern
storage and memory technologies through improved scheduling as well as static (compile
time) task planning. In this context, we also aim to automatically leverage interesting data
characteristics such as sorting order, degree of redundancy, and matrix/tensor sparsity.
11
Amirali Boroumand et al.: Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks.
ASPLOS 2018, pages 316-331.
12
Xin Luna Dong, Theodoros Rekatsinas: Data Integration and Machine Learning: A Natural Synergy. SIGMOD 2018,
pages 1645-1650.
13
David J. DeWitt and Michael Stonebraker: MapReduce: A major step backwards, The Database Column, 2008.
DaphneLib is a library of important operations (linear algebra, BLAS, FFT, relational algebra,
FEM, I/O) and higher-level distribution primitives. We aim to provide different language
bindings (e.g., Java, Python, and C++) – in terms of language-specific API frontends,
integrated in the host languages – in order to reuse language features like control flow,
leverage rich library eco systems, and provide interoperability with distributed frameworks
like Spark and HPC libraries. Similar to lazy evaluation in Spark, TensorFlow, or Dask14,
these operations are collected (unrolled) into DAGs of operations, and then compiled and
executed on demand, for example when the output is used by operations of the host language.
14
Dask.Dask: Library for dynamic task scheduling, 2016, https://dask.org/.
15
Dan Moldovan et al.: AutoGraph: Imperative-style Coding with Graph-based Performance. SysML 2019.
Example Pipeline: Figure 1.3 shows an example integrated pipeline at conceptual level. A simulation
model is used to simulate a subset of a physical object, the output data is materialized in a distributed
file system, then Spark is used for data-parallel featurization and random reshuffling, followed by
ML model training via data-parallel parameter servers. Finally, the model is used for cost-effective
“simulation” of the entire physical object, followed by a complex data processing pipeline to find
interesting patterns. Instead of materializing and reshuffling the simulation output, why not fuse the
data generation into the subsequent ML training to avoid unnecessary data transfer? Furthermore,
why not change the simulation parameters to yield better convergence or generalization of the ML
model, and why not fuse the final full-scale simulation with the data analysis pipeline to avoid
unnecessary materialization? Besides these questions related to function versus data shipping and
materialization, there are also integration opportunities such as to reuse classical all-reduce primitives
inside the parameter server and leverage multiple heterogeneous devices during model training.
Research Questions: Generalizing the previous example and integrated data analysis pipelines, from
a technological perspective, there are important research questions related to (1) seamless integration
of existing DM, HPC, ML systems, (2) intermediate representation and systematic lowering, (3)
holistic reasoning and optimization of integrated pipelines under different objectives, (4) code
generation for sparsity exploitation, (5) hierarchical scheduling and HW exploitation, (6) managed
storage tiers and HW acceleration, as well as (7) systematic exploitation of data characteristics (e.g.,
sparsity and redundancy), and operation characteristics. In this project, we aim to advance the state-
of-the-art in these areas and accordingly, describe these in more detail in Section 1.3.3.1
(methodology regarding system architecture) and Section 1.4 (ambition beyond state of the art).
16
BDVA – Big Data Value Association; http://bdva.eu
17
EuroHPC; http://eurohpc.eu
18
HiPEAC - High Performance and Embedded Architecture and Compilation;
We will follow the progress and results of these projects and related projects of previous national and
international programs, and connect with them over the EuroHPC initiative as well as existing
connections such as over RAWLabs (via ITU) to the SmartDataLake project and over BSC (UM) to
the INFORE, ExtremeEarth, and ELASTIC projects.
19
ARTEMIS-IA - Advanced Research & Technology for EMbedded Intelligent Systems; https://artemis-ia.eu
20
ECSEL - Electronic Components and Systems for European Leadership; https://www.ecsel.eu
21
WHPC - Women in HPC; http://womeninhpc.org
22
ACM-W - ACM Committee on Women; http://women.acm.org
Seamless Integration of DM, HPC, ML Systems: Combining existing techniques from these
different areas in a seamless manner is itself a major contribution. Such an integration requires
language abstractions with support for both first and second order functions, where the latter
takes functions as arguments. The challenge is to find the balance between optimized kernels
for important operations and primitives, augmented with powerful but minimalistic
infrastructure for processing user defined functions (UDF) that originate from different
programming models (e.g., Spark and OpenMP).
Intermediate Representation and Systematic Lowering: Defining the new IR dialect for
multi-level lowering to address different hardware and data characteristics, while providing
simple means of extensibility is crucial for performance and adoption. We will extend known
concepts of optimization passes and interesting data properties for bundles of operations and
data items in a hierarchy from distributed datasets to local data blocks.
Managed Storage Tiers and HW Acceleration: Integrated data analysis pipelines could
largely benefit from computational storage and HW acceleration (e.g., avoid unnecessary
overheads and fully utilize and exploit the available hardware). Again, the diverse set of
operations and data representations creates new challenges for managed storage tiers, near-
data processing, data-path optimizations and adaptive data placement. Similarly, exploiting
available HW accelerator devices requires tuned kernels (partially created via automatic
operator fusion and code generation), good performance models, dedicated handling of
multiple devices, and again decision models for data placement.
23
Chunping Qiu et al.: Feature Importance Analysis for Local Climate Zone Classification Using a Residual
Convolutional Neural Network with Multi-Source Datasets. Remote Sensing 10(10), 2018.
24
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun: Deep Residual Learning for Image Recognition. CVPR 2016
pages 770-778.
25
Diederik P. Kingma, Jimmy Ba: Adam: A Method for Stochastic Optimization. ICLR 2015.
26
ESA: Copernicus Open Access Hub – Sentinel-1/-2, https://scihub.copernicus.eu/
27
Olivia Bluder et al.: Modeling fatigue life of power semiconductor devices with ε-N fields. Winter Simulation
Conference 2014 pages 2609-2616.
28
Olivia Bluder, Kathrin Plankensteiner, Michael Glavanovics: Estimation of safe operating areas for power
semiconductors via Bayesian reliability modeling, ISBA 2014.
29
Stefan Schrunner et al.: A Comparison of Supervised Approaches for Process Pattern Recognition in Analog
Semiconductor Wafer Test Data. ICMLA 2018, pages 820-823.
1.4 Ambition
The DAPHNE project is ambitious in its overall and specific objectives of creating open and
extensible systems support for integrated data analysis pipelines that combine DM, HPC and ML.
What sets it apart from projects with similar goals is a multi-community effort on building actual
system infrastructure from language abstractions over compilation and runtime techniques to
hierarchical scheduling as well as storage and accelerator devices. Providing such an extensible
system infrastructure along with a related benchmark toolkit would already be very valuable for both
30
Yeounoh Chung et al.: Slice Finder: Automated Data Slicing for Model Validation. ICDE 2019, pages 1550-1553.
31
Avraham Shinnar et al.: M3R: Increased performance for in-memory Hadoop jobs. PVLDB 5(12) 2012, pages 1736-
1747.
32
Michael J. Anderson: Bridging the Gap between HPC and Big Data frameworks. PVLDB 10(8) 2017 pages 901-912.
33
Zhipeng Zhang et al.: MLlib*: Fast Training of GLMs Using Spark MLlib. ICDE 2019, pages 1778-1789.
34
Lucas M. Ponce et al.: Extension of a Task-Based Model to Functional Programming. SBAC-PAD 2019, pages 64-71.
35
Nikolay Malitsky et al: Building Near-Real-Time Processing Pipelines with the Spark-MPI Platform. CoRR
abs/1805.04886, 2018.
36
Kazuaki Ishizaki: Transparent GPU Exploitation on Apache Spark, Spark Summit 2018.
37
Matthias Boehm et al.: Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML. PVLDB 7(7)
2014, pages 553-564.
38
Matthias Boehm et al.: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning. BTW 2019,
pages 267-286.
39
Gaurav Sharma, Jos Martin: MATLAB: A Language for Parallel Computing. International Journal of Parallel
Programming 37(1) 2009, pages 3-36.
40
Florian Wolf et al.: Extending database task schedulers for multi-threaded application code. SSDBM 2015, pages 1-12.
41
Chris Lattner, Vikram S. Adve: LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation.
CGO 2004: 75-88.
42
TensorFlow MLIR; https://www.tensorflow.org/mlir
43
Joseph Vinish D'silva et al: AIDA - Abstraction for Advanced In-Database Analytics. PVLDB 11(11) 2018, pages
1400-1413.
44
Andreas Kunft, Asterios Katsifodimos, Sebastian Schelter, Sebastian Breß, Tilmann Rabl, Volker Markl: An
Intermediate Representation for Optimizing Machine Learning Pipelines. PVLDB 12(11): 1553-1567 (2019).
45
Viktor Leis et al: Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age.
SIGMOD 2014.
46
Evan R. Sparks et al.: Automating model search for large scale machine learning. SoCC 2015.
47
Evan R. Sparks et al.: KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics. In ICDE 2017.
48
Zeyuan Shang et al.: Democratizing Data Science through Interactive Curation of ML Pipelines. SIGMOD 2019.
49
Yongjoo Park et al.: BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees. SIGMOD
2019.
50
Matthias Boehm et al.: SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle.
CIDR 2020.
51
Andreas Kunft et al.: BlockJoin: Efficient Matrix Partitioning Through Joins. PVLDB 10(13) 2018, pages 2061-2072.
52
Botong Huang et al.: Cumulon: optimizing statistical data analysis in the cloud. SIGMOD 2013.
53
Khaled Zaouk et al.: UDAO: A Next-Generation Unified Data Analytics Optimizer. PVLDB 12(12) 2019, pages 1934-
1937.
54
Zeke Wang et al: A One-Size-Fits-All System for Any-precision Learning. PVLDB 12(7) 2019, pages 807-821.
55
T. Neumann. Effciently Compiling Effcient Query Plans for Modern Hardware. PVLDB, 4(9), 2011.
56
A. Crotty et al. An Architecture for Compiling UDF-centric Workflows. PVLDB, 8(12):1466-1477, 2015.
57
G. Belter et al. Automating the Generation of Composed Linear Algebra Kernels. SC, 2009, pages 1-12.
58
A. K. Sujeeth et al. OptiML: An Implicitly Parallel Domain-Specifc Language for Machine Learning. In ICML, pages
609-616, 2011.
59
Tarek Elgamal, Shangyu Luo, Matthias Boehm, Alexandre V. Evfimievski, Shirish Tatikonda, Berthold Reinwald,
Prithviraj Sen: SPOOF: Sum-Product Optimization and Operator Fusion for Large-Scale Machine Learning. CIDR 2017.
60
Google. TensorFlow XLA (Accelerated Linear Algebra). http://tensorflow.org/performance/xla
61
N. Vasilache et al. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions.
In CoRR, 2018.
62
S. Palkar et al. A Common Runtime for High Performance Data Analysis. In CIDR, 2017.
63
Fredrik Kjolstad et al.: The tensor algebra compiler. PACMPL 1(OOPSLA) 2017.
64
J. Knight. Intel Nervana Graph Beta. http://www.intel.ai/ngraph-preview-release/.
65
NVIDIA. TensorRT - Programmable Inference Accelerator. http://developer.nvidia.com/tensorrt.
66
Matthias Boehm et al: On Optimizing Operator Fusion Plans for Large-Scale Machine Learning in SystemML. PVLDB
11(12) 2018, pages 1755-1768.
67
Jérôme Petazzoni: Cgroups, name‐spaces and beyond: What are containers made from? DockerConEU 2015.
68
Don Lipari: The SLURM Scheduler Design, User Group Meeting, 2012.
69
Benjamin Hindmanet al: Mesos: A Platform for Fine‐Grained Resource Sharing in the Data Center. NSDI 2011.
70
Vinod Kumar Vavilapalliet al: Apache Hadoop YARN: yet another resource negotiator. SoCC 2013.
71
Carlo Curinoet al.: Hydra: a federated resource manager for data‐center scale analytics. NSDI 2019.
72
Brendan Burns et al.: Borg, Omega, and Kubernetes. ACM Queue 14(1): 10, 2016.
73
Ali Mohammed et al: Two-level Dynamic Load Balancing for High Performance Scientific Applications”. In
Proceedings of the SIAM Parallel Processing (SIAM PP 2020), 2020.
74
Ahmed Eleliemy et al: Exploring the Relation Between Two Levels of Scheduling Using a Novel Simulation Approach,
ISPDC 2017.
75
Florina M. Ciorba et al: A Combined Dual-stage Framework for Robust Scheduling of Scientific Applications in
Heterogeneous Environments with Uncertain Availability. IPDPS Workshops 2012.
76
Ce Zhang, Christopher Ré: DimmWitted: A Study of Main-Memory Statistical Analytics. PVLDB 7(12), 2014, pages
1283-1294
77
Iraklis Psaroudakis et al.: Scaling Up Concurrent Main-Memory Column-Store Scans: Towards Adaptive NUMA-
aware Data and Task Placement. PVLDB, 8(12), 2015, pages1442–1453
78
P. Tözün and H. Kotthaus. Scheduling Data-Intensive Tasks on Heterogeneous Many Cores. IEEE Data Engineering
Bulletin 42, 1, 2019, pages 61-72
79
Azalia Mirhoseini et al: Device Placement Optimization with Reinforcement Learning. ICML 2017, pages 2430-2439
80
Celestine Dünner et al.: SnapML: A Hierarchical Framework for Machine Learning. NeurIPS 2018.
81
Markus Steinberger et al.: Whippletree: task-based scheduling of dynamic workloads on the GPU. ACM Trans. Graph.
33(6), 2014, pages 1-11
82
Michael Gowanlock et al.: Accelerating the Unacceleratable: Hybrid CPU/GPU Algorithms for Memory-Bound
Database Primitives. DaMoN 2019
83
Ahmed Eleliemy, Florina M. Ciorba: Hierarchical Dynamic Loop Self-Scheduling on Distributed-Memory Systems
Using an MPI+MPI Approach. IPDPS Workshops 2019.
84
Vivek Seshadri et al.: RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization. MICRO 2013.
85
Vivek Seshadri et al: Ambit: in-memory accelerator for bulk bitwise operations using commodity DRAM technology.
MICRO 2017.
86
Jaeyoung Do, Yang-Suk Kee, Jignesh M. Patel, Chanik Park, Kwanghyun Park, David J. DeWitt: Query processing on
smart SSDs: opportunities and challenges. SIGMOD 2013.
87
Gustavo Alonso, Carsten Binnig, Ippokratis Pandis, Kenneth Salem, Jan Skrzypczak, Ryan Stutsman, Lasse Thostrup,
Tianzheng Wang, Zeke Wang, Tobias Ziegler: DPI: The Data Processing Interface for Modern Networks. CIDR 2019.
88
Alberto Lerner, Rana Hussein, Philippe Cudré-Mauroux: The Case for Network Accelerated Query Processing. CIDR
2019.
89
Justin Meza et al.: Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache
Management. Computer Architecture Letters 11(2), 2012, pages 61-64.
90
Niv Dayan et al.: GeckoFTL: Scalable Flash Translation Techniques for Very Large Flash Devices. SIGMOD 2016.
91
Till Westmann et al.: The Implementation and Performance of Compressed Databases. SIGMOD Record 29(3) 2000,
pages 55-67.
92
Vijayshankar Raman, Garret Swart: How to Wring a Table Dry: Entropy Compression of Relations and Querying of
Compressed Relations. VLDB 2006, pages 858-869.
93
Michael Stonebraker et al.: C-Store: A Column-oriented DBMS. VLDB 2005, pages 553-564.
94
Patrick Damme et al: From a Comprehensive Experimental Survey to a Cost-based Selection Strategy for Lightweight
Integer Compression Algorithms. ACM Trans. Database Syst. 44(3): 2019, pages 1-46.
95
Evangelia A. Sitaridi et al: Massively-Parallel Lossless Data Decompression. ICPP 2016.
96
Jeremy Fowers et al.: A Scalable High-Bandwidth Architecture for Lossless Compression on FPGAs. FCCM 2015.
97
Vasileios Karakasis et al: An Extended Compression Format for the Optimization of Sparse Matrix-Vector
Multiplication. IEEE Trans. Parallel Distrib. Syst. 24(10) 2013, pages 1930-1940.
98
Ahmed Elgohary et al.: Compressed Linear Algebra for Large-Scale Machine Learning. PVLDB 9(12) 2016, pages
960-971.
While code size is a rather imprecise measure, K1 represents a reduction of code size (e.g., lines of
code) for specifying the integrated pipelines by at least an order of magnitude for all applications.
The smaller the code size, the easier it will be to develop new pipelines, as well as understand and
extend existing pipelines, which also reduces the likelihood of bugs. Additionally, K2 states that we
aim to develop at least eight complex integrated data analysis pipelines using the new language
abstractions. This includes one DLR, two IFAT, one KAI, and one AVL use case, two additional
pipelines as part of DaphneBench, as well as one prepackaged pipeline for result analysis of the
DaphneBench toolkit. Finally, K3 indicates that all manual data export/import steps and data transfers
of intermediate results of a pipeline should be reduced to zero, i.e., should be eliminated. This is
crucial for a seamless composition and full automation of these pipelines.
Impact 2: “Demonstrated, significant increase of speed of data throughput and access, as measured
against relevant, industry-validated benchmarks” (Horizon 2020 Work Programme 2018-2020, Part
5.i - Page 49)
Objective 1 (System Architecture, APIs, and DSL) and Objective 2 (Hierarchical Scheduling) further
address the challenge of unnecessary overhead and low utilization due to expensive data exchange
between disconnected programming models and libraries, as well as statically provisioned data
processing, HPC, and ML clusters, as well as data exchange between heterogeneous computing
devices, and missing exploitation of opportunities of multi-devices operations and analysis pipelines.
Furthermore, Objective 3 (Use Cases and Benchmarking) aims to develop an initial proposal for a
new benchmark of integrated data analysis pipelines, that if successful, might itself become an
K6 indicates that 6 out of the at least 8 developed pipelines – including a pipeline at Petabyte scale –
combine components of data-parallel computation, high-performance computing, and distributed ML
training and scoring, which makes a case for the distributed execution of integrated data analysis
pipelines and their good scalability. Finally, K7 states that we aim to achieve a 20% error reduction
on average across the different use case studies, which makes a case for productivity, while also
advancing the state-of-the-art in important applications such as earth observation, semiconductor
manufacturing, material degradation modeling, and automotive vehicle development.
Impact on Industry and Society: Improving prediction accuracy of these important use cases will
also have direct positive impact on industry and society. First, establishing data analysis pipelines for
semiconductor production data (and more broadly for device quality and reliability of semiconductor
devices) is a crucial step towards automation and Industry 4.0. High levels of automation in both
development and production will be needed to keep European companies competitive. In addition,
we will further strengthen Europe’s R&D know-how by using the new opportunities given by HPC,
ML, and large-scale data processing. Integrated data analysis pipelines over distributed datasets from
K8 states that we will implement and make available at least 2 state-of-the-art baselines per work
package WP3-WP7, which are anyway required for proper experimental evaluation of the individual
results of these system-oriented work packages.
Users and Applications: Spurring interest of potential users is crucial for impact via adoption. Our
dissemination efforts target two classes of potential users: system or data engineers that are
designing and operating robust infrastructures, and application users that are mostly concerned with
productivity and accurate predictions. Our dissemination and exploitation strategy comprise five
parts. First and foremost, it is about creating value by improved productivity, better resource
utilization and less overhead, as well as robust and reliable infrastructure. Second, open sourcing our
100
FAIR Principles, https://www.go-fair.org/fair-principles/
101
Common Format and MIME Type for Comma-Separated Values (CSV) Files, https://tools.ietf.org/html/rfc4180
102
arXiv.org preprint archive, Cornell University, https://arxiv.org/
103
Zenodo (http://zenodo.org) is a general-purpose open-access repository developed under the European OpenAIRE
program and operated by CERN.
Coordinator (KNOW): The DAPHNE project will be coordinated and managed by KNOW,
explicitly being responsible for all aspects of the interface between the project and the European
Commission (EC) and establishing and maintaining a complete view over the whole technical
Figure 3.2b: KNOW participation in different EC funding programs and roles in EC research projects104
Project Manager: The project manager will be responsible for the day-to-day coordination of the
project. We plan an average of 3h per working day to carry out the project management tasks for the
DAPHNE consortium. He or she will ensure that the project management procedures as defined in
D1.1. are carried out throughout the project’s lifetime and that supporting tools and templates will be
made available. The project manager will implement a number of performance criteria that reflect the
quality of the work with respect to effort spent, costs occurred, and delays encountered. These
indicators will be updated regularly to maintain an overview over the project progress and to identify
divergence from initial planning and performance indicators timely.
Advisory Board: The advisory board is the senior supervisory entity and shall be responsible for the
proper execution and implementation of the decisions made by the general assembly. This body is
composed by the 8 work package leaders (KNOW, ETH, HPI, ICCS, ITU, KAI, TUD, UNIBAS) and
is chaired by KNOW as the coordinator. The advisory board coordinates the cooperation and
104
EC: https://ec.europa.eu/info/funding-tenders/opportunities/portal/screen/how-to-participate/org-details/997997111
General Assembly Meeting: The general assembly will meet in person once a year,
organized and conducted by the project manager, to review the overall status of the project
(technical and administrative). It will be a forum for major decisions on the implementation
of the project plan. These meetings will discuss all issues of the project (including
unexpected difficulties, new ideas, etc.), financial and administrative aspects, strategic
issues, particularly in relation to sustainability of project outcomes, and risk management.
Advisory Boards Meetings: The advisory board will also meet once a year in combination
to discuss results, work progress, future plans and technical issues. In addition, the advisory
board meets via conference call once a month.
WP/Task Meetings: Such meetings will mostly take place via conference calls and
whenever a need occurs. This could include achieving important milestones, submitting
major outcomes, or managing a specific issue related to the WP or individual task activities.
Weekly DAPHNE Student Conference Call: The coordinator, chaired by Matthias
Boehm, will schedule a fixed weekly online meeting with all project members actively
working on a task at the time. The aim of this call is to get regular updates on the progress
and make sure technical questions get clarified.
EC Review Meetings: Together with the EC project officer the coordinator will schedule
yearly review meetings.
We expect the general assembly meeting to be held very closely together with the advisory board
meeting to save on travel expenses.
Brief description
Know-Center (KNOW) is Austria's leading research center for data-driven business and big data
analytics. As a connecting link between science and industry, KNOW conducts application-
oriented research in cooperation with other academic institutions and with companies. Within
KNOW, we maintain a wider perspective and approach data-driven business as a cognitive
computing challenge. Our scientific strategy is to integrate (big) data analytics with human-
centered computing methods in order to create cognitive computing systems that will enable
humans to utilize massive amounts of data. The Center has an excellent 18-year track record and
over 100 excellent researchers. KNOW founded the European Network of National Big Data
Centers of Excellence and was awarded the iSpace label in Silver as one of the leading big data
research centers in Europe by the Big Data Value Association (BDVA). From 2015 to 2018, we
published over 50 journal articles and over 160 conference articles in international peer-
reviewed media. Since 2001 KNOW has been bridging the gap between science and industry
within more than 700 applied scientific projects.
Data Management (DM) is a newly-established research Area at KNOW, which deals with big
data storage and real time streaming technologies. The goal of DM Area is to develop scalable
and compliant data infrastructures and architectures based on innovative database concepts and
real-time streaming technologies. Its mission is to simplify data science by providing high level,
data science-centric abstractions and building systems, as well as tools to execute these tasks in
an efficient and scalable manner.
Knowledge Discovery (KD) represents the data analytics perspective. The focus of this Area is
to research and develop algorithms and models for domain agnostic, data-driven analytics using
a variety of data types, ranging from textual to time series data. The goal is to extract a maximum
of value out of data with an appropriate (minimal) amount of human input. The Area’s
competences are ML, Information Retrieval and Natural Language Processing (NLP).
Core competences relevant to DAPHNE
The core competences of KNOW relevant to this project are three-fold. First, KNOW is an
experienced participant as well as a leader of EU-funded projects, which is furthermore reflected
by the excellent network of partners in academia and industry. Second, the DM area of KNOW,
headed by Matthias Böhm, has a track record of conducting high quality research in the field of
data management and systems research, materializing in award winning publications (see below)
and the implementation of systems for large scale machine learning (ML) applications. Third,
105
Federal Ministry for Transport, Innovation and Technology of the Republic of Austria
4.1.2 AVL
Brief description
AVL List GmbH is the world's largest privately-owned company for development, simulation
and testing technology of powertrains (hybrid, combustion engines, transmission, electric drive,
batteries and software) for passenger cars, trucks and large engines. AVL has about 3850
employees in Graz (Austria), and a global network of 45 representations and affiliates resulting
in more than 9500 employees worldwide. AVL´s Powertrain Engineering division activities are
focused on the research, design and development of various powertrains in the view of low fuel
consumption, low emission, low noise and improved drivability. The Advanced Simulation
Technologies division develops and markets the simulation methods which are necessary for the
powertrain development work. The Instrumentation and Test Systems division is an established
manufacturer and provider of instruments and systems for powertrain and vehicle testing
including combustion diagnostic sensors, optical systems as well as complete engine, powertrain
and vehicle test beds. AVL supplies advanced development and testing solutions for
4.1.3 DLR
Resources
DLR operates the German satellite data archive, a fully operational data archive for national and
international earth observations for the sake of long-term data preservation and data
dissemination. Also available are HPC and HPDA infrastructure at DLR Jena.
4.1.4 ETH
Brief description
ETH Zurich is a science, technology, engineering and mathematics university in the city of
Zürich, Switzerland. It is an integral part of the Swiss Federal Institutes of Technology Domain
(ETH Domain) and is directly subordinate to Switzerland's Federal Department of Economic
Affairs, Education and Research. The school was founded by the Swiss Federal Government in
1854 with the stated mission to educate engineers and scientists, serve as a national center of
excellence in science and technology and provide a hub for interaction between the scientific
community and the industry.
Core competences relevant to DAPHNE
4.1.5 HPI
Brief description
The Hasso Plattner Institute (HPI) is unique in the German university landscape. Academically
structured as the independent Faculty of Digital Engineering at the University of Potsdam, HPI
unites excellent research and teaching with the advantages of a privately-financed institute and
a tuition-free study program. HPI is an elite, world-class educational facility. The Institute
already offers an optimal study and work environment and cooperates very closely with the
business community. HPI also maintains strong international contacts, for example with
Stanford University, an American Ivy League university in Palo Alto, California (USA).
Core competences relevant to DAPHNE
Prof. Dr. Tilmann Rabl leads the research group “Data Engineering Systems” at HPI and has
extensive expertise in benchmarking data management systems in various domains. He has been
part of several benchmarking standardization efforts within the Transaction Processing
Performance Council. In addition, he performs research on data processing on modern hardware
and end-to-end ML. With this background, he can contribute significantly to the DAPHNE
project. Within DAPHNE, Prof. Dr. Rabl will act as PI for HPI, mainly in the WP
“Benchmarking”.
Role and responsibility within DAPHNE
Within DAPHNE, Prof. Rabl acts as PI for HPI in the WP “Benchmarking.”
Curriculum vitae of senior staff
Prof. Dr. Tilmann Rabl (male) (h-index 18)
holds the Chair for Data Enineering Systems at the Hasso Plattner Institute and is a Professor at
the Digital Engineering Faculty of the University of Potsdam (Germany). He is also cofounder
and scientific director of the startup bankmark. Prof. Rabl received his PhD from the University
of Passau (Germany) in 2011. He spent 4 years at the University of Toronto (Canada) as a
postdoc in the Middleware Systems Research Group (MSRG). From 2015 to 2019, he was a
4.1.6 ICCS
Brief description
The Institute of Communications and Computer Systems (ICCS) is a non-profit Academic
Research Body established in 1989 by the Ministry of Education of Greece in order to carry
research and development activities covering diverse aspects of telecommunications and
computer systems. ICCS is associated with the School of Electrical and Computer Engineering
(SECE) of the National Technical University of Athens (NTUA) (Greece). The personnel of
ICCS consists of a number of research scientists and more than 500 associate scientists
(including PhD students). The research carried out in ICCS is substantially supported by the
School of Electrical and Computer Engineering, NTUA. ICCS is very active with regard to
European co-funded research activities and has been the project coordinator of many EU projects
in various programs. ICCS will participate in this project through the Computing Systems
Laboratory (CSLab) of the SECE of NTUA.
CSLab is one of the largest research laboratories of the Computer Science Department of the
School of Electrical and Computer Engineering of the NTUA. CSLab possesses strong expertise
in computer architecture and large scale parallel (HPC) and distributed systems (Big Data, Cloud
and P2P infrastructures). Its members have performed research and development both at the
higher level of algorithmic design and applications’ optimization, as well as the assembly and
operation of such systems with more than three decades of tradition in the implementation,
optimization and operation of systems at all scales. With its experienced staff in administration,
4.1.7 IFAT
Brief description
Infineon designs, develops, manufactures and markets a broad range of semiconductors and
system solutions. The focus of its activities is on automotive electronics, industrial electronics,
communication and information technologies, and hardware-based security. The product range
comprises standard components, customer-specific solutions for devices and systems, as well as
specific components for digital, analogue and mixed-signal applications. Over 60% of Infineon’s
revenue is generated via power semiconductors, almost 20% via embedded control products
(microcontrollers for automotive, industrial as well as security applications) and the remainder
via radio-frequency components and sensors. Infineon generates 30% of its revenue in Europe,
57% in Asia and 13% in the Americas.
Infineon organizes its operations in four segments: Automotive, Industrial Power Control, Power
Management & Multimarket and Digital Security Solutions. Infineon Technologies Austria AG
(IFAT) is a 100% subsidiary company of the worldwide operating semiconductor producer
Infineon Technologies AG and was founded in April 1999. IFAT develops and manufactures
semiconductor and systems solutions for all business areas. In Austria about 1,977 of 4,609
(12/2019) employees are working in the field of Research and Development. Hence, IFAT has
the largest R&D unit for microelectronics in Austria. This requires a high level of comprehensive
technology know-how in modelling, design and fabrication, as well as in the field of security.
In 2019 expenses of IFAT in the field of Research & Development were ~ EUR 525 million,
which corresponds to 17% of the business volume of EUR 3.113 million. Infineon Technologies
Austria AG is a worldwide semiconductor produces with long R&D history regarding power
semiconductors.
Core competences relevant to DAPHNE
For this project it is of great importance that IFAT provides its experience in the context of data
science, IT architectures, algorithms, data fusion, big data analysis and WIP flow management.
Role and responsibility within DAPHNE
IFAT will act as use case owner in the DAPHNE project. Furthermore, IFAT has experience in
working and managing European funded research projects.
Curriculum vitae of senior staff
4.1.8 INTP
4.1.9 ITU
Brief description
Founded in 1999, the IT University of Copenhagen (ITU) is Denmark's leading university with
a focus on IT research and education. ITU performs state-of-the-art teaching and research in
computer science, business IT and digital design. The ambition of ITU is to create and share
knowledge that is profound and that leads to pioneering information technology and services for
the benefit of humanity. The university works closely with the business community, the public
sector and international researchers and is characterized by a strong entrepreneurial spirit among
both students and researchers. ITU in numbers: 2,300 students and 330 employees (FTE). ITU
is part of the consortium of Danish universities, which together with the Danish eInfrastructure
Cooperation (DeIC) constitutes the Danish contribution in the Euro-HPC Joint Undertaking.
Core competences relevant to DAPHNE
The data systems group at ITU focuses on improving the scalability and performance of data
systems, i.e., the low-level system software that supports the basic functionality of data-intensive
applications (e.g., ML, databases) on modern hardware infrastructures. In recent years, the data
systems group has defined the open-channel SSD subsystem in the Linux kernel, developed the
open-source framework OX for programming storage controllers and proposed innovative
scheduling strategies on hardware accelerated architectures.
Role and responsibility within DAPHNE
Within DAPHNE, ITU will contribute to the micro-view layer. ITU will be a leader of the
computational storage WP and will contribute to the scheduling and resource sharing, as well as
the hardware accelerator integration WPs. This is a great match to the data systems group
expertise and the interest in computational storage.
Curriculum vitae of senior staff
Prof. Pınar Tözün (female) (h-index 11)
is an Associate Professor at the IT University of Copenhagen. Before that, she was a research
staff member at IBM Almaden Research Center. Prior to joining IBM, she received her PhD
from École Polytechnique Fédérale de Lausanne (EPFL) in Switzerland. Her thesis was awarded
4.1.10 KAI
Kompetenzzentrum
Full Short Participant
name
Automobil- und
Industrieelektronik GmbH
name
KAI
Number 10
Brief description
KAI Kompetenzzentrum Automobil- und Industrieelektronik GmbH is a well-established
industrial research center with a large national and international network of partners and proven
experience in the coordination of interdisciplinary research projects. In addition to core
competences in the area of power electronics, reliability test concept development, advanced
semiconductor materials research, statistical lifetime and degradation modeling, data science
and multi-physics FEM simulation, KAI maintains a well-equipped electronics laboratory as
well as state-of-the-art simulation computing resources. KAI GmbH was founded under the
Austrian competence center program K-ind in 2006, with a focus on robust metallization and
interconnect technologies and active temperature cycle reliability of power semiconductors.
Since then KAI have supported our project partner Infineon with device level reliability stress
test, analysis and modeling methodology for numerous new technology developments. Together
with our academic partners KAI has published the results of our work at international conference
and in journals. The topics covered in our publications include stress test concepts and
equipment developed at KAI, data science methods for the analysis and extraction of
information out of large data sets, new results in the areas of electrical and thermo-mechanical
FEM simulations and microphysical materials research.
Advanced data analytics, computer vision and ML: In the past years KAI has established
expertise in various areas of advanced data analytics, including statistics, ML and computer
vision. With the know-how gained KAI currently supports data-driven decision making in the
semiconductor industry in the production area by developing algorithms to identify deviations
and to support root cause analysis. Furthermore, KAI develops lifetime and degradation models
for the reliability assessment, using statistical methods based on data from electrical
measurements and quantitative information extracted from images of degraded semiconductor
devices.
High Performance Computing: Since 2008, KAI maintains HPC equipment (see resources)
which is regularly updated in order to support the research in advanced data analytics, computer
vision, ML and diverse fields of simulation (FEM, coupled circuit simulation, etc.). To ensure
4.1.11 TUD
4.1.12 UM
4.1.13 UNIBAS
Brief description
4.2 Third parties involved in the project (including use of third-party resources)
No third parties involved.
We have NOT entered any ethics issues in the ethical issue table in the administrative forms.
DAPHNE will use data from different data sources for the different applications. With that regard,
Task Leaders will be responsible for analyzing any potential issue related to the datasets to be used.
Although not expected, any ethical issues, analysis and decisions will be reported to the Steering
Board and processed according to the EC regulations.
5.2 Security