03 - Data Engineering

Data Engineering
Huge efficiencies have been made in biopharma companies over the last few decades, notably around
data capture (with moving to eCRFs) business process improvements, and data standardisation efforts
by CDISC.
However, the sector is increasingly competing based on their analytical capabilities, which requires a
centralised, combined, and, as much as possible, automated data environment to support these deeper
insights. Clearly, data in R&D needs a major transformation; it is too siloed, fragmented and manually
intensive to be utilised effectively.
Exploring the term data engineering opens the door to the vast opportunities and roles available today,
with the overarching goal to optimise the use of data in day-to-day business operations. In doing a
simple search on the internet, “what is data engineering?”, one will find many posts expressing their
understanding of data engineering, with some variation.
However, what is clear is that data engineering encompasses the many considerations for optimally
curating, transforming, securing and disseminating data suitable for analysis. As technology and tools
have become more advanced, building such a platform and infrastructure requires engineers and
architects of both general and specific expertise. The data engineer combines knowledge in areas such
as software development, infrastructure, data architecture, data warehousing, cloud technology and
data cleaning in order to design, build and test solutions that define the pipelines of data throughout the
enterprise, making the data accessible to the organisation.
Optimised data engineering appropriately balances the efficiency of an automated process against the
cost of development and maintenance of that process, ensuring repetitive processes that require
humans to write code, press keys, cut and paste and update documents are minimised or eliminated.
The Data Engineering Cluster will explore how established data engineering techniques successfully
deployed in other industries could be utilised in our industry. From traditional data warehousing to the
arrival of the big data lake, with data marketplaces, ePRO and IoT, the challenge is on to identify
analytical value from all these disparate data sources.
The aims of this Cluster are two-fold: firstly, to gather the myriad resources available on traditional
methods of data engineering, to provide a breadth of knowledge that could immediately bring benefit to
our existing clinical data estate by sharing examples and use cases on practical applications; secondly, to
share examples and use cases of how to most effectively utilise big data which is increasingly being
introduced into our sector so that we can learn about the more thought-leading subjects in this area and
help disseminate to automate and curate data for later usage in data science.
1 - Data Engineering Maturity Matrix
Roles
With new disciplines come new roles – and data engineering is no different. Traditionally, data
management and statistical programming are the two main departments which capture, cleanse and
analyse clinical data. Going forwards, data engineers and data scientists are being introduced to
compliment the more traditional roles.
The data engineer takes a holistic cross-study role, applying data standards to a centralised automated
data pipeline which takes the heavy lifting of common data items for all studies. Taking this approach
releases data managers and statistical programmers to concentrate on the non-conformed data, data
which is not common among all studies. A second benefit of a centralised data pipeline is the option to
pool multiple studies into a single central data repository/data lake, which satisfies a major requirement
for the data scientist.
To achieve quality data capture, near-real-time accessibility and meaningful analytics, the data engineer
cannot function without the data scientist, and effective teamwork optimises the value of each role. As
such, an analytics team would be composed of distinct roles/capabilities:
• data engineers (in areas such as database architecture, database development, machine
learning architecture, ETL scripting, etc.)
• data scientists
• business analysts
Data engineering brings together the broad expertise of these roles, to ensure the data are curated and
accessible to the data scientist; in our environment today, this process is becoming more an d more
complex. Therefore, expertise in curating big data and data of varying formats (structured and
unstructured) is a critical core competency to optimise the potential impact of these digital assets (i.e.
the data).
Technologies for Deployment and Automation

There are many tools available which enable and empower the data scientist to operate effectively and
optimally. These vary from open-source to proprietary software and cover all areas of data engineering.
PHUSE is a vendor-agnostic organisation and as such will not provide specific recommendations for one
specific piece of software. However, the benefits of certain functionality will be highlighted in the
examples and use cases we share, leaving the audience to decide which tools provide that functi onality
in the most appropriate way for their need.
A sub-cluster will focus specifically on technologies.
Data Engineering Use Cases

There are a growing number of use cases within and outside of the biopharma industry where data
engineering realises the benefits described above. This sub-cluster will collate examples from around
the globe and synthesise them for the audience to be able to learn new approaches which they have not
previously adopted. We will share examples and use cases which help to understand how these new
approaches can be applied to our clinical trial data domain.
2 - Data Engineering for Biomarker Pipeline
3 - Knowledge Graphs
Data Engineering Techniques

Since the formation of CDISC, significant progress has been made in standardising the format of data
collected, analysed and submitted to the health authorities. This has allowed organisations to explore
methods to build automation in their data flows and reporting tools.
However, recent advances in data engineering techniques, such as machine learning and big data
processing, have allowed companies to more easily curate data that are in a structured, as well as
unstructured, format. These concepts have been applied in the healthcare industry, as shared at the
PHUSE single day event in Ridgefield, CT[ya1] . At the event, Dr Wade Schulz, of the Yale School of
Medicine, shared the various technologies and tools used to build a data lake that integrates with their
clinical information systems to provide historic and real-time data for research studies and clinical
decision support. Tools in his “Data Science Toolkit” include, but are not limited to, Kafka, Storm,
Apache Spark, Hadoop, Apache HBASE and Python to ingest, process, store and analyse clinical and
healthcare data. He also notes the anticipation of a significant amount of future healthcare data in
unstructured format, posing greater challenge to ingest and process data.
Several techniques and methodologies can be utilised to help organisations mature in their adoption of
data engineering. A specific sub-cluster has been designated to research these techniques and their
appropriateness for use within the clinical data landscape.
4 - Data Engineering Techniques

Reference Material
Recommended PHUSE Educational Material
• Data Engineering Project (Educating for the Future Working Group). PHUSE US Connect 2019.
Guy Garrett (Achieve Intelligence), Bev Hayes (Janssen Research & Development).
• Schulz, Wade. (2018, 26 July). Baikal – Implementing and Deploying Clinical Models with a Real-
Time Data Lake. PHUSE SDE. Focus on the Patients – Bridging Data to Solutions. Ridgefield,
Boehringer Ingelheim.
Recommended Readings
• Aghabozorgi, S., & Lin, P. (2016, 6 June). Data Scientist vs Data Engineer, What’s the Difference?
Cognitive Class Blog. https://cognitiveclass.ai/blog/data-scientist-vs-data-engineer/
• Paruchuri, V. (2017, 25 January).What is a Data Engineer? Data Engineering Series. Dataquest.

www.dataquest.io/blog/what-is-a-data-engineer/
• Eunice, T. (2015, 24 September). Do Data Scientists Need Data Management? IBM Big Data &
Analytics Hub, IBM. www.ibmbigdatahub.com/blog/do-data-scientists-need-data-management
• O’Neal, K., & Roe, C. (2015). Business Intelligence versus Data Science: A DATAVERSITY 2015
Report. DATAVERSITY. https://content.dataversity.net/DV2015BIvDSRP_DownloadWP.html
Recommended Videos
• DataEDGE 2016. (2016, 25 May). Data Engineering and Data Science: Bridging the Gap [Video].
YouTube. https://youtu.be/-K9SjrWpeys
Recommended Websites

03 - Data Engineering

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

03 - Data Engineering

Uploaded by

Copyright:

Available Formats

Data Engineering

1 - Data Engineering Maturity Matrix

Technologies for Deployment and Automation

A sub-cluster will focus specifically on technologies.

Data Engineering Use Cases

Data Engineering Techniques

4 - Data Engineering Techniques

• Paruchuri, V. (2017, 25 January).What is a Data Engineer? Data Engineering Series. Dataquest.

You might also like