Professional Documents
Culture Documents
DEco2
DEco2
Abstract
The Data Science Platform Applied to Health (PCDaS) is a research and technological development project that aims to develop
and apply novel data analysis methods to public health data. It fills a technological gap between the variety of data sources
available in legacy and unstandardized formats and the current needs and possibilities of Data Science applications to consume
and explore data for the benefit of the Brazilian Health System. PCDaS provides democratic access to health-related datasets
and information by requiring fewer technological abilities from its users while maintaining a continuously updated stack of
technologies. As a data ecosystem, our primary goal is to provide secure and remote access to health data, technological tools,
and a robust infrastructure provided by our platform to process and analyze a large amount of data that generally demand
computational power often unavailable to researchers. The infrastructure consists of multi-region on-premise and cloud
servers prepared to deal with the heavy analysis of Big Data from anywhere from multiple users simultaneously. Providing
secure and remote access to health databases, whether in their original form or processed, is a daily breakthrough for a
public health researcher. Knowing that there is a place where they can access integrated data in a standard format makes the
research process much more manageable. To ensure quality, our data engineering and governance teams process these data
sources following a gold standard based on cross-tables provided by the Health Ministry (the TabNET system) and decoding
the original variables into meaningful names provided by the sources. It is very relevant to emphasize the comprehensive
documentation of metadata, attributes, and the ETL (Extract, Transform, Load) process for databases. Every part of these
steps is described in detail on the PCDaS website, ensuring the comprehension and reproducibility of the process. These
features ensure that PCDaS users can effectively leverage the platform’s resources and capabilities, enabling them to conduct
research, perform data analysis, and collaborate within a secure and supportive environment to contribute to the Brazilian
Health System.
Keywords
Public Health, PaaS, Data Science, Data Ecosystem
1. Introduction
The Data Science Platform applied to Health (“Plataforma
de Ciência de Dados aplicada à Saúde” – PCDaS) is a
Joint Workshops at 49th International Conference on Very Large Data research and technological development project of the
Bases (VLDBW’23) — Workshop on Data Ecosystems (DEco’23), August
28 - September 1, 2023, Vancouver, Canada Laboratory of Health Information (“Laboratório de Infor-
*
Corresponding author. mação em Saúde” - LIS) from the Institute of Scientific
$ marcel.pedroso@icict.fiocruz.br (M. Pedroso); and Technological Communication and Information in
rebeccapsalles@acm.org (R. Salles); jefferson.lima@icict.fiocruz.br Health (“Instituto de Comunicação e Informação Cientí-
(J. Lima) fica e Tecnológica em Saúde” - ICICT), of the Oswaldo
0000-0002-7323-2107 (M. Pedroso); 0000-0002-1825-0097
(R. Salles); 0000-0002-7560-1068 (J. Lima) Cruz Foundation (“Fundação Oswaldo Cruz” - Fiocruz),
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
in partnership with the Data Extreme Lab (DEXL) from
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org) the National Laboratory for Scientific Computing (“Lab-
1
Marcel Pedroso et al. CEUR Workshop Proceedings 1–16
oratório Nacional de Computação Científica” - LNCC). ments, trying to respond to different needs from epidemi-
Both Fiocruz and LNCC are two of the main research ological and administrative points of view. Most of those
institutions in Brazil. Fiocruz is the most prominent in- systems were implemented using 1990 legacy database
stitution of science and technology in health in Latin technologies, such as dBase and other DOS and early
America, while LNCC is the first Brazilian institution in versions of Windows applications. Its technological mod-
the field of Scientific Computing having the most power- ernization is planned but taking place slowly by facing
ful IT resources in Latin America. the challenges of covering a country with continental
PCDaS aims to develop and apply novel methods of extensions with inequalities in Internet access, techni-
data analysis on public health data, filling a technological cal working force availability, political interference, and
gap between the variety of data sources of interest to funding.
the Public Health available in legacy and unstandardized The DataSUS made Brazilian health data publicly avail-
formats and the current needs and possibilities of Data able since its creation, fulfilling a mission of data dissem-
Science applications to consume and explore data for the ination, being a pioneer government agency of open data
benefit of the Brazilian Health System. in the 90’s. Open data principles implemented on HIS
Data science emerged as an interdisciplinary field, cov- and the citizen rights to information access are keystones
ering the study of structured and non-structured data of for the PCDaS creation.
different volume, complexity, variety, and other proper- Using the available data from DataSUS and other
ties, also called “Big Data” [1]. Public Health may benefit sources of relevant sociodemographic information, such
from Data Science techniques and paradigms, with condi- as Census and population thematic inquiries, usually im-
tions to advance statistical and epidemiological practical poses steps of downloading and handling large amounts
applications to more complex and heterodox data sources. of files with storage needs and processing and filtering
Like other research fields, Public Health has histor- the data for specific research needs. Due to the nature of
ically been guided by a theory-driven or hypothesis- the available data formats in its sources, this Extraction-
driven approach to science, with a priori assumptions. Transform-Load (ETL) process includes the use of legacy
The new challenges imposed by Big-Data go beyond software, undocumented sources, and handling of larger-
scaling-up computer servers to use the same methods. than-memory data by non-technological savvy users,
New approaches are needed to overhaul the methods to such as social scientists, epidemiologists, geographers,
use better the full potential of available data in its diver- and others.
sity and complexity, leading to a data-driven approach to The current ecosystem of data sources of health data
science [2, 3]. For the context of PCDaS, Data Science is and relevant sources of information for Public Health im-
interpreted as a field of study that can aid the discovery poses on researchers and managers a long learning curve,
of knowledge of useful information from big or complex with non-standardized and shared practices that lead
databases and aid the decision-making guided by data to repeating the ETL processes among several research
[4]. groups to obtain similar but not comparable results.
The Brazilian Health System is publicly funded and Aiming at providing a Data Science platform for Public
offers universal free health care coverage to the Brazilian Health, the PCDaS is structured as Platform-as-a-Service
population, known as the Unified Health System (“Sis- (PaaS) to cover aspects such as ETL, data analysis, data
tema Único de Saúde” – SUS). It was established in 1988, visualization, modeling, artificial intelligence, and knowl-
along with the re-democratization process of the country. edge dissemination. At PCDaS, we strive to create a
A component of the SUS is the Department of Informatics community of data scientists who collaborate with SUS
(DataSUS), which is responsible for gathering, organizing, to offer advanced technology and scientific computing
and disseminating Brazilian health data. services. Our primary focus is to help manage, store,
The health data at DataSUS is structured by several analyze, visualize, and share extensive data related to
Health Information Systems (HIS) dedicated to covering healthcare and its socio-environmental influences. Our
different aspects of a person’s life cycle, resulting in spe- services cater to researchers, professors, students, educa-
cific HIS like the SINASC for birth data, SIH for hospital tional and research institutions, as well as government
admissions data, and SIM for mortality data. There are officials. Our objective is to advocate for positive ad-
other dozens of HIS maintained by the DataSUS, cover- vancements in public health policies and society as a
ing aspects such as vaccinations, ambulatorial services, whole.
records of health professionals, health services, health Besides this introduction, Section 2 presents a litera-
equipment, suspected cases of transmissible diseases, vio- ture review, a background on PaaS, and discusses related
lence, and other themes. The anonymized raw data from works. Section 3 further details the components of the
those systems are publicly available through the DataSUS PCDaS data ecosystem. Section 4 shares some of the
website. main research projects furthered by PCDaS as well as
In common, those HIS were created in different mo- their products and scientific results. Finally, Section 5
2
Marcel Pedroso et al. CEUR Workshop Proceedings 1–16
concludes and describes the plans for expansion and con- API (Application Programming Interface).
tinuous improvement of PCDaS. PaaS offers access to its resources through network
connections, delivering services to users. The concept
of an open platform relies on enabling third-party devel-
2. Literature review opers to create internet-based value-added applications.
This integration is facilitated through Application Pro-
Over the past years, industry and academia have shown
gramming Interfaces (APIs), granting third-party devel-
a growing interest in Big Data and analytics. Despite
opers access to resources and services. Consequently,
advancements in computer systems, handling large-scale
this accessibility yields tangible benefits for organiza-
data remains an important challenge. Commonly, we en-
tions, developers, and users.
counter hardware and time limitations, which drive the
One plausible solution to address Big Data challenges is
improvement of data processing methods. Big Data refers
through data ecosystems. Section 3 presents a more accu-
to the vast amount of data created and exchanged. Its ap-
rate description of our architecture, highlighting how we
plications are characterized by the “3Vs”: volume (infor-
provide PCDaS as a Platform-as-a-Service. In this context,
mation volume), velocity (data generation and consump-
we have implemented a data ecosystem that facilitates
tion time), and variety (heterogeneous data sources) [5].
collaborative efforts to enhance our understanding of
Additional dimensions such as veracity, validity, value,
public health in Brazil.
variability, venue, vocabulary, and vagueness have been
Data ecosystems transcend traditional production
proposed to complement the understanding of Big Data
chains by incorporating three key characteristics: net-
[1].
work, platform, and co-evolution [10]. As the mentioned
Over the past decade, data science has emerged as a
work describes, networks within data ecosystems are
highly relevant field. It encompasses a multidisciplinary
formed by developers, providers, technology suppliers,
approach, gathering different knowledge areas [6]. The
and infrastructure. Platforms serve as the means through
fundamental objective of data science is to extract valu-
which ecosystem participants interact. Moreover, data
able insights and knowledge from data, leveraging ana-
ecosystems provide resources that enable participants
lytical techniques and computational tools. Collaborative
to evolve through interactions among stakeholders and
platforms such as Kaggle and OpenML performs a rel-
across different knowledge fields. A more specific dis-
evant role as players on data science community. For
cussion [8] about PaaS and the relation between their
instance, the metioned platforms provides datasets, data
components. A data ecosystem can also be understood as
documentation and some exploratory analysis.
a complex set of interactions between heterogeneous
In the context of public health, data science plays a
agents and their environment, similar to a biological
crucial role. Goldsmith et al. [7] defines data science in
ecosystem [11].
public health as a discipline that focuses on formulating
The increasing variety and volume of data have pre-
and answering questions related to public health and
sented us with numerous challenges. Some solutions
well-being through data-centric approaches. In recent
have been proposed to overcome time and hardware
years, public health researchers in Brazil have increas-
restrictions. For integrating data from heterogeneous
ingly turned to data science tools and methodologies
sources while ensuring data quality, Ramalli et al. [12]
to understand the Brazilian health system better and
propose using SciExpeM, a framework designed to speed
improve healthcare outcomes. The analyses using the
up and support the development of scientific models. This
information provided by DataSus present themselves as
work was further extended [13], where a data ecosystem
a typical Big Data problem due to its volume.
was proposed specifically for chemical engineering.
DataSUS is the primary information system responsi-
The increasing adoption of IoT sensors has led to the
ble for providing computational support to all instances
continuous monitoring of daily activities. In Yu et al.
of SUS. Despite its crucial role, accessing data from Data-
[14], a data ecosystem is proposed to address predictive
SUS can be challenging due to the fragmented nature of
maintenance problems in an industrial context. In ma-
the available data sources, which makes it difficult for
terials science, Blaiszik et al. [15] emphasize enhancing
researchers and stakeholders to obtain a comprehensive
data ecosystems to develop new technologies. The pa-
view of the public health data landscape.
per presents two projects supporting machine learning
According to [8] Platform-as-a-Service (PaaS) is an in-
applications in materials science.
terface that provides access to a complex set of techno-
Although the mentioned works primarily focus on en-
logical components. Today, PaaS plays a central role
gineering applications, data ecosystems can be applied
in internet applications, abstracting architecture and in-
across various domains. For instance, in an analysis of
frastructure complexities for users. PaaS exhibits three
the effects of territorial politics and metropolitan gov-
significant characteristics [9]: it is internet resource cen-
ernance, Kitchin and Moore-Cherry [16] discuss how
tered, is open to third-party developers, and utilizes web
fragmented governance can reduce economies of scale
3
Marcel Pedroso et al. CEUR Workshop Proceedings 1–16
Table 1
Summary of the related work
and limit the effectiveness of public policies. These pa- It is also important do mention Global Health Obser-
pers contribute to the growing body of knowledge on vatory (GHO) data repository, is an World Health Or-
data ecosystems. They typically present domain-specific ganization initiative that aims to provide health-related
problems and strategies for solving Big Data challenges. statistics for all 194 Member States of Unated Nations.
Table 1 provides a comprehensive list of data ecosystems, They provide access to more than 1000 indicators via an
and to the best of the author’s knowledge, PCDaS is the API interface on collective health topics, i.e.: mortality
first data ecosystem specialized in public health in Brazil. and burden of diseases, immunization, malaria among
We selected some features for comparing some of the others.
existing data ecosystems: Notably, most of the listed data ecosystems do not
provide data from a centralized view. By centralizing data
• Educational Services: indicates if the given plat- access, PCDaS can guarantee the data science community
form provides training and technological literacy; with simpler usage. This decision allows us to enrich
• Teams: the data ecosystem provider explicitly al- data, add value and enable more complex analysis. By
locates teams for supporting data-driven projects; releasing microdata, PCDaS offers more flexibility for the
• ETL and Documentation: indicates if the plat- platform users. It is also worth to mention that, openning
form provides ETL process and data documenta- the choice of making our methodology public ensure
tion; reproducibility.
• Quality: indicates if the given data ecosystem
provides some data quality discussion;
• Centralized View: data provided after cleaning 3. PCDaS
and enrichment process;
As discussed in the previous sections, the utilization of
• Funding: indicates how data ecosystem are
health data from Brazilian Health Information Systems
founded; and
in research endeavors presents numerous complex chal-
• Domain: application area.
lenges that demand careful consideration. These chal-
An important player in data ecosystems is the World lenges can manifest from various aspects, including in-
Bank1 , an international organization dedicated to promot- trinsic characteristics of the data itself and the necessary
ing equity and reducing poverty worldwide. Among its infrastructure for handling and analyzing the vast vol-
objectives is the support of countries in improving their umes of data available.
statistical capacity through advisory services, project sup- From a data perspective, several critical issues arise
port, partnership management, and financial resources. when dealing with health data, including the integra-
Despite beeing a more mature data ecosystem when tion of data from diverse sources, data processing tasks
copared to the other presented in Table 1, Canadian In- such as cleaning and enrichment, the challenge of work-
stitute for Healt Information (CIHI) does not provides ing with pre-aggregated data lacking granularity, non-
a centralized access for data. The institution releases standard file formats, managing the sheer volume of data,
frequent reports and metrics wich can provide a wider and ensuring appropriate data modeling for various re-
comprehention on canadian public health landscape, and search needs.
also offers training on how to access the information From infrastructure perspective, having a robust envi-
made available. ronment that can provide the necessary computational
1
World Bank provides capacitation over mentoring and funding
power to acquire, process, and analyze large volumes
programs of data while ensuring privacy and security is crucial.
4
Marcel Pedroso et al. CEUR Workshop Proceedings 1–16
In addition to meeting the hardware requirements, the a strong focus on high-performance computing. The
infrastructure must encompass the selection of suitable second part of the infrastructure is located at the Data
tools and services that can effectively support the re- Processing Center of Oswaldo Cruz Foundation (Fiocruz),
search objectives. This infrastructure entails consideringin Rio de Janeiro, which is widely recognized as one of
factors such as data storage, processing capabilities, scal-
the leading research centers in the field of public health.
ability, and the ability to implement robust privacy and By leveraging the complementary nature of infrastruc-
security measures. ture and staff members of these two institutions, PCDaS
PCDaS was created by establishing a robust data can provide computational power and services capable of
ecosystem that effectively addresses several of the afore-meeting the specific needs of internal engineering teams
mentioned challenges. One of our primary focuses is to and users interested in using the platform.
provide services that facilitate the acquisition and explo- Regarding availability, the infrastrucutre located at
ration of data by users. By promoting data integration, Fiocruz complies with the ABNT NBR 15247 (Safe Storage
PCDaS aims to foster collaboration among different re- Environment with Fire Resistance Classification and Test
search groups, encouraging active participation in data Method) and NBR 60529 (Protection of Electrical Equip-
sharing and promoting the reuse of valuable informa- ment) standards. Additionally, it has 2 uninterruptible
tion. This collaborative approach can potentially enhance power supplies (UPS) and its own power generator, en-
knowledge exchange, accelerate research projects, and suring the operation of the equipment 24 hours a day and
ultimately contribute to data-driven healthcare decision- providing protection against humidity, corrosive gases,
making. magnetism, and high temperatures.
The initial version of the platform, named PCDaS 1.0, Furthermore, since the services are located in different
became available in 2016 [17]. Despite its limitations, this
places, any network or power related issues in one loca-
release marked the realization of the concept of offering tion only impact the services in that location, minimizing
users a unified platform for accessing and analyzing vast overall disruption to the platform’s services. Additionally,
amounts of HIS data. Subsequently, in 2019, PCDaS 1.5 having different teams responsible for maintaining the
was introduced [18], featuring enhancements to the user separate infrastructures and tools allows for a more spe-
interface, comprehensive tutorials to aid platform utiliza-
cialized focus on specific concerns. This fine-grained spe-
tion, and the addition of new and updated public datasets.cialization ensures that each team can efficiently address
PCDaS 2.0 was launched in 2021 [19], incorporating a maintenance tasks and any issues that arise, contributing
range of improvements, such as Single Sign-On (SSO) to the overall stability and reliability of the platform.
functionality for users, seamless integration with Google The platform prioritizes using free and open-source
Colab for notebooks and tutorials, and a broader selectionsoftware (FOSS) for its tools and services. This approach
of public datasets. More recently, a RESTful API was de- offers several advantages in terms of infrastructure main-
veloped and released to users of partner projects (such astenance. FOSS solutions are cost-effective, providing a
the ones presented in Section 4), with the aim of simplify-
more affordable alternative than commercial options. Ad-
ing access to datasets hosted on the platform. In additionditionally, FOSS software is transparent, allowing for
to enhancing the user experience, the platform’s backend greater visibility into the code and ensuring the plat-
undergoes continuous evolution through the expansion form’s operations are built on trustworthy foundations.
and optimization of its infrastructure, architecture, and Finally, the customizable nature of FOSS enables the plat-
tools. form to adapt and meet specific requirements efficiently.
Currently, PCDaS has around 1,450 active users, sup- This way, the platform can function with minimal impact
porting dozens of research and technological develop- even when budget constraints arise.
ment projects from a variety of groups (academics and A simplified overview of the infrastructure and archi-
from the government), with particular interest in researchtecture of the platform is presented in Figure 1. Despite
and analysis on Public Health and socio-environmental the specification of current infrastructure, tools, and ser-
determinants of health. The following subsections de- vices, the platform follows a tool-agnostic and evolu-
scribe in detail the components of the PCDaS data ecosys- tionary approach, with planned updates and migration
tem. to new tools and infrastructure when favorable. It can
be seen that the platform provides a set of services tai-
3.1. Infrastructure and Architecture lored to specific needs. Internal engineering teams of
the platform dispose of tools for ETL jobs (Apache Air-
The infrastructure of the platform is hosted in two com- flow and Jupyter Hub), as well as complete access (write,
putational environments, in different geographical loca- read) to a Data Warehouse solution (ElasticSearch). Plat-
tions. The first is located at the Data Processing Center form users’ services are focused on providing compu-
of the National Laboratory for Scientific Computation tational power (JupyterHub and Google Colaboratory),
(LNCC), a renowned Brazilian research institution with data analysis tools (Kibana), and data consumption inter-
5
Marcel Pedroso et al. CEUR Workshop Proceedings 1–16
LNCC FIOCRUZ
Read / Write
NFS Server Read Read / Write
Logs
Read / Write
PCDaS Users
6
Marcel Pedroso et al. CEUR Workshop Proceedings 1–16
Table 2
Services by user category
on our website to showcase the project’s main goals and data retrieval processes.
the individuals involved.
Finally, Partner Users receive the highest level of sup- 3.3.2. Data Transformation
port and benefit from three additional features: e) Basic
training in Python or R, offering classes on data usage, Following data collection, the subsequent phase involves
analysis, and manipulation with the chosen program- data processing. A significant challenge that users en-
ming language. f) Training of the research team for us- counter when working with data from various HIS is
ing PCDaS to its fullest potential, and providing com- the presence of diverse legacy and unstandardized file
prehensive support on leveraging our infrastructure and formats. Consequently, converting these files into user-
data sources. g) Extraction, Transformation, and Load- friendly formats, such as CSV, Parquet, and other com-
ing (ETL) of databases relevant to the research team’s patible file formats, is imperative. Although this task
interests. Our experienced engineering team assists in may appear straightforward, it often necessitates exten-
extracting and structuring data, allowing researchers to sive research, understanding domain requirements, and
focus on their project’s core objectives without managing potentially demanding the development of specialized
the technological aspects. libraries capable of handling these conversion processes
effectively.
Another very important step of data transformation
3.3. Data Management is data enriching. In the case of public HIS, data files
This section outlines the Data Management process often consist of numeric values representing categories
within PCDaS’ data ecosystem. Data management refers or amounts, with separate files containing mappings for
to the activities undertaken to ensure the accuracy, in- these categories to their respective string values. It is es-
tegrity, and accessibility of the datasets. It encompasses sential to perform data mappings that generate datasets
data collection, organization, processing, storage, inte- containing numeric category values and their correspond-
gration, and governance. Through effective data manage- ing string representations. Such a process enhances the
ment practices, PCDaS strives to maintain the quality and usability of the data. It enables easier data interpreta-
reliability of data, enabling researchers to derive mean- tion and analysis, ensuring meaningful insights can be
ingful insights and make informed decisions based on derived from the enriched dataset.
reliable and comprehensive information. In certain situations, it becomes necessary to integrate
different datasets. This integration can be achieved by
3.3.1. Data Extraction utilizing a shared column to enrich the resulting dataset
with additional information. By joining datasets, valuable
Data extraction is facilitated by utilizing two approaches: insights can be gained from the merged data, providing
leveraging the orchestration capabilities of Apache Air- more comprehensive underlying information. This pro-
flow or employing custom code on Jupyter Notebooks. cess enables researchers and analysts to leverage multiple
Apache Airflow is employed when there is a need for datasets’ combined knowledge and attributes, facilitating
automated data collection based on predefined sched- more informed analysis.
ules or events. On the other hand, Jupyter Notebooks In cases where the final dataset is stored in our data
are utilized for simpler data collection scenarios where warehouse, the data processing workflow includes the
a one-shot extraction is sufficient. This approach pro- crucial step of data modeling. Data modeling ensures the
vides flexibility and ease of use for ad-hoc data collection data and the destination storage system’s data model com-
tasks. By employing both Apache Airflow and Jupyter patibility. It also ensures that the data is appropriately
Notebooks, the platform can accommodate various data formatted to facilitate analytical queries. By perform-
collection requirements and ensure efficient and effective ing data modeling, we ensure that the data is structured
7
Marcel Pedroso et al. CEUR Workshop Proceedings 1–16
and organized to support efficient and effective analysis, 3.3.5. Data Access Management
enabling users to derive meaningful insights from the
Data acceess is managed by leveraging the concept of
dataset.
Elasticsearch roles. A role associates a collection of users
to a set of permissions. These permissions can include
3.3.3. Data Validation different types of access (read, write) to specific indices,
Data validation is an essential step in the data processing clusters, and objects in Elasticsearch and Kibana. Users
workflows of HIS for several reasons. One of the main are granted specific permissions they need to do their
motivations is the reliance on manual human input for jobs, without giving them access to more data than they
data entry in many of these systems. Despite verifica- need. This helps to protect data from unauthorized access
tions being performed in the source systems, it is not and allows to fine-tune the access that users have to
uncommon to encounter inconsistencies in the provided different artifacts managed by the PCDaS plataform.
data. By conducting data validation, these inconsistencies
can be identified and addressed, ensuring the accuracy 3.3.6. Example
and reliability of the processed data. Additionally, data
Figure 2 illustrates the data management process for the
validation plays a vital role in maximizing trust in the
System of Hospital Admissions (SIH) data. This work-
transformation process, as it helps prevent issues with
flow represents the data management process for single
the data from propagating to downstream tasks.
coverage of SIH, which releases data monthly.
Data validation involves examining and verifying
The process starts with creating the folder structure
transformed data to ensure its adherence to predefined
to accommodate the downloaded and generated data.
rules. These rules can be described in documents. It is the
After this, a zip file containing the mappings used during
case of various HIS systems, as domain specialists can
the transformation step is downloaded from DataSUS
define them. Regardless of their origin, these rules serve
FTP. The zip file decompression results in several files
as guidelines for ensuring data accuracy and consistency.
with CNV and DBF formats. These files represent the
mappings and are parsed to CSV file format.
3.3.4. Data Load The next step is downloading data files representing
Once the data has been transformed, the subsequent step each state’s hospital admissions. As these files are in
involves loading the transformed data into the target DBC file format, an additional step is necessary to con-
systems. In the case of public datasets provided by the vert them to a format compatible with popular processing
platform, this entails inserting the data into our Elas- tools. After format conversion, the dataset is validated
ticSearch cluster and loading the transformed CSV files by applying a set of rules provided in SIH documentation.
into a shared environment accessible to users. This load After the pre-validation, data is enriched by mapping
allows users to query the inserted data using Elastic- categorical values to respective strings and integrating
Search’s powerful analytics engine. Additionally, users them with other datasets. A post-validation step is per-
can utilize custom code on Jupyter or Colaboratory note- formed after data enriching to verify the correctness of
books to perform more detailed analytics on the CSV the data, and if the data is valid, it is loaded into our data
files. warehouse.
Another advantage of utilizing a distributed analyt- The loaded data is validated against cross-tables avail-
ics system like ElasticSearch is its horizontal scalability. able at the TabNET, a system provided by DataSUS, in
By consistently monitoring the system’s resource usage, which users can consult different data information about
we can identify demand increases and, if needed, add several HIS of SUS. Finally, the enriched data are gath-
additional machines to the cluster. This scalability fea- ered in a zip file and made available in a shared space to
ture ensures the platform can handle growing workloads which platform users have access.
effectively and maintain optimal performance. It is worth emphasizing that this process is designed
The data can be loaded into their preferred systems specifically for SIH data, although it can be adapted for
in any acceptable format for private datasets generated other datasets if needed. However, having distinct data
for Partner Users. However, since the platform provides management processes tailored to different datasets is
infrastructure that users may not possess, it is common more common.
for datasets to be loaded into a data warehouse. It al-
lows centralized storage, efficient data management, and 3.4. Data Analysis and Visualization
easy integration with the platform’s analytical tools and
services. As mentioned earlier, a primary objective of the platform
is to facilitate users’ access to HIS data. In order to achieve
this objective, it is essential to collect, enhance, and make
the data readily available and offer different interfaces
8
Marcel Pedroso et al. CEUR Workshop Proceedings 1–16
that support distinct needs for analyzing and visualizing cluding data collection, processing, and utilization. This
the information. documentation is a valuable resource for users and col-
When client-side data manipulations are required, laborators, enabling them to understand how the data is
users can utilize our JupyterHub service or Google Colab obtained and handled.
notebooks to execute queries over our Data Warehouse Regarding private datasets, all code generated by the
or manipulate CSV or other formats made available by platform’s engineering teams is shareable with our col-
the platform. Additionally, an API developed using the laborators through simple notebooks and packaged ap-
FastAPI framework is provided to simplify the querying plications. Additionally, we employ docker containers
process by supporting SQL query language. Utilizing this to facilitate the sharing process. This approach ensures
API offers the advantage of logging user queries, enabling seamless collaboration and enables our collaborators to
monitoring of request-related issues, and providing sup- work efficiently with the code and access the necessary
port when necessary. data analysis and processing tools.
Additionally, we offer a Kibana interface for users who
prefer a more visual approach to data analysis or have 3.6. Data Protection
limited familiarity with programming concepts. This
interface provides a user-friendly and intuitive way to Following Brazilian guidelines on research that includes
explore and analyze data stored in our ElasticSearch clus- data from humans, especially from the Ethics Committee
ter, enhancing and simplifying the overall data analysis from the Escola Politécnica de Saúde Joaquim Venâncio,
experience. we adopt specific legal documents as the Data Use Com-
Examples of analysis of the platform’s public data are mitment Term, or “Termo de Compromisso de Utilização
available on our site, both in the form of Kibana dash- de dados” (TCUD), where are stated the project leaders,
boards and in tutorial examples that can be easily cloned objectives of the research, and how the data will be used
and opened in Google Colaboratory’s notebooks. and guarded. Projects that contain data with sensitive
Various examples of analyses conducted on the plat- information follow specific guidelines from the Ethics
form’s public data are presented on the PCDaS website. Committee for data handling, storage, and retention poli-
These examples are available in Kibana dashboards and cies.
tutorial examples that can be effortlessly cloned and
opened in Google Colaboratory notebooks. These re- 3.7. Data FAIRness
sources serve as valuable references for users looking to
explore and learn from practical use cases of data analysis Data sharing is a core element of PCDaS activities. Thus,
on our platform. with the aim of improving its quality and expanding
the possibilities of reuse, it was decided to guide this
process through the FAIR principles (Findable, Accessible,
3.5. Data Transparency Interoperable, and Reusable).
Data transparency is crucial for enabling users and col- Adopting FAIR principles makes it easier to discover
laborators to comprehend the complete data generation and access data, while interoperability makes it possi-
process. By making the entire process transparent, there ble to automate processes and integrate with other an-
are several advantages. Firstly, users can verify that the alytical tools, which can lead to new discoveries faster.
process aligns with their expectations. They can suggest In addition, there is the improvement of transparency
necessary modifications to meet their needs if any incon- and reliability, which can lead to greater reuse of data,
sistencies arise. This transparency empowers users to in addition to promoting collaboration and innovation
actively participate in shaping the data generation pro- [13, 20, 21], which is fundamental to the field of public
cess, fostering a collaborative and inclusive environment. health.
For public datasets, to ensure data transparency, we At PCDaS, based on FAIR principles, information about
offer detailed documentation that provides clear and com- data provenance, descriptive metadata, and the record of
prehensive information about the entire data lifecycle, in- the data processing and dataset enrichment process are
9
Marcel Pedroso et al. CEUR Workshop Proceedings 1–16
shared. The data is shared in open formats such as CSV seminars and conferences, which caught the attention of
and through an API, which makes the data readable by researchers, journalists, and managers [25] [26] [27].
humans and machines, which together with the shared
documentation, expands the possibility of interoperabil- Accessibility to services and the reduction of ma-
ity and reuse. ternal and child mortality (ASSMI) Another project
with the support of the PCDaS team and infrastructure
named "Accessibility to services and the reduction of
4. Success cases maternal and child mortality" (ASSMI) focused on study-
This section presents the main research projects con- ing the impact of the displacements of many pregnant
ducted in partnership with PCDaS that provided ser- women when leaving their homes and cities and trav-
vices of scientific training, technological infrastructure, eling long distances to health establishments that offer
ETL, and data analysis and visualization. The described conditions for childbirth and neonatal care. Among the
projects presented high impact especially in the field of conclusions, the research indicated that the distances
public health as it can be observed by their achievements traveled for pregnant women to carry out deliveries and
and public adoption of their products and scientific re- neonatal care increased in the period analyzed — between
sults. These projects exemplify, validate and represent 2006 and 2017 and that infant deaths can be avoided by
the realization of the PCDaS goal of creating a community reducing the distances between mothers and the place
of data scientists and researchers, as well as government of delivery or neonatal care [28]. A detailed summary
officials who collaborate with SUS through the use of of other project’s contributions and support provided by
advanced technology and scientific computing services, PCDaS can be found at [29, 30]
furthering positive advancements in public health poli-
cies and society as a whole. Amplia Saúde The “Amplia Saúde” project [31], which
also received support from PCDaS, offers an unparal-
leled alternative for visual analysis of maternal and child
4.1. GCE health data during the pre-and perinatal periods. The
Created in 2003 by the Bill & Melinda Gates Foundation, project takes into consideration environmental and cli-
Grand Challenges Explorations is a program that invests mate factors. As part of this initiative, the First Big Data
in impactful research to solve significant health and de- View Workshop in Maternal and Neonatal Health was
velopment challenges worldwide [22]. Grand Challenges developed during the second call of the Grand Challenges
Brazil is the result of a partnership between the Depart- Explorations – Brazil. The workshop offers interactive
ment of Science and Technology (DECIT) of the Ministry tools for exploring and visually analyzing maternal and
of Health, the National Council for Scientific and Tech- early neonatal health data correlated with data on envi-
nological Development (CNPq), and the Bill & Melinda ronmental problems and extreme weather conditions.
Gates Foundation. In Brazil, projects supported in data
science [23] used innovative approaches to data analysis Vax*Sim and MATRECI Immunizing children under
and modeling with the support of the PCDaS team in five is highly effective and affordable in preventing dan-
providing databases, infrastructure compatible with anal- gerous and potentially fatal illnesses like polio, measles,
ysis in the Big Data scenario, teams dedicated to projects diphtheria, tetanus, and pertussis. Sadly, around 20 mil-
funded, and many others useful resources as can be seen lion Brazilian children are still not receiving routine vac-
in Table 3. cination services, which creates disparities in healthcare
coverage. With this scenario in mind, the project entitled
Brazilian Obstetric Observatory (OOBR) Among "The Role of social media, the Bolsa-Família Program and
the main results achieved by PCDaS partner projects in Primary Health Care in immunization coverage for chil-
the second call of the Grand Challenges Explorations dren under five in Brazil," — also called Vax*Sim, had the
– Brazil: Data Science to improve maternal and child support of PCDaS in relevant stages of the process, such
health, women’s and children’s health in Brazil, it is as monitoring the social network Twitter to analyze what
worth mentioning that the efforts and contributions of was being published and commented by Brazilians on
OOBr (Brazilian Obstetric Observatory) [24], which mon- the topic of vaccination, use of data from the Information
itors and analyzes the area of maternal and child health System on Live Births (SINASC) and the Information Sys-
in Brazil. Remarkable advances have been achieved by tem on Mortality (SIM), in addition to sociodemographic
the initiative in maternal, child, and obstetric health in data from municipalities, such as population and basic
a brief timeframe. Many resources have been generated, sanitation coverage [32].
including articles, data visualization panels, indicators, Another project PCDaS supported is Breastfeeding in
and a book. Moreover, they have actively contributed to Brazil in the MATRECI Model: Mapping, Trend, Clus-
10
Marcel Pedroso et al. CEUR Workshop Proceedings 1–16
Table 3
Indicators of PCDaS usage for the second GCE call.
tering, and Impact. This project aims to track the im- As a result of the PNS project, a website was generated
plementation and progress of breastfeeding initiatives in that gathers the previously mentioned information about
primary health care (PHC). The team identified success- the indicators in addition to dashboards with maps and
ful pro-breastfeeding programs and their impact. They figures, making it more accessible for researchers to carry
created a dataset on PHC infrastructure, breastfeeding out specific queries 2 .
rates, and programs. They also qualified and verified From its launch on April 1, 2021, to May 23, 2023,
SISVAN breastfeeding information [33]. These last two search and access to the site reached 50k users, and the
projects yielded significant contributions in published ar- engagement rate that measures user interactions on the
ticles with their respective conclusions [34, 35, 36, 37, 38]. site in terms of navigation was 51.5% distributed in Brazil
(94.63%), United States (3.02%) and Portugal (0.41%). Table
4.2. PNS 4 presents the five most accessed pages.
11
Marcel Pedroso et al. CEUR Workshop Proceedings 1–16
12
Marcel Pedroso et al. CEUR Workshop Proceedings 1–16
13
Marcel Pedroso et al. CEUR Workshop Proceedings 1–16
the SUS by aiding, with data and technological tools, the [7] J. Goldsmith, Y. Sun, L. Fried, J. Wing, G. W. Miller,
implementation and monitoring of health policies for the K. Berhane, The emergence and future of public
Brazilian population. To ensure the success of PCDaS, we health data science, Public Health Reviews (2021)
have planned to (i) expand our institutional partnerships, 4.
(ii) provide ongoing training for health data scientists, [8] D. Beimborn, T. Miletzki, S. Wenzel, Platform as
(iii) make new datasets available, and (iv) continuously a service (paas), Wirtschaftsinformatik 53 (2011)
improve the software ecosystem that supports the Plat- 371–375.
form. [9] C. Lv, Q. Li, Z. Lei, J. Peng, W. Zhang, T. Wang, Paas:
A revolution for information technology platforms,
in: 2010 International Conference on Educational
Acknowledgments and Network Technology, IEEE, 2010, pp. 346–349.
[10] M. I. S. Oliveira, B. F. Lóscio, What is a data ecosys-
This work was supported by Institute for Scientic
tem?, in: Proceedings of the 19th Annual Interna-
and Technological Communication and Information on
tional Conference on Digital Government Research:
Health (Icict/Fiocruz) in partnership with the National
Governance in the Data Age, dg.o ’18, Association
Laboratory of Scientific Computing (LNCC). Research
for Computing Machinery, New York, NY, USA,
Fund of the Rio de Janeiro State Research Foundation
2018. URL: https://doi.org/10.1145/3209281.3209335.
(FAPERJ), National Council for Scientific and Technologi-
doi:10.1145/3209281.3209335.
cal Development (CNPq), Bill and Melinda Gates Founda-
[11] O. I. Dictionary, A new oxford dictionary (1962).
tion, INOVA Fiocruz Program, Higher Education Person-
[12] E. Ramalli, G. Scalia, B. Pernici, A. Stagni, A. Cuoci,
nel (CAPES). Various PCDaS partners provided additional
T. Faravelli, Data ecosystems for scientific experi-
funding. We also thank for contributing Artur Ziviani (in
ments: managing combustion experiments and sim-
memoriam), João Vitor Maués Dias Carneiro, Jorge Nun-
ulation analyses in chemical engineering, Frontiers
des, Leandro Zirondi, Marcelo Rabaço, Marcus Vinicius
in big Data 4 (2021) 663410.
Carneiro Magalhães, Matheus Ferreira. Special thanks
[13] E. Ramalli, B. Pernici, et al., From a prototype to
to Igor da Silva Morais, who materialized the ideas and
a data ecosystem for experimental data and pre-
aspirations of the PCDaS, "live long and prosper"!
dictive models, in: Proc. of the First International
Workshop on Data Ecosystems (DEco’22), CEUR-
References WS, 2022, pp. 18–26.
[14] W. Yu, T. Dillon, F. Mostafa, W. Rahayu, Y. Liu, A
[1] K. Borne, Top 10 big data challenges a serious look global manufacturing big data ecosystem for fault
at 10 big data v’s, Blog Post 11 (2014). detection in predictive maintenance, IEEE Transac-
[2] R. d. F. Saldanha, C. Barcellos, M. d. M. Pedroso, tions on Industrial Informatics 16 (2019) 183–192.
Ciência de dados e big data: o que isso significa [15] B. Blaiszik, L. Ward, M. Schwarting, J. Gaff, R. Chard,
para estudos populacionais e da saúde?, Cadernos D. Pike, K. Chard, I. Foster, A data ecosystem to
Saúde Coletiva 29 (2021) 51–58. URL: https://doi.or support machine learning in materials science, MRS
g/10.1590/1414-462X202199010305. doi:10.1590/ Communications 9 (2019) 1125–1133. doi:10.155
1414-462X202199010305. 7/mrc.2019.118.
[3] S. A. Bohon, Demography in the big data revolution: [16] R. Kitchin, N. Moore-Cherry, Fragmented gover-
Changing the culture to forge new frontiers, Popula- nance, the urban data ecosystem and smart city-
tion Research and Policy Review 37 (2018) 323–341. regions: the case of metropolitan boston, Regional
URL: http://link.springer.com/10.1007/s11113-018 Studies 55 (2020) 1913.
-9464-6. doi:10.1007/s11113-018-9464-6. [17] André Bezerra, Fiocruz disponibiliza plataforma
[4] PCDaS, Plataforma de ciência de dados aplicada à de ciência de dados aplicada à saúde, 2016. URL:
saúde, 2023. URL: https://pcdas.icict.fiocruz.br. https://www.icict.fiocruz.br/content/fiocruz-dispo
doi:10.7303/syn25882127, laboratório de Infor- nibiliza-plataforma-de-ci%C3%AAncia-de-dados
mação em Saúde (Lis). Instituto de Comunicação -aplicada-%C3%A0-sa%C3%BAde.
e Informação Científica e Tecnológica em Saúde [18] André Bezerra, Lançada a versão 1.5 da plataforma
(Icict). Fundação Oswaldo Cruz (Fiocruz). de ciência de dados aplicada à saúde, 2019. URL:
[5] D. Laney, et al., 3d data management: Controlling https://www.icict.fiocruz.br/content/lan%C3%A7a
data volume, velocity and variety, META group da-vers%C3%A3o-15-da-plataforma-de-ci%C3%A
research note 6 (2001) 1. Ancia-de-dados-aplicada-%C3%A0-sa%C3%BAde.
[6] J. M. Wing, The data life cycle, Harvard Data [19] Ariane Alves, Pcdas, a plataforma pública e gratuita
Science Review 1 (2019) 6. de dados da saúde, ganha versão 2.0, 2021. URL:
https://www.icict.fiocruz.br/content/pcdas-plata
14
Marcel Pedroso et al. CEUR Workshop Proceedings 1–16
15
Marcel Pedroso et al. CEUR Workshop Proceedings 1–16
16