All Pages 2023 v3 EPFL Library RDM FastGuide-17

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

EPFL LIBRARY

RESEARCH
FAST GUIDE #1 DATA
RESEARCH DATA: THE BASICS MANAGEMENT

v3 – 2023
DOI: 10.5281/zenodo.3327829

DEFINITIONS [1] Code is also data!


 “Research data […] is collected, observed, or created, for purposes of analysis to
produce original research results”
 “Recorded factual material commonly accepted in the scientific community as Check the
necessary to validate research findings... ” Fast Guide #6:
 “Materials generated or collected during the course of conducting research... ” CODE AS DATA [2]

RESEARCH DATA LIFECYCLE [3] DO NOT CONFUSE

Acquire existing third-party data Write a Data Management Plan (DMP) Data point
Use licenses and copyrights Consider ethics authorizations 11.02 m/s
Reuse and repurpose data Explore existing data sources Raw data
Discover data sources Create naming convention
g+ y01 9
2
Follow-up research Find storage solutions
Cite data Choose metadata
Choose software x
Discover
and reuse
Plan and design
Check IP issues
65
Dataset

Archive and Use, organize


preserve and store
Collect data
Validate data
Review data Create metadata
Convert data format
Publish and share
Store and backup Database
Clean and curate data Derive and visualize data
Create preserve and document Capture data and metadata
Deposit data (repositories / archives) Interpret, process, and analyze data

Polish data; Validate and verify data; Create user documentation; Publish and share; Processed data
Select appropriate access to data; Promote data; Peer-review publications

Data type Description Examples


Data captured in-situ, can not be recaptured, recreated Sensor readings; Sensory (human) observations;
Observational
or replaced Survey results; Interview notes/transcripts

Data collected under controlled conditions, in situ or Gene sequences; Chromatograms;


Experimental
lab-based. Should be reproducible, but can be expensive Spectroscopy; Microscopy

The process of taking a large amount of data and using it Climate models; Economic models;
Simulation
to mimic real-world scenarios or conditions. Biogeochemical models

Derived / Reproducible, but can be very expensive Derived variables; Compiled database; 3D
Compiled models

Reference / Static or organic collection [peer-reviewed] datasets, Gene sequence databanks; Chemical
Canonical probably published structures; Census data; Spatial data portals

Structured information associated with data for purposes README files [4]; Publication keywords; File and
Metadata of discovery, description, use, management, and folder names
preservation

Credits and sources


[1] libguides.macalester.edu/data1 [3] data-archive.ac.uk
[2] go.epfl.ch/rdm-fastguide06 [4] go.epfl.ch/rdm-readme

Contact and info


go.epfl.ch/rdm researchdata@epfl.ch BY - NC - SA
EPFL LIBRARY
RESEARCH
FAST GUIDE #2 DATA
FAIR DATA PRINCIPLES [1] MANAGEMENT

v3 – 2023
DOI: 10.5281/zenodo.3327829

Data and metadata Both humans Data from different Published data
are easy to find by and computers datasets are prepared to be can be easily
both humans can readily access or combined or exchanged combined/replicated
and computers download datasets in future research

F INDABLE A CCESSIBLE I NTEROPERABLE R EUSABLE

F1 (Meta)data are A1 (Meta)data are I1 (Meta)data use a R1 (Meta)data are richly


assigned a globally retrievable by their formal, accessible, described with a plurality
unique and persistent identifier using a shared and broadly of accurate and relevant
identifier. standardized applicable language for attributes:
F2 Data are described communication protocol: knowledge R1.1
(meta)data are released
A1.1 representation. with a clear and accessible
with rich metadata. the protocol is open, free
and universally implementable; data usage license;
F3 Metadata clearly I2 (Meta)data use
R1.2
A1.2 the protocol allows for an
vocabularies that follow (meta)data are associated
and explicitly include authentication and authorization with detailed provenance;
FAIR principles.
the identifier of the data procedure where necessary. R1.3
(meta)data meet domain-
they describe. A2 Metadata are I3 (Meta)data include relevant community
F4 (Meta)data are qualified references to standards.
accessible, even when
other (meta)data.
registered or indexed in the data are no longer
a searchable resource. available.

DESCRIBE OPEN LINK PUBLISH


Describe provenance, Open up your data using Use persistent identifiers Deposit datasets in data
usage, and organization of standardized licenses [3]. (ex. DOI, HANDL, URN) to repositories, favoring
data with standardized Limitations may apply (ex. cross-link datasets. services with user-friendly
metadata [2]. data protection, commercial Publish files in open formats, interfaces. Make sure to
Make metadata available datasets, and double-use even alongside proprietary choose a FAIR-compliant
even if data is not. technology). formats. data repository, also for the
relative code.

MY FAIR DATA? FAIR ≠ Open WHERE TO PUBLISH DATA?

 FAIR ensures data can be


Check the FAIRness of found, understood and reused To find the appropriate repository
your dataset with this for your FAIR data, check the
self-assessment test [4]  Data can be shared under Data platforms dissemination table [6]
restrictions & still be FAIR: on the EPFL Library pages
"As open as possible, as
restricted as necessary”
Want to dig more? Check
Karel Luyben, re3data.org & fairsharing.org
president of the EOSC [5]

Credits and sources


[1] go-fair.org/fair-principles [4] ardc.edu.au/resources/aboutdata/fair-data/fair-self-assessment-tool
[2] go.epfl.ch/rdm-fastguide05 [5] doi.org/10.5281/zenodo.6807345
[3] go.epfl.ch/rdm-fastguide12 [6] go.epfl.ch/datarepo

Contact and info


go.epfl.ch/rdm researchdata@epfl.ch BY - NC - SA
EPFL LIBRARY
RESEARCH
FAST GUIDE #3 DATA
COST OF RDM MANAGEMENT

v3 – 2023
DOI: 10.5281/zenodo.3327829

RDM ACTIVITIES TO BE BUDGETED


RDM costs can be
eligible for funding
applications [1]
PEOPLE HARDWARE SOFTWARE SECURITY PUBLICATION

DATA MANAGEMENT DMP writing; DMP revision; DMP


PLAN publishing SNSF ERC / MSCA /
Horizon Europe
Databases and software; Data formatting;
COLLECTION
Data organization; Data transfer Up to CHF 10,000 for Costs for RDM (e.g.,
Open Research Data data storage,
Electronic Lab Notebook (ELN); Laboratory (ORD) activities, to processing and
ACTIVE MANAGEMENT prepare datasets and preservation) and for
Information Management System (SLIMS);
Data sharing platform grant access to them Open Access to
in non-commercial research data (APC,
repositories curation, …) are eligible
Data description and metadata;
DOCUMENTATION
Documentation and transcription

STORAGE /
Data back-up; Data storage; NAS; Cloud
BACK-UP
SOME FIGURES
Access control; Data security; Sensitive
ACCESS / CONTROL
data protection; 3rd party data Percentage relative to the total
0.2% funds requested to SNSF in 2018
Anonymization; Copyright assessment; for ORD activities [3]
SHARING
Data cleaning; Data publishing
RDM cost on the total project
PRESERVATION / Data preparation; Long-term preservation; 5% expenditure expected in 2016 to
ARCHIVING Data repository properly manage and steward data
[4]

Per year, cost estimation of not


RDM costs are listed in the budget templates and toolkits available on
the EPFL-ReO website [1]. An overview of the possible costs per research 26.2bn € having FAIR data for the EU [5]
activity is also presented by the Utrecht University [2].

HOW TO ESTIMATE?

The EPFL Cost Calculator for Data Management [6]


helps researchers to estimate the cost of managing,
storing, and publishing data.

At EPFL, Chronos [7] enables timekeeping for research


projects which need to justify the eligible personnel
costs to the funding bodies.

Credits and sources


[1] go.epfl.ch/ReO-toolkits_research-funding [4] eosc-portal.eu/sites/default/files/ realising_the_european_open_science_cloud_2016.pdf
[2] uu.nl/en/research/research-data-management/guides/ [5] op.europa.eu/s/n75s
costs-of-data-management [6] costcalc.epfl.ch/
[3] zenodo.org/record/3618209 [7] epfl.ch/research/services/fr/gerez-votre-projet/chronos/

Contact and info


go.epfl.ch/rdm researchdata@epfl.ch BY - NC - SA
EPFL LIBRARY
RESEARCH
FAST GUIDE #4 DATA
FILE FORMATS MANAGEMENT

v3 – 2023
DOI: 10.5281/zenodo.3327829

DEFINITION IDEALLY…
A file format is a standard way to encode data as a computer file. It follows a … prefer file formats that are:
protocol that specifies how bits are used to encode information in a digital storage
medium. File formats may be proprietary or open [1]. Unencrypted, uncompressed,
commonly used by the research
community
DO NOT FORGET
Compliant to an open, documented
When listing out the data formats you will be using, make sure to include: standard: interoperable among
diverse platforms and applications,
 The necessary software to view the data (e.g. SPSS v.3; Microsoft Excel 97- fully published and available
2003); royalty-free, fully and independently
 Information about version control; implementable by multiple
software providers on multiple
 If data are stored in one format during collection and analysis, then
platforms without any intellectual
converted to another for preservation: list out features that may be lost in
property [2]
data conversion such as system-specific labels.

TYPE OF DATA APPROPRIATE ACCEPTABLE DEPRECATED


Tabular (extensive metadata) csv – hdf5 txt – html – tex – fastq – por

Tabular (minimal metadata) csv – tab – ods – sql – tsv xml (if appropriate dtd) – xlsx xls – xlsb

txt – pdf – odt – odm – tex – md – htm – xml – pptx – rtf – docx – pdf (with doc – ppt – dvi –
Textual / Presentation
extxyz – odf embedded forms) – eps – ipf ps

m – r – py – iypnb – rstudio – rmd – NetCDF –


Code / Computation sdd mat – rdata
aiml

jcamp – jpg – jp2 – tif – tiff – pdf – gif indd – ait – psd –
Image / Spectroscopy tif – png – svg – jpeg – fits
– bmp – dm3 – oir – lsm spc

flac – wav – ogg – mxl – midi – mei –


Audio mp3 – aif
humdrum

Video mp4 – mj2 – avi – mkv ogm – mp4 – WebM wmv – mov – qt

NetCDF – tabular gis attribute data – shp –


Geospatial shx – dbf – prj – sbx – sbn – postGis – tif – tfw mdb – mif
– geoJson

3D structures & images x3d – x3dv – x3db – pdf3D – pov – pdbml dwg – dxf – pdb pxp

Other xml – json – rdf

WANT MORE?
Check out also these tables:
 fileformats.archiveteam.org/wiki/Scientific_Data_formats
 loc.gov/preservation/resources/rfs/TOC.html
 archives.gov/records-mgmt/policy/transfer-guidance-tables.html
 nationalarchives.gov.uk/information-management/manage-information/digital-records-transfer/file-formats-transfer

Credits and sources


[1] en.wikipedia.org/wiki/File_format
[2] guides.library.stanford.edu/data-best-practices/format-files

Contact and info


go.epfl.ch/rdm researchdata@epfl.ch BY - NC - SA
EPFL LIBRARY
RESEARCH
FAST GUIDE #5 DATA
METADATA MANAGEMENT

v3 – 2023
DOI: 10.5281/zenodo.3327829

DEFINITIONS Metadata can be created manually


or by automated information
MANUAL OR
processing
 “data that provide information about other data” [1] AUTOMATIC
(i.e. captured by computer or
machine)
 “structured information associated with an object for purposes of
discovery, description, use, management, and preservation” Metadata can be stored in the same
(National Information Standards Organization, 2008) [2] EMBEDDED OR file or structure as the data
SUPPLEMENTAL (embedded metadata), or in a
separate file (supplemental)
Family Description Examples
DESCRIPTIVE METADATA Metadata ensures
For finding or Title, Author, Subject, that data are FAIR
understanding a resource Keywords, Publication date

ADMINISTRATIVE METADATA Check the


Fast Guide #2: FAIR [3]
File type & size, Device
For decoding and
TECHNICAL METADATA version, Creation date/time
rendering files
Compression scheme
RESEARCH
PRESERVATION Long-term files Checksum date,
METADATA management Preservation event • Find datasets/code using metadata
Intellectual property rights Copyright status, License • Plan & Design metadata
RIGHTS METADATA
attached to content terms, Rights holder • Acquire / Create metadata
STRUCTURAL METADATA • Clean metadata, control quality
• FAIR-ify (normalize, enrich, reconcile)
Relationships of parts of Sequence,
resources to one another Place in hierarchy • Choose FAIR-compliant repositories
• Carefully fill in forms
• Publish metadata (along with data)
ELEMENTS TO BUILD (YOUR OWN) METADATA • Allocate / Add PIDs
• Interlink your and others’ PIDs
FORMAT, TECHNICAL, CONTENT • Hand-over for long-term preservation
INTERCHANGE STANDARDS MODEL

exif, IPTC, instrumentation specific Be systematic & consistent


ISA (Investigation-Study-
standards… Assay) framework, Force11
software citation principles
USEFUL RESOURCES
NORMS, STANDARDS,
REFERENCES www.dcc.ac.uk/resources/metadata-standards
STRUCTURE fairsharing.org / bartoc.org / lov.linkeddata.es
ISO 8601, ISO 639-1, ISO 3166-1, STANDARDS &
thesauri, vocabularies, authorities… SCHEMAS

INSPIRE, SDMX, Darwin TRY THEM OUT [4]


Metadata and metadata standards Core, Dublin Core, PROV
creation, adoption and maintenance is a Source code citation[4]: CodeMeta metadata
model, Datacite… generator
joint effort within and between interest-
based communities Dataset citation[5]: Datacite metadata generator

Credits and sources


[1] merriam-webster.com/dictionary/metadata [4] codemeta.github.io/codemeta-generator/
[2] framework.niso.org/24.html [5] github.com/mpaluch/datacite-metadata-generator
[3] go.epfl.ch/rdm-fastguide02

Contact and info


go.epfl.ch/rdm researchdata@epfl.ch BY - NC - SA
EPFL LIBRARY
RESEARCH
FAST GUIDE #6 DATA
CODE AS DATA MANAGEMENT

v3 – 2023
DOI: 10.5281/zenodo.3327829

As for data, when working with code, good management practices are needed.
The publication of code is crucial to understand, validate, reuse and repeat the research.

VERSIONING
Versioning systems are powerful code management tools.
The best-known is Git, it's free and open. It allows to:
 manage different versions of your code, and track and undo changes as needed;
 automatically back up code and its changes, being connected to a repository;
 collaboratively work in a team on the same code.

SHARING
To share code and make it visible, repositories provide various services like versioning
systems, wikis, task management, and issues tracking. One of the most used is Github.
 EPFL provides c4science.ch for code versioning. Data are stored in Switzerland.
 EPFL provides gitlab.epfl.ch (GitHub open-source alternative). Data are stored at EPFL.

DESCRIBING
The README file [1] is fundamental for coding documentation. It allows you to explain your
code, to yourself and others. On any publication of the code, you should add documentation
and rich metadata (e.g., README.txt, license.md, parameter files, comments on code [2],...).
Tools like sphinx-doc.org or doxygen.nl help by generating preformatted documentation.

LICENSING
As for data, it is important to explain how your code can be used or cited by others
(considering related restrictions). Instead of Creative Commons licenses as for data, for code it
is recommended to use software-specific licenses, such as:
 Open source licenses (permissive as MIT [3] or GPL [4]); Check the
 Academic licenses (restrict commercial usage); Fast Guide #12:
 Commercial licenses (reserve commercial usage); DATA & CODE
LICENSING [5]

PUBLISHING
Don't forget to generate a DOI to uniquely identify a version of your software and to make it
easily citable. Most data repositories [6] automatically generate a DOI for your code release.
TIP: Github provides a quick and easy integration with Zenodo [7]

PRESERVING
Preservation is important for keeping your work secure and also for scientific validation.
 c4science is an EPFL solution to preserve your code as a backup solution.
 If you use another code repository, you can always copy it on c4science.
 For the longer term, a generic data repository (like Zenodo ) is appropriate.

Credits and sources [4] opensource.org/licenses/gpl-license


[1] go.epfl.ch/rdm-readme [5] go.epfl.ch/rdm-fastguide12
[2] peps.python.org/pep-0008/#comments [6] go.epfl.ch/datarepo
[3] opensource.org/licenses/MIT [7] docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content

Contact and info


go.epfl.ch/rdm researchdata@epfl.ch BY - NC - SA
EPFL LIBRARY
RESEARCH
FAST GUIDE #7 DATA
ELN – ELECTRONIC LAB NOTEBOOK MANAGEMENT

v3 – 2023
DOI: 10.5281/zenodo.3327829

DEFINITION
Are you a handwriter? An Electronic Lab Notebook (ELN) is a software, local or web-based, that replicates
a paper lab notebook and provides advanced functionalities.

Many interfaces allow


to write notes by ELN Paper notebook
using a tablet with a
Easy of transmission
stylus. You can whether store the
notes as digital canvas or convert Automated back-up
them to searchable text.
Team work uniformization

Interoperability

Storage and accessibility

Security Risk of hacking Risk of stealing


When considering an ELN for
Formats/tools may
your lab, ask yourself the Long term preservation
become obsolete *
following questions… Free of charge Open source available

You can always save your ELN in common and open formats such as CSV or PDF/A
*
PRACTICAL
INTERFACE
 Are the storage method and location adequate?
 If our ELN is cloud-based, where is data hosted  Do we find the interface suitable for our group?
 Who can access the ELN and its underlying data?  Is it compatible with mobile devices?
 Do we need a single machine where the ELN is installed?  Do we need a sample/laboratory management?
 How does it fit in our research group's workflow?
 Is the technical support online, onsite, via hotline, ...?
 Is the ELN business plan the best for our group? INTEROPERABILITY

 Is the ELN compatible with other software we use?


IMPORT-EXPORT
 Can we integrate other cloud software
 Can we import our previous notes? (SWITCHdrive, GitHub, Zotero, ...) with the ELN?
 Can we export our data in an open way?  Is compatibility limited to the import of data we generate?
 What are the import and export options, is there an API?  Can we integrate new services with the ELN?
 What are the import and export formats?  Can we use repositories from within the ELN?
 Do we have data volume limitations?  Data [2]: Zenodo, MaterialsCloud, Figshare,…
 Do we need proprietary software to reuse the data?  Code [3]: C4science, EPFL GitLab, …

 SLIMS: Commercial solution (EPFL spinoff) for a Laboratory Information


EPFL ELN [1]: Chemistry Notebook, Management System (LIMS), with ELN functionalities, integrating sample
management and different services for life scientists [4]
developed internally at EPFL
 ELN comparison table from the Harvard Medical School [5]

Credits and sources [3] c4science.ch & gitlab.epfl.ch


[1] eln.epfl.ch [4] agilent.com/en/product/software-informatics/lab-workflow-management-software/slims
[2] zenodo.org & materialscloud.org & figshare.com [5] zenodo.org/record/472375

Contact and info


go.epfl.ch/rdm researchdata@epfl.ch BY - NC - SA
EPFL LIBRARY
RESEARCH
FAST GUIDE #8 DATA
PERSONAL DATA MANAGEMENT MANAGEMENT

v3 – 2023
DOI: 10.5281/zenodo.3327829

DEFINITIONS

Sensitive personal data is personal data


Personal data is “all information relating to an “revealing racial or ethnic origin, political opinions,
identified or identifiable person” [1] religious or philosophical beliefs; trade union
Examples: Name, Date of birth, Address, Photos, membership; genetic data, biometric data
Videos, IP address, GPS coordinates, Telephone processed solely to identify a human being;
number, Credit card number, Number plate,… health-related data; data concerning a person's
sex life or sexual orientation” [2]

FEDERAL ACT ON DATA GENERAL DATA PROTECTION


PROTECTION (FADP) [3] REGULATION (GDPR) [6]

Applies to projects conducted in Switzerland, with Applies to projects involving personal data of subjects
additional laws for research involving human beings who are in the EU, with some derogations for scientific or
(Human Research Act) [4] statistical purposes (art.89)
Principles: Good faith, Lawfulness, Proportionality, Principles: Lawfulness, Data minimization, Accuracy,
Exactitude, Security Storage limitation, Integrity, Transparency, Privacy-by-
● Data collected on the internet [5] is still submitted to design, Confidentiality, Accountability
restrictions, even if published by the subjects ● Keep a description of how you will implement the
● Hash the identifiers if the project goals can be reached principles
without them, and restricted access right to the ● If data processing or storage are outsourced, document
pseudomisation key external services’ GDPR compliance
● You need to assess the risk of reidentification ● In the event of a data breach, notify the VPSI or the DSPS
● Inform the subjects about the contact details of your immediately
unit, the purposes of your data collection, the recipients of ● Inform subjects of their rights to modify their data, restrict
the personal data, their right to access their personal the use and withdraw their participation, as well as
data, and the likely consequences if they refuse to extensive information about data collection/processing
provide their personal data ● Provide a Data Protection Impact Assessment (DPIA) [7]
● Anonymized data received from a third party still requires if the project may result in a high risk, i.e. if it involves data
the subject to be informed of this new use processed on a large scale, innovative use of data,
● Legal consent for subjects under 18 years is required to sensitive data, vulnerable subjects, data transfers outside
collect their data the EU, ...
● Personal data can be published only if the subject ● Any transfer of personal data abroad is only guaranteed
consents to publication, but in no case if they are for transfers to countries [8,9] whose legislation ensures an
sensitive adequate level of protection
● Guarantee these subjects’ minimal rights: rights of ● Guarantee subjects’ minimal rights: rights of access,
access, modification, erasure rectification, portability, objection, erasure

Revised Federal Act on Data Protection (FADP) in 2020 Applicable as of May 2018
Expected to enter into force in September 2023 [8]
Any doubt?
Contact the EPFL Human Research
Credits and sources
[1] admin.ch/opc/en/classified-compilation/19920153/index.html#a3
Ethics Committee [10]
[2] ec.europa.eu/info/law/law-topic/data-protection/reform/rules-business-
and-organisations/legal-grounds-processing-data/sensitive-data/what-
personal-data-considered-sensitive_en / [7] ec.europa.eu/newsroom/article29/item-detail.cfm?item_id=611236
[3] https://www.fedlex.admin.ch/eli/cc/1993/1945_1945_1945/en / [8] edoeb.admin.ch/dam/edoeb/fr/dokumente/2018/staatenliste.pdf.
[4] admin.ch/opc/en/classified-compilation/20061313/index.html download.pdf/20181213_Staatenliste_f.pdf
[5] edoeb.admin.ch/edoeb/fr/home/protection-des- [9] ec.europa.eu/info/law/law-topic/data-protection/international-dimension-data-
donnees/Internet_und_Computer/services-en-ligne/medias-sociaux.html protection/adequacy-decisions_en
[6] https://gdpr-info.eu/ [10] https://www.bj.admin.ch/bj/fr/home/staat/gesetzgebung/datenschutzstaerkung.html

Contact and info


go.epfl.ch/rdm researchdata@epfl.ch BY - NC - SA
EPFL LIBRARY
RESEARCH
FAST GUIDE #9 DATA
DATA MASKING MANAGEMENT

v3 – 2023
DOI: 10.5281/zenodo.3327829

DEFINITION [1] ADVANTAGES APPLICABILITY


Why it is worth Tests on humans / sensitive data
 Complies with law  Name, identification number, location
Data masking, also called
 Makes data sharable data, online identifier, etc.
data obfuscation, is the
 Prevents data misuse  Factors specific to the physical,
process of hiding original
 Makes data publishable physiological, genetic, mental, economic,
data with modified content
cultural or social identity

PSEUDOANONYMIZATION

 FOR ACTIVE DATA


≠ ANONYMIZATION

 FOR PUBLISHED DATA


 REVERSIBLE  IRREVERSIBLE

● REPLACING ● REMOVING
Replace data by identifiers. Store the key separately and Suppress data or part of the outlier records. Appropriate
securely. for processing identifiers.
● ENCRYPTING ● GENERALIZING
Encrypt the data and store the key securely. Appropriate Diminish granularity by generalizing the variables.
for long-term preservation, not for data publishing. Appropriate for data too specific or unique records.
● SHUFFLING
UTILITY PROTECTION Shuffle data over one / several columns without
HINT
compromising their utility.
Mitigate the
identification risk,
● FAKING
but preserve the Prevent the identification of specific records, adding
data utility for fake data while preserving correlations.
RESEARCH DATA research.

Do you deal with


CHECK THESE TOOLS
personal data?
MASK IDENTITY OR ASSESS IDENTIFICATION RISKS
Check the Fast Guide #8:
 GRAASP insights [2] PERSONAL DATA MANAGEMENT [12]
 ARX Data Anonymization Tool (Java) [3]
 Amnesia [4]
 ARGUS (Java) [5]
 sdcMicro (R) [6] EPFL Research Ethics [13]
 Differential privacy queries (SQL) [7]
 Faker (Python) [8] Federal Act on Data Protection (FADP) [14]
 OpenPseudonymiser [9]
 AES Crypt [10] Human Research Act (HRA) [15]
General Data Protection Regulation (GDPR) [16]
Credits and sources
[1] en.wikipedia.org/wiki/Data_masking [8] hfaker.readthedocs.io/en/master/
[2] insights.graasp.org [9] openpseudonymiser.org
[3] arx.deidentifier.org [10] www.aescrypt.com
[4] amnesia.openaire.eu/ [12] go.epfl.ch/rdm-fastguide08
[5] qosient.com/argus/ anonymization.shtml [14] https://www.fedlex.admin.ch/eli/cc/1993/1945_1945_1945/en
[6] https://cran.r-project.org/web/packages/sdcMicro/index.html [15] admin.ch/opc/en/classified-compilation/20061313/index.html
[7] github.com/uber/sql-differential-privacy [16] https://gdpr-info.eu

Contact and info


go.epfl.ch/rdm researchdata@epfl.ch BY - NC - SA
EPFL LIBRARY
RESEARCH
FAST GUIDE #10 DATA
STORE, PUBLISH, PRESERVE DATA MANAGEMENT

v3 – 2023
DOI: 10.5281/zenodo.3327829

KEEP IN MIND: Store ≠ Backup ≠ Preserve ≠ Publish ≠ Archive

Who is involved?
RESEARCH DATA  Research teams  Research partners
 Institutions  Private partners
 Raw Data
Active data

 Processed Data  Funders  IT service providers


 Metadata
 Codes / Algorithms
 Virtual machines
PUBLISHING CONDITIONS
 Data ownership
STORAGE
 Stakeholders consent

 EPFL NAS [1]  Compliance with protection laws


Laboratory NAS Ensuring data integrity
Active data

 
 Cloud solutions  Providing appropriate metadata
 Shared databases
 Clarifying reuse licensing
 ELN / LIMS [2]
RE-USE

 Data management systems  Setting up embargoes (if needed)

PUBLISHING PRESERVING CRITERIA

 Data repositories [3]  Historical and scientific data value


Cold data

 Journals servers  Data quality and uniqueness


 Preprints  Reliability of sources
 Data papers
 Data preparation cost
 Data citation mechanisms
 Repository and maintenance cost
 Deposit responsibility

PRESERVATION

 Data repositories [3] HOW LONG TO PRESERVE?


 Post-processed, curated data
Cold data

 At least 10 years for the SNSF [5]


 Archive-ready format converted
files  Evaluate preserving criteria
 EPFL ACOUA [4]  Mind any retention and disposal schedules
(ACademic OUtput Archive)
 Stick to administrative and legal requirements

Credits and sources [3] go.epfl.ch/datarepo


[1] go.epfl.ch/epfl-nas [4] go.epfl.ch/acoua
[2] go.epfl.ch/rdm-fastguide07 [5] snf.ch/en/dMILj9t4LNk8NwyR/topic/open-research-data (FAQ)

Contact and info


go.epfl.ch/rdm researchdata@epfl.ch BY - NC - SA
EPFL LIBRARY
RESEARCH
FAST GUIDE #11 DATA
DMP – DATA MANAGEMENT PLAN MANAGEMENT

v3 – 2023
DOI: 10.5281/zenodo.3327829

DEFINITION What is data life cycle?


A DMP - Data Management Plan - is a document describing how data and Check the
code of a research project are managed during their life-cycle Fast Guide #1 :
RESEARCH DATA:
THE BASICS [1]
WHY

 COMPLIANCY Requested by research funders (public or private), a FOR WHOM?


DMP enhances research reproducibility and the use of public funds.
Especially for YOURSELF and YOUR TEAM, but
 TRANSPARENCY Usually published when the funding period ends, in practice, many funders now require a DMP.
a DMP completes the research results with the information on data, Here is a non-exhaustive list:
software, protocols, sources, etc.
 SNSF
 FORECAST To anticipate costs (materials and software) and identify  EPFL (some internal projects)
risks (eg. data loss, incompatible formats, security). DMPs allow  EC (ERC, FET, MSCA, H2020, ...)
institutions to better allocate services.  AXA Research Fund
 STREAMLINE To reduce risks of data loss and the efforts of reverse  U.S. Federal Grants
engineering for new collaborators. A DMP boosts data reuse in the lab  Wellcome Trust
and outside.  Ligue Vaudoise Contre le Cancer
 CCR-pro
Target the reproducibility of research results!
Anticipate questions about data in your projects.

WHAT
SOME TEMPLATES [5]
EPFL DMP
 DESCRIPTION Data types, formats, size. For EPFL-funded projects or if no other
template is provided, aligned with
 COLLECTION Sources, experiments, analysis, simulations. EPFL recommendations.
 CURATION Metadata, naming, datasets structures. SNSF DMP
Based on the SNSF Open Research
 STORAGE Active data, sharing tools, backup, preservation.
Data Policy, with additional guiding
 RISKS Access rights, anonymization, ethics assessment. examples.

 PUBLICATION Data repositories [2], IP (ex. data licenses [3]). ERC DMP
Based on the FAIR principles, with
 COSTS For RDM: refer to Fast Guide #03 [4]. additional guiding examples.
Not just administrative hurdle! Use your DMP as Horizon Europe DMP (or MSCA)
reference tool for in-lab discussions & decisions. Also for Marie Skłodowska-Curie
Actions’ applicants.
WHEN
NCCR RDM STRATEGY
Rather than a single project, it
 IDEALLY At the conception of your research project. describes the data management for all
the projects of a NCCR
 USUALLY When requesting funds.

 REALLY ASAP, but it is never too late. EPFL RDM STRATEGY


Based on the NCCR RDM Strategy,
The DMP is a living document! Keep it up-to-date though mostly targeted to single
throughout the project and secure it at the end. research groups.

Credits and sources [3] go.epfl.ch/rdm-fastguide12


[1] go.epfl.ch/rdm-fastguide01 [4] go.epfl.ch/rdm-fastguide03
[2] go.epfl.ch/datarepo [5] go.epfl.ch/rdm-guide

Contact and info


go.epfl.ch/rdm researchdata@epfl.ch BY - NC - SA
EPFL LIBRARY
RESEARCH
FAST GUIDE #12 DATA
DATA & CODE LICENSING MANAGEMENT

v3 – 2023
DOI: 10.5281/zenodo.3327829

LICENSING APPLIES TO… AT EPFL?


EPFL owns the original data & code,
Collected, processed,
Original work, or from but the authors can use it for
Both data and code aggregated, augmented
another researcher
data / code research and IP [1, Art. 36]

WHY?
Licensing data & code is key for The protection of data by law
 Collaboration with other researchers, research groups, institutes is not harmonized internationally
but varies depending
 Reuse of others’ work, to generate new information or data/code on the specific country.
 Clarify ownership and authorship
 Differentiate authorized use of original vs. derivative work Licenses do not all have the same
international recognition.
 Share / Publish your work with clear usage rights

LICENSES FOR DATA LICENSES FOR CODE [7]


CC-Zero No restrictions [2] MIT Short term, Permissive, No warranty

CC-BY Mandatory citation [2] APACHE Permissive, Patents allowed, No warranty

CC-BY-SA Mandatory citation, Share Alike (Viral) [2] BSD All code by one organization, GPL mix not allowed

CC-BY-ND Mandatory citation, No Modifications [2] GPL Copyleft license, Patents allowed, Viral

CC-BY-NC Mandatory citation, Non Commercial [2] LGPL Libraries Sharing, Licenses mix allowed

CC-BY-NC-SA Mandatory citation, Non Commercial, Viral [2] AGPL Strong copyleft, Patents allowed, Viral

ODbL Open Access specific for databases [3]


Microdata
For unit-level data (i.e. sets of records
SHARING OR REUSING? [8]
Research
containing on individual respondent) [4,5] Data licenses define conditions about:
License
 Data ownership and use
 The treatment of original and derived data

Not sure where to start? This is important for researchers who:


This chooser helps you determine which Creative  Receive or collect and compile data from another researcher
Commons License is right for you in a few easy steps [6]  Generate information or data from the other researcher's data
on the other researcher's behalf or on their own behalf.

Any doubt? Moreover, a researcher may want to:


Contact the EPFL Technology Transfer Office [9]  Analyze/reuse another researcher’s data
 Process/aggregate data for own research, using the
Credits and sources processed data
[1] admin.ch/opc/en/classified-compilation/19910256/index.html#a36  Licensing the processed, aggregated, augmented data from
[2] creativecommons.org
[3] opendatacommons.org/licenses/odbl another researcher
[4] microdata.worldbank.org/index.php/terms-of-use
[5] ec.europa.eu/eurostat/cros/content/microdata-access_en
[6] chooser-beta.creativecommons.org
[7] choosealicense.com/appendix
[8] doi.org/10.1371/journal.pbio.1002235
[9] epfl.ch/research/services/units/technology-transfer-office

Contact and info


go.epfl.ch/rdm researchdata@epfl.ch BY - NC - SA
EPFL LIBRARY
RESEARCH
FAST GUIDE #13 DATA
DATA/CODE PUBLICATION PLATFORMS MANAGEMENT

v3 – 2023
DOI: 10.5281/zenodo.3327829
WHY?
Need to anonymize
Publishing your research data/code might depend on many factors:
before publication?
 Research ethics (transparency, accountability, reproducibility)
 Funder’s request opening data/code (SNSF, Horizon Europe, etc.)
Check the Fast Guide #9:
 Your research group’s policy about publication
Data Masking [1]
 Increase citations (data and code as academic output)

WHERE TO PUBLISH DATA/CODE?

Check the most-used platforms adopted by EPFL researchers in this comparative


Data platforms dissemination table [2]. It includes platforms for publication, as
well as preservation of data and code, and for different scientific fields.

CRITERIA TO SELECT A PLATFORM Simple choice suggestion


Filter the comparative table for a platform that is:
Institutional / External
 Suggested by SNSF [3] and the EC [4]
Data or Code-oriented / Content neutral
 Trusted, listed by re3data.org
Discipline specific / Discipline neutral  Hosted in CH or EU
 Able to provide a DOI
Offers persistent identifiers (ex. DOI)  Maintained by a non-profit organization

Supports personal identifiers (ex. ORCID)  Used by your research community

Cost is sustainable

Servers' location compliant with laws


Publishing your dataset on Zenodo?
Suggested by funders Increase the findability of your work: select
the EPFL Community in the upload page [5]
Max size allowed
How to make your GitHub code citable?
Choice of licenses
Follow GitHub’s documentation[6] to make a
persistent snapshot of your code on Zenodo
Specific: ex. preview, metadata, community, etc.

5-STAR SCHEME FOR OPEN DATA [7]


Make your stuff available on the Web (whatever format) under an open license
“As open
Make it available as structured data (ex. Excel instead of scan of a table) as possible,
Make it available in non-proprietary open format (ex. CSV instead of Excel) as restricted
Use URLs to denote things, so that people can point at your stuff as necessary”
Karel Luyben,
Link your data to other data to provide context president of the EOSC [8]

Credits and sources


[1] go.epfl.ch/rdm-fastguide09 / [5] zenodo.org/communities/epfl
[2] go.epfl.ch/datarepo / [6] guides.github.com/activities/citable-code
[3] snf.ch/en/WtezJ6qxuTRnSYgF/topic/open-research-data-which-data-repositories-can-be-used / [7] https://5stardata.info
[4] open-research-europe.ec.europa.eu/for-authors/data-guidelines#:~:text=2.2.%20Select%20a%20Repository [8] doi.org/10.5281/zenodo.6807345

Contact and info


go.epfl.ch/rdm researchdata@epfl.ch BY - NC - SA

You might also like