PDF Data and Text Processing For Health and Life Sciences Francisco M Couto Ebook Full Chapter

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 53

Data and Text Processing for Health

and Life Sciences Francisco M. Couto


Visit to download the full and correct content document:
https://textbookfull.com/product/data-and-text-processing-for-health-and-life-sciences-
francisco-m-couto/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Visuospatial Processing for Education in Health and


Natural Sciences Juan C. Castro-Alonso

https://textbookfull.com/product/visuospatial-processing-for-
education-in-health-and-natural-sciences-juan-c-castro-alonso/

Introduction to Biological Physics for the Health and


Life Sciences 2nd Edition Kirsten Franklin

https://textbookfull.com/product/introduction-to-biological-
physics-for-the-health-and-life-sciences-2nd-edition-kirsten-
franklin/

Data Mining Techniques for the Life Sciences 2nd


Edition Oliviero Carugo

https://textbookfull.com/product/data-mining-techniques-for-the-
life-sciences-2nd-edition-oliviero-carugo/

Imaging Technologies and Data Processing for Food


Engineers Sozer

https://textbookfull.com/product/imaging-technologies-and-data-
processing-for-food-engineers-sozer/
Food and nutrition economics : fundamentals for health
sciences First Edition Davis

https://textbookfull.com/product/food-and-nutrition-economics-
fundamentals-for-health-sciences-first-edition-davis/

Statistics for the Life Sciences 5th Edition Samuels

https://textbookfull.com/product/statistics-for-the-life-
sciences-5th-edition-samuels/

Statistics for the Life Sciences Myra L. Samuels

https://textbookfull.com/product/statistics-for-the-life-
sciences-myra-l-samuels/

Data Pipelines Pocket Reference: Moving and Processing


Data for Analytics 1st Edition James Densmore

https://textbookfull.com/product/data-pipelines-pocket-reference-
moving-and-processing-data-for-analytics-1st-edition-james-
densmore/

Biocalculus Calculus Probability and Statistics for the


Life Sciences 1st Edition James Stewart

https://textbookfull.com/product/biocalculus-calculus-
probability-and-statistics-for-the-life-sciences-1st-edition-
james-stewart/
Advances in Experimental Medicine and Biology 1137

Francisco Couto

Data and Text


Processing for
Health and Life
Sciences
Advances in Experimental Medicine
and Biology

Volume 1137

Editorial Board
IRUN R. COHEN, The Weizmann Institute of Science, Rehovot, Israel
ABEL LAJTHA, N.S. Kline Institute for Psychiatric Research,
Orangeburg, NY, USA
JOHN D. LAMBRIS, University of Pennsylvania, Philadelphia, PA, USA
RODOLFO PAOLETTI, University of Milan, Milano, Italy
NIMA REZAEI, Tehran University of Medical Sciences,
Children’s Medical Center Hospital, Tehran, Iran
More information about this series at http://www.springer.com/series/5584
Francisco M. Couto

Data and Text


Processing for Health
and Life Sciences

123
Francisco M. Couto
LASIGE, Department of Informatics
Faculdade de Ciências, Universidade de Lisboa
Lisbon, Portugal

ISSN 0065-2598 ISSN 2214-8019 (electronic)


Advances in Experimental Medicine and Biology
ISBN 978-3-030-13844-8 ISBN 978-3-030-13845-5 (eBook)
https://doi.org/10.1007/978-3-030-13845-5

© The Editor(s) (if applicable) and The Author(s) 2019. This book is an open access publication.
Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative
Commons licence and indicate if changes were made.
The images or other third party material in this book are included in the book’s Creative
Commons licence, unless indicated otherwise in a credit line to the material. If material is not
included in the book’s Creative Commons licence and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly
from the copyright holder.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are
exempt from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in
this book are believed to be true and accurate at the date of publication. Neither the publisher
nor the authors or the editors give a warranty, express or implied, with respect to the material
contained herein or for any errors or omissions that may have been made. The publisher remains
neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG.
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Aos meus pais, Francisco de Oliveira Couto e
Maria Fernanda dos Santos Moreira Couto.
Preface

During the last decades, I witnessed the growing importance of computer


science skills for career advancement in Health and Life Sciences. However,
not everyone has the skill, inclination, or time to learn computer program-
ming. The learning process is usually time-consuming and requires constant
practice, since software frameworks and programming languages change
substantially overtime. This is the main motivation for writing this book about
using shell scripting to address common biomedical data and text processing
tasks. Shell scripting has the advantages of being: (i) nowadays available
in almost all personal computers; (ii) almost immutable for more than four
decades; (iii) relatively easy to learn as a sequence of independent commands;
(iv) an incremental and direct way to solve many of the data problems that
Health and Life professionals face.
During the last decades, I had the pleasure to teach introductory computer
science classes to Life and Health and Life Sciences undergraduates. I used
programming languages, such as Perl and Python, to address data and text
processing tasks, but I always felt to lose a substantial amount of the time
teaching the technicalities of these languages, which will probably change
over time and are uninteresting for the majority of the students who do not
intend to pursue advanced bioinformatics courses. Thus, the purpose of this
book is to motivate and help specialists to automate common data and text
processing tasks after a short learning period. If they become interested (and
I hope some do), the book presents pointers to where they can acquire more
advanced computer science skills.
This book does not intend to be a comprehensive compendium of shell
scripting commands but instead an introductory guide for Health and Life
specialists. This book introduces the commands as they are required to
automate data and text processing tasks. The selected tasks have a strong
focus on text mining and biomedical ontologies given my research experience
and their growing relevance for Health and Life studies. Nevertheless, the
same type of solutions presented in the book are also applicable to many
other research fields and data sources.

Lisboa, Portugal Francisco M. Couto


January 2019

vii
Acknowledgments

I am grateful to all the people who helped and encouraged me along this
journey, especially to Rita Ferreira for all the insightful discussions about
shell scripting.
I am also grateful for all the suggestions and corrections given by my
colleague Prof. José Baptista Coelho and by my college students: Alice
Veiros, Ana Ferreira, Carlota Silva, Catarina Raimundo, Daniela Matias, Inês
Justo, João Andrade, João Leitão, João Pedro Pais, Konil Solanki, Mariana
Custódio, Marta Cunha, Manuel Fialho, Miguel Silva, Rafaela Marques,
Raquel Chora and Sofia Morais.
This work was supported by FCT through funding of DeST: Deep Seman-
tic Tagger project, ref. PTDC/CCI-BIO/28685/2017 (http://dest.rd.ciencias.
ulisboa.pt/), and LASIGE Research Unit, ref. UID/CEC/00408/2019.

ix
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Biomedical Data Repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Scientific Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Amount of Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Ambiguity and Contextualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Biomedical Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Programming Skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Why This Book? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Third-Party Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Simple Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
How This Book Helps Health and Life Specialists? . . . . . . . . . . . . . . . 5
Shell Scripting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Text Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Relational Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
What Is in the Book? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Command Line Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Biomedical Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
What? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Where? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
How? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
What? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Where? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
How? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Data Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Caffeine Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Unix Shell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Current Directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Windows Directories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Change Directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Useful Key Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

xi
xii Contents

Shell Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Data File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
File Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Reverse File Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
My First Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Line Breaks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Redirection Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Installing Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Permissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Debug . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Save Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Web Identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Single and Double Quotes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Data Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Standard Error Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Data Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Single and Multiple Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Data Elements Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Task Repetition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Assembly Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
File Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
XML Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Human Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
PubMed Identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
PubMed Identifiers Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Duplicate Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Complex Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
XPath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Namespace Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Only Local Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Extracting XPath Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Text Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Publication URL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Title and Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Disease Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4 Text Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Case Insensitive Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Number of Matches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Invert Match . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
File Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Word Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Contents xiii

Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Extended Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Alternation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Multiple Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Quantifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Beginning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Ending . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Near the End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Word in Between . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Full Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Match Position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Character Delimiters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Wrong Tokens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
String Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Multi-character Delimiters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Keep Delimiters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Sentences File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Select the Sentence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Pattern File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Relation Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Multiple Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Relation Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Remove Relation Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5 Semantic Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
OWL Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Class Label . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Class Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Related Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
URIs and Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
URI of a Label . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Label of a URI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Synonyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
URI of Synonyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Parent Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Labels of Parents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Related Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Labels of Related Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Ancestors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Grandparents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Root Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Recursion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
xiv Contents

My Lexicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Ancestors Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Merging Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Ancestors Matched . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Generic Lexicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
All Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Problematic Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Special Characters Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Removing Special Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Removing Extra Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Removing Extra Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Disease Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Inverted Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Case Insensitive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
ASCII Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Correct Matches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Incorrect Matches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Entity Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Modified Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Surrounding Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Semantic Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
DiShIn Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Database File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
DiShIn Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Large Lexicons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
MER Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Lexicon Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
MER Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Acronyms

ChEBI Chemical Entities of Biological Interest


CSV Comma-Separated Values
cURL Client Uniform Resource Locator
DAG Directed Acyclic Graph
DBMS Database Management System
DiShIn Semantic Similarity Measures using Disjunctive Shared
Information
DO Disease Ontology
EBI European Bioinformatics Institute
GO Gene Ontology
HTTP Hypertext Transfer Protocol
HTTPS HTTP Secure
ICD International Classification of Diseases
MER Minimal Named-Entity Recognizer
MeSH Medical Subject Headings
NCBI National Center for Biotechnology Information
NER Named-Entity Recognition
OBO Open Biological and Biomedical Ontology
OWL Web Ontology Language
PMC PubMed Central
RDFS RDF Schema
SNOMED CT Systematized Nomenclature of Medicine – Clinical Terms
SQL Structured Query Language
TSV Tab-Separated Values
UMLS Unified Medical Language System
UniProt Universal Protein Resource
URI Uniform Resource Identifier
URL Uniform Resource Locator
XLS Microsoft Excel file format
XML Extensible Markup Language

xv
Introduction
1

Abstract
Health and Life studies are well known
Biomedical Data Repositories
for the huge amount of data they produce,
Fortunately, a significant portion of the
such as high-throughput sequencing projects
biomedical data is already being collected,
(Stephens et al., PLoS Biol 13(7):e1002195,
integrated and distributed through Biomed-
2015; Hey et al., The fourth paradigm:
ical Data Repositories, such as European
data-intensive scientific discovery, vol 1.
Bioinformatics Institute (EBI) and National
Microsoft research Redmond, Redmond,
Center for Biotechnology Information (NCBI)
2009). However, the value of the data should
repositories (Cook et al. 2017; Coordinators
not be measured by its amount, but instead
2018). Nonetheless, researchers cannot rely on
by the possibility and ability of researchers to
available data as mere facts, they may contain
retrieve and process it (Leonelli, Data-centric
errors, can be outdated, and may require a
biology: a philosophical study. University of
context (Ferreira et al. 2017). Most facts are only
Chicago Press, Chicago, 2016). Transparency,
valid in a specific biological setting and should
openness, and reproducibility are key aspects
not be directly extrapolated to other cases. In
to boost the discovery of novel insights
addition, different research communities have
into how living systems work (Nosek et al.,
different needs and requirements, which change
Science 348(6242):1422–1425, 2015).
over time (Tomczak et al. 2018).

Keywords Scientific Text


Bioinformatics · Biomedical data
repositories · Text files · EBI: European Structured data is what most computer applica-
Bioinformatics Institute · Bibliographic tions require as input, but humans tend to prefer
databases · Shell scripting · Command line the flexibility of text to express their hypoth-
tools · Spreadsheet applications · CSV: esis, ideas, opinions, conclusions (Barros and
comma-separated values · TSV: tab-separated Couto 2016). This explains why scientific text
values is still the preferential means to publish new

© The Author(s) 2019 1


F. M. Couto, Data and Text Processing for Health and Life Sciences,
Advances in Experimental Medicine and Biology 1137,
https://doi.org/10.1007/978-3-030-13845-5_1
2 1 Introduction

discoveries and to describe the data that support biological concepts or entities (homonyms). For
them (Holzinger et al. 2014; Lu 2011). Another example, many times authors improve the read-
reason is the long-established scientific reward ability of their publications by using acronyms to
system based on the publication of scientific mention entities, that may be clear for experts on
articles (Rawat and Meena 2014). the field but ambiguous in another context.
The second problem is the complexity of the
message. Almost everyone can read and under-
Amount of Text stand a newspaper story, but just a few can really
understand a scientific article. Understanding the
The main problem of analyzing biomedical text underlying message in such articles normally
is the huge amount of text being published every requires years of training to create in our brain
day (Hersh 2008). For example, 813,598 cita- a semantic model about the domain and to know
tions1 were added in 2017 to MEDLINE, a bibli- how to interpret the highly specialized terminol-
ographic database of Health and Life literature2 . ogy specific to each domain. Finally, the mul-
If we read 10 articles per day, it will take us takes tilingual aspect of text is also a problem, since
more than 222 years to just read those articles. most clinical data are produced in the native
Figure 1.1 presents the number of citations added language (Campos et al. 2017).
to MEDLINE in the past decades, showing the
increasing large amount of biomedical text that
researchers must deal with. Biomedical Ontologies
Moreover, scientific articles are not the only
source of biomedical text, for example clinical To address the issue of ambiguity of natural
studies and patents also provide a large amount language and contextualization of the message,
of text to explore. They are also growing at a fast text processing techniques can explore current
pace, as Figs. 1.2 and 1.3 clearly show (Aras et al. biomedical ontologies (Robinson and Bauer
2014; Jensen et al. 2012). 2011). These ontologies can work as vocabularies
to guide us in what to look for (Couto et al.
2006). For example, we can select an ontology
that models a given domain and find out which
Ambiguity and Contextualization official names and synonyms are used to mention
concepts in which we have an interest (Spasic
Given the high flexibility and ambiguity of natu-
et al. 2005). Ontologies may also be explored
ral language, processing and extracting informa-
as semantic models by providing semantic
tion from texts is a painful and hard task, even
relationships between concepts (Lamurias et al.
to humans. The problem is even more complex
2017).
when dealing with scientific text, that requires
specialized expertise to understand it. The major
problem with Health and Life Sciences is the in- Programming Skills
consistency of the nomenclature used for describ-
ing biomedical concepts and entities (Hunter and The success of biomedical studies relies on over-
Cohen 2006; Rebholz-Schuhmann et al. 2005). In coming data and text processing issues to take the
biomedical text, we can often find different terms most of all the information available in biomed-
referring to the same biological concept or entity ical data repositories. In most cases, biomedical
(synonyms), or the same term meaning different data analysis is no longer possible using an in-
house and limited dataset, we must be able to
efficiently process all this data and text. So, a
1 https://www.nlm.nih.gov/bsd/index_stats_comp.html common question that many Health and Life
2 https://www.nlm.nih.gov/bsd/medline.html specialists face is:
Ambiguity and Contextualization 3

Fig. 1.1 Chronological listing of the total number of citations in MEDLINE (Source: https://www.nlm.nih.gov/bsd/)

Fig. 1.2 Chronological listing of the total number of registered studies (clinical trials) (Source: https://clinicaltrials.
gov)
4 1 Introduction

Fig. 1.3 Chronological listing of the total number of patents in force (Source: WIPO statistics database http://www.
wipo.int/ipstats/en/)

be impenetrable to the common Health and Life


How can I deal with such huge amount of specialists and usually become outdated or even
data and text the necessary expertise, time unavailable some time after their publication or
and disposition to learn computer program- the financial support ends. Instead, this book will
ming? equip the reader with a set of skills to process text
with minimal dependencies to existing tools and
technologies. The idea is not to explain how to
This is the goal of this book, to provide a low-
build the most advanced tool, but how to create
cost, long-lasting, feasible and painless answer to
a resilient and versatile solution with acceptable
this question.
results.
In many cases, advanced tools may not be
most efficient approach to tackle a specific prob-
Why This Book? lem. It all depends on the complexity of problem,
and the results we need to obtain. Like a good
State-of-the-art data and text processing tools physician knows that the most efficient treatment
are nowadays based on complex and sophisti- for a specific patient is not always the most
cated technologies, and to understand them we advanced one, a good data scientist knows that
need to have special knowledge on program- the most efficient tool to address a specific infor-
ming, linguistics, machine learning or deep learn- mation need is not always the most advanced one.
ing (Holzinger and Jurisica 2014; Ching et al. Even without focusing on the foundational basis
2018; Angermueller et al. 2016). Explaining their of programming, linguistics or artificial intelli-
technicalities or providing a comprehensive list gence, this book provides the basic knowledge
of them are not the purpose of this book. The and right references to pursue a more advanced
tools implementing these technologies tend to solution if required.
How This Book Helps Health and Life Specialists? 5

Third-Party Solutions ually manipulate our data using the spreadsheet


application that we already are comfortable with,
Many manuscripts already present and discuss and at the same time be able to automatize some
the most recent and efficient text mining of the repetitive tasks.
techniques and the available software solutions
based on them that users can use to process data
In summary, this book is directed mainly
and text (Cock et al. 2009; Gentleman et al. 2004;
towards Health and Life specialists and
Stajich et al. 2002). These solutions include
students that need to know how to process
stand-alone applications, web applications,
biomedical data and text, without being
frameworks, packages, pipelines, etc. A common
dependent on continuous financial support,
problem with these solutions is their resiliency
third-party applications, or advanced com-
to deal with new user requirements, to changes
puter skills.
on how resources are being distributed, and to
software and hardware updates. Commercial
solutions tend to be more resilient if they have
enough customers to support the adaptation How This Book Helps Health and
process. But of course we need the funding Life Specialists?
to buy the service. Moreover, we will be still
dependent on a third-party availability to address So, if this book does not focus on learning pro-
our requirements that are continuously changing, gramming skills, and neither on the usage of any
which vary according to the size of the company special package or software, how it will help
and our relevance as client. specialists processing biomedical text and data?
Using open-source solutions may seem a great
alternative since we do not need to allocate fund-
ing to use the service and its maintenance is as- Shell Scripting
sured by the community. However, many of these
solutions derive from academic projects that most The solution proposed in this book has been
of the times are highly active during the funding available for more than four decades (Ritchie
period and then fade away to minimal updates. 1971), and it can now be used in almost every
The focus of academic research is on creating personal computer (Haines 2017). The idea is to
new and more efficient methods and publish provide an example driven introduction to shell
them, the software is normally just a means to scripting3 that addresses common challenges in
demonstrate their breakthroughs. In many cases biomedical text processing using a Unix shell4 .
to execute the legacy software is already a non- Shells are software programs available in Unix
trivial task, and even harder is to implement operating systems since 19715 , but nowadays are
the required changes. Thus, frequently the most available is most of our personal computers using
feasible solution is to start from scratch. Linux, macOS or Windows operating systems.

Simple Pipelines But a shell script is still a computer algo-


rithm, so how is it different from learning
If we are interested in learning sophisticated and another programming language?
advanced programming skills, this is not the right
book to read. This book aims at helping Health
and Life specialists to process data and text by
describing a simple pipeline that can be executed 3 https://en.wikipedia.org/wiki/Shell_script

with minimal software dependencies. Instead of 4 https://en.wikipedia.org/wiki/Unix_shell

using a fancy web front-end, we can still man- 5 https://www.in-ulm.de/~mascheck/bourne/#origins


6 1 Introduction

It is different in the sense that most solutions


are based on the usage of single command line
tools, that sometimes are combined as simple
pipelines. This book does not intend to create
experts in shell scripting, by the contrary, the few Fig. 1.4 Spreadsheet example
scripts introduced are merely direct combinations
of simple command line tools individually ex-
(Microsoft 2003), and XLS (Microsoft 2003)
plained before.
formats. We can try to open all these files in our
The main idea is to demonstrate the ability of
favorite text editor.
a few command line tools to automate many of
When opening the CSV file, the application
the text and data processing tasks. The solutions
will show the following contents:
are presented in a way that comprehending them
is like conducting a new laboratory protocol i.e. A,C
testing and understanding its multiple procedural G,T
steps, variables, and intermediate results. Each line represents a row of the spreadsheet, and
column values are separated by commas.
When opening the TSV file, the application
Text Files will show the following contents:

All the data will be stored in text files, which A C


command line tools are able to efficiently pro- G T
cess (Baker and Milligan 2014). Text files repre- The only difference is that instead of a comma
sent a simple and universal medium of storing our it is now used a tab character to separate column
data. They do not require any special encoding values.
and can be opened and interpreted by using When opening the XML file, the application
any text editor application. Normally, text files will show the following contents:
without any kind of formatting are stored using
a txt extension. However, text files can contain ...
data using a specific format, such as: <Table ss:StyleID="ta1">
<Column ss:Span="1" ss:Width="
CSV : Comma-Separated Values6 ; 64.01"/>
TSV : Tab-Separated Values7 ; <Row ss:Height="12.81"><Cell><
XML : eXtensible Markup Language8 . Data ss:Type="String">A</Data
></Cell><Cell><Data ss:Type="
All the above formats can be open (import), String">C</Data></Cell></Row>
edited and saved (export) by any text editor appli- <Row ss:Height="12.81"><Cell><
cation. and common spreadsheet applications9 , Data ss:Type="String">G</Data
such as LibreOffice Calc or Microsoft Excel10 . ></Cell><Cell><Data ss:Type="
For example, we can create a new data file using String">T</Data></Cell></Row>
LibreOffice Calc, like the one in Fig. 1.4. Then </Table>
we select the option to save it as CSV, TSV, XML ...

6 https://en.wikipedia.org/wiki/Comma-separated_values
Now the data is more complex to find and under-
7 https://en.wikipedia.org/wiki/Tab-separated_values
stand, but with a little more effort we can check
8 https://en.wikipedia.org/wiki/XML
that we have a table with two rows, each one with
9 https://en.wikipedia.org/wiki/Spreadsheet two cells.
10 To When opening the XLS file, we will get a
save in TSV format using the LibreOffice Calc, we
may have to choose CSV format and then select as field lot of strange characters and it is humanly im-
delimiter the tab character. possible to understand what data it is storing.
What Is in the Book? 7

This happens because XLS is not a text file is a cusses what type of information they distribute,
proprietary format11 , which organizes data using where we can find them, and how we will be
an exclusive encoding scheme, so its interpreta- able to automatically explore them. Most of the
tion and manipulation could only be done using a examples in the book use the resources pro-
specific software application. vided by the European Bioinformatics Institute
Comma-separated values is a data format so (EBI) and use their services to automatically
old as shell scripting, in 1972 it was already retrieve the data and text. Nevertheless, after
supported by an IBM product12 . Using CSV or understanding the command line tools, it will
TSV enables us to manually manipulate the data not be hard to adapt them to the formats used
using our favorite spreadsheet application, and by other service provider, such as the National
at the same time use command line tools to Center for Biotechnology Information (NCBI).
automate some of the tasks. In terms of semantics, the examples will use
two ontologies, one about human diseases and
the other about chemical entities of biological
Relational Databases interest. Most ontologies share the same structure
and syntax, so adapting the solutions to other
If there is a need to use more advanced data
domains are expected to be painless.
storage techniques, such as using a relational
As an example, the Chap. 3 will describe the
database13 , we may still be able to use shell
manual steps that Health and Life specialists may
scripting if we can import and export our data
have to perform to find and retrieve biomedi-
to a text format. For example, we can open
cal text about caffeine using publicly available
a relational database, execute Structured Query
resources. Afterwards, these manual steps will
Language (SQL) commands14 , and import and
be automatized by using command line tools,
export the data to CSV using the command line
including the automatic download of data. The
tool sqlite315 .
idea is to go step-by-step and introduce how each
Besides CSV and shell scripting being al-
command line tool can be used to automate each
most the same as they were four decades ago,
task.
they are still available everywhere and are able
to solve most of our data and text processing
daily problems. So, these tools are expected to Command Line Tools
continue to be used for many more decades to
come. As a bonus, we will look like a true The main command line tools that this book will
professional typing command line instructions in introduce are the following:
a black background window !  ¨
• curl: a tool to download data and text from
the web;
What Is in the Book? • grep: a tool to search our data and text;
• gawk: a tool to manipulate our data and text;
First, the Chap. 2 presents a brief overview of
• sed: a tool to edit our data and text;
some of the most prominent resources of biomed-
• xargs: a tool to repeat the same step for
ical data, text, and semantics. The chapter dis-
multiple data items;
• xmllint: a tool to search in XML data files.
11 https://en.wikipedia.org/wiki/Proprietary_format
12 http://bitsavers.trailingedge.com/pdf/ibm/370/fortran/ Other command line tools are also presented
GC28-6884-0_IBM_FORTRAN_Program_Products_for_ to perform minor data and text manipulations,
OS_and_CMS_General_Information_Jul72.pdf
13 https://en.wikipedia.org/wiki/Relational_database
such as:
14 https://en.wikipedia.org/wiki/SQL
15 https://www.sqlite.org/cli.html • cat: a tool to get the content of file;
8 1 Introduction

• tr: a tool to replace one character by another; 2018). A regular expression is a string that in-
• sort: a tool to sort multiple lines; clude special operators represented by special
• head: a tool to select only the first lines. characters. For example, the regular expression
A|C|G|T will identify in a given string any of
the four nucleobases adenine (A), cytosine (C),
Pipelines guanine (G), or thymine (T).
Another technique introduced is tokenization.
A fundamental technique introduced in Chap. 3 It addresses the challenge of identifying the text
is how to redirect the output of a command line boundaries, such as splitting a text into sentences.
tool as input to another tool, or to a file. This en- So, we can keep only the sentences that may have
ables the construction of pipelines of sequential something we want. Chapter 4 also describes how
invocations of command line tools. Using a few can we try to find two entities in the same sen-
commands integrated in a pipeline is really the tence, providing a simple solution to the relation
maximum shell scripting that this book will use. extraction challenge18 .
Scripts longer than that would cross the line of
not having to learn programming skills.
Chapter 4 is about extracting useful informa-
tion from the text retrieved previously. The ex- Semantics
ample consists in finding references to malignant
hyperthermia in these caffeine related texts, so Instead of trying to recognize a limited list of
we may be able to check any valid relation. entities, Chap. 5 explains how can we use ontolo-
gies to construct large lexicons that include all the
entities of a given domain, e.g. humans diseases.
Regular Expressions The chapter also explains how the semantics
encoded in an ontology can be used to expand a
A powerful pattern matching technique described search by adding the ancestors and related classes
in this chapter is the usage of regular expres- of a given entity. Finally, a simple solution to
sions16 in the grep command line tool to per- the Entity Linking19 challenge is given, where
form Named-Entity Recognition (NER)17 . Regu- each entity recognized is mapped to a class in
lar expressions originated in 1951 (Kleene 1951), an ontology. A simple technique to solve the
so they are even older than shell scripting, but ambiguity issue when the same label can be
still popular and available in multiple software mapped to more than one class is also briefly
applications and programming languages (Forta presented.

16 https://en.wikipedia.org/wiki/Regular_expression
17 https://en.wikipedia.org/wiki/Named- 18 https://en.wikipedia.org/wiki/Relationship_extraction

entity_recognition 19 https://en.wikipedia.org/wiki/Entity_linking

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License
(http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in
any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons licence,
unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons
licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to
obtain permission directly from the copyright holder.
Resources
2

Abstract What?
The previous chapter presented the impor-
tance of text and semantic resources for Health In the biomedical domain, we can find text in
and Life studies. This chapter will describe different forms, such as:
what kind of text and semantic resources are
Statement: a short piece of text, normally con-
available, where they can be found, and how
taining personal remarks or an evidence about
they can be accessed and retrieved.
a biomedical phenomenon;
Abstract: a short summary of a larger scientific
Keywords
document;
Biomedical literature · Programmatic access ·
Full-text: the entire text present in a scientific
UniProt citations service · Semantics ·
document including scattered text such as fig-
Controlled vocabularies · Ontologies · OWL:
ure labels and footnotes.
Web Ontology Language · URI: Uniform
Resource Identifier · DAG: Directed Acyclic Statements contain more syntactic and semantic
Graphs · OBO: Open Biomedical Ontologies errors than abstracts, since they normally are
not peer-reviewed, but they are normally directly
linked to data providing useful details about it.
The main advantage of using statements or ab-
stracts is the brief and succinct form on which
Biomedical Text the information is expressed. In the case of ab-
stracts, there was already an intellectual exercise
Text is still the preferential means of publishing to present only the main facts and ideas. Never-
novel knowledge in Health and Life Sciences, theless, a brief description may be insufficient to
and where we can expect to find all the draw a solid conclusion, that may require some
information about the supporting data. Text important details not possible to summarize in a
can be found and explored in multiple types short piece of text (Schuemie et al. 2004). These
of sources, the main being scientific articles and details are normally presented in the form of a
patents (Krallinger et al. 2017). However, less full-text document, which contains a complete
formal texts are also relevant to explore, such as description of the results obtained. For example,
the ones present nowadays in electronic health important details are sometimes only present in
records (Blumenthal and Tavenner 2010). figure labels (Yeh et al. 2003).

© The Author(s) 2019 9


F. M. Couto, Data and Text Processing for Health and Life Sciences,
Advances in Experimental Medicine and Biology 1137,
https://doi.org/10.1007/978-3-030-13845-5_2
10 2 Resources

One major problem of full-text documents is such as Google Scholar5 , Google Patents6 , Re-
their availability, since their content may have searchGate7 and Mendeley8 .
restricted access. In addition, the structure of the More than just text some tools also integrate
full-text and the format on which is available semantic links. One of the first search engines
varies according to the journal in where it was for biomedical literature to incorporate semantics
published. Having more information does not was GOPubMed9 , that categorized texts accord-
mean that all of it is beneficial to find what ing to Gene Ontology terms found in them (Doms
we need. Some of the information may even and Schroeder 2005). These semantic resources
induce us in error. For example, the relevance will be described in a following section. A more
of a fact reported in the Results Section may be recent tool is PubTator10 that provides the text
different if the fact was reported in the Related annotated with biological entities generated by
Work Section. Thus, the usage of full-text may state-of-the-art text-mining approaches (Wei
create several problems regarding the quality of et al. 2013).
information extracted (Shah et al. 2003). There is also a movement in the scientific
community to produce Open Access Publica-
tions, making full-texts freely available with
Where? unrestricted use. One of the main free digital
archives of free biomedical full-texts is PubMed
Access to biomedical literature is normally done Central11 (PMC), currently providing access to
using the internet through PubMed1 , an informa- more than 5 million documents.
tion retrieval system released in 1996 that allows Other relevant source of biomedical texts is
researchers to search and find biomedical texts of the electronic health records stored in health in-
relevance to their studies (Canese 2006). PubMed stitutions, but the texts they contain are normally
is developed and maintained by the National directly linked to patients and therefore their
Center for Biotechnology Information (NCBI), at access is restricted due to ethical and privacy is-
the U.S. National Library of Medicine (NLM), sues. As example, the THYME corpus12 includes
located at the National Institutes of Health (NIH). more than one thousand de-identified clinical
Currently, PubMed provides access to more than notes from the Mayo Clinic, but is only available
28 million citations from MEDLINE, a biblio- for text processing research under a data use
graphic database with references to a compre- agreement (DUA) with Mayo Clinic (Styler IV
hensive list of academic journals in Health and et al. 2014).
Life Sciences2 . The references include multiple From generic texts we can also sometimes find
metadata about the documents, such as: title, ab- relevant biomedical information. For example,
stract, authors, journal, publication date. PubMed some recent biomedical studies have been pro-
does not store the full-text documents, but it cessing the texts in social networks to identify
provides links where we may find the full-text. new trends and insights about a disease, such as
More recently, biomedical references are also processing tweets to predict flu outbreaks (Ara-
accessible using the European Bioinformatics maki et al. 2011).
Institute (EBI) services, such as Europe PMC3 ,
the Universal Protein Resource (UniProt) with its 5 http://scholar.google.com/
UniProt citations service4 . 6 http://www.google.com/patents
Other generic alternative tools have been also 7 https://www.researchgate.net/
gaining popularity for finding scientific texts, 8 https://www.mendeley.com/
9 https://gopubmed.org/
1 https://www.nlm.nih.gov/bsd/pubmed.html 10 http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/
2 https://www.nlm.nih.gov/bsd/medline.html PubTator/
3 http://europepmc.org/ 11 https://www.ncbi.nlm.nih.gov/pmc/
4 https://www.uniprot.org/citations/ 12 http://thyme.healthnlp.org/
Semantics 11

How? More important is to check the URL that is


now being used:
To automatically process text, we need program- https://www.uniprot.org/
matic access to it, this means that from the pre- citations/?sort=score&desc=&
vious biomedical data repositories we can only compress=no&query=id
use the ones that allow this kind of access. These :27702941%20OR%20id:22333316&
limitations are imposed because many biomed- format=tab&columns=id
ical documents have copyright restrictions hold
by their publishers. And some restrictions may We can check that the URL has three
define that only manual access is granted, and no main components: the scheme (https), the
programmatic access is allowed. These restric- hostname (www.uniprot.org), the service
tions are normally detailed in the terms of service (citations) and the data parameters. The
of each repository. However, when browsing the scheme represents the type of web connection to
repository if we face a CAPTCHA challenge get the data, and usually is one of these protocols:
to determine whether we are humans or not, Hypertext Transfer Protocol (HTTP) or HTTP
probably means that some access restrictions are Secure (HTTPS)18 . The hostname represents the
in place. physical site where the service is available. The
Fortunately, NCBI13 and EBI14 online ser- list of parameters depends on the data available
vices, such as PubMed, Europe PMC, or UniProt from the different services. We can change
Citations, allow programmatic access (Li et al. any value of the parameters (arguments) to get
2015). Both institutions provide Web APIs15 that different results. For example, we can replace
fully document how web services can be pro- the two PubMed identifiers by the following one
grammatically invoked. Some resources can in- 2902929119 , and our browser will now display
clusively be accessed using RESTful web ser- the information about this new document:
vices16 that are characterized by a simple uniform PubMed ID Title Authors/Groups
interface that make any Uniform Resource Lo- Abstract/Summary
cator (URL) almost self-explanatory (Richardson 29029291 Nutrition Influences...
and Ruby 2008). The same URL shown by our The good news is that we can use this link with
web browser is the only thing we need to know a command line tool and automatize the retrieval
to retrieve the data using a command line tool. of the data, including extracting the abstract to
For example, if we search for caffeine using process its text.
the UniProt Citations service17 , select the first
two entries, and click on download, the browser
will show information about those two docu-
ments using a tabular format. Semantics
PubMed ID Title Authors/Groups Lack of use of standard nomenclatures across bi-
Abstract/Summary ological text makes text processing a non-trivial
27702941 Genome-wide association task. Often, we can find different labels (syn-
... onyms, acronyms) for the same biomedical enti-
22333316 Modeling caffeine ties, or, even more problematic, different entities
concentrations ... sharing the same label (homonyms) (Rebholz-
Schuhmann et al. 2005). Sense disambiguation
13 https://www.ncbi.nlm.nih.gov/home/develop/api/
to select the correct meaning of an expression in
14 https://www.ebi.ac.uk/seqdb/confluence/display/

JDSAT/ 18 https://en.wikipedia.org/wiki/
15 https://en.wikipedia.org/wiki/Web_API
Hypertext_Transfer_Protocol
16 https://www.ebi.ac.uk/seqdb/confluence/pages/
19 https://www.uniprot.org/citations/?sort=score&desc=
viewpage.action?pageId=68165098 &compress=no&query=id:29029291&format=
17 https://www.uniprot.org/citations/
tab&columns=id
12 2 Resources

a given piece of text is therefore a crucial issue. to be open and available without any constraint
For example, if we find the disease acronym ATS other than acknowledging their origin.
in a text, we may have to figure out if it repre- Concepts are defined as OWL classes that may
senting the Andersen-Tawil syndrome20 or the X- include multiple properties. For text processing
linked Alport syndrome21 . Further in the book, we important properties include the labels that may
will address this issue by using ontologies and be used to mention that class. The labels may
semantic similarity between their classes (Couto include the official name, acronyms, exact syn-
and Lamurias 2019). onyms, and even related terms. For example, a
class defining the disease malignant hyperther-
mia may include as synonym anesthesia related
What? hyperthermia. Two distinct classes may share the
same label, such as Andersen-Tawil syndrome
In 1993, Gruber (1993) proposed a short but and X-linked Alport syndrome that have ATS as
comprehensive definition of ontology as an: an exact synonym.
an explicit specification of a conceptualization
Formality
In 1997 and 1998, Borst and Borst (1997) and The representation of classes and the relation-
Studer et al. (1998) refined this definition to: ships may use different levels of formality, such
a formal, explicit specification of a shared concep- as controlled vocabularies, taxonomies and the-
tualization saurus, that even may include logical axioms.
Controlled vocabularies are list of terms with-
A conceptualization is an abstract view of
out specifying any relation between them. Tax-
the concepts and the relationships of a given
onomies are controlled vocabularies that include
domain. A shared conceptualization means that a
subsumption relations, for example specifying
group of individuals agree on that view, normally
that malignant hyperthermia is a muscle tissue
established by a common agreement among the
disease. This is-a or subclass relations are nor-
members of a community. The specification is a
mally the backbone of ontologies. We should
representation of that conceptualization using a
note that some ontologies may include multi-
given language. The language needs to be formal
ple inheritance, i.e. the same concept may be a
and explicit, so computers can deal with it.
specialization of two different concepts. There-
Languages fore, many ontologies are organized as a directed
The Web Ontology Language (OWL)22 is acyclic graphs (DAG) and not as hierarchical
nowadays becoming one of the most common trees, as the one represented in Fig. 2.1. A the-
languages to specify biomedical ontologies saurus includes other types of relations besides
(McGuinness et al. 2004). Another popular alter- subsumption, for example specifying that caf-
native is the Open Biomedical Ontology (OBO)23 feine has role mutagen.
format developed by the OBO foundry. OBO
established a set of principles to ensure high
Gold Related Documents
The importance of these relations can be easily
quality, formal rigor and interoperability between
understood by considering the domain modeled
other OBO ontologies (Smith et al. 2007). One
by the ontology in Fig. 2.1, and the need to find
important principle is that OBO ontologies need
texts related to gold. Assume a corpus with one
20 http://purl.obolibrary.org/obo/DOID_0050434
distinct document mentioning each metal, except
21 http://purl.obolibrary.org/obo/DOID_0110034
for gold that no document mentions. So, which
22 https://en.wikipedia.org/wiki/
documents should we read first?
Web_Ontology_Language The document mentioning silver is probably
23 https://en.wikipedia.org/wiki/ the most related since it shares with gold two
Open_Biomedical_Ontologies parents, precious and coinage. However, choos-
Semantics 13

Fig. 2.1 A DAG


representing a metal
classification of metals
with multiple inheritance,
since gold and silver are
considered both precious
and coinage metals (All the
links represent is-a
relations)
precious coinage

platinum palladium gold silver copper

ing between the documents mentioning platinum the OBO initiative was precisely to tackle this
or palladium or the document mentioning copper somehow disorderly spread of definitions for the
depends on our information need. This informa- same concepts. Each OBO ontology covers a
tion can be obtained by our previous searches clearly specified scope that is clearly identified.
or reads. For example, assuming that our last
searches included the word coinage, then docu- OBO Ontologies
ment mentioning copper is probably the second- A major example of success of OBO ontologies
most related. The importance of these semantic is the Gene Ontology (GO) that has been widely
resources is evidenced by the development of the and consistently used to describe the molecular
knowledge graph24 by Google to enhance their function, biological process and cellular compo-
search engine (Singhal 2012). nent of gene-products, in a uniform way across
different species (Ashburner et al. 2000). Another
OBO ontology is the Disease Ontology (DO)
Where? that provides human disease terms, phenotype
characteristics and related medical vocabulary
Most of the biomedical ontologies are available disease concepts (Schriml et al. 2018). Another
through BioPortal25 . In December of 2018, Bio- OBO ontology is the Chemical Entities of Bio-
Portal provided access to more than 750 ontolo- logical Interest (ChEBI) that provides a classifi-
gies representing more than 9 million classes. cation of molecular entities with biological inter-
BioPortal allows us to search for an ontology or est with a focus on small chemical compounds
a specific class. For example, if we search for (Degtyarenko et al. 2007).
caffeine, we will be able to see the large list of on-
tologies that define it. Each of these classes rep- Popular Controlled Vocabularies
resent conceptualizations of caffeine in different Besides OBO ontologies, other popular con-
domains and using alternative perspectives. To trolled vocabularies also exist. One of them is the
improve interoperability some ontologies include International Classification of Diseases (ICD)26 ,
class properties with a link to similar classes maintained by the World Health Organization
in other ontologies. One of the main goals of (WHO). This vocabulary contains a list of

24 https://en.wikipedia.org/wiki/Knowledge_Graph
25 http://bioportal.bioontology.org/ 26 https://www.who.int/classifications/icd/en/
14 2 Resources

generic clinical terms mainly arranged and clas- files from there. This way, we will be sure that
sified according to anatomy or etiology. Another we get the most recent release in the original
example is the Systematized Nomenclature of format and select the subset of the ontology
Medicine – Clinical Terms (SNOMED CT)27 , that really matter for our work. For example,
currently maintained and distributed by the ChEBI provides three versions: LITE, CORE and
International Health Terminology Standards FULL33 . Since we are interested in using the
Development Organization (IHTSDO). The ontology just for text processing, we are probably
SNOMED CT is a highly comprehensive and not interested in chemical data and structures that
detailed set of clinical terms used in many is available in CORE. Thus, LITE is probably
biomedical systems. The Medical Subject Head- the best solution, and it will be the one we will
ings (MeSH)28 is a comprehensive controlled use in this book. However, we may be missing
vocabulary maintained by the National Library synonyms that are only included in the FULL
of Medicine (NLM) for classifying biomedical version.
and health-related information and documents.
Both MeSH and SNOMED CT are included OWL
in the Metathesaurus of the Unified Medical The OWL language is the prevailing language
Language System (UMLS)29 , maintained by the to represent ontologies, and for that reason will
U.S National Library of Medicine. This is a large be the format we will use in this book. OWL
resource that integrates most of the available extends RDF Schema (RDFS) with more com-
biomedical vocabularies. The 2015AB release plex statements using description logic. RDFS is
covered more than three million concepts. an extension of RDF with additional statements,
Another alternative to BioPortal is Ontobee30 , such as class-subclass or property-subproperty
a repository of ontologies used by most OBO relationships. RDF is a data model that stores in-
ontologies, but it also includes many non-OBO formation in statements represented as triples of
ontologies. In December 2018, Ontobee provided the form subject, predicate and object. Originally,
access to 187 ontologies (Ong et al. 2016). W3C recommended RDF data to be encoded
Other alternatives outside the biomedical do- using Extensible Markup Language (XML) syn-
main include the list of vocabularies gathered by tax, also named RDF/XML. XML is a self-
the W3C SWEO Linking Open Data community descriptive mark-up language composed of data
project31 , and by the W3C Library Linked Data elements.
Incubator Group32 . For example, the following example repre-
sents an XML file specifying that caffeine is a
drug that may treat the condition of sleepiness,
How? but without being an official treatment:
<treatment category="non-
After finding the ontologies that cover our do-
official">
main of interest in the previous catalogs, a good
<drug>caffeine</drug>
idea is to find their home page and download the
<condition>sleepiness</
27 https://digital.nhs.uk/services/terminology-and-
condition>
</treatment>
classifications/snomed-ct
28 https://www.nlm.nih.gov/mesh/
The information is organized in an hierarchi-
29 https://www.nlm.nih.gov/research/umls/
cal structure of data elements. treatment is
30 http://www.ontobee.org/
the parent element of drug and condition.
31 http://www.w3.org/wiki/TaskForces/
The character < means that a new data element
CommunityProjects/LinkingOpenData/
CommonVocabularies
is being specified, and the characters </ means
32 http://www.w3.org/2005/Incubator/lld/XGR-lld-

vocabdataset-20111025 33 https://www.ebi.ac.uk/chebi/downloadsForward.do
Further Reading 15

that a specification of data element will end. Sometimes, ontologies are also available as
The treatment element has a property named database dumps. These dumps are normally SQL
category with the value non-official. files that need to be fed to a DataBase Manage-
The drug and condition elements have as ment System (DBMS)36 . If for any reason we
values caffeine and sleepiness, respec- must deal with these files, we can use the simple
tively. This is a very simple XML example, command line tool named sqlite3. The tool
but large XML files are almost unreadable by has the option to execute the SQL commands to
humans. import the data into a database (.read com-
To address this issue other encoding languages mand), and to export the data into a CSV file
for RDF are now being used, such as N334 and (.mode command) (Allen and Owens 2011).
Turtle35 . Nevertheless, most biomedical ontolo-
gies are available in OWL using XML encoding.

URI Further Reading


The Uniform Resource Identifier (URI) was de-
One important read if we need to know more
fined as the standard global identifier of classes in
about biomedical resources is the Arthur Lesk’s
an ontology. For example, the class caffeine
book about bioinformatics (Lesk 2014). The
in ChEBI is identified by the following URI:
book has entire chapters dedicated to where data
http://purl.obolibrary.org/obo/ and text can be found, providing a comprehensive
CHEBI_27732 overview of the type of biomedical information
If a URI represents a link to a retrievable resource available, nowadays.
is considered a Uniform Resource Locator, or A more pragmatic approach is to explore the
URL. In other words, a URI is a URL if we vast number of manuals, tutorials, seminars and
open it in a web browser and obtain a resource courses provided by the EBI37 and NCBI38 .
describing that class.

36 https://en.wikipedia.org/wiki/Database#

Database_management_system
34 https://en.wikipedia.org/wiki/Notation3 37 https://www.ebi.ac.uk/training
35 https://en.wikipedia.org/wiki/Turtle_(syntax) 38 https://www.ncbi.nlm.nih.gov/home/learn/

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License
(http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in
any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons licence,
unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons
licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to
obtain permission directly from the copyright holder.
Data Retrieval
3

Abstract is (see Fig. 3.1). From all the information that


This chapter starts by introducing an example is available we can check in the infobox that
of how we can retrieve text, where every step there are multiple links to external sources. The
is done manually. The chapter will describe infobox is normally a table added to the top
step-by-step how we can automatize each step right-hand part of a web page with structured
of the example using shell script commands, data about the entity described on that page.
which will be introduced and explained as From the list of identifiers (see Fig. 3.2), let
long as they are required. The goal is to equip us select the link to one resource hosted by the
the reader with a basic set of skills to retrieve European Bioinfomatics Institute (EBI), the link
data from any online database and follow the to CHEBI:277322 .
links to retrieve more information from other CHEBI represents the acronym of the
sources, such as literature. resource Chemical Entities of Biological Interest
(ChEBI)3 and 27732 the identifier of the entry in
Keywords ChEBI describing caffeine (see Fig. 3.3). ChEBI
Unix shell · Terminal application · Web is a freely available database of molecular entities
retrieval · cURL: Client Uniform Resource with a focus on “small” chemical compounds.
Locator · Data extraction · Data selection · More than a simple database, ChEBI also
Data filtering · Pattern matching · XML: includes an ontology that classifies the entities
extensible markup language · XPath: XML according to their structural and biological
path language properties.
By analyzing the CHEBI:27732 web page we
can check that ChEBI provides a comprehensive
Caffeine Example set of information about this chemical compound.
But let us focus on the Automatic Xrefs tab4 .
As our main example, let us consider that we This tab provides a set of external links to other
need to retrieve more data and literature about
caffeine. If we really do not know anything
about caffeine, we may start by opening our 2 https://www.ebi.ac.uk/chebi/searchId.do?chebiId=
favorite internet browser and then searching CHEBI:27732
caffeine in Wikipedia1 to know what it really 3 http://www.ebi.ac.uk/chebi/
4 http://www.ebi.ac.uk/chebi/displayAutoXrefs.do?
1 https://en.wikipedia.org/wiki/Caffeine chebiId=CHEBI:27732

© The Author(s) 2019 17


F. M. Couto, Data and Text Processing for Health and Life Sciences,
Advances in Experimental Medicine and Biology 1137,
https://doi.org/10.1007/978-3-030-13845-5_3
Another random document with
no related content on Scribd:
ŒUFS AU PLAT.

A pewter or any other metal plate or dish which will bear the fire,
must be used for these. Just melt a slice of butter in it, then put in
some very fresh eggs broken as for poaching; strew a little pepper
and salt on the top of each, and place them over a gentle fire until
the whites are quite set, but keep them free from colour.
This is a very common mode of preparing eggs on the continent;
but there is generally a slight rawness of the surface of the yolks
which is in a measure removed by ladling the boiling butter over
them with a spoon as they are cooking, though a salamander held
above them for a minute would have a better effect. Four or five
minutes will dress them.
Obs.—We hope for an opportunity of inserting further receipts for
dishes of eggs at the end of this volume.
MILK AND CREAM.

Without possessing a dairy, it is quite possible for families to have


always a sufficient provision of milk and cream for their consumption,
provided there be a clean cool larder or pantry where it can be kept.
It should be taken from persons who can be depended on for
supplying it pure, and if it can be obtained from a dairy near at hand
it will be an advantage, as in the summer it is less easy to preserve it
sweet when it has been conveyed from a distance. It should be
poured at once into well-scalded pans or basins kept exclusively for
it, and placed on a very clean and airy shelf, apart from all the other
contents of the larder. The fresh milk as it comes in should be set at
one end of the shelf, and that for use should be taken from the other,
so that none may become stale from being misplaced or overlooked.
The cream should be removed with a perforated skimmer (or
skimming-dish as it is called in dairy-counties) which has been
dipped into cold water to prevent the cream, when thick, from
adhering to it. Twelve hours in summer, and twenty-four in winter, will
be sufficient time for the milk to stand for “creaming,” though it may
often be kept longer with advantage. Between two and three pints of
really good milk will produce about a quarter of a pint of cream. In
frosty weather the pans for it should be warmed before it is poured
in. If boiled when first brought in, it will remain sweet much longer
than it otherwise would; but it will then be unfit to serve with tea;
though it may be heated afresh and sent to table with coffee; and
used also for puddings, and all other varieties of milk-diet.
DEVONSHIRE, OR CLOTTED CREAM.

From the mode adopted in Devonshire, and in some other


counties, of scalding the milk in the following manner, the cream
becomes very rich and thick, and is easily converted into excellent
butter. It is strained into large shallow metal pans as soon as it is
brought into the dairy and left for twelve hours at least in summer,
and thirty-six in cold weather. It is then gently carried to a hot plate—
heated by a fire from below—and brought slowly to a quite scalding
heat but without being allowed to boil or even to simmer. When it is
ready to be removed, distinct rings appear on the surface, and small
bubbles of air. It must then be carried carefully back to the dairy, and
may be skimmed in twelve hours afterwards. The cream should be
well drained from the milk—which will be very poor—as this is done.
It may then be converted into excellent butter, merely by beating it
with the hand in a shallow wooden tub, which is, we are informed,
the usual manner of making it in small Devonshire dairies.
DU LAIT A MADAME.

Boil a quart of new milk, and let it cool sufficiently to allow the
cream to be taken off; then rinse an earthen jar well in every part
with buttermilk, and while the boiled milk is still rather warm, pour it
in and add the cream gently on the top. Let it remain twenty-four
hours, turn it into a deep dish, mix it with pounded sugar, and it will
be ready to serve. This preparation is much eaten abroad during the
summer, and is considered very wholesome. The milk, by the
foregoing process, becomes a very soft curd, slightly, but not at all
unpleasantly, acid in flavour. A cover, or thick folded cloth, should be
placed on the jar after the milk is poured in, and it should be kept in a
moderately warm place. In very sultry weather less time may be
allowed for the milk to stand.
Obs.—We give this and the following receipt from an unpublished
work which we have in progress, being always desirous to make
such information as we possess generally useful as far as we can.
CURDS AND WHEY.

Rennet is generally prepared for dairy-use by butchers, and kept


in farmhouses hung in the chimney corners, where it will remain
good a long time. It is the inner stomach of the calf, from which the
curd is removed, and which is salted and stretched out to dry on
splinters of wood, or strong wooden skewers. It should be preserved
from dust and smoke (by a paper-bag or other means), and portions
of it cut off as wanted. Soak a small bit in half a teacupful of warm
water, and let it remain in it for an hour or two; then pour into a quart
of warm new milk a dessertspoonful of the rennet-liquor, and keep it
in a warm place until the whey appears separated from the curd, and
looks clear. The smaller the proportion of rennet used, the more soft
and delicate will be the curd. We write these directions from
recollection, having often had the dish thus prepared, but having no
memorandum at this moment of the precise proportions used. Less
than an inch square of the rennet would be sufficient, we think, for a
gallon of milk, if some hours were allowed for it to turn. When rennet-
whey, which is a most valuable beverage in many cases of illness, is
required for an invalid to drink, a bit of the rennet, after being quickly
and slightly rinsed, may be stirred at once into the warm milk, as the
curd becoming hard is then of no consequence. It must be kept
warm until the whey appears and is clear. It may then be strained,
and given to the patient to drink, or allowed to become cold before it
is taken. In feverish complaints it has often the most benign effect.
Devonshire junket is merely a dish or bowl of sweetened curds
and whey, covered with the thick cream of scalded milk, for which
see page 451.
CHAPTER XXIII.

Sweet Dishes, or Entremets.

Jelly of two colours, with macedoire of fruit.


TO PREPARE CALF’S FEET STOCK.

The feet are usually sent in from


the butcher’s ready to be
dressed, but as they are sold at
a very much cheaper rate when
the hair has not been cleared
from them, and as they may
then be depended on for
supplying the utmost amount of
nutriment which they contain, it
is often desirable to have them
White and Rose-coloured Jelly.
altogether prepared by the cook.
In former editions of this work
we directed that they should be “dipped into cold water, and
sprinkled with resin in fine powder; then covered with boiling water
and left for a minute or two untouched before they were scraped;”
and this method we had followed with entire success for a long time,
but we afterwards discovered that the resin was not necessary, and
that the feet could be quite as well prepared by mere scalding, or
being laid into water at the point of boiling, and kept in it for a few
minutes by the side of the fire. The hair, as we have already stated in
the first pages of Chapter IX. (Veal), must be very closely scraped
from them with a blunt-edged knife; and the hoofs must be removed
by being struck sharply down against the edge of a strong table or
sink, the leg-bone being held tightly in the hand. The feet must be
afterwards washed delicately clean before they are further used.
When this has been done, divide them at the joint, split the claws,
and take away the fat that is between them. Should the feet be large,
put a gallon of cold water to the four, but from a pint to a quart less if
they be of moderate size or small. Boil them gently down until the
flesh has parted entirely from the bones, and the liquor is reduced
nearly or quite half; strain, and let it stand until cold; remove every
particle of fat from the top before it is used, and be careful not to take
the sediment.
Calf’s feet (large), 4; water, 1 gallon: 6 to 7 hours.
TO CLARIFY CALF’S FEET STOCK.

Break up a quart of the stock, put it into a clean stewpan with the
whites of five large or of six small eggs, two ounces of sugar, and the
strained juice of a small lemon; place it over a gentle fire, and do not
stir it after the scum begins to form; when it has boiled five or six
minutes, if the liquid part be clear, turn it into a jelly-bag, and pass it
through a second time should it not be perfectly transparent the first.
To consumptive patients, and others requiring restoratives, but
forbidden to take stimulants, the jelly thus prepared is often very
acceptable, and may be taken with impunity, when it would be highly
injurious made with wine. More white of egg is required to clarify it
than when sugar and acid are used in larger quantities, as both of
these assist the process. For blanc-mange omit the lemon-juice, and
mix with the clarified stock an equal proportion of cream (for an
invalid, new milk), with the usual flavouring, and weight of sugar; or
pour the boiling stock very gradually to some finely pounded
almonds, and express it from them as directed for Quince Blamange,
allowing from six to eight ounces to the pint.
Stock, 1 quart; whites of eggs, 5; sugar, 2 oz.; juice, 1 small
lemon: 5 to 8 minutes.
TO CLARIFY ISINGLASS.

The finely-cut purified isinglass, which is now in general use,


requires no clarifying except for clear jellies: for all other dishes it is
sufficient to dissolve, skim, and pass it through a muslin strainer.
When two ounces are required for a dish, put two and a half into a
delicately clean pan, and pour on it a pint of spring water which has
been gradually mixed with a teaspoonful of beaten white of egg; stir
these thoroughly together, and let them heat slowly by the side of a
gentle fire, but do not allow the isinglass to stick to the pan. When
the scum is well risen, which it will be after two or three minutes’
simmering, clear it off, and continue the skimming until no more
appears; then, should the quantity of liquid be more than is needed,
reduce it by quick boiling to the proper point, strain it through a thin
muslin, and set it by for use: it will be perfectly transparent, and may
be mixed lukewarm with the clear and ready sweetened juice of
various fruits, or used with the necessary proportion of syrup, for
jellies flavoured with choice liqueurs. As the clarifying reduces the
strength of the isinglass—or rather as a portion of it is taken up by
the white of egg—an additional quarter to each ounce must be
allowed for this: if the scum be laid to drain on the back of a fine
sieve which has been wetted with hot water, a little very strong jelly
will drip from it.
Isinglass, 2-1/2 oz.; water, 1 pint; beaten white of egg, 1
teaspoonful.
Obs.—At many Italian warehouses a preparation is now sold
under the name of isinglass, which appears to us to be highly
purified gelatine of some other kind. It is converted without trouble
into a very transparent jelly, is free from flavour, and is less
expensive than the genuine Russian isinglass; but when taken for
any length of time as a restorative, its different nature becomes
perceptible. It answers well for the table occasionally; but it is not
suited to invalids.
SPINACH GREEN, FOR COLOURING SWEET DISHES,
CONFECTIONARY, OR SOUPS.

Pound quite to a pulp, in a marble or Wedgwood mortar, a handful


or two of young freshly-gathered spinach, then throw it into a hair
sieve, and press through all the juice which can be obtained from it;
pour this into a clean white jar, and place it in a pan of water that is
at the point of boiling, and which must be allowed only to just simmer
afterwards; in three or four minutes the juice will be poached or set:
take it then gently with a spoon, and lay it upon the back of a fine
sieve to drain. If wanted for immediate use, merely mix it in the
mortar with some finely-powdered sugar;[158] but if to be kept as a
store, pound it with as much as will render the whole tolerably dry,
boil it to candy-height over a very clear fire, pour it out in cakes, and
keep them in a tin box or canister. For this last preparation consult
the receipt for orange-flower candy.
158. For soup, dilute it first with a little of the boiling stock, and stir it to the
remainder.
PREPARED APPLE OR QUINCE JUICE.

Pour into a clean earthen pan two quarts of spring water, and
throw into it as quickly as they can be pared, quartered, and
weighed, four pounds of nonsuches, pearmains, Ripstone pippins, or
any other good boiling apples of fine flavour. When all are done,
stew them gently until they are well broken, but not reduced quite to
pulp; turn them into a jelly-bag, or strain the juice from them without
pressure through a closely-woven cloth, which should be gathered
over the fruit, and tied, and suspended above a deep pan until the
juice ceases to drop from it: this, if not very clear, must be rendered
so before it is used for syrup or jelly, but for all other purposes once
straining it will be sufficient. Quinces are prepared in the same way,
and with the same proportions of fruit and water, but they must not
be too long boiled, or the juice will become red. We have found it
answer well to have them simmered until they are perfectly tender,
and then to leave them with their liquor in a bowl until the following
day, when the juice will be rich and clear. They should be thrown into
the water very quickly after they are pared and weighed, as the air
will soon discolour them. The juice will form a jelly much more easily
if the cores and pips be left in the fruit.
Water, 2 quarts; apples or quinces, 4 lbs.
COCOA-NUT FLAVOURED MILK.

(For sweet dishes, &c.)


Pare the dark outer rind from a very fresh nut, and grate it on a
fine and exceedingly clean grater, to every three ounces pour a quart
of new milk, and simmer them very softly for three quarters of an
hour, or more, that a full flavour of the nut may be imparted to the
milk without its being much reduced: strain it through a fine sieve, or
cloth, with sufficient pressure to leave the nut almost dry: it may then
be used for blanc-mange, custards, rice, and other puddings, light
cakes and bread.
To each quart new milk, 3 oz. grated cocoa-nut: 3/4 to 1 hour.
Obs.—The milk of the nut when perfectly sweet and good, may be
added to the other with advantage. To obtain it, bore one end of the
shell with a gimlet, and catch the liquid in a cup; and to extricate the
kernel, break the shell with a hammer; this is better than sawing it
asunder.
COMPÔTES OF FRUIT.

(Or Fruit stewed in Syrup.)


We would especially recommend these delicate and very
agreeable preparations for trial to such of our readers as may be
unacquainted with them, as well as to those who may have a
distaste to the common “stewed fruit” of English cookery. If well
made they are peculiarly delicious and refreshing, preserving the
pure flavour of the fruit of which they are composed; while its acidity
is much softened by the small quantity of water added to form the
syrup in which it is boiled. They are also more economical than tarts
or puddings, and infinitely more wholesome. In the second course
pastry-crust can always be served with them, if desired, in the form
of ready baked leaves, round cakes, or any more fanciful shapes; or
a border of these may be fastened with a little white of egg and flour
round the edge of the dish in which the compôte is served; but rice,
or macaroni simply boiled, or a very plain pudding is a more usual
accompaniment.
Compôtes will remain good for two or three days in a cool store-
room, or somewhat longer, if gently boiled up for an instant a second
time; but they contain generally too small a proportion of sugar to
preserve them from mould or fermentation for many days. The syrup
should be enriched with a larger quantity when they are intended for
the desserts of formal dinners, as it will increase the transparency of
the fruit: the juice is always beautifully clear when the compôtes are
carefully prepared. They should be served in glass dishes, or in
compôtiers, which are of a form adapted to them.
Compôte of spring fruit.—(Rhubarb). Take a pound of the stalks
after they are pared, and cut them into short lengths; have ready a
quarter of a pint of water boiled gently for ten minutes with five
ounces of sugar, or with six should the fruit be very acid; put it in,
and simmer it for about ten minutes. Some kinds will be tender in
rather less time, some will require more.
Obs.—Good sugar in lumps should be used for these dishes.
Lisbon sugar will answer for them very well on ordinary occasions,
but that which is refined will render them much more delicate.
Compôte of green currants.—Spring water, half-pint; sugar, five
ounces; boiled together ten minutes. One pint of green currants
stripped from the stalks; simmered five minutes.
Compôte of green gooseberries.—This is an excellent compôte if
made with fine sugar, and very good with any kind. Break five
ounces into small lumps and pour on them half a pint of water; boil
these gently for ten minutes, and clear off all the scum; then add to
them a pint of fresh gooseberries freed from the tops and stalks,
washed, and well drained. Simmer them gently from eight to ten
minutes, and serve them hot or cold. Increase the quantity for a large
dish.
Compôte of green apricots.—Wipe the down from a pound of quite
young apricots, and stew them very gently for nearly twenty minutes
in syrup made with eight ounces of sugar and three-quarters of a pint
of water, boiled together the usual time.
Compôte of red currants.—A quarter of a pint of water and five
ounces of sugar: ten minutes. One pint of currants freed from the
stalks to be just simmered in the syrup from five to seven minutes.
This receipt will serve equally for raspberries, or for a compôte of the
two fruits mixed together. Either of them will be found an admirable
accompaniment to a pudding of batter, custard, bread, or ground
rice, and also to various other kinds of puddings, as well as to whole
rice plainly boiled.
Compôte of Kentish or Flemish cherries.—Simmer five ounces of
sugar with half a pint of water for ten minutes; throw into the syrup a
pound of cherries weighed after they are stalked, and let them stew
gently for twenty minutes: it is a great improvement to stone the fruit,
but a larger quantity will then be required for a dish.
Compôte of Morella cherries.—Boil together for fifteen minutes, six
ounces of sugar with half a pint of water; add a pound and a quarter
of ripe Morella cherries, and simmer them very softly from five to
seven minutes: this is a delicious compôte. A larger proportion of
sugar will often be required for it, as the fruit is very acid in some
seasons, and when it is not fully ripe.
Compôte of damsons.—Four ounces of sugar and half a pint of
water to be boiled for ten minutes; one pound of damsons to be
added, and simmered gently from ten to twelve minutes.
Compôte of the green magnum-bonum or Mogul plum.—The
green Mogul plums are often brought abundantly into the market
when the fruit is thinned from the trees, and they make admirable
tarts or compôtes, possessing the fine slight bitter flavour of the
unripe apricot, to which they are quite equal. Measure a pint of the
plums without their stalks, and wash them very clean; then throw
them into a syrup made with seven ounces of sugar in lumps, and
half a pint of water, boiled together for eight or ten minutes. Give the
plums one quick boil, and then let them stew quite softly for about
five minutes, or until they are tender, which occasionally will be in
less time even. Take off the scum, and serve the compôte hot or
cold.
Compôte of the magnum-bonum, or other large plums.—Boil six
ounces of sugar with half a pint of water the usual time; take the
stalks from a pound of plums, and simmer them very softly for twenty
minutes. Increase the proportion of sugar if needed, and regulate the
time as may be necessary for the different varieties of fruit.
Compôte of bullaces.—The large, or shepherds’ bullace, is very
good stewed, but will require a considerable portion of sugar to
render it palatable, unless it be quite ripe. Make a syrup with half a
pound of sugar, and three-quarters of a pint of water, and boil in it
gently from fifteen to twenty minutes, a pint and a half of the bullaces
freed from their stalks.
Compôte of Siberian crabs.—To three-quarters of a pint of water
add six ounces of fine sugar, boil them for ten or twelve minutes, and
skim them well. Add a pound and a half of Siberian crabs without
their stalks, and keep them just at the point of boiling for twenty
minutes; they will then become tender without bursting. A few strips
of lemon-rind and a little of the juice are sometimes added to this
compôte.
Obs.—In a dry warm summer, when fruit ripens freely, and is rich
in quality, the proportion of sugar directed for these compôtes would
generally be found sufficient; but in a cold or wet season it would
certainly, in many instances, require to be increased. The present
slight difference in the cost of sugars, renders it a poor economy to
use the raw for dishes of this class, instead of that which is well
refined. To make a clear syrup it should be broken into lumps, not
crushed to powder. Almost every kind of fruit may be converted into
a good compôte.
COMPÔTE OF PEACHES.

Pare half a dozen ripe peaches, and stew them very softly from
eighteen to twenty minutes, keeping them often turned in a light
syrup, made with five ounces of sugar, and half a pint of water boiled
together for ten minutes. Dish the fruit; reduce the syrup by quick
boiling, pour it over the peaches, and serve them hot for a second-
course dish, or cold for rice-crust. They should be quite ripe, and will
be found delicious dressed thus. A little lemon-juice may be added to
the syrup, and the blanched kernels of two or three peach or apricot
stones.
Sugar, 5 oz.; water, 1/2 pint: 10 minutes. Peaches, 6: 18 to 20
minutes.
Obs.—Nectarines, without being pared, may be dressed in the
same way, but will require to be stewed somewhat longer, unless
they be quite ripe.
ANOTHER RECEIPT FOR STEWED PEACHES.

Should the fruit be not perfectly ripe, throw it into boiling water and
keep it just simmering, until the skin can be easily stripped off. Have
ready half a pound of fine sugar boiled to a light syrup with three-
quarters of a pint of water; throw in the peaches, let them stew softly
until quite tender, and turn them often that they may be equally done;
after they are dished, add a little strained lemon-juice to the syrup,
and reduce it by a few minutes’ very quick boiling. The fruit is
sometimes pared, divided, and stoned, then gently stewed until it is
tender.
Sugar, 8 oz.; water, 3/4 pint: 10 to 12 minutes. Peaches, 6 or 7;
lemon-juice, 1 large teaspoonful.

You might also like