Karen E. Nelson (Eds.) - Encyclopedia of Metagenomics - Genes, Genomes and Metagenomes - Basics, Methods, Databases and Tools-Springer US (2015)

Karen E.
Nelson
Editor
Encyclopedia of
Metagenomics
Genes, Genomes and Metagenomes:
Basics, Methods, Databases and Tools
1 3Reference
Encyclopedia of Metagenomics
Karen E. Nelson
Editor
Encyclopedia of
Metagenomics
Genes, Genomes and Metagenomes:
Basics, Methods, Databases and
Tools
With 216 Figures and 64 Tables

Editor
Karen E. Nelson
J. Craig Venter Institute
Rockville, MD, USA
ISBN 978-1-4899-7477-8 ISBN 978-1-4899-7478-5 (eBook)

ISBN 978-1-4899-7479-2 (print and electronic bundle)
DOI 10.1007/978-1-4899-7478-5
Springer New York Heidelberg Dordrecht London
Library of Congress Control Number: 2014954611
# Springer Science+Business Media New York 2015

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or
part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way,
and transmission or information storage and retrieval, electronic adaptation, computer software,
or by similar or dissimilar methodology now known or hereafter developed. Exempted from this
legal reservation are brief excerpts in connection with reviews or scholarly analysis or material
supplied specifically for the purpose of being entered and executed on a computer system, for
exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is
permitted only under the provisions of the Copyright Law of the Publisher’s location, in its
current version, and permission for use must always be obtained from Springer. Permissions for
use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable
to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are
exempt from the relevant protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of
publication, neither the authors nor the editors nor the publisher can accept any legal
responsibility for any errors or omissions that may be made. The publisher makes no warranty,
express or implied, with respect to the material contained herein.
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)

Preface
Welcome to the Encyclopedia of Metagenomics. It is anticipated that the

Encyclopedia will become a resource for tools, tool development and
all things metagenomic. Volume 1 encompasses Genes, Genomes and
Metagenomes. It covers a range of approaches to conduct metagenomics
surveys including descriptions of analysis tools. Several of these approaches,
including databases, have been under development from the beginning of the
metagenome era and are enabling the analysis and interpretation of large
microbial data sets from various environments.
“Genes, Genomes and Metagenomes” also covers DNA extraction, various
cloning and sequencing approaches, quality control and experimental
designs: all essential components of the microbiome and metagenomic
sequencing process. These approaches have continued to evolve and be
refined, and several improvements have been incorporated over the past few
years. This has also been driven by a switch to next-generation sequencing
(NGS) platforms including Ion Torrent, 454 and various Illumina
technologies.
Post-sequencing genome assembly, alignment tools, gene prediction and
annotation are also critical to successful data interpretation. Deeper dives in
Vol. 1 discuss codon usage, clustering programs and functional gene
characterization.
MD, USA Karen E. Nelson

September 2014
v
About the Editor
Dr. Karen E. Nelson is the President of the

J. Craig Venter Institute (JCVI). Prior to being
appointed President, Dr. Nelson held a number
of other positions at the Institute including
Director of JCVI’s Rockville Campus and
Director of Human Microbiology and
Metagenomics in the Department of Human
Genomic Medicine at the JCVI. She is also a
Professor at JCVI with an active research
program in genomics and metagenomics.
Dr. Nelson has led several genomic and
metagenomic efforts including those of several
reference microbial genomes and the first human metagenomics study that
was published in 2006. Additional ongoing studies in her group include
metagenomic approaches to study the ecology of the gastrointestinal tract
of humans and animals, studies on the relationship between the microbiome
and various human and animal disease conditions, reference genome
sequencing and analysis primarily for the human body, and other omics
studies. Dr. Nelson also heads the microbiome group at Human Longevity
Inc., which was recently formed in La Jolla, California.
Dr. Nelson received her undergraduate degree from the University of the
West Indies and her Ph.D. from Cornell University. She has authored or
coauthored over 100 peer-reviewed publications and edited three books and
is currently Editor-in-Chief of the journal Microbial Ecology. She also serves
on the Editorial Boards of BMC Genomics, GigaScience, and the Central
European Journal of Biology. She is also a standing member of the NRC
Committee on Biodefense, a member of the National Academy of Sciences
Board of Life Sciences, and a Fellow of the American Academy of Micro-
biology. She was recently appointed an Honorary Professor at the University
of the West Indies.
vii
Contributors
Takashi Abe Graduate School of Science and Technology, Niigata

University, Niigata, Japan
Yutaka Akiyama Department of Computer Science, Tokyo Institute of
Technology, Meguro-ku Tokyo, Japan
Rudolf Amann Molecular Ecology Group, Max Planck Institute for Marine
Microbiology, Bremen, Germany
Jaime Henrque Amorim Universidade Estadual de Santa Cruz,
Laboratório de Biotecnologia Microbiana, Ilhéus, BA, Brazil
Luke D. Bainard Semiarid Prairie Agricultural Research Centre,
Agriculture and Agri-Food Canada, Swift Current, SK, Canada
Annalisa Ballarini Laboratory of Microbial Genomics, Centre for
Integrative Biology (CIBIO), University of Trento, Trento, Italy
Navneet Batra Department of Biotechnology, GGDSD College,
Chandigarh, India
Arvind Behal Department of Biotechnology, GGDSD College, Chandigarh,
India
Robert G. Beiko Faculty of Computer Science, Dalhousie University,
Halifax, NS, Canada
Terrence H. Bell Department of Natural Resource Sciences, McGill
University, Sainte–Anne–de–Bellevue, QC, Canada
Johan Bengtsson-Palme Institute of Neuroscience and Physiology, The
Sahlgrenska Academy, University of Gothenburg, Göteborg, Sweden
Nicholas H. Bergman National Biodefense Analysis and Countermeasures
Center, Frederick, MD, USA
Sonu Bhatia Department of Biotechnology, GGDSD College, Chandigarh,
India
Kai Blin Interfakult€ares Institut f€ur Mikrobiologie und Infektionsmedizin
T€ubingen, Mikrobiologie/Biotechnologie, Eberhard-Karls Universit€at,
T€ubingen, Germany
ix
x Contributors
Hervé M. Blottière INRA, AgroParisTech, Jouy en Josas, France

MetaGenoPolis, INRA, Jouy en Josas, France
Paul L. E. Bodelier Netherlands Institute of Ecology (NIOO-KNAW),
Wageningen, Netherlands
Germán Bonilla-Rosso Laboratorio de Evolución Molecular y Experimental,
Instituto de Ecologı́a UNAM, Universidad Nacional Autónoma de México,
Mexico City, Mexico
Mark Borodovsky Joint Georgia Tech and Emory Wallace H Coulter
Department of Biomedical Engineering, Center for Bioinformatics and
Computational Genomics, Atlanta, GA, USA
Yan Boucher Department of Biological Sciences, University of Alberta,
Edmonton, AB, Canada
Jean-Luc Bouchot Department of Mathematics, Drexel University,
Philadelphia, PA, USA
Rainer Breitling Manchester Institute of Biotechnology, University of
Manchester, Manchester, UK
Florence Busato Laboratory for Epigenetics and Environment, Centre
National de Génotypage, CEA- Institut de Génomique, Evry, France
Brandi Cantarel Institute for Genome Sciences, University of Maryland
School of Medicine, Baltimore, MD, USA
Rebecca J. Case Department of Biological Sciences, University of Alberta,
Patrick Chain Bioscience Division, Los Alamos National Laboratory,
Los Alamos, NM, USA
Chon-Kit Kenneth Chan Department of Mechanical Engineering,
The University of Melbourne, Melbourne, VIC, Australia
Trevor C. Charles Department of Biology, University of Waterloo,
Waterloo, ON, Canada
Chao Chen Dalian University of Technology, Dalian, China
Liangyu Chen Dalian University of Technology, Dalian, China
Tsute Chen Department of Microbiology, The Forsyth Institute,
Cambridge, MA, USA
Francis Y. L. Chin Department of Computer Science, The University of
Hong Kong, Hong Kong, China
Marco Cosentino Lagomarsino Computational and Quantitative Biology,
University Pierre et Marie Curie, Paris, France
CNRS, Paris, France
Contributors xi
Paul Cotter Teagasc Food Research Centre, Moorepark, Fermoy, Co.,

Cork, Ireland
Alimentary Pharmabiotic Centre, University College, Cork, Ireland
Pedro Coutinho Centre National de la Recherche Scientifique & Aix-
Marseille Université, Marseille, France
Don Cowan Centre for Microbial Ecology and Genomics (CMEG), Genome
Research Institute (GRI), University of Pretoria, Hatfield, Pretoria,
South Africa
David E. Crowley Enviromental Sciences, University of California,
Riverside, Riverside, CA, USA
Mulan Dai Semiarid Prairie Agricultural Research Centre, Agriculture and
Agri-Food Canada, Swift Current, SK, Canada
Rolf Daniel Institute of Microbiology and Genetics, Georg–August–
University Göttingen, Göttingen, Germany
Colin Davenport Hannover Medical School, Hannover, Germany
Tomas de Wouters INRA, AgroParisTech, Jouy en Josas, France
UMR Micalis, AgroParisTech, Jouy en Josas, France
Ye Deng Institute for Environmental Genomics, University of Oklahoma,
Norman, OK, USA
Chandrika Deshpande Department of Chemistry and Biomolecular
Sciences, Macquarie University, Sydney, NSW, Australia
Floyd Dewhirst Department of Molecular Genetics, The Forsyth Institute,
Cambridge, MA, USA
Greg Ditzler Department of Electrical and Computer Engineering, Drexel
University, Philadelphia, PA, USA
Jo€el Doré INRA, AgroParisTech, Jouy en Josas, France
US 1367 MetaGenoPolis, INRA, Jouy en Josas, France
UMR Micalis, AgroParisTech, Jouy en Josas, France
Inna Dubchak US Department of Energy Joint Genome Institute, Walnut
Creek, CA, USA
Lisa Durso Agroecosystem Management Research Unit, US Department
of Agriculture, University of Nebraska, Lincoln- East Campus, Lincoln,
NE, USA
Chitra Dutta Structural Biology & Bioinformatics Division, CSIR-Indian
Institute of Chemical Biology, Kolkata, West Bengal, India
Akihito Endo Department of Food and Cosmetic Science, Faculty of
Bioindustry, Tokyo University of Agriculture, Abashiri, Hokkaido, Japan
xii Contributors
K. Martin Eriksson Department of Biological and Environmental Sciences,

University of Gothenburg, Göteborg, Sweden
Jean Euzéby Society of Systematic Bacteriology and Veterinary (SBSV) &
National Veterinary School de Toulouse (ENVT), Toulouse, France
James A. Foster Department of Biological Sciences, Institute for
Bioinformatics & Evolutionary Studies (IBEST), University of Idaho,
Moscow, ID, USA
Iddo Friedberg Department of Microbiology, Miami University, Oxford,
OH, USA
Limin Fu Center for Research in Biological Systems (CRBS), University of
California, San Diego, La Jolla, CA, USA
C. G. M. Gahan Department of Microbiology, School of Pharmacy &
Alimentary Pharmabiotic Centre, University College Cork, Cork, Ireland
Xiang Geng Dalian University of Technology, Dalian, China
Jan Gerken Microbial Genomics and Bioinformatics Research Group, Max
Planck Institute for Marine Microbiology, Bremen, Germany
Wolfgang Gerlach Institute for Genomics and Systems Biology, Argonne
National Laboratory, Argonne, IL, USA
Tarini Shankar Ghosh Biosciences R & D, TCS Innovation Labs, Tata
Research Development & Design Centre, Tata Consultancy Services
Limited, Pune, MH, India
Jack Gilbert Department of Ecology & Evolution, University of Chicago,
Chicago, IL, USA
Frank Oliver Glöckner Microbial Genomics and Bioinformatics Group,
Max Planck Institute for Marine Microbiology, Bremen, Germany
Jacobs University Bremen gGmbH, Bremen, Germany
Johannes Goll Informatics, The J. Craig Venter Institute, Rockville,
MD, USA
Juan M. Gonzalez Instituto de Recursos Naturales y Agrobiologia,
IRNAS-CSIC, Seville, Spain
Susumu Goto Bioinformatics Center, Institute for Chemical Research,
Kyoto University, Uji, Kyoto, Japan
Luigi Grassi Physics Department, Sapienza University of Rome, Rome,
Italy
Stefan J. Green University of Illinois at Chicago, Chicago, IL, USA
Charles W. Greer National Research Council Canada, Montreal, QC,
Canada
Contributors xiii
Igor V. Grigoriev US Department of Energy Joint Genome Institute,

Walnut Creek, CA, USA
Jacopo Grilli Dipartimento di Fisica “G. Galilei”, CNISM and INFN,
Università di Padova, Padova, Italy
Saman K. Halgamuge Department of Mechanical Engineering, The
University of Melbourne, Melbourne, VIC, Australia
Chantal Hamel Semiarid Prairie Agricultural Research Centre, Agriculture
and Agri-Food Canada, Swift Current, SK, Canada
Jun Hang Viral Diseases Branch, WRAIR, Silver Spring, MD, USA
Mohammed Monzoorul Haque Biosciences R & D, TCS Innovation Labs,
Tata Research Development & Design Centre, Tata Consultancy Services
Limited, Pune, MH, India
Stephen J. Harrop School of Physics, University of New South Wales,
Sydney, NSW, Australia
Martin Hartmann Molecular Ecology, Agroscope Reckenholz-T€anikon
Research Station ART, Zurich, Switzerland
Zhili He Department of Microbiology and Plant Biology, Institute for
Environmental Genomics, University of Oklahoma, Norman, OK, USA
Bernard Henrissat Centre National de la Recherche Scientifique &
Aix-Marseille Université, Marseille, France
Sarah Highlander Genomic Medicine, J. Craig Venter Institute, La Jolla,
CA, USA
Colin Hill Alimentary Pharmabiotic Centre, Department of Microbiology,
University College, Cork, Ireland
David Horn School of Physics and Astronomy, Tel Aviv University,
Tel Aviv, Israel
Arthur L. Hsu Department of Mechanical Engineering, The University of
Melbourne, Melbourne, VIC, Australia
Gangqing Hu Systems Biology Center, National Heart, Lung and Blood
Institute, National Institutes of Health, Bethesda, MD, USA
Shih-Ting Huang J. Craig Venter Institute, Rockville, MD, USA
Daniel H. Huson Center for Bioinformatics, Algorithms in Bioinformatics,
University of T€ubingen, T€
ubingen, Germany
Toshimichi Ikemura Nagahama Institute of Bio-Science and Technology,
Nagahama, Shiga, Japan
Hachiro Inokuchi Nagahama Institute of Bio-Science and Technology,
xiv Contributors
Yuki Iwasaki Nagahama Institute of Bio-Science and Technology,

Mukesh Jain Functional and Applied Genomics Laboratory, National
Institute of Plant Genome Research (NIPGR), New Delhi, India
Diego Javier Jiménez Department of Microbial Ecology, University of
Groningen, Center for Ecological and Evolutionary Studies (CEES),
Groningen, The Netherlands
Brian V. Jones Center for Biomedical and Health Science Research,
University of Brighton, School of Pharmacy and Biomolecular Sciences,
Brighton, East Sussex, UK
I. King Jordan School of Biology, Georgia Institute of Technology,
Atlanta, GA, USA
Amit Joshi Department of Biotechnology & Bioinformatics, SGGS
College, Chandigarh, India
Olivier Jousson Laboratory of Microbial Genomics, Centre for Integrative
Biology (CIBIO), University of Trento, Trento, Italy
Minoru Kanehisa Bioinformatics Center, Institute for Chemical Research,
Geun-Joong Kim Department of Biological Sciences, College of Natural
Sciences, Chonnam National University, Gwangju, Republic of Korea
Joel Kostka School of Biology and Earth & Atmospheric Sciences, Georgia
Institute of Technology, Atlanta, GA, USA
Masaaki Kotera Bioinformatics Center, Institute for Chemical Research,
Renzo Kottmann Max Plank Institute for Marine Microbiology, Bremen,
Germany
Marcio R. Lambais Luiz de Queiroz College of Agriculture (ESALQ),
University of São Paulo (USP), Piracicaba, SP, Brazil
Ronald F. Lamont Department of Gynecology and Obstetrics, Clinical
Institute, University of Southern Denmark, Odense University Hospital,
Odense, Denmark
Division of Surgery, University College London, Northwick Park Institute of
Medical Research Campus, London, UK
Yemin Lan School of Biomedical Engineering, Science and Health, Drexel
Nicolas Lapaque INRA, AgroParisTech, Jouy en Josas, France
Contributors xv
Henry C. M. Leung Department of Computer Science, The University of

Hong Kong, Hong Kong, China
Weizhong Li J. Craig Venter Institute, La Jolla, CA, USA
Mark Liles Department of Biological Sciences, Auburn University,
Auburn, AL, USA
Ho-Dong Lim Department of Biological Sciences, College of Natural
Sciences, Chonnam National University, Gwangju, Republic of Korea
Chien-Chi Lo Genome Science Group, Los Alamos National Laboratory,
Los Alamos, NM, USA
Hernan Lorenzi Informatics, J. Craig Venter Institute, Rockville, MD, USA
Petra Louis Rowett Institute of Nutrition and Health, Microbiology
Group, Gut Health Programme, University of Aberdeen, Aberdeen, UK
Connie Lovejoy Department of Biology, Laval University, Québec, QC,
Canada
Vedran Lucić Molecular Biology Department, Division of Biology, Faculty
of Science, University of Zagreb, Zagreb, Croatia
Wolfgang Ludwig Lehrstuhl F€ur Mikrobiologie, Technische Universit€at
M€unchen, Freising, Germany
Haiwei Luo Department of Marine Sciences, University of Georgia, Athens,
GA, USA
Bridget Mabbutt Department of Chemistry and Biomolecular Sciences,
Macquarie University, Sydney, NSW, Australia
Norman J. MacDonald Faculty of Computer Science, Dalhousie University,
Halifax, NS, Canada
Emmanuelle Maguin INRA, AgroParisTech, Jouy en Josas, France
Sharmila Mande Biosciences R & D, TCS Innovation Labs, Tata Research
Development & Design Centre, Tata Consultancy Services Limited, Pune,
MH, India
Alan J. McCarthy Microbiology Research Group, Institute of
Integrative Biology, Biosciences Building, University of Liverpool,
Liverpool, UK
Alice C. McHardy Algorithmic Bioinformatics, Heinrich Heine University
D€usseldorf, D€usseldorf, Germany
David Mead Lucigen Corporation, Middleton, WI, USA
Marnix H. Medema Microbial Genomics and Bioinformatics Research
Group, Max Planck Institute for Marine Microbiology, Bremen, Germany
xvi Contributors
Folker Meyer Institute of Genomic and Systems Biology, Argonne

National Laboratory, Argonne, IL, USA
Kentaro Miyazaki Department of Medical Genome Sciences, Graduate
School of Frontier Sciences, The University of Tokyo, Sapporo, Japan
Bioproduction Research Institute, National Institute of Advanced Industrial
Science and Technology, Sapporo, Japan
Yuki Moriya Bioinformatics Center, Institute for Chemical Research,
Mark Morrison Diamantina Institute, The University of Queensland,
Woolloongabba, Brisbane QLD, Australia
Michael J. Moser Lucigen Corporation, Middleton, WI, USA
Raul Munoz Marine Microbiology Group, Department of Ecology and
Marine Resources, Institut Mediterrani d’Estudis Avançats (CSIC-UIB),
Illes Balears, Spain
Akira Muto Faculty of Agriculture and Life Science, Hirosaki University,
Hirosaki, Aomori, Japan
Heiko Nacke Institute of Microbiology and Genetics, Georg–August–
University of Göttingen, Göttingen, Germany
Istvan Nagy Institute of Biochemistry, Biological Research Centre of the
Hungarian Academy of Sciences, Szeged, Hungary
Tania Nasreen Department of Biological Sciences, University of Alberta,
Shamima Nasrin Department of Biological Sciences, Auburn University,
Auburn, AL, USA
Josh D. Neufeld Department of Biology, University of Waterloo, Waterloo,
ON, Canada
R. Henrik Nilsson Department of Biological and Environmental Sciences,
University of Gothenburg, Göteborg, Sweden
Beifang Niu Center for Research in Biological Systems (CRBS), University
of California, San Diego, La Jolla, CA, USA
Brian D. Ondov National Biodefense Analysis and Countermeasures
Orla O’Sullivan Teagasc Food Research Centre, Moorepark, Fermoy, Co.,
Cork, Ireland
Asli Ismihan Ozen The Novo Nordisk Foundation Center for
Biosustainability, Technical University of Denmark, Kongens Lyngby,
Denmark
Contributors xvii
Stephan Pabinger Division of Bioinformatics, Biocenter, Innsbruck

Medical University, Innsbruck, Austria
AIT – Austrian Institute of Technology, Health & Environment Department,
Molecular Diagnostics, Vienna, Austria
Donovan H. Parks Faculty of Computer Science, Dalhousie University,
Halifax, NS, Canada
Australian Centre for Ecogenomics, University of Queensland, Brisbane
QLD, Australia
Ravi K. Patel Functional and Applied Genomics Laboratory, National
Institute of Plant Genome Research (NIPGR), New Delhi, India
Jörg Peplies Ribocon GmbH, Bremen, Germany
Adam M. Phillippy National Biodefense Analysis and Countermeasures
Rob Phillips Departments of Applied Physics and Bioengineering California
Institute of Technology, California Institute of Technology, Pasadena,
CA, USA
Rembert Pieper J. Craig Venter Institute, Rockville, MD, USA
Om Prakash National Centre for Cell Science, Pune, Maharashtra, India
Elmar Pruesse Microbial Genomics and Bioinformatics Research Group,
Pei-Yuan Qian KAUST Global Collaborative Program, Division of Life
Science, Hong Kong University of Science and Technology, Hong Kong,
China
Christian Quast Microbial Genomics and Bioinformatics Research Group,
Jean-Baptiste Ramond Centre for Microbial Ecology and Genomics
(CMEG), Genome Research Institute (GRI), University of Pretoria, Hatfield,
Pretoria, South Africa
Rachel Rezende Universidade Estadual de Santa Cruz, Laboratório de
Biotecnologia Microbiana, Ilhéus, BA, Brazil
Lavanya Rishishwar Bioinformatics, Georgia Institute of Technology,
Atlanta, GA, USA
Francisco Rodriguez-Valera Microbiologia, Universidad Miguel
Hernandez, Campus San Juan, San Juan, Alicante, Spain
Masa Roller Bioinformatics Group, Molecular Biology Department,
Division of Biology, Faculty of Science, University of Zagreb, Zagreb,
Croatia
xviii Contributors
Sandra Ronca Centre for Microbial Ecology and Genomics (CMEG),

Genome Research Institute (GRI), University of Pretoria, Hatfield, Pretoria,
South Africa
David J. Rooks Microbiology Research Group, Institute of Integrative
Biology, Biosciences Building, University of Liverpool, Liverpool, UK
Gail Rosen Department of Electrical and Computer Engineering, Drexel
Paul Ross Teagasc Food Research Centre, Moorepark, Fermoy, Co., Cork,
Ireland
Ramon Rosselló-Móra Marine Microbiology Group, Department of
Ecology and Marine Resources, Institut Mediterrani d’Estudis Avançats
(CSIC-UIB), Illes Balears, Spain
Isaam Saeed Optimisation and Pattern Recognition Group, Melbourne
School of Engineering, The University of Melbourne, Parkville, Australia
Munmun Sarkar CSIR-Indian Institute of Chemical Biology, Kolkata,
India
Tulasi Satyanarayana Department of Microbiology, University of Delhi,
New Delhi, India
Karl-Heinz Schleifer Lehrstuhl F€ur Mikrobiologie, Technische Universit€at
M€
unchen, Freising, Germany
Thomas W. Schoenfeld Lucigen Corporation, Middleton, WI, USA
Matthew B. Scholz Genome Science Group, Los Alamos National
Laboratory, Los Alamos, NM, USA
Timmy Schweer Microbial Genomics and Bioinformatics Research Group,
Vineet K. Sharma MetaInformatics Laboratory, Metagenomics and
Systems Biology Group, Department of Biological Sciences, Indian Institute
of Science Education and Research, Bhopal, India
Martin Sievers Zurich University of Applied Sciences, Institute of
Biotechnology, Waedenswil, Switzerland
Jagtar Singh Department of Biotechnology, Panjab University,
Chandigarh, India
Roy Sleator Department of Biological Sciences, Cork Institute of
Technology, Cork, Co. Cork, Ireland
Jens Stoye Faculty of Technology, Bielefeld University, Bielefeld,
Germany
Hikaru Suenaga Bioproduction Research Institute, National Institute of
Advanced Industrial Science and Technology, Sapporo, Japan
Contributors xix
Moo-Jin Suh J. Craig Venter Institute, Rockville, MD, USA

Fengzhu Sun Molecular and Computational Biology Program, Department
of Biological Sciences, University of Southern California, Dana and David
Dornsife College of Letters, Arts and Sciences, Los Angeles, CA, USA
Visaahini Sureshan Department of Chemistry and Biomolecular Sciences,
Macquarie University, Sydney, NSW, Australia
Arbel D. Tadmor TRON – Translational Oncology at the University Medical
Center of the Johannes Gutenberg University Mainz, Mainz Germany
Hideto Takami Microbial Genome Research Group, Japan Agency for
Marine-Earth Science and Technology (JAMSTEC), Yokosuka, Japan
Eriko Takano Manchester Institute of Biotechnology, University of
Manchester, Manchester, UK
Sen-Lin Tang Bioinformatics Program, Taiwan International Graduate
Program, Institute of Information Science, Academia Sinica, Taipei, Taiwan
Shiyuyun Tang School of Biology, Biodiversity Research Center, Georgia
Institute of Technology, Atlanta, GA, USA
Todd D. Taylor Laboratory for Integrated Bioinformatics, Core for Precise
Measuring and Modeling, RIKEN Center for Integrative Medical Sciences,
Yokohama, Japan
João Carlos Teixeira Dias Universidade Estadual de Santa Cruz,
Laboratório de Biotecnologia Microbiana, Ilhéus, BA, Brazil
Torsten Thomas School of Biotechnology and Biomolecular Sciences &
Centre for Marine Bio-Innovation, University of New South Wales, Sydney,
NSW, Australia
Toshiaki Tokimatsu Bioinformatics Center, Institute for Chemical
Research, Kyoto University, Uji, Kyoto, Japan
Jörg Tost Laboratory for Epigenetics and Environment, Centre National de
Génotypage, CEA-Institut de Génomique, Evry, France
Zlatko Trajanoski Division of Bioinformatics, Biocenter, Innsbruck
Medical University, Innsbruck, Austria
Susannah Tringe US Department of Energy Joint Genome Institute, Walnut
Creek, CA, USA
Huai-Kuang Tsai Institute of Information Science, Academia Sinica,
Taipei, Taiwan
Ching-Hung Tseng Bioinformatics Program, Taiwan International Gradu-
ate Program, Biodiversity Research Center, Institute of Information Science,
Academia Sinica, Taipei, Taiwan
David Wayne Ussery Bioscience Division of Oak Ridge National Labora-
tory, Oak Ridge National Laboratory, Oak Ridge, TN, USA
xx Contributors
Joy D. Van Nostrand Department of Microbiology and Plant Biology,

Institute for Environmental Genomics, University of Oklahoma, Norman,
OK, USA
Digvijay Verma Department of Microbiology, University of Delhi,
New Delhi, India
Kristian Vlahoviček Bioinformatics Group, Molecular Biology
Department, Division of Biology, Faculty of Science, University of Zagreb,
Zagreb, Croatia
Jun Wang BGI Shenzhen, Shenzhen, China
Lingling Wang Department of Animal Sciences, The Ohio State University,
Columbus, OH, USA
Tse-Yi Wang Department of Medical Research, Mackay Memorial Hospital,
New Taipei City, Taiwan
Yi Wang Department of Computer Science, The University of Hong Kong,
Hong Kong, China
Yong Wang Division of Deep Sea Science, Sanya Institute of Deep Sea
Science and Engineering, San Ya, Hainan, China
Yumei Wang Dalian University of Technology, Dalian, China
Tandy Warnow Institute for Genomic Biology, University of Illinois, IL, USA
Tilmann Weber Interfakult€ares Institut f€ur Mikrobiologie und
Infektionsmedizin T€ubingen, Mikrobiologie/Biotechnologie, Eberhard-
Karls Universit€at, T€ubingen, Germany
Martin Wu Department of Biology, University of Virginia, Charlottesville,
VA, USA
Sitao Wu Center for Research in Biological Systems (CRBS), University of
California, San Diego, La Jolla, CA, USA
Li Charlie Xia Molecular and Computational Biology Program, Department
of Biological Sciences, University of Southern California, Dana and David
Dornsife College of Letters, Arts and Sciences, Los Angeles, CA, USA
Jianping Xu Department of Biology, McMaster University, Hamilton, ON,
Canada
Yuko Yamada Nagahama Institute of Bio-Science and Technology,
Jian Yang MOH Key Laboratory of Systems Biology of Pathogens, Institute
of Pathogen Biology, Chinese Academy of Medical Sciences & Peking Union
Medical College (CAMS&PUMC), Beijing, People’s Republic of China
Pablo Yarza Ribocon GmbH., Bremen, Germany
Contributors xxi
Yuzhen Ye Indiana University, School of Informatics and Computing,

Bloomington, IN, USA
Etienne Yergeau National Research Council Canada, Montreal, QC,
Canada
Pelin Yilmaz Microbial Genomics and Bioinformatics Research Group,
S. M. Yiu Department of Computer Science, The University of Hong Kong,
Hong Kong, China
Zhongtang Yu Department of Animal Sciences, Environmental Science
Graduate Program, The Ohio State University, Columbus, OH, USA
Marı́a Mercedes Zambrano Molecular Genetics and Microbial Ecology,
Corporación CorpoGen, Bogotá, DC, Colombia
Xinqing Zhao School of Life Science and Biotechnology, Dalian University
of Technology, Dalian, People’s Republic of China
Jizhong (Joe) Zhou Department of Microbiology and Plant Biology,
Institute for Environmental Genomics, University of Oklahoma, Norman,
OK, USA
Department of Environmental Science and Engineering, Tsinghua University,
Beijing, China
Earth Sciences Division, Lawrence Berkeley National Laboratory, Berkeley,
CA, USA
Huaiqiu Zhu Department of Biomedical Engineering, and Center for
Theoretical Biology, Peking University, Beijing, China
Zhengwei Zhu Center for Research in Biological Systems (CRBS), University
of California, San Diego, La Jolla, CA, USA
A
A 123 of Metagenomics from environmental samples. Arguably,

metagenomics has been the fastest growing field
Torsten Thomas1, Jack Gilbert2 and of microbiology in the last few years and has
Folker Meyer3 almost become a routine practice. The learning
1
School of Biotechnology and Biomolecular curve in the field has been steep, and many
Sciences & Centre for Marine Bio-Innovation, obstacles still need to be overcome to make
University of New South Wales, Sydney, metagenomics a reliable and standard process. It
NSW, Australia is timely to reflect on what has been learned over
2
Department of Ecology & Evolution, University the past few years from metagenome projects and
of Chicago, Chicago, IL, USA to predict future needs and developments.
3
Institute of Genomic and Systems Biology, This brief primer gives an overview for the
Argonne National Laboratory, Argonne, IL, USA current status and practices as well as limitations
of metagenomics. We present an introduction to
sampling design, DNA extraction, sequencing
Introduction technology, assembly, annotation, data sharing,
and storage.
Microbial ecology aims to comprehensively
describe the diversity and function of microor-
ganisms in the environment. Culturing, micros- Sampling Design and DNA Processing
copy, and chemical or biological assays were not
too long ago the main tools in this field. Molecu- Metagenomic studies of single habitats, for exam-
lar methods, such as 16S rRNA gene sequencing, ple, acid mine drainage (Tyson et al. 2004), termite
were applied to environmental systems in the hindgut (Warnecke et al. 2007), cow rumen (Hess
1990s and started to uncover a remarkable diver- et al. 2011), and the human gastrointestinal tract
sity of organisms (Barns et al. 1994). Soon, the (Gill et al. 2006), have provided an insight into the
thirst for describing microbial systems was no basic diversity and ecology of these environments.
longer satisfied by the knowledge of the diversity Moreover, comparative studies have explored the
of just one or a few genes. Thus, approaches were ecological distribution of genes and the functional
developed to describe the total genetic diversity adaptations of different microbial communities to
of a given environment (Riesenfeld et al. 2004). specific ecosystems (Tringe et al. 2005; Dinsdale
One such approach is metagenomics, which et al. 2008; Delmont et al. 2011). These pioneering
involves sequencing the total DNA extracted studies were predominately designed to develop
K.E. Nelson (ed.), Genes, Genomes and Metagenomes: Basics, Methods, Databases and Tools,
DOI 10.1007/978-1-4899-7478-5, # Springer Science+Business Media New York 2015
A 2 A 123 of Metagenomics
and prove the general metagenomic approach and (Mackelprang et al. 2011; Barberan et al. 2012;
were often limited by the high cost of sequencing. Bergmann et al. 2011; Nemergut et al. 2011;
Hence, desirable scientific methodology, includ- Bates et al. 2011). Considerable work still is
ing biological replication, could not be adopted, needed in order to determine spatial heterogene-
a situation that precluded appropriate statistical ity, for example, how representative a 0.1 mg
analyses and comparison (Prosser 2010). sample of soil is with respect to the larger envi-
The significant reduction, and indeed continu- ronment from which it was taken.
ing fall, in sequencing costs (see below) now The design of a sampling strategy is implicit in
means that the central tenants of scientific inves- the scientific questions asked and the hypotheses
tigation can be adhered to. Rigorous experimen- tested, and standard rules outside of replication
tal design will help researchers explore the and frequency of observation are hard to define.
complexity of microbial interactions and will However, the question of “depth of observation”
lead to improved catalogs of proteins and genetic is prudent to address because researchers now can
elements. Individual ecosystems can now be sequence microbiomes of individual environments
studied with appropriate cross-sectional and with exceptional depth or breadth. By enabling
temporal approaches designed to identify the either deep characterization of the taxonomic,
frequency and distribution of variance in commu- phylogenetic, and functional potential of a given
nity interaction and development (Knight ecosystem or a shallow investigation of these
et al. 2012). Such studies should also pay close elements across hundreds or thousands of samples,
attention to the collection of comprehensive current sequencing technology (see below) is
physical, chemical, and biological data (see changing the way microbial surveys are being
below). This will enable scientists to elucidate performed (Knight et al. 2012).
the emergent properties of even the most com- DNA handling and processing play a major
plex biological system. This capability will pro- role in exploring microbial communities through
vide the potential to identify drivers at multiple metagenomics (see also DNA extraction methods
spatial, temporal, taxonomic, phylogenetic, func- for human studies, “Extraction Methods, DNA”
tional, and evolutionary levels and to define the and “Extraction Methods, Variability Encoun-
feedback mechanisms that mediate equilibrium. tered in”). Specifically, it is well known that the
The frequency and distribution of variance type of DNA extraction used for a sample will
within a microbial ecosystem are basic factors affect the community profile obtained (e.g.,
that must be ascertained by rigorous experimental Delmont et al. 2012). Therefore, with projects
design and analysis. For example, to analyze the like the Earth Microbiome Project that aim to
microbial community structure from 1 l of sea- compare a large number of samples, efforts
water in a coastal pelagic ecosystem, one must have been made to standardize DNA extraction
also ideally define how representative this will protocols for every physical sample. Clearly, no
be for the ecosystem as a whole and what the single protocol will be suitable for every sample
bounds of that ecosystem are. Numerous studies type (Gilbert 2011, 2010b). For example,
of marine systems have shown how community a particular extraction protocol might yield only
structure can vary between water masses and over very low DNA concentrations for a particular
time (e.g., Gilbert et al. 2012; Fuhrman 2009; sample type, making it necessary to explore
Fuhrman et al. 2006, 2008; Martiny et al. 2006), other protocols in order to improve efficiency.
and metagenomics currently helps further However, differences among DNA extraction
define how community structure varies in these protocols may limit comparability of data.
environments (Ottesen et al. 2011; DeLong Therefore, researchers need to further define in
et al. 2006; Rusch et al. 2007; Gilbert et al. qualitative and quantitative terms how different
2010a). In contrast, in soil systems variance in DNA extraction methodologies affect microbial
space appears to be far larger than in time community structure.
A 123 of Metagenomics 3 A
Sequencing Technology and Quality nucleoside triphosphate matches the next posi-
Control tion after the primer, then its incorporation results
in the release of diphosphate (pyrophosphate, or A
The rapid development of sequencing technolo- PPi). PPi production is coupled by an enzymatic
gies over the past few years has arguably been reaction involving an ATP sulfurylase and
one of the driving forces in the field of a luciferase to the production of a light signal
metagenomics. While shotgun metagenomic that is detected through a charge-coupled device.
studies initially relied on hardware-intensive The Ion Torrent sequencing platform uses
and costly Sanger sequencing technology a related approach; however, here, protons that
(Tyson et al. 2004; Venter et al. 2004) available are released during nucleoside incorporation are
only to large research institutes, the advent and detected through semiconductor technology. In
continuous release of several next-generation both cases, the production of light or charge sig-
sequencing (NGS) platforms has democratized nals relates to the incorporation of the sequen-
the sequencing market and has given individual tially offered nucleoside, which can be used to
laboratories or research teams access to afford- deduce the sequence downstream of the primer.
able sequencing data. Among the available NGS Homopolymer sequences create signals propor-
options, the Roche (Margulies et al. 2005), tional to the number of positions; however,
Illumina (Bentley et al. 2008), Ion Torrent the linearity of this relationship is limited by
(Rothberg et al. 2011), and SOLiD (Life Tech- enzymatic and engineering factors leading to
nologies) platforms have been applied to well-investigated insertion and deletion (Indel)
metagenomic samples, with the former two sequencing errors (Prabakaran et al. 2011;
being more intensively used than the latter. The McElroy et al. 2012).
features of these sequencing technologies have Illumina sequencing is based on the incorpo-
been extensively reviewed – see, for example, ration and detection of fluorescently labeled
Metzker (2010) and Quail et al. (2012) – and are nucleoside triphosphates to extend a primer
therefore only briefly summarized here (Table 1). bound to a template. The key feature of the nucle-
Roche’s platform utilizes pyrosequencing oside triphosphates is a chemically modified 30
(also often referred to as 454 sequencing because position that does not allow for further chain
of the name of the company that initially devel- extension (“terminator”). Thus, the primer gets
oped the platform) as its underlying molecular extended by only one position, whose identity is
principle. Pyrosequencing involves the binding detected by different fluorescent colors for each
of a primer to a template and the sequential addi- of the four nucleosides. Through a chemical reac-
tion of all four nucleoside triphosphates in the tion, the fluorescent label is then removed, and
presence of a DNA polymerase. If the offered the 30 position is converted into a hydroxyl group
A 123 of Metagenomics, Table 1 Next-generation sequencing technologies and their throughput, errors, and
application to metagenomics
Throughput (per Error/metagenomic example
Machine (manufacturer) machine run) Reported errors references
GLX Titanium ~1 M reads @ 0.56 % indels; up to 0.12 % (McElroy et al. 2012; Fan et al. 2012)
(454/Roche) ~500 nt substitution
HiSeq 2000 (Illumina) ~3 G reads @ 100 nt ~0.001 % indels; up to (McElroy et al. 2012; Quail et al. 2012;
0.34 % substitution Hess et al. 2011)
Ion Torrent PGM (Life ~0.1–5 M reads @ 1.5 % indels (Loman et al. 2012; Whiteley
Technologies) ~200 nt et al. 2012)
SOLiD (Life ~120 M reads @ Up to 3 % (Salmela 2010; Zhou et al. 2011;
Technologies) ~50 nt Iverson et al. 2012)
allowing for another round of nucleoside incor- One important practical aspect to consider
poration. The use of a reversible terminator thus when analyzing raw sequencing data is the qual-
allows for a stepwise and detectable extension of ity value assigned to reads. For a long time, the
the primer that results in the determination of the quality assessment provided by the technology
template sequence. In theory, this process could vendor was the only available option for
be repeated to generate very long sequences; in data consumers. Recently, however, a vendor-
practice, however, misincorporation of nucleo- independent error detection and characterization
sides in the many clonal template strands results has been described that relies on error estimate-
in the fluorescent signal getting out of phase, and based reads that are accidentally duplicated
thus reliable sequencing information is only during the PCR stages (a fact described for
obtained for about 200 positions (Quail Ion Torrent, 454, and Illumina sequencing
et al. 2012). technologies) (Trimble et al. 2012). Moreover, a
SOLiD sequencing utilizes ligation, rather significant number of publicly available
than polymerase-mediated chain extension, to metagenomic datasets contain sequence adaptors
determine the sequence of a template. Primers (apparently because quality control is often
are extended through the ligation with fluores- performed on the level of assembled sequences,
cently labeled oligonucleotides. The high specific- not raw reads). Simple statistical analyses with
ity of the ligase ensures that only oligonucleotides tools such as FastQC (http://www.bioinformat-
matching the downstream sequence will be incor- ics.babraham.ac.uk/projects/fastqc/) will rapidly
porated; and by encoding different oligonucleo- detect most of these adapter contaminations. An
tides with different fluorophores, the sequence important aspect of quality control is therefore
can be determined. that each individual dataset requires error profil-
It is important to understand the features of the ing and that relying on general properties of the
sequencing technology in terms of throughput, platform used is not sufficient.
read length, and errors (see Table 1), because
these will have a significant impact on down-
stream processing. For example, the relative Assembly
high frequency of homopolymer errors for the
pyrosequencing technology can impact ORF iden- Assembly of shotgun sequencing data can in gen-
tification (Rho et al. 2010) but might still allow for eral follow two strategies: the overlap-layout-
reliable gene annotation, because of its compara- consensus (OLC) and the de Bruijn graph
tively long read length (Wommack et al. 2008). approach (see also “▶ A De Novo Metagenomic
Conversely, the short read length of Illumina Assembly Program for Shotgun DNA Reads”).
sequencing might reduce the rate of annotation of These two strategies are employed by a number
unassembled data, but the substantial throughput of different genome assemblers, and this topic
and data volume generated can facilitate assembly has been reviewed recently (Miller et al. 2010).
of entire draft genomes from metagenomic data Basically, the OLC assembly involves the
(Hess et al. 2011). These considerations are also pairwise comparison of sequence reads and the
particularly relevant with new sequencing technol- ordering of matching pairs into an overlap graph.
ogies coming online. These include single- These overlapping sequences are then merged
molecule sequencing using zero-mode waveguide into a consensus sequence. Assembly with the
nanostructure arrays (Eid et al. 2009), which de Bruijn strategy involves representing each
promises read lengths beyond 1,000 bp and has sequence’s reads in a graph of all possible
been shown to improve the hybrid assemblies of k-mers. Two k-mers are connected when the
genomes (Koren et al. 2012), as well as nanopore sequence reads have them in sequential,
sequencing (Schneider and Dekker 2012), which overlapping positions. Thus, all reads of
also promises long read lengths. a dataset are represented by the connection within
the de Bruijn graph, and assembled contigs are correspond to individual genomes or the abun-
generated by traversing these connections to dance information of k-mers to find an optimal
yield a sequence of k-mers. solution path through the graph. A
The OLC assembly has the advantage that These subdividing approaches are analogous to
pairwise comparison can be performed to allow binning metagenomic reads or contigs, in order to
for a defined degree of dissimilarity between identify groups of sequences that define a specific
reads. This can compensate for sequencing genome. These bins or even individual sequence
errors and allows for the assembly of reads from reads can also be taxonomically classified by
heterogeneous populations (Tyson et al. 2004). comparison with known reference sequences.
However, memory requirement for pairwise Binning and classifying of sequences can be
comparisons increases exponentially with the based on phylogeny, similarity, or composition
numbers of reads in the dataset; hence, the (or combinations thereof), and a large number of
OLC assembler often cannot deal with large algorithms and software is available. For recent
datasets (e.g., Illumina data). Nevertheless, sev- comparisons and benchmarking of binning and
eral OLCs, including the Celera Assembler classification software, please see Bazinet and
(Miller et al. 2008), Phrap (de la Bastide and Cummings (2012) and Droge and McHardy
McCombie 2007), and Newbler (Roche), have (2012). Obviously, care has to be taken with any
been used to assemble partial or complete draft automated process, since nonrelated sequences
genomes from metagenomic data; see, for exam- can be combined to produce genomic chimera
ple, Tyson et al. (2004), Liu et al. (2011), and bins or classes. It is thus advisable that any binning
Brown et al. (2012). or classification strategy is thoroughly tested
In contrast, memory requirements of de Bruijn through appropriate in vitro and in silico simula-
assemblers are largely determined by the k-mer tions (Mavromatis et al. 2007; Morgan et al. 2010;
size chosen to define the graph. Thus, these McElroy et al. 2012). Also, manual curation of
assemblers have been used successfully with contigs and iterative assembly and mapping can
large numbers of short reads. Initially, de produce improved genomes from metagenomic
Bruijn assemblers designed for clonal genomes, data (Dutilh et al. 2009). Through such carefully
such as Velvet (Zerbino and Birney 2008), designed strategies and refined processes, nearly
SOAP (Li et al. 2008), and ABySS (Simpson complete genomes can be assembled, even for
et al. 2009), were used to assemble metagenomic low-abundance organisms from large numbers of
data. Because of the heterogeneous nature of short reads (Iverson et al. 2012).
microbial populations, however, assemblies
often ended up fragmented. One reason is that
every positional difference between two reads Annotation
from the same region of two closely related
genomes will create a “bubble” in the graph. Initially, techniques developed for annotating
Another reason is that sequence errors in low- clonal genomes were applied to metagenomic
abundance reads cause terminating branches. data, and several tools for metagenomic analysis,
Traversing such a highly branched graph leads such as MG-RAST (Meyer et al. 2008) and
to large number of contigs. These problems have IMG/M (Markowitz et al. 2008), were derived
been partially overcome by modification of from existing software suites. For metagenomic
existing de Bruijn assemblers such as MetaVelvet projects, the principal challenges lie in the size of
(Namiki et al. 2012) or by newly designed de the dataset, the heterogeneity of the data, and the
Bruijn-based algorithms such as Meta-IDBA fact that sequences are frequently short, even if
(Peng et al. 2011; see also “Meta-IDBA, assembled prior to analysis.
overview”). Conceptually, these solutions often The first step of the analysis (after extensive
include the identification of subgraphs that quality control; see above) involves identification
of genes from a DNA sequence. Fundamentally, overviews and comparison between samples
two approaches exist: the extrinsic approach, after statistical normalization.
which relies on similarity comparison of an The time and resources required to perform
unknown sequence to existing databases, and functional annotations are substantial, but
the intrinsic (or de novo) approach, which approaches that project multiple results derived
applies statistical analysis of sequence proper- from a single sequence analysis into multiple
ties, such as frequently used codon usage, to namespaces can minimize these computational
define likely open reading frames (ORFs). For costs (Wilke et al. 2012). Numerous tools are
metagenomic data, the extrinsic approach (e.g., also available to predict, for example, short
running a similarity search with BLASTX) RNAs and/or other genomic features, but these
comes at a significant computational cost tools are frequently less useful for large
(Wilkening et al. 2009), rendering it less attrac- metagenomic datasets that exhibit both low
tive. De novo approaches based on codon or sequence quality and short reads.
nucleotide k-mer usage are thus more promising Several integrations package annotation func-
for large datasets. De novo gene-calling software tionality into a single website. The CAMERA
for microbial genomes are trained on long (Seshadri et al. 2007) website, for example,
contigs and assume clonal genomes. For provides users with the ability to run a number
metagenomic datasets this approach is often of pipelines on metagenomic data. The Joint
however unsuitable, because training data is Genome Institute’s IMG/M web service also pro-
lacking and multiple different codon usage vides an analysis for assembled metagenomic
(or k-mer) profiles are present due to the multi- data, which has been used so far for over
ple, different genomes present. 300 metagenomic datasets. The European Bioin-
However, several software packages have formatics Institute provides a service aimed at
been designed to predict genes for short frag- smaller, typically 454/pyrosequencing-derived
ments or even reads (see Trimble et al. 2012 metagenomes. The most popular service is the
for a review). The most important finding of MG-RAST system (Meyer et al. 2008), used for
that review is the effect of errors on gene predic- over 50,000 metagenomes with over 140 billion
tion performance, reducing the reading frame base pairs of data. The system offers comprehen-
accuracy of most tools to well below 20 % at sive quality control, tools for comparison of
3 % sequencing error. Only the software datasets, and data import and export tools to, for
FragGeneScan (Rho et al. 2010; see also example, QIIME (Caporaso et al. 2010) using
FragGeneScan, overview) accounted for the pos- standard formats such as BIOM (McDonald
sibility that metagenomic sequences may contain et al. 2012).
errors, thus allowing it to clearly outperform its
competitors.
Once identified, protein-coding genes require Metadata, Standards, Sharing, and
functional assignment. Here again, numerous Storage
tools and databases exist. Many researchers
have found that performing BLAST analysis With over 50,000 metagenomes available, the
against the NCBI nonredundant database scientific community has realized that standard-
adds little value to their metagenomic datasets. ized metadata (“data about data”) and higher-
Preferable are databases that contain high- level classification (e.g., a controlled vocabulary)
level groupings of functions, for example, into will increase the usefulness of datasets for novel
metabolic pathways as in KEGG (Kanehisa discoveries (see also ▶ Metagenomics, Metadata,
2002) or into subsystems as in SEED and Meta-analysis). Through the efforts of the
(Overbeek et al. 2005). Using such higher-level Genomic Standards Consortium (GSC) (Field
groupings allows for the generation of et al. 2011), a set of minimal questionnaires has
been developed and accepted by the community Conclusion
(Yilmaz et al. 2010) that allows effective
communication of metadata for metagenomic Metagenomics has truly proven a valuable tool for A
samples of diverse types. While the “required” analyzing microbial communities. Technological
GSC metadata is purposefully minimal and advances will continue to drive down the sequenc-
thus provides only a rough description, several ing cost for metagenomic projects and, in fact, the
domain-specific environmental packages exist flood of current datasets indicates that funding to
that contain more detailed information. obtain sequences is not a major limitation. Major
As the standards evolve to match the needs of bottlenecks are encountered, however, in terms of
the scientific community, the groups developing storage and computational processing of sequenc-
software and analysis services have begun to ing data. With community-wide efforts and stan-
rely on the presence of GSC-compliant meta- dardized tools, the impact of these current
data, effectively turning them into essential limitations might be managed in the short term.
data for any metagenome project. Furthermore, In the long term, however, large standardized data-
comparative analysis of metagenomic datasets is bases will be required (e.g., a MetaGeneBank) to
becoming a routine practice, and acquiring give information access to the entire scientific
metadata for these comparisons has become community. Every metagenomic dataset contains
a requirement for publication in several scien- many new and unexpected discoveries, and the
tific journals. Since reanalysis of raw sequence efforts of microbiologists worldwide will be
reads is often computationally too costly, needed to ensure that nothing is being missed. As
the sharing of analysis results is also advisable. for the data, whether raw or processed, it is just
Currently only the IMG/M and MG-RAST plat- data. Only its biological and ecological interpreta-
forms are designed to provide cross-sample tion will further our understanding of the complex
comparisons without the need to recompute and wonderful diversity of the microbial world
analysis results. In the MG-RAST system, around us.
moreover, users can share data (after providing
metadata) with other users or make data publicly Government License
available. The submitted manuscript has been created by
Metagenomic datasets continue to grow in UChicago Argonne, LLC, Operator of Argonne
size. Indeed the first multi-hundred gigabase National Laboratory (“Argonne”). Argonne,
pair of metagenomes already exists. Therefore, a US Department of Energy Office of Science
storage and curation of metagenomic data Laboratory, is operated under Contract
have become a central theme. The on-disk No. DE-AC02-06CH11357. The US Government
representation of raw data and analyses has retains for itself, and others acting on its behalf,
led to massive storage issues for groups a paid-up nonexclusive, irrevocable worldwide
attempting meta-analyses. Currently there is license in said article to reproduce, prepare deriv-
no solution for accessing relevant subsets of ative works, distribute copies to the public, and
data (e.g., only reads and analyses pertaining perform publicly and display publicly, by or on
to a specific phylum or a specific species) behalf of the Government.
without downloading the entire dataset. Cloud
technologies may in the future provide attrac-
tive computational solutions for storage and References
computing problems. However, specific and
metadata-enabled solutions are required for Barberan A, Bates ST, et al. Using network analysis to
explore co-occurrence patterns in soil microbial com-
cloud systems to power the community-wide
munities. ISME J. 2012;6(2):343–51.
(re-)analysis efforts of the first 50,000 Barns SM, Fundyga RE, et al. Remarkable archaeal
metagenomes. diversity detected in a Yellowstone National Park hot
spring environment. Proc Natl Acad Sci U S A. Gilbert JA, Field D, et al. The taxonomic and functional
1994;91(5):1609–13. diversity of microbes at a temperate coastal site:
Bates ST, Berg-Lyons D, et al. Examining the global a ‘multi-omic’ study of seasonal and diel temporal
distribution of dominant archaeal populations in soil. variation. PLoS One. 2010a;5(11):e15545.
ISME J. 2011;5(5):908–17. Gilbert JA, Meyer F, et al. The earth microbiome project:
Bazinet AL, Cummings MP. A comparative evaluation of meeting report of the “1 EMP meeting on sample
sequence classification programs. BMC Bioinforma. selection and acquisition at Argonne National Labora-
2012;13(1):92. tory October 6 2010”. Stand Genomic Sci.
Bentley DR, Balasubramanian S, et al. Accurate whole 2010b;3(3):249–53.
human genome sequencing using reversible terminator Gilbert JA, Bailey M, et al. The earth microbiome project:
chemistry. Nature. 2008;456(7218):53–9. the Meeting Report for the 1st International Earth
Bergmann GT, Bates ST, et al. The under-recognized Microbiome Project Conference, Shenzhen, China,
dominance of Verrucomicrobia in soil bacterial com- June 13th-15th 2010. Stand Genomic Sci.
munities. Soil Biol Biochem. 2011;43(7):1450–5. 2011;5(2):243–7.
Brown MV, Lauro FM, et al. Global biogeography of Gilbert JA, Steele JA, et al. Defining seasonal marine
SAR11 marine bacteria. Mol Syst Biol. 2012;8:595. microbial community dynamics. ISME J. 2012;6:
Caporaso JG, Kuczynski J, et al. QIIME allows analysis 298–308.
of high-throughput community sequencing data. Nat Gill SR, Pop M, et al. Metagenomic analysis of the human
Methods. 2010;7(5):335–6. distal gut microbiome. Science. 2006;312(5778):
de la Bastide M, McCombie WR. Assembling genomic 1355–9.
DNA sequences with PHRAP. Curr Protoc Hess M, Sczyrba A, et al. Metagenomic discovery of
Bioinforma. 2007. Chapter 11: Unit11 14. biomass-degrading genes and genomes from cow
Delmont TO, Malandain C, et al. Metagenomic mining for rumen. Science. 2011;331(6016):463–7.
microbiologists. ISME J. 2011;5(12):1837–43. Iverson V, Morris RM, et al. Untangling genomes from
Delmont TO, Prestat E, et al. Structure, fluctuation and metagenomes: revealing an uncultured class of marine
magnitude of a natural grassland soil metagenome. Euryarchaeota. Science. 2012;335(6068):587–90.
ISME J. 2012;6(9):1677–87. Kanehisa M. The KEGG database. Novartis Found Symp.
DeLong EF, Preston CM, et al. Community genomics 2002;247:91–101. discussion 101–103, 119–128,
among stratified microbial assemblages in the ocean’s 244–152.
interior. Science. 2006;311(5760):496–503. Knight R, Jansson J, et al. Designing better metagenomic
Dinsdale EA, Edwards RA, et al. Functional metagenomic surveys: the role of experimental design and metadata
profiling of nine biomes. Nature. 2008;452(7187): capture in making useful metagenomic datasets for
629–32. ecology and biotechnology. Nat Biotechnol.
Droge J, McHardy AC. Taxonomic binning of metagenome 2012;30(6):513–2.
samples generated by next-generation sequencing tech- Koren S, Schatz MC, et al. Hybrid error correction and de
nologies. Brief Bioinform. 2012;13(6):646–55. novo assembly of single-molecule sequencing reads.
Dutilh BE, Huynen MA, et al. Increasing the coverage of Nat Biotechnol. 2012;30(7):693–700.
a metapopulation consensus genome by iterative read Li R, Li Y, et al. SOAP: short oligonucleotide alignment
mapping and assembly. Bioinformatics. 2009;25(21): program. Bioinformatics. 2008;24(5):713–4.
2878–81. Liu MY, Kjelleberg S, et al. Functional genomic analysis
Eid J, Fehr A, et al. Real-time DNA sequencing of an uncultured delta-proteobacterium in the sponge
from single polymerase molecules. Science. 2009; Cymbastela concentrica. ISME J. 2011;5(3):427–35.
323(5910):133–8. Loman NJ, Misra RV, et al. Performance comparison of
Fan L, Reynolds D, et al. Functional equivalence and benchtop high-throughput sequencing platforms. Nat
evolutionary convergence in complex communities Biotechnol. 2012;30(5):434–9.
of microbial sponge symbionts. Proc Natl Acad Sci Mackelprang R, Waldrop MP, et al. Metagenomic analy-
U S A. 2012;109(27):E1878–87. sis of a permafrost microbial community reveals
Field D, Amaral-Zettler L, et al. The genomic standards a rapid response to thaw. Nature. 2011;480(7377):
consortium. PLoS Bio. 2011;9(6):e1001088. 368–71.
Fuhrman JA. Microbial community structure and its func- Margulies M, Egholm M, et al. Genome sequencing in
tional implications. Nature. 2009;459(7244):193–9. microfabricated high-density picolitre reactors.
Fuhrman JA, Hewson I, et al. Annually reoccurring Nature. 2005;437(7057):376–80.
bacterial communities are predictable from ocean Markowitz VM, Ivanova NN, et al. IMG/M: a data man-
conditions. Proc Natl Acad Sci U S A. 2006; agement and analysis system for metagenomes.
A103(35):13104–9. Nucleic Acids Res. 2008;36(Database issue):D534–8.
Fuhrman JA, Steele JA, et al. A latitudinal diversity gra- Martiny JB, Bohannan BJ, et al. Microbial biogeography:
dient in planktonic marine bacteria. Proc Natl Acad Sci putting microorganisms on the map. Nat Rev
U S A. 2008;A105(22):7774–8. Microbiol. 2006;4(2):102–12.
Mavromatis K, Ivanova N, et al. Use of simulated Rothberg JM, Hinz W, et al. An integrated semiconductor
data sets to evaluate the fidelity of metagenomic device enabling non-optical genome sequencing.
processing methods. Nat Methods. 2007;4(6): Nature. 2011;475(7356):348–52.
495–500. Rusch DB, Halpern AL, et al. The Sorcerer II global A
McDonald D, Clemente JC, et al. The Biological Obser- ocean sampling expedition: Northwest Atlantic
vation Matrix (BIOM) format or: how I learned to stop through Eastern Tropical Pacific. PLoS Biol.
worrying and love the ome-ome. Gigascience. 2007;5(3):e77.
2012;1(1):7. Schneider GF, Dekker C. DNA sequencing with
McElroy KE, Luciani F, et al. GemSIM: general, error- nanopores. Nat Biotechnol. 2012;30(4):326–8. doi:
model based simulator of next-generation sequencing 10.1038/nbt.2181.
data. BMC Genomics. 2012;13:74. Salmela L. Correction of sequencing errors in a mixed set
Metzker ML. Sequencing technologies – the next genera- of reads. Bioinformatics. 2010;26(10):1284–90.
tion. Nat Rev Genet. 2010;11(1):31–46. Seshadri R, Kravitz SA, et al. CAMERA: a
Meyer F, Paarmann D, et al. The metagenomics RAST community resource for metagenomics. PLoS Biol.
server – a public resource for the automatic phyloge- 2007;5(3):e75.
netic and functional analysis of metagenomes. BMC Simpson JT, Wong K, et al. ABySS: a parallel assembler
Bioinforma. 2008;9:386. for short read sequence data. Genome Res.
Miller JR, Delcher AL, et al. Aggressive assembly of 2009;19(6):1117–23.
pyrosequencing reads with mates. Bioinformatics. Trimble WL, Keegan KP, et al. Short-read reading-frame
2008;24(24):2818–24. predictors are not created equal: sequence error causes
Miller JR, Koren S, et al. Assembly algorithms for loss of signal. BMC Bioinforma. 2012;13(1):183.
next-generation sequencing data. Genomics. Tringe SG, von Mering C, et al. Comparative
2010;95(6):315–27. metagenomics of microbial communities. Science.
Morgan JL, Darling AE, et al. Metagenomic sequencing of 2005;308(5721):554–7.
an in vitro-simulated microbial community. PLoS Tyson GW, Chapman J, et al. Community structure and
One. 2010;5(4):e10209. metabolism through reconstruction of microbial
Namiki T, Hachiya T, et al. MetaVelvet: an extension of genomes from the environment. Nature.
Velvet assembler to de novo metagenome assembly 2004;428(6978):37–43.
from short sequence reads. Nucleic Acids Res. Venter JC, Remington K, et al. Environmental genome
2012;40(20):e155. shotgun sequencing of the Sargasso Sea. Science.
Nemergut DR, Costello EK, et al. Global patterns in the 2004;304(5667):66–74.
biogeography of bacterial taxa. Environ Microbiol. Warnecke F, Luginbuhl P, et al. Metagenomic and func-
2011;13(1):135–44. tional analysis of hindgut microbiota of a wood-
Ottesen EA, Marin R, et al. Metatranscriptomic analysis feeding higher termite. Nature. 2007;450(7169):
of autonomously collected and preserved marine 560–5.
bacterioplankton. ISME J. 2011;5(12):1881–95. Whiteley AS, Jenkins S, et al. Microbial 16S rRNA Ion
Overbeek R, Begley T, et al. The subsystems approach to Tag and community metagenome sequencing using
genome annotation and its use in the project to anno- the Ion Torrent (PGM) platform. J Microbiol Methods.
tate 1000 genomes. Nucleic Acids Res. 2012;91(1):80–8.
2005;33(17):5691–702. Wilke A, Harrison T, et al. The M5nr: a novel
Peng Y, Leung HC, et al. Meta-IDBA: a de Novo non-redundant database containing protein sequences
assembler for metagenomic data. Bioinformatics. and annotations from multiple sources and associated
2011;27(13):i94–101. tools. BMC Bioinforma. 2012;13:141.
Prabakaran P, Streaker E, et al. 454 antibody sequencing – Wilkening J, Wilke A, et al. Using clouds for
error characterization and correction. BMC Res Notes. metagenomics: a case study. IEEE Cluster 2009. 2009
2011;4:404. Wommack KE, Bhavsar J, et al. Metagenomics: read
Prosser JI. Replicate or lie. Environ Microbiol. length matters. Appl Environ Microbiol.
2010;12(7):1806–10. 2008;74(5):1453–63.
Quail M, Smith ME, et al. A tale of three next generation Yilmaz P, Kottmann R, et al. The “Minimum Information
sequencing platforms: comparison of Ion Torrent, about an ENvironmental Sequence” (MIENS) specifi-
Pacific Biosciences and Illumina MiSeq sequencers. cation. Nat Biotechnol. 2010. in print.
BMC Genomics. 2012;13(1):341. Zerbino DR, Birney E. Velvet: algorithms for de novo
Rho M, Tang H, et al. FragGeneScan: predicting genes in short read assembly using de Bruijn graphs. Genome
short and error-prone reads. Nucleic Acids Res. Res. 2008;18(5):821–9.
2010;38(20):e191. Zhou R, Ling S, et al. Population genetics in nonmodel
Riesenfeld CS, Schloss PD, et al. Metagenomics: genomic organisms: II. Natural selection in marginal habitats
analysis of microbial communities. Annu Rev Genet. revealed by deep sequencing on dual platforms. Mol
2004;38:525–52. Biol Evol. 2011;28(10):2833–42.
A 10 A De Novo Metagenomic Assembly Program for Shotgun DNA Reads
role in sequence processing, due to more valuable

A De Novo Metagenomic Assembly genomic content they can provide (Tyson
Program for Shotgun DNA Reads et al. 2004; Venter et al. 2004). In the past decade,
a good many assembly algorithms have been
Huaiqiu Zhu proposed to deal with the sequence assembly
Department of Biomedical Engineering, and problem, among of which are the early algo-
Center for Theoretical Biology, Peking rithms targeted to the Sanger sequencing technol-
University, Beijing, China ogy, such as Phrap (http://www.phrap.org),
Celera (Myers et al. 2000; Miller et al. 2008),
and PCAP (Huang et al. 2003), and the up-to-
Synonyms date algorithms targeted to the next-generation
technology, such as Velvet (Zerbinor and Birney
MAP: metagenomic assembly program 2008) and SOAPdenovo (Li et al. 2010). How-
ever, these methods are not targeting the
metagenome sequencing in spite of the situation
Definition that they are still usually employed to undertake
assembling of the metagenomic sequencing
Contig: a set of overlapping DNA segments that reads.
together represent a consensus region of Compared to isolated genome assembly prob-
DNA. Assembly (also genome assembly): the lem, the metagenomic assembly problem is more
process of taking a large number of short DNA complicated due to two challenges (Kunin
sequencing reads and putting them back together et al. 2008): (1) the genomic repeats may origi-
to create contigs from which the DNA originated. nate from either the same genome or the different
genomes; therefore, large numbers of mixed
short DNA reads belong to many different spe-
Introduction cies (we even know little about the population
structure for some environmental samples); and
MAP (metagenomic assembly program) is a de (2) the inhomogeneous coverage distribution and
novo assembler designed to be applicable to shot- the low abundance of organisms provide limited
gun DNA reads (recommended as >200 bp) for information to handle repeats. Due to the specific
metagenome sequencing project (Lai et al. 2012). challenges of the metagenomic assembly prob-
The program focuses on the metagenomic assem- lem, traditional assembly methods developed for
bly problem of longer reads produced by, for single genome assembly problem usually gener-
example, Sanger (typically 700–1,000 bp) and ate poor quality draft assembly on metagenomic
454 sequencing (typically 200–500 bp). Mean- data (Mavromatis et al. 2007). Thus, it is in need
while, mate-pair information from both ends of to develop highly efficient assembly method
a DNA fragment for a given size (e.g., an insert in specifically for metagenomic data.
a vector plasmid in Sanger sequencing or a mate- Moreover, compared with Sanger and
pair template in 454 sequencing) in sequencing is 454 sequencing, the current limitation of shorter
introduced, which is commonly available in reads (<200 bp, typically 25–100 bp) and higher
Sanger sequencing and most of the new sequenc- errors by the new sequencing platforms does not
ing technologies including 454 sequencing. allow a significant utility for metagenomic ana-
Although processing of shotgun metagenomic lyses for the difficulty in phylogenetic study or
sequence data usually does not have a fixed end gene function inference. In fact, shorter reads
point to recover one or more complete genomes technologies have not been widely used in
as for isolated microbial genomes, the assembly metagenome sequencing, and meanwhile the
tools, which aim to combine sequence reads into sequencing technologies producing longer
contigs, are still expected to play an important reads, such as Sanger (usually 700–1,000 bp)
A De Novo Metagenomic Assembly Program for Shotgun DNA Reads 11 A
A De Novo Metagenomic Assembly Program for Shotgun DNA Reads, Fig. 1 The flowchart of MAP algorithm
and 454 sequencing (usually 200–500 bp), are stage, a consistency-based consensus algorithm
still the overwhelming recommendation and is used (Rausch et al. 2009), which is based on
thus remain the major source of metagenomic a multi-read alignment algorithm aligning the
sequence data. Therefore, it is never trivial to reads with a consistency-enhanced alignment
continue to emphasize the importance of longer graph of shared sequence segments identified in
reads to metagenomic analyses, clearly including advance. The most important innovation of MAP
the reads assembly tool designed specifically. is the layout stage which applies mate-paired
information to deal with repeat problem, which
is described below.
Algorithm of MAP In the OLC approach of MAP, the overlap
graph is used to facilitate the assembly process.
MAP designs an improved approach of the clas- Conceptually, reads and overlaps are represented
sical overlap/layout/consensus (OLC) strategy, in in the graph G by nodes and bidirected edges,
which several special algorithms are incorporated respectively. The arrows of both ends of the edge
into its stages, to calculate correct contigs by are determined by the way how two reads over-
connecting the fragments linked by mate pairs lap. Herein, a dovetail path is defined as an acy-
to prevent the false merge of unrelated reads. clic path with each node has only one arrow
For the improved OLC strategy, MAP deploys outward it and one arrow inward it. Thus,
a series of algorithms in three stages as shown in a dovetail path can determine a certain contig
Fig. 1. In the overlap stage, the filter algorithm by means of threading the reads corresponding
based on q-gram (Mullikin et al. 2003) is used to to the nodes in this path. Thus, the goal of the
obtain the read pairs that are supposed to have the layout stage is to separate the graph into discon-
overlaps, and the seed and extend alignment nected dovetail paths. However, since there may
approach, similar to that used by BLAST be quite many misleading edges in the graph that
(Altschul et al. 1990), is employed in the pairwise represent the false overlaps mainly originated
alignment calculation. While in the consensus from two repetitive DNA regions or similar
A 12 A De Novo Metagenomic Assembly Program for Shotgun DNA Reads
fragments of different genomes, this goal seems (Margulies et al. 2005), and Genovo (Laserson
to be a formidable task. To this end, MAP et al. 2011), for typical shorter reads by
is designed to determine the optimal dovetail 454 sequencing (Lai et al. 2012).
paths with the aids of the clues given by mate
pairs (Lai et al. 2012).
Compared with other assemblers, several dis- Availability
tinct features of MAP algorithm should be pointed
out. First, MAP does not refer to any other infor- MAP is written in C++ and the source code is
mation such as genome length or sequencing cov- freely available under GNU GPL license. The
erage that is often used in the assemblers targeting MAP is freely available at http://bioinfo.ctb.
the isolated genomes, because such information is pku.edu.cn/MAP/.
clearly not applicable to the situation of
metagenomic assembly. What is more important
References
is that MAP employs mate-paired information dif-
ferent from other assemblers do. For example, the Altschul SF, Gish W, et al. Basic local alignment search
Celera Assembler (Myers et al. 2000) used mate- tool. J Mol Biol. 1990;215(3):403–10.
paired information in the scaffold construction. Huang X, Wang J, et al. PCAP: a whole-genome assembly
program. Genome Res. 2003;13:2164–70.
The Celera Assembler later developed a new pipe-
Kunin V, Copeland A, et al. A bioinformatician’s guide to
line CABOG, which finds the best overlap graph metagenomics. Microbiol Mol Biol Rev.
in the unitigger module (Miller et al. 2008). In this 2008;72:557–178.
algorithm, mate pairs are used to correct the Lai B, Ding R, et al. A de novo metagenomic assembly
program for shotgun DNA reads. Bioinformatics.
misassemblies by breaking the unitigs which are
2012;28(11):1455–62.
found violated with the mate-pair constrains. Laserson J, Jojic V, et al. Genovo: de novo assembly for
PCAP (Huang et al. 2003) used mate-paired infor- metagenomes. J Comput Biol. 2011;18:429–43.
mation to correct contigs and to link contigs into Li R, Zhu H, et al. De novo assembly of human genomes
with massively parallel short read sequencing.
scaffolds. Different from these assemblers, MAP
Genome Res. 2010;20:265–72.
uses mate pairs as a core measure to construct Margulies M, Egholm M, et al. Genome sequencing in
contigs when repeats hamper the assembly. microfabricated high-density picolitre reactors.
Based on mate-paired information, MAP designs Nature. 2005;437:376–80.
Mavromatis K, Ivanova N, et al. Use of simulated data sets
a series of procedures to implement the layout
to evaluate the fidelity of metagenomic processing
stage. methods. Nat Methods. 2007;4:495–500.
Miller JR, Delcher AL, et al. Aggressive assembly of
pyrosequencing reads with mates. Bioinformaticts.
2008;24:2818–24.
Performance of MAP Mullikin JC, Ning Z, et al. The phusion assembler.
Genome Res. 2003;13:81–90.
MAP is designed for metagenomic assembly on Myers EW, Sutton GG, et al. A whole-genome assembly
long reads data with mate pairs, such as Sanger of Drosophila. Science. 2000;287:2896–204.
Rausch T, Koren S, et al. A consistency-based consensus
reads (700–1,000 bp) and 454 reads
algorithm for de novo and reference-guided sequence
(200–500 bp). MAP method was assessed on assembly of short reads. Bioinformatics.
simulated data compared with widely used 2009;25:1118–24.
assemblers on long reads data. Specifically, the Tyson GW, Chapman J, et al. Genomic structure and
metabolism through reconstruction of microbial
assessment test results on simulated dataset with
genomes from the environment. Nature. 2004;428:
800 bp reads demonstrate that the total assembly 37–43.
performance of MAP can be superior to both Venter JC, Remington K, et al. Environmental genome
Celera and Phrap for typical longer reads by shotgun sequencing of Sargasso sea. Science.
2004;304:66–74.
Sanger sequencing, and the results on simulated
Zerbinor DR, Birney E. Velvet: algorithms for de novo
dataset with 200 bp reads show that MAP has short read assembly using de Bruijn graphs. Genome
evident advantage over Celera, Newbler Res. 2008;18:821–9.
Ab Initio Gene Identification in Metagenomic Sequences 13 A
annotation of the first completely sequenced
Ab Initio Gene Identification in archaeal genome, Methanococcus jannaschii
Metagenomic Sequences (Bult et al. 1996). All the M. jannaschii genes A
were predicted by the ab initio statistical method
Shiyuyun Tang1 and Mark Borodovsky2 (Borodovsky and McIninch 1993) while function
1
School of Biology, Biodiversity Research of 2/3 of them was a mystery since the translated
Center, Georgia Institute of Technology, Atlanta, protein sequences did not show sequence similar-
GA, USA ity to proteins in databases.
2
Joint Georgia Tech and Emory Wallace The history repeats itself in metagenomes,
H Coulter Department of Biomedical since majority of protein-coding regions in a new
Engineering, Center for Bioinformatics and metagenome may code for proteins that do not
Computational Genomics, Atlanta, GA, USA show similarity to already known proteins.
“Evidence-based” or “similarity-based” methods
of gene finding (Kunin et al. 2008) provide gene
Synonyms prediction along with valuable information about
function of encoded proteins. Similarity-based
Statistical or intrinsic methods of gene prediction gene finders possess high specificity, close to
100 % (Altschul et al. 1997; Badger and Olsen
1999; Frishman et al. 1998; Gish and States 1993).
Definition Still, the drawback of similarity-based methods is
low sensitivity; they cannot predict novel genes.
Computational inference of how a metagenomic The similarity-based methods are less useful
sequence is divided into protein-coding and non- for gene prediction in metagenomes that carry
coding regions based on presence or absence of many novel genes, while the ab initio gene
characteristic oligonucleotide frequency patterns. prediction methods, not depending on presence
of homologs in protein databases, are both effec-
tive and efficient for annotating genes in
Introduction metagenomic sequences (Kunin et al. 2008).
As of April 2013 sequences of 370 metagenomes

were available in databases. On the other Ab Initio Gene Finding
hand, Genomes Online Database (www.
genomesonline.org) lists 186 complete archaeal Ab initio gene prediction tools have high sensi-
and 3,956 complete bacterial genomes; also there tivity (above 90 % for the best tools) and high
are about 15,000 incomplete (draft) prokaryotic specificity (above 90 % as well). Ab initio gene
genomes. With the average size of a metagenome finders use statistical pattern recognition methods
being 100 times larger than an average prokary- (Wooley et al. 2010). Statistical models such as
otic genome, the current volume of metagenomic Markov models, hidden Markov models (HMM),
sequences is twice as large as the total sequence and hidden semi-Markov models (HSMM, also
in “genomic” data. Therefore, current called hidden Markov model with duration)
metagenomes carry a larger wealth of genes proved to be very useful to model statistical pat-
than all the prokaryotic genomes, and this gap is terns of nucleotide ordering in protein-coding and
growing. noncoding regions. Accurate ab initio gene find-
Notably, gene prediction and annotation of ing in isolated genomes requires ample sequence
gene and protein function is more challenging in data for estimation of algorithm parameters
metagenomes than in draft or complete genomes. (model training).
To give a historic perspective, one can compare Contrary to isolated (complete and draft)
gene annotation of a metagenome with genomes metagenomic sequences are derived
A 14 Ab Initio Gene Identification in Metagenomic Sequences
from numerous genomes of heterogeneous Glimmer-MG is based on interpolated Markov

microbial communities (microbiomes). A typical models or IMMs (Salzberg et al. 1998). Glimmer-
metagenomic sequence is short; its genomic con- MG scores metagenomic sequences and assigns
text and the phylogenetic origin are rarely known. them into clusters; then, the algorithm iteratively
Gene identification is also affected by sequencing estimates the IMM parameters and reassigns
and assembly errors; for example, errors that lead sequences to clusters.
to frameshifts (change of coding frame). FragGeneScan (Rho et al. 2010), an
The major challenge for ab initio gene predic- HMM-based gene finder, has an additional ability
tion in metagenomic sequences is that the to predict frameshifts caused by sequencing
metagenomic sequences are often too short for errors. Transition probabilities between coding
reliable estimation of parameters of species- frames are determined with respect to the error
specific models of coding and noncoding regions. models of sequencing technologies used to derive
Special training techniques have to be developed the input sequence.
to address the challenging task of parameter esti- MetaGene Annotator (Noguchi et al. 2008)
mation (see below). Similarly to gene prediction works in two steps: in the first step the program
in isolated genomes, newly predicted genes are scores open reading frames (ORFs) with respect
immediately translated into proteins and the sim- to base composition and lengths; in the second
ilarity search is used in an attempt of function step, it connects high-scoring ORFs using
annotation. dynamic programming.
Machine learning classification algorithms
such as support vector machines and neural net-
Gene Finders Currently Available for works are also used for ab initio gene finding. In
Metagenomes order to classify coding or noncoding ORFs,
Orphelia (Hoff et al. 2009, 2008) uses an artificial
Current metagenomic gene-finding tools include neural network combining multiple features to
FragGeneScan (Rho et al. 2010), Glimmer-MG get ORF’s scores.
(Kelley et al. 2012), MetaGene Annotator
(Noguchi et al. 2008), MetaGeneMark (Zhu
et al. 2010), and Orphelia (Hoff et al. 2009, Parameter Estimation for Metagenomic
2008). Glimmer-MG and MetaGeneMark are Gene-Finding Algorithms
extensions of gene finders for complete or draft
genomes Glimmer3 (Delcher et al. 2007) and Patterns of oligonucleotide frequencies differ in
GeneMarkS (Besemer et al. 2001), respectively. coding and noncoding regions; these patterns are
The MetaGeneMark algorithm uses HSMM more pronounced when frequencies of longer
architecture, originally developed in GeneMarkS oligomers are considered. Sequences with spe-
(Besemer et al. 2001). The HSMM parameter cific oligomer frequencies can be modeled by
derivation approach used in MetaGeneMark is Markov chain models and in the important case
to arrive to a large set of parameters (thousands of protein-coding sequences by three-periodic
of parameters related to oligonucleotide frequen- Markov chain models (Borodovsky et al. 1986).
cies) from a small set (nucleotide frequencies The number of parameters of a three-periodic
determined in a short fragment) using the depen- Markov chain model increases exponentially
dencies between oligonucleotide and nucleotide with the model order; estimation of parameters
frequencies that have been formed in evolution. of the practically useful fifth order model requires
The original idea of this approach (Besemer and at least several hundred thousand nucleotide long
Borodovsky 1999) has been developed for small sequence. Use of a shorter training sequence
viral genomes before the start of “metagenomic leads to over-fitting and will corrupt gene predic-
era” (see below for more details). tion. If the origin of the metagenomic sequence is
known, sequences from the whole parent genome next step, parameters of three-periodic Markov
could be used for training. Alternatively, if novel chain models of the heuristic model (Zhu
metagenomic sequences from a single species are et al. 2010). A
assembled in sufficiently long contig the model Interestingly, the heuristic models can also be
parameters can be estimated by self-training used for gene prediction in complete genomes or
on the contig sequence (Besemer et al. 2001; draft genomes. In comparison with the “native”
Kelley et al. 2012). Most frequently, however, models (models trained on a genome of interest),
metagenomic sequences are short and novel heuristic models are more sensitive to so-called
(of the order of a few hundred nucleotides). “atypical” genes. Many atypical genes appear to
Therefore, new approach to the model parameter be horizontally transferred genes with codon fre-
derivation is needed. quencies deviating from dominant codon usage
A novel approach for constructing parameters pattern of the “host” genome.
and making efficient models for gene prediction Another approach to model parameter estima-
in short genomic sequences was proposed back in tion is attempting to make a sufficiently large
1999 (Besemer and Borodovsky 1999). The idea set of training sequences by linking anonymous
was to use observed trends in the nucleotide fre- sequences that appear to be taxonomically close.
quencies in the three codon positions in genomes For example, Glimmer-MG assigns a taxon for
with various GC content. Use of these dependen- a metagenomic sequence by a classification
cies allows for reconstructing the species-specific method called Phymm (Brady and Salzberg 2009)
codon usage pattern in the whole genome starting and then searches databases for genomes that
from a short fragment of this genome whose belong to this taxon. Since such type of training is
length is sufficient to estimate the genome GC executed in real time, the running time of gene-
content. This approach is based on the assump- finding algorithm may increase significantly in
tion of genome compositional uniformity that is comparison with the algorithm selecting a heuristic
largely valid for prokaryotic genomes. It was model from a set of models precomputed for
shown that parameters provided by this approach possible values of GC contents.
allow sufficiently accurate gene prediction in
short metagenomic sequences. Later on, with
more genomes becoming available, this idea Additional Sequence Features Used by
was extended (Zhu et al. 2010) to longer oligo- Metagenomic Gene Finders
nucleotides (e.g., hexamers). With GC content of
a genome being an independent variable X, it Besides function-specific patterns in oligonucle-
could be shown that frequency of phased otide composition, gene identification algorithms
K-mers in any of three frames, variable Y, can can use additional features that help discriminate
be approximated by a polynomial of order protein-coding and noncoding regions. Such fea-
K. Particularly, the mononucleotide frequencies tures include empirical length distributions of
in three codon positions can be approximated by coding and noncoding regions, mutual orienta-
linear functions. These dependencies indicate tion of neighboring coding regions, and sequence
that GC content is a major driving factor that patterns related to functional sites such as ribo-
determines evolution of genome-wide codon somal binding sites (RBS). The two-component
usage pattern (Chen et al. 2004). In model of RBS, containing positional frequency
MetaGeneMark, the value of GC content deter- matrix as a model of the RBS motif and the length
mined for a short metagenomic sequence is used distribution of a “spacer,” the sequence between
as an estimate of GC content of the whole RBS and gene start, carries important additional
genome the sequence originated from. This information for improving accuracy of gene start
value allows immediate reconstruction of fre- prediction. In prokaryotic genomes an average
quencies of phased oligonucleotides and, at the spacer length is 5–7 nt. The RBS positional
Ab Initio Gene Identification in Metagenomic Sequences, Table 1 Gene prediction accuracy for five ab initio
gene finders. Sn stands for sensitivity and Sp stands for specificity
Sequence (Sn + Sp)/2
Programs Test set length (bp) Sn (%) Sp (%) (%) Publication
Orphelia Fragments from 12 test 300 82.1 91.7 86.9 Hoff et al. (2009)
species
FragGeneScan Simulated short reads of 400 91.3 86.1 88.7 Rho et al. (2010)
9 genomes
MetaGeneMark Fragments from 400 97.0 94.6 95.8 Zhu et al. (2010)
50 microbial chromosomes
Glimmer-MG Simulated 454 sequences 535 98.4 71.8 85.1 Kelley
et al. (2012)
MetaGeneAnnotator Subsequences of 700 95.1 91.0 93.1 Noguchi
13 genomes et al. (2008)
FragGeneScan Simulated reads with 1 % 400 85.4 79.5 82.5 Rho et al. (2010)
sequencing error rate
Glimmer-MG Simulated 454 reads with 535 83.6 62.5 73.1 Kelley
1 % sequencing error rate et al. (2012)
frequency matrix can be derived by algorithms (Noguchi et al. 2006), and many of them make
such as MCMC (Markov chain Monte Carlo)- co-transcribed “chains” or operons. Genes in an
based Gibbs sampler (Lawrence et al. 1993) or operon are located on a close distance or even
EM (Expectation Maximization)-based MEME overlap. Four base-pair overlap ATGA is very
(Bailey and Elkan 1994); detection of the RBS common in adjacent genes as an overlap of stop
motif is done by finding the most conserved set and start codons ATG and TGA. Average dis-
of ungapped sequence fragments within the tance between adjacent genes having the same
multiple alignment window. The structure of orientation is shorter than that between neighbor
two-component RBS model is convenient for genes residing in complementary strands, espe-
incorporation into HMM-based framework of cially in gene start-to-gene start configuration
several algorithms such as MetaGeneMark and where additional space has to be available for
FragGeneScan promoters.
Another feature, the prokaryotic gene length All these features are incorporated in
distribution, is approximated for complete or metagenomic gene finders, e.g., MetaGeneMark.
draft genomes by the gamma distribution with Tests of ab initio gene finders on simulated
mean value about 900 nt; yet another one, the metagenomic sequences have shown that these
distribution of length of noncoding region is algorithms are quite accurate, with average
approximated by exponential distribution. These values of sensitivity and specificity above 90 %;
two distributions, as well as the RBS spacer see Table 1. However, the sensitivity drops if the
length distribution, are used as in the HSMM- sequence length goes below 200 nt (Yok and
based algorithms (Besemer et al. 2001). Since Rosen 2011; Zhu et al. 2010).
short metagenomic sequences are more likely to
contain partial genes than complete genes, length
distributions of partial genes are used in HSMM- An Initio Gene Finding in Metagenomic
based metagenomic gene finders (Rho et al. 2010; Sequences with Errors
Zhu et al. 2010).
About 70 % of neighboring genes in prokary- Real metagenomic sequences contain errors: sub-
otic genomes have the same orientation stitutions, insertion, and deletions (indels), as well
Ab Initio Gene Identification in Metagenomic Sequences, Table 2 Frameshift prediction accuracy
Sequence
Programs length (bp) Sn (%) Sp (%) (Sn + Sp)/2 Test set Publication
A
FragGeneScan 400 81.0 43.2 62.1 Fragments from Tang
600 81.9 35.1 58.5 18 prokaryotic et al. 2013
800 82.8 29.4 56.1 genomes with
20 % containing
MetaGeneTrack 400 75.8 70.2 73.0
frameshifts
600 80.1 61.7 70.9
800 81.5 51.9 66.7
as chimerisms, when two reads from different (Table 2) in reads with error rate typical for
species are joined due to assembly error. Indels metagenomic projects (Tang et al. 2013).
can cause frameshifts in coding regions; thus gene Yet another approach was used in Glimmer-
prediction accuracy is affected by sequencing MG, which, to trace possible indel errors, splits
errors. The overall effect on accuracy depends on an ORF into three branches (frames), starting
error rates specific to sequencing and finishing from the position of a nucleotide called with
technologies; for example, the error rates reported low confidence (Kelley et al. 2012). This
for Sanger sequencing may be as low as 0.001 % approach was reported to have higher gene pre-
while sequencing errors in NGS technologies can diction accuracy on error-contained reads than
go above 1 %. In both simulated Sanger reads and FragGeneScan. Methods that account for
simulated 454 reads significant decrease of gene sequencing errors generally perform better in
prediction sensitivity is observed when error rate real error-prone metagenomic sequences than
exceeds 1 % (Hoff 2009). Still, in assembled “idealistic” approaches. The accuracy of
sequences, the per-nucleotide error rate of 0.5 % sequencing error detection, however, depends
in raw reads can be reduced to as low as 0.005 %. on how accurate is the modeling of sequencing
This error rate is still large enough to affect errors is.
3–4.5 % of genes in assembled sequences (Luo
et al. 2012).
To identify frameshift errors in metagenomic Summary
sequences, gene-finding algorithms have to
model frame transitions that occur due to Accurate ab initio gene prediction in
sequencing errors. In HSMM-based gene finders, metagenomic sequences is necessary for reliable
e.g., FragGeneScan, new hidden states designat- functional annotation. Ab initio algorithms iden-
ing transitions between coding frames in the same tify genes in metagenomic sequences by
strand were incorporated into the HSMM archi- detecting intrinsic statistical patterns of coding
tecture. Another recent tool able to detect frame- and noncoding regions. Being independent of
shift in metagenomic coding regions is data stored in databases, these methods are espe-
MetaGeneTack (Tang et al. 2013). It combines cially useful for discovering novel genes. Special
the original HSMM-based MetaGeneMark with techniques have been developed for derivation of
an ab initio frameshift finding program GeneTack parameters of the ab initio algorithms working
(Antonov and Borodovsky 2010). Several filters with short anonymous metagenomic sequences.
of false-positive predictions were employed in We have reviewed several ab initio gene finders
MetaGeneTack to achieve higher accuracy. developed for metagenomic sequences including
MetaGeneTack is reported to have higher frame- the latest tools that take into account possible
shift prediction specificity than FragGeneScan sequencing errors (frameshifts).
Cross-References PubMed PMID: 19648916. Pubmed Central PMCID:

2762791.
Bult CJ, White O, Olsen GJ, Zhou L, Fleischmann RD,
▶ Computational Approaches for Metagenomic Sutton GG, et al. Complete genome sequence of the
Datasets methanogenic archaeon. Methanococcus jannaschii.
▶ FragGeneScan: Predicting Genes in Short and Science. 1996;273(5278):1058–73. PubMed PMID:
Error-Prone Reads 8688087.
Chen SL, Lee W, Hottes AK, Shapiro L, McAdams
▶ Metagenomics, Metadata, and Meta-analysis HH. Codon usage between genomes is constrained by
▶ Protein-Coding Genes as Alternative Markers genome-wide mutational processes. Proc Natl Acad
in Microbial Diversity Studies Sci U S A. 2004;101(10):3480–5. PubMed PMID:
▶ Proteomics and Metaproteomics 14990797. Pubmed Central PMCID: 373487.
Delcher AL, Bratke KA, Powers EC, Salzberg SL. Identi-
▶ RITA: Rapid Identification of High- fying bacterial genes and endosymbiont DNA with
Confidence Taxonomic Assignments for Glimmer. Bioinformatics. 2007;23(6):673–9. PubMed
Metagenomic Data PMID: 17237039. Pubmed Central PMCID: 2387122.
Frishman D, Mironov A, Mewes H-W, Gelfand
M. Combining diverse evidence for gene recognition
in completely sequenced bacterial genomes. Nucleic
References Acids Res. 1998;26(12):2941–7.
Gish W, States DJ. Identification of protein coding
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, regions by database similarity search. Nat Genet.
Miller W, et al. Gapped BLAST and PSI-BLAST: 1993;3(3):266–72.
a new generation of protein database search programs. Hoff KJ. The effect of sequencing errors on metagenomic
Nucleic Acids Res. 1997;25(17):3389–402. gene prediction. BMC Genomics. 2009;10:520.
Antonov I, Borodovsky M. Genetack: frameshift identifi- PubMed PMID: 19909532. Pubmed Central PMCID:
cation in protein-coding sequences by the viterbi algo- 2781827.
rithm. J Bioinforma Comput Biol. 2010;8(3):535–51. Hoff KJ, Tech M, Lingner T, Daniel R, Morgenstern B,
PubMed PMID: 20556861. Meinicke P. Gene prediction in metagenomic frag-
Badger JH, Olsen GJ. CRITICA: coding region identifi- ments: a large scale machine learning approach.
cation tool invoking comparative analysis. Mol Biol BMC Bioinforma. 2008;9:217. PubMed PMID:
Evol. 1999;16(4):512–24. 18442389. Pubmed Central PMCID: 2409338.
Bailey TL, Elkan C. Fitting a mixture model by expecta- Hoff KJ, Lingner T, Meinicke P, Tech M. Orphelia:
tion maximization to discover motifs in biopolymers. predicting genes in metagenomic sequencing reads.
Proceedings/International Conference on Intelligent Nucleic Acids Res. 2009 Jul 37(Web Server issue):
Systems for Molecular Biology; ISMB International W101-5. PubMed PMID: 19429689. Pubmed Central
Conference on Intelligent Systems for Molecular Biol- PMCID: 2703946.
ogy, Vol. 2; 1994; p. 28–36. PubMed PMID: 7584402. Kelley DR, Liu B, Delcher AL, Pop M, Salzberg SL. Gene
Besemer J, Borodovsky M. Heuristic approach to deriving prediction with Glimmer for metagenomic sequences
models for gene finding. Nucleic Acids Res. augmented by classification and clustering. Nucleic
1999;27(19):3911–20. PubMed PMID: 10481031. Acids Res. 2012;40(1):e9. PubMed PMID:
Pubmed Central PMCID: 148655. 22102569. Pubmed Central PMCID: 3245904.
Besemer J, Lomsadze A, Borodovsky M. GeneMarkS: Kunin V, Copeland A, Lapidus A, Mavromatis K,
a self-training method for prediction of gene starts in Hugenholtz P. A bioinformatician’s guide to
microbial genomes. Implications for finding sequence metagenomics. Microbiol Mol Biol Rev.
motifs in regulatory regions. Nucleic Acids Res. 2008;72(4):557–78. Table of Contents. PubMed
2001;29(12):2607–18. PubMed PMID: 11410670. PMID: 19052320. Pubmed Central PMCID: 2593568.
Pubmed Central PMCID: 55746. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald
Borodovsky M, McIninch J. GENMARK: parallel gene AF, Wootton JC. Detecting subtle sequence signals:
recognition for both DNA strands. Comp Chem. a Gibbs sampling strategy for multiple alignment.
1993;17(2):123–33. Science. 1993;262(5131):208–14. PubMed PMID:
Borodovsky MY, Sprizhitskii Y, Golovanov E, 8211139.
Aleksandrov A. Statistical patterns in primary struc- Luo C, Tsementzi D, Kyrpides N, Read T, Konstantinidis
tures of functional regions in the E. coli genome. KT. Direct comparisons of Illumina vs. Roche
III. Computer recognition of coding regions. Mol 454 sequencing technologies on the same microbial
Biol. 1986;20:1145–50. community DNA sample. PloS ONE. 2012;7(2):
Brady A, Salzberg SL. Phymm and PhymmBL: e30087.
metagenomic phylogenetic classification with interpo- Noguchi H, Park J, Takagi T. MetaGene: prokaryotic gene
lated Markov models. Nat Methods. 2009;6(9):673–6. finding from environmental genome shotgun
AbundanceBin, Metagenomic Sequencing 19 A
sequences. Nucleic Acids Res. 2006;34(19):5623–30. Introduction
PubMed PMID: 17028096. Pubmed Central PMCID:
1636498.
Noguchi H, Taniguchi T, Itoh T. MetaGeneAnnotator: Binning is one of the challenging problems in the A
detecting species-specific patterns of ribosomal bind- metagenomics field. It has two main applications.
ing site for precise gene prediction in anonymous One application is for studying the structure of
prokaryotic and phage genomes. DNA Res Int microbial communities. The other application is
J Rapid Publ Rep Genes Genomes.
2008;15(6):387–96. PubMed PMID: 18940874. for improving the downstream analysis of
Pubmed Central PMCID: 2608843. metagenomic sequences, including metagenome
Rho M, Tang H, Ye Y. FragGeneScan: predicting genes in assembly (which has shown to be extremely dif-
short and error-prone reads. Nucleic Acids Res. ficult), considering that assembling reads one bin
2010;38(20):e191. PubMed PMID: 20805240.
Pubmed Central PMCID: 2978382. at a time significantly reduces the complexity of
Salzberg SL, Delcher AL, Kasif S, White O. Microbial the metagenome assembly problem.
gene identification using interpolated Markov models. Composition-based methods have been the
Nucleic Acids Res. 1998;26(2):544–8. PubMed main approaches to unsupervised classification
PMID: 9421513. Pubmed Central PMCID: 147303.
Tang S, Antonov I, Borodovsky M. MetaGeneTack: ab of reads. The basis of these approaches is that
initio detection of frameshifts in metagenomic the genome composition (G + C content, dinucle-
sequences. Bioinformatics. 2013;29(1):114–6. otide frequencies, and synonymous codon usage)
PubMed PMID: 23129300. Pubmed Central PMCID: vary among organisms and are generally charac-
3530910.
Wooley JC, Godzik A, Friedberg I. A primer on teristic of evolutionary lineages. Tools in this
metagenomics. PLoS Comput Biol. 2010;6(2): category include TETRA (Teeling et al. 2004),
e1000667. PubMed PMID: 20195499. Pubmed Cen- TACOA (Diaz et al. 2009), and MetaCluster
tral PMCID: 2829047. (Leung et al. 2011). Due to the substantial vari-
Yok NG, Rosen GL. Combining gene prediction methods
to improve metagenomic gene annotation. BMC ance in sequence properties along a genome, the
Bioinforma. 2011;12:20. PubMed PMID: 21232129. main limitation of composition-based approaches
Pubmed Central PMCID: 3042383. is that they require relatively long reads (at least
Zhu W, Lomsadze A, Borodovsky M. Ab initio gene 800 bp), although it is shown that MetaCluster
identification in metagenomic sequences. Nucleic
Acids Res. 2010;38(12):e132. PubMed PMID: (Leung et al. 2011) can bin reads of 300 bp by
20403810. Pubmed Central PMCID: 2896542. employing a different distance metric (Spearman
Footrule Distance) to reduce the local variations
for 4-mers.
Note a large collection of methods have been
developed to classify sequencing reads in
AbundanceBin, Metagenomic a supervised manner. MEGAN (Huson and
Sequencing Mitra 2012) is a representative approach of this
kind. These methods either use composition
Yuzhen Ye information (as in NCB, a naı̈ve Bayes classifier
Indiana University, School of Informatics and to metagenomic sequence classification (Rosen
Computing, Bloomington, IN, USA et al. 2011)) or employ similarity searches of
metagenomic sequences against a database of
known genes/proteins (as in MEGAN) and assign
Definition metagenomic sequences to taxa accordingly,
with or without using phylogeny. They also differ
Binning is unsupervised clustering of in the algorithms used for classification: MEGAN
metagenomic sequences into an unknown set of pioneers the lowest common ancestor (LCA)
species. algorithm (Huson et al. 2007), MTR (Gori
AbundanceBin is a binning tool utilizing the et al. 2011) improves on LCA algorithm consid-
different abundances of the species in ering multiple taxonomic ranks, and MetaPhyler
a community. (Liu et al. 2011) achieves better classification
A 20 AbundanceBin, Metagenomic Sequencing
results by tuning the taxonomic classifier to each further splitting bins. The recursive procedure
matching length, reference gene, and taxonomic continues if (1) the predicted abundance values
level. Note that some tools in this category can of two bins differ significantly; (2) the predicted
only classify a subset of the metagenomic genome sizes are larger than a certain threshold;
sequences instead of all. MLTreeMap (Stark and (3) the number of reads associated with each
et al. 2010) uses phylogenetic analysis of bin is larger than a certain threshold proportion of
31 marker genes for taxonomic distribution esti- the total number of reads classified in the
mation. CARMA (Krause et al. 2008) searches parent bin.
for conserved Pfam domains and protein families AbundanceBin achieves accurate classifica-
in raw metagenomic sequences and classifies tion of even very short sequences sampled from
them into a higher-order taxonomy. RDP classi- species with different abundance levels, as tested
fier is designed for classification of 16S rRNA on simulated and real metagenomic datasets. The
genes, and later extended to classification of 18S software is available for download at http://
rRNA genes using a naı̈ve Bayes classifier (Cole omics.informatics.indiana.edu/AbundanceBin.
et al. 2009).
Integrated Binning Methods

AbundanceBin
MetaCluster 3.0 is an integrated binning method
AbundanceBin (Wu and Ye 2011) is the first based on the unsupervised top–down separation
unsupervised clustering algorithm that utilizes and bottom–up merging strategy, which can bin
abundance information of the species in the metagenomic fragments of species with very bal-
same microbial community to group reads into anced abundance ratios to very different abun-
bins. The fundamental assumption of the dance ratios (Leung et al. 2011). MetaCluster 4.0
AbundanceBin algorithm is that reads are sam- further improves the binning algorithm and is
pled from genomes following a Poisson proce- able to handle datasets with large number of
dure, such that the sequencing reads can be species (e.g., 100 species) (Wang et al. 2012).
modeled as a mixture of Poisson distribution. MetaCluster is available for download at http://
An expectation–maximization (EM) algo- i.cs.hku.hk/~alse/MetaCluster/.
rithm is used in AbundanceBin to find parameters
for the Poisson distributions (i.e., the means),
which reflect the relative abundance levels of Joint Analysis of Multiple Metagenomic
the source species. AbundanceBin then assigns Samples
reads to bins based on the fitted Poisson distribu-
tions. AbundanceBin gives an estimation of the Baran and Halperin proposed an abundance-
genome size (or the concatenated genome size of based (also termed as coverage-based) binning
species of the same or very similar abundances) algorithm (MultBin) that operates on multiple
and the coverage (which reflects the abundances samples of the same environment simulta-
of species) of each bin in an unsupervised manner neously, assuming that the different samples con-
without requiring prior knowledge of the struc- tain the same microbial species, possibly in
ture of the microbial communities. The EM algo- different proportions (Baran and Halperin
rithm needs an important parameter, the number 2012). MultBin employs a k-medoids clustering
of bins, which is typically unknown, as for most algorithm to cluster reads according to their cov-
metagenomic projects. AbundanceBin solves this erage across the samples. Testing of MultBin on
problem by using a recursive binning approach to simulated metagenomic datasets shows that inte-
determine the total number of bins automatically. grating information across multiple samples
The recursive binning approach works by sepa- yields more precise binning on each of the
rating a dataset into two bins and proceeds by samples.
Accurate Genome Relative Abundance Estimation Based on Shotgun Metagenomic Reads 21 A
Summary analysis and comparison of tetranucleotide usage pat-
terns in DNA sequences. BMC Bioinformatics.
2004;5:163.
Abundance-based (or coverage-based) binning Wang Y, Leung HC, Yiu SM, et al. MetaCluster 4.0: A
approaches achieve an accurate performance a novel binning algorithm for NGS reads and huge
even for extremely short reads – when there number of species. J Comput Biol. 2012;19(2):241–9.
exist species abundance differences, an ability Wu YW, Ye Y. A novel abundance-based algorithm for
binning metagenomic sequences using l-tuples.
that cannot be achieved by composition-based J Comput Biol. 2011;18(3):523–34.
approaches which suffer from the variances of
the compositions of short reads. Approaches
that integrate abundance and composition infor-
mation and approaches that utilize multiple Accurate Genome Relative
samples have shown promising binning results. Abundance Estimation Based on
Shotgun Metagenomic Reads
References Fengzhu Sun and Li Charlie Xia

Molecular and Computational Biology Program,
Baran Y, Halperin E. Joint analysis of multiple Department of Biological Sciences, University
metagenomic samples. PLoS Comput Biol.
of Southern California, Dana and David Dornsife
2012;8(2):e1002373.
Cole JR, Wang Q, Cardenas E, et al. The ribosomal data- College of Letters, Arts and Sciences,
base project: improved alignments and new tools for Los Angeles, CA, USA
rRNA analysis. Nucleic Acids Res. 2009;37(Database
issue):D141–5.
Diaz NN, Krause L, Goesmann A, et al. TACOA: taxo-
nomic classification of environmental genomic frag- Synonyms
ments using a kernelized nearest neighbor approach.
BMC Bioinformatics. 2009;10:56. Genome Relative Abundance estimation using
Gori F, Folino G, Jetten MS, et al. MTR: taxonomic
Mixture Model theory (GRAMMy)
annotation of short metagenomic reads using cluster-
ing at multiple taxonomic ranks. Bioinformatics.
2011;27(2):196–203.
Huson DH, Mitra S. Introduction to the analysis of envi- Introduction
ronmental sequences: metagenomics with MEGA-
N. Methods Mol Biol. 2012;856:415–29.
Huson DH, Auch AF, Qi J, et al. MEGAN analysis of Accurate estimation of microbial community
metagenomic data. Genome Res. 2007;17(3):377–86. composition based on metagenomic sequencing
Krause L, Diaz NN, Goesmann A, et al. Phylogenetic data is fundamental for subsequent metagenomic
classification of short environmental DNA fragments.
Nucleic Acids Res. 2008;36(7):2230–9.
analysis. However, it is also a challenging com-
Leung HC, Yiu SM, Yang B, et al. A robust and accurate putational problem because of the mixed nature
binning algorithm for metagenomic sequences with of metagenomes and the fact that only a small
arbitrary species abundance ratio. Bioinformatics. fraction of them get sequenced.
2011;27(11):1489–95.
With the advents of next-generation sequenc-
Liu B, Gibbons T, Ghodsi M, et al. Accurate and fast esti-
mation of taxonomic profiles from metagenomic shotgun ing (NGS) technologies, there has been signifi-
sequences. BMC Genomics. 2011;12 Suppl 2:S4. cant increase in sequencing capacity yet
Rosen GL, Reichenberger ER, Rosenfeld AM. NBC: the reduction in single read length. This paradigm
naive bayes classification tool webserver for taxo-
shift in sequencing technologies has impacted
nomic classification of metagenomic reads. Bioinfor-
matics. 2011;27(1):127–9. downstream analyses. Specifically, the identifica-
Stark M, Berger SA, Stamatakis A, et al. MLTreeMap– tion of the origin of a read becomes more difficult
accurate maximum likelihood placement of environ- for several reasons. First, a large number of short
mental DNA sequences into taxonomic and functional
reference phylogenies. BMC Genomics. 2010;11:461.
reads cannot be uniquely mapped to a specific
Teeling H, Waldmann J, Lombardot T, et al. TETRA: location of one genome. Instead, they map to
a web-service and a stand-alone program for the multiple locations of one or multiple genomes.
A 22 Accurate Genome Relative Abundance Estimation Based on Shotgun Metagenomic Reads
These ambiguities are directly associated with GRAMMy is a statistical framework developed
the read length reduction in NGS technologies. to accurately and efficiently estimate the relative
Second, communities usually consist of many abundance of microbial organisms within the
microbes with similar genomes, different only community (Xia et al. 2011).
in some parts, making it indeed impossible to
determine the origin of a particular short read
based solely on its sequence. Description
Despite these difficulties, NGS read sets have
brought in richer abundance information of micro- The GRAMMy Framework
bial communities than traditional datasets because The GRAMMy framework is based on a mixture
of the significant increase in the number of reads. model for the short metagenomic sequencing and
Along with the increase of read set size, efforts to an expectation-maximization (EM) algorithm, as
assemble more reference genomes are ongoing. In outlined in the model schema and the analysis
addition, new experimental techniques, such as flowchart in Figs. 1 and 2. GRAMMy accepts
single-cell sequencing approaches, are being a set of shotgun reads as well as external refer-
developed to sequence reference genomes directly ences (e.g., genomes, scaffolds, or contigs) as
from environmental samples. In face of the chal- inputs and subsequently performs the
lenges from short reads and the opportunities from maximum-likelihood estimation (MLE) of the
fast-expanding reference genome databases, genome relative abundance (GRA) levels.
Accurate Genome Relative Abundance Estimation mixture model underlies the GRAMMy framework for
Based on Shotgun Metagenomic Reads, Fig. 1 The shotgun metagenomics. In the figure, “iid” stands for
GRAMMy model. A schematic diagram of the finite “independent identically distributed”
Accurate Genome Relative Abundance Estimation Based on Shotgun Metagenomic Reads 23 A
estimates. If the taxonomy information for the
input reference genomes is available, strain
(genome) level GRA estimates can be combined A
to calculate high taxonomic level abundance, such
as species- and genus-level estimates.
Accurate GRAMMy Estimates with EM

Algorithm
Formally, GRA is defined as the normalized
abundance for m unique genomes, where the
relative abundance for the jth known genome is
#j-th genome
aj ¼
#known genomes
Note that gm is a collective surrogate for

unknown genomes and cannot be estimated in the
model. Knowing length lj, aj is one-to-one related
to the corresponding mixing parameter pj by
pj
aj ¼
X
m1
pk
lj
l
k¼1 k
Mixing component distributions are needed to

solve for mixing parameter p, which are p(ri|
zij ¼ 1; g)’s – i.e., the probabilities of generating
Accurate Genome Relative Abundance Estimation a read ri from gj. They are approximated empir-
Based on Shotgun Metagenomic Reads, Fig. 2 The ically. The first approach is to use the number of
GRAMMy flowchart. A typical flowchart of
high-quality hits sij from BLAST, BLAT, or other
GRAMMy analysis pipeline employs “map” and s
“k-mer” assignment mapping tools and approximate by lijj ; the second
approach is to use k-mer composition as detailed
in the original study (Xia et al. 2011). The EM
In the typical GRAMMy workflow, which is algorithm to calculate p iterates between E-step
shown in Fig. 2, the end user starts with the
metagenomic read set and reference genome set ð tÞ
ðtÞ p r i jzij ¼ 1; g pj
and then chooses between mapping-based (“map”) zij ¼X
m
and k-mer composition-based (“k-mer”) assign- ðtÞ
pðr i jzik ¼ 1; gÞpk
ment options (He and Xia 2007). In either option, k¼1
after the assignment procedure, an intermediate
matrix describing the probability that each read is and M-step
assigned to one of the reference genomes is
produced. This matrix, along with the read set X
n
ð tÞ
zij
and reference genome set, is fed forward to the ðtþ1Þ i¼1
EM algorithm module for estimation of the GRA pj ¼
n
levels. After the calculation, GRAMMy outputs
the GRA estimates as a numerical vector, as well until convergence, where n is the total number of
as the log-likelihood and standard errors for the reads and zij’s are entries in the missing read
A 24 Accurate Genome Relative Abundance Estimation Based on Shotgun Metagenomic Reads
Accurate Genome Relative Abundance Estimation 99 species occurring in at least 50 % of the 33 human gut
Based on Shotgun Metagenomic Reads, samples with a minimum relative abundance of 0.05 %
Fig. 3 Frequent species of human gut microbiome. The were selected
origin matrix Z. The estimated mixing parame- Conclusions

ters p are then converted back to GRA
estimates a. GRAMMy is a rigorous probabilistic framework
for accurately and efficiently estimating genome
GRAMMy Estimates for Human Gut relative abundance (GRA) based on shotgun
Metagenomes metagenomic reads. Users have a wide choice
The human gastrointestinal tract harbors the of mapping and alignment tools to assign reads
largest group of human symbiotic microbes. to references. The method is particularly suit-
Figure 3 shows the 99 most frequent species of able for NGS short read datasets due to its better
human gut based on the GRAMMy analysis of handling of read assignment ambiguities.
the 33 metagenomic samples collected GRAMMy tools are packaged as a C++ exten-
from three human gut metagenome projects sion to Python, which can be downloaded freely
(Gill et al. 2006; Kurokawa et al. 2007; from GRAMMy’s homepage: http://meta.usc.
Turnbaugh et al. 2009). The medians of esti- edu/softs/grammy.
mated average genome lengths for these
metagenomes range from 2.8 to 3.7 Mbp.
Among the top ten most frequent species,
there are eight from the Firmicutes phylum Cross-References
including members of Faecalibacterium,
Eubacterium, and Ruminococcus genera, and ▶ Approaches in Metagenome Research:
two from the Bacteroides genus of the Progress and Challenges
Bacteroidetes phylum. Firmicutes and ▶ Computational Approaches for Metagenomic
Bacteroidetes dominate the human gastrointes- Datasets
tinal tract. Species’ relative abundance displays ▶ Extended Local Similarity Analysis (eLSA) of
a long-tail distribution, suggesting that many Biological Data
are detected across samples, though most of ▶ Metagenomic Research: Methods and
them are not highly abundant. The abundance Ecological Applications
levels of some species are highly variable (with ▶ Metagenomics, Metadata, and Meta-analysis
larger box size), while most others are rela- ▶ Molecular Ecological Network of Microbial
tively constant. Communities
All-Species Living Tree Project 25 A
References Definition
Gill SR, Pop M, Deboy RT, et al. Metagenomic analysis of The All-Species Living Tree Project (LTP) is an
the human distal gut microbiome. Science. A
international initiative for the creation and main-
2006;312(5778):1355–9.
He PA, Xia L. Oligonucleotide profiling for discriminat- tenance of highly curated 16SrRNA and
ing bacteria in bacterial communities. Comb Chem 23SrRNA gene sequence databases, alignments,
High Throughput Screen. 2007;10(4):247–55. and phylogenetic trees for all the type strains of
Kurokawa K, Itoh T, Kuwahara T, et al. Comparative
Bacteria and Archaea.
metagenomics revealed commonly enriched gene sets in
human gut microbiomes. DNA Res. 2007;14(4):169–81.
Turnbaugh PJ, Hamady M, Yatsunenko T, et al. A core gut
microbiome in obese and lean twins. Nature. Introduction
2009;457(7228):480–4.
Xia LC, Cram JA, Chen T, et al. Accurate genome relative
abundance estimation based on shotgun metagenomic Classification and identification of Bacteria and
reads. PLoS One. 2011;6(12):e27992. Archaea came across to a turning point around
35 years ago. It was the time when Carl Woese
and co-workers demonstrated that ribosomal
All-Species Living Tree Project markers were appropriate to infer genealogical
relationships by means of phylogenetic reconstruc-
Pablo Yarza1, Raul Munoz2, Jean Euzéby3, tions (Fox et al. 1977). Rapidly, comparative anal-
Wolfgang Ludwig4, Karl-Heinz Schleifer4, ysis of rRNA gene sequences became a standard
Rudolf Amann5, Frank Oliver Glöckner6,7 and procedure with mature implications in microbial
Ramon Rosselló-Móra2 ecology and taxonomy: culture-independent
1
Ribocon GmbH., Bremen, Germany exploration of ecosystems’ diversity (Amann
2
Marine Microbiology Group, Department of et al. 1995) and settlement of the phylogenetic
Ecology and Marine Resources, Institut backbone (i.e., our current accepted classification
Mediterrani d’Estudis Avançats (CSIC-UIB), of Bacteria and Archaea; Garrity 2001). As
Illes Balears, Spain a result, the total amount of ribosomal RNA entries
3
Society of Systematic Bacteriology and in the public DNA databases has grown exponen-
Veterinary (SBSV) & National Veterinary tially since early 1990s, currently comprising at
School de Toulouse (ENVT), Toulouse, France least 3,500,000 small (SSU) and 300,000 large
4
Lehrstuhl F€ur Mikrobiologie, Technische (LSU) ribosomal subunit gene sequence entries.
Universit€at M€unchen, Freising, Germany On the other hand, the number of bacterial and
5
Molecular Ecology Group, Max Planck Institute archaeal species with validly published names
for Marine Microbiology, Bremen, Germany has followed arithmetic trends with a ratio of
6
Microbial Genomics and Bioinformatics Group, around 500–700 annual descriptions during the
Max Planck Institute for Marine Microbiology, last 7 years (Fig. 1), currently (December 2012)
Bremen, Germany exceeding the total number of 10,300 species and
7
Jacobs University Bremen gGmbH, Bremen, subspecies. A comparative overview of these
Germany trends until December 2011 is shown in Fig. 1.
As from early 1990s, the 16S rRNA has been,
Synonyms by orders of magnitude, the most often sequenced
gene, there is no alternative phylogenetic marker
16SrRNA(SSU) and 23SrRNA( LSU) gene with such a high coverage in public repositories.
sequence databases; Alignments; LTP project; However, abundance is not the single requisite
Manual curation; “Orphan” species; Taxa bound- for a proper phylogenetic inference and other
aries; Taxonomy/classification/phylogeny of single molecules (e.g., 23S rRNA) or combina-
Bacteria and Archaea; Type strains tions of them might perform better at reflecting
A 26 All-Species Living Tree Project
All-Species Living Tree Project, Fig. 1 Annual growth and yellow bars (LSU, 1B). The cumulative growth of
of ribosomal 16S rRNA (a) and 23S rRNA (b) gene published species and subspecies names (according to
sequence databases and species and subspecies names LPSN; http://www.bacterio.cict.fr/number.html) since
with standing in nomenclature until December 2011. 1980 until December 2011 is plotted in blue. Note that the
SILVA SSU-Parc111 and LSU-Parc111 databases total number of names is around 2,000 above the total
(http://www.arb-silva.de/documentation/release-111/) were number of distinct type strains due to homotypic synonyms,
filtered by submission date until December 2011 and its new combinations, nomina nova, later heterotypic syno-
cumulative annual growth was plotted in red (SSU, 1A) nyms, or illegitimate names
genealogies of certain groups given the higher complete the genome sequence of all type strains
information content (Ludwig and Klenk 2001). (GEBA initiative). Undoubtedly, comparative
Although far from reaching 16S rRNA levels, genomics will involve a new breakthrough for
submission of alternative markers is growing microbial taxonomy and the current phylogenetic
fast, mostly because (i) the number of meta- backbone based on ribosomal sequences will be
genomes and complete genomes is growing carefully reviewed (Coenye et al. 2005). Never-
exponentially due to the reduction on sequencing theless, at this point, the number of sequenced
and analysis costs and (ii) the recent initiative to genomes of type strains is still low and therefore
the current possibilities for an in-depth taxo- names from LPSN. When a species is divided
nomic study are sparse. into subspecies, we substituted the original
The responsible teams of the ARB, SILVA, species name by that of the subspecies (e.g., A
and LPSN projects (www.arb-home.de, www. Staphylococcus sciuri subsp. sciuri instead of
arb-silva.de, and www.bacterio.net) together Staphylococcus sciuri). We avoided the
with the journal Systematic and Applied Micro- “Candidatus” names (e.g., “Candidatus
biology (SAM) started the “All-Species Living Aciduliprofundum boonei”), Cyanobacteria
Tree Project” (LTP; http://www.arb-silva.de/pro- not validly published under the Bacteriologi-
jects/living-tree), a project conceived to provide cal Code (e.g., Anabaena oscillatorioides),
a tool especially designed for the microbial tax- and later heterotypic synonyms (e.g., Pseudo-
onomist scientific community (Yarza et al. 2008). monas chloritidismutans).
The main objectives considered so far are (1) pro- 3. Manual cross-check. Then, each entry from
vide a curated 16S and 23S rRNA gene database our initial list of sequences was assigned to
for the type strains of all species with validly a species name by manually examining the
published names; (2) set up an optimized and companion contextual metadata. This process
universally usable alignment; (3) reconstruct reli- had to be done manually given the often out-
able phylogenetic trees with all the type strains; dated, mistaken, or absent taxonomic informa-
(4) maintain the database, alignments, and trees tion such as the organism name or the strain
through regular updates including the new validly numbers.
published taxa and their respective 16S and 23S 4. Quest for missing type strains. We realized
rRNA gene sequences; and (5) investigate, with that not all species names were represented
the use of the database, fundamental aspects in the list of sequences. Then, we inverted
about taxonomy of Bacteria and Archaea such the process by searching in resources like
as phylogenetic thresholds in new taxa circum- EMBL, Bergey’s Outlines, issues of the Inter-
scriptions, coherence of current taxonomy by national Journal of Systematic and Evolution-
means of phylogenetic schemes, and relevance ary Microbiology (IJSEM), etc. with the aim
of the ribosomal RNA genes in taxonomic to find a good-quality sequence entry for each
studies. missing type strain.
5. “Orphan” species recognition. Finally, we got
a group of type strains whose 16S/23S rRNA
Creation and Maintenance of LTP genes had never been sequenced or that the
Releases existing sequences were of too low quality to
be considered for the project (i.e., in terms of
LTP Datasets sequence length, number of ambiguities, etc.).
First LTP datasets (release LTPs93 for SSU We called them “orphan” species. The LTP
(Yarza et al. 2008), release LTPs102 for LSU project together with eleven international cul-
(Yarza et al. 2010)) were prepared following six ture collections has driven the sequencing of
main steps: these “orphan” species through the SOS ini-
1. Set up a list of candidate sequences. An initial tiative (Yarza et al. 2013).
sequence dataset consisted on a subsample of 6. Keep one sequence per species. On the other
the SILVA database, filtering by “type” (T) or hand, the list of type-strain sequences was
“cultured” (C) strains; this information mainly redundant in the sense that one single type
came from StrainInfo. strain could be represented by multiple
2. Set up a list of species names. In parallel we sequence entries. This is the case of multiple
built a comprehensive, updated, and independent sequencings and submissions, or
nonredundant (i.e., free of synonyms and the existence of several sequences due to mul-
according to latest valid nomenclature) list of tiple copies of the ribosomal operon. The aim
validly published species and subspecies of the LTP is, whenever possible, to keep one
All-Species Living Tree Project, Table 1 Summary of LTP releases. “Sync” fields correspond to IJSEM and EMBL
release dates. “Net increase” of a release is the number of new entries minus the number of deleted entries. “% incorrect”
refers to the percentage of new entries whose INSDC records carried incorrect information in the organism name field.
Averages include standard deviation
IJSEM EMBL Total New Deleted Net % Average
Release Type
sync sync entries entries entries increase incorrecta Average lengtha ambig.b
LTPs93 SSU
Dec. 2007 Dec. 2007 6,728 6,728 0 6,728 22 1,465.0 51.2 0.10 0.26
LTPs95 SSU
Jun. 2008 Jun. 2008 7,006 299 21 278 45 1,446.0 46.3 0.04 0.11
LTPs100 SSU
Aug. Jun. 2009 7,710 750 46 704 50 1,448.0 54.2 0.03 0.11
2009
LTPs102 SSU Feb. 2010 Nov. 8,029 363 44 319 58 1,453.6 52 0.33 0.12
2009
LTPs102 LSU Feb. 2010 Nov. 792 792 0 792 6 2,866.1 177.6 0.02 0.11
2009
LTPs104 SSU Dec. 2010 May 2010 8,545 545 29 516 74 1,444.6 62 0.27 0.11
LTPs106 SSU May 2011 Dec. 2010 8,815 279 9 270 77 1,445.9 51.1 0.03 0.12
LTPs108 SSU Dec. 2011 Jun. 2011 9,279 490 26 464 60 1,455.4 51.9 0.02 0.09
a
Average length for the “new entries”
b
Average percentage of ambiguities for the “total entries”
sequence per type strain in order to maintain if their corresponding species names are seen
simplicity, avoid confusion, and improve tree to be later heterotypic synonyms, if they
navigation and database usability. In general, become rejected, or as a matter of taxonomic
the best quality available (including manual opinions. Sequence entries existing in an LTP
inspection of the alignment) was selected for database can also change by means of their
the project and, in case of doubt, the earliest metadata. Thus, for example, new combina-
submission to an INSDC partner (www.insdc. tions (i.e., a type strain which changes its
org). From release LTPs102 (Yarza name due to reclassification) or subdivision of
et al. 2010), when multiple paralogues exist a species into subspecies leads to an entry
due to rRNA operon copy number, several modification at its taxonomic information
copies are kept if they show less than 98 % fields.
sequence identity (see below for further
details). Inaccurate or Mistaken Metadata
LTP is maintained by a scrutiny of the new Inaccurate sequence-associated metadata tend to
described species, nomenclatural changes, taxo- happen in more than 50 % of the new added 16S
nomic notes, and opinions that are monthly rRNA entries (Table 1). Often, these “mistakes”
published in the IJSEM journal. Their respective consist on a lack of entries’ updating tasks at the
16S and 23S rRNA gene sequence entries are time of their first appearance in a scientific pub-
acquired from the latest SILVA release and lication. It mainly occurs in taxonomy-associated
appended to the existing LTP database. There- information fields. To prove the uniqueness of a
fore, SILVA’s Reference (Ref) ARB databases new species and to name it take time and, in the
(http://www.springerreference.com/docs/html/ meanwhile, sequences are quickly produced
chapterdbid/304116.html) serve as template and easily submitted to nucleotide databases.
for the new LTP-ARB databases. Until now Most often, these submissions only show
(December 2012) one LSU-based and seven genus specifications, for example, sequence
SSU-based LTP releases have been produced entry GU808562 appears as “Hymenobacter sp.
(Table 1). New species are incorporated into HMD1010” but its real name is Hymenobacter
the database only if they account a good- yonginensis. Indeed, a Bacteriological Code-
quality sequence existing in the respective compliant (Lapage et al. 1992) nomenclature
SILVA release. Certain entries can be deleted may be somewhat tricky and is frequent to
consider several Latin terms and derivations until DCB-2T, accession number CP001336, with
one species name is finally accepted by authors 4.34 % of maximum inter-operonic divergence.
and reviewers. Unavoidably, this bad-quality A
information is propagated from INSDC databases Sequence Quality in LTP Datasets
(primary sources) to other technological services It has been suggested that sequences produced for
like dedicated ribosomal databases (e.g., taxonomic purposes should be equal or larger
SILVA). Although extensive data curation is not than 1,450 bases with less than 0.5 % ambiguities
a task of primary sources of information, it would (Stackebrandt et al. 2002). Reason is that infor-
be very beneficial that authors enhance their com- mative content of a molecular clock is linked to
mitment with the correctness of the metadata the total number of its variable positions (Ludwig
provided (e.g., like the species name) or that and Klenk 2001). Statistics derived from LTP
authors are forced to update their INSDC entries datasets indicate that in general, sequence quality
prior to manuscript acceptation (recommended is acceptable for in-depth phylogenetic studies
action for scientific journals). Successively, this (~1,455 bases and 0.02 % ambiguities for
rough data arrives finally to resources like LTP, LTPs108; Table 1). Figure 2 shows annual vari-
which have no choice but checking it carefully to ation of gene sequence length and percentage of
provide new informational fields with corrected ambiguities. Quality increase is mainly observed
information; curated information can return back in terms of ambiguities reduction, probably
to other resources of information. related to amelioration of sequencing techniques.
In any case, the completion of more full genome
Multiple Copies of the Ribosomal Operon sequences of type strains will substantially
In 2010, a comprehensive study was conducted to increase the sequence quality (indicated by
evaluate the intra-genomic variability of the 16S these two parameters) in the LTP database.
rRNA gene on complete type-strain genomes Researchers should be encouraged to complete
(Yarza et al. 2010). We observed that in very 50 ends of 16S rRNA gene sequences, as first
unusual exceptions, the intra-genus (94.5 %; 250 bases contain hypervariable regions V1 and
Yarza et al. 2008) or intraspecies (98.7 %; V2 which play an important role in comparisons
Stackebrandt and Ebers 2006) boundaries could between highly related organisms (Chakravorty
be exceeded within a single genome. In such et al. 2007).
cases, the selection of one or another sequence
might seriously affect the interpretation of Curated Metadata Introduced by the LTP
a phylogenetic inference. However, despite the In addition to regular fields provided by the
fact that the vast majority of strains contain mul- ARB-SILVA databases, sequence entries include
tiple copies of the rrn operon, only 2 % of them now the following LTP-specific metadata fields:
reveal divergences beyond 2 % (30 nucleotides) 1. fullname_ltp: corrected species name
sequence identity. Thus, most likely, the selec- according to LPSN (http://www.bacterio.net).
tion of one or another copy should not affect the 2. rel_ltp: name of the LTP release where
phylogenetic reconstructions. Consequently, a sequence entry appeared for the first time.
starting from release s104 (Munoz et al. 2011), 3. hi_tax_ltp: name of the family where the taxon
the LTP database includes all paralogues with is classified. For unclassified genera, the name
higher divergences than 2 %. By now, it is the of the next available higher taxon above genus
case of three species: Haloarcula marismortui (e.g., “Acidobacteria” for Bryobacter
ATCC 43049T, accession number AY596297, aggregatus).
with 5.7 % of maximum inter-operonic diver- 4. type_ltp: type species receive the label “type
gence; Thermoanaerobacter pseudethanolicus sp.” in this field.
ATCC 33223T, accession number CP000924, 5. riskgroup_ltp: risk-group classification of
with 3.66 % of maximum inter-operonic diver- microorganisms risk-group classification of
gence; and Desulfitobacterium hafniense microorganisms obtained from the DSMZ
All-Species Living Tree Project, Fig. 2 Annual distri- is given by the SILVA parameter “nuc_gene_slv” which
bution of the 16S rRNA gene sequence length and % of cuts off the bases at the extremes when beyond the
ambiguities in the 9,279 type-strain sequences E.coli’s16S rRNA gene limits. Percentage of ambiguities
corresponding to LTP release s108. Gene sequence length is given by the SILVA descriptor “ambig_slv”
(Deutsche Sammlung von Mikroorganismen variable stretches, with low sequence similarities,
und Zellkulturen), according to the Federal could be optimally positioned by recognizing
Institute for Occupational Safety and Health functional homology (due to evolutionary pres-
(BAuA) in Germany. sure) and functional stability of helices (due to
6. tax_ltp: taxonomic classification into higher chemical stability of base pairs’ bounds). A core
taxonomic ranks according to LPSN (http:// dataset of sequences with highly curated align-
www.bacterio.cict.fr/classifphyla.html). ments was incorporated into the SILVA system
7. url_lpsn_ltp: it contains the variable part of so new added sequences can be automatically
the URL leading to the LPSN’s species file aligned using this “seed alignment” as a reference
(e.g., http://www.bacterio.net/bryobacter.html). (Ludwig et al. 2004; Pruesse et al. 2007). Period-
ically more and more manually curated
Alignments and Phylogenetic Trees sequences are added into the seed which
Setting up universal alignments is a key step in improves its quality over time.
order to achieve optimal and comparable phylo- Although all new sequences incorporated into
genetic reconstructions. It has been one of the the LTP come from an ARB-SILVA database,
constant motivations of Wolfgang Ludwig and they are again manually revised to further correct
co-workers who dealt with the huge task of pre- misplaced bases and to check highly variable
paring common and reliable alignment of ribo- regions. Before tree calculation, the complete
somal SSU and LSU sequences of Bacteria, alignment is shifted using maximum frequency
Archaea, and Eukarya (Ludwig and Schleifer filters (Table 2) that remove dubious orthologous
1994). They found out that secondary structure positions caused by sequencing errors and
formations such as loops and helices occurred at hypervariability. Typically, LTP phylogenetic
the same relative positions along the molecule. trees are calculated using the 40 % maximum
This helped to refine the alignments because frequency filter.
All-Species Living Tree Project, Table 2 Maximum frequency filters implemented into the LTPs 108ARB database
Filter name Start position Stop position %mina %maxa No. of positionsb
LTPs108_ssu_10 0 50,000 10 100 1,433 A
LTPs108_ssu_20 0 50,000 20 100 1,433
LTPs108_ssu_30 0 50,000 30 100 1,432
LTPs108_ssu_40 0 50,000 40 100 1,390
LTPs108_ssu_50 0 50,000 50 100 1,288
a
Minimum and maximum sequence identity. For tree reconstructions, only columns are taken into account if they have a
positional conservation above the respective minimum values
b
Number of homologous positions (columns) taken into account for tree reconstructions
The first 16S rRNA-based phylogenetic tree The missing partial or lower-quality type-
was calculated for the release LTPs93 (Yarza strain sequences were added to the tree using
et al. 2008). The sequence dataset consisted of the ARB parsimony tool with the option for
6,728 type-strain sequences plus 3,247 keeping the initial topology while inserting
supporting sequences belonging to non-type additional data.
strains used to reinforce underrepresented groups The groups shown in the trees are defined by
and to stabilize the topology. The multiple align- recognizing the type members and according to
ment of 9,975 16S rRNA gene sequences was the taxonomic classification. The trees are care-
submitted to different treeing methodologies fully compared against previously reported topol-
including neighbor-joining, maximum likeli- ogies and current taxonomic classifications
hood, and maximum parsimony, all tested with (Yarza et al. 2010). All the additional supporting
several filters (30 %, 40 %, and 50 % maximum sequences used to reconstruct the phylogeny are
frequency filters) and all implemented in the removed from the final tree by keeping its topol-
ARB software package (Ludwig et al. 2004). ogy intact. Within the ARB database, the type
A high degree of congruence was observed species are labeled with a distinct color for easy
among them. The tree considered as optimal recognition and tree handling.
was a 40 %-filtered maximum likelihood recon-
struction calculated using the RAxML algorithm
(Stamatakis 2006), with the GTRGAMMA cor- Files Provided by the LTP
rection, with 100 bootstrap replicates, in a 5-node
and 20-processor parallel environment. The last As a taxonomic tool, the LTP must be understood
de novo phylogenetic reconstruction appears in as a collection of reference materials, all publicly
the release LTPs108 and was similarly calcu- available at the project’s Web page (http://www.
lated; tree calculation was run with a dataset of arb-silva.de/projects/living-tree), including:
12,166 16S rRNA gene sequences. 1. Release documentation: (I) readme file with
The phylogenetic tree calculated using the 23S a release description and (II) PDF document
rRNA gene was particularly challenging due to describing the metadata fields introduced by
data shortage in many groups. In order to set up the LTP
a reliable phylogeny based on 23S rRNA data, we 2. Tables: (I) new entries with outdated submis-
defined a core dataset made of high-quality sion names and (II) list of changes in the
sequences (type and non-type strains). The strin- dataset: added/deleted/modified entries
gent quality filtering approach ended with around 3. Export filter: ARB-export filter (.eft format) to
2,000 high-quality and nonredundant LSU extract data from LTP-ARB databases
sequences. This dataset was submitted to 4. Databases: (I) complete ARB databases
a maximum likelihood reconstruction in combi- including sequences, alignments, metadata,
nation with a 50 % maximum frequency filter filters, and trees and (II) datasets in CSV for-
allowing 2,463 positions of the entire alignment. mat including LTP metadata
5. Alignments: (I) gapped exports in multi- 94.9 % 0.4, 87.5 % 1.3, and 78.4 % 2.0
FASTA format and (II) compressed exports lead to the circumscription of a new genus, fam-
in multi-FASTA format ily, and phylum, respectively. For 23S rRNA
6. Phylogenetic trees: (I) collapsed overviews in genes, these values are slightly different:
PDF format showing the distinct phyla, 93.2 % 1.3 (genus), 87.7 % 2.5 (family),
(II) full SSU (more than 80 pages long) and and 75.3 % (phylum). As shown by the low
LSU trees in PDF format, and (III) full trees in errors, historically used criteria for genera, fam-
NEWICK format, including group names and ilies, and phyla are quite homogeneous and do not
branch lengths lead to unambiguous circumscriptions. These
cutoffs should be used with caution and always
as a complementary approach. They are espe-
Side Research cially recommended for prospective studies in
clone libraries and as additional support for the
Sequencing the Orphan Species circumscription of new taxa or emendation of
Initiative (SOS) existing ones.
The understanding that around 6 % of all classi-
fied species were missing from the ribosomal
SSU sequence catalogues motivated us to start Summary
the “Sequencing the Orphan Species” (SOS) ini-
tiative (Yarza et al. 2013). During 3 years of SSU and LSU databases made by the All-Species
work, the LTP team coordinated a network of Living Tree Project (LTP; http://www.arb-silva.
12 partner researchers and culture collections de/projects/living-tree) provide high-quality
(ATCC, BZF, CECT, CIP, CCUG, DSMZ, nearly full-length sequences of the type strains
JCM, ICMP, BCCM/LMG, MMG, NBRC, of all Archaea and Bacteria with validly
NCCB) in order to improve this situation by published names. Setting up a type-strain
(re)sequencing the 16S rRNA gene of the sequences database included the sieving of the
“orphan” species. As a result, 351 type strains public DNA databases whose sequence entries
appear represented now by a good-quality SSU often appeared outdated or mistaken at their tax-
gene sequence in the databases. They comprise onomic metadata. It involved the initial manual
representatives of 14 bacterial and archaeal cross-check of nearly 14,000 SSU and 6,000 LSU
phyla, 76 type species, and 78 pathogenic spe- sequence entries against the catalogue of distinct
cies. However, 201 type strains could not be species with validly published names retrieved
accessed as cultivable strains were not available from LPSN. Databases are complemented with
at recognized culture collections. They represent manually curated metadata, manually curated
10 phyla and 17 type species. alignments, and state-of-the-art phylogenetic
reconstructions (in contrast to other similar
Taxonomic Boundaries resources like the EzTaxon (Santamaria
In order to understand how the higher taxonomic et al. 2012)). The LTP team wants to remark
categories could be circumscribed by means of that the aim of the project is not to reconstruct
a sequence identity threshold, we performed the currently described species genealogy with
a statistical procedure to get the lowest similarity total fidelity but to provide a curated taxonomic
found within the members of a certain taxon tool for the scientific community. Our small but
(Yarza et al. 2008, 2010). By taking into account very representative SSU and LSU datasets may
all the taxa at a particular taxonomic rank, we be used as a reference for identification and clas-
obtained general lower cutoff values of sequence sification purposes in several fields of applica-
identity for genus, family, and phylum based on tion, for example, facilitating the collection of
16S rRNA and 23S rRNA. In general, minimum sequences for the reconstruction of taxa genealo-
16S rRNA gene sequence identities of gies (Cousin et al. 2012), enabling fast and
antiSMASH 33 A
reliable taxonomic affiliations in rRNA surveys Munoz R, Yarza P, Ludwig W, et al. Release LTPs104 of
(Santamaria et al. 2012), or serving as reference the all-species living tree. Syst Appl Microbiol.
2011;34:169–70.
datasets for testing bioinformatic procedures Pruesse E, Quast C, Knittel K, et al. SILVA: A
(Mizrahi-Man et al. 2013). a comprehensive online resource for quality checked
and aligned ribosomal RNA sequence data compatible
with ARB. Nucleic Acids Res. 2007;35:7188–96.
Santamaria M, Fosso B, Consiglio A, et al. Reference
Cross-References databases for taxonomic assignment in metagenomics.
Brief Bioinform. 2012;13:682–95.
▶ Culture Collections in the Study of Microbial Stackebrandt E, Ebers J. Taxonomic parameters
Diversity, Importance revisited: tarnished gold standards. Microbiol Today.
2006;33:152–5.
▶ Phylogenetics, Overview Stackebrandt E, Frederiksen W, Garrity GM, et al. Report
▶ SILVA Databases of the ad hoc committee for the re-evaluation of the
species definition in bacteriology. Int J Syst
Evol Microbiol. 2002;52:1043–7.
Stamatakis A. RAxML-VI-HPC: maximum likelihood-
References based phylogenetic analyses with thousands of taxa
and mixed models. Bioinformatics. 2006;22:2688–90.
Amann R, Ludwig W, Schleifer KH. Phylogenetic identi- Yarza P, Richter M, Peplies J, et al. The all-species living
fication and in situ detection of individual microbial tree project: a 16S rRNA-based phylogenetic tree
cells without cultivation. Microbiol Rev. of all sequenced type strains. Syst Appl Microbiol.
1995;59:143–69. 2008;31:241–50.
Chakravorty S, Helb D, Burday M, et al. A detailed anal- Yarza P, Ludwig W, Euzéby J, et al. Update of the
ysis of 16S ribosomal RNA gene segments for the all-species living tree project based on 16S and 23S
diagnosis of pathogenic bacteria. J Microbiol rRNA sequence analyses. Syst Appl Microbiol.
Methods. 2007;69:330–9. 2010;33:291–9.
Coenye T, Gevers D, Van de Peer Y, et al. Towards Yarza P, Spröer C, Swiderski J, et al. Sequencing Orphan
a prokaryotic genomic taxonomy. FEMS Microbiol Species initiative (SOS): filling the gaps in the 16S
Rev. 2005;29:147–67. rRNA gene sequence database for all species with
Cousin S, Gulat-Okalla ML, Motreff L, et al. Lactobacil- validly published names. Syst Appl Microbiol.
lus gigeriorum sp. nov., isolated from chicken crop. Int 2013;36:69–73.
J Syst Evol Microbiol. 2012;62:330–4.
Fox GE, Pechman KR, Woese CR. Comparative catalogu-
ing of 16S ribosomal ribonucleic acid: molecular
approach to prokaryotic systematics. Int J Bacteriol.
1977;27:44–57.
Garrity GM. Bergey’s manual of systematic bacteriology. antiSMASH
2nd ed. New York: Springer; 2001.
Lapage SP, Sneath PHA, Lessel EF, et al. International Eriko Takano1, Rainer Breitling1 and
code of nomenclature of bacteria (1990 revision).
Washington, DC: American Society for Microbiology;
Marnix H. Medema2
1
1992. p. 295. Manchester Institute of Biotechnology,
Ludwig W, Klenk HP. Overview: a phylogenetic University of Manchester, Manchester, UK
backbone and taxonomic framework for prokaryotic 2
Microbial Genomics and Bioinformatics
systematics. In: Boone DR, Castenholz RW,
Research Group, Max Planck Institute for Marine
Garrity GM, editors. Bergey’s manual of systematic
bacteriology. 2nd ed. New York: Springer; 2001. Microbiology, Bremen, Germany
p. 49–65.
Ludwig W, Schleifer KH. Bacterial phylogeny based on
16S and 23S rRNA sequence analysis. FEMS
Microbiol Rev. 1994;15:155–73.
Definition
Ludwig W, Strunk O, Westram R, et al. ARB: a software
environment for sequence data. Nucleic Acids Res. antiSMASH (Medema et al. 2011) is a web server
2004;32:1363–71. and a stand-alone software to identify, annotate,
Mizrahi-Man O, Davenport ER, Gilad Y. Taxonomic clas-
sification of bacterial 16S rRNA genes using short
and compare gene clusters that encode the bio-
sequencing reads: evaluation of effective study synthesis of secondary metabolites in bacterial
designs. PLoS One. 2013;8:e53608. and fungal genomes. antiSMASH offers a wide
A 34 antiSMASH
range of options to identify and analyze biosyn- Protein Domain Analysis of Polyketide
thetic gene clusters, including protein domain Synthases and Nonribosomal Peptide
analysis of the large multi-domain enzymatic Synthetases
assembly lines involved, prediction of core PKs and NRPs are synthesized by large
chemical structures of their end compounds, and megasynthase enzymes containing a multitude
multiple cluster alignments to a database of all of protein domains, such as condensation
currently sequenced gene clusters. (C) and adenylation (A) and PCP-binding
The antiSMASH web server can be found at domains in nonribosomal peptide synthetases
http://antismash.secondarymetabolites.org. (NRPSs), ketosynthase (KS), and acyltransferase
(AT) and ACP-binding domains in polyketide
synthases (PKSs) (Fischbach and Walsh 2006).
Introduction antiSMASH contains a library of pHMMs that
can recognize all these protein domains as well
Microbial secondary metabolites are of great as distinguish between various subtypes of these
interest to society because of their diverse bio- domains. In the antiSMASH output, the domain
logical activities that are interesting starting structures of any NRPSs or PKSs encoded in
points for drug development. Many of them are a gene cluster are visualized, and several down-
already used as antibiotics, antitumor agents, or stream analysis options are provided for each
cholesterol-lowering drugs (Hutchinson and domain (Fig. 1).
McDaniel 2001; Fischbach and Walsh 2009).
Automated computational identification of gene Core Chemical Structure Prediction
clusters in newly sequenced genomes is becom- When a secondary metabolite biosynthesis
ing a cornerstone of genome-based drug discov- gene cluster is detected, one of the key questions
ery, due to the affordability of sequencing large of course is what kind of chemical structure it
numbers of genomes from microorganisms that produces. For NRPs and PKs, antiSMASH
potentially produce novel secondary metabolites is able to already give a first approximation of
(Walsh and Fischbach 2010). the core chemical structure of the end compound
(Fig. 2). To do so, it uses several substrate
specificity prediction methods (Yadav et al.
Functionalities 2003; Minowa et al. 2007; Röttig et al. 2011)
that are based on the amino acid sequences of the
Gene Cluster Detection A domains of NRPSs and the AT domains of
antiSMASH detects a wide range of different PKSs. To infer the sequential arrangement of
types of biosynthetic gene clusters, including the predicted substrates of the A/AT domains
those encoding the pathways toward polyketides in the resulting polyketide or peptide, the
(PKs), nonribosomal peptides (NRPs), terpenoids, order of the PKS enzymes in a multimodular
ribosomal peptides, aminoglycosides, and assembly line is predicted using their estimated
non-NRP siderophores. The detection is docking domain binding affinities (Yadav
performed by screening the gene sequences from et al. 2009) or, alternatively, colinearity of the
the input against a library of profile Hidden Mar- PKS or NRPS genes with their enzymes is
kov Models (pHMMs) (Eddy 2011), each of which assumed.
is specific for genes characteristic for a certain
gene cluster type, and passing the results through Comparative Analysis of Gene Clusters
a hierarchical logic filter. A second detection algo- In order to understand the architecture and func-
rithm is also run, which detects genomic regions tion of a secondary metabolite biosynthesis
that are enriched in Pfam domains (Finn gene cluster, much is gained by examining it
et al. 2010) linked to secondary metabolism. within its evolutionary context through the
antiSMASH 35 A
SCO6273 (type I modular pks)
KS AT DH KR TD

A
KS AT DH KR KS AT DH KR

KS AT KS AT DH KR KS AT DH KR
antiSMASH, Fig. 1 Domain structure of multi-domain the mouse is positioned over a domain: one can, for
enzymes such as PKSs and NRPSs as visualized by example, run a BlastP search specifically with the
antiSMASH, offering several options for analysis when sequence of this domain
O
O
O O
N C(H1)
C(H1) N N
C(H1) O
C(H1) N N
O O
N O C(H1)
O O
N N N
C O
C(H1) O C(H1)
O O O
C(H1) O
N
O
N O N O N
C(H1)
O
N
antiSMASH, Fig. 2 Prediction of the core chemical for the substrate specificities of the NRPS adenylation
structure of an NRP by antiSMASH. The residues are domains in the gene cluster
based on a consensus between three prediction methods
comparison with related gene clusters from spe- detecting the borders of the gene cluster and
cies across the tree of life. To facilitate this, identifying the conserved multigene modules
antiSMASH hosts a regularly updated database that constitute its building blocks.
of gene clusters it has detected in all nucleotide
sequences present in GenBank. antiSMASH Secondary Metabolism-Specific Gene
then combines multiple BlastP runs into Family Analysis
a comparative search of every identified gene Most genes involved in the biosynthesis of sec-
cluster against all other known gene clusters. ondary metabolite have (close) homologues
This is used to generate a multiple gene cluster with similar functions in other secondary
alignment (Fig. 3), which can aid the biologist metabolite biosynthesis gene clusters. This can
in assessment of the novelty of the gene cluster, be used to infer the functions of the genes
A 36 antiSMASH
antiSMASH, Fig. 3 Example of a multiple gene cluster alignment by antiSMASH, showing identified homologue
clusters of the query gene cluster
residing in the biosynthetic gene cluster based Pfam matches and running Blast for each gene
on sequence homology. antiSMASH simplifies against a database of all bacterial and fungal
this process by categorizing the genes of protein sequences.
every identified gene cluster into secondary
metabolism-specific gene families and automat-
ically generating approximate phylogenetic Stand-Alone Version
trees of each gene in the context of its gene
family. Stand-alone versions of antiSMASH are avail-
able for download for Windows, Mac OS X, and
Genome-Wide Pfam and Blast Analysis Ubuntu Linux. Additionally, several related
Finally, antiSMASH also offers the possibility scripts are available from the antiSMASH
(transferred from CLUSEAN; Weber et al. website. An EMBL formatting script can be
2009) to do a comprehensive analysis of all downloaded to format raw FASTA sequences
genes within a submitted genome, identifying together with a text file containing gene
antiSMASH 37 A
annotations into an EMBL file that can be sub- gene clusters. Various functionalities –
mitted to antiSMASH. Also, a script is available comparative, phylogenomic, enzymatic, etc. –
which allows running antiSMASH on multiple are integrated in one single pipeline, making it A
files, in batch mode. straightforward for genomicists and natural prod-
uct researchers to study the biosynthetic potential
of any organism.
Development
antiSMASH is still under active development. Cross-References

Some features projected for the next release are
batch input on the web server, protein sequence ▶ Bacteriocin Mining in Metagenomes
input, and subclass prediction for enzyme classes ▶ CLUSEAN, Overview
like terpene synthases and trans-AT PKSs. ▶ Mining Metagenomic Datasets for Antibiotic
Feature requests, bug reports, or other questions/ Resistance Genes
suggestions can be sent to the development team ▶ Phylogenetics, Overview
via the online contact form on the antiSMASH
website.
References
Related Tools Anand S, Prasad MV, Yadav G, Kumar N, Shehara J,
Ansari MZ, Mohanty D. SBSPKS: structure based
Several other software tools for the study of sec- sequence analysis of polyketide synthases. Nucleic
ondary metabolism have been published. For Acids Res. 2010;38:W487–96.
Eddy SR. Accelerated profile HMM searches. PLoS
example, ClustScan (Starcevic et al. 2008) and
Comput Biol. 2011;7:e1002195.
NP.searcher (Li et al. 2009) can both be used to Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington
detect bacterial polyketide and NRP biosynthesis JE, Gavin OL, Gunasekaran P, Ceric G, et al. The Pfam
gene clusters. The same is the case for CLUSEAN protein families database. Nucleic Acids Res. 2010;38:
D211–22.
(Weber et al. 2009), the pipeline which has now
Fischbach MA, Walsh CT. Assembly-line enzymology for
been integrated entirely into antiSMASH. For the polyketide and nonribosomal peptide antibiotics:
analysis of fungal sequences, SMURF (Khaldi logic, machinery, and mechanisms. Chem Rev.
et al. 2010) offers a gene cluster detection potential 2006;106:3468–96.
Fischbach MA, Walsh CT. Antibiotics for emerging path-
similar to that of antiSMASH. Structural analysis
ogens. Science. 2009;325:1089–93.
of polyketide synthases can be performed with Hutchinson CR, McDaniel R. Combinatorial biosynthesis
the SBSPKS suite (Anand et al. 2010). Finally, in microorganisms as a route to new antimicrobial,
draft genomes with many small contigs and antitumor and neuroregenerative drugs. Curr Opin
Investig Drugs. 2001;2:1681–90.
metagenomes with fragments too small for gene
Khaldi N, Seifuddin FT, Turner G, Haft D, Nierman WC,
cluster detection can be scrutinized with NaPDoS Wolfe KH, Fedorova ND. SMURF: genomic mapping
(Ziemert et al. 2012) in order to find protein of fungal secondary metabolite clusters. Fungal Genet
domains related to secondary metabolite biosyn- Biol. 2010;47:736–41.
Li MH, Ung PM, Zajkowski J, Garneau-Tsodikova S,
thesis and analyze these phylogenetically.
Sherman DH. Automated genome mining for natural
products. BMC Bioinformatics. 2009;10:185.
Medema MH, Blin K, Cimermancic P, de Jager V,
Summary Zakrzewski P, Fischbach MA, Weber T, Takano E,
Breitling R. antiSMASH: rapid identification, annota-
tion and analysis of secondary metabolite biosynthesis
antiSMASH is an easy-to-use web server for the gene clusters in bacterial and fungal genome
detection of secondary metabolite biosynthesis sequences. Nucleic Acids Res. 2011;39:W339–46.
A 38 Approaches in Metagenome Research: Progress and Challenges
Minowa Y, Araki M, Kanehisa M. Comprehensive analy- Definition

sis of distinctive polyketide and nonribosomal peptide
structural motifs encoded in microbial genomes. J Mol
Biol. 2007;368:1500–17. Metagenomics comprises the culture-
Röttig M, Medema MH, Blin K, Weber T, Rausch C, independent and DNA-based analysis of entire
Kohlbacher O. NRPSpredictor2 – a web server for microbial communities and complements
predicting NRPS adenylation domain specificity. cultivation-based analysis of microorganisms.
Nucleic Acids Res. 2011;39:W362–7.
Starcevic A, Zucko J, Simunkovic J, Long PF, Cullum J, Metagenomic approaches allow comprehensive
Hranueli D. ClustScan: an integrated program package insights into phylogenetic and functional diver-
for the semi-automatic annotation of modular biosyn- sity of complex microbial consortia present in
thetic gene clusters and in silico prediction of moderate as well as extreme environments on
novel chemical structures. Nucleic Acids Res.
2008;36:6882–92. Earth. The introduction of next-generation
Walsh CT, Fischbach MA. Natural products version 2.0: sequencing technologies enabled cost-effective
connecting genes to molecules. J Am Chem high-throughput sequencing of metagenomic
Soc. 2010;132:2469–93. DNA molecules resulting in increased resolution
Weber T, Rausch C, Lopez P, Hoof I, Gaykova V, Huson
DH, Wohlleben W. CLUSEAN: a computer-based of microbial community analysis. In addition,
framework for the automated analysis of bacterial sec- screening of metagenomic libraries led to the
ondary metabolite biosynthetic gene clusters. identification of numerous novel biomolecules
J Biotechnol. 2009;140:13–7. from various environments such as soil, seawater,
Yadav G, Gokhale RS, Mohanty D. Computational
approach for prediction of domain organization and or glacial ice.
substrate specificity of modular polyketide synthases.
J Mol Biol. 2003;328:335–63.
Yadav G, Gokhale RS, Mohanty D. Towards prediction Introduction
of metabolic products of polyketide synthases: an in
silico analysis. PLoS Comput Biol. 2009;5:
e1000351. The immensely manifold microbial niches on
Ziemert N, Podell S, Penn K, Badger JH, Allen E, Jensen Earth comprise an extraordinarily high abun-
PR. The natural product domain seeker NaPDoS: dance and diversity of prokaryotic and eukaryotic
a phylogeny based bioinformatic tool to classify sec-
ondary metabolite gene diversity. PLoS One. 2012;7: microorganisms. The human body is colonized
e34064. by a wide variety of microbes representing all
three domains of life. The entirety of these micro-
bial cells (the human microbiome) that is often
described as an additional organ exceeds the
number of human cells by at least an order of
Approaches in Metagenome magnitude and outnumbers human genes by more
Research: Progress and Challenges than 100-fold (Li et al. 2012; Weinstock 2012).
Also in extreme environments such as hydrother-
Heiko Nacke and Rolf Daniel mal vents, sea ice, or deep inside the Earth’s
Institute of Microbiology and Genetics, crust, various microorganisms could be detected.
Georg–August–University of Göttingen, For example, a phylogenetically diverse and met-
Göttingen, Germany abolically active microbial assemblage was iden-
tified in the brine of an ice-sealed Antarctic lake
(Murray et al. 2012). The microorganisms
existing in this aphotic ecosystem withstand
Synonyms a temperature of 13 C, anoxic conditions, and
high salinity.
Function-based screening, Metagenomic biomol- Currently, less than 1 % of the microorgan-
ecule, Metagenomic library, Metagenomics, isms on Earth are readily culturable under labo-
Next-generation sequencing, Sequence-based ratory conditions. To investigate the high
screening percentage of uncultured microbes, different
Approaches in Metagenome Research: Progress and Challenges 39 A
metagenomic approaches can be routinely sequencing technologies are available: sequencing
applied. Metagenomics allows the direct study by ligation (SOLiD – Applied Biosciences/Life
of the collective genomes present in microbial Technologies), sequencing by synthesis (Solexa/ A
ecosystems (Handelsman 2004). This approach Illumina), semiconductor chip sequencing (Ion
significantly expanded our knowledge on micro- Torrent/Life Technologies), pyrosequencing
bial phylogenetic and functional diversity and (454/Roche), and single-molecule sequencing
enabled the discovery of numerous previously (Oxford Nanopore Technologies, SMRT – Pacific
unknown biomolecules. In the recent history Biosciences). Compared to Sanger sequencing,
of metagenomics, especially next-generation these cloning-independent techniques allow the
sequencing techniques, allowing cost-effective generation of far more sequence data per run.
and rapid decoding of metagenomic DNA, were Thus, microbial diversity comparisons between
applied to analyze microbial populations. As different environmental samples, requiring repli-
a consequence, a number of bioinformatic tools cated data and statistical analysis, as well as
to evaluate and compare comprehensive high- deep analysis of highly complex microbial com-
throughput metagenomic data have been devel- munity structures, are possible. Currently, often
oped in the last few years. tens to hundreds of thousands partial metagenomic
In this review, an overview of traditional and small-subunit rRNA gene sequences are produced
recent metagenomic research approaches, associ- using next-generation sequencing platforms.
ated future challenges, and a short description of In a recent pyrosequencing-based 16S rRNA
related meta-omic studies will be given. gene survey, a total of 41,141 bacterial and
30,651 archaeal sequences were analyzed to
investigate prokaryotic diversity in Yunnan
Microbial Phylogenetic and Functional and Tibetan hot springs (Song et al. 2013).
Diversity Determination To (pre-)process small-subunit rRNA gene
sequence datasets, various tools, software pack-
Small-subunit rRNA genes, universally distrib- ages, analytical web servers, and virtual instances
uted across prokaryotic and eukaryotic organ- can be used (Gonzalez and Knight 2012).
isms, can be considered as evolutionary clocks The QIIME package (Caporaso et al. 2010)
enabling phylogenetic analysis. Most commonly, provides workflows to extensively analyze
metagenome-derived 16S rRNA and 18S rRNA high-throughput amplicon-based sequence data
genes are used to phylogenetically characterize starting with raw sequences. Nevertheless, the
microbial communities. Furthermore, other con- avoidance of marker gene amplification bias by
served genes such as recA, rpoB, HSP70, or applying direct sequencing of metagenomic
EF-Tu allow phylogenetic assignments (Ludwig DNA instead of amplicon-based sequencing
and Klenk 2001). These genes can be investi- allows the most exact taxonomic assessment
gated by applying traditional molecular (Simon and Daniel 2011). For further improve-
approaches including fingerprinting methods ment of microbial diversity and abundance esti-
such as denaturing gradient gel electrophoresis mation, Kembel et al. (2012) recently introduced
and terminal restriction fragment length an approach, which incorporates 16S rRNA gene
polymorphism analysis or Sanger sequencing. copy number information.
A significant drawback of the Sanger sequencing- To identify the taxonomic affiliation of all
based analysis of microbial communities is the sequences derived from metagenomic DNA,
time-consuming and labor-intensive nature of a process called binning can be carried out.
this approach, as well as the required construction Within binning procedures, sequences of
of clone libraries. a metagenomic dataset sharing the same taxo-
More recently, next-generation sequencing nomic origin are “binned” (grouped).
platforms were used to decode metagenomic Composition-based binning is based on con-
DNA. Currently, the following next-generation served genomic features such as dinucleotide
frequencies, GC content, and synonymous codon during Arctic winter, Alonso Sáez et al. (2012)
usage, whereas similarity-based binning makes identified thaumarchaeal pathways for ammonia
use of sequence homology. Among others, oxidation. A number of other Thaumarchaeota
PhyloPythiaS, introduced by Patil et al. (2011), are also capable of ammonia oxidation, but unex-
represents an appropriate application to perform pectedly these Arctic thaumarchaeal organisms
composition-based binning. With respect to harbored a high abundance of genes involved in
similarity-based binning, typically searches urea transport and degradation.
against reference databases (e.g., National Center
for Biotechnology Information databases) are
performed using alignment tools such as Metagenomic Biomolecule Discovery
BLAST+ (Camacho et al. 2009). Subsequently,
BLAST results can be interpreted by applying To access the large pool of unexplored biomole-
software such as MEGAN (Huson et al. 2011). cules, microbial community DNA has been
Due to the often very high diversity of micro- extracted and metagenomic libraries have been
bial communities, assembly of metagenome- constructed. Small-insert and large-insert
derived sequences is challenging. In a recent metagenomic libraries can be screened to identify
metagenomic survey of honey bee gut novel biomolecules. For the construction of
microbiota, de novo assembly of 81,343,096 small-insert libraries containing metagenomic
Illumina paired-end reads resulted in 54,700 scaf- DNA 15 kb, plasmids are appropriate vectors,
folds of contigs (total length, 76.6 Mb) (Engel whereas cosmids, fosmids, and bacterial artificial
et al. 2012). Similar to the approach conducted by chromosomes (BACs) can be used for cloning of
Engel et al. (2012), single-genome assemblers large metagenomic DNA molecules (cosmids
were used for metagenome assembly with modi- and fosmids, 40 kb; BACs, 100–200 kb).
fied settings. Recently, a single-genome assem- Metagenomic libraries from different microbial
bler (Velvet) has been extended to enable the habitats such as glacier ice, digestive tracts of
assembly of short metagenomic reads (Namiki animals, soil, hot springs, or seawater have
et al. 2012). This new de novo assembler already been constructed and successfully
(MetaVelvet) generated significantly higher N50 screened for novel biomolecules (see, e.g.,
scores, a parameter that evaluates assembly qual- Nacke et al. 2012). Some of these biomolecules
ity, than analyzed single-genome assemblers for exhibit valuable characteristics for industrial
simulated datasets. applications such as thermal stability,
Based on assemblies or individual halotolerance, and activity under acidic or alka-
metagenomic sequence reads, gene prediction, line conditions. In a recent metagenomic
annotation, and reconstruction of pathways can approach, Sulaiman et al. (2012) isolated a gene
be carried out to assess the functional potential encoding a novel cutinase homolog designated
encoded by metagenomes. Consecutive LC-cutinase with polyethylene terephthalate-
processing of these steps is provided by degrading activity from a leaf-branch compost
a number of web-based tools like MG-RAST fosmid library. The enzyme showed higher spe-
(Meyer et al. 2008). These tools utilize resources cific polyethylene terephthalate-degrading activ-
of reference databases such as SEED (Overbeek ity than previously reported bacterial and fungal
et al. 2005) and KEGG (Kanehisa et al. 2008) cutinases. Thus, LC-cutinase is a potent candi-
to link biological information to predicted date for industrial applications, i.e., in textile
genes. In a recent survey including metagenomic industry. In general, two different metagenomic
methods, the functional potential of Arctic screening approaches for the identification of
Thaumarchaeota was investigated (Alonso Sáez novel biomolecules can be distinguished:
et al. 2012). By analyzing a metagenome derived function-based screening and sequence-based
from a Southeast Beaufort Sea sample collected screening.
Approaches in Metagenome Research: Progress and Challenges 41 A
Principle and Variations of biosensors. Nevertheless, all of these function-
Function-Driven Screens based screening approaches share one significant
disadvantage: the dependence of target gene pro- A
To perform function-driven screening, the duction on the expression machinery of the
construction of small-insert or large-insert metagenomic library host.
metagenomic libraries is required. A broad array
of different function-based screening approaches
can be applied using these libraries. The pheno- Principle and Variants of
typic insert detection (PID) is the most frequently Sequence-Based Screening
applied screening strategy. Metagenomic library-
containing clones expressing target genes are Conserved regions of genes or proteins enable
identified based on phenotypic characteristics. sequence-driven screening approaches. Based on
This method has been applied to identify novel these regions degenerate primers can be designed
lipolytic genes and gene families from German and fragments of target genes amplified. For
forest and grassland soil samples using tributyrin example, novel biphenyl dioxygenase DNA seg-
as a screening substrate (Nacke et al. 2011). ments encoding active site residues were obtained
A total of 37 lipolytic clones, encoding novel from polychlorobiphenyl-contaminated soils
lipases and esterases, which could be assigned using this strategy (Standfuß-Gabisch et al.
to five different known families and two puta- 2012). After sequencing of an amplified partial
tively new families of lipolytic enzymes, were target gene, it can be decoded completely using
identified by halo formation on indicator agar primer walking and extracted environmental DNA
plates. The potential to identify entirely novel or a metagenomic library as a template. In this
target genes is an important advantage of way, an entire xylose isomerase gene (xym1) has
function-driven screening approaches. Modu- been derived from a soil metagenomic library
lated detection (MD) represents another (Parachin and Gorwa-Grauslund 2011). The gene
commonly applied strategy to perform function- product of xym1 consisted of 443 amino acids and
based screening. Only if a certain gene product is was most similar (83 % identity) to a xylose isom-
expressed by a metagenomic library-containing erase from Sorangium cellulosum. Additionally,
host strain, it can grow under selective condi- novel complex polyketide and nonribosomal pep-
tions. Recently, novel acid resistance genes tide biosynthesis gene cluster that often exceed
were derived from planktonic and rhizosphere average insert sizes of large-insert metagenomic
microbial communities of the Tinto River libraries can be discovered by using degenerate
(Spain) using this strategy (Guazzaroni primers and subsequent chromosome walking
et al. 2013). Fifteen genes, mainly encoding (Piel 2011). The potential to identify genes of
putative proteins of unknown function, interest even if they are not expressed in
conferred acid resistance to the host strain a metagenomic library host represents a major
Escherichia coli. Moreover, substrate-induced advantage of sequence-based screening, but only
gene expression (SIGEX), product-induced gene novel variants of already-known gene or protein
expression (PIGEX), and metabolite-regulated families can be detected by this method.
expression (METREX) screening strategies
allow the identification of target genes from
metagenomic libraries (Simon and Daniel Future Challenges in Metagenomic
2009). Recently, Wang et al. (2012) suggested Research and Related Meta-omic
biosensor-based genetic transducer (BGT) sys- Approaches
tems as an alternative and sensitive approach to
screen for gene clusters whose expression pro- One of the major requirements to combine and
duce small molecules that activate the employed compare metagenomic studies conducted by
research groups worldwide is the definition and metagenomic libraries resulted in identification
acceptance of minimum standards in experimental of previously unknown biomolecules, including
design. The same applies to metatranscriptomics, biomolecules with industrially relevant
metaproteomics, and metabolomics. In this way, characteristics.
comparison and combination of results obtained
from the different meta-omic approaches are fea-
sible. Metatranscriptomics, metaproteomics, and Cross-References
metabolomics comprise the study of the collective
gene transcripts, expressed proteins, and metabo- ▶ A 123 of Metagenomics
lites, respectively, generated by the microorgan- ▶ Extraction Methods, Variability Encountered
isms within an ecosystem (Nacke et al. 2014; in
Hettich et al. 2012; Patti et al. 2012). The conse- ▶ Fosmid System
quent application and combination of appropriate ▶ Genome Portal, Joint Genome Institute
meta-omic approaches will lead to an enormous ▶ Microbial Diversity, Bar-Coding Approaches
extension of knowledge on the gene structure, ▶ Microbial Ecosystems, Protection of
diversity, activity, and responses of microbial ▶ Phylogenetics, Overview
communities on an ecosystem level. Furthermore,
the rapid growth of meta-omic technologies will
continuously demand for progress in the field of References
bioinformatics. Thus, further development and
linkage of meta-omic analysis tools will be impor- Alonso Sáez L, Waller AS, Mende DR, et al. Role for urea
in nitrification by polar marine Archaea. Proc Natl
tant in the future. In addition, the application and
Acad Sci USA. 2012;109:17989–94.
improvement of culture-based methods will be Camacho C, Coulouris G, Avagyan V, et al. BLAST+:
still valuable in the future to extend the number architecture and applications. BMC Bioinforma.
of available reference genomes allowing mapping 2009;10:421.
Caporaso JG, Kuczynski J, Stombaugh J, et al. QIIME
of metagenomic data. In this context, the young
allows analysis of high-throughput community
discipline of single cell genomics has potential to sequencing data. Nat Methods. 2010;7:335–6.
play a complementary role by continuously con- Engel P, Martinson VG, Moran NA. Functional diversity
tributing novel reference genomes. within the simple gut microbiota of the honey bee.
Proc Natl Acad Sci USA. 2012;109:11002–7.
Gonzalez A, Knight R. Advancing analytical algorithms
and pipelines for billions of microbial sequences. Curr
Summary Opin Biotechnol. 2012;23:64–71.
Guazzaroni ME, Morgante V, Mirete S, et al. Novel acid
resistance genes from the metagenome of the Tinto
The introduction of metagenomics allowed
River, an extremely acidic environment. Environ
culture-independent analysis of microbial Microbiol. 2013;15:1088–1102.
populations in complex ecosystems. Subse- Handelsman J. Metagenomics: application of genomics to
quently, other culture-independent meta-omic uncultured microorganisms. Microbiol Mol Biol Rev.
2004;68:669–85.
disciplines including metatranscriptomics,
Hettich RL, Sharma R, Chourey K, et al. Microbial
metaproteomics, and metabolomics were metaproteomics: identifying the repertoire of proteins
established. Metagenomics provided insights that microorganisms use to compete and cooperate in
into the enormous phylogenetic and functional complex environmental communities. Curr Opin
Microbiol. 2012;15:373–80.
diversity of microbial communities within vari- Huson DH, Mitra S, Ruscheweyh HJ, et al. Integrative
ous environments on Earth. The increasing num- analysis of environmental sequences using MEGAN4.
ber of next-generation sequencing technologies Genome Res. 2011;21:1552–60.
led to a more comprehensive and cost-effective Kanehisa M, Araki M, Goto S, et al. KEGG for linking
genomes to life and environment. Nucleic Acids Res.
assessment of the information encoded by
2008;36:D480–4.
metagenomic DNA. Metagenomic approaches Kembel SW, Wu M, Eisen JA, et al. Incorporating 16S
comprising the construction and screening of gene copy number information improves estimates of
Arbuscular Mycorrhizal Fungi Assemblages in Chernozems 43 A
microbial diversity and abundance. PLoS Comput Standfuß-Gabisch C, Al-Halbouni D, Hofer B. Character-
Biol. 2012;8:e1002743. ization of biphenyl dioxygenase sequences and activ-
Li K, Bihan M, Yooseph S, Methé BA. Analyses of the ities encoded by the metagenomes of highly
microbial diversity across the human microbiome. polychlorobiphenyl-contaminated soils. Appl Environ A
PLoS ONE. 2012;7:e32118. Microbiol. 2012;78:2706–15.
Ludwig W, Klenk HP. Overview: a phylogenetic back- Sulaiman S, Yamato S, Kanaya E, et al. Isolation of
bone and taxonomic framework for procaryotic sys- a novel cutinase homolog wit polyethylene
tematics. In: Garrity GM, Boone DR, Castenholz RW, terephthalate-degrading activity from leaf-branch
editors. Bergey’s manual of systematic bacteriology, compost by using a metagenomic approach. Appl
Vol. 1. 2nd ed. New York: Springer; 2001. p. 49–65. Environ Microbiol. 2012;78:1556–62.
Meyer F, Paarmann D, D’Souza M, et al. The Wang Y, Chen Y, Zhou Q, et al. A culture-independent
metagenomics RAST server – a public resource for approach to unravel uncultured bacteria and functional
the automatic phylogenetic and functional analysis of genes in a complex microbial community. PLoS ONE.
metagenomes. BMC Bioinforma. 2008;9:386. 2012;7:e47530.
Murray AE, Kenig F, Fritsen CH, et al. Microbial life at Weinstock GM. Genomic approaches to studying the
13 C in the brine of an ice-sealed Antarctic lake. human microbiota. Nature. 2012;489:250–6.
Proc Natl Acad Sci USA. 2012;109:20626–31.
Nacke H, Will C, Herzog S, et al. Identification of novel
lipolytic genes and gene families by screening of
metagenomic libraries derived from soil samples of Arbuscular Mycorrhizal Fungi
the German biodiversity exploratories. FEMS
Microbiol Ecol. 2011;78:188–201.
Assemblages in Chernozems
Nacke H, Engelhaupt M, Brady S, et al. Identification and
characterization of novel cellulolytic and Chantal Hamel, Luke D. Bainard and Mulan Dai
hemicellulolytic genes and enzymes derived from Ger- Semiarid Prairie Agricultural Research Centre,
man grassland soil metagenomes. Biotechnol Lett.
Agriculture and Agri-Food Canada, Swift
2012;34:663–75.
Nacke H, Fischer C, Th€ urmer A, et al. Land use type Current, SK, Canada
significantly affects microbial gene transcription in
soil. Microb Ecol. 2014;67:919–30.
Namiki T, Hachiya T, Tanaka H, et al. MetaVelvet: an
Synonyms
extension of Velvet assembler to de novo metagenome
assembly from short sequence reads. Nucleic Acids
Res. 2012;40:e155. Diversity, arbuscular mycorrhizal fungi, Cana-
Overbeek R, Begley T, Butler RM, et al. The subsystems dian Prairie, Chernozem, land use.
approach to genome annotation and its use in the
project to annotate 1000 genomes. Nucleic Acids
Res. 2005;33:5691–702.
Parachin NS, Gorwa-Grauslund MF. Isolation of xylose Definition
isomerases by sequence- and function-based screening
from a soil metagenomic library. Biotechnol Biofuels.
AM fungi are obligate plant symbionts that form
2011;4:9.
Patil KR, Haider P, Pope PB, et al. Taxonomic the phylum Glomeromycota. These fungi contrib-
metagenome sequence assignment with structured out- ute to plant nutrient uptake, influence soil biotic
put models. Nat Methods. 2011;8:191–2. and abiotic environments, and provide important
Patti GJ, Yanes O, Siuzdak G. Innovation: Metabolomics:
ecosystem services. 454-pyrosequencing of
the apogee of the omics trilogy. Nat Rev Mol Cell Biol.
2012;13:263–69. amplicons from metagenomic DNA revealed
Piel J. Approaches to capturing and designing biologically the distribution of AM fungi in major Canadian
active small molecules produced by uncultured Chernozem great groups as influenced by land use
microbes. Annu Rev Microbiol. 2011;65:431–53.
and crop management.
Simon C, Daniel R. Achievements and new knowledge
unraveled by metagenomic approaches. Appl
Microbiol Biotechnol. 2009;85:265–76.
Simon C, Daniel R. Metagenomic analyses: past and future Introduction
trends. Appl Environ Microbiol. 2011;77:1153–61.
Song ZQ, Wang FP, Zhi XY, et al. Bacterial and archaeal
diversities in Yunnan and Tibetan hot springs, China. AM fungi form a mycorrhizal symbiosis with the
Environ Microbiol. 2013;15:1160–75. roots of the majority of land plants. They have
A 44 Arbuscular Mycorrhizal Fungi Assemblages in Chernozems
coevolved with plants over 450 Ma to produce of operational taxonomic units (OTU) of the tar-
today’s mycorrhiza, which is an organ special- get microbial group in a soil sample. The concept
ized in the extraction of soil nutrients. As such, of an OTU is useful in soil microbiology as the
AM fungi are seen as a key stone of agricultural majority of microbial species are still
sustainability (Garg and Chandel 2010). undescribed. OTUs serve as a proxy for species
World grain, pulse, and biofuel crop produc- making it possible to measure and describe soil
tion mainly occurs on deep (typically microbial diversity. In addition, OTUs can be
>18–25 cm) warm-colored soils rich in humus identified by comparison with known sequences
(>0.6 % organic carbon) and weatherable min- in public databases such as GenBank and
erals, with high levels of base saturation (>50 %) MaarjAM. AM fungi have been difficult to
and calcium as the main exchangeable cation study due to their obligate biotrophy and inability
(Durán et al. 2011). These soils have similar to grow in pure culture. However, polymerase
properties but have different names in other soil chain reaction (PCR) made possible the amplifi-
classification systems. They are Chernozems in cation of DNA from their spores and enabled the
Canada, Ukraine, and Russia; Mollisols in the molecular characterization and classification of
USA and South America; Isohumosols or Black taxa within the Glomeromycota (Schuessler
Soils in China; and Chernozems, Kastanozems, 2013).
and Phaeozems according to the FAO (Liu Fungal diversity is commonly assessed based
et al. 2012). These soils have typically developed on the internal transcribed spacer (ITS) of the
under condition of moisture deficit and grassland ribosomal RNA gene. However, abundant SSU
vegetation in temperate regions around the globe. rRNA gene sequences of AM fungi are found in
They mainly occur in a band across Eastern databases due to the traditional use of this region
Europe and Central Asia, in northeast China, for the Glomeromycota. Several primers sets pro-
from south-central Canada down to the Gulf of ducing taxonomically informative amplicons
Mexico, and over most of Uruguay and part of short enough for use with first- and next-
Argentina. generation molecular techniques have been used
in ecological studies of AM fungi.
The AM fungi have a patchy distribution in
Tackling the Complexity of Soil soil (Hart and Klironomos 2003). Thus in order to
Biodiversity capture their diversity, multiple samples must be
taken at a study site. A composite sample is
Soil hosts an extremely high level of microbial usually made by pooling and homogenizing all
diversity (Young and Crawford 2004). However, the samples from a sampling site. The distribu-
high-throughput next-generation sequencing now tion of organisms varies with soil depth, thus
allows generation of the massive sequence data sampling depth also matters. The AM fungi are
required to characterize soil microbial diversity. normally found within the rooting depth.
Amplicon sequencing is preferred over whole
genome sequencing for the study of the taxo-
nomic diversity of targeted microbial groups. Arbuscular Mycorrhizal Fungi in the
The 454 FLX and 454 FLX + technologies Canadian Chernozems
allow the sequencing of DNA amplicons up to
400 and 800 bp in length, respectively. Such long AM fungal communities in the Canadian Prairie
sequences contain sufficient taxonomic informa- Chernozem soils are composed of a few dominant
tion for the characterization of microbial commu- and a large number of subordinate taxa. Less than
nities and their use conveniently eliminates the 6 % of the AM fungal OTUs accounted for half of
need for sequence assembly. all AM fungal reads (Dai et al. 2013). Across the
Pyrosequencing of amplicons and bioinfor- Canadian prairie landscape, the Glomeraceae
matic analysis of sequence data yield the profile were the most abundant family, accounting for
65 % of all AM fungal OTUs and 54 % of the AM This concurs with the previous observation of
fungal reads. The Claroideoglomeraceae is sec- differences in the seasonal pattern of sporulation
ond in abundance with 25 % of all AM fungal of different AM fungal species (Dhillion and A
OTUs and 39 % of the AM fungal reads. Diversis- Anderson 1993). Seasonal variation of AM
poraceae accounted for 8 % of the OTUs and 7 % fungi in the North American Great Plains was
of the AM fungal reads. Paraglomaceae, also described as the replacement of the fungi of
Gigasporaceae, and Archeosporaceae are poorly the order Helotiales by AM fungi as the season
distributed across the prairie landscape, and unfolds in the North American Great Plains
Gigasporaceae and Archeosporaceae are rare. (Jumpponen 2011).
In other regions, spore counts in grazed The Chernozem great groups are distributed
Kastanozems of Inner Mongolia revealed that along a gradient of precipitation radiating out-
the AM fungal communities resembled those ward from the US border in eastern Alberta, i.e.,
observed in Canadian Chernozems (Tian from the Brown soil zone through Dark Brown
et al. 2009). The Gigasporaceae are susceptible and Black soils up to the Gray soil zone at the
to disturbance and largely absent in croplands, fringe of the boreal forest. The lowest abundance,
which explains their greater abundance in richness, and diversity of AM fungi were
the Kastanozems than in the Canadian Prairie observed in the driest soil zone (Brown Cherno-
Chernozems (Dai et al. 2012, 2013). Poorer zem), which supported a negative impact of
AM fungal diversity is reported from American moisture deficit on these fungi.
spore-based surveys of Mollisols under tallgrass Soil moisture appears to be just one of several
prairie cover where Paraglomaceae and factors that influence the composition of AM
Archeosporaceae were undetected (Eom fungal communities in Chernozem soils. Despite
et al. 2001; Bentivenga and Hetrick 1992). the highest levels of precipitation in the Gray soil
Tallgrass prairies managed with fire were found zone, the highly productive Black soils harbor the
to be very highly dominated by the Glomeraceae most abundant and diverse AM fungal communi-
(Bentivenga and Hetrick 1992), underlining the ties (Dai et al. 2012). Black, Gray, Dark Brown,
importance of land use in the structuring of AM and Brown soils had an average of 10.2, 7.1, 7.0,
fungal communities. and 6.2 AM fungal OTUs, respectively, and the
AM fungi share root occupation with fungal Shannon diversity index of these soil groups fol-
endophytes belonging to different taxonomic lows a similar trend. AM fungal communities in
groups. Non-AM fungal endophytes are particu- Brown soils are characterized by a reduced rela-
larly abundant in temperate grasslands (Porras- tive abundance of Claroideoglomeraceae com-
Alfaro et al. 2011). This observation triggered the pared to Black and Dark Brown soils. Other
question as to whether AM fungi are at the end of important factors that influenced the abundance
their range in dry areas. of AM fungal OTUs were A horizon thickness
This hypothesis was explored in the Canadian and physicochemical properties of the soils, such
Prairie using primers Glo1/NS31, which pro- as bulk density, Zn level, pH, electrical conduc-
duced 18S rDNA amplicons of about 230 bp tivity, and sulfur level.
(Yang et al. 2010). A succession of AM fungi Soils are classified based on their physical and
was detected as the soil dried from early to late chemical properties. A soil type represents
summer, suggesting that the adaptation of AM a living environment inhabited by different AM
fungi to soil moisture availability varies with fungal communities. American Mollisols and
species. Glomus viscosum, Funneliformis Alfisols contain distinct AM fungal spore assem-
mosseae, and Glomus hoi were dominant in blages (Ji et al. 2012). Similarly, Canadian Cher-
early summer, under conditions of moisture suf- nozems and Podzols and even different great
ficiency, whereas the dominant AM fungal OTUs groups of Chernozems contained distinct assem-
in late season conditions (i.e., dry soil) belonged blages of AM fungal rRNA gene sequences
to Glomus iranicum and Glomus macrocarpum. (Dai et al. 2013).
A 46 Arbuscular Mycorrhizal Fungi Assemblages in Chernozems
Land use modifies the conditions of the soil relatively poor in symbiotic AM fungi and are
environment and the impact of land use on the less hospitable to the Claroideoglomus than
structure of AM fungal communities exceeds that other Chernozems, whereas Black Chernozems
of soil type. In the Canadian Prairie, roadsides are rich in AM fungal resources. The influence
host a higher level of AM fungal diversity than of soil type on the composition of AM fungal
cropland or natural areas (Dai et al. 2013). Road- communities is relatively small compared to
sides have higher soil moisture levels than crop- that of land use type. Funneliformis have
land and most natural areas, further indicating a competitive edge and proliferate in conven-
that water availability is an important determi- tional crop production systems, whereas
nant of the abundance and structure of AM fungal Claroideoglomus and Glomeraceae incertae
communities. Seven percent of the AM fungal sedis are favored in organic production systems.
OTUs found across the prairie soil zones are These Glomeraceae incertae sedis, currently
unique to croplands, whereas 14 % of the AM known as the G. iranicum/G. indicum group,
fungal OTUs are specific to roadsides. Roadsides are associated with reduced crop productivity
and natural areas are dominated by an OTU and nutrient uptake.
closely related to Claroideoglomus lamellosum,
C. etunicatum, and C. claroideum, which account
for 14 % and 19 % of all AM fungal reads. References
In cropland, an OTU closely related to
Funneliformis mosseae accounted for as much Bentivenga SP, Hetrick BAD. The effect of prairie man-
as 17 % of all AM fungal reads. The dominance agement practices on mycorrhizal symbiosis.
Mycologia. 1992;84:522–7.
of F. mosseae in croplands of the Canadian prai-
Dai M, Bainard LD, Hamel C, Gan Y, Lynch D. Impact of
rie is supported by studies based on metagenomic land use on arbuscular mycorrhizal fungal communi-
methods (Ma et al. 2005; Sheng et al. 2012; Dai ties in rural Canada. Appl Environ Microbiol.
et al. 2012, 2013) and on spore counts (Talukdar 2013;79:6719–29. doi:10.1128/aem.01333-13.
Dai M, Hamel C, Bainard LD, St. Arnaud M, Grant CA,
and Germida 1993).
Lupwayi NZ, Malhi SS, Lemke R. Negative and pos-
Crop management systems also have a strong itive contributions of arbuscular mycorrhizal fungal
influence on the composition of AM fungal com- taxa to wheat production and nutrient uptake efficiency
munities in Chernozem soils. Organic systems inorganic and conventional system in the canadian
prairie. Soil Biol Biochem. 2014;74:156–166.
have been shown to support more abundant and
Dai M, Hamel C, St. Arnaud M, He Y, Grant C,
diverse AM fungal communities compared to Lupwayi N, Janzen H, Malhi SS, Yang X, Zhou
conventional systems (Dai et al. 2014). Organic Z. Arbuscular mycorrhizal fungi assemblages in cher-
systems also promote greater proliferation of nozem great groups revealed by massively parallel
pyrosequencing. Can J Microbiol. 2012;58:81–92.
Claroideoglomus and of incertae sedis taxa of
Dhillion SS, Anderson RC. Seasonal dynamics of domi-
the Glomeraceae, currently referred to as Glomus nant species of arbuscular mycorrhizae in burned and
iranicum and Glomus indicum. However, these unburned sand prairies. Can J Bot. 1993;71:1625–30.
Glomeraceae incertae sedis are seemingly para- Durán A, Morrás H, Studdert G, Xiaobing L. Distribution,
properties, land use and management of Mollisols in
sitic as they were associated with reduced crop
South America. Chin Geogr Sci. 2011;21:511–30.
growth and N and P uptake efficiency. Eom AH, Wilson GWT, Hartnett DC. Effects of ungulate
grazers on arbuscular mycorrhizal symbiosis and
fungal community structure in tallgrass prairie.
Mycologia. 2001;93:233–42.
Summary Garg N, Chandel S. Arbuscular mycorrhizal networks:
process and functions. A review. Agron Sustain Dev.
Metagenomic studies on the distribution of AM 2010;30:581–99.
fungi in Chernozems are extremely useful to Hart MM, Klironomos JN. Diversity of arbuscular mycor-
rhizal fungi and ecosystem functioning. In: van der
understand how the living soil provides ecolog-
Heijden MGA, editor. Mycorrhizal ecology, Ecologi-
ical services and supports the production of cal studies, vol. 157. Berlin: Springer; 2003.
food and bioproducts. Brown Chernozems are p. 225–42.
Ji B, Bentivenga SP, Casper BB. Comparisons of AM Sheng M, Hamel C, Fernandez MR. Cropping practices
fungal spore communities with the same hosts but modulate the impact of glyphosate on arbuscular
different soil chemistries over local and geographic mycorrhizal fungi and rhizosphere bacteria in
scales. Oecologia. 2012;168:187–97. agroecosystems of the semiarid prairie. Can A
Jumpponen A. Analysis of ribosomal RNA indicates sea- J Microbiol. 2012;58:990–1001.
sonal fungal community dynamics in Andropogon Talukdar NC, Germida JJ. Occurrence and isolation of
gerardii roots. Mycorrhiza. 2011;21:453–64. vesicular-arbuscular mycorrhizae in cropped field
Liu X, Lee Burras C, Kravchenko YS, Duran A, soils of Saskatchewan, Canada. Can J Microbiol.
Huffman T, Morras H, Studdert G, Zhang X, Cruse 1993;39:567–75.
RM, Yuan X. Overview of Mollisols in the world: Tian H, Gai JP, Zhang JL, Christie P, Li L.
distribution, land use and management. Can J Soil Arbuscular mycorrhizal fungi in degraded typical
Sci. 2012;92:383–402. steppe of Inner Mongolia. Land degrad dev.
Ma WK, Siciliano SD, Germida JJ. A PCR-DGGE method 2009;20:41–54.
for detecting arbuscular mycorrhizal fungi in culti- Yang C, Hamel C, Schellenberg MP, Perez JC, Berbara
vated soils. Soil Biol Biochem. 2005;37:1589–97. RL. Diversity and functionality of arbuscular mycor-
Porras-Alfaro A, Herrera J, Natvig DO, Lipinski K, rhizal fungi in three plant communities in semiarid
Sinsabaugh RL. Diversity and distribution of soil fun- Grasslands National Park. Can Microb Ecol.
gal communities in a semiarid grassland. Mycologia. 2010;59:724–33.
2011;103:10–21. Young IM, Crawford JW. Interactions and self-
Schuessler A. Glomeromycota. Taxonomy. 2013. organization in the soil-microbe complex. Science.
Accessed 6 Nov 2013. http://schussler.userweb.mwn. 2004;304:1634–7.
de/amphylo/
B
Bacterial Diversity in Tree Canopies microorganisms that are associated with plants,
of the Atlantic Forest will likely be essential for establishing conserva-
tion strategies for protecting endangered plant
Marcio R. Lambais1 and David E. Crowley2 species. The large reservoir of microbial diversity
1
Luiz de Queiroz College of Agriculture on plant surfaces also represents a largely
(ESALQ), University of São Paulo (USP), untapped bank of microbial products that may
Piracicaba, SP, Brazil be of interest for pharmaceutical, agricultural,
2
Enviromental Sciences, University of and environmental applications.
California, Riverside, Riverside, CA, USA
Introduction
Synonyms
Plant surfaces in natural and agricultural ecosys-
Bacterial communities in the phyllosphere of the tems are colonized by a variety of epiphytic
Atlantic forest microorganisms that have been examined in rela-
tion to their diversity, ecology, and genetics using
culture-dependent and culture-independent
Definition approaches. Among the various surfaces that are
presented by plants, the leaf surface, also known
16S rRNA gene profiling is one of the main as the phyllosphere (Ruinen 1956), is one of the
approaches used for the study of microbial com- most common habitats for terrestrial microorgan-
munities that are associated with plants and ani- isms. The phyllosphere may be colonized by bac-
mals, which are mostly comprised of species terial cells at an average density of 106–107 cells
unable to grow under laboratory conditions. cm2 on plants from temperate regions (Lindow
Even though plants harbor an enormous micro- and Brandl 2003) and may be even higher on
bial diversity on their various surfaces, the func- tropical plants where dense canopies and
tions of these microorganisms, except for a few a moist shaded environment are conducive for
that are pathogens or symbionts, are largely bacterial growth. Considering that the estimated
unknown, but are speculated to modify plant total leaf area of terrestrial plants is approxi-
chemical signals, alter root exudation patterns, mately 6.4 108 km2 (Morris and Kinkel
and provide protection against pathogens. Under- 2002), the number of bacterial cells on leaf
standing of the factors that shape the structure of surfaces globally has been estimated to be as
microbial communities, and the functions of high as 1026 cells. Despite the importance of
B 50 Bacterial Diversity in Tree Canopies of the Atlantic Forest
Bacterial Diversity in Tree Canopies of the Atlantic microbial species, based on morphology of cells.
Forest, Fig. 1 Microbial biofilm on the leaf surface of (b) Diatom cells embedded in the microbial biofilm on
trees of the Atlantic forest. (a) Biofilm with multiple the leaf surface
plant-microbe interactions in plant disease, metabolic and signaling networks, leading to the
almost nothing is known about the indigenous, self-organization of highly complex communities
nonpathogenic bacteria that colonize plant leaf that have been selected by long-term coevolution
surfaces and their functions in terrestrial with their plant host. In general, the bacterial
ecosystems. populations in the phyllosphere occur as multi-
species biofilms (Fig. 1) mostly located at the
base of trichomes and nutrient-rich locations
The Phyllosphere Habitat along the veins and junctions of epidermal cells
(Morris et al. 1998; Monier and Lindow 2004).
Due to the harsh conditions and the highly com- Communication between microbial cells, and
petitive environment on plant leaves, microor- between microbial and plant cells, may be an
ganisms that live in the phyllosphere almost important factor controlling the dynamics of
certainly have evolved specific traits that enable leaf colonization and biofilm growth and
them to grow in such environments. Diurnal var- development.
iations in UV light incidence, temperature, water One of the major selection factors for micro-
availability, osmotic conditions, the concentra- bial colonization of leaf surfaces is the ability to
tion of reactive oxygen species, as well as the tolerate or grow on the myriad chemical sub-
low availability of nutrients make the stances that are released from plant leaf tissues
phyllosphere an extreme environment for micro- and/or produced by other microorganisms. This
bial growth (Vorholt 2012). All of these factors, includes many thousands of secondary metabo-
together with the specific morphological traits of lites, such as monoterpenes that serve as signal
the leaves, may contribute to the selection of factors and defense compounds, as well as chem-
specific microbial populations of bacteria, fungi, ical attractants and deterrents for insects, herbi-
archaea, and protozoa that will colonize the vores, and pathogens. However, the specific
phyllosphere and interact at different levels with secondary metabolites driving the structure of
the plant host. In addition, the microbial bacterial communities in the phyllosphere are
populations will interact with each other through unknown.
Bacterial Diversity in Tree Canopies of the Atlantic Forest 51 B
Bacterial Communities in the of biodiversity that is struggling to survive. The
Phyllosphere Atlantic forest used to be the second largest trop-
ical forest in South America and represented 1.3
Many early surveys of phyllosphere communities million km2 in the 1500s, when the Portuguese
have relied on descriptions of bacteria that can be first arrived in Brazil. Today, approximately 7 % B
cultivated on agar media and isolated as individ- of the original Atlantic forest remains, since most
ual colonies. Using various types of growth of it has been converted to agricultural or urban
media, 85 species of culturable microorganisms areas, leaving a patchwork of fragmented rem-
from 37 genera have been reported in the nants. The remnants of the Atlantic forest are
phyllospheres of rye, olive, sugar beet, and considered to be some of the oldest undisturbed
wheat (Ercolani 1991; Legard et al. 1994; forests on the planet, containing approximately
Thompson et al. 1993). While this is an impres- 20,000 plant species, of which nearly half are
sive number of species, studies using molecular endemic (Tabarelli et al. 2003). Several research
methods have revealed that the actual microbial projects have been developed in the Atlantic for-
species richness in the phyllosphere of agricul- est as part of the ongoing BIOTA-FAPESP (São
tural plants is much greater than this and suggest Paulo Research Foundation) program, which has
that different plant species harbor unique com- been successfully established to examine the bio-
munities that are similar for individuals of the diversity of the São Paulo State (Brazil).
same plant species (Yang et al. 2001). The dis- Different approaches can be used to survey the
covery of high levels of bacterial species richness microbial diversity in the phyllosphere. The first
associated with different agronomic plants has approach is using DNA fingerprinting methods.
prompted many questions about the true extent A low-resolution DNA fingerprinting method
of microbial diversity that may be associated with referred to as PCR-DGGE (polymerase chain
the phyllosphere of different plants in natural reaction-denaturing gradient gel electrophoresis),
ecosystems around the world. It has been specu- through which amplified fragments of highly var-
lated that since bacteria can be transported across iable regions of the bacterial 16S rRNA gene are
the globe in dust (Griffin et al. 2002), only a small separated by electrophoresis in a denaturing gra-
number of bacterial species may be adapted to dient polyacrylamide gel, has been used for
grow on leaf surfaces. On the other hand, if each studying the bacterial community structures in
plant species selects for its own microbial com- the phyllosphere of tree species of the Atlantic
munity, the microbial species diversity that is forest. This methodology generates a distinctive
associated with all of the different plant species fingerprint that can be used to compare the
on earth could be enormous. This question can relative similarities of communities, but does
only be answered by systematic surveys of not provide information on the identities of
phyllosphere microbial diversity in different eco- the bacterial species within the communities.
systems. Considering the current rate of extinc- To compare the phylogenetic diversity in the
tion of plant species, it is especially urgent to phyllosphere and generate diversity indices for
begin surveys of phyllosphere microorganisms different phyllosphere communities, sequencing
that are associated with endangered biomes. of specific regions of the bacterial 16S rRNA
gene is normally used.
With these combined approaches, it has been
Bacterial Community in the shown that the 16S rRNA gene band patterns for
Phyllosphere of the Atlantic Forest the bacterial communities from different tree spe-
cies of the Atlantic forest are distinct from each
Many tropical forests and biodiversity hotspots other (Lambais et al. 2006). Communities from
contain endemic plant species that are preserved replicates for different individuals of the same
only in a few remnant areas. The Atlantic forest tree species showed some expected variation,
of Brazil is an example of a forest with high levels but overall are highly similar to each other.
B 52 Bacterial Diversity in Tree Canopies of the Atlantic Forest
The similarities between the leaf bacterial com- transcontinental distances (Redford et al. 2010)
munities within and between species were further suggest a strong genetic component in the
measured statistically and showed that the trees regulation of the phyllosphere associated
could be segregated into groups according to tree microbiome.
species, family, and order, suggesting a coevolu- The majority of bacterial OTUs in the
tion between trees and microbial populations phyllosphere of the trees of the Atlantic forest
associated with the phyllosphere (Lambais have been assigned to the phylum Proteobacteria.
et al. data not published). Evidence of coevolu- Based on a survey of several tree species in the
tion of microbial populations associated with the Atlantic forest, including Ocotea dispersa,
bark (dermosphere) and rhizosphere of trees of Ocotea teleiandra, Mollinedia schottiana,
the Atlantic forest also has been observed, Mollinedia uleana, Eugenia cuprea, Eugenia
suggesting that plants coevolved with specific melanogyna, and Tabebuia serratifolia, it has
microbiomes (Lambais et al. data not published). been shown that, in general, approximately half
An estimate of the bacterial species richness asso- of the bacteria in the phyllosphere are phyloge-
ciated with the phyllosphere of trees in the Atlan- netically related to Gammaproteobacteria,
tic forest suggests the existence of 2–13 million whereas 20 % are related to Alphaproteobacteria
undescribed bacterial species that colonize the and 5 % to Flavobacteria, even though interspe-
collective phyllosphere of the Atlantic forest cific variation may occur (Lambais et al. data not
(Lambais et al. 2006). Interestingly, studies of published). For instance, in the phyllosphere of
the phyllosphere of different individuals of the Ocotea teleiandra, a high frequency of Alphapro-
same tree species in the Atlantic forest over teobacteria and a low frequency of Gammapro-
a range of distances and at different times show teobacteria have been detected, in contrast to
that the similarities between bacterial community other tree species.
structures in the phyllosphere of the same plant Altogether, these results show that every tree
species decrease with the increasing distance species that has been examined in the Atlantic
between individual trees, even though they still forest contains its own unique bacterial commu-
share high levels of similarity (Lambais et al. data nity and that spatially separated individuals of the
not published). Over larger scales, such as when same tree species have similar bacterial commu-
the bacterial communities of the individuals of nities, within the same environment (forest phys-
the same plant species are separated by hundreds iognomy). The variations in bacterial community
of kilometers, significant differences in commu- structures in the phyllosphere that were observed
nity structure are observed. These data suggest using the PCR-DGGE and sequencing
that the bacterial diversity in the phyllosphere of approaches to compare similarities among indi-
plants of the Atlantic forest may be even higher viduals indicate that the community composi-
than the predicted 2–13 million species estimate tions may vary on different leaves. This may
that does not take into account beta diversity. correspond with different leaf ages, location in
While still in an early phase, research aimed at the canopy, light incidence, and microclimate
measurements of beta diversity includes a survey conditions that influence the leaf environment
of Tamarix trees in Mediterranean and Dead Sea and types of chemical substances that are
regions in Israel and two locations in the USA secreted by the plant leaves. The bacteria may
(Finkel et al. 2011). These studies suggest that also interact with various fungi and algae that
besides the plant genetic component driving colonize the leaf surfaces and change the chemi-
the bacterial community structure in the cal and physical environment of the leaf habitat.
phyllosphere, environmental conditions associ- In future studies, it will be necessary to examine
ated with particular geographical locations are the microbial communities on leaf surfaces at the
also important. On the other hand, the high microsite scale to determine changes in species
levels of similarity of the bacterial communities composition and the ecology of different habitats
in the phyllosphere of Pinus ponderosa over on the leaf surface, for example, on the adaxial
Bacterial Diversity in Tree Canopies of the Atlantic Forest 53 B
and abaxial leaf surfaces or within biofilms and between plants. Terpenes and other plant second-
microcolonies at distinct physical locations on ary metabolites produced in plant leaves are also
the leaf surface. important feedstocks for various biochemicals
that are used in the industry and for pharmacol-
ogy. Future studies should investigate the B
Drivers of Community Structure in the genomes and genes encoding enzymes in the
Phyllosphere phyllosphere that may have broad application
for industrial biotechnology, as in the work
The development of different bacterial commu- described by Delmotte et al. (2009), which used
nities in the phyllosphere of different tree species proteogenomics to study the microbial commu-
demonstrates the strong effect of leaf surface nity associated with the phyllosphere of soybean,
environment as a selection factor. The initial clover, and Arabidopsis.
inoculation of leaves of different trees very likely
begins with the growth of opportunistic microor-
ganisms that are transported in dust, by insects, or Conclusion
that are splashed from adjacent trees by rain.
Inheriting a minimal microbiome through the Recent studies have provided only a glimpse into
seeds may also be a possibility. Further selection the microbial diversity that is associated with the
then occurs depending on differences in the types tree canopies in the Atlantic forest, and there are
of carbon substrates that are available for growth, many new questions that arise from this research.
as well as various physical and environmental For example, to what degree do soil, nutritional,
factors and interactions within the microbial and other environmental factors affect the com-
community. The primary carbon substrates that position and structure of microbial communities
are used for microbial growth include carbohy- in the phyllosphere? What is the diversity of
drates, amino acids, and organic acids. The com- fungi and Archaea on the plant leaf surfaces,
position and amounts of these substances may and how do these microorganisms interact?
vary for different plant species, but may also Future research should also examine the
vary over time depending on leaf age, insect functional aspects of phyllosphere microbial
damage, and rainfall, for instance. Another communities and the interactions that occur
potentially important selective factor is the pro- between phyllosphere bacteria and their host
duction of different types and quantities of mono- plants using metagenomics, metaproteomics,
terpenes and other volatile substances that are and metabolomics. As we begin to survey these
released from the leaf surfaces. These substances bacterial communities through systematic study
may be both toxic to some microorganisms and of different plant species, there will be exciting
used as growth substrates by others. Phytochem- opportunities for studies of the metabolic capa-
istry research has shown that tree species have bilities and ecological functions of phyllosphere
species-specific differences in their biochemical microorganisms in terrestrial ecosystems.
signatures for volatile molecules (Arey
et al. 1995). If terpenes act as selective sub-
stances, certain types of bacteria may be Summary
predicted to occur in relation to the biochemical
signatures of volatile organic compounds Each plant species is able to select its own bacte-
released by the leaves. Very little work has been rial community, and probably its own
conducted on this research topic, but bacteria are microbiome, which may be affected by plant
known to contain enzymes that convert terpenes genomic components and the environment. Alto-
to derivative substances. In this manner, the gether, the phyllosphere of plant species of the
phyllosphere bacteria may influence chemical Atlantic forest may harbor several million species
signaling to insects and other microorganisms or of bacteria that remain to be described. The roles
B 54 Bacteriocin Mining in Metagenomes
of the microbial communities of the phyllosphere fields. In: Lindow SE, Hecht-Poinar EI, Elliott V, edi-
in forest ecology are not yet known, but are likely tors. Phylosphere microbiology. St Paul: APS Press;
2002. p. 365–75.
to include chemical signaling, nitrogen fixation, Morris CE, Monier JM, Jacques MA. A technique to
and plant protection, among other functions. This quantify the population size and composition of the
immense microbial diversity may also provide biofilm component in communities of bacteria in the
new biomolecules of interest for pharmaceutical, phyllosphere. Appl Environ Microbiol. 1998;64:
4789–95.
agricultural, and environmental applications. Redford AJ, Bowers RM, Knight R, Linhart Y, Fierer
N. The ecology of the phyllosphere: geographic and
phylogenetic variability in the distribution of bacteria
Cross-References on tree leaves. Environ Microbiol. 2010;12:2885–93.
Ruinen J. Occurrence of Beijerinckia species in the
phyllosphere. Nature. 1956;177:220–1.
▶ New Computational Methodologies to Tabarelli M, Pinto LP, Silva JMC, Costa CMR. Endan-
Understand Microbial Diversity gered species and conservation planning. In: Galindo-
Leal C, Câmara IG, editors. The Atlantic forest of
South America: biodiversity, status, threats and out-
looks. Washington, DC: Island Press; 2003. p. 86–94.
References Thompson IP, Bailey MJ, Fenlon JS, Fermor TR, Lilley
AK, Lynch JM, McCormack PJ, McQuilken MP,
Arey J, Crowley DE, Crowley M, Resketo M, Lester Purdy KJ. Quantitative and qualitative seasonal
J. Hydrocarbon emissions from natural vegetation in changes in the microbial community from the
California South-coast-air-basin. Atmos Environ. phyllosphere of sugar beet (Beta vulgaris). Plant Soil.
1995;29:2977–88. 1993;150:177–91.
Delmotte N, Knief C, Chaffron S, et al. Community Vorholt JA. Microbial life in the phyllosphere. Nat Rev
proteogenomics reveals insights into the physiology Microbiol. 2012;10:828–40.
of phyllosphere bacteria. Proc Natl Acad Sci Yang CH, Crowley DE, Borneman J, Keen NT. Microbial
USA. 2009;106:16428–33. phyllosphere populations are more complex than pre-
Ercolani GL. Distribution of epiphytic bacteria on olive viously realized. Proc Natl Acad Sci USA. 2001;98:
leaves and the influence of leaf age and sampling time. 3889–94.
Microb Ecol. 1991;21:35–48.
Finkel OM, Burch AY, Lindow SE, Post AF, Belkin
S. Geographic allocation determines the population
structure in phyllosphere microbial communities of
a salt-excreting desert tree. Appl Environ Microbiol. Bacteriocin Mining in Metagenomes
2011;77:7647–55.
Griffin DW, Kellogg CA, Garrison VH, Shinn EA. The
global transport of dust – an intercontinental river of Orla O’Sullivan1,2, Colin Hill3, Paul Ross1,2 and
dust, microorganisms and toxic chemicals flows Paul Cotter1,2
through the Earth’s atmosphere. Amer Sci. 2002;90: 1
Teagasc Food Research Centre, Moorepark,
228–35.
Fermoy, Co., Cork, Ireland
Lambais MR, Crowley DE, Cury JC, B€ ull RC, Rodrigues 2
RR. Bacterial diversity in tree canopies of the Atlantic Alimentary Pharmabiotic Centre, University
forest. Science. 2006;312:1917. College, Cork, Ireland
Legard DE, McQuilken MP, Whipps JM, Fenlon JS, 3
Alimentary Pharmabiotic Centre,
Fermor TR, Thompson IP, Bailey MJ, Lynch
Department of Microbiology, University
JM. Studies of seasonal changes in the microbial
populations on the phyllosphere of spring wheat as College, Cork, Ireland
a prelude to the release of a genetically modified
microorganism. Agric Ecosyst Environ. 1994;50:
87–101.
Definition
Lindow SE, Brandl MT. Microbiology of the
phyllosphere. Appl Environ Microbiol. 2003;69:
1875–83. Bacteriocins are heat-stable ribosomally synthe-
Monier JM, Lindow SE. Frequency, size, and localization sized peptides produced by one bacterium which
of bacterial aggregates on bean leaf surfaces. Appl
are active against other bacteria and against
Environ Microbiol. 2004;70:346–55.
Morris CE, Kinkel LL. Fifty years of phylosphere micro- which the producer has a specific immunity
biology: significant contributions to research in related mechanism.
Bacteriocin Mining in Metagenomes 55 B
Introduction classes: those which are modified (Class I) and
those which are unmodified (Class II) (Cotter
Bacteriocins are ribosomally synthesized antimi- et al. 2005; Rea et al. 2011) (Table 1). This
crobial peptides that are produced by many bac- approach to classification excludes larger pro-
teria and which kill or inhibit the growth of other teins, such as the bacteriolysins and the colicin- B
bacteria. Bacteriocin producers are protected as a type antimicrobials, which as a consequence of
consequence of dedicated immunity (self- their larger size may be regarded as representing
protective) systems (Cotter et al. 2005). Bacterio- different classes of antimicrobials.
cins are of both academic and commercial Further classification of the Class I and II
interest, with several in use as food preservatives peptides is possible, for example, Class
or as the active agent in clinical or veterinary I bacteriocins from Gram-positive bacteria can
antimicrobials. It is not surprising that there is be divided into Class Ia, Class Ib, and Class
significant interest in the identification and char- Ic. Class Ia, the lantibiotics, harbor the unusual
acterization of new bacteriocin gene clusters. The posttranslationally modified residues lanthionine
growing volume of metagenomic sequence data (Lan) and/or b-methyllanthionine (meLan); these
is an important resource which can be mined for are products of the interaction of cysteines with
the in silico discovery of novel bacteriocins. enzymatically dehydrated serines (dehydroalanine;
Dha) and threonines (dehydrobutyrine; Dhb).
A Background to Bacteriocins Lantibiotics can be subdivided according to the
Bacteriocins were first described in 1925 and enzyme responsible for lanthionine formation; sub-
since then bacteriocin producers have been iden- class I use LanBC, subclass II use LanM, and
tified in a myriad of different environments, bear- subclass III use RamC-like, while subclass IV are
ing out a prediction by Klaenhammer in 1988 that modified by LanL enzymes. It should be noted,
bacteriocin production may be almost ubiquitous however, that subclass III and IV peptides identi-
(Klaenhammer 1988). The spectrum of activity fied to date have not been shown to possess antimi-
of these peptides can be narrow (lethal to bacteria crobial activity and thus are referred to as
in the same or closely related species) or broad lantipeptides. Class Ib, the labyrinthopeptins,
(lethal to bacteria in other genera). Many bacte- have a labyrinthine structure and contain
riocins function by depolarizing the cell mem- the posttranslationally modified amino acid
brane or through the inhibition of cell wall labionin, formed through a series of serine phos-
synthesis (Cotter et al. 2005). There are phorylations, dehydrations of phosphoserines to
a number of different classification schemes. didehydroalanines, and cyclizations. Class Ic, the
One approach, originally employed to classify sactibiotics, are cyclic peptides, generated from the
bacteriocins produced by Gram-positive bacteria, posttranslation formation of intramolecular cross-
has been to divide bacteriocins into two major linkages between the a-carbon and sulfur of amino
Bacteriocin Mining in Metagenomes, Table 1 Classification scheme for bacteriocins (Modified from (Rea
et al. 2011))
Class Divisions Further subclasses Examples
Class I Ia: Lantibiotics Subclass I–IV Lacticin 3,147, nisin A, subtilin
Ib: Labyrinthopeptins Labyrinthopeptins A1 and A2
Ic: Sactibiotics Single- and two-peptide bacteriocins Thuricin CD, subtilisin A
Class II IIa: Pediocin-like Subclasses I–IV Pediocin PA-1, munticin
IIb: Two-peptide bacteriocins Subclasses A and B Salivaricin P, lactococcin G
IIc: Circular bacteriocins Subclasses 1 and 2 Acidocin B, gassericin A
IId: Linear non-pediocin-like Lactococcin A
Single-peptide bacteriocins
acids within the peptide. Class Ic bacteriocins can Application of Bacteriocins

be further subdivided depending on whether they Bacteriocins have proved useful as antimicrobial
are single- or two-peptide bacteriocins. compounds in the food and health industries. In
Class II bacteriocins can be divided in Class the food industry, bacteriocins such as nisin and
IIa, Class IIb, Class IIc, and Class IId. Class IIa, pediocin PA-1 can improve food safety and food
pediocin-like bacteriocins, are typically highly quality. Bacteriocins produced by lactic acid bac-
active against the food pathogen Listeria teria (LAB) are of particular interest to the food
monocytogenes and contain a conserved hydro- industry since LAB have been awarded GRAS
philic, cationic region in the N-terminal region status (Generally Regarded As Safe) and can
termed the “pediocin box.” They can be therefore be used in food preparations (Cotter
subdivided into subclasses I–IV based on et al. 2005). More recently, the contribution of
sequence homology. Class IIb are two-peptide bacteriocin production to the efficacy of certain
unmodified bacteriocins. Both peptides are probiotics has been recognized, suggesting
required for activity and both possess another route via which bacteriocins can be of
a conserved GxxG motif. These are further value within the food for health arena (Dobson
subdivided into two subclasses based on et al. 2012). In the health industry, the use of
sequence homology. Class IIc are cyclic peptides, bacteriocins as an alternative to antibiotics has
resulting from the covalent linkage of their long been mooted (Piper et al. 2009). The poten-
N- and C-termini and tend to contain numerous tial benefits of employing bacteriocins in this
a-helical structures. Class IIc can also be further arena have been particularly apparent in recent
divided into two subclasses based on sequence times as a consequence of an appreciation of the
identity. Finally, the Class IId, unmodified linear, “collateral damage” which antibiotics can inflict
non-pediocin-like bacteriocins are in essence on the commensal microbiota. Narrow spectrum
bacteriocins which do not fit into any of the bacteriocins may well address this issue in view
other subclasses. (See Fig. 1 for an example of of their target specificity. In the area of veterinary
bacteriocin structure.) medicine, bacteriocins have proven useful in the
In addition to the requirement for a precursor control of mastitis in cattle and as an additive to
bacteriocin peptide, bacteriocin activity is also animal feed with a view to improving general
dependent on the production of several other pro- animal health (Abriouel et al. 2011). It has also
teins encoded within the corresponding bacterio- been suggested that bacteriocins or bacteriocin-
cin gene cluster. This gene cluster may encode producing microbes could be employed as bio-
proteins responsible for bacteriocin transport, control agents which, for example, could be
processing, regulation, immunity, and, in the added to soil to control plant pathogens
case of the Class 1 bacteriocins, peptide modifi- (Abriouel et al. 2011).
cation enzymes. The highly conserved accessory
proteins encoded by bacteriocin gene clusters can Identification of Novel Bacteriocin Gene
serve as useful driver sequences for downstream Clusters
analysis as the bacteriocin peptides themselves Traditionally, the identification of novel bacteri-
can be very diverse in their primary sequences. ocin gene clusters has involved using classical
Bacteriocin Mining in Metagenomes, Fig. 1 Structure of nisin A; the prototypical Gram-positive-modified bacte-
riocin (Modified residues in gray)
BACTIBASE, a bacteriocin database and suite
of analysis tools established to archive known
bacteriocin sequences and enhance the discovery
of bacteriocins in genomic data (Hammami
et al. 2010). The current release of BACTIBASE B
contains 177 bacteriocin sequences against which
one can test the homology of a query bacteriocin
sequence, perform sequence alignments, and pre-
dict peptide structure (Hammami et al. 2010).
Searches are limited to the known sequences
already in the database, and the usefulness of
the tool is also affected by the fact that bacterio-
cin peptides themselves often share little or no
homology. A specific bacteriocin mining tool,
BAGEL2 (BActeriocin GEnome Location), was
established to search for novel bacteriocin
sequences in genomic data (de Jong et al. 2010).
BAGEL2 has a built-in database of bacteriocin
Bacteriocin Mining in Metagenomes, Fig. 2 Repre-
sentative agar plate depicting the outcome of a culture- and bacteriocin-related sequences and, in addition
based screen for bacteriocin activity to genes encoding the structural bacteriocin pep-
tide, uses genes involved in bacteriocin biosynthe-
microbiology to screen for large collections of sis, regulation, export, and immunity to reveal
strains, using a culture-based assessment of their related genes in novel clusters. Additionally
ability to produce novel antimicrobials (Fig. 2). searches can be implemented against finished
This is then followed by the subsequent identifi- genome sequences or against novel genomes
cation of the responsible genes through uploaded by the user. The fact that genes involved
subcloning, mutagenesis, reverse genetics, or, in the modification of Class I bacteriocins, such as
more recently, sequencing of the corresponding those generically named lanM, lanB, and lanC or
genome. However, in spite of constant improve- those encoding radical SAMs associated with
ments in culturing techniques, it is still estimated sactibiotic production, are frequently more
that just 10–50 % of bacteria are culturable. For- highly conserved than the structural genes them-
tunately, metagenomic DNA sequencing pro- selves has also been utilized in recent years to
vides an alternative with respect to identifying identify Class I gene clusters in genomic and
novel bacteriocin gene clusters by facilitating an metagenomic databases. During this period targeted
unbiased characterization of entire microbial searches for bacteriocins in genomic data have
communities. In particular, recent improvements resulted in the discovery of several novel active
in sequencing technologies have resulted in bacteriocins, such as lichenicidin (Begley et al.
a massive increase in sequence data, leading to 2009), and a Streptococcus-associated lantibiotic
the development of valuable public databases and (Majchrzykiewicz et al. 2010), among others. This
annotation pipelines (http://camera.calit2.net/, strategy parallels similar genome-based approaches
http://img.jgi.doe.gov/, http://metagenomics.anl. which have identified gene clusters encoding other
gov/). The generation of vast quantities of DNA ribosomally synthesized natural products
sequence data from metagenomics-based pro- (Velásquez and van der Donk 2011). In addition to
jects from varying environments across the the identification of novel bacteriocins, the screening
globe represents a considerable resource from of genomes using the LtnM1 protein of lacticin 3147
which new bacteriocin gene clusters can be iden- (Begley et al. 2009; O’Sullivan et al. 2011) or the
tified. There are a number of ways in which this radical SAM proteins of thuricin CD (Murphy
information can be harnessed. One example is et al. 2011) as drivers has also revealed several
potential bacteriocin-encoding clusters. It is antici- (Murphy et al. 2011). In both studies, a

pated that many of these will be the focus of further simple BLAST search against the CAMERA
investigation in the coming years. (http://camera.calit2.net) metagenomic databases
was implemented and homologous proteins were
Identification of Bacteriocins in identified. The lanM search revealed homologs in
Metagenomes 11 metagenomes (Table 2). Three of these came
The identification of bacteriocins within from an Indian Ocean metagenome, four from
metagenomic DNA can be performed via hypersaline lagoon metagenomes from the
laboratory-based or in silico-based approaches. Galapagos Islands, and one each from a coastal
A recent example of the former involved sea water metagenome from the Gulf of Mexico,
a PCR-based screen to establish the bacteriocin- a farm soil metagenome, a whale fall carcass rib
producing potential within metagenomic DNA bone metagenome, and a coral reef metagenome
sourced from 40 Polish cheeses (Wie˛ckowicz (Table 2). Further phylogenetic analysis with the
et al. 2011). In this case, PCR-primers were 11 homologs and previously identified
designed to exploit conserved sequence motifs bacteriocin-like gene clusters revealed that the
within the four anti-listeral bacteriocin peptides, homologs from the metagenomes were related to
divercin V41, enterocin P, mesenteric in Y105, other lanM genes from a wide variety of different
and bacteriocin 423. It was established that microorganisms, thus highlighting the diverse
metagenomic DNA for each one of the 40 cheeses nature of the metagenome-associated genes
yielded a PCR product thereby highlighting the (O’Sullivan et al. 2011). The search that used
bacteriocin-producing potential of the cheese trnC-/trnD-like genes as driver sequences yielded
microbiota (Wie˛ckowicz et al. 2011). While 365 TrnC homologs and 151 TrnD homologs in
laboratory-based screens have considerable metagenomes from environments as diverse as
potential, the vast information present in Waseca soil, a coral reef, and the ocean surround-
metagenomic DNA databases suggests that in ing the Galapagos Islands (Murphy et al. 2011),
silico screening for bacteriocin gene clusters can again highlighting the presence of bacteriocin-
be a more successful approach. associated genes in metagenomic data.
Recently, two studies have carried out basic Despite the valuable insights provided by
homology searches against metagenomes to iden- these analyses, they failed to identify complete
tify clusters containing lanM genes and potentially bacteriocin gene clusters. A more suitable analy-
encoding novel type II lantibiotics (O’Sullivan sis tool would allow a homology search with
et al. 2011), or those possessing trnC-/trnD-like multiple genes (or even an operon) and therefore
genes which potentially encode novel sactibiotics enhance the possibility of identifying a true
Bacteriocin Mining in Metagenomes, Table 2 LanM homologs in metagenomic databases from (O’Sullivan
et al. 2011)
Protein function Metagenome Location % identity E-value
Lantibiotic-modifying enzyme Sea water Indian Ocean 29 1.07E-16
Hypothetical protein Soil sample Waseca County, USA 25 2.85E-16
Lantibiotic-modifying enzyme Whale fall rib carcass Santa Cruz Basin, USA 27 4.65E-12
Lantibiotic-modifying enzyme Hypersaline lagoons Galapagos Islands 30 1.83E-08
Hypothetical protein Coastal sea water Gulf of Mexico 25 2.39E-08
Hypothetical protein Hypersaline lagoons Galapagos Islands 24 1.55E-07
Hypothetical protein Open ocean Indian Ocean 36 4.51E-07
Hypothetical protein Coral reef Cook’s Bay, French Polynesia 24 5.89E-07
Hypothetical protein Open ocean Indian Ocean 29 1.71E-06
Mersacidin-modifying enzyme Open ocean Galapagos Islands 25 3.81E-06
Hypothetical protein Hypersaline lagoons Galapagos Islands 24 4.94E-06
bacteriocin cluster. Existing tools for will not be available, or may not be culturable,
metagenome analysis are in two formats: func- and other strategies will be required. The
tional key word search engines, such as those genetics-based options available can be divided
available through the MG-RAST (Glass into in vivo and in vitro approaches. Regardless
et al. 2010) and IMG/M (Markowitz et al. 2008) of the approach, specific genes within the cluster B
platforms, and homology search engines, such as will need to be regenerated through DNA synthe-
JCoast (Richter et al. 2008), MetaMine sis technology. In the case of in vivo harnessing,
(Bohnebeck et al. 2008), and CAMERA (Sun the DNA fragment(s) will be cloned and
et al. 2011). Functional searches rely heavily on expressed heterologously, using approaches
searching among annotated genes. This is inher- such as those employed to facilitate the produc-
ently reliant on accurate annotation, and due to tion of a Streptococcus-associated lantibiotic
the small size and heterogeneous nature of bac- cluster by Lactococcus lactis (Majchrzykiewicz
teriocin peptides, the corresponding genes are et al. 2010) and by Escherichia coli. Alterna-
often overlooked or mis-annotated. Homology tively, when dealing with modified bacteriocins,
search tools such as CAMERA and JCoast are one can clone and express individual genes het-
single gene search-driven, although JCoast does erologously but then purify them to facilitate the
have a graphical user interface that allows visu- in vitro reconstitution of biosynthesis using the
alization of the surrounding gene neighborhood corresponding modification proteins or related
which would prove particularly useful for screen- enzymes originating from other sources (Knerr
ing for the presence of other genes in the bacteri- and van der Donk 2012). Finally, an alternative
ocin operons (Richter et al. 2008). Metamine non-genetics-based approach, which is available
allows homology searches with “gene neighbor- when gene clusters predicted to encode
hoods”; again this would prove particularly use- unmodified residues are identified, is to employ
ful for bacteriocin clusters. Metamine searches peptide synthesis with a view of generating
are, however, restricted to marine metagenomic a synthetic equivalent of the natural antimicro-
databases (Bohnebeck et al. 2008). It should also bial. It is anticipated that these various options
be noted that, as a consequence of the evolution of will be widely used in the years to come.
DNA sequencing technologies, longer stretches of
contiguous metagenomic DNA will become avail-
able which will further enhance our ability to Summary
identify complete bacteriocin gene clusters.
Despite this, it must also be noted that the presence In order to effectively mine metagenomes for
of bacteriocin homologs alone is not an indicator bacteriocins, accurate annotation of the datasets
of function. Clearly in silico analysis is not suffi- is essential. As the volume of data grows, it is
cient to determine functional presence of anticipated that the precision of annotation tools
a bacteriocin. However, the likelihood that even will improve in tandem. The number of
a proportion of bacteriocin homologues will be bacteriocin-associated gene homologs present in
deemed functional is an intriguing prospect. diverse metagenomic environments suggests the
presence of multiple corresponding gene clusters.
Harnessing Bacteriocin Gene Clusters The further expansion of metagenomic DNA
While the in silico analysis of newly identified databases will undoubtedly further increase our
bacteriocin gene clusters within metagenomic appreciation of just how widespread, and diverse,
DNA can be of great value from a fundamental these clusters are. As the commercial application
perspective, the harnessing of the antimicrobial of bacteriocins becomes more common (for
potential of these clusters will undoubtedly review see (Cotter et al. 2005)), we can anticipate
become a priority in the future. In the majority that we will reap the benefits of in silico screening
of instances, the specific strain from which the and harnessing of this untapped reservoir of novel
fragment of metagenomic DNA has originated bacteriocins.
B 60 Binning Sequences Using Very Sparse Labels Within a Metagenome
Cross-References determinants reveals multiple sactibiotic-like gene

clusters. PLoS One. 2011;6(7):e20852. doi:10.1371/
journal.pone.0020852 PONE-D-11-04704[pii].
▶ Ab Initio Gene Identification in Metagenomic O’Sullivan O, Begley M, Ross R, Cotter P, Hill C. Further
Sequences identification of novel lantibiotic operons using LanM-
▶ Computational Approaches for Metagenomic based genome mining. Probiotics Antimicrob Protein.
Datasets 2011;3(1):27–40. doi:10.1007/s12602-011-9062-y.
Piper C, Cotter PD, Ross RP, Hill C. Discovery of medi-
cally significant lantibiotics. Curr Drug Discov
Technol. 2009;6(1):1–18.
References Rea M, Cotter P, Hill C, Ross R. Classification of bacteriocins
from gram-positive bacteria. In: Drider D, Rebuffat S,
Abriouel H, Franz CMAP, Omar NB, Gálvez A. Diversity editors. Prokaryotic antimicrobial peptides - from genes
and applications of Bacillus bacteriocins. FEMS to applications. New York: Springer; 2011. p. 29.
Microbiol Rev. 2011;35(1):201–32. doi:10.1111/ Richter M, Lombardot T, Kostadinov I, Kottmann R,
j.1574-6976.2010.00244.x. Duhaime MB, Peplies J, et al. JCoast - a biologist-
Begley M, Cotter PD, Hill C, Ross RP. Identification of centric software tool for data mining and comparison
a novel two-peptide lantibiotic, lichenicidin, following of prokaryotic (meta)genomes. BMC Bioinforma.
rational genome mining for LanM proteins. Appl Envi- 2008;9:177. doi:10.1186/1471-2105-9-177.
ron Microbiol. 2009;75(17):5451–60. doi:10.1128/ Sun S, Chen J, Li W, Altintas I, Lin A, Peltier S,
aem.00730-09. et al. Community cyberinfrastructure for advanced
Bohnebeck U, Lombardot T, Kottmann R, Glockner FO. microbial ecology research and analysis: the CAM-
MetaMine – a tool to detect and analyse gene patterns ERA resource. Nucleic Acids Res. 2011;39(Database
in their environmental context. BMC Bioinforma. issue):D546-551. doi:10.1093/nar/gkq1102.
2008;9:459. doi:10.1186/1471-2105-9-459. Velásquez JE, van der Donk WA. Genome mining for
Cotter PD, Hill C, Ross RP. Bacteriocins: developing ribosomally synthesized natural products. Curr Opin
innate immunity for food. Nat Rev Micro. 2005; Chem Biol. 2011;15(1):11–21.
3(10):777–88. Wie˛ckowicz M, Schmidt M, Sip A, Grajek W. Development of
de Jong A, van Heel AJ, Kok J, Kuipers OP. BAGEL2: a PCR-based assay for rapid detection of class IIa bacteri-
mining for bacteriocins in genomic data. Nucleic Acids ocin genes. Lett Appl Microbiol. 2011;52(3):281–9.
Res. 2010;38 suppl 2:W647–51. doi:10.1093/nar/gkq365. doi:10.1111/j.1472-765X.2010.02999.x.
Dobson A, Cotter PD, Ross RP, Hill C. Bacteriocin pro-
duction: a probiotic trait? Appl Environ Microbiol.
2012;78(1):1–6. doi:10.1128/aem.05576-11.
Glass EM, Wilkening J, Wilke A, Antonopoulos D Meyer Binning Sequences Using Very
F Using the metagenomics RAST server (MG-RAST)
for analyzing shotgun metagenomes. Cold Spring Sparse Labels Within a Metagenome
Harb Protoc. 2010(1), pdb prot5368, doi:2010/1/pdb.
prot5368 [pii] 10.1101/pdb.prot5368. Ching-Hung Tseng1, Chon-Kit Kenneth Chan2,
Hammami R, Zouhir A, Le Lay C, Ben Hamida J, Fliss I. Arthur L. Hsu2, Saman K. Halgamuge2 and
BACTIBASE second release: a database and tool plat-
form for bacteriocin characterization. BMC Microbiol. Sen-Lin Tang3
1
2010;10:22. doi:10.1186/1471-2180-10-22. Bioinformatics Program, Taiwan International
Klaenhammer TR. Bacteriocins of lactic acid bacteria. Graduate Program, Biodiversity Research
Biochimie. 1988;70(3):337–49. 0300-9084(88)90206-4. Center, Institute of Information Science,
Knerr PJ, van der Donk WA. Discovery, biosynthesis, and
engineering of lantipeptides. Annu Rev Biochem. Academia Sinica, Taipei, Taiwan
2
2012. doi:10.1146/annurev-biochem-060110-113521. Department of Mechanical Engineering, The
Majchrzykiewicz JA, Lubelski J, Moll GN, Kuipers A, University of Melbourne, Melbourne, VIC,
Bijlsma JJ, Kuipers OP, et al. Production of a class II Australia
two-component lantibiotic of Streptococcus 3
pneumoniae using the class I nisin synthetic machinery Bioinformatics Program, Taiwan International
and leader sequence. Antimicrob Agents Chemother. Graduate Program, Institute of Information
2010;54(4):1498–505. doi:10.1128/AAC.00883-09. Science, Academia Sinica, Taipei, Taiwan
Markowitz VM, Ivanova NN, Szeto E, Palaniappan K,
Chu K, Dalevi D, et al. IMG/M: a data management
and analysis system for metagenomes. Nucleic Acids
Res. 2008;36 suppl 1:D534–8. doi:10.1093/nar/gkm869.
Synonyms
Murphy K, O’Sullivan O, Rea MC, Cotter PD, Ross RP,
Hill C. Genome mining for radical SAM protein Binning using seeded GSOM
Binning Sequences Using Very Sparse Labels Within a Metagenome 61 B
Definition Self-Organizing Map and Growing
Self-Organizing Map Algorithms
Binning is the process to categorize sequences
into different groups based on compositional The self-organizing map (SOM) (Kohonen 1990)
features or sequence similarity or both is an unsupervised clustering algorithm. It can B
of them. visualize the clustering of unlabeled feature vec-
tors on a static lattice grid map, which has
pre-defined grid shape and map size, during the
entire training process. Within the map, every
Introduction node (or lattice) has a weight vector of the same
dimension as the input vector. The SOM algo-
As metagenomes are typically composed of rithm separates training into three phases: initial-
sequences from various species, how to catego- ization, ordering, and fine-tuning. In the
rize these sequences into groups can radically initialization phase, the weight vector of each
affect the accuracy and sensitivity of down- initial node can be either generated from random
stream analyses. Thus, the sequence binning values or, generally, using the principal compo-
is a critical step in the early process of nent analysis (PCA) to position a fully unfolded
metagenomic analysis pipeline. Several binning map on the plane formed by the first two principal
methods employing different strategies have vectors in the input space (Kohonen 1999). The
been proposed. For example, BLAST homology number of initial nodes needs to be determined
search helps to identify sequences of related by the user. In the ordering and fine-tuning
species; kmer (Sandberg et al. 2001), self- phases, each input identifies a winning node,
organizing map (SOM) (Abe et al. 2003), and which is of the smallest Euclidean distance to
TETRA (Teeling et al. 2004b) cluster sequences the input, on the map. Then the weight vectors
by similar compositional features, i.e., oligonu- of the winning node and its neighboring nodes are
cleotide frequency; PhyloPythia (McHardy updated by
et al. 2007), a support vector machine imple-
mentation, categorizes sequences based on
wðt þ 1Þ ¼ wðtÞ þ a h ½xðkÞ wðtÞ,
both pattern similarity and oligonucleotide fre-
quency. The above listed methods, either
supervised or unsupervised, have their own limi- where w is the weight vector of the node, x is the
tations. Supervised learning methods, including input vector (w, x ∈ RD where D is the dimen-
BLAST, kmer, SOM, and PhyloPythia, require sion), k is the index of the current input vector, a
prior knowledge, like completed genomes or is the learning rate, and h is the neighborhood
labeled long contigs, as training datasets; the kernel function.
unsupervised learning method, TETRA, may The Growing SOM (GSOM) (Alahakoon
become intractable for huge metagenomes et al. 2000; Hsu and Halgamuge 2003) is an
because of the required computation on extension of SOM. It is a dynamic SOM,
all-versus-all pair-wise comparison. Although which overcomes SOM’s weakness of the
these methods have demonstrated great feasi- static map structure, i.e., GSOM initiates its
bility, a solution to resolve bins without identi- training with minimum single lattice grid,
fiable labels makes the binning task more depending on whether the rectangular or hex-
applicable to typical metagenomes. In our con- agonal network topology is used, to facilitate
text, we introduce a semi-supervised learning the dynamic growth of the map in training
method that couples a seeding strategy with the process. GSOM employs the same weight
growing self-organizing map (GSOM), called adaptation and neighborhood kernel function
the seeded GSOM (S-GSOM), for sequence as SOM. The map size of a perfectly trained
binning. GSOM map is controlled by a global
parameter of growth, which is called Growth where X is the number of species included in the
Threshold (GT) and defined as metagenome and Y represents the serial number
of metagenome. For example, “10Sp_Set1” is the
GT ¼ D lnðSFÞ, first metagenome containing 10 random species.
In each genome, the seeding sequences were
where D is the data dimension and SF is the user- firstly identified as the flanking region of 16S
defined Spread Factor that takes value (0, 1], with rRNA genes of the length ranging from 8 to
0 representing minimum and 1 representing max- 13 kilobase (kb). The seeding sequences that
imum growth. overlapped with other rRNA and tRNA genes
There are four phases in GSOM training: ini- were excluded to avoid possible interferences
tialization, growing, and two smoothing phases. caused by highly similar sequence compositions.
In the initialization phase, weight vectors of ini- After removing the tRNA, rRNA, and seeding
tial nodes in the minimum single lattice grid are sequences, the remaining genomic regions were
initialized by random values and the GT is calcu- randomly chopped into simulated metagenomic
lated according to data dimension and user- fragments of the length from 8 to 13 kb. The
defined SF. During the growing phase, every length restriction of 8–13 kb is used to provide a
node keeps an accumulated error counter and standardized rule for either seeding or
the counter of the winning node (Ewinner) is metagenomic sequences (Mavromatis et al.
updated by 2007), but with the outlook for single-molecule
sequencing techniques on the horizon (Clarke
EWinner ðt þ 1Þ ¼ EWinner ðtÞ þ jxðkÞ wwinner ðtÞj: et al. 2009), these are definitely achievable length
for metagenomes in the near future.
When Ewinner exceeds GT, the winning node The tetranucleotide frequency of
that is at the boundary of current map will grow metagenomic sequences is the training feature
new nodes to its neighboring vacant lattice and we used in our implementation for binning
initialize a weight vector by interpolating or because it has a better resolution in species sep-
extrapolating weight vectors of existed neighbor- aration (Abe et al. 2003) and is highly similar
ing nodes around the winning node. If the win- between intragenomic fragments compared to
ning node is not a boundary node, the intergenomic fragments. The tetranucleotide fre-
accumulated error (Ewinner) is evenly distributed quencies were computed using a four-base slid-
outwards to its neighbors. The two smoothing ing window and normalized by the length of the
phases are for fine-tuning the weights of nodes. corresponding sequence (frequency per base).
The hexagonal lattice was used for GSOM in this A total of 256 (44) combinations of nucleotide
study as the hexagonal lattice yields better data usages, i.e., AAAA, AAAT, AAAG, AAAC . . .
topology preservation (Hsu et al. 2003). CCCG, and CCCC, are represented in the feature
vector of 256 dimensions.
Seeding Sequence and Metagenomic

Dataset Preparation Seeded GSOM Algorithm
From the NCBI Archaea/Bacteria genome data- Metagenomic sequences that belong to closely
base, we randomly selected 10, 20, and 40 species related species are likely to have homologous
to generate metagenomes of different complex- sequences between the clusters (bins), and this
ity. Three sets were drawn for the 10 and fact makes the identification of clustering bound-
20 species datasets, and only one set for the aries much more difficult. Therefore, a modified
40 species dataset due to the limitation imposed strategy is needed to identify clusters so that
by the available computing resources. Simulated GSOM can be improved as a more practical solu-
metagenomes were denoted by “XSp_SetY ” tion for binning.
Binning Sequences Using Very Sparse Labels Within S-GSOM. (b) The pseudo-code for node assigning process
a Metagenome, Fig. 1 The S-GSOM algorithm. in S-GSOM
(a) Schematic diagram of the clustering process of
The seeded GSOM (S-GSOM), which allows containing any sample are most likely
identifying clusters automatically in the feature representing a cluster boundary. So a penalty fac-
map using seeds (labeled data), is our proposed tor greater than one is multiplied to the actual
modification of GSOM. There are three core steps distance when calculating the distance between
in S-GSOM. Firstly, the very small amounts of empty nodes and clustered nodes. This will lead
labeled seeds (labeled feature vectors) are com- the S-GSOM not to label empty nodes to any
bined with unlabeled data (unlabeled feature vec- cluster (Fig. 1b). According to the empirical
tors). Secondly, the combined input vectors are fed observation that the clustering results are not
into GSOM training, which treats the seeds as the very sensitive to the penalty factor value between
unlabeled data. Finally, after the normal phases of 2 and 5, the penalty factor value of 2 was used in
GSOM training, S-GSOM identifies clusters based all our experiments.
on the location of seeds in the final map and the Before the initiation of the taxonomy-assigning
specified amount of nodes in the cluster (Fig. 1a). process, the seeded nodes must be assigned to
In the last step of S-GSOM training, the cluster a specific taxon. When all seeds in one node are
identification phase, the nodes that have seeds are coming from the same taxon or there is only
identified and labeled as clustered nodes. Then a single seed, it is trivial for S-GSOM to assign
the S-GSOM is going to assign other un-clustered the seeded node to the same taxon as contained
nodes, one by one, to clusters iteratively until the seeds. If the seeds in one node belong to multiple
specified clustering percentage (more details in taxa, the seeded node will be assigned to the major
Clustering Percentage (CP) Determination sec- taxon. However, when seeds are of multiple taxa
tion) is reached. In each iteration, a set of and have equal amounts, e.g., two seeds are in one
un-clustered nodes that are adjacent to the clus- node and belonging to taxon A and B, respec-
tered nodes is identified. The node in the set of the tively, all seeds are discarded.
shortest Euclidean distance to the adjacent clus- To illustrate the role of S-GSOM in binning,
tered node will be assigned to the same cluster Fig. 2 depicts the schematic diagram that explains
with the clustered node. However, nodes not how S-GSOM fits into the whole binning process.
Binning Sequences Using Very Sparse Labels Within a Metagenome, Fig. 2 An overview of binning process
using S-GSOM
Clustering Percentage (CP) Constrained K-means, Seeded K-means, and

Determination Transductive Support Vector Machine (TSVM)
were used alongside S-GSOM. Among above
Because metagenomic sequences of closely methods, different runs of random initiation of
related species occur frequently at the cluster the COP K-means and S-GSOM can lead to
boundary (Abe et al. 2003; Chan et al. 2008b) diverse results, which is not an issue for
that is very likely to disrupt the binning accuracy, Constrained K-means and Seeded K-means
an appropriate control of how many nodes because they use the labeled data for initiation.
assigned to bins is necessary for S-GSOM to So the best results of COP K-means in 100 runs
have a trade-off between the amount of binned of random initiations were reported, and to ensure
sequences and the binning accuracy. Hence, the repeatability, all the feature vectors in S-GSOM’s
clustering percentage (CP) value is introduced, initialization were fixed with the mid value 0.5 in
which is defined as the percentage of the number all dimensions.
of clustered nodes relative to the total nodes on Two indices were used to measure clustering
the map. It was noted that the performance of performance: adjusted Rand index (ARI) (Hubert
S-GSOM declined when CP was higher than and Arabie 1985) and weighted F-measure
55 % (Fig. 3). However, S-GSOM binned more (WF) (Van Rijsbergen 1979). The higher index
than 80 % of sequences at CP ¼ 55 % in most indicates the better performance.
cases. Thus, the 55 % CP was used throughout the S-GSOM manifested consistently superior
following experiments. performance on both measures, ARI and WF,
with the exception of Constrained K-means on
the ARI measure for the 10Sp_Set3 dataset
Comparison of Semi-supervised (Table 1). We suspect the considerable worse
Algorithms for Binning performance of TSVM as resulting from insuffi-
cient labeled data. The superior performance of
To test the feasibility of semi-supervised S-GSOM, which accurately assigned 75–90 % of
methods for binning, other four notable semi- all sequences at CP ¼ 55 %, clearly demon-
supervised clustering algorithms, COP K-means, strates that the adjustable CP value effectively
Binning Sequences Using Very Sparse Labels Within against CP. A trend of decreasing in performance with
a Metagenome, Fig. 3 Identification of an appropriate increasing in CP can be noted. A compromised value of
clustering percentage (CP). Five datasets for each of 5, 10, CP ¼ 55 % is marked where both the number of assigned
and 20 species are randomly samples. The average of nodes and performance are high
S-GSOM’s performance for the datasets are plotted
Binning Sequences Using Very Sparse Labels Within a Metagenome, Table 1 Clustering performance of semi-
supervised algorithms. Performance is measured by the adjusted Rand index (ARI) and weighted F-measure (WF)
COP K Constrained K Seeded K TSVM S-GSOM-55
ARI WF ARI WF ARI WF ARI WF ARI WF
10Sp_Set1 0.84 0.94 0.84 0.94 0.84 0.93 0.25 0.59 0.85 0.95
10Sp_Set2 0.89 0.96 0.79 0.90 0.78 0.90 0.41 0.69 0.93 0.97
10Sp_Set3 0.58 0.83 0.85 0.93 0.84 0.93 0.27 0.62 0.83 0.93
20Sp_Set1 0.91 0.90 0.77 0.82 0.76 0.82 0.45 0.65 0.97 0.96
20Sp_Set2 0.76 0.82 0.70 0.79 0.67 0.79 0.43 0.62 0.83 0.89
20Sp_Set3 0.81 0.89 0.75 0.86 0.75 0.86 0.46 0.67 0.97 0.98
40Sp 0.58 0.76 0.71 0.85 0.68 0.84 0.24 0.56 0.83 0.91
helps S-GSOM to achieve better clustering by not binning result is shown in Fig. 4b. Nodes
assigning those ambiguous sequences. The containing seeds from multiple species were col-
S-GSOM visualization of binning sequences of ored in grey with the label of species number.
10Sp_Set1, 20Sp_Set1, and 40Sp is provided in A significantly higher abundance of grey nodes
Fig. 4. around “C6” and “C7,” respectively representing
We considered the 20-species metagenomes Haemophilus influenzae 86-028NP and
as examples to analyze the resolution of binning Haemophilus somnus 129PT, indicates that
with S-GSOM. At CP ¼ 55 %, an average of metagenomic sequences with similar
82 % sequences were assigned with 92 % accu- tetranucleotide frequencies, resulted from closely
racy to their source species. The distribution of related species, tend to be clustered without
Binning Sequences Using Very Sparse Labels Within species, it is displayed in a color that uniquely identifies
a Metagenome, Fig. 4 Resulted growing self- the species. The node without a letter indicates that there is
organizing maps (GSOM) of (a) 10Sp_Set1, (b) no data (sequences) located in it. The grey nodes represent
20Sp_Set1, and (c) 40Sp metagenomes. Each hexagon multiple species in the node, and the exact number is as
represents a single node. If the node contains a single labeled
a clear boundary. This further highlights the cluster sequences of unseeded species, i.e., the
importance of obtaining seeds in non-boundary unknown species. To demonstrate this advantage,
regions. an iso-CP (constant CP) contour is delineated
In addition to the distinguished clustering per- in Fig. 5a, generated with a five-species
formance, S-GSOM possesses a prominent metagenome with only four seeds. By applying
advantage brought by the seeding method to different CP values, a group of nodes were found
Binning Sequences Using Very Sparse Labels Within be assigned when CP ¼ 27 % and dark grey nodes
a Metagenome, Fig. 5 Illustration of exploring an at CP ¼ 55 %, light grey at CP ¼ 77 %, and white at
unseeded cluster. (a) The five-species S-GSOM map. CP ¼ 100 %. (b) Internode distance map with nodes
The seeded nodes are shown with unique colors and assigned at CP ¼ 55 %
labels. Nodes in charcoal color represent nodes that will
only clustered at CP ¼ 77 % (on the top-right a community without any dominant species, has
region). This situation is most likely when sparse long contigs required by the composition-
a species is relatively abundant, but does not based analysis (Mavromatis et al. 2007; Teeling
have a seed. Figure 5b shows the allocation of et al. 2004a), we also excluded the simHC dataset
nodes to seeds at CP ¼ 55 %. However, a protru- from our analysis.
sion of species “1” into the unassigned region, For the purpose of fair comparison, all methods
which belongs to species “5,” is an incorrect need to be compared at the same taxonomic level
assignment that sometimes happens to nodes of binning. Binning at a very high level, e.g.,
without a correct seed. kingdom, clearly has no significance; therefore,
the results are compared at the order level here
and results for comparing at other taxonomic
Comparison of Binning Fidelity Using levels are included in the supplementary materials
S-GSOM of original publication (Chan et al. 2008a). At the
order level, the results for simLC (low complexity)
In this section, we compared the binning perfor- and simMC (medium complexity) metagenomes
mance of S-GSOM with three other methods: are shown in two separated tables, one for binning
BLAST, kmer, and PhyloPythia, reported on the contigs larger than 8 kb and the other for contigs
metagenomes of different complexities composed of at least 10 reads. To evaluate the
(Mavromatis et al. 2007) after assembly by performance, rather than using simple averages of
Arachne (Batzoglou et al. 2002), Phrap (Green, all bins (Mavromatis et al. 2007), we used weighted
1996), and JAZZ (Aparicio et al. 2002). How- average that gives higher weights to larger bins to
ever, JAZZ produced small number of contigs better reflect the amount of correctly binned contigs.
compared to Arachne and Phrap (Mavromatis In both simLC and simMC, S-GSOM
et al. 2007), so contigs assembled by JAZZ performed reasonably for binning contigs larger
were excluded. In addition, because the simHC, than 8 kb, where it is more accurate than all
Binning Sequences Using Very Sparse Labels Within a Metagenome, Table 2 Binning summary for low
complexity metagenome for contigs larger than 8 kb
Binned
Assembler Method Bins Contigs Total#Contigs %ofBinContigs #ofPredNotInAct wSp wSn
Arachne kmer (7 mer) 0 0 202 0 85 – 0.000
Arachne kmer (8 mer) 0 0 202 0 149 – 0.000
Arachne BLAST distr 0 0 202 0 0 – 0.000
1
Arachne BLAST distr 0 0 202 0 0 – 0.000
2
Arachne S-GSOM 1 141 202 69.8 0 1.000 0.698
(CP ¼ 55 %)
Arachne gen 1 168 202 83.17 0 1.000 0.832
PhyloPythia
(p:0.85)
Arachne ssp. 1 186 202 92.08 0 1.000 0.921
PhyloPythia
(p:0.85)
Arachne S-GSOM 1 180 202 89.11 0 1.000 0.891
(CP ¼ 75 %)
Arachne gen 1 201 202 99.5 0 1.000 0.995
PhyloPythia
(p:0.5)
Arachne ssp. 1 201 202 99.5 0 1.000 0.995
PhyloPythia
(p:0.5)
Phrap kmer (7 mer) 0 0 229 0 129 – 0.000
Phrap kmer (8 mer) 0 0 229 0 154 – 0.000
Phrap BLAST distr 0 0 229 0 0 – 0.000
1
Phrap BLAST distr 0 0 229 0 0 – 0.000
2
Phrap S-GSOM 1 157 229 68.56 0 1.000 0.686
(CP ¼ 55 %)
Phrap gen 1 185 229 80.79 0 1.000 0.808
PhyloPythia
(p:0.85)
Phrap ssp. 1 205 229 89.52 0 1.000 0.895
PhyloPythia
(p:0.85)
Phrap S-GSOM 1 204 229 89.08 0 1.000 0.891
(CP ¼ 75 %)
Phrap gen 1 227 229 99.13 0 1.000 0.991
PhyloPythia
(p:0.5)
Phrap ssp. 1 227 229 99.13 0 1.000 0.991
PhyloPythia
(p:0.5)
Total#Contigs total number of contigs in the dataset, %ofBinContigs the percentage of contigs binned, #ofPredNotInAct
the number of contigs predicted as a taxon that is not present in the dataset, which are treated as the un-binned contigs,
wSp weighted specificity, wSn weighted sensitivity
Binning Sequences Using Very Sparse Labels Within a Metagenome, Table 3 Binning summary for medium
complexity metagenome for contigs larger than 8 kb
Binned
Assembler Method Bins contigs Total#Contigs %ofBinContigs #ofPredNotInAct wSp wSn
Arachne kmer (7 mer) 0 0 301 0 47 – 0.000
B
Arachne kmer (8 mer) 0 0 301 0 191 – 0.000
Arachne BLAST distr 1 0 0 301 0 0 – 0.000
Arachne S-GSOM 2 220 301 73.09 0 1.000 0.731
(CP ¼ 55 %)
Arachne gen 2 242 301 80.4 0 1.000 0.804
PhyloPythia
(p:0.85)
Arachne ssp. 2 242 301 80.4 0 1.000 0.804
PhyloPythia
(p:0.85)
Arachne S-GSOM 2 279 301 92.69 0 1.000 0.927
(CP ¼ 75 %)
Arachne gen 2 301 301 100 0 1.000 1.000
PhyloPythia
(p:0.5)
Arachne ssp. 2 301 301 100 0 1.000 1.000
PhyloPythia
(p:0.5)
Phrap kmer (7 mer) 0 0 401 0 84 – 0.000
Phrap kmer (8 mer) 0 0 401 0 271 – 0.000
Phrap BLAST distr 1 0 0 401 0 0 – 0.000
Phrap S-GSOM 2 318 401 79.3 0 1.000 0.793
(CP ¼ 55 %)
Phrap gen 2 301 401 75.06 0 1.000 0.751
PhyloPythia
(p:0.85)
Phrap ssp. 2 295 401 73.57 0 1.000 0.736
PhyloPythia
(p:0.85)
Phrap S-GSOM 2 367 401 91.52 0 1.000 0.915
(CP ¼ 75 %)
Phrap gen 2 399 401 99.5 1 1.000 0.995
PhyloPythia
(p:0.5)
Phrap ssp. 2 399 401 99.5 1 1.000 0.995
PhyloPythia
(p:0.5)
settings of kmer and BLAST methods, but S-GSOM still outperformed PhyloPythia for the
was outperformed by PhyloPythia in both confi- simMC, particularly in terms of sensitivity, i.e.,
dence settings (CP ¼ 75 % vs. p-value ¼ 0.5 and having a higher true positive rate, at the family
CP ¼ 55 % vs. p-value ¼ 0.85) regardless of the level (refer to the supplementary materials of
assembler used (Tables 2 and 3). Nevertheless, original publication).
Binning Sequences Using Very Sparse Labels Within a Metagenome, Table 4 Binning summary for low
complexity metagenome for contigs with at least 10 reads
Binned
Arachne kmer (7 mer) 0 0 367 0 168 – 0.000
Arachne kmer (8 mer) 0 0 367 0 312 – 0.000
Arachne S-GSOM 3 295 367 80.38 0 1.000 0.798
(CP ¼ 55 %)
Arachne gen 2 214 367 58.31 0 1.000 0.583
PhyloPythia
(p:0.85)
Arachne ssp. 2 236 367 64.31 0 1.000 0.638
PhyloPythia
(p:0.85)
Arachne S-GSOM 3 343 367 93.46 0 0.950 0.926
(CP ¼ 75 %)
Arachne gen 2 292 367 79.56 0 1.000 0.796
PhyloPythia
(p:0.5)
Arachne ssp. 2 296 367 80.65 0 1.000 0.798
PhyloPythia
(p:0.5)
Phrap kmer (7 mer) 2 3 482 0.62 159 1.000 0.000
Phrap kmer (8 mer) 3 17 482 3.53 281 1.000 0.000
Phrap S-GSOM 8 381 482 79.05 9 1.000 0.728
(CP ¼ 55 %)
Phrap gen 3 236 482 48.96 0 1.000 0.488
PhyloPythia
(p:0.85)
Phrap ssp. 3 272 482 56.43 0 1.000 0.560
PhyloPythia
(p:0.85)
Phrap S-GSOM 8 443 482 91.91 9 1.000 0.840
(CP ¼ 75 %)
Phrap gen 4 368 482 76.35 1 1.000 0.759
PhyloPythia
(p:0.5)
Phrap ssp. 5 387 482 80.29 1 1.000 0.797
PhyloPythia
(p:0.5)
At the order level, while PhyloPythia Discussion

performed best for all binning tests on contigs
larger than 8 kb, our S-GSOM was the By including sequences with taxonomic informa-
best-performing method when used to bin tion, i.e., seeds, S-GSOM exhibits more feasibil-
contigs that contain at least 10 reads (Tables 4 ity in binning task for metagenomes containing
and 5). many unknown species. The visualization
Binning Sequences Using Very Sparse Labels Within a Metagenome, Table 5 Binning summary for medium
complexity metagenome for contigs with at least 10 reads
Binned
Arachne kmer (7 mer) 1 2 1,372 0.15 133 1.000 0.000
B
Arachne kmer (8 mer) 0 0 1,372 0 1,241 – 0.000
Arachne BLAST distr 1 0 0 1,372 0 0 – 0.000
Arachne BLAST distr 2 0 0 1,372 0 1 – 0.000
Arachne S-GSOM 5 1,061 1,372 77.33 0 0.998 0.768
(CP ¼ 55 %)
Arachne gen 3 562 1,372 40.96 0 1.000 0.409
PhyloPythia
(p:0.85)
Arachne ssp. 3 657 1,372 47.89 0 1.000 0.478
PhyloPythia
(p:0.85)
Arachne S-GSOM 5 1,253 1,372 91.33 0 0.983 0.897
(CP ¼ 75 %)
Arachne gen 4 1,036 1,372 75.51 6 1.000 0.753
PhyloPythia
(p:0.5)
Arachne ssp. 4 1,102 1,372 80.32 4 1.000 0.802
PhyloPythia
(p:0.5)
Phrap kmer (7 mer) 1 1 1,980 0.05 163 1.000 0.000
Phrap kmer (8 mer) 2 391 1,980 19.75 1,457 1.000 0.000
Phrap BLAST distr 1 0 0 1,980 0 2 – 0.000
Phrap BLAST distr 2 0 0 1,980 0 3 – 0.000
Phrap S-GSOM 8 1,409 1,980 71.16 9 0.995 0.686
(CP ¼ 55 %)
Phrap gen 3 799 1,980 40.35 1 1.000 0.404
PhyloPythia
(p:0.85)
Phrap ssp. 3 844 1,980 42.63 1 1.000 0.426
PhyloPythia
(p:0.85)
Phrap S-GSOM 8 1,708 1,980 86.26 9 0.991 0.816
(CP ¼ 75 %)
Phrap gen 5 1,484 1,980 74.95 6 1.000 0.745
PhyloPythia
(p:0.5)
Phrap ssp. 5 1,524 1,980 76.97 4 1.000 0.767
PhyloPythia
(p:0.5)
property of S-GSOM further allows the identifi- CP value or be considered as part of the boundary
cation of unseeded clusters. However, the of neighboring clusters and thus become hardly
sequence number of unseeded species should be detectable.
at least as many as in the seeded clusters; other- It is very likely that the 16S rRNA fragments
wise, S-GSOM may wrongly assigned the of some species were not or difficult to be sam-
unseeded species to an unrelated species at low pled. In such circumstances, we can still obtain
those metagenomic sequences in the possible References

bins, which have been identified by using the
iso-CP contour map, then comparing the Abe T, Kanaya S, Kinouchi M, Ichiba Y, Kozuki T,
Ikemura T. Informatics for unveiling hidden genome
sequences with existing databases by BLAST
signatures. Genome Res. 2003;13:693–702.
search. If any conserved marker gene is detected, Alahakoon D, Halgamuge SK, Srinivasan B. Dynamic
such as elongation factors and cytochrome oxi- self-organizing maps with controlled growth for
dase, then we may assess the clusters of these knowledge discovery. IEEE Trans Neural Netw.
2000;11:601–14.
sequences by phylogenetic analysis.
Aparicio S, Chapman J, Stupka E, Putnam N, Chia JM,
Even though these composition-based binning Dehal P, Christoffels A, Rash S, Hoon S, Smit A,
methods have shown good results, currently they et al. Whole-genome shotgun assembly and analysis
are hindered by the requirement of long sequence of the genome of Fugu rubripes. Science. 2002;297:
1301–10.
length. This limitation of length is partially due to
Batzoglou S, Jaffe DB, Stanley K, Butler J, Gnerre S,
the occurrence of chimeric sequences from clon- Mauceli E, Berger B, Mesirov JP, Lander
ing procedures of experiments and from the ES. ARACHNE: a whole-genome shotgun assembler.
incorrect assembly of sequences. The former Genome Res. 2002;12:177–89.
Chan CK, Hsu AL, Halgamuge SK, Tang SL. Binning
source of chimeric sequences can be reduced by
sequences using very sparse labels within
advanced cloning-free sequencing, e.g., Roche a metagenome. BMC Bioinformatics. 2008a;9:215.
454 genome sequencer FLX. However, the latter Chan CKK, Hsu AL, Tang SL, Halgamuge SK. Using
source of chimeric sequence is derived from the growing self-organising maps to improve the binning
process in environmental whole-genome shotgun
incompatible design of current assembler, which
sequencing. J Biomed Biotechnol. 2008b;2008
assembles all reads into one single genome and (513701):p 10. doi:10.1155/2008/513701
may not satisfy the requirement of metagenomes Clarke J, Wu HC, Jayasinghe L, Patel A, Reid S, Bayley
of poor sequencing coverage or of high species H. Continuous base identification for single-molecule
nanopore DNA sequencing. Nat Nanotechnol. 2009;4:
complexity. Therefore, if the number of chimeric
265–70.
sequences is reduced, the required sequence Green P. Documentation for PHRAP. 1996; http://boze-
length in S-GSOM can also be reduced. To help man.mbt.washington.edu/
the reduction of chimeric sequences, we suggest Hsu AL, Halgamuge SK. Enhancement of topology pres-
ervation and hierarchical dynamic self-organising
including the compositional information in the
maps for data visualisation. Int J Approx Reason.
assembling level. 2003;32:259–79.
Hsu AL, Tang S-L, Halgamuge SK. An unsupervised
hierarchical dynamic self-organizing approach to can-
cer class discovery and marker gene identification in
Summary microarray data. Bioinformatics. 2003;19:2131–40.
Hubert L, Arabie P. Comparing partitions. J Classif.
S-GSOM enables the clustering (binning) of 1985;2:193–218.
metagenomic sequences by incorporating sparse Kohonen T. The self-organizing map. Proc IEEE.
1990;78:1464–80.
sequence fragments, with phylogenetic labels,
Kohonen T. Analysis of processes and large data sets by
around highly conserved genes as seeds. The a self-organizing method. Intell Process Manuf Mater.
application of seeds makes S-GSOM more feasi- 1999;1:27–36.
ble when dealing with metagenomes containing Mavromatis K, Ivanova N, Barry K, Shapiro H,
Goltsman E, McHardy AC, Rigoutsos I, Salamov A,
many unknown species, which can be visualized
Korzeniewski F, Land M, et al. Use of simulated data
using CP contour display. In addition, S-GSOM sets to evaluate the fidelity of metagenomic processing
is also an efficient algorithm in terms of the methods. Nat Methods. 2007;4:495–500.
training time. By adjusting the CP value, users McHardy AC, Martin HG, Tsirigos A, Hugenholtz P,
Rigoutsos I. Accurate phylogenetic classification of
can retrieve different clustering results without
variable-length DNA fragments. Nat Methods.
retraining. The nature of self-organizing indeed 2007;4:63–72.
forms S-GSOM an automated process that can be Sandberg R, Winberg G, Branden CI, Kaske A, Ernberg I,
improved when new seeds are available. Coster J. Capturing whole-genome characteristics in
Biological Treasure Metagenome 73 B
short sequences using a naive Bayesian classifier. Introduction
Genome Res. 2001;11:1404–9.
Teeling H, Meyerdierks A, Bauer M, Amann R, Glockner
FO. Application of tetranucleotide frequencies for the The coming of a new era – the “metagenome
assignment of genomic fragments. Environ Microbiol. age” – that accesses the genomes of all microbes
2004a;6:938–47. retrieved directly from environmental samples B
Teeling H, Waldmann J, Lombardot T, Bauer M, paves a new way for the understanding and prac-
Glockner FO. TETRA: a web-service and a stand-
alone program for the analysis and comparison of tical application of microbial resources (Hunter-
tetranucleotide usage patterns in DNA sequences. Cevera 1998). The metagenome will provide
BMC Bioinformatics. 2004b;5:163. a revolutionary solution to offer powerful tools
Van Rijsbergen CJ. Information retrieval. London: for understanding the microbial world that has the
Butterworths; 1979.
potential to uncover constituents of the entire
living organism for valuable use in various fields
such as agricultural, medicinal, and industrial
biotechnology.
Biological Treasure Metagenome Microbes have currently recognized as being
possessed the most extensive genetic biodiver-
Geun-Joong Kim and Ho-Dong Lim sity. They have proliferated in the ecosystem on
Department of Biological Sciences, College of Earth for a long age (3.5–4.2 billon years) and
Natural Sciences, Chonnam National University, thus evolved fittingly to various habitats seem-
Gwangju, Republic of Korea ingly incompatible with life (from a conventional
point of view) such as the animal gut, desert,
Antarctic ice, and hot springs. Accordingly,
Synonyms their taxonomical species and metabolic func-
tions are also more diverse than expected. There-
The economic and academic values of fore, microbial consortia are a treasure of
metagenome resources resources with infinite value in basic research
and practical applications. Since the appearance
of mankind on Earth, microbes and humans have
Definition maintained a close relationship through both
direct and indirect interactions. In the view of
Given that the possibility and frequency of find- long direct interaction, the roles of microbes in
ing novel genes, enzymes, and metabolites human nutrition and health are established
through conventional pure culture technology through integrated research – the Human
are decreasing gradually, exploration of Microbiome Project – that aims to characterize
resources hidden in the metagenome as the microbial communities of the human body,
a treasure of new resources is expected to provide including nasal passages, oral cavities, skin, gas-
a new breakthrough. The metagenome is cur- trointestinal tract, and urogenital tract
rently the most promising candidate for exploring (Lewis et al. 2012).
new biological resources, and therefore there will Accessing the microbial diversity from vari-
be continuous efforts for refining strategies and ous niches through the approach using
developing new protocols. Through research metagenome has been presented to provide valu-
using the metagenome, we can measure micro- able resources and clues for the applications of
bial diversity, understand ecosystems through microbes in human health and industry, as well as
a window into the microbial genome contents in for scientific research on the ecosystem function,
a specific environmental habitat and explore use- the global biogeochemical cycle, and the origin
ful resources, and then ultimately incorporate of life. However, microbial species known as
them into the process of practical uses. potential candidates in industry and those with
B 74 Biological Treasure Metagenome
elucidated functions in ecosystems are only those The Value of Metagenome Resources
that can be cultured, which are estimated to be Exploited
about <1 % of microbes existing in nature; thus, The results of gene prediction, annotation on the
most microbes are recognized as unculturable sequence, and metabolic assembly through
species (Handelsman 2004). Accordingly, the (individual) genome reconstruction of
strategy to overcome the limitation of pure cul- metagenome give not only the understanding of
ture and thus to explore the entire microbial microbial ecology and physiology but also the
resource have attempted, which has induced expectation that useful genetic resources and the
a new paradigm shift. whole synthetic pathway of specific compounds
Metagenomics is a research area that studies in vivo are readily explored. With the develop-
the metagenome which is the total genomes of all ment of the amplification tools for rare DNAs and
organisms existing in a certain habitat. It extracts technology related to high-throughput sequenc-
DNA from a complex microbial community and ing of DNA, it is now possible to analyze and
analyzes the information of genomes using understand the function of individual species in
molecular biological tools mainly based on direct the whole community of natural strains. As an
sequencing. Therefore, metagenomics is example, the elucidation of broad distribution of
a microbial community analysis method to access non-extreme ammonia-oxidizing archaea, AOA,
all contents of microbial genomes, which goes as dominant species in a wide range of ecological
beyond the limited scope of cultivated cells. niches clarified a major provider of energy flow
Metagenomic research has been revolutionized and nitrogen cycling in ecosystem (Erguder
by the development of genome-manipulating et al. 2009). Thus, a fundamental reconsideration
technologies, and despite its short history, new of the geochemical cycle of nitrogen is
functional genes, proteins, and biomaterials have demanded. In line with this, environmental shot-
been mined successfully (Xu 2006). Comprehen- gun sequencings of specific samples from ocean,
sive understanding has also been attained on the soil, plant, and animal stimulated interest in the
ecology and physiology of microbes. In addition, diversity of microorganism and indwelling met-
a huge amount of sequence information derived abolic gene clusters, enabling the elucidation of
from metagenome is integrated using bioinfor- species and community functions in specific
matic tools. Accordingly, the scope of application niche. With the introduction of high-throughput
in the entire range of biotechnology has altered screening that can detect extremely weak activity
based on the potential value of the metagenome and signal, new methods have been developed for
(Fig. 1). rapid detection of target libraries with a small
Biological Treasure Metagenome, Fig. 1 A value of are not limited to, environmental, agricultural, medical,
metagenome resource. Current metagenomic studies and industrial needs
result in various fields of applications that include, but
amount of sample, thus facilitating screening in vitro. However, the limitation of engineering
with more positive hits (genes and proteins). processes in the exploration of sequence space
Candidates captured through this process are and the innate weak point of the stepwise screen-
compared with other known resources in shared ing process that cannot gather effectively the
pattern of functional or sequence signature, effect of beneficial mutation in the alternative B
which predicts the functional roles of the candi- landscape may reasonably explain the strength
dates in silico. With the derived functional roles, of the exploration of the metagenome from
metabolic pathway and/or capacity of the whole microbial community that has already possessed
microbiome is constructed. Currently, the various functional space (including biologically
microbiome in a specific environment is reliably permitted sequence space) by evolutionary expe-
established through reconstitution of the genome rience. Therefore, the metagenome can play
data of all organisms by bioinformatic tools a significant role as a resource to provide new
(Kunin et al. 2008). Metagenome data provide alternatives and get desired products from the
the understanding of complex biological systems highly precise and specific enzyme reactions of
through online public databases and data- thousands of substrates used in industry (Fig. 2).
integration tools. Currently, assembled genomes One of the major trends of research on biolog-
analyzed in an integrated logics are providing ically mediated processes is white biotechnology,
a window for forming more complete genomes, which is to find alternatives to petrochemical
and this is expected to reduce time and cost in compounds using renewable resources. There-
finding new resources considerably. fore, attention is paid to the production of fossil
fuels by bioconversion or fermentation using bio-
The Value of Metagenome Resources Remain mass. To this end, the acquisition of regulatory
to be Elucidated genes, potential enzymes, and gene clusters
Besides the elucidated physiological and ecolog- related to the production of organic acids, alco-
ical values by using metagenomic approaches, hols, solvents, and diesels is also obtainable from
there is an obvious reason for obtaining useful metagenome. What is more, organic compounds,
genes or physiologically active substances from which have been out of people’s attention for
the metagenome. This is because the access to economical reasons, are again spotlighted along
screening the library of pure-cultured cells has with their application to improved price compet-
been limited, and it is very difficult to sustain itiveness, low risk of environmental pollution,
the novelty of resources originating from and innovative tools of systems biology. We
culturable strains. According to what is known, also expect critical roles of the metagenome in
enzymatic degradation or synthesis is possible for increasing agricultural productivity and the utili-
almost every organic compound that can be found zation of biomass. Besides, research on the
naturally or synthesized chemically. However, human metagenome can provide clues to causes
regardless of the existence of related enzymes of diseases, acquired immune system, and new
with promising activities, the functional and methods of treating pathogen through the ana-
sequence spaces of metagenome resources from lyses of microbiome database from microbial
the whole living organisms in ecosystem are still communities (Gill et al. 2006). Also, in response
left mostly unexplored. Therefore, if hurdles in to the serious side effects of synthetic drugs and
the screening process, due to the approach based increasing drug-resistant pathogens, finding new
on the homology of protein sequence, can be natural inhibitors or suppressors, including
overcome, it will be possible to find resources in quorum-sensing blocker, as antibiotics in the
new areas. Of course, it is generally known metagenome could be possible. In this respect,
that the approaches of screening from the there are many attempts to approach the new
metagenome compete with protein engineering potential of metagenome resources through ana-
technologies that mutated or fine-tuned existing lyzing the resistome formed naturally by biolog-
genetic resources or induced forced evolution ical species existing around the ecological
B 76 Biological Treasure Metagenome
Biological Treasure Metagenome, Fig. 2 Exploring results in various fields of further applications that solve
value creation from integrative research activities of global problems such as fine chemical, environmental,
metagenome. Information concerning the application medical services, and future energies. Basically,
fields of metagenome resources is gathered and processed metagenome information also provides a clue for the
by systematically integrative systems. This information origin and minimal genome of living organisms
producers of these substances (D’Costa competition in tens of millions of microbial

et al. 2007). It is generally believed that such an species for billions of years.
expectation can be realized by research on the Genomic data collected through the
metagenome evolved in ecosystem through metagenome will be used ultimately in creating
countless mutations, suppression, and synthetically engineered species (with minimally
synthesized genomes; Gibson et al. 2008; resources using renewable resources, and this
Lartigue et al. 2007; Dymond et al. 2011) solving alone makes the metagenome highly worthy of
global problems such as medical services and study. Microbial diversity is so extensive that it is
energies. The goal of synthetic biology is to pro- not easy to estimate their history in the ecosystem
duce cell-level bio-factories, aiming at the bio- of the planet, and even now at all of ecosystem B
logical production of valuable chemicals and they may continue to mutate in order to resist or
drugs. In this research area, attempt to create adapt themselves to unceasing changes. Thus, the
a platform for the engineering of orthogonal fac- metagenome provides a huge potential as
tor (function without any interference in vivo), a resource with novel activity, which may be
such as switches, circuits, and logic gates, was used for any purpose.
made to control independently multiple genes in
host systems. In fact, the related tools and
methods have already been successfully applied
in various studies, giving rise to orthogonal DNA References
or RNA-protein pairs. As example of one such
effort, orthogonal ribosome-mRNA pairs are D’Costa VM, Griffiths E, et al. Expanding the soil antibi-
otic resistome: exploring environmental diversity.
composed of an mRNA containing a ribosome- Curr Opin Microbiol. 2007;10:481–9.
binding site that does not recognize by the endog- Dymond JS, Richardson SM, et al. Synthetic chromosome
enous ribosome and an orthogonal ribosome that arms function in yeast and generate phenotypic diver-
specifically translates the orthogonal mRNA and sity by design. Nature. 2011;477:471–6.
Erguder TH, Boon N, et al. Environmental factors shaping
thus function independently without severe the ecological niches of ammonia-oxidizing archaea.
effects on cell physiology and metabolism when FEMS Microbiol Rev. 2009;33:855–69.
required (Wang et al. 2007). This result provides Gibson DG, Benders GA, et al. Complete chemical syn-
a possibility that useful components of cells can thesis, assembly, and cloning of a Mycoplasma
genitalium genome. Science. 2008;319:1215–20.
be synthesized efficiently for making microbes as Gill SR, Pop M, et al. Metagenomic analysis of the
cell factories equipped with minimal but plentiful human distal gut microbiome. Science. 2006;312:
genome and then adding more genes for specific 1355–9.
purposes. The assignment of speciality and/or Handelsman J. Metagenomics: application of genomics to
uncultured microorganisms. Microbiol Mol Biol Rev.
orthogonal function may partly be attainable 2004;68:669–85.
through novel parts (genes and proteins) and Hunter-Cevera JC. The value of microbial diversity. Curr
genetic circuits (signaling cascades and meta- Opin Microbiol. 1998;1:278–85.
bolic pathways) to be mined from the Kunin V, Copeland A, et al. A bioinformatician’s guide to
metagenomics. Microbiol Mol Biol Rev. 2008;72:
metagenome. 557–78.
Lartigue C, Glass JI, et al. Genome transplantation in
bacteria: changing one species to another. Science.
Summary 2007;317:632–8.
Lewis Jr CM, Obregon-Tito A, et al. The human
microbiome project: lessons from human genomics.
Through metagenomics, scientists have obtained Trends Microbiol. 2012;20:1–4.
a new view to the microbial world that is different Wang K, Neumann H, et al. Evolved orthogonal ribo-
from traditional concepts and are working to somes enhance the efficiency of synthetic genetic
code expansion. Nat Biotechnol. 2007;25:770–7.
overcome difficulties in future society. The Xu J. Microbial ecology in the age of genomics and
exhaustion of natural resources such as fossil metagenomics: concepts, tools, and recent advances.
fuels will increase people’s interest in biological Mol Ecol. 2006;15:1713–31.
C
Carbohydrate-Active Enzymes host-pathogen interactions, signal transduction,

Database, Metagenomic Expert inflammation, intracellular trafficking, diseases
Resource and tumor metastasis, and differentiation/devel-
opment. Importantly, carbohydrates represent
Brandi Cantarel1, Pedro Coutinho2 and about 75 % of the structural components of pho-
Bernard Henrissat2 tosynthetically produced biomass. Sugar-rich
1
Institute for Genome Sciences, University of plant cell walls, seeds, and tubers thus represent
Maryland School of Medicine, Baltimore, a major source of nutrients for herbivorous and
MD, USA omnivorous animals and for humans. These car-
2
Centre National de la Recherche bohydrates have also significant potential to
Scientifique & Aix-Marseille Université, address energy and material needs.
Marseille, France A striking feature of carbohydrates is their
global structural variety, which results from
a large diversity of monosaccharide building
Definitions blocks, and the possibility of numerous stereo-
and regiospecific linkages (Laine 1994), which
Carbohydrate-active enzymes (CAZymes) give rise to a myriad of structures that can be
designate the ensemble of the enzymes that attached to proteins, lipids, nucleic acids,
catalyze the assembly, breakdown, or modifica- etc. Any biological molecule can be
tion of oligosaccharides, polysaccharides, and glycosylated, including proteins, lipids, nucleo-
glycoconjugates. They are usually comprising tides, and carbohydrate themselves, the level of
glycoside hydrolases (GHs), polysaccharide such modifications often varying extensively. In
lyases (PLs), carbohydrate esterases (CEs), and fact, glycosylation of proteins is the most com-
glycosyltransferases (GTs). mon posttranslational modification in eukaryotes
but also present in prokaryotes, strongly influenc-
ing many of their functional aspects, including
Introduction cellular localization, turnover, and protein qual-
ity. Proteoglycans mediate cell communication,
Carbohydrates growth factor sequestration, microbial recogni-
Carbohydrates, in the form of mono-, di-, oligo-, tion, chemokine and cytokine activation, tissue
and polysaccharides, as well as glycoconjugates, morphogenesis during embryotic development,
play important roles in all areas of biology. cell migration, and proliferation. Nature has
Beyond simple energy storage, carbohydrates exploited the tremendous possibilities offered
underpin diverse biological processes, such as by the sugar code by elaborating and breaking
C 80 Carbohydrate-Active Enzymes Database, Metagenomic Expert Resource
down very specific complex carbohydrates in structure, specificity, and mechanism, which
a highly specific manner. Exquisite details of provides significant predictive power. Initially
complex carbohydrates create immense func- motivated by a need to delineate cellulases
tional differences. For instance, cellulose and (EC 3.2.1.4) into distinct structural families, the
amylose, two simple polymers of glucose resi- first incarnation of the GH family classification,
dues linked between their position 1 and their as such, comprised 35 GH families (Henrissat
position 4, only differ by the equatorial vs. axial 1991). Five years later, the number of GH fami-
orientation of the glycosidic bond (b for cellulose lies grew to 57 families (Henrissat and Bairoch
and a for amylose). This minute difference gives 1996), and has continuously expanded to reach
rise to two massively different polysaccharides: 113 in 2009 (Cantarel et al. 2009). As of March
cellulose, whose mechanical properties rival 2012, 130 sequence-based families of GHs have
those of steel, is synthesized by plants as been defined and are presented in the continu-
a structural polysaccharide notoriously recalci- ously updated CAZy database (http://www.cazy.
trant to hydrolysis while amylose is a component org/). In parallel with the development of the
of starch and, as a reserve carbohydrate, is readily classification of GH families, sequence-based
converted to glucose. classifications of the glycosyltransferases (GTs)
(Campbell et al. 1997), polysaccharide lyases
Carbohydrate-Active Enzymes and Their (PLs) (Lombard et al. 2010), carbohydrate ester-
Classification ases (CEs) (Cantarel et al. 2009), and
Carbohydrate-active enzymes (CAZymes) cata- carbohydrate-binding modules (CBMs)
lyze selective reactions to assemble and break (Boraston et al. 2004) have similarly been devel-
down complex carbohydrates and glycoconjugates oped and added to the CAZy database.
for a large array of biological functions globally
underpinning glycobiology. These enzymes, Functional Prediction of Carbohydrate-Active
which comprise glycoside hydrolases (GH), poly- Enzymes
saccharide lyases (PL), carbohydrate esterases The immense variety of carbohydrate structures
(CE), and glycosyltransferases (GT), have gradu- and their involvement in extremely different bio-
ally evolved from a limited number of primordial logical functions make that functional annota-
carbohydrate-active enzymes coding genes by tions such as “putative carbohydrate-active
acquiring novel specificities at substrate and prod- enzyme” or “putative glycosidase” have very
uct level. In addition, these enzymes often display limited information value. Instead, a useful func-
a modular structure with a catalytic module tional prediction for a CAZyme should indicate
appended to one or several other domains, such the likely nature of sugar being cleaved or trans-
as carbohydrate-binding modules, allowing for ferred, with a description of the exact connectiv-
increased specificity and/or specific targeting to ity between the sugar undergoing catalysis and
a particular substrate/region (Boraston et al. 2004). the molecule it is attached to or detached from.
The sequence-based classification of A feature that was recognized very early on
CAZymes was initiated in 1991 (Henrissat was that the sequence-based families of
1991; Henrissat and Bairoch 1993, 1996) as carbohydrate-active enzymes group together
a complement to the long-standing Enzyme Com- enzymes of differing substrate specificity and
mission (EC) number system (http://www.chem. hence group together enzymes with different EC
qmul.ac.uk/iubmb/enzyme/), which is based numbers (Henrissat 1991; Campbell et al. 1997).
solely on enzyme activities. Given the prevalence Because of the multifunctional nature of these
of convergent evolution of enzymes that cleave enzymes, it is believed that a limited number of
glycosidic bonds, as well as the demonstrable catalytic and binding progenitors (protein domain
catalytic promiscuity of individual enzymes, families), which can be found in different combi-
sequence-based classification has proven to be nation, gave rise to the vast number of enzymes
a robust way to unify information on enzyme and of carbohydrate structures that exist in
Carbohydrate-Active Enzymes Database, Metagenomic Expert Resource 81 C
modern organisms, resulting in the gradual and of cases, inferences are done by detecting the
simultaneous acquisition of exquisite substrate similarity of sequence between the newly gener-
specificity for both carbohydrate biosynthesis ated DNA sequence (or putatively encoded pro-
and carbohydrate degradation. Since most tein) and sequences already in databases. This
CAZyme protein domain families are approach does not perform equally with different
multifunctional, prediction of functional roles classes of proteins in terms of the biological
for uncharacterized carbohydrate-active enzyme inference that can be derived. For instance, the C
encoding genes simply by family assignment can assignment to families of protease/peptidases has
lead to erroneous annotations, especially at high often limited predictive power: the prediction are
sequence divergence. Additionally, the universe often only based on the fold the most informative
of known carbohydrate structures with the same information being essentially that of the catalytic
types of linkage bonds is smaller than the uni- machinery – for instance, “serine protease”– and
verse of possibility; therefore, even when func- little predictive power in terms of what is the
tions are known, there are potentially more specific peptide substrate targeted by the enzyme.
possible substrates. As a result the number of Thus, the very difficulty with CAZymes (huge
sequences that can be assigned to CAZy families structural and functional variety of substrates) is
increases very rapidly, but the number of also at the origin of their intrinsic advantage:
CAZymes whose substrate specificity has been these enzymes had to evolve to achieve the exqui-
established (even roughly) grows at a much lower site specificity necessary to carry out their func-
pace. As sequencing data grows with increasing tion in a selective manner. The high information
genomic and metagenomic characterization, this content of complex carbohydrates has therefore
proportion of characterized enzymes continues to translated into the proteins that assemble and
decrease. In spite of limitations due to the pres- deconstruct then by leaving evolutionary sig-
ence of different substrate specificities in many nals/traces that can be recognized in the
CAZyme families, it is often possible to assign sequence. While experimental developments in
a broad substrate category (for instance, pectin, the field of glycomics are still slow in comparison
cellulose, xylan) to a number of CAZyme fami- to the boom in sequencing technologies,
lies (Cantarel et al. 2012) even if the precise carbohydrate-active enzymes are perhaps the
substrate or product specificity (for instance, to most adapted to functional inference from geno-
distinguish between endo- and exo-acting mic and metagenomic data.
enzymes or to distinguish between b-D- The direct genetic sequencing of microbial
xylosidase and a-L-arabinofuranosidase) cannot communities (metagenomics) is beginning to
be predicted accurately based on simple family explore the great gene diversity in the microbial
assignment. In order to improve functional pre- world. Environmental samples from diverse envi-
diction, the partition of CAZyme families into ronment are being studied to better understand
subfamilies based on phylogenetic analysis has the role of microbes in various habitats from the
been explored. Significantly subfamily classifica- human body to the ocean floor. This technology
tion of several families of GHs and PLs has has allowed scientist to begin to answer questions
shown that the majority of the defined subfam- not possible with studying only cultivable spe-
ilies were monospecific, thus indicating a better cies. Here we review the burgeoning exploration
correlation of substrate specificity between of carbohydrate-active enzymes in metagenomic
sequences at the subfamily level than the family samples.
level (Lombard et al. 2010; St. John et al. 2010;
Stam et al. 2006). Glycobiology in Microbial Communities
The advent of low cost DNA sequencing has Microbial communities isolated from human
revolutionized biology, and the central question fecal material are the most well studied in the
is no longer how to obtain nucleotide sequence, usage of CAZymes. CAZyme diversity in
but how to make sense of it. In the vast majority human gut microbiota studies (Gill et al. 2006;
C 82 Carbohydrate-Active Enzymes Database, Metagenomic Expert Resource
Mahowald et al. 2009; Turnbaugh et al. 2010) In a comparison of carbohydrate active

showed 81 glycoside-hydrolase families, making enzymes in all human body sites (Cantarel
the human gut one of the richest sources of et al. 2012), differences in abundance of
CAZymes. Some of these studies aim to deter- CAZymes were identified between all major
mine the relationship between CAZyme utiliza- sites. In general, digestive sites and particularly
tion and disease (Turnbaugh et al. 2009) and diet stool contained the highest number of CAZy
including a vegetarian (Kabeerdoss et al. 2011) or families and the highest abundance of CAZymes.
differential fiber intake (Tasse et al. 2010). Tax- These sites have a higher abundance of CAZymes
onomic and genetic differences were found involved in plant and algae degradation. GH94
between omnivores and vegetarians suggesting (cellobiose, cellodextrin, and chitobiose phos-
that energy intake, complex carbohydrate degra- phorylases) and GH30 (b-1,6-glucanase,
dation, and butyrate production, a product of b-xylosidase, b-D-fucosidase, b-glucosidase,
dietary fiber fermentation, was higher in omni- and b-1,6-galactanase) are statistically more
vores with these functions associated with an abundant in stool compared to the other four
increase in certain Clostridiales, such as Clos- major body habitats. Oral sites appear to special-
tridium, Roburia, and Eubacterium rectale ize in starch and glycogen degradation, as these
(Kabeerdoss et al. 2011). Gene clusters involved functions are enriched in oral habitats compared
in dietary fiber degradation have been shown to to the stool. Vaginal microbial communities are
be larger than clusters involved in starch degra- enriched in CAZymes involved in sucrose cleav-
dation and often contain genes involved in car- age and polymerization to fructans, potentially
bohydrate transports and binding (Tasse important for biofilm formation. Overall, the
et al. 2010). Studies of the human gut have also functional profiles are more similar within
revealed possible lateral gene transfer between a body habitat than between habitats, even when
marine and human microbes interacting in the the taxonomic profiles differ, suggesting func-
human digestion system (Hehemann tional adaptation of the community to the carbo-
et al. 2010), in order to increase algae cell wall hydrates prevalent in the environment.
degradation.
While the human gut microbiota has been the Practical Issues in Mining Metagenomes
subject of most studies, studies in other animal for CAZymes
gut microbiota reveal an evolutionary-driven Practically annotation of CAZymes is not
composition of gut microbiota. Thus mammalian completely trivial. First, for historical reasons
gut microbiome composition and functional some CAZy families are distantly related, mean-
capabilities is likely driven by diet, such that ing there are sequences that share statistically
carnivores, no matter the mammalian phylogeny, significant sequence similarity with multiple
contain organisms and have bacterial functions families in the same region. These families are
for using proteins as an energy source, compared therefore grouped into “clans,” similarly to
to herbivores, whose gut microbiota aims to con- PFAM clans. Therefore, family assignment of
vert complex plant carbohydrates into energy these proteins is ambiguous, suggesting these
(Pope et al. 2010; Muegge et al. 2011; Zhu proteins are general members of the clan with
et al. 2011). The digestive microbiota of herbi- broad functional predictions. Secondly
vores is being actively explored for the discovery CAZymes are modular, meaning these enzymes
of novel enzymes for the conversion of plant are often composed multiple protein domains
biomass to biofuels (Brulc et al. 2009; Duan connected by linker regions. These domains can
et al. 2009; Suen et al. 2010). In a similar vein, combine in a variety of permutations to form or
the termite gut microbiota revealed a wealth of give rise to diverse functions, e.g., a carbohydrate
CAZymes involved in degrading wood polysac- binding domain can be attached to multiple cata-
charides (Matteotti et al. 2011; Warnecke lytic domains to facilitate substrate specificity.
et al. 2007). Therefore, accurate annotation requires
Carbohydrate-Active Enzymes Database, Metagenomic Expert Resource 83 C
comparison to a database of domains, rather than approach: (i) sequence similarity does not equal
whole proteins. The length of the domains can same function and (ii) annotations on known
vary greatly from as little as 30 amino acids to sequences have varying degrees of accuracy
several hundred residues, so when strict expecta- depending on the level and quality of evidence,
tion value (calculated by BLAST or HMMER) which ideally relies on experimental validation,
thresholds, such as E-value < 1e-6, are imposed, but is often based on sequence similarity. Creat-
the false-negative rates increase for the identifi- ing accurate annotations becomes complex in C
cation of the distantly related short domain multifunctional protein families. Because the
family members. For metagenomics, these diversity in carbohydrate structure is large and
challenges are amplified since metagenomic the number of protein families acting on sugar
gene predictions are (i) often fragmented, limited, carbohydrate-active enzyme gene fami-
(ii) are too short to contain multiple domains, lies are often multifunctional and specificity is
and (iii) could be from organisms with little to mediated by additional structural or carbohydrate
no close evolutionary relatives. binding modules. The Carbohydrate Active
As previously discussed, as sequencing cover- Enzyme Database (Cantarel et al. 2009) (CAZy)
age of microbial communities (in metagenomes) provides an expert curated resource for the
increases, these challenges of metagenomics will glycobiology community, whereby annotations
diminish and CAZy family assignment as well as and their underlying evidence are documented.
domain structure prediction will gradually
improve. The hardest problem to resolve will
thus be with the precise prediction of the sub- Cross-References
strate/product specificity of CAZymes. Such pre-
dictions require close relatedness between the ▶ A 123 of Metagenomics
query and at least one biochemically character- ▶ Human Gut Microbial Genes by Metagenomic
ized CAZyme. Accelerating the pace of explora- Sequencing
tion of the sequence to specificity space of ▶ Mining Metagenomic Datasets for Cellulases
CAZymes is key to a leap toward accuracy, and
this will require new experimental innovations in
the field of carbohydrates and coupling
References
computer-guided high-throughput functional
investigations to structural genomics initiatives. Boraston AB, Bolam DN, Gilbert HJ, Davies GJ.
Yet, notwithstanding the accuracy of functional Carbohydrate-binding modules: fine-tuning polysac-
prediction, the interpretation of carbohydrate- charide recognition. Biochem J. 2004;382(Pt 3):
769–81.
active enzymes profiles resulting from
Brulc JM, Antonopoulos DA, Miller ME, Wilson MK,
metagenomic investigations will require exper- Yannarell AC, Dinsdale EA, et al. Gene-centric
tise in complex carbohydrate assembly and metagenomics of the fiber-adherent bovine
breakdown. rumen microbiome reveals forage specific glycoside
hydrolases. Proc Natl Acad Sci U S A. 2009;106(6):
1948–53.
Campbell JA, Davies GJ, Bulone V, Henrissat B. A clas-
Summary sification of nucleotide-diphospho-sugar glycosyl-
transferases based on amino acid sequence
similarities. Biochem J. 1997;326(Pt 3):929–39.
The advent of low cost DNA sequencing has Cantarel BL, Coutinho PM, Rancurel C, Bernard T,
revolutionized biology, and the central question Lombard V, Henrissat B. The Carbohydrate-Active
is no longer how to obtain nucleotide sequence, EnZymes database (CAZy): an expert resource for
but how to make sense of it. Functional predic- glycogenomics. Nucleic Acids Res. 2009;37(Database
issue):D233–8.
tions start by sequence comparisons against data-
Cantarel BL, Lombard V, Henrissat B. Complex carbohy-
bases of known annotated genes and proteins. drate utilization by the healthy human microbiome.
However, there are two major caveats to this PLoS One. 2012;7(6):e28742.
C 84 Challenge of Metagenome Assembly and Possible Standards
Duan CJ, Xian L, Zhao GC, Feng Y, Pang H, Bai XL, Stam MR, Danchin EG, Rancurel C, Coutinho PM,
et al. Isolation and partial characterization of novel Henrissat B. Dividing the large glycoside hydrolase
genes encoding acidic cellulases from metagenomes family 13 into subfamilies: towards improved func-
of buffalo rumens. J Appl Microbiol. 2009;107(1): tional annotations of alpha-amylase-related proteins.
245–56. Protein Eng Des Sel. 2006;19(12):555–62.
Gill SR, Pop M, Deboy RT, Eckburg PB, Turnbaugh PJ, Suen G, Scott JJ, Aylward FO, Adams SM, Tringe SG,
Samuel BS, et al. Metagenomic analysis of the human Pinto-Tomas AA, et al. An insect herbivore
distal gut microbiome. Science. 2006;312(5778):1355–9. microbiome with high plant biomass-degrading capac-
Hehemann JH, Correc G, Barbeyron T, Helbert W, ity. PLoS Genet. 2010;6(9):e1001129.
Czjzek M, Michel G. Transfer of carbohydrate-active Tasse L, Bercovici J, Pizzut-Serin S, Robe P, Tap J,
enzymes from marine bacteria to Japanese gut Klopp C, et al. Functional metagenomics to mine the
microbiota. Nature. 2010;464(7290):908–12. human gut microbiome for dietary fiber catabolic
Henrissat B. A classification of glycosyl hydrolases based enzymes. Genome Res. 2010;20(11):1605–12.
on amino acid sequence similarities. Biochem Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL,
J. 1991;280(Pt 2):309–16. Duncan A, Ley RE, et al. A core gut microbiome in
Henrissat B, Bairoch A. New families in the classification obese and lean twins. Nature. 2009;457(7228):480–4.
of glycosyl hydrolases based on amino acid sequence Turnbaugh PJ, Quince C, Faith JJ, McHardy AC,
similarities. Biochem J. 1993;293(Pt 3):781–8. Yatsunenko T, Niazi F, et al. Organismal, genetic,
Henrissat B, Bairoch A. Updating the sequence-based and transcriptional variation in the deeply sequenced
classification of glycosyl hydrolases. Biochem gut microbiomes of identical twins. Proc Natl Acad Sci
J. 1996;316(Pt 2):695–6. U S A. 2010;107(16):7503–8.
Kabeerdoss J, Shobana Devi R, Regina Mary R, Rama- Warnecke F, Luginbuhl P, Ivanova N, Ghassemian M,
krishna BS. Faecal microbiota composition in vegetar- Richardson TH, Stege JT, et al. Metagenomic and
ians: comparison with omnivores in a cohort of young functional analysis of hindgut microbiota of a wood-
women in southern India. Br J Nutr. 2011;20:1–5. feeding higher termite. Nature. 2007;450(7169):
Laine RA. A calculation of all possible oligosaccharide 560–5.
isomers both branched and linear yields 1.05 10(12) Zhu L, Wu Q, Dai J, Zhang S, Wei F. Evidence of
structures for a reducing hexasaccharide: the isomer cellulose metabolism by the giant panda gut
barrier to development of single-method saccharide microbiome. Proc Natl Acad Sci U S A.
sequencing or synthesis systems. Glycobiology. 2011;108(43):17714–9.
1994;4(6):759–67.
Lombard V, Bernard T, Rancurel C, Brumer H, Coutinho
PM, Henrissat B. A hierarchical classification of poly-
saccharide lyases for glycogenomics. Biochem
J. 2010;432(3):437–44. Challenge of Metagenome Assembly
Mahowald MA, Rey FE, Seedorf H, Turnbaugh PJ, Fulton
RS, Wollam A, et al. Characterizing a model human
and Possible Standards
gut microbiota composed of members of its two dom-
inant bacterial phyla. Proc Natl Acad Sci U S A. Matthew B. Scholz1, Chien-Chi Lo1 and
2009;106(14):5859–64. Patrick Chain2
Matteotti C, Haubruge E, Thonart P, Francis F, De 1
Genome Science Group, Los Alamos National
Pauw E, Portetelle D, et al. Characterization of a new
beta-glucosidase/beta-xylosidase from the gut Laboratory, Los Alamos, NM, USA
2
microbiota of the termite (Reticulitermes santonensis). Bioscience Division, Los Alamos National
FEMS Microbiol Lett. 2011;314(2):147–57. Laboratory, Los Alamos, NM, USA
Muegge BD, Kuczynski J, Knights D, Clemente JC,
Gonzalez A, Fontana L, et al. Diet drives convergence
in gut microbiome functions across mammalian
phylogeny and within humans. Science. 2011; Introduction
332(6032):970–4.
Pope PB, Denman SE, Jones M, Tringe SG, Barry K,
As technology and methodology have allowed for
Malfatti SA, et al. Adaptation to herbivory by the
tammar wallaby includes bacterial and glycoside more advanced assemblies of metagenomes, the
hydrolase profiles different from other herbivores. need for commensurate assignment of quality to
Proc Natl Acad Sci U S A. 2010;107(33):14793–8. these assemblies has become evident. There are
St John FJ, Gonzalez JM, Pozharski E. Consolidation of
currently no set standards for describing the qual-
glycosyl hydrolase family 30: a dual domain 4/7
hydrolase family consisting of two structurally distinct ity of sequencing, assembly, or analysis of
groups. FEBS Lett. 2010;584(21):4435–41. metagenomic assemblies. Uncorrected, this may
Challenge of Metagenome Assembly and Possible Standards 85 C
lead to faulty conclusions based on assumptions analysis and assembly. All of these factors under-
that the assembly is more or less accurate, or lie the highly variable nature of metagenomes,
representative of the sample, than it truly is. This making it difficult to generate accurate assem-
need is similar to, but far more complex than, the blies and also difficult to define standards or
dilemma that faced the microbial sequencing and otherwise grade the effectiveness of an assembly.
assembly community as more and more genomes It can be reasonably stated that metagenomic
were sequenced with new technologies and assem- assembly is still in its infancy and generally pro- C
bled with novel algorithms. duces what can only be described as draft assem-
For bacterial genomes, the quality of assembly blies of metagenomic data, though it has certainly
and finishing efforts has been standardized for been possible in some rare cases to recover full
several years, resulting in a much better under- and near-complete genomes from some environ-
standing of the types of analyses that can be ments (Huttenhower et al. 2012).
performed on each level of finished genome and The utility of sequencing a community sample
the resulting value. While the need for standards is based solely on the ability of the researcher to
in metagenomics has been very clear in terms of garner useful information from the data
metadata (Yilmaz et al. 2011), less attention has (assembly and annotation). This ability is, more
been focused on the genomic data itself. As the often than not, reflective of the “quality” of
field continues to advance and mature, it is clear a metagenome assembly. Additionally, the goals
that efforts in standardizing assemblies as well as of a given project can also affect this question, by
functional and phylogenetic classification are altering the types of analysis needed or the depth
sorely needed. Given the flux in application of of sequencing required, among many factors. The
various sequencing technologies (with different gross differences and determinations of sequence
and sometimes variable sequence qualities) to diversity between two or more metagenomes can
genome reconstruction, the most recent version be typically analyzed using simple comparative
of this standard for microbial genomes is to tests, whereas a more in-depth analysis for gene
divide sequences into broad levels of complete- content or for application to proteomic (peptide
ness and quality, from draft to completely fin- mass prediction) or metabolic pathway (function
ished (Chain et al. 2009). While these standards and operon prediction) analysis requires larger
are valuable, it is difficult to apply similar stan- assembled regions of contiguous sequence
dards to metagenomic assemblies, where the (contigs), with low error rates.
effort is to reconstruct the genomes (or parts of While it is not possible to set de facto stan-
genomes) of many organisms present within dards for metagenomic analysis or assembly, this
a sample. The numbers of different species in entry is an attempt to discuss a number of the
a community selected for sequencing and potential impediments to adequate or good
metagenomic assembly can vary from two to metagenome assembly. Additionally, several
millions of individual genomes from many spe- possible methods for improvement and validation
cies, with varying frequency of each genome. of assembled contig sets from metagenomic
Additionally, each genome may be different in assemblies, as well as potential methods for gen-
size, G+C content, and repetitiveness as well as erating higher quality draft metagenomes from an
have other genome-specific issues that make it individual sample, will be discussed.
impossible to assemble all genomes equally well.
Additionally, community genomics are com-
plicated by the potential existence of many Barriers to Metagenomic Assembly
strains of the same species (species of the same
genus, etc.), of recombination, of horizontal gene As has been addressed previously, metagenome
transfer events among members of the commu- assembly is difficult, requiring ever-increasing
nity, and of other factors that further complicate computational resources at a rate fast outpacing
“Moore’s law” (Miller et al. 2010; Scholz sequenced and increasing the percentage of reads
et al. 2012). This is a product of the limitations incorporated, affecting the slope of the curve. In
of current sequencing technologies coupled with short, the more complex communities, such as
the available assembly algorithms that can falter those found in soils and sediments, will require
when running into the massive scale of data pro- much greater sequencing inputs to allow for
duced by current next-generation sequencers assembly of a significant proportion of the data.
(NGS). For metagenomic sequencing and assem- Conversely, simpler communities, such as bio-
bly, there is a paradoxical problem with data. reactors, enrichment cultures, and naturally sim-
While the relatively low cost and highest ple environments can achieve nearly 100 %
throughput sequencers produce hundreds to thou- incorporation of data even with relatively few
sands of gigabytes of data per run, the short read reads (<200 million Illumina reads).
lengths of these technologies limit the types of
assembly procedures that can be applied (Scholz
et al. 2012). Due to the diversity of genomes and Assembling Subsets of Data
the variation between members of a community,
a good assembly of a metagenome requires sig- To allow current assemblers to better process the
nificantly more sequencing (potentially terabases mountains of data, it is generally believed that
of data per sample for some environments). In dividing reads into smaller, categorical bins may
direct opposition to this requirement, the current enable improved, or “targeted,” assembly. While
state-of-the-art assemblers for NGS data are lim- this partitions the data into manageable parcels
ited by available computational memory, mean- for assembly, it has also been used as a filtering
ing that, currently, computers are only capable of method, to remove extraneous reads from the
assembling as little as 1 % of the data required for dataset pre-assembly (Godoy-Vitorino et al.
the most complex of samples. The computational 2012). There have been several very thorough
time, processing power, and required system methods developed for binning of reads or contigs.
memory for assembling any genome using Binning can be performed as a function of nucle-
state-of-the-art assembly algorithms (Miller otide frequencies, or abstractions of that (Kmer-
et al. 2010) are directly proportional to the size based filtering, etc.), on statistical analysis of
and complexity of the genome(s) to be assembled read relationships (read topology) or on similarity
(and partly coupled with errors introduced during to known genomes or genome signatures
the sequencing process). In the case of (homology, etc.). Additionally, HMMs or other
metagenomes, then, the first approximation learning algorithms may eventually be developed
would be that the requirements for assembly are to allow rapid binning of reads. However, once
a function of the number of unique bases binning is performed, many of the issues surround-
(or unique “words” or Kmers) in all genomes ing assembly of many sequences again become
contained within the community to be sequenced. relevant and require in-depth analysis and work.
This easily overshadows even the largest and As each binning method will invariably introduce
most complex eukaryotic genomes, making both false positives and false negatives, it is not
assembly of all microbial genomes within clear what effect these algorithms may have on
a single metagenomic sample, given today’s a “final” assembly or if the effect will be consistent
infrastructure and algorithms, infeasible. This among different samples.
variation and computational limitation leads to
a variable amount of data that can be incorporated
into any given assembly. It can be expected that Whole Sample Assembly
read incorporation into metagenome assemblies
will follow a logarithmic curve, with the amount Full metagenome sequence runs, or bins of
of available sequence data covering more of the metagenomic data, are run through an assembly
diversity and complexity of the community being methodology or program; however most current
algorithms are designed for isolate genome of single-genome assembly (N50, total assembly
assembly (Miller et al. 2010; Scholz size, etc.). To improve accuracy, this can be com-
et al. 2012). While isolate genome assembly bined with manual inspection of the data under-
assumes that there are a limited number of solu- lying the contigs, using tools such as Consed
tions to the assembly, as the complexity of the (Gordon 2003), Hawkeye (Schatz et al. 2011),
genome increases and concomitant amount of or other alignment viewers. It bears noting here
sequence data are required, these decisions that these tools require sequential attention to C
become more difficult for algorithms to make. each individual contig, making validation of
For metagenomes, there are additional complica- metagenomic assemblies of many thousands to
tions, such as strain-level variation within millions of contigs prohibitively time-
a species, varying levels of similarity among the consuming.
multiple species within the population, including Additional validation can be gained by read
horizontal gene transfer, and the ever present mapping input sequence data to contigs to iden-
problem of variability of organism frequency/ tify errors or areas with unexplained variances in
abundance within the community. Several recent coverage. None of the tools or processes avail-
attempts have been made (e.g., MetaVelvet able for validations of single-genome assemblies
(Namiki et al. 2012), RAY (Boisvert is directly applicable to metagenomes and
et al. 2010), or Meta-IDBA (Peng et al. 2011)) requires either significant alterations in method
to solve one or more of these metagenome- or a completely new approach. This is due in part
specific issues. However, there is not yet to the much larger amount of data required for
a perfect algorithm, and all can benefit from a metagenome assembly as well as the intrinsic
improved understanding of the inherent complex- complexities associated with metagenomes, men-
ities within metagenomes as well as from tioned above.
improved algorithms for determining which data How, then, does one assess whether
are to be examined and how. Given the varied a metagenomic assembly is good or valid? It is
nature of the complexities that exist in communi- possible to examine statistics of an assembly to
ties, it is likely that the perfect assembly algo- determine if it has value and assess the quality of
rithm will have to evaluate the data and make assembled contigs to give a measure of what
decisions during metagenome assembly. analysis can be performed on the assembled
data (e.g., longer contigs allow for more annota-
tion analysis). Additionally, it is easy to calculate
Assembly Validation and Metrics the total number of bases assembled, allowing for
a rough estimate of how many genomes may be
Validation of metagenomic assemblies is not cur- captured in an assembly. However, it is also
rently a standard process. Some efforts have important to validate that the assembly is an
focused on validation using tools adapted from accurate representation of the input data by use
single-genome assemblies which, due to the dif- of read mapping or other comparative tools. For
ferences in complexity, can vary from being sim- metagenomes, with the stated issues of lack of
ply an inefficient method for validation at best to uniformity, it is likely that a valuable tool for
being misleading and based on incorrect assump- obtaining improved assemblies will be to perform
tions at worst. Validation of assembly complete- several assemblies in parallel and compare inter-
ness (good assemblies provide large contigs with sample assemblies to each other. This will also be
more of the raw data) and accuracy (good assem- an important method to compare the results of
blies harbor few errors such that it is a close binning and of different assembly methods to
representation of the target organism) is combined, or iterative, assemblies of the entire
a nuanced and nebulous process even with isolate dataset. For environments that have been amply
genomes. A typical series of statistical properties studied and for which there are a number of
of contigs is often used to describe the goodness pertinent reference sequences, such as for
human microbiome samples (Lampe 2008; project using the assembly program
Huttenhower et al. 2012; Methe et al. 2012), it SOAPdenovo with different Kmers as an input
is possible to use these to validate assemblies. parameter. What is important to note is that it is
Recent work with isolation and sequencing of difficult to select the best assembly based on any
single cells from within environmental samples single metric, even given the same assembler
raises the possibility of using reference-based with a single parameter change. In fact, it is the
validation tools on metagenome samples as well rule rather than the exception that no single
(Kant et al. 2011; Leung et al. 2012). assembly of the data can provide the best statis-
tics for every metric.
Statistical Comparisons
Read Mapping as Contig Validation
As mentioned above, the first approximation of
the quality of any assembly is an examination of It is important that any assembly be verified by
the metrics associated with the assembly. For methods beyond those utilizing simple statistical
metagenomes, these metrics should be different methods. It is also important that validation algo-
from those used for isolate genome work. Statis- rithms be independent from those utilized to per-
tics that are linked to the total assembly size (e.g., form the assembly. Currently, Burrows-Wheelers
N50) have little value, as the size of the (Langmead et al. 2009; Li and Durbin 2010) read
metagenome, the assembly (and the assembled mapping can serve as an independent approach of
number of contigs), and the choices made for validating the contigs assembled based on the raw
assembly (binning, filtering, assembler algo- sequencing data. This approach has the ability to
rithm, Kmer size, etc.), which can affect the num- validate assembled contigs by basis of coverage
ber and types of bases included in a metagenomic of every base contained within the contig (Fig. 1)
assembly, can all result in drastically different as well as based on the variation of coverage
interpretations. The evaluation of metagenome within the contig (Fig. 2).
assemblies is often conducted in a holistic man- It may not always be the case that coverage
ner, utilizing a number of important statistics and along a contig will appear as even as with isolate
validation metrics. This can be used to assess the genomes, due to the issues of strain (allele) var-
completeness of various assembly methods. iations, of gene duplication, of ribosomal gene
Table 1 shows a selection of assembly statistics similarities between species, and of horizontal
for a single sample (MH0001) from the MetaHIT gene transfer. Additionally, because read
Challenge of Metagenome Assembly and Possible Standards, Table 1 Statistical metrics of metagenome
assembly
Number of Maximum Bases in largest Bases in contigs % read
Assembly type contigs contig size Total bases 100 contigs >10 kb incorporation
SOAPdenovo- 378,624 18,148 63,350,050 1,025,623 438,734 60.8
Kmer 21
SOAPdenovo- 303,536 18,150 55,682,346 1,155,330 839,420 61.1
Kmer 23
SOAPdenovo- 244,200 25,192 47,972,706 1,220,072 949,421 60.6
Kmer 25
SOAPdenovo- 188,074 23,935 40,311,428 1,162,160 843,499 59.9
Kmer 27
SOAPdenovo- 140,502 28,068 33,228,335 1,177,230 804,055 58.9
Kmer 29
SOAPdenovo- 109,722 28,068 27,463,402 1,245,286 918,627 57.8
Kmer 31
Challenge of Contigs Coverage vs. Contigs Length
Metagenome Assembly
and Possible Standards, 100
Fig. 1 Coverage
histogram of metagenome
assembly. Displays
percentage coverage of 90
every contig as a function C
of the contig length
80
Contigs coverage (%)
70
60
50
40 Total Coverage: 99.42

0 5000 10000 15000 20000 25000
Contigs Length (bp)
Challenge of
Metagenome Assembly
and Possible Standards,
Fig. 2 Base-by-base
coverage histogram of
a single contig generated
within a metagenome
assembly. Areas where
coverage varies from the
mean may be identified as
regions of low quality or
confidence
mapping is fundamentally different from Kmer- a so-called edge-effect that prevents a read from
based assemblies, short contigs will generally mapping to a contig if the read-to-contig align-
have poorer coverage, when considering the per- ment ends in the middle of the read yet at the end
centage of total bases in the contig. This is due to of the constructed contig. However, due to the
speed and accuracy of Burrows-Wheeler style finished reference bacterial genomes. These
aligners, this method of validation is both rapid genomes are useful both for phylogenetic and
and sufficiently accurate to allow reasonable cer- functional classification and for validation of
tainty that an assembly is valid and that the assembly. When it is known or suspected that
contigs represent the genomes present within a particular organism is present within a sample
the sample. Finally, read mapping can be com- (e.g., Rhizobium spp. are expected in rhizosphere
bined with a number of other tools such as samples, while Escherichia coli are generally
SAMtools (Li et al. 2009) to locate possible pop- found in fecal samples), alignments against such
ulation differences such as single nucleotide references can be used to validate contigs that are
polymorphisms (SNPs), insertions or deletions generated from the metagenomic sample in
(indels), and other assembly errors within the question.
contigs. This allows assemblies to be validated In the future, reference-based approaches may
and potentially improved in an unsupervised be best utilized in a sample-specific manner to
manner based on the alignment of reads as well both contribute to and help validate metagenomic
as to make empirical judgments of assembly assemblies by using draft reference genomes gen-
quality. erated via single-cell (or microcolony) isolation
from the same site, followed by amplification and
sequencing. The advent of multiple displacement
Comparisons of Multiple Assemblies amplification to allow for the sequencing of
minute quantities of DNA, including single cells
Beyond statistical comparisons of multiple or clusters of cells, shows great promise for
assemblies and evaluation using the raw input metagenomic projects by allowing the inclusion
data, it is also possible to determine how similar of sample-specific genomes to be used in
two assemblies are using the same initial data. reference-based assembly methods.
For example, for the entries listed in Table 1,
there is no guarantee that the largest contigs
from each sample are the same or that the contigs Metagenome Assembly Standards:
have been recapitulated in the various other A Proposed Tiered System
assemblies. The mechanisms for comparing two
contig sets to each other are evolving and can As a nascent field, the methodology for
range from full assembly alignments using metagenome assembly is still under great flux.
BLAST- (McGinnis and Madden 2004) or Currently available tools are able to produce
NUCmer-based comparisons (Delcher et al. valid, useful assemblies of some fraction of any
2002) to protein coding content-based analyses metagenomic sample. However, these assemblies
to more sophisticated methods. In the future, must be considered as a set of draft contigs only,
training of better assembly pipelines may involve particularly if no form of validation has been
evaluating differences among several results in performed. Read-based validation can be used
terms of possible rearrangements, SNPs, indels, to inform and improve on assemblies; however
and errors in joining repetitive regions to deter- this is a time-consuming process and should not
mine if one methodology can be considered con- be expected to be a long-term, high-throughput
sistently better than another. solution for metagenome assemblies. However,
this does not obviate the need for validation pro-
tocols; it merely highlights the lack of algorith-
Generalized References and Site- mic approaches to the technique.
Specific References for Validation There are several promising areas of assembly
investigation that could produce assemblies dis-
The recent explosion in sequencing capacity has tinguishable from draft or validated draft
resulted in an ever-increasing number of draft and metagenomic assemblies. These include the use
Challenge of Metagenome Assembly and Possible Challenge of Metagenome Assembly and Possible
Standards, Table 2 Proposed statistical reporting met- Standards, Table 3 Classification of assembly methods
rics for metagenome assembly for metagenomes. Reporting would ideally describe both
classification and statistics described in Table 2
Proposed metric Description
Percent of read Percentage of read mapping or Quality Description
incorporation incorporated into assembly. This Draft One assembler, one parameter
serves as a metric as to how much Quality draft (QD) Multiple assemblers, multiple
additional sequencing may be parameters, merging-based
C
required for better assembly final assembly
Size of metagenome Number of base pairs included in Binning assisted (HQD) Multiple parameter assemblies
assembly the final assembly. This is performed by binning of reads
a measure of how many genomes into subsets, followed by
may have been assembled and can merging-based final assembly
be utilized to determine what Reference-guided Binning based on reference
additional sequencing will be RHQD sequences, followed by HQD
allowed in terms of additional assembly
sequence data incorporation
Location-specific Reference-guided assembly
Largest contig size This is typically a measurement of reference-guided including sequencing and
how well the most abundant assembly assembly of individual isolate,
organism assembled single cells, or microcolony-
Number of bases in This measurement is similar to based organisms isolated from
large contigs largest contig size but also allows the same environment as the
depth of analysis to potentially metagenome sample in
include less well-assembled species question
Fold coverage A histogram describing the number
histogram of bases covered at a given fold
coverage. This will indicate the
variation between abundant and
non-abundant organisms
incorporated in a sample, the total number of
bases in the resulting assembly, the size of the
largest contig, and the number of bases in the
of reference genome datasets to improve assem- largest 100, 1,000 and 100,000 contigs. Addi-
blies, the inclusion of long read technologies to tional options can include a histogram of fold
help generate longer contigs and scaffolds as well coverage of assembled contigs and alternative
as to allow linkage of genetic differences among measures of assembly. The second level of
haplotypes, and the use of iterative and combined reporting requires a community acceptance of
assembly methods to correct “invalid” contig and assembly types, similar to isolate genome assem-
scaffold regions and to find previously blies. The current default methodology for
unreported overlaps among contigs and reads. metagenome assembly (use of a single assembler,
In order to provide a complete assembly over- with a single or best parameter selected) is pro-
view, the standardized reporting of two important posed to be called a Draft Metagenome Assem-
pieces of information for any assembly of bly. Iterative and multiple assemblies coupled
metagenomic sample is proposed. Tables 2 and with the merging of contigs and validation/cor-
3 describe a first approximation of reporting that rection of contigs, such as that utilized at the
would help disseminate information regarding DOE Joint Genome Institute and Los Alamos
the quality (Table 2) and assembly levels National Laboratory, could be considered high-
(Table 3) of metagenome assemblies to a broader quality draft. Additional levels of quality require
audience. The first and most important level of technologies that are not currently adopted,
reporting is an accurate and consistent descrip- including the use of general reference genomes
tion of the assembly metrics as discussed above to perform reference-guided assemblies. Finally,
and in Table 2. These metrics should include, at the best assembly possible will require sequenc-
a bare minimum, the percentage of reads ing and assembly of genomes gathered by use of
single-cell or microcolony isolation techniques References

for organisms present in the study site. This
best-case scenario would allow binning, assem- Boisvert S, Laviolette F, et al. Ray: simultaneous assem-
bly of reads from a mix of high-throughput sequencing
bly, and in-depth analysis of both the reads and
technologies. J Comput Biol. 2010;17(11):1519–33.
the contigs assembled. Chain PSG, Grafham DV, et al. Genome project standards
The levels of validation, including read map- in a New Era of sequencing. Science. 2009;326(5950):
ping or correction of reads, should be reported. 236–7.
Delcher AL, Phillippy A, et al. Fast algorithms for large-
Using current technologies, this is the most likely
scale genome alignment and comparison. Nucleic
end point for metagenome assemblies for the Acids Res. 2002;30(11):2478–83.
foreseeable future. The final stages of improve- Godoy-Vitorino F, Goldfarb KC, et al. Comparative ana-
ment will result in near-total incorporation of all lyses of foregut and hindgut bacterial communities in
hoatzins and cows. Isme J. 2012;6(3):531–41.
generated reads into a final assembly set. Finally,
Gordon D. Viewing and editing assembled sequences
for all assembly classifications, it is also impor- using Consed. Curr Protoc Bioinforma. 2003.
tant that metadata, including the sample type, the Chapter 11: Unit11 12.
amount and type of sequencing technologies as Huttenhower C, Gevers D, et al. Structure, function and
diversity of the healthy human microbiome. Nature.
well as any modifications (trimming, filtering,
2012;486(7402):207–14.
binning) to the data, the numbers and types of Kant R, van Passel MWJ, et al. Genome sequence of
reference genomes used to guide assembly, as “Pedosphaera parvula” Ellin514, an aerobic
well as the percent of reads incorporated, be verrucomicrobial isolate from pasture soil.
J Bacteriol. 2011;193(11):2900–1.
attached to all assemblies, for future analysis to
Lampe JW. The human microbiome project: getting to the
be applicable to the same samples. guts of the matter in cancer epidemiology. Cancer
Epidemiol Biomarkers Prev. 2008;17(10):2523–4.
Langmead B, Trapnell C, et al. Ultrafast and memory-
Summary efficient alignment of short DNA sequences to the
human genome. Genome Biol. 2009;10(3).
Leung K, Zahn H, et al. A programmable droplet-based
This entry discusses the current difficulties asso- microfluidic device applied to multiparameter analysis
ciated with metagenomic assembly and presents of single microbes and microbial communities. Proc
a path for systematic, universally understood Natl Acad Sci. 2012;109(20):7665–70.
and accepted methods for validation and Li H, Durbin R. Fast and accurate long-read alignment
with Burrows-Wheeler transform. Bioinformatics.
a classification system for metagenomic assem- 2010;26(5):589–95.
blies. Each of these areas will require intensive Li H, Handsaker B, et al. The sequence alignment/map
research and tool development to approach the format and SAMtools. Bioinformatics. 2009;25(16):
specified methods and to generate standard met- 2078–9.
McGinnis S, Madden TL. BLAST: at the core of
rics for analysis and comparisons. There is still a powerful and diverse set of sequence analysis tools.
a strong need to develop universally applicable Nucleic Acids Res. 2004;32(Web Server issue):
validation methods as well as a need to develop W20–5.
a panel of defined datasets for new techniques to Methe BA, Nelson KE, et al. A framework for human
microbiome research. Nature. 2012;486(7402):
be validated against. Validation in this manner, 215–21.
coupled with active development of new methods Miller JR, Koren S, et al. Assembly algorithms for next-
of assessing and reporting the quality of assembly generation sequencing data. Genomics. 2010;95(6):
techniques, will maximize the possibility of gen- 315–27.
Namiki T, Hachiya T, et al. Metavelvet: an extension of
erating broadly applicable and accurate assembly velvet assembler to de novo metagenome assembly
tools that not only perform well using a single from short sequence reads. Nucleic Acids Res. 2012.
method of validation. In all, these proposed Peng Y, Leung HCM, et al. Meta-IDBA: a de Novo
reporting mechanisms (metagenome metadata) assembler for metagenomic data. Bioinformatics.
2011;27(13):I94–101.
will improve the ability of researchers to effec- Schatz MC, Phillippy AM. et al. Hawkeye and AMOS:
tively and confidently utilize metagenome assem- visualizing and assessing the quality of genome assem-
bly data. blies. Brief Bioinform. 2011.
CLUSEAN, Overview 93 C
Scholz MB, Lo CC, et al. Next generation sequencing and and fungi. Traditionally, screening for such com-
bioinformatic bottlenecks: the current state of pounds is performed by isolating potential pro-
metagenomic data analysis. Curr Opin Biotechnol.
2012;23(1):9–15. ducers from diverse sources, testing many
Yilmaz P, Gilbert JA, et al. The genomic standards con- different growth conditions, chemically isolating
sortium: bringing standards to life for microbial ecol- and purifying the produced compounds, and sub-
ogy. Isme J. 2011;5(10):1565–7. sequently determining their structure and testing
their bioactivities. The progress in the develop- C
ment of novel high-throughput sequencing tech-
nologies that allow cost-effective sequencing of
CLUSEAN, Overview microbial genomes, and the increased knowledge
on the biosynthetic pathways of natural product
Tilmann Weber and Kai Blin formation, recently led to the availability of
Interfakult€ares Institut f€ur Mikrobiologie und genome-mining methods as an alternative
Infektionsmedizin T€ubingen, Mikrobiologie/ approach to the time- and cost-effective biologi-
Biotechnologie, Eberhard-Karls Universit€at, cal/chemical screening approach. In genome
T€ubingen, Germany mining, DNA sequence information is used to
assess and evaluate the genetic potential of the
investigated strain. This approach is possible as
Synonym the molecular principles underlying secondary
metabolite biosynthesis – despite the vast diver-
CLUster SEquence ANalyzer sity and number of compounds – are highly
conserved.
Definition
Aim and Scope of CLUSEAN
CLUSEAN, the CLUster SEquence ANalyzer, is
a BioPerl-based software pipeline for the annota- CLUSEAN, the CLUster SEquence ANalyzer
tion of secondary metabolite biosynthetic gene (Weber et al. 2009), is a BioPerl (Stajich
clusters encoding the biosynthesis of molecules et al. 2002)-based tool collection that allows
with, e.g., antibiotic or anticancer activities. a semiautomatic annotation and analysis of sec-
CLUSEAN contains modules for automated ondary metabolite gene clusters. A typical
homology search, protein domain identification, CLUSEAN analysis run is carried out in two
and, in case of modular polyketide synthases and stages: in the first stage, the gene products of
non-ribosomal peptide synthetases-containing whole genomes or biosynthetic gene clusters are
pathways, substrate prediction for the biosyn- compared against standard databases. In the sec-
thetic enzymes. ond stage, secondary metabolite-specific ana-
lyses are carried out (Fig. 1).
During the first analysis stage, similar proteins
Introduction of all annotated gene products are identified using
BLAST (Altschul et al. 1990) against the
A majority of antimicrobials used in human med- non-redundant protein database, and conserved
icine to combat infectious diseases, e.g., tetracy- protein domains are identified with the
cline, penicillin, vancomycin, or erythromycin, HMMER (Eddy 2001) software searching against
many anticancer drugs, and other bioactive mol- the Pfam protein family database (Bateman
ecules, e.g., the immunosuppressant rapamycin, et al. 2002).
are derived from microbial secondary metabo- In the second stage, protein domains com-
lites, also denoted as natural products. These monly observed in the context of secondary
compounds are mainly synthesized by bacteria metabolism are identified using HMMER on
C 94 CLUSEAN, Overview
CLUSEAN, Overview, Fig. 1 Data processing within the CLUSEAN annotation pipeline (Reprinted from Weber
et al. 2009 with permission from Elsevier)
a custom HMM profile database. This analysis the catalytic domains of modular PKS and NRPS,
leads to the identification of the conserved func- which can indicate functionality of the enzymatic
tional domains in modular polyketide synthases domain and thus has an influence on the synthe-
and non-ribosomal peptide synthetases (NRPS). sized product.
Amino acid specificities of NRPS adenylation CLUSEAN has been included as an integral
domains are predicted with an integrated NRPS part into antiSMASH, antibiotics, and secondary
predictor (Rausch et al. 2005; Röttig et al. 2011). metabolites analysis shell, http://antismash.
All annotation is provided as annotation tags secondarymetabolites.org (Medema et al. 2011),
in EMBL-formatted sequence flat files which can where most analysis results can be accessed
be imported in standard sequence analysis tools, interactively or downloaded on a user-friendly
e.g., the Artemis sequence editing software web page.
(Rutherford et al. 2000) or the ACT sequence
comparison tool (Carver et al. 2005). The
CLUSEAN annotation can be exported in Availability and System Requirements
tabulator, or comma-separated text files, or as
MS Excel tables. CLUSEAN is freely distributed under a GNU
In addition to the prediction modules inte- GPL and can be downloaded from https://
grated into the automated pipeline script, addi- bitbucket.org/tilmweber/clusean.
tional tools exist to define KS types of trans-AT CLUSEAN has the following software
PKS according to Nguyen et al. (2008) and to requirements: BLAST + 2.2.24 (or later),
check the presence of conserved amino acids in HMMer 2, HMMer 3, BioPerl 1.6.9 (or later),
Computational Approaches for Metagenomic Datasets 95 C
and Perl libraries Sort::ArrayOfArrays and predicting NRPS adenylation domain specificity.
Spreadsheet::WriteExcel::Simple. Nucleic Acids Res. 2011;39(Web Server issue):
W362–7
Rutherford K, Parkhill J, Crook J, Horsnell T, Rice P,
Rajandream MA, et al. Artemis: sequence visualiza-
Summary tion and annotation. Bioinformatics. 2000;16(10):
944–5.
Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA,
Mining microbial and fungal genome data is Dagdigian C, et al. The Bioperl toolkit: perl modules
C
a successful novel strategy to identify producers for the life sciences. Genome Res. 2002;12(10):
of novel drug candidates. CLUSEAN is a widely 1611–8.
used tool to provide automated annotation of Weber T, Rausch C, Lopez P, Hoof I, Gaykova V, Huson
DH, et al. CLUSEAN: a computer-based framework
secondary metabolite gene clusters and to extract for the automated analysis of bacterial secondary
information from the sequence data which can be metabolite biosynthetic gene clusters. J Biotechnol.
basis for the deduction of the putative biosyn- 2009;140(1–2):13–7.
thetic products.
Cross-References Computational Approaches for

Metagenomic Datasets
▶ antiSMASH
▶ Bacteriocin Mining in Metagenomes Colin Davenport
Hannover Medical School, Hannover, Germany
References
Synonyms
Altschul SF, Gish W, Miller W, Myers EW, Lipman
DJ. Basic local alignment search tool. J Mol Biol.
Bioinformatic analysis; Metagenome data
1990;215(3):403–10.
Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, analysis
Eddy SR, et al. The Pfam protein families database.
Carver TJ, Rutherford KM, Berriman M, Rajandream
MA, Barrell BG, Parkhill J. ACT: the artemis compar-
Definition
ison tool. Bioinformatics. 2005;21(16):3422–3.
Eddy SR. HMMER: profile hidden Markov models for The process of gaining information about
biological sequence analysis. 2001. Available from: a metagenomic community from sequence data
http://hmmer.janelia.org/.
Medema MH, Blin K, Cimermancic P, de Jager V,
using a variety of interdisciplinary techniques
Zakrzewski P, Fischbach MA, et al. AntiSMASH: and approaches.
rapid identification, annotation and analysis of second-
ary metabolite biosynthesis gene clusters in bacterial
and fungal genome sequences. Nucleic Acids Res.
2011;39(Web Server issue):W339–46.
Introduction
Nguyen T, Ishida K, Jenke-Kodama H, Dittmann E,
Gurgui C, Hochmuth T, et al. Exploiting the mosaic The history of computational approaches to
structure of trans-acyltransferase polyketide synthases metagenomic data analysis is brief given the
for natural product discovery and pathway dissection.
Nat Biotechnol. 2008;26(2):225–33.
rapid development of the field. In 1998,
Rausch C, Weber T, Kohlbacher O, Wohlleben W, Huson a visionary paper described techniques for inves-
DH. Specificity prediction of adenylation domains in tigation of the molecular diversity of environmen-
nonribosomal peptide synthetases (NRPS) using tal communities and coined the term metagenome
transductive support vector machines (TSVMs).
(Handelsman et al. 1998). Focus was placed on
Röttig M, Medema MH, Blin K, Weber T, Rausch C, screening clone libraries for interesting biological
Kohlbacher O. NRPSpredictor2–a web server for activities, a mainly laboratory-based endeavor
C 96 Computational Approaches for Metagenomic Datasets
which has been continually successful at identify- programs intended for annotation of
ing relevant novel genes with novel functionality. metagenomic sequences before making some
Other researchers took a more technology-driven comments on relevant statistical analyses. Lastly,
approach by randomly sequencing metagenomic we briefly review the current state of affairs in
DNA from an acid mine biofilm and the metadata collection and standards.
well-known Sargasso Sea projects. These
sequence-based approaches required considerable
computational capacity for assembly and 16S rDNA Profiling
similarity searches. The Sargasso Sea project in
particular provided researchers with considerable Targeted sequencing of the ubiquitously present
headaches with data analysis due to its sheer size. 16S SSU bacterial and archaeal ribosomal gene
Microbiomes of humans and mice quickly has become a common technique in deriving
followed and have remained a major source of estimates of microbial diversity in a community.
metagenomic data to date, particularly with Despite its popularity, with approximately 90 %
respect to diet, health, and disease. As sequencing of all datasets having been produced according to
has become cheaper so has the demand for multi- this method (Davenport and T€ummler 2012), this
ple groups of samples and detailed comparative approach is not metagenomics in the strict sense.
analyses of time courses. Study design with con- 16S rDNA profiling completely ignores func-
trol groups has in turn become more complex and tional diversity such as gene content and acces-
critical. Some groups have even flirted with the sory genome elements while also overlooking
next stage in community analysis and investigated potentially important viral and eukaryotic taxa.
metatranscriptomes and metaproteomes of envi- However, this approach provides consistent qual-
ronmental communities. itative estimates of bacterial and archaeal mem-
While targeted sequencing of single genes has bers of the community, although care must be
been the norm for most projects to date (2012), taken with quantitative aspects. In addition, sev-
many groups are becoming interested again in eral capable software packages are available for
true metagenomics sensu stricto, i.e., the investi- analysis. Errors can occur for a number of rea-
gation of microbial community structure and sons, including copy number variations of ribo-
function using whole genome shotgun somal RNA operons in prokaryotic genomes, the
metagenome datasets. Large studies such as lack of coverage of “universal” primers, and
Metahit (Qin et al. 2010) and a comprehensive multi-template PCR biases. A recent effort has
cow rumen analysis (Hess et al. 2011) have incorporated copy number information of the 16S
driven the acceptance of this approach. Storage gene and reported improvements in microbial
requirements and computational resources can diversity estimates (Kembel et al. 2012).
quickly become limiting in these types of ana- In the past, 16S genes were sequenced using
lyses, though many of the state-of-the-art long Sanger reads, and only fully covered genes
algorithms described below do a good job at were used for analysis. Later, the long-read
mitigating these factors. 454 sequencing technology made targeting of
In the following sections we concentrate on one or more of the shorter so-called hypervariable
the principles, advantages, and problems of the regions of 16S genes possible at a much reduced
main approaches to computational metagenome cost. In turn, others have investigated the use of
analysis. We highlight existing approaches and overlapping paired end Illumina short-read tech-
mention some of the most widely applied soft- nologies to sequence hypervariable regions.
ware in the field, which is then listed with web There is still debate about whether only targeting
links in Table 1. We first deal with 16S rDNA regions of the 16S gene leads to similar results as
profiling, before describing the state of the art in using the full length gene and if this leads to
metagenome assembly and taxonomic assign- biases for some phylogenetic groups (Pinto and
ment algorithms. Subsequently, we discuss Raskin 2012).
Computational Approaches for Metagenomic Datasets, Table 1 A non-exhaustive list of software used directly
or indirectly in metagenomics and mentioned in the article
Availability (online
Program tool or standalone) Purpose URL
Allpaths LG Standalone Read assembly http://www.broadinstitute.org/software/
allpaths-lg/blog/?page_id¼12
PE-Assembler Standalone Read assembly http://www.comp.nus.edu.sg/~bioinfo/
peasm/PE_manual.htm
C
SSPACE Standalone Contig scaffolding http://www.baseclear.com/
landingpages/sspacev12/
AMOS (AMOScmp) Standalone Assisted read assembly http://sourceforge.net/apps/mediawiki/
amos/index.php?title¼AMOScmp
Velvet (Columbus) Standalone Assisted read assembly http://www.ebi.ac.uk/~zerbino/velvet/
Newbler (runMapping) Standalone Assisted read assembly http://454.com/products/analysis-
software/index.asp
VAAL Standalone Assisted read assembly, ftp://ftp.broadinstitute.org/pub/crd/
polymorphism discovery VAAL/VAAL_manual.doc
MetaVelvet Standalone Metagenome assembly http://metavelvet.dna.bio.keio.ac.jp/
Meta-IDBA Standalone Metagenome assembly http://i.cs.hku.hk/~alse/hkubrg/
projects/metaidba/
Cross_match Standalone Masking of vector http://www.phrap.org/
sequences phredphrapconsed.html
Phrap Standalone Long-read assembly http://www.phrap.org/
phredphrapconsed.html
CAP3 Standalone Long-read assembly http://seq.cs.iastate.edu/cap3.html
Glimmer-MG Standalone Ab initio gene finding in http://www.cbcb.umd.edu/software/
metagenomic samples glimmer-mg/
MetaGeneMark Online and Ab initio gene finding in http://exon.gatech.edu/metagenome/
standalone metagenomic samples Prediction/
http://exon.gatech.edu/
license_download.cgi
FragGeneScan Standalone Ab initio gene finding in http://omics.informatics.indiana.edu/
metagenomic samples FragGeneScan/
MetaGeneAnnotator Standalone Ab initio gene finding in http://metagene.cb.k.u-tokyo.ac.jp
metagenomic samples
Orphelia Standalone Ab initio gene finding in http://orphelia.gobics.de/
metagenomic samples
Prodigal Standalone Ab initio gene finding in http://prodigal.ornl.gov/
metagenomic samples
BLAST Standalone and Homology search http://blast.ncbi.nlm.nih.gov/
online
BLAT Standalone and Homology search http://genome.ucsc.edu/FAQ/FAQblat.
online html
HMMer Standalone and Homology search http://hmmer.janelia.org/
online
MG-RAST Online Metagenomic analysis http://metagenomics.anl.gov/
pipeline
IMG/M Online Metagenomic analysis http://img.jgi.doe.gov/cgi-bin/m/main.
pipeline cgi
CAMERA Online Metagenomic analysis http://camera.calit2.net/
pipeline
WebMGA Online Metagenomic analysis http://weizhong-lab.ucsd.edu/
pipeline metagenomic-analysis/
(continued)
Computational Approaches for Metagenomic Datasets, Table 1 (continued)

Availability (online
Program tool or standalone) Purpose URL
QIIME Standalone and Metagenomic analysis http://qiime.org/
online pipeline
Mothur Standalone Metagenomic analysis http://www.mothur.org/
pipeline
Uclust Standalone Sequence fragment http://drive5.com/usearch/manual/
clustering uclust_algo.html
tRNAscan-SE Standalone and tRNA detection http://lowelab.ucsc.edu/tRNAscan-SE/
online
InterProScan Standalone and Protein functional analysis
http://www.ebi.ac.uk/Tools/pfa/
online iprscan/
MEGAN Standalone Comparative metagenomic http://ab.inf.uni-tuebingen.de/software/
analysis megan/
Vegan (R package) Standalone Major ordination methods http://cc.oulu.fi/~jarioksa/softhelp/
vegan.html
Picard Standalone GC bias metrics http://picard.sourceforge.net/
(CollectGcBiasMetrics)
MetaPhlAn Standalone Taxonomic classification http://huttenhower.sph.harvard.edu/
metaphlan
PhyloPythiaS Online Taxonomic classification http://phylopythias.cs.uni-duesseldorf.
de/index.php?phase¼wait
Genometa Standalone Taxonomic classification http://genomics1.mh-hannover.de/
genometa/
PhymmBL Standalone Taxonomic classification http://www.cbcb.umd.edu/software/
phymm/
A typical 16S rDNA profiling analysis would Assembly

include the following steps. Artifacts and errors
can be excluded for the most part by rigorously Longer DNA sequences extracted from
filtering sequence reads according to empirical metagenomic samples afford more precise taxo-
experience. Commonly used filter steps include nomic assignment and annotation by providing
rigorous checking of sequencing barcodes, read more information for homology and composition
average Phred quality scores of 20 or more, and analyses at the cost of losing quantitative infor-
exclusion of reads with uncalled nucleotides mation on the number of reads attributed to taxa.
(Ns). Software tools such as QIIME (Table 1) Also, full length genes may be recovered from
facilitate the computational processing of the resulting assemblies when using a gene
sequence data. The next step is alignment to prospecting approach investigating phylogenies
a reference sequence of mostly full length 16S created with single-copy genes. Therefore,
genes such as the Greengenes or Silva databases depending on available coverage, sequence read
(Table 2) using the naı̈ve Bayesian Classifier assembly can be considered. Estimation of the
from the Ribosomal Database project (Table 1). proportion of bacteria of interest present in
Lastly, software tools such as QIIME, Mothur, or a metagenomic sample and the total number of
the R Vegan package (Table 1) allow calculation microbial (nonhost) reads can give a general idea
of diversity metrics, rarefaction curves and of how successful the assembly step may be.
various statistical analyses (see below) which A good discussion on estimation of probability
provide further information about the community to assemble a whole genome or achieve a specific
under study. average contig length for a particular microbe of
Computational Approaches for Metagenomic Datasets, Table 2 A non-exhaustive list of online databases used
directly or indirectly in metagenomics and mentioned in the article
Online
database Data URL
Greengenes 16S dRNA http://greengenes.lbl.gov/cgi-bin/nph-index.cgi
Silva rRNA http://www.arb-silva.de/
PFAM Protein families http://pfam.sanger.ac.uk/ C
TIGRFAM Curated multiple sequence alignments, http://www.jcvi.org/cgi-bin/tigrfams/index.cgi
HMMs for protein sequence classification
KEGG Genomic, chemical, and systemic functional http://www.genome.jp/kegg/kegg1.html
information
EggNOG Orthologous groups of genes http://eggnog.embl.de/version_3.0/
COG Clusters of orthologous groups of proteins http://www.ncbi.nlm.nih.gov/COG/
SEED Functional classification http://www.theseed.org/wiki/Main_Page
GenBank Annotated DNA http://www.ncbi.nlm.nih.gov/genbank/
RefSeq Genomic DNA, transcripts, and proteins http://www.ncbi.nlm.nih.gov/RefSeq/
UniProt Protein sequence and functional information http://www.uniprot.org/
GO Gene ontology http://www.geneontology.org/
PATRIC Protein families http://www.patricbrc.org/portal/portal/patric/Home
PROSITE Protein domains, families, and functional http://prosite.expasy.org/
sites
PRINTS Protein fingerprints http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/index.
php
Pfam Protein families http://pfam.sanger.ac.uk/
ProDom Protein domain families http://prodom.prabi.fr/prodom/current/html/home.php
SMART Proteomes http://smart.embl-heidelberg.de/
PIR Protein families, functions and pathways, http://pir.georgetown.edu/pirwww/dbinfo/iproclass.shtml
superfamily interactions, structures and structural
classifications, genes and genomes,
ontologies, literature, and taxonomy
Superfamily Structural and functional annotation of http://supfam.cs.bris.ac.uk/SUPERFAMILY/
proteins and genomes
Gene3D Protein domains http://gene3d.biochem.ucl.ac.uk/Gene3D/
Panther Gene functions http://www.pantherdb.org/
HAMAP Microbial proteomes http://pbil.univ-lyon1.fr/help/HAMAP.html
interest in a metagenomic sample is given in introduce frameshift errors during sequencing

Wendl et al. (2012). Several sequencing biases (especially with the 454 and Ion Torrent plat-
are likely to differentially affect the quality of forms). Genome length would also affect the
assembly for bacterial genomes due to their number of sequenced reads for a particular bac-
DNA composition. For example, reads from terium. These biases should be taken into account
GC-rich genomes are likely to be underrepre- in quantitative metagenomics analysis. In sum-
sented due to sequencing issues caused by the mary, the dominant (or more common) species in
secondary structures and higher melting temper- the sample should produce more reads and are
atures of GC-rich sequences (Frey et al. 2008) more likely to assemble better, though depending
interfering with polymerization and ligation reac- on their DNA composition and the sequencing
tions. Polymerization errors occurring during technology used the higher rate of sequencing
PCR amplification will also result in sequencing errors can cause poor-quality assemblies for
errors. Presence of homopolymer runs is likely to these dominant species.
Read length plays an important role in achiev- as a guide. Programs such as AMOScmp, Velvet
ing accuracy and high genomic coverage of an (Columbus), Newbler (runMapping), and VAAL
assembly. There is an ongoing debate whether it (Table 1) can be used for assisted assembly. Most
is sufficient to use short reads for metagenomic short-read assemblers are designed to assemble
analysis (Luo et al. 2012) or alternatively only a single genome and, thus, not optimal for assem-
use reads that are as long as possible for proper bly of metagenomic samples where reads from
annotation of genes, which ideally should include homologous regions of less represented genomes
their promoters, riboswitches, co-operonic genes, can be treated as error reads. Development of
and signature protein domains (Temperton and metagenomic assemblers, such as MetaVelvet
Giovannoni 2012). Regardless, using longer and Meta-IDBA (Table 1) should address this
paired end reads will always result in more accu- problem. In these assemblers the de Bruijn
rate assemblies better covering the length of the graph (Flicek and Birney 2009) for the entire
assembled genomes. assembly is analyzed for presence of subgraphs
Due to sequencing costs, next-generation corresponding to multiple bacterial genomes in
sequencing technologies offering relatively the sample.
long, paired end reads and yet providing high Due to high costs Sanger shotgun sequencing
coverage (enough to assemble underrepresented of metagenomic samples is less attractive.
bacterial genomes) are ideal for metagenomics However, it can be considered for low-diversity
projects. For example, the Illumina MiSeq instru- samples. Assembly of long Sanger reads can
ment can produce 8 Gb of 2 250 bp reads in generate nearly complete bacterial genome
one run at a lower per base cost than Sanger or sequences, ideal for subsequent annotation
454 sequencing technologies, while still offering efforts. Cloning vectors offer large insert sizes,
substantial sequence length. Another reason to e.g., bacterial artificial chromosomes (BACs)
avoid short single-end reads in the assembly (up to 200 Kb), yeast artificial chromosomes
step is that aside from problems with assembly (YACs) (up to 1.5 Mb), and fosmids (up to
of repetitive regions these reads are also likely to 90 Kb). Therefore, it is possible to amplify,
produce misassembled chimeric contigs. Gener- sequence, and assemble manageably large
ally, repetitive regions of any length can be stretches of DNA sequence randomly positioned
assembled by using multiple libraries of paired within a genome and overlapping each other, thus
end reads with varying insert sizes. Libraries with leading to assembly of nearly complete genomes.
shorter insert sizes can be used to build initial Vector sequences should be excluded from the
contigs avoiding misassembly of repetitive reads assembly using vector-masking software such as
into pseudo contigs. Longer insert libraries can be Cross_match (Table 1). The two most commonly
used for scaffolding and gap filling of the initial used long-read assemblers are Phrap and CAP3
contigs. For these reasons programs like Allpaths (Table 1). Trimming of poor-quality 50 - and 30
LG and PE-Assembler (Table 1) require paired ends should be implemented to improve the
end libraries with different insert sizes. Alterna- assembly.
tively, standalone scaffolding tools such as
SSPACE (Table 1) can be utilized to scaffold
already existing contigs using long insert librar- Taxonomic Assignment (Binning)
ies. It is recommended to filter out poor-quality
reads and analyze average base quality of the Assignment of derived sequence reads to their
remaining reads. Based on this analysis taxon of origin is a key goal of most metagenomic
a minimal required read length can be determined studies. This process is also referred to as bin-
for uniform or adaptive (quality-based) trim- ning, as sequences are placed into “bins”
ming. When references of closely related organ- representing the various taxa. Two types of
isms are available, it is possible to perform assignment have been largely utilized to date,
assisted assemblies using the reference genome compositional and sequence similarity based.
Compositional signals depend on the concept given that a reference sequence is available. The
of the genome signature. This relies on the simple key advantages of these methods are that they are
idea that the composition of oligomers such as an accurate and widely accepted robust method
tetramers from closely related genomes is more and also can give direct knowledge of gene con-
similar than those from distantly related genomes tent following alignment. The main disadvantage
(Mrázek 2009). There is a significant body of is the lack of available reference sequence for
research on this topic involving research into some taxa, which can lead to false overrepresen- C
identification of genomic islands, genes of aber- tations of somewhat related taxa in the estimates.
rant composition, genome evolution, and classi- Also, computation tends to be more demanding
fication of metagenome sequences. The main than the compositional approach. This is espe-
advantage of compositional classifiers is that cially so in the case of the BLAST algorithm
they can determine associations in the absence used in the popular software MEGAN (Table 1).
of alignment by assessment of normalized oligo- MEGAN uses a lowest common ancestor
mer counts. Furthermore, unsupervised machine approach to assign reads with two database hits
learning techniques such as self organizing maps to a taxon. If the reads hit unrelated bacteria from
are not biased by the availability of fully different phyla, the lowest common ancestor will
sequenced reference sequences. The main draw- be that prior to phylum, such as Bacteria. How-
back of these classifiers is the long sequences ever, if the reads hit different species of say
needed to derive robust oligomer statistics. For Burkholderia, the algorithm will appoint a hit to
example, the program PhyloPythiaS (Table 1) the genus Burkholderia. BLAST is effective
and more recent frameworks typically require since it allows alignments against the well char-
more than 1,000 bp of input sequence. As such, acterized metagenomic protein space, as well as
they are not able to assign the numerous short the less well-known nucleotide space.
reads from modern Illumina and SOLiD Another popular solution is the web-based
sequencers to various taxa, which is certainly analysis toolbox MG-RAST (Table 1).
possible with other techniques (see next section), MG-RAST allows taxonomic binning, but is
but they do work well on assembled contigs. This more focused on functional investigation and
leads to problems, as contigs do not reflect the comparison of metagenomes. It is further detailed
distributions of raw reads initially observed in the in the Annotation section below. WebMGA
metagenome. Also, some distantly related organ- (Table 1) is an alternative very capable
isms may not have sufficiently divergent genome metagenomics web server which uses efficient
signatures for assignment. algorithms such as FR-HIT and CD-HIT for flex-
Compositional data has been used in a number ible read alignment and highly efficient cluster-
of metagenomic studies. Willner et al. (2009) ing, respectively. MetaPhlAn (Table 1) attempts
analyzed the compositions of 86 microbial and to optimize unique clade-specific marker genes as
viral metagenomes sequenced with 100 bp a reduced reference sequence of about 400 thou-
454 reads. They found that dinucleotides sand genes most representative of each taxo-
explained more of the variance observed than nomic unit and map reads to it. This kind of
higher order nucleotides such as tetramers, mapping potentially allows assignment of reads
although this is probably due to the short length to higher taxonomic levels such as species and
of the read sequences used, which leads to has the advantage of being extremely rapid.
non-robust statistics for higher order oligomers. A further solution which seeks to use curated
Another advantage of oligomers is their ability to reference sequences is Genometa (Table 1).
detect contamination in contigs due to the diver- This GUI program puts emphasis on finding the
gent oligomer profile and their relatively modest mapping coordinates of even very short reads in
computational burden. a genome to check if a taxon is actually present,
A more widely-used method of binning or if it is more likely to be either contamination or
sequences is to find sequences by similarity, just a related ORF or genomic island.
Other programs aim to combine composi- publically available databases that can be used
tional and similarity based tools. A well-known for functional annotation, such as PFAM,
approach is PhymmBL (Table 1). This program TIGRFAM, KEGG, EggNOG, COG, SEED,
uses both BLAST and compositional attributes GenBank, RefSeq, UniProt, GO, and PATRIC
to assign even reads as short as 100 bp. The (Table 2).
authors found this technique to be more accu- MG-RAST allows the users to upload their
rate than either of the methods alone and have sequence data (in FASTA, FASTQ, and SFF for-
continued to improve their software. In general, mat) and metadata. The uploaded data are pre-
binning is still a difficult task, and algorithms ferred to be shared, but this is not mandatory. The
which work very well on one dataset may be data are quality controlled with the QC pipeline
extremely limited on the next. As such, we based on the settings provided by the user. The
recommend using at least two binning QC pipeline features include read quality filtra-
approaches on the sample to gain the maximum tion and trimming, dereplication, model organ-
possible information. ism screening, demultiplexing and merging mate
pairs. Currently, the minimal accepted read
length is 75 bp. Assemblies can also be submit-
Annotation ted. Starting from version 4.0 the pipeline will
also support read assembly. Submissions to this
Annotation of metagenomics samples requires pipeline are queued and submitted for feature
identification of features of interest in the assem- prediction using FragGeneScan, which identifies
bled fragments or reads binned into their Opera- the most likely reading frame and performs
tional Taxonomic Units (OTUs). For ab homology search on translated features. A pro-
initio identification of potential gene sequences gram called Uclust is then used to cluster 90 %
entire ORFs should be located. Various software identical protein fragments. The number of reads
exists to perform this task in metagenomic in each cluster is identified to estimate abun-
projects, e.g., Glimmer-MG, MetaGeneMark, dances. The pipeline also provides various visu-
FragGeneScan, MetaGeneAnnotator, Orphelia alization tools to view the results and to perform
(Table 1). These programs utilize various types comparative analysis with over 590 public
of Markov models for analysis of codon usage or metagenomes.
frequency of other genome composition elements IMG/M concentrates on comparative analysis
in the binned genomes. However, instead of of microbial genomes. The pipeline accepts
a single model the analysis is based on multiple assembled or unassembled reads. Unassembled
Markov models trained with data from a large reads are quality controlled, trimmed, and
variety of bacterial species. The trained dereplicated; their low-complexity regions are
model providing the best fit is then selected for masked. Aside from protein coding genes, ab
gene prediction. Given the complexity of initio gene finding also includes detection of
metagenomic assemblies, especially when only CRISPRs and noncoding RNA. RNA detection
short reads are utilized, it is expected that a large is performed using tRNAscan-SE for tRNAs and
proportion of assembled contigs may only have HMM models for rRNAs. Coding sequences are
partial ORFs. These sequences can still be predicted using a combination of Prodigal,
included in homology analysis using BLAST, Metagene, MetaGeneMark, and FragGeneScan
BLAT, or HMM (Table 1) searches against (Table 1). Longer sequences are also searched
gene or protein nonredundant databases. There against a local nonredundant protein database
are a number of online pipelines, e.g., using BLASTX. IMG/M provides functional
MG-RAST, IMG/M, and CAMERA (Table 1), annotation of the entire metagenome and sup-
available for ab initio- and homology-based ports functional comparisons to other stored
DNA structure annotation as well as functional annotated metagenomes. Various visualization
analysis of identified genes using a battery of tools facilitate this kind of comparison,
e.g., Phylogenetic Distribution of Genes or obtained by multiple methods. An important step
Radial Phylogenetic Tree. to ensure comparability is normalization. Nor-
CAMERA provides a collection of online malization of these metrics can be undertaken
tools for metagenomics analysis. The provided using relative abundances, GC content, genome
tools allow the following analysis steps: sequence size, or prevalence of single-copy genes. How-
QC, sequence assembly, ORF prediction, RNA ever, particularly normalization of true
prediction, BLAST, clustering, functional anno- metagenomic data is in a state of flux with little C
tation, and viral diversity estimation. current consensus. Care must be taken with the
InterProScan (Table 1) is one of the most GC content delivered by sequencers as a result of
advanced programs for protein functional analy- the different sample preparation and amplifica-
sis. It incorporates BLAST and HMM searches tion schemes. It must be assumed that all
against an array of protein domain and functional sequencing runs have some form of quantitative
site databases (PROSITE, PRINTS, Pfam, bias against either or both low GC and high GC
ProDom, SMART, TIGRFAMs, PIR superfam- organisms, meaning that they will be underrepre-
ily, SUPERFAMILY, Gene3D, PANTHER, and sented in the samples. This problem has not been
HAMAP (Table 2)). Online and locally installed widely considered in metagenomics to date. GC
versions are available. Due to highly bias assessment programs, such as Picard’s
CPU-intensive nature of the BLAST and HMM CollectGcBiasMetrics (Table 1), are particularly
searches, it is recommended to run this program useful in observing and quantifying relative bias
on a computer cluster. of read coverage at different GC values using just
MEGAN is a standalone tool for visualization of a reference sequence and a BAM alignment file.
BLAST search results as taxonomic dendrograms, Larger genomes are more likely to be sampled in
functional dendrograms using the SEED classifica- a randomly sheared metagenomic DNA sample.
tion, pathways using KEGG orthology, compara- This can be compensated for by normalizing for
tive visualization, etc. A good collection of general genome length, if applicable for the taxonomic
information about other available metagenomics attribution method used.
software and resources can be found on http:// Many metrics have been taken directly from
seqanswers.com/wiki/Metagenomics. the field of ecology. Alpha, beta, and gamma
diversity summarize the species diversity in one
habitat, species diversity across multiple habitats,
Statistical Analysis and total diversity over total species diversity
across a larger scale landscape, respectively. Spe-
Early metagenomic datasets, such as the Sargasso cies richness is simply the number of species
Sea, were relatively simple surveying projects by found, while species diversity includes
design. Attempts were made to quantitate species a measure of the abundance of members of each
abundances using relative abundance of reads species. Other measures such as Shannon and
and presence of 16 S rRNA and single-copy Simpson indices are also available. One use
genes. Later studies then focused more on com- case is from Dinsdale and coworkers (2008),
parative spatial or temporal variation of the where functional metagenomic diversity was
microbial community. Due to this increasing characterized separately across a range of bacte-
sophistication multiple metrics for characterizing rial and viral genomes in many different habitats.
the complexity of the community have been Interestingly, functional metagenomics was
developed. As many projects in metagenome reported by Dinsdale and coworkers to explain
analysis are not based on strict hypothesis testing, a larger proportion of the variance in each dataset
exploratory data analysis techniques such as mul- (about 75 %) and thus be predictive of metabolic
tivariate statistics are often employed (see capacity within the taxa of an ecosystem,
below). Generally, the metric under study is the than analysis of taxa by 16S rRNA genes only
estimate of abundance of a taxon, which can be (about 10 %).
The aforementioned indices attempt to char- As with any statistical analyses, care must be
acterize a highly multidimensional dataset into taken when performing multiple tests due to fre-
a single number, which can be useful as quent generation of false positives. While
a summary but obscures the underlying data. Bonferroni corrections are extremely good at
Therefore, advanced ordination methods for removing false positive test results, the extreme
multidimensional datasets such as principal com- stringency of this method will certainly mask
ponents analysis (PCA) and multidimensional a number of biologically true associations (false
scaling (MDS) have been applied to differentiate negatives). As such, we advocate the use of less
communities and reveal associations with abiotic stringent tests such as the Benjamini-Hochberg
parameters. Whichever of the many ordination false discovery rate method (FDR; van den Oord
methods is chosen, it is of great importance to and Sullivan 2003). Lastly, it should be noted that
check the variance explained by the observed extensive and high quality metadata is crucial to
components or functions. Where the principal observing and quantitating trends in microbial
components or alternative statistics explain little community structure.
of the variance in the data, this indicates the
variation in the data cannot be explained by the
variables measured, and caution must be taken in Metadata
interpreting the results. Various clustering algo-
rithms have also been demonstrably useful in Collection of metadata about metagenomes is
grouping similar datasets based on measures essential for making the sequence data and anal-
from normalized read counts to oligomer content ysis results meaningful and reusable by the sci-
and differentiating them from controls in entific community. Moreover, properly collected
a manner identical to the clustering schemes pop- and complete metadata can also help the scien-
ular in microarray group expression analysis tists originally analyzing a metagenomic sample
(Mrázek 2009). Clustering can also be important to draw conclusions about their findings that oth-
for quality control and identification of outlier erwise may be overlooked. A first step in this
microbial communities, which may also be attrib- direction is development of the minimum infor-
uted to technical artifacts. mation about a genome sequence (MIGS) speci-
Further types of statistical community com- fication and its extension to the minimum
parison metrics have been developed especially information about a metagenome sequence
for metagenomics. One example is the UniFrac (MIMS) specification by the Genomic Standards
distance metrics used to calculate a distance mea- Consortium (GSC). MIGS provides general
sure between microbial communities using infor- information about a genomic sequence, similar
mation from a supplied phylogenetic tree to what is collected by the NCBI Trace Archive
(Hamady et al. 2010). UniFrac uses a beta diver- or NCBI Short Read Archive, extended to more
sity measure detailing community membership detailed metadata about environment, nucleic
over space and time, which has distinct advan- acid sequence source, and assay preparation.
tages, and the phylogenetic tree method shows MIMS extends this specification to also include
improvements over comparing simple lists metadata about the habitat, e.g., temperature, pH,
of taxa. salinity, pressure, chlorophyll, conductivity, light
Experimental design is of paramount impor- intensity, dissolved organic carbon (DOC), cur-
tance in obtaining robust statistical results. Since rent, atmospheric data, density, alkalinity,
estimates of microbial communities tend to be dissolved oxygen, particulate organic carbon
noisy, replicates are necessary to gain a reliable (POC), phosphate, nitrate, sulfates, sulfides, and
assessment of variance. As finance is usually the primary production (Field et al. 2008). An XML
limiting factor, either samples can be sequenced schema is used to implement the MIGS/MIMS
at a lesser depth or cheaper sequencing technol- checklist. This schema is the basis for ongoing
ogies can be used. development of the Genomic Contextual Data
Markup Language (GCDML). This language advanced analysis software also increases. In
should support polymorphic validation of various conclusion, computational analysis of
taxa (requiring different checklists) and develop- metagenomic samples is becoming more afford-
ment of ontologies. able and available to the research community and
Another interesting resource that addresses the provides exciting research and software develop-
need for sharing standardized metagenomics data ment opportunities.
is the Genomes OnLine Database (GOLD, http:// Advances in metagenomic analysis of micro- C
www.genomesonline.org/). This database con- bial communities also provide opportunities for
tains a collection of completed and ongoing pro- metatranscriptomic and metaproteomic research.
jects with the associated metadata, which are
based on a controlled vocabulary coordinated
with the GSC. Cross-References
Another online resource that collects
GSC-compliant metadata is CAMERA, already ▶ Lessons Learned from Simulated
mentioned in the Annotation section of this Metagenomic Datasets
review. CAMERA is involved in GSC activities ▶ Metagenomics, Metadata, and Meta-analysis
and provides input for development of ▶ Nucleotide Composition Analysis: Use in
metagenomic metadata standards that are also Metagenome Analysis
used for submission of metagenomic data to ▶ Phylogenetics, Overview
CAMERA. ▶ Silva Databases
Summary References
Recent improvements in next-generation Davenport CF, T€ ummler B. Advances in computational

analysis of metagenome sequences. Environ
sequencing technologies are providing new
Microbiol. 2012. doi:10.1111/j.1462-2920.2012.
opportunities for metagenomics. While 16S 02843.x.
rRNA gene profiling is still predominantly used Dinsdale EA, Edwards RA, Hall D, et al. Functional
for quantitative profiling of communities analy- metagenomic profiling of nine biomes. Nature.
2008;452:629–32.
sis, availability of long paired end reads produced
Field D, Garrity G, Gray T, et al. The minimum informa-
by an Illumina MiSeq instrument or similar tech- tion about a genome sequence (MIGS) specification.
nology with high coverage and comparatively Nat Biotechnol. 2008;26:541–7.
low cost should shift the focus of future Flicek P, Birney E. Sense from sequence reads: methods
for alignment and assembly. Nat Methods. 2009;6:
metagenomics projects to genome assembly of
S6–12.
microbes of interest and their functional annota- Frey UH, Bachmann HS, Peters J, Siffert W.
tion. As more microbial genomes and PCR-amplification of GC-rich regions: slowdown
metagenomes are being sequenced and anno- PCR. Nat Protoc. 2008;3:1312–7.
Hamady M, Lozupone C, Knight R. Fast UniFrac: facili-
tated, it is hard to overestimate the need for shar-
tating high-throughput phylogenetic analyses of
ing these data and standardization of the microbial communities including analysis of
associated metadata for comparative analysis. In pyrosequencing and PhyloChip data. ISME
this regard, development of minimum informa- J. 2010;4:17–27.
Handelsman J, Rondon MR, Brady SF, et al. Molecular
tion standards (MIGS and MIMS) by the Geno- biological access to the chemistry of unknown soil
mic Standards Consortium and adaptation of microbes: a new frontier for natural products. Chem
these standards by the online analysis/storage Biol. 1998;5:R245–9.
resources, such as MG-RAST or IMG/M, are Hess M, Sczyrba A, Egan R, et al. Metagenomic discovery
of biomass-degrading genes and genomes from cow
encouraging developments. As the increasing
rumen. Science. 2011;331:463–7.
number of metagenomic projects becomes more Kembel SW, Wu M, Eisen JA, et al. Incorporating 16S
detailed and complex, the need for more gene copy number information improves estimates of
C 106 Conserved Regions in 16S Ribosome RNA Sequences
microbial diversity and abundance. PLoS Comput rRNA genes (rDNA), depending on the physiolog-
Biol. 2012. doi:10.1371/journal.pcbi.1002743. ical condition of the microbes (Klappenbach
Luo C, Tsementzi D, Kyrpides NC, et al. Individual
genome assembly from complex community short- et al. 2000; Liao 2000). Because the rRNA
read metagenomic datasets. ISME J. 2012;6:898–901. sequences are vertically delivered to the next gen-
Mrázek J. Phylogenetic signals in DNA composition: eration, they cannot be inherited by a different
limitations and prospects. Mol Biol Evol. 2009;26: species. Hence, 16S rRNA sequences are consid-
1163–9.
Pinto AJ, Raskin L. PCR biases distort bacterial and ered to be a stable marker of morphological dif-
archaeal community structure in pyrosequencing ference and have been applied in the taxonomic
datasets. PLoS One. 2012;7:e43093. classification of prokaryotes (Woese 1987). Since
Qin J, Li R, Raes J, et al. A human gut microbial gene the 1980s, partial and full-length sequences have
catalogue established by metagenomic sequencing.
Nature. 2010;464:59–65. been obtained using polymerase chain reaction
Temperton B, Giovannoni SJ. Metagenomics: microbial (PCR) technology (Lane et al. 1985). These
diversity through a scratched lens. Curr Opin sequences have been deposited in the Ribosomal
Microbiol. 2012;15:605–12. Database Project (RDP), Greengenes (http://
van den Oord EJCG, Sullivan PF. False discoveries and
models for gene discovery. Trends Genet. 2003;19: greengenes.lbl.gov), and SILVA rRNA public
537–42. databases (Cole et al. 2009; Pruesse et al. 2007).
Wendl MC, Kota K, Weinstock GM, et al. Coverage theo- The closest relative of a microbial organism of
ries for metagenomic DNA sequencing based on interest can be figured out by comparing the organ-
a generalization of Stevens theorem. J Math Biol.
2012. doi:10.1007/s00285-012-0586-x. ism’s rRNA sequence with the collected
Willner D, Thurber RV, Rohwer F. Metagenomic signa- sequences of known species (DeLong 1992;
tures of 86 microbial and viral metagenomes. Environ Fuhrman et al. 1992). Moreover, with the devel-
Microbiol. 2009;11:1752–66. opment of next-generation sequencing techniques
(Quail et al. 2012), rRNA sequences of microbial
communities in environmental samples can be
massively obtained in a short period (the effi-
Conserved Regions in 16S Ribosome ciency is platform dependent). These advances in
RNA Sequences and Primer Design detection have dramatically improved our under-
for Studies of Environmental standing of the communities of environmental
Microbes microbes in different sites on Earth (Qian
et al. 2011; Roussel et al. 2008).
Yong Wang1 and Pei-Yuan Qian2 The primer design is critical, regardless of
1
Division of Deep Sea Science, Sanya Institute of which sequencing method is employed, the
Deep Sea Science and Engineering, San Ya, Sanger method or next-generation methods.
Hainan, China With the rapid advances in metagenomics, envi-
2
KAUST Global Collaborative Program, ronmental samples can now consist of thousands
Division of Life Science, Hong Kong University of microbial species (Tremaroli and Backhed
of Science and Technology, Hong Kong, China 2012). Therefore, it is important to use primers
that are suitable to most of the species to fully
investigate the microbial community. If the
Definition primers fail to land on the matching parts of the
rDNA of certain dominant microbes, then these
The taxonomic classification of prokaryotic organ- species will be excluded from the PCR
isms based on morphological differences is diffi- amplicons, resulting in a poor survey of the com-
cult. A ribosomal RNA (rRNA) sequence has munity. For example, in a study of the microbial
many polymorphic sites that can act as a genetic communities in the Red Sea, the selection of
earmark to uncover the genetic background of primers almost failed to capture the entire
prokaryotes (Fox 2010). Bacterial and archaeal SAR11 group belonging to alpha-Proteobacteria
genomes contain one to several copies of 16S (Qian et al. 2011). Primer specification is also the
Conserved Regions in 16S Ribosome RNA Sequences 107 C
major concern in other studies (Huse et al. 2008; regions were searched again using nonredundant
Huws et al. 2007; Klindworth et al. 2013). core sequences from the SILVA database. A total
The strategy for primer design is based on the of 11 bacterial and seven archaeal segments with
conservation of the target sequences. Primers are degeneration sites were obtained. Because the
designed to obtain the variant sequences between nonredundant sequences were used and many of
two conserved regions. The degree of conserva- them were incomplete, the identified conserved
tion of the regions directly contributes to the sequences had more polymorphisms and the con- C
coverage rate of the primers targeting servation degree at both ends of the 16S rDNA
a community in an environmental sample. The sequences could not be evaluated (Table 1).
conserved regions in 16S rRNA sequences are However, three new conserved regions, located
involved in essential translational functions and at the bacterial 252–275 and 547–575 regions
interact with ribosomal proteins. For instance, and the archaeal 560–578 region, were found in
universally conserved sites G530, A1492, and these sequences. In the overlapping segment
A1493 in 16S rRNA sequences are crucial for between 565 and 575, a universally conserved
tRNA binding in the A site (Brimacombe and region was recognized for bacterial and archaeal
Stiege 1985; Demeshkina et al. 2012). Along sequences: 50 -TGGG[C/T][C/G/T]TAAAG-30 .
with the neighboring conserved sites, these sites This region has been used to design primers for
have been recognized to be ideal regions for the identification of clinical bacteria (Nikkari
primer design, as exemplified by the frequently et al. 2002). The positions of the conserved
used universal primers U519 and U1492 (Baker regions are standardized to the approximate posi-
et al. 2003). Apart from the conserved regions, tions on Escherichia coli 16S rDNA. It is inter-
there are a total of nine variant regions that esting that all the archaeal conserved segments
correspond to the species-specific structural have corresponding bacterial segments at the
sequences of ribosomal RNA (Huws et al. 2007; same standardized positions and share some
Wang and Qian 2009). The variant regions can be conserved sites with the bacterial counterparts
obtained through PCR, followed by sequencing (Table 1).
and comparison for taxonomic assignment.
Evaluation of Candidate Primers

Conserved Regions in 16S rRNA
Sequences Candidate primers were selected from the con-
served regions and were subjected to further eval-
In a previous study, a method had been developed uation. The candidates were matched to the core
for the de novo identification of conserved 16S rDNA sequences, and only two mismatches
regions in 16S rRNA sequences (Wang and were allowed between the primers and the
Qian 2009). First, conserved sites are detected targeting regions. The results in Table 2 show the
by checking the alignment file of the 16S rRNA coverage rates of the candidates on the SILVA
sequences, and consecutive conserved sites are core datasets. Archaeal primers in the 328–346,
regarded as potential candidates as primers with 340–357, 916–931, and 953–972 regions are asso-
a high coverage rate for all known species. Thus, ciated with a low coverage rate. This was caused
all possible conserved regions can be located with- by the presence of short archaeal 16S rDNA in the
out having to understand every detail of the role of core dataset, at least 7 % of which did not have the
16S rRNA sequences in ribosome and protein targeting regions for these primers. Therefore, the
translation. The previous study examined long core sequences could have been recovered better
16S rDNA sequences (>1,200 bp), but the over- by these candidates. In addition to these archaeal
representation of Firmicutes and Proteobacteria primers, the two primers initiate from position
sequences skewed the results toward the dominant 683 in the bacterial and archaeal 16S rDNA
phyla (Wang and Qian 2009). The conserved sequences have the lowest coverage rates of
Conserved Regions in 16S Ribosome RNA Sequences and Primer Design for Studies of Environmental
Microbes, Table 1 Conserved regions in archaeal and bacterial 16S rRNA core sequences
Start End Conserved sequence
Bacteria
252 275 TTGGYRRGGTAAHRGCYYACCAAG
311 365 CCACAHKGGVACTGAGAYACKGBCCACCTACGGGWGGCWGCAGTVRRGAAT
507 536 CTAACTHYGTGCCAGCAGCCGCGGTAAKAC
547 575 AGCGTTRYYCGGAWTYAYTGGGYKTAAAG
683 707 GTGTAGVRGTGAAATBCGTWGAKAT
765 806 GAAAGCKWGGGKAGCRAACRGGATTAGATACCCBGGTAGTCC
883 932 CTGGGRAGTACGVYCGCAAGRBTRAAACTCAAAGGAATTGACGGGGRCYC
935 986 ACAAGCRGYGGAGYRTGTGGYTTAATTCGAHRMWAMGCGMRRAACCTTACC
1,045 1,062 CAGGTGBTGCATGGYTGT
1,067 1,085 AGCTCGTGYCGTGAGRTGT
1,090 1,113 TTAAGTSCBRYAACGAGCGCAACC
Archaea
325 359 CWRGYCCTACGGGRYGCAGCAGKCGCGAAAMCTYY
514 539 GGTGYCAGCCGCCGCGGTAAHACCGC
560 578 WTTAYTGGGYYTAAAGCRT
679 701 GACRGTGAGGRAYGAARSCYDGG
781 806 CRAWCSGGATTAGACCCSRGTAGTCC
883 931 CTGGGRAGTAYGRYCGCAAGRYTGAAACTTAARGGAATTGGCGGGGGAG
953 972 GGTTYAATYGRABTCAACGC
The conserved regions were obtained by searching consecutive conserved sites in alignment file of 339 archaeal and
1,845 bacterial nonredundant core 16S rDNA sequences in SILVA database (release 108). Cutoff percentage of
occurrence of a nucleotide at a conserved site is 90 %. The positions are according to Escherichia coli 16S rDNA
positions. Abbreviations for degeneration sites are Y for C or T, R for A or G, W for A or T, K for G or T, M for C or A,
S for C or G, V for not T, H for not G, B for not A, and D for not C
88.5 % and 81.1 %, respectively (Table 2). Two more than 95 % of the bacterial 16S rDNA
candidates, 683–700 and 691–707, were selected sequences. For the archaeal candidates, the two
from the bacterial conserved region of 683–707 candidate primers in the region of 514–539 have
for the test; both were associated with low cover- the highest coverage rates, 93.5 % and 93.8 %,
age rates. Obviously, more degeneration sites have respectively. Thus, the archaeal rDNA sequences
been introduced in these bacterial primers com- appear to be more difficult to be fully covered than
pared with the same sets described previously the bacterial sequences, considering the average
(Wang and Qian 2009). This means that more coverage rate of 90.8 % with the low rates for the
polymorphisms emerged in the nonredundant four primers at both ends ignored. In regard to
dataset, which results in the low rates for the two universal primers, this study recommends the
primers. Thus, the primers from this rRNA region primers at the 515–533, 785–806, and 907–928
are not recommended due to their generally low positions. High coverage rates for these primers
coverage rate, although a previous study obtained were confirmed using the bacterial and archaeal
a 90.5 % coverage rate for a similar primer (Wang datasets (Table 2).
and Qian 2009). The overall quality of the other
candidate primers is high, with the average cover-
age rate being 92.7 % (Table 2). The best bacterial Summary
primers in this study were located at the E. coli
positions of 547–568, 556–575, 907–928, and A short list of 16S rDNA primers has been com-
1,046–1,062. These primers are able to recover piled using simplified nonredundant rDNA
Conserved Regions in 16S Ribosome RNA Sequences 109 C
Conserved Regions in 16S Ribosome RNA Sequences and Primer Design for Studies of Environmental
Microbes, Table 2 Evaluation of candidate primers
Position Sequence % coverage
Bacteria
259–275 GGTAAHRGCYYACCAAG 93.6 %
321–338 ACTGAGAYACKGBCCACC 86.6 %
334–353 CCACCTACGGGWGGCWGCAG 94.1 % C
515–533 GTGCCAGCAGCCGCGGTAA 93.6 %
547–568 AGCGTTRYYCGGAWTYAYTGGG 95.1 %
556–575 CGGAWTYAYTGGGYKTAAAG 96.9 %
683–700 GTGTAGVRGTGAAATBCG 88.5 %
691–707 GTGAAATBCGTWGAKAT 75.9 %
765–782 GAAAGCKWGGGKAGCRAA 82.1 %
785–806 GGATTAGATACCCBGGTAGTCC 94.7 %
907–928 AAACTCAAAGGAATTGACGGGG 96.6 %
946–964 AGYRTGTGGYTTAATTCGA 92.5 %
1,046–1,062 AGGTGBTGCATGGYTGT 95.8 %
1,067–1,085 AGCTCGTGYCGTGAGRTGT 93.0 %
1,090–1,113 TTAAGTSCBRYAACGAGC 90.8 %
Archaea
328–346 GYCCTACGGGRYGCAGCAG 83.8 %a
340–357 GCAGCAGKCGCGAAAMCT 80.2 %a
514–533 GGTGYCAGCCGCCGCGGTAA 93.8 %
519–539 CAGCCGCCGCGGTAAHACCGC 93.5 %
560–578 ATTAYTGGGYYTAAAGCRT 90.3 %
683–701 GTGAGGRAYGAARSCYDGG 81.1 %
785–806 GGATTAGATACCCSRGTAGTCC 90.9 %
883–902 CTGGGRAGTAYGRYCGCAAG 92.6 %
897–914 CGCAAGRYTGAAACTTAA 91.4 %
907–928 AAACTTAARGGAATTGGCGGGG 92.9 %
916–931 GGAATTGGCGGGGGAG 82.9 %a
953–972 GGTTYAATYGRABTCAACGC 79.9 %a
Degenerated nucleotides are referred to Table 1
a
Coverage percentages that need an adjustment due to incompleteness of some short 16S rDNA sequences at the region
datasets. These primers will be useful for identi- sequences and polymorphisms are revealed.
fying environmental microbes, as they are capa- Alternatively, the conserved sites will be more
ble of detecting more than 90 % of the known evident if full-length 16S rRNA sequences from
bacteria and archaea. However, the number of variant rare biospheres are found to exhibit the
prokaryotic organisms that resist being captured same conservation patterns. In return, these con-
by these 16S rRNA primers cannot be estimated. servation patterns may help to understand the role
It has been estimated that only about 1 % of the of ribosome RNAs in protein translation.
microbes on Earth are culturable. Moreover, the The method proposed here is also useful for
microbes colonizing extreme and geologically generating specific primers for interested taxa at
isolated environments are far from being lower taxonomic levels in an environment. In
completely explored (Pace 1997; Sogin a previous report, the CHECK_PROBE program
et al. 2006). The conserved sites in the 16S in the RDP database and the BLAST program
rDNA sequences will be degenerated if more were employed to predict cyanobacteria-specific
primers (Nubel et al. 1997). However, there are Klappenbach JA, Dunbar JM, Schmidt TM. rRNA
problems with these methods as demonstrated operon copy number reflects ecological strategies
of Bacteria. Appl Environ Microbiol. 2000;66:
previously (Wang and Qian 2009). Hence, the 1328–33.
method here is recommended since it may enable Klindworth A, Pruesse E, Schweer T, Peplies J, Quast C,
more specific primers to be generated for differ- Horn M, Glockner FO. Evaluation of general 16S
ent taxonomic levels. ribosomal RNA gene PCR primers for classical and
next-generation sequencing-based diversity studies.
Nucleic Acids Res. 2013;41:e1.
Lane DJ, Pace B, Olsen GJ, Stahl DA, Sogin ML, Pace
Cross-References NR. Rapid determination of 16S ribosomal RNA
sequences for phylogenetic analyses. Proc Natl Acad
Sci U S A. 1985;82:6955–9.
▶ Binning Sequences Using Very Sparse Labels Liao D. Gene conversion drives within genic
Within a Metagenome sequences: concerted evolution of ribosomal RNA
▶ Challenge of Metagenome Assembly and genes in bacteria and archaea. J Mol Evol. 2000;
Possible Standards 51:305–17.
Nikkari S, Lopez FA, Lepp PW, Cieslak PR, Ladd-
▶ I-rDNA and C16S: Identification and Wilson S, Passaro D, Danila R, Relman DA. Broad-
Classification of Ribosomal RNA Gene range bacterial detection and the analysis of
Fragments unexplained death and critical illness. Emerg Infect
▶ RITA: Rapid Identification of Dis. 2002;8:188–94.
Nubel U, Garcia-Pichel F, Muyzer G. PCR primers to
High-Confidence Taxonomic Assignments for amplify 16S rRNA genes from cyanobacteria. Appl
Metagenomic Data Environ Microbiol. 1997;63:3327–32.
Pace NR. A molecular view of microbial diversity and the
biosphere. Science. 1997;276:734–40.
Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W,
References Peplies J, Glockner FO. SILVA: a comprehensive
online resource for quality checked and aligned ribo-
Baker GC, Smith JJ, Cowan DA. Review and re-analysis somal RNA sequence data compatible with ARB.
of domain-specific 16S primers. J Microbiol Methods. Nucleic Acids Res. 2007;35:7188–96.
2003;55:541–55. Qian P-Y, Wang Y, Lee OO, Lau SCK, Yang J,
Brimacombe R, Stiege W. Structure and function of ribo- Lafi FF, Al-Suwailem A, Wong TYH. Vertical strati-
somal RNA. Biochem J. 1985;229:1–17. fication of microbial communities in the Red Sea
Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, revealed by 16S rDNA pyrosequencing. ISME J.
Kulam-Syed-Mohideen AS, McGarrell DM, Marsh T, 2011;5:507–18.
Garrity GM, Tiedje JM. The ribosomal database pro- Quail M, Smith M, Coupland P, Otto T, Harris S,
ject: improved alignments and new tools for rRNA Connor T, Bertoni A, Swerdlow H, Gu Y. A tale of
analysis. Nucleic Acids Res. 2009;37:141–5. three next generation sequencing platforms: compari-
DeLong EF. Archaea in coastal marine environments. son of Ion Torrent, Pacific Biosciences and Illumina
Proc Natl Acad Sci U S A. 1992;89:5685–9. MiSeq sequencers. BMC Genomics. 2012;13:341.
Demeshkina N, Jenner L, Westhof E, Yusupov M, Roussel EG, Bonavita M-AC, Querellou J, Cragg BA,
Yusupova G. A new understanding of the decoding Webster G, Prieur D, Parkes RJ. Extending the
principle on the ribosome. Nature. 2012;484:256–9. sub-sea-floor biosphere. Science. 2008;320:1046.
Fox GE. Origin and evolution of the ribosome. Cold Sogin ML, Morrison HG, Huber JA, Welch DM,
Spring Harb Perspect Biol. 2010;2:1–18. Huse SM, Neal PR, Arrieta JM, Herndl GJ. Microbial
Fuhrman JA, McCallum K, Davis AA. Novel major diversity in the deep sea and the underexplored rare
archaebacterial group from marine plankton. Nature. biosphere. Proc Natl Acad Sci U S A. 2006;103:
1992;356:148–9. 12115–20.
Huse SM, Dethlefsen L, Huber JA, Welch DM, Relman Tremaroli V, Backhed F. Functional interactions between
DA, Sogin ML. Exploring microbial diversity and the gut microbiota and host metabolism. Nature.
taxonomy using SSU rRNA hypervariable tag 2012;489:242–9.
sequencing. PLoS Genet. 2008;4:e1000255. Wang Y, Qian P-Y. Conservative fragments in bacterial
Huws SA, Edwards JE, Kim EJ, Scollan ND. Specificity 16S rRNA genes and primer design for 16S ribosomal
and sensitivity of eubacterial primers utilized for DNA amplicons in metagenomic studies. PLoS ONE.
molecular profiling of bacteria within complex 2009;4:e7401.
microbial ecosystems. J Microbiol Methods. 2007;70: Woese CR. Bacterial evolution. Microbiol Rev.
565–9. 1987;51:221–71.
Culture Collections in the Study of Microbial Diversity, Importance 111 C
Type strains, which constitute the name-
Culture Collections in the Study of bearing reference strain of a species and are
Microbial Diversity, Importance often used in the study of bacterial systematics,
are available from culture collections worldwide.
Martin Sievers Type strains must be deposited in two public
Zurich University of Applied Sciences, Institute collections in two different countries in order to
of Biotechnology, Waedenswil, Switzerland have the name and thus the species validated C
(Stackebrandt 2010).
Culture collections are a valuable resource for
Introduction the exploitation of biological diversity and can
help countries rich in biodiversity to understand
Prokaryotes, which comprise the bacterial and and utilize their microbial diversity more effec-
archaeal domains, show very high biodiversity. tively (Arora et al. 2005). Culture collections also
Over 50 different phyla including candidate phyla act as an interface between their providers and
with cultivable species and uncultivable represen- users of genetic resources to support fair and
tatives, which are only characterized via equitable sharing of the benefits based on docu-
metagenomics, have been detected. Microbial ments like Prior Informed Consent (PIC) and
strains are ubiquitous and are able to grow in mutually agreed terms (Sievers et al. 2010). In
extreme environments, and to determine their func- fulfilling these roles, culture collections have sev-
tions and activities in the environment is essential eral responsibilities regarding biosafety require-
for our understanding of life. Culture collections ments. These include compliance with
can help in the cataloging and preservation of international agreements and conventions on bio-
microbial strains and their genomic DNA. diversity, the support of researchers seeking
intellectual property rights, and to implement
new technologies and to find additional funding
Role of Culture Collections for their vital work.
In addition, microbial culture collections
Culture collections are important in the preserva- which are recognized as international depositary
tion of biodiversity and thus contribute to the authorities (IDA) offer deposition of microorgan-
objectives of the Convention on Biological isms involved in inventions for patent purposes
Diversity (CBD; www.cbd.int) through the pres- according to the Budapest Treaty (http://www.
ervation of important genetic resources. wipo.int/treaties/en/registration/budapest/trtdocs_
The primary function of microbial culture col- wo002.html).
lections is to gather, maintain, and distribute
strains which have unique properties and are of
practical value in various applications like Importance of Microorganisms
research, teaching, quality control assays, and
biotechnology (Uruburu 2003; Emerson and Microbial strains are used for a wide range of
Wilson 2009). Culture collections supply their scientific, industrial, and health-care applications,
users with well-characterized strains and replicable for example, as sources of enzymes, proteins,
parts (plasmid, DNA) as well with the associated vitamins, organic acids, bioactive compounds,
documentations relevant to these biological mate- antimicrobial peptides, and biopolymers. Micro-
rials. Cultures, strain, and DNA from culture col- organisms are used in agriculture as bio-fertilizer,
lections are distributed with a material transfer in wastewater treatment as agents for degradation
agreement (MTA) which provides users with all of compounds with complex structures, for metal
relevant handling information and regulations for recovery to catalyze specific chemical reactions,
commercial use of the supplied biological material. for bioenergy production, as starter cultures in the
C 112 Culture Collections in the Study of Microbial Diversity, Importance
production of fermented food, as probiotics, and as (Roesch et al. 2007). A large percentage of soil
reference material in diagnostics and development bacteria could not be isolated by cultivation
of new therapeutics. (between 90 % and 99 % in a given sample).
In contrast to their beneficial relatives, patho- Soil bacteria of the phyla Acidobacteria and
genic microbes cause severe diseases in humans, Verrucomicrobia are poorly represented in pure
animals, and plants, resulting in significant eco- cultures, and members of the Actinobacteria,
nomic loss and risk to global health. The useful Firmicutes, and Proteobacteria, in contrast, are
products and processes provided by microorgan- well represented in culture collections. For exam-
isms can be grouped into four broad categories: ple, the most dominant genus of the American
fine chemicals, processes, commodities, and Type Culture Collection (ATCC) soil accessions
emerging technologies (Kuo and Garrity 2002). is Streptomyces, belonging to the phylum
Thus, the use and the study of microorganisms Actinobacteria, reflecting their importance as
contribute to further economic growth and health producers of bioactive compounds and in soil
promotion and are of immense social and ecolog- ecology (Floyd et al. 2005).
ical value (Komagata 1999; Prakash et al. 2012; Currently, culture collections cover only
Smith 2003). a fraction of the diversity of microorganisms
and will benefit from the deposition of new
strains which are suitable for industrial use since
Microbial Diversity they represent rich and abundant source of novel
molecules with various biological activities.
Due to their myriad environmental roles and
functions, microorganisms are important compo-
nents of the world’s biodiversity. Microbial Identification of Strains at Species Level
diversity refers to the richness and degree of
variability among species and strains within an Isolated strains are checked for purity microscop-
ecosystem. Microbial communities of an investi- ically, for morphological homogeneity by unifor-
gated sample are composed of species which mity of colony form on agar plate, by distinct
could be isolated as well as the “silent majority” color formation on chromogenic agar, and con-
species which are considered non-culturable formation by denaturing gradient gel electropho-
under standard laboratory conditions and only resis DGGE (single band obtained). Identification
their DNA is accessible for genetic characteriza- of pure strains at species level is usually
tion. The richness of bacterial species is highly performed using ribosomal RNA gene sequence
variable in different environmental communities. analysis. Housekeeping genes encoding RNA
Some environments like the upper atmosphere, polymerase beta subunit (rpoB), RNA polymer-
glacial ice, and highly acidic stream waters have ase sigma factor (rpoD), gyrase beta subunit
low numbers of bacterial species in comparison (gyrB), recombinase A (recA), or heat shock pro-
to soil, microbial mats, and marine water, which tein (hsp60) provide in some cases better genetic
harbor vast numbers of bacterial species (Fierer resolution on the species level than the 16S rDNA
and Lennon 2011). The estimation of the number sequence used in taxonomic studies. Combina-
of bacterial species per gram of soil is not a trivial tions of housekeeping genes in multi-locus ana-
task. Metagenomic approaches based on analysis lyses provide a taxonomic tool for identification
of environmental DNA sequence data help to of prokaryotes at species and strain level (Moore
study microbial communities and to estimate et al. 2010). DNA sequences in combination with
their species richness. Based on high-throughput protein spectra obtained by MALDI-TOF-MS are
16S rDNA pyrosequencing and phylogenetic very efficient to identify strains at species level.
analysis, the most abundant species of bacteria MALDI-TOF-MS used for species identification
in different soil samples were assigned to generates protein spectra in the size range between
the phyla Proteobacteria and Bacteroidetes 2 and 20 kDa which is dominated by ribosomal
Culture Collections in the Study of Microbial Diversity, Importance 113 C
proteins. By use of this technology, the generated in an ecosystem. These data sets can be applied to
spectra of an unknown strain are compared with develop cultivation methods for ecologically
a reference data bank (Wieser et al. 2012). important microorganisms which are not-yet cul-
DNA-DNA hybridization (DDH) values are tivable (Prakash et al. 2012). Information based
used to determine relatedness between strains on DNA sequences is increasingly used in eco-
and strains belong to the same species when logical research and in investigating microbial
DDH values are approximately 70 % or greater communities. Storage of extracted DNA for use C
(Wayne et al. 1987). Average nucleotide identity of DNA barcoding technology should be depos-
(ANI) of common genes is discussed to be an ited in a repository for further taxonomic and
alternative method for replacing DDH. The cut- biotechnological studies (Vernooy et al. 2010).
off value of 70 % DDH for species delineation Handling of biological sequence data derived
correlates to 95 % ANI value (Goris et al. 2007). from “omics” (genomics, transcriptomics, prote-
ANI can be calculated by partial sequencing of omics, metabolomics) including storage and
the genomes (at least 20 %) of the query strains accessibility should be standardized. Culture col-
(Richter and Rosselló-Móra 2009). Unique lections developing to biological resource centers
strains of one species can be identified by meta- (BRC) meet the high standard of quality manage-
bolic activities (sugar utilization, acid produc- ment and accreditation processes and are able to
tion), resistance to antibiotics, and genetic participate in networking initiatives to strengthen
fingerprints obtained by rep-PCR. the collaboration between collections and their
Strains in a collection should undergo minimal users (Janssens et al. 2010; Stackebrandt 2010).
passages before distribution to reduce genetic Sequencing of complete bacterial genomes
variations within these strains. This can be leads to the discovery and characterization of
achieved by establishment of a two-tiered system new gene families (Wu et al. 2009). The ongoing
composed of a master and working (distribution) characterization of microbes will lead to new
bank for each organism (Day and Stacey 2008). strains, microbial metabolites, and novel
protein-coding genes suitable for use in many
industrial and health applications.
Future Tasks of Culture Collections
Environmental samples from habitats that harbor Summary

undiscovered microorganisms and are
disappearing due to climate change or forest Culture collections play a key role in preserva-
clearance should be a primary focus of culture tion, taxonomic characterization, and supply of
collections to preserve its microbial diversity. diverse microbial strains with associated infor-
The filamentous fungi Penicilliopsis mative documents. Collection organizations such
clavariaeformis producing an orange pigment as WFCC (World Federation for Culture Collec-
penicilliopsin occurred on fruits and seeds of tions) and ECCO (European Culture Collections’
Diospyros trees in Indonesia and Taiwan (Hsieh Organisation) and activities of CABRI (Common
and Ju 2002; Oxford and Raistrick 1940), and Access to Biological Resources and Information)
strains of these species are lost for isolation, promote and support culture collections and their
when the trees disappear (Colwell 2002). To pre- related services.
vent loss of microbial strains from disappearing With recent advances in genomic analysis and
habitats, preservation methods of ecosystems and molecular genetics, researchers are increasingly
natural communities have to be developed able to understand, harness, and engineer the vast
(Prakash et al. 2013). biochemical potential of microorganisms. And
Data generated from genomic and proteomic thus, further activities are necessary to fully real-
studies are useful to identify microbial species in ize the huge scientific and economic potential of
communities and help to determine their function these rich and diverse nature resources.
C 114 Culturing
Cross-References Oxford AE, Raistrick H. Studies in the biochemistry of

micro-organisms: penicilliopsin, the colouring matter
of Penicilliopsis clavariaeformis Solms-Laubach.
▶ A 123 of Metagenomics Biochem J. 1940;34:790–803.
▶ All-Species Living Tree Project Prakash O, Shouche Y, Jangid K, Kostka JE. Microbial
▶ Biological Treasure Metagenome cultivation and the role of microbial resource centers
▶ Culturing in the omics era. Appl Microbiol Biotechnol. 2012.
doi:10.1007/s00253-012-4533-y.
▶ Metagenomics, Metadata, and Meta-analysis Prakash O, Nimonkar Y, Shouche YS. Practice and pros-
▶ Microbial Ecosystems, Protection of pects of microbial preservation. FEMS Microbiol Lett.
▶ Phylogenetics, Overview 2013;339:1–9.
▶ Silva Databases Richter M, Rosselló-Móra R. Shifting the genomic gold
standard for the prokaryotic species definition. Proc
Natl Acad Sci U S A. 2009;106:19126–31.
Roesch LFW, Fulthorpe RR, Riva A, et al.
Pyrosequencing enumerates and contrasts soil micro-
References bial diversity. ISME J. 2007;1:283–90.
Sievers M, Dasen G, Wermelinger T, et al. Culture collections
Arora DK, Saikia R, Dwievdi R, Smith D. Current status, and the biotechnology deal. Chimia. 2010;64:782–3.
strategy and future prospects of microbial resource Smith D. Culture collections over the world. Int
collections. Curr Sci. 2005;89:488–95. Microbiol. 2003;6:95–100.
Colwell RR. The future of microbial diversity research. In: Stackebrandt E. Diversification and focusing: strategies of
James TS, Reysenbach A-L, editors. Biodiversity of microbial culture collections. Trends Microbiol.
microbial life: foundations of earth’s biosphere, vol. 2010;18:283–7.
3. New York: Wiley-Liss; 2002. p. 521–34. Uruburu F. History and services of culture collections. Int
Day JD, Stacey GN. Biobanking. Mol Biotechnol. Microbiol. 2003;6:101–3.
2008;40:202–13. Vernooy R, Haribabu E, Muller MR, et al. Barcoding life
Emerson D, Wilson W. Giving microbial diversity to conserve biological diversity: beyond the taxonomic
a home. Nat Rev Microbiol. 2009;7:758. imperative. PLoS Biol. 2010;8:e1000417.
Fierer N, Lennon JT. The generation and maintenance of Wayne LG, Brenner DJ, Colwell RR, et al. Report of the
diversity in microbial communities. Am J Bot. ad hoc committee on reconciliation of approaches to
2011;98:439–48. bacterial systematics. Int J Syst Bacteriol. 1987;
Floyd MM, Tang J, Kane M, Emerson D. Captured diver- 37:463–4.
sity in a culture collection: case study of the geo- Wieser A, Schneider L, Jung J, Schubert S. MALDI-TOF
graphic and habitat distributions of environmental MS in microbiological diagnostics-identification of
isolates held at the American type culture collection. microorganisms and beyond (mini review). Appl
Appl Environ Microbiol. 2005;71:2813–23. Microbiol Biotechnol. 2012;93:965–74.
Goris J, Konstantinidis KT, Klappenbach JA, et al. DNA- Wu D, Hugenholtz P, Mavromatis K, et al. A phylogeny-
DNA hybridization values and their relationship to driven genomic encyclopaedia of bacteria and archaea.
whole-genome sequence similarities. Int J Syst Nature. 2009;462:1056–60.
Evol Microbiol. 2007;57:81–91.
Hsieh HM, Ju YM. Penicilliopsis pseudocordyceps,
the holomorph of Pseudocordyceps seminicola,
and notes on Penicilliopsis clavariaeformis. Mycol.
2002;94:539–44. Culturing
Janssens D, Arahal DR, Bizet C, Garay E. The role of
public biological resource centers in providing a basic
Sarah Highlander
infrastructure for microbial research. Res Microbiol.
2010;161:422–9. Genomic Medicine, J. Craig Venter Institute,
Komagata K. Microbial diversity and the role of culture La Jolla, CA, USA
collections. 1999. http://old.iupac.org/symposia/pro-
ceedings/phuket97/komagata.pdf
Kuo A, Garrity GM. Exploiting microbial diversity. In:
James TS, Reysenbach A-L, editors. Biodiversity of Definitions
microbial life: foundations of earth’s biosphere, vol.
3. New York: Wiley-Liss; 2002. p. 477–520. Microbiome: The microbes (bacteria, archaea,
Moore ERB, Mihaylova SA, Vandamme P,
fungi, protists, and viruses) that inhabit
et al. Microbial systematics and taxonomy: relevance
for a microbial commons. Res Microbiol. a specific environment or host, such as all the
2010;161:430–8. microbes that live in and on the human body.
Culturing 115 C
Mesophile: An organism that grows and thrives (1) obligate aerobes (require ca. 20 % O2);
within a moderate temperature range, usually (2) microaerophiles, which grow well at reduced
between 20 and 45 C. O2 concentration; (3) facultative anaerobes that
Fastidious Organism: An organism that has can grow aerobically or can respire anaerobically
very specific and usually complex growth or grow fermentatively; (4) aerotolerant anaer-
requirements. obes, which are not killed by O2 but that cannot
Microbial Commensalism: Interaction between respire aerobically and only grow optimally C
two species where one benefits but does not harm under anoxic conditions; and (5) obligate anaer-
or affect the other. obes that are usually killed in the presence of O2
Microbial Mutualism: Interaction between two and grow only in the absence of oxygen. Strict
species where both organisms benefit. anaerobes must be manipulated in an anoxic
chamber. Aerotolerant anaerobes can be briefly
handled on the laboratory bench for standard
Introduction bacteriological techniques, but they must be incu-
bated anaerobically. Anaerobic organisms pre-
Only about 1 % of all prokaryotic species in the dominate in the human oral, gastrointestinal,
biosphere are thought to be cultivatable and vaginal tracts. Examples of obligate genera
(Handelsman 2004). A few thousand taxa are in the human body are Clostridium and
associated with the human microbiome. The Bacteroides; aerotolerant members include
taxa on the skin are mostly cultivatable (ca. Propionibacterium and Lactobacillus;
90 %) (Gao et al. 2007), about 50 % of the oral Escherichia is a facultative anaerobe. Some
species are cultivable (Dewhirst et al. 2010), and organisms require high concentrations of CO2
about 50 % of the taxa of the gut microbiome may (5 %) in addition to O2. These are capnophiles;
be cultivable (Goodman et al. 2011). The history examples are some streptococci, Neisseria spp.,
of cultivation of microbes can be traced to the and Haemophilus spp. that are residents of the
Egyptians who used yeasts and lactic acid bacte- respiratory tract.
ria in the production of breads, wines, and beers.
Robert Koch first cultivated bacteria on solid
medium in 1881. This permitted single colony Nutritional Requirements: Media Types
isolation and production of pure cultures. While
this seemed a logical approach, we are now learn- All bacteria require carbon, nitrogen, and sul-
ing that some organisms cannot grow in pure fur for metabolism. It is generally believed that
culture and these pure cultures are often not rep- the human body is colonized only by hetero-
resentative of the environment from which they trophic bacteria that obtain energy from oxida-
were isolated. tion of organic carbon substrates, although the
presence of autotrophic cyanobacteria has been
reported in some oral and fecal samples based
Environmental Requirements for on 16S rDNA sequencing. In complex bacteri-
Growth ological culture media, carbon, nitrogen, and
sulfur for heterotrophic growth are usually pro-
Bacteria can be separated into groups based on vided by a peptone (digested protein) such as
their growth requirements, which include specific casamino acids, Lab-Lemco (Oxoid), tryptone,
nutrients, optimal temperature (usually or soytone. Vitamins, amino acids, and carbo-
mesophiles for members of the human hydrates are provided by the addition of yeast
microbiome) and pressure ranges, and sensitivity extract. Sodium chloride is sometimes added as
or tolerance to salinity or pH. A very important an osmotic stabilizer. Since many bacteria have
feature is the optimal oxygen concentration special requirements for vitamins and other
required for growth. There are five categories: trace minerals, media are often supplemented
C 116 Culturing
with hydrolysates such as beef extract or yeast isolates. In 2002 they reported on the use of
extract. To create a solid medium, agar, diffusion chambers to grow microbial colonies
a polysaccharide product of seaweed, is added from an intertidal sandy flat in an aquarium
to the broth formulation prior to autoclaving. containing seawater as the growth medium
To support the growth of fastidious organisms, (Kaeberlein et al. 2002). They estimated that up
fresh defibrinated blood (usually sheep’s blood) to 40 % of the cells inoculated into the chamber
is often added to cooled agar media prior to could be cultivated, but attempts to grow these
pouring into Petri dishes. Haemophilus and microcolonies in pure culture were very ineffi-
Neisseria require blood that has been lysed prior cient. One isolate, which grew poorly on the agar
to use (chocolate agar). Strict anaerobes require plates, grew well in coculture with any of three
agar media that can be pre-reduced (i.e., incu- other isolates obtained from the chambers. They
bated in an anoxic environment to remove all expanded this to create a high-throughput Ichip
O2 prior to use), and some species require the diffusion array that contains 192 chambers per
addition of hemin and vitamin K. A good general array (Nichols et al. 2010). A clever application
agar for this use is Anaerobic Reducible Blood of the technology was the creation of an upper
Agar (Remel, Lenexa, KS), which contains cys- palate dental appliance that carried a 72-chamber
teine HCl, palladium chloride, and dithiothreitol Ichip diffusion array (Sizova et al. 2012). The
to maintain a low redox potential of the agar. It appliance was worn by a subject for 48 h then
also contains hemin and vitamin K and can be recovered and placed in an anaerobic chamber.
purchased with colistin and nalidixic acid as Bacterial cells from the chambers were plated on
a selective medium for isolation of gram- a “basic anaerobic medium” that was low in sugar
positive organisms or with kanamycin, vanco- concentration to prevent selection for fast-
mycin, and neomycin as a selective medium growing species. This method contributed 39 iso-
for gram-negative organisms, particularly the lates, several of which represented taxa that had
Bacteroides. not been previously cultivated. The take-home
Numerous preformulated specialty powdered lessons that these authors stressed were that
and premade bacteriological media and plates are “domestication” of uncultivated organisms from
available for selection, identification, and culti- the human microbiome is more likely if the bac-
vation of a wide variety of human bacterial spe- teria are first grown in vivo and that cell growth
cies, with a focus on pathogens. should be allowed to occur “unimpeded by neigh-
bors,” for example, by growth in diffusion cham-
bers, by dilution to extinction (Rappe et al. 2002),
Methods to Enhance Growth of or by growth encapsulated in microdroplets
Uncultivable Organisms (Zengler et al. 2002). They also stressed the
requirements of strict anaerobic conditions (for
The concept of isolation of the pure cultures has oral samples) and the utilization of media low in
been challenged by groups that have shown that readily utilizable carbohydrates (Sizova
cocultivation of organisms can sometimes et al. 2012).
lead to the successful isolation of previously The commensalism and mutualism of some
uncultivable organisms. New technologies have bacterial species have been exploited to stimulate
been applied to isolate and capture cells and then growth of previously uncultivated organisms.
incubate them an environment that simulates Examples of commensalism in dental plaque are
(or is) the natural one. well established, such as the catabolism of sugars
The groups of Slava Epstein and Kim Lewis by streptococci to lactic acid, which is fermented
have made key contributions to methods and by the veillonellae, which cannot utilize sugars.
discoveries leading to the cultivation of such Vartoukian et al. were able to cultivate Cluster
Culturing 117 C
A Synergistetes, which had not been previously environment of the species that is sought.
accomplished, by growing human plaque sam- The application of diffusion chambers and
ples in a complex cooked meat medium microdroplet technologies to human
(Vartoukian et al. 2010). Using fluorescent in microbiome samples should accelerate cultiva-
situ hybridization (FISH) directed against the tion of some species, and metabolic predic-
Synergistetes 16S rRNA, they followed the pres- tions from whole genome shotgun sequencing
ence of an isolate, Synergistetes SGP1, and may, in future, permit rationale cultivation of C
observed that the Synergistetes cells formed new species of bacteria.
aggregates with other bacteria. Ultimately, they
showed that growth of SGP1 was stimulated
by cross-streaks of Staphylococcus aureus,
Fusobacterium nucleatum, Parvimonas micra, References
and Treponema forsythia, which were members
of the cell aggregates. The mechanism of the D’Onofrio A, Crawford JM, Stewart EJ, Witt K,
effect has not yet been ascertained. Siderophore Gavrish E, Epstein S, et al. Siderophores from neigh-
boring organisms promote the growth of uncultured
sharing, or “stealing,” is a theme common in
bacteria. Chem Biol. 2010;17:254–64.
bacterial pathogenesis and is the one that the Dewhirst FE, Chen T, Izard J, Paster BJ, Tanner AC, Yu
Epstein and Lewis group observed as WH, et al. The human oral microbiome. J Bacteriol.
a mechanism that permitted coculture of some 2010;192:5002–17.
Gao Z, Tseng CH, Pei Z, Blaser MJ. Molecular analysis of
strains of bacteria isolated from sand biofilms
human forearm superficial skin bacterial biota. Proc
(D’Onofrio et al. 2010). They observed that Natl Acad Sci U S A. 2007;104:2927–32.
samples plated in high density yielded much Goodman AL, Kallstrom G, Faith JJ, Reyes A, Moore A,
higher numbers of colonies than expected Dantas G, et al. Extensive personal human gut
microbiota culture collections characterized and
compared to plates with diluted biofilm samples
manipulated in gnotobiotic mice. Proc Natl Acad Sci
and hypothesized that adjacent pairs of species U S A. 2011;108:6252–7.
might have growth dependencies. One strain, Handelsman J. Metagenomics: application of genomics to
Micrococcus luteus KLE1011, was shown to uncultured microorganisms. Microbiol Mol Biol Rev.
2004;68:669–85.
secrete 5 distinct but related siderophores, any
Kaeberlein T, Lewis K, Epstein SS. Isolating
one of which was able to induce growth of the “uncultivable” microorganisms in pure culture in
uncultivated strain Maribacter polysiphoniae a simulated natural environment. Science. 2002;
KLE1104. The M. luteus strain was then used as 296:1127–9.
Nichols D, Cahoon N, Trakhtenberg EM, Pham L,
“bait” to capture additional uncultivated bacteria
Mehta A, Belanger A, et al. Use of ichip for high-
from the samples (D’Onofrio et al. 2010). It throughput in situ cultivation of “uncultivable” micro-
would be surprising if this phenomenon was not bial species. Appl Environ Microbiol. 2010;
observed between members of the human 76:2445–50.
Rappe MS, Connon SA, Vergin KL, Giovannoni SJ.
microbiome.
Cultivation of the ubiquitous SAR11 marine bacterio-
plankton clade. Nature. 2002;418:630–3.
Sizova MV, Hohmann T, Hazen A, Paster BJ, Halem SR,
Murphy CM, et al. New approaches for isolation of
previously uncultivated oral bacteria. Appl Environ
Summary Microbiol. 2012;78:194–203.
Vartoukian SR, Palmer RM, Wade WG. Cultivation of
The cultivation of prokaryotes continues to a Synergistetes strain representing a previously
follow mostly traditional methods, although uncultivated lineage. Environ Microbiol. 2010;
12:916–28.
some groups are beginning to recognize that
Zengler K, Toledo G, Rappe M, Elkins J, Mathur EJ, Short
the cultivation of the uncultivable requires JM, et al. Cultivating the uncultured. Proc Natl Acad
a better appreciation of the in vivo Sci U S A. 2002;99:15681–6.
C 118 Customizable Web Server for Fast Metagenomic Sequence Analysis
Metagene (Noguchi et al. 2006) and

Customizable Web Server for Fast FragGeneScan (Rho et al. 2010) predict ORFs
Metagenomic Sequence Analysis from fragmented sequences. Meta-RNA (Huang
et al. 2009) scans rRNA from short sequences.
Sitao Wu1, Zhengwei Zhu1, Limin Fu1, Mothur (Schloss et al. 2009), QIIME (Caporaso
Beifang Niu1 and Weizhong Li2 et al. 2010), and CD-HIT-OTU (Li et al. 2012)
1
Center for Research in Biological Systems are software packages for estimating microbial
(CRBS), University of California, San Diego, diversities based on 16S rRNA tags. RAMMCAP
La Jolla, CA, USA (Li 2009) is an integrated annotation pipeline that
2
J. Craig Venter Institute, La Jolla, CA, USA provides gene prediction, clustering, function
annotations, and several other functions.
These complicated processes plus limited
Synonyms availability of computational resources tend to
overwhelm bench biologists from attempting to
A customizable Web server for fast metagenomic analyze their own metagenomic data. So, inte-
sequence analysis grated bioinformatics systems specific to
metagenomic data analysis, especially easy-to-
use Web portals, are of great importance for
Definition researchers in various communities to fully uti-
lize metagenomic approach.
WebMGA is a Web server through which WebMGA was developed as a fast, easy, and
researchers can upload metagenomic sequence flexible solution for metagenomic data analysis.
data and run various tools to analyze the data. It is freely available at http://weizhongli-lab.org/
The tools in WebMGA can also be accessed metagenomic-analysis to all users.
through the Web services using client-side
scripts.
Metagenomic Analysis Tools Provided
in WebMGA
Introduction
More than 20 different analysis tools specially
Metagenomics is an approach that studies the designed for metagenomic data analysis are pro-
environmental microorganism populations pre- vided by WebMGA Web portal. Most of these
dominantly using the next-generation sequencing tools are very fast. Also, they are implemented to
technologies developed during the last decade. be executed in parallel on a computer cluster.
Today, scientists have already studied the Given the raw metagenomic sequencing reads,
microbes under many different environments the following analyses can be completed:
such as water, soil, air, human body sites, and • A quality control (QC) script filters and trims
many others. raw sequencing reads and yields high-quality
Metagenomic data analysis from raw sequenc- reads.
ing reads to biological discoveries is a very • Software SolexaQA (Cox et al. 2010) can also
complicated process and includes many compu- be used as a QC tool.
tational procedures such as sequence quality • Program CD-HIT-454 (Niu et al. 2010) iden-
control, filtering, mapping, assembly, gene pre- tifies and removes artificial duplicate.
diction, normalization, function and pathway If the input sequences are filtered reads or
analyses, visualization, and statistical studies. In DNA sequences, WebMGA provides the follow-
the last several years, many computational tools ing analyses:
have been developed to address the problems in • tRNAscan (Lowe and Eddy 1997) finds
metagenomic data analysis. For example, tRNAs from the input sequences.
Customizable Web Server for Fast Metagenomic Sequence Analysis 119 C
Customizable Web Server for Fast Metagenomic Sequence Analysis, Fig. 1 The web server page for DNA
clustering
• Meta-RNA (Huang et al. 2009) identifies DNAs and RNAs and removes them from
rRNAs from fragmented sequences using input metagenomic sequences. A fast mapping
a hidden Markov model-based algorithm. program FR-HIT (Niu et al. 2011) is used to
• A BLAST-based program identifies rRNAs by align the input sequences against human ref-
comparing the query against several rRNA erence sequences.
reference databases. • CD-HIT-EST, an ultrafast sequence-
• For metagenomic data from human subjects, clustering program, clusters the DNAs into
WebMGA offers a tool that identifies human groups or removes redundant sequences.
C 120 Customizable Web Server for Fast Metagenomic Sequence Analysis
Customizable Web Server for Fast Metagenomic Sequence Analysis, Fig. 2 A simple workflow using tools in
WebMGA
• WebMGA has a taxonomy-binning tool that • CD-HIT-OTU (Li et al. 2012) is a pipeline that
maps the reads to reference genomes using filters and processes the raw rRNA tags and
FR-HIT and then assigns taxonomy clusters them into operational taxonomic units
annotations. (OTUs). CD-HIT-OTU is available at http://
• ORF_finder (Li 2009) calls ORFs from input weizhongli-lab.org/cd-hit-otu.
sequences by six-reading-frame translation. Each of the above tools has a Web interface
• Metagene (Noguchi et al. 2006) identifies where users can run them individually. Users
ORFs from fragmented sequences. with programming skills can even compose
• FragGeneScan (Rho et al. 2010) identifies a script to run a customized multistep analysis
ORFs and also tries to correct frameshift workflow through WebMGA’s Web services. As
errors. illustrated in Fig. 1, a user can upload a DNA
Users can input protein or peptide sequences dataset to run several analysis processes in paral-
to run the following analyses: lel. The user can use HMM-based or BLAST-
• CD-HIT (Li et al. 2001, 2002; Li and Godzik based method to find rRNAs and to produce
2006; Huang et al. 2010) clusters the input a FASTA file with rRNA masked. The latter
sequences into protein clusters or removes result file is then processed by an ORF calling
redundant sequences. program, and the ORFs are used for function and
• A multistep clustering pipeline groups protein pathway annotation. This workflow is illustrated
sequences into protein families. in Fig. 2.
• WebMGA uses HMMER3 program (Eddy
2009) to compare input peptides against
Pfam and Tigrfam databases and assign the Summary
domain or protein families.
• WebMGA uses RPS-BLAST to compare WebMGA provides researchers the tools for
NCBI’s COG, KOG, and PRK databases and rapid metagenomic sequence analysis through
provide function annotations. Web server and Web services. The tools and
• WebMGA provides Gene Ontology functions in WebMGA cover a large scope of
(GO) annotations. metagenomic data analysis such as raw sequence
• WebMGA searches KEGG database and pro- quality control, human DNA filtering, OTU esti-
vides pathway annotations. mation, taxonomy binning, sequence clustering,
16S rRNA tags can also be analyzed through and function and pathway annotation. By directly
WebMGA: accessing the Web services with client-side
• RDP Classifier (Wang et al. 2007) analyzes scripts, users can customize and run their own
rRNA tags and assigns taxonomy annotations. workflows. The tools and data in WebMGA are
Customizable Web Server for Fast Metagenomic Sequence Analysis 121 C
constantly being updated, and new tools for fast Li WZ, Godzik A. Cd-hit: a fast program for clustering
metagenomic data analysis will be continuously and comparing large sets of protein or nucleotide
sequences. Bioinformatics. 2006;22(13):1658–9.
added. Li WZ, Jaroszewski L, et al. Clustering of highly homol-
ogous sequences to reduce the size of large protein
databases. Bioinformatics. 2001;17(3):282–3.
Cross-References Li WZ, Jaroszewski L, et al. Tolerating some redundancy
significantly speeds up clustering of large protein data-
bases. Bioinformatics. 2002;18(1):77–82.
C
▶ Fast Program for Clustering and Comparing Li W, Fu L, et al. Ultrafast clustering algorithms for
Large Sets of Protein or Nucleotide Sequences metagenomic sequence analysis. Brief Bioinform.
▶ FR-HIT Overview 2012;13(6):656–68.
Lowe TM, Eddy SR. tRNAscan-SE: a program for
improved detection of transfer RNA genes in genomic
sequence. Nucleic Acids Res. 1997;25(5):955–64.
Niu B, Fu L, et al. Artificial and natural duplicates in
References pyrosequencing reads of metagenomic data. BMC
Bioinforma. 2010;11:187.
Caporaso JG, Kuczynski J, et al. QIIME allows analysis of Niu B, Zhu Z, et al. FR-HIT, a very fast program to recruit
high-throughput community sequencing data. Nat metagenomic reads to homologous reference
Methods. 2010;7(5):335–6. genomes. Bioinformatics. 2011;27(12):1704–5.
Cox MP, Peterson DA, et al. SolexaQA: at-a-glance qual- Noguchi H, Park J, et al. MetaGene: prokaryotic gene
ity assessment of Illumina second-generation sequenc- finding from environmental genome shotgun
ing data. BMC Bioinforma. 2010;11:485. sequences. Nucleic Acids Res. 2006;34(19):5623–30.
Eddy SR. A new generation of homology search tools Rho M, Tang H, et al. FragGeneScan: predicting genes in
based on probabilistic inference. Genome Inform. short and error-prone reads. Nucleic Acids Res.
2009;23(1):205–11. 2010;38(20):e191.
Huang Y, Gilna P, et al. Identification of ribosomal RNA Schloss PD, Westcott SL, et al. Introducing mothur: open-
genes in metagenomic fragments. Bioinformatics. source, platform-independent, community-supported
2009;25(10):1338–40. software for describing and comparing microbial com-
Huang Y, Niu B, et al. CD-HIT Suite: a web server for munities. Appl Environ Microbiol. 2009;75(23):
clustering and comparing biological sequences. Bioin- 7537–41.
formatics. 2010;26(5):680–2. Wang Q, Garrity GM, et al. Naive Bayesian classifier for
Li W. Analysis and comparison of very large rapid assignment of rRNA sequences into the new
metagenomes with fast clustering and functional anno- bacterial taxonomy. Appl Environ Microbiol.
tation. BMC Bioinforma. 2009;10:359. 2007;73(16):5261–7.
D
DACTAL methods for estimating phylogenies (Felsenstein

2003). Most frequently, phylogeny estimation for
Tandy Warnow a given set of taxa is performed in a sequence of
Institute for Genomic Biology, University of steps: (1) a gene is selected, (2) sequences for that
Illinois, IL, USA gene in the taxa are obtained, (3) a multiple
sequence alignment of the molecular data
(DNA, RNA, or amino acid) is estimated, and
Synonyms (4) a tree is estimated on that resultant alignment.
Although many preferred methods for phylogeny
Phylogeny ¼ phylogenetic tree ¼ tree; Multiple estimation are based on hard optimization prob-
sequence alignment ¼ MSA lems (e.g., maximum likelihood), small datasets
are not that hard to analyze, and effective heuris-
tics are able to analyze small datasets quite well.
Definition Maximum likelihood analysis of datasets with
many thousands of sequences is much more dif-
DACTAL ¼ “Divide-and-conquer trees (almost) ficult, although some methods (e.g., RAxML
without alignments.” (Stamatakis 2006) and FastTree-2 (Price
et al. 2010) but see also Felsenstein 2003;
Warnow 2013) are highly effective even on
Introduction these datasets. The estimation of large multiple
sequence alignments (MSA) is itself very chal-
DACTAL (divide-and-conquer trees (almost) lenging (Kemena and Notredame 2009; Liu
without alignments) is a method for estimating et al. 2010; Blair and Murphy 2011): most current
very large phylogenetic trees which utilizes an MSA methods are unable to analyze very large
iterative divide-and-conquer technique to datasets (10,000 sequences and more) due to
“boost” the accuracy and speed of an existing computational issues, while those that can ana-
phylogeny estimation method. DACTAL con- lyze datasets of this size (e.g., Clustal-Quicktree
structs trees without needing to compute or use and MAFFT-PartTree) produce alignments that
a multiple sequence alignment on the full dataset. result in insufficiently accurate trees. Of the var-
This contribution describes the method and dem- ious methods available for large-scale multiple
onstrates its performance on biological and sim- sequence alignment, to date only SATe (Liu
ulated datasets. et al. 2009, 2012) has been shown to be effective
Phylogeny estimation is a basic step in many at producing alignments that result in highly
bioinformatics analyses, and there are many accurate trees. However, SATe is limited to
D 124 DACTAL
DACTAL, Fig. 1 DACTAL algorithmic design. the dataset into small, overlapping subsets, estimates trees
DACTAL can begin with an initial tree (bottom triangle), on each subset, and merges the small trees into a tree on
or through a technique that divides the unaligned sequence the entire dataset (figures included from a previous publi-
dataset into overlapping subsets. Each subsequent cation (Nelesen et al. 2012), with permission from the
DACTAL iteration uses a novel decomposition strategy publisher).
called “PRD” (padded recursive decomposition) to divide
datasets of about 50,000 sequences. Thus, the of sequences around each sequence with some
estimation of a large phylogenetic tree is a very overlap between the sequence subsets. Alterna-
challenging problem, and one of the biggest tively, the division can be performed by comput-
issues is the estimation of the multiple sequence ing an alignment and tree on the dataset (using
alignment for the dataset. some fast and approximate methods) and then
using the tree to produce a recursive decomposi-
tion of the sequence dataset. In either case, the
Methods decomposition that is produced produces subsets
that overlap at least one other subset by some
DACTAL (Nelesen et al. 2012) is a method for specified minimum amount (default 50) and that
estimating a very large phylogeny without need- are themselves small (by default each subset has
ing to estimate a multiple sequence alignment on at most 200 sequences).
the entire dataset. The basic approach is Once the decomposition is performed, trees
a combination of divide-and-conquer plus itera- are estimated on each subset, using some favored
tion (see Fig. 1). method; the default is a maximum likelihood
The input is a set of unaligned but homologous analysis (default RAxML) on a good multiple
sequences, and each iteration produces a tree (but sequence alignment, with the default being
no alignment) on the full dataset. With the excep- MAFFT (Katoh et al. 2005). These subsets are
tion of the first iteration, each iteration begins small (by default, they have at most
with the tree from the previous iteration. In the 200 sequences in them), and as the experimental
first iteration, the method begins by dividing the results show, this is sufficient even for datasets
dataset into overlapping subsets, each with at with about 28,000 sequences.
most some user-specified number of sequences; After the trees are computed, they can be
the default for this is 200. This division into sub- merged together into a tree on the full set of
sets can be accomplished through the use of taxa using a supertree method; the default is
a technique that uses BLAST to form small sets SuperFine+MRP (Swenson et al. 2012),
DACTAL 125 D
a supertree method that has excellent accuracy Results
and which “boosts” the accuracy of MRP
(another supertree method; see Bininda-Emonds The performance of DACTAL was evaluated in
2004). Subsequent iterations begin with the comparison to maximum likelihood trees com-
tree estimated during the previous iteration and puted on SATe-I (Liu et al. 2009) and other
then decompose the dataset into overlapping alignment methods on simulated datasets with
subsets, compute trees on subsets, and merge 1,000 sequences and on biological datasets with
the trees into a tree on the full dataset. The num- 6,000–28,000 sequences (Nelesen et al. 2012).
ber of iterations is a parameter that is set by the The results of these experiments are shown in D
user. Thus, DACTAL is a method that can be the figures below and demonstrate that DACTAL
modified to enable different techniques for esti- had accuracy comparable to that of SATe-I and
mating trees on subsets and for combining could analyze larger datasets than SATe-I. These
subset tree into a full set of trees, and the target experiments also show that DACTAL was sub-
subset size and overlap between subsets are stantially more accurate than two-phase methods
parameters that can be set by the user. The default (i.e., methods that align sequences and then esti-
settings were selected for accuracy and speed mate trees on these alignments).
and provide good results, as the results section Figure 2 compares running time and tree accu-
demonstrates. racy on the 20 replicate datasets for 15 model
DACTAL, Fig. 2 Comparisons of ten iterations of RAxML(MAFFT) starting trees. Asterisks (*) denote
DACTAL to SATe and RAxML trees estimated on differ- model conditions for which DACTAL’s missing branch
ent alignments on “moderate-to-difficult” simulated rate is a statistically significant improvement over the next
1,000-taxon datasets. We show missing branch rates best method, according to Benjamini-Hochberg-corrected
(top) and runtimes in hours (bottom); n ¼ 20 for each pairwise t-tests (n ¼ 40; alpha ¼ 0: 05) (figures included
model condition, and standard error bars are shown. from a previous publication (Nelesen et al. 2012), with
DACTAL and SATe runtimes include the time to compute permission from the publisher).
D 126 DACTAL
DACTAL,
Fig. 3 Comparisons of
DACTAL and SATe
iterations with two-phase
methods on the 16S.T
dataset with 7,350
sequences. The starting
trees were RAxML on the
MAFFT-PartTree
alignment (RAxML(Part))
for SATe and FastTree-2 on
the MAFFT-PartTree
alignment (FT(Part)) for
DACTAL. We show
missing branch rates (top)
and cumulative runtimes in
hours (bottom); n ¼ 1 for
each reported value.
Iteration 0 is used to
compute the starting tree
for DACTAL and SATe
(figures included from
a previous publication
(Nelesen et al. 2012), with
permission from the
publisher).
conditions with 1,000 taxa, originally used to DACTAL is faster than SATe, although it is
evaluate SATe-I (Liu et al. 2009). These model slower than the two-phase methods.
conditions vary in terms of rates of evolution, Figure 3 shows performance on a single bio-
indel lengths (short, medium, or long), and rela- logical dataset, 16S.T, from the Comparative
tive rates of substitutions and indels (insertions Ribosomal Webpage (CRW) (Cannone
and deletions). et al. 2002). This dataset has 7,350 sequences
The error in tree estimation is computed using and a high rate of evolution and so represents
the missing branch rate, which is the fraction of a challenging phylogenetic dataset. The reference
the nontrivial bipartitions in the true (model) tree tree for this dataset is based on a curated struc-
that are missing in the estimated tree. In this tural multiple sequence alignment (Cannone
experiment, DACTAL is run for ten iterations, et al. 2002). This figure gives four different
while SATe-I runs for 24 h after it computes the two-phase methods (maximum likelihood com-
RAxML(MAFFT) starting tree. The running time puted using FastTree-2 or RAxML on either
comparison shows that DACTAL is much faster Clustal-Quicktree or MAFFT-PartTree align-
than SATe-I on every model condition. The comments), but also shows trees obtained for each
parison with respect to accuracy shows that of ten iterations produced by SATe-I and
DACTAL has approximately the same accuracy DACTAL. Note how SATe-I and DACTAL
as SATe-I and that both DACTAL and SATe-I both improve with each iteration, with the initial
are much more accurate than the two-phase iterations producing the biggest reductions in tree
methods on the difficult 1,000-taxon model con- error, and that they track each other iteration by
ditions. Finally, this figure also shows that iteration. However, note that each DACTAL
DACTAL 127 D
iteration is much faster than each SATe-I itera- Unlike these truly alignment-free methods,
tion, so that ten iterations of DACTAL finish in DACTAL is not completely alignment-free,
about 1/8 the time of ten iterations of SATe-I. since it does compute alignments on subsets.
However, the results shown here suggest that
highly accurate trees are indeed possible without
Discussion requiring a multiple sequence alignment on the
full dataset.
DACTAL is a method for estimating trees from
unaligned sequences. While it does not require D
the estimation of an alignment on the full dataset, Future Work
it is not entirely alignment-free, since it estimates
alignments on subsets. However, these subsets The phylogenetics research community has been
are small, containing only 200 sequences, which developing improved methods for alignment and
reduces the computational and analytical chal- phylogeny estimation. These methods may well
lenges to running DACTAL. These experiments lead to improved estimations of larger trees and
show that DACTAL can produce highly accurate could reduce the need for methods like DACTAL.
phylogenetic estimates on very large datasets, However, DACTAL may continue to be a useful
improving on the accuracy of both two-phase tool for improving scalability of these methods to
methods (that first align the sequences and then very large datasets, containing many tens of thou-
estimate the tree) and SATe-I. sands of sequences, since these improved tech-
Alignment-free methods (i.e., that do not use niques could be used to estimate trees on subsets
any multiple sequence alignment technique at all of taxa. This may be particularly relevant to the
to compute trees) have also been designed; these recent effort to develop methods that co-estimate
are surveyed in Vinga and Almeida 2003 and sequence alignments and trees under complex
Chan and Ragan 2013. Alignment-free methods models of sequence evolution (see Bouchard-
typically compute trees in three steps: first, each Cote and Jordan 2013 for a recent paper and
sequence is characterized by some distribution other methods surveyed in Warnow 2013). Most
(e.g., its k-mer distribution for some appropri- of these methods are computationally very inten-
ately chosen k), then distances between sive and limited to at most 200 sequences (and
sequences are computed, and finally a tree is even then are computationally intensive), and
computed on the distance matrix. Unlike DACTAL could potentially be used to improve
DACTAL, these truly alignment-free methods their scalability to larger datasets. More generally,
have not, to our knowledge, been shown to pro- the phylogenetics research community has been
duce trees of comparable accuracy to methods developing sophisticated techniques for highly
that estimate multiple sequence alignments and accurate estimations of alignments and trees, but
then compute maximum likelihood trees on these these statistically based methods often use tech-
alignments. Furthermore, the alignment-free niques (such as MCMC) that are computationally
methods surveyed in these papers do not have intensive and do not run on large datasets.
any theoretical guarantees under Markov models DACTAL provides a basic tool for improving the
of evolution. An interesting contrast to these scalability of these techniques and so comple-
methods is the recent result given in Daskalakis ments these efforts. Thus, large-scale phylogeny
and Roch 2010. This technique is guaranteed estimation may well improve through a combina-
statistically consistent under the TKF1 model tion of efforts – some aimed at improving the
(Thorne et al. 1991) and so represents an impor- estimation of trees and alignments on small
tant advance in theory. However, this method has datasets, using statistically informed but computa-
not yet been implemented, so it remains tionally intensive methods, and other efforts aimed
a theoretical contribution rather than a usable at using divide-and-conquer to combine smaller
technique. trees into larger trees.
D 128 Diversity and Distribution of Marine Microbial Eukaryotes
Summary Felsenstein J. Inferring phylogenies. Sunderland: Sinauer

Associates; 2003.
Katoh K, Kuma K, Miyata T, et al. Improvement in the
DACTAL is a method for estimating large trees accuracy of multiple sequence alignment MAFF-
from unaligned sequences that uses an iterative T. Genome Inform. 2005;16:22–33.
divide-and-conquer technique. By design, Kemena C, Notredame C. Upcoming challenges for mul-
DACTAL does not produce a multiple sequence tiple sequence alignment methods in the high-
throughput era. Bioinformatics. 2009;25:2455–65.
alignment, yet analyses on many datasets (both Liu K, Raghavan S, Nelesen S, et al. Rapid and accurate
real and simulated) show that DACTAL produces large-scale co-estimation of sequence alignments and
trees with great accuracy, improving on existing phylogenetic trees. Science. 2009;324:1561–4.
two-phase methods that first align and then esti- Liu K, Linder CR, Warnow T. Multiple sequence align-
ment: a major challenge to large-scale phylogenetics.
mate the tree from the sequences. These analyses PLOS Currents Tree of Life. 2010. doi: 10.1371/cur-
also show that DACTAL matches the accuracy of rents.RRN1198.
SATe while being much faster. With the increased Liu K, Warnow T, Holder M, et al. SATe-II: very fast and
interest in estimating very large trees, this type of accurate simultaneous estimation of multiple sequence
alignments and phylogenetic trees. Syst Biol.
approach could enable highly accurate and very 2012;61:90–106.
large-scale phylogenetic estimation. Nelesen S, Liu K, Wang LS, et al. DACTAL: divide-and-
conquer trees (almost) without alignments. Bioinfor-
matics. 2012;28:i274–82.
Price MN, Dehal PS, Arkin AP. FastTree-2 – approxi-
Cross-References mately maximum likelihood trees for large align-
ments. PLoS ONE. 2010;5:e9490. doi:10.1371/
▶ Computational Approaches for Metagenomic journal.pone.0009490.
Datasets Stamatakis A. RAxML-VI-HPC: maximum likelihood-
based phylogenetic analyses with thousands of taxa
▶ DACTAL and mixed models. Bioinformatics. 2006;22:2688–90.
▶ MRL and SuperFine+MRL Swenson M, Suri R, Linder CR, et al. SuperFine: fast and
▶ Phylogenetics, Overview accurate supertree estimation. Syst Biol.
▶ SATe-Enabled Phylogenetic Placement 2012;61:214–27.
Thorne JL, Kishino H, Felsenstein J. An evolutionary
▶ Use of Viral Metagenomes from Yellowstone model for maximum likelihood alignment of DNA
Hot Springs to study phylogenetic sequences. J Mol Evol. 1991;33:114–24.
relationships and evolution Vinga S, Almeida J. Alignment-free sequence compari-
son–a review. Bioinformatics. 2003;19(4):513–523.
Warnow T, Large-scale multiple sequence alignment and
phylogeny estimation. In: Chauve C, El-Mabrouk N &
References Tannier E, editors. Models and Algorithms for
Genome Evolution. Springer: London; 2013.
Bininda-Emonds O, editor. Phylogenetic supertrees: com- p. 85–146.
bining information to reveal the tree of life. Dordrecht:
Kluwer Academic Publishers; 2004.
Blair C, Murphy RW. Recent trends in molecular phylo-
genetics. J Hered. 2011;102(1):130–8.
Bouchard-Cote A, Jordan MI. Evolutionary inference via
the Poisson Indel Process. Proc National Academy of
Diversity and Distribution of Marine
Sciences. 2013;110(4):1160–1166. Microbial Eukaryotes
Cannone J, Subramanian S, Schnare M, et al. The com-
parative RNA web (CRW) site: an online database of Connie Lovejoy
comparative sequence and structure information for
Department of Biology, Laval University,
ribosomal, intron and other RNAs. BMC Bioinforma.
2002;3:2. doi:10.1186/1471-2105-3-2 Québec, QC, Canada
Chan CX, Ragan MA. Next-generation phylogenomics.
Biology Direct 2013;8:3.
Daskalakis C, Roch S. Alignment-free phylogenetic
Marine microbial eukaryotes (MME) are mor-
reconstruction. In: Berger B, editor. Proc. RECOMB
2010, volume 6044 of Lecture Notes in Computer phologically, phylogenetically, and functionally
Science. Berlin: Springer; p. 123–37. diverse. The term protist is often used but not
Diversity and Distribution of Marine Microbial Eukaryotes 129 D
a valid taxonomic classification (Adl et al. 2005, operationally defined picoplankton by size-
2007), and evolutionary relationships among fractionated filtration, and many of the novel
MME at the highest taxonomic ranks remain groups originally thought to be picoplanktonic
controversial. can in fact include cells >3 mm. Fragile cells are
Functionally, they span all trophic levels, with broken during filtration with cellular contents
phototrophic, heterotrophic, and mixotrophic passing through the filter, and free DNA pre-
taxa, where mixotrophic taxa are able to use served in seawater can also be collected on the
both photosynthesis and heterotrophy as sources 0.2 mm filter. Such size fractionation however is
for carbon and energy. Taxonomically MME are useful since it enriched the proportion of smaller D
found among all major branches of the eukary- cells. More recently, surveys of picoplankton
otic tree of life, with the exception, at least have been carried out using cells that were col-
up until now, of the Excavata (Adl et al. 2012). lected following flow cytometry (FCM) cell
This taxonomic and functional diversity is sorting. Using this technique, other novel pho-
also manifest in morphological diversity over tosynthetic taxa have been discovered (Not
several microscopic scales, depending on the et al. 2008). Placing novel taxa into known phy-
group. logenies requires aligning sequences with
known groups and determining their placement
within phylogenetic trees. Using nearly full-
Unassignable taxa: Taxa that do not fit
length 18S rRNA gene sequences, most of the
within descriptions based on standard tax-
early novel MME have been found to be within
onomy. In the case of gene sequences, this
some higher level taxonomic grouping, and as
occurs when the sequence in question
new environments are surveyed, the distribution
shows little homology with other
and diversity of uncultivated groups can be
sequences. The degree of difference is
documented.
indicative of taxonomic level differences,
In principle, it is possible to identify whole
e.g., domain, phyla, and order, to the level
communities of MME using metagenomic
of genus. For the first 18S rRNA gene sur-
approaches. In practice, high-throughput
veys using cloning and Sanger sequencing,
multiplexing of different samples with primers
several phyla-level novel sequences were
specific for hypervariable regions of the 18S
discovered. The majority of these belong to
rRNA gene can be used to identify MME in
uncultivated taxa and so there are no organ-
natural environments (Amaral-Zettler et al.
isms available to infer functional roles.
2009); (Comeau et al. 2011). These short
sequences or reads are taxonomically assigned
The use of molecular tools to identify MME based on reference-curated 18S rRNA gene phy-
began some 10 years after the first reports of logenies. However, as with Bacteria and
bacterial and archaeal diversity in the sea, in Archaea, the utility of identifying MME and
part because the existing taxonomic record their functional genes is directly related to the
from microscopy seemed complete. This accuracy and completeness of reference data-
changed when the first surveys were carried out bases. Many species that are found in the marine
with two publications in 2001 highlighting sur- environment are the same as those reported using
prising diversity of MME in the deep sea and microscopy, and a major challenge is linking
identification of unassignable taxa (Diez microscopy records with sequence data. An addi-
et al. 2001); (Moon-van der Staay et al. 2001). tional complication is the desirability to exploit
These studies triggered major programs that historic data sets and taxonomic treatises where
were mostly aimed at the so-called picoplankton only morphological descriptions are provided
operationally defined as cells passing through with no voucher specimens or cultures in exis-
a 3 mm filter and collected on either 0.8 or tence, and sequences cannot be matched to
0.2 mm filters (Vaulot et al. 2008). Most studies describe morphological species.
D 130 Diversity and Distribution of Marine Microbial Eukaryotes
Classification of MME in the ocean, these include Diatomea,

Pelagophyceae, Eustigmatales, Dictyochophyceae,
Historically MME were divided into plants Chrysophyceae, Raphidophyceae, and other
(algae) and animals (protozoa), with algal classi- stramenopiles. Among these are Parmales, which
fication following the botanical nomenclature have siliceous walls and have been reported from
code and protozoa following the zoological electron microscopy (Kosman et al. 1993) and are
code. The term algae is now considered closely related to or within the flagellated
non-taxonomic functional grouping of oxygenic bolidophytes (Ichinomiya et al. 2011). About half
C-photo-autotrophic (oxygen evolving photosyn- of the living species of Dinophyceae (within the
thesis) organisms that are neither bryophytes nor alveolates) are photosynthetic (Taylor et al. 2008).
vascular plants. Cyanobacteria are sometimes Cryptophyta and Haptophyta, also with chlo-
referred to as algae, are the important phototrophs rophyll c, are now thought to have arisen through
in much of the world ocean, but are not eukary- separate endosymbiotic events with different pro-
otes and not treated here. Below is a brief survey tists (Baurain et al. 2010), and their phylogenetic
of major MME categorized by trophic roles. positions are uncertain.
Haptophyta in the sea include flagellated taxa
Phototrophs mostly in the Prymnesiales, and Phaeocystales,
While diatoms, coccolithophores, and dinoflagel- and coccolith bearing taxa referred to as
lates are the most frequently mentioned coccolithophores, which include Isochrysidales
phototrophs in the ocean, there are many other (e.g., Emiliania) and Coccolithales. Many
taxa that can contribute substantially to oceanic coccolithophores also have flagellated stages,
primary productivity. The eukaryotic algae and some species lack chloroplasts (Adl
include heterogeneous and evolutionarily differ- et al. 2012).
ent groups. The origin and development of the There are two other algal phyla that arose from
first eukaryotic algae is explained through an endosymbiotic events where single-celled green
endosymbiotic event where a heterotrophic algae gave rise to the chlorophyll b containing
eukaryote acquired or enslaved an ancestral cya- chloroplasts in the photosynthetic Euglenophyta
nobacterium (cf. Reys-Prieto et al. 2010). After and the Chlorarachniophyta, both of which are
genetic reduction and transformation, this event common in marine waters. Several dinoflagel-
gave rise to primary plastids (chloroplasts) pre- lates from diverse lineages have lost their original
sent in Glaucophyta, Rhodophyta (red algae), and secondary endosymbiotically acquired chloro-
Chlorophyta (green algae), and the three lineages plast and have acquired new chloroplasts directly
are classified as Plantae (Raven et al. 2005) or from either green algae, cryptophytes, or even
more broadly as Archaeplastida (Adl et al. 2010) diatoms in what are termed tertiary endosymbi-
with the higher plants. Chlorophyta are ancestral otic events (Keeling 2009).
to algal Streptophyta and are predominantly
green with chlorophyll b as a secondary pigment; Mixotrophs
Prasinophyta and Mamiellaceae are the most The majority of the phototrophic groups named
common marine pelagic Chlorophyta. above are also more than likely mixotrophic, at
Other algae are polyphyletic (lack an identifi- least on some level. Mixotrophy can range from
able common ancestor), and for most, their chloro- the ability to use dissolved organic matter
plasts originated as a secondary endosymbiotic (osmotrophy) to the capacity to engulf
event where a single-celled pre-rhodophyte alga (phagotrophy) bacteria or other microbial
was acquired or enslaved by another heterotrophic eukaryotes, including species that are larger
protist. Over time this lineage gave rise to other than themselves. These mixotrophs can compete
major algal phyla (Reyes-Prieto et al. 2010); with heterotrophs for nutrients, carbon, and even
(Keeling 2009). Chlorophyll c is a secondary energy when taking up preformed organic
pigment common to most of these other algae; material. It is important to reiterate that these
Diversity and Distribution of Marine Microbial Eukaryotes 131 D
are not plants. For example, mixotrophic environmental surveys are several clades of
Chrysophyceae are particularly common in Arc- marine alveolates that are mostly related to para-
tic Sea ice and Arctic marine waters (Lovejoy sitic taxa, including Amoebophyra, which infect
et al. 2002); (Rozanska et al. 2008). Other dinoflagellates (Groisillier et al. 2006) and others
stramenopile mixotrophs include members of most closely related to zooplankton parasites
the Dictyochophyceae, Pelagophyceae, and (Skovgaard et al. 2005).
Raphidophyceae; Euglenozoa, Cryptophyceae,
Haptophyceae, and photosynthetic dinoflagel-
lates are also mixotrophic. D
References
Heterotrophs Adl SM, Simpson AGB, Farmer MA, Andersen RA,
Heterotrophic MME are also phylogenetically, Anderson OR, Barta JR, Bowser SS, Brugerolle G,
morphologically, and functionally diverse. Fensome RA, Fredericq S, James TY, Karpov S,
Kugrens P, Krug J, Lane CE, Lewis LA, Lodge J,
Among these are several microscopically recog-
Lynn DH, Mann DG, McCourt RM, Mendoza L,
nizable heterotrophic species with uncertain tax- Moestrup O, Mozley-Standridge SE, Nerad TA,
onomic affinities that are also frequently Shearer CA, Smirnov AV, Spiegel FW, Taylor
recovered in environmental gene surveys. M. The new higher level classification of eukaryotes
with emphasis on the taxonomy of protists. J Eukaryot
These include Telonema, Katablepharidae, and
Microbiol. 2005;52:399–451.
Centrohelida. Other groups are well placed in Adl SM, Leander BS, Simpson AGB, Archibald JM,
phylogenies, for example, choanoflagellates Anderson OR, Bass D, Bowser SS, Brugerolle G,
which are included along with animals in the Farmer MA, Karpov S, Kolisko M, Lane CE, Lodge
DJ, Mann DG, Meisterfeld R, Mendoza L,
Opisthokonta. Among novel uncultivated
Moestrup O, Mozley-Standridge SE, Smirnov AV,
MME are the biliphytes originally termed Spiegel F. Diversity, nomenclature, and taxonomy of
picobiliphytes (Not et al. 2007), which although protists. Syst Biol. 2007;56:684–9.
still uncultivated have been sequenced using Adl SM, Simpson AGB, Lane CE, Lukes J, Bass D,
Bowser SS, Brown MW, Burki F, Dunthorn M,
single-cell genome amplification technology on
Hampl V, Heiss A, Hoppenrath M, Lara E, le Gall L,
cells collected via FCM. The genome from these Lynn DH, McManus H, Mitchell EAD, Mozley-
cells has confirmed that they branch apart from Stanridge SE, Parfrey LW, Pawlowski J, Rueckert S,
other eukaryotes as a sister group to Shadwick L, Schoch CL, Smirnov A, Spiegel FW. The
revised classification of eukaryotes. J Eukaryot
Cryptophytes and that they are most likely strict
Microbiol. 2012;59:429–93.
heterotrophs (Yoon et al. 2011), and have been Amaral-Zettler LA, McCliment EA, Ducklow HW, Huse
formally described as a new Phyla; the Picozoa SM. A method for studying protistan diversity using
by Seenivasan et al. (2013). massively parallel sequencing of V9 hypervariable
regions of small-subunit ribosomal RNA genes. Plos
Among historically important protist hetero-
ONE. 2009;4.
trophs in the sea are representatives from the Baurain D, Brinkmann H, Petersen J, Rodriguez-Ezpeleta-
large supergroup Rhizaria, which includes N, Stechmann A, Demoulin V, Roger AJ, Burger G,
Cercozoa, Polycystinea, Acantharia, and Forami- Lang BF, Philippe H. Phylogenomic evidence for sep-
arate acquisition of plastids in cryptophytes,
nifera. Cercozoa from microscopy studies that
haptophytes, and stramenopiles. Mol Biol Evol.
have also been retrieved using environmental 2010;27:1698–709.
gene surveys include Cryothecomonas (Thaler Comeau AM, Li WKW, Tremblay J-É, Carmack EC,
and Lovejoy 2012). Lovejoy C. Arctic ocean microbial community struc-
ture before and after the 2007 record sea ice minimum.
The first environmental 18S rRNA gene sur-
Plos ONE. 2011;6:e27492.
veys revealed a number of distinct lineages of Diez B, Pedros-Alio C, Massana R. Study of genetic
marine stramenopiles (MASTs) (Massana diversity of eukaryotic picoplankton in different oce-
et al. 2004), which for the most part seem to be anic regions by small-subunit rRNA gene cloning and
sequencing. Appl Environ Microbiol.
bactivors, although cultured representatives are
2001;67:2932–41.
lacking (Massana et al. 2009). Also among Groisillier A, Massana R, Valentin K, Vaulotl D, Guilloul
groups that are mostly known from L. Genetic diversity and habitats of two enigmatic
D 132 DNA Methylation Analysis by Pyrosequencing
marine alveolate lineages. Aquat Microb Ecol. Skovgaard A, Massana R, Balague V, Saiz
2006;42:277–91. E. Phylogenetic position of the copepod-infesting par-
Ichinomiya M, Yoshikawa S, Kamiya M, Ohki K, asite Syndinium turbo (Dinoflagellata, Syndinea). Pro-
Takaichi S, and Kuwata A. Isolation and characteriza- tist. 2005;156:413–23.
tion of Parmales (Heterokonta/Heterokontophyta/ Taylor FJR, Hoppenrath M, Saldarriaga JF. Dinoflagellate
stramenopiles) from the Oyashio region, western diversity and distribution. Biodivers Conserv.
North Pacific. J Phycol. 2011;47:144–151. 2008;17:407–18.
Keeling PJ. Chromalveolates and the evolution of plastids Thaler M, Lovejoy C. Distribution and diversity of
by secondary endosymbiosis. J Eukaryot Microbiol. a protist predator Cryothecomonas (Cercozoa) in Arc-
2009;56:1–8. tic marine waters. J Eukaryot Microbiol.
Kosman CA, Thomsen HA, Ostergaard JB. Parmales 2012;59:291–9.
(Chrysophyceae) from Mexican, Californian, Baltic, Vaulot D, Eikrem W, Viprey M, Moreau H. The diversity
Arctic and Antarctic waters with the description of of small eukaryotic phytoplankton (<¼ 3 mu m) in
a new subspecies and several new forms. Phycologia. marine ecosystems. FEMS Microbiol Rev.
1993;32:116–28. 2008;32:795–820.
Lovejoy C, Legendre L, Martineau MJ, Bacle J, von Yoon HS, Price DC, Stepanauskas R, Rajah VD, Sieracki
Quillfeldt CH. Distribution of phytoplankton and ME, Wilson WH, Yang EC, Duffy S, Bhattacharya
other protists in the North Water. Deep-Sea Res Part D. Single-cell genomics reveals organismal interac-
II Top Stud Oceanogr. 2002;49:5027–47. tions in uncultivated marine protists. Science.
Massana R, Castresana J, Balague V, Guillou L, 2011;332:714–7.
Romari K, Groisillier A, Valentin K, Pedros-Alio
C. Phylogenetic and ecological analysis of novel
marine stramenopiles. Appl Environ Microbiol.
2004;70:3528–34.
Massana R, Unrein F, Rodriguez-Martinez R, Forn I,
Lefort T, Pinhassi J, Not F. Grazing rates and func- DNA Methylation Analysis by
tional diversity of uncultured heterotrophic flagellates.
ISME J. 2009;3:588–96. Pyrosequencing
Moon-van der Staay SY, De Wachter R, Vaulot
D. Oceanic 18S rDNA sequences from picoplankton Florence Busato and Jörg Tost
reveal unsuspected eukaryotic diversity. Nature. Laboratory for Epigenetics and Environment,
2001;409:607–10.
Not F, Valentin K, Romari K, Lovejoy C, Massana R, Centre National de Génotypage, CEA- Institut de
Tobe K, Vaulot D, Medlin LK. Picobiliphytes: Génomique, Evry, France
a marine picoplanktonic algal group with unknown
affinities to other eukaryotes. Science.
2007;315:253–5.
Not F, Latasa M, Scharek R, Viprey M, Karleskind P, Synonyms
Balague V, Ontoria-Oviedo I, Cumino A, Goetze E,
Vaulot D, Massana R. Protistan assemblages across the Quantitative sequencing by synthesis
Indian Ocean, with a specific emphasis on the
picoeukaryotes. Deep-Sea Res Part I Oceanogr Res
Pap. 2008;55:1456–73.
Raven JA, Finkel ZV, Irwin AJ. Picophytoplankton: Definition
bottom-up and top-down controls on ecology and evo-
lution. Vie Et Milieu-Life Environ. 2005;55:209–15. Pyrosequencing is a sequencing-by-synthesis
Reyes-Prieto A, Yoon HS, Moustafa A, Yang EC, Ander-
sen RA, Boo SM, Nakayama T, Ishida K, Bhattacharya method that quantitatively monitors the real-
D. Differential gene retention in plastids of common time incorporation of nucleotides using an
recent origin. Mol Biol Evol. 2010;27:1530–7. enzymatic conversion of pyrophosphate into
Rozanska M, Poulin M, Gosselin M. Protist entrapment in a proportional light signal. Quantitative mea-
newly formed sea ice in the Coastal Arctic Ocean.
J Mar Syst. 2008;74:887–901. sures are crucial for applications such as the
Seenivasan R, Sausen N, Medlin LK, Melkonian M. analysis of DNA methylation patterns, which
Picomonas judraskeda gen. et sp. nov.: The first identi- are intensively studied in various developmen-
fied member of the Picozoa Phylum nov., a widespread tal and pathological contexts as well as for bac-
group of picoeukaryotes, formerly known as
‘picobiliphytes’. PLoS ONE 2013;8(3):e59565. terial identification and determination of allelic
doi:10.1371/journal.pone.0059565. imbalance.
DNA Methylation Analysis by Pyrosequencing 133 D
Introduction The experimental procedure of the
pyrosequencing assay is simple and relatively
While Sanger sequencing has been the “gold robust and results are highly reproducible. There-
standard” for the identification of sequence vari- fore, pyrosequencing has become a widely used
ants for a long time, pyrosequencing with its analysis platform for various biological and/or
improved ability for quantification, decreased diagnostic applications such as routine
limit of detection and accelerated workflow lead- (multiplex) genotyping of single-nucleotide
ing to a shorter time to results, has become polymorphisms (SNPs), methylation analysis of
a valuable alternative notably for many clinical bisulfite-treated samples, bacterial typing, muta- D
and diagnostic applications. Pyrosequencing is tion detection, and allele quantification (Marsh
a sequencing-by-synthesis method, where nucle- 2007).
otides are incorporated complementary to
a template strand leading to the release of pyro-
phosphate (PPi) that will – after several enzy- DNA Methylation
matic reactions – produce a light signal
proportional to the amount of incorporated nucle- DNA methylation is a post-replication modifica-
otide (Fig. 1). tion that occurs in mammals almost exclusively
DNA Methylation Analysis by Pyrosequencing, used together with APS by an ATP sulfurylase to produce
Fig. 1 Nucleotides added into the pyrosequencing reac- ATP. ATP will be subsequently used by luciferase to
tion (here exemplified by a thymine) are incorporated by oxidate luciferin to oxyluciferin generating a proportional
the DNA polymerase extending the pyrosequencing light signal. Unincorporated nucleotides are degraded by
primer when they are complementary to the DNA tem- apyrase to avoid unspecific background signals. The reac-
plate sequence. This incorporation releases PPi, which is tions are detailed in the text
on the 50 position of the pyrimidine ring of cyto- can also promote spontaneous deamination,
sines in the context of a dinucleotide CpG (Tost enhance DNA binding of carcinogens, or increase
2009). CpGs represent less than 1 % of all bases ultraviolet absorption by DNA and, as a result,
and are mostly methylated in the mammalian increase the rate of mutations, DNA adduct for-
genome. CpGs are relatively rare because they mation, and subsequent gene inactivation. As
are easily transformed into TpGs by deamination, DNA methylation has been shown to be
and as thymine is a naturally occurring building influenced by diet and environmental exposure,
block of the DNA, these mutations are less well it has been postulated that DNA methylation
recognized and repaired by the cellular machin- might constitute a measurable molecular memory
ery. This elevated mutation rate has led to CpG of our lifestyle and environment (Cortessis
depletion during evolution. et al. 2012).
However, relatively CpG-rich clusters, called Methylation of cytosines in other sequence
CpG islands, are found in the promoter and first contexts (CpNpG, CpA, etc.) has been identified
exon of approximately two-thirds of all genes. in cultured cells such as mouse embryonic stem
Mostly unmethylated, these CpG islands are dis- cells. In plants, methylation on cytosines is more
tributed throughout the human genome and main- prevalent and more diverse compared to mam-
tain the chromatin in an open configuration to mals, and their DNA is highly methylated. The
allow transcription. The absence of DNA meth- methylcytosines are mainly located in CpG and
ylation is not directly correlated to the transcrip- CpNpG sequences, but they may also occur in
tional activity of the corresponding gene, but other contexts. DNA methylation controls plant
rather the transcriptional potential. However, growth and development, with a particular
a certain number of promoter CpG islands are involvement in the regulation of gene expression
methylated in a tissue-specific manner, and this and DNA replication, similar to its function in
DNA methylation helps to maintain transcrip- mammalian cells.
tional silence in non-expressed or noncoding Compared to mammals, bacteria have at least
regions of the genome. Methylated regions also two methylated bases in addition to
maintain transcriptional inactivation, as exempli- 5-methylcytosine: N6-methyladenine in the
fied by the methylation and repression of repeti- sequence context GpApTpC and GpApNpTpC
tive and transposable elements. Furthermore, and N4-methylcytosine (Casadesus and Low
some genes, called imprinted genes, express 2006). These methylated bases are involved in
only one allele depending on their parent of ori- the protection of bacterial DNA, where they act
gin (maternal or paternal allele), and the as a defense mechanism against bacteriophage
non-expressed allele is associated with infection. They play also crucial roles in the con-
a repressed imprinting control region, which is trol of DNA repair, replication, transposition,
in many cases marked by DNA methylation. and – similar to eukaryotes – gene expression.
Inactivation of one X chromosome in female Particularly, adenine methylation plays an impor-
mammals is another example in which DNA tant role in the regulation of gene expression in
methylation plays an important role in gene dos- bacteria, with its absence allowing the binding of
age and regulation. specific proteins to the bacterial DNA. Methyla-
During aging and in the context of patholo- tion patterns have also been correlated to the
gies, particularly cancer, regions normally virulence of several pathogens.
unmethylated become methylated, and this However, due to their greater diversity, the
hypermethylation can induce or is at least asso- presence of many “orphan” methyltransferases,
ciated with aberrant gene expression patterns. For i.e., enzymes not part of a restriction enzyme
example, methylation of the DNA repair genes system that methylate bacterial genomes at spe-
MLH1 and MGMT can lead to their inactivation, cific sites and the only recent emergence of
resulting respectively in microsatellite instability appropriate tools to study the DNA modifica-
and increased mutation frequency. Methylation tions, DNA methylation in bacteria has not been
a topic of intensive research. The advent of single translates the methylation signal into a sequence
molecule sequencing technologies such as the difference. After PCR amplification the methyla-
single molecule real-time sequencer from Pacific tion status at a given position is manifested in the
Biosciences performing sequencing with an ratio C (former methylated cytosine) to T (former
immobilized polymerase at the bottom of zero- non-methylated cytosine) and can be analyzed as
mode waveguide wells in zeptoliter volumes has a virtual C/T polymorphism spanning the entire
revolutionized the possibilities for DNA methyl- allele frequency spectrum from 0 % to 100 % in
ation analysis in bacteria and allowed the direct the bisulfite-treated DNA. The latter principle is
readout of CpG and other methylation modifica- commonly used for DNA methylation analysis by D
tions in bacteria (Davis, et al. 2013). pyrosequencing. It should be noted that the
reduced complexity of the bisulfite-treated DNA
(which essentially consists of a three-letter
Principles of the DNA Methylation genome) creating homopolymeric and highly
Analysis AT-rich sequences provides a challenge for the
design of PCR amplification-based assays and
As DNA methylation is involved in many biolog- induces frequently a preferential amplification
ical processes, it is of great importance to analyze of either unmethylated or methylated alleles.
DNA methylation patterns and their variability. This bias has to be monitored and corrected for
As DNA methylation is not retained during PCR to ensure accurate quantification of DNA meth-
amplification, it is necessary to make use of pro- ylation levels of the analyzed CpGs.
cedures that are able to differentiate the epige-
netic state. Methods for DNA methylation
analysis are based on four main principles: Principle of the Pyrosequencing
(1) the use of methylation-sensitive restriction Reaction
endonucleases, i.e., enzymes that are blocked by
methylated cytosines in their recognition Pyrosequencing is a polymerase-based quantita-
sequence are widely used for the analysis of tive real-time sequencing method used to analyze
methylation patterns in combination with their multiple sequence variations in a region of inter-
methylation-insensitive isoschizomers. Although est. In contrast to conventional Sanger sequenc-
methods based on methylation-sensitive restric- ing that uses a mixture of the four fluorescently
tion enzymes are simple and cost-effective as labeled chain-terminating ddNTPs and strand-
they do not require any special instrumentation, elongating dNTPs, only one nucleotide is dis-
they are hampered by the limitation to specific pensed at a time by an inkjet-type cartridge in
restriction sites as only CpG sites found within pyrosequencing reactions using either a user-
these sequences can be analyzed. (2) The meth- defined sequence-specific dispensation order or
ylated fraction of a genome can be enriched by a repetitive cyclic dispensation order of the four
precipitation with a bead-immobilized antibody nucleotides for unknown sequences.
specific for 5-methylcytosine or (3) affinity puri- This iterative incorporation of unmodified
fication of methylated DNA with MBD proteins, nucleotides by the exonuclease-deficient Klenow
but these methods do not permit the analysis of fragment of DNA polymerase I will result in the
DNA methylation patterns at single-nucleotide release of inorganic pyrophosphate (PPi), while
resolution. (4) The most widely used approach all unincorporated nucleotides will be degraded
consists of the chemical modification of genomic prior to addition of the next nucleotide by an
DNA with sodium bisulfite. This chemical reac- apyrase. When the polymerase encounters
tion induces the hydrolytic deamination of a noncomplementary nucleotide, it pauses while
non-methylated cytosines to uracils, while meth- nucleotide degradation takes place. The pyro-
ylated cytosines are resistant to conversion under phosphate is in the presence of adenosine
the chosen reaction conditions. This method thus phosphosulfate (APS) transformed by an ATP
sulfurylase into several products including ATP. reaches this polymorphism, both nucleotides of
The latter will be used in the subsequent step to the variable position will be added successively
oxidize luciferin to oxyluciferin by a luciferase and their proportional luminometric signal quan-
resulting in the creation of a proportional amount tified by the software.
of photons, which can be monitored by a CCD Since all the enzymatic reactions are quantita-
camera (Fig. 1). The four enzymes are present in tive, the intensity of the bioluminometric response
a well-balanced mixture allowing the DNA poly- is directly proportional to the amount of incorpo-
merase to extend the newly synthesized DNA rated nucleotides: the incorporation of two identi-
strand until it encounters a noncomplementary cal consecutive nucleotides will have the double
nucleotide while at the same time avoiding intensity (and therefore peak height in the resulting
unspecific nucleotide incorporation and out-of- pyrogram) compared to the signal of a single-
phase sequencing. A key step in the development nucleotide incorporation. This quantitative nature
of applications for pyrosequencing was the addi- of the results is the most important characteristic
tion of a single-stranded DNA binding protein to of the pyrosequencing technology because it
the reaction mixture (now also included in the allows performing quantitative applications such
commercial kits), which led to a substantial as DNA methylation analysis. Furthermore, as
increase in read length and overall greater accu- pyrosequencing proceeds at a rate of one dispen-
racy through the reduction of the formation of sation per minute, results on the presence and
secondary structures and mispriming (Dupont abundance of variable nucleotides will be avail-
et al. 2004). able between 10 and 60 min after launching
Samples of interest are amplified by PCR a pyrosequencing reaction. The total time to
performed with one of the two amplification results starting from the PCR amplification is com-
primers being biotinylated. This allows the isola- monly below 3–4 h and therefore much faster than
tion of a single-stranded sequencing template conventional Sanger sequencing.
through the capture of the biotinylated amplifica-
tion product on streptavidin-coated Sepharose
beads. After washing steps, the use of a sodium Inconveniences
hydroxide solution allows the denaturation of the
double-stranded DNA and isolation of the However, there are some inconveniences
biotinylated single strand used as template in associated with this technology, mainly
the pyrosequencing procedure. A (pyro)sequenc- concerning the analysis of variation in the close
ing primer is subsequently annealed to this tem- proximity of homopolymers, the size of the
plate, and the sequence is synthesized one amplification product, and the sequencing read
nucleotide at a time. The light signals are then length. Pyrosequencing as well as the closely
generated by the enzymatic cascade by extending correlated 454 sequencing and semiconductor
the 30 end of the nascent strand described above. sequencing (Ion Torrent) suffer from the lack of
It should be noted that the nucleotide dATP acts precision in the analysis of homopolymers.
as a natural substrate for luciferase (although less The bioluminometric response is only linear
efficient compared to ATP). Therefore the a-S- (R2 > 0.99) for the sequential addition of up
dATP analogue is used as nucleotide for primer to five identical nucleotides (C, G, T) or three
extension as it is equally well incorporated by the a-S-dATPs. Sequence variation in close proxim-
polymerase. ity to homopolymer reads might therefore not be
Pyrosequencing can analyze almost any poly- easily resolved, and the quantitative accuracy
morphism in the amplified sequence. As the might be limited. Due to the thermal instability
expected sequence is in most cases known of the enzymes, pyrosequencing has to be carried
a priori, the sequence to analyze is simply entered out at 28 C which limits the size of the amplifi-
into the software creating automatically cation product to 350 base pairs as the formation
a dispensation order, and once the sequencing of secondary structures can complicate annealing
of the sequencing primer or increase background nucleotides will be added one after another.
signals. The limitation in the read length (less Each allele combination will result in a specific
than 100 dispensed nucleotides) is mainly due to pyrosequencing pattern that can easily be read
dilution effects and increasing background due to either by the software or by the user. Besides
frameshifts of subpopulations of sequenced mol- simple qualitative genotyping, pyrosequencing
ecules. This drawback can be partly overcome can be used for quantitative applications such as
using the below described serial pyrosequencing the level of mutation or the potential loss of one
approach. Lastly, the setup and optimization of allele (loss of heterozygosity (LOH)). LOH can
robust pyrosequencing assays including the assay result in a neutral phenotype but can also be D
design but also the entry of an optimal dispensa- involved in cancer as exemplified by the LOH
tion order requires a certain degree of experience of BRCA1 or BRCA2 in breast cancer.
and expertise, and only few tools are available in Due to its relatively short read length,
the public domain for the assay design. pyrosequencing is best suited for the detection
and quantification of mutation hotspots such as
the codons 12 and 13 of KRAS (Ogino
Serial Pyrosequencing et al. 2005), a gene commonly mutated in many
cancers including colorectal cancer, where it is
To overcome the restriction in read length, a solu- the most commonly mutated gene with
tion was found in the “recycling” of the single- a prevalence of ~ 40 % of patients, lung, or
stranded template after the pyrosequencing run. pancreatic cancer. Similar applications concern
As this template is not altered during the the analysis of BRAF (V600E) or JAK2 (V617F)
pyrosequencing reaction, it can be recovered mutations and polymorphisms such as C677T
after the run by the same template preparation MTHFR. Compared to conventional Sanger
protocol used after PCR amplification. Several sequencing, the limit of detection is significantly
pyrosequencing primers can therefore be used on improved (i.e., 2–7 % for pyrosequencing com-
the same DNA template to cover the entire ampli- pared to 10–20 % for conventional Sanger
fied sequence with sufficient intensity and good sequencing) which enables the user to call
quantitative resolution. This improvement enables low-level mutations with greater confidence and
the analysis of an entire region amplified in resolve, e.g., ambiguous Sanger sequencing
a single PCR. While the approach has initially results. This property of pyrosequencing is also
been devised for DNA methylation analysis (Tost of special importance in situations where, for
et al. 2006), it could also be used for the analysis of example, few tumor cells are present among nor-
several sequence variation within the same ampli- mal cells and/or a subclone of the tumor carries
fication product. the mutation of interest, which might expand
upon a given therapy and induce drug resistance.
Pyrosequencing has also been applied to more
Application: Genotyping and Mutation complex genetic analyses requiring accurate
Detection sequencing such as HLA (sub)typing (Ugolotti
et al. 2011). A quantitative readout is also of
Pyrosequencing can be used to genotype single- interest for the genotyping of SNPs in polyploidy
nucleotide polymorphism (SNP) and detect organisms such as plants where pyrosequencing
mutations involved in various diseases (cancer, has proven to be an effective tool.
Alzheimer’s disease, heart diseases, diabetes) or
in biological traits such as eye color or lactose
intolerance. Application: Transcript Quantification
Once the sequencing reaches the SNP (entered
in the sequence to analyze in the software using Just as it can quantify the ratio of mutations in a
the IUPAC single letter code), all possible heterogeneous mixture of DNA, pyrosequencing
can quantify any variation of sequence. In the ciprofloxacin-resistant Neisseria gonorrhoeae.

case of cDNA, it is thereby possible to determine Similar assays have been developed for fungal
an imbalance in the transcript quantity of differ- and viral identification. It should be noted that
ent alleles (Yang et al. 2013). similar to Sanger sequencing, pyrosequencing
requires pure bacterial isolates and thus
a culture step prior to the analysis, as mixtures
Application: Bacterial Typing of different bacteria will lead to sequence pat-
terns that will be inconclusive or difficult to inter-
Similar to the analysis of genetic variations, tech- pret. Genome-wide sequencing approaches using
nologies used for bacterial identification have the pyrosequencing-based 454 technology for
shifted from Sanger sequencing to the more metagenomics which will circumvent this prob-
user-friendly pyrosequencing technology lem through clonal amplification of single DNA
enabling a more extensive sampling of microbial molecules (and thus single bacteria) are discussed
diversity with reduced efforts. Pyrosequencing elsewhere in this encyclopedia in the context of,
has been used for the identification of microbial e.g., the Human Microbiome Project.
species and detection of genetic mutations that
confer resistance to antibiotics and antiviral drugs
by sequencing well-characterized short hypervar- Application: DNA Methylation Analysis
iable regions of bacterial genes such as 16S, 23S
rRNA, or rnpB. Universal primers are located in Due to the recent interest in epigenetics in general
the conserved regions amplifying the variable and DNA methylation analysis in particular,
regions, which are subsequently pyrosequenced. DNA methylation analysis by pyrosequencing is
The provided sequence (sometimes in addition to probably the prime application of the technology
biochemical data) gives unambiguous and dis- as it allows simultaneous analysis and quantifica-
criminatory information for microbial identification of the methylation status of several CpG
tion. It should be noted that due to the limited positions in close proximity (Tost and Gut 2007).
read length of the pyrosequencing technology, This point is of particular interest as succes-
the careful design of the targets and location of sive CpGs might display significantly different
the amplification primers are of utmost impor- levels of methylation particularly in imprinted
tance and depend on the biological question. genes as well as at promoters devoid of a CpG
Pyrosequencing has been successfully used to island. Pyrosequencing has been demonstrated to
identify pathogens, which were refractory to bio- be very reproducible if assays are performed in
chemical analyses in a hospital setting identify- a quality-controlled and standardized fashion
ing 78 different genera representing 16 different including sufficient amount of input DNA for
specimen types. Further it was applied to the methylation analysis (Dupont et al. 2004). Fur-
identification and subtyping of different strains thermore, the possibility to include controls for
of, for example, Helicobacter pylori, Mycobacte- complete bisulfite conversion (i.e., the measure-
rium, and Streptococcus (Petrosino et al. 2009). ment at a cytosine outside of a CpG context)
It has been used to differentiate between Gram- avoids a potential pitfall of DNA methylation
positive and Gram-negative bacteria using the analysis. Pyrosequencing has a limit of detection
16S RNA demonstrating superior results to con- of ~ 5 % for the minor unmethylated or methyl-
ventional Gram staining. Pyrosequencing has ated allele, respectively, and the technical vari-
also been shown to have sufficient discrimination ability of the pyrosequencing reaction alone is
potential to identify highly similar strains of very limited (~ 2 %). Variability increases to
Yersinia pestis in a relatively short time and can about 5 % if independent bisulfite conversion
also be used to identify antimicrobial resistance and PCR amplifications are performed (Dupont
genes including mutations in the gyrase and other et al. 2004). Pyrosequencing is therefore much
genes in quinolone-resistant Salmonella and better suited and less complex than standard
Sanger sequencing for DNA methylation analy- amplification) assay has combined this amplifi-
sis. The recording of calibration curves using cation step with a pyrosequencing-based readout
standards with a known degree of methylation starting from as little as 100 pg of DNA (Paliwal
during assay setup or during routine use also et al. 2010). As a significant quantity of DNA is
allows for correction for potential preferential obtained after amplification, the DNA can be
amplification of methylated or unmethylated analyzed at multiple loci. It should however be
alleles, a phenomenon frequently encountered noted that the quantitative accuracy of the
with bisulfite-treated DNA. genome-wide amplification on bisulfite-treated
The quantitative accuracy can be applied to DNA is still controversial. D
analyze global or gene-specific DNA methylation Nonetheless these approaches have been used
patterns of a sample. Pyrosequencing has been for, e.g., forensic trace identification, whereby
widely used to analyze the DNA methylation tissue-specific DNA methylation patterns ana-
patterns of genes aberrantly silenced by promoter lyzed by pyrosequencing after bisulfitome ampli-
hypermethylation in cancer and other diseases. It fication were used to identify the biofluid of
has been used for the distinction between origin (Madi et al. 2012). Another approach to
age-related and cancer-associated DNA methyl- analyze minute amounts of DNA methylation
ation patterns or the analysis of the epigenetic patterns combines the high sensitivity of
field defect in cancer. A diagnostic test using methylation-specific PCR (MSP) with the speci-
pyrosequencing for the detection of aberrant ficity of the pyrosequencing-based readout.
DNA methylation patterns involved in the The replacement of the gel-based detection with
imprinting disorders Prader-Willi and Angelman the sequencing-based readout avoids some of the
syndromes was proposed. problems associated with potentially false-
Pyrosequencing can also be used for screening positive results induced by mispriming of the
of differential DNA methylation between two MSP primers (Shaw et al. 2006). Nonetheless,
sample groups by creating pools stratified for as molecules with a specific DNA methylation
clinical parameters of interest, for example, can- patterns are specifically enriched, the resulting
cerous versus matched peritumoral tissue pyrosequencing-based analysis is no longer
(Dejeux et al. 2007). This method helps to con- quantitative and/or representative of the methyl-
centrate research efforts and available biological ation patterns present in the analyzed sample.
material on genes displaying variable methyla-
tion patterns.
DNA methylation analysis can be performed Application: Global DNA Methylation
in the tissue of interest itself or in biofluids that Analysis
were in contact with the diseased tissue such as
serum, plasma, sputum, or urine (How Kit The LUminometric Methylation Assay (LUMA)
et al. 2012). However, amounts of DNA that can is based on a polymerase extension assay using
be isolated are normally very small and thus the pyrosequencing platform after digestion by
require an additional step of genome-wide ampli- methylation-sensitive and nonsensitive restric-
fication prior to the quantitative DNA methylation enzymes (Karimi et al. 2006). In this case,
tion detection. As DNA methylation marks are the pyrosequencer measures the luminometric
not retained during amplification, the amplifica- signal produced by the nucleotide extension of
tion has to be performed on the bisulfite-treated the resulting number of digested sites. It is
DNA with its lower sequence complexity, a quantitative and highly reproducible method
decreased integrity due to the harsh conditions and uses an internal control for DNA input.
of the chemical treatment, and increased potential Besides, no modification of genomic DNA is
for secondary structures. The qMAMBA required.
(quantitative methylation analysis of minute Furthermore, the global analysis of DNA
DNA amounts after whole bisulfitome methylation can be performed using the
pyrosequencing-based analysis of methylation of the base with the template ensuring an ampli-
patterns in repetitive elements such as ALU or fication of only the exact complementary allele at
LINE1 elements (Yang et al. 2004). While the chosen temperature.
LINE1 elements do have a relatively conserved
sequence allowing thus the design of a sequence-
specific pyrosequencing assay for DNA methyl- Summary
ation analysis, methylation of ALU elements is
assessed by a cyclic dispensation. These assays Pyrosequencing is a sequencing-by-synthesis,
have been widely used for the measurement of easy-to-use method that can precisely analyze
global DNA methylation changes in response to genetic and epigenetic variation in an amplified
environmental stimuli (Cortessis et al. 2012). sequence of up to 350 base pairs. Its applications
are wide and various: genotyping, methylation
analysis, transcript quantification, bacterial typing,
Application: Allele-Specific DNA etc. The broad range of applications combined
Methylation Analysis with the above-described advantages has made
pyrosequencing a widespread analysis method.
Some genes display different methylation pat-
terns on the two alleles either randomly or in
a parent of origin-specific manner. Imprinting Cross-References
control regions regulating the expression of
imprinted genes are commonly methylated on ▶ Approaches in Metagenome Research:
only one allele (inherited from the mother or the Progress and Challenges
father) so that only one “parental allele” is ▶ Conserved Regions in 16S Ribosome RNA
expressed. Using a heterozygous SNP to differ- Sequences and Primer Design for Studies of
entiate the two alleles, the methylation status of Environmental Microbes
each allele can be interrogated after enrichment ▶ Extraction Methods, Variability Encountered in
of the methylated molecules using the above- ▶ Metagenomic Research: Methods and
described MSP with primers complementary to Ecological Applications
a specific DNA methylation pattern. The ▶ NGS QC Toolkit: A Platform for Quality
resulting amplification products are subsequently Control of Next-Generation Sequencing Data
pyrosequenced, and the ratio of the two alleles
after methylation enrichment is quantified by
genotyping the two alleles of the SNP after meth-
References
ylation enrichment (Kristensen et al. 2013).
To analyze methylation on both alleles sepa- Casadesus J, Low D. Epigenetic gene regulation in the
rately, it is possible to design two pyrosequencing bacterial world. Microbiol Mol Biol Rev.
primers, each specific of one allele using the two 2006;70:830–56.
Cortessis VK, Thomas DC, Levine AJ, Breton CV, Mack
alleles of a heterozygous single-nucleotide poly-
TM, Siegmund KD, et al. Environmental epigenetics:
morphism to differentiate the two alleles (Wong prospects for studying epigenetic mediation of
et al. 2006). The specificity of the allele-specific exposure-response relationships. Hum Genet.
enrichment can be further improved by modify- 2012;131:1565–89.
Davis BM, Chao MC, Waldor MK. Entering the era of
ing the base complementary to the SNP with an bacterial epigenomics with single molecule real time
LNA (locked nucleic acid). Locked nucleic acids DNA sequencing. Curr Opin Microbiol.
are RNA monomers with a modified backbone. 2013;16:192–8.
The sugar phosphate backbone has a 20 -O-40 -C Dejeux E, Audard V, Cavard C, Gut IG, Terris B, Tost
J. Rapid identification of promoter hypermethylation
methylene bridge. The bridge increases the
in hepatocellular carcinoma by pyrosequencing of eti-
monomer’s thermal stability, reduces its flexibil- ologically homogeneous sample pools. J Mol Diagn.
ity, and increases the hybridization interactions 2007;9:510–20.
Dupont JM, Tost J, Jammes H, Gut IG. De novo quantita- microbial identification. Clin Chem. 2009;55:
tive bisulfite sequencing using the pyrosequencing 856–66.
technology. Anal Biochem. 2004;333:119–27. Shaw RJ, Akufo-Tetteh EK, Risk JM, Field JK,
How Kit A, Nielsen HM, Tost J. DNA methylation based Liloglou T. Methylation enrichment pyrosequencing:
biomarkers: practical considerations and applications. combining the specificity of MSP with validation by
Biochimie. 2012;94:2314–37. pyrosequencing. Nucleic Acids Res. 2006;34:e78.
Karimi M, Johansson S, Ekström TJ. Using LUMA: Tost J. DNA methylation: an introduction to the biology
a Luminometric-based assay for global DNA- and the disease-associated changes of a promising bio-
methylation. Epigenetics. 2006;1:45–8. marker. Mol Biotechnol. 2009;44:71–81.
Kristensen LS, Treppendahl MB, Asmar F, Girkov MS, Tost J, Gut IG. DNA methylation analysis by
Nielsen HM, Kjeldsen TE, et al. Investigation pyrosequencing. Nat Protoc. 2007;2:2265–75. D
of MGMT and DAPK1 methylation patterns in Tost J, Elabdalaoui H, Gut IG. Serial pyrosequencing for
diffuse large B-cell lymphoma using allelic MSP- quantitative DNA methylation analysis. Biotechniques.
pyrosequencing. Sci Rep. 2013;3. 2006;40:721–6.
Madi T, Balamurugan K, Bombardi R, Duncan G, Ugolotti E, Vanni I, Raso A, Benzi F, Malnati M, Biassoni
McCord B. The determination of tissue-specific DNA R. Human leukocyte antigen–B (-Bw6/-Bw4 I80, T80)
methylation patterns in forensic biofluids using bisul- and human leukocyte antigen–C (-C1/-C2)
fite modification and pyrosequencing. Electrophoresis. subgrouping using pyrosequence analysis. Hum
2012;33:1736–45. Immunol. 2011;72:859–68.
Marsh S, editor. Pyrosequencing protocols, methods in Wong H-L, Byun H-M, Kwan J, Campan M, Ingles S,
molecular biology vol 373. Totowa: Humana Press; Laird P, et al. Rapid and quantitative method of allele-
2007. specific DNA methylation analysis. Biotechniques.
Ogino S, Kawasaki T, Brahmandam M, Yan L, Cantor M, 2006;41:734–9.
Namgyal C, Mino-Kenudson M, Lauwers GY, Yang AS, Estécio MRH, Doshi K, Kondo Y,
Loda M, Fuchs CS. Sensitive sequencing method for Tajara EH, Issa J-PJ. A simple method for estimating
KRAS mutation detection by pyrosequencing. J Mol global DNA methylation using bisulfite PCR of
Diagn. 2005;7:413–21. repetitive DNA elements. Nucleic Acids Res.
Paliwal A, Vaissière T, Herceg Z. Quantitative detection 2004;32:e38.
of DNA methylation states in minute amounts of DNA Yang B, Wagner J, Yao T, Damaschke N, Jarrard
from body fluids. Methods. 2010;52:242–47. DF. Pyrosequencing for the rapid and efficient quanti-
Petrosino JF, Highlander S, Luna RA, Gibbs RA, fication of allele-specific expression. Epigenetics.
Versalovic J. Metagenomic pyrosequencing and 2013;8:1039–42.
E
Environmental Shaping of Codon provides an in silico functional metagenomic

Usage and Functional Adaptation platform to complement metaproteomic studies.
Across Microbial Communities
Vedran Lucić1, Masa Roller2, Istvan Nagy3 and Introduction

Kristian Vlahoviček2
1
Molecular Biology Department, Division of Environmental diversity studies have bypassed
Biology, Faculty of Science, University of the common problem where less than 1% of
Zagreb, Zagreb, Croatia microbes are amenable to cultivation in labora-
2
Bioinformatics Group, Molecular Biology tory conditions (Staley and Konopka 1985) by
Department, Division of Biology, Faculty of instead using high-throughput sequencing to
Science, University of Zagreb, Zagreb, Croatia extract genomic information directly from the
3
Institute of Biochemistry, Biological Research environmental sample, without prior culturing.
Centre of the Hungarian Academy of Sciences, Various environments and geological sites have
Szeged, Hungary been sampled using new-generation sequencing,
such as sea (Venter et al. 2004), soil (Tringe
et al. 2005a), and various extreme habitats
Definition (e.g., acid drainage from a metal mine (Tyson
et al. 2004), as well as gastrointestinal tracts of
Whole microbial communities exhibit patterns diverse organisms – including human (Gill
similar to those of single microbial species in et al. 2006) and mouse (Turnbaugh et al. 2006)).
terms of synonymous codon usage, regardless of Most of the analyss of the sampled environments
their phyletic composition. Therefore, methods were focused in two main directions. The first one
applicable on single microbial genomes to classifies the functions of identified genes (open
predict for functionally important and lifestyle- reading frames) according to annotation available
relevant genes based on translational optimiza- through orthology databases such as COG/KOG
tion of synonymous codons can be applied to the (Clusters of Orthologous Groups of genes)
study of the entire metagenomes. Using these (Tatusov et al. 2003) or KEGG-KO (Kyoto Ency-
predictions opens up a possibility to discover clopedia of Genes and Genomes – Orthology)
new and functionally unannotated genes relevant (Kanehisa et al. 2006) and subsequently ranking
for the community metabolism and overall adap- the relative “importance” of a particular function
tation to a particular environment. This approach according to its abundance in the environment.
presents an integrated approach to the study of The second direction focuses on estimating
microbial community genomic information and the phyletic distribution of microbial species
E 144 Environmental Shaping of Codon Usage
represented in the environment, based on similar- Environmental Shaping of Codon Usage and Func-
ity searches against known microbial species’ tional Adaptation Across Microbial Communities,
Table 1 Metagenomes used to demonstrate the concept
sequences (Huson et al. 2007). of environmental shaping of codon usage
For a thorough understanding of microbial
NCBI
communities at the systems level, it is necessary Project
to capture the interplay of community constitu- Metagenome ID Reference
ents and organizational complexity in the com- Global Ocean Sampling 13694 (Venter
munity metabolism. Microbes in the same Expedition Metagenome, et al. 2004)
environment live within the same physical and the Sargasso Sea version 1
chemical constraints, such as temperature, pH, or Waseca County farm soil 13699 (Tringe
metagenome et al. 2005b)
ion concentration, probably causing the GC con-
Whale fall metagenomes 13700
tent to be metagenome specific (Foerstner
5-way (CG) acid mine 13696 (Tyson
et al. 2005). Furthermore, communities of drainage biofilm metagenome et al. 2004)
microbes have been shown to share tRNA pools Human distal gut biome 16729 (Gill
to facilitate horizontal gene transfer (Tuller et al. 2006)
et al. 2011), which also implies a limited choice Lean mouse 1 gut metagenome 17391 (Turnbaugh
of preferred cognate codons within the shared et al. 2006)
Obese mouse 1 gut 17397
tRNA pool. It has also been shown that fast
metagenome
growth rates introduce stronger bias in synony- US EBPR sludge metagenome 17657 (Martin
mous codon usage at the level of whole et al. 2006)
metagenomes (Vieira-Silva and Rocha 2010), OZ EBPR sludge metagenome 17659
much like the effect observed in single microbial
species (Rocha 2004; Sharp et al. 2005).
Microbial communities living under the same Eleven different microbial community sequenc-
environmental constrains, at the level of genes, ing samples (Table 1.) were used to demonstrate
can effectively be considered and studied as that microbes living in the same ecological niche,
metagenomes, thereby using approaches and regardless of their phyletic diversity, share
methodology valid for single microbial genome a common preference for codon usage. CU bias is
studies. One such approach is the functional char- present at the community level and is also different
acterization by translational optimization through between distinct communities. CU bias also varies
synonymous codon usage bias. within the community, with distributions resem-
The codon usage (CU) bias within a genome bling that of single microbial species, i.e., the
reflects the selection pressure for translational intercommunity CU bias can be observed. The
optimization of highly expressed genes – primar- effects of intercommunity CU bias and transla-
ily the protein synthesis machinery such as ribo- tional optimization concepts are utilized to identify
somal genes and elongation factors, but also genes with CU close to that of the meta-ribosomal
genes with environmental adaptation functions sample. These genes have high predicted expres-
(Supek et al. 2010). At the level of a single micro- sion across the entire microbial community and
bial genome, the effect of CU bias is routinely define its “functional fingerprint.” This approach
used to predict for functionally relevant and establishes a functional metagenomic platform that
highly expressed genes (Sharp and Li 1987; enables functional studies at the level of the entire
Karlin and Mrazek 2000; Plotkin and Kudla microbial community samples.
2011). The choice of preferred codons in
a single genome is most closely correlated with
abundance of the cognate tRNA molecules Description
(Ikemura 1985; Kanaya et al. 2001; Tuller
et al. 2010) and further influenced by the Microbes living in the same ecological niche
genome’s GC content (Chen et al. 2004). share a bias in CU. When comparing the distance
Environmental Shaping of Codon Usage 145 E
Environmental Shaping of Codon Usage and Func- to their respective metagenome of origin therefore
tional Adaptation Across Microbial Communities, forming two distinct groups (the distribution of log2
Fig. 1 Codon usage is metagenome specific. Soil versus ratio of the two distances for each gene is shown in the
human gut metagenome codon usage (CU) frequencies. inset). If the amino acid composition of metagenomes is
(a) The distance (MILC) of each gene’s CU frequency to kept constant and the codons are randomly chosen, CU
overall CU frequencies of two microbial communities. bias of each metagenome would be eliminated resulting in
Genes (red in human gut (N ¼ 33,422) and blue in Waseca uniform distribution of CU distances and overlap of two
soil (N ¼ 88,696) metagenome) are predominantly closer samples, as shown in b)
of each gene’s CU in a metagenome from overall in CU patterns of compared phylogenies – the

metagenome CU in the metagenome of origin within-species’ CU pattern is more variable
with all other metagenomes, genes originating between metagenomes than in different species
from one metagenome form a distinct cluster within the same metagenome (Fig. 2).
(as shown in Fig. 1a) and have CU predominantly Comparison of CU variability of indepen-
closer to that of metagenome overall CU than dently sequenced strains of microbes living in
genes from other metagenomes. If the amino distinct niches is used to demonstrate that CU is
acid sequence of each gene is kept constant but a dynamic property that changes with different
the codons randomly chosen (Fig. 1b), the genes’ environmental constraints at the level of single
CU becomes equidistant to both metagenomes bacterial species. Comparison between 12 strains
(i.e., occupy the same portion of the plot) regard- of Propionibacterium acnes (Bruggemann
less of their metagenome of origin. et al. 2004; Hunyadkurti et al. 2011), commensal
gram-positive bacteria that live in consistent
The Variability of Single Species’ Codon environmental conditions, with 6 strains of cos-
Usage Across Metagenomes mopolitan bacterium Rhodopseudomonas
When comparing CU of species present in two palustris (Larimer et al. 2004; Oda et al. 2008),
distinct metagenomes, they can be compared in shows that there is less variation in CU per
terms of CU distance with (i) their respective orthologous group in P. acnes strains than in the
metagenome overall CU and (ii) CU of genes R. palustris strains (Fig. 3). Despite the fact that
from the same species in a different metagenome. the sampling includes more than twice as many
The resulting distance distributions, quantified strains from constrained environmental condi-
with the intraclass correlation coefficient measure tions (P. acnes) than variable conditions
(ICC), show a statistically significant difference (R. palustris), the variability in CU is smaller in
Environmental Shaping of Codon Usage and Func- (green, total comparisons N ¼ 1,029 comparisons). ICC
tional Adaptation Across Microbial Communities, measures were calculated, representing how “close” the
Fig. 2 Codon usage variability between same species CU profiles match, with ICC ¼1 denoting the perfect
in different metagenomes is larger than within a match. The orange distribution shows less variability and
metagenome. ORFs from each identified species (using is shifted toward higher ICC values, denoting the closer
MEGAN) were compared against their originating overall match of species’ CU to their metagenome of
metagenome (orange, total comparisons N ¼ 2,058) and origin
against same-species ORFs in a different metagenome
the constrained environmental conditions. (36% of the whole set) and the Alphaproteo-
R. palustris samples show on overall higher vari- bacteria class itself show virtually no deviation
ability in CU, suggesting plasticity of codon usage (ICC > 0.98 and 0.95, respectively) from the
that reflects on translational optimization and original metagenome CU.
adopts to each specific environment. Even though
the R. palustris strains generally show more vari- Codon Usage in Metagenomes Follows
ation in CU (Fig. 3), both species, regardless of Similar Patterns as in Single Microbial
environmental constraints, show the least relative Genomes
variation of CU within the COG categories (i.e., As has been established at the level of single
orthologous genes) for housekeeping, including microbial genomes (Ikemura 1985; Kanaya
ribosomal protein genes. et al. 2001), the distance of each gene’s CU fre-
quency to the overall CU of the whole genome and
The Variability of Codon Usage in to that of a “reference set” of highly expressed
Metagenomes upon Removal of genes (ribosomal protein genes) gives a character-
Dominant Phyla istic crescent-shaped plot (Fig. 4a, introduced by
Community-level codon usage bias is not an (Karlin and Mrazek 2000)). Metagenomes exhibit
effect caused by the most abundant species. CU similar CU distance distributions to those
frequencies of the Sargasso Sea metagenome, the observed in single bacterial genomes, despite the
largest dataset in this study, were compared to fact that they comprise of genes that originate from
other investigated metagenomes and to itself but diverse phylogenies (i.e., Santa Cruz whale car-
with dominant phyla removed. The comparisons cass bone in Fig. 4b). If the amino acid composi-
between Sargasso Sea CU frequencies and other tion of genes in a metagenome is kept constant but
metagenomes all show ICC < 0.75, while the the codons are randomly chosen, the crescent plot
same Sargasso sample with dominant phyla shape analogous to single bacterial genomes and
of the Alphaproteobacteria class removed CU bias is lost.
Environmental Shaping of Codon Usage and Func- set within an orthologous group to its centroid CU) for
tional Adaptation Across Microbial Communities, the strains of P. acnes (N ¼ 15,436), living in consistent
Fig. 3 Environmental variability of codon usage. Vari- environmental conditions, is shifted to the left, i.e., it
ability of codon usage per COG category in 6 strains of shows smaller variation and higher bias than for the
Rhodopseudomonas palustris and in 12 strains of R. palustris strains (N ¼ 24,071) living in diverse envi-
Propionibacterium acnes. The codon usage variability ronmental conditions
(calculated as median CU distance from the ribosomal
Predicting Metagenomic Expression and for the acid mine biofilm metagenome. The most
Functional Profiles Through Synonymous striking difference between metagenomes was
Codon Usage lack of enrichment in energy production and car-
Under different environmental constraints, CU bohydrate metabolism (COG supercategories
varies in single bacterial species, and C and G) in the obese mice microbiota sample,
metagenomes share synchronized CU as do sin- in contrast to both lean human and mouse
gle bacterial species. CU bias in metagenomes microbiota samples, indicating high metabolic
can be used to predict the expression levels of activity of lean gut bacteria.
genes in the same manner as is routinely used to Artificial metagenomes, constructed from ran-
predict genes optimized for high levels of expres- domly selected genes of whole genome bacterial
sion in single microbial genomes (Sharp and Li sequences from the NCBI with the same COG
1987; Karlin and Mrazek 2000; Supek and composition as their corresponding microbial sam-
Vlahovicek 2005). Figure 5 depicts the resulting ples, show loss of environment-specific enrichment
predictions at the level of whole metagenomes of optimization in their expression profiles.
using the meta-ribosomal protein reference set.
The most significantly enriched functions in the Validation with Metaproteomic Data
high expression level sets are (i) amino acid Predictions of gene expression for Sargasso Sea
transport and metabolism (COG supercategory metagenome were compared to the Sargasso Sea
E) for Sargasso Sea, (ii) energy production and metaproteomic study (Sowell et al. 2008) and
conservation (COG supercategory C) for the a functionally (COG) classified subset of the
Whale fall metagenomes, and (iii) inorganic ion human gut metaproteomic study (Verberkmoes
transport and metabolism (COG supercategory P) et al. 2009). Predicted expression values based
Environmental Shaping of Codon Usage and Func- B-plot for (a) a single microbial genome (Escherichia
tional Adaptation Across Microbial Communities, coli, N ¼ 4,358) and (b) a metagenome (whale carcass,
Fig. 4 Metagenomes show codon usage distribution sim- N ¼ 33,422). The metagenome shows the same character-
ilar to single genomes. The distance of each gene’s codon istic distribution as the genome with ribosomal genes
usage (CU) frequency forms the overall CU of the (meta) closer to the CU of the ribosomal set than the overall CU
genome and ribosomal reference set, displayed as a Karlin of the whole (meta)genome
Environmental Shaping of Codon Usage and Func- (N ¼ 40,916), Whale fall Antarctic bone (N ¼ 30,503),
tional Adaptation Across Microbial Communities, Whale fall Santa Cruz bone (N ¼ 33,422), obese mouse
Fig. 5 Enrichment of functions within highly expressed gut (N ¼ 4,058), lean mouse gut (N ¼ 4,955), human gut
genes in metagenomes. Enrichment or depletion of func- (N ¼ 47,765), Santa Cruz whale fall bone (N ¼ 33,422),
tional annotations in the 3% genes with highest predicted and acid mine (N ¼ 79,257). Metagenomes show different
expression (highest MELP measure) relative to the abun- functional enrichment patterns that are consistent with
dance of each COG supercategory in the whole environmental requirements (e.g., metabolite transport
metagenome for the OZ EBPR sludge (N ¼ 29,754), functions [E] in the Sargasso Sea or energy conversion
Waseca farm soil (N ¼ 88,696), acid mine biofilm [C] in the whale carcass metagenome). Letters at the
(N ¼ 79,257), Sargasso Sea (N ¼ 688,539), US EBPR bottom represent COG supercategories
sludge (N ¼ 20,175), Whale fall Santa Cruz microbial mat
on CU optimization positively correlate with Kanaya S, Yamada Y, Kinouchi M, Kudo Y, Ikemura
abundance in metaproteomic studies, both for T. Codon usage and tRNA genes in eukaryotes: Cor-
relation of codon usage diversity with translation effi-
the comparison of each gene with the protein ciency and with CG-dinucleotide usage as assessed by
most similar in sequence (Sargasso Sea multivariate analysis. Journal of Molecular Evolution.
rho¼0.34) and when median values per gene 2001;53:290–8.
and protein COG are compared (human gut Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF,
Itoh M, Kawashima S, et al. From genomics to chem-
rho¼0.34). This opens up for an in silico predic- ical genomics: new developments in KEGG. Nucl
tion of overall metagenomic proteome status. Acids Res. 2006;34:D354–7.
Karlin S, Mrazek J. Predicted highly expressed genes of
diverse prokaryotic genomes. Journal of Bacteriology.
Summary 2000;182:5238–50. E
Larimer FW, Chain P, Hauser L, Lamerdin J, Malfatti S,
Do L, et al. Complete genome sequence of the meta-
Analysis of eleven distinct metagenomes shows bolically versatile photosynthetic bacterium
that microbial communities exhibit codon usage Rhodopseudomonas palustris. Nature Biotechnology.
bias similar to that already described for single 2004;22:55–61.
Martin HG, Ivanova N, Kunin V, Warnecke F, Barry KW,
microbial species. Microbial communities sharing
McHardy AC, et al. Metagenomic analysis of two
an environment are likely to have similar synony- enhanced biological phosphorus removal (EBPR)
mous codon usage-based translational optimiza- sludge communities. Nature Biotechnology. 2006;24:
tion for expression of environment-specific 1263–9.
Oda Y, Larimer FW, Chain PSG, Malfatti S, Shin MV,
genes. This effect can be used to identify genes
Vergez LM, et al. Multiple genome sequences reveal
with unknown function and “optimal” codon adaptations of a phototrophic bacterium to sediment
encoding, indicating their potential for high microenvironments. Proceedings of the National
expression and therefore high relative importance Academy of Sciences of the United States of America.
2008;105:18543–8.
in the community metabolism and lifestyle.
Plotkin JB, Kudla G. Synonymous but not the same: the
causes and consequences of codon bias. Nat Rev
Genet. 2011;12:32–42.
References Rocha EPC. Codon usage bias from tRNA’s point of view:
Redundancy, specialization, and efficient decoding for
Bruggemann H, Henne A, Hoster F, Liesegang H, translation optimization. Genome Research. 2004;14:
Wiezer A, Strittmatter A, et al. The complete genome 2279–86.
sequence of Propionibacterium acnes, a commensal of Sharp P, Li W. The codon Adaptation Index–a measure of
human skin. Science. 2004;305:671–3. directional synonymous codon usage bias, and its
Chen SL, Lee W, Hottes AK, Shapiro L, McAdams potential applications. Nucleic Acids Res. 1987;
HH. Codon usage between genomes is constrained by 15(3):1281–95.
genome-wide mutational processes. Proceedings of Sharp PM, Bailes E, Grocock RJ, Peden JF, Sockett
the National Academy of Sciences of the United States RE. Variation in the strength of selected codon usage
of America. 2004;101:3480–5. bias among bacteria. Nucleic Acids Research.
Foerstner KU, von Mering C, Hooper SD, Bork 2005;33:1141–53.
P. Environments shape the nucleotide composition of Sowell SM, Wilhelm LJ, Norbeck AD, Lipton MS, Nicora
genomes. EMBO reports. 2005;6:1208–13. CD, Barofsky DF, et al. Transport functions dominate
Gill SR, Pop M, DeBoy RT, Eckburg PB, Turnbaugh PJ, the SAR11 metaproteome at low-nutrient extremes in
Samuel BS, et al. Metagenomic analysis of the human the Sargasso Sea. ISME J. 2008;3:93–105.
distal gut microbiome. Science. 2006;312:1355–9. Staley JT, Konopka A. MEASUREMENT OF IN SITU
Hunyadkurti J, Feltoti Z, Horvath B, Nagymihaly M, ACTIVITIES OF NONPHOTOSYNTHETIC
Voros A, McDowell A, et al. Complete Genome MICROORGANISMS IN AQUATIC AND TERRES-
Sequence of Propionibacterium acnes Type IB Strain TRIAL HABITATS. Annual Review of Microbiology.
6609. J Bacteriol. 2011;193:4561–2. 1985;39:321–46.
Huson DH, Auch AF, Qi J, Schuster SC. MEGAN analysis Supek F, Škunca N, Repar J, Vlahoviček K, Šmuc
of metagenomic data. Genome Research. 2007;17: T. Translational Selection Is Ubiquitous in Prokary-
377–86. otes. PLoS Genet. 2010;6:e1001004.
Ikemura T. Codon Usage and Transfer-RNA Content in Supek F, Vlahovicek K. Comparison of codon usage mea-
Unicellular and Multicellular Organisms. Molecular sures and their applicability in prediction of microbial
Biology and Evolution. 1985;2:13–34. gene expressivity. Bmc Bioinformatics. 2005;6:15.
E 150 Evaluating Putative Chimeric Sequences from PCR-Amplified Products
Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, body parts from different living beings. In molec-
Kiryutin B, Koonin EV, et al. The COG database: an ular biology, a chimeric sequence or chimera is
updated version includes eukaryotes. Bmc Bioinfor-
matics. 2003;4:14. a DNA sequence composed of DNA fragments
Tringe SG, von Mering C, Kobayashi A, Salamov AA, originated from two or more genes or genomes.
Chen K, Chang HW, et al. Comparative metagenomics Chimeric sequences can be naturally gener-
of microbial communities. Science. 2005a;308:554–7. ated during DNA recombination which occurs
Tringe SG, von Mering C, Kobayashi A, Salamov AA,
Chen K, Chang HW, et al. Comparative Metagenomics naturally within a genome or by taking up foreign
of Microbial Communities. Science (New York, N Y ). DNA by an organism. These processes of cross-
2005b;308:554–7. over recombination are of interest in phyloge-
Tuller T, Carmi A, Vestsigian K, Navon S, Dorfan Y, netic and evolution studies and need to be
Zaborske J, et al. An Evolutionarily Conserved Mech-
anism for Controlling the Efficiency of Protein Trans- identified (Posada and Crandall 2002). Neverthe-
lation. Cell. 2010;141:344–54. less, chimeras represent a serious problem to be
Tuller T, Girshovich Y, Sella Y, Kreimer A, Freilich S, considered when they are generated as artifacts
Kupiec M, et al. Association between translation effi- during DNA manipulation and/or analysis.
ciency and horizontal gene transfer within microbial
communities. Nucleic Acids Research. 2011;39: Chimeric artifacts can be produced at different
4743–55. stages during experimental DNA studies. Some
Turnbaugh PJ, Ley RE, Mahowald MA, Magrini V, examples can be described relating to cloning
Mardis ER, Gordon JI. An obesity-associated gut procedures, DNA amplification, and/or DNA
microbiome with increased capacity for energy har-
vest. Nature. 2006;444:1027–31. assembling during computational analysis
Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, (Fig. 1).
Richardson PM, et al. Community structure and metab- During DNA library preparation, genomic
olism through reconstruction of microbial genomes DNA is generally broken down into small frag-
from the environment. Nature. 2004;428:37–43.
Venter JC, Remington K, Heidelberg JF, Halpern AL, ments which will be introduced into cloning vec-
Rusch D, Eisen JA, et al. Environmental genome shot- tors or sequenced independently (Sambrook and
gun sequencing of the Sargasso Sea. Science. Russell 2001). These fragments are generated by
2004;304:66–74. physical or enzymatic means. The generation of
Verberkmoes NC, Russell AL, Shah M, Godzik A,
Rosenquist M, Halfvarson J, et al. Shotgun overlapping strand endings can lead to the ran-
metaproteomics of the human distal gut microbiota. dom fusion of DNA fragments resulting in chi-
Isme Journal. 2009;3:179–89. meras which can be detected upon sequencing
Vieira-Silva S, Rocha EPC. The Systemic Imprint of (Fig. 1a).
Growth and Its Uses in Ecological (Meta)Genomics.
PLoS Genet. 2010;6:e1000808. By far, DNA amplification procedures repre-
sent the most frequently reported processes gen-
erating chimeric sequences. Most amplification
procedures are prone to generate chimeras. The
most studied case is the polymerase-chain reac-
Evaluating Putative Chimeric tion (PCR) amplification procedure where multi-
Sequences from PCR-Amplified ple sequences of a target DNA region are
Products produced through a cycling amplification reac-
tion. The amplification is exponential and errors
Juan M. Gonzalez during the reaction can be greatly amplified at the
Instituto de Recursos Naturales y Agrobiologia, end of the PCR (Fig. 1b). Due to a variety of
IRNAS-CSIC, Seville, Spain causes, incomplete amplification of the target
fragment can behave as a priming sequence in
the next cycle potentially originating a DNA
Introduction fragment from two, or more, different DNA tem-
plates. The generation of chimeric sequences dur-
The term chimera has its origins in the Greek ing PCR amplification can occur for any gene
mythology defining a creature composed of although the most studied case is that of
Evaluating Putative Chimeric Sequences from PCR-Amplified Products 151 E
a b
chimera
E
chimera
chimera
c A B C D
A E C F
A E C F
A B C D
A B C F A E C D
chimera chimera
Evaluating Putative Chimeric Sequences from incomplete synthesis of the target DNA fragment can lead
PCR-Amplified Products, Fig. 1 Scheme of different in the next cycle to the annealing to a different target with
possibilities of potential chimera formation during DNA conserved regions and result in its extension using
cloning and library preparation (a), PCR amplification (b) a different DNA target. The consequence is the generation
and computer processing of assembling DNA fragments of a chimera resulting from two different DNAs. During
(c). Examples are presented on chimeras formed during computation of the assembly of small DNA fragments
libraries aimed at both vector cloning (left on a) and direct obtained through sequencing (c), different possibilities
sequencing (right on a). During PCR amplification (b), could be similarly valid and some of them can be chimeras
ribosomal RNA genes (rRNAs) which present overestimations of the microbial diversity in
both highly conserved and variable regions environmental studies (Hugenholtz and Huber
within their sequences. The rRNAs are present 2003; Gonzalez et al. 2005). Thus, it is of most
in every organism because the cells require them importance to detect and filter out those chimeric
for protein synthesis. In Microbiology, rRNAs DNA sequences.
are widely used to detect and classify microor- In addition to the potential to generate chi-
ganisms; because most of these microorganisms meras during DNA manipulation, the possibility
are often unculturable and cannot be detected to produce chimeras during computer processing
otherwise, the rRNAs are, at present, the only of DNA sequences should be considered. Small
mean to survey for these microbes. It is easy to DNA fragments forming DNA libraries are
deduce that a chimera would represent sequenced through a variety of sequencing plat-
a nonexisting microorganism, and so considering forms. These sequences are assembled into larger
chimeras as real sequences can induce serious fragments of gene or genomic DNA. During this
assembly, a potential exists to produce a chimeric and today these genes represent the primary way
final sequence (Fig. 1c). Above all, this can be to classify microorganisms which are difficult to
generated at the extreme of DNA assembled differentiate otherwise, either by morphology or
fragments generally induced by the presence of physiological traits.
repetitive sequences (which often causes trouble The rRNAs combine highly conserved and
during the assembly process) or by chimeras variable regions. Thus, partial synthesis of these
formed during early DNA manipulation steps or genes during PCR amplification can lead to
library preparation. As well, these assembling a DNA fragment able to anneal to a different
errors can truncate the generation of larger rRNA sequence in a complex mixture of DNAs.
contigs or fragments of genomic DNA during Annealing of that incomplete DNA fragment to
the assembly. The assembly of DNA fragments a target DNA from a different organism and
from different organisms into a single DNA extension in the same PCR cycle will result in
sequence is a risk when working with DNA sur- the formation of hybrid sequences of rRNAs.
veys of complex communities, for instance, on This rRNA has been originated by portions of
metagenomes, that is, genomic studies of com- sequences from different microorganisms
plex microbial communities (Mende et al. 2012). (Fig. 1b). Subsequent PCR cycles will generate
Independently of the step where chimeric multiple copies of that artifact. The result is the
sequences have been generated, they need to be generation of chimeras which represent
detected and filtered out to clean up these undesired artifacts that need to be detected and
sequence artifacts for further analysis. Numerous eliminated previous to further analysis.
strategies and pieces of software have been pro- The presence of chimeras in DNA databases
posed. Herein, the case of rRNAs will be used as have been previously reported (Hugenholtz and
example as most studies on chimera evaluation Huber 2003; Ashelford et al. 2005; Gonzalez
have been carried out on these genes. et al. 2005) which affects negatively when users
attempt to classify microorganisms by their
rRNAs. About 5 % of rRNA gene sequences
Chimeras and Microbial Diversity can represent suspicious or potential chimeras
(Ashelford et al. 2005; Haas et al. 2011). The
Most surveys of the composition of microbial use of curated rRNA-specific databases is
communities in natural environments are being recommended. Databases, such as RDP
performed through a PCR amplification step (Ribosomal Database Project; Cole et al. 2009),
(Gonzalez et al. 2012). Generating a high number Greengenes (DeSantis et al. 2006), and SILVA
of fragments from the rRNAs (rRNA amplicons) (Quast et al. 2013) (Table 1), have curated
represented in a community is a step previous entries. These repositories ensure the lack of chi-
to library preparation and sequencing meras and so a realistic approximation to the
(Wintzingerode et al. 1997; Roesch et al. 2007). identification of microorganisms through
At present, microbial communities are under- amplicon sequencing.
stood as composed by a highly diverse number In spite of the potential for chimeras in envi-
of microorganisms most of which remain ronmental microbial surveys, current understand-
unculturable (Curtis et al. 2002). If microorgan- ing of these communities suggests a huge
isms cannot be cultured in the laboratory, it microbial diversity (Curtis et al. 2002). This enor-
implies that the only means to analyze their mous diversity suggests that chimera detection is
potential features is through their nucleic acids. more complex than expected. However, the exis-
Due to the complexity of genomes, accurate tax- tence of a large set of sequences from microbial
onomic classification of microorganisms can rRNAs can be an allied for an increasing accu-
only be performed with a small number of racy in detecting chimeras. Only by knowing
genes; the most frequently used are the rRNAs. what is real, one can be in situation to discard
Extensive databases have been built with rRNAs what is unreal or chimeric (Gonzalez et al. 2005).
Evaluating Putative Chimeric Sequences from PCR-Amplified Products 153 E
Evaluating Putative Chimeric Sequences from PCR-Amplified Products, Table 1 Some resources focused on
rRNAs including database and software suites incorporating options and tools for chimera detection
Chimera check Database/
Name procedure software Link Reference
Ribosomal Database Pintail Database and http://rdp.cme.msu.edu Cole et al. 2003,
Project (RDP) tools 2009
SILVA Pintail Database and http://www.arb-silva.de Quast et al. 2013
tools
Greengenes Bellerophon Database and http://greengenes.lbl.gov DeSantis
tools et al. 2006
Mothur Variousa Software suite http://www.mothur.org Schloss
et al. 2009
E
QIIME ChimeraSlayer Software suite http://qiime.org Caporaso
et al. 2010
AmpliconNoise Perseus Software suite http://code.google.com/p/ Quince
ampliconnoise/ et al. 2011
a
Various options are available: Bellerophon, Ccode, Pintail, ChimeraSlayer, Uchime, Perseus
Evaluating Putative Chimeric Sequences from PCR-Amplified Products, Table 2 Some of the latest software
alternatives for chimera detection in sequence data
Program Link Reference
Bellerophon http://comp-bio.anu.edu.au/bellerophon/bellerophon.pl Hugenholtz and Huber 2003
Ccode http://www.microextreme.net/downloads.html Gonzalez et al. 2005
Pintail http://www.mybiosoftware.com/rna-analysis/1262 Ashelford et al. 2005
WigeoN http://microbiomeutil.sourceforge.net/#A _WigeoN Haas et al. 2011
Decipher http://decipher.cee.wisc.edu/FindChimeras.html Wright et al. 2011
ChimeraSlayer http://microbiomeutil.sourceforge.net/#A_CS Haas et al. 2011
Uchime http://drive5.com/uchime/uchime_download.html Edgar et al. 2011
Perseus http://code.google.com/p/ampliconnoise/ Quince et al. 2011
In fact, the large diversity of microorganisms proposed to check or detect chimeras. Table 2
known so far can provide with a range of vari- presents some of those alternatives with indica-
ability within specific microbial taxa. tion of its original publication and a link to its
As microbial taxonomy and the sequences of www homepage. As mentioned above, most of
rRNAs become increasingly defined and curated, these studies have been carried out to detect chi-
the detection of chimeric rRNAs is gaining accu- meras in DNA fragments obtained from PCR
racy. Thus, curated and extensive rRNA data- amplification and specifically on rRNA genes.
bases will definitively contribute both to avoid Originally, a simple method to intuitively and
the potential detection of real sequences as chi- approximately detect a potential chimera was to
meras and to improve on the accurate detection of search independently for homologues to the ini-
unreal sequences as chimeras. tial and finals portions of the DNA fragments.
This search was usually performed by blast
searches (Altschul et al. 1990). If this blast
Chimera Evaluation resulted in different organisms for the initial and
final portions of the DNA fragment, it was suspi-
Different procedures have been published to cious to be a chimera (Cole et al. 2003). More
check for chimeras in newly generated DNA sophisticated attempts have been designed
sequences. There has been a long list of programs through the years. A fruitful method was to
analyze potential chimeras by comparison to the Amplicon sequencing is still the most used
sequences obtained from the rRNA gene library procedure for microbial surveys through rRNAs.
being sequenced and analyzed (Hugenholtz and The detection of potential chimeras during
Huber 2003). Similar analysis can be carried out these studies is a requirement to avoid the false
to full DNA databases or repositories (Ashelford consideration of nonexisting microorganisms
et al. 2005; Quast et al. 2013). Further improve- and an overestimation of microbial diversity.
ments included the analysis of the query sequence Current pipelines for the processing of amplicon
in relationship to the known sequences showing sequencing data incorporate chimera screening
highest homology, for instance, within a taxo- and filtering procedures. Databases must con-
nomic group. These known sequences marked tinue their current effort to evaluate newly
the variability for small portions of the DNA deposited sequences for potential chimeras.
fragment under analysis, and so those sequences Curated rRNA databases are a required refer-
showing the highest dispersion than the limited ence for the taxonomically classification of
by known sequences were identified as potential microorganisms through sequencing data.
chimeras, and these assessments included statis- These efforts will result in a more accurate
tical results of the computational analysis detection of chimeras, a significant decrease in
(Ashelford et al. 2005; Gonzalez et al. 2005). misclassifications due to erroneous sequences
Different procedures are periodically proposed included in databases, and an improved knowl-
to screen for chimeras, mainly performing ana- edge of microbial species, gene, and genomic
lyses of portions of the DNA fragment (Wright diversity.
et al. 2011) by searching if different results are
received from DNA database searches. A DNA
fragment is proposed as a chimera if it presents Future Perspectives
different homology results for different portions
throughout its length. As NGS is attracting most research on genomics,
As a result of the next-generation sequencing metagenomics, transcriptomics, and amplicon
(NGS) platforms, large number of sequences is sequencing surveys, the massive data they gen-
being generated through whole library sequenc- erate and the work needed for processing
ing. The screening of such amount of data would these results is exponentially increasing. High-
not be possible without the latest developments throughput procedures are required to cope with
and the recent design of pipelines for the analysis this demand. The use of current pipelines, or
of large data sets of DNA amplicon sequences future improvements, should build a standard
(Schloss et al. 2009; Caporaso et al. 2010; Quince for amplicon sequencing. The detection of
et al. 2011). The inclusion of chimera checking sequencing errors through algorithms in bioin-
procedures within these pipelines (Table 1) has formatics should also be introduced into these
greatly facilitated the analysis of massive high-throughput pipelines, all aiming to obtain
sequencing data. Nevertheless, the newly intro- clean and accurate data previous to pursue fur-
duced algorithms are masked by the advantages ther analysis. The screening and curation being
presenting the whole pipelines and the easily performed by public repositories must continue
handling of large sequencing data (Quince in spite of the developments in pipelines and
et al. 2011). One should confirm that the compu- algorithms to ensure that databases remain as
tational pipeline to process your sequencing data clean as possible of chimeric and erroneous
includes a chimera filtering procedure. Besides, sequences. At a time when sequencing analyses
some of these pipelines offer the possibility of are not manually edited anymore, algorithms to
using different databases. The inclusion in these automatically filtering off chimeras and the
analyses of curated databases is an important required curation at the scientist and database
point to be considered. ends will become of increasing relevance.
Extended Local Similarity Analysis (eLSA) of Biological Data 155 E
Acknowledgments The author acknowledges funding Quast C, Pruesse E, Yilmaz P, et al. The SILVA ribosomal
from the Spanish Ministry of Economy and Competitive- RNA gene database project: improved data processing
ness, project CONSOLIDER CSD2009-00006, which and web-based tools. Nucl Acids Res. 2013;41:
includes participation of Feder funds. D590–6.
Quince C, Lanzen A, Davenport RJ, Turnbaugh
PJ. Removing noise from pyrosequenced amplicons.
References BMC Bioinforma. 2011;12:38.
Roesch LFW, Fulthorpe RR, Riva A, et al.
Pyrosequencing enumerates and contrasts soil micro-
Altschul SF, Gish W, Miller W, Myers EW, Lipman bial diversity. ISME J. 2007;1:283–90.
DJ. Basic local alignment search tool. J Mol Biol. Sambrook JJ, Russell DDW. Molecular cloning.
1990;215:403–10. A laboratory manual. Cold Spring Harbor: Cold Spring
Ashelford KE, Chuzhanova NA, Fry JC, Jones AJ, Harbor Laboratory Press; 2001. E
Weightman AJ. At least 1 in 20 16S rRNA sequence Schloss PD, Westcott SL, Ryabin T, et al. Introducing
records currently held in public repositories is esti- mother: open-source, platform-independent, commu-
mated to contain substantial anomalies. Appl Environ nity supported software for describing and comparing
Microbiol. 2005;71:7724–36. microbial communities. Appl Environ Microbiol.
Caporaso JG, Kuczynski J, Stombaugh J, et al. QIIME 2009;75:7537–41.
allows analysis of high-throughput community Wintzingerode F, Göbel UB, Stackebrandt E. Determina-
sequencing data. Nat Methods. 2010;7:335–6. tion of microbial diversity in environmental samples:
Cole JR, Chai B, Marsh TL, et al. The Ribosomal Data- pitfalls of PCR-base rRNA analysis. FEMS Microbiol
base Project (RDPII): previewing a new autoaligner Rev. 1997;21:213–29.
that allows regular updates and the new prokaryotic Wright ES, Yilmaz LS, Noguera DR. DECIPHER,
taxonomy. Nucl Acids Res. 2003;31:442–3. a search-based approach to chimera identification
Cole JR, Wang Q, Cardenas E, et al. The Ribosomal for 16S rRNA sequences. Appl Environ Microbiol.
Database Project: improved alignments and new tools 2011;78:717–25.
for rRNA analysis. Nucl Acids Res. 2009;37:D141–5.
Curtis TP, Sloan WT, Scannell JW. Estimating prokary-
otic diversity and its limits. Proc Natl Acad Sci USA.
2002;99:10494–9.
DeSantis TZ, Hugenholtz P, Larsen N, et al. Greengenes,
a chimera-checked 16S rRNA gene database and
Extended Local Similarity Analysis
workbench compatible with ARB. Appl Environ (eLSA) of Biological Data
Microbiol. 2006;72:5069–72.
Edgar RC, Haas BJ, Clemente JC, et al. UCHIME Fengzhu Sun and Li Charlie Xia
improves sensitivity and speed of chimera detection.
Molecular and Computational Biology Program,
Bioinformatics. 2011;27:2194–200.
Gonzalez JM, Zimmermann J, Saiz-Jimenez C. Evaluat- Department of Biological Sciences, University of
ing putative chimeric sequences from PCR-amplified Southern California, Dana and David Dornsife
products. Bioinformatics. 2005;21:333–7. College of Letters, Arts and Sciences,
Gonzalez JM, Portillo MC, Belda-Ferre P, Mira
Los Angeles, CA, USA
A. Amplification by PCR artificially reduces the pro-
portion of the rare biosphere in microbial communi-
ties. PLoS ONE. 2012;7(1):e29973.
Haas BJ, Gevers D, Earl AM, et al. Chimeric 16S rRNA Synonyms
sequence formation and detection in Sanger and
454-pyrosequenced PCR amplicons. Genome Res.
2011;21:494–504. Local association analysis; Local similarity
Hugenholtz P, Huber T. Chimeric 16S rDNA analysis
sequences of diverse origin are accumulating in the
public databases. Intl J Syst Evol Microbiol.
2003;53:289–93.
Mende DR, Waller AS, Sunagawa S, et al. Assessment Introduction
of metagenomic assembly using simulated next
generation sequencing data. PLoS ONE. 2012;7(2): The advances in high-throughput low-cost exper-
e31386.
imental technologies have made possible time
Posada D, Crandall AK. The effect of recombination on
the accuracy of phylogeny estimation. J Mol Evol. series studies of hundreds or thousands biological
2002;54:396–402. factors simultaneously. The availability of such
E 156 Extended Local Similarity Analysis (eLSA) of Biological Data
datasets leads to an increased interest in profile significantly correlated in time interval from
similarity analysis techniques that can identify 4 to 17 if X is shifted three units toward origin
significant association patterns possibly embrac- as shown in the bottom-right panel (eLS ¼ 0.51,
ing biological insights. In the context of P ¼ 0.006).
metagenomics, factors of particular interest are
operational taxonomic units (OTUs), microbial Extended Local Similarity Analysis
genomes, and environmental genes. Their Extended local similarity analysis (eLSA) is an
association patterns may suggest microbe- analysis technique designed to capture local asso-
environment, symbiotic relationships, and other ciations possibly with time delays. eLSA extends
types of interactions. the original local similarity analysis technique
Many computational or statistical approaches (Qian et al. 2001; Ruan et al. 2006) and local
exist to study the profile similarity at global scale, shape analysis technique to time series data with
such as Pearson’s correlation coefficients (PCC), replicates (Xia et al. 2011). Improvements in
Spearman’s correlation coefficients (SCC), computation efficiency of p-values are also made
principal component analysis (PCA), multi- (Xia et al. 2013). Time series data of a pair of
dimensional scaling (MDS), discriminant func- factors X and Y with replicates can be expressed
tion analysis (DFA), and canonical correlation as data matrices X[1:m][1:n] and Y[1:m][1:n], where
analysis (CCA). However, in many biological each column is one sample from the time point
settings, the interaction may be active within and n is the number of time points; each row is
only certain subintervals or the response to regu- a replicate and m is the number of replicates.
lation may be time lagged. Methods based on the Given time series data of two factors and
global relationships of profiles may fail to detect a user-constrained delay limit, eLSA uses
these interactions. Extended local similarity anal- dynamic programming algorithm to find the con-
ysis (eLSA) method is specifically developed to figuration of the data that yields the highest
capture local and potentially time-delayed extended local similarity (eLS) score –
co-occurrence and association patterns in time a similarity metric defined as
series data that cannot otherwise be identified
by ordinary correlation analysis.
l1
X
eLSX 1
½1:m½1:n , Y ½1:m½1:n j ¼ maxi, j, l s:t: jijjD j
n k¼0

F X
Description ½1:m, iþk F Y ½1:m, jþk
Local Association with Possible Time Delays where D is the delay limit and F is the summa-
Local association refers to the association that rizing function for repeated measures (mean,
only occurs in a subinterval of the time of inter- median, etc.). For example, within a delay limit
est. Time-delayed association indicates that there of two units, the first time spot of one series might
is a time lag for the response of one factor to the be aligned to the third time spot of the other
change in another factor. As an example of local series, thus maximizing their eLS.
association, in Fig. 1, the top-left panel shows For a dataset of many factors, eLSA is applied
two series X and Y with nonsignificant correla- to each pairwise combination of factors in the
tion (r ¼ 0.26, P ¼ 0.273); however, they are in dataset. Candidate associations are then evaluated
fact significantly correlated in the time interval statistically by a permutation test, which calculates
from 7 to 16 as shown in the bottom-left panel the p-value – the proportion of scores exceeding
(eLS ¼ 0.43, P ¼ 0.028). As an example the original eLS score after shuffling the first series
of time-delayed local association, in Fig. 1, and reevaluating the eLS score many times – or
the top-right panel shows two series X and Y more efficiently by theoretical approximation.
with nonsignificant correlation (r ¼ 0.26, Researchers can use eLSA to detect undirected
P ¼ 0.272); however, they are in fact associations, i.e., association patterns without
Extended Local Similarity Analysis (eLSA) of Biological Data 157 E
Extended Local Similarity Analysis (eLSA) of Biolog- 7 to 16 (eLS ¼ 0.43, P ¼ 0.028); top right, two series
ical Data, Fig. 1 Examples of local and time-delayed X and Y with nonsignificant correlation (r ¼ 0.26,
associations. Top left, two series X and Y with nonsignif- P ¼ 0.272); bottom right, they are significantly correlated
icant correlation (r ¼ 0.26, P ¼ 0.273); bottom left, they in time interval from 4 to 17 if X is shifted three units
are in fact significantly correlated in the time interval from toward origin (eLS ¼ 0.51, P ¼ 0.006)
Extended Local Similarity Analysis (eLSA) of Biolog- correlation coefficients. The tools then assess the statisti-
ical Data, Fig. 2 The eLSA pipeline. Users start with cal significance (p-values) of these correlation statistics
raw data (matrices of time series) as input and specify their using the permutation test and filter out insignificant
requirements as parameters. The LSA tools subsequently results. Finally, the tools construct a partially directed
F-transform and normalize the raw data and calculate association network from the significant associations
extended local similarity (eLS) scores and Pearson’s
E
158
Extended Local Similarity Analysis (eLSA) of Biological Data, Fig. 3 An eLSA subnetwork built around g-proteobacteria OTUs as central nodes (abbreviated Alt
alteromonas, CHB CHABI-7, Gam g-proteobacterium, S86 SAR86, S92 SAR92)
Extended Local Similarity Analysis (eLSA) of Biological Data
Extraction Methods, Variability Encountered in 159 E
time delays, and directed associations, where the permutation testing, and network construction.
change of one factor may temporally lead or fol- More information about the software is available
low another factor. Figure 2 shows the analysis from eLSA’s homepage at http://meta.usc.edu/
pipeline of the eLSA technique. softs/lsa.
Inferring Co-occurrence Networks Using eLSA

Studies adopting the local similarity analysis Cross-References
technique have shown interesting and novel dis-
coveries for microbial community network anal- ▶ Accurate Genome Relative Abundance
ysis. In one of the studies (Steele et al. 2011), Estimation Based on Shotgun Metagenomic
E
eLSA is used to find associations among relative Reads
abundances of bacteria, archaea, protists, total ▶ Computational Approaches for Metagenomic
abundance of bacteria and viruses, and physico- Datasets
chemical parameters. Co-occurrence networks
were generated from significant eLSA associa-
References
tions to visualize and identify time-dependent
relationship among ecologically important taxa, Qian J, Dolled-Filhart M, Lin J, et al. Beyond
for example, the SAR11 cluster, stramenopiles, synexpression relationships: local clustering of time-
alveolates, cyanobacteria, and ammonia- shifted and inverted gene expression profiles identifies
oxidizing archaea. new, biologically relevant interactions. J Mol Biol.
2001;314(5):1053–66.
A subnetwork from the study is shown in Ruan Q, Dutta D, Schwalbach MS, et al. Local similarity
Fig. 3. It is built around g-proteobacteria OTUs analysis reveals unique associations among marine
as central nodes (abbreviated Alt, alteromonas; bacterioplankton species and environmental factors.
CHB, CHABI-7; Gam, g-proteobacterium; S86, Bioinformatics. 2006;22(20):2532–8.
Steele JA, Countway PD, Xia L, et al. Marine bacterial,
SAR86; S92, SAR92). This subnetwork identifies archaeal and protistan association networks reveal
12 g-proteobacterial OTUs. g-proteobacteria ecological linkages. ISME J. 2011;5(9):1414–25.
OTUs correlate with eukaryotes and Crenarchaea Xia LC, Steele JA, Cram JA, et al. Extended local simi-
(Cren), as well as environmental parameters and larity analysis (eLSA) of microbial community and
other time series data with replicates. BMC Syst
bacterial production. g-proteobacterium SAR92- Biol. 2011;5 Suppl 2:S15.
749 is more likely opportunistic species, as the Xia LC, Ai D, Cram J, et al. Efficient statistical signifi-
relative abundance of SAR92-749 positively cor- cance approximation for local similarity analysis of
related with bacterial production measured by high-throughput time series data. Bioinformatics.
2013;29(2):230–7.
leucine and thymidine incorporation
(eLS ¼ 0.54, P ¼ 0.003 and eLS ¼ 0.495,
P ¼ 0.005, respectively).
Extraction Methods, Variability

Conclusion Encountered in
eLSA technique uniquely captures local and Paul L. E. Bodelier

potentially time-delayed co-occurrence and asso- Netherlands Institute of Ecology
ciation patterns in time series data. eLSA tech- (NIOO-KNAW), Wageningen, Netherlands
nique is also applicable to other types of gradient
data, including the response to different levels of
treatments, temperature, humidity, or spatial dis- Synonyms
tributions. The analysis pipeline is implemented as
a C++ extension to Python, which streamlines Bias in DNA extractions methods; Variation in
data normalization, local similarity scoring, DNA extraction methods
E 160 Extraction Methods, Variability Encountered in
Definition environmental conditions that are collected by

different laboratories in order to come to con-
The variability in extraction methods is defined cepts and general principles of community
as differences in quality and quantity of DNA structure, functioning, and regulation. This is
observed using various extraction protocols, a challenge because the most crucial step in any
leading to differences in outcome of microbial culture-independent molecular microbial study is
community composition assessments using the extraction of nucleic acids from cells and the
genomic approaches. recovery from the environment. Within the last
decades, countless procedures and protocols have
been developed to obtain DNA or RNA from
Introduction often very complex habitats optimized to yield
DNA/RNA amendable to PCR- or non-PCR-
Microbial communities are at the very basis based downstream community composition ana-
of life on Earth, catalyzing biogeochemical lyses. This was necessary because microbial cells
reactions and driving global nutrient cycles as well as the habitats where they are retrieved
(Falkowski et al. 2008). As yet, they are not on from contain compounds which damage the
the global biodiversity conservation agenda, nucleic acids directly, make the DNA/RNA inac-
implying that microbial diversity is not under cessible, or inhibit downstream applications
any threat by anthropogenic disturbance or directly. The major problem microbial ecology
climate change. However, this maybe a research is facing is that the efficiency and out-
misconception caused by the rudimentary knowl- come of community composition analyses is var-
edge we have concerning microbial communities iable between protocols used and between
in their natural habitats as compared to the environmental matrices the protocols are applied
knowledge we have on plants and animals. The to. This entry will give an overview of DNA
inability to culture the vast majority of microbes extraction methods and associated biases and
present in ecosystems prevents the detailed study what can be done to improve comparability
of their ecology and physiology. The introduction between different habitats.
of culture-independent methods based on DNA
and RNA studies has revolutionized our abilities
to study microbes and microbial communities in DNA Extraction from Environmental
their natural habitat. A vast array of methods has Samples and Sources of Variability
been established going from assessing the com-
plete genomes of single cells to whole communi- When retrieving DNA from complex habitats,
ties. A vast number of books, overview articles, there are two main hurdles to take. First, the
as well as reviews have been published on the DNA has to be liberated from the cells. Second,
molecular assessment of ecology, functioning the DNA has to be protected from degradation
and diversity of environmental microbes, and and precipitation which requires the separation
microbial communities of which the following from other cell components and environmental
are recommended: Liu and Jansson (2010), De contaminants. As said, countless protocols have
Bruin (2011), and Kowalchuk et al. (2007). been developed which consist mainly of the five
Despite the advances made and insight gained key steps which can vary in the way they are
since the genomic revolution, we are still far executed. An overview of these steps and variants
away from understanding the functioning of in execution is given in Fig. 1. The overview is
microbial communities in situ and especially a summary of many studies and giving these
their individual contributions to biogeochemical references goes beyond the goal of this overview.
reactions. The most challenging task ahead in However, most aspects addressed can be found in
microbial ecology will be to compare and Kowalchuk et al. (2007), Herrera and Cockell
integrate data from various habitats and (2007), and Lombard et al. (2011).
Environmental sample
Sample processing and storage a

(Mixing, fresh, frozen, freeze dried)
Indirect Direct extraction in environmental matrix

Extracting cells and separation
E
from environmental matrix
Shaking in extraction buffer, blending,
sonication, density centrifugation
DNA liberation from cells b

In buffer of alkaline pH containing salts and chelating agents (e.g. EDTA) for nucelase deactivation.
Vitamins, activated charcoal, Al2(SO4)3, CaCO3, CaCl2, can be added to bind and remove humic acids.
Physical Enzymatic Chemical
Bead beating Lysozyme SDS

Sonication Proteinase K CTAB
Grinding Xanthogenate Sarcosyl
Cryogenic mill
Micro wave
Free-thawing
Optional Agarose plug digestion of indirectly extracted cells:
high molecular weight DNA! For metagenomics!
Collection of the aqueous phase containing DNA after centrifugation and subsequent
removal of environmental matrix, including proteins, humic acids and polysaccharides.
DNA extraction and recovery c
Extraction using organic hydrophobic solvents (phenol, chloroform, isoamyl alcohol) causing precipatation of organic
cell components. The hydrophilic DNA remains in the aqeous phase which can be collected after phase separation
following centrifugation. DNA is recovered from the aqeous phase by precipitation with ethanol or isopropanol.
d Additional Purification e Quality control and quantitation.
Agarose gel electrophoresis

Agarose gel electrophoresis (linear and non linear SCODA)
Pulse field gel electrophoresis
Electro elution
Spectrophotometry (260/280nm ratio)
Density centrifugation
NanoDrop
Spin columns based (solid phase, molecular size separation)
Picogreen/SYBR green
Anion exchange chromatography
Lab on Chip
Extraction Methods, Variability Encountered in, samples. Step B is the step in all protocols where most
Fig. 1 Schematic presentation of the steps and proce- biases are introduced
dures to extract and purify DNA from environmental
The first step in every DNA-based study is the combination with detergents are preferred. The
collection and storage of environmental samples use of agarose plugs to perform enzymatic lysis
before the DNA is extracted (step A in Fig. 1). has shown to be very effective in obtaining high
Depending on whether the samples are fresh or molecular weight DNA (Williamson et al. 2011).
have been stored cold or frozen or whether they Next to the method of lyses the environmental
have been freeze dried can already give rise to matrix is also a source of variation. The extrac-
variations in the extracted DNA quality and quan- tion and liberation of DNA always is executed in
tity depending on the environmental matrix and a “lysis buffer.” The buffer normally is of alka-
the community composition. However, recently line pH (8–9) which reduces electrostatic inter-
it has been shown using a pyrosequencing actions between DNA and proteins and which
approach that the variation introduced due to inhibits enzymes degrading DNA (nucleases)
sample storage of soil and human-associated and facilitates denaturing of other proteins.
samples was insignificant (Lauber et al. 2010). Often a chelating agent (e.g., EDTA) is added to
After sample storage two routes can be followed the buffer which destabilizes cell walls and mem-
to step B, the liberation of DNA from cells (step branes as well as proteins by binding cations
B in Fig. 1). Either cells are released from the (Ca2+, Mg2+). Besides protecting DNA from deg-
environmental matrix by shaking or sonication radation once it is liberated, compounds that bind
followed by harvesting by, e.g., density centrifu- the DNA should be removed before non-DNA
gation with subsequent lysis (indirect extraction) components are removed by centrifugation.
or the cells are lysed in the environmental Humic acids are derived from plant and animal
matrix directly (direct extraction). Generally, remains by decomposition and are highly diverse
the direct lysis is preferred because the DNA in chemical structure. Due to their variability of
yield is higher due to no cell loss during cell functional groups on the molecules that can more
extraction and purification. However, especially or less strongly adhere to DNA and to the fact that
in metagenomic studies where large intact DNA the amount and structure depend on the biota and
fragments are required to (>20 kb in size) obtain chemical conditions of the environment, the
complete genes, operons, and genomes, it has impact of humic acids on DNA extraction is
been shown that the indirect method is preferred highly variable. Hence, a large number of com-
and does also not lead to a significant difference pounds (step B, Fig. 1) have been tested and used
in overall diversity (Delmont et al. 2011b). The to bind and remove humic acids already at the
subsequent liberation of DNA from cells is the stage of liberation of DNA. The latest addition
step in all extraction protocols where most bias is was the use of vitamins (Techer et al. 2010). Cen-
introduced. Cell walls have to be broken. The trifugation removes cell debris and precipitated
efficiency is dependent on the cell wall structure components, while the supernatant containing the
(gram + vs. gram –) and the presence of extracel- DNA is taken to step C (Fig. 1) which is the
lular slime layers composed of polysaccharides extraction from DNA out of the remaining
and proteins. Also the lyses methods commonly organic cell and environmental components.
used, physical, enzymatic, and chemical (Fig. 1), This is done by phase separation using hydropho-
differ in their efficiency of lyses, giving rise to bic solvents (step C, Fig. 1), keeping the DNA in
variability, strongly depending on the community the aqueous which is underneath the hydrophobic
composition in terms of the presence of difficult phase containing the remaining cell components.
to lyse cells. Also at this step, a choice of method Variability in this step can only come from the
can be made on the basis of the downstream quality of the chemicals and the pipetting skills of
application. The physical disruption techniques the researcher. Care has to be taken not to collect
(e.g., bead beating) yield low molecular weight any of the hydrophobic phase which leads to
DNA (<20 kb) not suitable for metagenome stud- differences in the amounts of aqueous phase col-
ies. In this case the enzymatic lyses methods in lected. The DNA is recovered by precipitation
using ethanol or isopropanol which destroys the Variability and Community Composition
helical structure leading to precipitation. After Assessment
resuspension in water or buffer, the DNA can be
ready for use in various analyses of abundance, The central question in microbial ecological
diversity, or genomic procedures or has to be research is why microbial communities are com-
additionally cleaned up to remove any remaining posed in the way they do and what factors influ-
impurities as indicated in step D (Fig. 1). The ence community composition. To this end it is
potential additional variation introduced here is essential when comparing one sample with
that loss of DNA can occur leading to changes in another that differences observed are due to
relative abundances of species not reaching the biotic or abiotic factors and not biases introduced
E
detection limits of the respective downstream by the methods used. It is obvious from the pre-
method. Hence, when DNA yield from samples vious section that a bias-free extraction of DNA
is low, additional cleanup is often not an option. from all environments is not possible. The matrix
Also at this step some procedures are more appli- as shown in Fig. 1 is a collection of methods
cable when HMW DNA is preferred. A procedure developed with the goal to obtain
where direct current and pulsating nonlinear cur- PCR-amplifiable DNA. Hence, the protocols
rents in gel electrophoresis are alternate has been were not designed for bias-free extraction but
shown to be very effective in purifying HMW for obtaining extract enabling downstream appli-
DNA from the soil (Engel et al. 2012). The last cations. Considering the inherent problems spe-
step before downstream analyses is the quality cific to various environmental matrices, not
control and quantification of the DNA concentra- a single protocol will suffice to be applied to all
tion. UV spectrophotometry is most often used as environments. The protocols developed were
an indicator of purity, where the ratio of absor- designed and tested to yield the highest quality
bance at wavelengths 260/280 nm should be and quantity of DNA and highest diversity in
2 when DNA is free from proteins or humic fingerprinting (denaturing gel electrophoresis
acids. The NanoDrop device is mostly used for (DGGE), terminal restriction fragment length
this purpose because it only requires a few ml polymorphism(T-RFLP), microarray) methods
of the precious extracted DNA. However, the or highest abundance assessed with quantitative
spectrophotometric methods suffer from the fact polymerase chain reaction (qPCR) or highest
that co-extracted RNA is also measured and MW DNA in metagenomic studies. Hence, com-
that humic acids also lead to absorbance, eventu- munity composition was the criterion for testing
ally overestimating the amount of DNA in the performance of protocols, and the amount of pro-
extract. Alternative methods based on fluorescent tocols available is a good indicator of the biases
dyes binding to double-stranded DNA can be introduced. However, it was demonstrated that
used which only detect DNA, but which are also even when applying 1 protocol on exactly the
sensitive to interference by humic acids. Bias- same soil sample, community composition ana-
free quantification methods are the ones where lyses following DNA extraction are not bias-free
gel electrophoresis is combined with densitome- (Pan et al. 2010). When a single well-
try, which even is available in a lab-on chip homogenized soil sample was extracted in differ-
format. ent laboratories using the same protocol, biases
All the procedures described in Fig. 1 were already introduced at the initial extraction.
have also been combined and offered as commer- The DNA quantity (Fig. 2a) as well as quality
cial ready-to-go DNA extraction kits for various varied significantly between laboratories leading
environmental matrices often by machinery to significant differences in community composi-
for cell lyses. In Table 1 an overview of some tion of methane-oxidizing bacteria (Fig. 3) as
commercially available kits and equipment is assessed by PCR-based microarray analyses.
given. Moreover, the same extractions performed by
Extraction Methods, Variability Encountered in, Table 1 Overview of a number of commercially available DNA
extraction kit, lyses equipment, additional cleanup kits, and DNA quantitation methods
Soil DNA extraction kits
PowerSoil and PowerMax/Mobio http://www.mobio.com/soil-dna-isolation/powermax-soil-dna-isolation-
kit.html
SoilMaster/Epicentre Technologies http://www.epibio.com/item.asp?id¼388
E.Z.N.A._ Soil DNA Kit/Omega BioTek http://www.omegabiotek.com/product_detail.php?ID¼95
ZR Soil Microbe DNA Kit/Zymo Research http://www.zymoresearch.com/media/downloads/212/D6001d.pdf
FastDNA_ SPIN kit for Soil/MP http://www.biocompare.com/11793-DNA-Purification-Kits-Soil/
Biomedicals 2691724-FastDNA96-Soil-Microbe-DNA-Kit/
Cell disruption equipment
BioSpec Mini Bead Beater http://www.biospec.com/product/28/mini_beadbeater/
MP Biomedicals FastPrep ®-24 or MP http://www.mpbio.com/product_info.php?family_key¼116004500
Biomedicals FastPrep ®-96
Geno/Grinder ® http://www.spexsampleprep.com/equipment-and-accessories/
equipment_product.aspx?typeid¼1
Free/Mill ® http://www.spexsampleprep.com/equipment-and-accessories/
equipment_product.aspx?typeid¼2
Additional cleanup kits
Wizard ® SV Gel and PCR Clean-Up http://www.promega.com/products/dna-and-rna-purification/dna-
System fragment-purification/wizard-sv-gel-and-pcr-clean_up-system/
Sepharose 4B ® columns http://www.gelifesciences.com/webapp/wcs/stores/servlet/catalog/nl/
GELifeSciences-nl/products/AlternativeProductStructure_17546/
17075701
Nonlinear electrophoresis (SCODA) http://www.borealgenomics.com/products/aurora/
DNA quality/quantity
NanoDrop http://www.nanodrop.com/
PicoGreen (QuaniTTM) http://www.invitrogen.com/site/us/en/home/brands/Product-Brand/
Quant-iT.html
Microfluidics http://www.genomics.agilent.com/GenericB.aspx?
Agilent Bioanalyzer PageType¼Family&SubPageType¼FamilyOverview&PageID¼183
two investigators simultaneously in the same lab- consequences for the subsequent outcome of the
oratory using exactly the same chemicals and downstream analyses.
equipment also yielded significant differences in Important improvements were made to reduce
DNA quantity (Fig. 2b) and quality proving that extraction bias by extracting the same sample
also the investigator can introduce biases, proba- matrix, remaining in the pellet of step B (Fig. 1)
bly due to pipet handling in step C (Fig. 1) of the multiple times (Feinstein et al. 2009). After three
protocol. Another source of bias appeared to extractions DNA quantity as well as bacteria
come from the DNA quantitation method abundance reached a plateau which was similar
(Fig. 2) leading to significantly different commu- for a number of different lyses protocols. This
nity profiles (Fig. 4) as well as abundance of demonstrates that a single extraction always
methane-consuming bacteria. In this case gives a biased picture of the community compo-
overestimation of DNA quantity by NanoDrop sition. Combining multiple extraction protocols
leads to a higher dilution of the DNA to reach has shown to enhance the detected diversity of
the same input amount of target DNA as in the recovered species by more than 80 % (Delmont
PicoGreen-based PCR reaction. This dilution et al. 2011a) in soil samples. However, the rela-
reduced the remaining inhibition of the PCR by tive abundance of the various approaches was
contaminants still present in the DNA with different, making this approach very important
a Nanodrop Picogreen b
300 a
b
250
*
DNA concentration (ng/uL)
bc
c * *
200 c
a
a * *
150 c
*
100
b
c
50
E
0
A B C D E A1 A2 B1 B2 C1 C2 D1 D2 E1 E2
Laboratory Investigator
Extraction Methods, Variability Encountered in, countries (P < 0.05, unequal honestly significant differ-
Fig. 2 DNA concentrations (means 1 standard devia- ence test). In panel B, the asterisk indicates a significant
tion) as analyzed with NanoDrop or PicoGreen, showing difference between investigators within one laboratory
the comparisons between laboratories (a) and between (as assessed using Students’ t test; P < 0.01) (From Pan
investigators in the various laboratories (b). Different et al. 2010 with permission)
letters in panel A indicate significant differences between
Extraction Methods, Variability Encountered in, dissimilarity between MOB communities. Analyses of
Fig. 3 Nonmetric multidimensional scaling plot using similarity (ANOSIM) resulted in a significant difference
log-transformed Bray-Curtis dissimilarity matrices based between MOB community structures analyzed in the dif-
on signal intensity values of the pmoA microarray ana- ferent laboratories. Only samples from laboratory A and
lyses on DNA extracted in five different laboratories. B did not differ from each other (n ¼ 8, except for labo-
Distances between symbols represent relative ratory E [n ¼ 6]) (From Pan et al. 2010 with permission)
for complete diversity assessment but not for and by a number of different laboratories (Petric
comparisons between different samples or envi- et al. 2011). The protocol was only standardized
ronments. The first attempt for standardization up to what is believed to be the step (step B,
between samples and environments has been Fig. 1) causing most variation. Thirteen different
established for soils where an ISO-certified laboratories tested a number of soil types. There
extraction protocol was tested on various soils was variation in DNA quantity and quality and
Extraction Methods, Variability Encountered in, represent relative dissimilarity between MOB communi-
Fig. 4 Nonmetric multidimensional scaling plot using ties. Analyses of similarity (ANOSIM) resulted in
log-transformed Bray-Curtis dissimilarity matrices based a significant difference between MOB community struc-
on signal intensity values of pmoA microarray analyses, tures when based on different DNA concentration mea-
performed on the basis of the NanoDrop or PicoGreen surements (n ¼ 8) (From Pan et al. 2010 with permission)
DNA quantitation method. Distances between symbols
also in community fingerprinting but acceptable laboratory by the same person using the identical
as compared to commonly observed variation. chemicals and machinery, especially the bead-
Although the soils did not differ/vary much in beating apparatus. Of course the latter may not
their complexity and only one fingerprinting always be feasible, and an extraction robot may
method was used, this standard protocol is be very useful in order to reduce variation caused
a very important step toward comparability of by pipet handling (e.g., Maxwell-16 system from
samples. At least for the intensively studied soil Promega). However, in order to come to real
habitat, comparisons may be possible and similar ecological comparisons of microbial communi-
standardizations for related habitats may be ties, new methods of standardization have to be
a way to go. developed. Internal standardization by spiking
samples with a known amount of cells may be
an option. The most important, however, will be
Conclusions to assess for every sample matrix what the extent
of the bias is and take that into account in the
It is obvious that not one protocol of DNA extrac- interpretation.
tion will be bias-free and that applying a single
protocol to a sample will never yield a “true”
picture of microbial community composition. Summary
The inherent differences in the properties of envi-
ronmental matrices prevent this. However, Microbial communities are the drivers of all eco-
important improvements have been made leading systems on Earth but are also the least understood
to the recommendation to perform multiple branch on the tree of life. The advent of molecu-
extractions on the same matrix and multiple pro- lar biological techniques assessing environmen-
tocols with varying stringency of lyses to maxi- tal nucleic acids has revolutionized the amount of
mize diversity assessments of single samples. information on environmental microbial commu-
When different samples have to be compared in nities. However, in the era of metagenomics and
time or between treatments or habitats, it is best high-throughput sequencing, the critical step in
when extractions are performed in the same microbial community analyses is still the
Extradiol Dioxygenases Retrieved from the Metagenome 167 E
extraction of DNA from environmental samples. community structure in soil and human-associated
DNA is extracted by liberation from cells samples. Fems Microbiol Lett. 2010;307(1):80–6.
Liu W-T, Jansson JK, editors. Environmental molecular
followed by extraction from the matrix using microbiology. Norfolk: Caister Academic press; 2010.
organic solvents and recovered by precipitation Lombard N, Prestat E, van Elsas JD, Simonet P. Soil-
with alcohols. The lyses of cells and the removal specific limitations for access and analysis of soil
of contaminants that degrade or adhere to the microbial communities by metagenomics. Fems
Microbiol Ecol. 2011;78(1):31–49.
DNA call for many different approaches varying Pan Y, Bodrossy L, Frenzel P, Hestnes AG, Krause S,
in effectiveness and leading to substantial bias in Luke C, et al. Impacts of inter- and intralaboratory
downstream genomic or metagenomic applica- variations on the reproducibility of microbial commu-
tions. Next to this, variation can also be intro- nity analyses. Appl Environ Microbiol. 2010;76(22):
7451–8. E
duced to investigator skills. Improvements have Petric I, Philippot L, Abbate C, Bispo A, Chesnot T,
been made for increasing the observed diversity Hallin S, et al. Inter-laboratory evaluation of the ISO
in one single sample, and for soils, an standard 11063 “Soil quality – method to directly
ISO-certified extraction protocol has been extract DNA from soil samples”. J Microbiol Methods.
2011;84(3):454–60.
established facilitating ecological comparisons Techer D, Martinez-Chois C, D’Innocenzo M, Laval-
for this habitat. For true ecological comparisons, Gilly P, Bennasroune A, Foucaud L, et al. Novel per-
new ways of standardization have to be spectives to purify genomic DNA from high humic
developed. acid content and contaminated soils. Sep Purif
Technol. 2010;75(1):81–6.
Williamson KE, Kan J, Polson SW, Williamson SJ. Opti-
mizing the indirect extraction of prokaryotic DNA
References from soils. Soil Biol Biochem. 2011;43(4):736–48.
De Bruin FJ, editor. Handbook of molecular microbial

ecology II: metagenomics in different habitats. Hobo-
ken: Wiley; 2011.
Delmont TO, Robe P, Cecillon S, Clark IM,
Extradiol Dioxygenases Retrieved
Constancias F, Simonet P, et al. Accessing the soil from the Metagenome
metagenome for studies of microbial diversity. Appl
Environ Microbiol. 2011a;77(4):1315–24. Kentaro Miyazaki1,2 and Hikaru Suenaga2
Delmont TO, Robe P, Clark I, Simonet P, Vogel 1
Department of Medical Genome Sciences,
TM. Metagenomic comparison of direct and indirect
soil DNA extraction approaches. J Microbiol Methods. Graduate School of Frontier Sciences,
2011b;86(3):397–400. The University of Tokyo, Sapporo, Japan
Engel K, Pinnell L, Cheng J, Charles TC, Neufeld 2
Bioproduction Research Institute, National
JD. Nonlinear electrophoresis for purification of soil
Institute of Advanced Industrial Science and
DNA for metagenomics. J Microbiol Methods.
2012;88(1):35–40. Technology, Sapporo, Japan
Falkowski PG, Fenchel T, Delong EF. The microbial
engines that drive Earth’s biogeochemical cycles. Sci-
ence. 2008;320(5879):1034–9.
Feinstein LM, Sul WJ, Blackwood CB. Assessment of
Synonyms
bias associated with incomplete extraction of micro-
bial DNA from soil. Appl Environ Microbiol. Extradiol Dioxygenases
2009;75(16):5428–33.
Herrera A, Cockell CS. Exploring microbial diversity in
volcanic environments: a review of methods in DNA
extraction. J Microbiol Methods. 2007;70(1):1–12. Definition
Kowalchuk GA, De Bruin FJ, Head IM, Akkermans AD,
Van Elsas JD, editor. Molecular microbial ecology Extradiol dioxygenases (EDOs) are mononuclear
manual, 2nd ed. Dordrecht, The Netherlands: Kluwer
Academic Publishers; 2007.
metalloenzymes that cleave the meta-position of
Lauber CL, Zhou N, Gordon JI, Knight R, Fierer N. Effect the C–C bond of catecholic compounds, yielding
of storage conditions on the assessment of bacterial yellow-pigmented open-ring products (Fig. 1).
E 168 Extradiol Dioxygenases Retrieved from the Metagenome
Extradiol Dioxygenases Cleavage

Retrieved from the
R at this position R
Metagenome,
Fig. 1 Meta-cleavage of OH O
catecholic compounds COOH
by EDOs
OH OH
Catecholic compounds meta-cleavage product
(R: -H, -Cl, -CH3, -Ph etc) (yellow-colored)
Introduction Most of our knowledge on EDOs has been

obtained from activities involving microbial
Both naturally existing and synthetic aromatic screening. Based on the observation of bacterial
hydrocarbons (e.g., petroleum products and colonies that develop yellow pigments attribut-
chemical wastes of agricultural and industrial able to the ring-cleavage products of catecholic
origin) are common contaminants of the envi- substrates, those expressing EDOs were isolated
ronment (US Environmental Protection and studied in detail for the past three decades.
Agency; http://www.epa.gov). Microorganisms, However, information on the degradation path-
particularly bacteria, play crucial roles in the ways, enzymes, and genes that are harbored by
biodegradation of these compounds and contrib- “uncultured” bacteria remain unknown. Screen-
ute to various biochemical cycles (Abraham ing of those genes using a metagenomic approach
et al. 2002; Chakraborty and Coates 2004; should thus shed light on the diversity, evolution,
Furukawa et al. 2004). Extensive efforts have and biochemical properties of novel pathways,
been directed at surveying and analyzing the enzymes, and genes.
pathways and genes responsible for the degra-
dation of aromatic compounds with the aim of
reviving polluted environments by using these Enzymatic Classification of EDO Family
microorganisms (i.e., bioremediation) (Top and
Springael 2003; Janssen et al. 2005; de Lorenzo EDOs can be classified into at least three evolu-
2008). These studies have shown that the initial tionarily distinct families (Vilchez-Vargas
conversion step in the degradation of aromatic et al. 2010): type I belongs to the vicinal oxygen
compounds is catalyzed by various types of chelate superfamily, type II includes enzymes
enzymes, depending on the aromatic compound consisting of different subunits, and type III
substrate, pathway, or the organism. Despite the belongs to the cupin superfamily. Type I is con-
variation, however, aromatic compound sub- sidered as a major family and is further divided
strates are converted to a limited number of into subfamilies (e.g., I.2.A) depending on
central intermediates, most commonly the the amino acid sequences of the enzymes
catecholic compounds (Fritsche and Hofrichter (Fig. 2). Enzymes belonging to the same sub-
2005). The subsequent cleavage of the aromatic family are defined as those with >54 %
rings of catechol derivatives is catalyzed by sequence identity (Eltis and Bolin 1996). They
extradiol dioxygenases (EDOs); these reactions are roughly classified into two families: those
are considered crucial in the biodegradation of that act on monocyclic aromatics (subfamily I.2)
aromatic compounds (Lipscomb 2008). EDOs and those that act on bicyclic aromatics
have thus served as functional markers in the (subfamily I.3). Despite differences in substrate
assessment of the biodegradation potential of specificities, these enzymes share common
specific bacterial communities (Vilchez-Vargas mechanisms of reaction, occurring at similar
et al. 2010). catalytic centers that contain a Fe(II) ion in the
I.2.F
I.2.D
I .2.E
Q C3 AN PSE O I.5
Q6 BAC AJA
Q9 969 ART PA
Q8 HORH
Q5 4048 RHO
ST
R
G
9B PU
PH 9NL9 B
K
UR
3
Q 4 N 3D
R
Q59 PJ
EB
U
770
EP
BAC
PS
9 Z
I.2.C
4
26
BPHC
I.2.G
52
5
1H11 1A1 2C1 9E D12
7 11 1
1E12 4E12 2A1 6H 10- E
1D9 9C8 5B2 3F 8 PH
N
BPH
1D2 9E4-1 5F10-1 4E 44 9S A HY
24
C BU
RCE
5F2 2B9 3G3 Q5 459 SP
BPHC 3H5 7B2 2C5-1 Q45
PSEP 8
I.3.A Q52032 S 6D4 10D
PSEPU 9B9 6F5 I.2.B
Q84EP0 9BUR
K 1A9
4D5
BPHC PSES1
’’
3F10-2
Q51749 PSEFL 9E4-2
OSR 9B1
Q53126 RH 6B91
R H OGO
BPHC1 U 2H2
I.3.B P EP
S 5F1
TODE F2 1 3A2 0-2
OE
R 2C
8 RH RH XY 5-2
N 5 Q LE2
I.3. 3
O69 F5 RH CTO
O
62 9A SR DM 5970 PSE
I.3.I Q7 87 Q PB 8 P PU
9 LC HO ER NA 597 PS SEP
Q
Q 5 R HO H HH 09 E U
. K R R P UF
Q 7M 1 P AE
I.3 KW 59 HO ER PS SE
59 0
Q9 693
Q LE SE
R O EP PU
72 R7 EP
XY LE 2 9P 2 9SP
92 HO H
0 I.2
2I RH
XY 3U2 5091
O
VV 5 R OR
9M ER
0
U .A
Q8
76 62
PS ICO
PS LC
H
. G Q 3
R
ES XX
I.3
8 P
69
A
ES
4
7E 3
P
1
P
S
2H
O
C
1
35
Q
76
RHO A
.H
C PS SO
O
OCA
69
U
EPA
Q
H
P
I.3
33
Q9KWQ8 RHOS
MPC2 RALEU
Q6W1M5 RHISN
PSE
S
O
RHOR
4A3
BPH
Q8
6B9-
ED
25
.L
5B-2
Q8L185 9N
5F10-3
Q5
I.3
C2
FB
325
1
.J
CATA
DB
BPH
RH
I.3
HN
.D
P72
OG
I.3
.M
O I
I.3
I . 1.B
C
.1.A
I.3.
I.3.E
I.1.C
I.3.F
I.6
I.4
0.1
Extradiol Dioxygenases Retrieved from the 9SPHN, Q50912; MPC2 RALEU, P17296; Q6W1M5
Metagenome, Fig. 2 A phylogenetic tree showing both RHISN, Q6W1M5; Q9KWQ8 RHOSR, Q9KWQ8;
metagenomic EDOs and previously identified type I EDOs. Q8L185 9NOCA, Q8L185; BPHC2 RHOGO, P47232;
The metagenomic clones identified from the activated sludge DBFB PSEPA, P47243; CATA RHORH, Q53034; BPHC
of wastewater from a Coke plant (Suenaga et al. 2007) are PSEPA, P11122; P72325 RHOSO, P72325; Q52533 PSESP,
shown in red. The accession numbers of the previously Q52533; Q8VV92 9MICO, Q8VV92; O69355 RHOER,
identified EDOs are as follows: BPHC BACPJ, Q8GR45; O69355; Q762H4 RHORH, Q762H4; O69362 RHOER,
Q59770 RHORH, Q59770; PHEB BACST, P31003; O69362; Q762I0 RHORH, Q762I0; O69359 RHOER,
Q89NL9 BRAJA, Q89NL9; Q59693 PSEPU, Q59693; O69359; Q9KWQ5 RHOSR, Q9KWQ5; Q9LC87 9ACTO,
Q9ZAN5 9BURK, Q9ZAN5; Q52264 PSEPU, Q52264; Q9LC87; Q762F5 RHORH, Q762F5; O69358 RHOER,
Q52444 9SPHN, Q52444; Q45459 SPHYA, Q45459; O69358; TODE PSEPU, P13453; BPHC1 RHOGO,
XYLE2 PSEPU, Q04285; Q59708 PSEPU, Q59708; P47231; Q53126 RHOSR, Q53126; Q51749 PSEFL,
DMPB PSEUF, P17262; Q59709 PSEPU, Q59709; NAHH Q51749; BPHC PSES1, P17297; Q84EP0 9BURK,
PSEPU, P08127; Q59720 PSESP, Q59720; Q7M0R7 Q84EP0; Q52032 PSEPU, Q52032; BPHC PSEPS,
ALCXX, Q7M0R7; XYLE1 PSEPU, P06622; XYLE P08695; BPHC BURCE, P47228 (This figure was drawn
PSEAE, P27887; Q83U22 9PSED, Q83U22; Q6N3D3 using the FigTree software (http://tree.bio.ed.ac.uk/software/
RHOPA, CGA009; Q44048 ARTGO, Q44048; Q50912 figtree/))
E 170 Extradiol Dioxygenases Retrieved from the Metagenome
active site and are coordinated by the known subfamilies, but surprisingly, 23 genes
so-called 2-His-1-carboxylate facial triad motif could not be classified into existing subfamilies,
(Lipscomb 2008). and therefore, four new subfamilies, namely,
I.1.C, I.2.G, I.3.M, and I.3.N (Fig. 2), were pro-
posed. Among these novel EDOs, the I.2.G sub-
EDOs Retrieved from the Metagenome family genes were overrepresented among the
retrieved metagenomic EDOs and branched at
At the time of writing of this report (March 2013), a deep point in the lineage. Enzymatic character-
42,295 “extradiol dioxygenase” sequences have ization demonstrated that the I.2.G EDOs have
been deposited in the Protein database of NCBI unique properties, including Mn(II) dependence
(www.ncbi.nlm.gov/protein), 1,076 of which are instead of the more common Fe(II) dependence,
derived from “uncultured bacteria.” Of the 1,076 as well as the highest affinity for catechol
sequences, however, only few contain complete reported thus far, and tolerance for thermal and
EDO sequences (Vilchez-Vargas et al. 2010; chemical inhibitors (NaCN and H2O2) (Suenaga
Suenaga 2012). et al. 2009).
Based on the yellow coloration of catechol
ring-cleavage products, 235 positive clones were
identified from the fosmid library constructed EDO Application for Bioremediation
from environmental DNA extracted from petrol-
contaminated soil (Brennerova et al. 2009). Each polluted site harbors contaminants that
PCR-based classification of the internal sequences carry environment-specific EDO genes. Monitor-
of the metagenomic EDO genes showed that only ing these “marker” EDO genes using the
one-fourth of the observed EDOs belong to sub- metagenomic approach may be a good method
family I.3.A of I.3.B that would be expected as in evaluating the bioremediation process
predominant taking into consideration of the (Widada et al. 2002). Furthermore, retrieving
knowledge obtained from isolated bacteria. novel EDOs, as well as engineering these for
Genes of subfamily I.2.A, which have frequently higher activity and stability, can enhance the
been used as DNA markers for assessing the cat- development of bioremediation processes.
abolic potential of polluted sites, were also absent
(Vilchez-Vargas et al. 2010). Functional analysis
of representative proteins indicated that 1 clone, Summary
s45, has exceptionally high affinity for different
catecholic substrates. Metagenomic approaches are an effective means
Coke plant wastewater contains various aro- of discovering novel enzymes including EDOs,
matic compounds and activated sludge that is which present specific sequences and enzymatic
used for decontamination may serve as a rich properties based on their substrate preference,
resource for EDO discovery. Suenaga metal dependence, inhibitor tolerance, and vari-
et al. (2007) created a metagenomic fosmid ous physicochemical properties. Research
library using the activated sludge and by func- targeting different environments may help in fur-
tional screening, 91 EDO-positive clones were thering the knowledge about the diversity
identified. Based on their substrate specificity of EDOs.
for various catecholic compounds, 38 clones
were subjected to shotgun DNA sequencing.
Some clones contained 2 EDO genes and as Cross-References
a result, a total of 43 EDO genes were identified.
Approximately half of these were classified into ▶ Metagenomics Potential for Bioremediation
References De Lorenzo V. Systems biology approaches to bioreme-
diation. Curr Opin Biotechnol. 2008;19:579–89.
Abraham WR, Nogales B, Golyshin PN, et al. Pieper DH, Seeger M. Bacterial metabolism of
Polychlorinated biphenyl-degrading microbial com- polychlorinated biphenyls. J Mol Microbiol
munities in soils and sediments. Curr Opin Microbiol. Biotechnol. 2008;15:121–38.
2002;5:246–53. Suenaga H, Ohnuki T, Miyazaki K. Functional screening
Brennerova MV, Josefiova J, Brenner V, et al. of a metagenomic library for genes involved in micro-
Metagenomics reveals diversity and abundance of bial degradation of aromatic compounds. Environ
meta-cleavage pathways in microbial communities Microbiol. 2007;9:2289–97.
from soil highly contaminated with jet fuel under Suenaga H, Mizuta S, Miyazaki K. The molecular basis
air-sparging bioremediation. Environ Microbiol. 2009; for adaptive evolution in novel extradiol dioxygenases
11:2216–27. retrieved from the metagenome. FEMS Microbiol
Chakraborty R, Coates JD. Anaerobic degradation of Ecol. 2009;69:472–80. E
monoaromatic hydrocarbons. Appl Microbiol Suenaga H. Targeted metagenomics: a high-resolution
Biotechnol. 2004;64:437–46. metagenomics approach for specific gene clusters in
Eltis LD, Bolin JT. Evolutionary relationships among complex microbial communities. Environ Microbiol.
extradiol dioxygenases. J Bacteriol. 1996;178:5930–7. 2012;14:13–22.
Fritsche W, Hofrichter M. Aerobic degradation by micro- Top EM, Springael D. The role of mobile genetic elements
organisms. In: Rehm H-J, Reed G, editors. Biotech- in bacterial adaptation to xenobiotic organic com-
nology: environmental processes II, vol. 11b. pounds. Curr Opin Biotechnol. 2003;14:262–9.
2nd ed. Weinheim: Wiley-VCH Verlag GmbH; 2008. Vilchez-Vargas R, Junca H, Pieper DH.
Furukawa K, Suenaga H, Goto M. Biphenyl dioxygenases: Metabolic networks, microbial ecology and “omics”
functional versatilities and directed evolution. technologies: towards understanding in situ biodeg-
J Bacteriol. 2004;186:5189–96. radation processes. Environ Microbiol. 2010;12:
Janssen DB, Dinkla IJT, Poelarends GJ, et al. Bacterial 3089–104.
degradation of xenobiotic compounds: evolution and Widada J, Nojiri H, Omori T. Recent developments in
distribution of novel enzyme activities. Environ molecular techniques for identification and monitoring
Microbiol. 2005;7:1868–82. of xenobiotic-degrading bacteria and their catabolic
Lipscomb JD. Mechanism of extradiol aromatic ring- genes in bioremediation. Appl Microbiol Biotechnol.
cleaving dioxygenases. Curr Opin Struct Biol. 2002;60:45–59.
2008;18:644–9.
F
Fast Program for Clustering and amount of data has become one of the major
Comparing Large Sets of Protein or issues and challenges in many sequencing-based
Nucleotide Sequences research. Such challenges are typically domi-
nated by two factors: huge data size and high
Weizhong Li sequence redundancy. Sequence clustering is
J. Craig Venter Institute, La Jolla, CA, USA a key technique that can address these two issues
at once, by clustering the sequences and reducing
them to a smaller subset of representative
Synonyms sequences.
Sequence clustering is a technique to group
CD-HIT is a fast program for clustering large sequences into groups (clusters), such that similar
amount of protein and nucleotide sequences sequences are clustered together and can be
potentially represented by a single representative
sequence. A sequence similarity between two
Definition sequences is normally defined based on an opti-
mal alignment between them. Such optimal
Sequence clustering is a process to group alignment is usually found by dynamic program-
sequences into groups (clusters) such that similar ming techniques, which are computationally
sequences are clustered together and can be expensive. Traditional clustering algorithms that
potentially represented by a single representative require many pairwise sequence comparisons are
sequence. CD-HIT uses a greedy incremental impractical for clustering very large sequence
clustering algorithm enhanced by an efficient datasets. Reducing the number of sequence com-
word filtering heuristics and an effective parisons is the key to efficient sequence cluster-
parallelization technique to do clustering on big ing that can cope with the massive amount of
sequence datasets efficiently. sequencing data.
Greedy incremental clustering has been
employed in sequence clustering to reduce the
Introduction number of sequence comparisons since the
implementation of a tool by Holm and Sander
Since the development of high-throughput (1998) to create nrdb90 for protein sequences
sequencing technologies, the amount of available with a decapeptide filter to further reduce the
biological sequences has increased dramatically number of comparisons. To overcome some lim-
and continues to increase rapidly. Efficient han- itations of that tool and further improve the clus-
dling and effective analysis of such massive tering efficiency, CD-HIT was developed to use
F 174 Fast Program for Clustering and Comparing Large Sets of Protein or Nucleotide Sequences
the same greedy incremental algorithm, but with Filtering Based on Shared Words
a much more efficient filtering heuristics Checking a query sequence against each of the
(Li et al. 2001, 2002). CD-HIT was then extended representative sequences is very inefficient,
to support clustering of nucleotide sequences because such checking involves sequence com-
(Li and Godzik 2006) and became one of the parison based on sequence alignment using
most widely used programs for sequence dynamic programming, which is computationally
clustering due to its efficiency to handle large expensive. To reduce such comparisons, a word
datasets. (k-mer or q-gram) indexing table can be used to
The rapid increasing amount of sequence data filter out unnecessary comparisons based on the
demand even more efficient clustering programs number of words shared between the query
and have lead to the development an enhanced sequence and each of the representative
version of CD-HIT (Fu et al. 2012), which has sequences.
been reengineered to support clustering of very The idea is that, for two sequences to have
large sequence datasets. In this new CD-HIT, identity above an identity cutoff, they must
a parallelization technique was developed to share a minimum number of common words
safely and efficiently parallelize the greedy incre- given the sequence lengths. It is easy to see that,
mental clustering algorithm. This parallel given two sequences with an alignment length
CD-HIT can achieve very good speedup L and an identity cutoff C, the maximum number
(quasilinear speedup for up to eight cores) on of mismatches and gaps that are allowed between
multicore computers for sequence clustering. two aligned sequences is E ¼ L(1 C), so the
CD-HIT and its derived programs such as minimum number of shared words of length
CD-HIT-454, CD-HIT-DUP, CD-HIT-LAP, and W should be L + 1 (E + 1)*W. This is also
CD-HIT-OTU have extensive applications in the minimum number of shared words between
metagenomics field. A summary of these a query sequence of length L and any other longer
applications is available from a recent review reference sequences. In CD-HIT, this threshold is
paper (Li et al. 2012). adjusted according to the presence of unknown
letters such as “N” and “X,” etc., and to the
command line options.
Methods To speed up the counting of shared words, an
indexing table is built for the representative
CD-HIT uses a greedy incremental clustering sequences to record for each word the indices
algorithm with filtering heuristics based on of the representative sequences and the number
shared word counting for efficient clustering. It of occurrences the word appears. This will
is further enhanced by an effective parallelization allow efficient counting of shared words
technique that can achieve very good speedup on between a query and each of the representative
multicore computers. sequences.
Greedy Incremental Clustering Banded Alignment and Sequence Identity

A greedy incremental clustering essentially In CD-HIT, sequence identity is computed based
works in the following way. Given a list of on an optimal alignment between two sequences.
DNA or protein sequences, sort them from long To reduce the computational time of dynamic
to short, and take the first sequence as a cluster programming, CD-HIT uses heuristics based on
representative sequence. Then, for each (query short words (shorter than the words for filtering)
sequence) of the remaining sequences, check if to estimate an optimal band and does banded
it is similar to any of the existing representative alignment. Sequence identity is then calculated
(reference) sequences, if yes, mark the sequence as the percentage of matched bases among the
as a redundant sequence, otherwise, add it to the aligned bases within the whole or best alignment
representative sequence list. region.
Fast Program for Clustering and Comparing Large Sets of Protein or Nucleotide Sequences 175 F
CD-HIT Core Procedures: Checking and 3. Use the word indexing table from step 2, do
Clustering the checking procedure on the remaining
In order to simplify CD-HIT and make an efficient sequences of S, and remove the sequences
implementation possible, the key steps of CD-HIT that are marked as redundant from S.
are abstracted into two core procedures: checking 4. Repeat steps 2 and 3, until S becomes empty.
and clustering. The distinction between checking
procedure and clustering procedure is also the key The Parallel CD-HIT Algorithm
to an efficient parallelization. The parallel CD-HIT algorithm uses two word
Given a word indexing table, the checking indexing tables to do sequence clustering. Since
procedure will check a query sequence against an efficient parallelization cannot be achieved
this table and its associated representatives, using within each single clustering cycle as described
the filtering heuristics and sequence comparison in the sequential algorithm section, the idea of the F
techniques described above. If the query is simi- parallelization technique developed in the paral-
lar to one of the representatives, the query lel CD-HIT is to properly interweave the step
sequence will be marked as redundant and be 2 and step 3 between two consecutive clustering
skipped in all future clustering steps. cycles, as the following (Fig. 2):
The clustering procedure is identical to the 1. Given a list of DNA or protein sequences
checking procedure except that, if the query is (say S), sort them from long to short.
not marked as redundant, it will be added to the 2. Take a sub-list (say G) of the longest
representative sequence list of the table, and the sequences from S (and remove them from S).
table is updated to index and incorporate the 3. Use all threads to do the checking procedure
words of the new representative sequence. concurrently on G using the word indexing
table built by the clustering procedure from
The Sequential CD-HIT Algorithm the previous cycle.
The sequential CD-HIT algorithm is formed by 4. Use all-but-one threads to do the checking
combining the greedy incremental clustering algo- procedure on the sequences in S and simulta-
rithm and the above described heuristics and tech- neously using the remaining one thread do the
niques, with proper dividing of the input sequences. clustering procedure on the sequences in
Basically, the steps are the following (Fig. 1): G starting from an empty word indexing table.
1. Given a list of DNA or protein sequences (say 5. Repeat steps 2, 3, and 4, until S becomes empty.
S), sort them from long to short. Here, if the clustering procedure finishes
2. Take a sub-list of the longest sequences from processing G before the checking procedures fin-
S (and remove them from S) and do the clus- ish processing the S, the thread for the clustering
tering procedure on them starting from an procedure will switch to do the checking proce-
empty word indexing table. dure on S as well. But if the checking procedures
Fast Program for

Clustering and
Comparing Large Sets of
Protein or Nucleotide
Sequences,
Fig. 1 Diagram for the
sequential CD-HIT
algorithm
F 176 Fast Program for Clustering and Comparing Large Sets of Protein or Nucleotide Sequences
Fast Program for

Clustering and
Comparing Large Sets of
Protein or Nucleotide
Sequences,
Fig. 2 Diagram for the
parallel CD-HIT algorithm
finish before the clustering procedure, the clus-

tering procedure will be terminated in order to
start a new clustering cycle, and the unfinished
sequences in G will be put back in S.
In this parallel version of the algorithm, the
first and last clustering cycle will effectively use
a single thread to do the clustering procedure.
Efficiency of the Parallel CD-HIT

The described parallelization technique is very
effective for CD-HIT on large datasets. The main
reason is that, in the parallel CD-HIT, all threads
are guaranteed to be active simultaneously and do
effective computation, and only the first and the Fast Program for Clustering and Comparing Large
last clustering cycle cannot use multithreading. Sets of Protein or Nucleotide Sequences,
Fig. 3 Evaluation of CD-HIT parallelization: computa-
But for large sequence datasets, the time spent on
tional time speedup with respect to the number of used
single threaded computation for the first and the CPU cores
last cycle is negligible. So in theory, the speedup
should approach linear for large datasets.
Figure 3 shows a benchmarking result on two efficiently. The parallelized version of CD-HIT
protein sequence datasets Swissprot (437,168 can further speed up the clustering process by
sequences) and NR (12,954,819 sequences) and using multiple CPU cores. With the high-
two nucleotide sequence datasets Twin Study throughput sequencing technologies becoming
(8,294,694 sequences) and Human Gut (23,285,083 more and more widely used, CD-HIT could play
sequences). This test was done on a Debian Linux an essential role to facilitate the analysis of the
server with four 12-core AMD Opteron 6172 pro- massive amount of sequencing data.
cessors. As it demonstrated, the parallel CD-HIT can
achieve quasilinear speedup for up to eight cores,
with good speedup for up to 16 cores. References
Fu L, Niu B, Zhu Z, et al. CD-HIT: accelerated for clus-

Summary tering the next generation sequencing data. Bioinfor-
matics. 2012;28(23):3150–2.
Holm L, Sander C. Removing near-neighbour redundancy
CD-HIT is a very fast sequence clustering pro- from large protein sequence collections. Bioinformat-
gram that can cluster very big sequence datasets ics. 1998;14:423–9.
Fosmid System 177 F
Li W, Godzik A. Cd-hit: a fast program for clustering and In the meanwhile, fosmid (F-based cosmid)
comparing large sets of protein or nucleotide was developed. Basically, they contain the repli-
sequences. Bioinformatics. 2006;22:1658–9.
Li W, Jaroszewski L, Godzik A. Clustering of highly cation origin of the E. coli F plasmid and can be
homologous sequences to reduce the size of large packaged in a lambda capsid to be transfected
protein database. Bioinformatics. 2001;17:282–3. rather than transformed. Based loosely on the
Li W, Jaroszewski L, Godzik A. Tolerating some redun- cosmid vector but adding the F origin of replica-
dancy significantly speeds up clustering of large pro-
tein databases. Bioinformatics. 2002;18:77–82. tion, fosmids combine the advantages of BAC
Li W, Fu L, Niu B, Wu S, Wooley J. Ultrafast clustering vectors (stability and single-copy maintenance)
algorithms for metagenomic sequence analysis. Brief and the easiness of transfection using a cosmid-
Bioinform. 2012;13:656–68. based vector (Kim et al. 1992). Cosmids have
cosN of phage lambda on the vector and use
a phage terminase to generate cohesive ends at F
the cosN. This way, a fosmid insert of 40-kb
Fosmid System average size can be cloned very efficiently after
packaging in a lambda phage capsid and infected
Francisco Rodriguez-Valera as in conventional cosmid cloning. Extensive
Microbiologia, Universidad Miguel Hernandez, libraries of fosmid clones are readily constructed
Campus San Juan, San Juan, Alicante, Spain and offer increased insert stability. They can be
propagated by standard E. coli cultures and the
clones isolated as colonies can be collected by an
Synonyms automated colony picking robot. They can be
stored as phage suspensions and transferred to
BAC; Cosmids; Large insert vectors the host very efficiently. Also the insert size is
very even and can be estimated in the range of
between 30 and 40 Kbp in most cases. The E. coli
Definition F-factor single-copy origin of replication guaran-
tees that there will be only one copy per genome
Original molecular cloning vectors were plas- during the cloning phase, avoiding problem with
mids such as the pBR232, were meant to clone chimera formation during this critical step. How-
single genes, and were based in multicopy plas- ever, the inducible high-copy oriV can be used to
mids that have low stringency control of the copy amplify to up to 50 copies per cell which, while
number. Later on with the development of geno- maintaining the stability of the plasmid, increases
mics, larger insert vectors were required for the the DNA yield and the possibilities to be
assembly of repeated regions and in general to expressed in E coli. For specific protocols of
pair-end the individual shotgun reads. Bacterial fosmid cloning, see for example: http://www.
artificial chromosomes (BAC) were developed epibio.com/item.asp?ID¼385.
based in the large single-copy plasmids of the
F group (Shizuya et al. 1992). These can be
propagated in Escherichia coli with inserts larger Fosmids in Metagenomics
than 300 Kbp. BACs were used by Beja and
coworkers in one of the first and more influential Before the development of second-generation
papers of the early development of metagenomics sequencing such as 454 pyrosequencing or
in which the existence of an energy-generating Illumina, all metagenomic studies were depen-
rhodopsin was found in a proteobacterial BAC dent on cloning of environmental DNA to
clone (Beja et al. 2000). However, BACs are sequence by Sanger using the vector primers.
laborious to generate and do not work well with Small insert vectors have been widespread for
the limited amount of DNA that normally is the easiness to generate very large libraries and
available for metagenomics. also because the insert that can be sequenced by
F 178 Fosmid System
Sanger using primer vectors is smaller than the difference that the ends sequenced are separated
size of most inserts of this size (Venter by a much larger distance. Also, the fosmids that
et al. 2004). show promise of revealing some interesting
However, large insert and particularly fosmids activity, or corresponding to an interesting
have been very popular for metagenomic workers microbe, can be fully sequenced (Fig. 1), tradi-
(DeLong et al. 2006; Martin-Cuadrado tionally by Sanger dideoxy but now also by
et al. 2007). The main reason is that the insert in high-throughput approaches (Martin-Cuadrado
a fosmid is a sizeable natural contig that contains et al. 2009).
typically 30–40 genes. This size is very appropri- Fosmids can also be screened by PCR to select
ate for annotation since bacterial and archaeal those belonging to selected groups of microbes,
gene clusters are arranged functionally, i.e., largely by using 16S rRNA primers (Martin-
genes with related function, such as different Cuadrado et al. 2008). This way, the fosmids
enzymes of a metabolic pathway, are located containing ribosomal operons can be identified
next to each other, often organized in operons. and those containing the target rRNA gene fully
Therefore, function can be inferred with much sequenced. This approach is a bit tricky when the
more reliability from a large contig. A common target group are bacteria because fosmid prepa-
approach taken for analysis of fosmid libraries is rations are always contaminated with E. coli
the fosmid-end sequencing by using the vector DNA and PCR of 16S rRNA gene gives always
primers. This generates datasets that are similar that amplicon. As an alternative methodology to
to the short insert (also known sometimes as select bacterial fosmids containing ribosomal
shotgun) libraries but with the important operons, primers for 16–23S gene spacer or ITS
Screening
Sequencing insert ends Spotting clones on membrane
Hybridisation with different probes:

Multiplex PCR rDNA, protein genes, etc
BLAST
phylogenetic analysis
Bacteria
16S rDNA Archaea
16S ITS 29S
18S rDNA Eukaryotes
Fosmid System, Fig. 1 Methods for selecting fosmids genes such as rRNAs. In the case of bacteria, a strategy
for full sequencing. End sequences can provide clues as to to select those containing other rRNAs different from
the kind of genes present in the fosmid and allow for E. coli that is present in all the clones is shown. The
selecting those involved in interesting processes or amplicon includes the internal transcribed spacer (ITS),
microbes (Martin-Cuadrado et al. 2009). Alternatively, and the size of this hypervariable region shows the clones
fosmid clones can be screened by PCR or hybridization containing rRNA genes different from those of E. coli
to select those that contain taxonomically informative
Fosmid System 179 F
a b
35000 35000
30000 30000
25000 25000
20000 20000
15000 15000
10000 10000
5000 5000
0 0 F
0 20 40 60 80 100 0 20 40 60 80 100
GC% GC%
Fosmid System, Fig. 2 Frequency distribution of GC% direct 454 pyrosequencing dataset. (b) All reads of the
for the two metagenomic sequence datasets from the DCM fosmids dataset after removing the vector pCC1fos
Mediterranean water column at the deep chlorophyll max- sequences. GC% of vector pCC1fos ¼ 48 %. For details
imum (50 m deep). (a) All reads obtained in the DCM see Ghai et al. (2010)
were used. The amplicons were run in an agarose cyanophages in the case of marine samples from
gel, and only those with a significantly different the photic zone. Also it provided much larger
size from that of E. coli were selected (Quaiser contigs (up to 44 Kbp and close to 200 contigs
et al. 2008). over 10 Kbp). The importance of long contigs for
With the advent of high-throughput sequenc- interpreting metagenomic datasets cannot be
ing (HTS), the applications of fosmids are still stressed enough since annotation of large clusters
significant. First of all, they provide a way to of genes is much more reliable (see above). For
assemble much larger contigs, the Achilles’ heel example, Ghai et al. assemble large fragments of
of the HTS. Ghai et al. (Ghai et al. 2010) the genomes of marine Euryarchaea of group II
sequenced 1,000 pooled fosmids by that later on were instrumental in assembling the
454 pyrosequencing and compared the results complete genome of one of their members from
with the direct 454 pyrosequencing of the same a natural environment (Iverson et al. 2012).
DNA before cloning. The results indicated A recent application described for fosmid vec-
a strong bias in the fosmid clones against some tors has been their use for metaviriome studies.
specific groups of microbes such as Candidatus Metaviromes have a major problem when
Pelagibacter ubique and Prochlorococcus that sequenced by HTS. Viral genes are even more
happen to be the most abundant microbes in this difficult to annotate, and to infer information
environment. Besides, the GC distribution plot from their sequence is close to impossible unless
indicated that high GC of ca. 50 % was enriched large fragments of the viral genome are available.
versus the reads of the directly sequenced DNA This problem has been solved by fosmid cloning
(Fig. 2). The reasons for these biases are obscure, in a pilot study carried out by Garcia-Heredia
and a similar bias was found for environmental et al. (2012). These authors have retrieved viral
BAC libraries (Feingersch and Beja 2009). How- DNA from a natural extreme environment and
ever, fosmid cloning provided a complementarity could reconstruct complete to near-complete
to direct pyrosequencing, providing a way to viral genomes that prey on microbes which pure
access microbes that were relatively less abun- culture is very fastidious and hence not adequate
dant in the sample such as marine Euryarchaea or for classical phage isolation in pure culture.
F 180 Fosmid System
Besides, the chances of screening for biologi- DeLong EF, Preston CM, et al. Community genomics
cal activity are better when using larger inserts, among stratified microbial assemblages in the ocean’s
interior. Science. 2006;311(5760):496–503.
among other things because the complete meta- Feingersch R, Beja O. Bias in assessments of marine
bolic pathway might be present, in case more than SAR11 biodiversity in environmental fosmid and
one gene is needed, and also the genomic context BAC libraries? ISME J. 2009;J3(10):1117–9.
facilitates expression (e.g., better chances of the Garcia-Heredia I, Martin-Cuadrado AB, et al.
Reconstructing viral genomes from the environment
required promoters and control machinery being using fosmid clones: the case of haloviruses. PLoS
present). Many recent examples have used One. 2012;7(3):30.
fosmid clones for expression of activities such Ghai R, Martin-Cuadrado A, et al. Metagenome of the
as enzymes (Selvin et al. 2012) or bioactive Mediterranean deep chlorophyll maximum studied by
direct and fosmid library 454 pyrosequencing. ISME
compounds (Riaz et al. 2008; Huang et al. 2009; J. 2010;9:1154–1166.
Parsley et al. 2011). Huang Y, Lai X, et al. Characterization of a deep-sea
The third generation of high-throughput sediment metagenomic clone that produces water-
single-molecule nucleic acid sequencing such as soluble melanin in Escherichia coli. Mar Biotechnol.
2009;11(1):124–31.
Nanopore or Helicos might generate long reads Iverson V, Morris RM, et al. Untangling genomes from
that, provided they have enough reliability, might metagenomes: revealing an uncultured class of marine
make fosmid cloning and sequencing obsolete Euryarchaeota. Science. 2012;335(6068):587–90.
(Munroe and Harris 2010; Manrao et al. 2012). Kim UJ, Shizuya H, et al. Stable propagation of cosmid
sized human DNA inserts in an F factor based vector.
Manrao EA, Derrington IM, et al. Reading DNA at single-
Summary nucleotide resolution with a mutant MspA nanopore
and phi29 DNA polymerase. Nat Biotechnol.
2012;30(4):349–53.
Many authors used the fosmid vectors to describe Martin-Cuadrado AB, Lopez-Garcia P, et al.
metagenomes. They allow to generate large librar- Metagenomics of the deep Mediterranean, a warm
ies with relatively small investment of time and bathypelagic habitat. PLoS One. 2007;2(9):e914.
money, and they can be used for multiple pur- Martin-Cuadrado AB, Rodriguez-Valera F, et al. Hind-
sight in the relative abundance, metabolic potential
poses. For example, fosmid-end sequencing pro- and genome dynamics of uncultivated marine archaea
vides data similar to shotgun libraries (in small from comparative metagenomic analyses of bathype-
insert vectors) but can be screened for sequences lagic plankton of different oceanic regions. ISME
of interest for full fosmid sequencing. There are J. 2008;2(8):865–86.
Martin-Cuadrado AB, Ghai R, et al. CO dehydrogenase
many examples of studies carried out that way. genes found in metagenomic fosmid clones from the
They can be screened by PCR for genes of interest deep Mediterranean sea. Appl Environ Microbiol.
such as 16S rRNA or others. Fosmids are also 2009;75(23):7436–44.
better vectors for expression screening by biolog- Munroe DJ, Harris TJ. Third-generation sequencing fire-
works at Marco Island. Nat Biotechnol. 2010;28(5):
ical activity. The advent of high-throughput 426–8.
sequencing technologies provides new opportuni- Parsley LC, Linneman J, et al. Polyketide synthase path-
ties for sequencing and screening fosmids. How- ways identified from a metagenomic library are
ever, long read single-molecule sequencing might derived from soil Acidobacteria. FEMS Microbiol
Ecol. 2011;78(1):176–87.
replace the need for fosmid cloning and render this Quaiser A, Lopez-Garcia P, et al. Comparative analysis of
metagenomic approach obsolete. genome fragments of Acidobacteria from deep Medi-
terranean plankton. Environ Microbiol. 2008;10(10):
2704–17.
Riaz K, Elmerich C, et al. A metagenomic analysis of soil
References bacteria extends the diversity of quorum-quenching
lactonases. Environ Microbiol. 2008;10(3):560–70.
Beja O, Suzuki MT, et al. Construction and analysis of Selvin J, Kennedy J, et al. Isolation identification and
bacterial artificial chromosome libraries from a marine biochemical characterization of a novel halo-tolerant
microbial assemblage. Environ Microbiol. 2000;2(5): lipase from the metagenome of the marine sponge
516–29. Haliclona simulans. Microb Cell Fact. 2012;11(1):72.
FragGeneScan: Predicting Genes in Short and Error-Prone Reads 181 F
Shizuya H, Birren B, et al. Cloning and stable mainte- although discovering new genes is one of the
nance of 300-kilobase-pair fragments of human DNA most important aspects in metagenomics research.
in Escherichia coli using an F-factor-based vector.
Proc Natl Acad Sci USA. 1992;89(18):8794–7. Alternatively, sequence conservation information
Venter JC, Remington K, et al. Environmental genome can be utilized for prediction of novel protein-
shotgun sequencing of the Sargasso Sea. Science. coding genes (Krause et al. 2006; Yooseph
2004;304(5667):66–74. et al. 2008); for example, a Ka/Ks value of ~1 for
a group of similar sequences indicates that these
sequences are under no selective pressure and
FragGeneScan: Predicting Genes in hence unlikely to code for proteins. This way,
Short and Error-Prone Reads novel families that have multiple members in
a metagenomic dataset can be identified
Yuzhen Ye (Yooseph et al. 2008). The other straightforward F
Indiana University, School of Informatics and solution to novel gene prediction in metagenomics
Computing, Bloomington, IN, USA is to use feature-based approaches such as proba-
bilistic models to evaluate the probabilities of
open reading frames (ORFs) being protein-coding
Definition regions (Noguchi et al. 2006, 2008; Hoff
et al. 2009), in a manner similar to conventional
Protein-coding genes are functional units in gene-finding methods such as Glimmer and
genomes that encode for proteins. GeneMark (Lukashin and Borodovsky 1998;
FragGeneScan is a hidden Markov model Salzberg et al. 1998; Delcher et al. 1999).
(HMM)-based predictor of incomplete and com- Short read length and sequencing errors are
plete genes from short reads or complete two major issues that pose significant challenges
genomes of prokaryotes. to gene prediction: incomplete genes (gene frag-
ments) are difficult to predict, and sequencing
errors may cause frameshifts that further compli-
Introduction cate gene prediction. The average length of genes
in microorganisms is about 950 bps (Noguchi
Identification of genes is one of the most impor- et al. 2006), which is much longer than the
tant and challenging problems in whole microbial sequencing reads generated by most NGS
genome sequencing projects (Davidsen et al. (Morozova et al. 2009; Metzker 2010; Quail
2001; Aziz et al. 2008; Stewart et al. 2009). In et al. 2012). Different NGS methods now produce
metagenomics, gene finding can provide the sequencing reads of various lengths ranging from
opportunity to elucidate the activities and interac- 100 bps (from Illumina sequencers) to thousands
tions of genes within an environmental sample, of base pairs (PacBio sequencing) and have dif-
from which the metabolic and signaling pathways ferent error profiles (Morozova et al. 2009).
specific to the environment can be reconstructed Sanger sequencers produce reads with an error
and identified (Turnbaugh et al. 2009; HMP con- rate of up to 1 %, whereas 454 sequencers pro-
sortium 2012). Most commonly, genes encoded by duce reads with an error rate of up to 3 % (Richter
metagenomes have been identified by using et al. 2008; Hoff 2009). Illumina sequencing
homology-based methods such as BLASTX technology may produce reads that have high
(Altschul et al. 1990; Meyer et al. 2008), which mismatch rates, especially when relatively long
however is facing a challenge due to the large reads are acquired (e.g., G is mistaken as T, and in
amount of sequencing data even with recent devel- later cycles A, C, and G are mistaken as T)
opments of faster tools including RAPSearch (Kircher et al. 2009). In 454 sequencing reads,
(Ye et al. 2011; Zhao et al. 2012). Homology sequencing errors tend to occur in the homopol-
searches against known protein databases, how- ymer regions, resulting in frequent insertions and
ever, cannot be used to predict novel genes, deletions. Most of the sequencing errors in
F 182 FragGeneScan: Predicting Genes in Short and Error-Prone Reads
PacBio reads are also indels (Carneiro transition probabilities to insertion and deletion
et al. 2012). It has been shown that ORF-based states are set to 0 when applying FragGeneScan
gene prediction methods are more substantially to gene prediction in complete genomic sequences.
affected by sequencing errors (indels) that cause Given a short read (or a complete genome), the
frameshifts (Hoff 2009; Tang et al. 2013). As gene prediction problem is to find the best path of
a consequence, programs that are currently avail- hidden states (see below) that is most likely to
able for gene prediction from short reads show generate the observed nucleotide sequence,
a significant decrease in their performance as the which can be solved by the Viterbi algorithm.
sequencing error rate increases. For example, FragGeneScan reports genes if they meet the
a low sensitivity of 26–43 % was observed with following three conditions: (1) the length of the
sequencing error rate of 2.8 % (Hoff 2009). genes is longer than 60 bps, (2) the genes start in
a start state (start codon) or in a match state
(internal region of genes), and (3) the genes end
FragGeneScan Algorithm in a stop state (stop codon) or in a match state
(internal region of genes). Therefore,
The core of FragGeneScan (Rho et al. 2010) is FragGeneScan can predict complete genes as
a hidden Markov model (HMM) (Rabiner 1989), well as partial (fragmented) genes without start
which incorporates codon usage bias, sequencing and/or stop codons. Since the probability of gene
error models, and start/stop codon patterns in regions and noncoding regions is calculated
a unified model. FragGeneScan HMM consists solely based on the composition of sequences
of two-level representations based on data (which is consistent regardless of the read length
abstraction. FragGeneScan considers separate and gene length), FragGeneScan is more robust
states representing the gene regions in the for- when input sequences are of different lengths.
ward strand and the reverse strand of a nucleotide
sequence, such that it can predict genes simulta-
neously from both strands. The model has seven Applications of FragGeneScan
superstates, representing gene regions, start
codons and stop codons in both the forward FragGeneScan software is available as open
(three states) and backward strands (three states), source on http://omics.informatics.indiana.edu/
and noncoding regions (one state), respectively. FragGeneScan. It has been incorporated into sev-
The states for gene regions consist of six consec- eral metagenomic analysis pipelines, including
utive sets of a match state, an insertion state, and MG-RAST (http://press.igsb.anl.gov/mgrdev/
a deletion state, which collectively correspond to under-the-hood/mg-rast-tools/fraggenescan/),
a six-periodic inhomogeneous HMM. Each IMG/M (Markowitz et al. 2012), WebMGA
match state in the gene regions uses a second- (Wu et al. 2011), and EBI metagenomics service
order Markov chain to model the codon usage. (Wu et al. 2011).
The state for noncoding regions is based on
a first-order Markov chain. FragGeneScan also
incorporates the sequence patterns for each start Summary
codon (ATG, GTG, and TTG) and stop codon
(TAA, TAG, and TGA) in the start and stop Gene prediction in short reads (and assemblies)
state, respectively. will remain a challenging problem, even with
FragGeneScan HMM has a unique feature. By recent advances in the field (Tang et al. 2013).
allowing transitions between the insertion/deletion Proteins predicted from environmental sequences
states and the match states, this model effectively have already greatly expanded the universe of
detects frameshifts that are caused by indel errors protein sequences. Not surprisingly, an increas-
in sequencing. Considering that complete genomic ingly large number of these proteins we are get-
sequences are unlikely to contain indel errors, the ting are hypothetical proteins. Functional
FR-HIT Overview 183 F
prediction of these hypothetical proteins will play prokaryotic and phage genomes. DNA Res. 2008;15:
a key role in elucidating their functions, which 387–96.
Quail MA, Smith M, Coupland P, et al. A tale of three next
however, will be an even more daunting task. generation sequencing platforms: comparison of Ion
Torrent, Pacific Biosciences and Illumina MiSeq
sequencers. BMC Genomics. 2012;13:341.
Rabiner LR. A tutorial on hidden Markov models and
References selected applications in speech recognition. Proc
IEEE. 1989;77:257–86.
Altschul SF, Gish W, Miller W, et al. Basic local align- Rho M, Tang H, Ye Y. FragGeneScan: predicting genes in
ment search tool. J Mol Biol. 1990;215:403–10. short and error-prone reads. Nucleic Acids Res.
Aziz R, Bartels D, Best A, et al. The RAST server: rapid 2010;38(20):e191.
annotations using subsystems technology. BMC Geno- Richter DC, Ott F, Auch AF, et al. MetaSim – a sequenc-
mics. 2008;9(1):75. ing simulator for genomics and metagenomics. PLoS
Carneiro MO, Russ C, Ross MG, et al. Pacific biosciences ONE. 2008;3:e3373. F
sequencing technology for genotyping and variation Salzberg SL, Delcher AL, Kasif S, et al. Microbial gene
discovery in human data. BMC Genomics. 2012; identification using interpolated Markov models.
13:375. Nucleic Acid Res. 1998;26:544–8.
Davidsen T, Beck E, Ganapathy A, et al. The comprehen- Stewart AC, Osborne B, Read TD. DIYA: a bacterial
sive microbial resource. Nucleic Acids Res. 2001;38 annotation pipeline for any genomics lab. Bioinfor-
Suppl 1:D340–5. matics. 2009;25(7):962–3.
Delcher AL, Harmon D, Kasif S, et al. Improved microbial Tang S, Antonov I, Borodovsky M. MetaGeneTack: ab
gene identification with GLIMMER. Nucleic Acids initio detection of frameshifts in metagenomic
Res. 1999;27:4636–41. sequences. Bioinformatics. 2013;29(1):114–6.
HMP consortium. Structure, function and diversity of the Turnbaugh PJ, Hamady M, Yatsunenko T, et al. A core gut
healthy human microbiome. Nature. 2012;486(7402): microbiome in obese and lean twins. Nature.
207–14. 2009;457(7228):480–4.
Hoff K. The effect of sequencing errors on metagenomic Wu S, Zhu Z, Fu L, et al. WebMGA: a customizable web
gene prediction. BMC Genomics. 2009;10(1):520. server for fast metagenomic sequence analysis. BMC
Hoff KJ, Lingner T, Meinicke P, et al. Orphelia: predicting Genomics. 2011;12:444.
genes in metagenomic sequencing reads. Nucleic Ye Y, Choi JH, Tang H. RAPSearch: a fast protein simi-
Acids Res. 2009;37:W101–5. larity search tool for short reads. BMC Bioinforma.
Kircher M, Stenzel U, Kelso J. Improved base calling for 2011;12:159.
the Illumina Genome Analyzer using machine learning Yooseph S, Li W, Sutton G. Gene identification and pro-
strategies. Genome Biol. 2009;10(8):R83. tein classification in microbial metagenomic sequence
Krause L, Diaz NN, Bartels D, et al. Finding novel genes data via incremental clustering. BMC Bioinforma.
in bacterial communities isolated from the environ- 2008;9:182.
ment. Bioinformatics. 2006;22:e281–9. Zhao Y, Tang H, Ye Y. RAPSearch2: a fast and memory-
Lukashin AV, Borodovsky M. GeneMark.hmm: new solu- efficient protein similarity search tool for next-
tions for gene finding. Nucleic Acids Res. 1998;26: generation sequencing data. Bioinformatics. 2012;
1107–15. 28(1):125–6.
Markowitz VM, Chen IM, Chu K, et al. IMG/M: the
integrated metagenome data management and compar-
ative analysis system. Nucleic Acids Res. 2012;40-
(Database issue):D123–9.
Metzker ML. Sequencing technologies – the next genera-
FR-HIT Overview
tion. Nat Rev Genet. 2010;11(1):31–46.
Meyer F, Paarmann D, D’Souza M, et al. The Beifang Niu, Zhengwei Zhu, Limin Fu and
metagenomics RAST server – a public resource for Sitao Wu
the automatic phylogenetic and functional analysis of
metagenomes. BMC Bioinforma. 2008;9(1):386.
Center for Research in Biological Systems
Morozova O, Hirst M, Marra M. Applications of new (CRBS), University of California, San Diego,
sequencing technologies for transcriptome analysis. La Jolla, CA, USA
Annu Rev Genomics Hum Genet. 2009;10:135–51.
Noguchi H, Park J, Takagi T. MetaGene: prokaryotic gene
finding from environmental genome shotgun Definition
sequences. Nucleic Acids Res. 2006;34(19):5623–30.
Noguchi H, Taniguchi T, Itoh T. MetaGeneAnnotator:
detecting species-specific patterns of ribosomal bind- A crucial step in metagenomic data analysis is
ing site for precise gene prediction in anonymous fragment recruitment, a process of aligning
F 184 FR-HIT Overview
sequencing reads to reference genomes. FR-HIT Methods

offers high speed and high sensitivity in
recruiting large-scale metagenomic reads. FR-HIT adopts a seeding strategy with
overlapping q-gram hashing to locate candidate
matching blocks on the reference sequences and
then applies an effective filtering within the can-
Introduction didate blocks to filter out blocks that do not meet
the minimum criteria for containing an alignment
Microbiome data are directly obtained from with specified parameters. For each candidate
various environments and contain genomics block that passed the filter, the best matching
information of many known and novel microor- subregions between a candidate block and
ganisms. An important step to study these organ- a read are determined and used subsequently by
isms’ identity and abundance is to align the the banded Smith-Waterman algorithm to carry
sequencing reads against the available reference out the actual alignment efficiently, which will
genomes. This process was called fragment finally verify if this can be a valid recruitment hit.
recruitment in the Global Ocean Sampling
(GOS) project that surveyed the world’s oceans Constructing Overlapping Q-Gram
(Rusch et al. 2007). Hash Table
A metagenomic dataset may have many novel All reference sequences are stored together with
species without available reference genomes. a hash lookup table to rapidly locate q character
Even if references are available, the microbial overlapping q-gram. The overlapping q-grams
species may undergo large variations. So are sampled at equidistant steps along the refer-
a fragment recruitment method needs to find all ence sequences. A reference of length m contains
significant alignments with arbitrary number of (m – q)/(q – p) + 1 q-grams with an overlap of
mismatches and gaps. p bases. Here q and p are user-adjustable
There are many available alignment pro- parameters.
grams that can be considered for fragment
recruitment. In terms of accuracy, BLAST is Identifying Candidate Matching Block
the best tool because it can identify very remote The candidate blocks are fragments on reference
homology so it was used in earlier studies such sequences that will be further considered for
as GOS. But it is too slow for computing reads alignment with the query. For each query, all its
from the next-generation sequencing (NGS) overlapping q-grams are used to scan the q-gram
platforms. The new generation of mapping pro- hash table and collect the q-grams shared by
grams, such as SOAP (Li et al. 2008), Bowtie reference sequences. Candidate blocks are
(Langmead et al. 2009), BWA (Li and Durbin derived from clusters of pieces on reference
2009), and many others, are orders of magnitude genomes marked by the shared q-grams.
faster than BLAST. However, these mapping
programs only tolerate a few mismatches so Q-Gram Filtering and Banded Alignment
they are not suitable for recruiting metagenomic Q-gram filtering strategy was used before in
reads. QUASAR (Burkhardt et al. 1999) based on the
FR-HIT is a very fast program to recruit q-gram lemma (Jokinen and Ukkonen 1991;
metagenomic reads to homologous reference Owolabi and Mcgregor 1988), which states two
genomes (Niu et al. 2011). It offers both high sequences of length n with Hamming distance e
speed and high sensitivity in recruiting NGS share at least n + 1 – (e + 1)q common q-grams.
reads. A C++ implementation of FR-HIT and FR-HIT calculates the maximal number of mis-
more details of this method are available at matches according to user-specified alignment
http://weizhongli-lab.org/frhit. cutoff value and rejects the candidate blocks
FR-HIT Overview 185 F
FR-HIT Overview, Fig. 1 Recruitment rate and speed of FR-HIT and other programs for four datasets. The x-axis is the
ratio of CPU time relative to BLASTN; y-axis is the ratio of number of recruited reads relative to BLASTN
that do not have enough common q-grams. In this On average, FR-HIT is ~2 orders of magnitude
step, the length of q-gram is 4. After filtering, faster than BLASTN with similar recruitment
banded alignments between the query and the rate. FR-HIT is slower than the mapping pro-
candidate blocks that passed the filter are grams SOAP2, BWA, and BWA-SW, but it
performed. recruits several times more reads.
Performance of FR-HIT Fragment Recruitment Viewer
The fragment recruitment performance of The results of alignments from FR-HIT can be
FR-HIT was compared to some widely used interactively visualized using Fragment
short-read mapping and sequence alignment Recruitment Viewer, a tool that plots the align-
tools including BLASTN, MegaBLAST, ments on a 2D map where the x-axis is the
SOAP2, BWA, BWA-SW, SSAHA2, BLAT, genome coordinate and y-axis is the alignment
and LAST using four metagenomic datasets of identity (Fig. 2). The map can be operated
up to one million reads covering 454 GS20, like a Google Map so that users can explore
454 GSFLX, 454 Titanium, and Illumina plat- the recruitment alignments from one or
forms. Reads are aligned to available microbial multiple samples to many reference genomes.
reference genomes and considered recruited if the Fragment Recruitment Viewer is available
alignments are at least 30 bp and at least 80 % from http://weizhongli-lab.org/mgaviewer.
identity. Some pre-calculated recruitment results using
The overall comparison of CPU time and the FR-HIT are available from the CAMERA
number of recruited reads are shown in Fig. 1. project (http://camera.calit2.net).
F 186 FR-HIT Overview
FR-HIT Overview, Fig. 2 Screenshots of the Fragment and tRNA). At right bottom corner, there are a few icons
Recruitment Viewer. The initial view of plot shows all hits to zoom in, zoom out, increase and decrease plot size, and
to the full reference genome. X-axis is the genome coor- reset to the default view. Mouse wheel can be used to
dinate, and y-axis is the alignment identity. Hits are col- zoom the plot. Plot can be panned using mouse. Informa-
ored by samples. The bottom of the plot shows genes of the tion of an alignment or a gene is displayed when the
reference genome colored by gene type (protein, rRNA, pointer is over it
Summary References
FR-HIT is an important tool to perform fragment Burkhardt S, Cramer A, Ferragina P. q-gram based database
searching using a suffix array (QUASAR). RECOMB
recruitment analysis for metagenomic sequences.
’99; 1999 Apr 11–14; Lyon; 1999, pp. 77–83.
The recruitment results can be visualized using Jokinen P, Ukkonen E. 2 algorithms for approximate
the fragment recruitment reviewer. They can also string matching in static texts. In: Tarlecki A, editor.
be analyzed to provide taxonomy and function Mathematical foundations of computer science. Lec-
ture notes in computer science, vol 520. Berlin:
annotations. As a fast alignment tool, FR-HIT can Springer; 1991, pp. 240–248.
also be used for many applications such as filter- Langmead B, Trapnell C, Pop M, et al. Ultrafast and
ing out human contaminations for human memory-efficient alignment of short DNA sequences
microbiome samples. to the human genome. Genome Biol. 2009;10:R25.
Functional Metagenomics of Bacterial-Cell Crosstalk 187 F
Li H, Durbin R. Fast and accurate short read alignment Cultivability and metabolic interdependence
with Burrows-Wheeler transform. Bioinformatics. of microbes in their ecosystems have confronted
2009;25:1754–60.
Li R, Li Y, Kristiansen K, et al. SOAP: short microbial ecologist with “the great plate-count
oligonucleotide alignment program. Bioinformatics. anomaly” (Staley and Konopka 1985) since the
2008;24:713–4. beginning of their studies. The term summarizes
Niu B, Zhu Z, Fu L, et al. FR-HIT, a very fast program to the great discrepancy between the loads of micro-
recruit metagenomic reads to homologous reference
genomes. Bioinformatics. 2011;27:1704–5. scopically observed bacteria in an environmental
Owolabi O, Mcgregor DR. Fast approximate string sample and the lower numbers obtained using
matching. Softw Pract Exp. 1988;18:387–93. culture-dependent counting techniques, indicat-
Rusch DB, Halpern AL, Sutton G, et al. The sorcerer II ing the lack of representativeness of culture-
global ocean sampling expedition: northwest
Atlantic through eastern tropical Pacific. PLoS Biol. dependent techniques in the study of most com-
2007;5:e77. plex bacterial ecosystem. F
The development of molecular cloning
approaches led microbial ecologists to explore
the enzymatic potential of their ecosystems by
Functional Metagenomics of heterologous expression. They developed tech-
Bacterial-Cell Crosstalk niques to extract total genomic DNA of bacterial
origin from complex environmental samples.
Tomas de Wouters1,3, Nicolas Lapaque1, These metagenomes can subsequently be
Emmanuelle Maguin1, Joël Doré1,2,3 and expressed in a well-known and cultivable host
Hervé M. Blottière1,2 using fosmids, cosmids, or bacterial artificial
1
INRA, AgroParisTech, Jouy en Josas, France chromosomes (BACs). The first application of
2
US 1367 MetaGenoPolis, INRA, Jouy en Josas, this technique allowed the identification of for-
France merly unknown fibrolytic enzymes from, among
3
UMR Micalis, AgroParisTech, Jouy en Josas, others, anaerobic and Gram-positive bacteria
France (Healy et al. 1995) using E. coli (a Gram-negative
bacterium) as a host. The use of heterologous
expression of the metagenome of an ecosystem
Synonyms to identify functionalities of uncultivable bacteria
was later coined “functional metagenomics” as
Host-microbiota interactions opposed to the use of molecular techniques for
phylogenetic characterization and in silico func-
tional predictions of microbial ecosystems called
Definition metagenomics.
Functional analysis of a metagenome (combined

genomes of a defined system) with the aim to Human Metagenomics
understand and/or identify single components of
the interaction of a microbe with specific cells. With the discovery of the importance of the
human microbiota for human health, the study
of the different ecological niches of the human
Introduction body gained a lot of attention in the late 1990s.
All over the human body to date, five principal
Complex ecosystems often exert several niche- niches were addressed: the skin, nasal, oral,
specific functions. Dependent on their entangle- urogenital, and gastrointestinal microbiota
ment and the accessibility of the ecosystem, the (Huttenhower et al. 2012). Based on the rapid
identification and analysis of these single func- development of the next-generation sequencing
tions can be challenging. technologies, these complex ecosystems have
F 188 Functional Metagenomics of Bacterial-Cell Crosstalk
been explored mainly through metagenomic cultivable bacteria or the study of monoxenic
studies of their phylogenetic composition and and gnotobiotic animal models. In order to cir-
their metabolic repertoire as far as in silico pre- cumvent this limitation, culture-independent
diction is possible. methods such as functional metagenomics have
Most attention has been focused on the intes- been adapted and used to study functions of the
tinal microbiota. Not only because of its unique human intestinal microbiota (Table 1). Initially
bacterial density but also because of the large the approach was used to search for enzymatic
mucosal interface that exposes the human body activities specific for intestinal metabolic
to this bacterial load. The study of germ-free functions.
animals and large human cohorts revealed corre- Using a BAC library prepared in an E. coli
lations between the composition of the intestinal host, Walter and colleagues screened a mouse
microbiota and physiological conditions of the intestinal metagenome for b-glucanase activity
host, such as the proper development of immu- identifying 3 out of a total 5,760 clones
nity, a balanced metabolism, and the systemic (containing a total of 320 Mb of genomic DNA,
inflammatory status (Cerf-Bensussan and each clone bearing on average 55 Kb) encoding
Gaboriau-Routhiau 2010). This systemic impact enzymes of interest (Walter et al. 2005).
indicates an interaction between the intestinal Similarly, by screening a small fragment
microbiota and the host that has since been sub- metagenomic library (14,000 clones,
ject to intensive scientific research. representing 77 Mb of genomic DNA, cloned
DNA fragments had sizes of up to 8 kb) derived
from a cow rumen content, Ferrer and colleagues
Functional Studies of the Intestinal identified and characterized 22 clones with dis-
Microbiota tinct hydrolytic activities (Ferrer et al. 2005). In
these two studies, the screening process only
The human intestinal microbiota harbors allowed a very limited coverage of the actual
a genetic repertoire >25 times larger than that metagenome due to the size of the library.
of each human host (Qin et al. 2010) encoding Although several studies have identified hydro-
a multitude of functions that contribute directly lytic enzymes using plasmid libraries, one of the
or indirectly to host’s physiology. Cultivation key issues of the functional approach is to obtain
efforts as compared to molecular techniques libraries bearing large fragments of DNA to have
revealed that 70–80 % of the dominant bacteria access to full operons and operational gene clus-
are not yet cultured. Therefore up to 80 % of the ters, i.e., from 10 to 50 Kb.
intestinal microbes have no representative in any Indeed, Jones and colleagues developed
bacterial strain collection for potential functional a more promising approach by screening about
studies (Suau et al. 1999; Hayashi et al. 2002). 90,000 metagenomic fosmid clones derived from
Functional studies of intestinal bacteria have a human fecal sample (representing a total of
therefore long been limited to the study of about 3.6 Gb bacterial DNA which is about one
Functional Metagenomics of Bacterial-Cell Crosstalk, Table 1 Reported functional metagenomic screenings of

the human intestinal microbiota
Target n of clones tested Hit rate (%) Reference
Enzymatic activity Bile salt hydrolases 89,856 1 103 (Jones et al. 2008)
Carbohydrate-active enzymes 156,000 2 103 (Tasse et al. 2010)
b-Glucuronidase 4,608 1.79 (Gloux et al. 2011)
Host –Microbe interaction Cell proliferation 20,725 4 102 (Gloux et al. 2007)
NF-kB activation 2,640 6 10-2 (Lakhdari et al. 2010)
equivalent of the dominant intestinal reporter genes (Lakhdari et al. 2010). NF-kB is
metagenome) for bile salt hydrolase activity a key transcription factor in intestinal epithelial
(Jones et al. 2008). They observed that these cells controlling, among others, the inflammatory
functions were present and enriched in all major response. This unique combination of reporter
gut bacterial divisions including Archaea, cell technology and functional metagenomics
demonstrating the powerful capacities for established a new approach to identify specific
discovery of the functional metagenomic regulatory elements of the intestinal microbiota
approach. In the same way, Tasse and colleagues in the complex interactions between the intestinal
applied high throughput functional screenings microbiota and its host. They further identified
to search human gut-derived metagenomics the genes implicated in the observed effect using
clones (156,000 clones representing 5.5 Gb of random transposition on the bioactive clones,
DNA) for their capacity to hydrolyze different showing that this approach can be used to identify F
polysaccharides (Tasse et al. 2010). This exhaus- genes involved in bacteria-cell crosstalk at the
tive analysis of carbohydrate-active enzymes level of intestinal epithelium.
allowed the identification of highly prevalent In order to reach a reasonable level of cover-
genes encoding enzymes that are involved in age of the metagenomic samples, this approach
the catabolism of dietary fibers in the human has been automated and in parallel screens have
intestinal tract, demonstrating again the been developed for other transcription factors
strategic interest of the functional metagenomic (AP1, PPARg. . .) or target genes (ANGPTL4,
approach. TSLP, TGFb. . .) in order to allow the high
throughput application necessary to identify bio-
active compounds of the intestinal microbiota
(www.mgps.eu). The identification of these bio-
Functional Metagenomics and Host- active clones and the corresponding genes, mol-
Microbiota Interaction ecules, and mode of action will help to untangle
the complex interactions of the intestinal
The intestinal microbiota had successfully been microbiota with its host.
screened for its enzymatic activities with the help Functional metagenomics can also be applied
of metagenomic libraries using fosmids, cosmids, to identify indirect interactions of the intestinal
or bacterial artificial chromosomes (BAC) with microbiota with its host. Gloux and colleagues
single, low copy, or copy control vectors. Thus identified b-glucuronidases using a functional
Gloux and colleagues set out to test if these metagenomic screen on libraries derived from
metagenomic libraries were suited for the study intestinal samples from healthy individuals and
of bacteria-host cell interactions at the intestinal Crohn’s disease patients (Gloux et al. 2011). The
interface, targeting the intestinal epithelial cells. study revealed the presence of a new class of
They therefore screened a library of over b-glucuronidase that seems to be gut-specific
20,000 clones for their influence on proliferation and is hypothesized to play a role in the
of HT-29 human intestinal epithelial cells and metabolism of xenobiotics. On this background,
CV1 kidney fibroblast showing that indeed this functional metagenomics in the human intestine
approach could reveal genes of interest in the could be a powerful tool to identify specific
dialogue between the host and its microbiota biodegradation or conversions observed in the
(Gloux et al. 2007). intestine that can have physiological effects and
The same group further developed this for which the dominant causal agent is often
approach performing the screening of over unknown.
2,500 clones on human colorectal carcinoma Up to now all reported functional
cell lines, namely, Caco-2 and HT-29, which metagenomic studies of host microbe interactions
were stably transfected with NF-kB-dependent published were performed using E. coli as a host
F 190 Functional Metagenomics of Bacterial-Cell Crosstalk
strain. Since the Gram + bacteria represent a large such tools have been used for functional screens
part of the intestinal microbiota and most of the of pathogen-cell interaction, a functional
probiotic bacteria described to have beneficial metagenomic study of interactions between com-
effects on human health are Gram+, great efforts mensal bacteria and host cells using a Gram +
have been made to develop easy cloning tools for host has not been published yet.
such studies in Gram + hosts. Since the expres-
sion of heterologous genes in E. coli gave access
to around 40 % of the genes for both Gram + and
Gram- bacteria (Gabor et al. 2004), it makes it Summary
a suitable but not universal host. The utility of
a Gram + bacterial host is based on eventual Metagenomic studies are applied to complex sys-
potential preference for RBSs and hence tems. Functional metagenomics is no exception.
increased transcription but also on secretion of If we study a complex system, simplification can
proteins through Gram + specific signal peptides bring clarity. This is the case if we search for
or eventual surface exposure of bioactive proteins specific enzymatic activities in a complex eco-
through cell wall anchoring motifs. Screenings of system. Simultaneously, simplification harbors
metagenomic libraries in Streptomyces spp. the danger of oversimplification and therefore
(Wang et al. 2000) and even Archaea (Albers error or deception.
et al. 2006) have successfully been performed The authors consider functional metagenomics
for other ecosystems. Efforts for targeted expres- as a very useful and powerful tool to screen
sion of candidate proteins of the human intestinal complex ecosystems for specific functions and
microbiota have been made by developing pre- believe it can be extended to the study of host-
diction tools for surface-exposed and secreted microbiota interactions as performed in the stud-
proteins in Gram + hosts in order to mine the ies mentioned above. For a full understanding of
abundantly available metagenomic data the complex interaction of a microbiome with its
(Barinov et al. 2009). The expression of the iden- cellular counterpart, this is however only an
tified candidate genes in a Gram + host such as exploratory tool that will always require valida-
Bacillus subtilis or Lactococcus lactis will allow tion in a more holistic and thus more complex
functional screening in cell-based assays. Though model (Fig. 1).
Functional
Metagenomics of
Bacterial-Cell Crosstalk,
Fig. 1 Possible models to
study host-microbiota
interactions ordered by
complexity of the microbial
(ordinate) and cellular
model (abscise) toward the
understanding of human
intestinal physiology
Cross-References from the human gut microbiome for modulation of
eukaryotic cell growth. Appl Environ Microbiol [Inter-
net]. 2007 [cited 2011 Jun 22];73(11):3734–7. Avail-
▶ Functional Metagenomics of Human Intestinal able from http://www.pubmedcentral.nih.gov/
Microbiome b-Glucuronidase Activity articlerender.fcgi?artid¼1932692&tool¼pmcentrez&
▶ Functional Viral Metagenomics and the rendertype¼abstract
Development of New Enzymes for DNA and Hayashi H, Sakamoto M, Benno Y. Phylogenetic analysis
of the human gut microbiota using 16S rDNA clone
RNA Amplification and Sequencing libraries and strictly anaerobic culture-based methods.
▶ Use of Bacterial Artificial Chromosomes in Microbiol Immunol [Internet]. 2002 [cited 2011 Apr
Metagenomics Studies, Overview 21];46(8):535–48. Available from http://www.ncbi.
nlm.nih.gov/pubmed/12363017
Healy FG, Ray RM, Aldrich HC, Wilkie AC, Ingram LO,
Shanmugam KT. Direct isolation of functional genes
References encoding cellulases from the microbial consortia in F
a thermophilic, anaerobic digester maintained on lig-
Albers S-V, Jonuscheit M, Dinkelaker S, Urich T, nocellulose. Appl Microbiol Biotechnol [Internet].
Kletzin A, Tampé R, et al. Production of recombinant 1995 [cited 2011 Aug 17];43(4):667–74. Available
and tagged proteins in the hyperthermophilic archaeon from http://www.ncbi.nlm.nih.gov/pubmed/7546604
Sulfolobus solfataricus. Appl Environ Microbiol Huttenhower C, Gevers D, Knight R, Abubucker S, Bad-
[Internet]. 2006 [cited 2011 Aug 21];72(1):102–11. ger JH, Chinwalla AT, et al. Structure, function and
Available from http://www.pubmedcentral.nih.gov/ diversity of the healthy human microbiome. Nature
articlerender.fcgi?artid¼1352248&tool¼pmcentrez&re [Internet]. Nature Publishing Group; 2012 [cited
ndertype¼abstract 2012 Jun 13];486(7402):207–14. Available from
Barinov A, Loux V, Hammani A, Nicolas P, Langella P, http://www.nature.com/doifinder/10.1038/
Ehrlich D, et al. Prediction of surface exposed proteins nature11234
in Streptococcus pyogenes, with a potential applica- Jones BV, Begley M, Hill C, Gahan CGM, Marchesi
tion to other Gram-positive bacteria. Proteomics JR. Functional and comparative metagenomic analysis
[Internet]. 2009 [cited 2012 Sep 5];9(1):61–73. Avail- of bile salt hydrolase activity in the human gut
able from http://www.ncbi.nlm.nih.gov/pubmed/ microbiome. Proc Natl Acad Sci U S Am [Internet].
19053137 2008 [cited 2011 Aug 20];105(36):13580–5. Available
Cerf-Bensussan N, Gaboriau-Routhiau V. The immune from http://www.pnas.org/cgi/content/abstract/105/
system and the gut microbiota: friends or foes? Nature 36/13580
Rev Immunol [Internet]. Nature Publishing Group; Lakhdari O, Cultrone A, Tap J, Gloux K, Bernard F,
2010 [cited 2011 Jul 20];10(10):735–44. Available Ehrlich SD, et al. Functional metagenomics:
from http://www.ncbi.nlm.nih.gov/pubmed/20865020 a high throughput screening method to decipher
Ferrer M, Golyshina OV, Chernikova TN, Khachane AN, microbiota-driven NF-kB modulation in the human
Reyes-Duarte D, Santos V a PM Dos, et al. Novel gut. Sturtevant J, editor. PLoS ONE [Internet]. 2010
hydrolase diversity retrieved from a metagenome [cited 2010 Oct 1];5(9):e13092. Available from http://
library of bovine rumen microflora. Environ Microbiol www.pubmedcentral.nih.gov/articlerender.fcgi?artid¼
[Internet]. 2005 [cited 2013 Jan 28];7(12):1996–2010. 2948039&tool¼pmcentrez&rendertype¼abstract
Available from http://www.ncbi.nlm.nih.gov/pubmed/ Qin J, Li R, Raes J, Arumugam M, Burgdorf KS,
16309396 Manichanh C, et al. A human gut microbial gene
Gabor EM, Alkema WBL, Janssen DB. Quantifying the catalogue established by metagenomic sequencing.
accessibility of the metagenome by random expression Nature [Internet]. 2010;464(7285):59–65. Available
cloning techniques. Environ Microbiol [Internet]. from http://www.ncbi.nlm.nih.gov/pubmed/20203603
2004 [cited 2011 Jun 22];6(9):879–86. Available Staley JT, Konopka A. Measurement of in situ activities of
from http://www.ncbi.nlm.nih.gov/pubmed/15305913 nonphotosynthetic microorganisms in aquatic and ter-
Gloux K, Berteau O, El Oumami H, Béguet F, Leclerc M, restrial habitats. Ann Rev Microbiol [Internet]. 1985
Doré J. A metagenomic b-glucuronidase uncovers [cited 2011 Aug 13];39:321–46. Available from http://
a core adaptive function of the human intestinal www.ncbi.nlm.nih.gov/pubmed/3904603
microbiome. Proc Natl Acad Sci U S A [Internet]. Suau A, Bonnet R, Sutren M, Godon JJ, Gibson GR,
2011 [cited 2011 Jul 29];108(Suppl):4539–46. Collins MD, et al. Direct analysis of genes encoding
Available from http://www.pubmedcentral.nih.gov/ 16S rRNA from complex communities reveals many
articlerender.fcgi?artid¼3063586&tool¼pmcentrez& novel molecular species within the human gut. Appl
rendertype¼abstract Environ Microbiol [Internet]. 1999;65(11):4799–807.
Gloux K, Leclerc M, Iliozer H, L’Haridon R, Available from http://www.pubmedcentral.nih.gov/
Manichanh C, Corthier G, et al. Development of articlerender.fcgi?artid¼91647&tool¼pmcentrez&render
high-throughput phenotyping of metagenomic clones type¼abstract
F 192 Functional Metagenomics of Human Intestinal Microbiome b-Glucuronidase Activity
Tasse L, Bercovici J, Pizzut-Serin S, Robe P, Tap J, Introduction

Klopp C, et al. Functional metagenomics to mine the
human gut microbiome for dietary fiber catabolic
enzymes. Genome Res [Internet]. 2010 [cited 2010 Intestinal b-glucuronidases (EC 3.2.1.31) are
Sep 18];20(11):1605–12. Available from http://www. among the major enzyme families associated
ncbi.nlm.nih.gov/pubmed/20841432 with chemical detoxification (Fig. 1). They cata-
Walter J, Mangold M, Tannock GW, Icrobiol APPLEN- lyze the hydrolysis of b-glucuronides naturally
M. Construction, analysis, and beta-glucanase screen-
ing of a bacterial artificial chromosome library from present in the human diet, in drugs, or those
the large-bowel microbiota of mice. Appl Environ produced in the liver by glucuronidation via
Microbiol. 2005;71(5):2347–54. UDP-glucuronosyltransferases (EC 2.4.1.17),
Wang GY, Graziani E, Waters B, Pan W, Li X, which is the major conjugation process in mam-
McDermott J, et al. Novel natural products from soil
DNA libraries in a streptomycete host. Organ Lett mals (Tukey and Strassburg 2000; Haiser and
[Internet]. 2000 [cited 2011 Aug 21];2(16):2401–4. Turnbaugh 2013). Numerous lipophilic com-
Available from http://www.ncbi.nlm.nih.gov/ pounds including metabolic wastes, vitamins,
pubmed/10956506 steroid hormones, plant- and animal-derived sec-
ondary metabolites, xenobiotics, and pharmaceu-
ticals are thus converted to water-soluble
Functional Metagenomics of Human compounds, allowing excretion via the bile and
Intestinal Microbiome the digestive tract. The b-glucuronidase activity
b-Glucuronidase Activity on glucuronide compounds in the gut lumen is
primarily due to intestinal bacteria (Rod
Petra Louis1 and Joël Doré2,3,4 et al. 1977). This activity regenerates aglycone
1
Rowett Institute of Nutrition and Health, insoluble forms that are frequently reabsorbed by
Microbiology Group, Gut Health Programme, the host through the enterohepatic circulation,
University of Aberdeen, Aberdeen, UK thus increasing circulating aglycone concentra-
2
INRA, AgroParisTech, Jouy en Josas, France tions and extending body exposure. The presence
3
US 1367 MetaGenoPolis, INRA, Jouy en Josas, of circulating hormones and xenobiotics is sub-
France stantially due to this phenomenon and linked to
4
UMR Micalis, AgroParisTech, Jouy en Josas, bacterial b-glucuronidase activity. With regard to
France toxic aglycones, the bacterial activity is largely
used as a marker of the potentially harmful
effects of commensal bacteria, particularly in
Definitions studies relating to colorectal cancer (McBain
and Macfarlane 1998). b-glucuronidase activity
b-glucuronidases: Enzymes belonging to glyco- can also lead to increased toxicity of chemother-
side hydrolase family 2 that catalyze the cleavage apeutics; such toxic effects were reduced by
of b-D-glucuronic acid residues from a range of coadministration of a b-glucuronidase inhibitor
different compounds. in an animal model (Haiser and Turnbaugh
Functional metagenomics: Screening of 2013). In contrast, it is also involved in the ben-
metagenomic DNA cloned into heterologous eficial bioconversion of dietary compounds
hosts for the expression of specific functions. including lignans, flavonoids, sphingolipids,
Sequence-based metagenomics/metagenomic glycyrrhizin, or baicalein (Kim et al. 1998,
sequence mining: In silico analysis of 2000; Schmelz et al. 1999).
metagenomic sequence libraries for the presence b-glucuronidase activity is phylogenetically
of genes with sequence similarity to known genes. widely distributed among the microbiota and is
Degenerate PCR: Usage of a mixture of similar present in numerous genera including
PCR primers designed to amplify the same gene Bacteroides, Clostridium, Eubacterium, Lacto-
from different organisms, by targeting highly bacillus, Ruminococcus, Faecalibacterium,
conserved gene regions. Roseburia, Streptococcus, Peptostreptococcus,
Functional Metagenomics of Human Intestinal Microbiome b-Glucuronidase Activity 193 F
Functional dietary phytochemicals xenobiotics, endogenous compounds
Metagenomics of (glycosides, esters) drugs etc. (hormones, vitamins etc.)
Human Intestinal
Microbiome
b-Glucuronidase
Activity, Fig. 1 Role of excretion
gut bacterial bacterial with bile
b -glucuronides
b-glucuronidase activity in (& host) COOH
the detoxification of dietary glycosidase
O O
R
glucuronides, xenobiotics, & esterase bacterial OH
drugs, and endogenous activities b-glucuronidase enterohepatic HO
circulation OH
compounds activity
liver
aglycones F
aglycones
gut
Enterococcus, Bacillus, Staphylococcus, Coryne- highly specialized b-glucuronidase activities

bacterium, Acinetobacter, Catenabacterium, and were involved in cleavage of the two glucuronic
Propionibacterium (Beaud et al. 2005; Dabek acid residues carried by the molecule (Kim
et al. 2008; Russell and Klaenhammer 2001; et al. 2000).
McBain and Macfarlane 1998). However, Genetic diversity has been demonstrated
because of the difficulty differentiating within the gusA genes of E. coli (Ram
b-glucuronidase from b-galactosidase genes, et al. 2004) and Ruminococcus gnavus species
very few corresponding protein or gene (Beaud et al. 2005) and for the genetic environ-
sequences have been clearly annotated as ment of different gusA genes of Ruminococcus
b-glucuronidase in the databases. The gnavus strains (Beaud et al. 2005). We herein
b-glucuronidase genes annotated in NCBI pri- summarize the most recent results of
marily are associated with the four major bacte- metagenomic investigations of b-glucuronidase
rial phyla present in the digestive tract: diversity within the human intestinal microbiota,
Bacteroidetes, Firmicutes, Actinobacteria, and derived from function-based and sequence-based
Proteobacteria. approaches. This provides key elements toward
Taking into account the great diversity of glu- a better understanding of the “ambiguous” roles
curonides likely present in the digestive tract, the of these enzymes in handling the large diversity
question of the diversity and specificities of of glucuronides reaching the colon.
b-glucuronidases is crucial to discriminate bene-
ficial from harmful intestinal bacteria with regard
to this activity. A few studies have suggested Functional Screening-Based
a diversity of enzyme action. Some bacterial Identification of Human Fecal
groups are thought to exert activity toward b-Glucuronidases in Metagenomic
para-nitrophenyl glucuronide or phenolphthalein Clone Libraries
glucuronide (Nanno et al. 1986), while others
activate 1-nytropyrene (Morotomi et al. 1985). The genetic information from complex microbial
b-glucuronidases from Escherichia coli ecosystems can be cloned as large fragments of
strains were more strongly induced by genomes (Handelsman 2004) into libraries that
methyl-b-D-glucuronide than were those from can be used to detect gene clusters or operons
other bacterial species (Tryland and Fiksdal allowing functional investigation. This approach
1998). In the case of glycyrrhizin metabolism, has recently been applied to the intestinal
ecosystems and offers the potential to identify of glycosyl hydrolase family 2 enzymes (Marchler-
new genes from the microbiota, including its Bauer and Bryant 2004). The BG protein also had
uncultured fraction. It is expected that about unique features, including an additional C-terminal
40 % of enzymatic activities should be recover- domain compared to known b-glucuronidases
able in E. coli (Gabor et al. 2004) and this host and primary sequence specificities that led to the
can express a significant number of genes proposal of novel consensus motifs for the
(Handelsman et al. 1998; Rondon et al. 2000). Firmicutes-borne BG and for glycosyl hydrolase
The metagenomic approach has revealed new family 2 (Gloux et al. 2011).
enzymes (Hayashi et al. 2005; Humblot On the basis of sequence specificities, the fre-
et al. 2007; Yun et al. 2004; Kim et al. 2006, quency of the novel Firmicutes or Bacteroidetes
Tasse et al. 2010, Cecchini et al 2013), anticancer BGs within the human gut metagenomes could be
products (Piel et al. 2005), and compounds assessed. It was such that at least one homolog
important for industrial, biotechnological, or could be found within approximately 104 bacte-
therapeutic applications (Streit and Schmitz rial genes, making it by far the most dominant BG
2004), all having no homolog in the host bacte- gene in human gut metagenomes. It was absent
rium (E. coli). b-glucuronidase represents an from other environmental metagenomes, includ-
important function of interaction between the ing animal guts, making it specific to the human
intestinal microbiota and the host and a relevant gut metagenome (Fig. 2). It was present in the
intestinal activity for human health. genomes of numerous human intestinal commen-
Metagenomic libraries from microbiota sals belonging to the phylogenetic and
obtained from human ileum or feces were
constructed in E. coli and their phylogenetic
diversity analyzed (Manichanh et al. 2006). The
first functional approach using these libraries
argued in favor of an efficiency of functional
expression from the four dominant phyla of the
digestive tract (Gloux et al. 2007). Despite the
presence of b-glucuronidase genes in the host
bacterium (E. coli), we designed a screening
strategy that allowed the identification of numer-
ous bioactive clones. Following primary screen-
ing for metagenomic clones overexpressing
b-glucuronidase activity, we subcloned the
inserts in a uidA- E. coli strain (Gloux
et al. 2011). Overall, 19 out of 6,144
metagenomic clones tested had fosmids able to
express a b-glucuronidase activity based on
para-nitro-phenyl-b-D-glucuronide bioconver-
sion (Bardonnet and Blanco 1992), with levels
ranging from 0.02 to 0.88 units. Phylogenetic,
genetic, and functional characteristics of Functional Metagenomics of Human Intestinal
Microbiome b-Glucuronidase Activity, Fig. 2 Abun-
b-glucuronidase-positive inserts were investigated.
dance of Firmicutes BG (blue), Bacteroidetes BG
A novel BG gene encoding a b-glucuronidase was (orange), and uidA homologs (green) in different
identified in both Firmicutes and Bacteroidetes environments. Abundance was assessed as hits per bil-
genetic backgrounds. The protein encoded by the lion base pairs to correct for size difference of
metagenomic datasets. The hit threshold was set as at
gene has two conserved glutamate residues
least 50 % identity with 50 % sequence coverage. For full
required for catalysis (Salleh et al. 2006) and the details of the different genomic hits, see Gloux et al.
conserved predicted TIM barrel domain structure (2011)
metagenomic cores described for the human gus Firmicutes:
microbiome (Tap et al. 2009; Qin et al. 2010). Roseburia
Lachnospiraceae
Rum. Ruminococcaceae
Finally, gene duplications and its spread across gnavus intestinalis Clostridiaceae
diverse phylogenetic lineages suggested an eco- Peptostreptococcaceae
OTU3 Streptococcaceae
logical drive to ensure the presence of the activ- OTU24
ity, via functional redundancy, in spite of Actinobacteria:
Eubacterium Bifidobacteriaceae
population variability between individuals. eligens
In conclusion, a novel class of BG was Proteobacteria:
revealed by our functional metagenomic Enterobacteriaceae
approach that may be part of a functional core BG Bacteroidetes:

specifically evolved to adapt to the human gut Bacteroidaceae
environment and potentially important in F
maintaining health. Unknown OTUs
Bacteroides Unknown OTUs likely
vulgatus OTU15 belonging to
Firmicutes
Sequence-Based Analysis of Human Unknown OTUs likely
belonging to
Fecal b-Glucuronidases B.
ovatus
Bacteroidetes
As described above, two different Functional Metagenomics of Human Intestinal

b-glucuronidase genes have been described, the Microbiome b-Glucuronidase Activity, Fig. 3 Distri-
gus (also referred to as gusA or uidA) gene, which bution of different types of b-glucuronidase gene in
feces of human volunteers (gus gene: 685 sequences
is present in many bacteria as well as higher from ten volunteers, BG gene: 400 sequences from six
organisms, and the BG gene, which was identi- volunteers). For full details of the different OTUs
fied by a functional metagenomic approach detected, see McIntosh et al. (2012)
(Gloux et al. 2011). Human fecal metagenomic
sequences can be searched to establish the distri-
bution of b-glucuronidases within the human gut mostly due to particular taxa rather than the
community. Degenerate primers targeting highly wider community (Flores et al. 2012). Interest-
conserved regions were designed to amplify both ingly, the phylogenetic relationship of gus genes
types of b-glucuronidase gene from human fecal in known bacteria often did not agree with their
DNA (McIntosh et al. 2012). This revealed that relatedness based on the 16S rRNA gene
the gus gene is present in many different phylo- sequence (McIntosh et al. 2012), which is com-
genetic groups, whereas the BG gene appears monly used to classify bacteria phylogenetically.
only to be present in bacteria related to Thus, it appears that the gus gene has been
Bacteroidetes and Firmicutes. Over 30 different obtained by horizontal gene transfer in several
sequence types (operational taxonomic units, bacteria. For the BG gene, there was a clear phy-
OTUs) were found for both genes, with only logenetic distinction between Firmicutes and
a few of those being highly abundant (Fig. 3). Bacteroidetes, but many bacteria, especially
The majority of OTUs, including some of the among the Bacteroidetes, carried several copies
abundant ones, corresponded to sequences of the gene with slightly divergent sequences.
that currently cannot be assigned to specific bac- The degenerate PCR approach described here
teria, as either they remain uncultured or their to investigate specific functions within microbial
genomes have not yet been sequenced. Three communities may miss sequences with slight var-
Lachnospiraceae species appear to be among the iations in the primer regions that may neverthe-
main carriers of gus (Fig. 3). A recent study that less encode the same function. The data were
compared levels of human fecal b-glucuronidase therefore compared to a large metagenomic
activity with overall community diversity also sequence library of 85 healthy human volunteers
concluded that b-glucuronidase activity was (Qin et al. 2010), which showed that the vast
majority of sequences had indeed been captured presence of b-glucuronides (Dabek et al. 2008;
by the degenerate PCR approach (McIntosh McIntosh et al. 2012). The BG gene was identi-
et al. 2012). There were slight differences in fied by screening for b-glucuronidase activity
relative abundance, which is not surprising con- (Gloux et al. 2011), thus confirming its function.
sidering the difference in technical approach as An investigation of several bacteria that harbor
well as volunteer numbers, but overall both a BG gene but no gus gene revealed only low
approaches correlated significantly in terms of levels of b-glucuronidase activity, in both the
relative abundance as well as prevalence of dif- absence and presence of b-glucuronide as inducer
ferent OTUs. Thus, a targeted approach based on (McIntosh et al. 2012). Thus, BG genes may only
degenerate primers appears to provide a good be expressed under specific conditions that are
coverage of b-glucuronidase genes. It currently yet to be identified. Alternatively, some variants
also allows for a more in-depth analysis per vol- of this diverse gene family may actually encode
unteer, as the actual metagenomic sequence cov- enzymes with different substrate specificities.
erage per volunteer in the pioneer metagenomic In conclusion, sequence-based analysis of
sequencing studies is relatively low and many genes encoding b-glucuronidases can be used to
genes are only partially covered. With the vast reveal the diversity of the b-glucuronidase-
advances in sequencing technology, however, positive community and forms a solid basis for
direct metagenomic mining for specific functional further functional investigation of this activity in
genes will become increasingly attractive. representative organisms.
Sequence-based analysis of functional genes
poses the risk of assigning functions to genes that
may in fact carry out a different activity, and the Summary
actual enzyme activity will ultimately have
to be established for representatives of gene var- The metabolic activities of the microbial commu-
iants less closely related to biochemically nity present in the human gut are closely linked to
characterized ones. Especially for glycoside the physiological status and overall health of its
hydrolases, it is often difficult to infer function host. Bacterial b-glucuronidase activity directly
from sequence alone (▶ Carbohydrate-Active interferes with one of the major host detoxifica-
Enzymes Database, Metagenomic Expert tion systems for a wide range of lipophilic com-
Resource). Both b-glucuronidase genes are pounds that enter the body via the diet, drugs, or
remotely related to each other based on protein exposure to environmental pollutants, as well as
sequence identity and belong to glycoside hydro- endogenous molecules. Glucuronidation of those
lase family 2, which also includes enzymes with compounds renders them more hydrophilic and
other specificities, including b-galactosidases facilitates their excretion, but b-glucuronidase
and b-mannosidases (http://www.cazy.org/GH2. activities within the gut microbiota convert them
html). The gus gene has been characterized bio- back to their respective aglycones, which leads to
chemically in bacteria from different phyloge- an extended retention time in the body. Many of
netic backgrounds (Beaud et al. 2005; Russell those compounds are toxic or carcinogenic, but
and Klaenhammer 2001), and the presence of potentially health-promoting compounds, such as
the gene in a panel of human gut isolates corre- plant phenolics ingested with the diet, may also be
lated relatively well with the detection of glucuronidated. Metagenomics can be utilized to
b-glucuronidase activity (Dabek et al. 2008). On enhance our understanding of which bacteria in
the other hand, it was shown that different strains the human gut carry b-glucuronidase activity.
of the same species can show differences in A functional metagenomic approach, whereby
enzyme activity levels when grown under the genes from environmental communities are
same conditions and that the level to which expressed in a heterologous host, has led to the
b-glucuronidase activity is induced varies in identification of a novel type of b-glucuronidase
dependence of the growth substrate and the gene, which was found to be prevalent within the
human gut microbiota but not commonly found in microbiome. Proc Natl Acad Sci U S A. 2011;108:
other environments. Metagenomic sequence min- 4539–46.
Haiser HJ, Turnbaugh PJ. Developing a metagenomic
ing for this novel gene, as well as a previously view of xenobiotic metabolism. Pharmacol Res.
known b-glucuronidase gene, revealed the distri- 2013;69:21–31.
bution of these genes in different phylogenetic Handelsman J. Metagenomics: application of genomics to
lineages. These results provide a valuable founda- uncultured microorganisms. Microbiol Mol Biol Rev.
2004;68:669–85.
tion for further functional characterization of this Handelsman J, Rondon MR, Brady SF, Clardy J, Good-
important microbial activity. man RM. Molecular biological access to the chemistry
of unknown soil microbes: a new frontier for natural
products. Chem Biol. 1998;5:R245–9.
Hayashi H, Abe T, Sakamoto M, Ohara H, Ikemura T,
Cross-References Sakka K, Benno Y. Direct cloning of genes encoding
novel xylanases from the human gut. Can J Microbiol. F
▶ Carbohydrate-Active Enzymes Database, 2005;51:251–9.
Metagenomic Expert Resource Henrissat B, Cantarel B, Coutinho P. Carbohydrate-active
enzymes database, metagenomic expert resource.
▶ Fosmid System http://www.springerreference.com/index.chapterbid/
303280
Humblot C, Murkovic M, Rigottier-Gois L, Bensaada M,
References Bouclet A, Andrieux C, Anba J, Rabot S. Beta-
glucuronidase in human intestinal microbiota is neces-
Bardonnet N, Blanco C. uidA-antibiotic-resistance cas- sary for the colonic genotoxicity of the food-borne
settes for insertion mutagenesis, gene fusions and carcinogen 2-amino-3-methylimidazo[4,5-f]quinoline
genetic constructions. FEMS Microbiol Lett. in rats. Carcinogenesis. 2007;28:2419–25.
1992;72:243–7. Kim DH, Jung EA, Sohng IS, Han JA, Kim TH, Han
Beaud D, Tailliez P, Anba-Mondoloni J. Genetic charac- MJ. Intestinal bacterial metabolism of flavonoids and
terization of the beta-glucuronidase enzyme from its relation to some biological activities. Arch Pharm
a human intestinal bacterium, Ruminococcus gnavus. Res. 1998;21:17–23.
Microbiology. 2005;151:2323–30. Kim DH, Hong SW, Kim BT, Bae EA, Park HY, Han
Cecchini DA, Laville E, Laguerre S, Patrick Robe P, MJ. Biotransformation of glycyrrhizin by human intes-
Leclerc M, Doré J, Henrissat B, Remaud-Siméon M, tinal bacteria and its relation to biological activities.
Pierre Monsan P, Potocki-Véronèse G. Functional Arch Pharm Res. 2000;23:172–7.
metagenomics reveals novel pathways of prebiotic Kim YJ, Choi GS, Kim SB, Yoon GS, Kim YS, Ryu
metabolization by human gut bacteria. PLoS ONE. YW. Screening and characterization of a novel ester-
2013;8:1–9. ase from a metagenomic library. Protein Expr Purif.
Dabek M, McCrae SI, Stevens VJ, Duncan SH, Louis 2006;45:315–23.
P. Distribution of b-glucosidase and b-glucuronidase Manichanh C, Rigottier-Gois L, Bonnaud E, Gloux K,
activity and of b-glucuronidase gene gus in human Pelletier E, Frangeul L, Nalin R, Jarrin C, Chardon P,
colonic bacteria. FEMS Microbiol Ecol. 2008;66: Marteau P, Roca J, Dore J. Reduced diversity of faecal
487–95. microbiota in Crohn’s disease revealed by
Flores R, Shi J, Gail MH, Gajer P, Ravel J, Goedert a metagenomic approach. Gut. 2006;55:205–11.
JJ. Association of fecal microbial diversity and taxon- Marchler-Bauer A, Bryant SH. CD-Search: protein
omy with selected enzymatic functions. PLoS ONE. domain annotations on the fly. Nucleic Acids Res.
2012;7:e39745. 2004;32:327–31.
Gabor EM, Alkema WB, Janssen DB. Quantifying the McBain AJ, Macfarlane GT. Ecological and physiological
accessibility of the metagenome by random expression studies on large intestinal bacteria in relation to pro-
cloning techniques. Environ Microbiol. 2004;6: duction of hydrolytic and reductive enzymes involved
879–86. in formation of genotoxic metabolites. J Med
Gloux K, Leclerc M, Iliozer H, L’haridon R, Microbiol. 1998;47:407–16.
Manichanh C, Corthier G, Nalin R, Blottière HM, McIntosh FM, Maison N, Holtrop G, Young P, Stevens VJ,
Doré J. Development of high-throughput phenotyping Ince J, Johnstone A, Lobley G, Flint HJ, Louis P.
of metagenomic clones from the human gut Phylogenetic distribution of genes encoding
microbiome for modulation of eukaryotic cell growth. b-glucuronidase activity in human colonic bacteria
Appl Environ Microbiol. 2007;73:3734–7. and the impact of diet on faecal glycosidase activities.
Gloux K, Berteau O, El Oumami H, Béguet F, Leclerc M, Environ Microbiol. 2012;14:1876–87.
Doré J. A metagenomic b-glucuronidase uncovers Morotomi M, Nanno M, Watanabe T, Sakurai T, Mutai M.
a core adaptive function of the human intestinal Mutagenic activation of biliary metabolites of
F 198 Functional Viral Metagenomics and the Development of New Enzymes
1-nitropyrene by intestinal microflora. Mutat Res. Nalin R, Dore J, Leclerc M. Towards the human
1985;149:171–8. intestinal microbiota phylogenetic core. Environ
Nanno M, Morotomi M, Takayama H, Kuroshima T, Microbiol. 2009;11:2574–84.
Tanaka R, Mutai M. Mutagenic activation of biliary Tasse L, Bercovici J, Pizzut-Serin S, Robe P, Tap J,
metabolites of benzo(a)pyrene by beta-glucuronidase- Klopp C, Cantarel BL, Coutinho PM, Henrissat B,
positive bacteria in human faeces. J Med Microbiol. Leclerc M, Doré J, Monsan M, Remaud-Simeon M,
1986;22:351–5. Potocki-Veronese G. Functional metagenomics to
Piel J, Butzke D, Fusetani N, Hui D, Platzer M, Wen G, mine the human gut microbiome for dietary fiber
Matsunaga S. Exploring the chemistry of uncultivated catabolic enzymes. Genome Res. 2010;20:1605–12.
bacterial symbionts: antitumor polyketides of the Tryland I, Fiksdal L. Enzyme characteristics of beta-
pederin family. J Nat Prod. 2005;68:472–9. D-galactosidase- and beta-D-glucuronidase-positive
Qin J, Ruiqiang L, Raes J, Arumugam M, Solvsten K, bacteria and their interference in rapid methods for
Burgdorf, Manichanh C, Nielsen T, Pons N, detection of waterborne coliforms and Escherichia
Levenez F, Yamada T, Mende D, Li J, Xu J, LI S, coli. Appl Environ Microbiol. 1998;64:1018–23.
Li D, Cao J, Wang B, Liang H, Zheng H, Yie Y, Tap J, Tukey RH, Strassburg CP. Human
Lepage P, Bertalan M, Batto JM, Hansen T, Le UDP-glucuronosyltransferases: metabolism, expres-
Paslier D, Linneberg A, Nielsen HB, Pelletier E, sion, and disease. Annu Rev Pharmacol Toxicol.
Renault P, Sicheritz-Ponten T, Turner K, Zhu H, 2000;40:581–616.
Yu C, Li S, Jian M, Zhou Y, Zhang X, Li S, Yang H, Yun J, Kang S, Park S, Yoon H, Kim MJ, Heu S, Ryu
Wang J, Brunak S, Brunak J, Dore J, Guraner F, S. Characterization of a novel amylolytic enzyme
Kristiansen K, Pedersen O, Parkhill J, Wessenbach J, encoded by a gene from a soil-derived metagenomic
MetaHIT Consortium, Bork P, Ehrlich SD, Wang J. A library. Appl Environ Microbiol. 2004;70:7229–35.
human gut microbial gene catalog established by
deep metagenomic sequencing. Nature. 2010;464:
59–65.
Ram JL, Ritchie RP, Fang J, Gonzales FS, Selegean JP.
Sequence-based source tracking of Escherichia coli
based on genetic diversity of beta-glucuronidase. Functional Viral Metagenomics and
J Environ Qual. 2004;33:1024–32. the Development of New Enzymes
Rod TO, Midtvedt T. Origin of intestinal beta- for DNA and RNA Amplification and
glucuronidase in germfree, monocontaminated and
conventional rats. Acta Pathol Microbiol Scand.
Sequencing
1977;85([B]):271–6.
Rondon MR, August PR, Bettermann AD, Brady SF, Thomas W. Schoenfeld, Michael J. Moser and
Grossman TH, Liles MR, Loiacono KA, Lynch BA, David Mead
MacNeil IA, Minor C, Tiong CL, Gilman M, Osburne
MS, Clardy J, Handelsman J, Goodman RM. Cloning
Lucigen Corporation, Middleton, WI, USA
the soil metagenome: a strategy for accessing
the genetic and functional diversity of uncultured
microorganisms. Appl Environ Microbiol. 2000;66: Introduction
2541–7.
Russell WM, Klaenhammer TR. Identification and clon-
ing of gusA, encoding a new beta-glucuronidase from The enzymes of phages and other viruses were
Lactobacillus gasseri ADH. Appl Environ Microbiol. vital to the early development of molecular biol-
2001;67:1253–61. ogy and are still essential tools. However, the
Salleh HM, M€ullegger J, Reid SP, Chan WY, Hwang J,
Warren RA, Withers SG. Cloning and characterization
available viral enzymes represent a tiny sample
of Thermotoga maritima beta-glucuronidase. of the potential diversity found in the global
Carbohydr Res. 2006;341:49–59. virosphere. Viral metagenomics has revealed
Schmelz EM, Bushnev AS, Dillehay DL, Sullards MC, a vast diversity of novel genes and its virtually
Liotta DC, Merrill Jr AH. Ceramide-beta-
D-glucuronide: synthesis, digestion, and suppression
limitless potential to provide new enzymes for
of early markers of colon carcinogenesis. Cancer use in molecular analysis. An important chal-
Res. 1999;59:5768–72. lenge to both the understanding of viral ecology
Streit WR, Schmitz RA. Metagenomics-the key to the and development of new viral enzymes is func-
uncultured microbes. Curr Opin Microbiol. 2004;7:
492–8.
tional characterization of metagenomic
Tap J, Mondot S, Levenez F, Pelletier E, Caron C, sequences, which has lagged far behind the abil-
Furet JP, Ugarte E, Munoz-Tamayo R, Paslier DL, ity to collect sequence data. Described is
Functional Viral Metagenomics and the Development of New Enzymes 199 F
a program to identify and characterize replication Thermostable DNA polymerases (Pols) have
operons of viral metapopulations isolated from been a major research focus due mainly to their
natural thermal environments and develop the wide use in molecular detection and analysis.
gene products as thermostable enzymes for DNA polymerases are essential for PCR (Staley
nucleic acid amplification and sequencing. and Konopka 1985) and other target-specific
Approaches to functionally characterize viral (Petruska et al. 1998; Notomi et al. 2000) and
replicases include (1) expression and biochemi- whole genome amplification methods (Goodman
cal analysis of gene products identified by and Fygenson 1998) and are also essential com-
sequence similarity, (2) functional screens to dis- ponents of all the major DNA sequencing plat-
cover new families of genes, and (3) assembly of forms. Sanger (dideoxy chain termination) DNA
operons to predict function based on gene posi- sequencing was the first major sequencing
tion. These approaches have uncovered at least method to use DNA polymerases and was F
two diverse families of replication operons advanced by thermostable Pols (Tang
including dozens of genes for thermostable et al. 2008). All of the leading next-generation
DNA polymerases and reverse transcriptases, as sequencing-by-synthesis platforms (e.g., Roche/
well as likely replicase subunits. In addition, 454 FLX, Illumina Genome Analyzer, Helicos
functional screens have uncovered one viral Pol Heliscope, Pacific BioSystems SMRT, ABI
unrelated to any known protein. These enzymes SOLiD) (Mardis 2008b; Shendure and Ji 2008)
are being engineered as improved PCR, RT PCR, use at least one DNA polymerase for base dis-
and DNA sequencing reagents. Diversity in the crimination and/or template preparation. DNA
viral metagenomes is also being explored to opti- polymerase-based methods are driving discovery
mize the activity of the genes discovered in the in research labs and, increasingly, in the clinic
libraries and make them more suitable for the (Bhui-Kaur et al. 1998) as methods for nucleic-
targeted applications. acid-based detection of infectious agents, cancer
Gene products of phages and other viruses and genetic variation advance next-generation
(collectively referred to here as viruses) have diagnostics, and personalized medicine. Progress
historically provided many of the enzymatic in improving all these methods depends in part on
tools for molecular biology. However, most of more suitable DNA polymerases.
the commonly used viral enzymes are derived Viruses are rich sources of diverse new DNA
from a very limited number of cultivated viruses, polymerases. Compared to their cellular hosts,
primarily phages T4, T7, lambda, SP6, and phi29, viruses use a wide array of strategies to replicate
and retroviruses Moloney murine leukemia virus their genomes, and their genomes adopt nearly
(Mo-MLV) and avian myeloblastosis virus every conceivable form, including double-
(AMV). The program to study hot spring virology stranded and both positive and negative single-
in Yellowstone National Park (YNP), California, stranded RNA and DNA forms, with linear,
and Nevada has provided insight into viral ecol- circular, and multipartite topologies ranging in
ogy (Otto et al. 1998; Breitbart et al. 2004; size from 1.2 Mb (mimivirus) down to 3.2 kb
Schoenfeld et al. 2008) and has revealed (hepatitis B virus) (Blanco et al. 1989; Detter
a nearly unlimited source of diversity for the et al. 2002). While many of these replicative
search for new enzymes (Beechem et al. 1998; strategies rely on host enzymes, a substantial sub-
Moser et al. 2012; Perez et al. 2012). However, set of viral families supplies its own replication
current approaches to functional analysis of viral proteins. There is speculation that viruses may
metagenomes, while informative, are limited by have played a key role in the evolution of repli-
their reliance on sequence similarity to infer gene cation strategies used by cellular life (Koonin
function. Improvements in the ability to function- 2006).
ally characterize viral metagenomes are neces- As replicases, viral polymerases are function-
sary to advance the field. ally distinct from the bacterial and archaeal
enzymes currently used in molecular biology. Retroviral replicases (i.e., reverse transcriptases),
During prokaryotic cellular replication, especially Mo-MLV and AMV, are indispensable
processive leading-strand synthesis depends on for detection, analysis, and cloning of transcripts
a multisubunit complex including Pol III holoen- and RNA viruses (Morin et al. 2008; Wang
zyme, helicases, and primases. E. coli Pol III et al. 2008). Together, these qualities make viral
holoenzyme is a 791 kD protein comprised of Pols attractive targets for development as
nine subunits (reviewed in Xiang et al. (2008)). reagents.
Due to their complexity, no Pol III derivative has While the emphasis has been DNA polymer-
been developed as a molecular biology reagent. ases, viruses encode other useful enzymes. RNA
Cell-derived reagent Pols, e.g., Taq, Pfu, or polymerases, for example, are key components of
E. coli DNA polymerases, are all bacterial Pol a number of in vitro and in vivo transcription and
I or archaeal Pol II derivatives that are mainly translation systems, as well as several
responsible in vivo for lagging strand and repair transcription-mediated amplification methods
synthesis, neither of which requires strand sepa- (Guatelli et al. 1990; Compton 1991). Virtually
ration or processive synthesis of long sequences. all ligation methods used for cloning and linker
Viral Pols are functionally more like the leading- attachment depend on T4 DNA ligase due to its
strand replicases and, accordingly, exhibit higher high activity on 50 - and 30 -extended and blunt
fidelity, rates of synthesis, and processivity (Ley DNA. The integrases and recombinases of vari-
et al. 2008). Phage T7 Pol, for example, incorpo- ous phages (e.g., lambda red and P1 cre/lox) have
rates 300 nt per second, six times faster than been used to integrate genes into bacterial and
Escherichia coli Pol I; T4 phage replicates DNA eukaryotic genomes. Resolvases (e.g., T4 endo-
ten times faster than its E. coli host (Heckler nuclease VII and T7 endonuclease I) have been
et al. 1984). Phi29 Pol has a processivity of used to detect single nucleotide polymorphisms
>70,000 nucleotides (Blanco et al. 1989) (i.e., it (SNPs) (Babon et al. 2003). It is likely that these
incorporates over 70,000 nucleotides before dis- and many other methods that rely on viral
sociating), far greater than that of Taq Pol I, enzymes can be further improved by novel
which has a processivity of between 50 and enzyme activities. Functional metagenomic-
80 (Merkens et al. 1995). Phi29 also has based enzyme discovery and development should
a strong strand displacement capability that, benefit a wide range of applications.
together with its processivity, makes it the poly- The enzymes that have been isolated by culti-
merase of choice for whole genome amplification vation over the years demonstrate the potential of
by multiple displacement amplification (MDA) viruses as a source of new enzymes, but greatly
(Dean et al. 2001). T7 phage Pol holoenzyme underrepresent the richness of this resource. The
has a processivity of 1,000 nucleotides (Tabor extreme global abundance and diversity of
et al. 1987) and efficiently incorporates chain- viruses is well documented (Breitbart
terminating nucleotide analogs, which facilitated et al. 2002; Angly et al. 2006; Dinsdale
Sanger sequencing until it was displaced by et al. 2008; McDaniel et al. 2008; Schoenfeld
Thermo Sequenase, a Taq Pol derivative that et al. 2008). A liter of ocean water contains as
was engineered based on the nucleotide variation many viruses as there are humans on the planet
in T7 DNA Pol that conferred efficient incorpo- and much more genetic diversity (Wang
ration of dideoxynucleotides (Tabor and Richard- et al. 2007). In fact, the bulk of the world’s
son 1995). T5 Pol has both high processivity and genetic diversity is probably encoded in viral
a potent strand displacement activity that are genomes. Despite the richness of the global
independent of additional host or viral proteins virosphere as a source of diverse replicative pro-
(Andraos et al. 2004). T4 DNA Pol has a high teins, standard approaches to discovering new
proofreading activity that is commonly exploited enzymes by cultivating the viruses have proven
for generating blunt ends, especially in physically extremely inefficient and few new viral enzymes
sheared DNA (Karam and Konigsberg 2000). have been commercialized in the past decades.
Notably, despite their widespread potential appli- Ding et al. 2008; Lopatto et al. 2008; Schmidt
cations and notwithstanding substantial effort, et al. 2008) or a small number of thermoaci-
thermostable viral Pols have completely eluded dophilic Archaea, particularly Sulfolobus and
discovery by cultivation. There are now 34 fully Acidianus (reviewed in Rehrauer et al. (1998)),
sequenced genomes from thermophilic viruses in due to the relative ease of cultivating these hosts.
the NCBI database (February 2010): 27 archaeal Metagenomics promises to overcome these bar-
viruses and 7 bacteriophages. None of these riers and provide a largely unbiased sampling of
genomes or broad screens of hundreds of culti- viral populations.
vated Thermus phage (Lopatto et al. 2008) has In some respects viral metagenomes are espe-
produced a thermostable DNA polymerase. cially well suited for discovery of enzymes for
Extensive analysis of cultivated crenarchaeal use in molecular analysis. Viral genomes are
viral genomes from high-temperature environ- highly diverse and dense with genes associated F
ments reveals few recognizable features other with nucleic acid metabolism (Paulsen and
than a small number of methylases, helicases, Wintermeyer 1984). For example, a typical bac-
glycosyltransferases, and several unknown but terial genome of 2 Mb contains three to five DNA
shared genes (Rehrauer et al. 1998). At least one polymerase genes, only one of which, polA,
presumptive DNA polymerase has been identi- encodes enzymes that have been used as reagents.
fied in an archaeal viral genome (Baklanov In contrast, a comparable 2 Mb of viral
et al. 1984), but not expressed in the lab. At metagenome can yield up to 40 pol genes
least five Pols have been expressed from thermo- (Schoenfeld et al. 2008). However, the promise
philic bacteriophage genomes (Wang et al. 2006; of using this diversity to advance the understand-
Schmidt et al. 2008; T. Schoenfeld, unpublished); ing of global ecology and in developing useful
however, for unknown reasons, these enzymes from viral metagenomes is tempered by
enzymes are only moderately thermostable and the challenge in assigning function to the genes.
incapable of surviving thermocycling in PCR The gigabases of viral metagenomic sequence
or sequencing, despite the thermostability of data that have been generated over the past
their host Pols. In order to identify useful decade have provided only inferential insight
thermostable Pols, more efficient approaches are into function or biochemistry of the viral genes
needed. and, consequently, few new molecular tools.
One of the main barriers to discovery of new Efforts to glean insight from metagenomes are
viral enzymes is technical challenges associated hampered by the nearly complete reliance on
with cultivation. It is widely noted that cultiva- sequence similarity coupled with the extreme
tion in the lab selects against the great majority of viral genomic diversity and the dearth of anno-
Bacteria and Archaea. Cultivation of new viruses tated sequences. Depending on the environment,
introduces another extreme level of selection 40–90 % of viral metagenomic sequences are
against the vast majority of natural populations unknown, novel sequences (Angly et al. 2006;
because cultivation requires the investigator to Dinsdale et al. 2008; Bench 2007; Srinivasiah
choose a host that can be grown in the lab, 2008; Schoenfeld 2008). All the next-generation
which severely limits the comprehensiveness of platforms generate shorter reads that are even
the screens. When examining extreme environ- more difficult to assemble or align to sequences
ments like thermal springs, which are dominated in GenBank, resulting in artificially low BLASTx
by autotrophic microbes, this host selection is homologies or, conversely, artificially high num-
even more limiting. Most of these cultivation bers of “unique” sequence (Wommack 2008).
efforts have focused on viruses that infect hetero- The VIROME database (virome.dbi.udel.edu)
trophic Bacteria, especially Thermus (Reha- has cataloged 201 Mb of predicted open reading
Krantz et al. 1998; Karam and Konigsberg frames (ORFs) from long read sequence data
2000; Pavlov and Karam 2000; Bebenek (Feb 2010), the vast majority of which are novel
et al. 2001; Blondal et al. 2003; and functionally uncharacterized.
Functional characterization of viral level (Truncaite et al. 2006; Wang and Silverman
metagenomes has lagged far behind the ability 2006). When assembly criteria are reduced to as
to collect sequence data. Essentially none of the low as 50 %, much larger assembled contigs are
millions of gene functions inferred by sequence generated (Schoenfeld et al. 2008). This approach
similarity has been proven biochemically by has proven effective in generating contigs that
expression and analysis of the gene products. contain identifiable operons that not only allow
More importantly, the mere description of isolation of genes of related function but allow
sequence similarity does little to further the mapping of diversity onto the protein structure.
understanding of viral biology or to identify use- These sequence variations correspond to bio-
ful new enzymes. Furthermore, sequence- chemical differences in the gene products and
similarity screens only identify genes with an provide a guide to enzyme engineering. In the
annotated counterpart in a database. The relative work described below, a tripartite approach was
scarcity of functionally annotated viral genes in used for functional analysis of viral metagenomes
GenBank has likely prevented discovery of truly including (1) expression and biochemical charac-
novel enzyme families, which should be the terization of the “BLASTx hits,” (2) functional
strength of viral metagenomics. screens to identify enzymes too dissimilar to
Finally, a conceptual barrier associated with known genes to be detected by sequence similar-
the definition of related viral types has prevented ity, and (3) assembly of operons to infer gene
assembly of viral genomes, and, consequently, function based on position in the genome.
inferences into function that are based on gene
position. Phage genes of related function, espe-
cially replication-related genes, often occur in Methods
proximity within operons (El Omari et al. 2006).
Assembly of sequence reads should allow recon- Sampling, Library Construction, and
struction of operons; however, standard Sequencing
approaches relying on nucleotide identities of Sampling, library construction, and sequencing
greater than 95 % are ineffective in assembly of of the YNP samples have been described
viral metagenomes and only a few very small, (Schoenfeld et al. 2008). The Great Boiling
abundant phage genomes have been Spring samples were collected as described and
reconstructed from metagenomic data (Angly amplified using the Repli-g kit (GE Healthcare).
et al. 2006). Because even the relatively long DNA was sheared and inserted into pETite vector
Sanger reads are almost always too short to (Lucigen) and the library used to transform
include more than one complete gene, these asso- E. coli HI-Control BL21(DE3) cells (Lucigen).
ciations are generally missed. Since traditional Individual clones from both libraries were
shotgun sequencing, used in some of the work sequenced in their entirety using standard chem-
described below, involved the construction of istry (Life Technologies).
clone libraries, success in identifying adjacent
genes by sequencing entire inserts from archived Bioinformatics
clones was achieved, but even this approach is Sequence assemblies were performed using
limited by the sizes of inserts in the libraries, Sequencher (Gene Codes) or SeqMan
generally less than 5 kb. Since none of the next- (DNASTAR). ClustalW analysis was performed
generation sequencing methods uses clone librar- as described (Nandakumar and Shuman 2005).
ies, this approach is impossible for most of the
ongoing viral metagenomic projects. The funda- Functional Screens
mental problem is that viral populations are too The clones from the Great Boiling Spring sam-
molecularly diverse to accommodate this crite- ples were grown on Luria broth, pelleted, and
rion. Among cultivated viruses, closely related resuspended in buffer containing lysozyme.
phages are up to 50 % divergent at the nucleotide Lysates were incubated for 10 min at 70 C and
centrifuged, and the supernatants were tested for The degree of sequence conservation among
DNA polymerase activity using the standard pol genes in these libraries, while relatively low,
assay. Positive clones were cultivated at 50 ml was higher than most sequences found in viral
in LB and retested. The inserts of clones with metagenomes. The discovery of 156 partial genes
activity were sequenced in their entirety. among roughly 600 viral genome equivalents
suggests that sequence-based screens were rela-
Cloning, Expression, Purification, and tively efficient in identifying pol genes. Nonethe-
Mutagenesis less, there are important disadvantages to this
DNA polymerase genes that were further charac- approach. One is that the diversity of viral pol
terized were expressed at higher levels by inser- genes is likely to be high enough that interesting
tion into pET28 vector and expression in E. cloni new enzymes are missed. Another problem is that
EXPRESS BL21(DE3) (Lucigen). DNA poly- a gene must be situated in the random clone so F
merase was purified by heat treatment and stan- that an identifiable portion of it is within the read
dard chromatography methods. Mutagenesis was length of the sequencing method (>1,000 nucle-
performed using the QuikChange II Site-Directed otides by Sanger, much less by newer sequencing
Mutagenesis Kits (Agilent). approaches) and the gene must not extend beyond
the boundaries of the random insert so that it is
Biochemical Analysis and Applications incomplete. It is unknown how many genes failed
Development to fulfill the first criterion and were within the
Biochemical assays were performed using stan- insert, but not within the sequence range. Of the
dard methods (Mardis 2008; Marks et al. 2008). 156 identified candidate pol genes, only 38 %
fulfilled the second criterion and appeared com-
plete. Finally, the identification of a gene does not
Results and Discussion mean that the gene will express efficiently in
E. coli. For unknown reasons, among the 59 likely
Sequence-Based and Functional Discovery of complete genes, 83 % failed to express at detect-
New DNA Polymerases able levels.
In a recent study of viral metagenomes from Functional screens address many limitations
Yellowstone hot springs, more than 28,000 of sequence-similarity screens and can often
Sanger-based long sequence reads (nearly detect completely novel activities regardless of
30 Mb of sequence) were determined divergence from known genes or position in the
(Schoenfeld et al. 2008). BLASTx alignment to insert, as long as the complete gene is present. By
the nonredundant protein database indicated that their nature, functional screens only detect
156 ORFs had similarity to known pol genes. complete, expression-competent genes. Viral
Fifty-nine appeared to be complete genes and metagenomic DNA from the Great Boiling
were tested for DNA polymerase activity. Ten Spring, Gerlach, NV, kindly provided by Brian
showed activity and seven of these were Hedlund and Jeremy Dodsworth (University of
sequenced in their entirety. Although highly Nevada-Las Vegas), was used to construct
divergent from known viral and cellular genes, a library that was screened for expression of
four were loosely grouped with family thermostable pol activity. Screening of 2,800
A polymerases and three grouped with family clones resulted in the discovery of 12 that were
B polymerases. These pol genes are referred to positive for primer extension activity. Eleven of
as “PyroPhage” followed by an identifying these were more than 97 % identical to each other
number. The family A pols detected by this and are referred to as the “PyroPhage 74-like
screen were too divergent to be grouped, but the polymerases” in reference to the first member
family B Pols are referred to below as discovered. These pol genes share up to 45 %
“PyroPhage 4110-like Pols” in reference to the identity with the other polA-type genes from Yel-
first one discovered. lowstone (PyroPhage 3173 and 967) and 56 %
Functional Viral
Metagenomics and the
Development of New
Enzymes for DNA and
RNA Amplification and
Sequencing,
Fig. 1 Polymerase
phylogenetic tree. Full-
length viral metagenomic
DNA polymerase amino
acid sequences were
compared by ClustalW to
representative viral,
microbial, and eukaryotic
Pols and displayed in
a neighbor-joining tree
identity to PyroPhage 488, a pol gene isolated from these screens, as well as those retrieved
8 years earlier in a sequence-based screen of from GenBank, were noticeably more diverse
a metagenome from Little Hot Creek, Long Val- than cellular genes. Most PyroPhage pol genes
ley, CA, which is 400 km from Gerlach, NV, but are highly divergent from known cellular or viral
still in the Great Basin. The final clone identified pol genes. The exception is PyroPhage 3063,
in the functional screen, PyroPhage 347, had no which is related to several polA genes of
significant similarity to any known pol gene. In Aquificales family, which are known to be quite
fact the strongest E value to any known gene had divergent from other bacterial polA genes
a barely significant 0.750 score to an open read- (Griffiths and Gupta 2004).
ing frame of unknown function in a crenarchaeal Since the libraries were constructed from dif-
virus. Due to this lack of similarity to genes of ferent hot spring populations, direct comparisons
known function, this gene would never have been are difficult. However, while the overall rate of
identified by sequence similarity. discovery of apparent DNA polymerase genes
The pol genes discovered by both screens was comparable for the sequence-based and func-
were aligned by ClustalW to each other and to tional screens (156 pol genes from 28,000 clones
representative cellular and viral pol genes to con- compared to 12 from 2,800 clones, respectively),
struct a neighbor-joining tree (Fig. 1). Viral genes the rate of discovery of functional thermostable
enzymes was much lower for the sequence alignment to known proteins (Tabor and Richard-
screens than the functional screens (10 of son 1995), mutation F418Y (Fig. 3a) reduced
28,000 vs. 12 of 2,800). The diversity of the discrimination against chain terminators to nearly
enzymes in the GBS library was much lower zero, making the enzyme very effective for dye
than those from Yellowstone springs, presumably terminator cycle sequencing (Fig. 3b).
reflecting a lower overall population diversity.
Single-Enzyme RT PCR with 3173 DNA
Biochemical Characteristics and Directed Polymerase
Engineering Improve Use of PyroPhage Pols The thermostability and reverse transcriptase
in PCR and Sanger Sequencing activities seen in PyroPhage 3173 Pol allow effi-
PyroPhage 3173 and 347 Pols proved to be the cient RT PCR amplification of mRNA and viral
most thermostable of the newly discovered poly- RNA genomic targets with improved perfor- F
merases. In fact, these are the first viral Pols with mance compared to alternative single-enzyme
adequate thermostability for PCR. PyroPhage solutions (Fig. 4). Quantitative detection of viral
3173 Pol, which has been studied in greatest targets is linear over at least seven logs of dilution
detail (Table 1), has adequate thermostability (Fig. 5). These benefits have significant improved
for thermocycling, inherent reverse transcriptase detection of transcripts and RNA viruses (Moser
activity, and high fidelity that enable a number of et al. 2012).
applications for this enzyme. The proofreading Currently almost all RT PCR depends on ret-
activity proved highly beneficial for high-fidelity roviral RTs, i.e., M-MLV and AMV RTs, which,
PCR amplification (Fig. 2). However, many despite wide use, have well-documented defi-
applications benefit from the absence of proof- ciencies that compromise RT PCR. Side activi-
reading activity. Alignment of the PyroPhage ties in retroviral reverse transcriptases, including
3173 pol gene to E. coli polA (Beese and Steitz RNAse H and terminal transferase, lead to
1991) identified codons for two acidic residues, mismatch extension artifacts (Blumenthal 1980;
either of which could be mutated to eliminate Blumenthal and Hill 1980; Harrison and
exonuclease activity. This reduced fidelity to Zimmerman 1984; Pulsinelli and Temin 1991;
very close to that of Taq Pol, but simplifies its Shah et al. 1995; Vratskikh et al. 1995; Ho and
use in PCR and other amplification methods. Like Shuman 2002; van Dijk et al. 2004). Primer-
most family A Pols, 3173 has a strong discrimi- dependent bias in extension efficiency (Yin
nation against dideoxynucleotides that made it et al. 2003) and fidelity (Cheng et al. 2005) likely
less effective in Sanger sequencing. Based on account for documented inaccuracy of RT PCR
quantification (Loeffler et al. 2003), poor corre-
lation between tests (Nelson et al. 2001), and/or
Functional Viral Metagenomics and the Develop-
ment of New Enzymes for DNA and RNA Amplifica-
complete amplification failure depending on the
tion and Sequencing, Table 1 Biochemical RT and the abundance of transcript (Damasko
characteristics of PyroPhage 3173 DNAP et al. 2005). Inherently low synthesis fidelity
30 –50 exonuclease Strong (up to one error per 500 nt, 20X higher than Taq
50 –30 exonuclease None Pol) results in misincorporations, frameshifts,
Strand displacement Strong and deletions (Kerr and Sadowski 1972; Little
Extension from nicks Strong 1981; Heaphy et al. 1987). Strand-switching
T½ at 95 10 min (Strauch et al. 2003) probably causes the inter-
Km dNTPs 40 mM and intramolecular rearrangement artifacts
Km DNA 5.3 nM (Cherepanov and de Vries 2001) that can be pref-
Processivity 42 erentially extended (Sharp et al. 1994) and result
Fidelity 8 104 in recombination or insertion/deletion (indel)
30 ends of amplicons Blunt artifacts in cDNA synthesis (Evans et al. 1989;
Template DNA or RNA Snyder et al. 1992). A consequence of
Functional Viral
Development of New
Enzymes for DNA and
Sequencing,
Fig. 2 Fidelity of
PyroPhage 3173 Pol and
its exo- derivative.
Fidelities of PCR
amplification of PyroPhage
3173 wt and exonuclease
minus Pols were compared
to commercial sources of
thermostable Pols in the
lacI forward mutation assay
(Lundberg et al. 1991)
Functional Viral
Development of New
Enzymes for DNA and
Sequencing,
Fig. 3 Directed
engineering of 3173 Pol to
improve Sanger
sequencing. (a) Shown is
the increased incorporation
of dideoxy- and acyclo-
nucleotides by the F418Y
mutant of PyroPhage 3173
Pol, as indicated by
increased inhibition of Pol
activity by chain-
terminating nucleotides. (b)
The F418Y mutant was
used as a direct substitute
for Thermo Sequenase in
a BigDye ® (ABI)
sequencing reaction
two-enzyme RT PCR is that the RT step can these deficiencies include mutagenesis to
interfere with subsequent PCR (Harnett disable or remove the RNAse H domain
et al. 1985; McLaughlin et al. 1985; Evans et al. (Downie et al. 2004). These mutations reduce
1989; Petric et al. 1991; Snyder et al. 1992; Sharp rearrangements, but lead to increased substitution
et al. 1994), which compromises quantification of errors and bias (Blumenthal and Hill 1980;
low abundance targets. Efforts to ameliorate Middleton et al. 1985; Vratskikh et al. 1995).
Functional Viral
Development of New
Enzymes for DNA and
Sequencing,
Fig. 4 Reverse
transcription PCR using
PyroPhage 3173 Pol.
(a) Total human liver RNA
(1 mg, Promega) was
reverse transcribed by
M-MLV RT or PyroPhage
3173 Pol and then PCR
amplified using Lucigen F
EconoTaq ® PLUS Master
Mix. Shown are targets of
144, 246, and 298 bp.
(b) Single-enzyme RT PCR
amplifications by
PyroPhage 3173 Pol and
Tth (Epicentre) were
compared using a 160 bp
MS2 phage RNA target
over a 102- to 108-fold
dilution series. Shown are
real-time and post reaction
melt data (top) and
corresponding end point
RT PCR agarose gel
(bottom). Tth polymerase
was used with Mn2+ as
directed. Arrows show
correct melt Tm (top) and
amplicon (bottom)
Other enzymes have been explored as alterna- Assembly of Composite Contigs from Viral
tives to retroviral RTs (e.g., Tth Pol (Rand and Metagenomes
Gait 1984)), but none has proven a satisfactory One anticipated drawback of using
replacement for most methods that rely on metagenomics as an enzyme discovery tools
reverse transcription of RNA. PyroPhage 3173 was the fragmentary nature of the reads, which
is the most efficient Pol for single-enzyme RT was expected to hamper efforts to associate sub-
PCR and, as such, an alternative to the retroviral units of multisubunit enzymes. Many proteins,
RT-dependent methods. replicases in particular, function as multiple
Functional Viral Metagenomics and the Develop- electrophoresis. (b) The MS2 RNA was diluted from
ment of New Enzymes for DNA and RNA Amplifica- 101- to 107-fold and amplified using a primer pair
tion and Sequencing, Fig. 5 Single-enzyme, one-step corresponding to the 160 bp fragment in Panel A. Real-
RT PCR amplification of MS2 phage RNA using 3173 time PCR fluorescence in RFU (relative fluorescence
Pol. MS2 RNA was amplified by 40 cycles of RT PCR units) vs. PCR cycles. (c) Post-amplification thermal
using the primers shown in Table 1 and 3173 Pol. (a) melt in -dRFU/dTemperature vs. Temperature ( C).
Products from 89 to 362 bp in length were amplified Light blue region indicates melt curves for specific prod-
using one-step single-enzyme RT PCR cycling condi- ucts. (d) Standard curve PCR cycle threshold vs. log10
tions: 15 s at 94 C (10 s at 94 C, 30 s at 72 C)*40. RNA copy number in triplicate with linear least squares
Products were resolved by 2 % agarose gel best fit line
subunits. Indeed, the replicases of phages T4, T7, independently in vitro, the utility may be
and Phi29 and viruses Mo-MLV, vaccinia, and improved by additional subunits. For example,
herpes all function in vivo as multigene replica- T7 Pol apoenzyme, by itself, has low processivity
tion complexes encoding a number of subunits, and was not very effective in Sanger sequencing
e.g., helicases, primases, processivity factors, and without its host-derived processivity factor,
clamp loaders (Blanco et al. 1994; Bertram thioredoxin (Tabor et al. 1987; Tabor and Rich-
et al. 1998; Goodman 1998; Reha-Krantz ardson 1987). Because proteins in replication
et al. 1998; Tang et al. 1998). While, in most complexes often have highly specific contacts
cases, the polymerase subunits function with one another (Goodman 1998), it is important
that subunits are derived from the same viral metagenome with the biochemistry of the gene
genome and not from unrelated viruses. products (Fig. 6).
Because these functionally related genes are This 16.5 kb contig, assembled at 50 % iden-
often adjacent in operons, it is theoretically pos- tity, includes 187 reads (average coverage of
sible to identify them given long enough contig- 11 reads per nucleotide position). GeneMark
uous sequence. Experience shows that operons (Besemer and Borodovsky 2005) predicted
are almost always too large to be found in the 26 ORFs of greater than 100 nucleotides, which,
relatively small insert clones seen in typical when translated and annotated by BLASTp,
metagenomic libraries and, without modified appears to include at least a partial replication
assembly rules, are missed. With deep sequenc- operon. The genes with the strongest similarity
ing, these fragments could theoretically be to four of these ORFs encode two primase sub-
assembled to recover complete viral genomes. units, uracil DNA glycosylase, a family B DNA F
In practice, the high degree of sequence polymor- polymerase, and nucleotide excision repair
phism that characterizes viral metapopulations nuclease (dna G, udg, pol B, and ERCC4 genes,
confounds assembly of related genes and only respectively). Homologs of these ORFs belong to
very limited assembly has been possible by stan- crenarchaeal DNA replication/repair complexes
dard protocols. (Roberts et al. 2003; Dionne and Bell 2005; Barry
To accommodate this natural population and Bell 2006). The predicted pol B gene showed
diversity, assembly stringency was lowered 28 % identity to Pyrobaculum islandicus polB2
experimentally from the standard 95 % identity (Kahler and Antranikian 2000). Three of the dis-
to as low as 50 %. Assembly of the YNP Bear creet clones that include the pol B gene in this
Paw (74 ºC) and Octopus (93 ºC) metagenomes contig (PyroPhage 4110, 2783, and 2323 Pols;
at 50 % identity allowed recovery of composite Fig. 1) have been expressed in E. coli to produce
contigs as large as 35 kb. Fully 7.04 Mb (33 %) of a functional thermostable DNA polymerase (data
the Octopus reads assembled at this identity into not shown). This contig also contains apparent
17 contigs of greater than 10 kb (Schoenfeld homologs to a zinc fingerlike protein and a
et al. 2008). These assemblies appear very reli- transposon-like integrase/resolvase (tnp).
able in associating orthologous sequences. Par- Another ORF with highest similarity to the
ticularly in the Octopus library, the sequence CRISPR-associated sequence cas4 (Haft
reads are evenly distributed throughout the et al. 2005) is more likely a separate member
contigs with minimal stacking or other anomalies of the cas4 COG, presumably a recB-like
that would suggest amplification or cloning exonuclease gene.
artifacts. The high numbers of reads on both To correlate the level of sequence divergence
strands, evenly distributed throughout the with predicted gene function, SNP frequency was
contigs, suggest these contigs represent indepen- calculated and overlaid onto the 50 % assembly
dent clones of closely related genomes. Using the consensus sequence of the contig (Fig. 6). Overall
lower stringency assemblies, SNPs can be iden- distribution of SNPs in the contig was 0.705 per
tified and mapped to the coding sequences. As 10 bp. Replication-associated genes showed
additional biochemical and structural data noticeably lower molecular diversity than the
become available, molecular diversity may be other ORFs. SNP distribution in the dna G, udg,
correlated with variations in function and pol B, and ERCC homologs was 0.565, 0.617,
structure. 0.569, and 0.548 per 10 bp, respectively, while
the distribution in the Zn finger, cas4, and thy
Assembly of a Replication Operon from A homologs was 0.979, 1.31, and 0.728, respec-
a Viral Metagenome tively. Finer mapping of this diversity is being
One of these contigs provided a unique opportu- used to understand the functional differences in
nity to identify potential replicase subunits and the enzymes encoded by the constituent clones of
associate population diversity of an assembled this contig.
16542 bp
187 reads
87% two reads per strand
4
3.5
SNP’s per 10 bp
3
2.5
2
1.5
1
0.5
0
bp 0 2000 4000 6000 8000 10000 12000 14000 16000
ORFs
dnaG cas4 trip

Zn finger thyA udg polB ERCC4
Functional Viral Metagenomics and the Develop- polymorphisms per 10 base pairs were normalized to the
ment of New Enzymes for DNA and RNA Amplifica- number of reads covering the respective nucleotide and
tion and Sequencing, Fig. 6 Assembly of a 16.5 kb are aligned with predicted open reading frames from the
viral metagenome consensus contig from Octopus hot consensus sequence in the contig and the gene name of the
spring showing single nucleotide polymorphism het- strongest BLASTx similarity. Direction of transcription is
erogeneity. (a) 16.5 kb contig was assembled at 50 % shown by the arrows. Similarities to known genes were
identity from the NYP Octopus hot spring library. identified by BLASTp (Reprinted with permission
Sequence coverage is shown on the top, with each line (Schoenfeld et al. 2008))
representing a separate read. Single nucleotide
Identification of a Replicase Polyprotein from in E. coli, this polyprotein is processed, either

the Great Boiling Spring Metagenome in vivo or in vitro, to produce a protein of about
Based on the large number of highly similar iso- 55 kD (Fig. 7b). The amino terminal half of this
lates (<3 % amino acid divergence), the apparent polyprotein has no known function and
PyroPhage 74-like family of pol A-like genes no significant sequence similarity to known pro-
from the Great Boiling Spring in Nevada teins, but is likely to be associated with replica-
(Fig. 1) appears to be derived from abundant tion and, therefore, the target of ongoing
viruses with limited molecular diversity. Unlike investigation.
the previously described pol genes, these were Polyproteins are common elements used by
identified by functional screening, precluding RNA viruses (Nandakumar et al. 2004). The ret-
the assembly of large contigs. However, this roviral reverse transcriptases, for example, are all
group of pol genes proved particularly useful for expressed as polyproteins that are proteolytically
dissecting the molecular biology of a different processed (Clepet et al. 2004). Heterologous viral
replicase. The various polymerase positive polyproteins from hepatitis C have been shown to
clones contain the carboxy terminal half of an be active and properly processed in E. coli (Yin
apparent polyprotein, but vary in the amount of et al. 2004). However, replicases expressed as
coding sequence for the amino terminal half polyproteins are much rarer in DNA viruses.
(Fig. 7a), implying that the carboxy terminal PyroPhage 74-family Pol described here and
half of the polyprotein is sufficient for polymer- the PyroPhage 3173 Pol described below are
ase activity. The polymerase gene appears to be the first documented examples of thermophilic
part of an open reading frame that would encode phage polyproteins that are actively processed
a polyprotein of at least 100 kD. After expression in E. coli.
Functional Viral Metagenomics and the Develop- C-terminal half of a 100 kD ORF, but vary in the amount
ment of New Enzymes for DNA and RNA Amplifica- of N-terminal sequence. Despite differences in the sizes of
tion and Sequencing, Fig. 7 Putative polyprotein open reading frames of the inserts, all PyroPhage 74-like
F
from Great Boiling Spring viral metagenome. The clones express a thermostable protein of about 55 kD
PyroPhage 74-like pol genes are aligned to the consensus (Panel B). The 347 clone, in contrast, produces a 35 kD
sequence (Panel A). All of the clones contain the thermostable protein
Molecular Biology of the PyroPhage 3173 from a highly divergent, less abundant virus,
Replicase Operon since reads from this clone failed to assemble at
Expression of PyroPhage 3173 Pol, described 95 % identity with any other read in the library.
above, illustrates another challenge in Assembly at 75 % identity resulted in a 7,299 nt
metagenomic-based enzyme discovery. Since, contig (Fig. 8a), comprised of four reads. This
as with all metagenomes, the intact virus has assembly was confirmed by PCR amplification of
never been cultivated and the sequence data is nearly the entire contig from viral DNA isolated
fragmentary, delineation of the open reading from the same hot spring 4 years later to produce
frame of the pol gene was unclear. For production a product of the predicted size (Fig. 8b). This
and study of the 3173 Pol, expression was initi- amplification also suggests the 3173-encoding
ated at an ATG codon that appeared to be the virus is more persistent in the environment than
most probable start site based on alignment to other viral families, none of which was detectable
bacterial pol genes. Despite the success in using in the later samples. This contig encodes four
this 55 kD expression product in RT PCR and open reading frames of greater than 100 nt. The
other applications (see above), anomalies were largest of these encodes a protein of 1608 amino
apparent in the open reading frame that was acids (170 kD), the carboxy terminal portion of
used for expression of this enzyme. First, there which includes the 55 kD PyroPhage 3173 DNA
was no obvious adjacent ribosome binding site or polymerase. The amino terminal portion contains
transcriptional promoter. Second, there was no a coding sequence with only weak similarity to
homologous ATG codon in the related 488 and known genes. The other open reading frames
967 clones (Fig. 1), despite overall alignment encode putative helicases and a cas4/recB endo-
with the 3173 gene. Finally, an open reading nuclease protein.
frame extended upstream from the putative start The amplification product of the entire 1608
codon to the insertion site of the viral sequence in amino acid ORF expressed in E. coli produced an
the cloning vector. 80 kD protein (Fig. 8c) that co-purified with
Low identity assembly of the 3173 clone thermostable DNA polymerase activity. The sim-
proved useful in dissecting the molecular biology plest explanation is that the 1608 amino acid
of this gene and allowed production of the com- protein (expected MW of 170 kD) is processed
plete enzyme corresponding to the likely in vivo in vivo or in vitro to generate the 80 kD product
product. In contrast to the 4110-like and 74-like and that the original 55 kD PyroPhage 3173 Pol
polymerase families, the 3173 clone was derived was a cloning anomaly. Supporting this
Functional Viral Metagenomics and the Develop- indicated E values. Primers derived from the assembly are
ment of New Enzymes for DNA and RNA Amplifica- indicated by arrows and their positions on the contig are
tion and Sequencing, Fig. 8 Analysis and PCR indicated by the associated numbers. These primers were
amplification of a 7.3 kb contig from 75 % NIAID assem- used to amplify viral DNA isolated 4 years after the
bly. A 7.3 kb contig was assembled from four clones in the original collection (Panel B). An amplicon covering the
hot springs viral metagenome. GeneMark identified four 1608 amino acid ORF (Panel B, lane 2) was used; inserted
open reading frames of greater than 100 amino acids, the into an expression system and used to produce an apparent
sizes of which (144, 229, 202, and 1,608 amino acids) are truncation product of ~80 kD, indicated by the arrow
indicated (Panel A). These genes had BLASTx similarity (Panel C); that co-purified with the Pol activity
to helicases, cas4 (recB), and DNA polymerases, with the
interpretation, amino acids 884 to 894 form the Sequence Variants of PyroPhage 3173 DNA
motif AYIYLGSIFVE, which was predicted by Polymerase Isolated from the Viral
cleavage site analysis to be both labile to auto- Metagenome
lytic cleavage and accessible on the surface Metagenomics has proven quite useful for new
(Cosstick et al. 1984). Cleavage between G and enzyme discovery. The utility of viral
S would result in a 704 amino acid (80 kD) pro- metagenomes is greatly expanded when it is
tein. The amino terminal amino acids from the used to guide engineering. One approach to
80 kD protein aligns with the 50 –30 exonuclease improving DNA polymerases is directed evolu-
domains of T. aquaticus and E. coli. The amino tion (Ghadessy et al. 2001) based on random
acids involved in nucleotide binding are con- mutagenesis. While effective, quite daunting is
served, but not the amino acids required for the sheer number of mutants that must be
hydrolysis. Although the 55 kD protein has screened to approach saturation. For an enzyme
shown great utility, it is possible that addition of of the size of Taq Pol (832 amino acids), this
this 25 kD amino terminal sequence, or a portion would require 20832 clones to completely saturate
thereof, would improve its function for certain the entire gene with mutagenized codons and test
applications. In addition to the 80 kD Pol protein, all the possible amino acids at each positions.
the other ORFs are being expressed to reconsti- Even a fraction of this number overwhelms any
tute the presumptive replicase holoenzyme. current or conceivable screening capability.
This work highlights an important caveat of To limit the search, algorithms have been devel-
enzyme discovery by metagenomics. The frag- oped to target mutagenesis to specific domains
mentary sequences can result in the recovery (Voigt et al. 2001).
of partial genes. Assembly of sequences can Metagenomic libraries are an alternative to
be the only means of verifying ORFs. In this random degenerate libraries as a source of molec-
case, the partial gene proved highly useful, ular diversity. Since, in native populations, nature
but in many cases, a functional protein could selects for active proteins, activities of variants in
easily be missed by recovery of partial the libraries may differ, but they should all retain
sequences. function. To study sequence variants, the 55 kD
Functional Viral Metagenomics and the Develop- was purified and tested for thermostability by incubating
ment of New Enzymes for DNA and RNA Amplifica- for 10 min at the indicated temperature and assaying using
tion and Sequencing, Fig. 9 Thermostability of the standard DNA polymerase assay (Panel A). Shown are
PyroPhage 3173 Pol variants. The amplification product amino acid alignments of a portion of the Q-helix from the
from Fig. 7b, lane 2, was cloned and expressed to produce prototype PyroPhage 3173 and the two least thermostable
thermostable protein. The clones grouped into at least four sequence variants (variants 1 and 11) (Panel B). These
families that were 97 % identical to one another and 93 % thermolabile variants had one or two unique amino acids,
identical to the original clone. The expressed Pol activity respectively, that mapped to this region
version of PyroPhage 3173 amplified from viral helix (four amino acids apart) and likely interact
DNA collected at Octopus hot spring (Fig. 8b) to stabilize or destabilize the alpha helix and
was cloned in an expression vector. Eleven thereby alter thermostability.
clones were used to express DNA polymerase While a goal of screening hot spring viromes
activity and the inserts were sequenced. The var- was to find the most thermostable enzymes pos-
iants were 93 % identical to the original 3173 sible, the lower thermostability variants have
isolate and at least 97 % identical to one another. value. Isothermal amplification methods such as
When the polymerases were partially purified and LAMP (Notomi et al. 2000) use intermediate
tested, they had a significant range of thermosta- temperature (i.e., 50–70 C) and do not require
bility (Fig. 9). The two most labile enzymes had extreme thermostability. Less thermostable
only one or two unique nucleotide polymorphism enzymes will likely have higher activity at these
each. Two of these independent sequence poly- intermediate temperatures (Giver et al. 1998).
morphisms map within four codons of each other. Equally important, amino acids that reduce ther-
No three-dimensional structure is available for mostability map to regions that can be targeted to
PyroPhage 3173 Pol, but, based on sequence increase thermostability (Bae and Phillips 2004)
alignment to Taq DNA polymerase and its and are attractive targets for mutagenesis.
known protein structure (Kim et al. 1995), the
polymorphisms associated with reduced thermo-
stability likely map to the same alpha helix (the Prospects
Q-helix) within one of the “fingers” of the Pol
structure. If so, the two affected amino acids are The focus of the efforts has been discovering and
at the proper spacing to be adjacent on the alpha improving thermostable DNA polymerases.
Metagenomics is playing a role in both the dis- DNA polymerase of bacteriophage RB69. J Biol
covery and development phases of this project. Chem. 2001;276(13):10387–97.
Beechem JM, Otto MR, Bloom LB, Eritja R, Reha-Krantz
Viral metagenomics has revealed new replicase LJ, Goodman MF. Exonuclease-polymerase active site
operons, thermophilic polyproteins, and entirely partitioning of primer-template DNA strands and equi-
new classes of Pols with novel and useful activ- librium Mg2+ binding properties of bacteriophage
ities for a number of methods of DNA and RNA T4 DNA polymerase. Biochemistry. 1998;37(28):
10144–55.
detection and analysis. In the near future, it may Beese LS, Steitz TA. Structural basis for the 30 -50 exonu-
be possible to assemble complete genomes from clease activity of Escherichia coli DNA polymerase I:
uncultivated viruses from thermal environments a two metal ion mechanism. EMBO J. 1991;10(1):
and recover intact replicase operons using the 25–33.
Bench SR, Hanson TE, Williamson KE, Ghosh D,
appropriate combination of sequencing strategy, Radosovich M, Wang K, Wommack KE.
assembly paradigm, and genome walking tech- Metagenomic characterization of Chesapeake Bay
niques. The information encoded in the viral virioplankton. Appl Environ Microbiol. 2007;73(23):
metagenomes is being used to direct an enzyme 7629–41.
Bertram JG, Bloom LB, Turner J, O’Donnell M, Beechem
improvement program. Additional applications JM, Goodman MF. Pre-steady state analysis of the
can likely be improved by the discovery of assembly of wild type and mutant circular clamps of
enzymes other than Pols. In many cases, viral Escherichia coli DNA polymerase III onto DNA.
metagenomes are excellent sources of diversity J Biol Chem. 1998;273(38):24564–74.
Besemer J, Borodovsky M. GeneMark: web software for
for these discovery programs and presumably any gene finding in prokaryotes, eukaryotes and viruses.
biochemical characteristic that can be measured Nucleic Acids Res. 2005;33:W451–4. Web Server
can be further improved by application of the issue.
knowledge gained through metagenomics. Bhui-Kaur A, Goodman MF, Tower J. DNA mismatch
repair catalyzed by extracts of mitotic, postmitotic,
and senescent Drosophila tissues and involvement of
mei-9 gene function for full activity. Mol Cell Biol.
1998;18(3):1436–43.
References Blanco L, Bernad A, Lazaro JM, Martin G, Garmendia C,
Salas M. Highly efficient DNA synthesis by the phage
Andraos N, Tabor S, Richardson CC. The highly phi 29 DNA polymerase. Symmetrical mode of DNA
processive DNA polymerase of bacteriophage T5. replication. J Biol Chem. 1989;264(15):8935–40.
Role of the unique N and C termini. J Biol Chem. Blanco L, Lazaro JM, de Vega M, Bonnin A, Salas
2004;279(48):50609–18. M. Terminal protein-primed DNA amplification.
Angly FE, Felts B, Breitbart M, Salamon P, Edwards RA, Proc Natl Acad Sci U S A. 1994;91(25):12198–202.
Carlson C, Chan AM, Haynes M, Kelley S, Liu H, Blondal T, Hjorleifsdottir SH, Fridjonsson OF,
Mahaffy JM, Mueller JE, Nulton J, Olson R, Aevarsson A, Skirnisdottir S, Hermannsdottir AG,
Parsons R, Rayhawk S, Suttle CA, Rohwer F. The Hreggvidsson GO, Smith AV, Kristjansson
marine viromes of four oceanic regions. PLoS Biol. JK. Discovery and characterization of a thermostable
2006;4(11):e368. bacteriophage RNA ligase homologous to T4
Babon JJ, McKenzie M, Cotton RG. The use of resolvases RNA ligase 1. Nucleic Acids Res. 2003;31(24):
T4 endonuclease VII and T7 endonuclease I in muta- 7247–54.
tion detection. Mol Biotechnol. 2003;23(1):73–81. Blumenthal T. Interaction of host-coded and virus-coded
Bae E, Phillips Jr GN. Structures and analysis of highly polypeptides in RNA phage replication. Proc R Soc
homologous psychrophilic, mesophilic, and thermo- Lond B Biol Sci. 1980;210(1180):321–35.
philic adenylate kinases. J Biol Chem. Blumenthal T, Hill D. Roles of the host polypeptides in
2004;279(27):28202–8. Q beta RNA replication. Host factor and ribosomal
Baklanov MM, Riazankin IA, Butorin AS, Nechaev Iu S, protein S1 allow initiation at reduced GTP concentra-
Iamkovoi VI. Purification and characteristics of an tion. J Biol Chem. 1980;255(24):11713–6.
RNA-ligase preparation from bacteriophage T4. Prikl Breitbart M, Salamon P, Andresen B, Mahaffy JM, Segall
Biokhim Mikrobiol. 1984;20(2):191–9. AM, Mead D, Azam F, Rohwer F. Genomic analysis of
Barry ER, Bell SD. DNA replication in the archaea. uncultured marine viral communities. Proc Natl Acad
Microbiol Mol Biol Rev. 2006;70(4):876–87. Sci U S A. 2002;99(22):14250–5.
Bebenek A, Dressman HK, Carver GT, Ng S, Petrov V, Breitbart M, Wegley L, Leeds S, Schoenfeld T, Rohwer
Yang G, Konigsberg WH, Karam JD, Drake F. Phage community dynamics in hot springs. Appl
JW. Interacting fidelity defects in the replicative Environ Microbiol. 2004;70(3):1633–40.
Cheng Q, Nelson D, Zhu S, Fischetti VA. Removal of Stevens R, Valentine DL, Thurber RV, Wegley L,
group B streptococci colonizing the vagina and oro- White BA, Rohwer F. Functional metagenomic profil-
pharynx of mice with a bacteriophage lytic enzyme. ing of nine biomes. Nature. 2008;452(7187):629–32.
Antimicrob Agents Chemother. 2005;49(1):111–7. Dionne I, Bell SD. Characterization of an archaeal family
Cherepanov AV, de Vries S. Binding of nucleotides by T4 4 uracil DNA glycosylase and its interaction with
DNA ligase and T4 RNA ligase: optical absorbance PCNA and chromatin proteins. Biochem J. 2005;
and fluorescence studies. Biophys J. 2001;81(6): 387(Pt 3):859–63.
3545–59. Downie AB, Dirk LM, Xu Q, Drake J, Zhang D, Dutt M,
Clepet C, Le Clainche I, Caboche M. Improved full-length Butterfield A, Geneve RR, Corum 3rd JW, Lindstrom
cDNA production based on RNA tagging by T4 DNA KG, Snyder JC. A physical, enzymatic, and genetic
ligase. Nucleic Acids Res. 2004;32(1):e6. characterization of perturbations in the seeds of the
Compton J. Nucleic acid sequence-based amplification. brownseed tomato mutants. J Exp Bot. 2004;55(399):
Nature. 1991;350(6313):91–2. 961–73.
Cosstick R, McLaughlin LW, Eckstein F. Fluorescent El Omari K, Ren J, Bird LE, Bona MK, Klarmann G,
labelling of tRNA and oligodeoxynucleotides using LeGrice SF, Stammers DK. Molecular architecture F
T4 RNA ligase. Nucleic Acids Res. 1984;12(4): and ligand recognition determinants for T4 RNA
1791–810. ligase. J Biol Chem. 2006;281(3):1573–9.
Damasko C, Konietzny A, Kaspar H, Appel B, Dersch P, Evans GF, Snyder YM, Butler LD, Zuckerman
Strauch E. Studies of the efficacy of Enterocoliticin, SH. Differential expression of interleukin-1 and
a phage-tail like bacteriocin, as antimicrobial agent tumor necrosis factor in murine septic shock models.
against Yersinia enterocolitica serotype O3 in a cell Circ Shock. 1989;29(4):279–90.
culture system and in mice. J Vet Med B Infect Dis Vet Ghadessy FJ, Ong JL, Holliger P. Directed evolution of
Public Health. 2005;52(4):171–9. polymerase function by compartmentalized self-
Dean FB, Nelson JR, Giesler TL, Lasken RS. Rapid replication. Proc Natl Acad Sci U S A. 2001;98(8):
amplification of plasmid and phage DNA using Phi 4552–7.
29 DNA polymerase and multiply-primed rolling cir- Giver L, Gershenson A, Freskgard PO, Arnold
cle amplification. Genome Res. 2001;11(6):1095–9. FH. Directed evolution of a thermostable esterase.
Detter JC, Jett JM, Lucas SM, Dalin E, Arellano AR, Proc Natl Acad Sci U S A. 1998;95(22):12809–13.
Wang M, Nelson JR, Chapman J, Lou Y, Rokhsar D, Goodman MF. Purposeful mutations. Nature. 1998;
Hawkins TL, Richardson PM. Isothermal strand- 395(6699):221–3.
displacement amplification applications for high- Goodman MF, Fygenson KD. DNA polymerase fidelity:
throughput genomics. Genomics. 2002;80(6):691–8. from genetics toward a biochemical understanding.
Ding L, Getz G, Wheeler DA, Mardis ER, McLellan MD, Genetics. 1998;148(4):1475–82.
Cibulskis K, Sougnez C, Greulich H, Muzny DM, Griffiths E, Gupta RS. Signature sequences in diverse
Morgan MB, Fulton L, Fulton RS, Zhang Q, Wendl proteins provide evidence for the late divergence of
MC, Lawrence MS, Larson DE, Chen K, Dooling DJ, the Order Aquificales. Int Microbiol. 2004;7(1):41–52.
Sabo A, Hawes AC, Shen H, Jhangiani SN, Lewis LR, Guatelli JC, Whitfield KM, Kwoh DY, Barringer KJ,
Hall O, Zhu Y, Mathew T, Ren Y, Yao J, Scherer SE, Richman DD, Gingeras TR. Isothermal, in vitro ampli-
Clerc K, Metcalf GA, Ng B, Milosavljevic A, fication of nucleic acids by a multienzyme reaction
Gonzalez-Garay ML, Osborne JR, Meyer R, Shi X, modeled after retroviral replication. Proc Natl Acad
Tang Y, Koboldt DC, Lin L, Abbott R, Miner TL, Sci U S A. 1990;87(19):7797.
Pohl C, Fewell G, Haipek C, Schmidt H, Dunford- Haft DH, Selengut J, Mongodin EF, Nelson KE. A guild of
Shore BH, Kraja A, Crosby SD, Sawyer CS, 45 CRISPR-associated (Cas) protein families and mul-
Vickery T, Sander S, Robinson J, Winckler W, tiple CRISPR/Cas subtypes exist in prokaryotic
Baldwin J, Chirieac LR, Dutt A, Fennell T, genomes. PLoS Comput Biol. 2005;1(6):e60.
Hanna M, Johnson BE, Onofrio RC, Thomas RK, Harnett SP, Lowe G, Tansley G. A stereochemical
Tonon G, Weir BA, Zhao X, Ziaugra L, Zody MC, study of the mechanism of activation of donor oligo-
Giordano T, Orringer MB, Roth JA, Spitz MR, nucleotides by RNA ligase from bacteriophage T4
Wistuba II, Ozenberger B, Good PJ, Chang AC, Beer infected Escherichia coli. Biochemistry. 1985;24(25):
DG, Watson MA, Ladanyi M, Broderick S, 7446–9.
Yoshizawa A, Travis WD, Pao W, Province MA, Harrison B, Zimmerman SB. Polymer-stimulated ligation:
Weinstock GM, Varmus HE, Gabriel SB, Lander ES, enhanced ligation of oligo- and polynucleotides by T4
Gibbs RA, Meyerson M, Wilson RK. Somatic muta- RNA ligase in polymer solutions. Nucleic Acids Res.
tions affect key pathways in lung adenocarcinoma. 1984;12(21):8235–51.
Nature. 2008;455(7216):1069–75. Heaphy S, Singh M, Gait MJ. Effect of single amino acid
Dinsdale EA, Edwards RA, Hall D, Angly F, Breitbart M, changes in the region of the adenylylation site of T4
Brulc JM, Furlan M, Desnues C, Haynes M, Li L, RNA ligase. Biochemistry. 1987;26(6):1688–96.
McDaniel L, Moran MA, Nelson KE, Nilsson C, Heckler TG, Chang LH, Zama Y, Naka T, Chorghade MS,
Olson R, Paul J, Brito BR, Ruan Y, Swan BK, Hecht SM. T4 RNA ligase mediated preparation of
novel “chemically misacylated” tRNAPheS. Mardis ER. Next-generation DNA sequencing methods.
Biochemistry. 1984;23(7):1468–73. Annu Rev Genomics Hum Genet. 2008b;9:387–402.
Ho CK, Shuman S. Bacteriophage T4 RNA ligase Marks JL, Gong Y, Chitale D, Golas B, McLellan MD,
2 (gp24.1) exemplifies a family of RNA ligases Kasai Y, Ding L, Mardis ER, Wilson RK, Solit D,
found in all phylogenetic domains. Proc Natl Acad Levine R, Michel K, Thomas RK, Rusch VW,
Sci U S A. 2002;99(20):12709–14. Ladanyi M, Pao W. Novel MEK1 mutation identified
Kahler M, Antranikian G. Cloning and characterization of by mutational analysis of epidermal growth factor
a family B DNA polymerase from the hyperthermo- receptor signaling pathway genes in lung adenocarci-
philic crenarchaeon Pyrobaculum islandicum. noma. Cancer Res. 2008;68(14):5524–8.
J Bacteriol. 2000;182(3):655–63. McDaniel L, Breitbart M, Mobberley J, Long A,
Karam JD, Konigsberg WH. DNA polymerase of the Haynes M, Rohwer F, Paul JH. Metagenomic analysis
T4-related bacteriophages. Prog Nucleic Acid Res of lysogeny in Tampa Bay: implications for prophage
Mol Biol. 2000;64:65–96. gene expression. PLoS One. 2008;3(9):e3263.
Kerr C, Sadowski PD. Gene 6 exonuclease of bacterio- McLaughlin LW, Piel N, Graeser E. Donor activation in
phage T7. I. Purification and properties of the enzyme. the T4 RNA ligase reaction. Biochemistry.
J Biol Chem. 1972;247(1):305–10. 1985;24(2):267–73.
Kim Y, Eom SH, Wang J, Lee DS, Suh SW, Steitz Merkens LS, Bryan SK, Moses RE. Inactivation of the
TA. Crystal structure of Thermus aquaticus DNA 50 -30 exonuclease of Thermus aquaticus DNA
polymerase. Nature. 1995;376(6541):612–6. polymerase. Biochim Biophys Acta. 1995;1264(2):
Koonin EV. Temporal order of evolution of DNA replica- 243–8.
tion systems inferred by comparison of cellular and Middleton T, Herlihy WC, Schimmel PR, Munro
viral DNA polymerases. Biol Direct. 2006;1:39. HN. Synthesis and purification of oligoribonucleotides
Ley TJ, Mardis ER, Ding L, Fulton B, McLellan MD, using T4 RNA ligase and reverse-phase chromatogra-
Chen K, Dooling D, Dunford-Shore BH, McGrath S, phy. Anal Biochem. 1985;144(1):110–7.
Hickenbotham M, Cook L, Abbott R, Larson DE, Morin RD, Aksay G, Dolgosheina E, Ebhardt HA,
Koboldt DC, Pohl C, Smith S, Hawkins A, Abbott S, Magrini V, Mardis ER, Sahinalp SC, Unrau
Locke D, Hillier LW, Miner T, Fulton L, Magrini V, PJ. Comparative analysis of the small RNA
Wylie T, Glasscock J, Conyers J, Sander N, Shi X, transcriptomes of Pinus contorta and Oryza sativa.
Osborne JR, Minx P, Gordon D, Chinwalla A, Zhao Y, Genome Res. 2008;18(4):571–84.
Ries RE, Payton JE, Westervelt P, Tomasson MH, Moser MJ, Difrancesco RA, Gowda K, Klingele AJ,
Watson M, Baty J, Ivanovich J, Heath S, Shannon Sugar DR, Stocki S, Mead DA, Schoenfeld TW. Ther-
WD, Nagarajan R, Walter MJ, Link DC, Graubert mostable DNA polymerase from a viral metagenome
TA, DiPersio JF, Wilson RK. DNA sequencing of is a potent rt-PCR enzyme. PLoS One. 2012;7(6):
a cytogenetically normal acute myeloid leukaemia e38371.
genome. Nature. 2008;456(7218):66–72. Nandakumar J, Shuman S. Dual mechanisms whereby
Little JW. Lambda exonuclease. Gene Amplif Anal. a broken RNA end assists the catalysis of its repair
1981;2:135–45. by T4 RNA ligase 2. J Biol Chem. 2005;280(25):
Loeffler JM, Djurkovic S, Fischetti VA. Phage lytic 23484–9.
enzyme Cpl-1 as a novel antimicrobial for pneumo- Nandakumar J, Ho CK, Lima CD, Shuman S. RNA sub-
coccal bacteremia. Infect Immun. 2003;71(11): strate specificity and structure-guided mutational anal-
6199–204. ysis of bacteriophage T4 RNA ligase 2. J Biol Chem.
Lopatto D, Alvarez C, Barnard D, Chandrasekaran C, 2004;279(30):31337–47.
Chung HM, Du C, Eckdahl T, Goodman AL, Nelson D, Loomis L, Fischetti VA. Prevention and elim-
Hauser C, Jones CJ, Kopp OR, Kuleck GA, ination of upper respiratory colonization of mice by
McNeil G, Morris R, Myka JL, Nagengast A, group A streptococci by using a bacteriophage lytic
Overvoorde PJ, Poet JL, Reed K, Regisford G, enzyme. Proc Natl Acad Sci U S A. 2001;98(7):
Revie D, Rosenwald A, Saville K, Shaw M, Skuse 4107–12.
GR, Smith C, Smith M, Spratt M, Stamm J, Thompson Notomi T, Okayama H, Masubuchi H, Yonekawa T,
JS, Wilson BA, Witkowski C, Youngblom J, Leung W, Watanabe K, Amino N, Hase T. Loop-mediated iso-
Shaffer CD, Buhler J, Mardis E, Elgin thermal amplification of DNA. Nucleic Acids Res.
SC. Undergraduate research. Genomics education 2000;28(12):E63.
partnership. Science. 2008;322(5902):684–5. Otto MR, Bloom LB, Goodman MF, Beechem JM.
Lundberg KS, Shoemaker DD, Adams MW, Short JM, Stopped-flow fluorescence study of precatalytic primer
Sorge JA, Mathur EJ. High-fidelity amplification strand base-unstacking transitions in the exonuclease
using a thermostable DNA polymerase isolated from cleft of bacteriophage T4 DNA polymerase. Biochem-
Pyrococcus furiosus. Gene. 1991;108(1):1–6. istry. 1998;37(28):10156–63.
Mardis ER. The impact of next-generation sequencing Paulsen H, Wintermeyer W. Incorporation of 1,
technology on genetics. Trends Genet. 2008a;24(3): N6-ethenoadenosine into the 30 terminus of tRNA
133–41. using T4 RNA ligase. 2. Preparation and ribosome
interaction of fluorescent Escherichia coli tRNAMetf. Snyder YM, Guthrie L, Evans GF, Zuckerman
Eur J Biochem. 1984;138(1):125–30. SH. Transcriptional inhibition of endotoxin-induced
Pavlov AR, Karam JD. Nucleotide-sequence-specific and monokine synthesis following heat shock in murine
non-specific interactions of T4 DNA polymerase with peritoneal macrophages. J Leukoc Biol. 1992;51(2):
its own mRNA. Nucleic Acids Res. 2000;28(23): 181–7.
4657–64. Srinivasiah S, Bhavsar J, Thapar K, Liles M, Schoenfeld
Perez LE, Merrill GA, Delorenzo RA, Schoenfeld TW, T, Wommack KE. Phages across the biosphere: con-
Vats A, Moser MJ. Evaluation of the specificity trasts of viruses in soil and aquatic environments. Res
and sensitivity of a potential rapid influenza screening Microbiol. 2008 Jun;159(5):349–57.
system. Diagn Microbiol Infect Dis. 2012;75(1): Staley JT, Konopka A. Measurement of in situ activities of
77–80. nonphotosynthetic microorganisms in aquatic and
Petric A, Bhat B, Leonard NJ, Gumport RI. Ligation with terrestrial habitats. Annu Rev Microbiol. 1985;39:
T4 RNA ligase of an oligodeoxyribonucleotide to 321–46.
covalently-linked cross-sectional base-pair analogues Strauch E, Kaspar H, Schaudinn C, Damasko C,
of short, normal, and long dimensions. Nucleic Acids Konietzny A, Dersch P, Skurnik M, Appel B. Analysis F
Res. 1991;19(3):585–90. of enterocoliticin, a phage tail-like bacteriocin. Adv
Petruska J, Hartenstine MJ, Goodman MF. Analysis of Exp Med Biol. 2003;529:249–51.
strand slippage in DNA polymerase expansions of Tabor S, Richardson CC. DNA sequence analysis
CAG/CTG triplet repeats associated with neurodegen- with a modified bacteriophage T7 DNA
erative disease. J Biol Chem. 1998;273(9):5204–10. polymerase. Proc Natl Acad Sci U S A. 1987;84(14):
Pulsinelli GA, Temin HM. Characterization of large dele- 4767–71.
tions occurring during a single round of retrovirus Tabor S, Richardson CC. A single residue in DNA poly-
vector replication: novel deletion mechanism involv- merases of the Escherichia coli DNA polymerase
ing errors in strand transfer. J Virol. 1991;65(9): I family is critical for distinguishing between deoxy-
4786–97. and dideoxyribonucleotides. Proc Natl Acad Sci U S A.
Rand KN, Gait MJ. Sequence and cloning of bacterio- 1995;92(14):6339–43.
phage T4 gene 63 encoding RNA ligase and tail fibre Tabor S, Huber HE, Richardson CC. Escherichia coli
attachment activities. EMBO J. 1984;3(2):397–402. thioredoxin confers processivity on the DNA polymer-
Reha-Krantz LJ, Marquez LA, Elisseeva E, Baker RP, ase activity of the gene 5 protein of bacteriophage T7.
Bloom LB, Dunford HB, Goodman MF. The proof- J Biol Chem. 1987;262(33):16212–23.
reading pathway of bacteriophage T4 DNA polymer- Tang M, Bruck I, Eritja R, Turner J, Frank EG,
ase. J Biol Chem. 1998;273(36):22969–76. Woodgate R, O’Donnell M, Goodman MF. Biochem-
Rehrauer WM, Bruck I, Woodgate R, Goodman MF, ical basis of SOS-induced mutagenesis in Escherichia
Kowalczykowski SC. Modulation of RecA nucleopro- coli: reconstitution of in vitro lesion bypass
tein function by the mutagenic UmuD’C protein com- dependent on the UmuD’2C mutagenic complex and
plex. J Biol Chem. 1998;273(49):32384–7. RecA protein. Proc Natl Acad Sci U S A. 1998;95(17):
Roberts JA, Bell SD, White MF. An archaeal XPF repair 9755–60.
endonuclease dependent on a heterotrimeric PCNA. Tang H, Yang X, Wang K, Tan W, Li H, He L, Liu B.
Mol Microbiol. 2003;48(2):361–71. RNA-templated single-base mutation detection based
Schmidt CJ, Romanov M, Ryder O, Magrini V, on T4 DNA ligase and reverse molecular beacon.
Hickenbotham M, Glasscock J, McGrath S, Talanta. 2008;75(5):1388–93.
Mardis E, Stein LD. Gallus GBrowse: a unified geno- Truncaite L, Zajanckauskaite A, Arlauskas A, Nivinskas
mic database for the chicken. Nucleic Acids Res. R. Transcription and RNA processing during expres-
2008;36(Database issue):D719–23. sion of genes preceding DNA ligase gene 30 in
Schoenfeld T, Patterson M, Richardson PM, Wommack T4-related bacteriophages. Virology. 2006;344(2):
KE, Young M, Mead D. Assembly of viral 378–90.
metagenomes from Yellowstone hot springs. Appl van Dijk AA, Makeyev EV, Bamford DH. Initiation of
Environ Microbiol. 2008;74(13):4164–74. viral RNA-dependent RNA polymerization. J Gen
Shah JS, Liu J, Buxton D, Hendricks A, Robinson L, Virol. 2004;85(Pt 5):1077–93.
Radcliffe G, King W, Lane D, Olive DM, Klinger Voigt CA, Mayo SL, Arnold FH, Wang ZG. Computa-
JD. Q-beta replicase-amplified assay for detection of tionally focusing the directed evolution of proteins.
Mycobacterium tuberculosis directly from clinical J Cell Biochem Suppl. 2001;37:58–63.
specimens. J Clin Microbiol. 1995;33(6):1435–41. Vratskikh LV, Komarova NI, Yamkovoy VI. Solid-phase
Sharp RL, May PC, Mayne NG, Snyder YM, Burnett synthesis of oligoribonucleotides using T4 RNA ligase
JP. Cyclothiazide potentiates agonist responses at and T4 polynucleotide kinase. Biochimie. 1995;77(4):
human AMPA/kainate receptors expressed in oocytes. 227–32.
Eur J Pharmacol. 1994;266(1):R1–2. Wang Y, Silverman SK. Efficient RNA 50 -adenylation by
Shendure J, Ji H. Next-generation DNA sequencing. Nat T4 DNA ligase to facilitate practical applications.
Biotechnol. 2008;26(10):1135–45. RNA. 2006;12(6):1142–6.
Wang LK, Schwer B, Shuman S. Structure-guided muta- Xiang Z, Zhao Y, Mitaksov V, Fremont DH, Kasai Y,
tional analysis of T4 RNA ligase 1. RNA. Molitoris A, Ries RE, Miner TL, McLellan MD,
2006;12(12):2126–34. DiPersio JF, Link DC, Payton JE, Graubert TA,
Wang LK, Nandakumar J, Schwer B, Shuman S. The Watson M, Shannon W, Heath SE, Nagarajan R,
C-terminal domain of T4 RNA ligase 1 confers Mardis ER, Wilson RK, Ley TJ, Tomasson
specificity for tRNA repair. RNA. 2007;13(8): MH. Identification of somatic JAK1 mutations in
1235–44. patients with acute myeloid leukemia. Blood.
Wang X, Sun Q, McGrath SD, Mardis ER, Soloway PD, 2008;111(9):4809–12.
Clark AG. Transcriptome-wide identification of novel Yin S, Ho CK, Shuman S. Structure-function analysis of
imprinted genes in neonatal mouse brain. PLoS One. T4 RNA ligase 2. J Biol Chem. 2003;278(20):
2008;3(12):e3839. 17601–8.
Wommack KE, Bhavsar J, Ravel J. Metagenomics: read Yin S, Kiong Ho C, Miller ES, Shuman S. Characteriza-
length matters. Appl Environ Microbiol. 2008 tion of bacteriophage KVP40 and T4 RNA ligase 2.
Mar;74(5):1453–63. Virology. 2004;319(1):141–51.
G
Genome Atlases, Potential the environmental DNA is preferably in

Applications in Study of chunks containing at least several genes – from
Metagenomes fosmids, longer read lengths, or assembled short
reads.
Asli Ismihan Ozen1 and David Wayne Ussery2 In recent years, there has been many
1
The Novo Nordisk Foundation Center for metagenomic data available on public databases
Biosustainability, Technical University of such as CAMERA (Sun et al. 2011) or IMG/M
Denmark, Kongens Lyngby, Denmark (Markowitz et al. 2012). Some of these databases
2
Bioscience Division of Oak Ridge National also provide analysis tools as Web servers,
Laboratory, Oak Ridge National Laboratory, e.g., a BLAST (Altschul et al. 1990), or other
Oak Ridge, TN, USA fast alignment tool is implemented. This allows
quick comparison of any sequence data against
the metagenomes provided. However, if one is
Traditional microbiology has used a single spe- not looking for a single sequence but rather
cies approach, as in Koch’s postulates, where chromosome-wide comparisons, then the inter-
a bacterium is shown to be pathogenic by first pretation of results might become difficult and
isolation from infected organisms, then grown in complicated. Therefore, a visualization tool
monoculture, and finally reintroduced into such as BLAST Atlas (Hallin et al. 2008) is
healthy individuals and causing the disease. In a very useful way of looking at conservation of
contrast, microbial ecology studies multispecies proteins in various metagenomic samples, along
and community structures. Both of these a given reference chromosome.
areas have been very successful, and these two Figure 1 is a BLAST Atlas, as an example to
different approaches can be seen in comparative illustrate this. An abundant ocean bacterium,
genomics, with the traditional analysis of single Candidatus Pelagibacter ubique strain HTCC1062
genomes versus many genomes or metagenomes (Giovannoni et al. 2005) has been chosen as
isolated from an environment. It is possible to a reference genome, to compare against several
relate microbial ecology to reductionist, mono- genomes and metagenome samples that are
culture microbiology by comparing the two found on CAMERA Projects. P. ubique is a mem-
different data types. In this case, the reference ber of Alphaproteobacteria, found in the SAR11
is the single genome of an organism, the other cluster, and known to be a very common inhabitant
being the metagenome samples where most of marine environments (Garcı́a-Martı́nez and
of the DNA in the environment is sampled. Rodrı́guez-Valera 2000; Brown et al. 2012).
Surely, the comparisons are most reliable when It is a free-living cell with a relatively small
G 220 Genome Atlases, Potential Applications in Study of Metagenomes
0k
25
0k
P. ubique
1000k
HTCC1062 1,308,759 bp
50
32 >
_ 09
0k
1>1
DRS
SCA
k
750
dna
2 A
093
11_
SAR
Su
rfa
ce
Genome Atlases, Potential Applications in Study of HTCC1062 followed by the genome’s annotation lane.
Metagenomes, Fig. 1 A BLAST Atlas representing the Then comes the BLAST lanes, where the BLAST result
comparison of marine bacterium Pelagibacter ubique to for the query genome against the reference is shown. The
the other four Pelagibacter genomes and seven BLAST hit significance is indicated with the color inten-
metagenome samples. The six innermost lanes show the sity, where higher intensity corresponds to a more
DNA properties of the reference genome P. ubique significant hit
genome of 1.3 Mbp, first isolated from Saragossa The Metagenome projects that are chosen are
Sea (Giovannoni et al. 1990), and requires added Moore Marine Microbial Sequencing (Sun
reduced sulfur for growth (Tripp et al. 2008). et al. 2011), Global Ocean Sampling (GOS)
The genome comparisons in this study include (Yooseph et al. 2007), Whale Fall (Tringe
other Pelagibacter species and Pelagibacterium et al. 2005), Acid Mine Drainage (Tyson
halotolerans B2 (Huo et al. 2012). Note the et al. 2004), Microbial Community Genomics at
darker green colors for the P. ubique lane and the HOT/ALOHA (DeLong et al. 2006), Waseca
for other closely related Pelagibacter species. County Farm Soil (Tringe et al. 2005), and Wash-
However, apart from the reference strain, there ington Lake (Kalyuzhnaya et al. 2008). In all
are some regions of missing genes (gaps) that comparisons, the P. ubique proteins were com-
can be seen. pared against the metagenomes using the BLAST
Genome Atlases, Potential Applications in Study of Metagenomes 221 G
tool of the database itself with default parameters, regions around 510–564 kb contains the genes
and the results are then visualized with BLAST that are related to amino sugar metabolism
Atlas. Moore Marine Microbes, GOS, and (rfaD,rfaE), pentose phosphate pathway (tktC),
HOT/ALOHA samples have protein annotations; lipopolysaccharide synthesis (gmhA and gmhB),
therefore, a BLASTP search was used. The other streptomycin biosynthesis (rpbB), and transfer-
metagenomes are assembled but not annotated, ase activity (spsA, rfaG, rfaK). This gap region
so TBLASTN comparison was made. and a few bases downstream is marked as “sur-
Metagenomes that are not assembled were not face” because this area contains proteins related
used in this study, because protein comparison to surface features (ompS, LPS biosynthesis,
against metagenome reads was not very reliable. etc.). Another gap includes a “giant protein”
In the BLAST Atlas, the six innermost lanes (Strom et al. 2012), annotated as “hypothetical
show some of the DNA properties (Jensen protein SAR11_0932,” and is 7,317 amino acid
et al. 1999; Pedersen et al. 2000) of the reference residues long. The reason why this protein seems
chromosome, P. ubique HTCC1062; these are, to be partially found in Marine Microbes and G
from innermost to outermost: the average AT GOS metagenomes (dark blue lane) is due to the
percentage (over a 10,000 bp average), GC many repeat regions in the protein, which might
Skew (10,000 bp average), Global Direct look like other regions in the proteins of the other
Repeats, Nucleosome Position Preference genomes. But the whole protein itself is not found
(green regions represent chromatin-free areas; because it varies even within the same species;
Satchwell et al. 1986; Baldi et al. 1996), DNA these “giant proteins” are known to be variable
helix stacking energy (on this scale, red regions and thought to be involved in protection against
will melt more readily, and green regions are viral attacks, as well as predation by protists
more stable; Ornstein et al. 1978), and intrinsic (Strom et al. 2012). Some of the other gaps are
curvature (blue means highly curved areas, and due to tRNA or rRNAs, because the BLAST lanes
yellow indicates low levels of curvature; Bolshoy only compare protein sequences. When looked at
et al. 1991; Shpigelman et al. 1993). The next the other metagenome BLAST lanes, the BLAST
outer lane is the annotations, coding sequences on hits are seen very weak meaning that P. ubique
plus and minus strand. After the annotations the genes that are compared here are not present in
BLAST lanes start, which show the BLAST hits those metagenome samples.
on each position. The color intensity indicates In summary, BLAST Atlas is a way to visual-
how good a BLAST hit is, with darker colors ize the mapping of bacterial genomes against
representing regions of conserved proteins and metagenomes, and this can be used to compare
grey areas contain poor or no matches. The first many different environments. If a certain protein,
BLAST lane is P. ubique itself as a control. The a set of proteins, or a genomic region is being
next few lanes are other Pelagibacter sp., and investigated, this tool will guide in finding the
they show high resemblance to the reference presence or absence of those proteins. It is also
P. ubique. The 5th lane is a Pelagibacterium possible to zoom in to desired ranges of the
which should not be mixed because it is classified genome to see local differences (Hallin
as a completely different clade in Alphaproteo- et al. 2008).
bacteria, as can be seen from the low protein
similarity. However it’s BLAST hit profile still
resembles the other Pelagibacter sp. References
According to this figure, we can see that
almost all the coding genes of P. ubique are Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ.
found in the CAMERA Marine Microbes sam- Basic local alignment search tool. J Mol Biol.
1990;215(3):403–10.
ples, and most are also found in the GOS data,
Baldi P, Brunak S, Chauvin Y, Krogh A. Naturally occur-
which means that the bacterium is present in ring nucleosome positioning signals in human exons
these environments, as expected. One of the gap and introns. J Mol Biol. 1996;263(4):503–10.
G 222 Genome Portal, Joint Genome Institute
Bolshoy A, McNamara P, Harrington RE, Trifonov Tringe SG, von Mering C, Kobayashi A, et al. Compara-
EN. Curved DNA without A-A: experimental estima- tive metagenomics of microbial communities.
tion of all 16 DNA wedge angles. Proc Natl Acad Sci Science. 2005;308(5721):554–7.
USA. 1991;88:2312–6. Tripp HJ, Kitner JB, Schwalbach MS, et al. SAR11 marine
Brown MV, Lauro FM, DeMaere MZ, et al. Global bio- bacteria require exogenous reduced sulphur for
geography of SAR11 marine bacteria. Mol Syst Biol. growth. Nature. 2008;452(7188):741–4.
2012;8:595. Tyson GW, Chapman J, Hugenholtz P, et al. Community
DeLong EF, Preston CM, Mincer T, et al. Community structure and metabolism through reconstruction of
genomics among stratified microbial assemblages in microbial genomes from the environment. Nature.
the ocean’s interior. Science. 2006;311(5760):496–503. 2004;428(6978):37–43.
Garcı́a-Martı́nez J, Rodrı́guez-Valera F. Microdiversity of Yooseph S, Sutton G, Rusch DB, et al. The Sorcerer II
uncultured marine prokaryotes: the SAR11 cluster and global ocean sampling expedition: expanding the uni-
the marine Archaea of group I. Mol Ecol. 2000;9(7): verse of protein families. PLoS Biol. 2007;5(3):e16.
935–48.
Giovannoni SJ, Britschgi TB, Moyer CL, Field
KG. Genetic diversity in Sargasso Sea bacterio-
plankton. Nature. 1990;345(6270):60–3. Genome Portal, Joint Genome
Giovannoni SJ, Tripp HJ, Givan S, et al. Genome Institute
streamlining in a cosmopolitan oceanic bacterium.
Science. 2005;309(5738):1242–5.
Hallin PF, Binnewies TT, Ussery DW. The genome Igor V. Grigoriev, Susannah Tringe and
BLAST atlas – a GeneWiz extension for visualization Inna Dubchak
of whole-genome homology. Mol Biosyst. 2008;4(5): US Department of Energy Joint Genome
363–71.
Huo Y-Y, Cheng H, Han X-F, et al. Complete genome Institute, Walnut Creek, CA, USA
sequence of Pelagibacterium halotolerans B2(T).
J Bacteriol. 2012;194(1):197–8.
Jensen LJ, Friis C, Ussery DW. Three views of microbial Synonyms
genomes. Res Microbiol. 1999;150(9–10):773–7.
Kalyuzhnaya MG, Lapidus A, Ivanova N, et al. High-
resolution metagenomics targets specific functional Comparative genomics; Data integration;
types in complex microbial communities. Nat Genome analysis; Genome projects;
Biotechnol. 2008;26(9):1029–34. Metagenomics
Markowitz VM, Chen I-MA, Chu K, et al. IMG/M: the
ative analysis system. Nucleic Acids Res. 2012;40-
(Database issue):D123–9. Definition
Ornstein RL, Rein R, Breen DL, MacElroy R. An
optimised potential function for the calculation of
nucleic acid interaction energies. I. Base stacking. The US Department of Energy (DOE) Joint
Biopolymers. 1978;17:2341–60. Genome Institute (JGI) is a national user facility
Pedersen AG, Jensen LJ, Brunak S, Staerfeldt HH, Ussery with massive-scale DNA sequencing and analy-
DW. A DNA structural atlas for Escherichia coli. sis capabilities dedicated to advancing genomics
J Mol Biol. 2000;299(4):907–30.
Satchwell SC, Drew HR, Travers AA. Sequence period- for bioenergy and environmental applications.
icities in chicken nucleosome core DNA. J Mol Biol. The JGI Genome Portal is an integrated geno-
1986;191(4):659–75. mic resource, which provides for the research
Shpigelman ES, Trifonov EN, Boishoy A. Curvature: soft- community around the world access to the large
ware for the analysis of curved DNA. Comput Appl
Biosci. 1993;9:435–40. collection of genomic data for plants, fungi,
Strom SL, Brahamsha B, Fredrickson KA, Apple JK, microbes, and metagenomes and to web-based
Rodr’iguez AG. A giant cell surface protein in interactive tools for their analysis.
Synechococcus WH8102 inhibits feeding by
a dinoflagellate predator. Environ Microbiol. 2012;
14(3):807–16. Introduction
Sun S, Chen J, Li W, et al. Community cyberinfrastructure
for advanced microbial ecology research and
analysis: the CAMERA resource. Nucleic Acids Res. The Department of Energy (DOE) Joint Genome
2011;39(Database issue):D546–51. Institute (JGI) was established for the Human
Genome Portal, Joint Genome Institute 223 G
Genome Project (Lander et al. 2001) and later plants (Phytozome; Goodstein et al. 2012),
was transformed into a national user facility for fungi (MycoCosm; Grigoriev et al. 2012),
genome research in the DOE mission areas of microbes (Integrated Microbial Genomes or
bioenergy, carbon cycling, and biogeochemistry. IMG; Markowitz et al. 2012b), and metagenomes
JGI provides expertise and resources in DNA (IMG/M; Markowitz et al. 2012a).
sequencing, technology development, and bioin- The JGI Genome Portal provides a unified
formatics to the broader scientific community. access point to all JGI genomic databases and
Scientists around the world can make proposals analytical tools, as well as worldwide statistics
to the JGI Community Sequencing Program on the usage of the JGI resources and the infor-
(CSP; e.g., Martin et al. 2011) to sequence mation about the latest genome releases and new
genomes, transcriptomes, and metagenomes and tool development. A user can find all DOE JGI
address important scientific questions of DOE sequencing projects and their status, search for
mission relevance. Massive amounts of genomic and download raw data, assemblies and annota-
data are assembled, annotated, and delivered to tions of sequenced genomes and metagenomes, G
users by means of integrated databases and inter- and interactively explore those datasets and com-
active analytical tools interconnected within the pare them with other sequenced microbes, fungi,
JGI Genome Portal (http://genome.jgi.doe.gov; plants, or metagenomes using specialized sys-
Grigoriev et al. 2012). tems tailored to each particular class of organ-
Leading the world in the number of sequenced isms. All these can serve as building blocks in
plants, fungi, microbes, and metagenomes comprehensive analyses of individual organisms
(according to the Genomes Online Database or systems of interacting organisms.
(GOLD; Pagani et al. 2012)), JGI has dramati-
cally increased its sequencing capabilities using
new sequencing technologies. JGI projects A Catalogue of Genome Sequencing
evolved from sequencing three of the human Projects
chromosomes (Lander et al. 2001) to the large-
scale “Grand Challenge” projects such as the Metagenomic analysis requires reference
Genomic Encyclopedia of Bacteria and Archaea genomes for better interpretation of sequence
(GEBA; Wu et al. 2009), the 1,000 Fungal data derived from complex microbial communi-
Genome Project (Grigoriev et al. 2011), and the ties. The democratization of sequencing allows
metagenomic projects targeting soil and rhizo- many scientists to sequence appropriate genome
sphere. Since tracking individual organisms and references in their own labs prior to approaching
samples at such a scale becomes critical, metagenomes. Consolidation of genomic data
genomes and metagenomes sequenced or sequenced in different places around the world
selected for sequencing are carefully catalogued is an important step in both genomics and
and made available to the public along with their metagenomics.
status and links to the produced data and avail- JGI’s collection of genomic projects includes
able tools. thousands projects of different types and is pub-
The sequenced data are assembled, annotated, licly available and searchable. Product types
and analyzed using various computational pipe- include standard or improved genome drafts, fin-
lines developed for each of the products delivered ished genomes, gene expression profiling,
by JGI to its users. The resulting annotations are resequencing, metagenome projects, and others.
available for download and also can be interac- The Project List (http://genome.jgi.doe.gov/
tively viewed using the JGI Genome Portal offer- genome-projects) is available from most of the
ing a wide array of databases and analytical Portal pages as a menu item and includes
systems to interpret the data. Some systems a detailed description of each project including
work across multiple JGI databases, while others its scope and current status, taxon, the JGI pro-
allow users to specifically manage datasets on gram, and the project lead. The Resources
column lists tools available for this project. Some Organism home pages. Clicking on a branch
of these tools, e.g., download, are available for all name produces a menu displaying available
genomes, while others are taxon, project type, or genomes in this kingdom, phylum, class, or
stage dependent. For example, a plant or fungal order (Fig. 1), each connected to pages in differ-
genome will be linked to Phytozome or ent analytical resources. The same pages can be
MycoCosm, respectively. reached in a step-by-step genome selection from
All JGI projects are also registered in the a hierarchical selection menu on the top of the
GOLD database, which includes a larger collec- page or searching for genomes by keyword (e.g.,
tion of projects sequenced around the world plants, Eukaryota), name, taxonID, or projectID.
(Pagani et al. 2012). Currently it contains a list Each of the genomic datasets can be analyzed
of about 16,000 genomes including over 3,000 with a collection of tools linked directly to their
that are complete and over 2,000 metagenomes. genome databases. Each organism’s home page
Besides utility for metagenomics, having contains a description of the project, BLAST,
a comprehensive list of sequencing projects download, and links to specialized resources as
from all laboratories around the world also helps described in the next section.
to avoid redundancy when sequencing targets are
selected for the large-scale projects like GEBA or
1,000 Fungal Genomes. Comparative Databases and Tools
Comparative genomics is a more powerful

Annotated Genomes and Metagenomes approach for functional annotation and evolu-
tionary studies of genomes than analysis of indi-
Finding genes in metagenomes is challenging, vidual genome sequences. It is also a primary
especially for eukaryotes with their complex method for annotation and analysis of
intron-exon gene structure and often relies on metagenomes. The JGI Genome Portal includes
gene prediction based on similarity to proteins a set of efficient comparative tools, such as gene
from other organisms. This requires clustering, whole-genome alignment, and build-
a comprehensive collection of genes from differing phylogenetic trees that are used across differ-
ent organisms across all domains of life. Besides ent genomic resources at JGI. VISTA Point
the human genome (Lander et al. 2001), JGI (http://genome.lbl.gov/vista) is an example of
sequenced and annotated genomes of the first such tools. It was, designed for visualization and
poplar tree (Tuskan et al. 2006) and its analysis of pairwise and multiple DNA align-
ectomycorrhizal symbiont (Martin et al. 2008); ments (Frazer et al. 2004) at different levels of
lignocellulose degrading fungi (Berka et al. 2011; resolution in three visualization modes:
Eastwood et al. 2011) and microbial communities (a) VISTA Browser, for visual comparative anal-
(Hess et al. 2011); diverse eukaryotes, often the ysis of complete genome assemblies using
first representatives of the Tree of Life branches pairwise and multiple large-scale alignments;
(Tyler et al. 2006; Bowler et al. 2008; King (b) VISTA Synteny Viewer, a multi-tiered graph-
et al. 2008; Fritz-Laylin et al. 2010; Colbourne ical display of pairwise alignments at three dif-
et al. 2011); and prokaryotes (Wu et al. 2009) as ferent levels of resolution; and (c) VistaDot, an
well as soil (Tringe et al. 2005) and ocean interactive two-dimensional dot-plot genome
metagenomes (Walsh et al. 2009). There are synteny viewer across multiple chromosomes/
over 3,000 annotated reference genomes in the scaffolds. Several specialized domain-specific
JGI database and three ways to find a particular computational systems for comparative genome
genome of interest: using an interactive Tree of analysis built at JGI include Phytozome,
Life, search, and select functions. a comparative hub for plant genome and gene
The Tree of Life organizes the annotated family data and analysis; MycoCosm to enable
genomes by the domains of life and links to users to navigate across sequenced fungal
Genome Portal, Joint Genome Institute
225
Genome Portal, Joint Genome Institute, Fig. 1 The JGI Genome Portal. A pull- projects at the DOE JGI. The bottom portion of the page connects to the specialized
down menu for the “Marine” category of Metagenomes is shown. BLAST and Down- databases in microbes (IMG) and metagenomes (IMG/M), fungi (MycoCosm), and
load functions are available for the entire selected group. Each genome is linked to the plants (Phytozome)
associated resources. “Project list” on the top leads users to the list of all sequencing
G
G
genomes and to conduct comparative and and analysis space from a single organism to the
genome-centric analyses and community annota- entire list of fungal genomes.
tion; and the IMG family of tools for large-scale The Genome browser with configurable selec-
comparative analysis of microbial genomes and tion of tracks displays predicted gene models and
metagenomes. annotations along with different lines of evidence
Phytozome (http://phytozome.net; Goodstein in support of these predictions, such as gene and
et al. 2012) gives access to the sequences and protein expression profiles. Gene models and
functional annotations of a growing number of annotations are linked to community annotation
complete plant genomes (31 in release v8.0), tools to revise them if needed. Functional profiles
including land plants and selected algae. of each genome summarize gene annotations
Phytozome provides both organism-centric and according to the GO, KEGG, and KOG classifica-
gene family-centric views as well as access to the tions and can be compared with each other to study
BLAST, BLAT, and Search capabilities. gene family expansions or contractions at different
Phytozome provides a view of the evolutionary levels of granularity. Clustering using BLAST
history of every plant and every plant gene at the alignments of all proteins and MCL can expand
level of sequence, gene structure, gene family, and these analyses to gene families even without anno-
genome organization. The Phytozome project tation and enable side-by-side comparison of each
organizes the proteomes of green plants into gene of the cluster members for pattern of protein
families defined at the nodes on the green plant domains, intron-exon structure, and synteny.
evolutionary tree. Genes have been annotated with MycoCosm comparative views combine the
PFAM, KOG, KEGG, and PANTHER assign- abovementioned tools to study entire groups of
ments, and publicly available annotations from genomes corresponding to MycoCosm nodes.
RefSeq, UniProt, TAIR, and JGI are hyperlinked Unlike the genome-centric view, there is no ref-
and searchable. The gene family view gives access erence genome in this analysis, and, for example,
to the information on each family and its members, a keyword or BLAST search for protein kinases
organized to highlight shared attributes. in Basidiomycota or Ascomycota will show dif-
GBrowse provides genome-centric views for ferences in the number of found genes or BLAST
all genomes included in Phytozome. Each organ- hits across different members of these phyla.
ism browser displays a number of tracks includ- IMG, the Integrated Microbial Genomes
ing a gene prediction track, a track of database (http://img.jgi.doe.gov; Markowitz
homologous sequences from related species et al. 2012a, b), is a system designed for flexible
aligned against the genome, supporting EST and comparative analyses of microbial genomic data,
VISTA tracks identifying regions of this genome which incorporates all complete public microbial
that are syntenic with other plant genomes. genomes as well as those sequenced at JGI. IMG
MycoCosm (http://jgi.doe.gov/fungi; with microbiome samples (IMG/M) is an
Grigoriev et al. 2012) brings together genomic expanded database that includes metagenome
data and analytical tools for diverse fungi that are data from diverse environments, both sequenced
important for energy and environment. Genomic at JGI and submitted by external users.
data from the JGI and its users are integrated and In addition to importing all public genomes
curated via user community participation in data and their annotations from NCBI’s RefSeq, IMG
submission, curation, annotation, and analysis. curates the data by adding features missed by
Over 150 newly sequenced and annotated fungal many annotation pipelines, such as small RNAs;
genomes are available to the public through assigning proteins and domains to all major pro-
MycoCosm for genome-centric and comparative tein family databases (e.g., COG, TIGRfam); and
analyses. Visual navigation across the MycoCosm linking to organism metadata stored in GOLD,
tree (Fig. 2b), where each node represents a group such as oxygen requirements or environment of
of phylogenetically related fungi and is linked to origin. Annotations can be viewed in detailed
analysis tools, allows users to redefine the search gene pages or summarized in genome pages that
Genome Portal, Joint Genome Institute, Fig. 2 (continued)

Genome Portal, Joint Genome Institute, Fig. 2 Comparative genomic resources at JGI: (a) Phytozome for plants,
(b) MycoCosm for fungi, and (c) IMG family of tool
include organism metadata in addition to statis- It also includes a “scaffold cart” for exploring
tics on genome size and gene counts within genes within a given set of contigs or scaffolds
various categories. as well as the option to categorize contigs/
The tools available in IMG allow for analyses scaffolds into population “bins” based on oligo-
at the gene, function, or genome level, using nucleotide composition or other features.
customizable “carts” for each of these data Recent developments in IMG and IMG/M
types. Thus, any given analysis can readily be include the capacity to add and view (meta)
performed on a single (meta)genome or several transcriptome and (meta)proteome data in the
and can be extended to many individual genes, context of a reference and compare expression
functions, or pathways. IMG/M includes profiles across experiments.
a number of metagenome-specific functions,
including the option to account for different
organism abundances by weighting comparative Metagenome Analysis
analyses according to estimated gene copies,
based on the contig read coverage reported in Analysis of metagenome data presents a number
the assembly rather than simple gene counts. of challenges beyond those faced in isolate
Genome Portal, Joint Genome Institute, Fig. 3 Metagenomic analysis. A protein recruitment plot showing
alignment of genes from a hot spring sample to genomes from the family Hydrogenothermaceae
genome analysis, in particular the wide variation there?) or a functional one (i.e., what are they
in individual organism abundances and the shal- doing?). Each of these uses a specific suite of
low coverage of low-abundance, but nonetheless tools, though nearly all rely on a well-curated
biologically important, taxa. Both of these tend to database of genes with known phylogenies and
result in highly fragmented assemblies, which are functions. For phylogenetic analysis, genes or
most readily interpreted when high-quality refer- gene fragments are assigned to phylogenetic line-
ence genome data are available. ages based on homology to genes of known phy-
Most metagenome analyses approach the data logenetic origin. This can be done for all genes
from either a phylogenetic perspective (i.e., who is from a metagenome dataset, for example, using
MEGAN (Huson and Mitra 2012), or for a set of large amounts of genomic data being produced in
conserved phylogenetic markers which can be different parts of the world. Effective analysis of
placed onto a tree of known sequences from isolate genomic and metagenomic data depends on the
genomes and/or amplified from uncultivated availability of comprehensive catalogues of ref-
organisms, for example, using pplacer (Matsen erence genome data for annotation and compara-
et al. 2012). IMG/M allows for both approaches – tive genomics as well as computational tools able
an overall perspective of all the genes in a dataset to process the large amounts of sequence data.
or on a specific set of contigs is provided through The JGI Genome Portal (http://genome.jgi.doe.
the “Phylogenetic Distribution of Genes” option gov) provides a unified access point to all JGI
on the main metagenome page or in the scaffold genomic databases and analytical tools including
cart, and genes with homology to particular phyla, list of sequencing projects at JGI and around the
families, genera, or species can be retrieved. When world, a comprehensive collection of annotated
there are good reference genomes available, align- genomes in all domains of life, and specialized
ments of protein-coding genes to those genomes databases for comparative analysis of plant, fun-
can be viewed in a recruitment plot (Fig. 3). gal, and microbial genomes and metagenomes.
Phylogenetic marker genes can also be extracted The latter is still in early stages of development,
and incorporated into trees using the “Phyloge- and data generated at unprecedented scale and
netic Marker COGs” option under the “Find complexity for metagenomes will require new
Functions” tab. approaches to data processing, analysis, and
Functional or “gene-centric” approaches visualization.
enable the comparison of metagenome datasets
at the functional level to both assess their relative
similarity and identify genes or functions that are
References
over- or underrepresented in a given dataset. This
type of approach is utilized by metagenome anal- Berka RM, Grigoriev IV, Otillar R, et al.
ysis systems like MG-RAST (Meyer et al. 2008). Comparative genomic analysis of the thermophilic
IMG/M provides several options for whole biomass-degrading fungi Myceliophthora thermophila
and Thielavia terrestris. Nat Biotechnol. 2011;29:
metagenome comparisons. Metagenomes can be
922–7.
clustered (under the “Compare Genomes” tab) Bowler C, Allen AE, Badger JH, et al. The Phaeodactylum
according to gene content, using either functional genome reveals the evolutionary history of diatom
(e.g., COG, Pfam) or phylogenetic criteria, and genomes. Nature. 2008;456:239–44.
Colbourne JK, Pfrender ME, Gilbert D, et al. The
the results visualized via hierarchical clustering,
ecoresponsive genome of Daphnia pulex. Science.
principal components analysis (PCA), or 2011;331:555–61.
a correlation matrix. Relative abundances of spe- Eastwood DC, Floudas D, Binder M, et al. The plant
cific gene families can be viewed via the abun- cell wall-decomposing machinery underlies the
functional diversity of forest fungi. Science.
dance profile function also under the “Compare
2011;333:762–5.
Genomes” tab. As mentioned above, these com- Frazer KA, Pachter L, Poliakov A, et al. VISTA: compu-
parisons can be made between partly assembled tational tools for comparative genomics. Nucleic
genomes by taking contig read depth into account Acids Res. 2004;32:W273–9.
Fritz-Laylin LK, Prochnik SE, Ginger ML, et al. The
when calculating gene abundance.
genome of Naegleria gruberi illuminates early eukary-
otic versatility. Cell. 2010;140:631–42.
Goodstein DM, Shu S, Howson R, et al. Phytozome:
Summary a comparative platform for green plant genomics.
Nucleic Acids Res. 2012;40:D1178–86.
Grigoriev IV, Cullen D, Goodwin SB, et al. Fueling
Technological innovations leading to the democ- the future with fungal genomics. Mycology. 2011;2:
ratization of genome sequencing have resulted in 192–209.
Genome-Based Studies of Marine Microorganisms 231 G
Grigoriev IV, Nordberg H, Shabalov I, et al. The
genome portal of the Department of Energy Joint Genome-Based Studies of Marine
Genome Institute. Nucleic Acids Res. 2012;40:
D26–32. Microorganisms
Hess M, Sczyrba A, Egan R, et al. Metagenomic discovery
of biomass-degrading genes and genomes from cow Xinqing Zhao1, Chao Chen2, Liangyu Chen2,
rumen. Science. 2011;331:463–7. Yumei Wang2 and Xiang Geng2
Huson DH, Mitra S. Introduction to the analysis of envi- 1
ronmental sequences: metagenomics with MEGAN. School of Life Science and Biotechnology,
Methods Mol Biol. 2012;856:415–29. Dalian University of Technology, Dalian,
King N, Westbrook MJ, Young SL, et al. The genome of People’s Republic of China
the choanoflagellate Monosiga brevicollis and the ori- 2
Dalian University of Technology, Dalian, China
gin of metazoans. Nature. 2008;451:783–8.
Lander ES, Linton LM, Birren B, et al. Initial sequencing
and analysis of the human genome. Nature.
2001;409:860–921. Synonyms
Markowitz VM, Chen IM, Chu K, et al. IMG/M: the G
ative analysis system. Nucleic Acids Res. 2012a;40: Genome mining of marine microorganisms
D123–9.
Markowitz VM, Chen IM, Palaniappan K, et al. IMG: the
Integrated Microbial Genomes database and compara-
tive analysis system. Nucleic Acids Res. 2012b;40: Definition
D115–22.
Martin F, Aerts A, Ahren D, et al. The genome of Laccaria
bicolor provides insights into mycorrhizal symbiosis. Genome-based studies of marine microorgan-
Nature. 2008;452:88–92. isms mean utilizing genetic information
Martin F, Cullen D, Hibbett D, et al. Sequencing the retrieved from genomic sequences of marine
fungal tree of life. New Phytol. 2011;190:818–21.
microorganisms to guide the discovery of useful
Matsen FA, Hoffman NG, Gallagher A, et al. A format
for phylogenetic placements. PLoS One. 2012;7: enzymes and natural products from marine
e31009. microorganisms. Chemical structures of natural
Meyer F, Paarmann D, D’Souza M, et al. The products potentially synthesized by marine
metagenomics RAST server – a public resource for
microorganisms can be predicted by aligning
metagenomes. BMC Bioinforma. 2008;9:386. the biosynthetic genes with known gene
Pagani I, Liolios K, Jansson J, et al. The Genomes OnLine sequences that are responsible for the biosynthe-
Database (GOLD) v. 4: status of genomic and sis of natural products, and the physicochemical
metagenomic projects and their associated metadata.
properties (UV spectrum, molecular weight,
Tringe SG, von Mering C, Kobayashi A, et al. Compara- polarity, etc.) obtained from the prediction can
tive metagenomics of microbial communities. be used to guide further purification and struc-
Science. 2005;308:554–7. ture elucidation of the compounds. In case that
Tuskan GA, Difazio S, Jansson S, et al. The genome of
the interested genes or gene clusters are not
black cottonwood, Populus trichocarpa (Torr. &
Gray). Science. 2006;313:1596–604. expressed or express in low level, various
Tyler BM, Tripathy S, Zhang X, et al. Phytophthora methods can be employed to activate the expres-
genome sequences uncover evolutionary origins and sion of biosynthetic genes. Identification of tar-
mechanisms of pathogenesis. Science. 2006;313:
get natural products can be achieved by
1261–6.
Walsh DA, Zaikova E, Howes CG, et al. comparative metabolic profiling, heterologous
Metagenome of a versatile chemolithoautotroph from expression, and other genome-mining strategies.
expanding oceanic dead zones. Science. 2009;326: For unculturable or yet-uncultured marine
578–82.
microbes in given environments, metagenomic,
Wu D, Hugenholtz P, Mavromatis K, et al. A phylogeny-
driven genomic encyclopaedia of bacteria and archaea. metatranscriptomic, and metaproteomic
Nature. 2009;462:1056–60. sequences can be employed. Function-based or
G 232 Genome-Based Studies of Marine Microorganisms
sequence-based screening of metagenomic Genome Mining for Natural Product

libraries is subsequently performed to identify Discovery
novel enzymes and natural products. Typically, biosynthetic genes of small molecules
in microorganisms are clustered together in the
genome to form gene clusters, and bioinformatic
Introduction analysis allows the rapid identification of gene
clusters similar to the known ones, thus speeding
Marine microorganisms are important sources for up the discovery of natural products. Genome
novel natural products and industrial enzymes, mining involves prediction of biosynthetic poten-
and many unique small molecules and proteins tial of organisms by analyzing their genomic
produced by marine microorganisms have been sequences, followed by screening or activation
reported in the recent years, which facilitate of enzymes and natural product biosynthesis by
novel drug discovery, agricultural biocontrol, as process optimization and/or genetic manipula-
well as industrial applications. In case of marine tions (Scheffler et al. 2013). Two types of small
natural products, it has been clear that vast diver- molecules encoded by multimodular polyketide
sity of chemistry can be explored from marine synthases (PKS) and non-ribosomal peptide syn-
microorganisms, mainly including marine bacte- thetases (NRPS) have been extensively focused.
ria and marine fungi (Imhoff et al. 2011). How- The biosynthesis of many polyketides and
ever, bioassay-guided screening of natural non-ribosomal peptides follows a colinearity
products has limitations in identification of com- rule and is assembled based on the number and
pounds with novel functions that are not readily type of domains within the enzymes, which
assayed, as well as in the discovery of novel makes it possible to predict the molecule struc-
compounds which exist in low amount, or even tures (Winter et al. 2011; Nikolouli and
not be produced under normal culture conditions. Mossialos 2012).
In addition, some marine microbes may glow Similar to genome scanning method
very slowly under laboratory conditions or (Zazopoulos et al. 2003), genome mining has
unculturable using currently available methods. the limitation that only the genes with similar
Therefore, it is important to develop new strate- functions to those of known ones are focused;
gies to fully explore the biosynthetic potential of new or unusual pathways are poorly explored.
marine microorganisms. However, the presence of PKS and NRPS genes
The development of high-throughput sequenc- is good indication of natural products with possi-
ing technologies has facilitated the exploration of ble broad spectrum of activities (Nikolouli and
the full biosynthetic potential of marine microor- Mossialos 2012).
ganisms. It has become increasingly evident Genome mining of microorganisms was first
through the analysis of abundantly available started in 2000, with the identification of
genomic sequences and metagenomic sequences coelichelin as one of the first examples (Challis
that microorganisms have much greater potential and Ravel 2000), while the first compound iden-
than we expected to produce various metabolites. tified in marine actinobacteria by genome mining
It was estimated by comparing the known sec- is the polyene macrolactam salinilactam A from
ondary metabolite and the analysis of the geno- Salinispora tropica (Udwary et al. 2007). The
mic sequences of several actinobacteria that as structures of coelichelin and salinilactam
much as 90 % of the biosynthetic potential of A were shown in Fig. 1.
actinomycetes remains undiscovered (Wilkinson Various genome-mining techniques have been
and Micklefield 2007). The available genomic reviewed elsewhere (Scheffler et al. 2013). Pre-
sequences of marine microorganisms enable us diction of gene functions and chemical structures
to rapidly identify useful enzymes and natural can be achieved using computer programs such as
products by genome mining. BLAST and THREADER, as well as other useful
HO O
OH
O NH
H2N N NH2
N OH O OH OH
H H
O
N
HO H
OH OH
NH N
H O O H
Coelichelin Salinilactam
HO
OH O
G
N COOH
H O OH OMe
Thailandamide A
Genome-Based Studies of Marine Microorganisms, Fig. 1 Structures of compounds discovered by genome mining
bioinformatic tools such as antiSMASH and NP. isotope amino acid precursors feeding into the
searcher (Nikolouli and Mossialos 2012). Due to culture broth and subsequent detection of the
the limited knowledge on enzymatic functions labeled molecule to identify NRPS or mixed
and metabolic cross talks, the prediction of chem- PKS/NRPS compounds.
ical structures is not always correct, and accurate Although low-level production of target mol-
annotation of gene functions and prediction of ecules can be identified by genomisotopic
chemical structures requires more advanced bio- method, some metabolites are only produced
informatic tools. under special circumstances; activation of pro-
In case that the biosynthetic genes are actively duction of these molecules requires mimicking
expressed under lab conditions, information on specific nutritional, environmental, and biologi-
the physicochemical properties of the target mol- cal conditions, such as special carbon and nitro-
ecules such as UV spectrum, molecular weight, gen source, high temperature, UV irradiation,
and polarity obtained from the bioinformatic pre- osmotic stress treatments, and coculture with
diction can be used to guide the further purifica- another microbial strain (Scherlach and
tion of the compounds. Thailandamide A was Hertweck 2009). In addition, genetic methods
discovered by genome mining of Burkholderia can also be employed to activate production of
thailandensis (Nguyen et al. 2008). Being tem- certain metabolites identified by genome min-
perature and light sensitive and also being pro- ing, including overexpression of activation reg-
duced in the early growth stage, thailandamide ulators and deletion of repressive regulators
A may not have been identified using classical (Scheffler et al. 2013). Heterologous expression
methods without the genomic-guided isolation of the entire gene cluster in well-defined host
(Nguyen et al. 2008). The structure of strains, including E. coli, Streptomyces, Bacil-
thailandamide A was shown in Fig. 1. lus, and Saccharomyces cerevisiae, has also
Genomisotopic approach was first described been employed in genome mining (Zhang
with the discovery of orfamides from Pseudomo- et al. 2011). Selection of suitable host strains
nas fluorescens (Gross et al. 2007), which stable and expression vectors are critical to achieve
G 234 Genome-Based Studies of Marine Microorganisms
Genome-Based Studies
of Marine
Microorganisms,
Fig. 2 Genome mining for
identification of natural
products
heterologous production of target active metagenomic libraries yield positive clones with
molecules. Scheme of genome mining was aimed sequences (reviewed by Brady et al. 2009).
depicted in Fig. 2. Novel enzymes such as laccase, aromatic hydro-
carbon dioxygenase, and halogenase have been
Metatranscriptomic and Metaproteomic isolated from marine metagenomic studies (Fang
Studies for Discovery of Novel Enzymes and et al. 2011; Marcos et al. 2012; Bayer et al. 2013,
Small Molecules reviewed by Kennedy et al. 2011), which have
In addition to culture-dependent genome-mining great potential for industrial applications and
studies, genome-based discovery of novel environmental bioremediation. In addition,
enzymes and natural products from environmen- novel natural products were also identified in
tal samples can also be achieved using culture- metagenomic libraries (reviewed by Brady
independent tools. It has been estimated that less et al. 2009), and Streptomyces and Ralstonia
than 1 % of the bacteria in most environmental metallidurans were used as hosts for heterolo-
samples are culturable (reviewed by Brady gous expression of metagenomic library.
et al. 2009), and it is thus important to study the Metagenome mining of symbiotic bacteria of
yet-uncultured microorganisms in marine envi- marine sponge Theonella swinhoei resulted in
ronment. Metagenome stands for a collection of the discovery of polytheonamides which are
genetic materials (genomic DNA) of a mixed extensively posttranslationally modified ribo-
community of organisms recovered directly somal peptides (Freeman et al. 2012).
from given environmental samples. Environmen- Metagenomic workflow was illustrated in Fig. 3.
tal DNA (eDNA) extracted from marine sedi- Metatranscriptomic and metaproteomic stud-
ments, seawater, or marine sponges, plants, or ies focus on the expression of certain genes in
animals can serve as starting point for a given environment at a given time (Schweder
metagenomic studies. Metagenomic DNA is et al. 2008; Stewart et al. 2012) and have been
cloned into various host cells, the most popular used to characterize metabolic behavior of micro-
host being E. coli. Phenotypic-based screening bial community. Such techniques have not been
and DNA sequencing-based screening of employed to study the isolation of novel enzymes
Genome-Based Studies
of Marine
Microorganisms,
Fig. 3 Metagenomic
method to discover novel
natural products or
enzymes
and small molecules from marine environment. will facilitate discovery of more novel marine
In comparison to metagenomic studies, metatran- enzymes and natural products for biotechnologi-
scriptomics and metaproteomics overlook genes cal applications.
that are not expressed in certain time and thus
have limitation to fully explore the biosynthetic Acknowledgments The authors are regretful for not
potential of marine microorganisms. However, being able to cite more references due to space limitation.
same problems of silent gene expression can
also be encountered when the metagenomic
libraries are propagated in certain host cells;
References
therefore, choosing diverse host cells and testing
Bayer K, Scheuermayer M, Fieseler L, Hentschel U.
various conditions for expression of Genomic mining for novel FADH(2)-dependent
metagenomic libraries are important to identify halogenases in marine sponge-associated microbial
novel enzymes and small molecules in marine consortia. Mar Biotechnol (NY). 2013;15(1):63–72.
Brady SF, Simmons L, Kim JH, Schmidt EW.
environment.
Metagenomic approaches to natural products from
free-living and symbiotic organisms. Nat Prod Rep.
2009;26(11):1488–503.
Summary Challis GL, Ravel J. Coelichelin, a new peptide
siderophore encoded by the Streptomyces coelicolor
genome: structure prediction from the sequence of its
Genome mining has speeded up the discovery of non-ribosomal peptide synthetase. FEMS Microbiol
natural products and novel enzymes from micro- Lett. 2000;187(2):111–4.
organisms by exploring their full biosynthetic Fang Z, Li T, Wang Q, Zhang X, Peng H, Fang W,
Hong Y, Ge H, Xiao Y. A bacterial laccase from
potentials. Metagenomic studies combined with
marine microbial metagenome exhibiting chloride tol-
genome mining promote the advancement of erance and dye decolorization ability. Appl Microbiol
studies of yet-uncultured marine microorgan- Biotechnol. 2011;89:1103–10.
isms. The discovery of marine natural products Freeman MF, Gurgui C, Helf MJ, Morinaka BI, Uria AR,
Oldham NJ, Sahl HG, Matsunaga S, Piel J.
and novel enzymes using genome-based methods Metagenome mining reveals polytheonamides as
is still in its early stage; however, development of posttranslationally modified ribosomal peptides.
genome mining and metagenomic approaches Science. 2012;338(6105):387–90.
G 236 GeoChip-Based Metagenomic Technologies
Gross H, Stockwell VO, Henkels MD, Nowak-Thompson B,

Loper JE, Gerwick WH. The genomisotopic approach: a GeoChip-Based Metagenomic
systematic method to isolate products of orphan biosyn-
thetic gene clusters. Chem Biol. 2007;14(1):53–63. Technologies for Analyzing
Imhoff JF, Labes A, Wiese J. Bio-mining the microbial Microbial Community Functional
treasures of the ocean: new natural products. Structure and Activities
Biotechnol Adv. 2011;29(5):468–82.
Kennedy J, O’Leary ND, Kiran GS, Morrissey JP,
O’Gara F, Selvin J, Dobson ADW. Functional Zhili He1, Joy D. Van Nostrand1 and
metagenomic strategies for the discovery of novel Jizhong (Joe) Zhou1,2,3
1
enzymes and biosurfactants with biotechnological Department of Microbiology and Plant Biology,
applications from marine ecosystems. J Appl Institute for Environmental Genomics,
Microbiol. 2011;111(3):787–99.
Marcos MS, Lozada M, Di Marzio WD, Dionisi University of Oklahoma, Norman, OK, USA
2
HM. Abundance, dynamics, and biogeographic distri- Department of Environmental Science and
bution of seven polycyclic aromatic hydrocarbon Engineering, Tsinghua University, Beijing,
dioxygenase gene variants in coastal sediments of Pat- China
agonia. Appl Environ Microbiol. 2012;78(5):1589–92. 3
Nguyen TA, Ishida K, Jenke-Kodama H, Dittmann E, Earth Sciences Division, Lawrence Berkeley
Gurgui C, Hochmuth T, Taudien S, Platzer M, National Laboratory, Berkeley, CA, USA
Hertweck C, Piel J. Exploiting the mosaic structure
of trans-acyltransferase polyketide synthases for nat-
ural product discovery and pathway dissection. Nat
Biotechnol. 2008;26(2):225–33. Synonyms
Nikolouli K, Mossialos D. Bioactive compounds synthe-
sized by non-ribosomal peptide synthetases and type-I Functional gene array; Metagenomic technology
polyketide synthases discovered through genome-
mining and metagenomics. Biotechnol Lett.
2012;34:1393–403.
Scheffler RJ, Colmer S, Tynan H, Demain AL, Gullo VP. Definition
Antimicrobials, drug discovery, and genome mining.
Appl Microbiol Biotechnol. 2013;97(3):969–78. Functional gene arrays (FGAs) are a special type
Scherlach K, Hertweck C. Triggering cryptic natural prod-
uct biosynthesis in microorganisms. Org Biomol of microarray containing probes for key genes
Chem. 2009;7:1753–60. involved in microbial functional processes, such
Schweder T, Markert S, Hecker M. Proteomics of marine as biogeochemical cycling of carbon, nitrogen,
bacteria. Electrophoresis. 2008;29:2603–16. sulfur, phosphorus, and metals, biodegradation of
Stewart FJ, Ulloa O, DeLong EF. Microbial metatran-
scriptomics in a permanent marine oxygen minimum environmental contaminants, antibiotic resis-
zone. Environ Microbiol. 2012;14(1):23–40. tance, energy processing, and stress response.
Udwary DW, Zeigler L, Asolkar RN, Singan V, GeoChips are considered to be the most compre-
Lapidus A, Fenical W, Jensen PR, Moore BS. Genome hensive FGAs and an important metagenomic
sequencing reveals complex secondary metabolome in
the marine actinomycete Salinispora tropica. Proc tool for microbial community analysis.
Natl Acad Sci. 2007;104(25):10376–81.
Wilkinson B, Micklefield J. Mining and engineering
natural-product biosynthetic pathways. Nat Chem
Introduction
Biol. 2007;3(7):379–86.
Winter JM, Behnken S, Hertweck C. Genomics-inspired Microorganisms are the most diverse group of
discovery of natural products. Curr Opin Chem Biol. organisms known in terms of phylogeny and
2011;15(1):22–31.
Zazopoulos E, Huang K, Staffa A, Liu W, Bachmann BO,
functionality. However, they do not live alone
Nonaka K, Ahlert J, Thorson JS, Shen B, Farnet CM. A but form distinct communities and play inte-
genomics-guided approach for discovering and grated and unique roles in ecosystems, such as
expressing cryptic metabolic pathways. Nat biogeochemical cycling of carbon (C), nitrogen
Biotechnol. 2003;21(2):187–90.
Zhang H, Boghigian BA, Armando J, Pfeifer BA. Methods
(N), sulfur (S), phosphorus (P), and metals (e.g.,
and options for the heterologous production of com- iron, copper, zinc), biodegradation or stabiliza-
plex natural products. Nat Prod Rep. 2011;28:125–51. tion of environmental contaminants, and
GeoChip-Based Metagenomic Technologies 237 G
interaction with hosts. Therefore, one of the most information on a microbial community in a rapid,
important goals of microbial ecology is to under- high-throughput, and parallel manner.
stand the diversity, composition, structure, func- This overview is focused on the analysis of
tion, dynamics, and evolution of microbial functional diversity, structure, and activity of
communities and their relationships with envi- microbial communities using GeoChip-based
ronmental factors and ecosystem functioning. metagenomic technologies but also includes
Toward this goal, several challenges remain. a brief introduction of GeoChips, GeoChip devel-
First, microorganisms are generally too small to opment, and GeoChip hybridization and data
see or characterize with most approaches used for analysis.
plant or animal studies. Second, microbial com-
munities are extremely diverse. It is estimated
that 1 g of soil contains 2,000–50,000 microbial GeoChips as the Most Comprehensive
species (Torsvik et al. 2002) and even up to Functional Gene Arrays
millions of species (Gans et al. 2005). Third, G
a vast majority of microorganisms (>99 %) are Functional gene arrays (FGAs) are special
uncultured (Whitman et al. 1998), making it dif- microarrays containing probes for key genes
ficult to study their functional ability and molec- involved in microbial functional processes, such
ular mechanisms. Finally, establishing as biogeochemical cycling of carbon (C), nitro-
mechanistic linkages between microbial diversity gen (N), phosphorus (P), sulfur (S), and metals,
and ecosystem functioning is even more difficult. antibiotic resistance, biodegradation of environ-
To address these challenges, culture- mental contaminants, energy processing, and
independent, high-throughput technologies for stress response. Since the exact functions of
analysis of microbial communities are necessary. selected genes on FGAs are known, this type of
Indeed, many culture-independent approaches array is especially useful for examining the func-
are available including PCR-based cloning anal- tional diversity, composition, and structure of
ysis, denaturing gradient gel electrophoresis microbial communities across different times
(DGGE), terminal-restriction fragment length and scales. Several FGAs have been reported
polymorphism (T-RFLP), quantitative PCR, and and evaluated, and they generally target specific
in situ hybridization. However, these methods functional processes, populations, or environ-
only provide snapshots of a microbial community ments, including nodC and nifH arrays,
but fail to provide a comprehensive view. There- a methanotroph gene (pmoA) array, a virulence
fore, high-throughput metagenomic technologies marker gene (VMG) array, pathogen detection/
are necessary for providing a rapid, specific, sen- diagnosis arrays, and a bioleaching array
sitive, and quantitative analysis of microbial (He et al. 2012b). However, GeoChips are the
communities and their relationships with envi- most comprehensive FGAs to date, especially
ronmental factors and ecosystem functioning. the later versions (GeoChips 2.0, 3.0, and 4.0),
Microarray-based technology can examine which target a variety of key microbial functional
thousands of genes at one time, providing processes, such as C, N, P, and S cycling, con-
a much more comprehensive analysis of micro- taminant bioremediation, and antibiotic resis-
bial communities. This technology, like tance (He et al. 2012a).
GeoChip, has been developed and adopted to GeoChips, constructed with 50-mer oligonu-
analyze microbial communities (He et al. 2007, cleotide probes, have evolved over several gen-
2010a; Hazen et al. 2010) and has been used to erations. The prototype GeoChip contained
profile the functional diversity, composition, struc- 89 PCR-amplicon probes for N-cycling genes
ture, and dynamics of microbial communities from (nirS, nirK, amoA, and pmoA) derived from
different habitats (He et al. 2011, 2012a, b). pure-culture isolates and marine sediment clone
A variety of studies demonstrate that microarrays libraries (Wu et al. 2001). The first-generation
can provide phylogenetic and functional GeoChip (GeoChip 1.0) was constructed with
763 gene variants involved in nitrogen cycling GeoChip 4.0 not only contains all functional cat-
(nirS, nirK, nifH, amoA), methane oxidation egories from GeoChip 3.0 but also includes addi-
(pmoA), and sulfite reduction (dsrAB). Then, an tional functional categories, such as genes from
expanded array was developed with 2,402 genes bacterial phages and those involved in stress
involved in organic contaminant biodegradation response and virulence (Hazen et al. 2010; He
and metal resistance to monitor microbial et al. 2012a). All evaluation and studies demon-
populations and functional genes involved in strate that GeoChip is a powerful tool for specific,
biodegradation and biotransformation (Rhee sensitive, and quantitative analysis of microbial
et al. 2004). Specificity evaluation with represen- communities from a variety of habitats
tative pure cultures indicated that the designed (He et al. 2011, 2012a, b).
probes appeared to be specific to their
corresponding target genes. The detection limit
was 5–10 ng of genomic DNA in the absence of GeoChip Development
background DNA and 50–100 ng of pure-culture
genomic DNA in the presence of background GeoChip development involves several major
DNA. Real-time PCR analysis was very consis- steps, including selection of target genes,
tent with the microarray-based quantification sequence retrieval and verification, oligonucleo-
(He et al. 2011). tide probe design, probe validation, and array
Although the prototype and GeoChip 1.0 construction as well as future automatic update,
arrays were used to probe specific functional which are generally implemented by a GeoChip
groups or activities, they lacked a truly com- development and data analysis pipeline (http://
prehensive probe set covering key microbial ieg./ou.edu/) (He et al. 2010a).
functional processes occurring in different
environments. Therefore, more comprehensive Selection of Target Genes and Sequence
GeoChips have been developed and evaluated. Retrieval
For example, GeoChip 2.0, containing 24,243 A variety of functional genes can be used as
(50-mer) oligonucleotide probes, targeting functional markers targeting different processes,
~10,000 functional gene variants from 150 gene such as biogeochemical cycling of C, N, S, P, and
families involved in the geochemical cycling metals, contaminant bioremediation, antibiotic
of C, N, and P, sulfate reduction, metal reduction resistance, and stress response. For example,
and resistance, and organic contaminant degrada- 292 functional gene families were selected for
tion, was developed as the first comprehensive GeoChip 3.0 with 41 for C cycling, 16 for
FGA (He et al. 2007). After 2 years, GeoChip 3.0 N cycling, 3 for P utilization, 4 for S cycling,
was developed, which contained about 28,000 173 for biodegradation of a variety of organic
probes and targeted ~57,000 sequences from contaminants, 41 for metal reduction and resis-
292 gene families (He et al. 2010a). GeoChip tance, 11 for antibiotic resistance, and 2 for
3.0 is more comprehensive and has several other energy processing. In addition, a phylogenetic
distinct features compared to GeoChip 2.0, such marker (gyrB) was also chosen (He et al.
as a common oligo reference standard (CORS) 2010a). More importantly, when sequences for
for data normalization and comparison, a soft- a known functional gene are available, they can
ware package for data management and future be added in an updated GeoChip. For example,
updating, the gyrB gene for phylogenetic analy- when GeoChip was updated to GeoChip 4.0,
sis, and additional functional groups including functional gene families involved in stress
those involved in antibiotic resistance and responses, bacterial phages, and virulence were
energy processing (He et al. 2010a). Based on added, resulting in 410 functional gene families
GeoChip 3.0, GeoChip 4.0 was developed, on GeoChip 4.0 (Hazen et al. 2010; He
which contains ~84,000 probes and targeting et al. 2012a).Generally, genes are chosen for
>152,000 genes from 410 functional families. key enzymes or proteins with the corresponding
function(s) of interest. If a process involves mul- related sequences will be chosen for array con-
tiple steps or a protein complex, those genes struction. GeoChip can be constructed in-house,
responsible for catalytic subunits or with the such as GeoChips 2.0 and 3.0 (He et al. 2007,
active site(s) will be selected (He et al. 2011). 2010a), or commercially, like GeoChip 4.0
Sequence retrieval is performed generally (Hazen et al. 2010; He et al. 2012a).
with a pipeline with a database integrated for
managing all retrieved sequences and subse-
quently designed probes. For each functional GeoChip Operation and Data Analysis
gene, the first step is to submit a query to the
GenBank protein database and fetch all candidate Generally, GeoChip operation and data analysis
amino acid sequences. The key words may include target preparation, GeoChip hybridiza-
include the name of the target gene/enzyme, its tion, image and data preprocessing, and data
abbreviation and enzyme commission number analysis (Fig. 1).
(EC), and affiliated domains of bacteria, archaea, G
and fungi. Second, retrieved sequences are vali- Target Preparation
dated by seed sequences (those sequences that Target preparation involves a few steps, includ-
have been experimentally confirmed to produce ing nucleic acid extraction and purification, label-
the protein of interest and that the protein func- ing, and hybridization (Fig. 1a). The most
tions as expected) with the HMMER program. important step for successful GeoChip analysis
Finally, all confirmed protein sequences are is nucleic acid extraction and purification from
searched against GenBank again to obtain their environmental samples generally using a well-
corresponding nucleic acid sequences for probe established method, which is able to produce
design (He et al. 2010a). large fragments of DNA. High-quality DNA
should have ratios of A260/A280 ~ 1.8 and
Oligonucleotide Probe Design A260/A230 > 1.7. Low A260/A230 ratios indicate
A new version of CommOligo (He et al. 2012a) impurities in the DNA sample and can negatively
with group-specific probe design features can be influence subsequent labeling and hybridization.
used to design both gene- and group-specific oli- Generally, since 1–5 mg of DNA or 5–20 mg of
gonucleotide probes with different degrees of RNA is required for GeoChip hybridization,
specificity based on the following criteria: (i) a whole-community genome amplification
gene-specific probe must have 90 % sequence (WCGA) for DNA and whole-community RNA
identity, 20-base continuous stretch, and amplification (WCRA) for RNA are necessary
35 kcal/mol free energy; (ii) a group-specific (He et al. 2012b). Non-amplified or amplified
probe has to meet the above requirements for nucleic acids are then labeled with fluorescent dye
nontarget groups, and it also must have 96 % (e.g., Cy3, Cy5) using random priming with the
sequence identity, 35-base continuous stretch, Klenow fragment of DNA polymerase for DNA
and 60 kcal/mol free energy within the group. and SuperScriptTM II/III RNase H-reverse tran-
Computational and experimental evaluation indi- scriptase for RNA. The labeled nucleic acids are
cates that these designed probes are highly spe- then purified and dried for hybridization (Fig. 1a).
cific to their targets (He et al. 2007, 2010a).
Hybridization, Imaging, and Data
Probe Validation and GeoChip Construction Preprocessing
All designed probes are subsequently verified Labeled nucleic acid target is suspended in
against the GenBank (NR) nucleic acid database a hybridization buffer containing 40–50 % form-
for specificity. Normally, multiple (e.g., 20) amide and hybridized on GeoChip at 42–50 C
probes for each sequence or each group of (He et al. 2007, 2010a, 2012b). The hybridization
sequences are designed, but only the best probe stringency can be adjusted by changing the
set for each sequence or each group of closely temperature and/or formamide concentration.
GeoChip-Based Metagenomic Technologies for Ana- microbial communities from a variety of habitats. (a)
lyzing Microbial Community Functional Structure Target preparation, (b) GeoChip hybridization and data
and Activities, Fig. 1 A schematic presentation of target processing, (c) GeoChip data analysis (This figure is
preparation, GeoChip operation, and data analysis of adapted from Fig. 1 by He et al. (2012b))
For every 1 % increase in formamide, the effective spots, evenness of control spot hybridization sig-
temperature increases by 0.6 C (He et al. 2011). nals across the slide surface, and background
Hybridized arrays are imaged with levels are assessed to determine overall array
a microarray scanner having a resolution of at quality. Spots flagged as poor or low quality are
least 10 mm for homemade arrays and 2 mm for removed along with outliers: positive spots with
commercially manufactured arrays. Microarray (signal – mean signal intensity of all replicate
analysis software is then used to quantify the spots) greater than three times the replicate
signal intensity (pixel density) of each spot. spots signal standard deviation (He et al. 2011).
Spot quality is also evaluated at this point using The signal intensities are then normalized for
predetermined criteria, and positive spots are further statistical analysis (Fig. 1b).
called generally based on signal-to-noise ratio
[SNR; SNR ¼ (signal mean – background GeoChip Data Analysis
mean)/background standard deviation] or signal- Data analysis is the most challenging part in the
to-both-standard-deviations ratio [SSDR; use of GeoChip for microbial community analy-
SSDR ¼ (signal mean – background mean)/(sig- sis, and a variety of methods have been used to
nal standard deviation – background standard address fundamental microbial ecology questions
deviation)] (He et al. 2012b). (Fig. 1c). First, various diversity indices (e.g.,
Raw GeoChip data are further evaluated via richness, evenness, diversity) based on the num-
the GeoChip data analysis pipeline ber of functional genes detected and their abun-
(He et al. 2010a). The quality of individual dances are used to examine the functional
diversity of microbial communities. The relative determined using variance partitioning analysis
abundance of specific genes or gene groups can (VPA). In addition, further correlations of
be determined based on the total signal intensity GeoChip data with environmental parameters
of the relevant genes or the number of genes can be performed with the Mantel test
detected. The percentage of genes shared by dif- (He et al. 2007, 2010a, b, 2011, 2012b). Finally,
ferent samples can also be calculated to compare GeoChip data can be used to infer functional
microbial communities examined. Second, for molecular ecological networks for revealing
statistical analysis of the overall microbial com- interactions of functional genes and their associ-
munity composition and structure with FGA data, ated populations. A recent study indicated that
ordination techniques can be used such as princi- elevated CO2 substantially altered the network
pal component analysis (PCA), detrended corre- interaction of soil microbial communities and
spondence analysis (DCA), cluster analysis (CA), the shift in network structures is significantly
and nonmetric multidimensional scaling correlated with soil properties (He et al. 2012b;
(NMDS). PCA and DCA are multivariate statis- Zhou et al. 2010) (Fig. 1c). G
tical methods, which reduce the number of vari-
ables needed to explain the data and highlight the
variability between samples. CA groups samples GeoChip Applications
based on the overall similarity of gene patterns.
NMDS finds both a nonparametric monotonic Different versions of GeoChip have been used to
relationship between the dissimilarities in the analyze microbial communities from different
item-item matrix and the Euclidean distances habitats, such as aquatic systems, soils, extreme
between items and the location of each item in environments, human microbiomes, and bioreac-
the low-dimensional space. Also, the response tors for addressing fundamental scientific ques-
ratio can be used to determine changes of specific tions related to global change, bioenergy,
functional genes between the control and the bioremediation, agricultural management, land
treatment. In addition, analysis of variation use, and human health and disease as well as
(ANOVA), analysis of similarities (ANOISM), ecological theories (He et al. 2011, 2012b). Sev-
nonparametric multivariate analysis of variance eral recent studies are highlighted, especially
(Adonis), and multi-response permutation proce- with a focus on soil and water microbial commu-
dure (MRPP) can be used to discern dissimilar- nities. A list of representative studies with differ-
ities of microbial communities over time and ent GeoChip versions is shown in Table 1.
space (He et al. 2011, 2012b). Third, if environ-
mental data or other metadata are available, Soils
GeoChip data can be used to correlate environ- Soil may harbor the most complex microbial
mental variables with the functional microbial communities among known habitats, and
community structure. These include the recently GeoChips have been used to investigate
Pearson’s correlation coefficient (PCC), canoni- soil microbial communities to address fundamen-
cal correspondence analysis (CCA), and Mantel tal ecological questions related to global change
test. PCC measures the strength of linear depen- (e.g., elevated CO2, elevated O3, warming), bio-
dence between two variables, such as functional remediation of oil-contaminated fields, land use,
gene abundances detected by GeoChip, and envi- agricultural management, and livestock grazing.
ronmental variables. CCA has been used in many Three recent studies focused on the response
cases in GeoChip-based studies to better under- of soil microbial communities to global change,
stand how environmental factors affect the com- including elevated CO2, temperature, and O3.
munity structure (He et al. 2011, 2012b). Also, First, GeoChip 3.0 was used to analyze soil
based on the results of the CCA, the relative microbial communities under elevated CO2 at
influence of environmental variables on the a multifactor grassland experiment site, BioCON
microbial community structure can be (biodiversity, CO2, and nitrogen deposition), in
GeoChip-Based Metagenomic Technologies for Analyzing Microbial Community Functional Structure and
Activities, Table 1 Summary of representative GeoChip applications. If no references are cited, those studies are
described in a previous review (He et al. 2012b)
Habitat or
ecosystem Ecosystem/sample type GeoChip Objectives of study/biological questions
Aquatic Marine sediment GeoChip Functional microbial community structure of marine
systems 1.0 sediments in the Gulf of Mexico
Ebro and Elbe river sediment GeoChip Pesticide impacts on European rivers
2.0
Coral-associated marine water GeoChip Microbial communities in healthy and yellow-band
2.0 diseased coral (Montastraea faveolata)
Soils Antarctic latitudinal transect GeoChip Microbial C and N cycling across an Antarctic latitudinal
soil 2.0 transect
Deciduous forest soil GeoChip Gene-area relation in microorganisms
2.0
Native grassland soil GeoChip Afforestation impacts soil microbial communities and their
2.0 functional potential
Strawberry farmland soil GeoChip Microbial responses to farm management
2.0
Grassland soil GeoChip Microbial responses to plant invasion
2.0
Agricultural soil GeoChip Agricultural practices/land use (Xue et al. 2013)
2.0
Grassland soil GeoChip Global change (elevated CO2) (He et al. 2010b)
3.0
Grassland soil GeoChip Global change (warming) (Zhou et al. 2012)
3.0
Wheat rhizosphere soil GeoChip Global change (elevated O3) (Li et al. 2013)
3.0
Citrus rhizosphere soil GeoChip Rhizosphere microbial community responses to
3.0 Candidatus Liberibacter asiaticus-infected citrus trees
Grassland soil GeoChip The effect of grazing on microbial communities (Yang
4.0 et al. 2013)
Contaminated U-contaminated underground GeoChip Bioremediation of U-contaminated groundwater
sites water (Oak Ridge, TN) 1.0
GeoChip Bioremediation of U-contaminated groundwater (Van
2.0 Nostrand et al. 2011)
U-contaminated sediment GeoChip Bioremediation of U-contaminated sediments
(Oak Ridge, TN) 2.0
U-contaminated underground GeoChip Bioremediation of U-contaminated groundwater (Liang
water (Rifle, CO) 2.0 et al. 2012)
PCB-contaminated soil GeoChip Microbial bioremediation of PCB-contaminated soil
2.0
Oil-contaminated soil GeoChip Bioremediation of oil-contaminated soil
2.0
Arsenic-contaminated soil GeoChip Rhizosphere microbial community responses to arsenic
3.0 contamination and phytoremediation
Landfill groundwater GeoChip Microbial responses to landfill-derived contaminants in
3.0 groundwater (Lu et al. 2012)
Oil-spill seawater GeoChip Microbial bioremediation of oil-spill sites (Hazen
4.0 et al. 2010)
(continued)
GeoChip-Based Metagenomic Technologies for Analyzing Microbial Community Functional Structure and
Activities, Table 1 (continued)
Habitat or
ecosystem Ecosystem/sample type GeoChip Objectives of study/biological questions
Extreme Deep-sea hydrothermal vent GeoChip Functional gene diversity of deep-sea hydrothermal vent
environments (chimney) 2.0 microbial communities
Deep-sea basalt samples GeoChip Functional gene diversity and structure of deep-sea basalt
2.0 microbial communities
GSL hypersaline water GeoChip Functional gene diversity and structure of hypersaline
2.0 water microbial communities
Acid mine drainage (water) GeoChip Functional gene diversity of microbial communities in acid
2.0 mine drainage (AMD) systems
Bioreactors Fluidized bed reactor for GeoChip Microbial bioremediation of hydrocarbon-contaminated
bioremediation 2.0 water
Microbial electrolysis cell for GeoChip Microbial hydrogen production using wastewater
hydrogen production 3.0 G
the Cedar Creek Ecosystem Science Reserve, GeoChip 3.0 was used to investigate the func-
MN (He et al. 2010b). The results showed that tional composition, and structure of rhizosphere
the functional microbial community structure microbial communities from O3-sensitive and
was markedly different between ambient CO2 O3-relatively-sensitive wheat (Triticum aestivum
and elevated CO2 as indicated by DCA of L.) cultivars under elevated O3 (eO3). Based on
GeoChip 3.0 data and 16S rRNA gene-based GeoChip hybridization signal intensities,
pyrosequencing data. Also, genes involved in although the overall functional structure of rhizo-
labile C degradation and C and N fixation were sphere microbial communities did not signifi-
significantly increased under elevated CO2 cantly change by eO3 or cultivars, the results
although the abundance of recalcitrant showed that the abundance of specific functional
C degradation genes remained unchanged. In genes involved in C fixation and degradation,
addition, changes in the microbial community N fixation, and sulfite reduction did significantly
structure were significantly correlated with soil alter in response to eO3 and/or wheat cultivars.
C and N contents and plant productivity Also, the O3-sensitive cultivar appeared to harbor
(He et al. 2010b). Second, GeoChip 3.0 was microbial functional communities in the rhizo-
used to understand the effect of increased tem- sphere more sensitive in response to eO3 than
perature on soil microbial communities and their the O3-relatively sensitive cultivar. In addition,
roles in regulating soil carbon dynamics at CCA suggested that the functional structure of
a tallgrass prairie ecosystem in the US Great microbial communities involved in C cycling was
Plains of Central Oklahoma. The results suggest largely shaped by soil and plant properties includ-
soil microorganisms may regulate soil carbon ing pH, dissolved organic carbon (DOC), micro-
dynamics through three primary feedback mech- bial biomass C, C/N ratio, and grain weight
anisms: (i) shifting microbial community compo- (Li et al. 2013). Those studies indicate that global
sition, leading to the reduced temperature change significantly impacts soil microbial com-
sensitivity of heterotrophic soil respiration; munities, which may in turn regulate ecosystem
(ii) differentially stimulating labile C but not functioning through different feedback
recalcitrant C degradation genes to maintain mechanisms.
long-term soil carbon stability and storage; and Various agriculture management practices
(iii) enhancing nutrient-cycling processes to pro- may have significant influences on soil microbial
mote plant growth (Zhou et al. 2012). Third, communities and their ecological functions.
GeoChip 2.0 was used to evaluate the potential Groundwater and Aquatic Ecosystems
functions of soil microbial communities under Due to human activities, groundwater and aquatic
conventional (CT), low-input (LI), and organic ecosystems are often contaminated from various
(ORG) management systems at an agricultural sources (e.g., mining, oil spill, landfill) and with
research site in Michigan. Compared to CT, a variety of toxic compounds (e.g., heavy metals,
a high diversity of functional genes was observed herbicides, antibiotics, pesticides) and conditions
in LI. The functional gene diversity in ORG did (e.g., low pH, high salinity). To understand how
not differ significantly from that of either CT or such contamination impacts groundwater and
LI. The abundance of genes encoding enzymes aquatic ecosystems, GeoChips were used to inves-
involved in C, N, P, and S cycling was generally tigate those microbial communities to explore the
lower in CT than in LI or ORG, but functional potential of in situ bioremediation of contaminated
genes involved in lignin degradation, methane sites by indigenous microbial communities.
generation/oxidation, and assimilatory N reduc- A pilot-scale system was established to exam-
tion remained unchanged. Also, significant ine the feasibility of in situ U(VI) immobilization
correlations were observed between NO3 con- at a highly contaminated aquifer in Oak
centration and denitrification gene abundance, Ridge, TN. Ethanol was injected intermittently
NH4+ concentration and ammonification gene as an electron donor to stimulate microbial
abundance, and N2O flux and denitrification U(VI) reduction, leading to a decrease of
gene abundance, indicating a close linkage U(VI) concentrations below the Environmental
between soil N availability or utilization and Protection Agency drinking water standard.
associated functional potential of soil microbial GeoChip 2.0 was used to monitor microbial
communities (Xue et al. 2013). communities in three wells during active
Livestock grazing is a type of global land-use U(VI) reduction and maintenance phases. The
activity. However, the effect of free livestock results showed that the overall microbial commu-
grazing on soil microbial communities at the nity structure exhibited a considerable shift over
functional gene level remains unclear. GeoChip the remediation phases examined and functional
4.0 was used to examine the effects of free live- populations of Fe(III)-reducing bacteria (FeRB),
stock grazing on the microbial community at an nitrate-reducing bacteria (NRB), and sulfate-
experimental site in Tibet, a region known to be reducing bacteria (SRB) reached their highest
very sensitive to anthropogenic perturbation and levels during the active U(VI) reduction phase
global warming. The results showed that grazing (days 137–370), in which denitrification, Fe(III)
changed the microbial community functional reduction, and sulfate reduction occurred sequen-
structure, in addition to aboveground vegetation tially, suggesting that these functional
and soil geochemical properties. Further statisti- populations could play an important role in both
cal analysis showed that microbial community active U(VI) reduction and maintenance stability
functional structures were closely correlated of reduced U(IV) (Van Nostrand et al. 2011).
with environmental variables and variations in To better understand the microbial functional
microbial community functional structures were diversity changes with subsurface redox condi-
mainly controlled by aboveground vegetation, tions during in situ U(VI) bioremediation,
soil C/N ratio, and NH4+-N. Therefore, these GeoChip 2.0 was applied to examine groundwa-
results indicated that soil microbial community ter microbial communities at a uranium mill tail-
functional structure was very sensitive to live- ings remedial action (UMTRA) site (Rifle, CO).
stock grazing and revealed the role of soil micro- The results indicated that functional microbial
bial communities in the regulation of soil N and communities altered with a shift in the dominant
C cycling, supporting the necessity to include metabolic process and the abundance of dsrAB
microbial components in evaluating the conse- and mcr genes increased when redox conditions
quence of land use and/or climate change (Yang shifted from Fe-reducing to sulfate-reducing con-
et al. 2013). ditions, while cytochrome genes were primarily
detected from Geobacter species and decreased Other Environments
with lower subsurface redox conditions. Statisti- GeoChips were also used to analyze microbial
cal analysis of environmental parameters and communities from other habitats/ecosystems,
functional genes indicated that acetate, U(VI), including various contaminated sites (e.g.,
and redox potential were the most significant chromate-contaminated water, U-contaminated
geochemical variables linked to the microbial sediments, polychlorinated biphenyl- and arsenic-
functional gene structures. This study indicates contaminated soils), extreme environments (e.g.,
that microbial functional genes could be very acid mine drainage, hypersaline lakes, deep-sea
useful for tracking microbial community struc- basalts, deep-sea hydrothermal vents), bioleaching
ture and dynamics during bioremediation (Liang systems, and bioreactors as well as the human
et al. 2012). microbiome (He et al. 2011, 2012b).
In another study, GeoChip 3.0 was used to
study the functional gene diversity and structure
of groundwater microbial communities in Summary G
a shallow landfill leachate-contaminated aquifer
in Norman, OK. Samples were taken from eight Although GeoChip technology has been demon-
wells at the same aquifer depth immediately strated to be specific, sensitive, and quantitative
below a municipal landfill or along the predomi- and applied to analyze microbial communities
nant downgradient groundwater flowpath. The from different habitats, some key issues and chal-
results showed that functional gene richness and lenges still remain, including probe coverage,
diversity immediately below the landfill and the specificity, sensitivity, quantitative capability,
closest well were considerably lower than those nucleic acid quality, the detection of microbial
in downgradient wells and that landfill leachate community activity, and challenges by high-
impacted the diversity, composition, structure, throughput sequencing technologies. It should
and functional potential of groundwater microbe noted that probe coverage on GeoChip is rel-
bial communities as a function of groundwater atively low compared to the availability of func-
pH and concentrations of sulfate, ammonia, and tional gene sequences in databases, especially for
dissolved organic carbon (Lu et al. 2012). earlier versions of FGAs. One of the reasons is
In 2010, the Deepwater Horizon oil spill that some sequences do not have specific probes
occurred in the Gulf of Mexico. GeoChip 4.0 was based on the availability of sequence databases
used to examine the functional composition and and software. Also, GeoChip probe sets need
structure of water microbial communities from the continuous updates to reflect the current status
oil plume and control sites. The results indicated of functional gene sequence information.
that the water microbial community composition Critical issues with GeoChip design and
and structure were dramatically altered in deep-sea detection are specificity, sensitivity, and quanti-
oil plume samples. A variety of functional genes tative capability, which are especially important
involved in both aerobic and anaerobic hydrocar- since many gene variants within each environ-
bon degradation were highly enriched in the plume mental sample are unknown. Array specificity is
compared with outside the plume, indicating controlled by probe design and hybridization
a great potential for intrinsic bioremediation or conditions. A novel microarray probe design soft-
natural attenuation in the deep sea. Various other ware tool, CommOligo (He et al. 2012a), and its
microbial functional genes that are relevant to C, improved versions were used to design probes for
N, P, S, and iron cycling, metal resistance, and GeoChip 2.0, GeoChip 3.0, and GeoChip 4.0.
bacteriophage replication were also enriched in Experimental evaluations of GeoChip 2.0 and
the plume. Overall, this study suggests that indig- GeoChip 3.0 indicated that low percentages of
enous microbial communities could have false positives (0.002–0.025 %) were observed
a significant role in biodegradation of oil spills in (He et al. 2007; He et al. 2010a). GeoChip hybrid-
deep-sea environments (Hazen et al. 2010). izations are generally performed at 42–50 C
with 50 % formamide. Sensitivity is another RNA extraction methods are necessary to use
major concern since many gene variants are environmental RNA for GeoChip analysis. Alter-
expected to be low abundant in environmental natively, other techniques, such as stable isotope
samples. The current level of sensitivity for oli- probing (SIP), enzyme activity, metaproteomic
gonucleotide arrays using environmental samples analysis, and metabolite assays, may be used to
is approximately 50–100 ng or 107 cells, or study the functional activity and ecosystem func-
approximately 5 % of the microbial community, tions of microbial communities.
providing a coverage of only the most dominant High-throughput sequencing technologies (e.g.,
community members. Several strategies have 454, Illumina) are available for microbial commu-
been utilized to increase sensitivity. For example, nity analysis, which challenge GeoChip technolo-
with WCGA and WCRA approaches, the sensi- gies. However, although these sequencing-based
tivity of GeoChip hybridization could increase to technologies can discover novel sequences, it can
10 fg. Also, array surface modifications, be expensive to do in-depth shotgun sequencing of
a decrease of hybridization solution, and the use a community. In addition, it suffers from lack of
of new labeling techniques could increase appropriate conserved primers for many target
GeoChip detection sensitivity (He et al. 2011, genes. Also, sequencing-based technologies have
2012a). An important goal in microarray analysis a disadvantage of random sampling, and/or under-
is to provide quantitative information. GeoChip sampling, making it difficult to compare different
has been shown to have a linear relationship samples, while microarray-based technologies
between target DNA or RNA concentrations and have a defined probe set, which is good for com-
hybridization signal intensities. However, this munity comparisons (He et al. 2012b). Therefore,
relationship can be affected by sequence diver- due to the unique features and advantages and
gence (i.e., the more divergent the sequence, the disadvantages of both microarray-based and
lower the signal intensity). Therefore, two strate- sequencing-based technologies, it is preferable
gies are used to improve quantitative ability: that they be used complementarily for microbial
mismatch probes and using relative comparisons community analysis in order to address fundamen-
across samples rather than absolute comparisons tal questions in microbial ecology and environ-
(He et al. 2012a). mental biology.
The quality and quantification of environmen-
tal nucleic acids are one of the most important for
Acknowledgments This work conducted by ENIGMA
successful GeoChip hybridization and reliable (Ecosystems and Networks Integrated with Genes
data generation. DNA with large fragments and and Molecular Assemblies) (http://enigma.lbl.gov),
minimal amounts of contaminants are especially a Scientific Focus Area Program at Lawrence Berkeley
National Laboratory, was supported by the Office of Sci-
important when samples need to be amplified
ence, Office of Biological and Environmental Research, of
using WCGA. Accurate measurement of DNA the US Department of Energy under Contract
yields is also important, so quantification should No. DE-AC02-05CH11231 and by the Oklahoma Applied
be based on double-strand DNA (dsDNA) Research Support (OARS), Oklahoma Center for the
Advancement of Science and Technology (OCAST),
measurement (e.g., PicoGreen) rather than via
State of Oklahoma, through AR11-035 and AR062-034.
absorbance. While DNA detection provides
information on the presence of functional genes
in the environment, it does not provide uncondi-
References
tional evidence for microbial activity. Population
changes can be used to infer microbial activity, Gans J, Wolinsky M, Dunbar J. Computational improve-
but this may not be accurate. To monitor micro- ments reveal great bacterial diversity and high metal
bial activity, mRNA should be used. However, toxicity in soil. Science. 2005;309:1387–90.
Hazen TC, Dubinsky EA, DeSantis TZ, Andersen GL,
since mRNA is easily degraded with rapid turn-
Piceno YM, Singh N, et al. Deep-sea oil plume
over, usually has a low abundance, and has enriches indigenous oil-degrading bacteria. Science.
a small proportion of the total RNA, improved 2010;330:204–8.
GHOSTM 247 G
He Z, Gentry TJ, Schadt CW, Wu L, Liebich J, Chong SC, Xue K, Wu L, Deng Y, He Z, Van Nostrand J, Robertson
et al. GeoChip: a comprehensive microarray for inves- PG, et al. Functional gene differences in soil microbial
tigating biogeochemical, ecological and environmen- communities from conventional, low-input, and
tal processes. ISME J. 2007;1:67–77. organic farmlands. Appl Environ Microbiol.
He Z, Deng Y, Van Nostrand JD, Tu Q, Xu M, Hemme 2013;79:1284–92.
CL, et al. GeoChip 3.0 as a high-throughput tool for Yang Y, Wu L, Lin Q, Yuan M, Xu D, Yu H,
analyzing microbial community composition, struc- et al. Responses of the functional structure of soil
ture and functional activity. ISME J. 2010a;4: microbial community to livestock grazing in the
1167–79. Tibetan alpine grassland. Glob Chang Biol.
He Z, Xu M, Deng Y, Kang S, Kellogg L, Wu L, 2013;19:637–48.
et al. Metagenomic analysis reveals a marked Zhou J, Deng Y, Luo F, He Z, Tu Q, Zhi X. Functional
divergence in the structure of belowground microbial molecular ecological networks. mBio. 2010;1(4):
communities at elevated CO2. Ecol Lett. e00169.
2010b;13:564–75. Zhou J, Xue K, Xie J, Deng Y, Wu L, Cheng X,
He Z, Van Nostrand JD, Deng Y, Zhou J. Development et al. Microbial mediation of carbon-cycle feedbacks
and applications of functional gene microarrays in the to climate warming. Nat Clim Chang. 2012;2:106–10.
analysis of the functional diversity, composition, and G
structure of microbial communities. Front Environ Sci
Engin China. 2011;5:1–20.
He Z, Deng Y, Zhou J. Development of functional gene
microarrays for microbial community analysis. Curr
Opin Biotechnol. 2012a;23:49–55.
GHOSTM
He Z, Van Nostrand JD, Zhou J. Applications of
functional gene microarrays for profiling microbial Yutaka Akiyama
communities. Curr Opin Biotechnol. 2012b;23: Department of Computer Science, Tokyo
460–6.
Institute of Technology, Meguro-ku,
Li X, Deng Y, Li Q, Lu C, Wang J, Zhang H, et al. Shifts of
functional gene representation in wheat rhizosphere Tokyo, Japan
microbial communities under elevated ozone. ISME
J. 2013;7(3):660–71.
Liang Y, Van Nostrand JD, N’Guessan LA, Peacock AD,
Deng Y, Long PE, et al. Microbial functional gene
Definition
diversity with a shift of subsurface redox conditions
during in situ uranium reduction. Appl Environ GHOSTM is a homology search tool developed
Microbiol. 2012;78:2966–72. for metagenomics and accelerated by
Lu Z, He Z, Parisi VA, Kang S, Deng Y, Van Nostrand JD,
GPU-computing. GHOSTM can be used as the
et al. GeoChip-based analysis of microbial functional
gene diversity in a landfill leachate-contaminated aqui- alternative of BLASTX program, which searches
fer. Environ Sci Technol. 2012;46:5824–33. protein databases using a translated nucleotide
Rhee S-K, Liu X, Wu L, Chong SC, Wan X, Zhou query. The GHOSTM system achieved calcula-
J. Detection of genes involved in biodegradation and
tion speeds that were 130 times faster than
biotransformation in microbial communities by using
50-mer oligonucleotide microarrays. Appl Environ BLAST with 1 GPU. It also had a calculation
Microbiol. 2004;70:4303–17. speed that was 3.4 times faster than BLAT with
Torsvik V, Ovreas L, Thingstad TF. Prokaryotic diver- higher search sensitivity. GHOSTM is distributed
sity – magnitude, dynamics, and controlling factors.
under the MIT license and its source code is
Science. 2002;296:1064–6.
Van Nostrand JD, Wu L, Wu W-M, Huang Z, Gentry TJ, available for download at http://code.google.
Deng Y, et al. Dynamics of microbial community com/p/ghostm/.
composition and function during in situ bioremedia-
tion of a uranium-contaminated aquifer. Appl Environ
Microbiol. 2011;77:3860–9.
Whitman WB, Coleman DC, Wiebe WJ. Prokaryotes: the Introduction
unseen majority. Proc Natl Acad Sci USA.
1998;95:6578–83. In metagenomic analysis, the DNA sequence
Wu L, Thompson DK, Li G, Hurt RA, Tiedje JM, Zhou
fragments obtained from environmental samples
J. Development and evaluation of functional gene
arrays for detection of selected genes in the environ- frequently include DNA sequences from many
ment. Appl Environ Microbiol. 2001;67:5780–90. different species, and closely related reference
G 248 GHOSTM
genome sequences are often unavailable. Thus, Overview of the Algorithm

sensitive approaches are required for the identifi-
cation of novel genes. Metagenomic DNA frag- The GHOSTM is mainly composed of three com-
ments are often translated into protein coding ponents, as shown in Fig. 1. The first component
sequences and then further assigned to protein searched the candidate alignment positions for
families, such as COG and Pfam databases. The a sequence from the database using the indexes.
BLASTX (Altschul et al. 1990) program has been The second component calculated local alignments
used for such binning and classification because it around the candidate positions using the Smith-
can identify homologues that do not have high Waterman algorithm for calculating the alignment
nucleotide sequence identity, but once these scores. Finally, the third component sorted the
sequences are translated, the homologue can be alignment scores and output the search results.
found in a distantly related member of a protein Both the candidate search and local alignment
family (Turnbaugh et al. 2006). The BLAST components required a large amount of comput-
algorithm is sufficiently sensitive for searching ing time. Therefore, queries on both components
protein families, but its performance is insuffi- are processed in parallel and they are mapped
cient for analyzing the large quantities of data onto GPUs. Thus, multiple queries were simulta-
produced by a next-generation sequencer. In neously processed on different GPU cores. GPUs
practice, approximately 1,000 CPU days were have many computing cores (the Tesla S1070 has
needed for querying 20 million short reads 240 cores per GPU) and this is the reason for the
against the KEGG database using BLASTX acceleration of GHOSTM in processing time.
program. Importantly, the GHOSTM system requires
To address the issue, the GHOSTM software a sufficient number of queries for maximum effi-
(Suzuki et al. 2012) had been developed. ciency, and in fact, when using only one query
GHOSTM can efficiently search homologous sequence, the calculation of GHOSTM becomes
sequences for a database based on much slower than BLAST.
GPU-computing technique. Graphics processing
units (GPUs) were originally designed for
graphics applications, but new generation GPUs Search Performances
have been transformed into powerful coprocessors
for general purpose computing because their com- Because metagenomic analyses require highly
putational power supersedes that of CPUs. For sensitive searches, it is difficult to use homology
example, the peak performance of a GPU, such search program with high speed but low sensitiv-
as the NVIDIA Tesla K20, is approximately 3.5 ity, such as BLAT (Kent 2002). In contrast,
TFLOPS. This speed is more than tenfold faster GHOSTM has sufficient search sensitivity for
than the most recent CPUs. GPUs have already metagenomic analysis.
been used for several bioinformatics applications, Figure 2 shows the comparison of search sen-
such as CUDASW (Liu et al. 2010) and CUDA- sitivity for each homology search program. To
BLASTP (Liu et al. 2011). evaluate the search sensitivity, the search results
GHOSTM employs a new and efficient obtained with the Smith-Waterman local align-
homology search algorithm suitable for GPU cal- ment method implemented in SSEARCH
culation. The system accepts a large number of (Pearson 1991) were assumed to be the correct
short DNA fragment sequences produced by answers. The performance of a particular method
a next-generation sequencer as the input like the is evaluated in terms of the fraction of its results
BLASTX program and performs DNA sequence that corresponded to the correct answers obtained
homology searches against a protein sequence by SSEARCH. The search accuracy of GHOSTM
database. The system demonstrated a calculation was clearly higher than BLAT. Low-scoring hits
speed that was 130 times faster with one GPU (e.g., <50) are generally not used in practice
than BLAST on a CPU. because such hits can occur by chance. With the
GHOSTM 249 G
GHOSTM, Fig. 1 Data
flow and processing within
GHOSTM
GHOSTM, Fig. 2 Search

accuracy of GHOSTM
exception of the low-score hits, GHOSTM suc- GHOSTM, Table 1 Comparison of search speed
cessfully identified more than 90 % of the hits Time Acceleration
identified by SSEARCH. This result suggests that Program #GPUs (s) ratio
GHOSTM is sufficiently accurate for general GHOSTM (K ¼ 4) 1 2,855 129.5
usage. GHOSTM (K ¼ 4) 4 909 406.7
The computational times of BLAST, BLAT, BLAT 9,898 37.3
and GHOSTM for 100 thousand reads are shown BLASTX (1 thread) 369,678 1
in Table 1. Each query read has the length from BLASTX 102,255 3.6
(4 threads)
60 to 75 bp and the search target is KEGG Genes
(“genes.pep”) database (Kanehisa et al. 2010) with
approximately 2.5 GB. The GHOSTM program
achieved a calculation speed approximately faster than BLAT despite of its higher search sen-
130 and 400 times faster than the BLAST program sitivity. GHOSTM achieves both high search
using 1 thread and 4 threads, respectively. More- speed and high search sensitivity compared with
over, GHOSTM was approximately 3.4 times previous homology search tools.
G 250 GHOSTM
Installation and Requirements However, GHOSTM is an efficient tool based on

GPU-computing techniques and it would be
The source code of GHOSTM is distributed a potential solution to this problem.
under the MIT license and is available for down-
load at http://code.google.com/p/ghostm/.
GHOSTM was implemented in C++ and the References
NVIDIA CUDA library and requires CUDA ver-
sion 2.2 or higher. Thus, the user has to prepare Altschul SF, Gish W, Miller W, Myers EW, Lipman
DJ. Basic local alignment search tool. J Mol Biol.
NVIDIA’s GPU card, such as Tesla K20, for 1990;215(3):403–10.
executing the GHOSTM program. The user can Kanehisa M, Goto S, Furumichi M, Tanabe M, Hirakawa
also execute GHOSTM on a general GeForce M. KEGG for representation and analysis of molecular
graphics card as well as Tesla. The performance networks involving diseases and drugs. Nucleic Acids
Res. 2010;38(Database issue):D355–60.
of GHOSTM basically depends on the number of Kent WJ. BLAT–the BLAST-like alignment tool.
CUDA cores and their clocks. Thus, several Genome Res. 2002;12(4):656–64.
GeForce GTX cards show better performance Liu Y, Schmidt B, Maskell DL. CUDASW++2.0:
than Tesla. However, current GeForce cards do enhanced Smith-Waterman protein database search
on CUDA-enabled GPUs based on SIMT and
not have Error Check and Correct (ECC) mem- virtualized SIMD abstractions. BMC Res Notes.
ory, and thus, the search results obtained using 2010;3:93.
such cards are unreliable because of the GPU Liu W, Schmidt B, M€ uller-Wittig W. CUDA-BLASTP:
memory error. Therefore, Tesla GPUs were accelerating BLASTP on CUDA-enabled graphics
hardware. IEEE/ACM Trans Comput Biol
recommended especially if the user have to pro- Bioinforma/IEEE ACM. 2011;8(6):1678–84.
cess large amount of sequences. Pearson WR. Searching protein sequence libraries: com-
parison of the sensitivity and selectivity of the Smith-
Waterman and FASTA algorithms. Genomics.
1991;11(3):635–50.
Summary Suzuki S, Ishida T, Kurokawa K, Akiyama Y. GHOSTM:
a GPU-accelerated homology search tool for
Currently, sequencing technology continues to metagenomics. PLoS One. 2012;7(5):e36060.
improve, and sequencers are increasingly produc- Fernandez-Fuentes N, ed.
Turnbaugh PJ, Ley RE, Mahowald MA, Magrini V,
ing larger and larger quantities of data. This Mardis ER, Gordon JI. An obesity-associated gut
explosion of sequence data makes computational microbiome with increased capacity for energy har-
analysis with contemporary tools more difficult. vest. Nature. 2006;444(7122):1027–31.
H
Horizontal Gene Transfer and are tiny, unicellular organisms with relatively
Bacterial Diversity small genomes, variations observed in their cel-
lular architectures, metabolic properties, and eco-
Chitra Dutta1 and Munmun Sarkar2 logical preferences are remarkable. Such
1
Structural Biology & Bioinformatics Division, enormous diversity may be attributed to the
CSIR-Indian Institute of Chemical Biology, extremely dynamic genomes of bacteria that
Kolkata, West Bengal, India evolve rapidly through alteration, acquisition,
2
CSIR-Indian Institute of Chemical Biology, deletion, and rearrangements of relevant genetic
Kolkata, India information through various molecular mecha-
nisms. These mechanisms include not only the
processes of internal modification of genetic
Synonyms materials like mutation or homologous recombi-
nation but also exchange of specific set of genes
Lateral gene transfer with other species through the process of hori-
zontal transfer (Ochman et al. 2000). Mutations
usually lead to slow, subtle, but continuous refine-
Definition ment and alteration of existing genes that may
foster diversification and speciation of microor-
Horizontal gene transfer (HGT) is the process in ganisms on an evolutionary time scale. HGT, on
which genetic material is transmitted between the contrary, is capable of introducing abrupt
two organisms that are not parent and offspring. large-scale changes in the gene repertoire of an
HGT is pervasive among bacteria, even among organism that may confer novel physiological
very distantly related ones. Through transmission traits to the recipient and enable an organism to
of distinct physiological traits from one organism explore new ecological niches and even can gen-
to another, it may cause drastic changes in the erate new variants of bacterial strains by “genetic
ecological and pathogenic character of bacterial quantum leaps” (Groisman and Ochman 1996).
species and thereby may catalyze the diversifica-
tion of bacterial lineages.
Mechanisms of HGT
Introduction HGT is in sharp contrast with the process of

vertical transfer that propagates genes from the
Bacteria are the most diverse and versatile life parental generation to offspring via sexual or
forms of our planet. In view of the fact that they asexual reproduction.
H 252 Horizontal Gene Transfer and Bacterial Diversity
There are three principal mechanisms for relatives are likely to have greater sequence iden-
interspecies transmission of DNA elements in tity and hence higher probability of homologous
HGT (Ochman et al. 2000): recombination as well as HFIR. Bacteria with the
(i) Transformation – uptake of naked DNA same restriction–modification system can more
element from environment easily share a phage or a plasmid and exchange
(ii) Transduction – the bacteriophage-mediated their DNA elements. DNAs of short length
transmission of genetic materials between (carrying one to several genes) usually has
organisms recognized by the phage a greater probability of undergoing a successful
(iii) Conjugation – transfer of DNA from the adaptive HGT, even across deeply divergent bac-
donor to the recipient through cell-to-cell teria, as it may allow an organism to selectively
contact via sexual pilus pick up a niche-transcending gene or set of genes
However, mere insertion of the donor DNA without acquiring the niche-specifying genes of
element into a recipient cytoplasm does not the donor. Furthermore, a short DNA may also
ensure a successful HGT, unless this foreign survive in a host with distinct restriction–modifi-
DNA sequence becomes stable in the host chro- cation system, as it is less likely to contain a given
mosome. Though the transfer or uptake of a short recognition sequence and may thereby be more
DNA sequence is usually indiscriminate with protected from cleavage by the restriction system
respect to the functional or compositional fea- of the host. And, needless to say, a niche-
tures of the transmitted sequence, stabilization transcending HGT that provides an important
of this foreign DNA element into the host organ- adaptation to a recipient will always have
ism depends critically on the compatibility of the a selective advantage.
transferred genes with the transcriptional and Among the mechanistic barriers limiting
translational machinery of the host (Dutta and unregulated uptake of foreign DNA in bacteria
Pan 2002). Stable incorporation of the newly are the lack of similarity between the donor and
acquired DNA into the host genome can be the recipient, which may prohibit the integration
mediated by any of the following processes: of new sequence into a replicating genetic unit,
(i) homologous recombination, which normally surface exclusion that may create an effective
limits the process among closely related organ- barrier against conjugative transfer into cells,
isms; (ii) persistence as an episome, if favored by and presence of distinct restriction/modification
natural selection; (iii) integration mediated by systems present in the host (Thomas and
mobile genetic elements; and (iv) illegitimate Nelsen 2005).
incorporation through chance events of double- A protein’s connectivity may be another
strand break repair. important factor for the transferability of genes
across organisms. The complexity hypothesis
(Jain et al. 1999) predicts a low rate of transfer
Factors Regulating the Events of HGT of genes, products of which are involved in many
and Their Outcomes complex interactions. Transfer of only one part of
a complex set of coadapted structures is likely to
Depending on the organisms involved and the bring about an incompatibility and loss of func-
gene transfer mechanisms that are operational, tion. It is thought that bacterial genes may be
there are a number of factors that can foster or broadly classified into two categories according
limit the transfer, uptake, stabilization, and to their transferability (Nakamura et al. 2004):
expression of foreign DNA molecules in bacteria. (i) less transferable “informational” genes
Factors that may foster an event of HGT involved in replication translation and transcrip-
(Wiedenbeck and Cohan 2011) include both tion and (ii) frequently transferable “operational”
mechanistic as well as functional aspects. The genes involved in metabolism. It has also been
phylogenetic closeness of the donor and the reported that among operational genes, those
recipient often facilitate HGT, since close involved in cell surface, DNA binding, and
Horizontal Gene Transfer and Bacterial Diversity 253 H
pathogenicity-related functions have higher niche-transcending traits that are commonly
probability of HGT as compared to the genes introduced in bacterial species through HGT are
related to amino acid biosynthesis, biosynthesis as follows.
of cofactors, energy metabolism, intermediary
metabolism, fatty acid and phospholipid metabo- Novel Metabolic Traits and Niche Adaptation
lism, and nucleotide biosynthesis. In bacteria, a substantial portion of species-
Any recipient organism would also try to resist specific functions can be attributed to HGT.
an event of HGT that might incur harmful pleio- Through HGT, divergent bacterial populations
tropic effects. The deleterious side effects of may share an adaptation that transcends their
a new acquisition often drive natural selection differences in cellular architectures, physiologi-
toward “domesticating” the acquired DNA, i.e., cal capabilities, and ecological niches. For
toward ameliorating its negative fitness effects instance, enterotoxigenic Escherichia coli that
(Wiedenbeck and Cohan 2011). Newly acquired attacks the epithelial cells of the small intestine
genes may have higher rates of evolution than shares the class 5 fimbriae with Burkholderia
other genes in the genome. Another mechanism cepacia that resides in human lungs of cystic
for domesticating a horizontally acquired adapta- fibrosis patients and attacks the respiratory epi- H
tion involves initial repression of the acquired thelium. On the other hand, closely related bac-
gene(s) in the host genome by histone-like teria or even strains of same species may exhibit
nucleoid-structuring proteins (H-NS) (Dorman radically different metabolic, physiological, or
2004). The compositional differences between pathogenic traits – thanks to HGT. Bacillus
a donor segment and the recipient are diminished anthracis (strain Ames ancestor), Bacillus cereus
over time as incorporated genes are subjected to (ATCC1098), and Bacillus thuringiensis (serovar
the mutational bias of the host (Lawrence and konkukian str. 97–27), all are considered as
Ochman 1997). a single species, as they show more than 94 %
ANI and have highly syntenic gene repertoire.
And yet they are drastically different in their
Bacterial Diversity Incurred by HGT phenotypes – a highly virulent pathogen and
potentially lethal bioterror agent, a source of
HGT is thought to be a prime contributor to food poisoning, and an eco-friendly organic bio-
bacterial evolution. As more and more genome pesticide, respectively (Doolittle and Papke
sequences are being determined, it is becoming 2006).
clear that cross-species transmission of genetic HGT, in many cases, endows the recipient
information through HGT is pervasive among with novel metabolic capabilities that enable it
bacteria and that it may occur at vast phyloge- either to invade a new niche or to improve its
netic distances and that it may confer novel phe- performance in its current niche (Cohan and
notypes and functions to the host organism by Koeppel 2008). For example, acquisition of the
introducing fully functional genes and gene clus- lac operon has enabled Escherichia coli to uti-
ters. Unlike point mutations that can only adjust lize the milk sugar lactose as a carbon source
preexisting phenotypes, HGT may result in dras- and thereby to explore a new niche, the mam-
tic changes in metabolic, pathological, or ecolog- malian colon, where it has established a
ical character of a microbial species, thereby commensal relationship (Ochman et al. 2000).
allowing effective and competitive exploitation An event of HGT may even allow for conversion
of new niches (Lawrence 1999; Hacker and of the recipient into a radically different organ-
Kaper 2000). In cases where habitat differences ism that may inhabit niches completely
suggest ecological differentiation between close unexplorable by the organisms relying on muta-
relatives, a genome-based analysis often iden- tional processes alone. Examples include the
tifies one or more events of HGT as the primary aerobic methanotrophs that have gained the
cause of the ecological divergence. Some of the ability to synthesize critical cofactors for
H4MPT-mediated methyl-group transfer by of the core genome of the respective species.

acquiring genes from methanogenic archaea, Such discrete gene clusters, referred to as “viru-
bacteria that exploit halorhodopsin homologues lence cassettes” or “pathogenicity islands”
as light-driven proton pumps, and cyanobacteria (Groisman and Ochman 1996; Hacker
gaining the capability of oxygenic photosynthe- et al. 1997), usually reside at tRNA and tRNA-
sis through acquisition of a second photosystem like loci, which appear to be common sites for
(Gogarten et al. 2002). integration of foreign sequences (Hacker
et al. 1997; Ochman et al. 2000) and are flanked
Speciation and Sub-speciation in Bacteria by 16–20 bp perfect or almost perfect direct
A substantial part of the speciation and repeats. They may also carry insertion elements
sub-speciation in bacteria can be explained as or transposons. All these observations strongly
the result of macroevolution events mediated by argue in favor of horizontal acquisition of these
HGT (Cruz and Davies 2000). Using E. coli and islands by their host genomes. Conversion of
Salmonella as a model system, it has been dem- laboratory strains of E. coli from avirulent to
onstrated that 17 % of their genomes (~800 kb) virulent forms upon experimental introduction
appear to have been acquired by HGT during the of genes from other species (Isberg and Falkow
past 100 million years. As the majority of these 1985; McDaniel and Kaper 1997) or presence of
DNAs seem to be recently recruited, it is apparent large virulence plasmids in pathogenic Shigella
that considerable genetic flux may still be occur- and Yersinia (Gemski et al. 1980; Portnoy
ring across these two species and the 234 detect- et al. 1981; Maurelli et al. 1985; Sasakawa
able HGT events that have persisted are probably et al. 1988) supported the notion of horizontal
“the tip of the iceberg of the thousands of mobile transfer of virulence factors in bacteria.
sequences” that have been acquired or shaded off With accumulation of genome sequences of
by any particular E. coli strain (Cruz and Davies diverse bacterial species, it became clear that
2000). Comparison of the members of a well- pathogenicity islands represent a subclass of
known collection of E. coli strains (the ECOR a more diverse group of genetic elements, desig-
collection) revealed that these strains are quite nated as genomic islands (GI). A GI refers to
variable in the size and macro-organization of a part of genome – usually 10–200 bp in length –
their chromosomes and plasmids. These observa- containing a set of horizontally acquired genes
tions point toward the fact that a significant pro- that might be beneficial for the host bacterium
portion of the genome of any strain of a single under specific environmental conditions. GIs
bacterial species may comprise fragments of may be associated with diverse adaptive func-
functional genetic elements from various origins, tions that enable the respective species to survive
which, if properly “nurtured,” can give rise to or colonize within a specialized niche or to adopt
new bacterial species. a distinct lifestyle. For instance, nitrogen fixation
genes harbored by “symbiosis islands” in various
Adoption of Pathogenic/Symbiotic Lifestyle Rhizobiaceae species enable these organisms to
Through Acquisition of Genome Islands develop a symbiotic relationship with legumes,
Horizontal acquisition of virulence factors is which, in turn, facilitate their survival inside
a common strategy of bacterial organisms for the root nodules of the legumes (Sullivan and
undergoing transformation from the benign Ronson 1998).
form into a pathogen. A pathogenic strain of Dissemination of the gene clusters (operons)
any bacterial strain is often distinguished from involved in the catabolism of xenobiotics in pol-
the nonpathogenic variants of the same or related luted environment is often attributed to transfer
species by the presence of a cluster of virulence of specific integrative and conjugative elements
factors like toxins, invasion factors, adherence (ICElands) – a special type of genome islands –
factors, and secretion systems, the G+C compo- across bacterial populations (van der Meer and
sition of which may differ significantly from that Sentchilo 2003; Cruz and Davies 2000).
Horizontal Gene Transfer and Bacterial Diversity 255 H
Examples include the ICElands containing the species to another especially in hospital environ-
clc element for chlorobenzoate and ment among close contaminants and in patients
chlorocatechol degradation in Pseudomonas with compromised immunity, thus resulting in
sp. strain B13 or in Ralstonia spp. strain JS705. nosocomial infections caused by multidrug resis-
It may be mentioned in this context that the xeno- tance bacterial strain.
biotic degradation pathways usually require com- Among different antibiotic-resistant classes of
plex genetic systems like operons of ten or more organisms, the cases of two most widely studied
genes or even regulons of several operons along phenotypes include resistance to b-lactams and
with their control circuits. For instance, in resistance to fluoroquinolones (Barlow 2009).
Sphingomonas aromaticivorans, there are b-lactam antibiotics, one of the major groups of
15 gene clusters – directly associated with the antibiotics used globally, act by inhibiting bacte-
catabolism or transport of aromatic compounds – rial cell wall biosynthesis mainly in gram-
in a large conjugative plasmid pNL1 that have positive bacteria. They contain a b-lactam ring
enabled the host bacteria to degrade compounds in their structures and require this ring to be intact
such as biphenyl, naphthalene, xylene, and cresol in order to be effective. The transfer of
(Romine et al. 1999). b-lactamase (acts by invading the b-lactam ring) H
The same or similar GIs may carry out distinct gene into many previously sensitive strains,
functions in different species, depending upon predicted to be transferred from different gram-
the genetic background and lifestyle of its hosts negative species such as E. coli, has resulted in
(Dutta and Paul 2012). For instance, GIs carrying various pathogenic strains resistant to most avail-
secretion systems of type III in the pathogenic able antimicrobials. One most cited example is
strains of Salmonella, Shigella, and Yersinia methicillin-resistant Staphylococcus aureus
groups or type IV in Legionella pneumophila (MRSA), one of the most virulent strain of
and Helicobacter pylori are known to be involved S. aureus, resistant to most b-lactams. In addition
in the infectious process of their respective hosts. to the b-lactamase activity, another gene mecA is
But similar GIs encoding the type III system of found to be associated with resistance to most
rhizobia or the type IV system of F plasmids b-lactams. This gene acts by producing an altered
function as symbiotic islands that enhance the penicillin-binding protein having lower affinity
fitness of the host organisms in their natural for b-lactam antibiotics. Another group of antibi-
niches. GIs encoding the adherence factors like otics, the fluoroquinolones (cephalosporin),
P-, S-, and F1C-fimbriae in E. coli strains of the effective against many gram-negative bacteria,
human gut microbiome act as a saprophytic widely used in both human medicine and veteri-
island that facilitate colonization of these nary practice is also becoming less functional
microbes at the gut. But if under special circum- because of the growing incidence of resistant
stances the P-, S-, or F1C-positive E. coli reaches strains.
the urinary tract of human, the same island may Different strains of Enterococci, a natural
serve as a pathogenicity island that helps its host commensal in human gut, have shown to contrib-
microbe to infect the bladder/kidney. ute in several cases of HGT due to having a large
number of plasmids. Cases of vancomycin resis-
Antibiotic Resistance tance in E. faecalis and E. faecium have been
A major health concern over past few decades is shown to be mediated through a type of
the emergence of numerous antibiotic-resistant pheromone-independent plasmids (Palmer
pathogenic strains. Horizontal gene transfer is et al. 2010). Recent cases showing plasmid-
one of the major reasons for the dissemination mediated transfer of vancomycin resistance
of various antibiotic-resistant factors throughout from Enterococci to MRSA are producing an
diverse microbial species. The resistant genes alarming rate of last line antibiotic failure, thus
located in various mobile DNA elements (such leading to combined growth of nosocomial path-
as plasmids) are easily transferred from one ogens having no effective antibiotic.
Evolutionary or Ecological Implications Summary

of HGT
Horizontal gene transfer (HGT) – the process of
Discoveries of rampant interspecies gene transfer interspecies transfer of genetic material via
across the entire microbial world and even mobile genetic elements such as plasmids,
beyond have underscored the need for reviewing phages, genomic islands, and genomic modules –
the basic concepts of biological evolution. As plays an important role in bacterial evolution,
proposed by Doolittle (1999), a single universal speciation, and diversification. HGT is pervasive
phylogenetic tree might not be the best way to among bacteria and may occur at vast phyloge-
depict relationships between all living and extinct netic distances. By introducing fully functional
species. Instead, a web- or netlike pattern might genes and gene clusters, an event of HGT may
provide a more appropriate representation confer novel phenotypes and functions to the host
(Doolittle 1999). It appears that some genes organism; may result in drastic changes in its
have flowed “randomly” through the biosphere, metabolic, pathological, or ecological character,
almost as if all life forms constituted one global thereby allowing effective and competitive
superorganism, divided into subpopulations, exploitation of new niches; and even can generate
within and between which genes are exchanged new variants of bacterial strains by “genetic
at varying frequencies (Dutta and Pan 2002). quantum leaps.” The widespread distribution of
The microbial niches are also no longer con- various antibiotic-resistant genes throughout
sidered as a static domain. A microbial niche may diverse microbial species, dissemination of
be considered as a dynamic domain, which is the gene clusters (operons) involved in
continuously being redefined after each gene biodegradative pathways, transformation of vari-
transfer event. This alternation of niche bound- ous bacterial organism from the benign form into
aries then imposes a different filter on the influx a pathogen, evolution of rhizobia–legume symbi-
of foreign DNA, imparting distinct selective osis or interstrain variations in size, and macro-
pressures on incoming genes. Recently Martin organization in chromosomal structures of any
et al. (2013) proposed a new model of ecological specific bacterial species all may be attributed
speciation via gradual genetic isolation, insti- to HGT. Genetic elements that can be transferred
gated by differential niche acclimatization of as a functional unit and provide a niche-
nascent bacterial populations. The model transcending adaptation have a greater probabil-
predicted how microbial populations, despite ity of undergoing a successful adaptive HGT.
having ecological cohesion, can display high Informational genes involved in replication trans-
genomic diversity through employment of lation and transcription are, in general, less trans-
selective, local HGT events, by tapping into ferable than the operational genes involved in
a gene pool that is adaptive toward continuously metabolism. Stabilization of the transferred
changing, local organismic interactions. material within the host is often limited by the
Pervasiveness of HGT across the entire living genetic and ecological similarity of the donor and
world has also redefined the concept of the “uni- the recipient. Recognition of HGT as a prime
versal ancestor” (Woese 1998). The presence of factor for bacterial speciation and diversification
a gene in all three domains of life – Bacteria, has revolutionized the basic concepts of biologi-
Archaea, and Eukarya – not necessarily cal evolution. It has been proposed that all pro-
ensures its existence in their common ancestor; karyotes together might be considered as one
it could have arisen at a later age in one “global superorganism” divided into subpopula-
domain and spread to the others. As stated by tions within and across which genes are fre-
Woese (1998): “the universal ancestor is quently exchanged. It has also been proposed
not a discrete entity. It is, rather, a diverse com- that the bacterial niches and HGT constantly
munity of cells that survives and evolves as interact with one another, each affecting the
a biological unit.” other as lineages evolve.
Host-Virus Interaction: From Metagenomics to Single-Cell Genomics 257 H
Reference Ochman H, Lawrence JG, Groisman EA. Lateral gene
transfer and the nature of bacterial innovation. Nature.
Barlow M. What antimicrobial resistance has taught us 2000;405:299–304.
about horizontal gene transfer. Methods Mol Biol. Palmer KL, Kos VN, Gilmore MS. Horizontal gene trans-
2009;532:397–411. fer and the genomics of enterococcal antibiotic resis-
Cohan FM, Koeppel AF. The origins of ecological diver- tance. Curr Opin Microbiol. 2010;13(5):632–9.
sity in prokaryotes. Curr Biol. 2008;18:R1024–34. Portnoy DA, Falkow S. Virulence-associated plasmids
de la Cruz F, Davies J. Horizontal gene transfer and the from Yersinia enterocolitica and Yersinia pestis.
origin of species: lessons from bacteria. Trends J Bacteriol. 1981;148(3):877–83.
Microbiol. 2000;8:128. Romine MF, Stillwell LC, Wong KK, Thurston SJ,
Doolittle WF. Lateral genomics. Trends Cell Biol. 1999;9:M5–8. Sisk EC, Sensen C, Gaasterland T, Fredrickson JK,
Doolittle WF, Papke RT. Genomics and the bacterial Saffer JD. Complete sequence of a 184-kilobase cata-
species problem. Genome Biol. 2006;7:116. bolic plasmid from Sphingomonas aromaticivorans
Dorman CJ. H-NS: a universal regulator for a dynamic F199. J Bacteriol. 1999;181(5):1585–602.
genome. Nat Rev Microbiol. 2004;2:391–400. Sasakawa C, Kamata K, Sakai T, Makino S, Yamada M,
Dutta C, Pan A. Horizontal gene transfer and bacterial Okada N, Yoshikawa M. Virulence-associated genetic
diversity. J Biosci. 2002;27 Suppl 1:27–33. regions comprising 31 kilobases of the 230-kilobase
Dutta C, Paul S. Microbial lifestyle and genome signa- plasmid in Shigella flexneri 2a. Bacteriol. 1988;
170(6):2480–4.
tures. Curr Genomics. 2012;13:153–62.
Sullivan JT, Ronson CW. Evolution of rhizobia by acqui-
H
Gemski P, Lazere JR, Casey T, Wohlhieter JA. Presence
of a virulence-associated plasmid in Yersinia pseudo- sition of a 500-kb symbiosis island that integrates into
tuberculosis. Infect Immun. 1980;28(3):1044–7. a phe-tRNA gene. Proc Natl Acad Sci U S A.
Gogarten JP, Doolittle WF, Lawrence JG. Prokaryotic 1998;95:5145–9.
evolution in light of gene transfer. Mol Biol Evol. Thomas CM, Nelsen KM. Mechanisms of, and barriers to,
2002;19:2226–38. horizontal gene transfer between bacteria. Nat Rev
Groisman EA, Ochman H. Pathogenicity islands: bacterial Microbiol. 2005;3:711.
evolution in quantum leaps. Cell. 1996;87:791–4. van der Meer JR, Sentchilo V. Genomic islands and the
Hacker J, Blum-Oehler G, M€ uhldorfer I, Tsch€ape H. Path- evolution of catabolic pathways in bacteria. Curr Opin
ogenicity islands of virulent bacteria: structure, func- Biotechnol. 2003;14:248–54.
tion and impact on microbial evolution. Mol Wiedenbeck J, Cohan FM. Origins of bacterial diversity
Microbiol. 1997;23:1089–97. through horizontal genetic transfer and adaptation to
Hacker J, Kaper JB. Pathogenicity islands and the evolution new ecological niches. FEMS Microbiol Rev.
of microbes. Annu Rev Microbiol. 2000;54:641–79. 2011;35:957–76.
Isberg RR, Falkow S. A single genetic locus encoded by Woese C. The universal ancestor. Proc Natl Acad Sci
Yersinia pseudotuberculosis permits invasion of cul- U S A. 1998;95:6854–9.
tured animal cells by Escherichia coli K-12. Nature.
1985;317(6034):262–4.
Jain R, Rivera MC, Lake JA. Horizontal gene transfer
among genomes: the complexity hypothesis. Proc
Host-Virus Interaction: From
Natl Acad Sci U S A. 1999;96:801–3806. Metagenomics to Single-Cell
Lawrence JG. Gene transfer, speciation, and the evolution of Genomics
bacterial genomes. Curr Opin Microbiol. 1999;2:519–23.
Lawrence JG, Ochman H. Amelioration of bacterial
genomes: rates of change and exchange. J Mol Evol.
Arbel D. Tadmor1 and Rob Phillips2
1
1997;44:383–97. TRON – Translational Oncology at the
Martin F, Polz MF, Alm EJ, Hanage WP. Horizontal gene University Medical Center of the Johannes
transfer and the evolution of bacterial and archaeal Gutenberg University Mainz, Mainz, Germany
population structure. Trends Gen. 2013;29:170–5. 2
Maurelli AT, Baudry B, d’Hauteville H, Hale TL,
Departments of Applied Physics and
Sansonetti PJ. Cloning of plasmid DNA sequences Bioengineering California Institute of
involved in invasion of HeLa cells by Shigella flexneri. Technology, California Institute of Technology,
Infect Immun. 1985;49(1):164–71. Pasadena, CA, USA
McDaniel TK, Kaper JB. A cloned pathogenicity island
from enteropathogenic Escherichia coli confers the
attaching and effacing phenotype on E. coli K-12. Synonyms
Mol Microbiol. 1997;23(2):399–407.
Nakamura Y, Itoh T, Matsuda H, Gojobori T. Biased
biological functions of horizontally transferred genes DNA packaging gene; Large terminase subunit
in prokaryotic genomes. Nat Genet. 2004;36:1126. gene; TerL
H 258 Host-Virus Interaction: From Metagenomics to Single-Cell Genomics
Definition CRISPRs to pair viruses with their hosts by

matching spacer sequences that occur both in
dPCR (digital PCR) is a PCR reaction performed the genome of the virus and in the genome of
in a nanoliter or subnanoliter volume making it the host (Andersson and Banfield 2008).
possible to detect single molecules. Metagenomes, however, are generally limited in
MetaCAT (metagenome cluster analysis tool) their ability to shed light on the nature of host-
is a metagenome data mining tool that uses an virus interaction since most environments are
iterative dynamic clustering approach to identify more complex. More importantly, the physical
the most abundant genes in a given metagenome entity of the cell is lost in the process of preparing
with respect to a reference dataset containing the metagenome thus destroying the possibility of
potentially homologous genes. assigning a given virus to a corresponding host.
Bacteriophage is a virus that infects and rep- A way to circumvent this problem is through
licates within bacteria (phage for short). culture-independent single-cell analysis methods
Prophage is a phage genome that is integrated that use microfluidic devices to trap and manipu-
into the bacterial genome or exists in the form of late single cells.
a plasmid within the cell.
Single-Cell Genomics
Introduction
Microfluidic devices are currently routinely used
It is widely appreciated today that viruses are to control and manipulate small volumes of liq-
a dominant and critical part of Earth’s biosphere. uid, including trapping and analyzing single cells
Yet despite the major advances in the study of (Kalisky et al. 2011). Once trapped, individual
environmental viruses in most cases, our knowl- cells are lysed, and their genetic content can be
edge of which viruses go with which hosts is probed. In the case of host-virus interaction, the
meager. In the classic phage isolation technique, genome of the virus forms a unique association
known as the plaque assay, a confluent layer of with the bacterial cell (Fig. 1). Thus, in an ideal
host cells is infected with a low density of viral scenario, both the genome of the host and its virus
particles. When a virus infects a cell within this would be sequenced. Although single-cell
“lawn” of host cells, the cell lyses, and new viral sequencing has been demonstrated (Kalisky
particles infect adjacent cells thereby creating et al. 2011; Kalisky and Quake 2011), for practi-
a clearing, or plaque, in the lawn. This technique cal reasons, the number of cells that may be
for isolating viruses requires that the host of the interrogated using this method remains at present
virus be culturable. However, given that >99% of quite low. As an alternative approach, it is possi-
microbes on Earth cannot be cultured at this time, ble to analyze single bacterial cells by PCR using
the vast majority of phage-host systems cannot be microfluidic digital polymerase chain reaction
investigated in the laboratory using these classi- (dPCR) arrays. This method, which is relatively
cal phage enrichment techniques. Consequently, high throughput, can currently interrogate several
little is known about the biology of most viruses thousands of single cells within days.
and their host specificity in the wild.
Metagenomic studies of environmental
viruses circumvent the cultivation limitation and Microfluidic Digital PCR
therefore have offered a unique glimpse into the
genetic composition of environmental viruses In microfluidic digital PCR a sample consisting
(Kristensen et al. 2010; Mokili et al. 2012). In of either DNA or cells is partitioned uniformly
low complexity environments such as natural onto an “array” of nanoliter or subnanoliter PCR
acidophilic biofilms, metagenomic analysis can chambers, with each chamber ideally containing
utilize antiviral defense systems known as a single DNA molecule or a single cell
a
b 1.0 c Microfluidic PCR chamber
Viral marker gene

Normalized fluorescence
H
0.1
150 mm
0.01 SSU rRNA gene
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 bacterial cell
Cycle
150 mm
Host-Virus Interaction: From Metagenomics to for retrieval are outlined in gray. FA indicates false alarm
Single-Cell Genomics, Fig. 1 End point fluorescence (a probable terminase primer-dimer). (b) Normalized
measured in a panel of a microfluidic digital PCR amplification curves of all chambers in (a) after linear
array. (a) The measured end point fluorescence from the derivative baseline correction (red/viral, green/rRNA).
rRNA channel (right half of each chamber) and the (c) Specific physical associations between a bacterial
terminase channel (left half of each chamber) in cell and the viral marker gene resulting in colocalization
a microfluidic array panel. Each panel in this array (one include, for example, an attached or assembling virion, an
of twelve) consists of 765 150 150 270 mm3 (6 nL) injected DNA, an integrated prophage, or a plasmid
reaction chambers. Retrieved colocalizations are outlined containing the viral marker gene (Tadmor et al. 2011)
in orange, and positive rRNA chambers randomly selected
(Kalisky et al. 2011; Kalisky and Quake 2011). spurious reactions and contaminating molecules
Each chamber is loaded with a mixture of primers such as residual genomic DNA that is intrinsic to
and fluorescent probes that target the loci of some reagents. These factors together provide the
interest. The advantage of performing quantita- sensitivity required to PCR amplify and detect
tive PCR (qPCR) reactions in such tiny volumes single molecules. Once thermocycling is com-
is that the likelihood of contamination is reduced, pleted, chambers containing the targets of interest
and the fluorescent signal per PCR chamber is are identified via the fluorescent signal, sampled
greatly intensified. In a standard benchtop qPCR and post-amplified in the laboratory for sequenc-
reaction, for example, the reaction volume is ing using conventional benchtop PCR machines.
15 ml compared to a dPCR reaction volume of 6 An appealing aspect of this technology is that
nl. Therefore dPCR Ct values are reduced by cells may be harvested directly from the environ-
about log2 (2,500) ¼ 11.3 cycles. In addition, ment and loaded onto a microfluidic dPCR array.
the large dilution factor ensures that the vast This method therefore does not require that cells
majority of dPCR chambers are free from be cultured beforehand and does not depend on
gene expression, the position of the targets in the that is ubiquitous in the environment of interest
genome or on the physiological state of the cell at should be identified.
the time of harvest (Ottesen et al. 2006).
The first application of microfluidic digital
PCR technology to study environmental bacteria Requirements from a Viral Marker Gene
involved colocalization of two genes present in
the same individual bacterium (Ottesen Not all viral genes are suitable to be unequivocal
et al. 2006). In this study, one marker targeted markers of a viral entity. As an example, the
an important functional gene expressed by certain integrase gene, which codes for an enzyme that
members of the microbial community resident in is used by the virus to integrate into the genome,
the hindgut of termites, and the second marker is prevalent not only among phages, but also
targeted the small subunit ribosomal RNA (SSU among certain nonviral entities such as plasmids,
rRNA) gene that was used to phylogenetically pathogenicity islands, and integrons (Casjens
identify the bacterium (also known as the 16S 2003). Similar arguments apply to viral genes
marker). By colocalizing and subsequently involving lysis, regulation of gene expression,
sequencing both markers from many individual and DNA replication in viruses (Casjens 2003).
cells, the identity of cells carrying the functional Casjens therefore argues that ideal “cornerstone”
gene was ascertained in cases of repeated phage genes (or at least prophages genes) are
colocalizations. genes involved in the assembly of the virion. Of
To study host-virus interaction, the dPCR these, genes that appear to be not only virus
approach described above was extended to specific but also particularly conserved are the
colocalize the SSU rRNA gene with a marker large terminase subunit (TerL) and portal protein
targeting a certain viral gene prevalent in the genes (Casjens 2003).
environment of interest, demonstrating proof-of- TerL genes have certain additional features
concept on the termite system (Tadmor that make them particularly attractive as viral
et al. 2011). Targeting viruses, however, which markers. The TerL gene is a component of the
are fundamentally different biological entities DNA packaging and cleaving mechanism present
than bacteria, raised certain questions that needed in numerous double-stranded DNA phages (Rao
to be addressed. First and foremost, unlike pro- and Feiss 2008). It contains an N-terminal
karyotes that have universal markers such as the ATPase domain, which is the “engine” of the
SSU rRNA gene, viruses do not have a single DNA packaging motor, and a C-terminal nucle-
shared gene that can be used as a universal ase domain (Rao and Feiss 2008). The ATPase
marker (Rohwer and Edwards 2002). In fact, domain of the TerL gene is conserved in a wide
viral metagenomic studies have shown that variety of dsDNA phages, including the eukary-
viruses are likely the largest reservoirs of otic herpes virus (Przech et al. 2003), suggesting
unknown genetic diversity with the majority of it is an ancient viral domain (Rao and Feiss 2008).
putative viral sequences exhibiting no significant Indeed, Koonin et al., who define “hallmark viral
similarity to currently known genes (Edwards genes” as “genes shared by many diverse groups
and Rohwer 2005; Kristensen et al. 2010; Mokili of viruses with only distant homologs in cellular
et al. 2012). To make matters worse, viruses are organisms and with strong indications of mono-
notorious for replicating their genetic material phyly of all viral members of the respective gene
with borderline fidelity. Consequently the defini- families” and thus “can be viewed as
tion of a viral gene in the environment is rela- distinguishing characters of the virus state”
tively fluid. Finally, many genes present in the (Koonin et al. 2006), identified the ATPase subunit
genome of the virus may be of prokaryote origin of the terminase gene as such a hallmark viral
making them poor signature markers for the gene. Since TerL genes have particularly well-
virus. Thus, to utilize digital PCR to study host- conserved functional residues and motifs (Rao
virus interaction, an unequivocal viral marker and Feiss 2008), they are well suited for
degenerate primer design. At the same time, across contain head-related proteins (Daw and Falkiner
biology TerL genes do not share overall significant 1996). GTAs can only be ruled out if the entire
sequence similarity (Rao and Feiss 2008), thereby genome of the putative viral entity is obtained.
making them sensitive viral markers. Full length viral genomes may be obtained by
Targeting a “cornerstone” or “hallmark” gene means of single-cell sequencing techniques.
of a virus may, however, be of questionable use if
the selected marker tags a defective prophage.
Since a necessary condition for the virus to be Identifying Ubiquitous Viral Markers
active is that its cornerstone gene be functional, it
is important to verify that the cornerstone gene is Although universally shared viral genes do not
under negative selection pressure (Nei and exist, it is beneficial to select a viral marker that is
Kumar 2000). Nonfunctional genes may contain ubiquitous in the environment of interest. Ubiqui-
errant stop codons, frameshift mutations, or tous markers not only have the potential to recover
mutations in certain highly conserved residues greater genetic diversity from the environment, but
essential for the proper function of the protein. can possibly also be found in similar or related
Yet demonstrating that a particular family of environments. Identification of a ubiquitous viral H
TerL alleles from the environment of interest is marker in the environment of interest, assuming
under negative selection pressure does not guar- one exists, is not straightforward and requires
antee that the virus is active in this environment sophisticated metagenome data mining approaches.
since a viral gene may remain functional while To address this problem the authors developed
the prophage itself is defective. Such a situation a bioinformatic program called MetaCAT
can occur if there was insufficient time for point (metagenome cluster analysis tool), which
mutations to have accumulated in the gene of employs a heuristic clustering and ranking
interest after the prophage was inactivated approach that aims to identify the most abundant
(Casjens 2003). Therefore, viruses carrying the genes of a given class (e.g., viral genes) in
viral marker may have been active only in recent a metagenome, without relying on superficial
evolutionary history. In an alternative scenario, features such as gene annotation (Tadmor
the putative marker indeed degraded over time et al. 2011). The input to MetaCAT is
upon prophage inactivation, but it was subse- a metagenome (either assembled translated
quently repaired by a recombination event with contigs or raw nucleotide reads) and a reference
another phage that was likely functional (Casjens library of known genes (e.g., all known viral
2003). In such a case the marker can serve as an genes). The output of MetaCAT is a list of
indicator for indirect infection. known reference genes (derived from an input
Another possibility is that the putative marker reference library) that were found to be present
was recruited by the bacterium because it confers in the metagenome, ranked by their abundance in
on the bacterium a competitive advantage, as is the metagenome. Abundance of a reported gene
the case with lysogenic conversion genes. In this is defined as the number of metagenome gene
case the gene would remain under negative selec- objects or reads that yield significant alignments
tion pressure, while the rest of the prophage with respect to this gene. A key feature of
degenerated (Boyd and Br€ussow 2002; Casjens MetaCAT is that it uses an iterative dynamic
2003). It is unlikely, however, that the host will clustering algorithm to group putative homolo-
recruit TerL genes since these are highly special- gous reference genes from the input reference
ized motors required for virion synthesis. Alter- library. The clustering is dynamic in the sense
natively, the putative marker could be part of that it is performed on the fly based on the
a functional non-phage entity such as a gene matches found in the given metagenome, thereby
transfer agent (GTA) or a bacteriocin (Casjens avoiding loss of information that would occur if
2003). In the case of TerL genes, bacteriocins can the reference library was a priori clustered. The
be ruled out since these taillike structures do not clustering is performed iteratively until all
Host-Virus Interaction: From Metagenomics to ability to cluster the list of known reference genes per
Single-Cell Genomics, Fig. 2 Schematic illustration metagenome and report a minimally redundant list of
of the MetaCAT algorithm. MetaCAT maps clusters of known genes that have putative homologs in the
similar known reference genes to groups of metagenome metagenome, ranked by their abundance in the
gene objects or reads. MetaCAT defines two known ref- metagenome. This list can then be used to generate
erence genes as being similar or “related” if their footprint hypotheses about the given metagenome. In this figure
in the metagenome has a significant overlap. The abun- the left oval depicts a reference database of genes (black
dance of a given cluster of related known reference genes dots), and the right oval depicts a metagenome, with black
in the metagenome is defined as the number of dots representing metagenome gene objects. Hexagons in
metagenome gene objects (or reads) with an E value the reference database represent clusters of related refer-
below a given threshold found when BLASTing ence genes identified by MetaCAT. Each hexagon is
a representative from the gene cluster against the linked to a corresponding cluster of metagenome gene
metagenome. The key feature of MetaCAT lies in its objects depicted by a circle of matching color
identified redundancy is removed. In this way the datasets are favorable candidates for putative
final reported list of ranked genes (or clusters of ubiquitous markers.
genes) contains orders of magnitude fewer ele- In the case of the termite hindgut, the list of
ments than the reference library and is amenable representative contigs corresponding to reported
to manual inspection (Fig. 2). genes in Table 1 was BLASTed against the
If gene annotation information is included in genome of Treponema primitia, a spirochete iso-
the reference database this information will be lated from a lower termite collected from North-
provided by MetaCAT in the ranked list of ern California. Performing this analysis revealed
genes making it a straightforward task to identify that the representative contig of the top ranking
genes of interest. As an example, Table 1 lists all gene found by MetaCAT indeed had significant
the TerL genes identified by MetaCAT in the hits (E value of ~105) in the genome of
metagenome of the hindgut of a higher termite T. primitia and mapped to two prophage-like
collected from Costa Rica (Tadmor et al. 2011). elements. In this case, BLASTing the TerL gene
Each known reference gene found by from the prophage-like element back against the
MetaCAT to be present in the metagenome can metagenome revealed close homologs with a
be paired with a metagenome gene object that similarity of 70 to 78% identity (Tadmor et al.
yielded the lowest E value. This metagenome 2011). Such a bootstrapping approach enabled
gene object is referred to as the “representative the identification of a ubiquitous viral marker in
contig” of the known reference gene. By the termite hindgut environment. Indeed, degen-
BLASTing the representative contigs erate primers designed against this marker were
corresponding to the top ranking candidate able to amplify closely related homologs of this
markers against other metagenomes from similar marker in other species of termites (as well as a
environments, or against genomes of organisms wood-feeding roach) collected from various loca-
isolated from similar environments, it is possible tions in the United States (Tadmor et al. 2011).
to identify ubiquitous viral genes, if present In this context, it is worthwhile to mention that
(Fig. 3). Closely related genes found in multiple MetaCAT is not restricted to ranking only viral
Host-Virus Interaction: From Metagenomics to Single-Cell Genomics, Table 1 TerL genes identified by
MetaCAT in the metagenome of a hindgut of a higher termite collected from Costa Rica. The following table
lists TerL genes with minimal E values 107 and abundances 5 that were identified by MetaCAT to have putative
homologs in the metagenome of the hindgut of a Nasutitermes sp. termite. TerL genes are ranked by the number of
metagenome gene objects yielding alignments with E value scores below 0.001. Also shown are the E value scores
obtained when BLASTing the representative contig of each RefSeqTerL gene cluster against the genome of Treponema
primitia (ZAS-2), using a cutoff value of 0.01, with values above this threshold marked as not significant (N.S.)
BLAST BLAST
RefSeq gene representative
against contig against
No. of hits in metagenome ZAS-2
Organism name Virus classification metagenome (E value) (E value)
Clostridium dsDNA viruses Caudovirales; Myoviridae 19 4.0E-40 2.0E-05
phage phiC2
Streptococcus dsDNA viruses Caudovirales 11 3.0E-34 N.S.
phage SMP
Pseudomonas dsDNA viruses Caudovirales; Podoviridae 7 2.0E-09 N.S.
phage PaP3 H
Enterobacteria dsDNA viruses Caudovirales; Siphoviridae 6 2.0E-180 N.S.
phage lambda
Enterobacteria dsDNA viruses Caudovirales; Siphoviridae 6 8.0E-69 N.S.
phage HK022
Host-Virus Interaction: From Metagenomics to dataset, such as another metagenome from a similar envi-
Single-Cell Genomics, Fig. 3 Bioinformatic approach ronment or a genome of an isolate from a similar environ-
to identify ubiquitous viral markers in a given envi- ment. If the percent identity is sufficiently high allowing
ronment. In the proposed approach to identify putative for degenerate primer design, this candidate can be con-
ubiquitous viral markers, a metagenome from a given sidered a putative viral marker and can be further evalu-
environment is first analyzed by MetaCAT to produce ated by experiment. If the percent identity is not high, but
a list of candidate viral genes abundant in the the E value is significant, a bootstrap-like approach may
metagenome. The corresponding representative contig of be employed where the contig/gene from the new dataset
each candidate viral gene (defined as the contig yielding is BLASTed back against the original metagenome,
the lowest E value) is then BLASTed against a second thereby potentially revealing more conserved markers.
Host-Virus Interaction:
From Metagenomics to
Single-Cell Genomics,
Fig. 4 Workflow using the
microfluidic digital PCR
array for host-virus
colocalization in a novel
environmental sample
(Tadmor et al. 2011)
genes, but it is possible to define other taxonomic bioinformatic sources (e.g., metagenomes and
groups as input reference libraries. For example, sequenced genomes), and degenerate primers
one can use MetaCAT to find the most abundant targeting the marker of interest may be designed.
genes in a given environment involved in Colocalization of viral genes is, however, com-
a certain metabolic pathway, the most abundant plicated by the fact that the low replication fidel-
mitochondrion-related genes in a given sample, ity of viruses makes it unlikely to recover the
the most abundant antibiotic genes in a soil sam- exact same allele twice from the dPCR arrays,
ple, and so on. MetaCAT can therefore be thought contrary to the case of colocalizing two bacterial
of as a useful tool for generating hypotheses genes. P-values can, nevertheless, still be
regarding a given environment. (Requests to assigned to repeated SSU rRNA ribotypes
obtain a beta version of MetaCAT may be sent retrieved from a given array, irrespective of the
to arbel.tadmor@tron-mainz.de.) paired gene, using knowledge of the frequency of
the given ribotype on the array. It is possible to
estimate ribotype frequencies by randomly
Colocalizing Viral-SSU rRNA Genes on selecting chambers positive for the SSU rRNA
Digital PCR Arrays gene and constructing a phylogenetic library of
array ribotypes (Tadmor et al. 2011). Host-phage
Once a viral marker has been selected, a diversity cophylogeny can then be reconstructed from gen-
of this marker can be retrieved from various uine colocalizations, providing a unique glimpse
Human Gut Microbial Genes by Metagenomic Sequencing 265 H
into the evolutionary dynamics of the system and Kalisky T, Blainey P, Quake SR. Genomic analysis at the
shedding light on the flow of viral genes between single-cell level. Annu Rev Genet. 2011;45:431–45.
Koonin E, Senkevich T, Dolja V. The ancient virus world
hosts in the given environment. An overview of and evolution of cells. Biol Direct. 2006;1(1):29.
the workflow using dPCR to colocalize host-virus Kristensen DM, Mushegian AR, Dolja VV, Koonin EV.
genes is provided in Fig. 4. New dimensions of the virus world discovered through
metagenomics. Trends Microbiol. 2010;18(1):11–9.
Mokili JL, Rohwer F, Dutilh BE. Metagenomics and
future perspectives in virus discovery. Curr Opin
Summary and Outlook Virol. 2012;2(1):63–77.
Nei M, Kumar S. Molecular evolution and phylogenetics.
The method outlined in this review provides USA: Oxford University Press; 2000.
Ottesen E, Hong J, Quake S, Leadbetter J. Microfluidic dig-
a general scheme for analyzing host-virus inter- ital PCR enables multigene analysis of individual envi-
actions at the single-cell level without having ronmental bacteria. Science. 2006;314(5804):1464–7.
to culture either host or virus. The method Przech AJ, Yu D, Weller SK. Point mutations in exon I of
first involves a bioinformatic analysis of a the herpes simplex virus putative terminase subunit,
UL15, indicate that the most conserved residues are
metagenomic dataset or datasets from the envi- essential for cleavage and packaging. J Virol.
ronment of interest to recover a ubiquitous viral 2003;77(17):9613–21.
H
marker. This marker is then colocalized with Rao VB, Feiss M. The bacteriophage DNA packaging
a universal gene identifying the host by means motor. Annu Rev Genet. 2008;42:647–81.
Rohwer F, Edwards R. The phage proteomic tree: a
of dPCR performed on single cells. The methods genome-based taxonomy for phage. J Bacteriol.
presented in this review are general and can be 2002;184(16):4529–35.
applied to other environments. Tadmor AD, Ottesen EA, Leadbetter JR, Phillips
R. Probing individual environmental bacteria for
viruses by using microfluidic digital PCR. Science.
2011;333(6038):58–62.
Cross-References
▶ Computational Approaches for Metagenomic

Datasets Human Gut Microbial Genes by
▶ Use of Viral Metagenomes from Yellowstone Metagenomic Sequencing
Hot Springs to Study Phylogenetic
Relationships and Evolution Jun Wang
▶ Viral MetaGenome Annotation Pipeline BGI Shenzhen, Shenzhen, China
References Synonyms
Andersson AF, Banfield JF. Virus population dynamics Genes in the human gut microbial community;
and acquired virus resistance in natural microbial com- Metagenome of the human gut microbiota
munities. Science. 2008;320(5879):1047–50.
Boyd EF, Br€ussow H. Common themes among
bacteriophage-encoded virulence factors and diversity
among the bacteriophages involved. Trends Microbiol. Definition
2002;10(11):521–9.
Casjens S. Prophages and bacterial genomics: what have
we learned so far? Mol Microbiol. 2003;49(2):
A gene is identified in human distal gut (colon)
277–300. microbes when reads from high-throughput
Daw MA, Falkiner FR. Bacteriocins: nature, function and sequencing of fecal samples are assembled and
structure. Micron. 1996;27(6):467–79. an open reading frame (ORF) is predicted from
Edwards R, Rohwer F. Viral metagenomics. Nat Rev
Microbiol. 2005;3(6):504–10.
the resulting DNA sequence. Such a gene could
Kalisky T, Quake SR. Single-cell genomics. Nat Methods. usually be mapped to a group of bacterial species
2011;8(4):311–4. and linked to certain functions. Metagenomic
H 266 Human Gut Microbial Genes by Metagenomic Sequencing
studies on other parts of the gastrointestinal tract Illumina sequencing technology) have come of
are often performed invasively using animals and age in metagenomic studies. Considering the
are not discussed here. nonuniform abundance of gut microbial species
and the high level of discordance between
individual humans, deep sequencing and wide
Introduction sampling are critical for a comprehensive under-
standing of the human gut flora. In 2010,
The human gut has long been known to contain high-throughput short-read sequencing was intro-
microbial species. Until the advent of duced into human gut microbiome research and
high-throughput metagenomic sequencing, how- showed great potential (Qin et al. 2010).
ever, these mysterious microbes largely eluded Bacterial DNA obtained from human fecal
interrogations by their human host. Recent samples could be readily used for high-
advancements described here and in other entries throughput sequencing on the Illumina platform.
reveal awe-inspiring complexity, dynamics, and After a few quality control steps, the short reads
significance of the gut microbiota. from each sample were assembled de novo
Eubacteria dominate the microbial commu- (Fig. 1), using software such as SOAPdenovo
nity in the human gut (Scanlan and Marchesi (Kultima et al. 2012). Protein-coding genes
2008; Marchesi 2010; Parfrey et al. 2011). Both were then predicted from the assembled contigs
eubacteria and archaebacteria species are routinely (Kultima et al. 2012). Genes from multi-samples
classified to genus level according to their 16S were pooled together and compared with one
rRNA gene sequences. Unfortunately, taxonomic another to remove redundancy. Finally, a
classification of commensal eukaryotes in the gut nonredundant gene catalog was generated and
has remained a tedious process (Parfrey et al. 2011). could serve as a basis for functional and phylo-
As a consequence, our understanding of the eukary- genetic analyses (Fig. 1) (Qin et al. 2010).
otic minorities in the gut lags far behind that of the Alternative to de novo assembly, mapping of
bacterial communities. The term “gut microbes” is reads to an existing gene catalog allows conve-
equivalent to “gut bacteria” hereafter. nient identification of genes in a sample. Natu-
Metagenomic sequencing of total DNA rally, such a time-saving approach requires the
extracted from fecal samples constitutes a key gene catalog to encompass a complete set of
step in forging our understanding of gut bacteria high-quality reference genes.
beyond taxonomy. The approach allows
researchers to obtain complete genome
sequences, identify genes, and predict functions. Total Gene Number and Its Variability
Such metagenomic information is especially pre-
cious for those bacteria that are yet to succumb to Metagenomic sequencing of 124 Europeans
laboratory culture conditions. (as part of the MetaHIT (Metagenomics of the
This overview is intended to briefly summa- Human Intestinal Tract) project) resulted in a gut
rize our current roll call of the various genes microbial gene catalog containing 3.3 million
present in the human gut flora as well as the nonredundant genes (Qin et al. 2010). Although
functional relevance of these genes to the micro- this gene number might still increase as more
organisms and human beings under normal and samples are sequenced, especially those from
perturbed states. patients of a particular disease (e.g., in Qin
et al. 2012), this number of known gut microbial
genes is already 150-fold greater than the number
Identification of Gut Microbial Genes of genes encoded by the human genome.
Two hundred ninety-four thousand one hun-
Next-generation, high-throughput, and cost- dred ten of the gut microbial genes were found in
efficient short-read data (mainly produced by at least 50 % of individuals, which were termed
nonredundant genes, of which 204,056 3,603
(around 38 %) were common genes. Thus, signif-
icant interpersonal differences exist in terms of
the number, type, and sequence of the genes.
Common Functions Encoded by Gut

Bacteria
Functional annotation of the gut metagenome

involves aligning the genes to databases such as
KEGG (Kyoto Encyclopedia of Genes and
Genomes) pathways, COG (Clusters of
Orthologous Groups), and eggNOG (evolutionary
genealogy of genes: Non-supervised Orthologous
Groups) databases (Fig. 1). At present, a significant H
fraction of genes remain functionally unknown
regardless of the database used, although common
genes could usually be annotated with greater suc-
cess. The wealth of information in the gut
metagenome awaits exploration in both global
and targeted fashion.
Just as gut microbial genes are to some extent
shared between individuals, there are functional-
ities that are common to the human gut
microbiota (Qin et al. 2010; Human Microbiome
Project Consortium 2012). Major metabolic path-
ways such as central carbohydrate metabolism and
amino acid synthesis can be seen in all samples.
Essential protein complexes, for example, DNA
replication machinery, RNA polymerases, ribo-
some, and secretory apparatus, are also part of the
core gut microbiota genes. Moreover, genes not
required for all bacteria but are important for life
in the gut are expected in the common set. Such
genes would presumably reflect adaptation to gut
Human Gut Microbial Genes by Metagenomic
temperature, oxygen level, and nutrients as well as
Sequencing, Fig. 1 High-throughput metagenomic
analysis of the human gut flora. DNA from fecal samples interaction with host cells and other microbes
are sequenced using the Illumina platform. The short reads (Fig. 2). The distinction between common and
generated are assembled into contigs and open reading rare functions, however, becomes semantic as one
frames (ORFs) are predicted. A nonredundant gene cata-
looks into these genes. We find it more convenient
log is created from the ORFs. The genes are then anno-
tated functionally and phylogenetically according to to discuss these in the following section.
databases
Genes Influenced by Host

“common” genes (Qin et al. 2010). The Environmental Factors
remaining ~90 % genes, although typically seen
in multiple samples, were not widely shared. Traditionally viewed as a place for water and salt
Each individual carried 536,112 12,167 resorption, the colon’s integral role in human
Human Gut Microbial

Genes by Metagenomic
Sequencing,
Fig. 2 Functions encoded
by the human gut microbial
metagenome in relation to
the gut. Gut microbes
contain genes important for
the survival and success of
themselves, at the same
time depend on, serve, and
manipulate their human
host. Diseases follow when
the symbiotic relationship
goes awry
nutrition has only become realized through stud- Antibiotic administration could lead to pro-
ies of the gut (fecal) microbiota. Various digested found and long-lasting alterations in the intestinal
or indigestible components of the diet arrive at microbiome (Dethlefsen and Relman 2010; Cho
the colon and constitute a major environmental et al. 2012). The distortion is typically manifested
factor shaping the gut microbial ecosystem as a sharp decrease in microbial diversity accom-
(Fig. 2). Complex carbohydrates are fermented panied by an overgrowth of Proteobacteria,
by bacteria of the phylum Firmicutes, producing especially in pathogenic Enterobacteriaceae
short-chain fatty acids (SCFAs, including acetate, populations (Nyberg et al. 2007). Antibiotic
propionate, and butyrate) for use by the host cells. intake exerts a strong selective pressure on the
In contrast, if the host diet relies more on simple intestinal flora and increases transfer of
sugars, as has become common in the United antibiotic-resistant genes (ARGs) among gut
States, enzymes for metabolizing mono- and microbes, leading to an accumulation of resis-
disaccharides could be more prominent in the gut tance strains (Sullivan et al. 2001; Schjørring
flora (Yatsunenko et al. 2012). Similarly, dietary and Krogfelt 2011). These antibiotic-resistant
intake of amino acids and vitamins appears to pathogens and nonpathogens could persist in
modulate the balance between their catabolism the gut well after removal of the selective
and anabolism by gut bacteria. pressure.
Bile acids (BAs) secreted by the host to emul- Notably, current evidence suggests that
sify dietary fats make a strong impact on the while commensal bacterial species vary between
gut microbiota. On one hand, primary BAs are hosts of different genetic background and envi-
known to be converted to more effective second- ronmental factors, the individuality is smaller
ary BAs through 7a-dehydroxylation by intesti- at the functional level, i.e., similar genes in
nal bacteria. On the other hand, with their different gut bacteria could serve similar pur-
amphipathic properties, BAs show a strong anti- poses and are selected by similar factors (Spor
microbial activity. Rats on a diet supplemented et al. 2011).
with the BA cholic acid recapitulated effects of
high-fat diet on the gut flora (reported in mice),
namely, an increased ratio of Firmicutes to Gut Microbiota and Diseases
Bacteroidetes and a declining microbial diversity
(Islam et al. 2011; Ley et al. 2005). Thus, ele- A growing body of evidence suggests that the
vated bile secretion stimulated by high-fat diet gut microbial flora is central to human health.
likely plays a major role in reshaping the gut Although we are very far from a definitive com-
microbiome (Islam et al. 2011). prehension of healthy versus diseased gut
microbiota, it is fair to say that a productive Metagenome-Wide Association Study
and well-balanced symbiotic relationship with for Diagnosis
our little gut residents is of key importance
for us human beings. Altered gut microbial To go beyond a descriptive account of genes pre-
composition has been reported in various gut- sent in healthy or unhealthy human gut microbiota,
related diseases such as colorectal cancer and it could be very helpful to perform a metagenome-
inflammatory bowel diseases (IBDs) and extend wide association study (MGWAS) for identifica-
to conditions like anorexia, allergies, cardiovas- tion of disease markers and evaluation of disease
cular diseases, and even autism (Clemente prospect (Fig. 3). A standard genome-wide associ-
et al. 2012; Tremaroli and B€ackhed 2012). ation study (GWAS) looks for genetic variants,
These diseases are more or less accompanied typically single-nucleotide polymorphisms (SNPs)
by dysbiosis, a state where benign or beneficial in a genome, and relates them to a phenotype such
gut microbes are overtaken by pathogens and as a disease. MGWAS stems from the concept of
normal processes like fermentation, synthesis “metagenome.” Accordingly, the relative abundance
of metabolites, barrier function, etc. become of a gene in a metagenome, instead of the presence of
disrupted. a SNP, is used to establish correlation with disease. H
On a metagenomic level, the gut microbiome The proof-of-principle study for MGWAS
of leptin-deficient obese mice (ob/ob) showed was performed on type 2 diabetes mellitus (Qin
an increased capacity for energy harvest from et al. 2012). In a reference gene catalog updated
the gut, encoding enzymes that could initiate from previous work (Qin et al. 2010), 3,298,811
breakdown of otherwise indigestible polysac- genes were found in the healthy or diabetic
charides (Turnbaugh et al. 2006). However, cohorts (total n ¼ 145). After filtering for shared
the end products of bacterial fermentation, genes and clustering based on numerical relation-
SCFAs especially butyrate, appear protective ships and phylogeny, the dimensionality was
and negatively regulate inflammation in the gut reduced to 1,138,151 genes. The first stage of anal-
(Maslowski et al. 2009). Butyrate synthesis ysis concluded with 278,168 statistically signifi-
genes in the gut flora were depleted in diabetes cant gene markers for diabetes. In Stage II, new
and symptomatic atherosclerosis patients com- samples (n ¼ 100 for each cohort) were sequenced
pared to healthy controls (Qin et al. 2012; and profiled with the markers from Stage I. The
Karlsson et al. 2012). Together with studies on analysis further reduced the number of gene
butyrate and IBDs (Thibault et al. 2010; Scharl markers to 52,484. For lowest error rate, as few
and Rogler 2012), current results point to a key as 50 gene markers were found to be optimal and
role of butyrate metabolism in colon health, with were successfully applied to diabetic/nondiabetic
extensive interplays between the gut flora and classification of 23 additional samples. Besides
the host. gene markers, markers from functional annotations
Another common theme in gut microbial (KEGG orthologous groups, eggNOG orthologous
homeostasis may be the handling of oxidative groups) and metagenomic linkage groups (MLG)
stress. The gut metagenome of diabetes patients that represent taxonomic units also present valu-
was enriched for genes involved in sulfate able information (Qin et al. 2012).
reduction and oxidative stress resistance (Qin In addition to diabetes, the same study identified
et al. 2012). Atherosclerosis was associated with gene markers and orthologous group markers for
an underrepresentation of phytoene dehydroge- IBDs and for enterotypes (Qin et al. 2012), raising
nase gene and a matching reduction in the anti- the stakes for routine application of MGWAS to
oxidant b-carotene in patient serum (Karlsson other microbiota-related diseases. It remains to be
et al. 2012). Oxidative stress is also known to seen how factors such as age, gender, and BMI
contribute to IBDs such as Crohn’s disease (body mass index) confound MGWAS in various
(Iborra et al. 2011). diseases, especially during initial marker
Human Gut Microbial Genes by Metagenomic Genes and species that are under- or overrepresented in
Sequencing, Fig. 3 Metagenome-wide association patients are selected following a rigorous procedure. The
study for gut flora-related diseases. For each sample, analysis results in gene markers and metagenomic linkage
sequencing reads are mapped to the reference gene catalog groups that can be used for diseased/undiseased classifi-
(Fig. 1) and relative abundance of genes is computed. cation and potentially for prognosis and diagnosis
selection. Things like sample size, read length, and Summary

ecological and genomic diversity all need to be
taken into consideration during study design and Metagenomic analyses of the human gut
interpretation. The emergence of an optimum microbiota offer road maps for elucidating the
MGWAS workflow and subsequent biological interplay between the gut symbionts and their
investigations would probably involve effort human host. The information could guide
from researchers across disciplines. detailed characterization of bacterial species
Human Oral Microbiome Database (HOMD) 271 H
individually and as a community. The range of catalogue established by metagenomic sequencing.
metabolites flowing in and out of the microbes Nature. 2010;464:59–65.
Qin J, Li Y, Cai Z, Li S, Zhu J, Zhang F, et al. A
and the plasticity of the gut flora are expected to metagenome-wide association study of gut microbiota
revolutionize nutrition science. Causal relation- in type 2 diabetes. Nature. 2012;490:55–60.
ships between host factors, gut microbiota, and Scanlan PD, Marchesi JR. Micro-eukaryotic diversity of
diseases, when established, hold great promise the human distal gut microbiota: qualitative assess-
ment using culture-dependent and -independent anal-
for human health. ysis of faeces. ISME J. 2008;2:1183–93.
Scharl M, Rogler G. Inflammatory bowel disease patho-
genesis: what is new? Curr Opin Gastroenterol.
References 2012;28:301–9.
Schjørring S, Krogfelt KA. Assessment of bacterial anti-
Cho I, Yamanishi S, Cox L, Methé BA, Zavadil J, Li K, biotic resistance transfer in the gut. Int J Microbiol.
et al. Antibiotics in early life alter the murine colonic 2011;2011:312956.
microbiome and adiposity. Nature. 2012;488:621–6. Spor A, Koren O, Ley R. Unravelling the effects of the
Clemente JC, Ursell LK, Parfrey LW, Knight R. The environment and host genotype on the gut
impact of the gut microbiota on human health: an microbiome. Nat Rev Microbiol. 2011;9:279–90.
integrative view. Cell. 2012;148:1258–70. Sullivan A, Edlund C, Nord CE. Effect of antimicrobial
Dethlefsen L, Relman DA. Incomplete recovery and indi- agents on the ecological balance of human microflora.
H
vidualized responses of the human distal gut Lancet Infect Dis. 2001;1:101–14.
microbiota to repeated antibiotic perturbation. Proc Thibault R, Blachier F, Darcy-Vrillon B, De Coppet P,
Natl Acad Sci. 2010;108:4554–61. Bourreille A, Segain J-P. Butyrate utilization by the
Human Microbiome Project Consortium. Structure, func- colonic mucosa in inflammatory bowel diseases:
tion and diversity of the healthy human microbiome. a transport deficiency. Inflamm Bowel Dis.
Nature. 2012;486:207–14. 2010;16:684–95.
Iborra M, Moret I, Rausell F, Bastida G, Aguas M, Tremaroli V, B€ackhed F. Functional interactions between
Cerrillo E, et al. Role of oxidative stress and antioxi- the gut microbiota and host metabolism. Nature.
dant enzymes in Crohn’s disease. Biochem Soc Trans. 2012;489:242–9.
2011;39:1102–6. Turnbaugh PJ, Ley RE, Mahowald MA, Magrini V,
Islam KBMS, Fukiya S, Hagio M, Fujii N, Ishizuka S, Mardis ER, Gordon JI. An obesity-associated gut
Ooka T, et al. Bile acid is a host factor that regulates microbiome with increased capacity for energy har-
the composition of the cecal microbiota in rats. Gas- vest. Nature. 2006;444:1027–31.
troenterology. 2011;141:1773–81. Yatsunenko T, Rey FE, Manary MJ, Trehan I,
Karlsson FH, Fåk F, Nookaew I, Tremaroli V, Dominguez-Bello MG, Contreras M, et al. Human
Fagerberg B, Petranovic D, et al. Symptomatic athero- gut microbiome viewed across age and geography.
sclerosis is associated with an altered gut metagenome. Nature. 2012;486:222–7.
Nat Commun. 2012;3:1245.
Kultima JR, Sunagawa S, Li J, Chen W, Chen H, Mende
DR, et al. MOCAT: a metagenomics assembly and
gene prediction toolkit. PLoS ONE. 2012;7:e47656.
Ley RE, B€ackhed F, Turnbaugh P, Lozupone CA, Knight
RD, Gordon JI. Obesity alters gut microbial ecology.
Human Oral Microbiome Database
Proc Natl Acad Sci U S A. 2005;102:11070–5. (HOMD)
Marchesi JR. Prokaryotic and eukaryotic diversity of the
human gut. Adv Appl Microbiol. 2010;72:43–62. Tsute Chen1 and Floyd Dewhirst2
Maslowski KM, Vieira AT, Ng A, Kranich J, Sierro F, 1
Yu D, et al. Regulation of inflammatory responses by
Department of Microbiology, The Forsyth
gut microbiota and chemoattractant receptor GPR43. Institute, Cambridge, MA, USA
2
Nature. 2009;461:1282–6. Department of Molecular Genetics, The Forsyth
Nyberg SD, Osterblad M, Hakanen AJ, Löfmark S, Institute, Cambridge, MA, USA
Edlund C, Huovinen P, et al. Long-term antimicrobial
resistance in Escherichia coli from human intestinal
microbiota after administration of clindamycin. Scand
J Infect Dis. 2007;39:514–20. Introduction
Parfrey LW, Walters WA, Knight R. Microbial eukaryotes
in the human microbiome: ecology, evolution, and
future directions. Front Microbiol. 2011;2:153.
The human oral cavity is a rich biological site
Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, with several microbial niches including teeth,
Manichanh C, et al. A human gut microbial gene gingival sulcus, tongue, cheek, hard and soft
H 272 Human Oral Microbiome Database (HOMD)
palates, tonsils, throat, and saliva. The At about the time we recognized the need to
microbiome of the oral cavity (Dewhirst create a taxonomic framework for the oral
et al. 2010) and its niches have been examined microbiome, the National Institute of Dental
based on 16S rRNA sequencing (Aas et al. 2005; and Craniofacial Research released a request
Bik et al. 2010; Human Microbiome Project from proposal on “The metagenome of the oral
2012a, b). The metagenome of the oral cavity microbiome.” We responded with a proposal
has been studied to a limited degree prior to entitled “A foundation for the oral microbiome
2012 due to the complexity of the site (Alcaraz and metagenome,” which was funded as
et al. 2012; Belda-Ferre et al. 2012; Xie DE016937. The goals of the grant were to
et al. 2010). More than 700 prevalent species (1) set up the HOMD web-accessible database
comprise the oral microbiome, but many taxa with a provisional taxonomic scheme and to pre-
are present at less than 0.1 % of the microbial sent all oral genomes in a graphical interface,
population (Dewhirst et al. 2010). As oral bacte- (2) to complete reference genomes for oral taxa,
rial reference genomes are becoming available, and (3) to obtain isolates of previously
primarily through the efforts of the Human uncultivated taxa and make them available to
Microbiome Project (Human Microbiome Pro- the research community by placing them in
ject 2012a, b), it is becoming possible to attribute national-type culture collections. We have made
metagenomic sequences to organisms at genus steady progress in achieving these goals, and this
and species level (Martin et al. 2012). The project is currently in its seventh year of funding.
anchoring of metagenome sequence information
to specific organisms in a taxonomic framework
is key to developing a full description of the The HOMD Website
bacteria-bacteria and bacteria-host interactions
that underlie human oral health and disease. The HOMD contains various types of informa-
The Human Oral Microbiome Database tion on human oral microorganisms including
(HOMD) was developed in response to the lack taxonomic, genomic, and bibliographic. The pur-
of any naming or taxonomic scheme for the thou- pose of the HOMD website (http://www.homd.
sands of human oral 16S rRNA clone sequences org) is to provide an easy-to-use online interface
that were being generated in the early 2000s and to search, retrieve, and navigate among these
dumped into GenBank without any taxonomic different types of information. HOMD also pro-
anchor. Investigators were publishing manu- vides web-based bioinformatics software tools
scripts using clone names (such as BU063) as for data mining and analyses.
provisional taxonomic names. The only way to Technically, the HOMD website is
phylogenetically place an oral clone was to per- constructed using a LAMP system and hosted
sonally align sequences and generate one’s own on the web server computers. The LAMP system
phylogenetic trees. We recognized that there was provides a Linux operating system, Apache web
a need for a 16S rRNA-based provisional taxo- service, MySQL relational database, and PHP
nomic scheme to name and provide reference dynamic web page rendering. Textual contents
sequences for unnamed taxa known only from such as the taxonomy and metagenomic informa-
clone or isolate 16S rRNA sequences. The nam- tion are queried and results dynamically
ing scheme had to be provisional because formal displayed in the web browser by the LAMP sys-
naming under the bacterial code requires isola- tem. A dedicated high-performance computer
tion in pure culture and full phenotypic charac- cluster is deployed to handle the computational
terization; 16S rRNA sequence by itself is demanding analysis such as homology sequence
insufficient for formal naming. The taxonomic searches.
scheme described more fully below is based on The HOMD has been designed to be compat-
a Human Oral Taxon number which runs cur- ible with most commonly used web browsers
rently from 001 to 918. such as Microsoft Internet Explorer, Firefox,
Google Chrome, and Safari. We suggest the use The HOMD home page also includes a
of one of these popular web browsers to ensure top-down oriented expandable menu on the left
the functionalities of HOMD web pages and side and an introductory paragraph in the center.
tools. All the HOMD information and tools are On the right side are the Meta-Database Search,
viewable and available to the general public with- the Announcement, and the Database Update
out having to log in or acquiring a user account. boxes. The Meta-Database Search is very useful
The log-in function is mainly for the purpose of for searching desired information across all the
maintaining the website and the curation of the subsets of HOMD databases, including the tax-
database information. If a user has been desig- onomy, the metagenomic information, as well as
nated a curator, he or she will see additional the dynamic genome annotations. The result lists
administrative submenus. the number of matches to the keyword that pro-
Detailed functionalities, web interfaces, and vides links, leading to detailed information. The
tools as well as useful usage tips are presented Announcement box displays the important
below. Technical information such as the imple- system-wise updates and news for the HOMD.
mentation and design of the HOMD has been The Database Update box is automatically
published elsewhere (Chen et al. 2010). updated by the HOMD dynamic genome annota- H
tion pipeline (see “Dynamic Annotation of Geno-
Features of the HOMD Web Pages mic Sequences” section) to keep track of the
The design of the website was based on the feed- status of the genome annotation.
back of several researchers in the field of oral HOMD also provides comprehensive docu-
microbiology over the past several years. The mentation and updates history of data and tools.
user interface was designed to be user-friendly, The HOMD User’s Guide (i.e., the help docu-
intuitive, and practical. On top of every HOMD mentation) was designed to help users to use the
page (Fig. 1), there is a top banner for the HOMD tools, navigate the information, and interpret the
logo, which automatically reduces to smaller size results provided by HOMD. The User’s Guide is
(in height) once the user navigates away from the accessible through the top navigation menu on all
home page so that the banner will not take up too pages and is dynamically linked to the relevant
much space from the requested content. Clicking guide for each different tool. For example, when
the top banner image also brings the user back to users are viewing the Taxon Table page, the
the HOMD home page. Top navigation menu is “How to Use This Page” menu item shown in
located right below the top banner and is also the top navigation menu will lead directly to the
accessible throughout all the HOMD pages. The page that explains the use of the Taxon Table.
top navigation menu provides access points to all Alternatively users can also browse the entire
HOMD’s tools and information on all the web user documentation by clicking the “Table of
pages. content” tab shown on top of each documentation
Another useful feature of the HOMD web page as well as the “User’s Guide” links on top
pages is the unique page ID system. The rightmost menu and side menu of home page. Every docu-
item displayed on the top navigation menu is the ment of HOMD can be searched either through
page ID – a unique code that distinctly identifies the search box located at the bottom of the table
the current page that a user is viewing. For exam- of contents of the documentation page or through
ple, the page ID of the HOMD home page is the Meta-Database Search box located at the
“HP1” (Fig. 1), and once a user navigates away top-right part of the home page.
from home page to, e.g., the Taxon Table page, the The design of the online interfaces of HOMD
page ID automatically changes to “TT1.” This has been driven by suggestions from HOMD users.
feature allows precise page referencing. This is HOMD is open to suggestions and feedback from
particularly useful when a user needs to refer to the research community to further improve its
a specific page on HOMD site for discussion, bug interface and content. Currently, HOMD provides
reporting, or suggestion. several different ways to communicate with the
Human Oral Microbiome Database (HOMD), Fig. 1 Screenshot of the HOMD home page
research team and research community. The con- The HOMD Database Schema
tact information provides e-mail addresses for The information and data provided by HOMD are
direct communication with the HOMD research stored in several databases. The Oral Taxon IDs
team. There is also a mailing list for important and the genome IDs serve as the keys to cross-
updates and announcement. Users can use their link these databases. The database table struc-
own e-mail address to subscribe to the HOMD tures and the contents can be downloaded from
Mailing List (https://groups.google.com/forum/ the HOMD FTP (file transfer protocol) site at
#!forum/homd-mail) by sending an empty e-mail ftp://ftp.homd.org to allow users to reconstruct
to the e-mail address: homd-mail+subscribe@ the databases and perform advance queries on
googlegroups.com. An automatic e-mail will be their own computers.
sent to the subscriber for confirmation. HOMD
also provides a discussion platform for the Download Data from HOMD
research community (https://groups.google.com/ Most of the data recorded in HOMD, including
forum/#!forum/homd-forum). Note that these taxonomy, genomics, and 16S rRNA reference
web links may change over time. In any case, sequences, can be downloaded from the HOMD
current or updated web links provided here will FTP site (ftp://ftp.homd.org). The FTP site pro-
be available on the HOMD website. vides both current and archived versions of the
data for comparison. The FTP site can be sequences were manually aligned in a secondary
accessed directly in the web browser. Each folder structure-based database using the program RNA
comes with a “readme” text file explaining the (Paster and Dewhirst 1988). Distance matrices
data, data format, and potential usage. Selected and neighbor-joining trees were generated to
data such as the aligned reference sequence determine the clustering of sequences. Sequences
dataset, aligned 16S rRNA datasets for each with similarity equal to or greater than 98.5 %
taxon, and an HOMD taxonomy database in were grouped together into a single taxon.
Excel format can be downloaded from the links Sequences were extensively checked for chi-
provided in the HOMD web pages. meras and several sequences and some provi-
sional taxa were removed. As a result, several
hundred apparently novel full 16S rRNA
Taxonomy sequences were identified this way.
To share the information of both the named
Compilation of the HOMD Taxa and novel human oral microbial taxa with the
The HOMD describes information linked to oral research community, we decided to build
microbe species. For bacteria, or archaea, that a database and designed web query interfaces H
have not been validly named, there is no defini- and tools. When the HOMD was publicly
tion of “species.” Molecular methods to identify launched in 2010, there were a total of
novel species generally have used 16S rRNA 619 Human Oral Taxa in the initial release of
sequencing of isolates or 16S rRNA-based anal- the HOMD database. The 753 reference 16S
ysis of clone libraries. These strains or clones can rRNA gene sequences upon which this analysis
then be clustered into phylotypes or taxa based on was done have been released publicly for down-
their 16S rRNA sequences. Phylotype can be load on the HOMD website as version 10. At the
defined for any similarity cutoff. In HOMD, time of writing this chapter, the total number of
a cutoff of 98.5 % 16S rRNA sequence similarity taxa described in the HOMD taxonomy database
was used to cluster the 16S rRNA sequences at has grown to 688, represented by a total of
the species level to define novel oral bacterial 833 reference 16S rRNA sequences (HOMD
phylotypes. Each validly named species and RefSeq Version 13.1).
novel phylotype cluster was given a unique inte-
ger number called Human Oral Taxon (HOT) ID. Navigating the HOMD Taxa
The original collection of oral microbial tax- The HOMD taxonomy information can be
onomy information came from a combination of viewed and retrieved in several different ways.
literature, primarily reports from Forsyth Insti- The information can be viewed online directly in
tute investigators (Dzink et al. 1985, 1988; a web browser or downloaded as text files. For the
Socransky and Haffajee 1994; Tanner online web browser viewing, the taxonomy pages
et al. 1979, 1998) and from Lillian Holderman can be searched with keywords or by visual nav-
Moore and Ed Moore (Moore et al. 1982, 1983; igation with the Taxon Table (Fig. 2) and the
Moore and Moore 1994) formerly at the Anaer- Taxonomic Hierarchy (Fig. 3). The Taxon
obe Laboratory at the Virginia Polytechnic Insti- Table can also be downloaded in Excel and
tute. 16S rRNA sequences for these named tab-delimited plain text file from the Tools &
species came either from sequences obtained in Download page or through the HOMD FTP site.
our laboratory or from GenBank. Over the past The keyword search can be done through the
20 years, our laboratory constructed and Meta-Database Search box on the home page or
sequenced over 600 16S RNA gene libraries and on the Taxon Table page. Both search boxes look
obtained over 35,000 clone sequences. The clon- for input keyword(s) in all text fields of the
ing, sequencing, aligning, treeing, and clustering HOMD taxonomy database table.
methods used to create HOMD are described On the Taxon Table page, all the human oral
elsewhere (Dewhirst et al. 2010). In brief, microbial taxa are listed in a table ordered
Human Oral Microbiome Database (HOMD), Fig. 2 Screenshot of the Taxon Table
alphabetically by organism names. The order can by two numbers enclosed in the square brackets
be changed by clicking the column name HOT indicating the number of taxa and genome
IDs, Genus, or Species names, to toggle the dis- sequences. For example, “Phylum Proteobacteria
play sort order. Three commonly used filters are [107, 144]” indicates that in the phylum
also provided to show only those taxa with Proteobacteria, 107 taxa were identified in the
“named species,” “unnamed cultivated species,” oral cavity and 144 strains have genomic
or “uncultured phylotypes.” Each taxon listed in sequences available at HOMD. If a strain has
the table contains links to the individual Taxon been sequenced by multiple groups, or multiple
Description page (described later) and to the strains sequenced for a species, we provide each
genomic information, if available. sequence when available.
The taxa can also be viewed in the taxonomic Another way to check the summary of the
hierarchical order, i.e., from domain, phylum, HOMD taxa is to view the number of taxa at
class, order, family, genus, to species levels, on various taxonomy levels. The Taxonomic Level
the Taxonomic Hierarchy page (Fig. 3). The hier- page provides a list of taxa and the number
archical tree is fully collapsed by default and can of taxa at the next lower level for each of the
be dynamically expanded at any given level 7 taxonomic levels: Currently, the numbers are
(or all levels). The link, at the species level, Domain (2), Phylum (14), Class (24), Order
brings users to the detailed Taxon Description (40), Family (83), Genus (183), and Species
page. The designation of each level is followed (688).
Human Oral Microbiome Database (HOMD), Fig. 3 Screenshot of the Taxonomic Hierarchy expanded at the order
level Bacteroidales
Taxon Description resource locator (URL) format, http://www.

The HOMD Taxon Description page (Fig. 4) pro- homd.org/taxon¼NNN, where NNN is the
vides comprehensive information for a specific HOT ID. The Human Microbiome Project
human oral microbial taxon. Information pro- Data Analysis and Coordination Center
vided can be summarized in four categories: Tax- (DACC; accessible at http://www.hmpdacc.
onomic Hierarchy, biological characteristics, org) is using HOT IDs to designate taxonomic
references, and community comments. Through- identity isolates of the oral cavity with URLs
out the page, clickable dynamic cross-links are cross-referenced to HOMD. These URLs are
provided for additional information. The taxon embedded in the data provided by DACC so
page can be edited and curated by designated that user can track down to the more compre-
curators upon their logging-in. The page also hensive information for individual genome.
allows input and comments provided by the The HOT IDs were also embedded in the
users in the research community. Information GenBank sequence records for the 35,000
described on this page are the following: clone sequences that were used to build the
Human Oral Taxon (HOT) ID – The Human initial collections of the HOMD taxa. The text
Oral Taxon ID is a unique numeric ID embedded in the GenBank records has the
representing a particular taxon. The taxon syntax /db_xref¼“HOMD:tax_NNN,” in
can be unambiguously referred to from other which NNN is the numeric HOT ID. If the
sources of scientific literature. The taxon can GenBank sequence is viewed in the web
be accessed on the web with an easy universal browser through the NCBI website, the
Human Oral Microbiome Database (HOMD), Fig. 4 Screenshot of the Taxon Description page
portion of the text “tax_NNN” is also click- Clicking “tax_302” in this GenBank record
able and links to the corresponding taxon page in the web browser will bring the user
on the HOMD website. For example, the to the corresponding taxon page on HOMD
GenBank record for the partial 16S rRNA (http://www.homd.org/taxon¼302). NCBI
sequence of the Alloprevotella rava clone embeds external database reference IDs in
GB024 (Accession No. GU409552, http:// the GenBank records for cross-database
www.ncbi.nlm.nih.gov/nuccore/GU409552) referencing. More information can be found
contains the text /db_xref¼“HOMD:tax_302,” at this link http://www.ncbi.nlm.nih.gov/
because the HOT ID for A. rava is 302. genbank/collab/db_xref.
Status – This field displays the culturing status reference sequence(s) on top which were
for the taxon. A taxon can be either a validly used as the template for alignment. To view
named cultivated species, an unnamed culti- the alignment in color format and for further
vated species, or an unnamed uncultured adjustment, third-party alignment viewing
phylotype. This status is shown in this field software may be used, such as SeqView
and will be updated upon the change of actual (http://pbil.univ-lyon1.fr/software/seaview.
status of the taxon. html) and BioEdit (http://www.mbio.ncsu.edu/
Type strain/reference strain – If the taxon’s sta- BioEdit/bioedit.html). Because some pairs
tus is validly named cultivated species, the of clone sequences may be nonoverlapping
Type Strain is listed here; if the taxon is an (i.e., 500-base sequences at opposite end of
unnamed isolate, the strain information will be the molecule), this file must be used with cau-
listed as Reference Strain. If no cultivated tion for tree construction.
strain is available yet, the Reference Strain Phylogeny – A phylogenic tree showing the posi-
field will be listed as “None, not yet tion of this taxon among related HOMD taxa is
cultivated.” provided here. The tree images are in PDF
Classification – The Taxon Description page lists format and can be viewed or downloaded H
the nomenclatures of each taxonomic level with the link provided in this field. A link to
from Domain to Species. This classification a list of all the downloadable phylogenetic tree
is defined by HOMD and may be different images encompassing all the HOMD taxa is
from the NCBI Taxonomy. The NCBI Taxon- also provided.
omy can be accessed using a dynamic link. Prevalence by molecular cloning – The number
The HOMD taxonomy is based on analysis of clones found for this taxon in an analysis of
of where each taxon falls in phylogenetic approximately 35,000 clones (Dewhirst
trees generated using several treeing methods et al. 2010). Based on the number of clones
and including over 100 non-oral reference found, the rank abundance of the taxon (out of
taxa identified by searching the “greengenes” 619) is given.
16S rRNA gene database (http://greengenes. Synonyms – Lists previous names for the organ-
lbl.gov). For example, in HOMD, an organism ism if validly named. Isolate or clone designa-
such as Eubacterium saburreum is placed in tions are given as synonyms when they have
the family Lachnospiraceae (because that is appeared in the literature as “names” for the
where it falls phylogenetically), rather than in taxon, such as “BU063.” (Zuger et al. 2007).
the family Eubacteriaceae (because its incor- NCBI taxonomy – For validly named species,
rect genus name “Eubacterium” has not yet there is a link to the NCBI Taxonomy. NCBI
been revised). Synonyms of the taxon that has no taxonomy for unnamed taxa; hence, the
are currently in use or were used before in reason HOMD was created.
the literature or publications are also provided. PubMed search – The number of hits when the
16S rRNA gene sequence – GenBank accession name (genus plus species) of this taxon is used
number and link to NCBI corresponding in the PubMed search. HOMD automatically
Entrez record to one or more 16S rRNA gene and periodically updates this hit number every
sequences associated with the taxon. 2 weeks. To get a most up-to-date search,
16S rRNA gene sequence alignment – This field simply click the “PubMed Link” to pull up
provides the link to the downloadable clone the search result live from NCBI PubMed
sequences preliminarily aligned to the refer- site. In general, there are no results for
ence sequence to which the clones belong. The unnamed taxa, hence the need for HOMD.
current set contains the approximately 35,000 When articles referencing these taxa (often
clone sequences (Dewhirst et al. 2010) aligned through clone numbers) are found by HOMD
for each taxon. The clone alignments are pro- curators or community members, they are
vided concatenated FASTA format with the manually added to the Taxon Description.
Nucleotide search – Similar search as above using sequences diverging by more than 10 bases
NCBI Entrez “nucleotide” as reference data- within a taxon.
base. The latest result (hit count) is displayed HOMD provides two primary sets of 16S
with link to NCBI for most updated search. rRNA gene reference sequence (RefSeq) for
Protein search – Similar search as above using download and for BLAST search. The first set is
NCBI Entrez “protein” as reference database. the HOMD 16S rRNA RefSeq. This set contains
The latest result (hit count) is displayed with sequences representing all currently named and
link to NCBI for most updated search. unnamed oral taxa. In the latest reference
Genomic sequence – Number of genomes that sequence set (version 13.1 at the time of writing),
have been sequenced is indicated here with there are 834 reference sequences representing
a link to a detailed list of these genomes. the 688 taxa. The second is the HOMD 16S
Hierarchy structure – An expandable/collapsible rRNA Extended RefSeq. This set contains addi-
view of a dynamically displayed taxonomy tional16S rRNA reference gene sequences that
tree indicating the position of the taxon on are distinctively different from existing taxa but
the page. have not yet been assigned with a taxon ID.
Cultivability – Conditions and media for growing The HOMD reference sequences are corrected
strains of this taxon, if available. consensus sequences. Many have been corrected
Phenotypic characteristics – Generic phenotypic and extended based on alignment with other
description of the taxon if the taxon has culti- sequences for that taxon and Ns and indels
vated member(s). removed. Therefore, for many sequences, there
Prevalence and source – Describes the fre- will be differences between the reference
quency and source of clones and isolates sequence and the GenBank sequence listed in
from different oral sites and states of health the header information. We have not yet updated
or disease when known. our own GenBank sequences and cannot update
References – Literature and publications those from other depositors. We believe these are
referencing this taxon. These references are currently the best reference sequences available
manually curated with up to ten key references and, for the purposes of BLAST analysis, have
which may also include older references not the advantage of being of a uniform length.
indexed in PubMed. On the HOMD 16S rRNA Sequence Identifi-
Community comments – Registered and logged- cation page (Fig. 5), users can copy and paste the
in users can provide their feedbacks related to query sequences in the text field or upload from
this taxon. The comment requires the approval user’s computer. The query sequences should be
of the HOMD curators before it is shown to the in the concatenated FASTA format. The maximal
public. number of query sequences allowed to upload in
a single search is 5,000. Since viewing of the
Identification of 16S rRNA Gene Sequence by BLAST results in the web browser for over
BLAST Search 5,000 query sequences becomes very slow, for
One of the most used HOMD software tools is the search over 5,000 sequences, please contact the
customized BLAST search specifically designed HOMD team. The HOMD 16S rRNA BLAST
to identify user-provided 16S rRNA sequences online tool was only designed for a modest number
against the comprehensive collection of the 16S of sequences, up to a couple of thousands, which
rRNA reference gene sequences. Currently there can be submitted in several batches. It is not capa-
are a total of 688 taxa defined based on version ble of handling larger numbers of sequence reads,
13.1 of the 16S rRNA reference sequences. Since such as hundreds of thousands of reads from the
a phylotype can include members with up to next-generation sequencing pipeline. For larger
1.5 % sequence divergence (23 bases for a full numbers of sequences, the search can be done on
1,500-base sequence), multiple reference a collaboration basis. HOMD provides secure FTP
sequences have been selected where we have (sFTP) upload for large batches of user sequences,
Human Oral Microbiome Database (HOMD), Fig. 5 HOMD 16S rRNA Sequence Identification. (a) Query
sequence input interface; (b) Result page
and the search will be sent manually to the HOMD result page. The match identity is presented as
BLAST server cluster on user’s behalf and results straight BLAST results and as an adjusted percent
made downloadable through the sFTP site. The identity (API) calculated as
upload page also provides options for adjusting
the BLAST search parameters although the default API ¼ 100 M=ðM þ MMÞ
setting should be sensitive enough to pick up
matches with even short oligo sequences. where M is the matched (identical) and MM the
Once the query sequences are submitted, the mismatch sequence length between the query and
sequences are uploaded to the HOMD computer the reference sequence, respectively. This calcu-
servers and queued for the BLAST search. Once lation excludes any gaps introduced during the
all the searches are done, the results are presented alignment process of the BLAST search. We
back to submitter in a tabularized format. Results have found that this correction gives much better
containing up to 20 top matches for each query values for single primer sequence reads where the
sequence can be downloaded in text or Excel file sequence adjacent the primer often includes
formats. Original full BLAST results including indels. The top hits are ordered by their API
the alignments can also be accessed from the rank, and sequences with alignment shorter than
95 % of query sequence are excluded from rank- conveniently accessible by users. Icons or links to
ing. The top four matched reference sequences available tools pertaining to a specific genome are
are listed by this method, and the table shown on automatically presented on relevant page to users.
the web page contains links to the original Important genomic data and bioinformatics tools
BLAST results as well as to the Taxon Descrip- provided by HOMD are described below. Addi-
tion pages for reference sequences. The results tional information on tools is also available in the
for the 20 top matches can be downloaded as previous publication (Chen et al. 2005).
plain text or in Microsoft Excel format.
Genome Table
HOMD organizes genomes in three viewing
Genomics options: Taxa with Annotated Genomes, Taxa
with Genomes in Progress, and View All
Genomics Tools Overview Genomes. The first option lists the oral taxa
Complimentary to the taxonomy information, the with annotated (static or dynamic) genomic infor-
HOMD also provides comprehensive informa- mation and provides links to all the genomes
tion and tools for studying genomes of the available for each taxon. The View Genome but-
human oral microbes. HOMD genomics database ton links to the Genome Table showing all the
serves as the curated repository for the molecular available genomes of a specific taxon. The
sequences of human oral microbiome, including Genome Table shows the Oral Taxon ID (HOT),
complete and partial genomics sequences, as well the Genus and Species names, Strain Culture
as 16S rRNA mentioned in the previous section. Collection, HOMD Sequence ID (SEQ ID), num-
Genomic sequences available at HOMD can be ber of contigs and singlets, combined sequence
fully assembled genomes, high-coverage genomes, length, and links to available tools and informa-
or genome surveys. HOMD also keeps tracks of the tion. The second option (Taxa with Genomes in
status of ongoing genome sequencing projects for Progress) lists those oral taxa with genomic
human oral microorganisms. A Sequence Meta sequencing project still in progress but no
Information page is created to hold relevant sequence is yet available. The third option shows
genomics and sequence meta information if all the genomes in the alphabetical order and pro-
a sequencing project for a human oral microbe is vides searching and sorting function for easy nav-
announced and available in the NCBI Genome igation. Each genome listed has a link to the
Project Database. The genome project status is Sequence Meta Information page described next.
updated biweekly based on information collected
from the NCBI Genome Project Database with an Sequence Meta Information
automatic query script. Once genomic sequences The Sequence Meta Information page provides
are publicly released, they are dynamically anno- detailed biological, molecular biological,
tated by HOMD (Dynamic Annotation). Annota- genetic, genomic, and taxonomic as well as anno-
tion done by other data centers, if available, is tation information for a particular strain that has
termed “static annotation” and is viewable in been, is being, or will be sequenced (Fig. 6).
a separate panel in the Genome Viewer Information on these pages is semiautomatically
(described below). Relevant tools are provided for updated. Updated information from both
viewing and searching the annotation. These tools Genomes OnLine and NCBI Genome Project
were first developed as part of the Bioinformatics Database is retrieved biweekly and compared
Resource for Oral Pathogens (BROP: http://www. with the existing database automatically. New
brop.org; Chen et al. 2005). The programs and the or modified Genomic Project information are
data-mining schemes used in HOMD are designed then added to the Sequence Meta Information
for both finished and unfinished (collections of pages with confirmation by curators. The
multiple contigs) genome sequences. The tools Sequence Meta Information page contains the
are integrated with the HOMD website and are following human-curated information related to
Human Oral Microbiome Database (HOMD), Fig. 6 Screenshot of the Sequence Meta Information page
the target organism: Oral Taxon ID, HOMD Both types of genomes are annotated and depos-
Sequence ID (SEQ ID), Organism Name (genus, ited in a public database such as GenBank.
species), Culture Collection Entry Number, Iso- HOMD aims to provide frequently updated geno-
late Origin, Sequencing Status, NCBI Genome mic annotation for oral bacterial genomes (see
Project ID, NCBI Taxonomy ID, Genomes below). In addition, HOMD provides graphical
Online Goldstamp ID, NCBI Genome Survey genomic viewing for static annotations done by
Sequence Accession ID, JCVI (previously other public data centers such as NCBI or JCVI.
TIGR) CMR ID, Sequencing Center, number of
contigs and singlets, combined length (Kbp), GC Genome Surveys
percent, DNA molecular summary, ORF annota- One of the original major goals of the
tion summary, and 16S rRNA gene sequence. NIH-funded project “A Foundation for the Oral
In addition, original external information such Microbiome and Metagenome,” DE016937, was
as NCBI Genome Project Database, NCBI to partially sequence up to 100 representative
Taxonomy Database, Genomes OnLine Data- human oral microbial species. A total of 12
base, and rRNA in NCBI Nucleotide Database, low-coverage partial genomic sequences were
if available, is parsed into separate tables below sequenced and deposited in NCBI before this
the Sequence Meta Information for convenient project fused with the Human Microbiome Pro-
referencing. ject. The genome information for these 12 surveys
is still maintained on HOMD even though they
Full and High-Coverage Genomes currently also have complete or high-coverage
Full genomes are the oral microbial genomes that genomes (The Forsyth Metagenomic Support
have been fully assembled, while the high- Consortium and Izard 2010). Since the launch
coverage genomes are not fully assembled but of the Human Microbiome Project, the HOMD
represent coverage of most of the genomes. team has been providing genomic DNA from
human oral microbes to the four HMP sequencing frequency is approximately a month for all the
centers for high coverage rather than survey 300+ genomes. Additional genomes are being
sequencing (The Forsyth Metagenomic Support added to the annotation pipeline as more
Consortium and Izard 2010). sequences are made available by other public
sequencing projects such as the Human
Dynamic Annotation of Genomic Sequences Microbiome Project (http://www.hmpdacc.org).
One of the major features of the HOMD Genomic A live update status of the genome annotation is
Database is the automatic and frequent updating provided on the HOMD home page indicating the
of genomic annotation pipeline for genomes of latest genome annotated or updated. HOMD aims
oral isolates. Although the amount of sequence to maintain frequent and dynamic computer
data is still growing rapidly, the computational annotation for genomic sequence of at least one
power needed for bioinformatic analysis of this isolate from each oral taxon whenever sequences
data is catching up and the cost and energy con- are made publicly available, as well as static
sumption per CPU decreasing due to the avail- annotation of all annotated releases.
ability of multi-core CPU formats. The lower cost
of computational power has made it feasible for Genome Explorer
us to set up a small computation farm dedicating Genome Explorer is the centralized web interface
to the annotation of human oral microbial that interconnects all the genomics resources in
genomes. HOMD recruited a cluster of multi- HOMD (Fig. 7). The front end of Genome
core multi-node computer servers to frequently Explorer is a user-friendly interface that allows
update the annotation. Current HOMD genome investigators to navigate among all the genomics
annotation algorithms include (i) BLASTP information provided at HOMD. HOMD Geno-
(http://www.ncbi.nih.gov/BLAST; Altschul mics Tools can be accessed either by selecting the
et al. 1997) search against weekly updated tool or the genome first. If the user chooses
NCBI nonredundant protein data (ftp://ftp.ncbi. the desired tool first, the user is then directed to
nih.gov/blast/db/), (ii) BLASTP search against the Genome Explorer interface for selecting
Swiss-Prot protein data (http://us.expasy.org/ genomes. Once a target genome is chosen, the
sprot/; Boeckmann et al. 2003), and (iii) interface dynamically presents all the tools,
InterProScan search against various sequence including linked external databases, available
databases (Zdobnov and Apweiler 2001; http:// for the selected genome. Currently available
www.ebi.ac.uk/interpro/). To provide data on tools include Genome Viewer, Dynamic Annota-
functional potential of genomes, BLASTP search tion, BLAST, Annotator, EMBOSS, KEGG path-
results against Swiss-Prot are further processed ways (Kanehisa 2002), Gene Ontology Tree
for the construction of KEGG metabolic path- (Ashburner et al. 2000), Genomewide ORF
ways and Gene Ontology Trees. We take advan- Alignment, and Sequence Download. The back
tage of the fact that the well-annotated Swiss-Prot end of Genome Explorer is a searchable annota-
protein sequence descriptions contain interlinks tion database that integrates all the results gener-
to the ENZYME (Bairoch 2000) and Gene Ontol- ated from the data-mining pipeline described
ogy (Camon et al. 2003). The dynamic genome below. The search result is presented in
annotation is running full time daily on the ded- a paginated and sortable table that also provides
icated computer cluster except during the week- web links to (i) a summary page for individual
end, when the latest NCBI nonredundant protein ORF, (ii) Genome Viewer to show the exact
database, Swiss-Prot, and InterPro databases are location of the target ORF in the genome, and
being downloaded to and updated on our server. (iii) the original BLAST or InterProScan results.
Currently a total of 324 genomes representing The summary page provides all the information
306 taxa are being repeatedly annotated by this and tools available for a specific ORF, including
pipeline. On average, each genome takes ~ 3 h to all the data-mining results mentioned above, as
be annotated; thus, the current re-annotation well as convenient links to other web tools for
Human Oral Microbiome Database (HOMD), Fig. 7 HOMD Genome Explorer displaying results of Dynamic
Annotation for the genome Aggregatibacter actinomycetemcomitans HK1651
performing fresh search and analysis. In short, different annotations can be viewed and com-
Genome Explorer is a one-stop interface for all pared side by side in the Genome Viewer (http://
the genomic information available for each target www.homd.org/index.php?name¼GenomeExp&
genome or gene. org¼pgin&gprog¼gview).
Genome Viewer HOMD Genomic BLAST

Genome Viewer is a unique graphical genomic With the increasing number of genomes being
sequence viewer developed originally for the sequenced, the output of a high-throughput
BROP project (Chen et al. 2005) (Fig. 8). BLAST search can be very complex and time-
The Genome Viewer was designed to alleviate consuming to interpret, with many redundant
the inconvenience encountered when comparing results. We recently developed a graphic tool
two different sets of annotations for the same based on newly improved BLAST+ (Camacho
genome. Genome Viewer provides a graphical, et al. 2009) that allows the user to customize
six-frame translational view of the same region BLAST searches by dynamically selecting
of the genome with individual panels showing a group of any combination of the genomic
different sets of annotations. It has easy navigat- sequences available in HOMD. The HOMD
ing features including zooming, centering, and Genomic BLAST provides a visual taxonomy-
searching by gene ID. For example, the genome based navigation interface (Fig. 9) for easy and
Porphyromonas gingivalis W83 has been anno- dynamic selection of a set of genomes for
tated by JCVI (TIGR), Los Alamos National sequence homology search. The selection can be
Laboratory, and NCBI separately. These a combination of individual genomes and/or
Human Oral Microbiome Database (HOMD), Fig. 8 HOMD Genome Viewer displaying multiple sources of
annotations for Aggregatibacter actinomycetemcomitans HK1651
a group of genomes related at any taxonomic hierarchy. As shown in Fig. 9, upon starting the
level (species, genus, etc.). The BLAST parame- HOMD Genomic BLAST, the taxonomy hierar-
ters are dynamically presented after the genome chical tree is fully expanded by default and can be
selection, and the results are available on the web dynamically collapsed at any given level. The
and for download in multiple formats. links, at the species level or genomes level, lead
The HOMD Genomic BLAST query interface to the detailed Taxon Description or Sequence
starts with the selection of the genomes to be Meta Information page, respectively. Numbers
searched against. All the HOMD genomes avail- indicated in the square brackets at each level are
able for search are displayed and selectable in the numbers of oral taxa, genomes with meta
a collapsible tree based on the taxonomy information, genomes with HOMD annotation,
Human Oral Microbiome Database (HOMD), Fig. 9 Screenshot of the HOMD Genomic BLAST tool – the genome
selection page showing 107 Bacteroides genomic sequences selected for BLAST Search
and genomes with NCBI annotation, respec- The query sequence, in FASTA format, can be
tively. The genome selection is flexible and can copied and pasted into the sequence field or
be a single genome, any randomly selected indi- uploaded directly from user’s computer. Multiple
vidual genomes, a group of genomes at any tax- sequences are allowed with the limit of ten
onomy level (from Domain to Species), all the sequences. BLAST parameters are dynamically
genomes dynamically annotated at HOMD, all changed based on the type of query and subject
the genomes with static annotations by NCBI, sequences. The query sequences can be either
or a representative genome from all the species. nucleotide or protein sequences. The subject can
The total number of genomes selected is shown be whole genomic DNA sequences or nucleotide
on top of the page. or amino acid sequences of the annotated proteins
After the genomes are selected, users are of the selected genomes. Once the sequence type
directed to the next page for providing the (nucleotide or protein) is selected by user for both
query sequence and options for BLAST search query and subject sequences, suitable BLAST
(Fig. 10). A summary of the selected genome(s) is programs are dynamically displayed for selec-
presented on top of this page with an option tion. For example, if both query and subject
for going back and modifying the selection. sequences are proteins, only BLASTP is avail-
Below the summary is the query sequence form. able for search; likewise, if both queries and
Human Oral Microbiome Database (HOMD), Fig. 10 The HOMD Genomic BLAST tool – query sequence input
and BLAST parameter adjustment page
subjects are nucleotides, the search can be done parameters. The search strategy including the
with BLASTN, BLASTX, or TBLASTX. Fur- query, subject, and BLAST parameters can be
thermore, alternative algorithms are available saved or downloaded for future reference. The
for nucleotide to nucleotide searches, including actual BLAST results are presented in a manner
MegaBLAST (Morgulis et al. 2008) and similar to the typical HTML format. They include
Discontiguous MegaBLAST (Morgulis a Graphical Overview section (Fig. 3) to display
et al. 2008). Similarly, for protein to protein the alignment of the “high-scoring pairs” (HSPs)
searches, available algorithms are BLASTP, between the query and the subject sequences.
PSI-BLAST (Altschul et al. 1997), PHI-BLAST HSPs are plotted against the query sequence and
(Zhang et al. 1998), and DELTA-BLAST highlighted by different colors based on align-
(Boratyn et al. 2012). For each BLAST program, ment scores. Every HSP on the plot is
only the parameters and options corresponding to hyperlinked with the corresponding pairwise
the selected program type and algorithm appear alignment in the Alignment section. Subject
on this page. Detailed information about BLAST sequences that matched the query are listed in
parameters is available under the link “Help.” For the Descriptions section, sorted by the expected
the advanced users, the command-line style (e) values. The Alignment section presents the H
BLAST+ parameters can be added in Advanced alignments of the HSPs as a series of pairwise
Option section (Camacho et al. 2009). alignments. Each alignment contains a hyperlink
Upon submission of the BLAST search, the to the corresponding HOMD- or NCBI-annotated
requested job is sent to the back-end service for gene, if such information is available.
processing. The back-end service consists of To provide the research community with sat-
a computer cluster to handle multiple requests isfactory experience with and the convenient fea-
from the query interface. The selected genomes/ tures of the HOMD Genomic BLAST, we
nucleotides/proteins are dynamically compiled to currently allow up to ten query sequences to be
a virtual sequence database searchable by the searched in a single job request. Since the time
BLAST programs, using the “blastdb_aliastool” needed for the computation is linear-proportional
tool provided by BLAST+ (Camacho et al. 2009). to the numbers of both query and subject
The searched jobs are distributed to the computer sequences, we expect the maximal waiting time
nodes of the cluster, which is managed by the to be no longer than 10 min, provided no previous
TORQUE resource manager (http://www. job is waiting in the job queue. In fact, when
adaptivecomputing.com/products/open-source/ a total of ten protein sequences with the size of
torque). During the search process, user is 500 amino acids in length were submitted to an
presented with an intermediate page to monitor empty queue to search against all the protein
the job status. This status page reports sequences of all HOMD genomes, the job was
a summary of the job as well as time/duration completed in about 400 s, without any prior jobs
elapsed since submission. The status page peri- waiting in the cluster queue. Special requests may
odically refreshes itself, effectively polling the be considered for jobs containing more query
server while the job runs. BLAST result is auto- sequences than the current limit, on the collabo-
matically presented when the job completes. ration basis.
BLAST results are presented dynamically in The number of the genomes hosted by HOMD
the output interface (Fig. 11). Users can check the database has been growing from approximately
details of BLAST job information and choose 600 genomes at launch (June 2011) to nearly
to download the results in different formats, 1,200 genomes towards the beginning of 2013.
such as HTML, archive, text, tabular, CSV, and We expect the number continue to grow, in con-
XML. Additional jobs can also be submitted for cordance with the growth or the NCBI microbial
the same queries and subjects with modified genomes, as well as the progress of the Human
Human Oral Microbiome Database (HOMD), Fig. 11 The HOMD Genomic BLAST tool result summary page
showing different download option for the BLAST search results
Microbiome Project. To keep pace with this fore- Conclusions

seeable growth and the computing power neces-
sary for Genomic BLAST and other tools, we will The goal of creating the HOMD website and tools
continue the efforts to enhance the capabilities of has been to create a community resource for those
HOMD’s computer backbone. interested in obtaining information on human oral
bacteria and their genomes. We have attempted to Human Microbiome Project Consortium. A framework
create a useful provisional taxonomic scheme so for human microbiome research. Nature. 2012a;486:
215–21.
that investigators can refer to phylogenetically Human Microbiome Project Consortium. Structure, func-
defined taxa rather than unanchored clones tion and diversity of the healthy human microbiome.
or OTUs. We provide full-length reference Nature. 2012b;486:207–14.
sequences and BLAST tools tied to our taxo- Kanehisa M. The KEGG database. Novartis Found Symp.
2002;247:91–101. discussion 101–103, 119–128,
nomic scheme. Finally, we provide access to all 244–152.
genomes completed for human oral bacteria. Martin J, et al. Optimizing read mapping to reference
genomes to determine composition and species preva-
lence in microbial communities. PLoS One. 2012;7:
e36427.
References Moore WE, Moore LV. The bacteria of periodontal dis-
eases. Periodontol. 1994;2000(5):66–77.
Aas JA, et al. Defining the normal bacterial flora of the Moore WE, et al. Bacteriology of severe periodontitis in
oral cavity. J Clin Microbiol. 2005;43:5721–32. young adult humans. Infect Immun. 1982;38:1137–48.
Alcaraz LD, et al. Identifying a healthy oral microbiome Moore WE, et al. Bacteriology of moderate (chronic)
through metagenomics. Clin Microbiol Infect. 2012;18 periodontitis in mature adult humans. Infect Immun.
Suppl 4:54–7. 1983;42:510–5.
H
Altschul SF, et al. Gapped BLAST and PSI-BLAST: Morgulis A, et al. Database indexing for production
a new generation of protein database search programs. MegaBLAST searches. Bioinformatics. 2008;24:
Nucleic Acids Res. 1997;25:3389–402. 1757–64.
Ashburner M, et al. Gene ontology: tool for the unification Paster BJ, Dewhirst FE. Phylogeny of campylobacters,
of biology. The Gene Ontology Consortium. Nat wolinellas, Bacteroides gracilis, and Bacteroides
Genet. 2000;25:25–9. ureolyticus by 16S ribosomal ribonucleic acid
Bairoch A. The ENZYME database in 2000. Nucleic sequencing. Int J Syst Bacteriol. 1988;38:56–62.
Acids Res. 2000;28:304–5. Socransky SS, Haffajee AD. Evidence of bacterial
Belda-Ferre P, et al. The oral metagenome in health and etiology: a historical perspective. Periodontology.
disease. ISME J. 2012;6:46–56. 1994;5:7–25.
Bik EM, et al. Bacterial diversity in the oral cavity of Tanner AC, et al. A study of the bacteria associated with
10 healthy individuals. ISME J. 2010;4:962–74. advancing periodontitis in man. J Clin Periodontol.
Boeckmann B, et al. The SWISS-PROT protein 1979;6:278–307.
knowledgebase and its supplement TrEMBL in 2003. Tanner A, et al. Microbiota of health, gingivitis, and
Nucleic Acids Res. 2003;31:365–70. initial periodontitis. J Clin Periodontol. 1998;
Boratyn GM, Sch€affer AA, Agarwala R, Altschul SF, 25:85–98.
Lipman DJ, Madden TL. Domain enhanced lookup The Forsyth Metagenomic Support Consortium, Izard J.
time accelerated BLAST. Biol Direct. 2012;7:12. doi: Building the genomic base-layer of the oral “omic”
10.1186/1745-6150-7-12.PMID:22510480. world. In: Sasano T, Suzuki O, editors. Interface oral
Camacho C, et al. BLAST+: architecture and applications. health science 2009: proceedings of the 3rd interna-
BMC Bioinformatics. 2009;10:421. tional symposium for interface oral health science.
Camon E, et al. The Gene Ontology Annotation (GOA) New York: Springer; 2010.
project: implementation of GO in SWISS-PROT, Xie G, et al. Community and gene composition of a
TrEMBL, and InterPro. Genome Res. 2003;13:662–72. human dental plaque microbiota obtained by
Chen T, et al. The bioinformatics resource for oral patho- metagenomic sequencing. Mol Oral Microbiol. 2010;
gens. Nucleic Acids Res. 2005;33:W734–40. 25:391–405.
Chen T, et al. The Human Oral Microbiome Database: Zdobnov EM, Apweiler R. InterProScan – an integration
a web accessible resource for investigating oral platform for the signature-recognition methods in
microbe taxonomic and genomic information. Data- InterPro. Bioinformatics. 2001;17:847–8.
base (Oxford). 2010;2010:baq013. Zhang Z, Sch€affer AA, Miller W, Madden TL, Lipman DJ,
Dewhirst FE, et al. The human oral microbiome. Koonin EV, Altschul SF. Protein sequence similarity
J Bacteriol. 2010;192:5002–17. searches using patterns as seeds. Nucleic Acids Res.
Dzink JL, et al. Gram negative species associated with 1998;26(17):3986–90.
active destructive periodontal lesions. J Clin Zuger J, et al. Uncultivated Tannerella BU045 and BU063
Periodontol. 1985;12:648–59. are slim segmented filamentous rods of high
Dzink JL, et al. The predominant cultivable microbiota of prevalence but low abundance in inflammatory
active and inactive lesions of destructive periodontal disease-associated dental plaques. Microbiology.
diseases. J Clin Periodontol. 1988;15:316–23. 2007;153:3809–16.
I
Insights into Environmental acquired in tandem, deeper insights into commu-

Microbial Denitrification from nity structure of organisms catalyzing specific
Integrated Metagenomic, metabolic functions can be obtained. Coupled
Cultivation, and Genomic Analyses cultivation, amplicon, genome, and metagenome
sequence data, targeting denitrifying bacteria
Stefan J. Green1, Lavanya Rishishwar2, from a highly contaminated subsurface environ-
Om Prakash3, I. King Jordan4 and Joel Kostka5 ment, were analyzed to reveal novel denitrifier
1
University of Illinois at Chicago, Chicago, diversity and the extent of bias associated with
IL, USA commonly used PCR primer sets targeting denitri-
2
Bioinformatics, Georgia Institute of fication genes. Furthermore, genome sequencing
Technology, Atlanta, GA, USA revealed that some denitrifiers are incapable of
3
National Centre for Cell Science, Pune, denitrification from nitrate and demonstrated
Maharashtra, India the need for integrated molecular and cultivation
4
School of Biology, Georgia Institute of approaches to characterization of microbial
Technology, Atlanta, GA, USA communities.
5
School of Biology and Earth & Atmospheric
Sciences, Georgia Institute of Technology,
Atlanta, GA, USA Introduction
The advent of next-generation sequencing plat-

Synonyms forms and the subsequent increased availability
of genomic and metagenomic sequence data have
Genome sequencing; Metagenomic technology; revolutionized environmental microbiology.
Nitrogen cycling However, though our eyes have been opened to
the vast genotypic and metabolic potential of
microbial communities in nature, exploration of
Definition the role of specific microbial groups in ecosystem
function still requires the application of
The high sequence diversity of microbial func- cultivation-based approaches. In fact, the verifi-
tional genes can hinder cultivation-independent cation of microbial phenotypes through cultiva-
molecular analyses. Likewise, cultivation-based tion is arguably more critical than ever as
approaches also provide a distorted picture of in metagenomic information now allows for the
situ microbial communities. When cultivation and generation of boundless hypotheses based on the
cultivation-independent molecular approaches are metabolic potential represented by complex
I 294 Insights into Environmental Microbial Denitrification
microbial communities. Although the advances high acidity in the source zone (pH 3–4) also
in cultivation-independent molecular analyses suppresses microbial activity and diversity
of microbial communities have been well adver- (Fields et al. 2005; Hemme et al. 2010). Despite
tised (e.g., high-throughput amplicon sequencing the restrictive conditions, there is evidence for
(e.g., Caporaso et al. 2011), metagenomics significant nitrous oxide production in the near-
(e.g., Tringe et al. 2005), and metatran- source zone (Spalding and Watson 2008). As the
scriptomics (e.g., Poretsky et al. 2009)), parallel low pH is ameliorated down-gradient of the
advances in cultivation have also been made, source zone, nitrate, nitrous oxide, and soluble
including the use of lower organic carbon uranium are attenuated without active remedia-
media, extended incubation, single-cell encapsu- tion, due to both microbial and geochemical
lation approaches, and overall improved mimick- processes (Kowalsky et al. 2011).
ing of natural conditions within a culture vessel The contaminant levels in the near-source
(e.g., Bollmann et al. 2007; Kaeberlein et al. zone are alarming, and source zone remediation
2002; Zengler et al. 2002). Here, data from strategies have been examined, with limited suc-
metagenomic sequencing and isolation, physio- cess (Wu et al. 2007). The extraordinary levels of
logical testing, and whole-genome sequencing of nitrate must be removed before microbial reduc-
denitrifying bacteria from the highly contami- tion of U(VI) to U(IV) can proceed (Akob
nated subsurface of the Oak Ridge Integrated et al. 2008; Luo et al. 2005; Wu et al. 2006,
Field Research Challenge (ORIFRC) site are con- 2010), and down-gradient remediation has been
sidered and the implications of this analysis on more effective as nitrate is essentially absent
understanding the environmental distribution and (e.g., Gihring et al. 2011). The presence of nitrous
ecological niche of denitrifying bacteria. oxide in the source zone wells suggested the
presence of in situ denitrification, and thus grew
The ORIFRC Site an interest in microorganisms capable of nitrate
The ORIFRC site is highly contaminated with reduction at in situ pH, with the hope that stimu-
spent uranium and a wide variety of other con- lation of these native organisms could aid in the
taminants (e.g., other radionuclides, heavy long-term removal of uranium from the site
metals, and volatile organic contaminants) as groundwater. Initial studies revealed significant
a result of long-term uranium enrichment for diversity in nitrite reductase genes in groundwa-
nuclear weapons, coupled with improper disposal ter at the site, including both genes encoding for
in unlined ponds (S-3 ponds) (Brooks 2001; copper-containing (nirK) and cytochrome (nirS)
Kostka and Green 2011; NABIR 2003; Watson forms (Palumbo et al. 2004; Yan et al. 2003).
et al. 2004). Although the ponds have been sub- Based on metagenomic analysis of acidic
sequently drained, much of the contaminant has groundwater from the site, Hemme et al. (2010)
migrated into the subsurface, where it serves to hypothesized that denitrification comprised the
feed a plume migrating down-gradient across the predominant form of metabolism in the near-
site (Watson et al. 2005). Uranium is the priority source zone microbial community due to the
contaminant of concern, though the nitrate in the low oxygen and lack of fermentation genes
near-source zone (adjacent to the former S-3 observed there. The overabundance of nitrate/
ponds) reaches extraordinarily high concentra- nitrite antiporters in the metagenome was
tions (in the range of 10–1,000 mM) due to the interpreted as a further indication of the strong
use of nitric acid in the processing of uranium. effect of the elevated nitrate on the source zone
The high level of nitrate complicates remediation microbial community.
strategies at the site by inhibiting microbial Prior to the metagenome sequencing of the
reduction of soluble hexavalent uranium to an acidic groundwater at the ORIFRC site,
insoluble mineral form of tetravalent uranium cultivation-independent molecular surveys had
(e.g., Finneran et al. 2002; Kostka and Green been performed to track denitrifying organisms.
2011; Shelobolina et al. 2003). The moderately As the denitrification phenotype is a polyphyletic
Insights into Environmental Microbial Denitrification 295 I
trait, and can be acquired readily via lateral gene targeting unique nirK genes, and whole-genome
transfer, ribosomal RNA gene sequencing is not sequences were also recovered from
suitable for identifying and tracking denitrifying non-denitrifying reference strains related to
organisms. Functional genes assays – targeting organisms isolated from the field site.
nitrate, nitrite, nitric oxide, and nitrous oxide Bacteria from six distinct genera of
reductases – have been performed for this pur- denitrifiers were isolated, including strains
pose. Yan et al. (2003) and Palumbo et al. (2004) of Hyphomicrobium (Alphaproteobacteria),
performed site-wide surveys of nitrite reductase Afipia (Alphaproteobacterium), Pseudomonas
genes at the ORIFRC site. No clear pattern relat- (Gammaproteobacteria), Rhodanobacter
ing the composition and relative abundance of (Gammaproteobacteria), Bacillus (Firmicutes),
nitrite reductase genes with groundwater geo- and Intrasporangium (Actinobacteria) (Green
chemical conditions was observed, however. et al. 2010). Under laboratory conditions, all
For example, a principal component analysis of strains were capable of growth with nitrate as
clusters of nirK (gene encoding for copper- the sole electron acceptor, though the Gram-
containing nitrite reductase) sequences grouped positive strains produced only nitrous oxide as
all wells across the pH gradient together, with the a terminal product, while Rhodanobacter spp.
exception of one high nitrate groundwater sam- produced a mixture of nitrous oxide and nitrogen
I
ple. In all wells, the most abundant nirK gas. Physiological and genetic characterization of
sequences were most similar to the nirK gene the isolates from the genus Rhodanobacter was
sequence derived from Hyphomicrobium prioritized, as these organisms had been detected
zavarzinii, and all sequences were most similar in great abundance in acidic groundwater as well
to gene sequences derived from Proteobacteria. as sediments from the near-source zone (Green
Thus, although a substantial diversity of nitrite et al. 2010, 2012). Bacteria from this genus were
reductase genes was observed, with many novel revealed to have extraordinarily high relative
gene sequences recovered, more recent data from abundance in the near-source zone, over multiple
genome and metagenome sequencing indicates sampling seasons, and were sometimes the only
that the predominant denitrifiers were not active organisms detected in RNA-based ana-
detected in single-gene surveys (Green lyses of groundwater samples (Green
et al. 2010, 2012; Hemme et al. 2010). et al. 2012). Highly similar strains were indepen-
dently isolated from ORIFRC site sediment using
Combined Cultivation and Direct Molecular a diffusion chamber approach (Bollmann
Studies of Denitrifying Bacteria et al. 2010), and in a metagenomic survey of
The study of denitrifying microorganisms at the acidic groundwater from the site, one of the dom-
ORIFRC field site was approached in inant organisms detected (so-called FW106 gI) is
a multipronged fashion, including (a) site-wide clearly a member of the genus Rhodanobacter
microbial community characterization using (Hemme et al. 2010). This organism contained
DNA extraction from sediment and groundwater, a full denitrification pathway.
coupled with high-throughput bacterial ribo- Despite the apparent numerical abundance of
somal RNA (rRNA) gene amplicon sequencing, members of the genus Rhodanobacter in the
(b) quantitative PCR (qPCR) analyses of bacte- acidic source zone, these organisms were not
rial small subunit (SSU) rRNA and nitrite reduc- detected in prior molecular surveys of denitrifi-
tase (nirK) gene abundance in groundwater and cation pathway genes at the ORIFRC site
sediment samples, (c) cultivation and physiolog- (Palumbo et al. 2004; Yan et al. 2003). Nor
ical testing of denitrifying bacteria from sediment could PCR amplification of nirS (cytochrome cd
and groundwater, and (d) de novo whole-genome 1-containing nitrite reductase), nirK, or nosZ
sequencing of denitrifying isolates. Subse- (nitrous oxide reductase) genes be achieved
quently, genomic DNA (gDNA) samples from using standard primer sets (Green et al. 2010).
the site were reanalyzed with novel primers Similar challenges were presented by the other
isolated strains, excepting Afipia. For the and helped determine the cause of PCR amplifi-
Hyphomicrobium strain, a novel primer set cation failure. First, the putative nitrite reductase
targeting nirK was designed based on genes from these organisms were highly diver-
a reference gene available in GenBank, but no gent from many sequences present in gene data-
similar reference sequences were available for bases, and the sequences contained a large
the other strains. Subsequently, metagenome number of mismatches with the most commonly
sequence data from acidic groundwater acquired used primer sets for targeting bacterial nirK genes
at the site (Hemme et al. 2010) was surveyed, and (e.g., 10 and 11 mismatches, respectively,
two novel nirK sequences were identified. Using between primer R3Cu and first and second nirK
these de novo assembled sequences, primer sets gene of R. denitrificans 2APBS1 (Green
were developed that allowed the amplification of et al. 2010; Hallin and Lindgren 1999)). In addi-
a nirK gene from the Rhodanobacter isolates and tion, most Rhodanobacter spp. have two highly
from putative Rhodanobacter organisms from divergent nirK genes located in different posi-
environmental genomic DNA (Green et al. tions in the genome (Green et al. 2010; Kostka
2010, 2012). Quantitative PCR analysis was uti- et al. 2012). Two strains of Rhodanobacter inde-
lized to quantitate SSU rRNA and nirK gene pendently isolated (Bollmann et al. 2010) simi-
abundance in groundwater from across the water- larly contain two nirK genes apiece, and both are
shed, and this analysis revealed that nirK genes nearly (>99% similar) or completely identical to
were present in abundance across the ORIFRC nirK genes from R. denitrificans 2APBS1T. Both
site, including nirK genes derived from forms of nirK are expressed under denitrifying
Rhodanobacter (Green et al. 2012). Coupled conditions in R. denitrificans 2APBS1T, but the
with relative abundance measurements derived purpose of two copies of the gene is not yet clear
from qPCR of rRNA genes and from rRNA (Green et al. 2012). One copy of the gene, collo-
gene amplicon sequencing, this analysis revealed quially called “nirK-B,” is most similar to nirK
that Rhodanobacter were the most abundant genes from certain Proteobacteria, including
organisms in the near-source zone, that nirK Betaproteobacteria from the genera Burkholderia
genes most similar to those from Rhodanobacter and Ralstonia. The second copy, called
strains were most abundant in the near-source “nirK-V,” is most similar to the nirK gene from
zone, and that Rhodanobacter organisms were Opitutus terrae PB90-1, within the phylum
active, not just present in the near-source zone. Verrucomicrobia.
Coupled with in vitro analysis of the physiologi- To examine this phenomenon on a broader
cal capabilities of Rhodanobacter strains in pure phylogenetic scale, Green et al. (2010) recovered
culture, these data led to the hypothesis that bac- complete nirK and nosZ genes from a number of
teria from the genus Rhodanobacter are the dom- microorganisms which had been sequenced by
inant near-source zone denitrifiers at the ORIFRC the Joint Genome Institute. These genes were
site. This hypothesis is supported by studies aligned and primer binding sites were identified.
conducted in other ecosystems which demon- This analysis revealed that the difficulty in ampli-
strate that Rhodanobacter spp. dominate under fying nirK genes from ORIFRC site isolates is
low pH, denitrifying conditions (e.g., van den symptomatic of a broader difficulty in detecting
Heuvel et al. 2010). denitrifying bacteria through single primer set
Direct PCR amplification of nitrite reductase amplification due to large numbers of mis-
genes from Rhodanobacter and other denitrifiers matches between primer and gene sequences.
isolated from the site was not successful using The commonly used primer sets (including quan-
standard primers, and subsequently, de novo titative PCR primer sets) target a relatively nar-
shotgun genome sequencing and draft assembly row range of organisms, primarily within the
of these bacterial denitrifiers was performed. The Proteobacteria (Green et al. 2010). Thus, molec-
initial draft sequences of Rhodanobacter and ular approaches that depend upon single primers,
Intrasporangium recovered complete nirK genes even heavily degenerate primers, cannot be used
suitably to detect or quantify denitrifiers in envi- (Prakash et al. 2012; van den Heuvel et al.
ronmental samples, and the true diversity and 2010). More recently, a novel species, R. caeni,
abundance of denitrifiers is most likely greatly was described as capable of nitrate reduction to
underestimated from current surveys. Alternate nitrite, but no evidence for complete denitrifica-
approaches, which utilize the full availability of tion was demonstrated (Woo et al. 2012).
reference sequence data derived from de novo Likewise, R. sp. strain A2-61, shown to form
genome sequencing and from shotgun intracellular uranium-phosphate complexes, was
metagenome sequencing of environmental sam- unable to reduce nitrate (Sousa et al. 2013).
ples, must be developed to more fully assess the To understand the genetic basis of the differ-
distribution of these important organisms. ences in physiology with respect to denitrifica-
Although the nitrite reductase gene is tion, the genomes of five additional strains of
a particularly dramatic example, it is not unique bacteria from the genus Rhodanobacter were
in this regard, and other functional genes of sig- sequenced (Kostka et al. 2012). In total, three
nificance to biogeochemical processes have strains of denitrifying Rhodanobacter were
shown similar levels of sequence diversity. The sequenced (R. denitrificans 2APBS1T,
sequence diversity of nirK may be in part due to R. denitrificans 116-2, R. thiooxydans) alongside
the multiple physiological roles for nitrite reduc- three strains of apparent non-denitrifying (from
I
tion (detoxification, respiration), different condi- nitrate) Rhodanobacter (R. fulvus Jip2
tions under which the enzymes may be active (Im et al. 2004), R. spathiphylli B39 (De Clercq
(e.g., prior to anoxic conditions, after total et al. 2006), and R. sp. 115, isolated from the
anoxia), and multiple locations for nitrite reduc- ORIFRC site (Kostka et al. 2012)). Preliminary
tases (periplasm, inner membrane) and for the analysis of the genomes of the six Rhodanobacter
different forms of the gene (copper nitrite reduc- strains revealed that all members of the genus
tase, nirK, and cytrochrome-cd1 nirS). This broad contained nearly complete denitrification path-
sequence divergence but with retained function is ways, including two copies of the nitrite reduc-
present in other functional genes, including other tase gene nirK (excepting R. spathiphylli, with
genes in the denitrification pathway (e.g., nosZ; only a single copy). All denitrifying isolates
Green et al. 2010; Jones et al. 2013; Sanford contained many genes in the dissimilatory deni-
et al. 2012). trification pathway, but non-denitrifying isolates
Although many Rhodanobacter spp. isolated were missing several key genes involved in
from the ORIFRC site subsurface were capable of nitrate respiration, such as nitrate reductase
complete denitrification, some members of the genes (i.e., narG, narH, narJ, and narI). The
genus were incapable of growth on nitrate. Sim- genomic context of these genes was further
ilarly, in a survey of the literature regarding examined, and it was observed that the nitrous
Rhodanobacter, most strains were identified as oxide genes (e.g., nosZ) showed the greatest
aerobic bacteria, incapable of nitrate reduction. synteny among all six genomes (Fig. 1). Since
Strains isolated independently from the ORIFRC relatively few organisms conduct nitrous oxide
site were observed to be acid tolerant (arrest of reduction alone, it may be supposed that the high
growth was observed at pH 3.5–4), tolerant of level of synteny in this gene and the lower
high levels of nitrate (up to 250 mM), and mod- synteny in other parts of the denitrification path-
erately tolerant of various heavy metals, includ- way favor the hypothesis that the ancestral com-
ing uranium (Bollmann et al. 2010). The initial mon ancestor of the bacteria within the genus
description of R. thiooxydans, the closest relative Rhodanobacter likewise contained a full denitri-
of R. denitrificans, indicated that the organism fication pathway, with subsequent rearrangement
was capable of nitrate, but not nitrite, reduction of the genes in the pathway. Further clarity will
(Lee et al. 2007). Subsequent work, however, be obtained with additional whole-genome
demonstrated that these organisms are capable sequences of related organisms from the
of complete denitrification from nitrate Xanthomonadaceae.
Insights into Environmental Microbial Denitrifica- unknown function DUF2165; hip, high potential
tion from Integrated Metagenomic, Cultivation, iron-sulfur protein; hisK, sensor histidine kinase; HYP,
and Genomic Analyses, Fig. 1 Gene order in the hypothetical protein; nosD, periplasmic copper-binding
genomic region of the nitrous oxide reductase gene protein; nosF, ABC transporter related protein; nosL,
(nosZ) in denitrifying and apparent non-denitrifying NosL protein; nosR, nitrous oxide expression regulator,
strains of bacteria from the genus Rhodanobacter. NosR; nosY, ABC-type transport system involved in
Strong gene synteny is observed between denitrifying multi-copper enzyme maturation, permease component;
(highlighted in green) and apparent non-denitrifying line- nosZ, nitrous oxide reductase; PGA, peptidase S45 peni-
ages (highlighted in pink). Gene order in Marinobacter cillin amidase; tatA, twin-arginine translocation protein,
aquaeolei VT8 (Gammaproteobacteria, Alteromo- TatA/E; tatB, twin-arginine-targeting protein translocase
nadaceae), capable of anaerobic growth on nitrate, was TatB; tatC, twin-arginine-targeting protein translocase
included as an out-group organism with a complete subunit TatC; trxB, thioredoxin reductase oxidoreductase;
genome sequence. Gene symbols: apbE, ApbE family badM/Rrf2, BadM/Rrf2 family transcriptional regulator;
lipoprotein; cheY-like, two-component system sensor his- nifB, molybdenum cofactor biosynthesis protein A; ppiC,
tidine kinase-response regulator hybrid protein; dapE, PpiC-type peptidyl-prolyl cis-trans isomerase
succinyldiaminopimelate desuccinylase; DUF, protein of
Conclusions Regarding Rhodanobacter Rhodanobacter, what is present suggests that

Bacteria from the genus Rhodanobacter appear to heavy metal tolerance is a common feature of
fill a relatively specific ecological niche, but these organisms. Bollmann et al. (2010) isolated
under appropriate conditions, these organisms two strains of Rhodanobacter that are tolerant of
can dominate to an extreme extent. Conditions 200 micromolar uranium (as well as other heavy
which appear to enable bacteria from the genus metals), and most recently Sousa et al. (2013)
Rhodanobacter to dominate include low pH, high described R. sp. strain A2-61, tolerant of up to
nitrate, low/variable oxygen concentrations, and 500 micromolar uranium, under aerobic condi-
heavy metal contamination. Although data in the tions. R. denitrificans strains are capable of toler-
literature are not particularly abundant for ating 1 mM uranium (data not shown).
Interestingly, R. sp. strain A2-61 was capable of anaerobes) will favor the use of oxygen as termi-
forming intracellular uranium-phosphate com- nal electron acceptor, and repress nitrogen
plexes, presumably a detoxification strategy. In oxyanion reduction to avoid loss of
a survey of the genome of R. denitrificans ATP-generation capability through a truncated
2APBS1T, multiple genes involved in metal respiratory pathway, and “entrapment” under
resistance have been detected, and these genes anoxic conditions without capability to continue
are strongly associated with horizontal gene respiration (Bergaust et al. 2011). It has been
transfer as indicated by low lineage probability hypothesized that an earlier onset of denitrifica-
scores (LPI), anomalous nucleotide composition (in terms of oxygen concentration) is an
tions, and association with putative mobile indication of the likelihood for nitrous oxide pro-
genetic elements such as transposons and duction by the strain (Bergaust et al. 2011; Zumft
integrons (data not shown). and Kroneck 2007). This is consistent with the
The presence of a near-complete denitrifica- initial characterization of R. denitrificans, in
tion pathway in “non-denitrifying” strains of bac- which both nitrous oxide and dinitrogen accumu-
teria from the genus Rhodanobacter suggests that lated during pure culture growth conditions
denitrification capability is an inherent trait of all in vitro, while other isolates from the site com-
members of the genus but that denitrification by pleted denitrification to dinitrogen (Afipia,
I
these organisms often requires nitrite rather than Hyphomicrobium) or nitrous oxide only (Gram
nitrate. Since nitrite is often available where there positives; Bacillus and Intrasporangium) (Green
is nitrate, and a number of organisms are capable et al. 2010). Further work is needed to determine
of nitrate-to-nitrite reduction, but cannot reduce the regulatory strategy taken by Rhodanobacter
nitrite further, the lack of a nitrate reductase may in the subsurface under aerobic/microaerophilic/
not be overly limiting for facultative anaerobes anaerobic conditions.
such as members of the Rhodanobacter. For Are Rhodanobacter extremophiles? Based on
example, in a study of denitrification capabilities the current data, it is not clear that they are.
in bacteria from the genus Bacillus, most- Although members of the genus can grow at pH
probable-number assays of a soil sample revealed values below pH 4, the optimum growth pH for
nearly an order of magnitude greater abundance R. denitrificans 2APBS1 is pH 6 (Bollmann
of organisms capable of nitrate-to-nitrite reduc- et al. 2010; Prakash et al. 2012). However, even
tion relative to complete denitrifiers (Verbaendert at circumneutral pH with excess organic carbon,
et al. 2011). A further confounding observation is growth by R. denitrificans is slow (generation
the presence of two putative nirK genes in almost time ~24 h). This may represent another strategy
all Rhodanobacter, including the non-nitrate by Rhodanobacter strains leading to dominance
reducers. It may be that the multiple nitrite reduc- in contaminated/extreme environments, but low
tases are involved in tolerance of high nitrate/ relative abundance in more ameliorated condi-
nitrite conditions, stressful conditions that are tions. It appears most likely that Rhodanobacter
further exacerbated by low pH (Spain and retain a variety of physiological capabilities –
Krumholz 2012). The nitrite reductases may anaerobic growth, metal tolerance and detoxifi-
also represent two different strategies relating to cation, denitrification phenotype, and broad car-
denitrification by Rhodanobacter under fluctuat- bon substrate utilization capability (including
ing aerobic/anaerobic conditions, such as those acetate) – that under specific environmental con-
found in the ORIFRC site subsurface. As ditions provides them with the opportunity for
described by Bergaust et al. (2011), bacteria can dominance.
employ complex strategies to maximize energy
generation, but provide insurance in case of sud- Conclusions Regarding Denitrification
den changes in environmental condition. Thus, The ORIFRC, with nitrate-replete groundwaters,
while in the presence of oxygen, denitrifying represents an ideal natural laboratory for investi-
bacteria (which are nearly always facultative gation of the microbial populations that mediate
denitrification. Through a close coupling of have been partially misleading regarding the
cultivation-based and molecular approaches, potential ecological niche for these organisms,
characterization of denitrifying bacteria from and only when coupled with whole-genome
the ORIFRC site has significant implications not sequencing has the putative in situ functional
just for broader characterization of denitrifying capability of these organisms been revealed. In
organisms but also for the application of an analysis of Bacillus isolate and culture-
PCR-based approaches to characterize microbial collection strains, Verbaendert et al. (2011)
functional groups. With specific reference to revealed that nitrate was not always a suitable
denitrification, it was observed that the most electron acceptor for verification of denitrifica-
commonly used primers targeting functional tion capability and that 20 % of denitrifying
genes within the dissimilatory denitrification strains could use nitrite but not nitrate-to-initiate
pathway were highly biased to a select group of denitrification. They opine that the true abun-
genes largely derived from bacteria within the dance of denitrifiers is underestimated because
Proteobacteria and the genes from organisms out- typically only nitrate is used as an electron accep-
side this group could not conceivably be targeted tor when testing for denitrification capability, and
with PCR due to the excessively large number of this is consistent with observations of isolates of
mismatches between primer and gene sequence. the genus Rhodanobacter. Remarkably, they also
Thus, results generated from single-gene primer observed that growth conditions can also affect
(even degenerate) sets must be interpreted care- electron acceptor utilization, and this can further
fully. A similar finding has been obtained for lead to missing identification of physiological
nitrous oxide genes as well (Sanford capability. No doubt analogous situations for
et al. 2012). Since de novo genome and shotgun other genes, organisms, and functions are with
metagenome sequences generate gene sequences us, waiting to be identified. Thus, it seems clear
that are clearly identifiable as nitrite (or nitrous that for more robust physiological characteriza-
oxide) reductases but also impossible to target tion of bacterial strains, genome-guided physio-
with common primers, new strategies must be logical testing must be implemented. Such an
developed to detect a broader collection of deni- approach will have profound implications for
trifiers in the environment. As the organisms the assessment of the ecological role of
capable of denitrification are broadly distributed bacteria taxa.
and are polyphyletic, functional gene analyses Prior to the acquisition of multiple genomes
will continue to be essential to identify and quan- from the genus Rhodanobacter, the denitrifica-
titate denitrifying microorganisms and to charac- tion phenotype in Rhodanobacter strains was
terize denitrifying microbial communities. hypothesized to result from a relatively recent
One of the essential extrapolations of these lateral gene transfer rather than from vertical
findings is that the true abundance of denitrifica- transmission, as appears to be the case (Green
tion capability in bacterial lineages is et al. 2010). Hemme et al. (2010) also opined
underestimated due to two processes revealed in that the inferred lateral gene transfer events
this study. First, the high sequence divergence most likely occurred after the introduction of
present in functional genes in the denitrification contamination at the site. With multiple genomes
pathway limits the detection of denitrification in hand, phylogenetic analysis of the nitrite
genes from isolates through PCR and sequencing. reductase genes from the whole-genome
Second, the partial pathway observed in sequences of multiple Rhodanobacter strains
Rhodanobacter strains suggests that when revealed a phylogeny consistent with that of the
searching for denitrification capabilities, other rRNA genes from the same organisms. If there
electron acceptors besides nitrate should be were lateral gene transfer events, these predated
tested. In a sense, cultivation approaches and the last common ancestor of the genus
physiological testing of Rhodanobacter strains Rhodanobacter, with the most parsimonious
interpretation being that nitrate reduction capa- Cross-References
bility was later lost from certain members of the
genus. The evolutionary history of the full deni- ▶ Culture Collections in the Study of Microbial
trification pathway, however, appears to be Diversity, Importance
fragmented – for example, the nirK genes do ▶ Functional Viral Metagenomics and the
appear to be derived from a lateral gene transfer, Development of New Enzymes for DNA and
but this transfer is not recent and certainly is RNA Amplification and Sequencing
independent of the ORIFRC site. The ▶ GeoChip-Based Metagenomic Technologies
Rhodanobacter nosZ genes are more consistent for Analyzing Microbial Community
with other Gammaproteobacterial denitrifiers. It Functional Structure and Activities
is possible, though entirely speculative, that ▶ Lateral Gene Transfer and Microbial Diversity
Rhodanobacter previously had type (or class)
I soluble periplasmic nitrite reductases, like
those present in Pseudomonas denitrificans, and
these have been subsequently replaced by type II References
cytoplasmic membrane nitrite reductases. The
Akob DM, Mills HJ, Gihring TM, Kerkhof L, Stucki JW,
ecologic benefit derived from this is not clear
Anastacio AS, Chin KJ, Kusel K, Palumbo AV, Wat- I
yet, but may relate to activity under aerobic and son DB, Kostka JE. Functional diversity and electron
anaerobic conditions, as has been observed for donor dependence of microbial populations capable
nitrate reductases (Bedzyk et al. 1999). of U(VI) reduction in radionuclide-contaminated
subsurface sediments. Appl Environ Microbiol.
2008;74:3159–70.
Bedzyk L, Wang T, Ye RW. The periplasmic nitrate
Summary reductase in Pseudomonas sp. strain G-179 catalyzes
the first step of denitrification. J Bacteriol.
1999;181:2802–6.
A combination of approaches to the study of
Bergaust L, Bakken LR, Frostegard A. Denitrification
denitrifying bacteria in a contaminated subsur- regulatory phenotype, a new term for the characteriza-
face environment, including cultivation and tion of denitrifying bacteria. Biochem Soc Trans.
physiological testing of denitrifying bacteria, de 2011;39:207–12.
Bollmann A, Lewis K, Epstein SS. Incubation of environ-
novo whole-genome sequencing, and shotgun
mental samples in a diffusion chamber increases the
metagenome sequencing, revealed key limita- diversity of recovered isolates. Appl Environ
tions to the application of more straightforward Microbiol. 2007;73:6386–90.
molecular approaches. Commonly used PCR Bollmann A, Palumbo AV, Lewis K, Epstein SS. Isolation
and physiology of bacteria from contaminated
primers targeting functional genes in the denitri-
subsurface sediments. Appl Environ Microbiol.
fication pathway are shown to be incapable of 2010;76:7413–9.
detecting a broad diversity of environmental Brooks SC. Waste characteristics of the former S-3 ponds
denitrifiers. Likewise, some denitrifiers are inca- and outline of uranium chemistry relevant to NABIR
Field Research Center studies. Oak Ridge: NABIR
pable of nitrate reduction from nitrate and may be
Field Research Center; 2001.
misidentified in routine physiological testing of Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D,
bacterial isolates. Bacteria from the genus Lozupone CA, Turnbaugh PJ, Fierer N, Knight
Rhodanobacter, which can be abundant in highly R. Global patterns of 16S rRNA diversity at a depth
of millions of sequences per sample. Proc Natl Acad
contaminated environments with low pH, appear
Sci U S A. 2011;108 Suppl 1:4516–22.
to be native denitrifiers, while metal resistance De Clercq D, Van Trappen S, Cleenwerck I,
genes appear to have been acquired via lateral Ceustermans A, Swings J, Coosemans J, Ryckeboer
gene transfer. Overall, Rhodanobacter dominate J. Rhodanobacter spathiphylli sp nov.,
a gammaproteobacterium isolated from the roots of
in certain environments with low pH, heavy
Spathiphyllum plants grown in a compost-amended
metal contamination, and conditions favoring potting mix. Int J Syst Evol Microbiol.
denitrification phenotype. 2006;56:1755–9.
Fields MW, Yan TF, Rhee SK, Carroll SL, Jardine PM, strains, isolated from soils and the terrestrial subsur-
Watson DB, Criddle CS, Zhou JZ. Impacts on face, with variable denitrification capabilities.
microbial communities and cultivable isolates from J Bacteriol. 2012;194:4461–2.
groundwater contaminated with high levels of nitric Kowalsky MB, Gasperikova E, Finsterle S, Watson D,
acid-uranium waste. FEMS Microbiol Ecol. Baker G, Hubbard SS. Coupled modeling of hydrogeo-
2005;53:417–28. chemical and electrical resistivity data for exploring
Finneran KT, Housewright ME, Lovley DR. the impact of recharge on subsurface contamination.
Multiple influences of nitrate on uranium solubility Water Resour Res. 2011;47.
during bioremediation of uranium-contaminated Lee CS, Kim KK, Aslam Z, Lee ST. Rhodanobacter
subsurface sediments. Environ Microbiol. 2002;4: thiooxydans sp. nov., isolated from a biofilm on sulfur
510–6. particles used in an autotrophic denitrification process.
Gihring TM, Gengxin Z, Brooks SC, Campbell JH, Int J Syst Evol Microbiol. 2007;57:1775–9.
Watson DB, Brandt CC, Yang Z, Criddle CS, Luo J, Cirpka OA, Wu WM, Fienen MN, Jardine PM,
Lowe K, Overholt WA, Wu W-M, Mehlhorn T, Mehlhorn TL, Watson DB, Criddle CS, Kitanidis PK.
Kostka JE, Green SJ, Schadt CW. A limited microbial Mass-transfer limitations for nitrate removal in a
consortium is responsible for longer-term bioreduction uranium-contaminated aquifer. Environ Sci Technol.
of uranium in a contaminated aquifer. Appl Environ 2005;39:8453–9.
Microbiol. 2011;77:5955–65. NABIR. Bioremediation of metals and radionucli-
Green SJ, Prakash O, Gihring TM, Akob DM, Jasrotia P, des. . .What it is and how it works. Berkeley: Lawrence
Jardine PM, Watson DB, Brown SD, Palumbo AV, Berkeley National Laboratory; 2003.
Kostka JE. Denitrifying bacteria from the terrestrial Palumbo AV, Schryver JC, Fields MW, Bagwell CE,
subsurface exposed to mixed waste contamination. Zhou JZ, Yan T, Liu X, Brandt CC. Coupling of
Appl Environ Microbiol. 2010;76:3244–54. functional gene diversity and geochemical data from
Green SJ, Prakash O, Overholt WA, Cardenas E, environmental samples. Appl Environ Microbiol.
Hubbard D, Akob DM, Tiedje JM, Watson DB, Jardine 2004;70:6525–34.
PM, Brooks SC, Kostka JE. Denitrifying bacteria from Poretsky RS, Hewson I, Sun S, Allen AE, Zehr JP, Moran
the genus Rhodanobacter dominate bacterial commu- MA. Comparative day/night metatranscriptomic anal-
nities in the highly contaminated subsurface of ysis of microbial communities in the North Pacific
a nuclear legacy waste site. Appl Environ Microbiol. subtropical gyre. Environ Microbiol. 2009;11:
2012;78:1039–47. 1358–75.
Hallin S, Lindgren PE. PCR detection of genes encoding Prakash O, Green SJ, Jasrotia P, Overholt WA, Canion A,
nitrite reductase in denitrifying bacteria. Appl Environ Watson DB, Brooks SC, Kostka JE. Rhodanobacter
Microbiol. 1999;65:1652–7. denitrificans sp. nov., isolated from nitrate-rich zones
Hemme CL, Deng Y, Gentry TJ, Fields MW, Wu L, of a contaminated aquifer. Int J Syst Evol Microbiol.
Barua S, Barry K, Tringe SG, Watson DB, He Z, 2012;62:2457–62.
Hazen TC, Tiedje JM, Rubin EM, Zhou Sanford RA, Wagner DD, Wu QZ, Chee-Sanford JC,
J. Metagenomic insights into evolution of a heavy Thomas SH, Cruz-Garcia C, Rodriguez G, Massol-
metal-contaminated groundwater microbial commu- Deya A, Krishnani KK, Ritalahti KM, Nissen S,
nity. ISME J. 2010;4:660–72. Konstantinidis KT, Loffler FE. Unexpected
Im WT, Lee ST, Yokota A. Rhodanobacter fulvus nondenitrifier nitrous oxide reductase gene diversity
sp. nov., a beta-galactosidase-producing gammapro- and abundance in soils. Proc Natl Acad Sci U S A.
teobacterium. J Gen Appl Microbiol. 2004;50:143–7. 2012;109:19709–14.
Jones CM, Graf DR, Bru D, Philippot L, Hallin S. The Shelobolina ES, O’Neill K, Finneran KT, Hayes LA,
unaccounted yet abundant nitrous oxide-reducing Lovley D. Potential for in situ bioremediation of
microbial community: a potential nitrous oxide sink. a low-pH, high-nitrate uranium-contaminated
ISME J. 2013;7:417–26. groundwater. Soil Sediment Contam. 2003;12:
Kaeberlein T, Lewis K, Epstein SS. Isolating 865–84.
“uncultivable” microorganisms in pure culture in Sousa T, Chung AP, Pereira A, Piedade AP, Morais PV.
a simulated natural environment. Science. Aerobic uranium immobilization by Rhodanobacter
2002;296:1127–9. A2–61 through formation of intracellular uranium–
Kostka JE, Green SJ. Microorganisms and processes phosphate complexes. Metallomics. 2013;5(4):
linked to uranium reduction and immobilization. In: 390–397.
Stolz JF, Oremland RS, editors. Microbial metal and Spain AM, Krumholz L. Cooperation of three denitrifying
metalloid metabolism: advances and applications. bacteria in nitrate removal of acidic nitrate- and
Washington, DC: ASM Press; 2011. uranium-contaminated groundwater. Geomicrobiol J.
Kostka JE, Green SJ, Rishishwar L, Prakash O, Katz LS, 2012;29:830–42.
Marino-Ramirez L, Jordan IK, Munk C, Ivanova N, Spalding BP, Watson DB. Passive sampling and analyses
Mikhailova N, Watson DB, Brown SD, Palumbo AV, of common dissolved fixed gases in groundwater.
Brooks SC. Genome sequences for six rhodanobacter Environ Sci Technol. 2008;42:3766–72.
Integrated Database Resource for Marine Ecological Genomics 303 I
Tringe SG, von Mering C, Kobayashi A, Salamov AA,
Chen K, Chang HW, Podar M, Short JM, Mathur EJ, Integrated Database Resource for
Detter JC, Bork P, Hugenholtz P, Rubin
EM. Comparative metagenomics of microbial commu- Marine Ecological Genomics
nities. Science. 2005;308:554–7.
van den Heuvel RN, van der Biezen E, Jetten MS, Renzo Kottmann
Hefting MM, Kartal B. Denitrification at pH 4 by Max Plank Institute for Marine Microbiology,
a soil-derived Rhodanobacter-dominated community.
Environ Microbiol. 2010;12:3264–71. Bremen, Germany
Verbaendert I, Boon N, De Vos P, Heylen
K. Denitrification is a common feature among mem-
bers of the genus Bacillus. Syst Appl Microbiol. Synonyms
2011;34:385–91.
Watson DB, Kostka JE, Fields MW, Jardine PM. The
Oak Ridge field research center conceptual model. Database; Environmental data; Environmental
NABIR Field Research Center Report, Oak Ridge; genomics; GIS; Integration; Marine;
2004. Metagenomics
Watson DB, Doll WE, Gamey TJ, Sheehan JR, Jardine
PM. Plume and lithologic profiling with surface resis-
tivity and seismic tomography. Ground Water.
2005;43:169–77. Definition
Woo SG, Srinivasan S, Kim MK, Lee M. Rhodanobacter I
caeni sp. nov., isolated from sludge from a sewage
disposal plant. Int J Syst Evol Microbiol. Megx.net, the integrated database resource for
2012;62:2815–21. marine ecological genomics, is the first database
Wu WM, Carley J, Fienen M, Mehlhorn T, Lowe K, to integrate bacterial and archaeal genes,
Nyman J, Luo J, Gentile ME, Rajan R, Wagner D, genomes, and metagenomes from the marine
Hickey RF, Gu BH, Watson D, Cirpka OA, Kitanidis
PK, Jardine PM, Criddle CS. Pilot-scale in situ biore- environment with curated contextual metadata,
mediation of uranium in a highly contaminated aqui- as well as environmental data from heteroge-
fer. 1. Conditioning of a treatment zone. Environ Sci neous resources.
Technol. 2006;40:3978–85.
Wu WM, Carley J, Luo J, Ginder-Vogel MA, Cardenas E,
Leigh MB, Hwang CC, Kelly SD, Ruan CM, Wu LY,
Van Nostrand J, Gentry T, Lowe K, Mehlhorn T, Introduction
Carroll S, Luo WS, Fields MW, Gu BH, Watson D,
Kemner KM, Marsh T, Tiedje J, Zhou JZ, Fendorf S,
Kitanidis PK, Jardine PM, Criddle CS. In situ Over the last years, microbial ecology and envi-
bioreduction of uranium (VI) to submicromolar levels ronmental microbiology have undergone
and reoxidation by dissolved oxygen. Environ Sci a paradigm shift, moving from a single experi-
Technol. 2007;41:5716–23.
ment science to a high-throughput endeavor.
Wu W-M, Carley J, Green SJ, Luo J, Kelly SD,
Nostrand J, Lowe K, Mehlhorn T, Carroll S, Although the genomic revolution is rooted in
Boonchayanant B, Lofller FE, Watson DB, Kemner medicine and biotechnology, it is currently the
KM, Zhou J, Kitanidis PK, Kostka JE, Jardine PM, environmental sector, specifically the marine,
Criddle CS. Effects of nitrate on the stability of ura-
which delivers the greatest quantity of data
nium in a bioreduced region of the subsurface. Environ
Sci Technol. 2010;44:5104–11. (Gilbert and Dupont 2011). Marine ecosystems,
Yan TF, Fields MW, Wu LY, Zu YG, Tiedje JM, Zhou covering >70 % of the Earth’s surface, host the
JZ. Molecular diversity and characterization of nitrite majority of biomass and significantly contribute
reductase gene fragments (nirK and nirS) from nitrate-
to global organic matter and energy cycling.
and uranium-contaminated groundwater. Environ
Microbiol. 2003;5:13–24. Microorganisms are known to be the “gate-
Zengler K, Toledo G, Rappe M, Elkins J, Mathur EJ, Short keepers” of these processes, and insights into
JM, Keller M. Cultivating the uncultured. Proc Natl their lifestyle and fitness can enhance our ability
Acad Sci U S A. 2002;99:15681–6.
to monitor, model, and predict future changes.
Zumft WG, Kroneck PM. Respiratory transformation of
nitrous oxide (N2O) to dinitrogen by Bacteria and Recent developments in sequencing technol-
Archaea. Adv Microb Physiol. 2007;52:107–227. ogy have made routine sequencing of whole
I 304 Integrated Database Resource for Marine Ecological Genomics
microbial communities from natural environ- from the GOS microbial dataset. Finally, megx.
ments possible. Prominent examples in the net also incorporates all sequenced marine phage
marine field are the Global Ocean Sampling genomes in MegDB, which is the first step
(GOS) campaign (Rusch et al. 2007), ICOMM, towards integrating viral genomic and biogeo-
TaraOceans, Malaspina, and the Ocean Sampling chemical data (Duhaime et al. 2011).
Day 2014 of the Micro B3 project. In an effort towards integrating microbial
These large-scale sequencing projects bring diversity with specific sampling sites, megx.net
new challenges to data management and software includes georeferenced small and large subunit
tools for assembly, gene prediction, and annota- rRNA gene sequences from the SILVA rRNA
tion, which are fundamental steps in genomic gene databases project (Quast et al. 2013). As of
analysis. Several dedicated database resources SILVA release 102, only 9 % (16S/18S) and 2 %
have emerged to tackle the current need for (23S/28S) of over one million sequences in
large-scale metagenomic data management and SILVA SSUParc (16S/18S) and LSUParc
analysis, among which are CAMERA (Sun (23S/28S) databases are georeferenced.
et al. 2010), IMG/M (Markowitz et al. 2008), All genomic sequences in megx.net are
and MG-RAST (Meyer et al. 2008). Neverthe- supplemented with contextual data from GOLD
less, it is increasingly apparent that the full poten- (Pagani et al. 2012), NCBI Genome Projects, and
tial of comparative genome and metagenome Moore Foundation’s Marine Microbial Genome
analysis can be achieved only if the geographic Sequencing Project.
and environmental context of the sequence data is The main environmental data is retrieved from
considered. The metadata describing a sample’s three sources:
geographic location and environment, the details 1. World Ocean Atlas: a set of objectively ana-
of its processing, from the time of sampling to lyzed (one decimal degree spatial resolution)
sequencing and subsequent analyses are impor- climatological fields of in situ measurements
tant for modeling species’ responses to environ- 2. World Ocean Database: a collection of scien-
mental change or the spread and niche adaptation tific, quality-controlled ocean profiles
of bacteria and viruses. Megx.net’s unique inte- 3. SeaWIFS chlorophyll a data
gration of contextual and sequence data allows These data are described at 33 standard depths
microbial ecologists and marine scientists to bet- for annual, seasonal, and monthly intervals.
ter compare biological data to understand the Together, the location and time data (x, y, z, and t)
complex interplay between organisms, genes, serve as a universal anchor and link environmental
and their environment. data to the sequence and contextual data.
Database Structure and Content Standards Compliance and

Interoperability
The Microbial Ecological Genomics Database
(MegDB), the backbone of megx.net, is Standards are an important means of enhancing
a centralized database based on the PostgreSQL data exchange and interoperability between dif-
database management system. The georeferenced ferent database resources. MegDB is designed to
data concerning geographic coordinates and time store all contextual data recommended by the
are managed with the PostGIS extension to Genomics Standards Consortium and is thus
PostgreSQL. compliant with the Minimum Information about
Sequences in MegDB are retrieved from the any (x) Sequence (MIxS) standard (Yilmaz
International Nucleotide Sequence Database Col- et al. 2012). However, most sequence data is
laboration (INSDC). Currently, MegDB contains missing contextual metadata. Therefore, numer-
1,832 prokaryote genomes (940 incomplete or ous bacterial and archaeal genomes were manu-
draft) and 80 marine shotgun metagenomes ally curated to assign geographic coordinates to
Integrated Database Resource for Marine Ecological Genomics 305 I
reveal their environmental origin. Even with a table, which also provides direct access to the
careful curation, a geographic origin could not associated contextual data of the hits (Fig. 1).
be assigned to the majority of genomes. In order
to give at least an indication of the environmental GIS Tools
origin of sequence data, they were manually The GIS tools allow post-factum retrieval of
curated with terms of the Habitat-Lite subset interpolated environmental parameters, such as
of the Environmental Ontology (Hirschman temperature, nitrate, or phosphate for any loca-
et al. 2008). tion in the ocean waters based on profile and
remote sensing data.
Two GIS tools are currently available:
Functionalities • World Ocean Atlas Extractor, comprised of
analyzed climatological fields of physico-
Genes Mapserver chemical parameters and biological layers
The Genes Mapserver gives a sample-centric obtained at monthly, seasonal, and annual
view of the georeferenced MegDB content. The samplings
map is interactive, offering user-friendly naviga- • World Ocean Database Extractor, comprised
tion and an overlay of the MegDB environmental of time series measurements of physicochem-
I
data layers to display sampling sites on a world ical parameters and biological layers
map in their environmental context. Sample site Both GIS tools make use of Inverse Distance
details and interpolated data can be retrieved by Weighted (IDW) interpolation to estimate the
clicking the sampling points on the map. environmental data at a given geographic loca-
The GIS Tools of the Genes Mapserver allow tion, time, and depth in the ocean.
extraction of interpolated values for several phys-
icochemical and biological parameters, such as MetaBar
temperature, dissolved oxygen, nitrate and chlo- MetaBar aims to support investigators to effi-
rophyll concentrations, over specified monthly, ciently capture, store, and submit contextual
seasonally, or annually intervals. metadata gathered in the field. It is a spreadsheet-
based sample data collection tool designed to
Geographic-BLAST support the complete workflow from the sam-
The Geographic-BLAST tool queries the MegDB pling event up to the metadata-enriched sequence
genome, metagenome, marine phages, and rRNA submission to an INSDC database (Hankeln
gene sequence data using the BLAST algorithm et al. 2010).
(Altschul et al. 1990). The Geographic-BLAST
tool permits the alignment of query sequences CDinFusion
against five databases instead of the standard Megx.net hosts a public installation of
BLAST query database: CDinFusion, a Web-based tool to combine
• Prokaryotic genomes MIxS compliant contextual and sequence data
• Global Ocean Sampling Metagenomes, which in (Multi)FASTA formatted files prior to submis-
are publicly available metagenomes from the sion (Hankeln 2011). It creates submission ready
Global Ocean Sampling expedition files for the NCBI submission system. However,
• 16S/18S rRNA CDinFusion is not (yet) appropriate for preparing
• 23S/28S rRNA data for the Sequence Read Archive (SRA)
• Marine phage genomes submission system.
The results are reported according to the sam-
ple locations (if available) of the database hits Web Services
and plotted on the Genes Mapserver world map, Megx.net offers programmatic access via Web
where they are labeled by the number of hits per services for experienced users and software
site. Standard BLAST results are shown in developers. All geographical maps can be
I 306 Integrated Database Resource for Marine Ecological Genomics
Integrated Database Resource for Marine Ecological Blue crosses and label indicating the number of significant
Genomics, Fig. 1 Geographic distribution of BLAST BLAST hits in the GOS metagenome samples. The map is
results of a proteorhodopsin from Dokdonia sp. PRO95. generated using the web service of the Genes Mapserver
retrieved via simple Web requests, as specified by access to data in their domains, integration of
the Web Map Service (WMS) standard. The base their data across domains requires megx.net to
URL for WMS requests is http://www.megx.net/ develop a set of new tools and Web services to
wms/gms, where one can also find a tutorial on facilitate seamless interoperability between the
how to use this service. Megx.net also provides different data domains.
access to MIxS reports in Genomic Contextual
Data Markup Language (GCDML) XML files
for all marine phage genomes through similar Summary
HTTP queries, e.g., http://www.megx.net/gcdml/
Prochlorococcus_phage_P-SSP7.xml (Kottmann Megx.net’s unique integration of environmental
et al. 2008). and sequence data allows microbial ecologists
and marine scientists to better contextualize and
compare biological data, using, e.g., the Genes
Current and Future Developments Mapserver and GIS tools. The integrated datasets
facilitate a holistic approach to understanding the
Currently, megx.net is further developed within complex interplay between organisms, genes, and
the FP7 EU project Micro B3 as an open source their environment. As such, megx.net is continu-
project to become an integral part of the Micro B3 ously improved to serve as a fundamental resource
Information System. This information system in the emerging field of ecosystems biology.
builds on a handful of long-established data
resources that span marine science. These data
resources include SeaDataNet and its network Cross-References
of National Oceanographic Data Centers
(oceanographic data), EurOBIS (macrobiological ▶ A 123 of Metagenomics
data), and EBI’s European Nucleotide Archive ▶ Computational Approaches for Metagenomic
(EBI-ENA; molecular sequence data). While Datasets
these resources exist to broaden and simplify ▶ SILVA Databases
Integrons as Repositories of Genetic Novelty 307 I
References Sun S, Chen J, Li W, Altintas I, Lin A, Peltier S,
et al. Community cyberinfrastructure for advanced
Altschul SF, Gish W, Miller W, Myers EW, Lipman microbial ecology research and analysis: the
DJ. Basic local alignment search tool. J Mol Biol. CAMERA resource. Nucleic Acids Res. 2010;
1990;215:403–10. 39(Database):D546–51.
Duhaime MB, Kottmann R, Field D, Glöckner Yilmaz P, Kottmann R, Field D, Knight R, Cole JR,
FO. Enriching public descriptions of marine phages Amaral-Zettler L, et al. Minimum information about
using the MIGS standard: a case study assessing the a marker gene sequence (MIMARKS) and minimum
contextual data frontier. Stand Genomic Sci. information about any (x) sequence (MIxS) specifica-
2011;4(2):1. tions. Nat Biotechnol. 2012;29(5):415–20. doi:
Gilbert JA, Dupont CL. Microbial metagenomics: beyond 10.1038/nbt.1823.
the genome. Ann Rev Mar Sci. 2011;3(1):347–71.
Annual Reviews.
Hankeln W, Buttigieg PL, Fink D, Kottmann R,
Yilmaz P, Glöckner FO. MetaBar – a tool for
consistent contextual data acquisition and standards
compliant submission. BMC Bioinforma.
Integrons as Repositories of Genetic
2010;11:358. Novelty
Hankeln W, Wendel NJ, Gerken J, Waldmann J, Buttigieg
PL, Kostadinov I, et al. CDinFusion – submission- Bridget Mabbutt1, Chandrika Deshpande1,
ready, on-line integration of sequence and contextual I
Visaahini Sureshan1 and Stephen J. Harrop2
data. PLoS ONE. 2011;6(9):e24797. Highlander SK, 1
editor. Department of Chemistry and Biomolecular
Hirschman L, Clark C, Cohen KB, Mardis S, Luciano J, Sciences, Macquarie University, Sydney,
Kottmann R, et al. Habitat-lite: a GSC case study NSW, Australia
based on free text terms for environmental metadata. 2
School of Physics, University of New South
OMICS J Integr Biol. 2008;12(2):129–36.
Kottmann R, Gray T, Murphy S, Kagan L, Kravitz S, Wales, Sydney, NSW, Australia
Lombardot T, et al. A standard MIGS/MIMS compli-
ant XML schema: toward the development of the
Genomic Contextual Data Markup Language
(GCDML). OMICS J Integr Biol. 2008;12(2):
Synonym
115–21.
Markowitz VM, Ivanova NN, Szeto E, Palaniappan K, Novel proteins engaged for LGT within the gene
Chu K, Dalevi D, Chen IM, Grechkin Y, Dubchak I, cassette/integron system
Anderson I, Lykidis A, Mavromatis K, Hugenholtz P,
Kyrpides NC. IMG/M: a data management and analy-
sis system for metagenomes. Nucleic Acids Res. 2008.
PMID:17932063 Definition
Meyer F, Paarmann D, D’Souza M, Olson R, Glass E,
Kubal M, et al. The metagenomics RAST server –
An important vehicle for lateral (or horizontal)
a public resource for the automatic phylogenetic and
functional analysis of metagenomes. BMC gene transfer in bacteria is the integron: it enables
Bioinforma. 2008;9(1):386. the capture and expression of genes as small
Pagani I, Liolios K, Jansson J, Chen I-MA, Smirnova T, mobile elements, or gene cassettes. These mobile
Nosrat B, et al. The Genomes OnLine Database
gene cassettes encompass a vast pool of genetic
(GOLD) v. 4: status of genomic and metagenomic
projects and their associated metadata. Nucleic Acids novelty, ostensibly for purposes of adaptation.
Res. 2012;40(Database issue):D571–9. In most cases, their functional annotation is
Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, obscured by their characteristically high
Yarza P, et al. The SILVA ribosomal RNA gene data-
sequence novelty. Our isolation and solving of
base project: improved data processing and web-based
tools. Nucleic Acids Res. 2013;41(Database issue): protein structures encoded by the cassette
D590–6. metagenome reveals a relatively high proportion
Rusch DB, Halpern AL, Sutton G, Heidelberg KB, of completely novel folds. These newly defined
Williamson S, Yooseph S, et al. The Sorcerer II global
crystal structures are found to encompass diverse
ocean sampling expedition: Northwest Atlantic
through Eastern Tropical Pacific. PLoS Biol. topologies and fold families and delineate new
2007;5(3):e77. protein domains.
I 308 Integrons as Repositories of Genetic Novelty
Introduction established that the gene cassette metagenome

encodes fully folded and functional proteins
Bacteria dominate the planet; they are omnipres- and includes new enzymes and protein-binding
ent, inhabiting a wide range of environments, factors (Robinson et al. 2005; Robinson et al.
including those appearing too extreme or inhos- 2008). This newly expanding group of protein
pitable for life (Rothschild and Mancinelli 2001). folds and structures reveals the extraordinary
Lateral gene transfer (LGT) is known to contrib- genetic novelty encoded by the cassette
ute to the enormous genetic diversity of this metagenome.
microbial world. Rendering the bacterial genome This entry focuses on cassette-encoded pro-
in a constant state of flux, LGT can be said to teins directly recovered by the technique of cas-
produce a gene pool that is collectively owned, sette PCR (outlined in Fig. 1) (Stokes et al. 2001;
leading to the concept of a mobile prokaryotic Boucher et al. 2007). The method has been
metagenome (Koonin and Wolf 2008). exploited for uncultured bacteria present within
One important mediator of LGT involves the environmental samples, as well as for strain iso-
integron system (Boucher et al. 2007; Cambray lates of Vibrio cholerae and the related
et al. 2010; Hall 2012), which allows bacteria to V. metecus (formerly V. paracholerae).
capture and express genes occurring in the envi-
ronment as small mobile elements, named gene
cassettes. Although originally identified as the Novel (Currently Unique) Gene Cassette
vehicle for the spread of antibiotic resistance, it Structures
is now clear that the integron/gene cassette sys-
tem is not just limited to the clinical context, but Examination of protein structures encoded by the
plays a wider role in shaping niche advantage cassette metagenome reveals a relatively high
(Labbate et al. 2012). While most integrons con- proportion to display a completely novel fold
tain a small number of gene cassettes (generally (Sureshan et al. 2013). These newly defined
up to ~10), in some instances multiple insertion three-dimensional structures encompass diverse
events assemble large cassette arrays, particu- topologies and fold families and impact beyond
larly notable within chromosomes of Vibrio spe- specific gene cassettes to delineate new protein
cies (Boucher et al. 2006; Joss et al. 2009). domains and their sequence homologs. Although
It is immediately obvious that the cassette it is not possible to yet identify specific substrates
metagenome comprises a repertoire of distinctly or biochemical properties for these first members
novel genes, with sequence homologs (if any) of new families, their molecular features and
sparsely represented or not annotated in current organizations (see Fig. 2) contribute currency to
databases. This is true for both isolated gene the ongoing discussion assessing the degree to
cassettes and gene cassette arrays derived from which function and/or protein network capacity
cultivated bacterial strains (Rowe-Magnus favors mobilization of genes (Cohen et al. 2011;
et al. 2003; Boucher et al. 2006), as well as for Labbate et al. 2012).
wider metagenomic surveys (Elsaied et al. 2007;
Koenig et al. 2008). All-a Fold Members
With the cassette metagenome extending The crystal structure determined for a gene cas-
beyond the coverage of conventional sequenc- sette isolated from a sewage outfall (Hfx_cass2,
ing, protein structure provides a first functional PDB 3FXH) depicts a dimeric protein incorpo-
inference for many gene cassettes through deter- rating a compact fold of six helical segments.
mination of three-dimensional fold homology The homodimer is stabilized by a hydrophobic
relationships (Sureshan et al. 2013). This interface engaging two helices from each chain
approach has resulted in the structural definition (Fig. 2a). Exposed on the external face of
of many new proteins, although a large subset each subunit is a triangular-shaped hydrophobic
includes entirely novel folds. It is now crevice flanked by two acidic residues and
Integrons as Repositories of Genetic Novelty, using primers (green arrows) targeting the 59-be ele-
Fig. 1 Recovery of gene cassettes from integron arrays. ments, cassette PCR has the capacity to recover gene
The structure of an integron, showing core features includ- cassettes and arrays independently of any specific encoded
ing the intI gene (beige) with its Pint promoter, the attI sequence. This allows recovery of entirely novel gene
attachment site and the Pc promoter. Three integrated cassettes (Adapted from Boucher et al. 2007 and Stokes
gene cassettes (blue, red, and yellow) are shown. By et al. 2001)
I
Integrons as Repositories of Genetic Novelty, Hfx_cass1. Each subunit within the oligomeric organiza-
Fig. 2 Ribbon depiction of novel cassette-encoded pro- tion is indicated in a different color. Putative binding sites,
tein structures: (a) Hfx_cass2, (b) Vpc_cass2, (c) for interaction with either small molecule ligands or,
Hfx_cass5, (d) Vch_cass3, (e) Vch_cass14, (f) potentially, other protein partners, are highlighted in cyan
a flexible loop. Pronounced acidic surface fea- opposing faces of the dimer and possibly gated
tures extend perpendicular to each cavity due to by residues of the flexible loop, appears highly
Glu and Asp side chains of an outer helix. This appropriate for hydrophobic and/or basic sub-
unique binding groove, presented twice on strates or protein partners.
A distinct all-helical protein had also been basic groups that line the pronounced surface
identified in a gene cassette recovered from clefts on both faces of the tetramer.
a V. metecus strain, Vpc_cass2 (PDB 3JRT). Derived from a strain of V. cholera, the struc-
The fold incorporates a four-helix bundle with ture of Vch_cass3 (PDB 3FY6) reveals an
helical extensions wrapping about at midpoint unusual two-layered a + b organization. Within
(Fig. 2b); orthogonal packing of two chain pairs the dimer, central helices stack end-to-end, so
creates a globular-shaped dimer. Sequence separating and exposing two distinct sheet com-
homologs (Shewanella baltica and Moritella ponents (Fig. 2d). A long pronounced surface
genomes, at ~50 % identity) highlight preserva- cleft is enclosed between the outer edge strands
tion of exposed residues (Lys63, Glu66, of these two sheets, flanked by acidic side chains.
His1090 , Val110’) clustered across the dimeric To date, two sequence homologs (~40 % identity)
interface, indicating a possible substrate- have been detected: within Desulfatibacillus
binding site. This fold is weakly related to the alkenivorans from polluted water and
substrate-binding domain of the kanamycin a metagenomic sample of Antarctic bloom-
nucleotidyltransferase (KNTase-C) clan of pro- forming cyanobacterium. These sequence rela-
teins, yet the shape of the dimeric interface in tives do not, however, retain the distinctive
Vpc_cass2 is distinct to that found in its closest Asp/Glu residues surrounding the proposed bind-
KNTase-C relatives (e.g., HI0074 from ing cleft within the Vch_cass3 structure.
Haemophilus influenzae). Lehmann and Another V. cholera-derived gene cassette,
workers have documented substrate-binding/ Vch_cass14, also incorporates an a + b dimer,
nucleotide-binding module pairs prevalent in in this case within a two-layer sandwich fold
bacterial genomes, particularly from harsh con- (PDB 3IMO, Fig. 2e). Sequence relatives of this
ditions and pathogens (Lehmann et al. 2003). gene cassette have been found in the genomes of
Thus, the mobile gene cassette Vpc_cass2 may several soil- and water-dwelling bacteria.
comprise one half of a bipartite system with the A particularly long and deep ligand cavity is
capacity to organize with a nucleotidyl- internalized within this protein, appropriate for
transferase domain into a functional enzyme. a linear hydrophobic substrate (e.g., fatty acid or
alcohol). The features of this binding cavity are
a + b Fold Members retained across all sequence relatives; 20 of the
A gene cassette also isolated from a sewage out- Vch_cass14 internal residues are conserved in its
fall (Halifax, Canada), Hfx_cass5, occurs as two closest homologs. A high degree of conser-
two domain-swapped a + b dimers organized vation is also seen among residues responsible for
into a tetramer (PDB 3IF4; Fig. 2c). Across the mediating dimerization of the module, pointing
center of the tetramer, 310 helices of two oppos- to a dimeric functional protein. A notable feature
ing subunits stack via polar and charged groups. of the dimer, possibly of functional importance, is
The flattened nature of the tetramer and the the projection of positively charged surface clus-
asymmetrical interactions of its component ters from the two exposed b-sheets.
dimers result in two large faces with markedly
different surface features. A small group of a/b Fold Members
sequence homologs (55–71 % identity) include An unusual trimeric protein is encoded by
gene cassettes from contaminated environments: Hfx_cass1, a gene cassette extracted from a salt
a geographically distinct sewage outfall in Can- marsh environment (Koenig et al. 2008).
ada and an Australian industrial site (Stokes Although there are no sequence homologs in cur-
et al. 2001). Residues mediating the tetrameric rent databases, the unique three-layered a/b fold
organization are preserved across all members of bears some topological relationship to the zinc
this emerging sequence family, indicating this to transporter CzrB of Thermus thermophilus. This
be the functional form. Also conserved is the new cassette-encoded protein presents three
inter-module linker segment, which presents clefts at each inter-subunit interface across the
Integrons as Repositories of Genetic Novelty, indicated in a different color. Putative binding sites, for
Fig. 3 Ribbon depiction of cassette-encoded new vari- interaction with either small molecule ligands or, poten-
ants of known folds: (a) Cass2, (b) Bal32a, (c) iMazG. tially, other protein partners, are highlighted in cyan
Each subunit within the oligomeric organization is
flattened trimer surface (Fig. 2f). The clefts are Thus, it can be proposed that the Cass2 family
polar in nature, occupied in the crystal structure has the capacity to form functional transcription
I
by water, and surrounded by pronounced acidic regulator complexes and possibly represents evo-
loops. Although the chemical organization of the lutionary precursors to multidomain regulators of
binding site is unique to Hfx_cass1, some com- cationic compounds.
ponents are common to active site chemistry of
enzymes known to engage with adenosine- a + b Barrel Transporter
and/or nicotinamide-based cofactors. A gene cassette derived from industrially polluted
soil has yielded a new member of the highly
adaptable a + b barrel family of transport proteins
New Variants of Known Folds Encoded and enzymes (Fig. 3b). The dimeric structure of
by Gene Cassettes Bal32a (PDB 1TUH (Robinson et al. 2005)) fea-
tures cone-shaped binding pockets within each
Cationic Drug-Binding Module barrel, common to this superfamily for engaging
The structure (PDB 3GK6) of gene cassette small hydrophobic substrates or peptides. The
Cass2 derived from environmental V. cholera Bal32a structure is, however, unique in that each
has identified an independent binding module of its central cavities is unusually deep and iso-
related to domains of the AraC/XylS transcrip- lated from solvent by a flexible loop. A potential
tion activator system (Deshpande et al. 2011). catalytic site of clustered polar groups within the
Sequence analysis identifies the cassette-encoded barrel is equivalently positioned to corresponding
protein to be representative of a group of inde- active sites within structurally related enzymes.
pendent binding modules undergoing lateral gene Although these enzymes likely share a common
transfer within Vibrio and related species. Closest evolutionary ancestry, with preservation of active
structural relatives of the Cass2 b-barrel (Fig. 3a) site features internal to the barrel, their very low
occur as domains of multidrug-binding proteins overall sequence relationship to Bal32a (<20 %
(including BmrR), incorporating a hydrophobic identity) suggests a wide adaptation of the a + b
binding pocket with a signature glutamate side barrel fold for varied demands. Within its origi-
chain. Cass2 has been demonstrated to bind nating cassette array, the Bal32a gene cassette was
a range of cationic drug compounds. The struc- immediately adjacent to a second cassette,
ture of this module depicts a surface proximal to Bal32b, encoding a likely membrane-associated
the drug-binding cavity with features homolo- protein. This suggests the two components may
gous to those engaged for protein interaction well possibly function in concert as a combined
within multidomain transcriptional regulators. binding and transport system.
MazG Enzyme Subfamily appropriate catalytic, binding, or membrane

As part of an ongoing investigation of an intact domains as adaptive pressure selects more spe-
integron array of 116 gene cassettes located in cific biochemical or regulatory networks
Vibrio rotiferianus DAT722 (Boucher and Stokes (Bornberg-Bauer and Alba 2013). Certainly, the
2006; Chowdhury et al. 2011), a new type of surface features described for each of the cassette
MazG nucleoside triphosphate pyrophospho- protein structures have potential to act as hetero-
hydrolase (NTP-PPase) has been described geneous protein interfaces within multidomain or
(Robinson et al. 2007). This cassette-encoded multi-protein systems.
protein, iMazG, has close sequence relatives
(some within gene cassettes) only within Vibrio
sp. and other aquatic g-proteobacteria. The struc-
Summary
ture of the iMazG tetramer (PDB 2Q5Z) (Fig. 3c)
shows the typical a-helical hairpin fold of the
Our structural studies continue to enforce the
general enzyme family in “closed” and “open”
notion that the highly novel gene cassette
states, as well as its essential Mg2+-coordination
metagenome is not merely a repository of
site. However, this new class of MazG enzymes
sequence divergent variants of known proteins,
contains significant variation, with unique loop
but in fact mobilizes a repertoire of genes belong-
and b-turn features connecting the four helices of
ing to poorly characterized protein families.
the scaffold, creating a distinct substrate site adja-
Thus, to fully scope and understand the global
cent to the divalent metal. Functional assays dem-
proteome, it remains essential to continue to inde-
onstrated that this single-domain type of MazG
pendently target structural investigation of the
cleaves phosphates of dNTP substrates, with
metagenomic element.
a preference for dCTP and dATP. Thus, iMazG
has the capacity to act as a house-cleaning enzyme
capable of removing noncanonical dNTPs.
Cross-References
Gene Cassettes Encode Novel Protein ▶ Lateral Gene Transfer and Microbial Diversity
Folds with Distinct Binding Features ▶ Metagenomic Potential for Understanding
Horizontal Gene Transfer
Regardless of the degree of novelty displayed, all
gene cassette-derived structures appear to be con-
sistent with adaptive functions (e.g., secondary References
metabolism, DNA modification) and possibly
selective advantage (e.g., drug resistance). Bornberg-Bauer E, Alba MM. Dynamics and adaptive
benefits of modular protein evolution. Curr Opin Struct
A tendency to form homo-oligomers has been Biol. 2013;23(3):459–66.
a consistent observation across this structural Boucher Y, Stokes HW. The roles of lateral gene transfer
survey of cassette proteins, with only one excep- and vertical descent in vibrio evolution. In: Fabiano
tion to date (the cationic drug-binding protein Lopes Thompson BA, Swings JG, editors. The biology
of vibrios. Washington, DC: ASM Press; 2006.
Cass2 from Vibrio (Deshpande et al. 2011)).
p. 84–94.
This clear preference for oligomerization may Boucher Y, Nesbo CL, Joss MJ, Robinson A, Mabbutt BC,
be a consequence of the relatively short sequence Gillings MR, et al. Recovery and evolutionary analysis
lengths of genes cassettes within arrays, stabiliz- of complete integron gene cassette arrays from Vibrio.
BMC Evol Biol. 2006;6:3.
ing small protein modules which can perhaps also
Boucher Y, Labbate M, Koenig JE, Stokes HW. Integrons:
be readily and flexibly mixed for different func- mobilizable platforms that promote genetic diversity
tions. Such modules may readily combine with in bacteria. Trends Microbiol. 2007;15(7):301–9.
IPRStats, Overview 313 I
Cambray G, Guerout A, Mazel D. Integrons. Annu Rev Rowe-Magnus DA, Guerout AM, Biskri L, Bouige P,
Genet. 2010;44:141–66. Mazel D. Comparative analysis of superintegrons:
Cohen O, Gophna U, Pupko T. The complexity hypothesis engineering extensive genetic diversity in the
revisited: connectivity rather than function constitutes Vibrionaceae. Genome Res. 2003;13(3):428–42.
a barrier to horizontal gene transfer. Mol Biol Evol. Roy Chowdhury P, Boucher Y, Hassan KA, Paulsen IT,
2011;28(4):1481–9. Stokes HW, Labbate M. Genome sequence of Vibrio
Deshpande CN, Harrop SJ, Boucher Y, Hassan KA, Di rotiferianus strain DAT722. J Bacteriol.
Leo R, Xu X, et al. Crystal structure of an integron 2011;193(13):3381–2.
gene cassette-associated protein from Vibrio cholerae Stokes HW, Holmes AJ, Nield BS, Holley MP,
identifies a cationic drug-binding module. PLoS One. Nevalainen KM, Mabbutt BC, et al. Gene cassette
2011;6(3):e16934. PCR: sequence-independent recovery of entire genes
Elsaied H, Stokes HW, Nakamura T, Kitamura K, Fuse H, from environmental DNA. Appl Environ Microbiol.
Maruyama A. Novel and diverse integron integrase 2001;67(11):5240–6.
genes and integron-like gene cassettes are prevalent Sureshan V, Deshpande CN, Boucher Y, Koenig JE,
in deep-sea hydrothermal vents. Environ Microbiol. Stokes HW, Harrop SJ, et al. Integron gene cassettes:
2007;9(9):2298–312. a repository of novel protein folds with distinct inter-
Hall RM. Integrons and gene cassettes: hotspots of diver- action sites. PLoS One. 2013;8(1):e52934.
sity in bacterial genomes. Ann N Y Acad Sci.
2012;1267:71–8.
Joss MJ, Koenig JE, Labbate M, Polz MF, Gillings MR,
Stokes HW, et al. ACID: annotation of cassette and
integron data. BMC Bioinformatics. 2009;10:118. IPRStats, Overview I
Koenig JE, Boucher Y, Charlebois RL, Nesbo C,
Zhaxybayeva O, Bapteste E, et al. Integron-associated Iddo Friedberg
gene cassettes in Halifax Harbour: assessment of
a mobile gene pool in marine sediments. Environ
Department of Microbiology, Miami University,
Microbiol. 2008;10(4):1024–38. Oxford, OH, USA
Koonin EV, Wolf YI. Genomics of bacteria and archaea:
the emerging dynamic view of the prokaryotic world.
Labbate M, Boucher Y, Luu I, Chowdhury PR, Stokes
Abbreviations
HW. Integron associated mobile genes: Just
a collection of plug in apps or essential components EBI European Bioinformatics Institute
of cell network hardware? Mob Genet Elements. GO Gene Ontology
2012;2(1):13–8.
Lehmann C, Lim K, Chalamasetty VR, Krajewski W,
IMG/M Integrated Microbial Genome/
Melamud E, Galkin A, et al. The HI0073/HI0074 Metagenomics
protein pair from Haemophilus influenzae is pHMM Profile hidden Markov model
a member of a new nucleotidyltransferase family: PSSM Position-specific scoring matrix
structure, sequence analyses, and solution studies. Pro-
teins. 2003;50(2):249–60.
SQL Structured Query Language
Robinson A, Wu PS, Harrop SJ, Schaeffer PM, XML Extensible Markup Language
Dosztanyi Z, Gillings MR, et al. Integron-associated
mobile gene cassettes code for folded proteins: the
structure of Bal32a, a new member of the adaptable
alpha + beta barrel family. J Mol Biol.
Definition
2005;346(5):1229–41.
Robinson A, Guilfoyle AP, Harrop SJ, Boucher Y, Stokes IPRStats is a lightweight platform-independent
HW, Curmi PM, et al. A putative house-cleaning open-source licensed software package for stor-
enzyme encoded within an integron array: 1.8
A crystal structure defines a new MazG subtype. Mol
ing and visualizing metagenomic data annotated
Microbiol. 2007;66(3):610–21. by InterProScan. IPRStats is unique in that it
Robinson A, Guilfoyle AP, Sureshan V, Howell M, provides the user with the same annotation
Harrop SJ, Boucher Y, et al. Structural genomics of choices offered by the popular open reading
the bacterial mobile metagenome: an overview.
Methods Mol Biol. 2008;426:589–95.
frame annotation pipeline, InterProScan.
Rothschild LJ, Mancinelli RL. Life in extreme environ- IPRStats can be installed either as a Web server
ments. Nature. 2001;409(6823):1092–101. or as a stand-alone software.
I 314 IPRStats, Overview
Introduction a protein, the lack of sensitivity that may result

from using only one program can be overcome.
The functional annotation of open reading frames Additionally, a consensus method can help weed
(ORFs) in metagenomic data is a highly challeng- out false positives, by picking only those annota-
ing problem. The problem is difficult enough tions on which there is a plurality agreement, or
with regular genomic data. When functionally some other voting mechanism. InterProScan
annotating metagenomic data, one is confronted (Zdobnov 2001) is a function annotation program
with the additional problems arising from that compares query protein sequences against
sequence fragmentation, imperfect assemblies, a repository of collected and annotated protein
unmitigated sequencing errors, partially identi- signatures. These InterPro (McDowall and
fied ORFs, and higher rates of error in ORF Hunter 2011) member databases employ
calling. One way to overcome these problems is a variety of motif, pHMMs, and position-specific
to do away with ORF calling altogether. Instead, score matrices (PSSMs) to describe protein fam-
assembled metagenomic sequences are translated ilies. Those include PROSITE, PRINTS, Pfam,
in six open reading frames. Those that produce ProDom, SMART, TIGRFAMs, PIR superfam-
proteins above a certain minimal length threshold ily, SUPERFAMILY, Gene3D, PANTHER, and
(say, 100aa) are subjected to functional analysis. HAMAP. These also include the associated soft-
The rationale behind such a strategy is that there ware used to query these databases: pfscan,
is a very low probability that a sequence which is FingerPRINTScan, HMMer3.0, HMMER 2.3,
(1) long enough and (2) found in a database of and BLAST. More information on current mem-
protein signatures is not a true ORF or a partial ber databases and search software employed in
ORF. Each such sequence is then treated as InterPro, including updated references, can be
a member in a population, with biological func- found at ftp://ftp.ebi.ac.uk/pub/software/unix/
tion attributes assigned to it. The storage, visual- iprscan/README.html
ization, and analysis of metagenomic data can be
handled using common database, statistical and
visualization tools used in population analyses. Visualization and Management of
Here such a package, IPRStats, which is based on Metagenomic Function Annotations
the popular InterProScan tool, is described. from InterProScan Using IPRStats
InterProScan can be installed on computer clus-

Annotation of Translated Sequences ters and therefore can handle large amounts of
sequence data. However, when analyzing large
Homology-based transfer algorithms require, amounts of sequence data, as in metagenomic
first and foremost, a comprehensive, accurately data, there are two needs which InterProScan
annotated, and up-to-date reference sequence does not provide: first, a visualization of the
database, but no single database can boast all results to make them comprehensible and, sec-
three traits at 100 % (Schnoes et al. 2009). This ond, a simple data storage and retrieval mecha-
is true for pairwise sequence alignment algo- nism for further analysis.
rithms and simple sequence-motif algorithms as To implement both goals, each translated
well as for the more complex profile hidden Mar- sequence is treated as a member in a population,
kov models (pHMM) (Eddy 1998) and position- which is assigned one or more functional attri-
specific score matrix (PSSM) similarity- based butes by the member programs of InterProScan.
algorithms. Therefore, several function annota- IPRStats (Kelly 2010), or InterProScan STATis-
tion programs are typically used to functionally tics, uses the output of InterProScan as its input
annotate ORFs. The rationale is that by using and quickly produces charts and tables enabling
more than one algorithm to functionally annotate a visualization of the functional potential of the
IPRStats, Overview 315 I
sequences analyzed. It also stores the results in
a simple SQL schema (Fig. 1b), which can be
used by other applications for downstream data
analysis and presentation.
Figure 1 describes the information flow in
IPRStats. The output of an InterProScan run is
stored in XML format. The XML file is parsed
into a 7-table SQLite or a MySQL database. The
tables follow the data structure outlined by the
InterProScan XML schema. After reading the
tables, IPRStats displays the information alpha-
numerically and graphically. The tabs in the side-
bar of the main program screen toggle between
the displays of results for each sequence signature
program called by InterProScan. The results dis-
play includes a chart (Fig. 1d) and a table
(Fig. 1e). The chart is either a pie chart or a bar
I
chart, which shows the count of different
sequence signatures from the relevant program
in the analyzed sequence population. Chart draw-
ing is implemented using either Google Chart
Tools or matplotlib. Google Chart Tools is a
web-based API that dynamically generates charts
using a URL string, so when drawing using Goo-
gle Chart Tools, an active Internet connection is
required. Alternatively, matplotlib may be used:
matplotlib is a Python-based clone of MatLab,
which can be used for chart graphics as well,
and does not require an Internet connection.
Availability
IPRStats is written in Python, with a graphic user

interface (GUI) based on wxWidgets, a cross-
platform toolkit for graphic user interfaces. Rely-
ing on platform-independent fully open-source
infrastructure ensures that we maximize portabil-
ity of IPRStats. Currently IPRStats has been
tested on Windows XP/7, Max OS 10.6, and
Ubuntu GNU/Linux 9.10 and 10.04. IPRStats is
IPRStats, Overview, Fig. 1 Overview of IPRStats. IPRStats, Overview, Fig. 1 (continued) (d) Graphic dis-
(a) Protein sequence information as a single FASTA file play. (e) Table display. (f) Toggle between results from
submitted to InterProScan (one or more proteins). different InterPro member databases (Reproduced from
(b) InterProScan XML output imported into IPRStats seven under BMC CC 2.0 license, copyright owned by
SQL database. (c) Display of sequence signature statistics. authors)
I 316 I-rDNA and C16S: Identification and Classification of Ribosomal RNA Gene Fragments
downloadable from GitHub at http://github.com/ Definition

idoerg/IPRStats. Packages for Windows, Mac
OSX, and Linux are available at http://github. Estimation of microbial diversity in an environ-
com/idoerg/IPRStats/downloads Community ment by efficiently identifying and classifying
participation and further development of this 16S rRNA gene fragments in metagenomic
tool are strongly encouraged. datasets using computational methods.
References Introduction
Eddy S. Profile hidden Markov models. Bioinformatics. Recent advances in high-throughput sequencing
1998;14(9):755–63.
technologies have enabled life-science
Kelly RJ, Vincent DE, Friedberg I. IPRStats: visualization
of the functional potential of an InterProScan run. researchers to rapidly sequence and characterize
BMC Bioinformatics. 2010;11:S13. the entire genomic content of microbial commu-
McDowall J, Hunter S. InterPro protein classification. nities residing in diverse ecological niches. A key
Methods Mol Biol. 2011;694:37–47. doi: 10.1007/
978-1-60761-977-2_3. PMID:21082426.
advantage of characterizing microbial communi-
Schnoes AM, Brown SD, Dodevski I, Babbitt PC. Anno- ties in this fashion is that it enables the concom-
tation error in public databases: misannotation of itant characterization of several microbes
molecular function in enzyme superfamilies. PLoS (constituting the community), most of which can-
Comput Biol. 2009;5(12):e1000605. doi: 10.1371/
not be studied using traditional culture-based
journal.pcbi.1000605. Epub 2009 Dec 11.
PMID:20011109. genomic approaches. Moreover, this approach
Zdobnov EM, Apweiler R. InterProScan – an integration (referred to as “Metagenomics”) is useful in
platform for the signature-recognition methods in understanding the interaction patterns between
InterPro. Bioinformatics. 2001;17(9):847–8.
the resident microbes as well as between the
microbes and the environment.
Characterizing and comparing the taxonomic
as well as functional diversity of microbial com-
I-rDNA and C16S: Identification and munities (obtained from varied ecological
Classification of Ribosomal RNA niches) are the broad objective of metagenomic
Gene Fragments projects. These objectives are attained using two
well-established approaches (Fig. 1). In the first
Algorithms for Efficient In Silico Identification approach (commonly referred to as the amplicon-
and Classification of Ribosomal RNA Gene based approach), a quick snapshot of taxonomic
Fragments in Metagenomic Datasets diversity of a given environmental sample is
obtained by specifically amplifying, cloning,
Sharmila Mande, Tarini Shankar Ghosh and and sequencing gene or gene fragments
Mohammed Monzoorul Haque corresponding to one or more phylogenetic
Biosciences R & D, TCS Innovation Labs, Tata marker genes. The 16S rRNA gene is the most
Research Development & Design Centre, Tata widely used phylogenetic marker gene employed
Consultancy Services Limited, Pune, MH, India in such amplicon-based approaches. Subse-
quently, bioinformatic approaches are used for
taxonomically classifying these sequenced
Synonyms genes or gene fragments. The relative proportions
of various taxonomic groups present in the
Classification of 16S rRNA gene fragments; metagenomic dataset (representing a given envi-
In silico identification of 16S rRNA gene ronmental sample) are then obtained from the
fragments identified taxa. In the second approach
I-rDNA and C16S: Identification and Classification of Ribosomal RNA Gene Fragments 317 I
I-rDNA and C16S: Identification and Classification of a given environment. Advantages and limitations of each
Ribosomal RNA Gene Fragments, Fig. 1 An overview approach are also summarized. Black regions depicted in
of different approaches adopted by metagenomic projects the genomic fragments correspond to entire or a fragment
for profiling the taxonomic and/or functional diversity of of 16S rRNA gene
(commonly referred to as the shotgun-sequencing MG-RAST (Meyer et al. 2008) and CAMERA
approach), the genomic content of a given envi- (Seshadri et al. 2007). Given the robustness of
ronmental sample is extracted and sequenced. BLAST/BLAT algorithms, this approach has
Genomic fragments (referred to as “reads”) high sensitivity in identifying/classifying 16S
obtained from the sequencing platforms are then rDNA fragments (even for reads with lengths
computationally analyzed in terms of taxonomy <100 bp) originating from known and
and function. Since the shotgun-sequencing characterized genomes.
approach generates millions of reads originating The BLAST-based approach, although iden-
from random positions/locations within the tifies 16S rDNA sequences with high sensitivity,
genomes of various microbes constituting requires huge compute power for performing
a given environmental sample, a subset of these alignments of millions of metagenomic reads
reads (hereafter referred to as 16S rDNA frag- with thousands of reference 16S rRNA gene
ments) are expected to originate from genomic sequences. This makes it unsuitable for practical
regions that specifically encompass the 16S use in research labs lacking access to high-end
rRNA genes of the resident microbes. Identifying computational infrastructure. Another alignment-
16S rDNA fragments (from within millions of based methodology attempts to address/over-
reads constituting a typical metagenomic dataset) come this limitation by employing hidden Mar-
and subsequently classifying them is therefore kov models (HMMs) that represent the
expected to aid in quickly deciphering the taxo- universally conserved sequence architecture of
nomic diversity of a given metagenomic dataset. the 16S rRNA gene (Huang et al. 2009). These
The following sections describe two algorithms, HMMs, built separately for bacterial and archaeal
namely, i-rDNA and C16S, which are used for the kingdoms, reflect the sequence conservation pat-
identification and taxonomic classification of 16S tern observed within the 16S rRNA genes of
rDNA fragments in metagenomic datasets, microbes belonging to these two lineages. For
respectively. identification of 16S rDNA fragments, reads in
a metagenomic dataset are individually aligned to
i-rDNA: Algorithm for Identification of 16S these two HMMs. Reads obtaining significant
rRNA Gene Fragments in Metagenomic alignment scores are then tagged as 16S rDNA
Datasets fragments. Given that the alignments of individ-
One of the simplest ways of identifying 16S ual reads are done only against two HMMs, rather
rDNA fragments in a metagenomic dataset is by than against thousands of individual reference
performing similarity searches of all reads con- 16S rRNA gene sequences (as in the case of the
stituting the dataset against a database containing BLAST-based approach), the execution time as
known 16S rRNA gene sequences. Such similar- well as the requirements of compute power are
ity searches are typically performed using popu- significantly reduced. Moreover, this approach is
lar algorithms such as BLAST (Altschul observed to achieve similar levels of detection
et al. 1990) and BLAT (Kent 2002). Similarity sensitivity as that of BLAST-based approach.
of a read with sequences in the database is eval- Though the above-described HMM-based
uated based on how it aligns with these approach represents a rapid way of identifying
sequences. Reads having significant similarity 16S rDNA fragments (as compared to the
(similarity being defined in terms of alignment BLAST-based approach), it still involves
parameters such as e-value, identity, and align- performing alignments of each individual read
ment length) with database sequences are identi- (in metagenomic datasets) against two HMMs.
fied as 16S rDNA fragments. Since this approach Consequently, adopting the HMM-based
enables identification as well as taxonomic clas- approach (on a standard work-station) for identi-
sification of 16S rDNA fragments, it is currently fication of 16S rDNA fragments within huge
incorporated as a standard procedure in popular metagenomic datasets (e.g. the Human
metagenomic analysis platforms such as Microbiome Project containing more than
I-rDNA and C16S: Identification and Classification of sequencing. Advantages and limitations of each approach
Ribosomal RNA Gene Fragments, Fig. 2 Available are also summarized. Black regions depicted in the geno-
approaches for identification of 16S rRNA gene fragments mic fragments correspond to entire or a fragment of 16S
in metagenomic datasets obtained using shotgun rRNA gene
32 million sequences) is expected to take several sequences and subsequently provide this small
hours to a few days. The recently published subset of reads as input to the HMM-based
i-rDNA method (Mohammed et al. 2011) has approach. This step of prefiltering data (based
addressed this issue by employing a sequence on compositional characteristics) essentially
composition-based step prior to the similarity reduces the volume of data which are provided
search step performed against the bacterial and as input to the HMM alignment step. The finer
archaeal HMMs (Fig. 2). This precursor step is algorithmic details of the i-rDNA method are
based on the following premise/observations. explained in the subsequent paragraphs.
Given that significant portions of 16S rRNA The i-rDNA method first captures the oligo-
gene sequences are universally conserved across nucleotide usage patterns which are specific to
all prokaryotic lineages, genomic regions 16S rRNA gene sequences. This procedure is
encompassing 16S rRNA genes are characterized performed as a one-time preprocessing step. For
by distinct sequence compositions (in terms of this purpose, genomic fragments (of lengths
oligonucleotide usage patterns) as compared to 1,000 bp each) from all completely sequenced
the other regions of the genome. The i-rDNA prokaryotic genomes are first obtained. Each
method utilizes this observation to first identify fragment is then represented as a 256-dimensional
a subset of reads which have an oligonucleotide vector containing the frequencies of all
composition similar to that of 16S rRNA gene possible tetranucleotides. Subsequently, vectors
corresponding to all fragments are clustered approaches, namely, BLAST based, HMM based,
(using k-means clustering algorithm) based on and i-rDNA, for four simulated metagenomic
their tetranucleotide frequency patterns. This datasets. These datasets were generated by pro-
generates a feature vector space with a number viding 35 prokaryotic genomes as input to the
of clusters. Centroids of the clusters are then MetaSim sequence simulator software (Richter
calculated based on the fragments contained in et al. 2008). Sequences in each of these datasets
them. Each cluster in the feature vector space is simulated the lengths and error rates of four pop-
thus represented by its centroid. Given the unique ular sequencing platforms, viz., Sanger (sequence
sequence composition of the 16S rRNA gene, length approximately 800 bp), 454-titanium
genomic fragments encompassing this gene are (~400 bp), 454-standard (~250 bp), and Illumina
localized to a subset of these clusters. In the (~ 110 bp). These comparative evaluations were
preprocessing step of i-rDNA method, clusters performed on a standard Linux workstation hav-
containing significant proportions of 16S rRNA ing a 2.33 GHz dual core processor and 2GB
gene fragments (as compared to other clusters) RAM memory. Results in this table indicate the
are identified and tagged as “probable 16S” clus- utility of the i-rDNA method in reducing the
ters (Fig. 3). This information is stored in the overall time taken for identification of 16S
form of a mapping file that contains cluster cen- rDNA fragments in metagenomic datasets. The
troids along with their respective tags (either i-rDNA method is observed to be 50 and 8 times
probable 16S or non-16S). faster in identifying 16S rDNA fragments as
The i-rDNA method identifies 16S rDNA compared to the BLAST-based and
fragments (from amongst all reads constituting HMM-based meta-rna program, respectively. As
a given metagenomic dataset) in the following can be observed, this reduction in time for iden-
manner. For each read, the distances of its tification is not accompanied by a noticeable
tetranucleotide frequency vector to all the cluster decrease in detection sensitivity.
centroids in the mapping file (obtained as
described in the previous paragraph) are first C16S: Algorithm for Taxonomic Classification of
computed. This step helps in identification of 16S rRNA Gene Fragments in Metagenomic
a set of clusters having tetranucleotide composi- Datasets
tion most similar to that of the read. If Extraction and classification of 16S rRNA gene
a significant proportion of the identified clusters fragments is one of the quickest ways to estimate
are observed to be pre-tagged (in the mapping taxonomic diversity of any microbial commu-
file) as “probable 16S,” the read is classified as nity. Due to the presence of several characteristic
a “probable 16S rDNA” fragment (Fig. 3). Only features, the 16S rRNA gene has been used as an
those reads classified as “probable 16S rDNA” ideal taxonomic marker. Primarily, this gene is
fragments are provided as input to the down- ubiquitously present within the genomes of all
stream HMM search. Adoption of the above strat- prokaryotic organisms. Secondly, given its role
egy in the published study (Mohammed in key cellular processes (e.g., protein synthesis),
et al. 2011) indicated a six to ten times reduction the probability of this gene being involved in
in the number of sequences provided as input to lateral gene transfer events is also minimal (Jain
the HMM search, thereby drastically reducing the et al. 1999; Daubin et al. 2003). This property
overall time for identifying 16S rDNA fragments enables its use as a phylogenetic marker to study
(in metagenomic datasets). Furthermore, this the evolutionary patterns in diverse prokaryotic
noticeable reduction in the overall analysis time lineages with high confidence. Furthermore, 16S
was observed to be achieved without any signif- rRNA genes are characterized by highly con-
icant loss in detection sensitivity (Mohammed served regions (U1-U8) that flank hypervariable
et al. 2011). regions (V1-V9) (Jonasson et al. 2002). Univer-
Table 1 provides an additional comparison of sal/customized primers designed against these
detection sensitivity and execution time for three conserved stretches (which are adjacent to the
I-rDNA and C16S: Identification and Classification of containing genomic fragments harboring portions of 16S
Ribosomal RNA Gene Fragments, Fig. 3 A concep- rRNA gene in significant proportions, are tagged as “prob-
tual overview of the framework used by the i-rDNA able 16S” clusters. Red dots: fragments originating from
method. (a) A schematic representation of the genomic regions harboring portions of 16S rRNA gene.
preprocessing step of i-rDNA method. A feature vector Blue dots: fragments not containing any portion of the 16S
space is generated by performing a k-means clustering rRNA gene. Black dots: centroids corresponding to each
(using tetranucleotide frequencies) of genomic fragments of the clusters in the feature vector space. (b) Identifica-
from all completely sequenced microbial genomes. In this tion workflow of the i-rDNA method. Tetranucleotide
feature vector space, clusters C3, C5, C6, and C7, frequency vectors corresponding to query reads
hypervariable regions) facilitate specific isola- classification. Although using a stringent set of
tion, PCR-based amplification and subsequent BLAST thresholds (for evaluating alignment
sequencing of the entire length (or specific por- quality prior to assignment) is expected to reduce
tions) of 16S rRNA genes. The hypervariable the misclassification rate (to some extent), a large
regions within the sequenced 16S rRNA gene number of 16S rDNA fragments may remain
fragments are specific to each organism and unassigned/unclassified. It may be noted that var-
thus serve as “taxonomic barcodes.” These ious read mapping algorithms, e.g., BWA (Li and
“barcodes” can be used to classify 16S rRNA Durbin 2010), Bowtie (Langmead et al. 2009),
gene fragments sampled from a given environ- etc., have also been used for aligning query 16S
ment into different taxonomic groups. rDNA fragments with sequences in reference
Various strategies are currently employed for databases. The premise and the overall method-
classification of 16S rDNA fragments (Fig. 4). ology for inferring the taxonomic origin of query
Overall, these strategies involve comparing the sequences however remain the same as in the
sequences and/or the compositions of 16S rDNA BLAST-based approach.
fragments with sequences/models corresponding Inferring the taxonomic origin of query 16S
to known taxonomic groups. Details of these rDNA sequences can also be performed by map-
strategies are described below. The BLAST- ping/aligning them to precomputed multiple
based approach (described in the previous sec- sequence alignments (MSAs). These MSAs are
tion) is also employed for classifying 16S rDNA generated by pre-aligning well-annotated 16S
fragments. For this purpose, 16S rDNA frag- rRNA gene sequences belonging to organisms
ments having significant hits with reference 16S of known taxonomic lineages. A detailed descrip-
rRNA gene sequences (from known and charac- tion of the methods adopting such strategies is
terized microbes) are assigned to the taxa provided in another review (Sun et al. 2011).
corresponding to the best hit(s). In this process, MSA-based approaches, though observed to pro-
the quality of the BLAST hit (obtained between vide robust taxonomic inferences, are critically
the query 16S rDNA fragment and reference 16S dependent on the quality and the taxonomic cov-
rRNA gene sequence) is judged based on user- erage of the reference sequences which are used
specified thresholds of alignment parameters for generating the precomputed alignments. Fur-
such as bit score, e-value, identity percentage, thermore, given the algorithmic complexity of
etc. Apart from the huge compute power require- the process of performing/generating multiple
ment (for performing the alignment step), the sequence alignment(s), enormous amount of
BLAST-based approach has the following limi- time and compute power are typically required
tation. In a given metagenomic dataset, a large for MSA-based analyses.
proportion of query 16S rDNA fragments typi- The widely popular RDP classifier (Wang
cally originate from hitherto unknown taxa. Such et al. 2007) attempts to address the limitations
sequences may belong to an entirely new species associated with the above-described BLAST-
or genus or family or order or class or even a new based as well as MSA-based approaches. This
phylum. Attempting to map such novel query 16S method taxonomically classifies a query 16S
rDNA fragments to known taxonomic groups is rDNA fragment by comparing its compositional
expected to result in incorrect taxonomic properties (e.g., oligonucleotide usage pattern)
ä
I-rDNA and C16S: Identification and Classification of cluster centroids C3, C5, C6, and C7 (all of which are
Ribosomal RNA Gene Fragments, Fig. 3 (continued) pre-tagged as “probable 16S” clusters). Consequently,
(R1 and R2) are first mapped to the feature vector space read R1 is identified as a “probable 16S rDNA” fragment.
(generated in the preprocessing phase of i-rDNA as Read R2 is in close proximity to clusters C8, C9, and C10
described above in (a). Read R1 maps to an area (within (all of which are pre-tagged as “non-16S” clusters). Read
the feature vector space) that is in close proximity to R2 is therefore identified as a “non-16S rDNA” fragment
I-rDNA and C16S: Identification and Classification of Ribosomal RNA Gene Fragments, Table 1 Performance
of i-rDNA, meta-rna (a HMM-based identification method) and BLAST in terms of detection sensitivity and execution
time. The approximate length of reads constituting each of the four simulated test datasets is indicated in brackets
Detection sensitivity (%) Execution time (in seconds)
Test dataset Number of reads i-rDNA meta_ma BLAST i-rDNA meta_rna BLAST
Illumina (~110 bp) 1,000,000 93.1 94.6 98.1 102 1,110 6,317
454-Standard (~250 bp) 400,000 90.6 96.4 99.2 97 1,026 6,681
454-Titanium (~400 bp) 250,000 91.3 97.1 99.6 92 947 6,128
Sanger (~800 bp) 100,000 87.6 95.2 99.8 105 929 5,783
I-rDNA and C16S: Identification and Classification of Ribosomal RNA Gene Fragments, Fig. 4 Different
approaches available for the taxonomic classification of 16S rRNA gene fragments
with models generated using compositional fea- query sequence, the RDP classifier first identifies
tures of sequences of known taxonomic lineages. a model (and the corresponding genus) whose
For this purpose, it first creates (as a 8-mer word frequencies are “most” similar to
preprocessing step) Naive Bayesian models that that of the query sequence. The classifier then
capture 8-mer oligonucleotide word frequencies employs a bootstrapping procedure to compute
in 16S rDNA sequences belonging to known gen- a confidence score of assignment to each taxa
era. During the classification step, for a given belonging to the taxonomic lineage of the
identified genus. The query sequence is then identifying an appropriate level of taxonomic
assigned to a taxon (within this lineage) that is assignment for each query sequence. The strategy
at the most specific taxonomic level and also of correlating the HMM score with the taxonomic
generates a confidence score that exceeds the level of assignment is based on the empirical
user-specified confidence score threshold. observation that the HMM score decreases with
Besides being alignment-free, a major advantage increasing taxonomic divergence between the
of the RDP classifier is the bootstrapping proce- taxa corresponding to the query and the HMM
dure employed to compute the confidence scores. (Ghosh et al. 2012).
The overall strategy of this procedure ensures the The classification methodology adopted by
accurate assignment of novel 16S rDNA C16S has the following advantages. First,
sequences (i.e., originating from hitherto employing representative genus-specific HMMs
unknown organisms) to related taxa at appropri- significantly reduces the time and compute
ately higher taxonomic levels. However, it is power as compared to that typically required by
important to note that the overall process of clas- BLAST-based or MSA-based classification
sification involves scoring and identifying the approaches. Second, the use of precomputed
“best” taxonomic lineage corresponding to the threshold scores in C16S ensures assignment of
query sequence. This scoring however does not query sequences (originating from unknown
take into account the actual level of composi- organisms) at appropriately higher taxonomic
tional similarity between the model levels, thereby reducing its misclassification rate
corresponding to the “best” taxonomic lineage as compared to that by the RDP classifier. Finally,
and the query sequence. Consequently, in cases given that the identified taxonomic levels are spe-
where 16S rDNA fragments originate from taxo- cific to the extent possible, the overall specificity
nomic lineages that have minor representation in of assignments by C16S is not compromised.
existing 16S rDNA databases, the classification The above observations (with respect to clas-
accuracy of the RDP classifier has been shown to sification efficiency of C16S) are also reflected in
decrease (Biers et al. 2009). the results of a comparative evaluation between
In contrast to all the three methods described the C16S algorithm and the RDP classifier (run
above, the recently published C16S algorithm with default parameters). This evaluation was
(Ghosh et al. 2012) employs genus-specific performed using five simulated 16S rDNA
HMMs for the taxonomic classification of 16S datasets (each comprised of 30,000 sequences).
rDNA fragments. The overall classification strat- While one of these datasets consisted of full-
egy is based on the following premise. 16S rDNA length 16S rRNA gene sequences from taxonom-
sequences contain alternating conserved and ically diverse microbes, the others consisted of
hypervariable regions. The latter regions are 16S rDNA fragments that mimicked the length
characterized by clade-specific sequence varia- and the sequencing error rates associated with
tion patterns. As a preprocessing step, the C16S four popular sequencing platforms, viz., Sanger,
algorithm captures these clade-specific patterns 454-titanium, 454-standard, and Illumina. Fur-
at the taxonomic level of genus. For this purpose, thermore, for each dataset, evaluation was
genus-specific HMMs are first generated and sub- performed in four different simulated
sequently utilized for classifying query 16S metagenomic scenarios, wherein the input 16S
rDNA fragments. During the classification rDNA sequences mimicked those originating
phase, a query 16S rDNA fragment is first from entirely new genera, families, orders, and
mapped to these precomputed genus-specific classes, respectively. These simulated scenarios
HMMs and the genus corresponding to the best were generated by progressively removing the
scoring HMM is identified. The score obtained in models corresponding to the genus, family,
this process is then utilized for dynamically order, and class of the source organisms
I-rDNA and C16S: Identification and Classification of Ribosomal RNA Gene Fragments, Table 2 Distribution
of taxonomic assignments obtained using C16S and RDP classifier for five simulated metagenomic datasets (each
comprised of 30,000 sequences)
Database scenario
Minus genus Minus family Minus order Minus class
Assignment category C16S RDP C16S RDP C16S RDP C16S RDP
Illumina dataset (average read length ~ 110 bp)
Correct 86.7 81.2 91.2 86.2 88.7 86.2 85.3 84.9
Higher levels 16.7 10.2 21.2 16.6 39.2 34.7 65.7 63.9
Intermediate levels 56.4 56.7 52.2 48.5 25.6 26.3 0 0
Specific levels 13.6 14.3 17.8 21.1 23.9 25.2 19.6 21
454-Standard dataset (average read length ~ 250 bp)
Correct 95.7 80.5 92.6 88.2 84.5 80.7 84.6 84.3
Higher levels 12.6 4.5 19.8 8.3 37.8 31.5 60.4 60.2
Specific levels 26.7 29.3 25 34 28.3 26.6 24.2 24.1
454-Titanium dataset (average read length ~ 400 bp)
Correct 94.4 79.5 92.6 84.3 87.2 70.3 86.4 79.5
I
Higher levels 1.2 2.7 20.3 4.1 22.2 7.3 47.2 43.4
Specific levels 36.8 34.3 26.4 36.0 46.6 41.7 39.2 36.1
Sanger dataset (average read length ~ 800 bp)
Correct 82.2 58.1 88.4 73.1 90.1 59.6 88.2 68.6
Higher levels 11.2 2.1 19.9 3.1 37.0 6.0 48.2 32.4
Specific levels 36.9 35.8 27.4 37.8 46.7 27.8 40.0 36.2
Dataset with full-length 16S rRNA gene sequences
Correct 90.1 57.6 88.8 64.9 78.8 48.3 70.7 51.8
Higher levels 3.3 2.1 6.4 3.4 11.2 4.4 19.8 13.8
Intermediate levels 44 16 47.3 26.4 19.6 15.8 0 0
Specific levels 42.8 39.5 35.1 35.1 48.0 28.1 50.9 38.0
(corresponding to the query 16S rDNA frag- and specificity of C16S is observed to be notice-
ments) from the databases utilized by RDP clas- ably better than the RDP classifier.
sifier as well as the C16S algorithm. Results of Correct assignments are assignments made to
this evaluation (Table 2) indicate improved levels taxa lying in the path between the root and the
of classification accuracy of C16S algorithm as source genus of the query sequence.
compared to the RDP classifier. Interestingly, the Correct assignments are further
improvement in performance, with respect to subcategorized into “specific levels,” “intermedi-
both classification accuracy and specificity, is ate levels,” and “higher levels” as described
especially pronounced in simulated scenarios, below
wherein query sequences originate from hitherto (a) Specific levels: If HMMs corresponding to
unknown genomes lacking counterpart models at genus or family or order or class are absent
the levels of order and class (in the databases of from the reference database, assignment of
both algorithms). Furthermore, for full-length a query sequence is classified as “correct” at
and Sanger datasets, the classification accuracy “specific level,” only if the assignment is
made to a correct taxon at the immediate gene or gene fragments. On the other hand, the
higher taxonomic level. For instance, in relatively higher classification accuracy of the
a “new family” simulated database scenario C16S method (as compared to other contempo-
(wherein HMMs corresponding to the source rary classification methods) is expected to pro-
family of the query 16S rDNA fragment are vide an accurate picture of taxonomic diversity
absent from the reference database), an of microbial communities inhabiting any given
assignment of the query sequence to the environment.
corresponding order is categorized as correct
at specific level.
(b) Intermediate levels: Correct assignments to
Cross-References
taxa lying between the phylum level and the
specific level (as described above) are classi-
▶ Computational Approaches for Metagenomic
fied as “correct” at “intermediate levels.”
Datasets
(c) Higher level: Assignments to root or cellular
▶ Conserved Regions in 16S Ribosome RNA
organisms or to superkingdom levels are cat-
Sequences and Primer Design for Studies of
egorized as correct assignments at “higher
Environmental Microbes
levels.”
▶ Microbial Diversity, Bar-Coding Approaches
▶ Nucleotide Composition Analysis: Use in
Metagenome Analysis
Summary
▶ Phylogenetics, Overview
▶ RITA: Rapid Identification of High-
One of the major objectives of most
Confidence Taxonomic Assignments for
metagenomic projects is to profile and subse-
Metagenomic Data
quently compare the spatial and temporal varia-
tions of microbial communities residing in
diverse ecological niches. Analyzing such varia-
tions helps in the identification of microbial References
groups that confer specific characteristics to
a given environment in terms of phenotype/func- Altschul SF, Gish W, et al. Basic local alignment search
tool. J Mol Biol. 1990;215(3):403–10.
tion. Development of efficient in silico methods
Biers EJ, Sun S, et al. Prokaryotic genomes and diversity
for identifying and classifying 16S rRNA genes in surface ocean waters: interrogating the global ocean
(or gene fragments) from metagenomic datasets sampling metagenome. Appl Environ Microbiol.
(obtained using amplicon-based or shotgun 2009;75(7):2221–9.
Daubin V, Moran NA, et al. Phylogenetics and the cohe-
sequencing approach) is therefore an important sion of bacterial genomes. Science. 2003;301(5634):
computational problem. This article describes 829–32.
two recently reported methods, viz., i-rDNA Ghosh TS, Gajjalla P, et al. C16S - a Hidden Markov
and C16S, that cater to the tasks of identification Model based algorithm for taxonomic classification
of 16S rRNA gene sequences. Genomics.
and classification of 16S rDNA fragments in
2012;99(4):195–201.
metagenomic datasets. The i-rDNA method rep- Huang Y, Gilna P, et al. Identification of ribosomal RNA
resents an approach which is efficient in terms of genes in metagenomic fragments. Bioinformatics.
execution speed as well as detection sensitivity. 2009;25(10):1338–40.
Jain R, Rivera MC, et al. Horizontal gene transfer among
Given its ability to directly identify 16S genomes: the complexity hypothesis. Proc Natl Acad
rDNA fragments from metagenomic datasets Sci U S A. 1999;96(7):3801–6.
(obtained using the shotgun sequencing Jonasson J, Olofsson M, et al. Classification, identification
approach), it holds the potential to completely and subtyping of bacteria based on pyrosequencing
and signature matching of 16S rDNA fragments.
bypass the experimental procedures (and the APMIS. 2002;110(3):263–72.
related costs of the same) associated with extrac- Kent WJ. BLAT–the BLAST-like alignment tool.
tion, cloning, and sequencing of the 16S rRNA Genome Res. 2002;12(4):656–64.
Langmead B, Trapnell C, et al. Ultrafast and memory- Richter DC, Ott F, et al. MetaSim: a sequencing simulator
efficient alignment of short DNA sequences to the for genomics and metagenomics. PLoS One.
human genome. Genome Biol. 2009;10(3):R25. 2008;3(10):e3373.
Li H, Durbin R. Fast and accurate long-read alignment Seshadri R, Kravitz SA, et al. CAMERA: a
with Burrows-Wheeler transform. Bioinformatics. community resource for metagenomics. PLoS Biol.
2010;26(5):589–95. 2007;5(3):e75.
Meyer F, Paarmann D, et al. The metagenomics RAST Sun Y, Cai Y, et al. A large-scale benchmark study of
server - a public resource for the automatic phyloge- existing algorithms for taxonomy-independent micro-
netic and functional analysis of metagenomes. BMC bial community analysis. Brief Bioinform.
Bioinformatics. 2008;19(9):386. 2011;13(1):107–21.
Mohammed MH, Ghosh TS, et al. i-rDNA: alignment-free Wang Q, Garrity GM, et al. Naive Bayesian classifier for
algorithm for rapid in silico detection of ribosomal rapid assignment of rRNA sequences into the new
gene fragments from metagenomic sequence data bacterial taxonomy. Appl Environ Microbiol.
sets. BMC Genomics. 2011;12 Suppl 3:S12. 2007;73(16):5261–7.
I
K
KEGG and GenomeNet, New Introduction

Developments, Metagenomic
Analysis The number of complete genomes has been
increasing dramatically. From the completion of
Masaaki Kotera, Yuki Moriya, Toshiaki the influenza genome in 1995, it took about
Tokimatsu, Minoru Kanehisa and Susumu Goto 13 years (1995–2008) to complete a total of
Bioinformatics Center, Institute for Chemical 500 species. The number of complete genomes
Research, Kyoto University, Uji, Kyoto, Japan is expected to have quadrupled (~2,000) during
the following 4 years (2009–2012). The total
number of putative genes in these ~2,000
Synonyms genomes is ~8 million. In contrast, recent
prevailing technology such as Next Generation
GenomeNet; Kyoto Encyclopedia of Genes and Sequencing produces even larger amount of data.
Genomes One emerging field enabled by this advance in
technology is referred to as metagenomics, i.e.,
genomic-scale sequencing of samples containing
Definition a mix of different species. The total amount of
publicly available metagenomic data has already
KEGG (Kyoto Encyclopedia of Genes and become larger than that of genomes:
Genomes) is a database resource representing 139 metagenome samples are currently stored in
biological systems, such as the cell, the organism, the KEGG database (http://www.kegg.jp/;
and the ecosystem, from molecular-level infor- Kanehisa et al. 2012) and the total number of
mation, especially large-scale molecular datasets putative genes in these samples is ~14 million.
generated by genome sequencing and other The need arises for novel tools and interfaces to
high-throughput experimental technologies. handle this flood of data, which is expected to
GenomeNet is database and computational exponentially increase in the foreseeable future.
services for genome research and related research KEGG have been storing complete genome,
areas in biomedical sciences, operated by the draft genome, and metagenomic data and given
Kyoto University Bioinformatics Center in them additional functional annotations. The
Japan. Both services work in collaboration put- development of KEGG is the continuous effort
ting a special focus on the visualization and inter- to construct an integrative knowledgebase for
pretation of large amount of data, such as widespread use in many fields, such as molecular
metagenome sequence data, derived from high- biology. GenomeNet (http://www.genome.jp/;
throughput measurement techniques. Kanehisa et al. 2002) is a database and
K 330 KEGG and GenomeNet, New Developments, Metagenomic Analysis
KEGG and GenomeNet,

New Developments,
Metagenomic Analysis,
Fig. 1 KEGG and
GenomeNet among
Internet resources
computational service for genome research and EST datasets (http://www.kegg.jp/kegg/catalog/

related research areas in biomedical sciences, org_list2.html)
operated by the Kyoto University Bioinformatics Metagenomes (http://www.kegg.jp/kegg/cata-
Center. It integrates KEGG with other databases log/org_list3.html)
that focus on genes, proteins, enzyme reactions, Pangenomes (http://www.kegg.jp/kegg/catalog/
metabolic compounds, drugs, natural products, org_list1.html)
and other biological resources scattered all Genome sequences registered in the RefSeq
over the world. DBGET/LinkDB (http://www. database are incorporated in the KEGG GENES
genome.jp/dbget/; Fujibuchi et al. 1998) is an database, and additional annotation is given so
integrated database retrieval system for handling that the genes have links to ortholog groups,
such molecular biology databases and is used as pathways, etc. Annotation is processed manually
a backbone system in GenomeNet and KEGG. with the help of the in-house KOALA (KEGG
They also develop Web tools for functional Orthology And Links Annotation) software,
analysis based on genome, metabolome, and met- based on the bidirectional best-hit strategy of
abolic reaction information and provide an inte- SSEARCH. Once the annotation is completed,
grated analysis environment for researchers and the organism-specific pathway is automatically
general public (Fig. 1). KEGG and GenomeNet generated on the basis of the KEGG Orthology
depend on each other to provide the high-quality and reference pathway (explained below).
knowledge and sophisticated user interfaces that By June 2011, KEGG had incorporated two
promote the interpretation of massive amounts of environmental metagenome samples retrieved
biological data. from the ocean and 137 microbiome samples
from human intestines (Fig. 2). KEGG gives
organism codes for complete and draft genomes
Genomic and Metagenomic Contents consisting of three or four characters (e.g., hsa for
in KEGG H. sapiens, human). The KEGG Organism codes
specify organisms and are also used as the
The KEGG Organism pages list complete headers of the pathway map IDs (e.g., hsa00010
genomes, expressed sequence tag (EST) datasets, for glycolysis/gluconeogenesis pathway in
metagenomes, and pangenomes (set of sequences H. sapiens). KEGG recently introduced an iden-
derived from a group of closely related strains, tifier system named “T numbers” that specify the
typically in bacterial phyla) in the following sets of sequencing data (EST, metagenomes, and
URLs: pangenomes). At the time of writing, KEGG has
Complete and draft genomes (http://www.kegg. incorporated metagenome data from three
jp/kegg/catalog/org_list.html) sources (NCBI, Metagenome.jp, and MetaHIT).
KEGG and GenomeNet, New Developments, Metagenomic Analysis 331 K
KEGG and GenomeNet, New Developments, Metagenomic Analysis, Fig. 2 Screenshot of KEGG
Metagenomes page
The examples of T numbers include T30001 for pages for all genomes, metagenomes,
planktonic microbial communities from North pangenomes, and EST datasets. Also, users can
Pacific Subtropical Gyre (retrieved from NCBI), search for genes of interest and jump to pathway
T30003 for human gut metagenome collected maps, functional hierarchy, modules, etc.
from a healthy Japanese adult male F1-S
(retrieved from Metagenome.jp), and T30016
for human gut microbial gene sample from KEGG PATHWAY Maps and BRITE
healthy Danish female (retrieved from Functional Hierarchy
MetaHIT).
For users interested in an organism (identified KEGG PATHWAY maps (http://www.kegg.jp/
by the KEGG Organism code) or a sample kegg/pathway.html) and BRITE functional hier-
(identified by the T number), embedded links archy (http://www.kegg.jp/kegg/brite.html) gen-
make it is easy to jump to the corresponding erally do not focus on a specific organism. BRITE
summary pages. Clicking the “T30003”, for contains a number of hierarchical classifications
instance, in the KEGG Metagenomes page takes of vocabularies used in journal articles and
the user to the summary page specific for the other public data in academic communities. The
sample T30003. KEGG provides this type of “reference” pathway maps are the combined
pathways present in a number of organisms and (1) pathway modules – representing smaller path-
are consensus among many published articles. way units than KEGG PATHWAY maps, such as
Only the reference pathway map is manually M00002 (glycolysis, core module involving
drawn with in-house software called KegSketch, three-carbon compounds; see Fig. 3, right);
whereas all other organism-specific maps are (2) structural complexes – often forming molec-
computationally generated. The user can conduct ular machineries, such as M00072
a search limited to an organism of interest as well (oligosaccharyltransferase); (3) functional sets,
as a comprehensive search throughout all of the for other types of essential sets, such as M00360
genome-sequenced organisms. In the pathway (aminoacyl-tRNA synthetases, prokaryotes); and
maps, rectangles and circles represent gene prod- (4) signature modules, as markers of phenotypes,
ucts (mostly proteins) and other molecules such as M00363 (EHEC pathogenicity signature,
(mostly metabolites), respectively. The maps are Shiga toxin).
colored in black and white in reference pathways,
i.e., when no organism has been specified. When
the user can specify an organism of interest, the KEGG Orthology (KO)
organism-specific pathways include some col-
ored rectangles indicating that the specified Coloring the rectangles in the organism-specific
organism possesses the corresponding genes or pathways, i.e., estimating the presence/absence in
proteins in the genome (Fig. 3, left). White rect- the respective genes in pathway maps, is deter-
angles indicate that no genes have been annotated mined based on the KEGG Orthology (KO). KO
to the corresponding function. This does not nec- collects the groups of orthologous genes having
essarily mean the organism does not possess the a common function and the same evolutionary
corresponding genes, but it is possible that the origin. A group of orthologous genes (a KO
genes have not been identified yet. entry) is given an identification number
(K number) and in principle corresponds to
more than one gene derived from more than one
KEGG Module organism. Genes assigned to the same K number
correspond to the same rectangle in
KEGG has three different levels of resolutions for a PATHWAY map (Fig. 3, left), MODULE
visualizing pathways: global maps (Fig. 6), (Fig. 3, right), and BRITE hierarchy. The top
(conventional) pathway maps (Fig. 3, left), and page of KO (http://www.kegg.jp/kegg/ko.html)
pathway modules (Fig. 3, right). Mapping genes provides the form to obtain an ortholog table
to global maps helps users to grasp the overview (Fig. 4), which shows currently annotated genes
of the sample. Mapping genes to pathway maps is in individual genomes for a given set of
useful to check the functional capability of the K numbers, together with coloring of adjacent
genome or metagenome. There are some cases genes on the chromosome. Each KEGG Module
where the smaller functional units, as defined in also contains a link to the corresponding ortholog
KEGG Modules, are more helpful to conduct the table. The ortholog table is a useful tool to check
detailed analysis. KEGG Modules include con- completeness and consistency of genome anno-
secutive reaction steps, operon or other regula- tations. KO entries for complete genomes are
tory units, and phylogenetic units by genome manually defined and annotated by the KEGG
comparison. KEGG have recently been focusing expert curators based on the phylogenetic profiles
effort on the development and annotation of and functional annotations of the genes. On the
KEGG Modules, leading to the increase of the other hand, KO for draft genomes, metagenomes,
number of entries. KEGG Module (http://www. pangenomes, and EST datasets are automatically
kegg.jp/kegg/module.html) collects functional annotated by KAAS (KEGG Automatic Annota-
units classified into the following four categories: tion Server), one of the GenomeNet tools.
KEGG and GenomeNet, New Developments, Metagenomic Analysis
333
KEGG and GenomeNet, New Developments, Metagenomic Analysis, (right). Rectangles colored in green indicate that human genome possesses the
Fig. 3 Mapping human genome onto glycolysis pathway and module. Human corresponding genes. KEGG Orthology entries are used to define KEGG Modules,
K
genome mapped onto glycolysis pathway map00010 (left) and module M00002 which is part of pathway maps, as indicated by the red lines
K
KEGG and GenomeNet, New Developments, links to the genes corresponding to the orthologs
Metagenomic Analysis, Fig. 4 Screenshot of the (K numbers) in genome-sequenced species. Columns and
ortholog table for module M00002. Ortholog tables contain rows represent orthologs and species, respectively
KAAS Automatic Annotation later. The result contains the corresponding KO

list, links to automatically colored PATHWAY
The KEGG Automatic Annotation Server pages and the BRITE pages. It is recommended
(KAAS) (Moriya et al. 2007) is one of the that the users download the result, since these
genome analysis tools available in GenomeNet results will be removed from GenomeNet server
(http://www.genome.jp/tools/kaas/) and has been after a few days.
developed for annotating draft genomes,
metagenomes, pangenomes, and EST datasets in
the framework of KEGG. KAAS accepts any Mapping Metagenome Data on KEGG
groups of gene sequences and helps users anno- PATHWAY
tate these genes if the genes are derived from
organisms that are not yet a member of the It is possible to color KEGG PATHWAY/BRITE
KEGG Organisms, or users obtain the gene IDs in a user-defined manner by using KEGG Mapper
otherwise (Fig. 5). After submitting the sequence (http://www.kegg.jp/kegg/mapper.html). This
data, it may take a long time to complete the will become more valuable for the interpretation
calculation. Therefore, users are requested to of metagenome and pangenome studies. KEGG
input their e-mail addresses, and the URL to Mapper has an option to specify multiple organ-
access the calculation result will be informed isms at a time. This option is particularly helpful
KEGG and GenomeNet, New Developments, Metagenomic Analysis, Fig. 5 KEGG Automatic Annotation
Server (KAAS)
not only for comparing genomes but also for a global map, where green lines indicate genes
visualizing host-microbiome relationship such that the human genome (only) possesses, red
as in human gut microbiome, host-symbiont rela- lines indicate gut metagenome (only) genes, and
tionship, and host-pathogen relationship. If a user blue lines indicate genes possessed by both.
inputs “hsa + pfa”, meaning human (Homo sapi- Figure 7 shows an example of the reconstructed
ens) plus a pathogen (Plasmodium falciparum thiamine metabolism pathway by mapping
3D7), the resulting pathways will be double col- human genome (hsa, colored in green) and
ored. The two colors would represent the gene human intestine metagenome (T30003, colored
products from the two organisms. This option in pink). Thiamine diphosphate shown in this
accepts any combinations up to a total of ten pathway works as an essential nutritional factor
genomes. For instance, the query “hsa + mmu + for human, but this cannot be synthesized without
dme”, which means human (Homo sapiens) + the help of the symbiotic bacteria in human intes-
mouse (Mus musculus) + fruit fly (Drosophila tine. By clicking one of the pink-colored rectan-
melanogaster), provides the three-colored map. gles (e.g., ThiC), a user can see the list of
Metagenomes can also be viewed with KEGG corresponding genes in the metagenome
Mapper. Figure 6 shows the human genome and (Fig. 8). The possible common sets of functions
a human intestine metagenome mapped onto between human genome and human gut
KEGG and GenomeNet, New Developments, Metagenomic Analysis, Fig. 6 Mapping of human genome and
human intestine metagenome on a global map
KEGG and GenomeNet, New Developments, Metagenomic Analysis, Fig. 7 Mapping of human genome and
human intestine metagenome on thiamine metabolism
KEGG and GenomeNet, New Developments, Metagenomic Analysis, Fig. 8 Examples of the metagenome
sequences annotated in the place of ThiC
metagenome can also be compared in terms of the genome has been annotated to have such
possessed KEGG Module entries. From the top a function.
page of a metagenome samples (e.g., T30003),
the user can jump to the module page, where
the thiamine biosynthesis module (M00127) is Conclusion
present (Fig. 9). In contrast, the human genome
also has the corresponding page, but there is no This review introduced the KEGG and
such module, meaning that no gene in human GenomeNet resources, putting emphasis on the
KEGG and GenomeNet, New Developments, Metagenomic Analysis, Fig. 9 KEGG Module entries assigned for
a metagenome sample
usage for metagenomics studies. Their focus on Cross-References

metagenomes has just begun; however, they plan
on developing novel user-oriented tools designed ▶ A 123 of Metagenomics
for discovery and analysis of metagenomic data. ▶ Approaches in Metagenome Research:
For further reading, some other publications are Progress and Challenges
recommended (Wheelock et al. 2009a, b; ▶ Computational Approaches for Metagenomic
Tokimatsu et al. 2011; Kotera et al. 2012) Datasets
explaining other contents that are not mentioned ▶ Customizable Web Server for Fast
in this review. The authors appreciate any Metagenomic Sequence Analysis
suggestions, questions, and comments on KEGG ▶ Genome Portal, Joint Genome Institute
and GenomeNet. Please send a message ▶ GHOSTM
through the feedback form (http://www.genome. ▶ Human Gut Microbial Genes by Metagenomic
jp/feedback/). Sequencing
Krona: Interactive Metagenomic Visualization in a Web Browser 339 K
▶ Human Oral Microbiome Database (HOMD) Goto S. Systems biology approaches and pathway tools
▶ MEMOsys: Platform for Genome-scale for investigating cardiovascular disease. Mol Biosyst.
2009a;5:588–602.
Metabolic Models Wheelock CE, Goto S, Yetukuri L, D’Alexandri FL,
▶ MetaBioME Klukas C, Schreiber F, Oresic M. Bioinformatics strat-
▶ MEtaGenome ANalyzer (MEGAN): egies for the analysis of lipids. Methods Mol Biol.
Metagenomic Expert Resource 2009b;580:339–68.
▶ Metagenomic Research: Methods and
Ecological Applications
▶ PhyloPythia(S) Krona: Interactive Metagenomic
▶ Variable Selection to Improve Classification of Visualization in a Web Browser
Metagenomes
▶ Viral MetaGenome Annotation Pipeline Brian D. Ondov, Nicholas H. Bergman and
Adam M. Phillippy
National Biodefense Analysis and
References Countermeasures Center, Frederick, MD, USA
Fujibuchi W, Sato K, Ogata H, Goto S, Kanehisa M.

KEGG and DBGET/LinkDB: integration of biological Abbreviations
relationships in divergent molecular biology data. In:
Knowledge sharing across biological and medical
knowledge based systems, Technical report WS-98-
BLAST Basic Local Alignment Search Tool
04. AAAI Press; 1998. p. 35–40. http://www.aaai.org/ HTML HyperText Markup Language K
Papers/Workshops/1998/WS-98-04/WS98-04-006.pdf NCBI National Center for Biotechnology
Kanehisa M, Goto S, Kawashima S, Nakaya A. The Information
KEGG databases at GenomeNet. Nucl Acids Res.
2002;30(1):42–6.
RDP Ribosomal Database Project
Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M. XML eXtensible Markup Language
KEGG for integration and interpretation of
large-scale molecular data sets. Nucleic Acids Res. Definition
2012;40(Database issue):D109–14. Epub 2011
Nov 10.
Kotera M, Hirakawa M, Tokimatsu T, Goto S, Kanehisa Krona is an interactive visualization tool for
M. The KEGG databases and tools facilitating omics exploring the composition of metagenomes
analysis: latest developments involving human dis- within a Web browser.
eases and pharmaceuticals. Chapter 2 In: Wang J,
Choon Tan A, Tian T, editors. Next generation micro-
array bioinformatics. Springer; 2012. ISBN 978-1-
61779-399-8. doi:10.1007/978-1-61779-400-1_2 Introduction
[PMID: 22130871]. http://link.springer.com/protocol/
10.1007%2F978-1-61779-400-1_2
Moriya Y, Itoh M, Okuda S, Yoshizawa A, Kanehisa
Much of the research performed in metagenomics
M. KAAS: an automatic genome annotation and is exploratory, making visualization a prominent
pathway reconstruction server. Nucleic Acids Res. aspect of the field. Graphically representing
2007;35:W182–5. a metagenome, however, is not a trivial task.
Tokimatsu T, Kotera M, Goto S, Kanehisa M. KEGG and
GenomeNet resources for predicting protein function
A single sample can easily contain too many
from omics data including KEGG PLANT resource. species to represent in one figure, and classifica-
Chapter 14. In: Kihara D, editor. Protein function tions are not always specific. This often forces
prediction for omics era. Springer; 2011. p. 271–288. visualizations to summarize the sample at higher
http://link.springer.com/chapter/10.1007%2F978-94-
007-0881-5_14
ranks, such as genus or family, trading details for
Wheelock CE, Wheelock AM, Kawashima S, Diez D, a meaningful overview. Though user interaction
Kanehisa M, van Erk M, Kleemann R, Haeggstrom JZ, can typically reveal more specific classifications,
K 340 Krona: Interactive Metagenomic Visualization in a Web Browser
Krona: Interactive Metagenomic Visualization in be seen even though they would be large enough. The
a Web Browser, Fig. 1 Types of overviews. The tradi- multilayer pie chart (b) depicts ranks more dynamically,
tional pie chart (a) shows abundances of organisms in dividing high-level classifications into more specific ones
a metagenome, summarized at the phylum level. Many toward the outside of the circle. This allows more details
phyla are still too small to compare, while genus- and to be shown for large phyla while small phyla are grouped
species-level classifications for the larger phyla cannot and labeled
Krona: Interactive Metagenomic Visualization in and cause wedges to become nearly rectangular. As
a Web Browser, Fig. 2 Zoomed multilayer pie chart. a result, it is less intuitive to discern relative abundances
Standard zooming can show more detail for a region of and hierarchical organization
a multilayer pie chart, but can move the center off screen
there is still a trade-off between comparing the high-level overviews and detailed views of spe-
most abundant organisms and viewing their most cific portions as needed (Shneiderman 2002).
specific classifications (Huson et al. 2007; Meyer Though an overview can (and usually must)
et al. 2008). Krona uses multilevel pie charts to omit some complexity, this view helps users
visualize both the most abundant organisms and determine which areas to view in further detail
their most specific classifications (Fig. 1). Rather and provides context as they browse between
than hiding lower ranks in its overview, Krona sections. Multilevel pie charts are a good option
hides low-abundance organisms, which can be for metagenomic overviews because they can
expanded interactively. Additionally, Krona’s convey hierarchy implicitly, nesting lower-level
browser-based implementation allows it to be wedges within higher ones (Draper et al. 2009).
much more portable than other interactive This allows the abundances of multiple levels to
metagenomic visualization tools. be shown in the same view and using the same
scale. As in other metagenomic visualizations,
some nodes will have to be hidden for the over-
Overviews and Details view to be informative. The benefit of multilevel
pie charts is that the nodes are hidden based on
Interactive visualizations can make complex abundance rather than specificity of their classi-
results more accessible by providing both fications. This gives priority to nodes that make
Krona: Interactive Metagenomic Visualization in A wedge in the overview (a, green) is stretched around
a Web Browser, Fig. 3 Polar zooming. Zooming in the center (b–e) until it fills the entire circle (f). The
polar space allows the zoomed region to retain the intui- detailed view also serves as a new overview from which
tive properties of the original multilayer pie chart. the process can be repeated with smaller wedges
up the greatest portion of the sample, which are
typically of the most interest. A potential draw-
back, however, is that simply zooming in on the
smaller nodes would cause them to lose their
resemblance to a pie chart (Fig. 2). Krona avoids
this problem using polar zooming, in which
a wedge is stretched around the center until it
forms a new multilayer pie chart (Fig. 3). The
zoomed in view also serves as a new overview for
further zooming, allowing even complex hierar-
chies to be explored with only a small amount of
navigation.
Interactivity Without Installation Krona: Interactive Metagenomic Visualization in

a Web Browser, Fig. 4 Krona architecture. Embedding
XML data within an HTML document and linking to
Since researchers often use visualizations to remote JavaScript allows a hybrid of Web-based interac-
convey data to others, portability is an essential tivity and locally stored data
feature of visualization software. In the past,
interactive features were typically at odds with
portability because they required software to be Showing More Information K
installed. However, thanks to technologies such
as JavaScript and HTML5, the modern Web Confidence
browser has become a ubiquitous, standardized Metagenomic classification algorithms are con-
platform for interactivity. Many software pack- stantly improving, but their results still come
ages are now entirely Web based, hosting both with a significant degree of uncertainty. Only
tools and data on centralized servers. While this a small fraction of the tree of life is represented
“cloud computing” model offers many advan- in reference databases, and this causes wide-
tages, it also creates a dependency on those spread bias in classifications (Wooley
servers and an obligation for the software devel- et al. 2010). It is thus important to consider
opers to maintain and scale them. Furthermore, classification confidence, whenever it is avail-
it requires researchers to store their data able, when analyzing classificatory results.
remotely, which may not always be desirable. Krona can vary wedge coloring to visualize
Krona offers a compromise in which each chart classification confidence in tandem with abun-
is a locally stored Web page, in the form of dances (Fig. 5).
a single HTML file with embedded XML data.
When this file is opened in a Web browser, Comparison
viewing code is fetched from the Internet, Metagenomic studies often compare differences
allowing the data to be viewed interactively in metagenomes sampled from multiple locations
without installing software or using remote stor- or times. Though direct comparison of samples is
age (Fig. 4). Krona charts can easily be shared infeasible for multilayer pie charts, it is possible
with anyone that has an Internet connection to convey differences through animation and
and a modern Web browser. They can also be color. To show animated differences, the chart
embedded in existing Web pages without can be morphed from one sample to the next,
modifying the server. For cases in which an causing wedges that change significantly in size
Internet connection is not available, Krona to draw attention from the user (Fig. 6). To show
charts can still be viewed locally, but require the differences with color, each wedge can be
installation. colored based on how much it varies between
Krona: Interactive Metagenomic Visualization in red (signifying low confidence) to green (signifying high
a Web Browser, Fig. 5 Classification confidence. confidence), allowing it to be depicted in tandem with
Classification confidence is mapped to a gradient from abundance and hierarchy
samples. These two methods can also be com- both types of data focus on quantities within
bined to provide a clearer picture of sample hierarchies, both are suited to visualization with
variation. Krona charts. To create Krona HTML files
from these data, many common formats can be
Applications in Metagenomics imported with KronaTools, a software pack-
Metagenomic analyses typically produce data age for Unix-based systems. Classifications
from one of the two categories: taxonomic and can be directly imported from the RDP
functional. Taxonomic classifications, which Classifier, Phymm/PhymmBL, FCP, MG-RAST,
place sequences on the tree of life, are inherently or the Web-based bioinformatics platform
hierarchical because of the various ranks in the Galaxy. For raw BLAST results downloaded
tree (species, genus, etc.). Functional classifica- from NCBI or the METAREP metagenomic
tions, which describe the roles of predicted pro- repository, KronaTools performs MEGAN-like
teins, are often made hierarchical by grouping (lowest common ancestor) classification using
specific functions into more general ones. Since NCBI taxonomy information. When importing
Krona: Interactive Metagenomic Visualization in of wedge coloring between samples helps the user keep
a Web Browser, Fig. 6 Comparing datasets. Differ- track of individual wedges and draws attention to ones that
ences between samples are shown with an animated tran- change by significant amounts
sition from one sample (a) to the next (b). The persistence
K
classifications from RDP and PhymmBL, a color Summary

gradient can be used to represent the average
reported confidence of assignments to each Krona enables the interactive visualization of com-
node. For MG-RAST, METAREP, and raw plex metagenomic data without installed software
BLAST results, the nodes can be colored by or cloud computing resources. It uses multilayer
e-value, score, or percent identity. Since classi- pie charts to provide overviews that emphasize the
fications can sometimes be performed on assem- most abundant members of a sample, while its
bled contigs rather than reads, KronaTools can polar zooming intuitively provides details for the
be given contig magnitudes to more accurately least abundant. Supplementary data, such as clas-
convey abundance in the chart. To extend sification confidence and sample variation, can be
KronaTools to formats that are not yet conveyed through color and animation. Krona’s
supported, it can also import generic tabular hybrid Web/local architecture allows each interac-
files containing NCBI Taxonomy Identifiers or tive chart to be a single file, viewable on any
Enzyme Commission numbers. Other types computer with an Internet connection and a mod-
of classifications can be imported from basic ern Web browser. Charts can be created from
text files or an Excel template detailing lineage common metagenomic and generic file formats
and magnitude. Finally, a custom XML file can using KronaTools, a software package for Unix-
be imported to gain complete control over the like systems. Both Krona and KronaTools are
chart, including custom attributes and colors for freely available under a BSD open-source license
each node. Since node attributes can contain and available from http://krona.sourceforge.net.
HTML and hyperlinks, XML import allows
Acknowledgments This publication was developed and
Krona to be deployed as a custom data browsing
funded under agreement no. HSHQDC-07-C-00020
and extraction platform in addition to a visuali- awarded by the US Department of Homeland Security
zation tool. for the management and operation of the National
Biodefense Analysis and Countermeasures Center References

(NBACC), a Federally Funded Research and Develop-
ment Center. The views and conclusions contained in Draper G, Livnat Y, Riesenfeld R. A survey
this document are those of the authors and should not be of radial methods for information visualization.
interpreted as necessarily representing the official poli- Vis Comput Graph IEEE Trans. 2009;15(5):
cies, either expressed or implied, of the US Department 759–76.
of Homeland Security. The Department of Homeland Huson DH, Auch AF, Qi J, Schuster SC. MEGAN analysis
Security does not endorse any products or commercial of metagenomic data. Genome Res. 2007;17(3):
services mentioned in this publication. 377–86.
Meyer F, Paarmann D, D’Souza M, Olson R, Glass EM,
Kubal M, Paczian T, Rodriguez A, Stevens R,
Wilke A, et al. The metagenomics RAST server –
Cross-References a public resource for the automatic phylogenetic and
functional analysis of metagenomes. BMC Bioinfor-
matics. 2008;9:386.
▶ METAREP, Overview Shneiderman B. The eyes have it: a task by data type
▶ MEtaGenome ANalyzer (MEGAN): taxonomy for information visualizations. Visual lan-
Metagenomic Expert Resource guages, 2002.
Wooley JC, Godzik A, Friedberg I. A primer on
▶ Novel Alkalistable and Thermostable metagenomics. PLoS Comput Biol. 2010;6(2):
Xylanase-Encoding Gene (Mxyl) Retrieved e1000667.
from Compost-Soil Metagenome
L
Lateral Gene Transfer and Microbial biological diversity. This estimate of less than
Diversity 1 % of prokaryotes being represented by cultures
(Torsvik et al. 1990) suggests that exploration of
Tania Nasreen, Rebecca J. Case and Yan Boucher the untapped diversity of microbial species, genes,
Department of Biological Sciences, University of pangenomes, metabolism, behaviors, and complex
Alberta, Edmonton, AB, Canada interactions will be a fruitful endeavor. However,
how we discover and understand microbial diver-
sity has been heavily influenced by LGT. We can
Synonyms compare any bacterial or archaeal 16S rRNA gene
directly recovered from the environment to the
Horizontal gene transfer (HGT); Lateral gene comprehensive public sequence databases. This
transfer (LGT) allows the identification of this gene’s host based
on its similarity and phylogenetic placement rela-
tive to sequences from described (and therefore
Definition cultured) prokaryotes. These sequences can also
be compared to all the other 16S rRNA gene
LGT is genetic changes within an individual or sequences directly retrieved from the environ-
a population that occur through the acquisition of ment; however, without a described culture, little
DNA from individuals that are not an organism’s can be inferred about their hosts physiology and
direct cellular parent or progenitor. One of its thereby their role in an ecosystem. This is further
effects on microbial populations is to alter diver- complicated by the prevalence of LGT making
sity through the acquisition of genetic material or inferences of few if any phenotypic characteristics
by homogenization of a population. If molecular of a species, genera, family, or phylum impossible.
sequences are available for a community, statis- Therefore, a sequence rarely tells us anything
tical estimators can be used to calculate its total about the ecology of an organism and its real
diversity and structure so that it can be compared value is that it can tell us something about the
to other ecosystems. biological diversity of an ecosystem.
Introduction Impact of LGT on Measurements of

Microbial Diversity
The “uncultured majority” (Whitman et al. 1998)
of prokaryotes capture the imagination, as it Diversity has been used as a metric by ecologists
suggests a seemingly limitless potential for for decades and can be correlated with other
L 348 Lateral Gene Transfer and Microbial Diversity
information to describe an ecosystem (Gravel often studied (with the exception of phytoplank-
et al. 2011). The diversity of a system is not ton in aquatic systems). Often a specific process
simply the number of organisms or unique DNA is of interest, such as degradation of xenobiotics
sequences identified. Probability-based estima- or denitrification. This presents one of the biggest
tors can be used to extrapolate the total diversity dilemmas for microbial ecologists as they cannot
from subsampling the diversity of operational study a phylogenetic group (for which there are
taxonomic units (OTUs) defined as a similarity many 16S rRNA-based primers and probes that
threshold of the 16S rRNA gene sequence. This could be used in targeted studies) and infer the
can be done for populations with parametric (e.g., function of the group (Case et al. 2007).
a rarefaction) or nonparametric (e.g., Chao1) dis- Macroecologists can infer that plants are primary
tributions. This is analogous to capture-recapture producers at the base of the food web and provide
methods of determining the population size of shelter for other species as habitat-forming spe-
animals. For example, to determine the popula- cies, which is not possible for microecologists.
tion size of swamp wallabies, several wallaby’s This is the result of LGT, as this phenomenon
ears are tagged within a population and subse- facilitates the movement of genes among phylo-
quent sampling of the population can be used to genetically distant organisms. This means that
estimate its size by calculating the probability of phylogeny based on universal marker genes
recapturing tagged wallabies among non-tagged such as 16S rRNA is not a predictive tool of
wallabies. Molecular microbial ecology is much function in microbiology.
more powerful than such macroecology studies Molecular methods have been adapted to cir-
as it rarely focuses on a single species, but rather cumvent this conundrum so that functional genes
the total bacterial and/or archaeal community and (such as hupL for hydrogen oxidation) can
the numerous populations that encompass thou- be directly targeted through PCR (Balskus
sands of species. Diversity estimators are then et al. 2011). Such functional genes can then be
used to calculate the total diversity and structure used in community fingerprinting, clone libraries,
of the community using indices such as or CARD-FISH, which has been adapted to iden-
Simpson’s Diversity Index (proportional distri- tify mRNA to look at expression of specific genes
bution of all species), species evenness inside cells. Such gene-omic (sequencing of
(distribution of individuals among species), and a single marker gene directly from an environ-
Shannon Index (entropy of community measured mental sample) approaches are popular for
from the richness and evenness of community). targeted studies and can be adapted to high-
These indices allow us to compare natural and throughput sequencing techniques. Datasets that
experimental communities to identify factors that include deep sequencing of a gene involved with
influence diversity such as the volume of water in a specific function can be used to identify redun-
tree holes (Bell et al. 2005) or a chronosequence dancy in a system. Such redundancy is important
within a lichen (Mushegian et al. 2011). for the stability of an ecosystem through
Microbial systems rarely have a perceived environmental change, as genetic redundancy
intrinsic value in that people do not marvel at represents the diversity of organisms able to
a termite’s hindgut as they do old growth forests. perform a function within a system. The alterna-
Their value is in what they do, their function. tive to gene-omic approaches is metagenomics,
Diversity is a powerful measurement in microbial whose popularity has been greatly influenced by
ecology as it has a major influence on the produc- the disconnect created by LGT between phylog-
tivity and stability (or resilience) of an ecosystem eny and function. Metagenomics retrieves
(Gravel et al. 2011). The indices described above large nontargeted sequence datasets from an
are useful in characterizing these systems as it environment such that metabolic networks
can be used to compare their productivity. How- and interactions can be inferred from the
ever, in microbial systems, the productivity is not community’s metagenome. This method can be
Lateral Gene Transfer and Microbial Diversity 349 L
coupled to metatranscriptomics (RNA) and/or transformation (the uptake of DNA directly from
metaproteomics (proteins) to move beyond the the environment or from a membrane vesicle),
genetic potential of a metagenome to the tran- conjugation (cell-to-cell contact mediated by the
scribed and translated. These methods, however, apparatus encoded on a conjugative element
have their greatest power when targeted or used or by a cytoplasmic fusion), or transduction
in low-diversity systems (Hugenholtz and Tyson (introduction of DNA by a phage) (Fig. 1). Sec-
2008). ond, integration into the new host genome is
required, which can be achieved by homologous
recombination (i.e., this requires a homologous
Mechanisms Responsible for the region of DNA between the donor and recipient),
Generation of Genetic Diversity in heterologous recombination (i.e., that does not
Microbes require a homologous region of DNA between
the donor and recipient DNA), or extrachromo-
What is measured through gene-omic approaches somal maintenance and replication.
such as the 16 rRNA gene or nifH clone libraries We can now obtain minimal LGT estimates
is nucleotide sequence diversity. The latter, how- through quantification of homologous recombi-
ever, is not the only type of genetic diversity. nation. This type of LGT directly affects
Metagenomic or genomic approaches allow the sequence diversity and is usually simply termed
measurement of gene content diversity, which is “recombination” in most molecular population
the measurement of differences in the genes studies. This is because mathematical models
found in various genomes or metagenomes. currently used in population genetics can only
Both of these are strongly affected by LGT, take into account changes in genetic material
which influences not only the rate at which they that is present in all members of the population,
change but also how they change. therefore excluding acquisition of novel genetic
Sequence Diversity. The only force responsi- material through heterologous recombination and
ble for de novo creation of genetic diversity is as extrachromosomal elements. Population
mutation. It can be defined as changes in the DNA recombination rates therefore only include events
sequence of a genome that is inherited from in which foreign DNA, through replacement of
a progenitor. The nature of such changes can a homologous locus by recombination, is inte-
vary: base pair substitutions, insertion/deletion grated in the host genome. Studies that have
of one or more nucleotide(s), as well as larger or compared population mutation and recombina-
more complex changes (such as chromosomal tion rates in various prokaryotic lineages have
rearrangement or gene duplication) (Fig. 1). The found a relatively even split between those in
physical causes of mutations are also diverse: which mutation introduces most of the changes
unforced DNA replication errors, errors during and those where homologous recombination is
proofreading or post-replication mismatch repair, responsible for most nucleotide variations
and DNA damage leading to replication errors or (sequence diversity).
inaccurate repair. Although mutation is responsi- Gene Content Diversity. LGT also (if not pre-
ble for creating diversity, it is not the only phe- dominantly) introduces change through the
nomena introducing variation in particular acquisition of novel genetic material through het-
groups or lineages of microbes. Genetic changes erologous recombination. This, in combination
within an individual or a population can occur with gene loss and gene duplication, leads to
through the acquisition of DNA from individuals changes in the gene content of an organism. For
that are not an organism’s direct cellular progen- example, strains of the marine heterotrophic bac-
itor. This process is LGT. In bacteria and archaea, terial genera, Vibrio, which are identical at one or
it has two main steps. First, foreign DNA pene- more protein-coding housekeeping gene, can be
trates the cellular envelope in one of three ways: differentiated by genome size (up to 800 kb
Lateral Gene Transfer and Microbial Diversity, Fig. 1 Description of the processes generating genetic diversity in
bacteria and archaea
variation) (Thompson et al. 2005). Also strains of hypothesized to hold for bacteria that partially
the nitrogen-fixing soil bacteria Frankia that are overlap in their ecological niche (Konstantinidis
more than 97 % identical in their rRNA gene et al. 2006). Sequence diversity dominates for
sequences – the conventional cutoff value for bacteria with identical or almost entirely
a bacterial species – can differ by as many as overlapping niches (little change in gene con-
3,500 genes, which represents nearly half of tent), and gene content diversity is more pro-
their 7.5 Mb genomes (Normand et al. 2007). nounced when bacteria occupy separate niches.
Although gene content and sequence diversity Ecological adaptation is therefore directly linked
are often correlated, it is not always the case. with gene content diversity but less so with
According to empirical data, the correlation is sequence diversity.
Lateral Gene Transfer and Microbial Diversity 351 L
Impact of LGT on the Phenotypic (Faruque and Mekalanos 2012). Thus, two
Diversity of Microbes individual LGT events involving these two
phages have the potential to make almost any
Microorganisms exhibit great diversity in their Vibrio cholerae strain into a potent human
cellular structures, metabolic properties, interac- pathogen.
tions, and ecological niches. It is well established Various metabolic properties, virulence, and
that mutation (sequence diversity) has contrib- antibiotic resistance traits can also be carried on
uted to this phenotypic diversifications of plasmids or transposons or a combination of the
microorganisms. However, growing numbers of two. This makes these genes more likely to be
genomic studies suggest that LGT influences the transferred through LGT. For example, Tn10 is
acquisition of novel functions through its effect a transposon consisting of a pair of IS10,
on gene content, not sequence, diversity. For a tetracycline determinant and a regulatory
example, recent studies of the genomic context gene. Similarly, transposon Tn5 consists of two
and phylogenetic relatedness of proteorhodopsin IS50 elements and a three-gene operon that attri-
genes suggested that they had been transferred by butes resistance to kanamycin, bleomycin, and
LGT from marine Archaea to Proteobacteria. streptomycin. Both of these transposons can be
This single gene is hypothesized to provide its incorporated into the chromosomes of phyloge-
host with a competitive advantage by allowing it netically diverse groups of bacteria. Plasmids are
to harness light energy for cellular function. As the other major mediator of antibiotic resistance
these organisms reside in the photic zone of the gene acquisition by LGT. Not only are plasmids
ocean, proteorhodopsin allows them to take full themselves transfer agents, but they can also
advantage of available UV energy (Frigaard change rapidly through LGT. For example,
et al. 2006). based on gene organization and sequence simi-
In some species, most of the genetic variation larity, plasmid pKF3-140 found in Klebsiella
and adaptation occurs through LGT. Although pneumoniae has been speculated to have origi-
Prochlorococcus species have a conserved core nated from Escherichia coli (plasmids
of genes, they show a significant variation in the p1ESCUM and pUTI89) and further modified
genes present on genomic islands. These repre- by acquiring resistance genes from different
sent the evolutionary hot spots inside their enteric bacteria by LGT.
genomes. It is hypothesized that these genomic Another genetic element facilitating LGT and
islands are acquired by LGT and undergo exten- phenotypic diversity is the integron. This genetic
sive rearrangement, suggesting a common mech- element carries genes for site-specific recombi-
anisms of niche differentiation in microbial nation known as mobile gene cassettes in the host
species. The pathogenicity islands of pathogenic genome. It has been found that about 17 % of the
bacteria also share the same characteristics sequenced bacterial genomes have integrons. For
(Coleman et al. 2006). Some genomic island example, many species of Pseudomonas contain
associated LGTs are thought to be mediated by integrons with a variable number of gene cas-
phages, since they can carry host genome frag- settes (10–32) that are considered to have been
ments. For example, the cholera toxin gene in obtained by LGT at the late stage of species
Vibrio cholerae that is actually encoded within segregation (Vaisvila et al. 2001).
a bacteriophage (CTXf) genome that necessarily These are only a few representative examples
needs the toxin co-regulated pilus (TcpA), an of the contribution of LGT to the phenotypic and
intestinal colonization factor, as its receptor. genotypic diversity of microbial populations.
TcpA is encoded within the pathogenicity island Importantly, this diversity is not only driven by
named VP1. However, this VP1 region mainly natural selection. Microbes have evolved the
constitutes the genome of another bacteriophage ability to sense the environments and generate
diversity as a response to a stressor. For example, References

the transfer of genomic islands encoding specific
metabolic properties is sometimes controlled by Balskus EP, Case RJ, Walsh CT. The biosynthesis of
cyanobacterial sunscreen scytonemin in intertidal
quorum sensing mechanisms. The genomic
microbial mat communities. FEMS Microbiol Ecol.
island ICEMISymR7A of Mesorhizobium loti 2011;77(2):322–32.
strain R7A encodes proteins required for symbi- Bell T, Ager D, Song JI, et al. Larger islands house more
otic nitrogen fixation and that regulate the bacterial taxa. Science. 2005;308(5730):1884.
Case RJ, Boucher Y, Dahllof I, Holmstrom C, Doolittle
transfer of plasmid by quorum sensing to
WF, Kjelleberg S. Use of 16S rRNA and rpoB genes as
nonsymbiotic mesorhizobia (Ramsay et al. molecular markers for microbial ecology studies. Appl
2009). Another example of stressor-generated Environ Microbiol. 2007;73(1):278–88.
genotypic diversity is CRISPRs. These elements Coleman ML, Sullivan MB, Martiny AC, et al. Genomic
islands and the ecology and evolution of
are considered to be an acquired immune system
Prochlorococcus. Science. 2006;311(5768):1768–70.
against virus and plasmids by which the host Faruque SM, Mekalanos JJ. Phage-bacterial interactions
identifies foreign DNA in a sequence specific in the evolution of toxigenic Vibrio cholerae. Viru-
manner (Horvath and Barrangou 2010). Experi- lence. 2012;3(7):556–65.
Frigaard NU, Martinez A, Mincer TJ, DeLong
mental evidence of CRISPR-mediated immunity
EF. Proteorhodopsin lateral gene transfer between
to bacteriophages has been shown in Streptococ- marine planktonic bacteria and archaea. Nature.
cus thermophilus. After exposure to a phage to 2006;439(7078):847–50.
which S. thermophilus was susceptible, only Gravel D, Bell T, Barbera C, et al. Experimental niche
evolution alters the strength of the diversity-productivity
a small fraction of cells survived, but the genome
relationship. Nature. 2011;469(7328):89–92.
of the survivors had acquired novel sequences in Guerin E, Cambray G, Sanchez-Alberola N, et al.
their CRISPR loci identical to the DNA of the The SOS response controls integron recombination.
infecting phage. This a genomic change directly Science. 2009;324(5930):1034.
Horvath P, Barrangou R. CRISPR/Cas, the immune system
triggered by an environmental factor. Similarly,
ofbacteriaandarchaea.Science.2010;327(5962):167–70.
the SOS response, a global regulatory network Hugenholtz P, Tyson GW. Microbiology: metagenomics.
that is activated in response to DNA damage, has Nature. 2008;455(7212):481–3.
recently been discovered to induce recombina- Konstantinidis KT, Ramette A, Tiedje JM. The bacterial
species definition in the genomic era. Philos Trans Roy
tion activity integrons. This causes an increased
Soc London B Biol Sci. 2006;361(1475):1929–40.
acquisition of gene cassettes, potentially Mushegian AA, Peterson CN, Baker CC, Pringle
encoding novel phenotypes. This creates a link A. Bacterial diversity across individual lichens. Appl
between environmental factors inducing the SOS Environ Microbiol. 2011;77(12):4249–52.
Normand P, Lapierre P, Tisa LS, et al. Genome character-
responses such as oxidative stress, pH change,
istics of facultatively symbiotic Frankia sp. strains
and exposure to antibiotics and genetic diversity reflect host range and host plant biogeography.
(Guerin et al. 2009). Genome Res. 2007;17(1):7–15.
Ramsay JP, Sullivan JT, Jambari N, et al. A LuxRI-family
regulatory system controls excision and transfer of the
Mesorhizobium loti strain R7A symbiosis island by
activating expression of two conserved hypothetical
Cross-References
genes. Mol Microbiol. 2009;73(6):1141–55.
Thompson JR, Pacocha S, Pharino C, et al. Genotypic
▶ Metagenomic Potential for Understanding diversity within a natural coastal bacterioplankton
Horizontal Gene Transfer population. Science. 2005;307(5713):1311–3.
Torsvik V, Goksoyr J, Daae FL. High diversity in DNA
▶ Mining Metagenomic Datasets for Antibiotic
of soil bacteria. Appl Environ Microbiol. 1990;56(3):
Resistance Genes 782–7.
▶ Novel approaches to Pathogen Discovery in Vaisvila R, Morgan RD, Posfai J, Raleigh EA. Discovery
Metagenomes and distribution of super-integrons among pseudomo-
nads. Mol Microbiol. 2001;42(3):587–601.
Whitman WB, Coleman DC, Wiebe WJ. Prokaryotes:
▶ Protein-Coding Genes as Alternative Markers the unseen majority. Proc Natl Acad Sci U S A.
in Microbial Diversity Studies 1998;95(12):6578–83.
Lessons Learned from Simulated Metagenomic Datasets 353 L
communities (Morgan et al. 2010), but this
Lessons Learned from Simulated approach is expensive, time-consuming, and
Metagenomic Datasets limited to communities of reduced complexity,
so the alternative presented is to apply mathe-
Germán Bonilla-Rosso matical models and simulations to test the
Laboratorio de Evolución Molecular robustness of the tools for their analysis
y Experimental, Instituto de Ecologı́a UNAM, (Caswell 1988).
Universidad Nacional Autónoma de México, A simulation is the imitation of a natural time-
Mexico City, Mexico ordered sequence of states a system takes in
a given time period with another that is the prod-
uct of a representative model (Peck 2008). In
Definition other words, they are the dynamic imitation of
natural processes that follow the changing states
A simulation is the dynamic modeling of a real of a system under a particular theoretical model.
process over time. A simulated metagenomic Simulations are used principally because the
dataset is the product of a single simulation iter- equations in the models cannot be followed in
ation of the sequencing process of a microbial time, but the individual states in the processes
community under a specific set of sequencing- defined by the model are. The models create
platform model parameters. virtual worlds, with rules defined by the model
parameters, that can be modified and followed in
Summary ways that would be too costly or unethical in real
The use of simulations to produce model systems, and the simulations that can be run in
metagenomic datasets allows to test the perfor- these modeled worlds can be seen as individual
mance of technological methodologies and the experimental systems (Winsberg 2003). Most
testing of theoretical hypothesis that cannot be commonly, simulations are theoretical models
achieved by empirical experimentation. Method- used to explain natural phenomena and test the
ologically, it has been used to evaluate the per- outcome of theoretical hypotheses (Caswell
formance of assembly programs and the effect of 1988), often used in computational biology to
differences of read length and error rate on the numerically estimate the behavior of a system
quality of the resulting datasets. Theoretically, it that is too complex to be resolved by analytical
has revealed biases and heterogeneity in the esti- solutions by generating a sample of scenarios that
mation of several diversity metrics from represent stochastically distinct moments of the
metagenomic samples. However, the full poten- same state of a modeled system under particular
tial of the implementation of simulated datasets conditions (Peck 2008).
to metagenomics is still to be revealed. Since no ecological community (microbial
or otherwise) has been sampled to exhaustion,
and no completely and accurately annotated
Introduction metagenome is available (Mende et al. 2012),
the construction of simulated datasets rely on the
The complexity of microbial communities, and available genomic data from the complete
the nature of the metagenomic datasets resulting genomes of individual datasets. These simulated
from their sequencing, belongs to systems with datasets have been used to date for two main
high nested complexity. To analyze them, there purposes: the test of the sequencing perfor-
is a growing need to test the robustness of new mance of different platforms and their
methodological and analytical tools (Angly processing pipelines and the analysis of the
et al. 2012). The evaluation of these tests could accuracy of a diverse set of alpha and beta diver-
in theory be done by the construction of in vitro sity estimations.
L 354 Lessons Learned from Simulated Metagenomic Datasets
Simulation of Metagenomic Datasets microbial mats (simHC). These datasets were

deposited and made available online as part of
To simulate the production of a metagenomic the Fidelity of Analysis of Metagenomic Samples
dataset, a program needs three basic components: program (FAMeS:http://fames.jgi-psf.org/index.
a pool of reference sequences, usually annotated html) as an attempt to standardize the
complete genomes, from where sequences will be benchmarking of metagenomic assembly and
drawn; a profile that details the composition annotation tools.
(taxonomic assignment of species) and structure These datasets are very atypical in that they
(relative abundance of species) of the designed were simulated from real sequencing reads, so
source community; and an error model that spec- that the sampling step from genomes was not
ifies how variability will be introduced to the simulated. As such, they only model the error
simulation, and usually accounts for sequencing- distribution of the Sanger sequencing platform
platform-associated errors and rates for mutation- as implemented by the JGI’s particular shotgun
introduction. In recent years, several different sequencing process and prevent their extrapola-
simulation software programs have been made tion to other sequencing platforms and error
available that differ in the type of sequencing models. Moreover, the lack of replication, the
platform supported and the adjustable parameters fixed species richness (the 113 isolate genomes
to model errors. The nature of the first and most from the pool), and the reduced and arbitrary
commonly used simulated dataset, and the two complexity range of the source community pro-
most commonly used and representative software files prevent their use for testing more ecological
programs available for metagenomic dataset sim- hypothesis regarding contrasting either species
ulations, is reviewed in the following section. richness or gradients in structure complexity.
These datasets, however, introduced the concept
The FAMeS Dataset of benchmarking metagenomic analysis pipelines
The first metagenomic simulated dataset was pro- with simulated datasets, and more recent studies
duced by the group led by Konstantinos have used their community profiles for the con-
Mavrommatis at Department of Energy Joint struction of new simulations with replications
Genome Institute (JGI), with the objective of and their extrapolation to different sequencing
benchmarking the alternative metagenomic platforms (Mitra et al. 2010; Charuvaka and
processing pipelines commonly used in the JGI Rangwala 2011; Pignatelli and Moya 2011).
sequencing facility (Mavromatis et al. 2007).
They randomly selected reads from the complete MetaSim
genome projects of 113 isolates sequenced at JGI One of the first computer programs developed
as their pool of sequences. Since these reads were specifically to simulate metagenomic datasets
derived from the real clone libraries in the shot- was developed and is maintained by the group
gun sequencing process, they incorporate real led by Daniel Huson at the University of T€ubingen
errors derived from the Sanger sequencing (Richter et al. 2008). It has been widely used both
method. Three different artificial source commu- because of its efficient algorithm and the benefit of
nities were designed with contrasting structure having a GUI. MetaSim uses complete genomes as
and composition reflecting the following: a low a reference pool of sequences and by default can
complexity community with a single dominant take advantage of the complete genomes available
near-clonal population like that found in bioreac- at the NCBI RefSeq database. This also allows the
tors (simLC), a moderately complex community use of NCBI’s taxonomy to construct the source
with few dominant populations and several low community profiles, either by providing a relative
abundance ones like those observed in the acid abundance matrix or by using an interactive
mine drainage biofilms (simMC), and a high graphic taxonomy tree to select the genomes to
complexity community with no dominant be included. Finally, MetaSim provides a large
populations such as those observed in soils and array of adjustable error-model configurations
such as read length, sequencing depth, error rate, read length, sequencing depth, substitution and
and error distribution. MetaSim includes a variety error distribution, and homopolymer and read
of default error models for the three main sequenc- end error rates for metagenomic datasets, and
ing platforms (Sanger, 454, Illumina) that can be chimera production and gene copy number for
easily modified. amplicon datasets (Angly et al. 2012).
MetaSim allows the user to easily develop Grinder accepts two different methods to pro-
complex designs of source community profiles vide community profiles. The first is the canoni-
by specifying the richness, structure, and compo- cal species-abundance matrix where the user
sition of a community via a species-abundance simultaneously defines community composition
matrix. These profiles can be saved and several and structure. The second one is by defining the
simulations can be run, allowing the comparison community richness and a rank abundance model
of simulated datasets from the same sample under for the relative abundance distribution of species.
different sequencing platforms and error models. Composition will be, however, selected ran-
As such, its main limitation is then its depen- domly from the species list. Moreover, multiple
dency on the available reference genomes and datasets can be produced simultaneously from the
their associated taxonomy at NCBI, which to same profile, both for replication purposes when
July 2012 contains more than 2,000 genomes. source communities are identical and to simulate
Finally, MetaSim includes a tool to simulate the sampling of related communities with
sampling from a set of “evolved” genome off- a defined percentage of shared species (beta
springs derived from the reference genomes diversity) (Angly et al. 2012).
using an evolutionary tree. That is, it simulates
real metagenomic datasets that usually contain
populations of organisms with different degrees Lessons Learned from Simulated
of relatedness to the available reference genomes, Metagenomic Datasets
in a way that it simulates genetic variability in
real, natural populations. Benchmarking of Technical Aspects
As explained above, the first simulated
Grinder metagenome comprising the FAMeS dataset
While most metagenomic dataset simulators (Mavromatis et al. 2007) was developed to eval-
were developed under a vision of metagenomic uate the fidelity of the sequencing processing
process benchmarking, Grinder, developed and pipeline regarding the assembly and gene predic-
maintained by Florent Angly at the Australian tion of metagenomes derived from shotgun
Centre for Ecogenomics (Angly et al. 2012), is sequencing. They revealed that the application
the first simulator with a more ecologically ori- of common single-isolate genome assemblers
ented perspective. Its main novel feature is that it resulted in a low incorporation of reads into
can also simulate amplicon datasets, addressing contigs and a high degree of chimeric contigs,
the need to benchmark the tools for the analysis which in turn can lead to up to 20 % of inaccu-
of 16S rRNA amplicon datasets that are widely rately called genes in metagenomes and errors
used in microbial ecology. As a pool of reference in functional and taxonomic annotations
sequences, Grinder can use any sequence data- (Mavromatis et al. 2007). Although the pipelines
base with FASTA format, like the NCBI RefSeq and sequencing platform addressed by
genomes database for metagenome datasets, and Mavromatis et al. (2007) are outdated, several
GreenGenes or SILVA for the amplicon datasets. recent studies have confirmed their findings on
Grinder supports error models for the three main the low performance of metagenome assemblers
sequencing platforms (Sanger, 454, Illumina) and with communities that are more complex than
allows the implementation of user-defined error a few dominant clonal populations, either with
models. It allows for the adjustment of error- new sequencing platforms (Pignatelli and Moya
model configurations such as genome size bias, 2011; Mende et al. 2012) or alternative assembly
methods (Pignatelli and Moya 2011; Charuvaka for the assembly of metagenomic datasets like
and Rangwala 2011; Mende et al. 2012). Genovo (Laserson et al. 2011), IDBA-UD (Peng
The effect of average read length on gene et al. 2012), and MetaVelvet (Namiki et al. 2012)
annotation has been addressed by Wommack shows a promising improvement in metagenomic
et al. (2008). They simulated the subsampling assembly, although only for low complexity with
of existing Sanger-sequenced metagenomic communities with phylogenetically distant mem-
datasets producing shorter (<400 bp) reads char- bers. An approach that should be used in all
acteristic of the next-generation sequencing tech- assembly benchmarking studies is the compari-
nologies 454 and Illumina. Their simulations son of the assembly obtained with the mixed
revealed that short reads can miss up to 72 % of simulated metagenomic dataset against the
the annotated functions revealed by longer assembly obtained with an independent assembly
(~750 bp) Sanger reads and can detect only of each species since most simulated datasets are
highly conserved sequences with phylogeneti- produced from the annotated complete genomes
cally close relatives in reference databases from isolates, as done by Charuvaka and
(Wommack et al. 2008). The simulations also Rangwala (2011) and Namiki et al. (2012).
indicate that even an increase in sampling depth
with short reads (as promised by the Illumina Evaluation of Ecological Aspects
platform) does not improve the annotation Computer simulations have been long used in
achieved by long reads. In addition, a related community ecology for modeling communities
study using simulated datasets to assess the effect (Garfinkel 1962) and testing the performance of
of sequencing error on gene prediction (Hoff diversity indexes (e.g., Heltshe and Forrester
2009) revealed that all metagenomic gene predic- 1983). But the use of computer-simulated
tion tools show a reduced accuracy at gene call- datasets to study the diversity of microbial com-
ing with increasing sequencing error rates and munities had to wait until molecular methods
that their individual performance seems to be were available to study microbial communities
affected by the taxonomic composition of the (Liu et al. 1997; Bent and Forney 2008). Simu-
samples, except when using Sanger reads with lated communities are the only option to test
error rates below 0.15 % (Hoff 2009). Pignatelli the performance of diversity metrics on
and Moya (2011) adapted the FAMeS commu- metagenomic datasets, since currently no natural
nity profiles to the 454 and Illumina sequencing community has been sampled to exhaustion and
platforms and at a deeper sequencing coverage hence no real diversity measure is accurately
and demonstrated that all de novo assemblers known that we can compare our estimations
produce a significant amount of chimeric contigs against. The design of an artificial community
(up to 10 %) that have a profound impact on the in vitro and its subsequent sequencing (Morgan
functional and phylogenetic annotation of et al. 2010) is at best methodologically and eco-
metagenomic sequences. Since domain and nomically unfeasible to test the performance of
motif databases like Pfam and TIGRfam rely on several replicated datasets (Angly et al. 2012).
short conserved sequences, they may give better Three published studies exist that use
annotations at a more functionally general anno- simulated datasets to evaluate the performance
tation (Pignatelli and Moya 2011). of community diversity metrics, two of which
All these studies reveal that the assembly of deal with 16S rRNA amplicon-derived datasets
metagenomic datasets is highly influenced by the (Kuczynski et al. 2010; Parks and Beiko 2012)
community composition complexity, depth of and one with metagenomic datasets (Bonilla-
sequencing coverage, and average length of the Rosso et al. 2012).
sequenced reads, discouraging the assembly of Bent and Forney (2008) were the first to
metagenomic datasets. Nevertheless, the recent implement large-scale sequencing simulations
development of software specifically designed to evaluate alpha diversity (species diversity in
individual samples) metrics from 16S rRNA demonstrate that patterns are more readily iden-
amplicon clone libraries and T-RFLPs. They tified with several low-coverage samples than
demonstrated that most alpha diversity metrics with few deep-coverage datasets (Kuczynski
are sensitive to the number of rare and uncom- et al. 2010). These results were further extrapo-
mon species, which are precisely the ones likely lated for similarity metrics that incorporate phy-
to be undersampled by 16S rRNA amplicon- logenetic information, and it was found that most
based techniques (the so-called tragedy of the distance metrics are highly intercorrelated, and
uncommons). Moreover, they show that different highly robust to rooting, choice of threshold for
methods applied on the same community can defining OTUs and the presence of basal lineages
produce radically different estimations for these (Parks and Beiko 2012).
metrics (Bent and Forney 2008). Using
a replicated simulated dataset of nine communi-
ties in a cross-gradient of species richness and Perspectives
dominance, Bonilla-Rosso et al. (2012) demon-
strated that the use of conserved protein genes in Often obscured by the large amount of data pro-
metagenomic datasets outperforms 16S rRNA duced, metagenomics is still a very young disci-
genes at reflecting the original community. More- pline where a consensus set of rigorously tested
over, they show that the most common alpha analytical tools is still lacking. Moreover, the
diversity metrics derived from metagenomic rapid advance of sequencing technologies causes
samples are biased because of insufficient sam- a constant development and diversification of
pling and variations in the taxonomic composi- their accompanying bioinformatic tools and
tion representation. These last two studies point approaches that require an objective quantifica-
toward the use of scale-dependent metrics tion of their performance. This is worsened by the
such as Rényi’s profiles or Hill’s numbers as lack of theoretical understanding of the assembly,
a better representation of alpha diversity’s dynamics, and functioning of natural microbial
multidimensional nature. communities. The use of simulated datasets after
Two studies have addressed the performance sequencing modeling is the best alternative to
of beta diversity metrics (similarity in species approach the benchmarking of technical and ana-
composition between samples) with simulated lytical methodologies as well as the testing of
datasets. The use of simulated datasets to test theories and hypotheses. However, a much more
ecological hypotheses was first implemented efficient benchmarking framework is still
with deep sequencing of 16S rRNA amplicons needed.
(Kuczynski et al. 2010). Addressing the effect A set of source communities from where new
of the environment on community structure, datasets are to be simulated need to be consensu-
they simulated datasets to model communities ally designed by the academic community as the
that were either shaped along an environmental minimal standard benchmarking start point, so
gradient or where the environment partitioned that the comparison of the performance of bioin-
them into discrete clusters. They found that the formatic tools across studies and sequencing plat-
patterns from environmental gradients were more forms is achieved. This was the original intention
easily detected than those from ecological clus- of the FAMeS dataset (Mavromatis et al. 2007),
tering, specially when differences between clus- but currently almost each new tool developed is
ters were subtle. Moreover, qualitative methods tested against a tailored simulated dataset, in part
(richness based) performed better on clustered because the three FAMeS communities cover
datasets, while quantitative methods (abundance a narrow range of community composition
based) performed better on gradients, so both options. Ideally, this standard source community
types of methods should be applied if the dataset should be designed in a way that spans
underlying pattern is unknown. Finally, they a wide spectrum across three dimensions of
assembled communities consisting of number ▶ Computational Approaches for Metagenomic

of species (richness), relative abundance Datasets
(dominance), and taxonomic composition ▶ Extraction Methods, Variability Encountered in
(phylogenetic relatedness). As an example, the ▶ Microbial Ecology in the Age of
effect of the presence of closely related strains Metagenomics: An Introduction
on both the assembly and diversity estimation of ▶ Mock Community Analysis
a metagenomic sample is largely unknown. ▶ Next-Generation Sequencing for
Variability is a factor that should be more Metagenomic Data: Assembling and Binning
often considered in simulated datasets. There is
a need to incorporate variation in platform-
specific error models, and the incorporation of References
empirical thresholds for best and worst case sce-
Angly FE, Willner D, Rohwer F, Hugenholtz P, Tyson GW.
narios in simulation software would greatly
Grinder: a versatile amplicon and shotgun sequence
improve this. Moreover, due to their dynamic simulator. Nucleic Acids Res. 2012;40(12):e94.
nature, two independent simulations from the Bent SJ, Forney LJ. The tragedy of the uncommon: under-
same source community will produce a different standing limitations in the analysis of microbial diver-
sity. ISME J. 2008;2(7):689–95.
set of datasets, and this sampling variability
Bonilla-Rosso G, Eguiarte LE, Romero D, Travisano M,
should be incorporated into the benchmarking Souza V. Understanding microbial community diversity
and hypothesis testing process that allow the metrics derived from metagenomes: performance evalu-
incorporation of variability in the models and ation using simulated data sets. FEMS Microbiol Ecol.
2012;82:37–49. doi:10.1111/j.1574-6941.2012.01405.x.
the statistical testing of significant differences.
Caswell H. Theory and models in ecology: a different
Finally, it should be noted that the potential of perspective. Ecol Mod. 1988;43(1–2):33–44.
simulated datasets to metagenomics is far from Charuvaka A, Rangwala H. Evaluation of short read
explored, since they have mostly been used to test metagenomic assembly. BMC Genomics. 2011;12
Suppl 2:S8.
the performance of technical methodologies, and
Garfinkel D. Digital computer simulation of ecological
as mentioned by Caswell (1988), they can be systems. Nature. 1962;194(4831):502–7.
readily applied for exploring the consequences Heltshe JF, Forrester NE. Estimating species richness
of proposed ecological theories, finding simple using the jackknife procedure. Biometrics. 1983;
39(1):1–11.
explanatory models that can reproduce the
Hoff KJ. The effect of sequencing errors on metagenomic
observed patterns in natural communities, and gene prediction. BMC Genomics. 2009;10(1):520.
aiding in the design of accurate future experi- Kuczynski J, Liu Z, Lozupone C, et al. Microbial commu-
ments. Furthermore, the implementation of repli- nity resemblance methods differ in their ability to
detect biologically relevant patterns. Nat Methods.
cations to variability modeling will also permit
2010;7(10):813–9.
the identification of theoretical thresholds for the Laserson J, Jojic V, Koller D. Genovo: de novo
detection of differences between communities assembly for metagenomes. J Comput Biol. 2011;
and as such will help define the scopes and limits 18(3):429–43.
Liu WT, Marsh TL, Cheng H, Forney LJ. Characterization
of metagenomics.
of microbial diversity by determining terminal restric-
tion fragment length polymorphisms of genes
encoding 16S rRNA. Appl Environ Microbiol.
Cross-References 1997;63(11):4516–22.
Mavromatis K, Ivanova N, Barry K, et al. Use of simulated
data sets to evaluate the fidelity of metagenomic
▶ A 123 of Metagenomics processing methods. Nat Methods. 2007;4(6):495–500.
▶ Accurate Genome Relative Abundance Mende DR, Waller AS, Sunagawa S, et al. Assessment
Estimation Based on Shotgun Metagenomic of metagenomic assembly using simulated next
generation sequencing data. PLoS One. 2012;7(2):
Reads
e31386.
▶ Approaches in Metagenome Research: Mitra S, Schubach M, Huson DH. Short clones or
Progress and Challenges long clones? A simulation study on the use of
paired reads in metagenomics. BMC Bioinformatics. Peng Y, Leung HCM, Yiu SM, Chin FYL. IDBA-UD: a de
2010;11(Suppl 1):S12 novo assembler for single-cell and metagenomic
Morgan JL, Darling AE, Eisen JA. Metagenomic sequenc- sequencing data with highly uneven depth.
ing of an in vitro-simulated microbial community. Bioinformatics. 2012;28(11):1420–8.
PLoS One. 2010;5(4):e10209. Pignatelli M, Moya A. Evaluating the fidelity of de novo
Namiki T, Hachiya T, Tanaka H, Sakakibara Y. MetaVelvet: short read metagenomic assembly using simulated
an extension of velvet assembler to de novo metagenome data. PLoS One. 2011;6(5):e19984.
assembly from short sequence reads. Nucl Acids Res. Richter DC, Ott F, Auch AF, Schmid R, Huson DH.
2012;40:e155. doi:10.1093/nar/gks678. Metasim – a sequencing simulator for genomics and
Parks DH, Beiko RG. Measures of phylogenetic differen- metagenomics. PLoS One. 2008;3(10):e3373.
tiation provide robust and complementary insights into Winsberg E. Simulated experiments: methodology for
microbial communities. ISME J. 2012;7:173–83. a virtual world. Philos Sci. 2003;70(1):105–25.
doi:10.1038/ismej.2012.88. Wommack KE, Bhavsar J, Ravel J. Metagenomics:
Peck SL. The hermeneutics of ecological simulation. Biol read length matters. Appl Environ Microbiol.
Philos. 2008;23(3):383–402. 2008;74(5):1453–63.
M
MEMOSys: Platform for Genome- been compiled for a number of different organ-
Scale Metabolic Models isms (Henry et al. 2010). Each model is in general
a network consisting of metabolites that are
Stephan Pabinger1,2 and Zlatko Trajanoski1 connected by reactions. Genome-scale models
1
Division of Bioinformatics, Biocenter, include all reactions occurring in a living organ-
Innsbruck Medical University, Innsbruck, ism and are primarily reconstructed using the
Austria annotated genome and literature information.
2
AIT – Austrian Institute of Technology, Metabolic models can be used to provide an
Health & Environment Department, alternative approach for integrating large
Molecular Diagnostics, Vienna, Austria amounts of data about biological systems to
gain novel insights into their interconnected func-
tionality (Kay and Wren 2009). Moreover, they
Synonyms have already been used for a variety of different
purposes including strain engineering (Benedict
Bioinformatics platform for genome-scale et al. 2012), gene deletion studies (Choi et al.
metabolic models 2010), biofuel production (de Jong et al. 2011),
and interpretation of gene and protein expression
data (Gowen and Fong 2010).
Definition The generation of new models is a well-
documented iterative process comprising a multi-
MEMOSys is a web-based platform for tude of different steps (Thiele and Palsson 2010),
constructing, managing, and storing genome- where often 10 % of construction time is needed to
scale metabolic models. It provides sophisticated model 90 % of reactions and 90 % to collect the
query and data exchange mechanisms, offers an remaining 10 % (Rocha et al. 2008). Until the final
integrated version control system, and allows version of a model is assembled, usually several
researchers to easily compare models. MEMOSys intermediate revisions are generated. During this
is freely available at http://www.icbi.at/memosys reconstruction process, simulated results are con-
under the GNU Affero General Public License. stantly compared to experimental data, and if they
do not agree, the model is critically reevaluated
(Baart and Martens 2012). It is therefore of great
Introduction importance to be able to review all changes, extract
previous versions, compare different versions of
Driven by recent innovations in sequencing tech- one model, and have access to easy to use software
nology, genome-scale metabolic models have for creating and manipulating models.
M 362 MEMOSys: Platform for Genome-Scale Metabolic Models
The MEtabolic MOdel research and develop- corresponding Systems Biology Ontology (SBO)
ment System (MEMOSys) (Pabinger et al. 2011, term and arranged in a hierarchy to support fine-
2014) has been developed to support the con- grained compartmentalization when exporting
struction, modification, and management of models. SBO is a hierarchically arranged set of
genome-scale metabolic models. It is a web-based controlled, relational vocabularies of terms that
bioinformatics platform that uses an automatic are commonly used in mathematical modeling.
version control system to store the complete MEMOSys uses an integrated balance check
developmental history of all model components. mechanism that validates the elemental composi-
This allows researchers to access the entire model tion of consuming and producing reactants. The
at any time during the iterative model building check is automatically executed when reactions
process. Furthermore, MEMOSys offers sophis- are modified, or during the import of a new model.
ticated query mechanisms and supports the Each organism of a model can be annotated
exchange of models using standardized formats. with the corresponding BioCyc (Karp et al. 2005)
identifier. BioCyc is a biological database collec-
tion, which includes highly curated genome and
Model Management pathway information for individual organisms. In
order to facilitate the assignment process,
Database Structure MEMOSys dynamically fetches all available
MEMOSys has been designed to store all proper- organisms from BioCyc and provides suggestions
ties of a metabolic model in a database. The to select the correct identifier.
model itself is represented by a name, its unique Genes and their relationship to other genes and
model identifier, as well as containing reactions, reactions can be described using hierarchical
genes, and metabolites. In addition, it is assigned structures and Boolean operators (e.g., [gene1 or
to an organism and may contain references to an gene2] and gene3). They are linked to the
image that graphically represents the metabolic corresponding BioCyc pages if the organism
network. MEMOSys supports the upload of arbi- identifier and the unique gene symbol are pro-
trary additional data files, which can be directly vided. In addition, for genes having a reference to
linked to stored models. Such files may include the Universal Protein Resource (UniProt)
experimental data sets that were used to validate (Magrane and Consortium 2011) database,
the model during the reconstruction process. In MEMOSys offers a mechanism to download the
addition, analysis results from external tools can amino acid sequence of the transcribed protein
be directly attached to the investigated model. and provides an integrated system to fetch addi-
Each model has an arbitrary number of reac- tional information from the UniProt entry.
tions, which are described by a multitude of prop- UniProt is a popular, freely accessible compre-
erties, including name, Enzyme Commission hensive resource containing protein sequence
(EC) number, reactants, products, and reversibil- data as well as functional and annotation
ity. Reactions can be linked to citations in order to information.
provide primary literature evidence and are Genome-scale metabolic models rely on anno-
assigned to a subsystem, which is used to group tations to unambiguously identify model compo-
reactions into metabolic pathways. MEMOSys nents. History has shown that biologists have
supports the definition of lower and upper bound been using different notations and naming
constraints, which are automatically included schemes for the same gene or protein. MEMOSys
when the model is exported into a file and can allows researchers to annotate reactions, metab-
then be directly used in constraint-based analyses. olites, genes, and compartments with references
Reactants and products of reactions contain the to external databases using the minimum infor-
metabolite itself and the stoichiometric coefficient mation requested in the annotation of biochemi-
for that metabolite, and they are assigned to cal models (MIRIAM) (Le Novère et al. 2005)
a compartment. Compartments are linked to the notation. Every MIRIAM identifier is a single
MEMOSys: Platform for Genome-Scale Metabolic Models 363 M
unique string, which unambiguously references an integrates an automatic version control system,
object in an external resource and facilitates sci- which creates a new revision for every modifica-
entific collaborations and model comparability. tion of a model component. This system allows
MEMOSys automatically transforms MIRIAM researchers to access the complete model history
annotations into web addresses and displays direct and query, compare, and export previous versions
links to the external data sources. of a model. Each modification can be annotated
Furthermore, the application includes with a comment, and the complete change history
a mechanism to easily define additional external is displayed as a list at the respective component
databases, which can then be used by all model pages. The home screen of the application lets the
components to create further references and user specify which version of a model should be
annotations. used and lists the latest modifications for metabo-
Due to the iterative model building process, lites and reactions.
components may be modified several times by
different members of the reconstruction team.
To facilitate the discussion between researchers, Data Access and Supervision
MEMOSys features an integrated web board that
allows attaching discussions to every model com- MEMOSys is a multiuser application using four
ponent. Associated threads are shown at each different user classes to control data access:
component page, and latest comments of all dis- (a) unregistered visitors are allowed to view
cussions are displayed on the home screen. In accepted, publicly available versions of models;
addition, global threads can be created to discuss (b) registered users are able to display in addition
general properties of models. to publicly available models, accepted versions
of assigned models; (c) editors have access to all
Querying System versions of their own models and are able to
MEMOSys uses enhanced lists to present and create, update, and delete model components. In
M
query data stored in the database. Every list can addition they are allowed to upload files to the
be customized to display a selection of available application and import models; (d) administrators
attributes. They are fully sortable and incorporate are editors with additional rights to access all
attributes from different tables into one view, models, change the public availability of models,
which allows comprehensive data representa- and accept modifications.
tions. MEMOSys supports fine-grained searches Each modification of a model component is
where different restrictions can be combined to at first marked as pending and needs to be con-
query for a specific question. In addition, the firmed by an administrator. Upon approval of
application offers an easy to use quick search a modification, a new internal revision number
mechanism that allows users to easily search for is assigned to the model. In addition to the auto-
reactions, metabolites, genes, and organisms. matically set revision number, administrators can
As all model components are highly assign major version numbers to each model.
connected with each other, MEMOSys displays MEMOSys differentiates between publicly avail-
links to referenced components throughout the able models, which are visible to all visitors and
system and allows free navigation within and contain all accepted modifications, and restricted
across all stored models. models that are only visible to registered users
and editors of the assigned models.
Versioning
The construction of a metabolic model is an iter-
ative task, which has been broken down into Comparison
96 steps (Thiele and Palsson 2010) generating
several intermediate versions until the final As the construction of a draft genome-scale met-
model is established. Therefore, MEMOSys abolic model is getting more and more a routine
M 364 MEMOSys: Platform for Genome-Scale Metabolic Models
application, future developments will strongly reactions that are in certain subsystems or using
rely on already existing reconstructions of related the result of a reaction query as input for the
organisms. In addition, researchers are often export mechanism.
interested in the subtle differences between MEMOSys features three different ways to
organisms when exploring specific biological assign reactions and metabolites to compartments
functionalities. Hence, MEMOSys offers (compartmentalization), which allow researchers
a flexible and intuitive mechanism to assess the to directly use exported models in analysis tools
similarity between models allowing users to com- that do not support a fine-grained assignment of
pare any version of different models. Further- reactions to compartments.
more, it is possible to compare two versions of The system supports the import of models that
the same model to identify development changes. are encoded in valid format as defined by the
The first section of the comparison result pre- consensus yeast reconstruction group. In addi-
sents Venn diagrams that graphically display the tion, existing models in SBML format can be
calculated differences for reactions, metabolites, used to improve the annotation of stored model
and genes. Next, restrictions on the used models components (see Fig. 1).
can be set to display only differences in selected
compartments and subsystems. In addition to the
graphical representation, the application shows Installation
detailed lists of unique and equal model compo-
nents and uses tabs to switch between reactions, The application itself and the source code of
metabolites, and genes lists. Every model com- MEMOSys are freely available under the GNU
ponent is connected to the corresponding page Affero General Public License. As MEMOSys is
where detailed information is presented. a web application, it is recommended installing it
on a server system and set appropriate access
permissions for potential users. A detailed used
Data Exchange guide and installation instructions are available at
the distribution website. MEMOSys is available
MEMOSys features the export of current metab- for download at http://www.icbi.at/MEMOSys.
olite and reaction query result lists into Excel or
PDF files, where only the active result set is
included. Since several methods and toolboxes Summary
which analyze genome-scale metabolic models
have been published over the last years (Baart During the last years, numerous genome-scale
and Martens 2012), MEMOSys provides metabolic models have been developed for
a sophisticated data exchange mechanism that a multitude of different organisms. They are
allows the export of models into valid SBML a promising approach to systematically analyze
files. The Systems Biology Markup Language complex cellular systems and have been success-
(SBML) (Hucka et al. 2003) provides fully applied for improving gene annotation,
a common intermediate format that can be used increasing the product yield, and predicting the
to define models in regulatory networks, meta- effect of gene deletions.
bolic pathways, signaling pathways, and gene The web-based MEtabolic MOdel research and
regulation networks. development System (MEMOSys) is a versatile
The exported files are compliant with the con- bioinformatics platform for the management,
sensus yeast format (Herrgård et al. 2008) or with storage, modification, and development of
the COBRA toolbox format (Schellenberger genome-scale metabolic models. It facilitates the
et al. 2011). Researchers are able to export all construction of new models by providing a built-in
available versions of a model and restrict the set version control system, which allows researchers
of exported reactions by either including only to access the complete reconstruction history.
MEMOSys: Platform for Genome-Scale Metabolic Models
MEMOSys: Platform for Genome-Scale Metabolic Models, Fig. 1 Displayed is possibility to select the correct annotation for each modification. In addition, empty
365
the user interface for improving the annotation of metabolite and genes. After loading model component fields can be filled with new annotations, or all stored annotations
an SBML file, the system identifies new or different annotations and offers the can be replaced with the currently loaded ones
M
M
M 366 MetaBin
Research on existing models is facilitated by representation and exchange of biochemical network

a powerful search system, a feature-rich compar- models. Bioinformatics. 2003;19:524–31.
Kay E, Wren BW. Recent advances in systems microbi-
ison mechanism, and standardized references to ology. Curr Opin Microbiol. 2009;12:577–81.
external databases. Le Novère N, Finney A, Hucka M, et al. Minimum infor-
MEMOSys provides customizable data mation requested in the annotation of biochemical
exchange mechanisms using the SBML format models (MIRIAM). Nat Biotechnol. 2005;23:
1509–15.
to enable further analysis in external tools and Magrane M, Consortium U. UniProt knowledgebase:
supports different user roles and access rights to a hub of integrated protein data. Database (Oxford).
allow collaborations across departments and uni- 2011;2011:bar009.
versities. The system is freely available at http:// Pabinger S, Rader R, Agren R, et al. MEMOSys: bioin-
formatics platform for genome-scale metabolic
www.icbi.at/MEMOSys. models. BMC Syst Biol. 2011;5:20.
Pabinger S, Snajder R, Hardiman T, Willi M, Dander A,
Trajanoski Z. MEMOSys 2.0: an update of the bioin-
Cross-References formatics database for genome-scale models and geno-
mic data Database. 2014;bau004 doi:10.1093/
database/bau004. published online February 14, 2014.
▶ KEGG and GenomeNet, New Developments, Rocha I, Förster J, Nielsen J. Design and application of
Metagenomic Analysis genome-scale reconstructed metabolic models.
▶ New Method for Comparative Functional Methods Mol Biol. 2008;416:409–31.
Schellenberger J, Que R, Fleming RMT, et al. Quantita-
Genomics and Metagenomics Using KEGG tive prediction of cellular metabolism with constraint-
MODULE based models: the COBRA Toolbox v2.0. Nat
Protoc. 2011;6:1290–307.
Thiele I, Palsson BØ. A protocol for generating a high-
References quality genome-scale metabolic reconstruction. Nat
Protoc. 2010;5:93–121.
Karp PD, Ouzounis CA, Moore-Kochlacs C, Goldovsky
Baart GJE, Martens DE. Genome-scale metabolic models: L, Kaipa P, Ahrén D, Tsoka S, Darzentas N, Kunin V,
reconstruction and analysis. Methods Mol Biol. López-Bigas N. Expansion of the BioCyc collection of
2012;799:107–26. pathway/genome databases to 160 genomes. Nucl
Benedict MN, Gonnerman MC, Metcalf WW, Acids Res. 2005;33(19):6083–89.
et al. Genome-scale metabolic reconstruction and
hypothesis testing in the methanogenic archaeon
Methanosarcina acetivorans C2A. J Bacteriol.
2012;194:855–65.
Choi HS, Lee SY, Kim TY, et al. In silico identification of MetaBin
gene amplification targets for improvement of lyco-
pene production. Appl Environ Microbiol.
Vineet K. Sharma1 and Todd D. Taylor2
2010;76:3097–105. 1
de Jong B, Siewers V, Nielsen J. Systems biology of yeast: MetaInformatics Laboratory, Metagenomics
enabling technology for development of cell factories and Systems Biology Group, Department of
for production of advanced biofuels. Curr Opin Biological Sciences, Indian Institute of Science
Biotechnol. 2011;23:624–30.
Gowen CM, Fong SS. Genome-scale metabolic model
Education and Research, Bhopal, India
2
integrated with RNAseq data to identify metabolic Laboratory for Integrated Bioinformatics,
states of Clostridium thermocellum. Biotechnol Core for Precise Measuring and Modeling,
J. 2010;5:759–67. RIKEN Center for Integrative Medical Sciences,
Henry CS, DeJongh M, Best AA, et al. High-throughput
Yokohama, Japan
generation, optimization and analysis of genome-scale
metabolic models. Nat Biotechnol. 2010;28:977–82.
Herrgård MJ, Swainston N, Dobson P, et al. A consensus
yeast metabolic network reconstruction obtained from Synonyms
a community approach to systems biology. Nat
Biotechnol. 2008;26:1155–60.
Hucka M, Finney A, Sauro HM, et al. The systems biology Taxonomic assignment; Taxonomic binning;
markup language (SBML): a medium for Taxonomic classification
MetaBin 367 M
Definition reference databases and are able to carry out
classification of metagenomic sequences at
MetaBin: Taxonomic binning of metagenomic lower taxonomic levels (genus or species) when
sequences. a comprehensive reference database is used.
Another homology-based method, WebCARMA,
scans for the presence of conserved Pfam
Introduction domains and protein families in the metagenomic
reads (Gerlach et al. 2009).
The first, and primary, challenge in metagenomic The motivation to develop a better homology-
data analysis is to ascertain the genomic origin of based algorithm for taxonomic classification
metagenomic sequences and to make appropriate came from the fact that none of the available
taxonomic assignments (Tringe and Rubin 2005; methods are comprehensive in that they have
McHardy et al. 2007; Sharma et al. 2012). not considered some key features of
Composition- or homology-based classification metagenomic sequences which could result in
of metagenomic sequences are the two main increased and more accurate taxonomic assign-
approaches that are currently used (McHardy ments. Therefore, in this entry, a novel algorithm
et al. 2007; Huson et al. 2007). Among the two, called “MetaBin” is presented which exploits the
homology-based methods are more sensitive and information from all possible ORFs (complete or
accurate but require a large amount of time to partial) for each sequence read while carrying out
generate the BLAST alignments, which are used the taxonomic assignment. This algorithm is
as an input for these programs. The composition- faster and results in much higher accuracy and
based approach is exploited by classification sensitivity for taxonomic classification. It can be
tools such as PhyloPythia, TETRA, and used for the taxonomic assignments of various
TACOA, for taxonomic classification of read lengths (45 bp, single or paired end) which
metagenomic sequences (Diaz et al. 2009; are commonly generated using available tradi-
M
McHardy et al. 2007; Teeling et al. 2004). tional and next-generation sequencing
These methods require prior training using longer technologies.
reads (>800 bp) to carry out the classification,
and thus the classifications remain limited to
higher taxonomic levels. Homology-based Methods
approach assesses the taxonomic identity of
a read from the results of a homology-based Reference Database Construction and
search against a known reference sequence data- Simulated Reads
base which is usually the NCBI non-redundant The non-redundant (NR) sequence database
(NR) database (Sayers et al. 2011). Examples of (ftp://ftp.ncbi.nih.gov/blast/db/FASTA/) was
some homology-based tools are MEGAN and retrieved from NCBI (Sayers et al. 2011). In
SOrt-ITEMS (Huson et al. 2007; Monzoorul addition, genomic sequences of 25 completed
et al. 2009). Both of these carry out taxonomic bacterial genomes belonging to different taxo-
binning based on the BLAST bit-score and lowest nomic lineages were retrieved (ftp.ncbi.nih.gov/
common ancestor (LCA) approach. If a read genomes/Bacteria). Local versions of the NR
shows a match with multiple genomes, it is database were created, to test the performance
assigned to the common taxonomic ancestor of MetaBin, by removing all sequences belong-
(higher level) of the hits. Since these are only ing to the associated genus and family. This helps
based on bit scores, they may lead to incorrect in assessing the performance of MetaBin on reads
or nonspecific taxonomic assignment. The for which no genome of the genus (novel
homology-based methods primarily depend on genome) is present in the NRminusFamily or
the representation of genomic sequences in the NRminusGenus database. The reads created
M 368 MetaBin
from these genomes are similar to reads of generate the alignments by implementing Blat
novel or yet unknown genomes because the (Kent 2002) as the faster alignment method in
NRminusGenus or NRminusFamily databases place of BLASTX which is comparable to the
do not contain any genome of that genus. Simu- time taken by composition-based methods. This
lated read datasets were created using the feature makes it practical to use a more accurate
MetaSim program to represent Sanger (read and sensitive homology-based approach for both
length ~800 bp) and 454 (read lengths of ~400 Web- and console-based high-throughput analy-
and ~250 bp) sequences (Richter et al. 2008). sis of large datasets.
A Perl script was developed for generating A unique approach has been adopted which
1,000 simulated reads of length ~75 bp and considers the taxonomic information from all
~45 bp, respectively, from each of the bacterial verified complete or partial ORFs present in
genomes, since the option to create short reads a read and then assigns a taxonomic bin. This
was absent in MetaSim. helps to make correct assignments of reads of
The metagenomic sequences for a real diverse lengths to different taxonomic bins.
metagenomic dataset were taken from human Since our procedure comprehensively considers
gut samples from a single Spanish male individ- all imaginable cases, the results are more accurate
ual generated by Illumina sequencing (V1.CD-2, and specific, and the assignments are not limited
age 49, BMI 27.76, 20,707,369 high-quality by read length. (Details are provided in the
reads, library 090107) (ftp://public.genomics. manuscript, Sharma et al. 2012.)
org.cn/BGI/gutmeta/High_quality_reads/) (Qin The taxonomic binning of the simulated read
et al. 2010). The sample data sequences datasets was carried out using MetaBin and
(Sargasso Sea Subsample 1) for Sargasso Sea MEGAN, and the assignments were counted at
were downloaded from http://www-ab. three levels, namely, “Genus,” “Phylum,” and
informatik.uni-tuebingen.de/software/megan/old- “Higher.” The “correct assignments” were those
datasets (Huson et al. 2007). This set contains the where the assigned phylum was same as the
first 10,000 reads from Sample 1 of the Sargasso expected phylum or simply if it was assigned to
Sea dataset (Venter et al. 2004). its own phylum. Only the intragenic reads were
BLAST (version 2.2.22, ftp://ftp.ncbi.nih.gov/ considered to calculate sensitivity and the posi-
blast/) was downloaded from NCBI. MEGAN tive predictive value (PPV) because the NR ref-
(version 3.8) (http://www-ab.informatik. erence database contains only protein sequences,
uni-tuebingen.de/data/software/megan/download/ and thus the reads coming from known protein
welcome.html) (Huson et al. 2007), SOrt-ITEMS coding regions (intragenic) are expected to find
(http://metagenomics.atc.tcs.com/binning/SOrt- a match. The following standard formulae were
ITEMS) (Monzoorul et al. 2009), and TACOA used to calculate sensitivity and PPV:
(version 1.0, http://www.cebitec.uni-bielefeld.
de/brf/tacoa/tacoa.html) (Diaz et al. 2009) Sensitivity ð%Þ ¼ ðTP=ðTP þ FN ÞÞ 100
were retrieved from their respective sites.
WebCARMA (version 1.0) was run from their Positive predictive value ðPPVÞ ð%Þ
Web server (http://webcarma.cebitec.uni- ¼ ðTPðTP þ FPÞÞ 100
bielefeld.de/cgi-bin/webcarma.cgi) (Gerlach
et al. 2009). True positive (TP) ¼ number of reads assigned
with correct (expected) phylum
Algorithm Development False positive (FP) ¼ number of reads assigned to
MetaBin provides significant improvements over other (incorrect) phylum
currently existing homology-based methods for False negative (FN) ¼ number of unassigned
better taxonomic assignments. It reduces (up to intragenic reads plus number of reads assigned
1,000-fold) the amount of time needed to above to the phylum level (higher)
MetaBin 369 M
The average sensitivity and PPV were calcu- 46 % higher average sensitivity as compared to
lated for all simulated read datasets aligned with MEGAN and SOrt-ITEMS, respectively.
the complete NR database or the NR-G versions. The performance of MetaBin was also evalu-
ated on real metagenomic data using the recent
MetaBin Development human gut data obtained by Illumina sequencing
The MetaBin algorithm was developed in Perl (short reads) from a European male individual
(version 5.10.1), and the dendrogram images and analyzed using MetaBin with Blat as the
were generated using the Perl GD module. alignment program. Only those bins containing
Options are provided to change the different run at least 10,000 reads were considered under
parameters such as bin size, minimum bit- score, default parameter conditions. The analysis of
and bit-score range, to select hits and to create such a large metagenomic dataset proves the
a dendrogram image after comparing the propor- ability of MetaBin to work on real metagenomic
tions of each taxonomic group in the selected datasets. In this analysis, Bacteroidetes was
metagenomes, and to display the respective profound as the most abundant phylum (77.4 %)
portions as a pie chart. The algorithm can be used followed by Firmicutes (16.8 %), Proteobacteria
for the taxonomic assignments of both single- and (3.5 %), Actinobacteria (1.7 %), Cyanobacteria
paired-end sequence reads. A user-friendly (0.27 %), and Euryarchaeota (0.24 %). These
website (http://metabin.riken.jp/) was developed results corroborate previous observations
on our server including detailed instructions for (Kurokawa et al. 2007).
installation, usage, and updating of the taxonomy The performance was also evaluated using
database. A free stand-alone executable program longer (~800 bp) reads obtained from the Sar-
is also provided and can be downloaded for dif- gasso Sea dataset. Using this common dataset,
ferent operating systems including Windows, the results of MetaBin, MEGAN, and SOrt-
Linux, and Mac. ITEMS were compared. MetaBin and MEGAN
both predicted a similar number of bins; however,
M
MetaBin assigned comparatively more reads
Results (nearly twice the number of reads at the species
level) to each of these common bins which shows
The overall performance of MetaBin was found its higher sensitivity and higher accuracy. The
to be superior to the other available tools such performance of SOrt-ITEMS was comparatively
as MEGAN, SOrt-ITEMS, TACOA, and poor compared to both MetaBin and MEGAN.
WebCARMA for all read datasets. It assigned A brief comparison of MetaBin was also carried
a higher percentage of reads to their correct out with one of the composition-based methods
genus and phylum, as compared to the other (TACOA) and with another method based on
methods. Particularly for the short (<100 bp) homology to protein families (WebCARMA)
Illumina reads, it assigned up to 18 % more using the above dataset. Both the composition-
reads to their correct taxonomic levels. This is and protein family-based methods showed limi-
a useful and unique ability of MetaBin to make tations in making comprehensive taxonomic
more accurate assignments at the lower and more assignments and performed poorly as compared
specific taxonomic levels. For all simulated read to homology-based methods.
datasets, the average sensitivity and PPV of
MetaBin was similar to or higher than those of
MEGAN, especially for short reads. For ~75 bp The Web Server
reads, MetaBin showed up to 6 % and 16.8 %
higher average sensitivity as compared to Different pages are provided on the Web server
MEGAN and SOrt-ITEMS, respectively. For with several options for carrying out online
~45 bp reads, MetaBin showed up to 32 % and taxonomic analysis. The main page is the
M 370 MetaBin
MetaBin,
Fig. 1 Screenshot of
“application” page using
a sample query
“Application” page, where the user can submit shows their abundance values. Another option is
and carry out taxonomic analysis of either available to compare the taxonomic profiles of up
sequence reads or Blastx output (Fig. 1). to five metagenomic datasets by “Compare
Two options, BLAT and BLAST, are provided Metagenome Profiles.”
to generate the alignments. The input sequences The stand-alone console-based version is pro-
should be submitted in FASTA format, for which vided to analyze large metagenomic datasets
the ORFs are predicted, and the qualified ORFs locally on the user’s system after installation.
are aligned against the NCBI NR database using A free stand-alone executable program is avail-
Blat. This output is used to classify the sequences able for download for several operating systems
into their appropriate taxonomic bins. Another including Linux, Mac, and Windows.
option, BLAST, uses Blastx for generating the
alignments and takes comparatively a much lon-
ger time for generating the alignments as com- Discussion
pared to Blat. The input parameters such as
minimum bit-score (Blat or Blastx output), Homology-based approaches are more common
bit-score range to select hits, and bin size and considered to be more specific and useful for
(minimum number of reads needed to form diverse read length as compared to composition-
a taxonomic bin) can be changed or used as based approaches. However, their implementa-
default. The “Results” page provides the output tion on large metagenomic datasets is limited
files in tab-delimited format and displays thumb- due to the longer analysis time. The MetaBin
nail images of the taxonomic tree (*.png file) and algorithm overcomes this limitation and pro-
functional annotation of the reads using COGs vides a significant improvement over the cur-
functional classes. The results can be rently existing homology-based methods for
downloaded from the website (Fig. 2). better and faster taxonomic assignments by
The “Visualization” page provides several using a more specific ORF-based approach. It
options for displaying the results and carrying carries out more accurate and specific taxo-
out comparative analysis (Fig. 3). nomic assignments at both genus and species
An option to upload the resultant *.json file levels. The replacement of BLAST by Blat in
generated after using the stand-alone version for MetaBin makes it possible to employ a more
additional Web-based analyses is also provided. accurate and sensitive homology-based
There are options to visualize the taxonomic tree approach for the high-throughput analysis of
and prepare a “composition chart” for a single large datasets and also for the development of
dataset. The composition chart gives an overview a Web-based community server. The perfor-
of the microbial distribution in the dataset and mance of this approach was validated using
MetaBin 371 M
M
MetaBin, Fig. 2 Screenshot of results page for the sample query
MetaBin, Fig. 3 Screenshot of visualization page
both simulated reads and real metagenomic technology, and perhaps it is the only method
datasets. In addition, it can be a tool of choice which can be applied for the taxonomic binning
for large metagenomic datasets as demonstrated of reads of lengths as short as 45–75 bp with high
in this entry. It can be used for the taxonomic accuracy and sensitivity. Thus, the MetaBin
assignment of sequence reads of diverse lengths Web server and program can be considered
(45 bp) derived from any existing sequencing a significant improvement over currently
M 372 MetaBioME
existing programs for carrying out the taxo-

nomic binning of metagenomic sequences with MetaBioME
high accuracy, speed, and sensitivity.
Computational Tool for Mining Metagenomic
Datasets to Discover Novel Biocatalysts by
References Using a Homology-Based Approach
Diaz NN, Krause L, Goesmann A, Niehaus K, Nattkemper Vineet K. Sharma1 and Todd D. Taylor2
TW. TACOA: taxonomic classification of environ- 1
MetaInformatics Laboratory, Metagenomics
mental genomic fragments using a kernelized nearest
neighbor approach. BMC Bioinforma. 2009;10:56.
and Systems Biology Group, Department of
Gerlach W, Junemann S, Tille F, Goesmann A, Stoye J. Biological Sciences, Indian Institute of Science
WebCARMA: a web application for the functional and Education and Research, Bhopal, India
taxonomic classification of unassembled metagenomic 2
Laboratory for Integrated Bioinformatics,
reads. BMC Bioinforma. 2009;10:430.
Core for Precise Measuring and Modeling,
Huson DH, Auch AF, Qi J, Schuster SC. MEGAN analysis
of metagenomic data. Genome Res. 2007;17:377–86. RIKEN Center for Integrative Medical Sciences,
Kent WJ. BLAT–the BLAST-like alignment tool. Yokohama, Japan
Genome Res. 2002;12:656–64.
Kurokawa K, Itoh T, Kuwahara T, Oshima K, Toh H,
Toyoda A, Takami H, Morita H, Sharma VK,
Srivastava TP, et al. Comparative metagenomics Synonyms
revealed commonly enriched gene sets in human gut
microbiomes. DNA Res. 2007;14:169–81. Biocatalysts; Commercially useful enzymes;
McHardy AC, Martin HG, Tsirigos A, Hugenholtz P,
Rigoutsos I. Accurate phylogenetic classification of
CUEs
2007;4:63–72.
Monzoorul HM, Ghosh TS, Komanduri D, Mande SS. - Definition
SOrt-ITEMS: sequence orthology based approach for
improved taxonomic estimation of metagenomic
sequences. Bioinformatics. 2009;25:1722–30. MetaBioME: Comprehensive metagenomic
Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, et al. A biomining engine.
human gut microbial gene catalogue established by
metagenomic sequencing. Nature. 2010;464:59–65.
Richter DC, Ott F, Auch AF, Schmid R, Huson DH.
MetaSim: a sequencing simulator for genomics and Introduction
metagenomics. PLoS One. 2008;3:e3373.
Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, The relationship between man and microbes is as
Canese K, Chetvernin V, Church DM, Dicuccio M,
old as the age of man himself and it is no wonder
Federhen S, et al. Database resources of the National
Center for Biotechnology Information. Nucleic Acids that man carries around ten times more of these
Res. 2011;39:D38–51. little friends than of his own cells (Gill et al. 2006).
Sharma VK, Kumar N, Prakash T, Taylor TD. Fast and However, it has only been a few 1,000 years since
accurate taxonomic assignments of metagenomic
man first learned to harness the power of microbes,
sequences using MetaBin. PLoS ONE. 2012;7:e34030.
Teeling H, Waldmann J, Lombardot T, Bauer M, initially to accomplish crude and trivial fermenta-
Glockner FO. TETRA: a web-service and a stand- tions like brewing and curdling. With the evolu-
alone program for the analysis and comparison of tion of man, today, these applications have been
tetranucleotide usage patterns in DNA sequences.
extended to almost all areas such as agriculture,
BMC Bioinforma. 2004;5:163.
Tringe SG, Rubin EM. Metagenomics: DNA sequencing pharmaceuticals, industry, biotechnology etc.,
of environmental samples. Nat Rev Genet. where microbes have become indispensable.
2005;6:805–14. These applications have now become more
Venter JC, Remington K, Heidelberg JF, Halpern AL,
refined, and the most remarkable change, which
Rusch D, et al. Environmental genome shotgun
sequencing of the Sargasso Sea. Science. has happened, is that microbial enzymes have
2004;304:66–74. replaced whole microbes in many such processes.
MetaBioME 373 M
These microbial enzymes a.k.a. “biocatalysts” fall, etc. (Daniel 2005; Edwards and Rohwer
offer ecologically friendly or “green” solutions 2005; Kurokawa et al. 2007; Tringe et al. 2005;
for the implementation of biochemical processes Turnbaugh et al. 2006; Tyson et al. 2004;
at a reduced cost and produce a large variety of Venter et al. 2004a; Warnecke et al. 2007), and
chemical substances without involving the use several large-scale worldwide metagenomic
of polluting reagents that are often character- projects are currently under progress or in plan-
istic of chemical synthesis (Ferrer et al. 2005). ning. From these metagenomic projects,
However, only a few enzymes are currently some important biocatalysts have already been
known which can be used as biocatalysts due to isolated such as lipases/esterases, proteases,
the limited number of sequenced microbes, nitrilases, b-lactamases, hydrolases, cellulases,
which is principally limited by the fact that a-amylases, xylanases, oxidoreductases, and dehy-
most (>98 %) of the microbes cannot be cultured, drogenases (Ferrer et al. 2005; Yun and Ryu 2005).
a necessary step for their sequencing by tradi- Therefore, the upcoming information from further
tional methods (Amann et al. 1995). This, yet metagenomic projects holds enormous prospect
unculturable, majority of microbes potentially for the discovery of novel genes, biocatalysts,
conceals an enormous treasure of unknown bio- and biochemical pathways, irrespective of the
logical functions locked in their unidentified necessity for complete genomic sequences.
genes, proteins, and biochemical pathways. Novel biocatalysts can be detected in genomic
Therefore, approaches aimed at mining environ- or metagenomic libraries using three commonly
mental genetic diversity can significantly used strategies: (i) homology-driven screening,
enhance the enzyme repertoire and will be help- (ii) substrate-induced gene expression screening,
ful in the discovery of novel biocatalysts with and (iii) activity-based analysis (Ferrer
potential biotechnological applications. et al. 2005; Yun and Ryu 2005). While these
Another feasible, yet challenging method is to methods have certain advantages like high spec-
create novel biocatalysts by using in silico ificity and reliability, they require extensive min-
M
approaches and bioengineering is to reshuffle ing of large genomic or metagenomic libraries
the 20 known amino acids and mutate the existing and result in a few positives per enzymatic
proteins. However, there exist nearly infinite pos- screening (Ferrer et al. 2005). This is further
sibilities for such an approach, and it is impracti- limited by the low quality of DNA, low coverage,
cal and costly to test them all. In this scenario, host bias, and need for better vector-host combi-
nature appears the veteran since it began its nations for expression.
bioengineering laboratory billions of years ago An alternative and promising approach which
and has already created and tested an intriguing now exists involves direct shotgun sequencing of
diversity of biochemical pathways and their metagenomic libraries (Tringe and Rubin 2005b).
constituent enzymes that perform numerous This approach was earlier considered too expen-
transformations of molecules in diverse biologi- sive, since it required massive sequencing by con-
cal systems with great precision and specificity. ventional sequencers (Sanger). However, the recent
Therefore, it is conceivable that the ideal biocat- availability of a new generation of sequencers, like
alyst may already exist in nature and a wise strat- Roche 454, Illumina HiSeq, Ion Torrent, etc., has
egy would be to augment our knowledge base by made sequencing even more high-throughput, sev-
exploring the inherent diversity of nature. eral orders less expensive, and most importantly
To this end, metagenomics has emerged as cloning independent (Mardis 2008). Considering
a powerful culture-independent approach for the sheer volume of metagenomic samples and
exploring the complexity of microbial genomes implementation of such high-throughput sequenc-
in their natural environments (Tringe and Rubin ing methods, combined with high-throughput
2005a). Many metagenomic projects have recently computational analysis, screening of potential
been conducted, such as metagenomic studies of biocatalysts is more promising and is likely to
soil, sea, acid mines, human gut, termite gut, whale accelerate the process of biocatalyst discovery.
M 374 MetaBioME
Oxidoreductase (EC1) Transferase (EC 2) Hydrolase (EC 3)

Lyase (EC 4) Isomerase (EC 5) Ligase (EC 6)
General Applications
Nutrition
Medical
Industry
Enzymatic Analysis
Environment
Energy
Biotechnology
Agriculture
All Applications
All Enzymes
MetaBioME, Fig. 1 Distribution of enzymes (EC classes) into nine application categories
In the present entry, we describe a computa- (42 %). Transferases (EC 2), which perform the
tional platform and resource to identify novel transfer of functional groups from one molecule to
biocatalysts in metagenomic datasets using another, are most abundant in three application
homology-based approaches. We have developed categories, namely, Agriculture (50 %), Nutrition
a comprehensive Metagenomic BioMining (48 %), and Biotechnology (37 %). Hydrolases
Engine (MetaBioME) platform (Sharma et al. (EC 3), which are involved in formation of two
2010), which provides a unique resource for the separate products from a single substrate by hydro-
identification of novel alternatives to the existing lysis, are most abundant only in Industrial appli-
known biocatalysts and novel biocatalysts in cations (45 %). It is clear from the above findings
metagenomic datasets, which can be used as that oxidoreductases (EC 1) are most widely used
leads for further experimental verification. as biocatalysts followed by transferases (EC 2) and
hydrolases (EC 3). It is also noteworthy that
although hydrolases (EC 3) constitute most of the
Results enzymes among the six EC classes, they are not
the most widely employed biocatalysts. The
The distribution of 510 biocatalysts in nine appli- biocatalysts belonging to the remaining three EC
cation categories indicates that the highest num- classes (4, 5, and 6) were not as widely distributed
ber (234, 46 %) of biocatalysts is present in the or were completely absent from many of the nine
“Biotechnology” category and the lowest (3, 3 %) application categories.
in the “Energy” category (Fig. 1).
Oxidoreductases (EC 1), which catalyze Gene Prediction in Metagenomic Datasets
oxidation-reduction reactions, are most abundant (Except HFV)
in five out of nine applications, namely, Enzymatic The average contig length in the metagenomic
Analysis (95 %), Energy (75 %), General Appli- datasets varied between 0.8 and 1.8 kb with the
cations (74 %), Environment (45 %), and Medical exception of AMD (4.18 kb). The prediction of
MetaBioME 375 M
ORFs by Glimmer and MetaGene showed con- which can be queried using a publicly available
siderable variation with MetaGene predicting up Web interface available at http://metasystems.
to twice the number of ORFs as compared to riken.jp/metabiome (Sharma et al. 2010). The
Glimmer. With the exception of AMD, having key idea of MetaBioME is to develop a computa-
an average number of four ORFs per contig tional tool for mining metagenomic datasets by
predicted by MetaGene and Glimmer, the aver- using homology-based approaches to discover
age number of ORFs per contig for the remaining novel biocatalysts and novel alternatives for
datasets was found to vary between 0.6 and 2.3. existing biocatalysts, with advanced analysis
The median protein length in bacteria was options for facilitating the validation of results.
reported in one study as 267 amino acids Therefore, for comprehensive querying, we have
(801 base pairs) (Brocchieri and Karlin 2005). developed the following query pages:
Since, in the above analysis, the average length MetaSearch: It houses a pre-classified set of
of the contigs varies between 0.8 and 1.8 kb, and 510 biocatalysts in nine application categories
the average number of ORFs per contig varies that can be searched for in different metagenomic
between 0.6 and 2.3, it is likely that a significant datasets.
portion of at least one ORF can be predicted in MetaXplorer: It contains the complete set of
a contig of about 1 kb (Tringe et al. 2005). The EC enzymes and options to search for their
ORFs predicted by Glimmer and MetaGene in all homologous ORFs in metagenomic datasets.
the metagenomic datasets were fed into the MetaAlign: It allows users to submit a gene or
“Metabase” database, which is being used for protein sequence of interest and search for the
the development of MetaBioME. existence of a homologous ORF in metagenomic
datasets.
Identification of Potential Biocatalysts The details of these query pages are provided
Using MetaBioME’s homology-based approach, below.
we identified 199 potential alternatives (49 % of
M
total biocatalysts) to known biocatalysts in the MetaSearch: Search for Biocatalysts in
metagenomic datasets using a stringent threshold Metagenomic Datasets
of identity 50 % and coverage 90 %. Among The “MetaSearch” query page is designed to
the nine application categories, novel alternatives identify novel biocatalysts, categorized into nine
to known biocatalysts could be predicted for main application categories in metagenomic
39–50 % of total biocatalysts in each category. datasets (Fig. 2).
We further relaxed the above cutoff (identity This pre-classification helps the user to select
30 % and coverage 90 %) to identify an biocatalysts belonging to any application area and
expanded list of potential alternate biocatalysts search for them in metagenomic datasets. A search
in the metagenomic datasets which could be used can be made by selecting one or more of the
as leads for experimental verification. Using this application categories and a single metagenomic
relaxed cutoff, novel alternatives for a total of dataset. Since the metagenomic datasets contain
305 (75 %) biocatalysts could be identified in volumes of information, the number of hits
the metagenomic datasets from all application reported for each query is expectedly large; there-
categories. Among these potential biocatalysts, fore, we have currently restricted the option to
20 were commonly found in all nine select and search in only one metagenomic dataset
metagenomic datasets, while 64 biocatalysts per query. The queries can also be made by
were rare and could be found in any one of the selecting different attributes such as EC number,
nine metagenomic datasets. enzyme name, Swiss-Prot ID, biochemical path-
way, and substrate or products. Multiple keywords
Description of Web Resource: MetaBioME can also be submitted using Boolean operators. An
We used the above strategy, data, and results to option is also provided to limit the number of
develop a comprehensive resource “MetaBioME,” results by selecting “Best hit” or “Best 10 hits.”
M 376 MetaBioME
MetaBioME, Fig. 2 Screenshot of “MetaSearch” page with a sample query
On submission of a query for a selected appli- showed at least 50 % coverage with the matched
cation category and metagenomic dataset, metagenomic contigs (Fig. 3).
MetaBioME examines the alignments of all Comprehensive information for each match
Swiss-Prot sequences known for all EC numbers can be retrieved by clicking on the Swiss-Prot
present in that category with the ORFs predicted in ID link on the Results page which opens up the
contigs of the selected metagenomic dataset. The “MetaBioME Profile” page. The profile page
subsequent “MetaResults” page displays the qual- summarizes information on the enzyme proper-
ified hits as a table sorted on the basis of percent ties, reaction performed, pathway information
coverage (completeness of the alignment) and pro- (as available in KEGG), links to related
vides a list of all matching Swiss-Prot IDs which publicly available databases, queried dataset,
MetaBioME 377 M
MetaBioME, Fig. 3 Screenshot of “MetaResults” page showing the results of the submitted sample query
M
and application category. This information is the best match. In the case of a good match, users
followed by a table of predicted ORFs, where are advised to carry out an “Advanced Search,”
the ORFs are segregated as commonly predicted which helps to confirm the goodness of the results
by Glimmer and MetaGene and uniquely by using a suite of options. Users can check the
predicted by Glimmer or MetaGene, respec- alignment of the Swiss-Prot sequence of the
tively. The ORF showing the best match with selected biocatalyst with the best matching
the Swiss-Prot sequence is highlighted in green. ORF. Since conserved motifs likely play a key
This table is followed by the contig view window role in the activity of an enzyme, all Swiss-Prot
displaying the predicted ORFs as directional sequences belonging to the same EC number can
arrows as per the orientation of the ORFs on the be aligned together or with the best matching
contig. The best match is displayed as a green- ORF to find the overall sequence homology
colored arrow. Each arrow can be clicked to among these sequences. This helps in the identi-
retrieve the nucleotide and protein sequences of fication of conserved motifs and confirms if the
the predicted ORFs. This window is followed by best matching ORF also possesses any conserved
a table providing summary information for the motifs which may be present in the Swiss-Prot
best matching ORF. The next table provides sequences. As another functional confirmation,
information on the closest available PDB struc- users can also look for the presence of conserved
ture and displays the 3-D protein structure. domains in the best matching ORF by aligning
In order to provide a useful indicator for the the sequence against the NCBI Conserved
goodness of the results, we have provided Domains Database (CDD).
a “MetaBioME Rating,” which rates the best Additionally, the user can also check if the
matching ORF on a scale of 1–5 stars, with same Swiss-Prot sequence of the biocatalyst in
a single star for lowest match and five stars for question is present in any other metagenomic
M 378 MetaBioME
MetaBioME, Fig. 4 Screenshot of “CUEsXplorer” page
dataset by carrying out a homology search against number. Any representative Swiss-Prot sequence
other metagenomic datasets. Another search can be selected and searched by TBLASTN in
option is provided to determine if the novel one or more metagenomic datasets selected from
predicted ORF sequence is already present or the drop-down menu. The results “MetaSearch
has a close match with any protein from a Results” and profile “MetaBioME Profile”
known genome available in the Non-Redundant pages, for the submitted query, are similar to as
(NR) database. These additional options are help- explained in the earlier section. This query page
ful in confirming the uniqueness of the novel provides users with an option to search all known
identified biocatalyst. enzymes, as available in EC, irrespective of their
known role as biocatalyst, which is a subset of
CUEsXplorer: Explore Commercially Useful this set.
Enzymes (CUEs) Database
This page provides options for exploring the MetaAlign: Online Application to Search for
CUEs database for any application category or Protein Sequences in Metagenomic Datasets
EC classification. It provides details about MetaAlign is an application powered by the
enzyme function and the curation summary of BLAT (faster and less sensitive) and BLAST
any selected enzyme (Fig. 4). (slower and more sensitive) sequence alignment
tools (Fig. 6).
MetaXplorer: Search for Enzymes in It provides the user an option to carry out
Metagenomic Datasets homology-based searches for single or multiple
This query page provides an option to select and (multi-FASTA format) submitted nucleotide or
search for any enzyme from the six EC classes in protein sequences against the metagenomic
metagenomic datasets (Fig. 5). sequences available in the ten metagenomic
On selecting any EC class, a list box datasets. Larger files containing multiple
containing all EC numbers belonging to that sequences can also be uploaded, with an email
class opens up. Selecting an EC number from being sent to the user on completion of analysis.
this list box reveals an expanded page with infor- The searches can be limited by selecting the
mation on the enzyme name, EC number, Prosite threshold E-value and the number of resultant
ID, enzymatic reaction, KEGG pathway, and list hits. The output format can also be specified as
of all Swiss-Prot IDs belonging to that EC “tabular” or “full” (complete alignment).
MetaBioME 379 M
MetaBioME, Fig. 5 Screenshot of “MetaXplorer” page
Discussion biocatalysts, but employs an inclusive approach

to identify all possible alternatives with reasonable
There is so much richness and natural diversity criterion. For any given function (EC number),
inherent in the metagenomic data that the possi- MetaBioME reports all possible ORFs (with strin-
bility of retrieving functional genes of interest is gent cutoff similarity) from the naturally existing
almost certain. This is further assured with the diverse protein repertoire of yet unidentified
availability of more metagenomic datasets, deeper microbial genomes which have evolved and sur-
M
coverage, and completed genomic sequences. vived in diverse environments. Thus, each resul-
Therefore, a computational homology-based tant metagenomic ORF having significant
approach search engine such as MetaBioME has similarity to a known biocatalyst is unique with
great potential to reveal novel alternatives for distinct characteristics such as thermodynamic and
existing biocatalysts. pH stability, turnover frequency, specific activity,
To look for an “ideal biocatalyst,” however, is etc., offering a wide choice for their selection and
not an easy task, since the requirements and con- employment as per the requirements for a given
ditions of the bioprocesses are not constant. Gen- bioprocess. This approach is especially useful for
erally, an “ideal” catalyst is defined in terms of pharmaceutical and supporting fine-chemical
turnover number (kcat) or, for a given process, companies, both of which explore multiple diverse
in terms of the maximum specificity constant biocatalysts to construct their local databases for
(kcat/KM) (Burton et al. 2002). However, from biotransformations (Lorenz and Eck 2005).
a bioprocess viewpoint, each bioprocess is The alternative novel biocatalysts found using
constrained by a set of conditions governed by MetaBioME can serve as leads for further exper-
the specific properties of the substrates, products, iments involving cloning and expression to estab-
and the bioconversion reaction. Thus, the currently lish their enzymatic activity and commercial
used microbial biocatalysts, whose selection has potential. Therefore, a combination of computa-
been limited by the limited number of available tional predictions of MetaBioME with activity-
genomes, may not be “ideal” and sometimes, the based mining and subsequent tailoring of these
industrial processes have to be designed to fit only proteins using bioengineering techniques could
mediocre enzymes (Lorenz and Eck 2005). provide a proficient prospect to replace chemical
Therefore, MetaBioME does not involve an synthesis with biotechnological processes, which
exclusive approach in looking for ideal are ultimately more sustainable to mankind.
M 380 MetaBioME
MetaBioME, Fig. 6 Screenshot of “MetaAlign” page
Methods Swiss-Prot database (O’Donovan et al. 2002)

for the different enzymes belonging to these EC
Enzyme Database numbers. The remaining EC numbers did not
We have used the Enzyme Commission number have any known Swiss-Prot sequence. An EC
(EC number) as a numerical classification number in this analysis is used exclusively to
scheme for enzymes based on the chemical reac- refer to an enzyme and defines its function.
tions they catalyze, with each EC class exclu- We curated a database of 510 microbial
sively defining the function performed by the enzymes, with known or potential commercial
enzyme (Bairoch 2000). Information on the com- applications as “biocatalysts,” by mining the
plete set of 4,877 enzymes annotated with an EC information available at BRENDA (Barthelmes
number was retrieved from the ENZYME et al. 2007), NCBI (Wheeler et al. 2008), ExPASy
nomenclature database, as available at ExPASy. (Gasteiger et al. 2003), and available literature.
Swiss-Prot sequences were retrieved from the These biocatalysts were classified into nine broad
MetaBioME 381 M
application categories, namely, Agriculture, Bio- matches being considered, and the output was
technology, Energy, Environment, Enzymatic generated in XML format.
Analysis, General Applications, Industry, Medi- Complete and partial ORFs (open reading
cal, and Nutrition. These broad application cate- frames) were predicted in the metagenomic
gories were further subclassified into 21 more sequences using the Glimmer (Delcher et al.
specific subcategories. 2007) and MetaGene (Noguchi et al. 2006)
gene prediction programs with a minimum
Other Resources length of 50 amino acids (150 nucleotides).
The Non-Redundant (NR) and Conserved We adopted a self-training approach for
Domains Database (CDD) were retrieved implementing Glimmer by using the contig
from NCBI (ftp://ftp.ncbi.nih.gov/blast/db), itself as the training sequence. Additional confi-
and Protein Data Bank (PDB) database was dence for an ORF prediction is provided by
retrieved from the Worldwide Protein Data integrating the results of MetaGene and Glim-
Bank (wwPDB) (http://www.wwpdb.org/). Pro- mer, using an in-house developed algorithm
tein structures were created using RasMol “SuperGene.” It called the ORFs as “Exact”
(version 2.6). (same start and end predicted by both methods),
“End_match” (start is variable and only end is
Mining the Metagenomic Databases matching), and “Unique” (predicted by only one
The publicly available metagenomic data from method). The “Exact” ORFs are certainly
ten diverse environments is analyzed in the cur- predicted with higher confidence with reliable
rent version of the database. Of these, the start and end positions, because they were
Sargasso Sea [SSEA] dataset was retrieved from predicted by two independent methods. For the
the J. Craig Venter Institute (https://research. “End_match” cases, the longer ORF was kept in
venterinstitute.org/sargasso/) (Venter et al. this analysis to ensure that no part of an ORF
2004b), and the remaining nine datasets, includ- was left out, even if some extra part was
M
ing sludge [SLUDGE] (Garcia et al. 2006), acid included in the initial prediction. The exact
mine drainage [AMD] (Tyson et al. 2004), whale start and end were further confirmed after align-
fall [WFALL] (Tringe et al. 2005), soil [SOIL] ment of the ORFs with their corresponding
(Tringe et al. 2005), human gut (2 individuals) Swiss-Prot sequences. The ORFs lying at the
[HGUTI] (Gill et al. 2006), human gut (13 indi- terminals of the contigs were considered partial.
viduals) [HGUTII] (Kurokawa et al. 2007), The above data was imported into a MySQL
mouse gut [MGUT] (Turnbaugh et al. 2006), database (Metabase).
termite gut [TGUT] (Warnecke et al. 2007),
and human fecal virus [HFV] (Zhang et al.
2006), were retrieved from the DDBJ data- Web Interface and Metabase Development
base (ftp://ftp.ddbj.nig.ac.jp/database/wgs/WGS_ Apache (version 2.2.8), MySQL (version 5.0.45),
ORGANISM_LIST.html). The sequences avail- PHP (version 5.2.4), and Perl (version 5.8.5) were
able in these datasets are referred to as “contigs” used for development of the GUI. The back-end
by the authors; therefore, we have called them database was called as “Metabase.” The Web
“contigs” in this analysis. However, we realize server was developed using Apache HTTP Server
that several metagenomic sequences in these (version 2.2.8). Client-side scripting was done
datasets are too short to be called contigs and are using XHTML, JavaScript, and AJAX, and
likely singletons. server-side scripting was done using PHP and
Swiss-Prot protein sequences of known XML. The publicly available applications,
biocatalysts were aligned with their BLAT (v34) (Kent 2002), BLAST (version
corresponding nucleotide sequences (contigs) in 2.2.17) (Wheeler et al. 2008), and MAFFT
each metagenomic dataset using TBLASTN with (version 6.240) (Katoh et al. 2005), were used
a threshold of E < 106, with only the best ten for additional analysis.
M 382 MetaBioME
Cross-References Gasteiger E, Gattiker A, Hoogland C, Ivanyi I, Appel RD,

Bairoch A. ExPASy: the proteomics server for
in-depth protein knowledge and analysis. Nucleic
▶ Binning Sequences Using Very Sparse Labels Acids Res. 2003;31:3784–8.
Within a Metagenome Gill SR, Pop M, Deboy RT, Eckburg PB, Turnbaugh PJ,
▶ Challenge of Metagenome Assembly and Samuel BS, Gordon JI, Relman DA, Fraser-Liggett
Possible Standards CM, Nelson KE. Metagenomic analysis of the human
distal gut microbiome. Science. 2006;312:1355–9.
▶ Computational Approaches for Metagenomic Katoh K, Kuma K, Miyata T, Toh H. Improvement in the
Datasets accuracy of multiple sequence alignment program
▶ FragGeneScan: Predicting Genes in Short and MAFFT. Genome Inform. 2005;16:22–33.
Error-Prone Reads Kent WJ. BLAT–the BLAST-like alignment tool.
Genome Res. 2002;12:656–64.
▶ MetaBin Kurokawa K, Itoh T, Kuwahara T, Oshima K, Toh H,
▶ MEtaGenome ANalyzer (MEGAN): Toyoda A, Takami H, Morita H, Sharma VK,
Metagenomic Expert Resource Srivastava TP, et al. Comparative metagenomics
▶ New Computational Methodologies to revealed commonly enriched gene sets in human gut
microbiomes. DNA Res. 2007;14:169–81.
Understand Microbial Diversity Lorenz P, Eck J. Metagenomics and industrial applica-
▶ Next-Generation Sequencing for tions. Nat Rev Microbiol. 2005;3:510–6.
Metagenomic Data: Assembling and Binning Mardis ER. The impact of next-generation sequencing
▶ NGS QC Toolkit: A Platform for Quality technology on genetics. Trends Genet. 2008;24:133–41.
Noguchi H, Park J, Takagi T. MetaGene: prokaryotic gene
Control of Next-Generation Sequencing Data finding from environmental genome shotgun
sequences. Nucleic Acids Res. 2006;34:5623–30.
O’Donovan C, Martin MJ, Gattiker A, Gasteiger E,
Bairoch A, Apweiler R. High-quality protein knowl-
References edge resource: SWISS-PROT and TrEMBL. Brief
Bioinform. 2002;3:275–84.
Amann RI, Ludwig W, Schleifer KH. Phylogenetic iden- Sharma VK, Kumar N, Prakash T, Taylor
tification and in situ detection of individual microbial TD. MetaBioME: a database to explore commercially
cells without cultivation. Microbiol Rev. useful enzymes in metagenomic datasets. Nucleic
1995;59:143–69. Acids Res. 2010;38(Database issue):D468–72.
Bairoch A. The enzyme database in 2000. Nucleic Acids Tringe SG, Rubin EM. Metagenomics: DNA sequencing
Res. 2000;28:304–5. of environmental samples. Nat Rev Genet.
Barthelmes J, Ebeling C, Chang A, Schomburg I, 2005a;6:805–14.
Schomburg D. BRENDA, AMENDA and FRENDA: Tringe SG, Rubin EM. Metagenomics: DNA sequencing
the enzyme information system in 2007. Nucleic Acids of environmental samples. Nat Rev Genet.
Res. 2007;35:D511–4. 2005b;6:805–14.
Brocchieri L, Karlin S. Protein length in eukaryotic and Tringe SG, von Mering C, Kobayashi A, Salamov AA,
prokaryotic proteomes. Nucleic Acids Res. Chen K, Chang HW, Podar M, Short JM, Mathur EJ,
2005;33:3390–400. Detter JC, et al. Comparative metagenomics of micro-
Burton SG, Cowan DA, Woodley JM. The search for the bial communities. Science. 2005;308:554–7.
ideal biocatalyst. Nat Biotechnol. 2002;20:37–45. Turnbaugh PJ, Ley RE, Mahowald MA, Magrini V,
Daniel R. The metagenomics of soil. Nat Rev Microbiol. Mardis ER, Gordon JI. An obesity-associated gut
2005;3:470–8. microbiome with increased capacity for energy har-
Delcher AL, Bratke KA, Powers EC, Salzberg SL. Identi- vest. Nature. 2006;444:1027–31.
fying bacterial genes and endosymbiont DNA with Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ,
glimmer. Bioinformatics. 2007;23:673–9. Richardson PM, Solovyev VV, Rubin EM, Rokhsar
Edwards RA, Rohwer F. Viral metagenomics. Nat Rev DS, Banfield JF. Community structure and metabolism
Microbiol. 2005;3:504–10. through reconstruction of microbial genomes from the
Ferrer M, Martinez-Abarca F, Golyshin PN. Mining environment. Nature. 2004;428:37–43.
genomes and ‘metagenomes’ for novel catalysts. Venter JC, Remington K, Heidelberg JF, Halpern AL,
Curr Opin Biotechnol. 2005;16:588–93. Rusch D, Eisen JA, Wu D, Paulsen I, Nelson KE,
Garcia MH, Ivanova N, Kunin V, Warnecke F, Barry KW, Nelson W, et al. Environmental genome shotgun
McHardy AC, Yeates C, He S, Salamov AA, Szeto E, sequencing of the Sargasso Sea. Science.
et al. Metagenomic analysis of two enhanced biologi- 2004a;304:66–74.
cal phosphorus removal (EBPR) sludge communities. Venter JC, Remington K, Heidelberg JF, Halpern AL,
Nat Biotechnol. 2006;24:1263–9. Rusch D, Eisen JA, Wu D, Paulsen I, Nelson KE,
MEtaGenome ANalyzer (MEGAN): Metagenomic Expert Resource 383 M
Nelson W, et al. Environmental genome shotgun Introduction
sequencing of the Sargasso Sea. Science.
2004b;304:66–74.
Warnecke F, Luginbuhl P, Ivanova N, Ghassemian M, Metagenomics is the study of uncultured organ-
Richardson TH, Stege JT, Cayouette M, McHardy isms in their native environment using DNA
AC, Djordjevic G, Aboushadi N, et al. Metagenomic sequencing (Handelsman et al. 1998). In
and functional analysis of hindgut microbiota of a typical project, DNA (or, in the case of meta-
a wood-feeding higher termite. Nature.
2007;450:560–5. transcriptomics, cDNA reverse-transcribed from
Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, RNA) is extracted from an environmental sample
Chetvernin V, Church DM, Dicuccio M, Edgar R, and then shotgun sequenced. Once a metagenome
Federhen S, et al. Database resources of the National dataset of DNA sequencing reads has been gen-
Center for Biotechnology Information. Nucleic Acids
Res. 2008;36:D13–21. erated in this way, the first three main computa-
Yun J, Ryu S. Screening for novel enzymes from tion challenges are to (1) estimate the taxonomic
metagenome and SIGEX, as a way to improve content of the sample, (2) estimate its functional
it. Microb Cell Fact. 2005;4:8. content, and (3) compare different samples.
Zhang T, Breitbart M, Lee WH, Run JQ, Wei CL, Soh SW,
Hibberd ML, Liu ET, Rohwer F, Ruan Y. RNA viral To address these challenges, the first step is to
community in human feces: prevalence of plant path- align the set of sequencing reads against
ogenic viruses. PLoS Biol. 2006;4:e3. a database of known reference protein sequences
such as NCBI-NR or RefSeq (Benson et al. 2005)
using a pairwise alignment tool such as BLASTX
(Altschul et al. 1990) or RapSearch2 (Zhao
MEtaGenome ANalyzer (MEGAN): et al. 2012). A read is said to hit a given reference
Metagenomic Expert Resource sequence, if a significant alignment is found in
this process. The comparison of the sequencing
Daniel H. Huson reads against a reference database is usually the
Center for Bioinformatics, Algorithms in computationally most expensive step of analysis,
M
Bioinformatics, University of T€ubingen, and subsequent steps are based on the obtained
T€ubingen, Germany alignments. Given the result of the alignment
step, an analysis program such as MEGAN is
then required to explore and analyze the data.
Synonyms
MEGAN ¼ MEtagenome ANalyzer Taxonomic Analysis
To perform a taxonomic analysis of

Definition a metagenomic dataset, MEGAN attempts to
place each read onto a node in the NCBI taxon-
MEGAN is a tool for analyzing metagenomic omy, based on an analysis of its hits. The key idea
sequencing data, allowing the user to interac- is to use all ranks of the taxonomy so as to assign
tively explore the taxonomic and functional con- reads specific to a particular species near the
tent of a dataset. It also supports the comparison leaves of the taxonomy and to map sequences
of multiple datasets. The program was originally that are conserved across a wider range of organ-
published in (Huson et al. 2007), and the most isms to higher-level nodes. For example, a read
recent version was published in (Huson that comes from a gene that only Escherichia coli
et al. 2011). Written in Java, the program runs has will be placed on the E. coli node, whereas
on all major operating systems. The program can a read that comes from a gene that is shared
be downloaded from http://www-ab.informatik. widely across different Proteobacteria will be
uni-tuebingen.de/software/megan. assigned to the node labeled Proteobacteria.
M 384 MEtaGenome ANalyzer (MEGAN): Metagenomic Expert Resource
MEtaGenome ANalyzer (MEGAN): Metagenomic logarithmically to indicate how many reads have been
Expert Resource, Fig. 1 Taxonomy analysis of assigned to it. In addition to the taxon name, each node
500,000 reads from an in vitro-simulated microbial is also labeled by the cumulative number of reads assigned
community Morgan et al. (2010). Each circle represents to, or below, that node
a taxon in the NCBI taxonomy and is scaled
The input to MEGAN is a file of DNA reads an alignment must achieve to be considered;
and a file containing all their hits in a reference minPercent, a further filter to remove all those
database, usually in BLAST or SAM format. In hits whose bit score differences by more than the
addition, at start-up, MEGAN reads in the whole given percentage from the top scoring hit for the
NCBI taxonomy. To perform a taxonomic analysis given read; and minSupport, the minimum number
of a metagenome dataset, MEGAN processes each of reads that a node in the NCBI taxonomy must
DNA read in turn, assigning each read to the node attract before it is shown in the final output.
in the NCBI taxonomy that is the lowest common Reads that have no hits are assigned to
ancestor of the set of species associated with all a special node labeled No Hits, whereas reads
reference sequences that were hit by the read. This that have hits but cannot be assigned to a taxon
approach is known as the LCA algorithm. are mapped to a special Unassigned node. In
The LCA algorithm has a number of parame- addition, reads consisting of highly repetitive
ters, such as minScore, the minimum bit score that sequence are assigned to a Low Complexity node.
MEtaGenome ANalyzer (MEGAN): Metagenomic Each circle represents a SEED category and is scaled
Expert Resource, Fig. 2 SEED-based functional analy- logarithmically to indicate the cumulative number of
sis of 500,000 reads from an in vitro-simulated micro- reads that have been assigned to it. In addition to the
bial community Morgan et al. (2010). The SEED SEED name, each node is also labeled by the number of
classification tree has been partially expanded to show reads assigned to, or below, that node
details on functional roles involved in flagellar motility.
The pertinent part of the NCBI taxonomy is Functional Analysis

displayed in the taxonomy viewer of MEGAN,
and by default, each node is scaled logarithmi- MEGAN uses both the SEED (Overbeek
cally to represent the number of reads associated et al. 2005) and the KEGG (Kanehisa and Goto
with it; see Fig. 1. Nodes can be interactively 2000) classifications to analyze the functional con-
collapsed or expanded to show more or fewer tent of a metagenome dataset. In essence, the
details of the classification. The user can select SEED classification maps genes onto functional
nodes of interest and then either inspect the roles, and these appear in different subsystems.
associated reads and alignments, or save them Similarly, KEGG maps genes onto KEGG
to a file, or chart them in a number of standard orthology groups, or KO groups, which are asso-
ways. ciated with enzymes that appear in different
MEtaGenome ANalyzer (MEGAN): Metagenomic Each circle represents a KEGG category and is scaled
Expert Resource, Fig. 3 KEGG-based functional anal- logarithmically to indicate the cumulative number of
ysis of 500,000 reads from an in vitro-simulated micro- reads that have been assigned to it. In addition to the
bial community Morgan et al. (2010). The KEGG KEGG name, each node is also labeled by the number of
classification tree has been partially expanded to show reads assigned to, or below, that node
details on KO groups involved in flagellar assembly.
pathways. In both cases, the classification can be taxonomy viewer. In additional, the KEGG
represented as a tree with roughly 13,000 nodes. viewer allows one to see how reads map to dif-
To perform a SEED-based analysis, for each ferent enzymes in a given pathway; see Figs. 2
read in the input, MEGAN identifies the highest and 3.
scoring hit to a reference sequence for which the
corresponding functional role is known and then
maps the read to that functional role. In a KEGG- Sequence Alignment
based analysis, each read is mapped to a KO
group in a similar fashion. As pointed out above, the main computational
Both the SEED and KEGG classifications are step is to compute pairwise alignments between
displayed as trees in MEGAN, and the viewers the set of DNA reads and all sequences in an
provide the same interactive features as the appropriate reference database. Based on this,
MEtaGenome ANalyzer (MEGAN): Metagenomic track shows the reference sequence and the main panel
Expert Resource, Fig. 4 MEGAN’s alignment viewer displays the aligned reads. Letters shown in gray belong to
constructs and displays a multiple sequence between all the reads but are not part of the alignment
reads that map to the same reference sequence. The top
it is possible to construct reference-guided mul- Comparison of Datasets

tiple sequence alignments between all reads that
hit the same reference sequence. This calculation To facilitate the comparison of datasets, MEGAN
is implemented in MEGAN in a new feature allows the user to open multiple datasets simul-
called the alignment viewer. Once the user has taneously, showing each dataset in a different
specified a node in the taxonomy, SEED or window. The user can then select a number of
KEGG viewer for which the alignment viewer is open datasets to be combined into a single new
to be launched, the program first collects all ref- comparison document. For such a document, the
erence sequences that correspond to the given taxonomy, SEED, and KEGG viewers indicate
node and then, for each such reference sequence, how many reads were assigned to each node for
the program determines all reads that hit it. The each original input document by drawing the
user can then select a reference sequence, and the node as a pie chart or bar chart, for example, see
corresponding sequence alignment is subse- Fig. 5. MEGAN also supports the calculation of
quently displayed; see Fig. 4. standard ecological indices for a comparison
MEtaGenome ANalyzer (MEGAN): Metagenomic different datasets are represented by different colors, and
Expert Resource, Fig. 5 High-level comparison of each node shows a bar chart that indicates the number of
taxonomic content of four different cDNA datasets from reads assigned to that node, on a logarithmic scale
a seawater monitoring study (Gilbert et al. 2008). The four
document and then, based on this, the program of 350 million reads with 1.3 billion BLASTX
can be used to compute a tree, network, or MDS matches. While MEGAN is mainly designed for
plot (not shown here). interactive use on a laptop or desktop computer,
all features of the program can also be accessed in
command-line mode, and thus analyses can also
Handling Large Data be performed on a server within the framework of
a larger bioinformatic analysis pipeline.
As sequencing technologies continue to improve,
the size of analyzed datasets continues to
increase. MEGAN was reportedly used to per- Summary
form the taxonomic analysis of 124 human gut
samples involving around 600 gigabases of MEGAN is an interactive tool for analyzing
sequence (Qin et al. 2010). In an ongoing study, the taxonomic and functional content of
MEGAN is currently being used to analyze a set metagenomic (and metatranscriptomic) datasets.
Metagenome of Acidic Hot Spring Microbial Planktonic Community 389 M
Input is a set of DNA reads and the result of Thiele I, Vassieva O, Ye Y, Zagnitko O, Vonstein V.
comparing the reads against a reference database. The subsystems approach to genome annotation and its
use in the project to annotate 1000 genomes. Nucleic
Taxonomic analysis is performed by placing Acids Res. 2005;33(17):5691–702.
DNA reads onto nodes of the NCBI taxonomy, Qin J, Li R, Raes J, Arumugam M, Burgdorf KS,
whereas functional analysis is based on mapping Manichanh C, Nielsen T, Pons N, Levenez F,
reads to SEED and KEGG categories. Yamada T, Mende DR, Li J, Xu J, Li S, Li D, Cao J,
Wang B, Liang H, Zheng H, Xie Y, Tap J, Lepage P,
The program supports comparative analysis of Bertalan M, Batto J-M, Hansen T, Le Paslier D,
multiple datasets. The program is written in Linneberg A, Nielsen HB, Pelletier E, Renault P,
Java and runs on all major operating systems. Sicheritz-Ponten T, Turner K, Zhu H, Yu C, Li S,
When run in command-line mode, the program Jian M, Zhou Y, Li Y, Zhang X, Li S, Qin N,
Yang H, Wang J, Brunak S, Dore J, Guarner F,
can also be integrated into larger bioinformatic Kristiansen K, Pedersen O, Parkhill J, Weissenbach J,
analysis pipelines. Bork P, Ehrlich SD, Wang J. A human gut microbial
gene catalogue established by metagenomic sequenc-
ing. Nature. 2010;464(7285):59–65.
Zhao Y, Tang H, Ye Y. RAPSearch2: a fast and memory-
Cross-References efficient protein similarity search tool for next-
generation sequencing data. Bioinformatics.
▶ Metagenomics, Metadata, and Meta-analysis 2012;28(1):125–6.
References
Altschul SF, Gish W, Miller W, Myers EW, Lipman

Metagenome of Acidic Hot Spring
DJ. Basic local alignment search tool. J Mol Biol. Microbial Planktonic Community:
1990;215:403–10. Structural and Functional Insights
Benson D, Karsch-Mizrachi I, Lipman D, Ostell J, M
Wheeler D. Genbank. Nucleic Acids Res. 1
Diego Javier Jiménez and Marı́a Mercedes
2005;1(33):D34–8.
Gilbert JA, Field D, Huang Y, Edwards R, Li W, Glina P, Zambrano2
1
Joint I. Detection of large numbers of novel sequences Department of Microbial Ecology, University of
in the metatranscriptomes of complex marine micro- Groningen, Center for Ecological and
bial communities. PLoS One. 2008;3:e3042.
Evolutionary Studies (CEES), Groningen,
Handelsman J, Rondon M, Brady S, Clardy J, Goodman
R. Molecular biological access to the chemistry of The Netherlands
2
unknown soil microbes: a new frontier for natural Molecular Genetics and Microbial Ecology,
products. Chem Biol. 1998;5:245–9. Corporación CorpoGen, Bogotá, DC, Colombia
Huson DH, Auch AF, Qi J, Schuster SC. MEGAN analysis
of metagenomic data. Genome Res.
2007;17(3):377–86.
Huson DH, Mitra S, Weber N, Ruscheweyh H-J, Schuster Synonyms
SC. Integrative analysis of environmental sequences
using megan4. Genome Res. 2011;21:1552–60.
The microbiome of Andean acidic hot springs
Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes
and genomes. Nucleic Acids Res. 2000;28(1):27–30.
Morgan JL, Darling AE, Eisen JA. Metagenomic sequenc-
ing of an in vitro-simulated microbial community. Definition
PLoS ONE. 2010;5.
Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang
H-Y, Cohoon M, de Crécy-Lagard V, Diaz N, Disz T, Metagenomic analyses were done to obtain
Edwards R, Fonstein M, Frank ED, Gerdes S, Glass a deeper view of the microbial community struc-
EM, Goesmann A, Hanson A, Iwata-Reuyl D, ture and to gain insight regarding the functional
Jensen R, Jamshidi N, Krause L, Kubal M, Larsen N,
Linke B, McHardy AC, Meyer F, Neuweger H,
properties present in the planktonic fraction of
Olsen G, Olson R, Osterman A, Portnoy V, Pusch these Neotropical high Andean acidic hot
GD, Rodionov DA, R€ uckert C, Steiner J, Stevens R, springs.
M 390 Metagenome of Acidic Hot Spring Microbial Planktonic Community
Introduction These acidic-hot ecosystems are also of interest

as a source of potential biotechnological prod-
High-mountain Andean ecosystems are rich in ucts, new species (Tirawongsaroj et al. 2008;
biodiversity and natural resources (Myers Bouraoui et al. 2013), and features relevant to
et al. 2000). The South American Andean region ecosystem maintenance and ecology such as hor-
is part of what is known as the “Ring of Fire” and izontal gene transfer, UV damage, and biogeo-
has several hot springs that represent unique and chemical cycles. The microbial planktonic
undisturbed extreme environments due to their community contained putative chemotrophic
high elevation and exposure to ultraviolet bacteria potentially involved in cycling of ferrous
(UV) light. These springs are heated mainly by iron and sulfur-containing minerals. In extremely
the underlying magma chamber from volcanic acidic and UV light-irradiated hot springs, pri-
activity; they are oligotrophic and vary in their mary production may also be mediated by
geochemistry, such as mineral content, tempera- phototrophic acidophiles (mainly eukaryotic
ture, and pH. Thus far, little is known regarding micro-algae) (Aguilera et al. 2010). However,
the microbiomes of these high-mountain ecosys- the presence of bacterial-rhodopsin photosystems
tems. A hot spring is characterized by discharge has been reported to complement the
of hot water from a vent. There is, however, no chemotrophic lifestyle (Bohórquez et al. 2012b).
universally accepted definition of “hot,” and the
temperature for distinguishing a “warm spring”
from a “hot spring” remains contentious (Rzonca Metataxonomic Approach: Microbial
and Schulze-Makuch 2003; Pentecost et al. 2003; Diversity Assessment by 16S rRNA
Jones and Renaut 2011). Hot springs contain Sequences
several microhabitats such as the planktonic frac-
tion (which has low cell density), microbial mats, Microbial diversity in terrestrial hot springs has
and sediments (with high cell density) each with been extensively studied in locations as varied as
different microbial assemblages. The microbial Yellowstone National Park (YNP), Japan, New
diversity in the planktonic fraction is dictated by Zealand, Great Basin, Iceland, Thailand, the Phil-
environmental physicochemical characteristics ippines, Russia, and the Tibetan plateau. These
as a pH, redox potential, temperature, and con- surveys, done mostly by 16S rRNA gene analy-
centration of trace elements (Siering et al. 2006; sis, have expanded our view of the microbial
Mathur et al. 2007). Metagenomic (total DNA), communities present in these extreme and
metataxonomic (16S rRNA and/or ITS difficult-to-study water ecosystems. An elegant
sequences), meta-transcriptomic (mRNA), and multi-approach based on 16S rRNA analysis of
PCR-target analyses have been extremely valu- an acidic hot spring in the Colombian Andes,
able for describing the microbial structure and called El Coquito (EC) (Fig. 1), was recently
functionality in different hot springs (López- carried out using high-throughput sequencing,
López et al. 2013; Wemheuer et al. 2013; Liu PhyloChip, and 16S rRNA clone libraries
et al. 2011). Cyanobacteria and Chloroflexi, for (Bohórquez et al. 2012a). The EC hot spring is
example, are abundant in low temperature- located at 3,973 m above sea level and is charac-
sediment samples from high-mountain hot terized by an acidic pH (2.7), high solar radiation
springs located in the Tibetan plateau (Wang (~9–11 mW/cm2 nm UV-B), and high sulfate
et al. 2013). A previous analysis of the planktonic content (1,003 mg SO42 L1). This spring is
microbial community in one Colombian acidic moderately hot, with a water temperature of
hot spring (El Coquito) located in the national approximately 29 C, which is considerably
park of Los Nevados showed that Bacteria rather higher than ambient temperature (~9 C) (Rzonca
than Archaea dominated the community, with and Schulze-Makuch 2003). Despite differences
predominance of Proteobacteria, Firmicutes, among the results obtained with the three strate-
and Planctomycetes (Bohórquez et al. 2012a). gies used to analyze the microbial diversity of
Metagenome of Acidic Hot Spring Microbial Plank- El Coquito (EC); red circle indicates the planktonic
tonic Community: Structural and Functional fraction and black square indicates the biofilm surface
M
Insights, Fig. 1 Photographs of the acidic hot spring formation
this ecosystem, there was dominance of the and prevalence of Gammaproteobacteria,

orders Burkholderiales, Legionellales, Alphaproteobacteria, and Betaproteobacteria
Rhodospirillales, Rhodocyclales, Clostridiales, (25 %), followed by micro-algae chloroplast
Planctomycetales, Nitrospirales, Rhizobiales, ribosomal DNA (15 %), Firmicutes (14 %), and
and Acidomicrobiales. The most abundant gen- Bacteroidetes (6 %). Both studies detected oxy-
era belonged to Acidithiobacillus, Acidiphilium, genic eukaryotic phototrophs that could be pre-
Leptospirillum, Thiomonas, Acidocella, and sent both in the planktonic fraction and in mat
Acidisphaera. In general, the community was communities. Overall, the community was dom-
reminiscent of those found in hot and acidic envi- inated by Bacteria rather than Archaea; it had
ronments with mesophilic organisms (Norris a large proportion of novel and unclassified
2001; Stout et al. 2009). The high abundance of sequences and the presence of eukaryotic micro-
chemolithoautotrophic and heterotrophic acido- algae. In addition, the presence of chemolithoau-
philes suggested that primary production in this totrophic acidophiles in this high-mountain
community could be driven by solar energy at the thermal spring suggested that primary production
surface and by inorganic chemicals that affect the could be driven by chemical energy in the water,
biogeochemistry of iron and sulfur in the water. as well as by solar energy at the surface.
A more recent evaluation of 16S rRNA sequences A comparative study of the planktonic micro-
present in EC hot spring, based on analysis of bial communities in five high-mountain hot
whole metagenome sequencing, which thus elim- springs was also carried out by 16S rRNA gene
inates biases associated with PCR and cloning assessment. The springs, which varied in altitude,
(Jiménez et al. 2012), showed consistent results geographical location, and geochemical
characteristics, also showed differences in terms and Rhodospirillales that included Acidiphilium
of diversity indexes. However, certain bacterial cryptum (1,681 assigned reads). A high propor-
phyla showed predominance in all of them: tion of sequences related to enzymes involved in
Proteobacteria, Aquificae, Chloroflexi, transposition and integration of mobile genetic
Cyanobacteria, Firmicutes, Nitrospirae, and elements (transposases) were mapped to the
Thermotogae. Based on cluster analysis of the A. cryptum JF-5 genome. By using BLASTX
microbial populations, these spring communities against the NCBI-nr database and the MEGAN
grouped together in a manner consistent with v4.0 software, 19,876 sequences were associ-
sample physicochemical parameters, with pH ated with KEGG pathways, specifically to
and sulfate concentration being the parameters metabolism of carbohydrates (2,623), amino
that most influenced the population structure. acids (2,584), energy (1,920), and nucleotides
Some springs were also characterized by site- (1,431). A total of 87,023 reads (30.9 %) were
specific bacterial taxa that distinguished each assigned to 25 COG categories and most of the
community. Thus despite their geographic prox- sequences were related to replication, recombi-
imity and similar origins, the environmental fac- nation, and repair (10,712 reads), suggesting
tors at each location have resulted in marked that these systems could be important in this
differences in the microbial assemblages present. ecosystem where high UV radiation, acidic pH,
and high water temperature may cause signifi-
cant damage to DNA. Deep sea hydrothermal
Metagenomic Approach: Taxonomic vent chimneys and hot spring microbial commu-
and Functional Assignment of nities are enriched in genes involved in
Metagenome Sequences mismatch DNA repair and homologous recom-
bination, perhaps due to the need for extensive
Although 16S rRNA gene analysis is very useful DNA repair systems to cope with extreme con-
for assessing microbial diversity, it does not pro- ditions that could have potential deleterious
vide ecologically relevant functional informa- effects on their genomes (Klatt et al. 2011; Xie
tion. Thus a direct analysis of total et al. 2011). In this study we also identified
metagenomic sequences becomes relevant. The sequences associated with quorum sensing and
current and most frequently used tools for taxo- cellular communication in biofilms, structures
nomic and functional classification of that could form on the surfaces of these acidic
metagenomic reads are based on local alignments hot springs and could be relevant for ecosystem
(BLAST) against different databases and associ- functionality (Fig. 1).
ating best hits to taxa, specific genes, functional
identifiers, or metabolic pathways (Montaña
et al. 2012). An analysis was therefore carried Metagenomic Approach: Nitrogen and
out with 53 Mb of metagenomic information Sulfur Transformations
retrieved from a planktonic fraction of the
EC hot spring (Jiménez et al. 2012). However, Pathways involved in nitrogen and sulfur metab-
only 8,121 reads (2.9 %) of the total reads olism could be important in acidic hot spring
could be assigned to a taxonomic category, habitats where terminal electron acceptors other
suggesting a great amount of newly described than O2 may be relevant, such as nitrate, ferric
sequences or a large amount of noncoding DNA iron, arsenate, thiosulfate, elemental S, sulfate, or
present in these genomes (especially in micro- CO2. Genes related to the dissimilatory reduction
eukaryotes). A high number of sequences were of nitrate to nitrite (nar GHI genes), conversion
related to Acidithiobacillales (represented by of nitrite to N2 (nir K, nir S, nor B, nos Z), and
sequences related to Acidithiobacillus caldus, associated with ferredoxin-nitrite reductase
Acidithiobacillus ferrooxidans, and Acidithio- (nir A) were found in the metagenome of EC
bacillus thiooxidans) followed by Legionellales hot spring (Fig. 2a). The presence of nif
Metagenome of Acidic Hot Spring Microbial Planktonic Community
Metagenome of Acidic Hot Spring Microbial Planktonic Community: Struc- KEGG characteristic identified and numbers in gray circles indicate the amount of
tural and Functional Insights, Fig. 2 Partial (a) nitrogen and (b) sulfur pathways sequence reads affiliated to the KEGG function (Jiménez et al. 2012)
393
identified by KEGG affiliation of the sequences from EC hot spring. Boxes indicate the
M
M
K genes (associated with sulfate-reducing the amino acid level with previously reported
Thermodesulfovibrio and sulfur-reducing bacte- PR sequences from both freshwater and marine
ria Desulfitobacterium) also indicated that in samples. These sequences contained conserved
addition to denitrification, nitrogen fixation residues indicative of proton-pumping activity
could also be taking place in this acidic hot and of pigments that absorb green light. They
spring. Based on taxonomic affiliation, the harbored diversity at the amino acid level and
dissimilatory nitrate reduction is most likely clustered into three groups, showing similarity
carried out by Proteobacteria-like organisms, with both freshwater and marine sequences. The
while assimilatory reduction of nitrate was asso- presence of these genes indicated that PR
ciated mostly with acidophilic micro-algae, phototrophy might play a role in these oligotro-
Acidobacteria, Spartobacteria, and Alphaproteo- phic high-mountain aquatic habitats exposed to
bacteria (Jiménez et al. 2012). Conversion of abundant sunlight by providing a possible
sulfate into adenylylsulfate and, further, to advantage that could contribute to survival.
generate sulfite and H2S were also predicted
from sequence analysis of the EC metagenome.
This included genes involved in conversion of Summary
adenylylsulfate to sulfite (apr AB; cys H), in
sulfite reduction and H2S formation (cys I), and The sequence-based exploration of the
in the oxidation of sulfite to sulfate (sulfite oxi- metagenomic content in Andean hot springs
dase enzymes) (Meyer and Kuever 2008). These goes beyond the identification of taxa using 16S
pathways indicate that the oxidation of H2S and rRNA gene analysis and provides insight into
(or) SO2 could be linked to the acidity of the metabolic potential and ecosystem function. Tax-
environment (Jones et al. 2012). onomic surveys of EC spring and other similar
springs indicated overall predominance of Bacte-
ria over Archaea, even in the most acidic waters.
PCR-Target Approach: Certain bacterial taxa predominated, but there
Proteorhodopsin-Like Genes in Andean were also site-specific groups at each spring,
Acidic Hot Springs indicating that the surveyed microbiomes were
different. The functional annotation showed that
These Andean mountain hot springs are the microbial community in EC spring contained
subjected to a large amount of solar light, yet pathways involved in nitrogen and sulfur metab-
taxonomic surveys identified only few olism, as well as extensive DNA repair systems,
phototrophic bacteria (Bohórquez et al. 2012a; possibly to cope with UV radiation at such high
Jiménez et al. 2012). Thus a search was altitudes. Processes involved in denitrification,
conducted to identify energy-harvesting bacte- nitrogen fixation, and sulfide oxidation were
rial proteorhodopsins (PRs) that could also con- likely linked to the acidity of the environment.
tribute to productivity in these ecosystems Finally, the presence of PR sequences in these
(Bohórquez et al. 2012b). PRs are retinal- communities suggests that these genes might play
binding bacterial transmembrane proton pumps a role important for bacterial survival in these
that can generate energy from light, which are aquatic ecosystems.
therefore important in terms of carbon cycling
and energy flux in various aquatic ecosystems
(Fuhrman et al. 2008). PCR with degenerate Cross-References
primers designed to target an internal conserved
region in the PR gene was used to identify puta- ▶ A 123 of Metagenomics
tive PR sequences. Recovered sequences ▶ Approaches in Metagenome Research:
showed between 80 % and 100 % identity at Progress and Challenges
▶ Biological Treasure Metagenome Liu Z, Klatt CG, Wood JM, Rusch DB, Ludwig M, et al.
▶ Computational Approaches for Metagenomic Metatranscriptomic analyses of chlorophototrophs of
a hot-spring microbial mat. ISME J. 2011;5:1279–90.
Datasets López-López O, Cerdán ME, González-Siso MI. Hot
▶ KEGG and GenomeNet, New Developments, spring metagenomics. Life. 2013;2:308–20.
Metagenomic Analysis Mathur J, Bizzoco RW, Ellis DG, Lipson DA, Poole AW,
▶ Lateral Gene Transfer and Microbial Diversity et al. Effects of abiotic factors on the phylogenetic
diversity of bacterial communities in acidic thermal
▶ Metagenomic Potential for Understanding springs. Appl Environ Microbiol. 2007;73(8):
Horizontal Gene Transfer 2612–23.
▶ Metagenomics, Metadata, and Meta-analysis Meyer B, Kuever J. Homology modeling of dissimilatory
APS reductases (AprBA) of sulfur-33 oxidizing and
sulfate-reducing prokaryotes. PLoS One. 2008;3(1):
e1514.
References Montaña JS, Jiménez DJ, Hernandez M, Angel T, Baena
S. Taxonomic and functional assignment of cloned
Aguilera A, Souza-Egipsy V, González-Toril E, sequences from high Andean forest soil metagenome.
Rendueles O, Amils R. Eukaryotic microbial diversity A Van Leeuw J Microb. 2012;101:205–15.
of phototrophic microbial mats in two Icelandic Myers N, Mittermeier RA, Mittermeier CG, da Fonseca
geothermal hot springs. Int Microbiol. 2010;13(1): GA, Kent J. Biodiversity hotspots for conservation
21–32. priorities. Nature. 2000;403:853–8.
Bohórquez LC, Delgado-Serrano L, Lopez G, Osorio- Norris PR. Acidophiles. In: Wiley J and Sons, editors.
Forero C, Klepac-Ceraj V, et al. In-depth characteri- Encyclopedia of life sciences. 2001. p. 1-6.
zation via complementing culture-independent doi:10.1038/npg.els.000033. http://els.net. Accessed
approaches of the microbial community in an acidic 11 Nov 2011.
hot spring of the Colombian Andes. Microb Ecol. Pentecost A, Jones B, Renaut RW. What is a hot spring?
2012a;63:103–15. Can J Earth Sci. 2003;40:1443–6.
Bohórquez LC, Ruiz-Pérez CA, Zambrano MM. Rzonca B, Schulze-Makuch D. Correlation between
Proteorhodopsin-like genes present in thermoaci- microbiological and chemical parameters of some
dophilic high-mountain microbial communities. Appl hydrothermal springs in New Mexico, USA. J Hidrol.
Environ Microbiol. 2012b;78(21):7813–7. 2003;280:272–84.
M
Bouraoui H, Rebib H, Aissa MB, Touzel JP, Siering PL, Clarke JM, Wilson MS. Geochemical and
O’donohue M, Manai M. Paenibacillus marinum biological diversity of acidic, hot springs in Lassen
sp. nov., a thermophilic xylanolytic bacterium isolated volcanic National Park. Geomicrobiol J. 2006;23(2):
from a marine hot spring in Tunisia. J Basic Microbiol. 129–41.
2013. doi:10.1002/jobm.201200275. [Epub ahead of Stout LM, Blake RE, Greenwood JP, Martini AM, Rose
print]. EC. Microbial diversity of boron-rich volcanic hot
Fuhrman JA, Schwalbach MS, Stingl U. springs of St. Lucia, Lesser Antilles. FEMS Microbiol
Proteorhodopsins: an array of physiological roles? Ecol. 2009;70(3):402–12.
Nat Rev Microbiol. 2008;6:488–94. Tirawongsaroj P, Sriprang R, Harnpicharnchai P,
Jiménez DJ, Andreote FD, Chaves D, Montaña JS, Osorio- Thongaram T, Champreda V, et al. Novel thermophilic
Forero C, et al. Structural and functional insights from and thermostable lipolytic enzymes from a Thailand
the metagenome of an acidic hot spring microbial hot spring metagenomic library. J Biotechnol.
planktonic community in the Colombian Andes. 2008;133:42–9.
PLoS ONE. 2012;7(12):e52069. Wang S, Hou W, Dong H, Jiang H, Huang L, et al. Control
Jones B, Renaut R. Hot springs and geysers. In: Reitner J, of temperature on microbial community structure in
Thiel V, editors. Encyclopedia of geobiology. hot springs of the Tibetan Plateau. PLoS ONE.
Berlin: Springer; 2011. doi:10.1007/Springer- 2013;8(5):e62901.
Reference_187284 2012-09-10 14:32:43 UTC. Wemheuer B, Taube R, Akyol P, Wemheuer F, Daniel
Springer Reference (www.springerreference.com). R. Microbial diversity and biochemical potential
Jones DS, Albrecht HL, Dawson KS, Schaperdoth I, encoded by thermal spring metagenomes derived
Freeman KH, et al. Community genomic analysis of from the Kamchatka Peninsula. Archaea. 2013:
an extremely acidophilic sulfur-oxidizing biofilm. (136714).
ISME J. 2012;6:158–170. Xie W, Wang F, Guo L, Chen Z, Sievert SM,
Klatt CG, Wood JM, Rusch DB, Bateson MM, et al. Comparative metagenomics of microbial
Hamamura N, et al. Community ecology of hot spring communities inhabiting deep-sea hydrothermal vent
cyanobacterial mats: predominant populations and chimneys with contrasting chemistries. ISME J.
their functional potential. ISME J. 2011;5:1262–78. 2011;5:414–26.
M 396 Metagenomes: 23S Sequences
of 2,900 bases, it is almost twice as long as the

Metagenomes: 23S Sequences 16S rRNA and, therefore, is theoretically a more
informative phylogenetic marker than the 16S
23S rRNA Genes in Metagenomes rRNA gene (Ludwig and Schleifer 1994; Ludwig
et al. 1995; Ludwig and Klenk 2001). Both the
Pelin Yilmaz1 and Frank Oliver Glöckner1,2 23S and 16S rRNA molecules share the same
1
Microbial Genomics and Bioinformatics properties in terms of molecule ubiquity, as well
Research Group, Max Planck Institute for Marine as sequence and structure conservation. Further-
Microbiology, Bremen, Germany more, phylogenetic trees based on 16S rRNA and
2
Jacobs University Bremen gGmbH, Bremen, on 23S rRNA genes have comparable topologies
Germany (Rijk et al. 1995; Ludwig and Schleifer 1999).
A disadvantage of the 23S rRNA gene is the
relatively low number of sequences available in the
Synonyms public databases as compared to 16S rRNA genes.
Currently (May 2014), only 446,998 23S/28S
Environmental genomics; Large subunit rRNA; sequences are publicly available, compared to
Metagenomes; Metagenomics; 23S ribosomal 4,346,367 16S/18S sequences (Quast et al. 2013).
RNA gene; 23S rRNA Furthermore, the low number of 23S/28S rRNA
sequences (29,397) longer than 1,900 bases (full
length) limits the assessment of taxonomic diver-
Definition sity due to reduced resolution in taxonomic assign-
ments. The lower number of available 23S rRNA
As an evolutionary marker, 23S ribosomal RNA gene sequences can historically be explained by the
(rRNA) offers more diagnostic sequence technical difficulty and higher cost of sequencing
stretches and greater sequence variation than the larger molecule with Sanger sequencing tech-
16S rRNA. The main drawback of using 23S nology. However, with new technologies and con-
rRNA as a phylogenetic marker is that it is still stantly decreasing sequencing costs, these
not as widely used. In a survey of 23S rRNA gene difficulties are becoming less pronounced.
sequences found in metagenomic datasets, the
Global Ocean Sampling (GOS) metagenome
revealed that 23S rRNA gene sequences are Summary of rRNA Gene Fragment
twice as abundant as 16S rRNA gene fragments, Retrieval
with 23S rRNA gene fragments being generally
about 100 bp longer. The 23S/28S rRNA gene is twice as long as the
16S/18S rRNA gene; hence, the probability of
retrieving a 23S/28S rRNA gene fragment should
Introduction be proportionately higher. Ratios of approxi-
mately 2:1 of identified 23S/28S rRNA over
The distribution of 23S rRNA gene sequences in 16S/18S rRNA observed at different sites in the
the GOS and other metagenomes remains GOS metagenome study support this expecta-
unexplored. Although the 16S rRNA gene has tion – GS000d (904 23S/28S vs. 438 16S/18S),
been established as the standard molecule for GS029 (351 23S/28S vs. 162 16S/18S), or
analyzing the taxonomic diversity in GS112a (227 23S/28S vs. 113 16S/18S)
metagenomes, using the 23S rRNA gene as (Fig. 1a). This twofold difference is also reflected
a phylogenetic marker offers advantages over by the average number of fragments retrieved per
using the 16S rRNA gene. With an average length site, which is 301 for 23S/28S rRNA and 177 for
Metagenomes: 23S Sequences 397 M
M
Metagenomes: 23S Sequences, Fig. 1 (a) Compari- rRNA fragments from each GOS sample dataset in terms
son of number of 23S/28S (dark gray bars) and 16S/18S of number of aligned bases within the rRNA gene bound-
(light gray bars) rRNA fragments retrieved from each aries, excluding any fragment (23S/28S or 16S/18S) that
GOS sample dataset. (b) Average length of 23S/28S contained less than 100 aligned bases. Sites marked with
(dark gray circles) and 16S/18S (light gray circles) an “*” indicate that less than five fragments were retrieved
16S/18S rRNA. Furthermore, 23S/28S rRNA Taxonomic Diversity Based on 23S and
gene fragments are considerably longer than 16S rRNA Genes
16S/18S gene fragments (Fig. 1b). Where an
average 23S/28S rRNA fragment has 836 aligned Percentages of both 23S and 16S rRNA frag-
bases within the rRNA gene boundaries, ments associated with major marine bacterial
a 16S/18S rRNA fragment has 713 aligned and archaeal taxa show good agreement with
bases. More abundant and longer rRNA gene each other (Fig. 2, b). Specifically, based on 23S
fragments may provide additional information rRNA assignments, 43 % of the retrieved rRNA
in assessing taxonomic diversity, both with phy- fragments are associated with Alphaproteo-
logeny and operational taxonomic unit-based bacteria, followed by 17 % Gammaproteo-
methods, as well as increasing the chances to bacteria, 9 % Actinobacteria, 8 %
affiliate other gene fragments with specific line- Cyanobacteria, 8 % Bacteroidetes, 3 %
ages. Both 23S/28S and 16S/18S rRNA frag- Betaproteobacteria, 2 % Euryarchaeota, and
ments are randomly distributed over the rRNA 0.4 % Crenarchaeota (Fig. 2a). However, less
gene regions, meaning that no specific sequence agreement in the assignment of 23S rRNA and
region is over- or underrepresented. 16S rRNA fragments is observed with less
Metagenomes: 23S Sequences, Fig. 2 Percentage of sample datasets, except GS038–GS046 and GS050. Per-
23S (a) and 16S (b) rRNA fragments associated with centages were calculated based on absolute numbers of
major marine bacterial and archaeal taxa among all GOS fragments associated with a given taxa
abundant marine taxa. For example, Chloroflexi- The former case, where 16S rRNA-based
and Deferribacteres-associated fragments are not assignments estimated more taxa in more sample
observed in the 23S rRNA gene-based classifica- datasets, demonstrates the current drawback of
tion, which may be ascribed to the lack of anno- 23S rRNA-based classification (i.e., its lack of
tated clades for these taxa. In such cases, 16S resolution due to insufficient full-length reference
rRNA gene-based classifications appear to pro- sequences). On the other hand, the latter observa-
vide better estimations. tions demonstrate that when reference sequences
Similar trends are also observed in sample-by- are present for a taxon, the higher number of 23S
sample distribution of taxa at the “class” level for rRNA fragments retrieved can capture what is
both 23S and 16S rRNA-based assignments, as missed with 16S rRNA fragments.
compared to the previous overall assessment Investigating relative abundances at lower
(Fig. 3a, b). Alphaproteobacteria, Gammapro- taxonomic levels can shed light on more promi-
teobacteria, Actinobacteria, Cyanobacteria, nent habitat-specific diversity patterns. However,
Flavobacteria, and Betaproteobacteria are the with the current size and content of LSU rRNA
most abundant taxa in the majority of sample reference databases, the 23S rRNA has a distinct
datasets. However, differences are observed in disadvantage in achieving this. As summarized in
the occurrence or relative abundance of minor Table 1, the percentage of 23S rRNA gene frag-
groups, such as Planctomycetacia or Aquificae. ments that can be classified to a certain taxa is
In certain cases, 23S rRNA-based assessments comparable to the 16S rRNA gene-based classi-
predict higher relative abundances or occurrence fication at domain, phylum or class levels.
in sample datasets for other taxa. Up to 12-fold A decrease in percentage of classified 23S
more Epsilonproteobacteria-associated 23S rRNA fragments was observed at lower levels,
rRNA fragments are found in sample dataset from 95 % at the class level down to even 17 % at
GS000b compared to 16S rRNA fragments. the genus level. This can be explained by the
Additionally, Lentisphaeria, which appears to 23,197 sequences of taxonomically classified cul-
be present in ten sites according to 23S rRNA tured organisms in the SILVA SSU Ref dataset
classifications, are observed only at two sites (release 102) versus only 3,602 sequences in the
according to 16S rRNA gene classifications. LSU Ref dataset of the same release.
Metagenomes: 23S Sequences, Fig. 3 The relative then normalized according to the total fragment counts
abundance of 23S (a) and 16S (b) rRNA fragments asso- from that site. Abundances are not normalized with
ciated with different taxa (rows) at each GOS sample respect to single copy genes, and since rRNA operons
dataset (columns). Presence of a spot indicates the pres- can occur multiple times in a genome, the numbers do
ence of fragments associated with a given taxa, and the not represent cell abundances. The taxa shown here are on
area of a spot represents the relative abundance. Relative the “class” level, except Cyanobacteria, which is at the
abundances are based on absolute counts of all fragments “phylum” level
from a given site associated with a certain taxa, which are
Metagenomes: 23S Sequences, Table 1 Percentage Contrary to these results, the primers devel-
of 23S and 16S rRNA gene fragments that can be classi- oped for the amplification of variable regions of
fied up to domain, phylum, class, order, family, and genus
levels. Total number of fragments classified are 20,036 bacterial 23S rRNA sequences (11a–97ar) (Van
and 12,491 for 23S and 16S rRNA, respectively, exclud- Camp et al. 1993) show very poor group coverage
ing Eukarya and fragments with less than 300 aligned in the GOS 23S dataset sequences, with generally
bases for LSU and less than 100 aligned bases for SSU less than 50 % coverage of the target group. 90 %
23S rRNA gene 16S rRNA gene group coverage is only observed for 69ar
fragments (%) fragments (%) (Table 2). Although the primer binding sites
Domain 99.9 100.0 were highly conserved, this is counteracted by
Phylum 96.6 100.0 the very small dataset that these primers were
Class 94.4 99.1
based on. Surprisingly, primers 53a to 97ar are
Order 78.8 96.3
observed to have higher group coverage within
Family 35.4 80.0
the GOS 23S rRNA sequences than within
Genus 16.6 31.2
LSU Parc.
The two archaeal primers (LSU190-F and
LSU2445a-R) (DeLong et al. 1999) show very
Specificity of Common 23S rRNA Primers low group coverage in the GOS 23S dataset
and Probes (Table 2), with 14 % and 5 %, respectively. Nev-
ertheless, while the percentages are higher in the
Including the 23S rRNA gene sequences identi- LSU Parc, they do not exceed 50 %.
fied in the GOS metagenome dataset in the For the BET42a probe (Manz et al. 1992),
SILVA LSU Parc dataset increased its size by 79 % group coverage is observed. This, as well
12 % (SILVA release 102). Furthermore, they as the number of outgroup hits within the GOS
have not undergone PCR amplification and 23S dataset, is close to that reported by a previous
hence provide a unique opportunity for testing evaluation (Amann and Fuchs 2008) (Table 2).
the coverage of previously described universal Group coverage within LSU Parc (87 %) is in
amplification primers, as well as widely used accordance with Amann and Fuchs (Amann and
class-specific probes. Fuchs 2008) (Table 2), although considerably
The most recently developed primer sets more outgroup hits, 348 in LSU Parc versus
(129f, 189f, 457r, 2490r) (Hunt et al. 2006), as 62, are observed.
well as primer 2241r (Lane 1991), show reason- The GAM42a probe coverage in the GOS 23S
able group coverage for the 23S rRNA gene dataset (Table 2) is almost half (42 %) of the
sequences identified in the GOS dataset with an value reported previously (76 %) (Amann and
average of 85 % (Table 2), and the results are Fuchs 2008) and the corresponding evaluation
comparable to those obtained from matching the of the LSU Parc (78 %) dataset. Since the mis-
primers against the SILVA LSU Parc dataset matches could result from sequencing errors, the
(release 102) with a difference of only 2 %. alignments of sequences with mismatches to the
The reference dataset used by Hunt and col- probe GAM42a were manually inspected. A few
leagues is with 2,176 sequences smaller than cases were likely to be sequencing errors and
both the LSU Parc (average of 11,000 target were mainly observed in fragments obtained
group sequences) and the GOS 23S (average of from ends of sequencing reads. The majority of
5,400 target group sequences) datasets used in the mismatches revealed consistent, class-
this study. However, the authors have included specific mismatches. These mismatches are up
environmental shotgun sequences from the Sar- to four bases and are found mainly between
gasso Sea pilot study (Venter et al. 2004) in their E. coli positions 1,030–1,040. Although this eval-
dataset, which would account for the comprehen- uation of the GAM42a probe was based on
siveness of these primers also in the GOS 23S a single environment, the surface ocean, limita-
dataset. tions and anomalous results with the GAM42a
Metagenomes: 23S Sequences, Table 2 Specificities of selected primers and probes, evaluated on the 23S/28S
rRNA gene fragments retrieved from the GOS metagenomes having more than 300 aligned bases within the rRNA gene
boundaries and on the SILVA Parc release 102 LSU dataset. Outgroup hits are the sum of both Archaea and Eukarya in
case of bacterial primers, both Bacteria and Eukarya in case of archaeal primers, only Eukarya in case of bacterial and
archaeal primers, and non-Betaproteobacteria and non-Gammaproteobacteria for BET42a and GAM42a probes
GOS 23S/28S LSU Parc
Group Group
Primer/ Size of coverage Outgroup Size of coverage Outgroup
probe Target group target group (%) hits target group (%) hits
129fa Bacteria 4,853 74 % 0 10,640 82 % 4
189fa Bacteria 5,285 87 % 0 11,508 87 % 0
457ra Bacteria 5,551 86 % 4 11,177 83 % 279
2241rb Bacteria 5,832 84 % 10 11,457 86 % 3,967
2490ra Bacteria 5,734 94 % 0 10,821 98 % 0
11ac Bacteria 5,256 20 % 0 11,478 39 % 0
23arc Bacteria 5,619 23 % 0 10,526 49 % 4
43ac Bacteria 5,633 6% 0 10,999 44 % 0
53ac Bacteria 5,320 3% 0 10,594 1% 0
62arc Bacteria 5,540 8% 0 11,455 5% 0
69arc Bacteria 5,731 90 % 0 11,443 87 % 0
93ac Bacteria 5,737 62 % 0 10,322 55 % 0
93arc Bacteria 5,731 63 % 0 10,327 56 % 2
97arc Bacteria 4,969 55 % 0 9,165 29 % 38
LSU190-Fd Bacteria and Archaea 5,348 14 % 0 11,741 24 % 0
28 %
LSU2445a- Archaea 142 5% 0 262 28 % 0
Rd M
BET42ae Betaproteobacteria 209 79 % 63 570 87 % 348
GAM42ae Gammaproteobacteria 980 42 % 1 2,877 78 % 10
References: aHunt et al. 2006; bLane 1991; cVan Camp et al. 1993; dDeLong et al. 1999; eManz et al. 1992
probe have been reported previously for other power of using 23S rRNA genes. High-quality
environments as well, which were found to be taxonomic classification for biodiversity analy-
mainly due to polymorphisms at E. coli position sis, as well as primer and probe design, depends
1,033 (Yeates et al. 2003; Barr et al. 2010). Our on the size and extent of the reference dataset
observation confirms these reports, by adding used. The advantage of using the larger 23S
additional polymorphisms before and after this rRNA genes for biodiversity analysis, especially
position. Consequently, the limitations of the for the marine system, has been shown previ-
GAM42a probe might be more severe than pre- ously (Peplies et al. 2004). Additionally,
viously thought, and therefore, we recommend a recent study assessing the diversity of
the design and testing of novel Gammaproteo- paralogous 23S rRNA genes has shown that
bacteria probes. significant sequence diversification was
observed in 184 species, further supporting the
suitability of this molecule for taxonomy (Pei
Summary et al. 2009). Although an obvious limitation
faced during this study was the small size of
This comparative overview of 16S and 23S the 23S rRNA gene reference datasets, this is
rRNA fragments retrieved from the GOS likely to be overcome in the near future with the
metagenomes exemplifies the possibility and contribution of (meta-)genomic sequences from
M 402 Metagenomic Analysis of Bile Salt Hydrolases in the Human Gut Microbiome
mega-sequencing projects, such as the Human Manz W, Amann R, et al. Phylogenetic oligodeoxynu-
Microbiome Project, the TerraGenome, the Tara cleotide probes for the major subclasses of
Proteobacteria: problems and solutions. Syst Appl
Oceans, or the Genomic Encyclopedia of Bacte- Microbiol. 1992;15(4):593–600.
ria and Archaea. Moreover, studies assessing Pei A, Nossa CW, et al. Diversity of 23S rRNA genes
the characteristics and sequence diversity of within individual prokaryotic genomes. PLoS ONE.
23S rRNA genes in bacterial and archaeal 2009;4(5):e5437.
Peplies J, Glöckner FO, et al. Comparative sequence
genomes, in combination with efforts to design, analysis and oligonucleotide probe design based on
test and, reevaluate universal and group-specific 23S rRNA genes of Alphaproteobacteria from
primers and probes, can renew the interest and North Sea bacterioplankton. Syst Appl Microbiol.
utilization of this molecule. The application of 2004;27(5):573–80.
Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza
continually advancing, cheaper sequencing P, Peplies J, Glöckner FO. The SILVA ribosomal RNA
technologies to the undiscovered fraction of the gene database project: improved data processing and
23S rRNA gene sequences can result in a higher web-based tools. Nucleic Acid Res. 2013;41:D590–
appreciation of this valuable phylogenetic D596.
Rijk P, Peer Y, et al. Evolution according to large tribosomal
marker. subunit RNA. J Mol Evol. 1995;41(3):366–75.
Van Camp G, Chapelle S, et al. Amplification and
sequencing of variable regions in bacterial 23S ribo-
References somal RNA genes with conserved primer sequences.
Curr Microbiol. 1993;27(3):147–51.
Venter JC, Remington K, et al. Environmental genome
Amann R, Fuchs BM. Single-cell identification in micro- shotgun sequencing of the Sargasso Sea. Science.
bial communities by improved fluorescence in situ 2004;304(5667):66–74.
hybridization techniques. Nat Rev Microbiol. Yeates C, Saunders AM, et al. Limitations of the widely
2008;6(5):339–48. used GAM42a and BET42a probes targeting bacteria
Barr JJ, Blackall LL, et al. Further limitations of phyloge- in the Gammaproteobacteria radiation. Microbiology.
netic group-specific probes used for detection of bac- 2003;149(5):1239–47.
teria in environmental samples. ISME J. 2010;4:
959–61.
DeLong E, Taylor L, et al. Visualization and enumeration
of marine planktonic Archaea and Bacteria by using
polyribonucleotide probes and fluorescent in situ Metagenomic Analysis of Bile Salt
hybridization. Appl Environ Microbiol. 1999;65(12):
5554–63.
Hydrolases in the Human Gut
Hunt DE, Klepac-Ceraj V, et al. Evaluation of 23S rRNA Microbiome
PCR primers for use in phylogenetic studies of bacte-
rial diversity. Appl Environ Microbiol. 2006;72(3): Brian V. Jones1 and C. G. M. Gahan2
2221–5. 1
Centre Biomedical and Health Science
Lane DJ. 16S/23S rRNA sequencing. In: Stackebrandt E,
Goodfellow M, editors. Nucleic acid techniques in Research, University of Brighton, School of
bacterial systematics. Chichester/New York: Wiley; Pharmacy and Biomolecular Sciences, Brighton,
1991. p. 115–75. East Sussex, UK
Ludwig W, Klenk HP. A phylogenetic backbone and 2
taxonomic framework for prokaryotic systematics.
Department of Microbiology, School of
In: Boone DR, Castenholz RW, editors. The Archaea Pharmacy & Alimentary Pharmabiotic Centre,
and the deeply branching and phototrophic Bacteria. University College Cork, Cork, Ireland
New York: Springer; 2001. p. 49–65.
Ludwig W, Schleifer KH. Bacterial phylogeny based on
16S and 23S rRNA sequence analysis. FEMS
Microbiol Rev. 1994;15(2–3):155–73. Definitions
Ludwig W, Schleifer K. Phylogeny of Bacteria beyond
the 16S rRNA standard. ASM News. 1999;65(11): Metagenome/metagenomics: The collective
752–7.
Ludwig W, Rossello-Mora R, et al. Comparative sequence
genomes of all members of a particular
analysis of 23S rRNA from Proteobacteria. Syst Appl microbial community may be referred to as
Microbiol. 1995;18:164–88. the metagenome (or a genome of many).
Metagenomic Analysis of Bile Salt Hydrolases in the Human Gut Microbiome 403 M
Metagenomics refers to methods which seek to Bile Acids and Microbial Bile Acid
understand the composition, development, and Metabolism
function of microbial ecosystems through analy-
sis of the community metagenome. Bile acids (BA) are cholesterol derivatives syn-
Function-driven metagenomics: A meta- thesized in the liver and linked with either glycine
genomic approach in which emphasis is placed or taurine to form conjugated bile acids (CBA)
on the recovery of genes encoding a defined func- (Ridlon et al. 2006; Begley et al. 2005a, b; Fig. 1).
tion of interest, through assays based on heterol- The dominant CBA in humans are glycine con-
ogous gene expression. Typically metagenomic jugates of cholic acid and chenodeoxycholic acid,
DNA is used to generate genetic libraries in with CBA forming a major component of bile
a surrogate host species that may be easily manip- stored in the gall bladder (Ridlon et al. 2006). In
ulated in the laboratory. Each clone in the library response to food intake, bile is secreted into the
(analogous to books in a conventional library) lumen of the intestine where CBA facilitate the
represents a fragment of metagenomic DNA digestion of dietary fat, promoting the emulsifi-
from a member of the microbial community cation of lipids and their subsequent absorption
under study. Libraries are then subsequently across the intestinal epithelium (Ridlon
screened to identify clones encoding and et al. 2006; Begley et al. 2005a). However, the
expressing activities of interest. functions of bile acids are not limited to diges-
Large-insert library/genetic library: Due to tion, and BA are also important signaling mole-
the complexity of microbial communities, cules that contribute to the regulation of diverse
genetic libraries constructed for function-driven metabolic processes (Thomas et al. 2008; Fig. 2).
metagenomic analysis often seek to clone large These include regulation of mucosal immune
fragments of metagenomic DNA (typically rang- responses in the intestine, as well as aspects of
ing from 40 to 200 kb in size, depending on the energy homeostasis and fat storage (Thomas
specifics of the cloning system used). The term et al. 2008; Inagaki et al. 2006; Houten
M
“insert” refers to the metagenomic DNA frag- et al. 2006; Jones 2011; Watanabe et al. 2006;
ments which are ligated, or “inserted,” into Fig. 2). As such, BA are now no longer viewed as
a plasmid vector that maintains them in the sur- purely digestive secretions but also as metabolic
rogate host bacterium. Insert sizes of ~40 kb and integrators and key regulators of intestinal
over are usually referred to as “large inserts,” homeostasis (Thomas et al. 2008; Hofmann and
giving rise to the term “large-insert library.” Eckmann 2006; Jones 2011).
Sequence-driven metagenomics: A meta- The regulatory functions of bile acids are
genomic approach in which the emphasis is believed to act through two main receptors, the
placed on the generation and analysis of nucleo- nuclear receptor FXRalpha and the membrane
tide sequence data from metagenomic DNA. receptor TGR5, for which bile acids are the nat-
Typically sequence-based approaches are uti- ural ligands (Thomas et al. 2008). These recep-
lized to provide a broad overview of the popula- tors are highly expressed in the liver and
tion structure and predicted functions undertaken intestinal tissues but also in numerous
by a microbial community. extraintestinal tissues (Thomas et al. 2008).
Heterologous gene expression: Refers to the Although the majority of bile acids are efficiently
expression of genes in an organism from which reclaimed from the intestine and returned directly
they did not originate. For function-driven to the liver for reuse (referred to as enterohepatic
metagenomics, this generally refers to the expres- circulation), a portion enter the systemic circula-
sion of genes encoded by cloned fragments of tion and signal other organs through these recep-
metagenomic in the surrogate host species used tors, coordinating cholesterol, triglyceride and
to construct genetic libraries (typically glucose metabolism, as well as fat storage
Escherichia coli). (Thomas et al. 2008; Fig. 2).
Metagenomic Analysis of Bile Salt Hydrolases in the novo in the liver from cholesterol and referred to as
Human Gut Microbiome, Fig. 1 Structure of domi- primary bile acids (Ridlon et al. 2006; Thomas
nant conjugated bile acids in humans (Modified from et al. 2008). De-conjugated CA and CDCA, as well as
Begley et al. 2005a). Major bile acid species in the human derivatives of these primary BA formed in the intestine,
bile acid pool are conjugated forms linked to either gly- are recovered and returned to the liver where they are
cine or taurine via amide bonds, with glyco-conjugates conjugated and re-assimilated into the bile acid pool. For
dominant in humans (Ridlon et al. 2006). The predomi- comprehensive reviews, see Ridlon et al. (2006) and
nant bile acid species are cholic acid (CA) and Begley et al. (2005a)
chenodeoxycholic acid (CDCA), which are generated de
CBA have also been implicated in the control Jones et al. 2008; Fig. 3). These modified bile
of microbial growth in the small intestine via acids display altered binding characteristics for
toxic effects on colonizing bacteria (Begley bile acid receptors, with microbial products of
et al. 2005a; Ridlon et al. 2006). This antimicro- bile acid metabolism among the most potent ago-
bial effect is thought to repress bacterial growth nists (Thomas et al. 2008). This highlights the
in the small intestine and prevent microbes pro- potential for microbes resident in the human gut
liferating to levels which are harmful to the microbiome to influence wider aspects of host
human host. Local mucosal immune responses metabolism and phenotype, through interaction
in the intestine are also regulated by bile acids with bile acid signaling pathways (Jones
(through FXRalpha) and implicated in microbial et al. 2008; Thomas et al. 2008; Jones 2011;
population control in this compartment (Inagaki Ogilvie and Jones 2012). Congruent with this
et al. 2006). It is most likely that bile acid medi- hypothesis is the accumulating body of evidence
ated mucosal immune regulation works in syn- implicating microbial bile acid metabolism as the
ergy with the direct effects of bile acids on basis of a long-standing dialogue between the
resident microbes, to prevent bacterial over- human host and its gut microbiome (Jones 2011;
growth in the small intestine and associated Inagaki et al. 2006; Gadaleta et al. 2011; Maran
deleterious effects on host health (Inagaki et al. et al. 2009; Modica et al. 2008; Duboc et al. 2013;
2006; Hofmann and Eckmann 2006; Begley Jones et al. 2008). As such, there is increasing
et al. 2005a; Fig. 2). interest in understanding the role of this activity
However, once secreted into the intestinal in human health and disease processes, with this
lumen, CBA are subject to extensive biotransfor- function of the gut microbiome likely to be
mation by indigenous gut microbes, leading to a viable target for disease prevention through
the formation of a range of secondary and tertiary manipulation, or augmentation of the intestinal
products (Ridlon et al. 2006; Begley et al. 2005a; microbial ecosystem.
Metagenomic Analysis
of Bile Salt Hydrolases in
the Human Gut
Microbiome,
Fig. 2 Overview of
physiological functions
undertaken by bile acids.
Boxes shaded violet
summarize the direct
functions of bile acids in
the small intestine,
attributed to their physical
properties. Boxes shaded
blue summarize regulatory
functions of bile acids,
through interaction with the
main bile acid receptors
TGR5 and FXRalpha. For
comprehensive reviews of
bile acid signaling, see
Thomas et al. (2008)
Overview of Bile Salt Hydrolases: While the main substrates for BSH and PVA
Biochemistry, Structure, and Function enzymes (conjugated bile acids and penicillins,
respectively) vary considerably in structure, PVA
Bile salt hydrolases (BSH; EC 3.5.1.24) (also has been shown to exhibit some moderate activity
designated as choloylglycine hydrolases or con- against bile acids and some BSH enzymes dem-
jugated bile acid hydrolases) are members of the onstrate mild activity against penicillin
N-terminal nucleophilic (Ntn) hydrolase super- V (Kumar et al. 2006). This suggests that each
family of proteins and catalyze the hydrolysis of enzyme group has preferential activity against
conjugated bile acids, linked with the amino acids a specific substrate but that some overlap in activ-
taurine or glycine (tauro-CBA, glyco-CBA), to ities also occurs (Kumar et al. 2006). The
liberate free primary bile acids and amino acids sequence homology between these enzyme fam-
(Fig. 3; Begley et al. 2006; Kumar et al. 2006). ilies has led to mis-annotation of PVA in some
The wider enzyme superfamily also contains the bacterial genomes, for example, in the initial
penicillin V acylase (PVA; EC 3.5.1. 11) enzyme genome annotation of Listeria monocytogenes
family, and BSH and PVA enzymes share signif- (Begley et al. 2005b) and Lactobacillus
icant homology and catalyze hydrolysis of the plantarum WCFS1 (Lambert et al. 2008a). This
same type of chemical bond. highlights a requirement for functional enzymatic
Metagenomic Analysis of Bile Salt Hydrolases in the which accumulate in the bile acid pool. GCA, TCA, glyco-
Human Gut Microbiome, Fig. 3 Major bile acid and tauro-conjugated cholic acid, respectively; GCDCA,
transformations undertaken by the human gut TCDCA, glyco- and tauro-conjugated chenodeoxycholic
microbiota (Modified from Jones 2011). Bile salt hydro- acid, respectively; CA, CDCA, free primary bile acids
lase (BSH) catalyzes the initial de-conjugation of CBA to cholic acid and chenodeoxycholic acid, respectively;
liberate free primary bile acids and amino acids. Free DCA, LCA, free secondary BA deoxycholic acid and
primary bile acids are then available to further modifica- lithocholic acid, respectively. For comprehensive reviews
tion by the gut microbiome and converted to secondary of microbial bile acid transformations, see Ridlon
forms. A multistep 7-alpha dehydroxylation pathway is et al. (2006) and Begley et al. (2005a)
responsible for generation of key secondary BA species
analysis in order to determine substrate prefer- specific loops near the active site in each case
ences and to guide annotation (Jones et al. 2008; which may explain differences in substrate spec-
Lambert et al. 2008b). ificity (Kumar et al. 2006).
The crystal structure has been solved for Structural and functional analysis of BSH
a number of BSH (Kumar et al. 2006; Rossocha enzymes from different bacteria has revealed
et al. 2005) and PVA (Suresh et al. 1999) the presence of conserved amino acids that are
enzymes and demonstrates a conservation in thought to be essential for bile hydrolysis. In
overall structure suggestive of shared mecha- particular the thiol group of the Cys-1 amino
nisms of action and an evolutionary relationship acid has been shown to be essential for catalytic
between BSH and PVA (Kumar et al. 2006). activity (Kim et al. 2004; Lodola et al. 2012). In
Detailed analysis of the structure of BSH and addition a number of amino acids including
PVA enzymes indicates that there is Asp-20, Tyr-82, Asn-175, and Arg-228 are
a significant difference in the organization of highly conserved across numerous BSH enzymes
(Begley et al. 2006) and have recently been sequencing (shotgun metagenomics or sequence-
shown to be essential for catalytic activity mainly based metagenomics) or used to construct
through electrostatic interactions with the Cys-1 large-insert genetic libraries for function-based
sulfur atom (Lodola et al. 2012) screening (function-driven metagenomics)
(Begley et al. 2006). Despite high levels of (Handelsman 2004) (Fig. 4).
amino acid conservation, different BSH enzymes The resulting data not only affords access to
display subtle differences in their preferred bile census-type information describing the composi-
substrates with some enzymes exhibiting hydro- tion of a community (who is there?) but also
lysis of glyco- and tauro-conjugated bile acids permits access to the broader functional content
and others demonstrating specific hydrolysis encoded by microbial ecosystems (what are they
of tauro-conjugated bile acids (Jones et al. doing?) (Handelsman 2004; Jones et al. 2008).
2008). BSH enzymes with specificity for tauro- Recently both function-driven and sequence-
conjugated bile acids are highly represented based metagenomic approaches have been
among the Bacteroidetes and form a separate applied to analyze BSH activity in the gut
phylogenetic group relative to other BSH microbiome and provide good examples of the
enzymes but have not been characterized in capacity for metagenomics to generate novel
detail (Jones et al. 2008). Further biochemical functional insights into a microbial community
analysis of a variety of BSH enzymes is and, in the case of the human microbiome, to
warranted to determine the structural variances understand its influence on host health (Jones
that give rise to these subtle differences in bile et al. 2008; Ogilvie and Jones 2012).
acid substrate range. Function-Driven Metagenomic Analysis of
Bile Salt Hydrolases: Due to the relative paucity
of information regarding the genes underpinning
Metagenomic Analysis of Bile Salt bile acid metabolism in the gut microbiome, ini-
Hydrolases (BSHs) tial community-wide studies of this activity uti-
M
lized a function-driven metagenomic approach,
As the human gut microbiota is composed pre- to assess the diversity and phylogenetic distribu-
dominantly of microbes which are yet to be tion of BSH activity in this ecosystem (Jones
grown in the laboratory, a range of culture- et al. 2008; Fig. 4). The reliance on heterologous
independent approaches have been developed gene expression in the surrogate host (typically
and applied to study this and other microbial E. coli) and the requirement for a phenotypic
communities (Handelsman 2004; Jones and screen for the trait of interest are clear limitations
Marchesi 2007; Qin et al. 2010; Kurokawa et al. of the function-based strategy but are offset by
2007; Gill et al. 2006). Metagenomic approaches unique benefits of this approach over other
constitute a particularly powerful branch of the metagenomic techniques (Handelsman 2004).
culture-independent techniques available for A major advantage of the function-driven
characterization of microbial ecosystems, in approach is that no prior knowledge or sequence
which the collective genomes of all species com- data for the genes underpinning an activity is
prising a community are considered as a single, required, which not only allows the application
community-wide, genetic unit (the metagenome) of metagenomics to poorly studied microbial
(Handelsman 2004). Access and analysis is functions in a community (such as bile acid
guided by this basic principle, and metagenomic metabolism) but also permits the recovery of
approaches are rooted in the extraction of total, novel, unrelated enzyme classes catalyzing
mixed community DNA (metagenomic DNA) a particular reaction (Jones et al. 2008;
without any prior cultivation (Handelsman Handelsman 2004). Furthermore, a clear confir-
2004). Recovered community DNA is then either mation of activity among the genes identified is
subject to direct analysis using high-throughput intrinsic to the function-driven approach. This is
Metagenomic Analysis of Bile Salt Hydrolases in the may be subject directly to high-throughput sequencing
Human Gut Microbiome, Fig. 4 Overview of (shotgun metagenomics) or first used as a template for
metagenomic approaches to study microbial ecosys- PCR reactions intended to amplify key genes of interest.
tems. Recovery of metagenomic DNA (Modified from The latter is most typically used to amplify phylogenetic
Ogilvie et al. 2012): Metagenomic approaches begin anchors, such as genes for 16S ribosomal RNA, which
with sampling the microbial ecosystem and extracting permit a census of the species present in a community.
DNA from the mixed community as a whole, without Sequences generated directly in the shotgun approach can
any prior cultivation. This metagenomic DNA may then subsequently be compared with well-characterized micro-
be subjected to one or more strategies to access the func- bial genomes and/or assembled into large contigs and
tional content of the ecosystem under study and/or explore genes predicted, in order to assess the functions encoded
the population structure and identify species present. by community members (with information on population
Sequence-driven metagenomics: Metagenomic DNA structure also captured in this strategy where relevant gene
of major benefit in the analysis of enzymes such members of the gut microbiome, the conserva-
as BSH, which share a considerable degree of tion of this function between distinct human
sequence homology with closely related enzymes microbiomes, and the role of this activity in
in the wider Ntn_CGH-like (COG3049) family of gut-associated bacteria (Jones et al. 2008;
proteins (Jones et al. 2008; Kumar et al. 2006). In Ogilvie and Jones 2012; Fig. 6).
particular BSH are closely related to penicillin Distribution of BSH Activity Among Mem-
V amidases, from which they are believed to have bers of the Human Gut Microbiome: Sequence
evolved, and comparison of sequence data alone data obtained from metagenomic clones
is often insufficient for the accurate prediction of encoding BSH activity was used to predict the
function in these enzymes (Kumar et al. 2006; phylogenetic origin of the BSHs obtained and
Jones et al. 2008). determine which members of the gut microbiome
The function-driven approach employed to encode this function (Jones et al. 2008). Although
survey BSH activity in the gut microbiome was the taxonomic resolution afforded by this analy-
based on screening large-insert genetic libraries sis was limited by a lack of conserved phyloge-
(constructed from metagenomic DNA derived netic anchors in many metagenomic clones (such
from stool samples), using a simple plate-based a 16S rRNA genes) and the limited availability of
assay to identify clones able to de-conjugate genome sequences from gut-associated bacterial
CBA (Fig. 5 Library construction and Screen). species at the time of analysis (against which
The basis of this screen is the complementa- recovered BSH sequences could be compared),
tion of the BSH-deficient E. coli host used this survey nevertheless revealed a broad distri-
to construct libraries and the subsequent bution of BSH activity within the gut microbiome
de-conjugation of CBA incorporated into the (Jones et al. 2008).
bacterial growth media used for screening All major bacterial phyla comprising the
(Jones et al. 2008; Dashkevicz and Feighner human gut microbiome (Bacteroidetes,
1989). Once liberated, free bile acids are no Firmicutes, Actinobacteria) were shown to encode
M
longer soluble and precipitate to form a halo this function, highlighting the high level of redun-
around BSH-positive clones, allowing those dancy and general stability of BSH activity within
harboring active BSH to be easily identified the community (Jones et al. 2008). Furthermore,
and recovered for further analysis (Jones BSH activity was also identified in the archaeal
et al. 2008; Fig. 5). Characterization of BSHs species Methanobrevibacter smithii, which com-
recovered from the human gut metagenomic monly forms a part of the human gut microbiome
library through function-based screening pro- (Jones et al. 2008). These observations further
vided the basis to subsequently examine the dis- expanded the representation of BSH among com-
tribution and evolution of this activity among munity members and revealed this function to be
Metagenomic Analysis of Bile Salt Hydrolases in the from genes of interest. This is a major advantage of the
Human Gut Microbiome, Fig. 4 (continued) such as function-driven approach which facilitates the identifi-
16S rRNA genes are identified). Function-driven cation of novel enzyme classes and is well suited to
metagenomics: these approaches rely on the construc- explore activities for which few initial examples of
tion of large-insert genetic libraries and the heterologous well-characterized genes or proteins exist. However,
expression of cloned genes in the surrogate host species a second major caveat of the function-driven approach
(as used to explore BSH activity in the human gut is that a suitable high-throughput screen for the activity
microbiome; Jones et al. 2008). Although the require- of interest must also be available (see Fig. 5). Fosmid
ment for genes originating in diverse and distantly (vectors based on the E. coli F-plasmid) and BACs
related species to express functional proteins in the (bacterial artificial chromosomes) represent the most
library host is a limitation of this method, unlike commonly used systems for construction of large-insert
sequence-driven approaches, there is no requirement metagenomic libraries
for prior information or well-characterized sequences
obtained by function-driven metagenomics were

compared with a large collection of related
sequences, from gut-associated and non-gut-
associated species, belonging to the wider
Ntn_CGH family of which BSH are members.
Clustering of these enzymes based on similarity
of amino acid sequences revealed those derived
from gut-associated microbes generally grouped
Metagenomic Analysis of Bile Salt Hydrolases in the together, despite originating from very different
Human Gut Microbiome, Fig. 5 Example of function-
based screening for bile salt hydrolase activity species (Jones et al. 2008). When the substrate
(Modified from Jones et al. 2008). High-throughput range of enzymes with proven function was
function-driven metagenomic analysis of BSH activity in mapped against the observed groupings, a clear
the human gut microbiome utilized a simple plate-based shift toward BSH activity was also evident
screen to identify clones encoding this activity
(Dashkevicz and Feighner 1989). De-conjugation of among sequences that originated from gut
CBA incorporated in the media results in precipitation of microbes (Jones et al. 2008). Subsequently,
free bile acids and the formation of a distinct halo around murine experiments designed to test the contri-
BSH + clones. The image shows the phenotype of bution of BSH to bacterial survival in the gut
the surrogate, BSH-deficient E. coli host, and
a corresponding BSH-positive metagenomic clone on the clearly demonstrated the role of BSH in facilitat-
bile agar media ing colonization of this habitat by mitigating the
toxic effects of bile acids in the intestine (Jones
et al. 2008).
present in two domains of life (bacteria and Collectively, the results of experiments in
archaea) in the gut microbiome (Jones et al. 2008). murine models, together with trends observed in
In addition, the expression in E. coli of BSH comparisons of functional BSH with related
genes predicted to originate from a wide range of sequences, indicated BSH activity to be
bacterial species, as well as archaea, highlights the a common microbial adaptation to the gut envi-
coverage afforded by the function-driven ronment, with selective pressure from conjugated
approach in this case (Jones et al. 2008). Despite bile acids likely to have driven the divergence of
this strategy being limited by the ability of the members of the Ntn_CGH family of proteins in
surrogate host to express the trait of interest, and gut bacteria toward BSH activity (Jones
genes derived from a wide range of often distantly et al. 2008). Overall, this points to CBA as
related microbes, the function-driven survey of a key selective pressure in the gut habitat, and
BSH demonstrates the clear potential for genes the development of a common mechanism for
of diverse phylogenetic origin to be obtained by dealing with this stress in a diverse cross section
this method (Jones et al. 2008). Continued of the community is congruent with the concept
improvements in the range of hosts and vector of host-level selection on functions of the gut
systems available will further enhance the utility microbiome (Ley et al. 2006). Bacteria face
of this approach and expand the role of this strat- many challenges when colonizing and persisting
egy in the analysis of microbial communities. in the mammalian intestinal tract, but the solu-
Insights into BSH Evolution and Its Role in tions developed for mitigating these barriers to
Gut Bacteria: The recovery of novel BSH survival must also be acceptable to the higher
sequences from the gut microbial ecosystem host organism and facilitate bacterial coloniza-
with confirmed function also allowed a deeper tion without negative impact on fitness of the host
insight into the evolution and role of this activity (Jones et al. 2008; Ley et al. 2006). Therefore, the
in the gut microbiome (Jones et al. 2008). To human host is believed to exert a selective pres-
understand the evolution of this activity within sure on functions and activities undertaken by the
the gut community, sequences from novel BSH gut microbiome as a whole, and analysis of
Metagenomic Analysis of Bile Salt Hydrolases in the sequence similarities and used to calculate the relative
Human Gut Microbiome, Fig. 6 Relative abundance abundance of BSHs for major phylogenetic divisions in
of bile salt hydrolases in the gut microbiome in health each gut microbiome (expressed as Hits/Mb DNA). ACT
and disease (From Ogilvie and Jones 2012). Human gut Actinobacteria; BACT Bacteroidetes; FIRM Firmicutes;
microbiomes from the MetaHIT dataset were surveyed TOTAL BSH relative abundance in MetaHit dataset as
using sequence from BSH with proven function to identify a whole irrespective of phylogenetic affiliations. Healthy
homologues to these genes (minimum of 35 % amino acid healthy individuals only (n ¼ 99), UC individuals with
identity 50 aa or more and 1e5 or lower) in the ulcerative colitis only (n ¼ 21), CD individuals with
124 individual gut microbiomes represented in this dataset Crohn’s disease only (n ¼ 4). Error bars indicate stan-
(Qin et al. 2010). Identified BSH sequences were subse- dard error of the mean. Level of significance: * P ¼
quently affiliated to different bacterial divisions based on < 0.01; ** P ¼ < 0.00
microbial BSHs suggests that these may be an sequence data from genes with proven functions
example of a mutually acceptable arrangement or activities. Such data in itself constitutes
between the host and its microbiome (Jones a useful and valuable resource for numerous
et al. 2008). other applications, including the accurate anno-
BSH Activity as a Conserved Feature of the tation and interpretation of shotgun metagenomes
Human Gut Microbiome: Although the initial (and complete bacterial genomes), opening the
application of function-driven screens provided way for larger-scale sequence-based surveys of
much fundamental insight into bile acid metabo- key functions within microbial ecosystems. This
lism by the gut microbiome, these studies also is exemplified by the use of BSH recovered
fill an additional role in generating baseline through function-driven metagenomics to
subsequently interrogate a range of sequence- this signaling network through bile acid transfor-
based shotgun metagenomes, in order to examine mations, alterations in capacity for bile acid
the representation of this activity among distinct metabolism in the human gut microbiome may
gut communities and other microbial ecosystems play a role in the pathogenesis of numerous dis-
(Ogilvie and Jones 2012; Jones et al. 2008). eases (Jones et al. 2008; Ogilvie and Jones 2012;
This approach was first applied to survey Jones 2011). For example, the products of micro-
15 human gut metagenomes and several non-gut bial bile acid metabolism have been linked to the
metagenomes from a range of habitats initiation and pathogenesis of colorectal cancer
(Jones et al. 2008). Comparison of the relative (CRC) through several mechanisms, including
abundance of genes with homology to functional the direct carcinogenicity of some BA
BSHs in human gut microbiomes with non-gut (Bernstein et al. 2005; Hill 1990; O’Keefe 2008;
habitats revealed an enrichment of putative BSHs Debruyne et al. 2001).
in the human gut microbiome (Jones et al. 2008). Recent observations also implicate the pertur-
This is in keeping with the concept of CBA as an bation of bile acid signaling as a potential mech-
important habitat-associated selective pressure anism contributing to the pathogenesis of CRC
for gut microbes (absent in non-gut environ- and other inflammatory bowel diseases, with the
ments) and BSH as a conserved microbial adap- dedicated bile acid receptor FXRalpha demon-
tation to life in the mammalian intestinal tract strated to be protective against both CRC and
(Jones et al. 2008). Crohn’s disease in murine models (Gadaleta
When relative abundance of BSH homologues et al. 2011; Modica et al. 2008; Duboc
was compared between individual gut et al. 2013; Maran et al. 2009). Since activation
microbiomes, the potential for interindividual var- of this receptor is implicated in the
iation in abundance and types of BSH was also downregulation of mucosal immune responses
highlighted (Jones et al. 2008). Because BSH cat- and protection against autoimmune damage and
alyzes the initial rate limiting step in the wider induction of antiapoptotic pathways in the human
pathway of microbial bile acid metabolism facili- gut (Gadaleta et al. 2011; Duboc et al. 2013),
tated by the gut microbiome (Fig. 3), variation in alterations to microbial bile acid metabolism
overall levels of BSH should be good predictors of leading to changes in the balance of BA species
the capacity for bile acid modification in a given available for receptor binding have clear impli-
microbiome (Jones et al. 2008; Ogilvie and Jones cations for disease initiation and progression.
2012). Furthermore, previous characterization of The initial function-driven metagenomic anal-
BSH types originating from the main phylogenetic ysis of BSH activity in the gut microbiome also
groups in the human gut microbiome revealed provided the basic information to explore these
differences in substrate range of enzymes encoded theories further and to begin to explore the asso-
by different phyla, highlighting the potential for ciation between microbial bile acid metabolism
shifts in community structure to also alter aspects and intestinal diseases (Jones et al. 2008; Ogilvie
of bile acid metabolism by altering the prevailing and Jones 2012). This is exemplified by the appli-
bile acid modifications undertaken by gut cation of gut-derived BSH sequences (with
microbes (Jones et al. 2008). proven activity) to explore changes in the BSH
profile in the microbiomes of individuals with
inflammatory bowel disease (Ogilvie and Jones
Metagenomic Analysis of Bile Salt 2012). Surveys of whole community shotgun
Hydrolases in Health and Disease metagenomes for genes homologous to functional
BSH sequences revealed a distinct reduction in
Due to the role of bile acids in regulating metab- the relative abundance of BSH homologues in the
olism and mucosal immune responses and the gut microbiomes of individuals with Crohn’s dis-
potential for the gut microbiome to influence ease (CD), primarily within BSHs affiliated with
the Firmicutes division (Ogilvie and Jones 2012). and has already yielded new targets for disease
These changes are in keeping with the well- diagnosis, prophylaxis, or treatment which can
documented dysbiosis and shift in community now be explored further.
structure characteristic of CD (where the diversity
of Firmicutes is markedly reduced) (Manichanh
et al. 2006; Qin et al. 2010) and the role of
FXRalpha signaling in regulation of mucosal References
immune responses (Gadaleta et al. 2011). These
Begley M et al. The interaction between bacteria and bile.
metagenomic-based predictions of changes in
FEMS Microbiol Rev. 2005a;29:625–91.
functional capacity of the CD gut microbiome Begley M et al. Contribution of three bile-associated loci,
related to bile acid metabolism have since been bsh, pva, and btlB, to gastrointestinal persistence and
validated and a reduction in capacity for bile acid bile tolerance of Listeria monocytogenes. Infect
Immun. 2005b;73:894–904.
modification demonstrated in active disease
Begley M et al. Bile salt hydrolase activity in probiotics.
(Duboc et al. 2013). The apparent deficiency of Appl Environ Microbiol. 2006;72:1729–38.
this function in the CD gut microbiome now raises Bernstein H et al. Bile acids as carcinogens in human
the potential for targeting bile acid metabolism in gastrointestinal cancers. Mutat Res. 2005;589:47–65.
Dashkevicz MP, Feighner SD. Development of a differen-
the gut microbiota as a marker for disease risk or
tial medium for bile salt hydrolase-active Lactobacillus
therapeutic intervention. spp. Appl Environ Microbiol. 1989;55:11–6.
Debruyne PR et al. Mutat Res. 2001;480–81:359–69.
Duboc H et al. Connecting dysbiosis, bile-acid
dysmetabolism and gut inflammation in inflammatory
Summary bowel diseases. Gut. 2013;62:531–9.
Gadaleta RM et al. Farnesoid X receptor activation
The analysis of bile acid metabolism in the inhibits inflammation and preserves the intestinal
human gut microbiome has benefited greatly barrier in inflammatory bowel disease. Gut. 2011;60:
from the application of metagenomics and pro- 463–72. M
Gill SR et al. Metagenomic analysis of the human distal
vides an excellent example of how these powerful gut microbiome. Science. 2006;312:1355–9.
community-level approaches can rapidly provide Handelsman J. Metagenomics: application of genomics to
significant insight into the functioning and devel- uncultured microorganisms. Microbiol Mol Biol Rev.
2004;68:669–85.
opment of microbial ecosystems. In the case of
Hill MJ. Bile flow and colon cancer. Mutat Res.
the human gut microbiome, and other host- 1990;238:313–20.
associated microbial consortia, metagenomic Hofmann AF, Eckmann L. How bile acids confer gut
approaches can also generate new understanding mucosal protection against bacteria. Proc Natl Acad
Sci U S A. 2006;103:4333–4.
of how bacteria interact with and impact upon
Houten SM et al. Endocrine functions of bile acids.
their higher host organisms. EMBO J. 2006;25:1419–25.
In the case of bile acid metabolism by the gut Inagaki T et al. Regulation of antibacterial defense in the
microbiome, the deployment of metagenomics to small intestine by the nuclear bile acid receptor. Proc
Natl Acad Sci U S A. 2006;103:3920–5.
explore this aspect of the indigenous intestinal
Jones BV. Bacterial bile acid modification and potential
microbiota has rapidly enhanced our understand- pharmaceutical applications. J Appl Ther Res. 2011;8:
ing of this activity, its effect on human health, and 94–100.
its function within the gut microbiome. Our Jones BV, Marchesi JR. Accessing the mobile
metagenome of the human gut microbiota. Mol
knowledge of bile acid metabolism by the gut
Biosyst. 2007;3:749–58.
microbiome has now been elevated to a point Jones BV et al. Functional and comparative metagenomic
where tangible hypotheses regarding impacts on analysis of bile salt hydrolase activity in the human gut
host health can be formulated and tested. Although microbiome. Proc Natl Acad Sci U S A. 2008;105:
13580–5.
much remains to be done and our understanding is
Kim GB et al. Cloning and characterization of the bile salt
far from complete, metagenomics will undoubt- hydrolase genes (bsh) from Bifidobacterium bifidum
edly continue to play a key role in ongoing studies strains. Appl Environ Microbiol. 2004;70:5603–12.
M 414 Metagenomic by RAPD Profiling
Kumar RS et al. Structural and functional analysis of

a conjugated bile salt hydrolase from Bifidobacterium Metagenomic by RAPD Profiling
longum reveals an evolutionary relationship with
penicillin V acylase. J Biol Chem. 2006;281:
32516–25. Jaime Henrque Amorim, João Carlos Teixeira
Kurokawa K et al. Comparative metagenomics revealed Dias and Rachel Rezende
commonly enriched gene sets in human gut Universidade Estadual de Santa Cruz,
microbiomes. DNA Res. 2007;14:169–81.
Lambert JM et al. Functional analysis of four bile salt Laboratório de Biotecnologia Microbiana,
hydrolase and penicillin acylase family members in Ilhéus, BA, Brazil
Lactobacillus plantarum WCFS1. Appl Environ
Microbiol. 2008a;74:4719–26.
Lambert JM et al. Improved annotation of conjugated bile
acid hydrolase superfamily members in Gram-positive Detailed description and study of taxa, metabolic
bacteria. Microbiology. 2008b;154:2492–500. pathways, protein/peptide interactions, and
Ley RE et al. Ecological and evolutionary forces shaping
molecular relationships in microenvironments
microbial diversity in the human intestine. Cell.
2006;124:837–48. bring out great interest due to the possibility of
Lodola A et al. A catalytic mechanism for cysteine yielding new molecules with important applica-
N-terminal nucleophile hydrolases, as revealed by bility and new knowledge about the microenvi-
free energy simulations. PLoS ONE. 2012;7:e32397.
ronment dynamics. However, such advances are
doi:10.1371/journal.pone.0032397.
Manichanh C et al. Reduced diversity of faecal microbiota not possible by culturing dependent techniques,
in Crohn’s disease revealed by a metagenomic due to lack of knowledge of culturing conditions
approach. Gut. 2006;55:205–11. to unknown microorganisms (Yun et al 2004;
Maran RRM et al. Farnesoid X receptor deficiency in mice
Riesenfeld et al. 2004). Metagenomic approaches
leads to increased intestinal epithelial cell proliferation
and tumor development. J Pharmacol Exp Ther. have been pointed as a way to further access data
2009;328:469–77. contained in these ecosystems (Johnson and
Modica S et al. Nuclear bile acid receptor FXR protects Slatkin 2006; McHardy and Rigotsos 2007).
against intestinal tumorigenesis. Cancer Res.
This technology allows access to taxonomical
2008;68:9589–94.
O’Keefe SJD. Nutrition and colonic health: the critical and metabolic data (Streit and Schmitz 2004;
role of the microbiota. Curr Opin Gastroenterol. Roh et al. 2006; McHardy and Rigotsos 2007)
2008;24:51–8. independently of culturing proceedings. Never-
Ogilvie LA, Jones BV. Dysbiosis modulates capacity for
theless, a main drawback using metagenomic
bile acid metabolism in the gut microbiomes of
patients with IBD: a mechanism or marker for disease? approaches is that most of them are preceded by
Gut. 2012. doi:10.1136/gutjnl-2012-302137. conventional DNA extraction methods that prej-
Ogilvie LA et al. Evolutionary, ecological and biotechno- udice taxonomical representativeness and diffi-
logical perspectives on plasmids resident in the human
culties cloning steps by interference substances.
gut mobile metagenome. Bioeng Bugs. 2012;3:1–19.
Qin J et al. A human gut microbial gene catalogue By biasing taxonomical representativeness, such
established by metagenomic sequencing. Nature. methods also limit the mining of new molecules.
2010;464:59–65. In addition, such metagenomic conventional
Ridlon JM et al. Bile salt biotransformations by human
approaches based on cloning of polymerase
intestinal bacteria. J Lipid Res. 2006;47:241–59.
Rossocha M et al. Conjugated bile acid hydrolase is chain reaction (PCR) products are unfeasible if
a tetrameric N-terminal thiol hydrolase with specific the aim is to access taxonomical and metabolic
recognition of its cholyl but not of its tauryl product. diversity at the same time (Schloss and
Biochemistry. 2005;44:5739–48.
Handelsman 2003). Another aspect that limits
Suresh CG et al. Penicillin V acylase crystal structure
reveals new Ntn-hydrolase family members. Nat the study of metagenomic content is the use of
Struct Biol. 1999;6:414–6. bioinformatics methods (Rondon et al. 2000; Roh
Thomas C et al. Targeting bile-acid signalling for meta- et al. 2006; Huson et al. 2007) that depend on
bolic diseases. Nat Rev Drug Discov. 2008;7:678–93.
sequences of the same gene to compute taxonom-
Watanabe M et al. Bile acids induce energy expenditure
by promoting intracellular thyroid hormone activation. ical or metabolic profile of the environment
Nature. 2006;439:484–9. (Rondon et al. 2000; Roh et al. 2006). Thus,
Metagenomic by RAPD Profiling 415 M
once more, they do not allow understanding of taxonomical information (Huson et al. 2007). In
taxonomical and metabolic content at the same addition, it is also necessary to use a suitable
time (Rondon et al. 2000; Roh et al. 2006). algorithm that is able to compute such taxonom-
The random amplified polymorphic DNA ical content based on sequences from different
(RAPD) is an approach that allows the study of gene families (Amorim et al. 2012). The advan-
genetic diversity and population structure of bac- tage of studying environmental diversity with this
teria (Baker and Banfield 2003; Akbar approach is the possibility to amplify not only
et al. 2005). In metagenomic studies, it has been bacterial but also viral, fungi, and other eukary-
exploited in its conventional form, through anal- otic sequences at the same time due to the no
ysis of the polymorphic amplified DNA segments specificity of RAPD primers. Thus, this approach
in electrophoretic devices (Helton and Wommac makes possible to infer the whole taxonomical
2009; Patel and Behera 2011). However, we diversity of a specific environment.
recently reported a new and interesting applica- Randomly amplifying sequences from
tion for RAPD in metagenomics by coupling it a variety of gene families may yield new mole-
with an innovative metagenomic DNA extraction cules that may have important applications.
method (Amorim et al. 2012). By cloning RAPD Again, the nonspecificity of RAPD primers
instead of PCR products, we were able to access works in benefit of the diversity of amplified
taxonomical and metabolic content at the same environmental DNA. However, it is necessary to
time. This advantage is due to the capacity of determine again a size cutoff, in order to maxi-
RAPD primers to anneal in a more broad number mize the probability of cloning a complete viable
of DNA segments, yielding amplified DNA with gene. In addition, the representativeness of
different sizes and from different gene families. metagenomic content must be considered in
Randomly amplifying metagenomic DNA seg- order to also maximize the probability of new
ments may result in at least three great fields of genes and molecule mining. Such requirement
investigation: (i) it is possible to infer the taxo- is due to some DNA extraction methods that
M
nomical diversity of a specific environment if precede metagenomic approaches and have
a suitable bioinformatic approach is available the characteristic of restricting or limiting the
(Huson et al. 2007; Amorim et al. 2012); (ii) it metagenomic representativeness regarding the
is possible to take advantage on the variety of taxonomical and metabolic diversity of a specific
gene families amplified and mine new genes and environment (Amorim et al. 2008, 2012). The
molecules; (iii) if DNA fragments with different advantage of mining new substances with this
sizes and from different gene families are ampli- approach is the possibility to search for molecules
fied, it is possible to infer the metabolic network with different biological functions at the same
in a specific environment. All of these possibili- time. In addition, it is possible to couple new
ties may significantly improve and expand the use molecule mining to genetic diversity profiling
of RAPD in the study of genetic diversity and by using RAPD.
population structure of microorganisms. As a new and interesting applicability, it was
To study the environmental taxonomical and shown that the use of RAPD in metagenomics
genetic diversity using RAPD in the context of may turn possible to infer the metabolic network
metagenomics, it is necessary to clone such of a specific environment. Once that genes
amplified DNA and then study their sequences. involved with related metabolic pathways are
Another possibility is to determine their amplified, it is possible to use suitable algorithms
sequences directly on pyrosequencing devices. to study its relationships in the same taxon or
However, it is important to determine a size cut- between different taxons. For all possibilities
off of sequences that will be used to compute the discussed here, RAPD seems to be a simple but
taxonomical diversity of the environment, in robust tool to significantly improve the
order to avoid inconclusive sequences regarding metagenomic research.
M 416 Metagenomic Potential for Understanding Horizontal Gene Transfer
References Yun J, Kang S, Park S, Yoon H, Kim M, Heu S,

Ryu S. Characterization of a novel amylolytic
Akbar T, Akhtar K, Ghauri MA, Anwar MA, Rehman M, enzyme encoded by a gene from a soil-derived
Rehman M, Zafar Y, Khalid AM. Relationship among metagenomic library. Appl Environ Microbiol.
acidophilic bacteria from diverse environments as 2004;11:7229–35.
determined by randomly amplified polymorphic
DNA analysis (RAPD). World J. Microbiol. Biotech.
2005;21:645–8.
Amorim JH, Macena TNS, Lacerda-Junior GV, Rezende Metagenomic Potential for
RP, Dias JCT, Cascardo JCM. An improved extraction Understanding Horizontal Gene
protocol for metagenomic DNA from a soil of the
Brazilian Atlantic Rainforest. Genet Mol Res.
Transfer
2008;4:1226–32.
Amorim JH, Vidal RO, Lacerda-Junior GV, Dias JCT, Luigi Grassi1, Jacopo Grilli2 and Marco
Brendel M, Rezende RP, Cascardo JCM. A simple Cosentino Lagomarsino3,4
boiling-based DNA extraction for RAPD profiling of 1
landfarm soil to provide representative metagenomic
Physics Department, Sapienza University of
content. Genet. Mol. Res. 2012;11:182–9. Rome, Rome, Italy
2
Baker BJ, Banfield JF. Microbial communities in acid Dipartimento di Fisica “G. Galilei”, CNISM and
mine drainage. FEMS Microbiol Ecol. 2003;44(2): INFN, Università di Padova, Padova, Italy
139–52. 3
Helton RR, Wommac KE. Seasonal dynamics and
Computational and Quantitative Biology,
metagenomic characterization of estuarine University Pierre et Marie Curie, Paris, France
4
viriobenthos assemblages by randomly amplified poly- CNRS, Paris, France
morphic DNA PCR. Appl Environ Microbiol.
2009;75(8):2259–65.
Huson DH, Auch AF, Stephan JQ, Schuster C. MEGAN
analysis of metagenomic data. Genome Res. Definition
2007;17:377–86.
Johnson PLF, Slatkin M. Inference of population genetic Horizontal gene transfer (HGT) describes the
parameters in metagenomics: a clean look at messy
data. Genome Res. 2006;112:1320–7.
biological phenomenon by which an organism
McHardy AC, Rigotsos I. What’s in the mix: phylogenetic acquires genes from organisms belonging to
classification of metagenome sequence samples. Curr other species, genera, or taxa. Its name reflects
Opin Microbiol. 2007;10:499–503. the fact that the transfer of genetic information
Patel AK, Behera N. Genetic diversity of coal mine
spoil by metagenomes using random amplified poly-
between organisms that are not necessarily
morphic DNA (RAPD) marker. Indian J Biotechnol. related is different from the “vertical” transmis-
2011;10:90–6. sion of genes from parent to offsprings. Early
Riesenfeld CS, Schloss PD, Handelsman J. reports (Smith et al. 1992) interpreted HGT as
METAGENOMICS: genomic analysis of microbial
communities. Annu Rev Genet. 2004;38:525–52.
a rare event, unable to significantly influence the
Roh C, Villatte F, Kim B, Schmid RD. Comparative study global composition of target genomes. This first
of methods for extraction and purification of environ- impression was rapidly subverted by the advent
mental DNA from soil and sludge samples. Appl of genomic sequencing technologies. For exam-
Biochem Biotechnol. 2006;134:97–112.
Rondon MR, August PR, Bettermann AD, Brady SF,
ple, the comparison of the genomes of
Grossman TH, Liles MR, Loicono KA, Escherichia coli and Haemophilus influenzae,
Handelsman J, Goodman RM. Cloning the soil two bacteria belonging to the same evolutionary
metagenome: a strategy for accessing the genetic and lineage, shows a significant difference in their
functional diversity of uncultured microorganisms.
Appl Environ Microbiol. 2000;115:2541–7.
gene content (Tatusov et al. 1996). This differ-
Schloss P, Handelsman J. Biotechnological prospects ence, which is not at all justifiable only in terms
from metagenomics. Curr Opin Biotechnol. of vertical descent, gave a first indication of the
2003;14:303–10. massive role played by HGT in the evolution of
Streit WR, Schmitz RA. Metagenomics – the key to the
prokaryotic genomes. Subsequent evidence from
492–8. multiple genomes indicates that HGT acts
Metagenomic Potential for Understanding Horizontal Gene Transfer 417 M
pervasively on prokaryotic genomes (Woese types of transfers can leave traces in the metabo-
2000). For example, a detailed analysis made by lism of the acquiring genomes. Several studies
Dagan et al. (Dagan and Martin 2007) considered found that the majority of changes to the meta-
57670 gene families across 190 sequenced bolic network of Escherichia coli in the past
genomes demonstrating that at least two-thirds 100 million years are due to HGT (Pál 2005;
and possibly all of them have been affected by Lercher and Pal 2008). Interestingly it appears
HGT at some time in their evolutionary past. that horizontally transferred genes are integrated
at the periphery of the network, whereas central
parts remain evolutionarily stable. This is also
Introduction supported by the modular nature of prokaryotic
genes. Indeed, metabolically related genes (e.g.,
The pervasive HGT occurrence reveals the impor- genes coding proteins in physiologically coupled
tance of this process in forging the extant prokary- reactions) are often transferred together as
otic genomes (Ochman et al. 2005). The operons. Thus, HGT appears to be the main
consequent question is how and to what extent force able to expand bacterial metabolic networks
the transferred genes innovate a genome over the by enlarging their periphery in response to chang-
course of evolution and how they are incorporated ing environments (Lercher and Pal 2008; Pang
into a genome’s existing biochemical and regula- and Maslov 2011). Necessarily, this carries con-
tory networks. This issue requires the study of sequences on the addition of new genes regulating
multiple genomes and naturally overlaps with the metabolic pathways, defining some observed
metagenomics, because HGT is affected by envi- quantitative features of genome composition
ronmental, ecological, and population factors act- (Grilli et al. 2012; Koonin 2011). All the above
ing at the level of communities of coexisting studies evaluate the contribution of HGT to the
species. In brief, the understanding of HGT passes evolution of prokaryotic genomes using the tools
through the knowledge of its consequences on of comparative genomics. Quite interestingly, the
M
genomes, populations, and ecosystems. effects and consequences of HGTs can also be
evaluated with direct experiments. For example
a single-cell analysis was performed in
HGT Impact on the Evolution of Escherichia coli in 2008 (Babic et al. 2008).
Genomes This study proves the high efficiency (up to
96 % of recipients) of recombination and integra-
A genome affected by HGT can acquire two tion of transferred DNA. In another study, Babic
typologies of genes: genes homologous to et al. monitored in real time, through fluorescence
existing ones and genes that are not (Ochman microscopy, the sequential conjugation events of
et al. 2000). Both types of HGT influence the an integrative and conjugative element encoding
evolution of a lineage but do so in very different for a green fluorescent protein (GFP) (Babic
manners and contexts. The first mechanism is et al. 2011). A recent study investigated the nov-
favored when the phylogenetic distance between elty of protein domains acquired through HGT in
donor and acceptor is small (Andam and Gogarten Proteobacteria, focusing on their specific features
2011); these transfers may occur via homologous (Grassi et al. 2012). The results indicate that
exchange, whose probability increases with protein domains subject to HGT have
genetic similarity (Vulic et al. 1997). The second a transferability proportional to their total fre-
type of HGT involves the acquisition of new quency in the pool of considered genomes, and
genes, with a sporadic phylogenetic distribution. at the same time, HGTs of exogenous protein
Such transfers might supply genes that confer families are found less frequently for larger
novel phenotypic properties and result in the genomes. Based on these observations, one can
rapid adaptation of a bacterial species. Both conclude that HGTs behave as if they were drawn
M 418 Metagenomic Potential for Understanding Horizontal Gene Transfer
randomly from a cross-genomic community of characterizing associations between human

gene pool, much like gene duplicates are drawn microbiome and health of an individual (Nelson
from a genomic gene pool. Similar conclusions et al. 2010), i.e., the ecological influence that
were drawn from a recent comparative study microorganisms have on humans.
of Escherichia coli and Salmonella enterica Metagenomic data make it difficult to formal-
genomes (Karberg et al. 2011). These results indi- ize the traditional concept of species (and conse-
cate a role of a common gene pool in determining quently of recombination and HGT among
the genes available for horizontal transfer and link prokaryotes). Nevertheless, there are interesting
the problem to the structure of past and existing findings that reveal the crucial role played by
bacterial communities and ecosystems. HGT in microbial communities. For example,
Hehemann et al. (2010) point out that
metagenomic samples derived by feces of Japa-
Advantage of Metagenomics nese individuals are enriched in carbohydrate-
in the Study of HGT active enzymes (e.g., porphyranases and
agarases), while the same enzymes are absent in
The use of single (often cultured) bacterial spe- metagenomic samples derived by North Ameri-
cies for the study of HGT has several limitations. can individuals. Interestingly, gut bacteria from
Firstly the great majority of bacteria (more than Japanese individuals have acquired these
99 %) cannot be cultured in the laboratory. Fur- enzymes through HGT. This finding confirms
thermore, such a “single-species” approach the observation that HGT events among bacteria
gives, by definition, an organism-centric view of from different environments can occur also inside
the phenomenon. This implies a limited under- the human intestine (Lester et al. 2006). Another
standing of microbial physiology, genetics, and recent study of Smillie et al. (2011) uses
community ecology. Many recent shotgun metagenomics to describe the forces governing
sequencing projects characterized the genome HGT. The authors identified, through a heuristic
content of whole microorganism communities method, recent HGTs among thousands of micro-
(Riesenfeld et al. 2004). For example, the “Sor- bial genomes. Roughly one-quarter of the identi-
cerer II” Global Ocean Sampling expedition was fied transfers includes at least one predicted
designed with the precise aim of giving a global mobile element, confirming the importance of
snapshot of the marine microbiological world such elements in facilitating gene exchange.
(Rusch et al. 2007). The results of this important However, the most interesting finding of this
project traced an impressive distance between study is that bacteria isolated from human body
marine microorganisms and cultivated ones. are 25 times more likely to share HGT genes than
Very few metagenomic sequences were found bacteria living in different environments
to be similar to the ones of annotated genomes. (aquatic, terrestrial, and nonhuman host associ-
A subsequent analysis of these data indicates that ated). This phenomenon is even more striking
the abundant and cosmopolitan picoplanktonic considering human isolates derived by the same
prokaryotes tend to have smaller genomes body site, with a rate of transfer increased by
(Yooseph et al. 2010). Such condition is probably a factor of two. The authors also studied this
associated to a slow growth lifestyle and with the high transferability in human microbiome sepa-
relative inability to sense or rapidly acclimate to rating bacteria by their ribosomal 16S distance,
energy-rich conditions. By contrast, the micro- reporting that even most divergent bacteria, sep-
bial taxa display the ability of growing slowly arated by billions of years of evolution, but shar-
and surviving in energy-limited environments, ing the same ecological niche, are affected by
while growing rapidly in energy-rich environ- more HGT than the most closely related isolates
ments. One other focus of interest for living in different niches.
metagenomics is the exploration of the human The above findings indicate that ecological
microbiome. This large project has the final goal factors are relevant for driving HGT in the
Metagenomic Potential for Understanding Horizontal Gene Transfer 419 M
human microbiome and thus play a role in its Grilli J, Bassetti B, Maslov S, Lagomarsino MC. Joint
evolution and genomic composition. A global scaling laws in functional and evolutionary categories
in prokaryotic genomes. Nucleic Acids Res.
understanding of these aspects is a future chal- 2012;40:530–40.
lenge for metagenomics, which could expand our Hehemann JH, Correc G, Barbeyron T, Helbert W,
fundamental understanding of evolution, with Czjzek M, Michel G. Transfer of carbohydrate-active
implications for biotechnology and health. enzymes from marine bacteria to Japanese gut
microbiota. Nature. 2010;464:908–12.
Karberg KA, Olsen GJ, Davis JJ. Similarity of genes hori-
zontally acquired by Escherichia coli and Salmonella
Summary enterica is evidence of a supraspecies pangenome. Proc
Natl Acad Sci U S A. 2011;108:20154–9.
Koonin EV. Are there laws of genome evolution? PLoS
Horizontal gene transfer (HGT) is a widespread Comput Biol. 2011;7:e1002173.
phenomenon in prokaryotes. Its pervasive modal- Lercher MJ, Pal C. Integration of horizontally transferred
ity of action enormously influences the receiving genes into regulatory interaction networks takes many
genomes. In light of this HGT appears among the million years. Mol Biol Evol. 2008;25:559–67.
Lester CH, Frimodt-Moller N, Sorensen TL, Monnet DL,
main forces able to expand bacterial metabolic Hammerum AM. In vivo transfer of the vanA resis-
networks in response to changing environments. tance gene from an Enterococcus faecium isolate of
Metagenomics opens up new perspective to the animal origin to an E. faecium isolate of human origin
study of HGT by giving the possibility to uncover in the intestines of human volunteers. Antimicrob
Agents Chemother. 2006;50:596–9.
the ecological factors relevant for driving HGT. Nelson KE, Weinstock GM, Highlander SK, Worley KC,
Interestingly it can directly investigate complex Creasy HH, Wortman JR, et al. A catalog of reference
ecosystems as marine microbiological world and genomes from the human microbiome. Science.
the human microbiome. 2010;328:994–9.
Ochman H, Lawrence JG, Groisman EA. Lateral gene
transfer and the nature of bacterial innovation. Nature.
2000;405:299–304.
Cross-References Ochman H, Lerat E, Daubin V. Examining bacterial spe-
M
cies under the specter of gene transfer and exchange.
Proc Natl Acad Sci U S A. 2005;102 Suppl 1:6595–9.
▶ Integrons as Repositories of Genetic Novelty Pál C, Papp B, Lercher MJ. Adaptive evolution of bacte-
▶ Lateral Gene Transfer and Microbial Diversity rial metabolic networks by horizontal gene transfer.
▶ Metagenome of acidic hot spring microbial Nature Genetics. 2005;37:1372–5.
planktonic community: Structural and Pang TY, Maslov S. A toolbox model of evolution of
metabolic pathways on networks of arbitrary topology.
functional insights PLoS Comput Biol. 2011;7:e1001137.
Riesenfeld CS, Schloss PD, Handelsman J.
Metagenomics: genomic analysis of microbial com-
References munities. Annu Rev Genet. 2004;38:525–52.
Rusch DB, Halpern AL, Sutton G, Heidelberg KB,
Andam CP, Gogarten JP. Biased gene transfer and its Williamson S, Yooseph S, et al. The Sorcerer II Global
implications for the concept of lineage. Biol Direct. Ocean Sampling expedition: northwest Atlantic
2011;6:47. through eastern tropical Pacific. PLoS Biol. 2007;5:
Babic A, Lindner AB, Vulic M, Stewart EJ, Radman e77.
M. Direct visualization of horizontal gene transfer. Smillie CS, Smith MB, Friedman J, Cordero OX, David
Science. 2008;319:1533–6. LA, Alm EJ. Ecology drives a global network of gene
Babic A, Berkmen MB, Lee CA, Grossman AD. Efficient exchange connecting the human microbiome. Nature.
gene transfer in bacterial cell chains. MBio. 2011;2(2): 2011;480:241–4.
e00027. Smith MW, Feng DF, Doolittle RF. Evolution by acquisi-
Dagan T, Martin W. Ancestral genome sizes specify tion: the case for horizontal gene transfers. Trends
the minimum rate of lateral gene transfer during Biochem Sci. 1992;17:489–93.
prokaryote evolution. Proc Natl Acad Sci U S A. Tatusov RL, Mushegian AR, Bork P, Brown NP, Hayes
2007;104:870–5. WS, Borodovsky M, et al. Metabolism and evolution
Grassi L, Caselle M, Lercher MJ, Lagomarsino of Haemophilus influenzae deduced from a whole-
MC. Horizontal gene transfers as metagenomic gene genome comparison with Escherichia coli. Curr Biol.
duplications. Mol Biosyst. 2012;8:790–5. 1996;6:279–91.
M 420 Metagenomic Research: Methods and Ecological Applications
Vulic M, Dionisio F, Taddei F, Radman M. Molecular selection biasness is not involved. High-
keys to speciation: DNA polymorphism and the con- throughput methods can be employed for direct
trol of genetic exchange in enterobacteria. Proc Natl
Acad Sci U S A. 1997;94:9763–7. sequencing of the metagenome. The functional
Woese CR. Interpreting the universal phylogenetic tree. approach is used to explore genes that encode
Proc Natl Acad Sci U S A. 2000;97:8392–6. novel enzymes or drugs, but advancements are
Yooseph S, Nealson KH, Rusch DB, McCrow JP, Dupont needed for function-based metagenomics by
CL, Kim M, et al. Genomic and functional adaptation
in surface ocean planktonic prokaryotes. Nature. employing high-throughput screenings.
2010;468:60–6.
Introduction
Enormous genetic and biological pool of micro-

Metagenomic Research: Methods bial diversity is present on the Earth. It accounts
and Ecological Applications for 4–6 1030 prokaryotic cells containing
106–108 distinct genospecies. Various studies
Navneet Batra1, Sonu Bhatia1, Arvind Behal1, based on molecular approaches prove that
Jagtar Singh2 and Amit Joshi3 approximately 1 % of the total vast microbial
1
Department of Biotechnology, GGDSD population is culturable under cultivation condi-
College, Chandigarh, India tions and in media of restricted and optimized
2
Department of Biotechnology, Panjab range. Cultivation techniques pose several diffi-
University, Chandigarh, India culties and limitations. To encounter this prob-
3
Department of Biotechnology & lem, various DNA-based molecular methods
Bioinformatics, SGGS College, Chandigarh, have been developed. Culture-independent
India methods were firstly applied to environmental
system like hot springs in Yellow Stone National
Park. Further technical developments in this field
Synonyms have led in to an era of metagenomics (Singh
et al. 2009).
Community genomics; Ecological genomics; Conserved rRNA gene sequences are used in
Environmental genomics sequence-driven strategies of genomics to
explore microbial diversity. 16S rRNA gene has
been established as the standard molecule for
Definition taxonomic diversity analysis in metagenomics.
Recently 23S rRNA has also been considered as
The aim of metagenomics is to investigate enor- it offers advantages over 16S rRNA with size
mous diversity of taxonomically and phylogenet- almost twice as long as that of 16S rRNA. Thus,
ically relevant genes, individual catabolic genes, it can prove to be a more informative phyloge-
and whole operons by explicating the genomes netic marker in comparison to 16S rRNA gene
of uncultured microbes. The concept of (Yilmaz et al. 2011); however, 23S rRNA faces
metagenomics was introduced by Handelsman a drawback as lower number of reference
which involves the extraction of genomic DNA sequences of this marker is present in public
from the microbial community inhabiting the databases. All metagenomic output is collected
environment. Either this DNA is cloned as librar- and shared across public databases and retrieved
ies for functional screening, or PCR-based using bioinformatic tools capable of dealing with
enrichment is performed with respect to gene of extensive generated data. Genomic Standards
interest. Generally, DNA is considered as the Consortium (GSC) was established in 2005 with
most appropriate method for assessing environ- its main intention to promote and share the infor-
mental microbial community’s structure as any mation about the resources required in the
Metagenomic Research: Methods and Ecological Applications 421 M
development of better and improved mechanisms is comprehensively been undertaken by
of metadata capture and exchange (Mocali and Global Ocean Sampling (GOS) expedition. The
Benedetti 2010). Atlantic, Pacific, and Indian oceans are covered
as part of their metagenomic studies (Yilmaz
et al. 2011).
Methodology Techniques for environmental sampling are
dependent on the purpose of the study, the habitat
All metagenomic approaches are mainly based on to be sampled, and the desired downstream anal-
the technique of isolation and examination of ysis (Lewin et al. 2012). With the advent of an era
DNA extracts directly from naturally occurring of new high-throughput methods, the number of
microbial populations. In functional samples accessed can be greatly increased. Sam-
metagenomic studies (expression dependent), pling has been reported from different environ-
the examination of DNA libraries is done by ment for metagenomic analysis including soils,
high-throughput assays to identify clones that surface water of the sea, deep sea sediments,
have specific desired phenotype. However, various organs of animals and humans, compost,
homology-based approach constitutes probing sludge, acid mine site, Arctic sediments, etc.
of the library to identify clones containing con- (Singh et al. 2009; Kakirde et al. 2010; Xing
served sequences. Metagenomics includes the et al. 2012). Oil samples were obtained from
following major steps. a production well at the onshore Potiguar Basin,
Brazil, with an in situ temperature of 42.2 C
Sampling and depth of 535.5–540.5 m for the purpose of
Most of the studies illustrate that estimate of screening for hydrocarbon biodegraders in
microbial diversity increases with the areas sam- a metagenomic clone library (Vasconcellos
pled. While beginning with metagenomic analy- et al. 2010).
sis of microorganisms, sampling is done using
M
well-established protocols to provide the best Extraction Methods
representative sample of the desired site (Hirsch Quality of extracted DNA samples should be high
et al. 2010). Microbial activity and growth pat- for construction of metagenomic library. This
terns in soil are influenced by physical, chemical, extraction and purification of nucleic acids
and biological properties. Soil substratum and should be performed critically. Methodology of
geographical location affect the phylogenetic DNA extraction is based according to the size of
composition of the microbial community. Selec- the target genes and on screening strategies.
tion of sampling site and sample method are Metagenome extraction is the arbitration between
important considerations (Kakirde et al. 2010). vigorous extraction that is done for the represen-
The number and diversity of microorganisms that tation of all microbial genome and lowering DNA
is to be sampled are affected by the depth of the shearing with simultaneous co-extraction of sam-
soil at which sampling is performed. Multiple ple contaminants (Cowan et al. 2005). In direct
spatial scales are used for sampling at different extraction methodology, the samples are
intervals, to demonstrate spatial heterogeneity of processed without the cultivation of the microbes
soil microbial communities in an agriculture soil. and involve the use of detergents and enzymes.
It was suggested that geo-statistical analysis can The samples are further treated with phenol or
be used to describe spatial distribution of the chloroform. It has been argued that this method of
microbes present at the subsurface of soil along extraction is biased; for example, ammonia-
with power analysis for the assessment of the oxidizing bacteria and methane-oxidizing bacte-
required sample size (Mocali and Benedetti ria are not easily displaced from soil particles,
2010). Sampling variability can be significantly when compared to the other bacterium inhabiting
reduced by using such a regime. Sampling the soil, and also actinomycete spores may be
from aquatic systems based on marine habitat underrepresented (Hirsch et al. 2010). Physical
means of separation of microbes with lysis-based Enrichment of Sample

extraction is employed in indirect extraction Whereas non-enrichment methods have a capac-
approach. Cell lysis can be performed using ity to maintain high diversity of microbial com-
methods like sonication, grinding, freezing- munities, to increase the specificity of a sample’s
thawing, and solubilization of cell membranes genomic DNA, enrichment is performed. Screen-
and cell walls by detergents or by employing ing of sequence-based novel genes is benefited by
enzymatic means (Singh et al. 2009). Bead- enrichment (Xing et al. 2012). Such methods
beating method was used for microbial lysis on have power to select particular community
soil samples collected from barren regions of based on its function. The loss of diversity can
Gujrat (India) to index microbial population and be moderated by alteration of the degree of the
community structure in saline-alkaline soil using selection pressure applied. Active biomolecule of
gene target metagenomics (Keshri et al. 2013). the microorganisms can be targeted using
Direct DNA extraction methods show higher genome enrichment strategies. Enrichment of
recovery rate of DNA (10–100 times) as com- the target population can be achieved by the use
pared to indirect methods but length of DNA of selective media due to its capability to utilize
fragments is larger in case of indirect methods. specific substrate. Novel techniques are
Impurity content is more in DNA extracted using employed to enrich specific microbial commu-
direct methods as compared to indirect methods nity such as 5-bromo-2-deoxyuridine (BrdU)
(Xing et al. 2012). labeling that can be done on actively growing
The extraction method is selected on the basis microbes, followed by separation of labeled
of desired applications. Delmont et al. (2011) nucleic acids by density gradient centrifugation
compared direct and indirect soil extraction and immuno-capture techniques. Growth of spe-
approaches, and they concluded that there was cific substrate utilizing microbial community can
a more than 40 % decrease in Eukarya sequences be enhanced by addition of substrates along with
when using indirect DNA extraction as compared BrdU (Singh et al. 2009).
to direct method. Archaeal and bacterial SSH (suppressive subtractive hybridization)
sequences also increased in indirect approaches. technique is also used for specific gene enrich-
Another concern in extraction process is the pres- ment and identifying genetic differences between
ence of various contaminants in the soil, for microorganisms. Samples are ligated with adap-
example, humic acid, polyphenols, polysaccha- tors and such fragments are selected on the
rides, and nucleases, which can prove inhibitory basis of subtractive hybridization. The effect
to different applications including PCR and of specific pollutants on the community DNA
metagenomic library construction. Single-DNA can be determined with this enrichment tech-
extraction methods can underestimate the total nique by making a comparison with reference
number of bacterial ribotypes present in marine metagenome in the absence of that pollutant
sediments (Singh et al. 2009; Kakirde (Cowan et al. 2005). Another enrichment method
et al. 2010). To obtain the significant amount of is stable isotope probing (SIP) that can be used to
DNA, large quantities of material is required. target metagenomics to specific populations. SIP
RNA recovery from environments is quite similar involves stable isotope-labeled substrate and sep-
to that of DNA isolation. To decrease physical aration of heavier nucleic acids (DNA/RNA) by
degradation and RNase activity, samples are spe- density gradient centrifugation. Metagenome
cifically processed. Harvesting of samples is expression profile can be compared in response
followed by freezing it at 80 C. Sulfate salt to specific substrates or xenobiotic compounds in
solution can be used to coprecipitate cellular method based on differential expression analysis
RNA with proteins. cDNA metagenomic libraries (DEA), which can identify genes which are
can be constructed to identify functional eukary- upregulated for specific activity (Cowan
otic genes using RNA extraction approach et al. 2005). In addition, microarray method,
(Cowan et al. 2005). phage display expression system, and multiple
displacement amplification and differential dis- cells, and screening of the target genes. Com-
play are other methods for enrichment of geno- monly used host strains are E. coli, Streptomyces
mic DNA (Singh et al. 2009). Aerobic and sp., Pseudomonas sp. and Rhizobium sp. Highly
anaerobic microbial enrichments can also be sheared DNA poses a major problem in library
performed as done in a study involved in screen- generation, because ligatable sticky ends cannot
ing for hydrocarbon degraders. These cultures be formed out of highly sheared DNA. Blunt-end
were grown in Schott bottles containing 500 ml ligation can overcome this problem to some
Widdel B mineral medium supplemented with extent (Singh et al. 2009).
n-hexadecane as carbon source. This resulted in Integrated approach of stable isotope probing
anaerobic enrichment of sample (Vasconcellos (SIP) and metagenomics has increased the fre-
et al. 2010). WGA (whole genome amplification) quency of clones containing target genes which
approach involves the use of short random are desirable. In one of the study on methane-
primers to replicate DNA and is employed when utilizing bacteria in a forest soil, the sample was
limited-sized sample (microsamples) is to be labeled with CH4 and “heavy” DNA was used to
processed (Hirsch et al. 2010). construct a bacterial artificial chromosome
library. 2,300 clones had to be screened in order
to obtain pmoCAB operon encoding subunits of
Construction of Metagenomic DNA methane monooxygenase, whereas in non-SIP
Libraries study 250,000 fosmid clones were screened to
find pmoCAB operon (Uhlik et al. 2013).
Construction of a metagenomic library depends
on appropriate vector. Quality of extracted DNA
and associated research goals plays an important Screening of Clones from Metagenomic
role in vector selection. Plasmids, cosmids, bac- Libraries
terial artificial chromosomes, and fosmids are
M
extensively used vectors. The choice of vector is After obtaining the metagenomic library, screen-
influenced by the size of the insert fragment, copy ing of clones is done. Function-based screening
number of vector required, host used, and screen- also known as biological activity screening
ing methods (Xing et al. 2012). Cosmid DNA selects positive clones that express desirable
libraries are constructed with an insert size rang- characteristics. Specific phenotypes of the indi-
ing between 25 and 35 Kb. BAC libraries can vidual clones can be directly detected by using
permit the size up to 200 Kb and fosmid libraries functional assays, by adding chemical-based
with inserts of 40 Kb of foreign DNA (Streit and dyes or chromophore-conjugated enzymes. New
Schmitz 2004). pCR 2.1 vector was used for antibiotic resistance gene determinants (ARGD)
cloning, and plasmids were further screened for can be investigated by functional analysis.
insert size by PCR-based amplification in the A novel chloramphenicol-florfenicol-resistant
study of community structure in saline-alkaline gene was discovered by screening Alaska soil
soil (Keshri et al. 2013). Molecular classification metagenomic clone library (Monier et al. 2011).
of gliomas was done using P-1-derived artificial Banik and Brady (2010) reviewed the
chromosome (PAC). Large-sized human geno- metagenomic approaches toward discovery of
mic DNA is best cloned in YAC or BAC (Xing antimicrobials. In a work performed by Schmitz
et al. 2012). Entire metabolic pathways can be and coworkers, bacteriophage DNA was isolated
recovered by cloning large fragments of from bat, guano, and earthworm guts; its func-
metagenomic DNA in vectors. Host selection is tional screening led to the discovery of three new
another important criteria considered for efficient lysins capable of inhibiting Bacillus anthracis
cloning. Host strain should be selected on the proliferation. Another function-based approach
basis of efficiency of the conversion process, is the use of host strains requiring heterologous
gene expression, plasmid stability in the host complementation by foreign genes for growth
under selective conditions. The recombinant Gene of interest can also be identified by random
clones that contain target gene and produce sequencing. Phylogeny can be linked with the
corresponding gene product in active form show functional gene by performing phylogenetic
optimum growth. This functional complementa- analysis with flanking DNA.
tion was used to isolate lysine racemase (Lyr)
gene from soil metagenome; in this E. coli
BCRC 51,734 cells were used as the host and Metagenomic Sequencing
D-lysine as selection agent (Chen et al. 2009).
The above approach faces certain problems Gradual change has been experienced in the area
including that of inaccurate transcription of target of sequencing. Classical Sanger’s sequencing
genes and assemblage problems of the technology is being proceeded by next-generation
corresponding enzymes. There is a scope of sequencing (NGS). Sanger method is preferred for
improvement in screening efficiency by enrich- its low error rate, long read length (>700 bp), and
ment of target microbes or use of screening sen- large insert sizes, but it has a drawback of being
sitive substrate (Streit and Schmitz 2004). a labor-intensive process. Array-based sequencing
Sequence-driven screening methods comprise and in vitro amplification of target DNA fragments
of primers and probes of known conserved constitute the second-generation DNA sequenc-
sequence that include phylogenetic or functional ing. Such technology is implemented in
genes. Target clone is identified by PCR-based 454 Genome Sequencer, Illumina Genome Ana-
amplification or hybridization. PCR amplifica- lyzer, and SOLiD platform (Xing et al. 2012).
tion of 30 genes encoding novel patellamide- These next-generation approaches have the capac-
like precursor peptide from Prochloron sp. ity for abundant parallel sequencing of samples.
symbionts living in consortia with marine Pyrosequencing allows sequencing of 100–200 bp
sponges was reported by Schmidt and coworkers of single-stranded DNA and employs luciferase-
(Banik and Brady 2010). Fifteen new variants of based real-time monitoring of pyrophosphate
the gene encoding precursor to the microviridin release (Guazzaroni et al. 2009) and has high
peptide were identified by Ziemert and coauthors accuracy rates comparable to Sanger’s
in a PCR-based methodology. Homology-based sequencing.
screening is carried out mostly by using degener- Metagenomics employs two approaches:
ate PCR primers, RT-PCR, DNA microarrays, firstly, system-based approach, where complete
integron, and affinity capture methods of sample of DNA is processed and analyzed.
sequence-based screening, as reported in litera- MG-RAST (Metagenomic Rapid Annotation
ture (Xing et al. 2012). Relatively a new method Using Subsystem Technology) characterizes HTS
for genetic screening is substrate-induced gene pyrosequencing run (Larsen et al. 2012). Secondly,
expression screening (SIGEX). These species identification-based approach involves the
metabolism-related genes are selectively probability of potentially missing certain taxa in
expressed in the presence of certain substrates. the process of PCR-based amplification of specific
Chromatography-based screening techniques regions. One of the efficient methods of high-
known as compound configuration screening are throughput analysis (HTS) of genes is based on
also reported. Clones are screened on their capa- microarrays; differential gene expression quantifi-
bility to produce new structural compounds cation of environmental bacterial diversity can be
depicting different chromatographic peaks rela- monitored (Cowan et al. 2005). Second-generation
tive to the host cells. Microarray-based GeoChip sequence technologies help in obtaining more
technology has been developed to access genetic information from complex microbial communities
and functional diversity of microbial community. (Logares et al. 2012). Open reading frames
Reactome array is a new sensitive metabolite and operons can be identified by analysis of
array which offers functional analysis of meta- longer contiguous sequences. Colony hybridiza-
bolic phenotypes (Streit and Schmitz 2004). tion and pyrosequencing when combined with
metagenomic approach helped in gaining informa- biological, physical, and chemical parameters
tion about genetic organization and diversity of that fully characterize a microbial community.
specific operon. Multivariate statistical analysis is provided by
Addition of sample specific oligonucleotides various tools like Primer-E package. This pack-
barcode to PCR primers had an advantage of age helps in the generation of multidimensional
sequencing a number of samples simultaneously scaling (MDS) plots, analysis of similarities
at a relatively reduced cost, also known as (ANOSIM), and species identification
barcoding or multiplexing (Willner and (SIMPER) (Thomas et al. 2012). A wide variety
Hugenholtz 2013). Third-generation sequencing of bioinformatic tools and databases are available
is evolving fast. The first such technology became for metagenomic studies (Table 1).
available was PacBio RS from Pacific Biosci-
ences. This immobilized polymerase performs
sequencing, and four differently colored nucleo- Ecological Inferences
tides are detected in real time (Logares
et al. 2012). Another innovative sequencing plat- Community Studies
form known as Ion Torrent is based on the princi- The ecological role of the microorganisms
ple that DNA polymerization releases protons can be highlighted by conducting a genome-
which can help in the detection of nucleotide wide analysis. The ecosystem is highly dynamic
incorporation. Read length >100 bp can be in structure, and by employing shotgun
obtained in the above technology. DNA nanoballs metagenomics, direct sequencing of community
can be sequenced in a technology offered by Com- DNA can be achieved. Metagenomics generate
plete Genomics (Thomas et al. 2012). environmental microbial community data that
helps in the investigation of microbial environ-
mental interactone (MEI) (Larsen et al. 2012).
Assembly, Binning, and Annotation PCR-based methods such as amplified ribo-
M
somal DNA restriction analysis (RISA), dena-
Recovering and characterization of genome of turing gradient gel electrophoresis (DGGE), and
cultured organisms requires assembly of short- terminal restriction fragment length polymor-
read fragments into longer genome contigs. phism (T-RFLP) have been used for the charac-
Reference-based assembly method is applied, if terization of community microorganisms.
closely related reference genomes are available. These techniques were applied to study the
Large computational resources are required for de bacterial response in a pesticide contaminated
novo assembly (Thomas et al. 2012). A process soil (Imfeld and Vuilleumier 2012). Subsurface
based on sequence comparison of unknown DNA oil reservoirs with high pressure, salt, heavy
with reference databases, known as binning, helps metals, and organic solvent concentration have
to sort DNA sequences into groups representing been analyzed by metagenomics. In another
genomes from closely related organisms. study, permafrost samples from the Canadian
Metagenome sequence data is generally annotated High Arctic and Alaska were investigated, in
by feature prediction and functional annotation. order to understand its potential linkage to
Feature prediction labels the sequences as gene, global warming (Lewin et al. 2012). Microbial
and functional annotation assigns taxonomic niche study was conducted on flowing acid mine
neighbors and putative gene function. drainage to determine the industrial community
structure of a natural acidophilic biofilm
growing on it (Streit and Schmitz 2004).
Data Handling and Statistical Analysis ECOMIC-RMQS project is a French initiative
to characterize soil microbial communities.
Statistical approach aids metagenomics to link Innovative studies and methodologies can
functional and phylogenetic information to the determine organism’s possible habitat in
Metagenomic Research: Methods and Ecological Applications, Table 1 Bioinformatic tools and databases
commonly used in metagenomic studies
Name Description Website
ARB Tools for sequence database handling and data analysis www.arb-home.de
CAMERA Community Cyberinfrastructure for Advanced http://camera.calit2.net
Microbial Ecology Research and Analysis
CARMA Characterizing short-read metagenomes www.cebitec.uni-bielefeld.de/brf/
carma/
COG Clusters of Orthologous Groups http://www.ncbi.nlm.nih.gov/COG/
DDBJ DNA Data Bank of Japan http://www.ddbj.nig.ac.jp/
DOTUR Defining Operational Taxonomic Units and Estimating http://www.plantpath.wisc.edu/fac/
Species Richness joh/dotur.html
EMBL European Molecular Biology Laboratory www.embl.de/services/
bioinformatics/index.php
GAAS Genome relative Abundance and Average Size http://sourceforge.net/projects/
gaas/
GenBank Genetic sequence database www.ncbi.nlm.nih.gov/Genbank/
metagenome.html
GOLD Genomes Online Database www.genomesonline.org
GSC Genomic Standards Consortium www.gensc.org
INSDC International Nucleotide Sequence Database http://www.insdc.org/
Collaboration
IMG/M Integrated Microbial Genomes http://img.jgi.doe.gov/
KEGG Kyoto Encyclopedia of Genes and Genomes http://www.genome.jp/kegg/
LefSe LDA Effect Size http://huttenhower.sph.harvard.edu/
galaxy/root?tool_id¼lefse_upload
MEGAN MEtaGenome ANalyzer www-ab.informatik.uni-tuebingen.
de/software/megan
Megx.net Marine Ecological GenomiX www.megx.net
MetaPhlAn Metagenomic Phylogenetic Analysis http://huttenhower.sph.harvard.edu/
GraPhlAn Graphical Phylogenetic Analysis http://huttenhower.sph.harvard.edu/
METAREP JCVI Metagenomics Reports http://jcvi.org/metarep/
PyNAST Python Nearest Alignment Space Termination www.qiime.org/pynast/
Naive Bayes classifier Probabilistic classifier http://www.statsoft.com/textbook/
naive-bayes-classifier/
MG-RAST Metagenomic RAST http://metagenomics.nmpdr.org
PHACCS Phage Communities from Contig Spectra http://sourceforge.net/projects/
phaccs/
RefSeq Reference Sequence http://www.ncbi.nlm.nih.gov/
refseq/
ShotgunFunctionalizeR R-package for functional comparison of metagenomes http://shotgun.zool.gu.se
SILVA Comprehensive online ribosomal RNA sequence data www.arb-silva.de
base
SINA Bioinformatic tools for sequence alignment www.arb-silva.de
SmashCommunity Stand-alone metagenomic annotation and analysis http://www.bork.embl.de/software/
pipeline smash/
Sort-ITEMS Sequence orthology-based approach for improved http://metagenomics.atc.tcs.com/
taxonomic estimation of metagenomic sequences binning/SOrt-ITEMS/
STAMP Statistical Analysis of Metagenomic Profiles http://kiwi.cs.dal.ca/Software/
STAMP
(continued)
Metagenomic Research: Methods and Ecological Applications, Table 1 (continued)
Name Description Website
TACOA Taxonomic classification of environmental genomic http://www.cebitec.uni-bielefeld.
fragments using a kernelized nearest neighbor approach de/brf/tacoa/tacoa.html
TETRA Fragment assignment by intrinsic tetranucleotide www.megx.net/tetra
frequencies
Treephyler Fast taxonomic profiling of metagenomes http://www.gobics.de/fabian/
treephyler.php
Fast UniFrac Comparison of microbial communities http://bmf2.colorado.edu/
fastunifrac
XplorSeq Mac OSX software for sequence analysis www.phyloware.com/Phyloware/
XplorSeq.html
Xipe Statistical comparison program http://edwards.sdsu.edu/cgi-bin/
xipe.cgi
multidimensional space. Microbial assemblage Metagenomic Research: Methods and Ecological

prediction (MAP), a bioclimatic tool, helps Applications, Table 2 Enzymes/biocatalysts isolated
using metagenomic approaches
in modeling relative abundance of microbial
taxa as a function of environmental parameters Name of enzymes
(Larsen et al. 2012). SIP-metagenomics Agarase DNA Nitrile hydratase
polymerase
approach can be employed in the identification
Alcohol Endoglucanase Nuclease
of microbial species degrading xenobiotic oxidoreductase
compounds. Alkane hydroxylase Exoglucanase Pectinase M
Amidase Esterase Phytase
Bioprospecting Amylase b-Glucosidase Polyketide
New thermostable and thermolabile biocatalyst synthase
can be discovered from extreme ecological com- Cellulase Glucoamylase Protease
munities. High-temperature metagenomes of Chitinase Laccase Rhamnosidase
virus recently gave a new thermostable DNA Decarboxylase Lipase Single-stranded
DNA ligase
polymerase with reverse transcriptase activity
4-Hydroxybutyrate Mannanase Xylanase
for RT-PCR (Lewin et al. 2012). Variety of dehydrogenase
enzymes has been isolated using metagenomics Dehydratase Nitrilase b-Lactamase
(Table 2). By using metagenomics, scientists Compiled from Cowan et al. 2005; Singh et al. 2009;
were able to identify many genes playing a role Ghazanfar et al. 2010; Xing et al. 2012
in various processes including cell cycle, metab-
olism, DNA repair, transcriptional regulation,
etc. (Sharma et al. 2005). Clinical Metagenomics
A novel cold-active xylanase gene was iso- In an initiative by the National Institute of Health
lated from the community DNA of goat’s rumen (NIH), the Human Microbiome Project is being
contents. Human gut metagenomic library was undertaken to characterize microbial community
subjected to high-throughput screening; present at various sites in the human body. The
310 clones were isolated showing various main objective of this project is to study these
enzyme activities (Xing et al. 2012). Novel anti- microbes in healthy and diseased state of the
biotics were successfully achieved through human body. Nelson et al. (2010) sequenced
metagenomics, e.g., indirubin, deoxyviolacein, 178 microbial genomes present at multiple body
and violacein (Ghazanfar et al. 2010). sites, and further novel predicted polypeptides
were identified. Crohn’s disease patient’s gut References

metagenome revealed a characteristic disease-
associated microbiota. By using metagenomics, Banik JJ, Brady SF. Recent application of metagenomic
approaches toward the discovery of antimicrobials and
healthy human virome can be characterized and
other bioactive small molecules. Curr Opin Microbiol.
infectious diseases with unknown etiology can be 2010;13:603–9.
diagnosed. Novel viruses such as cardiovirus and Chen IC, Lin WD, Hsu SK, et al. Isolation and character-
klassevirus have been reported in viral metagenome ization of a novel lysine racemase from a soil
metagenomic library. Appl Environ Microbiol.
of human fecal samples. For fungal diversity anal-
2009;75:5161–6.
ysis, nuclear ribosomal internal transcribed spacer Cowan D, Meyer Q, Stafford W, et al. Metagenomic gene
region (ITS) is employed. Mycobiome was pre- discovery: past, present and future. Trends Biotechnol.
pared from oral rinse samples to study fungal 2005;23:321–9.
Delmont TO, Robe P, Clark I, et al. Metagenomic
species diversity. High-throughput sequencing of
comparison of direct and indirect soil DNA extrac-
fungal metagenome was applied to the samples tion approaches. J Microbiol Methods. 2011;86:
from patients with cystic fibrosis for identifying 397–400.
new species. Community profiling using HTS pro- Ghazanfar S, Azim A, Ghazanfar MA, et al.
Metagenomics and its application in soil microbial
vides new insights in the area of clinical microbiol-
community studies: biotechnological prospects.
ogy (Willner and Hugenholtz 2013). J Anim Plant Sci. 2010;6:611–22.
Guazzaroni ME, Beloqui A, Golyshin PN,
et al. Metagenomics as a new technological tool to
gain scientific knowledge. World J Microbiol
Summary Biotechnol. 2009;25:945–54.
Hirsch PR, Mauchline TH, Clark IM. Culture-independent
Metagenomic approach is a repertoire of huge molecular techniques for soil microbial ecology. Soil
genetic information. DNA/RNA from numerous Biol Biochem. 2010;42:878–87.
Imfeld G, Vuilleumier S. Measuring the effects of pesti-
ecosystems are sampled, extracted, and processed.
cides on bacterial communities in soil: a critical
Functional- and sequence-based screening of review. Eur J Soil Biol. 2012;49:22–30.
metagenomic libraries has helped in establishing Kakirde KS, Parsley LC, Liles MR. Size does matter:
phylogenetic relationships among the communi- application-driven approaches for soil metagenomics.
Soil Biol Biochem. 2010;42:1911–23.
ties. It has opened a new era of discovery of novel
Keshri J, Mishra A, Jha B. Microbial population index
genes and microbial interaction-based studies. and community structure in saline-alkaline soil using
Innovative metagenomic sequencing efforts will gene targeted metagenomics. Microbiol Res.
be essential to resolve the complexity involved in 2013;168:165–73.
Larsen P, Hamada Y, Gilberta J. Modeling microbial
various microbiomes. It is important to share
communities: current, developing, and future technol-
and critically apply outcomes of metagenomic ogies for predicting microbial community interaction.
research. With the advent in the areas of J Biotechnol. 2012;160:17–24.
metagenome library construction, screening Lewin A, Wentzel A, Valla S. Metagenomics of microbial
life in extreme temperature environments. Curr Opin
methodology, and enhanced gene expression,
Biotechnol. 2012;24:1–10.
metagenomics can evolve as a significant tech- Logares R, Haverkamp THA, Kumar S, et al. Environ-
nology in microbial diversity analysis. mental microbiology through the lens of high-
throughput DNA sequencing: synopsis of current
platforms and bioinformatics approaches. J Microbiol
Cross-Reference Methods. 2012;91:106–13.
Mocali S, Benedetti A. Exploring research frontiers in
microbiology: the challenge of metagenomics in soil
▶ A 123 of Metagenomics microbiology. Res Microbiol. 2010;161:497–505.
▶ Biological Treasure Metagenome Monier JM, Demaneche S, Delmont TO, et al.
▶ Microbial Ecology in the Age of Metagenomic exploration of antibiotic resistance in
soil. Curr Opin Microbiol. 2011;14:229–35.
Metagenomics: An Introduction
Nelson KE, Weinstock GM, Highlander SK, et al. A catalog
▶ Next-Generation Sequencing for of reference genomes from the human microbiome.
Metagenomic Data: Assembling and Binning Science. 2010;328:994–9.
Metagenomics Potential for Bioremediation 429 M
Sharma R, Ranjan R, Kapardar RK, et al. Unculturable organisms. In many environments, microorgan-
bacterial diversity: an untapped resource. Curr Sci. isms are the main agents of bioremediation, as
2005;89:72–7.
Singh J, Behal A, Singla N, et al. Metagenomics: concept, they adapt their existing biochemical pathways to
methodology, ecological inference and recent the degradation or conversion of pollutants.
advances. Biotechnol J. 2009;4:480–94. Human intervention can often improve the ability
Streit WR, Schmitz RA. Metagenomics - the key to the of microorganisms to rapidly remediate contam-
492–8. inants, but how treatments affect species diver-
Thomas T, Gilbert J, Meyer F. Metagenomics - a guide sity and gene allocation in complex microbial
from sampling to data analysis. Microbiol Inform Exp. communities is not well characterized. The
2012;2:3. metagenome of a contaminated environment
Uhlik O, Leewis MC, Strejcek M, et al. Stable isotope prob-
ing in the metagenomics era: a bridge towards improved includes all DNA contained within it; however,
bioremediation. Biotechnol Adv. 2013;31:154–65. a variety of screening methods can be used in
Vasconcellos SP, Angolini CFF, Garcı́a INS, bioremediation studies to simplify the collection
et al. Screening for hydrocarbon biodegraders in and analysis of targeted genomic information.
a metagenomic clone library derived from Brazilian
petroleum reservoirs. Org Geochem. 2010;41:675–81.
Willner D, Hugenholtz P. Metagenomics and community
profiling: culture-independent techniques in the clini- Introduction
cal laboratory. Clin Microbiol Newsl. 2013;35:1–9.
Xing MN, Zhang XZ, Huang H. Application of
metagenomic techniques in mining enzymes from Pollution is a ubiquitous global concern, as many
microbial communities for biofuel synthesis. natural and synthetic compounds have been intro-
Biotechnol Adv. 2012;30:920–9. duced into environments in which they are posing
Yilmaz P, Kottmanna R, Pruesse E, et al. Analysis of 23S hazards to the health of humans and ecosystems.
rRNA genes in metagenomes – a case study from the
global ocean sampling expedition. Syst Appl Bioremediation is the degradation, conversion, or
Microbiol. 2011;34:462–9. stabilization of these compounds by organisms,
generally performed by microorganisms and
M
plants. When the organisms that are native to
a contaminated site effectively remove contami-
Metagenomics Potential for nants without intervention, the toxicity at the site
Bioremediation may simply be monitored as the pollutant is
reduced or converted to a less toxic form. In
Terrence H. Bell1, Charles W. Greer2 and many cases, however, intervention can increase
Etienne Yergeau2 the rate of bioremediation. The addition of stimu-
1
Department of Natural Resource Sciences, lating amendments on site (e.g., nutrients, organic
McGill University, Sainte–Anne–de–Bellevue, matter) and the relocation of contaminated mate-
QC, Canada rial to off-site treatment facilities are the most
2
National Research Council Canada, Montreal, common approaches to encouraging remediation.
QC, Canada Often it is microorganisms that play the most
significant role in bioremediation. High-
resolution genetic information is required to
Synonyms understand how contaminants and treatments
affect the complex microbial communities that
Metagenomics of polluted substrates/environments exist in natural environments. Some taxonomic
groups have been linked to the presence of vari-
ous pollutants, but many of the taxa and enzymes
Definition that can potentially participate in bioremediation
remain unknown. Thousands of microbial species
Bioremediation refers to the detoxification of may exist in a single gram of soil, so when pol-
environments through the activities of living lutants are similar in composition to compounds
M 430 Metagenomics Potential for Bioremediation
that occur naturally in the environment, a large parts of treatment scenarios applied to contami-
number of species are able to compete to use the nated sites, so metagenomic studies of bioreme-
pollutant as a source of carbon, nutrients, or diation will also provide information on how
energy. At the other extreme, when the intro- microbial communities respond to changes in
duced pollutant is complex or synthetic in origin, a variety of environmental factors. To date, only
there may be no local strains that are immediately a handful of such studies have been conducted
capable of metabolizing it or reducing its toxicity. (Table 1).
A number of bioremediating microorganisms
have been isolated from contaminated sites, but it
is now generally understood that the information Types of Metagenomic Studies Used in
obtained from these isolates is insufficient to Bioremediation
understand the workings of complex microbial
communities. More complete genetic informa- Strictly speaking, metagenomics involves the
tion from natural environments is required to entirety of genetic information contained within
understand how contamination affects microbial a sample. More efficient sequencing now makes
communities on the whole, and whether there is it possible to produce this data, but the effort
the potential for further optimization of bioreme- required to thoroughly analyze such huge
diation. The large-scale, culture-independent datasets is a limiting step in metagenomic studies.
studies that are required to meet this end are Even when full metagenomes are sequenced,
now possible with the advent of new high- analysis of the data will often focus on specific
throughput sequencing technologies. genes of interest. There is also a trade-off
between the number of samples analyzed and
the depth of sequencing possible. While it is
Aspirations for Metagenomics in tempting to completely sequence and annotate
Bioremediation single samples, it is difficult to know how repre-
sentative this sample is of an entire environment
Understanding the differences between or in the case of composite samples, the variabil-
a contaminated environment and its ity that exists within the environment.
uncontaminated equivalent is a major topic of As a compromise, many studies of contami-
study in bioremediation research, as it can help nated sites have used what has been referred to as
in determining how much of the natural function gene-targeted metagenomics (Iwai et al. 2010), in
of the system has been altered by contamination. which specific gene regions are amplified and
Metagenomic data can provide information about then sequenced using high-throughput technolo-
taxonomic and enzymatic diversity both pre- and gies. This has been used in bioremediation stud-
post-contamination, which will allow the mining ies to look at specific functional genes (Bell
of potentially active genes and organisms. Accu- et al. 2011; Iwai et al. 2010) as well as 16S
mulating metagenomes from a variety of contam- rRNA gene diversity (e.g., Bell et al. 2011;
inated and uncontaminated equivalent Gihring et al. 2011). The limitations of gene-
environments will make it possible to link targeted metagenomics are that (1) genetic infor-
changes in contaminant composition and concen- mation that is not immediately of interest cannot
tration to specific genes and taxa. In addition, be explored in the future, (2) novel genes that
such studies will answer questions about the cannot be amplified by the selected primers are
microbial ecology of the contaminated system, excluded from the analysis, and (3) information
specifically how microorganisms respond to the about the relative occurrence of the targeted
disturbance created by the contaminant. Adjust- genes within the sample will be lost.
ments of nutrients, carbon sources, pH, tempera- Several recent reports have incorporated some
ture, oxygen, and water content are frequently type of metagenomics into the study of the
Metagenomics Potential for Bioremediation, Table 1 Studies that have used metagenomics to study microbial
populations in contaminated substrates
Gene groups Sequencing
Substrate Contaminant Treatment examined Key finding type References
Whole genome sequencing
Groundwater Heavy None 16S rRNA, Significant loss of PRISM Hemme
metals, metabolism, species and metabolic 3730 et al. 2010
nitrate, stress response diversity following capillary
organic more than 50 years of DNA
solvents contamination sequencer
Soil Diesel Monoammonium 16S rRNA, Shift from Roche/454 Yergeau
phosphate and alkyl group Gammaproteobacteria GS FLX et al. 2012
aeration hydroxylases, to Alphaproteobacteria Titanium
extradiol and Actinobacteria
dioxygenase, after 1 year of
intradiol remediation
dioxygenase,
gentisate/
homogentisate
dioxygenase
Gene-targeted sequencing
Soil JP-8 jet fuel Monoammonium 16S rRNA, Alphaproteobacteria in Roche/454 Bell
phosphate alkB contaminated soils GS FLX et al. 2011
were more effective at Titanium
incorporating added
nitrogen than were
other bacterial taxa
Rhizosphere PCB None Toluene/ Unexpected gene Roche/454 Iwai
soil biphenyl diversity, including FLX et al. 2010 M
dioxygenases 25 novel clusters
Subsurface Uranium Ethanol injection 16S rRNA Identified indicator Roche/454 Cardenas
sediment (VI) taxa specific to various FLX et al. 2010
hydrochemical
conditions and those
that responded to
treatment
Mangrove MF380 None 16S rRNA Wide diversity in both Roche/454 dos Santos
sediment heavy fuel oil contaminated and FLX et al. 2011
uncontaminated
sediment, with
indicator taxa detected
for each
Groundwater Uranium, Emulsified 16S rRNA Very narrow group of Roche/454 Gihring
sulfate, vegetable oil microorganisms that FLX et al. 2011
nitrate were stimulated by the
treatment and/or
involved in
remediation
Liquid Synthetic Added individual 16S rRNA Microbial community Roche/454 Johnson
media aromatic alkanoic acids was unique to the et al. 2011
alkanoic contaminant added,
acids which varied in alkyl
side branching
(continued)
Metagenomics Potential for Bioremediation, Table 1 (continued)

Gene groups Sequencing
Substrate Contaminant Treatment examined Key finding type References
Functional screening
Soil Aliphatic and Air sparging Extradiol High diversity of ABI PRISM Brennerova
aromatic dioxygenase extradiol dioxygenase 3100 et al. 2009
hydrocarbons genes in contaminated Genetic
soil; one extradiol Analyzer
dioxygenase gene
found per 3.6 Mb of
DNA
Activated Various None Extradiol Identified novel ABI 3730xl Suenaga
sludge aromatic dioxygenase arrangements of the DNA et al. 2009
compounds extradiol dioxygenase Analyzer
degradation pathway
on plasmid-like DNA
microorganisms living in contaminated environ- of the variation between samples, they only
ments. Since the labor required to process data is described small portions of microbial communi-
beginning to outweigh the cost of sequencing as ties. Even clone library studies rarely sampled
the limiting step in metagenomic analyses, more than a few hundred clones, whereas
a variety of screening methods have been used multiplexed next-generation sequencing easily
in bioremediation studies to optimize the output provides several thousand sequences per sample.
of information (Fig. 1). The various approaches Since studies into bioremediation generally
to metagenomics that have been taken in biore- aim to identify effective pathways for converting
mediation research are outlined below. or tolerating contaminants, how relevant is tax-
onomy? There is still debate surrounding how
Multiplexed 16S rRNA Gene Sequencing much functional redundancy exists between
Because of its potential to quickly assign taxon- microbial species and how prevalent horizontal
omy to large numbers of microorganisms, 16S gene transfer (HGT) is within microbial commu-
rRNA gene sequencing has gone through several nities, yet a recent metagenomic study shows that
waves of popularity in microbial ecology. Com- distinct bacterial species likely do exist (Caro-
parisons of the 16S rRNA gene profiles of envi- Quintero and Konstantinidis 2012). A number
ronmental samples have taken off again with the of 16S rRNA gene surveys have been conducted
advent of high-throughput sequencing (Tringe in contaminated environments and have been
and Hugenholtz 2008) and are currently more used to assess how microbial communities vary
popular than any other type of metagenomic in relation to uncontaminated reference environ-
study. One reason is that a large number of 16S ments or how a community changes in
rRNA gene entries exist in NCBI and EMBL, as a contaminated environment over time. In several
do curated 16S rRNA gene databases such as the of these studies, 16S rRNA gene-targeted
Ribosomal Database Project (http://rdp.cme.msu. metagenomics has identified indicator species
edu/) and Green Genes (http://greengenes.lbl. that are specific to certain contaminants and envi-
gov/). As a result, profiles of community diversity ronmental conditions (Cardenas et al. 2010; dos
can be conducted with only a cursory understand- Santos et al. 2011). Similar multiplexed studies
ing of bioinformatics. While early techniques may be used to identify indicator species across
such as T-RFLP and DGGE gave some indication multiple environments at similar stages of
Metagenomics Potential for Bioremediation, Fig. 1 Methods for integrating metagenomics into bioremediation
studies
contamination, and these indicator species could considered, since many sequencing technologies
theoretically be used to assess the state of other have a maximum read length, although with time,
contaminated sites. this is becoming less of a concern.
The major advantage of the high-throughput
sequencing approach when compared with earlier Functional Screening
16S rRNA gene profiling techniques is the depth Since bioremediation is generally focused on
of coverage. In mangrove sediment contaminated which microbial communities most effectively
with heavy fuel, little change was seen at the degrade pollutants, it can potentially be straight-
phylum level following contamination, while forward to functionally screen for samples of inter-
large shifts were observed at finer taxonomic est. A study of contaminated Arctic soils
levels (dos Santos et al. 2011), an effect that compared the hydrocarbon-degrading efficiency
may not have been visible using coarser profiling of various soils in response to different in situ
methods. Similarly, 16S rRNA gene and ex situ treatments, with degradation occurring
pyrosequencing showed that a very narrow significantly more effectively in one location. Sub-
group of taxa were stimulated by emulsified oil sequently, a metagenomic analysis was conducted
injection in a uranium-contaminated aquifer throughout a year-long time course on the soil that
(Gihring et al. 2011). With less sequencing cov- most rapidly degraded the contaminating hydro-
erage, it would be impossible to determine carbons, along with an uncontaminated reference
whether these were the only taxa stimulated or soil (Yergeau et al. 2012). Metagenomic studies
simply the most dominant members of the that are conducted in vitro also involve an aspect
community. of selection, as only microorganisms that are capa-
ble of growing in mixed culture prevail. Mixed
Multiplexed Functional Gene Sequencing culture studies are common, as they often evaluate
In many bioremediation studies, specific cata- the potential for bioremediation in treatment facil-
bolic, reducing, or oxidizing genes are the sub- ities. Metagenomics is starting to be applied to
jects of interest. In such cases, it may be desirable such studies, as in one case in which it was deter-
to simply amplify and sequence these targeted mined that the amount of branching in synthetic
genes. As with 16S rRNA gene sequencing, aromatic alkanoic acids led to vastly different
many samples can be processed by multiplex microbial communities (Johnson et al. 2011).
sequencing for a limited cost. Degenerate primers Prescreening of DNA can also be conducted
have been used to amplify alkane on large genomic fragments that are contained
monooxygenase genes from hydrocarbon- within plasmids, such as fosmids or cosmids. By
contaminated Arctic soil, and sequencing showed transforming these vectors into hosts such as
that those related to Alphaproteobacteria E. coli, the DNA fragment can be screened for
responded most positively to amendment with the ability to mineralize or tolerate a specific
monoammonium phosphate (Bell et al. 2011). contaminant. This strategy permits the identifica-
Amplicons were also obtained from a tion of genes that are involved in the catabolism
PCB-contaminated soil using degenerate primers of particular pollutants, or that permit host sur-
targeting toluene/biphenyl dioxygenase genes, vival, provided the essential pathway can be
and sequencing identified a variety of novel contained in a single DNA fragment and can be
dioxygenase gene clusters (Iwai et al. 2010). In expressed in the host. Sequencing is also more
terms of gene discovery, the major drawback of targeted using this approach, as the sequencing of
this approach is that gene identification depends housekeeping and rRNA genes is limited.
on novel genes having significant homology at To search for genes capable of degrading cate-
the primer-targeted regions. Even when the chol, metagenomic DNA from a hydrocarbon-
targeted genes are known, the chosen primers contaminated soil was fragmented, cloned into
will bias the relative gene abundance within fosmid vectors, transformed into E. coli, and
each sample. Amplicon size must also be plated with catechol as a carbon substrate. A high
diversity of extradiol dioxygenase genes was (Yergeau et al. 2012), demonstrating that, in
observed, as well as a surprisingly high density this case, there was significance to taxonomic
of one extradiol dioxygenase per 3.6 Mb of DNA affiliation. Similarly, most of the functional
screened (Brennerova et al. 2009). A similar genes (stress response, metal resistance, etc.)
approach identified novel extradiol dioxygenase identified in the metagenome of a heavy metal-
genes, as well as previously unknown arrange- contaminated groundwater community were
ments of catechol-degrading pathways (Suenaga traced to Gammaproteobacteria, the group that
et al. 2009). The drawbacks of this approach are also dominated the 16S rRNA gene profile
that the entire genetic pathway must be contained (Hemme et al. 2010).
within a single plasmid; that the host may be Full metagenomes can also provide information
unable to survive in the presence of any toxic on the relative abundance of genes of interest.
gene products, meaning that not all relevant PCR-based approaches introduce a primer bias
genes will necessarily be identified; and that prior to sequencing, whereas strict metagenomic
some genes may not be expressed if the chosen analysis permits a more direct quantitative compar-
host is not closely related to the organism from ison. Within the contaminated groundwater
which the DNA fragment originated. metagenome, stress-response genes, such as those
involved in DNA repair and heavy metal resistance,
Full Metagenome Analysis were more abundantly represented than would be
Full metagenomic sequencing, when possible, expected in an uncontaminated community
provides the greatest amount of information. (Hemme et al. 2010). Most hydrocarbon-degrading
With this approach, any number of post hoc ana- genes were high in abundance in the contaminated
lyses can be conducted on a dataset. While much Arctic soil metagenomes when compared with the
of the genetic information obtained from a given uncontaminated reference soil, but extradiol aro-
environment may lack appropriate comparators matic ring-cleavage dioxygenase sequences
in existing gene banks, collecting full decreased after a year of treatment, while other
M
metagenomic information will allow future dioxygenases increased in abundance, and alkane
researchers the opportunity to analyze the hydroxylases remained constant throughout treat-
dataset. At the moment, a number of database ment (Yergeau et al. 2012). Caution should be
projects are ongoing in an attempt to collect and exercised when using preparatory techniques such
annotate metagenomic data, including some from as whole genome amplification, since the quantita-
contaminated sites (e.g., http://www.hydrocar- tion of genes can be affected (Yergeau et al. 2010).
bonmetagenomics.com/). Although the amount of DNA required for
To date, only a handful of complete metagenomic sequencing is decreasing, whole
metagenomic studies have been conducted in genome amplification may still be necessary in
contaminated environments. While 16S rRNA very low biomass systems, as can be found in
gene studies are useful in determining the rela- some highly contaminated environments.
tive microbial diversity of environments,
the metabolic potential of a microbial commu-
nity may not be strictly linked to its taxonomic Information Lacking from
profile. Thus, full metagenomic studies can Bioremediation Literature
be used to assess how diversity relates to func-
tional potential. A metagenomic study of a Genes Involved in Bioremediation
diesel-contaminated Arctic soil showed that Key pathways involved in the bioremediation of
a shift in 16S rRNA gene sequences from major contaminants are known, but many novel
Gammaproteobacteria to Alphaproteobacteria enzymes and pathways are still being discovered.
and Actinobacteria mostly correlated with The lack of sequence conservation in some key
a shift in hydroxylases and dioxygenases that gene families has made it difficult to determine
were affiliated with those same organisms their true diversity using PCR-based methods.
In the case of genes that code for enzymes that are community member. In addition, large differ-
involved in normal forms of metabolism or other ences in % G+C and codon bias between puta-
housekeeping functions within the cell, this tively transposed genes suggested a very recent
diversity may be extensive. Metagenomic studies origin for acetone carboxylases, mercuric resis-
across contaminated environments will help cor- tance operons, and czcD divalent cation trans-
relate gene groups with contaminants, and this porters (Hemme et al. 2010). The persistence of
may identify roles for pathways that had previ- HGT after 50 years of continued contaminant
ously been considered unimportant in the conver- stress suggests that it may be very important to
sion or tolerance of contaminants. the survival of microorganisms in a contaminated
Microbial species that are not directly environment.
involved in bioremediation can also represent Horizontal gene transfer was also suspected
a sizeable proportion of a contaminated commu- when a mismatch between the number of cyto-
nity. Soils contaminated with hydrocarbons have chrome P450 genes affiliated with Rhodococcus
still provided homes for populations of nitrifying and the relative abundance of Actinobacteria was
bacteria (Deni and Penninckx 1999) and observed in the metagenome of diesel-
cyanobacteria (Yergeau et al. 2012), while the contaminated Arctic soils (Yergeau et al. 2012).
stimulation of the microbial reductive chlorina- A number of the genes detected in this study can
tion of PCE and TCE by adding organic products be plasmid-borne, so this may be a common
tends to promote many microorganisms that are response. Future metagenomic analyses pre- and
not involved in remediation (Strycharz post-contamination may show how quickly this
et al. 2008). In addition, microorganisms that process can shape the genetic structure of micro-
function in various nutrient cycles (e.g., nitrogen bial communities. If HGT is determined to be
fixers) may be important to the functioning of the a major force shaping newly contaminated envi-
overall community. To date, it is not really ronments, the metagenomic screening of mobile
known how much these other species affect func- elements alone may be another method of elimi-
tioning in contaminated environments or how nating large amounts of housekeeping and redun-
bioremediation is affected if some processes are dant genetic information.
disrupted.
Quantitation
Extent of Horizontal Gene Transfer As mentioned, metagenomes that have not been
It can be difficult to determine the taxonomic modified by processes such as whole genome
affiliation of plasmid-borne DNA, and certain amplification may permit actual quantification
key genes involved in bioremediation, such as of gene abundances. Whereas techniques such
naphthalene dioxygenases and alkane as qPCR and PCR-based diversity studies are
monooxygenases (Whyte et al. 1997), have been subject to amplification biases, the metagenome
found on plasmids. Mobile genetic elements are represents all of the genetic information that
known to be common in at least some natural could be extracted from a sample. Most previous
environments, but it is not known how significant attempts to quantify microbial allocation of gene
a role HGT plays in the adaptation of microbial resources to important processes in contaminated
communities to contamination. sites have relied heavily on PCR methods.
In metagenomic studies, genes can be com- Some early metagenomic studies have already
pared with the background DNA of the commu- shown the potential of quantitation. The
nity metagenome, which can help in identifying relative genomic allocation to the degradation of
the prevalence of HGT. Bioinformatic analysis of various components of jet fuel, a complex con-
a metagenome under long-term contamination taminant, was observed in a contaminated soil
showed that roughly 12 transposons were present community. It was also observed that known
per Mb of DNA, which was similar to reference hydrocarbon-degrading genes represented a dis-
strains of Xanthomonas, the dominant proportionate amount of the total metagenome
(Brennerova et al. 2009). An overabundance of contaminant breakdown, and a recent review
genes conferring resistance to heavy metals, describes the potential power of combining SIP
nitrate, and organic solvents was observed in with metagenomics (Chen and Murrell 2010).
a heavy metal-contaminated aquifer (Hemme SIP-metagenomic analyses of contaminated sub-
et al. 2010). Semiquantitative approaches have strates allow the genes and species that actively
also been used to determine relative shifts in respond to pollutants to be separated from the
species abundance and nitrogen incorporation in huge amount of background genetic information
contaminated environments (Bell et al. 2011; that may remain from the initial, uncontaminated
Cardenas et al. 2010), and future studies using soil. The link between taxonomic affiliation and
full metagenome analysis would permit actual community function is already being explored
quantification. through the combination of SIP and high-
throughput sequencing (Bell et al. 2011), while
advances in RNA-SIP will provide a comprehen-
The Future of Metagenomics in sive picture of how the addition of substrates,
Bioremediation whether contaminants or amendments, directly
affects transcription. At the moment, the CsCl
Technologies that facilitate metagenomic gradients that are required to separate labeled
research are advancing quickly, and many studies and unlabeled nucleic acids are extremely cum-
that had previously been outside the realm of bersome and limit the number of samples that can
consideration are becoming possible. Companies be processed within a given study.
such as PacBio and Nanopore are producing However, a novel proteomic-SIP technique,
sequencers that will allow Kb reads of DNA, using 2-dimensional liquid chromatography-
which will make it possible to assemble continu- tandem mass spectrometry (2D-LC-MS/MS),
ous genomes in mixed communities. Even with was able to examine the isotopic ratios of roughly
current technologies this is becoming feasible, as 100,000 spectra while simultaneously searching
M
the entire draft genome of a novel permafrost a database of 31,966 protein sequences in under
methanogen was assembled by end-to-end 24 h (Pan et al. 2011). The computing power
linking of 113 bp paired-end reads that were required to conduct the analysis was enormous,
produced in a metagenomic study using Illumina but as with all high-throughput processing, this
GAII technology (Mackelprang et al. 2011). can be expected to change rapidly with time. The
The combination of various high-throughput potential for applying the proteomic-SIP tech-
techniques will enable comprehensive studies of nique in bioremediation studies is enormous, as
microbial communities and shed light on the even small numbers of proteins produced by rare
links between species diversity, gene density, microorganisms can be tracked (Pan et al. 2011).
gene expression, protein production, and chemi- This will be especially useful in examining bio-
cal transformation in contaminated environ- remediation pathways that involve syntrophic
ments. Stable isotope probing (SIP) is interactions, or those involved in the processing
a technique that involves adding heavy isotope- of slowly degraded contaminants, in which nutri-
labeled compounds to a substrate and allowing ent flux and subsequent protein production are
microorganisms to consume it and incorporate bound to be low.
the labeled atoms into cellular components such In contaminated environments, metagenomics
as DNA, RNA, and phospholipids. In the case of has been used to compare polluted substrates with
DNA-SIP, all DNA from a treated sample is uncontaminated reference substrates (e.g.,
extracted and then centrifuged in CsCl gradients Yergeau et al. 2012) and has also been used to
to separate the “heavy” (labeled) from the “light” directly measure species composition within the
(unlabeled) DNA. This technique has great same matrix before and after contamination (dos
potential in terms of identifying functionally Santos et al. 2011). These types of comparative
active microbes, specifically those involved in studies are geared at understanding what genetic
information distinguishes a contaminated envi- is being asked, as well as the resources that are
ronment from similar pristine systems. One of available. While full metagenomic studies provide
the next major efforts in metagenomics is likely the greatest amount of data per sample, surveying
to be the identification of a core microbiome for indicator species or gene diversity across
(Shade and Handelsman 2012). In other words, a wide range of samples may be more appropriate
what genes and species are common across an in many cases. These methods may change
environment and across multiple environments. quickly as technology continues to improve, but
With a more comprehensive idea of what core ultimately, the best approaches will be those that
microbiomes exist, environments may be aligned answer questions about how to most efficiently
by their conserved regions, much as sequences improve the bioremediation of contaminated sites.
are now, and the true variability between envi-
ronments can then be assessed. In the context of
bioremediation, it will be important to understand
References
whether there are critical genes and organisms
that must respond positively to the introduction Bell TH, Yergeau E, Martineau C, et al. Identification of
of a contaminant in order to achieve successful nitrogen-incorporating bacteria in petroleum-
remediation. Genes promoted outside of this contaminated Arctic soils by using [(15)N]DNA-
based stable isotope probing and pyrosequencing.
common core must then be the result of other
Appl Environ Microb. 2011;77:4163–71.
environmental or stochastic processes. Brennerova MV, Josefiova J, Brenner V, et al.
Many current genomic studies focus on snap- Metagenomics reveals diversity and abundance of
shots of genetic information in environmental meta-cleavage pathways in microbial communities
from soil highly contaminated with jet fuel under
samples, but the high growth rate of microorgan-
air-sparging bioremediation. Environ Microbiol.
isms means that many microbial communities 2009;11:2216–27.
are undergoing constant and rapid evolution. Cardenas E, Wu WM, Leigh MB, et al. Significant asso-
This suggests that longer-term metagenomic ciation between sulfate-reducing bacteria and
uranium-reducing microbial communities as revealed
studies should be a focal point of future research.
by a combined massively parallel sequencing-
The metagenomic study by Hemme et al. (2010) indicator species approach. Appl Environ Microb.
of metal-contaminated groundwater showed that 2010;76:6778–86.
50 years of pollutant stress had reduced species Caro-Quintero A, Konstantinidis KT. Bacterial species
may exist, metagenomics reveal. Environ Microbiol.
and metabolic diversity to a minimal level of
2012;14:347–55.
complexity. While all necessary metabolic path- Chen Y, Murrell JC. When metagenomics meets stable-
ways were found, more than ten times fewer isotope probing: progress and perspectives. Trends
OTUs, with a similar loss in metabolic complex- Microbiol. 2010;18:157–63.
Deni J, Penninckx MJ. Nitrification and autotrophic nitri-
ity, were present than were observed at an adja-
fying bacteria in a hydrocarbon-polluted soil. Appl
cent background site. Monitoring how evolution Environ Microb. 1999;65:4008–13.
selects genes in contaminated environments over dos Santos HF, Cury JC, do Carmo FL, et al. Mangrove
the long term will undoubtedly assist in the bacterial diversity and the impact of oil contamination
revealed by pyrosequencing: bacterial proxies for oil
understanding and treatment of chronically con-
pollution. PLoS One. 2011;6:e16943.
taminated sites, although the interpretation of Gihring TM, Zhang GX, Brandt CC, et al. A limited
large amounts of data will first require microbial consortium is responsible for extended
a solution to the human-processing bottleneck. bioreduction of uranium in a contaminated aquifer.
Appl Environ Microb. 2011;77:5955–65.
Hemme CL, Deng Y, Gentry TJ, et al. Metagenomic
insights into evolution of a heavy metal-contaminated
Summary groundwater microbial community. ISME J.
2010;4:660–72.
Iwai S, Chai BL, Sul WJ, et al. Gene-targeted-
A variety of metagenomic approaches are avail-
metagenomics reveals extensive diversity of aromatic
able to bioremediation researchers. The choice of dioxygenase genes in the environment. ISME J.
technique will depend heavily on the question that 2010;4:279–85.
Metagenomics, Metadata, and Meta-analysis 439 M
Johnson RJ, Smith BE, Sutton PA, et al. Microbial bio- Definition
degradation of aromatic alkanoic naphthenic acids is
affected by the degree of alkyl side chain branching.
ISME J. 2011;5:486–96. The analytical approach of identifying emergent
Mackelprang R, Waldrop MP, DeAngelis KM, et al. patterns in ecological properties of microbial
Metagenomic analysis of a permafrost microbial communities by sequencing community structure
community reveals a rapid response to thaw. Nature. and function and defining the physical, chemical,
2011;480:368–71.
Pan CL, Fischer CR, Hyatt D, et al. Quantitative and biological parameters of the ecosystem.
tracking of isotope flows in proteomes of microbial Metagenomics is the study of all genetic mate-
communities. Mol Cell Proteomics. 2011; 10: rial from all organisms in a defined sample
M110.006049. (Handelsman et al. 1998). However, it is defined:
Shade A, Handelsman J. Beyond the Venn diagram: the
hunt for a core microbiome. Environ Microbiol. metagenomics is just a term used to describe
2012;14:4–12. a selection of tools and techniques that enable
Strycharz SM, Woodard TL, Johnson JP, et al. Graphite us to uncover the DNA from the organisms in
electrode as a sole electron donor for reductive dechlo- an environment (which can comprise any ecosys-
rination of tetrachlorethene by Geobacter lovleyi.
Appl Environ Microb. 2008;74:5943–7. tem, from soil to human intestinal tract). Meta-
Suenaga H, Koyama Y, Miyakoshi M, et al. Novel orga- data (also known as contextual data) refers
nization of aromatic degradation pathway genes in directly to information regarding the original
a microbial community as revealed by metagenomic sample, the extraction and handling of the DNA,
analysis. ISME J. 2009;3:1335–48.
Tringe SG, Hugenholtz P. A renaissance for the and the sequencing platform and data processing
pioneering 16S rRNA gene. Curr Opin Microbiol. information (Field et al. 2011; Yilmaz et al.
2008;11:442–6. 2011). Without such metadata, metagenomic
Whyte LG, Bourbonnière L, Greer CW. Biodegradation sequence data would be redundant for anything
of petroleum hydrocarbons by psychrotrophic
Pseudomonas strains possessing both alkane (alk) other than basic gene discovery. Meta-analysis,
and naphthalene (nah) catabolic pathways. Appl Envi- which is the process of performing comparative
ron Microb. 1997;63:3719–23. investigation of features between datasets, is
M
Yergeau E, Hogues H, Whyte LG, et al. The functional greatly enhanced by the combination of
potential of high Arctic permafrost revealed by
metagenomic sequencing, qPCR and microarray ana- metagenomic data and metadata (Knight
lyses. ISME J. 2010;4:1206–14. et al. 2012).
Yergeau E, Sanschagrin S, Beaumier D, et al.
Metagenomic analysis of the bioremediation of
diesel-contaminated Canadian high Arctic soils.
PLoS One. 2012;7:e30058. Metagenomics
Our microbial planet is more than 1 1030

microbial cells (Whitman et al. 1998), a billion
more cells than stars in the known universe
Metagenomics, Metadata, and (Gilbert 2010). This dominance of biomass is
Meta-analysis encapsulated nicely by a quotation accredited to
Julian Davies, “Once the diversity of the micro-
Jack Gilbert bial world is catalogued, it will make astronomy
Department of Ecology & Evolution, look like a pitiful science” (Gewin 2006). Micro-
University of Chicago, Chicago, IL, USA bial life comprises the main functional drivers of
our planet’s ecosystems (Falkowski et al. 2008),
yet their diversity and ecological networks
Synonyms remain largely unknown. In the last 15 years,
metagenomics has provided a tool to explore the
Comparative analysis; Contextual data; vast unseen majority with a greater resolution and
Environmental data; Network analysis; Shotgun depth of field than culturing has yet provided
metagenomics (Hugenholtz and Kyrpides 2009). The explosion
M 440 Metagenomics, Metadata, and Meta-analysis
in 2004 of direct sequencing approaches, which have been adopted by the International Nucleo-
provided a different route to market compared to tide Sequence Database Collaboration (INSDC)
clone-dependent sequencing, has accelerated the and a considerable number of journals. The major
implementation and data generation capability of proponent from the latter group is the GSC’s own
this technique. Existing studies have been well journal, Standards in Genomic Science, which
reviewed in terms of the impact on community requires a detailed but standard description of
ecology interpretation and novel biochemical the associated metadata for genome and
process identification (Gilbert and Dupont 2011). metagenome reports (Gilbert et al. 2010a; Nelson
et al. 2009).
Metadata
Meta-analysis
The ensuing data bonanza (Field et al. 2011) has
driven the need for more robust and comprehen- Meta-analysis is defined as the combination of
sive standards for recording and sharing informa- results from different studies that have similar or
tion about why, how, and from where the related research hypotheses. While not strictly
sequencing data was generated. One person’s a meta-analysis, the use of comparative
metadata is another person’s primary data, metagenomics to explore the principles of micro-
and so the community outreach to determine bial ecology stems from the common analysis of
the consensus for recording different data data generated by different studies in different
types and information has been a mammoth ecosystems to explore central hypotheses, usually
effort. The Genomic Standards Consortium related to the overall distribution of taxonomic
(Field et al. 2011) has risen to be one of the functional attributes in the community. Initial
most prominent and successful standards com- efforts include comparative analysis of four
munities. The central tenet of the Genomic Stan- metagenomic samples from soil and whale fall
dards Consortium is to promote mechanisms (Tringe et al. 2005), 87 viral and microbial
that standardize the description of genomes, metagenomic datasets from nine biomes
metagenomes, and amplicon sequences and the (Dinsdale et al. 2008), metagenomic datasets
exchange and integration of these data and asso- from 86 viral and microbial communities
ciated metadata (www.gensc.org). The GSC has (Willner et al. 2009), and more recently
created three minimal information checklists, 77 metagenomes (Delmont et al. 2011). These
which collectively are known as the Minimal studies have led to the conclusion that different
Information about ANY sequence (MIxS) check- environments have habitat-specific functional
lists. The three standards are the Minimal Infor- and taxonomic fingerprints that indicate
mation about a Genomic Sequence (MIGS; Field environment-specific genomic adaptation. Of
et al. 2008), the Minimal Information about course this should be taken with a caveat that
a Metagenomic Sequence (MIMS), and the Min- each comparative study has a small number of
imal Information about a Marker Gene Sequence metagenomes in the analysis and that each
(MIMARKS) (Yilmaz et al. 2011). These infor- metagenomic dataset only comprises a tiny frac-
mation checklists and the ancillary environmen- tion of the functional information present in any
tal data sheets describe the types of information community. The latter point is made obvious by
the community would like to see associated with ultra-deep screening of microbial diversity,
the sequence data, and importantly provide whereby even in marine coastal surface waters,
a description for recording these data using the species richness can be astounding (>100,000
a defined standard. This enables a level playing taxa per L of water; Caporaso et al. 2011).
field for the provision and sharing of data Importantly, cross-sample comparisons
between organizations and PIs, and the checklists should be performed in concert with dynamic
Metagenomics, Metadata, and Meta-analysis 441 M
comparative analysis of the contextual environ- Cross-References
mental data. These physical, chemical, and bio-
logical data that describe the environment in ▶ Approaches in Metagenome Research:
which the microbial organisms under investiga- Progress and Challenges
tion were isolated are vital to interpreting the ▶ Biological Treasure Metagenome
gradients of function and specific trends in gene ▶ Challenge of Metagenome Assembly and
persistence seen between samples and studies. Possible Standards
Within one study, such as the Global Ocean Sam- ▶ Computational Approaches for Metagenomic
pling (Rusch et al. 2007) or Western English Datasets
Channel (Gilbert et al. 2010b), the link between ▶ Metagenomic Research: Methods and
environmental metadata and the functional or Ecological Applications
taxonomic sequence data can be implicit. How-
ever, in comparative studies, it is rare to be able to
generate canonical correlations between specific References
functional gene abundances and different contex-
tual metadata as different studies tend to measure Caporaso JG, Field D, Paszkiewicz K, Knight R, Gilbert
JA. Evidence for a persistent microbial community in
different parameters differently. The Earth
the Western English Channel. ISME J. 2012;6:1089–
Microbiome Project (www.earthmicrobiome. 1093.
org) is working to create not just comparable Delmont TO, et al. Metagenomic mining for microbiolo-
data on the basis of methodological standard pro- gists. Isme J. 2011;5(12):1837–43.
Dinsdale EA, et al. Functional metagenomic profiling of
tocols (e.g., DNA extraction, PCR, sequencing)
nine biomes. Nature. 2008;452(7187):629–32.
but also by obtaining data with comparable con- Falkowski PG, Fenchel T, Delong EF. The microbial
textual information, e.g., temperature measure- engines that drive Earth’s biogeochemical cycles.
ments, latitude and longitude, ammonia Science. 2008;320(5879):1034–9.
concentrations, pH, etc. All these metadata are Field D, et al. The minimum information about a genome M
sequence (MIGS) specification. Nat Biotechnol.
being collated into large-scale databases with the 2008;26(5):541–7.
Genomic Standards Consortium’s MIxS check- Field D, et al. The genomic standards consortium. PLoS
lists as the data framework, and so they represent Biol. 2011;9(6):e1001088.
Gewin V. Genomics: discovery in the dirt. Nature.
the community consensus for these records.
2006;439(7075):384–6.
Gilbert JA. Beyond the infinite – tracking bacterial gene
expression. Microbiol Today. 2010;37(2):82–5.
Summary Gilbert JA, Dupont CL. Microbial metagenomics: beyond
the genome. Ann Rev Mar Sci. 2011;3:347–71.
Gilbert JA, et al. Metagenomes and metatranscriptomes
Metagenomics studies now need to be performed from the L4 long-term coastal monitoring station in the
using the principles of scientific investigation and Western English Channel. Stand Genomic Sci.
excellent statistical experimental design, using 2010a;3(2):183–93.
Gilbert JA, et al. The taxonomic and functional diversity
replication and adequate controls to determine if
of microbes at a temperate coastal site: a ‘multi-omic’
the perceived biological variation actually could study of seasonal and diel temporal variation. PLoS
be used to explore basic ecological principles. One. 2010b;5(11):e15545.
The only appropriate way to perform good Handelsman J, et al. Molecular biological access to the
chemistry of unknown soil microbes: a new frontier for
meta-analysis for metagenomic studies is to uti-
natural products. Chem Biol. 1998;5(10):R245–9.
lize excellent metadata, and this comes back to Hugenholtz P, Kyrpides NC. A changing of the guard.
the design of the experiment, long before any Environ Microbiol. 2009;11(3):551–3.
molecular analysis has even been suggested. It Knight R, et al. Designing better metagenomic surveys:
the role of experimental design and metadata
also must leverage multidisciplinary effort to
capture in making useful metagenomic datasets for
obtain the right data to answer the relevant ecology and biotechnology. Nat Biotechnol.
questions. 2012;30(6):513–520.
M 442 MetaRank: Ranking Microbial Taxonomic Units
Nelson OW, Harrison SH, Garrity GM. Meeting report for microorganisms in microbial communities
SIGS1: first conference of the standards in genomic (Hugenholtz and Tyson 2008). A key question
sciences eJournal. Stand Genomic Sci. 2009;1(1):
72–6. in metagenomics is whether and how changes in
Rusch DB, et al. The sorcerer II global ocean sampling the microbial abundances of taxonomic units or
expedition: Northwest Atlantic through eastern tropi- functional groups relate to alterations of habitats
cal Pacific. PLoS Biol. 2007;5(3):e77. (Hamady and Knight 2009). To characterize the
Tringe SG, et al. Comparative metagenomics of microbial
communities. Science. 2005;308(5721):554–7. relationship, it is important to compare microbial
Whitman WB, Coleman DC, Wiebe WJ. Prokaryotes: community compositions in different environ-
the unseen majority. Proc Natl Acad Sci USA. ments (Wooley et al. 2010).
1998;95(12):6578–83. Many statistical methods (e.g., Metastats
Willner D, Thurber RV, Rohwer F. Metagenomic signa-
tures of 86 microbial and viral metagenomes. Environ (White et al. 2009), ShotgunFunctionalizeR
Microbiol. 2009;11(7):1752–66. (Kristiansson et al. 2009), STAMP (Parks and
Yilmaz P, et al. Minimum information about a marker Beiko 2010)) have been developed for compara-
gene sequence (MIMARKS) and minimum informative metagenomics in attempt to identify differ-
tion about any (x) sequence (MIxS) specifications. Nat
Biotechnol. 2011;29(5):415–20. entially abundant features between microbial
communities. Most of these methods employ sta-
tistical hypothesis tests to determine whether
member abundances are equal in distinct commu-
MetaRank: Ranking Microbial nities and focus on the quantitative differences
Taxonomic Units or Functional between microbial community compositions.
Groups for Comparative Analysis of They are highly dependent on the precision of
Metagenomes estimated values in member abundances.
However, estimated abundances might devi-
Tse-Yi Wang1 and Huai-Kuang Tsai2 ate from the true abundances in habitats due to
1
Department of Medical Research, Mackay sampling biases and other systematic artifacts in
Memorial Hospital, New Taipei City, Taiwan metagenomic data processing (Ashelford
2
Institute of Information Science, Academia et al. 2005; Brady and Salzberg 2009; Gomez-
Sinica, Taipei, Taiwan Alvarez et al. 2009; Mavromatis et al. 2007).
Although systematic artifacts can be corrected
through improvements in data processing tech-
Definition niques, sampling biases will remain unavoidable
unless exhaustive data of the whole populations
MetaRank is a rank conversion scheme for ana- become available (Wooley and Ye 2010).
lyzing microbial communities based on the rela- To reduce the effects of sampling biases,
tive order of member (taxonomic unit or MetaRank performs a series of rank conversions
functional group) abundances rather than their for analyzing microbial communities based on
estimated values (e.g., proportions). It leverages the ranks of members rather than their estimated
a series of statistical hypothesis tests to compare abundances. It leverages the fact that the ranks of
member abundances within microbial communi- highly abundant members are less affected by
ties and determine their ranks, providing an alter- sampling biases because large values and, by
native rank-based method for characterizing extension, their relative order are robust against
metagenomes. small deviations. It also utilizes statistical
hypothesis testing to compare member abun-
dances within communities and determine the
Introduction ranks as follows: Highly abundant members are
delegated to high ranks and any two members
Metagenomics is a field that involves sampling, without statistically significantly different abun-
sequencing, and analyzing the genetic material of dances are assigned the same rank.
MetaRank: Ranking Microbial Taxonomic Units 443 M
Empirical tests on real datasets and synthetic To select highly abundant members with
samples (Kurokawa et al. 2007; Ley et al. 2006; proportions that are significantly higher than
Mavromatis et al. 2007) approve that MetaRank the average proportion (1/N), MetaRank
is able to downsize the effects of sampling biases applies hypothesis tests, Ho: pn 1/N vs. Ha:
and help to clarify the characteristics of pn > 1/N for all 1 n N. Since
metagenomes. The ranks converted by MetaRank Xn Binomial(S,pn) with mean E(Xn) ¼ Spn
have small normalized standard deviations, and variance Var(Xn) ¼ Spn(1 – pn), the bino-
which clearly reveal the common traits within mial distribution of the test statistic Xn under
a set of metagenomes. The ranks also capably Ho is approximated by normal distribution with
identify the discriminating features of microbial z-statistic Zn:
community compositions (Wang et al. 2011). In
addition, it is noted that MetaRank as a rank-
S
based approach has the same disadvantages of X n Eð X n Þ Xn
N
Zn ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
all nonparametric methods. There is a loss of
VarðXn Þ S 1
information and the loss of ability to provide 1
parametric statistics for inference. Therefore, N N
MetaRank is a useful rank-based alternative for _ 1
analyzing metagenomes that complements para- pn
N
¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
metric methods. ffi N ð0, 1Þ
1 1
1
SN N
Methods
when sample size S is large enough such that
Given a metagenomic sample of a microbial com- pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
0 EðXn Þp\pm3 VarðXn Þ S: Otherwise, the M
munity, MetaRank first employs binomial tests to exact binomial test is applied when S is small
iteratively select highly abundant members within pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
such that EðXn Þ 3 VarðXn Þ < 0 or
the community followed by multinomial tests to pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
S < EðXn Þ þ 3 VarðXn Þ.
rank the selected members in each run.
The p-value for exact binomial test is calcu-
lated as follows:
Binomial Tests for Selecting Highly Abundant
Members
XS
For N members in a microbial community, let Xn S 1 1 Sk
represent the abundance of the nth member in the P½ X n x n
¼ 1
_ k¼xn
k Nk N
metagenomic sample and p n (i.e., Xn/S) be the
sample proportion of the nth member, where
n ¼ 1, 2, . . ., N and S ¼ X1 + X2 + . . . + XN. where xn is the observed value of the test statistic
Under the assumption that all nucleic acids of Xn.
microorganisms in habitats are equally likely to MetaRank considers members that reject the
be sampled and sequenced in metagenomic null hypothesis with statistical significance as
experiments, the abundance Xn of the nth member highly abundant. For those that fail to reject the
in the sample is modeled as a binomial random null hypothesis (assuming N0 members remain),
variable: MetaRank temporarily sets them aside and con-
tinues to select members whose proportions
Xn BinomialðS, pn Þ, are significantly larger than the average (1/N0 ) in
the next iteration. When none of the remaining
where pn is the unknown population proportion of members reject the null hypothesis, MetaRank
the nth member in the habitat and estimated by terminates the selection procedure and considers
_
the sample proportion p n . all remaining members as rare members.
Thus, in each iteration, the selected members Xi Xj Xj Xi

Zij ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi and Z ji ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi :
(whose proportions are larger than the average) Xi þ Xj Xj þ Xi
are higher than the remaining members (whose
proportions are equal to or smaller than the aver- Otherwise, the exact multinomial tests are
age). Moreover, the members selected in distinct applied. The p-values are calculated as
iterations are ranked in their selected order; more
specifically, the members selected in first itera- hðxi xj Þ
X
S X S!
tion are assigned a higher rank than the ones P Xi Xj xi xj ¼
selected in the second iteration. At the end, the h¼xi xj k¼0
h!k!ðS h kÞ!
rare members are ranked the lowest in the _ _
!hþk
pi þ pj _ _ Shk
community. 1 pi þ pj
2
Multinomial Tests for Ranking Highly

Abundant Members and
Based on the above procedure, MetaRank ranks
the abundances in the target community Sðxj xi Þ
X X
S
S!
according to the following three rules. First, all P X j X i xj xi ¼
h!k!ðS h kÞ!
rare members are assigned the same smallest h¼0 k¼hþðxj xi Þ
_ _
!hþk
rank. Second, the members selected in distinct pi þ pj _ _ Shk
1 pi þ pj
iterations are ranked according to the order in 2
which they were selected; thus, the members
selected in the first iteration of the procedure are
where xi and xj are the observed values of Xi
assigned higher ranks than all the others. Third, if
and Xj.
two abundances (the ith and jth members) are
As a result, the sorted abundances X(1) X(2)
selected in the same iteration, MetaRank deter-
. . . X(m) . . . X(M) are converted into
mines their ranks (Ri > Rj, Ri < Rj or Ri ¼ Rj)
ranks 1 R(1) R(2) . . . R(m) . . .
by two hypothesis tests, Ho: pi pj vs. Ha: pi >
R(M) M, where the subscript in parentheses
pj and H0 o: pj pi vs. H0 a: pj > pi. If Ho is
(m) denotes the mth order in the community and
rejected, Ri > Rj; conversely, if H0 o is rejected,
M is the total number of members. For members
Ri < Rj. However, if both Ho and H0 o are
whose abundances cannot be distinguished
accepted, Ri ¼ Rj.
from each other by hypothesis testing, MetaRank
Under the same assumption that all nucleic
converts them into their average order; i.e.,
acids are equally likely to be sampled and
for any m0 , m00 such that R(m0 ) < R(m0 +1) ¼
sequenced, each abundance Xn is modeled as
R(m +2) ¼ . . . ¼ R(m00 1) < R(m00 ) (given R(0) ¼ 0
0
a binomial random variable; any two abundances
and R(M+1) ¼ M + 1), we have
Xi and Xj are jointly modeled by the multinomial
distribution (i.e., the generalization of binomial m þm
0 00
distribution in multidimension): Rðm0 þ1Þ ¼ Rðm0 þ2Þ ¼ ¼ Rðm00 1Þ ¼

2

Xi , Xj Multinomial S, pi , pj For example, the ranks of the rare members
(assuming N00 members remain in the last itera-
where pi and pj are the unknown population pro- tion) are converted into (N00 + 1)/2.
portions of the ith and jth members in habitat and
_ _
estimated by the sample proportions p i and pj. For
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Empirical Tests
large S such that 0 EðXi Þp\pm3 VarðXi Þ S
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

and 0 E Xj p\pm3 Var Xj S, the To evaluate its utility in comparative analysis
z-statistics of the approximate tests are of microbiomes, MetaRank is applied to real
MetaRank: Ranking Microbial Taxonomic Units or the phylum level of the 5,000 synthetic samples for each
Functional Groups for Comparative Analysis of sample-sequencing depth r ∈ {10 %, 20 %, . . ., 90 %}.
Metagenomes, Fig. 1 The averages of CV, which is Under distinct sample-sequencing depth, the averages of
the normalized standard deviation, in the ranks converted CV in the ranks converted by MetaRank are smaller than
by MetaRank, estimated proportions and ordinary ranks at the ones in the others
metagenomes and synthetic samples (Kurokawa the ranks converted by MetaRank, estimated pro-
et al. 2007; Ley et al. 2006; Mavromatis et al. portions, or ordinary ranks. As shown in Fig. 1, the
2007; Wang et al. 2011). In synthetic samples, it normalized standard deviations in the ranks
is shown that as compared with the estimated converted by MetaRank are smaller than the ones
proportions or the ordinary ranks of straightfor- in the estimated proportions and the ordinary
M
ward sorted abundances, the ranks converted by ranks. Similar observations are also found at the
MetaRank have smaller normalized standard devi- taxonomic levels of class, order, family, genus,
ation and are less affected by sampling biases. In and other simulated datasets in the Wang
real metagenomes, using MetaRank is able to clar- et al. (2011) study. The results confirm that
ify the common traits and detect the discriminating MetaRank is able to reduce the effects of sampling
features of those microbiomes. biases.
Simulation Analyses of Synthetic Samples Demonstration Studies in Real Metagenomes

Synthetic samples are generated by randomly In the real datasets from the human gut
resampling reads from a pooled dataset of real microbiomes (Ley et al. 2006; Kurokawa
metagenomes (Ley et al. 2006) for investigating et al. 2007), MetaRank demonstrates its ability
the effects of sampling biases on the ranks to clearly reveal the characteristics of
converted by MetaRank, estimated proportions, metagenomes in comparative analyses (Wang
and ordinary ranks (Wang et al. 2011). All the et al. 2011). The first dataset contains samples
reads are pooled together as a synthetic library, from obese individuals and lean controls of
and at the taxonomic level of phylum, five thou- human gut metagenomes in a one-year diet
sand synthetic samples are generated for each study. The second dataset includes infant and
sample-sequencing depth r ∈ {10 %, 20 %, . . ., adult samples. In the first dataset, the obese sam-
90 %}. The effects of sampling biases are examples are extracted from 12 obese individuals (I1,
ined by the variability between the random syn- I2, . . ., I12) at four distinct time points (week
thetic samples, and the variability between the 0, 12, 26, and 52), and the lean controls are
random samples is measured by the normalized extracted from two lean individuals (I13 and
standard deviation (CV; coefficient of variation) in I14) at two time point (week 0 and 52), all
MetaRank: Ranking Microbial Taxonomic Units or agglomerative clustering (bottom-up clustering) initially
Functional Groups for Comparative Analysis of treats each sample as a single cluster at the bottom and
Metagenomes, Fig. 2 The hierarchical clustering then successively agglomerates pairs of nearest clusters
results of the ranks converted by MetaRank at the phylum until all clusters have been merged into a single cluster at
level in 12 obese individuals at week 0 (I1W0, I2W0, . . ., the top. Given a fix distance 0.2 (i.e., Pearson correlation
I12W0) and 52 (I1W52, I2W52, . . ., I12W52), including 0.8), there are three main clusters, where the unweighted
the four lean controls (I13W0, I14W0, I13W52, and arithmetic mean of distances within clusters are smaller
I14W52), based on UPGMA. The hierarchical than 0.2
denoted by the convention, IxWy, where (Unweighted Pair Group Method with Arithmetic
x represents the xth individual and y represents Mean). Figure 2 illustrates the result of the simple
the time point. In the second dataset, four infant case that only consists of the samples at week
and nine adult samples were extracted from dif- 0 and 52 (before and after diet). As shown in
ferent individuals for COG-functional analysis. Fig. 2, given a fix distance 0.2 (i.e., Pearson
When comparing metagenomes in the first correlation 0.8), there are three main clusters,
dataset (Ley et al. 2006), using MetaRank is where the unweighted arithmetic mean of dis-
able to clarify the common traits of similar sam- tances within clusters are smaller than 0.2. The
ples (Wang et al. 2011). The taxonomic abun- four lean controls are closely grouped together in
dances in the obese samples and the lean one cluster that contains some obese samples at
controls are converted into ranks by MetaRank, week 0 and all the obese samples at week
followed by hierarchical clustering with UPGMA 52 except one (I4W52). More than half of the
obese samples at week 0 are in the other two nonparametric approach, provides a useful rank-
clusters. The result shows that after dieting based alternative to analyzing microbial commu-
almost all the obese samples are clustered nity compositions.
together with the four lean controls. Similar
results are observed in the members of the
biome at the taxonomic levels of class, order, Cross-References
family, and genus (Wang et al. 2011).
Additionally, MetaRank is able to detect rank- ▶ STAMP: Statistical Analysis of Metagenomic
based differences and identify discriminating fea- Profiles
tures between metagenomes in the second dataset
(Kurokawa et al. 2007). The abundances of func-
tional groups in the infant and adult samples are
first converted into ranks by MetaRank. Then the References
t-test is applied to identify rank-based differences
Ashelford KE, Chuzhanova NA, Fry JC, Jones AJ,
between the infant and adult samples. When com- Weightman AJ. At least 1 in 20 16S rRNA sequence
pared with proportion differences detected by records currently held in public repositories is esti-
a parametric method (only t-test without mated to contain substantial anomalies. Appl Environ
Microbiol. 2005;71:7724–36.
MetaRank), it is found that MetaRank,
Brady A, Salzberg SL. Phymm and PhymmBL:
a nonparametric approach, helped to identify metagenomic phylogenetic classification with interpo-
additional functional groups as discriminating lated Markov models. Nat Methods. 2009;6:673–6.
features (Wang et al. 2011). Since nonparametric Gomez-Alvarez V, Teal TK, Schmidt TM. Systematic
artifacts in metagenomes from complex microbial
and parametric methods are complementary to
communities. ISME J. 2009;3:1314–7.
each other in statistics (one cannot replace the Hamady M, Knight R. Microbial community profiling for
other), MetaRank is thus a useful rank-based human microbiome projects: tools, techniques, and
approach complementary to parametric methods. challenges. Genome Res. 2009;19:1141–52. M
Hugenholtz P, Tyson GW. Microbiology: metagenomics.
Nature. 2008;455:481–3.
Kristiansson E, Hugenholtz P, Dalevi D. ShotgunFunc-
Summary tionalizeR: an R-package for functional comparison of
metagenomes. Bioinformatics. 2009;25:2737–8.
Kurokawa K, Itoh T, Kuwahara T, et al. Comparative
Most statistical methods for comparative analysis
metagenomics revealed commonly enriched gene sets
of microbial community compositions rely on in human gut microbiomes. DNA Res. 2007;14:169–81.
estimated abundances of members. However, Ley RE, Turnbaugh PJ, Klein S, Gordon JI. Microbial
when processing metagenomic data, sampling ecology: human gut microbes associated with obesity.
Nature. 2006;444:1022–3.
biases and systematic artifacts cause noisy devi-
Mavromatis K, Ivanova N, Barry K, et al. Use of simulated
ations that may result in estimated abundances data sets to evaluate the fidelity of metagenomic
differing from true abundances. MetaRank, processing methods. Nat Methods. 2007;4:495–500.
which converts highly abundant members into Parks DH, Beiko RG. Identifying biologically relevant
differences between metagenomic communities. Bio-
higher ranks, is designed to cut the effects of
informatics. 2010;26:715–21.
noisy deviations. It leverages the fact that Wang TY, Su CH, Tsai HK. MetaRank: a rank conversion
the ranks of highly abundant members are scheme for comparative analysis of microbial commu-
robust against small deviations. Empirical tests nity compositions. Bioinformatics. 2011;27:3341–7.
White JR, Nagarajan N, Pop M. Statistical methods for
on synthetic samples and real metagenomes con-
detecting differentially abundant features in clinical
firm that the ranks converted by MetaRank metagenomic samples. PLoS Comput Biol. 2009;5:
have small normalized standard deviations, facil- e1000352.
itate the comparative analysis of metagenomes, Wooley JC, Ye Y. Metagenomics: facts and artifacts, and
computational challenges. J Comput Sci Technol.
and help to reveal the common characteristics
2010;25:71–81.
or the discriminating features within a set Wooley JC, Godzik A, Friedberg I. A primer on
of microbiomes. Therefore, MetaRank, as a metagenomics. PLoS Comput Biol. 2010;6:e1000667.
M 448 METAREP, Overview
cultured outside their native habitat and thus can-

METAREP, Overview not be investigated using a classic genome
sequencing approach (Handelsman 2004). With
Johannes Goll increasing sequencing throughput of next-
Informatics Department, The J. Craig Venter generation sequencing technologies, this
Institute, Rockville, MD, USA approach has become commonplace and is
being applied to the soils, oceans, agriculture,
and human health. The goal is to understand
With increasing scale and complexity of current how the microbe’s genetic repertoire is used dur-
metagenomic studies approaching terabase- ing nutrient cycle and energy production and,
volumes of sequence data, scalability of biological especially, what role it plays in human health
analysis software has become an essential require- and chronic disease.
ment. Toward that end, we have developed JCVI As a consequence, the challenge for most
Metagenomics Reports (METAREP), an open- microbiologists has shifted away from data gen-
source tool, which integrates the highly scalable eration to effective data storage and analysis
search engine Solr/Lucene, R, and CAKEPHP methods (Stein 2010). An effective approach, to
into an extendible Web-based software to query, handle these immense data volumes, is to use
browse, compare, and share extremely large vol- workflows in combination with high-
umes of metagenomic annotations. The software performance computer clusters or grids with hun-
allows flexible and simultaneous comparison of dreds of processors that execute homology
taxonomic and biological pathway and individ- searches in parallel for subsets of assembled or
ual enzyme abundances across hundreds of sam- unassembled fragmented sequences (reads).
ples. In this chapter, we provide an overview of Based on hits to reference sequences from
this functionality, data format, import, installa- completely sequenced genomes, the end data
tion, and customization. We present new fea- products are typically organismal as well as met-
tures that have been released with version 1.4.0 abolic gene or read-based count profiles.
including the implementation of two-way statis- The Human Microbiome Project (HMP)
tical tests to compare features of two datasets (http://nihroadmap.nih.gov) is an excellent
without replicates, protein sequence integration, example of highlighting the scale of
and BLASTP homology search capabilities. The metagenomic projects currently taking place.
latest functionality can be tested on example The HMP is a very ambitious effort to character-
data at JCVI’s public METAREP instance, ize the microbial community associated with the
which is available at http://www.jcvi.org/ human body. The jumpstart consortium consists
metarep (via the “Try It” button). The open- of four sequencing centers: the Baylor College of
source code of the software and developer Medicine Human Genome Sequencing Center,
information is accessible at the project’s the Broad Institute, the J. Craig Venter Institute
source code repository at https://github.com/ (JCVI), and the Genome Center at Washington
jcvi/METAREP. University. While it involves a range of activities
including the extensive collection of metadata,
generation of reference genomes, and marker
Introduction gene studies, one of the most data-intensive
phases of the project is the shotgun metagenomic
Metagenomics describes a scientific approach in survey of over 650 samples from 254 healthy
which DNA, extracted from microbes sampled individuals initially examining 15–18 body hab-
from a certain environment, is used to reconstruct itats. In future phases the HMP will compare this
the genomic potential and interactions of whole baseline data to clinical samples to examine the
microbial communities. This circumvents the specific role the microbiome plays in disease and
problem that the majority of microbes cannot be the maintenance of human health. At present over
METAREP, Overview 449 M
METAREP, Overview, Table 1 Comparison of metagenomic software
Year Free
Resource/ latest Maintaining annotation Open-
software release institution services Workflow source Web site
IMG/M 2012 US Department of Yes Annotation: COG, Pfam, No http://img.jgi.
Energy TIGRFAM, InterPro, KEGG doe.gov/cgi-
bin/m/main.
cgi
EBI portal 2011 European Yes Annotation: InterPro, GO Yes https://www.
metagenomics Bioinformatics ebi.ac.uk/
Institute metagenomics/
CLOVR 2011 University of Yes 16 s, clustering, assembly, Yes http://clovr.org
Maryland School of annotation: COG, RefSeq
Medicine
Galaxy 2010 Penn State University No Only taxonomy/phylogeny, Yes http://galaxy.
some community extensions psu.edu
{#15571}
METAREP 2012 J. Craig Venter No No inbuilt annotation Yes http://www.
Institute workflow, users can upload jcvi.org/
existing annotations metarep/
CAMERA 2.0 2010 California Institute for Yes ORF finding, tRNA, rRNA No https://portal.
Telecommunications finding, clustering, genome camera.calit2.
and Information assembly, annotation: Pfam, net/
Technology TIGRFAM, COG
MG-RAST 2008 Argonne National Lab Yes SEED subsystem Yes http://
cluster metagenomics.
anl.gov/
M
20,864 million reads of Illumina data have been Laboratory (MG-RAST (Meyer et al. 2008)),
produced from healthy individuals. The compar- and the University of San Diego (CAMERA
ison of the sequence reads to protein databases (Sun et al. 2011)). Efforts that require compute
alone is estimated to generate data exceeding resources owned by the researches (or rented via
12 terabytes (Human Microbiome Project Con- a cloud service) include CLOVR (Angiuoli
sortium 2012). We believe that the HMP typifies et al. 2011), Galaxy (Goecks et al. 2010), and
the scope and complexity of metagenomic pro- METAREP. The free annotation resources,
jects that will come. The collection, integration, however, are often tightly coupled to each cen-
sharing, and comparison of this data represent ter’s specific infrastructure including its com-
a characteristic example of the current pute resources. Thus they cannot easily be
metagenomic data analysis challenges. Toward installed and modified to satisfy custom needs
this end we have developed METAREP, an including privacy concerns and advanced data
open-source and thus adjustable software that access management. In contrast CLOVR, Gal-
enables exploratory data analysis for projects of axy, and METAREP are self-contained and can
this size and larger (Goll et al. 2010, 2012). be run on other systems, and the source code can
A variety of other free metagenomic annota- be adapted to handle project-specific needs. On
tion and analysis software is accessible to the analysis side, most resources provide sum-
researchers (Table 1). Efforts that include anno- mary results that fit a certain workflow that are
tation workflows and free compute resources are tailored toward answering a certain question.
provided by the US Department of Energy METAREP is an exception, as it supports
(IMG/M (Markowitz et al. 2012)), the European generic exploratory data analysis for annota-
Bioinformatics Institute, the Argonne National tions from different workflows that can be
queried and filtered dynamically. For example, Except for the first two columns (peptide_id and
its functionality can be used to visualize how library_id), which specify the unique ID of the
specific taxonomic or metabolic markers vary respective annotation entry (gene/protein ID) and
across samples. METAREP does not support the library/dataset ID, respectively, columns are
a particular workflow but a generic annotation optional. This has the advantage that workflows
input format. As a consequence, it does not that do not produce all of the data types are
include annotation workflows. To bridge this supported. The last two columns in Table 2 provide
gap, users can run a public annotation service example values for each of the fields per pipeline.
or a custom local pipeline, format the data, and While the unique ID fields mentioned before store
import it. a single value, most of the other fields can store
In the following sections we will describe how multiple values (as indicated in column 3). By
to import data, highlight features to analyze indi- convention, multiple values are double pipe sepa-
vidual and multiple datasets, carry out BLAST rated. For example, information for a multi-
searches, and customize the software. enzymatic protein can be stored by setting the
value of the ec_id field to “1.6.99.3||1.6.5.3”. By
convention, the ec_id field stores the enzyme acces-
Data Format and Import Process sions according to IUBMB format. Higher-level
enzymatic levels are encoded using dashes for all
The current METAREP tab-delimited format unspecified levels, e.g., 3.4.-.-. The go_id field
specification for 17 fields is shown in Table 2. stores accessions defined by the Gene Ontology
Understanding this format is crucial for subse- (Ashburner et al. 2000) with accessions being
quent analysis. Following the outlined conven- prefixed using uppercase “GO:”. The hmm_id is
tions will help users to leverage as much of the a generic field for hidden Markov model-based
functionality as possible and understand what assignments. It takes Pfam accessions (PF234)
fields are supported. The format has been (Punta et al. 2012), TIGRFAM accessions
designed to accommodate common data types (TIGR23423) (Haft et al. 2003), superfamily acces-
that are produced by many annotation workflows sions (SSF345) (Madera et al. 2004), and combina-
without being tied to a specific workflow. The tions of the same (separated by double pipes).
disadvantage of this flexibility is that a custom The blast_* fields store information of BLAST
parser needs to be written to format the output of (Altschul et al. 1990) alignments (but can hold
a certain workflow according to this tab format alignment information from other alignment soft-
before importing the data. However, in most ware). In particular, the blast_tree field stores
cases, generating the METAREP tab-delimited organismal information in the form of the lowest
format is trivial. In addition, METAREP provides taxon using the NCBI Taxonomy as the reference
data formatting functionality for two workflows: taxonomy. For example, to indicate that a certain
(1) the JCVI Prokaryotic Metagenomic Annota- annotation entry belongs to Escherichia coli, the
tion Pipeline (JPMAP (Tanenbaum et al. 2010)) blast_tree field can be set to NCBI taxon id
and (2) the HUMAnN metabolic reconstruction “83333”. If multiple NCBI taxon IDs are pro-
pipeline (Abubucker et al. 2012). The open- vided, the lowest common ancestor will be deter-
source code for formatting output from these mined during the data import process based on
two pipelines serves as a template for supporting the NCBI taxon ID set provided by the user. The
other formats. The code base also includes blast_evalue, blast_pid (proportion of identical
a Perl utility script (scripts/perl/ amino acids), and blast_cov (proportion of cov-
metarep_loader.pl) to import tab- erage of query sequence) reflect alignment qual-
delimited annotation files into METAREP ity data types. The field values range from 0 to
projects (more details on how to use the import 1 and allow users to filter their data based on
script can be found at https://github.com/jcvi/ alignment quality (see searching and filtering).
METAREP/wiki/Installation-Guide-v-1.4.0). The ko_id field stores the KEGG Ortholog
METAREP, Overview, Table 2 Data format
METAREP, Overview
Multi-
Column Field name valued Description JPMAP HUMAnN
1 peptide_id No Unique entry ID JCVI_PEP_1234123 ptr:453118
2 library_id No Dataset ID SRS011061 SRS011061
3 com_name Yes Functional description Sugar ABC transporter, periplasmic sugar-binding protein LGMN, legumain, K01369 legumain
[EC:3.4.22.34]
4 com_name_src Yes Functional description source Uniref100_A23521 ptr:453118
Description assignment
5 go_id Yes Gene Ontology ID GO:0009265 GO:0001509
6 go_src Yes Gene Ontology source PF02511 K01369
Assignment
7 ec_id Yes Enzyme commission ID 2.1.1.148 3.4.22.34
8 ec_src Yes Enzyme commission source PRIAM ptr:453118
9 hmm_id Yes HMM ID PF02511 NA
10 blast_tree Yes NCBI Taxonomy ID 246194 9598
11 blast_evalue No BLAST E-value 1.78E-20 Median
12 blast_pid No BLAST percent identity 0.93 Median
13 blast_cov No BLAST sequence coverage 0.82 N/A
14 Filter Yes Filter tag Repeat N/A
15 ko_id Yes KEGG Ortholog ID N/A K01369
16 ko_src Yes KEGG Ortholog source N/A ptr:453118
17 Weight No Weight to adjust abundance of 1 43.23
assignments
451
M
M
accession (KO2134). Both the ec_id and ko_id imported data in tabular format. This is helpful
fields are used to support two types of pathway to check if the data has been correctly imported.
analysis (see pathway analysis section). Pathway The Summary Tab provides an overview of over-
analysis based on the ec_id field allows analysis all annotation statistics including a high-level
of 100, strictly metabolic, pathways. Pathway taxonomic breakdown. Subsequent tabs summa-
functionality based on ko_id is more comprehen- rize statistics for a corresponding annotation attri-
sive supporting 200 additional non-metabolic bute. For each, the top 20 ranked features with the
pathways such as transcription and translation. absolute and relative counts are displayed. Users
Depending on which field is populated, function- can adjust the number of top feature that is being
ality is activated. Source fields (fields with a _src displayed (up to 1,000 ranks) and download the
postfix) describe the origin of certain value. For data in tab-delimited format.
example, an enzyme accession may have been
assigned based on a certain TIGRFAM model or Dataset Search and Filter Options
a reference gene/protein homology hit or other The Search Page facilitates dynamic filtering of
methods. The ec_src field can be used to track this annotation and allows users to export matching
information. Finally, the weight field allows users entries and associated statistics. Once a query is
to assign weight to a certain entry to adjust the executed, the page summarizes top 10 statistics
absolute and relative frequency of associated for several annotation attributes in the form of
entry values. The field can be used for encoding lists and pie charts. The page also lists individual
abundance information such as the number of matching annotation entries so that users can
reads that support a certain gene/protein confirm that the query correctly retrieved the
(in transcriptomic or assembly studies) or spec- desired results. The top 10 statistics, matching
tral counts in meta-transcriptomic studies. By annotations, and underlying protein sequences
default the weight field is set to 1. (if configured, see configuration) can be
When we subsequently refer to annotation exported. To search a dataset, users can enter
attributes, we mean a selection of these fields a search term and select the field to search in
that are used throughout the software to provide from a drop-down box. Selections include
summary statistics and compare datasets. They ID-based and name-based searches. The former
refer to NCBI Taxonomy, Gene Ontology, performs exact searches; the latter executes fuzzy
Enzyme Classification, HMM, and KEGG/ name-based searches. For example, the user can
Metacyc pathways and KEGG Ortholog fields. enter 2.7.1.147 and select the Enzyme ID field
A feature refers to a certain value that an anno- from the drop-down box to search for exact
tation attribute can take. A feature-dataset matrix matches. Alternatively, the user can carry out
is a two-dimensional matrix with features of a fuzzy name-based search for “Glucokinase”
a certain annotation attribute as rows and which retrieves three matching enzymes:
datasets as columns. Cells represent the sum of ADP-specific glucokinase (2.7.1.147), glucoki-
weights of the respective feature-dataset combi- nase (2.7.1.2), and phosphoglucokinase
nation (by default it reflects the number of genes/ (2.7.1.10). For both search strategies, the selec-
peptides with that specific feature). tion triggers a query generation process that cre-
ates a query that is compatible with the Solr/
Lucene query syntax (http://wiki.apache.org/
Single Dataset Options solr/SolrQuerySyntax). The original search term
is prefixed by the search field, and multiple terms
Dataset Summary Statistics can be logically combined using the AND, OR,
The View Dataset Page displays the imported and NOT keywords. In the ID-based example, the
annotations and provides high-level summaries final query that will be generated is “ec_id:
of annotation attributes including detailed path- 2.7.1.147”. For the name-based example, the
way summaries. The Data Tab shows the final query represents a logical combination
(“OR”) of all individual matches, in this case menu) is to use Solr/Lucene wildcard characters.
“ec_id: 2.7.1.147 OR ec_id: 2.7.1.2 OR ec_id: There are two supported wildcards: “?” and “*”.
2.7.1.10”. The same principle is being applied The “?” performs a single character wild card
to pathway name-based searches. A search for search. For example, to find common names like
“starch and sucrose metabolism” using the fliF, fliC, and fliS, one can search for
name-based KEGG pathway name (EC) option “com_name_txt:fli?”. The “*” performs a
searches for all enzymes in that pathway by gen- multiple-character wild card search. For exam-
erating the following query: “ec_id:1.1.1.22 OR ple, to search for all transferases 2.1.1.1, 2.1.1.2,
ec_id:1.1.99.13 OR ec_id:2.4.1.1 OR 2.1.13, etc., one can enter “ec_id:2.*”. The quan-
ec_id:2.4.1.10 OR . . .”. While the drop down titative alignment information (proportion of
helps to build queries, experienced users can identical amino acids, proportion of covered
enter the Solr/Lucene-formatted queries directly. query amino acids, E-value) range queries can
This has the advantage of entering custom logical be applied that identify entries that fall between
combinations of particular fields of interest a minimum and maximum. For example, to filter
(a complete list of fields and example queries the data for a 1.0E-5 < ¼ E-value < ¼1.0E-20,
are shown in Table 2). Note that if the value one can search the blast_evalue_exp (which
contains itself a colon (which is a special charac- stores the negative E-value exponent) for
ter of the Sorl/Lucene language to separate field “blast_evalue_exp:[5 TO 20]”. To exclude the
names from values), it needs to be preceded by boundary values from the result list, the user
a backward slash. For example, a search for can use “blast_evalue_exp:{5 TO 20}”. This is
“go_id:GO\:0000160” instead of “go_id:GO\: equivalent to 1.0E-5 < ¼ E-value < ¼1.0E-20.
0000160” will return the desired results – fields When filtering for E-values, there is usually no
that store complete hierarchies including the defined lower bound. This can be reflected using
NCBI Taxonomy and the Gene Ontology. The a wild card character “*”. For example,
former is encoded in the blast_tree field, which “blast_evalue_exp:[5 TO *]” searches for all
M
stores the whole taxonomic lineage (according to entries with an E-value < ¼ 1.0E-5. Finally, if
NCBI) for each entry in the form of NCBI taxon the sequence store path has been defined (see
IDs. For example, a protein entry with a species section “Installation and Configuration”), the user
assignment of “Escherichia coli” with NCBI can enter an amino acid sequence into the search
taxon 562 has the following nine NCBI taxon box and select the Search by Sequence option with
IDs stored in the blast_tree field: a certain minimum E-value. The software then
562 ¼ Escherichia coli; 561 ¼ Escherichia; executes a BLASTP search behind the scenes and
543 ¼ Enterobacteriaceae; 91347 ¼ Enterobac- returns the top-matching entry
teriales; 1236 ¼ Gammaproteobacteria; 1224 ¼ accessions (peptide_id field) concatenated by an
Proteobacteria; 2 ¼ Bacteria; 131567 ¼ OR and visualizes summary statistics for homolo-
cellular_organisms; 1 ¼ root; gous proteins.
This allows users to find the entry by searching
for “blast_tree:562” (Escherichia coli) as well as Drill into Datasets Using Hierarchical
“blast_tree:2” (Bacteria) or any other IDs that are Datasets
part of that lineage. This can be very helpful for The Browse Dataset Pages are available for sev-
excluding or including proteins that were eral annotation hierarchies including NCBI Tax-
assigned to a certain taxonomic group. For exam- onomy, Gene Ontology, Enzyme Classification,
ple, “ec_id:2.7.1.2 AND blast_tree:2” can be and KEGG and Metacyc metabolic pathways.
used to filter the data for bacterial glucokinases. For KEGG two different pathway hierarchies
A search for “NOT blast_tree:9606” excludes can be selected: enzyme based and KO based.
entries that were assigned to “homo sapiens”. The difference is that the enzyme-based version
Another way of fuzzy searching (in addition to uses enzyme assignments and maps them to
the name-based searches using the drop-down EC-based KEGG pathways (a subset of KEGG
METAREP, Overview, Fig. 1 Screenshot of the METAREP Browse Pathway (EC) page
pathways that are mainly related to metabolism), Panel, and the Results Panel (Fig. 2). The right
while the KO-based version uses KEGG upper dataset select box in the Dataset Select
Orthologs to infer pathway membership and Panel allows users to select datasets by dragging
uses a more comprehensive set of pathways selected datasets to the left upper panel or by
including non-metabolic processes such as trans- clicking on the plus symbol. The dataset selection
lation and transcription. The number of hits is can be narrowed down by entering keywords in the
displayed for each node in the tree, and a user search textbox in the left upper panel. The Filter
can click on a tree node and expand further. After and Options Panel provides a textbox to enter
clicking a node, a summary of that node is shown a Lucene query (see section “Dataset Search and
in the right panel featuring a pie chart calculated Filter Options” and Table 3). If applied, each
from its sub-nodes and top lists of functional and dataset gets filtered and only annotation entries
taxonomic assignments. Once the user has that match the query are being retained for the
reached the pathway level, for the KEGG ver- comparison. A typical example is to apply
sions of the Browse Pathways pages, relative a more stringent BLAST E-value baseline.
abundance of pathway members is visualized on Another example, highlighted in {REF}, is to
top of pathway maps (Fig. 1). filter all datasets for a certain enzymatic marker,
e.g., pyruvate dehydrogenase complex
(“ec_id:1.2.4.1 OR ec_id:2.3.1.12 OR ec_id:
Multi-dataset Analysis Options 1.8.1.4” or the shorter version “ec_id:1.2.4.1 OR
2.3.1.12 OR 1.8.1.4”). A minimum count value
Compare Feature Abundance Profiles Across can be entered into the Min. Count Field to filter
Datasets out features whose minimum count across all
The Compare Page unifies a variety of descrip- datasets is equal or higher than the specified
tive, graphical, and statistical analysis options to count. By default this field is set to 0 showing
compare annotation attributes of dozens of any features with at least one dataset having
datasets. The page features three distinct panels, a count of one (features with zero counts across
the Dataset Select Panel, the Filter and Options all datasets are discarded). The main compare
METAREP, Overview, Fig. 2 Screenshot/conceptual Visualization options include heatmap (shown), hierarchi-
overview of the METAREP Compare Page. Current cal clustering, multidimensional scaling, and Mosaic
implementation of the METAREP Compare page (key Plots. Advanced Compare options include statistical tests
options are highlighted in green panels). The page allows for pairwise dataset comparisons (Fisher’s Exact Test, M
users to compare absolute and relative abundance of anno- Equality of Proportions Test) as well as for comparing
tations categories across multiple datasets including taxo- two dataset populations (Wilcoxon Rank-Sum and
nomic, pathway, enzyme, and GO classifications. a nonparametric t-test)
options can be selected from the drop down next Relative Count Matrix shows a numeric rep-
to the Min. Count Field and are organized by the resentation of a normalized feature-dataset
following categories: Count Matrices, Statistical matrix with cells containing the number of counts
Tests (2 Datasets), Statistical Tests for a feature-dataset combination divided by the
(2 populations), and Plot options. total dataset count. If a filter was entered, the cells
The Results Panel is automatically updated represent the number of counts for a feature-
upon option selection displaying feature-dataset dataset combination divided by the total count
matrices or graphical representations of the same. of the filtered dataset.
A certain annotation attribute can be selected by Heatmap Count Matrix shows a numeric rep-
clicking on the respective tab and exported using resentation of a row-normalized feature-dataset
the disk with the green array key. In the follow- matrix with cells containing the relative counts per
ing, we describe the option in more detail. dataset divided by the sum of relative counts per
row, i.e., across all datasets. Cells are color coded
Count Matrices: Applicable if at Least One Dataset according to their row-normalized counts. The color
Was Selected scheme can be changed using a drop-down menu.
Absolute Count Matrix shows a numeric repre-
sentation of a feature-dataset matrix with cells Statistical Tests (2 Datasets)
containing the number of counts for a feature- The following dataset tests are applicable if two
dataset combination. datasets were selected. As input for the tests,
METAREP, Overview, Table 3 METAREP search fields

Field name Description Type/range Example
Core annotation fields
peptide_id Peptide ID text peptide_id:1120333534885
Retrieve hit with the specified peptide id
com_name_txt Common name text com_name_txt:phage
(default field) All hits containing the word phage
com_name_src Common name text com_name_src:PF00204
source All hits having names assigned based on this PFAM hit
go_id Gene Ontology ID text go_id:GO\:0000160
Hits with GO:0000160; use “\” before the colon
go_tree Gene Ontology tree Integer portion go_tree:160
of ID Skip “GO:” prefix; all hits with GO:0000160 or lower including all
hits with GO IDs that are lower (more specific) in the GO hierarchy
go_src Gene Ontology text go_src:PF00204
source All hits that have GO terms assigned based this PFAM hit
ec_id Enzyme ID text ec_id:5.99.1.3
All hits with Enzyme ID 5.99.1.3
ec_src Enzyme source text ec_src:PF00204
All hits that have EC IDs assigned based on this PFAM hit
ko_id KEGG Ortholog ID text ko_id: K01369
ko_src KEGG source text ko_src: ptr\:453118
hmm_id HMM ID text hmm_id:PF00204
All hits that have a PF00204 HMM assignment
library_id Library ID text library_id:GS-00a-01-01-2P5KB
All hits that belong to library GS-00a-01-01-2P5KB (helpful to
search for library entries within populations)
filter Any filter tag (e.g., text filter:duplicate
sequence duplicates) All hits with filter tagged with duplicate
-filter:duplicate
Exclude entries with filter tag duplicate
Alignment fields
blast_species Species text blast_species:Chlamydia*
All Chlamydia species
blast_tree Taxonomy integer (NCBI blast_tree:2
Taxonomy ID) All bacteria
blast_tree:2
Exclude all bacteria
blast_evalue_exp Negative E-value positive integer blast_evalue_exp:[20 TO *]
Exponent All hits with BLAST E-value 1020
blast_evalue_exp:[10 TO 20]
All hits with 1020 E value 1010
blast_pid Percent identity Float between blast_pid:[0.9 TO *]
0 and 1 All hits with BLAST percent identity 90 %
blast_pid:[0.6 TO 0.8]
All hits with 60 % percent identity 80 %
All hits with 60 % percent identity 80 %
blast_cov Percent sequence Float between blast_cov:[0.8 TO *]
coverage 0 and 1 All hits with BLAST percent sequence coverage 80 %
blast_cov:[0.2 TO 0.3]
All hits with 20 % sequence coverage 30
a 2 2 contingency table is generated separately multiple testing is taken into account by provid-
for each feature with two dataset columns and ing Bonferroni-corrected p-values and
rows representing observations for the presence FDR-based q-values which are recommended
and absence of the respective feature. As multiple over the individual p-values.
features are simultaneously tested, Bonferroni- Wilcoxon Rank-Sum Test performs multiple
corrected p-values and FDR-based q-values are two-sample nonparametric Wilcoxon rank-sum
listed, which are recommended over the individ- tests (also known as Mann-Whitney Test) in
ual p-values. which each feature is being compared across
Equality of Proportions Test tests whether two populations. It tests whether differences in
the relative counts for two features is equal or not. the medians of the normalized counts for a certain
It is equal to the chi-square test of independence feature are due to chance or not. The null hypoth-
in case of a 2 2 contingency table. It is a large esis states that there is no difference between the
sample approximation test, in which the normal dataset-normalized population medians of
distribution is being used to approximate the a feature. The alternative hypothesis states that
binomial distribution. Typically, a minimum there is a significant difference between the pop-
cell count of five is recommended so that the ulation medians.
large sample approximation holds reasonably METASTATS is a modified nonparametric
well. The software accounts for this by automat- t-test for detecting differentially abundant features
ically setting the Min. Count Option to five in metagenomic samples (White et al. 2009).
removing any features from the feature-dataset The test can be used to compare features across
matrix that have counts lower than five. Results two populations. The null hypothesis states
are sorted by ascending p-value. As multiple fea- that there is no difference between the dataset-
tures are simultaneously tested, Bonferroni- normalized population means of a feature.
corrected p-values and FDR-based q-values are The alternative hypothesis states that there is
listed. All three measures can be used to filter the a significant difference between the population
M
data using the drop-down menu (q-values are means. The null distribution approximated via
being recommended). Result representation and randomization, and a t-statistic is being computed
filtering can be applied to any statistical tests for each iteration (see the section “Installation and
described subsequently. Configuration” on how to adjust the number of
Fisher’s Exact Test tests whether the relative iterations). For low counts (less than 8),
counts for two features is equal or not. It is an a Fisher’s Exact Test is used instead of the non-
exact test, in which the null distribution follows parametric t-test.
a hypergeometric distribution. Thus, it can be
used for feature-dataset matrices that contain Plots
small cell counts. However, as it is computation- The following plot options are applicable if at
ally much more intense, execution takes much least three datasets were selected:
longer than for the Equality of Proportions Test. Mosaic Plots draw groups of aligned rectangles,
one for each dataset. Features are vertically
Statistical Tests (2 Populations) stacked with the height of a feature (vertical
The following population tests are applicable if axis) being proportional to the relative count
two populations were selected. A typical scenario (Fig. 3c). The width of the rectangle
for using these tests is to compare two groups of (horizontal axis) is proportional to the overall
samples, for example, multiple samples taken dataset size (compared to the other datasets).
from healthy and diseased individuals or from Thus, a Mosaic Plot provides a more compre-
unfarmed and farmed land. The METAREP hensive view than a Barplot as it provides
administrator has privileges to create populations a way of visualizing both, the relative feature
from the collection of imported libraries via the contribution within datasets and the relative
Project Page. As for the two-way dataset tests, overall dataset size.
M
458
METAREP, Overview, Fig. 3 (continued)

METAREP, Overview
Hierarchical Cluster Plots provide visual sum- uses the mean vector differences, while
maries of groups of dataset “clusters” that Ward’s minimum variance minimizes the
are similar with respect to their feature overall within-cluster variance. For a review
composition (Fig. 3a). The input to cluster- see Milligan et al. (1980). According to
ing is a normalized feature-dataset matrix. Milligan et al., the method with the best over-
Here, for normalization, the total feature all performance has been either average link-
count across selected features is being used age or Ward’s minimum variance. A PDF of
per dataset (note, this is different from the dendrogram and computed distances can
the Relative Count Matrix normaliza- be downloaded via the Results Panel’s export
tion which uses the total dataset count). option.
Distances (dissimilarities) between datasets Heatmap Plots are similar to Hierarchical Clus-
in multidimensional space can be computed tering Plots in that they visualize datasets as
using the feature vectors. Users can choose well as feature differences using hierarchical
from various distance metric options includ- clustering in the form of a dendrogram
ing Euclidean, Morisita-Horn, Bray-Curtis, (Fig. 3b, shown on the right and top, respec-
and Jaccard that can be selected available via tively). The main difference is additional
drop-down menu. After distances have been quantitative information in the form of
computed, datasets are clustered using an iter- a heatmap in which normalized feature-
ative procedure referred to as hierarchical dataset counts are being color coded based
clustering: initially, each dataset belongs to on a color gradient. Users can change the
its own cluster. During each iteration, an base color of the gradient. Columns and rows
optimal cluster pair is being aggregated into are reordered to optimize the layout of the two
a higher-level cluster and distances are dendrograms on each axis. Hierarchical clus-
recomputed between the new and the tering options include the Distance Metric as
remaining clusters. The process continues well as the Cluster Method. A PDF of the
M
until there is one single cluster and a tree heatmap and both sets of computed distances
structure of successive clustering events, can be downloaded via the Results Panel’s
a dendrogram, is being drawn (Fig. 3a, b). export option.
Users can influence the process of Multidimensional Scaling Plots apply
recomputing distances based on several aggre- non-metric multidimensional scaling to pro-
gation methods including single linkage (uses ject differences between datasets onto a
minimum distance between the new cluster two-dimensional plane in which similar
members and an outside cluster), average link- datasets are closer and less similar datasets
age (uses average distance), and complete area farther apart (Fig. 3d). Like for hierarchi-
linkage (uses maximum distance). Centroid cal clustering, a dissimilarity matrix based on
METAREP, Overview, Fig. 3 Compare Page plot mangrove on Isabella Island). For GS-11, GS-30, and
options exemplified using a selection of eight Global GS-32, two samples were taken from the same location.
Ocean Survey (GOS) samples. Plots A, B, and D show The hierarchical clustering and heatmap dataset-based
the same selection of datasets based on organismal com- dendrograms and the MDS plot show that the replicated
position on the family level (assigned based on the best samples cluster together. The dendrogram shows that,
reference hit using BLAST) with the Minimum Count although the mangrove samples are distinct from the rest
Option set to 5. Plot C summarizes the same datasets for of the Galapagos Islands, they are more close to each other
the phylum level. GS-11 and GS-12 were sampled from when compared to the East Coast samples. The heatmap
the Chesapeake Bay, Annapolis, MD, USA, and Delaware shows an increase of Rhodobacteraceae (orange to
Bay, NJ, USA, respectively. GS-30, GS-31, and GS-G32 white) and a decrease in Comamonadaceae and
were sampled close to the Galapagos Islands (GS-30 off Burkholderiaceae families (orange to red) when compar-
Roca Redonda, GS-31 Fernandina Island, and GS-32 ing these two groups based on the % abundance level
the normalized counts is used as input for the https://github.com/jcvi/METAREP/wiki/Installation-

algorithm, which can be specified by the user. Guide-v-1.4.0. For later versions, please visit
A PDF of the final heatmap and the computed the Project Page at https://github.com/jcvi/
distances can be downloaded via the Results METAREP/wiki.
Panel’s export option. As part of the data import process, additional
annotation attributes including NCBI Taxonomy
Homology Searches lineage, GO assignments, and KEGG pathways
The BLAST Sequence Page provides functional- are fetched from a SQLite database. The database
ity to screen multiple datasets for a protein can be updated using the scripts/perl/
sequence of interest (Fig. 4). Highly conserved metarep_update_database.pl script.
single-copy marker genes, such as dnaG, for To update the KEGG attributes, the script needs
example, can be used to approximate the number to be pointed to a local snapshot of KEGG
of genomes in a dataset (Wu and Eisen 2008). downloaded from the KGG FTP site (license is
The page uses the same “Select Datasets” panel required).
as the Compare Page. BLAST options include Once installation is completed, the instance
the input sequence text area, BLAST Min. can be configured modifying the “app/config/
E-value, and a text field for entering a filter metarep.php” file. An important configuration
query. The Result Panel summarizes BLASTP that impacts performance and stability is the
alignment results filtered for homologous entries number of Solr/Lucene servers used for retriev-
that match the filter query in different formats ing annotation information. While METAREP
that can be selected by choosing one of three can be run in a setup with a single server
tabs. The Annotation Tab lists key alignment (SOLR_MASTER_HOST), for best performance
statistics along with annotations of homologous and stability, we recommend running a second
entries. The Alignment Tab displays the server (slave), on another machine. The addi-
default BLASTP alignment output including tex- tional server can be configured using the
tual representation of sequence alignments. SOLR_SLAVE_HOST variable. In theory, more
The Tabular Tab tabulates the default tab- than two slave servers can be defined, but
delimited BLASTP output (-m8 BLASTP METAREP currently supports only one slave
option). Results for each of these tabs can be server. A two-server setup can handle more con-
downloaded via the Results Panel’s export current traffic than a single server and thus can
option. To activate this option, protein sequences improve the average query response time
of each dataset need to be formatted using the (an important factor if many users are anticipated
BLAST utility program formatdb and organized to access data simultaneously). The two Solr/
in a sequence store on the Web server that runs Lucene servers will replicate data across the two
the METAREP instance (see section “Installation different machines, and user traffic is balanced
and Configuration”). between the two servers using an internal load-
balancing mechanisms implemented in the
Web-logic component of the METAREP soft-
Installation and Configuration ware. The slave server using Solr’s inbuilt repli-
cation functionality will automatically replicate
METAREP utilizes a variety of open-source soft- new index files that have been uploaded to the
ware including R, Lucene/Solr, CAKEPHP, master server. If one server goes down
MySQL, Apache Http server, and SQLite that (for maintenance, testing, malfunction, etc.), the
need to be downloaded and installed. Version other server can still handle user requests. The
1.4.0 of METAREP can be downloaded at two-server system is thus also more fault tolerant
https://github.com/jcvi/METAREP/zipball/1.4.0- and enables updates to the server without inter-
beta. For installation instruction please visit fering with the user experience.
METAREP, Overview, Fig. 4 Screenshot of the BLAST can be displayed and exported in various formats includ-
Sequence Page. A protein sequence of interest can be ing annotation (shown), alignment (shown in the zoom
searched against a selection of datasets. BLAST results panel), and tabular
The INTERNAL_EMAIL_EXTENSION vari- increase the precision of the p-values (see (White
able can be specified to identify internal users that et al. 2009) for details).
register with the instance and set permissions To activate the METAREP blast functionality,
accordingly. By default, users that register with searching and exporting of sequences, the
the specified e-mail extension are granted full SEQUENCE_STORE_PATH variable needs to
data access. The GOOGLE_ANALYTIC- be defined. This path points to the location on
S_TRACKER_ID and GOOGLE_ANALY- the Web server where the formatdb-formatted
TICS_DOMAIN_NAME variables configure the protein sequence files are kept (organized by
instance to synchronize Web usage with Google project ID and dataset, Table 4). The perl/
Analytics to track usage statistics. scripts/metarep_format_sequence.
The NUM_METASTATS_BOOTSTRAP_ pl utility can be used to format sequence data
PERMUTATIONS variable sets the number of according to this format. If an FTP server is
replicates to determine the null distribution for available, data sharing of a collection of custom
the METASTATS test and can be increased to files per dataset the via dataset download option
METAREP, Overview, Table 4 METAREP sequence and FTP data organization

Project
Feature Root directory Dataset directory Files
Sequence export and Sequence 12 GS695_GDQ27C301_0p1 GS695_GDQ27C301_0p1.
BLAST functionality store root phr
directory GS695_GDQ27C301_0p1.
pin
GS695_GDQ27C301_0p1.
psd
GS695_GDQ27C301_0p1.
psi
GS695_GDQ27C301_0p1.
psq
formatdb.log
12 GS695_GLDFQNX02 GS695_GLDFQNX02_viral/
GS695_GLDFQNX02_viral.
phr
pin
psd
psi
psq
formatdb.log
FTP export functionality FTP root 12 GS695_GDQ27C301_0p1.
directory tgz
12 GS695_GLDFQNX02.tgz
can be activated by specifying the FTP_HOST, balanced Dell Power Edge R710 servers each
FTP_USERNAME, and FTP_PASSWORD vari- having eight cores (2.66 GHz), 72 G RAM, and
ables. The software identifies FTP data by 2 600 GB HD. So far we have successfully
looking for the project ID folder followed by indexed 190 M. Our HMP METAREP instance
a tar-gzipped file that has a matching dataset that serves over 400 million weighted annotations
name, i.e., <dataset-names>.tgz. entries runs on a single server with two multi-
threaded Xeon X7560 2.26 GHz processors with
Example Hardware Configurations a total of 16 cores (32 threads), 256 G RAM, and
The main requirements are driven by the amount 4 terabyte of disk space. For performance bench-
of annotations that are to be stored in index files marks, please see Goll et al. (2010), Supplemen-
and served by a Solr/Lucene server. The main tary Fig. 1, and Goll et al. (2012), Fig. 6.
impact on performance for a single machine is
the amount of memory available for result
retrieval, caching, and operating systems for file Additional Resources
caching. If annotations are weighted, i.e., the
weight field is set to values other than 1, the As part of the NIH HMP project, the software was
CPU requirements increase (see (Goll tested with short-read annotations derived from
et al. 2012), Fig. 6). We are currently running over 14 trillion Illumina reads (Goll et al. 2012).
a two-server system that is served by two load The study includes several scenarios on how to
investigate the NIH human microbiome data open source tool for high-performance comparative
including how to analyze specific metabolic metagenomics. Bioinformatics. 2010;26:2631–2.
doi:10.1093/bioinformatics/btq455.
markers, cluster datasets based on their metabolic Goll J, Thiagarajan M, Abubucker S, Huttenhower C,
profile, and identify pathways that are differen- Yooseph S, et al. A case study for large-scale human
tially abundant between human body habitats. microbiome analysis using JCVI’s Metagenomics
The data can be accessed at www.jcvi.org/hmp- Reports (METAREP). PLoS ONE. 2012;7:e29044.
metarep. The following short video tutorial Haft DH, Selengut JD, White O. The TIGRFAMs data-
summarizes key functionality (YouTube base of protein families. Nucleic Acids Res.
ID:7FPJaPyLjMk). The METAREP home page 2003;31:371–3.
at www.jcvi.org/metarep provides an anonymous Handelsman J. Metagenomics: application of genomics to
login via the “Try It” button to evaluate the latest 2004;68:669–85. doi:10.1128/MMBR.68.4.669-
functionality for a collection of ocean samples 685.2004.
taken from the North Pacific Subtropical Gyre Human Microbiome Project Consortium. A framework
(DeLong et al. 2006). The open-source code of for human microbiome research. Nature.
2012;486:215–21. doi:10.1038/nature11209.
the software and developer information including Madera M, Vogel C, Kummerfeld SK, Chothia C, Gough
how to contribute to the open-source project is J. The SUPERFAMILY database in 2004: additions
available at the project’s source code repository and improvements. Nucleic Acids Res. 2004;32:
at https://github.com/jcvi/METAREP. For ques- D235–9. doi:10.1093/nar/gkh117.
Markowitz VM, Chen I-MA, Chu K, Szeto E,
tions and comments, please join the mailing list at Palaniappan K, et al. IMG/M-HMP: a metagenome
www.jcvi.org/metarep or directly send an e-mail comparative analysis system for the human
to metarep@googlegroups.com. microbiome project. PLoS ONE. 2012;7:e40151.
Meyer F, Paarmann D, D’Souza M, Olson R, Glass EM,
et al. The metagenomics RAST server – a public
References resource for the automatic phylogenetic and functional
analysis of metagenomes. BMC Bioinformatics.
M
Abubucker S, Segata N, Goll J, Schubert AM, Izard J, 2008;9:386. doi:10.1186/1471-2105-9-386.
et al. Metabolic reconstruction for metagenomic data Milligan, Glenn W. An examination of the effect of six
and its application to the human microbiome. PLoS types of error perturbation on fifteen clustering algo-
Comput Biol. 2012;8:e1002358. doi:10.1371/journal. rithms. Psychometrika1980;45(3):325–342.
pcbi.1002358. Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J,
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. et al. The Pfam protein families database. Nucleic
Basic local alignment search tool. J Mol Biol. 1990;215: Acids Res. 2012;40:D290–301. doi:10.1093/nar/
403–10. doi:10.1016/S0022-2836(05)80360-2. gkr1065.
Angiuoli SV, Matalka M, Gussman A, Galens K, Stein LD. The case for cloud computing in genome infor-
Vangala M, et al. CloVR: a virtual machine for auto- matics. Genome Biol. 2010;11:207. doi:10.1186/gb-
mated and portable sequence analysis from the desktop 2010-11-5-207.
using cloud computing. BMC Bioinformatics. Sun S, Chen J, Li W, Altintas I, Lin A, et al. Community
2011;12:356. doi:10.1186/1471-2105-12-356. cyberinfrastructure for advanced microbial ecology
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, research and analysis: the CAMERA resource. Nucleic
et al. Gene ontology: tool for the unification of biol- Acids Res. 2011;39:D546–51. doi:10.1093/nar/
ogy. The gene ontology consortium. Nat Genet. gkq1102.
2000;25:25–9. doi:10.1038/75556. Tanenbaum DM, Goll J, Murphy S, Kumar P, Zafar N,
DeLong EF, Preston CM, Mincer T, Rich V, Hallam SJ, et al. The JCVI standard operating procedure for anno-
et al. Community genomics among stratified microbial tating prokaryotic metagenomic shotgun sequencing
assemblages in the ocean’s interior. Science. data. Stand Genomic Sci. 2010;2:229–37.
2006;311:496–503. doi:10.1126/science.1120250. doi:10.4056/sigs.651139.
Goecks J, Nekrutenko A, Taylor J, Galaxy Team. Galaxy: White JR, Nagarajan N, Pop M. Statistical methods for
a comprehensive approach for supporting accessible, detecting differentially abundant features in clinical
reproducible, and transparent computational research metagenomic samples. PLoS Comput Biol. 2009;5:
in the life sciences. Genome Biol. 2010;11:R86. e1000352. doi:10.1371/journal.pcbi.1000352.
doi:10.1186/gb-2010-11-8-r86. Wu M, Eisen JA. A simple, fast, and accurate method of
Goll J, Rusch DB, Tanenbaum DM, Thiagarajan M, Li K, phylogenomic inference. Genome Biol. 2008;9:R151.
et al. METAREP: JCVI metagenomics reports–an doi:10.1186/gb-2008-9-10-r151.
M 464 MetaTISA: Metagenomic Gene Start Prediction with
In recent years, gene start prediction for

MetaTISA: Metagenomic Gene Start microbial genomes has achieved high accuracies
Prediction with for a number of methods (Besemer et al. 2001;
Zhu et al. 2004; Tech et al. 2005; Delcher
Huaiqiu Zhu1 and Gangqing Hu2 et al. 2007; Makita et al. 2007; Hu et al. 2008a,
1
Department of Biomedical Engineering, and 2009; Hyatt et al. 2010). Most of the methods are
Center for Theoretical Biology, Peking unsupervised and can be roughly sorted into two
University, Beijing, China groups based on whether specific assumptions are
2
Systems Biology Center, National Heart, Lung made on gene start-related features. The first
and Blood Institute, National Institutes of Health, group involves statistic models specifically
Bethesda, MD, USA designed for the cis-regulatory signals in the
vicinity of gene start such as the Shine-Dalgarno
(SD) signal (Shine and Dalgarno 1974). Assump-
Synonyms tions are then made regarding the length of the
signal, the start codon usage, and the distances
Gene start annotation; Translation initiation site between the signal and start codon (Besemer
(TIS) prediction et al. 2001; Zhu al. 2004; Delcher et al. 2007;
Makita et al. 2007; Hu et al. 2008a; Hyatt
et al. 2010). These methods show consistently
Definition high prediction accuracies on a number of
genomes such as E. coli and B. subtilis. However,
Gene start: the start position from which a geno- the assumptions that apply to these genomes may
mic sequence can be translated into protein. not apply to others. The other methods build
statistic model to characterize the whole
sequences around gene starts and do not take
Introduction specific assumptions on gene start-related geno-
mic features. Tech et al. (2005) introduced a
Knowledge of exact information of gene start plays second-order Markov model with positional
an important role in identification of native purified smoothing to characterize sequence properties
proteins from the high-throughput proteomics around gene start and achieved comparable accu-
(Poole et al. 2005). In addition, a correct prediction racies to other methods. This method however is
of gene start facilitates the identification of criticized for potential dependency of the quality
cis-regulatory signals related to translation initia- of initial annotation (Makita et al. 2007). Later
tion (Hu et al. 2008c) and thus facilitates the under- on, Hu et al. (2009a) introduced a classification of
standing of the diversity and evolution scenario of putative start codons into three categories based
translation initiation mechanisms (Zheng on evolutional pressures acting on the sequences:
et al. 2011). However, gene start annotation in true start codons (purifying selection), false start
widely used public databases such as GenBank codons in intergenic regions (minimal sequence
and RefSeq is not of high quality in general feature preserved under neutral selection) ,and
(Nielsen and Krogh 2005). In particular, the lon- false start codons in coding regions (period-
gest open reading frame is frequently used to anno- three oscillations in sequence content under puri-
tate a protein-coding gene (Besemer et al. 2001), fying selection) (Hu et al. 2008b). The sequence
which results in a systematical low quality in gene feature of each group is then characterized by
start annotations for GC-rich species (Nielsen and a non-homogeneous Markov model, and an iter-
Krogh 2005; Hu et al. 2008b). Therefore, accurate ative nonsupervised procedure is utilized for
gene start prediction has been an intensive research parameter estimations (Hu et al. 2009b). The
subject for many labs in the last decade. method achieves a better accuracy than other
MetaTISA: Metagenomic Gene Start Prediction with 465 M
methods, and the prediction is independent from among genera and then assigns the phyloge-
the quality of initial annotation (Hu et al. 2009b). netic origin of fragment F to the genus that
Since many of the metagenomics projects reports the maximal value of p(F|Gi). Since
involve high-throughput proteomics to identify the k-mer frequencies are pre-calculated for
novel proteins followed by experimental valida- each genus Gi, it is crucial to keep the param-
tion, the development of gene start prediction eters updated to maintain the classification
algorithm for metagenomic fragments receives accuracy especially when novel genera are
increasing attentions (Hoff et al. 2008). It is discovered.
important to realize that although the current 2. Unsupervised gene start predictions: frag-
methods are successful in gene start prediction ments assigned to the same genus are sup-
for microbial genomes, they are not directly posed to have close phylogenetic origin and
applicable to the metagenomic projects share a similar mechanism of translation initi-
(Hu et al. 2009a). This is largely caused by the ation. In this regard, gene start prediction
fragmentary nature of the metagenomics methods developed for microbial genomes
sequences and their uncertainties in phylogenetic may be applied. MetaTISA utilizes the
origins. A tool called MetaTISA (Hu et al. 2009a) methods described in Hu et al. (2009b) to
was implemented to address this question. accomplish with several considerations.
Firstly, it trains the parameters for each
genus in an unsupervised manner (also
“Binning Followed by Self-Training” known as self-training). This offers the advan-
tage to exclude the needs of a set of known
MetaTISA is essentially a sequential application training sets. However, for genus that receives
of metagenomics binning – a process that iden- only a few number of fragments (<200 by
tifies from what species a particular sequence has default), the prediction does require
originated – and an unsupervised procedure for pre-computed parameters. But note that the
M
gene start prediction within each bin: parameters training for this genus is also
1. Binning: giving a set of metagenomics frag- a nonsupervised process (Hu et al. 2009b).
ments, each fragment was assigned to a genus Secondly, the predication is independent
based on its k-mer nucleotide frequencies as from the quality of input. For metagenomic
described in (Sandberg et al. 2001). Briefly, samples, the quality of gene start prediction
a metagenomic fragment F of size l consists of from a gene annotation pipeline may vary
l-(k-1) overlapping motifs M of size k. The considerably across fragments bins with dif-
probability of finding fragment F in genus ferent GC content. Thirdly, the method esti-
Gi, denoted by p(F|Gi), can be estimated by mates the probability that a putative start
a product of the probabilities of finding each codon is within coding regions. This helps
motif M in genus Gi, which can be estimated tell the completeness of a coding sequence
from the normalized k-mer nucleotide fre- within a fragment. Lastly but not the least, it
quencies within genus Gi. Based on Bayesian outputs genus-specific parameters that may
statistics, giving the occurrence of fragment F, facilitate the comparison of TIS-related sig-
the probability that F belongs to Gi may be nals among different metagenomic samples
expressed as p(Gi|F) ¼ [p(F|Gi)p(Gi)]/P(F), (Noguchi et al. 2008).
where P(F) is the probability of finding frag-
ment F, which is independent of genus, and
p(Gi) is a prior probability that reflects the Prediction Accuracies
relative abundance of genus Gi in the
metagenomic sample of concern. MetaTISA MetaTISA is designed as a post-processor for
assumes that the prior probability is equal gene prediction pipelines currently available for
M 466 MetaTISA: Metagenomic Gene Start Prediction with
metagenomes, such as MGA (Noguchi Cross-References

et al. 2008), GeneMark.hmm (Zhu et al. 2010),
and Glimmer-MG (Kelley et al. 2012). The ▶ FragGeneScan: Predicting Genes in Short and
improvements brought by MetaTISA are demon- Error-Prone Reads
strated by post-processing gene predictions from ▶ MetaBin
MGA on metagenomic fragments simulated
using 100 genomes. Two kinds of simulations
with different fragment sizes are conducted:
References
400 bp for 454 or 700 bp for Sanger. When
assessed on experimentally verified datasets, the Besemer J, Lomsadze A, et al. GeneMarkS: a self-training
sensitivities are improved by 6–8 % without method for prediction of gene starts in microbial
a loss of specificities regardless of the choice of genomes. Implications for finding sequence motifs in
regulatory regions. Nucleic Acids Res. 2001;29(12):
fragment length (Hu et al. 2009a). An indirect
2607–18.
way of accuracy assessment on real metagenomic Delcher AL, Bratke KA, et al. Identifying bacterial genes
samples is to investigate the TIS-feature- and endosymbiont DNA with Glimmer. Bioinformat-
associated parameters self-trained for each ics. 2007;23(6):673–9.
Hoff KJ, Tech M, et al. Gene prediction in metagenomic
genus. As an example, the method is applied to
fragments: a large scale machine learning approach.
post-process MGA’s predictions for Human Gut BMC Bioinformatics. 2008;9:217.
Community Subject 7, and as a result it reveals Hu G, Liu Y, et al. New solutions of translation initiation
expected RBS patterns such as SD signals for site prediction for prokaryotic genomes. Prog Biochem
Biophys. 2008a;35(11):1254–62.
genus within Firmicutes (Hu et al. 2009a).
Hu G, Zheng X, et al. Computational evaluation of TIS
annotation for prokaryotic genomes. BMC Bioinfor-
matics. 2008b;9:160.
Availability Hu G, Zheng X, et al. ProTISA: a comprehensive resource
for translation initiation site annotation in prokaryotic
genomes. Nucleic Acids Res. 2008c;36(Database
The tool is written in C++ and the source code is issue):D114–9.
freely available under GNU GPL license. A web Hu G, Guo J, et al. MetaTISA: Metagenomic Translation
server (http://mech.ctb.pku.edu.cn/MetaTISA/) Initiation Site Annotator for improving gene start pre-
diction. Bioinformatics. 2009a;25(14):1843–5.
is dedicated for the user to run the program online
Hu G, Zheng X, et al. Prediction of translation initiation
and to receive the results by email. The web site for microbial genomes with TriTISA. Bioinfor-
server also provides downloading service for matics. 2009b;25(1):123–5.
source codes, files for pre-computed parameters, Hyatt D, Chen GL, et al. Prodigal: prokaryotic gene rec-
ognition and translation initiation site identification.
and executable version for Windows and Linux
BMC Bioinformatics. 2010;11:119.
platforms. Kelley DR, Liu B, et al. Gene prediction with Glimmer for
metagenomic sequences augmented by classification
and clustering. Nucleic Acids Res. 2012;40(1):e9.
Makita Y, de Hoon MJ, et al. Hon-yaku: a biology-driven
Summary Bayesian methodology for identifying translation ini-
tiation sites in prokaryotes. BMC Bioinformatics.
By a sequential combination of metagenomic 2007;8:47.
fragments binning and a self-training of parame- Nielsen P, Krogh A. Large-scale prokaryotic gene predic-
tion and comparison to genome annotation. Bioinfor-
ters within each bin, MetaTISA significantly
matics. 2005;21(24):4322–9.
improves the identification of gene starts for Noguchi H, Taniguchi T, et al. MetaGeneAnnotator:
metagenomes. Noteworthy, this “binning- detecting species-specific patterns of ribosomal
followed-by-self-retraining” scheme has been binding site for precise gene prediction in anonymous
prokaryotic and phage genomes. DNA Res.
successfully applied to the prediction of protein-
2008;15(6):387–96.
coding sequences for metagenomes (Kelley Poole 2nd FL, Gerwe BA, et al. Defining genes in the
et al. 2012). genome of the hyperthermophilic archaeon
Metaxa, Overview 467 M
Pyrococcus furiosus: implications for all microbial extracted sequences to taxonomic domains and
genomes. J Bacteriol. 2005;187(21):7325–32. organelle of origin. Metaxa is freely available
Sandberg R, Winberg G, et al. Capturing whole-genome
characteristics in short sequences using a naive Bayes- from http://microbiology.se/software/metaxa/.
ian classifier. Genome Res. 2001;11(8):1404–9.
Shine J, Dalgarno L. The 30 -terminal sequence of
Escherichia coli 16S ribosomal RNA: complementar- Introduction
ity to nonsense triplets and ribosome binding sites.
Proc Natl Acad Sci U S A. 1974;71(4):1342–6.
Tech M, Pfeifer N, et al. TICO: a tool for improving A common question in metagenomic studies con-
predictions of prokaryotic translation initiation sites. cerns the species composition of the community
Bioinformatics. 2005;21(17):3568–9. sampled (Desai et al. 2012). This is frequently
Zheng X, Hu G, et al. Leaderless genes in bacteria: clue to
the evolution of translation initiation mechanisms in addressed using a specific genetic marker, typi-
prokaryotes. BMC Genomics. 2011;12:361. cally the ribosomal RNA (rRNA) small subunit
Zhu H, Hu G, et al. Accuracy improvement for identifying (SSU) gene sequence (also referred to as the 16S,
translation initiation sites in microbial genomes. Bio- 18S, or 12S subunit depending on the lineage
informatics. 2004;20(18):3308–17.
Zhu W, Lomsadze A, et al. Ab initio gene identification under scrutiny). In some studies, the SSU gene
in metagenomic sequences. Nucleic Acids Res. is amplified by PCR and sequenced separately in
2010;38(12):e132. order to study microbial diversity. However, even
if the SSU sequences are not targeted for separate
sequencing, it is still possible to identify and
extract the SSU component of a metagenome.
Metaxa, Overview This task has traditionally been carried out
through similarity searches against sequence
Johan Bengtsson-Palme1, Martin Hartmann2, databases such as GenBank (Benson
K. Martin Eriksson3 and R. Henrik Nilsson3 et al. 2009), SILVA (Pruesse et al. 2007),
1
Institute of Neuroscience and Physiology, GreenGenes (DeSantis et al. 2006), or RDP
M
The Sahlgrenska Academy, University of (Cole et al. 2007).
Gothenburg, Göteborg, Sweden The complexity of the data requires frequent
2
Molecular Ecology, Agroscope Reckenholz- manual intervention to accurately sort out the
T€anikon Research Station ART, Zurich, origin of the sequences in such BLAST-based
Switzerland approaches, and the process is further compli-
3
Department of Biological and Environmental cated by the fact that the SSU gene is found not
Sciences, University of Gothenburg, Göteborg, only in the core genome of bacteria, archaea, and
Sweden eukaryotes but also in the chloroplasts and mito-
chondria of eukaryote organisms. These different
gene copies, although often very similar to one
Synonyms another, are non-orthologous and should in most
cases not be analyzed jointly. Metagenomic
16S extraction; rRNA extraction; SSU extrac- efforts are generally interested in the bacterial
tion; Taxonomic assignment and/or eukaryote diversity in the sample, and
thus any mitochondrial or chloroplast SSU
sequences, bearing high similarity to bacterial
Definition SSU genes, may confound the analysis if left in
the dataset. To avoid noise and bias associated
Metaxa is a software tool for extracting full- with analyzing non-orthologous sequences as if
length and partial ribosomal small subunit they were orthologous, the sequences must be
(SSU; 16S/18S/12S) sequences from subjected to manual inspection, which is a time-
metagenomic datasets and for classifying the consuming process further complicated by the
M 468 Metaxa, Overview
large number of incorrectly identified or poorly By default, the five highest-scoring BLAST
annotated reference sequences in the public matches are examined for origin in terms of
sequence databases (Bidartondo 2008; Hartmann organelle or taxonomic domain, and each origin
et al. 2011). is given a score based on the number of sequences
Metaxa (Bengtsson et al. 2011) is a software among the top five BLAST hits that belong to the
package that resolves the problem of extracting respective origin. The matches to the HMM pro-
and sorting SSU sequences to origin in an accu- files in the previous step are weighted together
rate and rapid way. The end result is a set of with these BLAST-based origin scores to make
FASTA files, each representing all SSU a final call on the most likely origin of the
sequences from a particular organelle or taxo- sequence fragment. In cases where the origin
nomic domain, for further analysis of species cannot be determined with certainty, but where
composition or other endeavors. there is a strong candidate, Metaxa assigns the
sequence to the most likely origin, but flags it as
potentially in need of manual inspection. If scores
Methods for origin are tied altogether, the sequence is
assigned into a special “uncertain” bin. In the
Extraction two latter cases, sequence alignments of the
The rRNA SSU gene is composed of eight to nine extracted fragment and the five best BLAST
hypervariable (“V”) regions flanked by more matches are computed automatically using
conserved domains (Hartmann et al. 2010). MAFFT (Katoh and Toh 2008), to assist the
Metaxa carries out the extraction of SSU user in the interpretation process.
sequence fragments from the metagenome using
the HMMER package (Eddy 2010) and Hidden Input and Output
Markov Models (HMMs) representing the most Metaxa takes input in the FASTA format and out-
conserved parts of the SSU gene, chiefly at the 50 puts one FASTA file for each origin found.
and 30 end of each V region. These HMMs are Optionally, Metaxa can also produce output in
modeled according to the same principles as table format. The entire running process is outlined
those of V-Xtractor (Hartmann et al. 2010). in Fig. 1.
Since the Metaxa models represent a set of highly
conserved domains, false-positive matches can
be all but avoided as only high-scoring profile Performance
matches are considered. Metaxa features HMM
profiles representing the archaeal and bacterial Metaxa has been shown to classify more than
16S genes, the eukaryote 18S gene, the mitochon- 99.95 % of the core-release sequences in the
drial 12S and 16S genes, and the chloroplast 16S SILVA database according to their annotated ori-
gene. These sets of HMM profiles enable Metaxa gin, and it has a false-positive rate of 0.00012 %
to identify and distinguish among all these classes (Bengtsson et al. 2011). When evaluated on sim-
of SSU sequences. ulated metagenomic data comprising three sets of
100,000 sequences with fragment lengths of
Classification 1,000, 300, and 100 bp, Metaxa processed the
After extracting all SSU sequences from the datasets in 112, 47, and 35 min, respectively,
query dataset, Metaxa proceeds to classify the with very high accuracy down to typical
extracted SSU sequence fragments. This is 454 read lengths (300 bp), retaining fidelity for
performed by comparing each fragment to bacterial sequences even at read lengths as short
a carefully selected set of reference SSU as 100 bp (Fig. 2). This suggests that Metaxa is
sequences from GreenGenes, SILVA, CRW highly reliable for Sanger, as well as 454-derived,
(Cannone et al. 2002), and MitoZoa (Lupi metagenomes, and that it is useful even on
et al. 2010) using BLAST (Altschul et al. 1997). metagenomes generated using short-read
Metaxa, Overview 469 M
Metaxa, Overview, Fig. 1 Overview of the Metaxa running process
sequencing technologies, such as Illumina. For example, a set of sequences extracted using
Metaxa takes advantage of multiple processor Metaxa could be used for sequence diversity
cores, if available, and it has no software or analysis. However, because of the classification
hardware restriction on the number of input capabilities of Metaxa, it is also useful in sorting
sequences. out PCR-amplified SSU libraries before continu-
ing with species richness investigations such as
rarefaction or species accumulation analysis.
Applications Here, the ability of Metaxa to separate chloro-
plast and mitochondrial SSU sequences from
Metaxa has obvious uses in deriving taxonomic other SSU entries is crucial for the accuracy of
inferences from metagenomic sequence sets. the downstream analysis. Metaxa could also be
M 470 Metaxa, Overview
Metaxa, Overview, Fig. 2 Performance of Metaxa at different read lengths
used as a tool to verify the authenticity of anno- processor cores where available. It can be used
tations in SSU sequence databases and reference as a tool for taxonomic analysis of metagenomes
libraries. as well as a classification tool for SSU amplicons.
Metaxa is freely available from http://microbiol-
ogy.se/software/metaxa/.
Availability
Metaxa is written in Perl and released as an open- Cross-References

source package under the GNU GPL v. 3 license.
It runs on Unix and Linux platforms, including ▶ Microbial Diversity, Bar-Coding Approaches
Mac OS X. The software package can be freely ▶ Phylogenetics, Overview
downloaded from http://microbiology.se/soft- ▶ Silva Databases
ware/metaxa/.
References
Summary
Altschul SF, Madden TL, Sch€affer AA, Zhang J, Zhang Z,
Metaxa is a high-performance software tool for Miller W, Lipman DJ. Gapped BLAST and
PSI-BLAST: a new generation of protein database
extracting and classifying SSU sequences from search programs. Nucleic Acids Res. 1997;25(17):
metagenomic datasets. The accuracy of the soft- 3389–402.
ware is very high, providing high sensitivity Bengtsson J, Eriksson KM, Hartmann M, Wang Z, Shenoy
toward SSU fragments even at short-read lengths BD, Grelet G-A, Abarenkov K, Petri A, Alm
Rosenblad M, Nilsson RH. Metaxa: a software tool
while maintaining a false-positive rate of about
for automated detection and discrimination among
0.00012 %. Metaxa is fast compared to, e.g., ribosomal small subunit (12S/16S/18S) sequences of
BLAST, and it takes advantage of multiple archaea, bacteria, eukaryotes, mitochondria, and
Microbial Diversity, Bar-Coding Approaches 471 M
chloroplasts in metagenomes and environmental
sequencing datasets. Antonie van Leeuwenhoek. Microbial Diversity, Bar-Coding
2011;100(3):471–5.
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Approaches
Sayers EW. GenBank. Nucleic Acids Res.
2009;37(Database issue):D26–31. James A. Foster
Bidartondo MI. Preserving accuracy in GenBank. Science Department of Biological Sciences, Institute for
(New York). 2008;319(5870):1616.
Cannone JJ, Subramanian S, Schnare MN, Collett JR, Bioinformatics & Evolutionary Studies (IBEST),
D’Souza LM, Du Y, Feng B, Lin N, Madabusi LV, University of Idaho, Moscow, ID, USA
M€uller KM, Pande N, Shang Z, Yu N, Gutell RR. The
comparative RNA web (CRW) site: an online database
of comparative sequence and structure information for
ribosomal, intron, and other RNAs. BMC Bioinfor- Introduction
matics. 2002;3:2.
Cole JR, Chai B, Farris RJ, Wang Q, Kulam-Syed-
Amplicon fingerprints are useful for ecological
Mohideen AS, McGarrell DM, Bandela AM,
Cardenas E, Garrity GM, Tiedje JM. The ribosomal studies of microbial communities. Most studies
database project (RDP-II): introducing myRDP space to date have used these techniques for determin-
and quality controlled public data. Nucleic Acids Res. ing how many species are present (richness, or
2007;35(Database issue):D169–72.
alpha diversity) in what ratios (beta diversity),
Desai N, Antonopoulos DA, Gilbert JA, Glass EM, Meyer
F. From genomics to metagenomics. Curr Opin which populations or species are present, and
Biotechnol. 2012;23(1):72–6. what metabolic or ecological functions the com-
DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie munity and its constituents may provide. These
EL, Keller K, Huber T, Dalevi D, Hu P, Andersen
data inform downstream analyses to determine
GL. Greengenes, a chimera-checked 16S rRNA gene
database and workbench compatible with ARB. Appl the response of microbial ecosystems to environ-
Environ Microbiol. 2006;72(7):5069–72. mental change, the relationship between human
Eddy S. HMMER. http://hmmer.janelia.org (2010), microbiota and health, the ecological succession,
Accessed 2012-05-15.
M
the co-evolutionary constraints within and
Hartmann M, Howes CG, Abarenkov K, Mohn WW,
Nilsson RH. V-Xtractor: an open-source, high- between communities and their environments,
throughput software tool to identify and extract hyper- and more (Foster et al. 2012a).
variable regions of small subunit (16S/18S) ribosomal This encyclopedia entry focuses on bacterial
RNA gene sequences. J Microbiol Method. 2010;
fingerprinting, since it has a longer history and is
83(2):250–3.
Hartmann M, Howes CG, Veldre V, Schneider S, more mature than fingerprinting techniques for
Vaishampayan PA, Yannarell AC, Quince C, other kingdoms of life. But these techniques are
Johansson P, Björkroth KJ, Abarenkov K, Hallam SJ, in principle applicable to all microbial organisms,
Mohn WW, Nilsson RH. V-REVCOMP: automated
including archaea and eukarya such as fungi,
high-throughput detection of reverse complementary
16S rRNA gene sequences in large environmental diatoms and tiny arthropods, and viruses
and taxonomic datasets. FEMS Microbiol Lett. (assuming they are organisms). Amplicons for
2011;319(2):140–5. bacteria have been in use since the beginning of
Katoh K, Toh H. Recent developments in the MAFFT
the molecular revolution and their gene products
multiple sequence alignment program. Brief
Bioinform. 2008;9(4):286–98. have been well characterized. However, potential
Lupi R, de Meo PD, Picardi E, D’Antonio M, Paoletti D, amplicons exist for all organisms. As dominant as
Castrignanò T, Pesole G, Gissi C. MitoZoa: a curated bacterial life is on Earth, it is by no means the
mitochondrial genome database of metazoans for
only microbial realm of interest. Nonetheless, it
comparative genomics studies. Mitochondrion. 2010;
10(2):192–9. is the focus of this entry.
Pruesse E, Quast C, Knittel K, Fuchs BM, The terminology herein is taken from the bac-
Ludwig W, Peplies J, Glöckner FO. SILVA: a com- terial ecology literature. A population is
prehensive online resource for quality checked and
a collection of individuals of the same type. In
aligned ribosomal RNA sequence data compatible
with ARB. Nucleic Acids Res. 2007;35(21): sexual organisms, a population is typically
7188–96. a collection of individuals from the same species.
M 472 Microbial Diversity, Bar-Coding Approaches
In asexual organisms, however, the species con- and highly conserved, providing a reliable guide
cept is problematic. In any case, one may be for fast and accurate alignment of large sets of
interested in discriminating to a subspecific or sequences (Nawrocki et al. 2009). This gene is
strain level, or indeed to higher levels. Thus, the strongly conserved, since it is a critical part of the
definition of a population is relative to the spe- replicative machinery in bacteria (and some
cific question under investigation. A community archaea). So it is in principle useful for recogniz-
is a collection of co-occurring populations. ing deep phylogenetic divergences. And finally
Therefore, the number of distinct populations in the 16S rDNA gene shows little evidence of hor-
a community is the richness of that community. izontal transfer, which makes it more useful as
The diversity of a community includes the rela- a phylogenetic marker. Woese and Fox first dem-
tive abundance of populations and their potential onstrated the utility of 16S rDNA analysis with
interactions. their discovery that archaea are a distinct king-
Amplicon fingerprinting techniques have dom of life (Woese 2004; Woese and Fox 1977).
developed in tandem with new sequencing tech- Several hypervariable regions in the 16S
nologies. Current fingerprinting approaches are rDNA gene provide enough sequence variation
particularly well adapted to modern high- to distinguish bacterial populations, sometimes to
throughput sequencing and have largely replaced the strain level. Hypervariable regions typically
older techniques based on electrophoresis or cap- contain loops in the rRNA secondary structure,
illary sequencers. The older approaches are still which change more as species evolve, since they
useful for crude estimates using older, and there- are not as structurally constrained as stems. Reli-
fore inexpensive and less used, equipment. How- able primers exist for nine regions, known as V1
ever, as the cost of new sequencing technologies through V9, that were short enough to be
drops, more modern amplicon fingerprinting completely sequenced by Sanger sequencing
approaches are likely to continue to replace when the primers were developed (Kim et al.
their predecessors. 2011). Hypervariable regions differ in the speci-
Amplicon fingerprinting techniques are cul- ficity and precision with which they can distin-
ture independent, meaning that it is unnecessary guish different types of organisms, so the choice
to grow cultures of individual populations or of amplicon primers is study specific (Schloss
communities before extracting DNA. This is par- 2010; Bazinet and Cummings 2012). As newer
ticularly significant in the microbial world, since sequencing technologies have increased the
most bacteria and archaea cannot currently be length of genetic fragments that can be
grown in the lab. Estimates show that as much sequenced, it has become standard practice to
as 97 % of existing microbial biodiversity is amplify from one end of one region to an end of
currently uncultivable (Whitman et al. 1998). another region. For example, V35 and V69,
These techniques enable ecological and func- which span regions 3–5 and 6–9, respectively,
tional analysis of communities that largely con- are common in the literature.
sist of otherwise inaccessible “biotic dark Since it has become possible to sequence
matter.” much larger fragments, it has become common
to attach “bar code adapters” to primers. This
makes it easier to multiplex samples from several
Choosing Amplicons different experimental treatments into single
sequencing runs and then separate the data algo-
With bacteria, the amplicon of choice has long rithmically. In theory, one could improve resolu-
been the gene for the small RNA subunit of the tion of fingerprinting techniques by multiplexing
ribosome, known as 16S rDNA for its size several primers for multiple hypervariable
(16 Svedberg units). Nearly universal primers regions, as if fingerprinting multiple fingers at
exist for several regions of this gene. The second- the same time. However, most projects currently
ary structure of 16S rDNA is well characterized work with only single sets of primers. However,
Microbial Diversity, Bar-Coding Approaches 473 M
very soon it will be feasible to sequence the entire do not have databases comparable to those avail-
16S rDNA gene, which of course will comprise able for 16S rDNA and have fewer useful
all hypervariable regions, making the choice of primers.
primers irrelevant for microbial community fin- No fingerprinting technique based on a single
gerprinting. An intriguing possibility will be to gene, however, carefully chosen, can hope to
multiplex fingerprinting from multiple genes that distinguish all microbes or fully elucidate all
expand analysis beyond the bacterial kingdom, microbial metabolic and ecological functions.
for example, multiplexing 16S rDNA and 18S Even when it becomes feasible to routinely
rDNA amplicons. sequence entire 16S rDNA genes from individ-
Databases of full 16S rRNA sequences exist ual cells, the gene-based amplicon analysis will
for hundreds of thousands of microbes (Cole only produce gene genealogies rather than
et al. 2007; DeSantis et al. 2006). A typical organismal phylogenies or full metabolic pro-
workflow searches these databases for putative files. Multiplexing amplicon processing for
homologues to amplicons. The annotations for several genes may improve phylogenetic reso-
these hits then inform likely taxonomic and func- lution. But as it becomes feasible to sequence
tional associations (Kuczynski et al. 2010). entire genomes for whole communities with
But modern databases have serious limita- shotgun metagenomics or single-cell genomics,
tions. It is rarely possible to classify bacteria it will become unnecessary to choose target
below the family level, since there are vastly amplicons at all.
more different populations than have been
observed. As cultivation-independent sequencing
methods grow more popular, new sequences in Fingerprinting Techniques
the databases tend to be from unclassified, and
therefore unannotated, populations. Annotations Fragment-based techniques use the length of
in existing databases are highly biased toward amplicon fragments as fingerprints. The spectra
M
pathogenic or other human-associated organisms. of these lengths indicate which microbial
Very closely related genera, species, and strains populations were in the original sample, assum-
can differ dramatically in their metabolic poten- ing that there is sufficient variation in the
tial and preferred ecological habitats. Finally, amplicon fragments. We present the three most
different species vary widely in their 16S rDNA common fragment-based techniques here.
copy numbers, making it easy to confuse dosage Temperature gradient and denaturing gradient
effects and within-individual sequence variation gel electrophoresis (TGGE and DGGE) separate
with species abundances. the DNA fragments by size using standard gel
Other genetic targets may serve the same func- electrophoresis (Fischer and Lerman 1979). The
tion as 16S does for microbial ecology, provided resulting band patterns are then the community
they exhibit sufficient variation, stability, and fingerprints. Presumably, more complex patterns
vertical inheritance. For example, the RNA poly- represent more complex communities and pat-
merase b-subunit gene, rpoB, is a single-copy terns from distinct populations contributing addi-
gene and has been recommended as an alternative tively to the overall pattern so that one can
to 16S rDNA. Other highly conserved house- decompose the community fingerprint into con-
keeping genes such as cytochrome B (cytB), stituent populations.
those responsible for electron transport in aerobic Automated ribosomal intergenic spacer anal-
organisms, may be more appropriate for plant ysis (ARISA) determines the spectra of the
studies or deep resolution of Cyanobacteria. intergenic spacer region (ITR) between small
And of course, eukaryotes and some Archaea do and large ribosomal subunit genes in bacteria
not have 16S ribosomal subunits, so a more (Fisher and Triplett 1999). The flanking genes
appropriate gene is their small subunit analogue, are highly conserved, making ITS a reasonable
the 18S rDNA gene. Currently, these alternatives amplicon. Moreover, the length ITS is highly
M 474 Microbial Diversity, Bar-Coding Approaches
variable between bacterial species, so a spectrum emerging for cleaning and quality control of
of ITS lengths is a reasonable fingerprint. raw data, detecting erroneous sequences (such
Terminal restriction fragment length polymor- as chimeras), aligning sequences, clustering fin-
phism (TRFLP) analysis binds fluorescent gerprints by similarity, searching for similar
markers to the amplicon PCR primers before annotated sequences in existing databases, and
restriction, marking the restriction fragments more. Two software packages aggregate state-
adjacent to the primer (Sch€utte et al. 2008). One of-the-art algorithms and pipelines to bring the
can then separate the labeled fragments by size, state of the art to the typical user, namely, Quan-
for example, in a capillary sequencer. The spectra titative Insights Into Microbial Ecology
of the lengths of these fragments are then the (QIIME) and MOTHUR (Caporaso et al. 2010;
fingerprint for the study sample. Schloss et al. 2009). Both packages are compat-
All three length-based fingerprinting tech- ible with most computing platform and are
niques have inherent biases and limitations, and updated regularly with the newest algorithms
all three are still commonly used. A PubMed from the research community. Both have exten-
search on 12 July 2012 for the terms “DGGE,” sive tutorials and reference documentation.
“ARISA,” and “TRFLP” returned 5658, 119, and MOTHUR is open source. Both packages per-
107 hits, respectively, with several recent cita- form most standard diversity analyses and pro-
tions indicating current use of all three duce datasets that can be imported into the
techniques. R statistical environment for further analysis
Bioinformatics has been critical for (Beck et al. 2011).
interpreting fragment-based amplicon fingerprint To summarize, amplicon choice remains
data. A common approach has been to perform in important to fingerprinting analyses, though frag-
silico analyses of existing databases, to determine ments of the 16S rDNA gene remain the amplicon
length spectra for known sequences. This proof choice for bacterial community diversity stud-
vides a kind of “reverse telephone book” with ies. Amplicon sequences are becoming the fin-
which one can translate empirical fingerprints gerprints of choice, though derived data such as
into possible population compositions. Two typ- length spectra for restriction fragments or
ical tools for this sort of analysis, focused on interspacer regions are still widely used. Future
TRFLP and still in heavy use, are the Microbial sequencing technologies are sure to change the
Community Analysis (MiCA) suite and the TFLP fingerprinting landscape significantly. Finally,
Analysis Program (TAP-TRFLP) (Shyu amplicon fingerprinting analysis requires exten-
et al. 2007; Cole et al. 2009). sive bioinformatic support, and appropriate tools
Sequence-based fingerprinting techniques use are available.
the amplicon sequences themselves as finger-
prints, rather than their length spectra. Current
sequencing technologies, also known as next- Cross-References
generation sequencing, have made it feasible to
sequence millions of amplicons in a single run. ▶ Culture Collections in the Study of Microbial
Different sequencing technologies vary in their Diversity, Importance
sequencing accuracy, typical type of sequencing ▶ Metagenomics, Metadata, and Meta-analysis
errors, and length of amplicon (Foster ▶ Microbial Ecology in the Age of
et al. 2012b). Consequently, the vast majority of Metagenomics: An Introduction
current amplicon fingerprinting projects use ▶ New Computational Methodologies to
amplicon sequences rather than derived data Understand Microbial Diversity
such as lengths. ▶ Next-Generation Sequencing for
Bioinformatics to analyze sequence-based Metagenomic Data: Assembling and Binning
fingerprints is a very active area of research. ▶ Protein-coding Genes as Alternative Markers
New and improved algorithms are constantly in Microbial Diversity Studies
Microbial Ecology in the Age of Metagenomics: An Introduction 475 M
References Schloss PD. The effects of alignment quality, distance
calculation method, sequence filtering, and region
Bazinet AL, Cummings MP. A comparative evaluation of on the analysis of 16S rRNA gene-based studies.
sequence classification programs. BMC Bioinformat- PLoS Comput Biol. 2010. doi:10.1371/journal.
ics. 2012;13(1):92. doi:10.1186/1471-2105-13-92. pcbi.1000844.
Beck D, Settles M, Foster JA. OTUbase: an R infrastruc- Schloss PD, Westcott SL, Ryabin T, Hall JR,
ture package for operational taxonomic unit data. Bio- Hartmann M, Hollister EB, Lesniewski RA,
informatics (Oxford, England). 2011;27(12):1700–1. et al. Introducing mothur: open-source, platform-
doi:10.1093/bioinformatics/btr196. independent, community-supported software for
Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, describing and comparing microbial communities.
Bushman FD, Costello EK, Noah F, et al. QIIME Appl Environ Microbiol. 2009;75(23):7537–41.
allows analysis of high-throughput community doi:10.1128/AEM.01541-09.
sequencing data. Nat Methods. 2010;7(5):335–6. Sch€
utte UME, Abdo Z, Bent SJ, Shyu C, Williams CJ,
doi:10.1038/nmeth.f.303. Pierson JD, Forney LJ. Advances in the use of Termi-
Cole JR, Chai B, Farris RJ, Wang Q, Kulam-Syed- nal Restriction Fragment Length Polymorphism
Mohideen AS, McGarrell DM, Bandela AM, (T-RFLP) analysis of 16S rRNA genes to characterize
Cardenas E, Garrity GM, Tiedje JM, et al. The Ribo- microbial communities. Appl Microbiol Biotechnol.
somal Database Project (RDP-II): introducing myRDP 2008;80(3):365–80. doi:10.1007/s00253-008-1565-4.
space and quality controlled public data. Nucleic Shyu C, Soule T, Bent SJ, Foster JA, Forney LJ. MiCA:
Acids Res. 2007;35:D169–72. doi:10.1093/nar/ a web-based tool for the analysis of microbial commu-
gkl889. nities based on terminal-restriction fragment length
Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, polymorphisms of 16S and 18S rRNA genes.
Kulam-Syed-Mohideen AS, et al. The Ribosomal J Microb Ecol. 2007;53(4):562–70.
Database Project: improved alignments and new tools Whitman WB, Coleman DC, Wiebe WJ. Prokaryotes:
for rRNA analysis. Nucleic Acids Res. 2009;37: the unseen majority. Proc Natl Acad Sci U S A.
D141–5. doi:10.1093/nar/gkn879. 1998;95(12):6578–83.
DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie Woese CR. A new biology for a new century. Microbiol
EL, Keller K, Huber T, Dalevi D, Hu P, Andersen Mol Biol Rev MMBR. 2004;68(2):173–86.
GL. Greengenes, a chimera-checked 16S rRNA gene doi:10.1128/MMBR.68.2.173-186.2004.
Woese CR, Fox GE. Phylogenetic structure of the pro-
database and workbench compatible with ARB. Appl
karyotic domain: the primary kingdoms. Proc Natl
M
Environ Microbiol. 2006;72(7):5069–72. doi:10.1128/
AEM.03006-05. Acad Sci USA. 1977;74(11):5088–90.
Fischer SG, Lerman LS. Length-independent separation
of DNA Restriction Fragments in two-dimensional gel
electrophoresis. Cell. 1979;16(1):191–200.
Fisher MM, Triplett EW. Automated approach for ribo-
somal intergenic spacer analysis of microbial diversity
Microbial Ecology in the Age of
and its application to freshwater bacterial communi- Metagenomics: An Introduction
ties. Appl Environ Microbiol. 1999;65(10):4630–6.
Foster JA, JH Moore, Gilbert JA, Bunge J. Microbiome Jianping Xu
studies: analytical tools and techniques. In: Russ B
Department of Biology, McMaster University,
Altman, A Keith Dunker, Lawrence Hunter, Teri E
Klein (eds), Pac Symp Biocomput. 2012a;200–2. Hamilton, ON, Canada
World Scientific, Singapore .
Foster JA, Bunge J, Gilbert JA, Moore JH. Measuring the
microbiome: perspectives on advances in DNA-based
techniques for exploring microbial life. Brief
Introduction
Bioinform. 2012b. doi:10.1093/bib/bbr080.
Kim M, Morrison M, Yu Zhongtang. Evaluation of differ- Microbial ecology is an interdisciplinary science
ent partial 16S rRNA gene sequence regions for phy- related to microbiology and ecology. Its investi-
logenetic analysis of microbiomes. 2011;84(1):81–7.
gations range from analyzing the diversity of
doi:10.1016/j.mimet.2010.10.020
Kuczynski J, Liu Z, Lozupone C, McDonald D. Microbial microorganisms within and among the different
community resemblance methods differ in their ability ecological niches on Earth to understanding the
to detect biologically relevant patterns. Nat Methods. interrelationships among microorganisms,
2010;7(10):813–9. doi:10.1038/nmeth.1499.
between microorganisms and macroorganisms,
Nawrocki EP, Kolbe DL, Eddy SR. Infernal 1.0: inference
of RNA alignments. Bioinformatics (Oxford, and between microorganisms and their abiotic
England). 2009. doi:10.1093/bioinformatics/btp157. environmental factors. Microbial diversity and
M 476 Microbial Ecology in the Age of Metagenomics: An Introduction
the interactions between microbes and other hydrothermal vents at the bottom of deepest
organisms can be analyzed at morphological, oceans. Current estimates put the number of
structural, physiological, and/or genetic levels. microbial cells on Earth at around 5.0 1030,
The recent advances in high-throughput technol- about eight orders of magnitude greater than the
ogies, especially in genome sequencing, are number of stars in the observable universe.
reshaping our understandings of microbial ecol- Indeed, despite their small sizes, the large num-
ogy. This entry introduces the fundamental con- ber of microbial cells on Earth makes microor-
cepts and issues in microbial ecology, with a brief ganisms the single largest carbon sink, more
focus on how metagenomics tools are impacting than those from plants and animals. Their large
microbial diversity studies. number, broad ecological distribution, and vast
diversity of metabolic pathways unparalleled by
macroorganisms make microbes indispensable
Microorganisms and Microbiology and central to our considerations of global geo-
chemical cycles and environmental issues.
A microorganism refers to any life form that can’t Most of the early methodologies for studying
be easily seen by the human naked eye. Microor- microorganisms are still widely used today, and
ganisms encompass morphologically, structur- many discoveries about the fundamental fea-
ally, and phylogenetically very diverse forms of tures of life were made using microorganisms
life and traditionally include both acellular life as model systems. Among the many practical
forms such as viruses and cellular life forms in all contributions of microbiology, microbiological
three domains, the Bacteria, Archaea, and discoveries have significantly impacted (and are
Eukarya (Woese 1987). Organisms in Bacteria continuing to impact) the control and prevention
and Archaea are completely microbial. Even in of diseases in plants, animals, and humans.
Eukarya, macroorganisms such as animals and However, techniques and methodologies alone
plants represent only parts of two of at least were insufficient for establishing microbiology
eight superkingdoms within this domain, while as a fledging field of scientific investigation.
the remaining six or more superkingdoms are Reductionist approaches and guidelines for
exclusively microbial (Baldauf 2003). While hypothesis testing such as the Koch’s postulates
most microorganisms can’t be seen at all by the for identifying the causative agents of infectious
naked eye, for many microorganisms, certain diseases were pivotal for the development of
stages of their life cycles can be easily visualized. microbiology. Interestingly, with the rapid
For example, mushrooms, the sexual reproduc- developments both in high-throughput experi-
tive structure of certain groups of fungi, are mental tools (e.g., Xu 2014) and in bioinformat-
a common occurrence on forest floors at certain ics software capable of analyzing large and
times of the year. diverse datasets, holistic views about microor-
Microorganisms were first seen and ganisms are beginning to attract significant sci-
described by Antonie van Leeuwenhoek in entific attention. Indeed, aside from the
1676 when he used a microscope to examine traditional subdisciplines such as microbial cell
a variety of natural and human-made objects. biology, biochemistry, physiology, genetics,
Subsequent developments in methodologies for ecology, and evolutionary biology, microbiol-
growing, purifying, and studying microorgan- ogy now also includes microbial genomics, sys-
isms ushered in a golden era of microbiology, tems microbiology, microbial community
which is still going strong today. Microorgan- ecology, and ecosystem microbiology. In addi-
isms have now been found in virtually every tion, the diverse subdisciplines of microbiology
habitable niche on Earth, from hot springs to have become integral components of agricul-
salt lakes, from frozen environments in the Ant- ture, forestry, animal husbandry, fishery, min-
arctica and glaciers at the top of mountains to ing, environmental sciences, and medicine.
Microbial Ecology years to describe direct analyses of environmental
DNA (Marco 2009). However, metagenomics
Broadly speaking, microbial ecology is the scien- has emerged as the favorite term and the prefix
tific discipline that examines the relationships “meta-“is now used to describe the direct
between microorganisms and their environments. analyses of environmental RNA, proteins, and
Ecologically oriented studies of microbes were metabolites, corresponding respectively to
performed as soon as their existence was realized. meta-transcriptomics, meta-proteomics, and
However, the term microbial ecology came into metabolomics (Fig. 1). Together, the direct analyses
frequent use only in the early 1960s, and its emer- of biological molecules from natural environments
gence as an independent field of investigation was constitute the field of “meta-omics” (Fig. 1).
propelled by both the awakening public interest in The different subfields of meta-omics analyze
environmental issues and the increasing recogni- complementary sets of biological molecules
tion of the essential roles of microbes in Earth’s directly from the environments that together help
geochemical cycling and in human welfare. provide holistic views of the natural biological
At present, microbial ecological investiga- communities. For example, analyses of environ-
tions can be grouped into three broad types: mental DNA samples can provide estimates of
(i) identifying the taxonomic, structural, and the taxonomic and genome diversities of organ-
functional diversities of microorganisms in isms in ecological niches in nature, the extracted
natural ecological niches; (ii) analyzing the RNA, protein, and metabolites provide information
relationships among microorganisms, between about the functions of the environmental genomes,
microorganisms and macroorganisms (plants including the degrees to which genes are tran-
and animals including humans), and between scribed and translated, and the types and amount
microorganisms and environmental factors of metabolites are generated in natural ecological
(such as nutrients, temperature, pH, pressure, niches. In addition, to properly analyze and inte-
oxygen); and (iii) investigating the mechanisms grate the diverse biological datasets, effective
M
that generate and maintain the diversity of micro- “meta-programs” are also needed and several
organisms and their relationships with each other such programs are currently available (de
and with their biotic and abiotic factors in natural Bruijn 2011).
environments. Among the three types of research Because biological materials (e.g., different
activities, most metagenomics studies of micro- types of microbial cells) can be very different
bial ecology have focused on microbial diversity from each other in terms of their size, morphology,
in natural environments. and structure, obtaining DNA (and/or RNA, pro-
Below is a brief introduction to metagenomics tein, and metabolites) directly from environmental
and how metagenomics approaches have shaped samples that can realistically reflect their native
our understanding of microbial diversity. For the biological states may require extensive sample
impact of metagenomics tools on the other two treatments. Such treatments may include sorting
aspects of microbial ecology, please refer to other biological samples (including different types of
entries in this encyclopedia. cells and viral particles) based on sizes, removing
materials that inhibit downstream reactions, and
applying different extraction methods that permit
Metagenomics the lysis of cells with specific types of cell walls.
Once the pools of targeted biological materials are
Metagenomics refers to the field of study that ana- obtained, additional treatments of these materials
lyzes genetic materials obtained directly from envi- may be needed before they are channeled into high-
ronmental samples. Several other terms, such as throughput analytical platforms. Below is a brief
environmental genomics, ecological genomics, overview of the applications of metagenomic tools
and community genomics, have emerged over the on estimates of microbial diversity.
M 478 Microbial Ecology in the Age of Metagenomics: An Introduction
Microbial Ecology in the Age of Metagenomics: An throughput technologies. To effectively utilize such data,
Introduction, Fig. 1 Legend: an overview of meta- suits of “meta-programs” are required to analyze and
omics: the direct analyses of biological molecules such integrate the diverse meta-datasets (Modified from Xu
as DNA, RNA, protein, and metabolites using high- 2010)
Estimates of Microbial Genetic Diversity a population will be different (Xu 2010). At the
Using Metagenomic Data species level, microbial diversity is measured as
species diversity. There are various measures of
Depending on the objectives of research, micro- species diversity. One commonly used refers to
bial diversity in the environment can be the frequency that two randomly drawn individ-
expressed as a quantitative measure using several uals in an environment will be different species.
common indices such as phylogenetic diversity, This measure takes into account both the number
species diversity, genotype diversity, gene diver- of species (species richness) and the frequency of
sity, and nucleotide diversity. Above the species each species (species abundance) in the environ-
level, microbial diversity can be quantified based ment. Conceptually, this measure of species
on evolutionary distances among the observed diversity is similar to those used for nucleotide
taxonomic groups from a specific environment. diversity, gene diversity, and genotype diversity.
Below the species level, microbial diversity can Microbial species diversity is among the most
be described using population genetic parameters commonly analyzed and compared in microbial
such as nucleotide diversity, gene diversity, and ecological studies. The earliest and still one of the
genotype diversity. Nucleotide diversity, gene most common metagenomics methods for esti-
diversity, and genotype diversity refer respec- mating species diversity of prokaryotes
tively to the probability that two randomly (including both Bacteria and Archaea) in natural
drawn bases at a specific site of the genome, environments is the direct analyses of sequence
alleles of a specific gene locus, and genotypes in variation at the 16S ribosomal RNA gene
(Pace et al. 1985). These analyses may involve complementary data is the messenger RNA
the polymerase chain reaction (PCR), denaturing sequences obtained from environmental samples.
gradient gel electrophoresis (DGGE), cloning, In combination with DNA sequence data, the
and sequencing. A broadly accepted criterion to mRNA data allow inferences of the potential
delineate prokaryote species is that two strains physiological activities of the different groups
belong to the same species if their 16S rRNA of microorganisms in natural environments
genes show 97 % sequence similarity (de Bruijn 2011).
(de Bruijn 2011). In eukaryotic microbes such
as fungi, a similar criterion (97 % sequence
similarity) is often used, albeit for a different Summary
DNA fragment, the internal transcribed spacer
(ITS) regions of the ribosomal RNA gene cluster This entry serves as an introduction to microor-
(Schoch et al. 2012). However, in more recent ganisms, microbiology, microbial ecology, and
analyses, direct sequencing of extracted environ- metagenomics. The impact of metagenomics on
mental DNA using NGS technologies is increas- estimates of microbial diversity was briefly
ingly used. These analyses suggest that the discussed. With the increasing application of
cultured microbes from most ecological niches high-throughput technologies in analyzing bio-
represent <1 % of the true microbial species logical materials (DNA, RNA, proteins, and
richness in their respective niches and that many metabolites) directly from environments, the
of these uncultured microbes belong to distinct future of microbial ecology is looking brighter
and previously unknown phylogenetic groups than ever.
(de Bruijn 2011). Metagenomic analyses, espe-
cially those based on NGS technologies (Xu
2014), have generated very large datasets from
environments including the human body (e.g., the
Cross-References M
human microbiome initiative; http://nihroadmap.
▶ Microbial Diversity, Bar-Coding Approaches
nih.gov/hmp/) and the oceans (the Global Ocean
Sampling surveys; http://www.jcvi.org/cms/
research/projects/gos/overview/). Scientists
from many countries participate in these large- References
scale projects.
The species diversity studies based on DNA Baldauf SL. The deep roots of eukaryotes. Science.
sequences at the 16S rRNA gene are increasingly 2003;300:1703–6.
Bruijn F. Handbook of molecular microbial ecology I:
complemented by other types of data that aug-
metagenomics and complementary approaches. New
ment our understanding of microbial diversity in Jersey: Wiley/Blackwell; 2011. p. 113–22.
natural environments. One type of such data is Marco D. Metagenomics: theory, methods and applica-
genetic variation among strains within a species. tions. Norfolk: Caister Academic Press; 2009.
Pace NR, Stahl DA, Olsen GJ, Lane DJ. Analyzing natural
With high-throughput DNA sequencing, genetic
microbial populations by rRNA sequences. ASME
variants of a gene fragment from different strains News. 1985;51:4–12.
of the same species in the same ecological niche Schoch CL*, The Fungal Barcode Consortium (one of
can be reliably identified (de Bruijn 2011). With 100 collaborators). Nuclear ribosomal internal tran-
scribed spacer (ITS) region as a universal DNA
sufficient genome coverage, it’s also possible to barcode marker for Fungi. Proc Natl Acad Sci U S A.
uncover genome variants. Such information 2012;109:6241–6.
allows direct comparisons of gene frequencies Woese CR. Bacterial evolution. Microbiol Rev.
and genotype frequencies among microbial 1987;51:221–71.
Xu J. Microbial population genetics. Norfolk: Caister
populations from diverse ecological niches,
Academic Press; 2010.
including the inferences of the modes of repro- Xu J. Next-generation sequencing: technologies and
duction in nature (Xu 2010). The second type of applications. Norfolk: Caister Academic Press; 2014.
M 480 Microbial Ecosystems, Protection of
policy, despite various pleas to do so (Cockell

Microbial Ecosystems, Protection of and Jones 2009). The biodiversity-ecosystem
function (BEF) research inherently requires the
Paul L. E. Bodelier investigation of the relationship between species
Netherlands Institute of Ecology assemblies and ecosystem processes, a link
(NIOO-KNAW), Wageningen, Netherlands which is difficult to make with microbes. High
diversity, rapid generation times, high adapt-
ability due to genome rearrangements, and ubiq-
Synonyms uitous distribution have led to the notion that
microbial communities are highly redundant
Conservation of microbial diversity and ecosys- and omnipresent and therefore inextinguishable.
tem functions provided by microbes; Preserva- However, the latter is a misconception driven by
tion of microbial diversity and ecosystem a number of gaps in our understanding of the
functions provided by microbes functioning of microbial communities and the
relevance of microbial diversity in ecosystem
functioning.
Definition
The use, management, and conservation of eco- Knowledge Gaps in Understanding

systems in order to preserve microbial diversity Microbial BEF
and functioning.
Definition of Species
Considering the IUCN Red List of species and the
Introduction associated criteria to get on this list (http://www.
iucn.org/), it is quite obvious that microbes have
Ecosystems collectively determine biogeo- not made it in there yet. A species is
chemical processes that regulate the Earth sys- a fundamental unit of biological organization,
tem. Loss of biodiversity is generally regarded but its relevance for microbes is debated. The
as detrimental to ecosystems and ecosystem inability to define taxonomic units equivalent to
functioning and therefore has been a central animal and plant species is also one of the most
issue for environmental scientists during the fundamental problems hampering the study of the
last decades (Hooper et al. 2012). Microorgan- BEF matter in microbial communities. The first
isms (i.e., bacteria, archaea, protozoa, and fungi) problem is that the isolation and cultivation of
comprise a major part of the total biomass of microbes in order to assess their geno- and phe-
organisms inhabiting on Earth and represent the notype has led to the description of only 7,000
largest source of biodiversity. They play critical species whereas DNA based-methods have
roles in biogeochemical processes and ecosys- indentified more than 100 prokaryotic phyla to
tem functioning and are fundamental to many be present in ecosystems (Pace 2009). Hence,
ecosystem services (e.g., soil health, wastewater approx. only 1 % of the actual microbial biodi-
treatment, nutrient recycling, human health, car- versity is represented as cultured organisms while
bon sequestration, etc.) (see Table 1) (Ducklow the characteristics and functions of the remaining
2008). Considering the challenges we are facing 99 % are unknown. Next to this, bacterial taxon-
with overexploitation of the planet, climate omy employs universal thresholds of DNA-
change, pandemics, increasing demands in sequence difference to help demarcate species.
food production, and need for renewable energy However, the sequence-identity cutoff value
and resources, it is remarkable that microbes and used to demarcate species has led to “species”
their diversity are absent in the ongoing debates that are enormously diverse in their genome con-
about global biodiversity loss and conservations tent physiology and ecology. Hence, what is
Microbial Ecosystems, Protection of 481 M
Microbial Ecosystems, Protection of, Table 1 Major groups of microbes and ecosystem services they provide. The
last column depicts the ecosystem service category as was defined in the Millennium Ecosystem Assessment 2005
Ecosystem
service
Microbial group Process Ecosystem service category
Heterotrophic Organic matter breakdown, mineralization Decomposition, nutrient recycling, Supporting
bacteria/Archaea climate regulation, water purification and
regulating
Photoautotrophic Photosynthesis Primary production, carbon Supporting
bacteria sequestration and
regulating
Chemo(litho) Specific elemental transformations Nutrient recycling, climate regulation, Supporting
autotrophic (e.g., NH4+, S2-, Fe2+, CH4 oxidation) water purification and
regulating
Unicellular Photosynthesis Primary production, carbon Supporting
phytoplankton sequestration and
regulating
Archaea Specific elemental transformation (e.g., Nutrient recycling, climate regulation, Supporting
metals, CH4 formation, NH4+ oxidation), carbon sequestration and
often in extreme habitats. regulating
Protozoa Mineralization of other microbes Decomposition, nutrient recycling, soil Supporting
formation
Fungi Organic matter breakdown and Decomposition, nutrient recycling, soil Supporting
mineralization formation, primary production (i.e.,
mycorrhizal fungi)
Viruses Lysis of hosts Nutrient recycling Supporting
All Production of metabolites (e.g., antibiotics, Production of precursors to industrial Provisional
polymers), degradation of xenobiotics, and pharmaceutical products M
genetic transformation and rearrangement
All Huge diversity, versatility, environmental Educational purposes, getting students Cultural
and biotechnological applications interested in science
From Bodelier 2011
regarded as a “species” in microbiology would concept of ecological coherence of taxa higher

definitely not be comparable to species in than the species level was put forward, which
macroecology (animals and plants), and com- suggests that deeper clades of various ranks may
monly the term “operational taxonomic units” is be used as alternative ecologically meaningful
used in microbiology. The situation will improve units in microbial ecology (Philippot
due to better cultivation methods and insights, et al. 2010). With the vast amount of
resulting in increased coverage of phylogenetic metagenomic data available from an increasing
lineages with cultured representatives. Next to variety of environments and the advent of com-
this, metagenomic, metaproteomic, and even parative genomics, the field of microbial ecology
single-cell genomic techniques enable the char- is undergoing a paradigm shift away from taxa-
acterization of functions of not-yet-cultivated oriented concepts of community analysis that
organisms in their environment (Raes and Bork have been inherited from macroorganism ecol-
2008). These novel techniques will facilitate the ogy toward trait-centered and/or systems
development and application of novel concepts in biology-oriented approach in which functional
environmental microbiology which may bridge units (protein-coding genes, enzymes, metabo-
the gap with macroecology, bypassing the spe- lites) are the key components of the overall eco-
cies hang-up in order to develop generic concepts system (Green et al. 2008). Using functional traits
and theories in microbial ecology. Recently, the and environmental gradients can bring general
patterns into community ecology, and a trait- answered when it is possible to study complete
centered perspective would be a tractable way microbial populations at ecologically relevant
for microbial ecology to address the significance scales.
of microbial diversity for ecosystem functioning.
Considering the fact that in plant sciences BEF Inability to Link Species Diversity to Function
studies are also incorporating traits rather than Connecting individual microbial species to the
species richness only, the trait-centered approach biogeochemical processes they catalyze is
may offer options for convergence of macro- and a prerequisite for assessing BEF relationships in
microbial ecology which will be essential for microbial communities. However, considering
including microbes in conservation policy. the lack of a species concept, the metabolic ver-
satility, the large number of unknown species,
Lack of Microbial Biogeography? and the scale issue involved, this is the central
The conventional view of microbial distribution problem area in the field of environmental micro-
of species through space and time has been dom- biology. The majority of studies in the literature
inated for decades by the “Baas-Becking” have relied on correlating changes in activity to
hypothesis “everything is everywhere, but the changes in community composition or diversity,
environment selects.” The lack of dispersal limi- and only a few articles can actually show a causal
tations of microorganisms would ensure a global relationship. A myriad of techniques have been
distribution, but that local deterministic factors developed for linking diversity and function (see
would determine the relative abundance of Wagner 2009). However, many of these tech-
“latent” and “flourishing” species. This view is niques were based on the analyses of ribosomal
in sharp contrast with plants and animals which RNA or mRNA transcripts of functional genes,
show clear taxa-area relationships and biogeog- indicating only the potential to be involved in
raphy. The Baas-Becking legacy is likely one of specific processes. The use of stable isotope prob-
the main reasons why microbial diversity is not ing (SIP) has evoked a major breakthrough in
on the biodiversity-conservation agenda. How- environmental microbiology (see Murrell and
ever, in the last decade there are a number of Whiteley 2011). The general approach is that
studies demonstrating species-area relationships, stable isotopically (13C/15N) labeled substrates
biogeography, and spatial patterns at various are incorporated into taxonomically relevant
scales for microbes (see Zhou et al. 2008). Next molecules (RNA/DNA, lipids, proteins). Only
to this, microbial endemism has been reported as the microbes which have actively been incorpo-
well, while studies using high-throughput rating the stable isotopes are detected when ana-
sequencing technology clearly demonstrated the lyzing RNA/DNA or PLFA using GC-IRMS (gas
presence of habitat-specific communities shaped chromatography-isotope ratio mass spectrome-
by edaphic factors and historical contingencies. try) or proteins using GC-MS or LC-MS (liquid
A meta-analysis of all currently available 16S chromatography-mass spectrometry). The major
rRNA gene sequences revealed clear environ- disadvantages of SIP are the use of unnaturally
mental distributions on the genus or species high substrate concentrations in case of DNA-
level with soil and freshwater as least selective and RNA-based SIP, the different label uptake
habitats, while marine, animal, and thermal hab- rates per species, and cross feeding. More recent
itats were the most selective (Tamames work brought improvements in the shortcomings
et al. 2010). The emerging pattern in microbial of traditional SIP studies by using magnetic bead
biogeography studies is definitely that not all capturing of mRNA, Raman spectroscopy, and
microbial communities occur everywhere and NanoSIMS (secondary ion beam mass spectros-
that local conditions can lead to unique associa- copy) (see Murrell and Whiteley 2011) also in
tions of microbes. However, whether microbes combination with metagenomic techniques
obey the same distribution and community uncovering active species of which no cultured
assembly rules as macroorganisms can only be representatives are available or discovering
unknown pathways or genes involved in biogeo- which coincides with an increase in diversity of
chemical processes (see Chen and Murrell 2010). these microbes (Fig. 1; Levine et al. 2011).
The most recent addition to the SIP repertoire Aspects of community composition other than
combined microarray detection and NanoSIMS, richness per se have been demonstrated to regulate
attaining low label incorporation levels and high the stability of biogeochemical processes. The ini-
phylogenetic resolution without PCR amplifica- tial evenness of redundant community members
tion of the target community (Mayali et al. 2012). was demonstrated to be important in resistance to
The challenge in applying SIP-based techniques salt stress in denitrifying communities (Wittebolle
will be in BEF experiments, where experimental et al. 2009), indicating that relative abundance of
designs allowing for causal and mechanistic con- the populations in a community is an important
clusions require high sample throughput. determinative factor for process stability, even in
redundant communities. Functional redundancy
Resistance, Resilience, and Redundancy of sensu stricto is difficult to assess in microbial
Microbial Communities communities, since it requires the contribution of
The absence of microbial diversity in BEF debate, individual community members to processes and
conservation issues, and global biogeochemical separation between diversity and environmental
process models is also caused by the paradigm of factors. The stability of a particular function
microbial omnipresence, high adaptability, and (e.g., methane conversion) in time is very likely
functional redundancy. Indeed, resilience after affected by more properties or traits of species
reduced diversity and redundancy of species car- than the expression of that one particular func-
rying out similar functions has been demonstrated tional gene only, e.g., response to inhibitors or
(see Bodelier 2011). But is this the rule? A number general adaptation of species to a particular envi-
of studies have demonstrated a direct relationship ronment. Moreover, populations of interacting
between diversity and ecosystem process rate (see microbes on microbial relevant scales may not
Bodelier 2011). Recently, a comprehensive meta- consist of many different species also due to spa-
M
analyses demonstrated that out of 110 studies, tial arrangement or isolation, e.g., along roots, soil
more than 70 % demonstrate that microbial com- pores, plant leaves, biofilms, or microbial flocs in
munity composition was not resistant (i.e., the sewage treatment. The growing body of experi-
degree to which community composition remains mental evidence suggests that microbial commu-
unchanged when disturbed) against disturbances nities can be sensitive to disturbances and that
(fertilization, CO2 increase, temperature, carbon resilience is linked to diversity. However, the
amendment) (Allison and Martiny 2008). This majority of studies are descriptive, correlative, or
held true for broad taxonomic groups (fungi, bac- strongly reductionist in nature, not allowing for
teria, archaea) as well as narrow groups with spe- causal or mechanistic conclusions.
cific functions (methane oxidation, nitrification).
The same study demonstrated that the resilience
(i.e., the rate at which microbial community com- Closing the Gaps
position returns to its original composition after
being disturbed) is in the order of years. Fertiliza- It is obvious that the omission of microbial com-
tion even led to differences in communities of munities from the BEF debate and in the manag-
N-cycling microbes (nitrifiers, denitrifiers) for ing and conservation of ecosystems is due to
more than 50 years (Hallin et al. 2009). Similar a lack of understanding of the functioning and
long-lasting effects have also been observed for composition of environmental microbial commu-
methane-consuming microbes. Microbes consum- nities. The controversy between huge diversity
ing atmospheric methane are responsible for and redundancy on the one hand and the lack of
6–10 % of global methane consumption. The pro- knowledge on 99 % of that diversity on the other
cess is sensitive to agricultural practices, and hand leads to the fact that we do not know what
recovery after land abandonment can take decades we have to protect and what might have been lost
Microbial Ecosystems, Protection of, Fig. 1 The consumption) are annual averages with error bars
recovery of methanotroph diversity and atmospheric representing standard errors. Land-use treatments are as
CH4 consumption following row-crop agriculture. follows: agricultural management of historically tilled
Increase in methanotroph diversity (open symbols) and lands (AG), early successional fields abandoned from
CH4 consumption (closed symbols) as a function of time agriculture in 1989 (ES), successional forests abandoned
since cessation of agriculture. The data clearly show that from agriculture in the 1950s (SF), managed grasslands on
agricultural use diminishes methanotrophic diversity as never-tilled soil (MG), and deciduous forests (DF) (From
well as function and that it can take decades before recov- Levine et al. 2011, with permission)
ery takes place. All measurements (diversity and
already. This controversy hampers the examina- enabling individual-based physiology and ecol-
tion of the importance of microbial diversity for ogy and even interactions on microbial relevant
ecosystem functioning. Consequently, BEF stud- scale. Theoretical and conceptual approaches
ies in environmental microbiology are largely of from macroecology are being applied to under-
descriptive nature and disconnected to ecological stand microbial community structure and to link
concepts. Approaches have been “top-down” or it to ecosystem processes (Bodelier 2011).
“bottom-up,” treating species/genotypes, com- Ultrahigh-throughput community assessment
munity traits, and interactions as a “black box” methods will facilitate processing of large num-
(see Bodelier 2011). However, the rapid method- ber of samples and replicates in order to obtain
ological developments of the last decades are sufficient information allowing for experimental
narrowing down the limitations which kept envi- designs which yield mechanistic understanding
ronmental microbiology at the descriptive level. of environmental microbial communities, even-
The “omic” techniques enable studying commu- tually leading to the opening of the “black box.”
nity ecology and physiology of known as well as
unknown microbial species, and a systems biol-
ogy approach for microbial communities is not Microbial Community Conservation
out of reach (Raes and Bork 2008). In situ adap-
tation of community members as well as in situ The fact that there are no microbial species on the
profiling of whole genome transcripts and pro- Red List nor are microbial communities in nature
teins of individual species is feasible. Next to conservation policy does not mean that there are
this, methodology and concepts are emerging, no initiatives toward conservation of microbial
communities. From the medical as well as bio- because of the fact that soils harbor the largest
technological perspective, there is a need for the source of microbial biodiversity. It is within
preservation of microbial genetic diversity which the soil conservation that many initiatives are
is mainly done by storing isolated and described taken toward conservation of soil biodiversity
microbial species in public culture collections like the EU soil framework directive in devel-
(e.g., ATCC (http://www.lgcstandards-atcc.org/), opment (http://ec.europa.eu/environment/soil/
DSMZ (http://www.dsmz.de/), and NCIMB biodiversity.htm) where also microbes are
(http://www.ncimb.com/)). However, since most explicitly taken into account (Gardi
of the diversity is represented in uncultured and et al. 2009). Combined with the already
not characterized microbes as part of environ- existing EU habitat conservation legislation
mental communities, we run the risk of losing (http://ec.europa.eu/environment/nature/natura
genetic diversity of which we do not know its 2000/index_en.htm), important habitats
value yet, on itself a good reason for conserva- containing a large part of microbial diversity on
tion. Well-known examples of biotechnological Earth are conserved. However, we still need to
spin-off of environmental microbial communities know what it is that needs to be preserved and
have led to conservation efforts. The discovery what we can potentially lose or affect by climate
of the heat-resistant Taq polymerase enzyme, change, habitat destruction, land-use change,
used in PCR reactions, in the bacterium urbanization, etc. This requires inventories of
Thermus aquaticus (http://en.wikipedia.org/ microbial diversity and functioning. Despite the
wiki/Thermus_aquaticus) in hot springs in serious limitations in methods to assess the shear
Yellow Stone National Park, has led to declaring endless microbial diversity, there are a number of
these hot springs as conservation targets in order initiatives going on that come as close as possible.
to preserve the microbial genetic potential mainly The Earth Microbiome Project (http://www.
for biotechnological applications (http://serc. earthmicrobiome.org/) is an initiative to assess
carleton.edu/microbelife/topics/bioprospecting. functional microbial diversity in more than
M
html), thereby making these hot springs the 200,000 environmental samples which will be
first environmental microbial conservation collected and analyzed in a coordinated way and
areas. Another development that contributes will be complemented with essential metadata
substantially to the “protection” of environ- which can be used to infer ecological or biogeo-
mental microbial communities is the TEEB graphical aspects of the communities in the data-
(The Economics of Ecosystems and Biodiver- base. A similar initiative has already been in
sity) initiative which expresses the value of place for a number of years focusing on marine
ecosystems, ecosystem services, and biodiver- microbial communities (http://www.coml.org/
sity in monetary values (http://www.teebweb. international-census-marine-microbes-icomm),
org/). Although this valuing of ecosystems is while the TerraGenome project specifically
controversial and anthropogenically centered, it focuses on soils (http://www.terragenome.org/).
definitely created awareness for biodiversity Hence, many steps on the “roadmap toward
among policy makers, politicians, and industry. microbial conservation,” as put forward by
The assessment of Earth ecosystems, biodiver- Cockell (Cockell and Jones 2009) a number
sity, and ecosystem services by 1,300 experts of years ago, have been taken. Projects
(Millennium 2005) identified key areas of eco- attempting to make microbial diversity inven-
system protection and conservation in order to tories are initiated and scientific approaches to
keep our planet habitable. In all of these eco- link microbial species to ecosystem functions are
systems, microbes play pivotal roles, a fact being developed. Nevertheless, “the Red List”
which is generally being recognized. Especially species approach will definitely not be applicable
soils are a main focus when it comes to micro- to microbes as already pointed out above. Hence,
bial processes because of the many ecosystem we need different approaches and concepts
services soils and soil microbes provide and regarding “conservation units” for microbial
communities which are useful and understandable microbes are present in environments in order to
for policy makers and politicians. Habitat conser- be able to monitor changes with possible conse-
vation is a good starting point, but probably we can quences for ecosystem functions. There are many
also put forward “vulnerable” nonredundant envi- initiatives underway seeking to make inventories
ronmental microbes which are carrying out impor- of functional diversity of microbial communities
tant ecosystem functions like methane oxidizers in marine, terrestrial, and freshwater habitats. This
which may be affected by anthropogenic distur- knowledge will facilitate assessing impacts and
bance, diminishing their functioning in the envi- consequences of anthropogenic disturbances on
ronment for decades (see Fig. 1). Next to this, microbial communities and their functioning in
educating the public, policy makers, and politi- the future and pave the way for the protection of
cians on the importance and shear uniqueness of environmental microbial communities. For the
microbes and microbial communities will be of time being, we have to rely on habitat conservation
utmost importance in the process of getting guidelines and legislation to ensure maintenance
microbes on the conservation agenda. If not of microbial communities.
protecting them for their valuable functions, we
should do it for the sake of ethics (Cockell 2011).
References
Summary Allison SD, Martiny JBH. Resistance, resilience, and
redundancy in microbial communities. Proc Natl
Despite the eminent role microbes and microbial Acad Sci U S A. 2008;105:11512–9.
Bodelier PLE. Toward understanding, managing, and
communities play in all ecosystems on Earth,
protecting microbial ecosystems. Front Microbiol.
they are not considered in conservation policy 2011;2(80).
or legislation. This is due to utter lack of funda- Chen Y, Murrell JC. When metagenomics meets stable-
mental knowledge on crucial issues concerning isotope probing: progress and perspectives. Trends
Microbiol. 2010;18(4):157–63.
environmental microbial communities. The
Cockell CS. Microbial rights? EMBO Rep.
species-oriented approach in conservation biol- 2011;12(3):181. 181.
ogy is not emendable to microbes where there are Cockell CS, Jones HL. Advancing the case for microbial
difficulties in defining species and where more conservation. Oryx. 2009;43(4):520–6.
Ducklow H. Microbial services: challenges for microbial
than 99 % of all species present in the environ-
ecologists in a changing world. Aquat Microb Ecol.
ment are not known. Next to this, we have no idea 2008;53(1):13–9.
what the importance is of microbial diversity for Gardi C, et al. Soil biodiversity monitoring in Europe:
ecosystem functioning because of the lack of ongoing activities and challenges. Eur J Soil Sci.
2009;60(5):807–19.
methodology to do so. The most important prob-
Green JL, Bohannan BJM, Whitaker RJ. Microbial bioge-
lem is probably the notion that microbes are so ography: from taxonomy to traits. Science.
abundant, diverse, and resilient that they are not 2008;320(5879):1039–43.
threatened by extinction. However, rapid devel- Hallin S, et al. Relationship between N-cycling commu-
nities and ecosystem functioning in a 50-year-old fer-
opments in the field of environmental microbiol-
tilization experiment. ISME J. 2009;3(5):597–605.
ogy, mainly in the application of genomic and Hooper DU, Adair EC, Cardinale BJ, Byrnes JEK,
isotopic techniques, have revolutionized our Hungate BA, Matulich KL, Gonzalez A, Duffy JE,
knowledge and demonstrate that microbes dis- Gamfeldt L, O’Connor MI. A global synthesis reveals
biodiversity loss as a major driver of ecosystem
play biogeography and are sensitive to environ- change. Nature. 2012. doi:10.1038/nature11118.
mental disturbance and that for a number of Levine UY, et al. Agriculture’s impact on microbial diver-
environmentally relevant processes, community sity and associated fluxes of carbon dioxide and meth-
composition is linked to ecosystem functioning. ane. ISME J. 2011;5(10):1683–91.
Mayali X, Weber PK, Brodie EL, Mabery S, Hoeprich PD,
Hence, microbes are not “untouchable” and omni-
Pett-Ridge J. High-throughput isotopic analysis of
present, but in order to get them onto the conser- RNA microarrays to quantify microbial resource use.
vation agenda, we have to be able to assess which ISME J. 2012;6:1210–21.
Mining Metagenomic Datasets for Antibiotic Resistance Genes 487 M
Millennium, Ecosystem, Assessment 2005. Ecosystems used and to high-throughput DNA sequencing of
and human well-being: general synthesis. United microbial community DNA. These sample types
Nations. www.millenniumassessment.org/en/synthe-
sis.aspx and methods can be used to gather information on
Murrell JC, Whiteley AS, editors. Stable isotope probing genes that code for antibiotic resistance.
and related technologies. American Society for Micro-
biology (ASM); Washington DC, 2011.
Pace NR. Mapping the tree of life: progress and prospects.
Microbiol Mol Biol Rev. 2009;73(4):565–76. Introduction
Philippot L, Andersson SGE, Battin TJ, Prosser JI,
Schimel JP, Whitman WB, Hallin S. The ecological Antibiotics are medicines that are used to kill,
coherence of higher bacterial taxonomic ranks. Nat slow down, or prevent the growth of susceptible
Rev Microbiol. 2010;8:523–9.
Raes J, Bork P. Systems microbiology – timeline – molec- bacteria. They became widely used in the
ular eco-systems biology: towards an understanding mid-twentieth century for controlling disease in
of community function. Nat Rev Microbiol. humans, animals, and plants and for a variety of
2008;6(9):693–9. industrial purposes. Antibiotic resistance is
Tamames J, et al. Environmental distribution of prokary-
otic taxa. BMC Microbiol. 2010;10. a broad term. Depending on the classification
Wagner M. Single-cell ecophysiology of microbes as scheme used, there are between eight and twenty
revealed by Raman microspectroscopy or secondary different classes of antibiotics, with multiple
ion mass spectrometry imaging. Annu Rev Microbiol. compounds in each class. These different catego-
2009;63:411–29.
Wittebolle L, et al. Initial community evenness favours ries represent different basic chemical structures
functionality under selective stress. Nature. and modes of action – some antibiotics will
2009;458(7238):623–6. inhibit cell wall synthesis, for example, while
Zhou JZ, et al. Spatial scaling of functional gene diversity others target portions of the ribosome and
across various microbial taxa. Proc Natl Acad Sci
U S A. 2008;105(22):7768–73. a cell’s protein processing machinery. Just as
there are many types of antibiotics, there are
also many types of antibiotic resistance. Some
M
types of resistance are specific for an individual
Mining Metagenomic Datasets for antibiotic, while others, such as multidrug resis-
Antibiotic Resistance Genes tance efflux pumps, can confer resistance to mul-
tiple different kinds of antibiotics. It is also likely
Lisa Durso that there are naturally occurring antibiotics that
Agroecosystem Management Research Unit, have yet to be described.
US Department of Agriculture, University Of Antibiotic resistance is a normal and natural
Nebraska, Lincoln- East Campus, Lincoln, phenomenon that can be documented even in
NE, USA ancient (permafrost from 30,000 years ago) and
pristine habitats such as Antarctica and the Sar-
gasso Sea (Allen et al 2009; D’Costa et al. 2011;
Synonyms Durso et al. 2012). In addition to naturally occur-
ring antibiotic resistance, there is no doubt that
Anthropogenic and human associated; Horizon- anthropogenic or human-associated use of antibi-
tal gene transfer and lateral gene transfer; Whole- otics for health, food production, veterinary, and
genome sequencing and metagenomic industrial purposes has dramatically impacted
sequencing resistance. The continued emergence of
antibiotic-resistant, opportunistic, and patho-
genic infections in health-care settings has
Definition become a major public health concern, especially
the emergence of bacteria that are resistant to
Metagenomics refers to samples in which the multiple antibiotics or multiple classes of antibi-
entire bacterial or microbial community DNA is otics. Yet few details are known about how
M 488 Mining Metagenomic Datasets for Antibiotic Resistance Genes
antibiotic resistance genes move through envi- bacteria present in the environment, including
ronmental, agricultural, and clinical settings. food, water, and soil. D’Costa et al. (2006) cul-
Metagenomics provides one tool to start charac- tured spore-forming bacteria from soil and
terizing antibiotic resistance genes across screened them against 21 antibiotics, including
habitats. both old and new antibiotics and naturally occur-
The term “metagenomic” has multiple mean- ring and synthetic antibiotics. Based on their
ings. Historically it was used to describe the kind results, they identified the soil as a reservoir of
of sample that was collected and referred to antibiotic resistance genes and proposed the idea
collecting DNA or genomic information not just of a pan-microbial resistome. Contrary to the gen-
from a single organism or isolate but from eral public perception that use of antibiotics in
a whole community, a metagenome, consisting human medicine and agriculture is the root cause
of both cultured and uncultured organisms of antibiotic resistance, the antibiotic resistome
(metagenomic samples). More recently, the term hypothesis supports the idea of a naturally occur-
metagenomic has come to describe a specific type ring global pool of antibiotic resistance and sug-
of analysis that relies on high-throughput nucleic- gests that the environment (especially soil) serves
acid sequencing of either 16S rDNA or whole- as a reservoir of antibiotic resistance elements. In
community DNA (metagenomic sequencing). In this model antibiotic resistance elements can be
addition to providing metagenomic sequencing enriched and selected for by anthropogenic antibi-
information, the new high-throughput sequenc- otic use. However, unlike previous models, the
ing methods can be used to profile whole- concept of the antibiotic resistome expands
community RNA profiles (metatranscriptome) the focus from the selection of pathogens via the
and whole-community protein profiles direct use of antibiotics in clinical settings to
(metaproteome). This entry will examine studies a global pool of antibiotic resistance that can
using both metagenomic samples and the use of potentially be transferred from harmless bacteria
metagenomic sequencing to gather information into human, animal, and plant pathogens. Later
on functional genes that code for antibiotic resis- work by the same group (Wright 2007; D’Costa
tance. Although the focus here will be on mining et al. 2011) as well as others (Riesenfeld
metagenomic data for information on antibiotic et al. 2004; Henriques et al. 2006; Aminov and
resistance genes, it is acknowledged that func- Mackie 2007; Mori et al. 2008) provides
tional and gene-based metagenomic studies com- supporting evidence for the natural occurrence of
plement experiments involving gene expression, antibiotic resistance, especially in soil, and the
protein production, and phenotypic characteriza- global distribution of antibiotic resistance genes.
tion of individual and community resistance. Conceptually, the relationship between
increased anthropogenic use of antibiotics and
increases in the number and types of antibiotic-
The Antibiotic Resistome resistant bacteria and antibiotic resistance genes
is clear. On a practical level, many of the details
The concept of an antibiotic “resistome” was first regarding the ecology of antibiotic resistance and
proposed in 2006 by D’Costa et al. to describe the antibiotic resistance genes in the environment
sum total of all antibiotic resistance genes across remain unknown. These include fate and trans-
the globe and all genetic elements that could give port of naturally occurring and anthropogenically
rise to antibiotic resistance genes (D’Costa induced antibiotic-resistant genes within and
et al. 2006). It includes pathogenic bacteria that between environmental, agricultural, and clinical
cause illness, as well as opportunistic and non- settings as well as rates of gene transfer, rates of
pathogenic bacteria. This concept provides gene expression, and impact of naturally occur-
a framework that unites antibiotic resistance in ring and anthropogenically introduced antibiotic
human, animal, and plant clinical applications, concentrations on short- and long-term microbial
with the broader pool of antibiotic-resistant community structure.
Antibiotic Resistance Genes called a library. In the case of antibiotic resis-
tance, the clone or BAC libraries are plated onto
The genes that code for antibiotic resistance are media containing a specific amount of
carried either as part of the regular bacterial chro- antibiotic. If they grow in the presence of the
mosome, which is passed vertically to individual antibiotic, they are considered resistant. If they
daughter cells, or as part of mobile genetic ele- do not grow, they are considered sensitive. In
ments such as plasmids and transposons which human medicine and clinical settings, there are
can be transferred both vertically to daughter well-defined standard methods that specify, by
cells and horizontally to other strains or species organism and antibiotic, the concentration
of bacteria. These antibiotic resistance genes, needed to be considered resistant. In environmen-
sometimes called antibiotic resistance determi- tal and experimental settings, these standards do
nants or antibiotic resistance elements, code for not exist, and there is no consistent definition
a variety of different kinds of proteins involved in across studies.
inactivating the antibiotic, removing the antibi-
otic from the cell, or modifying the target of the
antibiotic so that it is not recognized by the drug. Studying Antibiotic Resistance Genes
For any specific antibiotic, there may be multiple from Metagenomic Samples
types of resistance mechanisms. Many of these
mechanisms are complex and require the coordi- Metagenomic samples can be mined for known as
nation of a suite of genes, so that for any individ- well as uncharacterized antibiotic resistance
ual antibiotic, there are multiple different genes using functional screening of metagenomic
antibiotic resistance genes. clone libraries. After creating the libraries, clones
There are two basic approaches to mining are plated onto media containing the antibiotic of
metagenomic datasets for antibiotic resistance interest. Colonies that grow in the presence of the
genes: those that are database dependent and antibiotic are assumed to be carrying an
M
those that are discovery driven. The database- antibiotic-resistant gene from the original sam-
dependent systems are good for comparative ple. The inserts from the resistant clones can be
studies that screen large numbers of samples or sequenced, and the sequences compared against
large number of genes and examine similarities database of known antibiotic resistance genes. As
or differences in antibiotic resistance gene pat- early as 1997, these methods were used to char-
terns across samples. These methods rely on pre- acterize the diversity of quinolone resistance
viously sequenced antibiotic resistance genes to genes in soil (Waters and Davies 1997). This
provide the information used to design primers or functional metagenomic approach has been used
to provide a list against which new sequences are to target specific classes of antibiotic resistance
compared. The limitation of database-dependent genes, as well as more general surveys of antibi-
methods is that a particular gene must already otic resistance where libraries are screened
have been sequenced in order to be in the data- against multiple antibiotics. For example, tetra-
base, and researchers can only screen against cycline resistance has been assayed from human
genes that have already been discovered, charac- mouth, and organic pig samples (Diaz-Torres
terized, and deposited in the database. Discovery- et al 2003; Kazimierczak et al. 2009) and
driven methods, while time-consuming and b-lactamase genes have been extracted from sam-
low-throughput, can be used to describe novel ples such as tropical surface waters and Alaskan
antibiotic resistance genes. In this approach, soil (Henriques et al. 2006; Allen et al. 2009).
DNA fragments from metagenomic samples are The mining of functional genes focuses on two
cloned into hosts such as E. coli, or constructs main types of samples. When trying to determine
such as bacterial artificial chromosomes (BACs), baseline levels of antibiotic resistance and evolu-
and then screened for a particular phenotype. The tionary relationships of individual genes, pristine
collection of DNA fragments in the new host is samples and those dating from before the use of
antibiotics are used (D’Costa et al. 2011). When composed of subsystems. Examples of primary
searching for novel antibiotic resistance genes, SEED functional groups are “cell wall synthe-
complex samples are used, especially those with sis,” “nitrogen metabolism,” and “virulence.”
increased levels of antibiotic compounds such as Within the virulence functional category is
feces or activated sludge (Sommer et al. 2009; a subset of genes that are associated with “resis-
Mori et al 2008). It is also possible to use publicly tance to antibiotic and toxic compounds”
available information to screen for potentially (RATC). Drilling even further down into this
novel antibiotic resistance genes. Both the particular functional group, gene fragments are
National Center for Biotechnology Information binned by categories such as “aminoglycoside
(NCBI) and the MG-RAST server (Meyer adenyltranferases,” “beta-lactamase resistance,”
et al. 2008) have extensive DNA sequence and “resistance to fluoroquinolones.”
datasets that are available to the public. Once After identifying a metagenome, a list of anti-
identified via the public databases, antibiotic biotic resistance genes can be accessed using the
resistance genes of interest can then be charac- “analysis” icon. Under “Data Type” choose “Func-
terized using other methods (Toth et al. 2010). tional Abundance” and “Hierarchical Classifica-
tion.” The Data Selection annotation source
should be “subsystems” and the Data Visualization
Studying Antibiotic Resistance Genes option should be “table.” Then, hit the “generate”
Using Metagenomic Sequencing button. After processing the data, a table will be
Methods displayed with three functional classification levels
displayed, along with abundance and quality data.
One tool that is useful for exploring antibiotic The abundance results are clickable, and open
resistance in metagenomic samples is a window that lists the taxonomic assignments of
MG-RAST (Meyer et al. 2008). MG-RAST, each of the hits, as well as a link to the actual
developed at Argonne National Laboratory and sequence and M5nr nonredundant protein data.
the University of Chicago, provides The M5nr database allows classification of the
metagenomic data analysis tools for both public fragment across multiple classification schemes.
and private metagenomic sequencing sets. There Because metagenomic sequencing is
are hundreds of publicly available metagenomes performed on whole-community DNA without
on the MG-RAST website (http://metagenomics. a PCR step, the data generated can be considered
anl.gov). These can be accessed directly using the quantitative. So in addition to describing which
sample ID number or via the “browse antibiotic resistance genes are present,
metagenome” function. Researchers may submit metagenomic analysis can quantify the relative
their own metagenomic datasets to the site for amounts or proportions of individual genes
analysis, with processing priority given to and/or gene classes – both within any individual
datasets that will be made immediately available sample and across samples from different habi-
to the public. After normalization, both taxo- tats. As with all methods associated with tracking
nomic and functional data are extracted from antibiotic resistance in the environment, there are
the submitted sequences and made available for no standard methods (Allen et al. 2010). How-
visualization via the website. There are many ever, control metrics available through
different classification schemes that are available MG-RAST, in particular a new metric called
for organizing data on MG-RAST. One system, duplicate read inferred sequencing error estima-
called SEED (Overbeek et al. 2005), is designed tion (DRISEE; Keegan et al. 2012), can serve as
to classify functional genes across genomes using screening tools to decide on minimum quality
a standardized system for categorizing genes or standards for inclusion or exclusion of specific
gene fragments. The SEED system of organiza- metagenomic samples for analysis.
tion is hierarchical in nature, and each of the These metagenomic sequencing tools can be
primary functional groups or systems is used to start addressing questions related to the
ecology of antibiotic resistance in specific habitats pulled out and used for taxonomic purposes. In
and across ecosystems. Metagenomic analysis of addition, MG-RAST has the ability to link
45 microbiomes across the globe revealed func- protein-coding fragments with taxonomic assign-
tional gene profiles that correlated with environments using SEED and other systems. Currently,
ment (Dinsdale et al. 2008). This idea was the only way to access this linked information for
expanded to antibiotic resistance genes, providing individual reads from MG-RAST is through the
an antibiotic resistance “fingerprint” for some “assignment” column on the functional gene
samples (Durso et al. 2011). A metagenomic anal- table, so it is time-consuming to assemble this
ysis of public datasets was performed specifically linked data, even for a single metagenome.
comparing RATC genes from agricultural and Grouped data are more easily accessible in
nonagricultural metagenomes (Durso et al. 2012). MG-RAST using the “workbench” function. In
Among the 26 metagenomes studied, the total the functional table, the last column contains
percent of RATC gene fragments (based on all a box titled “to workbench.” Reads belonging to
classified fragments) ranged from 0.7 % for the specific functional groups can be selected, and
Sargasso Sea sample to 4.4 % for the dog. The then a second taxonomic-specific analysis can
fecal samples (dog, fish, three human, and cattle) be run exclusively on the reads in the workbench.
had the highest overall percent of RATC genes, Using these methods, information can be gath-
while the marine samples (Chesapeake, ered on which bacteria are likely carrying specific
Galapagos, Zanzibar, Gulf of Mexico, Key West, antibiotic resistance genes and how the bacterial
Madagascar, Gulf of Maine, and Sargasso Sea) had communities may change over time or space.
the lowest overall percent of RATC genes. In Some types of antibiotic resistance, such as
addition to having the highest proportion of beta-lactamase, MDR efflux pumps, and fluoro-
RATC genes, the dog metagenome also displayed quinolone resistance, are broadly distributed
the highest diversity of RATC classes (31 classes) across many (>10) taxa, while other types of
and the Sargasso Sea displayed the lowest diver- resistance genes such as tetracycline and vanco-
M
sity (7 classes). Using MG-RAST, individual clas- mycin resistance are more taxonomically
ses of antibiotic resistance genes could be restricted (4 or 5 taxa each). Within individual
examined. The fish metagenome, for instance, antibiotic resistance classes, the taxonomic dis-
had over ten times as many genes coding for mer- tribution of specific genes or gene classes varies
curic reductase and mercury resistance (3.9 %), by metagenome. For example, MDR efflux pump
compared to the average for the other genes are associated mainly with Clostridia in
metagenomes (0.31 %), while the day 29 kimchi animal agriculture metagenomes but are more
metagenome, a Korean fermented vegetable, had frequently assigned to Gammaproteobacteria in
high levels of the two-protein Gram-positive coastal marine samples. Metagenomic sequenc-
multidrug resistance compared to the other ing enables researchers to track the change in
metagenomes examined. In both of these exam- microbial communities over time. One set of
ples, the metagenomic data reflect what we already publicly available metagenomes follows the fer-
know about the biology of these systems and sug- mentation of kimchi over the course of a month.
gest that metagenomic RATC data can be used to The antibiotic resistance gene profiles associated
distinguish fundamental differences in microbial with the kimchi change dramatically as the fer-
community ecology from diverse microbiomes. mentation progresses, and these specific changes
In addition to information on specific antibi- can be tracked using metagenomic sequencing.
otic resistance genes, analysis of metagenomic The strengths of these metagenomic sequenc-
sequencing data can also provide taxonomic ing methods are that they allow researchers to
information about a sample. The use of the 16S identify and gain a quantitative understanding of
ribosomal RNA gene to classify bacteria is well functional gene relationships across samples and
known. Some of the fragments in a metagenomic geographies. It should be kept in mind that there
sample that code for 16S rRNA genes can be are many places where artifacts of processing of
either the sample itself or the resulting sequence D’Costa V, McGramm K, Hughes D, et al. Sampling the
data can influence the results. Although these antibiotic resistome. Science. 2006;311:374–7.
D’Costa V, King C, Kalak L, et al. Antibiotic resistance is
sequence-based metagenomic data are excellent ancient. Nature. 2011;477(7365):457–61.
for getting oriented in a system and providing an Diaz-Torres M, McNab R, Spratt D, et al. Novel tetracy-
overview of what is potentially there, the output cline resistance determinant from the oral
is of fairly low resolution and requires follow-up metagenome. Antimicrob Agents Chemother.
2003;47(4):1430–2.
using other methods before detailed conclusions Dinsdale E, Edwards R, Hall D, et al. Functional
can be drawn. Nonetheless, there is great value in metagenomic profiling of nine biomes. Nature.
the information that these kinds of techniques can 2008;452:629–33.
provide. Like the Lewis and Clark expedition, Durso L, Harhay G, Bono J, et al. Virulence-associated
and antibiotic resistance genes of microbial
which mapped the entire US western frontier populations in cattle feces analyzed using
based on sampling a single route covering much a metagenomic approach. J Microbiol Methods.
less than 1 % of today’s public roads in the area, 2011;84(2):278–82.
data generated by metagenomic sequencing Durso LM, Miller DN, Wienhold BJ. Distribution and
quantification of antibiotic resistant genes and bacteria
methods provide a first step in exploring previ- across agricultural and non-agricultural metagenomes.
ously unknown territory. For antibiotic resis- PLoS One. 2012;7:e48325.
tance, they offer the capacity to examine the Henriques I, Moura A, Alves A, et al. Analysing
prevalence of antibiotic gene distribution on diversity among b-lactamase encoding genes in
aquatic environments. FEMS Microbiol Ecol.
a global scale and the opportunity to begin to 2006;56:418–29.
compare distribution of specific antibiotic resis- Kazimierczak K, Scott K, Kelly D, et al. Tetracycline
tance genes across samples and time. resistance of the organic pig gut. Appl Environ
Microbiol. 2009;75(6):1717–22.
Keegan K, Trimble W, Wilkening J, et al. A platform-
independent method for detecting errors in
Summary
metagenomic sequencing data: DRISEE. PLoS
Comput Biol. 2012;8(6):e1002541. doi:10.1371/jour-
The ecology of antibiotic resistance genes in the nal.pcbi.1002541.
environment remains largely unexplored. Meyer F, Paarmann D, D’Souza M, et al. The
Metagenomic tools provide the opportunity to
identify novel antibiotic resistance genes, explore metagenomes. BMC Bioinforma. 2008;9:386.
the epidemiology of antibiotic-resistant genes Mori T, Mizuta S, Suenaga H, et al. Metagenomic screen-
across multiple habitats, and begin to define rela- ing for bleomycin resistance genes. Appl Environ
Microbiol. 2008;74(21):6803–5.
tionships between antibiotic resistance genes and
Overbeek R, Begley T, Butler R, et al. The subsystems
the bacteria that likely carry them. The availabil- approach to genome annotation and its use in the
ity of public metagenomic datasets affords all project to annotate 1000 genomes. Nucleic Acids
researchers an opportunity to ask and answer Res. 2005;33:5691–702.
Riesenfeld C, Goodman R, Handelsman J. Uncultured soil
questions about antibiotic resistance.
bacteria are a reservoir of new antibiotic resistance
genes. Environ Microbiol. 2004;6(9):981–9.
Sommer MO, Dantas G, Church GM. Functional
References characterization of the antibiotic resistance reservoir
in the human microflora. Science. 2009;325:1128–
Allen H, Cloud-Hansen K, Wolinski J, et al. Resident 1131.
microbiota of the gypsy moth midgut harbors antibi- Toth M, Smith C, Frase H, et al. An antibiotic-resistance
otic resistance determinants. DNA Cell Biol. enzyme from a deep-Sea bacterium. J Am Chem
2009;28(3):109–17. Soc. 2010;132:816–23.
Allen H, Donato J, Wang H, et al. Call of the wild: Waters B, Davies J. Amino acid variation in the GYRA
antibiotic resistance genes in natural environments. subunit of bacteria potentially associated with natural
Nat Rev. 2010;8:215–59. resistance to fluoroquinolone antibiotics. Antimicrob
Aminov R, Mackie R. Minireview: evolution and ecology Agents Chemother. 1997;41(12):2766–9.
of antibiotic resistance genes. FEMS Microbiol Lett. Wright G. The antibiotic resistome: the nexus of chemical
2007;271:147–61. and genetic diversity. Nat Rev. 2007;5:175–86.
Mining Metagenomic Datasets for Cellulases 493 M
between the polymer chains, and this is largely
Mining Metagenomic Datasets for responsible for its recalcitrance. This network of
Cellulases bonds leads to a mostly uniform arrangement of
fibers, and the resultant crystalline cellulose lacks
David J. Rooks and Alan J. McCarthy enzyme-accessible surface morphologies, further
Microbiology Research Group, Institute of enhancing resistance to hydrolysis (Zhou
Integrative Biology, Biosciences Building, et al. 2009). Cellulose usually occurs naturally
University of Liverpool, Liverpool, UK in close physical association with hemicelluloses,
which are heteropolysaccharides that, in terres-
trial plants, form the lignocarbohydrate matrix
Synonyms enveloping cellulose fibers and essentially con-
stituting the plant cell wall structure. Cotton is the
Environmental DNA; Glycosyl hydrolases; only naturally occurring pure form of highly crys-
Metagenomes; Metatranscriptomes talline cellulose. For microorganisms to hydro-
lyze and metabolize insoluble polymeric
cellulose, extracellular cellulases must be pro-
Definition duced and in multiple forms that act synergisti-
cally. The two primary models are those in which
Metagenomic (DNA) or metatranscriptomic the enzymes are truly secreted, versus the
(cDNA) sequence datasets generated using cellulosome, a surface-bound multimeric com-
DNA or RNA extracts are obtained directly plex of polypeptides comprising catalytic and
from environmental samples. These include soil, non-catalytic components; the cellulosome
water, gut contents, and degrading organic mat- has been likened to a polysaccharide process-
ter/plant biomass and biofilms; laboratory- ing nanomachine (Fontes and Gilbert 2010).
incubated microcosms or mesocosms in which There s a possible third model in which cellulose
M
cellulose-degrading microorganisms are is bound to the bacterial cell surface and fur-
enriched also serve as sources of nucleic acids ther processed in the periplasmic space (see
for the preparation of sequence datasets. Genes Ransom-Jones et al. 2012). Three major types of
encoding glycosyl hydrolases and specifically enzymatic activities are found: (i) endoglucanases,
those likely to be active against cellulose (ii) exoglucanases (cellobiohydrolases), and
(cellulases) can be sought, most efficiently in (iii) b-glucosidases (cellobiases). The evidence
the large sequence datasets generated by the for oxidative attack on cellulose has often
application of pyrosequencing technologies. been equivocal, but there are now data that
establish the involvement of an enzyme
(GH61) in cellulose depolymerization (Quinlan
Cellulose and Its Biodegradation et al. 2011).
Cellulose is the most abundant form of photosyn-

thetically fixed carbon in the biosphere. It is Cellulase Structure and Function
a fibrous linear homopolymer of glucose in the
form of cellobiose (dimer) units linked by Cellulases are generally glycosidic hydrolase
b-1,4-glycosidic bonds, and it occurs naturally (GH) enzymes that utilize the same mechanism
in plants, some fungi, protozoa, and one group of acid-base catalysis with inversion or retention
of animals – the urochordates (Lynd et al. 2002). of glucose anomeric configuration (Davies and
Native cellulose is a highly crystalline polymer Henrissat 1995). Cellulases are modular enzymes
due to the formation of rigid microfibrillar struc- composed of independently folding, structurally
tures stabilized by inter- and intramolecular and functionally discrete units, referred to as either
hydrogen bonds and van der Waals interactions domains or modules (Henrissat et al. 1998), and
M 494 Mining Metagenomic Datasets for Cellulases
are the most diverse enzymes that catalyze GH5 and GH9 have the largest number of bio-
a single reaction. Automated data mining sug- chemically characterized cellulases; however,
gests that there are 15 glycoside hydrolase fami- this could be largely due to the abundance of
lies that contain cellulases; families are defined these cellulases in the limited number of model
by amino acid sequence similarity (CAZy – see cellulolytic organisms that have been studied
below). Structural studies show that cellulases (Sukharnikov et al. 2011). The database is fre-
have eight different protein folds and contain quently updated to provide rich sets of manually
a carbohydrate-binding module, which is usually curated information on all groups of CAZymes,
linked to a catalytic-binding domain (Shoseyov i.e., names, GenBank accession numbers, EC
et al. 2006). Glycosyl hydrolases with open active designations, 3D structure, and taxonomy, and
sites typically exhibit endocellulolytic activity the information can serve as an invaluable
(endoglucanases) and cleave b 1–4 links at amor- resource to identify CAZyme genes or gene frag-
phous sites in the polysaccharide chain to gener- ments in both genomes and metagenomes.
ate chain ends and ultimately oligosaccharides of Although the collection of enzyme data in
various lengths (Horn et al. 2006). Those with CAZy is invaluable for enzymologists, annota-
tunnellike active sites exhibit exocellulolytic tions could be significantly improved; the term
activity and are cellobiohydrolases that act “characterized” in CAZy is applied equally to
in a processive manner on the reducing or proteins that have been analyzed biochemically
nonreducing ends to liberate either glucose or and to those for which function has been compu-
cellobiose as major products. b-Glucosidases tationally predicted (Stam et al. 2006).
convert cellobiose to glucose, completing the
highly synergistic and complete enzymatic depo-
lymerization of cellulose. Metagenomics
The vast majority of microorganisms in the bio-

The Carbohydrate-Active Enzyme sphere have yet to be cultivated and remain an
Database (CAZy) untapped source of enzymes for biotechnological
applications. The current impetus to find novel
Identification of cellulase genes per se can be cellulases for applications, particularly in
achieved by interrogating DNA sequence data- biomass refining, stems from the importance of
bases to identify homologies or, more ambi- utilizing cellulose as a substrate for second-
tiously, to look for new types or classes of generation biofuel production. The requirement
enzymes among the genes of unknown function for synergy and the low specific activity of cellu-
that invariably dominate metagenome sequence lases in native cellulose saccharification
datasets. The former is facilitated by the processes remains a major challenge. Environ-
Carbohydrate-Active Enzyme (CAZy) database mental microbiology research was changed radi-
(https://www.cazy.org) (Cantarel et al. 2009), cally by molecular biology, with the greatest
a comprehensive repository of CAZymes that is effort directed toward describing true phyloge-
an almost unique resource for enzyme discovery. netic/functional diversity in natural microbial
At present, CAZy covers approximately 300 pro- communities by PCR amplification of marker
tein families, including glycoside hydrolases genes. However, cellulase genes, although well
(GHs), glycosyltransferases (GTs), polysaccha- defined at the protein sequence level, can rarely
ride lyases (PLs), carbohydrate esterases (CEs), be simply amplified in this way because the
and carbohydrate-binding modules (CBMs). All extent of nucleotide sequence variation does not
known cellulases are found within the CAZy enable the design of appropriate oligonucleotide
database and are denoted by two enzyme com- primers for PCR. Subsequently, the development
mission numbers: EC 3.2.1.4 (endoglucanase) of quantitative PCR and the use of environmental
and EC 3.2.1.91 (cellobiohydrolase). Families RNA as the template moved this field forward,
Mining Metagenomic Datasets for Cellulases 495 M
but we are now firmly in the era of environmental members of the Archaea that are the primary
metagenomics, made possible by pyrosequencing drivers of nitrification. Gilbert et al. (2008) iden-
technology (next-generation sequencing). Thus, tified a large number of novel highly expressed
metagenomics, the direct sequencing of DNA sequence clusters from marine microbial commu-
fragments from environmental samples, enables nities, the majority of which were orphaned, thus
mining of the vast genetic resource held in demonstrating the utility of the metatran-
the genomes of uncultured microorganisms scriptomic approach in the discovery of novel
that dominate natural microbial communities. genetic variants. Damon et al. (2012) addressed
Currently, a single pyrosequenced metagenome the global activities of soil eukaryotes by
can comprise up to 15 gigabases in reads of sequencing 2 10,000 cDNAs synthesized
up to 600 bp (Illumina). Alternatively, the from polyadenylated mRNA directly extracted
metagenome can be cloned into a suitable vector from forest soils. A total of 2,076 sequences
that can accommodate large inserts (20–40 kb) were putative homologues to genes for different
and subsequently screened for cellulases (Rooks enzyme classes; specific annotation identified
et al. 2012). Functional screening in an expres- enzymes active on major plant biomass poly-
sion host (usually E. coli) using Congo red mers, with glycoside hydrolases representing
staining of carboxymethyl cellulose (CMC) 0.5 % of the total. Finally, a metatranscriptomic
(McDonald et al. 2012) has successfully recov- analysis targeted specifically at fungal glycoside
ered cellulases from a diversity of metagenomes, hydrolases induced by the addition of cellulosic
including those from soil (Voget et al. 2006), the substrates to soil, generated 47 putative cellulase
buffalo rumen (Duan et al. 2009), the termite sequences spanning 13 families identified within
hindgut (Warnecke et al. 2007), and the human a cDNA sequence dataset comprising over
intestine (Qin et al. 2010). 56,000 protein-coding sequence fragments
(Takasaki et al. 2013). Therefore, despite the
inherent difficulties of extracting, enriching, and
M
Metatranscriptomes processing mRNA from environmental samples,
for which technological solutions are emerging,
Metagenomics provides information on the metatranscriptomics offers the advantage of
potential metabolic and functional capacity of targeting genes that are active in the environment
a microbial community. However, these and therefore functionally competent and
DNA-based analyses cannot differentiate exploitable.
between expressed and non-expressed genes.
Environmental transcriptomics (metatran-
scriptomics) retrieves and sequences mRNAs Bioinformatics and Screening
from the microbial community to provide an
unbiased perspective on gene expression in situ. In environmental metagenomics, determining the
Due to the difficulties inherent in the processing true microbial community structure that will lead
of environmental RNA to maintain integrity and to the discovery of new taxa, and hence novel
ultimately recover the high-quality mRNA from enzymes, has been the most important driver. The
the predominantly ribosomal RNA background, bioinformatic tools and approaches available
publications are relatively few in number. The tend to reflect this emphasis on taxonomy.
first report of a pyrosequenced metatran- MEGAN (Huson et al. 2007) is a data manage-
scriptome from a complex microbial community ment program used in the taxonomic analysis of
was by Leininger et al. (2006) who demonstrated large sequencing datasets, processing the results
that archaeal transcripts of the key enzyme of comparisons between a known database and
(amoA) in ammonia oxidation were several metagenome-derived sequences. In the context of
orders of magnitude more abundant in soils than this entry, information on the presence/
the bacterial equivalent, suggesting that it is abundance of known taxa of cellulose degraders
M 496 Mining Metagenomic Datasets for Cellulases
can be provided, and it is always an analysis mined for variants of these known cellulases.
worth doing. To identify novel cellulases, more Much longer sequences, ideally complete
sophisticated bioinformatic approaches are genes, are the best source material for bioinfor-
required to search for domains and motifs indic- matic prediction of potential cellulase function,
ative of enzymes with cellulose binding and/or and metagenomic/metatranscriptomic datasets
catalytic functions. Sequence comparisons can provide the probes to identify such genes
among proteins with suggestive domain architec- and their neighbors in contemporaneously pro-
tures or genomic contexts in metagenomic DNA duced fosmid or bacterial artificial chromosome
have the potential to identify novel cellulases; the (BACS) libraries (Rooks et al. 2012). Subse-
discovery of a new carbohydrate-binding module quent cloning, overexpression, and purified pro-
in metagenomic DNA by Mello et al. (2010) is tein production then provide sufficient material
a particularly good example of what can be for the detailed structure/function characteriza-
achieved by the continuing development of tion, combining classical biochemistry and
bioinformatic tools. structural biology approaches, necessary to
With complete sequences and their genomic establish that a novel cellulase has been teased
context if located within larger sequenced DNA out from the metagenome.
fragments, homology-based approaches can be
extended. Firstly, structural modeling of mem-
bers of likely cellulase families can identify Future Prospects
those with unusual binding and catalytic sites
that may therefore exhibit functional novelty. Firstly, the tandem approach of using environ-
Secondly, domains of unknown function, which mental RNA and DNA as the starting material
are likely to be putative cellulase or cellulase- to generate complimentary metatranscriptomes
related sequences because they are consistently and metagenomes, thus benefitting from the spe-
linked by genome context, can be characterized cific advantages of each, is becoming more fea-
through distant homology, non-homology, and sible with developments in ribosomal RNA
structure-based approaches. This is exemplified depletion and messenger RNA enrichment tech-
by the identification of a novel cellulase from niques. Four hundred and fifty four
a sequenced marine bacterial genome through pyrosequencing, which had predominated due to
signature domains that assemble enzymes into the relatively long read lengths (ca. 800 bp) that
plant cell wall degradative complexes (Bras could be obtained, is receding to be replaced by
et al. 2011). next-generation sequencing technology that can
Sequences with matches indicative of cellu- deliver ever-increasing read lengths (currently
lases can of course be identified by BLAST ca. 500 bp by using paired end reads) in combi-
searches against the CAZy database (see nation with read numbers in the 107 range. All of
above) and through functional annotation pipe- this in an economically competitive environment
lines such as SEED (Overbeek et al. 2005) and in which sequencing run costs continue to
MG-RAST (Meyer et al. 2008) to provide decrease. The bioinformatic bottleneck remains
taxonomic affiliations for functional and hypo- in terms of computer processing capacity, and
thetical protein-encoding genes. However, thus time, but specifically in relation to mining
identification of even distant relationships for metagenomes for genes encoding enzymes, the
the short sequence read output (<500 bp) future is the ability to reliably predict and model
that is characteristic of pyrosequencing is protein structure and function in silico and thus
a bioinformatic challenge. The danger of simply identify truly novel cellulases among those
searching against databases of known cellulase numerous translated metagenomic sequences
gene sequences is that true novelty will be that lack homology with any known protein-
missed and the metagenomes will only be coding sequences.
Mock Community Analysis 497 M
References Quinlan RJ, Sweeney MD, Lo Leggio L, et al. Insights into
the oxidative degradation of cellulose by a copper
Bras JL, Cartmell A, Carvalho AL, et al. Structural metalloenzyme that exploits biomass components.
insights into a unique cellulase fold and mechanism Proc Natl Acad Sci U S A. 2011;108:15079–84.
of cellulose hydrolysis. Proc Natl Acad Sci U S A. Ransom-Jones E, Jones DL, McCarthy AJ, et al. The
2011;108:5237–42. fibrobacteres: an important phylum of cellulose-
Cantarel BL, Coutinho PM, Rancurel C, et al. The degrading bacteria. Microb Ecol. 2012;63:267–81.
Carbohydrate-Active EnZymes database (CAZy): an Rooks DJ, McDonald JE, McCarthy AJ. Metagenomic
expert resource for Glycogenomics. Nucleic Acids approaches to the discovery of cellulases. Methods
Res. 2009;37:233–8. Enzymol. 2012;510:375–94.
Damon C, Lehembre F, Oger-Desfeux C, et al. Metatran- Shoseyov O, Shani Z, Levy I. Carbohydrate binding mod-
scriptomics reveals the diversity of genes expressed by ules: biochemical properties and novel applications.
eukaryotes in forest soils. PLoS ONE. 2012;7:e28967. Microbiol Mol Biol Rev. 2006;70:283–95.
Davies G, Henrissat B. Structures and mechanisms of Stam MR, Danchin EG, Rancurel C, et al. Dividing the
glycosyl hydrolases. Structure. 1995;3:853–9. large glycoside hydrolase family 13 into subfamilies:
Duan CJ, Xian L, Zhao GC, et al. Isolation and partial towards improved functional annotations of alpha-
characterization of novel genes encoding acidic cellu- amylase-related proteins. Protein Eng Des Sel.
lases from metagenomes of buffalo rumens. J Appl 2006;19:555–62.
Microbiol. 2009;107:245–56. Sukharnikov LO, Cantwell BJ, Podar M, et al. Cellulases:
Fontes CM, Gilbert HJ. Cellulosomes: highly efficient ambiguous nonhomologous enzymes in a genomic
nanomachines designed to deconstruct plant cell wall perspective. Trends Biotechnol. 2011;29:473–9.
complex carbohydrates. Ann Rev Biochem. Takasaki K, Miura T, Kanno M, et al. Discovery of gly-
2010;79:655–81. coside hydrolase enzymes in an avicel-adapted forest
Gilbert JA, Field D, Huang Y, et al. Detection of large soil fungal community by a metatranscriptomic
numbers of novel sequences in the metatranscriptomes approach. PLoS ONE. 2013;8:e55485.
of complex marine microbial communities. PLoS Voget S, Steele HL, Streit WR. Characterization of
ONE. 2008;3:e3042. a metagenome-derived halotolerant cellulase.
Henrissat B, Teeri TT, Warren RA. A scheme for desig- J Biotechnol. 2006;126:26–36.
nating enzymes that hydrolyse the polysaccharides in Warnecke F, Luginbuhl P, Ivanova N, et al. Metagenomic
and functional analysis of hindgut microbiota of
the cell walls of plants. FEBS Lett. 1998;425:352–4.
a wood-feeding higher termite. Nature.
M
Horn SJ, Sikorski P, Cederkvist JB, et al. Costs and benefits
of processivity in enzymatic degradation of recalcitrant 2007;450:560–5.
polysaccharides. PNAS. 2006;103:18089–18094. Zhou W, Schuttler HB, Hao Z, et al. Cellulose hydrolysis
Huson DH, Auch AF, Qi J, et al. MEGAN analysis of in evolving substrate morphologies I: a general model-
metagenomic data. Genome Res. 2007;17:377–86. ing formalism. Biotech Bioeng. 2009;104:261–74.
Leininger S, Urich T, Schloter M, et al. Archaea predom-
inate among ammonia-oxidizing prokaryotes in soils.
Nature. 2006;442:806–9.
Lynd LR, Weimer PJ, van Zyl WH, et al. Microbial cel- Mock Community Analysis
lulose utilization: fundamentals and biotechnology.
Microbiol Mol Biol Rev. 2002;66:506–77.
McDonald JE, Rooks DJ, McCarthy AJ. Methods for the Sarah Highlander
isolation of cellulose-degrading microorganisms. Genomic Medicine, J. Craig Venter Institute,
Methods Enzymol. 2012;510:349–74. La Jolla, CA, USA
Mello LV, Chen X, Rigden DJ. Mining metagenomic data
for novel domains: BACON, a new carbohydrate-
binding module. FEBS Lett. 2010;584:2421–6.
Meyer F, Paarmann D, D’Souza M, et al. The Definitions
the automatic phylogenetic and functional analysis of Mock community: A defined mixture of micro-
metagenomes. BMC Bioinforma. 2008;9:386–92.
Overbeek R, Begley T, Butler RM, et al. The subsystems bial cells and/or viruses or nucleic acid molecules
approach to genome annotation and its use in the created in vitro to simulate the composition of
project to annotate 1000 genomes. Nucleic Acids a microbiome sample or the nucleic acid isolated
Res. 2005;33:5691–702. therefrom.
Qin J, Li R, Raes J, et al. A human gut microbial gene
catalogue established by metagenomic sequencing. Microbiome: The microbes (bacteria,
Nature. 2010;54:59–65. archaea, fungi, protists, and viruses) that inhabit
M 498 Mock Community Analysis
a specific environment or host, such as all the Mock Community Analysis, Table 1 Strains in the
microbes that live in and on the human body. HMP mock cell community (BEI HM-280)
Metagenome: The complete DNA (genomic) Genus species Strain number
content of a microbiome sample. The term Acinetobacter baumannii ATCC 17978
“metagenome” was first used by Handelsman Actinomyces odontolyticus ATCC 17982
et al. to describe the “collective genomes of soil Bacillus cereus ATCC 10987
microflora” (Handelsman et al. 1998). Bacteroides vulgatus ATCC 8482
Bifidobacterium adolescentis DSM 20083
Clostridium beijerinckii ATCC 51743
Deinococcus radiodurans ATCC 13939
Introduction
Enterococcus faecalis ATCC 47077
Escherichia coli ATCC 700296
Although a few studies have reported creation of
Helicobacter pylori ATCC 700392
mock communities for environmental microbial
Lactobacillus gasseri ATCC 33323
systems, this review will be restricted to mock Listeria monocytogenes ATCC BAA-679
communities that have been developed for stud- Methanobrevibacter smithii ATCC 35061
ies of the human microbiome. Neisseria meningitidis ATCC BAA-335
Porphyromonas gingivalis ATCC 33277
Propionibacterium acnes DSM 16379
Microbial Mock Communities Pseudomonas aeruginosa ATCC 47085
Rhodobacter sphaeroides ATCC 17023
For the human sampling aspect of the Human Staphylococcus aureus ATCC BAA-1718
Microbiome Project (HMP), the clinical centers Staphylococcus epidermidis ATCC 12228
at Baylor College of Medicine and the Washing- Streptococcus agalactiae ATCC BAA-611
ton University School of Medicine were tasked Streptococcus mutans ATCC 700610
with obtaining microbiome samples from 18 dif- Streptococcus pneumoniae ATCC BAA-334
ferent body sites. These samples were in the form
of saliva, tooth scrapings, buccal swabs, vaginal
swabs, nasal swabs, skin scrapings, feces,
etc. Each sample had a different physical and aerobe or anaerobe, high and low percent G+C,
microbial composition, yet it was necessary to and having completely sequenced genomes. The
have a standard and uniform method of DNA strains were grown under appropriate growth
extraction for each. The method selected conditions to late logarithmic or stationary
included chemical lysis with sodium dodecyl sul- phase and then mixed at an equal ratio at
fate (SDS) and mechanical disruption by bead a concentration of 108 cells/ml. The cell mix is
beating followed by column purification of the available, free of charge, from BEI Resources
DNA from the cell lysate (http://www.hmpdacc. (www.beiresources.org). A similar mixture, for-
org/doc/HMP_MOP_Version12_0_072910.pdf) mulated in 40 % glycerol (BEI HM-281), was
using the MO BIO PowerSoil DNA Isolation Kit also created to be used as a viable mock commu-
(Carlsbad, CA). As a means to evaluate the DNA nity for single cell studies.
purification protocol, we created a mock cell We extracted DNA from the mock cells com-
community that consists of 22 bacterial strains munity using the HMP standard DNA isolation
and one archaeal strain, mostly representing protocol, then performed 454 amplicon sequenc-
strains found at different sites within the human ing of the 16S ribosomal RNA variable regions,
body (Table 1). The strains were selected as hav- V1-V3 regions. We failed to detect any M. smithii
ing different features such as different cell wall or bifidobacterial reads and recovered less
compositions (gram positive, gram negative, than 1 % of the total reads corresponding to the
spore formers, encapsulated, thick cell wall), following input organisms: Acinetobacter
baumannii, Actinomyces odontolyticus, Clostrid- bead beating (as included in the HMP protocol)
ium beijerinckii, Deinococcus radiodurans, delivered the best representation of the commu-
Helicobacter pylori, Lactobacillus gasseri, nity structure. Addition of mutanolysin, but not
Rhodobacter sphaeroides, or Streptococcus lysozyme, or lysostaphin, also enhanced recovery
spp. In contrast, the relative abundance of of the expected proportions of 16S rRNA gene
Neisseria reads was approximately 35 % and the sequences. In sum, L. iners was overrepresented
relative abundance of Bacillus and Enterococcus using all techniques and the two gram-negative
reads isolated from the mock community was organisms, E. coli and P. aeruginosa, were
approximately 15 % for each genera. These underrepresented in all. Thus, the authors caution
observations are likely due to a combination of that none of the methods tested returned the
the relative ability of an organism to be lysed and actual representation of the input mock
the percent match of the 16S primers to rRNA community.
gene targets. For example, it is known that the In another comparison of extraction methods,
534R primer has numerous mismatches to Willner et al. created a mock community of
actinobacterial 16S rRNA genes, particularly to 12 strains that included organisms relevant to
the bifidobacteria, and an evaluation of primer respiratory infections and cystic fibrosis
mismatches to other members of the mock (CF) (Willner et al. 2012). The goal was to use
community revealed F27 mismatches to this mock community to compare and evaluate
Acinetobacter, Pseudomonas, and Escherichia methods for DNA extraction prior to their appli-
and numerous 534R mismatches to Methanobre- cation to clinical bronchoalveolar lavage (BAL)
vibacter, as described below. samples obtained from CF patients. They also
Although we did not use our mock cell com- developed an in silico simulation of the mock
munity for rigorous testing of lysis and DNA BAL community using the software package
extraction methods for metagenomics, Yuan Grinder (http://sourceforge.net/projects/
et al. have performed a systematic evaluation of biogrinder/). The mock community was com-
M
common DNA extraction methods (Yuan posed of the following bacterial species from
et al. 2012), using a mock community composed actively growing stocks (relative proportions in
of equal cell counts of 11 bacterial species chosen parentheses): P. aeruginosa (1), Burkholderia
to represent different human body sites: E. coli, cepacia (0.1), S. aureus (0.1), Haemophilus
S. aureus, P. aeruginosa, S. agalactiae, Coryne- influenzae (0.1), Moraxella catarrhalis (0.01),
bacterium tuberculostearicum, Lactobacillus S. epidermidis (0.01), Klebsiella pneumoniae
iners, Lactobacillus crispatus, Atopobium vagi- (0.01), N. meningitidis (0.001), Burkholderia
nae, Gardnerella vaginalis, and P. acnes. They multivorans (0.001), Legionella pneumophila
compared six different DNA methods that com- (0.0001), S. pneumoniae (0.0001), and Neisseria
bined different lysis (enzymatic, chemical, and gonorrhoeae (0.00001). Aliquots of the mock
bead beating) and DNA purification (silica col- community were extracted using a “CTAB
umn or phenol/chloroform plus isopropanol pre- method,” a “saline protocol,” using the
cipitation) methods. DNA yield and DNA NucleoSpin Tissue Kit (pellet and liquid proto-
integrity (shearing) were evaluated. Microbial cols) and the MO BIO PowerSoil Kit. Commu-
abundance was measured by 454 sequencing of nity abundance was evaluated by 454 16S rDNA
the 16S rDNA V1-V2 regions using a mixture of sequencing of the V8-V9 regions using
forward primers that were chosen to prevent bias a degenerate 1,114 F3 primer (Willner
against Lactobacillus spp. and Gardnerella spp. et al. 2012). Data were normalized to 900 reads
(Yuan et al. 2012), followed by statistical ana- per sample. At this level, few (<1 %) to no
lyses that included accommodation for differ- streptococcal reads were detected in most of the
ences in 16S rRNA gene copy number per preparations, and no Legionella reads were
organism. Extraction methods that included detected in any of the preparations. In contrast,
the abundance of Neisseria reads was greater than these mock communities were created as controls
that predicted by the in silico model. Unfortu- and calibrators for 16S rRNA gene variable
nately, each sample included Escherichia region sequencing on next-generation sequenc-
and Dechloromonas as contaminants and the ing platforms, but they are useful in the context
CTAB samples had a high percentage of of metagenomic sequencing as well.
Stentrophomonas. This made it difficult to draw Turnbaugh et al. used genomic DNA from
conclusions about the efficiency and reproduc- 67 gut bacterial strains (e.g., containing the gen-
ibility of the methods employed. era Bifidobacterium, Collinsella, Bacteroides,
Diaz et al. created two different oral bacterial Prevotella, Clostridium, Dorea, Roseburia,
mock cell mixes, one in even cell distribution and Ruminococcus, Streptococcus, Citrobacter,
one in unequal distribution, using seven species Enterobacter, Proteus, and Providencia) to cre-
that are representative of the tooth surface (Diaz ate even and uneven mixtures as calibrators for
et al. 2012): Streptococcus oralis, S. mutans, 454 16S rDNA sequencing of the V2 region for
Lactobacillus casei, Actinomyces oris, a twin study of gut microbiomes (Turnbaugh
Fusobacterium nucleatum, P. gingivalis, and et al. 2010). Following quality filtering,
Veillonella sp. Late logarithmic phase cultures pyrosequencing, denoising and chimera removal,
were mixed in an even distribution based on cell the estimated diversity (at 97 % species cutoff) of
counts and in an uneven mixture that replicated the three uneven mock communities was 75, 58,
the proportions found in the oral cavity. These and 63, respectively, which was remarkably close
cell communities were lysed using a single pro- to the 62 phylotypes expected in the community.
tocol that included lysozyme treatment, over- Diversity was not estimated for the even commu-
night proteinase K digestion, and column nities, although the ratio of observed-to-expected
purification of the DNA. As a control, genomic sequences by phylotype was tabulated and
DNAs from the seven bacteria were mixed, in reported. This revealed an absence of
equal proportion based on 16S rRNA gene copy bifidobacterial reads, due to multiple primer mis-
number. All DNAs were used for 454 sequencing matches, and overabundances of sequences map-
of the V1-V2 region of the 16S rRNA gene. Very ping to other genera, including the Bacteroides
few S. mutans or P. gingivalis reads were recov- and the clostridia. The authors acknowledged that
ered, despite efficient sequencing of the control these observations could be the result of a number
DNA, suggesting that the lysis method was inef- of factors including variations in 16S rRNA gene
ficient for these two members of the mock copy number per strain and DNA quality.
community. During the development of standardized
454 16S rDNA sequencing protocols for the
HMP, we created mock DNA communities
DNA Mock Communities using genomic DNAs from 21 of the strains
listed in Table 1 (B. adolescentis and
While mock communities composed of mixtures P. gingivalis were not included) plus Candida
of cells were intended to be used to evaluate albicans MYA-2876. DNAs were prepared from
different DNA extraction methods, they also individual cultures and each DNA preparation
revealed biases in 16S rRNA gene amplification, was validated for purity by Sanger paired-end
sequencing, and classification. DNA mock com- sequencing of 384 full-length 16S rDNA clones
munities have been created in attempts to address obtained from each. Genomic DNAs were com-
these issues, to examine sensitivity and presence bined, based on 16S rRNA gene copy number, to
of chimeric sequences, to serve as controls for form even or staggered mock communities.
protocol development, etc. DNA mock commu- The even communities theoretically contained
nities can be composed of mixtures of genomic 105 16S rDNA copies from each species per
DNA, of plasmid clones of genes (usually the 16S amplification reaction, and the staggered com-
rRNA gene), or of PCR amplicons. Generally, munities had 16S rDNA copies that ranged from
Mock Community Analysis, Fig. 1 Deviation from the determined by 454 reads (red) or Sanger 3,730 reads
expected for the 16S rDNA sequencing of the 20 bacterial (blue). Error bars represent standard error. (b) Lowest
+ one archaeal mock community. (a) Distribution of reads percent mismatch between prime and 16S rRNA gene
over the 18 genera; expected frequencies (gray) were copy by organism, sequencing technology, and variable
determined by whole-genome shotgun sequencing of the gene region (Jumpstart Consortium Human Microbiome
mock community, and observed frequencies were Project Data Generation Working Group 2012)
103 to 106 copies from a particular species per mock community was essential to validate and
reaction. All reactions contained approximately benchmark methods for 16S rDNA sequencing
1,000 copies of the C. albicans 18S rRNA gene by all four genome sequencing centers involved
(Haas et al. 2011) (Jumpstart Consortium in the HMP (the Baylor College of Medicine
Human Microbiome Project Data Generation Human Genome Sequencing Center, the Broad
Working Group 2012). These mock communi- Institute, the J. Craig Venter Institute, and the
ties were used to develop an improved chimera Washington University Genome Sequencing
detection tool, called ChimeraSlayer, and Center) and revealed clear cases of primer
revealed a high level of chimerism in short var- mismatches that caused some genera to be
iable region products (Haas et al. 2011). The underrepresented (Fig. 1).
Mock Community Analysis, Fig. 2 Quality filtering quality filtering (black), after quality filtering (green), and
and chimera checking and removal improve estimates of after quality filtering and chimera removal (red). The
community diversity as evaluated by rarefaction analysis. expected number of OTUs in the mock community was
Operational taxonomic units (OTUs) are plotted versus 18 (dotted line) (Jumpstart Consortium Human
number of 454 sequence reads for three 16S rDNA vari- Microbiome Project Data Generation Working Group
able region windows, V1-V3, V3-V5, and V6-V9, before 2012)
Sequencing the mock community clearly a dual-index method for 16S amplicon sequenc-
illustrated the need for quality filtering and chi- ing on the Illumina MiSeq platform (Kozich
mera checking of 454 data of variable regions as et al. 2013).
illustrated in Fig. 2. Without filtering, the diver- Another type of mock DNA community is
sity of the 21 species (18 operational taxonomic a set of plasmid clones of nearly full-length 16S
units) in the community is estimated number in rRNA gene fragments (Wu et al. 2010). Here,
the 100s. Following quality filtering and chimera PCR amplicons from Clostridium difficile,
removal, the community richness is only a Bacteroides fragilis, S. pneumoniae,
few-fold higher than expected, especially for the Desulfovibrio vulgaris, Campylobacter jejuni,
V1-V3 and V3-V5 regions of the 16S rDNA Rhizobium vitis, Lactobacillus delbrueckii,
gene. E. coli, Treponema sp., and Nitrosomonas
The HMP mock community also sp. were cloned into pTOPO vectors and then
revealed examples of misclassification, poor used to create even and staggered DNA mixtures,
classifiability (lowest in V6-V9), and which were then used as templates for
unexplained overrepresentation of some genera 454 sequencing of the V1-V2 regions of the 16S
(Jumpstart Consortium Human Microbiome rRNA gene. The authors report that correct pro-
Project Data Generation Working Group 2012). portions of input 16S rDNA sequence type were
This mock community has been used to evaluate recovered following 454 sequencing and analy-
how quality filtering impacts taxonomic classi- sis, although different polymerases used for rep-
fication of reads generated on the Illumina licates of the staggered community gave slightly
platform (Bokulich et al. 2013), and a modified different results. Use of cloned 16S rRNA genes
version has been used to develop as controls is convenient, although genes on
supercoiled high copy number plasmids are not molecules such as RNAs or peptides or known
likely to be good surrogates for chromosomal components of the metabolome as being useful
ribosomal genes. controls for microbiome work.
Summary Cross-References
DNA mock communities have identified prob-
▶ Conserved Regions in 16S Ribosome RNA
lems with use of “universal” 16S rRNA gene
Sequences and Primer Design for Studies of
primers for amplification of variable regions for
Environmental Microbes
microbiome sequencing and have revealed flaws
▶ Extraction Methods, Variability Encountered in
in taxonomic classification systems, where
known sequences were classified incorrectly
(Jumpstart Consortium Human Microbiome Pro-
ject Data Generation Working Group 2012). References
They have also shown the critical requirement
Bokulich NA, Subramanian S, Faith JJ, et al. Quality-filtering
for stringent read quality filtering and chimera
vastly improves diversity estimates from Illumina
removal of 16S rDNA sequencing reads, which amplicon sequencing. Nat Methods. 2013;10:57–9.
has helped to reduce estimates of the size the Diaz PI, Dupuy AK, Abusleme L, et al. Using high
“rare biosphere” of human microbiome. throughput sequencing to explore the biodiversity in
oral bacterial communities. Mol Oral Microbiol.
Mock communities of cells have proved valu-
2012;27:182–201.
able as controls for development of uniform DNA Haas BJ, Gevers D, Earl AM, et al. Chimeric 16S rRNA
extraction methods for microbiome samples and sequence formation and detection in Sanger and
DNA mixtures continue to be important as cali- 454-pyrosequenced PCR amplicons. Genome Res.
brators for 16S rDNA and metagenomic sequenc- 2011;21:494–504. M
Handelsman J, Rondon MR, Brady SF, et al. Molecular
ing on changing high-throughput platforms. biological access to the chemistry of unknown soil
Neither type of mock community is perfect. Cell microbes: a new frontier for natural products. Chem
mixtures are easily contaminated; they may have Biol. 1998;5:R245–9.
Jumpstart Consortium Human Microbiome Project Data
incorrect cell counts due to clumping, dead cells,
Generation Working Group. Evaluation of 16S rDNA-
or the presence of bacteriophage; and are limited based community profiling for human microbiome
to species that can be grown without difficulty. research. PLoS ONE. 2012;7:e39315.
DNA mixtures may also be contaminated, so it is Kozich JJ, Westcott SL, Baxter NT, et al. Development of
a dual-index sequencing strategy and curation pipeline
important to validate the purity of the prepara-
for analyzing amplicon sequence data on the MiSeq
tions prior to mixing, and mixtures based on 16S Illumina sequencing platform. Appl Environ
rDNA copy number may be skewed if calcula- Microbiol. 2013;79:5112–20.
tions or assumptions are incorrect, particularly if Turnbaugh PJ, Quince C, Faith JJ, et al. Organismal,
genetic, and transcriptional variation in the deeply
the genomes of the input DNAs are not finished.
sequenced gut microbiomes of identical twins. Proc
Cell communities are plagued with the same Natl Acad Sci USA. 2010;107:7503–8.
issues of amplification bias and misclassification Willner D, Daly J, Whiley D, et al. Comparison of DNA
discovered with DNA using DNA communities. extraction methods for microbial community profiling
with an application to pediatric bronchoalveolar
Despite the flaws inherent in mock communi-
lavage samples. PLoS ONE. 2012;7:e34605.
ties, they are useful as a uniform benchmark for Wu GD, Lewis JD, Hoffmann C, et al. Sampling and
microbiome and metagenome technology devel- pyrosequencing methods for characterizing bacterial
opment and evaluation. The concept could be communities in the human gut using 16S sequence
tags. BMC Microbiol. 2010;10:206.
expanded to include mock communities of viruses, Yuan S, Cohen DB, Ravel J, et al. Evaluation of methods
and fungi. Further, one could imagine developing for the extraction and purification of DNA from the
mock communities composed of different types of human microbiome. PLoS ONE. 2012;7:e33865.
M 504 Molecular Ecological Network of Microbial Communities
Simpson index and Shannon index, are used to

Molecular Ecological Network of measure the level of a-diversity (Fig. 1). Besides,
Microbial Communities the difference between two communities is often
estimated by b-diversity, and more multivariate
Ye Deng1 and Jizhong (Joe) Zhou2,3,4 statistical techniques are used to describe the
1
Institute for Environmental Genomics, community patterns and associations with envi-
University of Oklahoma, Norman, OK, USA ronmental factors, such as ordination and regres-
2
Department of Microbiology and Plant Biology, sion methods (Deng 2013). Compared to the
Institute for Environmental Genomics, intensive studies in community compositions
University of Oklahoma, Norman, OK, USA and diversities, there is much less attention on
3
Department of Environmental Science and the interaction and network relationships among
Engineering, Tsinghua University, microbial species (Zhou et al. 2010).
Beijing, China In natural environment, the microbial species
4
Earth Sciences Division, Lawrence Berkeley rarely live independently; instead, a large amount
National Laboratory, Berkeley, CA, USA of organisms tend to exist sympatrically and syn-
chronously through various types of symbiotic
relationships. Their relationship could be positive
Synonyms (mutualism and commensalism) or negative
(competition, predation, and amensalism) to the
Microbial community network; Microbial ecolog- partner species (Faust and Raes 2012), and all
ical network; Microbial interaction network; relationships among the species form a compli-
Microbial network; Molecular ecological network cated interaction web. These relationships can be
simply exhibited as a network structure (Fig. 1), in
which each node represents a species and the edge
Definition linking two nodes represents the interaction
between these two species. More complex rela-
The network of microbial communities tionships in the community could be integrated
constructed using molecule-based experimental into the network model as well. For instance, the
data, especially metagenomic data (e.g., microar- strength of relationship could be represented to the
ray hybridization, sequencing), is referred to as edge weights, and regulatory or beneficial rela-
molecular ecological network (MEN). It aims to tionships could be represented to the edge direc-
understand the interaction of members in a given tions. Additionally, the abundance of species
community. If the molecules are based on phylo- could be visualized as the sizes of nodes. There-
genetic gene markers (e.g., 16S small subunit fore, a comprehensive network structure could
ribosomal DNA), the network is defined as phylo- adequately depict the inherent relationships within
genetic molecular ecological networks (pMENs). a microbial community.
Since the end of the last century, the ecological
network studies have been started and well devel-
Introduction oped in the macro-ecology (Montoya et al. 2006).
Food web structures have been intensively
In microbial ecology, the majority of data analyt- studied due to their crucial contribution to the
ical efforts are focused on revealing the compo- stability of creaturely communities (Pimm
sition and diversity of a microbial community and 2002). Meanwhile, the mutualistic networks
also the changes across space, time series, and/or have also evoked a lot of attention (Bascompte
with experimental treatments. The conventional and Jordano 2007). Through those interactions, an
analytical approaches usually use species rich- ecosystem is capable of accomplishing its
ness and a-diversity to depict a community struc- systems-level functions which could not be
ture, and several diversity indexes, such as achieved by individual populations. Therefore,
Molecular Ecological Network of Microbial Communities 505 M
Molecular Ecological Network of Microbial Communities, Fig. 1 The study of microbial ecology from species
richness and diversity to interaction network
M
explaining the ecological network structures, relies on phylogenetic molecular markers, such
dynamics, and mechanisms has become an essen- as ribosomal RNA (rRNA) genes or some highly
tial part in ecology. However, the studies on inter- conserved coding genes (e.g., nifH, amoA, gyrB).
actions among microbial species are much more In microbial diversity surveys, consequently, the
difficult than those studies in macro-ecology, definition of operational taxonomic unit (OTU) is
majorly due to their incredibly high species diver- used to delimit the microbial taxa by the similar-
sity. Besides, most natural microbial species are ity of those sequences (Achtman and Wagner
uncultivable and also invisible to the naked eyes, 2008). Each OTU then represents a certain
which makes it more challenging to define net- taxon, such as a species or a genus. The compo-
work structure in a microbial community. Here, sition and diversity of microbial communities
the definition of phylogenetic molecular ecologi- actually are based on molecular OTUs rather
cal network (pMEN) for microbial community, than individual species. Recently, due to the
the network inference, and the common network rapid development of high-throughput sequenc-
properties are first introduced, and then several ing technology, large amounts of microbial diver-
key ecological questions are able to be addressed sity surveys have been carried out in various
through network analysis. environmental habitats through small subunit
(SSU) rRNA sequencing projects. These mas-
sive, community-wide, replicated metagenomic
Phylogenetic Molecular Ecological data provide unprecedented opportunities to
Network for Microbial Communities infer the interaction networks in microbial com-
munities (Raes and Bork 2008).
Owing to the technique innovation of molecular As a result, an ecological network generated
biology, the modern microbial taxonomy often from metagenomic data really reflects the
Molecular Ecological
Network of Microbial
Communities, Fig. 2 The
common steps of molecular
ecological network (MEN)
analysis. Two major parts
are included: network
inferences and network
analyses. In each of them,
several key steps are listed
relationships among molecular OTUs. Therefore, genomic biology and ecology (Barabasi and
such molecule-based ecological networks in Oltvai 2004; Faust and Raes 2012). Based on
microbial communities are referred to as molec- the mathematical algorithms, they can be classi-
ular ecological networks (MEN) (Zhou fied into Bayesian network, relevance network,
et al. 2010). The networks derived from func- and ordinary and partial differential equation
tional gene markers are referred to as functional methods [reviewed by De Jong (2002)]. Besides,
molecular ecological networks (fMEN) (Zhou some graphical theory-based methods were
et al. 2010), and those based on phylogenetic recently developed (Kramer et al. 2009).
gene markers as phylogenetic molecular ecolog- Among them, the relevance network method is
ical networks (pMENs) (Zhou et al. 2011). the most commonly used approach due to its
simple calculation procedure and high noise tol-
erance (Deng et al. 2012). For the relevance net-
Network Inference Approach work method, a similarity is first measured
between each two OTUs. This similarity mea-
For metagenomic data analysis, the abundance of surement can be Pearson, Spearman, biweight,
each gene marker in a sample is measured by the and jackknife correlations or mutual information
number of sequences for sequencing data or (Hardin et al. 2007).
hybridization signal intensity for microarray For network inference, another critical step is
data. Thereafter, the determined gene richness to identify a true link (Fig. 2). The key question is
and abundance are used to describe the composi- how similar a true link should be. The most com-
tion and structure of this microbial community. monly used way to choose the similarity thresh-
Based on such experimental data, a network old is based on biological knowledge which could
graph can be constructed to illustrate the interac- confirm some true interactions by previous exper-
tions of different gene markers (species) (Fig. 2). imental discovery and then use similar values
The way of constructing the connection diagram between those interactions to determine the
from the behavior of its components is known as threshold for other links. The constructed net-
network inference or reverse engineering (Faust work through this arbitrary threshold is subjec-
and Raes 2012). tive rather than objective (Barabasi and Oltvai
Various approaches for network inference 2004). There are also a couple of methodical
have been developed and widely used in both ways for determining the similarity threshold,
such as the significance level of correlation The “small world” is used to depict that any
(p value), false discovery rate (FDR), permuta- two nodes in a network can be connected just by
tion test, and random matrix theory (RMT)-based passing a few of linked neighbors (Table 1). It is
methods. Among them, p value-based and per- originally referred in sociology that 6 of separa-
mutation test-based methods give the least strict tion between us and everyone else on this planet.
threshold and lead to large amounts of links in This property usually reflects the efficiency of
a messy network that could be alike to random system and may be valuable for microbial com-
network. The FDR-based method has the strictest munities. In the small-world community, the
threshold, which could generate a loose network, energy, materials, and information can be easily
and a lot of true interactions might be ignored. transported within the entire system. In the
RMT-based algorithm has advantages in this step microbial community, this characteristic drives
(Luo et al. 2007). This method is able to automat- efficient communications among different mem-
ically identify a threshold based on the inherent bers so that relevant responses can be taken rap-
property of the similarity matrix. The results idly to environmental changes (Zhou et al. 2011;
indicated it is robust to reveal the meaningful Deng et al. 2012).
relationships through high-throughput data in The modularity property is used to demon-
both genomics and ecology (Luo et al. 2007; strate that a network could be degraded to sub-
Zhou et al. 2010, 2011). networks, also called modules, according to its
structure (Table 1). Each module in gene regula-
tory networks is considered as a functional unit,
Network Properties which consists of several elementary genes and
performs an identifiable task (Luo et al. 2007).
After the species interactions have been inferred, Modularity in an ecological community may
many pMENs are formed for the communities in reflect habitat heterogeneity, physical contact,
different habitats, such as soil, ocean, groundwa- functional association, divergent selection,
M
ter, and human guts (Deng et al. 2012; Faust and and/or phylogenetic clustering of closely related
Raes 2012). Several common topological prop- species (Olesen et al. 2007; Zhou et al. 2010).
erties, such as small world, scale-free, and mod- Also microorganisms in the same module could
ularity (Table 1) were also observed in all kinds have similar ecological niches (Zhou et al. 2011;
of pMENs, like other biological networks from Faust and Raes 2012).
food webs in macro-ecology to complex regula- Except these three common properties, there
tion networks in molecular biology. These com- are many other topological indexes that could be
mon network properties are important for the used to measure the organization and structure of
robustness and stability of complex systems microbial networks, such as clustering coeffi-
(Barabasi and Oltvai 2004; Kitano 2004; Zhou cient, hierarchy, density, transitivity, and con-
et al. 2010, 2011). nectedness [definitions and descriptions seen in
The scale-free is a most notable characteristic Deng et al. (2012)]. All these could become valu-
in complex systems. It is used to describe the able indexes to measure the microbial structure
finding that most nodes in a system have few for the studies of microbial ecology.
directly linked nodes (neighbors), while few
nodes have a large amount of neighbors
(Table 1). It implies the roles of species in the Network Interpretation Aspects
microbial community might be quite different.
A few microbial species could be generalists Once the network graph is drawn, we should
with higher connectivity which are inclined to disclose the ecological meanings behind this
have closer relationships with environmental structure. Several key ecological questions can
traits than other species (Zhou et al. 2011; Deng be revealed through network analysis procedures
et al. 2012). (Fig. 2).
Molecular Ecological Network of Microbial Communities, Table 1 The most commonly used topological indexes
and properties for complex networks
Network
property Mathematic measurement Ecological implication
Connectivity Xm It was used to describe the number of interactions of
ki ¼ aij , where m is the number of all neighbors each node, also named as node degree. In most
j¼1
complex systems, the nodes with the highest
(linked nodes) of node i and aij is the strength connectivity always played crucial roles and were
between nodes i and j. For the unweighted network, usually considered as network centers. In pMEN,
ki equals the number of neighbors the study found that nodes with higher connectivity
were inclined to have closer relationships with
environmental traits (Zhou et al. 2011; Deng
et al. 2012)
Scale-free P(k) ~ k g, where P(k) is the number of nodes with In most cases, the connectivity distributions of
k degrees, k is connectivity, and g is a constant pMEN and other complex systems follow this
power law, indicating most nodes in a network have
few neighbors, while few nodes have large amount
of neighbors. This phenomenon suggests the most
species in the communities are peripherals, but
a few of the species could be generalists and play
more important roles than others
X
Small world dij A smaller GD means all the nodes in the network
GD ¼ nðn1Þ : GD is the abbreviation of the average are closer, indicating each two nodes in the network
geodesic distance, where dij is the shortest path could be connected by a small number of
between nodes i and j, and n is the total number of acquaintances, and so-called small-world network.
all nodes Most pMENs are small-world network, which
imply that the energy, materials, and information
can be easily transported through entire systems
(Deng et al. 2012)
" 2 #
Modularity XNM
lb Kb Modularity property was used to demonstrate
M¼ , where NM is the number a network which could be naturally divided into
b¼1
L 2L
subcommunities, so-called modules. A modularity
of modules in the network, lb is the number of links value can be calculated by Newman’s method
th
among all nodes within the b module, L is the (Newman 2006) whose value is between 0 and
number of all links in the network, and Kb is the sum 1. Modularity in an ecological community may
of degrees (connectivity) of nodes which are in the reflect habitat heterogeneity, physical contact,
bth module functional association, divergent selection, and/or
phylogenetic clustering of closely related species
(Olesen et al. 2007; Zhou et al. 2010)
Identify Key Populations/Species in the centrality, betweenness, eigenvector centrality,

Community Based on Network Topology clustering coefficient, and vulnerability [defini-
In a scale-free network, the roles of nodes for the tions and descriptions seen in Deng et al. (2012)].
community could be quite different. Most nodes The nodes with higher indexes may carry out
are just peripheral, and they have less contribu- different functions for the network structure. For
tion to the network structure and stability. But example, the nodes with highest connectivity are
a few of the nodes may be located in the core of commonly regarded as centers in the network,
the network, and if it is removed from the net- while the nodes with highest betweenness usually
work, it will largely change the network structure. serve as bridges to connect other nodes. There-
These key nodes could be identified by multiple fore, in pMEN these key nodes representing spe-
network indexes, such as connectivity, stress cies also could play different roles for the
microbial community. Previous results already environmental variance. Module eigengene
showed the nodes with higher connectivity were is the most representative variable for all the
inclined to have closer relationships with envi- OTUs within a module through singular
ronmental traits in pMEN (Zhou et al. 2011; value decomposition (SVD) (Langfelder and
Deng et al. 2012), indicating they could be more Horvath 2007). Eigengene network analysis is
important to response the environmental change feasible to reveal the network organization
than other species. in module levels and directly test the correlation
The key nodes also can be determined based between modules and environmental factors
on the nodes’ roles in their own modules. In a (Deng et al. 2012). Because the taxa in
module-separated network, the node topological a module could be functionally associated with
roles can be defined by two parameters, within- overlapping ecological niches (Faust and Raes
module connectivity (zi) and among-module con- 2012), the module eigengenes are able to distin-
nectivity (Pi) (Guimera and Amaral 2005), and guish the module functions by associations with
the roles of nodes could be illustrated in a ZP plot environmental factors.
[seen in Deng et al. (2012) Fig. 4c]. According to
values of zi and Pi, the roles of nodes were clas- Network Comparisons for Microbial
sified into four categories: peripherals, connec- Communities Under Different Conditions,
tors, module hubs, and network hubs. In a pMEN, Locations, or Across Time Series
peripherals could represent specialists, while To analyze how the environment affects net-
module hubs and connectors are close to gener- work structures and species interactions, the
alists and network hubs as supergeneralists (Deng network is constructed and compared under dif-
et al. 2012). ferent experimental conditions, geographic
locations, or across time succession. Various
Associations Between Network Structures network indexes can be evaluated among differ-
and Environmental Factors ent communities in terms of the network sensi-
M
The correlations between pMEN topology and tivity and robustness, but since only a single
environmental factors can be examined in both value is available for each network, it is unable
direct and indirect ways. Indirectly, the OTU to perform standard statistical analyses to assess
significance (GS) is firstly calculated, which is statistical significance of differences. Thus,
the square of the correlation coefficient (r2) the randomized networks are introduced to gen-
between OTU abundance profile and environ- erate a null model for each identified network.
mental factor. A higher GS value indicates this Different methods can randomize the network
species better fits the variance of environmental differently; however, the commonly used
factor than the other species with lower GS Maslov-Sneppen method keeps the numbers of
values. Thereafter, the correlation between GS nodes and links unchanged but rewires the posi-
and nodes’ topological indexes (e.g., connectiv- tions of all links in the pMEN so that the sizes of
ity, betweenness) is able to measure the relation- networks are the same and the randomly rewired
ship of network topology with environmental networks are comparable with the original ones
factors (Deng et al. 2012). The correlation can (Maslov and Sneppen 2002). This method has
be calculated either by using Pearson correlation been typically used for ecological network ana-
for single GS or Mantel and partial Mantel tests lyses. For each identified network, usually
for multiple GS of environmental factors (Zhou a total of 100 randomized networks are
et al. 2011; Deng et al. 2012). implemented, and therefore, all network indexes
The correlations between module-based could be generated 100 times. Then the average
eigengenes and environmental factors are able and standard deviation for each index of all
to detect the modules’ response to random networks are obtained. The statistical
Z-test is able to test the differences of the indices References

between the MEN and random networks. Mean-
while, for the comparison between the network Achtman M, Wagner M. Microbial diversity and the
genetic nature of microbial species. Nat Rev
indices under different conditions, the Student
Microbiol. 2008;6(6):431–40.
t-test can be employed by the standard devia- Barabasi AL, Oltvai ZN. Network biology: understanding
tions derived from corresponding randomized the cell’s functional organization. Nat Rev Genet.
networks (Deng et al. 2012). 2004;5(2):101–15.
Bascompte J, Jordano P. Plant-animal mutualistic net-
Except for the overall network topology, net-
works: the architecture of biodiversity. Annu Rev
work comparisons also can be performed in dif- Ecol Evol Syst. 2007;38:567–93.
ferent levels and aspects, such as node overlaps, De Jong H. Modeling and simulation of genetic regulatory
module preservations, topological roles of indi- systems: a literature review. J Comput Biol. 2002;9(1):
67–103.
vidual nodes, and network hubs among different
Deng Y. Microarray data analysis. In: He Z, editor.
networks (Zhou et al. 2011). The changes among Microarrays: current technology, innovations and
different levels suggested the switches of spe- applications. Norwich: Horizon Scientific Press; 2013.
cies interactions under different conditions that Deng Y, Jiang YH, et al. Molecular ecological network
analyses. BMC Bioinforma. 2012;13(1):113.
could be ecologically important for microbial
Faust K, Raes J. Microbial interactions: from networks to
community to deal with the environmental models. Nat Rev Microbiol. 2012;10(8):538–50.
changes. Guimera R, Amaral LAN. Functional cartography of com-
plex metabolic networks. Nature. 2005;433(7028):
895–900.
Hardin J, Mitani A, et al. A robust measure of correlation
Summary between two genes on a microarray. BMC Bioinforma.
2007;25:8.
The current studies of microbial networks are still Kitano H. Biological robustness. Nat Rev Genet.
limited but rapidly growing. From the network 2004;5(11):826–37.
Kramer N, Schafer J, et al. Regularized estimation of
inferences to network interpretations, there are
large-scale gene association networks using graphical
a lot of fundamental ecological concerns that Gaussian models. BMC Bioinforma. 2009;24:10.
have not been well addressed, such as how well Langfelder P, Horvath S. Eigengene networks for studying
the modeled networks reflect the real interactions the relationships between co-expression modules.
BMC Syst Biol. 2007;1:54.
among microbial species, whether these interac-
Luo F, Yang Y, et al. Constructing gene co-expression
tions are casual or fixed under different environ- networks and predicting functions of unknown genes
mental conditions, and how to classify the types by random matrix theory. BMC Bioinforma.
of interactions among microbial species (i.e., 2007;8:299.
Maslov S, Sneppen K. Specificity and stability in topology
mutualistic or trophic). Cautions must be taken
of protein networks. Science. 2002;296(5569):910–3.
for the interpretation of the underlying mecha- Montoya JM, Pimm SL, et al. Ecological networks and
nisms that shape microbial communities through their fragility. Nature. 2006;442(7100):259–64.
the present network analysis. Newman MEJ. Modularity and community structure in
networks. Proc Natl Acad Sci U S A. 2006;103(23):
Nevertheless, by taking the advantage of rapid
8577–82.
technical revolution, microbial ecology studies Olesen JM, Bascompte J, et al. The modularity of pollina-
can be performed at a new level, network infer- tion networks. Proc Natl Acad Sci U S A. 2007;
ences. Consequently, through the analysis of net- 104(50):19891–6.
Pimm SL. Food webs: University of Chicago Press; 2002.
work structures, previously ignored interactions
Raes J, Bork P. Molecular eco-systems biology: towards
among microbial species could be revealed and an understanding of community function. Nat Rev
their responses to environmental changes could Microbiol. 2008;6(9):693–9.
be disclosed. With the development and comple- Zhou J, Deng Y, et al. Functional molecular ecological
networks. mBio. 2010;1(4):e00110–69.
ment on methodology, the studies of microbial
Zhou J, Deng Y, et al. Phylogenetic molecular ecological
interaction networks will evoke more and more network of soil microbial communities in response to
attention in the near future. elevated CO2. mBio. 2011;2(4):e00122–11.
Monitoring Lactic Acid Bacterial Diversity During Shochu Fermentation 511 M
Japan. Rice, sweet potato, or barley is usually
Monitoring Lactic Acid Bacterial used as the main ingredient. The fermentation
Diversity During Shochu process consists of three stages, i.e., koji produc-
Fermentation tion, yeast-seed production, and alcoholic fer-
mentation. Koji mold (Aspergillus niger or
Akihito Endo A. kawachii) saccharifies starches to glucose by
Department of Food and Cosmetic Science, amylases, and yeast, Saccharomyces cerevisiae,
Faculty of Bioindustry, Tokyo University of is responsible for alcoholic fermentation. Such
Agriculture, Abashiri, Hokkaido, Japan microorganisms are inoculated into the fermen-
tation as starters. Rice or barley is usually used as
a koji ingredient, and the main ingredient is added
Synonyms to the mash at the beginning of alcoholic fermen-
tation stage. During the saccharification, the mold
Beneficial microbe; Lactic acid bacteria produces large amounts of citric acid, resulting in
significant decrease of pH. The pH in the yeast-
seed production stage is therefore between 3.0
Definition and 3.5 and that in the alcoholic fermentation
stage between 4.0 and 4.5. Alcoholic concentra-
The composition of lactic acid bacteria during tion at the end of fermentation is 14–17 % (v/v).
fermentation of Japanese traditional distilled Because of such harsh environment for bacterial
spirit is reviewed. survival, bacterial diversity had been considered
to be poor, and very few studies have done for this
microbiota so far.
Introduction
M
Shochu is a Japanese distilled spirit made from LAB Diversity in Fermentation Mashes
several starchy materials. Fermentation of alco- of Shochu
holic beverages is usually carried out by combi-
nation of several microorganisms. Lactic acid Culture-dependent and culture-independent stud-
bacteria (LAB) are well known to play beneficial ies have been carried out to study for LAB diver-
roles in several food fermentations, including sity in fermentation mashes of shochu. LAB
dairy products, vegetables, and meat. This bacte- population in yeast-seed is generally low, i.e.,
rial group is also beneficial for fermentation of below 105 CFU/ml of mash as determined by
beverages, e.g., aroma production and reduction culturing and real-time quantitative PCR. Poor
of acid level in wines and growth prevention of LAB diversity (0–2 species in each mash) is
spoilage microorganisms in sake. Recent culture- seen in the yeast-seed. Lactobacillus plantarum,
dependent and culture-independent study L. paracasei, Weissella confusa/cibaria,
revealed that LAB can be seen in fermentation Leuconostoc citreum, and Enterococcus faecium
mashes of shochu. In the present chapter, lactic have been found in the mashes by denaturing
acid bacterial diversity during shochu fermenta- gradient gel electrophoresis (DGGE) combined
tion is briefly reviewed. with LAB-specific primers (Endo 2005; Endo
and Okada 2005b). LAB diversity in alcoholic
fermentation stage is dependent on the variety
Shochu Fermentation of main ingredients. Sweet potato generally pro-
duces higher population and richer diversity of
Shochu is a popular Japanese distilled spirit and is LAB than rice or barley. This might be due to
mainly produced in the south Kyushu area of differences of nutrition between the main
M 512 Monitoring Lactic Acid Bacterial Diversity During Shochu Fermentation
ingredients. Sweet potato mashes contain negative impacts on the quality of the final prod-
104–108 CFU/ml of LAB and rice or barley uct. Proper management of LAB might therefore
mashes contain 104–105 CFU/ml or less. Lacto- introduce shochu having better quality.
bacillus brevis, L. fermentum, L. helveticus,
L. hilgardii, L. kefiri, L. nagelii, L. paracasei,
L. pentosus, L. plantarum, Leuconostoc Summary
mesenteroides, Leuc. citreum, Leuc. lactis,
Lactococcus lactis, Enterococcus faecium, Shochu is a Japanese traditional distilled spirit
Pediococcus pentosaceus, and W. confusa/ made from starchy materials. During the fermen-
cibaria have been found in alcoholic fermenta- tation, Aspergillus spp. works for saccharification
tion mashes made from sweet potato (Endo 2005; of ingredients and Saccharomyces cerevisiae
Endo and Okada 2005b). Lactobacillus plays alcoholic fermentation. Aspergillus
satsumensis is a novel species found in the spp. produces large amounts of citric acid during
mashes (Endo and Okada 2005a). Of such spe- the fermentation and preserves the fermentation
cies, W. confusa/cibaria is the most seen species. from spoilage. LAB have generally low popula-
Several species found in this fermentation can be tion and poor diversity at the beginning of fer-
also seen in wine fermentation. This might be due mentation (yeast-seed stage), but their population
to similar harsh environments (high alcohol con- and diversity increase at the latter fermentation
tent and low pH) in the fermentation of the two (alcoholic fermentation stage). W. confusa/
alcoholic beverages. Mashes made from rice or cibaria, Lactobacillus spp., and Leuconostoc
barley contain poorer LAB diversity than those spp. are usually seen in the fermentation. Such
made from sweet potato. LAB have characteristics to survive in alcoholic
An interesting DNA sequence, which was and acidic environment, suggesting that LAB
characterized as uncultured Leuconostoc sp., have adapted to their habitat.
was found in yeast-seed by DGGE profile.
BLAST analysis of the sequence revealed low
similarities (below 95 %) against known
Cross-References
Leuconostoc spp. but high similarities (99.3 %)
against uncultured Leuconostoc spp. (accession
▶ Culturing
nos. EU469745 and AJ405013) (Endo 2005,
▶ Evaluating Putative Chimeric Sequences from
2011), suggesting the presence of unknown
PCR-amplified Products
LAB in shochu mashes. Population of the organ-
ism is approximately 108 CFU/ml as determined
by qPCR. Because of its predominance in the
yeast-seed, it might be an acid-tolerant
Leuconostoc sp., although Leuconostoc spp. are References
known to be acid sensitive.
Endo A. Lactic acid bacterial diversity during shochu
Most of LAB strains found in shochu mashes
fermentation. PhD thesis, Tokyo University of Agri-
were resistant to 10–15 % (v/v) of alcohol, and, culture; 2005.
moreover, they were able to grow at pH 3.5 (Endo Endo A. Diversity of lactic acid bacteria in fermented
2011). Very few strains were able to grow at products. Jpn J Lactic Acid Bact. 2011;22:87–92.
Endo A, Okada S. Lactobacillus satsumensis sp. nov.,
pH 3.0. Most of the strains metabolize citrate isolated from mashes of shochu, a traditional Japanese
when in the presence of glucose. These charac- distilled spirit made from fermented rice and other
teristics suggest that LAB seen in shochu mashes starchy materials. Int J Syst Evol Microbiol.
have adapted to their habitat. Citrate metabolism 2005a;55:83–5.
Endo A, Okada S. Monitoring the lactic acid bacterial
by LAB produces several aroma compounds,
diversity during shochu fermentation by
including diacetyl, acetoin, and acetic acid. PCR-denaturing gradient gel electrophoresis. J Biosci
These compounds have both positive and Bioeng. 2005b;99:216–21.
MRL and SuperFine+MRL 513 M
steps: first the input source trees are each
MRL and SuperFine+MRL represented by a matrix over {0,1,?}, where
each row represents a species and each column
Tandy Warnow represents a branch in the source tree. These
Institute for Genomic Biology, University of matrices are then concatenated together to form
Illinois, IL, USA the “MRP matrix.” Finally, this matrix is ana-
lyzed using maximum parsimony heuristics,
where maximum parsimony is the NP-hard opti-
Synonyms mization problem that seeks to find a tree on the
species set with the smallest number of total
Phylogeny ¼ phylogenetic tree ¼ tree; changes.
MRL ¼ matrix representation with likelihood; In Swenson et al. (2011), MRP was compared
MRP ¼ matrix representation with parsimony to a collection of other supertree methods and
found to be the most reliable with respect to
accuracy and ability to analyze large datasets.
Introduction However, that study also showed that the Quar-
tets MaxCut (QMC) method developed by Snir
The estimation of evolutionary trees is one of the and Rao (2012) was more accurate than MRP for
basic challenges in biology (Felsenstein 2003), those datasets on which QMC was able to run. An
but current methods have great difficulties with interesting variant on MRP was developed by
large datasets – often due to computational Nguyen et al. (2012), in which the MRP matrix
issues. For example, methods like maximum like- was analyzed under maximum likelihood, using
lihood (ML) and maximum parsimony (MP) are a symmetric two-state model. This method,
highly accurate techniques when they can be called MRL for “matrix representation with like-
properly run, but both are NP-hard (a technical lihood,” was shown to be more accurate than
M
term that has the consequence that exact algo- MRP on simulated datasets.
rithms are not likely to be found except through Thus, while MRP remains the most frequently
exhaustive search techniques). As a result, ML used supertree method, MRL and QMC are new
and MP analyses on large datasets either cannot supertree methods that offer some advantages
be run at all, or take a very long time to run, or over MRP; furthermore, new supertree methods
return poor results. Since most accounts of the continue to be developed.
number of species suggest that the Tree of Life In Swenson et al. (2012), a new technique was
itself will involve many millions of species, truly developed called “SuperFine.” This is a meta-
large-scale phylogenetic estimation is beyond the method that can be used with any supertree
reach of current methods. Instead, an alternative method (e.g., MRP, MRL, QMC, etc.), to pro-
approach has been proposed, in which different duce a modified supertree method. For example,
research groups would calculate phylogenetic when SuperFine is used with MRP, it is referred
trees on subsets of the species set and then these to as SuperFine+MRP, and when it is used
trees would be combined into a tree on the full with MRL, it is referred to as SuperFine+MRL.
dataset. The techniques that combine trees into SuperFine has two steps. The first step computes
a tree on the full taxon set are called “supertree a “strict consensus merger” (SCM) tree from the
methods,” and the resultant large tree is called set of input trees, where the SCM tree contains
a “supertree.” many high degree nodes (“polytomies”). The
There are many supertree methods (surveyed second step refines this tree by using the base
in Bininda-Emonds 2004), but matrix representa- method to refine each polytomy. The refinement
tion with parsimony (MRP), developed in Baum around each polytomy is performed by encoding
(1992) and Ragan (1992), is the most well known each of the source trees on a new leafset {1. . .d},
and most frequently used. MRP operates in two where d is the degree of the polytomy; these
M 514 MRL and SuperFine+MRL
MRL and SuperFine+MRL, Fig. 1 We present tree error branch rates, and the standard deviation for the running
and running times (in minutes) for supertree methods on times. Averages are computed for those replicates with
ten replicates of 1,000-taxon datasets. The method given sufficient taxonomic overlap to perform an accurate
parenthetically indicates the heuristic used to solve MRP supertree analysis: n ¼ 10 for all scaffold densities except
or MRL (e.g., PAUP* for MP and FastTree-2 (Price n ¼ 7 for the 20% scaffold density and n ¼ 9 for 50%
et al. 2010) or RAxML for ML). The scaffold density scaffold density (reproduced (with permission from the
refers to the percentage of the taxa that are in the “scaf- publisher) from Nguyen et al. (2012, 7:3))
fold” dataset. We show standard error for the missing
smaller source trees are then passed to the base to link the clade-based trees together. The scaf-
supertree method, which computes a supertree on fold trees are produced by random sampling of
{1. . .d}, and this supertree replaces the the taxa and then using the universal genes to
polytomy. The refinements around the construct a source tree. Thus, the clade-based
polytomies can be performed in parallel since source trees contain a subset of the taxa in
they are independent. Hence, the second step is a clade, while the scaffold trees contain
not only very fast, but very easily parallelized. In a random subset of the taxa, but may – in some
Swenson et al. (2012), they showed that Super- cases – contain all the taxa. The scaffold density
Fine+MRP gave much more accurate trees than refers to the percentage of the taxa in the scaffold
MRP and was also much faster. They also com- tree. Scaffold source trees for the supertree prob-
pared SuperFine+QMC and QMC and showed lem are produced by selecting a scaffold density
similar improvements. Finally, Nguyen and then concatenating the alignments from the
et al. (2012) compared SuperFine+MRL and universal genes on the randomly selected scaffold
MRL and showed similar improvements. Thus, taxa and computing a maximum likelihood tree
SuperFine is a method that can improve supertree on the concatenated alignment. Similarly, the
methods. clade-based source trees are computed by
A comparison between these different selecting a clade and then finding the genes that
methods (SuperFine+MRP, SuperFine+MRL, provide the best coverage for that clade (from the
MRP, and MRL) is shown in Fig. 1. The experi- clade-based genes), concatenating the align-
ment involves gene trees that evolve within ments, and computing a maximum likelihood
a species tree under a birth-death process, and tree on the concatenated alignments. Finally,
so may not contain all the taxa; however, some supertrees are computed on the source trees
genes are universal and so contain all the taxa. using MRP, MRL, SuperFine+MRP, and Super-
These genes are then used to evolve sequences Fine+MRL. The resultant species trees are then
under different sequence evolution models. compared to each other with respect to the miss-
There are two types of gene trees – “clade- ing branch rate and running time. Figure 1 shows
based” gene trees that are restricted to clades in results obtained on 1,000-taxon simulated
the species tree and “scaffold” trees that are used datasets and demonstrates that SuperFine+MRP
MRL and SuperFine+MRL 515 M
and SuperFine+MRL provide the best accuracy and QMC while also reducing the running time
of all methods and are much faster than the other used by these methods. Thus, SuperFine is
methods. It also shows that MRL outperforms a general purpose meta-method for improving
MRP with respect to accuracy under low scaffold supertree estimation.
density conditions. Finally, MRP is “solved”
using MP heuristics in PAUP* (Swofford 2003),
while MRL is “solved” using either FastTree-2
Funding
(Price et al. 2010) or RAxML (Stamatakis 2006).
This work was supported by NSF grant DEB
Note that the choice of ML heuristic has an
0733029 to T.W.
impact on the running time and accuracy of the
MRL method. Note also that SuperFine+MRL
(FastTree) matches the accuracy of SuperFine References
+MRP(PAUP*) but is much faster.
Baum BR. Combining trees as a way of combining data
sets for phylogenetic inference, and the desirability of
Summary combining gene trees. Taxon. 1992;41:3–10.
Bininda-Emonds O, editor. Phylogenetic supertrees: com-
bining information to reveal the tree of life. Dordrecht:
The construction of a large phylogeny, poten- Kluwer Academic Publishers; 2004.
tially spanning the Tree of Life, is considered to Felsenstein J. Inferring phylogenies. Sunderland: Sinauer
be one of the hardest computational problems in Associates; 2003.
Nguyen N, Mirarab S, Warnow T. MRL and SuperFine
biology. A central approach to this problem +MRL: new supertree methods. Algoritm Mol Biol.
involves using supertree methods, which com- 2012;7:3.
bine source trees, each on a subset of the species, Price M, Dehal P, Arkin A. FastTree 2 – approximately
into a tree on the full set of species. While many maximum likelihood trees for large alignments. PLoS
supertree methods have been developed, MRP
ONE. 2010;5:e9490. M
Ragan MA. Phylogenetic inference based on matrix
(matrix representation with parsimony) is the representation of trees. Mol Phylogenet Evol.
most well known and most frequently used 1992;1:53–8.
supertree method. However, newer supertree Snir S, Rao S. Quartet MaxCut: a fast algorithm for
amalgamating quartet trees. Mol Phylogenet Evol.
methods – including MRL (matrix representation 2012;62:1–8.
with likelihood) and QMC (Quartets MaxCut) – Stamatakis A. RAxML-VI-HPC: maximum likelihood-
have been introduced that provide comparable or based phylogenetic analyses with thousands of taxa
better accuracy to MRP. Finally, a new technique and mixed models. Bioinformatics. 2006;22:2688–90.
Swenson MS, Suri R, Linder CR, et al. An experimental
for “boosting” supertree methods has been devel- study of Quartets MaxCut and other supertree
oped. This method, called “SuperFine,” operates methods. Algoritm Mol Biol. 2011;6:7.
in two steps, where the first step constructs Swenson M, Suri R, Linder CR, et al. SuperFine:
a consensus tree from the source trees and the fast and accurate supertree estimation. Syst Biol.
2012;61:214–27.
second step uses the base supertree method to Swofford DL. PAUP*: phylogenetic analysis using
refine the consensus tree. Simulations show that parsimony (*and other methods), version 4. Sinauer
SuperFine improves the accuracy of MRP, MRL, Associates. 2003.
N
New Computational Methodologies microbial populations through analyzing

to Understand Microbial Diversity published metagenomic databases. Although
these methods have only been used to mine
Haiwei Luo metagenomic data sets from the oceans, they
Department of Marine Sciences, University of can be easily adapted to those from any other
Georgia, Athens, GA, USA environments.
Synonyms An Ensemble Machine Learning Method

to Predict Protein Subcellular
Bioinformatic methods for exploring genetic Localization of Metagenomic Sequences
diversity; Methods for metagenomic sequence
analysis Bacteria consume dissolved organic matter
(DOM) through hydrolysis, transport, and intra-
cellular metabolism, and these activities occur in
Definition distinct subcellular localizations. Therefore,
investigation of protein and proteome subcellular
Microbial diversity is broadly defined as genetic localization is likely to improve our understand-
variation in natural microbial populations. ings about how bacteria interact with DOM.
Many computational algorithms have been
developed to predict the subcellular localization
Introduction of proteins. These algorithms employ a variety of
supervised machine learning techniques and dif-
Metagenomics studies the genetic materials of ferent information sources to make predictions.
a natural microbial community recovered They can be generally classified into three types.
from an environmental sample. A typical One type of methods explores the presence/
metagenomic study involves two major steps, absence of signal peptides or specific protein
including an initial experimental stage for genetic domains, such as SignalP (Dyrlov Bendtsen
material extraction and sequencing and et al. 2004) and Phobius (K€all et al. 2007).
a following stage using standard bioinformatic These methods require protein sequences to be
tools for molecular sequence analysis. The pre- complete. Metagenomic peptides, however, are
sent review, however, focuses on several recently often fragmentary, making these methods not
developed computational methods that are applicable. The second type, such as Proteome
designed to explore ecological diversity of Analyst, uses localization information from
N 518 New Computational Methodologies to Understand Microbial Diversity
well-annotated homologous sequences identified X

N
by BLAST. It is not suitable to make discoveries as Ps ¼ argmaxi Pði, jÞ where N is the total
j¼1
of a protein family with different subcellular
localizations. The third type of methods builds number of base predictors and i is the index of
machine learning models (e.g., support vector a predicted subcellular compartment: cytoplas-
machine) and predicts protein localization using mic (i ¼ 1), cytoplasmic membrane (i ¼ 2),
features, such as amino acid/dipeptide composi- periplasmic (i ¼ 3), outer membrane (i ¼ 4),
tional bias, physicochemical properties of amino and extracellular (i ¼ 5). P(i, j) denotes the vot-
acids, and others. Since these sequence features ing weight of the prediction of the j th element
are derived from whole protein sequences, most predictor for compartment i . It is defined as
algorithms in this category are minimally affected XMj
Pði, jÞ ¼ 2jCK ij W K , where Mj is the num-
by the incompleteness of peptide sequences. k¼0
Examples are CELLO (Lu et al. 2004), SUBLOC ber of predictions of the jth predictor. It means that
(Hua and Sun 2001), and PSLDOC (Chang the voting weight of a prediction by thejth predictor
et al. 2008). Only the third approach is useful in for compartment i depends on the offset of the
the case of metagenomic peptides which are often index CK of its predicted class with regard to
fragmentary. the index i as well as its normalized score WK.
Because all algorithms have their own bias, The voting weight WK for Kth prediction is defined
the predictions from individual algorithms in the on the basis of its relative score by comparison with
third category are frequently inconsistent. This is all other predictions made by this algorithm.
related to the fact that sorting signals targeting Because raw scores of predictions from different
different subcellular locations usually share some component base algorithms are not directly com-
similarities. For example, sorting signals parable, the raw score SK is converted into a nor-
targeting the periplasm and outer membrane malized probability p(K) ¼ p(S SK) by
both have N-terminal positively charged regions. calculating the percentage of predictions with
In this case, prediction algorithms usually have lower raw scores among all predictions for a
some ambiguity for distinguishing these neigh- given algorithm. WK is then defined as WK ¼ p(K).
boring compartments. When an algorithm pre- The performance of MetaP and the component
dicts a protein as a periplasmic protein with the algorithms was evaluated using sets of testing
highest confidence, it also implies that the protein sequences whose localizations were verified by
has a probability of being located in its neighbor- experiments (Menne et al. 2000). For the purpose
ing compartments, including the cytoplasm, inner of testing the accuracy of fragmentary protein
membrane, outer membrane, and extracellular prediction, the N-terminal of the testing
space, with higher probability assigned to the sequences is removed. This benchmark test
locations closest to the periplasm. Indeed, neigh- showed that MetaP makes more accurate predic-
boring compartments are usually reported as tions of fragmentary peptide sequences than any
suboptimal predictions by the component algo- component method.
rithms (CELLO, SUBLOC, and PSLDOC). The MetaP was applied to several protein families
MetaP algorithm proposed recently considers of alkaline phosphatases using the Global Ocean
neighborhood relations among subcellular locali- Sampling (GOS) metagenomic data sets (Luo
zations and also suboptimal predictions. It thus has et al. 2009). Alkaline phosphatases are major
the benefit of resolving conflicting predictions by hydrolytic enzymes of organic phosphoesters
the base algorithms and achieves higher precision which are the dominant forms of dissolved
and accuracy of prediction (Luo et al. 2009). organic phosphorus in the ocean and providing
The predicted location of MetaP for an important source to meet bacterial phosphorus
a sequence s is the one that has the maximum requirements. It was thought that marine
sum of weighted voting for that subcellular local- bacterial alkaline phosphatases are exclusively
ization. The prediction can be denoted formally
New Computational Methodologies to Understand Microbial Diversity 519 N
New Computational
Methodologies to
Understand Microbial
Diversity,
Fig. 1 Subcellular
localization distributions of
APases recovered from the
GOS metagenomic
database (figure adapted
from Luo et al. 2009)
ectoenzymes. However, MetaP predicted that other hand, metagenomic DNA is a mixture
about 40 % of the alkaline phosphatases are from all microbes in the sample, making it diffi-
located in the cytoplasm (Fig. 1). Further bioin- cult to study genome content of a specific micro-
formatic analysis suggested that the cytoplasmic bial lineage in a systematic way. It is therefore N
alkaline phosphatases might play a role in hydro- important to develop high-throughput computa-
lyzing the imported small organic phosphorus tional approaches to systematically classify
compounds. In addition, application of MetaP to metagenomic genes taxonomically. This would
a metatranscriptomics data set showed diel vari- lead to an improved understanding of the ecolog-
ations in the fraction of transcripts encoding inner ical functions of the abundant taxa in the nature.
membrane and periplasmic proteins compared to Definitively assigning sequences from diverse
cytoplasmic proteins (Fig. 2), suggesting a close metagenomic data sets to taxonomic groups is
coupling of photosynthetic extracellular release problematic, however. Most applications rely on
and bacterial consumption (Luo 2012). BLAST-based (Altschul et al. 1997) identifica-
tion of best hits to an annotated sequence data-
base. While the BLAST best hit approach is easy
An Evolutionary Genetic Method to to use, its accuracy is decidedly influenced by the
Classify Metagenomic Reads composition of the annotated database. Thus,
Taxonomically a substantial fraction of best BLAST hits may
not be the closest relatives phylogenetically, an
Metagenomic DNA represents genetic potential issue that is exacerbated when taxonomic groups
of the microbial community in an environment. are not evenly represented in the database (Koski
Due to its unbiased nature, a majority of and Golding 2001). A second type of methods
a metagenomic sample consists of DNA from employs machine learning principles to classify
those abundant microbial lineages. Therefore, it metagenomic reads based on the nucleotide
provides raw material for studying genome consequence characteristics (McHardy et al. 2007).
tent of the abundant taxa in the nature. On the These methods are also subject to the high
New Computational
Methodologies to
Diversity,
Fig. 2 Differential gene
expression in protein
subcellular localizations
between day and night in
surface waters of
North Pacific Subtropical
Gyre. The letter above the
bar indicates the
significance level: (a),
P < 0.001; (b), P < 0.05
(figure adapted from Luo
2012)
false-positive issue, which cannot meet the needs substitutions among some members of the line-
of many ecological studies. age (Luo and Hughes 2012). Therefore, dN is used
A bioinformatic approach is recently devel- to measure the evolutionary distances of protein-
oped to assign metagenomic gene fragments to coding genes. The dN pipeline assigns
taxonomic groups by computing evolutionary a metagenomic gene to a microbial clade (e.g.,
distances of protein-coding DNA sequences the marine Roseobacter clade) based on the
(Luo et al. 2012). In a protein-coding DNA requirement that the mean evolutionary distance
sequence, point mutation occurs both in synony- between a metagenomic gene and each of the
mous sites which do not change the reference orthologous genes from the clade
corresponding amino acid sequence and in members is smaller than the mean of all
non-synonymous sites which change the encoded pairwise comparisons among the reference
amino acids. Thus, the evolutionary distances of orthologous genes in that clade. Mathematically,
protein-coding DNA sequences can be the requirement can be expressed using
represented using synonymous (dS) and Xn Xðn Þ
2
d N , ref meta dN, ref ref
non-synonymous (dN) substitution rate. More spe- 1 < 1 , in which n
n ðn2 Þ
cifically, dS is the number of synonymous sub-
is the number of reference orthologous genes, dN,
stitutions per synonymous site, and dN for the
ref meta is d N between a reference gene and the
number of non-synonymous substitutions per
metagenomic gene fragment, and dN,ref ref is dN
non-synonymous site. Since synonymous muta-
between two reference genes.
tions are largely invisible to natural selection,
The dN pipeline takes in alignments, each
synonymous sites are easily saturated with sub-
consisting of reference orthologous genes
stitutions. In contrast, most non-synonymous
belonging to the core genome of a monophyletic
mutations are deleterious, and many of them
microbial clade and one metagenomic gene frag-
have been eliminated by purifying selection.
ment with unknown taxonomic affiliation. Iden-
Thus, dN is much smaller than dS in a vast majority
tification of putative gene fragments from
of genes (Luo and Hughes 2012). Often, marine
metagenomic reads requires in silico translation
microbial ecologists are interested in highly
of the reads in six reading frames and then selec-
diverged lineages (e.g., Roseobacter, SAR11,
tion of all fragments with a certain minimal
Vibrio, Prochlorococcus). At this level of diver-
length (e.g., 60 amino acids) between stop
gence, the synonymous sites are saturated with
New Computational Methodologies to Understand Roseobacter core genes, green colored are other bacterial
Microbial Diversity, Fig. 3 A flowchart of genes homologous to the Roseobacter core genes, and
preprocessing steps for dN pipeline for high-confidence blue colored are not homologous to the core genes which
phylogenetic classification of metagenomic DNA frag- are not recovered by a BLAST similarity search. The dN
ments. The circles on the leftmost are Roseobacter pipeline is designed to filter out other bacterial genes in
genomes, in which pink-colored parts represent core green, but a few true Roseobacter sequences are missing
genomes. The gene fragments in the GOS metagenome because of the conservative nature of the dN pipeline
are categorized into three parts, in which pink colored are (figure adapted from Luo et al. 2012)
codons. Then, BLAST identifies a set of putative “anchor sequence,” and its pair end read is named
metagenomic gene fragments that are homolo- “mate pair sequence.” These assigned
gous to the reference genes (Fig. 3). Each of the metagenomic genes are by no means
homologous metagenomic gene fragments will a comprehensive list of genes affiliated with this
be aligned to the reference genes at the amino microbial clade, since they can be only identified
acid level, and the DNA sequences are imposed if they are core genes or physically linked to N
on the alignment. Next, the PAML software a core gene of that clade.
(Yang 1997) computes dN for each pairwise com- This whole procedure, including
parison in the DNA alignment. preprocessing, the dN pipeline, and the mate read
The output of the dN pipeline is a set of analysis, was applied to assign metagenomic
metagenomic gene fragments that are assigned genes in the Global Ocean Sampling (GOS) data
to the microbial clade. Validation of the dN pipe- sets (Rusch et al. 2007) to the marine
line using phylogenetic analyses showed that the Roseobacter clade. The major finding is that the
false-positive rate is smaller than 1 %. Since these uncultivated Roseobacter populations differ sys-
classified metagenomic gene fragments are tematically in several genomic attributes from
homologous to the core genomes of a microbial their cultured representatives, including fewer
clade which encode for biological functions that genes for signal transduction and cell surface
are essential to basic cellular functionality, they modifications but more genes for Sec-like protein
are unlikely to provide valuable information secretion systems, anaplerotic CO2 incorpora-
about ecologically relevant processes. However, tion, and phosphorus and sulfate uptake (Fig. 4).
depending on the library design for sequencing, Several of these trends match well with character-
a read may be partnered with a pair end read, both istics previously identified as distinguishing
of which are from the same DNA molecular, and r- versus K-selected ecological strategies in
the pair end read may carry an ecologically rele- bacteria, suggesting that the r-strategist model
vant gene. Thus, an important extension of the dN assigned to cultured roseobacters may be less
pipeline is to examine the pair end of the assigned applicable to their free-living oceanic counterparts.
reads. Here, the metagenomic gene fragment that Thus, genomic analyses of cultured roseobacters
is directly identified by the dN pipeline is named appear to be biasing our view of the lineage’s
New Computational Methodologies to Understand anaplerotic CO2 incorporation; light purple, Sec secretion
Microbial Diversity, Fig. 4 Differential representation system; dark orange, signaling; light orange, nutrient
of gene families in oceanic compared to cultured transport; teal, antibiotic synthesis or resistance; maroon,
roseobacters (M versus A plot). Families plotting above C1 metabolism; dark green, cell surface properties; and
the line are enriched, and those plotting below the line are light green, hypothetical. This plot shows differential
depleted in the oceanic roseobacters. Non-gray symbols representation for just one of three simulated
represent gene families with significant differential repre- metagenomic data sets that were constructed, all of
sentation between the two metagenomes. Colors indicate which had congruent results (figure adapted from Luo
gene families with similar functions: dark purple, et al. 2012)
ecology toward a stronger r-selected ecological Phosphonate contains a stable carbon-

model than is merited (Luo et al. 2012). phosphorus (C-P) bond, comprising 25 % of the
high-molecular-weight dissolved organic phos-
phorus in the ocean. Phosphonates are degraded
A Statistical Modeling Approach for by two types of enzymes, C-P lyases and hydro-
Comparative Metagenomics lases. The C-P lyase is a multienzyme complex,
and the corresponding genes are only expressed
An important goal of metagenomics is to explain when inorganic phosphate becomes limited,
the genetic potential of the microbial community suggesting that the activity of C-P lyase genes is
in the context of the environmental gradients. regulated by phosphate concentrations. In the
One way of approaching this goal is to reveal ocean, there is a vertical gradient of phosphate
correlations between environmental gradient level, in which phosphate is depleted in the upper
and gene abundance. Although standard statisti- euphotic zone (<100 m), reaches its maximum at
cal tests such as regression analysis have been the base of mesopelagic zone (1,000 m), and has
successful in correlating gene abundance and a minor decrease in the bathypelagic zone
geochemical parameters in large-scale sampling (>1,000 m). Only the phosphate level in the
efforts, exploring smaller data sets requires upper euphotic zone can be a limiting factor to
designing sophisticated statistical modeling biological productivity. Thus, the depth profile of
approaches. The following example illustrates it ocean water column provides a natural platform
(Luo et al. 2011). to test microbial adaptation to phosphate gradient
by correlating the vertical gradient of phosphate zeros because any 95 % confidence intervals
and C-P lyase gene abundance in different would include zero. This indicates that it is
depths. unlikely to see any executor genes at the desig-
Examination of a recently available nated water depths where these genes are not
metagenomic data set containing thousands of found. However, the absence of the executor
sequences at each of seven depths (10 m, 70 m, genes could indicate that these genes are rare
130 m, 200 m, 500 m, 770 m, 4,000 m) in the and the sample size is not large enough or the
North Pacific Subtropical Gyre showed that the number of observed executor genes might have
lytic executor genes (phnG, phnH, phnI, phnJ, been miscounted, i.e., misclassification might
phnM) of the C-P lyase complex are exclusively have occurred.
found in the upper euphotic zone. To validate the To take these possibilities into account, the
pattern of C-P lyase executor genes being present following two directions are considered for fur-
in the surface ocean metagenomes but absent in ther investigation: (1) increasing the sample size
deeper samples, a statistical approach was so that it is large enough to collect one executor
designed. Testing the significance of the exis- gene or (2) increasing the number of observed
tence or absence of executor genes in the two executor genes to correct possible misclassifica-
depth regions is equivalent to testing the follow- tions. Biologically, these two directions are inde-
ing two hypotheses: (1) executor genes exist in pendent, but mathematically they are linked. The
surface waters (70 m), and (2) executor genes following mathematical principle proves that if
are absent at depths 130 m. The basic method direction 2 cannot lead to the confirmation of the
applied was the one-sample test on proportions. true existence of executor genes, then the exis-
Specifically, a 95 % confidence interval was set tence cannot be confirmed through direction
up to indicate the range of possible true propor- 1 either.
tions (or population proportions) of executor N130 is used to denote the current sample size
genes, which was defined as at depth 130 m and deeper. A larger sample
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
p 1:96 pð1 pÞ=N . Here, N is the num- size is denoted as N130 + X with X > 0. N
ber of observed genes in total. The symbol p If the executor genes are miscounted
denotes the sample proportion. In this context, by 1, a 95 % confidence interval can be set
sample proportions are the proportions of the up with the lower bound given as
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
executor genes among all the genes collected 1=N 130 1:96 1=N 130 ð1 1=N 130 Þ=N 130 .
and mathematically defined as the ratio between If this lower bound of confidence interval
the number of observed executor genes and the includes zero, then it must be negative, i.e.,
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
number of observed genes in total (i.e., sample 1=N 130 < 1:96 1=N 130 ð1 1=N 130 Þ=N 130 .
size N) at each depth category (either 70 m or This gives (N130 1)/N130 > 1/1.962. Now the
130 m) in the water column. The confidence sample size increases to N 130 þ X . Since
interval was set up based on the normality (N 130 + X 1)/(N130 + X) > (N 130 1)/N130
assumption of proportions when the sample size and (N130 1)/N 130 > 1/1.962, we then have
(N) is large, which is certainly a valid assumption (N 130 + X 1)/(N130 + X) > 1/1.96 2 .
for this data set. If zero is not included in the To test direction 2, the numbers of miscounted
interval, then the existence of the executor genes were specified as Nmis ¼ 1, 2, 3, 4, 5, 6.
genes can be confirmed with 95 % of confidence. Again, the sample proportions and the 95 % con-
In the case of the upper euphotic zone, the 95 % fidence intervals are calculated for each case. It
of confidence does not include zero, confirming can be seen that zero was always included in the
that executor genes exist in surface waters 95 % confidence intervals unless four or more
(70 m). executor genes are misclassified (Fig. 5), which
The above method would not be applicable if is unlikely to happen in practice due to the char-
sample proportions of the executor genes are acteristic of the rarity of these executor genes.
New Computational
Methodologies to
Diversity,
Fig. 5 Assumed sample
proportions (denoted by
circles) and 95 %
confidence intervals
(dashed lines). Y-axis: to
get the probability,
multiply the y-values by
105; X-axis: the number of
assumed miscounted
executor genes among
N130 ¼ 327, 741 genes.
The filled squares are the
locations of zeros (figure
adapted from Luo et al.
2011)
Therefore, executor genes are indeed absent at References

depths 130 m.
At this point, the two earlier proposed hypoth- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z,
Miller W, Lipman DJ. Gapped BLAST and
eses have been tested. The proposed statistical
modeling approach can be widely applied to search programs. Nucleic Acids Res.
resolve biological questions regarding correla- 1997;25:3389–402.
tions between environmental gradient and gene Chang J-M, Su EC-Y, Lo A, Chiu H-S, Sung T-Y, Hsu
W-L. PSLDoc: protein subcellular localization predic-
abundance when the data sets are not large
tion based on gapped-dipeptides and probabilistic
enough to do linear regression analysis. latent semantic analysis. Proteins. 2008;72:693–710.
Dyrlov Bendtsen J, Nielsen H, von Heijne G, Brunak
S. Improved prediction of signal peptides: signalP
3.0. J Mol Biol. 2004;340:783–95.
Summary Hua S, Sun Z. Support vector machine approach for pro-
tein subcellular localization prediction. Bioinformat-
ics. 2001;17:721–8.
In the past decade, large-scale metagenomic data
K€all L, Krogh A, Sonnhammer ELL. Advantages of com-
sets have been released to the public community, bined transmembrane topology and signal peptide pre-
and this trend is likely to be continued. Many diction – the Phobius web server. Nucleic Acids Res.
biological questions may be answered by analyz- 2007;35:W429–32.
Koski LB, Golding GB. The closest BLAST hit is often
ing these data sets with appropriate computational
not the nearest neighbor. J Mol Evol. 2001;52:540–2.
approaches. Some of the promising methods are Lu Z, Szafron D, Greiner R, Lu P, Wishart DS, Poulin B,
illustrated above, which are based on machine et al. Predicting subcellular localization of proteins
learning techniques, molecular evolutionary prin- using machine-learned classifiers. Bioinformatics.
2004;20:547–56.
ciples, and statistical modeling approaches. These
Luo H. Predicted protein subcellular localization in dom-
studies are examples of future research directions inant surface ocean bacterioplankton. Appl Environ
of computational metagenomics. Microbiol. 2012;78:6550–7.
New Method for Comparative Functional Genomics and Metagenomics Using KEGG MODULE 525 N
Luo H, Hughes AL. dN/dS does not show positive selec- potentials harbored within the genome or
tion drives separation of polar-tropical SAR11 metagenome has not yet been established. Thus,
populations. Mol Syst Biol. 2012;8.
Luo H, Benner R, Long RA, Hu J. Subcellular localization a new evaluation method for the potential
of marine bacterial alkaline phosphatases. Proc Natl functionome, based on the completion ratio of
Acad Sci USA. 2009;106:21219–23. Kyoto Encyclopedia of Genes and Genomes
Luo H, Zhang H, Long RA, Benner R. Depth distributions (KEGG) functional modules, was developed.
of alkaline phosphatase and phosphonate utilization
genes in the North Pacific Subtropical Gyre. Aquat Basic methodology and application of this
Microb Ecol. 2011;62:61–9. method for comparative functional genomics
Luo H, Löytynoja A, Moran MA. Genome content of and metagenomics are expounded in this entry.
uncultivated marine Roseobacters in the surface
ocean. Environ Microbiol. 2012;14:41–51.
McHardy AC, Martin HG, Tsirigos A, Hugenholtz P,
Rigoutsos I. Accurate phylogenetic classification of Introduction
2007;4:63–72. One of the main goals of genomic and
Menne KML, Hermjakob H, Apweiler R. A comparison of
signal sequence prediction methods using a test set of metagenomic analyses is to extract the compre-
signal peptides. Bioinformatics. 2000;16:741–2. hensive functions (functionome) harbored in an
Rusch DB, Halpern AL, Sutton G, Heidelberg KB, individual organism or a whole community in
Williamson S, Yooseph S, et al. The sorcerer II global various environments. However, evaluating the
ocean sampling expedition: Northwest Atlantic
through Eastern tropical pacific. PLoS Biol. 2007;5: potential functionome is still difficult when com-
e77. pared with the functional annotation of individual
Yang Z. PAML: a program package for phylogenetic genes or proteins, i.e., based on a similarity
analysis by maximum likelihood. Bioinformatics. search against a reference database such as the
1997;13:555–6.
NCBI-NR database of non-redundant protein
sequences, usually employing a variant of the
BLAST program, or on the protein domain search
against a protein family database such as PFAM. N
New Method for Comparative This is mainly because a standard methodology
Functional Genomics and for extracting functional category information,
Metagenomics Using KEGG MODULE such as individual metabolism, energy genera-
tion, and transportation systems, has not yet
Hideto Takami been fully established. Traditionally, clusters of
Microbial Genome Research Group, Japan orthologous groups (COGs) have been used for
Agency for Marine-Earth Science and functional classification of proteins, particularly
Technology (JAMSTEC), Yokosuka, Japan in microbial genome sequencing projects. The
COG database provides 17 functional categories
for orthologous groups in order to facilitate func-
Synonyms tional studies and serves as a platform for func-
tional annotation of newly sequenced genomes
Functional potential evaluator and studies on genome evolution. Although
the COG functional categories are often used
within Standards in Genomic Sciences (http://
Definition standardsingenomics.org/index.php/sigen) as a
standard analysis, through combination with
Although one of the main goals of genomic anal- the Integrated Microbial Genomes (IMG) system
ysis is to elucidate the comprehensive functions (Markowitz et al. 2012), no large functional
(functionome) in individual organisms or a whole differences are usually observed in such broad
community in various environments, a standard categories, even between phenotypically dif-
evaluation method for discerning the functional ferent organisms and also whole microbial
N 526 New Method for Comparative Functional Genomics and Metagenomics Using KEGG MODULE
New Method for Comparative Functional Genomics databases, i.e., KEGG GENES for KAAS, M5nr (Willke
and Metagenomics Using KEGG MODULE, et al. 2012) for MG-RAST (M5nr includes the SEED as
Fig. 1 Outline of the methodology. (a) Workflow from a subset), and NCBI-NR for MEGAN4, and different
sequencing to evaluation of the potential functionomes. default threshold values for the BLAST hits. Each server
(b) Detailed workflow of the three annotation servers, converts the hit entries to the corresponding orthology IDs
KAAS, MG-RAST, and MEGAN4, using query for functional annotation and pathway/module/subsystem
sequences after gene finding process of sequenced data; mapping. Red-colored texts of KAAS indicate its
KAAS and MEGAN4 use BLASTP and BLASTX for improvements in the current study. This figure has been
amino acid and nucleotide query sequences, respectively, modified from the previous one (Takami et al. 2012)
and MG-RAST uses only BLASTX. All use different
communities in different environments. Thus, orthology IDs for mapping annotated sequences
it is difficult to differentiate the functional to functional categories depending on their
potentials between different genomes and desired outputs, namely, pathways in KEGG or
metagenomes by analysis based on COG subsystems in SEED. Notably, KAAS has been
classification. applied to protein-coding sequences from several
Recently, more detailed and comprehensive metagenomic samples, and their annotated
functional categories facilitated in KEGG KEGG pathways and other classifications are
(Kanehisa and Goto 2000) and SEED (Overbeek already available. The outputs of these systems
et al. 2005) have been used for comparative geno- include functional distributions of each sample
mics and as metagenomics tools to highlight by hierarchical classification using KEGG and/or
functional features represented by KAAS SEED and comparisons between several samples
(KEGG Automatic Annotation Server) (Moriya when necessary. However, it is still difficult to
et al. 2007), MG-RAST (Meyer et al. 2008), and evaluate the functional potentials via the current
MEGAN (Huson et al. 2011) (Fig. 1). They all classification systems (such as pathway
employ a similarity-based method for functional map-based analysis) because the functional infor-
annotations, but utilize different databases for mation from different organisms such as
protein sequences, default threshold values, and microbes, plants, and animals has been mixed up.
On the other hand, KEGG MODULE, a newly and the latest version is available from the KEGG
defined database that collects pathway modules FTP site (http://www.kegg.jp/kegg/download).
and other functional units, presents a promising Each module is defined by the combination of
tool for functional classification (Kanehisa KO identifiers so that it can be used for annota-
et al. 2008). Because the KEGG modules cover tion and interpretation purposes in individual
major metabolisms and physiological processes genomes or metagenomes. Notations of the Bool-
necessary for functional characterization of each ean algebra-like equation for this definition
categorized organisms such as plants, animals, include space-delimited items for pathway ele-
and microbes, a new evaluation method using ments, comma-separated items in parentheses for
the KEGG MODULE database was developed alternatives, a plus sign to define a complex, and
to resolve the difficulties for evaluation of poten- a minus sign for an optional item. Some modules
tial functionome and it was employed for com- have branching points in their reaction cascades,
parative functional genomics and metagenomics leading to different products or alternative reac-
(Takami et al. 2012). Based on this result, we also tion pathways. These modules are divided into
developed metabolic and physiological potential several parts depending on the branching patterns
evaluator (MAPLE) system. The MAPLE pro- and are redefined as submodules for accurate
vides a user-friendly Web interface not only for calculation of the completion ratio. The module
characterization of potential functionome har- completion ratio was calculated for each
bored in the genomic and metagenomic submodule to examine fine-grained functional
sequences but also for comparative analyses for categories (Takami et al. 2012).
the module completion ratio (MCR) and mapping
patterns to the KEGG modules (http://www. Calculation of the Module Completion Ratio
genome.jp/tools/maple/). Based on a Boolean Algebra-Like Equation
The completion ratio of all KEGG functional
modules in each organism was calculated based
Development of New Evaluation on a Boolean algebra-like equation. For this anal- N
Method for Potential Functionome ysis, one genome was selected from each of the
1,041 available prokaryotic species as of March
Kegg Module 2013. As one of the examples, M00009_1 is
KEGG MODULE (Kanehisa et al. 2008) is a core pathway module for the TCA cycle com-
a collection of pathway modules and other func- prising eight components (Fig. 2a). In each KO
tional units designed for automatic functional number set, vertically connected KO identifiers
annotation or pathway enrichment analysis. Path- indicate a complex and therefore represent “And”
way modules such as the TCA cycle core module or “+” in the Boolean algebra-like equation,
(Fig. 2a) are tighter functional units than KEGG whereas horizontally located K numbers indicate
pathway maps and are defined as consecutive alternatives and represent “Or” or “,” in the equa-
reaction steps, operon or other regulatory units, tion. When genes are assigned to all KO identi-
and phylogenetic units obtained by genome com- fiers in each reaction according to the Boolean
parisons. Other functional units include (1) struc- algebra-like equation, the module completion
tural complexes representing sets of protein ratio (MCR) becomes 100 %. If genes are not
subunits for molecular machineries such as pho- assigned to KO identifiers in two components,
tosystems (Fig. 2b), (2) functional sets the MCR is calculated as 75 %
representing other types of essential sets such as (6/8 100 ¼ 75). On the other hand,
aminoacyl-tRNA synthetases, and (3) signature M00163_1 comprising six components in
modules representing markers of phenotypes cyanobacteria represents a complex module for
such as enterohemorrhagic E. coli pathogenicity photosystem I. If genes assigned to KO identifiers
signature for Shiga toxin. The KEGG MODULE in two of those components are missing, the MCR
falls into 56 small functional categories (Table 1), is calculated as 66.7 % (Fig. 2b).
New Method for Comparative Functional Genomics identifiers or K numbers for computational applications.
and Metagenomics Using KEGG MODULE, The relationship between this module and the
Fig. 2 KEGG functional modules. (a) A pathway mod- corresponding KEGG pathway map is also shown by
ule. The module M00009 comprising eight components is indicating corresponding K number sets in the module
defined for the citrate cycle (TCA cycle) core module and and EC numbers in the pathway map using the same
represented as a Boolean algebra-like equation of KO index. In each K number set, vertically connected
Assignment of the Query Sequences to KO FLX Titanium sequencer contains several
Identifiers sequencing errors. The amino acid sequences of
Because KAAS is an efficient tool for assigning complete CDSs identified from the draft genome
KO identifiers to genes from complete genomes were randomly fragmented to 50, 60, 80, 100,
based on a BLAST search of the KEGG GENES 120, 150, and 200 residues in length, and each
database combined with a bidirectional best-hit fragment was subjected to verification of data-
method (Moriya et al. 2007), the KAAS system is base dependency based on the accuracy of KO
used to assign KO identifiers to protein sequences identifier assignment (Fig. 3). In general, because
from metagenome projects and to users’ own data most microbes thriving in natural environments
from other genome and metagenome projects. are uncultivable, many genes in environmental
Recently the KAAS system has just been slightly metagenomes do not show significant similarity
modified to improve the accuracy of KO assign- to those from known species in the public genome
ments by (i) using a variable bit-score threshold database. Especially when microbial genomes
instead of a fixed one (60 in the original KAAS belonging to the same phylum as the query
system) to avoid missed annotations when there microbe are missing in the genome database, the
are sufficient high-scoring hits for KO assign- accuracy rate of KO assignment to proteins phy-
ment and (ii) considering taxonomic information logenetically distant from known phyla is
of each KO when more than one candidate KO is expected to be low. In fact, when all species
obtained (Fig. 1) (Takami et al. 2012). This mod- within phylum Proteobacteria were not included
ification resulted in improved positive predictive in the data set, the accuracy rate of KO assign-
value (#true positives/#all positives) by 2–5 % in ment to full proteins of E. coli decreased to 80 %,
the KO reassignment tests for 30 selected species. but the accuracy rate of approximately 70 % was
The latest stand-alone KAAS system for Linux maintained even in the proteins fragmented to
and Mac OS X is available from the Web site of about 100 residues (Fig. 3). Considering these
KAAS HELP (http://www.genome.jp/tools/kaas/ results, even if the genes from unidentified
help.html). This new KAAS was used for estima- phyla of the so-called candidate division are N
tion of database dependency on the accuracy of included in the metagenomes, the KAAS system
the KO assignment (Fig. 3). Escherichia coli was can presumably assign KO identifiers to genes
selected as a representative of prokaryotic species longer than 300 bp (100 amino acids) with an
and constructed four different types of data sets: accuracy rate of approximately 70 %.
without E. coli and closely related species (1,239
species), without all species within family
Enterobacteriales (1,200 species), without all Distribution Patterns of the Module
species within class Gammaproteobacteria Completion Ratio in 1,256 Prokaryotic
(1,040 species), and without all species within Species
phylum Proteobacteria (755 species). The draft
genome of E. coli from infants in Trondheim, KEGG modules are modular functional units
Norway, (accession, ERX127960) was used for derived from the KEGG pathways and are cate-
this analysis because the assembled genome from gorized into pathway modules, structural com-
the short-read sequences produced by a 454 GS plexes, functional sets, and genotypic
New Method for Comparative Functional Genomics module M00163 comprising six components is defined
and Metagenomics Using KEGG MODULE, Fig. 2 for the type I photosystem. The Boolean algebra-like
(continued) K numbers indicate a complex and therefore equation and the corresponding KEGG pathway map are
represent “And” or “+” in the Boolean algebra-like equa- also shown. This figure has been redrawn with the updated
tion, whereas horizontally located K numbers indicate KEGG module database from the previous one (Takami
alternatives and represent “Or” or “,” in the equation. (b) et al. 2012)
A structural complex module. The structural complex
New Method for Comparative Functional Genomics signatures. Each KEGG module is designed for
and Metagenomics Using KEGG MODULE, automatic functional annotation by a Boolean
Table 1 Breakdown of small functional categories of
the KEGG modules algebra-like equation of KEGG Orthology IDs.
However, it remains uncataloged as to which
Pathway modules Structural complex modules
species possess common modules or if certain
Cofactor and vitamin Saccharide and polyol transport
biosynthesis system modules demonstrate universality or rareness
Central carbohydrate Phosphotransferase system between specific species, phyla, etc. Specific
metabolism (PTS) information regarding the phylogenetic profiles
Aromatics degradation ATP synthesis of each module holder would be especially useful
Lipid metabolism Phosphate and amino acid for annotating metagenomes. Thus, the distribu-
transport system
Aromatic amino acid Mineral and organic ion
tion patterns of the completion ratios of the
metabolism transport system KEGG modules were examined in the 1,256 pro-
Carbon fixation ABC-2 type and other transport karyotic species whose genomic sequences have
systems been completed. Although distribution of the
Methane metabolism Bacterial secretion system module completion ratios in the 1,256 species
Glycan metabolism Metallic cation, iron-
siderophore, and vitamin B12
varied greatly depending on the kind of module,
transport system it could be categorized into four patterns
Sterol biosynthesis RNA processing (universal, restricted, diversified, and
Fatty acid metabolism Ubiquitin system nonprokaryotic) regardless of the module type
Lysine metabolism Spliceosome (pathway, structural complex, signature, or func-
Other carbohydrate Protein processing tional set), when considering 70 % of all species
metabolism
to represent a majority measurement for the pat-
Glycosaminoglycan Repair system
metabolism ters (Table 2 and Fig. 4).
Terpenoid backbone DNA polymerase Pattern A defined as “universal” comprised
biosynthesis modules completed by more than 70 % of the
Cysteine and methionine Peptide and nickel transport 1,256 species (Fig. 4a). Of 226 pathway modules
metabolism system
containing submodules, modules grouped into
Nitrogen metabolism Replication system
Branched-chain amino acid RNA polymerase
pattern A account for only 7.5 % (Table 2) and
metabolism mainly belong to the categories of central carbo-
Lipopolysaccharide Proteasome hydrate metabolism and cofactor and vitamin
metabolism biosynthesis. Pattern B defined as “restricted”
Purine metabolism Photosynthesis comprised modules completed by less than
Pyrimidine metabolism Carbohydrate metabolism
30 % of the species (Fig. 4b) and accounted for
Polyamine biosynthesis Ribosome
17.3 % of all the pathway modules, and 37 mod-
Alkaloid and other Glycan metabolism
secondary metabolite ules were rare modules completed by less than
biosynthesis 10 % of the 1,256 species (Table 2). Pattern
Sugar metabolism C defined as “diversified” accounted for 40.3 %
Other terpenoid Functional set modules of all the pathway modules and comprised mod-
biosynthesis
ules ranging widely in completion ratios.
Serine and threonine Two-component regulatory
metabolism system M00012_1 (the glyoxylate cycle comprising
Arginine and proline Aminoacyl-tRNA five components) is one of the representatives of
metabolism pattern C (Fig. 4c). One or several KO identifiers
Phenylpropanoid and Nucleotide sugar were assigned to each reaction in this module;
flavonoid biosynthesis
however, KO identifiers, except for K01637 and
Sulfur metabolism
Histidine metabolism Signature modules
K01638 assigned to the third and fourth compo-
Other amino acid Pathogenicity nents, were also assigned to other pathway mod-
metabolism ules such as the TCA (Krebs) cycle (M00009_1),
first carbon oxidation (M00010_1), reductive
New Method for Comparative Functional Genomics genera Escherichia, Salmonella, Shigella, and Yersinia
and Metagenomics Using KEGG MODULE, (16 KO identifiers), order Enterobacteriales (90), class
Fig. 3 Effect of database dependency on accuracy of Gammaproteobacteria (203), or phylum Proteobacteria
the KO assignment. Purple triangles show the results (370) were removed in advance from the protein
using the data set without proteins from the genera data set. Here, the accuracy is defined by the sensitivity
Escherichia, Salmonella, Shigella, and Yersinia (1,239 TP/(TP + FN), where TP and FN are the numbers of true
species). Similarly, green squares, brown diamonds, and positives and false negatives, respectively. The truncated
blue dots show the results without proteins from the order proteins were also used to confirm the effect of amino acid
Enterobacteriales (1,200 species), class Gammaproteo- (a.a.) sequence lengths on the accuracy of KO assignments
bacteria (1,040 species), and phylum Proteobacteria as described in the text. This figure has been slightly
(755 species), respectively. KO identifiers specific to the modified from the previous one (Takami et al. 2012) N
New Method for Comparative Functional Genomics and Metagenomics Using KEGG MODULE,
Table 2 Classification of the KEGG modules based on the module completion ratio of 1,256 prokaryotes
Structural Functional sets
Pathways [226] complexes [331] [86] Signatures [9]
No. of modules No. of modules No. of modules No. of modules
Completion Definition of module (%) (%) (%) (%)
pattern type Total rare Total rare Total rare Total rare
A Universal 17 (7.5) 0 (0) 9 (2.7) 0 (0) 1 (1.2) 0 (0) 0 (0) 0 (0)
B Restricted 39 37 133 99 77 67 8 (88.9) 8 (88.9)
(17.3) (47.4) (40.2) (81.1) (89.5) (97.1)
C Diversified 91 41 70 23 5 (5.8) 2 (2.9) 1 (11.1) 1 (11.1)
(40.3) (52.6) (21.1) (18.9)
D Nonprokaryotic 79 0 (0) 119 0 (0) 3 (3.5) 0 (0) 0 (0) 0 (0)
(35.0) (36.0)
[] shows total number of the KEGG modules containing branched modules. “Rare” indicates the modules completed by
less than 10 % of 1,256 prokaryotic species. Universal, the modules completed by more than 70 % of 1,256 prokaryotic
species. Restricted, the modules completed by less than 30 % of 1,256 prokaryotic species. Diversified, the modules that
varies in the module completion ratio among 1,256 prokaryotic species. Nonprokaryotic, the modules not to be
completed by any prokaryotic species
New Method for Comparative Functional Genomics are the modules that vary in the module completion ratio
and Metagenomics Using KEGG MODULE, among 1,256 prokaryotic species. M00012_1, which is
Fig. 4 Typical completion patterns to the KEGG glyoxylate cycle, is one of the examples of the pattern
modules by 1,256 prokaryotic species. (a) Universal C. D: Nonprokaryotic modules completed by no prokary-
modules. The modules completed by more than 70 % of otic species. M00014_1, which is glucuronate pathway, is
768 prokaryotic species. M00018_1, which is threonine one of the examples of the pattern D. Breakdown of
biosynthesis (aspartate-homoserine-threonine), is one of taxonomic variations that complete each KEGG module
the examples of the pattern A-1. (b) Restricted modules is summarized in Table 3. This figure has been redrawn
completed by less than 30 % of 768 prokaryotic species. with the updated KEGG module and genome databases
M00038_1, which is tryptophan metabolism, is one of the from the previous one (Takami et al. 2012)
examples of the pattern B. C: Diversified modules. These
TCA cycle (M00173_1), and C4-dicarboxylate the module completion ratio is low, the relation-
cycle (nicotinamide adenine dinucleotide ship between the module completion ratio of the
(NAD)+-malic enzyme type) (M00171_1). targeted module and others to which the same KO
Some KO IDs assigned to many of the modules, identifiers are assigned should be considered.
categorized into pattern C, were also assigned to Pattern D, which accounted for 35.0 % of all
several other independent modules. Thus, when pathway modules, comprised nonprokaryotic
modules that are not completed by prokaryotic phenotypic properties were selected to test our
species (Fig. 4d). evaluation method for potential functionome
Of the 331 structural complex modules using KEGG modules, in order to differentiate
containing submodules redefined from modules the functional potentials harbored in their
with various complex patterns, 133 modules were genomes.
categorized into pattern B (47.4 %) and 99 were The gene products from eight bacillar
rare modules (Table 1). Pattern C accounted for genomes were assigned to KO identifiers
only 21.1 % in the structural modules compared constructing each module in 139 pathway,
with 40.3 % in the pathway modules. Thus, it was 112 structural complex, and 25 functional set
hypothesized that most of the structural complex modules. There was a significant difference in
modules, except for pattern D, are shared only in the module completion ratio by eight bacilli in
limited prokaryotic species. terms of at least 25 pathway, 40 structural com-
Nonprokaryotic modules account for 35 % of plex, and 15 functional set modules (Fig. 5a, b).
pathway and 36 % of structural complex mod- In particular, the completion ratio in
ules, respectively, and other modules were clas- Oceanobacillus iheyensis, a mesophilic,
sified into various taxonomic patterns such as extremely halotolerant alkaliphile, was very low
prokaryotic, Bacteria specific, and Archaea spe- in three modules for NAD biosynthesis, phospha-
cific based on the MCR profiles (Table 3). These tidylethanolamine biosynthesis, and biotin bio-
four patterns indicate the universal and unique synthesis. These three modules were completed
nature of each module and also the versatility of by all bacilli except for O. iheyensis although
the KO identifiers mapped to each module. Thus, they are categorized into one of the diversified
the four criteria and taxonomic classification for modules (pattern C). Conversely, the module for
each module should be helpful for the interpreta- tryptophan biosynthesis belonging to pattern
tion of results based on module completion C was completed by only O. iheyensis, although
profile. other species partially completed them. Through
these results it was evident that O. iheyensis dif- N
fers from other bacilli in its metabolic potentials.
Application of the Evaluation Method Some of the completed structural complex
for Potential Functionome to Genomic modules were found to be shared in bacilli with
and Metagenomic Analyses the same phenotypic properties or to be indepen-
dently species specific (Fig. 5b). For example,
Comparative Functionome Analysis of Bacilli the Firmicutes-specific modules for the teichoic
Based on the KEGG Modules acid transport system were shared only among
Bacillus and its related species in genera such as three mesophilic neutrophiles (B. subtilis,
Oceanobacillus and Geobacillus reclassified B. amyloliquefaciens, and B. licheniformis),
from genus Bacillus (Bacillus-related species) although this module is widely shared in other
are known to thrive in a wide range of environ- genera such as Staphylococcus, Clostridium, and
mental conditions: pH 2–12, temperatures Listeria within phylum Firmicutes. On the other
between 5 and 78 C, salinity from 0 % to 30 % hand, two other modules, the iron (III) transport
NaCl, and pressures from 0.1 Mpa (atmospheric system and phosphonate transport system which
pressure) to at least 30 MPa (pressure at a depth are shared in many prokaryotic species within
of 3,000 m) (Takami 2006). The genome struc- various phyla and belonged to pattern C, were
ture of these species within family Bacillaceae is shared only among three mesophilic alkaliphiles
comparatively similar, and the core structure (B. halodurans, B. pseudofirmus, and
comprising more than 1,400 orthologous groups O. iheyensis). Although it has been previously
is well conserved among Bacillaceae (Uchiyama reported that the orthologous genes for the
2008). Therefore, moderately related bacillar phosphonate transport system were shared
genomes from eight species with different between O. iheyensis and B. halodurans
New Method for Comparative Functional Genomics and Metagenomics Using KEGG MODULE,
Table 3 Breakdown of taxonomic patterns of the KEGG modules
Pathway [226] Structural complex [331]
Major taxonomic pattern Number Major taxonomic pattern Number
(%) (%)
Nonprokaryote 79 (35.0) Nonprokaryote 119 (36.0)
Prokaryote 50 (22.1) Bacteria 55 (16.6)
Bacteria 30 (13.3) Prokaryote 51 (15.4)
Proteobacteria 27 (11.9) Proteobacteria 36 (10.9)
Euryarchaeota 10 (4.4) Firmicutes 17 (5.1)
Proteobacteria/Actinobacteria 5 (2.2) Actinobacteria 5 (1.5)
Firmicutes 4 (1.8) Cyanobacteria 5 (1.5)
Proteobacteria/Firmicutes/Actinobacteria 3 (1.3) Archaea 4 (1.2)
Chloroflexi 2 (0.9) Proteobacteria/Firmicutes 4 (1.2)
Crenarchaeota 2 (0.9) Euryarchaeota/Crenarchaeota 3 (0.9)
Cyanobacteria 2 (0.9) Euryarchaeota/Crenarchaeota/Nanoarchaeota 3 (0.9)
Actinobacteria/Crenarchaeota 1 (0.4) Proteobacteria/Firmicutes/Fusobacteria 3 (0.9)
Chlamydiae/Cyanobacteria 1 (0.4) Euryarchaeota 2 (0.6)
Chloroflexi/Deinococcus-Thermus/ 1 (0.4) Firmicutes/Tenericutes/Actinobacteria 2 (0.6)
Euryarchaeota
Euryarchaeota/Crenarchaeota 1 (0.4) Proteobacteria/Actinobacteria 2 (0.6)
Firmicutes/Euryarchaeota 1 (0.4) Proteobacteria/Aquificae 2 (0.6)
Proteobacteria/Acidobacteria 1 (0.4) Proteobacteria/Firmicutes/Actinobacteria 2 (0.6)
Proteobacteria/Actinobacteria/ 1 (0.4) Actinobacteria/Cyanobacteria 1 (0.3)
Acidobacteria
Proteobacteria/Actinobacteria/ 1 (0.4) Actinobacteria/Verrucomicrobia/Nitrospirae 1 (0.3)
Bacteroidetes
Proteobacteria/Actinobacteria/ 1 (0.4) Firmicutes/Fusobacteria 1 (0.3)
Cyanobacteria
Proteobacteria/Cyanobacteria 1 (0.4) Firmicutes/Spirochaetes 1 (0.3)
Proteobacteria/Firmicutes 1 (0.4) Proteobacteria/Actinobacteria/Deinococcus- 1 (0.3)
Thermus
Proteobacteria/Verrucomicrobia 1 (0.4) Proteobacteria/Actinobacteria/ 1 (0.3)
Verrucomicrobia
Functional set [86] Proteobacteria/Bacteroidetes/Aquificae 1 (0.3)
Major taxonomic pattern Number Proteobacteria/Chlamydiae 1 (0.3)
(%)
Proteobacteria 26 (30.2) Proteobacteria/Chlorobi 1 (0.3)
Firmicutes 19 (22.1) Proteobacteria/Chlorobi/Deferribacteres 1 (0.3)
Bacteria 11 (12.8) Proteobacteria/Cyanobacteria 1 (0.3)
Actinobacteria 6 (7.0) Proteobacteria/Cyanobacteria/Chlorobi 1 (0.3)
Cyanobacteria 6 (7.0) Proteobacteria/Firmicutes/Deferribacteres 1 (0.3)
Nonprokaryote 3 (3.5) Proteobacteria/Firmicutes/Spirochaetes 1 (0.3)
Prokaryote 3 (3.5) Proteobacteria/Tenericutes 1 (0.3)
Firmicutes/Fusobacteria 2 (2.3) Proteobacteria/Thermodesulfobacteria 1 (0.3)
Proteobacteria/Nitrospirae 2 (2.3) Signature [9]
Firmicutes/Tenericutes/Thermotogae 1 (1.2) Major taxonomic pattern Number
(%)
Proteobacteria/Acidobacteria/ 1 (1.2) Proteobacteria 5 (55.6)
Deferribacteres
Proteobacteria/Acidobacteria/ 1 (1.2) Euryarchaeota 1 (11.1)
Planctomycetes
(continued)
New Method for Comparative Functional Genomics and Metagenomics Using KEGG MODULE, Table 3
(continued)
Pathway [226] Structural complex [331]
Proteobacteria/Chrysiogenetes/Firmicutes 1 (1.2) Proteobacteria/Actinobacteria 1 (11.1)
Proteobacteria/Cyanobacteria 1 (1.2) Proteobacteria/Thaumarchaeota 1 (11.1)
Proteobacteria/Firmicutes/Chlamydiae 1 (1.2) Proteobacteria/Verrucomicrobia/Nitrospirae 1 (11.1)
Proteobacteria/Nitrospirae/Deferribacteres 1 (1.2)
Proteobacteria/Spirochaetes/ 1 (1.2)
Verrucomicrobia
[] shows total number of the KEGG modules containing branched modules
(Takami et al. 2012), it could be easily visualized Gammaproteobacteria. Completion patterns of

using our new evaluation method that this system the KEGG module for these amino acids and
was also shared in other mesophilic and vitamins mainly fall into patterns C and
alkaliphilic B. pseudofirmus, whose genome D except for riboflavin biosynthesis belonging
sequence has been completed recently. Although to one of the universal modules A, indicating
how the differentiated functional modules confer that these modules are involved in the nutritional
phenotypic properties directly or indirectly is still supply for the gut microbiome as well as for the
unclear, a series of the above results should be host (Fig. 6b). Interindividual variation was also
helpful in better understanding of the physiolog- evident in the completion ratio of the module for
ical properties. vitamins. For example, the module belonging to
pattern C for pyridoxal (vitamin B6) biosynthesis
Comparative Functionome Analysis of was mainly attributable to Bacteroidetes in
Humans and Human Gut Microbiomes adults and Gammaproteobacteria in infants;
The completion ratio of each KEGG module was however, its completion ratio in two male infants
compared between humans and human gut (In-B and In-E) was extremely low (33.33 %) N
microbiomes to illustrate their metabolic linkage. (Fig. 6a). Interindividual variations in comple-
The metagenomic data of gut microbiomes from tion ratios were also observed in modules for
13 healthy Japanese individuals, previously polyamine biosynthesis, for example, putrescine,
reported on, was used (Kurokawa et al. 2007). spermidine, and spermine (Takami et al. 2012).
There was a significant difference in the module Similarly, the completion ratio of the KEGG
completion ratios of 13 individuals in terms of at modules for g-aminobutyric acid (GABA) varied
least 33 pathway modules (Fig. 6a). among individuals, and Gammaproteobacteria
The most complete 16S rRNA gene sequence- mainly contributed to GABA production
based enumerations available in human gut (Fig. 6a). Because these polyamines and GABA
microbiomes indicate that more than 90 % of are essential biological substances that act as cell
phylotypes belong to just two of the 70 known growth promoters and inhibitory neurotransmit-
divisions of Bacteria, the Bacteroidetes and ters, respectively, in humans, these variations
the Firmicutes, with the remaining phylotypes may be linked to susceptibilities to certain dis-
distributed among eight other phyla (Eckburg eases. Indeed, a recent report on metabolic
et al. 2005). Pairwise comparison of the changes in gut microbiomes after bariatric sur-
completion ratio of the KEGG module clearly gery for obese patients demonstrated their poten-
demonstrated the well-recognized functional tial for polyamine production in the gut; elevated
complementation of the gut microbiome to the protein putrefaction because of the bypassed food
human host, which includes essential amino acid passage promoted putrescine and GABA produc-
and vitamin biosynthesis. The contributors com- tion from gut microbiota (Li et al. 2011).
pleting the modules for vitamin production are Interestingly, gut microbiomes showed pref-
Firmicutes, Bacteroidetes, Actinobacteria, and erence for amino acid catabolism. The gut
N
536
New Method for Comparative Functional Genomics and Metagenomics Using species. Alphabet in parentheses shows the patterns of completion profile based on the
KEGG MODULE, Fig. 5 Comparison of module completion patterns in eight module completion ratio as shown in Table 2 and Fig. 4. bsu, B. subtilis; bao,
phenotypically different Bacillus-related species. (a) Pathway modules showing B. amyloliquefaciens; bli, B. licheniformis; bha, B. halodurans; B. pseudofirmus; oih,
remarkable differences appeared among the eight species. (b) Structural complex O. iheyensis; gka, G. kaustophilus; and gth, G. thermoglucosidasius. This figure has
modules showing remarkable differences appeared among the eight species. Upper been redrawn with the updated KEGG module database from the previous one
plot indicates common or specific modules in the species possessing each phenotype. (Takami et al. 2012)
Green letters show rare modules completed by less than 10 % of 1,256 prokaryotic
New Method for Comparative Functional Genomics and Metagenomics Using KEGG MODULE
New Method for Comparative Functional Genomics and Metagenomics Using KEGG MODULE
New Method for Comparative Functional Genomics and Metagenomics Using microbiomes in the module completion ratio. (c) Typical pathway modules for
KEGG MODULE, Fig. 6 Comparison of module completion patterns in humans which the completion ratio in the human gut microbiome is very low in contrast to
537
and human gut microbiomes from 13 healthy individuals. (a) Typical pathway that in humans. Detailed information of the 13 individuals has been previously
modules showing remarkable differences in the module completion ratio appeared described (Kurokawa et al. 2007). This figure has been redrawn with the updated
among human gut microbiomes from 13 healthy individuals. (b) Typical pathway KEGG module database from the previous one (Takami et al. 2012)
modules possessing complementary relationships between humans and human gut
N
N
microbiome did not seem to utilize exogenous applied to estimate database dependency on the
lysine, leucine, and aromatic amino acids such accuracy of the KO assignment using the E. coli
as tryptophan and tyrosine (Fig. 6c). To our draft genome. As a result, the KAAS system
knowledge, this is a novel finding on the nutri- could correctly assign to KO groups with an
tional preference of gut microbes. This may be accuracy rate of approximately 80 %, even if
one of the mutualistic representations of gut the gene hosts were not classified into known
microbiomes to avoid nutritional competition phyla within the reference database. Thus, this
with the host because these aromatic amino method will work well for comparative func-
acids are precursors of various biological sub- tional analysis in metagenomics, able to target
stances such as catecholamines, melatonin, sero- unknown environments containing various
tonin, thyroid hormones, and NAD. Thus, the uncultivable microbes within unidentified phyla,
new evaluation method based on the KEGG mod- although further verification studies on database
ules is expected not only to highlight the meta- dependency for metagenomics should be
bolic linkage between host and commensal performed. Based on this method, we developed
microbes but also to identify microbiome-based the metabolic and physiological potential evalu-
biomarkers for particular diseases. ator (MAPLE) and provided a user-friendly Web
interface not only for the characterization of
potential functionome harbored in the genomic
Summary and metagenomic sequences but also for compar-
ative analyses for the MCR and mapping patterns
A new evaluation method for potential to the KEGG modules (http://www.genome.jp/
functionomes based on the KEGG modules was tools/maple/).
developed. Using this new method, significant
difference in module completion ratio by eight
bacilli in terms of at least 25 pathway, 40 struc- Cross-References
tural complex, and 15 functional set modules was
highlighted, although how the differentiated ▶ Computational Approaches for Metagenomic
functional modules confer phenotypic properties Datasets
directly or indirectly is unclear thus far. Because ▶ Human Gut Microbial Genes by Metagenomic
the coverage of KEGG modules over whole met- Sequencing
abolic and signaling networks is continuously ▶ KEGG and GenomeNet, New Developments,
increasing, differences in module completion Metagenomic Analysis
ratio will provide some important clues to the ▶ Metagenomic Research: Methods and
understanding of phenotypic properties. Further- Ecological Applications
more, variations in the functional potential of
human gut microbiomes from 13 healthy individ-
uals could be characterized by the pathway and
structural complex module units, and the comple- References
mentarity between biochemical functions in
human hosts and nutritional preferences in Eckburg PB, Bik EM, Bernstein CN, et al. Diversity of
human gut microbiomes identified. the human intestinal microbial flora. Science.
2005;308:1635–8.
Functional annotations to metagenomic Huson DH, Mitra S, Ruscheweyh HJ, et al. Integrative
sequences remain difficult because metagenomic analysis of environmental sequences using MEGAN4.
data targeting various environments still contains Genome Res. 2011;21:1552–60.
incomplete genes from various unidentified spe- Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes
and genomes. Nucleic Acids Res. 2000;28:27–30.
cies, absent in a reference database. In this entry,
Kanehisa M, Araki M, Goto S, et al. KEGG for linking
the KAAS system was used for functional anno- genomes to life and environment. Nucleic Acids Res.
tation to the human metagenomes and also 2008;36:D480–4.
Next-Generation Sequencing for Metagenomic Data: Assembling and Binning 539 N
Kurokawa K, Itoh T, Kuwahara T, et al. Comparative affects all species in the world. Traditional
metagenomics revealed commonly enriched gene method of studying microorganisms requires cul-
sets in human gut microbiome. DNA Res.
2007;14:169–81. turing a single kind of microbe and studying each
Li JV, Ashrafian H, Bueter M, et al. Metabolic surgery microbe based on next-generation sequencing
profoundly influences gut microbial-host metabolic (NGS) technology by its genome one at a time
cross-talk. Gut. 2011;60:1214–23. (Perna et al. 2001). However, as a single kind of
Markowitz VM, Chen I-MA, Palaniappan K, et al. IMG:
the integrated microbial genomes database and com- microbe usually cannot live alone and over 99 %
parative analysis system. Nucleic Acids Res. 2012;40: of microbes cannot be cultivated in the laboratory
D115–22. (Rappe and Giovannoni 2003; Eisen 2007), tra-
Meyer F, Paarmann D, D’Souza M, et al. The ditional culture-based method cannot analyze the
the automatic phylogenetic and functional analysis of interactivity of a microbial community well.
metagenomes. BMC Bioinformatics. 2008;9:386. Metagenomic, which studies all microbes in
Moriya Y, Itoh M, Okuda S, et al. KAAS: an automatic a community as a whole, is introduced for solving
genome annotation and pathway reconstruction server. the problem. Based on the NGS technology
Nucleic Acids Res. 2007;35:W182–5.
Overbeek R, Begley T, Butler RM, et al. The subsystems (Shendure and Ji 2008), instead of sequencing
approach to genome annotation and its use in the each single cultivated microbe one by one,
project to annotate 1000 genomes. Nucleic Acids metagenomic sequences all microbes in an envi-
Res. 2005;33:5691–702. ronment sample as a community directly without
Takami H. Genomic diversity of extremophilic Gram-
positive endospore-forming Bacillus-related species. cultivation (Weinstock 2012; Gilbert and Dupont
In: Williams CR, editor. Trends in genome research. 2011; Hunter et al. 2012; Tremaroli and Backhed
New York: NOVA Publisher; 2006. p. 25–85. 2012; Wooley et al. 2010). Thus, genomes of
Takami H, Taniguchi T, Moriya Y, et al. Evaluation microbes that cannot be studied before can now
method for the potential functionome harbored in the
genome and metagenome. BMC Genomics. be obtained and be analyzed.
2012;13:699. However, the complexity of a microbial com-
Uchiyama I. Multiple genome alignment for identifying munity is high. There can be tens of thousands
the core structure among moderately related microbial kinds of microbes in a single sample. As genomes
genomes. BMC Genomics. 2008;9:515. N
Willke A, Harrison T, Wilkening J, et al. The M5nr: of these microbes coexist in the sample, reads
a novel non-redundant database containing protein (DNA short fragments) obtained from genomes
sequences and annotations from multiple sources and of different microbes are mixed and required to
associated tools. BMC Bioinformatics. 2012;13:141. be separated after NGS step. More seriously, as
the abundance of different microbes in a sample
can vary with several orders of magnitudes (Qin
et al. 2010), few reads are sequenced from the
Next-Generation Sequencing for low-abundance species which may be treated as
Metagenomic Data: Assembling erroneous reads. Thus, several approaches have
and Binning been developed for analyzing metagenomic data
depending on the property of samples and
Henry C. M. Leung, Yi Wang, S. M. Yiu and research objectives.
Francis Y. L. Chin
Department of Computer Science, The University
of Hong Kong, Hong Kong, China Sequencing Biomarker
Traditional sequencing techniques, e.g., Sanger

Introduction (Sanger and Coulson 1975), have a relatively low
throughput. Thus, it is impossible to sequence the
Microorganisms contribute the largest number of whole genome sequences of all microbes in
living cells in the world. The activity of different a sample, especially for the low-abundance spe-
microbes forms a microbial ecosystem which cies. Instead of sequencing the whole genome,
N 540 Next-Generation Sequencing for Metagenomic Data: Assembling and Binning
biologists usually design primers for capturing Besides the problem of sequencing the whole
short regions in the genomes of various microbes, 16S rRNA gene with high throughput, there is
e.g., fingerprinting polymerase chain reaction another problem of analyzing metagenomic data
(PCR) on 16S rRNA genes. Each 16S rRNA using 16S rRNA genes (or 18S rRNA genes).
gene is a 1.5-kilobase-long gene for encoding Microbe can transfer gene from one to another
part of the prokaryotic ribosome. Although each without reproduction process, horizontal gene
genomic sequence varies among different bacte- transfer, and thus the 16S rRNA gene of one
ria, there are some conserved regions (for the kind of microbe may be transferred to another
ribosome function) in the 16S rRNA gene such microbe and introduces problems in analyzing
that primer can be designed for capturing the 16S metagenomic data. In real situation, microbes
rRNA gene for different bacteria. Moreover, spe- can have multiple copies of 16S rRNA genes,
cies with 97 % identical in the 16S rRNA gene varying from 1 to 15 (Case et al. 2007;
usually are in the same operational taxonomic Klappenbach et al. 2001), and horizontal gene
unit (OUT) (Weinstock 2012). Thus, sequencing transfer makes the abundances difficult to be
the 16S rRNA genes can determine which kinds estimated. Recently, other housekeeping genes,
of bacteria in a sample and their relative abun- e.g., rpoB, amoA, pmoA, nirS, nirK, nosZ, and
dances (16S rRNA genes of high-abundance bac- pufM, are used (in addition of 16S rRNA gene)
teria will be sequenced more than those of for identifying different species in a
low-abundance bacteria resulting more reads metagenomic sample.
covering these genes). Instead of 16S rRNA,
18S rRNA gene encodes eukaryotic ribosome
and can also be sequenced for identifying eukary- Sequencing Whole Genome
otes in a sample.
However, as the read lengths of most popular Since using a single or only several biomarkers to
sequencing techniques are shorter than 1.5 kb represent a species may have a problem, another
(typical length of a 16S rRNA gene), biologists way to analyze metagenomic data is sequencing
can only sequence a portion of 16S rRNA genes, the whole genomes of different microbes in the
and the accuracy of identification depends on the sample. With the help on the high-throughput
read length. Traditional Sanger sequencing tech- next-generation sequencing techniques, biolo-
niques can produce 1-kb-long read which can gists can sequence the whole genomes of all
cover a larger portion of 16S rRNA genes. How- microbes in a sample with reasonably high
ever, its throughput is low such that 16S rRNA sequencing depth.
genes of many species may not be sequenced and
the relative abundances of species may not be Assembling Reads
estimated well. One of the next-generation As the read lengths of next-generation sequenc-
sequencing techniques, 454 pyrosequencing, can ing are much shorter than the genomes of
produce several orders more reads than the microbes, analyzing sequenced reads directly is
Sanger sequencing technique, but the read length difficult especially for Illumina platform. One
is about 400 bases, which can cover only a short possible way is assembling overlapped short
portion of 16S rRNA gene, and thus the sensitiv- reads to longer contigs before analysis (Mende
ity of identifying different microbes in a sample et al. 2012). Although there are many existing
will decrease. The Illumina platform, another assembling algorithms (Vyahhi et al. 2012;
next-generation sequencing technique, can Peng et al. 2010) designed for genomic data,
produce several orders more reads than they cannot be applied on metagenomic data
454 pyrosequencing; however, the read length is directly because of the following results:
at most 250 bases, thus resulting to lower sensi- 1. Abundances of different microbes vary in
tivity than 454 pyrosequencing. metagenomic data. Since erroneous reads
introduce arbitrary for assembling, existing Due to the above problem, several assemblers
genomic assemblers try to determine errone- have been developed for assembling
ous reads and remove them before assembling. metagenomic data, including Genovo (Laserson
Based on the assumption that erroneous reads et al. 2011) for 454 pyrosequencing and
are sampled fewer times than correct reads, MetaVelvet (Namiki et al. 2012), Ray Meta
these genomic assemblers usually consider (Boisvert et al. 2012), Meta-IDBA (Peng
those reads or length k substring of reads, et al. 2011), and IDBA-UD (Peng et al. 2012)
called k-mers, with low sampling rate for the Illumina platform. Since the length of
(multiplicity) as erroneous reads and k-mers. 454 pyrosequencing read is longer than those
These erroneous reads are removed before constructed by Illumina platform and the number
assembling. However, since the abundance of of input reads is much smaller than those by
microbes vary a lot in metagenomic data, cor- Illumina platform, Genovo stores all the input
rect reads and k-mers from low-abundance reads and calculates their pairwise overlapped
microbes could be sampled much fewer than relationship. It then calculates the probability of
the erroneous reads and k-mers from high- a set of reads sampled from the same contigs
abundance microbes. These genomic assem- based on Bayesian approach and applies a series
blers fail to remove erroneous reads and of hill climbing to obtain a set of contigs with the
k-mers and produce either very short contigs highest likelihood. However, this approach fails
or incorrect long contigs. when the number of input reads increases
2. Common regions across different microbes. Due (Boisvert et al. 2012). Because of the huge
to horizontal gene transfer and the existence of amount of input reads, MetaVelvet, Ray Meta,
common housekeeping genes, some common Meta-IDBA, and IDBA-UD all assemble contigs
patterns could appear in multiple genomes. As using de Bruijn graph approach. A de Bruijn
the read length can be shorter than these com- graph represents the connection of a set of reads
mon patterns, genomic assemblers cannot deter- using k-mers, length k strings of the read. Each
mine the genomic sequences of microbes near k-mer in the reads is represented by a vertex, and N
their common patterns. Although similar prob- there is an edge from vertex u to vertex v if and
lem also appears in assembling genomic data, only if k-mers u and v appear in at least one read
the number of common patterns in metagenomic consecutively, i.e., the length-(k-1) suffix of u is
genomic is much more than those in genomic the same as the length-(k-1) prefix of v. Thus,
data (Peng et al. 2011). As a result, shorter or a contig is represented by a path in the de Bruijn
erroneous contigs will be produced by existing graph. Because of the existence of sequencing
genomic assemblers. error and common regions among different
3. Huge data size. As the number of microbes in genomes, paths representing different genomes
a metagenomic data is huge, a high sequenc- may overlap and the de Bruijn becomes compli-
ing depth is required to obtain enough reads cated. Existing metagenomic assemblers apply
(say 10 coverage) from each microbe different approach to decompose the de Bruijn
(especially for the low-abundance microbes). graphs or determine contigs directly from the de
Thus, the total amount of input reads (e.g., Bruijn graph. Meta-IDBA decomposes the de
200G nucleotides in the metagenomic data of Bruijn graph based on the observation that there
cow stomach (Qin et al. 2010), over 100G of are more interconnections between k-mers sam-
nucleotides required for studying soil pled from the same genome than k-mers from
metagenome (Frisli et al. 2013)) for assem- sampled different genomes. After decomposi-
bling metagenomic data can be much more tion, paths representing different genomes will
than the genomic data. How to store and probe separated and can be reconstructed easier.
cess this huge amount of reads becomes a big MetaVelvet decomposes the de Bruijn graphs
problem. based on the multiplicities of k-mers.
N 542 Next-Generation Sequencing for Metagenomic Data: Assembling and Binning
By determining some local peaks in the distribu- using different classifiers which help analyzing
tion of multiplicities of k-mers, MetaVelvet the metabolism of the unknown microbes. How-
decomposes the de Bruijn graph according to ever, for the contigs sampled from microbes
the multiplicities. As k-mers sampled from dif- without genome reference and low-abundance
ferent genomes may have similar multiplicities microbes without enough reads for assembling
and k-mers sampled from the same genome could long contigs, binning approach is required. Note
have different multiplicities (due to sequencing that since the most microbes cannot be cultivated
bias), IDBA-UD calculates the average multi- and their genomes are still unknown, many reads
plicity of k-mers in the same contig and uses it and contigs cannot be aligned to reference
to determine erroneous k-mers and k-mers sam- genome in the database.
pled from different genomes. As the threshold is Binning reads and contigs is to cluster reads
determined locally, it can decompose the de and contigs sampled from the same microbes
Bruijn more accurate than Meta-IDBA and using the common property on the reads.
MetaVelvet using global thresholds. Ray Meta Composition-based methods use generic fea-
uses another approach to construct the contigs. tures, e.g., GC content, codon usage, dinucleo-
Instead of decomposing the de Bruijn graph, it tides distribution, and 4-mer distribution to
applies a heuristics-guided graph traversal to classify reads sampled from different genomes.
reconstruct the contig. Although all the above Existing supervised or semi-supervised binning
assemblers try to reconstruct contigs from algorithm (Brady and Salzberg 2009; McHardy
metagenomic data, short contigs (several thou- et al. 2006) can construct a classifier to determine
sand nucleotides) and chimera contigs the source of reads based on reference genome in
(misassembles contigs from different genome the database. Compared with alignment-based
together) could be resulted because of the high methods, these algorithms do not require the
diversity of metagenomic data. exact reference genome. Instead, classifier can
Since the number of k-mer is large, researches be constructed from a similar genome in the
have been performed for investigating storage of database such that more reads can be binned.
de Bruijn graph using less memory. Several effi- However, as there are limited number of refer-
cient data structures have been developed based ence genomes in the database, many reads still
on bloom filter (Chikhi and Rizk 2012; Pell cannot be classified correctly. Some binning
et al. 2012). A bloom filter uses a hash table and algorithms are designed to cluster reads sampled
several hash functions to store the existence of from the same genome using properties on reads
k-mers. When storing a k-mer, each hash function directly without any reference genomes.
will calculate an address based on the pattern of MetaCluster 3.0 (Yang et al. 2010) clusters
k-mer, and all these addresses will be set to 1 in reads based on 4-mer distribution. Given two
the hash table. Thus, the existence of a k-mer in long reads from the same genome, the occurrence
the reads can be determined by checking several frequencies of different 4-mers on the two reads
bits in the hash table. Although there may be should be similar (Zhou et al. 2008). MetaCluster
some false-positive k-mers, the number of false 3.0 calculates the pairwise spearman distance of
positives is small when the hash table is large reads based on 4-mer distributions and clustering
enough and there are multiple hash functions. reads using k-mean clustering methods. How-
ever, MetaCluster 3.0 can only handle
Binning metagenomic data with similar abundances and
After reconstructing contigs, each long contig long read length (500 bp or more). In order to bin
can be aligned to known reference genomes in short reads of length about 100 bp,
the database for identifying the microbes in the AbundanceBin (Wu and Ye 2011) and TOSS
samples (Huson et al. 2011). Even when there is (Tanaseichuk et al. 2012) consider the occurrence
no similar reference genome in the database, gene frequency of k-mers (k ¼ 25) in all the reads.
sequence may be predicted (Rho et al. 2010) k-mers that occur frequently should be sampled
from high-abundance microbes, while k-mers the problems challenging. A common practice for
that occur rarely should be sampled from analyzing metagenomic data is to assemble short
low-abundance microbes. Based on this assump- reads to longer contigs. Then try to identify
tion, AbundanceBin and TOSS can bin reads microbes in the sample by aligning the contigs
according to the k-mer frequencies. However, and unassembled reads to reference genomes. As
when the abundances of two microbes are similar most of the microbes have no reference in the
(abundance ratio within 1:3), these algorithms database, the unaligned reads and contigs should
fail to separate the reads sampled from the two be binned together using generic features, e.g.,
microbes. MetaCluster 4.0 further improves GC content, codon usage, dinucleotide distribu-
MetaCluster 3.0 by combining overlapped short tion, and 4-mer distribution. Previous study
reads to long virtual contigs and estimates the shows that binning contigs instead of reads can
4-mer or 5-mer distribution of the virtual contigs. improve the accuracy of binning. It is because the
As the lengths of virtual contigs are much longer long contigs carry more generic information than
than the short reads, 4-mer distribution of the the short reads. However, few researches have
virtual contigs can be estimated accurately. By been performed on studying how to improve the
constructing a huge number of small clusters and result of assembling using binning. Moreover,
merging cluster with similar 4-mer distribution, researchers usually use the information of refer-
MetaCluster 4.0 (Wang et al. 2012a) can handle ence genomes by alignment and supervising bin-
metagenomic data with microbes of different ning. In fact, similar genomes in the database
abundances. However, these unsupervised bin- may be used to improve the performance of de
ning algorithms cannot handle low-abundance novo assembling. As the performance of existing
microbes well because they cannot distinguish de novo assemblers and binning algorithms on
reads sampled from these low-abundance real biological data is not satisfied, further
microbes from the error reads sampled from researches on combining assembling, binning,
high-abundance microbes. MetaCluster 5.0 and the use of reference genomes may be
(Wang et al. 2012b) is designed for binning a possible way to improve the performance of N
reads from both high- and low-abundance analyzing metagenomic data.
microbes. It performs binning with two rounds.
In the first rounds, its target is to bin reads sam-
pled from high-abundance microbes using References
restricted parameters for constructing virtual
contigs and clustering reads. Reads sampled Boisvert S, Raymond F, Godzaridis E, et al. Ray Meta:
from low-abundance microbes can be handled scalable de novo metagenome assembly and profiling.
Genome Biol. 2012;13(12):R122.
in the second round using less restricted parame- Brady A, Salzberg SL. Phymm and PhymmBL:
ters. By applying multiple rounds of binning, metagenomic phylogenetic classification with interpo-
MetaCluster 5.0 can bin reads from microbes lated Markov models. Nat Methods. 2009;6:673–6.
with sequencing depth as low as 6 in Case RJ, Boucher Y, Dahllof I, et al. Use of 16s rRNA and
rpob genes as molecular markers for microbial ecology
a metagenomic dataset containing 100 microbes. studies. Appl Environ Microbiol. 2007;73:278–88.
However, it still cannot bin reads sampled from Chikhi R, Rizk G. Space-efficient and exact de Bruijn
microbes with sequencing depth lower than 6. graph representation based on a bloom filter. Algoritm
Bioinforma. 2012;7534:236–48.
Eisen JA. Environmental shotgun sequencing: its potential
and challenges for studying the hidden world of
Conclusion microbes. PLoS Biol. 2007;5(3):e82.
Frisli T, Haverkamp TH, Jakobsen KS, et al. Estimation of
Assembling and binning reads are two important metagenome size and structure in an experimental soil
microbiota from low coverage next-generation
procedures for analyzing metagenomic data. The sequence data. J Appl Microbiol. 2013;114(1):141–51.
high biodiversity and large variations in abun- Gilbert JA, Dupont CL. Microbial metagenomics: beyond
dances of genomes in metagenomic data make the genome. Ann Rev Mar Sci. 2011;3:347–71.
N 544 NGS QC Toolkit: A Platform for Quality Control of Next-Generation Sequencing Data
Hunter CI, Mitchell A, Jones P, et al. Metagenomic anal- Vyahhi N, Pyshkin A, Pham S, et al. From de Bruijn
ysis: the challenge of the data bonanza. Brief graphs to rectangle graphs for genome assembly.
Bioinform. 2012;13(6):743–6. Algoritm Bioinforma, LNCS. 2012;7534:249–61.
Huson DH, Mitra S, Ruscheweyh HJ, et al. Integrative Wang Y, Leung HC, Yiu SM, et al. MetaCluster 4.0:
analysis of environmental sequences using MEGAN4. a novel binning algorithm for NGS reads and huge
Genome Res. 2011;21:1552–60. number of species. J Comput Biol. 2012a;19:241–9.
Klappenbach JA, Saxman PR, Cole JR, et al. rrndb: the Wang Y, Leung HC, Yiu SM, et al. MetaCluster 5.0:
ribosomal RNA operon copy number database. a two-round binning approach for metagenomic data
Nucleic Acid Res. 2001;29:181–4. for low-abundance species in a noisy sample. Bioin-
Laserson J, Jojic V, Koller D. Genovo: de novo assembly formatics. 2012b;28:i356–62.
for metagenomes. J Comput Biol. 2011;18(3):429–43. Weinstock GM. Genomic approaches to studying the
McHardy AC, Martin HG, Tsirigos A, et al. Accurate human microbiota. Nature. 2012;489:250–6.
phylogenetic classification of variable-length DNA Wooley JC, Godzik A, Friedberg I. A primer on
fragments. Nat Methods. 2006;4:63–72. metagenomics. PLoS Comput Biol. 2010;6(2):e1000667.
Mende DR, Waller AS, Sunagawa S, et al. Assessment Wu YW, Ye Y. A novel abundance-based algorithm for
of metagenomic assembly using simulated next binning metagenomic sequences using l-tuples.
generation sequencing data. PLoS ONE. 2012;7(2): J Comput Biol. 2011;18(3):523–34.
e31386. Yang B, Peng Y, Henry CM, et al. Unsupervised binning
Namiki T, Hachiya T, Tanaka H, et al. MetaVelvet: an of environmental genomic fragments based on an error
extension of Velvet assembler to de novo metagenome robust selection of l-mers. BMC Bioinforma. 2010;11
assembly from short sequence reads. Nucleic Acids Suppl 2:S5.
Res. 2012;40(20):e155. Zhou F, Olman V, Xu Y. Barcodes for genomes and
Pell J, Hintze A, Canino-Koning R, et al. Scaling applications. BMC Bioinforma. 2008;9(1):546.
metagenome sequence assembly with probabilistic
de Bruijn graphs. Proc Natl Acad Sci.
2012;109(33):13272–7.
Peng Y, Leung HC, Yiu SM, et al. IDBA- a practical
iterative de Bruijn graph de novo assembler. Res NGS QC Toolkit: A Platform for
Comput Mol Biol. 2010;6044:426–40. Quality Control of Next-Generation
Peng Y, Leung HC, Yiu SM, et al. Meta-IDBA: a de novo Sequencing Data
assembler for metagenomic data. Bioinformatics.
2011;27:i94–101.
Peng Y, Leung HC, Yiu SM, et al. IDBA-UD: a de novo Ravi K. Patel and Mukesh Jain
assembler for single-cell and metagenomic sequencing Functional and Applied Genomics Laboratory,
data with high uneven depth. Bioinformatics. National Institute of Plant Genome Research
2012;28:1420–8.
(NIPGR), New Delhi, India
Perna N, Plunkett III G, Burland V, et al. Genome
sequence of enterohaemorrhagic Escherichia coli
O157:H7. Nature. 2001;409:529–33.
Qin J, Li R, Raes J, et al. A human gut microbial gene Synonyms
catalogue established by metagenomic sequencing.
Nature. 2010;464(7285):59–65.
Rappe MS, Giovannoni SJ. The uncultured microbial Format converters; Illumina; NGS data quality
majority. Annu Rev Microbiol. 2003;57:369–94. control; NGS data trimming; Roche 454
Rho M, Tang H, Ye Y. FragGeneScan: predicting genes in
short and error-prone reads. Nucleic Acids Res.
2010;38(20):e191.
Sanger F, Coulson AR. A rapid method for determining Definition
sequences in DNA by primed synthesis with DNA
polymerase. J Mol Biol. 1975;94(3):441–8. NGS QC Toolkit is a Perl-based stand-alone pro-
Shendure J, Ji H. Next-generation DNA sequencing. Nat
gram package for the quality control (QC) of
Biotechnol. 2008;26:1135–45.
Tanaseichuk O, Borneman J, Jiang T. A probabilistic next-generation sequencing (NGS) data. In addi-
approach to accurate abundance-based binning of tion to QC tools, it consists of many subsidiary
metagenomic reads. Algoritm Bioinforma. tools for handling and processing of data obtained
2012;7534:404–16.
from Illumina and Roche 454 sequencing plat-
Tremaroli V, Backhed F. Functional interactions between
the gut microbiota and host metabolism. Nature. forms. The open-source toolkit is freely available
2012;489:242–9. at http://www.nipgr.res.in/ngsqctoolkit.html.
NGS QC Toolkit: A Platform for Quality Control of Next-Generation Sequencing Data 545 N
Introduction parameters in these tools are set to the sensible
default values, they can be adjusted by the users
The need for fast and high-throughput sequenc- to optimize QC analysis, which makes these tools
ing has resulted into discovery of NGS technolo- versatile for different NGS assays. IlluQC has the
gies. The advent of these technologies has ability to identify different FASTQ file variants
transformed the genomics research by providing (Cock et al. 2009) and set the quality scoring
an opportunity to study genetic information at system accordingly for further analysis. Reads
a single-base resolution in cost-effective manner are analyzed based on their quality, and the
(Metzker 2010). However, usually several arti- poor-quality reads not fulfilling the user-specified
facts are reflected in NGS data due to technical criteria are discarded. The filtered reads are
errors and limitations associated with different checked for the primer/adaptor sequence contam-
NGS platforms. These sequence artifacts, includ- ination and the matching reads are discarded. The
ing read errors, poor-quality reads, and primer/ high-quality filtered data is exported as output
adaptor contamination, might affect downstream along with various quality statistics. 454QC
sequence analysis, such as de novo genome and tools read FASTA files and filter reads based on
transcriptome assembly, gene expression studies, the specified length cutoff at several stages in the
and single nucleotide polymorphism detection. analysis. The tool can also perform trimming of
To avoid misleading conclusions, it is necessary reads containing homopolymer(s) longer than
to filter the NGS data for these sequence artifacts specified length. Further, the quality check and
(Benaglio and Rivolta 2010). primer/adaptor sequence match are performed
NGS platform vendors have developed com- similar to that of IlluQC tools. However, unlike
mercial QC pipelines dedicated to mitigate the IlluQC tools, 454QC tools trim respective ends of
effect of limitations associated with their plat- the read showing primer/adaptor match. Eventu-
forms. However, even after processing through ally, the high-quality reads are exported in
these pipelines, many sequence artifacts remain FASTA format. Processing of Roche 454 PE
in the data. Several efforts have been made to data (using 454QC_PE.pl) requires an additional N
resolve one or the other sequence artifacts, but step of finding the linker sequence to separate and
many of them are specific to a particular sequenc- process both end reads simultaneously.
ing platform. NGS QC Toolkit (Patel and Jain
2012) can handle many of the known sequence
artifacts in Illumina and Roche 454 sequencing Key Characteristics
data. It is a stand-alone and user-friendly toolkit
written in Perl programming language by While NGS QC Toolkit shares its features with
employing modularized structure supported by many other QC tools (Schmieder and Edwards
several subroutines for various tasks, which 2011; Cox et al. 2010; Lassmann et al. 2009;
allows better maintainability. The toolkit com- Pandey et al. 2010), it also provides few unique
prises many easy-to-use tools for quality check attributes for the QC analysis of NGS data. In
and filtering, trimming, generating statistics, and addition to high-quality filtered data output, it is
different file format/variant conversion for also equipped with the modules for generating
Illumina and Roche 454 sequencing data (Fig. 1). several different kinds of statistics in graphical
format along with text files to help users make
better understanding of the data quality (Patel and
QC Workflow Jain 2012).
The toolkit provides dedicated tools for the QC of Reduced Computational Time and Storage
single-end (SE) and paired-end (PE) data from Space Requirement
Illumina (IlluQC tools) and Roche 454 (454QC Continued improvement in NGS technologies
tools) sequencing platforms. Although various has achieved larger read length and manyfold
N 546 NGS QC Toolkit: A Platform for Quality Control of Next-Generation Sequencing Data
increase in throughput. To reduce time require- the analysis easier, IlluQC tools are programmed
ment for the QC of several gigabases of sequence to first identify the input FASTQ variant automat-
data, parallel computing has been implemented in ically and set appropriate scoring system for fur-
the QC tools. Significant decrease in the analysis ther QC analysis.
time was evident using parallelized QC tools on
multi-core computer systems (Patel and Jain
2012). Nevertheless, tools can also be run on Additional Tools
single-core computers without any additional
requirement. Another challenge with the huge Apart from QC tools, a number of additional tools
NGS dataset is the increased storage space are provided in the toolkit to manage and gener-
requirement, which is considerably reduced by ate statistics for the NGS data (Fig. 1). A set of
the use of compressed (gzip) files. The high- sequence format converter tools offer facility to
quality filtered output data in compressed gzip convert between different variants of the FASTQ
files can be used directly for downstream analy- format based on the equations described previ-
sis, which saves large amount of storage space. ously (Cock et al. 2009). It also provides tools for
conversion between FASTQ and FASTA for-
Conservation of PE Data Integrity mats. TrimmingReads.pl tool is capable of trim-
PE sequencing data helps increase sequence cov- ming reads based on two criteria. It can trim
erage and confidence in the alignment which is given number of bases from the 50 and/or 30 end
very crucial for downstream analysis. However, of the reads. Another mode of trimming is to trim
surprisingly, not many QC pipelines maintain the low-quality bases from the 30 end of the reads
pairing information of the PE data in the filtered using user-defined threshold value of quality
data but the NGS QC Toolkit. QC tools analyze scores. HomopolymerTrimming.pl, as the name
both reads of each pair concurrently and export suggests, clips the 30 read end from first nucleo-
the high-quality filtered PE data along with the tide of the homopolymer of user-defined cutoff
unpaired reads (when only one read of the pair length. A newly introduced tool upon request
passes QC filters). In this way, QC tools maintain from users, i.e., AmbiguityFiltering.pl, helps to
PE data integrity and try to retain all important filter reads containing ambiguous bases (N/X
high-quality sequencing data. content) or to trim flanking ambiguous bases.
A couple of tools, AvgQuality.pl and N50Stat.
Homopolymer Trimming pl, generate statistics to help nonexpert users to
A major artifact is introduced in Roche access various sequence statistics.
454 pyrosequencing data by the use of pyrophos-
phate for the detection of incorporated bases. It
was found that linearity of signal intensity is Installation
disturbed when longer homopolymer is encoun-
tered (Margulies et al. 2005). This artifact may The toolkit requires Perl interpreter and few addi-
affect the downstream analysis due to frameshift. tional Perl modules like GD (optional; required to
454QC tools provide an optional parameter to generate QC graphs) and String:: Approx. Users
trim the homopolymer of the given minimum need to download NGSQCToolkit zip folder
threshold length. from the website. The toolkit is ready to use just
after unzipping the folder. The distribution
FASTQ Variant Detection includes all the tools along with a user manual,
Use of inconsistent variants of FASTQ format by which provides important links for the module
different sequencing platforms makes it tough for installation and describes the tools and their
the users to apply appropriate tools for the anal- usage in detail. Tools can report the missing
ysis, because the quality scoring system varies dependencies, if required modules are not found
with the variants (Cock et al. 2009). To make or improperly installed.
NGS QC Toolkit: A Platform for Quality Control of Next-Generation Sequencing Data 547 N
Trimming
TrimmingReads.pl
Trimming of reads from the ends
HomoPolymerTrimming.pl
Quality Control Trimming of reads at 3 end from the homopolymer of Format Conversion
user-specified length
IlluQC.pl AmbiguityFiltering.pl FastqTo454.pl

QC of Illumina data Filtering/trimming reads for ambiguous bases Separation of sequence and read quality in different
files from FASTQ file
IlluQC_PRLL.pl
Parallel QC of Illumina data FastqToFasta.pl
NGS QC Toolkit Conversion of FASTQ file to FASTA format
454QC.pl
QC of Roche 454 data SangerFastqToIlluFastq.pl
Conversion of read quality encoding from Sanger to
454QC_PRLL.pl Illumina variant (FASTQ file)
Parallel QC of Roche 454 data
Statistics
SolexaFastqToIlluFastq.pl
454QC_PE.pl AverageQuality.pl Conversion of read quality encoding from Solexa to
QC of Roche 454 paired-end data Calculation of average quality score for each read in Illumina variant (FASTQ file)
FASTA file
N50Stat
Calculation of various statistics for sequences in FASTA
format
NGS QC Toolkit: A Platform for Quality Control of Next-Generation Sequencing Data, Fig. 1 Various QC and
data processing tools included in the NGS QC Toolkit
Toolkit Updates In addition, the toolkit is comprised of numerous

supplementary tools for handling/processing of
Continuous support and updates played a crucial NGS data. This toolkit is being regularly modi-
N
role in the popularity of the NGS QC Toolkit fied and improved to accommodate users’
among the researchers working on NGS data requirements and make it compatible with chang-
analysis. It has been under active development ing sequencing data file formats. It is anticipated
since after it had been developed more than that this toolkit will provide an easy platform to
3 years ago. Several updates have been made to even non-bioinformaticians for QC analysis of
make the toolkit compatible with the ever- NGS data.
evolving sequencing technologies and fulfill the
requirements of users (http://www.nipgr.res.in/
ngsqctoolkit.html). Cross-References
▶ A De Novo Metagenomic Assembly Program

Summary for Shotgun DNA Reads
▶ DNA Methylation Analysis by
NGS QC Toolkit is an open-source stand-alone Pyrosequencing
toolkit for the QC of NGS data, which can be used
on any operating system with installed prerequi-
sites. It offers user-friendly parallel computing References
QC tools for the quality check and filtering of
Benaglio P, Rivolta C. Ultra high throughput sequencing
Illumina and Roche 454 sequencing data. These in human DNA variation detection: a comparative
tools provide various parameters to optimize the study on the NDUFA3-PRPF31 region. PLoS One.
QC analysis of different kinds of NGS assays. 2010;5(9):e13071.
N 548 Novel Alkalistable and Thermostable Xylanase-Encoding Gene (Mxyl)
Cock PJA, Fields CJ, Goto N, et al. The sanger FASTQ file restricted and the fragments are cloned. The
format for sequences with quality scores, and the clones are screened to select colonies with the
Solexa/Illumina FASTQ variants. Nucleic Acids Res.
2009;38:1767–71. desired xylanase gene, and the insert is sequenced
Cox MP, Peterson DA, Biggs PJ. SolexaQA: at-a-glance and the gene is subcloned and expressed. The
quality assessment of Illumina second-generation recombinant xylanase is purified and character-
sequencing data. BMC Bioinformatics. 2010;11:485. ized and tested for its applicability in generating
Lassmann T, Hayashizaki Y, Daub CO. TagDust-a pro-
gram to eliminate artifacts from next generation xylo-oligosaccharides from agro-residues and
sequencing data. Bioinformatics. 2009;25:2839–40. pulp bleaching.
Margulies M, Egholm M, Altman WE, et al. Genome
sequencing in microfabricated high-density picolitre
reactors. Nature. 2005;437:376–80.
Metzker ML. Sequencing technologies – the next genera- Introduction
tion. Nat Rev Genet. 2010;11:31–46.
Pandey RV, Nolte V, Schlotterer C. CANGS: a user- Hemicellulosic components are integral part of
friendly utility for processing and analyzing lignocellulosic residues and the second most
454 GS-FLX data in biodiversity studies. BMC Res
Notes. 2010;3:3. abundant renewable polymer of plant cell walls
Patel RK, Jain M. NGS QC Toolkit: a toolkit for quality after cellulose. Xylan is the main constituent
control of next generation sequencing data. PLoS One. in hemicelluloses of lignocellulosic agro-
2012;7(2):e30619. residues. b-1,4-linked xylosyl residues form the
Schmieder R, Edwards R. Quality control and
preprocessing of metagenomic datasets. Bioinformat- backbone of xylan that makes it a homopoly-
ics. 2011;27:863–4. saccharide. Since xylan contains several groups
such as arabinosyl, acetyl, and glucuronosyl
residues that are present in the side chains,
xylans are heteroploysaccarides (Hori and
Novel Alkalistable and Thermostable Elbein 1985; Coughlan and Hazlewood 1993).
Xylanase-Encoding Gene (Mxyl) Heteropolymeric xylan requires synergistic
Retrieved from Compost-Soil action of multiple xylanolytic enzymes for com-
Metagenome plete degradation. The complex xylanolytic sys-
tem includes endoxylanase (1,4-b-D-xylan
Digvijay Verma and Tulasi Satyanarayana xylanohydrolase; EC 3.2.1.8), b-xylosidase (1,4
Department of Microbiology, University of b-D-xylan xylohydrolase; EC 3.2.1.37),
Delhi, New Delhi, India a-glucuronidase, a-L-arabinofuranosidase, and
acetyl xylan esterase. The CAZY database
(http://www.cazy.org/fam/acc_GH.html) classi-
Synonyms fied xylanases into six glycosyl hydrolase fami-
lies GH5, GH8, GH10, GH11, GH30, and GH43
Community genomics; Culture-independent (Collins et al. 2005). Family 10 and 11 xylanases
approach; Environmental genomics; are however widely distributed in nature. Owing
Endoxylanase; Endo-b-1,4 xylanase; Thermo- to low molecular weight and substrate stringency,
alkali-stable xylanase; Xylanase family 11 xylanases are considered as true
xylanases, while GH10 xylanases share broad
substrate specificity with higher molecular
Definition weight.
Xylanases have successfully been used in var-
For retrieving genes encoding thermo-alkali- ious industries like ramie fiber degumming, food
stable xylanases by culture-independent processing, and textile, biofuels, feed, and paper/
(metagenomic) approach, the DNA extracted pulp industries. However, xylanases must be
from hot and alkaline environmental samples is alkalistable and thermostable to withstand the
Novel Alkalistable and Thermostable Xylanase-Encoding Gene (Mxyl) 549 N
extreme conditions prevailing in the paper indus- Germany). Hundred nanogram of insert DNA
tries in the pre-bleaching of kraft pulp. Although and 300 ng of Bam HI digested and
several xylanases have been reported from a large dephosphorylated p18GFP vector were ligated
number of microorganisms, most of them do not by using T4 DNA ligase overnight at 16 C. The
have adequate thermostability and alkalistability ligation mixture was transformed into competent
for their utility in paper and pulp industries. E. coli DH10B cells by heat shock method. The
Majority of xylanases have been obtained from metagenomic library was spread and screened for
the culturable 0.1–1 % of the total microbial xylanase activity on 0.3 % (w/v) RBB-xylan
diversity existing in natural environments. The (4-O-methyl-D-glucurono-D-xylan-remazol bril-
culture-independent metagenomic approaches liant blue R) (Sigma, St. Louis, MO, USA)
permit retrieval of genes encoding useful LB-ampicillin agar plates. The transformants
enzymes from environmental samples without were grown at 37 C overnight and observed for
involving laborious and elaborate methods of the zone of xylan hydrolysis.
cultivation of microbes. The immense demand
for alkalistable and thermostable xylanases Screening for Xylanase and Sequence
encouraged us to adapt this innovative strategy Analysis
for retrieving genes that encode thermo-alkali- The pure clone (TSDV-MX1) showing clear zone
stable xylanases from environmental of xylanase hydrolysis was sequenced using M13
metagenomes. forward and reverse primers followed by differ-
In this investigation, a metagenomic library ent internal primers using Applied Biosystem
was constructed and screened for clones with 373 stretch automated sequencer (Applied
xylanase activity. Xylanase-encoding gene Biosystems, Foster City, CA, USA) at Nucleic
(Mxyl) (accession no. AFP81696) was subcloned Acid Sequencing Facility of the University of
and expressed, and the recombinant xylanase was Delhi South Campus, New Delhi (India), for
purified and characterized. To the best of our obtaining full sequence of the insert. The ORFs
knowledge, this is the first report on retrieving were identified by using the NCBI’s open reading N
thermo-alkali-stable GH 11 family xylanase by frame (ORF) finder tool (http://www.ncbi.nlm.
a metagenomic approach. nih.gov/gorf/gorf.html). BLASTN and BLASTP
of NCBI were used to align the nucleotide and
amino acid sequences, respectively. Multiple
Methodology alignments of the amino acids were carried out
using the CLUSTALW program (http://www.ebi.
Collection of Samples and Construction of ac.uk/clustalW). The phylogenetic analysis was
Metagenomic Library done using MEGA 2.1 with neighbor-joining
The samples of compost soil were collected in strategy.
sterile polyethylene bags from the vicinity of
a hot water spring near Fukuoka Japan and stored Construction and Expression of Plasmids
at 4 ºC. The pH of the samples is in the acidic pET28a-Mxyl and pET22b-Mxyl
range (3.0–4.5). Soil DNA was extracted The xylanase gene was amplified and ligated into
according to Verma and Satynarayana (2011). the digested vectors followed by transformation
Metagenomic DNA was processed for into competent E. coli XL1 blue cells to obtain
constructing the metagenomic library. Five mg pET28-Mxyl and pET22-Mxyl. The recombinant
of metagenomic DNA was partially digested constructs were confirmed by colony PCR
with 0.5 U of restriction enzyme Sau3AI. The followed by double digestion of the construct
fragments of 3–12 kb were eluted from agarose with restriction enzymes. The clones having
gel (1.2 %, w/v) by gel extraction kit according to xylanase gene were transformed into E. coli
manufacturer’s protocol (Macherey-Nagel, BL21(DE3) and processed for sequencing.
The recombinant plasmid having the accurate enzyme (Km and Vmax) on different xylans from
sequence was then transformed into E. coli birchwood, beech wood, and oat spelt were
BL21 (DE3) competent cells for the expression calculated from Lineweaver-Burk double recip-
of recombinant proteins from pET28a-Mxyl and rocal plots.
pET22b-Mxyl. The expression was induced by
adding isopropyl-b-D-1-thiogalactopyranoside Saccharification of Agro-residues/Hydrolysis
(IPTG) to a final concentration of 1 mM and of Xylan
the culture was further cultivated at 30 C. The One percent (w/v) standard xylo-
samples were collected at 1 h intervals for oligosaccharides (X2–X6) and agro-residues
determining the enzyme titers. Localization of (wheat bran, corncobs, and sugarcane bagasse)
the recombinant protein was determined by were treated with recombinant xylanase
collecting the intracellular, extracellular, and (10 U–20 U/g) to find out the hydrolysis of XOs
periplasmic fractions from the cells followed by and lignocellulosic substrates. All the substrates
assay for xylanase (Verma and Satyanarayana (wheat bran, corncobs, and sugarcane bagasse)
2012). were suspended in glycine-NaOH buffer
(pH 9.0) and incubated at 80 C. Aliquots at the
Site-Directed Mutagenesis desired intervals were collected and analyzed on
Multiple sequence alignment of recombinant silica-based TLC plates (Merck, Germany) to
xylanase with the known xylanases revealed determine the hydrolysis products. The sacchari-
Glu117 and Glu209 to be catalytically important fication of agro-residues was determined using
residues. Experimentally it has been proved by DNSA reagent (Miller 1959).
site-directed mutagenesis using GeneArt site-
directed mutagenesis kit (Invitrogen, Carsband,
USA). Two point mutations (Glu117Asp and Results
Glu209Asp) were created in the metagenomic
xylanase gene and expressed in E. coli Construction of metagenomic library, DNA
BL21(DE3) cells. The induced mutations were sequencing, and bioinformatics analysis.
confirmed by sequencing. When 5.0 mg of high molecular weight
(20–30 kb) metagenomic DNA was digested
with Sau3AI and the fragments were ligated into
Xylanase Assay p18GFP vector with an efficiency of 3.6 104
Xylanase was assayed according to Archana and clones per mg of DNA in constructing the library,
Satyanarayana (1997) at 80 C and pH 9.0. One the insert sizes were in the range of 3.0–8.0 kb
unit of xylanase is defined as the amount of with an average size of 5.5 kb. On screening,
enzyme required to liberate 1mmole of reducing a clone having xylanase gene was spotted on
sugar as xylose ml1 min1 under the assay RBB xylan containing LB-amp plate. The full
conditions. sequence of the insert showed the size of 6.231
kbp that revealed its prokaryotic origin on blast
Purification and Biochemical analysis. The complete insert contained nine tran-
Characterization of rMxyl scriptional units with a complete ORF of 1,077 bp
The rMxyl was purified by affinity chromatogra- long xylanase gene. The sequence showed puta-
phy using Ni2+-NTA agarose (Novagen, Ger- tive sequences of 35 (CACGCCA), 10
many) (Verma and Satyanarayana 2012). The (TAAAAA), and ribosomal binding sites
characteristics of the recombinant xylanase like (AGGGG) at the upstream of xylanase gene
the effect of pH, temperature, metal ions, inhibi- followed by complete ORF having ATG and
tors and detergents on enzyme activity, thermo- TAA as start and stop codons, respectively
stability, and substrate specificity have been (Fig. 1). The xylanase displayed five conserved
studied. Kinetic properties of the recombinant regions (I–V) of GH11 xylanase having two
Novel Alkalistable and Thermostable Xylanase- cyan-highlighted regions represent GH11 catalytic
Encoding Gene (Mxyl) Retrieved from Compost-Soil domain. Gray-highlighted regions are compositionally
Metagenome, Fig. 1 Deduced amino acid sequence biased regions that were not used in database search and
of recombinant xylanase (rMyl) and its nucleotide proposed as linker regions. Bluish-green-highlighted
sequence. The red underlined region is leader sequence; region depicts substrate binding domain
Novel Alkalistable and Thermostable Xylanase- sp. BR), 302868167 (Micromonospora aurantiaca
Encoding Gene (Mxyl) Retrieved from Compost-Soil ATCC 27029), 386849796 (Actinoplanes sp. SE50/110),
Metagenome, Fig. 2 Multiple sequence alignment of 194368056 (Streptomyces sp. S27). Five signature
xylanase with other xylanases available in database. sequences: I (AYLTLYGW), II (VEYYIVDN), III
GenBank accession number and source of microorgan- (FWQYWSV), IV (HFDAWASLG), and V(MATEGY)
isms were given as follows: 182406872 (glycosyl hydro- of GH11 family are colored. Two catalytically important
lase family 11 precursor [uncultured bacterium]), residues (Glu 117 and Glu 209) are marked with black
17826947 (Pseudomonas sp. ND137), 29367333 circle
(uncultured Cellvibrio sp.), 388259220 (Cellvibrio
catalytically important residues (Glu109 and Expression of the Xylanase Gene in

Glu217) present in signature sequence II and E. coli and Localization of the Encoding
V (Fig. 2). Amino acid homology showed maxi- Recombinant Xylanase (rMxyl)
mum identity (79 %) with the xylanase gene of an
uncultured bacterium and Actinoplanes sp. SE50/ Xylanase gene was successfully cloned into
110 followed by a metagenomic GH11 xylanase pET28a and pET22b vectors. The recombinant
(71 %). It shared 63–75 % homology with plasmids were expressed in E. coli BL21(DE3)
xylanases produced by Streptomyces spp. The on induction with 1.0 mM IPTG at A600 of
xylanase retrieved in this investigation exhibits 0.6–0.7 and 30 C. At higher level of expression,
75, 67, and 64 % similarity with the endo-1,4 it led to the formation of inclusion bodies, which
b-xylanases of Cellulomonas fimi, could be solubilized using 6.0 M urea. The
Micromonospora aurantiaca 27029, and highest titer of the recombinant enzyme was
Amycolatopsis mediterranei U32, respectively. achieved in 4–6 h. The construct (pET28a-Mxyl)
It, however, has lower homology with the expressed a high proportion of xylanase in cyto-
xylanases of Microbulbifer hydrolyticus (63 %), plasmic fraction (83 %), followed by periplasmic
Pseudomonas sp. ND137 (62 %), uncultured (9 %) and extracellular (8 %) fractions after 4–5 h
Cellvibrio sp. (58 %), Cellvibrio mixtus (57 %), of induction. When xylanase gene was cloned and
and Aspergillus fumigatus AF293 (52 %) (Fig. 3). expressed in pET22b(+) vector, a high proportion
Novel Alkalistable and Thermostable Xylanase- microbial GH 11 xylanase. Neighbor-joining (NJ) tree is
Encoding Gene (Mxyl) Retrieved from Compost-Soil constructed by using MEGA 4.0 software. Bootstrap
Metagenome, Fig. 3 Phylogenetic tree of recombinant values (n ¼ 1,000 replicates) are represented as percent-
xylanase. rMxyl showed highest homology with xylanase age. The scale bar depicts the allowed changes per amino
of Cellulomonas fimi ATCC 484 followed by uncultured acid position
of intracellular enzyme (>60 %) was produced in Purification, Biochemical

the initial 3 h of induction, and thereafter, it Characterization, and Zymogram
declined. The periplasmic xylanase was optimum Analysis of rMxyl
at 12 h, while the extracellular fraction gradually
increased and it reached a peak (29 %) in 24 h. The recombinant xylanase was purified by
Ni2+-NTA resin affinity chromatography and the
purified recombinant protein could be eluted
Site-Directed Mutagenesis using imidazole (100–400 mM). The protein
appeared as a single band of 40 kDa against the
Muteins having Glu117Asp and Glu209Asp protein markers on 15 % SDS-PAGE, and the
completely lost the activity. These two gluta- recombinant xylanase revealed as a clear band
mates are highly conserved residues in the signa- of xylan hydrolysis by zymogram analysis
ture sequences LVEYYIVDN and MATEGY, (Fig. 4). The xylanase exhibited broad range of
and these are responsible for catalytic activity of pH (6.0–12.0) with optimum at 9.0, and it retained
GH 11 xylanase. ~55 % residual activity at pH 10.0 (Fig. 5a).
activity was, however, significantly inhibited at

higher concentration by Pb2+, Ag2+, Ca2+, Mn2+,
Ba2+, Cd2+, and Co2+. In the presence of Hg2+,
enzyme lost activity completely. Similarly, trace
amounts of b-mercaptoethanol (b-ME) and
dithiothreitol (DTT) completely inhibited the
xylanase activity. Inhibition in the presence of
N-bromosuccinimide (NBS) signifies the role of
tryptophan in catalysis, while EDTA confirms it
as a non-metalloenzyme.
Saccharification of Agro-residues/
Hydrolysis of Xylan
Novel Alkalistable and Thermostable Xylanase-
Encoding Gene (Mxyl) Retrieved from Compost-Soil The rMxyl hydrolyzed xylan from various
Metagenome, Fig. 4 Analysis of rMxyl using sources. The enzyme activity was very high in
SDS-PAGE (15 % polyacrylamide gel). (a). Lane 1 protein birchwood xylan (relative activity 100 %) in
marker, Lane 2 and 3 are washes with 20 and 30 mM
imidazole. Recombinant xylanase was eluted using differ-
comparison with that on xylan from beech wood
ent concentrations of imidazole (100, 200, 250, 300, (97 %) and arabinoxylan (80 %). There was no
400, 450, 500 mM). Purified xylanase showed molecular activity on carboxymethylcellulose (CMC) and
mass of ~42 kDa on staining with Coomassie Brilliant other non-xylan polysaccharides (starch,
Blue R-250. (b). Zymogram analysis of purified xylanase
using Congo red staining method
pullulan, and chitin). The Km and Vmax values
of the enzyme on birchwood xylan are
8.0 1.21 mg/ml and 300 09.12 mmol/min/
The rMxyl is active in the temperature range mg, respectively. The saccharification of wheat
between 40 C and 100 C (Fig. 5b) with opti- bran was high (15.2 %) as compared to that of
mum at 80 C and retains more than 90–95 % corncobs (9.89 %) and sugarcane bagasse
activity after exposure to 60 C and 70 C for 3 h. (4.71 %). Various xylo-oligosaccharides were
The enzyme has a T1/2 of 2.0 h at 80 ºC and detected in the hydrolysates (Fig. 6).
15 min at 90 ºC (Fig. 5c). The recombinant
enzyme did not lose activity after 3 h exposure
to pH 8.0 and 9.0, and thereafter, it declined Discussion
(50 % residual activity after 4 h). Approximately
20–45 % loss in activity was recorded on either Although several xylanases have been reported
side of the pH optimum after 1 h incubation from diverse microbiota using traditional culture-
(Fig. 5d). Mg2+, Sn2+, and Fe2+ stimulated dependent approaches, majority of them do not
rMxyl activity, while Hg2+ and Mn2+ strongly endure the extreme temperature and alkaline con-
inhibited enzyme activity even at 1 mM. Other ditions prevailing in industrial processes. An alter-
metal ions exerted varied inhibitory action on nate strategy was, therefore, adapted to retrieve
xylanase. More than 30 % activity was lost in a thermo-alkali-stable xylanase gene (Mxyl) by
the presence of Mn2+ (Table 1). NBS and PMSF culture-independent metagenomic approach. The
inhibited the activity to a significant extent even metagenomic library constructed with the DNA
at 1 mM concentration. b-ME and DTT strongly extracted from the compost-soil samples yielded
inhibited enzyme activity. A stimulatory effect a clone that produced xylanase. Although, the com-
EDTA was recorded on xylanase activity. post soils are in the acidic pH range, an alkalistable
Most of the metal ions did not affect enzyme and thermostable endoglucanase had been reported
activity at 1 mM concentration. Xylanase from rice straw compost (Son-Ng et al. 2009).
Novel Alkalistable and Thermostable Xylanase- glycine-NaOH buffer without substrate and kept at vari-
Encoding Gene (Mxyl) Retrieved from Compost-Soil ous temperatures. Aliquots were collected at various time
Metagenome, Fig. 5 Effect of pH and temperature on interval and store at 0 C for calculating residual activity.
the activity and stability of rMxyl. (a and b) The recom- (d) Similarly enzyme was incubated in various buffers
binant xylanase incubated in various buffers (pH 3–12) (pH 8–11) and aliquots of different time intervals were
and temperatures (40–100 C) and assayed for xylanase used xylanase assays
activity. (c) Recombinant xylanase was incubated in
The culture-independent approach has started by a short stretch of arginine- and threonine-
yielding the useful biocatalysts from the hidden rich non-catalytic region (WSVRQ2R2TG2TIT2).
Pandora’s Box of non-culturable microbial diver- In addition, serine-rich Q linker region
sity. The protein encoded by xylanase gene com- (S2GS2DITVG2TS2G2TS2G2S3G2S10G4) has also
prises 358 amino acids, of which 16 are acidic and been detected from amino acid 213 to 248 just after
21 basic. The predicted molecular weight, pI, and catalytic domain. Such repeated amino acids make
instability index of recombinant xylanase are linker regions that usually discriminate catalytic
~40 kDa, 8.8, and 33.44 respectively. The xylanase domain from carbohydrate-binding domain
contained a 43-amino-acid-long leader sequence at (Gilkes et al. 1991). Moreover, linkers have also
the N-terminal region followed by a catalytic been reported as integral parts of various xylanases
domain (44th–212th) of GH11 family interrupted that connect thermo-stabilizing domains, surface
Novel Alkalistable and Thermostable Xylanase-Encoding Gene (Mxyl) Retrieved from Compost-Soil
Metagenome, Table 1 Effect of modulators on rMxyl activity
Metal ions 1 mM 5 mM 10 mM
Mg2+ 106.45 1.05 99.65 0.98 87.38 0.45
Fe2+ 108.65 0.75 116.01 0.27 93.67 1.32
Sn2+ 110.43 0.67 76.12 0.44 45.17 0.63
Ni2+ 91.21 0.22 79.01 1.34 32.84 0.43
Zn2+ 91.67 0.32 76.64 0.78 32.89 0.89
Pb2+ 81.33 067 20.78 0.32 09.65 0.67
K+ 81.21 1.08 20.62 0.12 12.67 0.45
Ag2+ 73.48 0.53 54.55 0.69 27.83 0.98
Ca2+ 72.43 0.43 35.45 0.21 12.09 0.19
Mn2+ 71.76 0.63 27.34 1.32 09.67 0.27
Ba2+ 66.45 0.67 23.91 0.34 18.65 0.33
Cd2+ 54.67 0.43 29.33 0.49 12.87 0.65
Co2+ 59.15 1.23 29.63 0.65 12.54 1.12
Na+ 61.43 0.78 39.75 1.06 27.35 0.78
Cu2+ 29.12 0.18 15.76 0.76 10.09 0.87
Hg2+ 0 0 0
Inhibitors 1 mM 5 mM 10 mM
NBS 46.66 0.12 35.67 0.09 20.12 0.11
IAA 103.45 0.54 89.75 0.32 69.85 1.56
b-ME 0 0 0
DTT 0 0 0
EDTA 105.65 1.23 107.19 1.01 89.98 0.56
Detergents 0.1 % (v/v) 0.5 % (v/v)
Tween 20 103.45 1.32 105.67 0.98
Triton X100 108.32 0.96 104.05 0.92
SDS 97.34 1.32 65.89 0.19
Control 100 0.12 100 0.23 100 0.67
layer homology domains, and dockerin domains recombinant xylanase, it was subjected to high
which play a role in stabilizing the protein. Amino temperature prior to purification by Ni2+-NTA
acid homology and hydrophobic cluster analysis agarose resins. This step reduced the extra load
categorized this high molecular weight xylanase of non-His-tagged, less thermostable, and con-
into GH11 family. Metagenomic origin, distinct taminant host proteins (Mamo et al. 2006;
characteristics, lower homology, and higher Verma and Satyanarayana 2012).
molecular weight (>30 kDa) make this a novel The rMxyl exhibits optimum activity at
xylanase. The integrated N-terminal pelb signal higher temperature (80 C) and pH (9.0) which
sequence in pET22b(+) directed the enzyme to is similar to xylanases produced by Dictyoglomus
periplasm that further led to secretion into the thermolacticum, Thermotoga maritima,
extracellular environment. Bacillus stearothermophilus, and Geobacillus
The site-directed mutagenesis of two residues thermoleovorans having optimal activity at or
of glutamate to aspartate resulted in a complete above 80 C (Uchino and Fukuda 1983; Mathrani
loss of xylanase activity due to disruption in and Ahring 1992; Khasin et al. 1993; Verma and
double-displacement mechanism. In order to Satyanarayana 2012). The activity and stability
take the advantage of thermostability of the of rMxyl at higher pH are the crucial properties of
Novel Alkalistable and Thermostable Xylanase- X2 and X3. While X3, X4, and X5 were detected
Encoding Gene (Mxyl) Retrieved from Compost-Soil from hydrolysate of sugarcane bagasse (C1–C4)*. Lane
Metagenome, Fig. 6 Profile of xylo-oligosaccharides M: standards of various XOs. X1 xylose, X2 xylobiose, X3
liberated by the action of rMxyl. Lane (A1–A4)*: spots xylotriose, X4 xyloptetraose, X5 xylopentaose. *: 1/2/3/4
of X1, X2, and X3 were detected from wheat bran. Lane time intervals of 5, 15, and 30 min and 1 h, respectively
(B1–B4)*: hydrolysate from corncobs showed prominently
xylanases for their applicability in paper Cations (Mg2+, Sn2+, and Fe2+) stimulated the
processing industry. The shelf-life of rMxyl is rMxyl activity while 1 mM, Hg2+, and Mn2+
more than 3 months at 4 C, which retains greater significantly inhibited the activity. The inhibition N
than 90 % activity. The recombinant xylanase is of xylanase by Hg2+ suggests the presence of
optimally active at 80 C and pH 9.0 that distin- tryptophan residues that oxidize indole ring,
guishes it from already reported xylanases. The thereby inhibiting the xylanase activity. The inhi-
xylanase of Thermotoga maritima has Topt of bition of xylanase activity by Cu2+ is similar to
90 C, but it gets inactivated fast at pH 6.0 the majority of the xylanases (Matteotti
(Yoon et al. 2004). Similarly the alkalistability et al. 2012). In Glaciecola mesophila KMM
at higher pH is reported in many xylanases but are 241, EDTA caused ~25 % enhancement in activ-
active at lower temperatures (Khasin et al. 1993). ity (Guo et al. 2009). NBS inhibition suggests the
The recombinant xylanase of GH10 family involvement of tryptophan in xylanase activity.
from Bacillus halodurans showed both Total loss of xylanase activity by b-ME and DTT
properties together having optima at 75 C and suggests the distortion of disulfide linkages pre-
pH 9.0, but it losses 50 % activity at 65 C after sent between cysteine residues (Maalej
4 h and gets inactivated very fast at 80 C et al. 2009; Matteotti et al. 2012). Detergents
(Mamo et al. 2006). The metagenomic xylanase, exerted a slight stimulatory effect on the recom-
on the other hand, has good thermostability at binant xylanase which is a common feature of the
higher temperatures (60 C, 70 C and 80 C) other xylanases. However, rMxyl was inhibited
with only 20–30 % loss after 3 h exposure. The by SDS.
most significant aspect of this investigation is The rMxyl hydrolyzed birch wood and beech
obtaining a highly alkalistable (pHopt. 9.0) wood xylans efficiently. The structural similarity
and thermostable (Topt. 80 C) xylanase from of beech wood and birch wood xylans may be the
environmental samples by a metagenomic reason for the high activity. The enzyme
approach. exhibited almost similar activities on oat spelt
xylan and arabinoxylan. Oat spelt xylan is a type Coughlan MP, Hazlewood GP. b-1,4 D-xylan-degrading
of arabinoxylan very rich in arabinose (xylose/ enzyme systems: biochemistry, molecular biology and
applications. Biotechnol Appl Biochem. 1993;17:
arabinose ¼ 66:34) (Gruppen et al. 1992; 259–89.
Kormelink and Voragen 1993). Interestingly the Gilkes NR, Henrissat B, Kilburn DG, et al. Domains in
rMxyl liberated xylo-oligosaccharides from microbial 4-glycanases: sequence conservation, func-
xylan in just 5 min and it was sustainable on tion, and enzyme families. Microbiol Rev. 1991;
55:303–15.
prolonged incubation. Several xylanases have Gruppen H, Hamer RJ, Voragen AGJ. Water
been reported from various microorganisms that unextractable cell wall material from wheat flour.
liberate xylo-oligosaccharides following xylan 2. Fractionation of alkali extracted polymers and com-
hydrolysis. Alkaline xylanases show better action parison with water extractable arabinoxylans. J Cereal
Sci. 1992;16:53–67.
on agro-residues by lowering the stearic hin- Guo B, Chen X, Sun C, et al. Gene cloning, expression and
drance caused by cellulose and enhancing the characterization of a new cold-active and salt tolerant
solubility of hemicellulosic materials (Gruppen endo-b-1, 4-xylanase from marine Glaciecola
et al. 1992). The metagenomic xylanase finds mesophila KMM 241. Appl Microbiol Biotechnol.
2009;84:1107–15.
application in food industry for the production Hori H, Elbein AD. The biosynthesis of plant cell wall
of xylo-oligosaccharides as prebiotics (Vazquez polysaccharides. In: Higuchi T, editor. Biosynthesis
et al. 2000). and biodegradation of wood components. Orlando:
Academic; 1985. p. 109–35.
Khasin A, Alchanati I, Shoham Y. Purification and char-
acterization of a thermostable xylanase from Bacillus
Summary stearothermophilus T-6. Appl Environ Microbiol.
1993;59:1725–30.
Most of the xylanases retrieved by culture- Kormelink FJM, Voragen AGJ. Degradation of different
[(glucurono)arabino] xylans by a combination of puri-
dependent and culture-independent approaches fied xylan-degrading enzymes. Appl Microbiol
exhibit optimal activity in the pH and tempera- Biotechnol. 1993;38:688–95.
ture ranges of 6.0–8.0 and 40–60 C, respec- Maalej I, Belhaj I, Masmoudi NF, Belghith H. Highly
tively. The xylanase (rMxyl) obtained in this thermostable xylanase of the thermophilic
fungus Talaromyces thermophilus: purification and
investigation through metagenomic approach characterization. Appl Biochem Biotechnol. 2009;
displays alkalistability as well as thermostability. 158:200–12.
This is the first report on the xylanase with twin Mamo G, Delgado O, Martinez A, et al. Cloning, sequenc-
stabilities obtained through a culture- ing analysis and expression of a gene encoding an
endoxylanase from Bacillus halodurans S7. Mol
independent approach. A very low similarity in Biotechnol. 2006;33:149–59.
amino acid sequence of the enzyme with other Mathrani IM, Ahring BK. Thermophilic and alkaliphilic
known xylanases makes it a novel xylanase. The xylanase from several Dictyoglomus isolates. Appl
possibility of obtaining thermo-alkali-stable Microbiol Biotechnol. 1992;38:23–7.
Matteotti C, Bauwens J, Brasseur C, et al. Identification
xylanase from composts may lead to an intense and characterization of a new xylanase from gram-
search for similar enzymes in this and other positive bacteria isolated from termite gut
related niches. (Reticulitermes santonensis). Protein Expr Purif.
2012;83:117–27.
Miller GL. Use of dinitrosalicylic acid reagent for deter-
mination of reducing sugars. Anal Chem. 1959;
31:426–8.
References Son-Ng I, Li CW, Yeh Y, et al. A novel endoglucanase
from the thermophilic bacterium Geobacillus
Archana A, Satyanarayana T. Xylanase production by sp. 70PC53 with high activity and stability over
thermophilic Bacillus licheniformis A99 in solid-state broad range temperatures. Extremophiles. 2009;
fermentation. Enzyme Microb Technol. 1997; 13:425–35.
21:12–7. Uchino F, Fukuda O. Taxonomic characteristics of an
Collins T, Gerday C, Feller G. Xylanases, xylanase fam- acidophilic strain of Bacillus producing thermophilic
ilies and extremophilic xylanases. FEMS Microbiol acidophilic amylase and thermostable xylanase. Agric
Rev. 2005;29:3–23. Biol Chem. 1983;47:965–7.
Novel Approaches to Pathogen Discovery in Metagenomes 559 N
Vazquez MJ, Alonso JL, Dominguez H, et al. Xylo- which isolation of disease causative microbe
oligosaccharides: manufacture and applications. and determination of its etiological features are
Trends Food Sci Technol. 2000;11:387–93.
Verma D, Satyanarayana T. Cloning, expression and of the essence (Falkow 2004; Lipkin 2010).
applicability of thermo-alkali-stable xylanase of There are fascinating and tragic stories in medical
Geobacillus thermoleovorans in generating xylo- history of human volunteers or doctors who
oligosaccharides from agro-residues. Bioresour sacrificed health or even their lives to test patho-
Technol. 2012;107:333–8.
Verma D, Satynarayana T. An improved protocol for gens on themselves to satisfy the postulates. The
DNA extraction from alkaline soil and sediment sam- principles guided the development of clinical
ples for constructing metagenomic libraries. Appl microbiology and remain the important guide-
Biochem Biotechnol. 2011;165:454–64. lines, if not the rules, even in the era of molecular
Yoon HS, Han NS, Kim CH. Expression of thermotoga
maritima endo-b-1, 4-xylanase gene in E. coli and biology and genomics. Nevertheless, study has
characterization of the recombinant enzyme. Agric. shown that the vast majority of microorganisms
Chem. Biotechnol. 2004;47:157–160. cannot be readily grown or are not cultivable at
all (Handelsman 2004). It is also true for patho-
gens; in other words, there are numerous varieties
of potential pathogens that exist and evolve in the
Novel Approaches to Pathogen
environment; it is just a matter of time when and
Discovery in Metagenomes
where they will emerge or reemerge to cause
sporadic cases or outbreak. In addition, the man-
Jun Hang
ifestation of some diseases is contributable to
Viral Diseases Branch, WRAIR, Silver Spring,
coexistence of multiple organisms or imbalanced
MD, USA
microbial community at host tissues.
Technique approach for pathogen diagnostics
Synonyms evolves along with scientific discovery and
technology innovation on microbiology as well
Community genomics; Metagenomics and path- as other disciplines. A variety of techniques are N
ogen identification; Microbiome and virome; used in clinical labs, including the traditional
Pathogenomics microbiology tests, rapid serological assays, and
various molecular assays. They are well designed
and validated with reliable sensitivity and speci-
Definitions
ficity (Lipkin 2010). Many of them are automated
for improved speed, convenience, and accuracy.
Pathogen discovery: identification of causative
However, in spite of the great effectiveness and
microbial or viral agent(s) for an illness or
robustness, threat from emerging pathogens
asymptomatic infection. The identification may
remains real. In particular, because of the rising
refer to etiological diagnosis for individuals, epi-
globalization and drastic climate changes, novel
demiology investigation on population scale, and
pathogens and new variant strains have more
animal or environmental surveillance on orphan
often appeared and spread. There are chances that
pathogens.
a highly virulent pathogen may escape detection by
Metagenomics: genomic study on a population
conventional methods and can cause a widespread
of biologically or functionally close microorgan-
outbreak and public health crisis with dramatic
isms as a whole community, without separation of
economic loss and social consequences.
components into pure culture isolates.
To answer the emergent challenge, novel
approaches utilizing the advanced technologies
Introduction have been developed to effectively identify path-
ogens as well as elucidate pathogenesis mecha-
The best-known statement on pathogen discovery nism in comprehensive way (Lipkin 2010; Olsen
probably is the so-called Koch’s postulates, in et al. 2012). Metagenomics analyzes all genomic
N 560 Novel Approaches to Pathogen Discovery in Metagenomes
information in a specifically defined population. analytical approaches are vital for the sensitivity
The deep and comprehensive metagenomic and accuracy for pathogen discovery.
information allows individual organisms of inter-
est to be interrogated in the context of the whole
community and with its phylogenetic relatives 16S Ribosomal RNA Gene Sequencing
(Joseph and Read 2010). The significant strategy for Human Microbiota Assessment and
has transformed the way we perceive microbial Identification of Bacterial Pathogens
world. The related laboratory and bioinformat-
ics approaches were successfully used in identi- Bacterial 16S rRNA gene sequence has long been
fying causal pathogens for outbreaks and used to classify bacteria down to taxonomy levels
providing vital insights into the source and/or of genus or lower. In contrast to amplification,
evolutionary origins (Koser et al. 2012). cloning, and sequencing of full length 16S rRNA
Approaches based on rich knowledge from genes by Sanger method, NGS enables massive
metagenomics are vigorously implemented to acquisition of a million or more 16S rRNA gene
pathogen discovery and are believed to be clear segment sequences in a single run to decipher
path of future perspectives of the clinical diag- bacterial composition (species richness and
nostics (Eisen and MacCallum 2009; Olsen abundance) in a community (Kuczynski
et al. 2012). et al. 2012). Sequence across two to three vari-
able regions has been suggested to contain taxo-
nomic information unique enough for
Strategy and Schemes classification. Roche 454 pyrosequencing is cur-
rently the method of choice due to its relatively
Genomic approaches to detection of pathogen in long read length and low sequence error rate. Read
clinical specimens are either based on known length average 300–500 bases for Roche GS FLX
genomic information (sequence dependent) or Titanium system and 500–800 bases for the recent
designed to capture unique and disease-relevant FLX + system. FLX + application on amplicon
as well as redundant and irrelevant sequences sequencing is currently under development and
altogether (sequence independent) (Olsen yet to be validated for 16S sequencing which will
et al. 2012). Metagenomics was initially devel- achieve longer read length without comprising
oped in the era of Sanger sequencing (Fredericks sequence quality. Different from genome sequenc-
and Relman 1996; Handelsman 2004) and truly ing in which reads are assembled by overlapping
thrived with the emerging of the next-generation to obtain a consensus sequence, in 16S-based
sequencing (NGS) technologies which make metagenomic analysis, 454 sequencing reads are
DNA sequencing much less expensive and classified individually, i.e., each read is one oper-
hugely productive (Petrosino et al. 2009). It is ational taxonomic unit (OTU). Therefore, high-
now feasible and affordable to either sequence performance sample preparation and sequencing
a number of amplicons at exceedingly high depth procedures, stringent data processing, and analyt-
to capture rare variants or sequence all DNA and/or ical pipeline are critical for achieving and
RNA by design in a complex sample. NGS allows maintaining accuracy and sensitivity. Many stud-
direct sequencing of microbial contents without ies to compare materials and methods for optimi-
microbiological cultivation for isolation and zation have been published (Kuczynski
enrichment. Numerous molecular biology tech- et al. 2012). One significant open resource is the
niques for sample preparation prior to sequencing Data Analysis and Coordination Center (DACC)
and bioinformatics tools for data mining and ana- from the National Institutes of Health (NIH)
lyses were developed (Thomas et al. 2012). Exper- Common Fund supported Human Microbiome
iment design and the choice of technical and Project (HMP) and is available at website
http://www.hmpdacc.org/. The fundamental knowl- and phenotypic information will be the key to
edge on healthy human microbial communities and a metagenomics-based clinical test (Joseph and
the developed metagenomics techniques and ana- Read 2010). Other components essential to the
lytic tools are being brought into the clinical arena feasibility include streamlined sample processing
with encouraging successes on making diagnoses and sequencing system with automation, conve-
of difficult diseases and complex outbreaks nient data collection and management procedure,
(Loman et al. 2012). Pathogenomics is showing and efficient bioinformatics pipeline in concert
its power and clinical importance by revealing with reference information for sequence analysis
genomics and metagenomics basis for complicated and amenable to integration with medical record
syndromes which cannot be explicitly understood and interactive communication to worldwide dis-
with conventional clinical tests. In consequence, ease networks and specific study consortiums
improved therapeutic practices, reduced medica- (Koser et al. 2012).
tion costs, and more-informed disease prevention In addition to the promising clinical use of
measures can result in dependable public health whole-genome metagenomics, the scientific and
protection. technical resources gaining from metagenomics
quests have a multitude of utilities that can make
existing pathogen discovery methods design
Microbial Metagenomics and Single-Cell and perform better (Fournier and Raoult 2011).
Sequencing For instance, with the comprehensive genomic
information on the microbial community
There are considerable and ongoing efforts to corresponding to the specimens, multilocus
characterize collective whole genomes in an sequence typing (MLST), PCR-based molecular
entire community. For example, several research assays, microarray-based assays, etc. can be
teams used Illumina’s NGS technology and shot- made more specific for the targets with reduced
gun sequencing approach to generate several nonspecificity. Moreover, assay results can be
hundred gigabases of microbial sequences for interpreted with better estimation of probability N
extensive cataloging of genes in human gut of miss-calling and the false-positive, therefore
microbes (Qin et al. 2010; Arumugam et al. concluded with increased confidence.
2011). From a number of studies, the depth and Another promising approach is single-cell
comprehensiveness of our knowledge on human genome sequencing for pathogen discovery. Indi-
microbiomes is unprecedented, and it would not vidual microorganisms or parasites are physically
be possible without having the advanced NGS isolated out of a complex community, i.e., clinical
technologies and the associated sophisticated matrix, either under microscopy by morphology or
bioinformatics tools (Kuczynski et al. 2012; using devices such as flow cytometry cell sorting.
Thomas et al. 2012). However, such a shotgun Both methods are well established and already rou-
unselective metagenomic strategy requires tre- tinely used in clinical laboratory. Harvested single
mendous computational power and may not be cell or a homogenous pool of cells are then
efficient or cost-effective enough for a routine subjected to amplification and sequencing. Multiple
pathogen diagnostics practice. There are multiple displacement amplification (MDA) from a single
approaches that may facilitate the overcoming of cell has been shown robust and faithful for down-
the hurdles for the wide use of metagenomics in stream sequencing and microarray applications.
clinical settings. The HMP and other interna- Studies showed 95 % or higher genome coverage
tional programs aim to build a database of fully by using single-cell genomic sequencing (Pallen
annotated complete genome sequences for bacte- et al. 2010). The culture-free approach coupled
ria of clinical and human health relevant. The with lab-on-chip microfluidic cell harvesting and
high-quality reference genome database with processing automation may make its way to
rich and definitive genomic, genetic, functional, become suitable for clinical diagnostic use.
a
Specimen collection, Storage, Transportation,
Log and archiving, Deidentification, etc.
Pre-processing (centrifugation, filtration,

nuclease, etc.)
Solids Liquids
cells, bacteria virus, DNA/RNA
DNA/RNA extraction
DNA/RNA
16S rRNA gene Random Amplification

PCR Amplification NGS Library Preparation
Next-genaration Sequencing
Sequence
Reads
• Raw data quality-processing

• Partitioning: de novo assembly, host
sequence decontamination, etc.
• Viral sequence identification
• Bacterial sequence identification
• Other unidentified sequences
Data Analyses and Interpretations
Report and Plan for further

investigation
Amplicon V5-V3
Amplicon V1-V3 Amplicon V9-V6
Primer A
Key 16S sequences

Primer B
MID
Novel Approaches to Pathogen Discovery in Metagenomes, Fig. 1 (continued)

Novel Approaches to Pathogen Discovery in choice of downstream NGS platform, e.g., Roche/454
Metagenomes, Fig. 1 Pathogen discovery workflow. GS. Amplicon(s) for each sample can be barcoded indi-
(a) Flow diagram of the main procedures to pathogen vidually using sequences such as 454’s 10-nt Multiplexing
identification. (b) 16S-based targeted metagenomics for identifier (MID) sequences. (c) Unbiased random amplifi-
determination of bacterial composition. Top panel shows cation. Random reverse transcription is primed by random
16S rRNA gene and hypervariable regions 1–9. Center hexamers or octamers tailed with specific sequence. Sub- N
panel shows three amplicons commonly used in sequent random PCR uses the random primers and the
16S-based metagenomic sequencing. The arrows indicate primer matching with the specific sequence. The double-
sequencing direction. Bottom panel shows fusion primers stranded random amplicons can be sequenced with NGS
for PCR amplification of 16S rRNA gene segments. for viral sequence identification
Primer A/B and key sequences are compatible to the
Unbiased Random Amplification and virus or a new virus variant escaped initial detec-
Sequencing for Viral Pathogen tions or was misdiagnosed and caused an out-
Identification break. De novo approach with no requirement
for known sequence is therefore advantageous
While microbial metagenomics for bacterial for viral pathogen discovery. One technique
pathogen identification is still at its early stage, breakthrough is to identify novel viral sequence
viral metagenomics has become a robust by unbiased random amplification and massive
approach for hunting novel viral pathogens sequencing with NGS platforms. The process
when viral culture and molecular assays cannot illustrated in Fig. 1 includes the following major
make the diagnosis (Djikeng and Spiro 2009; steps: sample preparation which may require
Mokili et al. 2012). Because of the vast number extra preprocessing to enrich viruses and reduce
and variety of viruses in nature and the high non-virus contents, random reverse transcription
frequency of evolution events including nucleo- and anchored random PCR amplification,
tide mutagenesis, sequence recombination, and sequencing the random amplicons by NGS, and
segment reassortment, it is not rare that a novel data mining for identification of viral pathogen
Novel Approaches to
Metagenomic random sequencing raw reads
Pathogen Discovery in
Metagenomes,
Fig. 2 Bioinformatics
strategy for identification of Pre-filtering: Trim QV17 bases; Remove reads shorter than 50 bases
viral sequences.
Metagenomic sequencing
data are processed using
streamlined multiple Trim primer sequence from both ends
sequence analyzing tools to
search for disease-related
sequence hits. Two typical
Sequence clustering De novo assembly
analysis paths are shown as
examples. It is crucial for
the efficiency to reduce Remove host sequences Unassembled reads Contigs
redundancy (e.g., sequence (decontamination)
assembly) and human
genome sequences Bacterial hits
(decontamination) prior to Bacterial hits (bacteria database)
database alignment while (bacteria database)
retaining relevant
sequences. Reduced Viral hits
volume sequences are Viral hits
(virus database) (virus database)
subjected to thorough
alignments to the specified
as well as mega databases Other hits
Other hits
(nr/nt database) (nr/nt database)
Other hits Other hits

(nr database) (nr database)
Artificial/Novel Artificial/Novel
sequences sequences
Taxonomic Annotation
Summary Report
sequences. The significance of the culture- monitoring of low-level drug-resistant HIV

independent viral metagenomic approach was variants is clinically relevant for proactive
shown in studies in which novel viruses respon- health care of HIV-infected population (Gega
sible for unresolved infections were identified. and Kozal 2011).
The discoveries by metagenome sequencing There are a variety of protocols which were
also led to subsequent confirmation with PCR, originated from the same technical approach but
successful viral isolation by choosing suited designed differently based on individual circum-
cells, and complete viral genome sequences for stances. The considerations include enriching
rapid molecular tests for epidemiology and viral contents in complex matrices by
surveillance. Another noteworthy use of pretreatment with nucleases to degrade nonviral
unbiased metagenomic sequencing is to detect naked nucleic acids or concentration of viral par-
coinfection of viruses or virus variants with esti- ticles by filtration or centrifugation, DNase treat-
mation of the relative abundance for personalized ment prior to reverse transcription to reduce
medicine. For example, sensitive and accurate genomic DNA, the removal of ribosomal RNA,
size selection of amplification products, and pathogen discovery on metagenomes from
adjusting clonal amplification conditions to explorative research to standardized clinical
sequence random amplicons of broad range of practices.
sizes. To find virus sequence reads in
metagenomic sequencing of clinical samples,
capable bioinformatics workflow is needed to References
achieve the sensitivity, specificity, and speed.
The workflow may comprise a set of data Arumugam M, Raes J, et al. Enterotypes of the human gut
processing operations which can be chosen from microbiome. Nature. 2011;473(7346):174–80.
Djikeng A, Spiro D. Advancing full length genome
tools such as de novo assembly, sequence clus-
sequencing for human RNA viral pathogens. Futur
tering, decontamination (e.g., removal of human Virol. 2009;4(1):47–53.
sequences), NCBI BLAST tools, etc. Two exem- Eisen JA, MacCallum CJ. Genomics of emerging infec-
plary workflows are shown in Fig. 2. Neverthe- tious disease: a PLoS collection. PLoS Biol.
2009;7(10):e1000224.
less, further simplified and streamlined sample
Falkow S. Molecular Koch’s postulates applied to bacte-
preparation and sequencing procedure which rial pathogenicity – a personal recollection 15 years
can be readily reproduced in clinical laboratory, later. Nat Rev Microbiol. 2004;2(1):67–72.
good data management and sharing practices, and Fournier PE, Raoult D. Prospects for the future using
genomics and proteomics in clinical microbiology.
diagnostic-specific bioinformatics solution will
Annu Rev Microbiol. 2011;65(65):169–88.
be essential for viral pathogen discovery by Fredericks DN, Relman DA. Sequence-based identifica-
means of metagenomics. tion of microbial pathogens: a reconsideration of
Koch’s postulates. Clin Microbiol Rev. 1996;
9(1):18–33.
Gega A, Kozal MJ. New technology to detect low-level
drug-resistant HIV variants. Futur Virol. 2011;
Summary 6(1):17–26.
Handelsman J. Metagenomics: application of genomics to
The capability on pathogen discovery is driven by N
2004;68(4):669–85.
technology innovation. Koch’s postulates Joseph SJ, Read TD. Bacterial population genomics and
evolved from its original microbial form to the infectious disease diagnostics. Trends Biotechnol.
molecular postulates (Falkow 2004) and cur- 2010;28(12):611–8.
Koser CU, Ellington MJ, et al. Routine use of microbial
rently “the metagenomic version” (Mokili
whole genome sequencing in diagnostic and public
et al. 2012). Multidisciplinary strategy and meth- health microbiology. PLoS Pathog. 2012;8(8):
odology of metagenomics open a new era of e1002824.
pathogen discovery: analyze pathogenesis in Kuczynski J, Lauber CL, et al. Experimental and analyt-
ical tools for studying the human microbiome. Nat Rev
comprehensive ecology and community views;
Genet. 2012;13(1):47–58.
delineate etiology with information on pathogen Lipkin WI. Microbe hunting. Microbiol Mol Biol Rev.
coinfection, virulent variants and concurrent fac- 2010;74(3):363–77.
tors, and individualized therapy with the consid- Loman NJ, Misra RV, et al. Performance comparison of
benchtop high-throughput sequencing platforms. Nat
erations of metagenomes for optimal efficacy;
Biotechnol. 2012;30(5):434–9.
and avoid misuse of antibiotics and antiviral Mokili JL, Rohwer F, et al. Metagenomics and future
drugs (Relman 2011). Next-generation sequenc- perspectives in virus discovery. Curr Opin Virol.
ing is not only the ultimate sequence-based 2012;2(1):63–77.
Olsen RJ, Long SW, et al. Bacterial genomics in infectious
approach for pathogen identification but also
disease and the clinical pathology laboratory. Arch
a solution to stimulate clinical microbiology and Pathol Lab Med. 2012;136(11):1414–22.
molecular diagnostics when a novel pathogen is Pallen MJ, Loman NJ, et al. High-throughput sequencing
encountered. Despite the sound “proof-of- and clinical microbiology: progress, opportunities and
challenges. Curr Opin Microbiol. 2010;13(5):625–31.
principle” as well as advancements on both tech-
Petrosino JF, Highlander S, et al. Metagenomic
nical and analytical means, substantial individual pyrosequencing and microbial identification. Clin
and concerted efforts are needed on translating Chem. 2009;55(5):856–66.
N 566 Nucleotide Composition Analysis: Use in Metagenome Analysis
Qin J, Li R, et al. A human gut microbial gene catalogue composition, and it is by these patterns that oth-
established by metagenomic sequencing. Nature. erwise anonymous metagenomic sequences can
2010;464(7285):59–65.
Relman DA. Microbial genomics and infectious diseases. be grouped into inferred populations enabling
N Engl J Med. 2011;365(4):347–57. in-depth functional analysis.
Thomas T, Gilbert J, et al. Metagenomics – a guide from Extensive sequencing of microbial DNA made
sampling to data analysis. Microb Inform Exp. possible the large-scale analysis of this genome
2012;2(1):3.
base composition. Such analyses have revealed
that the various patterns in base composition may
be related to specific molecular machinery within
microbial cells that help shape base composition.
Nucleotide Composition Analysis: These biasing effects are thought to be mediated
Use in Metagenome Analysis by the processes of DNA repair and replication,
mutations, and base-step conformational tenden-
Isaam Saeed cies that operate in concert to give rise to the
Optimisation and Pattern Recognition Group, characteristic base composition of different
Melbourne School of Engineering, The University microbial genomes (Karlin et al. 1997).
of Melbourne, Parkville, Australia Since the sequencing methodology of
metagenomics does not preserve the association
between sequenced reads and their genome of
Synonyms origin, functional analysis of a metagenome can
only provide an overall snapshot of what
Binning; Genome signature; Nucleotide a microbial community can potentially
frequency do. However, if the association between
a sequence in a metagenome and the original
genome (or population) from which it was sam-
Definition pled from can be inferred, then the resulting func-
tional analysis can probe deeper into the inner
The composition of nucleotide bases in workings of a microbial community. Processing
a microbial genome is not random and is instead sequences in this manner prior to functional anal-
biased toward different compositional structures ysis is referred to as binning. There are currently
that vary between organisms. These biases occur two major ways to address the binning problem
as identifiable patterns in oligonucleotide base (McHardy and Rigoutsos 2007): the first classifies
composition, and it is by these patterns that oth- sequences using a database of preexisting knowl-
erwise anonymous metagenomic sequences are edge of microbial organisms; and the second
grouped into inferred populations. This allows groups related sequences based on the common
for more in-depth analysis of the functional patterns that arise from biases in the base compo-
potential of a sampled microbial community in sition of microbial genomes. The latter approach
the context of constituent members (inferred reflects the exploratory nature of metagenomics,
populations). given that the majority of microorganisms cannot
be cultivated in a laboratory environment and
therefore they may not be represented in current
Introduction databases as yet.
When considering the use of patterns
The composition of nucleotide bases in (or genome signatures) in nucleotide base com-
a microbial genome is not random and is instead position for binning, there are two major factors
biased toward different compositional structures that will influence the quality of the resulting set
that vary between organisms. These biases occur of indentified groups (inferred populations). The
as identifiable patterns in oligonucleotide base first is the taxonomic resolution of patterns to be
Nucleotide Composition Analysis: Use in Metagenome Analysis 567 N
used, which is governed by the between-genome Nucleotide Frequency
distinctness of a pattern. The second is the accu-
racy at which these patterns can be grouped, With the advent of large-scale, high-throughput
which is governed by the within-genome conser- DNA sequencing, the increased sample size of
vation of a pattern. sequenced DNA molecules provided a foundation
for extensive statistical evaluation of nucleotide
composition in different genomes. Further studies
A Simple Binning Strategy Using GC of genomic composition, in light of the increasing
Content number of available genome sequences, pioneered
the use of higher-order statistics to describe signa-
It is well established that there are differences in tures in microbial genomes. The underlying princi-
GC content between various microbial genomes. ple of these signatures is based on the observation
The benefit of this to binning is that these biases that specific oligomers are under-/overrepresented
can often be used as a representative pattern to in different genomes and that the similar biases
group related sequences that share similar GC occur in related genomes. Nucleotide frequency is
content. Although localized GC content can among the most widely used ways of representing
vary throughout a genome, if large enough these biases and is calculated by counting all occur-
sequences are available in a metagenome, then rences of fixed length oligos (or n-mers) within
the assumption that the observed GC content is a sequence and then normalizing by the total num-
representative of the full genome composition ber of oligos in that sequence to arrive at an esti-
still holds. It should be noted that GC content is mate of the oligonucleotide frequency content. The
not a unique property of individual genomes and features of microbial genomes based on nucleotide
if it is used it will group sequences coarsely frequency, which have been successfully applied to
(in terms of microbial taxonomy). In such metagenomic studies, include the following: the
a scenario, if GC content is significantly different dinucleotide odds ratio, codon signatures/trinucle-
between the identified groups, then it can be otide frequencies, and tetranucleotide frequencies. N
assumed that these groups are unrelated, but it is
not conclusive to say that the sequences within Dinucleotide Odds Ratio
each group are related unless further analysis is Among the earliest of these nucleotide frequency
conducted (due to the nonuniqueness property of signatures that was found to be biologically rele-
GC content). To increase the taxonomic resolu- vant was the dinucleotide odds ratio, which was
tion of binning using GC content, for example, it based on early in vitro studies on differences in
can often combine with other complementary dinucleotide content between various organisms
features. An approach of this sort was used to (Karlin et al. 1997). This signature considered the
group sequences of the metagenome of an acid dinucleotide frequency content of a sequence and
mine drainage biofilm (Tyson et al. 2004), where factored out the effect of mononucleotide fre-
GC content was combined with local assembly quencies using a normalization scheme based on
depth to distinguish the dominant populations a Markov model, as given by
that shared similar GC content.
Generally, metagenomes with a small number f XY
r2i ¼
of dominant species tend to be easier to assemble, f X f Y0
and the resulting contig lengths can often make GC
content a viable option to group sequences in such where X and Y represent the first and second
data sets. However, there are still limitations to this mononucleotide in the dinucleotide to be normal-
approach, and for more complex metagenomes, ized and f represents the frequency of mono-/
GC content has been superseded by higher-order dinucleotides. The derived statistic, also referred
statistics of base composition, referred to as oligo- to as the dinucleotide odds ratio, could ade-
nucleotide (or n-mer) frequencies. quately describe biases specific to various
microbial organisms. For example, it was 50 full-length genes from a given genome to
observed that there is a general TpA avoidance make a stable estimate, and these ratios can be
mechanism across various microbial genomes biased within a genome (not only between
and a CpG underrepresentation in thermophilic genomes), depending on the set of gene classes
microbes. Distances between sequences that comprise the genes in a genome. Due to these
represented using the dinucleotide odds ratio are issues, such signatures may cause difficulty in
evaluated using the Manhattan distance, also grouping sequences for the purpose of binning.
referred to as the d* distance. When this odds
ratio differs from 1, the resulting statistic pro- Tetranucleotide Frequency
vides a means to estimate the under-/ Tetranucleotide frequency (all possible combina-
overrepresentation of specific dinucleotides, tions of 4-mers, of which there are 256) offers
given by the limits 0.78 and 1.23, respectively greater discrimination between species in
(Karlin et al. 1997). Although several genome- a metagenome than lower-order nucleotide frequen-
wide biases were found when using this odds cies. For this reason, tetranucleotide frequency is
ratio statistic, the discrimination between larger perhaps the most widely used in clustering
sets of microbial genomes (more representative metagenomic sequences. Moreover, it has been
of real-world metagenomes) is still better handled found to capture a species-specific signature
by higher-order frequencies. (a reasonably strong phylogenetic signal at lower
taxonomic ranks), which makes it not only a more
Codon Signatures/Trinucleotide Frequency powerful alternative to clustering metagenomic
Gene sequences are relatively conserved within sequences but also offers biologically meaningful
a genome, as any changes at critical locations groupings of sequences (Teeling et al. 2004). This
may cause the gene product to be defective. This was also confirmed by (Mrazek 2009) who corre-
motivated the use of signatures based on this lated 16S rRNA distances with various signatures
knowledge to capture more representative patterns and found that tetranucleotide frequency was able to
in microbial genomes. Codon usage in the gene outperform other feature sets. It has also been found
sequences is thought to be mediated by the overall that tetranucleotide frequency can be used to find
genome composition and is also related to the conserved signatures flanking 16S rRNA genes,
flexibility of the choice of codons due to the which can in turn be used to assign classes to the
degeneracy of the genetic code. Trinucleotide fre- identified groups of sequences (Chan et al. 2008).
quencies (i.e., frequencies of all possible 3-mers:
AAA, AAC, AAT, . . ., GGG) can be used to Strand Bias
capture some of these biases, and alternatively, Prior to the use of these signatures, it should be
an extension to the dinucleotide odds ratio is also noted that for oligonucleotide frequencies, the
able to capture these codon signatures using dinu- feature vector requires correction for biases
cleotides (Karlin et al. 1998). This codon signature between strands (Tyson et al. 2004). This is
is constructed as follows: often remedied by counting the number of
f XY ð1, 2Þ n-mers on the original sequence, as well as on
gXY ð1, 2Þ ¼ the reverse complement, and then taking the aver-
f X ð1Þf Y ð2Þ
f ð2, 3Þ age of the two prior to normalization.
gXY ð2, 3Þ ¼ XY
f X ð2Þf Y ð3Þ
f ð3, 4Þ
gXY ð3, 4Þ ¼ XY 0 Normalization Techniques for
f X ð3Þf Y ð4Þ
Nucleotide Frequency
where the indices represent the nucleotide base at
the first, second, or third nucleotide within Given the observed nucleotide frequencies for
a codon (with index 4 referring to the first base each sequence, it is often necessary to normalize
of the next codon). This signature requires at least each observation prior to further analysis. (Note:
it is still possible to simply take the frequencies of frequency-derived error gradient (OFDEG)
the observed number of n-mers in a sequence.) (Saeed and Halgamuge 2009). OFDEG was
observed by exploring the relationship of
Markov Normalization nucleotide frequency between short fragments
The dinucleotide odds ratio is a special case of and the original DNA sequence from which
Markov normalization. In the general case, the they are sampled. The observation is based on
maximal-order Markov normalization of an a resampling method, where instead of using
observed nucleotide frequency vector is given by the entire sequence to estimate the maximum
likelihood point estimate of nucleotide fre-
f 1...k f 2...k1 quency, the OFDEG measure resamples the
rki ¼ ,
f 1...k1 f 2...k nucleotide base composition of varying length
subsequences to capture the distribution of
where the appropriate statistic for tetranucleotide
oligomeric frequencies.
frequency, for example, is given when k ¼ 4. This
For example, it is straightforward to com-
normalization scheme essentially aims to filter
pute the un-normalized nucleotide frequency of
out lower/higher nucleotide frequencies. Lower-
an entire genome (referred to in this definition
order normalization schemes are possible, but
as an occurrence vector). Similarly, the occur-
they have poorer correlation properties with phy-
rence vector for a short subsequence sampled
logenetic distances (Mrazek 2009).
from anywhere along the genome can be easily
computed. Intuitively, the error between these
Z-Score Normalization
two occurrence vectors (defined in terms of
Another approach to normalization uses the
Euclidean distance) would be large. Neverthe-
Z-score transform to assess the statistical signif-
less, the error is recorded and another subse-
icance of observed n-mers (Tyson et al. 2004).
quence, of increased length, is sampled again
For tetranucleotide frequency, the Z-score nor-
from anywhere along the genome. Trivially,
malization is computed as follows: the expected
the error between the occurrence vector of N
value for a given tetramer is calculated by
this new subsequence and the occurrence vec-
N ðn1 n2 n3 ÞN ðn2 n3 n4 Þ tor of the genome would be reduced. This
Eð n 1 n 2 n 3 n 4 Þ ¼ , process is continued until the length of sub-
N ð n1 n2 Þ
sequences is equivalent to the length of the
and the variance is calculated using genome, while keeping track of the error at
each sampling instance. The resulting error as
s2 ðn1 n2 n3 n4 Þ ¼ Eðn1 n2 n3 n4 Þ a function of subsequence length is found to be
½N ðn2 n3 Þ N ðn1 n2 n3 Þ ½N ðn2 n3 Þ N ðn2 n3 n4 Þ linear (up to a given subsequence length). The
, rate of error reduction (or gradient) of this
N ð n2 n3 Þ 2
linear trend, within the bounds of the linear
which gives the required normalization for each region, is referred to as the OFDEG. It has
tetramer: been found that this linear gradient is different
for various genomes and is remarkably consis-
tent within genomes as well as between frag-
N ð n1 n2 n3 n4 Þ Eð n 1 n2 n3 n4 Þ
Z ðn1 n2 n3 n4 Þ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : ments of a genome. The measure essentially
s2 ðn1 n2 n3 n4 Þ captures the relative magnitude of biases in
nucleotide base composition in a manner sim-
The Oligonucleotide Frequency-Derived ilar to entropic measures and has been used in
Error Gradient (OFDEG) combination with other complementary fea-
tures to successfully group related sequences
An extension to oligonucleotide frequency- in various simulated and real-world
based features is the oligonucleotide metagenomes.
Combining Features to Increase the training set in relation to the metagenome under
Taxonomic Resolution of Binning investigation.
It has recently been demonstrated that when two Unsupervised Clustering

signatures capture the different characteristics of Unsupervised learning is not predicated on
base composition, they can be used to group the availability of reference sequences for
sequences differently, and in cases where these training. Instead, methods which operate in
groups are mutually exclusive and at different this paradigm group related DNA sequences
taxonomic resolutions, such features can be by the inferred similarity of patterns, which
arranged hierarchically to increase the taxonomic is consistent with the exploratory nature of
resolution at which sequences in a metagenome metagenomic studies. When using patterns it
are grouped (Saeed et al. 2012). can often be advantageous to use this approach,
This concept was demonstrated using the particularly when the sampled community con-
combination of GC content and OFDEG as sists of microbes that are either underrepre-
a preliminary set of features to coarsely group sented or not represented in existing databases.
a metagenome and then using tetranucleotide fre- Moreover, these methods can be applied
quency to refine these coarse groups further. This directly to a data set to reveal hidden patterns
is particularly important when the number of among related sequences in a metagenome
populations in a metagenome increases, as without enforcing a priori knowledge of what
tetranucleotide frequency on its own is known phylotypes should be present.
to saturate in its discriminatory power when this However, for clustering methods which oper-
occurs (Saeed et al. 2012). Since both OFDE and ate on density estimation, these methods require
GC content generate coarse groups, the groups a sufficient number of sequences per population
can then be processed using a high-resolution in order for it to be discovered as a cluster. As
feature set for refinement. This has been found such, highly complex metagenomes with no
to improve the binning performance over existing dominant populations are difficult to analyze in
methods on simulated as well as real this manner and are at present perhaps better
metagenomic data sets (Saeed et al. 2012). suited to supervised methods (provided
a suitable training set can be constructed).
Grouping Related Metagenomic The Effects of Various Forms of Noise in

Fragments Based on Sequence Grouping Metagenomic Sequences Using
Composition Nucleotide Base Composition
Category I: When unrelated, or distantly related,
Methods that operate on nucleotide composition genomes have highly similar compositional
for grouping related sequences can be classified signatures, the number of false positives in
in terms of two broad machine learning para- grouping sequences can increase and will con-
digms: those which construct supervised classi- sequently affect binning specificity.
fiers and those which rely on unsupervised Category II: Genomes that have large intra-
exploratory clustering. genomic variation in base composition can
often increase the number of false negatives
Supervised Classification (these can sometimes be observed as outliers)
A classifier can be trained using existing knowl- during binning and consequently affect bin-
edge of patterns based on the analysis of refer- ning sensitivity.
ence sequences in current databases. These Category III: A more complex form of noise
methods consider the classification of each occurs when organisms partially share com-
sequence in isolation, and their accuracy will be mon characteristics in base composition,
dependent on the representativeness of the which will cause groups to overlap.
With the use of model-based clustering in an methods that operate on nucleotide frequency
unsupervised setting with a hierarchical set of alone can be seen to be at a disadvantage, but
features, there is the potential to increase the with the anticipated longer read lengths and
accuracy of binning by removing these forms of higher throughput that future sequencing plat-
noise (Saeed et al. 2012). forms are capable of generating, coupled with
the development of a wide variety of novel tools
for metagenomic data analysis, these issues may
Limitations and Future Directions be largely alleviated and composition-based bin-
ning will be an important tool for metagenome
Significant advances in compositional binning analysis.
approaches have primarily looked at the issue of
representing the composition of a sequence,
rather than refining machine learning methods Summary
that operate in an unrepresentative feature
space. Such is the case of the succession of GC The analysis of nucleotide base composition in
content by higher-order oligonucleotide frequen- grouping related metagenomic sequences allows
cies, for example. For instance, with an increase for more in-depth analysis of the functional
in the number of fully sequenced bacterial and potential of a sampled microbial community, in
archaeal genomes, it was observed that composi- the context of constituent members (inferred
tional features tend to saturate in their capacity to populations), rather than simply observing the
uniquely describe a microbial genome (or clade) overall functional potential of a community.
(Teeling et al. 2004). The use of complementary The features in current use are essentially based
features can address this limitation to a certain on nucleotide frequencies, which describe the
extent (Saeed et al. 2012). relative abundance of n-mers in a sequence, and
Given that metagenomes only contain frag- various extensions to these signatures have also
ments, which in most cases can be quite short, been introduced, such as the oligonucleotide N
the length of sequences often limits the represen- frequency-derived error gradient (OFDEG).
tativeness of features based on nucleotide fre- The performance of using composition-based
quency (Mavromatis et al. 2007). This is features for binning can be improved when using
because these signatures are statistical in nature complementary features in combination, which
and require sufficient sequence lengths with can result in an increase in the taxonomic resolu-
which to estimate a representative signature. tion of the groups that result. In general, however,
The minimum sequence length has been argued the accuracy of grouping sequences using nucle-
to be 1 kbp (McHardy and Rigoutsos 2007), otide base composition is largely governed by the
40 kbp (Teeling et al. 2004), and even algorithm used to analyze the patterns (whether
50–100 kbp (Karlin et al. 1997); but in general in a supervised or an unsupervised setting), the
caution is advised when applying such techniques available sequence lengths, and the choice of
to sequences less than 1 kbp as this may result in compositional feature. On the other hand, the
unrepresentative and sparse feature vectors. level of taxonomic resolution that can be
This limitation of sequence length in achieved in such an analysis is more heavily
metagenomes is not only due to currently achiev- influenced by the choice of compositional feature
able read lengths but also the complexity of alone. In most cases, these can be alleviated with
assembling metegenomes (in comparison to advances in sequencing technology. Neverthe-
single-genome studies). For complex communi- less, there is much that can be unveiled when
ties (species rich), the required coverage for rea- patterns are extracted from metagenomic
sonable levels of assembly (N50 contig length sequences. It is, however, a matter of knowing
greater than 1 kbp) translates into substantial what patterns to extract and how best to extract
sequencing requirements. In light of this, them before an attempt is made to group them.
References Mrazek J. Phylogenetic signals in DNA composition: limita-

tions and prospects. Mol Biol Evol. 2009;26(5):1163–9.
Chan C-K, Hsu A, Halgamuge SK, Tang S-L. Saeed I, Halgamuge SK. The oligonucleotide frequency
Binning sequences using very sparse labels within derived error gradient and its application to the binning
a metagenome. BMC Bioinforma. 2008;9:215. of metagenome fragments. BMC Genomics. 2009;
Karlin S, Mrazek J, Campbell A. Compositional biases of 10(S3):S10.
bacterial genomes and evolutionary implications. Saeed I, Tang S-L, Halgamuge SK. Unsupervised discov-
J Bacteriol. 1997;179(12):3899–913. ery of microbial population structure within
Karlin S, Campbell A, Mrazek J. Comparative DNA anal- metagenomes using nucleotide base composition.
ysis across diverse genomes. Annu Rev Genet. Nucleic Acids Res. 2012;40(5):e38.
1998;32:185–225. Teeling H, Meyerdierks A, Bauer M, et al. Application of
Mavromatis K, Ivanova N, Barry K, et al. Use of simulated tetranucleotide frequencies for the assignment of
data sets to evaluate the fidelity of metagenomic genomic fragments. Environ Microbiol. 2004;6(9):
processing methods. Nat Methods. 2007;4(6): 938–47.
495–500. Tyson G, Chapman J, Hugenholtz P, et al. Community
McHardy A, Rigoutsos I. What’s in the mix: phylogenetic structure and metabolism through reconstruction of
classification of metagenome sequence samples. Curr microbial genomes from the environment. Nature.
Opin Microbiol. 2007;10:499–503. 2004;428:37–43.
O
Open Resource Metagenomics Metagenomic libraries are typically constructed

for specific applications such as retrieving genes
Trevor C. Charles and Josh D. Neufeld with desired functions. Functional metagenomics
Department of Biology, University of Waterloo, is the use of metagenomic libraries to isolate
Waterloo, ON, Canada genes of interest based on associated activity of
captured environmental genes (recently reviewed
in Ekkers et al. 2012). Metagenomic libraries
Synonyms may be screened or selected in several potential
host organisms (Wexler and Johnston 2010),
Open-access metagenomics; Open-source commonly for functions of potential biotechno-
metagenomics; Shared metagenomic libraries logical value. This approach facilitates the dis-
covery of novel genes, without requiring culture
of the organisms that naturally carry those genes
Definition or sequence homology to known genes.
In recent years, the combination of high-
Open resource metagenomics encompasses the capacity sequencing and advances in computa-
emerging resource sharing system that facilitates tional analysis of metagenomic sequence data has
distribution of metagenomic libraries throughout resulted in dramatic improvements in gene dis-
the research community. These libraries are covery in the absence of functional screening
constructed from DNA isolated directly from (Thomas et al. 2012). Despite these improve-
environmental samples. Broad host range ments, a fundamental limitation is that links
cosmids are ideal for open resource between sequence and function tend to be sub-
metagenomics, accommodating large DNA stantially incomplete. This is not only a limitation
inserts for screening or selections in multiple of metagenomic library analysis, it is also an
prokaryotic or eukaryotic hosts. important caveat for the study of genomes from
individual organisms. For example, it is often not
possible to assign a function to a gene product of
The Challenges of Functional a characterized protein family, although that is
Metagenomics precisely the limitation of computational
methods for sequence-based analyses. Confident
By capturing DNA extracted directly from envi- determination of specific functions, such as sub-
ronmental samples, the first metagenomic librar- strate specificity associated with sequence
ies were constructed in the late 1990s motifs, relies on the availability of experimental
(Handelsman et al. 1998; Rondon et al. 2000). data. Arguably, the most interesting and valuable
O 574 Open Resource Metagenomics
metagenomic genes will be those whose function promoters for their transcription (Mastropaolo
could not have been predicted by sequence alone; et al. 2009). Host-specific limitations on gene
these genes would be more likely to encode prod- expression include posttranscriptional controls,
ucts with truly novel properties. including translation initiation, codon usage, pro-
A major advantage of metagenomic libraries tein folding, enzyme activation, and transport.
is that once they are made, they can be Also, wild-type and mutant strains that are most
a permanent resource, a snapshot of the microbial appropriate for a given screen might be available
community that the DNA was extracted from. only in a host background that does not support
The same library, if stored properly, can be replication of a given vector. This is especially
screened multiple times, indefinitely. Below we true when using vectors that only replicate in
outline several methodological considerations Escherichia coli and other Gammaproteo-
for maximizing benefit from open resource bacteria. For these reasons, it is advantageous to
metagenomic libraries. choose or design vectors that can be maintained
Although they have sometimes been used, in diverse host backgrounds.
small-insert libraries are not optimal for func- Metagenomic libraries are often constructed
tional metagenomics. The smaller the insert, the for specific applications, such as to screen for
less chance that individual clones will contain a desired enzyme activity. Unlike the situation
full operons, including the regions required for single culture isolates, which must be depos-
for control of gene expression. As a result, the ited in accessible culture collections or otherwise
use of bacterial artificial chromosome (BAC; made available as a requirement of publication of
Kakirde et al. 2012) and cosmid/fosmid research results involving them, there is no such
(Aakvik et al. 2009; Neufeld et al. 2011; requirement or expectation for metagenomic
Taupp et al. 2011) vectors enables the cloning libraries. This is unfortunate, as high-quality
of fragments that are large enough to include metagenomic libraries are technically challeng-
multiple operons. Such large-insert libraries ing and costly to construct, and their full value is
require fewer clones to ensure that they are often not realized if their use is restricted to one or
representative. a few research groups.
Depending on DNA yields, quality, and size,
metagenomic libraries of environmental micro-
bial communities may yield several million Achieving Metagenomic Resource
clones. If such libraries are distributed into Sharing
384-well plates, this would represent over 2,500
plates per million clones. Plate storage would We formally proposed that to ensure maximum
require extensive freezer space, and screening value, metagenomic libraries should be made
such libraries, one clone at a time, would be publicly available to members of the research
prohibitively laborious and costly, even with the community, without restriction (Neufeld
use of robotic manipulation. An alternative strat- et al. 2011). This is the concept of open resource
egy we recommend is to recover and maintain the metagenomics that libraries be pooled to ensure
libraries as pools of clones. This procedure ease of archiving as frozen stocks and for subse-
involves physical harvesting and mixing of all quent distribution and handling. We also
individual colonies from all initial library plates, recommended that cosmid libraries be used,
followed by the preparation of aliquot suspen- because they allow the efficient cloning of large
sions for cryopreservation and subsequent inserts of >30 kb. To facilitate screening in
distribution. a diversity of host backgrounds, cosmid vectors
Another important consideration is that differ- with broad host range origins of replication are
ent host backgrounds will selectively express recommended, as well as Gateway recombina-
only a subset of an environmental metagenome. tional systems for easy transfer of inserts to
For example, Bacteroides genes use specialized other vectors. An example of such a resource is
Open Resource Metagenomics 575 O
the Canadian MetaMicroBiome Library project for transfer of metagenomic libraries to a variety of
(CM2BL; http://cm2bl.org), which houses bacterial species. FEMS Microbiol Lett. 2009;296:
149–58.
a collection of Canadian soil metagenomic librar- Ekkers DM, Cretoiu MS, Kielak AM, Van Elsas JD. The
ies in an IncP cosmid Gateway vector. The largest great screen anomaly – a new frontier in product dis-
library in this collection contains over eight mil- covery through functional metagenomics. Appl
lion clones. To assist users in deciding which Microbiol Biotechnol. 2012;93:1005–20.
Handelsman J, Rondon MR, Brady SF, Clardy J, Good-
libraries to choose for a given application, exten- man RM. Molecular biological access to the chemistry
sive metadata and taxonomic sequence informa- of unknown soil microbes: a new frontier for natural
tion is accessible in an online database. products. Chem Biol. 1998;5:R245–9.
Kakirde KS, Wild J, Godiska R, Mead DA, Wiggins AG,
Goodman RM, Szybalski W, Liles MR. Gram negative
shuttle BAC vector for heterologous expression of
Summary metagenomic libraries. Gene 2012;475:57–62.
Mastropaolo MD, Thorson ML, Stevens AM. Comparison
The open resource metagenomics initiative aims of Bacteroides thetaiotaomicron and Escherichia coli
16S rRNA gene expression signals. Microbiology.
to increase the availability of metagenomic 2009;155:2683–93.
libraries to the research community as a public Neufeld JD, Engel K, Cheng J, Moreno-Hagelsieb G, Rose
and scientific resource. The principle of free and DR, Charles TC. Open resource metagenomics;
open sharing of metagenomic libraries is central a model for sharing metagenomic libraries. Stand
Genomic Sci. 2011;5:203–10.
to this initiative, including direct access to asso- Rondon MR, August PR, Betterman AD, Brady SF,
ciated metadata and DNA sequences. Increased Grossman TH, Liles MR, Loiacono KA, Lynch BA,
gene discovery as a result of the use of these Macneil IA, Minor C, Tiong CL, Gilman M, Osburne
libraries not only has the potential to provide MS, Clardy J, Handelsman J, Goodman RM.
Cloning the oil metagenome: a strategy for accessing
novel, biotechnologically useful genetic material the genetic and functional diversity of uncultured
but should increase the overall understanding of microorganisms. Appl Environ Microbiol. 2000;66:
gene functions and their relationship to DNA 2541–7.
sequence. Taupp M, Mewis K, Hallam SJ. The art and design of
functional metagenomic screens. Curr Opin
Biotechnol. 2011;22:465–72.
Thomas T, Gilbert J, Meyer F. Metagenomics - a guide O
References from sampling to data analysis. Microb Inform Exp.
2012;2:3.
Aakvik T, Degnes KF, Dahlsrud R, Schmidt F, Dam R, Wexler M, Johnston AB. Wide host-range cloning for
Yu L, Völker U, Ellingsen TE, Valla S. A plasmid functional metagenomics. Methods Mol Biol. 2010;
RK2-based broad-host-range cloning vector useful 668:77–96.
P
Phylogenetics, Overview each clade; and to allow retro direction, i.e., the
ability to infer ancestral properties based on
Phylogenetics: A Root and Branch Analysis of the observable characteristics of extant organisms.
Tree of Life A significant limitation of traditional
morphology-based phylogeny approaches is the
Roy Sleator fact that reconstructing ancient evolutionary
Department of Biological Sciences, Cork events requires a vast sum of character changes.
Institute of Technology, Cork, Co. Cork, Ireland Furthermore, many of these morphological char-
acters are likely under selective pressure and
subject to convergence (Sleator 2010). Based
Synonyms solely on this criterion, most organisms lack
sufficient phenotypic characters to perform
Evolutionary relatedness effective comparative analyses (Lopez and
Bapteste 2009).
The development of modern DNA and protein
Definition sequence technologies has however effectively
eliminated this limitation. Modern phylogenetic
Phylogenetics, derived from the Greek terms analysis involves the progressive alignment of
phylon (meaning “tribe”) and genetikos nucleic acid and/or protein sequences between
(meaning “genitive” or origin), is the study of extant organisms. A hypothesis is then produced
the evolutionary history of species, organisms, to explain the repartition of character states, and
genes, or proteins through the construction and the results presented as a phylogenetic tree –
analysis of mathematical entities known as trees which is simply a graphic representation of the
or phylogenies. computed output.
The accelerating accumulation of molecular
sequence data arising from recent concerted
Introduction large-scale genomic and metagenomic sequenc-
ing projects (Sleator et al. 2008) continues to
Darwin’s The Origin of Species marked the birth afford new opportunities and perspectives for
of phylogeny, a discipline whose primary aims dissecting evolutionary relationships. Indeed,
are to classify all living organisms, grouping all while early molecular phylogenetic approaches
extant descendants of a given ancestor within centered on individual DNA sequences coding
specific groups or clades; to provide insights for RNA or proteins, or the derived amino acid
into the shared properties of members within sequences of the latter, more recent analysis of
P 578 Phylogenetics, Overview
Phylogenetics,
Overview, Fig. 1 Tree-
building methods.
Schematic overview of the
major analytical
approaches to phylogenetic
tree building
whole genomes has led to the development of have a number of significant limitations. NJ, for
phylogenomics – a powerful approach to analyze example, provides only a single tree as opposed
complete genome sequences as a metasequence to character-based methods which compute
(Forterre and Gadelle 2009). a consensus tree from several optimal or near
optimal candidates. Furthermore, NJ may com-
pute different tress depending on the order in
Discussion which the constituent sequences are added.
Finally, given that differences are presented as
Tree-Building Methods: The Major Analytical distance values, it is impossible to identify the
Approaches specific character changes that support a branch
While several methods exist for inferring evolu- (Soltis and Soltis 2003a).
tionary relatedness, most can be classified as Character-based methods (also referred to as
either distance- or character-based methods tree-searching methods) search for the most prob-
(outlined in Fig. 1). Distance (or algorithmic) able tree for a specific sequence set based on
methods employ an algorithm incorporating characters at each position of the sequence align-
a model of evolution (e.g., amino acid substitu- ment and a model of evolution. The most com-
tion) to compute a distance matrix from which mon character-based approaches include
a phylogenetic tree is calculated by means of maximum parsimony (MP), maximum likelihood
progressive clustering. Specifically, distances in (ML), and to a lesser extent Bayesian methods.
the matrix relate to the number of differences MP seeks to find the tree or trees that are
between each pair of sequences (either DNA or compatible with the minimum number of substi-
protein). The model of evolution specifies how tutions among sequences, i.e., the fewest evolu-
amino acid substitutions occurred in the protein tionary changes. An advantage of MP is that it
sequence since they last shared a common ances- provides diagnosable units (i.e., specific sets of
tor. Finally, the tree is constructed from the characters) for each clade and branch lengths in
numerical data in the matrix, with the most terms of the number of changes on each branch of
closely related sequences occupying a position the tree. However, a significant limitation of the
on the tree which is distant from the less closely MP approach is that it requires strict assumptions
related sequences. Both the neighbor-joining of consistency across sites and among lineages.
(NJ) and the unweighted pair group method Thus, MP performance is significantly affected
using arithmetic averages (UPGMA) approaches when mutational rates differ between conserved
to tree building employ distance-based methods. and hypervariable regions or if evolutionary rates
Although fast and readily available in user- are highly variable among evolutionary lineages.
friendly software packages such as MEGA Finally, parsimony lacks an explicit model of
(Tamura et al. 2011), distance-based methods evolution.
Phylogenetics, Overview 579 P
ML methods are based on specific probabilis- searches are effective up to ~20 taxa. Heuristic
tic models of evolution and search for the tree (or “best guess”) searches employ a “hill
with maximum likelihood under these models. climbing” approach; an initial tree is chosen and
The model of evolution may be empirical, subsequently modified; changes leading to an
derived from general assumptions about the evo- inferior tree descend the hill and the tree is
lution of sequences, or parametric, based on rejected; changes leading to an improvement
values estimated from the dataset. The major ascend the hill – when no further improvement
advantage of likelihood approaches is that they is possible, the search is terminated. Although an
are based on powerful statistical theory which extremely fast approach, there is no guarantee
facilitates the application of robust statistical that the returned tree is the global optimal (the
hypothesis testing and significant refinements to summit) or merely a local optimum (a foothill’s
the resulting phylogenetic trees. However, while plateau).
these strong statistical foundations make ML Once an optimum tree is chosen, some statis-
techniques arguably the most powerful approach tical measure of internal support for clades must
in terms of phylogenetic reconstruction, paradox- also be provided to prove that the tree is suffi-
ically this strength is also a significant weakness, ciently robust and biologically meaningful. To
in that ML approaches are computationally inten- this end a variety of methods have been proposed
sive and, as a result, significantly slower than to verify the evolutionary reliability of trees of
alternative approaches. As such, ML analysis which the most commonly used is the bootstrap
can only be practically applied to a limited num- analysis. Bootstrapping can be divided into both
ber of sequences (Soltis and Soltis 2003a). parametric and nonparametric approaches
In practice, both distance- and character-based (Wrobel 2008). Nonparametric bootstrapping is
methods tend to be used in tandem. An initial tree a numerical resampling approach in which
may be estimated by a distance-based method a subset of sequence alignments referred to as
and used to test the parameters of the model of bootstrap or pseudo-alignments are formed from
evolution. The most appropriate of these might the dataset by random sampling. This process is
then be used in a maximum likelihood tree repeated several times (depending on the size of
search. the dataset and the specifications of the analysis)
usually with a default setting of 1,000 replicates.
Testing the Reliability of a Tree Bootstrap values are conservative measures of P
There are two approaches to finding the best tree: phylogenetic accuracy with values of 70 % or
those that use optimality criteria that can be eval- more representing “true” clades in experimental
uated for any given tree (used for MP and ML) phylogenies. Parametric bootstrapping on the
and those that involve the progressive clustering other hand creates replicate samples using
of sequence subsets (used for NJ and UPGMA). numerical simulation as opposed to resampling.
In the optimality methods trees are evaluated one This approach is usually applied to test compet-
by one by either exhaustive, branch and bound, or ing hypotheses.
heuristic searches. Exhaustive searches evaluate Although generally effective, the bootstrap
all possible bifurcating trees to find a globally approach rests on a number of assumptions
optimal topology; such an approach is only fea- which are not optimal when applied to molecular
sible for a relatively small number of taxa (<10). sequence analysis (for an overview see Box 1). In
Rather than evaluating every possible tree, the addition to bootstrapping another measure of
branch and bound approach first chooses a local internal support which is often used in phyloge-
optimum value for tree length representing the netic analyses is jackknifing. Although similar to
total number of evolutionary changes on the tree; bootstrapping, jackknifing involves one signifi-
any tree length greater than the local optimum is cant difference; rather than resampling the data,
automatically discarded, thus saving time and this approach uses only subsets of the available
computational expense. Branch and bound data (i.e., resampling without replacement to
P 580 Phylogenetics, Overview
create a smaller dataset). The purpose of which is sequences. However, while effective this
to account for the presence of possible “outlier” approach has a number of shortcomings in terms
characters which might have a disproportionate of phylogenetic analysis (for an overview of these
influence on the resulting tree. shortcomings, see Box 2).
Other less common approaches to measuring An alternative approach involves the
internal support include the decay index for par- application of motif finding algorithms which
simony analyses (Hernandez Fernandez and Vrba select common sequence motifs and align
2005) and the posterior probabilities generated in only these most conserved domains with no
Bayesian inference (Wrobel 2008). allowance for gaps or insertions (Lawrence
et al. 1993).
Box 1. Limitations of Bootstrap Analysis when In addition to alignment difficulties, two of the
Applied to Molecular Sequences most significant problems associated with
• The statistical bases of bootstrap analysis assessing tree reliably are long-branch attraction
require that all positions of an alignment are (LBA) associated with mutational saturation and
independently identically distributed. How- lateral gene transfer (LGT) mediated, at least in
ever, this assumption fails to hold true for part, by viruses and mobile genetic elements
either nucleotide or amino acid sequences. (Sapp 2007). As mutations cumulate during evo-
For example, in proteins certain di-residues lution, a point of mutational saturation is reached
(in the primary structure) are either over- or at which there is no further divergence between
underrepresented (Karlin et al. 1991), while taxa (Brocchieri 2001). From this point on it
strong correlations are observed between posi- becomes impossible to estimate evolutionary dis-
tions that interact within the 3D structure tance; furthermore very divergent sequences tend
(Karlin et al. 1994). to be attracted together (Fig. 2) – hence the
• Bootstrap analysis is hampered by unequal name – thus skewing their true position (Lopez
evolutionary rates. If mutational rates are too and Bapteste 2009).
high or uneven among lineages, the bootstrap
proportion P is usually an overestimate (Soltis Box 2. Sequence Alignment Shortcomings
and Soltis 2003b). • Heuristic methods, although fast, only provide
• Molecular sequences are not representative of a best guess or estimate of the optimal
a homologous population, and as such alignment.
resulting bootstrap values may not signify reli- • Alignments are sensitive to the choice of sim-
able clusters (Brocchieri 2001). ilarity matrix (for amino acid sequence align-
ments) and gap penalty which are user
Difficulties Associated with Creating Reliable adjustable – thus requiring human
Phylogenetic Trees intervention.
Phylogenetic inferences are only as good as the • Hierarchically aligning pairs of sequence is
alignments they are drawn from – “Garbage in; prone to generate biases and dominance by
garbage out.” The majority of current alignment the most similar sequences.
protocols are based on dynamic programming
(DP) procedures which seek to identify the max- What Next. . .?
imal alignment score, a value determined by the Phylogenomics – the merging of phylogenetics
choice of scoring matrix (e.g., PAM or and genomics – is perhaps the most exciting
BLOSUM) and the assignment of gap penalties. recent development in the field of evolutionary
Rather than searching for the optimal alignment mapping (Delsuc et al. 2005). Rather than con-
of n sequences in an n-dimensional space, most centrating on a single phylogenetic marker,
DP methods employ fast heuristic or “greedy” whole-genome phylogenomic approaches
approaches, progressively aligning pairs of involve comparisons of gene content: the
Phylogenetics, Overview 581 P
Phylogenetics, Overview, Fig. 2 Long-branch attrac- branches. (ii) An inferred tree of the taxa in which B and
tion. A simulated example of long-branch attraction. C are artificially grouped together because of the phenom-
(i) The real tree of the relationships among five taxa, enon of long-branch attraction
with two taxa (B and C) having long evolutionary
presence or absence of orthologous genes surely, evolutionary biologists are beginning to

(or gene families) and/or gene order. Genomic “see the wood for the trees” (Sleator 2011).
relationships based on genomic content and orga-
nization (representing the genomic profile) are
inferred using the genomic signature which com- Summary
putes differences within, and between, species
specific sequences based on dinucleotide relative Recent rapid expansions in the DNA and protein
abundance differences. The generality and databases, arising from large-scale genomic and
robustness of the genomic signature gives it an metagenomic sequence projects, have forced sig-
advantage over traditional approaches which, nificant development in the field of phylogenetics,
based on individual sequences, are strongly the study of the evolutionary relatedness of the
influenced by mutational events such as LGT. planet’s inhabitants. Advances in phylogenetic P
Finally, while it is tempting to consider only analysis have greatly transformed our view of the
the Darwinian-Mendelian model of vertical gene landscape of evolutionary biology, transcending
transfer in phylogenetic analysis, recent evidence the view of the tree of life which has shaped evo-
suggests that the role of LGT in shaping evolution lutionary theory since Darwinian times. Indeed,
can no longer be ignored. Indeed, in certain pro- modern phylogenetic analysis no longer focuses
karyotes the LGT rate is comparable to and, in on the restricted Darwinian-Mendelian model of
some instances, significantly higher than the rate vertical gene transfer but must also consider the
of spontaneous mutation (Lawrence 2002). LGT significant degree of lateral gene transfer which
has also been observed between eukaryotes connects and shapes almost all living things.
(Andersson et al. 2007) as well as between organ-
elles of the same cell (Archibald et al. 2003).
A major consequence of LGT is that instead of Cross-References
focusing on the elusive “tree of life” (Puigbo
et al. 2009), phylogenetic analysis must now con- ▶ DNA Methylation Analysis by
sider the whole forest, corresponding to the inte- Pyrosequencing
grated framework of vertical and lateral gene ▶ Horizontal Gene Transfer and Bacterial
transfer (Lopez and Bapteste 2009). Slowly, but Diversity
P 582 PhyloPythia(S)
References analysis using maximum likelihood, evolutionary dis-

tance, and maximum parsimony methods. Mol Biol
Andersson JO, Sjogren AM, Horner DS, Murphy CA, Evol. 2011;28:2731–9.
Dyal PL, Svard SG, Logsdon JR JM, Ragan MA, Hirt Wrobel B. Statistical measures of uncertainty for branches
RP, Roger AJ. A genomic survey of the fish parasite in phylogenetic trees inferred from molecular
Spironucleus salmonicida indicates genomic plasticity sequences by using model-based methods. J Appl
among diplomonads and significant lateral gene trans- Genet. 2008;49:49–67.
fer in eukaryote genome evolution. BMC Genomics.
2007;8:51.
Archibald JM, Rogers MB, Toop M, Ishida K, Keeling
PJ. Lateral gene transfer and the evolution of plastid-
targeted proteins in the secondary plastid-containing
PhyloPythia(S)
alga Bigelowiella natans. Proc Natl Acad Sci U S A.
2003;100:7678–83. Alice C. McHardy
Brocchieri L. Phylogenetic inferences from molecular Algorithmic Bioinformatics, Heinrich Heine
sequences: review and critique. Theor Popul Biol.
2001;59:27–40.
University D€usseldorf, D€usseldorf, Germany
Delsuc F, Brinkmann H, Philippe H. Phylogenomics and
the reconstruction of the tree of life. Nat Rev Genet.
2005;6:361–75. Definition
Forterre P, Gadelle D. Phylogenomics of DNA
topoisomerases: their origin and putative roles in the
emergence of modern organisms. Nucl Acids Res. PhyloPythia and its successor PhyloPythiaS are
2009;37:679–92. fast and accurate oligomer signature-based clas-
Hernandez Fernandez M, Vrba ES. A complete estimate sifiers for the taxonomic assignment of
of the phylogenetic relationships in Ruminantia:
a dated species-level supertree of the extant ruminants.
metagenome sequence fragments.
Biol Rev Camb Philos Soc. 2005;80:269–302.
Karlin S, Bucher P, Brendel V, Altschul SF. Statistical
methods and insights for protein and DNA sequences. Introduction
Annu Rev Biophys Biophys Chem. 1991;20:175–203.
Karlin S, Zuker M, Brocchieri L. Measuring residue asso-
ciations in protein structures. Possible implications for Metagenomics uses random shotgun sequencing
protein folding. J Mol Biol. 1994;239:227–48. to recover genome sequence information from
Lawrence JG. Gene transfer in bacteria: speciation with- microbial communities without the need for cul-
out species? Theor Popul Biol. 2002;61:449–60.
Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald
tivation of its member species. It thus gives
AF, Wootton JC. Detecting subtle sequence signals: access to the vast portion of the microbial world
a Gibbs sampling strategy for multiple alignment. Sci- that cannot be cultured with standard techniques
ence. 1993;262:208–14. (Hugenholtz 2002). The sequencing of randomly
Lopez P, Bapteste E. Molecular phylogeny: reconstructing
the forest. C R Biol. 2009;332:171–82.
sheared microbial community DNA initially gen-
Puigbo P, Wolf Y, Koonin E. Search for a ‘Tree of Life’ in erates a collection of short sequence fragments
the thicket of the phylogenetic forest. J Biol. called reads. Depending on the sequencing tech-
2009;8:59. nology used, the amount of generated data and
Sapp J. The structure of microbial evolutionary theory.
Stud Hist Philos Biol Biomed Sci. 2007;38:780–795.
read lengths vary (Metzker 2010; Droge and
Sleator RD. An overview of the processes shaping protein McHardy 2012): while traditional Sanger
evolution. Sci Prog. 2010;93:1–6. sequencing generates reads of around 800 bp,
Sleator RD. Phylogenetics. Arch Microbiol. 2011;193: the commercially available “next-generation”
235–9.
Sleator RD, Shortall C, Hill C. Metagenomics. Lett Appl
sequencing technologies return reads of approxi-
Microbiol. 2008;47:361–6. mately 50–75 bp (SOLID sequencing by Applied
Soltis DE, Soltis PS. The role of phylogenetics in compar- Biosciences/Life Technologies), 75–300 bp
ative genetics. Plant Physiol. 2003a;132:1790–800. (sequencing by synthesis technology by Solexa/
Soltis PS, Soltis DE. Applying the bootstrap in phylogeny
reconstruction. Stat Sci. 2003b;18:256–67.
Illumina), 100–200 bp (semiconductor chip
Tamura K, Peterson D, Peterson N, Stecher G, Nei M, sequencing by Ion Torrent/Life Technologies),
Kumar S. MEGA5: molecular evolutionary genetics and 550–1,000 bp (pyrosequencing by
PhyloPythia(S) 583 P
454/Roche). The recently developed single- et al. 2011; Segata et al. 2012; Wu and Scott
molecule sequencers produce read lengths of 2012).
over 1 kb (PacBio SMRT) and of 5–10 kb With the exception of highly complex com-
(Oxford Nanopore technology). Currently, munities, such as those found in soil, assembly
a single run of an Illumina HiSeq 2000 machine and taxonomic assignment of metagenome sam-
produces up to six billion paired-end reads or ples sequenced to sufficient depth allows the
600 Gb of sequence data (Illumina 2012). reconstruction of draft genomes, corresponding
Bioinformatics methods are subsequently to sets of contigs or scaffolds representing more
applied to process the data. Assembly software than 50 % of a genome (Pope et al. 2010; Hess
such as MetaVelvet (Namiki et al. 2012) can be et al. 2011; Iverson et al. 2012). This enables
used to reconstruct longer contiguous sequence a functional analysis and reconstruction of meta-
fragments, or contigs, based on overlaps in reads. bolic potential for individual community mem-
For paired-end reads, the distances between reads bers. The annotation of assembled and
originating from the two ends of an individual unassembled metagenome fragments can be
DNA fragment are approximately known. If performed with publicly available servers such
paired-end reads are assembled into different as MG-RAST, IMG/M, and CAMERA (Glass
contigs, the orientation of these contigs relative et al. 2010; Sun et al. 2011; Markowitz
to each other and the size of the unassembled gap et al. 2012). In annotation, the presence and func-
between them can be inferred. This ordering of tionalities of genes and operons are identified and
contigs with gaps of known sizes is also referred metabolic pathways reconstructed by comparing
to as a scaffold. The resulting sequence frag- enzymes predicted to be encoded in these frag-
ments, i.e., the contigs, scaffolds, and remaining ments with known reference pathways for model
unassembled reads, could principally originate organisms.
from any member species of the microbial In the following, the PhyloPythia and
community. PhyloPythiaS software for the taxonomic assign-
In taxonomic assignment or “binning,” the ment of metagenome sequence fragments are
fragments are assigned to individual species or described.
higher-ranking clades (see Droge and McHardy
(2012) for a recent review). The term “binning”
was coined as a metaphor to describe the Description P
process of separating the fragment mixture by
placing individual fragments into bins PhyloPythia and its successor PhyloPythiaS are
representing the different taxonomic origins. oligomer signature-based classifiers for the taxo-
Besides variations caused by amplification bias nomic assignment of metagenome sequence frag-
of sequencing, the number of reads recovered for ments (McHardy et al. 2007; Patil et al. 2011).
a community member should be approximately The methods are named after the Pythia, the
proportional to the product of its abundance priestess at Apollo’s oracle in ancient Delphi.
and the size of its genome (Segata et al. 2012). They use the similarity in oligomer usage
Thus fragments are more likely to originate from between a query sequence and a target clade as
the more abundant community members, which information. For prokaryotes, this allows to
are more extensively covered by sequencing. assign genome sequence fragments to species or
Taxonomic assignment is different from taxo- higher-ranking taxonomic clades from which
nomic profiling for a metagenome. In profiling, they originate. Oligomer- or composition-based
the relative abundances of the different commu- taxonomic assignment differs from sequence
nity members are estimated based on taxonomic similarity-based or phylogenetic methods in that
assignment of either universal or clade-specific global instead of local properties of the genome
marker genes found on a subset of the sample sequence are used as information. There is no
fragments (Wu and Eisen 2008; Sharpton requirement for homologous sequences of related
P 584 PhyloPythia(S)
taxa to be known for every analyzed fragment. access to the vast portion of the microbial world
A fraction of a species’ genome sequence, typi- that cannot be cultured with standard techniques.
cally 100 kb or more, suffices as reference data. Bioinformatics methods are subsequently applied
Reference data can be obtained by identifying to process the data. Assembly software is used to
contigs with conserved marker genes such as generate genomic sequence fragments, which
16S rRNA from the sample itself or by additional could principally originate from any member
sequencing of large insert libraries containing species of the microbial community. In taxo-
marker genes (Warnecke et al. 2007; Pope nomic assignment or “binning,” the fragments
et al. 2010). Oligomer-based assignment there- are assigned to individual species or higher-
fore is advantageous for taxonomic assignment of ranking clades from which they originate.
metagenomes from microbial communities with PhyloPythia and its successor PhyloPythiaS are
few available sequenced genomes of its members oligomer signature-based classifiers for the taxo-
or of related species. Oligomer-based taxonomic nomic assignment of metagenome sequence frag-
assignment is faster than alignment-based ments. Oligomer signature-based taxonomic
methods, as no sequence similarity searches in assignment is faster than alignment-based
a large collection of reference sequences are methods, as no sequence similarity searches in
required. This makes it well suited for the analy- a large collection of reference sequences are
sis of large next-generation sequence samples. required. Oligomer signature-based assignment
For short fragments of less than 1 kb or for is well suited for the taxonomic assignment of
assignment over long taxonomic distances, metagenomes from microbial communities with
homology-based methods tend to be more accu- few available sequenced genomes of its members
rate (Patil et al. 2011). With PhyloPythia, the or of related species. For microbial community
relative frequencies of 4–6 mer oligomer patterns members with draft genomes reconstructed by
with up to two wildcard characters in a sequence taxonomic binning, a functional analysis based
fragment are used as features to train ensembles on gene content and reconstruction of metabolic
of multi-class support vector machine classifiers potential can be performed.
with a Gaussian kernel for individual taxonomic
ranks. These are subsequently combined for the
assignment of variable length sequence frag- Cross-References
ments. PhyloPythiaS uses an ensemble of struc-
tured support vector machines with a linear ▶ A 123 of Metagenomics
kernel trained with the relative frequencies of ▶ Genome Portal, Joint Genome Institute
4–6 mer oligomers in sequence fragments. The ▶ KEGG and GenomeNet, New Developments,
structured output formulation allows to learn Metagenomic Analysis
a classifier simultaneously for the entire taxon-
omy under consideration of commonalities of
clades with partially shared evolutionary References
histories.
Droge J, McHardy AC. Taxonomic binning of
metagenome samples generated by next-generation
sequencing technologies. Brief Bioinforma. 2012;
Summary 13(6):646–55.
Glass EM, Wilkening J, Wilke A, Antonopoulos D, Meyer
Metagenomics uses random shotgun sequencing F. Using the metagenomics RAST server (MG-RAST)
for analyzing shotgun metagenomes. Cold Spring
to recover genome sequence information from
Harb Protoc. 2010; 2010(1):pdb prot5368.
microbial communities without the need for cul- Hess M, Sczyrba A, Egan R, Kim TW, Chokhawala H,
tivation of its member species. It thus gives Schroth G, et al. Metagenomic discovery of
Plasmid Capture from Metagenomes 585 P
biomass-degrading genes and genomes from cow Wu M, Eisen JA. A simple, fast, and accurate method of
rumen. Science. 2011;331(6016):463–7. phylogenomic inference. Genome Biol. 2008;9(10):
Hugenholtz P. Exploring prokaryotic diversity in the R151.
genomic era. Genome Biol. 2002;3(2): Wu M, Scott AJ. Phylogenomic analysis of bacterial
REVIEWS0003. and archaeal sequences with AMPHORA2. Bioinfor-
Illumina. 2012. Available from: http://www.illumina.com/ matics. 2012;28(7):1033–4.
Documents/systems/hiseq/datasheet_hiseq_systems.pdf.
Iverson V, Morris RM, Frazar CD, Berthiaume CT,
Morales RL, Armbrust EV. Untangling genomes
from metagenomes: revealing an uncultured class of
marine Euryarchaeota. Science. 2012;335(6068):
587–90. Plasmid Capture from Metagenomes
Markowitz VM, Chen IM, Chu K, Szeto E, Palaniappan K,
Grechkin Y, et al. IMG/M: the integrated metagenome
data management and comparative analysis Brian V. Jones
system. Nucleic Acids Res. 2012;40(Database issue): Center for Biomedical and Health Science
D123–9. Research, University of Brighton, School of
McHardy AC, Garcia-Martin H, Tsirigos A,
Pharmacy and Biomolecular Sciences, Brighton,
Hugenholtz P, Rigoutsos I. Accurate phylogenetic
classification of variable-length DNA fragments. Nat East Sussex, UK
Methods. 2007;4(1):63–72.
Metzker ML. Sequencing technologies – the next genera-
tion. Nat Rev Genet. 2010;11(1):31–46.
Namiki T, Hachiya T, Tanaka H, Sakakibara Y.
Definitions
MetaVelvet: an extension of velvet assembler to de
novo metagenome assembly from short sequence
reads. Nucleic Acids Res. 2012;40(20):e155. Metagenome: The collective genomes of all
Patil KR, Haider P, Pope PB, Turnbaugh PJ, Morrison M, members of a bacterial community.
Scheffer T, et al. Taxonomic metagenome sequence
Mobile metagenome: The total pool of
assignment with structured output models. Nat
Methods. 2011;8(3):191–2. mobile genetic elements associated with
Pope PB, Denman SE, Jones M, Tringe SG, Barry K, a bacterial community.
Malfatti SA, et al. Adaptation to herbivory Mobile genetic element (MGE): A discrete
by the tammar wallaby includes bacterial and
genetic unit capable of mediating its own transfer
glycoside hydrolase profiles different from other her-
bivores. Proc Natl Acad Sci U S A. 2010;107(33): between distinct DNA molecules and/or between
14793–8. distinct host cells of the same or different species. P
Segata N, Waldron L, Ballarini A, Narasimhan V, Plasmids, transposons, insertion sequences,
Jousson O, Huttenhower C. Metagenomic microbial
conjugative transposons, integrons, and bacterio-
community profiling using unique clade-specific
marker genes. Nat Methods. 2012;9(8):811–4. phage are all examples of MGE.
Sharpton TJ, Riesenfeld SJ, Kembel SW, Ladau J, Plasmid: Closed circular DNA molecule that
O’Dwyer JP, Green JL, et al. PhylOTU: a replicates within host cells as an autonomous
high-throughput procedure quantifies microbial com-
extrachromosomal element.
munity diversity and resolves novel taxa from
metagenomic data. PLoS Comput Biol. 2011;7(1): Plasmidome: Plasmid fraction of the mobile
e1001061. metagenome. May be defined as the total pool of
Sun S, Chen J, Li W, Altintas I, Lin A, Peltier S, et al. plasmids associated with a microbial community
Community cyberinfrastructure for advanced micro-
and a component of the mobile metagenome as
bial ecology research and analysis: the CAMERA
resource. Nucleic Acids Res. 2011;39(Database a whole.
issue):D546–51. Horizontal gene transfer: Transfer and
Warnecke F, Luginbuhl P, Ivanova N, Ghassemian M, acquisition of genetic material between distinct
Richardson TH, Stege JT, et al. Metagenomic and
cells or species, outside of and in addition to the
functional analysis of hindgut microbiota of a wood-
feeding higher termite. Nature. 2007;450(7169): normal process of inheritance (vertical gene
560–5. transfer).
P 586 Plasmid Capture from Metagenomes
Synonyms 2007a; Jones 2010; Lozupone et al. 2008; Heuer

and Smalla 2012; Ley et al. 2006). MGE are
Gut microbiota; Lateral gene transfer (LGT); capable of moving between distinct molecules
Mobile microbiome of DNA and/or host cells and are also well
documented to acquire new genetic material
from host bacteria and subsequently disseminate
Introduction this to other species. This feature of MGE facil-
itates the exchange and maintenance of genetic
Complex and diverse microbial ecosystems exist material between diverse species, a process
in a wide range of habitats ranging from aquatic termed horizontal gene transfer (HGT). HGT
and terrestrial environments, to those created on allows cells to rapidly acquire new genes and
and within animals, plants, and other metazoans. activities which facilitates adaptation to new
The activities of these microbial consortia environments, and the formation of new func-
(microbiomes) contribute to important environ- tional pathways, and is believed to be a pivotal
mental processes such as nutrient cycling and factor in the evolution and diversification of bac-
bioremediation, while those associated with teria (Ochman et al. 2000; Heuer and Smalla
higher eukaryotic organisms are now widely 2012; Jones and Marchesi; Jones 2010).
recognized to be intimately involved in host This is of particular relevance to host-
health and aspects of host development (Ley associated ecosystems, such as the human
et al. 2006; Jones 2010; Jones and Marchesi microbiome, where HGT is proposed to have
2007a; Strom 2008). played a key role in stabilizing the functional
However, members of both host-associated output of such ecosystems. For example, in the
and “free-living” environmental microbiomes in human gut microbiome dissemination of key
turn play host to a wide range of mobile genetic traits to multiple species in the community
elements (MGE) such as plasmids, transposons, through HGT is thought to generate functional
and bacteriophage, which are now also being redundancy and protect against loss of important
recognized as important components of these activities from the community as a whole (Ley
microbiomes (Jones and Marchesi 2007a; Jones et al. 2006; Lozupone et al. 2008; Jones and
et al. 2010; Jones 2010; Ogilvie et al. 2012; Kav Marchesi 2007a; Jones 2010). In this context, it
et al. 2012; Reyes et al. 2010; Zhang et al. 2011). is notable that the human microbiome has now
Collectively the total pool of MGE associated been shown to support an emergent and extensive
with a particular microbial ecosystem is referred network of gene exchange (with the highest rates
to as its mobile metagenome (Jones and Marchesi of transfer observed in the gut microbiome)
2007a, b), and there is increasing interest in (Smillie et al. 2011), and it seems likely that
understanding how this versatile, and dynamic MGE forge the majority of connections within
reservoir of genes and genetic elements is this network.
involved in the development of these ecologies.
For host-associated microbiomes there is also the
added dimension of how the mobile metagenome Plasmids and Plasmidomes
of a particular ecosystem may impact on the
health of the higher eukaryotic host, either Of the numerous types of MGE that will make up
through effects on the host microbiome or a particular mobile metagenome, those capable of
directly through the functions encoded by con- autonomous cell to cell transfer are of special
stituent MGE (Jones 2010; Ogilvie et al. 2012; interest. Plasmids in particular are believed to
Ley et al. 2006). be highly important in this regard and to be prev-
Moreover, MGE have also been proposed to alent in many bacterial ecosystems. Not only are
facilitate the spread of beneficial functions within plasmids frequently capable of mediating their
a bacterial community (Jones and Marchesi own transfer between distinct and diverse
bacterial species, but also act as vehicles for other with plasmids in a particular family exhibiting
MGE, and are known to encode a diverse array of a high degree of similarity around regions
accessory functions, including those relevant to involved in basic replication and maintenance
health of higher eukaryotic organisms (Reviewed (the plasmid “backbone,” or core replicon) but
in Ogilvie et al. 2012; Ochman et al. 2000; Smalla considerable variation in overall size and gene
et al. 2000a). Functions encoded by plasmids content. Many plasmids have also been described
include virulence factors, antibiotic resistance as possessing a modular organization, with essen-
determinants, bacteriocines, nutrient acquisition tial backbone functions and accessory genes
and utilization, and degradation of xenobiotic organized as distinct gene clusters (Schl€uter
compounds, as well as factors that mediate toler- et al. 2007; Heuer and Smalla 2012). This modu-
ance of a wide range of physical parameters larized genome architecture affords plasmids
(reviewed in Ogilvie et al. 2012; Ochman a high degree of genetic flexibility in terms of
et al. 2000; Smalla et al. 2000a; Heuer and gene loss or recruitment and is consistent with the
Smalla 2012). diversity of plasmids and functions represented
Plasmids are covalently closed circular mole- within a particular plasmid family.
cules of DNA which replicate as extrachromo- Considering the diversity of the prokaryotic
somal elements in the cytoplasm, independently world and the relatively small numbers of plas-
of the host cell chromosome. The copy number of mids characterized to date, it is clear that our
different plasmids can vary considerably, ranging knowledge of these elements remains limited. In
from 1 to 2 copies per cell for some plasmids to conjunction with the insights into microbial ecol-
several hundred copies per cell for others ogy and diversity provided by the application of
(Espinosa et al. 2000; Novick 1987). This varia- molecular genetic approaches (such as
tion in copy number contributes to gene dosage metagenomics) to the study of microbial commu-
effects for plasmid encoded genes, potentially nities, this has prompted many researchers to
increasing the output from plasmid encoded adopt a broader view of plasmids (and other
activities. MGE) associated with a particular microbiome
The size and gene content of these elements is (Jones et al. 2010; Ogilvie et al. 2012; Kav
also highly variable and ranges from small cryp- et al. 2012; Zhang et al. 2011). This shifts the
tic plasmids encoding no obvious functions emphasis to the global population of plasmids
outside of those essential for replication and resident in a given ecosystem and the collective P
maintenance to large mega plasmids of several functions and activities they encode, giving rise
hundred kilobases, which encode a diverse array to the concept of the plasmidome (Kav
of activities (Espinosa et al. 2000; Novick 1987; et al. 2012). The plasmidome refers to the total
Heuer and Smalla 2012). Typically, larger plas- pool of plasmids associated with a particular
mids are present in low copy number as they mobile metagenome, and may be thought of as
present a greater metabolic burden to host cells a distinct component of the mobile metagenome
and often also encode all machinery necessary to as a whole.
initiate their own transfer between host cells via
conjugation.
Plasmids are classified into distinct families, Accessing the Plasmidome
generally distinguished based on their ability to
coexist and replicate within the same host cell Plasmids are probably the best studied MGE, and
(incompatibility groups) and the sequence a range of strategies exist to specifically recover
homology of their replication machinery and characterize these genetic elements
(Espinosa et al. 2000; Novick 1987). However, (reviewed in Ogilvie et al. 2012). These include
from studies of the established and well- approaches that have been specifically designed
characterized plasmid families, it is clear that to permit community-level analysis of microbial
plasmid genomes are highly diverse in nature, plasmidomes and to capture and analyze
plasmids from the non-cultivatable fraction of A particular issue faced by all approaches to
microbial communities, which account for the survey microbial plasmidomes, as well as other
vast majority of bacterial species in these ecosys- facets of a given mobile metagenome, is the dif-
tems. The development of such tools has been, ficulty in evaluating the ability of any method to
and will continue to be, a major challenge with provide universal access to the plasmidome and
current approaches each exhibiting distinct identify any bias in the plasmids that may be
strengths and weaknesses when applied to identified and recovered (Ogilvie et al. 2012).
community-level analysis of plasmids Unlike analysis of the core chromosomal content
(Summarized in Table 1). of a microbiome, where detailed surveys of
Plasmid Capture from Metagenomes, Table 1 Relative merits of approaches available for analysis of microbial
plasmidomes and plasmid capture from metagenomes (Modified from Ogilvie et al. 2012)
Plasmid isolation
strategy Advantages Disadvantages Reference
Endogenous • Original bacterial host is known • Requires host cultivation restricting utility
Reviewed in
isolation • May be used for all cultivatable for study of natural communities Smalla and
bacteria • Reliance on plasmid encoded traits if Sobecky
• Applicable to all plasmid types surrogate host species required for plasmid (2002)
characterization Heuer and
Smalla (2012)
Ogilvie
et al. (2012)
Exogenous • Culture independent • Relies on plasmid encoded traits for Bale
isolation • Selective isolation of plasmid transfer, selection, and maintenance et al. (1988)
self-transmissible or in surrogate host
mobilizable elements • Original bacterial host unknown
• Potentially capable of isolating • Range of plasmids isolated dependent on
all plasmid types (circular and mating conditions used and dictated by
linear) and sizes numerous “unknown” environmental
• Can isolate plasmids irrespective variables influencing host cell physiology
of abundance in community and plasmid transfer kinetics
PCR-based • Culture independent • Original bacterial host unknown Götz
detection • High throughput • Complete characterization of plasmid et al. (1996)
• Sensitive detected generally impossible
• Scope for accurate quantitation of • Limited to detection of known and
plasmids characterized plasmid lineages used for
primer design
TRACA • Culture independent • Original bacterial host unknown Jones and
• Suitable for development of • Transposon may inactivate genes of Marchesi
high-throughput strategies interest, impeding phenotypic (2007b)
• Can isolate plasmids irrespective characterization Jones
of abundance in a community • Currently available Tn elements and et al. (2010)
• Fully independent of plasmid surrogate host may limit range of plasmids Warburton
encoded traits isolated et al. (2011)
• Sequence-based characterization • Linear plasmids not captured Zhang
of plasmids facilitated by known Tn • Transformation step may introduce size et al. (2011)
sequence in plasmids bias
• Potentially applicable to all • Plasmids belonging to same
circular plasmids and bacterial incompatibility group as Tn origin may not be
communities captured due to stability issues in surrogate
• May permit capture of MGE host
other than plasmids when present as • Potential for bias towards numerically
circular DNA molecules dominant plasmids
(continued)
Plasmid Capture from Metagenomes, Table 1 (continued)
Plasmid isolation
strategy Advantages Disadvantages Reference
Standard • Culture-independent • Original bacterial host unknown Kazimierczak
metagenomic • Suitable for development of • Likely bias towards numerically dominant et al. (2009)
libraries high-throughput strategies plasmids
(BAC/Fosmid) • Initial capture independent of • Screening relies on plasmid encoded traits
plasmid encoded traits expressed in surrogate host species
• Sequence-based characterization • Not specifically designed for plasmid
facilitated capture, and non-plasmid sequences
dominate libraries
• Generally only incomplete, partial
plasmids identified
• General compatibility of library
construction methods with plasmid capture
unknown
• Plasmids belonging to same
incompatibility group as vector
(BAC/Fosmid) may not be represented due to
instability of clones in surrogate host
• Plasmids belonging to same
incompatibility group as vector may not be
captured due to stability issues in surrogate
host
Shotgun • Culture-independent • Original bacterial host unknown Zhang
sequencing of • Suitable for development of • Removal of contaminating chromosomal et al. (2011)
plasmidomes high-throughput strategies DNA potentially problematic Kav
• Independent of plasmid encoded • Not suitable for survey of linear plasmids et al. (2012)
traits with present strategies for removal of
• Potential for complete access to chromosomal DNA
circular elements within a bacterial • Accurate assembly of complete plasmids
plasmidome will likely require a more comprehensive set
of reference plasmid genomes than presently
available
• Pre-sequence processing of plasmid DNA
(removal of chromosomal fragments and P
plasmid DNA amplification) likely to
introduce bias into final dataset. Requires
subsequent quantitative analysis to confirm
relative abundance of particular plasmids
population structure can be first undertaken using plasmidomes and have been applied to study
conserved housekeeping genes present in all bac- a range of microbial ecosystems yielding impor-
terial chromosomes (such as genes encoding 16S tant fundamental insights into the composition
rRNA), no such global survey is possible for and functional content of associated plasmid
plasmids (Ogilvie et al. 2012). As such, surveys pools.
of microbial plasmidomes are impeded by Endogenous isolation: The simplest and most
a fundamental lack of knowledge regarding the widely used approach to study plasmids is the
composition of these malleable gene pools, mak- direct isolation of plasmid DNA from host bacte-
ing the development and validation of methods ria. This approach classically involves the culti-
which access a representative cross section of the vation of host species, usually with selection for
plasmidome virtually impossible at present. Nev- particular traits of interest believed to be plasmid
ertheless, available strategies still offer the poten- encoded (reviewed in Smalla and Sobecky 2002;
tial to provide much insight into microbial Jones and Marchesi 2007a; Ogilvie et al. 2012).
Extracted plasmid DNA is subsequently trans- plasmid carrying recipient cells are subsequently
ferred into a new host, ideally of the same spe- identified by cultivation on media selectable for
cies, with E. coli K12-type strains most the recipient organism (often rifampicin resis-
commonly deployed. Plasmids are then typically tance), as well as plasmid encoded traits.
characterized based on the phenotypes they con- Biparental matings, involving only the donor
fer upon host species but are increasingly exam- community and selectable recipient, can be used
ined at the nucleotide level, and plasmids to retrieve self-transmissible plasmids capable of
sequenced as part of whole genome sequencing initiating autonomous conjugal transfer pro-
projects may also be considered as examples of cesses (Fig. 1). Alternatively donor cells carrying
endogenous isolation. a “helper” plasmid may also be introduced along
Aside from its simplicity and general applica- with the selectable recipient, in a tri-parental mat-
bility to all plasmid types (including linear plas- ing approach (Hill et al. 1992). In this case, the
mids), the major benefit of this approach is the “helper” plasmid sets up plasmid conjugation
identification of the natural hosts species of apparatus, which can subsequently be exploited
a particular plasmid. Conversely, the reliance on by plasmids that may be mobilized between
cultivation of host bacteria, as well as the reliance cells, but are not capable of independent transfer
on plasmid encoded traits and their expression in (Fig. 1). In particular, the retrieval of self-
surrogate hosts species (selectable markers and transmissible elements may be seen as
plasmid replication machinery), severely restricts a strength of the exogenous isolation approach,
the utility of this approach for access to the since these elements are likely to be the most
plasmidome. However, the general strategy of informative and important in understanding
direct isolation of plasmid DNA from host cells MGE-mediated prokaryotic gene flow both
can also be applied to the total community with- within and between microbiomes.
out prior cultivation, and when combined with Although this method offers a number of sig-
high-throughput sequencing or other culture- nificant advantages over endogenous approaches,
independent approaches, this direct extraction the capture of plasmids is still reliant on plasmid
method forms the basis for many “metagenomic” encoded traits, including the presence of select-
strategies for plasmidome analysis (discussed in able markers, as well as the ability of plasmids to
detail below). successfully replicate in the surrogate host spe-
Exogenous isolation: Exogenous isolation cies used (Ogilvie et al. 2012; Smalla and
approaches were the first to address some of the Sobecky 2002; Heuer and Smalla 2012). Plas-
limitations inherent in endogenous approaches mids lacking in traits selected for, or unable to
for community level analysis of plasmidomes replicate successfully in surrogate hosts, will not
(Bale et al. 1988; Hill et al. 1992). Exogenous be captured using these approaches. In addition,
methods rely on the natural ability of plasmids to the cell-cell transfer of plasmids is influenced by
initiate or participate in cell-cell transfer between numerous environmental variables, as well as
distinct host species. This strategy accesses plas- the physiological status of donor and recipient
mids using a selectable surrogate host species cells, with metabolically inactive community
(most typically E. coli) in biparental or members unlikely to participate in conjugal
tri-parental matings with the donor population, transfer processes. These factors also impact on
during which plasmids may be transferred from plasmid transfer rates, the types of plasmid that
donor cells in the community to the selectable can be acquired and the portion of the
recipient (Fig. 1; Bale et al. 1988; Hill et al. 1992; plasmidome that may be accessed (Ogilvie
Reviewed in Ogilvie et al. 2012; Smalla and et al. 2012; Smalla and Sobecky 2002; Heuer
Sobecky 2002; Heuer and Smalla 2012). Essen- and Smalla 2012). Collectively, these factors
tially this system utilizes the surrogate host as restrict the range of plasmids that may be cap-
a “fishing net,” to pick up plasmids circulating tured and limit the utility of this approach for
within the donor community under study, and studying microbial plasmidomes.
Plasmid Capture from
Metagenomes,
Fig. 1 Overview of
exogenous isolation
approaches for the
acquisition of plasmids
from microbial
communities. Arrows
indicate plasmid transfer
between donor (mixed
microbial community),
recipient, and “helper”
populations. Purple arrows
indicate plasmid transfer in
biparental matings in which
selectable recipient cells
are used to acquire self-
transmissible plasmids
directly from the donor
population. Green arrows
indicate transfer events in
tri-parental matings, in
which cells harboring
a self-transmissible
“helper” plasmid are
utilized to initiate conjugal
transfer events with the
donor population and the
selectable recipient, in
order to acquire
mobilizable but non-self-
transmissible plasmids
Direct plasmid detection by PCR: A range This limits the range of plasmids encompassed
of PCR primers have been developed in order in such surveys to those families already isolated
to distinguish between plasmids of different and characterized. A further disadvantage is that
families based on backbone sequences, but along with a lack of data on host range, no infor-
these have also been employed as surveying mation on functional content of plasmids is
tools to identify the presence of particular offered by this method, and there is little or no
plasmid types in total community DNA scope to characterize detected plasmids in greater
extracts (Götz et al. 1996; Smalla detail. As such, this approach does not at present
et al. 2000b). While this approach is poten- constitute a viable strategy for in depth and com-
tially useful in gaining an overview of the prehensive analysis of entire plasmidomes, but
types of plasmids comprising a particular may be used to augment other strategies and
plasmidome and their relative abundance provide further information on isolated plasmids.
(if utilized with a quantitative PCR strategy), Despite the present limitations, the usefulness of
its usefulness is currently limited by the rela- this approach is likely to grow as more sequence
tively small number of plasmid genomes avail- information and associated data is generated, and
able from which discriminatory primer sets greater numbers of habitat associated reference
may be established. data sets become available in the future.
Plasmid Capture from Metagenomes, chromosomal sequences and to amplify the recovered
Fig. 2 Overview of culture-independent metagenomic plasmid DNA for certain plasmidome if necessary. Plas-
approaches for microbial plasmidome analysis. Acquisi- mid capture and plasmidome access: Recovered
tion of plasmid DNA: Plasmid DNA may be harvested and plasmidome extracts may then be used in conjunction
processed in a number of ways before use in strategies to with one or more culture-independent approaches for
capture plasmids or access the plasmidome. Plasmid DNA plasmid capture or general access to the plasmidome.
may be acquired from either total metagenomic DNA Available culture-independent approaches include the
extracts of the microbial community or specific plasmid generation of standard metagenomic libraries, the used
extraction methods. Recovered pools of plasmids may of the TRACA plasmid capture approach, or direct shot-
subsequently be processed to remove contaminating gun sequencing of amplified plasmids
Transposon aided capture (TRACA): The transposon (Tn) system encoding this informa-
culture-independent transposon-aided capture tion (Fig. 2).
system (TRACA) has been specifically designed Following Tn integration, plasmids are subse-
for the acquisition of plasmids from whole com- quently transformed into a surrogate bacterial
munities and to overcome some of the main lim- host and cells carrying plasmids selected for
itations of endogenous and exogenous based on antibiotic resistance genes harbored by
approaches (Jones and Marchesi 2007b). The the inserted Tn. In this way, plasmids may be
basic premise of this system is to retrofit all acquired independently of the traits they encode,
plasmids with a suitable selectable marker and and their replication in the surrogate host is facil-
an origin of replication compatible with the sur- itated (Jones and Marchesi 2007b). This provides
rogate host biomachinery, using an in vitro access to plasmids in a bacterial community
regardless of functions encoded and has been and expand the range of plasmids that may be
successfully applied to study plasmids in acquired with this system.
a number of environments, including the human Retrieval of plasmids from standard
gut, the oral cavity, and activated sludge (Jones metagenomic libraries: Access to plasmid
and Marchesi 2007b; Warburton et al. 2011; sequences contained in standard metagenomic
Zhang et al. 2011). libraries derived from total community DNA
Although the TRACA system offers major have also been described (Fig. 2; Kazimierczak
advantages over other approaches, this method et al. 2009). In particular, the isolation of plas-
does not circumvent all issues and may be subject mids or plasmid fragments, from such libraries of
to a unique limitation in regard to the size of the organic pig gut microbiome, has been dem-
plasmids that can be captured when using this onstrated and included those with the ability to
approach (reviewed in Jones and Marchesi replicate autonomously when liberated from the
2007a; Ogilvie et al. 2012). Plasmids isolated by library vector and reconstructed by self-ligation
this system to date have all been in the smaller (Kazimierczak et al. 2009). Despite the novelty
size range (~14 Kb and smaller), indicating the of this approach, this strategy suffers from the
TRACA system may be biased towards the cap- same drawbacks as endogenous and exogenous
ture of small plasmids or even unable to acquire methods in its reliance on plasmid encoded traits
larger plasmids altogether. The reasons behind for initial plasmid identification and subsequent
this potential size restriction are presently demonstration of autonomous replication in sur-
unclear, although the transformation step in rogate host species (Kazimierczak et al. 2009).
which Tn-tagged plasmids are introduced into Furthermore, this approach is not at present
surrogate host cells is known to work more effi- designed to specifically retrieve plasmids, but
ciently with smaller DNA molecules, and there is rather total community DNA which is dominated
also potential for a size bias to be introduced by chromosomal sequences. As such this
during the purification of plasmid DNA (Jones approach is not presently suitable for the specific
and Marchesi 2007b). analysis of microbial plasmidomes, and in the
It is also possible that the size range of plas- original study by Kazimierczak et al. (2009),
mids captured by this system will be a function of libraries were analyzed for clones encoding anti-
the plasmidome composition and the predomi- biotic resistance genes, rather than plasmid
nance of smaller plasmids in the ecosystems sequences per se. However, there is clearly P
that have been explored with this method to date scope to utilize this method to further explore
(Ogilvie et al. 2012). Although there is presently existing metagenomic data sets and enhance the
no definitive data available on the average plas- interpretation of these valuable resources by illu-
mid size in any given microbial ecosystem, initial minating mobile genetic elements captured in
evidence suggests that physical features of plas- these repositories.
mids, such as size, are responsive to pervading Shotgun sequencing of plasmidomes: More
environmental and ecological conditions in the recently the first true applications of the
same way as host chromosomes (Slater metagenomic approach to study plasmidomes
et al. 2008). Overall, it is most probable that have been described (Fig. 2; Zhang et al. 2011;
both the composition of the plasmidome and Kav et al. 2012). In these studies, plasmid DNA
inherent attributes of the TRACA system dictate was extracted from the target community without
the profile of plasmids captured by this approach. any prior enrichment or cultivation, subjected to
Regardless of these potential limitations, the high-throughput sequencing, and fragments of
TRACA method provides an additional and use- plasmid genomes subsequently assembled from
ful tool for the exploration of bacterial the resulting reads (Zhang et al. 2011; Kav
plasmidomes, overcoming some of the major dis- et al. 2012). This permitted a global survey of
advantages of other methods. There is also much plasmid-encoded functions present in the bovine
scope to improve the existing TRACA approach plasmidome (Kav et al. 2012), as well as an
activated sludge microbial community (Zhang analysis constitutes a major advance in accessing
et al. 2011), demonstrating proof of principal for plasmids resident in microbial communities, in
the shotgun sequencing approach to plasmidome terms of both depth of coverage and the cross
analysis. section of plasmids that may be covered.
Although this approach should in theory be Further development of such approaches, in
able to offer total and unbiased access to the parallel with the development of more detailed
entire plasmidome of a given microbial commu- and extensive reference data sets from plasmids
nity, in practice limitations and potential biases captured through TRACA or exogenous
remain. For example, in the study by Kav approaches, for the first time places the compre-
et al. 2012, sufficient plasmid DNA for sequenc- hensive analysis of a microbial plasmidome
ing was only obtained after amplification of the within reach.
recovered plasmid DNA by rolling circle ampli-
fication. As such there is potential for some plas- Retrieval of Host Range Data Following
mids to be preferentially amplified over others, Plasmid Capture from Metagenomes
introducing bias into the resulting data set. In A major drawback of all culture-independent
addition, the complete removal of contaminating community-level approaches for investigation
chromosomal sequences is also challenging, and of microbial plasmidomes, and capture of plas-
despite the availability of “plasmid safe” DNases mids from metagenomic data sets, is the loss of
which do not act on circular molecules, total host range data inherent in these strategies
elimination of chromosomal DNA from plasmid (Table 1). All such strategies effectively divorce
extracts appears to constitute a bottleneck in this acquired plasmids or plasmid sequences of any
strategy (Zhang et al. 2011; Kav et al. 2012), with phylogenetic affiliation, undermining a primary
linear plasmids also likely to be removed during motivation for undertaking many such surveys:
this process. As such there is further potential to a fundamental understanding of gene flow in
alter the composition of the plasmid pool these communities. Despite this, several
obtained during this stage of plasmid DNA approaches may be used to supplement the initial
preparation. culture-independent plasmid capture strategy and
There is also potential for errors in assembly provide some indication of plasmid phylogenetic
due to the mosaic nature of these elements, affiliation and long-term host range.
a situation that may be exacerbated by the pres- Plasmids captured through culture indepen-
ence of any contaminating chromosomal dent approaches may subsequently be utilized to
sequences. In this regard, the availability of ref- develop fluorescent probes suitable for use in
erence plasmid genomes captured by methods fluorescence associated cell sorting (FACS)
which acquire whole, intact plasmids (such as applications (reviewed in Ogilvie et al. 2012).
exogenous isolation and TRACA) will constitute The development and use of such probes in
a highly valuable resource that will significantly FACS systems permits intact cells harboring
enhance the power and accuracy of the shotgun target genes or sequences to be separated from
plasmidome approach (Fig. 2), and some the rest of the microbial community and subse-
researchers have already begun to combine quently identified through culture-independent
these strategies (Zhang et al. 2011). Finally, molecular genetic approaches, such as 16S
extensive sequencing will likely be required for rDNA sequence analysis (Zwirglmaier et al.
most plasmidomes, in order to move beyond rep- 2004). This strategy, termed Ring-FISH
resentation of numerically dominant plasmids (recognition of individual genes by fluorescence
(particularly for assembly of complete replicons) in situ hybridization), has previously been
and provide the depth of coverage required to implemented and demonstrated as a feasible
access the full diversity of a given plasmidome. approach for the recovery of cells encoding
Despite these potential issues, it is clear that genes of interest, including those encoded by
the shotgun sequencing approach to plasmidome plasmids.
Alternatively a range of in silico approaches Summary
have been applied to plasmid host affiliation
(reviewed in Ogilvie et al. 2012). Plasmid There is now much evidence to support the
sequences may be compared directly to curated concept of distinct, community-associated
sequence databases where phylogenetic informa- plasmidomes and wider mobile metagenomes
tion on plasmid genomes and other genes is avail- (reviewed in Jones 2010; Ogilvie et al. 2012).
able. The homology of plasmid sequences to However, the mobile and promiscuous nature of
database entries may then be used to infer phy- many MGE (including many plasmids) makes
logeny of captured plasmids (Jones and Marchesi this a much less clearly defined genetic reservoir,
2007b; Jones et al. 2010; Kav et al. 2012; Zhang and membership of a particular mobile
et al. 2011). However, the mosaic nature of plas- metagenome will be far less exclusive than for
mids and the potential for a single element be the core chromosomal compliment of the associ-
composed of genetic material with highly diverse ated microbiome (Jones 2010). A greater under-
origins, coupled with inherent biases in public standing of the composition and functional
databases due to the paucity of available plasmid capacities of these mobile metagenomes, and
genomes, undermines the accuracy of this key MGE such as plasmids, will be important
approach and particularly when applied to frag- for understanding and ultimately manipulating
mentary data sets such as metagenomic libraries many important microbial ecosystems, as well
and shotgun plasmidomes. as providing fundamental insight into the mech-
Alternatively, strategies based on correlation anisms of gene flow within and between distinct
of nucleotide usage patterns in plasmids with microbiomes. Although no available method for
bacterial chromosomes have also been described accessing microbial plasmidomes represents
(Campbell et al. 1999; Suzuki et al. 2010). These a panacea for the study of these dynamic gene
are based on the premise that over time, plasmids pools, the application of tools currently available,
and other MGE that are long-term residents of particularly when used in combination,
a given host species adapt to their host at the holds much potential for greatly expanding our
nucleotide level and acquire a corresponding knowledge of plasmid diversity, abundance,
“genomic signature” in terms of nucleotide and functionality within microbial mobile
usage profiles (Campbell et al. 1999; Suzuki metagenomes.
et al. 2010). As this underlying genomic signature P
has been shown to permit discrimination between
chromosomal sequences of different bacterial References
species, there is also scope to employ plasmid
Bale MJ, Day MJ, Fry JC. Novel method for studying
nucleotide usage patterns to retrieve host range plasmid transfer in undisturbed river epilithon. Appl
information. Dinucleotide and trinucleotide Environ Microbiol. 1988;54(11):2756–8.
usage patterns, based on the abundance of all Campbell A, Mrazek J, Karlin S. Genome signature com-
possible two-nucleotide or three-nucleotide comparisons among prokaryote, plasmid, and mitochon-
drial DNA. Proc Natl Acad Sci U S A. 1999;96:
binations in a given DNA sequence, have been 9184–9.
used in this way and shown to provide insight into Espinosa M, Cohen S, Couturier M, et al. Plasmid repli-
plasmid host range, at least in terms of potential cation and copy number control. In: CM Thomas (ed)
long-term bacterial host species to which plas- The horizontal gene pool, bacterial plasmids and gene
spread. Amsterdam: Harwood Academic Publishers;
mids are well adapted (Campbell et al. 1999; 2000. p. 207–48.
Suzuki et al. 2010). There is much scope to incor- Götz A, Pukall R, Smit E. Detection and characterization
porate such analyses into culture-independent of broad-host-range plasmids in environmental bacte-
surveys of bacterial plasmidomes, as downstream ria by PCR. Appl Environ Microbiol. 1996;63:1980–6.
Heuer H, Smalla K. Plasmids foster diversification and
processing steps that may provide some of the adaptation of bacterial populations in soil. FEMS
phylogenetic inference lacking in metagenomic Microbiol Rev. 2012. doi:10.1111/j.1574-6976.2012.
approaches. 00337.x.
P 596 Protein-Coding Genes as Alternative Markers in Microbial Diversity Studies
Hill K, Weightman AJ, Fry JC. Isolation and screening of plasmids and gene spread. Amsterdam: Harwood Aca-
plasmids from the epilithon which mobilise recombi- demic Publishers; 2000a. p. 207–48.
nant plasmid pD10. Appl Environ Microbiol. 1992; Smalla K, Krögerrecklenfort E, Heuer H, et al. PCR-based
58:1292–300. detection of mobile genetic elements in total commu-
Jones BV. The human gut mobile metagenome: nity DNA. Microbiology. 2000;146:1256–7.
a metazoan perspective. Gut Microbes. 2010;1(6): Smillie CD, Smith MB, Friedman J, et al. Ecology drives
417–33. a global network of gene exchange connecting the
Jones BV, Marchesi JR. Accessing the mobile human microbiome. Nature. 2011. doi:10.1038/
metagenome of the human gut microbiota. Mol nature10571.
Biosyst. 2007a;3:749–58. Strom SL. Microbial ecology of ocean biogeochemistry:
Jones BV, Marchesi JR. Transposon-aided capture a community perspective. Science. 2008;320:1043–5.
(TRACA) of plasmids resident in the human gut Suzuki H, Yano H, Brown CJ, Top EM. Predicting plas-
mobile metagenome. Nat Methods. 2007b;4:55–61. mid promiscuity based on genomic signature.
Jones BV, Sun F, Marchesi JR. Comparative J Bacteriol. 2010;192(22):6045–55.
metagenomic analysis of plasmid encoded functions Warburton P, Allan E, Hunter S, et al. Isolation of bacte-
in the human gut microbiome. BMC Genomics. 2010; rial extra-chromosomal DNA from human dental
11:46. plaque associated with periodontal disease, using
Kav AB, Sasson G, Jami E, et al. Insights into the bovine transposon-aided capture (TRACA). FEMS Microbiol
rumen plasmidome. Proc Natl Acad Sci U S A. 2012; Ecol. 2011;78:349–54.
109:5452–7. Zhang T, Zhang X-X, Ye L. Plasmid metagenome reveals
Kazimierczak KA, Scott KP, Kelly D, Aminov high levels of antibiotic resistance genes and mobile
RI. Tetracycline resistome of the organic pig gut. genetic elements in activated sludge. PloS ONE.
Appl Environ Microbiol. 2009;75:1717–22. 2011;6:e26041.
Ley RE, Peterson DA, Gordon JI. Ecological and evolu- Zwirglmaier K, Ludwig W, Schleifer KH. Recognition of
tionary forces shaping microbial diversity in the individual genes in a single bacterial cell by fluores-
human intestine. Cell. 2006;124:837–48. cence in situ hybridization – RING-FISH. Mol
Lozupone CA, Hamady M, Cantral BL, et al. The conver- Microbiol. 2004;51(1):89–96.
gence of carbohydrate active gene repertoires in
human gut microbes. Proc Natl Acad Sci U S A.
2008;105:15076–81.
Novick RP. Plasmid incompatability. Microbiol Rev.
1987;51:381–95.
Ochman H, Lawrence JG, Groisman EA. Lateral gene Protein-Coding Genes as Alternative
transfer and the nature of bacterial innovation. Nature. Markers in Microbial Diversity
2000;405:299–304. Studies
Ogilvie LA, Firouzmand S, Jones BV. Evolutionary, eco-
logical and biotechnological perspectives on plasmids
resident in the human gut mobile metagenome. Bioeng Martin Wu
Bugs. 2012;3(1):1–19. Department of Biology, University of Virginia,
Reyes A, Haynes M, Hanson N, et al. Viruses in the faecal Charlottesville, VA, USA
microbiota of monozygotic twins and their mothers.
Nature. 2010;466:334–8.
Schl€uter A, Szczepanowski R, P€ uhler A, et al. Genomics
of IncP-1 antibiotic resistance plasmids isolated from Synonyms
wastewater treatment plants provides evidence for
a widely accessible drug resistance pool. FEMS
Automated Phylogenomic Inference Application
Microbiol Rev. 2007;31:449–77.
Slater FR, Bailey MJ, Tett AJ, Turner SL. Progress (AMPHORA)
towards understanding the fate of plasmids in
bacterial communities. FEMS Microb Ecol.
2008;66:3–13.
Smalla K, Sobecky PA. The prevalence and diversity of
Introduction
mobile genetic elements in bacterial communities of
different environmental habitats: insights gained from The small ribosomal unit RNA (SSU rRNA or
different methodological approaches. FEMS 16S rRNA) has been widely used in microbial
Microbiol Ecol. 2002;42:165–75.
systematic and diversity studies. The appeal of
Smalla K, Osburne AM, Wellington EMH. Isolation and
characterisation of plasmids from bacteria In: CM using 16S rRNA gene as a marker gene is numer-
Thomas (ed) The horizontal gene pool, bacterial ous. First of all, it is distributed in every single
Protein-Coding Genes as Alternative Markers in Microbial Diversity Studies 597 P
cellular organism. Secondly, because regions of sequenced directly from environments without
16S rRNA sequence are highly conserved, 16S prior isolation, culturing, and PCR amplification.
rRNA gene can be PCR amplified from a wide Metagenomics therefore overcomes a major hur-
diversity of taxa using “universal” primers and dle for using protein genes for microbial diversity
sequenced, bypassing the need to isolate and cul- studies in that it makes the sequences of protein
ture the organisms in question. Consequently, genes readily accessible. Because metagenomic
millions of 16S rRNA reference sequences are sequencing is random in nature, microbial com-
available for microbial classification and identi- position estimated based on metagenomic
fication (Cole 2009). sequencing is less biased than the 16S rRNA
Although 16S rRNA has been the “gold stan- PCR-based survey. When using single-copy
dard” in microbial diversity studies, it has several protein-coding genes for relative species abun-
shortcomings. First, because 16S rRNA only dance estimation, it further eliminates the bias
makes up a tiny fraction of a genome (~0.1 %), associated with the copy-number variations of
its application as a marker gene in classifying the 16S rRNA gene.
metagenomic sequences is seriously limited. Sec- The rapid growth of genomic data also pre-
ondly, the widely recognized bias in 16S rRNA sents challenges for using protein-coding genes
PCR skews the estimation of the relative abun- in microbial diversity studies. In order to answer
dance of species in a population (Acinas the question of “who is there” in metagenomic
et al. 2005). Thirdly, the 16S rRNA gene copy studies, there is a pressing need for developing
number varies substantially from species to spe- an automated high-throughput, high-quality
cies, further complicates the effort to accurately application for metagenomic phylotyping. Sev-
estimate microbial composition (Kembel eral factors should be considered for such an
et al. 2012). To circumvent these problems, application. First, because genes can be
protein-coding genes such as as rpoB, pyrG, exchanged in bacteria and archaea, it is impera-
recA, and HSP70 have been used as alternative tive to only use genes that are recalcitrant to
phylogenetic markers to complement rRNA- lateral gene transfer for phylotyping. Secondly,
based analyses (Ludwig and Klenk 2000; Santos for accurate estimation of the microbial compo-
and Ochman 2004). Because protein genes are sition, only single-copy protein genes should be
conserved at the amino acid level and not at the used as the marker genes. Thirdly, tree-based
nucleotide level, they evolve faster and thus have phylotyping involves multiple steps including P
more power at resolving the relationships of marker identification, sequence alignment, tree
closely related species than the 16S rRNA gene. reconstruction, and taxonomy assignment. For
Unfortunately for the same reason, it is extremely large-scale phylogenetic analysis, several techni-
difficult to design “universal primers” that can be cal hurdles need to be overcome to make
used to PCR amplify protein-coding genes from high-quality sequence alignments prior to the
distantly related species (Santos and Ochman phylogenetic inference.
2004). As a result, protein-coding genes have
seen very limited use in broad-spectrum
microbial surveys. Description
Recent explosive growth in genomic
sequences has changed the landscape. Thousands AMPHORA is an automated phylogenomic
of complete bacterial genomes are available and inference application (Wu and Eisen 2008; Wu
many more are on the way of being sequenced and Scott 2012). It offers speed, reliability, and
(Pagani et al. 2012). With each genome sequence high-quality analyses using protein-coding genes
come along thousands of protein-coding genes, as alternative marker genes for microbial diver-
vastly expanding the amount of data available for sity studies. The main components of the
protein marker genes. In metagenomic studies, AMPHORA are illustrated in Fig. 1 and are
genomes of a mixed microbial population are described in detail below.
Protein-Coding Genes as
Alternative Markers in
Microbial Diversity
Studies, Fig. 1 A
flowchart illustrating the
major components of
AMPHORA
Protein Phylogenetic Marker Database AMPHORA uses the HMMER3 package to

AMPHORA relies on a core phylogenetic marker search for marker genes in the input sequences.
database to identify a set of protein marker genes Profile Hidden Markov Model (HMM)-based
from the input sequences. The phylogenetic sequence similarity search is as fast as BLAST
marker database contained 31 bacterial markers but is more sensitive (Eddy 2011). AMPHORA
initially (Wu and Eisen 2008) and was recently can take either protein or DNA sequences as
expanded to include 104 genes from the archaeal input, which means that users can use
domain (Wu and Scott 2012). To limit potential AMPHORA to phylotype metagenomic reads
complications from paralogy and lateral gene directly without having to first annotate the
transfers, only single-copy genes that are “uni- DNA sequences. When DNA sequences are
versally” distributed in bacteria or archaea were used, AMPHORA will first identify the open
selected. As expected, most of the marker genes reading frames (ORFs) and then search the trans-
are housekeeping genes involved in DNA repli- lated peptide sequences for marker genes.
cation, transcription, translation, or central
metabolism, which are thought to be less prone High-Quality and Highly Reproducible
to lateral gene transfers (Jain 1999; Sorek Sequence Alignments
et al. 2007). The use of single-copy genes pro- Molecular phylogenetic analysis assumes com-
vides the additional benefit by reducing the bias mon ancestry, or homology, for every single col-
in the relative species abundance estimation. umn of a multiple sequence alignment. However,
Protein-Coding Genes as Alternative Markers in Microbial Diversity Studies 599 P
this assumption is often violated when distantly assigned taxonomy. There are two approaches
related sequences are aligned. Low-quality align- of phylotyping. Similarity-based phylotyping
ment regions are noisy and can obscure the true such as MEGAN works by BLAST searching
phylogenetic signal contained elsewhere in the the metagenomic sequence against a reference
alignment. It has been shown that alignment qual- database such as NCBI nonredundant amino
ity can have greater impact on the accuracy of the acid database and then assigning the common
tree than does the tree-building method employed taxonomy of the top hits to the sequence (Huson
(Lake 1991; Morrison and Ellis 1997; Hwang et al. 2007). Similarity-based phylotyping is
et al. 1998; Cammarano et al. 1999; Landan and extremely fast. However, it requires the user to
Graur 2007). Therefore, preparing high-quality select an arbitrary cutoff to define the top hits.
sequence alignments is the most critical part of Since different microbial species and protein
tree-based phylotyping process. Quality of the families evolve at different rates, there is no sin-
sequence alignment at each column can be gle universal cutoff that is applicable in all situ-
assessed (a step known as masking), and ations. Also because of the evolutionary rate
low-quality regions of the alignment can be variation, top hits are not guaranteed to be the
deleted or down weighted (a step known as closest relatives of the query sequence (Koski and
filtering) prior to making a tree. Masking and Golding 2001). Therefore, taxonomy assigned
filtering improve the accuracy of phylogenetic using the top hits can be misleading, especially
analysis (Grundy and Naylor 1999; Castresana when no close relatives are available in the
2000; Loytynoja and Goldman 2008; Wu database.
et al. 2012). Tree-based phylotyping works by placing
One great advantage of using AMPHORA is the metagenomic sequences into a phylogeny
that it provides automated high-quality alignment of the reference sequences. The metagenomic
masking and filtering. This is achieved by taking sequence is assigned the taxonomy of its sister
advantage of a unique feature of the profile clade, the closest relative according to the
HMM-based multiple sequence alignments. phylogeny. Since evolutionary methods can
When using HMM to align sequences, new account for the evolutionary rate variations,
sequences can be mapped to the “seed” sequence tree-based phylotyping is more robust than
alignment that is used to build the HMM, column similarity-based phylotyping. In addition,
by column. If the columns in the “seed” align- there is no need to choose an arbitrary cutoff P
ment have precomputed quality scores, they can in tree-based phylotyping. It has been shown
then be transferred to the new alignment, thereby that tree-based phylotyping outperformed
providing automated masking and filtering. similarity-based phylotyping methods (Wu and
Quality scores have been assigned to the “seed” Eisen 2008).
alignments of the AMPHORA’s marker genes Insertion of the sequences into the reference
using a probability-based alignment masking tree has been one of the rate-limiting steps in tree-
program named Zorro (Wu et al. 2012). Incorpo- based phylotyping. However, new placement
rating Zorro makes it practical to quickly algorithms make it possible to insert thousands
expand the phylogenetic marker database to of sequences into a reference tree simultaneously,
include hundreds of marker genes. It also makes therefore dramatically speeding up the process
it much easier for users to add markers of their (Matsen et al. 2010; Berger et al. 2011).
own choice and to build their personalized AMPHORA takes advantage of RAxML’s evo-
phylogenetic marker database to use with lutionary placement algorithm and can perform
AMPHORA. either parsimony or likelihood tree-based
phylotyping. It places sequences into the
Tree-Based Phylotyping NCBI’s taxonomic hierarchy and assigns
By comparing to the reference sequences, a confidence score at each rank of the taxonomic
metagenomic sequences can be classified and classification.
Protein-Coding Genes as Alternative Markers in Microbial Diversity Studies, Fig. 2 Bacterial composition of
the GOS dataset analyzed using AMPHORA
AMPHORA Analysis of the Global Ocean sequences in each marker gene can be used as
Survey Dataset approximation for the relative organismal abun-
AMPHORA was used to phylotype the environ- dance in the population. In agreement, the rela-
mental shotgun sequencing reads of the Global tive abundance of Pelagibacter ubique clade
Ocean Survey (GOS) (Rusch et al. 2007). From estimated by AMPHORA (35.8 %) is very close
the 41 million predicted peptides, 213,583 peptides to previous quantitative estimations by fluores-
were identified that corresponded to the 31 bacterial cence in situ hybridization showing that, on aver-
and 104 archaeal marker genes. Using the number age, cells of the clade account for one-third of the
of reads per marker, it was estimated that 95.4 % of ocean surface bacterioplankton communities
the reads in GOS dataset belonged to bacteria while, (Morris et al. 2002). Also as expected, the bacte-
4.6 % of the reads were from archaea, indicating that rial diversity profiles are remarkably consistent
the ocean surface water is dominated by bacteria. between the different marker genes (Fig. 2).
The relative abundance of major bacterial groups is
shown in Fig. 2. Alphaproteobacteria is the most
abundant group overall, making up 47.8 % of the Summary
bacterial population. This is mainly due to a single
clade of Pelagibacter ubique that constituted Metagenomics has the potential to transform the
35.8 % of the bacterial population sampled in GOS. way we study microbial diversity. To fully realize
Because all the marker genes in AMPHORA this potential, it is important to develop a set of
are single-copy genes, the relative abundance of well-curated protein-coding genes as alternative
Proteomics and Metaproteomics 601 P
marker genes. AMPHORA builds on a set of Loytynoja A, Goldman N. Phylogeny-aware gap place-
universally conserved, single-copy protein ment prevents errors in sequence alignment and evo-
lutionary analysis. Science. 2008;320:1632–5.
genes that are ideal for analyzing bacterial diver- Ludwig W, Klenk H-P. Overview: a phylogenetic back-
sity. It facilitates the large-scale phylogenetic bone and taxonomic framework for procaryotic sys-
analysis of these marker genes and should be of tematics. In: Boone DR, Castenholz RW, Garrity GM,
broad application in the study of microbial evo- editors. Bergey’s manual of systematic bacteriology,
vol. 1. New York: Springer-Verlag; 2000. p. 49–65.
lution and ecology. Matsen FA, Kodner RB, Armbrus EV. pplacer: linear time
maximum-likelihood and Bayesian phylogenetic
placement of sequences onto a fixed reference tree.
BMC Bioinforma. 2010;11.
References Morris RM, Rappe MS, Connon SA, et al. SAR11 clade
dominates ocean surface bacterioplankton communi-
Acinas SG, Sarma-Rupavtarm R, Klepac-Ceraj V, ties. Nature. 2002;420:806–10.
et al. PCR-induced sequence artifacts and bias: Morrison DA, Ellis JT. Effects of nucleotide sequence
insights from comparison of two 16S rRNA clone alignment on phylogeny estimation: a case study of
libraries constructed from the same sample. Appl 18S rDNAs of apicomplexa. Mol Biol Evol.
Environ Microbiol. 2005;71:8966–9. 1997;14:428–41.
Berger SA, Krompass D, Stamatakis A. Performance, Pagani I, Liolios K, Jansson J, et al. The Genomes OnLine
accuracy, and Web server for evolutionary placement Database (GOLD) v.4: status of genomic and
of short sequence reads under maximum likelihood. metagenomic projects and their associated metadata.
Syst Biol. 2011;60:291–302. Nucleic Acids Res. 2012;40:D571–9.
Cammarano P, Creti R, Sanangelantoni AM, et al. The Rusch DB, Halpern AL, Sutton G, et al. The sorcerer II
archaea monophyly issue: a phylogeny of translational global ocean sampling expedition: northwest Atlantic
elongation factor G(2) sequences inferred from an through eastern tropical Pacific. PLoS Biol. 2007;5:
optimized selection of alignment positions. J Mol e77.
Evol. 1999;49:524–37. Santos SR, Ochman H. Identification and phylogenetic
Castresana J. Selection of conserved blocks from multiple sorting of bacterial lineages with universally con-
alignments for their use in phylogenetic analysis. Mol served genes and proteins. Environ Microbiol.
Biol Evol. 2000;17:540–52. 2004;6:754–9.
Cole JR, Wang Q, Cardenas E, et al. The ribosomal data- Sorek R, Zhu Y, Creevey CJ, et al. Genome-wide exper-
base project: improved alignments and new tools for imental determination of barriers to horizontal gene
rRNA analysis. Nucleic Acids Res. 2009;37:D141–5. transfer. Science. 2007;318:1449–52.
Eddy SR. Accelerated profile HMM searches. PLoS Wu M, Eisen JA. A simple, fast, and accurate method of
Comput Biol. 2011;7:e1002195. phylogenomic inference. Genome Biol. 2008;9:R151.
Grundy WN, Naylor GJ. Phylogenetic inference from Wu M, Scott AJ. Phylogenomic analysis of bacterial and P
conserved sites alignments. J Exp Zool. 1999; archaeal sequences with AMPHORA2. Bioinformat-
285:128–39. ics. 2012;28:1033–4.
Huson DH, Auch AF, Qi J, et al. MEGAN analysis of Wu M, Chatterji S, Eisen JA. Accounting for alignment
metagenomic data. Genome Res. 2007;17:377–86. uncertainty in phylogenomics. PLoS ONE. 2012;7(1):
Hwang UW, Kim W, Tautz D, et al. Molecular phyloge- e30288.
netics at the Felsenstein zone: approaching the
Strepsiptera problem using 5.8S and 28S rDNA
sequences. Mol Phylogenet Evol. 1998;9:470–80.
Jain R. Horizontal gene transfer among genomes: the
complexity hypothesis. Proc Natl Acad Sci.
1999;96:3801–6.
Proteomics and Metaproteomics
Kembel SW, Wu M, Eisen JA, et al. Incorporating 16S
gene copy number information improves estimates of Rembert Pieper, Shih-Ting Huang and
microbial diversity and abundance. PLoS Comput Moo-Jin Suh
Biol. 2012;8:e1002743.
J. Craig Venter Institute, Rockville, MD, USA
Koski LB, Golding GB. The closest BLAST hit is often
not the nearest neighbor. J Mol Evol. 2001;52:540–2.
Lake JA. The order of sequence alignment can bias the
selection of tree topology. Mol Biol Evol. Synonyms
1991;8:378–85.
Landan G, Graur D. Heads or tails: a simple reliability
check for multiple sequence alignments. Mol Biol Global proteomics; Protein profiling of microbial
Evol. 2007;24:1380–3. communities; Proteomics of biological systems
P 602 Proteomics and Metaproteomics
Definition analyzed by mass spectrometry (MS), the tech-

nology that advanced proteomics the most and
Proteomics pertains to the comprehensive analy- resulted in Nobel Prize awards in Chemistry for
sis of expressed proteins from a cell, K. Tanaka and J. B. Fenn in 2002, can be identi-
a multicellular system, an extracellular environ- fied from complex mixtures on a global scale
ment, or a large set of recombinant clones. This is using computational methods that compare
achieved using combinations of protein separa- experimental mass spectra to the entirety of the-
tion, identification, and/or assay techniques, such oretical peptide masses and sequences derived
as liquid chromatography-mass spectrometry from protein sequences annotated in
(LC-MS), two-dimensional gel electrophoresis- a searchable database. Typically, peptides rather
mass spectrometry (2DE-MS), affinity than proteins are analyzed in MS-based proteo-
purification-mass spectrometry (AP-MS), and mic experiments because their mass range (length
protein- or antibody-based microarrays. The of 5–30 amino acids) makes them more suitable
objectives in proteomics research can be diverse; for ionization and accurate mass analysis, and
they include protein quantification on a global fragmentation of peptides in tandem MS experi-
scale, highly parallel analysis of protein functions ments allows sequence analysis. Peptide-spec-
and interactions, structural characterization of trum matches (PSMs) require mathematical
protein complexes, unraveling trafficking of pro- algorithms for probability-based assignment of
teins and their distribution in different cellular peptides to their protein(s) of origin. In addition
compartments, and discovery of protein signa- to increasingly powerful algorithms and the
tures for a disease state or other perturbation. exponential growth of complete genomic data-
Metaproteomics is a recent extension of proteo- bases (for thousands of species and subspecies),
mics where the biological systems under study MS techniques regarding ionization, accurate
are increased in complexity. This pertains to two measurement of mass-to-charge ratio of peptides,
or more coexisting organisms that may function- and proteins and their fragmentation have also
ally interact with each other, with mutual benefits dramatically advanced. Mass spectrometers now
or to the advantage of some and detriment of measure proteins with sensitivities in the
other species. attomole range, mass accuracies in the 1–3 ppm
range, and a peak resolution of up to 60,000, at
very high speeds and with considerable automa-
Introduction tion. Proteomes of prokaryotic and mammalian
cells can now be profiled in a few days to a couple
Proteomics is a relatively young scientific disci- of weeks, including proteins present in less than
pline at the interface of analytical biochemistry 100 copies in a cell. For example, the proteomes
and molecular biology. Together with of yeast and the human HeLa cell line have been
transcriptomics, the discipline emerged in part exhaustively characterized (de Godoy et al. 2008;
as a result of the “genomics revolution,” specifi- Nagaraj et al. 2011). MS-based proteomics
cally the availability of databases derived from requires high-resolution separation techniques to
genome sequencing and annotation efforts that reduce the complexity of peptides or proteins in
reliably predict potentially expressed proteins. a sample. Two-dimensional gel electrophoresis
Protein sequence information for all open reading (2DE) has been used for protein separation before
frames (ORFs) is an important component of MS emerged as the method of choice. In the last
high-throughput mass spectrometry- and decade, 2DE has been gradually replaced by shot-
microarray-based proteomics. Proteins arrayed gun proteomics, a strategy that takes advantage of
on microarray chips are usually derived from controlled enzymatic fragmentation of proteins
the expression of genes in recombinant systems. into peptides prior to MS analysis. Shotgun pro-
Expression requires sequence information to gen- teomics has a superior dynamic range for prote-
erate clones for the targeted genes. Proteins ome coverage compared to 2DE and is a more
sensitive detection method and less problematic competitive and/or synergistic nature of interac-
as it pertains to the exclusion of proteins difficult tions in multi-species communities). Second,
to solubilize and separate in gels. functional analysis of uncharacterized proteins
Protein microarrays allow immobilization of requires multiple methodological approaches
thousands of purified proteins and their interac- not yet feasible on a metagenomic scale or meth-
tion analysis with other proteins or small mole- odologically not distinct. In expression proteo-
cule ligands (Wolf-Yadlin et al. 2009). This mics, sample preparation is an essential
technique does generally not require MS since component and usually needs to be adapted to
the position of proteins on the array is predefined, a given scientific objective. Table 1 lists the
and a highly parallel assay on the microarray examples of common sample types and
facilitates detection of an activity or interaction approaches to recover the protein mixtures prior
with a ligand or substrate. Interactions of proteins to their analysis or that of their digestion
with small molecules have more recently been products.
studied with chemical probes that establish cova- Expression proteomics may focus solely on
lent bonds to proteins and thus characterize their protein identifications from a given biological
functions (Speers and Cravatt 2009). Here, MS is
typically used for protein identification. Finally,
by the use of protein interaction screens, large Proteomics and Metaproteomics, Table 1 Proteo-
protein networks (“interactomes”) have been mics and metaproteomics sample preparation methods
established, e.g., for the bacterium Mycoplasma Protein recovery
pneumoniae (Kuhner et al. 2009) and for baker’s Sample group Type of sample method
yeast (Ho et al. 2002). MS is the only proteomic Multiple- Soil, ocean water, Enrichment for
technique that permits comprehensive analyses organism stool/gut cellular materials,
environmental microbiome cell lysis
of posttranslational modifications which play sample
a key role in the modulation of protein functions, Heterogeneous Liver, bladder Isolation of cell
localizations, interactions, and control of turn- tissue types prior to cell
over rates in the cell (Olsen et al. 2010; van lysis or tissue
disruption
Noort et al. 2012). It is also the leading technol-
Cell culture Bacterial, fungal, Concentration of
ogy for research in metaproteomics, where the or mammalian cell extracellular
expressed proteome is derived from more than culture fraction or cell P
one species, often a microbial community. Com- lysis
munity dynamics and in some cases a host species Cell Mitochondria, Cell compartment
influence the protein complement expressed by compartment nuclei, exosomes, isolation followed
or fraction chloroplasts, by its
each participating species. Metagenomic data or bacterial periplasm disintegration
concatenated genomes of multiple species are Cellular Proteasome, Cell complex
essential input as they deliver the databases complex bacterial secretion, purification and
necessary for proteomic analysis of a complex and secondary disintegration
metabolite
biological system.
biosynthesis
systems
Protein- Blood plasma, Removal of lipids,
Expression Proteomics containing urine, hatch fluid of carbohydrates,
secretion fluid larvae cellular debris
Protein/Peptide Sample Preparation and Host-pathogen Intracellular Separation of host
system viruses, bacteria, and pathogen cells;
Separation fungi, infected cell lysis
Various areas of proteomic research were already eukaryotic cells
mentioned. This overview focuses on expression Life cycle Parasitic organism Lysis of cells or
proteomics. First, it is more relevant and applica- stages of with a complex life cellular
ble to the questions of metagenomics (i.e., the a species cycle compartments
sample, but quantitative assessments of solution (8 M urea, 2 M thiourea, detergents

a subcellular or cellular proteome are often of such as 4 % CHAPS or 1 % Nonidet-NP40,
interest (e.g., the comparative analysis of the 0.1 % ampholytes, and DTT as a reducing
Escherichia coli proteome isolated from expo- agent). Prior to this step, the removal of salts
nential versus stationary-phase cultures or of the and other macromolecules positively affect reso-
mouse liver proteome prior to, during, and after lution and identification of proteins. Many differ-
recovery from an infection with a hepatitis virus). ent 2DE modification techniques have been
Sample preparation may involve steps to remove introduced to improve spot resolution in alkaline
nonorganic or other matter not of interest in and acidic pH ranges, in high and low Mr regions,
a study (e.g., soil or digested foods in studies of and for lipid-associated and hydrophobic proteins
soil and gut microbiomes, respectively), but typ- (Gorg et al. 2004). Figure 1 displays a 2D gel
ically it starts with the isolation of tissues, cells, profile of an E. coli O157:H7 cell lysate next to
cell organelles, or fluids followed by extraction one from a mouse stool microbiome fraction.
and/or concentration of protein mixtures. An While more than 500 proteins were resolved in
exception to this experimental sequence is the E. coli gel, it is evident that the resolution
a strategy that involves protein labeling with iso- limit of the more complex stool protein sample
topes of a living cell/organism prior to cell and (hundreds of different gut microbial species and
protein extraction. Stable isotope labeling of secreted human proteins) was reached. For such
amino acids in culture (SILAC) is a frequently complex metaproteomic samples, 2D gels are not
used method where cells are cultured with useful because few proteins are well resolved and
defined media containing amino acids (e.g., Lys, identifiable as distinct spots. Differential quanti-
Arg) that contain carbon and nitrogen atom iso- fication of proteins from 2D gels is performed
topes which alter the total mass of the amino acid. with software tools that allow pixel-based spot
If two types of samples are to be compared in an intensity measurements, gel-to-gel spot matching
experiment (e.g., two cell cycle time points or and normalization, and generation of annotated
mammalian cells pre- and post-viral infection), spot maps that characterize the proteome under
they are cultured with different Lys and/or Arg investigation. Subcellular proteomes of bacteria
isotopes. Following combination of the two cell exposed to various environmental stress condi-
populations and their lysis, otherwise identical tions have been analyzed in 2D gels (Pieper
proteins (peptides) can be compared quantita- et al. 2008). Sample preparation of body fluids
tively based on their isotope mass differences. such as serum may include LC separations to
Chemical isotope-labeling methods for proteins remove highly abundant proteins and fractionate
such as iTRAQ follow the same principles other proteins prior to 2DE (Pieper et al. 2003).
(differential quantification from a multiplexed Sample preparation for analysis of eukaryotic
sample), but the labeling step occurs after isola- subcellular compartments often involves buoyant
tion of the protein digestion products. density-based centrifugal enrichment steps and
Traditionally, the first step of proteomic anal- differential display. Tagging of genes with
ysis has been high-resolution separation of pro- reporter gene constructs to localize expressed
teins in two-dimensional gels that permitted proteins in subcellular compartments has also
mapping of the most abundant proteins of been used. An example is the comprehensive
a complex mixture and their relative quantifica- survey of the mitochondrial proteome in yeast
tion (O’Farrell 1975). Proteins are visualized in (Prokisch et al. 2004). Methods used for the dis-
2D gels using protein-binding dyes such as integration of cells, the isolation of subcellular
Coomassie Brilliant Blue, and more sensitive organelles, and the subsequent protein extraction
fluorescent dyes that stain most proteins resolved and solubilization require project- and cell type-
in a gel (up to 1,000 protein spots). Sample prep- specific optimization.
aration for 2D gels includes solubilization or The shotgun proteomics workflow integrates
dilution of a protein mixture in a denaturing a protein digestion step prior to analyte (peptide)
Proteomics and Metaproteomics, Fig. 1 Protein pro- strips, proteins were separated according to size in second
files of (a) Shiga toxin-producing E. coli (serotype H157: dimension 8–18 %T SDS-PAGE gels (25 20 cm) for
O7) and (b) a murine stool fraction enriched in bacteria 1.8 kVh. Gels were stained with the dye Coomassie Bril-
displayed in 2D gels. Samples of ~150 mg protein were liant Blue G250 (Courtsey of Christine Peterson and
loaded onto pH 4–7 25 cm immobilized pH gradients Prashanth Parmar for their contributions to the gel elec-
strips and isoelectrically focused applying 64 kVh. Fol- trophoresis data depicted in the courtesy)
lowing reduction and alkylation of proteins in the IPG
separation, identification, and quantification. Protein Identification

Shotgun proteomics was developed 10 years ago MALDI-MS and LC-MS/MS have been the stan-
(Wolters et al. 2001). Proteins are solubilized, dard techniques for identifying proteins from 2D
denatured, and subjected to proteolysis with gel spots, often with considerable automation in
endoproteinases such as trypsin, LysC, and/or the spot excision from gels and the enzymatic
GluC. Using a combination of these enzymes digestion to generate dissolved peptide mixtures.
increases the coverage of a proteome. The mix- MALDI-time of flight (TOF) MS generates pep-
ture of digested proteins typically contains more tide mass fingerprints (PMFs) that are analyzed P
than 100,000 peptide fragments in a wide abun- with an MS algorithm in which the protein of
dance range. Digested proteins are sometimes origin is identified based on the count of mass-
applied to an LC column (e.g., with a reversed matching peptides in the experimental spectrum
phase or ion exchange matrix) or an immobilized compared to those predicted from the in silico
pH gradient gel strip to separate peptides further enzymatic digest for a given protein (Fig. 3a).
and reduce their complexity in the resulting frac- Nano-electrospray ionization (ESI) is the main
tions. Peptides are fractionated based on certain ionization technique for LC-MS/MS experi-
biophysical traits such as hydrophobicity, Mr, or ments. LC-MS/MS not only provides one separa-
net charge. Peptide eluates from LC columns may tion dimension for peptides, but also the data for
be directly spotted on plates for serial analysis via protein identification, first on the MS level via
matrix-assisted laser desorption ionization generation of an accurate mass-to-charge ratio
MS (MALDI-MS). A far more time- and cost- (m/z) for a peptide and then via data-dependent
effective and less tedious approach to obtain selection of an MS (peptide) peak for further
comprehensive proteome coverage, however, is fragmentation via gas-phase collision-induced
to apply concentrated and desalted peptide frac- dissociation (CID). MS peaks in fragment spec-
tions to online LC tandem mass spectrometry tra, typically y- and b-ions resulting from the
(LC-MS/MS). The shotgun proteomics workflow cleavage of peptide bonds, define the peptide
is displayed in the schematic of Fig. 2. sequence. Computational methods are available
Proteomics and Metaproteomics, Fig. 2 Shotgun combined from all fractions are collected and interpreted
proteomics workflow. After generation of a cell lysate or with an algorithm and a relevant protein sequence data-
protein extract, proteins are digested with an base. Identified peptides are assigned to proteins of origin
endoproteinase (e.g., trypsin). The peptide mixture is sep- and counted to obtain a protein quantity estimate. Abun-
arated on a reversed phase C18 or a strong anion exchange dance profiles from different samples can be displayed in
LC column. Peptide fractions are lyophilized and applied form of heat maps
to nano LC-MS/MS sequentially. The mass spectra
to assign a peptide sequence based on its original the database of theoretical PSMs, the more com-
m/z value and tandem MS data that deliver putationally challenging it is to determine the
a series of daughter m/z values for N- and best peptide match for an experimental mass
C-terminal fragment ions (Fig. 3b). The MS and spectrum. Herein lies one of the fundamental
subsequent MS/MS analyses are performed in challenges of metaproteomics: protein sequence
automated duty cycles defined by the LC-MS databases to be searched not only contain
instrument software so that tens or even hun- sequences derived from one but numerous fully
dreds of thousands of MS and subsequent annotated genomes or large metagenomic read
MS/MS scans are performed in series. The assemblies that are partially annotated. Their
aggregate of data from these scans describes content of predicted protein sequences is sub-
the proteome in the shotgun proteomics experi- stantially increased. MS platforms have recently
ment. Due to the fact that the matching of the- moved towards ultra-high pressure LC for high-
oretical MS/MS (peptide fragment) spectra and resolution peptide separations and high-
experimental spectra is performed with probabi- resolution, high-mass accuracy MS, such as the
listic models defined by a software and its inher- Orbitrap and Quadrupole-TOF instruments.
ent algorithm(s), shotgun proteomics data yield Excellent peptide separation has the benefit
a number of peptide (and protein) identifications that more peptides derived from
at a specific false discovery rate, often in con- low-abundance proteins are enriched in frac-
junction with an MS score matrix that also attri- tions and more likely to be selected during the
butes a measure of correct peptide identification. MS data-dependent duty cycle for MS/MS anal-
Algorithms integrated in such software tools are ysis. High-mass accuracy and resolution
used to score PSMs and assign the highest scor- enhance the confidence in peptide (sequence)
ing PSM to a peptide. Protein inferences are assignments via PSMs. For a detailed review
made by assigning peptides to a distinct protein of LC-MS platforms used for proteomics appli-
sequence in the database (Fig. 3c). The larger cations, see (Yates et al. 2009).
Proteomics and Metaproteomics
Proteomics and Metaproteomics, Fig. 3 Principles of peptide mass fingerprinting 698.81 (+2 ion charge). Bottom: the peptide at m/z 698.81 was selected for CID
(PMF) and tandem mass spectrometry (MS/MS). (a) PMF of a purified 2D gel protein fragmentation in a Velos Pro ion trap instrument and the sequence
spot (e.g., from MALDI-TOF/TOF analysis). (b) Snapshot of MS and subsequent EFVGGGYVTVLVR assigned based on y- and b-ion series. For m/z values of the
MS/MS scan in a shotgun proteomics dataset. Top: MS spectrum representing PMF and m/z values for a peptide and its MS/MS data, assignments to a protein in the
a peptide mixture derived from a variety of proteins, with one peptide peak at m/z searched database are made (c)
607
P
P
Protein Quantification intensity measurements, spectral counting can

Relative protein quantification from 2D gel probe performed with low-resolution (ion trap) MS
files involves correct spot matching based on instruments. For both methods, software tools
their gel coordinates and MS data for each indi- have been created allowing estimation of protein
vidual spot. 2DE has low dynamic range (two copies per cell (absolution quantification) for
orders of magnitude) and high-abundance thresh- a large number of distinct proteins expressed in
olds for accurate volumetric spot density mea- a cell. The schematic in Fig. 2 illustrates how
surements, resulting in quantitative analyses spectral counting data are analyzed for
limited to the top 5–20 % of the actual proteome. proteome-wide quantification following assign-
These are a major reason as to why quantitative ment of peptides to different proteins of origin.
proteomics has moved towards the use of other Recently, more than 10,000 human proteins have
techniques: (1) shotgun proteomics which allows been identified and quantified via shotgun prote-
quantification of a far larger proportion of the omics using an Orbitrap mass analyzer, presum-
proteome with higher dynamic range and ably representing the entire expressed proteome
(2) targeted proteomics which allows high- of a cancer cell line (Nagaraj et al. 2011). An
precision peptide quantification in absolute additional layer of complexity is added when
terms in a wide dynamic range, but usually for proteomic data are searched for specific post-
a small number of proteins. The latter technique translational modifications (PTMs). Among the
is moving towards larger scale, as demonstrated PTMs are N-terminal truncation, phosphoryla-
recently in an effort to monitor all yeast kinases tion, N-acetylation of Lys residues and
and phosphatases (Picotti et al. 2010). Targeted N-termini, and glycosylation of various side
proteomics is often associated with the term mul- chains, all of which can modify the protein’s
tiple reaction monitoring (MRM) proteomics cellular function, localization, and trafficking
where stable isotope-labeled peptide standards and interaction with other proteins or ligands
are used for quantitative comparisons. MRM pro- inside or outside of the cell. Ubiquitination of
teomics is dominated by triple quadrupole/ion Lys residues often sends a protein into
trap hybrid MS instruments. It requires preselec- a degradation pathway. Comprehensive knowl-
tion of “target” peptides, often tryptic peptides edge of all PTMs and their dynamics in a specific
that are unique to a given protein of interest. environment or cell state has not yet been
Such peptides are generated in situ via enzymatic achieved. Likewise, proteomic research is just
digestion of a sample in which the protein is beginning to provide information on distinct pro-
to be quantified. Equivalent chemically synthe- tein activities not functionally annotated in data-
sized peptide standards are spiked into to the bases. One of the promising technologies is
sample in known concentrations to allow activity-based proteomic profiling that allows
absolute quantification. MRM experiments con- labeling of proteins in their active sites by the
tinue to advance in speed and complexity use of chemical probes (Speers and Cravatt
(multiplexing is possible) as shown in a recent 2009). The main limitation is generating a large
study in which 63 urinary proteins were simulta- number of specific chemical probes for high-
neously measured in hundreds of samples (Chen throughput screens. A field more adapted to
et al. 2012). high-throughput analyses is interaction proteo-
A variety of methods for global quantification mics, where proteins of unknown function can
of proteins are available for shotgun proteomics be associated with protein complexes of known
as recently reviewed (Mueller et al. 2008; Elliott function, thus allowing assignment of new bio-
et al. 2009). In addition to isotope label-based logical roles. This field includes AP-MS which,
approaches (e.g., SILAC, iTRAQ), spectral for example, has been utilized to study 178 solu-
counting and MS1 (peptide) intensity-based mea- ble protein complexes of the M. pneumoniae
surements are common. Unlike MS1 peak proteome (Kuhner et al. 2009).
Metaproteomics While the first metaproteomic studies
pertained to low complexity systems, they
The term metaproteomics was first introduced highlighted the ability to elucidate dynamic
in 2004 by Rodriguez-Valera to describe the aspects of the adaptation of species to community
concept of an expressed protein complement living. Ram et al. investigated natural acid mine
from environmental microbial communities drainage microbial biofilms (Ram et al. 2005).
(Rodriguez-Valera 2004). Metagenomics has More than 2,000 proteins from five different spe-
been a driving force behind metaproteomic cies were identified using shotgun proteomics,
efforts; it essentially defines the protein sequence 48 % from Leptospirillum group II. Oxidative
databases to be searched, and it also provides stress and refolding proteins were highly
a biological context. Major interest has also expressed, supporting the notion that their activ-
arisen from the human microbiome project. ities were critical in a challenging environment.
Metaproteomics should take into consideration Markert et al. investigated the proteome of an
the host environment. This would expand the unculturable g-proteobacterial endosymbiont of
definition of metaproteomics to the study of bio- Riftia pachyptila, a deep-sea tube worm without
logical systems consisting of two or more species a digestive system (Markert et al. 2007). The
that may interact in a mutualistic manner or to the worm sustains a high growth rate using the sym-
detriment of some but the benefit of other species. biont’s capacity for chemosynthesis of carbon
This definition would include symbiotic relation- compounds fixing CO2 and oxidizing ambient
ships (e.g., N2-fixing bacteria with legumes), H2S. Using 2DE-MS proteomics, three abundant
host-pathogen relationships (infectious disease), major sulfide oxidation proteins critical for
and host-commensal relationships (e.g., the gut energy metabolism in the endosymbiont were
microbiome). The boundary of the latter two is identified. It was determined that both the reduc-
fluid, as metagenomic and other studies have tive tricarboxylic acid and Calvin cycles were
revealed. Metaproteomic data are of interest used for CO2 fixation. A more complex
because they add a degree of function to the metaproteome, that of human distal gut
description of a complex community: microbes microbiota, was examined by Verberkmoes
(and their hosts) live in the same environment, et al. (2009). A particular challenge of analyzing
compete for the same resources, and send molec- such a complex metaproteome is the high number
ular signals to each other including quorum sens- of species (and diverse subspecies and strains), P
ing, chemotaxis, and adhesion in response to the most of which are not culturable and whose
changing environment. The competition for genomes remain to be sequenced. Therefore,
resources implicates metabolism that is enabled databases for metaproteomic data searches,
by proteins. Likewise, inter- and intraspecies which are composed of only sequenced and
signaling implicates proteins and peptides or fully annotated genomes of bacteria known to
structures synthesized by proteins (e.g., LPS colonize the distal gut, are not truly representa-
and secondary metabolites). Thus, quantitative tive of metagenomic (species) complexity. It is
analysis of proteins via metaproteomics promises nonetheless useful to use such “imperfect” data-
to deliver new insights into the dynamics bases to gain insights into a human body-
of complex biological communities. Studies associated complex microbial metaproteome.
may be highly experimental, considering the Assessing quantitative estimates of protein
efforts to model microbial communities or a path- counts representing distinct cellular function cat-
ogen invading a macrophage in vitro. They egories, it was reported that proteins linked to
may constitute a natural environment such as carbohydrate metabolism, energy generation,
polybacterial biofilms growing in hydrothermal and ribosomal translation were most abundant
hot springs or on a urinary tract device (Hall- in the distal gut metaproteome (Verberkmoes
Stoodley et al. 2004). et al. 2009). Nearly 20 % of the mass spectra
matched protein sequences derived from A particular challenge pertains to the high
Bacteroides and Bifidobacterium, confirming rel- amino acid sequence identities among highly
atively high abundance of these genera in the conserved (housekeeping) proteins of related
human gut. Despite the application of a bacterial species in a microbiome. Since protein identifi-
enrichment procedure, 30 % of the PSMs cation in shotgun proteomics relies on peptide
represented matches to human proteins, includ- sequence data followed by in silico assignment
ing a large proportion of those active in cell-cell to proteins, it impedes taxonomic profiling on the
adhesion and innate immunity. This finding species level analogous to the short reads of
supported the notion that the host immune system NextGen sequencing technologies. Nonetheless,
interacts extensively with its gut microbiome. metaproteomic data already contribute effec-
Analysis of urinary tract metaproteomes linked tively to the elucidation of the metabolic capacity
to asymptomatic bacteriuria (Fouts et al. 2012) of complex biological systems and the cross-talk
resulted in protein identifications from two to five of such systems with their host environments.
different opportunistic pathogens and provided Robust computational algorithms and workflows
preliminary evidence for host-bacterial interac- will have a positive impact on the future of
tions, specifically a battle for iron. Human metaproteomics. Use of multiple “omics” tech-
lactotransferrin, an iron sequestration protein, nologies allows insights into complex intra- and
and iron acquisition proteins and receptors from extracellular biological processes and their cross-
E. coli and Klebsiella pneumoniae were identi- talk and integration into a biological system.
fied in the same samples.
References
Summary and Outlook
Chen YT, Chen HW, et al. Multiplexed quantification of
In conclusion, proteomics is a highly advanced 63 proteins in human urine by multiple reaction
monitoring-based mass spectrometry for discovery of
discipline that contributes to science at the bio- potential bladder cancer biomarkers. J Proteome.
logical systems level. Metaproteomics has clear 2012;75(12):3529-45
potential to elucidate functional interactions of de Godoy LM, Olsen JV, et al. Comprehensive mass-
coexisting microbial species and, if applicable, spectrometry-based proteome quantification of hap-
loid versus diploid yeast. Nature. 2008;455(7217):
those with their eukaryotic host environments. 1251–4.
Major challenges to enable in-depth and accurate Elliott MH, Smith DS, et al. Current trends in quantitative
metaproteomic profiling efforts for highly proteomics. J Mass Spectrom. 2009;44(12):1637–60.
diverse communities remain to be addressed. Fouts DE, Pieper R, et al. Integrated next-generation
sequencing of 16S rDNA and metaproteomics differ-
Only a fraction of the genomes represented in entiate the healthy urine microbiome from asymptom-
complex microbial communities have been atic bacteriuria in neuropathic bladder associated with
sequenced. Comprehensive metagenomic spinal cord injury. J Transl Med. 2012;10(1):174.
sequence datasets are very promising resources Gorg A, Weiss W, et al. Current two-dimensional electro-
phoresis technology for proteomics. Proteomics.
for advanced proteomic data searches. However, 2004;4(12):3665–85.
such datasets can be incomplete and may have Hall-Stoodley L, Costerton JW, et al. Bacterial biofilms:
sequence inaccuracies and significant redun- from the natural environment to infectious diseases.
dancy which, in turn, affects the reliability of Nat Rev Microbiol. 2004;2(2):95–108.
Ho Y, Gruhler A, et al. Systematic identification of protein
assignments of peptides and proteins on the complexes in Saccharomyces cerevisiae by mass spec-
species level via PSMs derived from MS-based trometry. Nature. 2002;415(6868):180–3.
proteomic datasets. Further improvement of Kuhner S, van Noort V, et al. Proteome organization in
metagenomic assembly and computational a genome-reduced bacterium. Science. 2009;
326(5957):1235–40.
methods will benefit the quality of Markert S, Arndt C, et al. Physiological proteomics of
metaproteomic datasets since their analysis the uncultured endosymbiont of Riftia pachyptila.
depends on predicted protein sequence data. Science. 2007;315(5809):247–50.
Mueller LN, Brusniak MY, et al. An assessment of soft- Prokisch H, Scharfe C, et al. Integrative analysis of the
ware solutions for the analysis of mass spectrometry mitochondrial proteome in yeast. PLoS Biol. 2004;
based quantitative proteomics data. J Proteome Res. 2(6):e160.
2008;7(1):51–61. Ram RJ, Verberkmoes NC, et al. Community proteomics
Nagaraj N, Wisniewski JR, et al. Deep proteome and of a natural microbial biofilm. Science. 2005;
transcriptome mapping of a human cancer cell line. 308(5730):1915–20.
Mol Syst Biol. 2011;7:548. Rodriguez-Valera F. Environmental genomics, the big
O’Farrell PH. High resolution two-dimensional electro- picture? FEMS Microbiol Lett. 2004;231(2):153–8.
phoresis of proteins. J Biol Chem. 1975;250(10): Speers AE, Cravatt BF. Activity-based protein profiling
4007–21. (ABPP) and click chemistry (CC)-ABPP by MudPIT
Olsen JV, Vermeulen M, et al. Quantitative mass spectrometry. Curr Protoc Chem Biol. 2009;1:29–41.
phosphoproteomics reveals widespread full phosphor- van Noort V, Seebacher J, et al. Cross-talk between phos-
ylation site occupancy during mitosis. Sci Signal. phorylation and lysine acetylation in a genome-
2010;3(104):ra3. reduced bacterium. Mol Syst Biol. 2012;8:571.
Picotti P, Rinner O, et al. High-throughput generation of Verberkmoes NC, Russell AL, et al. Shotgun
selected reaction-monitoring assays for proteins and metaproteomics of the human distal gut microbiota.
proteomes. Nat Methods. 2010;7(1):43–6. ISME J. 2009;3(2):179–89.
Pieper R, Gatlin CL, et al. The human serum proteome: Wolf-Yadlin A, Sevecka M, et al. Dissecting protein func-
display of nearly 3700 chromatographically separated tion and signaling using protein microarrays. Curr
protein spots on two-dimensional electrophoresis gels Opin Chem Biol. 2009;13(4):398–405.
and identification of 325 distinct proteins. Proteomics. Wolters DA, Washburn MP, et al. An automated
2003;3(7):1345–64. multidimensional protein identification technology for
Pieper R, Huang ST, et al. Characterizing the dynamic shotgun proteomics. Anal Chem. 2001;73(23):5683–90.
nature of the Yersinia pestis periplasmic proteome in Yates JR, Ruse CI, et al. Proteomics by mass spectrome-
response to nutrient exhaustion and temperature try: approaches, advances, and applications. Annu Rev
change. Proteomics. 2008;8(7):1442–58. Biomed Eng. 2009;11:49–79.
P
R
RITA: Rapid Identification of High- means uniform and they often cannot distinguish
Confidence Taxonomic Assignments closely related organisms. Further compounding
for Metagenomic Data the problem is the reliance on short-read-based
approaches to metagenome sequencing, which
Norman J. MacDonald1, Donovan H. Parks1,2 and can generate reads less than 200 nucleotides in
Robert G. Beiko1 length, and short or ambiguous assemblies in
1
Faculty of Computer Science, Dalhousie many cases. Successful classification methods
University, Halifax, NS, Canada use homology (e.g., BLAST comparisons against
2
Australian Centre for Ecogenomics, University genes or proteins from a set of reference
of Queensland, Brisbane, QLD, Australia genomes) or composition (e.g., distribution of
tetranucleotide sequences) for classification,
with a newer generation of “hybrid” classifiers
Definition using both (e.g., PhymmBL; Brady and Salzberg
2009). We have developed RITA, a hybrid
Algorithm, software, and Web service for taxo- approach that uses streamlined approaches to
nomic classification of metagenome fragments rapidly generate homology and composition
using both homology and compositional information and combines these sets of predic-
information. tions in a supervised classification pipeline that
sorts sequences into different classification
groups based on the strength and agreement of
Introduction the two types of predictions.
A central task in many metagenomic studies is

the inference of community function from Requirements
sequence data. An additional challenge is the
need to assign functional genes to particular Software: RITA is implemented in Python and
members of the community, in order to determine can be used as a stand-alone program (http://kiwi.
which organisms are responsible for carrying out cs.dal.ca/Software/RITA) or via the Web service
which molecular processes. While sequences (http://ratite.cs.dal.ca/rita). Queries to the Web
derived from a given microorganism often carry service are limited to 10,000 sequences at a time.
a “signature” that reflects mutational bias and For compositional classifications RITA uses the
other processes in the genome of that organism Fragment Classification Package FCP (Parks
(Campbell et al. 1999), these patterns are by no et al. 2011; http://kiwi.cs.dal.ca/Software/FCP).
R 614 RITA: Rapid Identification of High-Confidence Taxonomic Assignments for Metagenomic Data
For the stand-alone version, a locally installed comparisons between a translated query sequence
copy of either the BLAST+ software suite or and reference database of protein sequences. The
USEARCH (Edgar 2010) is necessary. objective in using this ordering is to place the
Reference Databases: Since RITA is fastest algorithms first, which removes the need
a supervised classifier, it requires a reference data- to run the slower algorithms on all query
base of sequenced genomes with associated taxo- sequences. The stand-alone version of RITA also
nomic information. Genomic information is includes the option to use UBLAST (Edgar 2010),
typically acquired from the NCBI database of which aims to prioritize searches against
sequenced genomes and can be performed auto- a reference database in order to avoid searching
matically using the scripts provided with the FCP the entire database. Approaches such as LCA
software package. From these sequenced genomes (Huson et al. 2007), MetaPhyler (Liu et al. 2011),
(and optionally, similarly formatted files provided and CARMA (Gerlach and Stoye 2011) use phy-
by the user), RITA can build reference models for logenetic information for taxonomic classification,
both composition and homology. If rank-flexible but our trials of RITA showed no additional ben-
classification (described below) is to be efit to the use of phylogenetic trees in the classifi-
performed, a set of 16S rRNA gene sequences cation scheme we describe below.
corresponding to the reference sequenced For compositional classifications, we encode
genomes will be required as well. Instructions on each reference genome as a series of nucleotide
acquiring and preparing these can be viewed at words (i.e., k-mers) of a fixed length to generate
http://kiwi.cs.dal.ca/Software/RITA. frequency distributions of each word. These fre-
Input Data: The user must provide their quency profiles are then used to train a naı̈ve
metagenomic sequences in a FASTA-formatted Bayes (NB) classifier (Parks et al. 2011), which
file. The sequences can be of any length. If rank- assigns likelihoods to each query fragment based
flexible classification is desired, a list of sampled on the match between its k-mer profile and those
16S rRNA gene sequences must also be provided. representing the different genomes in the reference
database. The genome with the largest likelihood
for a given fragment is the best compositional
The RITA Pipeline match to that fragment. The crucial assumption of
the NB classifier is of independence among input
The primary objective of RITA is to make taxo- k-mers: while this assumption is clearly violated by
nomic assignments that consider both the agree- k-mer decompositions of DNA sequences (for
ment between composition and homology and the instance, the frequency of the 6-mer AAAAAA
strength of evidence from both types of classifica- will be closely tied to that of AAAAAC), in prac-
tion technique. Homology-based classification is tice this does not impact on the performance of the
performed using local alignment-based compara- classifier. Phymm is a compositional classifier that
tive tools such as BLAST (Altschul et al. 1997). uses more-sophisticated Markov models of
Many variants of BLAST have been developed sequence composition: while these are better at
which differ in the type of sequence information describing the compositional profile of a genome,
being compared (e.g., nucleotide, 6-way translated in practice they are much slower and no more
nucleotide, amino acid), as well as sensitivity and accurate than our NB approach.
speed. Although RITA can be configured to run in The RITA pipeline combines homology and
a number of different ways, the default approach is composition information by first assessing
based on the sequential use of three different algo- whether the predictions of composition and
rithms: Discontiguous MEGABLAST for fast but homology agree. While homology alone outper-
low-sensitivity comparisons between a nucleotide forms composition alone in most classification
query and nucleotide database, BLASTN for tasks, the genomic patterns reflected in composi-
slower but more sensitive nucleotide-nucleotide tional profiles provide complementary informa-
comparisons, and BLASTX for sensitive tion, and agreement between the two types of
RITA: Rapid Identification of High-Confidence Taxonomic Assignments for Metagenomic Data 615 R
data is not trivially obtained. If agreement is found appropriate taxonomic group and rank based on
for a given fragment and the first BLAST algo- the strength of available evidence. To perform
rithm considered, then the fragment will be clas- rank-flexible classification of a metagenome
sified with the predicted taxonomic label and sample using RITA, the user must provide a list
assigned to group 1, the highest-confidence of 16S rRNA genes that were identified from the
group. If the predictions of composition and sample. These genes are used to limit the taxo-
homology disagree, then classification using nomic scope of the RITA predictions. The pro-
homology alone will be attempted in the follow- vided 16S rRNA genes are mapped into a tree of
ing manner. When running RITA, the user spec- all 16S genes from the reference database of
ifies a minimum margin for homology-based sequenced genomes. All genomes represented
classification based on e-values: the default value within a minimal clade containing one of the
is 20 orders of magnitude. If the globally best sampled 16S genes will be flagged as assignable
e-value is greater than the best e-value from to a taxonomic rank that is no more precise than
a different taxonomic group by an amount greater the rank covering all members of that clade. For
than or equal to this margin, then the result is example, if a sampled 16S gene maps to the
considered as strong evidence for assignment to reference tree such that all of its sister taxa are
the best-matching group, and the fragment is from the same order, then RITA will consider
assigned to group 2. If the fragment remains matches to those taxa to be equivalent at the
unclassified, the same procedure is followed for rank of order. In this manner, the level of classi-
subsequent BLAST algorithms with potential fication is determined by the density of reference
classifications to group 3, group 4, etc. If all genome sampling around the observed 16S rRNA
homology-based options have been exhausted, gene sequences from the environmental sample.
classification is made to one of two groups based
on the NB classifier alone. Similar to the homol-
ogy margin described above, the globally best NB Interpreting RITA Results and Factors
likelihood is compared to the best likelihood from Affecting Prediction Accuracy
a different taxonomic group. If this ratio exceeds
a user-specified amount, then the fragment is RITA Output: RITA returns detailed results of
assigned to the higher-confidence composition- both composition- and homology-based models.
only group. If the ratio does not exceed this Most critical in the RITA output is a tab-separated
amount, then the fragment is assigned to the last file that lists the predictions associated with each
and lowest-confidence group. DNA sequence. Examples of RITA output are
The procedure above describes rank-specific given in Table 1, with some taxonomic ranks R
classification, where all fragments are classified omitted to fit each result on a single line:
at a given taxonomic rank, for instance, phylum The first column contains the name of the
or genus. However, different groups of microbes sequence as obtained from the sequence file.
may be more or less represented by sequenced The second and third columns give the confi-
genomes, and there may be more evidence to dence group associated with the prediction, first
make precise assignments to some groups than by number and then by name. Group 1,
to others. In the extreme case, some bacterial “NB_DCMEGABLAST,” indicates agreement
phyla are represented by a single sequenced indi- between the first homology prediction method
vidual, making it impossible to distinguish used (Discontiguous MEGABLAST) and the
between genera and other groups within this phy- NB classifier, while group 2 corresponds to
lum. One solution to this problem is to classify all a prediction made based on a strong separation
fragments to a very high rank such as phylum or between the best and second-best groups
class, but this discards precision in cases where it according to homology. The fourth column
may be available. Our solution is to use a rank- shows the taxonomic rank at which the prediction
flexible version of RITA that assigns an was made, and the remaining columns give the
RITA: Rapid Identification of High-Confidence Taxonomic Assignments for Metagenomic Data,

Table 1 Examples of RITA output
seq1 1 NB_DCMEGABLAST CLASS Actinobacteria Nocardioides_sp._JS614
seq2 2 DCMEGABLAST_RATIO CLASS Deltaproteobacteria Syntrophus aciditrophicus
seq3 5 NB_RATIO CLASS Alphaproteobacteria Phenylobacterium zucineum
seq4 6 NB_ML CLASS Sphingobacteria Pedobacter saltans
labels associated with that prediction, with the Consequently the classification accuracy on
final column showing the actual genome that fragments from genomes that are taxonomi-
yielded the best prediction. cally novel (i.e., that have no relatives in the
Summarizing Results: In most cases, we do reference database at ranks such as order or
not recommend using all classes when building class) will be extremely poor. This presents a
taxonomic summaries of the contents of significant challenge when samples are known
a metagenome. In particular, the accuracy of the to be enriched in poorly represented phyla
final two classes (which are based on composition such as Verrucomicrobia, Acidobacteria, or
only) tends to be very poor when sequences are the many candidate phyla that lack sequenced
short. If high precision is desired, then a user can representatives. If human microbiome sam-
focus their attention on either group 1 alone or ples are being processed using RITA, it is
those groups in which homology is a factor in the highly desirable to add the draft genomes
prediction. In the example above, this would sequenced by the Human Microbiome Con-
include groups 1–4 and exclude the last two sortium (Markowitz et al. 2012) to increase
groups, 5 and 6. However, when sequences are the coverage of common human-associated
long (>2,500 nt in length) due to assembly, pre- taxonomic groups: the effects of including or
dictions based on composition alone are more excluding these genomes are shown in Fig. 1.
reliable and can be included in the final set of Short fragments. The effect of fragment length on
predictions. Also, if the user has a reasonable classification accuracy has been extensively
expectation of “who is there” based on, e.g., characterized (McHardy and Rigoutsos 2007;
taxonomic assignment of marker genes, this Brady and Salzberg 2009; MacDonald
knowledge can be used as the basis for accepting et al. 2012). While hybrid classifiers such as
a subset of predictions from the last two groups. RITA can give accuracy in excess of 50 %
Factors Affecting the Accuracy of RITA even on metagenomic fragments 50 nt in
Predictions: Several factors have been tested length, a high degree of misclassification is
and shown to impact on the accuracy of RITA likely and many false-positive predictions
predictions. Among the most notable are: can be anticipated. Restricting predictions to
Reference genome availability. Classification of the “agreement” groups such as group 1 is
a fragment to a taxonomic group at a given highly desirable in this case.
rank obviously depends on the existence of at Long fragments. A different problem is seen
least one sequenced genome from this group when applying RITA to long, assembled
in the reference database. Even at the level of metagenomic fragments. Since RITA considers
genus, inclusion of multiple reference only the best BLAST match for a given frag-
genomes is desirable to adequately map out ment, the homology prediction for a long
the pan-genome for homology-based predic- assembly will be based on one of many genes.
tions and to capture compositional variation If the prediction associated with this gene is
within the group. Compositional signal is incorrect (for instance, if it was recently trans-
highly variable within order, class, and phy- ferred into the sequenced organism from
lum, and best matching to homologs is diffi- a different genome), then homology and com-
cult as well due to the confounding effects position will likely disagree, and the entire
of gene loss and lateral gene transfer. fragment will likely be assigned to a
RITA: Rapid Identification of High-Confidence Taxonomic Assignments for Metagenomic Data 617 R
RITA: Rapid Identification of High-Confidence Taxo- sequenced by the HMP. A majority of sequences are
nomic Assignments for Metagenomic Data, assigned to the low-confidence “NB ratio” category. (b)
Fig. 1 RITA classifications of 33,000 metagenomic Classifications of the same data set with inclusion of the
fragments from obese twin gut metagenomes HMP reference genomes, showing a doubling of the num-
(Turnbaugh et al. 2009). D-BLASTN ¼ Discontiguous ber of assignments to the highest-confidence group
MEGABLAST. The number of assignments to each (NB and D-BLASTN) and a near-halving of assignments
RITA group is shown, with different colors indicating to the NB ratio group. Plots were generated by the RITA
assignments to different genera. (a) Assignments made Web server
to a set of reference genomes, excluding the draft genomes
composition-only bin. In this case, the NB clas- Cross-References

sification will likely be correct due to the length
of the fragment, and inspection of the homol- ▶ MEtaGenome ANalyzer (MEGAN):
ogy affinities of other genes in this fragment Metagenomic Expert Resource
will likely confirm the correct classification. ▶ Taxonomic Classification of Metagenomic
“Sticky” taxa. Some genera are both extremely Shotgun Sequences with CARMA3
diverse in their gene content and well
represented in the set of sequenced genomes.
Notable examples of this include genera Clos- References
tridia, Streptococcus, and Bacillus. Since
these genera have large pan-genomes and Altschul SF, Madden TL, Sch€affer AA, Zhang J, Zhang Z,
appear to share frequently with other lineages,
many query fragments may be incorrectly search programs. Nucleic Acids Res. 1997;25:
assigned to these large groups. Care should 3389–402.
be taken when results include a large number Brady A, Salzberg SL. Phymm and PhymmBL:
metagenomic phylogenetic classification with
of these genera. Also, genera such as
interpolated Markov models. Nat Methods. 2009;6:
Buchnera and Sulcia that are dominated by 673–6.
small genomes tend to have low genomic Campbell A, Mrázek J, Karlin S. Genome signature com-
G + C contents; as a consequence, fragments parisons among prokaryote, plasmid, and mitochon-
drial DNA. Proc Natl Acad Sci U S A. 1999;96:
from low-G+C regions of other genomes may
9184–9.
tend to be incorrectly assigned to these organ- Edgar RC. Search and clustering orders of magnitude
isms. Since many of these organisms are faster than BLAST. Bioinformatics. 2010;26:2460–1.
restricted to highly specific settings such as Gerlach W, Stoye J. Taxonomic classification of
metagenomic shotgun sequences with CARMA3.
insect bacteriomes, spurious matches to these
Nucleic Acids Res. 2011;39:e91.
groups can readily be identified. Huson DH, Auch AF, Qi J, Schuster SC. MEGAN analysis
of metagenomic data. Genome Res. 2007;17:377–86.
Liu B, Gibbons T, Ghodsi M, Treangen T, Pop
M. Accurate and fast estimation of taxonomic profiles
Summary from metagenomic shotgun sequences. BMC Geno-
mics. 2011;12 Suppl 2:S4.
RITA is a hybrid supervised classification system MacDonald NJ, Parks DH, Beiko RG. Rapid identification
for metagenomic reads that has been shown to of high-confidence taxonomic assignments for
metagenomic data. Nucleic Acids Res. 2012;40:e111.
give useful accuracy on fragments as small as
Markowitz VM, Chen IM, Chu K, Szeto E,
50 nt in length. Accurate classification depends Palaniappan K, Jacob B, Ratner A, Liolios K,
on the criteria listed above, in particular the avail- Pagani I, Huntemann M, Mavromatis K, Ivanova NN,
ability of good reference databases. The key to Kyrpides NC. IMG/M-HMP: a metagenome compar-
ative analysis system for the human microbiome pro-
RITA’s speed is the use of the very fast NB
ject. PLoS One. 2012;7:e40151.
classifier and prioritizing the slower homology McHardy AC, Rigoutsos I. What’s in the mix: phyloge-
search approaches. BLASTX in particular is very netic classification of metagenome sequence samples.
slow and can increase running time by an order of Curr Opin Microbiol. 2007;10:499–503.
Parks DH, MacDonald NJ, Beiko RG. Classifying
magnitude, so avoiding translated nucleotide short genomic fragments from novel lineages using
queries or using the UBLASTX algorithm of composition and homology. BMC Bioinforma.
Edgar (2010) is crucial to rapid execution. Predic- 2011;12:328.
tions based on composition alone are less reliable Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL,
Duncan A, Ley RE, Sogin ML, Jones WJ, Roe BA,
than those that include homology as a criterion, but
Affourtit JP, Egholm M, Henrissat B, Heath AC,
composition-based predictions become more accu- Knight R, Gordon JI. A core gut microbiome in
rate with increasing sequence length. obese and lean twins. Nature. 2009;457:480–4.
S
SATe-Enabled Phylogenetic sequences, whose taxon identification is uncer-

Placement tain but for which the gene assignment is assumed
correct. In general, the reference alignment A and
Tandy Warnow tree T are also assumed correct, and so the objec-
Institute for Genomic Biology, University of tive is to place the fragments in Q into T as close
Illinois, IL, USA as possible to their correct position.
Methods for phylogenetic placement include
EPA (Berger et al. 2011), pplacer (Matsen 2010),
Synonyms PaPaRa (Berger and Stamatakis 2011), and
others. Of these, EPA and pplacer are essentially
Evolutionary tree; Phylogenetic tree; Phylogeny; identical in performance and technique: first,
Tree a Profile Hidden Markov Model (HMM, Eddy
1998) is computed for the reference alignment,
and then it is used to align each of the query
Introduction sequences, one at a time. Thus, |Q|-extended
alignments are computed, each containing the
Taxonomic identification of DNA fragments pro- reference sequences and one query sequence,
duced in a shotgun sequencing analysis is a basic and inducing A on the reference sequences.
problem in metagenomic data analysis. One Then, maximum-likelihood methods are used to
approach for this problem operates as follows: insert the query sequence into the tree T. The
the fragmentary sequences are assigned to genes calculation of the extended alignment and the
and then these fragmentary sequences are placement of a single query sequence into
inserted into a calculated or precomputed taxon- the tree itself is also reasonably fast; however,
omy based on the same gene. The insertion of because there can be many query sequences, this
fragments of a gene sequence into a tree on approach can be computationally intensive.
full-length sequences is called “phylogenetic However, the analyses of different query
placement.” sequences are independent, and so this process
The input to the phylogenetic placement prob- can be easily parallelized. Furthermore, this
lem is generally assumed to be a reference align- approach has good accuracy when the reference
ment A on full-length sequences for a gene and its alignments and trees are correct.
maximum-likelihood tree T (Felsenstein 2003; In Mirarab et al. (2012), PaPaRa and pplacer
Price et al. 2010; Stamatakis 2006; Swofford were studied on a range of datasets, varying the
2003), as well as a set Q of “query sequences.” rate of evolution and the number of sequences.
The set Q thus represents the fragmentary This study showed that both PaPaRa and pplacer
S 620 SATe-Enabled Phylogenetic Placement
had good accuracy for genes that evolve under alignment subset, into which the query sequence
low rates of evolution, but when the rate of evo- is then placed. Both parameters influence the
lution increased, then their accuracy dropped accuracy and running time of SEPP.
substantially. Two observations resulted: first, Thus, the most important difference between
under a high rate of evolution, the reference SEPP and EPA and pplacer is just how the
alignment and tree would be difficult to estimate, extended alignment is computed. The technique
an observation that has been made elsewhere. in SEPP for calculating the extended alignment is
However, more surprisingly, when the sequences based on decomposing the taxon set into subsets
evolved under a high rate of evolution, even with using the reference tree, and so the important
a good alignment and tree, the technique for issue is how the taxon set is decomposed. They
computing the extended alignment did not have used the centroid edge decomposition strategy
good accuracy. first employed in the SATe multiple sequence
Mirarab et al. developed a divide-and-conquer alignment method (Liu et al. 2012). This strategy
technique for improving this approach to phylo- removes an edge that breaks the taxon set roughly
genetic placement, which they termed SEPP. in half and then repeats the process on each
This technique operates by using the reference subtree until the desired number of subtrees is
tree to divide the dataset into subsets (as used in computed. Thus, SEPP is SATe-enabled phylo-
SATé-II (Liu et al. 2012)) and then uses HMMER genetic placement.
(Eddy 1998) to compute an HMM on each subset SEPP takes two parameters – the size of the
using the induced alignment from the reference “alignment subsets” that the reference tree is
alignment. Thus, SEPP stands for SATé-enabled decomposed into and the size of the larger subsets
phylogenetic placement. Thus, instead of using (called “placement subsets”) into which the query
a single HMM to represent the reference align- sequences can be placed after their extended
ment, a collection of HMMs is used, each on alignments are computed. Both parameters
a different subset of the taxa. The calculation of impact accuracy and speed. For example, smaller
the extended alignment for each query sequence alignment subsets result in better accuracy but
is then made by using HMMER to score the fit increase the running time. Similarly, larger place-
between the query sequence and each of the sub- ment subsets improve accuracy but increase the
set HMMs, and the one that has the best score is running time. The experimental study showed
used to align the query sequence to the alignment that setting both parameters identically and
on that subset. Because the subset alignments are decomposing to ten subsets gave a good trade-
all in agreement with the reference alignment on off between accuracy and running time.
the full dataset, transitivity then provides the The experimental study in Mirarab
alignment of the query sequence to the full et al. (2012) showed that this default setting for
dataset. In this way, the extended alignment of SEPP gave improved accuracy compared to
each query sequence can be computed. Once the pplacer and PaPaRa; results from this study are
extended alignment is calculated, the query reproduced below in Fig. 1. The test datasets have
sequence can be inserted into the reference tree 500 query sequences (half “long” and half
using maximum likelihood, just as in EPA and “short,” where long sequences have a length on
pplacer. average of 250 and short sequences have a length
SEPP also allows the user to limit the subtree on average of 100), and the placement methods
of the reference tree into which the query insert these query sequences into a reference tree
sequence will be placed through an additional and alignment on 500 full-length sequences
parameter. Thus, SEPP takes two (average length 1,000 nt). Mirarab et al. (2012)
parameters: the number of leaves in the subtree also showed that SEPP provided improved com-
on which SEPP builds an HMM (based on the putational performance over these methods with
induced alignment) and the number of leaves in respect to both time and peak memory usage for
the (perhaps larger) subtree that contains the very large datasets.
SATe-Enabled Phylogenetic Placement 621 S
8 than both EPA and pplacer, because instead of
using a single HMM to represent the entire refer-
7
ence alignment, it uses multiple HMMs, each on
6 a different subset of the taxa. Although formu-
Delta Error (edges)
5 lated for use specifically with HMMER tools for

computing HMMs and aligning sequences to the
4
reference alignment, it could also be used to boost
3 other phylogenetic placement methods. Finally,
2 the divide-and-conquer technique employed in
SEPP is a general technique for boosting machine
1 learning methods that could be extended to other
0 classification problems in bioinformatics.
PaPaRa+pp HMMALIGN+pp 50/50
SATe-Enabled Phylogenetic Placement, Fig. 1 Delta

tree error of three phylogenetic placement methods Funding
on simulated datasets with 500 query sequences and
500 reference sequences, given the true alignment and
the true tree. The delta error is the average number of
This work was supported by NSF grant DEB
additional distance from the correct placement produced 0733029 to T.W.
by using the extended alignment rather than the true align-
ment in placing each query sequence. We show results
obtained using PaPaRa or HMMALIGN to compute the
extended alignments followed by pplacer to place each References
query sequence. We also show SEPP(50/50) (the default
setting), which uses alignment subsets of size 50 (10% of Berger SA, Stamatakis A. Aligning short reads to refer-
the reference tree) to compute the extended alignment, and ence alignments and trees. Bioinformatics. 2011;
then places the query sequences into the same subtree. 27(15):2068–75.
Note that SEPP(50/50) has less than half the error of Berger SA, Krompass D, Stamatakis A. Performance,
HMMALIGN+pplacer, and that PaPaRa+pplacer has accuracy, and web server for evolutionary placement
about double that of HMMALIGN+pplacer (reproduced of short sequence reads under maximum likelihood.
from Mirarab et al. 2012, with permission from the Syst Biol. 2011;60(3):291–302.
publisher) Eddy SR. Profile hidden Markov models. Bioinformatics.
1998;14:755–63.
Felsenstein J. Inferring phylogenies. Sunderland: Sinauer
Associates; 2003.
Summary Liu K, Warnow T, Holder M, Nelesen S, Yu J,
Stamatakis A, Linder CR. SATe-II: very fast and accu-
rate simultaneous estimation of multiple sequence
SEPP is a technique for performing phylogenetic
alignments and phylogenetic trees. Syst Biol.
placement, a basic algorithmic problem in large- 2012;61(1):90–106. S
scale phylogeny estimation and also in taxonomic Matsen F, Kodner R, Armbrust EV. pplacer: linear time
classification of fragmentary sequences that are maximum-likelihood and Bayesian phylogenetic
placement of sequences onto a fixed reference tree.
often produced in a shotgun sequencing analysis
BMC Bioinformatics. 2010;11:538.
of metagenomic data. Phylogenetic placement Mirarab S, Nguyen N, Warnow T. SEPP: SATe-enabled
methods such as EPA and pplacer use phylogenetic placement. Pacific Symposium on
a reference tree and alignment on full-length Biocomputing. 2012.
Price M, Dehal P, Arkin A. FastTree 2 - approximately
sequences for a given gene, represent the refer- maximum likelihood trees for large alignments. PLoS
ence alignment using a HMM, and then align ONE. 2010;5:e9490.
each fragmentary sequence to the reference Stamatakis A. RAxML-VI-HPC: maximum likelihood-
alignment using the HMM. This extended align- based phylogenetic analyses with thousands of taxa
and mixed models. Bioinformatics. 2006;22:2688–90.
ment for the given fragmentary sequence is then
Swofford DL. PAUP*: phylogenetic analysis using parsi-
used to find the best placement in the reference mony (*and other methods), version 4. Sunderland:
tree. SEPP produces more accurate placements Sinauer Associates; 2003.
S 622 Serial Analysis of V1 Ribosomal Sequence Tags
DNA sequencing became available, it was not

Serial Analysis of V1 Ribosomal feasible to generate adequate sequences of 16S
Sequence Tags rRNA genes from any complex microbiome
because 16S rRNA genes first need to be cloned,
Zhongtang Yu1 and Mark Morrison2 and then individual clones need to be sequenced
1
Department of Animal Sciences, Environmental in a one-clone-one sequence fashion, which is
Science Graduate Program, The Ohio State a costly and labor-intensive process. Indeed, all
University, Columbus, OH, USA the 16S rRNA gene datasets produced by the
2
Diamantina Institute, The University Sanger sequencing technology are too small to
of Queensland, Woolloongabba, Brisbane, capture the full diversity (Bent and Forney 2008;
QLD, Australia Tiedje et al. 1999). One strategy to reduce the
cost of DNA sequencing-based analysis of micro-
bial diversity is to sequence concatemers of
Synonyms a sequence tag of 16S rRNA genes using the
serial analysis of ribosomal sequence tags
Serial analysis of ribosomal sequence tags (SARST) (Kysela et al. 2005; Neufeld
(SARST); Serial analysis of V6 ribosomal et al. 2004; Yu et al. 2006). SARST was adapted
sequence tags (SARST-V6) from the serial analysis of gene expression
(SAGE) (Velculescu et al. 1995), an approach
first developed to substantially improve analysis
Definition of gene expression in eukaryotes (Carulli
et al. 1998). In SARST, one of the hypervariable
By this technique, the V1 hypervariable region of regions of 16S rRNA genes is used as a
16S rRNA genes is amplified as a ribosomal sequence tag. By far, SARST has been developed
sequence tag (RST) by PCR using universal based on either hypervariable V1 (referred to as
primers, concatenated head to tail, cloned, and SARST-V1) (Yu et al. 2006) or V6 (SARST-V6)
sequenced. By enabling multiple RSTs to be (Kysela et al. 2005). Except for the two different
sequenced from each RST concatemer, SARST- hypervariable regions as the sequence tags,
V1 substantially increases the number of SARST-V1 and SARST-V6 have similar
sequences on either the Sanger or next-generation procedures.
sequencing platforms, thus, increasing the depth
of coverage of microbiome analysis.
Overview of SARST-V1 Procedures
Introduction The entire process of SARST-V1 (Fig. 1) consists

of (i) amplification of the V1 region of 16S rRNA
Sufficient characterization of actual diversity is genes using a pair of bacterial primers;
a prerequisite to understanding the function of (ii) digestion of the PCR amplicons to cut off
microbiomes and to exploring and manipulating the primers; (iii) purification and concatenation
them for beneficial applications. Sequencing and of individual ribosomal sequence tags (RSTs);
phylogenetic analysis of 16S rRNA genes have (iv) gel sizing, end repair, and cloning of the
been the primary approaches in such a pursuit. concatemers; and (v) sequencing of cloned
Detailed characterization of microbiomes, how- RST concatemers and phylogenetic analysis of
ever, requires a large number of 16S rRNA genes individual RSTs. The detailed procedures have
to be sequenced from each microbiome sample, been described elsewhere (Yu and Morrison
especially when members present at low 2011; Yu et al. 2006). Here we describe the
abundance need to be identified. Before next- major steps, alternatives, and cautions when
generation sequencing (NGS) technologies for warranted.
Serial Analysis of V1 Ribosomal Sequence Tags 623 S
Serial Analysis of V1 BsgI-64F
Ribosomal Sequence B B
Tags, Fig. 1 Schematic of 16S rRNA gene
the SARST-V1 process. … …
BB, dual biotin label B B
BsgI-109R
conjugated to the 5 end of i) PCR
the primers. BsgI-Bact64F
and BsgI-Bact109R, B B
B B n
bacterial forward primer
and reverse primer, each 1. Digest with BsgI
with an extension ii) 2. Recover RSTs (by magnetic beads)
containing a BsgI
recognition site (the figure
was modified from GT
reference 22 with CA n
permission from
Wiley-Blackwell) iii) Concatenate (by T4 ligase)
GT GT GT GT
CA CA CA CA n
1. Gel sizing
iv) 2. Blunt end polishing
3. Cloning
RST concatemer libraries
v) Sequencing and analysis of RSTs
Inference of diversity and

microbiome composition
(i) PCR amplification. The V1 region is Taq DNA polymerase or a hot-start dNTP
amplified using BsgI-Bact64F (50 -dual mix is recommended in the PCR amplifica-
biotin-TTT GAC CGT GCA GCY TAA tion to reduce formation of primer dimers,
YRC ATG CAA GTCG-30 ) and BsgI- which can contaminate the RSTs.
Bact109R (50 -dual biotin-TTT GAC CGT (ii) Digestion of PCR products and primer
GCA GYY CAC GYG TTA CKC ACC removal. The purified PCR products are S
CGT-30 ). Each primer has an extension digested with BsgI, a type IIs restriction
region that contains a recognition site for endonuclease that cuts 16 base pairs
BsgI (bolded and underlined), and the most (bp) downstream from the recognition site.
50 nucleotide of this extension region is The released RSTs are separated from the
labeled with at least one biotin or biotin- primers using streptavidin-coated magnetic
tetra-ethyleneglycol (biotin-TEG) molecule. beads, such as Dyna 280 beads (Dynal, Oslo,
The quality and quantity of the PCR prod- Norway), which immobilize the primers that
ucts are evaluated using PAGE (8 %T, 19:1) have a biotin label at the 50 end.
mini gel. Then, the PCR products are purified (iii) Concatenation of individual RSTs. Each
using the QIAquick PCR Purification Kit of the freed RSTs has one 2-nt overhang
(QIAGEN, Valencia, CA) or by ethanol pre- at both 30 termini, and these overhangs
cipitation following extraction with phenol/ facilitate annealing of individual RSTs in
chloroform. Hot-start PCR using a hot-start hand-to-tail orientation in series (Fig. 1).
S 624 Serial Analysis of V1 Ribosomal Sequence Tags
Consequently, individual RSTs are ligated DNA sequencing technology. A typical

head-to-tail to form concatemers in 50 –30 Sanger sequencing read (greater than
orientation. Because of the short overhangs 500 bp) can determine the sequence of
and the desire to form long (>0.5 kb) 19 individual RSTs (Yu et al. 2006). Indi-
concatemers, the concatenation is often vidual RSTs are then delineated using the
performed overnight using a DNA ligase, conserved base pairs that flank individual
such as T4 DNA ligase. It should be noted RSTs. The individual RSTs first can be
that BsgI is the most suitable type IIs endo- grouped into OTUs and then compared to
nuclease available. New type IIs endonucle- databases (Neufeld et al. 2004; Poitelon
ases that produce 3- or 4-nt overhangs will et al. 2009; Yu et al. 2006), or they can be
improve concatenation. Additionally, when compared to databases without grouping
new primers are designed to target other (Kysela et al. 2005). BLASTn and SEQ
V regions, the recognition site of BsgI (or the MATCH are two programs that can be
type IIs restriction endonuclease used) should used to compare RSTs to the sequences
be at such a distance from the 30 end that archived in GenBank (http://www.ncbi.
digestion of the PCR products leaves at least nlm.nih.gov/) and RDP (http://rdp.cme.
five base pairs of the primers at each end of the msu.edu/), respectively. Other programs,
freed RSTs. These conserved base pairs allow such as ESPRIT (Sun et al. 2009), Mothur
delineation of individual RSTs from the (Schloss et al. 2009), Qiime (Caporaso
sequenced concatemers. et al. 2010), CD-HIT (Li and Godzik
(iv) Gel sizing, end repair, and cloning of 2006), and UniFrac (Lozupone et al. 2006),
concatemers. Concatenation of individual can also be used in RST analysis. Most of the
RSTs produces concatemers of varying RST datasets produced in previous studies
lengths. The concatemers of 0.5–2.0 kb can be found either in the NCBI Gene
need to be size selected, typically using gel Expression Omnibus (GEO) database
(either agarose or polyacrylamide) electro- (Ashby et al. 2007; Neufeld et al. 2004; Yu
phoresis, and recovered from the gel slice et al. 2006) or in the Sequence Read Archive
using commercial kits, such as the MinElute (SRA) (Huber et al. 2007).
Gel Extraction Kit (QIAGEN). Following Full-length 16S rRNA gene sequences can be
end repair by T4 DNA polymerase, the grouped or assigned to species and genera using
concatemers are then cloned by ligation 97 % and 95 % sequencing similarity as the
into a cloning vector (e.g., pZerO-2.1 from cutoff values, respectively (Ludwig et al. 1998;
Invitrogen or pSmartLCKan from Lucigen) Stackebrandt and Goebel 1994). However, most
that has been digested with a blunt-end researchers also use these cutoff values in ana-
restriction endonuclease. Alternatively, an lyzing partial sequences. Because sequence
adenine overhang can be added to each end divergence is not evenly distributed along the
of each concatemer so that the concatemers 16S rRNA gene (particularly among the nine
can be cloned using the TOPO TA cloning V regions), different cutoff values are needed
kit (Invitrogen). Direct cloning of the blunt- when different regions of 16S rRNA genes are
ended concatemers might be preferred analyzed (Kim et al. 2011). As such, different
because it increases cloning efficiency of cutoff values of sequence similarity are needed
SAGE concatemers (Koehl et al. 2003), to group and assign individual RSTs to
which have similar length as RST RST-based OTUs. Alternatively, individual
concatemers. RSTs can be compared to rRNA gene sequence
(v) Sequencing and phylogenetic analysis of databases to identify longer sequences, which
cloned RST concatemers. The cloned can then be used to characterize the
concatemers are sequenced using the Sanger microbiomes.
Serial Analysis of V1 Ribosomal Sequence Tags 625 S
SARST Based on Other V Regions Cross-References
Besides the V1 region, the V6 region ▶ A 123 of Metagenomics

(987–1,045 nt) (Kysela et al. 2005; Poitelon ▶ Approaches in Metagenome Research:
et al. 2009) has also been used in SARST. As Progress and Challenges
demonstrated in analyzing soil, rumen, and ▶ Computational Approaches for Metagenomic
marine samples, both V1 and V6 appeared to Datasets
have sufficient phylogenetic information to ▶ Metagenomic Research: Methods and
allow taxonomic assignments of the recovered Ecological Applications
RSTs (Neufeld and Mohn 2005; Neufeld
et al. 2004; Pinloche et al. 2013; Poitelon
et al. 2009; Yu et al. 2006). The V5 region has
also been used in a SAGE-like analysis References
(referred to as serial analysis of rRNA genes,
SARD) of the soil microbiome (Ashby Ashby MN, Rine J, Mongodin EF, Nelson KE,
et al. 2007). Since it only generates 14-bp Dimster-Denk D. Serial analysis of rRNA genes and
the unexpected dominance of rare members of
RSTs, SARD may not provide enough phylo-
microbial communities. Appl Environ Microbiol.
genetic information for reliable taxonomic 2007;73:4532–42.
assignments of the RSTs. Other V regions can Bent SJ, Forney LJ. The tragedy of the uncommon:
be targeted in SARST, but when choosing a V understanding limitations in the analysis of microbial
diversity. ISME J. 2008;2:689–95.
region for SARST, the following need to be
Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K,
considered: the length and divergence of Bushman FD, Costello EK, Fierer N, Pena AG,
sequence, the availability of universal primers, Goodrich JK, Gordon JI, Huttley GA, Kelley ST,
and the frequency of the recognition site of the Knights D, Koenig JE, Ley RE, Lozupone CA,
McDonald D, Muegge BD, Pirrung M, Reeder J,
type IIs restriction endonuclease chosen within Sevinsky JR, Turnbaugh PJ, Walters WA,
the V region. Widmann J, Yatsunenko T, Zaneveld J, Knight R.
QIIME allows analysis of high-throughput community
sequencing data. Nat Methods. 2010;7:335–6.
Carulli JP, Artinger M, Swain PM, Root CD, Chee L,
Tulig C, Guerin J, Osborne M, Stein G, Lian J,
Summary Lomedico PT. High throughput analysis of differential
gene expression. J Cell Biochem Suppl.
SARST was developed before NGS technolo- 1998;30–31(Suppl):286–96.
Huber JA, Mark Welch DB, Morrison HG, Huse SM, Neal
gies became available, and it significantly
PR, Butterfield DA, Sogin ML. Microbial population
improved upon the traditional one-clone-one- structures in the deep marine biosphere. Science.
sequence approach with respect to both cost 2007;318:97–100. S
and coverage. SARST will still be useful Kim M, Morrison M, Yu Z. Evaluation of different partial
16S rRNA gene sequence regions for phylogenetic
when NGS technologies is affordable. First, analysis of microbiomes. J Microbiol Methods.
SARST-V1 can generate as much phylogenetic 2011;84:81–7.
information as longer 454 pyrosequencing Koehl A, Friauf E, Nothwang HG. Efficient cloning of
reads (Pinloche et al. 2013). Second, deep cov- SAGE tags by blunt-end ligation of polished
concatemers. Biotechniques. 2003;34:692–4.
erage is not compromised when multiple
Kysela DT, Palacios C, Sogin ML. Serial analysis of V6
bar-coded microbiome samples are analyzed ribosomal sequence tags (SARST-V6): a method for
simultaneously in a single NGS run. Addition- efficient, high-throughput analysis of microbial com-
ally, as the read length of NGS continues to munity composition. Environ Microbiol. 2005;7:
356–64.
increase, concatemers of RSTs can be
Li W, Godzik A. Cd-hit: a fast program for clustering and
sequenced without cloning of RST comparing large sets of protein or nucleotide
concatemers. sequences. Bioinformatics. 2006;22:1658–9.
S 626 SILVA Databases
Lozupone C, Hamady M, Knight R. UniFrac–an online

tool for comparing microbial community diversity in SILVA Databases
a phylogenetic context. BMC Bioinformatics.
2006;7:371.
Ludwig W, Strunk O, Klugbauer S, Klugbauer N, Christian Quast1, Elmar Pruesse1, Jan Gerken1,
Weizenegger M, Neumaier J, Bachleitner M, Timmy Schweer1, Pelin Yilmaz1, Jörg Peplies2
Schleifer KH. Bacterial phylogeny based on compara- and Frank Oliver Glöckner3,4
tive sequence analysis. Electrophoresis. 1998;19: 1
554–68. Microbial Genomics and Bioinformatics
Neufeld JD, Mohn WW. Unexpectedly high bacterial Research Group, Max Planck Institute for Marine
diversity in Arctic Tundra relative to boreal forest Microbiology, Bremen, Germany
soils, revealed by serial analysis of ribosomal sequence 2
Ribocon GmbH, Bremen, Germany
tags. Appl Environ Microbiol. 2005;71:5710–8. 3
Neufeld JD, Yu Z, Lam W, Mohn WW. Serial analysis of Microbial Genomics and Bioinformatics Group,
ribosomal sequence tags (SARST): a high-throughput Max Planck Institute for Marine Microbiology,
method for profiling complex microbial communities. Bremen, Germany
Environ Microbiol. 2004;6:131–44. 4
Jacobs University Bremen gGmbH, Bremen,
Pinloche E, McEwan N, Marden JP, Bayourthe C,
Auclair E, Newbold CJ. The effects of a probiotic Germany
yeast on the bacterial diversity and population struc-
ture in the rumen of cattle. PLoS ONE. 2013;8:e67824.
Poitelon JB, Joyeux M, Welte B, Duguet JP, Prestel E, Synonyms
Lespinet O, DuBow MS. Assessment of phylogenetic
diversity of bacterial microflora in drinking water
using serial analysis of ribosomal sequence tags. Alignment; Classification; Probe and primer
Water Res. 2009;43:4197–206. evaluation; Quality assessment; Ribosomal
Schloss PD, Westcott SL, Ryabin T, Hall JR, RNA gene datasets; Taxonomy
Hartmann M, Hollister EB, Lesniewski RA, Oakley
BB, Parks DH, Robinson CJ, Sahl JW, Stres B,
Thallinger GG, Van Horn DJ, Weber CF. Introducing
mothur: open-source, platform-independent, Definition
community-supported software for describing and
comparing microbial communities. Appl Environ
Microbiol. 2009;75:7537–41. SILVA (from Latin silva, forest) is a comprehen-
Stackebrandt E, Goebel BM. Taxonomic note: a place for sive web resource (http://www.arb-silva.de) for
DNA-DNA reassociation and 16S rRNA sequence up-to-date, quality-controlled databases of aligned
analysis in the present species definition in Bacteriol- ribosomal RNA gene (rDNA) sequences from the
ogy. Int J Syst Bacteriol. 1994;44:846–9.
Sun Y, Cai Y, Liu L, Yu F, Farrell ML, McKendree W, Bacteria, Archaea, and Eukarya domains.
Farmerie W. ESPRIT: estimating species richness
using large collections of 16S rRNA pyrosequences.
Nucleic Acids Res. 2009;37:e76. Introduction
Tiedje JM, Asuming-Brempong S, N€ usslein K, Marsh TL,
Flynn SJ. Opening the black box of soil microbial
Sequencing the ribosomal RNA gene (rDNA) is
diversity. Appl Soil Ecol. 1999;13:109–22.
Velculescu V, Zhang L, Vogelstein B, Kinzler K. the method of choice for nucleic acid-based detec-
Serial analysis of gene expression. Science. 1995; tion and identification of microbes, their taxo-
270:484–7. nomic assignment, phylogenetic analysis, and
Yu Z, Morrison M. Sequence-based characterization of
investigation of microbial diversity. Today (July
microbiomes by Serial Analysis of Ribosomal
Sequence Tags (SARST). Handbook of molecular 2012), more than 3.5 million small and large
microbial ecology I. Wiley; 2011. p. 265–73. subunit (SSU and LSU) rDNA sequences are pub-
Yu Z, Yu M, Morrison M. Improved serial analysis of V1 licly available and their analysis demands for
ribosomal sequence tags (SARST-V1) provides
appropriate software tools and specialized,
a rapid, comprehensive, sequence-based characteriza-
tion of bacterial diversity and community composi- quality-controlled databases. The SILVA datasets,
tion. Environ Microbiol. 2006;8:603–11. established in 2007, provide high-quality,
SILVA Databases 627 S
ribosomal RNA Database Growth
3.5M
3,194,795
3.0M
2,492,653
2.5M
rRNA Sequences
2.0M
1,471,257
1.5M
995,747
1.0M
756,668
504,295
500.0k
286,257
194,696
60,274 83,960 101,781
473 1,379 2,251 2,849 2,849 4,332 6,205 7,322 16,277 16,277
0
19 19 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 S
92 93 994 995 996 997 998 999 000 001 002 003 004 005 006 007 008 009 010 011 ILV
A
r1
11
Year
SILVA Databases, Fig. 1 Growth of the ribosomal RNA databases since 1992
comprehensive rDNA datasets comprising which provides detailed statistics and documenta-
sequences from the Bacteria, Archaea, and tion of the resource.
Eukarya domains. All sequences are checked for
anomalies, carry a rich set of sequence-associated
contextual information, and have multiple taxo- SILVA Datasets
nomic classifications and the latest validly
described nomenclature. The SILVA datasets are The SILVA project provides datasets for all SSU
based on the EMBL/EBI Nucleotide Sequence and LSU rDNA sequences found in EMBL-Bank
Database (EMBL-Bank), a member of the Inter- that fulfill the SILVA quality criteria. Since their
national Nucleotide Sequence Database Collabo- first public release in February 2007, based on
ration (INSDC) comprising all publicly available EMBL-Bank release 89, these datasets have S
DNA sequences. They are generated by an auto- increased in size by a factor of 10 and 5 for the
matic software pipeline for the extraction of SSU SSU Parc and LSU Parc datasets, respectively.
and LSU rDNA sequences as well as quality con- Moreover, the growth is clearly exponential
trol. The alignment is based on the latest compre- (Fig. 1) as is the growth of the general DNA
hensive ARB (Ludwig et al. 2004) alignments. sequence databases. Detailed information on the
The datasets are extensively annotated by third- current SILVA database content can be found in
party data integration. Substantial manual curation the documentation section of the SILVA web
of the alignment and taxonomy is performed on portal.
each public release. SILVA dataset updates and The SILVA SSU and LSU rDNA datasets
new online features are continuously released on each consist of two subsets: (1) the “Parc”
the SILVA web portal (http://www.arb-silva.de) datasets comprising the complete SILVA
database content and (2) the “Ref” datasets com- particularly about one-fourth in case of the SSU
prising high-quality subsets of sequences in the Parc database and about one-tenth for the LSU
Parc datasets. For the SSU dataset, additionally, Parc database (ratios for the SILVA 111 release).
two subsets are provided, (3) the “Ref NR” Sequences originating from the “Human Skin
dataset, a nonredundant version of the Ref subset, Microbiome” (HSM) (Grice et al. 2009), the
and (4) a type strain dataset provided by the “All- “Mouse Wound Microbiota” (MWM) (Grice
Species Living Tree Project” (LTP) (Munoz et al. 2010), and the “Guerrero Negro
et al. 2011) which is also available for the LSU Hypersaline Microbial Mat” (GNHM) large-
rDNA. All datasets and individual subsets can be scale sequencing projects are excluded from
downloaded as ARB files for direct use with the the SSU Ref dataset. Instead, these sequences,
ARB software package (Ludwig et al. 2004). In with more than 490,000 (SILVA 111) long
the future, only SSU Ref and SSU Ref NR sequence reads in total, are provided in
datasets will be offered in the ARB format to a dedicated dataset. This is done to further restrict
avoid unmatched hardware demands on the the size of the SILVA SSU Ref dataset and to
user side. avoid overrepresentation of sequences of a spe-
cific origin.
SILVA Parc For both SILVA Ref datasets, the ARB files
All sequences in the Parc datasets have are supplemented with a manually classified
a minimum length of 300 aligned nucleotides “guide tree,” incrementally built using the ARB
within the boundaries of the rRNA genes. parsimony tool with filters to remove highly var-
Sequences are only accepted if they have less iable positions and followed by removal of
than 2 % ambiguities or homopolymers or vector sequence entries represented by anomalous tree
contamination. Additionally, after the alignment, branch lengths. These trees also represent the
minimal quality requirements for sequence qual- basis for the SILVA taxonomy (see section
ity and base-pair score, as well as alignment “SILVA Taxonomy” below).
identity and quality, are applied. For details,
please refer to the section “Quality Control” SILVA Ref NR (Nonredundant)
and the respective dataset documentation page For users interested in a representative SSU
(e.g., http://www.arb-silva.de/documentation/ rDNA sequence collection, the SILVA project
background/release-111/). offers a nonredundant (NR) version of the SSU
Ref subset. This dataset is created by applying
SILVA Ref clustering at 99 % (up to SILVA 108) and 98 %
The SILVA Ref datasets represent subsets of the (from SILVA 111 on) sequence identity. Of each
corresponding SILVA SSU and LSU Parc cluster, only the longest sequence is kept. This
datasets. They comprise only “full-length” or reduces the size of the dataset to less than 50 % of
nearly “full-length” sequences. An SSU sequence its original size, even though the sequences omit-
is considered to be of “full length” if it contains at ted in the SSU Ref dataset from the HSM, MWM,
least 1,200 aligned bases within the rRNA gene and GNHM projects (see above) are included for
boundaries. For sequences classified as Archaea, clustering. Sequences from cultivated species are
this threshold has been lowered to 900 aligned preserved in all cases to lead as an anchor for
bases to avoid losing the majority of sequences. taxonomy. The resulting SSU Ref NR dataset
LSU sequences are considered “full length” if with its manually curated “guide tree” can be
they are at least 1,900 bases long. used as a representative dataset for classification,
More stringent thresholds for alignment qual- phylogenetic analysis, and probe design. It is the
ity and identity are applied for the Ref datasets. recommended dataset to be used as a starting
Consequently, the Ref datasets contain consider- point for all users interested in environmental
ably less sequences than the Parc datasets, rDNA sequence analysis.
SILVA Taxonomy domains and are manually curated and continu-
ously enhanced.
A substantial revision of the classification of all
prokaryotic sequences in the Ref datasets was SEEDs for Quality Control
first published with SILVA release 100. Based The SEED used for the detection of sequence
on the “guide trees,” all phylogenetic assign- anomalies in SSU sequences is based on the
ments are manually curated, taking into account corresponding alignment SEED with all
taxonomic information provided by Bergey’s sequences removed if any indication of an anom-
Taxonomic Outline of the Prokaryotes (Garrity aly was found. This reduces the size of the SEED
et al. 2004); the taxonomic outlines for volumes by a factor of 6. The detection of anomalies is not
3, 4, and 5 of Bergey’s Manual; and the List of done for LSU rDNA sequences because none of
Prokaryotic names with Standing in Nomencla- the available tools can be applied.
ture (Euzéby 1997). Furthermore, extensive effort For identification of vector contaminations,
is spent to represent prominent uncultured and not a SEED based on the EMVEC (EBI) and UniVec
validly published environmental clades, groups, (NCBI) reference datasets is used with all
and taxa, respectively. The majority of these clades sequences removed resembling an rDNA
and groups are annotated in the “guide tree” based sequence.
on literature surveys and personal communica-
tions. Taxonomic groups consisting only of
sequences from uncultured organisms are named Data Retrieval
after the clone sequence submitted earliest. Due to
this exhaustive manual approach, SILVA currently Three strategies are applied to retrieve SSU and
contains the most up-to-date and detailed bacterial LSU rDNA sequences from EMBL-Bank:
and archaeal taxonomic classification. • A keyword search is used to extract annotated
To create also an improved and unified taxon- SSU and LSU rDNA sequences. Additionally,
omy for Eukarya based on 18S rDNA sequences, a set of relaxed keywords is applied to account
the Eukaryotic Taxonomy Working Group for sequences with spelling mistakes in the
(ETWG) has been founded in October 2011. annotation.
The first version of these efforts is deployed • A whitelist taken from the Ribosomal Data-
with SILVA release 111. base Project (RDP) (Cole et al. 2005) is used
to retrieve sequences that are not covered by
SILVA SEED Datasets the keyword search.
SILVA uses customized and specialized refer- • HMMs (one for each of the three domains of
ence datasets for specific tasks within its software life for both LSU and SSU) taken from the
pipeline. Such internal reference datasets are RNAmmer tool (Lagesen et al. 2007) are S
called SEEDs. searched against the complete EMBL-Bank.
Sequences that match one of the HMMs and
SEEDs for Alignment were not already imported by one of the two
As of July 2012, the SEED used for SSU rRNA previous approaches are added.
gene sequence alignment has 50,000 alignment In all cases, the entries in the datasets are
positions including all gaps and consists of about flagged by its origin of retrieval.
57,000 high-quality, aligned SSU rDNA refer-
ence sequences. The alignment SEED of the
LSU rRNA gene comprises 150,000 positions Alignment
but includes only about 3,000 aligned sequences.
Both SEEDs contain representative sequences After import, sequences are aligned using the
from the Bacteria, Archaea, and Eukarya SINA software (SILVA Incremental Aligner)
(Pruesse et al. 2012). Similar to the ARB project, reduced sequence quality caused by the sequenc-
the tool follows the concept of an incremental ing process. As a consequence, if homopolymers
alignment. Briefly, no de novo multiple sequence of five or more nucleotides are found within
alignment is created; instead, the highly accurate a sequence and these stretches count for more
manual alignment of closely related sequences than 2 % of the sequence within the rRNA gene
found in the corresponding alignment SEED is boundaries, the sequence is excluded from the
used as a template to align each sequence SILVA datasets.
included in the SILVA datasets. This approach Unaligned overhangs of a sequence are
guarantees a high-quality alignment of rDNA checked against the vector SEED using BLAST
sequences. (Altschul et al. 1998) to identify cloning artifacts.
If it is likely that the unaligned part of a sequence
is a vector sequence and the unaligned part is
Quality Control longer than the aligned part, the sequence is
excluded from SILVA. Sequences in SILVA are
Every imported and aligned SSU and LSU gene not allowed to contain more than 2 % vector
sequence has to pass a multistage quality inspec- contamination.
tion to assure the high quality of the SILVA The three parameters are combined into an
datasets. Sequences are checked for sequence overall “sequence quality” value. This score rep-
and alignment quality using various parameters. resents the mean of the three individual parame-
Sequences are excluded from the SILVA releases ters. It is normalized to values in the range of
in case they fail any of the applied tests or show 0–100, such that 100 represents the best possible
reduced quality based on combined quality quality of a sequence.
values. Additionally, sequences are tested for All thresholds to reject a sequence were
anomalies but no filtering is done by the SILVA defined based on statistical analysis of the
project based on these results. The information is retrieved SSU and LSU rDNA sequences.
provided to the users for individual filtering of the
datasets, if required. Alignment Quality
Detailed statistics on the SILVA quality con- Four characteristics of the alignment process are
trol can be found on the SILVA web portal for all evaluated in the pipeline and a sequence is
SILVA releases, e.g., http://www.arb-silva.de/ rejected if it fails to pass one of these: the base-
documentation/background/release-111/. pair score, the alignment quality, the alignment
identity, and the alignment length within the
Sequence Quality boundaries of the rRNA gene.
The SILVA sequence quality checks test for The base-pair score is calculated from the
ambiguous bases, extended homopolymeric number of bases involved in helix binding
stretches, and vector contaminations. according to the secondary structure model of
Ambiguous bases are nucleotides representing Gutell et al. (1994).
valid characters according to the International The alignment quality score is a measure of
Union of Pure and Applied Chemistry (IUPAC) the identity of the query sequence to the reference
DNA encoding but do not resolve to “A”, “C”, sequences that are used as a template for the
“G”, or “T”. A maximum of two percent of alignment. High values (>90) indicate that
ambiguous nucleotides within the rRNA gene closely related sequences have been found in
boundaries is allowed by SILVA. the alignment SEED and that the resulting align-
Homopolymers are stretches of identical ment is likely to be accurate. Low values suggest
nucleotides that commonly appear with that further manual inspection of the particular
a maximum of up to four nucleotide repetitions sequence is needed.
in native rDNAs. In contrast, extended stretches Additionally, the alignment identity of the
within a sequence represent an indication of query sequence to its closest relative in the
alignment SEED is considered to guarantee the For LSU rDNA sequences, only the EMBL-
specificity of the alignment. Two positions in the Bank and SILVA taxonomy are available due to
alignment are considered identical if both posi- a lack of additional resources.
tions have the same unambiguous nucleotide
according to the IUPAC encoding. Nomenclature
To fit the SILVA unified scoring scheme, the With every SILVA release, all organism names
base-pair and alignment quality scores are nor- are updated according to the “Nomenclature
malized to values between 0 and 100, such that Up-to-Date” website of the “Deutsche Sammlung
100 represents the maximum score. f€ur Mikroorganismen und Zellkulturen”
(DSMZ). All synonyms and name replacements
Chimera/Anomaly Detection are recorded.
To detect sequence anomalies, a customized
version of the Pintail software (Ashelford Strain Annotation
et al. 2005) is used. This software checks whether The strain field of an entry in the SILVA datasets
a pair of sequences is mutually anomalous (e.g., is annotated using SILVA-specific labels if an
chimeric) by computing a distance profile and entry matches one or more of the following
comparing it to a predicted distance profile. The criteria:
result is “yes,” “likely,” or “no,” depending on • The label “e[G]” is added if an entry is part of
the amount of measured deviation from expecta- the list of genomes offered by the EBI.
tion. From this operation, the SILVA Pintail • The label “l[T]” is added if the entry is part of
score is constructed by running each sequence the type strain datasets of “The All-Species
against the ten most similar sequences retrieved Living Tree” Project (Munoz et al. 2011).
from the chimera SEED. Sequences that have • The label “s[T]” is added if an entry is listed as
passed all tests with “no” (not anomalous) get a type strain by the StrainInfo project
a score of “100 %,” whereas all tests returning (Dawyndt et al. 2005).
“likely” would yield a 50 % score. Only SSU • The label “s[C]” is added if an entry is
sequences are checked for anomalies because a cultured strain according to the StrainInfo
the Pintail software does not contain profiles for project.
sequences other than 16S rDNAs. • The label “r[T]” is added if an entry is listed as
a type strain by the RDP project.
Furthermore, manually curated habitat infor-
Third-Party (Meta) Data mation and GPS coordinates are assigned to each
entry based on information provided by the
One of the unique features of the SILVA datasets megx.net project (Kottmann et al. 2010).
is extensive data integration based on various S
third-party resources and manifold linkage of
the SILVA database entries to external data SILVA Website/Online: Service
sources.
One of the problems associated with the
Taxonomies ever increasing amount of sequences is the
Every sequence in the SILVA databases carries hardware resources required to store and analyze
the EMBL-Bank taxonomy assignment. Where the data. As a response to allow users to still
available, the greengenes (DeSantis et al. 2006) work with these datasets, features requesting
and RDP (Cole et al. 2005) taxonomies are added comprehensive reference datasets such as probe
for comparison. All entries of the SILVA Ref and primer evaluation for testing the in silico
datasets are also assigned to the taxonomy of accuracy of oligonucleotide signatures are now
the SILVA project (see section “SILVA offered by the SILVA web portal. Additionally,
Taxonomy”). the SILVA website offers extensive data retrieval
SILVA Databases, Fig. 2 The entry of Amorphus coralli (DQ097300) within the genus Amorphus displayed in the
SILVA “Taxonomy Browser”
functions for the compilation of individual The browser can also be used to create cus-
sequence subsets from the comprehensive online tomized subsets of the SILVA databases and to
database as well as preconfigured, quality- display the results of the online services provided
constrained subsets for direct download. by SILVA. For each taxonomic group in the
browser, the fraction of corresponding sequences
Taxonomy Browsing, Searching, and in the cart can be highlighted (Fig. 2).
Download Cart The advanced search functionality offered on
The SILVA “Taxonomy Browser” allows navi- the SILVA website allows the user to easily com-
gation through a selected taxonomy by clicking pile custom subsets of sequences. Besides simple
on the respective nodes. The browser starts with searches, e.g., for accession numbers, organism
showing all taxonomic groups of the highest level names, taxonomic entities, or publication
of the selected taxonomy. By selecting one of DOI/PubMed IDs, complex queries including
these groups, a new list view appears with all several database fields are also possible. Con-
subgroups, preserving the former levels within straints such as the sequence length or quality
a horizontal scroll bar layout. If a sequence values can be used to further filter the sequences.
entry is selected, a detailed summary will be Customized sequence subsets compiled by the
opened. This summary shows full annotation of user including the results of the SILVA online
an entry and a traffic light like view of the main services can be collected in the SILVA cart sys-
quality parameters (Fig. 2). tem and downloaded in various formats.
SILVA Databases, Fig. 3 The web interface and results of the SILVA “TestProbe” service
Alignment, Sequence-Based Searches, and Finally, the “Least Common Ancestor” fea- S
Classification ture of the aligner can be used to classify
Users can align their own sequences using the sequences against any of the taxonomies pro-
SILVA SSU and LSU SEEDs with a fully vided by the SILVA project.
configurable online version of the SILVA aligner
(SINA). The aligned sequences can be TestProbe
downloaded in either ARB or FASTA file The SILVA probe match and evaluation tool
formats. called “TestProbe” detects and displays all occur-
Submitted sequences can also be searched rences of a given probe or primer sequence within
against one of the predefined datasets (Parc, any specified SILVA datasets or subsets thereof.
Ref, or NR). This function will return a list of It is offered to test and visualize in silico speci-
closely related sequences which can be added to ficity and target group coverage (sensitivity) of
the cart system for building and downloading rDNA-targeting probes and single primers
customized datasets. against the SILVA datasets. The tool can be
SILVA Databases, Fig. 4 The web interface and results of the SILVA “TestPrime” service
configured to allow up to five mismatches within the SILVA datasets or subsets thereof
between probe and target sequences and mis- which are targeted by a given pair of primers.
matches can be weighted. The resulting number The number of allowed mismatches can be con-
of matches and non-matches is shown as a set of figured and results are shown in overview pie
pie charts (Fig. 3), and an additional list provides charts (Fig. 4) and the corresponding sequences
sequence names, accession numbers, and can be selected for download.
a graphical representation of the probe’s binding
site within all matches. Sequences in this list can
be added to the cart system for subsequent Summary
download.
The SILVA project provides comprehensive,
TestPrime quality-controlled, richly annotated, and aligned
Similar to the SILVA “TestProbe” tool, reference rDNA datasets to support the molecular
“TestPrime” allows searching for all sequences assessment of biodiversity, as well as
Simultaneous Quantification of Multiple Bacteria 635 S
investigations of the evolution of organisms. Euzéby JP. List of bacterial names with standing in
Applications of these datasets range from basic nomenclature: a folder available on the internet. Int
J Syst Bacteriol. 1997;47(2):590–2.
research in microbiology and molecular ecology Garrity GM, Bell JA, et al. Taxonomic outline of the
to the detection of contaminants and pathogens in prokaryotes. Bergey’s manual of systematic bacteriol-
biotechnology and medicine. The taxonomically ogy. 2nd ed. New York: Springer; 2004. Release 5.0.
fully classified Ref and Ref NR datasets are Grice EA, Kong HH, et al. Topographical and temporal
diversity of the human skin microbiome. Science.
perfectly suited for the classification of 2009;324:1190–2.
metagenomic or amplicon-based next-generation Grice EA, Snitkin ES, et al. Longitudinal shift in diabetic
sequencing data. wound microbiota correlates with prolonged skin
The combination of SILVA datasets with the defense response. Proc Natl Acad Sci U S A. 2010;
107:14799–804.
ARB software suite provides an easy to use work- Gutell RR, Larsen N, et al. Lessons from an evolving
bench for researchers to perform in-depth rRNA: 16S and 23S rRNA structures from
sequence analysis and phylogenetic reconstruc- a comparative perspective. Microbiol Rev. 1994;58:
tions as well as manual curation of rDNA 10–26.
Kottmann R, Kostadinov I, et al. Megx.net: integrated
datasets. Furthermore, the SILVA datasets have database resource for marine ecological genomics.
become an integral part of the MOTHUR Nucleic Acids Res. 2010;38:D391–5.
(Schloss et al. 2009), QIIME (Caporaso Lagesen K, Hallin P, et al. RNAmmer: consistent and
et al. 2010), and MG-RAST (Meyer et al. 2008) rapid annotation of ribosomal RNA genes. Nucleic
Acids Res. 2007;35:3100–8.
analysis tools and pipelines. Ludwig W, Strunk O, et al. ARB: a software environment
for sequence data. Nucleic Acids Res. 2004;32(4):
1363–71.
Cross-References Meyer F, Paarmann D, et al. The metagenomics RAST
server – a public resource for the automatic phyloge-
netic and functional analysis of metagenomes. BMC
▶ A 123 of Metagenomics Bioinformatics. 2008;9:386.
▶ Computational Approaches for Metagenomic Munoz R, Yarza P, et al. Release LTPs104 of the
Datasets all-species living tree. Syst Appl Microbiol. 2011;34:
169–70.
Pruesse E, Peplies J, et al. SINA: accurate high throughput
multiple sequence alignment of ribosomal RNA genes.
Bioinformatics. 2012;28:1823–9.
References Schloss PD, Westcott SL, et al. Introducing mothur: open-
source, platform-independent, community-supported
Altschul S, Madden T, et al. BLAST and PSI-BLAST: software for describing and comparing microbial
a new generation of protein database search programs. communities. Appl Environ Microbiol. 2009;75:
FASEB J. 1998;12:A1326. 7537–41.
Ashelford KE, Chuzhanova NA, et al. At least 1 in
20 16S rRNA sequence records currently held in pub-
lic repositories is estimated to contain substantial S
anomalies. Appl Environ Microbiol. 2005;71:
7724–36. Simultaneous Quantification
Caporaso JG, Kuczynski J, et al. QIIME allows analysis of
high-throughput community sequencing data. Nat
of Multiple Bacteria
Methods. 2010;7:335–6. Nature Publishing Group.
Cole JR, Chai B, et al. The Ribosomal Database Project Annalisa Ballarini and Olivier Jousson
(RDP-II): sequences and tools for high-throughput Laboratory of Microbial Genomics, Centre for
rRNA analysis. Nucleic Acids Res. 2005;33:D294–6.
Integrative Biology (CIBIO), University of
Dawyndt P, Vancanneyt M, et al. Knowledge accumula-
tion and resolution of data inconsistencies during the Trento, Trento, Italy
integration of microbial information sources. IEEE
Trans Knowl Data Eng. 2005;17(8):1111–26.
DeSantis TZ, Hugenholtz P, et al. Greengenes, a chimera-
checked 16S rRNA gene database and workbench
Synonyms
compatible with ARB. Appl Environ Microbiol.
2006;72:5069–72. Composition assessment; Abundance determination
S 636 Simultaneous Quantification of Multiple Bacteria
Definition These techniques amplify a few copies of the

target DNA to millions of copies after 30–40
The term “quantification” derives from the Latin cycles, ensuring high sensitivity of detection.
terms quant (meaning “how much”) and facere Being also user-friendly, fast, and cost-effective,
(meaning “to make”). Quantifying is the act of these techniques are broadly used for clinical
determining the quantity of, measure. Simulta- detection of potential pathogens. Multiplexing
neous quantification of multiple bacteria stands can also be applied by means of multiple primer
for medium- to high-throughput detection and pairs hybridizing to different target sequences.
abundance determination of a bacterial However, the multiplexing capacity is limited,
community. as simultaneous detection of a high number of
bacteria leads to decreased sensitivity, increased
costs, and bacterial misidentifications. Thus,
Introduction these techniques cannot reach the throughput
required for bacterial community assessment.
Bacteria are widespread and abundant in the High-throughput sequencing (see entries
environment and comprise bacterial strains “▶ Approaches in Metagenome Research: Pro-
which are pathogenic for plants, animals, or gress and Challenges”; and “▶ Metagenomic
human beings. Historically, the threat Research: Methods and Ecological Applica-
represented by pathogenic bacteria lead to the tions”) is the most used method for metagenomic
development of cultural-, serological-, and studies, allowing the highest throughput and
molecular-based methods to identify and possi- in-depth determination of bacterial communities’
bly quantify the causative agent of an occurring composition without the requirement of a priori
infectious disease. These methods commonly tar- knowledge. The advantage of this technology is
get with high sensitivity and specificity one or to be able to detect not only well-characterized
few bacterial strains or species, but are not suit- microbes but also variant strains (Roh et al.
able for collecting comprehensive information on 2010). It can be applied either for deciphering all
a microbial community. Recently, the advent of genetic information including functional classes’
high-throughput sequencing technologies and the composition via whole-genome shotgun sequenc-
extensive metagenomic studies performed on ing or for defining exclusively the taxonomic com-
human microbiomes have shown how shifts in position of bacterial communities by targeted
bacterial composition and quantity may also cor- sequencing of phylogenetic markers. Despite the
relate locally or systemically with the health sta- continuous cost reductions, whole-genome shot-
tus of the human host. These high-throughput gun sequencing still requires high expense, time,
metagenomic technologies can provide compre- and complex computational resources (see entry
hensive information on microbial composition, “▶ Computational Approaches for Metagenomic
functions, and dynamics, accelerating the devel- Datasets”). For these reasons, this method is
opment of complementary or alternative methods prohibited for prolonged clinical studies or routine
for environmental studies, clinically oriented diagnostics. On the other hand, targeted sequenc-
studies, and routine diagnostics. The methods ing is less costly and complex in data processing
for simultaneous quantification of multiple bac- but commonly addresses the 16S rRNA gene,
teria are based on the following techniques: PCR, which was reported to have a phylogenetic resolu-
microarray, or high-throughput sequencing. tion limited to the family or genus level for several
End-point and real-time quantitative PCRs clades.
are widely used techniques for identification Microarray-based methods allow parallel
and/or quantification of bacteria. The PCR is detection of a large number of sequences in
based on a primer extension reaction catalyzed a single hybridization. As PCR-based assays,
by DNA polymerase, thus requiring a priori they require the a priori knowledge of the poten-
knowledge of the potential target bacteria. tial targets. Their feature is to combine a medium
to high throughput with a small format, a rapid categories: microbial function or phylogeneti-
and automated processing, and a low cost per cally targeted.
sample. Thus, they span in between PCR and Functional gene microarrays target mostly
sequencing and are well suitable for cost- a combination of functional classes’ genes. The
effective clinically oriented studies and informa- feature of these microarrays is to define the capa-
tive routine diagnostics. Moreover, applied to bilities of the bacterial community under investi-
metagenomic samples, they can provide a useful gation rather than its composition. For clinical
complementary approach to high-throughput purposes, arrays commonly target functional
sequencing, determining the appropriate genes belonging to virulence and antibiotic resis-
sequencing depth according to the complexity tance gene families (Jaing et al. 2008). Environ-
of the bacterial communities. mental applications focused instead on the known
Here below, the overview is focused on DNA functions of the specific bacterial niche under
microarrays as tools for simultaneous quantifica- study.
tion of multiple bacteria and more extensively on An example of this microarray type is the
well-known 16S microarrays and the recently GeoChip, which was developed for characteriz-
developed BactoChip, a multi-marker phyloge- ing microbial communities isolated from the
netic microarray. environment both at structural and functional
level. It proved successful in association with
high-throughput approaches to provide in-depth
DNA Microarrays for Simultaneous information of defined environmental niches,
Detection of Multiple Bacteria such as sulfate-reducing bacterial communities
important to environmental cleanup [see entry
DNA microarrays technology allows high- “▶ GeoChip-Based Metagenomic Technologies
throughput screening of nucleic acid sequences for Analyzing Microbial Community Functional
for complementary binding. The sequences Structure and Activities”].
bound to the solid surface of the microarray The second category is represented by phylo-
may be synthetic oligonucleotides or DNA frag- genetic oligonucleotide microarrays (see entry
ments either synthesized directly or spotted on “▶ Phylogenetics, Overview”), which target
the surface. In the presence of the target DNA instead phylogenetic marker genes, and can be
sample, nucleic acid hybridization can occur. The based on a single or multiple marker approach. In
stringency and rate of hybridization can be con- contrast to functional gene arrays, they aim at
trolled by varying temperature, salt concentra- discriminating bacteria by defining their identity.
tion, and washes (passive hybridization) or by Single marker phylogenetic arrays target variants
applying electric fields on a microelectronic of universally conserved rRNA sequences. Some
device (active hybridization). After hybridiza- of the publicly described chips targeting single S
tion, unbound template DNA is washed away phylogenetic markers are listed here.
and the bound template is detected, using dedi- An example of a panmicrobial 16S-based
cated scanners, mostly by means of fluorescent microarray is the GreenChipPm (Palacios
label or enzymatically active moieties previously et al. 2007). This single marker detection array
incorporated in the DNA sample. The high- was designed to target respiratory pathogens
affinity binding of template nucleic acids to (vertebrate viruses, fungi, bacteria, and proto-
their complementary target can be used for the zoa). It includes, among others, probes designed
identification of microorganisms and relative to specifically bind variable segments of the 16S
abundance determination. Key steps for highly rRNA gene, the most well-known “universal”
accurate bacterial detection are target DNA bacterial marker for phylogenetic determination.
region selection and probe design. According to The PhyloChipTM (Second Genome, San
the target regions chosen, microarrays for bacte- Francisco) is an oligonucleotide microarray
rial profiling can be classified in two main targeting the segments of the single 16S rRNA
gene for high-throughput detection of microbial target sequence allows introducing a PCR-based
communities both in the environment and clinical amplification step for bacterial target enrichment
samples (Brodie et al. 2007; Ghosh et al. 2009; (e.g., GreenChipPm). This pre-amplification step
Wu et al. 2010). Several versions were devel- ensures an increased sensitivity in detection but
oped, the latest being the G3 version (Hazen does increase the processing time and may intro-
et al. 2010). The G3 comprises 1.1 million DNA duce biases in relative abundance quantitation.
probes and covers nearly 60,000 operational tax- Besides that, being universally conserved
onomic units. In order to increase the reliability, within the bacterial kingdom, the 16S RNA
some microarray designs, including the gene may not be sufficient for specific and repro-
PhyloChips, define multiple target regions within ducible bacterial identification, especially in
the marker adopting a so-called multiple probe complex systems. In fact, the high conservation
concept to increase the overall detection score of this gene across taxa has been reported to
accuracy. cause cross-hybridization events, affecting both
HOMIM (Preza et al. 2009) is the acronym resolution and abundance determination, and to
for Human Microbial Identification Microarray, fail to discriminate below the genus level for
a tool developed to detect simultaneously many clades.
300 bacterial species from the oral microbiome, Recently, as alternative to 16S or single
including non-cultivable ones. The target bacte- marker array for microbial profiling, a multiple
ria were selected among the ones identified by marker phylogenetic microarray has been
16S rRNA sequencing in health roots and root designed, the BactoChip (Ballarini et al. 2013).
caries in elderly. Experiments performed with The array design was based on the notion that
this array showed a general agreement in the metagenomic sequencing data offer a powerful
results with 16S RNA gene sequencing analysis. view on the microbial diversity of the sampled
Since 2008, a core facility at the Forsyth Institute communities and an increasingly higher number
(Cambridge, Massachusetts) provides a service, of complete and annotated bacterial genomes are
based on this platform, to rapidly screen clinical publicly available.
samples from the oral cavity, esophagus, and
lungs.
The HITChip (human intestinal tract chip) The BactoChip: A Multi-marker
(Rajilic-Stojanovic et al. 2009) is a microarray- Phylogenetic Microarray for
based metagenomic tool designed for profiling Species-Level Resolution
the human gastrointestinal microbiota. This phylo-
genetic microarray comprised 4,809 oligonucleo- The BactoChip (Ballarini et al. 2013) was
tide probes and discriminate 1,140 species via two designed with the aim to overcome the issues of
hypervariable regions of the small subunit ribo- resolution and abundance determination of
somal RNA (SSU rRNA) gene. The validation 16S-based microarrays and thus approach the
performed with SSU rRNA clones and clinical throughput and specificity of sequence-based
samples proved that this microarray provides techniques. Up to date, one version of the
a highly reproducible fingerprint and has also quan- BactoChip has been described, detecting via
tification potential. In particular, tests performed a PCR-independent approach a set of 54 bacterial
with synthetic mixtures showed it can detect 40 dif- species belonging to multiple genera of clinical
ferent amplicons and also those with relative abun- interest. The number of target bacteria was lim-
dance of 0.1 %. The HITChip showed to correctly ited by the availability of typed strains for exper-
identify a universal microbiota at genus-level imental validation and of complete bacterial
resolution. genome sequences for computational microarray
Overall, the most used phylogenetic marker is design. However, the developed method for
the 16S rRNA gene. The presence of highly con- marker selection may be extended to the whole
served regions flanking the variable 16S rRNA microbial world, thus allowing high accuracy of
microbial composition assessment even in com- performed with multiple congeneric bacterial
plex samples. species from the Staphylococcus genus showed
Computational and Experimental Design. how this microarray design can resolve to the
The BactoChip in silico design is based on the species level even genera known to be poorly
knowledge deriving from metagenomic datasets resolved by the 16S marker genes. The perfor-
and complete bacterial genome sequences. The mance of the BactoChip in identifying bacteria
computational tool for DNA marker identifica- and determining relative abundances was tested
tion employed a pairwise identity threshold by means of synthetic bacterial communities
above 99 % to define core genes for most species, comprising 9 and 15 different species at even
where core genes are those shared by all available and staggered concentrations. The species-level
sequenced strains of the same species. Unique specificity was confirmed also in this experimen-
genes (i.e., core genes unique for each bacterial tal setting. The microarray quantified both bacte-
species) were then selected by removing all core rial communities with high accuracy with an
genes with blastn hits outside the target species. overall high correlation (0.97, p < 1010)
Probes targeting an average of 10 markers per between reference relative abundance values
bacterial species were designed to have similar and estimated ones. Experiments performed on
physicochemical parameters and were directly saliva microbiomes isolated from healthy volun-
synthesized on “custom high-definition Agilent teers, spiked in with reference species in known
DNA Comparative Genomic Hybridization amounts, proved the feasibility of this approach
arrays 8x15K” (Agilent Technologies, Santa for microbiome profiling, and detected the native
Clara, CA, USA). Besides internal control and and spiked-in species within clinical samples
other probes, the BactoChip includes 2,094 over a 100-fold dynamic range.
marker gene probes targeting 54 bacterial
species.
Testing on Pure Isolates, Synthetic Com- Summary and Conclusions
munities, and Clinical Samples. The BactoChip
was validated by performing hybridization exper- High-throughput metagenomic technologies
iments with 37 bacterial species singularly, mul- have provided an extensive amount of data on
tiple congeneric species, and synthetic bacterial microbial composition, functions, and dynamics,
communities of up to 15 microorganisms. Also, it accelerating the development of complementary
was tested with oral microbiomes from two or alternative methods for environmental studies,
healthy subjects spiked with 5 different species clinically oriented studies, and routine diagnos-
at known relative abundance. Single reference tics. Definitely, next-generation sequencing tech-
strains used for validation were collected from nology leads, without the need of a priori
the LGC Standards ATCC, the Leibniz Institute knowledge, to the maximum amount of informa- S
DSMZ, or university hospitals. Synthetic com- tion on the genomic sequences’ composition of
munities were obtained by mixing single strains a microbial sample. However, this technology
in known DNA quantities. Oral microbiomes requires complex computational analyses to
were collected from saliva, DNA was extracted extrapolate information of interest and still
with standard protocols, and the bacterial load requires high costs and processing times.
was determined by real-time PCR. The Among the alternative molecular-based tech-
BactoChip identified univocally almost all tested niques currently available (multiplex, real-time
species (97.3 %) from 19 genera with near- PCR, or array-based assays), microarrays repre-
perfect accuracy (AUC > 0.99). In case of sent the most promising technique for parallel
malfunctioning probes (false negative or false detection and relative abundance quantitation of
positive), the presence of multiple probes per bacteria with complex microbial samples, com-
marker genes and multiple genes per species bining a high-throughput with a user-friendly
prevented species misidentification. Testing rapid protocol and a low cost per sample. Besides
functional microarrays, which commonly target ▶ Metagenomic Research: Methods and

defined environmental niches and aim at func- Ecological Applications
tional classes’ classification (e.g., GeoChip), ▶ Phylogenetics, Overview
microarrays for microbial identification are phy-
logenetic based. Up to date, ribosomal genes
(in particular the 16S) are the most used phylo- References
genetic markers for microbial profiling through
microarray (e.g., GreenChipPm, PhyloChipTM, Ballarini A, Segata N, Huttenhower C, Jousson
HOMIM, HITChip). However, being the 16S O. Simultaneous quantification of multiple bacteria
by the bactochip microarray designed to target
gene highly conserved throughout the bacterial species-specific marker genes. PLoS One. 2013;8(2):
kingdom, it is difficult to resolve bacteria below e55764.
the family or genus level for some clades. Brodie EL, DeSantis TZ, Parker JPM, Zubietta IX,
Metagenomic data available on human body Piceno YM, Andersen GL. Urban aerosols harbor
diverse and dynamic bacterial populations. Proc Natl
sites samples has shown how defining the bacte- Acad Sci U S A. 2007;104:299–304.
ria profile at species level may generate a more Ghosh D, Roy K, Williamson KE, Srinivasiah S,
in-depth understanding of the relation between Wommack KE, Radoseevich M. Acyl-homoserine lac-
bacterial composition and health. Recently, tones can induce virus production in lysogenic bacte-
ria: an alternative paradigm for prophage induction.
a multi-marker phylogenetic microarray was Appl Environ Microbiol. 2009;75:7142–52.
described (the BactoChip) which proved to be Hazen TC, Dubinsky EA, DeSantis TZ, Andersen GL,
highly specific in bacterial species identification, Piceno YM, Singh N, Jansson JK, Probst A, Borglin
feasible for microbial profiling, and reliable for SE, Fortney JL, Stringfellw WT, Bill M, Conrad ME,
Tom LM, Chavarria KL, Alusi TR, Lamendella R,
relative abundance quantification over a 100-fold Zhou J, Mason OU. Deep-sea oil plume enriches indig-
dynamic range, even within complex ecosystems. enous oil-degrading bacteria. Science. 2010;330:
Being based on complete genomic sequences, the 204–8.
BactoChip array design stands on a lower number Jaing C, Gardner S, McLoughlin K, Mulakken N, Alegria-
Hartman M, Banda P, Williams P, Gu P, Wagner M,
of reference sequences available, in public Manohar C, et al. A functional gene array for detection
sequence databases, in comparison to the histor- of bacterial virulence elements. PLoS One. 2008;3(5):
ically used 16S rRNA phylogenetic marker e2163.
sequences. However, the exponentially increas- Palacios G, Quan P-L, Jabado O, Conlan S, Hirschberg D,
Liu Y. Panmicrobial oligonucleotide array for diagno-
ing amount of complete bacterial genome sis of infectious diseases. Emerg Infect Dis.
sequences will soon fill this gap allowing an 2007;13(1):73–81.
optimized marker selection for accurate micro- Preza D, Olsen I, Willumpsen T, Boches SK, Cotton SL,
bial profiling both for the ecosystem and the Grinde B, Paster BJ. Microarray analysis of microflora
in root caries of the elderly. Eur J Clin Microbiol Infect
human body. Dis. 2009;28(5):509–517.
Rajilic-Stojanovic M, Heilig HGHJ, Molenaar D,
Kajander K, Surakka A, Smidt H, de Vos WM. Devel-
Cross-References opment and application of the human intestinal tract
chip, a phylogenetic microarray: analysis of univer-
sally conserved phylotypes in the abundance
▶ Approaches in Metagenome Research: microbiota of young and elderly adults. Environ
Progress and Challenges Microbiol. 2009;11(7):1736–51.
▶ Computational Approaches for Metagenomic Roh SW, Abell GCJ, Kim K-H, Nam YD, Bae JW. Com-
paring microarrays and next-generation sequencing
Datasets technologies for microbial ecology research (review).
▶ Conserved Regions in 16S Ribosome RNA Trends Biotechnol. 2010; 28(6):291–299.
Sequences and Primer Design for Studies of Wu CH, Sercu B, Van De Werfhorst LC, Wong J,
Environmental Microbes DeSantis TZ, Brodie EL, Hazen TC, Holden PA,
Andersen GL. Characterization of coastal urban water-
▶ GeoChip-Based Metagenomic Technologies shed bacterial communities leads to alternative
for Analyzing Microbial Community community-based indicators. PLoS One. 2010;5:
Functional Structure and Activities e11285.
STAMP: Statistical Analysis of Metagenomic Profiles 641 S
row of the file contains the header for each col-
STAMP: Statistical Analysis of umn. Columns indicating the hierarchical struc-
Metagenomic Profiles ture of a feature must be organized from the
highest to lowest level in the hierarchy. There
Donovan H. Parks1,2 and Robert G. Beiko1 are no restrictions on the depth of a hierarchy
1
Faculty of Computer Science, Dalhousie and hierarchies may be multifurcating. However,
University, Halifax, NS, Canada hierarchies must form a strict tree structure (i.e.,
2
Australian Centre for Ecogenomics, University a child can have only one parent). The number of
of Queensland, Brisbane QLD, Australia sequences or reads assigned to each leaf node in
the hierarchy must be specified for each
metagenomic sample. To allow for different nor-
Definition
malization methods, counts may be integers or
real numbers. An example STAMP profile is
Cross-platform software providing statistical ana-
given in Table 1.
lyses and plots of taxonomic and functional profiles.
Several methods have been proposed for gen-
erating taxonomic or functional profiles from
Introduction metagenomic data. STAMP supports analyzing
profiles generated by MG-RAST (Meyer
Comparative metagenomic studies aim to under- et al. 2008), IMG/M (Markowitz et al. 2008),
stand differences in the structure and function of mothur (Schloss et al. 2009), CoMet (Lingner
microbial communities from different habitats. et al. 2011), and RITA (MacDonald et al. 2012).
Statistical approaches can be used to highlight dif- Profiles generated using these software platforms
ferences between pairs of metagenomic samples or can be converted to STAMP-compatible profiles
defined groups of samples (e.g., samples from sick using functionality provided within STAMP. The
and healthy individuals). STAMP (Statistical Anal- simple format of STAMP profiles helps ensure
ysis of Metagenomic Profiles) is a software plat- that results from other software platforms can be
form for analyzing metagenomic profiles (Parks and converted for processing by STAMP.
Beiko 2010), such as taxonomic profiles indicating Additional data associated with each
the number of marker genes assigned to different metagenomic sample can be defined through an
taxonomic units or functional profiles indicating optional tab-separated metadata file. The first
the number of sequences contributing to a specific column of this file indicates the name of each
subsystem or pathway. It aims to promote best sample and should correspond to an entry in the
practices in reporting statistical results by encour-
aging the use of effect sizes and confidence STAMP: Statistical Analysis of Metagenomic Profiles,
intervals when assessing biological importance. Table 1 Example STAMP profile S
A user-friendly, graphical interface permits easy
Hierarchical Hierarchical Sample Sample Sample
exploration of statistical results and generation of level 1 level 2 1 2 3
publication-quality plots for inferring the biologi- Category A Subcategory 0 4.4 4
cal relevance of features in a metagenomic profile. A1
STAMP is open-source, extensible via a plug-in Category A Subcategory 3 5 5
framework and available for all major platforms. A1
Category A Subcategory 4.8 3.5 2
A2
Category B Subcategory 2 32 6.5
Defining Metagenomic Profiles and B1
Sample Metadata Category C Subcategory 1 2 2
C1
STAMP requires metagenomic profiles to be Category C Subcategory 7.2 6 4
specified in a tab-separated text file. The first C1
S 642 STAMP: Statistical Analysis of Metagenomic Profiles
STAMP: Statistical Analysis of Metagenomic Pro- Metagenomic profiles typically consist of sev-
files, Table 2 Example metadata file eral hundred or thousand features. Care must be
Sample Id Location Phenotype Gender Sample size taken when performing multiple hypothesis tests.
Sample 1 Canada Obese Female 4,000 For example, a profile consisting of 1,000 fea-
Sample 2 Canada Lean Male 2,000 tures will have 50 features with a p-value less
Sample 3 Italy Lean Female 3,000 than 0.05 simply due to chance variation.
STAMP provides two techniques for correcting
p-values when multiple hypothesis tests are being
corresponding STAMP profile. Additional col- performed. The first controls the familywise error
umns may specify any other data relevant to rate using a correction method such as
the samples being considered. Within STAMP, Bonferroni, Holm-Bonferroni, or Šidák. This
these additional columns can be used to define adjusts the reported p-values so that the probabil-
groups (i.e., collections of one or more samples) ity of observing one or more false positives is
over which statistical tests and plots can be less than a specified probability. During data
calculated. An example metadata file is given in exploration, this approach can be too conserva-
Table 2. tive and it may be beneficial to adjust the
p-values using a false discovery rate procedure.
Under this approach, a q-value is calculated
Statistical Analysis of Metagenomic for each feature that indicates the expected
Profiles proportion of false positives within the set of
features with a smaller q-value (Benjamini and
STAMP provides statistics for assessing biologi- Hochberg 1995). Additionally, STAMP can filter
cally relevant differences between pairs of features using a number of criteria in addition to
metagenomic samples or treatment groups. p- or q-values in order to focus on biologically
Two-sample (e.g., Fisher’s exact test, G-test), interesting features, e.g., those with a large effect
two-group (Welch’s t-test, White’s nonparamet- size or consisting of a substantial number
ric t-test), and multigroup (ANOVA, Kruskal- of reads.
Wallis H-test) statistical hypothesis tests are
provided for identifying statistically significant
features. Features with p-values below a nomi- Exploration of Metagenomic Profiles
nally chosen threshold (e.g., 0.05) can reasonably
be assumed to be enriched or depleted due to STAMP provides the following interactive,
ecological differences between samples or treat- publication-quality plots for exploring
ment groups as opposed to representing metagenomic profiles:
a sampling artifact. STAMP also reports effect Bar plots indicate the proportion of sequences of
size statistics such as the difference or ratio each feature within a pair of samples or the
between proportions in order to aid in determin- proportion of sequences of a single feature
ing if a statistically significant feature is of bio- across all samples (Fig. 1a).
logical relevance. Consideration of effect sizes Box plots illustrate how the proportion of
is essential as small, biologically uninteresting sequences of a single feature is distributed
differences may be statistically significant when within different treatment groups using
sample sizes are large. Confidence intervals are a box-and-whiskers graphic (Fig. 1b).
computed for all effect size statistics. These indi- Box-and-whiskers graphics show the median
cate the range of effect size values that have of the data as a line, the mean of the data as
a specified probability (typically 95 %) of being a star, the 25th and 75th percentiles of the data
compatible with the observed data and are an as the top and bottom of the box, and use
important additional statistic for reasoning about whiskers to indicate the most extreme data
biological relevance. point within 1.5*(75th–25th percentile) of
STAMP: Statistical Analysis of Metagenomic Profiles 643 S
a
Bacteroides
Proportion of sequences (%)

60
Enterotype 1
50
Enterotype 2
40 Enterotype 3
30
20
10
0
ES-AD-1
FR-AD-3
JP-AD-1
JP-AD-4
JP-AD-6
JP-AD-7
JP-AD-8
JP-AD-9
DA-AD-1
DA-AD-4
ES-AD-2
ES-AD-3
FR-AD-6
IT-AD-4
AM-F10-T2
DA-AD-2
DA-AD-3
ES-AD-4
FR-AD-1
FR-AD-2
FR-AD-4
FR-AD-5
FR-AD-7
FR-AD-8
IT-AD-1
IT-AD-2
IT-AD-3
IT-AD-5
IT-AD-6
JP-AD-2
JP-AD-3
JP-AD-5
b Bacteroides
60
+
50 c
0.1
Proportion of sequences (%)
40 –0.0
PC2 (21.2%)
–0.1
30
–0.2
20
+ –0.3
10 –0.5 –0.4 –0.3 –0.2 –0.1 0.0 0.1 0.2 0.3

PC1 (57.0%)
0
Enterotype 1 Enterotype 2 Enterotype 3
d Bacteroides 95% confidence intervals
Enterotype 1 : Enterotype 3 <0.001
p-value
Enterotype 1 : Enterotype 2 <0.001
Enterotype 3 : Enterotype 2 ≥0.1
0.0 44.5 –10 0 10 20 30 40

Mean proportion (%) Difference in mean proportions (%)
STAMP: Statistical Analysis of Metagenomic Pro- enterotype. (c) Principal coordinate analysis plot deter-
files, Fig. 1 Exploration of the gut microbiota of 32 indi- mined from the proportion of reads assigned to each
viduals reported by Arumugam et al. (2011) to form three genera within a sample. (d) Post hoc plot for Bacteroides
S
distinct clusters or enterotypes. (a) Bar plot showing the indicating (1) the mean proportion and standard deviation
relative proportion of Bacteroides. Samples are colored within each enterotype, (2) the difference in mean pro-
according to the enterotype to which they have been portions between each pair of enterotypes along with 95 %
assigned. (b) Box plot showing the distribution in the confidence intervals, and (3) a p-value indicating if the
proportion of Bacteroides from samples assigned to each mean proportion is equal for a given pair
the median. Data points outside of the whis- plot indicates the sample represented by
kers are shown as crosses. the marker.
PCA plots give the first three principal compo- Post hoc plots contrast each pair of groups con-
nents of a metagenomic profile as determined sidered in a multigroup statistical hypothesis
by applying principal component analysis test (Fig. 1d). It indicates the mean proportion
(Fig. 1c). Clicking on a marker within the of sequences within each group, the difference
S 644 STAMP: Statistical Analysis of Metagenomic Profiles
F M 95% confidence intervals
Peptostreptococcus 0.017
Heliobacterium 0.029
Parvimonas 0.033
p-value
Aliivibrio 0.054
Bradyrhizobium 0.062
Anaerococcus 0.088
Geobacillus 0.098
–0.015 –0.010 –0.005 0.000 0.005

Difference in mean proportions (%)
STAMP: Statistical Analysis of Metagenomic Pro- indicates all genera where Welch’s t-test produces an
files, Fig. 2 Exploration of compositional differences uncorrected p-value < 0.1. All genera are overabundant
in the gut microbiota of males and females sampled by within the gut microbiota of males (M) compared to
Arumugam et al. (2011). The extended error bar plot females (F)
in mean proportions for each pair of groups to the first two principal components, and
along with the confidence interval of this individual panels of the extended error bar
effect size statistic, and a p-value indicating plot can be selectively hidden. Plots can be
if the mean proportion is equal for a given pair. saved in either vector (PDF, PS, EPS, SVG) or
Extended error barplots display the p-value, raster (PNG) formats. The resolution of raster
effect size, and associated confidence interval files can be set to allow for generation of plots
for all unfiltered features in a metagenomic suitable for printed publication or display on
profile (Fig. 2). In addition, a bar plot indicates posters.
the proportion of sequences assigned to Tabular views of statistical results are also
a feature in each sample or group. This pro- provided and columns can be sorted to help iden-
vides all information required to reason about tify interesting patterns. Tables can be saved as
the biological relevance of a feature in tab-separated value files for subsequent display in
a single plot. any text editor or spreadsheet program or for
Scatter plots indicate either the proportion of inclusion as supplemental information in
sequences or mean proportion of sequences publications.
assigned to each feature within a pair of sam-
ples or a pair of treatment groups, respec-
tively. This plot is useful for identifying Summary
features that are clearly enriched in one of
the two samples or groups. When considering Statistics can greatly aid in the comparison of
a pair of samples, confidence intervals calcu- metagenomic profiles. STAMP provides
lated with the Wilson score method can be a simple graphical environment for performing
shown. For a pair of treatment groups, differ- statistical analyses that are tailored to the needs of
ent statistics indicating the spread of the data comparative metagenomic studies. It provides
can be displayed (e.g., standard deviation, a range of statistical hypothesis test and can iden-
minimum and maximum proportions). tify statistically significant features between pairs
of samples or defined treatment groups. Different
All plots provide a range of customization multiple test correction methods are provided in
options. For example, PCA plots can be restricted order to account for the large number of features
Subtractive Hybridization Magnetic Bead Capture 645 S
typical of metagenomic profiles and to aid in data
exploration. The biological relevance of signifi- Subtractive Hybridization Magnetic
cant features can be assessed though a range of Bead Capture: Molecular Technique
publication-quality plots that provide key statis- for Recovery of Full-Length ORFs
tics such as effect sizes and confidence intervals. from Metagenomes
Interactive filtering allows the most biologically
interesting features to be quickly identified and Don Cowan, Sandra Ronca and Jean-Baptiste
plots of specific features to be generated. Ramond
STAMP’s wide range of statistics and simple Centre for Microbial Ecology and Genomics
interactive interface makes it a valuable tool in (CMEG), Genome Research Institute (GRI),
comparative metagenomic studies. University of Pretoria, Hatfield, Pretoria,
South Africa
Cross-References
Synonyms
▶ MEtaGenome ANalyzer (MEGAN):
Metagenomic Expert Resource Recovery of full-length ORFs from metagenomic
▶ Taxonomic Classification of Metagenomic DNA
Shotgun Sequences with CARMA3
Definition
References
Subtractive hybridization magnetic bead capture
Arumugam M, Raes J, Pelletier E, et al. Enterotypes
(SHBMC) is a sequence-based metagenomic
of the human gut microbiome. Nature. 2011;473:
174–80. technique for the recovery of full-length ORFs
Benjamini Y, Hochberg Y. Controlling the false discovery from heterogeneous metagenomic DNA samples.
rate: a practical and powerful approach to multiple
testing. J R Stat Soc B. 1995;57:289–300.
Lingner T, Aßhauer KP, Schreiber F, Meinicke
P. CoMet – a web server for comparative functional Introduction
profiling of metagenomes. Nucleic Acids Res. 2011;39
Suppl 2:W518–23. It is widely acknowledged that the vast majority
MacDonald NJ, Parks DH, Beiko RG. Rapid identification
(~99 %) of microorganisms present in the envi-
of high-confidence taxonomic assignments for
metagenomic data. Nucleic Acids Res. 2012;40:e111. ronment are resistant to culture using classical
Markowitz VM, Ivanona NN, Sveto E, et al. IMG/M: microbiological methods. Approximately half of
a data management and analysis system for the total estimated bacterial phyla (61) are still to
metagenomes. Nucleic Acids Res. 2008;36(Database
S
be cultured (Vartoukian et al. 2010). However,
issue):D534–8.
Meyer F, Paarmann D, D’Souza M, et al. The environmental microbial communities constitute
metagenomics RAST server – a public resource for a valuable resource for biotechnology and are
the automatic phylogenetic and functional analysis of a valid target for identification of novel genes
metagenomes. BMC Bioinforma. 2008;9:386.
and/or biological compounds such as biocatalysts
Parks DH, Beiko RG. Identifying biologically relevant
differences between metagenomic communities. or secondary metabolites (Sharma et al. 2005). In
Bioinformatics. 2010;26:715–21. order to bypass the limitations of microbial cul-
Schloss PD, Westcott SL, Ryabin T, et al. Introducing turing and to discover new microbial genes and
mother: open-source, platform-independent,
functions, two approaches have been
community-supported software for describing and
comparing microbial communities. Appl Environ implemented, either culture-based, through the
Microbiol. 2009;75:7537–41. development of innovative strategies and media
S 646 Subtractive Hybridization Magnetic Bead Capture
to “culture the unculturables” (Vartoukian metagenomic DNA from environmental samples.

et al. 2010), or culture-independent, by using The three core elements of the technology
“meta-omics” technologies (Riesenfeld et al. include high-quality metagenomic DNA and the
2004; Cowan et al. 2005). production of “tester DNA” and “driver DNA,”
Metagenomics enable to investigate in depth where the latter is immobilized on magnetic
the totality of (microbial) genomes present in any beads (Fig. 1).
given environments (Riesenfeld et al. 2004),
including extreme habitats which constitute fields
of choice for the discovery of robust enzymes and Metagenomic DNA Extraction and
biological compounds suitable for industrial Fragmentation
processes (Cowan et al. 2005). In practice, Prior to performing SHMBC, it is essential to
metagenomic studies can either be function- or obtain high-quality and high molecular weight
sequence-driven (Riesenfeld et al. 2004; metagenomic DNA, by chemical (e.g., cell lysis
Schmeisser et al. 2007). While the former is using detergents) and/or mechanical (e.g., bead-
widely used, it remains limited (i) by the choice beating) extraction protocols (Roh et al. 2006).
and number of substrates available, (ii) by the Due to variable physical and chemical composi-
difficulty of designing novel substrates, (iii) by tions, extracting high-quality metagenomic DNA
the fact that heterologous expression systems and from environmental samples can be challenging,
hosts are required, and (iv) by the need to clone most notably in attaining a complete representa-
full-length open reading frames (ORFs) or gene tion of the microbial (functional) diversity,
clusters to enable activities to be detected. Con- including rare phyla/sequences.
trastingly, the latter allows access to all the For functional investigation, the isolation of
sequences from any given environment and thus complete ORFs and/or gene clusters is crucial.
to its complete metabolic/catabolic potential pro- For SHMBC, metagenomic DNA fragment sizes
viding that similar sequences have previously of 1–5 kb are ideal as they are short enough to
been annotated and their encoded activity(ies) permit relatively easy PCR amplification follow-
characterized (Schmeisser et al. 2007). In ing subtractive hybridization (Fig. 1). The
metagenomic gene-mining studies (whether detergent-based metagenomic DNA extraction
function- or sequence-based), one of the main method developed by Zhou is recommended as
experimental challenges is therefore to recover it typically yields high-quality metagenomic
complete ORFs. The recent advent of high- DNA (>23 kb; Zhou et al. 1996). The whole
throughput second-generation “meta”sequencing extraction process requires ~6 h of labor. Alter-
technologies potentially provides access to all natively, various column-based metagenomic
the sequences present in a metagenome, and DNA extraction kits are commercially available
while it facilitates the isolation of the targeted and are efficient in extracting high-quality DNA
sequence(s) from metagenomic samples, it does (~10 kb in 1 h; Knauth et al. 2013).
not avoid laborious experimental procedures. The fragmentation of metagenomic DNA
Here, we report on a novel molecular biology to the appropriate size can be performed by phys-
technique for the recovery of full-length ORFs ical disruption methods (e.g., by freeze/saw or
from environmental metagenomes, termed sub- freeze-boiling cycles or bead-mill homogeniza-
tractive hybridization magnetic bead capture tion) or enzymatic digestion (using restriction
(Meyer et al. 2007). enzymes). Metagenomic DNAs contaminated
by co-extracted compounds (notably humic
acids or heavy metals), which can hamper down-
Method stream restriction and/or PCR amplification reac-
tions, can be diluted or purified. These procedures
SHMBC is a sequence-based technique devel- generally lead to the reduction of metagenomic
oped for the retrieval of complete ORFs from DNA yields.
Environmental DNA extraction and fractionation (1-5kb)
Construction of the ‘Tester DNA’

by Ligation of a T7 adapter PCR amplification of the ‘Driver DNA’
using degenerated biotinylated primers
Targeted ORF
Heat denaturation and Immobilization

on streptavidin covered magnetic beads
Heat denaturation
Subtractive hybridization:
1. Hybridization / 2. Wash
Direct PCR amplification of full-

length ORFs using T7 primers
Subtractive Hybridization Magnetic Bead Capture: Molecular Technique for Recovery of Full-Length ORFs
from Metagenomes, Fig. 1 Schematic subtractive hybridization magnetic bead capture (SHMBC) protocol
“Tester DNA” Construction (Fig. 1). However, environmental samples are

The “tester DNA” is a modified metagenomic characterized by composite bacterial communi-
DNA sample to be probed by SHMBC for full- ties, potentially with polymorphic sequences
length ORF recovery (Fig. 1). Once fractionated coding for multiple related genes. To
to an appropriate size (1–5 kb), the metagenomic PCR-amplify heterologous gene sequences, the
DNA is manipulated to generate the “tester use of degenerate primers is necessary. For the
DNA” by ligating T7 adapters at the 30 and 50 production of valid “driver DNA,” a compromise S
ends (in blue, in Fig. 1). Recommended protocols must therefore be made between primer degener-
are given in Meiring et al. (2010). The T7 acy (the number of degenerate bases) and primer
adapters contain T7 priming sites which enable coverage (the number of matched homologous
a direct PCR amplification of the 1–5 kb DNA gene). Highly specific primers may only target
fragments following subtractive hybridization. limited numbers of organisms/genes, while
excessive degeneracy often leads to high levels
“Driver DNA” Production of nonspecific binding and to the amplification of
The “driver DNA” is the hybridization probe untargeted sequences.
used for the recovery of full-length ORFs from In general, in order to design degenerate PCR
the “tester DNA.” Its production is thus crucial primers, homologous nucleotide sequences from
for the success of SHMBC. The “driver DNA” different microorganisms are retrieved from data-
can be PCR-amplified directly from the purified bases (e.g., GenBank) and aligned to such that
metagenomic DNA using gene-specific primers conserved ~20 mer long sequences can be
identified. An alternative is to identify conserved physiological characterization, is a fastidious and

amino acid sequences in homologous proteins, as unreliable process as numerous variables poten-
they usually constitute key components of active tially influence microbial growth (e.g., amounts
sites and/or are necessary for protein stability. of various but specific nutrients, pH, temperature,
Moreover, the genetic code being degenerate atmospheric gas composition, etc.). Function-
(or redundant), synonymous codons (i.e., triplet based screening of clones is dependent on the
of nucleotide coding for a single amino acid) expression of genes in foreign hosts, and only
generally differs by their last base, which may few model microorganisms are widely used as
be replaced in degenerate primers by the nucleo- transformation hosts (e.g., Escherichia coli,
tide “inosine.” Amplicons obtained with degen- Bacillus subtilis, Geobacillus sp., Streptococcus
erate primer sets must initially be cloned and pneumoniae, Neisseria gonorrhoeae,
sequenced to verify their specificity. Haemophilus influenzae, Helicobacter pylori,
Biotinylated ORF-/gene-specific “driver Acinetobacter baylyi, and some cyanobacteria).
DNAs” are produced using 50 -biotinylated gene- In addition, a successful transformation guaran-
specific forward degenerate primers (the reverse tees neither the expression of heterologous genes
remaining unlabelled; Meyer et al. 2007, Meiring nor the production of functional proteins/
et al. 2010). Single-stranded “driver DNAs” are enzymes in the foreign host. Finally, the detec-
immobilized to streptavidin-coated magnetic tion of an enzymatic activity is dependent on the
beads which have high affinity for biotin existence or design of suitable media and/or of an
(Fig. 1). The “driver DNA”-magnetic bead com- assay to detect the specific enzyme activities
plex constitutes the hybridization probe for (Waschkowitz et al. 2009).
SHMBC. Current advances in second-generation
sequencing technology (454 pyrosequencing,
Subtractive Hybridization of “Tester DNA” Illumina, and SOLiD TM) have increased the
and Full-Length ORF Amplification effectiveness of sequence-based screening as
“Tester DNAs” and “driver DNAs” are generally hundreds of millions of sequencing reads can be
hybridized overnight. To modify SHMBC selec- acquired in a single run, enabling the detection of
tivity, hybridization temperatures and hybridiza- rare ORFs in metagenomic samples. However,
tion buffer salt concentrations can be adjusted. To isolation of full-length sequences still necessary
ensure specific post-subtractive hybridization for functional studies, and the short sequence
PCR amplifications, unbound “tester DNAs” are read lengths (which do not cover entire ORFs),
eliminated with successive SDS washes. Recov- may become problematic (Morales and Holben
ered magnetic beads with hybridized “tester 2011). In consequence, complex computational
DNA” can be used directly to amplify target gene assemblage strategies must be implemented
ORFs using T7 primers (Fig. 1). To amplify in order to recover full-length ORFs (Liu
full-length ORFs, high-fidelity and reading Taq et al. 2012). When working with genes or ORFs
polymerases are recommended. with multiple homologous present in a database,
the latter can be used as reference sequences
during gene annotation and assembly processes.
Comparison with Other Techniques However, in the absence of such a reference
sequence, de novo approaches can only provide
The function-driven isolation of ORFs has most a probability of sequence fragments belonging to
commonly been used with pure isolates or clones a specific gene cluster (Thomas et al. 2012). New
from metagenomic libraries. However, this strat- gene families can be annotated from next-
egy is limited because of significant technical and generation sequencing datasets by de novo gene
methodological challenges. Notably, less than assembly methods (i) by reconstructing novel
1 % of environmental bacteria can be cultured. sequences/genes based on nucleotide frequency,
Obtaining axenic cultures, a prerequisite for any (ii) by implementation of a conventional Overlap
Layout Consensus (OLC) strategy, or (iii) by conclude that SHMBC using the specifically
probabilistic De Bruijn graphs (Paszkiewicz and designed probes (i.e., “driver DNA”) signifi-
Studholme 2010). However, gene size variations cantly improved the recovery of STR alleles
due to large insertions, deletions, or polymor- from degraded DNA samples.
phisms can lead to complicated de novo assem- SHMBC was recently amended by combining
blies even within closely related taxa. Finally, the DNA fractionation and linker ligation steps
computational assembly of next-generation (Harris et al. 2009). Such a procedure increased
sequencing data is limited by the fact that short- the efficiency in constructing the “tester DNA.”
read assemblies rely on data reduction algo- In this particular study, SHMBC was applied as
rithms, in which reads from low-abundance an enrichment tool to identify and isolate
organisms may be discarded, ultimately leading sex-specific regions in the complex genome of
to the disappearance of rare ORFs from the Australian python (Morelia spilota imbricata).
datasets. The authors stressed the critical importance of
nondegraded DNA.
SHMBC as a pre-enrichment tool could also
Applications and Improvements be used in comparative meta-transcriptomic stud-
ies. In such studies, the analysis of full-length
The sequence-based SHMBC technique consti- ORFs may identify the most frequently
tutes a valuable prescreening method, as it col- represented functional genes in different ecosys-
lects ORFs of interest from complex tems and thus potentially unravel differential tro-
metagenomic mixtures; and the latter can further phic structures and functions.
be functionally analyzed. Such an approach To conclude, we strongly believe that this
increases the chance of obtaining positive hits in technique should be considered by the scientific
post-functional screening protocols. community for use prior to any “meta-functional”
Recently, a comparable subtractive hybrid- or functional genomic studies. SHMBC can
ization approach in combination with a readily be automated and routinely used as
pre-enrichment microsatellite strategy was a full-length ORFs pre-enrichment tool to detect
performed to isolate novel phaC gene sequences functions of interest in metagenomic samples
from the marine bacteria Paracoccus homiensis (Latisnere-Barragan and Lopez-Cortes 2012),
(Latisnere-Barragan and Lopez-Cortes 2012). for disease diagnosis (Wang et al. 2011), and
This methodology allowed the efficient isolation could further be implemented to test processed
of full-length phaC ORFs and the construction of food products for the presence of genetically
enriched plasmid libraries of phaC genes, thereby modified organisms (GMOs) and/or other adul-
reducing the experimental costs of genome terations (such as horse meat contamination in
sequencing (Latisnere-Barragan and Lopez- beef lasagne!) or GMO cross-contaminations in S
Cortes 2012). In this study, a single genome was crop fields.
screened but the authors suggest that SHMBC
coupled with microsatellite enrichments could
be used to retrieve phaC sequences from complex Summary
microbial community metagenomes.
In forensic studies, SHMBC was developed A wide range of function-based and/or sequence-
(i) to extract and pre-concentrate STRs (Short based screening techniques have been developed
Tandem Repeat) from degraded DNA samples, to study/isolate ORFs in metagenomes. The sub-
which is a common problem in crime scene anal- tractive hybridization magnetic bead capture
ysis, and (ii) to compensate for STR allele imbal- (SHMBC) technique is potentially a cost-
ance, allele dropout, and sequence-specific effective and efficient method for the isolation
inhibitions generally encountered in such sam- of full-length ORFs from metagenomic DNA
ples (Wang and McCord 2011). The authors preparations. This approach could be widely
used as a pre-enrichment tool prior to performing subtractive hybridization magnetic bead capture.
post-functional studies or sequence-based func- Methods Mol Biol. 2010;668:287–97. Clifton, NJ.
Meyer QC, Burton SG, Cowan DA. Subtractive hybridi-
tional analyses. zation magnetic bead capture: a new technique for the
recovery of full-length ORFs from metagenome.
Biotechnol J. 2007;2:36–40.
Morales SE, Holben WE. Linking bacterial identities and
Cross-References ecosystem processes: can ‘omic’ analyses be more
than the sum of their parts? FEMS Microbiol Ecol.
▶ Approaches in Metagenome Research: 2011;75:2–16.
Progress and Challenges Paszkiewicz K, Studholme DJ. De novo assembly of short
sequence reads. Brief Bioinform. 2010;11:457–72.
▶ Biological Treasure Metagenome Riesenfeld CS, Schloss PD, Handelsman J.
▶ Metagenomic Research: Methods and Metagenomics: genomic analysis of microbial com-
Ecological Applications munities. Annu Rev Genet. 2004;38:525–52.
▶ Mining Metagenomic Datasets for Antibiotic Roh C, Villatte F, Kim B-G, Schmid RD. Comparative
study of methods for extraction and purification of
Resistance Genes environmental DNA from soil and sludge samples.
▶ Mining Metagenomic Datasets for Cellulases Appl Biochem Biotechnol. 2006;134:97–112.
▶ Protein-Coding Genes as Alternative Markers Schmeisser C, Steele H, Streit WR. Metagenomics, bio-
in Microbial Diversity Studies technology with non-culturable microbes. Appl
Microbiol Biotechnol. 2007;75:955–62.
Sharma R, Ranjan R, Kapardar RK, Grover A.
‘Unculturable’ bacterial diversity: an untapped
resource. Curr Sci. 2005;89:72–7.
References Thomas T, Gilbert J, Meyer F. Metagenomics - a guide
from sampling to data analysis. Microbiol Inform Exp.
Cowan D, Meyer Q, Stafford W, Muyanga S, Rory 2012;2:3–3.
Cameron R, Wittwer P. Metagenomics gene discov- Vartoukian SR, Palmer RM, Wade WG. Strategies for
ery: past, present and future. TRENDS Biotechnol. culture of ‘unculturable’ bacteria. FEMS Microbiol
2005;23:321–9. Lett. 2010;309:1–7.
Harris RP, Groth DM, Ledger J, Lee CY. Identification of Wang J, McCord B. The application of magnetic bead
sex specific DNA regions in the snake genome using hybridization for the recovery and STR amplification
a subtractive hybridization technique. Proc Assoc of degraded and inhibited forensic DNA. Electropho-
Advmt Anim Breed Genet. 2009;18:572–5. resis. 2011;32:1631–8.
Knauth S, Schmidt H, Tippkotter R. Comparison of com- Wang HH, Zhao CY, Li F. Rapid identification of
mercial kits for the extraction of DNA from paddy mycobacterium tuberculosis complex by a novel
soils. Lett Appl Microbiol. 2013;56:222–8. hybridization signal amplification method based on
Latisnere-Barragan H, Lopez-Cortes A. Isolation of phaC self-assembly of dna-streptavidin nanoparticles.
gene from marine bacteria Paracoccus homiensis Braz J Microbiol. 2011;42:964–72.
strain E33 by magnetic beads subtractive hybridiza- Waschkowitz T, Rockstroh S, Daniel R. Isolation and
tion. Ann Microbiol. 2012;62:1691–5. characterization of metalloproteases with a novel
Liu L, Li Y, Li S, Hu N, He Y, Pong R, Lin D, Lu L, domain structure by construction and screening of
Law M. Comparison of next-generation sequencing metagenomic libraries. Appl Environ Microbiol.
systems. J Biomed Biotechnol. 2012. doi:10.1155/ 2009;75:2506–16.
2012/251364. Zhou JZ, Bruns MA, Tiedje JM. DNA recovery from soils
Meiring T, Mulako I, Tuffin MI, Meyer Q, Cowan of diverse composition. Appl Environ Microbiol.
DA. Retrieval of full-length functional genes using 1996;62:316–22.
T
Taxa Counting Using Specific protein. SPs were introduced by Kunik et al.
Peptides of Aminoacyl tRNA (2007), and their predictive powers on full pro-
Synthetases tein sequences were established by Weingart
et al. (2009). Their results are the basis of the
David Horn webtool http://horn.tau.ac.il/DME11.html which
School of Physics and Astronomy, Tel Aviv supplies enzymatic assignments for queried pro-
University, Tel Aviv, Israel tein sequences. This methodology has been
applied directly to short reads, obtaining enzy-
matic and taxonomic signatures of data, by
Synonyms Weingart et al. (2010). These authors have
extracted a set of SPs that are associated with
Short Read Analysis; Specific Peptides; Taxa single proteins of the aaRS families, known as
Counting the S61 set (because the EC numbers of these
enzymes, indicating their 4-level enzymatic clas-
sification, start with 6.1.1.). The application of
Definition SPs to taxa counting in metagenomic data has
been developed by Persi et al. (2012). To ensure
Motifs that appear on Aminoacyl tRNA Synthe- high precision of the prediction process, it is
tases can serve as specific peptides (SP) whose required that the length of the SPs in the S61 set
presence in a metagenome indicates which taxa it is at least nine amino acids. The resulting list
contains. This is used to devise a method, based contains 3,949 SPs.
on gene fragments rather than on 16S rRNA
sequences, which allows for taxa counting from
short read metagenomic data. It is exemplified on The Taxa Counting Algorithm
human gut microbial data.
For short read data one first converts all genomic
reads to amino acid strings in the six possible
Introduction: The SP Approach reading frames. One then identifies all reads that
share a single SP. Choosing the largest group
Specific peptides (SPs) are short deterministic of such reads, one tries to group the short reads
motifs whose presence in the protein sequence into sets such that all reads within a set are
is a good predictor of an enzymatic activity of the consistent with one another (i.e., can be fused
T 652 Taxa Counting Using Specific Peptides of Aminoacyl tRNA Synthetases
with each other) and every set is inconsistent with exist more than 1,000 different species in their
the other ones. Although this mathematical prob- metagenomic data. Persi et al. (2012) has argued
lem is NP complete, one may devise simple that the prevalent genes, when analyzed using the
algorithms that carry it out efficiently (Persi S61 approach, display only half this count. If,
et al. 2012). The strong consistency conditions however, the full set of contigs is analyzed, an
can be relaxed to allow for errors, such that reads estimate of over 1,000 different species and
within a set may differ from each other by one strains is obtained. The number of different gen-
amino acid, and different sets have to differ from era has been estimated to be relatively small,
each other by at least two amino acids. The num- presumably of the order of a few tens.
ber of different sets becomes a lower bound on Of particular interest is the application of the
the number of different taxa. For short reads, novel method to short read data. Here this method
distinguishing between species belonging to the is quite unique. It allows for a quick estimate of
same genus is impossible. Depending on the species count directly from raw data. Short read
length of the short reads, chances are high how- singletons that are often discarded from
ever for distinguishing between different fami- metagenomic analysis, because they cannot com-
lies, classes, and phyla. For the case of long bine with other short reads to form longer contigs,
sequences or extensive contigs, one can resort to can be readily included in this analysis. More-
searching for sequences that share several SPs of over, one can test the sensitivity of the results to
the same aaRS enzyme. This allows one to sample size, to the minimal distance d allowed
address the question of counting different species between reads that are classified in the same taxa,
or even different strains of the same species. and to noise in the data.
The raw data contain errors, and every
misidentification of an amino acid will affect
Tests and Applications taxa counts. The probability of such errors was
estimated to be below 1 %. This was then tested
Persi et al. (2012) have compared the S61 SP by inserting artificial random errors at the level of
approach with the 16S rRNA analysis on an arti- 1 % into analyzed reads. The results showed that
ficial metagenome composed of 64 genomes of the d 2 counts of the set with artificial errors
different species that represent bacterial taxo- are similar to the d 1 estimates drawn from the
nomic diversity. For some of the principal raw data. One may therefore conclude that limit-
phyla, they selected pairs of strains of the same ing oneself to d 2 analysis of the raw data
species, such that the resolutions of the taxo- suffices to eliminate the majority of errors in the
nomic delineation of the two methods can be data. Sample sizes of order 1,000 short reads of
tested and compared. The SP approach has been the Qin et al. (2010) data lead to counts of 200 or
proved to match the accuracy provided by the more taxa. The counts keep increasing linearly
16S analysis and sometimes even to surpass with sample size, indicating that greater depth
it. The novel method has then been applied to unravels larger numbers of strains and species.
species counting in the human gut microbiome Focusing on large distances between reads, such
employing the data of Qin et al. (2010). These as d 7, the taxa counts in the analysis of Persi
data were based on samples taken from 124 indi- et al. (2012) saturate at about 60, providing
viduals. In addition to raw short read data, the a stable bound on the number of species that are
authors have presented genomic contigs, as well expected to have quite large Hamming distances
as a nonredundant set of 3.3 M ORFs derived (over 150) between their relevant protein
from full genomic analysis (also called “preva- sequences. Finally it is interesting to note that
lent genes”). The analysis of the prevalent genes an analysis of Persi et al. (2012) carried out for
has led Qin et al. (2010) to conclude that there all short reads of one of the subjects has shown
Taxonomic Classification of Metagenomic Shotgun Sequences with CARMA3 653 T
10 % novel species with respect to the contigs of Cross-References
Qin et al. (2010), and about 45 % novelties when
compared to all Uniprot enzymes. ▶ Computational Approaches for Metagenomic
Datasets
▶ Human Gut Microbial Genes by Metagenomic
Discussion Sequencing
The richness of microbiomes has become

a widely recognized topic. It is often being ana- References
lyzed by using 16S rRNA definitions of OTUs,
whose direct contact with observed and analyzed Kunik V, Meroz Y, Solan Z, et al. Functional representa-
tion of enzymes by specific peptides. PLoS Comput
organisms may be lacking. The alternative
Biol. 2007;3(8):e167.
method provided by the S61 SP approach is Persi E, Weingart U, Freilich S, Horn D. Peptide markers of
based on peptides that have been extracted from aminoacyl tRNA synthetases facilitate taxa counting in
an analysis of enzymes recorded in Swiss-Prot. metagenomic data. BMC Genomics. 2012;13:65.
Qin J, Li R, Raes J, Arumugam M, et al. A human gut
This is the only bias of this method. It is conceiv-
microbial gene catalogue established by metagenomic
able that some genes of new species will be so far sequencing. Nature. 2010;464:08821.
removed from the known ones that no SP match Weingart U, Lavi Y, Horn D. Data mining of enzymes
will occur and they may thus avoid detection. using specific peptides. BMC Bioinformatics.
2009;10:446.
Therefore the list of SPs should be updated from
Weingart U, Persi E, Gophna U, Horn D. Deriving enzy-
time to time (Weingart et al. 2009) as the data- matic and taxonomic signatures of metagenomes from
base grows. short read data. BMC Bioinformatics. 2010;11:390.
A major advantage of the S61 SP methodol-
ogy is its simplicity: its straightforward imple-
mentation does not require any further choice of
parameters or comparisons with additional data- Taxonomic Classification of
bases. Furthermore, it is satisfying to realize that Metagenomic Shotgun Sequences
it can be applied to short reads. Even those short with CARMA3
reads that cannot be combined into contigs may
lead to informative conclusions on taxa counting Wolfgang Gerlach1 and Jens Stoye2
1
using the S61 approach. Institute for Genomics and Systems Biology,
Argonne National Laboratory, Argonne, IL, USA
2
Faculty of Technology, Bielefeld University,
Summary Bielefeld, Germany
Taxonomic deciphering of metagenomic data

usually relies on 16S rRNA analysis. Alterna- Synonyms T
tively one may use genomic information, in par-
ticular genes related to single proteins, i.e., those Classification; Metagenome; Taxonomy
known to appear only once in a genome. Some of
the Aminoacyl tRNA Synthetases (aaRS) fit this
description. Employing specific peptides, whose Definition
occurrence is restricted to these protein families,
one can devise algorithms of taxa counting. The CARMA3 is a program to assign taxonomic iden-
latter turn out to be informative even for short tifiers to metagenomic sequences of unknown
read metagenomic data. taxonomic origin.
T 654 Taxonomic Classification of Metagenomic Shotgun Sequences with CARMA3
Introduction methods have been developed. Probably the most

basic method is to use BLAST to search for the
The vast majority of microbes cannot be culti- best hit in a database of sequences with known
vated in a monoculture and thus cannot be origin. Since the evolutionary distance between
sequenced by means of traditional methods. To the source organisms of the metagenomic frag-
explore these microbes, they have to be analyzed ment and the database sequence is unknown,
within their natural microbial communities. a classification result solely based on a best
High-throughput sequencing (HTS) technologies BLAST hit has to be interpreted carefully. In gen-
like Roche’s 454 sequencing, ABI’s SOLiD, or eral, such a classification is more reliable on higher
Illumina’s Genome Analyzer make it possible to taxonomic levels (e.g., superkingdom or phylum)
sequence microbial DNA samples of such com- than on lower taxonomic levels (e.g., genus or
munities, called metagenomes. Due to the species), but it is difficult to decide which taxo-
restricted read lengths produced by these technol- nomic level is reliable enough, as this strongly
ogies, reconstruction of complete genomic varies for each metagenomic fragment.
sequences from a metagenome is impossible. The program MEGAN (Huson et al. 2007,
However, by comparing the metagenomic frag- 2011) is based on the lowest common ancestor
ments with sequences of known function, it is (LCA) approach. A BLAST search is performed,
possible to analyze the biological diversity and and all BLAST hits that have a bit score close to
the underlying metabolic pathways in microbial the bit score of the best hit are collected. The
communities. metagenomic fragment is then classified by com-
To infer the taxonomic origin of metagenomic puting the LCA of all species in this set. One of the
reads, two kinds of methods, composition-based reasons for the improved classification accuracy of
and comparison-based, can be distinguished. The this approach is that fragments with ambiguous
composition-based methods extract sequence hits are assigned at higher taxonomic levels.
features like GC content or k-mer frequencies The SOrt-ITEMS method (Haque et al. 2009)
and compare them with features computed from extends the LCA approach and uses additional
reference sequences with known taxonomic ori- techniques to reduce the number of false-positive
gin (Abe et al. 2005; Diaz et al. 2009; Karlin predictions. One approach is the reduction of the
et al. 1997; McHardy et al. 2007). number of hits by using a reciprocal BLAST
A disadvantage is that short reads are not suited search step. Another technique used is the adap-
for this method as rather long reads are required tation of the taxonomic assignment level for all
to obtain a reasonable classification accuracy. hits, based on different alignment parameters like
The comparison-based methods, in contrast, rely sequence similarity between the metagenomic
on homology information obtained by database fragment and the aligned database sequence.
searches. They can be further subdivided into Inspired by these techniques, in particular the
methods that are based on hidden Markov reciprocal search step of SOrt-ITEMS, CARMA3
model (HMM) homology searches (Eddy 1998) (Gerlach and Stoye 2011) was developed to fur-
and those that are based on BLAST homology ther improve the accuracy of the taxonomic clas-
searches (Altschul 1990, 1997; Gish and States sification. It makes explicit use of the assumption
1993). CARMA version 1 (Krause et al. 2008) of a model of evolution where different gene fam-
and CARMA version 2 (Gerlach et al. 2009) ilies have different rates of mutation, but within
belong to the HMM-based methods. CARMA each family this rate does not change too much.
version 3 (Gerlach and Stoye 2011) has been
implemented in two variants, one of which is
HMMER3-based and therefore also belongs to Methods
the HMM-based methods.
For the taxonomic classification of The first step in CARMA3 consists of a BLASTx
metagenomic reads based on BLAST, different search of the metagenomic DNA sequence
Taxonomic Classification of Metagenomic Shotgun unknown phylogenetic affiliations x and x0 of

Sequences with CARMA3, Fig. 1 (a) Projections of metagenomic sequences q and q0 , respectively. (b) Inter-
BLAST hits obtained from reciprocal search onto the vals given by reciprocal bit scores for each taxonomic
lineage of t1. The dashed edges represent projections of rank and level assignments of x and x0 based on their score
against the NCBI NR protein database. All pro- The reciprocal search provides similarity
tein fragments in the database that have an align- scores in terms of BLAST bit scores between t1
ment with the metagenomic sequence are and all other database sequences. Since the taxo-
extracted. These sequences, as well as the protein nomic affiliations of the other database
translation of the metagenomic DNA sequence, sequences, except the metagenomic sequence,
as given by the BLAST alignment between the are known, the reciprocal search provides means
metagenomic query and the best database hit, are to correlate BLAST bit scores with phylogenetic
used to create a small protein BLAST database. distances. Database sequences that are more
In the second step of CARMA3, the reciprocal closely related to t1 tend also to have higher
BLAST search, the extracted protein fragment reciprocal bit scores than the less closely related
that corresponds to the best BLAST hit is sequences. A toy example for this is given in
searched against this database using BLASTp. Fig. 1a.
Since the protein fragment that is searched Each ti is projected onto the lowest common
against the database is included in the database, ancestor of ti and t1, a taxon within the lineage of
this database sequence produces a perfect align- t1. For each taxon in the lineage of t1 that gets
ment and yields the best BLAST bit score. projections from a subset of ti, an interval is
Let ti be the taxonomic affiliation of the ith defined by the minimum and the maximal recip-
best BLAST hit in the reciprocal search, and let rocal bit scores from the BLAST hits in this
x be the (unknown) species of the metagenomic subset. Intervals for the reciprocal search exam-
sequence. Clearly t1, which is also the taxonomic ple are depicted in Fig. 1b. These intervals can be
affiliation of the best BLAST hit in the first used to assign a metagenomic sequence to a taxon T
BLAST search, is the phylogenetically closest in the lineage of t1 based on its reciprocal score.
known relative of x. Since the taxonomy assign- In general (case a) this method tries to assign the
ment t1 is usually located at taxonomic rank spe- metagenomic sequence to the lowest taxonomic
cies, strain, or substrain, and metagenomic rank at which its reciprocal score is still within
sequences mostly come from species that are the borders of the interval at that rank. If such an
phylogenetically more distantly related, using t1 interval does not exists (case b), the lowest taxo-
as taxonomic classification for x would be an nomic rank is chosen for which all bit scores are
overprediction. Therefore, the purpose of this still lower than the bit score of the metagenomic
method is to approximate the lowest common sequence.
ancestor of t1 and x, which would be the best Two examples for the taxonomic classifica-
possible taxonomic classification. tion are given in Fig. 1b. Metagenomic read
q with unknown phylogenetic affiliation x has SOrt-ITEMS and MEGAN using simulated data.
a reciprocal score of 90. Since the bit score of The simulated metagenome consists of 25 ran-
q is higher than the bit score of the single hit in the domly chosen bacterial genomes from the NCBI
interval at rank family (t5 ¼ 80), but smaller than ftp site (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/).
any hit in the interval at taxonomic rank genus N ¼ 25 000 metagenomic reads were simulated
(t4 ¼ 95 and t2 ¼ 120), x gets assigned to the rank using MetaSim (Richter et al. 2008) with the
family (case b). The second metagenomic read q0 default 454 sequencing error model resulting in
with reciprocal score of 105 is within the borders an average read length of 265 bp. The second
of the interval at rank genus and thus x’ gets experiment is an example of the applicability of
assigned to the rank genus (case a). CARMA3 in the case of very large metagenomes
Real data often does not show the properties as that can be produced, for example, by the Illumina
assumed in this model, and sometimes reciprocal sequencing technology. In this experiment the real
scores are missing for a taxonomic rank. To data set consists of 3.3 million nonredundant
accommodate for this, CARMA3 additionally microbial genes of the gene catalogue of the
employs techniques like polishing, linear inter- human gut microbiome (Qin et al. 2010). Fecal
polation, and a fallback method, described in samples from different individuals were
detail in the original publication (Gerlach and sequenced with the Illumina Genome Analyzer
Stoye 2011). CARMA3 is also available in (GA) which yielded in 576.7 Gb of sequence.
a variant that is based on HMMER3 homology The reads were assembled into longer contigs,
searches against the Pfam (Finn et al. 2010) data- and a gene finder was used to detect open reading
base. In this variant the metagenomic sequences frames (ORFs). Similar ORFs were clustered to
are aligned against Pfam family alignments from obtain the final nonredundant gene set. This gene
which reciprocal scores can be computed that are set was downloaded and the ORFs were translated
required for the taxonomic classification. Both into protein sequences using the NCBI Genetic
the BLAST and the HMMER variants of Code 11.
CARMA3 can also be used for the taxonomic
classification of amino acid sequences. Comparison with Other Methods Using
Simulated Data
To evaluate the different BLAST-based methods
Results and Discussion regarding their ability to classify sequences of
unknown source organism, three BLAST NR
CARMA3 is available via the WebCARMA protein databases were created: “order-filtered,”
pipeline that takes metagenomic reads as input without sequences from species that share the
and output taxonomic and functional classifica- same order as any of the species from the simu-
tions. The pipeline runs on the compute cluster lated metagenome; “species-filtered,” without
of the Bielefeld University Bioinformatics sequences from species in the simulated
Resource Facility at the Center for Biotechnology metagenome; and “All,” the complete NR
(CeBiTec) and is freely accessible at http:// database.
webcarma.cebitec.uni-bielefeld.de. The com- The BLASTx runs for CARMA3, SOrt-
plete source code of CARMA3 (C/C++) has ITEMS, and MEGAN against these three data-
been released under the GPL and is available for bases were performed with default E-value
download from the WebCARMA homepage. threshold (-e 10), soft sequence masking
CARMA3 has been evaluated in various exper- (-F “mS”), and frameshift penalty 15 (-w 15).
iments including simulated and real metagenomes. To ensure comparability, CARMA3 used the
In the following the results of two of these same thresholds as SOrt-ITEMS regarding the
experiments are shown. The first experiment is BLASTx hits, a minimal bit score of 35, and
a qualitative comparison of CARMA3 with a minimal alignment length of 25. The parameter
Taxonomic Classification of Metagenomic Shotgun Sequences with CARMA3, Table 1 Comparison of the
taxonomic classification accuracy of the different BLASTx-based methods CARMA3, SOrt-ITEMS, and MEGAN
using the order-filtered database
CARMA3 SOrt-ITEMS MEGAN
TP FP TP FP TP FP
Superkingdom 12,696 861 12,576 786 12,626 1,849
Phylum 8,989 1,224 9,254 1,736 8,079 1,985
Class 4,066 1,495 4,062 1,937 3,649 2,479
Order – 2,507 – 4,011 – 4,975
Family – 1,186 – 2,565 – 4,087
Genus – 210 – 798 – 4,041
Species – 23 – 0 – 3,544
for the minimal number of reads that are required numbers TP, FP, and U sum up to the total num-
to report a taxon in SOrt-ITEMS and MEGAN ber N of reads used in the evaluation and U equals
was set to 1 in all experiments. To ensure com- N TP FP, U is not explicitly given in the
parability of MEGAN with the other two results.
BLAST-based methods, the top percent parame- The complete table for all results can be found
ter was increased from ten (default) to 15 resulting in the original publication. Table 1 below shows
in more conservative predictions. Although the results for the evaluation on the order-filtered
CARMA3 is a parameter-free method, an artifi- database. While CARMA3 performs better than
cial parameter p was introduced to slightly SOrt-ITEMS at rank class, since it has the same
increase the sensitivity at the cost of decreased number of true positives but fewer false positives,
specificity, in order to yield a sensitivity compa- for the ranks superkingdom and phylum, it is not
rable to that of the other two methods. This clear which method is better. At the taxonomic
allowed for evaluating each of the methods ranks order to genus, where the metagenomic
based on their number of false positives. The sequences have been filtered away, CARMA3
values of p were 1.024 for order-filtered, 1.033 has much fewer (37–74 %) false positives
for species-filtered, and 1.15 for the unfiltered than SOrt-ITEMS. CARMA3 has better results
database. than MEGAN at all taxonomic ranks, while SOrt-
The taxonomic classification methods assign ITEMS has better results than MEGAN at all
to a metagenomic read one taxon and therefore taxonomic ranks below superkingdom. The
also one taxonomic rank. This taxon implicitly results for the species-filtered and the complete
provides a taxonomic classification also for the NR database, where closely related or identical
higher taxonomic ranks. For example, the taxon reference species are available in the database,
Gammaproteobacteria at the taxonomic rank show that in such a setting CARMA3 performs T
class implicitly provides the taxonomic classifi- similar to the other two methods.
cation Bacteria at the taxonomic rank
superkingdom. The taxonomic ranks below the Taxonomic Classification of the Human Gut
predicted taxon can be considered to be classified Microbiome with CARMA3
as “unknown.” Therefore, for each taxonomic A taxonomic classification based on BLAST has
rank, a metagenomic read can either be correctly the advantage of a high sensitivity, which is in
classified and counts as a true positive (TP), can particular important if no closely related refer-
be wrongly classified and counts as a false posi- ence species are available. The main bottleneck
tive (FP), or it is not classified and counts as of this approach is the computation time required
unknown (U). As for each taxonomic rank the for the BLAST search. Over 98 % of the total
running time of a CARMA3 analysis is due to the D. longicatena that are found by the 16S rDNA
initial BLASTx search against the NR database. analysis could be confirmed by CARMA3. How-
While a BLASTx analysis of a complete 454 run ever, the species E. hadrum and R. callidus that
is feasible on a compute cluster in the order of have been found by 16S rDNA were not found by
hours or a few days, this approach seems to be CARMA3. The genus Clostridium which is the
less practical for the analysis of all unassembled taxon found by CARMA3 to have the highest
reads produced by a complete run of an Illumina abundance in the class Clostridia is not reported
sequencing machine that produces one to two by the 16S rDNA analysis. The reason for this
orders of magnitude more bases in total than might be that the 16S rDNA sequence of Clostrid-
a 454 sequencing machine in a single run. ium bartlettii, which mostly contributes to the
One way to overcome this limitation is the genus Clostridium and is known to be found in
usage of data reduction techniques. This is human feces, might not have been available at the
a common strategy to handle the amount of data time of the 16S rDNA analysis (Song et al. 2004).
produced by Illumina sequencing machines (Qin Also the species R. inulinivorans and
et al. 2010; Hess et al. 2011). Typical steps R. intestinalis of the genus Roseburia, which are
involve the assembly of reads into longer frag- found by CARMA3 but not by the 16S rDNA
ments, gene detection with a gene finder to detect analysis, are known to occur in human feces
open reading frames (ORFs), clustering of highly (Duncan et al. 2002; Scott et al. 2011). For the
similar ORFs, and translation of the second most abundant phylum, the Bacteroidetes,
nonredundant ORFs into protein sequences. the authors of the 16S rDNA analysis report a high
Such a metaproteome has, in contrast to the full variability in the distribution of phylotypes in
set of unassembled Illumina reads, a size that samples from different subjects. Nevertheless, all
makes the analysis with the BLASTp variant of phylotypes reported by the authors of the 16S
CARMA3 possible on a compute cluster in the rDNA analysis, B. vulgatus, Prevotellaceae,
order of hours or a few days. To evaluate the B. thetaiotaomicron, B. caccae, and B. fragilis,
applicability of CARMA3 on amino acid were among the 11 or, in case of B. putredinis,
sequences derived from assembled Illumina among the 22 most abundant taxa predicted by
reads, the BLASTp variant of CARMA3 was CARMA3 (Gerlach et al. 2011, Supplementary
used to analyze the gene catalogue of the human Figs. S22–S25).
gut microbiome (Qin et al. 2010). The results The comparison of the taxonomic predictions
were compared to the taxonomic classification of the 16S rDNA analysis and CARMA3 has
of another study of the human intestinal micro- revealed a high consistency in the results of
bial flora based on 13,355 prokaryotic 16S both methods. This shows that CARMA3 can
ribosomal RNA gene sequences (Eckburg also be used for the taxonomic classification of
et al. 2005). amino acid sequences obtained from assembled
Both methods, the 16S rDNA analysis and Illumina reads.
CARMA3, identify Firmicutes and Bacteroidetes
as the most abundant phyla, followed by
Proteobacteria, Actinobacteria, Verrucomicrobia, Summary
and Fusobacteria. Also, in both analyses, the
phylum Firmicutes consists mainly of the class CARMA3 is a method for the taxonomic classi-
Clostridia. Nearly all genera of the Clostridia fication of assembled and unassembled
that have been predicted by the 16S rDNA analy- metagenomic sequences that can be used in com-
sis, like Eubacterium, Ruminococcus, Dorea, bination with BLAST- and HMMER-based
Butyrivibrio, and Coprococcus, have also been homology searches. Except for the homology
predicted by CARMA3. Also most of the species search and the fallback scenario, this method is
of Clostridia like E. rectale, E. hallii, R. torques, parameter-free. In addition, for the HMMER-
R. gnavus, F. prausnitzii, D. formicigenerans, and based variant, it also provides a functional
classification of the metagenomic sequence. References
Typically, a metagenomic sample contains
many novel species that have not been sequenced Abe T, Sugawara H, Kinouchi M, Kanaya S, Ikemura T.
Novel phylogenetic studies of genomic sequence frag-
before. Such a scenario has been simulated with
ments derived from uncultured microbe mixtures in
the order-filtered database, and it also has been environmental and clinical samples. DNA Res,
shown that in most cases CARMA3 not only Center for Information Biology, National Institute of
performs better than existing BLAST-based Genetics, The Graduate University for Advanced
Studies (Sokendai) Mishima, Shizuoka, Japan.
methods, but most strikingly, it is better at
2005;12:281–290.
avoiding FP predictions on lower taxonomic Altschul SF, Gish W, Miller W, Myers EW, Lipman
ranks when only remote homologues are avail- DJ. Basic local alignment search tool. J Mol Biol.
able for the classification of novel species. 1990;215(3):403–10.
Altschul SF, Madden TL, Sch€affer AA, Zhang J, Zhang Z,
One reason for the high accuracy of CARMA3
is because reciprocal hits provide a reasonable PSI-BLAST: a new generation of protein database
estimation of the last common ancestor of the search programs. Nucleic Acids Res. 1997;25(17):
metagenomic sequence and its best hit in the 3389–402.
Diaz NN, Krause L, Goesmann A, Niehaus K, Nattkemper
sequence database. In contrast to the other
TW. TACOA: taxonomic classification of environ-
BLAST-based methods, this method is not mental genomic fragments using a kernelized nearest
based on the LCA and therefore does not discard neighbor approach. BMC Bioinforma. 2009;10:56.
reciprocal hits that can provide valuable informa- Duncan SH, Hold GL, Barcenilla A, Stewart CS,
Flint HJ. Roseburia intestinalis sp. nov., a novel
tion for the taxonomic classification.
saccharolytic, butyrate-producing bacterium from
A drawback of using BLASTx is its running human faeces. Int J Syst Evol Microbiol.
time. The computational bottleneck of the 2002;52(Pt 5):1615–20.
CARMA3 pipeline is the homology search, in Eckburg PB, Bik EM, Bernstein CN, Purdom E,
Dethlefsen L, Sargent M, Gill SR, Nelson KE, Relman
particular the BLAST search. In the evaluation
DA. Diversity of the human intestinal microbial flora.
the initial BLAST search accounted for over Science, Division of Infectious Diseases and Geo-
98 % of the total running time. However, this is graphic Medicine, Stanford University School of
a problem shared with all BLAST-based Medicine, Room S-169, 300 Pasteur Drive, Stanford
CA 94305-5107, USA. 2005;308:1635–1638.
approaches. Furthermore, it has been shown in
Eddy SR. Profile hidden Markov models (review). Bioin-
the evaluation that this problem can be dealt with formatics. 1998;14(9):755–63.
by the use of data reduction strategies which Finn RD, Mistry J, Tate J, et al. The Pfam protein families
include assembly and gene detection steps. database. Nucleic Acids Res. 2010;38(Database
issue):D211–22.
Currently available biological sequence data-
Gerlach W, Stoye J. Taxonomic classification of
bases are known to be biased because they mainly metagenomic shotgun sequences with CARMA3.
contain sequences of species that are culturable. Nucleic Acids Res. 2011;39(14):e91.
Although the authors have tried to minimize the Gerlach W, J€ unemann S, Tille F, Goesmann A, Stoye
J. WebCARMA: a web application for the functional
effect of this bias on the results of their evaluation
and taxonomic classification of unassembled
by creating the order-filtered database, this bias metagenomic reads. BMC Bioinforma. 2009;10:430. T
has to be kept in mind when generalizing the Gish W, States DJ. Identification of protein coding regions
evaluation results to metagenomic reads from by database similarity search. Nat Genet.
1993;3(3):266–72.
unculturable species.
Haque MM, Ghosh TS, Komanduri D, Mande SS. SOrt-
ITEMS: sequence orthology based approach for
improved taxonomic estimation of metagenomic
Cross-References Hess M, Sczyrba A, Egan R, et al. Metagenomic discovery
of biomass-degrading genes and genomes from cow
rumen. Science. 2011;331(6016):463–7. New York,
▶ MEtaGenome ANalyzer (MEGAN): N.Y.
Metagenomic Expert Resource Huson DH, Auch AF, Qi J, Schuster SC. MEGAN analysis
▶ PhyloPythia(S) of metagenomic data. Genome Res. 2007;17:377–86.
T 660 The Vaginal Microbiome in Health and Disease
Huson DH, Mitra S, Weber N, Ruscheweyh H, Schuster Introduction

SC. Integrative analysis of environmental sequences
using MEGAN4. Genome Res. 2011;21:1552–60.
Karlin S, Mrázek J, Campbell AM. Compositional biases The resident microbial flora of the healthy vagina
of bacterial genomes and evolutionary implications. provides protection from infection by a number
J Bacteriol. 1997;179:3899–913. of different mechanisms. Until recently, our
Krause L, Diaz NN, Edwards RA, et al. Taxonomic com- knowledge of the composition of the vaginal
position and gene content of a methane-producing
microbial community isolated from a biogas reactor. microbial flora came from qualitative/semiquan-
J Biotechnol. 2008;136(1–2):91–101. titative descriptive studies using culture-
McHardy AC, Martı́n HG, Tsirigos A, Hugenholtz P, dependent techniques. Following the develop-
Rigoutsos I. Accurate phylogenetic classification of ment and introduction of culture-independent
2007;4(1):63–72. molecular-based techniques, new information
Qin J, Li R, Raes J, et al. A human gut microbial gene with respect to the composition of normal vaginal
catalogue established by metagenomic sequencing. flora in health and disease has expanded our
Nature. 2010;464(7285):59–65. knowledge (Lamont et al. 2011). Most studies,
Richter DC, Ott F, Auch AF, Schmid R, Huson
DH. Metasim: a sequencing simulator for genomics whether culture dependent or independent, give
and metagenomics. PLoS One. 2008;3(10):e3373. the impression that the composition of vaginal
Scott KP, Martin JC, Chassard C, Clerget M, Potrykus J, flora is static and do not reflect the fact that such
Campbell G, Mayer C-D, Young P, Rucklidge G, communities undergo shifts in their relative rep-
Ramsay AG, Flint HJ. Substrate-driven gene expres-
sion in Roseburia inulinivorans: importance of induc- resentation, abundance, and virulence between
ible enzymes in the utilization of inulin and starch. individuals and over time (Costello et al. 2009),
Proc Natl Acad Sci U S A, Rowett Institute of Nutri- all of which are affected by many factors. In this
tion and Health, University of Aberdeen, Bucksburn, way, there may be a relatively stable “core” vag-
Aberdeen AB21 9SB, United Kingdom. 2011;108
(1):4672–4679. inal microbiome together with a “variable”
Song YL, Liu CX, McTeague M, Summanen P, Finegold microbiome that is affected inter alia by transient
SM. Clostridium bartlettii sp. nov., isolated from members of the community as well as by host
human feces. Anaerobe. 2004;10(3):179–84. factors such as environment, lifestyle, genotype,
and immune response (Turnbaugh et al. 2007).
The Vaginal Microbiome in Health Normal Vaginal Flora

and Disease
Culture and microscopy of “normal” vaginal flora
Ronald F. Lamont typically demonstrates a predominance of Lacto-
Department of Gynecology and Obstetrics, bacillus species, which are believed to promote
Clinical Institute, University of Southern a healthy vaginal milieu by providing numerical
Denmark, Odense University Hospital, Odense, dominance but also by producing lactic acid to
Denmark maintain an acid environment that is inhospitable
Division of Surgery, University College London, to many bacteria. Lactobacilli also produce
Northwick Park Institute of Medical Research hydrogen peroxide (H2O2), antibiotic hydroxyl
Campus, London, UK radicals, bacteriocins, and probiotics. Most of
the data on the vaginal microbiome published to
date have been derived from healthy asymptom-
Definition of the Human Vaginal atic women of reproductive age (Zhou et al. 2007;
Microbiome Srinivasan et al. 2010; Ravel et al. 2011; Gajer
et al. 2012). Using culture-independent tech-
The full collection of microbial genomes niques, it can be demonstrated that a significant
(bacterial, viral, fungal, etc.) in the human proportion (7–33 %) of healthy women lack
vagina. appreciable numbers of Lactobacillus species in
The Vaginal Microbiome in Health and Disease 661 T
the vagina that may be replaced by other lactic vagina, as the lactobacilli of the gut vary between
acid-producing bacteria such as Atopobium vagi- Japanese and Western women.
nae, Megasphaera, and Leptotrichia species. Over 120 species of Lactobacillus have been
Although the structure of the communities may identified, and more than 20 species have been
differ between populations, this demonstrates detected in the vagina. Using molecular-based
that vaginal health can be maintained, provided techniques and in contrast to the assertion of
the function of these communities (lactic acid Redondo-Lopez et al., outlined above
production) continues. Consequently, the (Redondo-Lopez et al. 1990), we now know that
absence of lactobacilli or the presence of certain healthy vaginal flora does not contain high num-
organisms such as Gardnerella vaginalis or spe- bers of many different species of Lactobacillus.
cies of Peptostreptococcus, Prevotella, Pseudo- At least six subtypes or community state types
monas, and/or Streptococcus does not constitute (CSTs) of vaginal microbiome exist (Zhou
an abnormal state. et al. 2007; Zhou et al. 2010; Ravel et al. 2011;
Gajer et al. 2012). Four of these CSTs are mainly
The Role of Lactobacilli from dominated by one or two lactobacilli from a range
Culture-Independent Studies of four species (L. crispatus, L. jensenii, Lacto-
Culture-based techniques, because they fail to bacillus iners, and Lactobacillus gasseri). The
detect fastidious organisms, underestimate the remaining two CSTs lack substantial numbers
diversity of vaginal microbial flora, but because of different species of lactobacilli and are com-
of deficiencies in the phenotypic identification of posed of a diverse array of anaerobic bacteria
lactobacilli, they overestimate the diversity of including species associated with bacterial vagi-
Lactobacillus species in the vagina. Some nosis (BV) such as Prevotella, Megasphaera,
20 years ago, using culture-based phenotypic G. vaginalis, Sneathia, and A. vaginae
techniques, Redondo-Lopez et al. concluded (Fredricks et al. 2005). In Lactobacillus-
that no two women were colonized by the same dominated CSTs, other species are rare,
two Lactobacillus species (Redondo-Lopez are lower in titer, and tend to be novel phylotypes.
et al. 1990). Using culture-independent tech- The exclusion of other species is in keeping with
niques, we now know this is inaccurate, and the theory of “competitive exclusion” and the
because of their significant role in health and superior ability of microorganisms such as
disease, much attention has been given to the L. crispatus to compete with other bacteria for
identification of lactobacilli using genotypic vaginal resources, a survival strategy known as
means. Culture-independent studies using “bacterial interference.” Alternatively, the rare
molecular-based techniques have been carried coexistence of multiple dominant species of
out in different populations from different geo- Lactobacillus could result from preemptive
graphic locations (Lamont et al. 2011). Racial colonization by a particular species or from host
variation and geographical area are important, factors that strongly influence the choice of
and different racial groups within the same geo- species to colonize the vagina. T
graphical region have significant differences in
what is the dominant vaginal organism. In most Lactobacillus iners: Under-detected and
populations, Lactobacillus crispatus is the most Underappreciated
common dominant isolate, and White women are The existence of L. iners was unknown prior to
more likely to be dominated by L. crispatus 1999, but due to molecular-based studies, it is
and/or Lactobacillus jensenii than any other spe- now known to play a significant role in the vag-
cies of Lactobacillus. A number of genetic as inal microbial flora. Culture-independent
well as environmental factors might explain at methods have identified L. iners, a lactic acid-
least part of this observation. Alternatively, diet producing bacterium, as one of the organisms
might influence the Lactobacillus species resi- most frequently isolated from the vagina of
dent in the gastrointestinal tract and hence the healthy women. In contrast to L. crispatus,
which is rarely dominant in bacterial vaginosis results from the interplay between microbial vir-
(BV), L. iners can be detected at high levels in ulence, numerical dominance, and the innate and
most subjects with and without BV, and in many adaptive immune response of the host (Smith
studies it is the only Lactobacillus species 1934). The most common disorder of vaginal
detected in women with BV. It has been postu- flora is BV, which is a polymicrobial condition
lated that this may be because L. iners may be characterized by a decrease in the quality or
better adapted to the conditions associated with quantity of lactobacilli and by a 1000-fold
BV, i.e., the polymicrobial state of the vaginal increase in the number of other organisms, deter-
flora and elevated pH. Alternatively, it could be mined by culture-dependent techniques, particu-
the relative resistance of L. iners to unknown larly anaerobes such as Mycoplasma hominis,
factors that led to the demise of other Lactoba- G. vaginalis, and Mobiluncus species. BV is
cillus species during the onset of BV or to increasingly associated with adverse outcomes
a relative lack of antagonism of L. iners to in gynecology such as pelvic inflammatory
BV-associated anaerobes, so that their domi- disease, postabortal sepsis, infertility, post-
nance predisposes the individual to the acquisi- hysterectomy vaginal cuff infections, and the
tion of BV. acquisition of STIs such as gonorrhea, Chla-
mydia, trichomoniasis, and HIV. In pregnancy,
Community Group Variations Among BV has been associated with early and late mis-
Different Ethnic Groups carriage, recurrent abortion, postpartum endome-
The vaginal microbiome and pH in asymptom- tritis, and preterm birth.
atic, sexually active women who were fairly
equally represented according to self-reported Atopobium vaginae: Under-detected and
ethnic group (Hispanic, Black, Asian, White) Underappreciated
has been studied (Ravel et al. 2011). The propor- The genus Atopobium is a member of the family
tion of each community group and pH among the Coriobacteriaceae and forms a distinct branch
four ethnic groups varied significantly. Bacterial within the phylum Actinobacteria. Following
communities dominated by lactobacilli were sequence analysis, three species formerly desig-
found significantly more commonly in Asian nated Lactobacillus minutus, Lactobacillus rimae,
and White women (80.2 % and 89.7 %, respec- and Streptococcus parvulus, within the lactic acid-
tively) compared to only 59.6 % and 61.9 % in producing group of bacteria, have been
Hispanic and Black women, respectively. Simi- reclassified as the genus Atopobium. In 1999, an
larly, median pH values were significantly higher organism similar but not identical to these three
in Black and Hispanic women compared to Asian species was isolated from the vagina of a healthy
and White women. woman in Sweden, and the organism was named
Atopobium vaginae (Rodriguez et al. 1999). Since
that time, using molecular-based techniques,
Abnormal Vaginal Flora A. vaginae has frequently been detected in the
vagina and is found much more commonly in
Abnormal vaginal flora may occur because of women with BV than in those with normal flora
a sexually transmitted infection (STI), e.g., (Lamont et al. 2011). A. vaginae is strictly anaer-
trichomoniasis, or through colonization by an obic and is very sensitive to clindamycin in vitro,
organism that is not part of the normal vaginal but is highly resistant to nitroimidazoles such as
community. Alternatively, abnormal vaginal metronidazole and secnidazole.
flora may result from overgrowth or increased
virulence of an organism that is a constituent High Diversity of Flora in Bacterial Vaginosis
part of normal vaginal flora such as Escherichia Compared with Normal Flora
coli. Alterations in vaginal flora do not necessar- Using various molecular-based techniques and
ily imply disease or result in symptoms. Disease the Amsel clinical criteria, or Nugent score to
classify normal or abnormal flora, a number of more cells/sample. In summary, these studies
studies have demonstrated a high diversity of have demonstrated that different subjects with
organisms in women with BV compared to BV have different microbial profiles, indicating
women with normal flora. Collectively, these heterogeneity in the composition of bacterial taxa
studies demonstrate the presence of species such in women with BV. Women without BV had
as A. vaginae, Porphyromonas asaccharolytica, bacterial communities dominated by Lactobacil-
bacterial vaginosis-associated bacteria lus species, accounting for 86 % of all sequences.
(BVAB)-1, BVAB-2, and BVAB-3 in the order In contrast, women with BV did not possess
Clostridiales and species of Megasphaera, a single dominant phylotype, but instead had
Leptotrichia, Dialister, Chloroflexi, Eggerthella, a diverse array of vaginal bacteria, often at
Olsenella, Streptobacillus, and Shuttleworthia relatively low abundances.
which are either novel or unfamiliar to clinicians
(Lamont et al. 2011). For many of these The Diagnosis of Bacterial Vaginosis
undetected or under-detected organisms, there is Bacterial vaginosis can be diagnosed clinically,
evidence of disease association. The renamed microscopically, enzymatically, and chromato-
Atopobium parvulum, Atopobium minutum, and graphically, using qualitative or semiquantitative
Atopobium rimae have been associated with oral culture methods or using composite clinical
infections, dental and tubo-ovarian abscesses, criteria. Currently, the gold standard is the
and abdominal wound infections, supporting the Nugent score (Nugent et al. 1991), but the num-
view that these organisms can be pathogenic to ber of diagnostic methods testifies to the fact that
the host. Leptotrichia sanguinegens/amnionii has no single test is ideal and that they can all provide
been reported in association with postpartum false-positive and false-negative results.
endometritis, adnexal masses, and fetal death
and has been detected in the amniotic fluid of Confounding Factors
women with preterm labor, preterm prelabor rup- Findings from molecular-based studies are now
ture of the membranes, and preeclampsia. Also, highlighting possible explanations for why diag-
in a study of 45 women with salpingitis and nosis by microscopy may be inconsistent and
44 controls (women seeking tubal ligation), bac- why molecular methods may replace them:
terial 16S rDNA sequences were found in the 1. Mobiluncus: One of the three organisms quan-
fallopian tube specimens of 24 % of cases, but tified as part of the Nugent score is
in none of the controls. Bacterial phylotypes Mobiluncus. Several cloning and sequencing
closely related to Leptotrichia species and studies have only rarely identified
A. vaginae were among those identified in the Mobiluncus. Fluorescence in situ hybridiza-
cases. In addition, Dialister pneumosintes was tion (FISH) technology has demonstrated that
found as the sole agent in the blood culture from BVAB-1 has curved-rod morphology, similar
a woman with suppurative postpartum ovarian to Mobiluncus morphotypes, and it is possible
thrombosis. that during microscopic examination of vagi- T
It has also been demonstrated that many of nal smears, Mobiluncus species may have
these organisms have specificity for BV and that been overrepresented and mistaken for
the number of phylotypes found in association BVAB-1. Alternatively, as species-specific
with BV is statistically significantly greater than PCR agrees with the Nugent score,
the number detected in the presence of interme- Mobiluncus may be missed in universal PCR
diate flora (a distinct entity in its own right) studies because it frequently falls below
(Taylor-Robinson et al. 2003) or normal flora. a threshold titer where it can be detected.
This statistic largely results from the extreme 2. Atopobium: The urea produced by Atopobium
dominance of lactobacilli in healthy women, species is associated with halitosis, and simi-
which makes detection of other species unlikely, larly, species of Megasphaera cause beer
even when they are present at levels of 100,000 or spoilage by producing turbidity, off-flavors
and off-colors. Accordingly, if two genera for BV, and the detection of Leptotrichia and
associated with malodorous metabolites can A. vaginae was three times more likely, and
be found in the vagina of healthy women and BVAB-1 twice as likely, when women
amines can be found in women without BV, reported douching.
then diagnostic techniques to diagnose BV,
based on amine production and odor forma- Diagnosis of BV Using Qualitative and
tion, may need to be reconsidered. Microscop- Quantitative Molecular Techniques
ically, Atopobium species are gram-positive, Some organisms or combinations of organisms
elliptical cocci, or rod-shaped organisms that have high sensitivities or specificities for the
occur singly, in pairs, or in short chains. The diagnosis of BV using the Amsel criteria and
variable cell morphology of Atopobium ren- the Nugent score (Fredricks et al. 2005; Fredricks
ders it well camouflaged among the mixture of et al. 2007). Using quantitative real-time PCR,
other species present in bacterial communities the association of individual organisms with BV
where the Nugent score is 4. A. vaginae is diagnosed by Nugent score was examined quali-
fastidious, grows anaerobically, and forms tatively. At a threshold of 108 DNA copies/ml,
small pinhead colonies on culture that are eas- Lactobacillus species was predictive of normal
ily missed. Although phylogenetically differ- flora (sensitivity 44 %; specificity 100 %).
ent from other lactic acid-producing bacteria, BVAB-1, BVAB-2, and BVAB-3 alone, or in
they are not phenotypically exceptional, and it combination, had high specificity for BV diag-
is not difficult to see why the significance of nosed by Amsel criteria.
this organism based on culture, microscopy, Since A. vaginae and G. vaginalis are fre-
and phenotype may be overlooked and quently detected in association with BV,
underappreciated. a number of authors using molecular-based tech-
3. Symptomatic relationships: Using species- niques have examined the possibility of combin-
specific primers, the relationships between ing these two organisms as a means of diagnosing
five fastidious organisms associated with BV BV. Using DNA quantitation, 19 out of 20 BV
were compared with BV diagnosed by Amsel samples had either a DNA level for A. vaginae
and/or Nugent scores, and also with the indi- 108 copies/ml or G. vaginalis 109 copies/ml,
vidual Amsel clinical criteria (Haggerty and nine out of 20 had both. The combination of
et al. 2009). The two biovars of Ureaplasma an A. vaginae DNA level 108 copies/ml and
urealyticum (Ureaplasma parvum and a G. vaginalis DNA level 109 copies/ml dem-
Ureaplasma urealyticum – biovar 2) were onstrated the best predictive criteria for the diag-
associated with vaginal discharge and raised nosis of BV with excellent sensitivity (95 %),
pH, but not with BV by either Amsel or specificity (99 %), negative predictive value
Nugent criteria or any of the individual (NPV, 99 %), and positive predictive value
Amsel clinical criteria. In contrast, with (PPV, 95 %) (Menard et al. 2008).
Leptotrichia sanguinegens/amnionii, A. vagi-
nae, and BVAB-1, an elevated pH >4.5 was
a universal feature, and they were all associ- Culture-Independent Studies in
ated with BV by both Amsel and Nugent Pregnancy
criteria and with the finding of >20 % of
epithelial cells as clue cells, a feature that has Culture-independent techniques have been used
already been reported. A positive test for to measure prevalence, diversity, and abundance
amine odor upon the addition of 10 % solution of organisms, particularly ureaplasmas in amni-
of potassium hydroxide was significantly otic fluid, in association with suspected
more likely in women testing positive for cervical insufficiency, preterm labor, preterm
BVAB-1. Douching is a recognized risk factor prelabor rupture of membranes (PPROM),
small-for-gestational-age babies, preeclampsia, Conclusions
and the potential for bacteria from the oral cavity
to colonize amniotic fluid. However, apart from Stability and resilience of the vaginal ecosystem
combining pregnant women with nonpregnant is now recognized to be of importance in the
women to increase sample numbers, the informa- health of a bacterial community as well as the
tion with respect to the vaginal microbiome in response to perturbations. The relative abun-
pregnant women is limited, particularly with dance of certain phylotypes correlates well with
respect to the outcome of pregnancy, especially low or high Nugent scores, which is used for the
preterm birth. Using species-specific primers, diagnosis of normal flora or BV. The inherent
Wilks et al. quantified the production of H2O2 by difference within and between women in differ-
lactobacilli from swabs taken at 20 weeks of ges- ent ethnic groups strongly argues for a more
tation from the vagina of 73 women considered to refined definition of the subtypes of bacterial
be at high risk of preterm birth (Wilks et al. 2004). communities normally found in healthy women
The levels of H2O2 production varied between and the need to appreciate differences between
species of Lactobacillus. The presence of individuals so they can be taken into account in
lactobacilli producing high levels of H2O2 was risk assessment and diagnosis of disease.
associated with a reduced incidence of BV at
20 weeks of gestation and subsequent chorioam-
nionitis. The authors postulated that H2O2- References
producing lactobacilli reduced the incidence of
Aagaard K, Riehle K, Ma J, Segata N, Mistretta TA,
ascending genital tract colonization in pregnancy,
Coarfa C, et al. A metagenomic approach to charac-
which leads to infection and preterm birth. In terization of the vaginal microbiome signature in preg-
a longitudinal study of 100 pregnant women, vag- nancy. PLoS ONE. 2012;7:e36466.
inal swabs were obtained at mean gestational ages Costello EK, Lauber CL, Hamady M, Fierer N, Gordon JI,
Knight R. Bacterial community variation in human
of 8.6, 21.2, and 32.4 weeks, respectively
body habitats across space and time. Science.
(Verstraelen et al. 2009). In the first trimester, 2009;326:1694–7.
77 women had normal or Lactobacillus-dominated Fredricks DN, Fiedler TL, Marrazzo JM. Molecular iden-
flora, 13 of whom developed abnormal flora in the tification of bacteria associated with bacterial vagino-
sis. N Engl J Med. 2005;353:1899–911.
second or third trimester. When the first-trimester
Fredricks DN, Fiedler TL, Thomas KK, Oakley BB,
normal flora was dominated by L. gasseri or Marrazzo JM. Targeted PCR for detection of vaginal
L. iners, there was a tenfold risk of conversion to bacteria associated with bacterial vaginosis. J Clin
abnormal flora. In contrast, normal flora compris- Microbiol. 2007;45:3270–6.
Gajer P, Brotman R, Bai G, Sakamoto J, Sch€ utte U,
ing L. crispatus had a fivefold decreased risk of
Zhong X, et al. Temporal dynamics of the human
conversion to abnormal flora. This may be because vaginal microbiota. Sci Transl Med. 2012;4:132ra152.
only a small percentage of L. gasseri and L. iners Haggerty CL, Totten PA, Ferris M, Martin DH, Hoferka S,
strains produce H2O2. Astete SG, et al. Clinical characteristics of bacterial
vaginosis among women testing positive for fastidious T
Knowledge of the vaginal microbiome in
bacteria. Sex Transm Infect. 2009;85:242–8.
pregnancy is limited to only a few studies Hernández-Rodriguez C, Romero-González R, Albani-
(Verstraelen et al. 2009; Hernández-Rodriguez Campanario M, Figueroa-Damián R, Meraz-Cruz N,
et al. 2011; Aagaard et al. 2012), none of which Hernández-Guerrero C. Vaginal microbiota of healthy
pregnant Mexican women is constituted by four Lac-
analyzed samples collected longitudinally.
tobacillus species and several vaginosis-associated
Recently, using 16S rDNA sequencing in normal bacteria. Infect Dis Obstet Gynecol. 2011;2011:
pregnant women sampled longitudinally, the 851485.
vaginal microbiome was found to be different Lamont R, Sobel J, Akins R, Hassan S,
Chaiworapongsa T, Kusanovic J, et al. The vaginal
from that of nonpregnant women; also the vaginal
microbiome: new information about genital tract
microbiome during pregnancy is more stable than flora using molecular based techniques. BJOG.
in the nonpregnant state (Romero et al. 2014). 2011;118:533–49.
T 666 tRNA Gene Database Curated Manually by Experts
Menard JP, Fenollar F, Henry M, Bretelle F, Raoult

D. Molecular quantification of Gardnerella vaginalis tRNA Gene Database Curated
and Atopobium vaginae loads to predict bacterial vag-
inosis. Clin Infect Dis. 2008;47:33–43. Manually by Experts
Nugent RP, Krohn MA, Hillier SL. Reliability of diagnos-
ing bacterial vaginosis is improved by a standardized tRNADB-CE and Use of tRNAs as Phylogenetic
method of gram stain interpretation. J Clin Microbiol. Markers for Metagenomic Sequences
1991;29:297–301.
Ravel J, Gajer P, Abdo Z, Schneider G, Koenig S, McCulle
S, et al. Vaginal microbiome of reproductive-age women. Takashi Abe1, Hachiro Inokuchi2,
Proc Natl Acad Sci U S A. 2011;108(Suppl1):4680–7. Yuko Yamada2, Akira Muto3, Yuki Iwasaki2
Redondo-Lopez V, Cook RL, Sobel JD. Emerging role of and Toshimichi Ikemura2
lactobacilli in the control and maintenance of the vaginal 1
bacterial microflora. Rev Infect Dis. 1990;12:856–72. Graduate School of Science and Technology,
Rodriguez J, Collins MD, Sjoden B, Falsen E. Character- Niigata University, Niigata, Japan
2
ization of a novel Atopobium isolate from the human Nagahama Institute of Bio-Science and
vagina: description of Atopobium vaginae sp. nov. Int Technology, Nagahama, Shiga, Japan
J Sys Bacteriol. 1999;49:1573–6. 3
Romero, R, S Hassan, P Gajer, A Tarca, D Fadrosh, Faculty of Agriculture and Life Science,
L Nikita, et al. The composition and stability of the Hirosaki University, Hirosaki, Aomori, Japan
vaginal microbiota of normal pregnant is different
from that of non-pregnant women: results of
a longitudinal study using culture-independent tech-
niques. Microbiome 2014. In press. Synonyms
Smith T. Parasitism and disease. Princeton: Princeton
University Press; 1934.
Srinivasan S, Liu C, Mitchell C, Fiedler T, Thomas K, tRNA; tRNADB-CE; Taxonomic assignment
Agnew K, et al. Temporal variability of human vaginal using tRNA genes
bacteria and relationship with bacterial vaginosis.
PLoS ONE. 2010;5:e10197.
Taylor-Robinson D, Morgan DJ, Sheehan M, Rosenstein
IJ, Lamont RF. Relation between Gram-stain and clin-
ical criteria for diagnosing bacterial vaginosis with Definition
special reference to Gram grade II evaluation. Int
J STD AIDS. 2003;14:6–10. The tRNA gene database curated manually by
Turnbaugh PJ, Ley RE, Hamady M, Fraser-Liggett CM, experts (“tRNADB-CE”) (http://trna.ie.niigata-
Knight R, Gordon JI. The human microbiome project.
Nature. 2007;449:804–10. u.ac.jp) has been constructed and annually
Verstraelen H, Verhelst R, Claeys G, De Backer E, updated by analyzing all available complete and
Temmeman M, Vaneechoutte M. Longitudinal analy- draft genomes of Bacteria and Archaea, virus
sis of the vaginal microflora in pregnancy suggests that genomes, chloroplast genomes, and eukaryote
L. crispatus promotes the stability of the normal vag-
inal microflora and that L. gasseri and/or L. iners are genomes plus fragment sequences obtained
more conducive to the occurrence of abnormal vaginal from metagenome analyses of environmental
microflora. BMC Microbiol. 2009;9:116. samples. By compiling tRNAs from known pro-
Wilks M, Wiggins R, Whiley A, Hennessy E, Warwick S, karyotes that had identical sequences, we found
Porter H, et al. Identification and H(2)O(2) production
of vaginal lactobacilli from pregnant women at high high phylogenetic preservation of tRNA
risk of preterm birth and relation with outcome. J Clin sequences, especially at the phylum level. Fur-
Microbiol. 2004;42:713–17. thermore, a large number of tRNAs obtained by
Zhou X, Brown C, Abdo Z, Davis C, Hansmann M, metagenome analyses of environmental samples
Joyce P, et al. Differences in the composition of vag-
inal microbial communities found in healthy Cauca- had sequences identical to those found in known
sian and black women. ISME J. 2007;1:121–33. prokaryotes. The identical sequence group, there-
Zhou X, Hansmann M, Davis C, Suzuki H, Brown C, fore, can be used as molecular phylogenetic
Sch€utte U, et al. The vaginal bacterial communities markers to clarify microbial community struc-
of Japanese women resemble those of women in other
racial groups. FEMS Immunol Med Microbiol. tures in environmental ecosystems as well as in
2010;58:169–81. clinical samples.
tRNA Gene Database Curated Manually by Experts 667 T
Introduction Second, three experts in the tRNA experimental
field manually checked the discordant cases
In accord with the remarkable progress of DNA (approximately 3 % of the total of bacterial
sequencing technology, a vast quantity of gene candidates) independently and included
metagenomic sequences obtained from a wide reliable cases in the database.
variety of environmental and clinical samples For fragment sequences obtained from
have been decoded and released by DDBJ/ metagenome analyses, only tRNA genes concor-
EMBL/GenBank. A massive number of dantly found by the three programs and those that
metagenomic sequences, including short had sequences identical to tRNAs already
sequences obtained with new-generation included in the database were stored. A large
sequencers, should contain a large number of number of tRNA genes were detected in various
complete tRNA sequences because the lengths environmental samples, and their numbers were
of tRNA sequences are short. However, practi- separately listed by category of environment.
cally no information on tRNA genes has been This enabled us to clarify microbial community
annotated for metagenomic sequences in DDBJ/ structures in environmental samples using tRNAs
EMBL/GenBank. The search for tRNA genes in as phylogenetic markers. Because a significant
metagenomic sequences can provide a new strat- portion of environmental DNA sequences are
egy to clarify microbial community structures in thought to be from unculturable microbes,
environmental samples. Thus, we included a vast tRNA genes of novel species should be included.
number of tRNA genes found in metagenomic
sequences in the tRNADB-CE (Abe et al. 2009,
2011). Functions of tRNADB-CE and Data
When we focused on a group of tRNAs with Access
an identical sequence, we found tRNAs only in
a particular lineage of phylogenetic groups. Nota- The tRNADB-CE allows browsing of the stored
bly, such phylotype-specific tRNA sequences data and search for the database with user-
were also found in many species-unknown geno- specified input as described previously in detail
mic fragments obtained by metagenome ana- (Abe et al. 2009, 2011). A browse page is
lyses. This fact shows that tRNA is a good presented in Fig. 1. First, a list of tRNA genes
phylogenetic marker for discovering the and anticodons can be browsed depending on the
phylotype composition and microbial community numbering of genomes (i.e., genome ID) or DNA
structure in an environmental sample. fragments of environmental samples stored in the
database. The statistical information for copy
numbers of tRNA genes in each phylotype/
Search for tRNA Genes species and the anticodon type in each amino
acid group can also be browsed (Fig. 1a). By
In order to enhance the completeness and accu- clicking the sequence ID of each tRNA gene, T
racy of searching for tRNA genes, three computer detailed information on the selected tRNA
programs, tRNAscan-SE (Lowe and Eddy 1997), genes can be browsed, including tRNA gene
ARAGORN (Laslett and Canback 2004), and sequences, their upstream and downstream
tRNAfinder (Kinouchi and Kurokawa 2006), sequences (10 nt), information on the secondary
were used in combination since their algorithms structures of the tRNA, and curation comments
were partially different and rendered somewhat on the tRNA.
different results. First, we checked to what degree A “keyword search” can also be conducted
the predicted regions and the anticodons of indi- using retrieved items such as species name,
vidual tRNA genes were consistent with each amino acid, anticodon, sequence ID, and genome
other. The tRNA genes concordantly found by ID. This function can be performed by using
the three programs were stored in the database. multiple keywords in combination. The database
tRNA Gene Database Curated Manually by Experts, Fig. 1 Basic functions of tRNADB-CE
supports two types of sequence search: sequence Archaea by sequence alignment using the
similarity search “BLASTN” and pattern search. CD-HIT (Li and Godzik 2006), we found high
In pattern search (i.e., oligonucleotide sequence phylogenetic preservation of tRNA genes;
search), we can focus the search area on the a particular tRNA sequence was found only in
stems/loops of cloverleaf structures and combine a particular lineage of phylogenetic groups. We
the areas in various patterns. After selecting designated here the tRNA group with an identical
tRNA genes of interest using the sequence search sequence as “identical sequence group: ISG”
procedures, multiple alignments with ClustalW (Fig. 2a) and listed the numbers of ISGs for
and downloads of aligned sequences and each phylotype (Fig. 2b) and for each anticodon
obtained dendrograms are available (Fig. 1b). (Fig. 2c). tRNAs with one anticodon type were
classified and listed according to the ISG along
with the phylotype information of each tRNA
Identical Sequence Groups and Their (Fig. 2d), and thus, the range of phylotypes
Use as Phylogenetic Markers for found for each ISG could be examined. If we
Environmental Metagenomic Sequences focused on ISGs composed of more than five
sequences, approximately 95 % of ISGs were
When we conducted the clustering of tRNA gene conserved at a phylum level, showing most
sequences, except the 30 CCA terminal sequence, tRNAs to be good phylogenetic markers at least
from complete and draft genomes of Bacteria and at the phylum level. The ISGs could provide
tRNA Gene Database Curated Manually by Experts 669 T
tRNA Gene Database Curated Manually by Experts, Fig. 2 List and search for identical sequence group (ISG)
a strategy for selecting reliable phylogenetic a function for searching for sequences with 97 %
markers. In addition, approximately 65 % of or 95 % sequence identity (2- or 3-nt difference,
ISGs were conserved even at the genus level, respectively) (Fig. 2a). By using tools in the data-
showing the possible existence of good genus- base and specific markers found by users (e.g.,
specific markers. By combining the data provided genus-specific markers), users can clarify micro-
by this database with other detailed knowledge of bial populations in an ecosystem by themselves.
a particular tRNA obtained by experiments or The present strategy can be applied even to
from literature, users may obtain useful phyloge- data of short sequences obtained with T
netic markers (e.g., genus-specific markers) by new-generation sequencers, such as Sequence
themselves. Read Archive (SRA), in NCBI. In metagenomic
Interestingly, among tRNA genes found in analyses using new-generation sequencers, the
metagenomic sequences derived from environ- phylogenetic characterization of short sequences
mental samples, approximately 25 % of tRNA with existing bioinformatics methods was partic-
genes were identical in sequence to genes from ularly difficult, except for sequences unambigu-
species-known prokaryotes. Using tRNAs found ously mapped on a known sequenced genome.
in an environment sample that were assigned to Because complete tRNA genes can be found
ISGs, we could predict the microbial community even from short genomic fragments of around
structure in an environmental ecosystem at least at 100 bases, tRNA genes should become one of
the phylum level (Fig. 2e). The database also has the most effective means for identifying
microbial populations in an ecosystem in the case group, therefore, can be used as molecular phy-
of metagenome studies conducted with next- logenetic markers to clarify microbial commu-
generation sequencers. nity structures of environmental ecosystems.
When we consider the rapid growth of genomic The tRNADB-CE allows users to obtain
and metagenomic sequences accumulated in phylotype-specific markers (e.g., genus-specific
DDBJ/EMBL/GenBank, our present strategy to markers) by themselves and to clarify microbial
search for reliable tRNAs, including manual community structures in ecosystems in detail.
curation by experts, may be inadequate. Our tRNADB-CE can be accessed freely at http://
group previously developed BLSOM (Batch- trna.ie.niigata-u.ac.jp.
Learning Self-Organizing Map) for oligonucleo-
tide composition, which clustered (self-organized)
genomic sequence fragments according to the
phylogenic group (Abe et al. 2003). The oligonu- References
cleotide BLSOM was successfully applied to
the phylogenetic classification of a large quantity Abe T, Kanaya S, Kinouchi M, Ichiba M, Kozuki T,
Ikemura T. Informatics for unveiling hidden genome
of metagenomic sequences (Abe et al. 2005). signatures. Genome Res. 2003;13:693–702.
When we conducted BLSOM for the tetra- and Abe T, Sugawara H, Kinouchi M, Kanaya S, Ikemura
pentanucleotide compositions of bacterial tRNAs, T. Novel phylogenetic studies of genomic sequence
tRNAs were accurately separated according to the fragments derived from uncultured microbe mixtures
in environmental and clinical samples. DNA Res.
amino acid, showing the BLSOM to be an addi- 2005;12:281–90.
tional informatics strategy for the assignment of Abe T, Ikemura T, Ohara Y, Uehara H, Kinouchi M,
reliable tRNAs. When we focused on tRNAs with Kanaya S, Yamada Y, Muto A, Inokuchi H. tRNADB-
the same anticodon, tRNAs were separated CE: tRNA gene database curated manually by experts.
according to the phylotype on BLSOM, showing Abe T, Ikemura T, Sugahara J, Kanai A, Ohara Y,
that the BLSOM is also applicable to the Uehara H, Kinouchi M, Kanaya S, Yamada Y,
phylogenetic assignment of tRNAs present in Muto A, Inokuchi H. tRNADB-CE 2011: tRNA gene
metagenomic sequences. database curated manually by experts. Nucleic Acids
Res. 2011;39:D210–3.
Kinouchi M, Kurokawa K. tRNAfinder: a software system
to find all tRNA genes in the DNA sequence based on
Summary the cloverleaf secondary structure. J Comp Aided
Chem. 2006;7:116–26.
Laslett D, Canback B. ARAGORN, a program to detect
By compiling the tRNAs of known prokaryotes tRNA genes and tmRNA genes in nucleotide
with identical sequences, we found high phylo- sequences. Nucleic Acids Res. 2004;32:11–6.
genetic preservation of tRNA sequences, espe- Li W, Godzik A. Cd-hit: a fast program for clustering and
cially at the phylum level. Furthermore, a large comparing large sets of protein or nucleotide
sequences. Bioinformatics. 2006;22:1658–9.
number of tRNAs obtained by metagenome ana- Lowe TM, Eddy SR. tRNAscan-SE: a program for
lyses had sequences identical to those found for improved detection of transfer RNA genes in genomic
known prokaryotes. The identical sequence sequence. Nucleic Acids Res. 1997;25:955–64.
U
Use of Bacterial Artificial Introduction

Chromosomes in Metagenomics
Studies, Overview Despite progress in culturing a greater diversity of
the members of microbial assemblages, the vast
Lingling Wang1, Shamima Nasrin2, Mark Liles2 majority of prokaryotic taxa is not readily cultured
and Zhongtang Yu3 in laboratory and remains largely unknown.
1
Department of Animal Sciences, The Ohio State Because of this “great plate count anomaly,”
University, Columbus, OH, USA microbiologists are forced to use cultivation-
2
Department of Biological Sciences, Auburn independent alternative approaches to access and
University, Auburn, AL, USA study the functional diversity in microbiomes.
3
Department of Animal Sciences, Environmental Direct cloning of metagenomic DNA fragments
Science Graduate Program, The Ohio State into BAC clone libraries and subsequent analyses
University, Columbus, OH, USA provide opportunities to investigate the genetic
makeup and potential of a microbiome.
Metagenomic DNA fragments can be cloned into
Synonyms plasmid, cosmid, or fosmid (a low-copy-number
cosmid that is based on the F-factor replicon of
Environmental genomic libraries; Metagenomic Escherichia coli) vectors, but these cloning vec-
libraries; Metagenomic studies tors can only carry relatively small DNA frag-
ments: plasmids, <20 kb; cosmids, 37–52 kb;
and fosmids, <42 kb. On the other hand, a BAC
vector can carry a much longer DNA fragment
Definition (up to 300 kb). Although BAC libraries are tech-
nically more difficult to construct than other types
Cloning of large metagenomic DNA fragments of clone libraries, BAC clone libraries have sev-
using bacterial artificial chromosome (BAC) vec- eral advantages. First, to clone a certain amount of
tors provides an opportunity to study the func- metagenomic DNA, a smaller number of BAC
tional diversity and to harness the metabolic clones are needed compared to libraries
potential of diverse microorganisms in various constructed using other cloning vectors. Second,
microbiomes. This technology is especially rele- the production of many bioactive compounds
vant to the study of biosynthetic pathways (e.g., antibiotics, multimodular polyketide, or
encoded by large gene clusters that would not nonribosomal peptide) is encoded by a gene clus-
be cloned as contiguous regions using other ter whose length typically exceeds what can be
cloning vectors. carried by a plasmid, cosmid, or fosmid vector
U 672 Use of Bacterial Artificial Chromosomes in Metagenomics Studies, Overview
(Piel 2011). Third, a cis-regulatory element is However, indirect DNA extraction can produce
often required for the expression of a gene or less representative metagenomic DNA because
operon. However, the inserts cloned into some microbial cells can be difficult to isolate
a plasmid, cosmid, or fosmid vector might not from the sample matrices. The choice of a direct
allow for the cloning of both a gene or operon or an indirect extraction method depends on the
and its cis-regulatory element into the same nature of the environmental sample. For exam-
clone. The lack of a cis-regulatory element can ple, for a sample with high levels of contami-
prevent the metabolic phenotype of interest from nants, such as soils and sediments, there may be
being detected during activity-based screening of an advantage in using indirect extraction to sig-
clone libraries. Fourth, cloned metagenomic DNA nificantly reduce humic acid levels co-extracted
fragments are less stable in a plasmid or a cosmid with the DNA. However, indirect extraction
vector than in a BAC vector. Therefore, BAC methods may not yield sufficient DNA from
libraries have unique utility in metagenomic stud- samples with lower cellular abundance, such as
ies of microbiomes. In this entry, the construction, sediments and aquifer. Therefore, it is important
screening, bioinformatic and biochemical analy- to use an empirical approach and evaluate mul-
sis, and utilization of BAC clone libraries are tiple methods for HMW DNA extraction.
overviewed. Several protocols have been developed to
overcome the difficulties in extracting HMW
DNA for metagenomic studies. Stein et al.
Isolation of High Molecular Weight DNA (1996) introduced an innovative technique that
embeds microbial cells in agarose gel matrix.
Because the purpose of BAC cloning is to recover These embedded cells are lysed in situ to prevent
contiguous regions of microbial genomes, the mechanical shearing of the microbial DNA. The
DNA fragments recovered from a microbiome Nycodenz extraction technique is another tech-
sample should be significantly longer than that nique that is used to avoid mechanical shearing of
required for fosmid or other cloning vectors. HMW DNA (Berry et al. 2003). This technique
Many studies have compared DNA extraction prevents physical damage to bacterial cells by
methods suitable for metagenomic applications cushioning them during the high-speed centrifu-
(Delmont et al. 2011), and this section will gation step. Some environmental contaminants
review those methods that are appropriate for are also removed during the centrifugation.
BAC cloning. The use of multiple extraction methods for
Extraction of high molecular weight (HMW) a single sample may increase the ultimate yield
DNA from microbiome is always a significant and phylogenetic representation of the recovered
issue due to the inherent conflict between the metagenomic DNA. The adoption of a particular
need to recover DNA from diverse microorgan- DNA extraction method should be evaluated by
isms while preserving DNA integrity. Direct the DNA yield and the representation of diverse
DNA extraction methods, by which DNA is microbial genomes within the recovered DNA.
recovered directly from an environmental sam- The diversity represented in the recovered DNA
ple, provide high DNA yields from phylogeneti- can be assessed by molecular phylogenetic
cally diverse microorganisms; yet the vast analysis based on sequencing of the universally
majority of the DNA is sheared and likely less conserved 16S rRNA gene. Although phyloge-
than 100 kb. In contrast, indirect DNA extraction netic diversity analysis can be influenced by
methods, in which microbial cells are first iso- many factors, including biases inherent in PCR,
lated from sample matrices prior to DNA extrac- it is a rapid analysis to assess the phylogenetic
tion, result in a lower DNA yield, but the resultant composition of samples and libraries. Next-
DNA is of significantly greater molecular generation sequencing (NGS) of 16S rRNA
weight compared to direct extraction methods. gene amplicons permits a greater depth of
Use of Bacterial Artificial Chromosomes in Metagenomics Studies, Overview 673 U
sequencing coverage compared to traditional results in random fragmentation of metagenomic
Sanger sequencing technology. DNA. This method has been demonstrated with
Subsequent to extraction, the metagenomic multiple eukaryotic genomes and recently applied
DNA may need to be purified to achieve efficient to construction of soil BAC libraries studies
cloning. Purification is especially needed to (Kakirde and Nasrin et al., unpublished data).
remove contaminants from metagenomic DNA A major challenge in constructing high-quality
extracted from soil samples because soil samples BAC libraries is to retain large DNA fragments
can contain high levels of humic acids and other while removing small ones that can be preferen-
phenolic compounds that tend to be co-extracted tially cloned. Multiple strategies are available to
with the DNA and hamper downstream processes recover and clone large metagenomic DNA frag-
(e.g., restriction digestion and cloning of the ments. Pulsed field gel electrophoresis (PFGE) is
DNA). Multiple approaches have been developed the most frequently used method for size selection
to purify metagenomic DNA, including the use of of partially digested or sheared metagenomic
CTAB, hydroxyapatite column purification, or DNA. Alternatively, agarose gel electrophoresis
formamide denaturation. Formamide can be can be used as it can provide better resolution of
more effective in removing some contaminants HMW DNA. Because sucrose gradient centrifuga-
from HMW DNA due to its inherent capability to tion can only resolve DNA fragments of 5–60 kb,
denature DNA and remove contaminants that are it is not a suitable method to separate HMW DNA
tightly intercalated between the DNA bases and fragments for BAC library construction.
strands (Liles et al. 2008). Formamide or
polyvinylpolypyrrolidone (PVPP) can also help
remove nuclease contaminant. Hydroxyapatite Construction of BAC Clone Libraries
chromatography has advantages over other purifi-
cation approaches as this method can efficiently Several BAC vectors have been developed for
fractionate nucleic acids with different conforma- metagenomic cloning that enable transfer and
tions (i.e., dsDNA, ssDNA, dsRNA, and ssRNA) expression of cloned DNA in multiple heterolo-
while helping remove sample contaminants gous hosts. The initial development of the
(Andrews-Pfannkoch et al. 2010). This differential pBELOBAC11 vector (Kim et al. 1996) was
elution of nucleic acids can easily be accom- instrumental in permitting BAC cloning of envi-
plished by changing the phosphate concentrations ronmental DNA (Rondon et al. 2000). However,
at a constant temperature or a combination of the inherent restriction of the pBELOBAC11 vec-
increasing temperatures and phosphate concentra- tor to single-copy within an E. coli host cell was
tions of the elution buffers. a severe limitation for downstream analysis of the
resultant BAC libraries. The development of
inducible-copy control BAC vectors by inclusion
Fragmentation and Size Selection of of an RK2 origin of replication enabled more
Metagenomic DNA facile BAC library construction (Wild et al.
2002), and different derivatives of these vectors
Fragments of metagenomic DNA with uniform were constructed and commercially available U
length about 150 kb are prepared either enzymati- (Lucigen, Middleton, WI; Epicentre, Madison,
cally or mechanically. Enzymatic fragmentation WI). The inducible-copy control BAC vectors
relies upon partial restriction digestion. However, were further modified by including the complete
the extent of partial digestion is difficult to control. mini-RK2 replicon within the BAC vector to
Additionally, partial restriction digestion can result enable shuttling of BAC clones into multiple het-
in nonrandom DNA fragmentation and a significant erologous hosts (Kakirde et al. 2011), greatly
reduction in DNA size. An alternative to partial expanding the host range for heterologous expres-
restriction digestion is mechanical shearing, which sional analysis of BAC libraries.
The size-selected metagenomic DNA frag- access the functional diversity that can be deter-
ments are ligated into an appropriate BAC vector mined phenotypically. The strategies for activity-
using a DNA ligase. DNA fragments prepared by based screening of BAC libraries depend on the
partial restriction digestion can be directly ligated nature of the compounds or enzymes of interest
to the chosen BAC vector that has been linearized and should be designed carefully (Taupp et al.
with the same restriction enzyme. If randomly 2011). The advent of new technologies has
shared DNA fragments are cloned, however, enabled the application of high throughput
both ends of each fragment need to be repaired screening (HTS) approaches to identify clones
to blunt ends. To increase cloning efficiency, with the desired phenotypes (producing certain
adaptors of an appropriate restriction enzyme compounds or enzymes) from a large number of
can also be ligated to the repaired ends. After BAC clones. The analysis of cell lysates, DNA,
ligation, the insert-carrying BAC vector is or supernatants from BAC clones can be
transformed into highly competent E. coli cells performed with a great diversity of screening
typically using electroporation. It should be noted targets to identify the clones carrying the desired
that even if a shuttle vector capable of transfer to activities or genetic targets (Lakhdari et al. 2010).
other hosts is used, it is advisable to first use For example, BAC clones expressing the desired
E. coli for BAC library construction to take activity can be identified by applying an indicator
advantage of high transformation efficiency, substrate of the enzymes of interest into the
even if the ultimate expression host is not growth medium. Depending on the nature of the
E. coli. It should also be noted that this is the assay, the active clones may be detected by visual
most difficult step in BAC cloning and that inspection of an indicator agar plate, flow
the larger number of fosmid libraries reported in cytometry, a spectrophotometer, or fluorescent
the literature compared to BAC libraries is merely microtiter plate reader (Taupp et al. 2011).
a reflection of the more facile fosmid cloning. One of the first proof-of-concept studies for
Once transformants are isolated on respective identifying a functional natural product from
antibiotic selection plates, a representative a BAC library was accomplished via sequence-
number of colonies should be evaluated for the based screening. In the seminal paper of Béjà
percentage of BAC clones with insert the average et al. (2000), the first bacterial proteorhodopsin,
insert size. a light-driven proton pump, was discovered by
If the library statistics are satisfactory for fur- identifying specific BAC clones that contained
ther analysis, then colonies can be archived. The a 16S rRNA gene sequence and then identifying
archiving of a BAC library, which typically con- the other linked functional genes contained
sists of a vast number of clones, is an important within the same clone. While this study used
step since researchers frequently need to access a fosmid library, this approach is equally appli-
clones for confirmation, verification, screening, cable to BAC libraries and has been used to
and other analyses. BAC clones are usually describe some of the functional diversity associ-
suspended in a cryoprotectant medium, which is ated with as-yet-uncultured bacteria (Liles
usually 10–15 % glycerol or 8 % dimethyl sulf- et al. 2008). This approach is inherently limited
oxide (DMSO) in the original growth medium of by the metagenomic DNA sequences that are
the bacterial host. The BAC clones are usually immediately adjacent to an rRNA operons, but
grown in 96-well or 384-well format for high was nonetheless a useful method in the initial
throughput handling and screening. exploration of BAC libraries.
Enzymatic activities expressed from BAC
clones may be identified via many different
Screening of BAC Libraries methods, including colorimetric or fluorescent
assays, as well as indicator media
Unlike shotgun metagenomic sequencing, BAC (Taupp et al. 2011). For example, lipase-
libraries provide the opportunity to identify and producing BAC clones can be detected on LB
agar plates supplemented with 1.0 % tributyrin by Sequencing and Bioinformatic
formation of a halo around individual clones due Analysis of BAC Libraries
to tributyrin hydrolysis. Cellulases and xylanases
can be detected using agar plates supplemented The inserts of BAC clones that exhibit certain
with carboxymethyl cellulose (CMC) and soluble metabolic activities can be sequenced to deter-
xylan, respectively. Other enzymatic activities mine the coding sequence, structural and regula-
identified using a BAC approach include but tory features of the gene(s), and potential
are not limited to esterase, alcohol dehydroge- phylogenetic markers. Three different common
nase, amidase, amylase, protease, chitinase, sequencing strategies are typically used.
dehydratase, and b-lactamase (Lorenz and Eck
2005). Subcloning and Sequencing of Individual
Identification of secondary metabolites BAC Clones
expressed from a heterologous host is dependent The insert of a BAC clone is first fragmented
upon having large-insert BAC clones, along with mechanically or enzymatically using
suitable transcriptional and translational machin- a restriction enzyme. The resultant smaller inserts
ery. The best examples of a functional can be cloned into a plasmid vector (e.g., pUC
metagenomic approach to identify antimicrobial vector) and then sequenced individually using the
activities were the isolation of turbomycin A and Sanger sequencing technology (see the
B (Gillespie et al. 2002), the identification of “Subcloning” section below). The full length of
antibacterial activities expressed in cosmid the BAC insert can be assembled from the
libraries in different proteobacterial hosts (Craig sequenced subclones (see “Sequence Assembly”
et al. 2010), and identification of gene clusters section below). Because it is time-consuming to
involved in synthesis of antifungal activities subclone and sequence a large number of BAC
(Chung et al. 2008). clones, this approach is primarily used to
In addition to activity-based screening, sequence one or a few of BAC clones of interest.
sequence-based screening is a widely used
approach to find genes or gene clusters involved End Sequencing
in particular functions within a BAC library. For Both ends of a BAC clone insert can be
example, an alternative to functional expression sequenced using the Sanger sequencing technol-
of libraries to identify antibacterial-active clones ogy and the primers that specifically anneal to the
is to first identify clones that contain known vector regions that flank the insert (Pope and
pathways involved in secondary metabolite syn- Patel 2008). This approach only allows sequenc-
thesis and then to express these pathways in ing a short region at both ends of a BAC clone,
a related host, permitting isolation of novel ana- and thus only limited sequence information can
logs of known metabolites. This has been dem- be determined. In contemporary studies, end
onstrated previously in the case of polyketide sequencing is primarily used to match a BAC
synthases (PKS) and nonribosomal peptide clone with its corresponding sequence that is
synthetases (NRPS). Feng et al. first identified determined using shotgun sequencing of pooled
the type II PKS biosynthetic system in two dif- BAC clones (see the “Shotgun Sequencing” sec- U
ferent cosmid clones by sequence-based homol- tion below). The genetic information of each
ogy screening of a cosmid library (Feng et al. BAC clone can then be analyzed with respect to
2010). Their sequence-based screening followed its phenotypic activities observed during activity
by heterologous expression of a type II PKS screening.
biosynthetic gene cluster identified three new
fluostatins that were previously uncharacterized Shotgun Sequencing of Pooled Select
in cultured species (Feng et al. 2010). This BAC Clones
approach can be equally applicable to screening Recent advancement in DNA sequencing tech-
BAC libraries. nologies, especially the NGS technologies, made
it more cost-effective and efficient to sequence Sequence Assembly

pooled BAC clones of interest in a shotgun man- Individual sequence reads from the same BAC
ner. The 454 FLX Platinum (Roche) is the most insert are linked together using de novo sequence
suitable NGS currently available to achieve assembly, a bioinformatic process, to form
effective shotgun sequencing of pools of BAC contigs and to reconstruct the BAC insert
clones because it can generate relatively long sequence without reference to any genome
sequence reads (average 500 bp). The Illumina sequence. Most software tools used in de novo
systems (Illumina) have also been used even sequence assembly (referred to sequence assem-
though they produce shorter (150 bp) sequence blers) seek overlapping sequences among indi-
reads. Ion Torrent (Life Technologies) is a new vidual sequence reads and then merge them
NGS that can generate 260 bp reads and should together based on the overlapping sequences.
be another suitable NGS technology for the afore- A number of sequence assemblers are available
mentioned shotgun sequencing. that use different strategies. Sequence reads gen-
Pooled BAC clones are first randomly erated by subcloning and the Sanger technology
fragmented to small fragments, with the length are relatively long (500 bp or longer) and only
of the fragments depending on the specific NGS low coverage (10) is needed to assemble
technology used. Adapters (short oligonucleo- complete BAC inserts. Such long reads can be
tides) may need to be ligated to the ends of assembled by alignment against each other and
each fragment to facilitate sequencing. Unique merged based on overlapping sequences using
barcodes (short oligonucleotides) can be incor- programs such as Phrap (Gordon 2004) and
porated into the adaptor for each BAC clone. In CAP3 (Huang and Madan 1999). Assembling
that case, sequence reads for individual BAC short sequence reads generated by NGS technol-
clones can be separated based on the unique ogies requires different strategies because it is
barcodes. However, it is often cost-prohibitive computing intensive to align large numbers of
to barcode individual BAC clones when a large short sequence reads required to assemble long
number of them are sequenced. Therefore, contigs or full-length BAC inserts. Moreover,
pooled BAC clones are typically sequenced in high sequencing coverage (at least 50,
a shotgun manner. However, be aware that depending on the read length) is needed to com-
assembly of complete BAC inserts may be prob- pensate for the short sequence reads. One alter-
lematic using a pooled BAC clone format native approach is to use a graph-based algorithm
depending on the degree of coverage and (e.g., the K-mers for de Bruijn graph) to detect
sequence similarity among the BAC clone certain short fragments to facilitate assembling
DNA inserts. Details on these NGS can be short sequence reads. Velvet (Zerbino and Birney
found in respective entries of this encyclopedia. 2008) is one bioinformatic program that uses the
It should be pointed out that except for the de Bruijn graph to assemble short sequence reads.
454 FLX Platinum system, the other aforemen- Other popular sequence assemblers used to
tioned NGS technologies produce short read. assemble NGS reads include SSAKE,
Very high (50 or greater) coverage is needed SHARCGS, VCAKE, Euler, SOAPdenovo,
to assemble the individual NGS reads into long ABySS, ALLPATHS (Miller et al. 2010).
contigs and the full-length inserts. Alignment Sequence reads generated by the 454 FLX
analysis of the end sequences (see End sequenc- Titanium system on average reach 500 bp in
ing above) and the assembled BAC insert length. Such a read length approaches that of
sequences can help match individual BAC Sanger sequencing reads. Two bioinformatic pro-
clones with their respective sequences. To ensure grams, Newbler (www.454.com) and Arachne
high-quality sequences, BAC clones need to be (Batzoglou et al. 2002), are designed to assemble
prepared free of E. coli chromosomal DNA. sequence reads generated by 454 systems.
Because each of these two sequence assemblers The commonly used ones include GeneMark.
has its own preference and the assembly is some- hmm, Metagene annotator (MGA, http://
what different, contigs generated from each of metagenomics.anl.gov/), and Orphelia (Yok and
them can be reassembled using an overlap- Rosen 2011). Orphelia differs from the other two
based algorithm, such as Minimus of the AMOS ab initio bioinformatic programs in combining
package (Sommer et al. 2007), to further improve a similarity-based algorithm with a composition-
the accuracy and extend the lengths of the contigs based method. However, the specificity and sen-
assembled. sitivity of these programs remain to be improved.
Comparative sequence assembly involves Furthermore, sequencing and assembly errors
alignment of sequence reads against reference significantly affect the accuracy of gene predic-
genome(s). It is rarely applied to assembling of tion. For instance, GeneMark.hmm is very sen-
shotgun BAC sequence reads because few refer- sitive to insertion and deletion, which are the
ence genomes are available in most cases. How- main types of sequencing error of the 454 FLX
ever, as more and more microbial genomes and system, thus producing false-positive and false-
metagenomes are sequenced in some habitats, negative predictions caused by frame shifts. As
such as human gut, comparative sequence assem- the logarithms of ORF-finding programs con-
bly may be used to facilitate assembling shotgun tinue to improve, future bioinformatic tools
BAC sequence reads. The Reference Mapper should improve in sensitivity and accuracy in
from Roche (www.roche.com), Eland from finding ORFs in BAC clone libraries and other
Illumina (www.illumina.com), Corona from metagenomic sequence data.
ABI (www.appliedbiosystems.com), and some A predicted ORF can then be annotated by
shareware bioinformatic programs (e.g., SOAP, comparing to databases, most of which maintain
MAQ, and segemehl) can be used in comparative a publically accessible online server. The tenta-
sequence assembling (Kunin et al. 2008). tive function of an ORF is typically inferred
using in silico bioinformatic analysis by compar-
ing to comprehensive databases, such as
Prediction and Annotation of Open GenBank (www.ncbi.nlm.nih.gov/), the EMBL
Reading Frame (ORF) Nucleotide Sequence Database (EMBL-bank,
www.ebi.ac.uk/embl/), and the DNA Database
The contigs assembled provide the opportunity of Japan (DDBJ, www.ddbj.nig.ac.jp/). Certain
to determine the putative genes, their structure functional and structural features of ORFs can
and organization, and putative function. Putative also be identified using specialty databases, such
genes are defined by open reading frames (ORF) as KEGG (www.genome.jp/kegg/) for prediction
that encode proteins of minimal molecular of metabolic functions, SignalP (www.cbs.dtu.
weight (>50 amino acids). The early bioinfor- dk/services/SignalP/) for prediction of presence
matic programs developed were intended to of signal peptides, and TransMembrane Protein
predict ORFs from sequenced genomes by DataBase (pdbtm.enzim.hu/) for presence of
either recognizing specific genome signals or transmembrane domains. It should be noted that
comparing them to protein or cDNA databases. although annotation of ORFs has significantly U
These specific genome signals are usually improved over the past 10 years owing to the
species specific. Hence, these bioinformatic development of software tools and databases
programs have limited utility when applied and the accumulation of sequenced and annotated
to metagenomic sequence data. In recent genomes and metagenomes, the gene function
years, several programs and database environ- predicted by in silico analysis sometimes does
ments have been developed to specifically pre- not really represent its actual biological charac-
dict ORFs from metagenomic sequence data. teristics. Moreover, inaccurate annotations can
be cascaded and amplified in databases. Biochemical Characterization of Gene

Researchers should always keep in mind such Functions of Interest
potential discrepancy when annotating ORFs
identified in BAC libraries and other Subcloning
metagenomic sequence data. The proteins or enzymes encoded by a gene or gene
cluster of a particular BAC clone can be biochem-
ically characterized following subcloning and
Phylogenetic Analysis overexpression. Subcloning entails cloning the
Unlike BAC libraries constructed from pure cul- gene(s) of interest from the selected BAC clone
tures, BAC libraries from microbiome samples into another vector, mostly an expression vector
are constructed from hundreds or even thousands (e.g., a pET vector). The gene(s) can either be
of species of diverse microbes. Thus, one of the excised out of the BAC clone using an appropriate
primary analyses to be performed on BAC librar- restriction enzyme or amplified using PCR amplifi-
ies is evaluating the microbiome composition. It cation. In the latter case, a pair of primers are
is straightforward to infer the taxon from which needed that anneal to the sequences flanking the
a BAC sequence is derived if a phylogenetic target gene(s). However, neither method may not be
marker, such as a SSU or LSU rRNA gene, can suitable in some cases, for example, when no appro-
be found. However, most BAC clones lack such priate restriction enzyme is available that does not
a phylogenetic marker. Alternative methods are cut within the target gene(s) or when the target
available to taxonomically predict the origin of gene(s) is too long to be PCR amplified. Recently,
BAC clones based on a number of features, such a homologous recombination-based subcloning
as sequence composition or homology (Kunin approach, referred to as BAC recombineering
et al. 2008). Composition-based software focuses (Warming et al. 2005), is used to overcome the
on the sequence composition signatures, primar- aforementioned subcloning obstacles.
ily oligonucleotide frequencies to distinguish
contigs from each other. Phymm (www.cbcb.
umd.edu/software/phymm/) and TETRA (www. Over Expression and Characterization of
megx.net/tetra/) are commonly used bioinfor- Expressed Gene Products
matic programs of this category. On the other
hand, homology-based methods predict the taxo- The enzyme encoded by the gene of interest can
nomic origin of BAC clones by searching for be biochemically characterized if overexpressed
homologous sequence available in databases. and purified. Briefly, a subcloned gene is
Representative homology-based bioinformatic transformed into an appropriate host, which is
programs include BLAST (blast.ncbi.nlm.nih. typically E. coli, or another expression system,
gov/), MEGAN (ab.inf.uni-tuebingen.de/soft- depending on the characteristics of the protein to
ware/megan/), and SIGNATURE (www.cmbi. be expressed. It should be noted that some genes
ru.nl/signature/). Some hybrid classifiers, such in BAC clones might not be expressed success-
as PhymmBL that is a combination of Phymm fully because the chosen expression system lacks
and BLAST (Brady and Salzberg 2009), are also the suitable transcription or translation systems.
available that can improve taxonomic assignment Another expression host may be evaluated for its
accuracy. It should be cautioned that although ability to express the gene(s) of interest using
short sequences can be accurately classified, a shuttle vector. In addition, the expressed protein
accurate and reliable prediction requires long may be toxic to the expression host. Recently,
reads or contigs. Therefore, precise and accurate fusion proteins and co-expression systems have
taxonomic classification of BAC clones depends been incorporated into the expression strategies
on a delicate selection of sequences and assembly to overcome some of the limitations mentioned
strategy. above. A fusion protein can aid in solubilization
and/or purification of the overexpressed proteins. Functions and Bioactive Compounds
Commonly used fusion proteins include glutathi- Identified by BAC
one S-transferase (GST), maltose-binding protein
(MBP), and histidine tags. A variety of enzymes and other bioactive com-
In some cases, a protein can be expressed but pounds have been identified through BAC librar-
cannot be correctly folded due to the lack of an ies. These include xylanases, cellulases, lipases,
appropriate chaperon protein in the heterologous proteases, amylases, esterases, and type II poly-
host. To overcome this obstacle, a co-expression ketide synthases. Examples of bioactive com-
vector that allows coexistence of multiple expres- pounds discovered from BAC libraries include
sion vectors and expression of a chaperon protein antibiotics, patellamide D, and ascidiacyclamide.
can be used to help the correct folding of heter- New antibiotic resistance genes have also been
ologous proteins. The Duet vectors (Novagen) found from BAC libraries. Future applications of
are among the systems used to co-express genes BAC libraries will probably lead to discovery of
of interest from BAC clones. Upon induction, novel compounds or enzymes useful to medical
sufficient quantities of the expressed protein can or technological purposes.
be obtained to determine its characteristics, such
as substrate range, product, optimum of temper-
ature and pH, kinetics, and stability. Summary
Metagenomic studies using BAC clone libraries

Prediction of Protein Structure allow access to metabolic activities and
biocatalysts from uncultured microbes. Unlike
The three-dimensional (3D) structure of a protein shotgun deep sequencing, BAC libraries provide
provides invaluable insights into the molecular a unique opportunity to gain access to metabolic
basis of its functions. Additionally, the detailed activities and the underpinning enzymes
knowledge of the spatial arrangement of key involved in synthesis or biodegradation of
amino acid residues within the overall 3D struc- many useful compounds and biocatalysts. Func-
ture also helps design experiments to characterize tional diversity archived in BAC libraries can
the protein and understand the molecular mecha- also be accessed repeatedly for various studies
nisms of functions. The structure of a purified including detailed characterization of the
protein can be determined experimentally using enzymes and metabolic activities. Furthermore,
X-ray crystallography, high-resolution electron BAC libraries also enable access and capture of
microscopy, or nuclear magnetic resonance large gene clusters that exceeds the capacity of
(NMR) spectroscopy. Detailed information on fosmid vectors. Thus, BAC libraries comple-
these types of characterization is beyond the ment both shotgun deep DNA sequencing and
scope of this entry but can be found in other fosmid libraries in metagenomic studies of
entries of this encyclopedia. Although experi- microbiomes.
mental methods can help determine the actual
3D structure of a protein, they are expensive, U
time-consuming, and not always applicable. Cross-References
Thus, alterative bioinformatic programs are
often used to predict the structure of a protein ▶ A De Novo Metagenomic Assembly Program
from its amino acid sequence. Some of the com- for Shotgun DNA Reads
monly used programs/servers include I-TASSER ▶ Fosmid System
(zhanglab.ccmb.med.umich.edu/I-TASSER/), ▶ KEGG and GenomeNet, New Developments,
Modeller (salilab.org/modeller/), and Phyre2 Metagenomic Analysis
(www.sbg.bio.ic.ac.uk/phyre2/). ▶ Phylogenetics, Overview
References characterization of a human bacterial artificial chro-

mosome library. Genomics. 1996;34:213–8.
Andrews-Pfannkoch C, Fadrosh DW, Thorpe J, Kunin V, Copeland A, Lapidus A, Mavromatis K,
Williamson SJ. Hydroxyapatite-mediated separation Hugenholtz P. A bioinformatician’s guide to
of double-stranded DNA, single-stranded DNA, and metagenomics. Microbiol Mol Biol Rev.
RNA genomes from natural viral assemblages. Appl 2008;72:557–78. Table of Contents.
Environ Microbiol. 2010;76:5039–45. Lakhdari O, Cultrone A, Tap J, Gloux K, Bernard F,
Batzoglou S, Jaffe DB, Stanley K, Butler J, Gnerre S, Ehrlich SD, Lefevre F, Dore J, Blottiere HM. Func-
Mauceli E, Berger B, Mesirov JP, Lander tional metagenomics: a high throughput screening
ES. ARACHNE: a whole-genome shotgun assembler. method to decipher microbiota-driven NF-kappaB
Genome Res. 2002;12:177–89. modulation in the human gut. PLoS One. 2010;5.
Béjà O, Aravind L, Koonin EV, Suzuki MT, Hadd A, Liles MR, Williamson LL, Rodbumrer J, Torsvik V,
Nguyen LP, Jovanovich SB, Gates CM, Feldman RA, Goodman RM, Handelsman J. Recovery, purification,
Spudich JL, Spudich EN, DeLong EF. Bacterial rho- and cloning of high-molecular-weight DNA from soil
dopsin: evidence for a new type of phototrophy in the microorganisms. Appl Environ Microbiol. 2008;74:
sea. Science. 2000;289:1902–6. 3302–5.
Berry AE, Chiocchini C, Selby T, Sosio M, Wellington Lorenz P, Eck J. Metagenomics and industrial applica-
EMH. Isolation of high molecular weight DNA from tions. Nature reviews. Microbiology. 2005;3:510–6.
soil for cloning into BAC vectors. FEMS Microbiol Miller JR, Koren S, Sutton G. Assembly algorithms for
Lett. 2003;223:15–20. next-generation sequencing data. Genomics. 2010;95:
Brady A, Salzberg SL. Phymm and PhymmBL: 315–27.
metagenomic phylogenetic classification with interpo- Piel J. Approaches to capturing and designing biologically
lated Markov models. Nat Methods. 2009;6:673–6. active small molecules produced by uncultured
Chung EJ, Lim HK, Kim JC, Choi GJ, Park EJ, Lee MH, microbes. Annu Rev Microbiol. 2011;65:431–53.
Chung YR, Lee SW. Forest soil metagenome gene Pope PB, Patel BK. Metagenomic analysis of a freshwater
cluster involved in antifungal activity expression in toxic cyanobacteria bloom. FEMS Microbiol Ecol.
Escherichia coli. Appl Environ Microbiol. 2008;74: 2008;64:9–27.
723–30. Rondon MR, August PR, Bettermann AD, Brady SF,
Craig JW, Chang FY, Kim JH, Obiajulu SC, Brady Grossman TH, Liles MR, Loiacono KA, Lynch BA,
SF. Expanding small-molecule functional MacNeil IA, Minor C, Tiong CL, Gilman M, Osburne
metagenomics through parallel screening of broad- MS, Clardy J, Handelsman J, Goodman RM. Cloning
host-range cosmid environmental DNA libraries in the soil metagenome: a strategy for accessing the
diverse proteobacteria. Appl Environ Microbiol. genetic and functional diversity of uncultured micro-
2010;76:1633–41. organisms. Appl Environ Microbiol. 2000;66:2541–7.
Delmont TO, Robe P, Clark I, Simonet P, Vogel Sommer DD, Delcher AL, Salzberg SL, Pop M. Minimus:
TM. Metagenomic comparison of direct and indirect a fast, lightweight genome assembler. BMC Bioinfor-
soil DNA extraction approaches. J Microbiol Methods. matics. 2007;8:64.
2011;86:397–400. Stein JL, Marsh TL, Wu KY, Shizuya H, DeLong
Feng Z, Kim JH, Brady SF. Fluostatins produced by the EF. Characterization of uncultivated prokaryotes: iso-
heterologous expression of a TAR reassembled envi- lation and analysis of a 40-kilobase-pair genome frag-
ronmental DNA derived type II PKS gene cluster. ment from a planktonic marine archaeon. J Bacteriol.
J Am Chem Soc. 2010;132:11902–3. 1996;178:591–9.
Gillespie DE, Brady SF, Bettermann AD, Cianciotto NP, Taupp M, Mewis K, Hallam SJ. The art and design
Liles MR, Rondon MR, Clardy J, Goodman RM, of functional metagenomic screens. Curr Opin
Handelsman J. Isolation of antibiotics turbomycin Biotechnol. 2011;22:465–72.
A and B from a metagenomic library of soil microbial Warming S, Costantino N, Court DL, Jenkins NA,
DNA. Appl Environ Microbiol. 2002;68:4301–6. Copeland NG. Simple and highly efficient BAC
Gordon D. Viewing and editing assembled sequences recombineering using galK selection. Nucleic Acids
using consed. In: Baxevanis AD, Davison DB, editors. Res. 2005;33:e36.
Current protocols in bioinformatics. New York: Wild J, Hradecna Z, Szybalski W. Conditionally
Wiley; 2004. p. 11.12.11–43. amplifiable BACs: switching from single-copy to
Huang X, Madan A. CAP3: a DNA sequence assembly high-copy vectors and genomic clones. Genome Res.
program. Genome Res. 1999;9:868–77. 2002;12:1434–44.
Kakirde KS, Wild J, Godiska R, Mead DA, Wiggins AG, Yok NG, Rosen GL. Combining gene prediction methods
Goodman RM, Szybalski W, Liles MR. Gram negative to improve metagenomic gene annotation. BMC Bio-
shuttle BAC vector for heterologous expression of informatics. 2011;12:20.
metagenomic libraries. Gene. 2011;475:57–62. Zerbino DR, Birney E. Velvet: algorithms for de novo
Kim UJ, Birren BW, Slepak T, Mancino V, Boysen C, short read assembly using de Bruijn graphs. Genome
Kang HL, Simon MI, Shizuya H. Construction and Res. 2008;18:821–9.
Use of Viral Metagenomes from Yellowstone Hot Springs 681 U
The roles of viruses in the ecology of hydro-
Use of Viral Metagenomes from thermal environments have not been studied in
Yellowstone Hot Springs to Study detail, although they appear to play a role in host
Phylogenetic Relationships and mortality and carbon cycling (Breitbart et al.
Evolution 2004b) and are probably the only predators. In
better studied marine environments, an estimated
Thomas W. Schoenfeld and David Mead 1030 viruses in the world’s oceans (Suttle 2007)
Lucigen Corporation, Middleton, WI, USA may comprise several hundred thousand different
types (Angly et al. 2006) and are responsible for
a significant proportion of microbial mortality
Introduction and thus have a profound influence on carbon
and other nutrient cycles (Suttle 2007). Viruses
High-temperature subterrestrial aquifers are vast also may be important vehicles for lateral gene
ecosystems fueled solely by chemical reducing transfer via lysogeny and transduction and prob-
potential rather than solar radiation as is the case ably promote diversity by preferentially lysing
for surface life (Fournier 2005). The volume of the most abundant species (Weinbauer and
the global thermal aquifer has been estimated as Rassoulzadegan 2004). Analyses of viral
high as 1019 L (Gold 1992), with microbial and metagenomes (Cann et al. 2005; Angly 2006;
viral abundances approaching those of the oceans Bench et al. 2007) and cultured viral genomes
(Breitbart et al. 2004b). This study, previously (Pedulla et al. 2003; Kwan et al. 2005) have
reported in Pride and Schoenfeld (2008), consistently shown that a minority of these
Schoenfeld et al. (2008), and Heidelberg sequences have detectable similarity to
et al. (2009), examined planktonic viruses sequences in GenBank and very few are similar
directly isolated from two mildly alkaline sili- to known viruses. In spite of extensive sequenc-
ceous hot springs in Yellowstone National Park ing of marine virus metapopulations, only a few
(YNP). With temperatures of 74 C and 93 C, small RNA genomes of 5–10 kb have been
life in these springs is comprised exclusively of assembled (Culley et al. 2007), presumably due
bacterial and archaeal cells and viruses, all to the extreme viral diversity that confounds
uniquely adapted to the temperature and chemis- the assembly of viral genomes (see
try extremes of the environment (Reysenbach Chaps. 2–10, Vol. I).
et al. 2002). The springs in these water-driven Enrichment cultivation has provided most of
systems are direct outflows of the thermal aquifer the knowledge of thermophilic viruses (defined
and not secondarily heated surface water, as is the here as those growing at >70 C). Since the first
case for vapor-driven systems (Fournier 2005). reports of thermophilic viruses (Sakaki and
In this respect they are distinct from acidic Oshima 1975; Martin et al. 1984), hundreds of
springs, mud pots, and other thermal features bacteriophages (Yu et al. 2006), dozens of
that have provided many of the published ther- crenarchaeal viruses (reviewed in Snyder
mophilic virus samples. Because the springs are et al. 2003; Prangishvili and Garrett 2005), and
direct outflows of the aquifers, conceivably, one euryarchaeal virus (Geslin et al. 2003) have U
viruses in these springs may proliferate not only been isolated from thermal springs and vents
at the surface but deeper in the vent as well, around the world. Cultivated Thermus bacterio-
where increased pressures and temperatures as phages belong to four morphological families:
high as 180–270 C are found at depths of Myoviridae, Siphoviridae, Tectiviridae, and
100–550 m throughout the caldera of YNP Inoviridae (Yu 2006). Their morphologies and
(Fournier 2005). If viruses proliferate in the sub- the available genomic sequences (Naryshkina
surface aquifer, hot springs separated by kilome- et al. 2006) suggest similarity to mesophilic bac-
ter distances that share common water sources teriophages. Most known thermophilic bacterio-
may also share viral populations. phages appear to be lytic, although this could be
U 682 Use of Viral Metagenomes from Yellowstone Hot Springs
biased by the method of their discovery (N 44.5560955 W110.8347866) and Octopus

(Yu 2006). Cultivated thermophilic crenarchaeal (N44.5340836 W110.7978895) hot springs
viruses infect the genera Sulfolobus, Acidianus, (Stoner et al. 2001). The temperatures of the hot
Pyrobaculum, and Thermoproteus. Morphol- springs are based on direct measurement on the
ogies and genome content suggest crenarchaeal day of the sampling. The pH values were deter-
viruses are unrelated to viruses of Euryarchaeota, mined by the USGS (McCleskey et al. 2004).
Bacteria, or Eukarya (Prangishvili et al. 2006a). Thermal water (400–600 L) was filtered using
All of the cultivated crenarchaeal viruses prolif- a 100 kD MWCO tangential flow filter
erate as chronic, nonlytic infections. (GE Healthcare). Viral particles were concen-
While enrichment cultures have been highly trated to 2 L, filtered through a 0.2 mm filter and
informative in the study of thermophilic viruses, further concentrated to 100 ml using a 100 kD
important contextual information such as relative filter. Viral concentrates were imaged by trans-
abundance, diversity, and distribution is lost. Fur- mission electron microscopy (TEM) (Leo 912AB
thermore, these analyses exclude the majority of operating at 80KV). Direct viral enumeration was
viruses that are not readily cultivated (Snyder performed by epifluorescence microscopy (Noble
et al. 2004). No viral cultivation study fully rep- and Fuhrman 1998). As recommended (Wen
licates the temperature and pressure extremes and et al. 2004), samples were unfixed and were
the chemistries that characterize the subsurface stained with SYBR Gold. The samples were
vents, which limits cultivation of not only viruses stored at 4 C for no more than 24 h before
but hosts, as well. Unlike cellular life, no univer- counting. Immediate freezing of samples in liq-
sal genetic marker (e.g., rDNA) exists for viruses. uid nitrogen was not possible, so viral abun-
Direct metagenomic analysis of viruses from dances may be somewhat underestimated.
environmental samples circumvents these limita-
tions and provides insight into biology, evolution, Viral DNA Processing and Extraction
and adaptations to the environment and compo- Viral concentrates were centrifuged at 12 K rpm
sition of viral assemblages through studies of for 20 min, syringe-filtered using a 0.2 mm
their genomic sequences. No metagenomic anal- Acrodisc filter (Gelman), and further concen-
ysis of waterborne viral populations in geother- trated to 400 ml by filtration using a 30 kD
mal environments has been reported. In fact, MWCO Centricon spin filter (Millipore). Those
planktonic life in thermal environments is judged by epifluorescence microscopy to be sub-
under-explored in general, with microbial diver- stantially free of microbial cells were used for
sity studies of hot spring environments focused library construction. Viral concentrates were
almost exclusively on sediments (Barns transferred to SM buffer (0.1 M NaCl, 8 mM
et al. 1994; Hugenholtz et al. 1998; Blank MgSO4, 50 mM Tris-HCl pH 7.5) using a 30 kD
et al. 2002), adherent filaments (Reysenbach MWCO spin filter. Benzonase endonuclease
et al. 1994), or mats (Ward et al. 1998). The (Sigma, 10 U) was added, and the reactions
goal of this study was to profile the diversity, were incubated for 30 min. at 23 ºC. EDTA
composition, and adaptations of viral assem- (20 mM), SDS (0.5 %), and Proteinase
blages in two hot springs of YNP based on K (100 U) were added, and the reactions were
metagenomic analysis of viruses inhabiting incubated for 3 h at 56 C. NaCl (0.7 M) and
these environments. CTAB (1 %) were added, and DNA was extracted
with phenol/chloroform and ethanol precipitated.
Materials and Methods Library Construction and Sequencing

Viral DNA was physically sheared to 3–6 kb
Site Description and Sampling using a HydroShear device (Genomic Solutions,
Viral particles were isolated from Bear MI). The ends were made blunt using the
Paw (an unofficial name for LRNN374) DNATerminator end repair kit (Lucigen, WI),
and the fragments were ligated to a using the BLASTp program. The rank abun-
double-stranded asymmetrical linker comprised dances were calculated using the PHAge Com-
of one phosphorylated blunt end munity from Contig Spectrum (PHACCS) web
(50 -GATGCGGCCGCTTGTATCTGATACTG- utility located at http://phage.sdsu.edu/research/
CT-30 , Linker 1) and one non-phosphorylated tools/phaccs/ (Angly et al. 2005) based on an
staggered end (50 -GGAGCAGTATCAGATA average genome length of 50 kb.
CAAGCGGCCGCATC-30 , Linker 2) to fix the
primer in a defined orientation relative to the
genomic DNA. Gel fractionation was used to Results and Discussion
remove unligated linkers and to isolate 3–6 kb
fragments. These fragments were PCR amplified Sampling Sites, Viral Abundance, and
using Vent DNA polymerase (New England Morphologies
Biolabs, MA) and a primer targeted to Linker 1 The two hot springs that provided samples are
(50 -AGCAGTATCAGATACAAGCGGCCGCA listed in Table 1. Bear Paw hot spring is in the
TC-30 ). Amplification products were gel purified river group of the lower geyser basin of YNP,
again, inserted into the cloning site of the while Octopus is about 5 km away in the White
transcription-free pSMART vector (Lucigen), Creek area. Although the pH values of these hot
and used to transform E. coli 10G cells springs are both circumneutral, the temperatures
(Lucigen). Libraries were sequenced by the and apparent microflora differ widely. Bear Paw
Department of Energy’s Joint Genome Institute is significantly cooler and is characterized by
(Walnut Creek, CA). The sequences were depos- orange sedentary microbial growth in the pool.
ited in the GenBank trace archive and are retriev- Octopus water emerges at the boiling point at the
able using CENTER_NAME ¼ “JGI” and local elevation of 2,300 m, with none of the
SEQ_LIB_ID ¼ “AOIX” for Bear Paw orange growth. Octopus hot spring is well
sequences and SEQ_LIB_ID ¼ “APNO” and documented to support prolific microbial life
SEQ_LIB_ID ¼ “ATYB” for octopus (Brock and Brock 1968), and its geochemistry
sequences. (McCleskey 2004) is suitable for chemotrophic
metabolism. Reported analyses based on rDNA
Bioinformatics sequences from filaments and sediments
Viral metagenome sequencing reads were com- (Reysenbach et al. 1994; Blank et al. 2002)
pared to the nonredundant (nr) protein database show that microbial diversity is relatively limited
(GenBank) using BLASTx (Altschul et al. 1997). compared to moderate-temperature environ-
The 50 most significant BLASTx scores ments. These studies and others comparing lipid
(E < 103) were recorded. The first occurrences and isotope composition (Jahnke et al. 2001) sug-
of keywords in the output of the BLASTx were gest the microbes in the filaments and the sedi-
counted using PERL scripts written for this pro- ments, close in proximity and temperature to the
ject, and the sequences were categorized by func- sample site in this study, are primarily Bacteria,
tion. Sequences were assembled using the with Aquificales and Thermotogales most highly
SeqMan® program (DNASTAR, WI) at represented. No detailed study of the planktonic U
a minimum of 50 % or 95 % identity over a min- life from Octopus or the chemical composition or
imum of 20 nt. Metagenome sequence libraries life in Bear Paw has been published.
were compared to each other and to all the Virus-enriched fractions were isolated from
sequences in GenBank using tBLASTx (NBCI) 400 to 600 L of hot spring water for library
with a cutoff of E < 103. Where indicated, the construction and sequence analysis. Viral abun-
apparent open reading frames were identified and dances (Table 1) were at the lower end of the
translated using the Gene Mark program range of 104–109 reported for thermal springs in
(Lukashin and Borodovsky 1998). These trans- Long Valley, California (Breitbart et al. 2004b),
lations were compared to the nr protein database and moderate-temperature aquatic environments
Use of Viral Metagenomes from Yellowstone Hot Springs to Study Phylogenetic Relationships and Evolu-
tion, Table 1 Sample sites and abundance of viral and microbial counts
Hot Virus: Virus/mL in Virus/mL
spring Temp pH Cells/mL Viruses/mL microbe ratio concentrate theoreticala Efficiency
Bear 74 7.34 4.3 106 1.44 106 0.33 1.48 108 7.21 109 2.1 %
paw
Octopus 93 8.14 9.0 105 3.07 105 0.34 2.18 108 1.53 109 14.2 %
Based on a concentration factor of 5,000 (500 L to 100 mL)
a
Use of Viral
Metagenomes from
Yellowstone Hot Springs
to Study Phylogenetic
Relationships and
Evolution, Fig. 1 TEM
images of viruslike
particles directly isolated
from YNP hot springs.
Images from Bear Paw
(Panels a and b) and
Octopus (Panels c and d)
hot springs are shown. The
bar in each figure is 200 nm
(Images are courtesy of Sue
Brumfield and Mark
Young, Montana State
University. Reproduced
with permission from
Schoenfeld et al. (2008))
(Wommack and Colwell 2000). The virus/ Morphologies of viral particles in the concen-
microbe ratios (VMRs) in the hot springs were trates represent most morphological families of
much lower than in moderate-temperature envi- known thermophilic viruses. Tailed morphol-
ronments (typically 3–10). These low VMRs may ogies are commonly associated with bacterio-
be related to the observation that none of the phages and euryarchaeotal viruses (Geslin
cultured thermophilic crenarchaeal viruses pro- et al. 2003; Yu et al. 2006); rod-shaped and fila-
liferate via lytic infections, a lifestyle that would mentous morphologies are more commonly asso-
result in large burst sizes at the same time as the ciated with crenarchaeal viruses (Prangishvili
microbial population is reduced. Actual yields of and Garrett 2004).
viruses were significantly below theoretical
yields (Table 1) for both two hot springs. It is
not known if this loss was systematic and, there- Library Construction and Sequencing
fore, biased the metagenomic analysis. Tailed,
rod-shaped, and filamentous morphologies were Advances in sequencing capacity make analyses
observed in the concentrates (Fig. 1). of large numbers of clones feasible; however,
challenges in sampling and library construction Octopus (21,198 reads) hot springs. Paired-end
have prevented the widespread use of reads averaged 981 nucleotides each or nearly
metagenomic shotgun sequencing for studying 30 Mb total. Assuming an average genome size
viral populations. At around 50 ag of DNA per of 50 kb, which is supported by agarose gel elec-
virus, abundances of 105–106 viruses per ml cor- trophoresis of the viral genomic DNA (data not
respond to 5–50 ng of viral DNA per liter. In shown), this sequencing depth represents about
practice, processing of hundreds of liters of 600 viral genomic equivalents. The quality of the
spring water generally yielded no more than libraries is highly dependent on the amount of
100 ng of DNA, much lower than is normally DNA used in their construction. The sequence
required for library construction. This low yield reads of the Octopus library contained very few
of virus precluded cesium chloride purification anomalies that would suggest amplification bias
of the viral particles, as is commonly used for or cloning artifacts. Some of the reads from the
marine viral metagenomic library construction. Bear Paw library were less random than the Octo-
Viral DNA also contains cytotoxic genes and pus library, as demonstrated by several cases of
modified nucleotides that induce host restriction sequence stacking.
systems. A linker-dependent, anonymous Contaminating cellular DNA in viral DNA
method of DNA amplification was used to access preparations was greatly reduced by filtration
this diversity, allowing construction of 3–8 kb and nuclease treatment. Only viral preparations
insert libraries with none of the potential modi- substantially free of microbial cells as judged by
fied nucleotides. This library construction epifluorescence microscopy were used for library
method has been used in the analysis of several construction. Detection of rDNA sequences (5S,
cultivated and uncultivated viral genomes 16S, and 23S) in the libraries was used to identify
(Breitbart et al. 2003, 2004a; Seguritan contaminating cellular DNA. These sequences
et al. 2003; Lindell et al. 2004; Paul et al. 2005; are absent in known viral genomes but highly
Bench et al. 2007) but never fully described. conserved in microbial cells. A typical bacterial
Viral DNA was physically sheared, and short genome contains 15 rRNA genes (Coenye 2003).
(20 bp) linkers were ligated to the DNA frag- Most hyperthermophilic archaeal and bacterial
ments to serve as priming sites for PCR. Ampli- genomes contain three to six rRNA genes,
fied fragments were cloned into a transcription- although the genomes of thermophilic
free pSMART vector to minimize cloning Geobacillus that grow in the temperature range
bias due to cytotoxic sequences (Godiska of Bear Paw contain up to 30 rRNA genes (Feng
et al. 2005). The use of flanking synthetic linkers et al. 2007). BLASTn analysis identified only
provides identical primer annealing sites for four rDNA sequences in the 10.4 microbial
each viral template in the mixture, which signif- genome equivalents sequenced from the Octopus
icantly limits amplification bias. A noteworthy library (two 23S and two 16S) and eight in the 3.8
characteristic of this approach is that it microbial genome equivalents from the Bear Paw
selects exclusively for dsDNA viruses. All culti- library, suggesting viral enrichment was quite
vated thermophilic bacteriophage and archaeal high, particularly for the Octopus library. This
viruses have dsDNA genomes except certain inference is supported by a high similarity to U
Thermus-specific Inoviruses, which have ssDNA sequences of cultivated viruses (shown below)
genomes (Yu et al. 2006). Notably, several viral and a large number of BLASTx similarities to
nucleic acid preparations from these and other genes associated with viral functions. In particu-
springs sampled as part of this study had RNase- lar, the hundreds of presumptive genes for viral
digestible material (data not shown), suggesting functions, such as replication, transcription,
that RNA viruses inhabit these hot spring translation, lysogeny, recombination, lysis, and
environments. structural proteins (Table 2), are consistent only
A total of 28,883 Sanger sequence reads were with a predominately viral origin of the
determined from Bear Paw (7,685 reads) and sequences.
tion, Table 2 Functional grouping of predicted genes in the viral metagenomes
Bear paw Octopus Bear paw Octopus
Total reads 7,685 21,198
No BLASTx similarity 2,545 8,469
COGs functional category Number of reads matching Percent with a keyword
a keyword match
F. Nucleotide transport and metabolism 1,445 2,130 35.09 % 37.81 %
J. Translation, ribosomal structure, and biogenesis 221 336 5.37 % 5.96 %
K. Transcription 278 325 6.75 % 5.77 %
L. Replication, recombination and repair 688 989 16.71 % 17.55 %
O. Posttranslational modification, protein turnover, chaperones 181 213 4.40 % 3.78 %
None virus specific 350 596 8.50 % 10.58 %
No match to a keyword 955 1,045 23.19 % 18.55 %
Identification of Likely Gene Products lysogeny is also common in thermal aquifers,

and Viral Lifestyles consistent with previous studies that show
integrase homologs in six crenarchaeal viral
BLASTx analysis of the individual reads was genomes (ATV, STSV1, and four SSV isolates)
used to identify coding sequences in the libraries. (Wiedenheft et al. 2004; Xiang et al. 2005;
While most reads revealed no significant similar- Prangishvili et al. 2006b), and induction of pro-
ity to known proteins (i.e., no BLASTx similar- phage by mitomycin C in 1–9 % of hot spring
ity; Table 2), a significant portion of the microbial cells (Breitbart et al. 2004b).
sequences could be assigned an apparent function
based on BLASTx analysis. The majority of these
predicted functions fall into five of the 23 NCBI Viruses and Lateral Gene Transfer in
Clusters of Orthologous Groups (COG) func- Thermal Environments
tional categories (Tatusov et al. 1997) or are
virus-specific functions that have no assigned Viruses have been implicated in lateral gene
COG function, e.g., lysin, packaging, capsid, transfer and nonorthologous gene replacement
tail, or tape measure protein (Table 2). The five in cellular genomes (Villarreal and DeFilippis
COG categories are all nucleic acid metabolism-, 2000; Daubin and Ochman 2004). Viruses also
information processing-, and translation-related may have played critical roles in the evolution of
functions, which are commonly associated with DNA as a genetic material, DNA replication
phages and viruses. mechanisms, the separation of the three domains
Certain similarities were particularly informa- of life, and the origin of the eukaryotic nucleus,
tive. The 532 lysin-like genes among 600 viral reviewed in Forterre (2006). Gene similarities
equivalents suggest lytic viruses are quite com- seen in the metagenomic libraries support the
mon in the hot springs, in contrast to the cultured role of viruses in cellular evolution. The 13 appar-
thermophilic crenarchaeal viruses, all of which ent reverse transcriptases were almost exclu-
are nonlytic. Although lysin genes were highly sively related to the intron-associated maturase/
abundant and are typically proximal to holin reverse transcriptases and retrotransposon
genes, no homologs for holins were seen, proba- reverse transcriptases. These genes and the
bly reflecting the high molecular diversity recombinase, integrase, and transposase genes
observed in known holin genes (Young 1992). represent 5.1 % and 3.4 % of the identifiable
The 86 apparent integrase genes imply that reads in the Bear Paw and Octopus libraries,
Use of Viral Metagenomes from Yellowstone Hot as likely examples of nonorthologous replace-
Springs to Study Phylogenetic Relationships and ment by viral genes (Filee et al. 2002). 156 pol
Evolution, Table 3 Sources of superfamily II helicase
similarities to Octopus contig 158 and strength of similar- gene homologs were identified in the two
ity by BLASTx metagenomic libraries, with all the polymerase
Source of similarity Domain E-value
families represented. In contrast, only one pol
Staphylococcus phage Twort Bacteriophage 2E-16
gene has been identified by BLASTx analysis of
Myxococcus xanthus Bacteria 1E-15 the known crenarchaeal viral genomes (ABV),
Sulfolobus islandicus Archaeal 8E-15 and three pol genes are found in thermophilic
filamentous virus virus bacteriophage genomes (Hjörleifsdottir
Lactobacillus plantarum Bacteriophage 3E-14 et al. 2002); Naryshkina 2006). The high abun-
bacteriophage dance of both pol and lys genes in the
Pyrococcus abyssi Archaea 4E-08 metagenomic libraries compared to cultured
Sulfolobus solfataricus Archaea 1.E-06 genomes suggests that the current view of diver-
Eremothecium gossypii Eukarya 9.E-05
sity may be biased by the difficulty in culturing
(a fungus)
Tribolium castaneum Eukarya 4.E-04
certain types of viruses.
(an insect)
Homo sapiens Eukarya 6.E-03
Sequence Assembly and Estimation of
Viral Diversity
respectively, suggesting that the appropriate The degree to which metagenomic reads assem-
machinery for lateral gene transfer exists in hot ble has been used to assess the diversity of the
spring viral genomes (Canchaya et al. 2003). viral populations. Previous studies have used
Other sequence similarities provide evidence >95 % identity over 20 nucleotides as the assem-
of ongoing gene transfer within these bly stringency (Breitbart et al. 2002, 2004a;
populations. Helicase genes shared among Breitbart 2003; Angly et al. 2006). Using this
viruses and cells from all domains have been criteria, the power law rank-abundance model
considered examples of nonorthologous replace- built into the Phages Communities from Contig
ment of cellular genes by viral genes (Filee Spectrum tool (PHACCS, 5) predicted 1,400 and
et al. 2003). Hundreds of reads showed sequence 1,310 viral types in Bear Paw and Octopus hot
similarity to the superfamily II helicases of springs, respectively, with no one viral type
a wide range of cells and viruses. For example, representing more than about 2 % of the popula-
the 2 kb Octopus contig 158 had significant sim- tion (Table 4). For reference, 1,650, 3,350, 7,180,
ilarity to helicases of bacterial, archaeal, and 7,340, and 2,390 viral genotypes were reported in
eukaryotic cells as well as to phage and archaeal estuarine, nearshore marine, open ocean, marine
viruses (Table 3). sediments, and fecal viral assemblages, respec-
Also common in the metagenomic libraries tively (Breitbart et al. 2002, 2003, 2004a; Angly
are presumptive ribonucleotide reductases et al. 2006; Bench et al. 2007), with no single
(14 and 50 in Bear Paw and Octopus springs, viral species representing more than 2–3 % in U
respectively) and thymidylate synthase (seven any case.
and 51, respectively) genes. The conservation of There are several limitations in assessing
these genes between viral and cellular genomes actual numbers of viral species from
of all domains and the biochemical activities of metagenomic libraries. First, these models
the gene products imply that viral genes played assume viral genomes evolve uniformly. How-
a key role in the transition from RNA-based to ever, different regions of viral genomes are
DNA-based genomes (Forterre 2005). DNA clearly more conserved than others (Lindell
polymerase (pol) genes have also been proposed et al. 2004). Genetic diversity outside the
tion, Table 4 Sequence assembly data and estimation of viral diversity
Bear paw Octopus Totals
Sequence reads 7,685 21,198 28,883
Bear paw 95 % Octopus 95 % Bear paw 50 % Octopus 50 %
Contigs assembled 6,191 13,543 4,850 4,788
Avg. reads per contig 1.239 3.129 1.587 4.427
Largest contig (nt) 3,503 4,554 8,007 35,089
Power law richness 1,440 1,310 548 283
Evenness score 0.946 0.954 0.933 0.936
Most abundant virus 2.14 % 1.88 % 3.93 % 4.88 %
Shannon-Wiener score 6.88 6.85 5.88 5.29
conserved regions is probably higher than these respectively. These lower stringency assemblies
models indicate. Second, the generation of new proved quite useful for associating sequences of
viral species by mosaicism, modular evolution, or related, but not identical, viral types and for
lateral gene transfer (Villarreal and DeFilippis studying diversity among these related viruses.
2000; Canchaya et al. 2003; Weinbauer and At 95 % identity, the largest contigs were 3.5 and
Rassoulzadegan 2004) would not be detected 4.6 kb for Bear Paw and Octopus, respectively
using assembly of <1 kb sequence reads. On the (Table 4). At 50 % identity, Octopus reads assem-
other hand, given the dynamic nature of viral bled into 17 contigs of greater than 10 kb, includ-
genomes, this approach is well suited to a view ing contigs of 35 kb and 19 kb, comprised of
of the diversity and evolution of viruses that >1,000 reads each. In each case, reads were
considers genes or groups of genes rather than evenly distributed across the contigs. The
whole genomes. Finally, assembly at >95 % 17 > 10 kb contigs comprise a total of 7.04
nucleotide identity fails to account for molecular Mbp (33 % of total metagenomic sequence) or
diversity among related viral types, which is about 140 viral equivalents. The four strongest
higher than that of cellular species. In fact such BLASTx hits to the 35 kb contig belonged to
stringency would fail to associate viruses that, thermophilic crenarchaeal viruses Acidianus
based on classical criteria (host range, morphol- Rod-shaped virus (ARV), Sulfolobus islandicus
ogy, replication lineages, and physicochemical rod-shaped viruses 1 (SIRV1) and 2 (SIRV2),
and antigenic properties), are considered to be and Sulfolobus islandicus filamentous viruses
related (LucchiniS and Brussow 1999; Hatfull (SIFV) (Table 5). The only significant similarity
et al. 2006; Kwan et al. 2006) although they for the 19 kb contig was to the thermophilic
may share as little as 50 % nucleotide identity crenarchaeal virus, Pyrobaculum spherical virus
over much of their genomes. (PSV). In the Bear Paw library, with roughly one
third as many reads, the largest contig that assem-
bled at 50 % identity was 8 kb. Five hundred
Lower Stringency Assemblies Reveal thirty four (7 %) of the reads assembled into
Population Heterogeneity 19 contigs >4 kb. These include 0.5 Mbp of
reads or ten viral equivalents.
To accommodate genomic heterogeneity inher- The larger composite contigs allow associa-
ent to viral populations, sequences were also tions that were impossible at standard stringency.
assembled at 50 % identity (Table 4). As More than 200 million bases have been
expected, the numbers of viral types decreased sequenced from marine viral metagenomic librar-
to 548 and 283 in Bear Paw and Octopus, ies, but only one small phage genome has been
Use of Viral Metagenomes from Yellowstone Hot population diversity of an assembled
Springs to Study Phylogenetic Relationships and metagenome with the biochemistry of the gene
Evolution, Table 5 Numbers of 95 % contigs with
tBLASTx similarities (E < 0.001) to the respective cellu- products (Fig. 3). This 16.5 kb contig, assembled
lar genomes at 50 % identity, includes 187 reads (average cov-
Bear Paw Octopus
erage of 11 reads per nucleotide position).
Pyrobaculum 124 684
GeneMark predicted 26 ORFs of greater than
Archaea 100 nucleotides, including an apparent replication
Aeropyrum 62 626 operon. The genes with the strongest similarity to
Sulfolobus 38 326 four of these ORFs encode primase, uracil DNA
Acidianus 25 185 glycosylase, family B DNA polymerase, and
Bacteria nucleotide excision repair nuclease (dnaG, udg,
Aquifex 474 1,138 polB, and ERCC4 genes, respectively). Homologs
of these ORFs belong to crenarchaeal DNA repli-
cation/repair complexes (Roberts and White 2003;
reconstructed (Angly et al. 2006). To validate the Dionne and Bell 2005; Barry and Bell 2006). The
low-stringency assemblies and to further study predicted polB gene showed 28 % identity to
the molecular biology of the viruses, the 4 kb Pyrobaculum islandicus polB2 (Kahler and
cognates of one contig of four reads that assem- Antranikian 2000) and has an archaeal codon pro-
bled at 50 % NAID were PCR amplified, cloned, file (data not shown). Sequences from three of the
and sequenced (Schoenfeld 2014). This confirms discreet clones that comprise the polB gene in this
that at least this assembly accurately reflects the contig have been expressed in E. coli to produce
virome sequence. Furthermore, this contig a functional thermostable DNA polymerase (data
includes an apparent replisome, and amplifica- not shown). This contig also contains apparent
tion based on the low-stringency assembly allows homologs to a zinc fingerlike protein and
study of an operon that, due to its size, could not a transposon-like integrase/resolvase (tnp), func-
otherwise be recovered from the fragmentary tions commonly associated with viruses and
metagenomic data. phages. Another ORF with highest similarity to
Certain contigs provide compelling evidence the CRISPR-associated sequence cas4 (Haft
that the 50 % assemblies associate genuine et al. 2005) is unlikely to be part of a functional
orthologous sequences. An example is Bear Paw CRISPR system. Unlike authentic Cas sequences,
contig 327 (Fig. 2). Eleven open reading frames this one is virus-derived and is not proximal to
(ORFs) were identified by the GeneMark algo- a CRISPR sequence or other typically associated
rithm (Lukashin 1998). BLASTp analysis of each sequences. More likely this gene is a separate
shows strongest similarity to the putative coding member of the Cas4 COG, presumably a RecB-
sequences of PSV (Haring et al. 2004). Nucleo- like exonuclease (Haft et al. 2005).
tide identities were as high as 88 %, gene order is To correlate the level of sequence divergence
perfectly preserved relative to the cultured virus, with predicted gene function, SNP frequency
and gene overlap is identical between the com- was aligned to the 50 % assembly consensus
posite contig and the cultivated virus. Interest- sequence of the contig. Overall distribution of U
ingly, two different ORFs of the PSV genome, gp SNPs in the contig was 0.705 per 10 bp.
4 and 5, are apparently related to each other, since Replication-associated genes showed noticeably
both had significant similarity to the same region lower molecular diversity than the other ORFs.
of the consensus contig. In both the cultured viral SNP distribution in the dnaG, udg, polB, and
genome and the consensus contig, the gp7 PSV ERCC homologs was 0.565, 0.617, 0.569, and
gene overlaps gp6 in the opposite orientation. 0.548 per 10 bp, respectively, while the distribu-
Contig 722 from the Octopus spring library tion in the Zn finger, cas4, and thyA homologs
provided a unique opportunity to associate was 0.979, 1.31, and 0.728, respectively.
0 1000 2000 3000 4000 5000
gp6 gp7 gp8 gp9 gp10 gp11 gp12 gp13 gp14

88% 81% 88% 88% 87% 85% 80% 80% 70%
gp4/5
34/57%
Use of Viral Metagenomes from Yellowstone Hot proteins in GenBank. Similarities to Pyrobaculum spher-
Springs to Study Phylogenetic Relationships and ical virus proteins are shown with percent coding identity.
Evolution, Fig. 2 Genes and gene order are highly con- The gene names are based on the annotation in GenBank
served between a cultured crenarchaeal virus and and are named in order of their location on the viral
a consensus contig from the Bear Paw library. Contig chromosome. Direction of transcription is indicated by
372 (5,492 bp, 71 reads) was assembled at 50 % identity the arrows (Reproduced with permission from Schoenfeld
from the Bear Paw library. Open reading frames identified et al. (2008))
by GeneMark algorithm were compared by BLASTp to
16542 bp
187 reads
87% two reads per strand
4
3.5
SNPs per 10 bp
3
2.5
2
1.5
1
0.5
0
bp 0 2000 4000 6000 8000 10000 12000 14000 16000
0RFs
dnaG cas4 tnp

Zn finger thyA udg pol8 ERCC4
Use of Viral Metagenomes from Yellowstone Hot base pairs were normalized to the number of reads cover-
Springs to Study Phylogenetic Relationships and ing the respective nucleotide (middle) and are aligned with
Evolution, Fig. 3 Alignment of nucleotide polymor- predicted open reading frames from the consensus
phisms with coding sequences in a 16.5 kb consensus sequence in the contig and the gene name of the strongest
contig from Octopus hot spring. Contig 722 was assem- BLASTx similarity (bottom). Direction of transcription is
bled at 50 % identity from the Octopus library. Sequence shown by the arrows. Similarities to known genes were
coverage is shown on the top, with each line representing identified by BLASTp (Reproduced with permission from
a separate read. Single-nucleotide polymorphisms per ten Schoenfeld et al. (2008))
Similarities to Known Viral and and 63 % from Octopus) had no tBLASTx simi-
Microbial Genomes Imply Phylogeny larity (E < 0.001) to any sequence in GenBank
(Fig. 4). Although it is typical for viral
tBLASTx analysis was used to infer phylogenetic metagenomic libraries analyzed in this way to
origin of the 95 % assembled contig sequences. have a high proportion of sequences without
A majority of the contigs (41 % from Bear Paw identifiable homologs, these libraries contained
a Bear Paw b Octopus
archaeal virus archaeal virus
1.3% 3.5%
phage
0.4% archaea
0.6% phage
archaea 0.1%
4.7% bacteria eukaryotic virus
12.3% 4.2%
no similarity
41.2%
bacteria eukarya
44.1% no similarity 16.5%
62.8%
eukarya
8.8%
eukaryotic virus
0.1%
Use of Viral Metagenomes from Yellowstone Hot compared to sequences in GenBank to infer phylogeny.
Springs to Study Phylogenetic Relationships and Shown are frequencies of contigs with no significant
Evolution, Fig. 4 Broad classification of viral sequence similarity in GenBank (E < 0.001) and those
metagenomic contigs based on tBLASTx similarities. with sequence similarity to Bacteria, Archaea, Eukarya,
Contigs assembled at 95 % identity from Bear Paw and and their respective viruses (Reproduced with permission
Octopus reads (Panel a and b, respectively) were from Schoenfeld et al. (2008))
the highest frequency of novel sequence reported Genome Signature Sequences to

to date using long read Sanger chemistry. This Associate Host/Virus Sequences
trend likely reflects the lack of sequence data
from microorganisms in high-temperature envi- The ability to determine phylogenetic relation-
ronments as well as high diversity. ships in viral metapopulations is important to
Interestingly, the libraries contained a sizable the current understanding of their community
number of sequences with homology to eukary- composition and function. In the absence of uni-
otic genes, 16.5 % for Octopus Spring and 8.3 % versal signature genes like 16S sequences,
for Bear Paw, which may reflect the commonly BLASTx and tBLASTx alignments have been
observed overlap in gene sequence homology the primary tools to determine phylogeny of
between Archaea and Eukarya (Brown and Doo- viral metagenomic sequences and to correlate
little 1997). Almost all known crenarchaeal them with their respective hosts. BLASTx and
viruses were cultivated on three archaeal genera, tBLASTx focus on amino acid sequence similar-
Pyrobaculum, Sulfolobus, and Acidianus. Inter- ities and ignore differences in codon usage and
estingly, these genera were three of the four most other patterns of nucleotide content, which can be
common archaeal sources of the sequence simi- highly informative. U
larities to the two libraries, the other being Sequence signature-based methods, indepen-
Aeropyrum (Table 5). Genetic similarities to dent of nucleotide or amino acid alignment, are
Sulfolobus and Acidianus are surprising because being developed to classify the phylogenies of
these two genera have been found exclusively in viral metagenomes and their hosts. Phylopythia
highly acidic environments. Nearly half the bac- is an approach designed for cellular
terial similarities were to Aquifex. Apparently no metagenomes (McHardy et al. 2007; see also
attempts have been made to cultivate phage on Chap. 47, Vol. I) that classifies based on oligonu-
any strain in the Aquificales order. cleotide composition differences. Alternative
approaches use differences in codon usage, which predicted based on the low diversity of microbes
are generally conserved between hosts and in the sediments and filaments. The BLASTx,
viruses (Lucks et al. 2008). Genome signature- GSPC, and diversity data all suggest that the
based phylogenetic classification (GSPC) ana- viruses are infecting hosts other than the seden-
lyzes differences in di-, tri-, and tetranucleotide tary surface bacteria, implying significant prolif-
utilization patterns to associate phylogenetic eration either in the pool or in the vent. The
relationships, which are influenced by codon viruses used in this study were planktonic isolates
usage bias, as a basis for correlating hosts and collected close to the outflow source immediately
viruses (Pride et al. 2006; Yooseph and Sutton after emergence, making it more unlikely that the
2008). hosts were surface microbes in the filament, sed-
A GSPC study based on tetranucleotide utili- iments, or water column.
zation in the Yellowstone viral metagenomes
from Bear Paw and Octopus hot springs was
reported in Pride and Schoenfeld (2008), which Alignment of the Metagenome to
includes the details of the analysis and the statis- Cultivated Viral Genomes
tical support. To be statistically significant, the
analysis used only contigs >1.9 kb (3.8 kb when Overall, only 3.4 % of the high stringency (95 %
analyzing both strands) assembled at 95 % iden- assembly) contigs from the two libraries showed
tity. Contigs of this size should include 95 % of similarity to known viral sequences. Most of
tetranucleotide combinations at least 7.5 times. these similarities were to cultivated thermophilic
Approximately 19.3 % and 39.0 % of the Bear crenarchaeal viruses (Table 6). Similarity to the
Paw and Octopus metagenomic contigs, respec- only non-thermophilic virus, phage Twort (Kwan
tively, representing the more abundant viruses, et al. 2005), was limited to the helicase gene,
conformed to these criteria. The GSPC analysis which shares similarity with that of SIFV (see
classified 20 of 22 Bear Paw contigs and 69 of above). The two libraries shared comparable fre-
70 Octopus contigs, a much higher proportion of quencies of sequence similarity to archaeal
the reads than either BLASTx or Phylopythia viruses and bacteriophage. Notable exceptions
with significantly stronger statistical support were Acidianus rod-shaped virus and Sulfolobus
(see Pride and Schoenfeld 2008). The method is islandicus rod-shaped virus 1 and 2 where the
useful to group contigs by relatedness, which Octopus library demonstrated a higher frequency
might assist assembly, and to infer phylogenies of homology than the Bear Paw library and
and hosts. The GSPC analysis suggests that Octo- the S. tengchongensis spindle-shaped virus 1
pus viruses belong primarily to archaeal families homology, less common in Octopus than in
Globuloviridae and Fuselloviridae (56 of 69) Bear Paw.
while Bear Paw members belong primarily to Alignment of the metagenomes to whole
the bacteriophage family Caudoviridae genome sequences of six cultivated thermophilic
(includes Myoviridae, Podoviridae, and viruses revealed striking conservation of certain
Siphoviridae) (17 of 20). The analysis also esti- sequences (Fig. 5). Almost the entire genome of
mates that 80 % of the Octopus contigs have Pyrobaculum spherical virus (PSV) has similar-
archaeal signatures, while 77 % of Bear Paw ity to sequences in both metagenomic libraries,
contigs had bacterial signatures, a finding consis- with median identities of 60 % and 51 % to the
tent with BLASTx analysis. Bear Paw and Octopus, respectively. Sequence
The apparent predominance of archaeal similarities to the other crenarchaeal viruses and
viruses seems inconsistent with the reported to bacteriophage YS40 were limited to a few
dominance of Octopus sediments and filaments specific ORFs, but the degree of similarity was
by Bacteria (Blank et al. 2002; Rachel relatively high in those regions. Interestingly,
et al. 2002). Furthermore, the viral populations nearly all of the ORFs showing high levels of
appear much more diverse than would be homology are among the few thermophilic
tion, Table 6 Numbers of 95 % contigs with tBLASTx similarities to cultured viral sequences
Number of
tBLASTx
similarities
Virus References Accession Bear paw Octopus
ARV, Acidianus rod-shaped virus (Vestergaard et al. 2005) AJ875026 36 228
SIRV 1 and 2, Sulfolobus islandicus rod-shaped (Blum et al. 2001; Peng AJ344259, 30 217
virus 1and 2 et al. 2001) AJ414696
PSV, Pyrobaculum spherical virus (Haring et al. 2004) AJ635161 44 152
SIFV, S. islandicus filamentous virus (Arnold et al. 2000) AF440571 7 46
STSV1, S. tengchongensis spindle-shaped (Xiang et al. 2005) AJ783769 26 22
virus 1
ATV, Acidianus two-tailed virus (Prangishvili et al. 2006b) AJ888457 8 17
YS40, Thermus thermophilus YS40 phage (Naryshkina et al.2006) DQ997624 15 41
TTSV1, Thermoproteus tenax spherical virus 1 (Ahn et al. 2006) AY722806 6 12
Twort, Staphylococcus phage Twort (KwanT et al. 2005) AY954970 4 21
crenarchaeal virus genes for which a function has SSV-RH (Wiedenheft et al. 2004), had no signif-
been assigned or inferred (Fig. 5 and references icant tBLASTx similarity to any of the
therein). These regions of high conservation are metagenomic samples.
genes associated with virion components, DNA
replication, transposition, recombination, or
nucleic acid metabolism. Identification of CRISPR Spacer Cognate
The degree of alignment to cultivated viruses Sequences in the Octopus Viral
was surprising. PSV was isolated from Obsidian Metagenome
hot spring (74 C, pH 5.6), about 30 km away
from both Octopus and Bear Paw. The geochem- Evidence has been accumulating recently associ-
istry of this thermal feature is distinct from the ating CRISPR (clustered regularly interspaced
springs in this study (Shock et al. 2005), and life short palindromic repeats) systems with acquired
within includes a highly diverse population of resistance to lateral gene transfer from viruses
Archaea and Bacteria (Barns et al. 1994; and episomal elements (reviewed by van der
Hugenholtz et al. 1998), most of which have not Oost et al. 2009). CRISPRs were first discovered
been detected in Octopus hot spring (Reysenbach as repetitive sequences found in most bacte-
et al. 1994; Blank et al. 2002) or elsewhere. In rial and virtually all archaeal genomes.
contrast, Thermoproteus tenax spherical virus, These systems are functionally analogous, but
which is quite similar to PSV in terms of nonhomologous, with eukaryotic RNA interfer-
sequence, morphology, and habitat (Ahn ence and appear to limit the lateral transfer of
et al. 2006), had very limited similarity to the genes by targeting them for nucleolytic degrada- U
YNP viral metagenomic sequences (not shown). tion prior to their integration into the genome.
The other viruses showing high similarity to the The emerging view is that sequences in the repeat
metagenomic sequences were isolated on differ- region of the CRISPR system correspond to
ent continents and, with the exception of YS40, sequences in viral or episomal genes and are
occurred in highly acidic springs. This observa- transcribed in the host cell as part of a targeting
tion is more remarkable because the microbial system to neutralize viral infections. However,
populations of acidic and neutral hot springs are little direct evidence of conservation between
quite distinct (Reysenbach et al. 2002). The one the CRISPR spacer sequences and viral genomes
other virus cultivated from Yellowstone, has been found in natural environments. The first
100% 100%
PSV 28,337 bp SIRV1 32,308 bp
90% 90%
80% 80%
70% 70%
60% 60%
50% 50%
40% 40%
30% 30%
20% 20%
10% 10% vir
rep rep gt vir gt
0% 0%
0 5000 10000 15000 20000 25000 0 5000 10000 15000 20000 25000 30000
100% 100%
ARV 24,655 bp ATV 62,730 bp
90% 90%
80% 80%
70% 70%
60% 60%
50% 50%
40% 40%
30% 30%
20% 20%
10% 10%
gt cp gt tnp tnp tnp
0% 0%
0 5000 10000 15000 20000 0 10000 20000 30000 40000 50000 60000
100% 100%
STSV 75,294 bp YS40 152,372 bp Bear Paw
90% 90%
Octopus
80% 80%
70% 70%
60% 60%
50% 50%
40% 40%
30% 30%
20% 20%
10% 10% hel
dcm ts dut dcm rec mr
0% 0%
0 10000 20000 30000 40000 50000 60000 70000 0 20000 40000 60000 80000 100000 120000 140000
Use of Viral Metagenomes from Yellowstone Hot Octopus alignments. Also shown are the known or
Springs to Study Phylogenetic Relationships and predicted functions of the conserved coding sequences
Evolution, Fig. 5 Alignment of Octopus and Bear Paw (rep replication related, vir virion component, gt glycosyl-
viral metagenomic library contigs with six cultured virus transferase, tnp transposase, cp coat protein, dam adenine
genomes. Contigs assembled at >95 % identity from the DNA methylase, ts thymidylate synthase, dut dUTPase,
viral metagenomic libraries were compared by tBLASTx dcm cytosine DNA methylase, hel helicase, rec
to the genomes of PSV, SIRV1, ARV, ATV, STSV, and recombinase, rnr ribonucleotide reductase (Arnold
YS40. Each bar represents the alignment of a unique et al. 2000; Blum et al. 2001; Peng et al. 2001; Haring
metagenomic sequence to the indicated location on the 2004; Kessler et al. 2004; Vestergaard et al. 2005; Xiang
cultivated viral genome, shown on the horizontal axis. 2005; Ahn et al. 2006; Naryshkina 2006; Prangishvili
Percent coding sequence identities are shown in the verti- 2006b) (Reproduced with permission from (Schoenfeld
cal axis. The threshold for inclusion is E-value <103. et al. 2008)
Red bars indicate Bear Paw alignments; blue bars indicate
demonstration of a correspondence between of these genes in microbial and viral populations.

CRISPR spacers and viral sequences was in Furthermore, since the CRISPR spacer sequences
dairy bacteria and their associated phages are generally only 20–50 nucleotides in length, it
(Horvath et al. 2008) and, by inference, in acid has been difficult to assign function of the
mine drainages (Andersson and Banfield 2008). targeted genes by BLASTx or other means.
In these and other cases, the lack of viral The Octopus viral metagenome, in conjunc-
metagenomes limited insight into the coevolution tion with a microbial metagenome and two
tion, Table 7 Octopus virome sequences showing silent or conservative changes compared to the CRISPR spacer
sequences of the Synechococcus genome
Predicated AA % AASIM/
Sequence %NAID sequence AAID
AGTTTACCCTCAAGTGGGAAGGCGGCTTTGTCCACCATCC FTLKWEGGFVHH
..........T...........T................. 95 ............ 100/100
..........T...........T................. 95 ............ 100/100
.......T..T...........T................. 92 ............ 100/100
.......T..T...........T................. 92 ............ 100/100
..........T...........T....A............ 92 ........Y. . . 100/91
........T. GCGC.....G..T....AC..GA....C.. 70 . . .R....Y.N. 100/75
....C.....G..A........G..G...........C.. 86 ............ 100/100
.A.................G.....G.AC..AA.T..... 80 ........Y.N. 100/83
.A.................G.....G.AC..AA.T..... 80 ........Y.N. 100/83
.A.................G.....G.AC..AA.T..... 80 ........Y.N. 100/83
.A.................G.....G.AC..AA.T..... 80 ........Y.N. 100/83
.A.................G.....G.AC..AA.T..... 80 ........Y.N. 100/83
.C.....A..A..A........T..T.AC..AA....C.. 75 ........Y.N. 100/83
.C.....A..A..A.....G..T..G.AC..AA....C.. 73 ........Y.N. 100/83
.C.....A..A..A.....G..T..G.AC..AA....C.. 73 ........Y.N. 100/83
Synechococcus genomes isolated from the same of one gene in the viral metagenome. The assem-
hot spring within 2 years of one another, provided bly of 23 reads covering this single gene indicates
a unique opportunity to identify the genes that this was one of the most abundant and con-
targeted by a CRISPR system and observe coevo- served element in the entire metagenome, which,
lution of a CRISPR system and its target in host by itself, would seem to make it an attractive
and viral genomes (Heidelberg et al. 2009). The target for a presumed antiviral system. The data
two Synechococcus strains contained sequences provided by the viral metagenome reveal the
with the hallmarks of a CRISPR system. Like target of the spacers was a likely lysozyme
other such sequences, the CRISPR spacers had gene, the conservation of which may be U
no BLASTn or tBLASTx similarity to any explained by evolutionary constraints due to the
sequences in GenBank. When compared to the interaction with a host cell wall. Inspection of the
microbial metagenome, 180 elements had simi- lys gene assembly revealed the apparent coevo-
larity to CRISPR spacer sequences. Of these, four lution of the CRISPR system and its viral target
shared similarity with 23 sequences in the Octo- (Table 7). Of the 23 viral metagenome reads,
pus viral metagenome. only five had detectable nucleic acid identities
Interestingly, two CRISPR spacer sequences (NAID).
shared by the isolates and the microbial and viral The sequence of one CRISPR spacer is shown
metagenomes had similarity to different regions in Line 1. Shown below are sequences from the
virome with similarity to this CRISPR spacer or Use of Viral Metagenomes from Yellowstone Hot
the same region in reads identified by similarity Springs to Study Phylogenetic Relationships and
Evolution, Table 8 Nucleotide and coding similarities
to a second independent CRISPR spacer or between the viral populations of Octopus and Bear Paw
a translation of one of these. Conserved nucleo- hot springs
tides are shown as dots; those that diverge from tBLASTx BLASTn
the CRISPR 1 spacer are shown as letters. The Frequency (number) of 43 % 21 %
percent nucleic acid identities (%NAID) to Octopus contigs with (5,843) (2,876)
CRISPR 1 and the percent amino acid similarity similarity to Bear Paw contigs
and identity (% AASIM and % AAID, respec- Frequency (number) of Bear 26 % 21 %
tively) to the predicted translation of CRISPR1 Paw contigs with similarity to (1,593) (1,339)
Octopus contigs
are also shown. (adapted from Heidelberg
Average length of similarity 298 175
et al. 2009). The remainder had sequence vari- (nucleotides)
ances that reduced NAID to as low as 70 %; Average identity 74 % 87 %
however, all of these nucleotide variations were Average expect value 1.38E–05 3.00E–05
silent or conservative with respect to the amino
acid sequence, which would likely allow the
sequence to evade targeting by the CRISPR sys-
tem, but not affect the enzymatic function of the average length of sequence alignment (298 and
gene product. This data suggests a high rate of 175 bp) was modest in both cases. This level of
coevolution or “germ warfare” between the similarity did not allow extensive assembly of
viruses and their hosts in this extreme contigs from the two libraries, even at 50 % iden-
environment. tity, presumably due to the short lengths of align-
ment (not shown). Taken together, these data
suggest a mosaiclike pattern of overlap of much
Similarities Between the Two Hot of the coding content in the two hot springs,
Springs’ Viral Populations although entire viral genomes or even entire
genes are not necessarily fully conserved. The
The two libraries were compared to one another fact that the degrees of identity at the nucleotide
to determine any variation between the viral level and at the translational level were relatively
populations in the two very different thermal close suggests that this overlap is not due solely to
environments. Contigs assembled at 95 % from selective pressure on the coding sequence, but
the two libraries were compared to each other by must be explained by other mechanisms. This
tBLASTx and BLASTn (Table 8). The differ- extensive conservation of viral sequences between
ences between the two analyses should be the the two hot springs in this study is surprising,
result of noncoding nucleotides. Since gene den- given that microbial populations are highly tem-
sities are high in viral genomes and there is very perature dependent (Reysenbach et al. 2002) and
little intergenic sequence, these differences are the surface temperatures of these hot springs differ
mainly due to silent codon variations, which by 19 C (74 C vs. 93 C).
should be largely free of selective pressure.
Most remarkable is the similarity between the
two libraries by either analysis. By tBLASTx, Conservation and Distribution of
5,843 of the Octopus contigs (43 %) and 1,593 Viruses in Thermal Environments
of the Bear Paw contigs (26 %) shared amino acid
coding similarity. By BLASTn, 2,876 (21 %) and Taken together, the above analyses suggest that
1,339 (21 %) of the respective contigs shared (1) viral populations in the water columns are
nucleotide similarity. The average percent iden- largely independent of microbial populations
tities were 74 % and 87 % and the expect values reported in the pools and (2) viral genomes, par-
were 1.38E-05 and 3.00E-05, although the ticularly certain genes, are more conserved both
regionally and globally than might have been contributor to viral populations at the surface.
predicted. The regional and global conservation Subsurface proliferation of viruses would also
of viral sequences is an intriguing area for further explain the apparent disconnect between the
study. There are examples of globally distributed planktonic viral populations in the pool and the
genes among marine viral assemblages (Breitbart reported sedentary microbial populations,
and Rohwer 2005; Short and Suttle 2005). Since described above. An implication of subsurface
the oceans are contiguous across the earth, an proliferation of viruses is that the habitable por-
obvious distribution mechanism exists. Groups tion of the subterranean aquifer could be contin-
of highly similar Sulfolobus viruses (Wiedenheft uous across much of the Yellowstone caldera or
2004) and Thermus phages (Yu 2006) have been even much larger areas. A second implication is
isolated from thermal springs on different conti- that, given the higher pressures in the vents, the
nents. In these cases, viruses were isolated from temperature limit of life in the subterrestrial aqui-
environments of similar pH and temperature and fers could significantly exceed the temperatures
were cultivated on the same host under similar measured at the surface.
laboratory conditions. Gene homologs to these
viruses were detected despite the absence of
these selective conditions. Conversely, most Computer Analysis
crenarchaeal virus morphotypes have been
detected in enrichments from YNP (Rice Availability of computer programs is described in
et al. 2001; Rachel et al. 2002; Wiedenheft the original publications (Schoenfeld et al. 2008;
et al. 2004); however, little is known about con- Heidelberg et al. 2009; Pride and Schoenfeld
servation of genes in these enrichments. 2008).
The mechanism and basis of this conservation
of viral sequence is open to speculation. It is
possible that viruses sharing common genes
References
adapt to the different host populations of the
environment. Alternatively, hot springs may be Ahn DG, Kim SI, Rhee JK, Kim KP, Pan JG, et al. TTSV1,
inoculated by airborne viruses from other springs. a new virus-like particle isolated from the hyperther-
It is also possible that the viruses acquire genes mophilic crenarchaeote Thermoproteus tenax. Virol-
ogy. 2006;351:280–90.
from mesophilic viruses, although this explana-
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z,
tion has no support in this study. Lineages of et al. Gapped BLAST and PSI-BLAST: a new gener-
conserved viral genes may be older than the sep- ation of protein database search programs. Nucleic
aration of the continents. Another explanation is Acids Res. 1997;25:3389–402.
Andersson AF, Banfield JF. Virus population dynamics
proliferation of the viruses deeper in the vent.
and acquired virus resistance in natural microbial com-
Thermophilic Bacteria and Archaea, potential munities. Science. 2008;320:1047–50.
hosts for viruses, have been detected in thermal Angly F, Rodriguez-Brito B, Bangor D, McNairnie P,
aquifers several km beneath the earth’s surface at Breitbart M, et al. PHACCS, an online tool for esti-
mating the structure and diversity of uncultured viral
abundances similar to those measured in this
communities using metagenomic information. BMC U
study (Moser et al. 2005) and many are distrib- Bioinformatics. 2005;6:41.
uted worldwide. While it is impossible to sepa- Angly FE, Felts B, Breitbart M, Salamon P, Edwards RA,
rate the contribution of the subsurface viruses et al. The marine viromes of four oceanic regions.
PLoS Biol. 2006;4:e368.
from any proliferation at the surface in the two Arnold HP, Zillig W, Ziese U, Holz I, Crosby M, et al. A
pools in this study, samples from thermal springs novel lipothrixvirus, SIFV, of the extremely thermo-
with no pool at all, collected within seconds of philic crenarchaeon Sulfolobus. Virology. 2000;267:
their emergence, have similar or somewhat 252–66.
Barns SM, Fundyga RE, Jeffries MW, Pace NR. Remark-
higher viral abundances to those measured in
able archaeal diversity detected in a Yellowstone
this report (Breitbart et al. 2004b), suggesting National Park hot spring environment. Proc Natl
subsurface proliferation is at least a significant Acad Sci USA. 1994;91:1609–13.
Barry ER, Bell SD. DNA replication in the archaea. Feng L, Wang W, Cheng J, Ren Y, Zhao G, et al. Genome
Microbiol Mol Biol Rev. 2006;70:876–87. and proteome of long-chain alkane degrading
Bench SR, Hanson TE, Williamson KE, Ghosh D, Geobacillus thermodenitrificans NG80-2 isolated
Radosovich M, et al. Metagenomic characterization from a deep-subsurface oil reservoir. Proc Natl Acad
of Chesapeake Bay virioplankton. Appl Environ Sci USA. 2007;104:5602–7.
Microbiol. 2007;73:7629–41. Filee J, Forterre P, Sen-Lin T, Laurent J. Evolution of
Blank CE, Cady SL, Pace NR. Microbial composition of DNA polymerase families: evidences for multiple
near-boiling silica-depositing thermal springs through- gene exchange between cellular and viral proteins.
out Yellowstone National Park. Appl Environ J Mol Evol. 2002;54:763–73.
Microbiol. 2002;68:5123–35. Filee J, Forterre P, Laurent J. The role played by viruses in
Blum H, Zillig W, Mallok S, Domdey H, Prangishvili the evolution of their hosts: a view based on informa-
D. The genome of the archaeal virus SIRV1 has fea- tional protein phylogenies. Res Microbiol. 2003;154:
tures in common with genomes of eukaryal viruses. 237–43.
Virology. 2001;281:6–9. Forterre P. The two ages of the RNA world, and the
Breitbart M, Rohwer F. Here a virus, there a virus, transition to the DNA world: a story of viruses and
everywhere the same virus? Trends Microbiol. cells. Biochimie. 2005;87:793–803.
2005;13:278–84. Forterre P. The origin of viruses and their possible roles in
Breitbart M, Salamon P, Andresen B, Mahaffy JM, Segall major evolutionary transitions. Virus Res. 2006;117:
AM, et al. Genomic analysis of uncultured marine viral 5–16.
communities. Proc Natl Acad Sci USA. 2002;99: Fournier RO. Geochemistry and dynamics of the Yellow-
14250–5. stone National Park Hydrothermal System. In:
Breitbart M, Hewson I, Felts B, Mahaffy JM, Nulton J, Inskeep W, editor. Geothermal biology and geochemis-
et al. Metagenomic analyses of an uncultured viral try in YNP. Bozeman: Thermal Biology Institute; 2005.
community from human feces. J Bacteriol. Geslin C, Le Romancer M, Erauso G, Gaillard M,
2003;185:6220–3. Perrot G, et al. PAV1, the first virus-like particle iso-
Breitbart M, Felts B, Kelley S, Mahaffy JM, Nulton J, lated from a hyperthermophilic euryarchaeote,
et al. Diversity and population structure of a near-shore “Pyrococcus abyssi”. J Bacteriol. 2003;185:3888–94.
marine-sediment viral community. Proc Biol Sci. Godiska R, Patterson M, Schoenfeld T and Mead
2004a;271:565–74. D. Beyond pUC: vectors for cloning unstable
Breitbart M, Wegley L, Leeds S, Schoenfeld T, Rohwer DNA. In: Kieleczawa, editor. DNA sequencing: opti-
F. Phage community dynamics in hot springs. Appl mizing the process and analysis. 2005;Jones and Bart-
Environ Microbiol. 2004b;70:1633–40. lett Publishers, Sudbury, MA.
Brock TD, Brock ML. Measurement of steady-state Gold T. The deep, hot biosphere. Proc Natl Acad Sci
growth rates of a thermophilic alga directly in nature. USA. 1992;89:6045–9.
J Bacteriol. 1968;95:811–5. Haft DH, Selengut J, Mongodin EF, Nelson KE. A guild of
Brown JR, Doolittle WF. Archaea and the prokaryote-to- 45 CRISPR-associated (Cas) protein families and mul-
eukaryote transition. Microbiol Mol Biol Rev. tiple CRISPR/Cas subtypes exist in prokaryotic
1997;61:456–502. genomes. PLoS Comput Biol. 2005;1:e60.
Canchaya C, Fournous G, Chibani-Chennoufi S, Dillmann Haring M, Peng X, Brugger K, Rachel R, Stetter KO,
ML, Brussow H. Phage as agents of lateral gene trans- et al. Morphology and genome organization of the
fer. Curr Opin Microbiol. 2003;6:417–24. virus PSV of the hyperthermophilic archaeal genera
Cann AJ, Fandrich SE, Heaphy S. Analysis of the virus Pyrobaculum and Thermoproteus: a novel virus fam-
population present in equine faeces indicates the pres- ily, the Globuloviridae. Virology. 2004;323:233–42.
ence of hundreds of uncharacterized virus genomes. Hatfull GF, Pedulla ML, Jacobs-Sera D, Cichon PM,
Virus Genes. 2005;30:151–6. Foley A, et al. Exploring the mycobacteriophage
Coenye T, Vandamme P. Intragenomic heterogeneity metaproteome: phage genomics as an educational plat-
between multiple 16S ribosomal RNA operons in form. PLoS Genet. 2006;2:e92.
sequenced bacterial genomes. FEMS Microbiol Lett. Heidelberg JF, Nelson WC, Schoenfeld T, Bhaya D. Germ
2003;228:45–9. warfare in a microbial mat community: CRISPRs pro-
Culley AI, Lang AS, Suttle CA. The complete genomes of vide insights into the co-evolution of host and viral
three viruses assembled from shotgun libraries of genomes. PLoS ONE. 2009;4:e4169.
marine RNA virus communities. Virol J. 2007;4:69. Hjörleifsdottir SH, Hreggvidsson GO, Fridjonsson OH,
Daubin V, Ochman H. Start-up entities in the origin of Aevarsson A, Kristjansson JK. Bacteriophage RM
new genes. Curr Opin Genet Dev. 2004;14:616–9. 378 of a thermophilic host organism. US Patent; 2002.
Dionne I, Bell SD. Characterization of an archaeal family Horvath P, Romero DA, Coute-Monvoisin AC,
4 uracil DNA glycosylase and its interaction with Richards M, Deveau H, et al. Diversity, activity, and
PCNA and chromatin proteins. Biochem J. 2005;387: evolution of CRISPR loci in Streptococcus
859–63. thermophilus. J Bacteriol. 2008;190:1401–12.
Hugenholtz P, Pitulle C, Hershberger KL, Pace NR. Novel Noble RT, Fuhrman JA. Use of SYBR Green I for rapid
division level bacterial diversity in a Yellowstone hot epifluorescence counts of marine viruses and bacteria.
spring. J Bacteriol. 1998;180:366–76. Aquat Microb Ecol. 1998;14:113–8.
Jahnke LL, Eder W, Huber R, Hope JM, Hinrichs KU, Paul JH, Williamson SJ, Long A, Authement RN, John D,
et al. Signature lipids and stable carbon isotope et al. Complete genome sequence of phiHSIC,
analyses of Octopus Spring hyperthermophilic com- a pseudotemperate marine phage of Listonella pelagia.
munities compared with those of Aquificales represen- Appl Environ Microbiol. 2005;71:3311–20.
tatives. Appl Environ Microbiol. 2001;67:5179–89. Pedulla ML, Ford ME, Houtz JM, Karthikeyan T,
Kahler M, Antrankian G. Cloning and characterization of Wadsworth C, et al. Origins of highly mosaic
a family B DNA polymerase from the hyperthermo- mycobacteriophage genomes. Cell. 2003;113:171–82.
philic crenarchaeon Pyrobaculum islandicum. Peng X, Blum H, She Q, Mallok S, Brugger K,
J Bacteriol. 2000;182:655–63. et al. Sequences and replication of genomes of the
Kessler A, Brinkman AB, van der Oost J, Prangishvili archaeal rudiviruses SIRV1 and SIRV2: relationships
D. Transcription of the rod-shaped viruses SIRV1 to the archaeal lipothrixvirus SIFV and some eukaryal
and SIRV2 of the hyperthermophilic archaeon viruses. Virology. 2001;291:226–34.
sulfolobus. J Bacteriol. 2004;186:7745–53. Prangishvili D, Garrett RA. Exceptionally diverse
Kwan T, Liu J, DuBow M, Gros P, Pelletier J. The com- morphotypes and genomes of crenarchaeal hyperther-
plete genomes and proteomes of 27 Staphylococcus mophilic viruses. Biochem Soc Trans. 2004;32:204–8.
aureus bacteriophages. Proc Natl Acad Sci Prangishvili D, Garrett RA. Viruses of hyperthermophilic
USA. 2005;102:5174–9. Crenarchaea. Trends Microbiol. 2005;13:535–42.
Kwan T, Liu J, Dubow M, Gros P, Pelletier J. Comparative Prangishvili D, Garrett RA, Koonin EV. Evolutionary
genomic analysis of 18 Pseudomonas aeruginosa bac- genomics of archaeal viruses: unique viral genomes
teriophages. J Bacteriol. 2006;188:1184–7. in the third domain of life. Virus Res.
Lindell D, Sullivan MB, Johnson ZI, Tolonen AC, 2006a;117:52–67.
Rohwer F, et al. Transfer of photosynthesis genes to Prangishvili D, Vestergaard G, Haring M, Aramayo R,
and from Prochlorococcus viruses. Proc Natl Acad Sci Basta T, et al. Structural and genomic properties of
USA. 2004;101:11013–8. the hyperthermophilic archaeal virus ATV with an
Lucchini S, Desiere F, Brussow H. Comparative genomics extracellular stage of the reproductive cycle. J Mol
of Streptococcus thermophilus phage species supports Biol. 2006b;359:1203–16.
a modular evolution theory. J Virol. 1999;73:8647–56. Pride DT, Schoenfeld T. Genome signature analysis of
Lucks JB, Nelson DR, Kudla GR, Plotkin JB. Genome thermal virus metagenomes reveals Archaea and ther-
landscapes and bacteriophage codon usage. PLoS mophilic signatures. BMC Genomics. 2008;9:420.
Comput Biol. 2008;4:e1000001. Pride DT, Wassenaar TM, Ghose C, Blaser MJ. Evidence
Lukashin AV, Borodovsky M. GeneMark.hmm: new solu- of host-virus co-evolution in tetranucleotide usage
tions for gene finding. Nucleic Acids Res. patterns of bacteriophages and eukaryotic viruses.
1998;26:1107–15. BMC Genomics. 2006;7:8.
Martin A, Yeats S, Janekovic D, Reiter WD, Aicher W, Rachel R, Bettstetter M, Hedlund BP, Haring M,
et al. SAV 1, a temperate u.v.-inducible DNA Kessler A, et al. Remarkable morphological diversity
virus-like particle from the archaebacterium of viruses and virus-like particles in hot terrestrial
Sulfolobus acidocaldarius isolate B12. EMBO J. environments. Arch Virol. 2002;147:2419–29.
1984;3:2165–8. Reysenbach AL, Wickham GS, Pace NR. Phylogenetic
McCleskey RB, Ball JW, Nordstrom DK, Holloway JM, analysis of the hyperthermophilic pink filament com-
Taylor HE. Water-chemistry data for selected hot munity in Octopus Spring, Yellowstone National Park.
springs, geysers, and streams in Yellowstone National Appl Environ Microbiol. 1994;60:2113–9.
Park, Wyoming, 2001–2002. U.S. Geological Survey Reysenbach AL, Gotz D, Yernool D. Microbial diversity
Open-File Report 2004-1316; 2004. of marine and terrestrial thermal springs. In: Staley JT,
McHardy AC, Martin HG, Tsirigos A, Hugenholtz P, Reysenbach AL, editors. Biodiversity of microbial
Rigoutsos I. Accurate phylogenetic classification of life. New York: Wiley Liss; 2002. U
variable-length DNA fragments. Nat Methods. Rice G, Stedman K, Snyder J, Wiedenheft B, Willits D,
2007;4:63–72. et al. Viruses from extreme thermal environments.
Moser DP, Gihring TM, Brockman FJ, Fredrickson JK, Proc Natl Acad Sci USA. 2001;98:13341–5.
Balkwill DL, et al. Desulfotomaculum and RobertsJ A, Bell SD, White MF. An archaeal XPF repair
Methanobacterium spp. dominate a 4– to 5-km-deep endonuclease dependent on a heterotrimeric PCNA.
fault. Appl Environ Microbiol. 2005;71:8773–83. Mol Microbiol. 2003;48:361–71.
Naryshkina T, Liu J, Florens L, Swanson SK, Pavlov AR, Sakaki Y, Oshima T. Isolation and characterization of
et al. Thermus thermophilus bacteriophage phiYS40 a bacteriophage infectious to an extreme thermophile,
genome and proteomic characterization of virions. Thermus thermophilus HB8. J Virol. 1975;15:
J Mol Biol. 2006;364:667–77. 1449–53.
Schoenfeld T, Patterson M, Richardson PM, Wommack Vestergaard G, Haring M, Peng X, Rachel R, Garrett RA,
KE, Young M, et al. Assembly of viral metagenomes et al. A novel rudivirus, ARV1, of the hyperthermo-
from Yellowstone hot springs. Appl Environ philic archaeal genus Acidianus. Virology.
Microbiol. 2008;74:4164–74. 2005;336:83–92.
Seguritan V, Feng IW, Rohwer F, Swift M, Segall Villarreal LP, DeFilippis VR. A hypothesis for DNA
AM. Genome sequences of two closely related viruses as the origin of eukaryotic replication proteins.
Vibrio parahaemolyticus phages, VP16T and VP16C. J Virol. 2000;74:7079–84.
J Bacteriol. 2003;185:6434–47. Ward DM, Ferris MJ, Nold SC, Bateson MM. A natural
Shock EL, Holland M, Meyer-Dombard DR, Amend view of microbial biodiversity within hot spring
JP. Geochemical sources of energy for microbial cyanobacterial mat communities. Microbiol Mol Biol
metabolism in hydrothermal ecosystems: Obsidian Rev. 1998;62:1353–70.
Pool, Yellowstone National Park. In: Inskeep WP, Weinbauer MG, Rassoulzadegan F. Are viruses driving
McDermott TR, editors. Geothermal biology and microbial diversification and diversity? Environ
geochemistry in YNP. Bozeman: Thermal Biology Microbiol. 2004;6:1–11.
Institute; 2005. Wen K, Ortmann AC, Suttle CA. Accurate estimation of
Short CM, Suttle CA. Nearly identical bacteriophage viral abundance by epifluorescence microscopy. Appl
structural gene sequences are widely distributed in Environ Microbiol. 2004;70:3862–7.
both marine and freshwater environments. Appl Envi- Wiedenheft B, Stedman K, Roberto F, Willits D, Gleske
ron Microbiol. 2005;71:480–6. AK, et al. Comparative genomic analysis of hyperther-
Snyder JC, Stedman K, Rice G, Wiedenheft B, Spuhler J, mophilic archaeal Fuselloviridae viruses. J Virol.
et al. Viruses of hyperthermophilic Archaea. Res 2004;78:1954–61.
Microbiol. 2003;154:474–82. WommackKEand Colwell RR. Virioplankton: viruses in
Snyder JC, Spuhler J, Wiedenheft B, Roberto FF, aquatic ecosystems. Microbiol Mol Biol Rev.
Douglas T, et al. Effects of culturing on the population 2000;64:69–114.
structure of a hyperthermophilic virus. Microb Ecol. Xiang X, Chen L, Huang X, Luo Y, She Q, et al.
2004;48:561–6. Sulfolobus tengchongensis spindle-shaped virus
Stoner DL, Geary MC, White LJ, Lee RD, Brizzee JA, STSV1: virus-host interactions and genomic features.
et al. Mapping microbial biodiversity. Appl Environ J Virol. 2005;79:8677–86.
Microbiol. 2001;67:4324–8. Yooseph S, Li W, Sutton G. Gene identification and pro-
Suttle CA. Marine viruses–major players in the global tein classification in microbial metagenomic sequence
ecosystem. Nat Rev Microbiol. 2007;5:801–12. data via incremental clustering. BMC Bioinformatics.
Tatusov RL, Koonin EV, Lipman DJ. A genomic perspec- 2008;9:182.
tive on protein families. Science. 1997;278:631–7. Young R. Bacteriophage lysis: mechanism and regulation.
van der Oost J, Jore MM, Westra ER, Lundgren M, Microbiol Rev. 1992;56:430–81.
Brouns SJ. CRISPR-based adaptive and heritable Yu MX, Slater MR, Ackermann HW. Isolation and char-
immunity in prokaryotes. Trends Biochem Sci. acterization of Thermus bacteriophages. Arch Virol.
2009;34:401–7. 2006;151:663–79.
V
Variable Selection to Improve doing overall? Some of these questions can be

Classification of Metagenomes addressed using DNA/RNA sequencing followed
by homology and taxonomic classification; how-
Greg Ditzler1, Yemin Lan2, Jean-Luc Bouchot3 ever, usually hypotheses focus on answering:
and Gail Rosen1 which organisms and/or their functions (e.g.,
1
Department of Electrical and Computer metabolisms) best differentiate multiple pheno-
Engineering, Drexel University, Philadelphia, types in a collection of samples? Consider
PA, USA a collection of gut microbiome samples that were
2
School of Biomedical Engineering, collected from patients with inflammatory bowel
Science and Health, Drexel University, disease (IBD) and a control set that do not have
Philadelphia, PA, USA IBD. A natural question to ask when examining
3
Department of Mathematics, Drexel University, the differences between the gut microbiomes of
Philadelphia, PA, USA the two phenotypes is what organisms or genes
can distinguish patients with IBD and healthy
controls? Knowing the answers to such question
Introduction can be useful in developing a better understanding
about a disease and aid in developing medicines to
Metagenomics is the study of DNA extracted from target a disease cause.
the microbial communities in an environment, in The question of finding differentiating fea-
comparison to traditional genomics, which studies tures, or variables of interest, has been deeply
the nucleic acids from single organisms (Wooley studied in the machine learning community (see
et al. 2010). In a metagenomic study, a sample is Guyon et al. 2006; Saeys et al. 2007), which is
collected directly from the environment, which commonly referred to as feature selection. Fea-
can be a gram of soil (Rousk et al. 2010; Bowers ture selection is the process of finding a subset of
et al. 2011), milliliter of ocean (Williamson features that best differentiate between multiple
et al. 2008), swab from an object (Caporaso classes or, in our case, phenotypes in a data set.
et al. 2011), or a sample of the microbes associated The process of selecting features is typically
with a host organism, such as humans (Caporaso achieved by maximizing some objective function
et al. 2011; Costello et al. 2009). The microbial (e.g., mutual information) in a greedy fashion.
content of an environmental sample is termed its The central motivation for feature selection is to
“microbiome.” There are several questions that find a smaller subset of features that can be used
are of particular importance when the microbiome to differentiate between the multiple phenotypes,
is being examined. In particular, who is there, how which in turn can reduce the computational com-
much of each species is there, and what are they plexity of the classification algorithm tailored to
V 702 Variable Selection to Improve Classification of Metagenomes
Variable Selection to Improve Classification of Metagenomes, Table 1 Functional databases mostly used for
creating functional profiles
Large collection of RefSeq Around 18 million proteins from 18 k organisms, annotations are available
reference sequences for a subset of the database, well-annotated for human sequences
UniProtKB/ Manually curated annotations for 500,000+ sequences, covering 12,930
Swiss-Prot organisms
Standardized ontologies Gene Well-controlled vocabulary, primarily for eukaryotes
Ontology
Gene orthologous COG Gene groups classified into 23 functional categories, inferred from
groups 66 prokaryote and unicellular eukaryote genomes
KOG Eukaryote version of COG containing 7 eukaryotic genomes
eggNOG Automated annotation of orthologs in 1,133 species
Metabolism KEGG 400+ manually drawn pathways, based on reactions from multiple species
pathway
BioCyc/ 2,000+ single-organism, experimentally derived pathways
Metacyc
SEED Subsystems that describe metabolic machinery with expert curation
Protein domains and Pfam A large collection of protein families that share the same domain
families FIGfam Protein families that share domains and pairwise align for their full length
sequences, resulting in less sequences per family
do such a task. Furthermore, regression could be raw-labeling of sequences can provide much
used instead of classification in the case of information; however, it cannot be used to ana-
continuous-environmental variables; however, lyze hierarchical functional structure in a data set,
for this entry, we assume that phenotypes take such as what high-level functions (e.g., reproduc-
on discrete states, and therefore, classification is tion/cellular transport) are upregulated in my
the primary focus. Previously, feature selection sample. Instead, sequence labeling can answer
has been shown useful to reduce the complexity what genes exist in my sample or which sample
of metagenome classification (Ditzler is functionally more diverse, because they pro-
et al. 2012); however, in this article, its use is vide better annotation coverage in the sample
expanded to determine relevance of biological than higher-level databases. However, if it is
features to associated phenotypes, thus aiding required to annotate with well-defined vocabular-
researchers in drawing conclusions from ies, which is needed to make biological inference
metagenomic data. and associations, then one wishes to use
Feature selection can be applied to a variety of a standardized ontology database. For example,
metagenomic data (e.g., 16S rRNA, whole researchers can use Gene Ontology annotation to
genome shotgun, taxonomic annotations, gene examine what functions are enriched in the sam-
annotations). In addition to selecting species ple compared to others. In some cases,
which differentiate microbiomes, many studies researchers wish to annotate the function of
wish to map DNA/RNA sequences to functional a gene that appears in multiple organisms rather
categories and address enriched/depleted func- than just one. In other words, the focus is to
tions between samples. Depending on the type accurately assign homologous genes associated
of question being asked and the nature of the with multiple species, which is especially impor-
data, there are a variety of functional databases tant in metagenomics due to the complex mixture
to choose from. Table 1 highlights some of the of organisms in a sample. Therefore, orthologous
most widely used databases. Large reference group databases are useful for annotating homol-
sequence databases with a variety of functional ogous function of orthologs. For studying
descriptions are preferred because they provide a microbiome’s metabolism rather than molecu-
detailed annotation of diverse data set. This lar functions, such as asking the questions what
Variable Selection to Improve Classification of Metagenomes 703 V
biological processes are enriched/missing from items to consider before applying a feature selec-
a diseased microbiome or should photosynthesis tion to a (biological) data set. First, how many
activity be enhanced in surface soil compared to features should be selected? Most feature selection
deeper layer soil samples, several metabolic path- algorithms assume that the end-user must select
way databases can be used. Finally, protein fam- this parameter, and the quality of the results will
ily databases search for conserved domains and most likely be highly dependent on the value of
motifs of protein sequences and are important this parameter. In many situations, cross validation
when considering the origin and evolution of pro- can be used to search for an acceptable value.
teins. For example, protein motifs that character- Second, what is the primary objective for features
ize pathogenicity may be used as potential targets selection? Is it the goal of the end-user to perform
for diagnosis and treatment. classification, or are they simply looking for the
Since the diversity of functional databases top k features in the data set? The design of the
serves a variety of research questions, it is impor- objective function, J (.), for feature selection can
tant to note that many studies would adopt several be used to emphasize and address these questions.
databases for annotation. Therefore, the optimal Let J (.) be a function of the features Xj (for j
feature selection technique may depend on the 2 f1; . . . ;Qg), the label variables Y, and the
database choice and the nature of taxonomic or current relevant feature set F. Note that the col-
functional data, such as the dimension of feature lection of variables (e.g., operational taxonomic
space, data sparsity, and the possible range of fold units, Pfams, etc.) is denoted by X. The objective
change between samples. function can be designed in a way, such that it
This entry is organized as follows: section reflects the task at hand. For example, if
“Feature Selection” highlights the components a biologist is interested in the top ranking features
of a general feature selection algorithm and how that carry the most mutual information between
to design such an algorithm. Section “A Descrip- Xj and Y, then the objective function should
tion of the MetaHit Database” presents the bench- reflect this goal. In this situation, using a mutual
mark MetaHit data set, followed by an empirical information maximization (MIM) method is suffi-
analysis of feature selection algorithms tested on cient to achieve this goal (Lewis 1992). MIM can
the MetaHit data set in section “Data Analysis.” be implemented as follows: (a) compute I(Xj;Y)
Finally, section “Conclusion” draws concluding for all j (I(Xj;Y) is the mutual information between
remarks for feature selection applied to Xj and Y), (b) rank the mutual informations
metagenomic data. in descending order, and (c) select the top
k variables with the largest mutual information
and place them in F.
Feature Selection However, many times we seek to classify data
based on Y, and in such situations designing
Feature selection can provide a unique insight a more complex objective function is required.
about the variables that provide discriminating For example, it may be more advantageous to
information about populations, or phenotypes, select F in such a way that the features contained
typically contained in the metadata. This metadata in F are informative about Y; however, they are
could be as simple as two populations, such as not redundant (i.e., one or more features provide
V
healthy or unhealthy, or significantly more com- the same amount of information about Y). An
plex by containing many different populations example of such an objective function is given by
within a data sample. It is natural during the anal-
ysis of a biological data set to ask the question:
J Xj , Y, F ¼ I Xj ; Y I Xj ; Xs
which variables provide the most differentiation Xs
between multiple populations? The answer to such

questions can be answered using feature selection where the first term maximizes the mutual infor-
(Guyon and Elisseeff 2003). There are several mation between the features, Xj, and metadata, Y,
Variable Selection to
Improve Classification of
Metagenomes,
Fig. 1 Generic forward
feature selection algorithm
for a filter-based method
while the second term is penalizing Xj for being The MetaHit data set represents one of the
redundant with the current relevant feature set in F. most comprehensive studies of the human gut
The design of the objective function is quite impor- microbiome. Among the 124 individuals in the
tant to the application to which feature selection database, 25 are from patients who have inflam-
is being applied. There are several works that matory bowel disease (IBD), and 42 patients are
highlight such results on bioinformatics data also obese. It is interesting to note that only
(Saeys et al. 2007), information theory methods three of the individuals who have IBD are also
(Brown et al. 2012), and general feature selection obese. Let us consider two different labeling
techniques (Guyon and Elisseeff 2003). schemes for the data: IBD and obesity, both of
A simple algorithm for feature selection is the which are binary prediction problems. The
forward selection search, which is shown in sequences from each individual are functionally
Fig. 1. The method begins by initializing the annotated using the Pfam database (Finn
relevant feature set F to the empty set. Then for et al. 2010), in a recent study that utilized the
k cycles, equation (1) is maximized, and the fea- MetaHit data set for feature selection on patient
ture that maximizes the expression is added to the age (Lan et al. 2013). There are a total of 6,343
relevant feature set, F, and removed from the unique functional features detected in the data
feature set, X. The forward selection search is set, and Fig. 2 shows the log10 of the total
used with several feature selection objective abundance for each of the 6,343 functional fea-
functions in the section on “Data Analysis.” tures over the 124 observations in the data set.
One way to (loosely) access the separability of
the IBD and no IBD patients (or obese and not
A Description of the MetaHit Database obese) in the data is to examine the principal
coordinate analysis (PCoA) plots of the patients’
As mentioned in Introduction, feature selection Pfam data (Gower 1967). Figure 3 shows the
can allow researchers in metagenomics to inter- PCoA scatter plots of the two sample labeling
pret the differentiating features in a data set. The schemes using PCoA implemented with the
interpretation can be insightful and allow the Euclidean distance. From these plots we observe
researchers to determine the functional differ- that there is a significant amount of overlap
ences between multiple phenotypes. As a case between the classes for both labeling schemes.
study, let us examine a metagenome data set
collected by Qin et al. (2010), which is widely
referred to as the MetaHit data set. The data are Data Analysis
collected from Illumina-based metagenomic
sequencing of 124 fecal samples of 124 Euro- In this section, the classification accuracy and area
pean individuals from Spain and Denmark. under the receiver operating characteristic (auROC)
curve for the MetaHit data set are examined The joint-mutual information feature selection
when feature selection is applied. The accuracy algorithm (JMI) is implemented with a forward
is measured using the standard 1–0 loss, and the selection search, and the na¨ıve Bayes classifier
auROC is interpreted as the probability of rank- is implemented with a multinomial model. The
ing a target data instance higher than a randomly FEAST feature selection toolbox implements
selected nontarget data instance (Fawcett 2006). the JMI algorithm (Brown et al. 2012). All
The IBD/obese class label is identified as statistics are presented as averages from tenfold
the target for the calculation of the auROC. cross validation using stratified sampling. Strat-
ified sampling assures that instances from
each class will be in each cross-validation
data set. Note that completely random cross-
validation data set partitions do not guarantee
this property.
The auROC and loss for the multinomial n€aıve
Bayes classifier are measured using the two label-
ing schemes described in section “A Description
of the MetaHit Database” (i.e., IBD and obese).
Table 2 contains the classification assessments
from the different labeling schemes as well as
a variation in the number of features that are
selected via JMI. From Table 2, it is clear that
feature selection can have a significant outcome
in the classification results. This is best shown in
Fig. 4 which shows the number of features
Variable Selection to Improve Classification of selected by the MIM algorithm versus the loss
Metagenomes, Fig. 2 Logarithm of the total abundance
of each feature detected by the Pfam database for Qin
(Fig. 4b) and the auROC (Fig. 4a). Note that these
et al. (2010)’s human gut microbiome data set. The results are generated using the mutual informa-
x-axis represents rank of each feature corresponding with tion maximization approach; however, similar
the number of detections sorted in descending order. From results/trends are observed for other feature
the plot, it is obvious that there are few Pfams with a large
abundance and many Pfams with a very low abundance
selection methods.
count. For example, there are 2,572 Pfams with 10 or Figure 5a presents a visualization of the
fewer occurrences across the 124 observations MetaHit data set before and after MIM feature
Variable Selection to Improve Classification of and obese labeling of the samples. There appears to be
Metagenomes, Fig. 3 (a) IBD (b) Obese. Multi- a significant amount of overlap between the controls and
dimensional scaling of the MetaHit data set with the IBD targets for both prediction problems
selection is applied. The features are sorted from Table 3. It is known in IBD patients that the
high to low in terms of overall abundance, and the expression of ABC transporter protein
patients are represented such that samples 1–99 (PF00005, the first feature MIM selected for clas-
do not have IBD and samples 99–124 have IBD. sifying IBD versus no IBD samples) is decreased
Clearly, this shows a large amount of sparsity that which limits the protection against various lumi-
is inherent in the data, which would also be evi- nal threats (Deuring et al. 2011). The feature
dent if taxonomic abundances were used over selection for IBD also identified glycosyl-
Pfams. Figure 5b shows that most of the features transferase (PF00535), whose alternation is
being selected by MIM are relatively abundant hypothesized to result in recruitment of bacteria
features; however, simply because a feature is to the gut mucosa and increased inflammation
abundant does not imply that the feature is rele- (Campbell et al. 2001). And the genotype of
vant. This can be observed near the 44th feature acetyltransferase (PF00583) plays an important
in Fig. 5b. Note that the features in Fig. 5b are role in the pathogenesis of IBD, which is useful in
order by the time they were selected by the for- the diagnostics and treatment of IBD (Baranska
ward search. et al. 2011). It is not surprising that ABC trans-
The top Pfams that maximize the mutual infor- porter (PF00005) is also selected for obesity,
mation for the MetaHit data set are shown in which is known to mediate fatty acid transport
that is associated with obesity and insulin-
Variable Selection to Improve Classification of
resistant states (Ashrafi 2007) and ATPases
Metagenomes, Table 2 Area under the ROC
(auROC) curves and classification error for a n€aıve (PF02518) that catalyze dephosphorylation reac-
Bayes classifier tested using tenfold cross validation tions to release energy.
auROC Error auROC Error
(IBD) (IBD) (obese) (obese)
10 0.706 0.233 0.640 0.395 Conclusion
15 0.624 0.290 0.672 0.352
25 0.616 0.292 0.660 0.403 This entry has presented a broad overview about
50 0.750 0.223 0.649 0.422 how feature selection algorithms can be used to
100 0.660 0.249 0.659 0.397 facilitate and interpret data in the field of
200 0.654 0.257 0.643 0.389 metagenomics. Recall that metagenomic abun-
500 0.635 0.277 0.641 0.378 dance data can be of very large dimension (e.g.,
All 0.665 0.238 0.622 0.240 MetaHit), and feature selection reduces the
Variable Selection to Improve Classification of selected has a larger effect on the auROC (i.e., detection of
Metagenomes, Fig. 4 (a) Loss of n€aıve Bayes. (b) target population examples) than the accuracy of the sys-
auROC of n€aıve Bayes. The effect of the number of tem. Similar results are observed with JMI and other
features selected by the MIM algorithm versus the loss feature selection methods
(left) and the auROC (right). The number of features being
Variable Selection to Improve Classification of samples. Samples 1 through 99 do not have IBD, and
Metagenomes, Fig. 5 (a) No feature selection. (b) Fea- samples 99 through 124 have IBD. (b) contains the top
ture selection. Visualization of the abundance matrix (on a 50 features relevant to the 124 data sets. Differences
log10 scale) (a) Before and (b) after MIM feature selec- between the two classes cannot be visualized; however,
tion. The x-axis represents a feature and y-axis represents classification auROCs are 10–15 % above chance
Variable Selection to Improve Classification of References

Metagenomes, Table 3 List of the “top” Pfams as
selected by the MIM feature selection algorithm. Note Ashrafi K. Obesity and the regulation of fat
that redundancy terms are not accounted for in the objec- metabolism. WormBook. 2007;9:1–20. Review.
tive of MIM. Hence, the features below are the ones that PMID:18050496.
provide the largest amounts of mutual information. The ID Baranska M, Trzcinski R, Dziki A, Rychlik-Sych M,
in parentheses is the Pfam accession number Dudarewicz M, Skretkowicz J. The role of
IBD features Obese features n-acetyltransferase 2 polymorphism in the
etiopathogenesis of inflammatory bowel disease.
Feature ABC transporter ABC transporter
Dig Dis Sci. 2011;56(7):2073–80. doi: 10.1007/
1 (PF00005) (PF00005)
s10620-010-1527-4. Epub 2011 Feb 15.
Feature Phage integrase MatE (PF01554) PMID:21321790.
2 family (PF00589) Bowers RM, McLetchie S, Knight R, Fierer N. Spatial
Feature Glycosyltransferase TonB-dependent variability in airborne bacterial communities across
3 family 2 (PF00535) receptor (F00593) land-use types and their relationship to the bacterial
Feature Acetyltransferase Histidine kinase-, DNA communities of potential source environments. ISME
4 (GNAT) family gyrase B-, and HSP90- J. 2011;5:601–12.
(PF00583) like ATPase (PF02518) Brown G, Pocock A, Zhao M-J, Luj’an M. Conditional
Feature Helix-turn-helix Response regulator likelihood maximisation: a unifying framework for
5 (PF01381) receiver domain information theoretic feature selection. J Mach Learn
(PF00072) Res. 2012;13:27–66.
Campbell BJ, Yu LG, Rhodes JM. Altered glycosylation
in inflammatory bowel disease: a possible role in can-
cer development. Glycoconj J. 2001;18(11–12):851–8.
Review.
Caporaso JG, Lauber CL, Costello EK, Berg-Lyons D,
Gonzalez A, Stombaugh J, Knights D, Gajer P,
V
dimensionality of the space to allow for a quick
Ravel J, Fierer N, Gordon JI, Knight R. Moving pic-
interoperation of the data. Furthermore, feature tures of the human microbiome. Genome Biol.
selection is also useful for classification because 2011;12(5).
it allows us to remove potentially irrelevant fea- Costello EK, Lauber CL, Hamady M, Fierer N, Gordon JI,
tures from the data set, which allows the classier Knight R. Bacterial community variation in human
body habitats across space and time. Science.
to focus on learning from the relevant informa-
2009;326:1694–7.
tion rather than attempt to decipher what is or is Deuring JJ, Peppelenbosch MP, Kuipers EJ, van der
not relevant. Woude CJ, de Haar C. Impeded protein folding and
V 708 Viral Metagenome Annotation Pipeline
function in active inflammatory bowel disease.

Biochem Soc Trans. 2011;39:1107–11. Viral Metagenome Annotation
Ditzler G, Polikar R, Rosen G. Information theoretic fea-
ture selection for high dimensional metagenomic data. Pipeline
In: International Workshop on Genomic Signal
Processing and Statistics, 2012. Hernan Lorenzi
Fawcett T. An introduction to ROC analysis. Pattern Informatics, J. Craig Venter Institute, Rockville,
Recogn Lett. 2006;27:861–74.
Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington MD, USA
JE, Gavin OL, Gunesekaran P, Ceric G, Forslund K,
Holm L, Sonnhammer EL, Eddy S, Bateman A. The
pfam protein families database. Nucleic Acids Res. Introduction
2010;38:D211–222.
Gower J. Multivariate analysis and multidimensional
geometry. J R Stat Soc. 1967;17(1):13–28.
Viruses are the most abundant and diverse organ-
Guyon I, Elisseeff A. An introduction to variable isms on Earth, yet only a small fraction of the
and feature selection. J Mach Learn Res. 2003; viral genome sequence space has been decoded.
3:1157–82. Based on analyses of environmental viral com-
Guyon I, Gunn S, Nikravesh M, Zadeh LA. Feature extrac-
tion: foundations and applications. Berlin: Springer;
munities, it is estimated that only 1 % of the
2006. existent viral diversity has been explored. During
Lan Y, Kriete A, Rosen GL. Selecting age-related more than a century, cultivation of viruses has
functional characteristics in the human gut remained the gold standard for virus discovery
microbiome. Microbiome. 2013;1(1):2. doi: 10.1186/
2049-2618-1-2.
and characterization. One major limitation of this
Lewis DD. Feature selection and feature extraction approach is that for most viral species, their hosts
for text categorization. In Proceedings of the (predominantly microbes) are either unknown or
Workshop on Speech and Natural Language. cannot be grown in culture. Viral metagenomics
p. 212–217.
Qin J, Li R, Raes J, Arumugam M, Burgdorf KS,
(VM) circumvents this limitation by sequencing
Manichanh C, Nielsen T, Pons N, Levenez F, viral genetic material isolated directly from the
Yamada T, Mende DR, Li J, Xu J, Li S, Li D, Cao J, environment. A typical viral metagenomics
Wang B, Liang H, Zheng H, Xie Y, Tap J, Lepage P, workflow is depicted in Fig. 1. Viral
Bertalan M, Batto JM, Hansen T, Le Paslier D,
Linneberg A, Nielsen HB, Pelletier E, Renault P,
metagenomic methods have evolved significantly
Sicheritz-Ponten T, Turner K, Zhu H, Yu C, Jian M, since their beginnings. Initially, they involved
Zhou Y, Li Y, Zhang X, Qin N, Yang H, Wang J, viral particle purification and enrichment from
Brunak S, Dore J, Guarner F, Kristiansen K, environmental samples, sharing of isolated
Pedersen O, Parkhill J, Weissenbach J, Bork P,
Ehrlich SD. A human gut microbial gene catalogue
nucleic acids followed by an optional cDNA syn-
established by metagenomic sequencing. Nature. thesis step in the case of RNA viruses, cloning
2010;464:59–65. into shotgun libraries, and direct sequencing of
Rousk J, Baath E, Brookes PC, Lauber CL, Lozupone C, the total DNA content by Sanger. This
Caporaso JG, Knight R, Fierer N. Soil bacterial and
fungal communities across a pH gradient in an arable
low-throughput approach has been used in the
soil. ISME J. 2010;4:1340–51. past for the characterization of viral communities
Saeys Y, Inza I, Larra naga P. A review of feature selec- from many different environments (Steward and
tion techniques in bioinformatics. Oxf Bioinforma. Preston 2011; Bench et al. 2007). During the last
2007;23(19):2507–17.
Williamson S, Rusch D, Yooseph S, Halpern A,
decade, the advent of high-throughput sequenc-
Heidelberg K, Glass J, Andrews-Pfannkoch C, ing technologies and the development of novel
Fadrosh D, Miller C, Sutton G, Frazier M, Venter viral particle purification methods are revolution-
JC. The sorcerer II global ocean sampling expedition: izing the field of VM facilitating the rapid expan-
metagenomic characterization of viruses within
aquatic microbial samples. PLoS Biol. 2008;3(1).
sion of viral genome data and boosting the
Wooley JC, Godzik A, Friedberg I. A primer on number of associated metagenomics publications
metagenomics. PLoS Comput Biol. 2010;6(2):1–13. in PubMed (Fig. 2). Among the many sequencing
Viral Metagenome Annotation Pipeline 709 V
elimination of redundant and low-quality reads,
assemblage, and viral gene identification (Fig. 1).
How the sequencing data will be processed and
annotated will depend on the goals of the project
and characteristics of the viral community but
basically can be divided into two major annota-
tion strategies, read-based annotation (RBA) and
gene-based annotation (GBA). The former
approach is more straightforward and involves
the identification of similarities to protein
sequences or domains directly on the reads to
classify them into phylogenetic or functional
groups. Gene-based annotation, on the other
hand, requires an optional assemblage of
sequencing reads into contigs and/or scaffolds,
gene identification, and functional prediction of
predicted proteins. In this entry, we describe cur-
rent tools, databases, and methodologies that
have been developed in the past few years for
RBA and GBA of viral metagenomic datasets and
discuss their advantages and drawbacks.
Read-Based Annotation
Viral Metagenome Annotation Pipeline,
Fig. 1 Schematic representation of a viral metagenomic
workflow Direct annotation of sequencing reads is fre-
quently used when the goal of the VM study is
to investigate and compare the type of species or
platforms currently available, Illumina and gene functions that are present in one or more
454 FLX/titanium pyrosequencing have been viral communities. In general, read-based anno-
the most frequently used for the characterization tation assumes that each read encodes for a single
of VM samples. One Illumina sequencing lane gene. Before proceeding with any annotation, it is
generates approximately 125–150 million reads important to preprocess sequencing reads to elim-
of up to 150 bp in length while one full-plate run inate regions with low-quality base calls and
of 454 titanium produces ~1 million reads of duplicated reads. This is particularly important
about 350–450 bp. A VM project usually when working with next generation sequencing
involves two or more Illumina/454 runs, and (NGS) data, since pyrosequencing and Illumina
therefore, the volume of sequence produced by technologies have a higher error rate compared
these studies is in the order of several gigabases. with Sanger. In particular, 454 pyrosequencing is
This huge volume of sequencing data makes prone to the generation of artifactual indels in
V
downstream annotation and analyses methods regions containing homopolymers (Kunin
very challenging and computationally expensive. et al. 2010; Gilles et al. 2011) while Illumina
Therefore, it is critical to preprocess sequencing reads have a higher substitution error rate than
data whenever possible in order to reduce the 454 dealing better with homopolymeric regions
amount of sequence to be annotated. (Minoche et al. 2011). Also, NGS platforms have
Pre-annotation processing methods include a tendency to produce a significant number of
Viral Metagenome
Annotation Pipeline,
Fig. 2 Number of articles
in PubMed about viral
metagenomics during the
period 2004–2011
duplicated reads, in particular when sequencing annotation stage. Because viruses have a fast evo-
libraries are constructed from very limited quan- lutionary rate, any comparison at the nucleotide
tities of starting RNA/DNA material (<10 ng). level is not sensitive enough to detect similarities
There are several programs that can be used to between reads from a studied metagenome and
remove exact or near exact duplicated reads or nucleotide databases of characterized viral genes
trimming low-quality bases and vector sequences or genomes. In consequence, all searches should
without requiring a large computer infrastructure. be done using translated sequences. The simplest
Some examples are BIGpre (Zhang et al. 2011), annotation approach is to compare the six-frame
Bolger et al. 2014 (http://www.usadellab.org/ translations of each read against a collection of
cms/index.php?page¼trimmomatic), PyroCleaner well-annotated protein databases using
(Jerome et al. 2011), CD-HIT (Huang et al. 2010), TBLASTN or equivalent algorithms to identify
NGS QC Toolkit (Patel and Jain 2012), LUCY the types of viral species or functions encoded by
(Chou and Holmes 2001), and SeqClean (Chen the viral metagenome. The main advantage of
et al. 2007). For example, Trimmomatic is RBA is that it does not involve previous gene
a java-based program that can run in Linux, identification or assembly of reads, processes
Windows, and Mac OSX operating systems and that require some level of user expertise. Another
has several different options for trimming benefit is that translation-based similarity
low-quality bases and adaptor sequences from searches are independent of gene structure and
Illumina reads. BIGpre is compatible with both therefore may prove to be more sensitive than
454 and Illumina platforms and detects and GBA at the time of studying viral communities
removes redundant reads after taking sequenc- whose genomes are enriched in intron-containing
ing errors into account and trimming low-quality genes. However, RBA has several disadvantages.
reads from raw data as well. BIGpre and NGS First, sequence similarity searches using
QC Toolkit also output a number of quality stats TBLASTN or equivalent programs are computa-
about NGS reads that are useful to assess the tionally demanding and time consuming. Second,
presence of sequence bias and the correlation many databases of conserved protein domains or
between forward and reverse reads among motifs cannot be queried using nucleotide
other tools. sequences or on the fly translations. Third, when
Once raw sequencing reads have been reads code for more than one gene, the molecular
processed, it is possible to proceed with the functions associated with the most divergent
genes on the read are usually masked by the gene perform well on metagenomic datasets because
with the best (lowest) e-values and hence are they are not designed to handle a mixture of reads
difficult to detect. Fourth, further characterization derived from different strains and species with
and phylogenetic analysis of protein families are distinct relative abundances. In this context,
complicated by the fact that it is difficult to gen- sequences of highly abundant species are likely
erate multiple sequence alignments from misidentified as repeats in a single genome,
evolutionary-related genes that start at different resulting in a number of small fragmented scaf-
positions on their respective reads. Lastly, the folds. There are a number of programs and
higher indels rate of NGS reads, in particular in websites specifically designed for generating de
454-derived sequences, creates artifactual trans- novo contigs and scaffolds of overlapping
lation frameshifts that can lead to an metagenomic NGS reads. The CAMERA website
overestimation of gene family diversity and com- (Sun et al. 2011) offers a meta-assembly proce-
plicates the interpretation of results from protein dure for 454 reads which consist of running
database searches. a number of single-genome assemblers with care-
fully optimized parameters on the metagenomic
dataset, then it collects all the resulting contigs
Gene-Based Annotation and assigns quality scores by consensus analysis,
and finally, it uses an adaptation of phrap (http://
A more thorough and efficient way to annotate www.phrap.org) to reassemble the contigs based
sequencing datasets from viral communities is to on computed quality scores. There are also
identify protein-coding genes before carrying out a number of metagenome-specific de novo
any comparison against protein databases. This assemblers, such as MetaVelvet (Namiki
approach reduces considerably the amount of et al. 2012), Meta-IDBA (Peng et al. 2011),
sequencing data to be queried and hence comput- IDBA-UD (Peng et al. 2012), and Genovo
ing time, expands the spectrum of databases that (Laserson et al. 2011). These programs deal bet-
can be searched, and simplifies the interpretation ter with a mixed population of species with dif-
of results and further evolutionary studies. ferent abundances compared to single-genome
Although GBA may involve different bioin- de novo assemblers (Namiki et al. 2012) and
formatics tools, databases, and cutoffs, it is usu- seem to reduce the number of chimeric contigs.
ally composed of the following consecutive Also, depending on the species diversity of the
steps: (i) sequence assembly; (ii) protein-coding metagenome, some of these programs may per-
gene identification; (iii) similarity searches of form differently (Namiki et al. 2012), and there-
predicted proteins against generic or specialized fore, it is better to try a variety of assembly
databases of characterized proteins, conserved programs before starting to work on the annota-
protein domains or motifs; and (iv) functional tion of a particular dataset.
assignments of predicted proteins following
a series of predefined rules. Below, we will dis- Ab Initio Gene Identification
cuss each of these steps in more detail. Gene features in viral genomes are strongly dic-
tated by the genetic characteristics of their host.
Assembling Viral Metagenomes Thus, bacterial viruses, or bacteriophages, are
V
Metagenomic sequence assembly is mostly composed of single-exon genes while
a fundamental way to improve metagenomic eukaryotic-infecting viruses may contain genes
annotation. For example, the sensitivity of both with more than one exon. In spite of this property,
phylogenetic assignment methods based on the majority of genes encoded by viral genomes
nucleotide composition and metagenomic ab do not have introns. Therefore, there are
initio gene finders increases with sequence length a number of gene finders that are suitable for the
(McHardy et al. 2007; Li 2009; Yok and Rosen ab initio identification of viral genes on either
2010). Single-genome assemblers usually do not NGS reads or assembled sequences, although
Viral Metagenome
Annotation Pipeline,
Fig. 3 Relative number of
viral, bacterial, archaeal,
and eukaryotic proteins in
GenBank, UniProt/Swiss-
Prot, and UniProt/
TrEMBL. Numbers are
relative to the total number
of protein in each database
none of them have been specifically developed Functional Annotation of Predicted Genes
for viral metagenomic samples. Two of the most Functional predictions of protein sequences are
widely used gene finders are MetaGeneAnnotator usually done in two consecutive steps:
(Noguchi et al. 2008) and FragGeneScan (Rho (A) similarity searches against very well-curated
et al. 2010). MetaGeneAnnotator integrates statis- protein databases and (B) functional assignments
tical models from prophage, bacterial and archeal based on database hits. A fundamental problem
genes, and ribosomal-binding sites, and it also uses in functional annotation of viral genes is how to
a self-training model from input sequences for assign functional roles to their encoding proteins
making predictions. FragGeneScan incorporates when viral sequences are highly divergent from
sequencing error models and codon usage infor- those already present in well-annotated protein
mation in a hidden Markov model (HMM) that databases. To make the situation even more
improves the prediction of protein-coding genes in complicated, proteins of viral origin represent
NGS reads and assemblies. FragGeneScan is able a tiny fraction of the proteins deposited in public
to compensate for artifactual frameshifts in repositories (Fig. 3). In consequence, in a typical
pyrosequencing reads caused by the higher fre- VM project, only a very small proportion of
quency of indels at homopolymeric regions. An viral peptides give significant hits (e-value
alternative strategy to the identification of genes 1 105) against protein databases. There-
with gene finders is using naı̈ve six-frame trans- fore, protein database searches have to be
lations (NSFT) that identify each possible ORF of complemented with other bioinformatics tools
at least 80 nt of length. In this case, 50 and 30 ends to increase the number of functionally predicted
of reads can be considered as start and stop codons, viral proteins. In this section we describe
respectively, to also incorporate partial genes trun- a strategy for functional annotation of viral
cated by read ends. In those cases where there are metagenomic datasets as implemented in the
two or more overlapping ORFs, it is possible to Viral MetaGenomic Annotation Pipeline
analyze all of them or select candidate genes based (VMGAP) at the J. Craig Venter Institute
on their properties: length, dn/ds ratio, similarity at (Lorenzi et al. 2011). This pipeline makes use
the protein level, etc. An alternative to this of databases of conserved protein domains,
approach is to combine the results of NSFT with mobile genetic elements, and environmental
gene predictions from FragGeneScan or peptides to improve the sensitivity and quality
MetaGeneAnnotator and pick the longest of the annotation. The first step in the VMGAP
predicted gene per region. is to perform several similarity searches
between the VM peptides to be annotated and (iii) BLASTP searches against GenBank envi-
the following databases: ronmental nonredundant database
(i) BLASTP searches against public An intriguing aspect of VM is the fact
nonredundant protein databases that the majority of viral predicted proteins
Several generic nonredundant protein do not share similarity with any known
databases can be used for functional assign- sequence. This collection of unknown pro-
ment of viral proteins: GenBank NR, teins, which are usually discarded as “junk”
UniProtKB (UniProt Consortium 2012), and sequences, may represent a formidable
UniRef (Suzek et al. 2007). UniProtKB con- source for the discovery of new viral spe-
sists of two databases, UniProtKB/Swiss-Prot cies. One way to exploit these protein
and UniProtKB/TrEMBL. Protein records in sequences is to compare them against the
UniProtKB/Swiss-Prot are annotated and proteins from other metagenomic datasets
reviewed by a curator, while entries in to gain some insight about the viral entities
UniProtKB/TrEMBL are automatically anno- that are shared between them. GenBank
tated and classified. UniRef is a group of environmental nonredundant database
nonredundant protein databases derived from (env_nr) is a collection of all the protein
clustering UniProtKB entries at different per- sequences derived from metagenomic pro-
centages of identity. Thus, UniRef100 com- jects deposited in GenBank.
bines identical complete and fragmented (iv) HMM searches against PFAM database
protein sequences from any organism into PFAM is a database of hidden Markov
a single UniRef entry. UniRef90 and models of conserved protein domains
UniRef50 are built by clustering UniRef100 (Punta et al. 2012). Because these domains
sequences at the 90 % or 50 % sequence are usually associated with a particular
identity levels. One of the main advantages molecular function or protein family and
of using a clustered protein database such as evolve at a lower pace compared to other
UniRef90 and UniRef50 is that they signifi- protein regions, they are excellent tools for
cantly reduce the time required for similarity identifying functional domains in highly
searches and improve detection of distant divergent protein sequences as the ones
relationships, since all closely related proteins from viruses. PFAM HMM searches can be
are collapsed in a single representative run with the HMMER2 suite of programs
sequence (Suzek et al. 2007). (Eddy 2011) in two different modes, global
(ii) BLASTP searches against the ACLAME or local, allowing for total or partial align-
database ments of the HMMs to the queried protein
The ACLAME database is a repository of sequences, respectively. If gene predictions
mobile genetic elements such as bacterio- are done on reads, it is expected a high pro-
phages, plasmids, and transposons (Leplae portion of partial (truncated) proteins. In that
et al. 2010). Entries in ACLAME are orga- case, local HMM searches are a more sensi-
nized into families based on their sequence tive approach. HMM searches using global
similarity and function. Those families with alignments are more specific than locals and
more than three members are manually anno- perform better on complete proteins. How-
V
tated with functional assignments using gene ever, even in assembled VM datasets the
ontology terms from GO (Shoop et al. 2004) proportion of truncated genes is very high,
and MeGO (Toussaint et al. 2007). MeGO is since assemblies tend to be very partitioned.
an ontology developed by ACLAME to Recently, PFAM released a new generation
describe biological functions, processes, and of HMM models compatible with a new
components specific to mobile genetic ele- development of the HMMER package, the
ments that are not present in the GO database HMMER3. These HMMs can only be run in
(for an example see Fig. 4). local mode but have similar specificity and
Viral Metagenome Annotation Pipeline, Fig. 4 Example of MeGO terms as they appear in ACLAME using
AmiGO
sensitivity to those from the two PFAM signatures of protein families and therefore
HMMER2 models (local and global) com- provide useful functional information. Since
bined. HMMER3 uses a faster algorithm and PSSMs are BLAST scoring matrices specific
hence is a better choice for performing of conserved protein domains or motifs, their
HMM searches on VM protein datasets. use gives better sensitivity than regular
(v) RPS-BLAST against NCBI-CDD database BLASTP at the time of detecting these
The NCBI Conserved Domain Database domains on more divergent proteins.
(CDD) (Marchler-Bauer et al. 2011) is (vi) Additional bioinformatic tools for func-
a compendium of position-specific scoring tional annotation
matrices (PSSMs) representing conserved Because a significant proportion of the
protein domains, protein families, and super- proteins encoded by viral metagenomes are
families gathered from SMART (Letunic unknown, it is useful to take advantage of in
et al. 2012), COG database (Tatusov silico protein-signal prediction tools that
et al. 2003), NCBI-curated protein domains could provide hints about their putative
(Sayers et al. 2012), and PFAM. In spite of function. An important first step toward
having some overlap with PFAM HMMs, understanding the biological role of
PSSMs derived from PFAM domains do not unknown viral proteins is determining their
behave exactly the same as their HMMs subcellular localization while infecting their
counterparts, and therefore, they complement host. A set of popular protein localization
each other. CDD-PSSMs are usually associ- prediction programs has been developed for
ated with a molecular function or represent the identification of protein signals that
dictate the subcellular destination of pep- chloroplast transit peptide, mitochondrial
tides: SignalP (Petersen et al. 2011), targeting peptide, or signal peptide.
ChloroP (Emanuelsson et al. 1999), TargetP TMHMM is a program that predicts trans-
(Emanuelsson et al. 2000), and Krogh et al. membrane domains based on HMM
2001 (http://www.cbs.dtu.dk/services/ searches. Each of these programs outputs
TMHMM/). None of these programs are spe- a p-value that can be used to select highly
cifically designed for viral genes. However, significant predictions.
once in the host, viruses can use prokaryotic/
eukaryotic signals to target their own pro- Functional Assignments to VM Proteins
teins to defined subcellular locations. Based on Annotation Rules
SignalP 4.0 uses a neural network-based The second stage of functional annotation is the
method to predict signal peptides from processing of the functional information pro-
gram positive, gram negative, and eukary- duced from database searches to generate a file
otic peptides, and it has been recently containing a summary of the functional charac-
improved to distinguish between signal pep- teristics (product names, GO/MeGO terms, EC
tides and transmembrane domains located numbers, etc.) for each viral peptide. Each of the
near the N-terminus of proteins. ChloroP evidences generated by the analyses described
also uses a neural network approach to pre- above is more or less informative or accurate
dict chloroplast transit peptides, and there- depending on the origin of the VM, the queried
fore, it might be useful for the functional databases, and the programs used. Therefore, the
annotation of viruses that infect plants. best approach is to apply a series of hierarchical
TargetP is a program that predicts the sub- rules to prioritize the use of a certain piece of
cellular location of eukaryotic proteins. The evidence over another based on how trustful and
location assignment is based on the presence useful that evidence is. Figure 5 shows a potential
of any of the following N-terminal signals: hierarchical scheme similar to the one used as
Viral Metagenome Annotation Pipeline, Fig. 5 Hierarchical scheme for functional annotation of viral proteins
part of the VMGAP at the JCVI. Under this location, body site, type of disease, etc., may
scheme, hits against ACLAME database are the still provide some clues about the biology of the
highest ranked supporting evidence for func- viruses present in the VM sample. Finally, if the
tional assignments. Hence, any viral protein that viral protein does not contain a database hit that
hits an ACLAME entry with an e-value falls within any of the first 12 tiers, then it is
1 1010, with at least 50 % identity spanning considered an unknown protein.
80 % of the length of the shortest sequence, will Note that the rules described above can be
automatically inherit the functional annotation further improved by, for example, the incorpora-
associated with that particular ACLAME pep- tion of results from subcellular localization
tide. The second, third, and fourth tiers of evi- predictions (TargetP, SignalP, ChloroP, and
dence correspond respectively to highly TMHMM) between tears 12 and 13 or any other
significant BLASTP hits against UniProt/Swiss- functional analysis.
Prot (US), UniProt/TrEMBL (UT), and GenBank Applying the rules described above, it is pos-
NR (GB). US has a higher hierarchy than UT sible to assign product names, EC numbers, and
and GB because entries in US are manually GO/MeGO terms to predicted proteins from the
curated. BLASTP hits against UT have a higher VM sample. For example, if a viral predicted
priority than GB hits because UT annotation is protein has a hit against a peptide from ACLAME
usually more comprehensive compared with database above the cutoffs from tier 1, then it can
GB. Ranked fifth and sixth are hits against almost inherit the product name as well as the GO or
complete PFAM HMMs and CDD-PSSMs, MeGO terms associated with that particular
respectively. PFAM hits are more reliable than ACLAME entry. UniProt entries, in particular
CDD hits because they can be selected based on from US, are also a very good source of product
their e-value but also using pre-calibrated names, EC numbers, and GO terms. However,
domain-specific bit score cutoffs named trusted these assignments should be done from high con-
cutoff. Any protein that hits a PFAM HMM with fidence hits only.
a bit score above its trusted cutoff is considered to
contain that particular domain with a very high
level of confidence. CDD domains, on the other Web Resources for Functional
hand, are being selected just based on the e-value Annotation of VM Datasets
of the RPS-BLAST hit and coverage of the CDD
domain, and hence, hits are less reliable com- Currently, there are a number of publicly avail-
pared with tier five. Local hits against PFAM able bioinformatics tools that can be used for the
HMM domains with e-values 1 105 are structural (gene identification) and functional
ranked seventh below CDD hits. Because these annotation of viral metagenomes. MG-RAST
hits span just a portion of the HMM model, they (Glass et al. 2010; Meyer et al. 2008) is
are solely selected by their e-value and not by a popular web resource able to perform structural
their bit score. Tiers eight to 11 look for less and functional annotations on both NGS reads
reliable hits against ACLAME, US, UT, and GB and assembled metagenomic data. One main
databases in that order using more permissive advantage is that all computes are run by the
cutoffs (e-value 1 105; coverage 70 %, MG-RAST server, and therefore, the user is not
identity 30 %) compared to tiers 1–4 (e-value required to have a big computer infrastructure. It
1 1010; coverage 80 %, identity 50 %). also handles Illumina and 454 reads and provides
Ranked 12th are BLASTP hits against environ- several read preprocessing tools such as elimina-
mental protein databases, such as GenBank tion of duplicated or contaminated reads and
env_nr, with e-values of at least 1 105. deletion of low-quality sequences and short
Entries in environmental protein databases are reads. Structural annotation is carried out either
likely to lack any functional annotation. How- on reads or assemblies using FragGeneScan
ever, associated metadata such as geographic while functional annotation is being done by
similarity searches against a protein scale metabolic networks in the SEED. BMC
nonredundant database that compiles the follow- Bioinforma. 2007;8:139. 1868769.
Dwivedi B, Schmieder R, Goldsmith DB, Edwards RA,
ing public protein databases: GenBank NR, Breitbart M. PhiSiGns: an online tool to identify sig-
KEGG (Tanabe and Kanehisa 2012), IMG nature genes in phages and design PCR primers for
(Markowitz et al. 2012), InterPro (Hunter examining phage diversity. BMC Bioinformatics.
et al. 2012), PATRIC (Gillespie et al. 2011), 2012;13:37.
Eddy SR. Accelerated profile HMM searches. PLoS
Dwivedi et al. 2012 (http://www.phantome.org/), Comput Biol. 2011;7(10):e1002195. 3197634.
RefSeq (Pruitt et al. 2012), SEED (DeJongh Emanuelsson O, Nielsen H, von Heijne G. ChloroP,
et al. 2007), UniProt/Swiss-Prot, UniProt/ a neural network-based method for predicting chloro-
TrEMBL, COG (Tatusov et al. 2003), GO, KO plast transit peptides and their cleavage sites. Protein
Sci. 1999;8(5):978–84. 2144330.
(Mao et al. 2005), and eggNOG (Powell Emanuelsson O, Nielsen H, Brunak S, von Heijne
et al. 2012). Among these databases is Phantome, G. Predicting subcellular localization of proteins
a protein database of complete phage genomes based on their N-terminal amino acid sequence.
manually curated by experts using a subsystem J Mol Biol. 2000;300(4):1005–16.
Gilles A, Meglecz E, Pech N, Ferreira S, Malausa T,
approach (Overbeek et al. 2005). Another nice Martin JF. Accuracy and quality assessment of
feature of MG-RAST is that it allows the com- 454 GS-FLX titanium pyrosequencing. BMC Geno-
parison among the annotated VM samples pro- mics. 2011;12:245. 3116506.
vided by the user and the more than 10,000 Gillespie JJ, Wattam AR, Cammer SA, Gabbard JL,
Shukla MP, Dalay O, Driscoll T, et al. PATRIC: the
metagenomic datasets that are publicly available comprehensive bacterial bioinformatics resource with
at the MG-RAST server. a focus on human pathogenic species. Infect Immun.
Another useful web resource is CAMERA 2011;79(11):4286–98. 3257917.
(Sun et al. 2011), which allows the construction Glass EM, Wilkening J, Wilke A, Antonopoulos D, Meyer
F. Using the metagenomics RAST server (MG-RAST)
of customized workflows for the analysis of for analyzing shotgun metagenomes. Cold Spring
external metagenomic data. Among the many Harb Protoc. 2010; 2010(1):pdb prot5368.
bioinformatic tools available are an assembly Huang Y, Niu B, Gao Y, Fu L, Li W. CD-HIT suite: a web
pipeline for 454 reads, protein clustering with server for clustering and comparing biological
CD-HIT, clustering of duplicated 454 reads, 2828112.
gene predictions based on different gene finders, Hunter S, Jones P, Mitchell A, Apweiler R, Attwood TK,
and a general pipeline for annotation of Bateman A, Bernard T, Binns D, Bork P, Burge S,
metagenomic datasets. et al. InterPro in 2011: new developments in the family
and domain prediction database. Nucleic Acids Res.
2012;40(Database issue):D306–12. 3245097.
Jerome M, Noirot C, Klopp C. Assessment of replicate
References bias in 454 pyrosequencing and a multi-purpose read-
filtering tool. BMC Res Notes. 2011;4:149. 3117718.
Bench SR, Hanson TE, Williamson KE, Ghosh D, Krogh A1, Larsson B, von Heijne G, Sonnhammer EL.
Radosovich M, Wang K, Wommack Predicting transmembrane protein topology with a
KE. Metagenomic characterization of Chesapeake hidden Markov model: application to complete
Bay virioplankton. Appl Environ Microbiol. genomes. J Mol Biol. 2001;305(3):567–80.
2007;73(23):7629–41. 2168038. Kunin V, Engelbrektson A, Ochman H, Hugenholtz
Bolger AM1, Lohse M, Usadel B. Trimmomatic: a flexible P. Wrinkles in the rare biosphere: pyrosequencing
trimmer for Illumina sequence data. Bioinformatics. errors can lead to artificial inflation of diversity esti-
2014; doi:10.1093/bioinformatics/btu170 mates. Environ Microbiol. 2010;12(1):118–23. V
Chen YA, Lin CC, Wang CD, Wu HB, Hwang PI. An Laserson J, Jojic V, Koller D. Genovo: de novo assembly
optimized procedure greatly improves EST vector for metagenomes. J Comput Biol. 2011;18(3):429–43.
contamination removal. BMC Genomics. 2007;8:416. Leplae R, Lima-Mendez G, Toussaint A. ACLAME:
2194723. a classification of mobile genetic elements, update
Chou HH, Holmes MH. DNA sequence quality trimming 2010. Nucleic Acids Res. 2010;38(Database issue):
and vector removal. Bioinformatics. 2001;17(12): D57–61. 2808911.
1093–104. Letunic I, Doerks T, Bork P. SMART 7: recent updates to
DeJongh M, Formsma K, Boillot P, Gould J, Rycenga M, the protein domain annotation resource. Nucleic Acids
Best A. Toward the automated generation of genome- Res. 2012;40(Database issue):D302–5. 3245027.
Li W. Analysis and comparison of very large Peng Y, Leung HC, Yiu SM, Chin FY. IDBA-UD: a de
metagenomes with fast clustering and functional anno- novo assembler for single-cell and metagenomic
tation. BMC Bioinforma. 2009;10:359. 2774329. sequencing data with highly uneven depth. Bioinfor-
Lorenzi HA, Hoover J, Inman J, Safford T, Murphy S, matics. 2012;28(11):1420–8.
Kagan L, Williamson SJ. TheViral MetaGenome Petersen TN, Brunak S, von Heijne G, Nielsen H. SignalP
Annotation Pipeline (VMGAP): an automated tool 4.0: discriminating signal peptides from transmem-
for the functional annotation of viral Metagenomic brane regions. Nat Methods. 2011;8(10):785–6.
shotgun sequencing data. Stand Genomic Sci. Powell S, Szklarczyk D, Trachana K, Roth A, Kuhn M,
2011;4(3):418–29. 3156399. Muller J, Arnold R, Rattei T, Letunic I, Doerks T,
Mao X, Cai T, Olyarchuk JG, Wei L. Automated genome et al. eggNOG v3.0: orthologous groups covering
annotation and pathway identification using the KEGG 1133 organisms at 41 different taxonomic ranges.
Orthology (KO) as a controlled vocabulary. Bioinfor- Nucleic Acids Res. 2012;40(Database issue):D284–9.
matics. 2005;21(19):3787–93. 3245133.
Marchler-Bauer A, Lu S, Anderson JB, Chitsaz F, Derby- Pruitt KD, Tatusova T, Brown GR, Maglott DR. NCBI
shire MK, DeWeese-Scott C, Fong JH, Geer LY, Geer Reference Sequences (RefSeq): current status, new
RC, Gonzales NR, et al. CDD: a conserved domain features and genome annotation policy. Nucleic
database for the functional annotation of proteins. Acids Res. 2012;40(Database issue):D130–5.
Nucleic Acids Res. 2011;39(Database issue):D225–9. 3245008.
3013737. Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J,
Markowitz VM, Chen IM, Chu K, Szeto E, Palaniappan K, Boursnell C, Pang N, Forslund K, Ceric G,
Grechkin Y, Ratner A, Jacob B, Pati A, Huntemann M, Clements J, et al. The Pfam protein families database.
et al. IMG/M: the integrated metagenome data manage- Nucleic Acids Res. 2012;40(Database issue):
ment and comparative analysis system. Nucleic Acids D290–301. 3245129.
Res. 2012;40(Database issue):D123–9. 3245048. Rho M, Tang H, Ye Y. FragGeneScan: predicting genes in
McHardy AC, Martin HG, Tsirigos A, Hugenholtz P, short and error-prone reads. Nucleic Acids Res.
Rigoutsos I. Accurate phylogenetic classification of 2010;38(20):e191. 2978382.
variable-length DNA fragments. Nat Methods. Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH,
2007;4(1):63–72. Canese K, Chetvernin V, Church DM, Dicuccio M,
Meyer F, Paarmann D, D’Souza M, Olson R, Glass EM, Federhen S, et al. Database resources of the National
Kubal M, Paczian T, Rodriguez A, Stevens R, Center for Biotechnology Information. Nucleic Acids
Wilke A, et al. The metagenomics RAST server – Res. 2012;40(Database issue):D13–25. 3245031.
a public resource for the automatic phylogenetic and Shoop E, Casaes P, Onsongo G, Lesnett L, Petursdottir
functional analysis of metagenomes. BMC EO, Donkor EK, Tkach D, Cosimini M. Data explora-
Bioinforma. 2008;9:386. 2563014. tion tools for the gene ontology database. Bioinformat-
Minoche AE, Dohm JC, Himmelbauer H. Evaluation of ics. 2004;20(18):3442–54.
genomic high-throughput sequencing data generated Steward GF, Preston CM. Analysis of a viral metagenomic
on Illumina HiSeq and genome analyzer systems. library from 200 m depth in Monterey Bay, California
Genome Biol. 2011;12(11):R112. 3334598. constructed by direct shotgun cloning. Virol
Namiki T, Hachiya T, Tanaka H, Sakakibara J. 2011;8:287. 3128862.
Y. MetaVelvet: an extension of Velvet assembler to Sun S, Chen J, Li W, Altintas I, Lin A, Peltier S, Stocks K,
de novo metagenome assembly from short sequence Allen EE, Ellisman M, Grethe J, et al. Community
reads. Nucleic Acids Res. 2012;40(20):e155. cyberinfrastructure for advanced microbial ecology
Noguchi H, Taniguchi T, Itoh T. MetaGeneAnnotator: research and analysis: the CAMERA resource. Nucleic
detecting species-specific patterns of ribosomal bind- Acids Res. 2011;39(Database issue):D546–51. 3013694.
ing site for precise gene prediction in anonymous Suzek BE, Huang H, McGarvey P, Mazumder R, Wu
prokaryotic and phage genomes. DNA Res. CH. UniRef: comprehensive and non-redundant
2008;15(6):387–96. 2608843. UniProt reference clusters. Bioinformatics. 2007;
Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang 23(10):1282–8.
HY, Cohoon M, De Crecy-Lagard V, Diaz N, Disz T, Tanabe M, Kanehisa M. Using the KEGG database
Edwards R, et al. The subsystems approach to genome resource. Curr Protoc Bioinform. 2012. Chapter 1:
annotation and its use in the project to annotate 1000 Unit1 12.
genomes. Nucleic Acids Res. 2005;33(17):5691–702. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR,
1251668. Kiryutin B, Koonin EV, Krylov DM, Mazumder R,
Patel RK, Jain M. NGS QC toolkit: a toolkit for quality Mekhedov SL, Nikolskaya AN, et al. The COG data-
control of next generation sequencing data. PLoS One. base: an updated version includes eukaryotes. BMC
2012;7(2):e30619. 3270013. Bioinformatics. 2003;4:41. 222959.
Peng Y, Leung HC, Yiu SM, Chin FY. Meta-IDBA: a de The UniProt Consortium. Reorganizing the protein space
Novo assembler for metagenomic data. Bioinformat- at the Universal Protein Resource (UniProt). Nucleic
ics. 2011;27(13):i94–101. 3117360. Acids Res. 2012;40(Database issue):D71–5.3245120.
Viral Pathogens in Clinical Samples by Use of a Metagenomic Approach 719 V
Toussaint A, Lima-Mendez G, Leplae R. PhiGO, a phage the associate diseases. Traditional techniques for
ontology associated with the ACLAME database. Res virus discovery such as cultivation-, morphology-,
Microbiol. 2007;158(7):567–71.
Yok N, Rosen G. Benchmarking of gene prediction pro- serology-, and immunology-based methods have
grams for metagenomic data. Conf Proc IEEE Eng contributed significantly to the identification
Med Biol Soc. 2010;2010:6190–3. of most important viral pathogens during the
Zhang T, Luo Y, Liu K, Pan L, Zhang B, Yu J, Hu S. last century. In addition, modern molecular
BIGpre: a quality assessment package for next-
generation sequencing data. Genomics Proteomics methods such as PCR and microarray also
Bioinforma. 2011;9(6):238–44. play more and more important roles in clinical
virology practices in the past decade. The
newly emerged metagenomic-based method is
a particularly powerful approach for virus identi-
fication since genetic materials can be analyzed
Viral Pathogens in Clinical Samples directly from clinical samples, bypassing the
by Use of a Metagenomic Approach need for culturing, cloning, or a priori knowledge
of what viruses may be present. The recent advent
Jian Yang of next-generation sequencing technologies
MOH Key Laboratory of Systems Biology of (NGS), which have dramatically improved the
Pathogens, Institute of Pathogen Biology, speed and cost-effectiveness of sequencing,
Chinese Academy of Medical Sciences & Peking fueled the clinical application of metagenomic
Union Medical College (CAMS&PUMC), method for viral diagnosis. Herein, we summa-
Beijing, People’s Republic of China rized the most recent studies that have success-
fully identified viral pathogens from clinical
samples by using the NGS-based metagenomic
Synonyms approach.
Metagenomic detection of viral agents in clinical

samples Viral Pathogens in Diseased Human
Tissues
Definition The astonishing power of NGS-based

metagenomic approach for clinical diagnosis
Viral pathogens in clinical samples here refer was first illustrated by two remarkable studies
to the human viruses isolated (or been discov- reported in 2008. Merkel cell carcinoma (MCC)
ered) in clinical samples that associate with is a rare but aggressive human skin cancer that
known human diseases. Viruses from environ- typically affects elderly and immune-suppressed
mental samples or nonhuman biological sam- individuals. By high-throughput metagenomic
ples as well as the large amount of commensal sequencing of the cDNA library of tumor tissues
viruses in human virome are not discussed in and digital subtraction of human transcriptome,
this entry. Feng et al. identified a novel polyomavirus that
may be a contributing factor in the pathogenesis
V
of MCC (Feng et al. 2008). Another study used
Introduction a similar strategy to discover a new arenavirus
that likely associated with a cluster of fatal
Viral diseases continue to threat public health and transplant-associated diseases, after many tradi-
medicine in the twenty-first century by causing tional and molecular assays including culture,
significant disease burden globally. Accurate PCR, and oligonucleotide microarray had failed
and rapid identification of the viral agents is the to identify any potential infectious agents
key step towards better control and prevention of (Palacios et al. 2008). The success of
V 720 Viral Pathogens in Clinical Samples by Use of a Metagenomic Approach
NGS-based metagenomic approach in clinical Viral Pathogens in Respiratory

diagnosis provided a new route for the identifica- Specimens
tion of pathogens from clinical samples and was
believed to be the herald of a breakthrough in the The respiratory tract is one of the most heavily
field of pathogen discovery (MacConaill and exposed organs in human body to microorgan-
Meyerson 2008). Indeed, the metagenomic isms. Therefore, the new NGS-based
approaches were further applied to screen post- metagenomic approaches were extensively used
mortem tissues for potential viral agents by the by different studies to identify viral agents from
same group from Columbia University, and they patients with respiratory infections (Table 1).
successively identified a new hemorrhagic fever- However, the quantities of samples from respira-
associated arenavirus named Lujo virus from tory tract, either swabs or aspirates, are much
Southern Africa and an astrovirus as a causative lower than those of fecal samples mentioned
agent for encephalitis in a patient with agamma- above. Detection of potential viral agents from
globulinemia (Briese et al. 2009; Quan the minute respiratory samples using the
et al. 2010). metagenomic approach is therefore particularly
challenging and tricky. All of the aforementioned
studies targeting human tissues or stools employed
Viral Pathogens in Fecal Samples the Roche/454 platform for metagenomic
sequencing as it produced longer reads (but
Due to the relatively feasible accessibility, stool lower overall throughput) than other NGS plat-
specimens are the most intensively investigated forms. Nevertheless, three of the five published
clinical samples by using metagenomic studies working on respiratory specimens tried
approaches to date (Table 1). Diarrhea is one of the Illumina platform instead. Actually, the
the major infectious causes of death worldwide, ultrahigh throughput of Illumina platform can
but about 40 % of the diarrhea cases are of largely compensate the disadvantage in reads
unknown etiology. Metagenomic approaches length as compared to the Roche/454 platform
were recently used by different groups to screen (Yang et al. 2011). In addition, a simulation
stool samples from diarrhea patients, and many study showed that the Illumina technology was
known eukaryotic viruses as well as several new more sensitive than the Roche/454 technology in
viral species/genus were discovered, including detection viruses from biological samples (Cheval
a novel gyrovirus species GyV3 and a potential et al. 2011). Indeed, using only 36 bp reads, our
new parvovirus genus (Nakamura et al. 2009; group identified seven known respiratory viral
Phan et al. 2012a, b). The same group from agents from 16 clinical samples, including a case
Blood Systems Research Institute also analyzed of coinfection that would have been misdiagnosed
fecal samples from 35 South Asian children with by conventional PCR assays (Yang et al. 2011).
nonpolio acute flaccid paralysis and identified Moreover, when utilizing the paired-end sequenc-
a large number of known enteric viruses as well ing strategy, the novel enterovirus 109 was readily
as several new viral species (Victoria et al. 2009). identified from a case of acute respiratory illness in
But numerous viruses were also detectable in the a Nicaraguan child (Yozwiak et al. 2010), whereas
samples from six healthy contacts of the patients. 90 % of the viral genome of H1N1 influenza
In addition, two groups dedicated to the unknown A virus can even be assembled de novo
etiology of gastrointestinal illness with the (Greninger et al. 2010).
metagenomic approach and revealed a new
astrovirus VA1 and a novel picornavirus named
klassevirus, respectively (Finkbeiner et al. 2009; Viral Pathogens in Blood Samples
Greninger et al. 2009). But further studies are still
required to fully characterize these newly identi- Viral hemorrhagic fever (VHF) is a severe illness
fied potential viral pathogens. characterized by high fever and bleeding, which
Viral Pathogens in Clinical Samples by Use of a Metagenomic Approach, Table 1 Selected clinical viral
diagnosis reports using a metagenomic approach based on next-generation sequencing technologies
Sequencing
Sample types Related diseases Viral pathogens detected platform Reference
Diseased Tumor tissues Merkel cell carcinoma Merkel cell polyomavirus Roche/454 Feng
human (a type of human skin (novel) et al. 2008
tissues cancer)
Postmortem Fatal transplant-associated New arenavirus Roche/454 Palacios
tissues diseases et al. 2008
Postmortem Hemorrhagic fever Lujo virus (novel) Roche/454 Briese
tissues and sera et al. 2009
Biopsy and Encephalitis Astrovirus Roche/454 Quan
postmortem et al. 2010
tissues
Fecal Stools Diarrhea Norovirus Roche/454 Nakamura
samples et al. 2009
Stools Nonpolio acute flaccid Several known enteric Sanger, Victoria
paralysis viruses and five novel Roche/454 et al. 2009
viruses
Stools Pediatric gastroenteritis Klassevirus (novel) Roche/454 Greninger
et al. 2009
Stools Acute gastroenteritis Astrovirus VA1(novel) Sanger, Finkbeiner
Roche/454 et al. 2009
Stools Diarrhea Several known viruses and Roche/454 Phan
one novel gyrovirus species et al. 2012a
Stools Pediatric acute diarrhea Several known viruses and Roche/454 Phan
one potential novel genus in et al. 2012b
the Parvoviridae family
Respiratory Nasopharyngeal Influenza Influenza virus Roche/454 Nakamura
specimens aspirates et al. 2009
Nasopharyngeal Acute pediatric respiratory Human enterovirus Illumina Yozwiak
swabs illness 109(novel) et al. 2010
Nasopharyngeal Influenza 2009 pandemic H1N1 Illumina Greninger
swabs influenza A virus et al. 2010
Nasopharyngeal Acute lower respiratory Seven known respiratory Illumina Yang
aspirates tract infections viral agents et al. 2011
Nasopharyngeal Acute lower respiratory Several known respiratory Roche/454 Lysholm
aspirates tract infections viruses and one novel type et al. 2012
of rhinovirus C
Blood Blood Hemorrhagic fever Bundibugyo ebolavirus Roche/454 Towner
samples (novel) et al. 2008
Sera Fever, thrombocytopenia, Henan fever virus (novel) Illumina Xu
and leukopenia syndrome et al. 2011
Sera Hemorrhagic fever Yellow fever virus Roche/454 McMullan
et al. 2012
Sera Dengue-like disease Human herpesvirus 6 and Illumina Yozwiak V
several other known viruses et al. 2012
may be caused by a number of viruses. Recently, using the Roche/454-based metagenomic

a group from the Centers for Disease Control and approach. They successfully identified a new
Prevention dedicated to screen viral agents in Ebola virus likely responsible for a large hemor-
blood samples from VHF patients in Uganda rhagic fever outbreak in western Uganda
V 722 Viral Pathogens in Clinical Samples by Use of a Metagenomic Approach
(Towner et al. 2008). In another study on the recent studies as the higher throughput do offers
suspected hemorrhagic fever endemic in northern greater sensitivity as compared with the former.
Uganda, using the same strategy, they not only Second, differ from traditional methods the
recognized yellow fever virus but also generated metagenomic approach rely heavily on subse-
98 % of the virus genome sequence, which facil- quent bioinformatics data analyses, which can
itated the follow-up phylogenetic analyses be very tricky particularly in case of detection
(McMullan et al. 2012). The Illumina platforms potential novel viruses. Lacking of standard pro-
are also employed for the detection of viral path- tocols for metagenomic data analysis has ham-
ogens in blood samples by using a metagenomic pered the further extensive applications of
approach (Table 1). During a tick-transmitted- metagenomic approach in the future. Third,
like outbreak of fever, thrombocytopenia, and results from metagenomic approach only indicate
leukopenia syndrome in China, most patients the presence of given viruses in the clinic samples
are tested negative for the former-suspected screened, and it cannot directly deduce that the
human granulocytic anaplasmosis. Hence, viruses identified are responsible for the human
a metagenomic approach based on paired-end diseases investigated. Hence, the biological and
Illumina sequencing was applied to screen medical interpretations of metagenomic results
potential viral agents from the sera of patients, may require further evidences from epidemiol-
and a novel bunyavirus was successfully identi- ogy, morphology, immunology, etc.
fied (Xu et al. 2011). In addition, the novel
virus was confirmed to present in 78 % of the
acute-phase serum samples by further RT-PCR Cross-References
testing.
▶ Functional Viral Metagenomics and the
Development of New Enzymes for DNA and
Summary RNA Amplification and Sequencing
▶ Viral MetaGenome Annotation Pipeline
Since the first introduce in 2008, we have
witnessed the emergence and extensive applica-
tions of the NGS-based metagenomic approach
References
as a powerful tool in diagnostic virology. The
intrinsic properties of metagenomics provide the Briese T, Paweska JT, McMullan LK, et al. Genetic detec-
method prominent advantages in speed and sen- tion and characterization of Lujo virus, a new hemor-
sitivity for parallel screening of known viral path- rhagic fever-associated arenavirus from southern
Africa. PLoS Pathog. 2009;5:e1000455.
ogens as well as detection of new unexpected
Cheval J, Sauvage V, Frangeul L, et al. Evaluation of high-
viral agents in clinical samples. With the contin- throughput sequencing for identifying known and
uous development and improvement of high- unknown viruses in biological samples. J Clin
throughput sequencing technologies, the Microbiol. 2011;49:3268–75.
Feng H, Shuda M, Chang Y, et al. Clonal integration of
metagenomic approach will probably become an
a polyomavirus in human Merkel cell carcinoma. Sci-
essential diagnostic method in clinical routines. ence. 2008;319:1096–100.
However, in current stage, several issues Finkbeiner SR, Li Y, Ruone S, et al. Identification of
should be kept in mind for the application of the a novel astrovirus (astrovirus VA1) associated with
an outbreak of acute gastroenteritis. J Virol.
metagenomic approach in viral diagnostic prac- 2009;83:10836–9.
tices. First, the selection of different NGS plat- Greninger AL, Chen EC, Sittler T, et al. A metagenomic
forms will be critical to both preceding sample analysis of pandemic influenza A (2009 H1N1) infec-
nucleotides preparation and further sequence data tion in patients from North America. PLoS One.
2010;5:e13381.
analyses. Though the majority of published appli-
Greninger AL, Runckel C, Chiu CY, et al. The complete
cations used Roche/454 platform, the Illumina genome of klassevirus – a novel picornavirus in pedi-
technology is increasingly employed in most atric stool. Virol J. 2009;6:82.
Lysholm F, Wetterbom A, Lindau C, et al. Quan PL, Wagner TA, Briese T, et al. Astrovirus enceph-
Characterization of the viral microbiome in patients alitis in boy with X-linked agammaglobulinemia.
with severe lower respiratory tract infections, Emerg Infect Dis. 2010;16:918–25.
using metagenomic sequencing. PLoS One. 2012;7: Towner JS, Sealy TK, Khristova ML, et al. Newly discov-
e30875. ered ebola virus associated with hemorrhagic fever
MacConaill L, Meyerson M. Adding pathogens by geno- outbreak in Uganda. PLoS Pathog. 2008;4:e1000212.
mic subtraction. Nat Genet. 2008;40:380–2. Victoria JG, Kapoor A, Li L, et al. Metagenomic analyses
McMullan LK, Frace M, Sammons SA, et al. Using next of viruses in stool samples from children with acute
generation sequencing to identify yellow fever virus in flaccid paralysis. J Virol. 2009;83:4642–51.
Uganda. Virology. 2012;422:1–5. Xu B, Liu L, Huang X, et al. Metagenomic analysis of
Nakamura S, Yang CS, Sakon N, et al. Direct fever, thrombocytopenia and leukopenia syndrome
metagenomic detection of viral pathogens in nasal (FTLS) in Henan Province, China: discovery of
and fecal specimens using an unbiased high- a new bunyavirus. PLoS Pathog. 2011;7:e1002369.
throughput sequencing approach. PLoS One. 2009;4: Yang J, Yang F, Ren L, et al. Unbiased parallel detection
e4219. of viral pathogens in clinical samples by use of a
Palacios G, Druce J, Du L, et al. A new arenavirus in metagenomic approach. J Clin Microbiol. 2011;49:
a cluster of fatal transplant-associated diseases. 3463–9.
N Engl J Med. 2008;358:991–8. Yozwiak NL, Skewes-Cox P, Gordon A, et al. Human
Phan TG, Li L, O’Ryan MG, et al. A third gyrovirus enterovirus 109: a novel interspecies recombinant
species in human faeces. J Gen Virol. 2012a; enterovirus isolated from a case of acute pediatric respi-
93:1356–61. ratory illness in Nicaragua. J Virol. 2010;84:9047–58.
Phan TG, Vo NP, Bonkoungou IJ, et al. Acute diarrhea in Yozwiak NL, Skewes-Cox P, Stenglein MD, et al. Virus
West African children: diverse enteric viruses and identification in unknown tropical febrile illness cases
a novel parvovirus genus. J Virol. 2012b;86: using deep sequencing. PLoS Negl Trop Dis. 2012;6:
11024–30. e1485.
V
List of Entries
A 123 of Metagenomics Diversity and Distribution of Marine Microbial

A De Novo Metagenomic Assembly Program for Eukaryotes
Shotgun DNA Reads DNA Methylation Analysis by Pyrosequencing
Ab Initio Gene Identification in Metagenomic Environmental Shaping of Codon Usage and
Sequences Functional Adaptation Across Microbial
AbundanceBin, Metagenomic Sequencing Communities
Accurate Genome Relative Abundance Evaluating Putative Chimeric Sequences from
Estimation Based on Shotgun Metagenomic PCR-Amplified Products
Reads Extended Local Similarity Analysis (eLSA) of
All-Species Living Tree Project Biological Data
antiSMASH Extraction Methods, Variability Encountered in
Approaches in Metagenome Research: Progress Extradiol Dioxygenases Retrieved from the
and Challenges Metagenome
Arbuscular Mycorrhizal Fungi Assemblages in Fast Program for Clustering and Comparing
Chernozems Large Sets of Protein or Nucleotide Sequences
Bacterial Diversity in Tree Canopies of the Fosmid System
Atlantic Forest FragGeneScan: Predicting Genes in Short and
Bacteriocin Mining in Metagenomes Error-Prone Reads
Binning Sequences Using Very Sparse Labels FR-HIT Overview
Within a Metagenome Functional Metagenomics of Bacterial-Cell
Biological Treasure Metagenome Crosstalk
Carbohydrate-Active Enzymes Database, Functional Metagenomics of Human Intestinal
Metagenomic Expert Resource Microbiome b-Glucuronidase Activity
Challenge of Metagenome Assembly and Functional Viral Metagenomics and the
Possible Standards Development of New Enzymes for DNA and
CLUSEAN, Overview RNA Amplification and Sequencing
Computational Approaches for Metagenomic Genome Atlases, Potential Applications in Study
Datasets of Metagenomes
Conserved Regions in 16S Ribosome RNA Genome Portal, Joint Genome Institute
Sequences and Primer Design for Studies of Genome-Based Studies of Marine
Environmental Microbes Microorganisms
Culture Collections in the Study of Microbial GeoChip-Based Metagenomic Technologies for
Diversity, Importance Analyzing Microbial Community Functional
Culturing Structure and Activities
Customizable Web Server for Fast Metagenomic GHOSTM
Sequence Analysis Horizontal Gene Transfer and Bacterial
DACTAL Diversity
K.E. Nelson (ed.), Genes, Genomes and Metagenomes: Basics, Methods, Databases and Tools, 725
726 List of Entries
Host-Virus Interaction: From Metagenomics to Microbial Diversity, Bar-Coding Approaches

Single-Cell Genomics Microbial Ecology in the Age of Metagenomics:
Human Gut Microbial Genes by Metagenomic An Introduction
Sequencing Microbial Ecosystems, Protection of
Human Oral Microbiome Database (HOMD) Mining Metagenomic Datasets for Antibiotic
Insights into Environmental Microbial Resistance Genes
Denitrification from Integrated Metagenomic, Mining Metagenomic Datasets for Cellulases
Cultivation, and Genomic Analyses Mock Community Analysis
Integrated Database Resource for Marine Molecular Ecological Network of Microbial
Ecological Genomics Communities
Integrons as Repositories of Genetic Novelty Monitoring Lactic Acid Bacterial Diversity
IPRStats, Overview During Shochu Fermentation
I-rDNA and C16S: Identification and MRL and SuperFine+MRL
Classification of Ribosomal RNA Gene New Computational Methodologies to
Fragments Understand Microbial Diversity
KEGG and GenomeNet, New Developments, New Method for Comparative Functional
Metagenomic Analysis Genomics and Metagenomics Using KEGG
Krona: Interactive Metagenomic Visualization in MODULE
a Web Browser Next-Generation Sequencing for Metagenomic
Lateral Gene Transfer and Microbial Diversity Data: Assembling and Binning
Lessons Learned from Simulated Metagenomic NGS QC Toolkit: A Platform for Quality Control
Datasets of Next-Generation Sequencing Data
MEMOSys: Platform for Genome-Scale Novel Alkalistable and Thermostable
Metabolic Models Xylanase-Encoding Gene (Mxyl) Retrieved
MetaBin from Compost-Soil Metagenome
MetaBioME Novel Approaches to Pathogen Discovery in
MEtaGenome ANalyzer (MEGAN): Metagenomes
Metagenomic Expert Resource Nucleotide Composition Analysis: Use in
Metagenome of Acidic Hot Spring Microbial Metagenome Analysis
Planktonic Community: Structural and Open Resource Metagenomics
Functional Insights Phylogenetics, Overview
Metagenomes: 23S Sequences PhyloPythia(S)
Metagenomic Analysis of Bile Salt Hydrolases in Plasmid Capture from Metagenomes
the Human Gut Microbiome Protein-Coding Genes as Alternative Markers in
Metagenomic by RAPD Profiling Microbial Diversity Studies
Metagenomic Potential for Understanding Proteomics and Metaproteomics
Horizontal Gene Transfer RITA: Rapid Identification of High-Confidence
Metagenomic Research: Methods and Ecological Taxonomic Assignments for Metagenomic
Applications Data
Metagenomics Potential for Bioremediation SATe-Enabled Phylogenetic Placement
Metagenomics, Metadata, and Meta-analysis Serial Analysis of V1 Ribosomal Sequence Tags
MetaRank: Ranking Microbial Taxonomic Units SILVA Databases
or Functional Groups for Comparative Simultaneous Quantification of Multiple Bacteria
Analysis of Metagenomes STAMP: Statistical Analysis of Metagenomic
METAREP, Overview Profiles
MetaTISA: Metagenomic Gene Start Subtractive Hybridization Magnetic Bead
Prediction with Capture: Molecular Technique for Recovery
Metaxa, Overview of Full-Length ORFs from Metagenomes
List of Entries 727
Taxa Counting Using Specific Peptides of Use of Viral Metagenomes from Yellowstone
Aminoacyl tRNA Synthetases Hot Springs to Study Phylogenetic
Taxonomic Classification of Metagenomic Relationships and Evolution
Shotgun Sequences with CARMA3 Variable Selection to Improve Classification of
The Vaginal Microbiome in Health and Disease Metagenomes
tRNA Gene Database Curated Manually by Experts Viral Metagenome Annotation Pipeline
Use of Bacterial Artificial Chromosomes in Viral Pathogens in Clinical Samples by Use of a
Metagenomics Studies, Overview Metagenomic Approach
This encyclopedia includes no entries for W, X, Y & Z.

Karen E. Nelson (Eds.) - Encyclopedia of Metagenomics - Genes, Genomes and Metagenomes - Basics, Methods, Databases and Tools-Springer US (2015)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Karen E. Nelson (Eds.) - Encyclopedia of Metagenomics - Genes, Genomes and Metagenomes - Basics, Methods, Databases and Tools-Springer US (2015)

Uploaded by

Copyright:

Available Formats

Karen E.

With 216 Figures and 64 Tables

ISBN 978-1-4899-7477-8 ISBN 978-1-4899-7478-5 (eBook)

# Springer Science+Business Media New York 2015

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)

Welcome to the Encyclopedia of Metagenomics. It is anticipated that the

MD, USA Karen E. Nelson

Dr. Karen E. Nelson is the President of the

Takashi Abe Graduate School of Science and Technology, Niigata

Hervé M. Blottière INRA, AgroParisTech, Jouy en Josas, France

Paul Cotter Teagasc Food Research Centre, Moorepark, Fermoy, Co.,

K. Martin Eriksson Department of Biological and Environmental Sciences,

Igor V. Grigoriev US Department of Energy Joint Genome Institute,

Yuki Iwasaki Nagahama Institute of Bio-Science and Technology,

Henry C. M. Leung Department of Computer Science, The University of

Folker Meyer Institute of Genomic and Systems Biology, Argonne

Stephan Pabinger Division of Bioinformatics, Biocenter, Innsbruck

Sandra Ronca Centre for Microbial Ecology and Genomics (CMEG),

Moo-Jin Suh J. Craig Venter Institute, Rockville, MD, USA

Joy D. Van Nostrand Department of Microbiology and Plant Biology,

Yuzhen Ye Indiana University, School of Informatics and Computing,

A 123 of Metagenomics from environmental samples. Arguably,

role in sequence processing, due to more valuable

As of April 2013 sequences of 370 metagenomes

from numerous genomes of heterogeneous Glimmer-MG is based on interpolated Markov

Cross-References PubMed PMID: 19648916. Pubmed Central PMCID:

Integrated Binning Methods

References Fengzhu Sun and Li Charlie Xia

Accurate GRAMMy Estimates with EM

Note that gm is a collective surrogate for

Mixing component distributions are needed to

origin matrix Z. The estimated mixing parame- Conclusions

SCO6274 (type I modular pks)

SCO6275 (type I modular pks)

antiSMASH is still under active development. Cross-References

Minowa Y, Araki M, Kanehisa M. Comprehensive analy- Definition

acids within the peptide. Class Ic bacteriocins can Application of Bacteriocins

potential bacteriocin-encoding clusters. It is antici- (Murphy et al. 2011). In both studies, a

Cross-References determinants reveals multiple sactibiotic-like gene

Seeding Sequence and Metagenomic

Clustering Percentage (CP) Constrained K-means, Seeded K-means, and

At the order level, while PhyloPythia Discussion

those metagenomic sequences in the possible References

producers of these substances (D’Costa competition in tens of millions of microbial

Carbohydrate-Active Enzymes host-pathogen interactions, signal transduction,

Mahowald et al. 2009; Turnbaugh et al. 2010) In a comparison of carbohydrate active

40 Total Coverage: 99.42

single-cell or microcolony isolation techniques References

Cross-References Computational Approaches for

Computational Approaches for Metagenomic Datasets, Table 1 (continued)

A typical 16S rDNA profiling analysis would Assembly

interest in a metagenomic sample is given in introduce frameshift errors during sequencing

Recent improvements in next-generation Davenport CF, T€ ummler B. Advances in computational

Evaluation of Candidate Primers

Environmental samples from habitats that harbor Summary

Cross-References Oxford AE, Raistrick H. Studies in the biochemistry of

Metagene (Noguchi et al. 2006) and

DACTAL methods for estimating phylogenies (Felsenstein

Summary Felsenstein J. Inferring phylogenies. Sunderland: Sinauer

Classification of MME in the ocean, these include Diatomea,

can quantify any variation of sequence. In the ciprofloxacin-resistant Neisseria gonorrhoeae.

Environmental Shaping of Codon provides an in silico functional metagenomic