Bioinformatics Research and Applications 12th International Symposium ISBRA 2016 Minsk Belarus June 5 8 2016 Proceedings 1st Edition Anu Bourgeois

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

Bioinformatics Research and

Applications 12th International


Symposium ISBRA 2016 Minsk Belarus
June 5 8 2016 Proceedings 1st Edition
Anu Bourgeois
Visit to download the full and correct content document:
https://textbookfull.com/product/bioinformatics-research-and-applications-12th-interna
tional-symposium-isbra-2016-minsk-belarus-june-5-8-2016-proceedings-1st-edition-a
nu-bourgeois/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Bioinformatics Research and Applications 10th


International Symposium ISBRA 2014 Zhangjiajie China
June 28 30 2014 Proceedings 1st Edition Mitra Basu

https://textbookfull.com/product/bioinformatics-research-and-
applications-10th-international-symposium-isbra-2014-zhangjiajie-
china-june-28-30-2014-proceedings-1st-edition-mitra-basu/

Experimental Algorithms 15th International Symposium


SEA 2016 St Petersburg Russia June 5 8 2016 Proceedings
1st Edition Andrew V. Goldberg

https://textbookfull.com/product/experimental-algorithms-15th-
international-symposium-sea-2016-st-petersburg-russia-
june-5-8-2016-proceedings-1st-edition-andrew-v-goldberg/

Pattern Recognition and Information Processing: 13th


International Conference, PRIP 2016, Minsk, Belarus,
October 3-5, 2016, Revised Selected Papers 1st Edition
Viktor V. Krasnoproshin
https://textbookfull.com/product/pattern-recognition-and-
information-processing-13th-international-conference-
prip-2016-minsk-belarus-october-3-5-2016-revised-selected-
papers-1st-edition-viktor-v-krasnoproshin/

Theory and Applications of Satisfiability Testing SAT


2016 19th International Conference Bordeaux France July
5 8 2016 Proceedings 1st Edition Nadia Creignou

https://textbookfull.com/product/theory-and-applications-of-
satisfiability-testing-sat-2016-19th-international-conference-
bordeaux-france-july-5-8-2016-proceedings-1st-edition-nadia-
Applied Reconfigurable Computing 12th International
Symposium ARC 2016 Mangaratiba RJ Brazil March 22 24
2016 Proceedings 1st Edition Vanderlei Bonato

https://textbookfull.com/product/applied-reconfigurable-
computing-12th-international-symposium-arc-2016-mangaratiba-rj-
brazil-march-22-24-2016-proceedings-1st-edition-vanderlei-bonato/

Engineering Secure Software and Systems 8th


International Symposium ESSoS 2016 London UK April 6 8
2016 Proceedings 1st Edition Juan Caballero

https://textbookfull.com/product/engineering-secure-software-and-
systems-8th-international-symposium-essos-2016-london-uk-
april-6-8-2016-proceedings-1st-edition-juan-caballero/

NASA Formal Methods 8th International Symposium NFM


2016 Minneapolis MN USA June 7 9 2016 Proceedings 1st
Edition Sanjai Rayadurgam

https://textbookfull.com/product/nasa-formal-methods-8th-
international-symposium-nfm-2016-minneapolis-mn-usa-
june-7-9-2016-proceedings-1st-edition-sanjai-rayadurgam/

Unifying Theories of Programming 6th International


Symposium UTP 2016 Reykjavik Iceland June 4 5 2016
Revised Selected Papers 1st Edition Jonathan P. Bowen

https://textbookfull.com/product/unifying-theories-of-
programming-6th-international-symposium-utp-2016-reykjavik-
iceland-june-4-5-2016-revised-selected-papers-1st-edition-
jonathan-p-bowen/

Search Based Software Engineering 8th International


Symposium SSBSE 2016 Raleigh NC USA October 8 10 2016
Proceedings 1st Edition Federica Sarro

https://textbookfull.com/product/search-based-software-
engineering-8th-international-symposium-ssbse-2016-raleigh-nc-
usa-october-8-10-2016-proceedings-1st-edition-federica-sarro/
Anu Bourgeois · Pavel Skums
Xiang Wan · Alex Zelikovsky (Eds.)
LNBI 9683

Bioinformatics Research
and Applications
12th International Symposium, ISBRA 2016
Minsk, Belarus, June 5–8, 2016
Proceedings

123
Lecture Notes in Bioinformatics 9683

Subseries of Lecture Notes in Computer Science

LNBI Series Editors


Sorin Istrail
Brown University, Providence, RI, USA
Pavel Pevzner
University of California, San Diego, CA, USA
Michael Waterman
University of Southern California, Los Angeles, CA, USA

LNBI Editorial Board


Søren Brunak
Technical University of Denmark, Kongens Lyngby, Denmark
Mikhail S. Gelfand
IITP, Research and Training Center on Bioinformatics, Moscow, Russia
Thomas Lengauer
Max Planck Institute for Informatics, Saarbrücken, Germany
Satoru Miyano
University of Tokyo, Tokyo, Japan
Eugene Myers
Max Planck Institute of Molecular Cell Biology and Genetics, Dresden,
Germany
Marie-France Sagot
Université Lyon 1, Villeurbanne, France
David Sankoff
University of Ottawa, Ottawa, Canada
Ron Shamir
Tel Aviv University, Ramat Aviv, Tel Aviv, Israel
Terry Speed
Walter and Eliza Hall Institute of Medical Research, Melbourne, VIC, Australia
Martin Vingron
Max Planck Institute for Molecular Genetics, Berlin, Germany
W. Eric Wong
University of Texas at Dallas, Richardson, TX, USA
More information about this series at http://www.springer.com/series/5381
Anu Bourgeois Pavel Skums

Xiang Wan Alex Zelikovsky (Eds.)


Bioinformatics Research
and Applications
12th International Symposium, ISBRA 2016
Minsk, Belarus, June 5–8, 2016
Proceedings

123
Editors
Anu Bourgeois Xiang Wan
Department of Computer Science Department of Computer Science
Georgia State University Hong Kong Baptist University
Atlanta, GA Kowloon Tong
USA Hong Kong SAR
Pavel Skums Alex Zelikovsky
Centre for Disease Control and Prevention Georgia State University
Atlanta, GA Atlanta, GA
USA USA

ISSN 0302-9743 ISSN 1611-3349 (electronic)


Lecture Notes in Bioinformatics
ISBN 978-3-319-38781-9 ISBN 978-3-319-38782-6 (eBook)
DOI 10.1007/978-3-319-38782-6

Library of Congress Control Number: 2016938416

LNCS Sublibrary: SL8 – Bioinformatics

© Springer International Publishing Switzerland 2016


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are
believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors
give a warranty, express or implied, with respect to the material contained herein or for any errors or
omissions that may have been made.

Printed on acid-free paper

This Springer imprint is published by Springer Nature


The registered company is Springer International Publishing AG Switzerland
Preface

The 12th edition of the International Symposium on Bioinformatics Research and


Applications (ISBRA 2016) was held during June 5–8, 2016, in Minsk, Belarus. The
symposium provides a forum for the exchange of ideas and results among researchers,
developers, and practitioners working on all aspects of bioinformatics and computa-
tional biology and their applications.
There were 77 submissions received in response to the call for papers. The Program
Committee decided to accept 42 of them for publication in the proceedings and for oral
presentation at the symposium: 22 for Track 1 (an extended abstract) and 20 for Track
2 (an abridged abstract). The technical program also featured invited keynote talks by
five distinguished speakers: Dr. Teresa M. Przytycka from the National Institutes of
Health discussed the network perspective on genetic variations, from model organisms
to diseases; Prof. Ion Mandoiu from the University of Connecticut spoke on challenges
and opportunities in single-cell genomics; Prof. Alexander Schoenhuth from Centrum
Wiskunde and Informatica spoke on dealing with uncertainties in big genome data;
Prof. Ilya Vakser from the University of Kansas discussed genome-wide structural
modeling of protein–protein interactions; and Prof. Max Alekseyev from George
Washington University spoke on multi-genome scaffold co-assembly based on the
analysis of gene orders and genomic repeats.
We would like to thank the Program Committee members and external reviewers for
volunteering their time to review and discuss symposium papers. Furthermore, we
would like to extend special thanks to the steering and general chairs of the symposium
for their leadership, and to the finance, publicity, local organization, and publication
chairs for their hard work in making ISBRA 2016 a successful event. Last but not least,
we would like to thank all authors for presenting their work at the symposium.

June 2016 Anu Bourgeois


Pavel Skums
Xiang Wan
Alex Zelikovsky
Organization

Steering Chairs
Dan Gusfield University of California, Davis, USA
Ion Măndoiu University of Connecticut, USA
Yi Pan Georgia State University, USA
Marie-France Sagot Inria, France
Ying Xu University of Georgia, USA
Alexander Zelikovsky Georgia State University, USA

General Chairs
Sergei V. Ablameyko Belarusian State University, Belarus
Bernard Moret École Polytechnique Fédérale de Lausanne,
Switzerland
Alexander V. Tuzikov National Academy of Sciences of Belarus

Program Chairs
Anu Bourgeois Georgia State University, USA
Pavel Skums Centers for Disease Control and Prevention, USA
Xiang Wan Hong Kong Baptist University, SAR China
Alex Zelikovsky Georgia State University, USA

Finance Chairs
Yury Metelsky Belarusian State University, Belarus
Alex Zelikovsky Georgia State University, USA

Publications Chair
Igor E. Tom National Academy of Sciences of Belarus

Local Arrangements Chair


Dmitry G. Medvedev Belarusian State University, Belarus

Publicity Chair
Erin-Elizabeth A. Durham Georgia State University, USA
VIII Organization

Workshop Chair
Ion Mandoiu University of Connecticut, USA

Program Committee
Kamal Al Nasr Tennessee State University, USA
Sahar Al Seesi University of Connecticut, USA
Max Alekseyev George Washington University, USA
Mukul S. Bansal University of Connecticut, USA
Robert Beiko Dalhousie University, Canada
Paola Bonizzoni Università di Milano-Bicocca, Italy
Anu Bourgeois Georgia State University, USA
Zhipeng Cai Georgia State University, USA
D.S. Campo Rendon Centers for Disease Control and Prevention, USA
Doina Caragea Kansas State University, USA
Tien-Hao Chang National Cheng Kung University, Taiwan
Ovidiu Daescu University of Texas at Dallas, USA
Amitava Datta University of Western Australia, Australia
Jorge Duitama International Center for Tropical Agriculture, Columbia
Oliver Eulenstein Iowa State University, USA
Lin Gao Xidian University, China
Lesley Greene Old Dominion University, USA
Katia Guimaraes Universidade Federal de Pernambuco, Brazil
Jiong Guo Universität des Saarlandes, Germany
Jieyue He Southeast University, China
Jing He Old Dominion University, USA
Zengyou He Hong Kong University of Science and Technology,
SAR China
Steffen Heber North Carolina State University, USA
Jinling Huang East Carolina University, USA
Ming-Yang Kao Northwestern University, USA
Yury Khudyakov Centers for Disease Control and Prevention, USA
Wooyoung Kim University of Washington Bothell, USA
Danny Krizanc Wesleyan University, USA
Yaohang Li Old Dominion University, USA
Min Li Central South University, USA
Jing Li Case Western Reserve University, USA
Shuai Cheng Li City University of Hong Kong, SAR China
Ion Mandoiu University of Connecticut, USA
Fenglou Mao University of Georgia, USA
Osamu Maruyama Kyushu University, Japan
Giri Narasimhan Florida International University, USA
Bogdan Pasaniuc University of California at Los Angeles, USA
Steven Pascal Old Dominion University, USA
Andrei Paun University of Bucharest, Romania
Organization IX

Nadia Pisanti Università di Pisa, Italy


Russell Schwartz Carnegie Mellon University, USA
Joao Setubal University of Sao Paulo, Brazil
Xinghua Shi University of North Carolina at Charlotte, USA
Yi Shi Jiao Tong University, China
Pavel Skums Centers for Disease Control and Prevention, USA
Ileana Streinu Smith College, USA
Zhengchang Su University of North Carolina at Charlotte, USA
Wing-Kin Sung National University of Singapore, Singapore
Sing-Hoi Sze Texas A&M University, USA
Weitian Tong Georgia Southern University, USA
Gabriel Valiente Technical University of Catalonia, Spain
Xiang Wan Hong Kong Baptist University, SAR China
Jianxin Wang Central South University, China
Li-San Wang University of Pennsylvania, USA
Peng Wang Shanghai Advanced Research Institute, China
Lusheng Wang City University of Hong Kong, SAR China
Seth Weinberg Old Dominion University, USA
Fangxiang Wu University of Saskatchewan, Canada
Yufeng Wu University of Connecticut, USA
Minzhu Xie Hunan Normal University, China
Dechang Xu Harbin Institute of Technology, China
Can Yang Hong Kong Baptist University, SAR China
Ashraf Yaseen Texas A&M University-Kingsville, USA
Guoxian Yu Southwest University, China
Alex Zelikovsky Georgia State University, USA
Yanqing Zhang Georgia State University, USA
Fa Zhang Institute of Computing Technology, China
Leming Zhou University of Pittsburgh, USA
Fengfeng Zhou Chinese Academy of Sciences, China
Quan Zou Xiamen University, China

Additional Reviewers

Aganezov, Sergey Cirjila, Ionela Leung, Yuk Yee Previtali, Marco


Alexeev, Nikita Cobeli, Matei Li, Xin Rizzi, Raffaella
Avdeyev, Pavel Della Vedova, Luo, Junwei Sun, Sunah
Castro-Nallar, Gianluca Pei, Jingwen Sun, Luni
Eduardo Ghinea, Razvan Peng, Xiaoqing Wang, Weixin
Chen, Huiyuan Ionescu, Vlad Pham, Son Zhang, June
Chen, Yong Kuksa, Pavel Postavaru, Stefan Zhao, Wei
Chu, Chong Lara, James Preda, Catalina Zhou, Chan
Contents

Next Generation Sequencing Data Analysis

An Efficient Algorithm for Finding All Pairs k-Mismatch Maximal


Common Substrings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Sharma V. Thankachan, Sriram P. Chockalingam, and Srinivas Aluru

Poisson-Markov Mixture Model and Parallel Algorithm for Binning


Massive and Heterogenous DNA Sequencing Reads . . . . . . . . . . . . . . . . . . 15
Lu Wang, Dongxiao Zhu, Yan Li, and Ming Dong

FSG: Fast String Graph Construction for De Novo Assembly of Reads Data . . . 27
Paola Bonizzoni, Gianluca Della Vedova, Yuri Pirola, Marco Previtali,
and Raffaella Rizzi

OVarCall: Bayesian Mutation Calling Method Utilizing Overlapping


Paired-End Reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Takuya Moriyama, Yuichi Shiraishi, Kenichi Chiba, Rui Yamaguchi,
Seiya Imoto, and Satoru Miyano

High-Performance Sensing of DNA Hybridization on Surface


of Self-organized MWCNT-Arrays Decorated by Organometallic
Complexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
V.P. Egorova, H.V. Grushevskaya, N.G. Krylova, I.V. Lipnevich,
T.I. Orekhovskaja, B.G. Shulitski, and V.I. Krot

Towards a More Accurate Error Model for BioNano Optical Maps . . . . . . . . 67


Menglu Li, Angel C.Y. Mak, Ernest T. Lam, Pui-Yan Kwok, Ming Xiao,
Kevin Y. Yip, Ting-Fung Chan, and Siu-Ming Yiu

HapIso: An Accurate Method for the Haplotype-Specific Isoforms


Reconstruction from Long Single-Molecule Reads. . . . . . . . . . . . . . . . . . . . 80
Serghei Mangul, Harry (Taegyun) Yang, Farhad Hormozdiari,
Elizabeth Tseng, Alex Zelikovsky, and Eleazar Eskin

Protein-Protein Interactions and Networks

Genome-Wide Structural Modeling of Protein-Protein Interactions. . . . . . . . . 95


Ivan Anishchenko, Varsha Badal, Taras Dauzhenka, Madhurima Das,
Alexander V. Tuzikov, Petras J. Kundrotas, and Ilya A. Vakser

Identifying Essential Proteins by Purifying Protein Interaction Networks . . . . 106


Min Li, Xiaopei Chen, Peng Ni, Jianxin Wang, and Yi Pan
XII Contents

Differential Functional Analysis and Change Motifs in Gene Networks


to Explore the Role of Anti-sense Transcription . . . . . . . . . . . . . . . . . . . . . 117
Marc Legeay, Béatrice Duval, and Jean-Pierre Renou

Predicting MicroRNA-Disease Associations by Random Walking on


Multiple Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Wei Peng, Wei Lan, Zeng Yu, Jianxin Wang, and Yi Pan

Progression Reconstruction from Unsynchronized Biological Data using


Cluster Spanning Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Ryan Eshleman and Rahul Singh

Protein and RNA Structure

Consistent Visualization of Multiple Rigid Domain Decompositions


of Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Emily Flynn and Ileana Streinu

A Multiagent Ab Initio Protein Structure Prediction Tool for Novices


and Experts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Thiago Lipinski-Paes, Michele dos Santos da Silva Tanus,
José Fernando Ruggiero Bachega, and Osmar Norberto de Souza

Filling a Protein Scaffold with a Reference. . . . . . . . . . . . . . . . . . . . . . . . . 175


Letu Qingge, Xiaowen Liu, Farong Zhong, and Binhai Zhu

Phylogenetics

Mean Values of Gene Duplication and Loss Cost Functions . . . . . . . . . . . . . 189


Paweł Górecki, Jarosław Paszek, and Agnieszka Mykowiecka

The SCJ Small Parsimony Problem for Weighted Gene Adjacencies . . . . . . . 200
Nina Luhmann, Annelyse Thévenin, Aïda Ouangraoua, Roland Wittler,
and Cedric Chauve

Path-Difference Median Trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211


Alexey Markin and Oliver Eulenstein

NEMo: An Evolutionary Model with Modularity for PPI Networks . . . . . . . . 224


Min Ye, Gabriela C. Racz, Qijia Jiang, Xiuwei Zhang,
and Bernard M.E. Moret

Multi-genome Scaffold Co-assembly Based on the Analysis of Gene Orders


and Genomic Repeats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Sergey Aganezov and Max A. Alekseyev
Contents XIII

Sequence and Image Analysis

Selectoscope: A Modern Web-App for Positive Selection Analysis


of Genomic Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Andrey V. Zaika, Iakov I. Davydov, and Mikhail S. Gelfand

Methods for Genome-Wide Analysis of MDR and XDR Tuberculosis from


Belarus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Roman Sergeev, Ivan Kavaliou, Andrei Gabrielian, Alex Rosenthal,
and Alexander Tuzikov

Haplotype Inference for Pedigrees with Few Recombinations . . . . . . . . . . . . 269


B. Kirkpatrick

Improved Detection of 2D Gel Electrophoresis Spots by Using Gaussian


Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
Michal Marczyk

Abridged Track 2 Abstracts

Predicting Combinative Drug Pairs via Integrating Heterogeneous Features


for Both Known and New Drugs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
Jia-Xin Li, Jian-Yu Shi, Ke Gao, Peng Lei, and Siu-Ming Yiu

SkipCPP-Pred: Promising Prediction Method for Cell-Penetrating Peptides


Using Adaptive k-Skip-n-Gram Features on a High-Quality Dataset . . . . . . . 299
Wei Leyi and Zou Quan

CPredictor2.0: Effectively Detecting both Small and Large Complexes from


Protein-Protein Interaction Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
Bin Xu, Jihong Guan, Yang Wang, and Shuigeng Zhou

Structural Insights into Antiapoptotic Activation of Bcl-2 and Bcl-xL


Mediated by FKBP38 and tBid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
Valery Veresov and Alexander Davidovskii

VAliBS: A Visual Aligner for Bisulfite Sequences . . . . . . . . . . . . . . . . . . . 307


Min Li, Xiaodong Yan, Lingchen Zhao, Jianxin Wang, Fang-Xiang Wu,
and Yi Pan

MegaGTA: A Sensitive and Accurate Metagenomic Gene-Targeted


Assembler Using Iterative de Bruijn Graphs . . . . . . . . . . . . . . . . . . . . . . . . 309
Dinghua Li, Yukun Huang, Henry Chi-Ming Leung, Ruibang Luo,
Hing-Fung Ting, and Tak-Wah Lam
XIV Contents

EnhancerDBN: An Enhancer Prediction Method Based on Deep Belief


Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
Hongda Bu, Yanglan Gan, Yang Wang, Jihong Guan,
and Shuigeng Zhou

An Improved Burden-Test Pipeline for Cancer Sequencing Data . . . . . . . . . . 314


Yu Geng, Zhongmeng Zhao, Xuanping Zhang, Wenke Wang, Xiao Xiao,
and Jiayin Wang

Modeling and Simulation of Specific Production of Trans10,


cis12-Conjugated Linoleic Acid in the Biosynthetic Pathway . . . . . . . . . . . . 316
Aiju Hou, Xiangmiao Zeng, Zhipeng Cai, and Dechang Xu

Dynamic Protein Complex Identification in Uncertain Protein-Protein


Interaction Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
Yijia Zhang, Hongfei Lin, Zhihao Yang, and Jian Wang

Predicting lncRNA-Protein Interactions Based on Protein-Protein Similarity


Network Fusion (Extended Abstract) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Xiaoxiang Zheng, Kai Tian, Yang Wang, Jihong Guan,
and Shuigeng Zhou

DCJ-RNA: Double Cut and Join for RNA Secondary Structures


Using a Component-Based Representation . . . . . . . . . . . . . . . . . . . . . . . . . 323
Ghada Badr and Haifa Al-Aqel

Improve Short Read Homology Search Using Paired-End Read Information . . . 326
Prapaporn Techa-Angkoon, Yanni Sun, and Jikai Lei

Framework for Integration of Genome and Exome Data for More Accurate
Identification of Somatic Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
Vinaya Vijayan, Siu-Ming Yiu, and Liqing Zhang

Semantic Biclustering: A New Way to Analyze and Interpret Gene


Expression Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
Jiří Kléma, František Malinka, and Filip Železný

Analyzing microRNA Epistasis in Colon Cancer Using Empirical Bayesian


Elastic Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
Jia Wen, Andrew Quitadamo, Benika Hall, and Xinghua Shi

Tractable Kinetics of RNA–Ligand Interaction . . . . . . . . . . . . . . . . . . . . . . 337


Felix Kühnl, Peter F. Stadler, and Sebastian Will

MitoDel: A Method to Detect and Quantify Mitochondrial DNA Deletions


from Next-Generation Sequence Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
Colleen M. Bosworth, Sneha Grandhi, Meetha P. Gould,
and Thomas LaFramboise
Contents XV

TRANScendence: Transposable Elements Database and De-novo Mining


Tool Allows Inferring TEs Activity Chronology . . . . . . . . . . . . . . . . . . . . . 342
Michał Startek, Jakub Nogły, Dariusz Grzebelus, and Anna Gambin

Phylogeny Reconstruction from Whole-Genome Data Using Variable


Length Binary Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
Lingxi Zhou, Yu Lin, Bing Feng, Jieyi Zhao, and Jijun Tang

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347


Next Generation Sequencing Data
Analysis
An Efficient Algorithm for Finding All Pairs
k-Mismatch Maximal Common Substrings

Sharma V. Thankachan1 , Sriram P. Chockalingam2 , and Srinivas Aluru1(B)


1
School of CSE, Georgia Institute of Technology, Atlanta, USA
sharma.thankachan@gatech.edu, aluru@cc.gatech.edu
2
Department of CSE, Indian Institute of Technology, Bombay, India
sriram.pc@iitb.ac.in

Abstract. Identifying long pairwise maximal common substrings


among a large set of sequences is a frequently used construct in com-
putational biology, with applications in DNA sequence clustering and
assembly. Due to errors made by sequencers, algorithms that can accom-
modate a small number of differences are of particular interest, but
obtaining provably efficient solutions for such problems has been elusive.
In this paper, we present a provably efficient algorithm with an expected
run time guarantee of O(N logk N + occ), where occ is the output size,
for the following problem: Given a collection D = {S1 , S2 , . . . , Sn } of n
sequences of total length N , a length threshold φ and a mismatch thresh-
old k ≥ 0, report all k-mismatch maximal common substrings of length
at least φ over all pairs of sequences in D. In addition, we present a result
showing the hardness of this problem.

1 Introduction

Due to preponderance of DNA and RNA sequences that can be modeled as


strings, string matching algorithms have myriad applications in computational
biology. Modern sequencing instruments sequence a large collection of short reads
that are randomly drawn from one or multiple genomes. Deciphering pairwise
relationships between the reads is often the first step in many applications. For
example, one may be interested in finding all pairs of reads that have a sufficiently
long overlap, such as suffix/prefix overlap (for genomic or metagenomic assem-
bly) or substring overlap (for read compression, finding RNA sequences contain-
ing common exons, etc.). Sequencing instruments make errors, which translate to
insertion, deletion, or substitution errors in the reads they characterize, depend-
ing on the type of instrument. Much of modern-day high-throughput sequencing
is carried out using Illumina sequencers, which have a small error rate (< 1–2 %)
and predominantly (> 99 %) substitution errors. Thus, algorithms that tolerate
a small number of mismatch errors can yield the same solution as the much
more expensive alignment/edit distance computations. Motivated by such appli-
cations, we formulate the following all pairs k -mismatch maximal common
substrings problem:


c Springer International Publishing Switzerland 2016
A. Bourgeois et al. (Eds.): ISBRA 2016, LNBI 9683, pp. 3–14, 2016.
DOI: 10.1007/978-3-319-38782-6 1
4 S.V. Thankachan et al.

Problem 1. Given a collection D = {S1 , S2 , . . . , Sn } of n sequences of total


length N , a length threshold φ, and a mismatch threshold k ≥ 0, report all
k-mismatch maximal common substrings of length ≥ φ between any pair of
sequences in D.

Throughout this paper, for a sequence Si ∈ D, |Si | is its length, Si [x] is its
xth character and Si [x..y] is its substring starting at position x and ending at
position y, where 1 ≤ x ≤ y ≤ |Si |. For brevity, we may use Si [x..] to denote the
suffix of Si starting at x. The substrings Si [x..(x+l −1)] and Sj [y..(y +l −1)] are
a k-mismatch common substring if the Hamming distance between them is at
most k. It is maximal if it cannot be extended on either side without introducing
another mismatch.
Before investigating Problem 1, consider an existential version of this prob-
lem, where we are just interested in enumerating those pairs (Si , Sj ) that contain
at least one k-mismatch maximal common substring of length ≥ φ. For k = 0 and
any given (Si , Sj ) pair, this can be easily answered in O(|Si | + |Sj |) time using a
generalized suffix tree based algorithm because no mismatches are involved. For
k ≥ 1, we can use the recent result by Aluru et al. [1], by which the k-mismatch
longest common substring can be identified in O((|Si | + |Sj |) logk (|Si | + |Sj |))
time. Therefore, this straightforward approach  can solve the existential prob-
lem over all pairs of sequences in D in i j (|Si | + |Sj |) logk (|Si | + |Sj |) =
O(n2 L logk L) time, where L = maxi |Si |. An interesting question is, can we
solve this problem asymptotically faster?
We present a simple result showing that the existence of a purely combi-
natorial algorithm, which is “significantly” better than the above approach is
highly unlikely in the general setting. In other words, we prove a conditional
lower bound, showing that the bound is tight within poly-logarithmic factors.
This essentially shows the hardness of Problem 1, as it is at least as hard as its
existential version. Our result is based on a simple reduction from the boolean
matrix multiplication (BMM) problem. This hard instance is simulated via care-
ful tuning of the parameters – specifically, when L approaches the number of
sequences and φ = Θ(log n). However, this hardness result does not contradict
the possibility of an O(N + occ) run-time algorithm for Problem 1, because occ
can be as large as Θ(n2 L2 ).
In order to solve such problems in practice, the following seed-and-extend type
filtering approaches are often employed (see [13] for an example). The underline
principle is: if two sequences have a k-mismatch common substring of length ≥ φ,
then they must have an exact common substring of length at least τ =  k+1 φ
.
Therefore, using some fast hashing technique, all pairs of sequences that have a
τ -length common substring are identified. Then, by exhaustively checking all such
candidate pairs, the final output is generated. Clearly, such an algorithm cannot
provide any run time guarantees and often times the candidate pairs generated
can be overwhelmingly larger than the final output size. In contrast to this, we
develop an O(N ) space and O(N logk N +occ) expected run time algorithm, where
occ is the output size. Additionally, we present the results of some preliminary
experiments, in order to demonstrate the effectiveness of our algorithm.
An Efficient Algorithm for Finding All Pairs 5

2 Notation and Preliminaries


Let Σ be the alphabet for all sequences in D. Throughout the paper, both |Σ| and
k are assumed to be constants. Let T = S1 $1 S2 $2 . . . Sn $n be the concatenation
of all sequences in D, separated by special characters $1 , $2 , . . . , $n . Here each
$i is a unique special symbol and is lexicographically larger than all characters
in Σ. Clearly, there exists a one to one mapping between the positions in T
(except the $i positions) and the positions in the sequences in D. We use lcp(·, ·)
to denote the longest common prefix of two input strings and lcpk (·, ·) to denote
their longest common prefix while permitting at most k mismatches. We now
briefly review some standard data structures that will be used in our algorithms.

2.1 Suffix Trees, Suffix Arrays and LCP Data Structures

The generalized suffix tree of D (equivalently, the suffix tree of T), denoted by
GST, is a lexicographic arrangement of all suffixes of T as a compact trie [11,15].
The GST consists of |T| leaves, and at most (|T| − 1) internal nodes all of which
have at least two child nodes each. The edges are labeled with substrings of T.
Let path of u refer to the concatenation of edge labels on the path from root
to node u, denoted by path(u). If u is a leaf node, then its path corresponds
to a unique suffix of T (equivalently a unique suffix of a unique sequence in D)
and vice versa. For any node, node-depth is the number of its ancestors and
string-depth is the length of its path.
The suffix array [10], SA, is such that SA[i] is the starting position of the suffix
corresponding to the ith left most leaf in the suffix tree of T, i.e., the starting
position of the ith lexicographically smallest suffix of T. The inverse suffix array
ISA is such that ISA[j] = i, if SA[i] = j. The Longest Common Prefix array LCP
is defined as, for 1 ≤ i < |T|

LCP[i] = |lcp(TSA[i] , TSA[i+1] )|

In other words, LCP[i] is the string depth of the lowest common ancestor of
the ith and (i + 1)th leaves in the suffix tree. There exist optimal sequential
algorithms for constructing all these data structures in O(|T|) space and time
[6,8,9,14].
All operations on GST required for our purpose can be simulated using
SA, ISA, LCP array, and a range minimum query (RMQ) data structure over
the LCP array [2]. A node u in GST can be uniquely represented by an interval
[sp(u), ep(u)], the range corresponding to the leaves in its subtree. The string
depth of u is the minimum value in LCP[sp(u), ep(u) − 1] (can be computed
in constant time using an RMQ). Similarly, the longest common prefix of any
two suffixes can also be computed in O(1) time. Finally, the k-mismatch longest
common prefix of any two suffixes can be computed in O(k) time as follows: let
l = |lcp(T[x..], T[y..])|, then for any k ≥ 1, |lcpk (T[x..], T[y..])| is also l if either of
T[x+l], T[y+l] is a $i symbol, else it is l+1+|lcpk−1 (T[(x+l+1)..], T[(y+l+1)..])|.
6 S.V. Thankachan et al.

2.2 Linear Time Sorting of Integers and Strings


Our algorithm relies heavily on sorting of integers and strings (specifically, suf-
fixes). The integers are always within the range [1, N ]. Therefore, linear time
sorting (via radix sort) is possible when the number of integers to be sorted is
sufficiently large (say ≥ N  for some constant  > 0). In order to achieve linear
sorting complexity for even smaller input sizes, we use the following strategy:
combine multiple small collections such that the total size becomes sufficiently
large. Assign a unique integer id in [1, N ] to each small collection, and with each
element e within a small collection with id i, associate the key i · N + e. We can
now apply radix sort and sort all elements within the combined collection w.r.t.
their associated key. An additional scanning step suffices to separate all small
collections with their elements sorted. The strings to be sorted in our algorithms
are always suffixes of T. Using GST, each suffix can be mapped to an integer
in [1, N ] that corresponds to its lexicographic rank in constant time. Therefore,
any further suffix sorting (or sparse suffix sorting) can also be reduced to linear
time integer sorting.

3 Hardness Result
3.1 Boolean Matrix Multiplication (BMM)
The input to BMM problem consists of two n × n boolean matrices A and B
with all entries either 0 or 1. The task is to compute their boolean product C.
Specifically, let Ai,j denotes the entry corresponding to ith row and jth column
of matrix A, where 0 ≤ i, j < n, then for all i, j, compute


n−1
Ci,j = (Ai,k ∧ Bk,j ). (1)
k=0

Here ∨ represents the logical OR and ∧ represents the logical AND opera-
tions. The time complexity of the fastest matrix multiplication algorithm is
O(n2.372873 ) [16]. The result is obtained via algebraic techniques like Strassen’s
matrix multiplication algorithm. However, the fastest combinatorial algorithm is
better than the standard cubic algorithm by only a poly-logarithmic factor [17].
As matrix multiplication is a highly studied problem over decades, any truly
sub-cubic algorithm for matrix multiplication which is purely combinatorial is
highly unlikely, and will be a breakthrough result.

3.2 Reduction
In this section, we prove the following result.

Theorem 1. If there exists an f (n, L) time algorithm for the existential version
of Problem 1, then there exists an O(f (2n, O(n log n))) + O(n2 ) time algorithm
for the BMM problem.
An Efficient Algorithm for Finding All Pairs 7

This implies BMM can be solved in sub-cubic time via purely combinatorial
techniques if f (n, L) is O(n2− L) or O(n2 L1− ) for any constant  > 0. As the
former is less likely, the later is also less likely. In other words, O(n2 L) bound is
tight under BMM hardness assumption. We now proceed to show the reduction.
We create n sequences X1 , X2 , . . . , Xn from A and n sequences Y1 , Y2 , . . . , Yn
from B. Strings are of length at most L = nlog n + n − 1 over an alphabet
Σ = {0, 1, $, #}. The construction is the following.

– For 1 ≤ i ≤ n, let Ui = {k | Ai,k = 1}. Then, for each Ui , create a correspond-


ing string Xi as follows: encode each element in Ui in binary in log n bits,
append it with $ symbol, then concatenate all of them. For example, if n = 4
and Ui = {0, 2}, then Xi = 00$10$.
– For 1 ≤ i ≤ n, let Vj = {k | Bk,j = 1}. Then, for each Vj , create a corre-
sponding string Yj as follows: encode each element in Vj in binary in log n
bits, append it with # symbol, then concatenate all of them. For example, if
n = 4 and Vj = {1, 2}, then Yj = 01#10#.
– The database D is the set of all Xi ’s and Yj ’s.

Lemma 1. The entry Ci,j = 1, iff Ui ∩ Vj is not empty. The set Ui ∩ Vj is not


empty iff Xi and Yj have a common substring of length log n.
Using this result, we can compute the boolean product of A and B from the out-
put of the existential version of Problem 1 plus additional O(n2 ) time. Therefore,
BMM can be solved in f (2n, O(n log n)) + O(n2 ) time.

4 Our Algorithm for k-Mismatch Maximal Common


Substrings
We present an O(N logk N + occ) expected time algorithm. Recall that each
position x in the concatenated text T corresponds to a unique position in a
unique sequence Sd ∈ D, therefore we denote the sequence identifier d by seq(x).
Each output can be represented as a pair (x, y) of positions in T, where
1. seq(x) = seq(y)
2. T[x − 1] = T[y − 1]
3. |lcpk (T[x..], T[y..])| ≥ φ.

4.1 The Exact Match Case


Without mismatches, the problem can be easily solved in optimal O(N + occ)
worst case time. First create the GST, then identify all nodes whose string depth
is at least φ, such that the string depth of their parent nodes is at most (φ−1). Such
nodes are termed as marked nodes. Clearly, a pair of suffixes satisfies condition (3)
iff their corresponding leaves are under the same marked node. This allows us to
process the suffixes under each marked node w independently as follows: let Suff w
denotes the set of starting positions of the suffixes of T corresponding to the leaves
in the subtree of w. That is, Suff w = {SA[j] | sp(w) ≤ j ≤ ep(w)}. Then,
8 S.V. Thankachan et al.

1. Partition Suff w into (at most) Σ + 1 buckets, such that for each σ ∈ Σ,
there is a unique bucket containing all suffixes with previous character σ.
All suffixes with their previous character is a $ symbol are put together in a
special bucket. Note that a pair (x, y) is an answer only if both x are y are
not in the same bucket, or if both of them are in the special bucket.
2. Sort all suffixes w.r.t the identifier of the corresponding sequence (i.e., seq(·)).
Therefore, within each bucket, all suffixes from the same sequence appear
contiguously.
3. For each x, report all answers of the from (x, ·) as follows: scan every bucket,
except the bucket in which x belongs to, unless x is in the special bucket.
Then output (x, y) as an answer, where y is not an entry in the contiguous
chunk of all suffixes from seq(x).

The construction of GST and Step (1) takes O(N ) time. Step (2) over all
marked nodes can also be implemented in O(N ) time via integer sorting (refer
to Sect. 2.2). By noting down the sizes of each chunk during Step (2), we can
implement Step (3) in time proportional to the sizes of input and output. By
combining all, the total time complexity is O(N · |Σ| + occ), i.e., O(N + occ)
under constant alphabet size assumption.

4.2 The k-Mismatch Case


Approximate matching problems are generally harder, because the standard data
structures such as suffix trees and suffix arrays may not apply directly. There-
fore, we follow a novel approach, where we transform the approximate matching
problem over exact strings into an exact matching problem over inexact copies
of strings, which we call as modified suffixes.

Definition 1. Let # be a special symbol not in Σ. A k-modified suffix is a suffix


of T with its k characters replaced by #.

Let Δ be a set of positions. Then, TΔ [x..] denotes the |Δ|-modified suffix


obtained by replacing |Δ| positions in the suffix T[x..] as specified by Δ. For
example, let T = aaccgattcaa, Δ = {2, 4}, then T[5..] = gattcaa and TΔ [5..] =
g#t#caa.
Our algorithm consists of two main phases. In the first phase, we create a
collection of sets of k-modified suffixes. In the second phase, we independently
process each set constructed in the first phase and extract the answers. The
first phase takes O(N H k ) time, where as the second phase takes O(N H k + occ)
time. Here H is the height of GST. It is known that the expected value of H is
O(log N ) [3]. Therefore, by combining the time complexities of both phases with
H replaced by O(log N ), we obtain the expected run time as claimed. We now
describe these phases in detail.

Details of Phase-1. This phase is recursive (with levels of recursion starting


from 0 up to k), such that at each level h > 0, we create a collection of sets of
An Efficient Algorithm for Finding All Pairs 9

h-modified suffixes (denoted by C1h , C2h , . . . ) from the sets in the previous level.
At level 0, we have only a single set C01 , the set of all suffixes of T. See Fig. 1 for
an illustration. To generate the sets at level h, we take each set Cgh−1 at level
(h − 1) and do the following:

– Create a compact trie of all strings in Cgh−1


– For each internal node w in the trie, create a set consisting of the strings
corresponding to the leaves in the subtree of w, but with their (l + 1)th
character replaced by #. Here l is the string depth of w. Those strings with
their (l + 1)-th character is a $i symbol are not included.

level 0 C10

level 1 C11 C21 . . . . . .


.
.
.
level (h − 1)
. . h−1
Cg−1 Cgh−1 h−1
Cg+1 . .

level (h)
. . Cfh−1 Cfh Cfh+1 . .

level k C1k C2k C3k C4k . . . .

Fig. 1. The sets . . . , Cfh−1 , Cfh , Cfh+1 . . . of h-modified suffixes are generated from the
set Cgh−1 .

From our construction procedure, the following properties can be easily


verified.

Property 1. All modified suffixes within the same set (at any level) have #
symbols at the same positions and share a common prefix at least until the last
occurrence of #.

Property 2. For any pair (x, y) of positions, there will be exactly one set
at level k, such that it contain k-modified suffixes of T[x..] and T[y..] with #
symbols at the first k positions in which they differ. Therefore, the lcp of those
k-modified suffixes is equal to the lcpk of T[x..] and T[y..].

We have the following result about the sizes of these sets.


Lemma 2. No set is of size more than N and the sum of sizes of all sets at a
particular level h is ≤ N × H k , where H is the height of GST.
10 S.V. Thankachan et al.

Proof. The first statement follows easily from our construction procedure and the
second statement can be proved via induction. Let Sh be the sum of sizes of all
sets at level h. Clearly, the base case, S0 = |C10 | = N , is true. The sum of sizes of
sets at level h generated from Cgh−1 is at most |Cgh−1 |× the height of the compact
trie over the strings in Cgh−1 . The height of the compact trie is ≤ H, because if we
remove the common prefix of all strings in Cgh−1 , they are essentially suffixes of
T. By putting these together, we have Sh ≤ Sh−1 · H ≤ Sh−2 · H 2 ≤ · · · ≤ N H k .
Space and Time Analysis: We now show that Phase-1 can be implemented in
O(N ) space and O(N logk N ) time in expectation. Consider the step where we
generate sets from Cgh−1 . The lexicographic ordering between any two (h − 1)-
modified suffixes in Cgh−1 can be determined in constant time. i.e., by simply
checking the lexicographic ordering between those suffixes obtained by deleting
their common prefix up to the last occurrence of #. Therefore, suffixes can
be sorted using any comparison based sorting algorithm. After sorting, the lcp
between two successive strings can be computed in constant time. Using the
sorted order and lcp values, we can construct the compact trie using standard
techniques in the construction of suffix tree [4]. Finally, the new sets can be
generated in time proportional to their size. In summary, the time complexity
for a particular Cgh−1 is O(|Cgh−1 |(log N + H)). Overall time complexity is
 
Cfk + (H + log n) Cfh = O(N (log N + H)k−1 H)
h≤k f h<k f

By replacing H by O(log N ), we bound the expect run time of Phase-1 by


O(N logk N ).
We generate the sets in pre-order of the corresponding node in the recursion
tree. As soon as a set (at level k) is generated, we immediately pass it to Phase-2,
extract the necessary information and discard it from the working space. Also,
any set at level h < k is also deleted after all k-level sets in its subtree are
processed. This way, at any point of time in the execution of the algorithm, we
need to maintain only k sets, corresponding to the sets in a root to leaf path in
the recursion tree. Since the size of each set is at most N (Lemma 2), we can
bound the working space also by O(N ), assuming k = O(1).

Details of Phase-2. In this phase, we seek to process each set Cfk created by
Phase-1 independently and generate the answers in time linear to the total size of
all sets and output. i.e., O(N H k +occ). We first present a simple O((N +occ)H k )
time approach. Following are the key steps.

1. Create a compact trie over all k-modified suffixes in Cfk . Then identify the
marked nodes as before. Recall that a marked node has a string depth ≥ φ,
where as the string depth of its parent is < φ.
2. Let Δ be the set of k positions corresponding to modifications in the k-
modified suffixes in Cfk . Clearly, if the leaves corresponding to two modified
suffixes (say TΔ [x..] and TΔ [y..]) are in the same subtree of a marked node,
An Efficient Algorithm for Finding All Pairs 11

then their lcpk is ≥ φ. If seq(x) = seq(y) and T[x − 1] = T[y − 1], then report
(x, y) as an answer.
The trie can be created in time linear to the size of Cfk . Note that the key step
in the creation of a trie is the sorting of k-modified suffixes. To do it efficiently,
we map each k-modified suffix to the lexicographic rank of the suffix obtained
by removing all characters (from left) until its last # symbol. Using this as the
key, the k-modified suffixes can be sorted via integer sorting (refer to Sect. 2.2).
The second step of extracting answers can also be implemented using the exact
same procedure described in Sect. 4.1. However, the problem with this approach
is that, a pair (x, y) can get reported more than once, although only once per
set. In the worst case, an answer can get reported H k times. The resulting time
complexity is therefore O((N + occ)H k ).

Improving the Run Time Complexity. To achieve the claimed O(N H k +


occ) run time, we need to ensure that each output (x, y) is reported exactly
once. For this, we explore Property 2 as follows: while processing a pair of two
k-modified suffixes TΔ [x..] and TΔ [y..] under the subtree of some marked node,
report (x, y) as an answer iff
1. lcp(TΔ [x..], TΔ [y..]) = lcpk (T[x..], T[y..]).
2. seq(x) = seq(y)
3. T[x − 1] = T[y − 1]
From Property 2, for a pair (x, y), there will be only one pair of k-modified
suffixes satisfying this condition (1). The following is unique to that pair: T[x +
l − 1] = T[y + l − 1] for all l ∈ Δ. Therefore, the task can be executed efficiently
by processing the set of k-modified suffixes in the subtree of each marked node
w as follows:
1. Partition them into (at most) Σ + 1 buckets based on the previous character
as in Sect. 4.1.
2. Partition the k-modified suffixes in each bucket into (at most) |Σ|k sub-
buckets based on the sequence of k characters that were originally at the
positions in Δ. Each sub-bucket is therefore associated with a unique string
of length (1 + k): the (previous) character corresponding to the bucket in
which it belongs to, followed by the sequence of k characters at the positions
in Δ.
3. Within each sub-bucket, sort the k-modified suffixes based on the identifier
of the sequence to which it belongs.
4. Finally, for each TΔ [x..], we visit each sub-bucket and find answers of the
form (x, ·) as follows: let c0 , c1 , c2 , . . . , ck be the sequence of (1 + k) characters
corresponding to a sub-bucket. If c0 = T[x − 1] or T[x − 1] is some $i symbol
and ct = T[x + t − 1] for t = 1, 2, . . . k, then for all entries TΔ [y..] in the
sub-bucket with seq(x) = seq(y), report (x, y) as an answer. Notice that the
entries within a sub-bucket are sorted according to the sequence identifier.
Therefore, all entries with seq(x) = seq(y) comes together as a contiguous
chunk, which can be easily skipped.
12 S.V. Thankachan et al.

Analysis: The overall time for implementing the first three steps is O(N H k )
and final step takes O(N H k |Σ|k + occ) time. Therefore, total time complexity
is O(N H k + occ), assuming k and |Σ| are constants.

Theorem 2. Problem 1 can be solved in O(N ) space and O(N logk N + occ)
expected time, assuming k and |Σ| are constants.

Note: If the hamming distance between two reads is < k, then our algorithm
will not output them as an answer. To capture such answers, we shall run the
algorithm for all numbers of mismatches starting from 0 up to k. The run-time
remains the same.

5 Preliminary Experiments

We have implemented our algorithm using the C++11 standard. As noted ear-
lier, we use SA, LCP array, and RMQ data structures to simulate all the opera-
tions on the suffix trees. We construct the suffix array for the set of the sequences
D using the libdivsufsort library [12]. We build ISA corresponding to SA by a
single pass over SA. Construction of LCP array and RMQ tables are based on the
implementations in the SDSL library [5]. For the construction of the LCP and
RMQ tables, we use Kasai et al.’s algorithm [7] and Bender-Farach’s algorithm
[2], respectively. Although the SDSL library supports bit compression techniques
to reduce the size of the tables and arrays in exchange for relatively longer time
to answer queries, we do not compress these data structures. Instead, we use
32-bit integers both for indices and prefix lengths.
Note that our algorithm does not require the internal nodes to be processed
in any specific order. Therefore, we do not need to construct parent-child links
or suffix links, which are typically present in a suffix tree. We only need a list
of all the internal nodes, their string depths and the corresponding suffixes.
We represent the internal node u by the tuple (sp(u), ep(u), string-depth), where
sp(u) and ep(u) are the left and right indices in SA anchoring the range of suffixes
having path(u) as their prefix.
We conducted our preliminary experiments on a system with Intel Xeon E5-
2660 CPU having 10 cores and 64 GB RAM. We created a 2,049,118 read input
derived from the RNA-Seq dataset with accession number SRX011546, from the
NCBI SRA repository. This dataset corresponds to an Illumina sequencing of
Human CD4 T cells, with 45bp reads. Table 1 shows the run-time results for this
dataset for different numbers of mismatches allowed k. The length threshold φ
is chosen as 25.
The run-time complexity for generating all valid occurrences increases expo-
nentially with k. As shown in Table 1, this behavior is observed for smaller
values of k. However, for larger values of k, the run-time escalation is signifi-
cantly smaller than the exponential dependence predicted by the theory. This is
probably because as the matches extend towards the end of the sequences, the
number of compact tries that need to be generated is limited.
An Efficient Algorithm for Finding All Pairs 13

Table 1. Time to generate k-mismatch pairs with φ = 25.

Hamming distance (k) Run time (seconds)


0 21.44
1 931.33
2 3244.99
3 4584.36
4 4801.55
5 4920.68

For large input sizes, the value of k for which the algorithm can be run
in practice will become limited. Our method can be supplanted with seed and
extend heuristics, where the seed itself can be based on approximate sequence
matching with a limited value of k. Because our algorithm can process internal
nodes of the GST in any order, it allows easy parallelization on multiple cores
sharing the same shared memory. This can be used to further increase the scale
of the datasets or the values of k for which the algorithm is practical.

Acknowledgment. This research is supported in part by the U.S. National Science


Foundation under IIS-1416259.

References
1. Aluru, S., Apostolico, A., Thankachan, S.V.: Efficient alignment free sequence
comparison with bounded mismatches. In: Przytycka, T.M. (ed.) RECOMB 2015.
LNCS, vol. 9029, pp. 1–12. Springer, Heidelberg (2015)
2. Bender, M.A., Farach-Colton, M.: The LCA problem revisited. In: Gonnet, G.H.,
Viola, A. (eds.) LATIN 2000. LNCS, vol. 1776, pp. 88–94. Springer, Heidelberg
(2000)
3. Devroye, L., Szpankowski, W., Rais, B.: A note on the height of suffix trees. SIAM
J. Comput. 21(1), 48–53 (1992)
4. Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity
of suffix tree construction. J. ACM 47(6), 987–1011 (2000)
5. Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play
with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA
2014. LNCS, vol. 8504, pp. 326–337. Springer, Heidelberg (2014)
6. Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction. In:
Baeten, J.C.M., Lenstra, J.K., Parrow, J., Woeginger, G.J. (eds.) ICALP 2003.
LNCS, vol. 2719, pp. 943–955. Springer, Heidelberg (2003)
7. Kasai, T., Lee, G.H., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-
common-prefix computation in suffix arrays and its applications. In: Amir, A.,
Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidel-
berg (2001)
8. Kim, D.-K., Sim, J.S., Park, H.-J., Park, K.: Linear-time construction of suffix
arrays. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS,
vol. 2676, pp. 186–199. Springer, Heidelberg (2003)
14 S.V. Thankachan et al.

9. Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. In: Baeza-
Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp.
200–210. Springer, Heidelberg (2003)
10. Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches.
SIAM J. Comput. 22(5), 935–948 (1993)
11. Edward, M.: McCreight.: a space-economical suffix tree construction algorithm. J.
ACM 23(2), 262–272 (1976)
12. Mori, Y.: Libdivsufsort: a lightweight suffix array construction library, pp. 1–12
(2003). https://github.com/y-256/libdivsufsort
13. Peterlongo, P., Pisanti, N., Boyer, F., Lago, A.P.D., Sagot, M.-F.: Lossless filter for
multiple repetitions with hamming distance. J. Discrete Algorithms 6(3), 497–509
(2008)
14. Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260
(1995)
15. Weiner, P.: Linear pattern matching algorithms. In: Switching and Automata The-
ory, pp. 1–11 (1973)
16. Williams, V.V.: Multiplying matrices faster than coppersmith-winograd. In: Pro-
ceedings of the 44th Symposium on Theory of Computing Conference (STOC),
New York, NY, USA, pp. 887–898, 19–22 May 2012 (2012)
17. Yu, H.: An improved combinatorial algorithm for boolean matrix multiplication.
In: Halldórsson, M.M., Iwama, K., Kobayashi, N., Speckmann, B. (eds.) ICALP
2015. LNCS, vol. 9134, pp. 1094–1105. Springer, Heidelberg (2015)
Poisson-Markov Mixture Model and Parallel
Algorithm for Binning Massive
and Heterogenous DNA Sequencing Reads

Lu Wang, Dongxiao Zhu(B) , Yan Li, and Ming Dong

Department of Computer Science, Wayne State University,


Detroit, MI 48202, USA
{lu.wang3,dzhu,rock liyan,mdong}@wayne.edu

Abstract. A major computational challenge in analyzing metagenomics


sequencing reads is to identify unknown sources of massive and hetero-
geneous short DNA reads. A promising approach is to efficiently and
sufficiently extract and exploit sequence features, i.e., k-mers, to bin the
reads according to their sources. Shorter k-mers may capture base com-
position information while longer k-mers may represent reads abundance
information. We present a novel Poisson-Markov mixture Model (PMM)
to systematically integrate the information in both long and short
k-mers and develop a parallel algorithm for improving both reads bin-
ning performance and running time. We compare the performance and
running time of our PMM approach with selected competing approaches
using simulated data sets, and we also demonstrate the utility of our
PMM approach using a time course metagenomics data set. The proba-
bilistic modeling framework is sufficiently flexible and general to solve
a wide range of supervised and unsupervised learning problems in
metagenomics.

Keywords: Probabilistic clustering · Expectation-Maximization algo-


rithm · Metagenomics · Next-generation sequencing (NGS) · Parallel
algorithm

1 Introduction

Metagenomics sequencing reads are typically sequenced from a large number of


heterogeneous sources with diverse abundances. There are two related yet dis-
tinct computational problems. The first is unsupervised binning of the reads to
identify unknown sources. Reads from the same sources are more similar com-
pared to the rest and the sources can later be labeled as Operational Taxonomic
Units (OTU’s). The other is supervised classification of the reads to assign each
read to a labeled known source, such as a taxonomic or a patient treatment/risk
group. Here we will focus on the more challenging reads binning problem.
Reads binning has posed the following unprecedented algorithmic and com-
putational challenges, ranked by decreasing priority, to bioinformatics research

c Springer International Publishing Switzerland 2016
A. Bourgeois et al. (Eds.): ISBRA 2016, LNBI 9683, pp. 15–26, 2016.
DOI: 10.1007/978-3-319-38782-6 2
16 L. Wang et al.

community: (1) How to sufficiently and robustly extract discriminating features


from the reads? This is essentially a k-mers (sequencing feature) counting and
selection problem; (2) How to account for the differential abundances across
bins? Some sources may generate more reads whereas others may generate less;
(3) How to filter out the inseparable reads? Some reads contain useful feature
information, but others don’t. The latter can come from the common sequences
shared among the sources and was referred to as inseparable reads, and (4) How
to efficiently process ultra-high throughput (hundreds of millions), very short
(≈ 100 bp) reads?
A key to overcome the first challenge is to sufficiently and robustly extract
sequence features, i.e., k-mers (substring of length k), from NGS reads since it
is the only information available from DNA sequencing data. Earlier approaches
usually align the entire reads to non-redundant coding sequences (nr) and/or
functional groups based on sequence similarity, usually via a BLASTX search. In
metagenomics, familiar examples include CARMA [4], MEGAN [6] and Phymm
[1]. CARMA attempts to assign short reads to known Pfam domains (struc-
tural components conserved across multiple proteins) and protein families [4].
MEGAN classifies reads to the Lowest Common Ancestor (LCA) based on mul-
tiple BLASTX score hits [6]. These dynamic programming approaches use infor-
mation in the long k-mers to construct optimal read sequence alignment result.
Other approaches used information in the shorter k-mers. Phymm used inter-
polated Markov models (IMMs) [18] to characterize variable-length short k-mers
that are typical of a phylogenetic grouping. Short k-mers, such as oligonucleotide
[14], dinucleotide [7] and tetranucleotide counts [16,19], were used as the discrim-
inative features to capture the information on base composition heterogeneity,
perhaps in deference to the long sequencing contigs generated from the earlier
sequencing technology. In particular, our recent work [16] used short k-mers in
a mixture of Markov chains to calculate the probability of each read assigned
to each bin. Presumably, reads binning approaches using both short k-mers and
long k-mers as features are more desirable.
An effective approach to overcome the second challenge is to explicitly capture
abundance information. For example, AbundanceBin extracted and used feature
information from long k-mers of the reads, which directly yield read abundance
information [21], to fit a mixture of Poisson models. Each component models the
abundance of an individual bin. Similarly, an effective approach to overcome the
third challenge is to develop non-mutually exclusive probabilistic clustering meth-
ods, where each read can simultaneously fall into different clusters with different
posterior probabilities. A read with similar posterior probabilities across all the
bins can be considered as non-informative, thus inseparable reads.
Due to the increasing degree of problem complexity, recent works focused
more on developing analytic workflows, which exploit the information in short
and/or long k-mers and solve the problem in a heuristic manner, e.g., [9,20].
However, the short and long k-mer reads features were not used systematically,
i.e., the performance can be compromised by the choices of user-defined cut-
off’s and the heuristic k-means type algorithms. Thus, it is subject to high
variance. Moreover, the deterministic reads partitioning significantly undermines
Another random document with
no related content on Scribd:
Carolina convention, the same Iredell, after pointing out that the
American concept of the relation of citizen to all governments had
become basic American law, contrasts that fact with the fundamental
law of Great Britain where “Magna Charta itself is no constitution, but
a solemn instrument ascertaining certain rights of individuals, by the
legislature for the time being; and every article of which the
legislature may at any time alter.” (4 Ell. Deb. 148.)
In the Pennsylvania convention, on December I, 1787, one of the
most distinguished lawyers of that generation made a memorable
speech, expressing the universal knowledge that the American
concept had taken forever the place of the Tory concept in
fundamental American law. We commend a careful study of that
speech to those of our public leaders and “constitutional” lawyers,
who for five years have been acting on the assumption that the Tory
concept has again become our fundamental American law. We
average Americans, after living with those earlier Americans, are not
surprised to listen to the statements of Wilson. “The secret is now
disclosed, and it is discovered to be a dread, that the boasted state
sovereignties will, under this system, be disrobed of part of their
power.... Upon what principle is it contended that the sovereign
power resides in the state governments?... The proposed system
sets out with a declaration that its existence depends upon the
supreme authority of the people alone.... When the principle is once
settled that the people are the source of authority, the consequence
is, that they may take from the subordinate governments powers
which they have hitherto trusted them, and place those powers in the
general government, if it is thought that there they will be productive
of more good. They can distribute one portion of power to the more
contracted circle, called state governments; they can also furnish
another proportion to the government of the United States. Who will
undertake to say, as a state officer, that the people may not give to
the general government what powers, and for what purposes, they
please? How comes it, sir, that these state governments dictate to
their superiors—to the majesty of the people?” (2 Ell. Deb. 443.)
We average Americans, legally bound (as American citizens) by
no command (interfering with our human freedom) except from our
only legislature at Washington and then only in those matters in
which we ourselves, the citizens of America, have directly given it
power to command us, now intend insistently to ask all our
governments, the supreme one at Washington and the subordinate
ones in the states of which we are also citizens, exactly the same
question which Wilson asked.
Daniel Webster asked almost exactly the same question of Hayne
and history does not record any answer deemed satisfactory by the
American people. Webster believed implicitly in the concept of
American law stated by those who made our Constitution. Like them,
and unlike our “constitutional” lawyers, he knew that the Tory
concept of the relation of men to their government had disappeared
from American basic law.
“This leads us to inquire into the origin of this government, and the
source of its power. Whose agent is it? Is it the creature of the state
legislatures, or the creature of the people?... It is, sir, the people’s
constitution, the people’s government—made for the people, made
by the people, and answerable to the people. The people of the
United States have declared that this Constitution shall be the
supreme law. We must either admit the proposition, or dispute their
authority. The states are, unquestionably, sovereign, so far as their
sovereignty is not affected by this supreme law. But the state
legislatures, as political bodies, however sovereign, are yet not
sovereign over the people.... The national government possesses
those powers which it can be shown the people have conferred on it,
and no more.... We are here to administer a Constitution emanating
immediately from the people, and trusted by them to our
administration.... This government, sir, is the independent offspring
of the popular will. It is not the creature of state legislatures; nay,
more, if the whole truth must be told, the people brought it into
existence, established it, and have hitherto supported it, for the very
purpose, amongst others, of imposing certain salutary restraints on
state sovereignties.... The people, then, sir, erected this government.
They gave it a constitution, and in that constitution they have
enumerated the powers which they bestow upon it.... Sir, the very
chief end, the main design for which the whole constitution was
framed and adopted, was to establish a government that should not
be obliged to act through state agency, depend on state opinion and
state discretion.... If anything be found in the national constitution,
either by original provisions, or subsequent interpretation, which
ought not to be in it, the people know how to get rid of it. If any
construction be established, unacceptable to them, so as to become
practically a part of the constitution, they will amend it at their own
sovereign pleasure. But while the people choose to maintain it as it
is—while they are satisfied with it, and refuse to change it—who has
given, or who can give, to the state legislatures a right to alter it,
either by interference, construction, or otherwise?... Sir, the
people have not trusted their safety, in regard to the general
constitution, to these hands. They have required other security, and
taken other bonds.” (From Webster’s reply to Hayne, U. S. Senate,
January, 1830. 4 Ell. Deb. 498 et seq.)
We average Americans, now educated in the experience of the
average American from 1776 to the beginning of 1787, find much
merit and comfort in Webster’s understanding of basic American law.
He had a reasoned and firm conviction that Americans really are
citizens and not subjects. His conviction, in that respect, while
opposed to the convictions of our leaders and “constitutional”
lawyers, has seemed to us quite in accord with the convictions of
earlier leaders such as Iredell and Wilson and the others, and also
with the decisions of our Supreme Court.
Briefly stated, it has become quite clear to us that the American
people, from 1776 to 1787, were fixed in their determination to make
our basic American law what the conviction of Webster and the
leaders of every generation prior to our own knew it to be. Let us go
back, therefore, to the Americans in the Philadelphia convention of
1787, who worded the Constitution which is the supreme law of
America, and ascertain how their knowledge of fundamental
American law dictated the wording of their proposed Seventh Article.
CHAPTER VIII
PHILADELPHIA ANSWERS “CONVENTIONS, NOT
LEGISLATURES”

We recall how clearly the Americans at Philadelphia, in 1787,


knew that any grant of national power to interfere with the freedom of
individuals was the constitution of government. We recall the bitter
conflict of opinion, threatening the destruction of the assembly, over
the manner of choosing the members of the legislature to exercise
whatever powers of that kind the citizens of America might grant. We
recall the great opposition to the proposal of a grant of any power of
that kind and to the particular proposal of each of the enumerated
powers of that kind, all embodied in the First Article.
We have thus come to know with certainty that the minds of the
Americans at Philadelphia, during those strenuous four months,
were concentrated mainly upon a proposal to grant some national
power to interfere with the human freedom of all Americans. In other
words, we have their knowledge that their proposed First Article, by
reason of its grants of such power, would constitute a new nation
and government of men, if those grants were validly made by those
competent to make such grants.
Under which circumstances, we realize that it became necessary
for them to make a great legal decision, in the construction of basic
American law, and, before making that decision, which was
compelled to be the result of judgment and not of will, accurately to
ascertain one important legal fact. Indeed, their decision was to be
the actual conclusion reached in the effort to ascertain that legal fact.
This was the single question to which they must find the right
answer: “Under our basic American law, can legislatures ever give to
government any power to interfere with the human freedom of men,
or must every government in America obtain its only valid powers of
that kind by direct grant from its own citizens?”
It is easy for us to state that they should have known that the
answer to that question was expressly and authoritatively given in
the Statute of ’76. It was there plainly enacted that every just power
of any government must be derived from the direct grant of those to
be governed by its exercise. Yet our own leaders for the last five
years have not even asked the question, much less known the right
answer.
At Philadelphia, in 1787, they did know it. They had no doubt
whatever about it. We shall see that quickly in our brief review of the
record they made at Philadelphia in ascertaining and deciding, as a
legal necessity, to whom their First Article and its enumerated grants
of national power must be sent and, when we boast of how quickly
we knew the answer, we should admit that we did not know it until
after we had lived again with them through their experience of the
preceding ten or twelve years which had educated them, as it has
just educated us, to that knowledge. Furthermore, many of us
average Americans will be unable to explain, until later herein, why,
during the last five years, our own leaders have not known the right
answer. The Statute of ’76 has not been wholly unknown to them.
The record of the Philadelphia Convention and the ratifying
conventions has not been entirely a closed book to them. The
important and authentic statements of Webster and other leaders of
past generations have been read by many of them. If they did not
understand and know the correct answer, as we now realize they
have not known, let us not withhold from the Americans at
Philadelphia our just tribute of gratitude that they did accurately
know, when it was amazingly important to us that they should know.
When those Americans came to answer that question, there were
facts which might have misled them as other similar facts of lesser
importance have undoubtedly misled our leaders.
In 1776, from that same Philadelphia had gone a suggestion that a
constitution of government, with Articles granting power to
government, be made in each former colony. In 1787, there had
gone from that same Philadelphia a proposal that a constitution of a
general government for America be made, with Articles granting
power to that government. The proposal of 1776 had suggested that
the proposed Articles be made by the people themselves, assembled
in conventions. The proposal of 1777 had suggested that the
proposed Articles be made by the legislative governments of the
states. Both proposals, even as to the makers of the respective
Articles, had been acted upon. All the Articles, although some had
been made by the people themselves and others by legislatures, had
been generally recognized as valid law. Some of the men at
Philadelphia in 1787 had been members of the proposing Second
Continental Congress, when the respective proposals of 1776 and
1777 had gone from Philadelphia. When, in 1787, they were called
upon to find and state, as their legal decision, the correct answer to
their important question, it was necessary for them to ascertain, as
between state “legislatures” and the people themselves, in
“conventions,” which could validly make the Articles which had been
worded and were about to be proposed. It would not, therefore, have
been beyond the pale of our own experience if the earlier proposals
had misled them and they had made the wrong answer to the
question which confronted them. Furthermore, as we have already
noted, although we can little realize the influence of such a fact upon
men seeking the correct legal answer to an important question, their
whole proposal was a new adventure for men on an uncharted sea
of self-government. Under all of which circumstances, let us again
pay them their deserved tribute that they went unerringly to the only
correct answer.
We know that the essence of that answer is expressed in the
Seventh Article proposed from Philadelphia. Only one answer was
possible to Americans of that generation. They had been “subjects”
and had become “citizens.” They knew the vital distinctions between
the two relations to government.
The Convention which framed the Constitution was, indeed,
elected by the state legislatures. But the instrument, when it
came from their hands, was a mere proposal, without
obligation, or pretensions to it. It was reported to the then
existing Congress of the United States, with the request that it
might “be submitted to a convention of delegates, chosen in
each state by the people thereof, under the recommendation
of its legislature, for their assent and ratification.” This mode
of proceeding was adopted; and by the Convention, by
Congress, and by the state legislatures, the instrument was
submitted to the people. They acted upon it in the only
manner in which they can act safely, effectively, and wisely,
on such a subject, by assembling in convention. It is true, they
assembled in their several states; and where else should they
have assembled? No political dreamer was ever wild enough
to think of breaking down the lines which separate the states,
and of compounding the American people into one common
mass. Of consequence, when they [the American people] act,
they act in their states. But the measures they adopt do not,
on that account, cease to be the measures of the people
themselves, or become the measures of the state
governments. From these conventions the Constitution [the
First Article grants of power to interfere with individual
freedom] derives its whole authority. The government
proceeds directly from the people; is “ordained and
established” in the name of the people.... It required not the
affirmance, and could not be negatived, by the state
governments.... To the formation of a league, such as was the
Confederation, the state sovereignties were certainly
competent.
But, when a general government of America was to be given any
national power to interfere with the individual freedom of its citizens,
as in the First Article of 1787 and in the Eighteenth Amendment of
1917,
acting directly on the people, the necessity of referring it to
the people, and of deriving its powers directly from them, was
felt and acknowledged by all. The government of the Union,
then, (whatever may be the influence of this fact on the case,)
is, emphatically, and truly, a government of the people. In form
and in substance it emanates from them. Its powers are
granted by them, and are to be exercised directly on them,
and for their benefit. (Marshall in the Supreme Court,
M’Culloch v. Maryland, 4 Wheat. 316.)
Marshall was one of the Americans who had been at Valley Forge
in 1778, and at other places whose sacrifices made it the basic law
of America that all power over American citizens must be derived by
direct grant from themselves. Later, he was prominent in the Virginia
convention where all Americans in Virginia knew and acted upon this
basic law. These facts qualified him to testify, from the Bench of the
Supreme Court, that all Americans then knew and acknowledged the
binding command of that basic law.
Under such circumstances, it was impossible that the Americans
at Philadelphia should not have known and obeyed that law in the
drafting of their proposed Seventh and Fifth Articles. Both of these
Articles, the Seventh wholly, and the Fifth partly, deal with the then
future grant of national power over the people and its only legal gift
by direct grant from the people themselves, assembled in their
“conventions.” Both Articles name the people of America, by the one
word “conventions.”
That Philadelphia should not have strayed from the legal road
clearly marked by the Statute of ’76 was certain when we recall how
large a part Madison played at Philadelphia, and particularly how he
personally worded and introduced, in the closing hours at
Philadelphia, what we know as its Fifth Article. As to his personal
knowledge of this basic law, we recall his letter of April, 1787, where
he said, “To give the new system its proper energy, it will be
desirable to have it ratified by the authority of the people, and not
merely by that of the legislatures.” And we recall his later words,
when urging Americans to adopt the Constitution with its Fifth and
Seventh Articles, he said of the Seventh, “This Article speaks for
itself. The express authority of the people alone could give due
validity to the Constitution,” to its grants of power over the people in
its First Article. (Fed. No. 43.)
That we may fix firmly in our own minds the knowledge which all
Americans then had, which our leaders never acquired or have
entirely forgotten, let us briefly review what the earlier Americans did
at Philadelphia in obedience to that knowledge of basic American
law.
On May 28, Randolph of Virginia “opened the main business” of
the Convention. He proposed fifteen resolutions embodying the
suggestion of what should be in the different Articles. Resolution
Number 15 was that such Articles should be submitted to
“conventions,” “to be expressly chosen by the people, to consider
and decide thereon.” (5 Ell. Deb. 128.)
The first short debate on this Resolution took place on June 5. In it
Madison stated that he “thought this provision essential. The Articles
of Confederation themselves were defective in this respect, resting,
in many of the states, on the legislative sanction only.” The resolution
was then postponed for further consideration. On June 12, “The
question was taken on the 15th Resolution, to wit, referring the new
system to the people of the United States for ratification. It passed in
the affirmative.” (5 Ell. Deb. 183.) This was all in the Committee of
the Whole.
On June 13, that Committee made their full report, in which the
Randolph Resolution Number 15 was embodied in words as
Resolution Number 19 of the report. On June 16, while the
Convention was again sitting as a Committee of the Whole, the great
struggle was on between the conflicting opinions as to how and in
what proportion should be elected the future legislators who were to
exercise the granted powers over Americans. On that day, the
discussion centered on the relative merits of the Randolph national
proposals and a set of federal Articles amending the existing Federal
Constitution. In supporting Randolph, Wilson of Pennsylvania stated
that “he did not fear that the people would not follow us into a
national government; and it will be a further recommendation of Mr.
Randolph’s plan that it is to be submitted to them, and not to the
legislatures, for ratification.” (5 Ell. Deb. 196.)
On July 23, Resolution Number 19 came up for action.
Remembering how insistent many of the delegates were that the
general government should be kept a purely federal one, it is not
surprising to find Oliver Ellsworth of Connecticut opening the short
debate with a motion that the Constitution “be referred to the
legislatures of the states for ratification.” But it will also be
remembered that the powers to be granted in the new Articles had
not yet been settled. The nationalists in the Convention, intent on
having some national Articles, knew that the proposed ratification
must be by the people themselves, “felt and acknowledged by all” to
be the only competent grantors of national powers.
Colonel Mason of Virginia “considered a reference of the plan to
the authority of the people as one of the most important and
essential of the resolutions. The legislatures have no power to ratify
it. They are the mere creatures of the state constitutions, and cannot
be greater than their creators.... Whither, then, must we resort? To
the people, with whom all power remains that has not been given up
in the constitutions derived from them. It was of great moment that
this doctrine should be cherished, as the basis of free government.”
(5 Ell. Deb. 352.)
Rufus King of Massachusetts, influenced undoubtedly by the error
of thinking that the Convention meant to act within the Articles of
Confederation, was inclined to agree with Ellsworth “that the
legislatures had a competent authority, the acquiescence of the
people of America in the Confederation being equivalent to a formal
ratification by the people.... At the same time, he preferred a
reference to the authority of the people, expressly delegated to
conventions, as the most certain means of obviating all disputes and
doubts concerning the legitimacy of the new Constitution.” (5 Ell.
Deb. 355.)
Madison “thought it clear that the legislatures were incompetent to
the proposed changes. These changes would make essential
inroads on the state constitutions; and it would be a novel and
dangerous doctrine, that a legislature could change the constitution
under which it held its existence.” (5 Ell. Deb. 355.)
Ellsworth’s motion to send to the state legislative governments,
and not to the people themselves, assembled in “conventions,” was
lost by a vote of seven to three. Resolution Number 19, that the new
Articles must be sent to the people themselves was adopted by a
vote of nine to one, Ellsworth and King both voting for it. (5 Ell. Deb.
356.)
This impressive discussion, now continued for over a month of
1787, with its display of accurate knowledge of the distinction
between sending Articles to legislatures and “referring” them to the
people, makes quite amusing what we shall hear later in 1917. It will
come from the counsel of the political organization which dictated
that governments should make the supposed Eighteenth
Amendment. After he kindly tells us that history has proven that
these Americans of 1787 “builded more wisely than they knew,”
meaning “than he knew,” he shall later impart to us the remarkable
information that “the framers in the Constitutional Convention knew
very little, if anything, about referendums.”
The Resolutions, which had now become twenty-three in number,
on July 26, were referred to the Committee of Detail to prepare
Articles in conformity therewith. On August 6, that Committee made
its report of twenty-three worded Articles. In Article XXII was
embodied the requirement that the Constitution should be submitted
“to a convention chosen in each state, under the recommendation of
its legislature, in order to receive the ratification of such convention.”
This provision, the Philadelphia answer and always the only legal
answer to the question as to who can validly grant power to interfere
with individual freedom, was later seen not properly to belong in the
Constitution itself. For which reason, it was taken out of the
Constitution and embodied in a separate Resolution which went with
the Constitution from Philadelphia.
In Article XXI, the first draft of our Article VII, it was provided: “The
ratification of the conventions of —— states shall be sufficient for
organizing this Constitution.” (5 Ell. Deb. 381.)
The month of August was passed in the great debates on the
proposed grants of national power and the other proposed Articles.
When the Convention was drawing to a close on August 30, Articles
XXI and XXII were reached.
Gouverneur Morris of Pennsylvania “moved to strike out of Article
XXI the words, ‘conventions of the,’ after ‘ratification,’ leaving the
states to pursue their own modes of ratification.” Rufus King “thought
that striking out ‘conventions,’ as the requisite mode, was equivalent
to giving up the business altogether.” Madison pointed out that, “The
people were, in fact, the fountain of all power.” The motion of Morris
was beaten. An attempt was made to fill the blank in Article XXI with
the word “thirteen.” “All the states were ‘No’ except Maryland.” The
blank was then filled by the word “nine” the vote being eight to three.
The two articles were then passed, the vote thereon being ten to
one. (5 Ell. Deb. 499-502.)
On September 10, the beginning of the last business week of the
Convention, Gerry of Massachusetts moved to reconsider these two
Articles. The short discussion was not in connection with any matter
in which we are now interested. His motion was lost. The entire set
of worded Articles was then referred to a committee for revising the
style and arrangement of the Articles agreed upon. (5 Ell. Deb. 535.)
On Wednesday, September 12, that Committee reported our
Constitution, with its seven Articles, as we know them except for
some slight changes made during the discussions of the last three or
four days of the Convention. In these seven Articles, the language of
the earlier Article XXII did not appear. As it really was the statement
of the correct legal conclusion of the Convention that its proposed
Articles, because they would grant power to interfere with individual
freedom, must necessarily be made by the people themselves, its
proper place was outside the Constitution itself and in a special
Resolution of the same nature as every Congress resolution
proposing an amendment to that Constitution. That was the view of
the Committee and, on Thursday, September 13, the Committee
reported such special Resolutions, in the very words of the former
Article XXII. “The proceedings on these Resolutions are not given by
Mr. Madison, nor in the Journal of the Federal Convention. In the
Journal of Congress, September 28, 1787, Volume 4, p. 781, they
are stated to have been presented to that body, as having passed in
the Convention on September 17 immediately after the signing of the
Constitution.” (5 Ell. Deb. 602.)
This is the Resolution:
“Resolved, That the preceding Constitution be laid before the
United States in Congress assembled; and that it is the opinion of
this Convention, that it should afterwards be submitted to a
convention of delegates chosen in each state by the people thereof,
under the recommendation of its legislature, for their assent and
ratification; and that each convention, assenting to and ratifying the
same, should give notice thereof to the United States in Congress
assembled.
“Resolved, That it is the opinion of this Convention, that, as soon
as the conventions of nine states shall have ratified this Constitution,
the United States in Congress assembled should fix a day, etc.” (5
Ell. Deb. 541.)
This Resolution is the most authoritative statement of the legal
conclusion reached by these leaders of a people then “better
acquainted with the science of government than any other people in
the world.” The conclusion itself was compelled by accurate
knowledge that the government of “citizens” can validly obtain only
from the citizens themselves, by their direct grant, any power to
interfere with their individual freedom. The expression of that
knowledge, in the Resolution, is, in many respects, one of the most
important recorded legal decisions ever made in America. We
average Americans, educated with those Americans at Philadelphia
through their experience of the years between 1775 and 1787,
cannot misunderstand the meaning and importance of that decision.
Instructed by our review of their actions and their reasoning at
Philadelphia in reaching that conclusion and making that legal
decision, we know, with an accurate certainty, that it was their
declaration to the world and to us that no proposal from Philadelphia
suggested that Americans again resume the relation of “subjects” to
any government or governments.
Our minds impressed with this accurate knowledge that such was
not their purpose, we now prepare to complete our education as
American citizens, not subjects, by reading the Philadelphia story
and language of their Fifth Article, their only other Article which even
partly concerned the future grant of new government power to
interfere with individual American freedom. By reason of our
education, we will then come to the reading of the language of this
Article, as the Americans read it and understood it when they made it
in their “conventions” that followed the proposing convention of
Philadelphia.
Being educated “citizens” and not “subjects,” we ourselves will no
longer, as our leaders have done for five years, mistake the only
correct and legal answer to the indignant outburst of Madison, who
wrote this Fifth Article at Philadelphia. “Was, then, the American
Revolution effected, was the American Confederacy formed, was the
precious blood of thousands spilt, and the hard-earned substance of
millions lavished, not that the people of America should enjoy peace,
liberty, and safety, but that the governments of the individual states,
that particular municipal establishments, might enjoy a certain extent
of power, and be arrayed with certain dignities and attributes of
sovereignty? We have heard of the impious doctrine in the Old
World, that the people were made for kings, not kings for the people.
Is the same doctrine to be revived in the New, in another shape—
that the solid happiness of the people is to be sacrificed to the views
of political institutions of a different form?” (Fed. No. 45.)
The American answer, from the people of America assembled in
the conventions that ratified that Fifth Article, was a clear and
emphatic “No.” The Tory answer of the last five years, from our
leaders and our governments, has been an insistent “Yes.”
No one, however, with any considerable degree of truthfulness,
can assert that there has come from the American people
themselves, during the last five years, any very audible “Yes.” To
whatever extent individual opinions may differ as to the wisdom or
legality of the new constitution of government of men, made entirely
by governments, no unbiased observer has failed to note one
striking fact. By a very extensive number of Americans otherwise
law-abiding, Americans in all classes of society, the new government
edict, the government command to “subjects,” has been greeted with
a respect and obedience strikingly similar to the respect and
obedience with which an earlier generation of Americans received
the Stamp Act and the other government edicts between 1765 and
1776.
When the Americans of that earlier generation were denounced by
the government which had issued those edicts to its “subjects,” one
of the latter, five years before Americans ceased to be “subjects” of
that government, stated: “Is it a time for us to sleep when our free
government is essentially changed, and a new one is forming upon a
quite different system—a government without the least dependence
upon the people?”
It may be but a coincidence that, while our American government
was announcing its recognition of the wide-spread American
disrespect for the new government edict, it is only a few days since
throughout America there resounded many eulogies of the Samuel
Adams, who made that statement in the Boston Gazette of October
7, 1771. In those eulogies, there was paid to him the tribute that he
largely helped to bring about the amazing result of American desire
for individual freedom which culminated in the assembling of the
Americans in the “conventions” which ratified the proposed
Constitution.
We have already sensed that the existence of the supposed
Eighteenth Amendment depends entirely upon an amazing modern
meaning put into the Fifth Article made in those conventions. Let us,
therefore, who are Americans now educated in the experience of the
Americans who assembled in those “conventions,” sit therein with
them and there read the story and the language of the Fifth Article as
they read it when they made it.
CHAPTER IX
THE FIFTH ARTICLE NAMES ONLY
“CONVENTIONS”

It has been the misfortune of our prominent Americans of this


generation that they read the Fifth Article with preconceived notions
of its meaning. To the error of that method of reading it, we average
Americans will not pay the tribute of imitation. We know that its
meaning to those who made it in the “conventions” of the earlier
century is the meaning which it must have as part of the supreme
law of the land. That we may read it as they read it and get its clear
and only possible meaning, as they got it, we shall briefly review the
story of its wording and its proposal at Philadelphia. That Convention
immediately preceded the assembling of the people in their own
“conventions.” In each of their “conventions,” among the people
assembled, were some who had been prominent at Philadelphia,
such as Madison and Randolph and Mason in Virginia, Hamilton in
New York, Wilson in Pennsylvania and the Pinckneys in South
Carolina. Moreover, between the Philadelphia proposal and the
assembling of these conventions, Madison and Hamilton, proposer
and seconder of the Fifth Article at Philadelphia, had been publishing
their famous essays, now collectively known as The Federalist, in
the New York newspapers to explain the Articles worded at
Philadelphia and to urge their adoption. Under which circumstances,
it is clear that, if we want to read and know the meaning of the Fifth
Article as it was understood in those conventions, the Fifth Article
which named those same “conventions,” we must complete our
education by an accurate and brief review of the story of that Article
at Philadelphia. Only in that way shall we average Americans of
today be in the position in which were the Americans who made that
Article.
When we read that story of Philadelphia, in relation to the Fifth
Article, one thing stands out with amazing clarity and importance.
We already know how that Convention, until its last days, was
concentrated upon the hotly debated question of its own proposed
grants of national powers in the First Article. In the light of which
continued concentration, it is not surprising to learn that, until almost
the very last days, the delegates forgot entirely to mention, in their
tentative Fifth Article, the existing and limited ability of state
legislatures to make federal or declaratory Articles, and mentioned
only “conventions” of the people, who alone could or can make
national Articles.
The first suggestion of what we now know as the Fifth Article was
on the second day, May 29, when the Randolph Resolution 13 read
“that provision ought to be made for the amendment of the Articles of
union whensoever it shall seem necessary.” This wording was the
exact language of Resolution 17 of the report of the Committee of
the Whole. It was adopted by the Convention on July 23. Three days
later, with the other Resolutions, it was referred to the Committee of
Detail “to prepare and report the Constitution.” On August 6, this
Committee, in the first draft of our Constitution, reported the
following: “Art. XIX. On the Application of the legislatures of two-
thirds of the states in the Union, for an amendment of this
Constitution, the legislature of the United States shall call a
convention for that purpose.”
We see clearly why the delegates, their minds concentrated on
their own proposed grants of national powers, mentioned only the
people themselves, the “conventions” of the “Seventh” and “Fifth”
Articles, who alone can make national Articles, and forgot to mention
legislatures, because the latter never can make national Articles.
That kind of Article was the only thing they were then thinking about.
Naturally, it then escaped their attention that, if they proposed a wise
and proper distribution of national power between the new American
government and the respective existing state governments, almost
every future Article, if not every one, would be of the federal kind,
which legislatures or governments could validly make, as they had
made all the Articles of the existing federation. Clearly for that
reason this Article XIX never even mentioned the existing and limited
ability of legislatures.
Between this report of August 6 and August 30, the Convention
was again entirely occupied with the grants of national power and
the election of the legislators to exercise it or, in other words, with
what is now the First Article. On August 30, Article XIX was adopted
without any debate.
We are now aware that the Convention was within two weeks of its
end and no one had mentioned, in what is now the Fifth Article, the
state governments or legislatures as possible makers of federal
Articles, if and when such Articles were to be made in the future.
It was not until September 10, Monday of the last Convention
week, that Article XIX again came up for action, when Gerry of
Massachusetts moved to reconsider it. His purpose, as he himself
stated it, was to object because it made it possible that, if the people
in two-thirds of the states called a convention, a majority of the
American people assembled in that convention “can bind the Union
with innovations that may subordinate the state constitutions
altogether.” Hamilton stated that he could see “no greater evil, in
subjecting the people in America to the major voice than the people
of any particular state.” He went on to say that he did think the Article
should be changed so as to provide a more desirable “mode for
introducing amendments,” namely, drafting and proposing them to
those who could make them. In this respect he said: “The mode
proposed was not adequate. The state legislatures will not apply for
alterations, but with a view to increase their own powers. The
national legislature will be the first to perceive, and will be most
sensible to, the necessity of amendments; and ought also to be
empowered, whenever two-thirds of each branch should concur, to
call a convention. There could be no danger in giving this power, as
the people would finally decide in the case.” (5 Ell. Deb. 531.)
Roger Sherman of Connecticut then tried to have the Article
provide that the national government might also propose
amendments to the several states, as such; such amendment to be
binding if consented to by the several states, namely, all the states.
For reasons that will appear in a moment, this clear attempt to
enable the states, mere political entities, and their legislatures,
always governments, to do what they might wish with the individual
freedom of the American citizen—thus making him their subject—
was never voted upon. It was, however, seconded by Gerry of
Massachusetts. Its probable appeal to Sherman, always a strong
opponent of the national government of individuals instead of the
federal government of states, was that it would make it difficult to
take away any power from Connecticut, unless Connecticut wished
to give it up. Its appeal to Gerry, consistently a Tory in his mental
attitude to the relation of government and human being, was
undoubtedly the fact that it would permit government or governments
to do what they might wish with individual freedom. It does not
escape the attention of the average American that our governments
and leaders, during the last five years, have not only displayed the
mental attitude of Gerry but have also acted as if the proposal, which
he urged, had been put into what is our Fifth Article. Only on that
theory can we average Americans, with our education, understand
why governments in America have undertaken to exercise and to
vest in our government a national power over us, which power
neither is enumerated in the First Article nor was ever granted by the
citizens of America to their only government; nor can we understand
why our leaders have assumed that governments in America, which
are not even the government of the American citizens, can do either
or both of these things. We know, if governments and leaders do not,
that neither thing can ever be possible in a land where men are
“citizens” and not “subjects.”
CHAPTER X
ABILITY OF LEGISLATURES REMEMBERED

Living through the days of that Convention, we have now seen


three months and ten days of its sincere and able effort to word a
Constitution which would “secure the Blessings of Liberty” to the
individual American. We have seen them spend most of their time in
the patriotic endeavor to adjust and settle how much, if any, national
power to interfere with individual freedom that Constitution shall give
to its only donee, the new and general government. In other words,
we have seen the mind and thought and will of that Convention
almost entirely concentrated, for those three months and ten days,
upon the Article which is the constitution of government, the First
Article, with its enumerated grants of general power to interfere with
the human rights of the American citizen.
Keeping in mind the object of that intense concentration, the First
Article grants of power of that kind, we average Americans note, with
determined intent never to forget, the effect of that concentration
upon the wording of our Fifth Article up to that tenth day of
September. We note, with determined intent never to forget, that,
from May 30 to September 10, the only maker of future changes
mentioned was the “people” of America, the most important reservee
of the Tenth Amendment, the “conventions” of the American people
named in both the Seventh and the Fifth Articles.
As this fact and its tremendous meaning have never been known
or mentioned in the sorry tale of the five years from 1917 to 1922, we
average Americans are determined to dwell upon it briefly so that we
cannot escape an accurate appreciation of the short remaining story
of the one week at Philadelphia, in 1787, in relation to our Fifth
Article.

You might also like