Journal Life Linux Distributions 2014 Complete

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Life Sciences Driven Customized Linux Distributions

Bilal Wajid1 and Erchin Serpedin1


1
Dept. of Electrical and Computer Engineering,
Texas A&M University, College Station, TX, USA.

May 28, 2014

Abstract
Research in Life Sciences has moved a from purely hypothesis driven science to a data-hypothesis driven
science. Huge volumes of data requires powerful systems, intelligent algorithms and a group of people
maintaining and improving the infrastructure associted with the software environments. These software
environments need to be constantly maintained, configured and updated to suit the researchers ever chang-
ing needs and goals. To address these challenges engineers and computer scientists have proposed multiple
solutions built on Linux systems that include within them all the necessary software needed by the re-
search group. Therefore, this paper presents a review of the major Life Sciences driven customized Linux
distributions (henceforth referred to as ‘Life-Linux distros’) used in the academia and industry.

1 Introduction
The publication of the human genome project [1] drove biological data throughput to proportions that have
surpassed Moore’s law [2]. Such data ballistic research drove biologists and engineers to work together in order to
translate data into knowledge. With a huge selection of software, advancement in biological research has geared
up tremendously. However, the current state-of-the-art software does compel biologists to spend increasing
amount of time and resources in installing, configuring and maintaining software rather than spending the same
in doing research work [3, 4].
To address these problems in the field of biology, not one, but many solutions have been proposed, all of
which are built on Linux based operating systems. Examples like BioLinux [5], Scubuntu [6], Scibuntu (http:
//scibuntu.sourceforge.net/), Open Discovery [7], BioSLAX (http://www.bioslax.com) [8], LXtoo [3] and
Scientific Linux (https://www.scientificlinux.org/) do address to a broad set of users, however, as there is
no ‘one size fits all’ solution, scientists have come with other solutions each tailored for a specific problem. For
instance, BioBrew provided an ‘over-the-counter’ cluster functionality [8–10]. DNALinux [11,12] provided a pre-
configured virtual machine that runs on top of the free VMWare Player on Windows XP and Vista, meaning that
one could use Windows in parallel with running one’s bioinformatics application in DNALinux [8,11]. BioPuppy,
(http://www.biopuppy.org/) built on top of Puppy Linux, is a very compact distribution aiming to address
a class room setting where the instructor aims to teach the students use of specific tools. Other Ubuntu based
distributions like BioconductorBuntu [13] and PhyLIS [4] are tailored for a specific type of scientific applications.
For example, BioconductorBuntu provides a microarray processing platform, whereas PhyLIS tries to fulfill the
computational needs of Phylogenetics and Phyloinformatics.
This paper discusses and compares Life-Linux distros with Section 2 providing details of the major Linux
distributions used in life sciences and Section 3 explaining the concept of virtualization.

2 Linux Distributions in Biology


Linux is a Unix-like operating system first released in 1991 by Linus Torvalds under free and open source software
development and distribution model [14]. It is the most popular operating system for embeded systems, servers,
mainframe computers and supercomputers. Interestingly, Andriod (http://www.android.com/c), a popular
operating system for mobile phones is also build on top of a Linux kernel.
Since Linux source code may be used, modified and distributed by anyone under GNU General Public Li-
cense which allows developers to come up with their own derivatives such as Debian (https://www.debian.
org/), Ubuntu (http://www.ubuntu.com/), Linux Mint (http://www.linuxmint.com/), Fedora (https://

1
Table 1: Brief comparison of Linux distributions used in Life sciences. ? Gentoo is available in multiple
environments like Gnome, KDE, Xfce, LXDE, i3, etc. The number of packages pre-compiled in Gentoo is not
known to the authors.

Operating system Last Updated Free GUI Approx. num- Size of Media RAM
ber of packages
precompiled

SLAX 2013 X KDE Plasma workplace 2050 200 MB 256

Knoppix 2013 X LXDE 3600 4.2 GB 256

Puppy Linux 2014 X JWM 128 161 MB 256

Gentoo Linux 2012 X ? ? 3.9 GB 512

Xubunutu 2014 X Xfce 52,541 700 MB 512

Ubunutu 2014 X Unity 52,541 965 MB 512

Enterprise Linux 2014 × GNOME 12,516 255 MB 512

2
fedoraproject.org/), Arch Linux (https://www.archlinux.org/), Red Hat Enterprise Linux (http://www.
redhat.com), SUSE Linux Enterprise Server (https://www.suse.com/) etc. Each Linux distribution is built
on top of the Linux kernel. The ‘kernel’ is the central part of an operating system which interacts with the
applications and manages the CPU, the memory and the hardware based on the overall requirements of all
the utilities running on the system (http://en.wikipedia.org/wiki/Kernel_(computing)). Apart from the
kernel each customized Linux distribution contains a large selection of libraries and tools that fulfill the distribu-
tions intended use. For instance, a Linux distribution targetting desktops would contain a desktop environment
like GNOME (http://www.gnome.org/), Unity (https://unity.ubuntu.com/), LXDE (http://lxde.org/),
JWM (http://joewing.net/projects/jwm/), etc., whereas a Linux distribution targeting servers may not
contain the graphical environments but rather would contain other libraries needed to run server applica-
tions.
There are several Linux distributions that have been modified to fit the needs of the Life Sciences commu-
nity. Among them one can mention SLAX (http://www.slax.org/), Knoppix (http://www.knopper.net/
knoppix/index-en.html), Puppy Linux (http://puppylinux.org/), Gentoo Linux (http://www.gentoo.
org/), Xubuntu (http://xubuntu.org/), Ubuntu (http://www.ubuntu.com/) and Enterprise Linux (http:
//www.redhat.com/products/enterprise-linux/), see Table 1. SLAX (SLACKWARE flavor of Linux) is
charachterized by its compactness and its unique modular approach which allows software to be copied and
run onto SLAX without a specific need of installation and configuration (http://www.slax.org/). The SLAX
build is composed of ‘base’ modules that together form the core of the systems, while all the other components
are the utilities which the users want on their SLAX system. Since the utilities can be copied/removed prior
to generating a distributable version, SLAX allows full customization. BioSLAX is a customized distribution
of SLAX Linux [15].
Knoppix is a bootable Live system on CD or DVD which because of its decompression strategy contains upto
2 GB of executable software installed on its CD version and over 8GB of executable software installed on
its DVD “Maxi” edition (http://www.knopper.net/knoppix/index-en.html). In its latest release, Knoppix
7.2, also includes A.D.R.I.A.N.E. (Audio Desktop Reference Implementation And Networking Environment) a
talking menu system especially designed for the blind. BioKnoppix is a custom distribution of Knoppix [16].
BioKnoppix was last updated in 2004.
Puppy Linux is a very compact Live Linux distribution (approximately 100MB in size) with very low sys-
tem requirements, small boot time and contains basic applications like wordprocessors, spreadsheets, Internet
browsers, games and image editors (http://puppylinux.org/). BioPuppy derived from Puppy Linux is a
compact distribution for bioinformatics and computational biologists specially designed to cope with the needs
of beginners, students and staff. However, BioPuppy has been discontinued [8].
Gentoo Linux based on either Linux or FreeBSD is referred to as a ‘metadistribution’ because of its configura-
bility and performance which allows it to be customized for a secure server, desktop, gaming system and as
embeded solution (http://www.gentoo.org/). Lxtoo is a custom distribution of Gentoo aiming to provide a
collection of tools for sequence and structural analysis, microRNA target prediction, microarray and proteomics
data mining, protein network analyses and molecular dynamics and modeling [3]. Lxtoo was last updated in
2012.
Xubuntu is a GNU/Linux based derivative of the popular Ubuntu operating system (http://xubuntu.org/).
The software tools are free of charge, easily usable in the users local language and admit various levels of
configurability. Xubuntu allows the user to efficiently use the systems hardware yet it is lightweight so that
moderately old machines may be able to run it as well. Xubuntu presents a large community infrastructure
which allows resolution of operating system problems and challenges quickly. DNALinux is a Virtual Machine
with bioinformatics software preinstalled on Xubuntu [8]. For details on ‘Virtual machine’ and ‘virtualization’
please refer to Section 3. DNALinux was last updated in 2009.
Ubuntu is one of the most popular operating systems since the onset of Windows and MAC OS. Ubuntu is
now available not only for desktops and notebooks but also for the Cloud, servers, phones, tablets and even for
television screens (http://www.ubuntu.com/). Ubuntu provides long term support for its products, presents
precision in its engineering design and it is free. Ubuntu allows it to be customized because of which many
developers chose Ubuntu as its underlining operating system including BioLinux, Scubunutu, Bioconductor
Buntu, PhyLis and Baari. Amongst Ubuntu derivatives Bio-Linux 7 provides a rich suite of bioinformatics
software totalling more than 500 programs. Bio-Linux provides a graphical menu for bioinformatics programs,
as well as the necessary documentation and sample data useful for testing programs [17]. Bio-Linux system
can also be run on cloud computing architectures by using its cloud version ‘CloudBioLinux’ [18]. BioLinux
was last updated in 2012. On the other hand, other Ubuntu derivatives are dedicated to specific areas of
research. For instance, BioconductorBuntu caters for microarrays by supporting several microarray analysis
pipelines including oligonucleotide, dual-or single-dye experiments and post-processing with Gene Set Enrich-

3
ment Analysis [13]. BioconductorBuntu was last updated in 2009, however, it is also available as a software
package within the Ubuntu Software repository and can be easily installed via the Ubuntu Software Center.
Similarly PhyLIS aims to present a phylogenetic workbench covering a wide range of applications from sequence
data manipulation to alignment and tree search, including visualization, model selection, divergence time es-
timation, macroevolutionary analyses and tools for automation and batch analysis [4]. PhyLIS was also last
updated in 2009. Lastly Baari contains more than 60+ software and packages oriented towards Next Genera-
tion Sequencing supporting pre-assembly tools, genome assemblers as well as post-assembly tools. Therefore,
Baari offers a well-tailored environment for both novices and experts working in the field of genome assem-
bly (http://people.tamu.edu/~bilalwajidabbas/Baari.html). See table 2 for a comparison of the major
Life-Linux distros.

3 Virtualization
In principle, many operating systems can run in parallel on a single server using a process called virtualization.
Through virtualization all the applications associated to an operating system, including the OS itself, live in a
separate container called the Virtual Machine (VM). VMs are completely isolated from each other. However,
all the hardware, meaning the CPUs, memory, disk space and networking are pooled together and delivered
dynamically to each VM using a software called Hypervisor. The Hypervisor helps deliver each VM the re-
sources it needs on run time, making use of the hardware resources most efficiently. Therefore, via the use of
virtualization hardware, overhead costs decrease and since virtualization helps to run the servers at their opti-
mum capacity, the operating efficiency increases tremendously. Examples of virtualization software are VMWare
(http://www.vmware.com/) and Parallels Desktop (http://www.parallels.com/). Examples of some freeware
are VirtualBox (https://www.virtualbox.org/), QEMU (http://wiki.qemu.org/Main_Page), Cooperative
Linux (CoLinux) (http://www.colinux.org/), and FreeVPS, which is available at (http://sourceforge.
net/projects/freevps/). Genobuntu can be installed on a Ubuntu system that is running in parallel with
Windows and MAC using the above mentioned software packages.

4 Conclusion
Scientists in Life Sciences and engineers have worked closely together to ease the computational framework
needed for the analysis of biological data. Through various degrees of interactions, engineers and computer
scientists have brought multiple solutions most of which are based on Linux based platforms. As Linux based
platforms evolve, most of these Life-Linux distros are shifting towards Ubuntu for its popularity and long-term-
support as the underlining operating system. Amogst various Life-Linux distros, Bio-Linux has been adopted
as the industry standard for most life-scientists, however, as Bio-Linux does not fulfill the need of any particular
area of research other Life-Linux distros evolved like PhyLis, BioconductorBuntu and Baari each catering well
to a specific area of research.

Acknowledgments
The first author would like to thank his students in Summer 1, DUKE Tip program who motivated him to write
this paper.

4
Operating system Free Reliable Base OS Software Open source LTS GUI Security Threat detection ×86/×64 Cloud Script
Files
Baari X X Ubuntu 30+ Genome As- X X Unity X X ×64 × X
13.10 sembly tools
Bio Puppy X X Puppy Linux Bioinformatics X × X.org, Xvesa × × ×86 × ×
applications for a
class room setting
BioSLAX X X SLAX modualize almost X X KDE X Window X X ×86 × ×
and bioinformatics
application
Lxtoo X X Gentoo Sequence Analysis, X X X11 Desktop X X x86/×64 × ×
Linux 11 Protien-Protien in-
teractions
Open Discovery 3 × X Fedora Sul- molecular dynam- X X GNOME 2.22 X X ×86/×64 X ×
phur 9 ics, docking, se-
quence analysis
BioBrew X X Red Hat 7.3 Cluster Software X × KDE, GNOME X X ×86 × ×
Scibuntu* X × N/A N/A N/A × N/A × × ×86/×64 × ×
PhyLIS X X Ubuntu 8 Phylogenetics X × Unity X X ×86/×64 × ×

5
DNALinux X X Xubuntu DNA and protein X × XFCE 4.2.2 X X ×86 X ×
analysis. Also con-
tains Virtual Desk-
top
Scientific Linux X X Linux 6.4 For general audi- X X GNOME X X ×86/×64 × ×
ence, not specifi-
cally Biologists
Bioconductor Buntu X X Ubuntu Bioconductor X X Unity X X ×86/×64 × ×
12.04 Buntu 2.11
Scubuntu X X Ubuntu 9.1 For general audi- X X Unity X X ×86/×64 × ×
ence not specifically
Biologists
BioLinux 7 X X Ubuntu 500+ Bioinformat- X X Unity X X ×64 X ×
12.04 ics application with
7 Assembly tools
Table 2: Comparison of Different Linux Distributions: The table compares different linux distributions. Against popular opinion Scibuntu is not an operating system,
it is simply a script file that helps in downloading and installing a group of generic softwares. The Virtual Desktop, a feature of DNALinux is a preconfigured virtual machine
(VM) which runs on top of the free VMWare Player. Through the use of Virtual Desktop one can use DNALinux in parallel with Windows. VMWare also allows all other Linux
based distributions to run on top of Windows as a virtual machine, see Section 3 for further details.
References
[1] J. D. McPherson, M. Marra, L. Hillier, R. H. Waterston, A. Chinwalla, J. Wallis, M. Sekhon, K. Wylie,
E. R. Mardis, R. K. Wilson et al., “A physical map of the human genome,” Nature, vol. 409, no. 6822, pp.
934–941, 2001.
[2] D. Field, L. Amaral-Zettler, G. Cochrane, J. R. Cole, P. Dawyndt, G. M. Garrity, J. Gilbert, F. O. Glöckner,
L. Hirschman, I. Karsch-Mizrachi et al., “The genomic standards consortium,” PLoS biology, vol. 9, no. 6,
p. e1001088, 2011.
[3] G. Yu, L.-G. Wang, X.-H. Meng, and Q.-Y. He, “Lxtoo: an integrated live linux distribution for the
bioinformatics community,” BMC Research Notes, vol. 5, no. 1, p. 360, 2012.
[4] R. C. Thomson, “Phylis: a simple gnu/linux distribution for phylogenetics and phyloinformatics,” Evolu-
tionary bioinformatics online, vol. 5, p. 91, 2009.
[5] D. Field, B. Tiwari, T. Booth, S. Houten, D. Swan, N. Bertrand, M. Thurston et al., “Open software for
biologists: from famine to feast,” Nature biotechnology, vol. 24, no. 7, pp. 801–804, 2006.
[6] P. Van Zyl and T. Fogwill, “Empowering african scientists-an investigation into a cd-based installer for
scubuntu,” 2008.
[7] U. Vetrivel and K. Pilla, “Open discovery: An integrated live linux platform of bioinformatics tools,”
Bioinformation, vol. 3, no. 4, p. 144, 2008.
[8] A. Rana and F. Foscarini, “Linux distributions for bioinformatics: an update,” EMBnet. news, vol. 15,
no. 3, pp. pp–35, 2009.
[9] T. Zhu, J. Zhou, Y. An, J. Zhou, H. Li, G. Xu, and D. Ma, “Construction and characterization of a
rock-cluster-based est analysis pipeline,” Computational Biology and Chemistry, vol. 30, no. 1, pp. 81–86,
2006.
[10] N. D’Agostino, M. Aversano, and M. L. Chiusano, “Parpest: a pipeline for est data analysis based on
parallel computing,” BMC bioinformatics, vol. 6, no. Suppl 4, p. S9, 2005.
[11] S. Bassi and V. V. Gonzalez, “Dnalinux virtual desktop edition,” 2007.
[12] T. Kant, “Open source bioinformatics workbench options for life science researchers,” New York Science
Journal, vol. 3, no. 10, 2010.
[13] P. Geeleher, D. Morris, J. P. Hinde, and A. Golden, “Bioconductorbuntu: a linux distribution that im-
plements a web-based dna microarray analysis server,” Bioinformatics, vol. 25, no. 11, pp. 1438–1439,
2009.
[14] J. W. Eckert, Linux+ guide to Linux certification. Cengage Learning, 2012.
[15] S. Ranganathan, W.-L. Hsu, U.-C. Yang, and T. W. Tan, “Emerging strengths in asia pacific bioinformat-
ics,” BMC bioinformatics, vol. 9, no. Suppl 12, p. S1, 2008.
[16] D. C. Leucucta and A. A. Cadariu, “Bioknoppix–bioinformatics linux distribution,” Applied Medical In-
formatics, vol. 13, no. 3, 4, pp. 49–53, 2011.
[17] D. Field, B. Tiwari, and J. Snape, “Bioinformatics and data management support for environmental ge-
nomics,” PLoS biology, vol. 3, no. 8, p. e297, 2005.
[18] K. Krampis, T. Booth, B. Chapman, B. Tiwari, M. Bicak, D. Field, and K. E. Nelson, “Cloud biolinux: pre-
configured and on-demand bioinformatics computing for the genomics community,” BMC bioinformatics,
vol. 13, no. 1, p. 42, 2012.

You might also like