Allied EssentialsOfComputers 0bioinformatics

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 156

ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

UNIT 1
Introduction to Computer, user interface with the Operating System, binary coding
system and Network terminologies. Working with windows and MS office software
concerning word processing, spreadsheets and presentation software.

UNIT 2
Internet and ICT with its Applications, IT Act, System Security (virus/firewall).
Cloud computing- using Google docs, Google Scholar, Google sheets, Google meet,
MS teams and Zoom scheduling. Overview of life Science oriented software’s, their
usage in laboratories (Python, MATLAB and others) and healthcare (Azure,
HoloLens, etc.)

UNIT 3
Forms of biological information and the need for storage. General introduction to
biological databases; nucleic acid databases (NCBI, DDBJ, and EMBL). Protein
databases (PIR, Uniprot). Specialized genome databases: (SGD and microbial
genome database-MGDB). Structure database-PDB.

UNIT 4
Retrieval methods for Nucleic acid and Protein Sequences. Use of Bioinformatic
Tools -sequence homology- substitution matrices- PAM and BLOSUM. Pairwise
alignment (global and local) using BLAST. Multiple sequence alignment using
Clustal omega.

UNIT 5
Methods of Genome analysis- Shot gun and Hierarchical methods. Gene Prediction
using GENEMARK. Phylogenetic tree construction by MEGA (molecular genetic
evolutionary analysis). Protein Structure visualization Tool- RasMol and MolMol.

1
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

UNIT 1
Introduction to Computer, user interface with the Operating System, binary
coding system and Network terminologies. Working with windows and MS
office software concerning word processing, spreadsheets and presentation
software.

Introduction to Computers
A computer is an electronic device that processes data, converting it into
information that is useful to people. Any computer—regardless of its
type—is controlled by programmed instructions, which give the machine a
purpose and tell it what to do.
The 1st computer was invented by Charles Babbage in 1821.
There are 2 main types of computers. They are:
1. Digital computers: So called because they work “by the numbers."
That is, they break all types of information into tiny units, and use
numbers to represent those pieces of information. Digital computers
also work in very strict sequences of steps, processing each unit of
information individually, according to the highly organized
instructions they must follow.

2. Analogue computers: Early analogue computers were mechanical


devices, weighing several tons and using motors and gears to perform
calculations. A more manageable type of analog computer is the old-
fashioned slide rule.

The 6 primary types of computers are:


1. Desktop computers:
The most common type of personal computer is the desktop computer—a
PC that is designed to sit on (or under) a desk or table. These are the
systems you see all around you, in schools, homes, and offices, and they are
the main focus of this book. Not only do these machines enable people to do
their jobs with greater ease and efficiency, but they can be used to
communicate, produce music, edit photographs and videos, play
sophisticated games, and much more. Used by everyone from preschoolers
to nuclear physicists, desktop computers arc indispensable for learning,
work, and play. A desktop computer is a full-size computer that is too big to
be carried around. The main component of a desktop PC is the system unit,

2
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

which is the case that houses the computer’s critical parts, such as its
processing and storage devices. There are two common designs for desktop
computers. The more traditional desktop model features a horizontally
oriented system unit, which usually lies flat on the top of the user’s desk.
2. Workstations:
A workstation is a specialized, single-user computer that typically has more
power and features than a standard desktop PC. These machines are
popular among scientists, engineers, and animators who need a system
with greater-than-average speed and the power to perform sophisticated
tasks. Workstations often have large, high-resolution monitors and
accelerated graphics-handling capabilities, making them suitable for
advanced architectural or engineering design, modeling, animation, and
video editing.
3. Notebook computers:
Notebook computers, as their name implies, approximate the shape of an
8.5-by-11-inch notebook and easily fit inside a briefcase. Because people
frequently set these devices on their lap, they are also called laptop
computers. Notebook computers can operate on alternating current or
special batteries. Because of their portability, notebook PCs fall into a
category of devices called mobile computers—systems small enough to be
carried by their user.
4. Tablet computers:
Tablet PCs offer all the functionality of a notebook PC, but they are lighter
and can accept input from a special pen—called a stylus or a digital pen—
that is used to tap or write directly on the screen. Many tablet PCs also have
a built-in microphone and special software that accepts input from the
user's voice. Tablet PCs run specialized versions of standard programs and
can be connected to a network. Some models also can be connected to a
keyboard and a full-size monitor.
5. Handheld computers:
Handheld personal computers are computing devices small enough to fit in
your hand. A popular type of handheld computer is the personal digital
assistant (PDA). A PDA is no larger than a small appointment book and is
normally used for special applications, such as taking notes, displaying
telephone numbers and addresses, and keeping track of dates or agendas.
Many PDAs can be connected to larger computers to exchange data. Most
PDAs come with a pen that lets the user write on the screen. Some

3
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

handheld computers feature tiny built-in keyboards or microphones that


allow voice input.
6. Smart phones:
Some cellular phones double as miniature PCs. Because these phones offer
advanced features not typically found in cellular phones, they are
sometimes called smart phones. These features can include Web and email
access, special software such as personal organizers, or special hardware
such as digital cameras or music players.
The other types computers are:
1. Supercomputers:
Supercomputer is a broad term for one of the fastest computers currently
available. Supercomputers are very expensive and are employed for
specialized applications that require immense amounts of mathematical
calculations (number crunching). For example, weather forecasting
requires a supercomputer. Other uses of supercomputers scientific
simulations, (animated) graphics, fluid dynamic calculations, nuclear
energy research, electronic design, and analysis of geological data (e.g. in
petrochemical prospecting). Perhaps the best known supercomputer
manufacturer is Cray Research.
2. Mainframe computers:
Mainframe was a term originally referring to the cabinet containing the
central processor unit or "main frame" of a room-filling Stone Age batch
machine. After the emergence of smaller "minicomputer" designs in the
early 1970s, the traditional big iron machines were described as
"mainframe computers" and eventually just as mainframes. Nowadays a
Mainframe is a very large and expensive computer capable of supporting
hundreds, or even thousands, of users simultaneously. The chief difference
between a supercomputer and a mainframe is that a supercomputer
channels all its power into executing a few programs as fast as possible,
whereas a mainframe uses its power to execute many programs
concurrently. In some ways, mainframes are more powerful than
supercomputers because they support more simultaneous programs. But
supercomputers can execute a single program faster than a mainframe. The
distinction between small mainframes and minicomputers is vague,
depending really on how the manufacturer wants to market its machines.
3. Minicomputers:
It is a midsize computer. In the past decade, the distinction between large
minicomputers and small mainframes has blurred, however, as has the

4
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

distinction between small minicomputers and workstations. But in general,


a minicomputer is a multiprocessing system capable of supporting from up
to 200 users simultaneously.
4. Personal computers:
It can be defined as a small, relatively inexpensive computer designed for
an individual user. In price, personal computers range anywhere from a few
hundred pounds to over five thousand pounds. All are based on the
microprocessor technology that enables manufacturers to put an entire
CPU on one chip. Businesses use personal computers for word processing,
accounting, desktop publishing, and for running spreadsheet and database
management applications. At home, the most popular use for personal
computers is for playing games and recently for surfing the Internet.

Generations of Computers
1st generation: 1946-1959 → vacuum tubes
2nd generation: 1959-1965 → transistors
3rd generation: 1965-1971 → integrated circuits
4th generation: 1980-present → AI and ULSI (Ultra Large Scale Integration)

5
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

Parts of a Computer
A computer system consists of 4 main parts:
1. Hardware:
The mechanical devices that make up the computer are called hardware.
Hardware is any part of the computer you can touch. A computer’s
hardware consists of interconnected electronic devices that you can use to
control the computer’s operation, input, and output.

6
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

Internal hardware: Motherboard, RAM, Hard disk, SSD, CPU.


External hardware: Monitor, Mouse, Keyboard, Printer, Speaker.
2. Software:
Software is a set of instructions that makes the computer perform tasks. In
other words, software tells the computer what to do. (The term program
refers to any piece of software.) Some programs exist primarily for the
computer's use to help perform tasks and manage its own resources. Other
types of programs exist for the user, enabling him or her to perform tasks
such as creating documents. Thousands of different software programs arc
available for use on personal computers.
Examples: Internet Explorer, Google, Microsoft, Opera, Mozilla Firefox,
Oracle.
3. Data:
Data consist of individual facts or pieces of information that by themselves
may not make much sense to a person. A computer’s primary job is to
process these tiny pieces of data in various ways, converting them into
useful information.
4. User:
People are the computer operators, also known as users.
The information processing cycle has four parts, and each part involves one
or more specific components of the computer:
» Input:
During this part of the cycle, the computer accepts data from some source,
such as the user or a program, for processing.
» Processing:
During this part of the cycle, the computer’s processing components
perform actions on the data, based on instructions from the user or a
program.
» Output:
Here, the computer may be required to display the results of its processing.
For example, the results may appear as text, numbers, or a graphic on the
computer’s screen or as sounds from its speaker. The computer also can
send output to a printer or transfer the output to another computer
through a network or the Internet. Output is an optional step in the
information processing cycle but may be ordered by the user or program.

7
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

» Storage:
In this step, the computer permanently stores the results of its processing
on a disk, tape, or some other kind of storage medium. As with output,
storage is optional and may not always be required by the user or program.

Memory Devices:
In a computer, memory is one or more sets of chips that store data and/or
program instructions, cither temporarily or permanently. Memory is a
critical processing component in any computer Personal computers use
several different types of memory, but the two most important arc called
random access memory (RAM) and read-only memory (ROM).
These two types of memory work in very different ways and perform
distinct functions.
1. Random Access Memory
The most common type of memory is called random access memory (RAM).
As a result, the term memory is typically used to mean RAM. RAM is like an
electronic scratch pad inside the computer. RAM holds data and program
instructions while the CPU works with them. When a program is launched,
it is loaded into and run from memory. As the program needs data, it is
loaded into memory for fast access. As new data is entered into the
computer, it is also stored in memory—but only temporarily. Data is both
written to and read from this memory. (Because of this, RAM is also
sometimes called read/write memory.) Like many computer components,
RAM is made up of a set of chips mounted on a small circuit board.
RAM is volatile, meaning that it loses its contents when the computer is
shut off or if there is a power failure. Therefore, RAM needs a constant
supply of power to hold its data. For this reason, you should save your data
files to a storage device frequently, to avoid losing them in a power failure.
RAM has a tremendous impact on the speed and power of a computer.
Generally, the more RAM a computer has, the mote it can do and the faster
it can perform certain tasks. The most common measurement unit for
describing a computer’s memory is the byte—the amount of memory it
takes to store a single character such as a letter of the alphabet or a
numeral. When referring to a computer's memory, the numbers are often so
large that it is helpful to use terms such as kilobyte (KB), megabyte (MB),
gigabyte (GB), and terabyte (TB) to describe the values
Types of RAM: SRAM, DRAM.

8
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

2. Read Only Memory


Unlike RAM, read-only memory (ROM) permanently stores its data, even
when the computer is shut off. ROM is called non-volatile memory because
it never loses its contents. ROM holds instructions that the computer needs
to operate. Whenever the computer's power is turned on, it checks ROM for
directions that help it start up, and for information about its hardware
devices.
Types of ROM: PROM, EPROM, EEPROM, MROM
MROM: MASK Programmed ROM
PROM: Programmable ROM
EPROM: Erasable Programmable ROM
EEPROM: Electrically Erasable Programmable ROM

User Interface
A user interface refers to the part of an operating system, program, or
device that allows a user to enter and receive information.
A text based user interface displays text, and its commands are usually
typed on a command line using a keyboard.
With a graphical user interface, the functions are carried out by clicking or
moving buttons, icons, and menus using a pointing device.
Text User Interface (TUI):
Modern graphical user interfaces have evolved from text based user
interfaces.
Some operating systems such as Linux, can still be used with a text based
user interface. In this case, the commands are entered as text. To display the
text based user interface Command Prompt in Windows, open Start menu
and type cmd. Press Enter on the keyboard to launch the command prompt
in a separate window. With the command prompt, you can type your
commands from the keyboard instead of using the mouse.
Graphical User Interface (GUI):
In most operating systems, the primary user interface is graphical, i.e.,
instead of typing commands, you can manipulate various graphical objects
with a pointing device.

9
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

Most GUIs have the following:


• A start menu with program groups
• A taskbar showing running programs
• A desktop
• Various icons and shortcuts
• Operating System
A computer needs a set of programs called an operating system or OS to
control its devices.
An OS is a program which manages all the computer hardware. It provides
the base for application program and acts as an Intermediate between the
user and the computer hardware. There are different kinds of operating
systems, for example, Windows, Linux, and Mac OS.
The operating system enables, among other things:
• The identification and activation of devices connected to the
computer,
• The installation and use of programs, and
• The handling of files.

Binary Coding System


A binary code represents text, computer processor instructions, or any
other data using a two-symbol system. The two-symbol system used is
often “0” and “1” from the binary number system. The binary code assigns a
pattern of binary digits, also known as bits, to each character, instruction,
etc. All digital data stored and used by computers consists of strings of such
binary digits or bits which can be thought of as machine language.
For example, a binary string of eight bits (which is also called a byte) can
represent any of 256 possible values and can, therefore, represent a wide
variety of different items.
The American Standard Code for Information Interchange (ASCII), uses a 7-
bit binary code to represent text and other characters within computers,
communications equipment, and other devices. Each letter or symbol is
assigned a number from 0 to 127. For example, lowercase “a” is represented
by 1100001 as a bit string (which is “97” in decimal).
Importance of Binary code:
The binary number system is the base of all computing systems and
operations. It enables devices to store, access and manipulate all types of

10
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

information directed to and from the CPU or memory. This makes it


possible to develop applications that enable users to do the following:
• View websites
• Create and update documents
• Play games
• View streaming video and other kinds of graphical information
• Access software
• Perform calculations and data analysis

The binary scheme of digital 1s and 0s offers a simple and elegant way for
computers to work. It also offers an efficient way to control logic circuits
and to detect whether an electrical signal is true (1) and false (0).
How binary numbers work:
01001111
1= 2^0=1*1=1
1=2^1=2*1=2
1=2^2=1*4=4
1=2^3=1*8=8
0=2^4=0
0=2^5=0
1=2^6=1*64=64
0=2^7=0
So, 1+2+4+8+64=79

Network Terminologies
When 2 or more computers or devices are connected to transfer files and
data or to communicate to each other, then the medium is referred to as a
computer network.
Network is a collection of interconnected devices, such as computers,
printers, and servers that can communicate with each other. In a computer
network, each computer is called a node.

11
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

The various types of computer networks are:


• LAN: Local Area Network
• MAN: Metropolitan Area Network
• PAN: Personal Area Network
• WAN: Wide Area Network
LAN:
A local area network (LAN) is a network contained within a small
geographic area, usually within the same building. Home Wi-Fi networks
and small business networks are common examples of LANs. Most LANs
connect to the Internet at a central point: a router. Home LANs often use a
single router, while LANs in larger spaces may additionally use network
switches for more efficient packet delivery.
LANs almost always use Ethernet, Wi-Fi, or both in order to connect devices
within the network. Ethernet is a protocol for physical network connections
that requires the use of Ethernet cables. Wi-Fi is a protocol for connecting
to a network via radio waves.
A variety of devices can connect to LANs, including servers, desktop
computers, laptops, printers, IoT devices, and even game consoles. In
offices, LANs are often used to provide shared access to internal employees
to connected printers or servers.
Advantages:
• Number of computers can vary from 2 to few hundred.
• Bandwidth is high as data speed is high.
• Propagation delay is lesser.
• Fault tolerance is higher.
• Data transfer is fast due to limited number of computers in this
network.
• Privately owned network, so maintenance cost is less.

12
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

MAN:
A metropolitan area network (MAN) is a computer network that connects
computers within a metropolitan area, which could be a single large city,
multiple cities and towns, or any given large area with multiple buildings. A
MAN is larger than a local area network (LAN) but smaller than a wide area
network (WAN). MANs do not have to be in urban areas; the term
“metropolitan” implies the size of the network, not the demographics of the
area that it serves. Like WANs, a MAN is made up of interconnected LANs.
Since MANs are smaller, they are usually more efficient than WANs, since
data does not have to travel over large distances. MANs typically combine
the networks of multiple organizations, instead of being managed by a
single organization.
Most MANs use fiber optic cables to form connections between LANs. Often
a MAN will run on “dark fiber” — formerly unused fiber optic cables that
are able to carry traffic. These fiber optic cables may be leased from
private-sector Internet service providers (ISP). In some cases, this model is
reversed: a city government builds and maintains a metropolitan fiber optic
network, then leases dark fiber to private companies.
Advantages:
• Highly secure network.
• Highest speed due to optical fiber wired medium.
• Cost effective.
• Long distance data transfer is high speed.

13
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

PAN:
A personal area network (PAN) connects electronic devices within a user’s
immediate area. The size of a PAN ranges from a few centimeters to a few
meters. One of the most common real-world examples of a PAN is the
connection between a Bluetooth earpiece and a smartphone. PANs can also
connect laptops, tablets, printers, keyboards, and other computerized
devices.
PAN network connections can either be wired or wireless. Wired
connection methods include USB and FireWire; wireless connection
methods include Bluetooth (the most common), Wi-Fi, IrDA, and Zigbee.
While devices within a PAN can exchange data with each other, PANs
typically do not include a router and thus do not connect to the Internet
directly. A device within a PAN, however, can be connected to a local area
network (LAN) that then connects to the Internet. For instance, a desktop
computer, a wireless mouse, and wireless headphones can all be connected
to each other, but only the computer can connect directly to the Internet.
A wireless personal area network (WPAN) is a group of devices connected
without the use of wires or cables. Today, most PANs for everyday use are
wireless. WPANs use close-range wireless connectivity protocols such as
Bluetooth.
The range of a WPAN is usually very small, as short-range wireless
protocols like Bluetooth are not efficient over distances larger than 5-10
meters.

14
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

WAN:
A wide area network (WAN) is a large computer network that connects
groups of computers over large distances. WANs are often used by large
businesses to connect their office networks; each office typically has its
own local area network, or LAN, and these LANs connect via a WAN. These
long connections may be formed in several different ways, including leased
lines, VPNs, or IP tunnels (see below). The definition of what constitutes a
WAN is fairly broad. Technically, any large network that spreads out over a
wide geographic area is a WAN. The Internet itself is considered a WAN.

Advantages:
• Most expensive
• High maintenance networks
• Communicate over a large geographical area over the globe.
• Transfer of data to anywhere in the world without any delay or
problem.
• Speed = 100Gbps.

→ (IM) Instant Messaging: online facility that facilitates us to chat or


talk, provided by Skype, Google Talk, Windows Live Messenger, etc.
→ (VoIP) Voice over Internet Protocol: a protocol used especially for
voice transfer over IP networks, making phone calls by using the
internet.
→ Podcast: a digital file with audio/video records available on the
internet.

15
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

→ Social Networking Websites: websites that facilitate users with a


common platform where they can share their message as text, audio,
video, images, etc. Example: Google, Facebook, Twitter, etc.
→ Chat Rooms: a dedicated area on the internet facilitating users to
communicate.
→ (PSTN) Public Switched Telephone Network: the aggregate of the
world’s telephone networks that are operated by national, regional,
or local telephone operators. It provides infrastructure and services
for public telephony. The PSTN consists of telephone lines, fiber-optic
cables, microwave transmission links, cellular networks,
communications satellites, and undersea telephone cables
interconnected by switching centers, such as central offices, network
tandems, and international gateways, which allow telephone users to
communicate with each other.
→ (ISDN) Integrated Services Digital Network: a set of communication
standards for simultaneous digital transmission of voice, video, data,
and other network services over the digitalized circuits of the public
switched telephone network.
→ (ASDL) Asymmetric Digital Subscriber Line: a type of digital
subscriber line (DSL) technology, a data communications technology
that enables faster data transmission over copper telephone lines
than a conventional voiceband modem can provide.
→ Download: a process that saves data from the internet to a personal
computer.
→ Upload; s process that transfers the saved data from a personal
computer to the internet server.
→ Dial-up: a technique in which a phone line is used in order to connect
to the internet.
→ Broadband: a wide bandwidth data transmission that transports
multiple signals and traffic types swiftly.
→ Protocol: a set of rules and standards that define how devices on a
network communicate with each other.
→ IP Address: a unique numerical identifier assigned to each device on
a network, used to identify and communicate with other devices.
→ Router: a networking device that connects multiple networks and
forwards data packets between them.
→ Switch: a networking device that connects devices on a network and
forwards data packets between them.
→ Firewall: a security device or software that monitors and controls
incoming and outgoing network traffic, based on a set of predefined
security rules.

16
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

→ (DNS) Domain Name System: translates readily memorized domain


names to the numerical IP addresses needed for locating and
identifying computer services and devices with the underlying
network protocols.
→ (DHCP) Dynamic Host Configuration Protocol: a network
management protocol used on Internet Protocol (IP) networks for
automatically assigning IP addresses and other communication
parameters to devices connected to the network using a client–server
architecture.
→ (TCP/IP) Transmission Control Protocol/ Internet Protocol: a set of
protocols used to communicate over the internet and other networks.

MS Word
A Word processor is a computer program for processing words.
A Word processor software provides a general set of tools for entering,
editing, and formatting text.
A word processor has everything that a conventional typewriter has. It
provides various useful features that cannot be done on a typewriter.
To Launch Word:
To start Word 2019, click on the Office Start button, and then select
Microsoft Word 2019 from the options panel. The Microsoft Word Icon can
be pinned to the start bar for quick access.
Title Bar: Displays the name of the file you are currently working on. It also
consists of three buttons, for example, The Minimize button reduces the
window to an icon, but the word remains active. The Restore button
returns the Word window to its original maximum size. The close button
takes us out of Word.
→ Menu Bar: This consists of various commands that can be accessed by
clicking on the menu options under these menu headings.
→ Standard Toolbar: Displays icons for common operations, such as
Open, Print, Save, etc., which can be done by clicking on the suitable
tool.
→ Formatting Toolbar: Displays the options that can be used to format
our document, such as the ruler indicating the width of the
document. It can be increased or decreased. You can see how many of
the lines you wrote. Work area This is the area where you can enter
the text.

17
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

→ Vertical Scroll Bar: For larger text in the document, you can scroll the
vertical bar to view the text in different positions.
→ Horizontal scrollbar: Used to move from left to right of the document
and vice versa if the document is too wide to fit on the screen.
→ Search Object Selection Button: This helps us select one of several
tools used to find something in a document.
→ Normal View button: Helps us view the document very as it will be
printed. Arranges the text so that no document is hidden on the
screen.
→ Print Layout View: This option allows us to see how the document
will be printed. All headers, footers, and comments are displayed.
→ Draw Toolbar: One of several toolbars that may be available on the
screen. This special is used to make drawings on the document.
→ Status bar: This bar always shows you your current position as far as
the text goes. It shows you the current position of your in terms of the
page number, line number, etc.
Features:
• Fast Typing: Text in a word processor becomes fast since there is no
associated mechanical carriage movement.
• Editing functions: Any type of correction (insert, delete, change, etc.)
can be easily done as and on demand.
• Permanent storage: Documents can be stored indefinitely. The saved
document can be called up at any time.
• Formatting functions: Entered text can be created in any form and
style (bold, italic, underline, different fonts, etc.). Graphics Provides
the ability to insert drawings into documents, making them more
useful.
• OLE (Object Linking and Embedding): OLE is a program integration
technology used to exchange information between programs about
objects. Objects are entities stored as graphs, equations, video clips,
audio clips, images, and so on.
• Alignment: You can align your text as you like, for example, left, right,
or centered. You can even make a box set, i.e., aligned from both sides.
• Delete errors: You can remove a word, line, or paragraph from a
stroke, and the rest of the subject will appear automatically.
• Line Spacing: You can set the line spacing from one to nine according
to your preference.
• Move-in Cursor: You can move the cursor from one word to another
or from one paragraph to another as needed.

18
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

• Naming a Document: You can name a document and retrieve it from


your hard drive at any time for editing, updating, correction, and even
for printing.
• Page break: You can set a page break at any point in the text so that
the next page is printed when printing.
• Search and Replace: You can search for a specific word in the entire
document and replace it with another word.
• Thesaurus: you can exchange a word with one of its synonyms. This
way you can avoid the repetition of a single word in a document and
add beauty to the language.
• Indentation: Refers to the space between the text boundaries and the
margins of the page. There are three types of indents: positive,
negative, and hanging.
• Header and footer: A header or footer is text or a graphic, such as a
page number, a date, or a company logo, that is typically printed at
the top or bottom of each page of a document.
• Page orientation: Refers to whether the text is printed lengthwise or
across. Above the printed side is called PORTRAIT and the side
printed across is called LANDSCAPE.
• Spell Checker: Not only can it check spelling mistakes, but it can also
suggest possible alternatives for misspelled words.
• Mail Merge: This is a function that allows you to print a large number
of letters/documents with more or less similar texts. Below this, the
same letter of invitation must be sent to the guests, only the name
and address are changed.
• When someone else is working with you on a document, you’ll see
their presence and the changes they’re making.
• Add icons or other scalable vector graphics (SVGs) to your
documents. Change their color, apply effects, and change them up to
suit your needs.

MS Excel
Microsoft Excel is a powerful electronic spreadsheet program you can use
to automate accounting work, organize data, and perform a wide variety of
tasks. Excel is designed to perform calculations, analyze information, and
visualize data in a spreadsheet. Also this application includes database and
charting features.
To Launch Excel:
To launch Excel for the first time:

19
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

1. Click on the Start button.


2. Click on All Programs.
3. Select Microsoft Office from the menu options, and then click on
Microsoft Excel 2019.
→ Quick Access Toolbar: Displays quick access to commonly used
commands.
→ Search Bar: Advance search will help you find and perform tasks.
→ Title Bar: Displays the name of the application file.
→ File Tab: The File tab has replaced the Office button. It helps you to
manage
→ The Microsoft application and provide access to its options such as
→ Open, New, Save, As Print, etc.
→ Name Box: Displays the active cell location.
→ Cell: The intersection of a row and column; cells are always named
with the column letter followed by the row number (e.g. A1 and
AB209); cells may contain text, numbers and formulas.
→ Range: One or more adjacent cells. A range is identified by its first
and last cell address, separated by a colon. Example ranges are B5:B8,
A1:B1 and A1:G240.
→ Status Bar: Displays information about the current worksheet.
→ New Sheet: Add a new sheet button.
→ Ribbon: Displays groups of related commands within tabs. Each tab
provides buttons for commands.
→ Formula Bar: Input formulas and perform calculations.
→ Worksheet: A grid of cells that are more than 16,000 columns wide
(A-Z, AA-AZ, BA-BZ…IV) and more than 1,000,000 rows long.
→ View Option: Display worksheet view mode.

Features:
• Charts: Charts can be used to represent the data in richly detailed
graphical format.
• SmartArt: We can utilize SmartArt to express information by aligning
data in creative ways graphically.
• Clip Arts: We can include ready-to-use clip arts to convey our
message in a visual format.
• Shapes: We can use a variety of shapes to depict data in infographics
and shapes. With the help of the free form features we can draw any
shape.

20
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

• Pictures: Any image can be inserted to enhance the objects. For


example, backgrounds of Worksheets, shapes, and charts.
• Tables: We can group the rows and columns using parent and child
records. This will make it easier to conduct additional research
rapidly.
• Grouping: With parent and child records, we can group the rows and
columns.
• Sorting: In Excel, we can sort the data. We can sort the data in
Ascending or Descending order with one or more than one column.
• Filtering: The data can be filtered in Excel. In order to filter with
essential options, we can set a variety of options. In Excel, there is an
option of Advanced Filtering, which allows us to perform more
complex filters.

MS PowerPoint
Microsoft PowerPoint, usually just called the PowerPoint, is a software
program developed by Microsoft to produce effective presentations. It is a
part of Microsoft Office suite. The program comprises slides and various
tools like word processing, drawing, graphing and outlining. It is an
absolute presentation graphics package that gives you everything needed to
create a professional-looking presentation. PowerPoint offers word
processing, drawing, outlining, graphing, and presentation management
tools.
PowerPoint was developed by Dennis Austin and Thomas Rudkin at a
software company named Forethought Inc. It was thought to be identified
as Presenter, but due to trademark issues was renamed PowerPoint in
1987. The first iteration of PowerPoint was released collectively with
Windows 3.0 in 1990. The initial version of PowerPoint only allowed slide
progression in one direction i.e., forward and the amount of customization
was somewhat limited.
Progressively, with every version, the program was more creative and more
interactive. Numerous other characteristics were also added in PowerPoint
in the later versions which massively increased the demand and use of this
MS Office program. The default file extension of a PowerPoint presentation
is “.ppt”. It is a presentation (PPT)-based program comprising slides that
use graphics, videos, and other features to make a presentation more
interactive and interesting.

21
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

To Launch Microsoft PowerPoint 2019:


1. Click on the Office Start button.
2. Click on the PowerPoint 2019 icon from the options panel.
3. The PowerPoint Template window will appear.
4. Click on the Blank Presentation icon.
Quick Access Toolbar: Displays quick access to commonly used commands.
→ Title Bar: Displays the name of the open file.
→ File Tab: The File tab has replaced the Office 2007 button. It helps you
to manage the Microsoft application and provides access to options
such as Open, New, Save As, Print, etc.
→ Thumbnail Slide: Displays a snapshot of each slide.
→ Title: Placeholder Section where text is entered.
→ Subtitle: Placeholder Section where text and/or graphics are entered.
→ Status Bar: Displays information about the slide presentation, such as
page numbers.
→ Ribbon: Displays groups of related commands within tabs. Each tab
provides buttons for commands.
→ Collapse: Collapses the ribbon so only the tab names show.
→ Work Area: Each slide has an area where text and graphics are
entered for a presentation. There are various slide layouts to work
from.
→ View Option: Displays several View modes for slides, chart, graphics
and media in the slides.
Features:
• Slide Layout: Multiple options and layouts are available based on
which a presentation can be created. This option is available under
the “Home” section and one can select from the multiple layout
options provided.
• Insert – Clipart, Video, Audio: Under the “Insert” category, multiple
options are available where one can choose what feature they want to
insert in their presentation. This may include images, audio, video,
header, footer, symbols, shapes, etc.
• Slide Design: MS PowerPoint has various themes using which
background color and designs or textures can be added to a slide.
This makes the presentation more colorful and attracts the attention
of the people looking at it. This feature can be added using the
“Design” category mentioned on the homepage of MS PowerPoint.

22
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

Although there are existing design templates available, in case


someone wants to add some new texture or color, the option to
customize the design is also available. Apart from this, slide designs
can also be downloaded online.
• Animations: During the slide show, the slides appear on the screen
one after the other. In case, one wants to add some animations to the
way in which a slide presents itself, they can refer to the “Animations”
category.

23
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

UNIT 2

Internet and ICT with its Applications, IT Act, System Security


(virus/firewall). Cloud computing- using Google docs, Google Scholar,
Google sheets, Google meet, MS teams and Zoom scheduling. Overview of
life Science oriented software’s, their usage in laboratories (Python,
MATLAB and others) and healthcare (Azure, HoloLens, etc.)

INTERNET

The Internet is a global network of billions of computers and other


electronic devices. With the Internet, it's possible to access almost any
information, communicate with anyone else in the world, and do much
more.

What is the Web?

The World Wide Web—usually called the Web for short—is a collection of
different websites you can access through the Internet. A website is made
up of related text, images, and other resources. Websites can resemble
other forms of media—like newspaper articles or television programs—or
they can be interactive in a way that's unique to computers.

How does the Internet work?

At this point you may be wondering, how does the Internet work? The
exact answer is pretty complicated and would take a while to explain.
Instead, let's look at some of the most important things you should know.

It's important to realize that the Internet is a global network of physical


cables, which can include copper telephone wires, TV cables, and fiber optic
cables. Even wireless connections like Wi-
Fi and 3G/4G rely on these physical cables
to access the Internet.

24
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

When you visit a website, your computer sends a request over these wires
to a server. A server is where websites are stored, and it works a lot like
your computer's hard drive. Once the request arrives, the server retrieves
the website and sends the correct data back to your computer.

Applications of Internet:

1. Communication
Communication refers to exchanging ideas and thoughts between or among
people to create understanding. The communication process involves the
elements of source, encoding, channel, receiver, decoding, and feedback. In
organizations, both formal and informal communications simultaneously
take place. Formal communications refer to official communications in
orders, notes, circulars, agenda, minutes, etc. Apart from formal
communications, informal grapevine communications also exist. Informal
communications are usually in the form of rumors, whispers, etc. They are
unofficial, unrecorded, and spread very fast.

2. Web Browsing
Web Browsing is one of the applications of the internet. A web browser is a
program that helps the user to interact with all the data in the WWW
(World Wide Web). There are many web browsers present in today's
world. Some of them are as follows:
• Google Chrome

• Firefox

• Safari

• Internet Explorer

25
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

• Microsoft Edge

3. Online Shopping
The era of the internet took shopping into a new market concept, where
many virtual shops are available 24x7. The shops provide all the necessary
details of a product on their website, so the user can choose as per their
needs.

4. Real-Time Update
The internet makes things easier. One can quickly get an update on the
things happening in real-time in any part of the world. For example, sports,
politics, business, finance, etc. The internet is very useful in many decisions
based on real-time updates.

5. Social Media
The youth of this generation spend the maximum of their free time on
social media, all thanks to the internet. Social media is a place where the
user can communicate with anyone, like friends, family, classmates, etc.
User can promote their businesses on social media as well. You can also
post your thoughts, pictures and videos with your friends on social media.

6. Job Search
The internet has brought a revolution in the field of Jobs. The candidate can
search for their dream job, apply and get it very easily. Even companies
nowadays post their need on the internet and hire candidates as per their
skills based on the job role.
There are many platforms which are primarily doing this. Some of them are
listed below.
• LinkedIn

• Monster.com

• Naukari.com

• Indeed

• Glassdoor

• Upwork

26
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

7. Education
The Internet has a vital role in the education field. It became an effective
tool in both teaching and learning. Teachers can upload their notes or
learning videos on the websites with the help of the internet. It made the
learning process more diverse and joyful.

8. Travel
Users can easily search for their favorite tourist places worldwide and plan
their trips. One can book holiday trips, cabs, hotels, flight tickets, clubs, etc.,
with the help of the Internet. Some websites that provide these facilities are
as follows:
• goibibo.com

• makemytrip.com

• olacabs.com

9. Stock Market Update


A stock market update refers to the latest information and news related to
the financial markets, particularly the stock market. The stock market is
where individuals buy and sell publicly traded company shares. Stock
market updates include vital data and statistics, like the current prices of
major stocks, individual stock prices, trading volumes, market
capitalization, and price movements.

10. Video Conferencing


Video conferencing means using computers to provide a video link between
two or more people. It allows users in different locations to hold face-to-
face meetings. You can also see them instead of just talking to someone on
the telephone. Video conferencing is a widely accepted mode of
communication among businesses, houses, and other organizations.

ICT
Information and communications technology (ICT) is an extensional term
for information technology (IT) that stresses the role of unified
communications and the integration
of telecommunications (telephone lines and wireless signals) and
computers, as well as necessary enterprise software, middleware, storage
and audiovisual, that enable users to access, store, transmit, understand
and manipulate information.

27
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

ICT is also used to refer to the convergence of audiovisuals and telephone


networks with computer networks through a single cabling or link system.
There are large economic incentives to merge the telephone networks with
the computer network system using a single unified system of cabling,
signal distribution, and management. ICT is an umbrella term that includes
any communication device, encompassing radio, television, cell phones,
computer and network hardware, satellite systems and so on, as well as the
various services and appliances with them such as video conferencing and
distance learning. ICT also includes analog technology, such as paper
communication, and any mode that transmits communication.
In health care

• Telehealth
• Artificial intelligence in healthcare
• Use and development of software for COVID-19 pandemic
mitigation
• mHealth
• Clinical decision support systems and expert systems
• Health administration and hospital information systems
• Other health information technology and health informatics

In science
Applications of ICTs in science, research and development, and academia
include:

• Internet research
• Online research methods
• Science communication and communication between scientists
• Scholarly databases
• Applied metascience
Models of access
Scholar Mark Warschauer defines a "models of access" framework for
analyzing ICT accessibility. In the second chapter of his book, Technology
and Social Inclusion: Rethinking the Digital Divide, he describes three
models of access to ICTs: devices, conduits, and literacy. Devices and
conduits are the most common descriptors for access to ICTs, but they are
insufficient for meaningful access to ICTs without third model of access,
literacy. Combined, these three models roughly incorporate all twelve of

28
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

the criteria of "Real Access" to ICT use, conceptualized by a non-profit


organization called Bridges.org in 2005:

1. Physical access to technology


2. Appropriateness of technology
3. Affordability of technology and technology use
4. Human capacity and training
5. Locally relevant content, applications, and services
6. Integration into daily routines
7. Socio-cultural factors
8. Trust in technology
9. Local economic environment
10. Macro-economic environment
11. Legal and regulatory framework

IT ACT
The Information Technology Act,
2000 (also known as ITA-2000, or
the IT Act) is an Act of the Indian
Parliament (No 21 of 2000)
notified on 17 October 2000. It is
the primary law in India dealing
with cybercrime and electronic
commerce. The Act provides a
legal framework for electronic
governance by giving recognition
to electronic records and digital
signatures. It also defines cyber-
crimes and prescribes penalties
for them. The Act directed the
formation of a Controller of
Certifying Authorities to regulate
the issuance of digital signatures.
It also established a Cyber
Appellate Tribunal to resolve
disputes arising from this new
law. The Act also amended
various sections of the Indian
Penal Code, 1860, the Indian
Evidence Act, 1872, the Banker's Books Evidence Act, 1891, and

29
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

the Reserve Bank of India Act, 1934 to make them compliant with new
technologies.

SYSTEM SECURITY
System security encompasses a wide range of practices and technologies
designed to safeguard computer systems, including hardware, software,
networks, and data. It involves implementing various security controls,
such as firewalls, antivirus software, access controls, and encryption, to
defend against potential threats.
Firewalls, one of the fundamental components of system security, act as a
barrier between a trusted internal network and untrusted external
networks, filtering incoming and outgoing network traffic based on
predetermined security rules. They help prevent unauthorized access to a
network and protect against various types of cyber threats, such as
malware, hacking attempts, and denial-of-service attacks.
Antivirus software is another essential tool in system security. It scans files
and programs for known patterns of malicious code, preventing malware
from infecting a system. It also provides real-time protection by monitoring
system activity and blocking suspicious or potentially harmful activities.
Access controls are mechanisms that limit and control user access to
computer systems and resources. They ensure that only authorized
individuals can access sensitive data and perform specific actions. Access
controls can include password authentication, biometric identification, and
role-based access controls, among others.
Encryption is a process of converting data into a format that is unreadable
to unauthorized individuals. It uses mathematical algorithms to scramble
data, making it unintelligible unless decrypted with the correct key.
Encryption is commonly used to protect sensitive information during
transmission over networks or when stored on devices.
The security of a system can be threatened via two violations:
• Threat: A program that has the potential to cause serious
damage to the system.
• Attack: An attempt to break security and make unauthorized use
of an asset.
Security can be compromised via any of the breaches mentioned:
• Breach of confidentiality: This type of violation involves the
unauthorized reading of data.

30
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

• Breach of integrity: This violation involves unauthorized


modification of data.
• Breach of availability: It involves unauthorized destruction of
data.
• Theft of service: It involves the unauthorized use of resources.
• Denial of service: It involves preventing legitimate use of the
system

Security System Goal:


Integrity: The objects in the system mustn’t be accessed by any
unauthorized user & any user not having sufficient rights should not be
allowed to modify the important system files and resources.
Secrecy: The objects of the system must be accessible only to a limited
number of authorized users. Not everyone should be able to view the
system files.
Availability: All the resources of the system must be accessible to all the
authorized users i.e. only one user/process should not have the right to
hog all the system resources. If such kind of situation occurs, denial of
service could happen. In this kind of situation, malware might hog the
resources for itself & thus preventing the legitimate processes from
accessing the system resources.

Threats can be classified into the following two categories:


Program Threats: A program was written by a cracker to hijack the
security or to change the behavior of a normal process. In other words, if a
user program is altered and further made to perform some malicious
unwanted tasks, then it is known as Program Threats.
System Threats: These threats involve the abuse of system services. They
strive to create a situation in which operating-system resources and user
files are misused. They are also used as a medium to launch program
threats.

Types of Program Threats:


Virus: An infamous threat, known most widely. It is a self-replicating and
malicious thread that attaches itself to a system file and then rapidly
replicates itself, modifying and destroying essential files leading to a
system breakdown.
Trojan Horse: A code segment that misuses its environment is called a
Trojan Horse. They seem to be attractive and harmless cover programs but
are really harmful hidden programs that can be used as the virus carrier.
In one of the versions of Trojan, the User is fooled to enter confidential

31
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

login details on an application. Those details are stolen by a login


emulator and can be further used as a way of information breaches. One of
the major as well as a serious threat or consequences of the Trojan horse
is that it will actually perform proper damage once installed or run on the
computer’s system but at first, a glance will appear to be useful software
and later turns out to be maliciously unwanted one.
Another variance is Spyware, Spyware accompanies a program that the
user has chosen to install and download ads to display on the user’s
system, thereby creating pop-up browser windows and when certain sites
are visited by the user, it captures essential information and sends it over
to the remote server. Such attacks are also known as Convert Channels.
Trap Door: The designer of a program or system might leave a hole in the
software that only he is capable of using, the Trap Door works on similar
principles. Trap Doors are quite difficult to detect as to analyze them, one
needs to go through the source code of all the components of the
system. In other words, if we may have to define a trap door then it would
be like, a trap door is actually a kind of a secret entry point into a running
or static program that actually allows anyone to gain access to any system
without going through the usual security access procedures.
Logic Bomb: A program that initiates a security attack only under a
specific situation. To be very precise, a logic bomb is actually the most
malicious program which is inserted intentionally into the computer
system and that is triggered or functions when specific conditions have
been met for it to work.
Worm: A computer worm is a type of malware that replicates itself and
infects other computers while remaining active on affected systems. A
computer worm replicates itself in order to infect machines that aren’t
already infested. It frequently accomplishes this by taking advantage of
components of an operating system that are automatic and unnoticed by
the user. Worms are frequently overlooked until their uncontrolled
replication depletes system resources, slowing or stopping other
activities.

Types of System Threats:


Aside from the program threats, various system threats are also
endangering the security of our system:
Worm: An infection program that spreads through networks. Unlike a
virus, they target mainly LANs. A computer affected by a worm attacks the
target system and writes a small program “hook” on it. This hook is
further used to copy the worm to the target computer. This process
repeats recursively, and soon enough all the systems of the LAN are

32
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

affected. It uses the spawn mechanism to duplicate itself. The worm


spawns copies of itself, using up a majority of system resources and also
locking out all other processes.
Port Scanning: It is a means by which the cracker identifies the
vulnerabilities of the system to attack. It is an automated process that
involves creating a TCP/IP connection to a specific port. To protect the
identity of the attacker, port scanning attacks are launched from Zombie
Systems, that is systems that were previously independent systems that
are also serving their owners while being used for such notorious
purposes.
Denial of Service: Such attacks aren’t aimed for the purpose of collecting
information or destroying system files. Rather, they are used for
disrupting the legitimate use of a system or facility. These attacks are
generally network-based.

Security Measures Taken:


To protect the system, Security measures can be taken at the following
levels:
Physical: The sites containing computer systems must be physically
secured against armed and malicious intruders. The workstations must be
carefully protected.
Human: Only appropriate users must have the authorization to access the
system. Phishing(collecting confidential information) and Dumpster
Diving(collecting basic information so as to gain unauthorized access)
must be avoided.
Operating system: The system must protect itself from accidental or
purposeful security breaches.
Networking System: Almost all of the information is shared between
different systems via a network. Intercepting these data could be just as
harmful as breaking into a computer. Henceforth, Network should be
properly secured against such attacks.

Firewall
A firewall is a network security device, either hardware or software-based,
which monitors all incoming and outgoing traffic and based on a defined
set of security rules it accepts, rejects or drops that specific traffic.
Accept : allow the traffic.
Reject : block the traffic but reply with an “unreachable error”.

33
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

Drop : block the traffic with no reply A firewall establishes a barrier


between secured internal networks and outside untrusted network, such
as the Internet.
Advantages of using Firewall
1. Protection from unauthorized access: Firewalls can be set up
to restrict incoming traffic from particular IP addresses or
networks, preventing hackers or other malicious actors from
easily accessing a network or system. Protection from unwanted
access.
2. Prevention of malware and other threats: Malware and other
threat prevention: Firewalls can be set up to block traffic linked
to known malware or other security concerns, assisting in the
defense against these kinds of attacks.
3. Control of network access: By limiting access to specified
individuals or groups for particular servers or applications,
firewalls can be used to restrict access to particular network
resources or services.
4. Monitoring of network activity: Firewalls can be set up to
record and keep track of all network activity. This information is
essential for identifying and looking into security problems and
other kinds of shady behavior.
5. Regulation compliance: Many industries are bound by rules
that demand the usage of firewalls or other security measures.
Organizations can comply with these rules and prevent any fines
or penalties by using a firewall.
6. Network segmentation: By using firewalls to split up a bigger
network into smaller subnets, the attack surface is reduced and
the security level is raised.

34
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

CLOUD COMPUTING
Cloud computing is the on-demand availability of computing resources
(such as storage and infrastructure), as services over the internet. It
eliminates the need for individuals and businesses to self-manage physical
resources themselves, and only pay for what they use.
Types of cloud computing deployment models
Public cloud: Public clouds are run by third-party cloud service providers.
They offer compute, storage, and network resources over the internet,
enabling companies to access shared on-demand resources based on their
unique requirements and business goals.

Private cloud: Private clouds are built, managed, and owned by a single
organization and privately hosted in their own data centers, commonly
known as “on-premises” or “on-prem.” They provide greater control,
security, and management of data while still enabling internal users to
benefit from a shared pool of compute, storage, and network resources.

Hybrid cloud: Hybrid clouds combine public and private cloud models,
allowing companies to leverage public cloud services and maintain the
security and compliance capabilities commonly found in private cloud
architectures.

Google Docs

1. Cloud-Based Access and Storage:


- Google Docs operates entirely in the cloud, meaning documents are
stored remotely on Google's servers.
- Users can access their documents from any device with an internet
connection by simply logging into their Google account.
- This cloud-based approach eliminates the need for manual file transfers
and ensures that documents are always up to date across devices.
2. Real-Time Collaboration:
- Multiple users can work on the same document simultaneously, with
changes being instantly reflected for all collaborators.
- Each user's edits are color-coded, allowing collaborators to see who
made specific changes.

35
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

- Real-time collaboration extends to comments and suggestions, fostering


efficient communication and collaboration among team members.
3. Revision History:
- Google Docs maintains a detailed revision history of each document,
recording every change made by collaborators.
- Users can access the revision history to view previous versions of the
document, including who made each change and when.
- This feature provides transparency and accountability, allowing users to
track the evolution of a document and revert to earlier versions if
necessary.
4. Commenting and Suggesting:
- Users can leave comments anywhere within a document to provide
feedback, ask questions, or initiate discussions.
- Comments can be resolved once addressed, keeping the document clean
and organized.
- The "Suggesting" mode allows collaborators to make proposed edits
without directly modifying the original text.
- Authors can review and accept/reject suggestions individually or
collectively, giving them control over the final version of the document.
5. Offline Editing:
- Google Docs offers offline access through the use of the Google Docs
Offline extension for Chrome.
- Users can enable offline editing for specific documents, allowing them to
work without an internet connection.
- Changes made offline are automatically synced with the cloud once the
user reconnects to the internet, ensuring seamless collaboration and
version control.
6. Templates:
- Google Docs provides a wide range of professionally designed templates
for various document types, including resumes, letters, reports, and more.
- Templates serve as starting points for document creation, offering pre-
formatted layouts and designs that users can customize to their needs.
- This feature saves time and helps users create polished documents
without starting from scratch.

36
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

7. Integration with Google Drive:


- Google Docs seamlessly integrates with Google Drive, allowing users to
store, organize, and share their documents.
- Documents created in Google Docs are automatically saved to Google
Drive, where they can be accessed alongside other files and folders.
- Integration with Google Drive enables easy sharing and collaboration,
with customizable sharing settings to control access permissions.
8. Formatting Tools:
- Google Docs offers a wide range of formatting options for text, including
font styles, sizes, colors, and alignment.
- Users can apply formatting to individual characters, paragraphs, or
entire sections of text, using familiar tools such as bold, italics, underline,
and more.
- In addition to text formatting, Google Docs provides tools for inserting
and formatting images, tables, hyperlinks, and other elements to enhance
the visual appeal and functionality of documents.
9. Add-ons and Extensions:
- Google Docs supports a variety of add-ons and extensions that extend its
functionality beyond basic word processing.
- Add-ons can be installed from the Google Workspace Marketplace and
provide additional features such as citation management, language
translation, document signing, and more.
- These add-ons allow users to tailor Google Docs to their specific needs
and workflow, enhancing productivity and efficiency.
10. Accessibility Features:
- Google Docs includes features designed to improve accessibility for
users with disabilities.
- This includes support for screen readers, keyboard shortcuts, and
adjustable text size and contrast settings.
- Accessibility features help ensure that documents created in Google
Docs are accessible to all users, regardless of their abilities.

37
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

Google Scholar

What does Google Scholar include?

Google Scholar www.scholar.google.com provides a simple way to broadly


search for scholarly literature. From one place, you can search across many
disciplines and sources: articles, theses, books, abstracts and court
opinions, from academic publishers, professional societies, online
repositories, universities and other web sites. Google Scholar helps you find
relevant work across the world of scholarly research.

38
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

Google Scholar only contains citation references to books, journal articles,


and other resources, not general websites like the Google Search engine.

It is more complete in the STEM (science, technology, engineering, and


medicine) literature. It is also fairly comprehensive in the Social Sciences
(such as Education and Counseling).

Has the fewest indexed articles in the Humanities, including Religion and
Biblical Studies. (See the “Metrics” link at the top to show the major
disciplines and the most highly indexed journals in each discipline.) It also
tends to include more recent literature rather than pre-1990 literature
because this older literature has often never been digitized and put on the
web.

Although it contains patent records, court cases, and legal documents, we


will not be discussing those in this handout.

What are some advantages of using Google Scholar?

1. In addition to showing resources like journal articles in our subscription


databases, it also shows free “open access” and gray literature items (like
conference proceedings, organization white papers, etc.) found on the web.
The open access movement is increasing in popularity (e.g., Liberty’s
Digital Commons). Some of the items found in Google Scholar are not
available in our subscription databases (such as EbscoHost or ProQuest
platforms).

2. If you choose the Liberty University Library in your initial settings it will
point to journal articles (Get it @ LU) and search for books in WorldCat
(Library Search). (If you don’t see Get it @ LU, check under the “More”
links.)

3. The default sort for results is by relevance ranking. Articles that are
cited the most by others show up higher in the rankings. The relevance
ranking in our subscription databases is often determined by the number of
times the search term(s) is found in the metadata. Thus Google Scholar can
be helpful in finding key or seminal authors on a topic because they will be
the most cited.

4. It shows who has cited each work so that you can trace patterns of
research. If the older, original article is helpful, it is likely that at least some

39
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

of the more recent articles that cite the older article will also be helpful in
your research.

5. It provides suggested machine generated citations in the three format


styles used by Liberty University (APA, MLA, and Turabian
Notes/Bibliography style).

6. If you are a published author (even in Digital Commons) you can trace
those who cite your work.

7. Like regular Google, it can be more “forgiving” than our subscription


databases. So, if you are looking for a particular article, but you only know
partial information it might bring up what you are interested in by
providing only incomplete details.

Finding recent papers

Your search results are normally sorted by relevance, not by date.

To find newer articles, try the following options in the left sidebar:

1. Click "Since Year" to show only recently published papers, sorted by


relevance;

2. Click "Sort by date" to show just the new additions, sorted by date;

3. Click the envelope icon to have new results periodically delivered by


email as they are added to Google Scholar.

Locating the full text of an article Abstracts are freely available for most of
the articles. However, reading the
entire article may require a
subscription database.

40
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

Google Sheets

Google Sheets is a free, web-based spreadsheet application that is provided


by Google within the Google Drive service. Google Sheets allows users to
edit, organize, and analyze different types of information. It allows
collaborations, and multiple users can edit and format files in real-time, and
any changes made to the spreadsheet can be tracked by a revision history.

Features of Google Sheets

1. Editing

One of the key features of Google Sheets is that it allows collaborative


editing of spreadsheets in real-time. Rather than emailing one document to
multiple people, a single document can be opened and edited by multiple
users simultaneously. Users can see every change made by other
collaborators, and all changes are automatically saved to Google servers.

Google Sheets also includes a sidebar chat feature that allows collaborators
to discuss edits in real-time and make recommendations on certain
changes. Any changes that the collaborators make can be tracked using the
Revision History feature. An editor can review past edits and revert any
unwanted changes.

41
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

2. Explore

The Explore feature in Google Sheets was first introduced in September


2016, and it uses machine learning to bring additional functionalities. This
feature provides a lot of information based on the data added to the
spreadsheet, and it can auto-update itself depending on the selected data.

With the Explore feature, users can ask questions, build charts, visualize
data, create pivot tables, and format the spreadsheet with different colors.
For example, if you are preparing a monthly budget and you’ve added all
the expenses to the spreadsheet, you can use the Explore feature to get the
cost of specific expenses such as food, travel, clothing, etc.

On the sidebar, there is a box where you can type the question, and it will
return the answer. When you scroll down further in the Explore panel,
there is a list of suggested graphs that are representative of the data
entered in the spreadsheet, and you can choose between a pivot table, pie
chart, or bar chart.

3. Offline editing

Google Sheets supports offline editing, and users can edit the spreadsheet
offline either on desktop or mobile apps. On the desktop, users need to use
the Chrome browser and install the “Google Docs Offline” Chrome
extension to enable offline editing for Google Sheets and other Google
applications. When using mobile, users need to use the Google Sheets
mobile app for Android and iOS, which support offline editing.

4. Supported file formats

Google Sheets supports multiple spreadsheet file formats and file types.
Users can open, edit, save or export spreadsheets and document files into
Google Sheets. Some of the formats that can be viewed and converted to
Google Sheets include:

• .xlsx
• .xls
• .xlsm
• .xlt
• .xltx
• .xltxm
• .ods
• .csv

42
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

• .tsv

5. Integration with other Google products

Google Sheets can be integrated with other Google products such as Google
Form, Google Finance, Google Translate, and Google Drawings. For
example, if you want to create a poll or questionnaire, you can input the
questions in Google Forms, and then import the Google Forms into Google
Sheets.

How to Use Google Sheets

Google Sheets is a free-to-use application that can be accessed on the


Chrome web browser or the Google Sheets app on Android or iOS platform.
Users need a free Google account to get started. To create a new Google
Excel Sheet, following the following steps:

1. Go to the Google Drive Dashboard, and click the “New” button on the
top left corner, and select Google Sheets.
2. Open the menu bar in the spreadsheet window, go to File then New. It
will create a blank spreadsheet, and the interface will be as follows:

To rename the spreadsheet, click on the field on the top left corner, which
is titled “Untitled spreadsheet” and type in your preferred name. When a
new Google spreadsheet is created, it is automatically saved in the root
folder of your drive. To move the spreadsheet to a different folder, click and
hold the file, and drag it to the preferred folder.

43
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

Common Terms

The following are some of the common terms associated with Google
spreadsheets:

• Cell: A single data-point.


• Column: A vertical range of cells that runs down from the top of the
sheet.
• Row: A horizontal range of cells that run across from the left side of
the sheet.
• Range: A selection of multiple cells that runs across a column, row, or
a combination of both.
• Function: A built-in feature in Google Sheet that is used to calculate
values and manipulate data.
• Formula: A combination of functions, columns, rows, cells, and
ranges that are used to obtain a specific end result.
• Worksheet: Sets of columns and rows that make up a spreadsheet.
• Spreadsheet: Entire document that contains Google Excel sheets.
One spreadsheet can have more than one worksheet.

Google Meet
Google Meet is a video conferencing platform developed by Google, offering
various features for online meetings and collaboration.
1. Video Conferencing: Google Meet allows users to conduct video meetings
with participants from around the world. It supports both one-on-one
meetings and group meetings with up to hundreds of participants
(depending on the plan).
2. High-Quality Video and Audio: Meet offers high-definition video and clear
audio quality for a smooth meeting experience. It automatically adjusts the
video resolution based on your network connection to ensure the best
possible quality.
3. Screen Sharing: Users can share their entire screen or specific
applications or tabs with other participants during a meeting. This feature
is useful for presentations, demonstrations, and collaborative work.
4. Real-Time Captions: Google Meet provides real-time captions during
meetings, helping participants follow along even if they have difficulty
hearing or understanding.

44
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

5. Live Streaming: With Google Meet, you can live stream your meetings to a
large audience. This feature is particularly useful for webinars, virtual
events, and presentations.
6. Recording: Google Meet allows users to record their meetings for later
reference or for participants who couldn't attend. The recordings are
automatically saved to Google Drive and can be shared with others.
7. Integration with Google Workspace: Meet is fully integrated with other
Google Workspace apps like Gmail, Google Calendar, and Google Drive. This
integration makes it easy to schedule meetings, join calls directly from
Calendar events, and access shared files during meetings.
8. Security and Privacy: Google Meet offers various security features to
protect your meetings, including encryption of data in transit and in
storage, meeting codes to prevent unauthorized access, and controls for
meeting hosts to manage participants and permissions.
9. Virtual Backgrounds: Users can choose virtual backgrounds to customize
their appearance during meetings. This feature allows you to hide your
actual background or add a professional touch to your video calls.
10. Participant Management: Meeting hosts have control over participants,
with options to mute/unmute participants, remove participants, and
control who can present during the meeting.
11. Breakout Rooms: Google Meet recently introduced breakout rooms,
allowing meeting hosts to split participants into smaller groups for focused
discussions or activities, then easily bring them back to the main meeting.
12. Polls and Q&A: Google Meet offers built-in polling and Q&A features,
enabling meeting hosts to gather feedback or engage participants in
interactive sessions.
Other Features:
Video Quality and Bandwidth Management:
Google Meet automatically adjusts the video resolution and frame rate
based on your network connection to ensure optimal video quality while
minimizing bandwidth usage.
It supports up to 1080p resolution for video calls, depending on the user's
device and network capabilities.
Audio Enhancements:
Meet uses advanced audio processing algorithms to reduce background
noise and echo, providing clear and crisp audio during meetings.

45
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

It supports stereo audio for a more immersive audio experience, especially


beneficial for larger meetings or presentations.
Advanced Settings for Presenters:
Presenters in Google Meet have access to advanced settings, such as
adjusting the layout to focus on the presenter's video or content being
shared.
They can also pin or spotlight specific participants to ensure their video
remains prominent throughout the meeting.
Interactive Whiteboard:
Google Meet offers an integrated digital whiteboard feature, allowing
participants to collaborate in real-time by drawing, annotating, and
brainstorming ideas together.
This feature is particularly useful for remote teams or educational
purposes, facilitating visual communication and collaboration.
Live Reactions and Emojis:
Participants can use live reactions and emojis during meetings to express
themselves non-verbally, providing a way to convey emotions or feedback
without interrupting the flow of the conversation.
This feature enhances engagement and interaction among participants,
making meetings more dynamic and interactive.
Hand Raise and Attendance Tracking:
Google Meet includes a hand raise feature, allowing participants to indicate
when they have something to say or ask without interrupting the speaker.
Meeting hosts can also track attendance and participation, making it easier
to manage large meetings or track engagement over time.
Third-Party Integrations:
Google Meet integrates seamlessly with third-party applications and
services, such as productivity tools, project management platforms, and
learning management systems.
This allows users to streamline workflows, access additional features, and
enhance collaboration within the Google Meet environment.

46
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

Accessibility Features:
Google Meet is committed to accessibility, with features such as keyboard
shortcuts, screen reader support, and adjustable captions to accommodate
users with disabilities.
It also offers live captioning in multiple languages, making meetings more
inclusive and accessible to participants from diverse backgrounds.
Customization and Branding:
Google Meet allows organizations to customize their meeting experience
with branded backgrounds, logos, and custom meeting URLs.
This feature enables businesses to maintain brand consistency and
professionalism during video calls, enhancing their corporate identity.
Advanced Security Controls:
Google Meet offers advanced security controls for meeting hosts, including
the ability to lock meetings, prevent participants from joining before the
host, and restrict screen sharing to specific users.
It also provides end-to-end encryption for all video meetings, ensuring the
confidentiality and privacy of sensitive information shared during calls.

47
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

MS Teams
Microsoft Teams was released in 2017 and has since proven to be an
exceptionally popular addition to Microsoft's suite of online services. Teams
is an online collaboration service available as a part of Microsoft 365 and as a
free service.
Left sidebar menu
The left sidebar menu of the Teams interface holds icons for different major
work areas in Teams. By default, you’ll have Activity, Chat, Teams,
Calendar, Calls, and Files. These are all called “apps” for want of a better
word.
This menu is customizable and can be changed by you or by IT. Changes that
you make are persistent and won’t be seen by others.
You can remove any icon you don’t want by right-clicking and then
selecting Unpin, or move icons around if you click and hold an icon then
drag it to the desired location. What is shown above is the result of adding
OneNote to the left sidebar. You can add many applications by clicking on
the Apps icon at the bottom of the sidebar.
Teams
Clicking on Teams on the left-sidebar lists teams where you are a member.
You can contribute right away to any of these teams by clicking on Posts and
typing in the text box at the bottom. You are automatically made a member
of some Teams created by the company or Teams administrators.

Discover and join Teams


To discover Teams that you can join, click Join or create a team at the
bottom of the Teams list. You may see a lock icon next to some channels
meaning that it’s invitation only. Other Teams may be hidden by the Team
owner and won’t be visible in the interface.

Add your picture


Personalize your profile picture by clicking on the user profile icon in the
upper right corner and adding a profile picture.
Features:
Chat: Teams allows users to communicate via text-based chat in individual
or group conversations. Users can send messages, emojis, GIFs, and
stickers. Conversations are threaded, making it easy to follow discussions.

48
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

Meetings: Teams provides robust meeting features, including scheduled


and ad-hoc video conferences. Meetings support video, audio, and screen
sharing. Users can schedule meetings directly within Teams and invite
participants.
Channels: Teams organizes conversations into channels based on topics,
projects, or departments. Channels can be public or private, allowing for
focused discussions among specific team members.
File Sharing and Collaboration: Users can share files within Teams, allowing
team members to collaborate on documents, spreadsheets, presentations,
and other files in real-time. Integration with OneDrive and SharePoint
ensures seamless file access and version control.
Integration with Office 365: Teams integrates with other Office 365
applications such as Word, Excel, PowerPoint, and Outlook. Users can
create, edit, and share Office documents directly within Teams, enhancing
productivity and collaboration.
Apps and Integrations: Teams supports a wide range of third-party apps
and integrations, allowing users to customize their workspace and
streamline workflows. Popular integrations include Trello, Asana,
Salesforce, and more.
Meetings Recording: Teams allows users to record meetings for later
reference or sharing with team members who couldn't attend. Recorded
meetings are automatically saved to Microsoft Stream for easy access.
Collaborative Whiteboard: Teams includes a collaborative whiteboard
feature, allowing participants to brainstorm, sketch ideas, and visualize
concepts together in real-time during meetings.
Presence and Status: Users can set their availability status to indicate
whether they are available, busy, away, or offline. Presence indicators help
team members know when colleagues are online and accessible.
Security and Compliance: Teams prioritizes security and compliance with
features such as data encryption, multi-factor authentication, compliance
with regulations like GDPR and HIPAA, and built-in data loss prevention
(DLP) capabilities.
Customizable Notifications: Users can customize notification settings to
control how they receive alerts for new messages, mentions, or activity in
channels and conversations.

49
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

Mobile and Desktop Apps: Teams is available as a desktop application for


Windows and Mac, as well as mobile apps for iOS and Android devices,
ensuring users can stay connected and productive from anywhere.
Guest Access: Teams allows external users to be invited as guests, enabling
collaboration with clients, vendors, or partners outside the organization
while maintaining control over permissions and data security.

50
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

Zoom Scheduling
Zoom Video Conferencing is a popular platform for hosting virtual
meetings, webinars, and online events. It offers a range of features that
make communication and collaboration easy, including:
1. High-quality video and audio: Zoom provides high-definition video and
crystal-clear audio to ensure a seamless meeting experience.
2. Screen sharing: Participants can share their screens with others, making
it easy to present slideshows, documents, or software demonstrations.
3. Chat: Zoom includes a chat feature that allows participants to send text
messages to the entire group or privately to individuals.
4. Recording: Meetings can be recorded for later reference or sharing with
those who couldn't attend. The recordings can include video, audio, and
screen sharing.
5. Virtual backgrounds: Zoom allows users to set virtual backgrounds,
which can help maintain privacy or add a touch of professionalism to
meetings.
6. Gallery view and speaker view: Participants can choose between gallery
view, which displays multiple participants simultaneously, or speaker view,
which highlights the person currently speaking.
7. Host controls: Hosts have access to a variety of controls, including muting
participants, disabling video, and managing screen sharing permissions.
8. Security features: Zoom offers several security features to ensure the
safety of meetings, such as meeting passwords, waiting rooms, and end-to-
end encryption.
9. Integration: Zoom integrates with various other tools and platforms, such
as calendars, messaging apps, and productivity software.
10. Accessibility: Zoom provides features to improve accessibility, such as
closed captioning, screen reader support, and keyboard shortcuts.
Advantages:
1. User-friendly interface: Zoom's interface is intuitive and easy to navigate,
making it simple for users to join meetings, access features, and manage
settings.
2. Cross-platform support: Zoom is available on various devices and
operating systems, including Windows, macOS, Linux, iOS, and Android.

51
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

This ensures that participants can join meetings from their preferred
devices.
3. Multiple meeting formats: In addition to standard video meetings, Zoom
offers various meeting formats, including webinars, breakout rooms, and
virtual events. This versatility makes it suitable for different types of
gatherings and collaborations.
4. Interactive features: Zoom provides interactive features such as polling,
Q&A sessions, and hand-raising, enabling engaging and participatory
meetings.
5. Integration with productivity tools: Zoom integrates with popular
productivity tools like Google Calendar, Microsoft Outlook, Slack, and
others, streamlining the scheduling and joining process.
6. Customizable settings: Hosts can customize meeting settings according to
their preferences and requirements. This includes adjusting audio and
video settings, enabling or disabling specific features, and controlling
participant permissions.
7. Cloud storage: Zoom offers cloud storage for recorded meetings, making
it convenient to access and share recordings securely.
8. Advanced features for business users: For businesses and organizations,
Zoom provides advanced features such as single sign-on (SSO), enterprise-
level security controls, and reporting and analytics.
9. Global availability: Zoom has data centers located worldwide, ensuring
reliable performance and low-latency connections for users across different
geographic regions.
10. Customer support: Zoom provides customer support through various
channels, including email, chat, and phone, to assist users with any
questions or issues they may encounter.

To schedule a meeting on Zoom, follow these steps:


1. Sign in to your Zoom account: Go to the Zoom website and log in with
your credentials.
2. Schedule a meeting: Once you're logged in, click on the "Schedule a
Meeting" option.
3. Enter meeting details: Fill in the meeting details such as topic,
description, date, time, duration, and other settings according to your
preferences.

52
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

4. Invite participants: You can invite participants by adding their email


addresses in the "Invitees" field. You can also copy the meeting invitation
link and send it to participants via email or other communication channels.
5. Set meeting options: Configure meeting options such as requiring a
password to join, enabling waiting room, allowing participants to join
before the host, etc.
6. Save and send invitations: After entering all the necessary information,
click on the "Save" or "Schedule" button. You will then have the option to
send email invitations directly from Zoom or copy the invitation details to
share through other means.
7. Manage scheduled meetings: You can view, edit, or delete scheduled
meetings from your Zoom account dashboard.
Remember to send reminders to participants closer to the meeting time
and ensure that you have all the necessary equipment and settings
configured for a smooth meeting experience.

53
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

LIFE SCIENCE ORIENTED LABORATORY SOFTWARES


In the ever-evolving landscape of life sciences, the fusion of technology and
scientific inquiry has ushered in a new era of efficiency and precision.
Laboratories, once reliant on manual processes, are now seamlessly
integrating advanced software solutions to enhance experimental
workflows, data analysis, and collaboration. This assignment aims to delve
deeply into the multifaceted world of life science-oriented laboratory
software, meticulously exploring its historical context, current significance,
diverse applications, and the transformative impact it has on research
methodologies.
MATLAB
MATLAB, developed by MathWorks, is a high-performance programming
language and environment designed for numerical computing, data
analysis, and visualization. Its versatility extends across a spectrum of
scientific disciplines, making it a preferred choice for researchers and
scientists in the biological sciences. MATLAB's intuitive syntax and
extensive library of built-in functions empower users to perform intricate
computations with ease. The interactive environment, encompassing a
command-line interface and a script editor, facilitates seamless exploration
and manipulation of data.

MICROSCOPY IMAGE BROWSER:


Microscopy Image Browser (MIB) is a high-performance MATLAB-based
software package for advanced image processing, segmentation and
visualization of multi-dimensional (2D-4D) light and electron microscopy
datasets.
•Multi-Dimensional Image Stacks: The browser facilitates the visualization
of multi-dimensional image stacks, allowing researchers to explore
volumetric data over time or different imaging channels. This is particularly
useful in techniques such as confocal microscopy and time-lapse imaging.
•Interactive Navigation: Researchers can interactively navigate through
large datasets, zooming in on specific regions of interest and dynamically
adjusting parameters such as contrast and brightness. This aids in the
detailed examination of cellular and subcellular structures.
•Denoising and Filtering: The browser integrates image processing
algorithms for tasks like denoising and filtering. This is crucial for
enhancing the quality of images by reducing noise and improving signal-to-
noise ratios, especially in low-light conditions.

54
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

•Image Registration: Aligning images from different time points or imaging


modalities is simplified through the browser's image registration
capabilities. This is essential for longitudinal studies and for merging data
from multiple sources.
•3D Reconstruction: Researchers can create three-dimensional
reconstructions of biological structures based on microscopy data. This is
particularly useful for visualizing complex cellular architectures and spatial
relationships between different structures.

LOBSTER:
LOBSTER is an image analysis environment to identify biological objects in
microscopy images, and measure their spatial location, geometry, dynamics
and intensity distribution. The objects can be exported as 2D/3D models,
for instance for their exploration/edition in external software, or for the
simulation of the processes under study. Multiple images can be processed
in one go and image size is not limited by the main memory of the
workstation.
• Wound Healing Assays: LOBSTER facilitates the tracking of individual cells
during wound healing assays, providing quantitative data on migration
rates, directionality, and cell interactions. This is crucial for understanding
the mechanisms involved in tissue repair and regeneration.
•Immune Cell Dynamics: In immunology studies, LOBSTER can track
immune cell movements within tissues or in response to stimuli. This is
valuable for investigating immune response dynamics, cell trafficking, and
interactions with pathogens.
•Mitosis Analysis: LOBSTER's tracking capabilities enable the study of
mitotic events, tracking individual cells through the stages of mitosis. This

55
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

is essential for understanding cell division dynamics, identifying


aberrations, and exploring potential targets for anti-cancer therapies.
•Cell Cycle Progression: Researchers can utilize LOBSTER to analyze the
progression of cells through the cell cycle. Quantitative data on cell cycle
duration, synchronization, and checkpoints contribute to a deeper
understanding of cell biology.
•Cell-Based Drug Screening: LOBSTER is instrumental in cell-based drug
screening assays, tracking the effects of compounds on cellular behavior.
This is essential for identifying potential drug candidates, assessing drug
toxicity, and understanding cellular responses to pharmacological agents.
•Phenotypic Profiling: Researchers can use LOBSTER for phenotypic
profiling, analyzing the morphological changes induced by drugs. This
approach aids in characterizing drug effects on cellular structures and
identifying compounds with specific phenotypic outcomes.

NIRFAST:
NIRFAST: is a computational tool that leverages the principles of near-
infrared light propagation through tissues. Near-infrared light penetrates
biological tissues more effectively than visible light, making it suitable for
non-invasive imaging applications. NIRFAST provides researchers with a
platform to simulate and reconstruct optical properties of tissues, enabling
the visualization and quantification of biological structures and functions.
•Optical Property Simulation: NIRFAST allows users to simulate the
propagation of near-infrared light through complex tissue geometries. This
simulation is based on mathematical models that take into account the
optical properties of tissues, such as absorption and scattering coefficients.
•Fluorescence Tomography: One of NIRFAST's significant applications is in
fluorescence tomography. It enables the modeling and reconstruction of
fluorescent markers within tissues. This is particularly useful in molecular
imaging and studying the distribution of contrast agents or fluorescent
probes.
•Oncology: In oncology research, NIRFAST plays a crucial role in
characterizing tumors based on their optical properties. It aids in tumor
detection, monitoring treatment responses, and guiding surgical
interventions.
•Molecular Imaging: In molecular imaging applications, NIRFAST is utilized
for studying the distribution and concentration of fluorescent molecular

56
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

probes. This is valuable in research focused on biomarker detection and


molecular targeting.
•Brain Imaging: NIRFAST has applications in functional brain imaging,
where it helps map changes in cerebral blood flow and oxygenation. This is
relevant for understanding brain activity and for potential clinical
applications in neurology.

PYTHON
Python, a versatile and user-friendly programming language, has become
an integral tool in life science laboratories, revolutionizing data analysis,
modeling, and automation. This assignment explores the applications of
Python in life science-oriented laboratories, highlighting its significance in
various domains such as bioinformatics, computational biology, and
experimental design. Python's popularity in life science laboratories stems
from its simplicity, readability, and extensive libraries. Its open-source
nature and large community support make it an ideal choice for
researchers aiming to harness computational capabilities in their
experiments and analyses. Sequence Analysis: Python is widely used in
bioinformatics for DNA, RNA, and protein sequence analysis. Libraries like
Biopython offer tools for reading, manipulating, and analyzing biological
sequences, facilitating tasks such as sequence alignment, motif searching,
and structure prediction.
•Genomic Data Mining: Life scientists utilize Python for mining and
analyzing large genomic datasets. Pandas and NumPy libraries provide
efficient data manipulation and statistical analysis tools, enabling
researchers to derive meaningful insights from genomics data.
•Systems Biology: Python serves as a powerful language for building and
simulating complex biological models in systems biology. Libraries like
SciPy and BioSimPy enable the creation of mathematical models to simulate
biological processes, helping researchers understand dynamic interactions
within cellular systems.
•Biological Network Analysis: With the NetworkX library, Python facilitates
the analysis of biological networks such as protein-protein interaction
networks or metabolic pathways. Researchers can explore network
topology, identify key nodes, and study the dynamics of interconnected
biological systems.
•Pandas and NumPy: These libraries are fundamental for handling and
analyzing large genomic datasets. Researchers use Python to manipulate,

57
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

clean, and analyze genomic data efficiently, enabling tasks like variant
calling, gene expression analysis, and genomic association studies.
• Biopython: Python's Biopython library is extensively used for
bioinformatics tasks. It supports the analysis of biological sequences (DNA,
RNA, and protein) and provides tools for tasks such as sequence alignment,
motif searching, and structure prediction.
•SciPy and BioSimPy: Python facilitates systems biology modeling and
simulations. Researchers use Python to create mathematical models that
simulate biological processes, aiding in the understanding of complex
interactions within biological systems.
•Matplotlib and Seaborn: Python's data visualization libraries are applied
for creating clear and informative visualizations of biological data.
Researchers use these tools to generate plots, charts, and graphs that aid in
the interpretation and communication of results.
• Automation Scripts: Python scripts are employed for laboratory
automation, helping automate repetitive tasks, data collection, and
instrument control. Python's ease of integration with various devices and
instruments makes it a valuable tool for improving laboratory workflows.
•LIMS Integration: Python is used to integrate Laboratory Information
Management Systems (LIMS), providing a unified platform for managing
experimental data, sample information, and workflows within the
laboratory.

PRIMER3
Primer3 is a widely used bioinformatics tool designed for the automated
design of PCR (Polymerase Chain Reaction) primers. It aims to select
optimal primers for DNA amplification reactions, ensuring high specificity
and efficiency in various molecular biology applications.
1.PCR Experiments: Primer3 is extensively used for designing primers for
various PCR experiments, including endpoint PCR, quantitative PCR (qPCR),
and reverse transcription PCR (RT-PCR).
2.Molecular Cloning: Researchers use Primer3 to design primers for cloning
experiments, ensuring the specificity and efficiency of DNA amplification
for subsequent molecular manipulations.
3.Targeted Sequencing: In applications such as Sanger sequencing or next-
generation sequencing (NGS), Primer3 is employed to design primers for
amplifying specific target regions.

58
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

4.Genotyping Assays: Primer3 is valuable in designing primers for


genotyping studies, where specific DNA regions are amplified for
subsequent analysis.
5.Site-Directed Mutagenesis: Primer3 can be employed to design primers
for site-directed mutagenesis experiments. Researchers use this approach
to introduce specific mutations into a DNA sequence, allowing the study of
gene function and protein structure.
6.Quantitative PCR (qPCR) Assays: Primer3 is widely utilized for designing
primers for qPCR assays. The tool ensures the selection of primer pairs that
exhibit high specificity, efficiency, and minimal interference from non-
specific amplification.
7.SNP Genotyping: For studying single nucleotide polymorphisms (SNPs),
Primer3 assists in designing primers for genotyping assays. This is crucial
for population genetics studies, association studies, and identification of
genetic variations.
8.Bisulfite PCR: In epigenetic studies involving bisulfite treatment for DNA
methylation analysis, Primer3 aids in designing primers for bisulfite PCR.
This technique allows the investigation of DNA methylation status at the
single-nucleotide level.
9.Genome Walking: Primer3 can assist in designing primers for genome
walking experiments. Genome walking is a technique used to sequence
unknown regions adjacent to known sequences in a genome.

LIFE SCIENCE ORIENTED HEALTHCARE SOFTWARES


AZURE
Azure is a cloud computing service created by Microsoft. It provides a
variety of cloud services, including those for compute, analytics, storage,
and networking. Azure allows users to build, deploy, and manage
applications and services through Microsoft-managed data centers.
Some key features and services offered by Azure include:
1. Virtual Machines: Azure provides scalable virtual machines for running
applications and workloads.
2. App Services: A platform-as-a-service (PaaS) offering for building,
deploying, and scaling web applications and APIs.
3. Azure SQL Database: A fully managed relational database service in the
cloud.

59
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

4. Azure Functions: Serverless computing service that allows developers to


run event-triggered code without provisioning or managing infrastructure.
5. Azure Blob Storage: Object storage service for storing large amounts of
unstructured data, such as text or binary data.
6. Azure Cognitive Services: A set of APIs and services for adding machine
learning, computer vision, natural language processing, and other AI
capabilities to applications.
7. Azure DevOps: Tools and services for development teams to plan,
collaborate, and ship software across the entire development lifecycle.
8. Azure Kubernetes Service (AKS): Managed Kubernetes container
orchestration service for deploying, managing, and scaling containerized
applications.
9. Azure Active Directory (AD): Identity and access management service for
controlling user access to resources.
10. Azure Virtual Network: Allows users to create private networks in the
cloud, connect to on-premises data centers, and secure applications and
services.

Azure plays a significant role in healthcare by providing cloud-based


solutions that enable healthcare organizations to improve patient care,
streamline operations, enhance data security, and facilitate research and
innovation. Here are some ways Azure is utilized in the healthcare industry:
1. Health Data Storage and Management: Azure provides secure and
compliant cloud storage solutions for healthcare data, including electronic
health records (EHRs), medical images, and genomic data. Azure's robust
data management capabilities ensure data integrity, availability, and privacy
while complying with industry regulations such as HIPAA (Health
Insurance Portability and Accountability Act) and GDPR (General Data
Protection Regulation).
2. Healthcare Analytics and Insights: Azure's advanced analytics tools and
services enable healthcare providers to derive valuable insights from vast
amounts of clinical and operational data. By leveraging machine learning,
predictive analytics, and data visualization, organizations can identify
patterns, trends, and correlations to improve patient outcomes, optimize
resource utilization, and reduce costs.
3. Telemedicine and Remote Patient Monitoring: Azure supports
telemedicine initiatives by providing scalable infrastructure and

60
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

collaboration tools for virtual consultations, remote monitoring, and


telehealth applications. Healthcare providers can leverage Azure's secure
video conferencing, messaging, and IoT (Internet of Things) capabilities to
deliver timely and personalized care to patients, regardless of their
location.
4. Medical Imaging and Diagnostics: Azure offers specialized services for
processing, analyzing, and storing medical images, such as X-rays, MRIs, and
CT scans. Healthcare organizations can leverage Azure's AI-powered image
recognition, natural language processing (NLP), and deep learning
algorithms to automate diagnosis, detect abnormalities, and assist
radiologists in interpreting complex imaging studies.
5. Genomics and Precision Medicine: Azure provides scalable compute and
storage resources for genomic research, personalized medicine, and
population health management. By leveraging Azure's high-performance
computing (HPC) capabilities and bioinformatics tools, researchers can
analyze large genomic datasets, identify genetic variations, and develop
targeted therapies for individual patients based on their genetic makeup.
6. Drug Discovery and Clinical Trials: Azure accelerates drug discovery and
development by providing computational resources for virtual screening,
molecular modeling, and simulation-based drug design. Pharmaceutical
companies and research institutions can leverage Azure's cloud-based
infrastructure to conduct large-scale in silico experiments, optimize drug
candidates, and expedite the drug discovery process.
7. Healthcare IoT and Wearables: Azure IoT services enable the integration
of medical devices, wearables, and sensors into healthcare systems,
allowing real-time monitoring of patient vital signs, medication adherence,
and wellness metrics. By collecting and analyzing IoT data in the cloud,
healthcare providers can proactively manage chronic conditions, prevent
adverse events, and deliver personalized interventions to patients.

61
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

HOLOLENS
Microsoft HoloLens is an augmented reality (AR) headset developed and
manufactured by Microsoft. It overlays digital content onto the physical
world, enabling users to interact with holograms in their real environment.
1. Hardware: HoloLens features see-through holographic lenses, sensors
(including depth-sensing cameras and spatial mapping sensors), speakers,
microphones, and a built-in processor. The device is untethered, meaning
users can move freely without being connected to a PC.
2. Spatial Mapping and Tracking: HoloLens uses advanced sensors to map
the surrounding environment in real-time and accurately track the user's
movements and gestures. This enables precise placement and interaction
with holographic content anchored to specific physical locations.
3. Holographic Display: HoloLens projects holographic images directly onto
the user's field of view, creating a mixed reality experience where virtual
objects appear to coexist with the real world. Users can view and interact
with holograms without obstructing their view of the physical
environment.
4. Gesture and Voice Input: HoloLens supports natural interaction through
gestures, voice commands, and gaze tracking. Users can manipulate
holographic objects using hand gestures, speak voice commands to control
applications, and engage in gaze-based interactions by focusing on specific
elements.
5. Development Platform: Microsoft provides a comprehensive
development platform for creating mixed reality applications for HoloLens,
including tools, APIs, and SDKs (Software Development Kits). Developers
can build immersive experiences leveraging Unity 3D, Visual Studio, and the
HoloLens Emulator for testing and debugging.
6. Enterprise and Industrial Applications: HoloLens is used in various
industries for enterprise applications such as remote assistance, training
and simulation, 3D visualization, design and prototyping, maintenance and
repair, and medical imaging. It enables workers to access contextual
information, instructions, and guidance hands-free, improving productivity
and efficiency.
7. Education and Entertainment: HoloLens is also utilized in education and
entertainment sectors for immersive learning experiences, interactive
storytelling, virtual tours, and gaming. It provides new opportunities for
engaging and immersive content delivery, fostering creativity and
exploration.

62
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

8. Collaboration and Communication: HoloLens facilitates remote


collaboration and communication through mixed reality conferencing and
shared experiences. Users can collaborate on projects, share 3D models,
and interact with holographic content together in real-time, regardless of
their physical location.
Applications:
1. Medical Education and Training: HoloLens enables immersive and
interactive medical education experiences. Medical students can visualize
complex anatomical structures in 3D, manipulate virtual organs, and
simulate medical procedures in a realistic manner. This hands-on approach
enhances learning outcomes and helps trainees develop essential clinical
skills.
2. Surgical Planning and Visualization: Surgeons use HoloLens to visualize
patient anatomy in 3D before surgeries. By overlaying holographic images
onto the patient's body, surgeons can better understand anatomical
relationships, plan incisions and approaches, and simulate procedures. This
improves surgical precision, reduces risks, and enhances patient safety.
3. Guided Surgery and Navigation: During surgical procedures, HoloLens
provides real-time guidance and navigation assistance. Surgeons can view
patient data, imaging scans, and virtual overlays directly within their field
of view, without having to divert their attention to external screens or
monitors. This helps improve intraoperative decision-making and surgical
outcomes.
4. Remote Assistance and Consultation: HoloLens facilitates remote
assistance and collaboration among healthcare professionals. Specialists
can provide real-time guidance and support to clinicians in different
locations through mixed reality conferencing. This is particularly valuable
for telemedicine, remote consultations, and medical training programs.
5. Patient Education and Engagement: HoloLens enhances patient
education and engagement by visualizing medical conditions, treatment
options, and procedures in a personalized and interactive manner. Patients
can better understand their diagnoses, participate in shared decision-
making, and adhere to treatment plans with greater confidence.
6. Rehabilitation and Physical Therapy: HoloLens is used in rehabilitation
and physical therapy programs to create engaging and interactive exercises
for patients. By overlaying virtual objects and providing real-time feedback,
HoloLens helps motivate patients, track progress, and facilitate recovery
from injuries or disabilities.

63
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

7. Medical Imaging and Visualization: Radiologists and clinicians use


HoloLens to visualize and interact with medical imaging data, such as MRI
and CT scans, in 3D. This enables them to explore volumetric datasets,
identify abnormalities, and plan interventions with greater accuracy and
efficiency.
8. Research and Development: HoloLens supports medical research and
development efforts by enabling scientists and researchers to visualize
complex biological processes, conduct simulations, and analyze large
datasets in a collaborative environment. This accelerates discoveries,
innovation, and the development of new treatments and therapies.

64
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

UNIT 3
Forms of biological information and the need for storage. General
introduction to biological databases; nucleic acid databases (NCBI, DDBJ,
and EMBL). Protein databases (PIR, Uniprot). Specialized genome
databases: (SGD and microbial genome database-MGDB). Structure
database-PDB.

DATABASE
What is a database:
A database is a computerized archive used to store and organize data in
such a way that information can be retrieved easily via a variety of search
criteria. Databases are composed of computer hardware and software for
data management. The chief objective of the development of a database is
to organize data in a set of structured records to enable easy retrieval of
information. Each record, also called an entry, should contain a number of
fields that hold the actual data items, for example, fields for names, phone
numbers, addresses, dates. To retrieve a particular record from the
database, a user can specify a particular piece of information, called value,
to be found in a particular field and expect the computer to retrieve the
whole data record. This process is called making a query.
Although data retrieval is the main purpose of all databases, biological
databases often have a higher level of requirement, known as knowledge
discovery, which refers to the identification of connections between pieces
of information that were not known when the information was first
entered. For example, databases containing raw sequence information can
perform extra computational tasks to identify sequence homology or
conserved motifs. These features facilitate the discovery of new biological
insights from raw data.
Types of databases:
Originally, databases all used a flat file format, which is a long text file that
contains many entries separated by a delimiter, a special character such as
a vertical bar (|). Within each entry are a number of fields separated by tabs
or commas. Except for the raw values in each field, the entire text file does
not contain any hidden instructions for computers to search for specific
information or to create reports based on certain fields from each record.
The text file can be considered a single table. Thus, to search a flat file for a
particular piece of information, a computer has to read through the entire
file, an obviously inefficient process. This is manageable for a small
database, but as database size increases or data types become more

65
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

complex, this database style can become very difficult for information
retrieval. Indeed, searches through such files often cause crashes of the
entire computer system because of the memory-intensive nature of the
operation. To facilitate the access and retrieval of data, sophisticated
computer software programs for organizing, searching, and accessing data
have been developed. They are called database management systems.
These systems contain not only raw data records but also operational
instructions to help identify hidden connections among data records. The
purpose of establishing a data structure is for easy execution of the
searches and to combine different records to form final search reports.
Depending on the types of data structures, these database management
systems can be classified into two types: relational database management
systems and object-oriented database management systems. Consequently,
databases employing these management systems are known as relational
databases or object-oriented databases, respectively.
→ Relational databases:
Instead of using a single table as in a flat file database, relational databases
use a set of tables to organize data. Each table, also called a relation, is
made up of columns and rows. Columns represent individual fields. Rows
represent values in the fields of records. The columns in a table are indexed
according to a common feature called an attribute, so they can be cross-
referenced in other tables. To execute a query in a relational database, the
system selects linked data items from different tables and combines the
information into one report. Therefore, specific information can be found
more quickly from a relational database than from a flat file database.
Relational databases can be created using a special programming language
called structured query language (SQL). The creation of this type of
databases can take a great deal of planning during the design phase. After
creation of the original database, a new data category can be easily added
without requiring all existing tables to be modified. The subsequent
database searching and data gathering for reports are relatively
straightforward.

66
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

→ Object-oriented databases:
One of the problems with relational databases is that the tables used do not
describe complex hierarchical relationships between data items. To
overcome the problem, object-oriented databases have been developed that
store data as objects. In an object-oriented programming language, an
object can be considered as a unit that combines data and mathematical
routines that act on the data. The database is structured such that the
objects are linked by a set of pointers defining predetermined relationships
between the objects. Searching the database involves navigating through
the objects with the aid of the pointers linking different objects.
Programming languages like C++ are used to create object-oriented
databases.
The object-oriented database system is more flexible; data can be
structured based on hierarchical relationships. By doing so, programming
tasks can be simplified for data that are known to have complex
relationships, such as multimedia data. However, this type of database
system lacks the rigorous mathematical foundation of the relational
databases. There is also a risk that some of the relationships between
objects may be misrepresented. Some current databases have therefore
incorporated features of both types of database programming, creating the
object–relational database management system.

Biological databases:
Current biological databases use all three types of database structures: flat
files, relational, and object oriented. Despite the obvious drawbacks of
using flat files in database management, many biological databases still use
this format. The justification for this is that this system involves a minimum
amount of database design and the search output can be easily understood
by working biologists. Based on their contents, biological databases can be
roughly divided into three categories: primary databases, secondary
databases, and specialized databases.
Primary databases contain original biological data. They are archives of raw
sequence or structural data submitted by the scientific community.
GenBank and protein data bank (PDB) are examples of primary databases.
Secondary databases contain computationally processed or manually
curated information, based on original information from primary
databases. Translated protein sequence databases containing functional
annotation belongs to this category. Examples are Swissprot and protein

67
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

information resources (PIR) (successor of Margaret Dayhoff’s atlas of


protein sequence and structure).
Specialized databases are those that cater to a particular research interest.
For example, Flybase, HIV sequence database.
Ribosomal database project are databases that specialize in a particular
organism or a particular type of data.

1. Primary databases:
There are three major public sequence databases that store raw nucleic
acid sequence data produced and submitted by researchers worldwide:
GenBank, the European molecular biology laboratory (EMBL) database and
the DNA data bank of Japan (DDBJ), which are all freely available on the
internet. Most of the data in the databases are contributed directly by
authors with a minimal level of annotation. A small number of sequences,
especially those published in the 1980s, were entered manually from
published literature by database management staff. Presently, sequence
submission to either GenBank, EMBL, or DDBJ is a precondition for
publication in most scientific journals to ensure the fundamental molecular
data to be made freely available. These three public databases closely
collaborate and exchange new data daily. They together constitute the
international nucleotide sequence database collaboration. This means that
by connecting to any one of the three databases, one should have access to
the same nucleotide sequence data.
Although the three databases all contain the same sets of raw data, each of
the individual databases has a slightly different kind of format to represent
the data.
2. Secondary databases:
Secondary databases which contain computationally processed sequence
information derived from the primary databases. The amount of
computational processing work varies greatly among the secondary
databases; some are simple archives of translated sequence data from
identified open reading frames in DNA, whereas others provide additional
annotation and information related to higher levels of information
regarding structure and functions.
3. Specialized databases:
Specialized databases normally serve a specific research community or
focus on a particular organism. The content of these databases may be

68
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

sequences or other types of information. The sequences in these databases


may overlap with a primary database, but may also have new data
submitted directly by authors. Because they are often curated by experts in
the field, they may have unique organizations and additional annotations
associated with the sequences. Many genome databases that are taxonomic
specific falls within this category. Examples include Flybase, Wormbase,
ACEDB, and TAIR. In addition, there are also specialized databases that
contain original data derived from functional analysis. For example,
GenBank EST database and microarray gene expression database at the
European bioinformatics institute (EBI) are some of the gene expression
databases available.

Nucleic acid databases:


→ NCBI:
The National Center For Biotechnology Information (NCBI) provides a large
suite of online resources for biological information and data, including the
GenBank nucleic acid sequence database and the PubMed database of
citations and abstracts published in life science journals. The entrez system
provides search and retrieval operations for most of these data from 38
distinct databases. The e-utilities serve as the programming interface for
the entrez system. Augmenting many of the web applications are custom
implementations of the blast program optimized to search specialized data
sets. New resources released in the past year include PubMed labs and a
new sequence database search. Resources that were updated in the past
year include PubMed, PMC, bookshelf, genome data viewer, assembly,
prokaryotic genomes, genome, Bioproject, DBSNP, DBVAR, blast databases,
Igblast, icn3d and PubChem. All of these resources can be accessed through
the NCBI home page at www.ncbi.nlm.nih.gov.
Data sources and collaborations: NCBI receives data from three sources:
direct submissions from researchers, national and international
collaborations or agreements with data providers and research consortia,
and internal curation efforts. For example, NCBI manages the GenBank
database and participates with the EMBL-EBI European nucleotide archive
(ENA) and the DNA data bank of Japan (DDBJ) as a partner in The
International Nucleotide Sequence Database Collaboration (INSDC). Details
about direct submission processes are available from the NCBI submit page
(www.ncbi.nlm.nih.gov/home/submit.shtml) and from the resource home
pages (e.g. The GenBank page, www.ncbi.nlm.nih.gov/GenBank/). NCBI
staff provide identifiers to submitters for their data usually within 2–5
business days, depending on the destination database and the complexity

69
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

of the submission. More information about the various collaborations,


agreements, and curation efforts are also available through the home pages
of the individual resources.
The GenBank entry for a sequence can be viewed by searching the NCBI
database for the accession number for that sequence.
To view the GenBank entry for the den-1 dengue virus, follow these steps:
Go to the NCBI website (www.ncbi.nlm.nih.gov).
Search for the accession number NC_001477.
Since we searched for a particular accession, we are only returned a single
main result which is titled “nucleotide sequence: dengue virus 1, complete
genome.”
Click on “dengue virus 1, complete genome” to go to the GenBank entry.
The GenBank entry for an accession contains a lot of information about the
sequence, such as papers describing it, features in the sequence, etc. The
definition field gives a short description for the sequence. The organism
field in the NCBI entry identifies the species that the sequence came from.
The reference field contains scientific publications describing the sequence.
The features field contains information about the location of features of
interest inside the sequence, such as regulatory sequences or genes that lie
inside the sequence. The origin field gives the sequence itself.
The FASTA file format is a simple file format commonly used to store and
share sequence information. When you download sequences from
databases such as NCBI you usually want FASTA files.
The first line of a FASTA file starts with the “greater than” character (>)
followed by a name and/or description for the sequence. Subsequent lines
contain the sequence itself.
A FASTA file can contain the sequence for a single, an entire genome, or
more than one sequence. If a FASTA file contains many sequences, then for
each sequence it will have a header line starting with a greater than
character followed by the sequence itself.
You can easily retrieve DNA or protein sequence data by hand from the
NCBI sequence database via its website www.ncbi.nlm.nih.gov.
Dengue den-1 DNA is a viral DNA sequence and its NCBI accession number
is nc_001477. To retrieve the DNA sequence for the dengue den-1 virus
from NCBI, go to the NCBI website, type “nc_001477” in the search box at

70
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

the top of the webpage, and press the “search” button beside the search
box.
On the results page of a normal NCBI search you will see the number of hits
to “nc_001477” in each of the NCBI databases on the NCBI website. There
are many databases on the NCBI website, for example, PubMed and
PubMed central contain abstracts from scientific papers, the genes and
genomes database contain DNA and RNA sequence data, the proteins
database contains protein sequence data, and so on.
Most biologist would do this type of work by hand from within their web
browser, but it can also be done by writing small programs in scripting
languages such as python or r. In r, the rentrez package is a powerful tool
for intersecting with NCBI resource. In this tutorial we’ll focus on the web
interface. It’s good to remember, though, that almost anything done via the
webpage can be automated using a computer script.
A challenge when learning to use NCBI resources is that there is a
tremendous amount of sequence information available and you need to
learn how to sort through what the search results provide. As you are
looking for the DNA sequence of the dengue den-1 virus genome, you
expect to see a hit in the NCBI nucleotide database. This is indicated at the
top of the page where it says “nucleotide sequence” and lists “dengue virus
1, complete genome.”
When you click on the link for the nucleotide database, it will bring you to
the record for NC_001477 in the NCBI nucleotide database. This will
contain the name and NCBI accession of the sequence, as well as other
details such as any papers describing the sequence. If you scroll down you’ll
see the sequence also.
If you need it, you can retrieve the DNA sequence for the den-1 dengue
virus genome sequence as a FASTA format sequence file in a couple ways.
The easiest is just to copy and paste it into a text, .r, or other file. You can
also click on “send to” at the top right of the NC_001477 sequence record
webpage.
After you click on send to you can pick several options. And then choose
“file” in the menu that appears, and then choose FASTA from the “format”
menu that appears, and click on “create file”. The sequence will then
download. The default file name is sequence. FASTA so you’ll probably want
to change it.
You can now open the FASTA file containing the den-1 dengue virus genome
sequence using a text editor like notepad, WordPad, notepad++, or even

71
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

RStudio on your computer. To find a text editor on your computer search for
“text” from the start menu (windows) and usually one will come up.
→ GenBank:
GenBank sequence database is an open access and annotated collection
Of nucleotide sequences and their protein translations including mRNA
sequences with coding regions, segments of genomic DNA with a single
gene or multiple genes, and ribosomal RNA gene clusters. GenBank is
produced and maintained by the national centre for biotechnology
information (NCBI) as part of the international collaboration
With EMBL data library from the EBI and the DNA data bank of Japan
(DDBJ).
Individual laboratory can submit sequence data or large-scale sequencing
centre can submit bulk submission directly to the GenBank by using Banklt
or sequin. The Banklt is a web-based form and sequin is a stand-alone
software tool developed by the NCBI for submitting and updating sequence
to the GenBank, EMBL and DDBJ databases. After sequence submission the
GenBank staffs assigns an accession number to the newly entered sequence
and performs quality assurance checks. Then the newly submitted
sequence is released to the database. Data that are stored in GenBank can
be retrieved by entrez or by downloading File Transfer Protocol (FTP). The
GenBank is a collection of information on Expressed Sequence Tag (EST),
Sequence Tagged Site (STS), Genome Survey Sequence (GSS), and High
Throughput Genome Sequence (HTGS) and complete microbial genome
sequences.
Information of GenBank can be accessed through the server
http://www.ncbi.nlm.nih.gov/GenBank/.
There are several ways to search and retrieve data from GenBank as given
under –
• Search GenBank for sequence identifiers and annotations with entrez
nucleotide ,
Which is divided into three divisions: core nucleotide (the main
Collection), DBEST (expressed sequence tags), and DBGSS (genome survey
sequences).
• Search and align GenBank sequences to a query sequence using blast.
• Search, link, and download sequences programmatically using NCBI e-
utilities.

72
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

→ DDBJ:
DDBJ (DNA data bank of Japan) is a kind of nucleotide sequence data bank
that receives nucleotide sequence from researchers and assigns an
accession number to data submitters. DDBJ collects sequence data mainly
from Japanese researchers, however, they also receive data and assign
accession number to researchers of any other countries. DDBJ began data
bank activities in 1986 at National Institute Of Genetics (NIG). Currently,
DDBJ is in operation at nig in Mishima, Japan.
Main activities of DDBJ are – i) being a member of INSDC, DDBJ collects
nucleotide sequence data from researcher, assigns an accession number to
the data submitters exchanges the collected data with EMBL-bank and
GenBank on a daily basis,
ii) DDBJ manage bioinformatics tools for data submission and retrieval, iii)
DDBJ develops tools for analysis of biological data and
iv) organizes bioinformatics training course in Japanese to teach how to
analyse biological data.
Information of DDBJ can be accessed through the server
http://www.ddbj.nig.ac.jp.
→ EBI-EMBL:
European bioinformatics institute (EBI) is part of European molecular
biology laboratory (EMBL). EMBL-EBI is now known as EMBL-bank and
was established in 1980 at the EMBL in Heidelberg, Germany. It was the
world’s first nucleotide sequence database. EMBL-EBI provides freely
available data from life science experiments, performs basic research in
computational biology and offers an extensive user training programme for
the researchers. EMBL-EBI stores data on DNA and RNA (genes, genomes
and variation), gene expression (RNA, protein and metabolite expression),
protein (sequence, families and motifs), structure (molecular and cellular
structures), systems (reaction, interaction, pathways), chemical biology
(chemo genomics and metabolomics), ontologies (taxonomies and
controlled vocabularies) and literature (scientific publications and patents).
EMBL-EBI can be accessed through the server http://www.ebi.ac.uk.
→ EnsEMBL:
EnsEMBL is a joint project between EBI, EMBL and the welcome trust
sanger institute to develop a software system that produces and maintains
automatically annotation on selected eukaryotic genomes. EnsEMBL was
stated in 1999 with an aim to automatically annotate the genome, integrate

73
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

this annotation with other available biological data and release the
information to the researchers via the web. EnsEMBL produces genome
databases for vertebrates and other eukaryotic species and makes this
information freely available online. EnsEMBL can be freely accessed
through the server http://www.asia.ensembl.org. Various research projects
around the world contribute DNA sequence and their assemblies data to
the EnsEMBL. This database emphasizes on two areas of comparative
genomics – the creation of gene trees using representative proteins from
each gene in a species, and the alignment of DNA sequences to infer
synteny, conservation, etc. The EnsEMBL variation database stores
Data on the regions of genome that differ between individual genomes,
associated disease and phenotype information. EnsEMBL regulation stores
data on the mechanisms of gene regulation in human and mouse cells,
transcriptional and post-transcriptional mechanisms.

Protein databases:
→ PIR:
PIR (protein information resource) was developed by the National
Biomedical Research Foundation (NBRF) in 1984 to assist researchers in
the identification and interpretation of protein sequence information. It is
an integrated public resource of protein informatics that supports genomic
and proteomic research and scientific discovery. PIR has three distinct
sections – PIR1 contains fully classified and annotated entries, PIR2
contains preliminary entries that has not been thoroughly reviewed and
contain redundancy, PIR3 contain unverified entries and PIR4 has one of
the following categories –
i. Conceptual translations of art factual sequences,
ii. Conceptual translations of sequences that are not transcribed or
translated,
iii. Protein sequences or conceptual translations that are genetically
engineered,
iv. Sequences that are not genetically encoded and not produced on
ribosomes. PIR maintains the protein sequence database (PSD)
that stores over 283 000 sequences.
For over four decades PIR has been providing protein databases and
analysis tools those are freely accessible to the researchers including the
protein sequence database (PSD). The PIR has a bibliography system for
literature searching, mapping, and user submission. PIR also maintains a
(non-redundant reference) NREF database, and iPro class, an integrated

74
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

database of protein family, function and structure information. PIR-NREF


protein sequences information. Currently it consists of more than 1 000
000 entries from PIR-PSD, Swissprot, trEMBL, refseq, GenPept, and PDB.
The PIR web site is http://www.pir.georgetown.edu.
→ Uniprotkb/Swissprot:
Uniprotkb/Swissprot is the manually annotated and reviewed section of
the
Uniprot knowledgebase (Uniprotkb). It is an annotated and non-redundant
(means less
Identical sequences are present in the database) protein sequence
database. Swissprot was established in 1986 and maintained
collaboratively by the EMBL outstation (EBI) and the Swiss Institute Of
Bioinformatics (SIB). It provides information on domain structure of
protein, its function, post-translational modification, variants etc. The
swissprot database distinguishes itself from other protein sequence
databases by three distinct criteria –
I. annotation,
II. Minimal redundancy and
III. Integration with other databases.
In 1996, a computer-annotated supplement to swiss-port was created and
named as trEMBL (translation of EMBL nucleotide sequence database).
TrEMBL consists of computer-annotated entries derived from the
translation of all coding sequence (CDSS) in the EMBL database except for
CDSS already present in Swissprot.
The server for swissprot and trEMBL are www.ebi.ac.uk/uniprot and
www.ebi.ac.uk respectively.

Structure database:
→ PDB:
Protein data bank was established at Brookhaven National Laboratories
(BNL) in 1971. PDB contains 3d structures of protein that is established by
x-ray crystallographic and nuclear magnetic resonance (NMR) studies and
is maintained by research collaborator for structural bioinformatics (RCSB)
at Rutgers university.
As on December 24, 2013 there are 96596 structures of proteins available
at PDB which provide information on atomic coordinate of amino acids in

75
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

protein, protein fragments or protein bound to substrate or inhibitors.


Protein structure data can be deposited in the PDB using a web-based
autodep input tool (adit).
Molecular structure of protein of PDB can be displayed by molecular
graphics programs such as RESNOL, chine CNED.
PDB’s URL is http://www.rcsb.org/pdb/.

Specialized genome database:


→ SGD:
The Saccharomyces Genome Database (SGD) provides internet access to the
complete saccharomyces cerevisiae genomic sequence, its genes and their
products, the phenotypes of its mutants, and the literature supporting these
data. The amount of information and the number of features provided by
SGD have increased greatly following the release of the S. cerevisiae
genomic sequence, which is currently the only complete sequence of a
eukaryotic genome. SGD aids researchers by providing not only basic
information, but also tools such as sequence similarity searching that lead
to detailed information about features of the genome and relationships
between genes. SGD presents information using a variety of user-friendly,
dynamically created graphical displays illustrating physical, genetic and
sequence feature maps. SGD can be accessed via the world wide web at
http://genome-www.stanford.edu/saccharomyces/.
SGD is not a primary sequence database, but instead collects DNA and
protein sequence information from primary providers (GenBank, EMBL,
DDBJ, swissprot and PIR). It then assembles it into datasets that make the
sequence information more useful to molecular biologists. These datasets
are available from SGD through the world wide web and anonymous ftp.
SGD provides a variety of sequence similarity search services, including
blast, FASTA, pattern matching, sequence similarity view and stripe view.
Pattern matching, sequence similarity view and stripe view are all
programs created by SGD staff. Pattern matching allows users to perform a
variety of motif searches, using degenerate search sequences. Sequence
similarity view and stripe view provide a visual display of sequence
similarities within the yeast genome.
A variety of datasets are available for searching with blast, FASTA and
pattern matching: the genomic sequence, all GenBank S. cerevisiae
sequences, a non-redundant set of S. cerevisiae protein sequences
combined from NCBI’s genpept, PIR and swissprot, the DNA coding

76
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

sequence of all hypothetical orfs identified by the systematic sequencing


project, the hypothetical translation of all orf sequences, and the non-orf
DNA sequences. In addition to the protein sequence datasets, SGD also
includes the topic category protein information which contains information
curated by the YPD resource.
Pattern matching allows the user to query the nucleic acid or amino acid
databases maintained at SGD for a regular expression that defines a motif
or sequence pattern of interest. The results are displayed in a tabular
format with hyperlinks that provide direct access to detailed biological
information via the locus page.
→ MGDB:
MGDB is a workbench system for comparative analysis of completely
sequenced microbial genomes. The central function of MGDB is to create an
orthologous gene classification table using precomputed all-against-all
similarity relationships among genes in multiple genomes. In MGDB, an
automated classification algorithm has been implemented so that users can
create their own classification table by specifying a set of organisms and
parameters. This feature is especially useful when the user's interest is
focused on some taxonomically related organisms. The created
classification table is stored into the database and can be explored
combining with the data of individual genomes as well as similarity
relationships among genomes. Using these data, users can carry out
comparative analyses from various points of view, such as phylogenetic
pattern analysis, gene order comparison and detailed gene structure
comparison. MGDB is accessible at http://mbdg.genome.ad.jp/.

In MGDB, similarity relationships among all protein coding genes in


genomes are precomputed and stored. Using these data and a user-
specified set of organisms and/or parameters, an ortholog classification
table is dynamically created. The created table is cached into the database
and one can compare multiple genomes by this table from various points of
view such as gene arrangement and phylogenetic relationship. The user can
also specify keywords or query sequences to retrieve a set of genes to see
the cluster table containing them. Creating orthologous gene classification
table is the central function of MGDB. MGDB provides several methods to
retrieve specific orthologous groups from the default or created cluster
table. For example, users can specify keywords on the top page of MGDB.
The system searches for the keywords at first in individual gene records,
and then it finds clusters containing the retrieved genes. In this case, the
users need pay less attention to the differences of descriptions between

77
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

organisms, because they can finally get all of the orthologous genes.MGDB
also provides a usual genome map search interface for users to navigate an
individual genome to retrieve a particular gene. All information about a
particular gene such as homology relationships and motif hits is
summarized in a gene information page, which also includes a link to
retrieve neighboring orthologs.

Users can also specify query sequences for similarity search. Here, the
system calculates similarities between query and database sequences by
the same way as all-against-all similarities in MGDB, i.e. BLAST searches
followed by DP alignment and then finds the clusters containing those
genes hit by the search. The result is listed in the order of the average
similarity scores against the query.

78
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

UNIT 4
Retrieval methods for Nucleic acid and Protein Sequences. Use of
Bioinformatic Tools -sequence homology- substitution matrices- PAM and
BLOSUM. Pairwise alignment (global and local) using BLAST. Multiple
sequence alignment using Clustal omega.

RETRIEVAL METHODS FOR NUCLEIC ACIDS AND PROTEINS

You can easily retrieve DNA or protein sequence data by hand from
the NCBI Sequence Database via its website www.ncbi.nlm.nih.gov.
Dengue DEN-1 DNA is a viral DNA sequence and its NCBI accession
number is NC_001477. To retrieve the DNA sequence for the Dengue
DEN-1 virus from NCBI, go to the NCBI website, type “NC_001477” in the
Search box at the top of the webpage, and press the “Search” button
beside the Search box.
On the results page of a normal NCBI search you will see the number of
hits to “NC_001477” in each of the NCBI databases on the NCBI website.
There are many databases on the NCBI website, for
example, PubMed and PubMed Central contain abstracts from scientific
papers, the Genes and Genomes database contains DNA and RNA
sequence data, the Proteins database contains protein sequence data,
and so on.
Most biologist would do this type of work by hand from within their web
browser, but it can also be done by writing small programs in scripting
languages such as Python or R. In R, the rentrez package is a powerful
tool for intersecting with NCBI resource. In this tutorial we’ll focus on the
web interface. Its good to remember, though, that almost anything done
via the webpage can be automated using a computer script.
A challenge when learning to use NCBI resources is that there is a
tremendous amount of sequence information available and you need to
learn how to sort through what the search results provide. As you are
looking for the DNA sequence of the Dengue DEN-1 virus genome, you
expect to see a hit in the NCBI Nucleotide database. This is indicated at
the top of the page where it says “NUCLEOTIDE SEQUENCE” and lists
“Dengue virus 1, complete genome.”
When you click on the link for the Nucleotide database, it will bring you
to the record for NC_001477 in the NCBI Nucleotide database. This will
contain the name and NCBI accession of the sequence, as well as other

79
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

details such as any papers describing the sequence. If you scroll down
you’ll see the sequence also.
If you need it, you can retrieve the DNA sequence for the DEN-1 Dengue
virus genome sequence as a FASTA format sequence file in a couple
ways. The easiest is just to copy and paste it into a text, .R, or other file.
You can also click on “Send to” at the top right of the NC_001477
sequence record webpage.
After you click on Send to you can pick several options. and then choose
“File” in the menu that appears, and then choose FASTA from the
“Format” menu that appears, and click on “Create file”. The sequence will
then download. The default file name is sequence.fasta so you’ll
probably want to change it.
You can now open the FASTA file containing the DEN-1 Dengue virus
genome sequence using a text editor like Notepad, WordPad, Notepad++,
or even RStudio on your computer. To find a text editor on your computer
search for “text” from the start menu (Windows) and usually one will
come up.
SEQUENCE HOMOLOGY vs SEQUENCE SIMILARITY
When two sequences are descended from a common evolutionary origin,
they are said to have a homologous relationship or share homology. A
related but different term is sequence similarity, which is the percentage of
aligned residues that are similar in physiochemical properties such as size,
charge, and hydrophobicity. Sequence homology is an inference or a
conclusion about a common ancestral relationship drawn from sequence
similarity comparison when the two sequences share a high enough degree
of similarity.
On the other hand, similarity is a direct result of observation from the
sequence alignment. Sequence similarity can be quantified using
percentages; homology is a qualitative statement. For example, one may say
that two sequences share 40% similarity. It is incorrect to say that the two
sequences share 40% homology. They are either homologous or
nonhomologous.
Generally, if the sequence similarity level is high enough, a common
evolutionary relationship can be inferred. In dealing with real research
problems, the issue of at what similarity level can one infer homologous
relationships is not always clear. The answer depends on the type of
sequences being examined and sequence lengths.

80
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

Nucleotide sequences consist of only four characters, and therefore,


unrelated sequences have at least a 25% chance of being identical.
For protein sequences, there are twenty possible amino acid residues, and
so two unrelated sequences can match up 5% of the residues by random
chance. If gaps are allowed, the percentage could increase to 10–20%.
Sequence length is also a crucial factor. The shorter the sequence, the
higher the chance that some alignment is attributable to random chance.
The longer the sequence, the less likely the matching at the same level of
similarity is attributable to random chance. This suggests that shorter
sequences require higher cutoffs for inferring homologous relationships
than longer sequences.
For determining a homology relationship of two protein sequences, for
example, if both sequences are aligned at full length, which is 100 residues
long, an identity of 30% or higher can be safely regarded as having close
homology. They are sometimes referred to as being in the “safe zone”. If
their identity level falls between 20% and 30%, determination of
homologous relationships in this range becomes less certain. This is the
area often regarded as the “twilight zone,” where remote homologs mix
with randomly related sequences. Below 20% identity, where high
proportions of nonrelated sequences are present, homologous
relationships cannot be reliably determined and thus fall into the “mid-
night zone.”
Scoring Matrices
The scoring systems is called a substitution matrix and is derived from
statistical analysis of residue substitution data from sets of reliable
alignments of highly related sequences.
Scoring matrices for nucleotide sequences are relatively simple. A positive
value or high score is given for a match and a negative value or low score
for a mismatch. This assignment is based on the assumption that the
frequencies of mutation are equal for all bases. However, this assumption
may not be realistic; observations show that transitions (substitutions
between purines and purines or between pyrimidines and pyrimidines)
occur more frequently than transversions (substitutions between purines
and pyrimidines). Therefore, a more sophisticated statistical model with
different probability values to reflect the two types of mutations is needed.
Scoring matrices for amino acids are more complicated because scoring has
to reflect the physicochemical properties of amino acid residues, as well as
the likelihood of certain residues being substituted among true homologous
sequences. Certain amino acids with similar physicochemical properties

81
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

can be more easily substituted than those without similar characteristics.


Substitutions among similar residues are likely to preserve the essential
functional and structural features. However, substitutions between residues
of different physicochemical properties are more likely to cause disruptions
to the structure and function. This type of disruptive substitution is less
likely to be selected in evolution because it renders nonfunctional proteins.

PAM Matrix
The PAM matrices (also called Dayhoff PAM matrices) were first
constructed by Margaret Dayhoff, who compiled alignments of seventy-one
groups of very closely related protein sequences. PAM stands for “point
accepted mutation” (although “accepted point mutation” or APM may be a
more appropriate term, PAM is easier to pronounce). Because of the use of
very closely related homologs, the observed mutations were not expected
to significantly change the common function of the proteins.
Thus, the observed amino acid mutations are considered to be accepted by
natural selection. These protein sequences were clustered based on
phylogenetic reconstruction using maximum parsimony. The PAM matrices
were subsequently derived based on the evolutionary divergence between
sequences of the same cluster. One PAM unit is defined as 1% of the amino
acid positions that have been changed.
To construct a PAM1 substitution table, a group of closely related sequences
with mutation frequencies corresponding to one PAM unit is chosen. Based
on the collected mutational data from this group of sequences, a
substitution matrix can be derived.

82
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

Construction of the PAM1 matrix involves alignment of full-length


sequences and subsequent construction of phylogenetic trees using the
parsimony principle. This allows computation of ancestral sequences for
each internal node of the trees. Ancestral sequence information is used to
count the number of substitutions along each branch of a tree. The PAM
score for a particular residue pair is derived from a multistep procedure
involving calculations of relative mutability (which is the number of
mutational changes from a common ancestor for a particular amino acid
residue divided by the total number of such residues occurring in an
alignment), normalization of the expected residue substitution frequencies
by random chance, and logarithmic transformation to the base of 10 of the
normalized mutability value divided by the frequency of a particular
residue. The resulting value is rounded to the nearest integer and entered
into the substitution matrix, which reflects the likelihood of amino acid
substitutions. This completes the log-odds score computation. After
compiling all substitution probabilities of possible amino acid mutations, a
20 × 20 PAM matrix is established.
Positive scores in the matrix denote substitutions occurring more
frequently than expected among evolutionarily conserved replacements.
Negative scores correspond to substitutions that occur less frequently than
expected. Other PAM matrices with increasing numbers for more divergent
sequences are extrapolated from PAM1 through matrix multiplication. For
example, PAM80 is produced by values of the PAM1 matrix multiplied by
itself eighty times. The mathematical transformation accounts for multiple
substitutions having occurred in an amino acid position during evolution.
For example, when a mutation is observed as F replaced by I, the
evolutionary changes may have actually undergone a number of
intermediate steps before becoming I, such as in a scenario of F → M → L →
I. For that reason, a PAM80 matrix only corresponds to 50% of observed
mutational rates.
A PAM unit is defined as 1% amino acid change or one mutation per 100
residues.
The increasing PAM numbers correlate with increasing PAM units and thus
evolutionary distances of protein sequences. For example, PAM250, which
corresponds to 20% amino acid identity, represents 250 mutations per 100
residues. In theory, the number of evolutionary changes approximately
corresponds to an expected evolutionary span of 2,500 million years. Thus,
the PAM250 matrix is normally used for divergent sequences. Accordingly,
PAM matrices with lower serial numbers are more suitable for aligning
more closely related sequences.

83
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

BLOSUM Matrix
In the PAM matrix construction, the only direct observation of residue
substitutions is in PAM1, based on a relatively small set of extremely closely
related sequences.
Sequence alignment statistics for more divergent sequences are not
available. To fill in the gap, a new set of substitution matrices have been
developed. This is the
series of blocks amino acid substitution matrices (BLOSUM), all of which
are derived based on direct observation for every possible amino acid
substitution in multiple sequence alignments. These were constructed
based on more than 2,000 conserved amino acid patterns representing 500
groups of protein sequences. The sequence patterns, also called blocks, are
ungapped alignments of less than sixty amino acid residues in length. The
frequencies of amino acid substitutions of the residues in these blocks are
calculated to produce a numerical table, or block substitution matrix.

84
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

Instead of using the extrapolation function, the BLOSUM matrices are actual
per- centage identity values of sequences selected for construction of the
matrices. For example, BLOSUM62 indicates that the sequences selected for
constructing the matrix share an average identity value of 62%. Other
BLOSUM matrices based on sequence groups of various identity levels have
also been constructed. In the reversing order as the PAM numbering
system, the lower the BLOSUM number, the more divergent sequences they
represent.

PAIRWISE ALIGNMENT
The overall goal of pairwise sequence alignment is to find the best pairing
of two sequences, such that there is maximum correspondence among
residues. To achieve this goal, one sequence needs to be shifted relative to
the other to find the position where maximum matches are found.
There are two different alignment strategies that are often used: global
alignment and local alignment.
Global Alignment and Local Alignment
In global alignment, two sequences to be aligned are assumed to be
generally similar over their entire length. Alignment is carried out from
beginning to end of both sequences to find the best possible alignment
across the entire length between the two sequences. This method is more

85
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

applicable for aligning two closely related sequences of roughly the same
length. For divergent sequences and sequences of variable lengths, this
method may not be able to generate optimal results because it fails to
recognize highly similar local regions between the two sequences.
Local alignment, on the other hand,
does not assume that the two
sequences in question have
similarity over the entire length. It
only finds local regions with the
highest level of similarity between
the two sequences and aligns these
regions without regard for the
alignment of the rest of the
sequence regions. This approach
can be used for aligning more
divergent sequences with the goal of
searching for conserved patterns in
DNA or protein sequences. The two
sequences to be aligned can be of different lengths. This approach is more
appropriate for aligning divergent biological sequences containing only
modules that are similar, which are referred to as domains or motifs.
Dot Matrix Method
Dot matrix method, also known as the dot plot method, is a graphical
method of sequence alignment that involves comparing two sequences by
plotting them in a two-dimensional matrix.
In a dot matrix, two sequences that must be compared are plotted along a
matrix’s horizontal and vertical axes. The method then scans each residue
of one sequence to identify similarities with all residues in the other
sequence.
If a residue in one sequence matches a residue in the other sequence, a dot
is placed in the corresponding position in the matrix. Otherwise, the matrix
position is left blank.
If the two sequences being compared are highly similar, the dot plot will
display as a single line along the matrix’s main diagonal. However, when
the sequences are less similar, the dot plot will show more scattered dots
with fewer diagonal lines, indicating that the sequences share less
similarity.
Dot plots can also find repeat elements in a single sequence. Short parallel
lines above and below the main diagonal indicate the presence of repeats.
Dotmatcher (bioweb.pasteur.fr/seqanal/interfaces/dotmatcher.html) and
Dottup (bioweb.pasteur.fr/seqanal/interfaces/dottup.html) are two

86
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

programs of the EMBOSS package, which have been made available online.
Dotmatcher aligns and displays dot plots of two input sequences (DNA or
proteins) in FASTA format. A window of specified length and a scoring
scheme are used. Diagonal lines are only plotted over the position of the
windows if the similarity is above a certain threshold. Dottup aligns
sequences using the word method and is capable of handling genome-
length sequences. Diagonal lines are only drawn if exact matches of words
of specified length are found.
Dothelix (www.genebee.msu.su/services/dhm/advanced.html) is a dot
matrix
program for DNA or protein sequences. The program has a number of
options for
length threshold (similar to window size) and implements scoring matrices
for protein sequences. In addition to drawing diagonal lines with similarity
scores above a certain threshold, the program displays actual pairwise
alignment.

Dynamic Programming
Dynamic programming is used to find the optimal alignment between two
proteins or nucleic acid sequences by comparing all possible pairs of
characters in the sequences.
Dynamic programming can be used to produce both global and local
alignments. The global pairwise alignment algorithm using dynamic
programming is based on the Needleman-Wunsch algorithm, while the
dynamic programming in local alignment is based on the Smith-Waterman
algorithm.

87
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

This method works in the following three steps.


1. Initialization of the scoring matrix: The first step is to create a
two-dimensional matrix where the two sequences to be aligned are
written along the top and left sides. The matrix is initialized with
gap penalties and an initial score of zero at the top-left corner.

2. Matrix filling with maximum scores: The next step involves


filling the matrix with scores based on a scoring matrix. Scoring
matrices for nucleotide
sequences are simple. A
positive value is given
for a match, and a
negative value for a
mismatch. For amino
acids, BLOSUM and PAM
scoring matrices are
used.
To calculate the
alignment scores, the
algorithm starts at the
upper left corner of the
matrix and proceeds
one row at a time
toward the lower right
corner. The algorithm
fills each cell in the
matrix with the
maximum score that
can be obtained by
aligning the
corresponding residues.

3. Traceback to identify
optimal alignment:
After filling the matrix, the algorithm performs a traceback to find
the optimal alignment path. Starting from the bottom-right corner
and moving towards the top-left corner, adjacent cells are examined
in reverse order to determine the best path with the highest total
score. The optimal alignment path is the one with the maximum
score.

88
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

Gap Penalties
Performing optimal alignment between sequences often involves applying
gaps that represent insertions and deletions. Because in natural
evolutionary processes insertion and deletions are relatively rare in
comparison to substitutions, introducing gaps should be made more
difficult computationally, reflecting the rarity of insertional and deletional
events in evolution. However, assigning penalty values can be more or less
arbitrary because there is no evolutionary theory to determine a precise
cost for introducing insertions and deletions. If the penalty values are set
too low, gaps can become too numerous to allow even nonrelated
sequences to be matched up with high similarity scores. If the penalty
values are set too high, gaps may become too difficult to appear, and
reasonable alignment cannot be achieved, which is also unrealistic.
Through empirical studies for globular proteins, a set of penalty values
have been developed that appear to suit most alignment purposes. They are
normally implemented as default values in most alignment programs.
Another factor to consider is the cost difference between opening a gap and
extending an existing gap. It is known that it is easier to extend a gap that
has already been started. Thus, gap opening should have a much higher
penalty than gap extension. This is based on the rationale that if insertions
and deletions ever occur, several adjacent residues are likely to have been
inserted or deleted together. These differential gap penalties are also
referred to as affine gap penalties. The normal strategy is to use preset gap
penalty values for introducing and extending gaps. For example, one may
use a −12/ − 1 scheme in which the gap opening penalty is −12 and the gap
extension penalty −1. The total gap penalty (W) is a linear function of gap
length, which is calculated using the formula:
W = γ + δ × (k − 1)
where γ is the gap opening penalty, δ is the gap extension penalty, and k is
the length of the gap. Besides the affine gap penalty, a constant gap penalty
is sometimes also used, which assigns the same score for each gap position
regardless whether it is opening or extending. However, this penalty
scheme has been found to be less realistic than the affine penalty.
Gaps at the terminal regions are often treated with no penalty because in
reality many true homologous sequences are of different lengths.
Consequently, end gaps can be allowed to be free to avoid getting
unrealistic alignments.

89
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

MULTIPLE SEQUENCE ALIGNMENT


Clustal Omega
Clustal Omega is a popular bioinformatics tool used for multiple sequence
alignment (MSA). It is an improved version of the widely used ClustalW and
ClustalX programs.
1. Purpose:
- Clustal Omega is primarily used for aligning multiple sequences of
biological macromolecules, such as protein or nucleotide sequences.
- It helps in identifying regions of similarity among sequences, which can
be indicative of functional, structural, or evolutionary relationships.
2. Features:
- Accuracy: Clustal Omega employs advanced algorithms to produce highly
accurate alignments.
- Speed: It is optimized for rapid alignment even with very large datasets.
- Scalability: Capable of handling large numbers of sequences efficiently.
- Flexibility: Offers various options for customization of alignment
parameters.
- Output Formats: Supports multiple output formats including FASTA,
Clustal, PHYLIP, and others.
- Command-line and Web Interface: Clustal Omega can be used both
through a command-line interface and a web server interface.
3. Algorithms:
- Progressive Alignment: Clustal Omega uses a progressive alignment
strategy, where sequences are first grouped into a guide tree based on their
similarity and then aligned progressively.
- Iterative Refinement: It employs iterative refinement techniques to
improve the alignment accuracy.
- HMM-based methods: Hidden Markov Model (HMM) profiles are used to
guide the alignment process, particularly helpful for remote homolog
detection.
4. Usage:
- Command-line Interface (CLI): Users can run Clustal Omega through the
command line by providing input sequences and desired parameters.

90
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

- Web Interface: Clustal Omega provides a user-friendly web interface


where users can upload sequences, configure alignment parameters, and
obtain results interactively.
5. Input Requirements:
- Clustal Omega accepts sequences in FASTA format, which is a widely
used format for representing nucleotide or protein sequences.
- It can handle a diverse range of sequences including proteins,
nucleotides, and even user-defined sequences.
6. Parameters:
- Gap Penalties: Users can specify the penalties for opening and extending
gaps in the alignment.
- Substitution Matrix: Choice of substitution matrix (e.g., BLOSUM, PAM)
for scoring amino acid or nucleotide substitutions.
- Output Format: Users can specify the desired output format for the
alignment results.
7. Output:
- Clustal Omega generates an alignment of input sequences, where
identical or similar residues are arranged in columns.
- The output also includes a guide tree depicting the evolutionary
relationships among the input sequences.
- Various output formats are supported to facilitate downstream analyses.
8. Applications:
- Phylogenetic Analysis: Clustal Omega alignments are often used as input
for phylogenetic tree construction to study evolutionary relationships.
- Structural Prediction: Alignments can aid in predicting protein
structures by identifying conserved regions.
- Functional Annotation: Identification of conserved motifs or domains
across related sequences can provide insights into their functions.
9. Limitations:
- Sensitivity to Input Parameters: The accuracy of the alignment may vary
depending on the chosen parameters and the characteristics of the input
sequences.

91
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

BLAST ANALYSIS
BLAST stands for Basic Local Alignment Search Tool. It is a widely used
bioinformatics program that was first introduced by Stephen Altschul in
1990 and has since become one of the most popular tools for sequence
similarity search. There are five types (variants) of BLAST that are
differentiated based on the type of sequence (DNA or protein) of the query
and database sequences.
1. BLASTN compares a nucleotide query sequence to a nucleotide
sequence database.
2. BLASTP compares a protein query sequence to a protein sequence
database.
3. BLASTX compares a nucleotide query sequence to a protein
sequence database by translating the query sequence into its six
possible reading frames and aligning them with the protein
sequences.

92
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

4. TBLASTN compares a protein query sequence to a nucleotide


sequence database by translating the nucleotide sequences in all six
reading frames and aligning them with the protein sequence.
5. TBLASTX compares a nucleotide query sequence to a nucleotide
sequence database by translating the query sequence in all six
reading frames and aligning them with the nucleotide sequences.

BLAST works by comparing a query sequence to a database of sequences to


find regions of similarity. It uses a heuristic approach to search for
similarities in the database, making it faster and more efficient.
BLAST performs sequence alignment through the following steps.
Step 1: The first step is to create a lookup table or list of words from the
query sequence. This step is also called seeding. First, BLAST takes the
query sequence and breaks it into short segments called words. For protein
sequences, each word is usually three amino acids long, and for DNA
sequences, each word is usually eleven nucleotides long.
Step 2: The second step is to search a database of known sequences to find
any sequences that contain the same words as the query sequence. This is
done to identify database sequences containing the matching words.
Step 3: BLAST then scores the similarity of the matching words. The
matching of the words is scored by a given substitution matrix. If a word is
above a certain threshold, it is considered a match.
Two commonly used substitution matrices for protein sequences are PAM
(Percent Accepted Mutations) and BLOSUM (Blocks Substitution Matrix).
For nucleotide sequences, the scoring matrix is based on match-mismatch
scoring.
Step 4: The fourth step involves pairwise alignment by extending the
words in both directions while counting the alignment score using the
same substitution matrix. If the score drops below a certain threshold due
to differences in the sequences or mismatches, the alignment stops. The
resulting aligned segment pair without gaps is called the high-scoring
segment pair (HSP).

BLAST also calculates a statistical significance value for each alignment. It is


called E-value or Expect value. The E-value represents the probability of
obtaining a sequence match by random chance. A lower E-value indicates
that the sequence match is less likely to be a result of random occurrence.
Hence, the lower the E-value, the higher the level of significance.

93
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

94
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

UNIT 5
Methods of Genome analysis- Shot gun and Hierarchical methods. Gene
Prediction using GENEMARK. Phylogenetic tree construction by MEGA
(molecular genetic evolutionary analysis). Protein Structure visualization
Tool- RasMol and MolMol.

SHOTGUN GENOME SEQUENCING


To sequence a clone longer than the average read length, it is possible to
use a shotgun approach. The idea is to pepper the DNA with sequence reads
such that they overlap, and when assembled, yield the complete sequence
of the clone.
The shotgun part comes from the way the clone is prepared for sequencing:
it is randomly sheared into small pieces (usually about 1 kb) and subcloned
into a "universal" cloning vector. The library of sub-fragments is sampled at
random, and a number of sequence reads generated (using a universal
primer directing sequencing from within the cloning vector). These
sequence reads are then assembled into contigs, and the complete
sequence of the clone generated.

→Making a shotgun library


Genomic DNA is sheared or restricted to yield random fragments of the
required size.

95
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

Below is a contig (for “contiguous sequence”). The red-green hybrid in the


center is the original dsDNA to be sequenced. It was broken up, and smaller
pieces cloned into plasmids. The inserts in various randomly chosen
plasmids were then sequenced to give the smaller fragments shown. Note
that it is important to sequence both strands. While this may seem a waste
of effort given the rules of Watson-Crick base pairing, the fact is that certain
areas on one strand may be difficult to sequence accurately (for example,
because of local secondary structure formation). The complementary
strand, however, may sequence well. Using primers from opposite ends will
give you sequence for both strands. Once you have sequenced a bunch of
small fragments, a computer can find regions of overlap (shown as hatch
marks) and properly align them into the complete original sequence.

Sequencing reactions are performed with a universal primer on a random


selection of the clones in the shotgun library. These sequencing reads are
assembled in to contigs, identifying gaps (where there is no sequence
available) and single-stranded regions (where there is sequence for only
one strand). The gaps and single-stranded regions are then targeted for
additional sequencing to produce the full sequenced molecule.

96
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

97
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

1. Sample Collection and DNA Extraction:


- The process begins with the collection of a biological sample containing
the DNA of interest. This sample can be obtained from various sources,
including tissue, blood, saliva, or cultured cells.
- DNA is extracted from the sample using various methods, such as
phenol-chloroform extraction, silica-based column purification, or magnetic
bead-based extraction. The goal is to isolate high-quality genomic DNA free
from contaminants.
2. Fragmentation of DNA:
- The extracted genomic DNA is fragmented into smaller, more
manageable pieces. This fragmentation can be achieved through physical
methods (e.g., sonication, nebulization) or enzymatic methods (e.g.,
restriction enzyme digestion).
- The resulting DNA fragments vary in size and collectively represent the
entire genome of the organism.
3. Library Preparation:
- Adapters or linkers with unique DNA sequences are ligated to the ends
of the fragmented DNA molecules. These adapters serve as recognition sites
for sequencing primers and facilitate the subsequent steps in the
sequencing process.
- The DNA fragments with attached adapters are then amplified through
PCR (Polymerase Chain Reaction) to generate a sequencing library. This
library contains a collection of DNA fragments ready for sequencing.
4. Sequencing:
- The sequencing library is loaded onto a high-throughput sequencing
platform, such as Illumina, PacBio, or Oxford Nanopore sequencing systems.
- The DNA fragments in the library are sequenced in parallel, generating
millions to billions of short sequence reads. Each read corresponds to a
fragment of the original genome.
- The sequencing technology used determines the length and quality of
the sequence reads produced. Short-read sequencing platforms like
Illumina typically produce reads of 100-300 base pairs, while long-read
sequencing platforms like PacBio and Nanopore can generate reads ranging
from thousands to tens of thousands of base pairs.

98
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

5. Sequence Assembly:
- After sequencing, the short sequence reads are processed and assembled
into longer contiguous sequences (contigs) using specialized bioinformatics
software and algorithms.
- Assembly involves aligning and overlapping the sequence reads to
reconstruct the original genome sequence. The process is facilitated by the
use of paired-end sequencing, which provides information about the
relative positions of sequence reads within the genome.
- Assembly software uses various algorithms to assemble contigs, resolve
repeats, and generate scaffolds that represent the linear order of contigs.
6. Genome Annotation and Analysis:
- Once assembled, the genome sequence is annotated to identify genes,
regulatory elements, and other functional elements.
- Annotation involves predicting gene locations, identifying coding regions
(exons) and non-coding regions (introns), and annotating regulatory
sequences such as promoters and enhancers.
- The annotated genome sequence is then analyzed to gain insights into
the genetic makeup and biological characteristics of the organism. This may
include studying genetic variation, evolutionary relationships, and the
genetic basis of traits or diseases.

Applications of Shotgun Sequencing:


1. Whole Genome Sequencing (WGS): Shotgun sequencing is used
extensively for whole genome sequencing projects across various
organisms, including humans, plants, animals, and microorganisms. It
enables researchers to obtain the complete nucleotide sequence of an
organism's genome, providing insights into its genetic makeup, variation,
and evolution.

2. Comparative Genomics: Shotgun sequencing facilitates comparative


genomic studies by sequencing and comparing the genomes of multiple
species or strains. It allows researchers to identify conserved regions, gene
families, and evolutionary changes across different organisms, aiding in the
understanding of genome structure, function, and evolution.

99
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

3. De Novo Genome Assembly: Shotgun sequencing is employed for de novo


genome assembly, where the genome of an organism is sequenced without
a reference genome. It involves sequencing fragmented DNA samples,
assembling overlapping reads into contiguous sequences (contigs), and
scaffolding contigs to reconstruct the complete genome sequence.

4. Metagenomics: Shotgun sequencing is used in metagenomics to study the


genetic composition and diversity of microbial communities in
environmental samples, such as soil, water, and the human gut. It enables
the identification and characterization of microbial species, functional
genes, and metabolic pathways present in complex microbial ecosystems.

5. Functional Genomics: Shotgun sequencing facilitates functional genomic


studies by sequencing and analyzing transcribed RNA molecules (RNA-seq).
It allows researchers to quantify gene expression levels, identify alternative
splicing events, and discover novel transcripts, providing insights into gene
regulation and functional annotation.

6. Population Genetics: Shotgun sequencing is applied in population


genetics to study genetic variation and diversity within and between
populations. It enables the identification of single nucleotide
polymorphisms (SNPs), insertion-deletion variants (indels), and structural
variants (SVs), aiding in population structure analysis, demographic
inference, and association studies.

7. Cancer Genomics: Shotgun sequencing is used in cancer genomics to


sequence tumor genomes and identify somatic mutations, chromosomal
aberrations, and driver mutations associated with cancer development and
progression. It aids in understanding the molecular mechanisms of cancer,
patient stratification, and personalized medicine.

8. Epigenomics: Shotgun sequencing is employed in epigenomics to study


epigenetic modifications, such as DNA methylation, histone modifications,
and chromatin accessibility. It enables the mapping of epigenetic marks
across the genome, identifying regulatory elements, and studying their
roles in gene regulation and cellular processes.

100
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

9. Phylogenomics: Shotgun sequencing is utilized in phylogenomics to


reconstruct the evolutionary relationships and phylogenetic trees of
organisms based on their genomic sequences. It allows researchers to
compare genome-wide sequence data, infer evolutionary distances, and
elucidate the evolutionary history of species and taxa.

10. Functional Annotation: Shotgun sequencing facilitates functional


annotation of genomes by identifying protein-coding genes, non-coding
RNAs, regulatory elements, and functional elements. It aids in gene
prediction, gene annotation, and pathway analysis, providing valuable
information on gene function, expression, and regulation.

Hierarchical Shotgun Sequencing vs. Whole Genome Shotgun


Sequencing

101
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

1. Hierarchical Shotgun Sequencing (HGP method):


- Genomic DNA is initially cut into large fragments, typically around 150
megabases (Mb) in size.
- These large fragments are then inserted into bacterial artificial
chromosome (BAC) vectors and cloned into Escherichia coli (E. coli) cells,
where they are replicated and stored.
- The BAC inserts are isolated and mapped to determine the order of each
cloned 150 Mb fragment, creating a "Golden Tiling Path."
- Each BAC fragment is further fragmented into smaller pieces, which are
then cloned into plasmids and sequenced on both strands.
- The resulting sequences are aligned to identify overlapping regions,
allowing contiguous pieces to be assembled into a finished sequence.
- Each strand is sequenced approximately four times to achieve 8X
coverage of high-quality data.
2. Shotgun Sequencing (Celera method):
- Shotgun sequencing involves randomly shearing genomic DNA into
smaller fragments, typically ranging from a few hundred to a few thousand
base pairs in length.
- These fragmented DNA pieces are then cloned into plasmids and
sequenced on both strands, without the intermediate step of inserting them
into BAC vectors.
- Once the sequences are obtained, they are aligned and assembled into
the finished genome sequence.
- This method was initially developed and optimized for prokaryotic
genomes, which are smaller in size and contain less repetitive DNA
compared to eukaryotic genomes.
Key Differences:
- In hierarchical shotgun sequencing, the genome is first divided into larger
fragments, which are then cloned into BAC vectors for storage and mapping.
This step is omitted in shotgun sequencing.
- Shotgun sequencing directly shears the genomic DNA into smaller
fragments without the intermediate BAC vector step, making it more
straightforward and efficient.

102
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

- Both methods involve sequencing the cloned fragments on both strands


and assembling the sequences into a finished genome, but the initial steps
differ.

103
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

→Hierarchical Shotgun Sequencing


1. Construction of Large Insert Library:
- DNA Extraction: The process begins with isolating high-quality nuclear
DNA from the organism of interest. Various extraction methods can be used
depending on the organism and the quality of DNA required.
- Vector Selection: Historically, yeast artificial chromosomes (YACs) were
used, but due to technical challenges, bacterial artificial chromosomes
(BACs) or P1 artificial chromosomes (PACs) are preferred nowadays. These
vectors can accommodate large DNA inserts (around 100-200 kb) and are
more stable than YACs.
- Library Preparation: The isolated DNA is then ligated into the chosen
vector to create the large insert library. This involves fragmenting the DNA
into manageable sizes, ligating the fragments into the vector, and
transforming the vectors into suitable host organisms (e.g., bacteria for
BACs/PACs).
2. Creation of Ordered Clone Array:
- Molecular Mapping: A molecular map, consisting of DNA markers
aligned along the chromosomes, is essential for ordering the clones. These
markers are typically sequence tagged sites (STS) or genetic markers.
- Fingerprinting: Each clone in the library is fingerprinted using
restriction enzymes, such as HindIII, to generate a unique pattern of DNA
fragments. This fingerprinting helps identify overlapping clones.
- Clone Overlap Identification: Clones are arranged onto the molecular
map based on their fingerprint patterns and overlapping regions.
Overlapping clones are identified by comparing their fingerprint patterns,
allowing the construction of a physical map representing the genome's
organization.
3. Selection of Clones for Sequencing:
- Minimal Tiling Path: A minimal tiling path is selected from the physical
map, consisting of a redundant set of clones covering the entire genome.
Clones are chosen to ensure comprehensive coverage and minimize gaps.
- Quality Control: Each selected clone undergoes thorough quality control
to ensure it is free from chimeric inserts or other artifacts. This typically
involves analyzing the fingerprint pattern to confirm that the clone contains
sequences from a single genomic region.

104
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

4. Sequencing and Assembly:


- Cloning and Sequencing: The selected clones are individually cloned and
sequenced using high-throughput sequencing technologies, such as Sanger
sequencing or next-generation sequencing (NGS) platforms.
- Sequence Assembly: The raw sequence data is then processed and
assembled using specialized bioinformatics algorithms. This involves
aligning and overlapping the sequences to reconstruct the original genome
sequence.
- Accuracy: Sequencing accuracy is crucial, with the human genome
project adopting a high standard (99.99% accuracy) for full shotgun
sequences. Achieving sufficient coverage (e.g., 8-10 fold for a BAC clone)
ensures accurate sequence assembly.
5. Finishing the Sequence:
- Directed Sequencing: Additional sequencing efforts may be required to
address gaps or regions of low quality in the assembled sequence. This can
involve sequencing subclones of problematic regions or using PCR
amplification and direct sequencing with primers designed to flank the
troublesome regions.
- Quality Assessment: The finished sequence undergoes rigorous quality
assessment to ensure accuracy and completeness. This may involve
comparing the sequence to reference genomes and validating critical
regions experimentally.

105
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

106
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

→Whole Genome Shotgun Sequencing


1. Sample Collection:
- Whole genome sequencing begins with the collection of a biological
sample containing the DNA of interest. The sample can be obtained from
various sources such as blood, tissue, saliva, or cultured cells.
- The choice of sample depends on the organism being studied and the
specific research or clinical objectives.
2. DNA Extraction:
- Once the sample is collected, the DNA must be isolated and purified from
other cellular components.
- DNA extraction methods vary depending on the type of sample and the
desired quality and quantity of DNA.
- Common extraction techniques include phenol-chloroform extraction,
silica-based column purification, and magnetic bead-based extraction.
3. Library Preparation:
- The extracted DNA is fragmented into smaller pieces, typically ranging
from a few hundred to a few thousand base pairs in length.
- Adapters with unique DNA sequences are ligated to the ends of the DNA
fragments. These adapters serve as recognition sites for sequencing
primers and help in the subsequent sequencing process.
- The prepared DNA fragments with attached adapters are then amplified
using polymerase chain reaction (PCR) to generate a sequencing library.
4. Sequencing:
- The sequencing library is loaded onto a high-throughput sequencing
platform, such as Illumina, PacBio, or Oxford Nanopore sequencing systems.
- During sequencing, the DNA fragments in the library are sequenced in
parallel, generating millions to billions of short sequence reads.
- The sequencing technology used determines the length and quality of
the sequence reads produced.
5. Sequence Assembly:
- After sequencing, the short sequence reads are processed and assembled
into longer contiguous sequences (contigs) using specialized bioinformatics
software and algorithms.

107
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

- Assembly involves aligning and overlapping the sequence reads to


reconstruct the original genome sequence.
- Different assembly strategies may be employed depending on factors
such as genome size, complexity, and sequencing technology used.
6. Genome Annotation:
- Once assembled, the genome sequence is annotated to identify genes,
regulatory elements, and other functional elements.
- Annotation involves predicting gene locations, identifying coding regions
(exons) and non-coding regions (introns), and annotating regulatory
sequences such as promoters, enhancers, and transcription factor binding
sites.
- Functional annotation may also involve comparing the sequenced
genome to reference genomes and databases to identify known genes and
biological pathways.
7. Analysis and Interpretation:
- The annotated genome sequence is then analyzed to gain insights into
the genetic makeup and biological characteristics of the organism.
- Analysis may include identifying genetic variants (e.g., single nucleotide
polymorphisms, insertions, deletions), studying gene expression patterns,
and understanding the genetic basis of traits or diseases.
- Bioinformatics tools and statistical methods are often used to analyze
and interpret the genomic data.

108
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

109
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

110
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

Applications of Hierarchical Sequencing:


1. Large Genome Sequencing: Hierarchical genome sequencing is
particularly suitable for sequencing large and complex genomes, such as
those of mammals, plants, and certain microorganisms. It enables the
comprehensive sequencing and assembly of the entire genome by breaking
it down into smaller, manageable fragments.

2. De Novo Genome Assembly: Hierarchical sequencing facilitates de novo


genome assembly by generating a physical map of overlapping clones
covering the entire genome. This map serves as a scaffold for assembling
the individual sequences into contiguous genomic sequences, providing a
complete picture of the genome structure.

3. Structural Variation Analysis: Hierarchical sequencing allows for the


detection and characterization of structural variations in the genome,
including insertions, deletions, inversions, and translocations. It enables the
identification of genomic rearrangements associated with genetic diseases,
cancer, and evolutionary events.

4. Genetic Mapping and Linkage Analysis: Hierarchical sequencing aids in


genetic mapping and linkage analysis by generating a physical map of
clones along chromosomes. It facilitates the localization of genetic markers,
quantitative trait loci (QTLs), and disease-associated genes, providing
insights into the genetic basis of traits and diseases.

5. Comparative Genomics: Hierarchical sequencing enables comparative


genomic studies by sequencing and comparing the genomes of different
species or strains. It facilitates the identification of conserved regions, gene
families, and evolutionary changes across diverse organisms, shedding light
on genome evolution and adaptation.

6. Functional Genomics: Hierarchical sequencing supports functional


genomic studies by providing a comprehensive view of the genome
structure and organization. It aids in the annotation of protein-coding
genes, non-coding RNAs, regulatory elements, and functional elements,
enhancing our understanding of gene function and regulation.

111
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

7. Epigenomics: Hierarchical sequencing is used in epigenomic studies to


investigate DNA methylation patterns, histone modifications, and
chromatin accessibility. It enables the mapping of epigenetic marks across
the genome, identifying regulatory elements and studying their roles in
gene expression and cellular processes.

8. Genome-Wide Association Studies (GWAS): Hierarchical sequencing


facilitates GWAS by providing high-resolution genomic data for association
analysis. It enables the identification of genetic variants associated with
complex traits and diseases, aiding in the discovery of novel biomarkers
and therapeutic targets.

9. Population Genomics: Hierarchical sequencing is employed in population


genomics to study genetic variation and diversity within and between
populations. It enables the identification of single nucleotide
polymorphisms (SNPs), insertion-deletion variants (indels), and structural
variants (SVs), providing insights into population structure, demographic
history, and adaptation.

10. Biomedical Research and Precision Medicine: Hierarchical sequencing


has applications in biomedical research and precision medicine by
elucidating the genetic basis of diseases, predicting disease risk, and
guiding personalized treatment strategies. It facilitates the identification of
disease-causing mutations, pharmacogenomic markers, and therapeutic
targets, leading to improved diagnosis and treatment outcomes.

112
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

113
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

GENE PREDICTION USING GENMARK


1. Algorithmic Approach:
- GenMark employs a sophisticated algorithmic approach to gene
prediction, utilizing a combination of statistical models and computational
techniques.
- The algorithm iteratively scans the input DNA sequence, searching for
features indicative of gene structure such as open reading frames (ORFs),
start and stop codons, and splice sites.
- Hidden Markov Models (HMMs) form the backbone of GenMark's gene
prediction algorithm, providing a probabilistic framework for modeling the
complex structure of genes and their constituent elements.
- By considering the statistical properties of gene features and their
arrangement within genomic sequences, GenMark is able to infer the most
likely gene structures.

114
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

2. Hidden Markov Models (HMMs):


- HMMs are mathematical models used to represent sequences of
observations (e.g., nucleotides in DNA) in terms of a sequence of hidden
states (e.g., coding regions, introns).
- GenMark's HMMs are trained on large datasets of annotated genes,
allowing them to learn the characteristic patterns and distributions of gene
features.
- These models incorporate transition probabilities between hidden states
and emission probabilities of observed nucleotides, enabling GenMark to
probabilistically infer gene structures from DNA sequences.
3. Integration of Evidence:
- GenMark integrates multiple lines of evidence to enhance the accuracy of
gene prediction.
- Sequence similarity to known genes or proteins is assessed using
computational tools like BLAST or profile Hidden Markov Models (pHMMs),
enabling GenMark to leverage evolutionary conservation.
- Structural features indicative of gene elements, such as promoter
regions, splice sites, and consensus sequences for start and stop codons, are
identified and incorporated into the prediction process.
- By combining these diverse sources of evidence, GenMark is able to
discriminate between true gene structures and background noise in
genomic sequences more effectively.
4. Model Training and Parameterization:
- GenMark's performance is influenced by the quality of its underlying
models and parameter settings.
- The initial training of GenMark's models involves large datasets of
annotated genes from well-studied organisms, allowing the algorithm to
learn the specific characteristics of genes in those organisms.
- Users have the flexibility to adjust various parameters, such as
thresholds for statistical significance and species-specific parameters, to
optimize gene prediction for different genomic contexts.
5. Output and Post-processing:
- GenMark generates output files containing predicted gene structures,
including the locations of coding sequences, exon-intron boundaries, and
other relevant annotations.

115
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

- Post-processing steps may involve filtering out low-confidence


predictions, refining gene boundaries based on additional evidence, and
resolving overlapping gene annotations to produce a final set of high-
quality predictions.
6. Performance Evaluation:
- The accuracy of GenMark's gene predictions is typically evaluated using
benchmark datasets and standard metrics such as sensitivity, specificity,
and precision.
- Comparative evaluations against other gene prediction methods provide
insights into the strengths and weaknesses of GenMark in different genomic
contexts, helping researchers assess its suitability for specific applications
and organisms.

Gene prediction is a fundamental task in bioinformatics, and numerous


methods and tools have been developed to address this challenge. These
methods employ various computational approaches, including statistical
models, machine learning algorithms, and comparative genomics
techniques, to identify gene structures within genomic sequences. Other
methods of gene prediction include the following:
1. Ab initio Gene Prediction:
- Ab initio gene prediction methods, also known as de novo gene
prediction, rely solely on the analysis of genomic sequences without relying
on external information such as homology to known genes or proteins.
- These methods typically utilize statistical models, such as Hidden
Markov Models (HMMs), to identify characteristic features of gene
structures, including open reading frames (ORFs), start and stop codons,
and splice sites.
- Examples of ab initio gene prediction tools include GENSCAN, FGENESH,
and Augustus.
2. Homology-based Gene Prediction:
- Homology-based gene prediction methods leverage sequence similarity
to known genes or proteins to infer gene structures in target genomic
sequences.
- These methods often involve searching the target genome against
databases of annotated genes or proteins using algorithms like BLAST or
profile Hidden Markov Models (pHMMs).

116
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

- By identifying regions of sequence similarity or conservation, homology-


based methods can predict gene structures based on the alignment of
homologous sequences.
- Tools such as GeneWise and Exonerate are commonly used for homology-
based gene prediction.
3. Comparative Genomics:
- Comparative genomics approaches exploit evolutionary conservation of
gene structures across related species to predict genes in target genomes.
- These methods involve aligning genomic sequences from multiple
species and identifying conserved regions that likely correspond to genes.
- Comparative genomics can reveal evolutionarily conserved elements
such as coding sequences, regulatory regions, and non-coding RNAs.
- Tools like EVidenceModeler (EVM) and Cufflinks incorporate
comparative genomics information for gene prediction.
4. Transcript-based Gene Prediction:
- Transcript-based gene prediction methods utilize experimental data
such as RNA sequencing (RNA-seq) to identify transcribed regions within
genomic sequences.
- These methods involve aligning RNA-seq reads to the genome and
assembling transcripts, which can then be used to infer gene structures,
including exons, introns, and splice variants.
- Transcript-based approaches are particularly effective for detecting
novel or alternatively spliced genes and refining gene models.
- Popular tools for transcript-based gene prediction include StringTie,
TopHat, and Cufflinks.
5. Integration of Multiple Approaches:
- Many gene prediction methods integrate multiple approaches,
combining ab initio predictions, homology-based evidence, comparative
genomics data, and transcriptomic evidence to improve prediction
accuracy.
- By leveraging complementary sources of information, integrated gene
prediction pipelines can produce more reliable and comprehensive gene
annotations.
- Tools such as MAKER and JIGSAW integrate various prediction methods
and evidence types to generate high-quality gene annotations.

117
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

ORF Determination:
ORFs are sequences of DNA or RNA that potentially encode proteins. They
are identified by their ability to be translated into proteins by the cell's
machinery.
1. Identify Start Codons: The most common start codons are AUG (encoding
methionine) in eukaryotes and archaea, and sometimes GUG and UUG in
bacteria. Look for these codons in the sequence.
2. Search for Stop Codons: Look for stop codons (UAA, UAG, or UGA in RNA
sequences) that could terminate the translation. An ORF typically extends
from a start codon to a stop codon.
3. Length Consideration: Not all ORFs are significant. ORFs shorter than a
certain threshold are often discarded as they may not encode functional
proteins. The minimum length considered significant varies depending on
the context, but common thresholds are around 100 codons.
4. Frame Selection: ORFs can be in three reading frames depending on
where translation starts relative to the sequence. Hence, for a given
sequence, all three reading frames need to be considered.
5. ORF Prediction Tools: There are several bioinformatics tools and
software available for ORF prediction, such as ORFfinder, GeneMark, and
Prodigal. These tools automate the process of identifying ORFs in
nucleotide sequences.
6. Comparative Analysis: Sometimes, comparative genomics can help
identify conserved ORFs, which are more likely to encode functional
proteins.
7. Experimental Verification: Finally, experimental methods such as gene
expression analysis, mass spectrometry, or functional assays are often used
to verify the functionality of predicted ORFs.
Steps:
1. Translation in All 6 Frames:
- DNA sequences can be translated in all six possible reading frames: three
in the forward direction (5' to 3') and three in the reverse direction (3' to
5'). This accounts for the possibility of ORFs occurring in any reading frame.
2. Stop Codons Every 20 Codons:
- Stop codons (UAA, UAG, or UGA) typically occur approximately every 20
codons by chance within coding regions. This random distribution of stop
codons helps delineate potential ORFs within the sequence.

118
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

3. Frames Longer Than Thirty Codons Without Interruption:


- ORFs that are longer than thirty codons and lack interruption by stop
codons are suggestive of gene-coding regions. This criterion helps filter out
shorter ORFs that are less likely to encode functional proteins.
4. Confirmation of Putative Frames:
- Putative ORFs are further confirmed by the presence of start codons
(e.g., AUG) and Shine-Dalgarno sequences in bacterial mRNA. These
sequences serve as ribosomal binding sites and are indicative of translation
initiation.
5. Translation and Homology Detection:
- Once putative ORFs are identified and confirmed, the corresponding
protein sequences are translated from the DNA sequence. These protein
sequences can then be used to search against protein sequence databases
(e.g., NCBI's BLAST) to detect homologous sequences, providing a strong
indicator of functional similarity.
6. Transcription Termination Signal:
- Transcription termination signals, particularly in bacteria, often involve
rho-independent terminators. These terminators are characterized by a
stem-loop secondary structure followed by a string of thymine (T) residues.
Identifying these signals aids in predicting the end of gene sequences.

Hidden Markov Model:


Hidden Markov Models (HMMs) are powerful statistical models used in
various fields, including bioinformatics, speech recognition, natural
language processing, and more. In bioinformatics, HMMs are widely
employed for tasks such as sequence alignment, gene prediction, protein
structure prediction, and motif finding.
1. Markov Models:
- Markov Models are mathematical models that describe a sequence of
events where the probability of each event depends only on the state
attained in the previous event. In other words, they capture dependencies
between sequential observations.
- In a Markov model, there is a set of states, and transitions between these
states are governed by transition probabilities.
2. Hidden Markov Models (HMMs):

119
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

- HMMs are a type of Markov model where the underlying system is


assumed to be a Markov process with unobservable (hidden) states that
emit observable symbols or outcomes.
- The model consists of two main components:
a. Hidden States: These are the unobservable states that represent the
underlying system.
b. Observations: These are the visible symbols or outcomes emitted by
each hidden state.
- The transitions between hidden states are governed by transition
probabilities, and each hidden state emits observations according to
emission probabilities.
3. Components of an HMM:
- Transition Probabilities: These describe the likelihood of transitioning
from one hidden state to another.
- Emission Probabilities: These describe the likelihood of emitting each
observation from each hidden state.
- Initial State Probabilities: These describe the probability distribution
over the initial hidden states.
- State Sequence: The sequence of hidden states that generate a given
sequence of observations is not directly observable and is referred to as the
"hidden" sequence.
4. Applications in Bioinformatics:
- Sequence Alignment: HMMs are used in profile Hidden Markov Models
(pHMMs) for sequence alignment tasks such as protein family identification
and multiple sequence alignment.
- Gene Prediction: HMMs are employed to model the structure of genes in
DNA sequences, identifying exons, introns, and regulatory regions.
- Protein Structure Prediction: HMMs can be used to predict protein
secondary structure, fold recognition, and protein domain identification.
- Motif Finding: HMMs are utilized to identify conserved motifs or
patterns within biological sequences, such as DNA binding sites or protein
domains.

120
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

5. Training HMMs:
- HMMs are typically trained using the Expectation-Maximization (EM)
algorithm or variants like the Baum-Welch algorithm.
- Training involves estimating the parameters (transition probabilities,
emission probabilities, and initial state probabilities) that maximize the
likelihood of the observed data.
6. Software and Tools:
- Several software packages and libraries are available for working with
HMMs, including HMMER, SAMtools, and the BioPython library. These tools
provide implementations for building, training, and applying HMMs in
various bioinformatics tasks.

K-Order Markov Model:


1. K-order Markov Model:
- In gene prediction, K-order Markov models are utilized to capture the
statistical dependencies between nucleotides in DNA sequences.
- The order of the model, denoted by K, specifies the number of preceding
nuleotides considered when predicting the next nucleotide. For example:
- 0-order (K=0): Each nucleotide is assumed to occur independently,
regardless of surrounding nucleotides. This is typical of non-coding regions
where there is no specific pattern or dependency.
- 1st order (K=1): The probability of a nucleotide depends only on the
preceding nucleotide.
- 2nd order (K=2): The probability of a nucleotide depends on the
preceding two nucleotides.
2. Parameters and Training:
- Parameters of the K-order Markov model are estimated from a set of
DNA sequences with known gene locations.
- The main parameters include:
- Transition probabilities: Probabilities of transitioning from one
nucleotide (or sequence of nucleotides) to another.
- Initial probabilities: Probabilities of starting in a particular state (i.e., a
particular nucleotide or sequence of nucleotides).

121
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

- These parameters are typically estimated using maximum likelihood


estimation or similar techniques.
3. Frequency of Nucleotide Sequences in Coding Regions:
- In coding regions (i.e., regions that contain genes), certain nucleotide
sequences occur more frequently than expected by random chance.
- This is due to the presence of codons, which are sequences of three
nucleotides that encode specific amino acids in the genetic code.
- For example, the frequency of codons (sequences of three nucleotides)
representing start codons (e.g., AUG) and stop codons (e.g., UAA, UAG, UGA)
will be higher in coding regions compared to non-coding regions.
- Additionally, the frequency of certain nucleotide motifs or patterns may
be indicative of regulatory elements or functional regions within genes.
4. Application in Gene Prediction:
- K-order Markov models are used in gene prediction algorithms to
distinguish between coding and non-coding regions of DNA sequences.
- By considering the probability distribution of nucleotides within coding
regions, these models can identify regions that are more likely to contain
genes.
- The models can also incorporate additional features such as the
presence of start and stop codons, splice sites, and other structural
characteristics of genes.

Interpolated Markov Model:


Interpolated Markov Models (IMMs) address some of the limitations of
higher-order Markov models, such as the scarcity of data for estimating
parameters accurately in short gene sequences.
1. Interpolated Markov Model (IMM):
- Interpolated Markov Models combine information from Markov models
of varying orders to provide more accurate predictions, especially when
dealing with limited data or short sequences.
- Instead of relying solely on a single high-order Markov model, an IMM
integrates predictions from multiple lower and higher-order Markov
models.

122
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

2. Sampling Sequence Patterns:


- The IMM samples a large number of sequence patterns, ranging from
dimers (2-mers) to ninemers (9-mers), covering a wide range of k-values (1
to 8 in this case).
- For each k-mer (sequence pattern), the IMM calculates its probability
based on the observed frequency in the training data.
3. Weighted Scheme:
- The IMM applies a weighted scheme to assign weights to each k-mer
based on its frequency in the training data.
- More weight is assigned to k-mers that occur frequently, indicating a
higher probability, while less weight is assigned to rare k-mers.
- The weights are typically determined using a predefined weighting
function or based on empirical observations.
4. Combining Probabilities:
- The final probability of a hexamer (6-mer) or any k-mer is calculated as
the sum of the probabilities of all weighted k-mers.
- The weights serve as coefficients in a weighted sum, with more weight
assigned to higher-order Markov models and less weight to lower-order
models.
- By combining predictions from multiple models, the IMM leverages
information from both short-range and long-range dependencies in the
sequence data.
5. Advantages of IMM:
- Improved Accuracy: By incorporating information from multiple Markov
models of varying orders, IMM provides more accurate predictions,
especially when dealing with limited data or short sequences.
- Robustness: IMM is more robust to overfitting compared to high-order
Markov models, as it balances between capturing local dependencies and
avoiding parameter estimation issues.
- Flexibility: IMM allows for flexibility in adjusting the weighting scheme
based on the characteristics of the data and the specific task at hand.

123
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

Applications of GenMark:
1. Infectious Disease Testing: GenMark's ePlex® system is utilized for rapid
and accurate detection of infectious diseases caused by bacteria, viruses,
and fungi. This includes respiratory infections (e.g., influenza, respiratory
syncytial virus), bloodstream infections (e.g., sepsis), gastrointestinal
infections (e.g., Clostridioides difficile), and sexually transmitted infections
(e.g., chlamydia, gonorrhea). The system allows for simultaneous testing of
multiple pathogens from a single patient sample, providing timely results to
guide patient management and treatment decisions.
2. Antimicrobial Resistance (AMR) Surveillance: GenMark's ePlex® Blood
Culture Identification (BCID) panels incorporate assays targeting
antimicrobial resistance genes, allowing for the rapid detection of antibiotic
resistance in pathogens causing bloodstream infections. This enables
clinicians to make informed decisions regarding antibiotic selection and
stewardship, contributing to more effective patient care and AMR
surveillance efforts.
3. Respiratory Pathogen Panel: GenMark's ePlex® Respiratory Pathogen
Panel (RPP) is designed to detect a comprehensive panel of respiratory
viruses and bacteria associated with acute respiratory infections. The panel
includes common respiratory pathogens such as influenza viruses,
respiratory syncytial virus (RSV), rhinovirus, adenovirus, and bacterial
pathogens like Streptococcus pneumoniae and Haemophilus influenzae.
Rapid and accurate detection of these pathogens aids in diagnosis,
treatment, and infection control measures.
4. Genetic Testing for Inherited Disorders: GenMark's eSensor® XT-8
system is utilized for genetic testing of inherited disorders, including
carrier screening, prenatal testing, and diagnosis of genetic conditions. The
system allows for the detection of specific genetic variants associated with
conditions such as cystic fibrosis, thalassemia, and familial
hypercholesterolemia. Genetic testing with the eSensor® XT-8 system
provides valuable information for family planning, reproductive counseling,
and personalized healthcare.
5. Oncology Testing: GenMark's oncology panels enable the detection of
genetic mutations and biomarkers associated with cancer diagnosis,
prognosis, and treatment response. These panels can be used for targeted
therapy selection, monitoring of minimal residual disease, and
identification of drug resistance mutations. The ePlex® system allows for
multiplexed testing of cancer-related genes.

124
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

125
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

126
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

PHYLOGENETIC TREE CONSTRUCTION BY MEGAX

→Opening an Alignment
The Alignment Explorer is the tool for building and editing multiple
sequence alignments in MEGA.
Example:
1. Launch the Alignment Explorer by selecting the Align | Edit/Build
Alignment on the launch bar of the main MEGA window.
2. Select Create New Alignment and click Ok. A dialog will appear asking
“Are you building a DNA or Protein sequence alignment?” Click the
button labeled “DNA”.
3. From the Alignment Explorer main menu, select Data | Open |
Retrieve sequences from File. Select the "hsp20.fas" file from the
MEG/Examples directory.

→Aligning Sequences by ClustalW


You can create a multiple sequence alignment in MEGA using either the
ClustalW or Muscle algorithms. Here we align a set of sequences using the
ClustalW option.
Example:
1. Select the Edit | Select All menu command to select all sites for every
sequence in the data set.
2. Select Alignment | Align by ClustalW from the main menu to align the
selected sequences data using the ClustalW algorithm. Click the “Ok”
button to accept the default settings for ClustalW.
3. Once the alignment is complete, save the current alignment session
by selecting Data | Save Session from the main menu. Give the file an
appropriate name, such as "hsp20_Test.mas". This will allow the
current alignment session to be restored for future editing.
4. Exit the Alignment Explorer by selecting Data | Exit Aln Explorer
from the main menu.
→Obtaining Sequence Data from the Internet (GenBank)
Using MEGA’s integrated browser you can fetch GenBank sequence data
from the NCBI website if you have an active internet connection.
Example:
1. From the main MEGA window, select Align | Edit/Build Alignment
from the main menu.
2. When prompted, select Create New Alignment and click ok. Select
DNA Activate MEGA’s integrated browser by selecting Web | Query
GenBank from the main menu.

127
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

3. When the NCBI: Nucleotide site is loaded, enter CFS as a search term
into the search box at the top of the screen. Press the Search button.
4. When the search results are displayed, check the box next to any
item(s) you wish to import into MEGA.
5. If you have checked one box: Locate the dropdown menu labeled
Display Settings (located near the top left hand side of the page
directly under the tab headings). Change its value to FASTA and then
click Apply. The page will reload with all the search results in a
FASTA format
6. If you have checked more than one box: locate the Display Settings
dropdown (located near the top left hand side of the page directly
under the tab headings).
7. Change the value to FASTA (Text) and click the Apply button. This
will output all the sequences you selected as a text in the FASTA
format.
8. Press the Add to Alignment button (with the red + sign) located
above the web address bar. This will import the sequences into the
Alignment Explorer.
9. With the data now displayed in the Alignment Explorer, you can close
the Web Browser window.
10. Align the new data using the steps detailed in the previous
examples.
11. Close the Alignment Explorer window by clicking Data | Exit
Aln Explorer. Select No when asked if you would like the save the
current alignment session to file.
12.
→Constructing a phylogenetic tree
Step 1:
1. Install the MEGAX software in your device.
2. Launch the software.
Step 2:
1. Click the “Align” button.
2. Select “Edit/Build Alignment” option.
3. A dialogue box appears, select “Create a new alignment” and click
“OK”.
Step 3:
1. An alignment explorer window is opened. A dialogue box asks
whether you are building a protein, DNA or nucleotide. Select
“Protein”.

128
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

2. A new window opens.


3. Copy the required sequences (5+) from any protein database like
NCBI in FASTA format and paste it in the alignment explorer window.
4. A default “sequence 1” appears, delete it.
Step 4:
1. Under the alignment tab, select “muscle align” and select “Align
Protein”.
2. Change the type of alignment from UPGMA to Neighbor Joining
model.
3. Select “OK”.
4. The sequences are aligned.
Step 5:
1. Select the “Data tab” and select “Export alignment”>FASTA format.
2. Save the file as a FASTA file in your device.
Step 6:
1. Now, close the window and select the “Phylogeny” tab.
2. Select the “Construct/Test Neighbor-Joining Tree” option.
3. Select the FASTA format of the protein alignment sequence.
4. In the “input data options” select “Protein sequences” and click “OK”.
5. Step 7:
1. An “Analysis preference” dialogue box will be prompted.
2. Select “Bootstrap method” , “Poisson model”, “Complete deletion”.
3. Click “OK”.
Step 8:
1. Within a few seconds, a phylogenetic tree is constructed.
2. You can resize the tree, make adjustments, change the type of
branches, etc.
3. Finally, in the “Image tab”, select the format you want to save the
phylogenetic tree.
4. Save the file in your device in the required format

→Printing the NJ Tree (For Windows users)


Example:
1. Select the File | Print option from the Tree Explorer main menu to
bring up a standard Print window. This will print the tree full-sized
and may take multiple sheets of paper.
2. Press Cancel.

129
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

3. To restrict the size of the printed tree to a single sheet of paper,


choose the File | Print in a Sheet command from the Tree Explorer
main menu. Press Ok.
4. Select the File | Exit Tree Explorer command to exit the Tree
Explorer. Click the OK button to close the Tree Explorer without
saving the tree session.

Applications of MEGAX:

1. Phylogenetic Tree Construction: MEGA X allows users to construct


phylogenetic trees based on molecular sequence data, such as DNA, RNA, or
protein sequences. These trees help in visualizing the evolutionary
relationships between different species or taxa, aiding in evolutionary and
taxonomic studies.

2. Sequence Alignment: MEGA X facilitates multiple sequence alignment,


which is crucial for comparing sequences from different species or
individuals. It helps in identifying conserved regions, sequence motifs, and
variations within and between sequences.

3. Evolutionary Distance Estimation: MEGA X enables the estimation of


evolutionary distances between sequences using various models, such as
the Jukes-Cantor, Kimura, and Tamura-Nei models. These distances provide
insights into the genetic divergence and evolutionary rates of sequences.

4. Phylogenetic Inference Methods: MEGA X implements various


phylogenetic inference methods, including neighbor-joining, maximum
likelihood, and Bayesian inference. These methods allow users to infer the
evolutionary history of sequences and estimate branch lengths and node
support values in phylogenetic trees.

5. Bootstrap Analysis: MEGA X performs bootstrap analysis to assess the


robustness of phylogenetic trees by generating multiple replicate datasets
through resampling. It helps in estimating the confidence levels or support
values for different branches in the tree.

6. Molecular Evolutionary Analysis: MEGA X provides tools for conducting


molecular evolutionary analyses, such as tests of selection, codon-based
analyses, and ancestral sequence reconstruction. These analyses help in
understanding the mechanisms of molecular evolution and adaptation at
the sequence level.

130
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

7. Population Genetics Studies: MEGA X can be used for population genetics


studies by analyzing genetic variation within and between populations. It
facilitates the estimation of genetic diversity, population structure, and
gene flow based on molecular sequence data.

8. Phylogenomic Analysis: MEGA X supports phylogenomic analysis by


integrating genomic data from multiple loci or genes to reconstruct large-
scale phylogenies. It helps in resolving complex evolutionary relationships
and inferring the evolutionary history of organisms using genome-wide
data.

9. Comparative Genomics: MEGA X enables comparative genomics studies


by analyzing sequence conservation, gene synteny, and evolutionary
patterns across different genomes. It helps in identifying orthologous and
paralogous genes, gene families, and evolutionary breakpoints.
10. Educational and Research Tool: MEGA X serves as an educational and
research tool for students, educators, and researchers in the fields of
evolutionary biology, genetics, and bioinformatics. Its user-friendly
interface, comprehensive features, and visualization capabilities make it
accessible for a wide range of users to explore and analyze molecular
evolutionary genetics data.

131
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

132
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

133
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

PROTEIN VISUALIZATION TOOLS

RASMOL
RasMol is a molecular graphics program intended for the visualization of
proteins, nucleic acids and small molecules. The program is aimed at
display, teaching and generation of publication quality images. The program
reads in molecular co-ordinate files and interactively displays the molecule
on the screen in a variety of representations and colour schemes.
Supported input file formats include Brookhaven Protein Databank (PDB),
Tripos Associates' Alchemy and Sybyl Mol2 formats, Molecular Design
Limited's (MDL) Mol file format, Minnesota Supercomputer Centre's (MSC)
XYZ (XMol) format and CHARMm format files.

→Running RasMol Under Microsoft Windows

To start RasMol under Microsoft Windows, double click on the RasMol icon
in the program manager. When RasMol first starts, the program displays a
single main window (the display window) with a black background on the
screen and provides the command line window minimized as a small icon
at the bottom of the screen. The command line or terminal window may be
opened by double clicking on this RasMol icon. It is possible to specify
either a coordinate filename or both on the windows command line. The
format for specifying a script file to add the option '-script <filename>' to
the command line. A molecule co-ordinate file may be specified by placing
its name on the command line, optionally preceded by a file format option.
If no format option is given, the specified co-ordinate file is assumed to be
in PDB format. Valid format options include '- pdb', '-mdl', '-mol2', '-xyz', '-
alchemy' and '-charmm', which correspond to Brookhaven, MDL Mol file,
Sybyl Mol2, MSC's xyz, Alchemy and CHARMm formats respectively. If both
a co-ordinate file and a script file are specified on the command line, the
molecule is loaded first, then the script commands are applied to it. If either
file is not found, the program displays the error message 'Error: File not
found!' and the user is presented the RasMol prompt.

→RasMol's Window

On all platforms RasMol displays two windows, the


main graphics or canvas window with a black background and a command
line or terminal window. At the top of the graphics window (or at the top of
the screen for the Macintosh) is the RasMol menu bar. The contents of the
menu bar change from platform to platform to support the local user
interface guidelines; however, all platforms support the 'File', 'Display',

134
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

'Colours', 'Export', 'Options' and 'Settings' pull-down menus. The Main


graphics window also has two scroll bars, one on the right and one at the
bottom, that may be used to rotate the molecule interactively.

While the mouse pointer is located within the graphics area of the main
display window, the mouse pointer is drawn as a cross-hair cursor, to
enable the 'picking' of objects being displayed; otherwise the mouse
pointer is drawn as an arrowhead. Any characters that are typed at the
keyboard while the display window is in 'focus' (meaning active or
foreground) are redirected to the command line in the terminal window.
Hence you do not need continually to switch focus between the command
line and graphics windows.

→Mouse Controls

Action Windows Macintosh


Rotate X, Y Left Unmodified
Translate X, Y Right Command
Rotate Z Shift-Right Shift-Command
Zoom Shift-Left Shift
Slab Plane Ctrl-Left Ctrl

→Scroll Bars

The scroll bar across the bottom of the canvas area is used to rotate the
molecule about the y-axis, i.e. to spin the nearest point on the molecule left
or right; and the scroll bar to the right of the canvas rotates the molecule
about the x-axis, i.e. the nearest point up or down. Each scroll bar has an
'indicator' to denote the relative orientation of the molecule, which is
initially positioned in the centre of the scroll bar. These scroll bars may be
operated in either of two ways. The first is by clicking any mouse button on
the dotted scroll bar background to indicate a direct rotation relative to the
current indicator position; the second is by clicking one of the arrows at
either end of the scroll bar to rotate the molecule in fixed sized increments.
Rotating the molecule by the second method may cause the indicators on
the scroll bars to wrap around from one end of the bar to the other. A
complete revolution is indicated by the indicator travelling the length of the
scroll bar. The angle rotated by using the arrows depends upon the current
size of the display window.

135
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

→Picking

In order to identify a particular atom or bond being displayed, RasMol


allows the users to 'pick' objects on the screen. The mouse is used to
position the cross-hair cursor over the appropriate item, and then any of
the mouse buttons is depressed. Provided that the pointer is located close
enough to a visible object, the program determines the identity of the
nearest atom to the point identified.

→Backbone

Syntax: backbone {<boolean>}


backbone <value>
backbone dash

The RasMol 'backbone' command permits the representation of a


polypeptide backbone as a series of bonds connecting the adjacent alpha
carbons of each amino acid in a
chain. The display of these
backbone 'bonds' is turned on and
off by the command parameter in
the same way as with the
'wireframe' command. The
command 'backbone off' turns off
the selected 'bonds', and 'backbone
on' or with a number turns them
on. The number can be used to
specify the cylinder radius of the
representation in either Angstrom
or RasMol units. A parameter value
of 500 (2.0 Å) or above results in a
"Parameter value too large" error.
Backbone objects may be coloured using the RasMol 'colour backbone'
command.

The reserved word backbone is also used as a predefined set ("help sets")
and as a parameter to the 'set hbond' and 'set ssbond' commands. The
RasMol command 'trace' renders a smoothed backbone, in contrast to
'backbone' which connects alpha carbons with straight lines.

The backbone may be displayed with dashed lines by use of the 'backbone
dash' command.

136
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

→Background
Syntax: background <colour>

The RasMol 'background' command is used to set the colour of the "canvas"
background. The colour may be given as either a colour name or a comma
separated triple of Red, Green and Blue (RGB) components enclosed in
square brackets. Typing the command 'help colours' will give a list of the
predefined colour names recognized by RasMol. When running under X
Windows, RasMol also recognizes colours in the X server's colour name
database.

The 'background' command is synonymous with the RasMol 'set


background' command.

→Cartoon
Syntax: cartoon {<number>}

The RasMol 'cartoon' command does a display of a molecule 'ribbons' as


Richardson (MolScript) style protein 'cartoons', implemented as thick
(deep) ribbons. The easiest way to obtain a cartoon representation of a
protein is to use the 'Cartoons' option on the 'Display' menu. The 'cartoon'
command represents the currently selected residues as a deep ribbon with
width specified by the command's argument. Using the command without a
parameter results in the ribbon's width being taken from the protein's
secondary structure, as described in the 'ribbons' command. By default, the
C-termini of beta-sheets are displayed as arrow heads. This may be enabled
and disabled using the 'set cartoons' command. The depth of the cartoon
may be adjusted using the 'set cartoons <number>' command. The 'set
cartoons' command without any parameters returns these two options to
their default values.

→Centre
Syntax: centre {<expression>}
center {<expression>}

The RasMol 'centre' command defines the point about which the 'rotate'
command and the scroll bars rotate the current molecule. Without a
parameter the centre command resets the centre of rotation to be the
centre of gravity of the molecule. If an atom expression is specified, RasMol
rotates the molecule about the centre of gravity of the set of atoms
specified by the expression. Hence, if a single atom is specified by the
expression, that atom will remain 'stationary' during rotations.

137
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

Type 'help expression' for more information on RasMol atom expressions.

Alternatively, the centering may be given as a comma separated triple of


[CenX, CenY, CenZ] offsets in RasMol units (1/250 of an Angstrom) from
the centre of gravity. The triple must be enclosed in square brackets.

→Clipboard
Syntax: clipboard

The RasMol 'clipboard' command places a copy of the currently displayed


image on the local graphics 'clipboard'. Note: this command is not yet
supported on UNIX or VMS machines. It is intended to make transferring
images between applications easier under Microsoft Windows or on an
Apple Macintosh.

When using RasMol on a UNIX or VMS system this functionality may be


achieved by generating a raster image in a format that can be read by the
receiving program using the RasMol 'write' command.

→Colour
Syntax: colour {<object>} <colour>
color {<object>} <colour>

Colour the atoms (or other objects) of the selected region. The colour may
be given as either a colour name or a comma separated triple of Red, Green
and Blue (RGB) components enclosed in square brackets. Typing the
command 'help colors' will give a list of all the predefined colour names
recognised by RasMol.

Allowed objects are 'atoms', 'bonds', 'backbone', 'ribbons', 'labels', 'dots',


'hbonds' and 'ssbonds'. If no object is specified, the default keyword 'atom'
is assumed. Some colour schemes are defined for certain object types. The
colour scheme 'none' can be applied to all objects except atoms and dots,
stating that the selected objects have no colour of their own, but use the
colour of their associated atoms (i.e. the atoms they connect). 'Atom'
objects can also be coloured by 'alt', 'amino', 'chain', 'charge', 'cpk', 'group',
'model', 'shapely', 'structure', 'temperature' or 'user'. Hydrogen bonds can
also be coloured by 'type' and dot surfaces can also be coloured by
'electrostatic potential'.

→H Bonds
Syntax: hbonds {<boolean>}
hbonds <value>

138
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

The RasMol 'hbond' command is used to represent the hydrogen bonding


of the protein molecule's backbone. This information is useful in assessing
the protein's secondary structure. Hydrogen bonds are represented as
either dotted lines or cylinders between the donor and acceptor residues.
The first time the 'hbond' command is used, the program searches the
structure of the molecule to find hydrogen bonded residues and reports the
number of bonds to the user. The command 'hbonds on' displays the
selected 'bonds' as dotted lines, and the 'hbonds off' turns off their display.
The colour of hbond objects may be changed by the 'colour hbond'
command. Initially, each hydrogen bond has the colours of its connected
atoms.

By default, the dotted lines are drawn between the accepting oxygen and
the donating nitrogen. By using the 'set hbonds' command the alpha carbon
positions of the appropriate residues may be used instead. This is
especially useful when examining proteins in backbone representation.

→Stereo
Syntax: stereo on
stereo <number>
stereo off

The RasMol 'stereo' command provides side-by-side stereo display of


images. Stereo viewing of a molecule may be turned on (and off) either by
selecting 'Stereo' from the 'Options' menu, or by typing the commands
'stereo on' or 'stereo off'. The separation angle between the two views may
be adjusted with the 'set stereo [-] <number>' command, where positive
values result in crossed eye viewing and negative values in relaxed (wall-
eyed) viewing. The inclusion of '[-] <number>' in the 'stereo' command, as
for example in 'stereo 3' or 'stereo -5', also controls angle and direction.

The stereo command is only partially implemented. When stereo is turned


on, the image is not properly recentred. (This can be done with a 'translate
x -<number>' command.) It is not supported in vector PostScript output
files, is not saved by the 'write script' command, and in general is not yet
properly interfaced with several other features of the program.

→Zap
Syntax: zap

Deletes the contents of the current database and resets parameter


variables to their initial default state.

139
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

→Zoom
Syntax: zoom {<boolean>}
zoom <value>

Change the magnification of the currently displayed image. Boolean


parameters either magnify or reset the scale of current molecule. An
integer parameter specifies the desired magnification as a percentage of
the default scale. The minimum parameter value is 10; the maximum
parameter value is dependent upon the size of the molecule being
displayed. For medium sized proteins this is about 500.

→Colour Schemes
The RasMol 'colour' command allows different objects (such as atoms,
bonds and ribbon segments) to be given a specified colour. Typically, this
colour is either a RasMol predefined colour name or an RGB triple.
Additionally RasMol also supports 'alt', 'amino', 'chain', 'charge', 'cpk',
'group', 'model', 'shapely', 'structure', 'temperature' or 'user' colour
schemes for atoms, and 'hbond type' colour scheme for hydrogen bonds
and 'electrostatic potential' colour scheme for dot surfaces. The 24
currently predefined colour names are listed below with their
corresponding RGB triplet and hexadecimal value.

Predefined
Sample RGB Values Hexadecimal
colour

Black [ 0, 0, 0] 000000

Blue [ 0, 0,255] 0000FF

BlueTint [175,214,255] AFD7FF

Brown [175,117,89] AF7559

Cyan [ 0,255,255] 00FFFF

Gold [255,156, 0] FC9C00

Grey [125,125,125] 7D7D7D

Green [ 0,255, 0] 00FF00

GreenBlue [ 46,139,87] 2E8B57

140
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

GreenTint [152,255,179] 98FFB3

HotPink [255, 0,101] FF0065

Magenta [0,255,0] FF00FF

Orange [255,165, 0] FFA500

Pink [255,101,117] FF6575

PinkTint [255,171,187] FFABBB

Purple [160, 32,240] A020F0

Red [255, 0, 0] FF0000

RedOrange [255, 69, 0] FF4500

SeaGreen [ 0,250,109] 00FA6D

SkyBlue [ 58,144,255] 3A90FF

Violet [238,130,238] EE82EE

White [255,255,255] FFFFFF

Yellow [255,255, 0] FFFF00

YellowTint [246,246,117] F6F675

Applications of RasMol:

1. Structural Biology Research: RasMol is extensively used by structural


biologists to visualize and analyze protein and nucleic acid structures
obtained from experimental methods like X-ray crystallography and NMR
spectroscopy. It aids in understanding the three-dimensional arrangement
of atoms and elucidating the structural basis of biological functions.

2. Drug Discovery and Design: RasMol is employed in drug discovery for


studying protein-ligand interactions, identifying potential drug binding
sites, and designing novel therapeutic agents. It facilitates virtual screening,
molecular docking, and structure-based drug design approaches, helping in
the development of new drugs and pharmaceuticals.

141
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

3. Protein Engineering: RasMol is utilized in protein engineering to study


structure-function relationships, predict the effects of mutations, and
design proteins with desired properties. It assists in rational protein
design, protein engineering experiments, and optimization for various
biotechnological applications.

4. Molecular Dynamics Simulations: RasMol supports the visualization of


molecular dynamics simulations, allowing researchers to animate the
movement of atoms and residues over time. It aids in studying
conformational changes, protein-ligand interactions, and dynamic
behaviors of biomolecules during simulations.

5. Bioinformatics Analysis: RasMol is integrated into bioinformatics


pipelines for analyzing and visualizing molecular structures in the context
of sequence data. It facilitates comparative structural analysis, structural
alignment, and the interpretation of sequence-structure relationships.

6. Educational Tools: RasMol serves as an educational tool for teaching


molecular visualization, structural biology principles, and bioinformatics
concepts. It provides students with hands-on experience in exploring
molecular structures and understanding biomolecular functions in
educational settings.

7. Structural Genomics: RasMol is used in structural genomics projects for


visualizing and analyzing the three-dimensional structures of proteins
encoded by genes. It aids in structural annotation, functional
characterization, and the prediction of protein structure and function from
genomic data.

8. Phylogenetic Analysis: RasMol is employed in phylogenetic analysis to


visualize and compare the structures of homologous proteins across
different species or evolutionary lineages. It helps in studying evolutionary
relationships, identifying conserved structural motifs, and inferring
functional divergence.

9. Enzyme Mechanisms: RasMol is utilized in enzymology research for


studying enzyme mechanisms and catalytic reactions. It aids in visualizing
enzyme active sites, substrate binding pockets, and transition states,
facilitating the elucidation of reaction mechanisms and the design of
enzyme inhibitors.

142
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

10. Biomedical Visualization: RasMol is applied in biomedical research for


visualizing molecular structures relevant to human health and disease. It
aids in studying the structural basis of genetic disorders, protein misfolding
diseases, and drug-target interactions, contributing to the understanding
and treatment of various medical conditions.

143
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

144
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

145
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

146
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

MOLMOL
MOLMOL is a program for displaying, analyzing, and manipulating
molecules. MOLMOL was first developed under the name COSMOS, but it
had to be renamed due to a name collision with a different program. It was
tried to keep the program as general as possible. However, there are some
functions that make it especially useful for studying structures of
macromolecules obtained by NMR.
MOLMOL has a graphical user interface with menus, dialog boxes, and on-
line help. The display possibilities include conventional presentation, as
well as novel schematic drawings, with the option of combining different
presentations in one view of a molecule. Covalent molecular structures can
be modified by addition or removal of individual atoms and bonds, and
three-dimensional structures can be manipulated by interactive rotation
about individual bonds. Special efforts were made to allow for appropriate
display and analysis of the sets of typically 20-40 conformers that are
conventionally used to represent the result of an NMR structure
determination, using functions for superimposing sets of conformers,
calculation of root mean square distance (RMSD) values, identification of
hydrogen bonds, checking and displaying violations of NMR constraints,
and identification and listing of short distances between pairs of hydrogen
atoms.
The following options are recognized by this shell script:

147
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

→Main Window Layout

→Executing commands
All commands can be executed by selecting them in a pulldown menu.
Commands that need arguments will ask for them using a dialog box.
Users that prefer keyboard input can enter commands on the command
line. Parts of commands that are unique are completed automatically. It is
possible to also enter the command arguments on the command line. If no
or only part of the arguments are given, a dialog box will appear.
Some frequently used commands can also be found in the popup menu
associated with the right mouse button.
Some commands have keyboard accelerators. They can be seen in the
pulldown menu. These commands can be executed by using the
corresponding key combination anywhere in the main window.
The program executes commands that it receives on standard input. This
can be used to couple MOLMOL with other programs, e. g. by writing a
program (shell script) that generates commands and then piping the output
of this program into MOLMOL.
→Interactive Manipulation
1. Rotation
Molecules can be rotated by pressing the left mouse button in the drawing
area and then moving the mouse. The virtual trackball model is used.
2. Moving
Molecules can be moved by pressing the middle mouse button in the
drawing area and then moving the mouse.

148
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

3. Moving/Resizing Text
Text annotations can be modified by pressing the middle mouse button
inside the box that is dis played when the text is selected, and then moving
it. The box is subdivided into different region with dashed lines. The central
region is used for moving the text, the other regions are used for re sizing
the text in the corresponding direction.
4. Zooming
Zooming is done by pressing the left and middle mouse buttons at the same
time. Moving the mouse to the right and/or the top will zoom in, moving it
to the left and/or bottom will zoom out.
→Selection
There are two ways to make selections:
1. Interactive selection:
Items (like atoms or bonds) can be selected by clicking on them with the
left or middle mouse button. Doing that will normally deselect all other
items of the same class. To prevent that, either the Shift or the Ctrl key must
be pressed on the keyboard while making the selection. For text
annotations there is a difference between the function of the left and
middle mouse buttons: the left mouse button only selects texts if their
bottom-left corner is the nearest item, while the middle mouse button gives
priority to texts and selects them whenever the position is within the text.
2. Use commands:
Items can be selected by using various commands. The most convenient
way to use these commands is the selection dialog box, which can be
switched on with the Dial Select command. All these commands take
expressions that specify whether an item is selected or not.

149
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

→Properties

→Command Overview
1. Input/Output

150
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

2. Movement

3. Display

4. User Interface

151
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

5. Figures

1. Loading Molecular Structures:


- MolMol allows users to load molecular structures from various file
formats, including PDB (Protein Data Bank), MOL, XYZ, and others. Users
can import single or multiple structures for analysis.
- Molecular structures can be retrieved from public databases like the
Protein Data Bank (PDB) or generated using computational methods such
as molecular modeling and simulation.
2. Visualization Modes:
- MolMol provides multiple visualization modes to represent molecular
structures, including wireframe, stick, ball-and-stick, and space-filling
models. Each mode offers different levels of detail and clarity for visualizing
atoms, bonds, and molecular surfaces.
- Users can switch between visualization modes and adjust parameters
such as atom size, bond thickness, and color scheme to customize the
appearance of the molecular structure.
3. Interactive Manipulation:
- MolMol offers interactive tools for manipulating molecular structures in
three dimensions. Users can rotate, translate, and zoom into the structure
using mouse controls or keyboard shortcuts.
- The interactive manipulation allows users to explore the structure from
different perspectives and orientations, facilitating detailed examination of
specific regions of interest.
4. Annotation and Labeling:
- MolMol enables users to annotate and label specific features of the
molecular structure for better visualization and interpretation. Annotations
can include secondary structure elements (e.g., helices, sheets), ligand-
binding sites, active sites, and other functional regions.

152
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

- Text labels, color-coded regions, and symbols can be added to the


structure to highlight important features and provide context for further
analysis.
5. Structure Analysis:
- MolMol offers a suite of analysis tools for studying various structural
properties of proteins and nucleic acids. Users can measure distances,
angles, and dihedral angles between atoms or residues to characterize bond
lengths, bond angles, and torsional angles within the structure.
- Tools for calculating solvent accessibility, molecular surfaces, and
electrostatic potentials provide insights into the structural and chemical
properties of the molecule, aiding in the interpretation of its function and
interactions.
6. Molecular Dynamics Visualization:
- MolMol supports the visualization of molecular dynamics simulations,
allowing users to animate the movement of atoms and residues over time.
Users can visualize trajectories, analyze conformational changes, and study
protein-ligand interactions during dynamic simulations.

153
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

7. Integration with External Tools and Databases:


- MolMol can interface with external tools and databases for additional
analysis and data retrieval. This includes tools for sequence alignment,
homology modeling, and structural comparison.
- Users can import structural data from public databases such as the
Protein Data Bank (PDB) and integrate it with their own datasets for
comparative analysis and validation.
8. Customization and Scripting:
- MolMol offers extensive customization options, allowing users to tailor
the visualization and analysis settings to their specific requirements.
Parameters such as rendering styles, color schemes, and display options
can be adjusted to enhance visualization clarity and detail.
- Advanced users can write scripts using MolScript, MolMol's scripting
language, to automate tasks, customize visualizations, and perform complex
analyses. Scripts can be used to generate publication-quality figures,
conduct large-scale analyses, and streamline repetitive tasks.
Applications of MolMol:

1. NMR Spectroscopy: MolMol is used in NMR spectroscopy studies to


visualize and analyze protein and nucleic acid structures determined by
solution NMR techniques. It aids in interpreting NMR data, refining
structural models, and characterizing biomolecular conformations in
solution.

2. X-ray Crystallography: MolMol facilitates the visualization and analysis of


protein structures determined by X-ray crystallography. It helps
researchers examine electron density maps, validate crystallographic
models, and identify structural features such as ligand-binding sites and
active sites.
3. Homology Modeling: MolMol is employed in homology modeling or
comparative modeling to predict the three-dimensional structure of
proteins based on their amino acid sequences and known homologous
structures. It aids in building structural models, assessing model quality,
and predicting protein-ligand interactions.

4. Molecular Docking: MolMol is utilized in molecular docking studies to


predict the binding mode and affinity of small molecules or ligands to
protein targets. It aids in visualizing docking poses, analyzing

154
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

intermolecular interactions, and prioritizing compounds for experimental


validation.

5. Structural Bioinformatics: MolMol serves as a valuable tool in structural


bioinformatics for analyzing and comparing protein structures, identifying
structural motifs, and inferring functional annotations. It aids in structural
classification, fold recognition, and the prediction of protein-protein
interactions.
6. Protein Dynamics: MolMol supports the visualization and analysis of
protein dynamics, including protein flexibility, domain motions, and
conformational changes. It aids in studying allosteric regulation, protein
folding pathways, and the effects of mutations on protein stability and
dynamics.

7. Molecular Evolution: MolMol is employed in molecular evolution studies


to visualize and compare the structures of homologous proteins across
different species or evolutionary lineages. It aids in studying sequence-
structure relationships, identifying conserved structural features, and
inferring evolutionary constraints.

8. Structure-Based Drug Design: MolMol is applied in structure-based drug


design to study protein-ligand interactions, identify druggable binding
sites, and design novel therapeutic agents. It aids in virtual screening,
fragment-based drug discovery, and lead optimization efforts.

9. Biophysical Chemistry: MolMol is used in biophysical chemistry research


for studying the structural and dynamic properties of biomolecules,
including protein folding, unfolding, and aggregation. It aids in analyzing
experimental data, interpreting biophysical measurements, and validating
theoretical models.

10. Protein Engineering and Design: MolMol facilitates protein engineering


and design efforts by visualizing protein structures, predicting the effects of
mutations, and designing proteins with desired properties. It aids in
rational protein design, directed evolution experiments, and the
optimization of protein function for biotechnological applications.

155
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS

156

You might also like