Professional Documents
Culture Documents
Allied EssentialsOfComputers 0bioinformatics
Allied EssentialsOfComputers 0bioinformatics
Allied EssentialsOfComputers 0bioinformatics
UNIT 1
Introduction to Computer, user interface with the Operating System, binary coding
system and Network terminologies. Working with windows and MS office software
concerning word processing, spreadsheets and presentation software.
UNIT 2
Internet and ICT with its Applications, IT Act, System Security (virus/firewall).
Cloud computing- using Google docs, Google Scholar, Google sheets, Google meet,
MS teams and Zoom scheduling. Overview of life Science oriented software’s, their
usage in laboratories (Python, MATLAB and others) and healthcare (Azure,
HoloLens, etc.)
UNIT 3
Forms of biological information and the need for storage. General introduction to
biological databases; nucleic acid databases (NCBI, DDBJ, and EMBL). Protein
databases (PIR, Uniprot). Specialized genome databases: (SGD and microbial
genome database-MGDB). Structure database-PDB.
UNIT 4
Retrieval methods for Nucleic acid and Protein Sequences. Use of Bioinformatic
Tools -sequence homology- substitution matrices- PAM and BLOSUM. Pairwise
alignment (global and local) using BLAST. Multiple sequence alignment using
Clustal omega.
UNIT 5
Methods of Genome analysis- Shot gun and Hierarchical methods. Gene Prediction
using GENEMARK. Phylogenetic tree construction by MEGA (molecular genetic
evolutionary analysis). Protein Structure visualization Tool- RasMol and MolMol.
1
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
UNIT 1
Introduction to Computer, user interface with the Operating System, binary
coding system and Network terminologies. Working with windows and MS
office software concerning word processing, spreadsheets and presentation
software.
Introduction to Computers
A computer is an electronic device that processes data, converting it into
information that is useful to people. Any computer—regardless of its
type—is controlled by programmed instructions, which give the machine a
purpose and tell it what to do.
The 1st computer was invented by Charles Babbage in 1821.
There are 2 main types of computers. They are:
1. Digital computers: So called because they work “by the numbers."
That is, they break all types of information into tiny units, and use
numbers to represent those pieces of information. Digital computers
also work in very strict sequences of steps, processing each unit of
information individually, according to the highly organized
instructions they must follow.
2
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
which is the case that houses the computer’s critical parts, such as its
processing and storage devices. There are two common designs for desktop
computers. The more traditional desktop model features a horizontally
oriented system unit, which usually lies flat on the top of the user’s desk.
2. Workstations:
A workstation is a specialized, single-user computer that typically has more
power and features than a standard desktop PC. These machines are
popular among scientists, engineers, and animators who need a system
with greater-than-average speed and the power to perform sophisticated
tasks. Workstations often have large, high-resolution monitors and
accelerated graphics-handling capabilities, making them suitable for
advanced architectural or engineering design, modeling, animation, and
video editing.
3. Notebook computers:
Notebook computers, as their name implies, approximate the shape of an
8.5-by-11-inch notebook and easily fit inside a briefcase. Because people
frequently set these devices on their lap, they are also called laptop
computers. Notebook computers can operate on alternating current or
special batteries. Because of their portability, notebook PCs fall into a
category of devices called mobile computers—systems small enough to be
carried by their user.
4. Tablet computers:
Tablet PCs offer all the functionality of a notebook PC, but they are lighter
and can accept input from a special pen—called a stylus or a digital pen—
that is used to tap or write directly on the screen. Many tablet PCs also have
a built-in microphone and special software that accepts input from the
user's voice. Tablet PCs run specialized versions of standard programs and
can be connected to a network. Some models also can be connected to a
keyboard and a full-size monitor.
5. Handheld computers:
Handheld personal computers are computing devices small enough to fit in
your hand. A popular type of handheld computer is the personal digital
assistant (PDA). A PDA is no larger than a small appointment book and is
normally used for special applications, such as taking notes, displaying
telephone numbers and addresses, and keeping track of dates or agendas.
Many PDAs can be connected to larger computers to exchange data. Most
PDAs come with a pen that lets the user write on the screen. Some
3
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
4
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
Generations of Computers
1st generation: 1946-1959 → vacuum tubes
2nd generation: 1959-1965 → transistors
3rd generation: 1965-1971 → integrated circuits
4th generation: 1980-present → AI and ULSI (Ultra Large Scale Integration)
5
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
Parts of a Computer
A computer system consists of 4 main parts:
1. Hardware:
The mechanical devices that make up the computer are called hardware.
Hardware is any part of the computer you can touch. A computer’s
hardware consists of interconnected electronic devices that you can use to
control the computer’s operation, input, and output.
6
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
7
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
» Storage:
In this step, the computer permanently stores the results of its processing
on a disk, tape, or some other kind of storage medium. As with output,
storage is optional and may not always be required by the user or program.
Memory Devices:
In a computer, memory is one or more sets of chips that store data and/or
program instructions, cither temporarily or permanently. Memory is a
critical processing component in any computer Personal computers use
several different types of memory, but the two most important arc called
random access memory (RAM) and read-only memory (ROM).
These two types of memory work in very different ways and perform
distinct functions.
1. Random Access Memory
The most common type of memory is called random access memory (RAM).
As a result, the term memory is typically used to mean RAM. RAM is like an
electronic scratch pad inside the computer. RAM holds data and program
instructions while the CPU works with them. When a program is launched,
it is loaded into and run from memory. As the program needs data, it is
loaded into memory for fast access. As new data is entered into the
computer, it is also stored in memory—but only temporarily. Data is both
written to and read from this memory. (Because of this, RAM is also
sometimes called read/write memory.) Like many computer components,
RAM is made up of a set of chips mounted on a small circuit board.
RAM is volatile, meaning that it loses its contents when the computer is
shut off or if there is a power failure. Therefore, RAM needs a constant
supply of power to hold its data. For this reason, you should save your data
files to a storage device frequently, to avoid losing them in a power failure.
RAM has a tremendous impact on the speed and power of a computer.
Generally, the more RAM a computer has, the mote it can do and the faster
it can perform certain tasks. The most common measurement unit for
describing a computer’s memory is the byte—the amount of memory it
takes to store a single character such as a letter of the alphabet or a
numeral. When referring to a computer's memory, the numbers are often so
large that it is helpful to use terms such as kilobyte (KB), megabyte (MB),
gigabyte (GB), and terabyte (TB) to describe the values
Types of RAM: SRAM, DRAM.
8
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
User Interface
A user interface refers to the part of an operating system, program, or
device that allows a user to enter and receive information.
A text based user interface displays text, and its commands are usually
typed on a command line using a keyboard.
With a graphical user interface, the functions are carried out by clicking or
moving buttons, icons, and menus using a pointing device.
Text User Interface (TUI):
Modern graphical user interfaces have evolved from text based user
interfaces.
Some operating systems such as Linux, can still be used with a text based
user interface. In this case, the commands are entered as text. To display the
text based user interface Command Prompt in Windows, open Start menu
and type cmd. Press Enter on the keyboard to launch the command prompt
in a separate window. With the command prompt, you can type your
commands from the keyboard instead of using the mouse.
Graphical User Interface (GUI):
In most operating systems, the primary user interface is graphical, i.e.,
instead of typing commands, you can manipulate various graphical objects
with a pointing device.
9
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
10
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
The binary scheme of digital 1s and 0s offers a simple and elegant way for
computers to work. It also offers an efficient way to control logic circuits
and to detect whether an electrical signal is true (1) and false (0).
How binary numbers work:
01001111
1= 2^0=1*1=1
1=2^1=2*1=2
1=2^2=1*4=4
1=2^3=1*8=8
0=2^4=0
0=2^5=0
1=2^6=1*64=64
0=2^7=0
So, 1+2+4+8+64=79
Network Terminologies
When 2 or more computers or devices are connected to transfer files and
data or to communicate to each other, then the medium is referred to as a
computer network.
Network is a collection of interconnected devices, such as computers,
printers, and servers that can communicate with each other. In a computer
network, each computer is called a node.
11
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
12
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
MAN:
A metropolitan area network (MAN) is a computer network that connects
computers within a metropolitan area, which could be a single large city,
multiple cities and towns, or any given large area with multiple buildings. A
MAN is larger than a local area network (LAN) but smaller than a wide area
network (WAN). MANs do not have to be in urban areas; the term
“metropolitan” implies the size of the network, not the demographics of the
area that it serves. Like WANs, a MAN is made up of interconnected LANs.
Since MANs are smaller, they are usually more efficient than WANs, since
data does not have to travel over large distances. MANs typically combine
the networks of multiple organizations, instead of being managed by a
single organization.
Most MANs use fiber optic cables to form connections between LANs. Often
a MAN will run on “dark fiber” — formerly unused fiber optic cables that
are able to carry traffic. These fiber optic cables may be leased from
private-sector Internet service providers (ISP). In some cases, this model is
reversed: a city government builds and maintains a metropolitan fiber optic
network, then leases dark fiber to private companies.
Advantages:
• Highly secure network.
• Highest speed due to optical fiber wired medium.
• Cost effective.
• Long distance data transfer is high speed.
13
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
PAN:
A personal area network (PAN) connects electronic devices within a user’s
immediate area. The size of a PAN ranges from a few centimeters to a few
meters. One of the most common real-world examples of a PAN is the
connection between a Bluetooth earpiece and a smartphone. PANs can also
connect laptops, tablets, printers, keyboards, and other computerized
devices.
PAN network connections can either be wired or wireless. Wired
connection methods include USB and FireWire; wireless connection
methods include Bluetooth (the most common), Wi-Fi, IrDA, and Zigbee.
While devices within a PAN can exchange data with each other, PANs
typically do not include a router and thus do not connect to the Internet
directly. A device within a PAN, however, can be connected to a local area
network (LAN) that then connects to the Internet. For instance, a desktop
computer, a wireless mouse, and wireless headphones can all be connected
to each other, but only the computer can connect directly to the Internet.
A wireless personal area network (WPAN) is a group of devices connected
without the use of wires or cables. Today, most PANs for everyday use are
wireless. WPANs use close-range wireless connectivity protocols such as
Bluetooth.
The range of a WPAN is usually very small, as short-range wireless
protocols like Bluetooth are not efficient over distances larger than 5-10
meters.
14
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
WAN:
A wide area network (WAN) is a large computer network that connects
groups of computers over large distances. WANs are often used by large
businesses to connect their office networks; each office typically has its
own local area network, or LAN, and these LANs connect via a WAN. These
long connections may be formed in several different ways, including leased
lines, VPNs, or IP tunnels (see below). The definition of what constitutes a
WAN is fairly broad. Technically, any large network that spreads out over a
wide geographic area is a WAN. The Internet itself is considered a WAN.
Advantages:
• Most expensive
• High maintenance networks
• Communicate over a large geographical area over the globe.
• Transfer of data to anywhere in the world without any delay or
problem.
• Speed = 100Gbps.
15
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
16
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
MS Word
A Word processor is a computer program for processing words.
A Word processor software provides a general set of tools for entering,
editing, and formatting text.
A word processor has everything that a conventional typewriter has. It
provides various useful features that cannot be done on a typewriter.
To Launch Word:
To start Word 2019, click on the Office Start button, and then select
Microsoft Word 2019 from the options panel. The Microsoft Word Icon can
be pinned to the start bar for quick access.
Title Bar: Displays the name of the file you are currently working on. It also
consists of three buttons, for example, The Minimize button reduces the
window to an icon, but the word remains active. The Restore button
returns the Word window to its original maximum size. The close button
takes us out of Word.
→ Menu Bar: This consists of various commands that can be accessed by
clicking on the menu options under these menu headings.
→ Standard Toolbar: Displays icons for common operations, such as
Open, Print, Save, etc., which can be done by clicking on the suitable
tool.
→ Formatting Toolbar: Displays the options that can be used to format
our document, such as the ruler indicating the width of the
document. It can be increased or decreased. You can see how many of
the lines you wrote. Work area This is the area where you can enter
the text.
17
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
→ Vertical Scroll Bar: For larger text in the document, you can scroll the
vertical bar to view the text in different positions.
→ Horizontal scrollbar: Used to move from left to right of the document
and vice versa if the document is too wide to fit on the screen.
→ Search Object Selection Button: This helps us select one of several
tools used to find something in a document.
→ Normal View button: Helps us view the document very as it will be
printed. Arranges the text so that no document is hidden on the
screen.
→ Print Layout View: This option allows us to see how the document
will be printed. All headers, footers, and comments are displayed.
→ Draw Toolbar: One of several toolbars that may be available on the
screen. This special is used to make drawings on the document.
→ Status bar: This bar always shows you your current position as far as
the text goes. It shows you the current position of your in terms of the
page number, line number, etc.
Features:
• Fast Typing: Text in a word processor becomes fast since there is no
associated mechanical carriage movement.
• Editing functions: Any type of correction (insert, delete, change, etc.)
can be easily done as and on demand.
• Permanent storage: Documents can be stored indefinitely. The saved
document can be called up at any time.
• Formatting functions: Entered text can be created in any form and
style (bold, italic, underline, different fonts, etc.). Graphics Provides
the ability to insert drawings into documents, making them more
useful.
• OLE (Object Linking and Embedding): OLE is a program integration
technology used to exchange information between programs about
objects. Objects are entities stored as graphs, equations, video clips,
audio clips, images, and so on.
• Alignment: You can align your text as you like, for example, left, right,
or centered. You can even make a box set, i.e., aligned from both sides.
• Delete errors: You can remove a word, line, or paragraph from a
stroke, and the rest of the subject will appear automatically.
• Line Spacing: You can set the line spacing from one to nine according
to your preference.
• Move-in Cursor: You can move the cursor from one word to another
or from one paragraph to another as needed.
18
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
MS Excel
Microsoft Excel is a powerful electronic spreadsheet program you can use
to automate accounting work, organize data, and perform a wide variety of
tasks. Excel is designed to perform calculations, analyze information, and
visualize data in a spreadsheet. Also this application includes database and
charting features.
To Launch Excel:
To launch Excel for the first time:
19
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
Features:
• Charts: Charts can be used to represent the data in richly detailed
graphical format.
• SmartArt: We can utilize SmartArt to express information by aligning
data in creative ways graphically.
• Clip Arts: We can include ready-to-use clip arts to convey our
message in a visual format.
• Shapes: We can use a variety of shapes to depict data in infographics
and shapes. With the help of the free form features we can draw any
shape.
20
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
MS PowerPoint
Microsoft PowerPoint, usually just called the PowerPoint, is a software
program developed by Microsoft to produce effective presentations. It is a
part of Microsoft Office suite. The program comprises slides and various
tools like word processing, drawing, graphing and outlining. It is an
absolute presentation graphics package that gives you everything needed to
create a professional-looking presentation. PowerPoint offers word
processing, drawing, outlining, graphing, and presentation management
tools.
PowerPoint was developed by Dennis Austin and Thomas Rudkin at a
software company named Forethought Inc. It was thought to be identified
as Presenter, but due to trademark issues was renamed PowerPoint in
1987. The first iteration of PowerPoint was released collectively with
Windows 3.0 in 1990. The initial version of PowerPoint only allowed slide
progression in one direction i.e., forward and the amount of customization
was somewhat limited.
Progressively, with every version, the program was more creative and more
interactive. Numerous other characteristics were also added in PowerPoint
in the later versions which massively increased the demand and use of this
MS Office program. The default file extension of a PowerPoint presentation
is “.ppt”. It is a presentation (PPT)-based program comprising slides that
use graphics, videos, and other features to make a presentation more
interactive and interesting.
21
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
22
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
23
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
UNIT 2
INTERNET
The World Wide Web—usually called the Web for short—is a collection of
different websites you can access through the Internet. A website is made
up of related text, images, and other resources. Websites can resemble
other forms of media—like newspaper articles or television programs—or
they can be interactive in a way that's unique to computers.
At this point you may be wondering, how does the Internet work? The
exact answer is pretty complicated and would take a while to explain.
Instead, let's look at some of the most important things you should know.
24
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
When you visit a website, your computer sends a request over these wires
to a server. A server is where websites are stored, and it works a lot like
your computer's hard drive. Once the request arrives, the server retrieves
the website and sends the correct data back to your computer.
Applications of Internet:
1. Communication
Communication refers to exchanging ideas and thoughts between or among
people to create understanding. The communication process involves the
elements of source, encoding, channel, receiver, decoding, and feedback. In
organizations, both formal and informal communications simultaneously
take place. Formal communications refer to official communications in
orders, notes, circulars, agenda, minutes, etc. Apart from formal
communications, informal grapevine communications also exist. Informal
communications are usually in the form of rumors, whispers, etc. They are
unofficial, unrecorded, and spread very fast.
2. Web Browsing
Web Browsing is one of the applications of the internet. A web browser is a
program that helps the user to interact with all the data in the WWW
(World Wide Web). There are many web browsers present in today's
world. Some of them are as follows:
• Google Chrome
• Firefox
• Safari
• Internet Explorer
25
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
• Microsoft Edge
3. Online Shopping
The era of the internet took shopping into a new market concept, where
many virtual shops are available 24x7. The shops provide all the necessary
details of a product on their website, so the user can choose as per their
needs.
4. Real-Time Update
The internet makes things easier. One can quickly get an update on the
things happening in real-time in any part of the world. For example, sports,
politics, business, finance, etc. The internet is very useful in many decisions
based on real-time updates.
5. Social Media
The youth of this generation spend the maximum of their free time on
social media, all thanks to the internet. Social media is a place where the
user can communicate with anyone, like friends, family, classmates, etc.
User can promote their businesses on social media as well. You can also
post your thoughts, pictures and videos with your friends on social media.
6. Job Search
The internet has brought a revolution in the field of Jobs. The candidate can
search for their dream job, apply and get it very easily. Even companies
nowadays post their need on the internet and hire candidates as per their
skills based on the job role.
There are many platforms which are primarily doing this. Some of them are
listed below.
• LinkedIn
• Monster.com
• Naukari.com
• Indeed
• Glassdoor
• Upwork
26
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
7. Education
The Internet has a vital role in the education field. It became an effective
tool in both teaching and learning. Teachers can upload their notes or
learning videos on the websites with the help of the internet. It made the
learning process more diverse and joyful.
8. Travel
Users can easily search for their favorite tourist places worldwide and plan
their trips. One can book holiday trips, cabs, hotels, flight tickets, clubs, etc.,
with the help of the Internet. Some websites that provide these facilities are
as follows:
• goibibo.com
• makemytrip.com
• olacabs.com
ICT
Information and communications technology (ICT) is an extensional term
for information technology (IT) that stresses the role of unified
communications and the integration
of telecommunications (telephone lines and wireless signals) and
computers, as well as necessary enterprise software, middleware, storage
and audiovisual, that enable users to access, store, transmit, understand
and manipulate information.
27
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
• Telehealth
• Artificial intelligence in healthcare
• Use and development of software for COVID-19 pandemic
mitigation
• mHealth
• Clinical decision support systems and expert systems
• Health administration and hospital information systems
• Other health information technology and health informatics
In science
Applications of ICTs in science, research and development, and academia
include:
• Internet research
• Online research methods
• Science communication and communication between scientists
• Scholarly databases
• Applied metascience
Models of access
Scholar Mark Warschauer defines a "models of access" framework for
analyzing ICT accessibility. In the second chapter of his book, Technology
and Social Inclusion: Rethinking the Digital Divide, he describes three
models of access to ICTs: devices, conduits, and literacy. Devices and
conduits are the most common descriptors for access to ICTs, but they are
insufficient for meaningful access to ICTs without third model of access,
literacy. Combined, these three models roughly incorporate all twelve of
28
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
IT ACT
The Information Technology Act,
2000 (also known as ITA-2000, or
the IT Act) is an Act of the Indian
Parliament (No 21 of 2000)
notified on 17 October 2000. It is
the primary law in India dealing
with cybercrime and electronic
commerce. The Act provides a
legal framework for electronic
governance by giving recognition
to electronic records and digital
signatures. It also defines cyber-
crimes and prescribes penalties
for them. The Act directed the
formation of a Controller of
Certifying Authorities to regulate
the issuance of digital signatures.
It also established a Cyber
Appellate Tribunal to resolve
disputes arising from this new
law. The Act also amended
various sections of the Indian
Penal Code, 1860, the Indian
Evidence Act, 1872, the Banker's Books Evidence Act, 1891, and
29
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
the Reserve Bank of India Act, 1934 to make them compliant with new
technologies.
SYSTEM SECURITY
System security encompasses a wide range of practices and technologies
designed to safeguard computer systems, including hardware, software,
networks, and data. It involves implementing various security controls,
such as firewalls, antivirus software, access controls, and encryption, to
defend against potential threats.
Firewalls, one of the fundamental components of system security, act as a
barrier between a trusted internal network and untrusted external
networks, filtering incoming and outgoing network traffic based on
predetermined security rules. They help prevent unauthorized access to a
network and protect against various types of cyber threats, such as
malware, hacking attempts, and denial-of-service attacks.
Antivirus software is another essential tool in system security. It scans files
and programs for known patterns of malicious code, preventing malware
from infecting a system. It also provides real-time protection by monitoring
system activity and blocking suspicious or potentially harmful activities.
Access controls are mechanisms that limit and control user access to
computer systems and resources. They ensure that only authorized
individuals can access sensitive data and perform specific actions. Access
controls can include password authentication, biometric identification, and
role-based access controls, among others.
Encryption is a process of converting data into a format that is unreadable
to unauthorized individuals. It uses mathematical algorithms to scramble
data, making it unintelligible unless decrypted with the correct key.
Encryption is commonly used to protect sensitive information during
transmission over networks or when stored on devices.
The security of a system can be threatened via two violations:
• Threat: A program that has the potential to cause serious
damage to the system.
• Attack: An attempt to break security and make unauthorized use
of an asset.
Security can be compromised via any of the breaches mentioned:
• Breach of confidentiality: This type of violation involves the
unauthorized reading of data.
30
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
31
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
32
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
Firewall
A firewall is a network security device, either hardware or software-based,
which monitors all incoming and outgoing traffic and based on a defined
set of security rules it accepts, rejects or drops that specific traffic.
Accept : allow the traffic.
Reject : block the traffic but reply with an “unreachable error”.
33
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
34
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
CLOUD COMPUTING
Cloud computing is the on-demand availability of computing resources
(such as storage and infrastructure), as services over the internet. It
eliminates the need for individuals and businesses to self-manage physical
resources themselves, and only pay for what they use.
Types of cloud computing deployment models
Public cloud: Public clouds are run by third-party cloud service providers.
They offer compute, storage, and network resources over the internet,
enabling companies to access shared on-demand resources based on their
unique requirements and business goals.
Private cloud: Private clouds are built, managed, and owned by a single
organization and privately hosted in their own data centers, commonly
known as “on-premises” or “on-prem.” They provide greater control,
security, and management of data while still enabling internal users to
benefit from a shared pool of compute, storage, and network resources.
Hybrid cloud: Hybrid clouds combine public and private cloud models,
allowing companies to leverage public cloud services and maintain the
security and compliance capabilities commonly found in private cloud
architectures.
Google Docs
35
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
36
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
37
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
Google Scholar
38
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
Has the fewest indexed articles in the Humanities, including Religion and
Biblical Studies. (See the “Metrics” link at the top to show the major
disciplines and the most highly indexed journals in each discipline.) It also
tends to include more recent literature rather than pre-1990 literature
because this older literature has often never been digitized and put on the
web.
2. If you choose the Liberty University Library in your initial settings it will
point to journal articles (Get it @ LU) and search for books in WorldCat
(Library Search). (If you don’t see Get it @ LU, check under the “More”
links.)
3. The default sort for results is by relevance ranking. Articles that are
cited the most by others show up higher in the rankings. The relevance
ranking in our subscription databases is often determined by the number of
times the search term(s) is found in the metadata. Thus Google Scholar can
be helpful in finding key or seminal authors on a topic because they will be
the most cited.
4. It shows who has cited each work so that you can trace patterns of
research. If the older, original article is helpful, it is likely that at least some
39
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
of the more recent articles that cite the older article will also be helpful in
your research.
6. If you are a published author (even in Digital Commons) you can trace
those who cite your work.
To find newer articles, try the following options in the left sidebar:
2. Click "Sort by date" to show just the new additions, sorted by date;
Locating the full text of an article Abstracts are freely available for most of
the articles. However, reading the
entire article may require a
subscription database.
40
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
Google Sheets
1. Editing
Google Sheets also includes a sidebar chat feature that allows collaborators
to discuss edits in real-time and make recommendations on certain
changes. Any changes that the collaborators make can be tracked using the
Revision History feature. An editor can review past edits and revert any
unwanted changes.
41
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
2. Explore
With the Explore feature, users can ask questions, build charts, visualize
data, create pivot tables, and format the spreadsheet with different colors.
For example, if you are preparing a monthly budget and you’ve added all
the expenses to the spreadsheet, you can use the Explore feature to get the
cost of specific expenses such as food, travel, clothing, etc.
On the sidebar, there is a box where you can type the question, and it will
return the answer. When you scroll down further in the Explore panel,
there is a list of suggested graphs that are representative of the data
entered in the spreadsheet, and you can choose between a pivot table, pie
chart, or bar chart.
3. Offline editing
Google Sheets supports offline editing, and users can edit the spreadsheet
offline either on desktop or mobile apps. On the desktop, users need to use
the Chrome browser and install the “Google Docs Offline” Chrome
extension to enable offline editing for Google Sheets and other Google
applications. When using mobile, users need to use the Google Sheets
mobile app for Android and iOS, which support offline editing.
Google Sheets supports multiple spreadsheet file formats and file types.
Users can open, edit, save or export spreadsheets and document files into
Google Sheets. Some of the formats that can be viewed and converted to
Google Sheets include:
• .xlsx
• .xls
• .xlsm
• .xlt
• .xltx
• .xltxm
• .ods
• .csv
42
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
• .tsv
Google Sheets can be integrated with other Google products such as Google
Form, Google Finance, Google Translate, and Google Drawings. For
example, if you want to create a poll or questionnaire, you can input the
questions in Google Forms, and then import the Google Forms into Google
Sheets.
1. Go to the Google Drive Dashboard, and click the “New” button on the
top left corner, and select Google Sheets.
2. Open the menu bar in the spreadsheet window, go to File then New. It
will create a blank spreadsheet, and the interface will be as follows:
To rename the spreadsheet, click on the field on the top left corner, which
is titled “Untitled spreadsheet” and type in your preferred name. When a
new Google spreadsheet is created, it is automatically saved in the root
folder of your drive. To move the spreadsheet to a different folder, click and
hold the file, and drag it to the preferred folder.
43
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
Common Terms
The following are some of the common terms associated with Google
spreadsheets:
Google Meet
Google Meet is a video conferencing platform developed by Google, offering
various features for online meetings and collaboration.
1. Video Conferencing: Google Meet allows users to conduct video meetings
with participants from around the world. It supports both one-on-one
meetings and group meetings with up to hundreds of participants
(depending on the plan).
2. High-Quality Video and Audio: Meet offers high-definition video and clear
audio quality for a smooth meeting experience. It automatically adjusts the
video resolution based on your network connection to ensure the best
possible quality.
3. Screen Sharing: Users can share their entire screen or specific
applications or tabs with other participants during a meeting. This feature
is useful for presentations, demonstrations, and collaborative work.
4. Real-Time Captions: Google Meet provides real-time captions during
meetings, helping participants follow along even if they have difficulty
hearing or understanding.
44
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
5. Live Streaming: With Google Meet, you can live stream your meetings to a
large audience. This feature is particularly useful for webinars, virtual
events, and presentations.
6. Recording: Google Meet allows users to record their meetings for later
reference or for participants who couldn't attend. The recordings are
automatically saved to Google Drive and can be shared with others.
7. Integration with Google Workspace: Meet is fully integrated with other
Google Workspace apps like Gmail, Google Calendar, and Google Drive. This
integration makes it easy to schedule meetings, join calls directly from
Calendar events, and access shared files during meetings.
8. Security and Privacy: Google Meet offers various security features to
protect your meetings, including encryption of data in transit and in
storage, meeting codes to prevent unauthorized access, and controls for
meeting hosts to manage participants and permissions.
9. Virtual Backgrounds: Users can choose virtual backgrounds to customize
their appearance during meetings. This feature allows you to hide your
actual background or add a professional touch to your video calls.
10. Participant Management: Meeting hosts have control over participants,
with options to mute/unmute participants, remove participants, and
control who can present during the meeting.
11. Breakout Rooms: Google Meet recently introduced breakout rooms,
allowing meeting hosts to split participants into smaller groups for focused
discussions or activities, then easily bring them back to the main meeting.
12. Polls and Q&A: Google Meet offers built-in polling and Q&A features,
enabling meeting hosts to gather feedback or engage participants in
interactive sessions.
Other Features:
Video Quality and Bandwidth Management:
Google Meet automatically adjusts the video resolution and frame rate
based on your network connection to ensure optimal video quality while
minimizing bandwidth usage.
It supports up to 1080p resolution for video calls, depending on the user's
device and network capabilities.
Audio Enhancements:
Meet uses advanced audio processing algorithms to reduce background
noise and echo, providing clear and crisp audio during meetings.
45
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
46
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
Accessibility Features:
Google Meet is committed to accessibility, with features such as keyboard
shortcuts, screen reader support, and adjustable captions to accommodate
users with disabilities.
It also offers live captioning in multiple languages, making meetings more
inclusive and accessible to participants from diverse backgrounds.
Customization and Branding:
Google Meet allows organizations to customize their meeting experience
with branded backgrounds, logos, and custom meeting URLs.
This feature enables businesses to maintain brand consistency and
professionalism during video calls, enhancing their corporate identity.
Advanced Security Controls:
Google Meet offers advanced security controls for meeting hosts, including
the ability to lock meetings, prevent participants from joining before the
host, and restrict screen sharing to specific users.
It also provides end-to-end encryption for all video meetings, ensuring the
confidentiality and privacy of sensitive information shared during calls.
47
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
MS Teams
Microsoft Teams was released in 2017 and has since proven to be an
exceptionally popular addition to Microsoft's suite of online services. Teams
is an online collaboration service available as a part of Microsoft 365 and as a
free service.
Left sidebar menu
The left sidebar menu of the Teams interface holds icons for different major
work areas in Teams. By default, you’ll have Activity, Chat, Teams,
Calendar, Calls, and Files. These are all called “apps” for want of a better
word.
This menu is customizable and can be changed by you or by IT. Changes that
you make are persistent and won’t be seen by others.
You can remove any icon you don’t want by right-clicking and then
selecting Unpin, or move icons around if you click and hold an icon then
drag it to the desired location. What is shown above is the result of adding
OneNote to the left sidebar. You can add many applications by clicking on
the Apps icon at the bottom of the sidebar.
Teams
Clicking on Teams on the left-sidebar lists teams where you are a member.
You can contribute right away to any of these teams by clicking on Posts and
typing in the text box at the bottom. You are automatically made a member
of some Teams created by the company or Teams administrators.
48
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
49
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
50
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
Zoom Scheduling
Zoom Video Conferencing is a popular platform for hosting virtual
meetings, webinars, and online events. It offers a range of features that
make communication and collaboration easy, including:
1. High-quality video and audio: Zoom provides high-definition video and
crystal-clear audio to ensure a seamless meeting experience.
2. Screen sharing: Participants can share their screens with others, making
it easy to present slideshows, documents, or software demonstrations.
3. Chat: Zoom includes a chat feature that allows participants to send text
messages to the entire group or privately to individuals.
4. Recording: Meetings can be recorded for later reference or sharing with
those who couldn't attend. The recordings can include video, audio, and
screen sharing.
5. Virtual backgrounds: Zoom allows users to set virtual backgrounds,
which can help maintain privacy or add a touch of professionalism to
meetings.
6. Gallery view and speaker view: Participants can choose between gallery
view, which displays multiple participants simultaneously, or speaker view,
which highlights the person currently speaking.
7. Host controls: Hosts have access to a variety of controls, including muting
participants, disabling video, and managing screen sharing permissions.
8. Security features: Zoom offers several security features to ensure the
safety of meetings, such as meeting passwords, waiting rooms, and end-to-
end encryption.
9. Integration: Zoom integrates with various other tools and platforms, such
as calendars, messaging apps, and productivity software.
10. Accessibility: Zoom provides features to improve accessibility, such as
closed captioning, screen reader support, and keyboard shortcuts.
Advantages:
1. User-friendly interface: Zoom's interface is intuitive and easy to navigate,
making it simple for users to join meetings, access features, and manage
settings.
2. Cross-platform support: Zoom is available on various devices and
operating systems, including Windows, macOS, Linux, iOS, and Android.
51
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
This ensures that participants can join meetings from their preferred
devices.
3. Multiple meeting formats: In addition to standard video meetings, Zoom
offers various meeting formats, including webinars, breakout rooms, and
virtual events. This versatility makes it suitable for different types of
gatherings and collaborations.
4. Interactive features: Zoom provides interactive features such as polling,
Q&A sessions, and hand-raising, enabling engaging and participatory
meetings.
5. Integration with productivity tools: Zoom integrates with popular
productivity tools like Google Calendar, Microsoft Outlook, Slack, and
others, streamlining the scheduling and joining process.
6. Customizable settings: Hosts can customize meeting settings according to
their preferences and requirements. This includes adjusting audio and
video settings, enabling or disabling specific features, and controlling
participant permissions.
7. Cloud storage: Zoom offers cloud storage for recorded meetings, making
it convenient to access and share recordings securely.
8. Advanced features for business users: For businesses and organizations,
Zoom provides advanced features such as single sign-on (SSO), enterprise-
level security controls, and reporting and analytics.
9. Global availability: Zoom has data centers located worldwide, ensuring
reliable performance and low-latency connections for users across different
geographic regions.
10. Customer support: Zoom provides customer support through various
channels, including email, chat, and phone, to assist users with any
questions or issues they may encounter.
52
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
53
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
54
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
LOBSTER:
LOBSTER is an image analysis environment to identify biological objects in
microscopy images, and measure their spatial location, geometry, dynamics
and intensity distribution. The objects can be exported as 2D/3D models,
for instance for their exploration/edition in external software, or for the
simulation of the processes under study. Multiple images can be processed
in one go and image size is not limited by the main memory of the
workstation.
• Wound Healing Assays: LOBSTER facilitates the tracking of individual cells
during wound healing assays, providing quantitative data on migration
rates, directionality, and cell interactions. This is crucial for understanding
the mechanisms involved in tissue repair and regeneration.
•Immune Cell Dynamics: In immunology studies, LOBSTER can track
immune cell movements within tissues or in response to stimuli. This is
valuable for investigating immune response dynamics, cell trafficking, and
interactions with pathogens.
•Mitosis Analysis: LOBSTER's tracking capabilities enable the study of
mitotic events, tracking individual cells through the stages of mitosis. This
55
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
NIRFAST:
NIRFAST: is a computational tool that leverages the principles of near-
infrared light propagation through tissues. Near-infrared light penetrates
biological tissues more effectively than visible light, making it suitable for
non-invasive imaging applications. NIRFAST provides researchers with a
platform to simulate and reconstruct optical properties of tissues, enabling
the visualization and quantification of biological structures and functions.
•Optical Property Simulation: NIRFAST allows users to simulate the
propagation of near-infrared light through complex tissue geometries. This
simulation is based on mathematical models that take into account the
optical properties of tissues, such as absorption and scattering coefficients.
•Fluorescence Tomography: One of NIRFAST's significant applications is in
fluorescence tomography. It enables the modeling and reconstruction of
fluorescent markers within tissues. This is particularly useful in molecular
imaging and studying the distribution of contrast agents or fluorescent
probes.
•Oncology: In oncology research, NIRFAST plays a crucial role in
characterizing tumors based on their optical properties. It aids in tumor
detection, monitoring treatment responses, and guiding surgical
interventions.
•Molecular Imaging: In molecular imaging applications, NIRFAST is utilized
for studying the distribution and concentration of fluorescent molecular
56
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
PYTHON
Python, a versatile and user-friendly programming language, has become
an integral tool in life science laboratories, revolutionizing data analysis,
modeling, and automation. This assignment explores the applications of
Python in life science-oriented laboratories, highlighting its significance in
various domains such as bioinformatics, computational biology, and
experimental design. Python's popularity in life science laboratories stems
from its simplicity, readability, and extensive libraries. Its open-source
nature and large community support make it an ideal choice for
researchers aiming to harness computational capabilities in their
experiments and analyses. Sequence Analysis: Python is widely used in
bioinformatics for DNA, RNA, and protein sequence analysis. Libraries like
Biopython offer tools for reading, manipulating, and analyzing biological
sequences, facilitating tasks such as sequence alignment, motif searching,
and structure prediction.
•Genomic Data Mining: Life scientists utilize Python for mining and
analyzing large genomic datasets. Pandas and NumPy libraries provide
efficient data manipulation and statistical analysis tools, enabling
researchers to derive meaningful insights from genomics data.
•Systems Biology: Python serves as a powerful language for building and
simulating complex biological models in systems biology. Libraries like
SciPy and BioSimPy enable the creation of mathematical models to simulate
biological processes, helping researchers understand dynamic interactions
within cellular systems.
•Biological Network Analysis: With the NetworkX library, Python facilitates
the analysis of biological networks such as protein-protein interaction
networks or metabolic pathways. Researchers can explore network
topology, identify key nodes, and study the dynamics of interconnected
biological systems.
•Pandas and NumPy: These libraries are fundamental for handling and
analyzing large genomic datasets. Researchers use Python to manipulate,
57
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
clean, and analyze genomic data efficiently, enabling tasks like variant
calling, gene expression analysis, and genomic association studies.
• Biopython: Python's Biopython library is extensively used for
bioinformatics tasks. It supports the analysis of biological sequences (DNA,
RNA, and protein) and provides tools for tasks such as sequence alignment,
motif searching, and structure prediction.
•SciPy and BioSimPy: Python facilitates systems biology modeling and
simulations. Researchers use Python to create mathematical models that
simulate biological processes, aiding in the understanding of complex
interactions within biological systems.
•Matplotlib and Seaborn: Python's data visualization libraries are applied
for creating clear and informative visualizations of biological data.
Researchers use these tools to generate plots, charts, and graphs that aid in
the interpretation and communication of results.
• Automation Scripts: Python scripts are employed for laboratory
automation, helping automate repetitive tasks, data collection, and
instrument control. Python's ease of integration with various devices and
instruments makes it a valuable tool for improving laboratory workflows.
•LIMS Integration: Python is used to integrate Laboratory Information
Management Systems (LIMS), providing a unified platform for managing
experimental data, sample information, and workflows within the
laboratory.
PRIMER3
Primer3 is a widely used bioinformatics tool designed for the automated
design of PCR (Polymerase Chain Reaction) primers. It aims to select
optimal primers for DNA amplification reactions, ensuring high specificity
and efficiency in various molecular biology applications.
1.PCR Experiments: Primer3 is extensively used for designing primers for
various PCR experiments, including endpoint PCR, quantitative PCR (qPCR),
and reverse transcription PCR (RT-PCR).
2.Molecular Cloning: Researchers use Primer3 to design primers for cloning
experiments, ensuring the specificity and efficiency of DNA amplification
for subsequent molecular manipulations.
3.Targeted Sequencing: In applications such as Sanger sequencing or next-
generation sequencing (NGS), Primer3 is employed to design primers for
amplifying specific target regions.
58
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
59
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
60
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
61
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
HOLOLENS
Microsoft HoloLens is an augmented reality (AR) headset developed and
manufactured by Microsoft. It overlays digital content onto the physical
world, enabling users to interact with holograms in their real environment.
1. Hardware: HoloLens features see-through holographic lenses, sensors
(including depth-sensing cameras and spatial mapping sensors), speakers,
microphones, and a built-in processor. The device is untethered, meaning
users can move freely without being connected to a PC.
2. Spatial Mapping and Tracking: HoloLens uses advanced sensors to map
the surrounding environment in real-time and accurately track the user's
movements and gestures. This enables precise placement and interaction
with holographic content anchored to specific physical locations.
3. Holographic Display: HoloLens projects holographic images directly onto
the user's field of view, creating a mixed reality experience where virtual
objects appear to coexist with the real world. Users can view and interact
with holograms without obstructing their view of the physical
environment.
4. Gesture and Voice Input: HoloLens supports natural interaction through
gestures, voice commands, and gaze tracking. Users can manipulate
holographic objects using hand gestures, speak voice commands to control
applications, and engage in gaze-based interactions by focusing on specific
elements.
5. Development Platform: Microsoft provides a comprehensive
development platform for creating mixed reality applications for HoloLens,
including tools, APIs, and SDKs (Software Development Kits). Developers
can build immersive experiences leveraging Unity 3D, Visual Studio, and the
HoloLens Emulator for testing and debugging.
6. Enterprise and Industrial Applications: HoloLens is used in various
industries for enterprise applications such as remote assistance, training
and simulation, 3D visualization, design and prototyping, maintenance and
repair, and medical imaging. It enables workers to access contextual
information, instructions, and guidance hands-free, improving productivity
and efficiency.
7. Education and Entertainment: HoloLens is also utilized in education and
entertainment sectors for immersive learning experiences, interactive
storytelling, virtual tours, and gaming. It provides new opportunities for
engaging and immersive content delivery, fostering creativity and
exploration.
62
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
63
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
64
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
UNIT 3
Forms of biological information and the need for storage. General
introduction to biological databases; nucleic acid databases (NCBI, DDBJ,
and EMBL). Protein databases (PIR, Uniprot). Specialized genome
databases: (SGD and microbial genome database-MGDB). Structure
database-PDB.
DATABASE
What is a database:
A database is a computerized archive used to store and organize data in
such a way that information can be retrieved easily via a variety of search
criteria. Databases are composed of computer hardware and software for
data management. The chief objective of the development of a database is
to organize data in a set of structured records to enable easy retrieval of
information. Each record, also called an entry, should contain a number of
fields that hold the actual data items, for example, fields for names, phone
numbers, addresses, dates. To retrieve a particular record from the
database, a user can specify a particular piece of information, called value,
to be found in a particular field and expect the computer to retrieve the
whole data record. This process is called making a query.
Although data retrieval is the main purpose of all databases, biological
databases often have a higher level of requirement, known as knowledge
discovery, which refers to the identification of connections between pieces
of information that were not known when the information was first
entered. For example, databases containing raw sequence information can
perform extra computational tasks to identify sequence homology or
conserved motifs. These features facilitate the discovery of new biological
insights from raw data.
Types of databases:
Originally, databases all used a flat file format, which is a long text file that
contains many entries separated by a delimiter, a special character such as
a vertical bar (|). Within each entry are a number of fields separated by tabs
or commas. Except for the raw values in each field, the entire text file does
not contain any hidden instructions for computers to search for specific
information or to create reports based on certain fields from each record.
The text file can be considered a single table. Thus, to search a flat file for a
particular piece of information, a computer has to read through the entire
file, an obviously inefficient process. This is manageable for a small
database, but as database size increases or data types become more
65
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
complex, this database style can become very difficult for information
retrieval. Indeed, searches through such files often cause crashes of the
entire computer system because of the memory-intensive nature of the
operation. To facilitate the access and retrieval of data, sophisticated
computer software programs for organizing, searching, and accessing data
have been developed. They are called database management systems.
These systems contain not only raw data records but also operational
instructions to help identify hidden connections among data records. The
purpose of establishing a data structure is for easy execution of the
searches and to combine different records to form final search reports.
Depending on the types of data structures, these database management
systems can be classified into two types: relational database management
systems and object-oriented database management systems. Consequently,
databases employing these management systems are known as relational
databases or object-oriented databases, respectively.
→ Relational databases:
Instead of using a single table as in a flat file database, relational databases
use a set of tables to organize data. Each table, also called a relation, is
made up of columns and rows. Columns represent individual fields. Rows
represent values in the fields of records. The columns in a table are indexed
according to a common feature called an attribute, so they can be cross-
referenced in other tables. To execute a query in a relational database, the
system selects linked data items from different tables and combines the
information into one report. Therefore, specific information can be found
more quickly from a relational database than from a flat file database.
Relational databases can be created using a special programming language
called structured query language (SQL). The creation of this type of
databases can take a great deal of planning during the design phase. After
creation of the original database, a new data category can be easily added
without requiring all existing tables to be modified. The subsequent
database searching and data gathering for reports are relatively
straightforward.
66
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
→ Object-oriented databases:
One of the problems with relational databases is that the tables used do not
describe complex hierarchical relationships between data items. To
overcome the problem, object-oriented databases have been developed that
store data as objects. In an object-oriented programming language, an
object can be considered as a unit that combines data and mathematical
routines that act on the data. The database is structured such that the
objects are linked by a set of pointers defining predetermined relationships
between the objects. Searching the database involves navigating through
the objects with the aid of the pointers linking different objects.
Programming languages like C++ are used to create object-oriented
databases.
The object-oriented database system is more flexible; data can be
structured based on hierarchical relationships. By doing so, programming
tasks can be simplified for data that are known to have complex
relationships, such as multimedia data. However, this type of database
system lacks the rigorous mathematical foundation of the relational
databases. There is also a risk that some of the relationships between
objects may be misrepresented. Some current databases have therefore
incorporated features of both types of database programming, creating the
object–relational database management system.
Biological databases:
Current biological databases use all three types of database structures: flat
files, relational, and object oriented. Despite the obvious drawbacks of
using flat files in database management, many biological databases still use
this format. The justification for this is that this system involves a minimum
amount of database design and the search output can be easily understood
by working biologists. Based on their contents, biological databases can be
roughly divided into three categories: primary databases, secondary
databases, and specialized databases.
Primary databases contain original biological data. They are archives of raw
sequence or structural data submitted by the scientific community.
GenBank and protein data bank (PDB) are examples of primary databases.
Secondary databases contain computationally processed or manually
curated information, based on original information from primary
databases. Translated protein sequence databases containing functional
annotation belongs to this category. Examples are Swissprot and protein
67
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
1. Primary databases:
There are three major public sequence databases that store raw nucleic
acid sequence data produced and submitted by researchers worldwide:
GenBank, the European molecular biology laboratory (EMBL) database and
the DNA data bank of Japan (DDBJ), which are all freely available on the
internet. Most of the data in the databases are contributed directly by
authors with a minimal level of annotation. A small number of sequences,
especially those published in the 1980s, were entered manually from
published literature by database management staff. Presently, sequence
submission to either GenBank, EMBL, or DDBJ is a precondition for
publication in most scientific journals to ensure the fundamental molecular
data to be made freely available. These three public databases closely
collaborate and exchange new data daily. They together constitute the
international nucleotide sequence database collaboration. This means that
by connecting to any one of the three databases, one should have access to
the same nucleotide sequence data.
Although the three databases all contain the same sets of raw data, each of
the individual databases has a slightly different kind of format to represent
the data.
2. Secondary databases:
Secondary databases which contain computationally processed sequence
information derived from the primary databases. The amount of
computational processing work varies greatly among the secondary
databases; some are simple archives of translated sequence data from
identified open reading frames in DNA, whereas others provide additional
annotation and information related to higher levels of information
regarding structure and functions.
3. Specialized databases:
Specialized databases normally serve a specific research community or
focus on a particular organism. The content of these databases may be
68
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
69
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
70
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
the top of the webpage, and press the “search” button beside the search
box.
On the results page of a normal NCBI search you will see the number of hits
to “nc_001477” in each of the NCBI databases on the NCBI website. There
are many databases on the NCBI website, for example, PubMed and
PubMed central contain abstracts from scientific papers, the genes and
genomes database contain DNA and RNA sequence data, the proteins
database contains protein sequence data, and so on.
Most biologist would do this type of work by hand from within their web
browser, but it can also be done by writing small programs in scripting
languages such as python or r. In r, the rentrez package is a powerful tool
for intersecting with NCBI resource. In this tutorial we’ll focus on the web
interface. It’s good to remember, though, that almost anything done via the
webpage can be automated using a computer script.
A challenge when learning to use NCBI resources is that there is a
tremendous amount of sequence information available and you need to
learn how to sort through what the search results provide. As you are
looking for the DNA sequence of the dengue den-1 virus genome, you
expect to see a hit in the NCBI nucleotide database. This is indicated at the
top of the page where it says “nucleotide sequence” and lists “dengue virus
1, complete genome.”
When you click on the link for the nucleotide database, it will bring you to
the record for NC_001477 in the NCBI nucleotide database. This will
contain the name and NCBI accession of the sequence, as well as other
details such as any papers describing the sequence. If you scroll down you’ll
see the sequence also.
If you need it, you can retrieve the DNA sequence for the den-1 dengue
virus genome sequence as a FASTA format sequence file in a couple ways.
The easiest is just to copy and paste it into a text, .r, or other file. You can
also click on “send to” at the top right of the NC_001477 sequence record
webpage.
After you click on send to you can pick several options. And then choose
“file” in the menu that appears, and then choose FASTA from the “format”
menu that appears, and click on “create file”. The sequence will then
download. The default file name is sequence. FASTA so you’ll probably want
to change it.
You can now open the FASTA file containing the den-1 dengue virus genome
sequence using a text editor like notepad, WordPad, notepad++, or even
71
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
RStudio on your computer. To find a text editor on your computer search for
“text” from the start menu (windows) and usually one will come up.
→ GenBank:
GenBank sequence database is an open access and annotated collection
Of nucleotide sequences and their protein translations including mRNA
sequences with coding regions, segments of genomic DNA with a single
gene or multiple genes, and ribosomal RNA gene clusters. GenBank is
produced and maintained by the national centre for biotechnology
information (NCBI) as part of the international collaboration
With EMBL data library from the EBI and the DNA data bank of Japan
(DDBJ).
Individual laboratory can submit sequence data or large-scale sequencing
centre can submit bulk submission directly to the GenBank by using Banklt
or sequin. The Banklt is a web-based form and sequin is a stand-alone
software tool developed by the NCBI for submitting and updating sequence
to the GenBank, EMBL and DDBJ databases. After sequence submission the
GenBank staffs assigns an accession number to the newly entered sequence
and performs quality assurance checks. Then the newly submitted
sequence is released to the database. Data that are stored in GenBank can
be retrieved by entrez or by downloading File Transfer Protocol (FTP). The
GenBank is a collection of information on Expressed Sequence Tag (EST),
Sequence Tagged Site (STS), Genome Survey Sequence (GSS), and High
Throughput Genome Sequence (HTGS) and complete microbial genome
sequences.
Information of GenBank can be accessed through the server
http://www.ncbi.nlm.nih.gov/GenBank/.
There are several ways to search and retrieve data from GenBank as given
under –
• Search GenBank for sequence identifiers and annotations with entrez
nucleotide ,
Which is divided into three divisions: core nucleotide (the main
Collection), DBEST (expressed sequence tags), and DBGSS (genome survey
sequences).
• Search and align GenBank sequences to a query sequence using blast.
• Search, link, and download sequences programmatically using NCBI e-
utilities.
72
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
→ DDBJ:
DDBJ (DNA data bank of Japan) is a kind of nucleotide sequence data bank
that receives nucleotide sequence from researchers and assigns an
accession number to data submitters. DDBJ collects sequence data mainly
from Japanese researchers, however, they also receive data and assign
accession number to researchers of any other countries. DDBJ began data
bank activities in 1986 at National Institute Of Genetics (NIG). Currently,
DDBJ is in operation at nig in Mishima, Japan.
Main activities of DDBJ are – i) being a member of INSDC, DDBJ collects
nucleotide sequence data from researcher, assigns an accession number to
the data submitters exchanges the collected data with EMBL-bank and
GenBank on a daily basis,
ii) DDBJ manage bioinformatics tools for data submission and retrieval, iii)
DDBJ develops tools for analysis of biological data and
iv) organizes bioinformatics training course in Japanese to teach how to
analyse biological data.
Information of DDBJ can be accessed through the server
http://www.ddbj.nig.ac.jp.
→ EBI-EMBL:
European bioinformatics institute (EBI) is part of European molecular
biology laboratory (EMBL). EMBL-EBI is now known as EMBL-bank and
was established in 1980 at the EMBL in Heidelberg, Germany. It was the
world’s first nucleotide sequence database. EMBL-EBI provides freely
available data from life science experiments, performs basic research in
computational biology and offers an extensive user training programme for
the researchers. EMBL-EBI stores data on DNA and RNA (genes, genomes
and variation), gene expression (RNA, protein and metabolite expression),
protein (sequence, families and motifs), structure (molecular and cellular
structures), systems (reaction, interaction, pathways), chemical biology
(chemo genomics and metabolomics), ontologies (taxonomies and
controlled vocabularies) and literature (scientific publications and patents).
EMBL-EBI can be accessed through the server http://www.ebi.ac.uk.
→ EnsEMBL:
EnsEMBL is a joint project between EBI, EMBL and the welcome trust
sanger institute to develop a software system that produces and maintains
automatically annotation on selected eukaryotic genomes. EnsEMBL was
stated in 1999 with an aim to automatically annotate the genome, integrate
73
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
this annotation with other available biological data and release the
information to the researchers via the web. EnsEMBL produces genome
databases for vertebrates and other eukaryotic species and makes this
information freely available online. EnsEMBL can be freely accessed
through the server http://www.asia.ensembl.org. Various research projects
around the world contribute DNA sequence and their assemblies data to
the EnsEMBL. This database emphasizes on two areas of comparative
genomics – the creation of gene trees using representative proteins from
each gene in a species, and the alignment of DNA sequences to infer
synteny, conservation, etc. The EnsEMBL variation database stores
Data on the regions of genome that differ between individual genomes,
associated disease and phenotype information. EnsEMBL regulation stores
data on the mechanisms of gene regulation in human and mouse cells,
transcriptional and post-transcriptional mechanisms.
Protein databases:
→ PIR:
PIR (protein information resource) was developed by the National
Biomedical Research Foundation (NBRF) in 1984 to assist researchers in
the identification and interpretation of protein sequence information. It is
an integrated public resource of protein informatics that supports genomic
and proteomic research and scientific discovery. PIR has three distinct
sections – PIR1 contains fully classified and annotated entries, PIR2
contains preliminary entries that has not been thoroughly reviewed and
contain redundancy, PIR3 contain unverified entries and PIR4 has one of
the following categories –
i. Conceptual translations of art factual sequences,
ii. Conceptual translations of sequences that are not transcribed or
translated,
iii. Protein sequences or conceptual translations that are genetically
engineered,
iv. Sequences that are not genetically encoded and not produced on
ribosomes. PIR maintains the protein sequence database (PSD)
that stores over 283 000 sequences.
For over four decades PIR has been providing protein databases and
analysis tools those are freely accessible to the researchers including the
protein sequence database (PSD). The PIR has a bibliography system for
literature searching, mapping, and user submission. PIR also maintains a
(non-redundant reference) NREF database, and iPro class, an integrated
74
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
Structure database:
→ PDB:
Protein data bank was established at Brookhaven National Laboratories
(BNL) in 1971. PDB contains 3d structures of protein that is established by
x-ray crystallographic and nuclear magnetic resonance (NMR) studies and
is maintained by research collaborator for structural bioinformatics (RCSB)
at Rutgers university.
As on December 24, 2013 there are 96596 structures of proteins available
at PDB which provide information on atomic coordinate of amino acids in
75
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
76
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
77
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
organisms, because they can finally get all of the orthologous genes.MGDB
also provides a usual genome map search interface for users to navigate an
individual genome to retrieve a particular gene. All information about a
particular gene such as homology relationships and motif hits is
summarized in a gene information page, which also includes a link to
retrieve neighboring orthologs.
Users can also specify query sequences for similarity search. Here, the
system calculates similarities between query and database sequences by
the same way as all-against-all similarities in MGDB, i.e. BLAST searches
followed by DP alignment and then finds the clusters containing those
genes hit by the search. The result is listed in the order of the average
similarity scores against the query.
78
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
UNIT 4
Retrieval methods for Nucleic acid and Protein Sequences. Use of
Bioinformatic Tools -sequence homology- substitution matrices- PAM and
BLOSUM. Pairwise alignment (global and local) using BLAST. Multiple
sequence alignment using Clustal omega.
You can easily retrieve DNA or protein sequence data by hand from
the NCBI Sequence Database via its website www.ncbi.nlm.nih.gov.
Dengue DEN-1 DNA is a viral DNA sequence and its NCBI accession
number is NC_001477. To retrieve the DNA sequence for the Dengue
DEN-1 virus from NCBI, go to the NCBI website, type “NC_001477” in the
Search box at the top of the webpage, and press the “Search” button
beside the Search box.
On the results page of a normal NCBI search you will see the number of
hits to “NC_001477” in each of the NCBI databases on the NCBI website.
There are many databases on the NCBI website, for
example, PubMed and PubMed Central contain abstracts from scientific
papers, the Genes and Genomes database contains DNA and RNA
sequence data, the Proteins database contains protein sequence data,
and so on.
Most biologist would do this type of work by hand from within their web
browser, but it can also be done by writing small programs in scripting
languages such as Python or R. In R, the rentrez package is a powerful
tool for intersecting with NCBI resource. In this tutorial we’ll focus on the
web interface. Its good to remember, though, that almost anything done
via the webpage can be automated using a computer script.
A challenge when learning to use NCBI resources is that there is a
tremendous amount of sequence information available and you need to
learn how to sort through what the search results provide. As you are
looking for the DNA sequence of the Dengue DEN-1 virus genome, you
expect to see a hit in the NCBI Nucleotide database. This is indicated at
the top of the page where it says “NUCLEOTIDE SEQUENCE” and lists
“Dengue virus 1, complete genome.”
When you click on the link for the Nucleotide database, it will bring you
to the record for NC_001477 in the NCBI Nucleotide database. This will
contain the name and NCBI accession of the sequence, as well as other
79
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
details such as any papers describing the sequence. If you scroll down
you’ll see the sequence also.
If you need it, you can retrieve the DNA sequence for the DEN-1 Dengue
virus genome sequence as a FASTA format sequence file in a couple
ways. The easiest is just to copy and paste it into a text, .R, or other file.
You can also click on “Send to” at the top right of the NC_001477
sequence record webpage.
After you click on Send to you can pick several options. and then choose
“File” in the menu that appears, and then choose FASTA from the
“Format” menu that appears, and click on “Create file”. The sequence will
then download. The default file name is sequence.fasta so you’ll
probably want to change it.
You can now open the FASTA file containing the DEN-1 Dengue virus
genome sequence using a text editor like Notepad, WordPad, Notepad++,
or even RStudio on your computer. To find a text editor on your computer
search for “text” from the start menu (Windows) and usually one will
come up.
SEQUENCE HOMOLOGY vs SEQUENCE SIMILARITY
When two sequences are descended from a common evolutionary origin,
they are said to have a homologous relationship or share homology. A
related but different term is sequence similarity, which is the percentage of
aligned residues that are similar in physiochemical properties such as size,
charge, and hydrophobicity. Sequence homology is an inference or a
conclusion about a common ancestral relationship drawn from sequence
similarity comparison when the two sequences share a high enough degree
of similarity.
On the other hand, similarity is a direct result of observation from the
sequence alignment. Sequence similarity can be quantified using
percentages; homology is a qualitative statement. For example, one may say
that two sequences share 40% similarity. It is incorrect to say that the two
sequences share 40% homology. They are either homologous or
nonhomologous.
Generally, if the sequence similarity level is high enough, a common
evolutionary relationship can be inferred. In dealing with real research
problems, the issue of at what similarity level can one infer homologous
relationships is not always clear. The answer depends on the type of
sequences being examined and sequence lengths.
80
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
81
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
PAM Matrix
The PAM matrices (also called Dayhoff PAM matrices) were first
constructed by Margaret Dayhoff, who compiled alignments of seventy-one
groups of very closely related protein sequences. PAM stands for “point
accepted mutation” (although “accepted point mutation” or APM may be a
more appropriate term, PAM is easier to pronounce). Because of the use of
very closely related homologs, the observed mutations were not expected
to significantly change the common function of the proteins.
Thus, the observed amino acid mutations are considered to be accepted by
natural selection. These protein sequences were clustered based on
phylogenetic reconstruction using maximum parsimony. The PAM matrices
were subsequently derived based on the evolutionary divergence between
sequences of the same cluster. One PAM unit is defined as 1% of the amino
acid positions that have been changed.
To construct a PAM1 substitution table, a group of closely related sequences
with mutation frequencies corresponding to one PAM unit is chosen. Based
on the collected mutational data from this group of sequences, a
substitution matrix can be derived.
82
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
83
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
BLOSUM Matrix
In the PAM matrix construction, the only direct observation of residue
substitutions is in PAM1, based on a relatively small set of extremely closely
related sequences.
Sequence alignment statistics for more divergent sequences are not
available. To fill in the gap, a new set of substitution matrices have been
developed. This is the
series of blocks amino acid substitution matrices (BLOSUM), all of which
are derived based on direct observation for every possible amino acid
substitution in multiple sequence alignments. These were constructed
based on more than 2,000 conserved amino acid patterns representing 500
groups of protein sequences. The sequence patterns, also called blocks, are
ungapped alignments of less than sixty amino acid residues in length. The
frequencies of amino acid substitutions of the residues in these blocks are
calculated to produce a numerical table, or block substitution matrix.
84
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
Instead of using the extrapolation function, the BLOSUM matrices are actual
per- centage identity values of sequences selected for construction of the
matrices. For example, BLOSUM62 indicates that the sequences selected for
constructing the matrix share an average identity value of 62%. Other
BLOSUM matrices based on sequence groups of various identity levels have
also been constructed. In the reversing order as the PAM numbering
system, the lower the BLOSUM number, the more divergent sequences they
represent.
PAIRWISE ALIGNMENT
The overall goal of pairwise sequence alignment is to find the best pairing
of two sequences, such that there is maximum correspondence among
residues. To achieve this goal, one sequence needs to be shifted relative to
the other to find the position where maximum matches are found.
There are two different alignment strategies that are often used: global
alignment and local alignment.
Global Alignment and Local Alignment
In global alignment, two sequences to be aligned are assumed to be
generally similar over their entire length. Alignment is carried out from
beginning to end of both sequences to find the best possible alignment
across the entire length between the two sequences. This method is more
85
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
applicable for aligning two closely related sequences of roughly the same
length. For divergent sequences and sequences of variable lengths, this
method may not be able to generate optimal results because it fails to
recognize highly similar local regions between the two sequences.
Local alignment, on the other hand,
does not assume that the two
sequences in question have
similarity over the entire length. It
only finds local regions with the
highest level of similarity between
the two sequences and aligns these
regions without regard for the
alignment of the rest of the
sequence regions. This approach
can be used for aligning more
divergent sequences with the goal of
searching for conserved patterns in
DNA or protein sequences. The two
sequences to be aligned can be of different lengths. This approach is more
appropriate for aligning divergent biological sequences containing only
modules that are similar, which are referred to as domains or motifs.
Dot Matrix Method
Dot matrix method, also known as the dot plot method, is a graphical
method of sequence alignment that involves comparing two sequences by
plotting them in a two-dimensional matrix.
In a dot matrix, two sequences that must be compared are plotted along a
matrix’s horizontal and vertical axes. The method then scans each residue
of one sequence to identify similarities with all residues in the other
sequence.
If a residue in one sequence matches a residue in the other sequence, a dot
is placed in the corresponding position in the matrix. Otherwise, the matrix
position is left blank.
If the two sequences being compared are highly similar, the dot plot will
display as a single line along the matrix’s main diagonal. However, when
the sequences are less similar, the dot plot will show more scattered dots
with fewer diagonal lines, indicating that the sequences share less
similarity.
Dot plots can also find repeat elements in a single sequence. Short parallel
lines above and below the main diagonal indicate the presence of repeats.
Dotmatcher (bioweb.pasteur.fr/seqanal/interfaces/dotmatcher.html) and
Dottup (bioweb.pasteur.fr/seqanal/interfaces/dottup.html) are two
86
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
programs of the EMBOSS package, which have been made available online.
Dotmatcher aligns and displays dot plots of two input sequences (DNA or
proteins) in FASTA format. A window of specified length and a scoring
scheme are used. Diagonal lines are only plotted over the position of the
windows if the similarity is above a certain threshold. Dottup aligns
sequences using the word method and is capable of handling genome-
length sequences. Diagonal lines are only drawn if exact matches of words
of specified length are found.
Dothelix (www.genebee.msu.su/services/dhm/advanced.html) is a dot
matrix
program for DNA or protein sequences. The program has a number of
options for
length threshold (similar to window size) and implements scoring matrices
for protein sequences. In addition to drawing diagonal lines with similarity
scores above a certain threshold, the program displays actual pairwise
alignment.
Dynamic Programming
Dynamic programming is used to find the optimal alignment between two
proteins or nucleic acid sequences by comparing all possible pairs of
characters in the sequences.
Dynamic programming can be used to produce both global and local
alignments. The global pairwise alignment algorithm using dynamic
programming is based on the Needleman-Wunsch algorithm, while the
dynamic programming in local alignment is based on the Smith-Waterman
algorithm.
87
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
3. Traceback to identify
optimal alignment:
After filling the matrix, the algorithm performs a traceback to find
the optimal alignment path. Starting from the bottom-right corner
and moving towards the top-left corner, adjacent cells are examined
in reverse order to determine the best path with the highest total
score. The optimal alignment path is the one with the maximum
score.
88
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
Gap Penalties
Performing optimal alignment between sequences often involves applying
gaps that represent insertions and deletions. Because in natural
evolutionary processes insertion and deletions are relatively rare in
comparison to substitutions, introducing gaps should be made more
difficult computationally, reflecting the rarity of insertional and deletional
events in evolution. However, assigning penalty values can be more or less
arbitrary because there is no evolutionary theory to determine a precise
cost for introducing insertions and deletions. If the penalty values are set
too low, gaps can become too numerous to allow even nonrelated
sequences to be matched up with high similarity scores. If the penalty
values are set too high, gaps may become too difficult to appear, and
reasonable alignment cannot be achieved, which is also unrealistic.
Through empirical studies for globular proteins, a set of penalty values
have been developed that appear to suit most alignment purposes. They are
normally implemented as default values in most alignment programs.
Another factor to consider is the cost difference between opening a gap and
extending an existing gap. It is known that it is easier to extend a gap that
has already been started. Thus, gap opening should have a much higher
penalty than gap extension. This is based on the rationale that if insertions
and deletions ever occur, several adjacent residues are likely to have been
inserted or deleted together. These differential gap penalties are also
referred to as affine gap penalties. The normal strategy is to use preset gap
penalty values for introducing and extending gaps. For example, one may
use a −12/ − 1 scheme in which the gap opening penalty is −12 and the gap
extension penalty −1. The total gap penalty (W) is a linear function of gap
length, which is calculated using the formula:
W = γ + δ × (k − 1)
where γ is the gap opening penalty, δ is the gap extension penalty, and k is
the length of the gap. Besides the affine gap penalty, a constant gap penalty
is sometimes also used, which assigns the same score for each gap position
regardless whether it is opening or extending. However, this penalty
scheme has been found to be less realistic than the affine penalty.
Gaps at the terminal regions are often treated with no penalty because in
reality many true homologous sequences are of different lengths.
Consequently, end gaps can be allowed to be free to avoid getting
unrealistic alignments.
89
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
90
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
91
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
BLAST ANALYSIS
BLAST stands for Basic Local Alignment Search Tool. It is a widely used
bioinformatics program that was first introduced by Stephen Altschul in
1990 and has since become one of the most popular tools for sequence
similarity search. There are five types (variants) of BLAST that are
differentiated based on the type of sequence (DNA or protein) of the query
and database sequences.
1. BLASTN compares a nucleotide query sequence to a nucleotide
sequence database.
2. BLASTP compares a protein query sequence to a protein sequence
database.
3. BLASTX compares a nucleotide query sequence to a protein
sequence database by translating the query sequence into its six
possible reading frames and aligning them with the protein
sequences.
92
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
93
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
94
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
UNIT 5
Methods of Genome analysis- Shot gun and Hierarchical methods. Gene
Prediction using GENEMARK. Phylogenetic tree construction by MEGA
(molecular genetic evolutionary analysis). Protein Structure visualization
Tool- RasMol and MolMol.
95
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
96
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
97
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
98
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
5. Sequence Assembly:
- After sequencing, the short sequence reads are processed and assembled
into longer contiguous sequences (contigs) using specialized bioinformatics
software and algorithms.
- Assembly involves aligning and overlapping the sequence reads to
reconstruct the original genome sequence. The process is facilitated by the
use of paired-end sequencing, which provides information about the
relative positions of sequence reads within the genome.
- Assembly software uses various algorithms to assemble contigs, resolve
repeats, and generate scaffolds that represent the linear order of contigs.
6. Genome Annotation and Analysis:
- Once assembled, the genome sequence is annotated to identify genes,
regulatory elements, and other functional elements.
- Annotation involves predicting gene locations, identifying coding regions
(exons) and non-coding regions (introns), and annotating regulatory
sequences such as promoters and enhancers.
- The annotated genome sequence is then analyzed to gain insights into
the genetic makeup and biological characteristics of the organism. This may
include studying genetic variation, evolutionary relationships, and the
genetic basis of traits or diseases.
99
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
100
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
101
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
102
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
103
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
104
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
105
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
106
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
107
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
108
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
109
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
110
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
111
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
112
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
113
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
114
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
115
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
116
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
117
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
ORF Determination:
ORFs are sequences of DNA or RNA that potentially encode proteins. They
are identified by their ability to be translated into proteins by the cell's
machinery.
1. Identify Start Codons: The most common start codons are AUG (encoding
methionine) in eukaryotes and archaea, and sometimes GUG and UUG in
bacteria. Look for these codons in the sequence.
2. Search for Stop Codons: Look for stop codons (UAA, UAG, or UGA in RNA
sequences) that could terminate the translation. An ORF typically extends
from a start codon to a stop codon.
3. Length Consideration: Not all ORFs are significant. ORFs shorter than a
certain threshold are often discarded as they may not encode functional
proteins. The minimum length considered significant varies depending on
the context, but common thresholds are around 100 codons.
4. Frame Selection: ORFs can be in three reading frames depending on
where translation starts relative to the sequence. Hence, for a given
sequence, all three reading frames need to be considered.
5. ORF Prediction Tools: There are several bioinformatics tools and
software available for ORF prediction, such as ORFfinder, GeneMark, and
Prodigal. These tools automate the process of identifying ORFs in
nucleotide sequences.
6. Comparative Analysis: Sometimes, comparative genomics can help
identify conserved ORFs, which are more likely to encode functional
proteins.
7. Experimental Verification: Finally, experimental methods such as gene
expression analysis, mass spectrometry, or functional assays are often used
to verify the functionality of predicted ORFs.
Steps:
1. Translation in All 6 Frames:
- DNA sequences can be translated in all six possible reading frames: three
in the forward direction (5' to 3') and three in the reverse direction (3' to
5'). This accounts for the possibility of ORFs occurring in any reading frame.
2. Stop Codons Every 20 Codons:
- Stop codons (UAA, UAG, or UGA) typically occur approximately every 20
codons by chance within coding regions. This random distribution of stop
codons helps delineate potential ORFs within the sequence.
118
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
119
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
120
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
5. Training HMMs:
- HMMs are typically trained using the Expectation-Maximization (EM)
algorithm or variants like the Baum-Welch algorithm.
- Training involves estimating the parameters (transition probabilities,
emission probabilities, and initial state probabilities) that maximize the
likelihood of the observed data.
6. Software and Tools:
- Several software packages and libraries are available for working with
HMMs, including HMMER, SAMtools, and the BioPython library. These tools
provide implementations for building, training, and applying HMMs in
various bioinformatics tasks.
121
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
122
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
123
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
Applications of GenMark:
1. Infectious Disease Testing: GenMark's ePlex® system is utilized for rapid
and accurate detection of infectious diseases caused by bacteria, viruses,
and fungi. This includes respiratory infections (e.g., influenza, respiratory
syncytial virus), bloodstream infections (e.g., sepsis), gastrointestinal
infections (e.g., Clostridioides difficile), and sexually transmitted infections
(e.g., chlamydia, gonorrhea). The system allows for simultaneous testing of
multiple pathogens from a single patient sample, providing timely results to
guide patient management and treatment decisions.
2. Antimicrobial Resistance (AMR) Surveillance: GenMark's ePlex® Blood
Culture Identification (BCID) panels incorporate assays targeting
antimicrobial resistance genes, allowing for the rapid detection of antibiotic
resistance in pathogens causing bloodstream infections. This enables
clinicians to make informed decisions regarding antibiotic selection and
stewardship, contributing to more effective patient care and AMR
surveillance efforts.
3. Respiratory Pathogen Panel: GenMark's ePlex® Respiratory Pathogen
Panel (RPP) is designed to detect a comprehensive panel of respiratory
viruses and bacteria associated with acute respiratory infections. The panel
includes common respiratory pathogens such as influenza viruses,
respiratory syncytial virus (RSV), rhinovirus, adenovirus, and bacterial
pathogens like Streptococcus pneumoniae and Haemophilus influenzae.
Rapid and accurate detection of these pathogens aids in diagnosis,
treatment, and infection control measures.
4. Genetic Testing for Inherited Disorders: GenMark's eSensor® XT-8
system is utilized for genetic testing of inherited disorders, including
carrier screening, prenatal testing, and diagnosis of genetic conditions. The
system allows for the detection of specific genetic variants associated with
conditions such as cystic fibrosis, thalassemia, and familial
hypercholesterolemia. Genetic testing with the eSensor® XT-8 system
provides valuable information for family planning, reproductive counseling,
and personalized healthcare.
5. Oncology Testing: GenMark's oncology panels enable the detection of
genetic mutations and biomarkers associated with cancer diagnosis,
prognosis, and treatment response. These panels can be used for targeted
therapy selection, monitoring of minimal residual disease, and
identification of drug resistance mutations. The ePlex® system allows for
multiplexed testing of cancer-related genes.
124
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
125
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
126
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
→Opening an Alignment
The Alignment Explorer is the tool for building and editing multiple
sequence alignments in MEGA.
Example:
1. Launch the Alignment Explorer by selecting the Align | Edit/Build
Alignment on the launch bar of the main MEGA window.
2. Select Create New Alignment and click Ok. A dialog will appear asking
“Are you building a DNA or Protein sequence alignment?” Click the
button labeled “DNA”.
3. From the Alignment Explorer main menu, select Data | Open |
Retrieve sequences from File. Select the "hsp20.fas" file from the
MEG/Examples directory.
127
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
3. When the NCBI: Nucleotide site is loaded, enter CFS as a search term
into the search box at the top of the screen. Press the Search button.
4. When the search results are displayed, check the box next to any
item(s) you wish to import into MEGA.
5. If you have checked one box: Locate the dropdown menu labeled
Display Settings (located near the top left hand side of the page
directly under the tab headings). Change its value to FASTA and then
click Apply. The page will reload with all the search results in a
FASTA format
6. If you have checked more than one box: locate the Display Settings
dropdown (located near the top left hand side of the page directly
under the tab headings).
7. Change the value to FASTA (Text) and click the Apply button. This
will output all the sequences you selected as a text in the FASTA
format.
8. Press the Add to Alignment button (with the red + sign) located
above the web address bar. This will import the sequences into the
Alignment Explorer.
9. With the data now displayed in the Alignment Explorer, you can close
the Web Browser window.
10. Align the new data using the steps detailed in the previous
examples.
11. Close the Alignment Explorer window by clicking Data | Exit
Aln Explorer. Select No when asked if you would like the save the
current alignment session to file.
12.
→Constructing a phylogenetic tree
Step 1:
1. Install the MEGAX software in your device.
2. Launch the software.
Step 2:
1. Click the “Align” button.
2. Select “Edit/Build Alignment” option.
3. A dialogue box appears, select “Create a new alignment” and click
“OK”.
Step 3:
1. An alignment explorer window is opened. A dialogue box asks
whether you are building a protein, DNA or nucleotide. Select
“Protein”.
128
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
129
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
Applications of MEGAX:
130
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
131
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
132
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
133
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
RASMOL
RasMol is a molecular graphics program intended for the visualization of
proteins, nucleic acids and small molecules. The program is aimed at
display, teaching and generation of publication quality images. The program
reads in molecular co-ordinate files and interactively displays the molecule
on the screen in a variety of representations and colour schemes.
Supported input file formats include Brookhaven Protein Databank (PDB),
Tripos Associates' Alchemy and Sybyl Mol2 formats, Molecular Design
Limited's (MDL) Mol file format, Minnesota Supercomputer Centre's (MSC)
XYZ (XMol) format and CHARMm format files.
To start RasMol under Microsoft Windows, double click on the RasMol icon
in the program manager. When RasMol first starts, the program displays a
single main window (the display window) with a black background on the
screen and provides the command line window minimized as a small icon
at the bottom of the screen. The command line or terminal window may be
opened by double clicking on this RasMol icon. It is possible to specify
either a coordinate filename or both on the windows command line. The
format for specifying a script file to add the option '-script <filename>' to
the command line. A molecule co-ordinate file may be specified by placing
its name on the command line, optionally preceded by a file format option.
If no format option is given, the specified co-ordinate file is assumed to be
in PDB format. Valid format options include '- pdb', '-mdl', '-mol2', '-xyz', '-
alchemy' and '-charmm', which correspond to Brookhaven, MDL Mol file,
Sybyl Mol2, MSC's xyz, Alchemy and CHARMm formats respectively. If both
a co-ordinate file and a script file are specified on the command line, the
molecule is loaded first, then the script commands are applied to it. If either
file is not found, the program displays the error message 'Error: File not
found!' and the user is presented the RasMol prompt.
→RasMol's Window
134
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
While the mouse pointer is located within the graphics area of the main
display window, the mouse pointer is drawn as a cross-hair cursor, to
enable the 'picking' of objects being displayed; otherwise the mouse
pointer is drawn as an arrowhead. Any characters that are typed at the
keyboard while the display window is in 'focus' (meaning active or
foreground) are redirected to the command line in the terminal window.
Hence you do not need continually to switch focus between the command
line and graphics windows.
→Mouse Controls
→Scroll Bars
The scroll bar across the bottom of the canvas area is used to rotate the
molecule about the y-axis, i.e. to spin the nearest point on the molecule left
or right; and the scroll bar to the right of the canvas rotates the molecule
about the x-axis, i.e. the nearest point up or down. Each scroll bar has an
'indicator' to denote the relative orientation of the molecule, which is
initially positioned in the centre of the scroll bar. These scroll bars may be
operated in either of two ways. The first is by clicking any mouse button on
the dotted scroll bar background to indicate a direct rotation relative to the
current indicator position; the second is by clicking one of the arrows at
either end of the scroll bar to rotate the molecule in fixed sized increments.
Rotating the molecule by the second method may cause the indicators on
the scroll bars to wrap around from one end of the bar to the other. A
complete revolution is indicated by the indicator travelling the length of the
scroll bar. The angle rotated by using the arrows depends upon the current
size of the display window.
135
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
→Picking
→Backbone
The reserved word backbone is also used as a predefined set ("help sets")
and as a parameter to the 'set hbond' and 'set ssbond' commands. The
RasMol command 'trace' renders a smoothed backbone, in contrast to
'backbone' which connects alpha carbons with straight lines.
The backbone may be displayed with dashed lines by use of the 'backbone
dash' command.
136
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
→Background
Syntax: background <colour>
The RasMol 'background' command is used to set the colour of the "canvas"
background. The colour may be given as either a colour name or a comma
separated triple of Red, Green and Blue (RGB) components enclosed in
square brackets. Typing the command 'help colours' will give a list of the
predefined colour names recognized by RasMol. When running under X
Windows, RasMol also recognizes colours in the X server's colour name
database.
→Cartoon
Syntax: cartoon {<number>}
→Centre
Syntax: centre {<expression>}
center {<expression>}
The RasMol 'centre' command defines the point about which the 'rotate'
command and the scroll bars rotate the current molecule. Without a
parameter the centre command resets the centre of rotation to be the
centre of gravity of the molecule. If an atom expression is specified, RasMol
rotates the molecule about the centre of gravity of the set of atoms
specified by the expression. Hence, if a single atom is specified by the
expression, that atom will remain 'stationary' during rotations.
137
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
→Clipboard
Syntax: clipboard
→Colour
Syntax: colour {<object>} <colour>
color {<object>} <colour>
Colour the atoms (or other objects) of the selected region. The colour may
be given as either a colour name or a comma separated triple of Red, Green
and Blue (RGB) components enclosed in square brackets. Typing the
command 'help colors' will give a list of all the predefined colour names
recognised by RasMol.
→H Bonds
Syntax: hbonds {<boolean>}
hbonds <value>
138
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
By default, the dotted lines are drawn between the accepting oxygen and
the donating nitrogen. By using the 'set hbonds' command the alpha carbon
positions of the appropriate residues may be used instead. This is
especially useful when examining proteins in backbone representation.
→Stereo
Syntax: stereo on
stereo <number>
stereo off
→Zap
Syntax: zap
139
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
→Zoom
Syntax: zoom {<boolean>}
zoom <value>
→Colour Schemes
The RasMol 'colour' command allows different objects (such as atoms,
bonds and ribbon segments) to be given a specified colour. Typically, this
colour is either a RasMol predefined colour name or an RGB triple.
Additionally RasMol also supports 'alt', 'amino', 'chain', 'charge', 'cpk',
'group', 'model', 'shapely', 'structure', 'temperature' or 'user' colour
schemes for atoms, and 'hbond type' colour scheme for hydrogen bonds
and 'electrostatic potential' colour scheme for dot surfaces. The 24
currently predefined colour names are listed below with their
corresponding RGB triplet and hexadecimal value.
Predefined
Sample RGB Values Hexadecimal
colour
Black [ 0, 0, 0] 000000
140
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
Applications of RasMol:
141
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
142
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
143
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
144
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
145
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
146
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
MOLMOL
MOLMOL is a program for displaying, analyzing, and manipulating
molecules. MOLMOL was first developed under the name COSMOS, but it
had to be renamed due to a name collision with a different program. It was
tried to keep the program as general as possible. However, there are some
functions that make it especially useful for studying structures of
macromolecules obtained by NMR.
MOLMOL has a graphical user interface with menus, dialog boxes, and on-
line help. The display possibilities include conventional presentation, as
well as novel schematic drawings, with the option of combining different
presentations in one view of a molecule. Covalent molecular structures can
be modified by addition or removal of individual atoms and bonds, and
three-dimensional structures can be manipulated by interactive rotation
about individual bonds. Special efforts were made to allow for appropriate
display and analysis of the sets of typically 20-40 conformers that are
conventionally used to represent the result of an NMR structure
determination, using functions for superimposing sets of conformers,
calculation of root mean square distance (RMSD) values, identification of
hydrogen bonds, checking and displaying violations of NMR constraints,
and identification and listing of short distances between pairs of hydrogen
atoms.
The following options are recognized by this shell script:
147
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
→Executing commands
All commands can be executed by selecting them in a pulldown menu.
Commands that need arguments will ask for them using a dialog box.
Users that prefer keyboard input can enter commands on the command
line. Parts of commands that are unique are completed automatically. It is
possible to also enter the command arguments on the command line. If no
or only part of the arguments are given, a dialog box will appear.
Some frequently used commands can also be found in the popup menu
associated with the right mouse button.
Some commands have keyboard accelerators. They can be seen in the
pulldown menu. These commands can be executed by using the
corresponding key combination anywhere in the main window.
The program executes commands that it receives on standard input. This
can be used to couple MOLMOL with other programs, e. g. by writing a
program (shell script) that generates commands and then piping the output
of this program into MOLMOL.
→Interactive Manipulation
1. Rotation
Molecules can be rotated by pressing the left mouse button in the drawing
area and then moving the mouse. The virtual trackball model is used.
2. Moving
Molecules can be moved by pressing the middle mouse button in the
drawing area and then moving the mouse.
148
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
3. Moving/Resizing Text
Text annotations can be modified by pressing the middle mouse button
inside the box that is dis played when the text is selected, and then moving
it. The box is subdivided into different region with dashed lines. The central
region is used for moving the text, the other regions are used for re sizing
the text in the corresponding direction.
4. Zooming
Zooming is done by pressing the left and middle mouse buttons at the same
time. Moving the mouse to the right and/or the top will zoom in, moving it
to the left and/or bottom will zoom out.
→Selection
There are two ways to make selections:
1. Interactive selection:
Items (like atoms or bonds) can be selected by clicking on them with the
left or middle mouse button. Doing that will normally deselect all other
items of the same class. To prevent that, either the Shift or the Ctrl key must
be pressed on the keyboard while making the selection. For text
annotations there is a difference between the function of the left and
middle mouse buttons: the left mouse button only selects texts if their
bottom-left corner is the nearest item, while the middle mouse button gives
priority to texts and selects them whenever the position is within the text.
2. Use commands:
Items can be selected by using various commands. The most convenient
way to use these commands is the selection dialog box, which can be
switched on with the Dial Select command. All these commands take
expressions that specify whether an item is selected or not.
149
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
→Properties
→Command Overview
1. Input/Output
150
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
2. Movement
3. Display
4. User Interface
151
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
5. Figures
152
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
153
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
154
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
155
ESSENTIALS OF COMPUTERS AND BIOINFORMATICS
156