Professional Documents
Culture Documents
Tao 2016
Tao 2016
BLAST® Help
Camacho Christiam
NCBI
camacho@ncbi.nlm.nih.gov
BLAST® Help
Lee Szilagyi
NCBI
szilagyl@ncbi.nlm.nih.gov
faster download, the service is also available through the Aspera client for those users with
the Aspera browser plug-in installed (https://www.ncbi.nlm.nih.gov/public/?blast/).
This document describes the subdirectories and file contents of the BLAST FTP directory.
Technical details on how to use certain files, especially those under the db (database)
subdirectory, are also provided.
Subdirectory File content
db Preformatted BLAST database files and FASTA sequence files (under the /FASTA subdirectory)
documents Preliminary documentation (mostly from developers) and pointers to other documentation
WGS_TOOLS A tool for generating WGS project-based database alias for vdb blast and its readme
BLAST® Help
A collection of windowmasker files for various organisms/genomes, each in its own subdirectory named using their
windowmasker_files
taxonomic ids
File name Contents
cdd_delta.tar.gz Conserved domain database for use with the standalone delta-blast program.
Protein sequences from large environmental sequencing projects, e.g., Sargasso Sea, Acid Mine Drainage. Its entries are
env_nr.##.tar.gz
EXCLUDED from the nr database.
A collection of protein sequences with entries from GenPept, Swissprot, PDB, PRF, PIR and NCBI Reference Sequence
nr.##.tar.gz
(RefSeq) project.
pataa.tar.gz Protein sequences from patents as supplied by USPTO. Its entries are EXCLUDED from the nr database.
swissprot.tar.gz Protein sequences from the swiss-prot sequence database (last major update).
Protein sequences from annotated on the transcriptome shotgun assembly. Its entries are EXCLUDED from the nr
tsa_nr.tar.gz
database.
File Contents
Nucleotide sequences from large environmental sequence projects, such as Sargasso Sea and Acid Mine Drainage. Its entries
env_nt.##.tar.gz
are EXCLUDED from the nt database.
An ALIAS file for the EST database to combine est_*.*.tar.gz files into a large virtual EST database to represent entries
est.tar.gz
BLAST® Help
from the EST division of GenBank, EMBL and DDBJ. Entries are EXCLUDED from the nt database.
est_human.##.tar.gz The human subset of the EST database. Entries are EXCLUDED from the nt database.
est_mouse.##.tar.gz The mouse subset of the EST database. Entries are EXCLUDED from the nt database.
The non-human and non-mouse subset of the EST database. Entries are excluded from nt. Entries are EXCLUDED from the
est_others.##.tar.gz
nt database.
The GSS database with entries from the GSS division of GenBank, EMBL and DDBJ. Entries are EXCLUDED from the nt
gss.##.tar.gz
database.
File Contents
The HTGS database with entries from the HTG division of GenBank, EMBL and DDBJ. Entries are EXCLUDED from the
htgs.##.tar.gz
BLAST® Help
nt database.
The nucleotide sequence database contains entries from traditional divisions of GenBank, EMBL and DDBJ. Sequences
nt.##.tar.gz from bulk divisions, i.e., gss, sts, pat, est, htg, wgs, and environmental sequences are excluded. RefSeq genomic entries are
also excluded.
The patent nucleotide sequence database contains entries from USPTO through GenBank or from EU/Japan Patent Agencies
patnt.##.tar.gz
through EMBL/DDBJ. Entries are EXCLUDED from the nt database.
pdbnt.##.tar.gz A set of ALIAS files for the nt database, pointing to the subset of nucleotide sequences in nt that are from PDB structures.
The sts database with entries from the STS division of GenBank, EMBL and DDBJ. Entries are EXCLUDED from the nt
sts.##.tar.gz
database.
A database with earlier non-project based Transcriptome Sequence Assembly entries. Entries are EXCLUDED from the nt
tsa_nt.##.tar.gz
database.
BLAST® Help
Whole genome shotgun sequence assemblies for various organisms. It is already superseded by the project-based datasets
wgs.##.tar.gz
stored in vdb format (ftp://ftp.ncbi.nlm.nih.gov/sra/wgs/).
File Contents
16SMicrobial.tar.gz Microbial 16S RNA sequences from the RefSeq Targeted Loci project (http://1.usa.gov/1jwzhZL).
A core set of refseq genomic sequences selected by their representative nature according to Assembly
Representative_Genomes.##.tar.gz
annotation, for reduced redundancy, in multiple volumes.
BLAST® Help
Assembled human chromosomes from Reference and Genbank for various individuals, along with their
human_genomic.##.tar.gz
scaffold components.
human_genomic_transcript.tar.gz Human genomic contigs (NC) and mapped/predicted transcripts (NM/XM) for the current annotation release.
Mouse genomic contigs entries (NC) and mapped/predicted transcripts (NM/XM) for the current annotation
mouse_genomic_transcript.tar.gz
release.
other_genomic.##.tar.gz Genomic records (NC/NT/NS/NZ accessioned) for lower eukaryotes, microbes and organelles.
refseq_genomic.##.tar.gz All genomic sequences from NCBI RefSeq project, including wgs scaffolds.
refseq_rna.##.tar.gz RNA sequences from NCBI RefSeq project, also included in the nt database.
Genomic sequences for selected genes records from NCBI RefSeq project (NG accessioned), containing
refseqgene.tar.gz
additional up- and downstream sequences beyond the transcribed region.
BLAST® Help
A non-sequence database file containing taxonomic information for sequences in the preformatted databases
taxdb.tar.gz
providing common and scientific names for each entry.
• Preformatted database files remove the formatting steps through makeblastdb and
saves valuable processing time
• Taxonomic information is encoded within the preformatted databases and can be
retrieved readily through blastdbcmd (taxdb.tar.gz installation required)
• Sequences in FASTA format can be generated from the preformatted databases by
BLAST® Help
Connected to NCBI
Downloading env_nr (2 volumes) ...
Downloading env_nr.00.tar.gz... done.
BLAST® Help
The complete options of this script (obtained using specific option "--help") are shown
below.
SYNOPSIS
BLAST® Help
OPTIONS
--decompress
Downloads, decompresses the archives in the current working directory,
and deletes the downloaded archive to save disk space, while
preserving the archive checksum files (default: false). Note: Using
this option may require more computing resources than using your
system's native decompression tool(s).
--showall
Show all available preformatted BLAST databases (default: false). The
BLAST® Help
output of this option lists the database names which should be used
when requesting downloads or updates using this script.
--passive
Use passive FTP, useful when behind a firewall (default: false).
--timeout
Timeout on connection to NCBI (default: 120 seconds).
--force
Force download even if there is a archive already on local directory
(default: false).
--verbose
Increment verbosity level (default: 1). Repeat this option multiple
times to increase the verbosity level (maximum 2).
BLAST® Help
--quiet
Produce no output (default: false). Overrides the --verbose option.
--version
Prints this script's version. Overrides all other options.
DESCRIPTION
This script will download the preformatted BLAST databases requested in
the command line from the NCBI ftp site.
EXIT CODES
BLAST® Help
BUGS
Please report them to <blast-help@ncbi.nlm.nih.gov>
COPYRIGHT
See PUBLIC DOMAIN NOTICE included at the top of this script.
decompress utilities. The working database files can then be extracted out of the resulting tar
file using tar program in Unix/Linux or WinZip and StuffIt Expander on Windows and
Macintosh platforms, respectively.
Large databases are formatted in multiple one-GB volumes, which are named using the
"name.##.tar.gz" convention. To reconstitute a given multi-volume database, all volume
files with the same base name are required. A database alias file, with ".pal" extension for
protein or ".nal" extension for nucleotide, is provided to tie the volumes together so that the
database can be called using the base database name. For example, binary programs from
the blast+ package can call the nt database using the command line option of "-db nt"
(without quotes).
BLAST® Help
Certain databases are subsets of a larger parent database (marked as alias in Tables 2, 3a and
3b). For those databases, alias and mask files, rather than actual databases, are provided. The
mask file points to a specific subset of the parent database and requires the parent database
to function properly. The parent databases should be generated at the same time as the mask
file. For example, to use pdbnt preformatted database (pdbnt.##.tar.gz), nt.##.tar.gz files
with the same date stamp are needed.
For proper setup of standalone BLAST, it is recommended that the database files be stored
in a centralized directory. Details are available for Windows and Mac/Linux/Unix.
Table 4a. Contents of protein database files under the /db/FASTA directory
File Content
nr.gz 2
BLAST® Help
Footnote:
1. These sequence files are provided for test purposes; their contents are not up-to-date.
2. It is recommended that their preformatted counterparts described in Table 2 be used whenever possible.
File Content
igSeqNt.gz Nucleotide sequences for human and mouse immunoglobulin variable regions
File Content
wgs.gz The FASTA equivalent of the wgs.##.tar.gz database files, to be decommissioned soon.
Footnote:
1. These sequence files are provided for test purposes; their contents are not up-to-date.
2. It is recommended that their preformatted counterparts described in Table 2 be used whenever possible.
3. This may offer some advantage of the pdbnt.##.tar.gz alias files due to their requirement of nt.##.tar.gz.
For local BLAST searches, the recommendation is to use the preformatted version given in
BLAST® Help
the parent directory. For those without preformatted counterparts, the FASTA sequence file
need to be inflated using gzip or other comparable utilities first, and then formatted using
makeblastdb from the blast+ package. Example command lines for formatting igSeqNt and
igSeqProt are given below.
The complete EST database is not available in FASTA or in preformatted form. All three
subsets, est_human, est_mouse and est_others, are required to reconstitute the EST database.
For screening of vector contamination, the UniVec database should be used instead. They
BLAST® Help
ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/
ftp://ftp.ncbi.nlm.nih.gov/pub/genomes/ ftp://ftp.ncbi.nlm.nih.gov/pub/
factsheets/HowTo_Downloading_Genomic_Data.pdf
Database update
BLAST® Help
In general, BLAST databases are updated daily. There is no established incremental update
scheme at this time, and it is recommended that databases be downloaded at regular intervals
to keep the content of local copy current. The update_blastdb.pl script can help streamline
this download process. If the original database.##.tar.gz files are kept, this utility can
automatically check the time stamps to determine if update is required or not.
For faster download, installation of the Aspera plug-in is recommended. This client can be
downloaded from the Aspera Soft site under the download tab. NCBI bookshelf also
provides a detailed guide on downloading data from NCBI via Aspera. A web interface for
the NCBI FTP site is at: https://www.ncbi.nlm.nih.gov/public/.
File/Dir Name Content
benchmark Package with sample database and query for gauging the performance of BLAST releases on different platforms
blast_demo.tar.gz blast_demo package on blast db, blast object, and reformating blast alignment from blastobj file
bmc Sequences used in the generation of BLAST search data for the blast+ paper published in BMC
igblast Directory with igblast related set of scripts, see its README for more details
Since NCBI is migrating to C++ code base provided by the C++ toolkit, some of the files
from this directory could become obsolete and may change without notice. Most of the
functions demonstrated here are incorporated in the blast_formatter utility distributed in the
blast+ package.
File/Dir Name Content
blast-sc2004.pdf A poster describing the NCBI splitd system behind NCBI BLAST service
web_blast.pl Example Perl script for doing blast searches through BLAST URLAPI
BLAST® Help
File Contents
All non-source code archives or packages are equivalent. They contain a standard collection
of standalone command line blast programs and accessory utilities for different platforms.
Installation of the package enables local BLAST searches, custom database preparation from
FASTA sequences, as well as sequence retrieval from existing databases formatted with the
"-parse_seqids" setting.
Details on individual programs from the package as well as installation procedures are
available for Windows and Mac/Linux/Unix.
BLAST® Help
Note that blast+ does not provide a separate netblast equivalent package since the client-
server functions are built into individual blast programs, i.e. blastn , blastp , blastx , tblastn
and tblastx , which can be invoked using the "-remote" option on the command line to
remotely search against databases at NCBI. In addition, blast+ does not provide wwwblast
equivalent packages.
This subdirectory archives the releases of legacy blast package based on the NCBI C
Toolkit. Each release is under its own directory with the version number as its name.
Available builds start at version 2.0.7. The "/LATEST" is a symbolic link mapping to the
newest release.
Packages with version number 2.2.10 or newer are packaged with a built-in directory
structure to better organize the distributed contents. Package content and brief installation
procedures are available for Windows and Mac/Linux/Unix.
is contained within a separate directory with the upload date as its name. The files available
vary from snapshot to snapshot depending on need. For example, the 2010-11-15 snapshot is
a complete package for blast+, while 2010-09-13 is for a single program.
File Contents
A handout describing the general usage of these vdb blast programs is available at:
ftp://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_Local_SRA_BLAST.pdf
BLAST® Help
Getting Help
For details, please refer to the BLAST Help Manual and other documents under the Help tab
of the BLAST homepage or the document directory under the BLAST FTP site.
Comments, questions and bug reports specifically relating to the BLAST programs and their
usage should be sent to blast-help@ncbi.nlm.nih.gov.
BLAST® Help