Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Structural Genomics

Shuhaila Mat-Sharani, Doris HX Quay, Chyan L Ng, and Mohd Firdaus-Raih, Universiti Kebangsaan Malaysia, Selangor, Malaysia
r 2018 Elsevier Inc. All rights reserved.

Introduction

Computational means of assigning function have been crucial for the annotation of the numerous genomes that are being
sequenced. Without a computational function annotation system, much of the genome sequences generated would remain
isolated voluminous non-sensical digital outputs. The primary means of assigning a function to a new open reading frame (ORF)
or coding sequence (CDS) has been sequence alignment.
The function of a known protein sequence can be transferred to its homolog once the sequence similarity is considered adequate
to infer shared function. The basis of such a function inference is the fact that the function of proteins are dependent on their three-
dimensional (3D) structures. The work of Anfinsen (1972) has demonstrated that the information needed for a protein to fold into
its functional 3D structure is contained within the sequence of amino acids that make up the polypeptide. The entropic (disordered)
state of the different amino acids in a newly synthesized polypeptide chain drives it to fold into the 3D structure. Chothia and Lesk
(1986) had determined that sequence similarity at 30% is enough for a protein to share similar folding and retain a similar function.
This has thus been used as a cutoff point for many sequence alignments when assigning function from similarity.
Nevertheless, there remain vast numbers of sequences that have uncharacterized function. For many genome annotations,
predicted ORFs with no characterized homologs have been assigned the label ‘hypothetical protein’. Solving the structure of a
hypothetical protein can be enough to identify any folding similarities to structures already available in the Protein Data Bank
(PDB) (Dey et al., 2018; Laskowski et al., 2018) and thus point to potentially similar functions. The PDB is the central repository
for structural coordinates of biological macromolecules that have been made publicly accessible. Visual examination and com-
putational analysis of a protein’s structure may yield clues regarding its function. If function characterization could be potentially
achieved by acquiring the structures of the many proteins with yet to be determined functions, then a systematic high throughput
means of acquiring such structures would ultimately accelerate such efforts.
Developments and progress in macromolecular structure determination methods, especially in X-ray crystallography, and lately in
cryo-electron microscopy, have resulted in more than 130,000 structures being deposited in the PDB. The bulk of this number
(approximately 90% in November 2017) were determined using X-ray crystallography. The availability of numerous genome
sequences and the progress made in structural biology have therefore led to large scale projects with the specific objectives of
determining the 3D structures for proteins encoded in these genomes, such an approach has been termed as structural genomics. For
many of these projects, emphasis was placed on the hypothetical proteins that had no known sequence homologs with characterized
functions. One of the earliest test case for structural genomics was the MJ0577 hypothetical protein structure of the hyperthermophile
Methanococcus jannaschii that was solved to a high resolution (Zarembinski et al., 1998). The structure of MJ0577 was found to contain
an ATP molecule, which directed the function search of the protein that was later confirmed as an ATPase (Zarembinski et al., 1998).
In this article, we review and discuss computational strategies currently in use for structural genomics, specifically for structures
solved by X-ray diffraction of protein crystals. While the protein production and crystallography remain as molecular biology and
chemistry problems, the selection of targets to be fed into a structural genomics pipeline and the analyses of the resulting structures
are computational biology problems.

Background

Defining Structural Genomics


There are currently different definitions available for the term structural genomics. Several often used definitions are: (i) an effort to
describe 3D structure of every protein encoded in a genome via experimental approaches or computational modelling or a
combination of both; and (ii) knowledge on the structural organization of the genome thus allowing for all the genes in a genome
to be located and characterized. This article will touch on the former, but will only proceed in depth for a subset, which is the
solution of structures via experimental means. The second will not be discussed under this topic. In a way, the definitions have
overlaps and all are arguably correct; however, these differences and similarities will not be discussed in depth here.

The Extent and Realities of Structural Genomics in Practice


Methods used for determining the 3D structures of proteins have improved tremendously over the past decade. The bulk of the
PDB is composed of structures solved through X-ray crystallography with the remainder consisting of coordinates acquired
through nuclear magnetic resonance (NMR) spectroscopy and cryo-electronmicroscopy (Cryo-EM). The X-ray crystallography
technique is the most common method used in structural genomics because it can be applied to a wide range of protein sizes and
types, including membrane proteins.

Encyclopedia of Bioinformatics and Computational Biology doi:10.1016/B978-0-12-809633-8.20155-6 1


2 Structural Genomics

An increase in the computational capacity to process the theoretical folding of protein sequences into their 3D structures means
that the generation of 3D structures for every protein in a genome is moving closer towards reality. Nevertheless, such capability is
not expected in the foreseeable future. However, even for a wholly computational structural genomics approach, it may not be
necessary to compute all the CDS in a target genome because for some proteins, an experimentally solved structure may already
exist (Baugh et al., 2013).
Similarly, it may not be necessary to experimentally determine the structure for every CDS in a genome because they are already
available in the PDB, either for the same organism from another work, or as a homolog from another source. The constraints for an
experimental structural genomics approach is grounded in the reality of costs and other limitations that include but are not limited
to: the ability to clone, express, purify and crystallize the protein; followed by a requirement for the crystal to diffract and the data
to be processable. This process of molecular biology by attrition can result from an initial starting point being thousands of CDS
ending with perhaps solved structures that number in the hundreds or even less (Franklin et al., 2015).
The structural genomics process gradually reduces the initial number of target genes acquired from the genome sequence via a
systematic process up to the point of coordinate data acquisition for the encoded proteins. As such, this process can be made into a
pipeline (Fig. 1) that begins by taking in all the predicted CDS from the genome sequence and proceeding with steps that can then
be defined specifically to fit a project’s requirements and objectives. For example, for many groups, a logical first step is to identify
CDS that code for possibly insoluble proteins; in practice, this is done by identifying membrane proteins (Fig. 1). The next step
may be to filter the CDS for those that have very similar examples already available in the PDB. The cut-off for similarity can vary
depending on the interests and requirements of the project.
The filtering process can also be intervened from the start to begin with a smaller subset; for example, only CDS that have been
annotated as hypothetical proteins can be selected for input into the structural genomics pipeline. Other examples would be to
target for a selection that may be of existing focus in a research group – for example, potential drug targets (Franklin et al., 2015;
Fedorov et al., 2007). Nevertheless, it must be noted that even for sequences that were initially targets in a structural genomics
pipeline, but do not end up as crystallographic structures, may actually have varying degrees of useful functional characterization
as a result of having passed through the pipeline (Ahmad et al., 2015; Yusof et al., 2016; Hashim et al., 2013; Ramli et al., 2013).
Although we do not discuss structures solved by NMR and cryo-EM, switching the experimental method that the pipeline feeds
from protein crystallography to either NMR, cryo-EM or a combination of all three methods can also be carried out. Thus structural
genomics in the present context can be an integration of all available methods to acquire 3D structure data. Similar to the case of
filtering the ORFs for the purpose of crystallography, the pipeline can be redirected to identify targets for these other methods –
such as identification of smaller molecular weight proteins that would be amenable to structure solution by NMR.

Fig. 1 The overall structural genomics process begins with computational filteration to get target genes and proceeds to solved structures via
experimental means. The general filteration begins by removing genes that have transmembrane regions, are insoluble, have more than 20% of
low complexity regions and coils. This is followed by removing predicted CDS with lengths less than 50 or more than 700 amino acids and have
more than 30% sequence identity with entries in the PDB database.
Structural Genomics 3

Systems and/or Applications

A pipeline to process a genome sequence for the purpose of structural genomics can be assembled using proven tools (Table 1). Due
to the potential volume of sequences that will be analysed, a local or fit for purpose installation of the programs required would be
ideal. This would thus prevent the pipelines engineered for such projects from overloading publicly available servers at the expense of
other users. Nevertheless, when such a practice is not possible, care and due consideration of other users of the service should
exercised when scheduling input submissions. As discussed previously (Section “The Extent and Realities of Structural Genomics in
Practice”), the starting point of the pipeline would vary depending on the requirements and capacity of the project.
A filter for insoluble proteins, which would be a logical feature and starting point for many projects, would obviously be
unsuitable if the project were to actually include the solution of membrane proteins among its objectives. In order to perhaps
maximize the return of the project in terms of finding novel folds or characterizing indescribed functions for hypothetical proteins,
some projects would opt to further restrict the targets to sequences that are unrepresented in the PDB. This is usually done by
employing a 30% sequence identity cutoff or dipping as low as 25% identity (Nair et al., 2009).
For perhaps many projects, these two initial steps of screening for membrane and membrane associated proteins would be
adequate to proceed on to the process of elimination from the pipeline by experimental attrition. However, It is also possible to
specifically filter for a subset of sequences that contain either N-terminal or C-terminal associated membrane insertions. This is
done by identifying whether the first and last thirty residues could potentially be membrane embedded by plotting hydrophobicity
plots or predicting transmembrane helices for these sections of the sequence for every CDS in the genome (Marsden et al., 2007).
Once identified, these sequences are then cloned and expressed without the predicted transmembrane anchor. Although this does
not guarantee solubility, this does increase the initial set of targets while at the same time allowing for the structures of membrane
anchored proteins to be solved.
At this stage, the project can then proceed to the non-computational steps highlighted as in Fig. 1. A computational component
can then be integrated again at the point where the structure data is available in PDB format. This second phase of the compu-
tational analysis will include structural similarity comparisons and in depth structural analysis. It is common practice to identify
fold similarities for a solved structure once the final structure file has been acquired. Here we will discuss aspects of computational
analysis on the structure but will exclude discussion of in depth visual scrutiny of the structure using molecular graphics.
In most cases, a fold similarity search, such as by the DALI program (Holm and Laakso, 2016) is sufficient to identify other
structural homologs. There are however hypothetical proteins that have no similarity to known folds or are matched to structures
of proteins with uncharacterized functions (Nadzirin and Firdaus-Raih, 2012). In such cases, the lack of fold similarity will not
allow for association with any known structural features than can in turn provide clues as to possible function to be made. In such
a situation, computer programs that can carry out structural analysis independent of any homology requirements can be used.
There are several computer programs that work by identifying similarities of 3D substructures or motifs that can bypass the
necessity to have fold similarity in order to make a functional association. Many of these methods are able to search and/or

Table 1 Selected programs and their applications that can be integrated into a structural genomics pipeline. A listing of the seven programs or
web servers for sequence alignment, transmembrane prediction, protein solubility prediction, coils region prediction, low complexity region
prediction and a protein structure database

Function Program/server Method/content URL

Sequence BLAST Protein sequence alignment to protein sequence in PDB database. https://blast.ncbi.nlm.nih.gov/
alignment Blast.cgi
MUSCLE Iterative alignment of protein sequences to get conserved residue/motif. https://www.ebi.ac.uk/Tools/
msa/muscle/
Transmembrane TMHMM Predicts TMs using HMMs (edit this) – just providing an example. http://www.cbs.dtu.dk/
prediction services/TMHMM/
Protein solubility PROSOII Predict solubility based on a classifier by two-layered structure in which the http://mbiljj45.bio.med.uni-
prediction output of a primary Parzen window model for sequence similarity and a muenchen.de:8888/prosoII/
logistic regression classifier of amino acid k-mer composition serve as prosoII.seam
input for a second level logistic regression classifier to exploiting subtle
differences between soluble proteins from TargetDB and the PDB and
notoriously insoluble proteins from TargetDB.
Coils regions COILS/Coiled-coil Predict coils regions using MTK and MTIDK matrices both assign https://embnet.vital-it.ch/
prediction prediction high probabilities to known coiled coils segment. software/COILS_form.html/
https://npsa-prabi.ibcp.fr/
cgi-bin/primanal_lupas.pl
Low complexity SEG Predict low complexity region by the SEG algorithm represent http://mendel.imp.ac.at/
region compositionally biased regions based only on residue composition and METHODS/seg.server.html
prediction taking into account the improbability of appearance of such sequences.
Protein structure Protein Data A crystallographic database for the three-dimensional structural data of https://www.rcsb.org/pdb/
database Bank (PDB) large biological molecules, such as proteins and nucleic acids.
4 Structural Genomics

compare directly a query arrangement (Debret et al., 2009; Nadzirin et al., 2012, 2013) against a database or take a query structure
and compare it against a database of motifs (Nadzirin et al., 2012). There are also services that allow for annotation of specific sites
such as binding sites (Konc and Janezic, 2017).

Analysis and Assessment

The need to experimentally determine a structure can be made based on several factors. In situations where detailed information of
the differences for atomic level interactions are required, such as for drug development applications, an experimentally solved
structure is necessary. It is therefore not uncommon for structural genomics projects to target not only CDS that have less than 30%
sequence identity to examples in the PDB but also proteins that are deemed as important where mechanistic details at the atomic
level can still be different even though the proteins are homologous. However, for applications where the investigator would like
to select residues for site directed mutagenesis as part of a larger characterization effort, then high quality predicted models may
suffice (Bhattacharya et al., 2007).
We provide an example where the sequence for an arginase acquired from the genome of G. antarctica (Firdaus-Raih et al.,
2018), a psychrophilic yeast, was predicted and compared to the structure that was solved by molecular replacement of the X-ray
diffraction data (PDBID: 5ynl). The sequence identity between the G. antarctica arginase and the reference used for molecular
replacement, a human arginase (PDBID: 4hze), is 45%. In this particular case, the decision was made to solve the structure of the
arginase despite the sequence identity being higher than the 30% cutoff point used for most of the other targets in the project. It
was deemed necessary to acquire a structure that could provide atomic level details of the function as they may be specific to
yeasts, psychrophiles or simply unique to G. antarctica.
To make a comparison of the differences that could arise from a computationally predicted model, we compared the
solved structure to one that was predicted using SWISS-MODEL (Biasini et al., 2014). As expected, the prediction was able to
generate the correct fold with discernible disagreements between the two structures only in the loop regions with the compared
structures having an RMSD value of 1.06 Å . When the computed model was compared to the template, the differences were
much lower at an RMSD of 0.31 Å . It is clear that the model is biased towards the template than the experimentally determined
structure (Fig. 2).
We then examined how much an effect the modelling had on the side chain arrangements when the model was compared to the
template and the experimental structure. For this comparison, eleven residues that have been reported to be involved in metal
coordination and arginine binding were selected. As observed for the folding, it is clear that the predicted model is more biased to the
reference used (Fig. 3). Nevertheless, with the exception of D194 and E288 – two of the putative arginine binding residues – the
arrangement of side chains for the other nine residues were generally similar and could be seen as having acceptable overlaps. In E288,
the side chains ended up in opposite directions despite the C-alpha position generally overlapping. For D194, the general position of
the residues in the experimental structure and model were the same, but neither the side chain or the C-alpha position overlapped.
These limitations of protein structure prediction are well understood. Here, we demonstrate the fact that even when a homo-
logous PDB structure is available at 45% sequence identity, an experimentally determined structure will still have valuable atomic
level details that cannot be adequately transferred via comparative modelling. It is clear that despite the fold conservation, it is the
intricate details that could be the factor in understanding mechanistic differences between homologous proteins. This in a way affects
the target selection criteria for the structural genomics pipeline. As a result, manual intervention of the pipeline may allow for a more
refined target list as opposed to a target list generated wholly based on pre-determined values and computed without expert curation.

Fig. 2 Comparison of G. antarctica arginase structure generated from protein crystallography against a predicted model generated by SWISS-
MODEL. The superposition of G. antarctica arginase structures between (A) predicted 3D model (dark blue) with X-ray crystal
structure (gold) (RMSD: 1.06 Å /243 Ca) and (B) predicted 3D model (dark blue) with the template used for prediction (PDBID:4ity) (cyan)
(RMSD: 0.31 Å /300 Ca). Superposition was done using CCP4MG.
Structural Genomics 5

Fig. 3 Superposition of active site residues between (a) G. antarctica arginase structure solved by X-ray crystallography (gold) (PDB template
used: 4hze) against the SWISS-MODEL predicted arginase (light blue) and (b) The SWISS-MODEL predicted arginase (light blue) against the
template used for prediction (PDBID: 4ity). Active site residues are labeled based on the numbering of G. antarctica arginase crystal structure. The
strictly conserved putative metal coordinating residues in G. antarctica arginase are H112, H137, D135, D139, D243 and D245 and several
conserved putative arginine binding residues are D139, N141, S148, H152, D194 and E288.

Illustrative Examples and Case Studies

Burkholderia pseudomallei
Burkholderia pseudomallei is a pathogenic soil bacteria that causes the disease melioidosis. At the initial time of writing
(circa November 2017), there were 237 PDB entries for B. pseudomallei in the PDB. Filtering those for structures that have less than
95% sequence identity retrieved 99 PDB entries. Due to the use of mutant structures as a means to characterize functional
mechanisms, this reduced number is most likely the actual count for unique B. pseudomallei structures in the PDB. Of the 237
structures, 51% (121 entries) were deposited by the Seattle Structural Genomics Center for Infectious Disease (SSGCID) (Myler et al.,
2009). The number submitted by the SSGCID could be reduced to 63 entries when a sequence identity filter of 95% was used.
The SSGCID is a National Institute of Allergy and Infectious Diseases (NIAID) consortium applying structural genomics approaches
for identifying potential drug targets (Stacy et al., 2011). In the search for drugs against B. pseudomallei, the SSGCID embarked on a
combined functional and structural genomics approach to identify essential proteins and solve their structures (Baugh et al., 2013). This
effort involved experimental identification of essential genes using transposon mutagenesis in Burkholderia thailandensis, a phylogen-
etically similar member of the Burkholderia genus. With 406 putative essential genes identified, an ‘ortholog rescue’ strategy was
employed for difficult targets by integrating seven other Burkholderia species into the pipeline as sources for orthologous proteins.
These efforts resulted in 31 structures of which 25 proteins have properties of a potential antimicrobial drug target. Although
the identification of essential genes was experimentally carried out using transposon mutagenesis, the genomes of the Bur-
kholderia species were compared to examples of known essential genes from UniProtKB (see “Relevant Websites section”) and the
Database of Essential Genes – DEG (see “Relevant Websites section”) (Zhang et al., 2009). The identified targets were also used to
identify the existence of human homologs using BLASTP – the results of this was used to further filter the target list to only proteins
that did not have any human homologs.
The work that reported the discovery of Burkholderia lethal factor 1 (BLF-1) (Cruz-Migoni et al., 2011) started on a similar
footing. The project was aimed at identifying B. pseudomallei essential genes, as well as novel pathogenicity factors. Despite some
similarities to the SSGCID initiative, the drafting of the target list was carried out differently. Instead of using transposon
mutagenesis, potential essential genes were computationally identified based on their homologs in the DEG resource. An addi-
tional dataset of genes that were annotated to encode hypothetical proteins was also generated. The essential genes homologs and
the hypothetical proteins were then cross-referenced against the proteomic profiles of B. pseudomallei and B. thailandensis reported
by Wongtrakoongate et al. (2007). B. thailandensis is non-pathogenic, therefore proteins present in the B. pseudomallei proteome but
absent in the B. thailandensis proteome were assumed to have potential roles in pathogenesis – these were then selected as targets
for the pipeline. The BPSL1549 protein was found to be present in the 2D gel electrophoresis data of B. pseudomallei but absent in
that of B. thailandensis thus leading to its selection for structure determination.
For decades, conclusive evidence for the pathogenicity and virulence factors of B. pseudomallei had eluded investigators (Stone,
2007). The BPSL1549 structure, which was renamed to BLF-1, had fold similarities to the catalytic domain of Escherichia coli
6 Structural Genomics

cytotoxic necrotizing factor 1 (CNF-1) (Cruz-Migoni et al., 2011). The solution of the BLF-1 structure led to the identification of its
substrate, the eukaryotic translation eIF4A, hence characterizing the role of BLF-1 in the disruption of protein synthesis in its host.

Glaciozyma antarctica
Glaciozyma antarctica is an obligate psychrophilic yeast that is able to survive at temperatures below 201C. From its genome
annotation, a total of 7857 genes were predicted (Firdaus-Raih et al., 2018). Target genes for structural genomics were then chosen
by filtering out 1610 genes encoding transmembrane regions, 4874 proteins predicted to be insoluble, 198 proteins computed to
have low complexity regions and 23 proteins where coils make up more than 20% of the total sequence. These filtering steps also
selected genes that have lengths of between 50 and 700 amino acids and no similar examples in the PDB at a sequence identity of
30% or lower which produced 453 target genes with 53% of this number having been annotated as hypothetical proteins.
Several target genes from G. antarctica were further crystallized and six of the structures had been successfully solved at the time
of writing with three having been deposited in the PDB (PDBIDs: 4xxf, 5yhp, 5ynl). A resource that tracks the progress of target
CDS for structure solution is available as part of the GlacIER (Glaciozyma antarctica Integrated Exploration Resource) database (see
“Relevant Websites section”). This is perhaps another aspect that a computational and informatics based approach can be
integrated into structural genomics. In this case, a simple MySQL database was made accessible via a web interface that allowed
members of the project as well as the public varying levels of access to the data. Users of the database can thus track the progress of
the targets as they proceed through the pipeline. Undoubtedly, more complex enterprise level management of such pipelines are in
use by larger structural genomics consortia that work on multiple target organisms, however, the GlacIER database demonstrates
that a small group in extended collaboration with several other labs can also execute a structural genomics project.

Discussion and Future Directions

Although the bulk of the work done in structural genomics consists of experimental work to produce proteins and generate 3D
structure coordinate data, it is clear that extensive sequence database searching needs to be carried out during the identification of
targets for the pipeline. Computational approaches are once again called upon for analysis of the 3D structures once they become
available. While most structural genomics pipelines that can be found in the literature adopt a similar structure to that presented in
Figs. 1 and 3, this is expected to evolve as the resolutions of structures acquired using cryo-EM get higher.
It is not expected for cryo-EM to supplant protein crystallography in providing the bulk of the 3D structures in the PDB; however,
future structural genomics projects may employ all three approaches in order to provide a wider coverage of the structures that can be
acquired from an available genome sequence. These would thus require the target selection process to take into account the specific
requirement for each method and no longer be highly dependent solely on protein solubility. Informatic approaches would be
required to provide decision making tools to identify targets and estimate the requirement and/or necessity of solving specific
homologous structures. The pipeline may also evolve further to include computational methods of structure determination.

Closing Remarks

The structural genomics pipeline and specific filters enable what can be an expensive endeavour to have a reduced workload and
cost. This allows for smaller and less well funded groups outside of large structural genomics consortia to also practice structural
genomics as discussed in the case studies examples.

Acknowledgement

MFR is funded by the Ministry of Science, Technology and Innovation grant 02-01-02-SF1278 and the Universiti Kebangsaan
Malaysia grant GP-K011849.

References

Ahmad, L., Hung, T.L., Mat Akhir, N.A., et al., 2015. Characterization of Burkholderia pseudomallei protein BPSL1375 validates the Putative hemolytic activity of the COG3176
N-Acyltransferase family. BMC Microbiol. 15, 270.
Anfinsen, C.B., 1972. The formation and stabilization of protein structure. Biochem. J. 128, 737–749.
Baugh, L., Gallagher, L.A., Patrapuvich, R., et al., 2013. Combining functional and structural genomics to sample the essential Burkholderia structome. PLOS ONE 8, e53851.
Bhattacharya, A., Tejero, R., Montelione, G.T., 2007. Evaluating protein structures determined by structural genomics consortia. Proteins 66, 778–795.
Biasini, M., Bienert, S., Waterhouse, A., et al., 2014. SWISS-MODEL: Modelling protein tertiary and quaternary structure using evolutionary information. Nucleic Acids Res. 42,
W252–W258.
Chothia, C., Lesk, A.M., 1986. The relation between the divergence of sequence and structure in proteins. EMBO J. 5, 823–826.
Structural Genomics 7

Cruz-Migoni, A., Ruzheinikov, S.N., Sedelnikova, S.E., et al., 2011. Cloning, purification and crystallographic analysis of a hypothetical protein, BPSL1549, from Burkholderia
pseudomallei. Acta. Crystallogr. Sect. F Struct. Biol. Cryst. Commun. 67, 1623–1626.
Debret, G., Martel, A., Cuniasse, P., 2009. RASMOT-3D PRO: A 3D motif search webserver. Nucleic Acids Res 37, W459–W464.
Dey, S., Ritchie, D.W., Levy, E.D., 2018. PDB-wide identification of biological assemblies from conserved quaternary structure geometry. Nat. Methods 15, 67–72.
Fedorov, O., Sundstrom, M., Marsden, B., Knapp, S., 2007. Insights for the development of specific kinase inhibitors by targeted structural genomics. Drug Discov. Today 12,
365–372.
Firdaus-Raih, M., Hashim, N.H.F., Bharudin, I., et al., 2018. The Glaciozyma antarctica genome reveals an array of systems that provide sustained responses towards
temperature variations in a persistently cold habitat. PLOS ONE 13, e0189947.
Franklin, M.C., Cheung, J., Rudolph, M.J., et al., 2015. Structural genomics for drug design against the pathogen Coxiella burnetii. Proteins 83, 2124–2136.
Hashim, N.H., Bharudin, I., Nguong, D.L., et al., 2013. Characterization of Afp1, an antifreeze protein from the psychrophilic yeast Glaciozyma antarctica PI12. Extremophiles 17,
63–73.
Holm, L., Laakso, L.M., 2016. Dali server update. Nucleic Acids Res. 44, W351–W355.
Konc, J., Janezic, D., 2017. ProBiS tools (algorithm, database, and web servers) for predicting and modeling of biologically interesting proteins. Prog. Biophys. Mol. Biol. 128,
24–32.
Laskowski, R.A., Jablonska, J., Pravda, L., Varekova, R.S., Thornton, J.M., 2018. PDBsum: Structural summaries of PDB entries. Protein Sci. 27, 129–134.
Marsden, R.L., Lewis, T.A., Orengo, C.A., 2007. Towards a comprehensive structural coverage of completed genomes: A structural genomics viewpoint. BMC Bioinform. 8, 86.
Myler, P.J., Stacy, R., Stewart, L., et al., 2009. The seattle structural genomics center for infectious disease (SSGCID). Infect. Disord. Drug Targets 9, 493–506.
Nadzirin, N., Firdaus-Raih, M., 2012. Proteins of unknown function in the protein data bank (pdb): An inventory of true uncharacterized proteins and computational tools for
their analysis. Int. J. Mol. Sci. 13, 12761–12772.
Nadzirin, N., Gardiner, E.J., Willett, P., Artymiuk, P.J., Firdaus-Raih, M., 2012. SPRITE and ASSAM: Web servers for side chain 3D-motif searching in protein structures.
Nucleic Acids Res. 40, W380–W386.
Nadzirin, N., Willett, P., Artymiuk, P.J., Firdaus-Raih, M., 2013. IMAAAGINE: A webserver for searching hypothetical 3D amino acid side chain arrangements in the Protein Data
Bank. Nucleic Acids Res. 41, W432–W440.
Nair, R., Liu, J., Soong, T.-T., et al., 2009. Structural genomics is the largest contributor of novel structural leverage. J. Struct. Funct. Genom. 10, 181–191.
Ramli, A.N., Azhar, M.A., Shamsir, M.S., et al., 2013. Sequence and structural investigation of a novel psychrophilic alpha-amylase from Glaciozyma antarctica PI12 for cold-
adaptation analysis. J. Mol. Model. 19, 3369–3383.
Stacy, R., Begley, D.W., Phan, I., et al., 2011. Structural genomics of infectious disease drug targets: The SSGCID. Acta. Crystallogr. Sect. F Struct. Biol. Cryst. Commun. 67,
979–984.
Stone, R., 2007. Infectious disease. Racing to defuse a bacterial time bomb. Science 317, 1022–1024.
Wongtrakoongate, P., Mongkoldhumrongkul, N., Chaijan, S., Kamchonwongpaisan, S., Tungpradabkul, S., 2007. Comparative proteomic profiles and the potential markers
between Burkholderia pseudomallei and Burkholderia thailandensis. Mol. Cell. Probes 21, 81–91.
Yusof, N.A., Hashim, N.H., Beddoe, T., et al., 2016. Thermotolerance and molecular chaperone function of an SGT1-like protein from the psychrophilic yeast, Glaciozyma
antarctica. Cell Stress Chaperones 21, 707–715.
Zarembinski, T.I., Hung, L.-W., Mueller-Dieckmann, H.-J., et al., 1998. Structure-based assignment of the biochemical function of a hypothetical protein: A test case of
structural genomics. Proc. Natl. Acad. Sci. USA 95, 15189–15193.
Zhang, Y., Lin, Z.A., Pan, J.J., et al., 2009. Concurrent control study of different radiotherapy for primary nasopharyngeal carcinoma: Intensity-modulated radiotherapy versus
conventional radiotherapy. Ai Zheng 28, 1143–1148.

Relevant Websites
http://mfrlab.org/glacier/
GlacIER: Glaciozyma antarctica Integrated Exploration Resource.
http://www.uniprot.org
The UniProt knowledgebase.
http://tubic.tju.edu.cn/deg/
Tianjin University BioInformatics Centre.

You might also like