Copyright: © EMBL-EBI 2004
Databases of protein domains and functional sites have become vital resources for the prediction of protein functions. During the last decade, several signature- recognition methods have evolved to address different sequence analysis problems, resulting in rather different and, for the most part, independent databases. Diagnostically, these resources have different areas of optimum application owing to the different strengths and weaknesses of their underlying analysis methods. Thus, for best results, search strategies should ideally combine all of them.
InterPro ([1]) is a collaborative project aimed at providing an integrated layer on top of the most commonly used signature databases by creating a unique, non- redundant characterisation of a given protein family, domain or functional site.
The InterPro database integrates PROSITE ([2]), PRINTS ([3]), Pfam ([4]), ProDom ([5]), SMART ([6]), TIGRFAMs ([11]), PIR superfamily ([13]), SUPERFAMILY ([14]) Gene3D (15]) and PANTHER ([16]) databases and the addition of others is scheduled. InterPro data is distributed in XML format and it is freely available under the InterPro Consortium copyright. The InterPro project home page is at http://www.ebi.ac.uk/interpro.
Any queries should be emailed to interhelp@ebi.ac.uk.
PROSITE patterns:
Some biologically significant amino acid patterns can be summarised in the form of regular expressions.
ScanRegExp (by Wolfgang.Fleischmann@ebi.ac.uk)
PROSITE profiles:
There are a number of protein families as well as functional or structural domains that cannot be detected using patterns due to their extreme sequence divergence, so the use of techniques based on weight matrices (also known as profiles) allows the detection of such proteins or domains. A profile is a table of position-specific amino acid weights and gap costs. The profile structure used in PROSITE is similar to but slightly more general (Bucher P. et al., 1996 [7]) than the one introduced by M. Gribskov and co-workers.
pfscan from the Pftools package (by Philipp.Bucher@isrec.unil.ch).
PRINTS:
The PRINTS database houses a collection of protein family fingerprints. These are groups of motifs that together are diagnostically more powerful than single motifs by making use of the biological context inherent in a multiple-motif method. The fingerprinting method arose from the need for a reliable technique for detecting members of large, highly divergent protein super-families.
FingerPRINTScan (Scordis P. et al., 1999 [8]).
PFAM: Pfam is a database of protein domain families. Pfam contains curated multiple sequence alignments for each family and corresponding hidden Markov models (HMMs) (Eddy S.R., 1998 [9]). Profile hidden Markov models are statistical models of the primary structure consensus of a sequence family. The construction and use of Pfam is tightly tied to the HMMER software package.
hmmpfam from the HMMER2.3.2 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).
PRODOM: ProDom is a database of protein domain families obtained by automated analysis of the SWISS-PROT and TrEMBL protein sequences. It is useful for analysing the domain arrangements of complex protein families and the homology relationships in modular proteins. ProDom families are built by an automated process based on a recursive use of PSI-BLAST homology searches.
ProDomBlast3i.pl (by Emmanuel Courcelle emmanuel.courcelle@toulouse.inra.fr and Yoann Beausse beausse@toulouse.inra.fr) (it is a wrapper for the Blast package (Altschul S.F. et al., 1997 [10])).
SMART: SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. These domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues. SMART alignments are optimised manually and following construction of corresponding hidden Markov models (HMMs).
hmmpfam from the HMMER2.3.2 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).
TIGRFAMs: TIGRFAMs are a collection of protein families featuring curated multiple sequence alignments, Hidden Markov Models (HMMs) and associated information designed to support the automated functional identification of proteins by sequence homology. Classification by equivalog family (see below), where achievable, complements classification by orthologs, superfamily, domain or motif. It provides the information best suited for automatic assignment of specific functions to proteins from large scale genome sequencing projects.
hmmpfam from the HMMER2.3.2 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).
PIR SuperFamily: PIR SuperFamily (PIRSF) is a classification system based on evolutionary relationship of whole proteins.
hmmpfam from the HMMER2.3.2 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).
SUPERFAMILY: SUPERFAMILY is a library of profile hidden Markov models that represent all proteins of known structure, based on SCOP.
hmmpfam/hmmsearch from the HMMER2.3.2 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).
GENE3D: Gene3D is supplementary to the CATH database. This protein sequence database contains proteins from complete genomes which have been clustered into protein families and annotated with CATH domains, Pfam domains and functional information from KEGG, GO, COG, Affymetrix and STRINGS.
hmmpfam from the HMM2.3.2 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).
PANTHER: The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System was designed to classify proteins (and their genes) in order to facilitate high-throughput analysis.
hmmsearch from the HMM2.3.2 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).
blastall from the Blast package (Altschul S.F. et al., 1997 [10]).
Optionally, predictions for coiled-coil, signal peptide cleavage sites (SignalP v3) and TM helices (TMHMM v2) are supported (See the FAQ for details of how to set these up).
InterProScan is a tool that combines different protein signature recognition methods into one resource. The number of signature databases and their associated scanning tools, as well as the further refinement procedures, increases the complexity of the problem.
InterProScan is more than just a simple wrapping of sequence analysis applications since it also performs a considerable amount of data look-up from various databases and program outputs. The Perl-based InterProScan is intended to be an extensible and scalable system optimised to cope with bulk data processing. The need for production scale efficiency and easy extensibility requires a robust and efficient (parallel) internal architecture that can benefit from network-distributed computing with the support of UNIX queuing systems.
In the package a Perl-based simple data retrieval system is used in order to provide the required data look-up efficiency and extensibility.
There are two ways you can use InterProScan, either via the EBI website (http://www.ebi.ac.uk/InterProScan/ - note the maximum number of protein sequences you may submit is 10 and nucleotide is 1) or by downloading and installing it locally on your computer. InterProScan can run stand-alone via a web user interface (GUI), via the command-line or via SRS.
InterProScan can take either nucleotide or protein sequences in a recognised sequence format (such as raw, FASTA or EMBL). It will reformat and, if necessary, translate the sequences before beginning its search tasks. If raw format (free text) is used, it will be given the name "Sequence_n" by default, where n is the order in which it appeared in the input.
Nucleotide sequences will translated and scanned in all 6 frames without any further assumptions except transcript length cut-off (orfminsize) and/or codon translation table of the EMBOSS sixpack tool:
0 | (Standard) |
1 | (Standard (with alternative initiation codons)) |
2 | (Vertebrate Mitochondrial) |
3 | (Yeast Mitochondrial) |
4 | (Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma) |
5 | (Invertebrate Mitochondrial) |
6 | (Ciliate Macronuclear and Dasycladacean) |
9 | (Echinoderm Mitochondrial) |
10 | (Euplotid Nuclear) |
11 | (Bacterial) |
12 | (Alternative Yeast Nuclear) |
13 | (Ascidian Mitochondrial) |
14 | (Flatworm Mitochondrial) |
15 | (Blepharisma Macronuclear) |
16 | (Chlorophycean Mitochondrial) |
21 | (Trematode Mitochondrial) |
22 | (Scenedesmus obliquus) |
23 | (Thraustochytrium Mitochondrial) |
If you wish to use more sophisticated protein sequence predictions, you can replace or modify the conf/sixpack.sh script and edit translate.cmd in the stand-alone version's conf/iprscan.conf file. Please note that any non-standard single letter amino acid codes (such as an asterix "*", signifying a stop codon) can cause problems when running the software.
During a run, the program prepares a temporary directory (something like 'tmp/20041011/iprscan-20041011-11123456') where 20041011 is today's date and iprscan-20041011-11123456 is the session directory name. The directory name is automatically generated to be unique and consists of "iprscan-" followed by the date (YYYYMMDD), followed by the time of the day (hhmmss) and a 2-digit random number (NN).
When the scanning is finished the results will be displayed on the STDOUT unless you used the -o option on the command line to specify an output file where to put results.
InterProScan makes results available in four formats {raw ebixml xml txt html}:
NF00181542 0A5FDCE74AB7C3AD 272 HMMPIR PIRSF001424 Prephenate dehydratase 1 270 6.5e-141 T 06-Aug-2005\ IPR008237 Prephenate dehydratase with ACT region Molecular Function:prephenate dehydratase activity (GO:0004664), Biological Process\ :L-phenylalanine biosynthesis (GO:0009094)
NF00181542 | is the id of the input sequence. |
27A9BBAC0587AB84 | is the crc64 (checksum) of the protein sequence (supposed to be unique). |
272 | is the length of the sequence (in AA). |
HMMPIR | is the anaysis method launched. |
PIRSF001424 | is the database members entry for this match. |
Prephenate dehydratase | is the database member description for the entry. |
1 | is the start of the domain match. |
270 | is the end of the domain match. |
6.5e-141 | is the evalue of the match (reported by member database anayling method). |
T | is the status of the match (T: true, M: marginal). |
06-Aug-2005 | is the date of the run. |
IPR008237 | is the corresponding InterPro entry (if iprlookup requested by the user). |
Prephenate dehydratase with ACT region | is the description of the InterPro entry. |
Molecular Function:prephenate dehydratase activity (GO:0004664) | is the GO (gene ontology) description for the InterPro entry. |
InterProScan is available for running via the EBI web-site. Alternatively, you can download a stand- alone version to your local server and run it there.
InterProScan and the underlying applications are freely available under the GNU licence agreement from the EBI's ftp server (ftp://ftp.ebi.ac.uk/pub/databases/interpro/iprscan/).
Program Name | Speed in v4.2 (s) |
---|---|
HMMPfam | 68 |
HMMPanther | 12 |
HMMPIR | 21 |
blastprodom | 18 |
coils | 7 |
gene3d | 18 |
HMMSmart | 7 |
HMMTigr | 23 |
FPRINTScan | 12 |
scanregexp | 7 |
profilescan | 12 |
superfamily | 38 |
seg | 6 |
signalp | 7 |
tmhmm | 7 |
-------------- | ----------------- |
total | 4m23s |
For detailed instructions on how to install InterProScan locally, please read the Installation instructions, on the ftp site here: ftp://ftp.ebi.ac.uk/pub/databases/interpro/iprscan/Installing_InterProScan.txt
InterProScan and InterPro version numbers are not related in any way. The only time the version number of InterProScan changes is if there has been a change in the underlying program code.
There are InterPro xml files available for download from the FTP site which are currently updated every few months or so and will likely update more frequently in the future. Download them and put them into the data directory of your InterProScan installation and you will have the most up-to-date data available.
You can update the member database information whenever you want. The tarballs on the FTP site will update whenever InterPro does.
PROSITE:ftp://ftp.isrec.isb-sib.ch/sib-isrec/profiles/prosite_prerelease.prf
PRODOM: http://prodes.toulouse.inra.fr/prodom/current/html/download.php
InterPro database: ftp://ftp.ebi.ac.uk/pub/databases/interpro/interpro.xml.gz
InterPro matches: ftp://ftp.ebi.ac.uk/pub/databases/interpro/match.xml.gz
Pfam: ftp://ftp.sanger.ac.uk/pub/databases/Pfam/Pfam_ls.gz
PRINTS: ftp://ftp.bioinf.man.ac.uk/pub/fingerPRINTScan/database/printsXXX.pval_blos62.gz (XXX is the highest version available)
TIGRFAMs: ftp://ftp.tigr.org/pub/data/TIGRFAMs
PIR: ftp://ftp.pir.georgetown.edu/pir_databases/pirsf/interpro/new/
PANTHER: https://panther.appliedbiosystems.com/downloads/ (we provide a compressed binary library).
GENE3D: ftp://ftp.biochem.ucl.ac.uk/pub/
TMHMM (v2.0): http://www.cbs.dtu.dk/services/TMHMM/ (under commercial license : contact software@cbs.dtu.dk)
SignalP v3.0: http://www.cbs.dtu.dk/services/SignalP/ (under commercial license : contact software@cbs.dtu.dk)
SMART: http://smart.embl-heidelberg.de/
SUPERFAMILY: http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/ (under free license)
FingerPRINTScan: ftp://proline.sbc.man.ac.uk/pub/fingerPRINTScan/binaries/
ScanRegExp: ftp://ftp.ebi.ac.uk/pub/software/unix/
Pfscan: http://www.isrec.isb-sib.ch/ftp-server/pftools/
hmmpfam: http://hmmer.wustl.edu/
hmmsearch: http://hmmer.wustl.edu/
hmmconvert: http://hmmer.wustl.edu/
NCBI Blast: ftp://ftp.ncbi.nlm.nih.gov/blast/
EMBOSS tools: http://emboss.sourceforge.net/ (for seqret (sequence reformater) and sixpack (nucleic sequences translator))
Ncoils: ftp://ftp.ebi.ac.uk/pub/software/unix/coils-2.2
Seg: http://blast.wustl.edu/pub/seg/
All are also available in tarballs (tar.gz) from the FTP site (see the installation instructions)
Should you wish to know the set up and features of InterProScan in-depth, this section details the features and architecture of the program.
For more information on how InterProScan works, see the next section.
As mentioned above, the Perl-based InterProScan was designed for bulk sequence analysis. The architecture does not have any internal limitations on the number of submitted sequences and has been tested on runs with more than 100 000 sequences. The general approach is to split the original input file into smaller parts with a pre-configured number of sequences in each (a so called "chunk").
InterProScan is more than just a simple wrapper of protein sequence analysis applications. In addition, it performs a considerable amount of data look-up from various databases and has the ability to parse and retrieve program outputs.
Each data description module defines the data schema of the source data and its parsing rules. The corresponding Perl module provides an object-oriented interface to the underlying entry attributes. The parsing of the output results happens only once searching is done when all the applications are finished. The parsing of the source data into the memory objects happens only once and is done upon request, implementing so-called lazy-parsing. Hierarchical parsing rules are implemented using the recursive-descent approach (Parse-RecDescent package). Fast data retrieval is implemented using the Perl native B-trees indexing (DB_File.pm, based on Berkeley DB).
The simple 'one Perl module per data source' organisation makes it possible to reuse the modules in other stand-alone ad-hoc solutions. The Perl-based InterProScan is capable of providing post-processed, integrated results in several formats and could also be used as a simple retrieval system for the underlying data.
Each installation has the following directories:
The main script is the one called "iprscan" in the bin/ directory. It acts as both a command-line script (when the -cli option is used) and as the CGI script if and when a user has installed the web interface to InterProScan. The iprscan script starts jobs by calling another script (iprscan_wrapper.pl) which in turn launches and tracks jobs for each application included in the program. Results are parsed and the output created. By default, when running on the command line, the results are written to stdout, unless the -o option is used to redirect the output to a file. You can also specify -verbose mode which shows the status of the main program as PENDING, RUNNING and DONE.
If the job crashed, iprscan warns you (either on the command line and in the web interface) and produces a report file containing information about crashed jobs. In a best case scenario, iprscan gives you the reason why the job(s) crashed so that you can try to fix the problem and restart any failed jobs (To do so, just run ResubmitJobs.pl on the command line or click 'Resubmit failed jobs' on the web interface). InterProScan will restart only those jobs which failed (from scratch) and concatenate the results with previous ones.
Below is a more detailed description of the various programs included in InterProScan's architecture:
-cli | Specify to the script to be used in command line mode. This same script is also used as CGI script when configured web interface. |
-i | Input sequence file. This file must exists and be readable. |
-o | Output file where to write results (default stdout). |
-iprlookup | Switch on look up of corresponding InterPro annotation |
-goterms | Switch on look up of corresponding Gene Ontology annotation (requires -iprlookup option to be used too) |
-trtable num | Are used for specifying Translation Table code and and -trlen transcript length threshold respectively for nucleic acid to protein sequence translation (based on CodonTable.pm by Heikki Lehvaslaiho |
-appl | Application to use. Check iprscan/conf/iprscan.conf file to see what are the applications that you configured or type './iprscan -cli -h'. Use multiple -appl flags to specify multiple applications. |
-nocrc | Does not perform a crc64 check on your protein sequence(s) before launching any application. If all your sequences have a known crc64 according to the match.xml file, then no applications will be launched and the results will be then displayed. |
Specify an email address where to send email when the run is finished. | |
-format (raw|xml|txt|ebixml|html) | Output format (default xml). |
-seqtype (n|p) | The type of the input sequences (dna/rna (n) or protein (p)). |
-verbose | Displays status of the job. |
-taxo | Activate the Taxonomy filter for abbreviated taxonomy (e.g. -taxo Arthto -taxo Bact) Possible values: Arabidopsis thaliana (AraTh), Archaea (Arch), Arthropoda (Arthro), Bacteria (Bact), Caenorhabditis elegans (CaeEl), Chordata (Chor), Cyanobacteria (Cyan), Eukaryota (Euka), Fruit Fly (FrFly), Fungi (Fung), Green Plants (GrePl), Human (Huma), Metazoa (Meta), Mouse (Mous), Nematoda (Nema), Other Eukaryotes (OthEuk), Plastid Group (PlasGrp), Rice spp (Rice), Saccharomyces cerevisiae (SacCer), Synechosystis PCC 6803 (Synec), Unclassified (Unclass) and Virus (Vir) |
-txrule <0,1> | Make decision on the taxonomy: 0 -> AND, 1 -> OR."; |
-help | Displays this help and exit. |
session_dir | The path of the session directory of your job |
-f | File(s) to index. By default indexes all the required files. Type ./index_data.pl -h to get the list of the supported files. |
-inx | Index given files. |
-iforce | Force script to reindex files even if they are reported as being up to date. This feature is needed most of the time when data is being updated. It first removes the old index and then builds a new one. |
-bin | Convert ascii hmm library file to binary file. Can speed up the hmmpfam search by up to 40%. (.bin is now required by default) |
-bforce | Force the binary file conversion even if it is already here. Needed most of the time during data update. |
-v | Verbose mode, prints informations during indexing. |
-h | Displays this help and exit. |
-format | format you want to convert to |
-input raw_file | the original output file of results |
-jobid | the interproscan job id (required for HTML and XML formats) |
-i infile | input file name (fasta sequences) |
-o outfile | output file name (collated results) |
-c cmd | command to execute on the sequences |
-h | help |
-r | Path to the report file. (e.g. tmp/20040302/iprscan-20040302-12355481/iprscan-20040302-12335481.report) |
-h | Displays this help and exit. |
-v [0,1,2] | Verbose mode with multiple mode. 0 no info at all (like without -v option). 1 only prints main actions. 2 prints all that the script is doing. |
InterProScan is supplied with configuration files in the conf directory so that you can easily set-up your installation exactly as you want. Configuration files are based on 'tags' that are expandable when InterProScan is reading them. What we do we mean by a 'tag'?
For example in the following lines :
workserver=http://fido.ebi.ac.uk:4000 workurl=[%workserver]/iprscan/iprscan?tool=iprscan&jobid=....
workserver is a key and http://fido.ebi.ac.uk:4000 is the value. In the next line, [%workserver] is a tag. So when InterProScan reads a configuration file, it reads it as key-value pair file. When it sees something on the line looking like '[%.....]', it understands it as a tag and try to to expand/replace it in its memory and searches if it already seen this key somewhere. If yes, it replaces the tag '[%...]' by its value, otherwise it replaces it with nothing. That's why, each time you want to use a tag in a value to avoid repeat it, you HAVE TO SET IT correctly before as file are read from top to bottom.
In that case, workurl will be :
http://fido.ebi.ac.uk:4000/iprscan/iprscan?tool=iprscan&jobid=....after expanding.
InterProScan support conditions in its tags. Here is a list. To know how to write them correctly, you will have to read the code of Config.pm module (iprscan/lib/Dispatcher/Config.pm).
%env | referres to environment variable hash table in Perl (%ENV). |
%if | you can do some condition into your tags. |
%switch | you can have mutiple choices (same as basic switch condition in programing). |
%random | calls the srand Per subroutine. |
%YYYY | translates it as the actual year. |
%MM | translates it as the current month of the year. |
%DD | translates it as the current day of the month. |
%hh | translates it as the current hour of the day. |
%mm | translates it as the current minutes of the hour. |
%ss | translates it as the current second of the minute. |
%hostname | translates it as the hostname of the machine. |
%pid | translates it with the process id of the program. |
%uname | translates it with the operating system name. |
With all of these features, you should be able to modify/configure InterProScan and applications as you want.
InterProScan is based on scanning methods native to the InterPro member databases. It is distributed with pre-configured method cut-offs recommended by the member database experts and which are believed to report relevant matches. All cut-offs are defined in configuration files (see 'conf' directory). Matches obtained with the fixed cut-off are subject to the following filtering. (Please also see member database web pages for more information)
The taxonomy filtering is based on InterPro entry taxonomy. InterProScan does it current work, searching your sequence against InterPro members databases. When all analysis are done, InterProScan perform a post processing of the results based on the taxonomy found for each InterPro entries found for your sequence(s).
e.g: Your sequence (RS16_ECOLI) is analysed against Pfam and ProDom. Hits returned are Pfam: PF00886, ProDom: PD003791 and associated InterPro entries: IPR000307
Taxonomy for IPR000307 is listed below.
For each hit, 2 different things can happen.
AND rule :
If all the taxonomy selected by the user is not present in the InterPro entry, then the hit (PF00886, or PD003791) is rejected and not shown as result.
OR rule : If one of the taxonomy name selected by the user is present on the InterPro entry, then the hit (PF00886, or PD003791) is conserved and shown to the user.
NOTE: If you do not wish to use taxonomy and make it unavailable for users, just edit iprscan.conf and set the value 0 for the tag 'taxonomy.use'.
1. The InterPro Consortium (*R.Apweiler, T.K.Attwood, A.Bairoch, A.Bateman, E.Birney, M.Biswas, P.Bucher, L.Cerutti, F.Corpet, M.D.R.Croning, R.Durbin, L.Falquet, W.Fleischmann, J.Gouzy, H.Hermjakob, N.Hulo, I.Jonassen, D.Kahn, A.Kanapin, Y.Karavidopoulou, R.Lopez, B.Marx, N.J.Mulder, T.M.Oinn, M.Pagni, F.Servant, C.J.A.Sigrist, E.M.Zdobnov). "The InterPro database, an integrated documentation resource for protein families, domains and functional sites." Nucleic Acids Research, 2001, 29(1): 37-40. 2. Hofmann K., Bucher P., Falquet L., and Bairoch A. "The Prosite Database, Its Status in 1999." Nucleic Acids Res, 1999, 27(1): 215-9. 3. Attwood T.K., Croning M.D., Flower D.R., Lewis A.P., Mabey J.E., Scordis P., Selley J.N., and Wright W. "Prints-S: The Database Formerly Known as Prints." Nucleic Acids Res, 2000, 28(1): 225-7. 4. Bateman A., Birney E., Durbin R., Eddy S.R., Howe K.L., and Sonnhammer E.L. "The Pfam Protein Families Database." Nucleic Acids Res, 2000, 28(1): 263-6. 5. Corpet F., Gouzy J., and Kahn D. "Recent Improvements of the Prodom Database of Protein Domain Families." Nucleic Acids Res, 1999, 27(1): 263-7. 6. Schultz J., Copley R.R., Doerks T., Ponting C.P., and Bork P. "Smart: A Web-Based Tool for the Study of Genetically Mobile Domains." Nucleic Acids Res, 2000, 28(1): 231-4. 7. Bucher P., Karplus K., Moeri N., and Hofmann K. "A Flexible Motif Search Technique Based on Generalized Profiles." Comput Chem, 1996, 20(1): 3-23. 8. Scordis P., Flower D.R., and Attwood T.K. "Fingerprintscan: Intelligent Searching of the Prints Motif Database." Bioinformatics, 1999, 15(10): 799-806. 9. Eddy S.R. "Profile Hidden Markov Models." Bioinformatics, 1998, 14(9): p. 755-63. 10. Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., and Lipman D.J. "Gapped Blast and Psi-Blast: A New Generation of Protein Database Search Programs." Nucleic Acids Res, 1997, 25(17): p. 3389-402. 11. Haft,D.H., Loftus,B.J., Richardson,D.L., Yang,F., Eisen,J.A., Paulsen,I.T., White,O. "TIGRFAMs: a protein family resource for the functional identification of proteins." Nucleic. Acids. Res, 2001, 29 (1):41-3 12. Eddy, S.R. "HMMER: Profile hidden Markov models for biological sequence analysis". WWW, 2001. http://hmmer.wustl.edu/ 13. Cathy H. Wu, Hongzhang Huang, Lai-Su L. Yeh, Winona C. Barker "Protein family classification and functional annotation." Computational Biology and Chemistry, 2003, 27: 37-47. 14. Gough, J., Karplus, K., Hughey, R. and Chothia, C. "Assignment of Homology to Genome Sequences using a Library of Hidden Markov Models that represent all Proteins of Known Structure." J. Mol. Biol., 2001, 313(4): 903-919. 15. D. Buchan, F.Pearl, D.Lee, A.Shepherd,S.Rison,C.Orengo,J,Thornton "Gene3D: "Structural assignments for whole genes and genomes using the CATH domain structure database." Genome Research. Vol. 12 (3): 503 - 514 16. Huaiyu Mi, Betty Lazareva-Ulitsky, Rozina Loo, Anish Kejariwal, Jody Vandergriff, Steven Rabkin, Nan Guo, Anushya Muruganujan, Olivier Doremieux, Michael J. Campbell, Hiroaki Kitano1 and Paul D. Thomas* "The PANTHER database of protein families, subfamilies, functions and pathways." Nucleic Acids Research, 2005, Vol. 33, Database issue D284-D288
Zdobnov E.M. and Apweiler R.
"InterProScan - an integration platform for the signature-recognition methods in InterPro."
Bioinformatics, 2001, 17(9): 847-8.