Top

InterProScan README

Version 4.2

Authors


Sarah Hunter <hunter (at) ebi.ac.uk>
Emmanual Quevillon <tuco (at) ebi.ac.uk>
Ville Silventoinen <vsi (at) ebi.ac.uk>

Acknowledgments


Florence Servant <florence.servant (at) mcgill.ca>
Evgueni Zdobnov <evgueni.zdobnov (at) embil-heidelberg.de>

Copyright: © EMBL-EBI 2004


Contents

  1. Introduction to InterPro
  2. InterPro member databases and scanning methods
  3. InterProScan
  4. Stand-alone InterProScan
  5. In-depth
  6. References
  7. How to cite

Introduction to InterPro

Databases of protein domains and functional sites have become vital resources for the prediction of protein functions. During the last decade, several signature- recognition methods have evolved to address different sequence analysis problems, resulting in rather different and, for the most part, independent databases. Diagnostically, these resources have different areas of optimum application owing to the different strengths and weaknesses of their underlying analysis methods. Thus, for best results, search strategies should ideally combine all of them.

InterPro ([1]) is a collaborative project aimed at providing an integrated layer on top of the most commonly used signature databases by creating a unique, non- redundant characterisation of a given protein family, domain or functional site.

The InterPro database integrates PROSITE ([2]), PRINTS ([3]), Pfam ([4]), ProDom ([5]), SMART ([6]), TIGRFAMs ([11]), PIR superfamily ([13]), SUPERFAMILY ([14]) Gene3D (15]) and PANTHER ([16]) databases and the addition of others is scheduled. InterPro data is distributed in XML format and it is freely available under the InterPro Consortium copyright. The InterPro project home page is at http://www.ebi.ac.uk/interpro.

Any queries should be emailed to interhelp@ebi.ac.uk.


InterPro member databases and scanning methods

PROSITE patterns: Some biologically significant amino acid patterns can be summarised in the form of regular expressions.
ScanRegExp (by Wolfgang.Fleischmann@ebi.ac.uk)

PROSITE profiles: There are a number of protein families as well as functional or structural domains that cannot be detected using patterns due to their extreme sequence divergence, so the use of techniques based on weight matrices (also known as profiles) allows the detection of such proteins or domains. A profile is a table of position-specific amino acid weights and gap costs. The profile structure used in PROSITE is similar to but slightly more general (Bucher P. et al., 1996 [7]) than the one introduced by M. Gribskov and co-workers.
pfscan from the Pftools package (by Philipp.Bucher@isrec.unil.ch).

PRINTS: The PRINTS database houses a collection of protein family fingerprints. These are groups of motifs that together are diagnostically more powerful than single motifs by making use of the biological context inherent in a multiple-motif method. The fingerprinting method arose from the need for a reliable technique for detecting members of large, highly divergent protein super-families.
FingerPRINTScan (Scordis P. et al., 1999 [8]).

PFAM: Pfam is a database of protein domain families. Pfam contains curated multiple sequence alignments for each family and corresponding hidden Markov models (HMMs) (Eddy S.R., 1998 [9]). Profile hidden Markov models are statistical models of the primary structure consensus of a sequence family. The construction and use of Pfam is tightly tied to the HMMER software package.
hmmpfam from the HMMER2.3.2 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).

PRODOM: ProDom is a database of protein domain families obtained by automated analysis of the SWISS-PROT and TrEMBL protein sequences. It is useful for analysing the domain arrangements of complex protein families and the homology relationships in modular proteins. ProDom families are built by an automated process based on a recursive use of PSI-BLAST homology searches.
ProDomBlast3i.pl (by Emmanuel Courcelle emmanuel.courcelle@toulouse.inra.fr and Yoann Beausse beausse@toulouse.inra.fr) (it is a wrapper for the Blast package (Altschul S.F. et al., 1997 [10])).

SMART: SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. These domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues. SMART alignments are optimised manually and following construction of corresponding hidden Markov models (HMMs).
hmmpfam from the HMMER2.3.2 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).

TIGRFAMs: TIGRFAMs are a collection of protein families featuring curated multiple sequence alignments, Hidden Markov Models (HMMs) and associated information designed to support the automated functional identification of proteins by sequence homology. Classification by equivalog family (see below), where achievable, complements classification by orthologs, superfamily, domain or motif. It provides the information best suited for automatic assignment of specific functions to proteins from large scale genome sequencing projects.
hmmpfam from the HMMER2.3.2 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).

PIR SuperFamily: PIR SuperFamily (PIRSF) is a classification system based on evolutionary relationship of whole proteins.
hmmpfam from the HMMER2.3.2 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).

SUPERFAMILY: SUPERFAMILY is a library of profile hidden Markov models that represent all proteins of known structure, based on SCOP.
hmmpfam/hmmsearch from the HMMER2.3.2 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).

GENE3D: Gene3D is supplementary to the CATH database. This protein sequence database contains proteins from complete genomes which have been clustered into protein families and annotated with CATH domains, Pfam domains and functional information from KEGG, GO, COG, Affymetrix and STRINGS.
hmmpfam from the HMM2.3.2 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).

PANTHER: The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System was designed to classify proteins (and their genes) in order to facilitate high-throughput analysis.
hmmsearch from the HMM2.3.2 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).
blastall from the Blast package (Altschul S.F. et al., 1997 [10]).

Optionally, predictions for coiled-coil, signal peptide cleavage sites (SignalP v3) and TM helices (TMHMM v2) are supported (See the FAQ for details of how to set these up).


InterProScan

InterProScan is a tool that combines different protein signature recognition methods into one resource. The number of signature databases and their associated scanning tools, as well as the further refinement procedures, increases the complexity of the problem.

InterProScan is more than just a simple wrapping of sequence analysis applications since it also performs a considerable amount of data look-up from various databases and program outputs. The Perl-based InterProScan is intended to be an extensible and scalable system optimised to cope with bulk data processing. The need for production scale efficiency and easy extensibility requires a robust and efficient (parallel) internal architecture that can benefit from network-distributed computing with the support of UNIX queuing systems.

In the package a Perl-based simple data retrieval system is used in order to provide the required data look-up efficiency and extensibility.

There are two ways you can use InterProScan, either via the EBI website (http://www.ebi.ac.uk/InterProScan/ - note the maximum number of protein sequences you may submit is 10 and nucleotide is 1) or by downloading and installing it locally on your computer. InterProScan can run stand-alone via a web user interface (GUI), via the command-line or via SRS.

Input

InterProScan can take either nucleotide or protein sequences in a recognised sequence format (such as raw, FASTA or EMBL). It will reformat and, if necessary, translate the sequences before beginning its search tasks. If raw format (free text) is used, it will be given the name "Sequence_n" by default, where n is the order in which it appeared in the input.

Nucleotide sequences will translated and scanned in all 6 frames without any further assumptions except transcript length cut-off (orfminsize) and/or codon translation table of the EMBOSS sixpack tool:
0(Standard)
1(Standard (with alternative initiation codons))
2(Vertebrate Mitochondrial)
3(Yeast Mitochondrial)
4(Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma)
5(Invertebrate Mitochondrial)
6(Ciliate Macronuclear and Dasycladacean)
9(Echinoderm Mitochondrial)
10(Euplotid Nuclear)
11(Bacterial)
12(Alternative Yeast Nuclear)
13(Ascidian Mitochondrial)
14(Flatworm Mitochondrial)
15(Blepharisma Macronuclear)
16(Chlorophycean Mitochondrial)
21(Trematode Mitochondrial)
22(Scenedesmus obliquus)
23(Thraustochytrium Mitochondrial)

If you wish to use more sophisticated protein sequence predictions, you can replace or modify the conf/sixpack.sh script and edit translate.cmd in the stand-alone version's conf/iprscan.conf file. Please note that any non-standard single letter amino acid codes (such as an asterix "*", signifying a stop codon) can cause problems when running the software.

Output

During a run, the program prepares a temporary directory (something like 'tmp/20041011/iprscan-20041011-11123456') where 20041011 is today's date and iprscan-20041011-11123456 is the session directory name. The directory name is automatically generated to be unique and consists of "iprscan-" followed by the date (YYYYMMDD), followed by the time of the day (hhmmss) and a 2-digit random number (NN).

When the scanning is finished the results will be displayed on the STDOUT unless you used the -o option on the command line to specify an output file where to put results.

InterProScan makes results available in four formats {raw ebixml xml txt html}:


Stand-alone InterProScan

InterProScan is available for running via the EBI web-site. Alternatively, you can download a stand- alone version to your local server and run it there.

Availability

InterProScan and the underlying applications are freely available under the GNU licence agreement from the EBI's ftp server (ftp://ftp.ebi.ac.uk/pub/databases/interpro/iprscan/).

System requirements