InterProScan README

Version 4.2

Authors

Sarah Hunter <hunter (at) ebi.ac.uk>
Emmanual Quevillon <tuco (at) ebi.ac.uk>
Ville Silventoinen <vsi (at) ebi.ac.uk>

Acknowledgments

Florence Servant <florence.servant (at) mcgill.ca>
Evgueni Zdobnov <evgueni.zdobnov (at) embil-heidelberg.de>

Introduction to InterPro
InterPro member databases and scanning methods
InterProScan
- Input
- Output
Stand-alone InterProScan
- Availability
- System Requirements
- Installation and Update
- DATA and applications distributed with InterProScan
In-depth
- Features
- Architecture review
- Implementation details
- Configuration files
- Results filtering / Match status
- Programs
References
How to cite

Introduction to InterPro

Databases of protein domains and functional sites have become vital resources for the prediction of protein functions. During the last decade, several signature- recognition methods have evolved to address different sequence analysis problems, resulting in rather different and, for the most part, independent databases. Diagnostically, these resources have different areas of optimum application owing to the different strengths and weaknesses of their underlying analysis methods. Thus, for best results, search strategies should ideally combine all of them.

InterPro ([1]) is a collaborative project aimed at providing an integrated layer on top of the most commonly used signature databases by creating a unique, non- redundant characterisation of a given protein family, domain or functional site.

The InterPro database integrates PROSITE ([2]), PRINTS ([3]), Pfam ([4]), ProDom ([5]), SMART ([6]), TIGRFAMs ([11]), PIR superfamily ([13]), SUPERFAMILY ([14]) Gene3D (15]) and PANTHER ([16]) databases and the addition of others is scheduled. InterPro data is distributed in XML format and it is freely available under the InterPro Consortium copyright. The InterPro project home page is at http://www.ebi.ac.uk/interpro.

Any queries should be emailed to interhelp@ebi.ac.uk.

InterPro member databases and scanning methods

PROSITE patterns: Some biologically significant amino acid patterns can be summarised in the form of regular expressions.
ScanRegExp (by Wolfgang.Fleischmann@ebi.ac.uk)

PROSITE profiles: There are a number of protein families as well as functional or structural domains that cannot be detected using patterns due to their extreme sequence divergence, so the use of techniques based on weight matrices (also known as profiles) allows the detection of such proteins or domains. A profile is a table of position-specific amino acid weights and gap costs. The profile structure used in PROSITE is similar to but slightly more general (Bucher P. et al., 1996 [7]) than the one introduced by M. Gribskov and co-workers.
pfscan from the Pftools package (by Philipp.Bucher@isrec.unil.ch).

PRINTS: The PRINTS database houses a collection of protein family fingerprints. These are groups of motifs that together are diagnostically more powerful than single motifs by making use of the biological context inherent in a multiple-motif method. The fingerprinting method arose from the need for a reliable technique for detecting members of large, highly divergent protein super-families.
FingerPRINTScan (Scordis P. et al., 1999 [8]).

PFAM: Pfam is a database of protein domain families. Pfam contains curated multiple sequence alignments for each family and corresponding hidden Markov models (HMMs) (Eddy S.R., 1998 [9]). Profile hidden Markov models are statistical models of the primary structure consensus of a sequence family. The construction and use of Pfam is tightly tied to the HMMER software package.
hmmpfam from the HMMER2.3.2 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).

PRODOM: ProDom is a database of protein domain families obtained by automated analysis of the SWISS-PROT and TrEMBL protein sequences. It is useful for analysing the domain arrangements of complex protein families and the homology relationships in modular proteins. ProDom families are built by an automated process based on a recursive use of PSI-BLAST homology searches.
ProDomBlast3i.pl (by Emmanuel Courcelle emmanuel.courcelle@toulouse.inra.fr and Yoann Beausse beausse@toulouse.inra.fr) (it is a wrapper for the Blast package (Altschul S.F. et al., 1997 [10])).

SMART: SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. These domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues. SMART alignments are optimised manually and following construction of corresponding hidden Markov models (HMMs).
hmmpfam from the HMMER2.3.2 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).

TIGRFAMs: TIGRFAMs are a collection of protein families featuring curated multiple sequence alignments, Hidden Markov Models (HMMs) and associated information designed to support the automated functional identification of proteins by sequence homology. Classification by equivalog family (see below), where achievable, complements classification by orthologs, superfamily, domain or motif. It provides the information best suited for automatic assignment of specific functions to proteins from large scale genome sequencing projects.
hmmpfam from the HMMER2.3.2 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).

PIR SuperFamily: PIR SuperFamily (PIRSF) is a classification system based on evolutionary relationship of whole proteins.
hmmpfam from the HMMER2.3.2 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).

SUPERFAMILY: SUPERFAMILY is a library of profile hidden Markov models that represent all proteins of known structure, based on SCOP.
hmmpfam/hmmsearch from the HMMER2.3.2 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).

GENE3D: Gene3D is supplementary to the CATH database. This protein sequence database contains proteins from complete genomes which have been clustered into protein families and annotated with CATH domains, Pfam domains and functional information from KEGG, GO, COG, Affymetrix and STRINGS.
hmmpfam from the HMM2.3.2 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).

PANTHER: The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System was designed to classify proteins (and their genes) in order to facilitate high-throughput analysis.
hmmsearch from the HMM2.3.2 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).
blastall from the Blast package (Altschul S.F. et al., 1997 [10]).

Optionally, predictions for coiled-coil, signal peptide cleavage sites (SignalP v3) and TM helices (TMHMM v2) are supported (See the FAQ for details of how to set these up).

InterProScan

InterProScan is a tool that combines different protein signature recognition methods into one resource. The number of signature databases and their associated scanning tools, as well as the further refinement procedures, increases the complexity of the problem.

InterProScan is more than just a simple wrapping of sequence analysis applications since it also performs a considerable amount of data look-up from various databases and program outputs. The Perl-based InterProScan is intended to be an extensible and scalable system optimised to cope with bulk data processing. The need for production scale efficiency and easy extensibility requires a robust and efficient (parallel) internal architecture that can benefit from network-distributed computing with the support of UNIX queuing systems.

In the package a Perl-based simple data retrieval system is used in order to provide the required data look-up efficiency and extensibility.

There are two ways you can use InterProScan, either via the EBI website (http://www.ebi.ac.uk/InterProScan/ - note the maximum number of protein sequences you may submit is 10 and nucleotide is 1) or by downloading and installing it locally on your computer. InterProScan can run stand-alone via a web user interface (GUI), via the command-line or via SRS.

Input

InterProScan can take either nucleotide or protein sequences in a recognised sequence format (such as raw, FASTA or EMBL). It will reformat and, if necessary, translate the sequences before beginning its search tasks. If raw format (free text) is used, it will be given the name "Sequence_n" by default, where n is the order in which it appeared in the input.

Nucleotide sequences will translated and scanned in all 6 frames without any further assumptions except transcript length cut-off (orfminsize) and/or codon translation table of the EMBOSS sixpack tool:

0 (Standard)

1 (Standard (with alternative initiation codons))

2 (Vertebrate Mitochondrial)

3 (Yeast Mitochondrial)

4 (Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma)

5 (Invertebrate Mitochondrial)

6 (Ciliate Macronuclear and Dasycladacean)

9 (Echinoderm Mitochondrial)

10 (Euplotid Nuclear)

11 (Bacterial)

12 (Alternative Yeast Nuclear)

13 (Ascidian Mitochondrial)

14 (Flatworm Mitochondrial)

15 (Blepharisma Macronuclear)

16 (Chlorophycean Mitochondrial)

21 (Trematode Mitochondrial)

22 (Scenedesmus obliquus)

23 (Thraustochytrium Mitochondrial)

If you wish to use more sophisticated protein sequence predictions, you can replace or modify the conf/sixpack.sh script and edit translate.cmd in the stand-alone version's conf/iprscan.conf file. Please note that any non-standard single letter amino acid codes (such as an asterix "*", signifying a stop codon) can cause problems when running the software.

Output

During a run, the program prepares a temporary directory (something like 'tmp/20041011/iprscan-20041011-11123456') where 20041011 is today's date and iprscan-20041011-11123456 is the session directory name. The directory name is automatically generated to be unique and consists of "iprscan-" followed by the date (YYYYMMDD), followed by the time of the day (hhmmss) and a 2-digit random number (NN).

When the scanning is finished the results will be displayed on the STDOUT unless you used the -o option on the command line to specify an output file where to put results.

InterProScan makes results available in four formats {raw ebixml xml txt html}:

raw format

is basic tab delimited format useful for uploading the data into a relational database or concatenation of different runs.
is all on one line.
Example here (with descriptions):

NF00181542      0A5FDCE74AB7C3AD        272     HMMPIR  PIRSF001424     Prephenate dehydratase  1       270     6.5e-141        T       06-Aug-2005\
        IPR008237       Prephenate dehydratase with ACT region  Molecular Function:prephenate dehydratase activity (GO:0004664), Biological Process\
        :L-phenylalanine biosynthesis (GO:0009094)

Key:

NF00181542	is the id of the input sequence.
27A9BBAC0587AB84	is the crc64 (checksum) of the protein sequence (supposed to be unique).
272	is the length of the sequence (in AA).
HMMPIR	is the anaysis method launched.
PIRSF001424	is the database members entry for this match.
Prephenate dehydratase	is the database member description for the entry.
1	is the start of the domain match.
270	is the end of the domain match.
6.5e-141	is the evalue of the match (reported by member database anayling method).
T	is the status of the match (T: true, M: marginal).
06-Aug-2005	is the date of the run.
IPR008237	is the corresponding InterPro entry (if iprlookup requested by the user).
Prephenate dehydratase with ACT region	is the description of the InterPro entry.
Molecular Function:prephenate dehydratase activity (GO:0004664)	is the GO (gene ontology) description for the InterPro entry.

xml format
- is a self descriptive computer readable format compatible with the distribution XML format of InterProMatches.
ebixml format
- is the xml format with an EBI header describing applications' databases and methods.
txt format
- is a condensed plain text representation of the results.
html format
- conforms to the html3 standard viewable by Internet Browsers.
- provides a graphical representation of the identified matches and a 'Table View' where hits are reported without ant cartoons but with the evalue, range and status of the matches (you can also get this info by mouse-over of the cartoon)
- hyperlinks to the corresponding InterPro entries, the signature entries of the InterPro member databases, the scanned protein sequences, the original output of the underlying applications and provides links to the application's home pages.
- shows the InterPro entries and descriptions, InterPro hierarchy (Parents, children, contain and found in) and the GO terms annotation.

Stand-alone InterProScan

InterProScan is available for running via the EBI web-site. Alternatively, you can download a stand- alone version to your local server and run it there.

Availability

InterProScan and the underlying applications are freely available under the GNU licence agreement from the EBI's ftp server (ftp://ftp.ebi.ac.uk/pub/databases/interpro/iprscan/).

System requirements

The InterProScan package has been developed in Perl5 under UNIX A list of Perl modules require to run InterProScan is in the Installation notes on the FTP site.
Binaries of signature recognition methods provided for the following UNIX platforms: SGI IRIX64, Mac Darwin 10.2, Linux PC, DEC Alpha, Solaris/Sparc and AIX6.5
The installation step implies that you are able to execute such commands as 'ls', 'pwd', 'rsh', 'uname'.
The full installation (with binaries & data for all platforms) takes about 9 Gb of disk space.
For distributed computing:
- Must be able to rsh to hosts being used
- Installation should be on a shared filesystem (e.g. over NFS)
- accessible from all hosts that you are going to use.
Queueing systems currently supported: LSF 4.2, Sun GridEngine 6, PBS

Benchmarking information: Crude benchmarking was done for InterProScan running on P50750|CDK9_HUMAN (327aa). This will be repeated for each release so that users know what to expect as far as performance goes. Machine specifications: HP Compaq with 2x Pentium 4 CPUs (3.2GHz); 512Mb RAM. InterProScan was run on 1 CPU; each program was run separately.

Program Name	Speed in v4.2 (s)
HMMPfam	68
HMMPanther	12
HMMPIR	21
blastprodom	18
coils	7
gene3d	18
HMMSmart	7
HMMTigr	23
FPRINTScan	12
scanregexp	7
profilescan	12
superfamily	38
seg	6
signalp	7
tmhmm	7
--------------	-----------------
total	4m23s

Installation and Update

For detailed instructions on how to install InterProScan locally, please read the Installation instructions, on the ftp site here: ftp://ftp.ebi.ac.uk/pub/databases/interpro/iprscan/Installing_InterProScan.txt

InterProScan and InterPro version numbers are not related in any way. The only time the version number of InterProScan changes is if there has been a change in the underlying program code.

There are InterPro xml files available for download from the FTP site which are currently updated every few months or so and will likely update more frequently in the future. Download them and put them into the data directory of your InterProScan installation and you will have the most up-to-date data available.

You can update the member database information whenever you want. The tarballs on the FTP site will update whenever InterPro does.

DATA and Applications distributed with InterProScan

InterPro protein signature databases.

PROSITE:ftp://ftp.isrec.isb-sib.ch/sib-isrec/profiles/prosite_prerelease.prf
PRODOM: http://prodes.toulouse.inra.fr/prodom/current/html/download.php
InterPro database: ftp://ftp.ebi.ac.uk/pub/databases/interpro/interpro.xml.gz
InterPro matches: ftp://ftp.ebi.ac.uk/pub/databases/interpro/match.xml.gz
Pfam: ftp://ftp.sanger.ac.uk/pub/databases/Pfam/Pfam_ls.gz
PRINTS: ftp://ftp.bioinf.man.ac.uk/pub/fingerPRINTScan/database/printsXXX.pval_blos62.gz (XXX is the highest version available)
TIGRFAMs: ftp://ftp.tigr.org/pub/data/TIGRFAMs
PIR: ftp://ftp.pir.georgetown.edu/pir_databases/pirsf/interpro/new/
PANTHER: https://panther.appliedbiosystems.com/downloads/ (we provide a compressed binary library).
GENE3D: ftp://ftp.biochem.ucl.ac.uk/pub/
TMHMM (v2.0): http://www.cbs.dtu.dk/services/TMHMM/ (under commercial license : contact software@cbs.dtu.dk)
SignalP v3.0: http://www.cbs.dtu.dk/services/SignalP/ (under commercial license : contact software@cbs.dtu.dk)
SMART: http://smart.embl-heidelberg.de/
SUPERFAMILY: http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/ (under free license)

Scanning applications.

FingerPRINTScan: ftp://proline.sbc.man.ac.uk/pub/fingerPRINTScan/binaries/
ScanRegExp: ftp://ftp.ebi.ac.uk/pub/software/unix/
Pfscan: http://www.isrec.isb-sib.ch/ftp-server/pftools/
hmmpfam: http://hmmer.wustl.edu/
hmmsearch: http://hmmer.wustl.edu/
hmmconvert: http://hmmer.wustl.edu/
NCBI Blast: ftp://ftp.ncbi.nlm.nih.gov/blast/
EMBOSS tools: http://emboss.sourceforge.net/ (for seqret (sequence reformater) and sixpack (nucleic sequences translator))
Ncoils: ftp://ftp.ebi.ac.uk/pub/software/unix/coils-2.2
Seg: http://blast.wustl.edu/pub/seg/

All are also available in tarballs (tar.gz) from the FTP site (see the installation instructions)

In-depth

Should you wish to know the set up and features of InterProScan in-depth, this section details the features and architecture of the program.

Package Features

The most important feature of InterProScan is that it is pure Perl: No dependencies are needed to use it. If jobs fail, reports are created detailing the failure and a resubmission script is automatically written which is then able to complete your failed jobs when the problem(s) is/are solved. (N.B. v4.x no longer uses gmake when launching and processing jobs).
Another important feature of InterProScan is the possibility for distributed execution of individual jobs. The integrated applications are executed using Unix rsh on the configured network hosts. The job can either be directly executed on a remote host or can be submitted from the host to a Unix queuing system like LSF, PBS or SGE which can redirect it further.
As a wrapper, InterProScan has a modular structure with a simple "one Perl module per database" organisation. This structure is based on Perl modules used at EBI to dispatch the jobs on the EBI network. This Perl library can be used for other projects needing a dispatcher for jobs.
Each of the Perl modules provides an object-oriented interface to the underlying database entry attributes. The parsing of the output results file happens only once and is done upon request, implementing so-called lazy parsing.
Parsing routines are implemented in a classic way. For each application, InterProScan reads the output results, stores the info into a hash table and returns it to the main program which then writes the raw file.
To speed up the required data look-up, InterProScan indexes the corresponding databases. Fast data retrieval is implemented, based on Perl native B-trees indexing (DB_File.pm by Paul Marquess, based on BerkeleyDB).
The InterProScan package includes optional support for a Web user interface with a script for basic retrieval of local data and check of indexes.
You can submit a nucleic acid sequence that will be translated in all 6 frames and piped into the analysis programs. InterProScan is designed to use seqret and sixpack binaries from EMBOSS (http://emboss.sourceforge.net) package to translate and reformat input sequences. But you can use your own translator/ reformator. See FAQ for instructions.
The InterProScan package implements additional filtering of the results based on specific cut-offs and other post-processing steps.

For more information on how InterProScan works, see the next section.

Architecture review

As mentioned above, the Perl-based InterProScan was designed for bulk sequence analysis. The architecture does not have any internal limitations on the number of submitted sequences and has been tested on runs with more than 100 000 sequences. The general approach is to split the original input file into smaller parts with a pre-configured number of sequences in each (a so called "chunk").

InterProScan is more than just a simple wrapper of protein sequence analysis applications. In addition, it performs a considerable amount of data look-up from various databases and has the ability to parse and retrieve program outputs.

Each data description module defines the data schema of the source data and its parsing rules. The corresponding Perl module provides an object-oriented interface to the underlying entry attributes. The parsing of the output results happens only once searching is done when all the applications are finished. The parsing of the source data into the memory objects happens only once and is done upon request, implementing so-called lazy-parsing. Hierarchical parsing rules are implemented using the recursive-descent approach (Parse-RecDescent package). Fast data retrieval is implemented using the Perl native B-trees indexing (DB_File.pm, based on Berkeley DB).

The simple 'one Perl module per data source' organisation makes it possible to reuse the modules in other stand-alone ad-hoc solutions. The Perl-based InterProScan is capable of providing post-processed, integrated results in several formats and could also be used as a simple retrieval system for the underlying data.

Implementation details.

Each installation has the following directories:

'data' directory contains all databases and required indices.
'tmp' directory is used to store temporary user sessions and temporary jobs outputs). This tmp directory contains also another tmp directory used by some applications to create temporary file during runs. Each session directory is created in a directory representing the day of the year when the jobs had been launched. Each day a new directory is therefore created.
'bin' directory contains some Perl scripts and platform specific binaries of scanning programs (in the binaries/ subdirectory).
'lib' directory contains all Perl modules necessary for iprscan to work properly. The main core of launching jobs, checking jobs, parsing results and creating the results page/output are located in the lib directory under a package developed at EBI (Dispatcher::*). These packages are the main core of iprscan. It also contains an Index directory used to index databases and output results. It is based on the index method present in previous versions of iprscan.
'conf' directory contains configuration files for each database/application used. It also contains configuration files for several queueing systems, translate/reformat tools, indexing and InterProScan.

Programs

The main script is the one called "iprscan" in the bin/ directory. It acts as both a command-line script (when the -cli option is used) and as the CGI script if and when a user has installed the web interface to InterProScan. The iprscan script starts jobs by calling another script (iprscan_wrapper.pl) which in turn launches and tracks jobs for each application included in the program. Results are parsed and the output created. By default, when running on the command line, the results are written to stdout, unless the -o option is used to redirect the output to a file. You can also specify -verbose mode which shows the status of the main program as PENDING, RUNNING and DONE.

If the job crashed, iprscan warns you (either on the command line and in the web interface) and produces a report file containing information about crashed jobs. In a best case scenario, iprscan gives you the reason why the job(s) crashed so that you can try to fix the problem and restart any failed jobs (To do so, just run ResubmitJobs.pl on the command line or click 'Resubmit failed jobs' on the web interface). InterProScan will restart only those jobs which failed (from scratch) and concatenate the results with previous ones.

Below is a more detailed description of the various programs included in InterProScan's architecture:

Config.pl
- is provided to make the installation and reconfiguration of InterProScan easy.
- There are no command-line options - you will be prompted for information
- Most of the prompts have some explanations and provide default suggestions in []. If you later decide to change a perl path, queue system or queue name you just need to restart Config.pl - it will overwrite the old info with the new. Alternatively, you can directly change the configuration files in the conf directory at your own risk.

iprscan

is the program that initiates an InterProScan job.
It creates a temporary session directory and prepares all required infrastructure for the scanning.

The input file is checked to confirm FASTA format, reformated to clean FASTA and splited into configured portions each in its own 'chunk_NN' directory. The number of sequences contained in each 'chunk_NN' directory can be changed by changing the 'chunk' tag in the iprscan/conf/iprscan.conf file.
The session directory contains a parameters file which sumarises the differents options the user entered (needed later during parsing) and also info about sequences like length, id and crc64 (checksum). It also contains the original reformated sequences file.
Each 'chunk_NN' directory gets its own sequence file (.nocrc).

The options for this script are :

-cli	Specify to the script to be used in command line mode. This same script is also used as CGI script when configured web interface.
-i	Input sequence file. This file must exists and be readable.
-o	Output file where to write results (default stdout).
-iprlookup	Switch on look up of corresponding InterPro annotation
-goterms	Switch on look up of corresponding Gene Ontology annotation (requires -iprlookup option to be used too)
-trtable num	Are used for specifying Translation Table code and and -trlen transcript length threshold respectively for nucleic acid to protein sequence translation (based on CodonTable.pm by Heikki Lehvaslaiho ).
-appl	Application to use. Check iprscan/conf/iprscan.conf file to see what are the applications that you configured or type './iprscan -cli -h'. Use multiple -appl flags to specify multiple applications.
-nocrc	Does not perform a crc64 check on your protein sequence(s) before launching any application. If all your sequences have a known crc64 according to the match.xml file, then no applications will be launched and the results will be then displayed.
-email	Specify an email address where to send email when the run is finished.
-format (raw\|xml\|txt\|ebixml\|html)	Output format (default xml).
-seqtype (n\|p)	The type of the input sequences (dna/rna (n) or protein (p)).
-verbose	Displays status of the job.
-taxo	Activate the Taxonomy filter for abbreviated taxonomy (e.g. -taxo Arthto -taxo Bact) Possible values: Arabidopsis thaliana (AraTh), Archaea (Arch), Arthropoda (Arthro), Bacteria (Bact), Caenorhabditis elegans (CaeEl), Chordata (Chor), Cyanobacteria (Cyan), Eukaryota (Euka), Fruit Fly (FrFly), Fungi (Fung), Green Plants (GrePl), Human (Huma), Metazoa (Meta), Mouse (Mous), Nematoda (Nema), Other Eukaryotes (OthEuk), Plastid Group (PlasGrp), Rice spp (Rice), Saccharomyces cerevisiae (SacCer), Synechosystis PCC 6803 (Synec), Unclassified (Unclass) and Virus (Vir)
-txrule <0,1>	Make decision on the taxonomy: 0 -> AND, 1 -> OR.";
-help	Displays this help and exit.

meter.pl
- reports the progress of a job
- You need to provide the full or relative path to the session directory (e.g. 'tmp/20040302/iprscan-20040302-12585481')
- The options for this script are:
  
  session_dir The path of the session directory of your job

index_data.pl

checks and updates all required indices (see DATA Update). This script is also used to format databases needed by blastall using formatdb binary located in 'iprscan/bin/binaries/YOURPLATFORM/blast/'.

The options for this script are:

-f	File(s) to index. By default indexes all the required files. Type ./index_data.pl -h to get the list of the supported files.
-inx	Index given files.
-iforce	Force script to reindex files even if they are reported as being up to date. This feature is needed most of the time when data is being updated. It first removes the old index and then builds a new one.
-bin	Convert ascii hmm library file to binary file. Can speed up the hmmpfam search by up to 40%. (.bin is now required by default)
-bforce	Force the binary file conversion even if it is already here. Needed most of the time during data update.
-v	Verbose mode, prints informations during indexing.
-h	Displays this help and exit.

converter.pl
- Is used to reformat results from raw into [html, xml, ebixml, txt, gff3] format.
- NOTE: ebixml format just adds an EBI header to the top of the xml file.
- NOTE: to get gff3 format, you must first run iprscan and output raw format.
- The options for this script are:
  
  -format format you want to convert to
  
  -input raw_file the original output file of results
  
  -jobid the interproscan job id (required for HTML and XML formats)
iterator.pl
- Iterator reads fasta sequences from the input one at a time and executes the command for each sequence. The command MUST use stdout to print the output. Use an %infile tag on the command line to specify the location of the input file.
- The options for this script are:
  
  -i infile input file name (fasta sequences)
  
  -o outfile output file name (collated results)
  
  -c cmd command to execute on the sequences
  
  -h help

ResubmitJobs.pl

This script is used to relaunch failed jobs that occured during a run. If jobs crashed (if the size of the errors file is greater than 0), InterProScan creates a report with info about the chunk, the application that failed and the reason, if available. Then the report file is used to relaunch the failed jobs only. The user/ administrator must fix the errors first before using this script, otherwise InterProScan cannot restart the failed application and will recreate the same report, etc.

The options for this script are:

-r	Path to the report file. (e.g. tmp/20040302/iprscan-20040302-12355481/iprscan-20040302-12335481.report)
-h	Displays this help and exit.
-v [0,1,2]	Verbose mode with multiple mode. 0 no info at all (like without -v option). 1 only prints main actions. 2 prints all that the script is doing.

iprscan_wrapper.pl
- This script is used by iprscan to launch, check, and parse results of all jobs. You should not need to use this script yourself.
- It takes the params file on stdin

Configuration files

InterProScan is supplied with configuration files in the conf directory so that you can easily set-up your installation exactly as you want. Configuration files are based on 'tags' that are expandable when InterProScan is reading them. What we do we mean by a 'tag'?

For example in the following lines :

workserver=http://fido.ebi.ac.uk:4000
workurl=[%workserver]/iprscan/iprscan?tool=iprscan&jobid=....

workserver is a key and http://fido.ebi.ac.uk:4000 is the value. In the next line, [%workserver] is a tag. So when InterProScan reads a configuration file, it reads it as key-value pair file. When it sees something on the line looking like '[%.....]', it understands it as a tag and try to to expand/replace it in its memory and searches if it already seen this key somewhere. If yes, it replaces the tag '[%...]' by its value, otherwise it replaces it with nothing. That's why, each time you want to use a tag in a value to avoid repeat it, you HAVE TO SET IT correctly before as file are read from top to bottom.
In that case, workurl will be :

http://fido.ebi.ac.uk:4000/iprscan/iprscan?tool=iprscan&jobid=....

after expanding.

InterProScan support conditions in its tags. Here is a list. To know how to write them correctly, you will have to read the code of Config.pm module (iprscan/lib/Dispatcher/Config.pm).

%env	referres to environment variable hash table in Perl (%ENV).
%if	you can do some condition into your tags.
%switch	you can have mutiple choices (same as basic switch condition in programing).
%random	calls the srand Per subroutine.
%YYYY	translates it as the actual year.
%MM	translates it as the current month of the year.
%DD	translates it as the current day of the month.
%hh	translates it as the current hour of the day.
%mm	translates it as the current minutes of the hour.
%ss	translates it as the current second of the minute.
%hostname	translates it as the hostname of the machine.
%pid	translates it with the process id of the program.
%uname	translates it with the operating system name.

With all of these features, you should be able to modify/configure InterProScan and applications as you want.

Results filtering / Match status

Method cut-offs

InterProScan is based on scanning methods native to the InterPro member databases. It is distributed with pre-configured method cut-offs recommended by the member database experts and which are believed to report relevant matches. All cut-offs are defined in configuration files (see 'conf' directory). Matches obtained with the fixed cut-off are subject to the following filtering. (Please also see member database web pages for more information)

PFAM filtering:
- Each Pfam family is represented by 2 HMMs - ls and fs (full-length and fragment).
- An HMM model has bit score cut-offs (for each domain match and the total model match) and these are defined in the GA lines of the Pfam database. Initial results are obtained with quite a high common cut- off and then the matches of the signature with a lower score than the family specific cut-offs are dropped.
- If both the fs and ls model for a particular Pfam hits the same region of a sequence, the LS model is always chosen.
- Another type of filtering has been implemented since release 4.1. It is based on Clan filtering and nested domains. Please check the Pfam website (http://www.sanger.ac.uk/Pfam) for more information on Clan filtering.
TIGRFAMs filtering:
- Each TigrFAM HMM model has its own cut-off scores for each domain match and the total model match. These bit score cut-offs are defined in the TC lines of the database. Initial results are obtained with quite a high common cut-off and then the matches (of the signature or some of its domains) with a lower score than the family specific cut- offs are dropped.
PRINTS filtering:
- There is a test version of PRINTS families specific p-value cut-offs. All matches with p-value more than p_min for the signature are dropped.
SMART filtering:
- The publicly distributed version of InterProScan has a common e-value cut-off corresponding to the reference database size. A more sophisticated scoring model is used on the SMART web server and in the production of pre-calculated InterPro match data.
- Exact scoring thresholds for domain assignments are proprietary data that can be obtained directly from the SMART team. The InterProMatches data production procedure uses these additional smart.thresholds
- PLEASE NOTE: that the given cut-offs are e-values (i.e. the number of expected random hits) and they therefore are valid only in the context of reference database size and smart.desc data files (which are available from the SMART team) to filter out results obtained with higher cut-off.
- It implements the following logic:
  1. If the E-value of found match is worse than the 'cut_low' the match is dropped.
  2. If the E-value of found match is worse than the 'family' cut-off it is reported as the family hit with unknown/marginal status ("M") and no description is given.
  3. If the E-value of found match is better than the 'family' cut-off but worse than the 'cutoff' it is reported as the family member with marginal status ("M") but the family name is given
  4. If the 'family' cut-off is undefined and the E-value of the match is worse than the 'cutoff' but better than the 'cut_low' it is reported as a domain match with marginal status.("M")
  5. If the E-value of the found match is better than the 'cutoff' it is reported as a domain match with true status ("T").
PROSITE patterns CONFIRMation:
- ScanRegExp is able to verify PROSITE matches using corresponding statistically-significant CONFIRM patterns.
- The default status of the PROSITE matches is unknown (?) and the true positive (T) status is assigned if the corresponding CONFIRM patterns match as well.
- The CONFIRM patterns were generated based on the true positive SWISS-PROT PROSITE matches using eMOTIF software with a stringency of 10e-9 P-value.
PANTHER filtering:
- Panther has pre- and post- processing steps. The pre-processing step is intended to speed up the HMM-based searching of the sequence and involves blasting the HMM sequences with the query protein sequence in order to find the most similar models above a given e-value. The resulting HMM hits are then used in the HMM-based search.
- Panther consists of families and sub-families. When a sequence is found to match a family in the blast run, the families sub-families are also scored using HMMER (that is, unless there is only 1 sub-family, in which case, the family alone is scored against).
- Any matches that score below the e-value cut-off are discarded. Any remaining matches are searched to find the HMM with the best score and evalue and the best hit is then reported (including any sub-family hit).
- For more information, please see the Panther website
GENE3D filtering:
- Gene3D also employs post-processing of results by using a program called DomainFinder.
- This program takes the output from searching the Gene3D HMMs against the query sequence and extracts all hits that are more than 10 residues long and have an e-value better than 0.01.
- If hits overlap at all, the match with the better e-value is chosen.

Taxonomy Filtering

The taxonomy filtering is based on InterPro entry taxonomy. InterProScan does it current work, searching your sequence against InterPro members databases. When all analysis are done, InterProScan perform a post processing of the results based on the taxonomy found for each InterPro entries found for your sequence(s).

e.g: Your sequence (RS16_ECOLI) is analysed against Pfam and ProDom. Hits returned are Pfam: PF00886, ProDom: PD003791 and associated InterPro entries: IPR000307

Taxonomy for IPR000307 is listed below.

Arabidopsis thaliana
Archaea
Arthropoda
Bacteria
Caenorhabditis elegans
Chordata
Cyanobacteria
Eukaryota
Fruit Fly
Fungi
Green Plants
Human
Metazoa
Mouse
Nematoda
Other Eukaryotes
Plastid Group
Rice spp.
Saccharomyces cerevisiae
Synechocystis PCC 6803

For each hit, 2 different things can happen.
AND rule : If all the taxonomy selected by the user is not present in the InterPro entry, then the hit (PF00886, or PD003791) is rejected and not shown as result.
OR rule : If one of the taxonomy name selected by the user is present on the InterPro entry, then the hit (PF00886, or PD003791) is conserved and shown to the user.

NOTE: If you do not wish to use taxonomy and make it unavailable for users, just edit iprscan.conf and set the value 0 for the tag 'taxonomy.use'.

References

1. The InterPro Consortium (*R.Apweiler, T.K.Attwood, A.Bairoch, A.Bateman,
  E.Birney, M.Biswas, P.Bucher, L.Cerutti, F.Corpet, M.D.R.Croning, R.Durbin,
  L.Falquet, W.Fleischmann, J.Gouzy, H.Hermjakob, N.Hulo, I.Jonassen, D.Kahn,
  A.Kanapin, Y.Karavidopoulou, R.Lopez, B.Marx, N.J.Mulder, T.M.Oinn, M.Pagni,
  F.Servant, C.J.A.Sigrist, E.M.Zdobnov).
  "The InterPro database, an integrated documentation resource for protein
  families, domains and functional sites."
  Nucleic Acids Research, 2001, 29(1): 37-40.

2. Hofmann K., Bucher P., Falquet L., and Bairoch A.
  "The Prosite Database, Its Status in 1999."
  Nucleic Acids Res, 1999, 27(1): 215-9.

3. Attwood T.K., Croning M.D., Flower D.R., Lewis A.P., Mabey J.E., Scordis P.,
  Selley J.N., and Wright W.
  "Prints-S: The Database Formerly Known as Prints."
  Nucleic Acids Res, 2000, 28(1): 225-7.

4. Bateman A., Birney E., Durbin R., Eddy S.R., Howe K.L., and Sonnhammer E.L.
  "The Pfam Protein Families Database."
  Nucleic Acids Res, 2000, 28(1): 263-6.

5. Corpet F., Gouzy J., and Kahn D.
  "Recent Improvements of the Prodom Database of Protein Domain Families."
  Nucleic Acids Res, 1999, 27(1): 263-7.

6. Schultz J., Copley R.R., Doerks T., Ponting C.P., and Bork P.
  "Smart: A Web-Based Tool for the Study of Genetically Mobile Domains."
  Nucleic Acids Res, 2000, 28(1): 231-4.

7. Bucher P., Karplus K., Moeri N., and Hofmann K.
  "A Flexible Motif Search Technique Based on Generalized Profiles."
  Comput Chem, 1996, 20(1): 3-23.

8. Scordis P., Flower D.R., and Attwood T.K.
  "Fingerprintscan: Intelligent Searching of the Prints Motif Database."
  Bioinformatics, 1999, 15(10): 799-806.

9. Eddy S.R.
  "Profile Hidden Markov Models."
  Bioinformatics, 1998, 14(9): p. 755-63.

10. Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W.,
  and Lipman D.J.
   "Gapped Blast and Psi-Blast: A New Generation of Protein Database Search
  Programs."
   Nucleic Acids Res, 1997, 25(17): p. 3389-402.

11. Haft,D.H., Loftus,B.J., Richardson,D.L., Yang,F., Eisen,J.A., Paulsen,I.T.,
  White,O.
   "TIGRFAMs: a protein family resource for the functional identification of
  proteins."
   Nucleic. Acids. Res, 2001, 29 (1):41-3

12. Eddy, S.R.
   "HMMER: Profile hidden Markov models for biological sequence analysis".
    WWW, 2001. http://hmmer.wustl.edu/

13. Cathy H. Wu, Hongzhang Huang, Lai-Su L. Yeh, Winona C. Barker
   "Protein family classification and functional annotation."
   Computational Biology and Chemistry, 2003, 27: 37-47.

14. Gough, J., Karplus, K., Hughey, R. and Chothia, C.
   "Assignment of Homology to Genome Sequences using a Library of Hidden
  Markov Models that represent all Proteins of Known Structure."
   J. Mol. Biol., 2001, 313(4): 903-919.

15. D. Buchan, F.Pearl, D.Lee, A.Shepherd,S.Rison,C.Orengo,J,Thornton
   "Gene3D: "Structural assignments for whole genes and genomes using the CATH
   domain structure database."
   Genome Research. Vol. 12 (3): 503 - 514

16. Huaiyu Mi, Betty Lazareva-Ulitsky, Rozina Loo, Anish Kejariwal, Jody
   Vandergriff, Steven Rabkin, Nan Guo, Anushya Muruganujan, Olivier
   Doremieux, Michael J. Campbell, Hiroaki Kitano1 and Paul D. Thomas*
   "The PANTHER database of protein families, subfamilies, functions and
   pathways."
   Nucleic Acids Research, 2005, Vol. 33, Database issue D284-D288

How to cite

Zdobnov E.M. and Apweiler R.
"InterProScan - an integration platform for the signature-recognition methods in InterPro."
Bioinformatics, 2001, 17(9): 847-8.

0	(Standard)
1	(Standard (with alternative initiation codons))
2	(Vertebrate Mitochondrial)
3	(Yeast Mitochondrial)
4	(Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma)
5	(Invertebrate Mitochondrial)
6	(Ciliate Macronuclear and Dasycladacean)
9	(Echinoderm Mitochondrial)
10	(Euplotid Nuclear)
11	(Bacterial)
12	(Alternative Yeast Nuclear)
13	(Ascidian Mitochondrial)
14	(Flatworm Mitochondrial)
15	(Blepharisma Macronuclear)
16	(Chlorophycean Mitochondrial)
21	(Trematode Mitochondrial)
22	(Scenedesmus obliquus)
23	(Thraustochytrium Mitochondrial)

-format	format you want to convert to
-input raw_file	the original output file of results
-jobid	the interproscan job id (required for HTML and XML formats)

-i infile	input file name (fasta sequences)
-o outfile	output file name (collated results)
-c cmd	command to execute on the sequences
-h	help