InterProScan logo README  FAQ
RELEASE v4.1 by Emmanuel Quevillon, Ville Silventoinen and Florence Servant


Introduction to InterPro:
InterPro member databases and scanning methods:
InterProScan:
Features:
Availability:
System requirements:
Installation:
DATA Update:
Distributed DATA and Applications:
Architecture review:
Implementation details.
Results filtering / Match status:
Programs:
What's new:
References:
How to cite:



==========================
Introduction to InterPro:
==========================
Top
Databases of protein domains and functional sites have become vital
resources for the prediction of protein functions. During the last decade, several
signature-recognition methods have evolved to address different sequence
analysis problems, resulting in rather different and, for the most part,
independent databases. Diagnostically, these resources have different
areas of optimum application owing to the different strengths and weaknesses of
their underlying analysis methods. Thus, for best results, search strategies
should ideally combine all of them. InterPro ([1]) is a collaborative project
aimed at providing an integrated layer on top of the most commonly used signature
databases by creating a unique, non-redundant characterisation of a
given protein family, domain or functional site. The InterPro database
integrates PROSITE ([2]), PRINTS ([3]), Pfam ([4]), ProDom ([5]), SMART ([6]),
TIGRFAMs ([11]), PIR superfamily ([13]), SUPERFAMILY ([14]) Gene3D (15]) and PANTHER ([16])
databases and the addition of others is scheduled. InterPro data is distributed in XML format
and it is freely available under the InterPro Consortium copyright. The InterPro
project home page is available at http://www.ebi.ac.uk/interpro.

Any queries should be emailed to interhelp@ebi.ac.uk.


================================================
InterPro member databases and scanning methods:
================================================
Top
* PROSITE patterns.
  Some biologically significant amino acid patterns can be summarised in
  the form of regular expressions.
  ScanRegExp (by Wolfgang.Fleischmann@ebi.ac.uk),

* PROSITE profiles.
  There are a number of protein families as well as functional or
  structural domains that cannot be detected using patterns due to their extreme
  sequence divergence, so the use of techniques based on weight matrices
  (also known as profiles) allows the detection of such proteins or domains.
  A profile is a table of position-specific amino acid weights and gap costs.
  The profile structure used in PROSITE is similar to but slightly more general
  (Bucher P. et al., 1996 [7]) than the one introduced by M. Gribskov and
  co-workers.
  pfscan from the Pftools package (by Philipp.Bucher@isrec.unil.ch).

* PRINTS.
  The PRINTS database houses a collection of protein family fingerprints.
  These are groups of motifs that together are diagnostically more
  powerful than single motifs by making use of the biological context inherent in a
  multiple-motif method. The fingerprinting method arose from the need for
  a reliable technique for detecting members of large, highly divergent
  protein super-families.
  FingerPRINTScan (Scordis P. et al., 1999 [8]).

* PFAM.
  Pfam is a database of protein domain families. Pfam contains curated
  multiple sequence alignments for each family and corresponding hidden
  Markov models (HMMs) (Eddy S.R., 1998 [9]). 
  Profile hidden Markov models are statistical models of the primary
  structure consensus of a sequence family. The construction and use
  of Pfam is tightly tied to the HMMER software package.
  hmmpfam from the HMMER2.3.2 package (by Sean Eddy,
  eddy@genetics.wustl.edu, http://hmmer.wustl.edu).

* PRODOM.
  ProDom is a database of protein domain families obtained by automated
  analysis of the SWISS-PROT and TrEMBL protein sequences. It is useful
  for analysing the domain arrangements of complex protein families and the
  homology relationships in modular proteins. ProDom families are built by
  an automated process based on a recursive use of PSI-BLAST homology
  searches.
  ProDomBlast3i.pl (by Emmanuel Courcelle emmanuel.courcelle@toulouse.inra.fr
                    and Yoann Beausse beausse@toulouse.inra.fr)
  a wrapper on top of the Blast package (Altschul S.F. et al., 1997 [10]).

* SMART.
  SMART (a Simple Modular Architecture Research Tool) allows the
  identification and annotation of genetically mobile domains and the
  analysis of domain architectures. These domains are extensively
  annotated with respect to phyletic distributions, functional class, tertiary
  structures and functionally important residues. SMART alignments are
  optimised manually and following construction of corresponding hidden Markov models (HMMs).
  hmmpfam from the HMMER2.3.2 package (by Sean Eddy,
  eddy@genetics.wustl.edu, http://hmmer.wustl.edu).

* TIGRFAMs.
  TIGRFAMs are a collection of protein families featuring curated multiple
  sequence alignments, Hidden Markov Models (HMMs) and associated
  information designed to support the automated functional identification
  of proteins by sequence homology. Classification by equivalog family
  (see below), where achievable, complements classification by orthologs,
  superfamily, domain or motif. It provides the information best suited
  for automatic assignment of specific functions to proteins from large
  scale genome sequencing projects.
  hmmpfam from the HMMER2.3.2 package (by Sean Eddy,
  eddy@genetics.wustl.edu, http://hmmer.wustl.edu).

* PIR SuperFamily.
  PIR SuperFamily (PIRSF) is a classification system based on evolutionary
  relationship of whole proteins.
  hmmpfam from the HMMER2.3.2 package (by Sean Eddy,
  eddy@genetics.wustl.edu, http://hmmer.wustl.edu).

* SUPERFAMILY.
  SUPERFAMILY is a library of profile hidden Markov models that represent
  all proteins of known structure, based on SCOP.
  hmmpfam/hmmsearch from the HMMER2.3.2 package (by Sean Eddy,
  eddy@genetics.wustl.edu, http://hmmer.wustl.edu).

  Optionally, predictions for coiled-coil, signal peptide cleavage sites
  (SignalP v3) and TM helices (TMHMM v2) are supported (See the FAQs file
  for details).

* GENE3D
  Gene3D is supplementary to the CATH database. This protein sequence database
  contains proteins from complete genomes which have been clustered into protein
  families and annotated with CATH domains, Pfam domains and functional
  information from KEGG, GO, COG, Affymetrix and STRINGS.
  hmmpfam from the HMM2.3.2 package (by Sean Eddy,
  eddy@genetics.wustl.edu, http://hmmer.wustl.edu).

* PANTHER
  The PANTHER (Protein ANalysis THrough Evolutionary Relationships)
  Classification System was designed to classify proteins (and their genes)
  in order to facilitate high-throughput analysis.
  hmmsearch from the HMM2.3.2 package (by Sean Eddy,
  eddy@genetics.wustl.edu, http://hmmer.wustl.edu).
  and blastall from the Blast package (Altschul S.F. et al., 1997 [10]).

==============
InterProScan:
==============
Top
InterProScan is a tool that combines different protein signature
recognition methods into one resource. The number of signature databases
and their associated scanning tools as well as the further refinement
procedures increase the complexity of the problem. InterProScan is more
than a simple wrapping of sequence analysis applications since it
requires performing a considerable data look-ups from some databases and program
outputs.  The Perl-based InterProScan is intended to be an extensible
and scalable system optimised to cope with bulk data processing. The need
for production scale efficiency and an easy extensibility require a robust
and efficient (parallel) internal architecture that can benefit from network
distributed computing with the support of UNIX queuing systems. In the
package a Perl-based simple data retrieval system was introduced to
provide the required data look-up efficiency and easy extensibility.


==========
Features:
==========
Top
1) The most important feature of InterProScan is that it is pure Perl.
   No dependencies are needed to use it.
   In case of failure jobs, reports are created and a resubmiting script
   is able to complete your failed jobs when the problem(s) is/are solved.

2) Another important feature of InterProScan is the distributed
   execution of scanning jobs. The integrated applications are executed using Unix
   rsh on the configured network hosts. The job can either be directly
   executed on a remote host or can be submitted from the host to a Unix
   queuing system like LSF, PBS or SGE which can redirect it further.
   If you need to set some environment variable or whatever before running a queue
   command line, you can add some piece of code in each *env.sh file. Each file
   correspond to a queueing system.

3) As a wrapper InterProScan has a modular structure with a simple
   "one Perl module per database" organisation.
   This structure is based on Perl modules used at EBI to dispatch
   the jobs on the EBI network.
   This Perl library can be used for other project needing dispatcher
   for jobs.

4) Each of the Perl modules provides an object-oriented interface to the
   underlying database entry attributes. The parsing of output results
   file happens only once and is done upon request, implementing
   so-called lazy parsing.

5) Parsing routines are implemented using a classic way. For each applications
   InterProScan reads the output results, store the info into a hash table
   and returns it to the main program which write the raw file.

6) To speed up the required data look-up, InterProScan indexes the corresponding
   databases. Fast data retrieval is implemented based on Perl native B-trees
   indexing (DB_File.pm by Paul Marquess, based on BerkeleyDB).

7) InterProScan makes the results available in four formats {raw ebixml xml txt html}:

	a) raw format - is basic tab delimited format useful for uploading the data
	into a relational database or concatenation of different runs. All on one line.
	example here (with descriptions):

------
NF00181542      0A5FDCE74AB7C3AD        272     HMMPIR  PIRSF001424     Prephenate dehydratase  1       270     6.5e-141        T       06-Oct-2004         IPR008237       Prephenate dehydratase with ACT region  Molecular Function:prephenate dehydratase activity (GO:0004664), Biological Process:L-phenylalanine biosynthesis (GO:0009094)
------

	Where: NF00181542:             is the id of the input sequence.
	       27A9BBAC0587AB84:       is the crc64 (checksum) of the proteic sequence (supposed to be unique).
	       272:                    is the length of the sequence (in AA).
	       HMMPIR:                 is the anaysis method launched.
	       PIRSF001424:            is the database members entry for this match.
	       Prephenate dehydratase: is the database member description for the entry.
	       1:                      is the start of the domain match.
	       270:                    is the end of the domain match.
	       6.5e-141:               is the evalue of the match (reported by member database anayling method).
	       T:                      is the status of the match (T: true, ?: unknown).
	       06-Oct-2004:            is the date of the run.
	       IPR008237:              is the corresponding InterPro entry (if iprlookup requested by the user).
	       Prephenate dehydratase with ACT region:                           is the description of the InterPro entry.
	       Molecular Function:prephenate dehydratase activity (GO:0004664):  is the GO (gene ontology) description for the InterPro entry.

	b) xml format - is a self descriptive computer readable format compatible
	   with the distribution XML format of InterProMatches.

	c) ebixml format - is a self descriptive computer readable format compatible
	   with the distribution XML format of InterProMatches. It includes a EBI header
	   describing applications's databases and methods.

	d) txt format - is a condensed plain text representation of the results.

	e) html format - conforms to the html3 standard viewable by Internet
	   Browsers. This format is enhanced by a graphical representation of the
	   identified matches as well as by hyperlinks to the corresponding
	   InterPro entries, the signature entries of the InterPro member databases, the
	   scanned protein sequences and the original output of the underlying
	   applications. It also provides links to the application's home pages.
	
	   It also procude a 'Table view' where all the hits are reported without ant
	   cartoons. This view displays to the user the start-end of each matches, the
	   evalue and the status (this info are also available from the graphical view
	   by passing the mouse over the cartoons). It also shows the InterPro entries 
	   and descriptions, InterPro hierarchy (Parents, children, contain and found in)
	   and the GO terms annotation.

8) The InterProScan package includes optional support for a Web user
   interface with a script for basic retrieval of local data and check of indexes.

9) You can submit nucleic acid sequences that will be translated in all
   6 frames and piped into the analysis programs. InterProScan is designed to
   use seqret and sixpack binaries from EMBOSS (http://emboss.sourceforge.net)
   package to translate and reformat input sequences. But you can use your
   own translator/reformator. See FAQ to know how to do it.

10) The InterProScan package implements additional filtering of the
    results based on family specific cut-offs.

11) The InterProScan package get rid of the use of gmake to launch, parse
    and concatenate results. InterProScan had been rewriten from scratch
    and it is pure Perl.

==============
Availability:
==============
Top
InterProScan and the underlying applications are freely available under
the GNU licence agreement from the EBI's ftp server
(ftp://ftp.ebi.ac.uk/pub/databases/interpro/iprscan/).


=====================
System requirements:
=====================
Top
* The InterProScan package has been developed in Perl5 under UNIX

* DB_File.pm (interface to Berkeley DB, which is a part of standard
  Perl5 distribution). If not install, check CPAN (http://search.cpan.org).

* XML::Parser.pm needs to be installed. Thus that libexpat (1.95.5 or newer).
  Needed for BlastProDom new implementation and to parse xml outputs.

* Here is the list of Perl modules that InterProScan uses and you will need to
  have installed :
  - XML::Quote
  - English
  - File::Basename
  - File::Copy
  - File::Path
  - File::Spec::Functions
  - Sys::Hostname
  - Mail::Send
  - FileHandle
  - IO::Scalar
  - CGI
  - URI::Escape
  - IO::String

* Binaries of signature recognition methods provided for the following
  UNIX platforms:

 iprscan_bin_IRIX64.tar.gz - the executables for SGI
 iprscan_bin_Darwin-10.2.tar.gz - ... for Mac Darwin 10.2 or newer
 iprscan_bin_Linux.tar.gz  - ... for Linux PC
 iprscan_bin_OSF1.tar.gz - ... for DEC Alpha
 iprscan_bin_SunOS.tar.gz - ... for Solaris/Sparc
 iprscan_bin_AIX6.5.tar.gz - ... for AIX 6.5

* The full installation (with binaries & data for all platforms) takes about
  4 Gb.
* For distributed computing:
  InterProScan relies on UNIX rsh. This means you have to be able to
  rlogin to the hosts you are going to use.
  The installation should be on a shared file system (e.g. over NFS)
  that is accessible from all hosts (in the queue) you are going to
  use.
* The installation step implies that you are able to execute such
  commands as 'ls', 'pwd', 'rsh', 'uname'.

==============
Installation:
==============
Top
1) Download, unzip and untar the following files
   from ftp://ftp.ebi.ac.uk/pub/databases/interpro/iprscan/ (e.g. for version 4.1):

       % ncftp ftp://ftp.ebi.ac.uk/pub/databases/interpro/iprscan/RELEASE/4.1/iprscan_v4.1.tar.gz
       % gunzip -c iprscan_v4.1.tar.gz | tar xvf -

	a) iprscan_vX.tar.gz - the InterProScan itself
	b) iprscan_bin_XXX.tar.gz - executables of signature scanning
	   applications specific to the UNIX platform(s) you are going to use
	   (see System requirements).
	   Those binaries have been compiled. If you to compile them yourself with
	   your own options, see InterPro member databases and scanning methods 
	   for url(s) where to download them.
	c) iprscan_data.tar.gz  - databases used by the InterProScan, including
	   all required indexes of the data.

2) From the root of the installation (where you created 'iprscan' directory) run

            % perl Config.pl

    and follow the prompts.
    If you are right with the proposition given by the prompt, just ype 'Enter'.

3) You can test the installation by querying the test sequence (test.seq) included:

            % cd bin; ./iprscan -cli -i ../test.seq -iprlookup -goterms 

   Where the '-iprlookup' requests to lookup the corresponding InterPro
   references to the output and -goterms looks up to the corresponding entries
   in GO (Gene Ontology).
   For more options/informations about what iprscan is able to do type:
   ./iprscan -cli -h

   NOTE: -cli option is mandatory for command line usage. It specifies to the script
	 that user is using it as command line and not as CGI script.


   The program will prepare a temporary directory (something like
   'tmp/20041011/iprscan-20041011-11123456') where:
   iprscan-20041011-11123456 is a the session directory in format:

   iprscan : toolname
   20041011: date of the day (YYYYMMDD)
   11123456: time of the day (hhmmssNN) NN=random number

   When the scanning is finished the results will be displayed on the STDOUT
   unless you used the -o option to precise an output file where to put results.
   Anyway the raw file is writen in the session directory. What is displayed on
   on the screen is just a conversion of this raw file.
   To check that everything works correctly you can compare your results with
   the 'merged.xml' file included in the root of the distribution.

   % diff merged.raw tmp/20041011/iprscan-20041011-11123456/merged.raw


=============
DATA Update:
=============
Top
1) If you are updating one of these file:
   interpro.xml, match.xml, Pfam, sf_hmm, sf_hmm_sub, smart.HMMs, Gene3D.hmm
   smart.desc (not public), smart.thresholds (not public), superfamily.hmm
   TIGRFAMs_HMM.LIB or prodom.ipr do the following:

   Just put the new files in 'data/' directory either preserving the original
   names or change these names in the corresponding files in 'conf/APPLICATIONNAME.conf'
   to point on the right new file.
   Go to your bin directory and launch:

      ./index_data.pl -inx -iforce -bin -bforce.

   -inx    : Indexes files.
   -iforce : Forces reindexing, remove old ones.
   -bin    : Converts flat file to bianry files (speed up to 40% hmm applications).
   -bforce : Forces conversion, remove old ones.
   -v      : Verbose mode.

   Type './index_data.pl -h' for more informations and supported files.


===================================
Distributed DATA and Applications:
===================================
Top
InterPro protein signature databases.
-------------------------------------

PROFILE  : ftp://ftp.isrec.isb-sib.ch/sib-isrec/profiles/prosite_prerelease.prf
PRODOM   : http://prodes.toulouse.inra.fr/prodom/current/html/download.php
InterPro : ftp://ftp.ebi.ac.uk/pub/databases/interpro/interpro.xml.gz          (InterPro database)
	   ftp://ftp.ebi.ac.uk/pub/databases/interpro/match.xml.gz             (IprMatches database)
Pfam     : ftp://ftp.sanger.ac.uk/pub/databases/Pfam/Pfam_ls.gz
PRINTS   : ftp://ftp.bioinf.man.ac.uk/pub/fingerPRINTScan/printsXXX.pval_blos62.gz (XX is the higher version)
TIGRFAMs : ftp://ftp.tigr.org/pub/data/TIGRFAMs
PIR      : ftp://ftp.pir.georgetown.edu/pir_databases/pirsf/interpro/new/
PANTHER  : https://panther.appliedbiosystems.com/downloads/ (we provide a compressed binary library).

* Will be integrated in next release
  GENE3D   : ftp://ftp.biochem.ucl.ac.uk/pub/cathdata/v2.5.1   (Jan 2004 latest)

**************************
******* not public *******
**************************

TMHMM (v2.0) : http://www.cbs.dtu.dk/services/TMHMM/        (under commercial license : contact software@cbs.dtu.dk)
SignalP v3.0 : http://www.cbs.dtu.dk/services/SignalP-2.0/  (under commercial license : contact software@cbs.dtu.dk)
SMART        : http://smart.embl-heidelberg.de/
SUPERFAMILY  : http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/ (under free license)


Scanning applications.
----------------------

FingerPRINTScan : ftp://proline.sbc.man.ac.uk/pub/fingerPRINTScan/binaries/
ScanRegExp      : ftp://ftp.ebi.ac.uk/pub/software/unix/
Pfscan          : http://www.isrec.isb-sib.ch/ftp-server/pftools/
hmmpfam         : http://hmmer.wustl.edu/
hmmsearch       : http://hmmer.wustl.edu/
hmmconvert      : http://hmmer.wustl.edu/
NCBI Blast      : ftp://ftp.ncbi.nlm.nih.gov/blast/
EMBOSS tools    : http://emboss.sourceforge.net    (for seqret (sequence reformater) and sixpack (nucleic sequences translator))
Ncoils          : ftp://ftp.ebi.ac.uk/pub/software/unix/coils-2.2
Seg             : http://blast.wustl.edu/pub/seg/


=====================
Architecture review:
=====================
Top
  As mentioned above, the Perl-based InterProScan was designed for bulk
  sequence analysis. The architecture does not have any internal
  limitations on the number of submitted sequences and has been tested on runs with
  more than 100 000 sequences. The general approach is to split the original
  input file into smaller parts with a pre-configured number of sequences in
  each (so called chunk).

  InterProScan is more than a simple wrapper of protein sequence analysis
  applications. In addition, it requires to do a considerable data look-up
  from some databases and has abilities for parsing and retrieving program
  outputs. Each of the data description modules defines the data schema of
  the source text data and the parsing rules. The corresponding Perl
  module provides an object-oriented interface to the underlying entry
  attributes.
  The parsing of the output results happens only once and it is done when all
  the applications are done.
  The parsing of the source data into the memory objects happens only once
  and is done upon request, implementing so-called lazy-parsing.
  Hierarchical parsing rules are implemented using the recursive-descent
  approach (Parse-RecDescent package). Fast data retrieval is implemented
  using the Perl native B-trees indexing (DB_File.pm, based on Berkeley
  DB).
  The simple 'one Perl module per data source' organisation makes it
  possible to reuse the modules in other stand-alone ad-hock solutions. The	
  Perl-based InterProScan is capable of providing post-processed, integrated results
  in several formats and it could be used as a simple retrieval system for
  the underlying data.


========================
Implementation details.
=======================
Top
Each installation has the following directories:
* 'data' directory contains all databases and required indices.
   Run './index_data.pl -h' from the bin directory of your iprscan
   root directory. (see DATA Update);

* 'tmp' directory is used to store temporary user sessions and
   temporary jobs outputs).
   This tmp directory contains also another tmp directory used by
   some applications to create temporary file during runs.
   Each session directory is creaeted in a directory representing the
   day of the year when the jobs had been launched. Each day a new directory
   is then created.

* 'bin' directory contains some Perl scripts and platform specific
   binaries of scanning programs.

* 'lib' directory contains all Perl modules necessary to iprscan to work
  properly. Main core of launching jobs, checking jobs, parse results and
  create results page/output are located in lib directory under a package
  developed at EBI (Dispather::*). This packages are the main core of iprscan.
  It also contains an Index directory used to index databases and output results.
  It is based on index method present in previous version of iprscan.

* 'conf' directory contains configuration files for each database/applications
  used. It also contains configuration files for queueing systems, translate/reformat
  tools, indexing and InterProScan.


  The job itsel is started by the script bin/iprscan. This script is used as a command
  line script as well as a CGI script when the user installed a web server to run InterProScan
  through a web interface. This script starts jobs by using another one called iprscan_wrapper.pl
  which lauches jobs for each applications and takes care of them until they are all finished.
  Thus, it launches the parsing of the results and creates output results.
  In case of command line usage, the results are displayed on the standard output by
  default unless you use -o option to redirect the output to a file (see 'Programs' below).
  It is possible to have a verbose mode that provides you the status of the main program.

  If the job crashed, iprscan warns you (both command line and web interface) and produces
  a report file containing informations about crashed jobs. It best case, iprscan gives you
  the reason why the job(s) crashed. So that, you can try to fix the problem and restart failed
  jobs. To do so, just run ResubmitJobs.pl (cmd line) or click 'Resubmit failed jobs' (web
  interface). InterProScan will restart only failed jobs (from scratch) and concatenate
  results with previous one.

==================================
Results filtering / Match status:
==================================
Top
Method cut-offs:
----------------

  InterProScan is based on scanning methods native to the InterPro member
  databases. It is distributed with pre-configured method cut-offs
  recommended by the member database experts and which are believed to report
  relevant matches. All cut-offs are defined in configuration files (see 'conf'
  directory). Matches of Pfam and Smart signatures obtained with the fixed
  cut-off are subject to the following filtering.

* PFAM filtering:
  each Pfam HMM model has its own cut-off scores for each domain match and
  the total model match. These bit score cut-offs are defined in the GA
  lines of the Pfam database. Initial results are obtained with quite a high
  common cut-off and then the matches (of the signature or some of its domains)
  with a lower score than the family specific cut-offs are dropped.
  Another filtering has been implemented in the release 4.1. It is based on Clan
  filtering and nested domains. Please check Pfam website (http://www.sanger.ac.uk/Pfam)

* TIGRFAMs filtering:
  each Pfam HMM model has its own cut-off scores for each domain match and
  the total model match. These bit score cut-offs are defined in the TC
  lines of the Pfam database. Initial results are obtained with quite a high
  common cut-off and then the matches (of the signature or some of its domains)
  with a lower score than the family specific cut-offs are dropped.

* PRINTS filtering:
  there is a test version of PRINTS families specific p-value cut-offs.
  All matches with p-value more than p_min for the signature are dropped.

* SMART filtering:
  The publicly distributed version of InterProScan has a common e-value
  cut-off corresponding to the reference database size. A more
  sophisticated scoring model is used on the SMART web server and in the production of
  pre-calculated InterProMatches data. Exact scoring thresholds for domain
  assignments are proprietary data that can be obtained directly from the
  SMART team.
  [The InterProMatches data production procedure uses the additional
  smart.thresholds (note, that the given cut-offs are e-values - the number
  of expected random hits and they are valid only in the context of reference
  database size) and smart.desc data files (which are available from
  the SMART team) to filter out results obtained with higher cut-off. It
  implements the following logic:
  1. If the E-value of found match is higher than the 'cut_low' the
     match is dropped.
  2. If the E-value of found match is higher than the 'family' cut-off
     it is reported as the family hit with unknown status.
  3. If the E-value of found match is less than the 'family' cut-off and
     higher than the 'cutoff' it is reported as the family member with true
     status.
  4. If the 'family' cut-off is undefined and the E-value of the match is
     higher than the 'cutoff' but less than the 'cut_low' it is reported as a
     domain match with unknown status.
  5. If the E-value of the found match is less than the 'cutoff' it is
     reported as a domain match with true status.]

* PROSITE patterns CONFIRMation:
  ScanRegExp is able to verify PROSITE matches using corresponding
  statistically significant CONFIRM patterns.
  The default status of the PROSITE matches is unknown (?) and the true
  positive (T) status is assigned if the corresponding CONFIRM patterns
  match as well. The CONFIRM patterns were generated based on the true
  positive SWISS-PROT PROSITE matches using eMOTIF software with a
  stringency of 10e-9 P-value.


==========
Programs:
==========
Top
* Config.pl is provided to make the installation and reconfiguration of InterProScan
	    easy. Most of the prompts have some explanations and provide default
	    suggestions in [].
	    If you decided to change a perl path, queue system, queue name ...
	    just restart Config.pl, it will overwrite old info with the new one or
	    you can directly change the configuration files in the conf directory
	    at your own risks.

* iprscan [-i input_file] [-iprlookup [-goterms]] [-trtable num] [-trlen num] 
	  [-nocrc] [-appl application to run (default all)] [-email @] [-seqtype n|p]
	  [-format raw|xml|html|txt (default xml)] [-verbose] [-h] [-cli]
	  [-taxo ] [-txrule <0,1>]

	  is the program that initiates an InterProScan job. It creates a temporary
	  session directory and prepares all required infrastructure for the scanning.

	  1. the input file is checked to confirm FASTA format, reformated to clean FASTA
	  and splited into configured portions each in its own 'chunk_NN' directory.
	  The number of sub sequence contained in each 'chunk_NN' directory can be changed
	  by changing 'chunk' tag in iprscan/conf/iprscan.conf file.

	  2. the session directory contains a parameters file which sumarises the differents
	  options the user entered (needed later during parsing) and also info about sequences
	  like length, id and crc64 (checksum).
	  It also contains the original reformated sequences file.

	  3. each 'chunk_NN' directory gets its own sequence file (.nocrc).

	  The options for this script are :
	  ---------------------------------
	  -i	       Input sequence file. This file must exists and be readable.

	  -o           Output file where to write results (default stdout).

	  -iprlookup   Switch on look up of corresponding InterPro annotation

	  -goterms     Switch on look up of corresponding Gene Ontology annotation (require -iprlookup option)

	  -trtable num Are used for specifying Translation Table code and
	  and -trlen   transcript length treshold respectively for nucleic acid to protein
		       sequence translation (based on CodonTable.pm by Heikki Lehvaslaiho
		       ).

          -appl        Application to use. Check iprscan/conf/iprscan.conf file to see what are the applications
		       that you configured or type './iprscan -cli -h'.

	  -nocrc       Does not perform a crc64 check on your proteic sequence(s) before launching any
		       application. If all your sequences have a known crc64 according to the match.xml
		       file, then no applications will be launched and the results will be then displayed.

	  -email       Specify an email address where to send email when the run is finished.

	  -format      Output format [raw, xml, txt, html] (default xml).

	  -seqtype     The type of the input sequences (dna/rna (n) or protein (p)).
	
	  -verbose     Displays status of the job.

	  -cli         Specify to the script to be used in command line mode. This same script is alos
		       used as CGI script when configured web interface.

	  -taxo        Activate the Taxonomy filter for abbreviated taxonomy.(e.g. : -taxo Arthto -taxo Bact)\n";
                       Possible values: Arabidopsis thaliana (AraTh),Archaea (Arch),Arthropoda (Arthro),Bacteria (Bact),Caenorhabditis elegans (CaeEl)\n";
                                        Chordata (Chor),Cyanobacteria (Cyan), Eukaryota (Euka),Fruit Fly (FrFly),Fungi (Fung),Green Plants (GrePl)\n";
                                        Human (Huma),Metazoa (Meta),Mouse (Mous),Nematoda (Nema),Other Eukaryotes (OthEuk),Plastid Group (PlasGrp)\n";
                                        Rice spp (Rice),Saccharomyces cerevisiae (SacCer),Synechosystis PCC 6803 (Synec),Unclassified (Unclass),Virus (Vir)\n";
	  -txrule <0,1> Make decision on the taxonomy: 0 -> AND, 1 -> OR.\n";	       

	  -help        Displays this help and exit.


* meter.pl ../tmp/20040302/iprscan-20040302-12585481 reports the progress of a job in 'tmp/20040302/iprscan-20040302-12585481' session.
	   You have to give the full path to the session directory if you are in the bin directory, otherwise type something like:
	   ../../bin/meter.pl iprscan-20040302-12585481 if you are located in iprscan/tmp/20040302 directory.

* index_data.pl [-f  -f <..> ... ] [-inx [-iforce]] [-bin [-bforce]] [-h] [-v]
		 checks and updates all required indices (see DATA Update). This script is also used to format ProDom database needed
		 by blastall using formatdb binary located in 'iprscan/bin/binaries/YOURPLATFORM/blast/'.

		 The option for this script are:
		 -------------------------------
		 index_data.pl [-f  -f <..> ... ] [-inx [-iforce]] [-bin [-bforce]] [-h] [-v]

		 -f      file(s) to index. By default indexes all the required files. Type ./index_data.pl -h
			  to get the list of the supported files.
		 -inx    Index given files.
		 -iforce Force script to reindex files in case of up to date. Needed most of the time when
			  updating data. It first remove the old index and recreate a new one.
	         -bin    Convert ascii hmm library file to binary file. Can speed up to 40% the hmmpfam search.
		 -bfirce Force the conversion of to the binary file if already here. Needed most of the time during
			  data update.
		 -v      Verbose mode, prints informations during indexing.
		 -h      Displays this help and exit.


* converter.pl -format  -input  -jobid   > output_file
	       Is used to reformat results from raw into [html, xml, ebixml or txt] format.
	       NOTE: ebixml format just add an EBI header on top of the xml file.


* iterator.pl -i  -o  -c  [-h display header]
	       Iterator reads fasta sequences from the input one at a time and
	       executes the command for each sequence. Command must use stdout
	       to print the output. Use tag %infile on the command line to specify
	       the location of the input file.

* ResubmitJobs.pl -r  -h -v [0,1,2]
		  This script is used to relaunched failed jobs that occured during a run.
		  If jobs crashed, InterProScan (if errors file is greater than 0) creates
		  a report with info about the chunk, the application failed and the reason
		  if available.
		  Then the report file is used to relaunched failed jobs ONLY!
		  The user/adminnistrator must fixed the errors first before using this script
		  otherwise InterProScan cannot restart failed application and will recreate
		  the same report and so on...
		  When errors are fixed, use this scipt as mentionned above.

		  The options for this script are:
		  --------------------------------

		  -r Path to the report file.
		     (e.g. /path/to/iprscan/tmp/20040302/iprscan-20040302-12355481/iprscan-20040302-12335481.report)
		  -h Displays this help and exit.
		  -v [0,1,2] Verbose mode with multiple mode.
		     0 no infos at all (like without -v option).
		     1 only prints main actions.
		     2 prints all what the script is doing.


* iprscan_wrapper.pl < iprscan-xxxx-xxxx.params
		     This script is used by iprscan to launch, check, and parse results of all jobs.
		     You should not need to use this script your self.


============
What's new:
============
Top
since v1.x
----------
* crc64 calculation for FastaSeq object (raw format changed: seq_CRC64
and seq_Length fields inserted after seq_ID);
* match status handling (raw format changed: status mark inserted after
match location, which is considered to be true (T) unless parser reports
something);
* NULL reference to InterPro reported if the corresponding Interpro
entry was not found;
* PRODOM cut-off fixed (E-value/dbsize are frozen);
* cleanup of Blast core dumps (into 'blast.core') on low complexity
sequences;
* FPrintScan partial match extrapolation trimed to the submitted
sequence length ;
* Smart HMM search;
* [Smart family specific cut-off filtering (data is not distributed);]
* progress report (see bin/meter.pl);
* indexer (allows to index specified attribute of an available database);
* html output splitted into chunks;
* query html form;
* [getit.pl regexpr querying (commented out in 'lib/utils.ph' since it
can cause problems returning ambigious results)]

since v2.0
----------
* restructured CONFIG.pl
* parsers and data updated to InterPro release v3.1

since v2.1
----------
* [Coiled-Coil search & display;]
* [TMHMM search & display (not distributed);]
* [SignalP search & display (not distributed);]
* Interpro2GO parsing;
* raw format changed (with +ipr switch): GO terms separated by ';' added
  at the end of lines;
* converter.pl - shows GO in txt & html formats;
* TIGRFAMs search;
* TIGRFAMs family specific cut-offs
* scores parsing;
* implemented PRINTS family specific cut-offs for FingerPRINTScan;
* 6 frame translation for nucleic acid sequence input added.

since v3.1
----------
* Data updated to InterPro release v5.1
* Data updated to InterPro release v5.2
* Data updated to InterPro release v5.3
* Data updated to InterPro release v6.0
* Data updated to InterPro release v6.1

since v3.2
----------
* Data updated to InterPro release v7.1.
* Support for https servers/urls.
* PIR module for PIR search.
* SuperFamily module for SuperFamily search. Using data version 1.63.
  You can choose to use hmmpfam or hmmsearch by uncommenting lines in the lib/superfamily.pm
* IprMatches module for CRC64 check on proteic sequences.
* Options available specifying '-' for InterProScan.pl instead of '+'.
* Possibility to choose specific database to launch on command line with InterProScan.pl
* Possibility to rescale the html output to a 700 pixels width (should fit a standard browser window, I hope so...).
* Possibility to add a specific name to your temporary session directory (command line and web interface).
  See InterProScan.pl help for informations typing 'InterProScan.pl -h'.
* Display of the InterPro classification including the InterPro parent, children, found_in and
  contain entries.
* Support for OpenPBS and SGE (Sun Grid Engine) queuing systems.
* New BlastProdom using xml parser (Expat 1.95.5) and blastp (2.2.6). Uses data from ProDom release 2003.1
  WARNING (MacOSX) : The xml parser was compiled on OSX 10.3 without '-static' option. 
  If it doesn't work you could probably have to install the expat (1.95.5) library on your machine.
* Hmmer version 2.3.2(use of precompiled binary from hmmer web site)
  SunOS hmmer binaries (hmmpfam and hmmsearch) have been compiled on a Solaris 8 version machine. The native Solaris
  cc compiler has been used to compiled the sources like that :
  ./configure --enable-threads --enable-lfs
  make -Bstatic -dn 
* New ProfileScan method using ps_scan.pl script. The output is in gff format and parsed by ProfileScan parser.

since 3.3
----------
* Rewriten from scratch.
* Data updated to InterPro release v8.1
* New implementation for PIRSF.
* Get rid of gmake utility. It is pure Perl now. We tried to be as explicit as possible
  when an error occured during a run. Some cases might be forgotten, so please apologies
  for that. Any bugs, or problem you can encouter must be reported by email to:
  interhelp@ebi.ac.uk
* New indexing script. Able also to convert hmm library to bianry file which can spped up
  to 30-40% execution time (measures done on a single processor machine).
  See 'DATA update' for more informations.
* Ability to resubmit failed jobs (both command line and web interface).
* You can query each databases with keywords, it will display the entry if found (web interface only).
  Check query page to know what kind of query you can do.
* You can check what is the status of your indexes data (web interface only).
* Indexing process improved. Much faster!
* HTML view redesigned. Now it is divided in two different pages.
  The default page (picture view) is the cartoon representation showing domains on the sequence.
  The other one (table view) displays informations about the matches themselves like
  start, stop, evalue, InterPro hierachy, GO terms.
* Possibility to launch jobs in two different ways (web interface only) :
      - 'parallel'      : it is not real parallel processes, it means launch all the jobs for each
		          chunks in one go. So you will have to wait until all the jobs are completed
		          to see the results.
      - 'alternatively' : it means that you can launch the jobs chunk by chunk. It allows you to see the
			  results for the first chunk when it is completed whereas the other one are not.
			  Convenient method for very long jobs and impatient people :-).
* Configuration for iprscan, applications, queueing systems more flexible and easier. Check FAQ to know more
  about it.
* Added a new format for converter.pl.
* Documentation for perl scripts and modules in html and man formats.(Hope clear enough).
* Possibility to use SignalP 1.x to last one (3.0).
* And still the same feature as the version 3.3!.

since v4.0
----------
* Data updated to InterPro release v10.0 (5th anniversary!!)
* Bug fixing.
* Introduced a generic shell script to replace the 'ls' command used by local installation to check jobs status.
  Now, we don't have problem anymore with language setting or user settings making the ls output command different
  from expected.
* Now user can filter out its results using taxonomy found in the interpro.xml file for each InterPro entries.
  Two decisional rules can be applied on this search, AND or OR.
* Integration of PANTHER and GENE3D databases members.
* Possibility to user to set the number of cpu to be used by applications running hmmpfam/hmmsearch binary (HMM models based).
* User can also configured its own rights on the temporary session it creates (tag 'usermode' in iprscan.conf).
* Improvment of Indexing modules. Now fields to index can be passed as an array reference. See iprscan/bin/index_data.pl and doc about Index.pm (doc/[html|man|text]).
* Querying flat file using web interface can be done using case insensistive now.
* Precompiled binary seqret & sixpack from EMBOSS 9.0 for all plateforms. See iprscan/bin/binaries/README to know which options
  have been used to compile it.
* Integrate the possibility to launch jobs alternatively or parallel (see section 'since v3.3' above) with command line.
* Retrieve data and application results from indexed flat files using wget.pl on the command line. NEW!
* New Pfam filtering for Clan and nested domains.
* Taxonomy filtering (based on InterPro taxonomy tags). It allows the user to display only hits for a particular specie. Check 'Programs:iprscan' above to know
  what are the species available.
  
Any comments and suggestions are very welcome (interhelp@ebi.ac.uk).


============
References:
============
Top
1. The InterPro Consortium (*R.Apweiler, T.K.Attwood, A.Bairoch,
   A.Bateman, E.Birney, M.Biswas, P.Bucher, L.Cerutti, F.Corpet, M.D.R.Croning,
   R.Durbin, L.Falquet, W.Fleischmann, J.Gouzy, H.Hermjakob, N.Hulo, I.Jonassen,
   D.Kahn, A.Kanapin, Y.Karavidopoulou, R.Lopez, B.Marx, N.J.Mulder, T.M.Oinn,
   M.Pagni, F.Servant, C.J.A.Sigrist, E.M.Zdobnov).
   "The InterPro database, an integrated documentation resource for protein families,
   domains and functional sites."
   Nucleic Acids Research, 2001, 29(1): 37-40.

2. Hofmann K., Bucher P., Falquet L., and Bairoch A.
   "The Prosite Database, Its Status in 1999."
   Nucleic Acids Res, 1999, 27(1): 215-9.

3. Attwood T.K., Croning M.D., Flower D.R., Lewis A.P., Mabey J.E., Scordis P., Selley J.N., and Wright W.
   "Prints-S: The Database Formerly Known as Prints."
   Nucleic Acids Res, 2000, 28(1): 225-7.

4. Bateman A., Birney E., Durbin R., Eddy S.R., Howe K.L., and Sonnhammer E.L.
   "The Pfam Protein Families Database."
   Nucleic Acids Res, 2000, 28(1): 263-6.

5. Corpet F., Gouzy J., and Kahn D.
   "Recent Improvements of the Prodom Database of Protein Domain Families."
   Nucleic Acids Res, 1999, 27(1): 263-7.

6. Schultz J., Copley R.R., Doerks T., Ponting C.P., and Bork P.
   "Smart: A Web-Based Tool for the Study of Genetically Mobile Domains". 
   Nucleic Acids Res, 2000, 28(1): 231-4.

7. Bucher P., Karplus K., Moeri N., and Hofmann K.
   "A Flexible Motif Search Technique Based on Generalized Profiles."
   Comput Chem, 1996, 20(1): 3-23. 

8. Scordis P., Flower D.R., and Attwood T.K.
   "Fingerprintscan: Intelligent Searching of the Prints Motif Database>"
   Bioinformatics, 1999, 15(10): 799-806.

9. Eddy S.R.
   "Profile Hidden Markov Models."
   Bioinformatics, 1998, 14(9): p. 755-63.

10. Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., and Lipman D.J.
    "Gapped Blast and Psi-Blast: A New Generation of Protein Database Search Programs."
    Nucleic Acids Res, 1997, 25(17): p. 3389-402.

11. Haft,D.H., Loftus,B.J., Richardson,D.L., Yang,F., Eisen,J.A., Paulsen,I.T., White,O.
    "TIGRFAMs: a protein family resource for the functional identification of proteins."
    Nucleic. Acids. Res, 2001, 29 (1):41-3

12. Eddy, S.R. 
    "HMMER: Profile hidden Markov models for biological sequence analysis".
     WWW, 2001. http://hmmer.wustl.edu/

13. Cathy H. Wu, Hongzhang Huang, Lai-Su L. Yeh, Winona C. Barker
    "Protein family classification and functional annotation."
    Computational Biology and Chemistry, 2003, 27: 37-47.

14. Gough, J., Karplus, K., Hughey, R. and Chothia, C.
    "Assignment of Homology to Genome Sequences using a Library of Hidden Markov Models
    that represent all Proteins of Known Structure."
    J. Mol. Biol., 2001, 313(4): 903-919.

15. D. Buchan, F.Pearl, D.Lee, A.Shepherd,S.Rison,C.Orengo,J,Thornton
    "Gene3D: "Structural assignments for whole genes and genomes using the CATH domain structure database."
    Genome Research. Vol. 12 (3): 503 - 514

16. Huaiyu Mi, Betty Lazareva-Ulitsky, Rozina Loo, Anish Kejariwal, Jody Vandergriff, Steven Rabkin, Nan Guo,
    Anushya Muruganujan, Olivier Doremieux, Michael J. Campbell, Hiroaki Kitano1 and Paul D. Thomas* 
    "The PANTHER database of protein families, subfamilies, functions and pathways."
    Nucleic Acids Research, 2005, Vol. 33, Database issue D284-D288

============
How to cite:
============
Top
    Zdobnov E.M. and Apweiler R.
    "InterProScan - an integration platform for the signature-recognition methods in InterPro."
    Bioinformatics, 2001, 17(9): 847-8.