#===================================================
#>      InterProScan Perl-based version 4.2
#>
#> $Id: README.txt,v 1.5 2005/11/28 15:48:01 hunter Exp $
#>
#>README file.
#>
#>Authors:  Sarah Hunter <hunter@ebi.ac.uk>
#>          Emmanuel Quevillon <tuco@ebi.ac.uk>
#>          Ville Silventoinen <vsi@ebi.ac.uk>
#>
#>Acknowledgments: Florence Servant <florence.servant@mcgill.ca>
#>                 Evgueni Zdobnov  <evgueni.zdobnov@embl-heidelberg.de>
#>
#>
#>Copyright: EMBL-EBI 2004
#>
#>url : http://www.ebi.ac.uk/interpro
#>
#===================================================

==========
>Contents:
==========
1. Introduction to InterPro
2. InterPro member databases and scanning methods
3. InterProScan
       3.1 Input
       3.2 Output
4. Stand-alone InterProScan
       4.1 Availability
       4.2 System Requirements
       4.3 Installation and Update
       4.4 DATA and applications distributed with InterProScan
5. In-depth
       5.1 Features
       5.2 Architecture review
       5.3 Implementation details
       5.4 Configuration files
       5.5 Results filtering / Match status
       5.6 Programs
6. References
7. How to cite


==========================
>Introduction to InterPro:
==========================

Databases of protein domains and functional sites have become vital resources for
the prediction of protein functions. During the last decade, several signature-
recognition methods have evolved to address different sequence analysis problems,
 resulting in rather different and, for the most part, independent databases.
Diagnostically, these resources have different areas of optimum application owing
to the different strengths and weaknesses of their underlying analysis methods.
Thus, for best results, search strategies should ideally combine all of them.

InterPro ([1]) is a collaborative project aimed at providing an integrated layer
 on top of the most commonly used signature databases by creating a unique, non-
redundant characterisation of a given protein family, domain or functional site.
The InterPro database integrates PROSITE ([2]), PRINTS ([3]), Pfam ([4]), ProDom
([5]), SMART ([6]), TIGRFAMs ([11]), PIR superfamily ([13]), SUPERFAMILY ([14])
Gene3D (15]) and PANTHER ([16]) databases and the addition of others is
scheduled. InterPro data is distributed in XML format and it is freely available
under the InterPro Consortium copyright. The InterPro project home page is
available at http://www.ebi.ac.uk/interpro.

Any queries should be emailed to interhelp@ebi.ac.uk.


================================================
>InterPro member databases and scanning methods:
================================================

* PROSITE patterns.
 Some biologically significant amino acid patterns can be summarised in the form
 of regular expressions.
 ScanRegExp (by Wolfgang.Fleischmann@ebi.ac.uk),

* PROSITE profiles.
 There are a number of protein families as well as functional or structural
 domains that cannot be detected using patterns due to their extreme sequence
 divergence, so the use of techniques based on weight matrices (also known as
 profiles) allows the detection of such proteins or domains.  A profile is a
 table of position-specific amino acid weights and gap costs.  The profile
 structure used in PROSITE is similar to but slightly more general (Bucher P.
 et al., 1996 [7]) than the one introduced by M. Gribskov and co-workers.
 pfscan from the Pftools package (by Philipp.Bucher@isrec.unil.ch).

* PRINTS.
 The PRINTS database houses a collection of protein family fingerprints. These
 are groups of motifs that together are diagnostically more powerful than single
 motifs by making use of the biological context inherent in a multiple-motif
 method. The fingerprinting method arose from the need for a reliable technique
 for detecting members of large, highly divergent protein super-families.
 FingerPRINTScan (Scordis P. et al., 1999 [8]).

* PFAM.
 Pfam is a database of protein domain families. Pfam contains curated multiple
 sequence alignments for each family and corresponding hidden Markov models
 (HMMs) (Eddy S.R., 1998 [9]).  Profile hidden Markov models are statistical
 models of the primary structure consensus of a sequence family. The
 construction and use of Pfam is tightly tied to the HMMER software package.
 hmmpfam from the HMMER2.3.2 package
 (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).

* PRODOM.
 ProDom is a database of protein domain families obtained by automated analysis
 of the SWISS-PROT and TrEMBL protein sequences. It is useful for analysing the
 domain arrangements of complex protein families and the homology relationships
 in modular proteins. ProDom families are built by an automated process based
 on a recursive use of PSI-BLAST homology searches.
 ProDomBlast3i.pl (by Emmanuel Courcelle emmanuel.courcelle@toulouse.inra.fr
                       and Yoann Beausse beausse@toulouse.inra.fr)
 (it is a wrapper for the Blast package (Altschul S.F. et al., 1997 [10])).

* SMART.
 SMART (a Simple Modular Architecture Research Tool) allows the identification
 and annotation of genetically mobile domains and the analysis of domain
 architectures. These domains are extensively annotated with respect to phyletic
 distributions, functional class, tertiary structures and functionally important
 residues. SMART alignments are optimised manually and following construction
 of corresponding hidden Markov models (HMMs).
 hmmpfam from the HMMER2.3.2 package
 (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).

* TIGRFAMs.
 TIGRFAMs are a collection of protein families featuring curated multiple
 sequence alignments, Hidden Markov Models (HMMs) and associated information
 designed to support the automated functional identification of proteins by
 sequence homology. Classification by equivalog family (see below), where
 achievable, complements classification by orthologs, superfamily, domain or
 motif. It provides the information best suited for automatic assignment of
 specific functions to proteins from large scale genome sequencing projects.
 hmmpfam from the HMMER2.3.2 package
 (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).

* PIR SuperFamily.
 PIR SuperFamily (PIRSF) is a classification system based on evolutionary
 relationship of whole proteins.
 hmmpfam from the HMMER2.3.2 package
 (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).

* SUPERFAMILY.
 SUPERFAMILY is a library of profile hidden Markov models that represent all
 proteins of known structure, based on SCOP.
 hmmpfam/hmmsearch from the HMMER2.3.2 package
 (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).

* GENE3D
 Gene3D is supplementary to the CATH database. This protein sequence database
 contains proteins from complete genomes which have been clustered into protein
 families and annotated with CATH domains, Pfam domains and functional
 information from KEGG, GO, COG, Affymetrix and STRINGS.
 hmmpfam from the HMM2.3.2 package
 (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).

* PANTHER
 The PANTHER (Protein ANalysis THrough Evolutionary Relationships)
 Classification System was designed to classify proteins (and their genes) in
 order to facilitate high-throughput analysis.
 hmmsearch from the HMM2.3.2 package
 (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).
 blastall from the Blast package (Altschul S.F. et al., 1997 [10]).

 Optionally, predictions for coiled-coil, signal peptide cleavage sites
 (SignalP v3) and TM helices (TMHMM v2) are supported (See the FAQs file
 for details of how to set these up).


==============
>InterProScan:
==============

InterProScan is a tool that combines different protein signature recognition
methods into one resource. The number of signature databases and their associated
scanning tools, as well as the further refinement procedures, increases the
complexity of the problem.

InterProScan is more than just a simple wrapping of sequence analysis
applications since it also performs a considerable amount of data look-up
from various databases and program outputs.  The Perl-based InterProScan is
intended to be an extensible and scalable system optimised to cope with bulk data
processing. The need for production scale efficiency and easy extensibility
requires a robust and efficient (parallel) internal architecture that can
benefit from network-distributed computing with the support of UNIX queuing
systems.

In the package a Perl-based simple data retrieval system is used in order to
provide the required data look-up efficiency and extensibility.

There are two ways you can use InterProScan, either via the EBI website 
(http://www.ebi.ac.uk/InterProScan/ - note the maximum number of protein 
sequences you may submit is 10 and nucleotide is 1) or by downloading  and 
installing it locally on your computer.  InterProScan can run stand-alone via 
a web user interface (GUI), via the command-line or via SRS.

============
>Input:
============

InterProScan can take either nucleotide or protein sequences in a recognised
sequence format (such as raw, FASTA or EMBL).  It will reformat and, if necessary,
translate the sequences before beginning its search tasks.  If raw format (free text) 
is used, it will be given the name "Sequence_n" by default, where n is the order
in which it appeared in the input.

Nucleotide sequences will translated and scanned in all 6 frames without
any further assumptions except transcript length cut-off (orfminsize).
and/or codon translation table of the EMBOSS sixpack tool:

     0         (Standard)
     1         (Standard (with alternative initiation codons))
     2         (Vertebrate Mitochondrial)
     3         (Yeast Mitochondrial)
     4         (Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma)
     5         (Invertebrate Mitochondrial)
     6         (Ciliate Macronuclear and Dasycladacean)
     9         (Echinoderm Mitochondrial)
     10        (Euplotid Nuclear)
     11        (Bacterial)
     12        (Alternative Yeast Nuclear)
     13        (Ascidian Mitochondrial)
     14        (Flatworm Mitochondrial)
     15        (Blepharisma Macronuclear)
     16        (Chlorophycean Mitochondrial)
     21        (Trematode Mitochondrial)
     22        (Scenedesmus obliquus)
     23        (Thraustochytrium Mitochondrial)

If you wish to have more sophisticated protein sequence predictions, replace or
modify conf/sixpack.sh script and the command line for the translation
in conf/iprscan.conf file

If you wish to use more sophisticated protein sequence predictions, you can replace 
or modify the conf/sixpack.sh script and edit translate.cmd in the stand-alone 
version's conf/iprscan.conf file.  Please note that any non-standard single letter 
amino acid codes (such as an asterix *, signifying a stop codon) can cause 
problems when running the software.

=============
>Output:
=============

During a run, the program prepares a temporary directory (something like
'tmp/20041011/iprscan-20041011-11123456') where  20041011 is today's date and
iprscan-20041011-11123456 is the session directory name. The directory name is
automatically generated to be unique and consists of "iprscan-" followed by the
date (YYYYMMDD), followed by the time of the day (hhmmss) and a 2-digit random
number (NN).

When the scanning is finished the results will be displayed on the STDOUT unless
you used the -o option on the command line to specify an output file where to
put results.

InterProScan makes results available in four formats {raw ebixml xml txt html}:

* raw format
       - is basic tab delimited format useful for uploading the data into a
         relational database or concatenation of different runs.
       - is all on one line.
       - Example here (with descriptions):

--------------------------------------------------------------------------------
NF00181542      0A5FDCE74AB7C3AD        272     HMMPIR  PIRSF001424     Prephenate dehydratase  1       270     6.5e-141        T       06-Aug-2005         IPR008237       Prephenate dehydratase with ACT region  Molecular Function:prephenate dehydratase activity (GO:0004664), Biological Process:L-phenylalanine biosynthesis (GO:0009094)


       Where: NF00181542:             is the id of the input sequence.
              27A9BBAC0587AB84:       is the crc64 (checksum) of the protein sequence (supposed to be unique).
              272:                    is the length of the sequence (in AA).
              HMMPIR:                 is the anaysis method launched.
              PIRSF001424:            is the database members entry for this match.
              Prephenate dehydratase: is the database member description for the entry.
              1:                      is the start of the domain match.
              270:                    is the end of the domain match.
              6.5e-141:               is the evalue of the match (reported by member database anayling method).
              T:                      is the status of the match (T: true, ?: unknown).
              06-Aug-2005:            is the date of the run.
              IPR008237:              is the corresponding InterPro entry (if iprlookup requested by the user).
              Prephenate dehydratase with ACT region:                           is the description of the InterPro entry.
              Molecular Function:prephenate dehydratase activity (GO:0004664):  is the GO (gene ontology) description for the InterPro entry.
--------------------------------------------------------------------------------

* xml format
       - is a self descriptive computer readable format compatible with the
         distribution XML format of InterProMatches.

* ebixml format
       - is the xml format with an EBI header describing applications' databases
         and methods.

* txt format
       - is a condensed plain text representation of the results.

* html format
       - conforms to the html3 standard viewable by Internet Browsers.
       - provides a graphical representation of the identified matches and a
         'Table View' where hits are reported without ant cartoons but with the
         evalue, range and status of the matches (you can also get this info by
         mouse-over of the cartoon)
       - hyperlinks to the corresponding InterPro entries, the signature
         entries of the InterPro member databases, the scanned protein
         sequences, the original output of the underlying applications
         and  provides links to the application's home pages.
       - shows the InterPro entries and descriptions, InterPro hierarchy
         (Parents, children, contain and found in) and the GO terms annotation.

==========================
>Stand-alone InterProScan:
==========================

InterProScan is available for running via the EBI website
(http://www.ebi.ac.uk/interpro/).  Alternatively, you can download a stand-
alone version to your local server and run it there.


==============
>Availability:
==============

InterProScan and the underlying applications are freely available under
the GNU licence agreement from the EBI's ftp server
(ftp://ftp.ebi.ac.uk/pub/databases/interpro/iprscan/).


=====================
>System requirements:
=====================

* The InterProScan package has been developed in Perl5 under UNIX A list of Perl
 modules require to run InterProScan is in the Installation notes on the FTP
 site.

* Binaries of signature recognition methods provided for the following UNIX
 platforms: SGI IRIX64, Mac Darwin 10.2, Linux PC, DEC Alpha, Solaris/Sparc and
 AIX6.5

* The installation step implies that you are able to execute such
 commands as 'ls', 'pwd', 'rsh', 'uname'.

* The full installation (with binaries & data for all platforms) takes about
 9 Gb of disk space.

* For distributed computing:
       - Must be able to rsh to hosts being used
       - Installation should be on a shared filesystem (e.g. over NFS)
         accessible from all hosts that you are going to use.

* Queueing systems currently supported: LSF 4.2, Sun GridEngine 6, PBS

* Benchmarking information:
Crude benchmarking was done for InterProScan running on P50750|CDK9_HUMAN
(327aa). This will be repeated for each release so that users know what to
expect as far as performance goes.
Machine specifications: HP Compaq with 2x Pentium 4 CPUs (3.2GHz); 512Mb RAM.
InterProScan was run on 1 CPU; each program was run separately.

	Program Name 	Speed in v4.2 (s)
	--------------	-----------------
	HMMPfam         68
	HMMPanther      12
	HMMPIR          21
	blastprodom     18
	coils           7
	gene3d          18
	HMMSmart 	7
	HMMTigr 	23
	FPRINTScan 	12
	scanregexp 	7
	profilescan 	12
	superfamily 	38
	seg 	        6
	signalp 	7
	tmhmm 	        7
	--------------	-----------------
	total           4m23s

=========================
>Installation and Update:
=========================

For detailed instructions on how to install InterProScan locally, please read
the Installation instructions, on the ftp site here:
ftp://ftp.ebi.ac.uk/pub/databases/interpro/iprscan/Installing_InterProScan.txt

InterProScan and InterPro version numbers are not related in any way. The only time 
the version number of InterProScan changes is if there has been a change in the 
underlying program code.  

There are InterPro xml files available for download from the FTP site
which are currently updated every few months or so and will likely update more 
frequently in the future.  Download them and put them into the data directory of 
your InterProScan installation and you will have the most up-to-date data available. 

You can update the member database information whenever you want.  The tarballs on the
FTP site will update whenever InterPro does.

=====================================================
>DATA and Applications distributed with InterProScan:
=====================================================

InterPro protein signature databases.
-------------------------------------

PROFILE  : ftp://ftp.isrec.isb-sib.ch/sib-isrec/profiles/prosite_prerelease.prf
PRODOM   : http://prodes.toulouse.inra.fr/prodom/current/html/download.php
InterPro : ftp://ftp.ebi.ac.uk/pub/databases/interpro/interpro.xml.gz          (InterPro database)
        : ftp://ftp.ebi.ac.uk/pub/databases/interpro/match.xml.gz             (IprMatches database)
Pfam     : ftp://ftp.sanger.ac.uk/pub/databases/Pfam/Pfam_ls.gz
PRINTS   : ftp://ftp.bioinf.man.ac.uk/pub/fingerPRINTScan/database/printsXXX.pval_blos62.gz (XXX is the higher version)
TIGRFAMs : ftp://ftp.tigr.org/pub/data/TIGRFAMs
PIR      : ftp://ftp.pir.georgetown.edu/pir_databases/pirsf/interpro/new/
PANTHER  : https://panther.appliedbiosystems.com/downloads/ (we provide a compressed binary library).
GENE3D   : ftp://ftp.biochem.ucl.ac.uk/pub/
******* NOT PUBLIC *******
TMHMM (v2.0) : http://www.cbs.dtu.dk/services/TMHMM/        (under commercial license : contact software@cbs.dtu.dk)
SignalP v3.0 : http://www.cbs.dtu.dk/services/SignalP/  (under commercial license : contact software@cbs.dtu.dk)
SMART        : http://smart.embl-heidelberg.de/
SUPERFAMILY  : http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/ (under free license)
**************************

Scanning applications.
----------------------

FingerPRINTScan : ftp://proline.sbc.man.ac.uk/pub/fingerPRINTScan/binaries/
ScanRegExp      : ftp://ftp.ebi.ac.uk/pub/software/unix/
Pfscan          : http://www.isrec.isb-sib.ch/ftp-server/pftools/
hmmpfam         : http://hmmer.wustl.edu/
hmmsearch       : http://hmmer.wustl.edu/
hmmconvert      : http://hmmer.wustl.edu/
NCBI Blast      : ftp://ftp.ncbi.nlm.nih.gov/blast/
EMBOSS tools    : http://emboss.sourceforge.net    (for seqret (sequence reformater) and sixpack (nucleic sequences translator))
Ncoils          : ftp://ftp.ebi.ac.uk/pub/software/unix/coils-2.2
Seg             : http://blast.wustl.edu/pub/seg/

All are also available in tarballs from the FTP site (see the installation instructions)

==========
>In-depth:
==========

Should you wish to know the set up and features of InterProScan in-depth,
this section details the features and architecture of the program.


==================
>Package Features:
==================

1) The most important feature of InterProScan is that it is pure Perl:  No
  dependencies are needed to use it.  If jobs fail, reports are created
  detailing the failure and a resubmission script is automatically written
  which is then able to complete your failed jobs when the problem(s) is/are
  solved. (N.B. v4.x no longer uses gmake when launching and processing jobs).

2) Another important feature of InterProScan is the possibility for distributed
  execution of individual jobs. The integrated applications are executed using
  Unix rsh on the configured network hosts. The job can either be directly
  executed on a remote host or can be submitted from the host to a Unix queuing
  system like LSF, PBS or SGE which can redirect it further.

3) As a wrapper, InterProScan has a modular structure with a simple "one Perl
  module per database" organisation.  This structure is based on Perl modules
  used at EBI to dispatch the jobs on the EBI network.  This Perl library can
  be used for other projects needing a dispatcher for jobs.

4) Each of the Perl modules provides an object-oriented interface to the
  underlying database entry attributes. The parsing of the output results file
  happens only once and is done upon request, implementing so-called lazy
  parsing.

5) Parsing routines are implemented in a classic way. For each application,
  InterProScan reads the output results, stores the info into a hash table and
  returns it to the main program which then writes the raw file.

6) To speed up the required data look-up, InterProScan indexes the corresponding
  databases. Fast data retrieval is implemented, based on Perl native B-trees
  indexing (DB_File.pm by Paul Marquess, based on BerkeleyDB).

7) The InterProScan package includes optional support for a Web user interface
  with a script for basic retrieval of local data and check of indexes.

8) You can submit a nucleic acid sequence that will be translated in all 6 frames
  and piped into the analysis programs. InterProScan is designed to use seqret
  and sixpack binaries from EMBOSS (http://emboss.sourceforge.net) package to
  translate and reformat input sequences. But you can use your own translator/
  reformator. See FAQ for instructions.

9) The InterProScan package implements additional filtering of the results based
  on specific cut-offs and other post-processing steps.

For more information on how InterProScan works, see the next section.


=====================
>Architecture review:
=====================

As mentioned above, the Perl-based InterProScan was designed for bulk sequence
analysis. The architecture does not have any internal limitations on the number
of submitted sequences and has been tested on runs with more than 100 000
sequences. The general approach is to split the original input file into smaller
 parts with a pre-configured number of sequences in each (a so called "chunk").

InterProScan is more than just a simple wrapper of protein sequence analysis
applications. In addition, it performs a considerable amount of data look-up
from various databases and has the ability to parse and retrieve program outputs.

Each data description module defines the data schema of the source data and its
parsing rules. The corresponding Perl module provides an object-oriented
interface to the underlying entry attributes.  The parsing of the output results
 happens only once searching is done when all the applications are finished.
The parsing of the source data into the memory objects happens only once and is
done upon request, implementing so-called lazy-parsing.  Hierarchical parsing rules
are implemented using the recursive-descent approach (Parse-RecDescent package).
Fast data retrieval is implemented using the Perl native B-trees indexing
(DB_File.pm, based on Berkeley DB).

The simple 'one Perl module per data source' organisation makes it possible to
reuse the modules in other stand-alone ad-hoc solutions. The Perl-based
InterProScan is capable of providing post-processed, integrated results in
several formats and could also be used as a simple retrieval system for the
underlying data.

========================
>Implementation details.
=======================

Each installation has the following directories:

* 'data' directory contains all databases and required indices.

* 'tmp' directory is used to store temporary user sessions and temporary jobs
  outputs).  This tmp directory contains also another tmp directory used by
  some applications to create temporary file during runs. Each session
  directory is created in a directory representing the day of the year when
  the jobs had been launched. Each day a new directory is therefore created.

* 'bin' directory contains some Perl scripts and platform specific binaries of
  scanning programs (in the binaries/ subdirectory).

* 'lib' directory contains all Perl modules necessary for iprscan to work
  properly. The main core of launching jobs, checking jobs, parsing results
  and creating the results page/output are located in the lib directory under a
  package developed at EBI (Dispatcher::*). These packages are the main core
  of iprscan.  It also contains an Index directory used to index databases and
  output results.  It is based on the index method present in previous
  versions of iprscan.

* 'conf' directory contains configuration files for each database/application
  used. It also contains configuration files for several queueing systems,
  translate/reformat tools, indexing and InterProScan.


==========
>Programs:
==========

The main script is the one called "iprscan" in the bin/ directory.  It acts as
both a command-line script (when the -cli option is used) and as the CGI script
if and when a user has installed the web interface to InterProScan.  The
iprscan script starts jobs by calling another script (iprscan_wrapper.pl) which
in turn launches and tracks jobs for each application included in the program.
Results are parsed and the output created.  By default, when running on the
command line, the results are written to stdout, unless the -o option is used
to redirect the output to a file.  You can also specify -verbose mode which
shows the status of the main program as PENDING, RUNNING and DONE.

If the job crashed, iprscan warns you (either on the command line and in the
web interface) and produces a report file containing information about crashed
jobs. In a best case scenario, iprscan gives you the reason why the job(s)
crashed so that you can try to fix the problem and restart any failed jobs (To
do so, just run ResubmitJobs.pl on the command line or click 'Resubmit failed
jobs' on the web interface). InterProScan will restart only those jobs which
failed (from scratch) and concatenate the results with previous ones.

Below is a more detailed description of the various programs included in
InterProScan's architecture:

* Config.pl
       - is provided to make the installation and reconfiguration of
         InterProScan easy.
       - There are no command-line options - you will be prompted for information 
       - Most of the prompts have some explanations and provide default
         suggestions in [].  If you later decide to change a perl path,
         queue system or queue name you just need to restart Config.pl - it
         will overwrite the old info with the new.  Alternatively, you can
         directly change the configuration files in the conf directory at your
         own risk.

* iprscan [-i input_file] [-iprlookup [-goterms]] [-trtable num] [-trlen num]
         [-nocrc] [-appl application to run (default all)] [-email @]
         [-seqtype n|p] [-format raw|xml|html|txt (default xml)] [-verbose]
         [-h] [-cli] [-taxo <taxonomy>] [-txrule <0,1>]
       - is the program that initiates an InterProScan job.
       - It creates a temporary session directory and prepares all required
         infrastructure for the scanning.

       1) The input file is checked to confirm FASTA format, reformated to
         clean FASTA and splited into configured portions each in its own
         'chunk_NN' directory.  The number of sequences contained in each
         'chunk_NN' directory can be changed by changing the 'chunk' tag in
         the iprscan/conf/iprscan.conf file.

       2) The session directory contains a parameters file which sumarises the
         differents options the user entered (needed later during parsing) and
         also info about sequences like length, id and crc64 (checksum).  It
         also contains the original reformated sequences file.

       3) Each 'chunk_NN' directory gets its own sequence file (.nocrc).

       The options for this script are :
       ---------------------------------
       -cli          Specify to the script to be used in command line mode.
                     This same script is also used as CGI script when configured
                     web interface.

       -i            Input sequence file. This file must exists and be readable.

       -o            Output file where to write results (default stdout).

       -iprlookup    Switch on look up of corresponding InterPro annotation

       -goterms      Switch on look up of corresponding Gene Ontology
                     annotation (requires -iprlookup option to be used too)

       -trtable num  Are used for specifying Translation Table code and and
                     -trlen transcript length threshold respectively for nucleic
                     acid to protein sequence translation (based on CodonTable.pm
                     by Heikki Lehvaslaiho <heikki at ebi.ac.uk>).

       -appl         Application to use. Check iprscan/conf/iprscan.conf file
                     to see what are the applications that you configured or type
                     './iprscan -cli -h'.

       -nocrc        Does not perform a crc64 check on your protein sequence(s)
                     before launching any application. If all your sequences have a
                     known crc64 according to the match.xml file, then no
                     applications will be launched and the results will be then
                     displayed.

       -email        Specify an email address where to send email when the run
                     is finished.

       -format       Output format [raw, xml, txt, html] (default xml).

       -seqtype      The type of the input sequences (dna/rna (n) or protein (p)).

       -verbose      Displays status of the job.


       -taxo         Activate the Taxonomy filter for abbreviated taxonomy
                     (e.g. -taxo Arthto -taxo Bact)
                     Possible values: Arabidopsis thaliana (AraTh), Archaea (Arch),
                     Arthropoda (Arthro), Bacteria (Bact), Caenorhabditis elegans
                     (CaeEl), Chordata (Chor), Cyanobacteria (Cyan), Eukaryota (Euka),
                     Fruit Fly (FrFly), Fungi (Fung), Green Plants (GrePl), Human
                     (Huma), Metazoa (Meta), Mouse (Mous), Nematoda (Nema), Other
                     Eukaryotes (OthEuk), Plastid Group (PlasGrp), Rice spp (Rice),
                     Saccharomyces cerevisiae (SacCer), Synechosystis PCC 6803 (Synec),
                     Unclassified (Unclass) and Virus (Vir)

       -txrule <0,1> Make decision on the taxonomy: 0 -> AND, 1 -> OR.\n";

       -help         Displays this help and exit.

* meter.pl <session-dir-name>
       - reports the progress of a job
       - You need to provide the full or relative path to the session directory (e.g.
         'tmp/20040302/iprscan-20040302-12585481')

* index_data.pl [-f <file> -f <..> ... ] [-inx [-iforce]] [-bin [-bforce]] [-h] [-v]
       - checks and updates all required indices (see DATA Update). This
         script is also used to format databases needed by blastall
         using formatdb binary located in 'iprscan/bin/binaries/YOURPLATFORM/blast/'.

       The option for this script are:
       -------------------------------
       index_data.pl [-f <file> -f <..> ... ] [-inx [-iforce]] [-bin [-bforce]] [-h] [-v]

       -f      file(s) to index. By default indexes all the required files.
               Type ./index_data.pl -h to get the list of the supported files.

       -inx    Index given files.

       -iforce Force script to reindex files even if they are reported as
               being up to date. This feature is needed most of the time when data is
               being updated. It first removes the old index and then builds a new one.

       -bin    Convert ascii hmm library file to binary file. Can speed up the
               hmmpfam search by up to 40%. (.bin is now required by default)

       -bforce Force the binary file conversion even if it is already here.
               Needed most of the time during data update.

       -v      Verbose mode, prints informations during indexing.

       -h      Displays this help and exit.


* converter.pl -format <format> -input <raw file> -jobid <jobid>  > output_file
       - Is used to reformat results from raw into [html, xml, ebixml, txt,
         gff3] format.
       - NOTE: ebixml format just adds an EBI header to the top of the xml file.
       - NOTE: to get gff3 format, you must first run iprscan and output raw
         format.


* iterator.pl -i <infile> -o <outfile> -c <cmd> [-h display header]
       - Iterator reads fasta sequences from the input one at a time and
         executes the command for each sequence. The command MUST use stdout
         to print the output. Use an %infile tag on the command line to
         specify the location of the input file.

* ResubmitJobs.pl -r <reportfile> -h -v [0,1,2]
       - This script is used to relaunch failed jobs that occured during a run.
         If jobs crashed (if the size of the errors file is greater than 0),
         InterProScan creates a report with info about the chunk, the
         application that failed and the reason, if available.  Then the
         report file is used to relaunch the failed jobs only. The user/
         administrator must fix the errors first before using this script,
         otherwise InterProScan cannot restart the failed application and will
         recreate the same report, etc.

       The options for this script are:
       --------------------------------

       -r Path to the report file. (e.g.
       tmp/20040302/iprscan-20040302-12355481/iprscan-20040302-12335481.report)

       -h Displays this help and exit.

       -v [0,1,2] Verbose mode with multiple mode.
          0 no info at all (like without -v option).
          1 only prints main actions.
          2 prints all that the script is doing.

* iprscan_wrapper.pl < iprscan-xxxx-xxxx.params
       - This script is used by iprscan to launch, check, and parse results of
         all jobs.  You should not need to use this script yourself.

====================
>Configuration files
====================

InterProScan is supplied with configuration files in the conf directory so 
that you can easily set-up your installation exactly as you want.
Configuration files are based on 'tags' that are expandable when InterProScan 
is reading them. What we do we mean by a 'tag'? 

For example in the following lines :

workserver=http://fido.ebi.ac.uk:4000
workurl=[%workserver]/iprscan/iprscan?tool=iprscan&jobid=....

workserver is a key and http://fido.ebi.ac.uk:4000 is the value. In the next line,
[%workserver] is a tag. So when InterProScan reads a configuration file, it reads it as key-value pair
file. When it sees something on the line looking like '[%.....]', it understands it as a tag and try to
to expand/replace it in its memory and searches if it already seen this key somewhere. If yes, it replaces
the tag '[%...]' by its value, otherwise it replaces it with nothing. That's why, each time you want to use
a tag in a value to avoid repeat it, you HAVE TO SET IT correctly before as file are read from top to bottom.

In that case, workurl will be :
http://fido.ebi.ac.uk:4000/iprscan/iprscan?tool=iprscan&jobid=....
after expanding.

InterProScan support conditions in its tags. Here is a list. To know how to write them correctly,
you will have to read the code of Config.pm module (iprscan/lib/Dispatcher/Config.pm).

-%env      -> referres to environment variable hash table in Perl (%ENV).
-%if       -> you can do some condition into your tags.
-%switch   -> you can have mutiple choices (same as basic switch condition in programing).
-%random   -> calls the srand Per subroutine.
-%YYYY     -> translates it as the actual year.
-%MM       -> translates it as the current month of the year.
-%DD       -> translates it as the current day of the month.
-%hh       -> translates it as the current hour of the day.
-%mm       -> translates it as the current minutes of the hour.
-%ss       -> translates it as the current second of the minute.
-%hostname -> translates it as the hostname of the machine.
-%pid      -> translates it with the process id of the program.
-%uname    -> translates it with the operating system name.

With all of these features, you should be able to modify/configure InterProScan and applications
as you want.


==================================
>Results filtering / Match status:
==================================

Method cut-offs:
----------------

InterProScan is based on scanning methods native to the InterPro member
databases. It is distributed with pre-configured method cut-offs recommended by
the member database experts and which are believed to report relevant matches.
All cut-offs are defined in configuration files (see 'conf' directory). Matches
obtained with the fixed cut-off are subject to the following filtering.
(Please also see member database web pages for more information)

* PFAM filtering:
       - Each Pfam family is represented by 2 HMMs - ls and fs (full-length
         and fragment).
       - An HMM model has bit score cut-offs (for each domain match and the
         total model match) and these are defined in the GA lines of the Pfam
         database. Initial results are obtained with quite a high common cut-
         off and then the matches of the signature with a lower score than the
         family specific cut-offs are dropped.
       - If both the fs and ls model for a particular Pfam hits the same
         region of a sequence, the LS model is always chosen.
       - Another type of filtering has been implemented since release 4.1. It
         is based on Clan filtering and nested domains. Please check the Pfam
         website (http://www.sanger.ac.uk/Pfam) for more information on Clan
         filtering.

* TIGRFAMs filtering:
       - Each TigrFAM HMM model has its own cut-off scores for each domain
         match and the total model match. These bit score cut-offs are defined
         in the TC lines of the database. Initial results are obtained with
         quite a high common cut-off and then the matches (of the signature or
         some of its domains) with a lower score than the family specific cut-
         offs are dropped.

* PRINTS filtering:
       - There is a test version of PRINTS families specific p-value cut-offs.
         All matches with p-value more than p_min for the signature are
         dropped.

* SMART filtering:
       - The publicly distributed version of InterProScan has a common e-value
         cut-off corresponding to the reference database size. A more
         sophisticated scoring model is used on the SMART web server and in
         the production of pre-calculated InterPro match data.
       - Exact scoring thresholds for domain assignments are proprietary data
         that can be obtained directly from the SMART team.  The
         InterProMatches data production procedure uses these additional
         smart.thresholds
       - PLEASE NOTE: that the given cut-offs are e-values (i.e. the number of
         expected random hits) and they therefore are valid only in the
         context of reference database size and smart.desc data files (which
         are available from the SMART team) to filter out results obtained
         with higher cut-off.
       - It implements the following logic:
         1. If the E-value of found match is worse than the 'cut_low' the
            match is dropped.
         2. If the E-value of found match is worse than the 'family' cut-off
            it is reported as the family hit with unknown/marginal status
            ("M") and no description is given.
         3. If the E-value of found match is better than the 'family' cut-off
            but worse than the 'cutoff' it is reported as the family member
            with marginal status ("M") but the family name is given
         4. If the 'family' cut-off is undefined and the E-value of the match
            is worse than the 'cutoff' but better than the 'cut_low' it is
            reported as a domain match with marginal status.("M")
         5. If the E-value of the found match is better than the 'cutoff' it is
            reported as a domain match with true status ("T").

* PROSITE patterns CONFIRMation:
       - ScanRegExp is able to verify PROSITE matches using corresponding
         statistically-significant CONFIRM patterns.
       - The default status of the PROSITE matches is unknown (?) and the true
         positive (T) status is assigned if the corresponding CONFIRM patterns
         match as well.
       - The CONFIRM patterns were generated based on the true positive SWISS-
         PROT PROSITE matches using eMOTIF software with a stringency of 10e-9
         P-value.

* PANTHER filtering:
       - Panther has pre- and post- processing steps.  The pre-processing step
         is intended to speed up the HMM-based searching of the sequence and
         involves blasting the HMM sequences with the query protein sequence
         in order to find the most similar models above a given e-value. The
         resulting HMM hits are then used in the HMM-based search.
       - Panther consists of families and sub-families.  When a sequence is
         found to match a family in the blast run, the families sub-families
         are also scored using HMMER (that is, unless there is only 1 sub-
         family, in which case, the family alone is scored against).
       - Any matches that score below the e-value cut-off are discarded.  Any
         remaining matches are searched to find the HMM with the best score
         and evalue and the best hit is then reported (including any sub-
         family hit).
       - For more information, please see the Panther website

* GENE3D filtering:
       - Gene3D also employs post-processing of results by using a program
         called DomainFinder.
       - This program takes the output from searching the Gene3D HMMs against
         the query sequence and extracts all hits that are more than 10
         residues long and have an e-value better than 0.01.
       - If hits overlap at all, the match with the better e-value is chosen.


Taxonomy Filtering
------------------

The taxonomy filtering is based on InterPro entry taxonomy.
InterProScan does it current work, searching your sequence against
InterPro members databases. When all analysis are done, InterProScan
perform a post processing of the results based on the taxonomy found
for each InterPro entries found for your sequence(s).
                                                                                                                                                             
e.g:
                                                                                                                                                             
Your sequence (RS16_ECOLI) is analysed against Pfam and ProDom.
Hits returned are :
                                                                                                                                                             
Pfam   : PF00886
ProDom : PD003791
                                                                                                                                                             
and associated InterPro entries:
                                                                                                                                                             
IPR000307
                                                                                                                                                             
Taxonomy for IPR000307 is listed below.
                                                                                                                                                             
Arabidopsis thaliana
Archaea
Arthropoda
Bacteria
Caenorhabditis elegans
Chordata
Cyanobacteria
Eukaryota
Fruit Fly
Fungi
Green Plants
Human
Metazoa
Mouse
Nematoda
Other Eukaryotes
Plastid Group
Rice spp.
Saccharomyces cerevisiae
Synechocystis PCC 6803
                                                                                                                                                             
For each hit, 2 different things can happen.
AND rule :
----------
   If all the taxonomy selected by the user is not present in the InterPro entry,
   then the hit (PF00886, or PD003791) is rejected and not shown as result.
OR rule :
---------
   If one of the taxonomy name selected by the user is present on the InterPro entry,
   then the hit (PF00886, or PD003791) is conserved and shown to the user.
                                                                                                                                                             
NOTE:  If you do not wish to use taxonomy and make it unavailable for users, just edit iprscan.conf
      and set the value 0 for the tag 'taxonomy.use'.


============
>References:
============

1. The InterPro Consortium (*R.Apweiler, T.K.Attwood, A.Bairoch, A.Bateman,
  E.Birney, M.Biswas, P.Bucher, L.Cerutti, F.Corpet, M.D.R.Croning, R.Durbin,
  L.Falquet, W.Fleischmann, J.Gouzy, H.Hermjakob, N.Hulo, I.Jonassen, D.Kahn,
  A.Kanapin, Y.Karavidopoulou, R.Lopez, B.Marx, N.J.Mulder, T.M.Oinn, M.Pagni,
  F.Servant, C.J.A.Sigrist, E.M.Zdobnov).
  "The InterPro database, an integrated documentation resource for protein
  families, domains and functional sites."
  Nucleic Acids Research, 2001, 29(1): 37-40.

2. Hofmann K., Bucher P., Falquet L., and Bairoch A.
  "The Prosite Database, Its Status in 1999."
  Nucleic Acids Res, 1999, 27(1): 215-9.

3. Attwood T.K., Croning M.D., Flower D.R., Lewis A.P., Mabey J.E., Scordis P.,
  Selley J.N., and Wright W.
  "Prints-S: The Database Formerly Known as Prints."
  Nucleic Acids Res, 2000, 28(1): 225-7.

4. Bateman A., Birney E., Durbin R., Eddy S.R., Howe K.L., and Sonnhammer E.L.
  "The Pfam Protein Families Database."
  Nucleic Acids Res, 2000, 28(1): 263-6.

5. Corpet F., Gouzy J., and Kahn D.
  "Recent Improvements of the Prodom Database of Protein Domain Families."
  Nucleic Acids Res, 1999, 27(1): 263-7.

6. Schultz J., Copley R.R., Doerks T., Ponting C.P., and Bork P.
  "Smart: A Web-Based Tool for the Study of Genetically Mobile Domains."
  Nucleic Acids Res, 2000, 28(1): 231-4.

7. Bucher P., Karplus K., Moeri N., and Hofmann K.
  "A Flexible Motif Search Technique Based on Generalized Profiles."
  Comput Chem, 1996, 20(1): 3-23.

8. Scordis P., Flower D.R., and Attwood T.K.
  "Fingerprintscan: Intelligent Searching of the Prints Motif Database."
  Bioinformatics, 1999, 15(10): 799-806.

9. Eddy S.R.
  "Profile Hidden Markov Models."
  Bioinformatics, 1998, 14(9): p. 755-63.

10. Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W.,
  and Lipman D.J.
   "Gapped Blast and Psi-Blast: A New Generation of Protein Database Search
  Programs."
   Nucleic Acids Res, 1997, 25(17): p. 3389-402.

11. Haft,D.H., Loftus,B.J., Richardson,D.L., Yang,F., Eisen,J.A., Paulsen,I.T.,
  White,O.
   "TIGRFAMs: a protein family resource for the functional identification of
  proteins."
   Nucleic. Acids. Res, 2001, 29 (1):41-3

12. Eddy, S.R.
   "HMMER: Profile hidden Markov models for biological sequence analysis".
    WWW, 2001. http://hmmer.wustl.edu/

13. Cathy H. Wu, Hongzhang Huang, Lai-Su L. Yeh, Winona C. Barker
   "Protein family classification and functional annotation."
   Computational Biology and Chemistry, 2003, 27: 37-47.

14. Gough, J., Karplus, K., Hughey, R. and Chothia, C.
   "Assignment of Homology to Genome Sequences using a Library of Hidden
  Markov Models that represent all Proteins of Known Structure."
   J. Mol. Biol., 2001, 313(4): 903-919.

15. D. Buchan, F.Pearl, D.Lee, A.Shepherd,S.Rison,C.Orengo,J,Thornton
   "Gene3D: "Structural assignments for whole genes and genomes using the CATH
   domain structure database."
   Genome Research. Vol. 12 (3): 503 - 514

16. Huaiyu Mi, Betty Lazareva-Ulitsky, Rozina Loo, Anish Kejariwal, Jody
   Vandergriff, Steven Rabkin, Nan Guo, Anushya Muruganujan, Olivier
   Doremieux, Michael J. Campbell, Hiroaki Kitano1 and Paul D. Thomas*
   "The PANTHER database of protein families, subfamilies, functions and
   pathways."
   Nucleic Acids Research, 2005, Vol. 33, Database issue D284-D288

============
>How to cite:
============

   Zdobnov E.M. and Apweiler R.
   "InterProScan - an integration platform for the signature-recognition
   methods in InterPro."
   Bioinformatics, 2001, 17(9): 847-8.

=====
>End
=====