InterProScan FAQ

Version 4.2

Authors

Sarah Hunter <hunter (at) ebi.ac.uk>

Acknowledgments

Emmanuel Quevillon <tuco (at) ebi.ac.uk>
Florence Servant <florence.servant (at) mcgill.ca>
Evgueni Zdobnov <evgueni.zdobnov (at) embil-heidelberg.de>
Ville Silventoinen <vsi (at) ebi.ac.uk>

This FAQ is split into 2 sections: one mainly useful to administrators, and one for users. Please also make sure you have read the README and Installation Instructions.

If you still haven't found a solution to your problem, please contact interhelp@ebi.ac.uk

USERS:
1. Using InterProScan
2. Interpreting Results
ADMIN:

For Users:

1.1) Using InterProScan

What is InterProScan?

InterProScan is a tool that combines different protein signature recognition methods native to the InterPro member databases into one resource with look up of corresponding InterPro and GO annotation.

See & cite: Zdobnov E.M. and Apweiler R. "InterProScan - an integration platform for the signature-recognition methods in InterPro" Bioinformatics, 2001, 17(9): p. 847-8.

InterPro documentation available at: http://www.ebi.ac.uk/interpro/

What are the terms for commercial companies for using InterProScan?

As stated in the InterPro documentation, the manual and database may be copied and redistributed freely, without advance permission, provided that this Copyright statement is reproduced with each copy. The InterProScan software is distributed under the GNU license, as are the included scanning tools (except SignalP and TMHMM, see later). Therefore, you do not need a special license for commercial use but please cite the resource and keep the Copyright statement with your installation

InterPro - Integrated Resource Of Protein Domains And Functional Sites
Copyright © 2001 The InterPro Consortium.

If we use your InterProscan web server, what is your policy regarding confidentiality of protein sequences that we submit?

We cannot necessarily guarantee confidentiality of your sequences submitted via the web server, and we suggest you install a local copy of the software at your site. Everything you need is available from the FTP site (ftp://ftp.ebi.ac.uk/pub/databases/interpro/) in the iprscan directory.

Is there a limit to the number of sequences I can search using InterProScan?

Unlike the EBI-web version, where you can only search single sequences, there is no limit to the number of sequences you can search using the stand-alone version of InterProScan. It is optimised so that large batch jobs can be "chunked" into smaller pieces and the searches parallelised.

Is there a maximum length for a sequence I can search using InterProScan?

We haven't tested whether there is a maximum length that a sequence can be when running InterProScan, although some users have reported problems which may be as a result of very long sequences. This is something we will look into in further detail in the near future.

Can you run InterProScan and choose which programs you wish to use (e.g. I don't want to run my sequence against Pfam). Will it affect my results?

If you remove any of the programs from InterProScan, that method will not be run on your sequence, meaning you will not get predictions for that method in your results. The methods are generally independent of each other and so those methods that remain should themselves produce the same results. There are a few ways to do this:

You can configure your InterProScan installation so that you never use particular programs. If you never want to (for example) use the HMM-based methods, when you run the iprscan CONFIG.pl script, answer "n" for HMM based methods (i.e. for Pfam, SMART and TIGRFAMs) when prompted.
If you are using the web GUI, you can select which tools you will run by checking the boxes next to the program name.
On the command-line, use -appl -appl to specify which applications you want to run

How can I retrieve my InterProScan results?

If you enter your email on the EBI web GUI for InterPro, you should receive an email with the results of you InterProScan submission. Alternatively, you can add the link displayed on the webpage whilst the job is running to your favourites ("bookmark it") and return to it later.

1.2) Interpreting Results

I've found that not all matches from .output files are parsed into .raw files. Are you using additional filtering?

Yes. InterProScan implements additional filtering for some of the databases. This is detailed in the README, in the "Results filtering/match status" section

I know that my protein is transmembrane|secreted and so should have a TM domain|signal peptide predicted but it doesn't. Why?

Unfortunately, this is a consequence of how InterProScan works. In order to save time, InterProScan calculates a checksum for your sequence and uses it to look-up pre-computed results in an XML file containing all the matches in the InterPro database. If the signature is not in the file for any reason (i.e. it is not stored by InterPro, or has not been integrated into the database yet), it will not be returned by the look-up, even if it would normally match your sequence if you ran a full search.

The TMHMM and SignalP prediction search algorithms are provided through the web interface at EBI (and as options for the stand-alone version, under license) for your convenience, however, they are not integrated into InterPro and as such, currently, do not exist in the pre-computed results file.

We are working on solutions to this problem, one of which is a new version of the match.xml file which will contain all integrated and unintegrated methods (although initially excluding TMHMM and SignalP predictions) for UniProt proteins. We intend to release this file every couple of weeks to ensure the most up-to-date information for our users.

If you are running the stand-alone version of InterProScan, you can avoid this problem by forcing a search every time you run it, by using the "-nocrc" option

I am getting different results to the website of a particular member database. Why?

If a member database has just release a new version of its database, it will take a while for the new models to be integrated into InterPro. This means that for a short while, the results may differ slightly. If you wish, you can update the individual databases on your stand-alone version but be aware this may lead to confusing results, if any of the post-processing of the results has changed between versions.

(Also, see the answer above.)

I get fewer results with BlastProDom using my local InterProScan installation than when I use EBI InterProScan. What's wrong?

Actually this is not an error. EBI InterProScan uses a larger file from ProDom (prodom.mul available for download from ProDom webiste) which contains all the ProDom entries. Standalone InterProScan uses a smaller file (prodom.ipr) which contains only ProDom entries integrated into InterPro. That explains why on EBI web site appear some entries marked as "unintegrated" which are not in prodom.ipr because they are not integrated into InterPro.

What does the status flag mean?

InterProScan supports two flags:
"T" means that we believe it is a true positive match and
"M" (marginal) means that we are not so sure :)
"F" (false) is created during post-processing but these are not reported in the final output.

As a result of manual curation InterPro supports more values for status. See: InterPro documentation.

The end position of the model is reported as being further than the length of the protein. Is this correct?

We have noticed that with some methods (FingerPrintScan and coils), the reported hit actually extends past the end of the protein.

e.g.
Q25520   91CA44BBFB791E45  79  FPrintScan       PR00967 ONCOGENEAML1        62          84

Probabilistically-speaking, this is actually OK, and the match is still true (it's just that there's no sequence there to match the rest of the model but the algorithm reports it anyway). We haven't seen this behaviour with any of the HMM-based methods.

So, whatever residue is reported as the "end" of the domain is usually part of the domain. However, if you see matches where the hit extends off the end of the sequence (i.e. the position of the end of the hit is reported as greater than the length of the protein), just ignore the reported "end" and use the last residue of the protein itself as the effective "end" of the domain.

Hopefully, this clarifies things, however, if you have further questions about this specific issue, please contact the authors of Coils (and FingerPrintScan).

2) For ADMINISTRATORS

2.1) Configuring queues and servers

Can you help me configure InterProScan so that it works on my particular OS?

We currently can only compile and test the different programs contained in the InterProScan package on the operating systems that are listed as supported (see README). We are happy to provide assistance to compile source code on other, unsupported operating systems whenever possible.

Unfortunately, the InterPro member databases generally do not provide binaries for Windows, meaning that you cannot install InterProScan directly on a Windows server just yet, although we are considering distributing binaries for cygwin (http://www.cygwin.com/) at a later date. Some of our users have used a UNIX emulation layer on Windows in order to be able to install.

Can you help me configure InterProScan so that it works on my particular Queue System?

We currently only test InterProScan on LSF, however, we have provided configuration files for Sun Grid Engine 6 and PBS. It is possible we can provide assistance to help set up an unsupported queue system if necessary.

I want to have my data and/or tmp directory in a different location - can I do this?

Yes, simply move the directory where you want it to be and create a soft link to the new location from the interproscan home directory (iprscan).

Without having a queuing system is it possible to use several hosts to work on a batch of sequences submitted to iprscan?

Without a queuing system you can configure InterProScan to perform scanning of different methods on different hosts (1 scanning method per host) in parallel (which host is defined in the .conf file for each application). In the case of a queuing system you are asked for a submission host.

Do I have to start the jobs from the execution host or is it possible to start them from any host?

You can start it from any host which can do rsh to the "execution host". Check that you have access to each host you want to run the applications on using "rsh THE_HOST hostname" for example. You could be also asked to edit the file called .rhost on each host to allow connections from other machines.

(e.g. : to allow connection from foo.bar.com to blah.co.uk as user john, your .rhost on blah must contain something like : "foo.bar.com john").

I am using a queueing system, I looked at the configuration file but I don't understand what are the special tag "optqueue" and "optresource"

"optqueue" is the queue name to use for an application for example and optresource is the name of a resource to use for this application. If you don't know what these terms are, contact you system administrator.

Each time an application is launched, InterProScan reads its configuration file, and launches the job. To launch the job, it looks what is the method to do it. Queue or Local implementation? This is mentionned in the tag "queue" of each applications. Then, if the queue is a queue like LSF for example (lsf42.conf) then it launches the application using lsf command and specific queue name and resource on the command line if specified in the application configuration file (search for "resource" and "queue.name"). Thus in the command line of lsf, optqueue is replaced by the value of the "queue.name" tag (for this application) and optresource is then replaced by "resource" tag (for this application).

2.2) Configuring the web interface

How do I change the url of the InterProScan server to my domain?

Go to your iprscan/conf directory and edit iprscan.conf "workserver" tag : Old value (http://fido.ebi.ac.uk:4000) to your domain (http://foo.bar.com).

workserver=http://fido.ebi.ac.uk:4000

becomes

workserver=http://foo.bar.com

You can precise a specific port to listen and also put a https url.

The images and logos are not displaying in the web interface

Make sure you put the correct path to your installation's image folder in the conf/iprscan.conf file so that it is visible from your webserver. All the necessary images are present in the images sub directory of the iprscan installaion.

2.3) Configuring applications

I have got the smart thresholds files under license. How do I configure iprscan to use them?

You don't need to do much. Just rename THRESHOLDS to smart.thresholds and DESCRIPTIONS to smart.desc and put them in your data directory. Then, in conf/hmmsmart.conf, edit the line which starts "evalue=" and remove the e-value specified (but not the tag).

How can I plug in SignalP / TMHMM predictions?

The InterProScan package provides all required scripts/parsers for the methods but you have to contact the authors (software@cbs.dtu.dk) to get the programs and data since they are not publicly available.

Installation:

SignalP :

Save the signalp shell script into iprscan/bin/binaries or made a soft link to your installation.
You should not have to do anything with data as they are in your signalp-vxx package by default and the shell script will refer to them directly
Open iprscan/conf/signalp.conf file and check if all the paths etc are ok. You can use different versions of SignalP: v1.X, v2.X or newer. To do so, open iprscan/conf/signalp.conf and search for signalp.version tag. Then, change value (1 to v1.X, 2 to v2.X or newer).

NOTE: SignalP version 2.0 and newer have limitations in the number of submitted sequences to 4000. If you don't want any restrictions you can try to hack the code by editing the signalp shell script. Search for:

    # Maximal number of sequences (command line and WWW):
    # Leave it empty of you don't want any limitations for the max number of input sequence (huge analysis).
    # Default (4000)
    #MAXSEQ=4000
    MAXSEQ=
    MAXWWWSEQ=4000

    #We check if the $MAXSEQ is set. If not it means we don't want any limitations in the number
    #of input sequences.
    if [ "$MAXSEQ" != "" ]
        then
        if [ $NSEQ -gt $MAXSEQ ]
            then
            echo signalp: too many sequences, the limit is $MAXSEQ
            exit 1
        elif [ "$WWW" -a \( "$NSEQ" -gt "$MAXWWWSEQ" \) ]
            then
            cat $SIGNALP/doc/wwwtoomany.html | sed 's/_NUM_/'$MAXWWWSEQ'/'
            exit
        fi
    fi

TMHMM :

Save decodeanhmm binary in iprscan/bin/binaries or make a soft link to your installation.
Save TMHMM2.X.X.model in your iprscan/data directory.
Uncomment the applications in the header of CONFIG.pl and run it.
Check which version of the model is in the tmhmm.conf file and alter it if necessary (alter the line beginning "modelfile=")

Finally, for both, edit the configuration file tags (signalp.conf, tmhmm.conf) to reflect your system. (queue : local/lsf42/pbs54/sge) and host.exec for local implementation OR queue.name and/or resource for queueing systems. To get an idea what this should look like, have a look at the other applications' configuration files.

Please note that you must choose whether or not you will use the NN (neural network) or HMM method of SignalP - you cannot run both together.

See: README for the relevant URLs and references.

Can I use more up to date source databases?

Yes. Just save the updated files under the same names and run index_data.pl manually. The only problem is that you will be getting more hits from signatures without corresponding InterPro records (referred as NULL as they aren't integrated yet).

Also, please note that you must index most of the HMM databases so they are converted to binary format by default. If you wish to avoid this, you must edit the .conf file for that database and remove the ".bin" from the database filename.

Can I use another sequence translation or translation tool than the ones provided?

Well, we integrated and developped new InterProScan using EMBOSS tools (two of them) because they are fast, robust, free, maintained and used by a lot of people. But it is up to you if you don't want to use them. You can use your own tools to reformat (not mandatory) or translate your sequences. If you don't want to reformat your sequences, you might have some problems with the headers of certain sequences and they could produce errors with certain applications.

Formatting sequences:

Open iprscan/conf/iprscan.conf file. Search for "formatcmd". The original format sequence command is contained into a shell script calling seqret tool from EMBOSS package. This script reads the input sequences and write them to the InterProScan output sequence file. i.e.

    formatcmd=[%env IPRSCAN_HOME]/conf/seqret.sh  $in > $out

So, to replace it with your own formatting script (let's call it myscript), you can wrap it into a shell script (like we did for seqret - have a look at the shell script to see what we did) or have a simple script taking options. You can put it in the bin directory together with the other scripts (iprscan/bin). Alternatively, [%env IPRSCAN_HOME] can be replaced by another path where your script is located. [%env IPRSCAN_HOME] refers to the IPRSCAN_HOME environment variable, which is the path where InterProScan is installed.

    formatcmd=[%env IPRSCAN_HOME]/[bin|conf]/myscript [options?] $in > $out

[options] : Is the eventual options your script could need to get the input sequence as a parameter (e.g. -i, -input , -seqfile .....).

NOTE: Leave "$in" as it is. It is converted by InterProScan by the real path of the input sequence file. So the command would be : myscript -i $in > $out.

Your script MUST be able to write the results on the standard output or to write the results into a specified file. BUT IN ALL CASES, you must leave "$out" as it is, the is the output file where the results will be, InterProScan will replace it by the right name of the file. So, your different case could be: myscript -i $in [ > $out | -o $out | -output $out | -out $out ...]

Translate sequences:

Open iprscan/conf/iprscan.conf file. Search for "translatecmd". The original translate sequences command is contained into a shell script calling sixpack tool form EMBOSS package. This script reads the input sequences and write the translated to the InterProScan translated output.

      translatecmd=[%env IPRSCAN_HOME]/conf/sixpack.sh -table $table -orfminsize $trlen  -outseq $out $in

So, to use your own translating tool, you can either wrap it with a shell script which will call the right options (in case in your script needs some) or just literaly write the whole command line. Your script can be installed either in conf or bin directory and should be executable. Also, [%env IPRSCAN_HOME] can be replaced by another path where your script is located. [%env IPRSCAN_HOME] refers to the IPRSCAN_HOME environment variable which is the path where InterProScan is installed.

    translatecmd=[%env IPRSCAN_HOME]/[bin|conf]/myscript [options?]

[options] : Are the options your script might need to get the input sequence as a parameter (e.g. -i, -input , -seqfile .....).

NOTE: Leave "$in" and "$out" as they are. It is converted by InterProScan to the real path of the input and output file.

Additionally, you can specify (or not) a translation table (see http://www.ebi.ac.uk/cgi-bin/mutations/trtables.cgi) and also a minimum length for the translated sequence (-table and -orfminsize in our exmaple). BUT, if you have such option for table code value and minimum orf length, you will have to use $table and $trlen as value for the options of your scirpt as InterProScan will replace them automatically when reading the configuration file (e.g. : myscript -i $in -out $out -tablevalue $table -minlengthforORF $trlen).

If you have problems, contact interhelp@ebi.ac.uk

I would like to use more than one cpu for my hmmer searches using InterProScan. Is it possible to configure it?

Yes of course. Applications using hmmpfam or hmmsearch are configurable. You just need to update/change the tag "cpu_opt" in the applicaton's configuration file you want to update/change.

Configuration files supporting this option are listed below :

gene3d.conf (GENE3D)
hmmpanther.conf (Panther)
hmmpir.conf (PIR superfamily)
hmmpfam.conf (Pfam)
hmmsmart.conf (Smart)
hmmtigr.conf (Tigr)
superfamily.conf (SCOP/SUPERFAMILY)

If this tag "cpu_opt" value is empty (default) the --cpu option is not used. NOTE: By default, PIR is set to --cpu 1.

2.4) Configuring access and permissions

I would like to avoid removing some of the session directories InterProScan created. Can I do it quickly?

Yes :) of course, by editing iprscan/conf/tooldefault.conf. Search for "dirmode" tag and put the dir permissions you want (default is 775). You can change the umask values as well.

I would like to have different rights on the session directory to avoid other people looking in it.

The rights for the date and session directories are stored in tooldefault.conf file under the "dirmode" tag. Default value for this tag is 777 and the umask is set to 000. So this means that anybody can creates/remove any directory under iprscan/tmp.

If you want to protect your session directory, open iprscan.conf, edit "usermode" and put the value you want. If not value is set, iprscan will use the default one stored in tooldefault.conf.

I would like to configure limits to enforce a maximum number of input sequences allowed to be given by the user. Is this possible?

Yes. You can do it when you install InterProScan or if you skipped it during installation you can do it manually editing iprscan.conf file. You can limit:

Maximum number of proteic input sequences changing value of "maxinputseqs.aa".
Maximum number of nucleic input sequences changing value of "maxinputseqs.nt".
Maximum length of nucleic input sequences changing value of "maxseqlen.nt".
Minimum length of proteic input sequences changing value of "minseqlen.aa".

and also give the default value of the minimum length for an ORF ("minorfsize") when nucleic sequences are translated, thus that the default codon table value to use for translation ("codon.table").

I would like to apply a time limit to running jobs How can I do it?

It is quite simple. Edit iprscan.conf and put "job.time.limit" to 1. Then configure the two following tags, "pollinterval" (sleeping time in seconds between checking jobs) and maxpollrounds (number of times jobs are checked).

NOTE: BE AWARE THAT THIS CONFIGUATION IS NOT POSSIBLE WITH INSTALLTIONS USING "local" QUEUE!!! (MAY BE ADDED LATER).

2.5) Common errors seen

I am having problem with FingerPRINTScan on my Linux.

Try changing the binaries to the correct one from ftp://bioinf.man.ac.uk/pub/fingerPRINTScan/binaries/Linux/ .

I am getting messages like: "Can't locate loadable object for module DB_File in @INC..." or "Can't locate > auto/DB_File/autosplit.ix in @INC..." or "your libdb and db.h file are not compatible"

Check your installation of perl. You must ensure that it has all the necessary file modules (e.g. DB_File.pm & BerkeleyDB) and that dynaloader actually picks them up.

The index_data.pl gives errors when it is running. What should I do to stop it doing this?

Errors can be for various reasons:

ERROR: No action precised to do (indexing or binary conversion?): You have to tell the script what you want it to do for you This script can be used to index data file for InterProScan (-inx option) or used to convert ascii hmm model files to binary (-bin option) to speed up the analysis up to 40%.
ERROR: No path set for configuration file It means either your "iprscan.home" tag is not set in your iprscan.conf file or value "$ENV{IPRSCAN_HOME}" is not set. Check whether the "BEGIN" statement at the begining of the script points to the right directory where InterProScan is installed.
ERROR: File [XX] not found : No such file or directory Check if the file XX is in your data directory or if you made a soft link check if the link is still correct.
ERROR: This file [YYY] is not supported by InterProScan This file is not in the list of indexed files that InterProScan uses. You can get a list of supported files by typing "./index_data.pl -h". If you want to index this file see section "How to index new file".
ERROR: No tag input record delimeter [recdel.interpro] configured in index.conf Check in index.conf file that the tag is set. It should be set as ""
ERROR: Index file and output file have different size, index out of date? A previous indexed file already exists but does not fit with the original datafile. Find which one of both file is the newer (indexed or datafile). If you want to recreate an index from an old datafile, use -iforce option to force the script to remove old index file and create a new one.
ERROR: Cannot convert XX, hmmconvert binary does not exist or not executable : No such file or directory Check that your bin/binaries/hmmconvert binary exists or the soft link you could have created is there.
Blast indexing ERROR on prodom.ipr : FormatBlastDB: Formatdb command [YYY] failed: No such file or directory at ./index_data.pl line 219.: This can happen even if you have the files in your data directory. Check that the file exists, if it is there, check that you have a soft link in the bin directory called "binaries" which points to the directory containing binaries for your platform (if you do not select any applications when installing, this soft link is not created automatically).

I have installed the EMBOSS package and also seqret and sixpack but I get errors from InterProScan about them. What is the problem?

Check the environment variable called EMBOSS_ROOT and EMBOSS_ACDROOT in the seqret and sixpack shell scripts located in your iprscan/conf directory. Make sure that these values point to the right directory. EMBOSS_ROOT is the root directory where your EMBOSS package is installed and EMBOSS_ACDROOT is the directory where acd directory (needed for all EMBOSS applications) is needed.

InterProScan gives me a report file containing some errors from FingerPRINTScan that are weird like : ERROR: Calculation has exceeded maximum allowed complexity Fingerprint PRICHEXTENSN matches this sequence..

This is not a real error, this is just a warning. Don't worry about it.

I get the following error: "supervise: doRawResults: failed to create raw result: Parsing Problem for Panther with location "

This error is usually caused when the Panther data has not been installed correctly. Please make sure you have downloaded the PANTHER-specific file from the DATA directory on the FTP site. Untar and unzip it as you would during install. This should fix the problem

I am running Solaris and during InterProScan install, the following error is seen: "bin/index_data.pl -bin sh: /dev/null: bad number ERROR: Problem during the conversion of file /[myinstalldir]/Pfam : No such file or directory "

During install, the configuration script now runs all file indexing (so that you don't need to download the indices - you can build them yourself). On some versions of solaris, this call to index_data.pl causes InterProScan to crash. You can fix it by changing the following:

if(system(\"$path/bin/binaries/hmmconvert -b $f $f.bin >& /dev/null\")){

to:

if(system(\"$path/bin/binaries/hmmconvert -b $f $f.bin\")){

I get the following error when trying to run TMHMM: "Errors : cat: output error (0/218 characters written) Broken pipe"

This is an error from TMHMM directly. Check that the model you have in your directory and the model specified in iprscan/conf/tmhmm.conf have the same name. If not, edit the .conf file

2.6) Improving performance

Pfam searching is taking a really long time. Why?

In order to better mirror the results of the Pfam server, InterProScan now searches both LS and FS models (for more information on LS and FS models, see the Pfam website). This means that the number of models which need searching is doubled and hence so is the search time. If you do not wish to search the FS models (you will not miss too many hits if you don't) you can do this by replacing the file that comes wih InterProScan ("Pfam") with a file from the Pfam ftp site (Pfam_ls). Rename Pfam_ls so that it is called "Pfam" and put it in the iprscan/data directory. You will then need to run bin/index_data.pl -inx -bin -f Pfam in order to index the file for InterProScan and convert it into the HMMER binary format. The binary conversion uses hmmconvert, if not present into your bin archive, please download the hmmer package from http://hmmer.wustl.edu/ for your platform and copy it into iprscan/bin/binaries.

My searches are taking a really long time. Any ideas how I can improve the speed?

Try experimenting with the "chunk" size setting so that you minimise the number of jobs created by InterProScan and the memory footprint of the program.