Help   InterProScan Help

  • INTRODUCTION

    Protein signature databases have become vital tools for identifying distant relationships in novel sequences and hence are used for the classification of protein sequences and for inferring their function. InterPro streamlines the analysis of newly determined sequences for the individual user and makes a significant contribution to the demanding task of automatic annotation of predicted proteins from genome sequencing projects.

    InterProScan is a tool that combines different protein signature recognition methods native to the InterPro member databases into one resource with look up of corresponding InterPro and GO annotation.

    This form allows you to query your sequence against InterPro. For more detailed information see the documentation for the perl stand-alone InterProScan package (Readme file or FAQs), or the InterPro user manual or documentation. If you wish to use this facility during a course, or if you have any problems or suggestions, then please contact at http://www.ebi.ac.uk/support/.

    InterProScan Tutorial

  • INSTRUCTIONS

    1. Enter your email.
    2. Choose between an interactive run, where you wait for your results, or an email run, where you will receive your results by email when they are ready.
    3. Select the type of sequence you are going to input i.e. protein or nucleotide. For nucleotide sequences please read the secion on translation
    4. Either paste your sequence into the window, or attach a file containing your sequence.
    5. You now can perform an InterProScan query.

  • YOUR SEQUENCE

    Please either enter (or cut and paste) your protein sequence into the text box, or, if you have the sequence in a file on your computer, click the 'Browse' button to upload it directly (you will be given a file selection window if you choose this option). If you need help on sequence formats, this page details various common formats.

    For multiple or bulk protein sequences you can install InterProScan locally. Please download the InterProScan software from our FTP site.

    Enter or cut and paste a protein sequence, or set of sequences here. Supported formats include fasta or Swiss-Prot format.
  • YOUR EMAIL

    You must enter your email address in the box below to use this service. Email addresses in the standard form, i.e. user@ebi.ac.uk. It is not necessary to fill in the box if you are running your search interactively.

  • RESULTS

    You can either wait for the search results to be returned in the web browser window (interactive job), or choose to have them sent to your email address on completion. The latter may be useful, as some searches will take a considerable time to complete.

    The default is interactive.
  • SEQUENCE INPUT WINDOW

    You can cut and paste or type a Nucleotide or Protein sequence into the large text window. A free text (raw) sequence is simply a block of characters representing a DNA/RNA or Protein sequence. You may also paste a sequence in Fasta, EMBL, Swiss-Prot and GenBank format.
    Partially formatted sequences will not be accepted. Copying and Pasting directly from word processors may yield unpredictable results as hidden/control characters may be present. Adding a return to the end of the sequence may help certain applications understand the input. Some examples of common sequence formats may be seen here.
  • UPLOAD A FILE

    You may upload a file from your computer which containing a valid sequence in any format (Raw, Fasta, EMBL, Swiss-Prot and GenBank) using this option. Please note that this option only works with Netscape Browsers or Internet Explorer version 5 or later. Some word processors may yield unpredictable results as hidden/control characters may be present in the files. It is best to save files with the Unix format option to avoid hidden windows characters. Some examples of common sequence formats may be seen here.

  • APPLICATIONS TO RUN

    A number of different protein sequence applications are launched. These applications search against specific databases and have preconfigured cut off thresholds.

    • BlastProDom
      Scans the families in the ProDom database. ProDom is a comprehensive set of protein domain families automatically generated from the Swiss-Prot and TrEMBL sequence databases using psi-blast. In InterProScan the blastpgb program is used to scan the database. Blastpgp performs gapped blastp searches and can be used to perform iterative searches in psi-blast and phi-blast mode.

    • FPrintScan
      Scans against the fingerprints in the PRINTS database. These fingerprints are groups of motifs that together are more potent than single motifs by making use of the biological context inherent in a multiple motif method.
    • HMMPIR
      Scans the hidden markov models (HMMs) that are present in the PIR Protein Sequence Database (PSD) of functionally annotated protein sequences, PIR-PSD.

    • HMMPfam
      Scans the hidden markov models (HMMs) that are present in the PFAM Protein families database.
    • HMMSmart
      Scans the hidden markov models (HMMs) that are present in the SMART domain/domain families database.
    • HMMTigr
      Scans the hidden markov models (HMMs) that are present in the TIGRFAMs protein families database.
    • ProfileScan
      Scans against PROSITE profiles. These profiles are based on weight matrices and are more sensitive for the detection of divergent protein families.
    • ScanRegExp
      Scans against the regular expressions in the PROSITE protein families and domains database.
    • SuperFamily
      SUPERFAMILY is a library of profile hidden Markov models that represent all proteins of known structure.


  • TRANSLATION & READING FRAMES

    N.B. As nucleotide input sequences needed to be converted into a hypothetical protein. This occurrs in 6 reading frames, i.e. results in 6 possible protein sequences.

    Each 3 bases in the DNA sequence codes for 1 amino acid. As you may not be sure what position to start at when predicting what protein sequence may be produced by this code, you could start with one of 3 positions from either end of the DNA sequence. Thus there are 6 possible predicted protein sequences resulting from such a piece of code. These are known as the 6 possible reading frames. There are 3 forward frames and 3 reverse sense frames.

    e.g.
    gcagccgggcggccgcagaagcgcccaggcccgcgcgccacccct      DNA

    Forward frames
           gca gcc ggg cgg ccg cag aag cgc cca ggc ccg cgc gcc acc cct  DNAs
            A   A   G   R   P   Q   K   R   P   G   P   R   A   T   P   amino acids
    
         g cag ccg ggc ggc cgc aga agc gcc cag gcc cgc gcg cca ccc ct   DNA
            Q   P   G   G   R   R   S   A   Q   A   R   A   P   P       amino acids
    
        gc agc cgg gcg gcc gca gaa gcg ccc agg ccc gcg cgc cac ccc t    DNA
            S   R   A   A   A   E   A   P   R   P   A   R   H   P       amino acids
                  
    Reverse frames
           tcc cca ccg cgc gcc cgg acc cgc gaa gac gcc ggc ggg ccg acg  DNA
            R   G   G   A   R   A   W   A   L   L   R   P   P   G   C   amino acids
    		
         t ccc cac cgc gcg ccc gga ccc gcg aag acg ccg gcg ggc cga cg   DNA
            G   W   R   A   G   L   G   A   S   A   A   A   R   L   X   amino acids
    		
        tc ccc acc gcg cgc ccg gac ccg cga aga cgc cgg cgg gcc gac g    DNA
            G   V   A   R   G   P   G   R   F   C   G   R   P   A   A   amino acids
               

    Example of Hypothetical proteins produced from a translation:

    			
    Genetic Code table used: [0] -> Standard Genetic Code
    Frames: All Six Frames 
    
    >_1
    AAGRPQKRPGPRATP
    >_2
    QPGGRRSAQARAPP
    >_3
    SRAAAEAPRPARHP
    >_4
    RGGARAWALLRPPGC
    >_5
    GWRAGLGASAAARLX
    >_6
    GVARGPGRFCGRPAA

    Also the translation into protein does not apply uniformly to all organisms, the same nucleotide sequence can code for a different set of amino acids in different organisms. Therefore you can translate using the standard ('Universal') genetic code and also with a selection of non-standard codes that may predict the hypothetical protein sequence more accurately. Please select the most appropriate genetic code for the species from which the sequence was obtained. More about genetic codes.

  • MIN. OPEN READING FRAME SIZE

    If you for example set this option to 100, this means that when a nucleotide sequence is translated to a hypothetical protein sequence, if a stop codon is hit before a hundred nucleotide bases are translated, the hypothetical protein sequence will be discarded, and the application will commence translating the next peice of nucleotide sequence. This means any pieces of nucleotide sequence that are less than 100 bases long before hitting a stop codon (which code for 33 amino acids) will be excluded from the experiment.

  • CRC (Internal use only)

    Every sequence has a CRC. If a sequence is submitted to InterProScan its CRC is checked against a precomputed list of matches of protein sequences to InterPro entries(that are contained in the IPRMATCHES database). If the CRC of the query sequence matches to one in the precomputed results, this result is returned to the user and InterProScan is not executed. If the CRC does not match to anything, InterProScan is launched on the query sequence.

  • REFERENCES

    1. The InterPro Consortium (*R.Apweiler, T.K.Attwood, A.Bairoch, A.Bateman, E.Birney, M.Biswas, P.Bucher, L.Cerutti, F.Corpet, M.D.R.Croning, R.Durbin, L.Falquet, W.Fleischmann, J.Gouzy, H.Hermjakob, N.Hulo, I.Jonassen, D.Kahn, A.Kanapin, Y.Karavidopoulou, R.Lopez, B.Marx, N.J.Mulder, T.M.Oinn, M.Pagni, F.Servant, C.J.A.Sigrist, E.M.Zdobnov),
    " The InterPro database, an integrated documentation resource for protein families, domains and functional sites",
    Nucleic Acids Research, 2001. vol 29(1):37-40.

    2. Hofmann K., Bucher P., Falquet L., and Bairoch A.,
    " The Prosite Database, Its Status in 1999".
    Nucleic Acids Res, 1999. 27(1): p. 215-9.

    3. Attwood T.K., Croning M.D., Flower D.R., Lewis A.P., Mabey J.E., Scordis P., Selley J.N., and Wright W.,
    " Prints-S: The Database Formerly Known as Prints".
    Nucleic Acids Res, 2000. 28(1): p. 225-7.

    4. Bateman A., Birney E., Durbin R., Eddy S.R., Howe K.L., and Sonnhammer E.L., "The Pfam Protein Families Database". Nucleic Acids Res, 2000. 28(1): p. 263-6.

    5. Corpet F., Gouzy J., and Kahn D.,
    " Recent Improvements of the Prodom Database of Protein Domain Families".
    Nucleic Acids Res, 1999. 27(1): p. 263-7.

    6. Schultz J., Copley R.R., Doerks T., Ponting C.P., and Bork P.,
    " Smart: A Web-Based Tool for the Study of Genetically Mobile Domains".
    Nucleic Acids Res, 2000. 28(1): p. 231-4.

    Bucher P., Karplus K., Moeri N., and Hofmann K.,
    " A Flexible Motif Search Technique Based on Generalised Profiles".
    Comput Chem, 1996. 20(1): p. 3-23.

    Scordis P., Flower D.R., and Attwood T.K.,
    "Fingerprintscan: Intelligent Searching of the Prints Motif Database".
    Bioinformatics, 1999. 15(10): p. 799-806.

    9. Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., and Lipman D.J.,
    "Gapped Blast and Psi-Blast: A New Generation of Protein Database Search Programs".
    Nucleic Acids Res, 1997. 25(17): p. 3389-402.

    11. Haft,D.H., Loftus,B.J., Richardson,D.L., Yang,F., Eisen,J.A., Paulsen,I.T., White,O.,
    " TIGRFAMs: a protein family resource for the functional identification of proteins".
    Nucleic. Acids. Res, 2001. 29 (1):41-3

    12. Eddy, S.R. "HMMER:
    Profile hidden Markov models for biological sequence analysis".
    WWW, 2001. http://hmmer.wustl.edu/

  • OTHER SERVICES:

    This services is also available as an application from the EBI's
    srs server: http://srs.ebi.ac.uk/