ps_scan documentation ===================== Reference: Alexandre Gattiker, Elisabeth Gasteiger, Amos Bairoch. ScanProsite: a reference implementation of a PROSITE scanning tool. Applied Bioinformatics 2002:1(2) 107-108. Contact: prosite@isb-sib.ch ps_scan is a perl program used to scan one or several patterns, rules and/or profiles from PROSITE against one or several protein sequnces in Swiss-Prot or FASTA format. It requires two compiled external programs from the PFTOOLS package : "pfscan" used to scan a sequence against a profile library and "psa2msa" which is necessary for the "-o msa" output format only. Authors: Alexandre Gattiker Edouard de Castro; E-mail: ecastro@isb-sib.ch Elisabeth Gasteiger Installation ------------ Download a binary package for your platform from ftp://ftp.expasy.org/databases/prosite/tools/ps_scan/ps_scan_<platform>.tar.gz or the source packages from ftp://ftp.expasy.org/databases/prosite/tools/ps_scan/sources/ in which case you will need gcc or a compatible fortran compiler to compile the pftools sources. You may need to edit the ps_scan.pl to provide absolute paths to the directory where you have installed the pfscan and psa2msa executables, unless you have stored them in a directory in your PATH. A local copy of the PROSITE database is also needed. You may download it in a single file from ftp://ftp.expasy.org/databases/prosite/release_with_updates/prosite.dat Usage ----- perl ps_scan.pl [options] sequence-file(s) ps_scan version $VERSION options: -h : this help screen Input/Output: -e : specify the ID or AC of an entry in sequence-file -o : specify an output format : $formats_string -d : specify a prosite.dat file -p : specify a pattern or the AC of a prosite motif -f : specify a motif AC to scan against together with all its releated post-processing motifs (but show only specified motif hits) Selection: -r : do not scan profiles -m : only scan profiles -s : skip frequently matching (unspecific) patterns and profiles -l : cut-off level for profiles (default : 0) Pattern match mode: -x : specify maximum number of accepted matches of X's in sequence (default=1) -g : Turn greediness off -v : Turn overlaps off -i : Allow included matches The sequence-file may be in Swiss-Prot or FASTA format. If no PROSITE file is submitted, it will be searched in the paths $PROSITE/prosite.dat and $SPROT/prosite/prosite.dat. There may be several -d, -p and -q arguments. Pfsearch options: -w pfsearch : Compares a query profile against a protein sequence library. A profile file must be specified with option -d. $progname -w pfsearch [-C cutoff] [-R] -d profile-file seq-library-file(s) -R: use raw scores rather than normalized scores for match selection -C=# : Cut-off value. Reports only match score higher than the specified parameter. An integer argument is interpreted as a raw score value, a decimal argument as a normalized score value. An integer value forces option -R. Output formats -------------- An output format may be specified with the -o option. Here are the available formats, with an example output with a profile : -o scan : default ps_scan output format. YAHD_ECOLI : PS50088 ANK_REPEAT Ankyrin repeat profile. 12 - 37 -------LAAQQGDIDKVKTCLALGVDINTCDR L=-1 38 - 70 QGKTAITLASLYQQYACVQALIDAGADINKQDH L=0 138 - 175 VGWTPLLEAIVLNDGGikqqaIVQLLLEHGASPHLTDK L=0 176 - 201 YGKTPLELARERGFEEIAQLLIAAGA------- L=0 -o fasta : matching sequences in FASTA format. >YAHD_ECOLI/12-37 : PS50088 ANK_REPEAT L=-1 Ankyrin repeat profile. LAAQQGDIDKVKTCLALGVDINTCDR >YAHD_ECOLI/38-70 : PS50088 ANK_REPEAT L=0 Ankyrin repeat profile. QGKTAITLASLYQQYACVQALIDAGADINKQDH >YAHD_ECOLI/138-175 : PS50088 ANK_REPEAT L=0 Ankyrin repeat profile. VGWTPLLEAIVLNDGGIKQQAIVQLLLEHGASPHLTDK >YAHD_ECOLI/176-201 : PS50088 ANK_REPEAT L=0 Ankyrin repeat profile. YGKTPLELARERGFEEIAQLLIAAGA -o psa : profile-to-sequence-alignments with insertions in lowercase and deletions marked as dashes. >YAHD_ECOLI/12-37 : PS50088 ANK_REPEAT L=-1 Ankyrin repeat profile. -------LAAQQGDIDKVKTCLALGVDINTCDR >YAHD_ECOLI/38-70 : PS50088 ANK_REPEAT L=0 Ankyrin repeat profile. QGKTAITLASLYQQYACVQALIDAGADINKQDH >YAHD_ECOLI/138-175 : PS50088 ANK_REPEAT L=0 Ankyrin repeat profile. VGWTPLLEAIVLNDGGikqqaIVQLLLEHGASPHLTDK >YAHD_ECOLI/176-201 : PS50088 ANK_REPEAT L=0 Ankyrin repeat profile. YGKTPLELARERGFEEIAQLLIAAGA------- -o msa : multiple sequence alignment of the matches in each sequence, built from the psa format with gaps marked as dots. >YAHD_ECOLI/12-37 : PS50088 ANK_REPEAT L=-1 Ankyrin repeat profile. -------LAAQQGDID.....KVKTCLALGVDINTCDR >YAHD_ECOLI/38-70 : PS50088 ANK_REPEAT L=0 Ankyrin repeat profile. QGKTAITLASLYQQYA.....CVQALIDAGADINKQDH >YAHD_ECOLI/138-175 : PS50088 ANK_REPEAT L=0 Ankyrin repeat profile. VGWTPLLEAIVLNDGGikqqaIVQLLLEHGASPHLTDK >YAHD_ECOLI/176-201 : PS50088 ANK_REPEAT L=0 Ankyrin repeat profile. YGKTPLELARERGFEE.....IAQLLIAAGA------- -o pff : tabular format listing bounding positions on the sequence and the profile, the raw and normalized profile score, and the cut-off level. YAHD_ECOLI 12 37 ANK_REPEAT 8 -1 196 6.653 -1 YAHD_ECOLI 38 70 ANK_REPEAT 1 -1 351 10.793 0 YAHD_ECOLI 138 175 ANK_REPEAT 1 -1 288 9.110 0 YAHD_ECOLI 176 201 ANK_REPEAT 1 -8 343 10.579 0 Pattern and rule matches are reported in the same formats, but cut-off levels are not available. In scan, psa and msa output formats, amino acids matching an "x" in the pattern are reported in lowercase. In pff format only the first three columns are available. The "-o matchlist" output format is specific to patterns and not meant for public use. Profile and pattern matching parameters --------------------------------------- Several parameters can be fine-tuned. - profile cut-off levels (-l option) Most profiles contain several cut-off levels. The level 0 cut-off is the trusted cut-off for positive matches, but a level -1 is usually defined above which a match is potential, especially if there are other matches in the sequence with the profile. By default, pfscan is run with the cut-off level 0, so that only trusted matches are reported. To retrieve potential (weak) matches as well, run ps_scan with the option "-l -1". Weak matches are then reported with the indication "L=-1". - skip frequently matching patterns (-s option) Some PROSITE entries such as those describing commonly found post- translational modifications (a typical example is N-glycosylation) are found in the majority of known protein sequences. While it is generally useful to note their presence, some programs may want, in some cases, to ignore those entries, which contain the line "CC /SKIP-FLAG=TRUE;". Pattern match mode: - pattern greediness off (-g option) A pattern-matching engine is said to be "greedy" if it tries to extend at most variable-length pattern elements. The sequence "ABCDC" and the pattern "A-x(1,3)-C" will produce the match "ABCDC" with a greedy engine, and the match "ABC" with a non-greedy one. By default, PROSITE is scanned in greedy mode, unless the -g option is set. - overlaps (-v option) and included matches (-i option) Some patterns may produce distinct but overlapping matches on a given sequence. Additionally, if the pattern contains variable-length elements, some of these matches may be completely included in another one. Different combinations of -g, -v and -i options may produce differences in match count and match positions with patterns that contain variable-length elements. An engine which allows overlaps should be greedy in order to reduce the number of multiple hits which are almost entirely overlapping except at the extremities. - treatment of X characters in sequences (-x option) The PROSITE syntax describes how to treat ambiguities in the pattern, but not how to handle ambiguities in the sequence. In rare sequences in Swiss-Prot and other databases, the characters B and Z are used according to IUBMB nomenclature when a residue may be either Asp or Asn, or Glu or Gln, respectively. The ps_scan program will produce a match if the sequence has a "B" and the pattern allows either a "D" or a "N", or both (and similarly for Z). Whether the character X should be allowed to match any position of a pattern is more controversial. It is generally useful to accept a single pattern position to match X (unless that pattern position is an X itself, in which case we can accept more). The maximum number of X characters which are allowed to match a non-X position in a pattern can be specified with the -x option. The default value is 1. Examples -------- examples : scan 1 sequence with 1 pattern perl ps_scan.pl -p PS00123 seq.dat perl ps_scan.pl -p "P-S-[QW]" seq.dat scan 1 sequence with all prosite, including profiles perl ps_scan.pl seq.dat scan 2 profiles or patterns against swiss-prot perl ps_scan.pl -p PS00123 -p PS00124 sprot.dat scan all patterns and rules against swiss-prot perl ps_scan.pl -r sprot.dat scan a profile file against a sequence file perl ps_scan.pl -d profile.prf sprot.dat License ------- The ps_scan program and the Prosite.pm module are Copyright (C) 2001-2006, Swiss Institute of Bioinformatics. They are released under the terms of the GNU General Public License, available at http://www.gnu.org/copyleft/gpl.html. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.