Prosite.pm -
functions to use and scan the PROSITE database on sequences
use Prosite;
use Prosite 1.0; #specify a version number
my $re = prositeToRegexp("D-[EFG]-{Q}");
my $seq = "FDEGGDEDFGDEDGDEQEEDEDGEGEG";
my $hits = scanPattern($re, $seq);
for my $hit (@$hits){
my ($subseq, $from, $to) = @$hit;
print "$from - $to: $subseq\n";
}
This package supplies methods to scan amino acid sequences against the
PROSITE database of protein families and domains. PROSITE consists of
biologically significant sites, patterns and profiles that help to
reliably identify to which known protein family (if any) a new sequence
belongs.
PROSITE currently contains three classes of identification tools. These
are :
- Patterns
-
These are a subset of regular expressions. PROSITE defines the patterns
but not the way those patterns have to be matched. Therefore the
greediness of the match and the handling of overlapping matches is left
to the implementation. This module gives the user control over these
parameters.
- Profiles
-
Scanning a sequence with generalized profiles is not trivial and
computationally intensive. Therefore, this module does not do the scan
itself, but calls the external program pfscan and parses the results.
- Rules
-
A few number of PROSITE entries have a number of rules instead of, or
complementing, a pattern. These rules are directly hard-coded in this
module.
All methods are exported.
- scanPattern $pattern, $sequence[, $behavior[, $max_x]]
-
Scans a pattern (a regular expression, NOT a pattern in PROSITE format)
with a sequence. Returns a pointer to an array of arrays of [subsequence,
starting pos, ending pos] of matches. If each token of the pattern is
enclosed by parentheses, as is the case of the output of prositeToRegexp(),
the subsequence has residues corresponding to an ``x'' in the pattern are in
lowercase, and dashes are inserted in variable-length range matches so that
different subsequences obtained with a single pattern form a multiple
sequence alignment based on the pattern tokens.
-
$behavior controls whether the engine allows overlapping and including
matches. Allowed values are :
-
not set, 0 or undef - allow overlapping, but not included matches
-
1 - don't allow overlapping matches
-
2 - allow overlapping and included matches
-
see PATTERN MATCHING for additional details.
-
$max_x is the maximum number of X residues in the sequence that can match
a non-X position in the pattern. The default value is 1.
- scanRule $pattern, $sequence, $rule_ac
-
Scans a sequence with a PROSITE rule. Return type is the same as
scanPattern(). Known rules are PS00013, PS00002, PS00003 and PS00015.
- scanProfiles $filename
-
Read an output file of the program pfscan from the pftools package. The pfscan
program should be run with the options -x, -z and -l so as to report the
matching sequence, the bounding positions on the profile, and the highest
matched cut-off level, respectively. The return value is an a hash for which
the keys are PROSITE accession numbers and the values are arrays of arrays; for
each hit, the array contains (matching sequence, start on sequence, end on
sequence, profile identifier, start on profile, end on profile, raw score,
normalized score, cut-off level, level-tag, description text, array of
submatches). The sequence matching the profile is reported with insertions in
lowercase and deletions represented by dashes. Level-tag is not implemented.
-
The pftools package by Philipp Bucher is available at
ftp://ftp.ch.embnet.org/sib-isrec/pftools/
- prositeToRegexp $pspattern[, $notGreedy[, $preventX]]
-
Transforms a PROSITE pattern to a perl regular expression. Note that
the syntax of the $pspattern is not checked, so that the result cannot
be guaranteed to be a valid regular expression. If $notGreedy is set
to 1, the matching will not be greedy (see PATTERN MATCHING).
If $preventX is set, no X characters in the sequence will be allowed to
match conserved positions in the pattern.
-
Returns a perl regular expression, or `undef' if the pattern could
not be parsed. In the latter case the position and the message of
the error are found in the variables $Prosite::errpos and $Prosite::errstr.
- checkPatternSyntax $pspattern
-
Checks the syntax of a pattern for syntax errors and obvious mistakes
(such as unreal ambiguities and too degenerate patterns). Returns undef
if no errors have occurred, or a string containing a message describing
the first error which was found.
- parseProsite $psentry
-
Parses a PROSITE entry (text starting with ``ID'' and ending with ``//\n'')
and returns the following values in an array : AC, ID, type,
description, pattern, rule, PDOC, skip-flag, taxon, max-repeat, sites
(sites is an array of arrays).
The standard IUPAC one-letter codes for the amino acids are used.
The symbol ``x'' is used for a position where any amino acid is accepted.
Ambiguities are indicated by listing the acceptable amino acids for a
given position, between square parentheses ``[ ]''. For example: [ALT]
stands for Ala or Leu or Thr.
Ambiguities are also indicated by listing between a pair of curly
brackets ``{ }'' the amino acids that are not accepted at a given
position. For example: {AM} stands for any amino acid except Ala and
Met.
Each element in a pattern is separated from its neighbor by a ``-''.
Repetition of an element of the pattern can be indicated by following
that element with a numerical value or a numerical range between
parenthesis. Examples: x(3)
corresponds to x-x-x; x(2,4)
corresponds to
x-x or x-x-x or x-x-x-x; A(3)
corresponds to A-A-A.
When a pattern is restricted to either the N- or C-terminal of a
sequence, that pattern either starts with a ``<'' symbol or respectively
ends with a ``>'' symbol.
In some rare cases (e.g. PS00267 or PS00539), '>' can also occur inside
square brackets for the C-terminal element. 'F-[GSTV]-P-R-L-[G>]' means
that either 'F-[GSTV]-P-R-L-G' or 'F-[GSTV]-P-R-L>' are considered.
The character * may be used for to specify a range which can be ``zero or
more''. Thus, the pattern ``<{C}*>'' can be used to retrieve all sequences
which do not contain a Cysteine. [This is not used in PROSITE, but
supported in this module]
Three parameters allow to finely tune the behaviour of
the pattern-matching engine. These are :
- greed
-
extend at most variable-length pattern elements
- overlap
-
allow partially overlapping matches
- include
-
allow matches included within one another (implies overlap)
The default behavior is greedy, allows overlaps but not included
matches. This means that two overlapping matches are rejected if one
is entirely contained within the other.
For example, consider the sequence ``ABACADAEAFA'' and the simple pattern
``A-x(1,3)-A''. The six possible combinations of the switches produce the
following results:
greed=1, overlap=1, include=0 (default) : 4 matches
ABACADAEAFA
ooooo......
..ooooo....
....ooooo..
......ooooo
greed=1, overlap=1, include=1 : 5 matches
ABACADAEAFA
ooooo......
..ooooo....
....ooooo..
......ooooo
........ooo
greed=1, overlap=0 : 2 matches
ABACADAEAFA
ooooo......
......ooooo
greed=0, overlap=1, include=0 or 1 : 5 matches
ABACADAEAFA
ooo........
..ooo......
....ooo....
......ooo..
........ooo
greed=0, overlap=0 : 3 matches
ABACADAEAFA
ooo........
....ooo....
........ooo
Thanks go to Marco Pagni for providing me with the pattern matching
example.
Alexandre Gattiker, gattiker@isb-sib.ch
Elisabeth Gasteiger, Elisabeth.Gasteiger@isb-sib.ch