NAME

Prosite.pm - functions to use and scan the PROSITE database on sequences


SYNOPSIS

 use Prosite;
 use Prosite 1.0; #specify a version number
 my $re = prositeToRegexp("D-[EFG]-{Q}");
 my $seq = "FDEGGDEDFGDEDGDEQEEDEDGEGEG";
 my $hits = scanPattern($re, $seq);
 for my $hit (@$hits){
   my ($subseq, $from, $to) = @$hit;
   print "$from - $to: $subseq\n";
 }


DESCRIPTION

This package supplies methods to scan amino acid sequences against the PROSITE database of protein families and domains. PROSITE consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs.

PROSITE currently contains three classes of identification tools. These are :

Patterns
These are a subset of regular expressions. PROSITE defines the patterns but not the way those patterns have to be matched. Therefore the greediness of the match and the handling of overlapping matches is left to the implementation. This module gives the user control over these parameters.

Profiles
Scanning a sequence with generalized profiles is not trivial and computationally intensive. Therefore, this module does not do the scan itself, but calls the external program pfscan and parses the results.

Rules
A few number of PROSITE entries have a number of rules instead of, or complementing, a pattern. These rules are directly hard-coded in this module.


METHODS

All methods are exported.

scanPattern $pattern, $sequence[, $behavior[, $max_x]]
Scans a pattern (a regular expression, NOT a pattern in PROSITE format) with a sequence. Returns a pointer to an array of arrays of [subsequence, starting pos, ending pos] of matches. If each token of the pattern is enclosed by parentheses, as is the case of the output of prositeToRegexp(), the subsequence has residues corresponding to an ``x'' in the pattern are in lowercase, and dashes are inserted in variable-length range matches so that different subsequences obtained with a single pattern form a multiple sequence alignment based on the pattern tokens.

$behavior controls whether the engine allows overlapping and including matches. Allowed values are :

 not set, 0 or undef - allow overlapping, but not included matches
 1 - don't allow overlapping matches
 2 - allow overlapping and included matches

see PATTERN MATCHING for additional details.

$max_x is the maximum number of X residues in the sequence that can match a non-X position in the pattern. The default value is 1.

scanRule $pattern, $sequence, $rule_ac
Scans a sequence with a PROSITE rule. Return type is the same as scanPattern(). Known rules are PS00013, PS00002, PS00003 and PS00015.

scanProfiles $filename
Read an output file of the program pfscan from the pftools package. The pfscan program should be run with the options -x, -z and -l so as to report the matching sequence, the bounding positions on the profile, and the highest matched cut-off level, respectively. The return value is an a hash for which the keys are PROSITE accession numbers and the values are arrays of arrays; for each hit, the array contains (matching sequence, start on sequence, end on sequence, profile identifier, start on profile, end on profile, raw score, normalized score, cut-off level, level-tag, description text, array of submatches). The sequence matching the profile is reported with insertions in lowercase and deletions represented by dashes. Level-tag is not implemented.

The pftools package by Philipp Bucher is available at ftp://ftp.ch.embnet.org/sib-isrec/pftools/

prositeToRegexp $pspattern[, $notGreedy[, $preventX]]
Transforms a PROSITE pattern to a perl regular expression. Note that the syntax of the $pspattern is not checked, so that the result cannot be guaranteed to be a valid regular expression. If $notGreedy is set to 1, the matching will not be greedy (see PATTERN MATCHING). If $preventX is set, no X characters in the sequence will be allowed to match conserved positions in the pattern.

Returns a perl regular expression, or `undef' if the pattern could not be parsed. In the latter case the position and the message of the error are found in the variables $Prosite::errpos and $Prosite::errstr.

checkPatternSyntax $pspattern
Checks the syntax of a pattern for syntax errors and obvious mistakes (such as unreal ambiguities and too degenerate patterns). Returns undef if no errors have occurred, or a string containing a message describing the first error which was found.

parseProsite $psentry
Parses a PROSITE entry (text starting with ``ID'' and ending with ``//\n'') and returns the following values in an array : AC, ID, type, description, pattern, rule, PDOC, skip-flag, taxon, max-repeat, sites (sites is an array of arrays).


PATTERN SYNTAX


PATTERN MATCHING

Three parameters allow to finely tune the behaviour of the pattern-matching engine. These are :

greed
extend at most variable-length pattern elements

overlap
allow partially overlapping matches

include
allow matches included within one another (implies overlap)

The default behavior is greedy, allows overlaps but not included matches. This means that two overlapping matches are rejected if one is entirely contained within the other.

For example, consider the sequence ``ABACADAEAFA'' and the simple pattern ``A-x(1,3)-A''. The six possible combinations of the switches produce the following results:


ACKNOWLEDGEMENTS

Thanks go to Marco Pagni for providing me with the pattern matching example.


AUTHORS

Alexandre Gattiker, gattiker@isb-sib.ch

Elisabeth Gasteiger, Elisabeth.Gasteiger@isb-sib.ch