Index - Indexation of files tool
use Index;
This module can help you to index a file with key=seqid, value=position in the file You will need to have installed DB_File module (Berkley_DB) to use it.
#e.g. Indexation of this file #>1433_OENHO Wap_rat #MATAPSPREENVYLAKLAEQAERYEEMVEFMEKVCAAADSEELTVEERNLLSVAYKNVIG #... #>41_HUMAN Wap_human #MTTEKSLVTEAENSQHQQKEEGEEAINSGQQEPQQEESCQTAAEGDNWCEQKLKASNGDT #...
#keys=1433_OENHO & 41_HUMAN
# examples here #Index module does not implement a 'new' method. So you need to call a specific module #Here we will show you how to index a Fasta file. use Index::Fasta;
#For Fasta files: my ($res,$index) = new Index::Fasta($file); or my ($res, $index) = new Index::Fasta(); die $index unless $res; ($res, $msg) = $index->setFile($file);
# If you already have a hash table with key=word-value=position in the file ($res, $msg) = $index->indexOut(\%inx); die $msg unless $res; #otherwise my $r_array = ['ac', 'name']; ($res, $msg) = $index->buildIndex($r_array); #array reference containing the list of key to index. Keys are defined in $Index::regexp hash table. die $msg unless $res; or my $r_hash = { 'id' => '>([^\s]+)', 'name' => '>[^\s]+\s+(\w+)' }; ($res, $msg) = $index->buildIndex($r_hash); #hash reference with key => regexp to index. Will index id and name as described in regexp. die $msg unless $res;
($res, $msg) = $index->indexOut($msg); die $msg unless $res; my $entry;
#Case sensitive ($res, $entry) = $index->getEntry('1433_OENHO'); die $entry unless $res;
#Case insensitive ($res, $entry) = $index->getEntry('1433_OENHO', 1); die $entry unless $res;
print $entry->[0];
#should print : # >1433_OENHO Wap_rat #MATAPSPREENVYLAKLAEQAERYEEMVEFMEKVCAAADSEELTVEERNLLSVAYKNVIG
($res, $msg) = $index->parseFields(\$entry); #This parse the entry using the default rules ($Index::regexp) and keep in memory a hash table die $msg unless $res; #with keys equal keys of Index::regexp and values found in entry
#Equivalent to
my($id, $name); ($res, $id) = $index->getField('id', \$entry); #Avoid user to call pasrFields first. die $id unless $res;
#Getting fields...
($res, $id) = $index->getField('id'); die $id unless $res;
($res, $name) = $index->getField('name'); die $name unless $res;
#equivalent to: ($res, $list) = $index->getField(['id', 'name']); #Returns id and name values form entry in the same order than array of keys ['id', 'name'] die $list unless $res; #$list will be equal to ['1433_OENHO', 'Wap_rat']
#equivalent to following two lines ($res, $id) = $index->get_id(); #uses AUTOLOAD function to retrieve only id from entry previously parsed die $id unless $res;
($res, $name) = $index->get_name(); #uses AUTOLOAD function to retrieve only name from entry previously parsed die $name unless $res;
#Can also be used as follow
($res, $name) = $index->get_name(\$entry); #Same behavior as AUTOLOAD but also parse first the "new" $entry if $entry different than the old one. (die $name unless $res;
$Id: Index.pm.html,v 1.1.1.1 2005/08/18 13:18:25 hunter Exp $
Copyright (c) European Bioinformatics Institute 2004
Emmanuel Quevillon <tuco@ebi.ac.uk>
Description: Create a new object Index. User needs to call inheriting module.
Arguments:
Returns: 1, $self on success 0, msg on failure
Description: Set the original file to be indexed and the name of the future indexed file. Those name will be used later by other subroutines.
Arguments: $file Path to the file. Returns: 1, on success 0, msg on failure
Description: Return the path to the file to be indexed. Check in the same time if the file exists and is set.
Arguments: Returns: 1, path to the file on success 0, msg on failure
Description: Set the input record delimiter. You can store 2 keys (dumper and/or building). Some times we need 2 differents records delimeter to first build (building) the indexed file and another one to delimit the entry (dumper).
Arguments: $key (supported keys: dumper (when uses getEntry), and building (when creating the index file). $recdel Returns: 1, '' on success 0, msg on failure
Description: Get the input record separator associated with a key (dumper or building).
Arguments: $key (supported keys: dumper (when uses getEntry), and building (when creating the index file)).
Returns: 1, $recdel on success 0 error if no key given or key not known.
Description: Check if the file to index is set and exists Arguments: $file path to file Returns: 1, path to the file on success 0, msg on failure
Description: index the input sequence by descrption field
Ex : >wap_rat blah balh rthgensvcdawwq.... ...... >another sequence ... Will index the position of the 'wap_rat' word.
Arguments: $inx reference to a hash table containing key=word, value=position in the file. To get position in the file you can use the perl build-in 'tell' function Returns: 1, on success 0, msg on failure
Description: Read a file and construct a hash table reflecting the index file.
Arguments: $r_list reference to a hash table (key => regexp to index) or an array containing a list of keys present in the $regexp hash table. $file file to index (optional). By default the file is set in the 'new' method.
Returns: 1, ref to a hash on success 0, msg on failure
Description: Check if the index file is up to date comparing the actual size of the original file (not indexed) and the size stored in the indexed file. (see indexOut upper).
Arguments: $file Name of the file to check (not the indexed one) (optional). Returns: 1, '', 'size & modification time' on success 0, msg, 'size & modification time' on failure
Description: Get the position in the flat file of the given word.
Arguments: $word to search, $insensitive if to 1, perform search using case insensitive $file (optional to search in a specific file). This file needs to be indexed first.
Returns: 1, [ref to array of positions | position] on success 0, msg on failure
Description: Get the indexed entry from the original file.
Arguments: $word, what you want to retrieve $insensitive Case insensitive ? $recdel, Input record delimiter. If you want to precise newline, do it as '%newline>' for a recdel like '\n>'. $posid, position of this word in the file (optional) $remove, remove the record delimeter from the retrieved entry (optional)
Returns: 1, array ref containing entr(y/ies) 0, msg on failure
Description: set a hash table with key -> regular expression to look for in the file to be indexed.
Arguments: $hash A reference to a hash table. Returns: 1, '' on success 0, msg on errors.
Description: Return the reference to a hash table containing the key/values pair for expression to look for in the file to be indexed. Arguments: Returns: 1, reference to hash table on success. 0, msg on error
Description: set a hash table with key -> regular expression to look for in the file to be indexed.
Arguments: $hash A reference to a hash table. Returns: 1, '' on success 0, msg on errors.
Description: Return the reference to a hash table containing the key/values pair for expression to look for in the file to be indexed. Arguments: Returns: 1, reference to hash table on success. 0, msg on error
Description: set a hash table with key -> values parsed from entry using $rules
Arguments: $hash A reference to a hash table. Returns: 1, '' on success 0, msg on errors.
Description: Return the reference to a hash table containing the key/values parsed from the entry. Arguments: Returns: 1, reference to hash table on success. 0, msg on error
Description: Remove the hash table containing the parsed entry from the memory and set the flag parsed to 0. Argument: Returns:
Description: Parse and entry (returned by getEntry) for all the keys known in $regexp. Keep in a hash table all the results and returns also the hash table. e.g. results from InterPro IPR000001 entry. { 'short_name' => 'Kringle', 'name' => 'Kringle', 'taxonomy' => [ 'Arthropoda', 'Bacteria', 'Caenorhabditis elegans', 'Chordata', 'Eukaryota', 'Fruit Fly', 'Green Plants', 'Human', 'Metazoa', 'Mouse', 'Nematoda', 'Plastid Group' ], 'ac' => 'IPR000001', 'found' => [ 'IPR003966', 'IPR008293', 'IPR011357', 'IPR011358', 'IPR011359' ], 'meth_ac' => [ 'PR00018', 'PS50070', 'PS00021', 'PF00051', 'PD000395', 'SM00130' ], 'type' => 'Domain', 'meth_name' => [ 'KRINGLE', 'KRINGLE_2', 'KRINGLE_1', 'Kringle', 'Kringle', 'KR' ] } When use ask for a field (using getField) is pick the informations up from build hash table.
Arguments: $entry reference to a scalar containing the entry.
Returns: 1, Reference to hash table on success 0, msg on error
Description: Check if the entry is already parsed. It keeps in memory the last entry parsed. Argument: $entry Scalar reference. Returns: 1 if parsed 0 if not parsed.
Description: Retrieve a specific field from interpro.xml indexed file The fields need to be defined in the regexp hash table at the begining of the package (regular expression).
Arguments: $rule, field to search. May be a simple key like 'ac' or a reference to an array containing a list of keys. In this case, the returned result is a reference to an array where the results of items are in the same order than the key list. If there are multiple results for a key, then the result is a reference to an array. e.g.: $interpro->getField(['ac', 'name', 'contains']); will return a array ref like:
[ 'IPR000001', 'Kringle', [ 'IPR003966', 'IPR008293', 'IPR011357', 'IPR011358', 'IPR011359' ] ]
$entry A reference to a scalar containing the whole entry. Avoid user to call first parseFields method, then it is done on the fly. Kept in memory is the last entry succefully parsed. So the user must be carefull when using this subroutine without any reference. Returns: 1, entry for field or reference to an array if multiple results. 0, msg on failure
Description: This method is generic and allow user to retrieve any field form the entry if it is described in $rules hash table of the module used. By default, $rules is loaded as the rules to be used to parse the entry. If no $rules hash table are defined, by default, it load $regexp instead.
Argument: You have to use this like that: get_id() or get_ac().... id or ac are the keys you want to retrieve. If those keys are not defined in one of rules or regexp hash tables, then it throws an error. One additional argument may be given like get_ac(\$entries). $entries is used to be parsed if not done already. Check getField subroutine. This flexibility allow the used not to call parseFields subroutine and can give the entry to parse directly to getField subroutine. Returns: 1, reference to an array on success. 0, msg on error.