******************DRAFT*******************************

This document describes how to call the 'readdb' functions.


readdb_get_descriptor:


Boolean LIBCALL 
readdb_get_descriptor (ReadDBFILEPtr rdfp, Int4 sequence_number, SeqIdPtr PNTR id, CharPtr PNTR description)


Obtains the definition line and SeqId for a sequence via the ordinal number of the
sequence in the database.  The caller must deallocate the definition and SeqId.


To call this function and format the ID and definition line as a FASTA line one might use
a function like (where the ReadDBFILEPtr has already been properly allocated):


void
function(ReadDBFILEPtr rdfp, Int4 sequence_number)
{
	Char buffer[100];
	CharPtr defline;
    	SeqIdPtr sip;

	/* Obtains the SeqIdPtr and defline from database. */
	readdb_get_descriptor(rdfp, sequence_number, &sip, &defline);

	/* Writes a FASTA ID into buffer. */
	SeqIdWrite(sip, buffer, PRINTID_FASTA_LONG, sizeof(buffer));

	printf("%s %s\n", buffer, defline);

	/* Deallocate entire chain of SeqIdPtr's. */
	sip = SeqIdSetFree(sip);

	defline = MemFree(defline);

	return;
}


This section describes the format of the databases.
+++++++++++++++++++++++++++++++++++++++++++++++++++

Formatdb creates three main files for proteins containing indices, sequences, and headers
with the extensions, respectively, of pin, psq, and phr (for nucleotides these are
nin, nsq, and nhr).  A number of other ISAM indices are created, but these are described
elsewhere.  

FORMAT OF THE INDEX FILE (pin or nin extension)
------------------------

1.) formatdb version number 	[4 bytes].
2.) protein dump flag (1 for a protein databae, 0 for a nucleotide database)	[4 bytes].
3.) length of the database title in bytes	[4 bytes].
4.) the database title		[length given in 3.)].
5.) length of the date/time string	[4 bytes].
6.) the date/time string	[length given in 5.)].
7.) the number of sequences in the database	[4 bytes].
8.) the total length of the database in residues/basepairs	[4/8 bytes]. (1)
9.) the length of the longest sequence in the database 		[4 bytes].
10.) a list of the offsets for definitions (one for each sequence) in the header file.
There are num_of_seq+1 of these, where num_of_seq is the number of sequences given in 7.).
11.) a list of the offsets for sequences (one for each sequence) in the sequence file.
There are num_of_seq+1 of these, where num_of_seq is the number of sequences given in 7.).
12.) a list of the offsets for the ambiguity characters (one for each sequence) in the 
sequence file.  This list is only present for nucleotide databases and, since the database
is compressed 4/1 for nucleotides, allows the ambiguity characters to be restored when
the sequence is generated.  There are num_of_seq+1 of these, where num_of_seq is the 
number of sequences given in 7.).

(1) The version 3 of the blast databases uses 4 bytes to store the length of
the database, while the version 4 of the blast databases uses 8 bytes for this
purpose.


FORMAT OF THE SEQUENCE FILE (psq or nsq extension)
---------------------------

There are different formats for the protein and nucleotide sequence files.

The protein sequence files is quite simple.  The first byte in the file is
a NULL byte, followed by the sequence in ncbistdaa format (described in the
NCBI Software Development Toolkit documentation).  Following the sequence is
another NULL byte, followed by the next sequence.  The file ends with a NULL
byte, following the last sequence.

The nucleotide sequence file contains the nucleotide sequence, with four basepairs compressed
into one byte.  The format used is NCBI2na, documented in the NCBI Software Development Toolkit
manual.  Any ambiguity characters present in the original sequence are replaced at
random by A, C, G or T.  The true value of ambiguity characters are stored at the
end of each sequence to allow true reproduction of the original sequence.


FORMAT OF THE HEADER FILE (phr or nhr extension)
-------------------------

The format of the header file depends on whether or not the identifiers in the
original file were parsed or not.  For the case that they were not, then each
entry has the format:

gnl|BL_ORD_ID|entry_number my favorite yeast sequence...

Here entry_number gives the ordinal number of the sequence in the database (with
zero offset).  The identifier gnl|BL_ORD_ID|entry_number is used by the BLAST
software to identify the entry, if the user has not provided another identifier.
If the identifier was parsed, then gnl|BL_ORD_ID|entry_number is replaced by
the correct identifier, as described in ftp://ncbi.nlm.nih.gov/blast/db/README .

There are no separators between these deflines.


ISAM INDEX FILES
----------------

There are also some ISAM indices that can be used to perform lookups of the
ordinal ID based upon a numerical or string index.  The numerical index 
contains only gi's and the index files are pni and pnd (or nni and nnd for
nucleotides).  The 'ni' file contains an index into the 'nd' (data) file.
The string ISAM index is used for all other identifiers and the index
files are psi and psd (or nsi and nsd).