******************DRAFT******************************* This document describes how to call the 'readdb' functions. readdb_get_descriptor: Boolean LIBCALL readdb_get_descriptor (ReadDBFILEPtr rdfp, Int4 sequence_number, SeqIdPtr PNTR id, CharPtr PNTR description) Obtains the definition line and SeqId for a sequence via the ordinal number of the sequence in the database. The caller must deallocate the definition and SeqId. To call this function and format the ID and definition line as a FASTA line one might use a function like (where the ReadDBFILEPtr has already been properly allocated): void function(ReadDBFILEPtr rdfp, Int4 sequence_number) { Char buffer[100]; CharPtr defline; SeqIdPtr sip; /* Obtains the SeqIdPtr and defline from database. */ readdb_get_descriptor(rdfp, sequence_number, &sip, &defline); /* Writes a FASTA ID into buffer. */ SeqIdWrite(sip, buffer, PRINTID_FASTA_LONG, sizeof(buffer)); printf("%s %s\n", buffer, defline); /* Deallocate entire chain of SeqIdPtr's. */ sip = SeqIdSetFree(sip); defline = MemFree(defline); return; } This section describes the format of the databases. +++++++++++++++++++++++++++++++++++++++++++++++++++ Formatdb creates three main files for proteins containing indices, sequences, and headers with the extensions, respectively, of pin, psq, and phr (for nucleotides these are nin, nsq, and nhr). A number of other ISAM indices are created, but these are described elsewhere. FORMAT OF THE INDEX FILE (pin or nin extension) ------------------------ 1.) formatdb version number [4 bytes]. 2.) protein dump flag (1 for a protein databae, 0 for a nucleotide database) [4 bytes]. 3.) length of the database title in bytes [4 bytes]. 4.) the database title [length given in 3.)]. 5.) length of the date/time string [4 bytes]. 6.) the date/time string [length given in 5.)]. 7.) the number of sequences in the database [4 bytes]. 8.) the total length of the database in residues/basepairs [4/8 bytes]. (1) 9.) the length of the longest sequence in the database [4 bytes]. 10.) a list of the offsets for definitions (one for each sequence) in the header file. There are num_of_seq+1 of these, where num_of_seq is the number of sequences given in 7.). 11.) a list of the offsets for sequences (one for each sequence) in the sequence file. There are num_of_seq+1 of these, where num_of_seq is the number of sequences given in 7.). 12.) a list of the offsets for the ambiguity characters (one for each sequence) in the sequence file. This list is only present for nucleotide databases and, since the database is compressed 4/1 for nucleotides, allows the ambiguity characters to be restored when the sequence is generated. There are num_of_seq+1 of these, where num_of_seq is the number of sequences given in 7.). (1) The version 3 of the blast databases uses 4 bytes to store the length of the database, while the version 4 of the blast databases uses 8 bytes for this purpose. FORMAT OF THE SEQUENCE FILE (psq or nsq extension) --------------------------- There are different formats for the protein and nucleotide sequence files. The protein sequence files is quite simple. The first byte in the file is a NULL byte, followed by the sequence in ncbistdaa format (described in the NCBI Software Development Toolkit documentation). Following the sequence is another NULL byte, followed by the next sequence. The file ends with a NULL byte, following the last sequence. The nucleotide sequence file contains the nucleotide sequence, with four basepairs compressed into one byte. The format used is NCBI2na, documented in the NCBI Software Development Toolkit manual. Any ambiguity characters present in the original sequence are replaced at random by A, C, G or T. The true value of ambiguity characters are stored at the end of each sequence to allow true reproduction of the original sequence. FORMAT OF THE HEADER FILE (phr or nhr extension) ------------------------- The format of the header file depends on whether or not the identifiers in the original file were parsed or not. For the case that they were not, then each entry has the format: gnl|BL_ORD_ID|entry_number my favorite yeast sequence... Here entry_number gives the ordinal number of the sequence in the database (with zero offset). The identifier gnl|BL_ORD_ID|entry_number is used by the BLAST software to identify the entry, if the user has not provided another identifier. If the identifier was parsed, then gnl|BL_ORD_ID|entry_number is replaced by the correct identifier, as described in ftp://ncbi.nlm.nih.gov/blast/db/README . There are no separators between these deflines. ISAM INDEX FILES ---------------- There are also some ISAM indices that can be used to perform lookups of the ordinal ID based upon a numerical or string index. The numerical index contains only gi's and the index files are pni and pnd (or nni and nnd for nucleotides). The 'ni' file contains an index into the 'nd' (data) file. The string ISAM index is used for all other identifiers and the index files are psi and psd (or nsi and nsd).