BLAST XML output ---------------- The NCBI has started to provide BLAST results in XML format. This output option will allow members of the community to use readily available XML parsing tools on the BLAST results. Users who wish to parse the BLAST output are strongly encouraged to rely on the XML output rather than the standard BLAST report as this is subject to change. XML output may be generated by the stand-alone BLAST programs by using the '-m 7' option, e.g.: blastall -p blastp -d nr -i MYQUERY -m 7 -o xml.out Users wishing to study the DTD may find it at ftp://ncbi.nlm.nih.gov/blast/documents/ and should look at NCBI_BlastOutput.dtd, which also refers to NCBI_BlastOutput.mod (the BLAST specific portion of the specification) as well as NCBI_Entity.mod (definitions of some entities such as integer etc.). The rest of this document comprises an overview of the XML generated by BLAST. The BLAST XML specfication contains the information in the standard BLAST report and is organized into a number of different definitions. The top-level definition is the ELEMENT BlastOutput: The first nine elements of BlastOutput as well as the last one are basic units (i.e., do not refer to another ELEMENT). These elements provide much of the information seen in the top of the standard BLAST report. The names are mostly self-explanatory, but a short comment on each element is worthwhile: BlastOutput_program: BLAST program, e.g., blastp, blastn, etc. BlastOutput_version: version number of the BLAST engine (e.g., 2.1.2) BlastOutput_reference: a reference to the article describing the algorithm. BlastOutput_db: the database(s) searched. BlastOutput_query-ID: the identifier of the query BlastOutput_query-def: the definition line of the query BlastOutput_query-len: the length of the query BlastOutput_query-seq: the query sequence (optional) BlastOutput_iter-num: the psi-blast iteration number (optional) BlastOutput_message: error messages (optional) The remaining three elements describe the search parameters, statistics, and the indivdiual alignments between the query and database sequences and may be briefly summarized (detailed descriptions are below): BlastOutput_hits: hits to database sequences, one for every sequence. BlastOutput_param: BLAST search parameters. BlastOutput_stat: BLAST statistics. The definition of BlastOutput_hits is: Since there is one BlastOutput_hits for every database sequence this ELEMENT will probably appear multiple times. The first five elements are basic and briefly identify the database sequence, the last one describes the alignment. A brief explanation of the fields is: Hit_num: ordinal number of the hit, with one-offset (e.g., "1, 2..."). Hit_id: identifier of the database sequence (e.g., "gi|7297267|gb|AAF52530.1|") Hit_def: definition line of the database sequence (e.g., "(AE003618) CG6717 gene...") Hit_accession: accession of the database sequence (e.g., "AAF57408") Hit_len: length of the database sequence. Hit_hsps: an element describing the individual alignments, discussed below. The Hit_hsps describes the individual alignments (or HSP's) between the query and database. Since one database sequence may have multiple alignments to the query there may be multiple Hit_hsps for a BlastOutput_hits. The definition of Hit_hsps is: These elements are briefly described: Hsp_num: ordinal number of the HSP, one-offset. Hsp_score: score (in bits) of the HSP Hsp_evalue: expect value of the HSP Hsp_query-from: offset of query at the start of the alignment (one-offset) Hsp_query-to: offset of query at the end of the alignment (one-offset) Hsp_hit-from: offset of database sequence at the start of the alignment (one-offset) Hsp_hit-to: offset of database sequence at the end of the alignment (one-offset) Hsp_pattern-from: start of phi-blast pattern on the query (one-offset) Hsp_pattern-to: end of phi-blast pattern on the query (one-offset) Hsp_query-frame: frame of the query if applicable Hsp_hit-frame: frame of the database sequence if applicable Hsp_identity: number of identities in the alignment Hsp_positive: number of positive (conservative) substitutions in the alignment Hsp_gaps: number of gaps in the alignment Hsp_density: score density Hsp_qseq: alignment string for the query Hsp_hseq: alignment string for the database Hsp_midline: formatting middle line as normally seen in BLAST report. The BLAST parameters are described by the Parameters element:. Brief explanations of the meanings and the standalone command-line flags are: Parameters_matrix: matrix used (-M) Parameters_expect: expect value cutoff (-e) Parameters_include: inclusion threshold for a psi-blast iteration (-h) Parameters_sc-match: match score for nucleotide-nucleotide comparison (-r) Parameters_sc-mismatch: mismatch penalty for nucleotide-nucleotide comparison (-r) Parameters_gap-open: gap existence cost (-G) Parameters_gap-extend: gap extension cost (-E) Parameters_filter: filtering options (-F) Parameters_pattern: pattern used for phi-blast search Parameters_entrez-query: entrez query used to limit search. The Statistics element provides the information normally found at the end of the BLAST report: Brief descriptions of these fields are: Statistics_db-num: number of sequences in the database Statistics_db-len: number of letters in the database. Statistics_hsp-len: the effective HSP length. Statistics_eff-space: the effective search space Statistics_kappa: Karlin-Altschul parameter K Statistics_lambda: Karlin-Altschul parameter Lambda Statistics_entropy: Karlin-Altschul parameter H