******************************************************************************

RefSeq-release2.txt       ftp://ftp.ncbi.nih.gov/refseq/release/release-notes/

		NCBI Reference Sequence (RefSeq) Database

			Release 2
			October 21, 2003

		Distribution Release Notes

Release Size: 
2124 organisms, 7745398573 nucleotide bases, 286957682 amino acids, 1097404 records

******************************************************************************

This document describes the format and content of the flat files that 
comprise releases of the NCBI Reference Sequence (RefSeq) database.

Additional information about RefSeq is available at:

1. NCBI Handbook:  
   http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowTOC&rid=handbook.TOC&depth=2
 
2. RefSeq Web Site: 
   http://www.ncbi.nih.gov/RefSeq/

If you have any questions or comments about RefSeq, the RefSeq release files
or this document, please contact NCBI by email at:

   info@ncbi.nlm.nih.gov. 

To receive announcements of future RefSeq releases and large updates please
subscribe to NCBI's refseq-announce mail list:

 send email to refseq-announce-subscribe@ncbi.nlm.nih.gov
 with "subscribe" in the subject line (without quotes)

OR

 subscribe using web interface at:
 http://www.ncbi.nlm.nih.gov/mailman/listinfo/refseq-announce

=============================================================================
TABLE OF CONTENTS
=============================================================================

1. INTRODUCTION	
	1.1 Release 2
	1.2 Cutoff date
	1.3 RefSeq Project Background
		1.3.1 Sequence accessions, validation, and annotations
		1.3.2 Data assembly, curation, and collaboration 
		1.3.3 Biologically non-redundant data set
		1.3.4 RefSeq and DDBJ/EMBL/GenBank comparison
	1.4 Uses and applications of the RefSeq database

2. CONTENT
	2.1 Organisms included
	2.2 Molecule Types included
	2.3 Known Problems, Redundancies, and Inconsistencies
	2.4 Last genome update for select major organisms
	2.5 Release Catalog
	
3. ORGANIZATION OF DATA FILES
	3.1 FTP Site Organization
	3.2 File Names and Formats
        3.3 File Sizes
        3.4 Statistics
	3.5 Release Catalog
      	3.6 Accession Format
        3.7 Growth of RefSeq
        
4. FLAT FILE ANNOTATION
	4.1 Main features of RefSeq Flat File
		4.1.1 LOCUS, DEFLINE, ACCESSION, KEYWORDS, SOURCE, ORGANISM
		4.1.2 REFERENCE, DIRECT SUBMISSION, COMMENT
		4.1.3 FEATURE ANNOTATION (Gene, mRNA, CDS, Variation, Protein)
	4.2 Tracking Identifiers
		4.2.1 GeneID and LocusID
		4.2.2 Transcript ID
		4.2.3 Protein ID
		4.2.4 Conserved Domain Database (CDD) ID

5. REFSEQ ADMINISTRATION
	5.1 Citing RefSeq
	5.2 RefSeq Distribution Formats
	5.3 Other Methods of Accessing RefSeq Data
	5.4 Request for Corrections and Comments
	5.5 Credits and Acknowledgements
	5.6 Disclaimer



=============================================================================
1. INTRODUCTION
=============================================================================

The NCBI Reference Sequence Project (RefSeq) is an effort to provide the 
best single collection of naturally occurring biomolecules, representative
of the central dogma, for each major organism. Ideally this would include 
one sequence record for each chromosome, organelle, or plasmid linked on a 
residue by residue basis to the expressed transcripts, to the translated 
proteins, and to each mature peptide product. Depending on the organism, we 
may have some, but not all, of this information at any given time. We 
pragmatically include the best view we can from available data.

1.1 Release 2
-------------

The National Center for Biotechnology Information (NCBI) at the National
Library of Medicine (NLM), National Institutes of Health (NIH) is 
responsible for producing and distributing the RefSeq Sequence Database. 
Records are provided through a combination of collaboration and in-house 
processing including some curation by in-house staff comprised of expert 
biologists.  

RefSeq Release 2 is a full release of all NCBI RefSeq records.
The RefSeq project is an ongoing effort to provide a curated, non-redundant
collection of sequences. This first release includes all of the sequence data
that we have collected at this time. Although the RefSeq collection is not yet 
complete, its value as a non-redundant dataset has reached a level that
justifies providing full releases.  


1.2 Cutoff date
---------------

This full release, Release 2, incorporates data available as of October 21, 2003. 
For more recent data, users are advised to:
	
	. Download the RefSeq daily update files from the RefSeq FTP site
	     ftp://ftp.ncbi.nih.gov/refseq/daily/new/

	
	. Use the interactive web Entrez Query systems to query based on 
	  date
             http://www.ncbi.nih.gov/Entrez/

Notice of Change: a new directory has been created (daily/new/) to provide 
daily updates in the same file formats as are made available with the release. 
The original file formats provided in the 'daily' directory will be retained 
until January 2004. At that time, the daily updates will only be provided 
in the file formats consistent with the release and the 'new' directory 
will be removed.


1.3 RefSeq Project Background
-----------------------------

1.3.1 Sequence accessions, validation, and annotation
-----------------------------------------------------

Every sequence is assigned a stable accession, version, and gi and 
all older versions remain available over time. RefSeq accessions
have a distinct format (see section 3.6); the underscore ("_") is the 
primary distinguishing feature of a RefSeq accession. 
DDBJ/EMBL/GenBank accessions never include an underscore.

Sequences are validated in several ways. For example, to confirm 
that genomic sequence from the region of the mRNA feature really 
does match the mRNA sequence itself, and that the annotated coding region 
features really can be translated into the protein sequences they refer to.
Validation also checks for valid ASN.1 format.  For genomes
included in the LocusLink database, validation ensures consistency
is maintained for descriptive information (symbols, gene and protein names)
between RefSeq and LocusLink records.

Each molecule is annotated as accurately as possible with the 
correct organism name, the correct gene symbol for that organism, 
and reasonable names for proteins where possible. When available, 
nomenclature provided by official nomenclature groups is used.  
Note that gene symbols are not required or expected to be unique 
either across species or within a species. 

1.3.2 Data assembly, curation, and collaboration 
------------------------------------------------

We welcome collaborations with authoritative groups outside NCBI 
who are willing to provide the sequences, annotations, or links 
to phenotypic or organism specific resources. Where such collaborations 
have not yet developed, NCBI staff have assembled the best view of 
the organism that we can put together ourselves. In some cases, as with the 
human genome, NCBI is an active participant in generating the 
genome assembly and in providing reference sequences to represent 
the annotated genome. For other genomes, we may compile the data 
ourselves from DDBJ/EMBL/GenBank or other public sources. For instance,  
we may simply select the "best" DDBJ/EMBL/GenBank record by automatic means, 
validate the data format (and correct if needed), and add an essentially 
unchanged copy to the RefSeq collection, attributed to the original 
DDBJ/EMBL/GenBank record. In other cases we may provide a record that is very 
similar to the DDBJ/EMBL/GenBank record, but to which experts at NCBI have added 
corrected or additional annotation. This latter process can range 
from minor technical repairs to a manually curated re-annotation of 
the sequence, often in collaboration with experts outside NCBI. 

Each record that has been curated, or that is in the pool for
future curation, is labeled with the level of curation it has received.  
Curation status information is provided primarily for transcript and 
protein records.  Curation is carried out on the whole genome level 
for some smaller genomes such as viral, organelle, and some microbial
genomes.  

Curation status codes are defined in the section 3.2 below.



1.3.3 Biologically non-redundant data set
-----------------------------------------

RefSeq provides a biologically non-redundant set of sequences for database 
searching and gene characterization. It has the advantage of providing an 
objective and experimentally verifiable definition of "non-redundant" in 
supplying one example of each natural biomolecule per organism. The small 
amount of sequence redundancy introduced from close paralogs, alternate splicing 
products, and genome assembly intermediates is compensated for by the 
clarity of the model. RefSeq provides the substrate for a variety of 
conclusions about non-redundancy based on clustering identical sequences, 
or families of related sequences, without confounding the database itself 
with these more subjective assessments.


1.3.4 RefSeq and DDBJ/EMBL/GenBank comparison
---------------------------------------------

RefSeq is unique in providing a large curated database across many 
organisms, which precisely and explicitly links genetic (chromosome), 
expression (mRNA), and functional (protein) sequence data into an 
integrated whole. 

DDBJ/EMBL/GenBank also integrates DNA and protein information, and RefSeq is 
substantially based on sequence records contributed to DDBJ/EMBL/GenBank. 
However, RefSeq is similar to a review article in that it represents 
a synthesis and summary of information by a particular group (NCBI or 
other RefSeq contributors) that is based on the primary data gathered 
by many others and made part of the scientific record. Also, like a 
review article, it has the advantage of organizing a large body of 
diverse data into a single consistent framework with a uniform set of 
conventions and standards.

Note that while based on DDBJ/EMBL/GenBank, RefSeq is distinct from 
DDBJ/EMBL/GenBank. DDBJ/EMBL/GenBank represents the sequence and annotations 
supplied by the original authors and is never changed by NCBI or RefSeq staff. 
DDBJ/EMBL/GenBank remains the primary sequence archive while RefSeq is a 
summary and synthesis based on that essential primary data.


1.4 Uses and applications of the RefSeq database
------------------------------------------------

A stable, consistent, comprehensive, non-redundant database of genomes
and their products provides a valuable sequence resource for similarity 
searching, gene identification, protein classification, comparative 
genomics, and selection of probes for gene expression. It also acts as 
molecular "white pages" by providing a single, uniform point of access 
for searching at the sequence level, and by connecting the results with 
a diversity of organism-specific databases or resources unique to that 
organism or field. 


=============================================================================
2. CONTENT
=============================================================================


2.1 Organisms included
----------------------

This release includes records representing 2124 distinct taxonomic categories,
as measured by counting the number of distinct tax_ids included in the release.
Tax_ids are provided, for all species having any amount of sequence data, by the 
NCBI Taxonomy group. 

The release includes species ranging from viral to microbial to eukaryotic and 
includes organisms for which complete and incomplete genomic sequence data is 
available.  

The release does not include all species for which some sequence data is
available in DDBJ/EMBL/GenBank. The decision to generate RefSeq data for a 
species depends in part on the amount of sequence data available.  
Additional species will be represented in the RefSeq collection as 
more sequence data becomes available.


2.2 Molecule Types Included
---------------------------

The RefSeq release includes genomic, transcript, and protein sequence data; 
however, these molecule types are not provided for all organisms and the 
sequences provided  may not be complete or comprehensive for some species.  

Transcript RefSeq records may represent protein-coding transcripts or 
non-coding RNA products; these records are currently only provided for 
eukaryotic species.

Genomic RefSeq records are provided when a sufficient quantity of genomic 
sequence data is available in DDBJ/EMBL/GenBank. Transcript and protein 
records may be provided for a species before genomic sequence data is available, 
as is the case with Danio rerio (zebrafish).


2.3 Known Problems, Redundancies, and Inconsistencies
------------------------------------------------------

The RefSeq collection is an ongoing project that is expected to grow
in scope and content over time.  Thus it is important to recognize that
it is not complete in that some genomes are not yet completely sequenced,
some incompletely sequenced genomes may not be included, or some gene 
products may not yet be represented. RefSeq records may be added, removed, 
or updated in future releases as new information becomes available and as 
a result of curation.

Genomes with pending updates:

Sequence updates are planned in the near future for the following species.
These updates may revise the genomic, transcript, and protein data provided 
in the RefSeq collection.  The updated RefSeqs will be available in a future
release.

	Mus musculus genome assembly and associated models



Known Data inconsistencies:

	[1] RefSeq status codes are not consistently provided for some species. 
	The goal is to consistently provide a status code for all RefSeq
	records. The release catalog indicates "UNKNOWN" if a status code
	was expected but not detected and "na" if a status code is not
	expected based on the original project plan for provision of this type
	of information. Status codes will be more consistently applied to all 
	records in the future.

	[2] The genomic, transcript, and protein collection is known to be 
	incomplete for many species. This is particularly true for those
	genomes for which a complete genome assembly is not yet available, 
	such as Danio rerio (zebrafish), Bos taurus (cow), and Leishmania 
	major. As additional sequence data becomes available, the RefSeq 
	representation for these, and other, organisms will increase. 
	
	[3] Although the goal is to provide a non-redundant collection, some
	redundancy is included in this release as follows:
	
	Known Duplication:
	Two versions of NM_172381 and NP_759013 are included in the release files
	due to a processing problem. 

	Redundant Protein records:
	    Alternate Splicing		When additional transcripts are provided
					to represent alternate splicing products, 
					and the alternate splice site occurs in 
					the UTR, then the protein is redundantly 
					provided.

	    Paralogs			The goal is to provide a RefSeq record 
					for each naturally occuring molecule. 
					Therefore, records are provided for all 
					genes identified including those produced 
					by more recent gene duplication events in 
					which the genes are nearly identical.
	
	Redundant Genomic records:				
	   Intermediate records		For some species, intermediate genomic 
					records are provided to support the 
					assembly and/or annotation of the genome.
	
					For example, for human, a chromosome may 
					be represented by a chromosome RefSeq 
					record with a NC_ accession prefix.
					The chromosome record may consist of 
					many contigs, each represented as a 
					separate record with a NT_ accession
					prefix. In addition, some curated gene
					region records, with NG_ accession
					prefix, may also be provided to support
					annotation of complex regions.
					 
				
	
	   Alternate assemblies		Genomic records are provided to represent 
					alternate assemblies of genomic sequence
					derived from different populations. These 
					records will have varying levels of 
					redundancy and represent polymorphic and
					haplotype differences in terms of the
					sequence and annotation.

					For example, alternate assemblies are
					provided for different mouse strains and
					for regions of the human major
					histocompatibility complex (MHC). The MHC
					is a highly variable region of chromosome
					6 which exhibits variation at the level 
					of both sequence polymorphism and gene 
					content. The alternate assemblies make it
					possible to represent this alternate gene
					content. 
					
 
	
2.4 Notes on select major organisms
-----------------------------------

Anopheles gambiae		Genomic sequence data is available as 
				whole genome shotgun (WGS).

Arabidopsis thaliana		An update of the annotated genome was provided 
				by TIGR in July, 2003. The RefSeq release includes 
				chromosomes, transcripts, and proteins.

Caenorhabditis elegans		The RefSeq release includes an annotation update 
				that was released on October 16, 2003 the genome
                                version available on March 7, 2003. The release
				includes chromosome, transcript, and protein records.

Drosophila melanogaster		Release 3.1 of the assembled, annotated genome 
				was provided by FlyBase in March 2003. 

Homo sapiens			NCBI provides the human genome assembly in close
				collaboration with the sequencing centers. 
				RefSeq release 2 includes human genome build 34, 
				which is based on data available on July 30 2003. 
				Release 2 includes RefSeq chromosomes, contigs, 
				known transcripts and proteins (as defined by 
				having a Locus ID), and derived model transcripts 
				and proteins predicted by the Genome Annotation 
				pipeline. See: 
				http://www.ncbi.nlm.nih.gov/genome/guide/build.html

Mus musculus			NCBI provides the mouse genome assembly in close
				collaboration with the sequencing centers. 
				This RefSeq release includes mouse genome build 30 
				which is based on data available in January, 2003. 
				Release 2 includes RefSeq contigs, known 
				transcripts and proteins, and derived model 
				transcripts and proteins predicted by the Genome
				Annotation pipeline. A mouse genome update is
				imminent at the time of this release processing
				and will be included in the next release.

Neurospora crassa 		This release adds representation for neurospora. The 
				annotated genome data was supplied by the Whitehead
				Institute. RefSeqs were released on July 2, 2003
				and include WGS genomic contigs, predicted transcripts,
				predicted proteins. The RefSeq data does not represent
				the subset of small WGS contigs that were not
				mapped to a chromosome position or do not include 
				annotation.
 
Oryza sativa			This release adds representation for rice. The
				genome is being sequenced by the International
				Rice Genome Sequencing Project. RefSeqs are provided 
				by NCBI processing to generate the annotated
				genomic contigs; annotation is propagated from 
				the submitted BAC clones. The rice RefSeq set does not
				represent annotation from BAC clones that didn't fall into 
				supercontigs, even though they are mapped to a chromosome 
				position. RefSeqs were first released on October 6, 2003 and 
				include genomic contigs, transcripts, and proteins.

Rattus norvegicus		NCBI uses the rat whole genome shotgut (WGS) 
				genome assembly provided by Baylor sequencing 
				center. RefSeq release 2 includes rat genome 
				build 2 which is based on the RGSC v3.1 assembly, 
				provided by the Rat Genome Sequencing Consortium 
				(RGSC).  Release 2 includes RefSeq contigs, known 
				transcripts and proteins, and derived model transcripts 
				and proteins predicted by the Genome Annotation 
				pipeline.  

Saccharomyces cerevisiae	Provided by Sacchraomyces Genome Database (SGD) 
				and last updated October 14, 2003. The RefSeq release 
				includes chromosome and protein records.

Microbial			This RefSeq release includes 140 complete microbial 
				genomes. Microbial genomes are annotated by 
				a collaborative automatic computation method, 
				followed by curation by NCBI staff. 

				Fifteen microbial genomes have become available in
				GenBank since the last release:	
				   Prochlorococcus marinus subsp. marinus str. CCMP1375
				   Chlamydophila pneumoniae TW-183
				   Candidatus Blochmannia floridanus
				   Bordetella pertussis
				   Bordetella parapertussis
				   Bordetella bronchiseptica
				   Prochlorococcus marinus subsp. pastoris str. CCMP1378
				   Prochlorococcus marinus str. MIT 9313
				   Synechococcus sp. WH 8102
				   Chromobacterium violaceum ATCC 12472
				   Porphyromonas gingivalis W83
				   Wolinella succinogenes
				   Gloeobacter violaceus
				   Photorhabdus luminescens subsp. laumondii TTO1
				   Vibrio vulnificus YJ016


				Seventeen microbial genomes have been curated:
				  Aeropyrum pernix
				  Archaeoglobus fulgidus
				  Buchnera sp. APS
				  Buchnera aphidicola Sg
				  Corynebacterium glutamicum
				  Escherichia coli K-12
				  Haemophilus influenzae
				  Lactococcus lactis subsp. lactis
				  Mycoplasma genitalium
				  Mycoplasma pneumoniae
				  Oceanobacillus iheyensis
				  Shewanella oneidensis
				  Pyrococcus abyssi
				  Pyrococcus furiosus
				  Pyrococcus horikoshii
				  Thermoplasma volcanium
				  Vibrio vulnificus CMCP6 
		


Viruses				This RefSeq release includes over 1231 distinct viral
				records which have been curated via an extensive
				collaboration between the international virologist
				community and NCBI staff virologists. A panel of
				viral genomes advisors has been established. 

				

For more information please see:

RefSeq Collaborations:  http://www.ncbi.nih.gov/RefSeq/collaborators.html
Viral Genome Advisors:  http://www.ncbi.nih.gov/PMGifs/Genomes/viradvisors.html
Microbial Contributors: http://www.ncbi.nih.gov/RefSeq/microbialcontrib.html
				

2.5 Release Catalog
-------------------

The Release Catalog documents the full contents of the RefSeq Release.
The catalog can be used to identify data of interest.  See the format
description in section 3.5 for additional information.

The release catalog is available at:

  ftp://ftp.ncbi.nih.gov/refseq/release/release-catalog/RefSeq-release2.catalog


=============================================================================
3. ORGANIZATION OF DATA FILES
=============================================================================

3.1 FTP Site Organization
-------------------------

RefSeq releases are available on the NCBI FTP site at:

   ftp://ftp.ncbi.nih.gov/refseq/release/

The RefSeq collection is provided in a redundant fashion to best meet the needs
of those who want the full collection as well as those who want a specific
sub-set of the collection.  Therefore the collection is provided as: 
   1) the complete collection, and
   2) sections as defined by major taxonomic or other logical groupings. 

A subdirectory exists for each sub-section as follows:

   fungi						
   invertebrate	
   microbial	
   mitochondrion	
   plant
   plasmid		
   plastid		
   protozoa	
   vertebrate_mammalian	
   vertebrate_other	
   viral			

In addition, the complete collection is available without these
sub-groupings in the subdirectory:

   complete

Note that this directory structure intentionally provides the release 
data in a redundant fashion. We gave considerable thought to how to
package the release to meet the needs of different user groups. 
For instance, some groups may be interested in retrieving the complete
protein set, while other groups may be interested in retrieving data 
for a more limited number of organisms.  We decided to provide
logical groupings based on general taxonomic node (viral, mammalian etc)
as well as logical molecule type compartmentalization (e.g., plastid).
Thus, all records are provided at least twice, once in the "complete" 
directory, and a second time in one of the other directories. 
Some sequences may be provided three times when it is logical to 
include the record in more than one additional directory. For example, 
a sequence may be provided in the "complete", "mitochondrion", and 
"vertebrate_mammalian" directories.

We are interested in hearing if you find this structure useful or if
you would like information grouped in a different manner.

Send suggestions or comments to the NCBI Help Desk at:

	info@ncbi.nlm.nih.gov


3.2 File Names and Formats 
--------------------------

File names are informative, and indicate the content, molecule type,
and file format of each RefSeq release data file. Most filenames
utilize this structure:


	directoryfilenumber.molecule.format.gz
	1        2	    3        4	  

File Name Key:

	1. directory		directory level the file is provided in 
				(e.g.,complete, viral etc)

	2. file	number:		large data sets are provided as incrementally 
				numbered files 

	3. molecule		type of molecule (genomic, rna, or protein); 
				not relevant for ASN.1 format files provided 
				in the "complete" sub-directory

	4. format		the data format provided in the file; see below


For example:
	complete1.genomic.bna.gz
        vertebrate_mammalian2.protein.gpff.gz

RefSeq Whole Genome Shotgun (WGS) data are provided in files provided 
per WGS project.  Their filenames use a slightly different structure:

	directoryWGSproject.molecule.format.gz

For example:
	completeNZ_AAAU.bna.gz
        microbialNZ_AAAV.genomic.fna.gz

All RefSeq release files have been compressed with the gzip utility;
therefore, an invariant ".gz" suffix is present for all release files.

The data that comprises a RefSeq release are available in several
file formats, as indicated by the format component in the file name:

  bna	binary ASN.1 format; includes nucleotide and protein 
  gbff	GenBank flat file format; nucleotide records
  gpff	GenPept flat file format; protein records
  fna	FASTA format; nucleotide records
  faa	FASTA format; protein records

The comprehensive full release is deposited in the "complete"
directory and is available in all file types.

Binary ASN.1 format is only provided in the complete directory. The remaining
directory include all of the remaining file types.

The DDBJ/EMBL/GenBank and GenPept flat file format provided in this release 
matches that seen when accessing the records using the NCBI web site. 
Notably, some RefSeq record are in the CON division and do not instantiate 
the sequence on the flat file display, instead a 'join' statement is provided 
to indicate the assembly instructions.  The FASTA files do include the 
assembled sequences for these CON division RefSeq records.  

For example, see NC_000022.

Suggestions regarding the structure of the RefSeq release product 
and the available formats may be sent to the NCBI Help Desk:

    info@ncbi.nlm.nih.gov


3.3 File Sizes
--------------

RefSeq release files are provided in a range of sizes. Most are
limited to several hundred megabytes. However, some of the genomic
FASTA files can exceed 2Gb.

Files are compressed to reduce file size and facilitate FTP retrieval.

The total size of release 2 (includes all directories) is as follows:


         Extension    Size (GB)          Type
         -----------------------------------------------------------
         bna          4.45                ASN.1
         gbff         7.89                GenBank flat file
         gpff         4.95                GenPept flat file
         fna          34.61               FASTA, nucleotide
         faa          0.69                FASTA, protein

Note: for release 2, the compete directory provides all file types. The ASN.1
format is only available in the complete directory; the file sizes reported for 
the remaining file formats represents the redundant total found in the complete 
plus other directories.


3.4 Statistics
---------------

RefSeq release 2 includes sequences from 2124 different organisms.

The number of species represented in each Release sub-directory,
determined by counting distinct tax IDs, is as follows:

	complete		2124
	fungi			34	
	invertebrate		81
	microbial		378	
	mitochondrion		437
	plant			33	
	plasmid			37	
	plastid			34	
	protozoa		40	
	vertebrate_mammalian	82	
	vertebrate_other	208	
	viral			1231

Total Number of Accessions, Length (number of nucleotides or amino 
acids, per type of molecule:

   Type		Accessions        Length 
   ------------------------------------------
   Genomic:        64805	7399000384
   RNA:            201312	346398189
   Protein:        831287	286957682


Complete RefSeq release statistics for each directory are provided 
in a separate document. Please see:

   ftp://ftp.ncbi.nih.gov/refseq/release/release-statistics/
   
   file: RefSeq-release2.10212003.stats.txt


3.5 Release Catalog Format
--------------------------

The full non-redundant contents of the release are documented in the 
release catalog. 

The catalog includes the following columns:


  1. tax_id
  2. species name
  3. RefSeq accession.version
  4. gi
  5. FTP directories data is provided in
  6. RefSeq status code
  7. sequence length

Note: the molecule type for each catalog entry can be inferred from 
the accession prefix (see below).
 

RefSeq Status Codes are documented on the RefSeq web site. The catalog for
release 2 includes the following terms:

  na			Not Applicable; 
			status codes are not provided for some genomic records

  UNKNOWN		The status code has not yet been applied 

  REVIEWED		The RefSeq record has been the reviewed by NCBI  
			staff or by a collaborator. Some RefSeq records 
			may incorporate expanded sequence and annotation 
			information including additional publications 
			and features.

  VALIDATED		The RefSeq record has undergone an initial review 
			to provide the preferred sequence standard. The  
			record has not yet been subject to final review 
			at which time additional functional information 
			may be provided.	
			
  PROVISIONAL		The RefSeq record has not yet been subject to 
			individual review and is thought to be well 
			supported and to represent a valid transcript 
			and protein.
			
  PREDICTED		The RefSeq transcript may represent an ab initio 
			prediction or may be partially supported by other 
			transcript data; the protein is predicted.
			
  INFERRED		The RefSeq record is inferred by genome sequence 
			analysis.

  MODEL			RefSeq records provided via automated processing 
  			and are not subject to individual review or revision 
			between builds.


3.6 RefSeq Accession Format
---------------------------

RefSeq accessions are formatted as a two letter prefix, followed by 
an underscore, followed by six digits or 4 letters plus eight digits. 
For example, NM_020236 and NZ_AABC02000001.  

The underscore ("_") is the  primary distinguishing feature of a RefSeq 
accession; DDBJ/EMBL/GenBank accessions never include an underscore.


The RefSeq accession prefix indicates the molecule type. 


  Molecule Type		Accession Prefix
  ----------------------------------------------
  protein		NP_; XP_; ZP_
  rna			NM_; NR_; XM_; XR_
  genomic		NC_; NG_; NT_; NW_; NZ_
        

Additional information is available on the RefSeq Web site:

  http://www.ncbi.nih.gov/RefSeq/key.html#accessions


NOTICE OF CHANGE:
NP_ accession space will need to be expanded in the near future,
the new format will be NP_12345678. Existing accessions will remain
unchanged. That is, existing accessions, such as NP_013474, will not
be modified to 8 digits (NP_013474 and NP_00013474 will be distinct
accessions identifying different protein records)

As other accession series need to be expanded, they will also be
expanded by adding 2 digits with existing accessions remaining stable.


3.7 Growth of RefSeq
--------------------

Release	Date		Species	Nucleotides	Amino Acids	Records

1	Jun 30, 2003	2005	4672871949	263588685	1061675
2	Oct 21, 2003	2124	7745398573	286957682	1097404




=============================================================================
4. FLAT FILE ANNOTATION
=============================================================================

4.1 Main features of RefSeq Flat File
-------------------------------------
Also see the  RefSeq web site and the NCBI Handbook, RefSeq chapter.

   http://www.ncbi.nih.gov/RefSeq/
   http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowTOC
   &rid=handbook.TOC&depth=2


4.1.1 LOCUS, DEFLINE, ACCESSION, KEYWORDS, SOURCE, ORGANISM 
--------------------------------------------------------------------

The beginning of each RefSeq records provides information about the accession,
length, molecule type, division, and last update date. This is followed by the 
descriptive DEFINITION line, then by the Accession, version,and GI data, 
followed by detailed information about the organism and taxomonic lineage.

//
LOCUS       NC_004916             384518 bp    DNA     linear   INV 26-JUN-2003
DEFINITION  Leishmania major chromosome 3, complete sequence.
ACCESSION   NC_004916
VERSION     NC_004916.1  GI:32189699
KEYWORDS    .
SOURCE      Leishmania major
  ORGANISM  Leishmania major
            Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae;
            Leishmania.
//

Note: Both the GI and VERSION number increment when a sequence is updated, 
while the ACCESSION remains the same.  The GI and "ACCESSION.VERSION" 
identifiers provide the finest resolution reference to a sequence.


4.1.2 REFERENCE, DIRECT SUBMISSION, COMMENT
-------------------------------------------

REFERENCE: 

While the majority of RefSeq records do include REFERENCE
data, this data is not required and some records do not include any
citations. Publications are propagated from the GenBank record(s) from which
the RefSeq is derived, provided by collaborating groups and NCBI staff
during the curation process, and provided by the National Library of
Medicine (NLM) PubMed MeSH indexing staff as they add new articles to PubMed.

Functionally relevant citations are added by individual scientists using
the LocusLink GeneRIF submission form, and a significant volume of citation
connections are supplied by the NLM MeSH indexing staff for human, 
mouse, rat, zebrafish,and cow. This functionality is expected to increase 
in the future to treat all organisms represented in the RefSeq collection.  
Citations supplied by the MeSH indexers and individual scientists can be 
identified by the presence of a REMARK beginning with the text string "GeneRIF".
This represents a significant method to keep sequence connections to the
literature up-to-date; GeneRIFs add considerable value to the RefSeq 
collection.

For more information on GeneRIFs please see:

    http://www.ncbi.nlm.nih.gov/LocusLink/GeneRIFhelp.html


For example, several GeneRIFs have been added to NM_000173.1 including:

// 
REFERENCE   13 (bases 1 to 2480)
  AUTHORS   Poujol,C., Ware,J., Nieswandt,B., Nurden,A.T. and Nurden,P.
  TITLE     Absence of GPIbalpha is responsible for aberrant membrane
            development during megakaryocyte maturation: ultrastructural study
            using a transgenic model
  JOURNAL   Exp. Hematol. 30 (4), 352-360 (2002)
  MEDLINE   21935100
   PUBMED   11937271
  REMARK    GeneRIF: Absence of GPIbalpha is responsible for aberrant membrane
            development during megakaryocyte maturation; leads to abnormal
            partitioning of the membrane systems and abnormal proplatelet
            production.
//

DIRECT SUBMISSION: 

A Direct Submission field is provided on some RefSeq records but not all. It
is propagated from the underlying GenBank record from which the RefSeq is 
derived or provided on submissions from collaborating groups. Transcript
and protein RefSeqs for human, mouse, rat, zebrafish, and cow do not provide
this field as records often include additional data and are not necessarily
direct copies of the GenBank submission.


COMMENT: 

A COMMENT identifying the RefSeq Status is provided for the majority of the 
RefSeq records. This comment may include information about the RefSeq status, 
collaborating groups, and the GenBank records(s) from which the RefSeq is 
derived. The RefSeq COMMENT is not provided comprehensively in this release. 
We are working to supply this COMMENT more comprehensively in the future.

Additional COMMENTS are provided for some records to provide information 
about the sequence function, notes about the aspects of curation, or 
comments describing transcript variants.

A COMMENT is always provided if the GI has changed.
 
For example (from NM_133490):

//
COMMENT     REVIEWED REFSEQ: This record has been curated by NCBI staff. The
            reference sequence was derived from BC008969.1.
            On Dec 31, 2002 this sequence version replaced gi:19424123.
            
            Summary: Voltage-gated potassium (Kv) channels represent the most
            complex class of voltage-gated ion channels from both functional
            and structural standpoints. Their diverse functions include
            regulating neurotransmitter release, heart rate, insulin secretion,
            neuronal excitability, epithelial electrolyte transport, smooth
            muscle contraction, and cell volume. This gene encodes a member of
            the potassium channel, voltage-gated, subfamily G. This member
            functions as a modulatory subunit. The gene has strong expression
            in brain. Alternative splicing results in two transcript variants
            encoding distinct isoforms.
            
            Transcript Variant: This variant (2) has an alternate 3' sequence,
            as compared to variant 1. It encodes isoform 2 that is shorter and
            has a distinct C-terminus as compared to isoform 1.
//



4.1.3 NUCLEOTIDE FEATURE ANNOTATION
-----------------------------------

Gene, mRNA, CDS:
Every effort is made to consistently provide the Gene and coding sequence (CDS)
feature (when relevant).  If a RefSeq is based on a GenBank record that is only 
annotated with the CDS, then a Gene feature is created.  mRNA features are 
provided for most eukaryotic records; this is not yet comprehensively provided
and will improve in future releases.

Gene Names: 
Gene symbols and names are provided by external official 
nomenclature groups for some organisms.  If official nomenclature is 
not available we may use a systemic name provided by the data submittor 
or apply a more functional name during curation. When official nomenclature 
is available we may provide additional alternate names for some organisms.

Variation:
Variation is computed by the dbSNP database staff and added via post-processing
to RefSeq records.

Miscellaneous:
For some records, additional annotation may be provided when identified by the 
curation staff or provided by a collaborating group. For example, the location 
of polyA signal and sites may be included.


4.1.4 PROTEIN FEATURE ANNOTATION
--------------------------------

Protein Names: 
Protein names may be provided by a collaborating group, may be based on the 
Gene Name, or for some records, the curation process may identify the 
preferred protein name based on that associated with a specific EC number 
or based on the literature.

Protein Products:
Signal peptide and mature peptide annotation is provided by propagation from 
the GenBank submission that the RefSeq is based on, when provided by a 
collaborating group, or when determined by the curation process.

Domains:
Domains are computed by alignment to the NCBI Conserved Domain Database 
database for  human, mouse, rat, zebrafish, nematode, and cow.  The best 
hits are annotated on the RefSeq. For some records, additional functionally 
significant regions of the protein may be annotated by the curation staff.
Domain annotation is not provided comprehensively at this time.

4.2 Tracking Identifiers
------------------------

Several identifiers are provided on RefSeq records that can be used to track 
relationships between annotated features, relationships between RefSeq records, 
and changes to RefSeq records over time. 

The LocusID identifies the related Gene, mRNA, and CDS features. Transcript IDs 
(RefSeq accessions) provide an explicit connection between a transcript feature 
annotated on a genomic RefSeq record, and the RefSeq transcript record itself. 
Likewise, the Protein ID (RefSeq accessions) provides the association between 
the annotated CDS feature on a genomic or transcript RefSeq record, and the 
protein record itself.

Changes to a RefSeq sequence over time can be identified by changes to the GI 
and version number.
		
4.2.1 GeneID and LocusID
------------------------
Release 2 includes a new identifier, the GeneID.  The GeneID is being added
to RefSeq records to support development of a new Entrez database, Entrez
Gene.  Entrez Gene will provide gene-oriented information for the entire
RefSeq collection. It represents a significant expansion of the LocusLink
database concept. Once Entrez Gene is publicly available, then the dbxref
that is already provided on the RefSeq records will be hotlinked, in the
web Entrez GenBank and GenPept record displays, to this new resource.

The GeneID was initially set to be equivalent to the LocusID; these IDs
will not be synchronized in the future as new identifiers are added.

A distinct tracking ID (the LocusID) is available for human, mouse, rat, 
zebrafish, and cow records. The LocusID is provided as a database cross-
reference qualifier (dbxref) on the gene, mRNA, and CDS features. The LocusID 
can be used to identify the set of related features; this is especially useful 
when multiple products are provided to represent alternate splicing events.
	
For example:
//
     gene            19683..104490
                     /gene="DLEC1"
		     /db_xref="GeneID:9940"  <<<--- GeneID
                     /db_xref="LocusID:9940" <<<--- LocusID
                     /db_xref="MIM:604050"
//
	

When viewing RefSeq records via the internet, the LocusID is hot-linked to the 
LocusLink Gene Report page where additional descriptive information may be 
available about the gene. In the future, the LocusID will be renamed "GeneID" 
and will be provided for all species included in the RefSeq collection.

4.2.2 Transcript ID
-------------------

The transcript_id qualifier found on a mRNA or other RNA feature annotation
provides an explicit correspondance between a feature annotation on a genomic 
record and the RefSeq transcript record.


For example:

NT_011523.9      Homo sapiens chromosome 22 genomic contig.

//
     mRNA            complement(231444..239103)
                     /gene="PKDREJ"
                     /product="polycystic kidney disease (polycystin) and REJ
                     (sperm receptor for egg jelly homolog, sea urchin)-like"
                     /note="Derived by automated computational analysis using
                     gene prediction method: BestRefseq,BLAST. Supporting
                     evidence includes similarity to: 3 mRNAs"
                     /transcript_id="NM_006071.1  <<<--- linked RefSeq transcript
                     /db_xref="GI:5174632"
		     /db_xref="GeneID:10343"
                     /db_xref="LocusID:10343"
                     /db_xref="MIM:604670"
//

		
4.2.3 Protein ID
----------------

The protein_id qualifier found on a coding region (CDS) feature provides an 
explicit correspondance between feature annotation on a genomic or transcript 
RefSeq record and the RefSeq transcript record.

For example:

NC_001144.2      Saccharomyces cerevisiae chromosome XII, complete chromosome
                 sequence.
//      
 CDS             complement(16639..17613)
                     /gene="MHT1"
                     /locus_tag="YLL062C"
                     /note="Mht1p;
                     go_component: cellular_component unknown [goid 8372]
                     [evidence ND];
                     go_function: homocysteine S-methyltransferase activity
                     [goid 8898] [evidence IDA] [pmid 11013242];
                     go_process: sulfur amino acid metabolism [goid 96]
                     [evidence IMP] [pmid 11013242]"
                     /codon_start=1
                     /evidence=experimental
                     /product="S-Methylmethionine Homocysteine
                     methylTransferase"
                     /protein_id="NP_013038.1"	<<<--- linked RefSeq protein
                     /db_xref="GI:6322966"
                     /db_xref="SGD:S0003985"
		     /db_xref="GeneID:850664"
                     /translation="MKRIPIKELIVEHPGKVLILDGGQGTELENRGININSPVWSAAP
                     FTSESFWEPSSQERKVVEEMYRDFMIAGANILMTITYQANFQSISENTSIKTLAAYKR
                     FLDKIVSFTREFIGEERYLIGSIGPWAAHVSCEYTGDYGPHPENIDYYGFFKPQLENF
                     NQNRDIDLIGFETIPNFHELKAILSWDEDIISKPFYIGLSVDDNSLLRDGTTLEEISV
                     HIKGLGNKINKNLLLMGVNCVSFNQSALILKMLHEHLPGMPLLVYPNSGEIYNPKEKT
                     WHRPTNKLDDWETTVKKFVDNGARIIGGCCRTSPKDIAEIASAVDKYS"
//


4.2.4 Conserved Domain Database (CDD) ID
----------------------------------------

The CDD identifier found on protein records, and mapped to associated 
nucleotide records as a misc_feat,identifies protein domains that are
found on the record. CDD annotation is applied computationally. Initially
this annotation was provided for a subset of RefSeq; it will be applied
to the entire collection in the near future.

For example:

NP_000550.2	     A-gamma globin

//

     Region          5..147
                     /region_name="Globin"
                     /note="globin"
                     /db_xref="CDD:pfam00042"  <<<--- conserved domain database


//


=============================================================================
5. REFSEQ ADMINISTRATION
=============================================================================

The National Center for Biotechnology Information (NCBI), National Library
of Medicine, National Institutes of Health, is responsible for the production
and distribution of the NIH RefSeq Sequence Database. NCBI distributes
RefSeq sequence data by anonymous FTP. For more information, you may contact 
NCBI by email at info@ncbi.nlm.nih.gov or by phone at 301-496-2475.

5.1 Citing RefSeq
-----------------

When citing data in RefSeq, it is appropriate to to give the sequence name,
and primary accession and version number (or GI). Note, the most accurate
citation of the sequence is provided by including the combined accession plus
version number or the GI number.

It is also appropriate to list a reference for the RefSeq project. The 
following on-line publication provides the most complete description and
should be cited when possible:

   The NCBI handbook [Internet]. Bethesda (MD): National Library of 
   Medicine (US), National Center for Biotechnology Information; 2002 
   Oct. Chapter 17, The Reference Sequence (RefSeq) Project. 
   Available from http://www.ncbi.nih.gov/entrez/query.fcgi?db=Books 

If on-line citations are not accepted by a journal, please use the following
citation:

   NCBI Reference Sequence Project: update and current status
   Pruitt KD, Tatusova T, Maglott DR
   Nucleic Acids Res 2003 Jan 1;31(1):34-37


5.2 RefSeq Distribution Formats
-------------------------------

Complete flat file releases of the RefSeq database are available via
NCBI's anonymous ftp server:

	ftp://ftp.ncbi.nih.gov/refseq/release/

Each release is cumulative, incorporating previous data plus new data.
Records that have been suppressed are not included in the release.

Incremental updates that become available between RefSeq releases
are available at:

ftp://ftp.ncbi.nih.gov/refseq/daily/new
ftp://ftp.ncbi.nih.gov/refseq/cumulative

Please refer to the README for additional information:
ftp://ftp.ncbi.nih.gov/refseq/README
ftp://ftp.ncbi.nih.gov/refseq/CHANGE_NOTICE

5.3 Other Methods of Accessing RefSeq Data
------------------------------------------

Entrez is a molecular biology database system that presents an integrated
view of DNA and protein sequence data, structure data, genome data, 
publications, and other data fields.  The Entrez query and retrieval
system is produced by the National Center for Biotechnology Information
(NCBI) and is available only via the internet.

Entrez is accessed at:

	http://www.ncbi.nih.gov/Entrez/

RefSeq entries are indexed for retrieval in the Entrez system. The web-based
filter restrictions can be used to restrict your query to RefSeq data or to 
specific subsets of the RefSeq database.

Additional specific property restrictions are provided to support querying
for RefSeq records with specific STATUS codes. Queries are defined on the
RefSeq web site at:

	http://www.ncbi.nih.gov/RefSeq/


5.4 Request for Corrections and Comments
----------------------------------------

We welcome your suggestions to improve the RefSeq collection; we invite 
groups interested in contributing toward the collection and curation 
of the RefSeq database to improve the representation of single genes, 
gene families, or complete genomes to contact us.

Please refer to RefSeq accession and version numbers (or GI) and the RefSeq
Release number (release 2) to which your comments apply; it is useful if you
indicate the source of data that you found to be problematic (e.g., data on
the FTP site, data retrieved on the web site), the entry DEFLINE, and the 
specific annotation field for which you are suggesting a change.

Suggestions and corrections can be sent to:

	info@ncbi.nlm.nih.gov


5.5 Credits and Acknowledgements
--------------------------------

This RefSeq release would not be possible without the support of numerous
collaborators and the primary sequence data that is submitted by thousands
of laboratories and available in GenBank.

The RefSeq project is ambitious in scope and we actively welcome opportunities
to work with other groups to provide this collection. We value all of our 
collaborators; they contribute information with a large range in scope and 
volume such as completely annotated genomes, advice to improve the sequence 
or annotation of individual RefSeq records, information about official 
nomenclature, and information about function.

In addition to the significant information collected by collaboration, 
numerous NCBI staff are involved in infrastructure support, programmatic 
support, and curation. RefSeq is supported by 3 primary work groups that 
are associated with LocusLink, Entrez Genomes, and the Genome Annotation 
Pipeline. 

See the RefSeq web site for a list of collaborating groups and in-house
staff.


5.6 Disclaimer
--------------

The United States Government makes no representations or warranties
regarding the content or accuracy of the information.  The United States
Government also makes no representations or warranties of merchantability
or fitness for a particular purpose or that the use of the sequences will
not infringe any patent, copyright, trademark, or other rights.  The
United States Government accepts no responsibility for any consequence
of the receipt or use of the information.

For additional information about RefSeq releases, please contact
NCBI by e-mail at info@ncbi.nlm.nih.gov or by phone at (301) 496-2475.