###################################################################
README for ftp://ncbi.nlm.nih.gov/refseq/release/release-catalog

Last updated: December 21, 2022

###################################################################

_________________________________________________________________________

       
       National Center for Biotechnology Information (NCBI)
             National Library of Medicine
             National Institutes of Health
             8600 Rockville Pike
             Bethesda, MD 20894, USA
             tel: (301) 496-2475
             fax: (301) 480-9241
             e-mail: info@ncbi.nlm.nih.gov
             
_________________________________________________________________________

Updates:
  December 21, 2022
  Changed CRC checksum to MD5 checksum

  December 4, 2020
  The 'release#.files.installed' section was modified to add the names of
  the columns, CRC checksum and File name 
  January 11, 2018
  The following sections were updated to reflect dropping of gi numbers,
  and other changes.
      'RefSeq-release#.catalog' now contain 6 columns instead of 7
         - 'gi' deleted as column 4
      'release#.removed-records' now contains 7 columns instead of 8
         - 'gi' deleted as column 4
      'release#.AutonomousProtein2Genomic.gz' now contains 6 columns,
      instead of 5
         - 'Protein gi' deleted as column 2
	 - 'Nucleotide gi' deleted as column 4
	 - 'Strain-level tax_id' added as column 3
	 - 'BioSample' added as column 5
	 - 'Species name' added as column 6
      'release#.MultispeciesAutonomousProtein2taxname.gz' now contains
      3 columns instead of 4
         - 'Protein gi' deleted as column 2
	 
  July 21, 2017
  The 'release#.files.installed' section was modified to clarify the 
  relationship between files with the same numerical increment. 
 
  October 21, 2016
  Updated two file names to match the file names in 
  ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/
     release#.WP2genomic.mapping.txt.gz was updated to
        release#.AutonomousProtein2Genomic.gz
     release#.multispecies_WP_accession_to_taxname.txt.gz was updated to
        release#.MultispeciesAutonomousProtein2taxname.gz 
 	  
  November 21, 2013
  Added documentation for files:
     release#.WP2genomic.mapping.txt.gz
     release#.multispecies_WP_accession_to_taxname.txt.gz

  November 23, 2005
  New file provided with the release:
     release#.accession2geneid

  March 10, 2005
  New files provided with the release:
     release#.taxon.new
     release#.taxon.update

  April 5, 2004
  RefSeq-release4.txt was updated to remove redundancy
  The original is available as RefSeq-release4.originalwithduplicates.txt

_________________________________________________________________________

This directory includes files documenting the contents of the RefSeq release
both as an accession list and file list, and records that were included in the 
previous release but are not included in the current release.


Files included are:

  RefSeq-release#.catalog 
  release#.files.installed
  release#.removed-records
  release#.taxon.new
  release#.taxon.update
  release#.accession2geneid
  release#.WP2genomic.mapping.txt.gz
  release#.multispecies_WP_accession_to_taxname.txt.gz

  where '#' is the release number

Subdirectories:
  archive - previous release catalogs are available here

==========================================
RefSeq-release#.catalog 
==========================================
Content: Tab-delimited listing of all accessions included in the current 
RefSeq release.

Columns:
 1. taxonomy ID
 2. species name
 3. accession.version
 4. refseq release directory accession is included in
      complete + other directories
      '|' delimited
 5. refseq status
      na - not available; status codes are not applied to most genomic records
      INFERRED
      PREDICTED
      PROVISIONAL
      VALIDATED
      REVIEWED
      MODEL
      UNKNOWN - status code not provided; however usually is provided for 
                this type of record
 6. length     

==========================================
release#.files.installed
==========================================
Content: Tab-delimited list of sequence files installed for the current release
and the corresponding MD5 checksum. 

Columns: 
1. MD5 checksum
2. File name

File name format indicates the directory node, molecule type, and format type. 

Name format:

 complete10.bna.gz
|-------|--|---|--|
   1     2   3  4

   1. directory location 
   2. numerical increment
   3. format type 
   4. compression 

Multiple files may be provided for any given molecule and format type, indicated 
by a numerical increment in the file names. Files of the same molecule type and 
increment are related by content. Files of different molecule type and the same 
increment may or may not have related content. For example:

    complete.1006.bna.gz
    complete.1006.1.genomic.fna.gz, complete.1006.2.genomic.fna.gz 
          -- genomic FASTA split into two sub-parts due to size 
    complete.1006.genomic.gbff.gz -- content related to the two 1006.#.genomic.fna.gz files 
    complete.1006.protein.faa.gz, complete.1006.protein.gpff.gz 
          -- contains proteins found in either genomic or rna files of this increment 
    complete.1006.rna.fna.gz
    complete.1006.rna.gbff.gz -- unrelated to the contents of the genomic files of this increment

If you are interested in a complete set of genomic, protein, and rna files for a given 
tax_id, you must scan all files from the directory. You may also want to consider using 
the per-assembly files provided at ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/ instead. 
More information is available at:
    https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/

Note that for some molecule and format types, a number increment is skipped.
This is not an error.  The RefSeq release processing first produces a set of 
split ASN.1 files which are used to export the records by molecule and format 
type. If an ASN.1 file does not include any records for a given molecule type, 
such as genomic sequence data, then the corresponding 'genomic' fasta and 
flatfile records will not be found.

For example:

 complete10.bna.gz
 complete10.genomic.fna.gz
 complete10.genomic.gbff.gz
 complete10.protein.faa.gz
 complete10.protein.gpff.gz
 complete10.rna.fna.gz
 complete10.rna.gbff.gz

If complete10.bna.gz includes genomic, and RNA, and protein data then the 
full set  of files are provided.

In contrast, if complete24.bna includes only genomic and protein data 
then the corresponding rna file is not provided:

 complete24.bna.gz
 complete24.genomic.fna.gz
 complete24.genomic.gbff.gz
 complete24.protein.faa.gz
 complete24.protein.gpff.gz

==========================================
release#.removed-records
==========================================
Content: Tab-delimited report of records that were included in the previous 
release but are not included in the current release.

Columns:
 1. taxonomy ID
 2. species name
 3. accession.version
 4. refseq release directory accession is included in
      complete + other directories
      '|' delimited
 5. refseq status
      na - not available; status codes are not applied to most genomic records
      INFERRED
      PREDICTED
      PROVISIONAL
      VALIDATED
      REVIEWED
      MODEL
      UNKNOWN - status code not provided; however usually is provided for 
                this type of record
 6. length     
 7. removed status
      dead protein: protein was removed when genomic record was reloaded 
                    and protein  was not found on the nucleotide update.  
                    This is an implied permanent suppress.

      temporarily suppressed: record was temporarily removed and may be 
                              restored at a later date.

      permanently suppressed: record was permanently removed. It is possible 
                              to restore this type of record however at the 
                              time of removal that action is not anticipated.

      replaced by accession:  the accession in column 3 has become a secondary 
                              accession that cited in column 8.

==========================================
release#.taxon.new
==========================================
Content: tab delimited report of organisms that have been added to the RefSeq 
collection since the previous release.

Columns:
 1. taxonomy ID
 2. species name
 3. refseq release directory that data is included in
      complete + other directories
      '|' delimited

==========================================
release#.taxon.update
==========================================
Content: Report of organisms for which either the NCBI taxonomy ID or the 
species name has been modified since the previous release.

Columns ('|' delimited):

 1. taxonomy ID, current
 2. species name, current
 3. taxonomy ID, previous
 4. species name, previous

==========================================
release#.accession2geneid
==========================================
Content: Report of GeneIDs available at the time of the RefSeq release. 
Limited to GeneIDs that are associated with RNA or mRNA records with 
accession prefix N[M|R] and X[M|R].

Columns (tab delimited):

    1: Taxonomic ID
    2: Entrez GeneID
    3: Transcript accession.version              
    4: Protein accession.version
       na if no data 
       --for example, the NR_ accession prefix is used for RNA
         so there is no corresponding protein record
               
==========================================
release#.AutonomousProtein2Genomic.gz
==========================================
Content: Report of genomic accessions that nonredundant WP
protein accessions are annotated on. Multiple rows are provided
for nonredundant proteins that have been annotated on multiple
genomic records.

Columns:
    1: Protein accession.version
    2: Genomic nucleotide accession.version
    3. Strain-level tax_id
    4: Species-level tax_id
    5: BioSample
    6: Species name

=====================================================
release#.MultispeciesAutonomousProtein2taxname.gz
=====================================================
Content: Report of the NCBI TaxID and species names for the subset
of nonredundant proteins (WP accession) which are annotated on genomic
records from more than one species. 

Columns:
   1: Protein accession.version
   2: Species-level tax_id of genomic record(s) that the protein is annotated on
   3: Species name (string)