################################################################### README for ftp://ncbi.nlm.nih.gov/refseq/release/release-catalog Last updated: December 21, 2022 ################################################################### _________________________________________________________________________ National Center for Biotechnology Information (NCBI) National Library of Medicine National Institutes of Health 8600 Rockville Pike Bethesda, MD 20894, USA tel: (301) 496-2475 fax: (301) 480-9241 e-mail: info@ncbi.nlm.nih.gov _________________________________________________________________________ Updates: December 21, 2022 Changed CRC checksum to MD5 checksum December 4, 2020 The 'release#.files.installed' section was modified to add the names of the columns, CRC checksum and File name January 11, 2018 The following sections were updated to reflect dropping of gi numbers, and other changes. 'RefSeq-release#.catalog' now contain 6 columns instead of 7 - 'gi' deleted as column 4 'release#.removed-records' now contains 7 columns instead of 8 - 'gi' deleted as column 4 'release#.AutonomousProtein2Genomic.gz' now contains 6 columns, instead of 5 - 'Protein gi' deleted as column 2 - 'Nucleotide gi' deleted as column 4 - 'Strain-level tax_id' added as column 3 - 'BioSample' added as column 5 - 'Species name' added as column 6 'release#.MultispeciesAutonomousProtein2taxname.gz' now contains 3 columns instead of 4 - 'Protein gi' deleted as column 2 July 21, 2017 The 'release#.files.installed' section was modified to clarify the relationship between files with the same numerical increment. October 21, 2016 Updated two file names to match the file names in ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/ release#.WP2genomic.mapping.txt.gz was updated to release#.AutonomousProtein2Genomic.gz release#.multispecies_WP_accession_to_taxname.txt.gz was updated to release#.MultispeciesAutonomousProtein2taxname.gz November 21, 2013 Added documentation for files: release#.WP2genomic.mapping.txt.gz release#.multispecies_WP_accession_to_taxname.txt.gz November 23, 2005 New file provided with the release: release#.accession2geneid March 10, 2005 New files provided with the release: release#.taxon.new release#.taxon.update April 5, 2004 RefSeq-release4.txt was updated to remove redundancy The original is available as RefSeq-release4.originalwithduplicates.txt _________________________________________________________________________ This directory includes files documenting the contents of the RefSeq release both as an accession list and file list, and records that were included in the previous release but are not included in the current release. Files included are: RefSeq-release#.catalog release#.files.installed release#.removed-records release#.taxon.new release#.taxon.update release#.accession2geneid release#.WP2genomic.mapping.txt.gz release#.multispecies_WP_accession_to_taxname.txt.gz where '#' is the release number Subdirectories: archive - previous release catalogs are available here ========================================== RefSeq-release#.catalog ========================================== Content: Tab-delimited listing of all accessions included in the current RefSeq release. Columns: 1. taxonomy ID 2. species name 3. accession.version 4. refseq release directory accession is included in complete + other directories '|' delimited 5. refseq status na - not available; status codes are not applied to most genomic records INFERRED PREDICTED PROVISIONAL VALIDATED REVIEWED MODEL UNKNOWN - status code not provided; however usually is provided for this type of record 6. length ========================================== release#.files.installed ========================================== Content: Tab-delimited list of sequence files installed for the current release and the corresponding MD5 checksum. Columns: 1. MD5 checksum 2. File name File name format indicates the directory node, molecule type, and format type. Name format: complete10.bna.gz |-------|--|---|--| 1 2 3 4 1. directory location 2. numerical increment 3. format type 4. compression Multiple files may be provided for any given molecule and format type, indicated by a numerical increment in the file names. Files of the same molecule type and increment are related by content. Files of different molecule type and the same increment may or may not have related content. For example: complete.1006.bna.gz complete.1006.1.genomic.fna.gz, complete.1006.2.genomic.fna.gz -- genomic FASTA split into two sub-parts due to size complete.1006.genomic.gbff.gz -- content related to the two 1006.#.genomic.fna.gz files complete.1006.protein.faa.gz, complete.1006.protein.gpff.gz -- contains proteins found in either genomic or rna files of this increment complete.1006.rna.fna.gz complete.1006.rna.gbff.gz -- unrelated to the contents of the genomic files of this increment If you are interested in a complete set of genomic, protein, and rna files for a given tax_id, you must scan all files from the directory. You may also want to consider using the per-assembly files provided at ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/ instead. More information is available at: https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/ Note that for some molecule and format types, a number increment is skipped. This is not an error. The RefSeq release processing first produces a set of split ASN.1 files which are used to export the records by molecule and format type. If an ASN.1 file does not include any records for a given molecule type, such as genomic sequence data, then the corresponding 'genomic' fasta and flatfile records will not be found. For example: complete10.bna.gz complete10.genomic.fna.gz complete10.genomic.gbff.gz complete10.protein.faa.gz complete10.protein.gpff.gz complete10.rna.fna.gz complete10.rna.gbff.gz If complete10.bna.gz includes genomic, and RNA, and protein data then the full set of files are provided. In contrast, if complete24.bna includes only genomic and protein data then the corresponding rna file is not provided: complete24.bna.gz complete24.genomic.fna.gz complete24.genomic.gbff.gz complete24.protein.faa.gz complete24.protein.gpff.gz ========================================== release#.removed-records ========================================== Content: Tab-delimited report of records that were included in the previous release but are not included in the current release. Columns: 1. taxonomy ID 2. species name 3. accession.version 4. refseq release directory accession is included in complete + other directories '|' delimited 5. refseq status na - not available; status codes are not applied to most genomic records INFERRED PREDICTED PROVISIONAL VALIDATED REVIEWED MODEL UNKNOWN - status code not provided; however usually is provided for this type of record 6. length 7. removed status dead protein: protein was removed when genomic record was reloaded and protein was not found on the nucleotide update. This is an implied permanent suppress. temporarily suppressed: record was temporarily removed and may be restored at a later date. permanently suppressed: record was permanently removed. It is possible to restore this type of record however at the time of removal that action is not anticipated. replaced by accession: the accession in column 3 has become a secondary accession that cited in column 8. ========================================== release#.taxon.new ========================================== Content: tab delimited report of organisms that have been added to the RefSeq collection since the previous release. Columns: 1. taxonomy ID 2. species name 3. refseq release directory that data is included in complete + other directories '|' delimited ========================================== release#.taxon.update ========================================== Content: Report of organisms for which either the NCBI taxonomy ID or the species name has been modified since the previous release. Columns ('|' delimited): 1. taxonomy ID, current 2. species name, current 3. taxonomy ID, previous 4. species name, previous ========================================== release#.accession2geneid ========================================== Content: Report of GeneIDs available at the time of the RefSeq release. Limited to GeneIDs that are associated with RNA or mRNA records with accession prefix N[M|R] and X[M|R]. Columns (tab delimited): 1: Taxonomic ID 2: Entrez GeneID 3: Transcript accession.version 4: Protein accession.version na if no data --for example, the NR_ accession prefix is used for RNA so there is no corresponding protein record ========================================== release#.AutonomousProtein2Genomic.gz ========================================== Content: Report of genomic accessions that nonredundant WP protein accessions are annotated on. Multiple rows are provided for nonredundant proteins that have been annotated on multiple genomic records. Columns: 1: Protein accession.version 2: Genomic nucleotide accession.version 3. Strain-level tax_id 4: Species-level tax_id 5: BioSample 6: Species name ===================================================== release#.MultispeciesAutonomousProtein2taxname.gz ===================================================== Content: Report of the NCBI TaxID and species names for the subset of nonredundant proteins (WP accession) which are annotated on genomic records from more than one species. Columns: 1: Protein accession.version 2: Species-level tax_id of genomic record(s) that the protein is annotated on 3: Species name (string)