README:  Gene                       Last modified:  January 2, 2025

NOTE: As files are added or modified in this ftp site, notification will be
sent via the Gene News RSS feed.

You may subscribe to the Gene News RSS feed here:
            https://www.ncbi.nlm.nih.gov/feed/rss.cgi?ChanKey=genenews

A comparison of the files previously available from LocusLink to those now 
   available from Entrez Gene is provided here:

https://www.ncbi.nlm.nih.gov/entrez/query/static/help/LL2G.html#files

Files are provided in several directories and subdirectories.  
  This document is comprehensive, and subdivided according to the path in 
  which files are found. 

  Most of the files in this path are re-calculated daily. Gene does not,
  however, compare previous and current data, so the date on the file may 
  change without any change in content.

Changes not affecting use of the ftp site:
15 Dec 2009: removed references to LocusLink
             altered the labels in the interactions section, by appending 1 
             and 2

I.     DATA directory
II.    DATA directory, ASN_BINARY subdirectory
III.   DATA directory, GENE_INFO subdirectory
IV.    DATA directory, expression subdirectory
V.     GeneRIF directory (includes reports of interactions)
VI.    tools directory
VII.   gene-related files from genome annotation
===========================================================================
===========================================================================
I.     Files in the DATA directory
===========================================================================
===========================================================================
gene2accession                                  recalculated daily
---------------------------------------------------------------------------
           This file is a comprehensive report of the accessions that are 
           related to a GeneID.  It includes sequences from the international
           sequence collaboration, Swiss-Prot, and RefSeq. The RefSeq subset
           of this file is also available as gene2refseq.

           Because this file is updated daily, the RefSeq subset does not 
           reflect any RefSeq release. Versions of RefSeq RNA and protein 
           records may be more recent than those included in an annotation
           release (build) or those in the current RefSeq release.

           More notes about this file:

           tab-delimited
           one line per genomic/RNA/protein set of sequence accessions
           Column header line is the first line in the file.

           NOTE: Because this file is comprehensive, it may include
           some RefSeq accessions that are not current, because they are
           part of the annotation of the current genomic assembly. In other 
           words, the annotation of a genome is not continuous, but depends
           on a data freeze. Sub-genomic RefSeqs, however, are updated 
           continuously. Thus some RefSeqs may have been replaced or 
           suppressed after a data freeze assocated with a genomic annotation. 
           Until the release of a new genomic annotation, all
           RefSeqs that are included in the current annotation are reported
           in this file.
---------------------------------------------------------------------------

tax_id:
           the unique identifier provided by NCBI Taxonomy
           for the species or strain/isolate

GeneID:
           the unique identifier for a gene

status:
           status of the RefSeq if a refseq, else '-'
           RefSeq values are: INFERRED, MODEL, NA, PREDICTED, PROVISIONAL,
           REVIEWED, SUPPRESSED, VALIDATED

RNA nucleotide accession.version:
           may be null (-) for some genomes

RNA nucleotide gi:
           the gi for an RNA nucleotide accession, '-' if not applicable

protein accession.version:
           will be null (-) for RNA-coding genes

protein gi:
           the gi for a protein accession, '-' if not applicable

genomic nucleotide accession.version:
           may be null (-) 

genomic nucleotide gi:
           the gi for a genomic nucleotide accession, '-' if not applicable

start position on the genomic accession:
            position of the gene feature on the genomic accession,
            '-' if not applicable
            position 0-based

            NOTE: this file does not report the position of each exon.
            For positions on RefSeq contigs and chromosomes, 
            use the gff3 file in the appropriate annotation directory.
            For example, for the human genome,
            ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/annotation_releases/

            This file has one line for each feature.

            WARNING: Positions in gff3 files are 1-based, not 0-based

            NOTE: if genes are merged after an annotation is released, there 
            may be more than one location reported on a genomic sequence 
            per GeneID, each resulting from the annotation before the merge.

end position on the genomic accession:
            position of the gene feature on the genomic accession,
            '-' if not applicable
            position 0-based

            NOTE: this file does not report the position of each exon.
            For positions on RefSeq contigs and chromosomes, 
            use the gff3 file in the appropriate annotation directory.
            For example, for the human genome,
            ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/annotation_releases/

            This file has one line for each feature.

            WARNING: Positions in gff3 files are 1-based, not 0-based

            NOTE: if genes are merged after an annotation is released, there 
            may be more than one location reported on a genomic sequence 
            per GeneID, each resulting from the annotation before the merge.

orientation:
            orientation of the gene feature on the genomic accession,
            '?' if not applicable

assembly:
            the name of the assembly
            '-' if not applicable

mature peptide accession.version:
           will be null (-) if absent

mature peptide gi:
           the gi for a mature peptide accession, '-' if not applicable

Symbol:
           the default symbol for the gene


===========================================================================
gene2ensembl                                    recalculated daily
---------------------------------------------------------------------------
           This file reports matches between NCBI and Ensembl annotation
           based on comparison of rna and protein features.

           Matches are collected as follows.
           For a protein to be identified as a match between RefSeq and
           Ensembl, there must be at least 80% overlap between the two.
           Furthermore, splice site matches must meet certain conditions:
           either 60% or more of the splice sites must match, or there may 
           be at most one splice site mismatch.

           For rna features, the best match between RefSeq and Ensembl is 
           selected based on splice site and overlap comparisons. For coding 
           transcripts, there is no minimum threshold for reporting other than 
           the protein comparison criteria above. For non-coding transcripts, 
           the splice site criteria are the same as for protein matching, but 
           the overlap threshold is reduced to 50%.

           Furthermore, both the rna and the protein features must meet these 
           minimum matching criteria to be considered a good match.  In 
           addition, only the best matches will be reported in this file.  
           Other matches that satisified the matching criteria but were
           not the best matches will not be reported in this file.

           A summary report of species that have been compared is contained
           in another FTP file, README_ensembl (see below).

           Ensembl gene identifiers are also reported in the dbXrefs column
           in the gene_info FTP file.  Due to differences in how these files
           are processed, the Ensembl gene identifiers in these two files
           may not be in complete concordance.

           More notes about this file:

           tab-delimited
           one line per match between RefSeq and Ensembl rna/protein
           Column header line is the first line in the file.


---------------------------------------------------------------------------

tax_id:
           the unique identifier provided by NCBI Taxonomy
           for the species or strain/isolate

GeneID:
           the unique identifier for a gene

Ensembl_gene_identifier:
           the matching Ensembl identifier for the gene

RNA nucleotide accession.version:
           the identifier for the matching RefSeq rna
           will be null (-) if only the protein matched

Ensembl_rna_identifier:
           the identifier for the matching Ensembl rna
           may include a version number
           will be null (-) if only the protein matched

protein accession.version:
           the identifier for the matching RefSeq protein
           will be null (-) if only the mRNA matched

Ensembl_protein_identifier:
           the identifier for the matching Ensembl protein
           may include a version number
           will be null (-) if only the mRNA matched


===========================================================================
gene2vega                                    archived
---------------------------------------------------------------------------
           This file is no longer being updated.  The last update was on
           December 3, 2018.

           This file reports matches between NCBI and Vega annotation.

           Matches are derived from the comparisons between NCBI and
           Ensembl annotation (which are reported in the gene2ensembl FTP
           file).  That is, where there is a match between NCBI and 
           Ensembl annotation, and there is a correspondence between that 
           Ensembl annotation and Vega annotation, then the inferred 
           relationship between the NCBI and Vega annotations are reported 
           here.

           More notes about this file:

           tab-delimited
           one line per match between RefSeq and Vega rna/protein
           Column header line is the first line in the file.


---------------------------------------------------------------------------

tax_id:
           the unique identifier provided by NCBI Taxonomy
           for the species or strain/isolate

GeneID:
           the unique identifier for a gene

Vega_gene_identifier:
           the matching Vega identifier for the gene

RNA nucleotide accession.version:
           the identifier for the matching RefSeq rna
           will be null (-) if only the protein matched

Vega_rna_identifier:
           the identifier for the matching Vega rna
           may include a version number
           will be null (-) if only the protein matched

protein accession.version:
           the identifier for the matching RefSeq protein
           will be null (-) if only the mRNA matched

Vega_protein_identifier:
           the identifier for the matching Vega protein
           may include a version number
           will be null (-) if only the mRNA matched


===========================================================================
README_ensembl                                         recalculated weekly
---------------------------------------------------------------------------
           This file reports the overall status of comparison between 
           NCBI and Ensembl annotation.  The detailed report is contained
           in the gene2ensembl FTP file (see above).

           More notes about this file:

           tab-delimited
           one line per species
           Column header line is the first line in the file.

---------------------------------------------------------------------------

tax_id:
           the unique identifier provided by NCBI Taxonomy
           for the species or strain/isolate

ncbi_release:
           the NCBI release number

ncbi_assembly:
           the NCBI assembly name

ensembl_release:
           the Ensembl release number, or "Rapid" followed by the genebuild
           for a comparison with an Ensembl Rapid release

ensembl_assembly:
           the Ensembl assembly name

date_compared:
           the date when the comparison was performed, in YYYYMMDD format


===========================================================================
gene2go                                         recalculated daily
---------------------------------------------------------------------------
           This file reports the GO terms that have been associated
           with Genes in Entrez Gene.

           Gene ontology annotations are imported from external sources
           by processing the gene_association files on the GO ftp site: 
           http://www.geneontology.org/GO.current.annotations.shtml
           and comparing the DB_Object_ID to annotation in Gene,
           as also reported in gene_info.gz. This process is limited to the 
           species listed in go_process.xml file.

           For all other species, gene ontology terms are computed at the time
           of annotation by running InterProScan 
           (https://interproscan-docs.readthedocs.io/en/latest/), including analyses 
           against PANTHER trees on all annotated proteins and collating the 
           results by GeneID. These data are also provided in the GAF (GO 
           Annotation File) format in Genomes FTP. For example, see: 
           ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/949/774/975/GCF_949774975.1_mLagAlb1.1/

           Multiple gene_associations files may be used for any genome.
           If so, duplicate information is not reported; but unique
           contributions of GO terms, evidence codes, and citations are.
 
           The file that is used to establish the rules for 
           the files and fields that are used for each taxon 
           is documented in go_process.xml


           MODIFIED: May 9, 2006 to include the category of the GO term.

           MODIFIED: May 21, 2007 to use '-' for empty fields.

           Data elements which are not applicable are shown as '-'.

           tab-delimited
           One line per GeneID/GO term/representative GO evidence code.
           Column header line is the first line in the file.
---------------------------------------------------------------------------

tax_id:
           the unique identifier provided by NCBI Taxonomy
           for the species or strain/isolate

GeneID:
           the unique identifier for a gene

GO ID:
           the GO ID, formatted as GO:0000000

Evidence:
           the evidence code in the gene_association file

Qualifier: 
           a qualifier for the relationship between the gene
           and the GO term

GO term:
           the term indicated by the GO ID

PubMed:
           pipe-delimited set of PubMed uids reported as evidence
           for the association

Category:
           the GO category (Function, Process, or Component)


===========================================================================
gene2pubmed                                     recalculated daily
---------------------------------------------------------------------------
           This file can be considered as the logical equivalent of
           what is reported as Gene/PubMed Links visible in Gene's 
           and PubMed's Links menus. Although gene2pubmed is re-calculated daily,
           some of the source documents (GeneRIFs, for example) are not
           updated that frequently, so timing depends on the update
           frequency of the data source.

           Documentation about how these links are maintained
           is provided here:

           https://www.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html#gene

           tab-delimited
           one line per set of tax_id/GeneID/PMID
           Column header line is the first line in the file.
---------------------------------------------------------------------------

tax_id:
           the unique identifier provided by NCBI Taxonomy
           for the species or strain/isolate

GeneID:
           the unique identifier for a gene


PubMed ID (PMID):
           the unique identifier in PubMed for a citation


===========================================================================
gene2refseq                                     recalculated daily
---------------------------------------------------------------------------
           tab-delimited
           one line per genomic/RNA/protein set of RefSeqs
           Column header line is the first line in the file.

           Because this file is updated daily, the RefSeq subset does not 
           reflect any RefSeq release. Versions of RefSeq RNA and protein 
           records may be more recent than those included in an annotation
           release (build) or those in the current RefSeq release.


           NOTE: Because this file is comprehensive, it may include
           some RefSeq accessions that are not current, because they are
           part of the annotation of the current genomic assembly. In other 
           words, the annotation of a genome is not continuous, but depends
           on a data freeze. Sub-genomic RefSeqs, however, are updated 
           continuously. Thus some RefSeqs may have been replaced or 
           suppressed after a data freeze associated with a genomic annotation. 
           Until the release of a new genomic annotation, all
           RefSeqs included in the current annotation are reported
           in this file.


           NOTE: This file is the RefSeq subset of gene2accession.
---------------------------------------------------------------------------

tax_id:
           the unique identifier provided by NCBI Taxonomy
           for the species or strain/isolate

GeneID:
           the unique identifier for a gene

status:
           status of the RefSeq
           values are: INFERRED, MODEL, NA, PREDICTED, PROVISIONAL,
           REVIEWED, SUPPRESSED, VALIDATED

RNA nucleotide accession.version:
           may be null (-) for some genomes

RNA nucleotide gi:
           the gi for an RNA nucleotide accession, '-' if not applicable

protein accession.version:
           will be null (-) for RNA-coding genes

protein gi:
           the gi for a protein accession, '-' if not applicable

genomic nucleotide accession.version:
           may be null (-) if a RefSeq was provided after
           the genomic accession was submitted

genomic nucleotide gi:
           the gi for a genomic nucleotide accession, '-' if not applicable

start position on the genomic accession:
            position of the gene feature on the genomic accession,
            '-' if not applicable
            position 0-based

            NOTE: this file does not report the position of each exon.
            For positions on RefSeq contigs and chromosomes, 
            use the gff3 file in the appropriate annotation directory.
            For example, for the human genome,
            ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/annotation_releases/

            WARNING: positions in these files are 1-based, not 0-based

            NOTE: if genes are merged after an annotation is released, there 
            may be more than one location reported on a genomic sequence 
            per GeneID, each resulting from the annotation before the merge.

end position on the genomic accession:
            position of the gene feature on the genomic accession,
            '-' if not applicable
            position 0-based

            NOTE: this file does not report the position of each exon.
            For positions on RefSeq contigs and chromosomes, 
            use the gff3 file in the appropriate annotation directory.
            For example, for the human genome,
            ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/annotation_releases/

            WARNING: positions in these files are 1-based, not 0-based

            NOTE: if genes are merged after an annotation is released, there 
            may be more than one location reported on a genomic sequence 
            per GeneID, each resulting from the annotation before the merge.

orientation:
            orientation of the gene feature on the genomic accession,
            '?' if not applicable

assembly:
            the name of the assembly
            '-' if not applicable

mature peptide accession.version:
           will be null (-) if absent

mature peptide gi:
           the gi for a mature peptide accession, '-' if not applicable

Symbol:
           the default symbol for the gene

===========================================================================
gene2sts                                        archived
---------------------------------------------------------------------------
           This file is no longer being updated.  The last update was on
           July 26, 2017.

           tab-delimited
           one line per GeneID, UniSTS ID pair
           Column header line is the first line in the file.
---------------------------------------------------------------------------
GeneID:
           the unique identifier for a gene

UniSTS ID:
           the unique identifier given to a primer pair by UniSTS


===========================================================================
gene2unigene                                    archived
---------------------------------------------------------------------------
           This file is no longer being updated.  The last update was on
           June 19, 2019.

           This file can be considered as the logical equivalent of
           what is reported as Gene/UniGene Links visible in Gene's 
           and UniGene's Links menus.

           Documentation about how these links are maintained
           is provided here:

           https://www.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html
       
           tab-delimited
           Column header line is the first line in the file.
           Note: tax_id is not provided in a separate column.  The prefix 
                 of the UniGene cluster can be used to determine 
                 the species 
---------------------------------------------------------------------------

GeneID:
           the unique identifier for a gene


UniGene cluster:

                        
===========================================================================
gene_group                                      recalculated daily
---------------------------------------------------------------------------
           report of genes and their relationships to other genes

           tab-delimited
           one line per pair of GeneIDs
           Column header line is the first line in the file.
  
           NOTE: This file is not comprehensive, and contains 
           a subset of information summarizing gene-gene relationships.  

           Relationships are reported symmetrically, where appropriate, and 
           currently include:

               Ortholog*
               Potential readthrough sibling
               Readthrough child
               Readthrough parent
               Readthrough sibling
               Region member
               Region parent
               Related functional gene
               Related pseudogene

           *Note that Ortholog records appear in the gene_orthologs
           file, and are excluded from the gene_group file.
           Note also that the gene_group and gene_orthologs files use the 
           same column format.

---------------------------------------------------------------------------

tax_id:
           the unique identifier provided by NCBI Taxonomy
           for the species or strain/isolate

GeneID:
           the current unique identifier for a gene

relationship:
           the type of relationship between the two genes, 
           e.g. GeneID has a 'relationship' to Other GeneID

Other tax_id:
           the related gene's tax_id

Other GeneID:
           the related gene's GeneID


===========================================================================
gene_orthologs                                  recalculated daily
---------------------------------------------------------------------------
           report of orthologous genes

           tab-delimited
           one line per pair of GeneIDs
           Column header line is the first line in the file.

           The column format is the same as the gene_group file.  In this 
           file, the relationship column value is always "Ortholog".

           Ortholog gene groups are calculated by NCBI's Eukaryotic Genome 
           Annotation Pipeline for the NCBI Gene dataset using a combination 
           of protein sequence similarity and local synteny information.  
           Orthologous gene relationships may additionally be assigned after 
           manual review by a RefSeq genome curator.  Orthology is determined 
           between a genome being annotated and a reference genome—such as the 
           cat and human genomes, respectively.  For instance, if a cat gene 
           is identified as an ortholog of a human gene, it is included in the 
           set of genes from various organisms recognized as orthologs to that 
           human gene.  Consequently, the "ortholog set" for the human gene 
           may contain genes from multiple organisms that can be regarded as 
           one-to-one orthologs of each other.  For a description of the NCBI 
           ortholog calculation procedure, see 
           https://www.ncbi.nlm.nih.gov/kis/info/how-are-orthologs-calculated/ 

           This file does not report relationships symmetrically.  Instead, a 
           primary organism/gene is identified for each ortholog, and is 
           represented in the first two columns.  All other orthologous 
           organisms/genes are represented in the last two columns.  Therefore, 
           each record in the file represents a pair of orthologous genes. If 
           there are N orthologous genes in a group, then there will be N-1 
           records (pairs) represented in the gene_orthologs file.  For 
           example, a human gene ortholog set that includes mouse, rat, cat 
           and dog genes will be represented by the following 4 rows in the 
           gene_orthologs file: human::mouse, human::rat, human::cat, and 
           human::dog. 


===========================================================================
gene_history                                    recalculated daily
---------------------------------------------------------------------------
           comprehensive information about GeneIDs that are no longer current

           tab-delimited
           one line per GeneID
           Column header line is the first line in the file.
---------------------------------------------------------------------------

tax_id:
           the unique identifier provided by NCBI Taxonomy
           for the species or strain/isolate

GeneID:
           the current unique identifier for a gene

Discontinued GeneID:
           the GeneID that is no longer current

Discontinued Symbol:
           the symbol that was assigned to the discontinued GeneID,
           if the discontinued record was not replaced with another

Discontinue Date:
           the date the gene record was discontinued or replaced, in 
           YYYYMMDD format


===========================================================================
gene_info                                       recalculated daily
---------------------------------------------------------------------------
           tab-delimited
           one line per GeneID
           Column header line is the first line in the file.
           Note: subsets of gene_info are available in the DATA/GENE_INFO
                 directory (described later)
           This file is identical in content to GENE_INFO/All_Data.gene_info.gz
                 even though their file sizes and timestamps may differ
                 slightly
---------------------------------------------------------------------------

tax_id:
           the unique identifier provided by NCBI Taxonomy
           for the species or strain/isolate

GeneID:
           the unique identifier for a gene
           ASN1:  geneid

Symbol:
           the default symbol for the gene
           ASN1:  gene->locus

LocusTag:
           the LocusTag value
           ASN1:  gene->locus-tag

Synonyms:
           bar-delimited set of unofficial symbols for the gene

dbXrefs:
           bar-delimited set of identifiers in other databases
           for this gene.  The unit of the set is database:value.
           Note that HGNC and MGI include 'HGNC' and 'MGI', respectively,
           in the value part of their identifier.  Consequently,
           dbXrefs for these databases will appear like:
             HGNC:HGNC:1100
             This would be interpreted as database='HGNC', value='HGNC:1100'
           Example for MGI:
             MGI:MGI:104537
             This would be interpreted as database='MGI', value='MGI:104537'

chromosome:
           the chromosome on which this gene is placed.
           for mitochondrial genomes, the value 'MT' is used.

map location:
           the map location for this gene

description:
           a descriptive name for this gene

type of gene:
           the type assigned to the gene according to the list of options
           provided in https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/objects/entrezgene/entrezgene.asn


Symbol from nomenclature authority:
            when not '-', indicates that this symbol is from a
            a nomenclature authority

Full name from nomenclature authority:
            when not '-', indicates that this full name is from a
            a nomenclature authority

Nomenclature status:
            when not '-', indicates the status of the name from the 
            nomenclature authority (O for official, I for interim)

Other designations:
            pipe-delimited set of some alternate descriptions that
            have been assigned to a GeneID
            '-' indicates none is being reported.

Modification date:
            the last date a gene record was updated, in YYYYMMDD format

Feature type:
            pipe-delimited set of annotated features and their classes or 
            controlled vocabularies, displayed as feature_type:feature_class 
            or feature_type:controlled_vocabulary, when appropriate; derived 
            from select feature annotations on RefSeq(s) associated with the 
            GeneID

===========================================================================
gene_neighbors                                     recalculated daily
---------------------------------------------------------------------------
           This file reports neighboring genes for all genes placed on a 
           given genomic sequence.

           More notes about this file:

           tab-delimited
           one line per GeneID and genomic placement
           Column header line is the first line in the file.
           genomic sequences in scope for reporting include all top-level 
               sequences and curated genomic (NG_ accessions)

---------------------------------------------------------------------------

tax_id:
           the unique identifier provided by NCBI Taxonomy
           for the species or strain/isolate

GeneID:
           the unique identifier for a gene

genomic accession.version:

genomic gi:
           the gi for a genomic nucleotide accession

start position:
           start position of the gene feature on the genomic accession
           position value is 0-based

end position:
           end position of the gene feature on the genomic accession
           position value is 0-based

orientation:
           orientation of the gene feature on the genomic accession

chromosome:
           the chromosome on which this gene is placed.
           for mitochondrial genomes, the value 'MT' is used.
           '-' if not applicable

GeneIDs on left:
           bar-delimited set of GeneIDs for the nearest two non-overlapping
           non-biological-region genes on the left, or '-' if there are none
           additional GeneIDs may be included if the neighboring genes
           overlap each other

distance to left:
           distance to the nearest gene on the left, or '-' if there is none

GeneIDs on right:
           bar-delimited set of GeneIDs for the nearest two non-overlapping
           non-biological-region genes on the right, or '-' if there are none
           additional GeneIDs may be included if the neighboring genes
           overlap each other

distance to right:
           distance to the nearest gene on the right, or '-' if there is none

overlapping GeneIDs:
           bar-delimited set of GeneIDs for all overlapping genes, 
               or '-' if there are none

assembly:
           the name of the assembly
           '-' if not applicable


===========================================================================
gene_refseq_uniprotkb_collab            recalculated every month
---------------------------------------------------------------------------
           report of the relationship between NCBI Reference Sequence
           protein accessions and UniProtKB protein accessions

           tab-delimited
           one line per pair
           Column header line is the first line in the file.

           The NCBI RefSeq::UniProt accession pairs represented in this file 
           are sourced, as indicated by the term in "method" column, in one of 
           the following ways:
           1. "uniprot" -- NCBI RefSeq::UniProt matches imported from UniProt
           2. "identical" -- NCBI RefSeq::UniProt matches where the protein 
              sequence and assigned organism of the two accessions are identical 
              to each other
           3. "similar" -- NCBI RefSeq::UniProt matches where both of the 
              proteins have the same assigned organism and share >90% sequence 
              identity over >80% coverage


NCBI protein accession:
           the protein accession of the RefSeq

UniProtKB protein accession:
           the corresponding UniProtKB protein accession

NCBI_tax_id:
           the unique identifier provided by NCBI Taxonomy for the species or 
           strain/isolate, for the NCBI protein accession

UniProtKB_tax_id:
           the unique identifier provided by NCBI Taxonomy for the species or 
           strain/isolate, for the UniProtKB protein accession
           
method:
           method by which the NCBI RefSeq::UniProt accession pair was 
           determined; see above for explanation


===========================================================================
gene_summary                                    recalculated daily
---------------------------------------------------------------------------
           extract of gene summary texts for live genes that have them

           tab-delimited
           one line per GeneID
           Column header line is the first line in the file.
---------------------------------------------------------------------------

tax_id:
           the unique identifier provided by NCBI Taxonomy
           for the species or strain/isolate

GeneID:
           the current unique identifier for a gene

Source:
           the name of the source that provided the summary, for selected 
           sources

Summary:
           the gene summary text


===========================================================================
go_process.xml
---------------------------------------------------------------------------
           Rules for mapping information in the gene_info file
           in this directory to the enumerated authority files

===========================================================================
mim2gene_medgen                                     daily
---------------------------------------------------------------------------
           report of the relationship between MIM numbers (OMIM), 
           GeneIDs, and Records in MedGen

           tab-delimited
           one line per MIM number
           Column header line is the first line in the file.
           Tax_id is not included because this file is relevant only for
             human, tax_id 9606.
           see also: http://omim.org/help/faq
           In June, 2015, this file was modified to add a Comment column,
             to qualify the relationship between a gene and a disorder as
             reported by OMIM. 

---------------------------------------------------------------------------
MIM number:
           a MIM number associated with a GeneID
GeneID:
           the current unique identifier for a gene
           the lack of a GeneID, for whatever reason, is represented as a '-'

type:
           type of relationship between the MIM number and the 
           GeneID
              current values are 
                'gene'      the MIM number associated with a Gene, 
                            or a GeneID that is assigned to a record
                            where the molecular basis of the disease
                            is not known
                'phenotype' the MIM number associated with a disease
                            that is associated with a gene

           If NCBI has no record of this MIM number in its databases yet, there 
           is a '-' provided in the type column

source:    
           This value is provided only when there is a report of a relationship
           between a MIM number that is a phenotype, and a GeneID.
           The current expected values are GeneMap (from OMIM), GeneReviews, 
           and NCBI.

MedGenCUI: 
           The accession assigned by MedGen to this phenotype.  If the accession starts
           with a C followed by integers, the identifier is a concept ID (CUI) from UMLS.
           https://www.nlm.nih.gov/research/umls/
 
           If it starts with a CN, no CUI in UMLS was identified, and NCBI created
           a placeholder. 

Comment:   
           optional value reporting the qualifiers OMIM provides when reporting
           a gene/phenotype relationship
           The values are based on the explanation of the
             symbols provided by OMIM: http://omim.org/help/faq
             nondisease: Brackets, "[ ]", indicate "nondiseases,"
                          mainly genetic variations that lead to 
                          apparently abnormal laboratory test values 
                          (e.g., dysalbuminemic euthyroidal hyperthyroxinemia).
             susceptibility: {} indicate mutations that contribute to 
                         susceptibility to multifactorial disorders
             QTL 1: {} and qtl
             QTL 2: [] and qtl
             somatic: somatic in the disease name
             question: A question mark, "?", before the disease name 
                         indicates an unconfirmed or possibly spurious mapping.


===========================================================================
stopwords_gene
---------------------------------------------------------------------------
           A list of stopwords that are automatically excluded from 
           searches in Gene.

see also:
https://www.ncbi.nlm.nih.gov/books/NBK3841/#EntrezGene.Words_Excluded_From_Queries


===========================================================================
===========================================================================
II.    Files in the DATA/ASN_BINARY directory
---------------------------------------------------------------------------
This directory and all its subdirectories contain complete extractions from 
Entrez Gene in binary ASN.1 format, as Entrezgene sets.

These files are in binary ASN.1 format, and can readily be converted to XML 
via the tool gene2xml documented below.
===========================================================================
===========================================================================
All_Data.ags.gz                    all records
Organelles.ags.gz                  Organelles only 
Plasmids.ags.gz                    Plasmids only 

Archea_Bacteria                    directory for Genes from Archaea and
                                      Bacteria

    All_Archaea_Bacteria.ags.gz    all records from Archaea and Bacteria
    Archaea.ags.gz                 Archaea only
    Bacteria.ags.gz                Bacteria only
    Escherichia_coli_str._K-12_substr._MG1655.ags.gz
                                   Escherichia coli K-12 MG1655 only 
    Pseudomonas_aeruginosa_PAO1.ags.gz 
                                   Pseudomonas aeruginosa PAO1 only 

Fungi                              directory for Genes from Fungi

    All_Fungi.ags.gz               all records from Fungi, including
                                       organelles

    Ascomycota.ags.gz              Ascomycota only
    Microsporidia.ags.gz           Microsporidia only
    Penicillium_chrysogenum_Wisconsin_54-1255.ags.gz
                                   Penicillium chrysogenum Wisconsin 54 only
    Saccharomyces_cerevisiae.ags.gz
                                   Saccharomyces cerevisiae only

Invertebrates                      directory for genes from invertebrates

    All_Invertebrates.ags.gz       all records from invertebrates
    Anopheles_gambiae.ags.gz       Anopheles gambiae only
    Caenorhabditis_elegans.ags.gz  Caenorhabditis elegans only
    Drosophila_melanogaster.ags.gz Drosophila melanogaster only

Mammalia                           directory for genes from mammals

    All_Mammalia.ags.gz            all records from mammals, including
                                     organelles
    Bos_taurus.ags.gz              Bos taurus only
    Canis_familiaris.ags.gz        Canis familiaris only
    Homo_sapiens.ags.gz            Homo sapiens only
    Mus_musculus.ags.gz            Mus musculus only
    Pan_troglodytes.ags.gz         Pan troglodytes only
    Rattus_norvegicus.ags.gz       Rattus norvegicus only
    Sus_scrofa.ags.gz              Sus scrofa only

Non-mammalian_vertebrates          directory for non-mammalian vertebrates

    All_Non-mammalian_vertebrates.ags.gz
                                   all records from non-mammalian vertebrates
    Danio_rerio.ags.gz             Danio rerio only
    Gallus_gallus.ags.gz           Gallus gallus only
    Xenopus_laevis.ags.gz          Xenopus laevis only
    Xenopus_tropicalis.ags.gz      Xenopus tropicalis only

Plants                             directory for plants

    All_Plants.ags.gz              all records from plants
    Arabidopsis_thaliana.ags.gz    Arabidopsis thaliana only
    Chlamydomonas_reinhardtii.ags.gz
                                   Chlamydomonas reinhardtii only
    Oryza_sativa.ags.gz            Oryza sativa only
    Zea_mays.ags.gz                Zea mays only

Protozoa                           directory for protozoa

    All_protozoa.ags.gz            all records from protozoa
    Plasmodium_falciparum.ags.gz   Plasmodium falciparum only

Viruses                            directory for viruses

    All_Viruses.ags.gz             all records from viruses
    Retroviridae.ags.gz            Retroviridae only


===========================================================================
===========================================================================
III.    Files in the DATA/GENE_INFO directory
---------------------------------------------------------------------------
This directory and all its subdirectories contain extractions from
Entrez Gene in the same format as the gene_info file (described earlier).
Each file contains a subset of data for the species or taxonomic group 
indicated by the file name.

The content and directory structure mirror the content and structure of the
ASN_BINARY directory. The file names in this directory are qualified to
distinguish them from the binary ASN.1 files. For example, the gene_info
subset file for human will be found in:

    DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz

The gene_info.gz file will continue to be updated in its original location
in the DATA directory.

The DATA/GENE_INFO/All_Data.gene_info.gz file is identical in content to 
DATA/gene_info.gz even though their file sizes and timestamps may differ
slightly.

===========================================================================
===========================================================================
IV.    Files in the DATA/expression directory
---------------------------------------------------------------------------
This directory and all its subdirectories contain reports of normalized RNA
expression levels computed from RNA-seq data for human, mouse, and rat 
genes. 

For details, see the README_expression contained therein.


===========================================================================
===========================================================================
V.   Files in the GeneRIF directory (Gene References into Function)
===========================================================================
===========================================================================
generifs_basic.gz
---------------------------------------------------------------------------
            GeneRIFs describing a single Gene each
            (rather than interactions between two genes' products)

            Tab-delimited
            Sorted by Tax ID, Gene ID, and the first PubMed ID in the list
            
            For more information, please review:
            https://www.ncbi.nlm.nih.gov/gene/about-generif
---------------------------------------------------------------------------

Tax ID
            the unique identifier provided by NCBI Taxonomy
            for the species or strain/isolate

Gene ID
            the unique identifier for a gene

PubMed ID (PMID) list
            unique citation identifier(s) in PubMed;
            multiple values are comma-separated
            NOTE: if you process this by Excel, please be certain 
            to treat this column as a string. Otherwise comma-delimited
            PubMed uids may be converted to a single integer         
 
last update timestamp
            the last time this GeneRIF was modified,
            in ISO 8601 format "yyyy-mm-dd hh:mm"

GeneRIF text
            GeneRIF text string, length <= 425 characters


===========================================================================
hiv_interactions.gz
---------------------------------------------------------------------------
            Descriptions of interactions between two genes' products --
            specifically, one from Human and one from Human Immunodeficiency
            Virus type 1 (HIV-1) -- from a collaboration with NIAID

            This file contains a subset of the interaction data reported in
            interactions.gz, described below.

            Tab-delimited
            Sorted by:  human Gene ID, human accession.version,
                        virus Gene ID, virus accession.version,
                        first PubMed ID in the list
---------------------------------------------------------------------------

First gene of interacting pair (virus interactant)

  Tax ID 1
            the unique identifier provided by NCBI Taxonomy
            for the species or strain/isolate

  Gene ID 1
            the unique identifier for a gene

  product accession.version 1

  product name 1

Interaction short phrase
            text string

Second gene of interacting pair (human interactant)

  Tax ID 2
            the unique identifier provided by NCBI Taxonomy
            for the species or strain/isolate

  Gene ID 2
            the unique identifier for a gene


  product accession.version 2

  product name 2

PubMed ID (PMID) list
            unique citation identifier(s) in PubMed;
            multiple values are comma-separated

            NOTE: if you process this by Excel, please be certain 
            to treat this column as a string. Otherwise comma-delimited
            PubMed uids may be converted to a single integer         


last update timestamp
            the last time this GeneRIF was modified,
            in ISO 8601 format "yyyy-mm-dd hh:mm"

GeneRIF text
            text string, length <= 425 characters


===========================================================================
hiv_siRNA_interactions.gz
---------------------------------------------------------------------------
            Descriptions of HIV-1 virus and human protein interactions that 
            regulate HIV-1 replication and infectivity.

            All interactions are with Human immunodeficiency virus 1
            (NC_001802.1, Tax ID 11676).

            Tab-delimited
---------------------------------------------------------------------------

Tax ID
            the unique identifier provided by NCBI Taxonomy
            for the species or strain/isolate

Gene ID
            the unique identifier for a gene

Interaction short phrase
            text string

product accession.version

product name

PubMed ID (PMID) list
            unique citation identifier(s) in PubMed;
            multiple values are comma-separated

            NOTE: if you process this by Excel, please be certain 
            to treat this column as a string. Otherwise comma-delimited
            PubMed uids may be converted to a single integer         

last update timestamp
            the last time this GeneRIF was modified,
            in ISO 8601 format "yyyy-mm-dd hh:mm"

GeneRIF text
            text string, length <= 425 characters


===========================================================================
interactions.gz
---------------------------------------------------------------------------
            Descriptions of interactions involving up to two interactants
            and a resulting complex, at least one of which is a gene product.

            If both interactants are associated with Gene IDs, the interacting
            pair is reported once, using the convention that the interactant
            with the smaller Gene ID is listed as the "first interactant",
            as defined below.

            This file includes the interaction data reported in
            hiv_interactions.gz and hiv_siRNA_interactions.gz, described above.

            Data elements which are not applicable are shown as "-".

            Tab-delimited
            Sorted by:  1st Tax ID, 1st Gene ID, 1st accession.version,
                        2nd Tax ID,              2nd accession.version,
                        first PubMed ID in the list
---------------------------------------------------------------------------

First interactant

  Tax ID 1
            the unique identifier provided by NCBI Taxonomy
            for the species or strain/isolate

  Gene ID 2
            the unique identifier for a gene

  interactant accession.version 1

  interactant name 1

Interaction short phrase
            text string

Second interactant

  Tax ID 2
            the unique identifier provided by NCBI Taxonomy
            for the species or strain/isolate

  interactant ID 2
            an identifier for this interactant, within the database
            specified by "interactant ID type" below
            --  note:  depending on the database, this ID may be
                either a numeric value or a character string

  interactant ID type
            the database within which the interactant ID may be found;
            if this interactant is a gene product, its interactant ID type
            is "GeneID", and the interactant ID is its numeric Gene ID.

  interactant accession.version 2

  interactant name 2

Resulting complex

  complex ID
            an identifier for this complex, within the database
            specified by "complex ID type" below
            --  note:  depending on the database, this ID may be
                either a numeric value or a character string

  complex ID type
            the database within which the complex ID may be found

  complex name

PubMed ID (PMID) list
            unique citation identifier(s) in PubMed;
            multiple values are comma-separated

            NOTE: if you process this by Excel, please be certain 
            to treat this column as a string. Otherwise comma-delimited
            PubMed uids may be converted to a single integer         

last update timestamp
            the last time this GeneRIF was modified,
            in ISO 8601 format "yyyy-mm-dd hh:mm"

GeneRIF text
            text string, length <= 425 characters

Interaction source

  interaction ID
            an identifier for this interaction, within the database
            specified by "interaction ID type" below
            --  note:  depending on the database, this ID may be
                either a numeric value or a character string

  interaction ID type
            the database within which the interaction ID may be found;
            if there is no interaction ID, no interaction ID type is reported
 
            additional information on interaction source databases is in
            the file interaction_sources, described below.


=========================================================================== 
interaction_sources
---------------------------------------------------------------------------
            Additional information on sources of interactions listed in
            interactions.gz, described above.

            Tag/value pairs, one per line, delimited by colon and whitespace
            Sources delimited by blank lines
            Sorted by symbol
---------------------------------------------------------------------------

Symbol
            the symbol used to represent this source in interactions.gz

Webpage URL
            the primary or general Web page for this source

Template URL
            a prefix which, when combined with the interaction ID from a
            specific interaction record in interactions.gz, produces a full
            URL which accesses further information on that interaction from
            the source's Web site


===========================================================================
===========================================================================
VI.     Files in the Tools directory
===========================================================================
===========================================================================
i. taxidToGeneNames.pl
---------------------------------------------------------------------------
      A representative perl script, using ESearch and ESummary, to extract
GeneIDs, names and names for a species (i.e. by Taxonomy's id). Usage notes 
provided when no arguments are supplied are:

Usage: taxidToGeneNames.pl [option] -t taxonomyId -o xml|tab
    Options:   -h     Display this usage help information
               -v     Verbose
               -o     output options
                        xml  - XML
                        tab  - tab-delimited

Output is written to STDOUT. 

Sample execution statement:

       taxidToGeneNames.pl -t 9615 -o xml > 9615_genes


==========================================================================
ii. gene2xml
---------------------------------------------------------------------------
    gene2xml is a standalone program that converts Entrez Gene ASN.1 into XML.
    It also interconverts different formats of Entrez Gene ASN.1.  It is 
    available for multiple platforms.


directory path:
ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/cmdline/

  gene2xml.linux64.gz
  gene2xml.mac.gz
  gene2xml.win64.zip

OR

ftp://ftp.ncbi.nlm.nih.gov/asn1-converters/by_program/gene2xml/

linux64.gene2xml.gz
mac.gene2xml.gz
win64.gene2xml.zip


For comprehensive documentation, use either of these sources:

ftp://ftp.ncbi.nlm.nih.gov/asn1-converters/documentation/gene2xml_readme.txt

OR

ftp://ftp.ncbi.nlm.nih.gov/gene/tools/README


===========================================================================
iii. geneDocSum.pl
---------------------------------------------------------------------------
      A representative perl script, using ESearch and ESummary, to extract
GeneIDs and other fields from the Document Summary (DocSum).  Usage notes
provided when no arguments are supplied are:

Usage: geneDocSum.pl [options] -q query -o xml|tab
    Options:   -h     Display this usage help information
               -v     Verbose
               -q     Query to run against Entrez Gene, e.g. "has summary[prop]"
               -o     Output options
                        xml  - XML
                        tab  - tab-delimited
               -t     Tag from eutils xml to extract, e.g. "Summary"
                        - is case sensitive
                        - may be specified multiple times to extract multiple 
                              tags & values
                        - used only with "-o tab" option
                        - to see all available xml tags in the DocSum, run first
                              with "-o xml" option


Output is written to STDOUT.

Sample execution statement:

       geneDocSum.pl -q "has_summary[prop] AND chimpanzee[orgn]" -o tab -t Name -t Summary


===========================================================================
iv. geneGoSummary.pl
---------------------------------------------------------------------------
      A representative perl script, using ESearch, to extract a summary of
GeneOntology information.  Usage notes provided when no arguments are 
supplied are:

Usage: geneGoSummary.pl [options]
    Options:   -h     Display this usage help information
               -v     Verbose
               -i     Input file (or - for stdin) with ranges
               -g     gene2go file
               -t     tax id


The tax id defaults to 9606 (Homo sapiens).

The gene2go file is specified with the -g argument, and the latest version of 
this file must be downloaded and decompressed before running the 
geneGoSummary.pl script.

Output is written to STDOUT.

Sample execution statement:

    echo chr12:1000000-2000000 | geneGoSummary.pl -g gene2go


==========================================================================
VII. Gene-related files from genome annotation
---------------------------------------------------------------------------

Genome annotation files are available at: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq