EMBL Outstation - The European Bioinformatics Institute
           European Nucleotide Archive annotated/assembled sequences

                                  User Manual

                           Release 139, April 2019
                             
                                EMBL Outstation
                       European Bioinformatics Institute
                          Wellcome Genome Campus
                                    Hinxton
                               Cambridge CB10 1SD
                                 United Kingdom
                           Telephone: +44-1223-494499
                           Telefax  : +44-1223-494468
                       Electronic mail: datasubs@ebi.ac.uk
                         URL: http://www.ebi.ac.uk/ena


               This manual and the database it accompanies may be
               copied  and  redistributed freely, without advance
               permission,  provided  that  this   statement   is
               reproduced with each copy.


                                   CONTENTS
        1       INTRODUCTION
        2       CONVENTIONS USED IN THE DATABASE 
        2.1       Sequence Data  
        2.2       Organism Identification and Classification 
        2.3       Literature References 
        3       FORMAT OF THE DATABASE
        3.1       Data Class 
        3.2       Taxonomic Division 
        3.3       Structure of an Entry  
        3.4       Line Structure
        3.4.1       The ID Line  
        3.4.2       The AC Line  
        3.4.3       The PR Line
        3.4.4       The DT Line 
        3.4.5       The DE Line  
        3.4.6       The KW Line  
        3.4.7       The OS Line  
        3.4.8       The OC Line 
        3.4.9       The OG Line  
        3.4.10      The Reference (RN, RC, RP, RX, RG, RA, RT, RL)
                    Lines
        3.4.10.1     The RN Line  
        3.4.10.2     The RC Line  
        3.4.10.3     The RP Line  
        3.4.10.4     The RX Line 
        3.4.10.5     The RG Line
        3.4.10.6     The RA Line  
        3.4.10.7     The RT Line  
        3.4.10.8     The RL Line  
        3.4.11      The DR Line  
        3.4.12      The AH Line
        3.4.13      The AS Line
        3.4.14      The CO Line
        3.4.15      The FH Line 
        3.4.16      The FT Line  
        3.4.17      The SQ Line  
        3.4.18      The Sequence Data Line 
        3.4.19      The CC Line  
        3.4.20      The XX Line  
        3.4.21      The // Line  

APPENDIX A      STANDARD BASE CODES
APPENDIX B      MODIFIED BASE CODES
APPENDIX C      REFERENCES FOR ABBREVIATIONS AND SYMBOLS
 
1  INTRODUCTION
This document describes the format and conventions used in ENA sequence
records. An attempt has been made to make the collected data as easily
accessible as possible without restricting their usefulness to any
particular type of computing environment. For this reason, the simplest
possible organisation ("flat file") has been chosen.
The main body of this User Manual describes the features of the database which
will remain stable, such as the flat file format and the use of line types 
to distinguish different kinds of information. Features of the
database more likely to require change (such as journal abbreviations)
are described in the appendices. Information which applies specifically
to the current release of the database is presented in the Release
Notes. The Release Notes also describe changes which are foreseen in future
releases.
It is likely that the need to represent new kinds of information in the
database will eventually necessitate changes or additions to the 
presentation of data.
Such changes will be made as far as possible in ways which have minimal impact
on user programs and procedures. For example, a new type of data could be 
added to the database as a new line type (see Section 3) without
affecting the processing of existing line types.
We would like to stress that both this manual and the database itself are free
from any copyright restrictions (please see the statement on the title 
page). While we would appreciate acknowledgement if our efforts have been 
useful to you, we want to ensure that the data are freely available to anyone
interested.

2  CONVENTIONS USED IN THE DATABASE
This section describes the general conventions which have been applied to 
The information in the database in order to achieve uniformity of
presentation.
Specific abbreviations and symbol usage are summarized in the appendices.

2.1  Sequence Data
Nucleotide sequence data are generally presented in the database as they 
have been submitted or published, subject to certain conventions which have been
established for the database as a whole. The sequences are always listed in the 
5' to 3' direction, regardless of the published order. Bases are numbered 
sequentially beginning with 1 at the 5' end of the sequence.
The sequences are presented in the database in a form corresponding to the
biological state of the information in vivo. Thus, cDNA sequences are stored 
in the database as RNA sequences, even though they usually appear in the
literature as DNA. For genomic data, the coding strand is stored. Data
containing coding sequences on both strands are stored according to the
prevailing conventions in the literature. The stored data generally 
correspond to wild type sequences before mutation or genetic manipulation.
Sequences of tRNA molecules are stored as unmodified RNA sequences (equivalent
to the mature transcript before any base modification occurs). This form
(colinear with the genomic sequence) has been adopted to simplify both 
storage and analysis of the sequences. Thus, a modified base appears in 
the sequence as the corresponding unmodified base. However, each base
modification is noted in the feature table, so that the mature 
tRNA sequence can be restored automatically by a simple computer program
if this is desirable. The two-letter code used by Sprinzl and Gauss has
been adopted for abbreviation of modified bases in the feature table.

2.2  Organism Identification and Classification
A unified taxonomy is used by the collaborating databases of the International
Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org).
Based on the NCBI's 'Taxon' project, this constitutes a taxonomy database which
reflects current phylogenetic knowledge. It is a sequence-based taxonomy as
far as possible, and is based upon published authorities wherever appropriate.
Deciding criteria include a variety of physiological, ecological, morphological
characters, overall morphological similarity and common descent.
Evolutionary taxonomists tend to consider both overall similarity
and common descent when making and assigning a classification while
phylogeneticists attempt to reflect the branching pattern of the underlying
phylogenetic tree. There is of course no such thing as a single best method for
classifying organisms and the choice of one system over the other has to be made
with regard to the particular purpose of the classification. Because of the
inherent ambiguity of evolutionary classification and the specific needs of
database users (e.g. trying to track down the phylogenetic history of a group
of organisms or to elucidate the evolution of a molecule), the taxonomy strives
to reflect accurately current phylogenetic knowledge.
One of the major sources for classification are phylogenetic insights derived
from molecular evolution studies. New taxonomic information is included
as soon as it becomes available, but at the same time, efforts are made to
ensure that the arguments and evidence provided are reliable in order to avoid
frequent (and possibly unnecessary) changes to the classification system.
The OS/OC lines of all entries reflect the up to date taxonomic classification.
This classification is intended to be informative and helpful; no claim is
made that it is necessarily the best or most exact. This information is subject
to change in future editions.
According to the Feature Table Definition, an entry's sequence span has 
to be covered by one source feature or a combination of several.  'Synthetic
constructs' are one type of sequence entry which typically contain several
source features. Here one of these source features spans the whole sequence
(/organism="synthetic construct"). The feature qualifier /focus is attached to
the preferred source feature and used to determine the taxonomic division. If
no translation table is specified, the organism with /focus will define the
translation table. Within an entry with several source features, only one will
exist with /focus on it. 
                            
2.3  Literature References
The references cited for an entry should be considered a pointer to the
literature and not a scientific credit for the elucidation of the
sequence. Although every effort is made to give complete reference 
information, occasionally only a secondary source has been cited. This
has happened most frequently in cases where a secondary reference has 
presented the data in a form easily entered. The speed and accuracy with
which data can be abstracted is very dependent on the form of presentation.
In such cases, we prefer to cite also the primary reference, and request
users who note such omissions to inform us so that the appropriate additions
may be made.

3  FORMAT OF THE DATABASE
The ENA assembled/annotated sequence release and update products are composed
of sequence entries. Each entry corresponds to a single contiguous molecule as
contributed to the database or reported in the literature. In some cases, entries
have been assembled from several papers reporting overlapping sequence regions.
Conversely a single paper often provides data for several entries, as when
homologous sequences from different organisms are compared.

3.1  Data Class
The data class of each entry, representing a methodological approach to the
generation of the data or a type of data, is indicated on the first (ID) line
of the entry. Each entry belongs to exactly one data class.
  Class          Definition
  -----------    -----------------------------------------------------------
  CON		 Entry constructed from segment entry sequences; if unannotated,
                 annotation may be drawn from segment entries
  PAT            Patent
  EST            Expressed Sequence Tag
  GSS            Genome Survey Sequence
  HTC            High Thoughput CDNA sequencing
  HTG            High Thoughput Genome sequencing
  WGS            Whole Genome Shotgun
  TSA            Transcriptome Shotgun Assembly
  STS            Sequence Tagged Site
  STD            Standard (all entries not classified as above)

3.2  Taxonomic Division
The entries which constitute the database are grouped into taxonomic divisions,
the object being to create subsets of the database which reflect areas of
interest for many users.
In addition to the division, each entry contains a full taxonomic
classification of the organism that was the source of the stored sequence,
from kingdom down to genus and species (see below).
Each entry belongs to exactly one taxonomic division. The ID line of each entry
indicates its taxonomic division, using the three letter codes shown below:

                          Division                 Code
                          -----------------        ----
                          Bacteriophage            PHG
                          Environmental Sample     ENV
                          Fungal                   FUN
                          Human                    HUM
                          Invertebrate             INV
                          Other Mammal             MAM
                          Other Vertebrate         VRT
                          Mus musculus             MUS 
                          Plant                    PLN
                          Prokaryote               PRO
                          Other Rodent             ROD
                          Synthetic                SYN
                          Transgenic               TGN
                          Unclassified             UNC
                          Viral                    VRL

3.3  Structure of an Entry
The entries in the database are structured so as to be usable by human 
readers as well as by computer programs. The explanations, descriptions,
classifications and other comments are in ordinary English, and the symbols
and formatting employed for the base sequences themselves have been 
chosen for readability. Wherever possible, symbols familiar to molecular
biologists have been used. At the same time, the structure is systematic 
enough to allow computer programs easily to read, identify, and manipulate
the various types of data included.
Each entry in the database is composed of lines. Different types of lines,
each with its own format, are used to record the various types of data which
make up the entry. In general, fixed format items have been kept to a 
minimum, and a more syntax-oriented structure adopted for the lines. 
The two exceptions to this are the sequence data lines and the feature table
lines, for which a fixed format was felt to offer significant advantages
to the user. Users who write programs to process the database entries should
not make any assumptions about the column placement of items on lines other
than these two: all other line types are free-format. 
A sample entry is shown in Figure 1.
Note that each line begins with a two-character line code, which indicates
the type of information contained in the line. The currently used line 
types, along with their respective line codes, are listed below:
     ID - identification             (begins each entry; 1 per entry)
     AC - accession number           (>=1 per entry)
     PR - project identifier         (0 or 1 per entry)
     DT - date                       (2 per entry)
     DE - description                (>=1 per entry)
     KW - keyword                    (>=1 per entry)
     OS - organism species           (>=1 per entry)
     OC - organism classification    (>=1 per entry)
     OG - organelle                  (0 or 1 per entry)
     RN - reference number           (>=1 per entry)
     RC - reference comment          (>=0 per entry)
     RP - reference positions        (>=1 per entry)
     RX - reference cross-reference  (>=0 per entry)
     RG - reference group            (>=0 per entry)
     RA - reference author(s)        (>=0 per entry)
     RT - reference title            (>=1 per entry)
     RL - reference location         (>=1 per entry)
     DR - database cross-reference   (>=0 per entry)
     CC - comments or notes          (>=0 per entry)
     AH - assembly header            (0 or 1 per entry)   
     AS - assembly information       (0 or >=1 per entry)
     FH - feature table header       (2 per entry)
     FT - feature table data         (>=2 per entry)    
     XX - spacer line                (many per entry)
     SQ - sequence header            (1 per entry)
     CO - contig/construct line      (0 or >=1 per entry) 
     bb - (blanks) sequence data     (>=1 per entry)
     // - termination line           (ends each entry; 1 per entry)
Note that some entries will not contain all of the line types, and some line
types occur many times in a single entry. As indicated, each entry begins with
an identification line (ID) and ends with a terminator line (//). The various 
line types appear in entries in the order in which they are listed above 
(except for XX lines which may appear anywhere between the ID and SQ lines). A
detailed description of each line type is given in the following sections.

ID   X56734; SV 1; linear; mRNA; STD; PLN; 1859 BP.
XX
AC   X56734; S46826;
XX
DT   12-SEP-1991 (Rel. 29, Created)
DT   25-NOV-2005 (Rel. 85, Last updated, Version 11)
XX
DE   Trifolium repens mRNA for non-cyanogenic beta-glucosidase
XX
KW   beta-glucosidase.
XX
OS   Trifolium repens (white clover)
OC   Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
OC   Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; rosids;
OC   fabids; Fabales; Fabaceae; Papilionoideae; Trifolieae; Trifolium.
XX
RN   [5]
RP   1-1859
RX   DOI; 10.1007/BF00039495.
RX   PUBMED; 1907511.
RA   Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.;
RT   "Nucleotide and derived amino acid sequence of the cyanogenic
RT   beta-glucosidase (linamarase) from white clover (Trifolium repens L.)";
RL   Plant Mol. Biol. 17(2):209-219(1991).
XX
RN   [6]
RP   1-1859
RA   Hughes M.A.;
RT   ;
RL   Submitted (19-NOV-1990) to the INSDC.
RL   Hughes M.A., University of Newcastle Upon Tyne, Medical School, Newcastle
RL   Upon Tyne, NE2 4HH, UK
XX
DR   EuropePMC; PMC99098; 11752244.
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..1859
FT                   /organism="Trifolium repens"
FT                   /mol_type="mRNA"
FT                   /clone_lib="lambda gt10"
FT                   /clone="TRE361"
FT                   /tissue_type="leaves"
FT                   /db_xref="taxon:3899"
FT   mRNA            1..1859
FT                   /experiment="experimental evidence, no additional details
FT                   recorded"
FT   CDS             14..1495
FT                   /product="beta-glucosidase"
FT                   /EC_number="3.2.1.21"
FT                   /note="non-cyanogenic"
FT                   /db_xref="GOA:P26204"
FT                   /db_xref="InterPro:IPR001360"
FT                   /db_xref="InterPro:IPR013781"
FT                   /db_xref="InterPro:IPR017853"
FT                   /db_xref="InterPro:IPR018120"
FT                   /db_xref="UniProtKB/Swiss-Prot:P26204"
FT                   /protein_id="CAA40058.1"
FT                   /translation="MDFIVAIFALFVISSFTITSTNAVEASTLLDIGNLSRSSFPRGFI
FT                   FGAGSSAYQFEGAVNEGGRGPSIWDTFTHKYPEKIRDGSNADITVDQYHRYKEDVGIMK
FT                   DQNMDSYRFSISWPRILPKGKLSGGINHEGIKYYNNLINELLANGIQPFVTLFHWDLPQ
FT                   VLEDEYGGFLNSGVINDFRDYTDLCFKEFGDRVRYWSTLNEPWVFSNSGYALGTNAPGR
FT                   CSASNVAKPGDSGTGPYIVTHNQILAHAEAVHVYKTKYQAYQKGKIGITLVSNWLMPLD
FT                   DNSIPDIKAAERSLDFQFGLFMEQLTTGDYSKSMRRIVKNRLPKFSKFESSLVNGSFDF
FT                   IGINYYSSSYISNAPSHGNAKPSYSTNPMTNISFEKHGIPLGPRAASIWIYVYPYMFIQ
FT                   EDFEIFCYILKINITILQFSITENGMNEFNDATLPVEEALLNTYRIDYYYRHLYYIRSA
FT                   IRAGSNVKGFYAWSFLDCNEWFAGFTVRFGLNFVD"
XX
SQ   Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other;
     aaacaaacca aatatggatt ttattgtagc catatttgct ctgtttgtta ttagctcatt        60
     cacaattact tccacaaatg cagttgaagc ttctactctt cttgacatag gtaacctgag       120
     tcggagcagt tttcctcgtg gcttcatctt tggtgctgga tcttcagcat accaatttga       180
     aggtgcagta aacgaaggcg gtagaggacc aagtatttgg gataccttca cccataaata       240
     tccagaaaaa ataagggatg gaagcaatgc agacatcacg gttgaccaat atcaccgcta       300
     caaggaagat gttgggatta tgaaggatca aaatatggat tcgtatagat tctcaatctc       360
     ttggccaaga atactcccaa agggaaagtt gagcggaggc ataaatcacg aaggaatcaa       420
     atattacaac aaccttatca acgaactatt ggctaacggt atacaaccat ttgtaactct       480
     ttttcattgg gatcttcccc aagtcttaga agatgagtat ggtggtttct taaactccgg       540
     tgtaataaat gattttcgag actatacgga tctttgcttc aaggaatttg gagatagagt       600
     gaggtattgg agtactctaa atgagccatg ggtgtttagc aattctggat atgcactagg       660
     aacaaatgca ccaggtcgat gttcggcctc caacgtggcc aagcctggtg attctggaac       720
     aggaccttat atagttacac acaatcaaat tcttgctcat gcagaagctg tacatgtgta       780
     taagactaaa taccaggcat atcaaaaggg aaagataggc ataacgttgg tatctaactg       840
     gttaatgcca cttgatgata atagcatacc agatataaag gctgccgaga gatcacttga       900
     cttccaattt ggattgttta tggaacaatt aacaacagga gattattcta agagcatgcg       960
     gcgtatagtt aaaaaccgat tacctaagtt ctcaaaattc gaatcaagcc tagtgaatgg      1020
     ttcatttgat tttattggta taaactatta ctcttctagt tatattagca atgccccttc      1080
     acatggcaat gccaaaccca gttactcaac aaatcctatg accaatattt catttgaaaa      1140
     acatgggata cccttaggtc caagggctgc ttcaatttgg atatatgttt atccatatat      1200
     gtttatccaa gaggacttcg agatcttttg ttacatatta aaaataaata taacaatcct      1260
     gcaattttca atcactgaaa atggtatgaa tgaattcaac gatgcaacac ttccagtaga      1320
     agaagctctt ttgaatactt acagaattga ttactattac cgtcacttat actacattcg      1380
     ttctgcaatc agggctggct caaatgtgaa gggtttttac gcatggtcat ttttggactg      1440
     taatgaatgg tttgcaggct ttactgttcg ttttggatta aactttgtag attagaaaga      1500
     tggattaaaa aggtacccta agctttctgc ccaatggtac aagaactttc tcaaaagaaa      1560
     ctagctagta ttattaaaag aactttgtag tagattacag tacatcgttt gaagttgagt      1620
     tggtgcacct aattaaataa aagaggttac tcttaacata tttttaggcc attcgttgtg      1680
     aagttgttag gctgttattt ctattatact atgttgtagt aataagtgca ttgttgtacc      1740
     agaagctatg atcataacta taggttgatc cttcatgtat cagtttgatg ttgagaatac      1800
     tttgaattaa aagtcttttt ttattttttt aaaaaaaaaa aaaaaaaaaa aaaaaaaaa       1859
//
                  Figure 1 - A sample entry from the database

3.4  Line Structure
This section describes in detail the format of each type of line used in 
the database. Each line begins with a two-character line type code. 
This code is always followed by three blanks, so that the actual information 
in each line begins in character position 6.

3.4.1  The ID Line
The ID (IDentification) line is always the first line of an entry. The
format of the ID line is:
ID   <1>; SV <2>; <3>; <4>; <5>; <6>; <7> BP.
The tokens represent:
   1. Primary accession number
   2. Sequence version number
   3. Topology: 'circular' or 'linear'
   4. Molecule type (see note 1 below)
   5. Data class (see section 3.1)
   6. Taxonomic division (see section 3.2)
   7. Sequence length (see note 2 below)

Note 1 - Molecule type: this represents the type of molecule as stored and can
be any value from the list of current values for the mandatory mol_type source
qualifier. This item should be the same as the value in the mol_type
qualifier(s) in a given entry.
Note 2 - Sequence length: The last item on the ID line is the length of the
sequence (the total number of bases in the sequence). This number includes 
base positions reported as present but undetermined (coded as "N").
An example of a complete identification line is shown below:
ID   CD789012; SV 4; linear; genomic DNA; HTG; MAM; 500 BP.
3.4.2  The AC Line
The AC (ACcession number) line lists the accession numbers associated with 
the entry.
                              
Examples of accession number lines are shown below:
 AC   X56734; S46826;
 AC   Y00001; X00001-X00005; X00008; Z00001-Z00005;
Each accession number, or range of accession numbers, is terminated by a
semicolon. Where necessary, more than one AC line is used. Consecutive
secondary accession numbers in ENA flatfiles are shown in the form of 
inclusive accession number ranges.
Accession numbers are the primary means of identifying sequences providing 
a stable way of identifying entries from release to release. An accession
number, however, always remains in the accession number list of the latest
version of the entry in which it first appeared.  Accession numbers allow
unambiguous citation of database entries. Researchers who wish to cite entries
in their publications should always cite the first accession number in the
list (the "primary" accession number) to ensure that readers can find the
relevant data in a subsequent release. Readers wishing to find the data thus
cited must look at all the accession numbers in each entry's list.
Secondary accession numbers: One reason for allowing the existence of several
accession numbers is to allow tracking of data when entries are merged
or split. For example, when two entries are merged into one, a "primary" 
accession number goes at the start of the list, and those from the 
merged entries are added after this one as "secondary" numbers.  

Example:        AC   X56734; S46826;

Similarly, if an existing entry is split into two or more entries (a rare 
occurrence), the original accession number list is retained in all the derived
entries.
An accession number is dropped from the database only when the data to
which it was assigned have been completely removed from the database.

3.4.3  The PR Line
The PR (PRoject) line shows the International Nucleotide Sequence Database
Collaboration (INSDC) Project Identifier that has been assigned to the entry.
Full details of INSDC Project are available at
http://www.ebi.ac.uk/ena/about/page.php?page=project_guidelines.
Example:        PR   Project:17285;

3.4.4  The DT Line
The DT (DaTe) line shows when an entry first appeared in the database and
when it was last updated.  Each entry contains two DT lines, formatted
as follows:
DT   DD-MON-YYYY (Rel. #, Created)
DT   DD-MON-YYYY (Rel. #, Last updated, Version #)
The DT lines from the above example are:
DT   12-SEP-1991 (Rel. 29, Created)
DT   13-SEP-1993 (Rel. 37, Last updated, Version 8)
The date supplied on each DT line indicates when the entry was created or 
Last updated; that will usually also be the date when the new or modified 
Entry became publicly visible via the EBI network servers. The release 
number indicates the first quarterly release made *after* the entry was 
created or last updated. The version number appears only on the "Last 
updated" DT line.
The absolute value of the version number is of no particular significance; its
purpose is to allow users to determine easily if the version of an entry 
which they already have is still the most up to date version. Version numbers
are incremented by one every time an entry is updated; since an entry may be
updated several times before its first appearance in a quarterly release, the
version number at the time of its first release appearance may be greater than
one. Note that because an entry may also be updated several times between
two quarterly releases, there may be gaps in the sequence of version numbers 
which appear in consecutive releases.
If an entry has not been updated since it was created, it will still have 
two DT lines and the "Last updated" line will have the same date (and 
release number) as the "Created" line.

3.4.5  The DE Line
The DE (Description) lines contain general descriptive information about the
sequence stored. This may include the designations of genes for which the
sequence codes, the region of the genome from which it is derived, or other
information which helps to identify the sequence. The format for a DE line is:
DE   description
The description is given in ordinary English and is free-format. Often, more
than one DE line is required; when this is the case, the text is divided only
between words. The description line from the example above is
DE   Trifolium repens mRNA for non-cyanogenic beta-glucosidase      
The first DE line generally contains a brief description, which can stand
alone for cataloguing purposes.
 
3.4.6  The KW Line
The KW (KeyWord) lines provide information which can be used to generate
cross-reference indexes of the sequence entries based on functional,
structural, or other categories deemed important.
The format for a KW line is:
     KW   keyword[; keyword ...].
More than one keyword may be listed on each KW line; the keywords are 
separated by semicolons, and the last keyword is followed by a full
stop. Keywords may consist of more than one word, and they may contain
embedded blanks and stops. A keyword is never split between lines. 
An example of a keyword line is:
     KW   beta-glucosidase.
The keywords are ordered alphabetically; the ordering implies no hierarchy
of importance or function.  If an entry has no keywords assigned to it,
it will contain a single KW line like this:
     KW   .

3.4.7  The OS Line
The OS (Organism Species) line specifies the preferred scientific name of
the organism which was the source of the stored sequence. In most 
cases this is done by giving the Latin genus and species designations, 
followed (in parentheses) by the preferred common name in English where
known. The format is:
     OS   Genus species (name)
In some cases, particularly for viruses and genetic elements, the only
accepted designation is a simple name such as "Canine adenovirus type 2".
In these cases only this designation is given. The species line from the 
example is:
     OS   Trifolium repens (white clover)
Hybrid organisms are classified in their own right. A rat/mouse hybrid,
for example, would appear as follows:
     OS   Mus musculus x Rattus norvegicus
     OC   (OC for mouse)
 
If the source organism is unknown but has been/will be cultured, the OS
line will contain a unique name derived from the what is known of the
classification. The unique name serves to identify the database entry,
which will be updated once the full classification is known. In the
case of an unknown bacterium, for example:
     OS   unidentified bacterium B8
     OC   Bacteria.
For environmental samples where there is no intention to culture the
organism and complete taxonomy cannot be determined, collective names
are used in the OS line and the classification given extends down to
the most resolved taxonomic node possible, for example:
     OS   uncultured proteobacterium
     OC   Bacteria; Proteobacteria; environmental samples.
 
For naturally occurring plasmids the OS/OC lines will contain the 
source organism and the plasmid name will appear on the OG line. 
For example:
     OS   Escherichia coli
     OC   Prokaryota; ... Enterobacteriaceae.
     XX
     OG   Plasmid colE1
For artificial plasmids the OS line will be "OS Cloning vector" and the
sequence will be classified as an artificial sequence. For example:
     OS   Cloning vector M13plex17 
     OC   Artificial sequences; vectors.
 
Where only a naturally occurring part of a plasmid is reported, the plasmid
name will appear on the OG line and the OS/OC lines will describe the natural
source.
For example:
     OS   Escherichia coli
     OC   Prokaryota; ... Enterobacteriaceae.
     XX
     OG   Plasmid pUC8

3.4.8  The OC Line
The OC (Organism Classification) lines contain the taxonomic classification
Of the source organism as described in Section 2.2 above. 
The classification is listed top-down as nodes in a taxonomic tree in which 
the most general grouping is given first.  The classification may be 
distributed over several OC lines, but nodes are not split or hyphenated 
between lines. The individual items are separated by semicolons and the
list is terminated by a full stop. The format for the OC line is:
     OC   Node[; Node...].
                                   
Example classification lines:
OC   Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
OC   euphyllophytes; Spermatophyta; Magnoliophyta; eudicotyledons; Rosidae;
OC   Fabales; Fabaceae; Papilionoideae; Trifolium.


3.4.9  The OG Line
The OG (OrGanelle) linetype indicates the sub-cellular location of non-nuclear
sequences.  It is only present in entries containing non-nuclear sequences
and appears after the last OC line in such entries.
The OG line contains
a) one data item (title cased) from the controlled list detailed under the
/organelle qualifier definition in the Feature Table Definition document
that accompanies this release or
b) a plasmid name.
Examples include "Mitochondrion", "Plastid:Chloroplast" and "Plasmid pBR322".

For example, a chloroplast sequence from Euglena gracilis would appear as:
     OS   Euglena gracilis (green algae)
     OC   Eukaryota; Planta; Phycophyta; Euglenophyceae.
     OG   Plastid:Chloroplast

3.4.10  The Reference (RN, RC, RP, RX, RG, RA, RT, RL) Lines
These lines comprise the literature citations within the database.
The citations provide access to the papers from which the data has been 
abstracted. The reference lines for a given citation occur in a block, and
are always in the order RN, RC, RP, RX, RG, RA, RT, RL. Within each such 
reference block the RN line occurs once, the RC, RP and RX lines occur zero
or more times, and the following lines must occur at least once: the RA (or RG), RT, RL. 
If several references are given, there will be a reference block for each. 
Example of references :

RN   [5]
RP   1-1859
RA   Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.;
RT   "Nucleotide and derived amino acid sequence of the cyanogenic
RT   beta-glucosidase (linamarase) from white clover (Trifolium repens L.).";
RL   Plant Mol. Biol. 17:209-219(1991).

The formats of the individual lines are explained in the following 
paragraphs.
RN   [2]
RP   1-1657990
RG   Prochlorococcus genome consortium
RA   Larimer F.;
RT   ;
RL   Submitted (03-JUL-2003) to the INSDC.
RL   Larimer F., DOE Joint Genome Institute, Production Genomics Facility, 
RL   2800 Mitchell Drive, Walnut Creek, CA 94598, USA, and the Genome 
RL   Analysis Group, Oak Ridge National Laboratory, 1060 Commerce Park Drive, 
RL   Oak Ridge, TN 37831, USA;

3.4.10.1  The RN Line
The RN (Reference Number) line gives a unique number to each reference 
Citation within an entry. This number is used to designate the reference
in comments and in the feature table. The format of the RN line is:
     RN   [n]                               
The reference number is always enclosed in square brackets. Note that the
set of reference numbers which appear in an entry does not necessarily form a
continuous sequence from 1 to n, where the entry contains "n" references. As
references are added to and removed from an entry, gaps may be introduced into
the sequence of numbers. The important point is that once an RN number has
been assigned to a reference within an entry it never changes. The reference
number line in the example above is:
     RN   [5]

3.4.10.2  The RC Line
The RC (Reference Comment) linetype is an optional linetype which appears if 
The reference has a comment. The comment is in English and as many RC lines as
are required to display the comment will appear. They are formatted thus:
     RC   comment

3.4.10.3  The RP Line
The RP (Reference Position) linetype is an optional linetype which appears if
one or more contiguous base spans of the presented sequence can be attributed
to the reference in question. As many RP lines as are required to display the
base span(s) will appear.
The base span(s) indicate which part(s) of the sequence are covered by the
reference.  Note that the numbering scheme is for the sequence as presented
in the database entry (i.e. from 5' to 3' starting at 1), not the scheme used
by the authors in the reference should the two differ. The RP line is
formatted thus:
     RP   i-j[, k-l...]
The RP line in the example above is:
     RP   1-1859

3.4.10.4  The RX Line
The RX (reference cross-reference) linetype is an optional linetype which
contains a cross-reference to an external citation or abstract resource.
For example, if a journal citation exists in the PUBMED database, there will
be an RX line pointing to the relevant PUBMED identifier.
The format of the RX line is as follows:
     RX  resource_identifier; identifier.                                 
The first item on the RX line, the resource identifier, is the abbreviated 
name of the data collection to which reference is made. The current
set of cross-referenced resources is:
     Resource ID    Fullname
     -----------    ------------------------------------
     PUBMED         PUBMED bibliographic database (NLM)
     DOI            Digital Object Identifier (International DOI Foundation)
     AGRICOLA       US National Agriculture Library (NAL) of the US Department
                    of Agriculture (USDA)
The second item on the RX line, the identifier, is a pointer to the entry in
the external resource to which reference is being made. The data item used as
the primary identifier depends on the resource being referenced.
For example:
RX   DOI; 10.1016/0024-3205(83)90010-3.
RX   PUBMED; 264242.
Note that further details of DOI are available at http://www.doi.org/. URLs
formulated in the following way are resolved to the correct full text URLs:
     http://dx.doi.org/<doi>
     eg. http:/dx.doi.org/10.1016/0024-3205(83)90010-3

3.4.10.5  The RG Line
The RG (Reference Group) lines list the working groups/consortia that 
produced the record. RG line is mainly used in submission reference 
blocks, but could also be used in paper reference if the working group is 
cited as an author in the paper.

3.4.10.6  The RA Line
The RA (Reference Author) lines list the authors of the paper (or other 
work) cited. All of the authors are included, and are listed in the order 
given in the paper. The names are listed surname first followed by a blank
followed by initial(s) with stops. Occasionally the initials may not 
be known, in which case the surname alone will be listed. The author names 
are separated by commas and terminated by a semicolon; they are not split 
between lines. The RA line in the example is:
RA   Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.;    
As many RA lines as necessary are included for each reference.

3.4.10.7  The RT Line
The RT (Reference Title) lines give the title of the paper (or other work) as
exactly as is possible given the limitations of computer character sets. Note
that the form used is that which would be used in a citation rather than that
displayed at the top of the published paper. For instance, where journals
capitalise major title words this is not preserved. The title is enclosed in
double quotes, and may be continued over several lines as necessary. The title
lines are terminated by a semicolon. The title lines from the example are:
RT   "Nucleotide and derived amino acid sequence of the cyanogenic
RT   beta-glucosidase (linamarase) from white clover (Trifolium repens L.)";
Greek letters in titles are spelled out; for example, a title in an entry 
would contain "kappa-immunoglobulin" even though the letter itself may be
present in the original title. Similar simplifications have been made in 
other cases (e.g. subscripts and superscripts). Note that the RT line of
a citation which has no title (such as a submission to the database) contains
only a semicolon.

3.4.10.8  The RL Line
The RL (Reference Location) line contains the conventional citation 
information for the reference.  In general, the RL lines alone are 
sufficient to find the paper in question. They include the journal,
volume number, page range and year for each paper. 
Journal names are abbreviated according to existing ISO standards 
(International Standard Serial Number)
The format for the location lines is:
     RL   journal vol:pp-pp(year).
Thus, the reference location line in the example is:
     RL   Plant Mol. Biol. 17:209-219(1991).
Very occasionally a journal is encountered which does not consecutively 
number pages within a volume, but rather starts the numbering anew for
each issue number. In this case the issue number must be included, and the 
format becomes:
     RL   journal vol(no):pp-pp(year).
 
If a paper is in press, the RL line will appear with such information as 
we have available, the missing items appearing as zeros. For example:
     RL   Nucleic Acids Res. 0:0-0(2004).
This indicates a paper which will be published in Nucleic Acids Research at some
point in 2004, for which we have no volume or page information. Such references
are updated to include the missing information when it becomes available.
Another variation of the RL line is used for papers found in books 
or other similar publications, which are cited as shown below:
     RA   Birnstiel M., Portmann R., Busslinger M., Schaffner W.,
     RA   Probst E., Kressmeann A.;
     RT   "Functional organization of the histone genes in the
     RT   sea urchin Psammechinus:  A progress report";
     RL   (in) Engberg J., Klenow H., Leick V. (Eds.);
     RL   SPECIFIC EUKARYOTIC GENES:117-132;
     RL   Munksgaard, Copenhagen (1979).
Note specifically that the line where one would normally encounter the 
journal location is replaced with lines giving the bibliographic citation
of the book. The first RL line in this case contains the designation "(in)",
which indicates that this is a book reference.
The following examples illustrate RL line formats that are used for data
submissions:
     RL   Submitted (19-NOV-1990) to the INSDC.
     RL   M.A. Hughes, UNIVERSITY OF NEWCASTLE UPON TYNE, MEDICAL SCHOOL, NEW
     RL   CASTLE UPON TYNE, NE2  4HH, UK
Submitter address is always included in new entries, but some older 
submissions do not have this information. 
RL lines take another form for thesis references. 
For example:
     RL   Thesis (1999), Department of Genetics,
     RL   University of Cambridge, Cambridge, U.K.
For an unpublished reference, the RL line takes the following form:
     RL   Unpublished.
Patent references have the following form:
     RL   Patent number EP0238993-A/3, 30-SEP-1987.
     RL   BAYER AG.
The words "Patent number" are followed by the patent application number, the
patent type (separated by a hyphen), the sequence's serial number within the
patent (separated by a slash) and the patent application date. The subsequent RL
lines list the patent applicants, normally company names.
Finally, for journal publications where no ISSN number is available for the
journal (proceedings and abstracts, for example), the RL line contains the
designation "(misc)" as in the following example.
     RL   (misc) Proc. Vth Int. Symp. Biol. Terr. Isopods 2:365-380(2003).

3.4.11  The DR Line
The DR (Database Cross-reference) line cross-references other databases which
contain information related to the entry in which the DR line appears. For
example, if an annotated/assembled sequence in ENA is cited in the IMGT/LIGM
database there will be a DR line pointing to the relevant IMGT/LIGM entry.
The format of the DR line is as follows:
     DR   database_identifier; primary_identifier; secondary_identifier.
The first item on the DR line, the database identifier, is the abbreviated 
name of the data collection to which reference is made.
The second item on the DR line, the primary identifier, is a pointer to 
the entry in the external database to which reference is being made.
The third item on the DR line is the secondary identifier, if available, from
the referenced database.
An example of a DR line is shown below:
DR   MGI; 98599; Tcrb-V4.

3.4.12   The AH Line (in TPA and TSA records only)
Third Party Annotation (TPA) and Transcriptome Shotgun Assembly (TSA) records
may include information on the composition of their sequences to show
which spans originated from which contributing primary sequences. The AH
(Assembly Header) line provides column headings for the assembly information.
The lines contain no data and may be ignored by computer programs.
The AH line format is:
AH   LOCAL_SPAN     PRIMARY_IDENTIFIER     PRIMARY_SPAN     COMP 

3.4.13   The AS Line (in TPA and TSA records)
The AS (ASsembly Information) lines provide information on the composition of 
a TPA or TSA sequence. These lines include information on local sequence spans
(those spans seen in the sequence of the entry showing the AS lines) plus
identifiers and base spans of contributing primary sequences (for ENA
primary entries only).
    
a) LOCAL_SPAN   base span on local sequence shown in entry  
b) PRIMARY_IDENTIFIER       acc.version of contributing ENA sequence(s)
                            or trace identifier for ENA read(s)
c) PRIMARY_SPAN             base span on contributing ENA primary
                            sequence or not_available for ENA read(s)
                                   
d) COMP                     'c' is used to indicate that contributing sequence
                            originates from complementary strand in primary
                            entry
                                            
Example:
AH   LOCAL_SPAN     PRIMARY_IDENTIFIER     PRIMARY_SPAN     COMP
AS   1-426          AC004528.1             18665-19090         
AS   427-526        AC001234.2             1-100            c
AS   527-1000       TI55475028             not_available

3.4.14 The CO Line (in CON records only)
Con(structed) sequences in the CON data classes represent complete
chromosomes, genomes and other long sequences constructed from segment entries.
CON data class entries do not contain sequence data per se, but rather the
assembly information on all accession.versions and sequence locations relevant
to building the constructed sequence. The assembly information is represented in
the CO lines.
Example:
CO   join(Z99104.1:1..213080,Z99105.1:18431..221160,Z99106.1:13061..209100, 
CO   Z99107.1:11151..213190,Z99108.1:11071..208430,Z99109.1:11751..210440, 
CO   Z99110.1:15551..216750,Z99111.1:16351..208230,Z99112.1:4601..208780, 
CO   Z99113.1:26001..233780,Z99114.1:14811..207730,Z99115.1:12361..213680, 
CO   Z99116.1:13961..218470,Z99117.1:14281..213420,Z99118.1:17741..218410, 
CO   Z99119.1:15771..215640,Z99120.1:16411..217420,Z99121.1:14871..209510, 
CO   Z99122.1:11971..212610,Z99123.1:11301..212150,Z99124.1:11271..215534) 
 
Gaps of undefined length are represented using the expression 'gap(unk100)'.
These gaps contribute to the sequence length for the entry (as shown in the
ID line).
Example: CO   join(AL358912.1:1..39187,gap(unk100),AL137130.1:1..40815,... 
Gaps of defined length are represented via 'gap(#)' where # is the 
gap length. These gaps also contribute to the sequence length for the entry (as
shown in the ID line).
Example: CO   AE005330.1:61..14164,AE005331.1:61..3773,gap(4001),...
Below are the relevant sections of a Bacillus subtilis CON entry providing 
construct information for the assembly of the Bacillus subtilis genome.  
    
ID   AL009126; SV 2; circular; genomic DNA; CON; PRO; 4214630 BP.
XX
AC   AL009126;
XX
DT   18-JUL-2002 (Rel. 72, Created)
DT   07-JUL-2003 (Rel. 76, Last updated, Version 3)
XX
DE   Bacillus subtilis complete genome.
XX
KW   complete genome.
XX
OS   Bacillus subtilis subsp. subtilis str. 168
OC   Bacteria; Firmicutes; Bacillales; Bacillaceae; Bacillus.
...
CITATION INFORMATION
...
FH   Key             Location/Qualifiers
FH
FT   source          1..4214630
FT                   /organism="Bacillus subtilis subsp. subtilis str. 168"
FT                   /strain="168"
FT                   /mol_type="genomic DNA"
FT                   /db_xref="taxon:224308"
XX
CO   join(Z99104.2:1..213080,Z99105.2:51..202768,Z99106.2:31..195912,
CO   Z99107.2:51..202089,Z99108.2:51..197409,Z99109.2:41..198743,
CO   Z99110.2:41..201241,Z99111.2:41..191980,Z99112.2:41..204263,
CO   Z99113.2:41..207829,Z99114.2:41..192961,Z99115.2:51..201375,
CO   Z99116.2:31..204537,Z99117.2:31..199173,Z99118.2:31..200707,
CO   Z99119.2:51..199922,Z99120.2:51..201059,Z99121.2:51..194692,
CO   Z99122.2:51..200690,Z99123.2:31..201139,Z99124.2:51..203901)
//

3.4.15  The FH Line
The FH (Feature Header) lines are present only to improve readability of
an entry when it is printed or displayed on a terminal screen. The lines 
contain no data and may be ignored by computer programs. The format of these
lines is always the same:
     FH   Key             Location/Qualifiers
     FH
The first line provides column headings for the feature table, and the second
line serves as a spacer. If an entry contains no feature table 
(i.e. no FT lines - see below), the FH lines will not appear.

3.4.16 The FT Line
The FT (Feature Table) lines provide a mechanism for the annotation of the
sequence data. Regions or sites in the sequence which are of interest are
listed in the table. In general, the features in the feature table represent
signals or other characteristics reported in the cited references. In some
cases, ambiguities or features noted in the course of data preparation have 
been included.  The feature table is subject to expansion or change as more
becomes known about a given sequence.
Feature Table Definition Document:
A complete and definitive description of the feature table is given 
in the document "The DDBJ/ENA/GenBank Feature Table:  Definition". 
URL: ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/FT_current.txt
Much effort is expended in the design of the feature table to try to
ensure that it will be self-explanatory to the human reader, and we therefore
expect that the official definition document will be of interest mainly
to software developers rather than to end-users of the database.
A browser derived from the document is provided to assist users in navigating
and composing feature table representations at
http://www.ebi.ac.uk/ena/WebFeat/.

3.4.17  The SQ Line
The SQ (SeQuence header) line marks the beginning of the sequence data and 
Gives a summary of its content. An example is:
     SQ   Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other; 
As shown, the line contains the length of the sequence in base pairs followed
by its base composition.  Bases other than A, C, G and T are grouped 
together as "other". (Note that "BP" is also used for single stranded RNA
sequences, which is not strictly accurate, but has been used for consistency
of format.) This information can be used as a check on accuracy or for
statistical  purposes. The word "Sequence" is present solely as a marker for
readability.

3.4.18 The Sequence Data Line
The sequence data line has a line code consisting of two blanks. The sequence
is written 60 bases per line, in groups of 10 bases separated by a blank
character, beginning at position 6 of the line. The direction listed is 
always 5' to 3', and wherever possible the non-coding strand 
(homologous to the message) has been stored. Columns 73-80 of each 
sequence line contain base numbers for easier reading and quick 
location of regions of interest. The numbers are right justified and indicate
the number of the last base on each line.
An example of a data line is:
     aaacaaacca aatatggatt ttattgtagc catatttgct ctgtttgtta ttagctcatt        60
      
The characters used for the bases correspond to the IUPAC-IUB 
Commission recommendations (see appendices).

3.4.19  The CC Line
CC lines are free text comments about the entry, and may be used to convey 
any sort of information thought to be useful that is unsuitable for
inclusion in other line types.
                             
3.4.20 The XX Line
The XX (spacer) line contains no data or comments. Its purpose is to make 
an entry easier to read on a page or terminal screen by setting off the 
various types of information in appropriate groupings. XX is used
instead of blank lines to avoid confusion with the sequence data lines. 
The XX lines can always be ignored by computer programs.

3.4.21 The // Line
The // (terminator) line also contains no data or comments. It designates 
the end of an entry.
                     
                                   APPENDIX A
                              STANDARD BASE CODES

These are the official IUPAC-IUB single-letter base codes (reference 1 below).

     Code      Base Description
     ----      --------------------------------------------------------------
     G         Guanine
     A         Adenine
     T         Thymine
     C         Cytosine
     R         Purine               (A or G)
     Y         Pyrimidine           (C or T or U)
     M         Amino                (A or C)
     K         Ketone               (G or T)
     S         Strong interaction   (C or G)
     W         Weak interaction     (A or T)
     H         Not-G                (A or C or T) H follows G in the alphabet
     B         Not-A                (C or G or T) B follows A
     V         Not-T (not-U)        (A or C or G) V follows U
     D         Not-C                (A or G or T) D follows C
     N         Any                  (A or C or G or T)
                                      A-1


                                APPENDIX B
                            MODIFIED BASE CODES

The following table is taken from Sprinzl M.  and Gauss D.H.
(reference 2 below). The codes appear in database entries as values for the
/mod_base qualifier in the feature table.

        Code            Modified Base
        ----            ------------------------------------------------------------
        ac4c            4-acetylcytidine
        chm5u           5-(carboxyhydroxylmethyl)uridine
        cm              2'O-methylcytidine
        cmnm5s2u        5-carboxymethylaminomethyl-2-thiouridine
        cmnm5u          5-carboxymethylaminomethyluridine
        dhu             dihydrouridine
        fm              2'-O-methylpseudouridine
        gal q           beta-D-galactosylqueuosine
        gm              2'-O-methylguanosine
        i               inosine
        i6a             N6-isopentenyladenosine
        m1a             1-methyladenosine
        m1f             1-methylpseudouridine
        m1g             1-methylguanosine
        m1i             1-methylinosine
        m22g            2,2-dimethylguanosine
        m2a             2-methyladenosine
        m2g             2-methylguanosine
        m3c             3-methylcytidine
        m4c             N4-methylcytosine
        m5c             5-methylcytidine
        m6a             N6-methyladenosine
        m7g             7-methylguanosine
        mam5u           5-methylaminomethyluridine
        mam5s2u         5-methylaminomethyl-2-thiouridine
        man q           beta-D-mannosylqueuosine
        mcm5s2u         5-methoxycarbonylmethyl-2-thiouridine
        mcm5u           5-methoxycarbonylmethyluridine
        mo5u            5-methoxyuridine
        ms2i6a          2-methylthio-N6-isopentenyladenosine
        ms2t6a          N-((9-beta-D-ribofurnosyl-2-methylthiopurin-6-yl)carbamoyl)threonine
        mt6a            N-((9-beta-D-ribofuranosylpurine-6-yl)N-methyl-carbamoyl)threonine
        mv              uridine-5-oxoacetic acid methylester
        o5u             uridine-5-oxyacetic acid (v)
        osyw            wybutoxosine
        p               pseudouridine
        q               queuosine
        s2c             2-thiocytidine
        s2t             5-methyl-2-thiouridine
        s2u             2-thiouridine
        s4u             4-thiouridine
        m5u             5-methyluridine
        t6a             N-((9-beta-D-ribofuranosylpurine-6-yl)carbamoyl)threonine
        tm              2'-O-methyl-5-methyluridine
        um              2'-O-methyluridine
        yw              wybutosine
        x               3-(3-amino-3-carboxypropyl)uridine, (acp3)u
        OTHER           (requires /note= qualifier)
		               
                           B-1


                                   APPENDIX C
                    REFERENCES FOR ABBREVIATIONS AND SYMBOLS


     1.  Cornish-Bowden A., Nucl. Acids Res. 13:3021-3030(1985).
     2.  Sprinzl M., and Gauss D.H., "Compilation of tRNA Sequences",
         Nucl. Acids Res. 10:r1-r55(1982).

                                  C-1
                                

Revised: 03-APRIL-2019