Help   About Sequence Formats


Sequence formats are simply the way in which the amino acid or DNA sequence is recorded in a computer file. Different programs expect different formats, so if you are to submit a job successfully, it is important to understand what the various formats look like.

In order to successfully submit a job it is important to understand what the various sequence formats used for describing biological sequences are and what their basic structure is. The job submission forms are fairly flexible but cannot cope with too much inconsistency.
You can submit sequence to the search and analysis programs in any of the formats mentioned in the options your chosen tool.

If you are submitting sequences to clustalw or pratt you may the normal format, as described below, just making sure that the sequences follow each other and are separated from each other with the formatīs separator. In the case of EMBL format this would be `//ī.

In order to aid the user with the process of converting sequences to appropriate formats please use the following link:READSEQ.

Examples of Sequence Formats:

Click here to see a complete list of sequence formats supported by EMBOSS applications.

ALN/ClustalW format:

ALN format was originated in the alignment program ClustalW. The file starts with word "CLUSTAL" and then some information about which clustal program was run and the version of clustal used.
e.g. "CLUSTAL W (1.82) multiple sequence alignment"
The type of clustal program is "W" and the version is 1.82.
The alignment is written in blocks of 60 residues.
Every block starts with the sequence names, obtained from the input sequence, and a count of the total number of residues is shown at the end of the line.
The information about which residues match is shown below each block of residues:
"*" means that the residues or nucleotides in that column are identical in all sequences in the alignment.
":" means that conserved substitutions have been observed.
"." means that semi-conserved substitutions are observed.

An example is shown below.
CLUSTAL W (1.82) multiple sequence alignment


FOSB_MOUSE      MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA 60
FOSB_HUMAN      MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA 60
                ************************************************************

FOSB_MOUSE      ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS 120
FOSB_HUMAN      ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTSYSTPGMSGYSSGGASGS 120
                ********************************.***************:*.**:******

AMPS Block file format:

The first part of a block-file contains the identifier codes of the sequences that are to follow. Each code is prefixed by the > symbol, codes must not contain spaces. e.g.
>HAHU
>Trypsin
>A0046
>Seq1


etc.

The number of ">" symbols is read in the beginning of the file until a * symbol is found. The * signals the beginning of the multiple alignment which is stored VERTICALLY, thus columns are individual sequences, whilst rows are aligned positions. The * symbol must lie over the first sequence. A further star in the same column signals the end of the alignment. Software then uses the number of ">" symbols at the beginning of the file to work out how many columns to read from the * position. It is therefore important that the only ">" symbols in the file are those that define the identifiers, and the only symbols are those defining the start and end of the multiple alinnment. A simple, small block-file is shown below.
>Seq_1
>A0231
>HAHU
>Four_Alpha
>Globin
>GLobin_C
*
ARNDLQ
AAAAAA
PPPPPP
PP PPP
WW WWW
LLLLLL
IIVVLL
*

Codata Format:

The first line starts with the text ENTRY". The end of a sequence is delineated by "///". The "SEQUENCE" line specifies the beginning of the sequence lines (starting on the next line), and no sequence is assumed to appear in the entry if the "SEQUENCE" line is missing.

			
ENTRY           IXI_234 
SEQUENCE        
                 5        10        15        20        25        30
      1 T S P A S I R P P A G P S S R P A M V S S R R T R P S P P G
     31 P R R P T G R P C C S A A P R R P Q A T G G W K T C S G T C
     61 T T S T S T R H R G R S G W S A R T T T A A C L R A S R K S
     91 M R A A C S R S A G S R P N R F A P T L M S S C I T S T T G
    121 P P A W A G D R S H E
///
ENTRY           IXI_235 
SEQUENCE        
                 5        10        15        20        25        30
      1 T S P A S I R P P A G P S S R - - - - - - - - - R P S P P G
     31 P R R P T G R P C C S A A P R R P Q A T G G W K T C S G T C
     61 T T S T S T R H R G R S G W - - - - - - - - - - R A S R K S
     91 M R A A C S R S A G S R P N R F A P T L M S S C I T S T T G
    121 P P A W A G D R S H E
///
ENTRY           IXI_236 
SEQUENCE        
                 5        10        15        20        25        30
      1 T S P A S I R P P A G P S S R P A M V S S R - - R P S P P P
     31 P R R P P G R P C C S A A P P R P Q A T G G W K T C S G T C
     61 T T S T S T R H R G R S G W S A R T T T A A C L R A S R K S
     91 M R A A C S R - - G S R P P R F A P P L M S S C I T S T T G
    121 P P P P A G D R S H E
///
ENTRY           IXI_237 
SEQUENCE        
                 5        10        15        20        25        30
      1 T S P A S L R P P A G P S S R P A M V S S R R - R P S P P G
     31 P R R P T - - - - C S A A P R R P Q A T G G Y K T C S G T C
     61 T T S T S T R H R G R S G Y S A R T T T A A C L R A S R K S
     91 M R A A C S R - - G S R P N R F A P T L M S S C L T S T T G
    121 P P A Y A G D R S H E
///






EMBL Format:

The EMBL entries(as below) in the database are structured so as to be usable by human readers as well as by computer programs. Each entry in the database is composed of lines. Different types of lines, each with its own format, which are used to record the various types of data which make up the entry. Some entries will not contain all of the line types, and some line types occur many times in a single entry. As noted, each entry begins with an identification line (ID) and ends with a terminator line (//). Consult the EMBL user manual for a more comprehensive guide.

  • The ID (IDentification line) line is always the first line of an entry. The general form of the ID line is:

    Term ID entryname dataclass molecule division sequencelength (Base Pairs)
    e.g. ID MMFOSB standard RNA MUS 4145 BP

  • The XX line contains no data or comments. It is used instead of blank lines to avoid confusion with the sequence data lines.

  • The AC (Accession Number) line lists the accession numbers associated with this entry.

  • The SV (Sequence Version) line contains the new format of the nucleotide sequence identifier.

  • The DT (DaTe) line shows when an entry first appeared in the the database and when it was last updated.

  • The DE (DEscription) lines contain general descriptive information about the sequence stored.

  • The KW (KeyWord) lines provide information which can be used to generate cross-reference indexes of the sequence entries based on functional, structural, or other categories deemed important. The keywords chosen for each entry serve as a subject reference for the sequence, and will be expanded as work with the database continues. Often several KW lines are necessary for a single entry.

  • The OS (Organism Species) line specifies the preferred scientific name of the organism which was the source of the stored sequence.

  • The OC (Organism Classification) lines contain the taxonomic classification of the source organism.

  • The RN (Reference Number) line gives a unique number to each reference citation within an entry.

  • The RC (Reference Comment) line type is an optional line type which appears if the reference has a comment.

  • The RP (Reference Position) line type is an optional line type which appears if one or more contiguous base spans of the presented sequence can be attributed to the reference in question.

  • The RX (Reference Cross-reference) line type is an optional line type which contains a cross-reference to an external citation or abstract database.

  • The RA (Reference Author) lines list the authors of the paper (or other work) cited.

  • The RT (Reference Title) lines give the title of the paper (or other work).

  • The RL (Reference Location) line contains the conventional citation information for the reference.


  • The DR (Database Cross-Reference) line cross-references other databases which contain information related to the entry in which the DR line appears.


  • The CC lines are free text comments about the entry, and may be used to convey any sort of information thought to be useful.


  • The FH (Feature Header) lines are present only to improve readability of an entry when it is printed or displayed on a terminal screen. The lines contain no data and may be ignored by computer programs.


  • The FT (Feature Table) lines provide a mechanism for the annotation of the sequence data. Regions or sites in the sequence which are of interest are listed in the table.
  • A complete and definitive description of the feature table is given here.

  • The SQ (SeQuence header) line marks the beginning of the sequence data and gives a summary of its content.


  • The sequence data lines has lines of code starting with two blanks. The sequence is written 60 bases per line, in groups of 10 bases separated by a blank character, beginning in position 6 of the line. The direction listed is always 5' to 3'


  • The // (terminator) line also contains no data or comments. It designates the end of an entry.



ID   MMFOSB     standard; RNA; MUS; 4145 BP.
XX
AC   X14897;
XX
SV   X14897.1
XX
DT   23-NOV-1989 (Rel. 21, Created)
DT   12-SEP-1993 (Rel. 36, Last updated, Version 2)
XX
DE   Mouse fosB mRNA
XX
KW   fos cellular oncogene; fosB oncogene; oncogene.
XX
OS   Mus musculus (house mouse)
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;
OC   Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.
XX
RN   [1]
RP   1-4145
RX   MEDLINE; 89251612.
RA   Zerial M., Toschi L., Ryseck R.P., Schuermann M., Mueller R., Bravo R.;
RT   "The product of a novel growth factor activated gene, fos B, interacts with
RT   JUN proteins enhancing their DNA binding activity";
RL   EMBO J. 8:805-813(1989).
XX
DR   MGD; MGI:95575; Fosb.
DR   SWISS-PROT; P13346; FOSB_MOUSE.
DR   TRANSFAC; T00291; T00291.
XX
CC   clone=AC113-1; cell line=NIH3T3;
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..4145
FT                   /db_xref="taxon:10090"
FT                   /organism="Mus musculus"
FT   CDS             1202..2218
FT                   /db_xref="SWISS-PROT:P13346"
FT                   /note="fosB protein (AA 1-338)"
FT                   /protein_id="CAA33026.1"
FT                   /translation="MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECA
FT                   GLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSY
FT                   STPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRE
FT                   RNKLAAAKCRNRRRELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGC
FT                   KIPYEEGPGPGPLAEVRDLPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLF
FT                   THSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL"
XX
SQ   Sequence 4145 BP; 960 A; 1186 C; 1007 G; 991 T; 1 other;
     ataaattctt attttgacac tcaccaaaat agtcacctgg aaaacccgct ttttgtgaca        60
     aagtacagaa ggcttggtca catttaaatc actgagaact agagagaaat actatcgcaa       120
     actgtaatag acattacatc cataaaagtt tccccagtcc ttattgtaat attgcacagt       180
     gcaattgcta catggcaaac tagtgtagca tagaagtcaa agcaaaaaca aaccaaagaa       240
     aggagccaca agagtaaaac tgttcaacag ttaatagttc aaactaagcc attgaatcta       300
     tcattgggat cgttaaaatg aatcttccta caccttgcag tgtatgattt aacttttaca       360
     gaacacaagc caagtttaaa atcagcagta gagatattaa aatgaaaagg tttgctaata       420
     gagtaacatt aaataccctg aaggaaaaaa aacctaaata tcaaaataac tgattaaaat       480
     tcacttgcaa attagcacac gaatatgcaa cttggaaatc atgcagtgtt ttatttaaga       540
     aaacataaaa caaaactatt aaaatagttt tagagggggt aaaatccagg tcctctgcca       600
     ggatgctaaa attagacttc aggggaattt tgaagtcttc aattttgaaa cctattaaaa       660
     agcccatgat tacagttaat taagagcagt gcacgcaaca gtgacacgcc tttagagagc       720
     attactgtgt atgaacatgt tggctgctac cagccacagt caatttaaca aggctgctca       780
     gtcatgaact taatacagag agagcacgcc taggcagcaa gcacagcttg ctgggccact       840
     ttcctccctg tcgtgacaca atcaatccgt gtacttggtg tatctgaagc gcacgctgca       900
     ccgcggcact gcccggcggg tttctgggcg gggagcgatc cccgcgtcgc cccccgtgaa       960
     accgacagag cctggacttt caggaggtac agcggcggtc tgaaggggat ctgggatctt      1020
     gcagagggaa cttgcatcga aacttgggca gttctccgaa ccggagacta agcttccccg      1080
     agcagcgcac tttggagacg tgtccggtct actccggact cgcatctcat tccactcggc      1140
     catagccttg gcttcccggc gacctcagcg tggtcacagg ggcccccctg tgcccaggga      1200
     aatgtttcaa gcttttcccg gagactacga ctccggctcc cggtgtagct catcaccctc      1260
     cgccgagtct cagtacctgt cttcggtgga ctccttcggc agtccaccca ccgccgccgc      1320
     ctcccaggag tgcgccggtc tcggggaaat gcccggctcc ttcgtgccaa cggtcaccgc      1380
     aatcacaacc agccaggatc ttcagtggct cgtgcaaccc accctcatct cttccatggc      1440
     ccagtcccag gggcagccac tggcctccca gcctccagct gttgaccctt atgacatgcc      1500
     aggaaccagc tactcaaccc caggcctgag tgcctacagc actggcgggg caagcggaag      1560
     tggtgggcct tcaaccagca caaccaccag tggacctgtg tctgcccgtc cagccagagc      1620
     caggcctaga agaccccgag aagagacact taccccagaa gaagaagaaa agcgaagggt      1680
     tcgcagagag cggaacaagc tggctgcagc taagtgcagg aaccgtcgga gggagctgac      1740
     agatcgactt caggcggaaa ctgatcagct tgaagaggaa aaggcagagc tggagtcgga      1800
     gatcgccgag ctgcaaaaag agaaggaacg cctggagttt gtcctggtgg cccacaaacc      1860
     gggctgcaag atcccctacg aagaggggcc ggggccaggc ccgctggccg aggtgagaga      1920
     tttgccaggg tcaacatccg ctaaggaaga cggcttcggc tggctgctgc cgccccctcc      1980
     accacccccc ctgcccttcc agagcagccg agacgcaccc cccaacctga cggcttctct      2040
     ctttacacac agtgaagttc aagtcctcgg cgaccccttc cccgttgtta gcccttcgta      2100
     cacttcctcg tttgtcctca cctgcccgga ggtctccgcg ttcgccggcg cccaacgcac      2160
     cagcggcagc gagcagccgt ccgacccgct gaactcgccc tcccttcttg ctctgtaaac      2220
     tctttagaca aacaaaacaa acaaacccgc aaggaacaag gaggaggaag atgaggagga      2280
     gaggggagga agcagtccgg gggtgtgtgt gtggaccctt tgactcttct gtctgaccac      2340
     ctgccgcctc tgccatcgga catgacggaa ggacctcctt tgtgttttgt gctccgtctc      2400
     tggttttctg tgccccggcg agaccggaga gctggtgact ttggggacag ggggtggggc      2460
     ggggatggac acccctcctg catatctttg tcctgttact tcaacccaac ttctggggat      2520
     agatggctgg ctgggtgggt agggtggggt gcaacgccca cctttggcgt cttgcgtgag      2580
     gctggagggg aaagggtgct gagtgtgggg tgcagggtgg gttgaggtcg agctggcatg      2640
     cacctccaga gagacccaac gaggaaatga cagcaccgtc ctgtccttct tttcccccac      2700
     ccacccatcc accctcaagg gtgcagggtg accaagatag ctctgttttg ctccctcggg      2760
     ccttagctga ttaacttaac atttccaaga ggttacaacc tcctcctgga cgaattgagc      2820
     ccccgactga gggaagtcga tgcccccttt gggagtctgc taaccccact tcccgctgat      2880
     tccaaaatgt gaacccctat ctgactgctc agtctttccc tcctgggaaa actggctcag      2940
     gttggatttt tttcctcgtc tgctacagag ccccctccca actcaggccc gctcccaccc      3000
     ctgtgcagta ttatgctatg tccctctcac cctcaccccc accccaggcg cccttggccg      3060
     tcctcgttgg gccttactgg ttttgggcag cagggggcgc tgcgacgccc atcttgctgg      3120
     agcgctttat actgtgaatg agtggtcgga ttgctgggtg cgccggatgg gattgacccc      3180
     cagccctcca aaactttccc tgggcctccc cttcttccac ttgcttcctc cctccccttg      3240
     acagggagtt agactcgaaa ggatgaccac gacgcatccc ggtggccttc ttgctcaggc      3300
     cccagacttt ttctctttaa gtccttcgcc ttccccagcc taggacgcca acttctcccc      3360
     accctgggag ccccgcatcc tctcacagag gtcgaggcaa ttttcagaga agttttcagg      3420
     gctgaggctt tggctcccct atcctcgata tttgaatccc caaatatttt tggactagca      3480
     tacttaagag ggggctgagt tcccactatc ccactccatc caattccttc agtcccaaag      3540
     acgagttctg tcccttccct ccagctttca cctcgtgaga atcccacgag tcagatttct      3600
     attttttaat attggggaga tgggccctac cgcccgtccc ccgtgctgca tggaacattc      3660
     cataccctgt cctgggccct aggttccaaa cctaatccca aaccccaccc ccagctattt      3720
     atccctttcc tggttcccaa aaagcactta tatctattat gtataaataa atatattata      3780
     tatgagtgtg cgtgtgtgtg cgtgtgcgtg cgtgcgtgcg tgcgtgcgag cttccttgtt      3840
     ttcaagtgtg ctgtggagtt caaaatcgct tctggggatt tgagtcagac tttctggctg      3900
     tccctttttg tcaccttttt gttgttgtct cggctcctct ggctgttgga gacagtcccg      3960
     gcctctccct ttatcctttc tcaagtctgt ctcgctcaga ccacttccaa catgtctcca      4020
     ctctcaatga ctctgatctc cggtntgtct gttaattctg gatttgtcgg ggacatgcaa      4080
     ttttacttct gtaagtaagt gtgactgggt ggtagatttt ttacaatcta tatcgttgag      4140
     aattc                                                                  4145
//




Fasta Format:

  • This format contains a one line header followed by lines of sequence data.
  • Sequences in fasta formatted files are preceded by a line starting with a " >" symbol.
  • The first word on this line is the name of the sequence. The rest of the line is a description of the sequence.

    Term Entry Name Molecule Type Gene Name Sequence Length
    e.g. FOSB_MOUSE Protein fosB 338 bp

  • The remaining lines contain the sequence itself.
  • Blank lines in a FASTA file are ignored, and so are spaces or other gap symbols (dashes, underscores, periods) in a sequence.
  • Fasta files containing multiple sequences are just the same, with one sequence listed right after another. This format is accepted for many multiple sequence alignment programs.

>FOSB_MOUSE Protein fosB. 338 bp
MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA
ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS
GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT
DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD
LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY
TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL



GCG/MSF Format

  • The file may begin with as many lines of comment or description as required.
  • The comments are terminated with a line starting with two slashes.
  • The first mandatory line that is recognised as part of the MSF file is the line containing the text "MSF:", this line also includes the sequence length, type and date plus an internal check sum value.
  • The next line is a mandatory blank line inserted before the sequence names.
  • There then follows one line per sequence describing the sequence name, length, checksum and a weight value. Only one name per line is allowed; the qualifier "Name: " is followed by the sequence name. Names are restricted to 10 characters or less. Extra characters, between the sequence names and "Len: " are acceptable if they contain no blank characters. Another blank line is added followed by a line starting with two slashes "//" , this indicates the end of the name list.
  • There then follows another blank line.
  • Sequences are interleaved on separate lines with gaps represented by periods. Each sequence line starts with the sequence name which is separated from the aligned sequence residues by white space.
				
			
DNA_MULTIPLE_ALIGNMENT 1.0 
Four anthropoidea 
MSF: 50  Type: N  Check: 2666 .. 
// 


Name: Homo_sapiens     Len: 50   Check: 8318   Weight: 1.00 
Name: Pan_paniscus     Len: 50   Check: 7854   Weight: 1.00 
Name: Gorilla_gorilla  Len: 50   Check: 7778   Weight: 1.00 
Name: Pongo_pigmaeus   Len: 50   Check: 8716   Weight: 1.00
//


Homo_sapiens        AGUCGAGUC...GCAGAAAC 
Pan_paniscus        AGUCGCGUCG..GCAGAAAC 
Gorilla_gorilla     AGUCGCGUCG..GCAGAUAC 
Pongo_pigmaeus      AGUCGCGUCGAAGCAGA..C 
Homo_sapiens        GCAUGAC.GACCACAUUUU. 
Pan_paniscus        GCAUGACGGACCACAUCAU. 
Gorilla_gorilla     GCAUCACGGAC.ACAUCAUC 
Pongo_pigmaeus      GCAUGACGGACCACAUCAUC 

Homo_sapiens        CCUUGCAAAG 
Pan_paniscus        CCUUGCAAAG 
Gorilla_gorilla     CCUCGCAGAG 
Pongo_pigmaeus      CCUUGCAGAG 



				
				



GDE Format:

GDE format is a tagged field format used for storing all available information about a sequence. The format matches very closely the GDE internal structures for sequence data. The format consists of text records starting and ending with braces ('{}'). Between the open and close braces are several tagged field lines specifying different pieces of information about a given sequence. The tag values can be wrapped with double quote characters ('""') as needed. If quotes are not used, the first white space delimited string is taken as the value.Any fields that are not specified are assumed to be the default values. Offsets can be negative as well as positive. Genbank entries written out in this format will have all (") converted to ('), and all ({}) converted to ([]) to avoid confusion in the parser. Leading and trailing gaps are removed prior to writing each sequence. This format is deliberately verbose in order to be simple to duplicate.


			
{ 
name "Short name for sequence" 
longname "Long (more descriptive) name for sequence" 
sequence-ID "Unique ID number" 
creation-date "mm/dd/yy hh:mm:ss" 
direction [-1|1] 
strandedness [1|2] 
type [DNA|RNA||PROTEIN|TEXT|MASK] 
offset (-999999,999999) 
group-ID (0,999) 
creator "Author's name" 
descrip "Verbose description" 
comments "Lines of comments that can be fairly arbitrary text about a 
sequence. Return characters are allowed, but no internal double quotes 
or brace characters. Remember to close with a double quote" 
sequence "gctagctagctagctagctcttagctgtagtcgtagctgatgctagct 
gatgctagctagctagctagctgatcgatgctagctgatcgtagctgacg 
gactgatgctagctagctagctagctgtctagtgtcgtagtgcttattgc" 
}




Genebank Format:

GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. Although there is daily exchange of information with the EMBL Nucleotide Sequence Database, it has it's own sequence format shown below. Each GenBank entry includes a concise description of the sequence, the scientific name and taxonomy of the source organism, and a table of features that identifies coding regions and other sites of biological significance, such as transcription units, sites of mutations or modifications, and repeats. Protein translations for coding regions are included in the feature table. Bibliographic references are included along with a link to the Medline unique identifier for all published sequences. Each sequence entry is composed of lines. Different types of lines, each with their own format, are used to record the various data that make up the entry.
  • LOCUS: Short name for this sequence (Maximum of 32 characters).
  • DEFINITION: Definition of sequence (Maximum of 80 characters).
  • ACCESSION: accession number of the entry.
  • VERSION: Version of the entry.
  • DBSOURCE: Shows the source, the date of creation and last modification of the database entry.
  • KEYWORDS: Keywords for the entry.
  • AUTHORS: Authors for the work.
  • TITLE: Title of the publication.
  • JOURNAL: Journal reference for the entry.
  • MEDLINE: Medline ID.
  • COMMENT: Lines of comments.
  • SOURCE ORGANISM: The organism from which the sequence was derived.
  • ORGANISM: Full name of organism (Maximum of 80 characters).
  • AUTHORS: Authors of this sequence (Maximum of 80 characters).
  • ACCESSION: ID Number for this sequence (Maximum of 80 characters).
  • FEATURES: Features of the sequence.
  • ORIGIN: Beginning of sequence data.
  • // End of sequence data.
LOCUS       MMFOSB                  4145 bp    mRNA    linear   ROD 12-SEP-1993
DEFINITION  Mouse fosB mRNA.
ACCESSION   X14897
VERSION     X14897.1  GI:50991
KEYWORDS    fos cellular oncogene; fosB oncogene; oncogene.
SOURCE      Mus musculus.
  ORGANISM  Mus musculus
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.
REFERENCE   1  (bases 1 to 4145)
  AUTHORS   Zerial,M., Toschi,L., Ryseck,R.P., Schuermann,M., Muller,R. and
            Bravo,R.
  TITLE     The product of a novel growth factor activated gene, fos B,
            interacts with JUN proteins enhancing their DNA binding activity
  JOURNAL   EMBO J. 8 (3), 805-813 (1989)
  MEDLINE   89251612
   PUBMED   2498083
COMMENT     clone=AC113-1; cell line=NIH3T3.
FEATURES             Location/Qualifiers
     source          1..4145
                     /organism="Mus musculus"
                     /db_xref="taxon:10090"
     CDS             1202..2218
                     /note="fosB protein (AA 1-338)"
                     /codon_start=1
                     /protein_id="CAA33026.1"
                     /db_xref="GI:50992"
                     /db_xref="MGD:95575"
                     /db_xref="SWISS-PROT:P13346"
                     /translation="MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQEC
                     AGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGT
                     SYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRV
                     RRERNKLAAAKCRNRRRELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAH
                     KPGCKIPYEEGPGPGPLAEVRDLPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNL
                     TASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPS
                     LLAL"
BASE COUNT      960 a   1186 c   1007 g    991 t      1 others
ORIGIN      
        1 ataaattctt attttgacac tcaccaaaat agtcacctgg aaaacccgct ttttgtgaca
       61 aagtacagaa ggcttggtca catttaaatc actgagaact agagagaaat actatcgcaa
      121 actgtaatag acattacatc cataaaagtt tccccagtcc ttattgtaat attgcacagt
      181 gcaattgcta catggcaaac tagtgtagca tagaagtcaa agcaaaaaca aaccaaagaa
      241 aggagccaca agagtaaaac tgttcaacag ttaatagttc aaactaagcc attgaatcta
      301 tcattgggat cgttaaaatg aatcttccta caccttgcag tgtatgattt aacttttaca
      361 gaacacaagc caagtttaaa atcagcagta gagatattaa aatgaaaagg tttgctaata
      421 gagtaacatt aaataccctg aaggaaaaaa aacctaaata tcaaaataac tgattaaaat
      481 tcacttgcaa attagcacac gaatatgcaa cttggaaatc atgcagtgtt ttatttaaga
      541 aaacataaaa caaaactatt aaaatagttt tagagggggt aaaatccagg tcctctgcca
      601 ggatgctaaa attagacttc aggggaattt tgaagtcttc aattttgaaa cctattaaaa
      661 agcccatgat tacagttaat taagagcagt gcacgcaaca gtgacacgcc tttagagagc
      721 attactgtgt atgaacatgt tggctgctac cagccacagt caatttaaca aggctgctca
      781 gtcatgaact taatacagag agagcacgcc taggcagcaa gcacagcttg ctgggccact
      841 ttcctccctg tcgtgacaca atcaatccgt gtacttggtg tatctgaagc gcacgctgca
      901 ccgcggcact gcccggcggg tttctgggcg gggagcgatc cccgcgtcgc cccccgtgaa
      961 accgacagag cctggacttt caggaggtac agcggcggtc tgaaggggat ctgggatctt
     1021 gcagagggaa cttgcatcga aacttgggca gttctccgaa ccggagacta agcttccccg
     1081 agcagcgcac tttggagacg tgtccggtct actccggact cgcatctcat tccactcggc
     1141 catagccttg gcttcccggc gacctcagcg tggtcacagg ggcccccctg tgcccaggga
     1201 aatgtttcaa gcttttcccg gagactacga ctccggctcc cggtgtagct catcaccctc
     1261 cgccgagtct cagtacctgt cttcggtgga ctccttcggc agtccaccca ccgccgccgc
     1321 ctcccaggag tgcgccggtc tcggggaaat gcccggctcc ttcgtgccaa cggtcaccgc
     1381 aatcacaacc agccaggatc ttcagtggct cgtgcaaccc accctcatct cttccatggc
     1441 c


 

NBRF/PIR Format:
  • The PIR format is similar to FASTA format.
  • The first line of each sequence entry begins with a "greater than", (>) sign.
  • Each sequence starts with a sequence type code (described in the table below), then a semi-colon
  • .
  • On the next line the sequence name and a description appears.
  • The sequence is on the following line and is ended with an asterisk (*).
Sequence type Code
Protein (complete) P1
Protein (fragment) F1
DNA (linear) DL
DNA (circular) DC
RNA (linear) RL
RNA (circular) RC
tRNA N3
other functional RNA N1

>P1;FOSB_MOUSE
FOSB_MOUSE 338 bases
MFQAFPGDYD SGSRCSSSPS AESQYLSSVD SFGSPPTAAA SQECAGLGEM
PGSFVPTVTA ITTSQDLQWL VQPTLISSMA QSQGQPLASQ PPAVDPYDMP
GTSYSTPGLS AYSTGGASGS GGPSTSTTTS GPVSARPARA RPRRPREETL
TPEEEEKRRV RRERNKLAAA KCRNRRRELT DRLQAETDQL EEEKAELESE
IAELQKEKER LEFVLVAHKP GCKIPYEEGP GPGPLAEVRD LPGSTSAKED
GFGWLLPPPP PPPLPFQSSR DAPPNLTASL FTHSEVQVLG DPFPVVSPSY
TSSFVLTCPE VSAFAGAQRT SGSEQPSDPL NSPSLLAL*


Pfam/Stockholm Format:

The "Pfam/Stockholm" format is a system for marking up features in a multiple alignment. These mark-up annotations are preceded by a 'magic' label, of which there are four types.

Header:
The first line in the file must contain a format and version identifier, currently:

# STOCKHOLM 1.0

The sequence alignment:

< seqname> <aligned sequence>
< seqname> <aligned sequence>
< seqname> <aligned sequence>
.
.
//

<seqname> stands for "sequence name", typically in the form "name/start-end" or just "name".
The "//" line indicates the end of the alignment.
Sequence letters may include any characters except whitespace. Gaps may be indicated by "." or "-".
Wrap-around alignments are allowed in principle, mainly for historical reasons, but are not used in e.g. Pfam. Wrapped alignments are discouraged since they are much harder to parse.

The alignment mark-up:

Mark-up lines may include any characters except whitespace. Use underscore ("_") instead of space.

#=GF <feature> <Generic per-File annotation, free text>
#=GC <feature> <Generic per-Column annotation, exactly 1 char per column>
#=GS <seqname> <feature> <Generic per-Sequence annotation, free text>
#=GR <seqname> <feature> <Generic per-Sequence AND per-Column markup, exactly 1 char per column>

Example:

# STOCKHOLM 1.0
#=GF ID CBS
#=GF AC PF00571
#=GF DE CBS domain
#=GF AU Bateman A
#=GF CC CBS domains are small intracellular modules mostly found
#=GF CC in 2 or four copies within a protein.
#=GF SQ 67
#=GS O31698/18-71 AC O31698
#=GS O83071/192-246 AC O83071
#=GS O83071/259-312 AC O83071
#=GS O31698/88-139 AC O31698
#=GS O31698/88-139 OS Bacillus subtilis
O83071/192-246 MTCRAQLIAVPRASSLAE..AIACAQKM....RVSRVPVYERS
#=GR O83071/192-246 SA 999887756453524252..55152525....36463774777
O83071/259-312 MQHVSAPVFVFECTRLAY..VQHKLRAH....SRAVAIVLDEY
#=GR O83071/259-312 SS CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEE
O31698/18-71 MIEADKVAHVQVGNNLEH..ALLVLTKT....GYTAIPVLDPS
#=GR O31698/18-71 SS CCCHHHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEHHH
O31698/88-139 EVMLTDIPRLHINDPIMK..GFGMVINN......GFVCVENDE
#=GR O31698/88-139 SS CCCCCCCHHHHHHHHHHH..HEEEEEEE....EEEEEEEEEEH
#=GC SS_cons CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEH
O31699/88-139 EVMLTDIPRLHINDPIMK..GFGMVINN......GFVCVENDE
#=GR O31699/88-139 AS ________________*__________________________
#=GR_O31699/88-139_IN ____________1______________2__________0____
//

Phylip Format:

  • The first line of the input file contains the number of species, the number of sequences and their length (in characters)separated by blanks.
  • The next line contains the sequence name, followed by the sequence in blocks of 10 characters.
 1 338 I 
FOSB_MOUSE MFQAFPGDYD SGSRCSSSPS AESQYLSSVD SFGSPPTAAA SQECAGLGEM
PGSFVPTVTA ITTSQDLQWL VQPTLISSMA QSQGQPLASQ PPAVDPYDMP
GTSYSTPGLS AYSTGGASGS GGPSTSTTTS GPVSARPARA RPRRPREETL
TPEEEEKRRV RRERNKLAAA KCRNRRRELT DRLQAETDQL EEEKAELESE
IAELQKEKER LEFVLVAHKP GCKIPYEEGP GPGPLAEVRD LPGSTSAKED
GFGWLLPPPP PPPLPFQSSR DAPPNLTASL FTHSEVQVLG DPFPVVSPSY
TSSFVLTCPE VSAFAGAQRT SGSEQPSDPL NSPSLLAL



 

Raw Format:

Like text/plain format except that it removes any white space or digits, accepts only alphabetic characters and rejects anything else. This means that it is safer to use this format that plain format. If you have digits and spaces or TAB characters, these are removed and ignored. If you have other non-alphabetic characters (for example, punctuation characters), then the sequence will be rejected as erroneous.

			
     ataaattcttattttgacactcaccaaaatagtcacctggaaaacccgctttttgtgaca       
     aagtacagaaggcttggtcacatttaaatcactgagaactagagagaaatactatcgcaa       
     actgtaatagacattacatccataaaagtttccccagtccttattgtaatattgcacagt       
     gcaattgctacatggcaaactagtgtagcatagaagtcaaagcaaaaacaaaccaaagaa       
     aggagccacaagagtaaaactgttcaacagttaatagttcaaactaagccattgaatcta       
     tcattgggatcgttaaaatgaatcttcctacaccttgcagtgtatgatttaacttttaca            
			


RSF Format:

RSF means rich sequence format and it is created by the Editor in SeqLab. The format is recognised by the word !!RICH_SEQUENCE at the beginning of
the file. It contains one or more sequences that may or may not be related. In addition to the sequence data, each sequence can be annotated with descriptive sequence information such as:
  • Creator/author of the sequence
  • Sequence weight
  • Creation date
  • One-line description of the sequence
  • Offset, or the number of leading gaps in a sequence that is part of an alignment or fragment assembly project Known sequence features
			
!!RICH_SEQUENCE 1.0           
..                             
{
name  chkhba                  
type    DNA
longname  chkhba
checksum    980
creation-date  4/15/98 16:42:47
strand  1
sequence                      
  ACACAGAGGTGCAACCATGGTGCTGTCCGCTGCTGACAAGAACAACGTCAAGGGCATCTT
  CACCAAAATCGCCGGCCATGCTGAGGAGTATGGCGCCGAGACCTTGGAAAGGATGTTCAC
  CACCTACCCCCCAACCAAGACCTACTTCCCCCACTTCGATCTGTCACACGGCTCCGCTCA
  ...
}
{
name  davagl
type    DNA
longname  davagl
checksum    7399
creation-date  4/15/98 16:42:47
strand  1
sequence  
  GTGCTCTCGGATGCTGACAAGACTCACGTGAAAGCCATCTGGGGTAAGGTGGGAGGCCAC
  GCCGGTGCCTACGCAGCTGAAGCTCTTGCCAGAACCTTCCTCTCCTTCCCCACTACCAAA
  ...
}

            
			



UniProt/Swiss-Prot Format:

UniProt/Swiss-Prot is an annotated protein sequence database. The UniProt/Swiss-Prot protein knowledgebase consists of sequence entries. Sequence entries are composed of different line types, each with their own format. For standardization purposes the format of UniProt/Swiss-Prot follows as closely as possible that of the EMBL Nucleotide Sequence Database. The UniProt/Swiss-Prot user manual is available here. The entries in the UniProt/Swiss-Prot database are structured so as to be usable by human readers as well as by computer programs. The explanations, descriptions, classifications and other comments are in ordinary English. Wherever possible, symbols familiar to biochemists, protein chemists and molecular biologists are used. Each sequence entry is composed of lines. Different types of lines, each with their own format, are used to record the various data that make up the entry.

  • The ID (IDentification) line is always the first line of an entry. The general form of the ID line is:

    Term ID ENTRY_NAME DATA_CLASS MOLECULE_TYPE SEQUENCE_LENGTH.
    e.g. ID FOSB_MOUSE STANDARD PRT 338 AA

  • The AC (ACcession number) line lists the accession number(s) associated with an entry.

  • The DT (DaTe) lines show the date of creation and last modification of the database entry.

  • The DE (DEscription) lines contain general descriptive information about the sequence stored.

  • The GN (Gene Name) line contains the name(s) of the gene(s) that code for the stored protein sequence.

  • The OS (Organism Species) line specifies the organism(s) which was (were) the source of the stored sequence.

  • The OG (OrGanelle) line indicates if the gene coding for a protein originates from the mitochondria, the chloroplast, a cyanelle, or a plasmid.

  • The OC (Organism Classification) lines contain the taxonomic classification of the source organism.

  • The OX (Organism taxonomy Cross-Reference) line is used to indicate the identifier to a specific organism in a taxonomic database.

  • The RN (Reference Number) line gives a sequential number to each reference citation in an entry.

  • The RP (Reference Position) line describes the extent of the work carried out by the authors of the reference cited.

  • The RC (Reference Comment) lines are optional lines which are used to store comments relevant to the reference cited.

  • The RX (Reference Cross-Reference) line is an optional line which is used to indicate the identifier assigned to a specific reference in a bibliographic database.


  • The RA (Reference Author) lines list the authors of the paper (or other work) cited.


  • The RT (Reference Title) lines give the title of the paper (or other work) cited.


  • The RL (Reference Location) lines contain the conventional citation information for the reference.


  • The CC lines are free text comments on the entry, and are used to convey any useful information.


  • The DR (Database cross-Reference) lines are used as pointers to information related to Swiss-Prot entries and found in other data collections.

  • The KW (KeyWord) lines provide information that can be used to generate indexes of the sequence entries based on functional, structural, or other categories.


  • The FT (Feature Table) lines provide a precise but simple means for the annotation of the sequence data. The table describes regions or sites of interest in the sequence. In general the feature table lists posttranslational modifications, binding sites, enzyme active sites, local secondary structure or other characteristics reported in the cited references.


  • The SQ (SeQuence header) line marks the beginning of the sequence data and gives a quick summary of its content.


  • The sequence data line has a line code consisting of two blanks rather than the two-letter codes used until now. The sequence counts 60 amino acids per line, in groups of 10 amino acids, beginning in position 6 of the line.


  • The // (terminator) line contains no data or comments and designates the end of an entry.


ID   FOSB_MOUSE     STANDARD;      PRT;   338 AA.
AC   P13346;
DT   01-JAN-1990 (Rel. 13, Created)
DT   01-JAN-1990 (Rel. 13, Last sequence update)
DT   15-JUN-2002 (Rel. 41, Last annotation update)
DE   Protein fosB.
GN   FOSB.
OS   Mus musculus (Mouse).
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC   Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.
OX   NCBI_TaxID=10090;
RN   [1]
RP   SEQUENCE FROM N.A.
RX   MEDLINE=89251612; PubMed=2498083;
RA   Zerial M., Toschi L., Ryseck R.-P., Schuermann M., Mueller R.,
RA   Bravo R.;
RT   "The product of a novel growth factor activated gene, fos B, interacts
RT   with JUN proteins enhancing their DNA binding activity.";
RL   EMBO J. 8:805-813(1989).
RN   [2]
RP   SEQUENCE FROM N.A.
RX   MEDLINE=92158623; PubMed=1741260;
RA   Lazo P.S., Dorfman K., Noguchi T., Mattei M.-G., Bravo R.;
RT   "Structure and mapping of the fosB gene. FosB downregulates the
RT   activity of the fosB promoter.";
RL   Nucleic Acids Res. 20:343-350(1992).
CC   -!- FUNCTION: FOSB INTERACTS WITH JUN PROTEINS ENHANCING THEIR DNA
CC       BINDING ACTIVITY.
CC   -!- SUBUNIT: HETERODIMER (BY SIMILARITY).
CC   -!- SUBCELLULAR LOCATION: NUCLEAR.
CC   -!- INDUCTION: BY GROWTH FACTORS.
CC   -!- SIMILARITY: BELONGS TO THE BZIP FAMILY. FOS SUBFAMILY.
CC   --------------------------------------------------------------------------
CC   This Swiss-Prot entry is copyright. It is produced through a collaboration
CC   between  the Swiss Institute of Bioinformatics  and the  EMBL outstation -
CC   the European Bioinformatics Institute.  There are no  restrictions on  its
CC   use  by  non-profit  institutions as long  as its content  is  in  no  way
CC   modified and this statement is not removed.  Usage  by  and for commercial
CC   entities requires a license agreement (See http://www.isb-sib.ch/announce/
CC   or send an email to license@isb-sib.ch).
CC   --------------------------------------------------------------------------
DR   EMBL; X14897; CAA33026.1; -.
DR   EMBL; AF093624; AAD13196.1; -.
DR   PIR; S04108; TVMSFB.
DR   PIR; S35477; S35477.
DR   HSSP; P01100; 1FOS.
DR   TRANSFAC; T00291; -.
DR   MGD; MGI:95575; Fosb.
DR   InterPro; IPR000837; Leuzip_Fos.
DR   InterPro; IPR004827; TF_bZIP.
DR   Pfam; PF00170; bZIP; 1.
DR   PRINTS; PR00042; LEUZIPPRFOS.
DR   SMART; SM00338; BRLZ; 1.
DR   PROSITE; PS00036; BZIP_BASIC; 1.
KW   Nuclear protein; DNA-binding.
FT   DNA_BIND    161    179       BASIC MOTIF.
FT   DOMAIN      183    211       LEUCINE-ZIPPER.
SQ   SEQUENCE   338 AA;  35976 MW;  E9D031A4BEAE48EC CRC64;
     MFQAFPGDYD SGSRCSSSPS AESQYLSSVD SFGSPPTAAA SQECAGLGEM PGSFVPTVTA
     ITTSQDLQWL VQPTLISSMA QSQGQPLASQ PPAVDPYDMP GTSYSTPGLS AYSTGGASGS
     GGPSTSTTTS GPVSARPARA RPRRPREETL TPEEEEKRRV RRERNKLAAA KCRNRRRELT
     DRLQAETDQL EEEKAELESE IAELQKEKER LEFVLVAHKP GCKIPYEEGP GPGPLAEVRD
     LPGSTSAKED GFGWLLPPPP PPPLPFQSSR DAPPNLTASL FTHSEVQVLG DPFPVVSPSY
     TSSFVLTCPE VSAFAGAQRT SGSEQPSDPL NSPSLLAL
//


Known biosequence format Extensions

ID Name Read Write Int'leaf Document Content-type Suffix
1 IG|Stanford yes yes -- -- biosequence/ig .ig
2 GenBank|GB yes yes -- yes biosequence/genbank .gb
3 NBRF yes yes -- -- biosequence/nbrf .nbrf
4 EMBL yes yes -- yes biosequence/embl .embl
5 GCG yes yes -- -- biosequence/gcg .gcg
6 DNAStrider yes yes -- -- biosequence/strider .strider
7 Fitch -- -- -- -- biosequence/fitch .fitch
8 Pearson|Fasta yes yes -- -- biosequence/fasta .fasta
9 Zuker -- -- -- -- biosequence/zuker .zuker
10 Olsen -- -- yes -- biosequence/olsen .olsen
11 Phylip3.2 yes yes yes -- biosequence/phylip2 .phylip2
12 Phylip|Phylip4 yes yes yes -- biosequence/phylip .phylip
13 Plain|Raw yes yes -- -- biosequence/plain .seq
14 PIR|CODATA yes yes -- -- biosequence/codata .pir
15 MSF yes yes yes -- biosequence/msf .msf
16 PAUP|NEXUS yes yes yes -- biosequence/nexus .nexus
17 Pretty -- yes yes -- biosequence/pretty .pretty
18 XML yes yes -- yes biosequence/xml .xml
19 BLAST yes -- yes -- biosequence/blast .blast
20 SCF yes -- -- -- biosequence/scf .scf
21 ASN.1 -- -- -- -- biosequence/asn1 .asn