UniGene

FILES IN THIS DIRECTORY
=======================
NOTE:  The files in this directory have been renamed with "Hs"
(for Homo sapiens) appearing somewhere in the filename.
Other organisms are abbreviated:

Below is the abbreviation used as a 
prefix for naming files within the organism's subdirectory, 
the numerical identifier of the taxon, 
and the organism's name  for each organism 
for which UniGene datasets ar provided.

Aae:7159:Aedes aegypti
Afp:338618:Aquilegia formosa x Aquilegia pubescens
Aga:7165:Anopheles gambiae
Ame:7460:Apis mellifera
At:3702:Arabidopsis thaliana
Bfl:7739:Branchiostoma floridae
Bmo:7091:Bombyx mori
Bna:3708:Brassica napus
Bt:9913:Bos taurus
Cel:6239:Caenorhabditis elegans
Cfa:9615:Canis familiaris
Cin:7719:Ciona intestinalis
Cpo:199306:Coccidioides posadasii
Cre:3055:Chlamydomonas reinhardtii
Csa:51511:Ciona savignyi
Csi:2711:Citrus sinensis
Ddi:44689:Dictyostelium discoideum
Dm:7227:Drosophila melanogaster
Dr:7955:Danio rerio
Fhe:8078:Fundulus heteroclitus
Fne:5207:Filobasidiella neoformans
Gac:69293:Gasterosteus aculeatus
Gga:9031:Gallus gallus
Ghi:3635:Gossypium hirsutum
Gma:3847:Glycine max
Gmo:117187:Gibberella moniliformis
Gra:29730:Gossypium raimondii
Han:4232:Helianthus annuus
Hma:6085:Hydra magnipapillata
Hs:9606:Homo sapiens
Hv:4513:Hordeum vulgare
Lco:47247:Lotus corniculatus
Les:4081:Lycopersicon esculentum
Lsa:4236:Lactuca sativa
Mdo:3750:Malus x domestica
Mfa:9541:Macaca fascicularis
Mgr:148305:Magnaporthe grisea
Mm:10090:Mus musculus
Mmu:9544:Macaca mulatta
Mte:30286:Molgula tectiformis
Mtr:3880:Medicago truncatula
Ncr:5141:Neurospora crassa
Oar:9940:Ovis aries
Ocu:9986:Oryctolagus cuniculus
Ola:8090:Oryzias latipes
Omy:8022:Oncorhynchus mykiss
Os:4530:Oryza sativa
Pba:73824:Populus balsamifera
Pgl:3330:Picea glauca
Pin:4787:Phytophthora infestans
Ppa:3218:Physcomitrella patens
Psi:3332:Picea sitchensis
Pta:3352:Pinus taeda
Ptp:47664:Populus tremula x Populus tremuloides
Rn:10116:Rattus norvegicus
Sbi:4558:Sorghum bicolor
Sja:6182:Schistosoma japonicum
Sma:6183:Schistosoma mansoni
Sof:4547:Saccharum officinarum
Spu:7668:Strongylocentrotus purpuratus
Ssa:8030:Salmo salar
Ssc:9823:Sus scrofa
Str:8364:Xenopus tropicalis
Stu:4113:Solanum tuberosum
Ta:4565:Triticum aestivum
Tca:7070:Tribolium castaneum
Tgo:5811:Toxoplasma gondii
Tru:31033:Takifugu rubripes
Vvi:29760:Vitis vinifera
Xl:8355:Xenopus laevis
Zm:4577:Zea mays


The files in each subdirectory are as follows, with 
each organism's filename beginning with the organism abbreviation:


Hs.info
  
   Some statistics for the current build


Hs.seq.all.Z

   Human transcript sequences derived both known genes and ESTs that
   have been partitioned into clusters.  The lines beginning with the
   # character delimit the clusters. The cluster identifier, 
   which is NOT guaranteed to remain stable across UG builds, 
   appears as Xx.99999, with Xx the two-letter organism abbreviation.
   The extent of the coding sequence is indicated with /cds=[p,m](x,y)
   with p incating a CDS annotated on the plus strand (the usual case) and 
   m indicating a CDS on the minus strand.
   Otherwise, the sequences are
   shown in FASTA-style.  The number following the # is the UniGene
   sequence ID; This number won't change from UG build to build, though 
   the sequence may not remain in the same (or in any) cluster across UG builds.
   If the GB or dbEST sequence is updated, the UG sid remains the same.
   Note that individual clusters may be downloaded from the main 
   UniGene website.

	

Hs.seq.uniq.Z

   One sequence selected from each UniGene cluster (the one with the 
   longest region of high-quality sequence data).  This file was 
   intended to be used for BLAST/FASTA searching.  

Hs.data.Z

   Send comments to Lukas Wagner (wagner@ncbi.nlm.nih.gov).

   Line types/qualifiers:

       ID           UniGene cluster ID
       TITLE        Title for the cluster
       GENE         Gene symbol
       CYTOBAND     Cytological band
       EXPRESS      Tissues of origin for ESTs in cluster
       RESTR_EXPR   Single tissue or development stage contributes 
                    more than half the total EST frequency for this gene.
       GNM_TERMINUS genomic confirmation of presence of a 3' terminus; 
                    T if a non-templated polyA tail is found among 
	              a cluster's sequences; else
                    I if templated As are found in genomic sequence or
                    S if a canonical polyA signal is found on 
                      the genomic sequence
       GENE_ID      Entrez gene identifier associated with at least one sequence in this cluster; 
	            to be used instead of LocusLink.  
       LOCUSLINK    LocusLink identifier associated with at least one sequence in this cluster;  
		    deprecated in favor of GENE_ID
       CHROMOSOME   Chromosome.  For plants, CHROMOSOME refers to mapping on the arabidopsis genome.
       STS          STS
            NAME=        Name of STS
            ACC=         GenBank/EMBL/DDBJ accession number of STS [optional field]
            DSEG=        GDB Dsegment number [optional field]
            UNISTS=      identifier in NCBI's UNISTS database
       TXMAP        Transcript map interval
            MARKER=      Marker found on at least one sequence in this cluster
            RHPANEL=     Radiation Hybrid panel used to place marker
       PROTSIM      Protein Similarity data for the sequence with highest-scoring protein similarity in this cluster
            ORG=         Organism
            PROTGI=      Sequence GI of protein
            PROTID=      Sequence ID of protein
            PCT=         Percent alignment
            ALN=         length of aligned region (aa)
       SCOUNT       Number of sequences in the cluster
       SEQUENCE     Sequence
            ACC=         GenBank/EMBL/DDBJ accession number of sequence
            NID=         Unique nucleotide sequence identifier (gi)
            PID=         Unique protein sequence identifier (used for non-ESTs)
            CLONE=       Clone identifier (used for ESTs only)
            END=         End (5'/3') of clone insert read (used for ESTs only) 
            LID=         Library ID; see Hs.lib.info for library name and tissue  	
            MGC=	 5' CDS-completeness indicator; if present, 
			 the clone associated with this sequence  
			 is believed CDS-complete. A value greater than 511
			 is the gi of the CDS-complete mRNA matched by the EST,
 	 		 otherwise the value is an indicator of the reliability
                         of the test indicating CDS comleteness;
 			 higher values indicate more reliable CDS-completeness predictions. 
           SEQTYPE=	 Description of the nucleotide sequence. Possible values are
			 mRNA, EST and HTC.
           TRACE=	 The Trace ID of the EST sequence, as provided by NCBI Trace Archive
           PERIPHERAL=   Indicator that the sequence is a suboptimal 
	                 representative of the gene represented by this cluster.
                         Peripheral sequences are those that are in a cluster
                         which represents a spliced gene without sharing a
                         splice junction with any other sequence.  In many
                         cases, they are unspliced transcripts originating
                         from the gene.

       //           End of record

Hs.lib.info.Z
	additional information regarding the LID field.  Note that 
	libraries may be browsed and downloaded via the Library Browser 
 	page, accessible through the main UniGene website.



Hs.retired.lst.Z

	This file allows a comparison of the current UniGene release
with the previous version.  It is a list of the previous UniGene
clusters, their composite sequences, and the current UniGene cluster
for each sequence. 

As of April 30, 2001, the file will include a header with the organism
name and UniGene release number and date.  The data will be in 4
columns.  the first column will be the previous UniGene cluster ID,
the second colum will indicate the current UniGene cluster ID.  The
third column is the UniGene sequence ID, and the fourth column is the
GenBank accession ID of the sequence.

Previously, the file had 3 columns, the first being the original
UniGene cluster number, the second being the GenBank accession code of
the sequence, the third column is the current UniGene cluster ID.

Hs.profiles.gz


        This file summarizes the expression profile of ESTs in
each cluster from libraries with curated controlled-vocabulary
tissue, organ, or developmental stage of origin. Libraries
derived by normalized or subtracted laboratory protocols are not
used for the expression profiles because they could bias the
results.  Clusters with 10 or more classified ESTs are reported.
The figures reported are fractions, with the numerator being the
number of ESTs in the cluster from all qualifying RNA sources and
denominator being the total number of ESTs from the same set of
RNA sources.  By reporting both, absolute and relative expression
levels can be assessed.  Where possible, separate classifications
for, e.g., tissue of origin and developmental stage are
expressed.