UniGene RETIREMENT NOTICE ================= UniGene was originally implemented as a gene-oriented grouping of transcript sequences in the absence of a reference genome for a broad range of organisms. We added genome-based grouping later. UniGene has since been used as a source of approximate expression profiles, an index of available cDNA clones, and as a guide to transcript-oriented resource design. However, with the advent of short read sequencing, fewer and fewer ESTs are submitted to NCBI every year, and reference genomes are available for most organisms with a sizable research community. Consequently, the usage of and need for UniGene has dropped significantly. Although we will retire the web interfaces, we will continue to have the most recent UniGene builds available on NCBI's FTP site. Web traffic to UniGene entries will redirect to relevant gene entries when those are available. When that's not possible, web requests will be routed to either a representative nucleotide sequence entry or a helpful Entrez query against nucleotide records. UniGene cluster identifiers (eg Hs.2 or Dr.1068) are indexed in Entrez Gene. Users interested in particular UniGene clusters may find them by searching gene. For best results, searches should be restricted to the field "UniGene Cluster Number" rather than the default choice of searching against all fields in Gene. The advanced search builder provides a convenient way to do this: https://www.ncbi.nlm.nih.gov/gene/advanced Note that Gene contains as representative mRNA sequences RefSeqs and selected GenBank mRNA records, rather than the longer lists of representative sequences in UniGene, which are still available via ftp here. FILES IN THIS DIRECTORY ======================= NOTE: The files in this directory have been renamed with "Hs" (for Homo sapiens) appearing somewhere in the filename. Other organisms are abbreviated: Below is the abbreviation used as a prefix for naming files within the organism's subdirectory, the numerical identifier of the taxon, and the organism's name for each organism for which UniGene datasets ar provided. Aae:7159:Aedes aegypti Afp:338618:Aquilegia formosa x Aquilegia pubescens Aga:7165:Anopheles gambiae Ame:7460:Apis mellifera At:3702:Arabidopsis thaliana Bfl:7739:Branchiostoma floridae Bmo:7091:Bombyx mori Bna:3708:Brassica napus Bt:9913:Bos taurus Cel:6239:Caenorhabditis elegans Cfa:9615:Canis familiaris Cin:7719:Ciona intestinalis Cpo:199306:Coccidioides posadasii Cre:3055:Chlamydomonas reinhardtii Csa:51511:Ciona savignyi Csi:2711:Citrus sinensis Ddi:44689:Dictyostelium discoideum Dm:7227:Drosophila melanogaster Dr:7955:Danio rerio Fhe:8078:Fundulus heteroclitus Fne:5207:Filobasidiella neoformans Gac:69293:Gasterosteus aculeatus Gga:9031:Gallus gallus Ghi:3635:Gossypium hirsutum Gma:3847:Glycine max Gmo:117187:Gibberella moniliformis Gra:29730:Gossypium raimondii Han:4232:Helianthus annuus Hma:6085:Hydra magnipapillata Hs:9606:Homo sapiens Hv:4513:Hordeum vulgare Lco:47247:Lotus corniculatus Les:4081:Lycopersicon esculentum Lsa:4236:Lactuca sativa Mdo:3750:Malus x domestica Mfa:9541:Macaca fascicularis Mgr:148305:Magnaporthe grisea Mm:10090:Mus musculus Mmu:9544:Macaca mulatta Mte:30286:Molgula tectiformis Mtr:3880:Medicago truncatula Ncr:5141:Neurospora crassa Oar:9940:Ovis aries Ocu:9986:Oryctolagus cuniculus Ola:8090:Oryzias latipes Omy:8022:Oncorhynchus mykiss Os:4530:Oryza sativa Pba:73824:Populus balsamifera Pgl:3330:Picea glauca Pin:4787:Phytophthora infestans Ppa:3218:Physcomitrella patens Psi:3332:Picea sitchensis Pta:3352:Pinus taeda Ptp:47664:Populus tremula x Populus tremuloides Rn:10116:Rattus norvegicus Sbi:4558:Sorghum bicolor Sja:6182:Schistosoma japonicum Sma:6183:Schistosoma mansoni Sof:4547:Saccharum officinarum Spu:7668:Strongylocentrotus purpuratus Ssa:8030:Salmo salar Ssc:9823:Sus scrofa Str:8364:Xenopus tropicalis Stu:4113:Solanum tuberosum Ta:4565:Triticum aestivum Tca:7070:Tribolium castaneum Tgo:5811:Toxoplasma gondii Tru:31033:Takifugu rubripes Vvi:29760:Vitis vinifera Xl:8355:Xenopus laevis Zm:4577:Zea mays The files in each subdirectory are as follows, with each organism's filename beginning with the organism abbreviation: Hs.info Some statistics for the current build Hs.seq.all.Z Human transcript sequences derived both known genes and ESTs that have been partitioned into clusters. The lines beginning with the # character delimit the clusters. The cluster identifier, which is NOT guaranteed to remain stable across UG builds, appears as Xx.99999, with Xx the two-letter organism abbreviation. The extent of the coding sequence is indicated with /cds=[p,m](x,y) with p incating a CDS annotated on the plus strand (the usual case) and m indicating a CDS on the minus strand. Otherwise, the sequences are shown in FASTA-style. The number following the # is the UniGene sequence ID; This number won't change from UG build to build, though the sequence may not remain in the same (or in any) cluster across UG builds. If the GB or dbEST sequence is updated, the UG sid remains the same. Note that individual clusters may be downloaded from the main UniGene website. Hs.seq.uniq.Z One sequence selected from each UniGene cluster (the one with the longest region of high-quality sequence data). This file was intended to be used for BLAST/FASTA searching. Hs.data.Z Send comments to Lukas Wagner (wagner@ncbi.nlm.nih.gov). Line types/qualifiers: ID UniGene cluster ID TITLE Title for the cluster GENE Gene symbol CYTOBAND Cytological band EXPRESS Tissues of origin for ESTs in cluster RESTR_EXPR Single tissue or development stage contributes more than half the total EST frequency for this gene. GNM_TERMINUS genomic confirmation of presence of a 3' terminus; T if a non-templated polyA tail is found among a cluster's sequences; else I if templated As are found in genomic sequence or S if a canonical polyA signal is found on the genomic sequence GENE_ID Entrez gene identifier associated with at least one sequence in this cluster; to be used instead of LocusLink. LOCUSLINK LocusLink identifier associated with at least one sequence in this cluster; deprecated in favor of GENE_ID CHROMOSOME Chromosome. For plants, CHROMOSOME refers to mapping on the arabidopsis genome. STS STS NAME= Name of STS ACC= GenBank/EMBL/DDBJ accession number of STS [optional field] DSEG= GDB Dsegment number [optional field] UNISTS= identifier in NCBI's UNISTS database TXMAP Transcript map interval MARKER= Marker found on at least one sequence in this cluster RHPANEL= Radiation Hybrid panel used to place marker PROTSIM Protein Similarity data for the sequence with highest-scoring protein similarity in this cluster ORG= Organism PROTGI= Sequence GI of protein PROTID= Sequence ID of protein PCT= Percent alignment ALN= length of aligned region (aa) SCOUNT Number of sequences in the cluster SEQUENCE Sequence ACC= GenBank/EMBL/DDBJ accession number of sequence NID= Unique nucleotide sequence identifier (gi) PID= Unique protein sequence identifier (used for non-ESTs) CLONE= Clone identifier (used for ESTs only) END= End (5'/3') of clone insert read (used for ESTs only) LID= Library ID; see Hs.lib.info for library name and tissue MGC= 5' CDS-completeness indicator; if present, the clone associated with this sequence is believed CDS-complete. A value greater than 511 is the gi of the CDS-complete mRNA matched by the EST, otherwise the value is an indicator of the reliability of the test indicating CDS comleteness; higher values indicate more reliable CDS-completeness predictions. SEQTYPE= Description of the nucleotide sequence. Possible values are mRNA, EST and HTC. TRACE= The Trace ID of the EST sequence, as provided by NCBI Trace Archive PERIPHERAL= Indicator that the sequence is a suboptimal representative of the gene represented by this cluster. Peripheral sequences are those that are in a cluster which represents a spliced gene without sharing a splice junction with any other sequence. In many cases, they are unspliced transcripts originating from the gene. // End of record Hs.lib.info.Z additional information regarding the LID field. Note that libraries may be browsed and downloaded via the Library Browser page, accessible through the main UniGene website. Hs.retired.lst.Z This file allows a comparison of the current UniGene release with the previous version. It is a list of the previous UniGene clusters, their composite sequences, and the current UniGene cluster for each sequence. As of April 30, 2001, the file will include a header with the organism name and UniGene release number and date. The data will be in 4 columns. the first column will be the previous UniGene cluster ID, the second colum will indicate the current UniGene cluster ID. The third column is the UniGene sequence ID, and the fourth column is the GenBank accession ID of the sequence. Previously, the file had 3 columns, the first being the original UniGene cluster number, the second being the GenBank accession code of the sequence, the third column is the current UniGene cluster ID. Hs.profiles.gz This file summarizes the expression profile of ESTs in each cluster from libraries with curated controlled-vocabulary tissue, organ, or developmental stage of origin. Libraries derived by normalized or subtracted laboratory protocols are not used for the expression profiles because they could bias the results. Clusters with 10 or more classified ESTs are reported. The figures reported are fractions, with the numerator being the number of ESTs in the cluster from all qualifying RNA sources and denominator being the total number of ESTs from the same set of RNA sources. By reporting both, absolute and relative expression levels can be assessed. Where possible, separate classifications for, e.g., tissue of origin and developmental stage are expressed.