Universal Protein Resource (UniProt) ==================================== The Universal Protein Resource (UniProt), a collaboration between the European Bioinformatics Institute (EBI), the SIB Swiss Institute of Bioinformatics, and the Protein Information Resource (PIR), is comprised of three databases, each optimized for different uses. The UniProt Knowledgebase (UniProtKB) is the central access point for extensively curated protein information, including function, classification and cross-references. The UniProt Reference Clusters (UniRef) combine closely related sequences into a single record to speed up sequence similarity searches. The UniProt Archive (UniParc) is a comprehensive repository of all protein sequences, consisting only of unique identifiers and sequences. Genome annotation tracks ======================== UniProt genome annotation tracks represent the mapping of a species reference proteome and sequence annotations to the species' reference genome. The genome track files provided can be loaded into a genome browser as custom tracks to visualise UniProtKB annotations within the genomic context. This directory, uniprot/current_release/knowledgebase/genome_annotation_tracks/ contains the following directories and files, updated every eight weeks: README (this file) relnotes.txt __beds/ __hub/ __tracks.txt File names: ----------- - The is the unique identifier of the UniProtKB reference proteome and the the corresponding taxonomic identifier. - relnotes.txt file contains the corresponding species scientific name and . File format: ------------ 1. relnotes.txt A tab delimited file containing - Species: Name of the source organism. - UPID: The unique identifier of the UniProtKB reference proteome (). - TaxID: Taxonomic identifier (). 2. __.bed A BED detail formatted tab delimited file containing - Chromosome name. - Annotation start coordinate on the chromosome. - Annotation end coordinate on the chromosome. - UniProtKB accession, BED line name. - Score set to 0 as default. - DNA strand +/- for forward or reverse. - Thick start coordinate on the chromosome. - Thick end coordinate on the chromosome. - Annotation color (RGB). - Number of blocks representing the annotation. - Block sizes, a comma separated list. - Block starts, a comma separated list of block offsets relative to the annotation start. - Annotation identifier. (accession in proteome file) - Annotation description, a semi-colon (;) separated list that can consist of: 1. amino acid or amino acid range the UniProt annotation covers or amino acid change (variants only). 2. annotation description 3. disease name and OMIM identifier (variants only) 4. PubMed literature evidence if available. This column has a maximum of 254 characters. Missing values are represented by dots. Note: Bed files contain a basic header line for uploading into genome browsers. It is possible to upload the Bed files into the UCSC genome browser if the mitochondrial records are deleted. UCSC use chrM instead of MT for the mitochondrial DNA. The Ensembl genome browser currently does not supprot the Bed detail file format used by UniProt for its Bed files. It is recommended that to visualise UniProtKB annotations with the UCSC or Ensembl genome browers use the bigbed (.bb) binary compressed versions of the bed files directly or upload the track definitions from the __tracks.txt file or with the track hub. 3. __.bb An extended binary compressed version of the bed detail file containing - Chromosome name. - Annotation start coordinate on the chromosome. - Annotation end coordinate on the chromosome. - UniProtKB annotation display name. - Score set to 0 as default. - DNA strand +/- for forward or reverse. - Thick start coordinate on the chromosome. - Thick end coordinate on the chromosome. - Annotation color (RGB). - Number of blocks representing the annotation. - Block sizes, a comma separated list. - Block starts, a comma separated list of block offsets relative to the annotation start. - UniProtKB accession - UniProtKB annotation detail description name. - Annotation identifier. (accession in proteome file) - Annotation position, the amino acid or amino acid range the UniProt annotation covers (not available in proteome file) - Disease name (variants only) - variant - HGVS Coding sequence mutation (variants only) - variant database cross references (variants only) - Annotation description (in proteome file this is a list of amino acids peptides in each exon) - Annotation PubMed literature evidence (not available in proteome file) Missing values are represented by dots. 4. __tracks.txt Track definitions for each UniProtKB annotation type; defining: genome version, URL for the annotation big bed file and initial visualisation settings. Currently only supported by UCSC genome browser. WARNING ------- The Bed files and track hub are provided for use in track enabled genome browsers. The genomic coordinates for protein annotations are defined with a one base adjustment for the conversion between 1 base genomic coordinates and 0 based bed file block starts as defined by the UCSC Bed format. Users wishing to have genomic coordinates for UniProt annotations are recommended to use the Proteins API coordinates service (https://www.ebi.ac.uk/proteins/api). Users who wish to use these files for the genomic coordinates of UniProt annotations have to subtract one from the ChromEnd genomic coordinate of the annotation. Directory format: ----------------- __beds/ Contains a bed detail formatted file for each UniProtKB sequence annotation type. __hub/ This directory contains track hub definition files and a directory. Hub files: ---------- hub.txt - track hub definition file genomes.txt - genome and track database definitions Hub directory: -------------- / - The genome directory. For example hg38 Hub directory files: ----------------------------- /trackDB.txt - UniProtKB annotation files track definitions. /__proteome.bb - Bigbed representation of the species proteome sequences. /__act_site.bb - Bigbed representation of active site sequence annotations with in the species. /__binding.bb - Bigbed representation of binding site sequence annotations with in the species. /__ca_bind.bb - Bigbed representation of Calcium binding site sequence annotations with in the species. /__carbohyd.bb - Bigbed representation of glycosylation sequence annotations with in the species. /__chain.bb - Bigbed representation of chain sequence annotations with in the species. /__coiled.bb - Bigbed representation of coiled coil sequence annotations with in the species. /__crosslnk.bb - Bigbed representation of cross-link between proteins sequence annotations with in the species. /__disulfide.bb - Bigbed representation of disulfide bond sequence annotations with in the species. /__dna_bind.bb - Bigbed representation of DNA binding domain sequence annotations with in the species. /__domain.bb - Bigbed representation of modular protein domain sequence annotations with in the species. /__helix.bb - Bigbed representation of helical regions within experimentally determined protein structure. /__init_met.bb - Bigbed representation of initiator methionine cleavage during N-terminal direct protein sequencing annotations with in the species. /__intramem.bb - Bigbed representation of intra-membrane sequence annotations with in the species. /__lipid.bb - Bigbed representation of covalently attached lipid groups sequence annotations with in the species. /__metal.bb - Bigbed representation of metal ion binding site sequence annotations with in the species. /__mod_res.bb - Bigbed representation of modified residues, excluding lipids, glycans and protein cross-links, sequence annotations with in the species. /__motif.bb - Bigbed representation of short sequence motif of biological interest sequence annotations with in the species. /__non_std.bb - Bigbed representation of non-standard amino acids (selenocysteine and pyrrolysine) annotations with in the species. /__np_bind.bb - Bigbed representation of Nucleotide Phosphate binding region sequence annotations with in the species. /__peptide.bb - Bigbed representation of active peptide sequence annotations with in the species. /__propep.bb - Bigbed representation of a polypeptide that needs be cleaved for the protein to mature or to be activated annotations with in the species. /__region.bb - Bigbed representation of region of interest sequence annotations with in the species. /__repeat.bb - Bigbed representation of repeated sequence motifs or repeated domains sequence annotations with in the species. /__signal.bb - Bigbed representation of signal sequence annotations with in the species. /__site.bb - Bigbed representation of interesting single amino acid site sequence annotations with in the species. /__strand.bb - Bigbed representation of Beta strand regions within the experimentally determined protein structurie. /__topo_dom.bb - Bigbed representation of topological domain sequence annotations with in the species. /__transit.bb - Bigbed representation of transit peptide for organelle targeting sequence annotations with in the species. /__transmem.bb - Bigbed representation of transmembrane sequence annotations with in the species. /__turn.bb - Bigbed representation of turns within the experimentally determined protein structure. /__variants.bb - Bigbed representation of natural variant sequence annotations with in the species. /__zn_fing.bb - Bigbed representation of Zinc finger sequence annotations with in the species. -------------------------------------------------------------------------------- LICENSE -------------------------------------------------------------------------------- We have chosen to apply the Creative Commons Attribution 4.0 International (CC BY 4.0) License (https://creativecommons.org/licenses/by/4.0/) to all copyrightable parts of our databases. (c) 2002-2021 UniProt Consortium -------------------------------------------------------------------------------- DISCLAIMER -------------------------------------------------------------------------------- We make no warranties regarding the correctness of the data, and disclaim liability for damages resulting from its use. We cannot provide unrestricted permission regarding the use of the data, as some data may be covered by patents or other rights. Any medical or genetic information is provided for research, educational and informational purposes only. It is not in any way intended to be used as a substitute for professional medical advice, diagnosis, treatment or care.