Universal Protein Resource (UniProt) ==================================== The Universal Protein Resource (UniProt), a collaboration between the European Bioinformatics Institute (EBI), the SIB Swiss Institute of Bioinformatics, and the Protein Information Resource (PIR), is comprised of three databases, each optimized for different uses. The UniProt Knowledgebase (UniProtKB) is the central access point for extensively curated protein information, including function, classification and cross-references. The UniProt Reference Clusters (UniRef) combine closely related sequences into a single record to speed up sequence similarity searches. The UniProt Archive (UniParc) is a comprehensive repository of all protein sequences, consisting only of unique identifiers and sequences. UniProt Reference Clusters (UniRef) ================================================= The UniProt Reference Clusters (UniRef) provide clustered sets (UniRef100, UniRef90 and UniRef50 clusters) of sequences from the UniProt Knowledgebase and selected UniParc records, in order to obtain complete coverage of sequence space at several resolutions (100%, >90% and >50%) while hiding redundant sequences (but not their descriptions) from view. UniRef100 ========= UniRef100 contains all records in the UniProt Knowledgebase and selected UniParc records (see below). In UniRef100, identical sequences and subfragments are placed into a single cluster using the CD-HIT algorithm. The longest members of the cluster (seed sequences) is used to generate UniRef90. However, the longest sequence is not always the most informative. There is often more biologically relevant information and annotation (name, function, cross-references) available on other cluster members. All the proteins in each cluster are ranked to facilitate the selection of a biologically relevant representative for the cluster. The proteins are ranked as follows: 1. quality of annotation: order of preference is a member from UniProtKB/Swiss-Prot then UniProtKB/TrEMBL and last is UniParc 2. annotation score: prefer entries that have higher UniProtKB Annotation Score 3. organism: prefer entries from Reference proteomes and Model Organisms 4. sequence length: longest sequence is preferred. As new proteins are added to UniProtKB and UniParc, UniRef cluster memberships and/or identifiers might change. The UniRef100 identifier is generated by placing "UniRef100_" prefix before the UniProtKB accession number or UniParc identifier of the representative UniProtKB or UniParc entry. UniParc records in UniRef100 ---------------------------- In addition to UniProtKB records, UniRef100 also includes selected UniParc entries that are not covered by UniProtKB and contain cross-references to the following databases: - Refseq - PDB Ftp access ========== Currently, UniRef100 is available from UniProt FTP site: ftp.uniprot.org/pub/databases/uniprot/uniref/uniref100 The UniRef100 files and their descriptions are as follows: File Name File Description ------------- ----------------------------------------------------------- uniref100.fasta This file contains all UniRef100 entries in FASTA format. The definition line in the FASTA format includes cluster specific information such as cluster name, number of members and and common taxonomy and also the ID of the representative protein. The format is as follows: >UniqueIdentifier ClusterName n=Members Tax=Taxon RepID=RepresentativeMember where: - UniqueIdentifier is the primary accession number of the UniRef cluster. - ClusterName is the name of the UniRef cluster. - Members is the number of UniRef cluster members. - Taxon is the scientific name of the lowest common taxon shared by all UniRef cluster members. - RepresentativeMember is the entry name of the representative member of the UniRef cluster. For example: >UniRef100_P99999 Cytochrome c n=5 Tax=Hominidae RepID=CYC_HUMAN uniref100.xml This file contains all UniRef100 entries in XML format. Each entry is identified by the UniRef identifier, and contains: - cross-reference to representative UniProtKB or UniParc entry and its sequence - cluster member that served as the seed sequence is flagged - cross-references to member UniProtKB and/or UniParc entries - cross-references to UniRef50 and UniRef90 entries - member count - common taxon Document type definition for uniref100.xml ------------------------------------------ ]> -------------------------------------------------------------------------------- LICENSE -------------------------------------------------------------------------------- We have chosen to apply the Creative Commons Attribution (CC BY 4.0) License (https://creativecommons.org/licenses/by/4.0/) to all copyrightable parts of our databases. (c) 2002-2022 UniProt Consortium -------------------------------------------------------------------------------- DISCLAIMER -------------------------------------------------------------------------------- We make no warranties regarding the correctness of the data, and disclaim liability for damages resulting from its use. We cannot provide unrestricted permission regarding the use of the data, as some data may be covered by patents or other rights. Any medical or genetic information is provided for research, educational and informational purposes only. It is not in any way intended to be used as a substitute for professional medical advice, diagnosis, treatment or care.