Universal Protein Resource (UniProt)
====================================


The Universal Protein Resource (UniProt), a collaboration between the European
Bioinformatics Institute (EBI), the SIB Swiss Institute of Bioinformatics, and
the Protein Information Resource (PIR), is comprised of three databases, each
optimized for different uses. The UniProt Knowledgebase (UniProtKB) is the
central access point for extensively curated protein information, including
function, classification and cross-references. The UniProt Reference Clusters
(UniRef) combine closely related sequences into a single record to speed up
sequence similarity searches. The UniProt Archive (UniParc) is a comprehensive
repository of all protein sequences, consisting only of unique identifiers and
sequences.

UniProt Reference Clusters (UniRef)
=================================================

The UniProt Reference Clusters (UniRef) provide clustered sets (UniRef100, UniRef90
and UniRef50 clusters) of sequences from the UniProt Knowledgebase and selected UniParc
records, in order to obtain complete coverage of sequence space at several resolutions
(100%, >90% and >50%) while hiding redundant sequences (but not their descriptions)
from view.

UniRef50
=========

UniRef50 clusters are generated from the UniRef90 seed sequences with a 50% sequence
identity threshold using the MMseqs2 algorithm. The seed sequences are the longest 
members of the UniRef90 cluster. However, the longest sequence is not always the 
most informative. There is often more biologically relevant information and annotation
(name, function, cross-references) available on other cluster members. All the proteins
in each cluster are ranked to facilitate the selection of a biologically relevant
representative for the cluster. The proteins are ranked as follows: 
1. quality of annotation: order of preference is a member from UniProtKB/Swiss-Prot
   then UniProtKB/TrEMBL and last is UniParc
2. annotation score: prefer entries that have higher UniProtKB Annotation Score
3. organism: prefer entries from Reference proteomes and Model Organisms
4. sequence length: longest sequence is preferred. 
As new proteins are added to UniProtKB and UniParc, UniRef cluster memberships and/or
identifiers might change.

UniRef50 cluster titles and identifiers are derived from the representative UniRef90
entry. The UniRef50 identifier is generated by replacing "UniRef90_"  prefix of 
the representative with "UniRef50_".

Ftp access 
==========

Currently, UniRef50 is available from UniProt FTP site:

        ftp.uniprot.org/pub/databases/uniprot/uniref/uniref50

The UniRef50 files and their descriptions are as follows:

File Name       File Description
-------------   -----------------------------------------------------------
uniref50.fasta  This file contains all UniRef50 entries in FASTA format. 
                The definition line in the FASTA format includes cluster
                specific information such as cluster name, number of members and
                and common taxonomy and also the ID of the representative protein.
                The format is as follows:
                >UniqueIdentifier ClusterName n=Members Tax=Taxon RepID=RepresentativeMember
                where:
                - UniqueIdentifier is the primary accession number of the UniRef cluster.
                - ClusterName is the name of the UniRef cluster.
                - Members is the number of UniRef cluster members.
                - Taxon is the scientific name of the lowest common taxon shared
                  by all UniRef cluster members.
                - RepresentativeMember is the entry name of the representative member
                  of the UniRef cluster.
                For example:
                >UniRef50_P99999 Cytochrome c n=33 Tax=Eukaryota RepID=CYC_HUMAN
 
uniref50.xml    This file contains all UniRef50 entries in XML format. Each entry is
                identified by the UniRef identifier, and contains:
                - cross-reference to representative UniProtKB or UniParc entry and its 
                  sequence
                - cluster member that served as the seed sequence is flagged 
                - cross-references to member UniProtKB and/or UniParc entries
		- cross-references to UniRef90 and UniRef100 entries
                - member count
                - common taxon
                        

Document type definition for uniref50.xml   
------------------------------------------
<?xml version="1.0" encoding="ASCII"?>
<!DOCTYPE UniRef50 [
<!ELEMENT UniRef50 (entry+)>
<!ATTLIST UniRef50 
                    xmlns CDATA #FIXED "http://uniprot.org/uniref"
                    xmlns:xsi CDATA #IMPLIED
                    xsi:schemaLocation CDATA #IMPLIED
                    releaseDate    CDATA #IMPLIED
                    version        CDATA #IMPLIED
>

<!-- entry: UniRef50 entry -->
<!ELEMENT entry (name,property*,representativeMember,member*)> 
<!ATTLIST entry  id             ID    #REQUIRED
                 updated        CDATA #IMPLIED 
>

<!-- name: UniRef50 cluster name derived from representative --> 
<!-- UniRef100 entry  -->
<!ELEMENT name  (#PCDATA)>


<!-- representativeMember: information for representative -->
<!-- UniRef100 entry  -->
<!ELEMENT representativeMember (dbReference,sequence)>

<!-- memberList: members of UniRef50 cluster other than representative --> 
<!ELEMENT member (dbReference)>

<!-- dbReference: cross-reference to member UniRef100 entries  -->
<!-- of the UniRef50 cluster --> 
<!ELEMENT dbReference (property*)>
<!ATTLIST dbReference
    type CDATA #REQUIRED 
    id 	 CDATA #REQUIRED 
> 

<!-- property: properties of cross-references -->
<!ELEMENT property EMPTY>
<!ATTLIST property
    type CDATA #REQUIRED
    value CDATA #REQUIRED
>

<!ELEMENT sequence (#PCDATA ) >
<!ATTLIST sequence
    length CDATA #IMPLIED
    checksum CDATA #IMPLIED
>

]>

--------------------------------------------------------------------------------
  LICENSE
--------------------------------------------------------------------------------
We have chosen to apply the Creative Commons Attribution (CC BY 4.0) License
(https://creativecommons.org/licenses/by/4.0/) to all copyrightable parts of
our databases.

(c) 2002-2022 UniProt Consortium

--------------------------------------------------------------------------------
  DISCLAIMER
--------------------------------------------------------------------------------
We make no warranties regarding the correctness of the data, and disclaim
liability for damages resulting from its use. We cannot provide unrestricted
permission regarding the use of the data, as some data may be covered by patents
or other rights.

Any medical or genetic information is provided for research, educational and
informational purposes only. It is not in any way intended to be used as a
substitute for professional medical advice, diagnosis, treatment or care.