Universal Protein Resource (UniProt) ==================================== The Universal Protein Resource (UniProt), a collaboration between the European Bioinformatics Institute (EBI), the SIB Swiss Institute of Bioinformatics, and the Protein Information Resource (PIR), is comprised of three databases, each optimized for different uses. The UniProt Knowledgebase (UniProtKB) is the central access point for extensively curated protein information, including function, classification and cross-references. The UniProt Reference Clusters (UniRef) combine closely related sequences into a single record to speed up sequence similarity searches. The UniProt Archive (UniParc) is a comprehensive repository of all protein sequences, consisting only of unique identifiers and sequences. Proteomics mapping ================== UniProt is running an analysis pipeline to provide a mapping from publicly available mass spectrometry-based data to UniProtKB sequences. The pipeline consists of the following steps: I) Species-specific lists of identified peptides are retrieved from the mass spectrometry (MS) proteomics repositories indicated in the output files. These peptides are then filtered based on quality metrics metadata provided by the MS proteomics repositories. II) All sequences (canonical sequences plus isoforms) from the UniProtKB reference proteome of a specific taxonomic identifier are in-silico digested with up to nine cleaving rules, allowing for up to two missed cleavages and with or without initiator methionine cleavage. III) The in-silico generated peptides are then filtered in order to remove peptides which are less than seven amino acids in length and to remove X, B or Z-containing peptide sequences. IV) Unicity of the in-silico generated peptides is evaluated according to gene groups underlying the UniProtKB reference proteomes where protein sequences are grouped based on the gene(s) encoding them. Each gene group is constituted by one or more UniProtKB sequences. A peptide is considered unique if it belongs to only one gene group. Two types of in-silico generated peptides are therefore obtained: unique and non-unique. V) The two types of in-silico generated peptides are compared to the corresponding species lists of identified peptides from the MS proteomics repositories and those with a 100% sequence match are exported to the corresponding output files that are described below. UniProtKB identifiers with only one seven, eight or nine-amino acids long mapped peptide are removed. This directory, uniprot/current_release/knowledgebase/proteomics_mapping/, contains the following files, updated every eight weeks: README (this file) __uniquePeptides.tsv __nonUniquePeptides.tsv publications.txt relnotes.txt File names: ----------- - The is the unique identifier of the UniProtKB reference proteome and the the corresponding taxonomic identifier. - *uniquePeptides.tsv files contain the mapping of unique peptides. - *nonUniquePeptides.tsv files contain the mapping of non-unique peptides. - publications.txt file contains species-specific lists of PubMed unique identifiers (PMIDs) associated with some of the original data sets used by MS proteomics repositories. - relnotes.txt file contains species-specific statistics. File format: ------------ The files are in tab-separated format. The columns they contain are described below. 1. __uniquePeptides.tsv - Peptide: Sequences of the peptides. - (e.g. PeptideAtlas, MaxQB, ...): 'N' denotes absence, 'Y' denotes presence of the peptide in the indicated repository. Note that the number and types of columns depends on the species () and might change over time with data availability from repositories. - GeneGroupSymbol: Representative gene symbol for the gene group. A dash symbol ('-') is used when the gene encoding a protein is unknown. - Accessions: Comma-separated list of UniProtKB identifiers (accession numbers and/or isoform identifiers). Example: Peptide PeptideAtlas MaxQB GeneGroupSymbol Accessions TNFFVNGLLDLVK Y Y TRIM56 C9JI91,Q9BRZ2,Q9BRZ2-2,Q9BRZ2-3 RVLSLGR Y N TRIM56 Q9BRZ2 LDEAFEFVK N Y DUSP1 P28562 2. __nonUniquePeptides.tsv - Peptide: Sequences of the peptides. - (e.g. PeptideAtlas, MaxQB, ...): 'N' denotes absence, 'Y' denotes presence of the peptide in the indicated repository. Note that the number and types of columns depends on the species () and might change over time with data availability from repositories. - GeneGroupSymbol: Comma followed by space-separated representative gene symbol for each gene group. A dash symbol ('-') is used when the gene encoding a protein is unknown. - Accessions: Comma and semicolon-separated list of UniProtKB identifiers (accession numbers and/or isoform identifiers). A comma is used to separate identifiers from the same gene group, a semicolon is used to separate different gene groups. Example: Peptide PeptideAtlas MaxQB GeneGroupSymbol Accessions DGELWNK Y Y BUB1B, PAK6 O60566,O60566-2,O60566-3;H3BTB9 IYVSDDGK Y N AMY1A, AMY2A, AMY2B P04745;P04746;P19961 IADFGFAR N Y ULK2, ULK1 Q8IYT8;O75385 3. publications.txt - UPID: The unique identifier of the UniProtKB reference proteome (). - TaxID: Taxonomic identifier (). - PMID: Comma-separated list of PMIDs. 4. relnotes.txt - Species: Name of the source organism. - UPID: The unique identifier of the UniProtKB reference proteome (). - TaxID: Taxonomic identifier (). - Sequences: Number of UniProtKB sequences processed by the pipeline. - WithUniqueEvidence: Number of distinct UniProtKB identifiers (accession numbers and/or isoform identifiers) reported in the __uniquePeptides.tsv file. - UniquePeps: Number of peptides reported in the __uniquePeptides.tsv file. - WithNonUniqueEvidence: Number of distinct UniProtKB identifiers (accession numbers and/or isoform identifiers) reported in the __nonUniquePeptides.tsv file. - NonUniquePeps: Number of peptides reported in the __nonUniquePeptides.tsv file. -------------------------------------------------------------------------------- LICENSE -------------------------------------------------------------------------------- We have chosen to apply the Creative Commons Attribution (CC BY 4.0) License (https://creativecommons.org/licenses/by/4.0/) to all copyrightable parts of our databases. (c) 2002-2021 UniProt Consortium -------------------------------------------------------------------------------- DISCLAIMER -------------------------------------------------------------------------------- We make no warranties regarding the correctness of the data, and disclaim liability for damages resulting from its use. We cannot provide unrestricted permission regarding the use of the data, as some data may be covered by patents or other rights. Any medical or genetic information is provided for research, educational and informational purposes only. It is not in any way intended to be used as a substitute for professional medical advice, diagnosis, treatment or care.