-------------------------------------------------------------------------------- README for UniVec vector origin files: artificial_whole_UniVec_entries.txt artificial_intervals_5column.txt biological_intervals_5column.txt Last Updated 24 October 2017 ................................................................................ BACKGROUND ========== Sequences in UniVec may be completely artificial or may have parts that are derived from real biological sources. Many vector sequences in UniVec have multiple origins for distinct sub-intervals. The files described below contain information about the origins of vector segments that we could confirm by comparing UniVec against the NCBI nr BLAST database and various assembled genomes. The process of discovering vector origins is ongoing and we are likely to add more information on vector origins without updating the UniVec itself. The information in the vector origin files is being used internally at NCBI to evaluate whether matches to sequences in the UniVec database represent true contamination or matches based on a common taxonomic origin. (Note: the files used internally may be a slightly different version from those provided in this FTP directory.) Some related software for running VecScreen and interpreting and annotating its output matches taking taxonomy into account is under development, and can be found at: https://github.com/aaschaffer/vecscreen_plus_taxonomy If the sequence S and the matching vector interval V share their taxonomy at genus level, then we judge that the match is a false positive. In some cases, a similar rule of inference can be applied at higher taxonomic levels (e.g., family or order), so we report vector segments at higher levels in such cases. Further investigation is needed to decide which intervals should be raised above genus level and by how many levels. Vector intervals not included in these files are unlikely to be derived from organisms that both have reliable assembled genomes and are commonly used in laboratory experiments such as E. coli, S. cerevisiae, D. melanogaster, M. musculus, but there are exceptions. For example, we are still finding vector segments that come from E. coli phages or nonstandard E. coli strains. All UniVec entry names and positions are with respect to UniVec build 10.0. FILE DESCRIPTIONS ================= artificial_whole_UniVec_entries.txt ----------------------------------- This file contains UniVec entries that we found to be entirely artificial. artificial_intervals_5column.txt -------------------------------- This file contains intervals in UniVec entries that we found to be artificial. The five-column format for this file is: Column 1: UniVec sequence ID starting with gnl|uv| Column 2: start of interval (1-based coordinates) Column 3: end of interval (1-based coordinates) Column 4: numeric taxid of node in NCBI's taxonomy tree Column 5: name of node in NCBI's taxonomy tree By NCBI convention, the taxid for Artificial is 81077. biological_intervals_5column.txt -------------------------------- This file contains intervals in UniVec entries that we found to come from a real biological source. The five-column format for this file is the same as for the artificial_intervals_5column.txt file described above. Columns 4 and 5 describe the taxonomy of that source, usually at genus level. Sometimes the taxonomic level in columns 4 and 5 is above genus because the same sequence occurs in multiple genera. An interval can appear multiple times in this file for biological reasons including: a) The genera Escherichia and Shigella are equivalent for the purpose of determining vector origins. b) A vector segment that comes from a phage will be assigned to the genera of the virus and the bacterial host(s). c) Sometimes a vector segment that comes from a retrovirus with a known natural vertebrate host can be assigned to the genera of both the retrovirus and the host. ................................................................................ Send any questions or comments to Alejandro Schaffer (schaffer@ncbi.nlm.nih.gov) Paul Kitts (kitts@ncbi.nlm.nih.gov) --------------------------------------------------------------------------------