--------------------------------------------------------------------------------

README for UniVec vector origin files:
  artificial_whole_UniVec_entries.txt
  artificial_intervals_5column.txt
  biological_intervals_5column.txt

Last Updated 24 October 2017

................................................................................

BACKGROUND
==========

Sequences in UniVec may be completely artificial or may have parts that are 
derived from real biological sources. Many vector sequences in UniVec have 
multiple origins for distinct sub-intervals.

The files described below contain information about the origins of vector 
segments that we could confirm by comparing UniVec against the NCBI nr BLAST 
database and various assembled genomes. The process of discovering vector 
origins is ongoing and we are likely to add more information on vector origins
without updating the UniVec itself.

The information in the vector origin files is being used internally at NCBI to
evaluate whether matches to sequences in the UniVec database represent true
contamination or matches based on a common taxonomic origin. (Note: the files
used internally may be a slightly different version from those provided in this
FTP directory.) Some related software for running VecScreen and interpreting 
and annotating its output matches taking taxonomy into account is under 
development, and can be found at: 
https://github.com/aaschaffer/vecscreen_plus_taxonomy

If the sequence S and the matching vector interval V share their taxonomy at
genus level, then we judge that the match is a false positive. In some cases,
a similar rule of inference can be applied at higher taxonomic levels (e.g.,
family or order), so we report vector segments at higher levels in such cases.
Further investigation is needed to decide which intervals should be raised above
genus level and by how many levels.

Vector intervals not included in these files are unlikely to be derived from 
organisms that both have reliable assembled genomes and are commonly used in 
laboratory experiments such as E. coli, S. cerevisiae, D. melanogaster, 
M. musculus, but there are exceptions. For example, we are still finding vector
segments that come from E. coli phages or nonstandard E. coli strains. 

All UniVec entry names and positions are with respect to UniVec build 10.0.


FILE DESCRIPTIONS
=================

artificial_whole_UniVec_entries.txt
-----------------------------------
This file contains UniVec entries that we found to be entirely artificial.


artificial_intervals_5column.txt
--------------------------------
This file contains intervals in UniVec entries that we found to be artificial.
The five-column format for this file is:

Column 1: UniVec sequence ID starting with gnl|uv|
Column 2: start of interval (1-based coordinates)
Column 3: end of interval (1-based coordinates)
Column 4: numeric taxid of node in NCBI's taxonomy tree
Column 5: name of node in NCBI's taxonomy tree

By NCBI convention, the taxid for Artificial is 81077.


biological_intervals_5column.txt
--------------------------------
This file contains intervals in UniVec entries that we found to come from a real 
biological source. The five-column format for this file is the same as for the 
artificial_intervals_5column.txt file described above. Columns 4 and 5 describe
the taxonomy of that source, usually at genus level. Sometimes the taxonomic 
level in columns 4 and 5 is above genus because the same sequence occurs in
multiple genera. An interval can appear multiple times in this file for 
biological reasons including:
a) The genera Escherichia and Shigella are equivalent for the purpose of 
   determining vector origins.
b) A vector segment that comes from a phage will be assigned to the genera of 
   the virus and the bacterial host(s).
c) Sometimes a vector segment that comes from a retrovirus with a known natural
   vertebrate host can be assigned to the genera of both the retrovirus and the
   host.

................................................................................

Send any questions or comments to
Alejandro Schaffer (schaffer@ncbi.nlm.nih.gov)
Paul Kitts (kitts@ncbi.nlm.nih.gov)

--------------------------------------------------------------------------------