README for ftp.ncbi.nlm.nih.gov/pub/medgen

Last updated: September 11, 2014

There are multiple files maintained by NCBI that are related to medical genetics. This README summarizes files available by ftp not only in the medgen path, but also on other paths.
================================================================================
MedGen
================================================================================
Files in this directory are updated weekly, on Wednesdays. Each .RFF file is structured according to the following conventions:

    A vertical bar (|) is used as delimiter
    The first line in each file begins with a hash (#) and provides the column names.

When appropriate, names of the columns are consistent with those used by UMLS. Many of the values come from UMLS as well (fields with names in lower case have no counterpart in UMLS). This document provides more information about the abbreviations used by UMLS.
============================================================
MERGED.RRF
============================================================
Pairs of concept identifiers (CUI) that have been merged.
CUI 	concept unique identifier that has been replaced
to CUI 	current concept identifier

============================================================
MGCONSO.RRF
============================================================
Summary data for each concept identifier.
CUI 	concept unique identifier
TS 	term status:
P: preferred LUI (unique identifier for term i.e. lexically similar strings)
S: non-preferred LUI
STT 	string type:
PF: preferred form of term
VCW: case and word-order variant of the preferred form
VC: case variant of the preferred form
VO: variant of the preferred form
VW: word-order variant of the preferred form
ISPREF 	Is this term preferred in the set of terms from the source? (Y/N)
AUI 	atom unique identifier, where an atom is one term from a source
SAUI 	source-asserted atom unique identifier, i.e. the source's identifier for one term. Often null.
SCUI 	source-asserted concept unique identifier, i.e. the source's identifier for a concept that may include multiple terms
SDUI 	source-asserted descriptor unique identifier
SAB 	abbreviation for the source of the term (Defined here)
TTY 	type of term as defined by the source
CODE 	unique identifier or code for the term provided by the source
STR 	string, i.e. the term value
SUPPRESS 	suppressed by UMLS curators

============================================================
MGDEF.RFF
============================================================
Summary data for definitions and sources of concepts.
CUI 	concept unique identifier
DEF 	concept definition.
Please note that some values in the DEF column contain internal line feeds. The line separator for RRF files is '|\n'. The line separator within the DEF column of MGDEF.RRF is '\r', CR (Carriage return, '\r', 0x0D, 13 in decimal). Unix/Linux and windows tool sometimes behave differently on these formats. If this format is problematic for you, consider use of the comma-separated value (csv) files in the csv subdirectory.
source 	sources that contribute strings or relationships to the UMLS Metathesaurus
SUPPRESS 	suppressed by UMLS curators

============================================================
MGREL.RRF
============================================================
Summary data for relationship between concepts.
CUI1 	first concept unique identifier
AUI1 	first atom unique identifier, where an atom is one term from a source
STYPE1 	the name of the column in MRCONSO.RRF that contains the first identifier to which the relationship is attached
REL 	relationship label
CUI2 	second concept unique identifier
AUI2 	second atom unique identifier, where an atom is one term from a source
RELA 	additional relationship label
RUI 	relationship unique identifier
SAB 	abbreviation for the source of the term (Defined here)
SL 	source of relationship label
SUPPRESS 	suppressed by UMLS curators

============================================================
MGSAT.RRF
============================================================
Summary data for concepts' attributes.
CUI 	concept unique identifier
METAUI 	UMLS Metathesaurus asserted unique identifier
STYPE 	the name of the column in MRCONSO.RRF that contains the identifier to which the attribute is attached
CODE 	unique identifier or code for the term provided by the source
ATUI 	attribute unique identifier
ATN 	attribute name
SAB 	abbreviation for the source of the term (Defined here)
ATV 	attribute value
SUPPRESS 	suppressed by UMLS curators

============================================================
MGSTY.RRF
============================================================
Summary data for semantic types.
CUI 	concept unique identifier
TUI 	semantic type unique identifier
STN 	semantic type tree number
STY 	semantic type
ATUI 	attribute unique identifier

============================================================
NAMES.RRF
============================================================
Summary data for concept names and sources.
CUI 	concept unique identifier
name 	concept name
source 	sources that contribute strings or relationships to the UMLS Metathesaurus
SUPPRESS 	suppressed by UMLS curators

============================================================
medgen_pubmed
============================================================
Summary data for MedGen and PubMed links.
UID 	MedGen unique identifier
CUI 	concept unique identifier
NAME 	concept name
PMID 	PubMed unique identifier

============================================================
MedGen_HPO_Mapping.txt
============================================================
Report of MedGen's processing of terms from Human Phenotype Ontology (HPO)
CUI 	concept unique identifier
SDUI 	Identifier from HPO
HpoStr 	term from HPO
MedGenStr 	preferred term in MedGen
MedGenStr_SAB 	Source of the term in MedGen
STY 	semantic type

============================================================
MedGen_HPO_OMIM_Mapping.txt
============================================================
Report of MedGen's processing of terms from Human Phenotype Ontology (HPO) and their relationships diagnostic terms from OMIM

OMIM_CUI 	concept unique identifier assigned to a record from OMIM
MIM_number 	MIM number defining the record from OMIM
OMIM_name 	preferred term from OMIM
relationship 	relationship of the term from HPO to the record from OMIM. Constructions like 'not_manifestation_of' are used to represent the 'not' qualifier for a relationship.
HPO_CUI 	Concept UID (CUI) assiged to the term from HPO
HPO_name 	preferred term from HPO
HPO_CUI 	Concept UID (CUI) assiged to the term from HPO
MedGen_name 	preferred term used in MedGen
MedGen_source 	source of the term used preferentially by MedGen
STY 	semantic type

============================================================
MedGen_CUI_history.txt
============================================================
Tab-delimited report of changes in CUI in MeGen and the dates the changes were made.

Previous_CUI  			The CUI that was deprecated
Current_CUI				The CUI that is now current
Date_Of_Action			The month and year this happened.


================================================================================
================================================================================
Subdirectories
================================================================================
================================================================================

------------------------------------------------------------
ftp.ncbi.nlm.nih.gov/pub/medgen/csv/
------------------------------------------------------------
The csv subdirectory contains a set of comma-separated files (csv) corresponding to the RRF files in the main path. Some of the files are split to allow loading into spreadsheet software (maximum 1,000,000 lines per file). The csv files also facilitate processing of the MGDEF.RRF file because the DEF column may contain internal line feeds.


------------------------------------------------------------
ftp.ncbi.nlm.nih.gov/pub/medgen/presentations
------------------------------------------------------------
The presentations subdirectory contains presentations related to MedGen.

There is one file at present, Conditions_Phenotypes.pptx, which describes how data in MedGen and other resources can be used to identify terms and identifiers to include in submissions to ClinVar and GTR.
================================================================================
================================================================================
Other sites
================================================================================
================================================================================
------------------------------------------------------------
ftp.ncbi.nlm.nih.gov/pub/clinvar/
------------------------------------------------------------
The files discussed in the NAMES OF PHENOTYPES section of the README contain information about condition names used by ClinVar and GTR.

These files are:

    disease_names
    gene_condition_source_id
    ConceptID_history.txt
------------------------------------------------------------
Gene's ftp site
------------------------------------------------------------
ftp://ftp.ncbi.nih.gov/gene/DATA/mim2gene_medgen

A report of identifiers from OMIM, whether they are genes or conditions, and corresponding data in Gene and MedGen. Described in ftp://ftp.ncbi.nih.gov/gene/README.