This document provides a BNF specification and mapping to OWL2 of the OBO Flat File Format, version 1.4
This is an editor's draft, for comment by the community.
Comments should be sent to obo-format@lists.sourceforge.net (archive) to ensure wide visibility.
On completion, this document will provide a formalisation (using BNF) of the OBO Flat File Format Specification, version 1.4 (which we will refer to from now on as OBOF), and a semantics for the language defined via a translation to the OWL DL abstract syntax.
The first part of this document describes the syntax of OBO Format using BNF. OBO Files are parsed into abstract OBO documents. A mapping is provided between abstract OBO documents and OWL2 and the IAO ontology-metadata ontology.
This document is derived from and indebted to a previous document by Ian Horrocks [OBO-OWL Horrocks], later described in an accompanying paper [OBO-OWL Golbreich]. This document provided a semantics for OBO Format in terms of OWL-DL and what was then OWL1.1. The syntax was underspecified and the mapping was incomplete as there was no treatment of annotation properties.
The NCBO OboInOwl vocabulary attempted to fill in the gaps in the Horrocks translation by providing a full translation of all of obof1.2, including synonyms and definitions using the n-ary relations pattern. [OBO-OWL NCBO]. This vocabulary has been implemented in many tools, including OboEdit2.1 and the Berkeley OBO converter. This vocabulary is also currently (Aug 2010) used in the official OBO Foundry conversions available from the Berkeley ontologies download site and the accompanying Neurocommons triplestore. However, whilst the n-ary relations pattern represented OBO metadata correctly, the mapping has undesirable side-effects, such as introducing modeling classes and individuals into an ontology, affecting both reasoning and usability. The mapping was never adopted by Protege4.
OBO Format 1.3 was accompanied by a grammar and mapping to FOL [Obolog]. This mapping was based on the 2005 version of the OBO Relation Ontology and OBOF1.3/Obolog attempted to do justice to dual instance/type level relations and ternary temporally-qualified instance-level relations. However, this introduced too large an impedance mismatch between OBOF and OWL, and as a consequence OBOF1.3 and Obolog have been deprecated.
More recent work by Tirmizi provides an implementation of a fully roundtrippable mapping between OBO to OWL and back [OBO-OWL SWAT4LS] [OBO-OWL JBMS]. This did not attempt to provide a full grammar, and relied on previous efforts for defining the semantics.
This document supersedes previous efforts. It is based on OBO-Format 1.4 which introduces extra constructs and is intended to be closely aligned to OWL and the IAO ontology metadata ontology. The grammar provided here is intended to be complete, formally specifying the syntax of an OBO format file at the character level. The semantics are also complete, and a translation is provided for all OBOF constructs, including new constructs which allow a wider range of expressivity within OWL2. The NCBO oboInOwl vocabulary and n-ary relations approach has been deprecated, and the intention is to use IAO in its place (although the mapping of all vocabulary elements is not complete), with OWL2 axiom annotations replacing the n-ary relations pattern. An additional section on OWL macros describes how to effectively increase the expressivity of OBOF whilst remaining within a simple OBOF environment.
OBO-Format 1.4 is defined using a standard BNF notation, which is summarized in the table below.
Construct | Syntax | Example |
---|---|---|
non-terminal symbols | boldface | QuotedString |
terminal symbols | single quoted | 'Term' |
zero or more | curly braces | { entity-frame } |
zero or one | square brackets | [ ws SynonymType-ID ] |
alternative | vertical bar | Class-ID | Relation-ID |
grouping | parentheses | ( ) |
complementation | minus symbol | ( character - NewLineChar ) |
Because generic tag-value pairs occur in very many places in the syntax, to save space the grammar has meta-productions for basic tag-values and for the tags themselves:
<T>-TVP ::= <T>: {ws} UnquotedString } <T>-Tag ::= <T>: {ws}
Documents in the obo-format consist of sequences of Unicode characters [UNICODE] and are encoded in UTF-8 [RFC 3629]. TODO - language tags
WhiteSpaceChar ::= ' ' | \t | U+0020 | U+0009 ws ::= { WhiteSpaceChar } NewlineChar ::= \r | \n | U+000A | U+000C | U+000D nl ::= ws NewLineChar nl* ::= { nl } nl+ ::= nl { ws NewLineChar } OBOChar ::= '\' Letter | ( Char - (NewLineChar | '\') )
Each tag-value clause in OBOF is line separated. The line can optionally be ended by a comment, indicated by the ! character - this is ignored by the parser. Each clause can also have zero or more tag-value qualifiers.
EOL ::= ws* [ QualifierBlock ] ws* [ HiddenComment ] ws* NewLineChar QualifierBlock ::= '{' QualifierList } HiddenComment ::= '!' { ( Char - NewlineChar ) }
QuotedString ::= DblQuote { ( OBOChar - DblQuote ) } DblQuote UnquotedString ::= { OBOChar }
NonWsChar ::= ( OBOChar - WhiteSpaceChar ) Class-ID ::= ID Rel-ID ::= ID Instance-ID ::= ID ID ::= Prefixed-ID | Unprefixed-ID Unprefixed-ID ::= { ( NonWsChar - ':' ) } Prefixed-ID ::= Canonical-Prefixed-ID | NonCanonical-Prefixed-ID
Canonical-Prefixed-ID ::= Canonical-IDPrefix ':' Canonical-LocalID Alpha-Char ::= a |b |c |d |e |f |g |h |i |j |k |l |m |n |o |p |q |r |s |t |u |v |w |x |y |z | A |B |C |D |E |F |G |H |I |J |K |L |M |N |O |P |Q |R |S |T |U |V |W |X |Y |Z Digit ::= 0 |1 |2 |3 |4 |5 |6 |7 |8 |9 Canonical-IDPrefix ::= Alpha-Char { ( '_' | Alpha-Char ) } Canonical-LocalID ::= { Digit }
NonCanonical-Prefixed-ID ::= Any-IDPrefix ':' Any-LocalID Any-IDPrefix ::= { ( NonWsChar - ':' ) } ':' Any-LocalID ::= { NonWsChar } (note: TODO - this grammar rule subsumes the Canonical-Prefixed-ID rule. Is there a standard way to denote the former is a "greedy" rule, analagous to cut in LP?)
An OBO document consists of a series of header frame followed by zero or more entity frames. Each frame has a set of clauses, or <tag,value> pairs. A tag is a token which may be drawn from the set of defined OBOF tags. The value can be atomic or multi-values. In the OBO Abstract syntax a clause is written <Tag>(<V1>...<Vn>).
Each frame should by convention be separated by an empty line, but this is not enforced in the syntax
Note that the term "frame" replaces the previously used "stanza" in order to be more consistent with Manchester Syntax.
OBO-Doc := header-frame { entity-frame } nl*
header-frame ::= { header-clause nl* } header-clause ::= format-version-TVP | ontology-TVP | data-version-TVP | date-Tag DD-MM-YYYY sp hh-mm | saved-by-TVP | auto-generated-by-TVP | import-Tag IRI | filepath | subsetdef-Tag ID sp QuotedString | synonymtypedef-Tag ID sp QuotedString [ SynonymScope ] | default-namespace-Tag OBONamespace | idspace-Tag IDPrefix sp IRI [ sp QuotedString ] | treat-xrefs-as-equivalent-Tag IDPrefix | treat-xrefs-as-genus-differentia-Tag IDPrefix ws Rel-ID ws Class-ID | treat-xrefs-as-relationship-Tag IDPrefix Rel-ID | treat-xrefs-as-is_a-Tag IDPrefix | treat-xrefs-as-has-subclass-Tag IDPrefix | remark-TVP | UnreservedToken ':' [ ws ] UnquotedString entity-frame ::= term-frame | typedef-frame | instance-frame | annotation-frame
Term frames introduce and define the meaning of terms (AKA classes).
term-frame ::= nl* '[Term]' nl id-Tag Class-ID EOL { term-frame-clause EOL } term-frame-clause ::= is_anonymous-BT | name-TVP | namespace-Tag OBONamespace | alt_id-Tag ID | def-Tag QuotedString ws XrefList | comment-TVP | subset-Tag Subset-ID | synonym-Tag QuotedString ws SynonymScope [ ws SynonymType-ID ] XrefList | xref-Tag Xref | builtin-BT | property_value-Tag Relation-ID ( QuotedString XSD-Type | ID ) | is_a-Tag Class-ID | intersection_of-Tag Class-ID | intersection_of-Tag Relation-ID Class-ID | union_of-Tag Class-ID | equivalent_to-Tag Class-ID | disjoint_from-Tag Class-ID | relationship-Tag Relation-ID Class-ID | is_obsolete-BT | replaced_by-Tag Class-ID | consider-Tag ID | created_by-Tag Person-ID | creation_date-Tag ISO-8601-DateTime
Typedef frames introduce and define the meaning of relations (AKA properties).
typedef-frame ::= [ nl ] '[Typedef]' nl id-Tag Relation-ID EOL { typedef-frame-clause EOL } typedef-frame-clause ::= is_anonymous-BT | name-TVP | namespace-Tag OBONamespace | alt_id-Tag ID | def-Tag QuotedString ws XrefList | comment-TVP | subset-Tag Subset-ID | synonym-Tag QuotedString ws SynonymScope [ ws SynonymType-ID ] XrefList | xref-Tag Xref | property_value-Tag Relation-ID ( QuotedString XSD-Type | ID ) | domain-Tag Class-ID | range-Tag Class-ID | builtin-BT | holds_over_chain-Tag Relation-ID Relation-ID | is_anti_symmetric-BT | is_cyclic-BT | is_reflexive-BT | is_symmetric-BT | is_transitive-BT | is_functional-BT | is_inverse_functional-BT | is_a-Tag Rel-ID | intersection_of-Tag Rel-ID | union_of-Tag Rel-ID | equivalent_to-Tag Rel-ID | disjoint_from-Tag Rel-ID | inverse_of-Tag Rel-ID | transitive_over-Tag Relation-ID | equivalent_to_chain-Tag Relation-ID Relation-ID | disjoint_over-Tag Rel-ID | relationship-Tag Rel-ID Rel-ID | is-obsolete-BT | replaced_by-Tag Rel-ID | consider-Tag ID | created_by-Tag Person-ID | creation_date-Tag ISO-8601-DateTime | expand_assertion_to-Tag QuotedString ws XrefList | expand_expression_to-Tag QuotedString ws XrefList | is_metadata_tag-BT | is_class_level_tag-BT
Instance frames introduce and define the meaning of instances (AKA individuals).
instance-frame ::= [ nl ] '[Instance]' nl id-Tag Relation-ID EOL { instance-frame-clause EOL } instance-frame-clause ::= is_anonymous-BT | name-TVP | namespace-Tag OBONamespace | alt_id-Tag ID | def-Tag QuotedString ws XrefList | comment-TVP | subset-Tag Subset-ID | synonym-Tag QuotedString ws SynonymScope [ ws SynonymType-ID ] XrefList | xref-Tag ID | property_value-Tag Relation-ID ID | instance_of-Tag Class-ID | PropertyValueTagValue | relationship-Tag Rel-ID ID | created_by-Tag Person-ID | creation_date-Tag ISO-8601-DateTime | is-obsolete-BT | replaced_by-Tag Rel-ID | consider-Tag ID
SynonymScope ::= 'EXACT' | 'BROAD' | 'NARROW' | 'RELATED'
This section defines structural constraints on an OBO Document. These structural constraints hold on an abstract OBO Document - the result of parsing a physical OBO document file.
Each frame in an abstract OBO document can be accessed by its identifier - the value of the id tag.
Although it is strongly recommended that an OBOF document does not contain two frames with the same identifier, this is syntactically valid. If two frames have the same identifier, then they are combined. Only frames of the same type can be combined - if a document uses the same ID for two frames of different types, the document is structurally invalid.
When two frames F1 and F2 are combined, the new frame F3 has a set of tag-values consisting of the union of the set of tag-values from F1 and the set of tag-values from F2. If two tag-values are have identical tags and identical values, they are considered a single tag-value.
Note that OBO namespaces are not the same as OWL namespaces - the analog of OWL namespaces are OBO ID spaces. OBO namespaces are semantics-free properties of a frame that allow partitioning of an ontology into sub-ontologies. For example, the GO is partitioned into 3 ontologies (3 OBO namespaces, 1 OWL namespace).
Every frame must have exactly one namespace. However, these do not need to be explicitly assigned. After parsing an OBO Document, any frame without a namespace is assigned the default-namespace, from the OBO Document header. If this is not specified, the Parser assigns a namespace arbitrarily. It is recommended this is equivalent to the URL or file path from which the document was retrieved.
Every OBODoc should have an "ontology" tag specified in the header. If this is not specified, then the parser should supply a default value. This value should be derived from the URL of the source of the ontology (typically using http or file schemes).
Every OBO-Doc is automatically populated with two header clauses:
The following table shows how header macros are used to insert new clauses into the ontology document. If the pre-condition is satisfied (middle column) and the header contains the specified macro (first column), then clauses are added to the ontology such that the post-condition is satisfied (right hand column). Note that the original xref tags are not replaced in the ontology.
Expansion of these clauses is optional. A typicaly use case is to expand clauses into a separate bridging ontology. NOTE: this part of the spec is unsatisfactory. A better solution will be to do this as an OWL Macro (section 7). Specifically, obo header macros will be translated to OWL macros, and then the generic OWL macro engine will translate these. We leave this section in the draft spec for now to show the intended semantics.
Note that the id tag is not specified in this table. Each frame has exactly one id tag, and the id tag uniquely identifier the frame. These constrainst must hold after frames with identical ids have been combined (see above).
Any tag not mentioned has free cardinality (zero, one or many).
On completions this section will define the semantics of the entirety of OBO via mappings to OWL2. The mappings could also be used to specify a translation procedure and/or an interface to OWL tools (such as OWL reasoners).
The translation is defined using a translation function T which translates (a fragment of) OBO into OWL DL. The definition of T is often recursive, but it will eventually "ground out" in (a fragment of) OWL DL.
An OBO-Doc translates to an owl Ontology. The Ontology IRI is generated by extracting the value of the "ontology" tag in the header frame
Term frames are mapped to Class declarations and Class-level axioms. We list the translations that are specific to Term frames here - generic translations making annotation assertions are applied later on.
The relationship and intersection_of tags can take pairs of values; these pairs map to OWL Class Expressions according to the following table.
Mappings are listed in order of precedence - an expression cannot match more than one production rule, only the first match counts.
Note that OWL2 does not have the ability to declare EquivalentProperties between a named object property and either an intersection or a union. We translate these OBO constructs to a weaker axiom and add an AnnotationAssertion - we will later provide an appendix for handling these in FOL. This is also true for equivalent_to_chain
The OBO Format equivalent_to_chain clause can only be partially expressed in OWL2. We can do some inference as part of translation - see this prover9 file for a proof.
Note we have no cognate of IAO:definition_source - we would have a generic xref annotation on the definition annotation assertion
Mapping of OBO Format synonym clauses is described in 5.6. These map to annotation properties without any semantics in OWL. This section provides semantics external to OWL.
In OBO Format, the term "synonym" is used loosely for any kind of alternative label for a class (the "name" tag is used for the community preferred label). A label is a synonym for a class if there exists some user or user community (existing or historic) for which this label unambiguously denotes the class.
Synonyms are always scoped into one of four disjoint categories: EXACT, BROAD, NARROW, RELATED. A synonym is EXACT if it is a "true" synonym - most members of the community served by the ontology regard the exact synonym as substitutable for the primary label. If an ontology contains two classes which share either primary label (name) or exact synonym, then these classes are equivalent. A synonym for a class C is BROAD if it denotes a broader class than C. Here, broader than is an informal notion that encompasses both subsumption and possibly mereological and temporal containment. For example, "skull" could conceivably be a BROAD synonym for the class cranium. Conversely a synonym for a class C is NARROW if C denotes a broader class than the synonym. If a synonym is neither EXACT, NARROW or BROAD, then it is RELATED.
Scoping of synonyms has proven extremely useful for OBO format ontologies - most ontologies use this feature. The ability to designate a synonym as EXACT, in particular, is very useful for introducing additional terminological precision and reducing ambiguity in ontologies. These have also proven useful for NLP purposes.
Based on: OBO ID Policy
Every OBO-Document is automatically populated with two clauses:
TODO - some owl to obo translations may introduce owl:Thing - give this special status in OBO?
This section can also be considered in isolation from OBO Format altogether
Macros are specified as the range of an AnnotationAssertion on a property. They take the form of a string encoding either an OWL axiom or OWL expression in an extension Manchester Syntax format [BCP 47]. This extension allows the use of variable markers (written ?X or ?Y) in place of classes or class expressions. When these variable markers are replaced by actual OWL classes or expressions (again serialized in Manchester Syntax) as part of a Template Substitution Operation, the resulting string must be valid Manchester Syntax.
Examples are provided in [OWL Macros].
Macro | Annotation Property | Variables | Generates |
---|---|---|---|
expand_expression_to | obo:IAO_0000424 | ?Y | OWL2 Class Expression |
expand_assertion_to | obo:IAO_0000425 | ?X,?Y | OWL2 Axiom |