I have just started off a new blog "Ontological definitions and such" given I have decided to use this "The Semantic Web in Life Sciences research" blog for just um, talking about life science research.
Saturday, August 13, 2011
Thursday, July 7, 2011
Sensors and algorithms as Web services
I'm getting back into blogging after a year of serious personal strife. The period from June 2010 to May 2011 will probably go down as the toughest phase of my life; leave alone career. In my new position at IUPUI, I'm now looking at bringing more Semantic Web applications to life; this time with an emphasis on sensors and processing algorithms for visualization and classification.
From the little I have heard from the Semantic Web research community in the last year, I understand everyone is confronted with scalability and tractability issues when it comes to storing and reasoning with large volumes of data. This is a significant part of the challenge of my current project, which involves encapsulating geographically distributed sensors and sparse sensor arrays and their associated algorithms in a Web service framework to enable discovery, invocation, and dynamic configuration. This is not a life sciences project, although applications can be extended into the clinical research domain.
From the little I have heard from the Semantic Web research community in the last year, I understand everyone is confronted with scalability and tractability issues when it comes to storing and reasoning with large volumes of data. This is a significant part of the challenge of my current project, which involves encapsulating geographically distributed sensors and sparse sensor arrays and their associated algorithms in a Web service framework to enable discovery, invocation, and dynamic configuration. This is not a life sciences project, although applications can be extended into the clinical research domain.
Friday, March 26, 2010
Modeling the Linnaean taxonomy in OWL: Where do specimens come in?
In my earlier post, I had written about one way of modeling a Linnaean taxonomy in OWL, where the names of species, genera, class, and order could be modeled as instances of concepts from the taxonomic rank ontology. For example, Homo sapiens is modeled as an instance of the Species concept from the Taxonomy Rank ontology. This approach however, did not consider actual biological specimens and their relationships with these taxa.
Last week, I was at the Phenoscape project all-hands meeting at the Field Museum in Chicago. Matt Yoder, co-PI on the Hymenoptera Anatomy Ontology project, spoke about the HAO's design of modeling the relationship between specimens and phenotypes, and modeling the names of the taxa as standalone concepts. This approach addresses two issues at once. One, the recurring problems with synonymy, homonymy, and polysemy are addressed directly instead of relegating them to "annotation property" status. Two, the relationships between specimens, names, and taxonomic ranks can be represented without taking recourse to meta-concepts. For the record, regular concepts are instances of meta-concepts. In my earlier post where specimens were ignored, taxonomic ranks (e.g. Species, Genus, Rank etc.) would be meta-concepts, specific names of taxa (e.g. Brassica olaracea capitata, Canis lupus etc.) would be concepts or instances of the meta-concepts, and finally, specimens such as the big bad wolf and the head of cabbage I bought last evening at the grocery, would be instances of these concepts. There could be workarounds for this philosophy, the most obvious one being modeling the relationship between specimens and taxa names as something other than type-token relationships.
Further, Chris Mungall was also at the project meeting, as a consultant. Chris is currently working on creating representations of homologies and he suggested using the "hasPart" relation from OBO relations to model the relationship between specimens and phenotypes.
Given the definition of a Phenotype concept, the relationship "hasPart" can be extended in an OWL framework to relate a Specimen concept (the domain) to a Phenotype concept (the range). An example RDF triple relating a specimen to a phenotype would be as shown in (1). Note the post composed representation of the Phenotype instance.
'Specimen 1' 'has part' 'some(vertebra 1 and hasQuality some sigmoid)' --(1)
The relationship between a specimen and its taxon name would be represented as shown in (2). I have used "hasTaxonName" for want of a better label for this relation, which relates a Specimen concept to a Name concept.
'Specimen 1' 'has taxon name' 'Danio rerio' --(2)
Lastly, given a "hasRank" relation to model the relationship between a name and a taxonomic rank, the RDF triple (3) completes this paradigm. Note "hasRank" is used as an annotation property in its current avatar.
'Danio rerio' 'has rank' 'Species' --(3)
The following 'type' triples are necessary.
'Danio rerio' 'type' 'Name'
'Species' 'type' 'Taxonomic rank'
'Specimen 1' 'type' 'Specimen'
Alternative names for Danio rerio such as Brachydanio rerio can be represented using the RDF triple in (4), where synonym can be defined as a reflexive property between Name concepts.
'Brachydanio rerio' 'synonym' 'Danio rerio' --(4)
In the interest of sound ontology design principles, each of these concepts can be extended from concepts from "higher-level" ontologies such as the Information Artifact Ontology.
In my next post, I shall look at use cases that can leverage these designs both from the point of view of Phenoscape (the project I currently work on) as well as other life science data integration and modeling projects.
As an aside, my days on the Phenoscape project are numbered and I'm currently looking for new positions. Wish me luck!
Last week, I was at the Phenoscape project all-hands meeting at the Field Museum in Chicago. Matt Yoder, co-PI on the Hymenoptera Anatomy Ontology project, spoke about the HAO's design of modeling the relationship between specimens and phenotypes, and modeling the names of the taxa as standalone concepts. This approach addresses two issues at once. One, the recurring problems with synonymy, homonymy, and polysemy are addressed directly instead of relegating them to "annotation property" status. Two, the relationships between specimens, names, and taxonomic ranks can be represented without taking recourse to meta-concepts. For the record, regular concepts are instances of meta-concepts. In my earlier post where specimens were ignored, taxonomic ranks (e.g. Species, Genus, Rank etc.) would be meta-concepts, specific names of taxa (e.g. Brassica olaracea capitata, Canis lupus etc.) would be concepts or instances of the meta-concepts, and finally, specimens such as the big bad wolf and the head of cabbage I bought last evening at the grocery, would be instances of these concepts. There could be workarounds for this philosophy, the most obvious one being modeling the relationship between specimens and taxa names as something other than type-token relationships.
Further, Chris Mungall was also at the project meeting, as a consultant. Chris is currently working on creating representations of homologies and he suggested using the "hasPart" relation from OBO relations to model the relationship between specimens and phenotypes.
Given the definition of a Phenotype concept, the relationship "hasPart" can be extended in an OWL framework to relate a Specimen concept (the domain) to a Phenotype concept (the range). An example RDF triple relating a specimen to a phenotype would be as shown in (1). Note the post composed representation of the Phenotype instance.
'Specimen 1' 'has part' 'some(vertebra 1 and hasQuality some sigmoid)' --(1)
The relationship between a specimen and its taxon name would be represented as shown in (2). I have used "hasTaxonName" for want of a better label for this relation, which relates a Specimen concept to a Name concept.
'Specimen 1' 'has taxon name' 'Danio rerio' --(2)
Lastly, given a "hasRank" relation to model the relationship between a name and a taxonomic rank, the RDF triple (3) completes this paradigm. Note "hasRank" is used as an annotation property in its current avatar.
'Danio rerio' 'has rank' 'Species' --(3)
The following 'type' triples are necessary.
'Danio rerio' 'type' 'Name'
'Species' 'type' 'Taxonomic rank'
'Specimen 1' 'type' 'Specimen'
Alternative names for Danio rerio such as Brachydanio rerio can be represented using the RDF triple in (4), where synonym can be defined as a reflexive property between Name concepts.
'Brachydanio rerio' 'synonym' 'Danio rerio' --(4)
In the interest of sound ontology design principles, each of these concepts can be extended from concepts from "higher-level" ontologies such as the Information Artifact Ontology.
In my next post, I shall look at use cases that can leverage these designs both from the point of view of Phenoscape (the project I currently work on) as well as other life science data integration and modeling projects.
As an aside, my days on the Phenoscape project are numbered and I'm currently looking for new positions. Wish me luck!
Friday, February 26, 2010
Modeling the Linnaean taxonomy in OWL
Following up on the Phenoscape beta release in July, I've worked primarily on warehousing the phenotype data and refactoring the data services for faster performance on the Phenoscape web interface. I'm also collaborating with Chris Mungall at Lawrence Berkeley National Laboratories on a manuscript outlining the principles of OBD and its application to the Phenoscape knowledgebase. I hope to finish writing the first draft in the next couple of weeks.
I've been pondering over ways to create representations of phenotype annotations in RDF triples using OWL concepts, instances, and object properties. A phenotype annotation is a Subject-Predicate-Object triple that relates an evolutionary taxon from a Linnaean taxonomy to an exhibited phenotype. In the Phenoscape project, phenotype annotations relate species (and sometimes higher taxa from the Linnaean taxonomy) of fish to exhibited phenotypes.
To relate these two entities, we have defined a new binary relation exhibits. The exhibits relation has been defined in an OBO framework, where only a simple ID and label are required with a text description of the intended semantics. I have been thinking about a more formal treatment for this important relation, specifically in a Semantic Web framework. How do I create an object property definition of the exhibits relation? What concepts do I define as its domain and range?
In layman terms, the exhibits relation relates a taxon (node) from a Linnaean taxonomy to a phenotype. The taxonomy rank ontology specifies partonomy relationships between the various ranks of a Linnaean taxonomy, each instance of a rank is also an instance of the higher ranks. The taxon concept in the taxonomy rank ontology has been defined as the subconcept of the continuant concept of the Basic Formal Ontology.
Genus, species, family, order, and class are subconcepts of taxon. Species such as Ictalurus furcatus, Oryza sativa and Esox americanus are instances of the species concept. The corresponding genera Ictalurus, Oryza and Esox are instances of the genus concept.
I have not addressed the relationship between actual living organisms and Linnaean taxa; is my dog an instance of Canis familiaris for example, or is this a different kind of relationship altogether? How about fossils that are being discovered in the various corners of the Earth even today such as the fascinating Tiktaalik rosaea? How about the preserved soft tissue specimens in various life science museums? Are these instances of specific taxa? This is the subject of a very old debate in the community of evolutionary biologists and systematists. Very often, evolutionary biologists cannot decide which part of the Tree of Life to assign a newly discovered specimen to. I shall defer a discussion on this relationship to a later post.
Now let us consider phenotypes. A phenotype is defined as an observable physical or biochemical characteristic of a living organism, that is caused by its genetic makeup and also by the influence of its environment. For sometime now, model organism databases have used the Entity-Quality formalism for modeling phenotypes i.e. a phenotype is a quality that inheres in an anatomical or a behavioral entity. Phenoscape subscribes to this formalism. A phenotype concept in Phenoscape (and in OBD from whence it is inherited) is "post composed" from previously defined concepts in an anatomical ontology or a behavioral ontology such as the Foundational Model of Anatomy (FMA) or the GO biiological process ontology and from a quality ontology such as the Phenotypes and Traits Ontology (PATO). This is a nifty way to create a RDF-style blank node with a Skolemized identifier, which identifies the origins of the node. The post composed phenotype is related to the quality concept by a subsumption relationship ("a round fin is round after all") and to the corresponding anatomy or behaviour concept by the inheres_in relation from OBO. Again, the comparison with RDF blank nodes is obvious. It's not the node itself, but its relationships that we care about.
So here goes putting it all together. I use Phenoscape as the namespace prefix here. I have eliminated the angle brackets from the tags so it can be displayed here. This is going into an ontology that will soon be posted on the Phenoscape site.
<owl:Class rdf:ID="Phenotype">
<rdfs:subClassOf>
<owl:intersectionOf rdf:parseType="Collection">
<owl:Class rdf:ID="PATO:0000001"> // Quality
<owl:Restriction>
<owl:onProperty rdf:resource="OBO_REL:inheres_in">
<owl:hasValue/>
<owl:unionOf rdf:parseType="Collection">
<owl:Class rdf:ID="GO:0007610"> // Behavior
<owl:Class rdf:ID="TAO:0100000"> // Anatomical entity from TAO
</owl:unionOf>
<owl:restriction>
<owl:onProperty rdf:resource="OBO_REL:towards">
<owl:someValuesFrom/>
<owl:unionOf rdf:parseType="Collection">
<owl:Class rdf:resource="GO:0007610"> // Behavior
<owl:Class rdf:resource="TAO:0100000"> //Anatomical entity from TAO
</owl:unionOf>
</owl:onProperty>
</owl:restriction>
</rdfs:subClassOf>
</owl:Class>
Note how I use the root concept of the Teleost Anatomy Ontology as one of the concepts in the OWL union in the range of both the inheres_in property as well as the towards property. This is for the purposes of the Phenoscape project. For other subsets of the Tree of Life, concepts from equivalent anatomy ontologies such as the Amphibian Anatomy Ontology, the Foundational Model of Anatomy (FMA), or even the Common Anatomy Reference Ontology (CARO) can be used instead of this concept.
Now for the taxon concept. This is much simpler. I use the Continuant concept from BFO as the superconcept of taxon. I use TRO as the prefix for the Taxonomy Rank Ontology.
<owl:Class rdf:ID="TRO:Taxon">
<rdfs:subClassOf rdf:resource="BFO:Continuant"/>
</owl:Class>
Other concepts in the TRO can be defined as below in OWL.
<owl:Class rdf:ID="TRO:Genus">
<rdfs:subClassOf rdf:resource="TRO:Taxon"/>
</owl:Class>
<owl:Class rdf:ID="TRO:Species">
<rdfs:subClassOf rdf:resource="TRO:Taxon"/>
</owl:Class>
Lastly, the individual species, genera et al can be defined as OWL individuals as below. These are taken from Peter Midford's Teleost Taxonomy Ontology.
<TRO:species id="TTO:1001979"/> // Danio rerio
<TRO:genus id="TTO:101040"/> // Danio
Similarly phenotypes with post composed identifiers can be defined as instances of the OWL concept phenotype defined earlier
<phenoscape:phenotype rdf:ID="PATO:0000599^OBO_REL:inheres_in(TAO:0000656)"/>
Finally, we define the exhibits relation in OWL.
<owl:ObjectProperty id="exhibits">
<rdfs:domain resource="TRO:Taxon"/>
<rdfs:range resource="#Phenotype"/>
</rdfs:range>
This definition is now the logical underpinning for RDF triples in N3 syntax that look like:
<tto:0001979> <phenoscape:exhibits> <pato:0000599^obo_rel:inheres_in(tao:0000656)>
I may be off on some of the syntax (I'm a bit rusty), but I hope the points I have made in this post have been reflected adequately in these definitions. As always, feedback and critique are welcome. This OWL ontology will soon be up on the Phenoscape site as I have mentioned earlier. I thank Peter Midford for his input and thoughts. In my next post, I will address the relationship between specimens and evolutionary taxa, a subject to which I have briefly alluded here. Until then, happy trails!
I've been pondering over ways to create representations of phenotype annotations in RDF triples using OWL concepts, instances, and object properties. A phenotype annotation is a Subject-Predicate-Object triple that relates an evolutionary taxon from a Linnaean taxonomy to an exhibited phenotype. In the Phenoscape project, phenotype annotations relate species (and sometimes higher taxa from the Linnaean taxonomy) of fish to exhibited phenotypes.
To relate these two entities, we have defined a new binary relation exhibits. The exhibits relation has been defined in an OBO framework, where only a simple ID and label are required with a text description of the intended semantics. I have been thinking about a more formal treatment for this important relation, specifically in a Semantic Web framework. How do I create an object property definition of the exhibits relation? What concepts do I define as its domain and range?
In layman terms, the exhibits relation relates a taxon (node) from a Linnaean taxonomy to a phenotype. The taxonomy rank ontology specifies partonomy relationships between the various ranks of a Linnaean taxonomy, each instance of a rank is also an instance of the higher ranks. The taxon concept in the taxonomy rank ontology has been defined as the subconcept of the continuant concept of the Basic Formal Ontology.
Genus, species, family, order, and class are subconcepts of taxon. Species such as Ictalurus furcatus, Oryza sativa and Esox americanus are instances of the species concept. The corresponding genera Ictalurus, Oryza and Esox are instances of the genus concept.
I have not addressed the relationship between actual living organisms and Linnaean taxa; is my dog an instance of Canis familiaris for example, or is this a different kind of relationship altogether? How about fossils that are being discovered in the various corners of the Earth even today such as the fascinating Tiktaalik rosaea? How about the preserved soft tissue specimens in various life science museums? Are these instances of specific taxa? This is the subject of a very old debate in the community of evolutionary biologists and systematists. Very often, evolutionary biologists cannot decide which part of the Tree of Life to assign a newly discovered specimen to. I shall defer a discussion on this relationship to a later post.
Now let us consider phenotypes. A phenotype is defined as an observable physical or biochemical characteristic of a living organism, that is caused by its genetic makeup and also by the influence of its environment. For sometime now, model organism databases have used the Entity-Quality formalism for modeling phenotypes i.e. a phenotype is a quality that inheres in an anatomical or a behavioral entity. Phenoscape subscribes to this formalism. A phenotype concept in Phenoscape (and in OBD from whence it is inherited) is "post composed" from previously defined concepts in an anatomical ontology or a behavioral ontology such as the Foundational Model of Anatomy (FMA) or the GO biiological process ontology and from a quality ontology such as the Phenotypes and Traits Ontology (PATO). This is a nifty way to create a RDF-style blank node with a Skolemized identifier, which identifies the origins of the node. The post composed phenotype is related to the quality concept by a subsumption relationship ("a round fin is round after all") and to the corresponding anatomy or behaviour concept by the inheres_in relation from OBO. Again, the comparison with RDF blank nodes is obvious. It's not the node itself, but its relationships that we care about.
So here goes putting it all together. I use Phenoscape as the namespace prefix here. I have eliminated the angle brackets from the tags so it can be displayed here. This is going into an ontology that will soon be posted on the Phenoscape site.
<owl:Class rdf:ID="Phenotype">
<rdfs:subClassOf>
<owl:intersectionOf rdf:parseType="Collection">
<owl:Class rdf:ID="PATO:0000001"> // Quality
<owl:Restriction>
<owl:onProperty rdf:resource="OBO_REL:inheres_in">
<owl:hasValue/>
<owl:unionOf rdf:parseType="Collection">
<owl:Class rdf:ID="GO:0007610"> // Behavior
<owl:Class rdf:ID="TAO:0100000"> // Anatomical entity from TAO
</owl:unionOf>
<owl:restriction>
<owl:onProperty rdf:resource="OBO_REL:towards">
<owl:someValuesFrom/>
<owl:unionOf rdf:parseType="Collection">
<owl:Class rdf:resource="GO:0007610"> // Behavior
<owl:Class rdf:resource="TAO:0100000"> //Anatomical entity from TAO
</owl:unionOf>
</owl:onProperty>
</owl:restriction>
</rdfs:subClassOf>
</owl:Class>
Note how I use the root concept of the Teleost Anatomy Ontology as one of the concepts in the OWL union in the range of both the inheres_in property as well as the towards property. This is for the purposes of the Phenoscape project. For other subsets of the Tree of Life, concepts from equivalent anatomy ontologies such as the Amphibian Anatomy Ontology, the Foundational Model of Anatomy (FMA), or even the Common Anatomy Reference Ontology (CARO) can be used instead of this concept.
Now for the taxon concept. This is much simpler. I use the Continuant concept from BFO as the superconcept of taxon. I use TRO as the prefix for the Taxonomy Rank Ontology.
<owl:Class rdf:ID="TRO:Taxon">
<rdfs:subClassOf rdf:resource="BFO:Continuant"/>
</owl:Class>
Other concepts in the TRO can be defined as below in OWL.
<owl:Class rdf:ID="TRO:Genus">
<rdfs:subClassOf rdf:resource="TRO:Taxon"/>
</owl:Class>
<owl:Class rdf:ID="TRO:Species">
<rdfs:subClassOf rdf:resource="TRO:Taxon"/>
</owl:Class>
Lastly, the individual species, genera et al can be defined as OWL individuals as below. These are taken from Peter Midford's Teleost Taxonomy Ontology.
<TRO:species id="TTO:1001979"/> // Danio rerio
<TRO:genus id="TTO:101040"/> // Danio
Similarly phenotypes with post composed identifiers can be defined as instances of the OWL concept phenotype defined earlier
<phenoscape:phenotype rdf:ID="PATO:0000599^OBO_REL:inheres_in(TAO:0000656)"/>
Finally, we define the exhibits relation in OWL.
<owl:ObjectProperty id="exhibits">
<rdfs:domain resource="TRO:Taxon"/>
<rdfs:range resource="#Phenotype"/>
</rdfs:range>
This definition is now the logical underpinning for RDF triples in N3 syntax that look like:
<tto:0001979> <phenoscape:exhibits> <pato:0000599^obo_rel:inheres_in(tao:0000656)>
I may be off on some of the syntax (I'm a bit rusty), but I hope the points I have made in this post have been reflected adequately in these definitions. As always, feedback and critique are welcome. This OWL ontology will soon be up on the Phenoscape site as I have mentioned earlier. I thank Peter Midford for his input and thoughts. In my next post, I will address the relationship between specimens and evolutionary taxa, a subject to which I have briefly alluded here. Until then, happy trails!
Saturday, November 14, 2009
Biological data integration with Phenoscape
It has been quite a long time since my last post in June 2008. I have since taken up a position working on an NSF funded project in Durham, NC. Phenoscape is an initiative to bring together model organism data with rich text descriptions of phenotypes exhibited by evolutionary species, which enables the discovery of new interesting and hitherto unknown relationships between mutants and evolutionary species. In this post, I outline the Phenoscape project. In my next posts, I will describe some of the issues I have come across in the Phenoscape project and some of my thoughts and ideas for resolving these issues.
I have spent most of my 16 months on the Phenoscape project developing an ontology based back-end repository (the Phenoscape knowledgebase) to store phenotype data from model organism databases and from rich text descriptions in scientific publications. I have developed data loader modules which translate data from a myriad different formats into a single, shareable syntax with ontology based semantics. Lastly, I have developed Web service endpoint interfaces to query this knowledgebase and output the results, which are displayed at a User Interface developed by my colleague, Jim Balhoff.
By ontology based semantics, I mean semantics defined in OBO ontologies. OBO predates the Semantic Web initiative, so this is rather a chicken and egg problem here. While the Semantic Web pedant in me is dismayed by the absolute lack of a mathematical framework in OBO definitions, which are nothing but rich text descriptions, the extent of knowledge annotation and reuse that biologists have achieved with OBO ontologies such as the Gene Ontology is very impressive.
The schema of the Phenoscape knowledgebase is based upon the Ontology-Based Database (OBD) schema developed by Chris Mungall at the Lawrence Berkeley National Laboratory for storing phenotype annotations. For the sake of lucidity, I briefly define some of the terminology that will be used in the rest of this post (and in the posts to follow as well).
Phenotype annotations are statements that relate evolutionary taxa to exhibited phenotypes. Taxa are the nodes that are part of a taxonomy. Species such as Homo sapiens (humans) are the leaf nodes (or leaf taxa) of a taxonomy devised by Linnaeus for the classification of living organisms. When talking about higher and lower taxa, I am referring to the relative positions of these taxa in the Linnaean taxonomy. Classes (concepts) are types or templates of real world "entities" (for want of a better word) with property based definitions. Instances are real world occurrences of the types. The classic example: car would be a class, my Porsche 911 GT would be an instance of the car class.
OBD is an intelligent, relational database that provides for the storage of phenotype annotations in triples format (not RDF). It also comes with a reasoner for extracting transitive subsumptions and partonomies as well as relation chains from asserted data. The inference capabilities (deductive closure) of the OBD reasoner exceed that of the RDF reasoner.
OBD allows for evolutionary species to be defined and treated as concepts. Annotations of exhibited phenotypes are existentially quantified. Since annotations to these leaf nodes are existentially quantified, I developed an extension to the OBD reasoner to associate higher level taxa in the Linnaean taxonomy with phenotypes exhibited by the lower level taxa. This makes it possible for biologists to query for all the phenotypes under a higher taxon, and see the phenotypes exhibited by all the lower, subsumed taxa as well as the query taxon itself.
For example, given an existentially quantified assertion that Homo sapiens exhibit a four chambered heart, my extension to the OBD reasoner infers that the genus Homo also exhibits a four chambered heart. In every day terms, given an assertion that some instances of Homo sapiens exhibit a phenotype, an opposable thumb for example, it is reasonable to infer some instances of Homo exhibit the same phenotype as well. The existential quantification is key to inferring up the hierarchy, and add inferences that do not lead to an inconsistent knowledge base. Existential quantifications are also convenient for assertions in the life sciences, which are rife with exceptions to the general rules.
Phenoscape is a well documented project and more background information can be found on the
Phenoscape informatics wiki and the Phenoscape blog. Phenoscape it must be noted, addresses just one issue in the wide research area of evolutionary and biodiversity informatics, whose efforts at data integration and interoperability are confronted by problems very similar to those confronting medical informatics. On a parting note, I find it very satisfying that ontologies are crucial to addressing these issues in both research domains (and several others as well).
I have spent most of my 16 months on the Phenoscape project developing an ontology based back-end repository (the Phenoscape knowledgebase) to store phenotype data from model organism databases and from rich text descriptions in scientific publications. I have developed data loader modules which translate data from a myriad different formats into a single, shareable syntax with ontology based semantics. Lastly, I have developed Web service endpoint interfaces to query this knowledgebase and output the results, which are displayed at a User Interface developed by my colleague, Jim Balhoff.
By ontology based semantics, I mean semantics defined in OBO ontologies. OBO predates the Semantic Web initiative, so this is rather a chicken and egg problem here. While the Semantic Web pedant in me is dismayed by the absolute lack of a mathematical framework in OBO definitions, which are nothing but rich text descriptions, the extent of knowledge annotation and reuse that biologists have achieved with OBO ontologies such as the Gene Ontology is very impressive.
The schema of the Phenoscape knowledgebase is based upon the Ontology-Based Database (OBD) schema developed by Chris Mungall at the Lawrence Berkeley National Laboratory for storing phenotype annotations. For the sake of lucidity, I briefly define some of the terminology that will be used in the rest of this post (and in the posts to follow as well).
Phenotype annotations are statements that relate evolutionary taxa to exhibited phenotypes. Taxa are the nodes that are part of a taxonomy. Species such as Homo sapiens (humans) are the leaf nodes (or leaf taxa) of a taxonomy devised by Linnaeus for the classification of living organisms. When talking about higher and lower taxa, I am referring to the relative positions of these taxa in the Linnaean taxonomy. Classes (concepts) are types or templates of real world "entities" (for want of a better word) with property based definitions. Instances are real world occurrences of the types. The classic example: car would be a class, my Porsche 911 GT would be an instance of the car class.
OBD is an intelligent, relational database that provides for the storage of phenotype annotations in triples format (not RDF). It also comes with a reasoner for extracting transitive subsumptions and partonomies as well as relation chains from asserted data. The inference capabilities (deductive closure) of the OBD reasoner exceed that of the RDF reasoner.
OBD allows for evolutionary species to be defined and treated as concepts. Annotations of exhibited phenotypes are existentially quantified. Since annotations to these leaf nodes are existentially quantified, I developed an extension to the OBD reasoner to associate higher level taxa in the Linnaean taxonomy with phenotypes exhibited by the lower level taxa. This makes it possible for biologists to query for all the phenotypes under a higher taxon, and see the phenotypes exhibited by all the lower, subsumed taxa as well as the query taxon itself.
For example, given an existentially quantified assertion that Homo sapiens exhibit a four chambered heart, my extension to the OBD reasoner infers that the genus Homo also exhibits a four chambered heart. In every day terms, given an assertion that some instances of Homo sapiens exhibit a phenotype, an opposable thumb for example, it is reasonable to infer some instances of Homo exhibit the same phenotype as well. The existential quantification is key to inferring up the hierarchy, and add inferences that do not lead to an inconsistent knowledge base. Existential quantifications are also convenient for assertions in the life sciences, which are rife with exceptions to the general rules.
Phenoscape is a well documented project and more background information can be found on the
Phenoscape informatics wiki and the Phenoscape blog. Phenoscape it must be noted, addresses just one issue in the wide research area of evolutionary and biodiversity informatics, whose efforts at data integration and interoperability are confronted by problems very similar to those confronting medical informatics. On a parting note, I find it very satisfying that ontologies are crucial to addressing these issues in both research domains (and several others as well).
Monday, June 23, 2008
Introducing the Ontology-Based PubMed Annotator
Since my last post, I've pretty much kept my nose to the grindstone and I have something to show for it. The new PubMed annotator on steroids, or ahem...the Ontology-Based Pubmed Annotator or the OBPA for short. The OBPA, like its predecessor, the PubMed Annotator, requires the user, a biologist to annotate biomedical experiments in RDF triple format for storage, subsequent querying, summarizing, and comparison. The difference is the OBPA prompts the user with matching terms from a few preselected ontologies, in auto-complete mode even as she is filling the fields. The user can choose to use terms from the ontologies for her annotation work, or she can use her own terms.
OBPA keeps track of the number of terms the user borrows from each ontology as a measure of the ontology's usefulness (c.f. my previous blog entry). OBPA is definitely more advanced, implementing more features and enhanced security than the PubMed Annotator. The following OWL ontologies are currently being used in the OBPA:
a) The Ontology for Biomedical Investigations (OBI)
b) The MGED Ontology
c) Barry Smith's Basic Formal Ontology
d) Heinrich Herre's General Formal Ontology
e) Barry Smith's Relation Ontology
f) Michel Dumontier's Relation Ontology
g) OWL 1.0 Ontology
The OBPA in its current version cannot handle OBO syntax. I believe ontologies such as the Foundational Model of Anatomy, Reactome, and UniProt will also be relevant to the OBPA. The OBPA however, suffers from a significant roadblock which prevents the incorporation of more ontologies into its scope. Terms (classes and properties) from ontologies are loaded into OBPA at deployment time. Given the slow performance of current versions of OWL-based APIs such as Jena and the OWL API, server-side deployment is a very tortuous process with the server timing out frequently. Also, the terms are not updated periodically. With the current rate of progress on ontologies, OBPA runs the risk of using obsolete terminology from ontologies.
Ben Good's Entity Describer (E.D.), which works with ontologically defined terms, uses the interface provided by Freebase to dynamically extract terms from ontologies such as the Gene Ontology (GO) to prompt the user with a suggestion box complete with a text description about the term, the ontology it is extracted from, and sometimes, even a picture! Future revisions to the OBPA may incorporate this methodology to alleviate the problems with obsolete ontological terms. Another solution may be to create a service that periodically browses a selection of ontologies and presents the extracted terms on an interface accessible to applications such as the OBPA. An application such as the Ontology Lookup Service (OLS) which is also compatible with OWL ontologies may help as well.
On a tangent, Mark Wilkinson suggested a future area of work where one could browse through the nodes of an ontology and extract publications associated with every node. I'm putting it down here because it may be something for me to work on in the future, and also to ensure that you heard it first, from here!! In closing, I would like to thank the hands-on help provided by Ed Kawas on the jQuery part of the application, Luke McCarthy for his insightful tips on various aspects, and Ben Good for being the Dry Lab's own “thinker.”
UPDATE: It has been a while since the server for the Wilkinson lab was changed from bioinfo.icapture.ubc.ca to the new server. This is the reason why the link to the Pubmed Annotator Web UI is inactive. The WAR I had on my laptop was lost forever when the laptop was stolen from my house in Vancouver. The code for the Ontology Based Pubmed Annotator is available on the Wilkinson lab's code repository, and I will be moving this to a new project on SourceForge very soon.
OBPA keeps track of the number of terms the user borrows from each ontology as a measure of the ontology's usefulness (c.f. my previous blog entry). OBPA is definitely more advanced, implementing more features and enhanced security than the PubMed Annotator. The following OWL ontologies are currently being used in the OBPA:
a) The Ontology for Biomedical Investigations (OBI)
b) The MGED Ontology
c) Barry Smith's Basic Formal Ontology
d) Heinrich Herre's General Formal Ontology
e) Barry Smith's Relation Ontology
f) Michel Dumontier's Relation Ontology
g) OWL 1.0 Ontology
The OBPA in its current version cannot handle OBO syntax. I believe ontologies such as the Foundational Model of Anatomy, Reactome, and UniProt will also be relevant to the OBPA. The OBPA however, suffers from a significant roadblock which prevents the incorporation of more ontologies into its scope. Terms (classes and properties) from ontologies are loaded into OBPA at deployment time. Given the slow performance of current versions of OWL-based APIs such as Jena and the OWL API, server-side deployment is a very tortuous process with the server timing out frequently. Also, the terms are not updated periodically. With the current rate of progress on ontologies, OBPA runs the risk of using obsolete terminology from ontologies.
Ben Good's Entity Describer (E.D.), which works with ontologically defined terms, uses the interface provided by Freebase to dynamically extract terms from ontologies such as the Gene Ontology (GO) to prompt the user with a suggestion box complete with a text description about the term, the ontology it is extracted from, and sometimes, even a picture! Future revisions to the OBPA may incorporate this methodology to alleviate the problems with obsolete ontological terms. Another solution may be to create a service that periodically browses a selection of ontologies and presents the extracted terms on an interface accessible to applications such as the OBPA. An application such as the Ontology Lookup Service (OLS) which is also compatible with OWL ontologies may help as well.
On a tangent, Mark Wilkinson suggested a future area of work where one could browse through the nodes of an ontology and extract publications associated with every node. I'm putting it down here because it may be something for me to work on in the future, and also to ensure that you heard it first, from here!! In closing, I would like to thank the hands-on help provided by Ed Kawas on the jQuery part of the application, Luke McCarthy for his insightful tips on various aspects, and Ben Good for being the Dry Lab's own “thinker.”
UPDATE: It has been a while since the server for the Wilkinson lab was changed from bioinfo.icapture.ubc.ca to the new server. This is the reason why the link to the Pubmed Annotator Web UI is inactive. The WAR I had on my laptop was lost forever when the laptop was stolen from my house in Vancouver. The code for the Ontology Based Pubmed Annotator is available on the Wilkinson lab's code repository, and I will be moving this to a new project on SourceForge very soon.
Friday, May 23, 2008
Introducing the Pubmed Annotator
In my last post, I held forth on the divide between biologists and computer scientists. Since then, I have been actively collaborating with biologists trying to harness their knowledge of biomedical experiments through an annotation interface that is called the Pubmed Annotator.
The Pubmed Annotator is at an alpha state of development. At present, I'm addressing some advanced security issues brought up by Ed and Luke, two of the Wilkinson lab's own superhacks. In its present avatar, the Pubmed Annotator allows a user to register, log in, query Pubmed with a Pubmed Identifier to retrieve a publication, and then annotate the experiment described in the publication using Subject-Predicate-Object triple syntax. I have a recent publication describing the objectives of the Pubmed Annotator project.
The Pubmed Annotator hopes to elicit unique structured representations of biomedical experiments in SPO triple format. Each experiment can be stored as a collection of SPO triples, or by extension as RDF triples on the Semantic Web. This will enable easy querying of experiment details in a universally shareable syntax (RDF) for one. Second, experiments can be compared for similarity. Third, logic based reasoning mechanisms (one of the primary benefits of the Semantic Web) can be used to summarize experiments for the benefit of overworked biologists and the curators of biological knowledge bases. Lastly, raw annotations from users can be used to synthesize a controlled vocabulary (and an ontology) for the annotation of biomedical experiments. This constitutes a bottom-up approach to ontology synthesis, wherein raw data is used to create a template. Ontology development today is more often a top-down process where domain experts and knowledge engineers argue and somehow, agree on a set of terms and logical definitions of these terms, which are capable of representing the knowledge domain of interest.
I attended the Vancouver workshop of the Ontology of Biomedical Investigations (OBI) Consortium in February. I was mostly a passive observer as some of the best ontology developers, domain experts, and philosophers went to work arguing on common place terms and their definitions, which most common folk would merely take for granted. It was fascinating to say the least.
The Wilkinson lab on the other hand, is fast becoming a hotbed of bottom-up approaches to ontology development. Ben Good took the lead a few years ago with the excellent iCAPTUREr. Very recently, he has developed the Entity Describer as an add on to Connotea, as a means to help users use ontologically defined terms to annotate publications of their choice on Connotea. This is an example of Semantic Social Tagging.
Along these lines, the Pubmed Annotator is currently being upgraded to use ontologies and ontologically defined terms to annotate publications. I believe the usefulness of ontologies to experiment annotation in general can be evaluated by a simple metric. Given a user could use either terms of his choice or ontologically defined terms or a combination of both to annotate an experiment, the metric is a ratio of the number of terms from an ontology that were used in annotating an experiment to the total number of terms used to annotate the same experiment. Let me put this in a simple mathematical formula. Given t is the total number of terms used to annotate an experiment e by a user u, and n is the number of ontologically defined terms used by u to annotate e, the efficiency metric is defined as
e = n/t.
This is for one user u. The same metric can be used to quantitatively measure the effectiveness of an ontology to annotate several experiments by several users by extension. All this is mere hypothesis however. It is hoped the data that we gather will give us a realistic estimate of the correctness of this hypothesis.
The Pubmed Annotator is at an alpha state of development. At present, I'm addressing some advanced security issues brought up by Ed and Luke, two of the Wilkinson lab's own superhacks. In its present avatar, the Pubmed Annotator allows a user to register, log in, query Pubmed with a Pubmed Identifier to retrieve a publication, and then annotate the experiment described in the publication using Subject-Predicate-Object triple syntax. I have a recent publication describing the objectives of the Pubmed Annotator project.
The Pubmed Annotator hopes to elicit unique structured representations of biomedical experiments in SPO triple format. Each experiment can be stored as a collection of SPO triples, or by extension as RDF triples on the Semantic Web. This will enable easy querying of experiment details in a universally shareable syntax (RDF) for one. Second, experiments can be compared for similarity. Third, logic based reasoning mechanisms (one of the primary benefits of the Semantic Web) can be used to summarize experiments for the benefit of overworked biologists and the curators of biological knowledge bases. Lastly, raw annotations from users can be used to synthesize a controlled vocabulary (and an ontology) for the annotation of biomedical experiments. This constitutes a bottom-up approach to ontology synthesis, wherein raw data is used to create a template. Ontology development today is more often a top-down process where domain experts and knowledge engineers argue and somehow, agree on a set of terms and logical definitions of these terms, which are capable of representing the knowledge domain of interest.
I attended the Vancouver workshop of the Ontology of Biomedical Investigations (OBI) Consortium in February. I was mostly a passive observer as some of the best ontology developers, domain experts, and philosophers went to work arguing on common place terms and their definitions, which most common folk would merely take for granted. It was fascinating to say the least.
The Wilkinson lab on the other hand, is fast becoming a hotbed of bottom-up approaches to ontology development. Ben Good took the lead a few years ago with the excellent iCAPTUREr. Very recently, he has developed the Entity Describer as an add on to Connotea, as a means to help users use ontologically defined terms to annotate publications of their choice on Connotea. This is an example of Semantic Social Tagging.
Along these lines, the Pubmed Annotator is currently being upgraded to use ontologies and ontologically defined terms to annotate publications. I believe the usefulness of ontologies to experiment annotation in general can be evaluated by a simple metric. Given a user could use either terms of his choice or ontologically defined terms or a combination of both to annotate an experiment, the metric is a ratio of the number of terms from an ontology that were used in annotating an experiment to the total number of terms used to annotate the same experiment. Let me put this in a simple mathematical formula. Given t is the total number of terms used to annotate an experiment e by a user u, and n is the number of ontologically defined terms used by u to annotate e, the efficiency metric is defined as
e = n/t.
This is for one user u. The same metric can be used to quantitatively measure the effectiveness of an ontology to annotate several experiments by several users by extension. All this is mere hypothesis however. It is hoped the data that we gather will give us a realistic estimate of the correctness of this hypothesis.
Subscribe to:
Posts (Atom)