Saturday, November 14, 2009

Biological data integration with Phenoscape

It has been quite a long time since my last post in June 2008. I have since taken up a position working on an NSF funded project in Durham, NC. Phenoscape is an initiative to bring together model organism data with rich text descriptions of phenotypes exhibited by evolutionary species, which enables the discovery of new interesting and hitherto unknown relationships between mutants and evolutionary species. In this post, I outline the Phenoscape project. In my next posts, I will describe some of the issues I have come across in the Phenoscape project and some of my thoughts and ideas for resolving these issues.

I have spent most of my 16 months on the Phenoscape project developing an ontology based back-end repository (the Phenoscape knowledgebase) to store phenotype data from model organism databases and from rich text descriptions in scientific publications. I have developed data loader modules which translate data from a myriad different formats into a single, shareable syntax with ontology based semantics. Lastly, I have developed Web service endpoint interfaces to query this knowledgebase and output the results, which are displayed at a User Interface developed by my colleague, Jim Balhoff.

By ontology based semantics, I mean semantics defined in OBO ontologies. OBO predates the Semantic Web initiative, so this is rather a chicken and egg problem here. While the Semantic Web pedant in me is dismayed by the absolute lack of a mathematical framework in OBO definitions, which are nothing but rich text descriptions, the extent of knowledge annotation and reuse that biologists have achieved with OBO ontologies such as the Gene Ontology is very impressive.

The schema of the Phenoscape knowledgebase is based upon the Ontology-Based Database (OBD) schema developed by Chris Mungall at the Lawrence Berkeley National Laboratory for storing phenotype annotations. For the sake of lucidity, I briefly define some of the terminology that will be used in the rest of this post (and in the posts to follow as well).

Phenotype annotations are statements that relate evolutionary taxa to exhibited phenotypes. Taxa are the nodes that are part of a taxonomy. Species such as Homo sapiens (humans) are the leaf nodes (or leaf taxa) of a taxonomy devised by Linnaeus for the classification of living organisms. When talking about higher and lower taxa, I am referring to the relative positions of these taxa in the Linnaean taxonomy. Classes (concepts) are types or templates of real world "entities" (for want of a better word) with property based definitions. Instances are real world occurrences of the types. The classic example: car would be a class, my Porsche 911 GT would be an instance of the car class.

OBD is an intelligent, relational database that provides for the storage of phenotype annotations in triples format (not RDF). It also comes with a reasoner for extracting transitive subsumptions and partonomies as well as relation chains from asserted data. The inference capabilities (deductive closure) of the OBD reasoner exceed that of the RDF reasoner.

OBD allows for evolutionary species to be defined and treated as concepts. Annotations of exhibited phenotypes are existentially quantified. Since annotations to these leaf nodes are existentially quantified, I developed an extension to the OBD reasoner to associate higher level taxa in the Linnaean taxonomy with phenotypes exhibited by the lower level taxa. This makes it possible for biologists to query for all the phenotypes under a higher taxon, and see the phenotypes exhibited by all the lower, subsumed taxa as well as the query taxon itself.

For example, given an existentially quantified assertion that Homo sapiens exhibit a four chambered heart, my extension to the OBD reasoner infers that the genus Homo also exhibits a four chambered heart. In every day terms, given an assertion that some instances of Homo sapiens exhibit a phenotype, an opposable thumb for example, it is reasonable to infer some instances of Homo exhibit the same phenotype as well. The existential quantification is key to inferring up the hierarchy, and add inferences that do not lead to an inconsistent knowledge base. Existential quantifications are also convenient for assertions in the life sciences, which are rife with exceptions to the general rules.

Phenoscape is a well documented project and more background information can be found on the
Phenoscape informatics wiki and the Phenoscape blog. Phenoscape it must be noted, addresses just one issue in the wide research area of evolutionary and biodiversity informatics, whose efforts at data integration and interoperability are confronted by problems very similar to those confronting medical informatics. On a parting note, I find it very satisfying that ontologies are crucial to addressing these issues in both research domains (and several others as well).