Monday, June 23, 2008

Introducing the Ontology-Based PubMed Annotator

Since my last post, I've pretty much kept my nose to the grindstone and I have something to show for it. The new PubMed annotator on steroids, or ahem...the Ontology-Based Pubmed Annotator or the OBPA for short. The OBPA, like its predecessor, the PubMed Annotator, requires the user, a biologist to annotate biomedical experiments in RDF triple format for storage, subsequent querying, summarizing, and comparison. The difference is the OBPA prompts the user with matching terms from a few preselected ontologies, in auto-complete mode even as she is filling the fields. The user can choose to use terms from the ontologies for her annotation work, or she can use her own terms.

OBPA keeps track of the number of terms the user borrows from each ontology as a measure of the ontology's usefulness (c.f. my previous blog entry). OBPA is definitely more advanced, implementing more features and enhanced security than the PubMed Annotator. The following OWL ontologies are currently being used in the OBPA:

a) The Ontology for Biomedical Investigations (OBI)
b) The MGED Ontology
c) Barry Smith's Basic Formal Ontology
d) Heinrich Herre's General Formal Ontology
e) Barry Smith's Relation Ontology
f) Michel Dumontier's Relation Ontology
g) OWL 1.0 Ontology

The OBPA in its current version cannot handle OBO syntax. I believe ontologies such as the Foundational Model of Anatomy, Reactome, and UniProt will also be relevant to the OBPA. The OBPA however, suffers from a significant roadblock which prevents the incorporation of more ontologies into its scope. Terms (classes and properties) from ontologies are loaded into OBPA at deployment time. Given the slow performance of current versions of OWL-based APIs such as Jena and the OWL API, server-side deployment is a very tortuous process with the server timing out frequently. Also, the terms are not updated periodically. With the current rate of progress on ontologies, OBPA runs the risk of using obsolete terminology from ontologies.

Ben Good's Entity Describer (E.D.), which works with ontologically defined terms, uses the interface provided by Freebase to dynamically extract terms from ontologies such as the Gene Ontology (GO) to prompt the user with a suggestion box complete with a text description about the term, the ontology it is extracted from, and sometimes, even a picture! Future revisions to the OBPA may incorporate this methodology to alleviate the problems with obsolete ontological terms. Another solution may be to create a service that periodically browses a selection of ontologies and presents the extracted terms on an interface accessible to applications such as the OBPA. An application such as the Ontology Lookup Service (OLS) which is also compatible with OWL ontologies may help as well.

On a tangent, Mark Wilkinson suggested a future area of work where one could browse through the nodes of an ontology and extract publications associated with every node. I'm putting it down here because it may be something for me to work on in the future, and also to ensure that you heard it first, from here!! In closing, I would like to thank the hands-on help provided by Ed Kawas on the jQuery part of the application, Luke McCarthy for his insightful tips on various aspects, and Ben Good for being the Dry Lab's own “thinker.”

UPDATE: It has been a while since the server for the Wilkinson lab was changed from bioinfo.icapture.ubc.ca to the new server. This is the reason why the link to the Pubmed Annotator Web UI is inactive. The WAR I had on my laptop was lost forever when the laptop was stolen from my house in Vancouver. The code for the Ontology Based Pubmed Annotator is available on the Wilkinson lab's code repository, and I will be moving this to a new project on SourceForge very soon.

Friday, May 23, 2008

Introducing the Pubmed Annotator

In my last post, I held forth on the divide between biologists and computer scientists. Since then, I have been actively collaborating with biologists trying to harness their knowledge of biomedical experiments through an annotation interface that is called the Pubmed Annotator.

The Pubmed Annotator is at an alpha state of development. At present, I'm addressing some advanced security issues brought up by Ed and Luke, two of the Wilkinson lab's own superhacks. In its present avatar, the Pubmed Annotator allows a user to register, log in, query Pubmed with a Pubmed Identifier to retrieve a publication, and then annotate the experiment described in the publication using Subject-Predicate-Object triple syntax. I have a recent publication describing the objectives of the Pubmed Annotator project.

The Pubmed Annotator hopes to elicit unique structured representations of biomedical experiments in SPO triple format. Each experiment can be stored as a collection of SPO triples, or by extension as RDF triples on the Semantic Web. This will enable easy querying of experiment details in a universally shareable syntax (RDF) for one. Second, experiments can be compared for similarity. Third, logic based reasoning mechanisms (one of the primary benefits of the Semantic Web) can be used to summarize experiments for the benefit of overworked biologists and the curators of biological knowledge bases. Lastly, raw annotations from users can be used to synthesize a controlled vocabulary (and an ontology) for the annotation of biomedical experiments. This constitutes a bottom-up approach to ontology synthesis, wherein raw data is used to create a template. Ontology development today is more often a top-down process where domain experts and knowledge engineers argue and somehow, agree on a set of terms and logical definitions of these terms, which are capable of representing the knowledge domain of interest.

I attended the Vancouver workshop of the Ontology of Biomedical Investigations (OBI) Consortium in February. I was mostly a passive observer as some of the best ontology developers, domain experts, and philosophers went to work arguing on common place terms and their definitions, which most common folk would merely take for granted. It was fascinating to say the least.

The Wilkinson lab on the other hand, is fast becoming a hotbed of bottom-up approaches to ontology development. Ben Good took the lead a few years ago with the excellent iCAPTUREr. Very recently, he has developed the Entity Describer as an add on to Connotea, as a means to help users use ontologically defined terms to annotate publications of their choice on Connotea. This is an example of Semantic Social Tagging.

Along these lines, the Pubmed Annotator is currently being upgraded to use ontologies and ontologically defined terms to annotate publications. I believe the usefulness of ontologies to experiment annotation in general can be evaluated by a simple metric. Given a user could use either terms of his choice or ontologically defined terms or a combination of both to annotate an experiment, the metric is a ratio of the number of terms from an ontology that were used in annotating an experiment to the total number of terms used to annotate the same experiment. Let me put this in a simple mathematical formula. Given t is the total number of terms used to annotate an experiment e by a user u, and n is the number of ontologically defined terms used by u to annotate e, the efficiency metric is defined as

e = n/t.

This is for one user u. The same metric can be used to quantitatively measure the effectiveness of an ontology to annotate several experiments by several users by extension. All this is mere hypothesis however. It is hoped the data that we gather will give us a realistic estimate of the correctness of this hypothesis.