Friday, May 23, 2008

Introducing the Pubmed Annotator

In my last post, I held forth on the divide between biologists and computer scientists. Since then, I have been actively collaborating with biologists trying to harness their knowledge of biomedical experiments through an annotation interface that is called the Pubmed Annotator.

The Pubmed Annotator is at an alpha state of development. At present, I'm addressing some advanced security issues brought up by Ed and Luke, two of the Wilkinson lab's own superhacks. In its present avatar, the Pubmed Annotator allows a user to register, log in, query Pubmed with a Pubmed Identifier to retrieve a publication, and then annotate the experiment described in the publication using Subject-Predicate-Object triple syntax. I have a recent publication describing the objectives of the Pubmed Annotator project.

The Pubmed Annotator hopes to elicit unique structured representations of biomedical experiments in SPO triple format. Each experiment can be stored as a collection of SPO triples, or by extension as RDF triples on the Semantic Web. This will enable easy querying of experiment details in a universally shareable syntax (RDF) for one. Second, experiments can be compared for similarity. Third, logic based reasoning mechanisms (one of the primary benefits of the Semantic Web) can be used to summarize experiments for the benefit of overworked biologists and the curators of biological knowledge bases. Lastly, raw annotations from users can be used to synthesize a controlled vocabulary (and an ontology) for the annotation of biomedical experiments. This constitutes a bottom-up approach to ontology synthesis, wherein raw data is used to create a template. Ontology development today is more often a top-down process where domain experts and knowledge engineers argue and somehow, agree on a set of terms and logical definitions of these terms, which are capable of representing the knowledge domain of interest.

I attended the Vancouver workshop of the Ontology of Biomedical Investigations (OBI) Consortium in February. I was mostly a passive observer as some of the best ontology developers, domain experts, and philosophers went to work arguing on common place terms and their definitions, which most common folk would merely take for granted. It was fascinating to say the least.

The Wilkinson lab on the other hand, is fast becoming a hotbed of bottom-up approaches to ontology development. Ben Good took the lead a few years ago with the excellent iCAPTUREr. Very recently, he has developed the Entity Describer as an add on to Connotea, as a means to help users use ontologically defined terms to annotate publications of their choice on Connotea. This is an example of Semantic Social Tagging.

Along these lines, the Pubmed Annotator is currently being upgraded to use ontologies and ontologically defined terms to annotate publications. I believe the usefulness of ontologies to experiment annotation in general can be evaluated by a simple metric. Given a user could use either terms of his choice or ontologically defined terms or a combination of both to annotate an experiment, the metric is a ratio of the number of terms from an ontology that were used in annotating an experiment to the total number of terms used to annotate the same experiment. Let me put this in a simple mathematical formula. Given t is the total number of terms used to annotate an experiment e by a user u, and n is the number of ontologically defined terms used by u to annotate e, the efficiency metric is defined as

e = n/t.

This is for one user u. The same metric can be used to quantitatively measure the effectiveness of an ontology to annotate several experiments by several users by extension. All this is mere hypothesis however. It is hoped the data that we gather will give us a realistic estimate of the correctness of this hypothesis.

1 comment:

bgood said...

Hi Cartik,

I think the step towards using concept-defining URIs as your nodes and edges was important, good show. Where are you going to get them from?

Also, how are you going to judge the quality of the statements added to this knowledge base? How will you know if it is 'working' ?