Tuesday, October 2, 2007

So you think you know ontologies?

The need to share biological data has led to the development of several high profile and oft referenced ontologies in the life sciences domain. Soldatova and King have pointed out the limitations with many biological ontologies that threaten to hurt the long term purpose of their use in the life sciences domain. Based on my experience in ontology development over the last five years, and my interactions with other ontology developers and organizations looking to invest (or involved) in ontology development, the findings of this paper do not come as a surprise.

Today, the word "ontology" is used in a variety of contexts. Very often, it is used to refer to vocabularies and taxonomies. While an ontology can be both a vocabulary and a taxonomy, the converse is not true. Throwing together a subsumption hierarchy is not ontology development. Nor is the curation of a carefully controlled vocabulary of concepts pertinent to a knowledge domain. Many biologists (and other professionals) dabbling in ontology development are blithely unaware of the mathematical underpinnings of ontologies.

Concepts and relations that are defined as part of an ontology need to be grounded in mathematical axioms. Ontology development toolkits such as Protege and Altova isolate ontology developers, or biologists in this case, from this reality. For all their usefulness in enabling the adoption of ontologies, ontology development tools that conveniently generate OWL syntax obscure the reality that every construct in OWL (at least, the decidable species of OWL) has its semantic underpinnings in a rigorous and formal logical framework.

Ontology developers need to understand data; how it is used, accessed, and most importantly, modeled. A familiarity with the philosophy of the Entity Relation (ER) model or with the Object Oriented (OO) philosophy is a necessary prerequisite to ontology development. A second prerequisite is an understanding of mathematical logic, first order logic at the very least.

Ontology development by specialists or domain experts amounts to a wastage of their skills, if not a serious threat to the quality of the developed ontology. While ontology engineers need not be experts in a specific knowledge domain, their skills are relevant to the distillation of expertise from any domain into a representational framework such as OWL. I would not want an expert virologist to develop an ontology pertinent to viruses, any more than I would want a machinist or hangar technician to design and develop a database of airplane spare parts. The hangar technician would be best employed working with spares, not describing them.

On a positive note, these deficiencies are symptomatic of any new technology, particularly in the information technology area. In the middle and late 90s, programmers accustomed to the procedural syntax of languages such as C were slow to adopt and master the object oriented philosophy behind newly introduced languages such as SmallTalk and Java. Ontologies are in the same phase of adoption today. A new fangled technology that promises to change the world as we know it, with its attendant evangelists (such as yours truly) and skeptics. Believe!!

Wednesday, August 1, 2007

What will it be? Resource or metadata. How to know the answer without asking (and looking dumb in the process)

Several proposals are being discussed in the HCLS community for the resolution of URIs that represent resources from URIs that represent metadata about those resources. One of the most touted proposals involves the use of the 303 redirect protocol to resolve a URI request. On the basis of the value of the content type in the request header, a URI request can be resolved to either streaming RDF/XML or a simple HTML page.

Consider a scenario from a recent publication. A request URI cannot distinguish between a web page of a person and the person himself. An example would be the URI http://www.illuminae.org/home/MarkWilkinson which may resolve to either a Web page of Mark Wilkinson's or to Mark Wilkinson himself, who may be described by a set of RDF triples. Based on the specified content type in the HTTP request header, if it is set to "RDF/XML", the RDF page with the URI http://www.illuminae.org/rdf/MarkWilkinson would be retrieved. Otherwise, if the content type is set to "HTML", the Web page with the URI http://www.illuminae.org/html/MarkWilkinson would be retrieved. Note the substitution of the "rdf" and "html" terms in the path of resolved URI in place of the original "home" term, creating a URI hierarchy of sorts. This proposal again seems to be a temporary hack rather than a long term solution.

In contrast, the XSLT based approach discussed by InChi holds a lot of promise. Entire RDF pages can be translated into HTML pages for display on a Web browser while the triples are accessible in plain RDF format. This dovetails with the idea of abstracting the details of the Semantic Web away from the end user as discussed in my previous post here. No troublesome redirects here. And this came through the HCLS mailing list only this morning. Moral of the story: Never lose hope!

Thursday, July 5, 2007

John Doe and the Semantic Web

There has been a recent post on the HCLS Wiki comparing the relative strengths and weaknesses of the various proposed solutions to the unique resource identifier (URI) problem on the Semantic Web. A matrix of the various proposals with their desired features has been created. What got my noodle in this matrix was a "clickability" feature. Now going by the definitions of clickability from the Human Computer Interaction course back in Graduate School, I suppose an end user should be able to click on a URI that identifies a node and access a human friendly definition of the resource.

This is bothersome. Why would a naive end user want to access a node? At best, this would be analogous to looking through an XML file. Now how much of this would John Doe really "get"? In the early days of XML, customized tags with DTDs were touted as one of its main features. You could use CSS or XSLT for rendering it on a browser. That changed very quickly. Now we use XML transparently in databases and Web services without the need to view concept and message definitions on a User Interface such as a browser. If clickability were a desired feature of the message formats and protocols defined in SOAP and WSDL, how much of a WSDL definition would make sense to a naive end user? And why do we have to bother with writing an XSLT program to render WSDL on an interface for an end user? Would it make any sense?

I believe the Semantic Web will be transparent to the naive end user. John Doe will continue using seemingly unstructured raw text data that is formatted for his easy consumption on a Web user interface. The various concepts and relations in the text he reads will hook into logically grounded definitions in an ontology, after the fashion of hyperlinks. This is what will enable John Doe to run informed searches through the Web and invoke agent programs that will help him plan a vacation to Florida by using sophisticated reasoning and planning algorithms. The Semantic Web with its carefully curated ontologies will exist at a "higher level of abstraction" from the Web that is accessed by John Doe, but should not be readily visible to him

Thursday, June 28, 2007

The battle for LSIDs and the obsession with browsers

I've been actively following the debate over resolvable URLs in the Health Care and Life Sciences (HCLS) on the Semantic Web community. At the workshop in Banff as part of the WWW 2007 conference, some leading thinkers in the community actually questioned the utility of resolvable URLs. Shocking!

Non resolving namespaces are the bane of semantic interoperability. DIE, example.org, DIE!! If definitions of concepts cannot be resolved to specific nodes on the Semantic Web, the best antidote is to discard them all together, IMHO. URIs can be location specific as in URLs or non location specific as in URNs. Again, location is given way too much importance at this juncture. As an ontology developer, I do not really care where a concept is defined, as long as I can access the definition. In other words, I NEED the definition but I could care less whether it came out of a location in Timbuktu or Flin Flon, Manitoba.

The emphasis on location is because of the desire by life scientists to view definitions of concepts on a browser, a reluctance to let go of the browser. Ironic indeed! The utility of the Web and Web browser were not immediately apparent in the early days (circa 1990) to the scientific community. But once the benefits of Web pages became clear, they were embraced and have become the cornerstone of scientific research community. As a HCI researcher from Microsoft said at a keynote address at WWW 2007, browsers are clearly on the way out. Tim Berners-Lee, in his seminal paper about the Semantic Web, describes a network of agents that can be invoked by interfaces (not browsers necessarily) and which can process machine understandable content to make intelligent decisions.

The dependence upon browsers necessitates the need for users to remember (if not bookmark) URLs. In its heyday, the AOL browser only needed users to type in a keyword to locate a Web page. For example, typing in the keyword ``NFL `` would bring up the homepage of the National Football League. On a conventional browser, users were required to remember the protocol (HTTP) as well as the complete URL to access the very same page. The use of URNs may very well follow the same procedure as the AOL keyword. It frees the user from the need to remember or bookmark URLs.

LSIDs (Life Science Identifiers) are URNs that are location independent and resolvable. To the end users, LSIDs are transparent, capable of allowing the access of web services from registries such as BioMoby. LSIDs are capable of handling versions of concept definitions. Because they uniquely identify concepts within an ontology, they can be used to extract specific concept definitions from ontologies without necessarily downloading the entire ontology. They also allow the capture of metadata about the concept definition. Metadata includes the identity of the authority that has defined the concept, the version of the definition, and a timestamp among other things. On the other hand, Ben Good has pointed out some of the crucial limitations of the LSID idea. These are temporary limitations though.

Of late, the HCLS community has been discussing the use of LSIDs and LSID resolvers to address the problem of non standard naming protocols in life science ontologies. The Banff manifesto is an initiative that hopes to address the same issue. These are very promising developments. I look forward to the day when example.org is consigned to the dustbin and lingers on as a joke...Cheers!