Blog

Annotations: What they are and why I want them

What they are

Annotation is the process of selecting portions of an original document and adding additional information. These additions are not necessarily saved with the original and are commonly used by editors and scholars to include background citations. In Computer Science, annotations are used in Semantic Web and work collaboration technologies. The Comment function in MS Word is an example of using annotations for collaboration.

Business uses

Aside from these functions I think that there is a whole class of business processes that could benefit from annotation technology. For example, we often copy line items from a Request For Proposal (RFP) and paste them into our proposal. Later, these line items may get pasted into a Product Backlog, Production Readiness Checklist, or ticketing system.

As a programmer, I shudder whenever I see so much copy-and-pasting. If a block of text is important enough to be present in multiple documents, it should be extracted as its own entity and reused. This eliminates the problem of maintaining that block in multiple places and ensures that each instance of the block is accurate. More importantly, it helps you concentrate on adding value rather than repeating yourself (see the DRY principle.)

Categories of technologies

Annotation technologies can be split into two broad categories based on where the annotations are saved: In a server or in the document. Annotations that are saved centrally in a server lend themselves to collaboration because multiple people can update the annotations without having to modify a master document. On the other hand, document control may be more important than collaboration, in which case storing the annotations within the document would be more appropriate.

Example technologies

While Googling for whats around the web regarding Annotations, I ran across Ian Lumbs blog.  Hes got a number of excellent posts on the subject.  I also found a number of different technologies listed over at the Semantic Web portal, but the list appears to be dated (several were private projects, there were a couple broken links, and a few of the projects appear to be unmaintained.) However, there were enough working projects there for me to get an idea of what people are doing in the field of Annotations. The promising projects I saw there were:

Annotea is more of a protocol than software, but there were a number of client and server implementations listed (Annozilla was one of them). The Zope server product ZAnnot was a breeze to install on top of a fresh download of Zope, and I was able to get it working with Annozilla pretty quickly. Annozilla itself, though, needs a little TLC before I can incorporate it into a working system.

First of all, Annozilla hasnt been updated in about nine months and requires FireFox 3.5 (current version is 3.6 and I reinstalled with version 3.5 just to try it out.) Secondly, the annotations themselves are free-form and cannot be constrained or reused (such as with an ontology). This is a well-known problem in the world of tagging content, where search-as-you-type tagging helps you avoid multiple tags that are almost but not exactly the same (such as the tags “annotations” and “annotation”.)

GATE (General Architecture for Text Engineering) is an amazing collection of projects centered around doing things with text. I found only one technology in there that was relevant to annotations, and that was regarding automatic annotations (examining text and feeding annotations into Annotea.) There were a lot of other interesting libraries in there and I hope to check them out later.  Specifically, they announced a Teamware application would be forthcoming which would incorporate annotation technology and workflows.

Likewise, the KIM Semantic Annotation Platform seemed to be oriented more toward automatic annotation rather than streamlining human annotation.

Nuxeo Document Management is a Java-based product very similar to Alfresco.  It has a document preview feature that shows you an HTML version of Word or PDF documents it stores, and also has an annotation module that lets you annotate that preview.  [edit:  When I originally installed it I rushed through and misread some of the documentation.  Stefane Fermigier, the founder of Nuxeo saw this post and corrected me, but I have yet to revisit Nuxeo.  In the interim I’ve edited this post to remove my misinterpretations.  Hopefully soon I’ll be able to follow up with a more in-depth look at Nuxeo.]

Lastly I gave SMW+ a try.  Ive used Semantic MediaWiki before and I think it has great promise.  I especially like the rich report formats that you can generate with semantic queries (like showing a SIMILE timeline as the result of a time series query.)  There was a long list of extensions to install, but I eventually got to the point where I could copy and paste a whole RFP as a wiki page and annotate it.  The annotation tool was AJAX-based and a little clunky, but once I was done I had a nice report of each annotation in the original document.  It wasnt clear how to bind an ontology to it, so I have the same complaint with it as I do with Annozilla.

Parting Thoughts

The kind of annotation I want to do seems possible using some of the WYSIWYG editors embedded in most blogs and CMS (like kupu or TinyMCE). A colleague of mine noted that you could simply supply a custom CSS style that would be included in the menu of available styles, but refrain from adding any style changes to it. So in theory any CMS which provides document preview ala Nuxeo should be able to supply an annotation editor which wouldnt change the original. That to me is the most promising direction.

If you have any experience with similar technology please let me know in the comments.