Jul 27, 2005
- Maybe this discussion has already taken place somewhere. But if so, I haven't seen it, and pointers would be appreciated.
- Terminology used below is RSS terminology, however the same points are intended to be applied to Atom.
1. The Disappearing Reference
Back at the beginning, an RSS item had three major elements: title, description and link. What to put in the first two of these was always reasonably clear. However, an ambiguity existed with respect to the third.
The link element was taken to contain a URL for the item being described in the title and the description. This created two possibilities: whether the link pointed to one's own article, or whether it pointed to an article written by a third party.
For example, the RSS for the New York Times might contain a list of articles, and the link for each item would point to the URL of the New York Times article, everything being in the nytimes.com domain. However, the RSS for Fred's Big Links might point to articles from many newspapers, the link in one pointing to nytimes.com and another to wapost.com and so on.
When software such as Blogger and LiveJournal embraced RSS, they embraced the first model. Thus, every link in every item in the feed generated by halfanhour.blogspot.com pointed to an article within the halfanhour.blogspot.com domain. RSS, therefore, was thought as a site summary document, rather than a linking document.
Over time this has become the dominant model for RSS; my own RSS feed is one of the very few feeds left in the world listing feeds outside the feed domain. Almost all feeds point to a page within their own domain. However RSS aggregators, such as Daypop, PubSub and Syndicate, provide RSS feeds with external links in the link element (as one would expect).
But what if we want to do both? What if we want to, say, create a post in Blogger that talks about an external resource, such as an article in the NY Times? It seems that we must pick one of the two possible links - the blogspot.com link or the nytimes.com link - to put into the link element. Blogger, of course, makes the choice for us, placing the blogspot.com link into the link element. But now, crucially, the nytimes.com link disappears from the RSS (or the Atom, as both have this problem).
2. The Need for Reference
But who cares, right? After all, if we have the link to the content document on Blogger, then we have all the information we need. The author can simply write about the NY Times as part of the post content, and embed a link into what will eventually become the description. Problem solved.
But - not really. For one thing, in order to obtain that link to nytimes.com it is necessary to do extra parsing in order to extract the href from addresses embedded in the description HTML. For another, various urls may be embedded in the HTML some of which may not actually be references but merely helpful links added to make navigation easier.
The point is, in order to achieve an expressive power anything beyond merely replicating the content of an HTML page in another format, RSS (and Atom) needs some sort of reference element.
- a discussion list is expressed as a series of RSS items. Reference is used to keep track of which comment replies to which.
- a conference organizer divides the conference into themes, each of which is represented as an RSS item, and in addition lists each conference presentation as an item. Reference is used to associate each presentation with a theme.
- a person presents a paper at a conference and this presentation is listed as an item in an RSS feed. Another person blogs about that presentation. Reference is used to associate the blog commentary with the original paper.
- a person blogs about an article in the New York Times. Reference is used to associate the blog post with the NY Times article. An aggregator uses these references to create a collection of blog posts about this particular article.
- a taxonomy is created as an RSS feed. Reference is used to associate items at lower levels in the taxonomy with items at higher levels of the taxonomy.
- a large document is split into several parts, each of which is described as a separate item. Reference is used to associate each of those parts with a common title and table of contents page.
RSS referencing essentially creates distributed structured metadata. Because of the desirability of this, various alternatives are already available. Each alternative, however, has limited applicability and therefore does not offer a consistent approach to RSS referencing.
Several RDF data elements can be used to accomplish some functions of referencing. For example, RDF subClassOf can be used to represent taxonomical relationships. However no RDF data element implies referencing specifically.
Referencing may also be accomplished using the rdf:about attribute inside the item tag, as demonstrated in this column by Mark Pilgrim. However, this mechanism is available only to RSS 1.0 and derivatives. Moreover, it merely relocates the original problem; the example just cited uses the rdf:about tag to replicate the contents of the link element.
Dublin Core offers several alternatives, including ispartof, reference, relation and others. http://dublincore.org/documents/dc-citation-guidelines/ While extremely useful, these tags do not specify links to external resources as the link element does, but rather, contain citation information, such as (say) a bibliographical element.
Various aggregators have attempted to create RSS structure through the use of tag or category elements. For example, authors blogging a particular conference, say, NECC, are encouraged to use a given tag, say NECC.
Tagging is not an instance of additional metadata, but rather, the placement of specific HTML code within the content description (or the body of a blog post). The NECC tag is created using 'a href="http://technorati.com/tag/NECC" rel="tag"'. As such, tagging is therefore an instance of the original problem wherein the extraction of structure information requires specialized parsing of the link element.
The RSS 2.0 category element is more useful in the sense that it is an actual XML element, and does not therefore require separate parsing. However, this element is used specifically for the purpose of categorization, and although a link reference could, in theory, be placed inside a category element, most aggregators are not going to expect to process this link.
- Conversion of RSS items to channels
Some blog engines have enabled comment RSS feeds. This is typically accomplished by creating a separate channel for comments; David Phillip's Moveable Type comment feed template provides a good example.
What has happened here is that the original blog post, which began as an item in another RSS feed, is not represented as a channel. The value of the link element in the original item is now the value of the link element in the channel.
This allows association between comments and posts, however, at the cost of multiplying channels and duplicating post information (once in the original item element, and once in the channel element).
Moreover, the creation of a separate RSS channel presupposes that all comments are known, or are located in the same place. Where comments are distributed - as in, say, blog posts responding to other people's blog posts - the requiste channel might never be created.
4. Specific Mechanisms
The precise mechanism settled on by the RSS community may vary, however, in this section I propose a specific mechanism as a template.
Essentially, in order to encode reference in XML, one or more RSS (or Atom) elements need to be created. These may be core elements, or they may be extensions.
For now, I will treat these elements as extensions. Accordingly, they are prefaced with 'ssn' (which stands for 'Semantic Social Network').
- parent - this tag, ecoded 'ssn:parent', is a generic parentage relation. That is to say, when placed inside an RSS item, it refers the reader (or aggregator) to a higher level entity. For example, a chapter in a book would use 'ssn:parent' to point to the book home page; a comment in a discussion would use 'ssn:parent' to point to the comment it is replying to. Strictly speaking, only 'ssn:parent' is required to satisfy the requirements outlined above.
- replyto - this tag, encoded 'ssn:replyto', is used specifically in the dontext of discussion lists, and is used to point to the comment to which the given comment replies.
- reference - this tag, encoded 'ssn:reference', is used by a blog post or similar piece of writing to point to an external resource being described or referenced by the blog post
Inside the tag, RSS content is displayed as though in a typical RSS item. This allows a content provider to (optionally) include information over and above link information, for example, the title of the resource.
The intent of a reference is to provide information about an external entity within the context of the current entity. For example, the intent of a 'replyto' element in a comment item is to provide information about a different comment, specifically, the comment being replied to.
The reference element itself must one specific piece of information, the URL of the external entity. The intent here is that the URL serves double duty, both as an indication of the location of the external entity, and as an identifier for the external entity. It may also include additional information, such as the title.
In a typical use, when additional information is not provided, it is anticipated that an aggregator will have the rest of the information about the external entity - the title, description, and the like - already harvested and in the database. Therefore, the URL of the external entry serves as a search parameter, allowing this information to be retrieved and displayed with the current resource.
In other cases, however, this information will not be available - for example, a person using Blogger does not have access to this data, nor does a service that harvests from only a few content feeds. In this case, the reference, as described above, provides *only* the external link.
If the service displaying the resource does not have a database of links, several options remain open:
- to use a generic link title, such as 'Reference', and provide the URL to the viewer as a link
- use the link to access the HTML page and scrape the title from the page, then display the title
- use the link to access the HTML page and scrape the 'link rel' tag to obtain RSS for the page
- (best) use the link as a search term to use at an aggregator that does have the full RSS or Atom description and will return that XML to you
But that said, the reference element is best used in an environment that is both a writing and reading environment; for example, it is better used in an environment like Bloglines, which connects a blog-authoring service to an aggregator, than to Blogger, which does not offer a blog aggregation service.
Alternatively, it may be worth considering the embedding of external information locally.
6. Expanding Reference
The types of reference described in this document are for the most part content-specific. They describe relations between one type of content and another.
Not all entities are content entities. Other entities include people, events, companies and locations.
We have already begin working with reference to some of these other entities. For example, longitude and latitude data in RSS feeds and GeoURL convert a place-specific RSS element into a referenc to an external resource.
Many entities can be described using the simple syntax of RSS with a minimum of extension. An event, for example, can be desribed in RSS with the addition of date and location elements. An organization can be described in RSS with the addition of (say) contact information and (say) references to organization staff.
Developers in the RSS community (and, for that matter, in other XML communities) have for the most part not considered seriously the utility of linkages between entities, being instead focused on describing the current entity. This focus should, over time, change.