Downes.ca ~ Dupes, Subs and Atom

Dupes, Subs and Atom

Apr 04, 2005
By Stephen Downes

Regarding Tim Bray, Why We Need Atom Now

Before commenting, let me first note that I am officially agnostic with respect to Atom. That is not to say I don't care; I care a great deal. But I do not take a stance favouring either Atom or RSS.

That said, it seems to me that your post, 'Why We Need Atom Now,' misses the point in some important respects, and brings into the mixture some elements that will not sit comfortably with the content syndication community.

Duplicates

I too am plagued with duplicates. As you know, aggregators usually attempt to eliminate duplicate listings (otherwise we would be subject to 24 copies of the same post every day). Some, such as my own aggregator, cross-check links against URLs; this fails with some sites, which add a new code to each page update. Others cross-check against content; this fails when bloggers like Dave Winer update each entry multiple times.

The proposal to assign each resource a universally unique ID does not solve this problem. In fact, in some ways, it makes the problem worse.

First, it adds no information over and above that which was already available. Each resource already has a unique identifier - its URL. By definition, if two links point to the same URL, they are pointing to the same resource. Identifiers can be useful if there are multiple URLs for a given resource, but the sorts of cases we are considering here are not instances of that sort of case; we are not getting duplicates because there are mirrors of the same post, we are getting duplicates because the URL or content has changed.

While it may appear that identifiers allow the URL or content to change while keeping the resource identifier constant, it should be noted that these changes are the result of a site policy. I have already seen sites changing URL identifiers to correspond with new advertising; this is a form of aggregator spamming and I have complained to the site author about it. In such cases, the site operator is as likely to create a new identifier for the resource as to create a new URL, because the same factors that motivate a change in URL also would motivate a change in identifier.

Second, even if an identifier addresses this issue, it is not unique to Atom. RSS 2.0 has, as you are aware, the GUID tag. This tag usually refers to the URL of the resource (it being a useful unique identifier) but could contain any unique identifier.

But third, the process required by Atom may actually, as I suggested, make matters worse.

One way it could make matters worse is in bringing in the idea that the identifier tag is "required". My position is that the concept of a required tag runs contrary to the interests of the syndication community. If I publish an Atom feed without an identifier, is it not an Atom feed? How would this be enforced? Are you going to "harass them until they fix it" - really? I may be hearing a lot from you then. Sure, it might not validate - but who cares? The only relevant standard is whether aggregators will harvest my feed if it is missing this tag.

Another way it could make matters worse is that, if the globally unique identifier is not a URL, then each post must be submitted to a central registration process in order to acquire such an identifier (or at the very least, each post-producing site must do so). Aside from the inherent bottleneck and locus of control that this introduces, the use of identifiers - or let's just call then 'Handles' - commits the syndication community to a system in which publication and access is dependent on the will of a registrar, where such registrar may or may not be forthcoming.

Or we could return to the idea that the URL is the identifier, and identify the real problem, which is aggregator spamming. And leave it to individual aggregators and readers to choose whether or not to subscribe to such a feed, rather than place the entire blogosphere under a central authority.

Subscription

As you know, this has been much talked about recently. But it seems to me that this is a problem very easily solved at the browser level. Indeed, it can only be solved at the browser level.

As you note, "Atom feeds can (and should) contain a field named self that says 'here's my address', so that if you have a copy of the feed, you have everything you need to know to subscribe to it." I would point out that it is very rare to have only a copy of a feed; what one has usually is a direct link to the feed itself (that is, after all, what those orange buttons point to), in which case the address of the feed is contained in the link to the feed; no need even to look at the feed in order to subscribe to it.

As you know, browsers such as Firefox also uses a separate application/rss+xml link in the header. And aggregators such as Yahoo and Bloglines have aggregator-specific links. None of this is new.

But no matter what system is used, the problem is not that the location of the feed is unknown, the problem is in sending this location to the appropriate aggregator. And it seems to me that if a browser can detect an RSS feed (which it could by reading the second line of the document) then it can send the appropriate request either to a client-side application or to a server-side service. And it needs to do this whether it is getting the feed address from the link to the feed or from the content of the feed. In other words - the problem and the solution are the same in both Atom and RSS, and the solution outlined here hardly constitutes something unique in Atom.

As I said at the outset, I am agnostic about Atom. From my perspective, it offers some advantages, but has not in general been worth the extra work (as now every syndication application must deal with yet another syndication format). I understand why the Atom project emerged and am sympathetic with the founders.

But, you know, the more we wait for Atom, the more complex Atom becomes, the more it solve problems via 'requirements' rather than enablements, the less happy I become. Somebody could hand-code an RSS feed in notepad, slap it on any old web site, and it worked. Somebody could add tags or drop tags and it still worked. If Atom doesn't allow this, it's broken.