Downes.ca ~ Persistent Identification and Public Policy

Persistent Identification and Public Policy

Nov 09, 2004
By Stephen Downes

Note: this article is a response to the discussions held at the Persistent Identifiers Seminar at University College Cork, Ireland, last June. The summary to which I refer is not yet available on the internet. When (and if) it is made available, I will link to it here. Presentations are available on the conference website.

Despite what the name implies, as has been widely observed, a persistent identifier (PI) does not guarantee that an object will be available - my observation is that a significant cause of broken URLs is that the resource has ceased to exist or has been removed from public access. I don't expect this to change in a post-PI world - one might recall that when the U.S. government changed hands four years ago many resources were removed from public access by the incoming government as being no longer current.

From my perspective, the ability of a PI system to associate multiple locations with a resource is a major asset (and, as the author noted, a significant weakness of the widely used Purl system). But the choice of multiple URLs must be resolved either by the address server or by the browser. If the former, there may be concerns regarding the opacity of the process; such a system could be used, for example, by governments or service providers to hinder access to information, to redirect to similar information offered by 'content partners', and more. But if the latter, then browsers themselves must resolve the choice of location to attempt to access. Browsers do not have this capacity - hence the discussion and concern about the need for a plug-in. It would be well worth observing how the Mozilla developers (who have released Firefox) approach this issue.

There was in addition discussion of the idea that a PI system should provide metadata about the resource in addition to location information. On the face of it this is a good idea, as metadata can be used to facilitate searching and to customize display. To date, while browsers have the capacity to display metadata (via CSS or XSLT, with limitations) they do not have the capacity to use metadata to alter the display. Moreover, the major use of metadata in searching is as a filtering mechanism; it allows users to select by language, for example, or select by format. However, to date the only location where such filtering could occur is in the address server, which then assumes the functions of a search engine. This would result in a fracturing of the search system, as no address server will serve locations for all resources (a problem exaggerated by the different types of address servers proposed).

The PI system therefore assumes a federated search system, in which a search is propagated to other (strategially allied) address servers, with the retrieval of search results depending on authentication. This raises numerous issues - such searches are significantly slower, result in a combinatorial explosion of search processes, and impose a burden on content provders. An alternative is to allow a services such as Google to harvest metadata from address servers, which separates the search from the address resolution. This results in a much faster and more open search process. However, although it was only touched upon in this paper, content providers expect to maintain control over metadata, making it available only to authenticated searchers. As it will not be possible to bar centralized search services (who will, after all, be indistinhuishable from authenticated users), it will be necessary to create a legislative environment protecting metadata, or assume that some metadata - such as the location of a resource - is in the public domain, like street addresses or telephone numbers. It is not a stretch to imagine a Napster-like dispute arising over the ownership of resource locations, and legislators should be advised in advance to expect such an event.

This becomes a particularly pointed issue when the opinions of third parties with respect to resources comes to bear. While now if I conduct a search for a resource, say, a particular journal article, I may obtain not only the article but also commentary on the article in my serach result. However, if search is metadata driven, and if the only source of metadata is the publisher of the article, then my search results will not include commentary and criticism (or, for that matter, third party classifications, ratings, or even alternative sales points). I have discussed the need for third party metadata elsewhere. For now, it is sufficient merely to argue that if metadata about a resource is made a part of persistent identifiers, that it is important that this metadata be collected from numerous sources.

It is worth noting that a major driver behind authenticated access to metadata is that personal information may be used to drive business rules. This is one of the major reasons why commercial publishers, such as newspapers, are increasingly requiring reader registration. In addition to obvious applications, such as signing buletin board comments and personalizing news selection, authentication allows advertisers not only to understand reader demographics and tailor advertising. Authentication will almost certainly be used to drive pricing differentiation - Lawrence Lessig recently commented on this.

While it was only briefly mentioned in the paper, the use of an address server that distributes metadata may (and probably will) be used to refer to non-digital resources. One person mentioned the use of such a system to identify people. There has already been progress in this area, following a hub-and-spoke model similar to the Handle Syatem, is a system calle SxIP (Simple eXensible Indenticifation Protocol). In the SxIP system, users maintain ownership and control over personal information metadata, which they may distribute on request to various websites. All other things being equal, such a system has wide potential for personal identification with respect to government services, such as (for example) a single health care record. SxIP has not announced publicly any intent to enable third party enquires based on SxIP identifiers, but this is a natural, and in my mind inevitable, consequence of such a system.

The paper commented that "persistent identification and resolving is fundamentally a social problem." I believe that this is true, but believe that it would be a mistake to say that this can be addressed by "philosophical changes within an organisation". The concept of persisten identification, particularly when extended to include reference to humans and other non-digital entities, is not limited by the bounds of organizations. Without recognition that persisten identification is a social good, and not merely an organizational good, it will be limited to piecemeal application and inconsistent implementation, creating Byzantine arrangement of competing protocols and standards, disputes over ownership and sharing of metadata, disrupted access to resources, and fragmentation of the internet as a whole.

For example, it is well worth considering a question that did not arise in the seminar summary - the persistence of metadata. It is possible that this is not seen as an issue because the presumption is that there will be a sole source of metadata for any given resource. However, for numerous reasons, this presumption should be challenged. If it is successfully challenged, then there will be metadata extant beyond the immediate control of the resource publisher. Should this be the case, then any change in metadata - an author changes her name, for example, or the price of a popula resource is increased - will result in a duplication of the persistence issue at the metadata level. Indeed, in any case where string data is today used as a metadata value, there exists the possibility that metadata will change.

We see this issue today addressed by the requirement of taxonomies and canonical vocabularies for metadata contents. Consistency of reference is essential if metadata is to be reliable. There must be agreement on whether a resource is, for example, a 'picture', an 'image', a 'photograph' or a 'snapshot'. In my view, however, it is unlikely that there will be concord among such vocabularies (and that there will be a constant need for verification and enforcement), if only because the granularity of such description shifts across discipline. For educators, for example, it is sufficient to know that a resource is a JPEG file. For a digital artist, however, the JPEG compression ration provides an additional level of granularity necessary in content selection. It seems clear therefore that the contents of metadata should, where possible, point to other resources rather than to strings - this is the purpose of ontologies. But we need to understand that, for persistence of metadata to exist, this concept needs to extend to such entities as publishers, departments, and individuals.

In order to achieve genuine persistence, therefore, it is important that we consider the persistence of identification of documents not in isolation, but as a part of a more comprehensive strategy of persistent of resources in general, and consider the need for, and mechanisms for, the identification and delivery of metadata not only about documents but about other non-digital entities. And in order for such a system to be viable, there is a need for there to be a certain mobility of metadata - if metadata about a resource points to metadata describing an author, then the (current) name of the author must be accessible by the searcher at the time of the search. And just as in the case of document metadata, it is likely, and even desired, that some metadata about the author be owned by the author, and distributed only by the express permission of the author.

In other words, the question of persistence, viewed this way, is going to involve a question of balance. For any resource, whether a document or a person, there will be a certain amount of metadata that ought to be private and a certain amount that ought to be public. Just as numerous public services, such as mail, taxis and fire response and police, cannot function without open access to street addresses, so also numerous internet functions cannot operate without open access to document addresses. But in the same way, access to some information about documents - and in some cases, the documents themselves - must be controlled, just as is the case for information about people.

I argue for consistency. I argue that the same principles underlying the right to share information about documents ought to be applied to information about people. For example, if it is permissible for individual comapnies to share the names and addresses of people, then it ought in the same way be permissible for people to share the names and addresses of resources. Or for example, if it is not permissible for a third party in the know to share information about the economic health of a company (for example, under insider trading regulations), so also it ought not be permissible for a third party in the know to share information about the health of a person. Just as access to information about a person conveys an advantage to a publisher, access to information about a publication conveys an advantage to a person. These advantages ought to be balanced, to ensure that the one does not enjoy an unfair advantage over the other.

Finally, given the argument above, I would like to touch briefly on the idea of there being a right to an identity. Numerous public and private services require that a person provide a persistent identity before access to services is granted. This identity is manifest in numerous, and not always reliable, modes. For example, the Social Insurance Number creates an identity for a person with respect to access to government services. A driver's license is required to enter public drinking establishments (and to serve on juries). A credit card is required to rent a car or book a hotel room. But none of these is sufficient, not simply because they do not guarantee identity, but because they are not universally distributed. Visitors and children do not have SINs or driver's licenses, people with poor credit do not have credit cards.

In a similar manner, under a system of persistent identification, large bodies of documents may be similarly disadvantaged. The paper discussed concerns about the financial overhead inherent in the Handle system and Digital Object identifiers. These are genuine concerns, because they limit the capacity to assign an identity to a person or organization of sufficient means, and they therefore serve as a disincentive to the persistent identification of free or open source content. Just as there ought to be balance between the personal and the corporate, so also out there to be balance between the commercial and non-commercial. And just as one redress to this imbalance is the guarantee that any person may have an identity, so also must there be a concordant guarantee that any document may have an identity.

It seems to me that while discussion, even within government, of persistent identity has assumed an organizational scope, in order to ensure that the application of persistent identity does not result in a skewed information environment, this discussion must be conducted, especially within government, within a society-wide scope. While the selection of a mechanism for persistent identification may appear to be of implication only within government departments, it will in fact have an impact on the information environment adopted by society as a whole, especially insofar as individuals will by necessity adopt whatever scheme is endorsed by the government in order to access government information. Not only should the government take into account the access needs of individuals, it must take into account the needs of those individuals to create and distribute their own information.

It may be a bit early to say this, but it should not now be beyond consideration the idea of persistent identification as a public service, with every citizen granted the right (but not the obligation) to establish a personal identity, company identities and document entities (to name a few) without charge and which will be recognized without hinderance or prejudice by the information exchange network of the future. The deployment of such mechanisms ought additionally to be accompanied with a legislative environment specifying the various rights to use and protect information accessible within this system, with enforcement provisions applying equally to government, industry and private citizens. The development and deployment of a governmental informational system, and in particular, a system of persistent identification of resources, ought not proceed in the absence of these social and informational frameworks.

That said, the establishment of a society-wide persistent identification framework has the inherent advantage of addressing the social factors that need to be addressed before we will see widespread adoption. Granting individuals and corporations membership in such a system and ownership over their own part of it provides the necessary incentive to contribute (as compared, for example, to the incentive provided to an employee cataloguing a document in which they have little or no interest). Providing a permanent record of a person's contributions to society generates in itself an incentive to contribute to that record - this is, indeed, part of the thinking behind the concept of e-portfolios, which has been much-discussed recently. And it fosters in individuals the attitude that a resource, any resource, is documented and processed in a certain way, making the task of document management within a corporate or institutional environment nothing over and above the obligations of daily life.