Trackforward – Following the Consequences with N’th Order Trackbacks

One of the nice things about blogging within the WordPress ecosystem is the way that trackbacks/pingbacks capture information about posts that link back to your posts, in much the same way that using the link: search limit on a web or blog search engine allows you to see what other webpages are linking back to a particular web page.

In the latter case, for example, searching for link:http://hedebate.jiscinvolve.org/on-line-higher-education-learning/ on Google blogsearch will turn up blog posts that link back to the original HE Debate blog post on On-Line Higher Education Learning.

(Actually, that’s not quite true. In an apparent tweak of the Google blogsearch algorithm last year, the Google blogsearch engine now seems to be indexing and returning results from complete web pages rather than indexing the content of RSS feeds i.e. blog posts – which means that as well as the useful links referred to in the body of a post, links are also indexed from blogrolls, twitter feeds and bookmark lists displayed in blog sidebars, blog comments etc etc. Which in turn is to say that Google blogsearch qua a web search of blog web pages is not much use as a blog search engine at all…)

By judicious linking back to your own blog posts, it’s possible to build up quite complex pathways between related posts that are navigable in two directions: from one post that links to another, previously published post, via an inline link; and “forwards” in time to a later post that has itself linked back to a post of interest and been picked up via a trackback/pingback.

(For examples of these emergent link structures, see Emergent Structure in the Digital Worlds Uncourse Blog Experiment, Uncovering a Little More Digital Worlds Structure and Trackback Graphs and Blog Categories.)

So the question arises – if I write a blog post that several other people link back to, and several further posts in turn link back to those posts that referred back to my post, but not my original post, how do I keep track of the conversation?

Keeping track of posts that cite my post is easy enough – if I have an effective pingback set-up, that will tell me who’s linking back to my posts; or I can simply run link: searches against the URLs of my posts every so often to see who the search engines think are linking back to me.

The answer lies in a recursive algorithm of the form:

function showInLinks($url){
  $links=getLinksto($url);
  foreach ($link in $links){
    print $link;
    showInLinks($link)
  }
}

This will then display URLs for the pages that link to an originally specified URL, the URLs of pages that link to those URLs, and so on…

So here for example is a quick test:

The items numbered “1.” are links that Google blogsearch thinks link back to the original URL. The items numbered “2.” are links that link to the links that link back to the original URL.

Here’s some minimal PHP code if you want to try it out:

<?php
$urlstub = "http://ajax.googleapis.com/ajax/services/search/blogs?scoring=d&v=1.0&rsz=large&q=link%3A";
$url="http://halfanhour.blogspot.com/2008/11/future-of-online-learning-ten-years-on_16.html";
if ($_GET['url']) $url=$_GET['url'];
$testurl=$urlstub.$url;
echo "Starting with: ".$url."<br/>";
echo "via: ".$testurl."<br/><br/>";
$depth=0;

function handlelinks($url, $depth){
	$urlstub = "http://ajax.googleapis.com/ajax/services/search/blogs?v=1.0&rsz=large&q=link%3A";
	//echo "testing".$url."  ";
	$depth++;
	$testurl=$urlstub.$url;
	//echo "testing ".$testurl."  ";
	$ch = curl_init();
	curl_setopt($ch, CURLOPT_URL, $testurl);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
	$body = curl_exec($ch);
	curl_close($ch);
	// now, process the JSON string
	$json = json_decode($body);
	//var_dump($json); echo "<br/&gt";    
	if ($depth<3) 
	  foreach (responseData->results as $result) {
		for ($i=0;$i<$depth;$i++) echo "  ";
		echo $depth.".$result->title;
		echo "<a href='".$result->postUrl."'>".$result->postUrl."</a><br/>";
		handlelinks($result->postUrl, $depth);
	 }
}
handlelinks($url, $depth);
?>

By using this sort of algorithm to generate an RSS feed of links, it becomes possible to subscribe to a feed that will keep you updated of all the downstream posts (“blogversation” posts) that are contributing to a discussion that at some point referred to a URL you are interested in.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

9 thoughts on “Trackforward – Following the Consequences with N’th Order Trackbacks”

  1. That’s very clever, Tony…. I may need one, two more cups of coffee to process. Thoughts and questions:

    * As a recursive function, should there me some exit condition? Of course that assumes some insane popularity of a URL

    * Are there any implications for hitting the google api so many times?

    * What is the significance of the order of said list? Is it by search rank prevalence? Should it by time? By?…

    * This produces a snapshot; what might be interesting is some way to record the growth over time, even visually, to show how a URl grows, spreads, dies…

    damn clever, thanks

  2. I’m loving your recent posts – they’ve prompted me to explore Pipes properly.

    I’m struggling with relevancy in the blog searches I’m running on Google Blogsearch and aggregating in my pipes. There’s the problem with sites like Blogspot which put tag mentions in the sidebar and hence return spurious results (as per your point above), but also lots of spam blogs too which just re-publish random post from elsewhere.

    A colleague pointed me to IceRocket as a better blog search engine recently, which might help. But have you got any further pointers on improving relevancy of this kind of approach?

  3. @alan

    “As a recursive function, should there me some exit condition? Of course that assumes some insane popularity of a URL”

    Arrgh – yes – i didn’t escape the code properly and wordpress ata a bit (coreected now) – I used a simple trap to limit the depth of the recurse (” if depth < 3″)

    “Are there any implications for hitting the google api so many times?”

    Maybe ;-) – the code was more proof of concept; would be good if this was taken up as a service by a proper blogsearch engine that didn’t index blogrolls etc and just limited itself to indexing feed content… ;-)

    “What is the significance of the order of said list? Is it by search rank prevalence? Should it by time? By?…”

    the order is just the results from each query from the blogsearch api call (the search query can return top results or most recent results; limited to max 8 results returned). I guess i could postrank, get more than 8 results etc etc? ( https://ouseful.wordpress.com/2008/12/17/getting-lots-of-results-out-of-a-google-custom-search-engine-cse-via-rss/ )

    “This produces a snapshot; what might be interesting is some way to record the growth over time, even visually, to show how a URl grows, spreads, dies…”

    I more of had in mind this routine producing a feed so that you would get the latest results; the reader could then aggregate the results over time (inefficient, I know, on the search calls; not sure if there is a search limit that can just find results SINCE a time, which could then be used to collect data according to a cron schedule?

  4. @steph I’ve found the Google Reader search to be okay – plus it lets you search through the feeds you subscribe to, and the posts you have actually read, in effect providing you with various flavours of custom search engine, but so far I haven’t found an API or a way of subscribing to the results via an external feed.

    Rest assured, I’ll post about any effective blogsearch tools I come across.:-)

Comments are closed.