Downes.ca ~ Stephen's Web ~ Capturing an Entire Web Site

Capturing an Entire Web Site

Aug 06, 1998
By Stephen Downes

Posted to WWWDEV 06 August 98

Jayne K Moore asked Why would you want to do this? Aren't you copying someone's work here? On Thu, 6 Aug 1998, Duan vd Westhuizen wrote: How can I capture an entire website...?

Of course the debates on whether it is appropriate to copy online resources continue unabated. Duan, of course, wanted to copy his own site. But what if he wanted to copy someone else's? Why would that be so bad?

Several scenerios present themselves:

1. Perhaps he wanted to copy a frequently used site so that he does not have to download it off the internet every time he accesses it. I do that sometimes which search results. It's really silly to regenerate a search each time I want to follow up a new link. I use one of two techniques: a. I open one browser window to the search results, and by using 'copy link location' and 'paste', use another browser window to follow up the resource b. I save the file to my own hard drive, then point my browser to that page on my hard drive. I have in the past found it useful to download entire sites to my hard drive. For example, the online PERL manual is a very useful site. But why tie up their resources on a daily basis when I can simply download it to disk, bookmark the disk file, and access it at my leisure. Much faster for me, more useful for them. All of these are clear cases of copying. But are they wrong?

2. We run networks in rural communities (Windows NT servers (ugh) running Wingate (double ugh)). These networks routinely cache files which have been downloaded off the internet. This way, when a file is used more than once, the subsequent user's browser reads the cached file from Wingate rather than the original from the internet site. File caching is becoming more large scale. Yesterday, I read an article (I think on ZD Net) reporting that cable internet access providers are caching entire frequently-used sites in order to speed access to their customers. Here we have a clear case, not only of copying, but of software and ISP supported copying. Is this wrong?

3. At our college we have about 800 students, all of whom have internet access, but only a minor internet connection. Anything we could do to reduce traffic is worth our while. One solution I have been contemplating is downloading major regional sites and installing them on our college network, and rerouting requests to those sites to requests for our downloaded versions. Is this wrong?

The assumptions - I think (apologies if I am wrong) - in the question were: that someone else's site would be copied, and that this was wrong. Yet above we have three cases of what I think are pretty legitimate cases of site copying.

An additional assumption which may have been made (again, apologies if it wasn't) was that the planned action was to download someone else's site and present it on one's own server. Sounds like a straightforward case of plagiarism. But in all of the cases above, the material was copies, and the material was placed on one's own server.

Moreover, consider the following scenerios:

4. I have developed a PERL program I call the 'viewer'. This program integrates with our online courses and accepts, as input, the URL of a 'collection' file, which consists of an ordered list of URLs and titles. When a user activates the viewer, the first page in the list is displayed and navigation tools are generated to let the user page through the remaining page. This is a variation on the 'frames' problem, wherein one provider places another's content in frames on the first provider's site. Is this wrong? It seems so, but...

5. Several sites are now compilations of listings from other sites. The most well known of these, probably, is the Drudge Report (or in Canada, Bourque). There is very little original content on these sites; what the authors do is check out other sites, capture the headlines, and print the headines with links to the original article. I also receive an email newsletter which does this with technology news. Are these wrong?

6. My sites have been indexed by search engines across the world wide web. In most cases, I did not ask these agents to index my site, and they did not notify me. Yet when I do a search on my name or relevant topics, there is a link to my site. Some search engines even take the first few words off my site by way of description. Is this wrong?

Here we have some cases wherein the content is not copied, but is in some way used to benefit an external agency. Again, the waters are murky here. In some cases, in some jurisdictions, prosecutions of 'in frames' content have been successful. Headline and URL listings, however, appear to be universally acceptable (though one wonders what would happen were someone to clone Yahoo).

Why raise these instances? At the very least, I would like to make it clear that the debate surrounding use and abuse of online content is far from clear. We hear a lot from lawyers who tell us, usually at the behest of publishers, that it's business as usual online, that the ownership of online content will remain concentrated as it did before. Good for publishers and even a few authors. Bad for the rest of us.

But online publishing involves new technology, and therefore, new ways of presenting information. A web page is in no way equivalent to a magazine or book page. Our understanding of what can be done with that content must change with the technology. The very concept of 'copying' takes on new meaning. It appears to be possible to (a) copy a site without actually making a reproduction of it, and (b) make a reproduction of a site without actually copying it.

And so far I have touched only on content. When we turn our attention to the area of internet programming, a whole new range of questions emerge. Suppose we see some HTML we like. Can we copy it, and substitute the content for our own? The first reaction seems to be 'no' but then some absurd consequences result:

1. Would it mean that the first person to use the list structure (

item 2. Or: Project Cool claims (probably validly) to be the first to use moving menus scrolling from the right hand side of the page (see http://www.projectcool.com ). Do they now own this method? But (as I think may be the case) suppose they built this with a built-in feature of Dreamweaver (which does some very sophisticated Javascript and DHTML scripting). Now - do they own the technique?
The examples multiply. And the web developers - like the publishers - claim ownership over these techniques (well, some of them anyways). But it isn't clear that this ownership can be sustained.
It's too easy and unthinking to launch an 'anti-copying' crusade. We need to consider more deeply the potential of the technology and the nature of the information in question. Probably, there are some clear-cut cases of ownership of and prohibitions regarding online materials. But more often, the fact that someone claims ownership or prohibitions does not mean that they are actually entitled to said ownership or prohibitions.