Downes.ca ~ A New Website, Part Seven

A New Website, Part Seven - Converting The Data

Nov 22, 2006
By Stephen Downes

I confess, I am almost not sure how I want to proceed next. My website is a pretty complex website and it doesn't really map easily to what Drupal (or any other CMS) does. This is why I have resisted for all these years simply using a CMS.

Also, I want to make some changes to what I'm doing. First of all, I've decided I need a much faster-loading front page, and one that easily directs people new to my site to the information they may be seeking. Right now someone who visits me for the first time would never find my more important papers, for example.

As well, I want to try to implement some of the changes I've been planning. I want to use a proper mailing list tool, for example, instead of the home-made one I have been using all these years. I want to properly connect different content types. I want a better integration with remote data sources such as Flickr and Slideshare.

I am working, then, roughly according to a plan, but it's pretty loose. Basically, the first thing to do is to get the content off the NRC servers and into Drupal. Then I'll configure Drupal to roughly emulate my existing functionality. Finally, I will add the enhanced functionality.

The Content

OK then, let's take stock of what sort of content I have. This is a bit tricky because my website includes both harvested content from Edu_RSS and generated content such as my articles and posts. Tricky, but not impossible, for as we have seen Drupal also has an RSS aggregator.

I had always kept the two types of content - Edu_RSS links and my own content - separate, but recently I started putting all that into one big table, distinguished only by content 'type' and by the author. I did this so it would be easier to write templates and the like to display the content. I am not under this constraint here, as Drupal will manage that for me (though I will still have to do some custom work for each new type of content). So I will separate the two types of content again. Edu_RSS content will be managed separately from my own website content.

My own content consists mainly of two major content types: 'posts', which are essentially the links I post in OLDaily, and 'articles', which are the longer types of writing I do. I have more than 10,000 of the former, from eight years worth of collecting, and they are all in the same place (happily). I have about five hundred of the latter, and these are stored both on my own system and in Blogger, where I have been depositing such work recently.

The posts are mostly sand-alone, though some are connected to files and others to resources such as my photo sets. My photo sets were stored on my website, but I have been recently moving them all to Flickr. The posts, additionally, have 'topics'. I do not use either taxonomies or folksonomies to manage my topics; happily, it is a flexible approach we can leave it until later.

Complicating matters is that each post is also associated with one or more links. This is an artifact of Edu_RSS, which managed the imput of many different people about the same resource by creating a separate identity for each resource, then linking the posts to the resource.

The articles, meanwhile, have their own special features. Any given article may be associated with a file, such as an MS-Word or PDF document. Moreover, an article may also be associated with an event, such as a paper presentation. Articles may also have publication data, and some of them have been published more than once.

This introduces another major type of data, my presentations. I have more than a hundred PowerPoint slide shows, currently stored on SlideShare (but also on my existing website) as well as more than fifty MP3 audio recordings of my talks. The also are associated with events, and are also associated, sometimes, with articles.

That's the pretty basic set-up, and really isn't anything other people wouldn't have in their own data collections.

The way databases work is that each of these different types of entities is given its own table, where a table consists of a series of records, one for each item. Each record in each table is given its own identity, called a key. A record's own identity is called the 'primary key'. A record may be related to other entities, and when it is, these other identities are identified by their key; when this key shows up in a table, it is called a 'secondary key'.

Now there are different ways entities can be related to each other:

- one-to-one. This is pretty rare. Traditional marriage is works this way; each spouse has one and only one spouse, and the other spouse in turn has one and only one spouse.

- one-to-many. This is pretty common. Mothers and children work this way. Each mother can have many children, but each child can have only one (biological) mother.

- many-to-may. This is very common. Cousins work this way. Each person may have many cousins, and each cousin may also have many cousins.

In the case of one-to-one and one-to-many relationships, keeping track of associations is simple. We simple put a column in the table such that each record has a field containing the key of the associated entity. Nothing to it.

In the case of many-to-many, however, we have to construct a separate table of data where the keys of each of the associated entities are paired. This table is sometimes called a 'lookup' table. Since a lot of my data is many-to-many, I will need to plan for that.

In addition, I have attached to all of the my discussion board system, which over the years has collected hundreds of comments on things. This isn't as major a thing as NewsTrolls, which has thousands and thousands of posts. Ah, but one site at a time.

As much as possible, I would like to map my content types into existing Drupal tables. This means I don't have to create extra tables and associations and all that. It also means I can draw from the examples of Drupal designers when I do have to create some custom content.

What will help is that in addition to the standard modules, which I covered in the previous installment, Drupal has hundreds of additional modules. Of course, many of these were designed for Drupal version 5, and won't work in the Drupal 5.0 version I am testing. But many will, and more will each day.

OK then.

The Transfer

What I want to move first are the posts. They are the simplest type of content to move, and also form the heart of the services on my website.

Here's what the post records look like on my website.

post_id

int(15)

auto_increment

post_type

varchar(32)

Yes

NULL

post_pretext

text

Yes

NULL

post_title

varchar(255)

Yes

NULL

post_link

varchar(255)

Yes

NULL

post_linkid

int(15)

Yes

NULL

post_author

varchar(255)

Yes

NULL

post_journal

varchar(255)

Yes

NULL

post_authorid

int(15)

Yes

NULL

post_journalid

int(15)

Yes

NULL

post_description

text

Yes

NULL

post_quote

text

Yes

NULL

post_content

longtext

Yes

NULL

post_replies

int(15)

Yes

post_key

int(15)

Yes

NULL

post_hits

int(12)

Yes

NULL

post_thread

int(15)

Yes

NULL

post_dir

varchar(32)

Yes

NULL

post_crdate

varchar(36)

Yes

NULL

post_creator

varchar(36)

Yes

NULL

post_crip

varchar(24)

Yes

NULL

post_pub

varchar(10)

Yes

NULL

id is the primary key. type indicates a type of post (in my system, articles, comments, posts, and any other type of content I upload is a post, each with its own type. A post, for example, is a post type 'link'). pretext is an extra content field (it's text that can be placed before the title of the item). title is the title. link is the actual URL of the item while linkid is the secondary key from the Links table. author and journal are the names of the author and the journal for the link I am discussing, while authorid and journalid are the secondary keys from those tables, respectively. key is legacy, from my previous database, hits is the number of hits, thread is unused (it used to track comment threads), dir specifies where to put files associated with the post, crdate is the date it was created, in unix time, creator is the secondary key of the record creator, crip is the IP address from which it was created (used to track spammers) while pub is the publication date.

What I want to do is to on Drupal find or create a type of content that most closely matches this. Drupal has by default the 'page' and 'story' content types. I need to first ask myself whether either of these will work for me. 'Page', probably not (and I will want to use it for website pages). What about 'story'?

What followed at this point was a couple of hours worth of investigation into how Drupal stores its data. As this site observes, "An important concept in Drupal is that all content is stored as a node. They are the basic building blocks for the system, and provide a foundation from which content stored in Drupal can be extended. Creating new node modules allows developers to define and store additional fields in the database that are specific to your site's needs. Nodes are classified according to a type. Each type of node can be manipulated and rendered differently based on its use case."

Taking a look at the Drupal database itself, we can see that the content is stored in two separate tables. One table is just a list of all the contents. The other table is the actual content itself. Doing ti this way keeps one of the tables really short, so you can do things like print lists of the titles or display the teasers. Also, by keeping the body of the item separate from the listings, you can have versions of the same item, which opens up all sorts of possibilities.

OK then, so each one of my posts will be a node. I will create a new node type, called 'post'. Then I will attempt to populate these two tables:

node

Field	Type	Collation	Attributes	Null	Default	Extra
nid	int(10)		UNSIGNED	No		auto_increment
vid	int(10)		UNSIGNED	No	0
type	varchar(32)	utf8_general_ci		No
title	varchar(128)	utf8_general_ci		No
uid	int(11)			No	0
status	int(11)			No	1
created	int(11)			No	0
changed	int(11)			No	0
comment	int(11)			No	0
promote	int(11)			No	0
moderate	int(11)			No	0
sticky	int(11)			No	0

nid is the node id (ie., the primary key) and will increment automatically. vid is the current version, and for us, will always be the same as the node id. type is the node type, in plain text (and not the key from the types table). title is the title of the item. status indicates whether it is published ('1') or not. created and changed are times when these happened, and they are in unix time (yay!), which are is the number of seconds since the standard epoch began at January 1, 1970 (GMT). Here's a time converter. comment, promote, moderate and sticky are status flags, and we'll use the defaults.

node_revisions

Type	Collation	Attributes	Null	Default	Extra
nid	int(10)		UNSIGNED	No
vid	int(10)		UNSIGNED	No
uid	int(11)			No	0
title	varchar(128)	utf8_general_ci		No
body	longtext	utf8_general_ci		No
teaser	longtext	utf8_general_ci		No
log	longtext	utf8_general_ci		No
timestamp	int(11)			No	0
format	int(11)			No	0

nid and vid as above. uid is the identity of the owner of the node - which in our case will always be me, user number 1. title is the title and a repeat of what we say before. body is the body, and as a 'long text' item can be very long. teaser is like an abstract or summary. Don't know what log is, but it's empty on all my test content. timestamp is self-explanatory. Don't know what format is, but 'story' used '1' and the poll used '0'.

So what I need to do is create a mapping from my table to Drupal's. Some of the fields I just won't copy over - the legacy database key, for example, and the thread. But others are buts of data for which there is no Drupal equivalent. Pretext, for example.

But - importantly - one of the big differences between my system and Drupal (and every other blogging system) is that posts in my system are about something, and I store that essential data - the title, author, publisher and link - as separate data items in my database. This is important, because it allows me to connect my work to other people's, but it also allows me to index the content based on the author or publisher, as I do on my resources page. I have always wondered why other systems don't do this - when people talk about something else they just put the link into the text, in an almost random fashion.

Well. Time for lunch and a walk to think about all this.

After Lunch

Well, it is now after lunch, two hours later, in fact, and I can report a most frustrating afternoon.

While i was walking up to the Tim Horton's I decided that basically what I was looking at was something like microformats. After all, if my post is about something, then it is essentially a review. And you can create reviews in microformats.

You see, what I had been thinking originally is that I would have to add extra fields to the database, so that I could add the extra information. I also considered just adding another node-type database, to hold this info. But what creating the record as a microformat would do is actually put the structured data into the 'body' field in 'node_revisions'.

So when I got back to the office I decided to see whether there was anything on Drupal and Microformats. And ran smack into Drupal's brain-dead documentation again, finding this, a project in 'alpha' that consists of nothing but a place-holder (hey Spaghetti, an 'alpha' is still supposed to be something that works). And as far as I can tell, the plan is to implement it using some sort of pseudo-code. No, not good at all.

Some more searching revealed some more discussion, including more posts from Digital Spaghetti but also some from gusaus talking about structured blogging. Well, that would work too - after all, what I do with my website, when I fill in the form identifying the author and the publisher and all that, is structured blogging. There's an implementation, from GoingOn - but no, the site is down for maintenance (has been all day). Then an absolutely useless and misleading page that seems to be about table-less layout (folks - don't say your page is about one thing when it's really about another thing, ok?). Still searching - found another discussion on structured blogging - same participants, same references, different discussion. Some good outline on what structured blogging is, and how it compared to microformats, but no new information about Drupal.

I then hit on this post from D'Arcy Norman. He writes, "upport for custom formats and authoring templates is baked into the DNA of Drupal. Even for non-coders, anyone can make up new formats (and templates) on the fly using the flexinode module. And several other formats are already available as prepackaged modules (events, reviews, etc...)." I had looked at the flexinode module, gagged at the completely useless (but oh so typical) documentation, and decided to pass on it. Maybe now it was worth a revisit, even though it contained no installation instructions.

Installation for modules in Drupal seems straightforward. There's a 'modules' subdirectory in the Drupal installation. Create a subdirectory under the modules directory, and stuff your module code into it. A module release will consist of a few files (this one contained about seven files). The files and the directory have the same root name, and the tyoes of files are indicated by the extension. So, say, if the module is called 'devel' then the directory is 'drupal/modules/devel' and the files might be 'devel.module' and 'devel.install' and 'devel.css'. Etc.

So anyhow, I create a 'flexinode' module and stuff the files in, and then go to the 'Modules' administration page in order to enable the module, just like I enabled all those other modules. I look - but it's not there.

Hm. OK. What's happening then? The 'flexinode' module I'm working with was designed for Drupal 4.7 and so isn't really intended for Drupal 5.0. And when I look in my NewsTrolls Drupal installation, it doesn't even resemble the 5.0 installation - all the ',module' files are in the 'modules' directory,a nd there aren't any subdirectories (that puzzles me, but I set it aside).

OK, are there any modules that are ready for Drupal 5.0? When I search, I get this page, which isn't a list of modules, but rather, some discussion about modules. Some utterly useless advice from harrisben (if you aren't going to describe something fully, or provide a link, just don't comment, OK? Saying 'check the downloads section' without any indication of what you're looking for or where it is is really useless - and frustrating).

Anyhow, there is a list of Drupal 5.0 modules (but you won't, by the way, find it in the downloads section). OK, good, I'll install a 5.0 module, something that's firmly developed for 5.0, and see if that works. Then maybe I can fix flexinode (since it doesn't look like the author has looked at it since 2004). devel is a perfect candidate. "Fully ported and the DRUPAL-5 branch exists." No installation instructions once again (do they thing we set up these sites by ESP?) but I follow the standard procedure. Then, over to the 'Modules' admin page to enable it. And... nothing. devel is nowhere to be seen.

So there's some magical procedure here that appears to be undocumented - at least, in a full day of searching about, I haven't encountered it. It's the end of the day, I'll transfer a few more photos to Flickr, read some email, and think about it overnight.