Scraping ugly HTML using ‘regular expressions’

Paul Bradshaw, Online Journalism Blog, Nov 08, 2012
Commentary by Stephen Downes

A lot of the magic that I work behind the scenes in this newsletter and the MOOCs we run is based on regular expressions. This is a two part post (part one, part two) providing an overview. Ignore the references to 'OutWit Hub' - regular expressions work everywhere, not just in the one system (well, ok, not everywhere, but anywhere you're working with sufficiently powerful programming langauges). Basically, regular expressions are pattern matchers - they are code used to define types of patterns that can be matched against strings, to extract from strings, or change strings. Why would this be useful? Well, suppose you have a huge pile of data, like, say, every blog post published today. Regular expressions can be used to zero on those posts that talk about a certain thing, or class of things. They're also really useful for categorization - instead of using tags, which are labour intensive, I simply define a topic 'tag' as a shorthand for a regular expression.

Views: 0 today, 302 total (since January 1, 2017).[Direct Link]