Content-type: text/html Downes.ca ~ Stephen's Web ~ Scraping ugly HTML using ‘regular expressions’

Stephen Downes

Knowledge, Learning, Community

A lot of the magic that I work behind the scenes in this newsletter and the MOOCs we run is based on regular expressions. This is a two part post (part one, part two) providing an overview. Ignore the references to 'OutWit Hub' - regular expressions work everywhere, not just in the one system (well, ok, not everywhere, but anywhere you're working with sufficiently powerful programming langauges). Basically, regular expressions are pattern matchers - they are code used to define types of patterns that can be matched against strings, to extract from strings, or change strings. Why would this be useful? Well, suppose you have a huge pile of data, like, say, every blog post published today. Regular expressions can be used to zero on those posts that talk about a certain thing, or class of things. They're also really useful for categorization - instead of using tags, which are labour intensive, I simply define a topic 'tag' as a shorthand for a regular expression.

Today: 0 Total: 1098 [Direct link] [Share]

Image from the website


Stephen Downes Stephen Downes, Casselman, Canada
stephen@downes.ca

Copyright 2024
Last Updated: Apr 19, 2024 01:00 a.m.

Canadian Flag Creative Commons License.

Force:yes