Building a web-crawler for fun and ... the hell of it

So I’m reading Programming Collective Intelligence at the moment, by Toby Segaran. It’s about things like search-engines, recommendations-engines, filtering/sorting results, data-mining and generally Web 2.0 social interactions. This is interesting in (at least!) three ways:

The examples are in Python, which is new to me, and outside my comfort-zone (I’m writing this as I wait for XCode to come down, so I can get gcc with which to build pysqlite - apparently).
It’s making me dust off my maths/stats from way back, which is painful but probably ultimately healthy.
C’mon; teaching computers how to be smart? It just is, ok? Figuring out why people like things is both interesting and incredibly sellable; developing insight via some automated process is clearly a hugely useful tool!

I’ll be interviewing at a place that cites “experience building web-crawlers” on its job-spec, so I thought I’d have a go, while I’m recuperating at home. Chapter 4 covers this. It turns out that it’s possible to create a web-crawler that will go off from a starting point that you give it, pull down that page, parse it (BeautifulSoup appears to provide jQuery-selection-engine-like functionality), pull out the links, and start following them to subsequent pages in … ~30 lines of code (leveraging two libraries). This is … cool. I think the same kind of thing in C# would have needed … rather more work! I wonder whether I can get IronPython to play…?

Anyhow. Xcode is down now, and gcc with it, so I’m off back to play… I’ll review the book once I’ve finished it.