got a favorite screen scraper?

There are more of you reading this now, so I guess I might as well ask again... What's your favorite HTML scraper? I'm feeling the urge to write one, but of course, I'd rather not. It has already been done, again and again, and I'd rather use one of those, but it's not clear whether any of the existing ones actually, you know, work. Any experience with them?

Basically, there are a bunch of sites that I would read if they were on my friends list, but that I generally don't check very often otherwise. Most of them don't have RSS feeds. Rather than waiting for RSS to take over the world, I probably should just hack something up to parse HTML to RSS, with a buttload of special cases for each of the sites I'm interested in. But I keep hoping someone has done it (properly) already. Because I really don't want to.

Tags: , , ,

6 Responses:

  1. hepkitten says:

    I really like Url Watcher which jorm wrote. It emails me updates which I think rules.

  2. bassfingers says:

    I started one that scoured Yahoo's news site a few years ago. Also one that grabbed their weather feed for my area. (Actually, grabbed the NOAH data directly from the government and a pretty map graphic from Yahoo...) but neither are in much form to pass along...

  3. thelonious says:

    Before you go scraping any sites yourself, see if doesn't already do it for you.

  4. insomnia says:

    If you find a *good* scraper, you'll have to share, because I've yet to be truely impressed. You might find these links helpful, however.

    <lj user="markpasc"> has created the RSS scraper Stapler, and <lj user="deus_x"> mentioned creating a scraper as well, though I don't think he ever released it. I suspect they are the LJ users out there who would be most helpful as far as scrapers go.

    You might be interested in <lj user="syn_promo">... it was originally created for sharing and promoting new LJ RSS channels, but it's also where we share RSS-related tips on scraping, where to find new feeds, creating improved RSS feeds for free and paid LJ accounts, etc.

  5. waider says:

    I wrote snorq to generate a page I could feed to AvantGo, so I get a bunch of cartoons and stuff onto my PalmPilot. It's godawful, configured with Perl hashes, and uses regexps to clean the crap out of the result. On the plus side, it smashes up images into Palm-sized chunks, and it can burrow into a website using the aforementioned regexps. You can see the output at "Compilation Page for waider"