| Cheesegrater + Portalizer |
Back in 2002, there were still many web sites that did not provide RSS feeds of their updates, or that provided RSS feeds that included only the headlines and not the full articles.
That's much less true today, thankfully.
Bur back in those bad old days, I wrote this program to generate RSS feeds for certain sites that didn't provide them. It does this by parsing the HTML on various sites, and converting that to RSS. This kind of hack is known as "screen scraping", so I named this program "cheesegrater."
Yes, there is custom code for each site, since every site in the world has their own idiosyncratic HTML layout. Yes, this is fragile, and some day, possibly this afternoon, the maintainers of these sites will change their HTML slightly and I'll have to change my code too. Yes, this all sucks. This is the very problem that RSS is supposed to solve. The right way to fix this is to convince those remaining dinosaur webmasters to provide decent RSS feeds of their own. (Good luck with that.)
| cheesegrater.pl | Given a bunch of URLs on the command line, it downloads the content and converts it to RSS, writing that to the rss/ directory. It only works with the particular URLs that the source knows about. It's a small matter of hacking to make it know about more. Read the code for enlightenment. Let cut-and-paste be your guide. |
| portalizer.pl |
Given a bunch of RSS files on the command line, this
creates a single HTML file showing their contents, putting
each entry in a pretty little box. This should work
on any RSS file, but I wouldn't swear to that.
The first time you run it, it's going to just dump the content of all of the RSS files into the document, one after the other. But on subsequent runs, it will be more clever: it does not add an entry to the HTML file if that entry is present already, and it adds new entries at the top. So, as new entries show up in subsequent versions of the RSS files, they will show up at the top of the HTML file in the order they were added. Items drop off the end of the HTML file when it gets too long (200 entries, by default.) |
| cheesecast.pl | This takes an RSS file and translates it into something that Miro (and probably iTunes) will recognize as a "podcast". Basically, it takes the source feed, parses through the HTML in each entry, and generates various <itunes:xxx> and <enclosure> tags. This is useful if a site provides an old-style RSS feed, but you want to use a program like Miro to auto-download the content they link to. |
I did this using very vanilla Perl, using regular expressions all over the place, and without any esoteric CPAN modules, without an XML parser, etc., etc. I did it this way because that was easier. Perhaps the code would be simpler and more correct if I had written it in another way, but I didn't care to take the time to research what APIs were available to do this differently. And the only standard I care about complying with is "does it work, or not?" So, it works for me. If it breaks, feel free to keep both pieces.