hack hack

So yesterday I finally got around to writing a script that would dig album cover images out of an online store, by feeding it search strings and then parsing the resultant pages. I did this so that I could have Gronk, my MP3 jukebox, display album covers as well as track listings. The script I wrote at first seemed to be working pretty well, but it turned out that I ended up spending hours doing by hand the ones that it almost found. I pulled the images out of Amazon (even though I don't shop there, since I don't approve of software patents), since they sell used CDs as well, and so were likely to have more images.

I only ended up getting images for about half of the CDs I own. The compilation albums seem spottiest (they're hard to search for, and apparently don't stay in print long.) Also, some of the images are 300x300, but mostly they are 130x130, which is pretty illegible.

There must be a better source of these images out there somewhere....

In retrospect, wow, what a colossal waste of time that was.

Tags: , , ,

11 Responses:

  1. loic says:

    I have a script I wrote that tries to grab images out of the HTML and does a /fairly/ good job. Its heuristics only sometimes fail, but since Amazon exposed a SOAP and a POST+XML based interface it looks a lot easier to get images out. But I haven't had a chance to hack on it yet.

    Which interface did you use?

    • jwz says:

      It didn't even occur to me that they might have exposed more reasonable interfaces; I just did HTTP POSTs to the search form on the main page and grovelled the HTML.

      • loic says:

        They have an API exposed through SOAP and a simple POST+XML interface. I couldn't get the SOAP interface working on in Python because the SOAP library doesn't support WSDL interface definitions. The POST based interface looks pretty trivial.

  2. waider says:

    You could try allmusic.com; you could then do even more pointless hacking by parsing the "similar/related music" section and displaying whatever bits of it you have as suggested playlist items.

  3. alsoravi says:

    After all the MP3 patenting chaos running around, are you still using mp3s instead of, say, Oggs or what?

    Just curious ...

    • jwz says:

      I'm still using MP3 because

      • there's no way in hell I'm going to re-rip ~2500 CDs;
      • transcoding from MP3 to Ogg is reputedly really bad news;
      • there is still no way to convert the DNA webcasts to Ogg because there is still no working version of Icecast for it.

      (I'm sure someone is going to try and argue that last point with me. They will base their arugment on what they've heard rather than what they've tried, and they will lose.)

  4. darwinx0r says:

    I've spent a good bit of time trying to get amazon to give me useful info in this regard. The easiest way I've found is ASIN based, although I'm not sure if there is any publically accessible database of ASIN information. For example :

    --- cut ---

    # album title, etc.
    Appetite for Destruction [EXPLICIT LYRICS]
    Guns N' Roses

    # ASIN
    Geffen Records; ASIN: B000000OQF

    # full url for the page

    # url for the large cover image

    # url for the thumbnail-sized cover image

    --- cut ---

    You may very well have already done something like this. If so, please excuse me. :) I haven't, however, found any great solution for getting the data that I expect in order to get the ASIN. Perhaps the "MARC" format that allmusic.com uses would be useful in some way? Please keep us (ljers) updated on your progress, because I'd love to see what you come up with.. :)

    media d0mination