more winnage from CDDB

In case you didn't know, the file format that CDDB (and FreeDB) use is complete garbage. In addition to random idiotic crap like it being impossible to unambiguously represent a song title that has a slash in it, it's rocket science to figure out how long a song is supposed to be. I need this info not only to display it in Gronk, but also for some error-checking that my ripping scripts do, so that I don't end up with truncated files if there was a crash or a full disk or something.

So get this. CDDB files contain junk like this:

    # Track frame offsets:
    #       150
    #       18265
    #       32945
    #       49812
    ...
    # Disc length: 3603 seconds
(You'd think that the fact that it's in a comment would mean something, but no: you have to parse both comments and non-comments, begging the question of what they thought "comment" means.)

Those numbers are the starting sectors of each track on the disc. There are 75 sectors per second. So you convert those to seconds by dividing, and then find the length of each track by subtracting each from the previous. Oh, but wait, they don't give you the sector address of the end of the last track: for that one, it's expressed in seconds, for no sensible reason. Still, the info is there, right?

Uh, almost.

It turns out that if the last track on a CD is a data track (an ISO9660 file system) then there is a gap between the last track (the data track) and the second-to-last track (the last audio track.) This gap is exactly 11400 sectors (152 seconds, 2:32.) On some discs, you can actually see this track, it's a differently-shiny ring. Why's it there? I don't know. Why's it that size? I don't know. What if the data track is not the last track on the CD? (Does that even work?) I don't know.

So what this means is, when computing the length that a track should be, you have to subtract 152 seconds from the length of the second-to-last track, only if the last track is a data track.

How do you tell whether the last track is a data track? By hoping that the CDDB file contains the words "data track" in the title of that track, I guess. Yeah, that's reliable.

And, just to keep things interesting, it turns out that older versions of grip and cdparanoia didn't skip over this gap when ripping: instead, they would append 152 seconds of silence onto the end of the second-to-last track. So now my script that sanity-checks the lengths of the files has to consider two different lengths to be "right", since I now have CDs that were ripped both ways.

Whee. Love love love supporting standards invented by 12-year-olds.

Of course the reason that I use CDDB files at all in Gronk is because of the mind-blowing worthlessness of ID3 tags (32 character limits on titles, etc.) Yay more standards invented by 12-year-olds. (Please don't even mention ID3v2 or Ogg. I laugh at you, you silly person.)

Tags: , ,

38 Responses:

  1. ralesk says:

    (Please don't even mention ID3v2 or Ogg. I laugh at you, you silly person.)

       Now I'm really interested in your opinion about those.

    • jlindquist says:

      Ogg is easy.

      "Not worth much 'till Apple supports it in the iPod."

      Which I'd honestly like to see, since it expands choice and allows people to completely swear off the Evil Fraunhaufer Patent Nazis.

      • ralesk says:

           Well, that still is not a good reason to throw OGG out the window for desktop purposes.  I'm highly content with the quality the format provides --- for your ears, as opposed to MP3's way of giving similarity to the original for an oscilloscope (which barely matters for us humans, does it? :}).

           I do indeed want to see a portable OGG player.

  2. jerronimo says:

    I used to have a CD (hell if i can remember which one it is) that has the data track shoved at negative time from track 1. (much like the hidden track in TMBG's John Henry CD.)

    I seem to also remember another one in which the data track was completely invisible to the cd player I had at the time. Perhpas it was a session added on later? I don't remember...

    I always figured there were some sort of flags in the Disc's header that state whether parts are data or not.. hrm...

    • deviant_ says:

      The real trick is that the table of contents isn't actually required to bear any similarity to what's on the disk, and you can figure out all of the track info from subchannel data instead. The TOC is really just an optimization to find track start points.

      Hince, all the stupid copy protection games involving sticking shit in the TOC and watching ripping software screw up.

  3. pexor says:

    I made the mistake of reading the IRC protocols.

    I used to think there were a lot of developers out there who were a lot better than me. *laugh*

    • I once read the protocol documentation because I was interested in writing an IRC client. Afterwards, I was more interested in starting a new server network based on something else entirely.

  4. ivorjawa says:

    I've been writing, on and off, for the last two years or so, a small python script that only does two things:

    If it sees a track named NN_Artist_-_Trackname.mp3 where NN is the track number, and the MP3 has no ID3v2 tag, it will write the artist, track name, and track number to the tag. Conversely, if the track has artist, track name, and track number, but the filename is weird, it will normalize the filename into the above format. This way, I could make both of the players that I regularly use -- iTunes and XMMS -- happy.

    At first I was going to write it in straight python, to make it portable. (I fucking hate perl, it completely offends my sensibilities, I avoid it wherever possible. And besides, I was learning python, so it was a good toy problem to learn the language.)
    All of the python id3 libraries were a complete joke. Some asshole would code for a couple of days, put the nonfunctional mess on Sourceforge, and of course get linked from id3v2.org or whatever the site that pushes this crap is. Then never finish the fucker.

    Second attempt was in Jython, which is python spitting out Java bytecode. That way I could use a Java library, and still have the script be portable. (It has to run on both OS X and Linux.) No go there, either. I forget what the problem was, but it was stupid and meaningless and made me want to crush the responsible moron's nuts in a garlic press.

    Third attempt, most recently. Write it in standard, portable C++, because that's what the official version of id3lib is allegedly written in. HA. This attempt happend at 30,000 feet, between Boston and Salt Lake. I found that the allegedly portable library uses weird nonstandard I18N functions, which don't exist in OS X. I ponder breaking the glass of the nearest window with my TiBook and making myself a gruesome splatter in some farmer's corn field in Nebraska. Then I realize that, hey, I just paid Delta $100 to upgrade to first class, and booze is free. Problem solved.

    Roughly every six months, I bang on this script for a couple hours, get completely frustrated, and resign myself to looking at song titles like "MenWithoutHats-PopGoesTheWorld" in my iTunes window for the next several months.

    I've programmed Motif. I've programmed in win32. I've even programmed in AppleScript, and although that nearly wore my fingers down to bloody stubs, it still wasn't as fucking frustrating as attempting to deal with id3 tags.

    • ciphergoth says:

      Do you want skins with that?

      If only the sorts of people who write compilers and HTTP servers wrote MP3 players and CD rippers...

      • dingodonkey says:

        ...or vice versa... *shivers*

      • I've had reasonably good experiences with Paranoia and EAC. And MAD, mpg123, and all the Ogg tools.

        Note that compilers are extremely picky about their input, and HTTP servers don't really have any, beyond a config file. Whereas the whole point of an MP3 decoder, or a CD ripper, is to create approximate reproductions of random downloaded files produced by unknown encoders, or scratched discs accessed via fickle drive electronics. Just try writing an FTP client which has to parse listings someday, it will drive you nuts.

        Skins *are* just inexcusable, though :)

    • unabomber says:

      Those airplane windows are a lot thicker than you think, probably too thick to take out during flight with a good sledgehammer, let alone your TiBook. You'd have better luck with a pre-battery recall PB 5300 down in the cargo bay by the center fuel tank. I think the possible spontaneous battery fire might get you some TWA Flight 800 action if you're lucky.

      -FC

      • bzztbomb says:

        The windows are thick, but have you ever thought about how thin the metal wall part is? You could pick axe through that part really easy...

    • confuseme says:

      Write it in standard, portable C++

      Ha ha ha ha, ha. Ha, ha. Ha. ...how can you mention the words "standard" and "C++" in the same sentence like that and keep a straight face?

      • ivorjawa says:

        "Portable" means "portable to POSIX platforms that support g++", right? RIGHT???

        *sob*

        One of the last things I did at Akamai before The Great Layoffs was work on a program that was written on linux, and immediately ported to win32, for deployment on both platforms.

        It is ... stunning ... at how standards-uncompliant Visual C++ is. The standard was adopted in 1998, for fuck's sake. You'd think that by October, 2002, Microsoft could deign to support the fucking language.

        • nerpdawg says:

          It's on purpose, though. Microsoft uses their compiler for all their products, so if they changed it to be standards compliant..

          *waves hands grandly and cackles*

        • In all fairness, the standard is very demanding. A full implementation is a huge task, and I believe G++ has a ways to go itself.

          The special on-by-default "scope variables defined in for loops wrong so that MFC headers work" option is aggravating, though.

          • jwz says:

            It's not easy to blame someone for implementing C++ "incorrectly", given what a mess the spec is. Which is why it is easy to blame someone for using that language at all.

            • "C++: an octopus made by nailing extra legs onto a dog" -- Steve Taylor, 1998

              On the other hand, I like being able to do systems programming without writing my own hash table every time I need one.

    • jwz says:

      See, I had a premonition of the pain you describe, which is why I decided right off the bat to not ever try to read ID3. What I do is, when I rip a CD, I keep the CDDB discid around (so that I can find its CDDB file) and my scripts parse the info out of the CDDB file, and unconditionally overwrite the ID3 tags in the MP3 files. That way, the players get reasonable (yet truncated) ID3 tags, but Gronk displays the full names from CDDB.

      The big weakness with Gronk is that it really wants you to have ripped the CDs yourself, so that you have the CDDB file and discid. If you have random MP3 files of questionable vintage, you have to find a CDDB file (by searching the FreeDB site) and then make sure the file names match those titles.

    • billemon says:

      erm ... use one of the programs that comes with id3lib, and a shell script? it works for me :) apart from song titles with a / in of course ...

  5. grahams says:

    I believe the data standard allows for a data track to come either before or after the audio session... AFAIK, you will never see audio, data, then some more audio, though.

    You don't often see the data track as the first track because it used to trip up older standalone CD players..

    • greyhame says:

      I rememer when that did happen sometimes, though: some CDs which had a data track before the audio had labels warning that track 1 should be skipped when playing the disc in a stereo system, because it would just produce awful white noise (which is the same reason you don't see audio, then data, then more audio).

  6. Hello. I noticed you on my Friends list.

    Who are you? How did you find my LJ?

    xoxoxoxoxoxoxoxoxooxoxoxoxoxo

  7. darwinx0r says:

    ok, cddb has an excuse.. it was created by amateurs and then marketed.

    wtf is freedb's excuse? I remember when they started the project, I contacted them and said something like "is your new standard going to allow for various-artists albums?" because, as you mention, cddb is very very bad at this (increasingly common) case. I got a mail back that said "oh, if you'd like to participate in the process join this mailing list" .. I quickly became convinced that their intent was to clone cddb, warts and all. Needless to say, I didn't join the list..

    Imo this century is when meta-information becomes more important than the original information..

    =darwin

    • jwz says:

      Oh yes, FreeDB has taken the CDDB braindeadness and layered even more braindeadness on top of it, it's a thing of wonder.

      Like, go ahead and try to ever have the "genre" field be something approaching reality -- oops! The first person who ripped this CD said it was "folk" because that's number zero! So fix it and resubmit? Hah! That's the one field you can't ever edit after creation, since it implies what the directory the file goes in on their server.

      But hey, at least it saves me a lot of typing when I rip CDs. It's getting pretty rare for me to find CDs that someone has not already entered.

    • billemon says:

      They do support "various artists" ... you have to enable the "use freedb extensions" box in various apps that use it :-D

      • darwinx0r says:

        That's great, but :

        #1) I can't find any documentation on these extensions, on freedb.org or through google.. although one or two pages mention that they do, in fact, exist.

        #2) why clone something, create a a completely fucked up database full of crap and then "extend" it to "fix" the bakage? ie - these "extensions" may make various artists work in a sane way, but the FAQ still says "submit with 'artist/title' for various artists" ... so even if these extensions _do_ work, most of the various artists albums don't care.

        =darwin

        • inoshiro says:

          #1 -- yeah, I also tried to find docs for Gnome APIs when I did a bunch of development for it. Except for mostly empty, auto-generated docs and the odd tutorial on some effect I could get with Gnome canvas, there was no such thing. The lack of documentation in most open source projects is enough that I've stopped contributing to them in general. I've reduced my computer involvement (through quiting my programming job and such) because of how frustrating it is to tread water, rather than use computers as real tools.

          #2 -- you could ask the same thing about any number of systems where compatibility was the first priority. They always add extra extensions around the base system, and you have to take the time to detect and use them without also breaking the "standard" systems. And, yes, most are so half-assed extended, it's not a solution so much as additional loc that may or may not work.

        • jwz says:

          What extensions? I don't think you need any. Here's what one looks like: cd11ef0f. Normal albums have "Artist / Title" in DTITLE; multi-artist albums have "Artist / Title" in TTITLEn.

          Grip lets you edit multi-artist discs and then writes them with two fields, TTITLEn and TARTISTn, which CDDB and FreeDB don't understand. But that's a Grip problem, really. (Grip just made that second system up, as far as I can tell.)

          • waider says:

            Well, as best I can tell much of Grip's CDDB support was written by empirical observation of CDDB files, resulting in some errant code here and there. I use a poorly-written emacs mode to rearrange the CDDB files to my own liking, and use a modified version of Grip which allows you to save and reload CDDB files willy-nilly.

            Oh, and I'm responsible for the initial gluing of id3lib to Grip. I'm so very, very sorry. It seemed like a good idea at the time.