when the database worms eat into your brain

Do any of you know of a command-line / scriptable way to extract urls and associated last-access-times from the Mozilla history.dat file?

It's a "Mork" file and I can't make any sense of crazyman's documentation, such as it is. I asked on IRC #mozilla, and all I got was some random guy writing a term paper on the computer industry who wanted to interview me about the "good old days."

(Just grepping it for URLs isn't enough, because I also want the date associated with loading them, and the stuff in that file appears to be in hash order or something. There seems to be a maze of half a dozen overlapping numerical namespaces, all alike.)

Update: I've got something that almost works now, but it's really slow: mork.pl.

Tags: , ,

57 Responses:

  1. loosechanj says:

    You're posting this on lj. That must mean you're really desperate.

  2. I don't know MORK but a quick look around and it seems like an easy text-based file format. Maybe you could just use Perl and use a regexp construct?

    • jwz says:

      WOW, I NEVER WOULD HAVE TRIED THAT! I always wave the "LJ peanut gallery KICK ME" sign earlier rather than later.

      Just because it's text doesn't mean it makes any damned sense. If it's clear to you from your "quick look around", please to be enlightening me.

  3. avva says:

    "zany mongrel Mork terse syntax"?? they oughta shoot this guy.

    I even tried to look at the source, gave up in disgust after a few minutes.

    How about this: the items all seem to be keyed by hex numbers. Immediately after the item with the URL, there follows an item with the last access date, stored as time() plus a bunch of other numbers (microseconds, probably).

    Here, look at what my file gives for the URL of this entry of yours:


    (1E534
    =http://www.livejournal.com/users/jwz/312657.html?style=mine)(1E535
    =1078320627077811)(1E536
    =j$00w$00z$00:$00 $00w$00h$00e$00n$00 $00t$00h$00e$00 $00d$00a$00t$00a$00b\
    $00a$00s$00e$00 $00w$00o$00r$00m$00s$00 $00e$00a$00t$00 $00i$00n$00t$00o$00 $00\\
    y$00o$00u$00r$00 $00b$00r$00a$00i$00n$00)>

    Ignoring whitespace, treat it as ($num=$value). The value for 1E535 is the last access date (compare to current time() to see how many decimal values at the end to throw out - 6 it seems). The value for 1E536 seems to be the title of the page, encoded in some weird shitty scheme.

    Usually the number of the item with the last access date is one more than the number of the item with the URL (all stored in hex), but it's not true for many old URLs in my file; it seems that rather these numbers are consecutive ids stored by the system, so that for example when you visit an old URL, its entry in the file is updated by storing a new access time with a new id, and leaving the same URL with the same id. Maybe that's what the whitespace is for, to allow doing it in-place? *shrug*

    • avva says:

      The value for 1E536 seems to be the title of the page, encoded in some weird shitty scheme.

      In fact, it's just the title stored with a gratuitous "$00" stored after every character. What were they thinking?? No, wait, I probably don't wanna know.

      • jwz says:

        Oh, that's just Unicrud.

      • kehoea says:

        Little endian stupid-ASCII-safe Unicode encoding? Oooh, innovative.

      • jwz says:

        In case the true comedy of this part isn't apparent: he's clearly going to all this trouble to hyper-compress this file, what with the lookup tables for every identifier, right? And then, when writing out strings, he doesn't write them as UTF-8: he writes them as raw wchar_t strings. If he was writing them as raw bytes, that would multiply his file size by either 2 or 4 (depending on how big wchar_t is -- but wait! He doesn't write them raw, he writes them A) hex encoded and B) with a $ before each hex byte! So even on 16-bit wchar_t systems, he's tripled the file size.

        But hey, at least it's a "real" "database" and written in an "object oriented" language.

        • tlkh says:

          >If he was writing them as raw bytes,
          >that would multiply his file size by either 2 or 4

          Only for Western languages.
          And, unlike UTF8, hexadecimal encoding allows using only 7bit characters

    • jwz says:

      See, I tried that: as far as I can tell, those key/value pairs are a dump of a hash table (I think they're interned strings or something.) The numbers don't line up; they don't appear to be in order. Also, the numbering scheme changes at least twice in different sections of the file.

      It's a complete clusterfuck.

      Incidentally, the guy who wrote this is the one I was trying to help here. It didn't work.

      • avva says:

        When I tried to unravel the idiotic look-ma-I-wrote-a-DB
        thing that Mork seems to be, I did get to the place where it writes the
        history file out (in db/mork/morkWriter.cpp, I got there after
        following three huge unnecessary intermediate classes), and it seems
        to go row after row, in whatever it means
        by "row". The order of rows seems to be random/hashy, and the key
        numbers keep changing. But the fact that the URL is *always* followed by
        time says that the time relates to that URL, and the hash order is on a
        row basis, not on a key/value basis. I *think* that if you ignore key
        numbers, grep for http:// URLs, and treat the value of the next pair
        as the last-access time, you're good. You could test it by visiting some
        URLs already stored there, exiting and seeing if the number changed
        properly.

        • 9000 says:

          I wonder why a real, well-documented database was not used in the first place (any DBM would be better than this!)?

          And then, why not trhow this away at the next release and just provide a tool (say, an extenstion) to look up and/or convert your old history.dat to the new format? 99% of people would not mind throwing their browsing history away on upgrade, anyway.

          I do know that somebody just has to undertake the task, but are there any issues but this obvious one?

  4. avva says:

    Hmm, yeah, I think I get the gyst of it now.

    There's an index somewhere in the file (not necessarily at the beginning or end, it seems). It looks like this:


    [42(^82^14A)(^84^1DA93)(^85^14B)(^88^86)(^87^132)(^86=514)(^8A=1)]
    [BE(^82^29D)(^84^1D8C4)(^85^29E)(^88^86)(^87^132)(^86=234)]
    [C1(^82^2A2)(^84^1D7DB)(^85^2A3)(^88^86)(^87^132)(^86=138)]

    etc.

    The first number is a key. Then you get a bunch of stuff either of the form (^$a^$b) or (^$a=$b). Here $a is always the number of the field stored, its meaning is clear from the meta-row in the very beginning of the file (thus in my file at least, 82 is URL, 84 is last visit date, 85 is first visit date, 86 visit count, etc.). If you have the form (^$a=$b), the value is in-place; thus ^86=138 means 138 visits to this URL. If you have (^$a^$b), then $b is the key under which the value is stored elsewhere in the file; this ^82^14A, and I have (14A=http://....) somewhere in the file with the URL.

    It still seems to be true that actual rows are always written together, first the URL, then the last access, then other stuff. Sometimes the index has the same value for two keys, for instance, when first visit=last visit date, your index entry will have (^84^12345)(^85^12345), and the time will be stored under (12345=....) elsewhere in the file, for both occassions, but it'll still follow the URL for the same row. To be on the really safe side, read in the whole file, parse all the (key=value) stuff, and parse the index to match it all together, but I think just going by the rule "(key=URL)(key=last access date)" and ignoring the keys will give you the correct information.

    • I have nothing to add, I just want to say that this is the most hilariously stupid file format I've seen since I had to deal with the "Universal Computer Protocol" which is used to drive pager services.

    • gen_witt says:

      I'm goign to ask the obvious question that everyone is thinking. What the hell is this for? I mean what concsievable reason would you have for changes the field identifiers. The only thing that could make this worse is a healthy dose of XML to compund the rest of the braindamage.

      • coldacid says:

        Be quiet before any of the Mork coders hear you. They might get ideas...

      • rin3y says:

        At least if it were XML, you'd have some vague hope of being able to read it without having to decipher some freakish moonman patois. Of course, the file would be 300 gigs and take a week to parse, but that's beside the point.

        --riney

        • gen_witt says:

          Please, look at this guys idea of a human readable text file. You can add just as much garbage and retardation into the xml format, in fact you can add way more, because thats the w3c way of doing things.

          • rin3y says:

            Point taken. I'd love to see what this guy could do with (to) xml. It could very well cause a singularity of suckitude that would collapse the universe.

            --riney

            • Actually, he actively compares this abomination to XML and finds favor on the side of his format. Seriously.

              It's hard for me to accept, but XML actually would be less painful. The only thing absurd compression like this MIGHT have going for it is that it kept a logical row on a single line of a text file... but he doesn't do that. He keeps it between ()s, which may or may not be adjoined with text for the previous, current, or next row.

    • That's noise on your modem line.

      Please hang up and try again.

    • jwz says:

      Almost documentation...

      The line noise at the bottom are columns and rows! Because the incessantly whispering database brain-worms insist that all the world's a spreadsheet!

      • naturalborn says:

        The guy who wrote that almost documentation is still around. You could probably ask him if you have any more questions. (Yes, it's the same person, despite the name change.)

        • jwz says:

          Yes, I know, and if I never speak to him again (and by him I mean any of his personalities) that will be fine.

          • bitpuddle says:

            Yikes. Looking at his resume, he jumped from Taligent, to Apple's OpenDoc team, to Netscape.

            Fear him. He is the destroyer of projects.

            • enochsmiles says:

              MDB abstraction under Mozilla, replaced the 4.x family third party database with a persistent storage abstraction named MDB, to isolate mail/news and address books from database details.

              mork db wrote a simple in-memory text based database named Mork which satisfies the MDB interface, and then helped integrate usage for mail/news summary files, address book stores, and browser history databases. Chose a Mork text format to be flexible and expressive as XML, but better tuned to MDB usage for concise text markup.

            • reddragdiva says:

              (five years later comment)

              After that he ended up at Chandler. If you ever wondered why that never came out.

              Did he moonlight on Vista?

          • cyeh says:

            I was wondering who the heck this person was that could inspire such an immediate upwelling of hate and bile. After finding out who it was, I can see why.

        • Oh, I get it now. He's a Discordian, and this file format must be an attempt at sowing discord.

        • kfringe says:

          Wow. That really explains the OSA Foundation's software choices.

    • jwz says:

      mork.pl

      It's really slow, and I found a way to make Perl 5.8.0 dump core. Whee. I'll have to optimize this a lot before it's actually usable; writing parsers for complex (basically) binary formats like this using regular expressions is just a colossal fucking pain in the ass.

      • edp says:

        i agree, this file format sux.
        and i would like to be able to process my mozilla history too!
        alas my perl skillz are not so 1337.
        so i used some C to preprocess...
        but maybe it is of some use:

        http://panix.com/~edp/mozhist/

        regards :)

      • violentbloom says:

        oh come on perl is fun.
        heh.
        I mean what's the point of even knowing perl if not to do some painful contortion right?

        or something.
        have I mentioned that once again I work on animation where life is fun and there is almost no perl in my life?
        though there is java and unfortunately I prefer perl, I very much and totally prefer perl.
        sigh.

      • avva says:

        Yeah, works well for my history.dat too. And is rather slow, true.

  5. taffer says:

    Argh, my fucking eyes!

  6. When I worked in Microsoft Office, I once cracked this file's format (for Netscape 3.x and 4.x, not Mozilla, but Mozilla's format should be the same). It is quite easy.

    • lherrera says:

      Jamie asked about Mozilla, not Netscape. Without mentioning that it's possible that he might know one thing or two about older versions of Netscape file formats...

    • jwz says:

      In all versions of Netscape, it was a Berkeley DBM file. In Mozilla, it is, as I said, and as many discussed before you posted, a Mork file.

  7. billemon says:

    I'll try to write one, sometime. I imagine the solution you need is a bit like "sgrep" ?

  8. exiledbear says:

    http://www.erys.org/resume/

    Is this the guy you were talking about in the perl script?

  9. gnt says:

    Safari Browser History

    In ~/Library/Safari/History.plist...

  10. gnodal says:

    By way of context: my appreciation for files of the "history" and "bookmark" family date back to Mosaic and, more significantly, the early Netscapes.

    With that in mind, and intending to document what I do when I drill down through a meme, I grabbed FF's history.dat thinking, "A human being programmed this, so really, I should be able to figure out how to parse this afternoon's perigrations".

    heh

    FWIW: the mork primer at moz.org

  11. gnodal says:

    I couldn't track an item down using your link, but I've located two dealing with mork syntax; here;s the googlegroup links for one that is cited on moz.org 19MARCH99. For the sake of entertainment here's an early rationale/apologia and, for the benefit and elucidation of those assembled here, your ''[MORK] some summary file problem statements''.

    • jwz says:

      Thank you Google, for totally screwing the pooch with the all-new all-shark-jumping Google Groups. I sure hope archive.org provides usenet archives soon, because Google has completely destroyed theirs.

      The thread you linked to is also on my site.

      • gnodal says:

        Yaa I found that, actually ... a coupla dupes and copies here and there.
        As for the google thing ... heh ... there's something philosophically fractal about following the thread relating to one botched data system and tripping over another one ... self-similarity or something. (Not all the shit floats to the top?)

        BTW: your quote of something like "Thing is about doing it right the first time is that nobody knows how hard it is." ... I've always wanted to hash that out with you; I keep getting the feeling you and I disagree fundamentally but I don't believe that's actually so. Meh *shrug* ... some other day, gawds willing. (Oh hey, I'm about 4500 miles closer to SF than I was this time last year ... it's been decades ... I'd love to check the psycho-acoustics in your club. Do you use any echo-sorb? I know this neat trick to turn wall panels into notch filters; they kill standing waves dead.)

        But this was fun: I failed to document "a typical afternoon" but just following this thread ... it brought me back here, for one, and then I ended up going from Jesse's bookmarklets (first time I see his blog) to one of his contributions to Keepers of Lists: "Top 38 Signs You Are Not Drunk Enough". *grin*

        [Ummmm ... would you consider unbanning my <lj user="hfx_ben">? That lava-lamp temperature / heat thing was a long time ago ... and I /was/ correct, if not right.]

        best of '05

  12. ronebofh says:

    I'm just surprised that, after all that, NOBODY made a damned "shazbot!" joke. I am underwhelmed.

  13. gemsling says:

    1. I reckon McCusker must be reading such comments on his lunacy with glee. Perhaps he made such an obscure format to taunt; not stupid, but a genius... just without the genius.

    2. Thanks for mork.pl. I was going nuts trying to figure out the timestamps in history.dat until I finally found this post and the line:

    $val = int ($val / 1000000); # we don't need milliseconds, dude.

  14. airmax says:

    Is there any more Morkgusting file format that Mozilla uses for the "Downloads" list out there? It is so freaking slow on loading the list, I bet it's something like Mork2.0.