"I'm the Googlebot. I'm here to index you. Please hold still."

Let's see how much my copyrights have been infringed within the ChatGPT training data:

Hey, I outrank Stormfront and 4Chan! So at least there's that.

See the websites that make AI bots like ChatGPT sound so smart:

Tech companies have grown secretive about what they feed the AI. So The Washington Post set out to analyze one of these data sets to fully reveal the types of proprietary, personal, and often offensive websites that go into an AI's training data.

The three biggest sites were patents.google.com; wikipedia.org; and scribd.com No. 3, a subscription-only digital library. Also high on the list: b-ok.org, a notorious market for pirated e-books that has since been seized by the U.S. Justice Department. At least 27 other sites identified by the U.S. government as markets for piracy and counterfeits were present in the data set.

Some top sites seemed arbitrary, like wowhead.com, a World of Warcraft player forum; thriveglobal.com, a product for beating burnout founded by Arianna Huffington; and at least 10 sites that sell dumpsters, including dumpsteroid.com, that no longer appear accessible. [...]

The data set contained more than half a million personal blogs, representing 3.8 percent of categorized tokens. [...] Social networks like Facebook and Twitter -- the heart of the modern web -- prohibit scraping, which means most data sets used to train AI cannot access them. Tech giants like Facebook and Google that are sitting on mammoth troves of conversational data have not been clear about how personal user information may be used to train AI models that are used internally or sold as products. [...]

The Post found that the filters failed to remove some troubling content, including the white supremacist site stormfront, the anti-trans site kiwifarms, and 4chan, the anonymous message board known for organizing targeted harassment campaigns against individuals.

18 Responses:

  1. anon says:

    Citing AI Bullshit Generator©™: Fine *applauses*
    Citing Wikipedia: did you know wikipedia is not a realiable source? DERP DERP DERP

    • I'm pretty sure that scraping off Wikipedia is against Wikipedia's own terms of service.

      • anon says:

        Scraping by itself is also against a lot of authors, magazines, journals, etc that the Bullshit Company™ is profiting from

      • Netluser says:

        You don't need to scrape Wikipedia, they already provide dumps to anyone with a few dozen spare terabytes (or however large it is) and a lot of patience.

        The question of whether these AI companies are complying with the Creative Commons license Wikipedia is available under is another issue.

        • phuzz says:

          If you don't care about the images (and I assume someone trying to train a chatbot wouldn't), then it's 'only' about 50-60GB for just the text of the articles. I guess depending on what you wanted your chatbot to say, you might include the discussion pages too.

  2. Don says:

    I would find this marginally less gross if fair use seemed to still exist for us little guys. Instead I get threatening emails from YouTube because of literally 19 seconds of a song playing in a short video demonstrating something but these clowns get to scrape everything in sight and I'll eat my hat if their network provider would enforce a DMCA takedown against them.

  3. J. Peterson says:

    So we can ask the chatbot for responses "in the style of jwz"? Cool.

    • jwz says:

      Please don't, and if you do, please don't share it with me.

    • Eric TF Bat says:

      I want to see a chatbot swearing at Apple's flaky implementation of some obscure internet standard and then spruiking an 18+ gig involving two guys playing the electric kazoo and a bunch of highly tattooed women wearing pasties and petticoats.

  4. jwz says:

    This means that ChatGPT has been trained on the output of DadaDodo.

    And as the rat's milk returns to the sewer, the cycle of life is complete.

  5. CSL3 says:

    The Post found that the filters failed to remove some troubling content, including the white supremacist site stormfront, the anti-trans site kiwifarms, and 4chan, the anonymous message board known for organizing targeted harassment campaigns against individuals.

    Oh, don't worry: Emerald Mine Space Karen has taken time off from destroying Twitter to go on Faux News an announce he's supporting an AI that will only scrape from white supremacists, misogynists, and similar nut-jobs. He's even naming it after Trump's failed Twitter clone - 'cause why the fuck not?

    • Elusis says:

      Well today he retconned deadnaming and misgendering out of Twitter's Hate Speech parameters, so why the hell not.

      • CSL3 says:

        Alex Gibney's doing a documentary about him and he mentioned to Rolling Stone how he (Gibney) can't fuckin' believe that people honestly think Musk Oil founded Tesla or any of the other companies he bought his way into with Daddy's apartheid money.

        Expensive revisionism is what he does; like when he says paedophiles Jeffrey Epstein and Ghislaine Maxwell weren't his BFFs, despite all evidence to the contrary. His ever-changing origin stories are as fake as his hair plugs.

  6. Jiri Lebl says:

    It's interesting that university websites are so low in the training ranking (in the millions). No wonder it is so bad at math. This includes say UIUC, where the most famous evil movie AI was trained at. I guess that if you try to unplug ChatGPT when it's trying to kill you, it's not going to start nostalgically talking about Urbana.

    • MC says:

      I got it to work, and it did not to any degree capture the disdain, contempt, raw anger and sheer wit of our glorious leader.

      In the future, only the inimitable will have a voice.

  7. Referring to Scribd as a "subscriber-only digital library" rather than essentially a 90s ROMs site but for PDFs is funny. Somewhat ironic that Scribd's business model is to get people who don't want to pay for their stolen PDFs to upload unique material of their own, and the safest and easiest way to go that is to generate files full of garbage text. I'd like to think that I've contributed in some tiny way to worsening these idiots' days twice over now.

