
Let's see how much my copyrights have been infringed within the ChatGPT training data:
Rank | | Site | Tokens | | Percent |
20,032 | jwz.org | 700k | 0.0004% |
244,596 | dnalounge.com | 93k | 0.00006% |
11,317,461 | dnapizza.com | 270 | 0.0000002% |
Hey, I outrank Stormfront and 4Chan! So at least there's that.
See the websites that make AI bots like ChatGPT sound so smart:
Tech companies have grown secretive about what they feed the AI. So The Washington Post set out to analyze one of these data sets to fully reveal the types of proprietary, personal, and often offensive websites that go into an AI's training data.
The three biggest sites were patents.google.com; wikipedia.org; and scribd.com No. 3, a subscription-only digital library. Also high on the list: b-ok.org, a notorious market for pirated e-books that has since been seized by the U.S. Justice Department. At least 27 other sites identified by the U.S. government as markets for piracy and counterfeits were present in the data set.
Some top sites seemed arbitrary, like wowhead.com, a World of Warcraft player forum; thriveglobal.com, a product for beating burnout founded by Arianna Huffington; and at least 10 sites that sell dumpsters, including dumpsteroid.com, that no longer appear accessible. [...]
The data set contained more than half a million personal blogs, representing 3.8 percent of categorized tokens. [...] Social networks like Facebook and Twitter -- the heart of the modern web -- prohibit scraping, which means most data sets used to train AI cannot access them. Tech giants like Facebook and Google that are sitting on mammoth troves of conversational data have not been clear about how personal user information may be used to train AI models that are used internally or sold as products. [...]
The Post found that the filters failed to remove some troubling content, including the white supremacist site stormfront, the anti-trans site kiwifarms, and 4chan, the anonymous message board known for organizing targeted harassment campaigns against individuals.
Previously, previously, previously, previously, previously, previously.