incredibly funny how a bunch of people interpreted “ao3 was almost certainly scraped as part of the gpt training dataset because it’s a big easily accessible body of english language text, so you can prompt gpt with surprisingly vague stuff and it will autocomplete with snarry underage or wangxian a/b/o” as “elon musk Personally is Currently scraping ao3 and training an ai to plagiarize fic, going to go lock ALL my works on ao3 IMMEDIATELY”
its. its already in the dataset. how do you think these things work. “locking my works to registered users only until after the scraping stops!” my dude the ao3 team just needs to like add a robots.txt and check the useragent and stuff to prevent this from happening in the future*, and theyre already on it, but not only is the existing body of work presumably In the Dataset, the model has ALREADY BEEN TRAINED. that omelet isnt going to get unscrambled
(*im assuming that everyone gathering datasets for large language models is being reasonably Polite about it bc these are both very simple to circumvent — if this assumption is false then ao3 might need to graduate to Offensive Measures but also we would definitely need to bully the culprits off of hacker news)
anyway im not taking any Stance one way or the other on the “ai art debate” (other than maybe “none of you know what the hell you’re talking about”) but we’re definitely going to see a whole new world of copyright claims against the big art models and ml researchers developing new tools for “removing” stuff from a trained model, and i for one think that it will be SO entertaining to watch
right ok i did some cursory googling and the main two datasets gpt-3 got trained on that ao3 works might appear in are common crawl and webtext2. commoncrawl.org has a faq page that tells you exactly what to add to your robots.txt to stop the crawler, and their terms of use say you’re not allowed to use the dataset to violate ip rights, although in terms of actual legal force i think that probably has about as much oomph as a whiffle ball.
since the common crawl dataset is used for a broad range of internet research, not just ai training, i personally wouldn’t want to block their crawler — but ao3 might decide differently, as is their right. (it’d be really lovely if common crawl let you indicate to their crawler that you’re ok with your data being included in the full dataset for some purposes and not others, but i digress.) they also check the robots.txt regularly although i couldn’t find any info about what they do with previously scraped data when it gets updated (and at any rate i would be SHOCKED if openai hadn’t downloaded a copy of the dataset, independent of common crawl’s updates).
unlike common crawl which is provided by an independent organization, webtext2 is a dataset generated by openai themselves, composed of every outbound reddit link with at least 3 karma. i couldn’t immediately find any info on how to block their web scraper, but if you had a website that was actually being scraped by them, you could figure out what their bot is called and add it to your robots.txt, or just blacklist everything except googlebot etc. or block reddit as a referrer or something so people don’t link to your stuff from there, idk. for ao3 specifically the best solution is probably gonna be blocking the openai scraper bot, in my opinion as someone who only vaguely knows shit about websites.
but the most important takeaway if you didn’t understand ANY of that is that these are long term solutions for preventing future stuff from being included in the dataset; there’s really no point in locking your back catalog, it’s already in there and openai provides no tools for letting you take it out. go harass them about it if you wanna do something.
the other thing you may be interested in knowing is that github is in DEEP SHIT for doing similar stuff with code — to oversimplify, they trained a big language model called copilot on a dataset including all public github repositories (non-programmers: github repositories are one of the most popular places to keep source code), completely ignoring license and copyright, and they’ve started selling a subscription service that lets you use copilot to write code. unsurprisingly to anyone who knows anything about ml, copilot immediately started regurgitating verbatim snippets of code from those public repos.
now, a lot of repos on github are public because they are open-source projects licensed under gpl, which is a “copyleft” license that “infects” other projects — if you use gpl’d code in your commercially licensed projects, ooops! your project is now also gpl-licensed. when i interned at google i got a WHOLE spiel during orientation about how if you want to use an open source project you have to get approval before doing anything ESPECIALLY IF IT IS gpl-licensed, and i’ve gotten similar spiels at other internships.
and, uh, copilot can recite the entire gpl license word for word, so it… definitely has seen a lot of gpl’d code. this means if you use copilot in your proprietary project the chances are good that, eventually, it will start puking out code that was gpl licensed and that is now being blithely reproduced in complete and total violation of that license, and theres no way for you to tell this has happened because after all copilot doesn’t fucking know any better, and the original project can rightfully now force YOUR entire project to be free and open-source.
anyway i think that, Real Soon Now, ml dataset gatherers are going to have some nasty realizations about copyright law, but the field of battle is probably going to be software licensing (a field that has just SO much legal firepower to throw around) and not fanfiction or digital art (fields which, uh, don’t).
It's good to see you bring up CoPilot's use of code, because there are some interesting things going on in code. Sometimes, there's only one way to write a specific function concisely. That same way of writing the function may have been independently derived by three different program authors. The first author released the code under a proprietary license to make it open-source but not free or libre. The second author released the code under the WTFPL or the DAMAIL or the CC0 License, which are maximally permissive. The third author took the middle path and released the code under the GPL-3, a very infectious open-source license that is both free and libre while still reserving rights to the author.
If you looked at the output of CoPilot, you would not be able to tell which license the copied code was from, or whether it came from a single source instead of all three sources.
Likewise, there are only so many ways to write concise sentence about Steve slapping Bucky on the shoulder.
Human languages are large, but they are not infinite, and there is only a certain amount of complexity that can be expressed in a text string of a given length.
One of the open questions of law is at what point a work legally ceases to be derivative and becomes transformative. There are no clear guidelines in US law about this; if you want to be sure, you have to go to court.
The Organization for Transformative Works runs AO3, and they favor a broad interpretation of what counts as transformative. A fanfiction author's Stucky fanfic novelization of Captain America: Winter Soldier is generally regarded as transformative, except maybe by Disney's lawyers, even though it has the same plot beats as the original film and the same spoken lines.
Is it transformative if I write a proprietary Drupal module that contains significant similarities to a GPL-2 WordPress plugin? Is it transformative if CoPilot does the writing of the code?
Is it transformative if GPT-3 writes Stucky slashfic, or Rogers/Barnes/Danvers A/B/O?
These are the things that lawyers are paid to argue about.
But if you're someone who writes unlicensed fanfic, I think you should be cautious about too strongly endorsing an author's right to control derivative works. Don't vote for the Leopards Eating People's Faces party.
YEAH thanks for bringing this up! iirc this was also a major point of contention when oracle and google were fighting over the android api, because 99% of google’s reimplementation was different but there was one nine-line function for something really simple that both oracle and google happened to implement identically. so part of the fight was “can you copyright an api” and part of it was “did this identical reimplementation infringe on oracle’s ip if nobody involved in writing it saw oracle’s code”.
but also if the ao3 legal team does decide to pick a fight over it i think thats one of the better options, because of their interest in preserving transformative works. i’ve been seeing a lot of fic authors and fanartists make very strong claims about the legitimacy of “ai” creative pursuits that im not sure they’d like turned around on them. law in its majestic equality and all that
OTW might block the simpler scrapers, but if anything, I think their mission statement is more in favor of preserving the rights of fans to create transformative works through machine generation than it is opposed to machine generation. Transformative works created with the aid of a machine are still transformative works, in my eyes.
If there's a legal judgement which preserves human ability to create transformative works but puts limits on computer-generated transformative works, that judgement will certainly be splitting some extremely fine hairs, especially with regards to human-prompted computer-generated works.
I worry that the specifics of any judgement against computer-generated transformative works will lead to restrictions on those human-composed transformative works in which a computer was in any way involved beyond stenographically reproducing the human's inputs. Possibly even on all human-composed transformative works. There's no way to do clean-room rederivation of cultural artifacts.
(Tangent: Part of Oracle v. Google was about whether Google could use the same names for functions as Oracle. Google could have used different names, but then their reimplemenation of the Java API wouldn't have worked with existing third-party code. If you replace all the character names, your Twilight fanfic becomes Fifty Shades of Grey, which is no longer recognizable as Twilight fanfic and which doesn't attract Twilight readers who don't use a compatibility layer to change the names back.)

















