Copilot lawsuit

We're investigating a potential lawsuit against GitHub Copilot for violating its legal duties to open-source authors and end users:

Here again we find Microsoft getting handwavy. In 2021, Nat Friedman claimed that Copilot's "output belongs to the operator, just like with a compiler." But this is a mischievous analogy, because Copilot lays new traps for the unwary.

Microsoft characterizes the output of Copilot as a series of code "suggestions". Microsoft "does not claim any rights" in these suggestions. But neither does Microsoft make any guarantees about the correctness, security, or extenuating intellectual-property entanglements of the code so produced. Once you accept a Copilot suggestion, all that becomes your problem. [...]

What entanglements might arise? Copilot users -- here's one example, and another -- have shown that Copilot can be induced to emit verbatim code from identifiable repositories. Just this week, Texas A&M professor Tim Davis gave numerous examples of large chunks of his code being copied verbatim by Copilot, including when he prompted Copilot with the comment /* sparse matrix transpose in the style of Tim Davis */.

Use of this code plainly creates an obligation to comply with its license. But as a side effect of Copilot's design, information about the code's origin -- author, license, etc. -- is stripped away. How can Copilot users comply with the license if they don't even know it exists? [...]

Amidst this grand alchemy, Copilot interlopes. Its goal is to arrogate the energy of open-source to itself. We needn't delve into Microsoft's very checkered history with open source to see Copilot for what it is: a parasite.

The legality of Copilot must be tested before the damage to open source becomes irreparable. That's why I'm suiting up.

Previously, previously, previously, previously, previously.

Tags: , , , ,

3 Responses:

  1. Jim says:
    1

    Here's a good source describing how large language models (which are usually used in the voice assistant systems that usually produce unattributed content) actually contain the full text information of the documents on which they were trained, which these days almost always includes the full text of the English Wikipedia, for example. In particular the first paragraph of the Background and Related Work section on page 2. It's fascinating that document extraction is considered an "attack" against such systems, which may speak somewhat to the understanding of the researchers that they are involved with copyright issues on an enormous scale.

  2. eswan says:
    5

    /* sparse matrix transpose in the style of Jim Davis */ yields much different results.

  3. Joe Buck says:

    I'm a bit late to comment on this one, but it occurs to me that one solution would be for Copilot to come with a search engine that would search the training data for close matches above some length threshold. So, if Copilot decides to insert the fast inverse square root routine from Quake, it would also show a link to the repository where it came from, with licensing information. This would protect the Copilot user from infringing: if they plan to release their code under a compatible license, it's cool, and they can even add any required notices. Otherwise they can just reject that code.

    This would cost money to build, of course, but perhaps source code contributors could insist on it as a condition for Copilot to continue using their code: either they are a contributor to copyright infringement or they mitigate the risk.

  • Previously