natfriedman
4 years ago
0
53
In general: (1) training ML systems on public data is fair use (2) the output belongs to the operator, just like with a compiler.

On the training question specifically, you can find OpenAI's position, as submitted to the USPTO here: https://www.uspto.gov/sites/default/files/documents/OpenAI_R...

We expect that IP and AI will be an interesting policy discussion around the world in the coming years, and we're eager to participate!

croes4 years ago
Fair use doesn't exist in every country, so it's US only?
krzykcroes4 years ago
It exists in EU also (and it much mire powerful here).
croeskrzyk4 years ago
The EU doesn't have a copyright related fair use. Quite the opposite, that why we are getting upload filters.
anthkcroes4 years ago
False. In Spain you have it as under "uso legitimo".
croesanthk4 years ago
Spain is only part of the EU not the EU.
krzykcroes4 years ago
One exception makes the whole "EU doesn't have" incorrect.

EU doesn't enforce it on the states, yes. But some (maybe all) countries that are in EU do have it.

jay_kyburzcroes4 years ago
Yes, my partner likes to remind me we don't have it here in Australia. You could never write a search engine here. You can't write code that scrapes websites.
joepie91_4 years ago
> training ML systems on public data is fair use

Uh, I very much doubt that. Is there any actual precedent on this?

> We expect that IP and AI will be an interesting policy discussion around the world in the coming years, and we're eager to participate!

But apparently not eager enough to have this discussion with the community before deciding to train your proprietary for-profit system on billions of lines of code that undoubtedly are not all under CC0 or similar no-attribution-required licenses.

I don't see attribution anywhere. To me, this just looks like yet another case of appropriating the public commons.

patrickthebold4 years ago
What does "public" mean? Do you mean "public domain", or something else?
6gvONxR4sf7opatrickthebold4 years ago
Unfortunately, in ML "public data" typically means available to the public. Even if it's pirated, like much of the data available in the Books3 dataset, which is a big part of some other very prominent datasets.
kzrdude6gvONxR4sf7o4 years ago
So basically youtube all over again? I.e bootstrap and become popular by using widely available whatever media (pirated by crowdsourced piracy) and then many years later, when it gets popular, dominant, it has to turn around and "do things right" and guard copyrights.
qihqi4 years ago

   (1) training ML systems on public data is fair use 

This one is tricky considering that kNN is also a ML system.
visargaqihqi4 years ago
kNN needs to hold on to a complete copy of the dataset itself unlike a neural net where it's all mangled.
stwrong4 years ago
What about privacy. Does the AI send code to GitHub? This reminds me of Kite
avery42stwrong4 years ago
Yes, under "How does GitHub Copilot work?":

> [...] The GitHub Copilot editor extension sends your comments and code to the GitHub Copilot service, which then uses OpenAI Codex to synthesize and suggest individual lines and whole functions.

stefano4 years ago
How do you guarantee it doesn't copy a GPL-ed function line-by-line?
king_magicstefano4 years ago
I truly don't think they can guarantee that. Which is a massive concern.
ipsum2stefano4 years ago
Yup, this isn't a theoretical concern, but a major practical one. GPT models are known for memorizing their training data: https://towardsdatascience.com/openai-gpt-leaking-your-data-2dfb9e22c1b2

Edit: Github mentions the issue here: https://docs.github.com/en/github/copilot/research-recitation and here: https://copilot.github.com/#faq-does-github-copilot-recite-code-from-the-training-set though they neatly ignore the issue of licensing :)

visargaipsum24 years ago
> GPT models are known for memorizing their training data

Hash each function, store the hashes as a blacklist. Then you can ask the model to regenerate the function until it is copyright safe.

ipsum2visarga4 years ago
What if it copies only a few lines, but not an entire function? Or the function name is different, but the code inside is the same?
protealipsum24 years ago
If we could answer those questions definitively, we could also put lawyers out of a job. There’s always going to be a legal gray area around situations like this.
ipsum2proteal4 years ago
Matching on the abstract syntax tree might be sufficient, but might be complex to implement.
vlasevipsum24 years ago
You can probably tokenize the names so they become irrelevant. You can ignore non-functional whitespace, so that code C remains. Maybe one can hash all the training data D such that hash(C) is in hash(D). Some sort of Bloom filter...
bogwogipsum24 years ago
That second link says the following:

> We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set

That's kind of a useless stat when you consider that the code it generates makes use of your existing variable/class/function names when adapting the code it finds.

I'm not a lawyer, but I'm pretty sure I can't just bypass GPL by renaming some variables.

kitsune_ipsum24 years ago
It's not just about regurgitating training data during a beam search, it's also about being a derivative work, which it clearly is in my opinion.
abraaestefano4 years ago
Surprised not to see more mention of this. It would make sense for an AI to "copy" existing solutions. In the real world, we use clean room to avoid this.

In the AI world, unless all GPL (etc.) code is excluded from the training data, it's inevitable that some will be "copied" into other code.

Where lawyers decide what "copy" means.

kitsune_abraae4 years ago
It's not just about copying verbatim. They clearly use GPL code during training to create a derivative work.

Then you have the issue of attribution with more permissive licenses.

dkarrasstefano4 years ago
How do you know that when you write a simplish function for example, it is not identical to some GPL code somewhere? "Line by line" code does not exist anywhere in the neural network. It doesn't store or reference data in that way. Every character of code is in some sense "synthesized". If anything, this exposes the fragility of our concept of "copyright" in the realm of computer programs and source code. It has always been ridiculous. GPL is just another license that leverages the copyright framework (the enforcement of GPL cannot exist outside such a copyright framework after all) so in such weird "edge cases" GPL is bound to look stupid just like any other scheme. Remember that GPL also forbids "derivative" works to be relicensed (with a less "permissive" one). It is safe to say that you are writing code that is close enough to be considered "derivative" to some GPL code somewhere pretty much every day, and you can't possibly prove that you didn't cheat. So the whole framework collapses in the end anyways.
krainboltgreenedkarras4 years ago
> How do you know that when you write a simplish function for example, it is not identical to some GPL code somewhere?

I don't, but then I didn't go first look at the GPL code, memorize it completely, do some brain math, and then write it out character by character.

king_magic4 years ago
@Nat, these questions (all of them, not just the 2 you answered) are critical for anyone who is considering using this system. Please answer them?

I for one wouldn't touch this with a 10000' pole until I know the answers to these (very reasonable) questions.

stephen824 years ago
> We expect that IP and AI will be an interesting policy discussion around the world in the coming years, and we're eager to participate!

Another question is this: let's hypothesize I work solo on a project; I have decided to enable Copilot and have reached a 50%-50% development with it after a period of time. One day the "hit by a bus" factor takes place; who owns the project after this incident?

lovichstephen824 years ago
Your estate? The compiler comparison upthread seems to be perfectly valid. If you work on a solo project in c# and die, Microsoft doesn’t automatically own your project because you used visual studio to produce it
breck4 years ago
You should look into:

https://breckyunits.com/the-intellectual-freedom-amendment.html

Great achievements like this only hammer home the point more about how illogical copyright and patent laws are.

Ideas are always shared creations, by definition. If you have an “original idea”, all you really have is noise! If your idea means anything to anyone, then by definition it is built on other ideas, it is a shared creation.

We need to ditch the term “IP”, it’s a lie.

Hopefully we can do that before it’s too late.

anmkbreck4 years ago
I'm sure natfriedman will be thrilled to abolish IP and also apply this to the Windows source code. We can expect it on GitHub any minute!
breckanmk4 years ago
I used to work at Microsoft and occasionally would email Satya the same idealistic pitch. I know they have to be more conservative , but some of us have to envision where the math can take us and shout out loud about it, and hope they steer well. When I started at MS, my first week was heckled for installing Ubuntu on my windows machine. When I left, windows was shipping with Ubuntu. What may seem impossible today can become real if enough people push the ball forward together. I even hold out hope that someday BG will see the truth and help reduce the ovarian lottery by legalizing intellectual freedom.
cnp10breck4 years ago
Talking about the ovarian lottery seems strange in a thread about an AI tool that will turn into a paid service.

No one will see the light at Microsoft. The "open" source babble is marketing and recruiting oriented, and some OSS projects infiltrated by Microsoft suffer and stagnate.

sombremesaanmk4 years ago
All I know is that if a lawsuit comes around for a company who tried to use this, Github et al won't accept an ounce of liability.
delanobreck4 years ago
In practical terms, IP could be referred to as unique advantages. What is the purpose of an organization that has no unique qualities?

In general, what is IP and how it's enforced are two separate things. Just because we've used copyright and patents to "protect" an organization's unique advantages, doesn't mean we need to keep using them in the same way. Or maybe it's the best we can do for now. That's why BSD style licences are so great.

orthecreedencebreck4 years ago
You can't abolish IP without completely restructuring the economic system (which I'm all for, BTW). But just abolishing IP and keeping everything the same is kind of myopic. Not saying that's what you're advocating for, but I've run into this sentiment before.
breckorthecreedence4 years ago
I agree that it would be a huge upheaval. Absolutely massive change to society. But I have every confidence we have a world filled with good bright people who can steer us through the transition. Step one now is just educating people that these laws are marketed dishonestly, are inequitable, and counter productive to the progress of ideas. As long as the market starts to get wind that the truth is spreading, I believe it will start to prepare for the transition.
TaylorAlexanderorthecreedence4 years ago
Sure, but usually I tend to think "abolish X" means "lets agree on an end goal of abolishing X and then work rapidly to transition to that world." So in that sense I tend to think the person is not advocating for the simple case of changing one law, but on the broader case of examining the legal system and making the appropriate changes to realize a functioning world where we can "abolish X".
bogwogbreck4 years ago
> Ideas are always shared creations, by definition. If you have an “original idea”, all you really have is noise! If your idea means anything to anyone, then by definition it is built on other ideas, it is a shared creation.

Copyright doesn't protect "ideas" it protects "works". If an artist spends a decade of his life painting a masterpiece, and then some asshole sells it on printed T-shirts, then copyright law protects the artist.

Likewise, an engineer who writes code should not have to worry about some asshole (or some for-profit AI) copy and pasting it into other peoples' projects. No copyright protections for code will just disincentivize open source.

Software patents are completely bullshit though, because they monopolize ideas which 99.999% of the time are derived from the ideas other people freely contributed to society (aka "standing on the shoulders of giants"). Those have to go, and I do not feel bad at all about labeling all patent-holders greedy assholes.

But copyright is fine and very important. Nothing is perfect, but it does work very well.

breckbogwog4 years ago
Copyrights are complete bullshit too though. In your 2 examples. First, the artist I assume is using paints and mediums developed arguably over thousands of years, at great cost. So just because she is assembling the leaf nodes of the tree, the far majority of the “work” was created by others. Shared creation.

Same goes for an engineer. Binary notation is at the root of all code, and in the intermediate nodes you have Boolean logic and microcode and ISAa and assembly and compilers and high level Lang’s and character sets. The engineer who assembles some leaf nodes that are copy and pasteable is by definition building a shared creation of which they’ve contributed the least.

Jcowellbreck4 years ago
The basis of copyright isn’t that the sum product is 100% original. That insane since nothing we do is ever original. It’ll always be a component ultimately of nature. The point is that your creation is protected for a set amount of time and then it too eventually becomes a component for future works.
arp242breck4 years ago
> the artist I assume is using paints and mediums developed arguably over thousands of years, at great cost.

And they went to the store and paid money for those things.

breckarp2424 years ago
And they handed the cashier money and then got to do whatever they want with those things. Now they want to sell their painting to the cashier AND control what the cashier does with it for the rest of the cashier's life. They want to make a slave out of the cashier by a million masters.
dylannorthrupbreck4 years ago
Remind me when GitHub handed anyone any money for the code they used?
tlamponi4 years ago
> the output belongs to the operator, just like with a compiler.

No it really is not that easy, as with compilers it depends on who owned the source and which license(s) they applied on it.

Or would you say I can compile the Linux kernel and the output belongs to me, as compiler operator, and I can do whatever I want with it without worrying about the GPL at all?

user-the-name4 years ago
> training ML systems on public data is fair use

So, to be clear, I am allowed to take leaked Windows source code and train an ML model on it?

dylannorthrupuser-the-name4 years ago
Or, take leaked Windows source code, run it through a compiler, and own it!
abn1204 years ago
(1) That may be so, but you are not training the models on public data like sports results. You are training it on copyright protected creations of humans that often took years to write.

So your point (1) is a distraction, and quite an offensive one to thousands of open source developers, who trusted GitHub with their creations.

dylannorthrup4 years ago
Fair Use is an affirmative defense (i.e. you must be sued and go to court to use it; once you're there, the judge/jury will determine if it applies). But taking in code with any sort of restrictive license (even if it's just attribution) and creating a model using it is definitely creating a derivative work. You should remember, this is why nobody at Ximian was able to look at the (openly viewable, but restrictively licensed) .NET code.

Looking at the four factors for fair use looks like Copilot will have these issues: - The model developed will be for a proprietary, commercial product - Even if it's a small part of the model, the all training data for that model are fully incorporated into the model - There is a substantial likelihood of money loss ("I can just use Copilot to recreate what a top tier programmer could generate; why should I pay them?")

I have no doubt that Microsoft has enough lawyers to keep any litigation tied up for years, if not decades. But your contention that this is "okay because it's fair use" based on a position paper by an organization supported by your employer... I find that reasoning dubious at best.

deepnash4 years ago
It is the end of copyright then. NNs are great at memorizing text. So I just train a large NN to memorize a repository and the code it outputs during "inferencing" is fair use ?

You can get past GPL, LGPL and other licenses this way. Microsoft can finally copy the linux kernel and get around GPL :-).