stefanos82
4 years ago
0
99
Lots of questions:

  - the generated code by AI belongs to me or GitHub?
  - under what license the generated code falls under?
  - if generated code becomes the reason for infringment, who gets the blame or legal action?
  - how can anyone prove the code was actually generated by Copilot and not the project owner?
  - if a project member does not agree with the usage of Copilot, what should we do as a team?
  - can Copilot copy code from other projects and use that excerpt code?
    - if yes, *WHY* ?!
    - who is going to deal with legalese for something he or she was not responsible in the first place?
    - what about conflicts of interest?
  - can GitHub guarantee that Copilot won't use proprietary code excerpts in FOSS-ed projects that could lead to new "Google vs Oracle" API cases?
natfriedman4 years ago
You should read the FAQ at the bottom of the page; I think it answers all of your questions: https://copilot.github.com/#faqs
rozabnatfriedman4 years ago
This page has a looping back button hijack for me
samtheprogramnatfriedman4 years ago
The most important question, whether you own the code, is sort of maybe vaguely answered under “How will GitHub Copilot get better over time?”

> You can use the code anywhere, but you do so at your own risk.

Something more explicit than this would be nice. Is there a specific license?

EDIT: also, there’s multiple sections to a FAQ, notice the drop down... under “Do I need to credit GitHub Copilot for helping me write code?”, the answer is also no.

Until a specific license (or explicit lack there-of) is provided, I can’t use this except to mess around.

netcraftnatfriedman4 years ago
I dont see the answer to a single one of their questions on that page - did you link to where you intended?

Edit: you have to click the things on the left, I didn't realize they were tabs.

viccuadnatfriedman4 years ago
> You should read the FAQ at the bottom of the page; I think it answers all of your questions: https://copilot.github.com/#faqs

Read it all, and the questions still stand. Could you, or any on your team, point me on where the questions are answered?

In particular, the FAQ doesn't assure that the "training set from publicly available data" doesn't contain license or patent violations, nor if that code is considered tainted for a particular use.

res0nat0rviccuad4 years ago
From the faq:

> GitHub Copilot is a code synthesizer, not a search engine: the vast majority of the code that it suggests is uniquely generated and has never been seen before. We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set.

I'm guessing this covers it. I'm not sure if someone posting their code online, but explicitly saying you're not allowed to look at it, getting ingested into this system with billions of other inputs could somehow make you liable in court for some kind of infringement.

viccuadres0nat0r4 years ago
that doesn't include patent violations nor license violations or compatibility between licenses. Which would be the most numerous and non-trivial cases.
res0nat0rviccuad4 years ago
How is it possible to determine if you've violated a random patent from somewhere on the internet via a small snippet of customized auto-generated code?

Does everyone in this thread contact their lawyers after cutting and pasting a mergesort example from Stackoverflow that they've modified to fit their needs? Seems folks are reaching a bit.

IncRndres0nat0r4 years ago
For that very reason, many companies have policies that forbid copying code from online (especially from StackOverflow).
dlubarovIncRnd4 years ago
That mitigates copyright concerns, but patent infringement can occur even if the idea was independently rediscovered.
IncRnddlubarov4 years ago
I was answering a specific question, "How is it possible to determine if you've violated a random patent from somewhere on the internet via a small snippet of customized auto-generated code?" The answer is that many companies have forbidden that specific action in order to remove the risk from that action.

You are expanding the discussion, which is great, but that doesn't apply in answer to that specific question.

There are answers in response to your question, however. For example, many companies use software for scanning and composition analysis that determines the provenance and licensing requirements of software. Then, remediation steps are taken.

dlubarovIncRnd4 years ago
Not sure what you're getting at. Are you suggesting that independent discovery is a defense against patents? Or are you clear that it isn't a defense, but just arguing that something from the internet is more likely to be patented than something independently invented in-house? Maybe that's true, but it doesn't really answer the question of

> How is it possible to determine if you've violated a random patent from somewhere on the internet via a small snippet of customized auto-generated code?

The only real answer is a patent search.

IncRnddlubarov4 years ago
There are different ways to handle risk, such as avoidance, reduction, transfersal, acceptance. I was answering a specific question as to how people manage risk in a given situation. In answer I related how companies will reduce the risk. I was not talking about general cases of how to defend against the risk of patents but a specific case as to reducing the risk of adding externally found code into a product.

My answer described literally what many companies do today. It was not a theoretical pie in the sky answer or a discussion about patent IP.

To restate, the real-world answer I gave for, "How is it possible to determine if you've violated a random patent from somewhere on the internet via a small snippet of customized auto-generated code?" is often "Do not take code from the Internet."

woahviccuad4 years ago
I think a patent violation with CoPilot is exactly the same scenario as if you violated a patent yourself without knowing it.
IncRndres0nat0r4 years ago
That doesn't cover it, since that is a technical answer for a non-technical question. The same questions remain.
notsureaboutpgres0nat0r4 years ago
Sounds like using CodePilot can introduct GPLd code into your project and make your project bound by GPL as a result...

0.1% is a lot when you use 100 suggestions a day.

dvaunnatfriedman4 years ago
None of the questions and answers in this section hold information about how the generated code affects licensing. None of the links in this section contain information about licensing, either.
kitsune_natfriedman4 years ago
Sorry Nat, but I don't think it really answers anything. I would argue that using GPL code during training falls under Copilot being a derivative work of said code. I mean if you look at how a language model works, than it's pretty straightforward. The word "code synthesizer" alone insinuates as much. I think this will probably ultimately tested in court.
gpm4 years ago
> - under what license the generated code falls under?

Is it even copyrighted? Generally my understand is that to be copyrightable it has to be the output of a human creative process, this doesn't seem to qualify (I am not a lawyer).

See also, monkeys can't hold copyright: https://en.wikipedia.org/wiki/Monkey_selfie_copyright_dispute#Naruto_et_al_v._David_Slater

lawtalkinghumangpm4 years ago
In the US, yes. Elsewhere, not necessarily.
tlamponigpm4 years ago
> Is it even copyrighted?

Isn't it subject to the licenses the model was created from, as the learning is basically just an automated transformation of the code, which would be still the original license - as else I could just run some minifier, or some other, more elaborate, code transformation, on some FOSS project, for example the Linux kernel, and relicense it under whatever?

Does not sound right to me, but IANAL and I also did not really look at how this specific model/s is/are generated.

If I did some AI on existing code I'd be quite cautious and group by compatible licences classes, asking the user what their projects licence is and then only use the compatible parts of the models.-Anything else seems not really ethical and rather uncharted territory in law to me, which may not mean much as IANAL and just some random voice on the internet, but FWIW at least I tried to understand quite a few FOSS licences to decide what I can use in projects and what not.

Anybody knows of some relevant cases of AI and their input data the model was from, ideally in jurisdictions being the US or any European Country ones?

gpmtlamponi4 years ago
(Not a lawyer, and only at all familiar with US law, definitely uncharted territory)

No, I don't believe it is, at least to the extent that the model isn't just copy and pasting code directly.

Creating the model implicates copyright law, that's creating a derivative work. It's probably fair use (transformative, not competing in the market place, etc), but whether or not it is fair use is github's problem and liability, and only if they didn't have a valid license (which they should have for any open source inputs, since they're not distributing the model).

I think the output of the model is just straight up not copyrighted though. A license is a grant of rights, you don't need to be granted rights to use code that is not copyrighted. Remember you don't sue for a license violation (that's not illegal), you sue for copyright infringement. You can't violate a copyright that doesn't exist in the first place.

Sometimes a "license" is interpreted as a contract rather than a license, in which you agreed to terms and conditions to use the code. But that didn't happen here, you didn't agree to terms and conditions, you weren't even told them, there was no meeting of minds, so that can't be held against you. The "worst case" here (which I doubt is the case - since I doubt this AI implicates any contract-like licenses), is that github violated a contract they agreed to, but I don't think that implicates you, you aren't a party to the contract, there was no meeting of minds, you have a code snippet free of copyright received from github...

birdyroostergpm4 years ago
So if I make AI that takes copyrighted material in one side, jumbles it about, and spits out the same copyrighted material on the other side, I have successfully laundered someone else's work as my own?

Wouldn't GitHub potentially be responsible for the infringement by distributing the copyrighted material knowing that it would be published?

gpmbirdyrooster4 years ago
I exempted copied segments at the start of my previous post for a reason, that reason is I don't really know, I doubt it works because judges tend to frown on absurd outcomes.
tlamponigpm4 years ago
Where does copying end though? If an AI "retypes" it, not only with some variable renaming but some transformations that are not just describable by a few code transformations (neural nets are really not transparent and can do weird stuff), it wouldn't seem like a copy when just comparing parts of it, but it effectively would be one, as it was an automated translation.
gpmtlamponi4 years ago
Probably, copying ends when the original creative elements are unrecognizable. Renaming variables actually goes a long way to that, also having different or standardized (and therefore not creative) whitespace conventions, not copying high level structure of files, etc.

The functional parts of code are not copyrightable, only the non functional creative elements.

(Not a lawyer...)

tlamponigpm4 years ago
> The functional parts of code are not copyrightable, only the non functional creative elements.

1. Depends heavily on the jurisdiction (e.g., Software patents are a thing in America but not really in basically all European ones)

2. A change to a copyrightable work, creative or not, would still mean that you created a derived work where you'd hold some additional rights, depending on the original license, but not that it would now be only in your creative possession. E.g., check §5 of https://www.gnu.org/licenses/gpl-3.0.en.html

3. What do you think of when saying "functional parts"? Some basic code structure like an `if () {} else {}` -> sure, but anything algorithmic like can be seen as copyrightable, and whatever (creative or not) transformation you apply, in its basics it is a derived work, that's just a fact and the definition of derived work.

Now, would that matter in courts? That depends not only on 1., but additionally to that also very much on the specific case, and for most trivial like it probably would be ruled out, but if an org would invest enough lawyer power, or suing in a for its case favourable court (OLG Hamburg anyone). Most small stuff would be thrown out as not substantial enough, or die even before reaching any court.

But, that actually scares me a bit when thinking about that in this context, as for me, it seems like when assuming you'd be right, this all would significantly erodes the power of copyleft licenses like (A)GPL.

Especially if a non-transparent (e.g., AI), lets call it, code laundry would be deemed as a lawful way to strip out copyright. As it is non-transparent it wouldn't be immediately clear if creative change or not, to use the criteria for copyright you used. This would break basically the whole FOSS community, and with its all major projects (Linux, coreutils, ansible, git, word press, just to name a few) basically 80% of core infrastructure.

sjygpm4 years ago
If the model is a derivative work, why wouldn’t works generated using the model also be derivative works?
gpmsjy4 years ago
Because a derivative work must "as a whole, represent an original work of authorship".

https://www.law.cornell.edu/uscode/text/17/101

(Not a lawyer...)

buu700tlamponi4 years ago
This is a great point. If I recall correctly, prior to Microsoft's acquisition of Xamarin, Mono had to go out of its way to avoid accepting contributions from anyone who'd looked at the (public but non-FOSS) source code of .NET, for fear that they might reproduce some of what they'd seen rather than genuinely reverse engineering.

Is this not subject to the same concern, but at a much greater scale? What happens when a large entity with a legal department discovers an instance of Copilot-generated copyright infringement? Is the project owner liable, is GitHub/Microsoft liable, or would a court ultimately tell the infringee to deal with it and eat whatever losses occur as a result?

In any case, I hope that GitHub is at least limiting any training data to a sensible whitelist of licenses (MIT, BSD, Apache, and similar). Otherwise, I think it would probably be too much risk to use this for anything important/revenue-generating.

heavyset_gobuu7004 years ago
> In any case, I hope that GitHub is at least limiting any training data to a sensible whitelist of licenses (MIT, BSD, Apache, and similar). Otherwise, I think it would probably be too much risk to use this for anything important/revenue-generating.

I'm going to assume that there is no sensible whitelist of licenses until someone at GitHub is willing to go on the record that this is the case.

fomine3buu7004 years ago
Interesting to see since Nat was a founder of Xamarin
CookieMonbuu7004 years ago
> I hope that GitHub is at least limiting any training data to a sensible whitelist of licenses (MIT, BSD, Apache, and similar)

Yes, and even those licences require preservation of the original copyright attribution and licence. MIT gives some wiggle room with the phrase "substantial portions", so it might just be MIT and WTFPL

croesgpm4 years ago
It is output of humans creative processes, just not yours. Like an automated stackoverflow snippet engine.
agilobgpm4 years ago
>Generally my understand is that to be copyrightable it has to be the output of a human creative process

https://en.wikipedia.org/wiki/Monkey_selfie_copyright_dispute

chuinard4 years ago
Some of your questions aren't easy to answer. Maybe the first two were OK to ask. Others would probably require lawyers and maybe even courts to decide. This is a pretty cool new product just being shared on an online discussion forum. If you are serious about using it for a company, talk to your lawyers, get in touch with Github's people, and maybe hash out these very specific details on the side. Your comment came off as super negative to me.
Tainnorchuinard4 years ago
> This is a pretty cool new product just being shared on an online discussion forum.

This is not one lone developer with a passion promoting their cool side-project. It's GitHub, which is an established brand and therefore already has a leg up, promoting their new project for active use.

I think in this case, it's very relevant to post these kinds of questions here, since other people will very probably have similar questions.

ericbarrettchuinard4 years ago
Regardless of tone, I thought it was chock full of great questions that raised all kinds of important issues, and I’m really curious to hear the answers.
peddling-brinkchuinard4 years ago
I think these are very important questions.

The commenter isn't interrogating some indy programmer. This is a product of a subsidiary of Microsoft, who I guarantee has already had a lawyer, or several, consider these questions.

king_magicchuinard4 years ago
No, they are all entirely reasonable questions. Yeah, they might require lawyers to answer - tough shit. Understanding the legal landscape that ones' product lives in is part of a company's responsibility.
amelius4 years ago
Does Copilot phone home?
gpmamelius4 years ago
When you sign up for the waitlist it asks permission for additional telemetry, so yes. Also the "how it works" image seems to show the actual model is on github's servers.
heavyset_goamelius4 years ago
Yes, and with the code you're writing/generating.
ameliusheavyset_go4 years ago
This obviously sucks.

Can't companies write code that runs on customer's premises these days? Are they too afraid somebody will extract their deep learning model? I have no other explanation.

And the irony is that these companies are effectively transferring their own fears to their customers.

jmmcdamelius4 years ago
It's a large and gpu-hungry model.
natfriedman4 years ago
In general: (1) training ML systems on public data is fair use (2) the output belongs to the operator, just like with a compiler.

On the training question specifically, you can find OpenAI's position, as submitted to the USPTO here: https://www.uspto.gov/sites/default/files/documents/OpenAI_RFC-84-FR-58141.pdf

We expect that IP and AI will be an interesting policy discussion around the world in the coming years, and we're eager to participate!

croesnatfriedman4 years ago
Fair use doesn't exist in every country, so it's US only?
krzykcroes4 years ago
It exists in EU also (and it much mire powerful here).
croeskrzyk4 years ago
The EU doesn't have a copyright related fair use. Quite the opposite, that why we are getting upload filters.
anthkcroes4 years ago
False. In Spain you have it as under "uso legitimo".
croesanthk4 years ago
Spain is only part of the EU not the EU.
krzykcroes4 years ago
One exception makes the whole "EU doesn't have" incorrect.

EU doesn't enforce it on the states, yes. But some (maybe all) countries that are in EU do have it.

jay_kyburzcroes4 years ago
Yes, my partner likes to remind me we don't have it here in Australia. You could never write a search engine here. You can't write code that scrapes websites.
joepie91_natfriedman4 years ago
> training ML systems on public data is fair use

Uh, I very much doubt that. Is there any actual precedent on this?

> We expect that IP and AI will be an interesting policy discussion around the world in the coming years, and we're eager to participate!

But apparently not eager enough to have this discussion with the community before deciding to train your proprietary for-profit system on billions of lines of code that undoubtedly are not all under CC0 or similar no-attribution-required licenses.

I don't see attribution anywhere. To me, this just looks like yet another case of appropriating the public commons.

patricktheboldnatfriedman4 years ago
What does "public" mean? Do you mean "public domain", or something else?
6gvONxR4sf7opatrickthebold4 years ago
Unfortunately, in ML "public data" typically means available to the public. Even if it's pirated, like much of the data available in the Books3 dataset, which is a big part of some other very prominent datasets.
kzrdude6gvONxR4sf7o4 years ago
So basically youtube all over again? I.e bootstrap and become popular by using widely available whatever media (pirated by crowdsourced piracy) and then many years later, when it gets popular, dominant, it has to turn around and "do things right" and guard copyrights.
qihqinatfriedman4 years ago

   (1) training ML systems on public data is fair use 

This one is tricky considering that kNN is also a ML system.
visargaqihqi4 years ago
kNN needs to hold on to a complete copy of the dataset itself unlike a neural net where it's all mangled.
stwrongnatfriedman4 years ago
What about privacy. Does the AI send code to GitHub? This reminds me of Kite
avery42stwrong4 years ago
Yes, under "How does GitHub Copilot work?":

> [...] The GitHub Copilot editor extension sends your comments and code to the GitHub Copilot service, which then uses OpenAI Codex to synthesize and suggest individual lines and whole functions.

stefanonatfriedman4 years ago
How do you guarantee it doesn't copy a GPL-ed function line-by-line?
king_magicstefano4 years ago
I truly don't think they can guarantee that. Which is a massive concern.
ipsum2stefano4 years ago
Yup, this isn't a theoretical concern, but a major practical one. GPT models are known for memorizing their training data: https://towardsdatascience.com/openai-gpt-leaking-your-data-2dfb9e22c1b2

Edit: Github mentions the issue here: https://docs.github.com/en/github/copilot/research-recitation and here: https://copilot.github.com/#faq-does-github-copilot-recite-code-from-the-training-set though they neatly ignore the issue of licensing :)

visargaipsum24 years ago
> GPT models are known for memorizing their training data

Hash each function, store the hashes as a blacklist. Then you can ask the model to regenerate the function until it is copyright safe.

ipsum2visarga4 years ago
What if it copies only a few lines, but not an entire function? Or the function name is different, but the code inside is the same?
protealipsum24 years ago
If we could answer those questions definitively, we could also put lawyers out of a job. There’s always going to be a legal gray area around situations like this.
ipsum2proteal4 years ago
Matching on the abstract syntax tree might be sufficient, but might be complex to implement.
vlasevipsum24 years ago
You can probably tokenize the names so they become irrelevant. You can ignore non-functional whitespace, so that code C remains. Maybe one can hash all the training data D such that hash(C) is in hash(D). Some sort of Bloom filter...
bogwogipsum24 years ago
That second link says the following:

> We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set

That's kind of a useless stat when you consider that the code it generates makes use of your existing variable/class/function names when adapting the code it finds.

I'm not a lawyer, but I'm pretty sure I can't just bypass GPL by renaming some variables.

kitsune_ipsum24 years ago
It's not just about regurgitating training data during a beam search, it's also about being a derivative work, which it clearly is in my opinion.
abraaestefano4 years ago
Surprised not to see more mention of this. It would make sense for an AI to "copy" existing solutions. In the real world, we use clean room to avoid this.

In the AI world, unless all GPL (etc.) code is excluded from the training data, it's inevitable that some will be "copied" into other code.

Where lawyers decide what "copy" means.

kitsune_abraae4 years ago
It's not just about copying verbatim. They clearly use GPL code during training to create a derivative work.

Then you have the issue of attribution with more permissive licenses.

dkarrasstefano4 years ago
How do you know that when you write a simplish function for example, it is not identical to some GPL code somewhere? "Line by line" code does not exist anywhere in the neural network. It doesn't store or reference data in that way. Every character of code is in some sense "synthesized". If anything, this exposes the fragility of our concept of "copyright" in the realm of computer programs and source code. It has always been ridiculous. GPL is just another license that leverages the copyright framework (the enforcement of GPL cannot exist outside such a copyright framework after all) so in such weird "edge cases" GPL is bound to look stupid just like any other scheme. Remember that GPL also forbids "derivative" works to be relicensed (with a less "permissive" one). It is safe to say that you are writing code that is close enough to be considered "derivative" to some GPL code somewhere pretty much every day, and you can't possibly prove that you didn't cheat. So the whole framework collapses in the end anyways.
krainboltgreenedkarras4 years ago
> How do you know that when you write a simplish function for example, it is not identical to some GPL code somewhere?

I don't, but then I didn't go first look at the GPL code, memorize it completely, do some brain math, and then write it out character by character.

king_magicnatfriedman4 years ago
@Nat, these questions (all of them, not just the 2 you answered) are critical for anyone who is considering using this system. Please answer them?

I for one wouldn't touch this with a 10000' pole until I know the answers to these (very reasonable) questions.

stephen82natfriedman4 years ago
> We expect that IP and AI will be an interesting policy discussion around the world in the coming years, and we're eager to participate!

Another question is this: let's hypothesize I work solo on a project; I have decided to enable Copilot and have reached a 50%-50% development with it after a period of time. One day the "hit by a bus" factor takes place; who owns the project after this incident?

lovichstephen824 years ago
Your estate? The compiler comparison upthread seems to be perfectly valid. If you work on a solo project in c# and die, Microsoft doesn’t automatically own your project because you used visual studio to produce it
brecknatfriedman4 years ago
You should look into:

https://breckyunits.com/the-intellectual-freedom-amendment.html

Great achievements like this only hammer home the point more about how illogical copyright and patent laws are.

Ideas are always shared creations, by definition. If you have an “original idea”, all you really have is noise! If your idea means anything to anyone, then by definition it is built on other ideas, it is a shared creation.

We need to ditch the term “IP”, it’s a lie.

Hopefully we can do that before it’s too late.

anmkbreck4 years ago
I'm sure natfriedman will be thrilled to abolish IP and also apply this to the Windows source code. We can expect it on GitHub any minute!
breckanmk4 years ago
I used to work at Microsoft and occasionally would email Satya the same idealistic pitch. I know they have to be more conservative , but some of us have to envision where the math can take us and shout out loud about it, and hope they steer well. When I started at MS, my first week was heckled for installing Ubuntu on my windows machine. When I left, windows was shipping with Ubuntu. What may seem impossible today can become real if enough people push the ball forward together. I even hold out hope that someday BG will see the truth and help reduce the ovarian lottery by legalizing intellectual freedom.
cnp10breck4 years ago
Talking about the ovarian lottery seems strange in a thread about an AI tool that will turn into a paid service.

No one will see the light at Microsoft. The "open" source babble is marketing and recruiting oriented, and some OSS projects infiltrated by Microsoft suffer and stagnate.

sombremesaanmk4 years ago
All I know is that if a lawsuit comes around for a company who tried to use this, Github et al won't accept an ounce of liability.
delanobreck4 years ago
In practical terms, IP could be referred to as unique advantages. What is the purpose of an organization that has no unique qualities?

In general, what is IP and how it's enforced are two separate things. Just because we've used copyright and patents to "protect" an organization's unique advantages, doesn't mean we need to keep using them in the same way. Or maybe it's the best we can do for now. That's why BSD style licences are so great.

orthecreedencebreck4 years ago
You can't abolish IP without completely restructuring the economic system (which I'm all for, BTW). But just abolishing IP and keeping everything the same is kind of myopic. Not saying that's what you're advocating for, but I've run into this sentiment before.
breckorthecreedence4 years ago
I agree that it would be a huge upheaval. Absolutely massive change to society. But I have every confidence we have a world filled with good bright people who can steer us through the transition. Step one now is just educating people that these laws are marketed dishonestly, are inequitable, and counter productive to the progress of ideas. As long as the market starts to get wind that the truth is spreading, I believe it will start to prepare for the transition.
TaylorAlexanderorthecreedence4 years ago
Sure, but usually I tend to think "abolish X" means "lets agree on an end goal of abolishing X and then work rapidly to transition to that world." So in that sense I tend to think the person is not advocating for the simple case of changing one law, but on the broader case of examining the legal system and making the appropriate changes to realize a functioning world where we can "abolish X".
bogwogbreck4 years ago
> Ideas are always shared creations, by definition. If you have an “original idea”, all you really have is noise! If your idea means anything to anyone, then by definition it is built on other ideas, it is a shared creation.

Copyright doesn't protect "ideas" it protects "works". If an artist spends a decade of his life painting a masterpiece, and then some asshole sells it on printed T-shirts, then copyright law protects the artist.

Likewise, an engineer who writes code should not have to worry about some asshole (or some for-profit AI) copy and pasting it into other peoples' projects. No copyright protections for code will just disincentivize open source.

Software patents are completely bullshit though, because they monopolize ideas which 99.999% of the time are derived from the ideas other people freely contributed to society (aka "standing on the shoulders of giants"). Those have to go, and I do not feel bad at all about labeling all patent-holders greedy assholes.

But copyright is fine and very important. Nothing is perfect, but it does work very well.

breckbogwog4 years ago
Copyrights are complete bullshit too though. In your 2 examples. First, the artist I assume is using paints and mediums developed arguably over thousands of years, at great cost. So just because she is assembling the leaf nodes of the tree, the far majority of the “work” was created by others. Shared creation.

Same goes for an engineer. Binary notation is at the root of all code, and in the intermediate nodes you have Boolean logic and microcode and ISAa and assembly and compilers and high level Lang’s and character sets. The engineer who assembles some leaf nodes that are copy and pasteable is by definition building a shared creation of which they’ve contributed the least.

Jcowellbreck4 years ago
The basis of copyright isn’t that the sum product is 100% original. That insane since nothing we do is ever original. It’ll always be a component ultimately of nature. The point is that your creation is protected for a set amount of time and then it too eventually becomes a component for future works.
arp242breck4 years ago
> the artist I assume is using paints and mediums developed arguably over thousands of years, at great cost.

And they went to the store and paid money for those things.

breckarp2424 years ago
And they handed the cashier money and then got to do whatever they want with those things. Now they want to sell their painting to the cashier AND control what the cashier does with it for the rest of the cashier's life. They want to make a slave out of the cashier by a million masters.
dylannorthrupbreck4 years ago
Remind me when GitHub handed anyone any money for the code they used?
tlamponinatfriedman4 years ago
> the output belongs to the operator, just like with a compiler.

No it really is not that easy, as with compilers it depends on who owned the source and which license(s) they applied on it.

Or would you say I can compile the Linux kernel and the output belongs to me, as compiler operator, and I can do whatever I want with it without worrying about the GPL at all?

user-the-namenatfriedman4 years ago
> training ML systems on public data is fair use

So, to be clear, I am allowed to take leaked Windows source code and train an ML model on it?

dylannorthrupuser-the-name4 years ago
Or, take leaked Windows source code, run it through a compiler, and own it!
abn120natfriedman4 years ago
(1) That may be so, but you are not training the models on public data like sports results. You are training it on copyright protected creations of humans that often took years to write.

So your point (1) is a distraction, and quite an offensive one to thousands of open source developers, who trusted GitHub with their creations.

dylannorthrupnatfriedman4 years ago
Fair Use is an affirmative defense (i.e. you must be sued and go to court to use it; once you're there, the judge/jury will determine if it applies). But taking in code with any sort of restrictive license (even if it's just attribution) and creating a model using it is definitely creating a derivative work. You should remember, this is why nobody at Ximian was able to look at the (openly viewable, but restrictively licensed) .NET code.

Looking at the four factors for fair use looks like Copilot will have these issues: - The model developed will be for a proprietary, commercial product - Even if it's a small part of the model, the all training data for that model are fully incorporated into the model - There is a substantial likelihood of money loss ("I can just use Copilot to recreate what a top tier programmer could generate; why should I pay them?")

I have no doubt that Microsoft has enough lawyers to keep any litigation tied up for years, if not decades. But your contention that this is "okay because it's fair use" based on a position paper by an organization supported by your employer... I find that reasoning dubious at best.

deepnashnatfriedman4 years ago
It is the end of copyright then. NNs are great at memorizing text. So I just train a large NN to memorize a repository and the code it outputs during "inferencing" is fair use ?

You can get past GPL, LGPL and other licenses this way. Microsoft can finally copy the linux kernel and get around GPL :-).