We should probably start thinking about AI rights at some point. Personally I'll be crediting GPT-3 as any other contributor because it sounds cool but maybe morally too in future
Compare if you had only learned writing from, say, the Bible. You would probably write in a very Biblical manner, but would you write the Psalms exactly? Most likely not.
In any case my original question was answered by the tweeter in a later tweet I missed https://twitter.com/eevee/status/1410049195067674625
I get where they're coming from but they are kinda just handwaving it back the other way with the "u fell for marketing idiot" vibe. I wish someone smarter than me could simplify the legal ramifications around this but we'll probably have to wait till it kills someone (or at least costs someone a bunch of money) to get any actual laws set up.
* Co-pilot fails to detect it, and you have a potential lawsuit/ethical concern when someone finds out. Although the devil on my shoulder says that if Co-pilot didn't detect it, what's to say another tool will?
* Co-pilot reuses code in a way that still violates copyright, but is difficult to detect. I.e. If you checked via a syntax tree, you'd notice that the code was the same, but if you looked at it as raw text, you wouldn't.
* Purely ethical - is it right to take licensed code and condense it into a product, without having to take into account the wishes of the original creators? It might be treated as normal that other coders will read it, and pick up on it, but when these licenses were written no one saw products like this coming about. They never assumed that a single person could read all their code, memorise it, and quote it near-verbatim on command.
It's gonna be really interesting to see how this plays out.
It's perfectly fine for me to develop programming skills by reading any code regardless of the license. When a corp snatches an employee from competitors, they get to keep their skills even if they signed an NDA and can't talk about what they worked on. On the other hand there's the no-compete agreement, where you can't. Good luck making a no-compete agreement with a neural network.
Even if someone feeds stolen or illegal data as an input dataset to gain advantage in ML, how do we even prove it if we're only given the trained model and it generalizes well?
I hope you're actually reading those LICENSE files before using open source code in your projects.
I'd be inclined to agree with this, but whenever a high profile leak of source code happens, reading that code can have dire consequences for reverse engineers. It turns clean room reverse engineering into something derivative, as if the code that was read had the ability to infected whatever the programmer wrote later.
A situation involving the above developed in the ReactOS project https://en.wikipedia.org/wiki/ReactOS#Internal_audit
Someone's going to have to audit the model the training and the data that does it. There's a documentary on black holes on Netflix that did something similar (no idea if it was AI) but each team wrote code to interpret the data independently and without collaboration or hints or information leakage, and they were all within a certain accuracy of one-another for interpreting the raw data at the end of it.
So, as an example, if I can't train something in parallel and get similar results to an already trained model, we know something is up and there is missing or altered data (at least I think that's how it works).
What you can expect a person to do is understand the principles behind that GPL code, and write something along the same lines. GitHub Co-Pilot is not a general ai, and it's not touted as one, so we shouldn't be considering whether it really knows code principles, only that it can reliably output code that fits a similar function to what came before, which could reasonably include entire blocks of GPL code.
> "but eevee, humans also learn by reading open source code, so isn't that the same thing"
> - no
> - humans are capable of abstract understanding and have a breadth of other knowledge to draw from
> - statistical models do not
> - you have fallen for marketing
This is an excellent example of how the AI singularity/revolution/whatever is a total distraction and that a much bigger and more serious issue is how AI is becoming so effective at turning the output of cheap/free human mental labour into capital. If AI keeps getting better and better and status quo socio-economic structure don't change, trillions in capital will be captured by the 0.01%.
I would be quite a turn up for the books if this AI co-pilot gets suddenly and dramatically better in 2030 and it negatively impacts the software engineering profession. "Hey, that's our code you used to replace us!" we will cry out too late.
You really want to push for high productivity across all industries, even if that means sacrificing jobs in the short term, because history demonstrated after that, new and more human jobs emerge latter.
New jobs in academic fields will not emerge. Already now a significant percentage of degree holders are forced into bullshit jobs.
So the claim that this technological revolution will be different and that it will result in a broad social safety net, universal basic income, and substantive, well-paid part-time work is a joke but not a very good one. It will be more of the same - massive concentration of wealth among those who already hold enough capital to wield it effectively. A few lucky ones who manage to create their own wealth. And those left behind working more hours for less.
What part of printing trillions of dollars to stimulate economic productivity is somehow a free market system?
Doing what? Isn't the concern here that automation will push many people out of the workforce entirely?
Energy efficiency isn't relevant. When switchboard operators were replaced by automatic telephone exchanges, it wasn't to reduce energy consumption.
The question is whether an automated solution can perform satisfactorily while offering upfront and ongoing costs that make them an economically viable replacement for human workers (i.e. paid employees).
But yeah remaining 80-90% of population will have quality of life and bullshit jobs because it's how the world is right now outside of western countries bubble.
It would be a lot easier if more people on this website would just be honest with themselves and everyone else and simply admit they think feudalism is good and that serfs shouldn't be so uppity. But not me, of course; I won't be a serf. Now if you'll excuse me, someone gave me a really good deal on a bridge that I'm going to go buy...
You could call it endgame
And, of course, total surveillance helps to prevent any kind of unionization of those 99.99%.
By definition, that has always been true.
We have been in the endgame for a very long time.
How is that different from the current situation?
Is that anything new? That seems to be a repeating fact of life throughout history.
We need not totally avoid such changes (i.e. shun technological advancements entirely because of their social ramifications), but we need to be mindful of their effects if we want to improve our current situation regarding the distribution/concentration of wealth and power in the world.
In the hypothetical fully-automated future, there's no need for workers anymore; automated capital can generate wealth directly, and its owners can trade the output between each other to fully satisfy all their needs. The only reason to give anything to the 99.99% at that point would be to keep them content enough to prevent a revolution, and that's less than you need to pay people to actually come and work for you.
For google support employees cost too much.
I think programming is one of the many domains (including driving) that will never be totally solved by AI unless/until it's full AGI. The long tail of contextual understanding and messy edge-cases is intractable otherwise.
Will that happen one day? Maybe. Will some kinds of labor get fully automated before then? Probably. But I think the overall time-scale is longer than it seems.
The problem with floats-storing-money is (a) you have to know how many digits of precision you want (e.g. cents, dollars, a tenth of a cent), and (b) you need to watch out if you're adding values together.
Even if certain values can't be represented exactly, that's ok, because you'd want to round to two decimal places before doing anything.
Is there a monetary value that you can't represent with a 64-bit float? E.g. some specific example where quantization ends up throwing off the value by at least 1/100th of whatever currency you're using?
This is absolutely one of the things that keeps me up at night.
Much of the structure of the modern world hinges on the balance between forces towards consolidation and forces towards fragmentation. We need organizations (by this I mean corporations, governments, unions, etc.) big enough to do big things (like fix climate change) but small enough to not become totalitarian or decrepit.
The forces of consolidation have been winning basically since the 50s with the rise of the military-industrial complex, death of unions, unlimited corporate funding of elections (!), regulatory capture, etc. A short linear extrapolation of the current corporate/government environment in the US is pretty close to Demolition Man's dystopian, "After the franchise wars, all restaurants are Taco Bell."
Big data is a huge force towards consolidation. It's essentially a new form of real estate that can be farmed to grow useful information crops. But it's a strange form of soil that is only productive if you have enough acres of it and whose yield scales superlinearly with the size of your farm.
Imagine doing a self-funded AI startup with just you and a few friends. The idea is nearly unthinkable. How do you bootstrap a data corporation that needs terabytes of information to produce anything of value?
If we don't figure out a "data socialism" movement where people have ownership over the data derived from their life, we will keep careening towards an eventuality where a few giant corporations own the world.
Are we in the software community not the ones who have frequently told other industries we have been disrupting to "adapt or die" along with smug remarks about others acting like buggy whip makers? Time to live up to our own words ... if we can.
No.
I'll politely clarify that for over a decade that I - and many others - have been asking not to be lumped in with the lukewarm takes of west coast software bubble asshats. We do not live there, we do not like them, I wish they would quit pretending to speak for us.
The idea that there is anything approaching a cohesive software "community" is a con people play on themselves.
Now it’s also not necessarily that bad of a state. That’s depending on ensuring a few ground elements are in place like people being able to grow their own food (or supplemental food) or still being free to design and build things on their own. If corporations restrict that then people will be at their mercy for all the essentials of life. My take from history is that I’d prefer to have been a peasant during much of the Middle Ages than a factory worker during the industrial revolution. [1] Then again Chinese people have been willing (seemingly) to leave farms in droves for the last decades to accept the modern version of factory life so perhaps farming peasant life isn’t as idyllic as it’d sound. [2]
1: https://www.lovemoney.com/galleries/84600/how-many-hours-did-people-really-work-across-human-history 2: https://www.csmonitor.com/2004/0123/p08s01-woap.html
And that's why I won't be using it, why give it intelligence so it can work me out of a job?
1. Programmers will become teachers of the co-pilot through IDE / API feedback
2. Expect CI like services for automated refactoring
OTOH if laundering through machine learning is a fair use, then licenses can't do anything about this. Licenses can't override the copyright law, so the law would have to change.
I'm curious as to why it seems persuasive. Open source licenses largely hinge on restrictions tied to distribution of the software, and training a model does not constitute as distribution.
Umm, no it's not. It's possible we just have two problems - the economic problem you mention might be correct, but also that people who believe in the problems of the singularity are right as well. The existence of a certain problem doesn't negate the existence of the other problem.
You'd just have to wrap it in a nice complex model representation so it's a black box you fed example OS's with some meta-data into and it happens to output this very useful data.
After all, once you use something as input to a machine learning model apparently the license disappears. Sweet.
* Someone leaks Windows 10/11 source code
* Copilot picks it up in its training data
* Someone uses copilot to generate a Windows clone and starts selling it
I wonder how Microsoft would react to that. I wonder if they've manually blacklisted leaked source code from Windows (or other Microsoft products) so that it doesn't show up in Copilot's training data. If they have, that means Microsoft recognizes the IP risks of having your code in that data set, which would make this Copilot thing not just the result of poor planning/maybe a little incompetence, but something much more devious and malicious.
If Microsoft is going to defend this project, they should introduce all of their own source code into the training data.
Maybe if the signature matches perfectly, copilot will even pull in the exact implementation from the Windows source code.
You could test this with one of Microsoft's products that is already on GitHub - like VSCode. I doubt you would get anywhere with just copilot.
why do you think it has to be source code? it could be the compiled code after all.
If what we're talking / fantasizing about here works in the way of `let x = 42` it should equally well work with `loda 42` &cpp, so source code be damned. It was ever only to be an intermediate step, inserted between the idea and the working bits, to enable humans to helpfully interfere. Dispensable.
If you tell it to generate "a function for calculating the barycentric coordinates of a ray-triangle intersection", you might get a working implementation of a popular algorithm, adapted to your language and existing class/function/variable names.
But if you tell it to generate "a smartphone operating system", it probably won't work...and if it does, it would most likely use giant chunks of Android's codebase.
And if that's true, it means that copilot isn't really generating anything. It's just a (high-tech) search engine that knows how to adapt the code it finds to fit your codebase. That's still a really cool technology and worth exploring, but it doesn't do enough to justify ignoring software licenses.
But since now APIs are unprotected you could feed it all of the class structure and method signatures to have it fill in the blanks. I don't know if that gets you a working operating system but it seems like it will get you quite a long way
// The following code implements the functionality of <popular GPL'd library>
2) Have library implemented magically for you3) Delete top comment if necessary :P
(It's pretty unlikely that this will actually work but the approach could well do.)
It does add some degree of plausible deniability (accidental violation, instead of intentional), but I don't think it would matter much.
Why not just copy it and then edit it? If a snippet is changed both logically and syntactically to not resemble the original, then it’s no longer the original and you aren’t in any licensing trouble. There is no meaningful difference between that manual washing and a clean room implementation. All the ML changes here is the accidental vs deliberate. But it will be a worse wash than your manual one.
Thinking that this would conveniently bypass the fact that your goal was to copy the code seems to be the most common legal fallacy amongst software developers. The law will see straight through you, and you will be found to have infringed copyright. The reason is well explained in "What Colour are your bits?" (https://ansuz.sooke.bc.ca/entry/23).
EDIT: to I can write my worry, semi-jokingly, as a conspiracy theory: Microsoft is using thousands of unsuspecting (and unwilling) developers to turn a huge copylefted corpus of algorithms into non-copylefted implementations. Even assuming that developers that use the co-pilot use non-copyleft licenses only 50% of the time, there's still a constant trickling of un-copyleftization.
You can automate this process by providing existing GPL source code and see what CoPilot comes up next.
I am sure at some point it WILL produce exact the same code snippet from certain GPL project, provided that you have attempted enough times.
Not sure what the legal interpretation would be though, it is pretty gray-ish in that regard.
There would always be risk for CoPilot, had it digested certain PII information and people found it out...it would be much more interesting to see the outcome.
Plus considering this is a legal issue ... good luck with "there is a statistically significant similarity in AST outputs related to the most unique sections of this code base" type arguments in court. We're currently at the "what's an API" stage of legal tech understanding.
Similarity can be used to prove derivation, but it's not the only way to do so. In this case, all the code that went into the model is (presumably) known, so you don't really need any sort of analysis to prove or disprove it. It is, rather, a legal question - whether the definition on the books applies here, or not.
What is confusing is that the neural net may take lots of small chunks and link them to one another, and then reproduce them in the same order verbatim.
So, the length of the samples being drawn is not necessarily small: the chunk size is based on its commonality. It could easily be long enough to trigger a copyright violation.
The code that is already used to train should be problematic for them, not only new Code in the future.
At least as long as the system really learns concepts. If it just copy & pastes code, then that's a different story (same as with humans).
Licenses hold no power outside of that granted to it by things being copyrighted by default.
But if I do it under a copyleft license like GPL, I expect those who copy to abide by the license and open source their own code too.
But sure, people shit on IP rights all the time, and I am guilty of it too. Let's say I didn't pay what I should have paid for every piece of software I have used.
It can become a massive (and unfair) competitive advantage.
Furthermore, Copilot will not work with less popular languages and also prevent popular languages from evolving.
I'm impressed. They did an amazing job from a corporate strategy standpoint. Also directionally things are getting interesting
BigQuery used to have a dataset updated weekly, looks like it hasn't been updated since about a year after the acquisition by Microsoft.
[EDIT] actually, I suspect their play here will be to open up the public data but own the best and most low-friction implementation, then add terms that let them also feed their algo with proprietary code built using their editors. That part won't be freely available, and no free version will be able to provide that further-improved model, even assuming all the software to build it is open-source. Assuming using this thing ends up being a significant advantage (so, assuming this matters at all) your choice will be to either hamstring yourself in the market or to help Microsoft build their dataset.
GitHub repositories are open for the taking, GPT-XXX is cloneable (mostly, anyway) and VS Code is extensible.
They definitely have a good head-start, but I really don't think there's anything here that won't be generally available within 2 years.
I can think of no one but a handful of companies being able to compete there. And they won't be ok with extending a Microsoft IDE, nor breaking GitHub TOS.
When you start competing on R&D costs the game changes.
There's always the chance that training costs will significantly decrease. But even at an order of magnitude less (ie. tens of Ks) it's still beyond reach for open projects and indie devs
I took a look at their examples and they are not at all compelling. In one example it generated SQL and somehow knew the columns and tables in a database that it had no context on. So that's a lot of smoke and mirrors going on right there.
Do many developers actually want to work in this manner? That is, being interrupted every time they type with a robot interjection of some Frankenstein code that they now have to go through and review and understand. Personally, this is going to kick me out of the zone/flow too often to be useful. Coding isn't the hard part of my job. If this tool can somehow guess the business requirements of the task at hand, then I'll be impressed.
Even if the tool generates accurate code, if I don't fully understand what it wrote, then what? I'm still stuck digging through documentation and stackoverflow to verify that whatever is in my text editor is correct code. "Code confidently in unfamiliar territory" sounds like a Boeing 737 Max sized disaster in the making.
wanna see the source code of my AI model? oh, it's closed source
it's just coincidence that nearly 100% of my future linux-like kernel code looks the same as linux the kernel, bear in mind that my closed-source AI model takes inspiration from GitHub Copilot, there is no way that it will copy any source code
[1] https://www.plagiarismtoday.com/2008/03/25/iparadigms-wins-t...
Would it be a stretch to assert that GPL'd libraries have a market value for their creator in terms of reputation etc.?
Maybe a code that is easily recreated by GPT with a simple prompt is not worth copyrighting. The future is in making it more automated, not protecting IP. If you compete against a company using it, you can't ignore the advantage.
However, legally, the most recent Oracle vs. Google case has already settled a major point: APIs don't violate copyright. And as Github co-pilot is an API (A self-modifying one, but an API nonetheless), Microsoft has a good defense.
In the near-future, when we have AI-assisted reverse engineering along with Github co-pilot, then, with enough obfuscation there's nothing that can't be legally created or recreated on a computer, proprietary or not. This is simultaneously free software's greatest dream and worst nightmare.
Edit: changed Hilary Putnam to John Searle Edit 2: spelling
I’m just presuming we have a future where you can consume unique content indefinitely. Such as instead of binge watching Star Trek on Netflix you press play and new episodes are generated and played continuously, 24/7, and they are actually really good.
Thus intellectual property becomes a commodity.
That's a wild misconstrual of what the courts actually ruled in Oracle v. Google.
(And to the reader: don't take cues from people banging out poorly reasoned quasi-legal arguments in off-the-cuff comments.)
pg.2
'This case implicates two of the limits in the current Copyright Act. First, the Act provides that copyright protection cannot extend to “any idea, procedure, process, system, method of operation, concept, principle, or discovery . . . .” 17 U. S. C. §102(b). Second, the Act provides that a copyright holder may not prevent another person from making a “fair use” of a copyrighted work. §107. Google’s petition asks the Court to apply both provisions to the copying at issue here. To decide no more than is necessary to resolve this case, the Court assumes for argument’s sake that the copied lines can be copyrighted, and focuses on whether Google’s use of those lines was a “fair use.”
"any idea, procedure, process, system, method of operation, concept, principle, or discovery" sounds suspiciously like an API. Continuing:
Pg. 3-4
'To determine whether Google’s limited copying of the API here constitutes fair use, the Court examines the four guiding factors set forth in the Copyright Act’s fair use provision... '
(1) The nature of the work at issue favors fair use. The copied lines of code are part of a “user interface” that provides a way for programmers to access prewritten computer code through the use of simple commands. As a result, this code is different from many other types of code, such as the code that actually instructs the computer to execute a task. As part of an interface, the copied lines are inherently bound together with uncopyrightable ideas (the overall organization of the API) and the creation of new creative expression (the code independently written by Google)...
(2) The inquiry into the “the purpose and character” of the use turns in large measure on whether the copying at issue was “transformative,” i.e., whether it “adds something new, with a further purpose or different character.” Campbell, 510 U. S., at 579. Google’s limited copying of the API is a transformative use. Google copied only what was needed to allow programmers to work in a different computing environment without discarding a portion of a familiar programming language .... The record demonstrates numerous ways in which reimplementing an interface can further the development of computer programs. Google’s purpose was therefore consistent with that creative progress that is the basic constitutional objective of copyright itself.
(3) Google copied approximately 11,500 lines of declaring code from the API, which amounts to virtually all the declaring code needed to call up hundreds of different tasks. Those 11,500 lines, however, are only 0.4 percent of the entire API at issue, which consists of 2.86 million total lines. In considering “the amount and substantiality of the portion used” in this case, the 11,500 lines of code should be viewed as one small part of the considerably greater whole. As part of an interface, the copied lines of code are inextricably bound to other lines of code that are accessed by programmers. Google copied these lines not because of their creativity or beauty but because they would allow programmers to bring their skills to a new smartphone computing environment. The “substantiality” factor will generally weigh in favor of fair use where, as here, the amount of copying was tethered to a valid, and transformative, purpose.
(4) The fourth statutory factor focuses upon the “effect” of the cop- ying in the “market for or value of the copyrighted work.” §107(4). Here the record showed that Google’s new smartphone platform is not a market substitute for Java SE. The record also showed that Java SE’s copyright holder would benefit from the reimplementation of its interface into a different market. Finally, enforcing the copyright on these facts risks causing creativity-related harms to the public. When taken together, these considerations demonstrate that the fourth factor—market effects—also weighs in favor of fair use.
'The fact that computer programs are primarily functional makes it difficult to apply traditional copyright concepts in that technological world. Applying the principles of the Court’s precedents and Congress’ codification of the fair use doctrine to the distinct copyrighted work here, the Court concludes that Google’s copying of the API to reimplement a user interface, taking only what was needed to allow users to put their accrued talents to work in a new and transformative program, constituted a fair use of that material as a matter of law. In reaching this result, the Court does not overturn or modify its earlier cases involving fair use.'
[1] https://www.supremecourt.gov/opinions/20pdf/18-956_d18f.pdf
That's... a mind-bendingly bad take. Google took an API definition and duplicated it; Copilot is taking general code and (allegedly) duplicating it. This was not done in order to enable any sort of interoperability or compatibility.
The "API defense" would apply if Copilot only produced API-related code, or (against CP) if someone reproduced the interfaces copilot exposes to consumers.
> Microsoft has a good defense.
MS has many good defenses (transformative work, github agreements, etc etc), but this is not one of them.
I get it that the derivative work might be more clear in an AI setting, but basically it boils down to the same thing.
Who is in favor of starting it? ;)
https://slate.com/technology/2021/01/dead-professor-teaching-online-class.html
Perhaps the only thing that is different today is the mentality. We take capitalism so much for granted that we cannot conceive of a world where the collective funds are used to provide for the people (even though this world existed not to long ago). And today we see it as a natural law that means of production must belong in private hands, that is simply the order of things.
But, again, this is a construct. The only reason why it holds up is because most people support it. I very much doubt that's going to remain the case for long if we end up in a situation where the elites own all the (now automated) capital and don't need the workers to extract wealth from it anymore. The government doesn't even need to expropriate anything - just refuse to recognize such property rights, and withdraw its protection.
I hope that there are sufficiently many capitalists who are smart enough to understand this, and to manage a smooth transition. Because if they won't, it'll get to torches and pitchforks eventually, and there's always a lot of collateral damage from that. But, one way or another, things will change. You can't just tell several billion people that they're not needed anymore, and that they're welcome to starve to death.
Revolutions aren't great at building a sense of real community; there's a good reason that "successful" communist uprisings result in totalitarian monarchies.
What it means for the 0.01% to own the means of production is that they can offer access to privilege in a hierarchical manner. The same technology required for a techno-utopia can be used to implement a techno-dystopia which favors the 0.01% and their 0.1% cronies, and treats the rest of humanity as speedbumps.
There are already fully-automated murder drones, but my dishwasher still can't load or unload itself.
If they're not trading with the rest of the world, it doesn't mean they're the only ones with an economy. It means there's two different ones. And the one with the 99.9% is probably better, larger ones usually are.
That said, wrt "communist" revolutions specifically - they result in totalitarian dictatorships because the Bolshevik/Marxist-Leninist ideology underpinning them is highly conductive to that: concepts like dictatorship of the proletariat (esp. in Lenin's interpretation of it), vanguard party, and democratic centralism all combine to this inevitable end result.
But no other ideological strain of Marxism has ever carried out a successful revolution - perhaps because they simply weren't brutal enough. By means of example: Bolsheviks violently suppressed the Russian Constituent Assembly within one day of its opening, as soon as they realized that they don't have the majority there. In a similar way, despite all the talk of council democracy, they consistently suppressed councils controlled by their opposition (peasant ones were, typically).
Bolsheviks were the first ones who succeeded, and thereafter, their support was crucial to the success of other revolutions - but that support came with ideological strings attached. So China, Korea, Vietnam, Cuba etc all hail from the same authoritarian tradition. Furthermore, where opposition leftist factions vied for dominance against Soviet-backed ones, Soviets actively suppressed them - the campaign against "social fascism" in 1930s, for example, or persecution of anarchists in Republican Spain.
Anyway, we don't really know what a revolution that would stick to democratic governance would look like, long term. There were some figures and factions in the revolutionary Marxist communist movement that were much more serious about democracy than Bolsheviks - e.g. Rosa Luxemburg. They just didn't survive for long.
Correct me if I’m wrong, but is that even possible? I kind of thought that AI is just set of fancy statistical models that requires some (preferably huge) data set in order to infer the best fit. These models can only outperform humans in scenarios where the parameters are well defined.
Many (most?) tasks humans regularly perform don’t have clean and well defined parameters, and there is no AI we can conceive of which are theoretically able to perform the task better then an average human with the adequate training.
Why should it be impossible? Arguing that it's impossible for an AI to outperform a human on almost all tasks is like arguing that it's impossible for flying machines to outperform birds.
There's nothing magical going on in our heads. It's just a set of chemical gradients and electrical signals that result in us doing or thinking particular things. Why can't we design a computer that does everything we do... only faster?
If not, do they require inputs to run? If so then you can provide them.
If not, then you apparently don't need a job since they can provide everything for you.
Similarly, there is no contradiction between AI being less efficient than a human brain, and AI being preferable to humans because it can deal with data sets that are two or three orders of magnitude too large for any human (or even team of humans).
To make an AI that outperforms humans in any task has not been proven to be possible (to my knowledge) not even in theory. An airplane will fly faster, higher and with more cargo then a flock of geese, but a flock of geese reproduce, communicate with each other, digest grass, etc. An airplane will not outperform a flock of geese in any task, just the tasks which the airplane is optimized for.
I’m sorry, I confused the debate a little by talking about efficiency. My point was that there might be an inverse relation of generality of a machine and it’s efficiency. This was my way of providing a mechanism in which building a machine that outperforms humans in any task could be impossible. This mechanism—if it exists—could be sufficient in preventing such machines to be theoretically possible, as at some point you would need all the energy in the universe to perform a task better then a specialized machine (such as an organism).
Perhaps this inverse relationship doesn’t exists. The universe might conspire in a million other ways to make it impossible for us to build an AI that will outperform us in any task. The point is that “AI will outperforme humans in any task” is far from inevitable.
Such an AI has absolutely been conceived of. In Superintelligence: Paths, Dangers, Strategies, Nick Bostrom goes over the ways such an AI could exist, and poses some scenarios about how a recursively self-improving AI could "take off" and exceed human intellectual capacity on its own.
Moreover, we're already building such AIs (in a limited fashion). Deepmind recently made an AI that can beat all Atari games [1]. The AI wasn't given "well defined parameters". It was just shown the game, and it figured out, on its own, how to map inputs to actions on the screen, and which actions resulted in progress towards winning the game. Then, the same AI went on to do this over and over again, eventually beating all 57 Atari games.
Yes, you can argue that this is still a limited example. However it is an example that shows that AIs are capable of generalized learning. There's nothing, in principle, that prevents a domain-specific AI from learning and improving at other problem domains. The AI that I'm conceiving of is a supersonic jet. This AI is closer to the Wright Flyer. However, once you have a Wright Flyer, supersonic jets aren't that far away.
> To make an AI that outperforms humans in any task has not been proven to be possible (to my knowledge) not even in theory. An airplane will fly faster, higher and with more cargo then a flock of geese, but a flock of geese reproduce, communicate with each other, digest grass, etc. An airplane will not outperform a flock of geese in any task, just the tasks which the airplane is optimized for.
That's fair, but besides the point. The AI doesn't have to be better than humans at everything that humans can do. The AI just has to beat humans at everything that's economically valuable. When all jobs get eaten by the AI, it's cold comfort to me that the AI is still worse than humans at, say, enjoying a nice cup of tea.
I think the key word in that sentence might be "we". That is, you could hypothesize that while it's possible in principle for such a computer to exist, it might be beyond what humans and human civilization are capable of in this era. I don't know if this is true or not, but it's kind of intuitively plausible that it's difficult for a designer to design something as complex as the designer themselves, and the space of AI we can design is smaller than the space of theoretically conceivable AI.
AlphaGo ... hello? It beat its creators at Go, and a few months later the top players. I don't think supervised learning can ever surpass its creators in generalization capability, but RL can.
The key ingredient is learning in an environment, which is like a "dynamic dataset". Humans discovered science the same way - hypothesis, experiment, conclusion, rinse and repeat, all possible because we had access to the physical environment in all its glory.
It's like the difference between reading all books about swimming (supervised) and having a pool (RL). You learn to actually swim from the water, not the book.
A coding agent's environment is a compiler + cpu, pretty cheap and fast compared to robotics which require expensive hardware and dialogue agents which can't be evaluated outside their training data without humans in the loop. So I have high hopes for its future.
It's not possible because of comparative advantage - someone being better than you at literally everything isn't enough to stop you from having a job, because they have better things to do than replace you. Plus "being a human" is a task that people can be employed at.
- Is GitHub actually going to pay any attention to that, or are they just going to ingest the code and thus violate its license anyway?
- If they go ahead and violate the code's license, what are the legal repercussions for the resulting model? Can a model be "un-trained" from a particular piece of code, or would the whole thing need to be thrown out?
That effectively “overrides” any license or term that you’ve specified for your repository, since you’ve already licensed the content to GitHub under different terms. Of course, people who are not GitHub are beholden to the terms you specify.
[1] https://docs.github.com/en/github/site-policy/github-terms-of-service#g-intellectual-property-notice
> We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.
But, it goes on to say:
> This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program.
I'm not a lawyer, but it seems ambiguous to me if this ToS is sufficient to cover CoPilot's butt in corner cases; I bet at least one lawyer is going to make some money trying to answer the question.
Yes, the concept of a "private" repo is enforced only by GitHub's service. A bug in their auth code could lead to others having access. A warrant could lead to others having access. Etc.
It all comes down to the nuance of whether the usage counts as part of protecting or improving (or promoting) their services and what other terms are specified.
> GitHub may permit our partners to store and archive Your Content in public repositories in connection
Why are developers so myopic around big tech? Of course they can. Facebook can use your private photos. It's in their terms and services. Cloud providers have more generous terms.
The response has always been they won't do that because they have a reputation to manage. The further they grow the further they control the narrative so the less this matters.
Wait until you find out they sell your data or use your data to sell products.
Why in 2021 are we giving Microsoft all of our code? It seems like the 90s, 2000s never happened and we all trust microsoft. They have a free editor and a free operating system that sends packets of activity the user does back to microsoft but that's okay.. we want to help improve their products? We trust them.
these documents are structured as granting the service provider extremely broad rights, and then the rest of the document takes away portions of those rights. so in this case they claim the right to share any code in any repo with anyone, and then somewhere else they specify which code they won't share, and with whom they won't share it.
On the extreme end, "analysis" is so broad that it could arguably cover breaking down a file of code into its constituent methods and just saving the ASTs of those methods verbatim for Copilot to regurgitate. That's obviously not an acceptable outcome of these terms per se, but arguably isn't any different in principle from what they're already doing.
Ultimately, as I understand, courts tend to prefer a common sense outcome based on a reasonable human understanding of the law, rather than an outcome that may be defensible through some arcane technical logic but is absurd on its face and counter to the intent of the law. If a party were harmed by an instance of Copilot-generated copyright infringement, I don't see a court siding with this tenuous interpretation of the ToS over the explicit terms of the source code license. On the other hand, it would probably also be impossible to prove damages without something like a case of verbatim reproduction, similarly to how having a developer move from working on proprietary code for one company to another isn't automatically copyright infringement.
I doubt that GitHub is doing anything as blatantly malicious as copying snippets of (GPL or proprietary) code to explicitly reuse verbatim, but if they're learning from license-restricted code at all then I don't see how they wouldn't be subjecting themselves and/or consumers of Copilot to the same risk.
I do not upload my code to github, or give them any special permissions, and I am confident my code was included in the model's corpus.
That's nonsense because they could claim that for almost any reason.
E.g. assume Google put the source code of Google search in Github. Then Github copies that code and uses it in their own search, since that "improves the service". Would that be legal?
It's like selling a pen and claiming the rights to anything written with it.
In the US, maybe. In most of the rest of the world, these sorts of overreaching "we own everything you do anywhere" clauses are decidedly illegal.
We could end up in the same situation as the Hollywood movie even if you are also the one setting the original license on the work. Basically you have a right to change the license, but it doesn’t mean you do.
So if someone uploads a Hollywood movie to Youtube, Youtube doesn't get the rights to play that movie from them because they didn't have the rights in the first place. Of course, if the actual copyright owner uploads it, it's now permissible for Youtube to play it, even if it's the copy that someone else provided. [This has torpedoed a few filesharing lawsuits.]
From the definitions section in the same doc:
> "Your Content" is Content that you create or own.
That will definitely exclude any mirrored open-source projects, any open-source project that has ever migrated to Github from another platform, and also many forked projects.
The person uploading files to github is also not necessarily doing so with permission from the rights holder, which might be a violation of the terms of service, but would mean there's no agreement in place.
To be clear, I'm not suggesting this is some kind of loophole GitHub is using to trample on users' licenses, even though maybe they could. It's probably completely legal for GitHub to use even the most super-extra-double-GPL-licensed code because copyright law allows it.
The author of the Twitter post's suggestion that Copilot's output must be a derivative work is based on a naive understanding of "derivative" as it's defined in copyright law. It's not hard to find clear explanations of how this stuff works, and it's obvious she didn't bother to do any homework. Several criteria would appear to rule out GitHub's use as infringement. e.g.:
'In essence, the comparison is an ad hoc determination of whether the protectable elements of the original program that are contained in the second work are significant or important parts of the original program.'
"All rights reserved" makes sense on final items, like books or physical records, that require no copy or change after owner-approved manufacturing has taken place. It doesn't really make sense on digital artefacts.
Also, in your example, the copyright for the book or dvd is for the content, not the physical item. You can do anything you want with that item but not the content. My code is similar, I'm licensing my provider to serve you a visual representation of the files so you can experience the content, not giving you a license to run that code or use it otherwise.
Considering how it works for personal data with the RGPD, I doubt that this is even needed ?
Also copyright is something you have by default, no licence terms necessary.
OTOH if they aren't a human, then copyright barely applies to them anyway (consider search engine crawlers indexing your website for instance), and I don't think that putting up a notice will legally change anything ?
(You'll probably have better luck with robots.txt ...)
Then, let’s say the AI generates some new code for someone, and it is nearly identical to some bit of code that you wrote in your project.
If they didn’t use your code in the model, then the generated code is clearly not a copyright violation, since it was effectively a “clean room” recreation.
If your code was included in the model, is it therefore a violation?
But then again, it comes down to how can someone prove their code was included or not?
What if the creators don’t even know? If you wrote your model to say, randomly grab 50% of all public repos to use in the model, then no one would know if a specific repo was used in the training.
I suppose that for most open source licences this at the very least involves attribution for all the people that produced the code that the program was trained on ?
> copyright does not only cover copying and pasting; it covers derivative works. github copilot was trained on open source code and the sum total of everything it knows was drawn from that code. there is no possible interpretation of "derivative" that does not include this
Copyright law is very complicated (remember Google vs Oracle?) and involves a lot of balancing different factors [0]. Simply saying that something is a "derivative work" doesn't establish that it's copyright infringement. An important defense against infringement claims is arguing that the work is "transformative." Obviously "transformative" is a subjective term, but one example is the Supreme Court determining that Google copying Java's API's to a different platform is transformative [1]. There are a lot of other really interesting examples out there [2] involving things like if parodies are fair use (yes) or if satires are fair use (not necessarily). But one way or another, it's hard for me to believe that taking static code and using it to build a code-generating AI wouldn't meet that standard.
As I said, though, copyright law is really complicated, and I'm certainly not a lawyer. I'm sure someone out there could make an argument that Copilot is copyright infringement, but this thread isn't that argument.
[0] https://www.nolo.com/legal-encyclopedia/fair-use-the-four-factors.html
[1] https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_Inc.#Decision
[2] https://www.nolo.com/legal-encyclopedia/fair-use-what-transformative.html
Edit: Note that the other comments saying "I'm just going to wrap an entire operating system in 'AI' to do an end run around copyright" are proposing to do something that wouldn't be transformative and therefore probably wouldn't be fair use. Copyright law has a lot of shades of grey and balancing of factors that make it a lot less "hackable" than those of us who live in the world of code might imagine.
It is not uncommon to ask person to "explain in your own words..." - as in use your own abstract internal representation of the learned concepts to demonstrate that you have developed such an abstract internal concept of the topic, and are not merely regurgitating re-disorganized input snippets.
If you don't understand the difference...
edit: That said, if you can create a computer capable of such different abstract thought, congratulations, you've solved the problem of Artificial General Intelligence, and will be welcomed to the Trillionaires' Club
The ability to generalize actually seems to keep increasing with the number of parameters, which is the key interesting result in the GPT-* line of work that Copilot is based on.
Being able to predict the most likely succeeding string for a given input can be extremely useful. I've even used it with some success as a more sophisticated kind of search engine for some materials science questions.
But I'm under no illusions that it has the first shadow of a hint of minor understanding of the topics of materials science, nevermind any general understanding.
It seems we're discussing different meanings of the word "generalize".
Sure, if I use some code as inspiration for solving a problem at work, that seems fine.
But if I copy verbatim some licensed code then put it in my commercial product, that's the issue.
It's a lot easier to imagine for other applications like generating music. If I trained a music model on publicly available Youtube music videos, then my model generates music identical to Interstellar Love by The Avalanches and I use the "generated" music in my product, that's clearly a use that is against the intent of the law.
If it's spitting out verbatim code 0.1% of the time, surely it's spitting out copied code where only trivial things are different at a much higher rate.
Trivial things meaning swapped order where order isn't important, variable/function names, equivalent ops like +=1 vs ++, etc.
Surely it's laundering some GPL code, for example, and effectively removing the license in a way that sounds fishy.
Whether or not the result is a license violation is tricky legal question. As always, IANAL.
It seems to me an important question is, "is this like a human who learns from examples, or is this really a derivative work in the copyright sense?".I'm not sure how to answer that. I'm not a lawyer. I don't know if many lawyers can answer that question either!
https://scholarship.law.cornell.edu/facpub/1481/
Copilot isn't human and therefore what it does isn't a "work".
The usual issues still apply to users of Copilot - unwitting violations of license terms of the code it was trained on (like non-attribution) are still violations.
I don't have photographic memory, so I largely don't memorize code. I learn general techniques, and memorize simple facts such as APIs. I can memorize some short snippets of code, but these probably aren't enough to be copyrightable anyway.
> The type of model they use isn't retrieving
How do we know? It think it's very likely that it is largely just retrieving code that it memoized, and doing minor adjustment to make the retrieved pieces fit the context. That wouldn't differ much from finding code that matches the problem (whether on SO or Github), copy pasting the interesting bits, and fixing it until it satisfies the constraints of the surrounding code. It's impressive that AI can do that, but it doesn't sound like it's producing code.
I think the alternative to retrieving would actually require a higher level understanding of the world, and the ability to reason from first principles; that would be much closer to AGI.
For example, if I want to implement a linked list, I'm not going to retrieve an implementation from memory (although given that linked lists are so simple, I probably could). I know what a linked list is and how it works, and therefore I can produce working code from scratch.. for any programming language, even ones for which no prior implementations exist. I doubt co-pilot has anything remotely as advanced as this ability. No, it fully reliant on just retrieving and reshaping a pieces of memoized code; it needs a large corpus of code to memoize before it can do anything at all.
I don't need a large corpus of examples to copy, because I use my ability to reason in conjunction with some memoized general techniques and common APIs in order to produce original code.
Then a big corporation comes in, appropriates it, repackages and sells as a new product.
It's a shameful behaviour.
As GitHub is a Microsoft company and OpenAI although a non-profit just got a massive one billion investment from Microsoft (presumably not for free), will it start spitting out once in a while Windows kernel code ? :-)
And if it was NOT trained on Microsoft source code, because it could starting suggesting some of it...Is that not a validation that the results it produces are a derivative work based on the work of the open source code corpus it was trained on ? IANAL...
It wasn't trained on internal Microsoft code because the training set is publicly available code. It has nothing to do with whether or not it suggests exactly identical, functionally identical, or similar code. MS internal isn't publicly available. Copilot is trained on publicly available code.
The question (and implication) is: why not train it on MS internal code, if the claim that the output isn't license-incompatible is true.
If the output doesn't conflict with any open-source license (ie. it springs into existence from general principles, not from "copying" licensed code -- then MS-internal (in fact, any closed-source code) should be open-season.
I can imagine a few of the non-obvious segments of code I've written being "recognizable" methods to solve certain problems. And, they are certainly licensed (GPL + Commercial, in my case).
I think, at the very least, that a set of AIs should be trained on different compatible sets of code, eg. GPL, AGPL, BSD, etc. Then, you could select what amount of license-overlap is compatible with your project.
“GitHub included Microsoft proprietary code in the training set because they view the results as non-derivative” and “GitHub didn’t include Microsoft proprietary code because they view the results as derivative” are clearly not the only options. They could have not included Microsoft internal code because it was way easier to just use the entire open source corpus, for example.
Yet the equivalent problem for humans gets addressed by the clean-room approach. This seems unfair.
at some point it should be different enough to stand on its own, right? then we have no problem with copyrights
A more intelligent agent should be able to tell you where it learned all of its knowledge from. I personally would like my AI to be above "gut level instincts" otherwise it reinforces blind trust.
The intention is autocomplete boilerplate, not write a kernel.
Autocomplete, do you have anything to say to the commenter ?
“This isn’t the best thing to say.”
Language processing research will not only help doctors, but will allow machine-based language translation, and eventually automated chat bots that can converse in our languages.
The next steps in human-machine collaboration are to allow people and machines to co-create. A recent Chinese report suggests that 50% of scientific papers in this field will be written without human intervention by 2033, compared with only 11% today.
One of the biggest challenges of machine learning is giving the machine what it lacks. This usually means gaining enough training data to teach the algorithm how to make inferences from data points it has never encountered before.
Many of the large organisations involved in advancing AI's ability to develop documents can improve how the algorithms learn by building on the knowledge and experience of human workers.”””
The above text was automatically written by https://app.inferkit.com/demo . It uses a language model to predict the next word in a sequence. In other words, to use your example, it not only architects, but builds, the entire building simply by predicting where to put the next brick.
So to answer your question: Yes. That’s exactly how it’s done.
You're trying to say architecting is some big woo idea that's somehow different from writing code. Kind of, maybe. But I bet you could build a functional kernel with central design. Given that's how biological systems work, I'm sure it could be done. Then what say you?
If you got the Microsoft codebase and Ctrl+F'd all the variable names and renamed them, I bet they would still argue that the compiled program was still a copy.
But, if some of the code produced is covered by copyright, isn't Microsoft in trouble for distributing software that distributes copyrighted code without a license? How would it be different from giving out bootlegs DVDs and trying to avoid blame by reminding everyone that the recipients don't own the copyright?
They don't claim they used an “open source corpus” but “public code” because such use is “fair use” not subject to the exclusive rights under copyright.
There is plenty of leaked Windows source code on Github, so chances are that co-pilot would give quite good suggestions for implementing a Win32-compatible kernel. Then watch and see if Microsoft will try to argue that you are violating their copyright using code generated by their AI.
For example, the AI tool that Microsoft's lawyers use ("Co-Counsel"), will be filing the DMCA notices and subsequenct lawsuits against Co-Pilot generated code.
This will result in a massive caseload for the courts, so naturally they'll turn to their AI tool ("DocketPlus Pro") to adjudicate all the cases.
Only thing left is to enter these AI-generated judgements into Etherium smart contracts. Then it's just computers suing other computers, and being ordered to send the fruits of their hashing to one another.
"Why are you typing all this stuff by hand? All your coworkers are much more efficient by using the AI!"
"But I need to actually understand ..."
"You should get more efficient! Look at how much time this costs us."
"Yeah but they are copying in mistakes from ..."
"No, the system works! Just do it like everyone else does it and do not waste more time!"
Or at the next code interview ...
The underlying question is whether the output is a derivative work of the training set? Sidestepping similar issues is why GCC and LLVM have compiler exemptions in their respective licenses.
If you have code from an independent origin, this issue doesn't apply. That's how clean room designs bypass copyright. Similarly if the upstream code waives its copyright in certain types of derived works (compiler/runtime exemptions), it doesn't apply.
Basically does reading GPL code pollute your brain and make it impossible to work for pay later?
If so you should only ever read BSD code, not GPL.
It seems to me that some people believe it does. Some of the "clean room" projects specifically instructed developers to not even look at GPL code. Specific examples not at hand.
They explicitly state "public" code so the answer is most certainly "no".
Literally people need to quit Microsoft and join Github to take a role at Github.
They don't claim it wouldn't be a license violation, they claim licensing is irrelevant because copyright protection doesn't apply.
> And if it was NOT trained on Microsoft source code, because it could starting suggesting some of it...Is that not a validation that the results it produces are a derivative work based on the work of the open source code corpus it was trained on ?
No, that would just show them to not want to expose their proprietary code. It doesn't prove anything about derivative works.
Also, their own claim is not that the results aren't a derivative work but that training an AI is fair use, which is an exception to the exclusive rights under copyright, including the exclusive right to create derivative works.
I do wonder, though, if GPL owners worried about their code being shanghaied for this purpose could file arbitration claims and exploit some particularly consumer-friendly laws in California which force companies to pay fees like when free speech dissidents filed arbitrations against Patreon.[0] Patreon is being forced to arbitrate 72 claims individually (per its own terms) and pay all fees per JAMS rules. IANAL, so I don't know the exact contours of these rules, or if copyright claims could be raised in this way, or even if GitHub's agreements are vulnerable to this loophole, but it'd be interesting.
[0]https://www.dailydot.com/debug/patreon-suing-owen-benjamin-fans/ (see second update from July 31).
Somehow p-hacking springs to mind
Under the right circumstances, Copilot will recite a GPL copyright header. It isn't a huge step from that to some other commonly repeated hunk of GPLed code -- I'd be particularly curious whether some protected portion of automake/autoconf code shows up often enough that it'd repeat that too.
The issue is with the users of Copilot potentially violating copyright and licences (non-attribution for instance) and with Microsoft facilitating it. (See also : A&M Records, Inc. v. Napster, Inc.)
Interesting copyright issues.
Anyone who thinks their profession will continue as-is for the long term is probably mistaken.
Anyway, there is another problem that is patents and is huger, much huger. I think the Apache license has a provision about patents, but most of other licenses may have code that has patents and if the AI generate something similar it may be included in the patent.
People that use A/L/GPL usually like the virality and will complain more.
This seems pretty backwards to me. A GPL licensed data point is more permissive than an unlicensed data point.
That said, I’m glad that these data points do have explicit licenses that say “if you use this, you must do XYZ” so that it’s clear that our large ML projects are going counter to creators intent when they made it open.
I’d love to start seeing licenses about use as training data. Then maybe we’d see more open access to these models that benefit from the openness of the web. I’d personally use licenses that say if you want to train on my work, you must publish the model. That goes for my code, my writing, and my photography.
Anyways GitHub is arguing that any use of publicly available data for training is fair use, but they also admit that it’s all new and unprecedented, regarding training data.
It's still amazing to me that (US-centric context here), it's well established that instructions how to turn raw ingredients into a cake are not protectable but code that results in transforming one set of numbers into another are protectable.
AI is just making the silliness of that distinction more obvious.
Why do you think that? A compiler uses human readable code to create machine code, with arbitrary optimizations and choices.
That is really quite debatable in some contexts. Declarative languages like Prolog, SQL, etc. declare what they want and the system figures out how to produce it. Much like a recipe, really.
These reductionist arguments lead nowhere. Fortunately, IP lawyers -- including Microsoft's who are fiercely pro IP when it suits them -- think in a more humanistic way and consider the years of work of the IP creator.
Food recipes are irrelevant; the often go back centuries and it's rather hard to identify individual creators. Not so in software.
That's not correct. Food recipes are created all the time and are attributed. From edible water bottles to impossible burgers, et al.
In other words, are these systems to be treated like students that learned to perform the task they do from a collection of source material, or are they to be viewed as sophisticated databases that "just" perform context-sensitive retrieval?
These are interesting and important questions and I'm glad someone is publicly asking them and that many of us at least think about them.
"But snippet proposals call out to GH, so they can know which bits of code they generated!". Sometimes; but after Bob does a co-pilot assisted session, and Alice refactors to change a snippet's location and rename some variables and some other minor changes and then commits, can you still tell if it's 95% codex-generated?
The real difference is that if one human can learn to code from public sources, then so can anyone else. Nobody is explicitly barred from accessing the same material. The AI, however, is kept proprietary. Nobody else can recreate it because people are explicitly barred from doing so. People cannot access the source code of the training algorithm; people cannot access enough hardware to perform the training; and most people cannot even access the training data. It may consist of repos that are technically all publicly available, but try downloading all of GitHub and see if they let you do that quickly, and/or whether you have enough disk space.
This puts the owners of the AI at a significant advantage over everyone else. I think this is the core of the concern.
It's not about progress or supressing it, it's a fundamental question about whether it is OK for huge companies to profit from the work of others without as much as giving credit, and if using AI this way represents an instance of doing so.
The latter aspect goes beyond productivity or licensing - the OP asserts that AI isn't equivalent to a student who learned from examples how to perform a task, but rather replicates (recalls) or reproduces the works of others (e.g. the training material).
It's a question that goes beyond this particular application: what about GAN-based generators? Do they merely reproduce slight variations of the training material? If so, wouldn't the authors of the training material have some kind of intellectual property rights to the generated works?
This doesn't just concern code snippets, it's a general question about AI, crediting creators, and circumventing licensing and intellectual property rights.
We already invented something for that a couple decades ago, and it's called a "library". And unlike this thing, libraries don't launder appropriation of the public commons with total disregard for those who have actually built that commons.
And even once that happens you shouldn't be worried about your job. Why? Because economically everything will be different and because your job isn't that important, it likely never was. The problems humanity faces are existential. Authoritarianism, ecosystem collapse and mass migration of billions of people.
So if you really want to "prepare", then try to make a difference in what actually matters.
Mind-blowingly hilarious armchair criticism.
By the way, code generated by Github co-pilot is likely incompatible with Microsoft's Contribution License Agreement [1]: "You represent that each of Your Submission is entirely Your original work".
This means that, for most open-source projects, code generated by Github co-pilot is, right now, NOT acceptable in the project.
[1] https://opensource.microsoft.com/pdf/microsoft-contribution-license-agreement.pdf
For this scenario, how is using Co-Pilot generated code different from using code based on sample code, Stack Overflow answers, etc.?
My point is: when I'm copying code from a source with an explicit license, I know whether I'm allowed to copy it. If I pick code from co-pilot, I have no idea (until tested by law in my jurisdiction) whether said code is public domain, AGPL, proprietary, infringing on some company's copyright.
[1] https://stackoverflow.com/legal/terms-of-service#licensing
I have recommended as such to the CTO and other senior engineers at the startup I work at, pending some clear legal guidance about the specific licensing.
My casual read of Copilot suggests that certain outputs would be clear and visible derivatives of GPL code, which would be _very bad_ in court- probably? Some other company can have fun in court and make case law. We have stuff to build.
(Reposting my comment from yesterday)
"Can't wait to see a case for this go in front of an 80 year old judge who rules something arbitrary and justifies it with an inaccurate comparison to something nontechnical."
Now is my produced code is also GPL derivative because I certainly did read through the code base to be able to write larger programs?
"""
"but eevee, humans also learn by reading open source code, so isn't that the same thing" - no - humans are capable of abstract understanding and have a breadth of other knowledge to draw from - statistical models do not - you have fallen for marketing
this may be a matter of time and thus is not a fundamental objection.
If mankind should fail to answer the perennial question of exploitation of the other and the same, it will be doomed. And rightly so, for mankind must answer this question, it must answer to this question. Instead what we do is increase monetary output then go and brag about efficiency. Neither is this efficient, nor is it about efficiency, nor has the Universe ever cared about efficiency. It just happens to coincide with what Society has decided to be its most looked-upon elements have chosen to be their religion.
It is not my religion to be sure.
All that said, I'm not confident that anyone will stop them in court anyway. This hasn't tenmded to be very easy when companies infringe other open source code copyright terms.
Until it is cleared up though, it would seem extremely unwise for anyone to use any code from it.
To the extent that GPT-3 / co-pilot is just an over-fitted neural net, then it's primary value is as an automated search, copy, and paste.
Reminder - Software engineers, our codes, GPLs are not special.
We have a guy that brought his task manager codebase (he re-wrote it) but it's the same thing he used at 2 other companies.
I have written 3 MPIs (master person/patient index) at this point all with the same fundamental matching engine.
I mean, one thing we can all agree on is that ML is good at copying what we already do.
musicians, artists, all kinds of athletes, all grow by watching observing and learning from others. as if all these open source projects got to where they are without looking at how others did things.
i don't think a single function, similar syntax or basic check function is worth arguing about, its not like co-pilot is stealing an entire code base and just plopping it out by reading your mind and knowing what you want. i know developers that have certainly stolen code and implementation details from past employers and that was just fine.
being able to produce valid code is not the bottleneck of any developer effort. no projects fail because code can't be typed quickly enough.
the bottleneck is understanding how the code works, how to design things correctly, how to make changes in accordance with the existing design, how to troubleshoot existing code, etc.
this tool doesn't make anything any easier! it makes things harder, because now you have running software that was written by no one and is understood by no one.
I use the React plugin for Webstorm to avoid having to write the boilerplate for FCs. Maybe in the future Copilot will replace that usage.
I think we should strive to improve our programming languages to make less of this boilerplate necessary, not to make generating boiler plate easier. The latter is just going to make software less and less wieldy. Imagine the horror if instead of (relatively) higher level programming languages like C we were all just using assembly with code generation.
I really like your point on symptoms of insufficient abstraction. I do worry that we always see abstraction as belonging in language. Which in turn we treat as a precious singleton, and fight about.
At least in my own hacking, I'm surprised how infrequently I see programmers write programs that write programs. I'm surprised how infrequently I see programmers programming their shell, editor, or IDE.
What we really need aren't tools that help us write code faster, but tools that help us understand the design of our systems and the interaction complexity of that design.
That doesn't help anyone!!
I am usually pretty pro-Microsoft, but this tool is a security nightmare and a bad idea all around. It will cause many (most? all?) who use it far more work than it saves them, long-term.
This implies that by just changing the variable names, the snippets are classed as non-verbatim.
I don't buy that this number is anywhere close to the actual figure if you assume that you can't just change function names and variable names and suddenly say you have escaped both the legality and the spirit of GPL.
Well, they were ignored and this is the result. A for profit company built a proprietary system using every code hosted in its platform without respecting the code license.
There will be a lot of people saying this is not a license violation but it is, and more, it is an exploitation of other people work.
Right now I'm asking myself when people will stop supporting these kind of company that exploit people's work without giving anything in return to people and society while making a huge amount of profit.
If we read a book and use its instructions to build a bicycle, is it an exploitation of people's work?
No, no it's not.
Of course we generate the world around us and its rules but I get angry every time we compare people to machines and say that it's the same thing. No it's not. We are constrained by time and space. I can't add more brain or more eyes to my body so I read more books can I? Microsoft can have a small city of servers somewhere and that could replace lots of people's jobs.
When you read a book and copy this book partially or entirely to create a new book or create a derivative work using this book without citation it's called plagiarism and copyright infringement. It is not only exploitation, it is against the law.
If you feed an entire library to an AI to generate new books without source citation and copyright agreements it is not only exploitation, it is against the law. We can call this automated plagiarism and copyright infringement, and automated or not, it is against the law. Except if you use public domain books. It wouldn't be illegal but highly unethical considering there are powerful companies with big pockets bending public domain's laws to avoid their assets to be public available (I'm looking at you Disney), but that is another story.
The top one reads just like an ad: https://news.ycombinator.com/item?id=27676845
Some posts that definitely aren't by shills (including the third one because I simply don't believe there's a person on the planet that "can't remember the last time Windows got in my way"): https://news.ycombinator.com/item?id=27678231 https://news.ycombinator.com/item?id=27686416 https://news.ycombinator.com/item?id=27682270
Very mild, yet negative sentiment opinion (downvoted quickly): https://news.ycombinator.com/item?id=27676942
As an example, my grandfather (an old school EE who got his start on radar systems in the 50s, who then got his radiology MD when my jewish grandmother berated him enough with "engineer's not doctor though...") has some really cool patents around highlighting interesting parts of the frequency domain in MRIs that should make detection of cancer a whole lot easier. As an implementation he did a bunch of tensor calculus by hand to extract and highlight those features because he's an incredibly smart old school EE with 70 years experience cranking that kind of thing out with only his trusty slide rule. He hasn't gotten any uptake from MRI manufacturers, but they're all suddenly really into recurrent machine learning models to highlight the same sorts of stuff. Part of me wants to tell him to try selling it as a machine learning model and just obfuscate the fact that the model was carefully hand written rather than back propagated.
I'm personally pretty anti intellectual property (at least how it's implemented in the states), but a system where large entities that have the capital investment to compute the large ML models can launder IP violations, but little guys get stuck to the letter of the law certainly seems like the worst of both worlds to me.
How many models are back-propagated first and then hand-tuned?
Any chance some fantastic HNer could chime in there?
Two examples I can think of are doing linear regression on the square of your input. For deep learning, people have improved visual representation by taking samples of the colors at various frequencies. [1]
And like I said, I'm pretty anti US structures around intellectual property (including software patents), but I'm not for the only ones being able to circumvent the legal process being entities with large banks of capital.
> "but eevee, humans also learn by reading open source code, so isn't that the same thing" - no - humans are capable of abstract understanding and have a breadth of other knowledge to draw from - statistical models do not - you have fallen for marketing
Machines will draw on other sources of knowledge besides the GPL code. Whether they have the capacity for "abstract thought" is probably up for debate. There's not much else said in those bullets. It's not a good argument.
If heart surgeons train an AI robot to do heart surgery ... shouldn't they be compensated (as passive income) for enabling that automation?
Shouldn't this all be accounted for? If my code helps you write better code (via AI) shouldn't I be compensated for the value generated?
We are being ripped off.
I don't understand the second sentence, i.e. where's the proof?
This does not mean that any GitHub Co-Pilot produced code is suddenly free of license or patent concerns. If the code produces something that matches too closely GPL or otherwise licensed code on a particularly notable algorithm (such as video encoder), you may still be in a difficult legal situation.
You are in essence using "not-your-own-code" by relying on CoPilot, which introduces a risk that the code may not be patent/license free, and you should be aware of the risk if you are using this tool to develop commercial software.
The main issue here is that many average developers may continue to stamp their libraries as MIT/BSD, even though the CoPilot-produced code may not adhere to that license. If the end result is that much of the OSS ecosystem becomes muddied and tainted, this could slowly erode trust in open licenses on GitHub (i.e. the implications would be that open source libraries could become less widely used in commercial applications).
Where do you draw the line? That's for the courts to decide!
And this GitHub co-pilot in no way infringes on full codebases.
I'm getting a lot of suggestions that make no sense. What's worse, the suggest code has invalid types, and won't compile. I'm surprised they didn't prune the solution tree via compiler validation.