Copilot and Snippet Scanning

By Kate Downing, with Input from Aaron Williamson

In the wake of Copilot’s release, I’ve seen an uptick in questions related to snippet scanning and whether or not that may be desirable for open source compliance purposes. I believe that the answer is still “no.”

First, Copilot has created filters that prevent Copilot from making suggestions that exactly match any public code on GitHub. I’m not aware of any open source scanning tool capable of being able to identify a non-exact snippet match, so I’m not sure what sort of snippet matches one might receive if these filters are activated – chances are the matches won’t be coming from Copilot suggestions. These filters are not hidden away or difficult to enable and they must be turned on in order for an organization to be eligible for Copilot’s indemnity offer. They can be turned on for the entire organization.

It’s good practice for engineers to look skeptically at any lengthy Copilot suggestions as the chances of copyrightability of a suggestion (and hence the possibility of copyright infringement related to using the suggestion) increase the lengthier the suggestion. If receiving a lengthy suggestion, it’s also worthwhile to consider whether or not it may be better to receive the same functionality by adding an open source dependency instead. That’s because a piece of distinct open source code for an actively managed project will be updated and patched by someone else, whereas an unidentifiable suggestion from Copilot will not be. Likewise, distinct OSS can trigger security alerts from various open source monitoring tools, but they are less likely to identify a vulnerability in a file that just looks like company code rather than third party code. GitHub has announced that it is also working on a feature that provides references to OSS projects for certain suggestions, making it even easier to add OSS dependencies when extensive functionality is desired. If engineers are looking closely at lengthy suggestions and filters are also turned on, the chances of code that’s actually copyrightable ending up in a company’s product are quite low.

Second, even when a snippet scanner turns up an exact match, it might mean very little.  The snippet may not be copyrightable, or may reflect a common code pattern used by many projects. Remember that Copilot is basically autocomplete for code and it biases toward producing code that appears in the training data most often. Open source scanners might identify the code as coming from a particular project, but they’re incapable of listing ALL the projects the same code appears in. That means that even if you attribute the project identified by the scanner, that project may not even be the originator of that code. Some other project could have written it first and the attribution made by the scanner may be incorrect, or the authors of multiple projects may have written the same code independently. I’ve personally seen code scanners attribute snippets to very large, very popular projects, when the snippet is actually found in a subcomponent owned by someone else entirely, written long before the popular project came into existence. And of course, the more often the code has appeared in various projects, the more likely it is that the code is purely functional (and not copyrightable) and it appears in multiple projects because that’s just how something is done in a particular language.

Third, if the concern is patents rather than copyrights, I’d argue that it’s extremely difficult to embody an entire patent in just a snippet of code.

Fourth, one has to look practically at the possibility of actual legal enforcement in this context. I’m not aware of any litigation based merely on snippets. Every open source-related litigation I’m aware of involved taking substantial portions of libraries, drivers, even operating systems without proper attribution or source code offers. Even if one were in the business of trolling, trolling merely on the basis of snippets and nothing more is just not profitable. There are so many companies out there not doing even basic compliance with entire Linux distributions, that there’s really no reason to spend time and money arguing about much more gray cases like snippets, which the plaintiff is less likely to win and which will be more costly because the plaintiff will need to bring in evidence and experts to defend the copyrightability of the snippet. There is far less dispute about the copyrightability of entire libraries and operating systems.

Infringing snippets are also hard to find, particularly if they’re embedded in a SaaS product or software that is distributed only in executable (as opposed to source code) form. Techniques for finding open source software in binary software distributions are limited. Often, enforcement efforts are based on the inclusion of complete open source components, where the components can be identified by their filenames, or by the output when they are run. Open source components may also be identified by strings (quoted text) that are unique to that component, because when source code is compiled into binary form, those strings can still be found in the binary. But a short snippet compiled into another piece of software is unlikely to be identified by either technique.

In order to have standing to enforce a copyright license, a copyright holder has to register their copyrights. Most open source developers do not do this. Even many corporations do not do this. Back in 2018, there was a study about how many people actually complied with Stack Overflow’s Creative Commons ShareAlike 3.0 license. Stack Overflow is probably the single most common source of snippets picked up by open source scanners. But, the answer is that basically nobody complies with these licensing terms. In no small part, it’s because the people posting on Stack Overflow don’t bother to register their copyrights in their snippets. They also generally have no particular interest in enforcing those licensing terms. Expensive enforcement litigation makes sense for non-profits dedicated to enforcement, large corporations, and serial trolls, not everyday contributors, much less coders answering questions on public forums.

Fifth, snippet scanning is almost always a distraction from higher-priority compliance issues. For example, most organizations still don’t properly do open source compliance for virtualized or containerized images, failing to provide attribution or offer source code for entire containers, applications, and operating systems. So, spending time chasing down snippets while still not having figured out containerization is bad risk-management. And in my experience, the tools focused on the far less risky subject of snippets are also much worse at dealing with containerization.

Sixth, snippet scanning is not industry-standard. There are many open source scanning tools out there, but only a handful do snippet scanning and only a subset of those customers are chasing these down. The entire tech industry has embraced Copilot – there are really only a few notable exceptions to my knowledge. Which means that in some ways we are back to where we started from – deeper pockets are at higher risk of enforcement and smaller companies continue to fly under the radar. The number of entities in a position to do OSS enforcement hasn’t changed and whatever is the total budget for that enforcement remains the same. I don’t think Copilot is going to induce more people to enter the trolling business for the reasons laid out above (lawsuits against GitHub itself notwithstanding). So given that the actual risk here is the same, it does not make sense to reallocate company compliance budgets to spend time and money on the less risky issue of snippets, in lieu of other, more substantive potential violations.

Conclusion

When selecting tools, it doesn’t make sense to prioritize great snippet identification over things like a better ability to identify secondary licenses buried in source code, automated customer-facing attribution files that actually reproduce copyright notices and licenses from the source code instead, identification of transitive dependencies, ability to work with more computer languages and build systems, or good container handling (especially separation of application layer from operating system layer). For me, it’s absolutely the least important feature of a software scanning tool.

Leave a Reply