Before You Use AI to Rewrite a GPL Library, Read This

Listen

4:42

I've been writing Python since the 1.4 days — over 30 years now. I'm a Fellow of the Python Software Foundation, and the Python open source community has been a foundational part of my career. So when a licensing controversy erupts in one of the ecosystem's most venerable libraries, I pay attention.

Last week, a fascinating situation unfolded around chardet, a Python library for character encoding detection that's been around since 2006. The library was originally created by Mark Pilgrim and released under the LGPL. Dan Blanchard, who has maintained chardet for over a decade, released version 7.0.0 with a bombshell in the release notes: a "ground-up, MIT-licensed rewrite." Same package name, same API — but a completely new implementation, generated with the help of Claude Code.

Mark Pilgrim wasn't having it. He argued that Dan had no right to relicense the project, that having extensive exposure to the original codebase meant this couldn't be a legitimate clean-room implementation, and that the new code is a derivative work regardless of how it was produced.

Simon Willison wrote an excellent analysis of the situation, walking through the nuances and landing somewhere near "I think the rewrite is legitimate, but the arguments on both sides are credible." Armin Ronacher — creator of Flask, Jinja, and Ruff — took a similar position, arguing that if you throw away all the code and start from scratch, even if the result behaves the same, "it's a new ship."

Now, my hot take: I think the current maintainer made a serious error in judgement — but not for the reasons you might expect.

The problem isn't that he used AI to create a new implementation. The problem isn't even that he wanted to move from LGPL to MIT — I'm personally not a fan of GPL/LGPL for the vast majority of open source software, and I greatly prefer the much more liberal MIT license. The problem is that he piggybacked on the existing project. He used the same package name, the same repository, the same PyPI listing. He treated it as a new version of an existing thing rather than what it actually is: a new thing.

I agree with Mark Pilgrim that, given how this was done, the new implementation is a derivative work. Dan had over a decade of exposure to the original codebase. Claude itself was almost certainly trained on chardet's source. The "clean room" separation simply wasn't there. That said, I am not convinced that the "clean room" separation is a reasonable expectation in the first place.

If Dan had created a brand new project on GitHub with a different name — say, chardetect or pychardet-mit — this would be a big fat nothing-burger. As others have pointed out, the new code is structurally independent, with a max similarity of 1.29% compared to the previous release. It's clearly a brand new implementation with an altogether different design. A new project under a new name would have been a perfectly defensible exercise, regardless of the fact that Dan has been maintaining the current project for a decade.

We're entering a world where the cost of creating software is plummeting. If you encounter a library you'd love to use but can't because of a restrictive license, it's increasingly feasible to just... generate a new implementation. The technical barrier that once made copyleft licenses practically enforceable — the sheer expense of rewriting code from scratch — is evaporating.

But this opens up a genuinely hard problem: how do you prove that AI-generated code isn't a derivative work? If the model was trained on the GPL'd code (spoiler: it almost certainly was), and uses it as a reference, then the output is arguably tainted. But that's all behind the scenes, inside the model's weights. How in the world are you supposed to prove — or disprove — that the newly generated code was implemented without licensing violations?

This is a hard problem, and it's only going to get harder. I expect we'll see well-funded litigation in the commercial world soon, as companies realize that their closely held IP can be functionally replicated by anyone with access to a capable coding agent. The chardet situation is a small-scale preview of a much larger reckoning.

For now, my advice is simple: if you're going to use AI to create a new implementation of existing software, make it clearly, unambiguously new. New name, new repo, new identity. Don't try to relicense an existing project from within. That's where Dan went wrong, and it's a lesson the broader community would be wise to learn quickly.

Before You Use AI to Rewrite a GPL Library, Read This

Listen

Related Blog Posts