As a civil matter, the publishing houses are more likely to get the full money if anthropic stays in business (and does well). So it might be bad, but I'm really skeptical about bankruptcy (and I'm not hearing anyone seriously floating it?)
Artisian
Plantifs made that argument and the judge shoots it down pretty hard. That competition isn't what copyright protects from. He makes an analogy with teachers teaching children to write fiction: they are using existing fantasy to create MANY more competitors on the fiction market. Could an author use copyright to challenge that use?
Would love to hear your thoughts on the ruling itself (it's linked by reuters).
Depends on the content and the method. There are tons of ways to encrypt data, and under relevant law they may still count as copies. There are certainly weaker NN models where we can extract a lot of the training data, even if it's not easy, from the model parameters (even if we can't find a prompt that gets the model to regurgitate).
I also read through the judgement, and I think it's better for anthropic than you describe. He distinguishes three issues:
A) Use any written material they get their hands on to train the model (and the resulting model doesn't just reproduce the works).
B) Buy a single copy of a print book, scan it, and retain the digital copy for a company library (for all sorts of future purposes).
C) Pirate a book and retain that copy for a company library (for all sorts of future purposes).
A and B were fair use by summary judgement. Meaning this judge thinks it's clear cut in anthropics favor. C will go to trial.
I'm still looking for a good reason to believe critical thinking and intelligence are taking a dive. It's so very easy to claim the kids aren't all right. But I wish someone would check. An interview with the gpt cheaters? A survey checking that those brilliant essays aren't from people using better prompts? Let's hear from the kids! Everyone knows nobody asked us when we were being turned into ungrammatical zombies by spell check/grammar check/texting/video content/ipads/the calculator.
Idk how much this is dystopian. Once your data is explicitly your property, we have a much better dialog about data brokers. Imagine the class action lawsuits against data breaches.
This sounds like something legitimately terrifying, but I'm struggling to make it concrete. Could you expand on the example a bit?
I felt this way until recently, when I'm becoming much more aware of how limited our collective attention is. Every honest belief probably deserves to have one (maybe 3) reasonable people listen to it. But they definitely aren't all worth national/state/city/expert attention.
If you wanna go the extra mile, skimming an ally guide for 10 minutes, looking up some terminology and concepts, would reduce awkwardness by a fair bit. I certainly would have avoided a half dozen missteps if I did some reading.
As I understand it, there are many many such models. Especially those made for academic use. Some common training corpus's are listed here: https://www.tensorflow.org/datasets
Examples include wikipedia edits and discussions, and open source scientific articles.
Almost all research models are going to be trained on stuff like this. Many of them have demos, open code, and local installation instructions. They generally don't have a marketing budget. Some of the models listed here certainly qualify: https://github.com/eugeneyan/open-llms?tab=readme-ov-file
Both of these are lists that are not so difficult to get on; so I imagine some of these have trouble with falsification or mislabeling, as you point out. But there's little reason for people to do so (beyond improving a papers results I guess?).
Art generation seems to have had a harder time, but there are stable diffusion equivalents that used only CC work. A few minutes of search found: Common Canvas, claims to have been competitive.
I would love to see the source on this one. It sounds fascinating.