Ask Lemmy
A Fediverse community for open-ended, thought provoking questions
Rules: (interactive)
1) Be nice and; have fun
Doxxing, trolling, sealioning, racism, and toxicity are not welcomed in AskLemmy. Remember what your mother said: if you can't say something nice, don't say anything at all. In addition, the site-wide Lemmy.world terms of service also apply here. Please familiarize yourself with them
2) All posts must end with a '?'
This is sort of like Jeopardy. Please phrase all post titles in the form of a proper question ending with ?
3) No spam
Please do not flood the community with nonsense. Actual suspected spammers will be banned on site. No astroturfing.
4) NSFW is okay, within reason
Just remember to tag posts with either a content warning or a [NSFW] tag. Overtly sexual posts are not allowed, please direct them to either !asklemmyafterdark@lemmy.world or !asklemmynsfw@lemmynsfw.com.
NSFW comments should be restricted to posts tagged [NSFW].
5) This is not a support community.
It is not a place for 'how do I?', type questions.
If you have any questions regarding the site itself or would like to report a community, please direct them to Lemmy.world Support or email info@lemmy.world. For other questions check our partnered communities list, or use the search function.
6) No US Politics.
Please don't post about current US Politics. If you need to do this, try !politicaldiscussion@lemmy.world or !askusa@discuss.online
Reminder: The terms of service apply here too.
Partnered Communities:
Logo design credit goes to: tubbadu
view the rest of the comments
Wow, thank you, I think this is the first argument that clicked for me.
But it does raise for me 2 questions:
About your first point: think of it like inbreeding, you need fresh genes on the pool or mutations occur.
A generative model will generate some relevant results and some non relevant results, it's the job of humans to curate that.
However, the more content the llm generates, it is used on the web and thus becomes part of it's training data.
Imagine that 95% of results are accurate, from those only 1% doesn't get fact checked and gets released into the internet where other humans will complain, but that will be used as input of an llm regardless. Anyway, so we have a 99% accuracy in the next input, and only 95% of that will be accurate.
It's literally a sequence that will reach very innacurate values very fast:
You can mitigate it by not training it on generated data, but as long as AI content replaces genuine content, specially with images, AI will train itself from its own output and it will degenerate fast.
About the second point, you can pay artists to train models, sure, but that's not so clear when talking about text based generative models that depend on expert input to give relevant responses. About voice LLMs too, any given money would not be enough for a voice actor because doing so would effectively destroy their future jobs and thus future income.
Can it ever get to the point where it wouldn't be vulnerable to this? Maybe. But it would require an entirely different AI architecture than anything that any contemporary AI company is working on. All of these transformer-based LLMs are vulnerable to this.
That would be fine. That's what they should have done to train these models in the first place. Instead they're all built on IP theft. They were just too cheap to do so and chose to build companies based on theft instead. If they hired their own artists to create training data, I would certainly lament the commodification and corporatization of art. But that's something that's been happening since long before OpenAI.
Thank you, out of all of these replies I feel like you really hit the nail on the head for me.