this post was submitted on 24 Mar 2024
1226 points (97.4% liked)

Memes

44124 readers
3955 users here now

Rules:

  1. Be civil and nice.
  2. Try not to excessively repost, as a rule of thumb, wait at least 2 months to do it if you have to.

founded 5 years ago
MODERATORS
 
you are viewing a single comment's thread
view the rest of the comments
[–] Natanael@slrpnk.net 19 points 3 months ago (3 children)

Training from scratch and retraining is expensive. Also, they want to avoid training on ML outputs as samples, they want primarily human made works as samples, and after the initial public release of LLMs it has become harder to create large datasets without ML stuff in them

[–] scrubbles@poptalk.scrubbles.tech 13 points 3 months ago* (last edited 3 months ago)

There was a good paper that came out recently saying that training on ml data will result in a collapse of cohesion. It's going to be real interesting, I don't know if they'll be able to train as easily ever again

[–] Iron_Lynx@lemmy.world 4 points 3 months ago

I recall spotting a few things about Image Generators having their training data contaminated using generated images, and the output becoming significantly worse. So yeah, I guess LLMs and IGA's need natural sources, or it gets more inbred than the Habsburgs.

[–] TurtleJoe@lemmy.world 0 points 3 months ago

I think it's telling that they acknowledge that the stuff their bots churn out is often such garbage that training their bots on it would ruin them.