this post was submitted on 01 Apr 2025
185 points (90.0% liked)
Technology
68305 readers
4370 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related news or articles.
- Be excellent to each other!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
- Check for duplicates before posting, duplicates may be removed
- Accounts 7 days and younger will have their posts automatically removed.
Approved Bots
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
I disagree. Scaling might seem trivial now, but the state-of-the-art architectures for NLP a decade ago (LSTMs) would not be able to scale to the degree that our current methods can. Designing new architectures to better perform on GPUs (such as Attention and Mamba) is a legitimate advancement. Furthermore, the viability of this level of scaling wasn't really understood for a while until phenomenon like double descent (in which test error surprisingly goes down, rather than up, after increasing model complexity past a certain degree) were discovered.
Furthermore, lots of advancements were necessary to train deep networks at all. Better optimizers like Adam instead of pure SGD, tricks like residual layers, batch normalization etc. were all necessary to allow scaling even small ConvNets up to work around issues such as vanishing gradients, covariate shift, etc. that tend to appear when naively training deep networks.