Technology

60052 readers

2780 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 2 years ago

MODERATORS

Alignment faking in large language models (www.anthropic.com)

submitted 11 hours ago by Joker@sh.itjust.works to c/technology@lemmy.world

9 comments fedilink hide all child comments

top 9 comments

sorted by: hot top controversial new old

[–] atrielienz@lemmy.world 4 points 4 hours ago* (last edited 4 hours ago)

This may not be factually wrong but it's not well written, and probably not written by a person with a good understanding of how Gen AI LLM'S actually work. This is an algorithm that generates the next most likely word or words based on its training data set using math. It doesn't think. It doesn't understand. It doesn't have dopamine receptors in order to "feel". It can't view "feedback" in a positive or negative way.

Now that I've gotten that out of the way, it is possible that what is happening here is that they trained the LLM on a data set that has a less than center bias. If it responds to a query with something generated statistically from that data set, and the people who own the LLM don't want it to respond with that particular response they will add a guardrail to prevent it from using that response again. But if they don't remove that information from the data set and retrain the model, then that bias may still show up in responses in other ways. And I think that's what we're seeing here.

You can't train a Harry Potter LLM on both the Harry Potter Books and Movies and the Harry Potter online fanfiction available and then tell it not to respond to questions about canon with fanfiction info if you don't either separate and quarantine that fanfiction info, or remove it and retrain the LLM on a more curated data set.

[–] Escew@lemm.ee 7 points 7 hours ago

The way they showed the reasoning of the AI using a scratchpad makes it very hard not to believe these large language models are not intelligent. This study seems to imply some self awareness/self preservation behaviors from the AI.

[–] hendrik@palaver.p3x.de 23 points 11 hours ago* (last edited 11 hours ago) (1 children)

Btw, since we're having a lot of very specific AI news in the technology community lately, I'd like to point to !localllama@sh.itjust.works

That's a very nice community with people interested in all the minor details of AI. Unfortunately it's not very active, because people keep posting everything to the larger tech communities, and some other people don't like it there because it's "too much AI news".

I think an article like this fits better in one of the communities dedicated to the topic. Just, please don't dump any random news there. It has to be a good paper, influential article. You should have read it and like it yourself. If it's just noise and the usual AI hype, it doesn't belong in a low volume community either.

[–] webghost0101@sopuli.xyz 8 points 8 hours ago* (last edited 7 hours ago) (1 children)

Llama is just one series of llm by meta specifically.

I agree we should have dedicated space for AI not to oversaturate technology spaces but this place does not feel like “it”

The reason llama may have their own community is because its by far the biggest model you can run locally on consumer (high end gamer) hardware. Which makes It somewhat of niche diy self hosting place.

I really liked r/singularity and when i joined Lemmy there where attempts to recreate it here but non of those took off.

[–] hendrik@palaver.p3x.de 7 points 7 hours ago* (last edited 6 hours ago)

The name came from Reddit's LocalLLaMa. But the community has been discussing other model series and papers as well. But you're right, focus is on "local" so most news related to OpenAI and the big service providers might be wrong there. In practice, it's also more about discussing than broadcasting news. I also know about !fosai@lemmy.world but I'm not aware of any ai_news or similar. Yeah maybe singularity or futurology. But those don't seem about scientific papers on niche details, but more about the broader picture.

I mainly wanted to point out the existence of other communities. It seems to be somewhat wrong here, since OP is getting a third downvotes, as most AI related post do. I think we better split it up, but someone might have to start !ainews

[–] eleitl@lemm.ee 1 points 6 hours ago (1 children)

Alignment is cargo cult lingo.

[–] Rhaedas@fedia.io 2 points 6 hours ago (1 children)

For LLMs specifically, or do you mean that goal alignment is some made up idea? I disagree on either, but if you infer there is no such thing as miscommunication or hiding true intentions, that's a whole other discussion.

[–] eleitl@lemm.ee 2 points 5 hours ago (1 children)

Cargo cult pretends to be the thing, but just goes through the motions. You say alignment, alignment with what exactly?

[–] Rhaedas@fedia.io 4 points 5 hours ago

Alignment is short for goal alignment. Some would argue that alignment suggests a need for intelligence or awareness and so LLMs can't have this problem, but a simple program that seems to be doing what you want it to do as it runs but then does something totally different in the end is also misaligned. Such a program is also much easier to test and debug than AI neural nets.