Technology

37705 readers

362 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:

This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 2 years ago

MODERATORS

alyaza@beehaw.org

TheRtRevKaiser@beehaw.org

gyrfalcon@beehaw.org

rs5th@beehaw.org

coldredlight@beehaw.org

Los@beehaw.org

SemioticStandard@beehaw.org

TheRtRevKaiser@kbin.social

remington@beehaw.org

AI Is a Black Box. Anthropic Figured Out a Way to Look Inside (www.wired.com)

submitted 5 months ago by hedge@beehaw.org to c/technology@beehaw.org

9 comments fedilink hide all child comments

archive.is link needed

you are viewing a single comment's thread
view the rest of the comments

[–] astronaut_sloth@mander.xyz 9 points 5 months ago

The original paper itself, for those who are interested.

Overall, this is really interesting research and a really good "first step." I will be interested to see if this can be replicated on other models. One thing that really stood out, though, was that certain details are obfuscated because of Sonnet being proprietary. Hopefully follow-on work is done on one of the open source models to confirm the method.

One of the notable limitations is quantifying activation's correlation to text meaning, which will make any sort of controls difficult. Sure, you can just massively increase or decrease a weight, and for some things that will be fine, but for real manual fine tuning, that will prove to be a difficulty.

I suspect this method is likely generalizable (maybe with some tweaks?), and I'd really be interested to see how this type of analysis could be done on other neural networks.