Refusal in LLMs is mediated by a single direction

https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction
{
"by": "veryluckyxyz",
"descendants": 20,
"id": 40242939,
"kids": [
40248418,
40251554,
40254915,
40249812,
40253092,
40248781
],
"score": 110,
"time": 1714697726,
"title": "Refusal in LLMs is mediated by a single direction",
"type": "story",
"url": "https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction"
}
{
"author": null,
"date": null,
"description": null,
"image": null,
"logo": null,
"publisher": "LessWrong",
"title": "Vercel Security Checkpoint",
"url": "https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction"
}
null