Semantic Structure in Large Language Model Embeddings

12 days ago

Copy Link

Human ratings of words across semantic scales can be reduced to a low-dimensional form with minimal information loss.
LLM embeddings show similar semantic structure, with projections on antonym-pair-defined directions correlating highly with human ratings.
Semantic features in LLMs reduce to a 3D subspace, resembling patterns from human survey responses.
Shifting tokens along semantic directions causes off-target effects proportional to cosine similarity of aligned features.
Semantic features in LLMs are entangled similarly to human language, and much of semantic information is surprisingly low-dimensional.
Accounting for semantic structure is essential to avoid unintended consequences when steering features in LLMs.

Hasty Briefsbeta