persona-vectors
Extract persona vectors from language models, then compare those vectors across layers and prompt variants, probe them for attribute information, or use them for experimental steering.
This project is experimental.
What is a persona vector?
A persona vector is the mean hidden-state activation a model produces while
answering as a given persona. Extraction saves one (num_layers, hidden_size)
tensor per persona, prompt variant, model, and mask strategy. Every downstream tool — similarity, projection, probes, steering — reads those saved tensors back; nothing re-runs the model.
Pipeline
personas + QA pairs -> prompts -> token masks -> hidden states -> saved vectors -> analysis
| Stage | What happens | Reference |
|---|---|---|
| Extraction | Format persona QA prompts, build token masks, run the model, save one vector per persona and prompt variant | Activation Extraction |
| Storage | Local PersonaVectorStore and read-only Hub HFPersonaVectorStore over one shared on-disk layout |
Artifacts |
| Analysis | Aligned vector loading, centered cosine similarity, PCA / UMAP / Isomap, clustering, plots | Analysis |
| Probes | Linear probes that read a persona attribute out of the vectors | Probes |
| Steering | Experimental biography-minus-templated direction | Steering |
| Trait Vectors | Deconfounded per-attribute directions from minimal-pair swaps | Trait Vectors |
Install
uv sync
cp .env.example .env
Requires Python >=3.12. Set NDIF_API_KEY to use remote extraction. See the README for quickstart commands and extraction scripts.