persona-vectors

Extract persona vectors from language models, then compare those vectors across layers and prompt variants, probe them for attribute information, or use them for experimental steering.

This project is experimental.

What is a persona vector?

A persona vector is the mean hidden-state activation a model produces while answering as a given persona. Extraction saves one (num_layers, hidden_size) tensor per persona, prompt variant, model, and mask strategy. Every downstream tool — similarity, projection, probes, steering — reads those saved tensors back; nothing re-runs the model.

Pipeline

personas + QA pairs -> prompts -> token masks -> hidden states -> saved vectors -> analysis

Stage	What happens	Reference
Extraction	Format persona QA prompts, build token masks, run the model, save one vector per persona and prompt variant	Activation Extraction
Storage	Local `PersonaVectorStore` and read-only Hub `HFPersonaVectorStore` over one shared on-disk layout	Artifacts
Analysis	Aligned vector loading, centered cosine similarity, PCA / UMAP / Isomap, clustering, plots	Analysis
Probes	Linear probes that read a persona attribute out of the vectors	Probes
Steering	Experimental biography-minus-templated direction	Steering
Trait Vectors	Deconfounded per-attribute directions from minimal-pair swaps	Trait Vectors

Install

uv sync
cp .env.example .env

Requires Python >=3.12. Set NDIF_API_KEY to use remote extraction. See the README for quickstart commands and extraction scripts.