Skip to content

Trait Vectors

A persona vector mixes every attribute a persona carries, so a population difference-of-means direction for one attribute absorbs whatever co-occurs with it. A trait vector isolates a single attribute with a minimal pair: swap only that attribute on a persona, extract both views, and average the within-pair activation delta. Everything that did not change cancels.

Core module: src/persona_vectors/traits.py

How it works

  1. persona_data.templated.swap_attribute edits one attribute and re-renders the whole templated view (so coupled sentences like age+sex stay coherent). Binary attributes auto-flip.
  2. For each persona, extract activations for the original and swapped views (reusing the extraction pipeline), then take act(value_to) − act(value_from). Activations are oriented by attribute value, not by which value was original, so a Male→Female swap and a Female→Male swap reinforce instead of cancelling on average.
  3. Average the paired deltas over personas → the per-layer trait vector.

This builds the description-level flavor (PERSONA_MEAN over the swapped templated view) for binary attributes; the answer-level flavor (ANSWER_MEAN, force-decoded explicit answer) and non-binary attributes reuse the same orientation logic.

Extract and build

from persona_vectors.extraction import MaskStrategy
from persona_vectors.traits import extract_trait_deltas, build_trait_direction

runs = [(p, dataset.train_test_split(p.id, n_train=1)[0]) for p in dataset]
runs = [(p, qa) for p, qa in runs if qa]  # the QA only builds the prompt

deltas = extract_trait_deltas(
    model, dataset, "sex", runs,
    variant="templated", mask_strategy=MaskStrategy.PERSONA_MEAN, remote=False,
)
info = build_trait_direction(deltas, candidate_layers=[13])

build_trait_direction returns a steering-harness direction dict (layer, unit_direction, gap_norm, auc, positive, …), so a trait vector drops straight into generate_steered / generate_band_steered (see Steering).

Storage

Trait vectors are persisted with TraitVectorStoresave_trait_deltas(store, deltas) writes the full per-layer mean delta plus contrast metadata, and load_trait_direction(store, attribute, layer=...) rebuilds a steering-ready dict at any layer. See the Trait Vector Store section of Artifacts for the layout and API.

Steering

A trait info dict feeds generate_steered / generate_band_steered unchanged; load_trait_band builds the {layer: info} band for multi-layer steering (see Steering).

Notebook

notebooks/notebook_extract_trait.py runs the full flow: extract a trait vector per binary attribute (saving each one), then compare the trait-cosine matrix against the co-occurrence (Cramér's V) matrix — high co-occurrence with low trait-cosine means the minimal-pair extraction successfully deconfounded that pair.