Skip to content

SynthPersona

SynthPersonaDataset loads persona profiles plus QA pairs from Hugging Face and exposes a small in-memory API for analysis and prompt generation.

Loader

from persona_data.synth_persona import SynthPersonaDataset

dataset = SynthPersonaDataset()
small = SynthPersonaDataset(sample_size=100)

The default dataset source is implicit-personalization/synth-persona. The loader reads:

  • dataset_personas.jsonl
  • dataset_qa.jsonl
  • implicit_shared_mc_bank.json (used to hydrate QAPair.related_frq_qids on implicit shared MCQs)
  • attribute_schema.json, when present (used to expose clean attribute names and ordinal encodings)
  • question_registry.jsonl, when present (used for semantic topic and evaluation-set filters)

sample_size keeps the leading personas and filters QA rows to the loaded persona IDs. The persona-less Assistant baseline is an ordinary persona row (id="baseline_assistant", exported as BASELINE_PERSONA_ID).

Retrieve the baseline directly via dataset.baseline (or dataset.get_persona(BASELINE_PERSONA_ID)):

dataset = SynthPersonaDataset()
baseline = dataset.baseline
for persona in dataset:
    ...

Records

  • PersonaData: top-level persona record
  • QAPair: question-answer pair with type, item_type, and optional multiple-choice fields
  • Statement: supporting claim record used by downstream tooling

Statement includes sid, category, claim, and support_turns.

QAPair fields:

Field Meaning
qid Globally unique question id.
type "explicit" (directly supported by a seed attribute / interview / statement) or "implicit" (inferred from the persona biography).
item_type "frq" (free-response) or "mcq" (multiple-choice).
scope "individual" (one persona) or "shared" (same item bank across personas).
question, answer Question text and reference answer.
choices, correct_choice_index MCQ-only; empty/None for FRQs. The final MCQ option is always "Not enough information from the context.".
bank_id Stable item / source-slot identifier. For explicit rows, FRQ and MCQ rows that share a bank_id come from the same seed attribute, interview answer, or statement. Used for leakage-aware splits.
related_frq_qids Implicit shared MCQs only: qids of individual implicit FRQs used as evidence when constructing the MCQ. Used for leakage-aware splits.
evidence_sids Optional list of supporting Statement.sids. Empty when the dataset row carries no statement evidence.
tags Optional free-form string tags for downstream slicing or analysis.

Persona fields

PersonaData includes:

  • id
  • persona
  • templated_view
  • biography_view
  • statements_view
  • statements

It also exposes name as a derived property.

persona = dataset[0]

persona.name              # "Ethan Robinson"
persona.templated_view    # short attribute-based system prompt
persona.biography_view    # full biography text
persona.persona["state"]  # raw structured attributes
persona.statements        # list of Statement

Index by position with dataset[i], iterate the dataset directly, or look one up by id with dataset.get_persona(id) (returns None if the id is not loaded).

persona is the raw structured attribute dictionary. Use the dataset helpers when you need values aligned to a persona id list:

from persona_data.synth_persona import BASELINE_PERSONA_ID

persona_ids = [p.id for p in dataset if p.id != BASELINE_PERSONA_ID]

# Categorical labels aligned to persona_ids.
states = dataset.attribute_values("state", persona_ids)

# Numeric values for scalar coloring or analysis.
ages = dataset.attribute_values("age", persona_ids, encode=True)
politics = dataset.attribute_values("political_views", persona_ids, encode=True)

# Convenience for all loaded personas in dataset order.
all_states = dataset.attribute_values("state")

dataset.attribute_names lists non-identifier attributes. dataset.attribute_info(name) returns the raw schema entry when the Hugging Face schema is available. Ordinal attributes with schema ordered_values encode to 0.0, 1.0, ...; binary No / Yes attributes encode to 0.0 / 1.0; numeric and boolean attributes encode to floats. Nominal attributes and other non-ordered labels encode to None, so leave encode=False when you want raw labels.

Queries

persona = dataset[0]

qa_pairs = dataset.get_qa(persona.id)
qa_pairs = dataset.get_qa(persona.id, type="explicit", item_type="mcq")
religion = dataset.get_qa(persona.id, type="implicit", topic_group_id="religion_spirituality_and_meaning")
study_eval = dataset.get_qa(persona.id, item_type="mcq", question_set="qa_eval_v1")

get_qa() returns typed QAPair records.

type filters explicit versus implicit rows. topic_group_id filters the semantic topic of a question, independent of whether the row is explicit or implicit. question_set filters curated evaluation sets, such as a subset of multiple-choice questions selected for a benchmark run. Topic and question-set filters require question_registry.jsonl.

question_registry.jsonl is one JSON object per question. Shared bank items should be keyed by bank_id; individual rows should be keyed by qid. topic_group_id is an optional string, and question_sets is an optional list of strings.

Train/test split

dataset.train_test_split(persona_id, n_train=None, seed=0) returns (train, test) for one persona:

  • train: individual free-response questions (both explicit and implicit). Pass n_train=50 or another integer to cap the train slice, or n_train=None for no cap.
  • test: shared multiple-choice questions (both explicit and implicit), preserved in full. The shared bank is the same item set for every persona, so per-persona test scores are directly comparable.
  • seed: optional int that shuffles the train candidates before capping (reproducible). Test order is left untouched.

To avoid train→test leakage, train rows are dropped if their bank_id matches a test MCQ bank_id (explicit FRQ↔MCQ from the same source slot) or if their qid appears in a test MCQ's related_frq_qids (implicit MCQ built from that FRQ as evidence).

Notes

  • sample_size keeps a leading slice rather than sampling randomly.
  • Personas load eagerly; QA (~1 GB) is loaded and parsed lazily on the first get_qa() call, so attribute-only work never pays for it. The parse result is cached next to the immutable Hugging Face blob, so later runs skip it.
  • Call dataset.prefetch_qa() (safe from a background thread) to warm that cache ahead of time so the first get_qa() does not block.