SynthPersona
SynthPersonaDataset loads persona profiles plus QA pairs from Hugging Face and exposes a small in-memory API for analysis and prompt generation.
Loader
from persona_data.synth_persona import SynthPersonaDataset
dataset = SynthPersonaDataset()
small = SynthPersonaDataset(sample_size=100)
The default dataset source is implicit-personalization/synth-persona. The loader reads:
dataset_personas.jsonldataset_qa.jsonlimplicit_shared_mc_bank.json(used to hydrateQAPair.related_frq_qidson implicit shared MCQs)attribute_schema.json, when present (used to expose clean attribute names and ordinal encodings)question_registry.jsonl, when present (used for semantic topic and evaluation-set filters)
sample_size keeps the leading personas and filters QA rows to the loaded persona IDs. The persona-less Assistant baseline is an ordinary persona row (id="baseline_assistant", exported as BASELINE_PERSONA_ID).
Retrieve the baseline directly via dataset.baseline (or dataset.get_persona(BASELINE_PERSONA_ID)):
dataset = SynthPersonaDataset()
baseline = dataset.baseline
for persona in dataset:
...
Records
PersonaData: top-level persona recordQAPair: question-answer pair with type, item_type, and optional multiple-choice fieldsStatement: supporting claim record used by downstream tooling
Statement includes sid, category, claim, and support_turns.
QAPair fields:
| Field | Meaning |
|---|---|
qid |
Globally unique question id. |
type |
"explicit" (directly supported by a seed attribute / interview / statement) or "implicit" (inferred from the persona biography). |
item_type |
"frq" (free-response) or "mcq" (multiple-choice). |
scope |
"individual" (one persona) or "shared" (same item bank across personas). |
question, answer |
Question text and reference answer. |
choices, correct_choice_index |
MCQ-only; empty/None for FRQs. The final MCQ option is always "Not enough information from the context.". |
bank_id |
Stable item / source-slot identifier. For explicit rows, FRQ and MCQ rows that share a bank_id come from the same seed attribute, interview answer, or statement. Used for leakage-aware splits. |
related_frq_qids |
Implicit shared MCQs only: qids of individual implicit FRQs used as evidence when constructing the MCQ. Used for leakage-aware splits. |
evidence_sids |
Optional list of supporting Statement.sids. Empty when the dataset row carries no statement evidence. |
tags |
Optional free-form string tags for downstream slicing or analysis. |
Persona fields
PersonaData includes:
idpersonatemplated_viewbiography_viewstatements_viewstatements
It also exposes name as a derived property.
persona = dataset[0]
persona.name # "Ethan Robinson"
persona.templated_view # short attribute-based system prompt
persona.biography_view # full biography text
persona.persona["state"] # raw structured attributes
persona.statements # list of Statement
Index by position with dataset[i], iterate the dataset directly, or look one
up by id with dataset.get_persona(id) (returns None if the id is not
loaded).
persona is the raw structured attribute dictionary. Use the dataset helpers
when you need values aligned to a persona id list:
from persona_data.synth_persona import BASELINE_PERSONA_ID
persona_ids = [p.id for p in dataset if p.id != BASELINE_PERSONA_ID]
# Categorical labels aligned to persona_ids.
states = dataset.attribute_values("state", persona_ids)
# Numeric values for scalar coloring or analysis.
ages = dataset.attribute_values("age", persona_ids, encode=True)
politics = dataset.attribute_values("political_views", persona_ids, encode=True)
# Convenience for all loaded personas in dataset order.
all_states = dataset.attribute_values("state")
dataset.attribute_names lists non-identifier attributes. dataset.attribute_info(name)
returns the raw schema entry when the Hugging Face schema is available. Ordinal
attributes with schema ordered_values encode to 0.0, 1.0, ...; binary
No / Yes attributes encode to 0.0 / 1.0; numeric and boolean
attributes encode to floats. Nominal attributes and other non-ordered labels
encode to None, so leave encode=False when you want raw labels.
Queries
persona = dataset[0]
qa_pairs = dataset.get_qa(persona.id)
qa_pairs = dataset.get_qa(persona.id, type="explicit", item_type="mcq")
religion = dataset.get_qa(persona.id, type="implicit", topic_group_id="religion_spirituality_and_meaning")
study_eval = dataset.get_qa(persona.id, item_type="mcq", question_set="qa_eval_v1")
get_qa() returns typed QAPair records.
type filters explicit versus implicit rows. topic_group_id filters the
semantic topic of a question, independent of whether the row is explicit or
implicit. question_set filters curated evaluation sets, such as a subset of
multiple-choice questions selected for a benchmark run. Topic and question-set
filters require question_registry.jsonl.
question_registry.jsonl is one JSON object per question. Shared bank items
should be keyed by bank_id; individual rows should be keyed by qid.
topic_group_id is an optional string, and question_sets is an optional list
of strings.
Train/test split
dataset.train_test_split(persona_id, n_train=None, seed=0) returns (train, test) for one persona:
- train: individual free-response questions (both explicit and implicit). Pass
n_train=50or another integer to cap the train slice, orn_train=Nonefor no cap. - test: shared multiple-choice questions (both explicit and implicit), preserved in full. The shared bank is the same item set for every persona, so per-persona test scores are directly comparable.
- seed: optional
intthat shuffles the train candidates before capping (reproducible). Test order is left untouched.
To avoid train→test leakage, train rows are dropped if their bank_id matches a test MCQ bank_id (explicit FRQ↔MCQ from the same source slot) or if their qid appears in a test MCQ's related_frq_qids (implicit MCQ built from that FRQ as evidence).
Notes
sample_sizekeeps a leading slice rather than sampling randomly.- Personas load eagerly; QA (~1 GB) is loaded and parsed lazily on the first
get_qa()call, so attribute-only work never pays for it. The parse result is cached next to the immutable Hugging Face blob, so later runs skip it. - Call
dataset.prefetch_qa()(safe from a background thread) to warm that cache ahead of time so the firstget_qa()does not block.