SynthPersona
SynthPersonaDataset loads persona profiles plus QA pairs from Hugging Face and exposes a small in-memory API for analysis and prompt generation.
Loader
from persona_data.synth_persona import SynthPersonaDataset
dataset = SynthPersonaDataset()
small = SynthPersonaDataset(sample_size=100)
The default dataset source is implicit-personalization/synth-persona. The loader reads two JSONL files:
dataset_personas.jsonldataset_qa.jsonl
sample_size keeps the leading personas and filters QA rows to the loaded persona IDs.
Records
PersonaData: top-level persona recordQAPair: question-answer pair with type, difficulty, and optional multiple-choice fieldsBiographySection: structured subsection of a persona biographyStatement: supporting claim record used by downstream tooling
Statement includes sid, category, claim, support, and confidence.
QAPair includes qid, type, question, answer, difficulty, answer_format, choices, correct_choice_index, evidence_sids, and tags.
Persona fields
PersonaData includes:
idpersonatemplated_viewbiography_viewstatements_viewsectionsstatements
It also exposes convenience helpers:
get_section(section_id)get_sections_by_category(category)sections_by_idsections_by_categorysection_categories
It also exposes name as a derived property.
Queries
persona = dataset[0]
qa_pairs = dataset.get_qa(persona.id)
qa_pairs = dataset.get_qa(persona.id, type="explicit", difficulty=[2, 3])
questions = dataset.questions(persona.id, type="implicit")
loaded_persona = dataset.get_persona("p1")
get_qa() returns typed QAPair records. questions() returns question strings only.
Notes
difficultyaccepts one level or a list of levels.typecan be"explicit"or"implicit".sample_sizecan limit how many personas are kept in memory.sample_sizekeeps a leading slice rather than sampling randomly.- The loader keeps the dataset eager and notebook-friendly rather than streaming.