Skip to content

SynthPersona

SynthPersonaDataset loads persona profiles plus QA pairs from Hugging Face and exposes a small in-memory API for analysis and prompt generation.

Loader

from persona_data.synth_persona import SynthPersonaDataset

dataset = SynthPersonaDataset()
small = SynthPersonaDataset(sample_size=100)

The default dataset source is implicit-personalization/synth-persona. The loader reads two JSONL files:

  • dataset_personas.jsonl
  • dataset_qa.jsonl

sample_size keeps the leading personas and filters QA rows to the loaded persona IDs.

Records

  • PersonaData: top-level persona record
  • QAPair: question-answer pair with type, difficulty, and optional multiple-choice fields
  • BiographySection: structured subsection of a persona biography
  • Statement: supporting claim record used by downstream tooling

Statement includes sid, category, claim, support, and confidence.

QAPair includes qid, type, question, answer, difficulty, answer_format, choices, correct_choice_index, evidence_sids, and tags.

Persona fields

PersonaData includes:

  • id
  • persona
  • templated_view
  • biography_view
  • statements_view
  • sections
  • statements

It also exposes convenience helpers:

  • get_section(section_id)
  • get_sections_by_category(category)
  • sections_by_id
  • sections_by_category
  • section_categories

It also exposes name as a derived property.

Queries

persona = dataset[0]

qa_pairs = dataset.get_qa(persona.id)
qa_pairs = dataset.get_qa(persona.id, type="explicit", difficulty=[2, 3])

questions = dataset.questions(persona.id, type="implicit")
loaded_persona = dataset.get_persona("p1")

get_qa() returns typed QAPair records. questions() returns question strings only.

Notes

  • difficulty accepts one level or a list of levels.
  • type can be "explicit" or "implicit".
  • sample_size can limit how many personas are kept in memory.
  • sample_size keeps a leading slice rather than sampling randomly.
  • The loader keeps the dataset eager and notebook-friendly rather than streaming.