Skip to content

Datasets

HallucinationSample

HallucinationSample(prompt, answer, labels, split, task_type, dataset, language, context_modality='prose', category=None, subcategory=None, context=None, question=None, metadata=dict()) dataclass

A single hallucination detection sample.

Attributes: prompt: The full model input string. The question is placed at the front (User request: {question}\n\n{context}) so it is never lost to context truncation. Training tokenizes this directly. context: The grounding evidence alone (passages / source files / tool output), without the question. None for legacy data. question: The user request alone, or None for summarization-style tasks. Separated from context so it can be reformatted freely and so question tokens are addressable (e.g. for omission detection). answer: The LLM-generated answer to check for hallucinations. labels: List of span annotations. Each dict has start, end (character offsets within answer), label (native source label), and optionally category and subcategory (v2 unified taxonomy fields). Empty list for clean samples. split: Dataset split (train, dev, or test). task_type: Task type (e.g. summarization, qa, code_generation). dataset: Source dataset name. language: Language code. context_modality: Modality of the retrieved context (prose, code, markdown). category: Top-level taxonomy category for the sample (v2). None for clean samples. subcategory: Optional fine-grained sub-type within the category (v2). metadata: Arbitrary source-specific provenance fields (e.g. instance_id, repo).

to_json()

Serialize to a JSON-compatible dict.

from_json(json_dict) classmethod

Deserialize from a JSON dict.

HallucinationData

HallucinationData(samples) dataclass

A collection of hallucination detection samples.

Attributes: samples: List of :class:HallucinationSample instances.

to_json()

Serialize all samples to a JSON-compatible list.

from_json(json_dict) classmethod

Deserialize from a list of JSON dicts.

HallucinationDataset

HallucinationDataset(samples, tokenizer, max_length=4096)

Bases: Dataset

Dataset for Hallucination data.

Initialize the dataset.

Parameters:

Name Type Description Default
samples list[HallucinationSample]

List of HallucinationSample objects.

required
tokenizer AutoTokenizer

Tokenizer to use for encoding the data.

required
max_length int

Maximum length of the input sequence.

4096

__len__()

Return the number of samples in the dataset.

prepare_tokenized_input(tokenizer, context, answer, max_length=4096) classmethod

Tokenize context and answer, compute answer start index, and initialize labels.

Computes the answer start token index and initializes a labels list (using -100 for context tokens and 0 for answer tokens).

Parameters:

Name Type Description Default
tokenizer AutoTokenizer

The tokenizer to use.

required
context str

The context string.

required
answer str

The answer string.

required
max_length int

Maximum input sequence length.

4096

Returns:

Type Description
tuple[dict[str, Tensor], list[int], Tensor, int]

A tuple containing: - encoding: A dict of tokenized inputs without offset mapping. - labels: A list of initial token labels. - offsets: Offset mappings for each token (as a tensor of shape [seq_length, 2]). - answer_start_token: The index where answer tokens begin.

__getitem__(idx)

Get an item from the dataset.

Parameters:

Name Type Description Default
idx int

Index of the item to get.

required

Returns:

Type Description
dict[str, Tensor]

Dictionary with input IDs, attention mask, and labels.