Datasets¶

HallucinationSample¶

`HallucinationSample(prompt, answer, labels, split, task_type, dataset, language, context_modality='prose', category=None, subcategory=None, context=None, question=None, metadata=dict())` `dataclass` ¶

A single hallucination detection sample.

Attributes: prompt: The full model input string. The question is placed at the front (User request: {question}\n\n{context}) so it is never lost to context truncation. Training tokenizes this directly. context: The grounding evidence alone (passages / source files / tool output), without the question. None for legacy data. question: The user request alone, or None for summarization-style tasks. Separated from context so it can be reformatted freely and so question tokens are addressable (e.g. for omission detection). answer: The LLM-generated answer to check for hallucinations. labels: List of span annotations. Each dict has start, end (character offsets within answer), label (native source label), and optionally category and subcategory (v2 unified taxonomy fields). Empty list for clean samples. split: Dataset split (train, dev, or test). task_type: Task type (e.g. summarization, qa, code_generation). dataset: Source dataset name. language: Language code. context_modality: Modality of the retrieved context (prose, code, markdown). category: Top-level taxonomy category for the sample (v2). None for clean samples. subcategory: Optional fine-grained sub-type within the category (v2). metadata: Arbitrary source-specific provenance fields (e.g. instance_id, repo).

`to_json()` ¶

Serialize to a JSON-compatible dict.

`from_json(json_dict)` `classmethod` ¶

Deserialize from a JSON dict.

HallucinationData¶

`HallucinationData(samples)` `dataclass` ¶

A collection of hallucination detection samples.

Attributes: samples: List of :class:HallucinationSample instances.

`to_json()` ¶

Serialize all samples to a JSON-compatible list.

`from_json(json_dict)` `classmethod` ¶

Deserialize from a list of JSON dicts.

HallucinationDataset¶

`HallucinationDataset(samples, tokenizer, max_length=4096)` ¶

Bases: Dataset

Dataset for Hallucination data.

Initialize the dataset.

Parameters:

Name	Type	Description	Default
`samples`	`list[HallucinationSample]`	List of HallucinationSample objects.	required
`tokenizer`	`AutoTokenizer`	Tokenizer to use for encoding the data.	required
`max_length`	`int`	Maximum length of the input sequence.	`4096`

`len()` ¶

Return the number of samples in the dataset.

`prepare_tokenized_input(tokenizer, context, answer, max_length=4096)` `classmethod` ¶

Tokenize context and answer, compute answer start index, and initialize labels.

Computes the answer start token index and initializes a labels list (using -100 for context tokens and 0 for answer tokens).

Parameters:

Name	Type	Description	Default
`tokenizer`	`AutoTokenizer`	The tokenizer to use.	required
`context`	`str`	The context string.	required
`answer`	`str`	The answer string.	required
`max_length`	`int`	Maximum input sequence length.	`4096`

Returns:

Type	Description
`tuple[dict[str, Tensor], list[int], Tensor, int]`	A tuple containing: - encoding: A dict of tokenized inputs without offset mapping. - labels: A list of initial token labels. - offsets: Offset mappings for each token (as a tensor of shape [seq_length, 2]). - answer_start_token: The index where answer tokens begin.

`getitem(idx)` ¶

Get an item from the dataset.

Parameters:

Name	Type	Description	Default
`idx`	`int`	Index of the item to get.	required

Returns:

Type	Description
`dict[str, Tensor]`	Dictionary with input IDs, attention mask, and labels.

Datasets¶

HallucinationSample¶

HallucinationSample(prompt, answer, labels, split, task_type, dataset, language, context_modality='prose', category=None, subcategory=None, context=None, question=None, metadata=dict()) dataclass ¶

to_json() ¶

from_json(json_dict) classmethod ¶

HallucinationData¶

HallucinationData(samples) dataclass ¶

to_json() ¶

from_json(json_dict) classmethod ¶

HallucinationDataset¶

HallucinationDataset(samples, tokenizer, max_length=4096) ¶

__len__() ¶

prepare_tokenized_input(tokenizer, context, answer, max_length=4096) classmethod ¶

__getitem__(idx) ¶

`HallucinationSample(prompt, answer, labels, split, task_type, dataset, language, context_modality='prose', category=None, subcategory=None, context=None, question=None, metadata=dict())` `dataclass` ¶

`to_json()` ¶

`from_json(json_dict)` `classmethod` ¶

`HallucinationData(samples)` `dataclass` ¶

`to_json()` ¶

`from_json(json_dict)` `classmethod` ¶

`HallucinationDataset(samples, tokenizer, max_length=4096)` ¶

`len()` ¶

`prepare_tokenized_input(tokenizer, context, answer, max_length=4096)` `classmethod` ¶

`getitem(idx)` ¶