Skip to content

Datasets

HallucinationSample

HallucinationSample(prompt, answer, labels, split, task_type, dataset, language) dataclass

A single hallucination detection sample.

Attributes: prompt: Context text (source documents, code files, documentation, user query). answer: The LLM-generated answer to check for hallucinations. labels: List of span annotations. Each dict has start, end (character offsets within answer), and label keys. Empty list for clean samples. split: Dataset split (train, dev, or test). task_type: Task type (e.g. summarization, qa, code_generation). dataset: Source dataset (ragtruth, ragbench, or swebench_code). language: Language code.

to_json()

Serialize to a JSON-compatible dict.

from_json(json_dict) classmethod

Deserialize from a JSON dict.

HallucinationData

HallucinationData(samples) dataclass

A collection of hallucination detection samples.

Attributes: samples: List of :class:HallucinationSample instances.

to_json()

Serialize all samples to a JSON-compatible list.

from_json(json_dict) classmethod

Deserialize from a list of JSON dicts.

HallucinationDataset

HallucinationDataset(samples, tokenizer, max_length=4096)

Bases: Dataset

Dataset for Hallucination data.

Initialize the dataset.

Parameters:

Name Type Description Default
samples list[HallucinationSample]

List of HallucinationSample objects.

required
tokenizer AutoTokenizer

Tokenizer to use for encoding the data.

required
max_length int

Maximum length of the input sequence.

4096

__len__()

Return the number of samples in the dataset.

prepare_tokenized_input(tokenizer, context, answer, max_length=4096) classmethod

Tokenize context and answer, compute answer start index, and initialize labels.

Computes the answer start token index and initializes a labels list (using -100 for context tokens and 0 for answer tokens).

Parameters:

Name Type Description Default
tokenizer AutoTokenizer

The tokenizer to use.

required
context str

The context string.

required
answer str

The answer string.

required
max_length int

Maximum input sequence length.

4096

Returns:

Type Description
tuple[dict[str, Tensor], list[int], Tensor, int]

A tuple containing: - encoding: A dict of tokenized inputs without offset mapping. - labels: A list of initial token labels. - offsets: Offset mappings for each token (as a tensor of shape [seq_length, 2]). - answer_start_token: The index where answer tokens begin.

__getitem__(idx)

Get an item from the dataset.

Parameters:

Name Type Description Default
idx int

Index of the item to get.

required

Returns:

Type Description
dict[str, Tensor]

Dictionary with input IDs, attention mask, and labels.