Datasets¶
HallucinationSample¶
HallucinationSample(prompt, answer, labels, split, task_type, dataset, language)
dataclass
¶
A single hallucination detection sample.
Attributes:
prompt: Context text (source documents, code files, documentation, user query).
answer: The LLM-generated answer to check for hallucinations.
labels: List of span annotations. Each dict has start, end (character offsets
within answer), and label keys. Empty list for clean samples.
split: Dataset split (train, dev, or test).
task_type: Task type (e.g. summarization, qa, code_generation).
dataset: Source dataset (ragtruth, ragbench, or swebench_code).
language: Language code.
HallucinationData¶
HallucinationData(samples)
dataclass
¶
HallucinationDataset¶
HallucinationDataset(samples, tokenizer, max_length=4096)
¶
Bases: Dataset
Dataset for Hallucination data.
Initialize the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
samples
|
list[HallucinationSample]
|
List of HallucinationSample objects. |
required |
tokenizer
|
AutoTokenizer
|
Tokenizer to use for encoding the data. |
required |
max_length
|
int
|
Maximum length of the input sequence. |
4096
|
__len__()
¶
Return the number of samples in the dataset.
prepare_tokenized_input(tokenizer, context, answer, max_length=4096)
classmethod
¶
Tokenize context and answer, compute answer start index, and initialize labels.
Computes the answer start token index and initializes a labels list (using -100 for context tokens and 0 for answer tokens).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokenizer
|
AutoTokenizer
|
The tokenizer to use. |
required |
context
|
str
|
The context string. |
required |
answer
|
str
|
The answer string. |
required |
max_length
|
int
|
Maximum input sequence length. |
4096
|
Returns:
| Type | Description |
|---|---|
tuple[dict[str, Tensor], list[int], Tensor, int]
|
A tuple containing: - encoding: A dict of tokenized inputs without offset mapping. - labels: A list of initial token labels. - offsets: Offset mappings for each token (as a tensor of shape [seq_length, 2]). - answer_start_token: The index where answer tokens begin. |
__getitem__(idx)
¶
Get an item from the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idx
|
int
|
Index of the item to get. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Tensor]
|
Dictionary with input IDs, attention mask, and labels. |