Datasets¶
HallucinationSample¶
HallucinationSample(prompt, answer, labels, split, task_type, dataset, language, context_modality='prose', category=None, subcategory=None, context=None, question=None, metadata=dict())
dataclass
¶
A single hallucination detection sample.
Attributes:
prompt: The full model input string. The question is placed at the front
(User request: {question}\n\n{context}) so it is never lost to
context truncation. Training tokenizes this directly.
context: The grounding evidence alone (passages / source files / tool
output), without the question. None for legacy data.
question: The user request alone, or None for summarization-style
tasks. Separated from context so it can be reformatted freely and
so question tokens are addressable (e.g. for omission detection).
answer: The LLM-generated answer to check for hallucinations.
labels: List of span annotations. Each dict has start, end (character offsets
within answer), label (native source label), and optionally category
and subcategory (v2 unified taxonomy fields). Empty list for clean samples.
split: Dataset split (train, dev, or test).
task_type: Task type (e.g. summarization, qa, code_generation).
dataset: Source dataset name.
language: Language code.
context_modality: Modality of the retrieved context (prose, code, markdown).
category: Top-level taxonomy category for the sample (v2). None for clean samples.
subcategory: Optional fine-grained sub-type within the category (v2).
metadata: Arbitrary source-specific provenance fields (e.g. instance_id, repo).
HallucinationData¶
HallucinationData(samples)
dataclass
¶
HallucinationDataset¶
HallucinationDataset(samples, tokenizer, max_length=4096)
¶
Bases: Dataset
Dataset for Hallucination data.
Initialize the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
samples
|
list[HallucinationSample]
|
List of HallucinationSample objects. |
required |
tokenizer
|
AutoTokenizer
|
Tokenizer to use for encoding the data. |
required |
max_length
|
int
|
Maximum length of the input sequence. |
4096
|
__len__()
¶
Return the number of samples in the dataset.
prepare_tokenized_input(tokenizer, context, answer, max_length=4096)
classmethod
¶
Tokenize context and answer, compute answer start index, and initialize labels.
Computes the answer start token index and initializes a labels list (using -100 for context tokens and 0 for answer tokens).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokenizer
|
AutoTokenizer
|
The tokenizer to use. |
required |
context
|
str
|
The context string. |
required |
answer
|
str
|
The answer string. |
required |
max_length
|
int
|
Maximum input sequence length. |
4096
|
Returns:
| Type | Description |
|---|---|
tuple[dict[str, Tensor], list[int], Tensor, int]
|
A tuple containing: - encoding: A dict of tokenized inputs without offset mapping. - labels: A list of initial token labels. - offsets: Offset mappings for each token (as a tensor of shape [seq_length, 2]). - answer_start_token: The index where answer tokens begin. |
__getitem__(idx)
¶
Get an item from the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idx
|
int
|
Index of the item to get. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Tensor]
|
Dictionary with input IDs, attention mask, and labels. |