SLAQ: The Curious Case of Factual (Mis)Alignment between LLMs' Short- and Long-Form Answers

Dataset and code from the paper The Curious Case of Factual (Mis)Alignment between LLMs' Short- and Long-Form Answers https://arxiv.org/abs/2510.11218

Repository Information

This repository contains the following

Gold Short-Long form dataset in the dataset folder.
Inference scripts for evaluating your LLM in using the dataset.
Evaluation scripts for using LLM-as-a-judge (gemini) and computing factual accuracy and alignment scores.

Results

Below image provides SLAQ factual accuracy and alignment scores for Gemma, Qwen and Llama models. You can go over these results in the evaluation/raw_benchmarking_results folder.

Citations

@misc{islam2025curiouscasefactualmisalignment,
      title={The Curious Case of Factual (Mis)Alignment between LLMs' Short- and Long-Form Answers}, 
      author={Saad Obaid ul Islam and Anne Lauscher and Goran Glavaš},
      year={2025},
      eprint={2510.11218},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.11218}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
dataset		dataset
evaluation		evaluation
imgs		imgs
README.md		README.md
inference.py		inference.py
inference.sh		inference.sh
models.py		models.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SLAQ: The Curious Case of Factual (Mis)Alignment between LLMs' Short- and Long-Form Answers

Repository Information

Results

Citations

About

Uh oh!

Releases

Packages

Languages

WorldHellow/SLAQ

Folders and files

Latest commit

History

Repository files navigation

SLAQ: The Curious Case of Factual (Mis)Alignment between LLMs' Short- and Long-Form Answers

Repository Information

Results

Citations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages