Dataset and code from the paper The Curious Case of Factual (Mis)Alignment between LLMs' Short- and Long-Form Answers https://arxiv.org/abs/2510.11218
This repository contains the following
- Gold Short-Long form dataset in the dataset folder.
- Inference scripts for evaluating your LLM in using the dataset.
- Evaluation scripts for using LLM-as-a-judge (gemini) and computing factual accuracy and alignment scores.
Below image provides SLAQ factual accuracy and alignment scores for Gemma, Qwen and Llama models. You can go over these results in the evaluation/raw_benchmarking_results
folder.
@misc{islam2025curiouscasefactualmisalignment,
title={The Curious Case of Factual (Mis)Alignment between LLMs' Short- and Long-Form Answers},
author={Saad Obaid ul Islam and Anne Lauscher and Goran Glavaš},
year={2025},
eprint={2510.11218},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.11218},
}