[2025.10] 🌟 AcademicEval was released.
[2025.09] 🎉 AcademicEval was accepted by TMLR 2025.
We proposed AcademicEval, a live benchmark for evaluating LLMs over long-context generation tasks. AcademicEval adopts papers on arXiv to introduce several acadeic writing tasks with long-context inputs, i.e., Title, Abstract, Introduction, Related Work, wich covers a wide range of abstraction levels and require no manual labeling.
Comparing to existing long-context LLM benchmarks, our Comparing to existing long-context LLM benchmarks, our AcademicEval offers flexible length, automatic annotation, hierarchical abstraction, few-shot demonstrations, and live updates without data leakage risks.
Benchmark | Avg Len | Automatic Annotation | Hierarchical Abstraction | Few-shot Demonstrations | Live Update |
---|---|---|---|---|---|
ZeroSCROLLS (Shaham et al., 2023) | ~10K | ✓ | ✘ | ✘ | ✘ |
L-Eval (An et al., 2023) | ~8K | ✘ | ✘ | ✘ | ✘ |
BAMBOO (Dong et al., 2023) | ~16K | ✘ | ✘ | ✘ | ✘ |
LongBench (Bai et al., 2023) | ~8K | ✘ | ✘ | ✓ | ✘ |
LooGLE (Li et al., 2023) | ~20K | ✘ | ✘ | ✘ | ✘ |
∞Bench (Zhang et al., 2024) | ~200K | ✘ | ✘ | ✘ | ✘ |
AcademicEval (ours) | Flexible | ✓ | ✓ | ✓ | ✓ |
❗❗❗You can download our collected data at AcademicEval
# python==3.10
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
pip install arxiv
pip install tqdm
pip install rouge_score
pip install textstat
pip install transformers
pip install langchain
pip install PyMuPDF
pip install faiss-gpu
pip install openai==0.28.0
We additionally need the tokenizer configuration files for LLMs to ensure correct and accurate truncation.
You only need to download the tokenizer configuration files for each LLM, no model weight files are needed, because we access LLMs through the API. Please place the downloaded files in "gemma", "llama", "qwen", "mixtral", and "hermes" directories, respectively.
❗We have integrated these files in our repository.
❗Note: Since we use the LLM API provided by together.ai to access LLMs, you need to prepare your own API KEY in the "get_llm_response_via_api" function in utils.py
❗Please ensure that the AcademicEval is downloaded in the "AcademicEval" directory. The path should be like the following:
├── README.md
├── abs_extractor.py
├── bart_score.py
├── construct_relation_graph.py
├── exp_comparison.py
├── main.py
├── model.png
├── refine_graph.py
├── related_extractor.py
├── retrieval.py
├── section_region_extractor.py
├── utils.py
├── gemma
│ ├── ...
├── llama
│ ├── ...
├── qwen
│ ├── ...
├── mixtral
│ ├── ...
├── hermes
│ ├── ...
├── AcademicEval
│ ├── abs_9K
│ ├── abs_28K
│ ├── abs_29K_G
│ ├── intro_8K
│ ├── intro_28K
│ ├── intro_28K_G
│ ├── related_34K
│ ├── related_53K
│ ├── related_53K_G
│ ├── title_10K
│ ├── title_30K
│ └── title_31K_G
Here are some command examples, you can run all the experiments by replacing "llm_model" and "setting", or adding "--rag" and "--retriever"
# Standard LLMs
python exp_comparison.py --setting title_10K --llm_model google/gemma-7b-it --cuda 3
python exp_comparison.py --setting title_10K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3
# Long-context LLMs
python exp_comparison.py --setting title_10K --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting title_10K --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting title_10K --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting title_10K --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting title_10K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever
# Long-context LLMs
python exp_comparison.py --setting title_30K --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting title_30K --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting title_30K --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting title_30K --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting title_30K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever
# Long-context LLMs
python exp_comparison.py --setting title_31K_G --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting title_31K_G --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting title_31K_G --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting title_31K_G --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting title_31K_G --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever
# Standard LLMs
python exp_comparison.py --setting abs_9K --llm_model google/gemma-7b-it --cuda 3
python exp_comparison.py --setting abs_9K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3
# Long-context LLMs
python exp_comparison.py --setting abs_9K --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting abs_9K --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting abs_9K --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting abs_9K --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting abs_9K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever
# Long-context LLMs
python exp_comparison.py --setting abs_28K --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting abs_28K --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting abs_28K --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting abs_28K --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting abs_28K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever
# Long-context LLMs
python exp_comparison.py --setting abs_29K_G --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting abs_29K_G --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting abs_29K_G --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting abs_29K_G --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting abs_29K_G --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever
# Standard LLMs
python exp_comparison.py --setting intro_8K --llm_model google/gemma-7b-it --cuda 3
python exp_comparison.py --setting intro_8K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3
# Long-context LLMs
python exp_comparison.py --setting intro_8K --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting intro_8K --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting intro_8K --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting intro_8K --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting intro_8K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever
# Long-context LLMs
python exp_comparison.py --setting intro_28K --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting intro_28K --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting intro_28K --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting intro_28K --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting intro_28K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever
# Long-context LLMs
python exp_comparison.py --setting intro_28K_G --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting intro_28K_G --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting intro_28K_G --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting intro_28K_G --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting intro_28K_G --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever
# Standard LLMs
python exp_comparison.py --setting related_34K --llm_model google/gemma-7b-it --cuda 3
python exp_comparison.py --setting related_34K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3
# Long-context LLMs
python exp_comparison.py --setting related_34K --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting related_34K --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting related_34K --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting related_34K --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting related_34K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever
# RALM
python exp_comparison.py --setting related_53K --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting related_53K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever
# RALM
python exp_comparison.py --setting related_53K_G --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting related_53K_G --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever
We give a general example for constructing AcademicEval benchmark in this section.
Note: The initial collection process will be time-consuming
We first collect a co-author graph via the arXiv API. You should prepare your "YOUR START AUTHOR" in construct_relation_graph.py
Then, run the following command to start BFS.
python construct_relation_graph.py
The collected graph may have many defects. Therefore, we provide a complete pipeline for refining the collected graph (including connectivity detection, chronological split, etc.)
python refine_graph.py
You can refer to live_update.py
for updating the collected co-author graph.
-
GoR: Graph of Records: Boosting Retrieval Augmented Generation for Long-context Summarization with Graphs.
-
Thought Retriever: Thought-Retriever: Don’t Just Retrieve Raw Data, Retrieve Thoughts.
@article{AcademicEval,
title={AcademicEval: Live Long-Context LLM Benchmark},
author={Haozhen Zhang and Tao Feng and Pengrui Han and Jiaxuan You},
journal={arXiv preprint arXiv:2510.17725},
year={2025}
}