AcademicEval: Live Long-Context LLM Benchmark

News

[2025.10] 🌟 AcademicEval was released.

[2025.09] 🎉 AcademicEval was accepted by TMLR 2025.

Introduction

We proposed AcademicEval, a live benchmark for evaluating LLMs over long-context generation tasks. AcademicEval adopts papers on arXiv to introduce several acadeic writing tasks with long-context inputs, i.e., Title, Abstract, Introduction, Related Work, wich covers a wide range of abstraction levels and require no manual labeling.

Comparing to existing long-context LLM benchmarks, our Comparing to existing long-context LLM benchmarks, our AcademicEval offers flexible length, automatic annotation, hierarchical abstraction, few-shot demonstrations, and live updates without data leakage risks.

Benchmark	Avg Len	Automatic Annotation	Hierarchical Abstraction	Few-shot Demonstrations	Live Update
ZeroSCROLLS (Shaham et al., 2023)	~10K	✓	✘	✘	✘
L-Eval (An et al., 2023)	~8K	✘	✘	✘	✘
BAMBOO (Dong et al., 2023)	~16K	✘	✘	✘	✘
LongBench (Bai et al., 2023)	~8K	✘	✘	✓	✘
LooGLE (Li et al., 2023)	~20K	✘	✘	✘	✘
∞Bench (Zhang et al., 2024)	~200K	✘	✘	✘	✘
AcademicEval (ours)	Flexible	✓	✓	✓	✓

❗❗❗You can download our collected data at AcademicEval

📌Environment Setup

Python Package

# python==3.10
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
pip install arxiv
pip install tqdm
pip install rouge_score
pip install textstat
pip install transformers
pip install langchain
pip install PyMuPDF
pip install faiss-gpu
pip install openai==0.28.0

LLM Tokenizers

We additionally need the tokenizer configuration files for LLMs to ensure correct and accurate truncation.

You only need to download the tokenizer configuration files for each LLM, no model weight files are needed, because we access LLMs through the API. Please place the downloaded files in "gemma", "llama", "qwen", "mixtral", and "hermes" directories, respectively.

❗We have integrated these files in our repository.

⭐Experiments

❗Note: Since we use the LLM API provided by together.ai to access LLMs, you need to prepare your own API KEY in the "get_llm_response_via_api" function in utils.py

❗Please ensure that the AcademicEval is downloaded in the "AcademicEval" directory. The path should be like the following:

├── README.md
├── abs_extractor.py
├── bart_score.py
├── construct_relation_graph.py
├── exp_comparison.py
├── main.py
├── model.png
├── refine_graph.py
├── related_extractor.py
├── retrieval.py
├── section_region_extractor.py
├── utils.py
├── gemma
│   ├── ...
├── llama
│   ├── ...
├── qwen
│   ├── ...
├── mixtral
│   ├── ...
├── hermes
│   ├── ...
├── AcademicEval
│   ├── abs_9K
│   ├── abs_28K
│   ├── abs_29K_G
│   ├── intro_8K
│   ├── intro_28K
│   ├── intro_28K_G
│   ├── related_34K
│   ├── related_53K
│   ├── related_53K_G
│   ├── title_10K
│   ├── title_30K
│   └── title_31K_G

Here are some command examples, you can run all the experiments by replacing "llm_model" and "setting", or adding "--rag" and "--retriever"

✅Title Writing

title-10K

# Standard LLMs
python exp_comparison.py --setting title_10K --llm_model google/gemma-7b-it --cuda 3
python exp_comparison.py --setting title_10K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3
# Long-context LLMs
python exp_comparison.py --setting title_10K --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting title_10K --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting title_10K --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting title_10K --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting title_10K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever

title-30K

# Long-context LLMs
python exp_comparison.py --setting title_30K --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting title_30K --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting title_30K --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting title_30K --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting title_30K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever

title-31K-G

# Long-context LLMs
python exp_comparison.py --setting title_31K_G --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting title_31K_G --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting title_31K_G --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting title_31K_G --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting title_31K_G --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever

✅Abstract Writing

abs-9K

# Standard LLMs
python exp_comparison.py --setting abs_9K --llm_model google/gemma-7b-it --cuda 3
python exp_comparison.py --setting abs_9K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3
# Long-context LLMs
python exp_comparison.py --setting abs_9K --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting abs_9K --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting abs_9K --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting abs_9K --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting abs_9K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever

abs-28K

# Long-context LLMs
python exp_comparison.py --setting abs_28K --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting abs_28K --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting abs_28K --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting abs_28K --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting abs_28K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever

abs-29K-G

# Long-context LLMs
python exp_comparison.py --setting abs_29K_G --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting abs_29K_G --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting abs_29K_G --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting abs_29K_G --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting abs_29K_G --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever

✅Introduction Writing

intro-8K

# Standard LLMs
python exp_comparison.py --setting intro_8K --llm_model google/gemma-7b-it --cuda 3
python exp_comparison.py --setting intro_8K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3
# Long-context LLMs
python exp_comparison.py --setting intro_8K --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting intro_8K --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting intro_8K --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting intro_8K --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting intro_8K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever

intro-28K

# Long-context LLMs
python exp_comparison.py --setting intro_28K --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting intro_28K --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting intro_28K --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting intro_28K --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting intro_28K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever

intro-28K-G

# Long-context LLMs
python exp_comparison.py --setting intro_28K_G --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting intro_28K_G --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting intro_28K_G --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting intro_28K_G --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting intro_28K_G --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever

✅Related Work Writing

related-34K

# Standard LLMs
python exp_comparison.py --setting related_34K --llm_model google/gemma-7b-it --cuda 3
python exp_comparison.py --setting related_34K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3
# Long-context LLMs
python exp_comparison.py --setting related_34K --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting related_34K --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting related_34K --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting related_34K --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting related_34K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever

related-53K

# RALM
python exp_comparison.py --setting related_53K --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting related_53K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever

related-53K-G

# RALM
python exp_comparison.py --setting related_53K_G --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting related_53K_G --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever

📍Benchmark Construction

We give a general example for constructing AcademicEval benchmark in this section.

Note: The initial collection process will be time-consuming

Co-author Graph Construction

We first collect a co-author graph via the arXiv API. You should prepare your "YOUR START AUTHOR" in construct_relation_graph.py

Then, run the following command to start BFS.

python construct_relation_graph.py

Graph Refine

The collected graph may have many defects. Therefore, we provide a complete pipeline for refining the collected graph (including connectivity detection, chronological split, etc.)

python refine_graph.py

Live Update

You can refer to live_update.py for updating the collected co-author graph.

Other Awesome Works

GoR: Graph of Records: Boosting Retrieval Augmented Generation for Long-context Summarization with Graphs.
Thought Retriever: Thought-Retriever: Don’t Just Retrieve Raw Data, Retrieve Thoughts.

Citation

@article{AcademicEval,
  title={AcademicEval: Live Long-Context LLM Benchmark},
  author={Haozhen Zhang and Tao Feng and Pengrui Han and Jiaxuan You},
  journal={arXiv preprint arXiv:2510.17725},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AcademicEval: Live Long-Context LLM Benchmark

News

Introduction

📌Environment Setup

Python Package

LLM Tokenizers

⭐Experiments

✅Title Writing

title-10K

title-30K

title-31K-G

✅Abstract Writing

abs-9K

abs-28K

abs-29K-G

✅Introduction Writing

intro-8K

intro-28K

intro-28K-G

✅Related Work Writing

related-34K

related-53K

related-53K-G

📍Benchmark Construction

Co-author Graph Construction

Graph Refine

Live Update

Other Awesome Works

Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
figures		figures
gemma		gemma
hermes		hermes
llama		llama
mixtral		mixtral
qwen		qwen
LICENSE		LICENSE
README.md		README.md
abs_extractor.py		abs_extractor.py
bart_score.py		bart_score.py
construct_relation_graph.py		construct_relation_graph.py
download.py		download.py
exp_comparison.py		exp_comparison.py
live_update.py		live_update.py
main.py		main.py
refine_graph.py		refine_graph.py
related_extractor.py		related_extractor.py
retrieval.py		retrieval.py
section_region_extractor.py		section_region_extractor.py
utils.py		utils.py

License

ulab-uiuc/AcademicEval

Folders and files

Latest commit

History

Repository files navigation

AcademicEval: Live Long-Context LLM Benchmark

News

Introduction

📌Environment Setup

Python Package

LLM Tokenizers

⭐Experiments

✅Title Writing

title-10K

title-30K

title-31K-G

✅Abstract Writing

abs-9K

abs-28K

abs-29K-G

✅Introduction Writing

intro-8K

intro-28K

intro-28K-G

✅Related Work Writing

related-34K

related-53K

related-53K-G

📍Benchmark Construction

Co-author Graph Construction

Graph Refine

Live Update

Other Awesome Works

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages