Skip to content

ulab-uiuc/AcademicEval

Repository files navigation

AcademicEval: Live Long-Context LLM Benchmark

Build HuggingFace License
Build Build Build

News

[2025.10] 🌟 AcademicEval was released.

[2025.09] 🎉 AcademicEval was accepted by TMLR 2025.

Introduction

We proposed AcademicEval, a live benchmark for evaluating LLMs over long-context generation tasks. AcademicEval adopts papers on arXiv to introduce several acadeic writing tasks with long-context inputs, i.e., Title, Abstract, Introduction, Related Work, wich covers a wide range of abstraction levels and require no manual labeling.

Comparing to existing long-context LLM benchmarks, our Comparing to existing long-context LLM benchmarks, our AcademicEval offers flexible length, automatic annotation, hierarchical abstraction, few-shot demonstrations, and live updates without data leakage risks.

Benchmark Avg Len Automatic Annotation Hierarchical Abstraction Few-shot Demonstrations Live Update
ZeroSCROLLS (Shaham et al., 2023) ~10K
L-Eval (An et al., 2023) ~8K
BAMBOO (Dong et al., 2023) ~16K
LongBench (Bai et al., 2023) ~8K
LooGLE (Li et al., 2023) ~20K
∞Bench (Zhang et al., 2024) ~200K
AcademicEval (ours) Flexible

❗❗❗You can download our collected data at AcademicEval

📌Environment Setup

Python Package

# python==3.10
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
pip install arxiv
pip install tqdm
pip install rouge_score
pip install textstat
pip install transformers
pip install langchain
pip install PyMuPDF
pip install faiss-gpu
pip install openai==0.28.0

LLM Tokenizers

We additionally need the tokenizer configuration files for LLMs to ensure correct and accurate truncation.

You only need to download the tokenizer configuration files for each LLM, no model weight files are needed, because we access LLMs through the API. Please place the downloaded files in "gemma", "llama", "qwen", "mixtral", and "hermes" directories, respectively.

❗We have integrated these files in our repository.

⭐Experiments

❗Note: Since we use the LLM API provided by together.ai to access LLMs, you need to prepare your own API KEY in the "get_llm_response_via_api" function in utils.py

❗Please ensure that the AcademicEval is downloaded in the "AcademicEval" directory. The path should be like the following:

├── README.md
├── abs_extractor.py
├── bart_score.py
├── construct_relation_graph.py
├── exp_comparison.py
├── main.py
├── model.png
├── refine_graph.py
├── related_extractor.py
├── retrieval.py
├── section_region_extractor.py
├── utils.py
├── gemma
│   ├── ...
├── llama
│   ├── ...
├── qwen
│   ├── ...
├── mixtral
│   ├── ...
├── hermes
│   ├── ...
├── AcademicEval
│   ├── abs_9K
│   ├── abs_28K
│   ├── abs_29K_G
│   ├── intro_8K
│   ├── intro_28K
│   ├── intro_28K_G
│   ├── related_34K
│   ├── related_53K
│   ├── related_53K_G
│   ├── title_10K
│   ├── title_30K
│   └── title_31K_G

Here are some command examples, you can run all the experiments by replacing "llm_model" and "setting", or adding "--rag" and "--retriever"

Title Writing

title-10K

# Standard LLMs
python exp_comparison.py --setting title_10K --llm_model google/gemma-7b-it --cuda 3
python exp_comparison.py --setting title_10K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3
# Long-context LLMs
python exp_comparison.py --setting title_10K --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting title_10K --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting title_10K --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting title_10K --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting title_10K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever

title-30K

# Long-context LLMs
python exp_comparison.py --setting title_30K --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting title_30K --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting title_30K --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting title_30K --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting title_30K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever

title-31K-G

# Long-context LLMs
python exp_comparison.py --setting title_31K_G --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting title_31K_G --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting title_31K_G --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting title_31K_G --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting title_31K_G --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever

Abstract Writing

abs-9K

# Standard LLMs
python exp_comparison.py --setting abs_9K --llm_model google/gemma-7b-it --cuda 3
python exp_comparison.py --setting abs_9K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3
# Long-context LLMs
python exp_comparison.py --setting abs_9K --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting abs_9K --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting abs_9K --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting abs_9K --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting abs_9K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever

abs-28K

# Long-context LLMs
python exp_comparison.py --setting abs_28K --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting abs_28K --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting abs_28K --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting abs_28K --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting abs_28K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever

abs-29K-G

# Long-context LLMs
python exp_comparison.py --setting abs_29K_G --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting abs_29K_G --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting abs_29K_G --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting abs_29K_G --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting abs_29K_G --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever

Introduction Writing

intro-8K

# Standard LLMs
python exp_comparison.py --setting intro_8K --llm_model google/gemma-7b-it --cuda 3
python exp_comparison.py --setting intro_8K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3
# Long-context LLMs
python exp_comparison.py --setting intro_8K --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting intro_8K --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting intro_8K --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting intro_8K --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting intro_8K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever

intro-28K

# Long-context LLMs
python exp_comparison.py --setting intro_28K --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting intro_28K --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting intro_28K --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting intro_28K --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting intro_28K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever

intro-28K-G

# Long-context LLMs
python exp_comparison.py --setting intro_28K_G --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting intro_28K_G --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting intro_28K_G --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting intro_28K_G --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting intro_28K_G --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever

Related Work Writing

related-34K

# Standard LLMs
python exp_comparison.py --setting related_34K --llm_model google/gemma-7b-it --cuda 3
python exp_comparison.py --setting related_34K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3
# Long-context LLMs
python exp_comparison.py --setting related_34K --llm_model Qwen/Qwen1.5-72B-Chat --cuda 3
python exp_comparison.py --setting related_34K --llm_model mistralai/Mixtral-8x7B-Instruct-v0.1 --cuda 3
python exp_comparison.py --setting related_34K --llm_model NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO --cuda 3
# RALM
python exp_comparison.py --setting related_34K --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting related_34K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever

related-53K

# RALM
python exp_comparison.py --setting related_53K --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting related_53K --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever

related-53K-G

# RALM
python exp_comparison.py --setting related_53K_G --llm_model google/gemma-7b-it --cuda 3 --rag --retriever contriever
python exp_comparison.py --setting related_53K_G --llm_model meta-llama/Llama-3-70b-chat-hf --cuda 3 --rag --retriever contriever

📍Benchmark Construction

We give a general example for constructing AcademicEval benchmark in this section.

Note: The initial collection process will be time-consuming

Co-author Graph Construction

We first collect a co-author graph via the arXiv API. You should prepare your "YOUR START AUTHOR" in construct_relation_graph.py

Then, run the following command to start BFS.

python construct_relation_graph.py

Graph Refine

The collected graph may have many defects. Therefore, we provide a complete pipeline for refining the collected graph (including connectivity detection, chronological split, etc.)

python refine_graph.py

Live Update

You can refer to live_update.py for updating the collected co-author graph.

Other Awesome Works

  • GoR: Graph of Records: Boosting Retrieval Augmented Generation for Long-context Summarization with Graphs. [code]

  • Thought Retriever: Thought-Retriever: Don’t Just Retrieve Raw Data, Retrieve Thoughts. [code]

Citation

@article{AcademicEval,
  title={AcademicEval: Live Long-Context LLM Benchmark},
  author={Haozhen Zhang and Tao Feng and Pengrui Han and Jiaxuan You},
  journal={arXiv preprint arXiv:2510.17725},
  year={2025}
}

About

[TMLR'25] AcademicEval: Live Long-Context LLM Benchmark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published