Add EVALS_CODE venv using ElutherAI's original lm-evaluation-harness repository #520

arobergeTT · 2025-08-18T17:35:20Z

Summary

Adds a dedicated virtual environment (EVALS_CODE) for code evaluation benchmarks (MBPP, HumanEval) using ElutherAI's original lm-evaluation-harness repository.

Changes

workflows/workflow_types.py : Added EVALS_CODE enum to WorkflowVenvType
workflows/workflow_venvs.py: Added setup_evals_code() function that:
Installs lm-evaluation-harness with code evaluation extensions
Pins critical packages (protobuf, pyjwt==2.7.0, pillow==11.0.0, datasets==3.1.0)
evals/run_evals.py : Auto-applies security flags (--trust_remote_code, --confirm_run_unsafe_code) for code evaluation tasks

Key Features

Isolation : Dedicated environment prevents dependency conflicts
Security : Automatic security flags for code execution
Optimized : Tailored dependencies for code evaluation tasks

Ready for enabling code evaluation tasks in eval_config.py.

…any lm_eval modules are imported

Copilot

Pull Request Overview

Adds a dedicated virtual environment for code evaluation benchmarks using ElutherAI's lm-evaluation-harness repository with automatic security configuration.

Added EVALS_CODE virtual environment type with specialized setup for code evaluation tasks
Implemented automatic security flags (--trust_remote_code, --confirm_run_unsafe_code) for code evaluation
Added environment variable setup for HuggingFace code evaluation permissions

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
workflows/workflow_types.py	Added EVALS_CODE enum value to WorkflowVenvType
workflows/workflow_venvs.py	Implemented setup_evals_code function and venv configuration
evals/run_evals.py	Added automatic security flags and environment variables for code evaluation

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-08-19T17:26:12Z

evals/run_evals.py

+    # Check if any tasks require code evaluation and set environment variable early
+    # This must be set in os.environ (not just env_vars) because lm_eval modules
+    # check for it during their import phase, before the subprocess is fully set up
+    has_code_eval_tasks = any(task.confirm_run_unsafe_code for task in eval_config.tasks)


The code checks for task.confirm_run_unsafe_code attribute, but this attribute is not defined in the task model. Based on the logic below that checks task.workflow_venv_type == WorkflowVenvType.EVALS_CODE, this should likely be task.workflow_venv_type == WorkflowVenvType.EVALS_CODE instead.

Suggested change

has_code_eval_tasks = any(task.confirm_run_unsafe_code for task in eval_config.tasks)

has_code_eval_tasks = any(task.workflow_venv_type == WorkflowVenvType.EVALS_CODE for task in eval_config.tasks)

@arobergeTT this did fail when attempting to run and is also the fix that Claude suggests.

@stisiTT I made appropriate changes. Could you try testing it again?

evals/run_evals.py

arobergeTT added 2 commits August 15, 2025 21:33

Create new venv for coding tasks - i.e. MBPP, HumanEval

fb6148c

Set os.environ[HF_ALLOW_CODE_EVAL] = 1 in the current process before …

811fad0

…any lm_eval modules are imported

arobergeTT requested a review from stisiTT August 18, 2025 17:35

arobergeTT self-assigned this Aug 18, 2025

arobergeTT mentioned this pull request Aug 18, 2025

Add New Virtual Environment for Code Evaluation Tasks #513

Open

arobergeTT requested a review from tstescoTT August 18, 2025 17:39

arobergeTT added enhancement New feature or request evals labels Aug 18, 2025

tstescoTT approved these changes Aug 18, 2025

View reviewed changes

stisiTT requested a review from Copilot August 19, 2025 17:25

Copilot AI reviewed Aug 19, 2025

View reviewed changes

Set env_vars[HF_ALLOW_CODE_EVAL] = 1 for coding evals

de6c436

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add EVALS_CODE venv using ElutherAI's original lm-evaluation-harness repository #520

Add EVALS_CODE venv using ElutherAI's original lm-evaluation-harness repository #520

Uh oh!

arobergeTT commented Aug 18, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Aug 19, 2025

Uh oh!

stisiTT Aug 19, 2025

Uh oh!

arobergeTT Aug 19, 2025

Uh oh!

Uh oh!

Uh oh!

	has_code_eval_tasks = any(task.confirm_run_unsafe_code for task in eval_config.tasks)
	has_code_eval_tasks = any(task.workflow_venv_type == WorkflowVenvType.EVALS_CODE for task in eval_config.tasks)

Add EVALS_CODE venv using ElutherAI's original lm-evaluation-harness repository #520

Are you sure you want to change the base?

Add EVALS_CODE venv using ElutherAI's original lm-evaluation-harness repository #520

Uh oh!

Conversation

arobergeTT commented Aug 18, 2025

Summary

Changes

Key Features

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

stisiTT Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

arobergeTT Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!