Skip to content

Add EVALS_CODE venv using ElutherAI's original lm-evaluation-harness repository #520

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

arobergeTT
Copy link

Summary

Adds a dedicated virtual environment (EVALS_CODE) for code evaluation benchmarks (MBPP, HumanEval) using ElutherAI's original lm-evaluation-harness repository.

Changes

  • workflows/workflow_types.py : Added EVALS_CODE enum to WorkflowVenvType
  • workflows/workflow_venvs.py: Added setup_evals_code() function that:
  • Installs lm-evaluation-harness with code evaluation extensions
  • Pins critical packages (protobuf, pyjwt==2.7.0, pillow==11.0.0, datasets==3.1.0)
  • evals/run_evals.py : Auto-applies security flags (--trust_remote_code, --confirm_run_unsafe_code) for code evaluation tasks

Key Features

  • Isolation : Dedicated environment prevents dependency conflicts
  • Security : Automatic security flags for code execution
  • Optimized : Tailored dependencies for code evaluation tasks

Ready for enabling code evaluation tasks in eval_config.py.

@stisiTT stisiTT requested a review from Copilot August 19, 2025 17:25
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds a dedicated virtual environment for code evaluation benchmarks using ElutherAI's lm-evaluation-harness repository with automatic security configuration.

  • Added EVALS_CODE virtual environment type with specialized setup for code evaluation tasks
  • Implemented automatic security flags (--trust_remote_code, --confirm_run_unsafe_code) for code evaluation
  • Added environment variable setup for HuggingFace code evaluation permissions

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
workflows/workflow_types.py Added EVALS_CODE enum value to WorkflowVenvType
workflows/workflow_venvs.py Implemented setup_evals_code function and venv configuration
evals/run_evals.py Added automatic security flags and environment variables for code evaluation

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

# Check if any tasks require code evaluation and set environment variable early
# This must be set in os.environ (not just env_vars) because lm_eval modules
# check for it during their import phase, before the subprocess is fully set up
has_code_eval_tasks = any(task.confirm_run_unsafe_code for task in eval_config.tasks)
Copy link
Preview

Copilot AI Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code checks for task.confirm_run_unsafe_code attribute, but this attribute is not defined in the task model. Based on the logic below that checks task.workflow_venv_type == WorkflowVenvType.EVALS_CODE, this should likely be task.workflow_venv_type == WorkflowVenvType.EVALS_CODE instead.

Suggested change
has_code_eval_tasks = any(task.confirm_run_unsafe_code for task in eval_config.tasks)
has_code_eval_tasks = any(task.workflow_venv_type == WorkflowVenvType.EVALS_CODE for task in eval_config.tasks)

Copilot uses AI. Check for mistakes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arobergeTT this did fail when attempting to run and is also the fix that Claude suggests.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stisiTT I made appropriate changes. Could you try testing it again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request evals
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants