-
Notifications
You must be signed in to change notification settings - Fork 5
Add EVALS_CODE venv using ElutherAI's original lm-evaluation-harness repository #520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
…any lm_eval modules are imported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds a dedicated virtual environment for code evaluation benchmarks using ElutherAI's lm-evaluation-harness repository with automatic security configuration.
- Added
EVALS_CODE
virtual environment type with specialized setup for code evaluation tasks - Implemented automatic security flags (
--trust_remote_code
,--confirm_run_unsafe_code
) for code evaluation - Added environment variable setup for HuggingFace code evaluation permissions
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
workflows/workflow_types.py | Added EVALS_CODE enum value to WorkflowVenvType |
workflows/workflow_venvs.py | Implemented setup_evals_code function and venv configuration |
evals/run_evals.py | Added automatic security flags and environment variables for code evaluation |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
evals/run_evals.py
Outdated
# Check if any tasks require code evaluation and set environment variable early | ||
# This must be set in os.environ (not just env_vars) because lm_eval modules | ||
# check for it during their import phase, before the subprocess is fully set up | ||
has_code_eval_tasks = any(task.confirm_run_unsafe_code for task in eval_config.tasks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code checks for task.confirm_run_unsafe_code
attribute, but this attribute is not defined in the task model. Based on the logic below that checks task.workflow_venv_type == WorkflowVenvType.EVALS_CODE
, this should likely be task.workflow_venv_type == WorkflowVenvType.EVALS_CODE
instead.
has_code_eval_tasks = any(task.confirm_run_unsafe_code for task in eval_config.tasks) | |
has_code_eval_tasks = any(task.workflow_venv_type == WorkflowVenvType.EVALS_CODE for task in eval_config.tasks) |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arobergeTT this did fail when attempting to run and is also the fix that Claude suggests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stisiTT I made appropriate changes. Could you try testing it again?
Summary
Adds a dedicated virtual environment (
EVALS_CODE
) for code evaluation benchmarks (MBPP
,HumanEval
) using ElutherAI's original lm-evaluation-harness repository.Changes
workflows/workflow_types.py
: AddedEVALS_CODE
enum toWorkflowVenvType
workflows/workflow_venvs.py
: Addedsetup_evals_code()
function that:evals/run_evals.py
: Auto-applies security flags (--trust_remote_code
,--confirm_run_unsafe_code
) for code evaluation tasksKey Features
Ready for enabling code evaluation tasks in
eval_config.py
.