-
Notifications
You must be signed in to change notification settings - Fork 1.7k
feat(code_executors): Add GkeCodeExecutor for sandboxed code execution on GKE #1629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@hangfei could you assign a reviewer for my PR |
@syangx39 what about blocking network access from the sandbox? Seems like we'd want to add that in. Or are you expecting the developer to create a network policy on his own an apply it to the jobs? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two Action Items left afterwards
- Follow-up PR on input/output files support with _create_pvc().
- regarding this comment "@syangx39 what about blocking network access from the sandbox? Seems like we'd want to add that in. Or are you expecting the developer to create a network policy on his own an apply it to the jobs?"
gVisor's default network sandboxing in GKE specifically blocks access to sensitive host-level endpoints like the GKE metadata server. A pod running in the gVisor sandbox can still make network calls to public services on the internet.
If gke_code_executor blocks network access, the sandboxed code will not be able topip install
packages from the internet. So I'm thinking -
The best way to handle dependencies is ahead of time. A developer would build an image with the required packages already installed via a requirements.txt file. They would then push this image to Artifact Registry and configure to use their custom image instead of the defaultpython:3.11
.
Anyway, I will add a block_network_access flag for the flexibility and default toTrue
. But for readbility, will do it in a follow-up PR.
What's the startup time for GkeCodeExecutor? Should the user starts it first or it can be triggered on-demand? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please create a issue and doc to update docs ad adk-docs repo.
@@ -0,0 +1,50 @@ | |||
apiVersion: v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should every user write this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the default namespace is "default". The file deployment_rbac.yaml located in the contributing/samples/ directory is intended to be a sample. It demonstrates all the necessary RBAC components (Namespace, ServiceAccount, Role, RoleBinding) and how they connect. The name agent-sandbox was chosen to be descriptive for the sample, but it's not a required value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, could you add a README.md to this sample agent and putting information above in it , so that users understand the usage of this file ?
|
||
logger = logging.getLogger(__name__) | ||
|
||
class GkeCodeExecutor(BaseCodeExecutor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was wondering if there will be other variant of GkeCode executor other than gVisor-based?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. We've designed the GkeCodeExecutor to be exclusively gVisor-based because its core mission is to safely run untrusted, LLM-generated code. Offering a non-sandboxed variant would create a significant security liability, as it would expose the node's kernel to the code being executed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think hangfei meant, for sandboxed solution, will there be other option besides gVisor (e.g. Kata Containers) , thus shall we add some thing related to gVisor in the name ?
each execution request. The user's code is mounted via a ConfigMap, and the | ||
Pod is hardened with a strict security context and resource limits. | ||
|
||
Key Features: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to add this to source code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this 'Key Features' serve as a high-level summary so a future developer can immediately grasp the component's security posture and design without having to read the whole implementation.
Do you think we could keep it for that reason? I feel it adds a lot of value for future maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair point.
file at: contributing/samples/gke_agent_sandbox/deployment_rbac.yaml | ||
""" | ||
namespace: str = "default" | ||
image: str = "python:3.11-slim" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it only work with 3.11?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary 3.11, but it has to be Python https://github.com/syangx39/adk-python/blob/5533fbf31085a98184b91bda48b36a515524ea84/src/google/adk/code_executors/gke_code_executor.py#L110
I can patch a follow-up PR to make the command configurable, turning it into a more generic script executor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please. thanks.
_batch_v1: client.BatchV1Api | ||
_core_v1: client.CoreV1Api | ||
|
||
def __init__(self, **data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this work for AI Studio api key or EasyGCP?
If not, is it possible to throw an exception when it inits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AI Studio API key is irrelevant for this component. The GkeCodeExecutor's only job is to communicate with the Kubernetes API, either via ServiceAccount token or kubeconfig fail.
If load_kube_config() fails, it raises a kubernetes.config.ConfigException. This exception will halt the initialization and cause the GkeCodeExecutor constructor to fail, which I think is the correct behavior when no valid Kubernetes credentials can be found.
How long does it take to spin up the code executor and finish it? |
Please add unit tests. Please add a sample under contributing/samples |
Please make sure you add a corresponding docs in adk-docs before submission. Thanks. |
|
Overall LGTM. Thanks! Let's do the following before merge:
|
|
@hangfei |
thanks.
|
@@ -135,6 +135,10 @@ extensions = [ | |||
"toolbox-core>=0.1.0", # For tools.toolbox_toolset.ToolboxToolset | |||
] | |||
|
|||
# For GKE-specific features, primarily the GkeCodeExecutor. | |||
gke = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's put it into extensions.
@@ -49,3 +49,14 @@ | |||
' Executor with agents, please install it. If not, you can ignore this' | |||
' warning.' | |||
) | |||
|
|||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the original codes in this file is problematic, we've just fixed it, let's follow the latest style.
close #2170
Summary
This PR introduces
GkeCodeExecutor
, a new code executor that provides a secure and scalable method for running LLM-generated code by leveraging GKE Sandbox. It serves as a robust alternative to local or standard containerized executors by leveraging the GKE Sandbox environment, which uses gVisor for workload isolation.For each code execution request, it dynamically creates an ephemeral Kubernetes Job with a hardened Pod configuration, offering significant security benefits and ensuring that each code execution runs in a clean, isolated environment.
Key Features of GkeCodeExecutor
batch/v1
API to create a new Job for each code snippet.ConfigMap
, which is mounted to a read-only file.gvisor
runtime for kernel-level isolation.ttl_seconds_after_finished
feature on Jobs for robust, automatic garbage collection of completed Pods and Jobs.tolerations
in its Pod specification. This allows the k8s scheduler to place the execution Pod onto a pre-configured gVisor-enabled node.GkeCodeExecutor
is registered in thecode_executors/__init__.py
, making it available for use by agents. TheImportError
handling is configured to check for the requiredkubernetes
SDK.Execution Flow:
GkeCodeExecutor
with the LLM-generated code.GkeCodeExecutor
willexecute_code
– creates a temporaryConfigMap
, and then create a k8sJob
to run it.python:3.11-slim
container. The image is pulled once to the node and cached. The Job will mount the ConfigMap as/app/code.py
stdout/stderr
logs from the container, returnCodeExecutionResult
to the LlmAgent, and ensure all temp resources are deleted.