Skip to content

feat(code_executors): Add GkeCodeExecutor for sandboxed code execution on GKE #1629

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 31 commits into
base: main
Choose a base branch
from

Conversation

syangx39
Copy link

@syangx39 syangx39 commented Jun 24, 2025

close #2170

Summary

This PR introduces GkeCodeExecutor, a new code executor that provides a secure and scalable method for running LLM-generated code by leveraging GKE Sandbox. It serves as a robust alternative to local or standard containerized executors by leveraging the GKE Sandbox environment, which uses gVisor for workload isolation.

For each code execution request, it dynamically creates an ephemeral Kubernetes Job with a hardened Pod configuration, offering significant security benefits and ensuring that each code execution runs in a clean, isolated environment.

Key Features of GkeCodeExecutor

  • Dynamic Job Creation: Uses the Kubernetes batch/v1 API to create a new Job for each code snippet.
  • Secure Code Mounting: Injects code into the Pod via a temporary ConfigMap, which is mounted to a read-only file.
  • gVisor Sandboxing: Enforces execution within a gvisor runtime for kernel-level isolation.
  • Hardened Security Context: Pods run as non-root with all Linux capabilities dropped and a read-only root filesystem.
  • Resource Management: Applies configurable CPU and memory limits to prevent abuse.
  • Automatic Cleanup: Uses the ttl_seconds_after_finished feature on Jobs for robust, automatic garbage collection of completed Pods and Jobs.
  • Node Scheduling: The executor uses Kubernetes tolerations in its Pod specification. This allows the k8s scheduler to place the execution Pod onto a pre-configured gVisor-enabled node.
  • Module Integration: The GkeCodeExecutor is registered in the code_executors/__init__.py, making it available for use by agents. The ImportError handling is configured to check for the required kubernetes SDK.

Execution Flow:

54zSPHsQTMaBN3m (1)

  1. Agent invokes GkeCodeExecutor with the LLM-generated code.
  2. The GkeCodeExecutor will execute_code – creates a temporary ConfigMap, and then create a k8s Job to run it.
  3. This Job runs a standard python:3.11-slim container. The image is pulled once to the node and cached. The Job will mount the ConfigMap as /app/code.py
  4. The GkeCodeExecutor will monitor the Job to completion, fetch stdout/stderr logs from the container, return CodeExecutionResult to the LlmAgent, and ensure all temp resources are deleted.
  5. The calling agent formats the result and provides a final response to the user. If the result contains error, it will retry up to error_retry_attempts times.

@syangx39
Copy link
Author

syangx39 commented Jul 7, 2025

@hangfei could you assign a reviewer for my PR

@eliaslevy
Copy link

@syangx39 what about blocking network access from the sandbox? Seems like we'd want to add that in. Or are you expecting the developer to create a network policy on his own an apply it to the jobs?

Copy link
Author

@syangx39 syangx39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two Action Items left afterwards

  1. Follow-up PR on input/output files support with _create_pvc().
  2. regarding this comment "@syangx39 what about blocking network access from the sandbox? Seems like we'd want to add that in. Or are you expecting the developer to create a network policy on his own an apply it to the jobs?"
    gVisor's default network sandboxing in GKE specifically blocks access to sensitive host-level endpoints like the GKE metadata server. A pod running in the gVisor sandbox can still make network calls to public services on the internet.
    If gke_code_executor blocks network access, the sandboxed code will not be able to pip install packages from the internet. So I'm thinking -
    The best way to handle dependencies is ahead of time. A developer would build an image with the required packages already installed via a requirements.txt file. They would then push this image to Artifact Registry and configure to use their custom image instead of the default python:3.11.
    Anyway, I will add a block_network_access flag for the flexibility and default to True. But for readbility, will do it in a follow-up PR.

@syangx39 syangx39 requested a review from eliaslevy July 11, 2025 05:54
@hangfei
Copy link
Collaborator

hangfei commented Jul 16, 2025

What's the startup time for GkeCodeExecutor? Should the user starts it first or it can be triggered on-demand?

Copy link
Collaborator

@hangfei hangfei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please create a issue and doc to update docs ad adk-docs repo.

@@ -0,0 +1,50 @@
apiVersion: v1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should every user write this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the default namespace is "default". The file deployment_rbac.yaml located in the contributing/samples/ directory is intended to be a sample. It demonstrates all the necessary RBAC components (Namespace, ServiceAccount, Role, RoleBinding) and how they connect. The name agent-sandbox was chosen to be descriptive for the sample, but it's not a required value.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, could you add a README.md to this sample agent and putting information above in it , so that users understand the usage of this file ?


logger = logging.getLogger(__name__)

class GkeCodeExecutor(BaseCodeExecutor):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was wondering if there will be other variant of GkeCode executor other than gVisor-based?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. We've designed the GkeCodeExecutor to be exclusively gVisor-based because its core mission is to safely run untrusted, LLM-generated code. Offering a non-sandboxed variant would create a significant security liability, as it would expose the node's kernel to the code being executed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think hangfei meant, for sandboxed solution, will there be other option besides gVisor (e.g. Kata Containers) , thus shall we add some thing related to gVisor in the name ?

each execution request. The user's code is mounted via a ConfigMap, and the
Pod is hardened with a strict security context and resource limits.

Key Features:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to add this to source code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this 'Key Features' serve as a high-level summary so a future developer can immediately grasp the component's security posture and design without having to read the whole implementation.

Do you think we could keep it for that reason? I feel it adds a lot of value for future maintainability.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair point.

file at: contributing/samples/gke_agent_sandbox/deployment_rbac.yaml
"""
namespace: str = "default"
image: str = "python:3.11-slim"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it only work with 3.11?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary 3.11, but it has to be Python https://github.com/syangx39/adk-python/blob/5533fbf31085a98184b91bda48b36a515524ea84/src/google/adk/code_executors/gke_code_executor.py#L110

I can patch a follow-up PR to make the command configurable, turning it into a more generic script executor.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please. thanks.

_batch_v1: client.BatchV1Api
_core_v1: client.CoreV1Api

def __init__(self, **data):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work for AI Studio api key or EasyGCP?

If not, is it possible to throw an exception when it inits?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI Studio API key is irrelevant for this component. The GkeCodeExecutor's only job is to communicate with the Kubernetes API, either via ServiceAccount token or kubeconfig fail.

If load_kube_config() fails, it raises a kubernetes.config.ConfigException. This exception will halt the initialization and cause the GkeCodeExecutor constructor to fail, which I think is the correct behavior when no valid Kubernetes credentials can be found.

@hangfei hangfei added the wip [Status] This issue is being worked on. Either there is a pending PR or is planned to be fixed label Jul 16, 2025
@hangfei
Copy link
Collaborator

hangfei commented Jul 16, 2025

How long does it take to spin up the code executor and finish it?

@hangfei
Copy link
Collaborator

hangfei commented Jul 16, 2025

Please add unit tests.

Please add a sample under contributing/samples

@hangfei
Copy link
Collaborator

hangfei commented Jul 17, 2025

Please make sure you add a corresponding docs in adk-docs before submission. Thanks.

@syangx39
Copy link
Author

syangx39 commented Jul 25, 2025

@hangfei

  1. "Please add a sample under contributing/samples"
    added, contributing/samples/gke_agent_sandbox/deployment_rbac.yaml is the permission file and contributing/samples/code_execution/gke_sandbox.agent.py is the agent app. I feel like adding them at different places might confuse reader -- maybe I should move both under contributing/samples/code_execution/??

  2. "Please add unit tests."
    added

tests/unittests/code_executors/test_gke_code_executor.py::TestGkeCodeExecutor::test_create_job_manifest_structure[VERTEX] PASSED                                                            [100%]
  1. "What's the startup time for GkeCodeExecutor?"
    We haven't really tested this but I can give you a ballpark figure
    Here's the breakdown of startup time:
  • API latency: agent making API call to k8s control plane to create ConfigMap and Job. This is negligible (miliseconds).
  • Pod Scheduling: k8s scheduler assign new Pod to a gVisor-enabled node. If healthy cluster w available nodes, this is miliseconds to seconds.
  • Pod Initialization:
    • image pull is the biggest factor. The node must have the container image. For the first time (cold start), it download from registry, since python:3.11-slim is a lightweight image, it takes seconds. For the following times (cache), it's neglible.
    • gVisor Sandbox creation: slight overhead (miliseconds?)
  • From my previous experience, it usually takes a few seconds (for the 1st time, the following times are instant due to cache) on a healthy cluster. We're working on cold start optimization.
  1. "Should the user starts it first or it can be triggered on-demand?"
    triggered on-demand, code_executor.execute_code(...) automatically run analysis code on any new data files the user provides.

  2. "please create a issue and doc to update docs ad adk-docs repo. Please make sure you add a corresponding docs in adk-docs before submission."
    Sure. Do you have a specific page you want me to put?
    maybe https://google.github.io/adk-docs/tools/built-in-tools/#code-execution?

@syangx39 syangx39 requested a review from hangfei July 25, 2025 08:27
@hangfei
Copy link
Collaborator

hangfei commented Jul 30, 2025

@hangfei
Copy link
Collaborator

hangfei commented Jul 30, 2025

Overall LGTM. Thanks!

Let's do the following before merge:

  1. have a adk-docs PR for this change.
  2. fix the tests.

@syangx39
Copy link
Author

syangx39 commented Aug 15, 2025

Overall LGTM. Thanks!

Let's do the following before merge:

  1. have a adk-docs PR for this change.
  2. fix the tests.

@hangfei

  1. I added the "adk-docs" PR here: https://github.com/google/adk-docs/pull/621/files
  2. What do you mean by "fix the tests"? The unit tests are already added in tests/unittests/code_executors/test_gke_code_executor.py and they are all passed. Here's the log
$ pip install -e .[test,gke]
$ pytest tests/unittests/code_executors/test_gke_code_executor.py
========================================================== test session starts ==========================================================
platform linux -- Python 3.12.3, pytest-8.4.1, pluggy-1.6.0
rootdir: /home/user/adk-python
configfile: pyproject.toml
plugins: asyncio-1.1.0, mock-3.14.1, anyio-4.10.0, langsmith-0.4.14, xdist-3.8.0
asyncio: mode=Mode.AUTO, asyncio_default_fixture_loop_scope=function, asyncio_default_test_loop_scope=function
collected 14 items                                                                                                                      

tests/unittests/code_executors/test_gke_code_executor.py ..............                                                           [100%]

========================================================== 14 passed in 7.94s ===========================================================

@syangx39
Copy link
Author

@hangfei
The doc PR got merged. Please merge this PR as well for consistency.
https://google.github.io/adk-docs/tools/built-in-tools/#gke-code-executor

@hangfei
Copy link
Collaborator

hangfei commented Aug 19, 2025

thanks.

  1. there are some formatting and tests failures. plz fix.

  2. could you share the steps to test the code execuctor so we can test it?

@@ -135,6 +135,10 @@ extensions = [
"toolbox-core>=0.1.0", # For tools.toolbox_toolset.ToolboxToolset
]

# For GKE-specific features, primarily the GkeCodeExecutor.
gke = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's put it into extensions.

@@ -49,3 +49,14 @@
' Executor with agents, please install it. If not, you can ignore this'
' warning.'
)

try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the original codes in this file is problematic, we've just fixed it, let's follow the latest style.

@GWeale GWeale self-requested a review August 19, 2025 23:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wip [Status] This issue is being worked on. Either there is a pending PR or is planned to be fixed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FEAT: Add GkeCodeExecutor for Secure Code Execution on GKE
4 participants