Kubernetes Sandbox Backend¶

How Cognition runs agent commands in Kubernetes-native sandboxes using the agent-sandbox CRD and controller.

Why a K8s Backend¶

Cognition ships three sandbox backends:

Backend	Isolation	Works on K8s?
`local`	None — commands run as server process user	Yes, but no isolation
`docker`	Container per session	No — requires Docker socket + privileged mode
`kubernetes`	Sandbox pod per session	Yes — K8s-native, no special privileges needed

The Cognition Helm chart deploys the server with readOnlyRootFilesystem: true, capabilities.drop: ["ALL"], and runAsNonRoot: true. The Docker backend cannot work under these constraints. The K8s backend uses the agent-sandbox CRD + controller + router to provide isolated sandbox pods without requiring any privileged access from the Cognition server.

Two-Package Split¶

┌─────────────────────────────────────────────────────┐
│  Cognition (this repo)                              │
│                                                     │
│  CognitionKubernetesSandboxBackend                  │
│  • Protected path enforcement (.cognition/)         │
│  • Scoping labels from CognitionContext             │
│  • Session-scoped lifecycle                         │
│  • Delegates all execution to K8sSandbox            │
│                                                     │
│  Wraps ──────────────────────────────────┐          │
│                                         │          │
└─────────────────────────────────────────┼──────────┘
                                          │
┌─────────────────────────────────────────┼──────────┐
│  langchain-k8s-sandbox (standalone pkg) │          │
│                                         ▼          │
│  K8sSandbox(BaseSandbox)                          │
│  • Lazy init on first execute()                   │
│  • Labels passthrough to Sandbox CR               │
│  • TTL via spec.shutdownTime patch                │
│  • BaseSandbox file ops via execute()             │
│                                                   │
│  Uses ─────────────────────────────────┐          │
│                                       │          │
└───────────────────────────────────────┼──────────┘
                                      │
┌──────────────────────────────────────┼────────────┐
│  k8s-agent-sandbox SDK (PyPI)       │            │
│                                     ▼            │
│  SandboxClient                                   │
│  • create_sandbox(template, namespace, labels)   │
│  • sandbox.commands.run(cmd, timeout)            │
│  • sandbox.terminate()                           │
│  • SandboxDirectConnectionConfig(router_url)     │
└──────────────────────────────────────────────────┘

The split follows the langchain-<provider> convention. langchain-k8s-sandbox is published as a standalone package with zero Cognition imports. Cognition wraps it with domain policy (protected paths, scoping labels, session lifecycle).

Design doc for the standalone package: packages/langchain-k8s-sandbox/DESIGN.md

Execution Flow¶

User sends message
        │
        ▼
┌──────────────────┐
│  Cognition API   │  POST /sessions/{id}/messages
└──────┬───────────┘
       │
       ▼
┌──────────────────┐
│  CognitionAgent  │  LangGraph ReAct loop
│  (runtime)       │  decides to call a shell tool
└──────┬───────────┘
       │
       ▼
┌──────────────────────────────────────────┐
│  CognitionKubernetesSandboxBackend       │
│  ┌────────────────────────────────────┐  │
│  │ Protected path guard (write/edit)  │  │
│  └──────────────┬─────────────────────┘  │
│                 ▼                         │
│  K8sSandbox.execute()                     │
│  (lazy creates Sandbox CR on first call) │
└─────────────────┬────────────────────────┘
                  │
    ┌─────────────┴─────────────┐
    ▼ (first call)              ▼ (subsequent calls)
┌──────────────────┐    ┌──────────────────┐
│ SDK:             │    │ SDK:             │
│ create_sandbox() │    │ commands.run()   │
│ + patch TTL      │    │                  │
└──────┬───────────┘    └────────┬─────────┘
       │                         │
       ▼                         │
┌──────────────────┐             │
│ K8s API Server   │             │
│ → SandboxClaim   │             │
│ → controller     │             │
│ → Sandbox CR     │             │
│ → Pod spawned    │             │
└──────────────────┘             │
                                 │
       ┌─────────────────────────┘
       ▼
┌──────────────────┐
│ sandbox-router   │  Routes by X-Sandbox-ID header
│ :8080            │
└──────┬───────────┘
       │
       ▼
┌──────────────────┐
│ Sandbox Pod      │  python-runtime :8888
│ (from template)  │  Executes command
│                  │  Returns stdout/stderr/exit_code
└──────────────────┘

The sandbox is never created for conversation-only messages. Only when the agent invokes a shell tool (bash, write_file, read_file, etc.) does _ensure_sandbox() trigger pod creation.

Shell Interpretation¶

The agent-sandbox SDK's commands.run() executes commands directly (like exec), not through a shell. This means heredocs, pipes, redirects, and variable expansion do not work with raw command strings.

K8sSandbox.execute() wraps every command in sh -c using shlex.quote():

sh_command = f"sh -c {shlex.quote(command)}"
result = sandbox.commands.run(sh_command, timeout=effective_timeout)

This is the same pattern used by ModalSandbox (the reference deepagents integration, which wraps in bash -c). The wrapping is required because BaseSandbox's write(), read(), and edit() methods construct commands with heredoc syntax to pass base64-encoded payloads via stdin — these only work through a shell interpreter.

Scoping Labels¶

CognitionContext.effective_scope fields are mapped to cognition.io/* labels on the Sandbox CR, enabling multi-tenant visibility:

labels = {
    "cognition.io/session": session_id,
}
# Add all effective_scope keys as labels
for key, value in context.effective_scope.items():
    labels[f"cognition.io/{key}"] = value

Queryable with kubectl:

kubectl get sandboxes -n cognition -l cognition.io/user=alice

Labels are set at SandboxClaim creation time and cannot be changed afterward. They are for observability and traceability only — not for K8s admission policy enforcement (that's a v2 hardening item).

Settings¶

Five environment variables control the K8s sandbox backend:

Env Var	Default	Description
`COGNITION_K8S_SANDBOX_TEMPLATE`	`cognition-sandbox`	SandboxTemplate CR name
`COGNITION_K8S_SANDBOX_NAMESPACE`	`default`	K8s namespace for sandbox CRs
`COGNITION_K8S_SANDBOX_ROUTER_URL`	`http://sandbox-router-svc.default.svc.cluster.local:8080`	Router service URL
`COGNITION_K8S_SANDBOX_TTL`	`3600`	Auto-cleanup after N seconds
`COGNITION_K8S_SANDBOX_WARM_POOL`	(none)	SandboxWarmPool CR name (reserved)

Set COGNITION_SANDBOX_BACKEND=kubernetes to activate.

Helm values under config.sandbox.k8s.*:

config:
  sandbox:
    backend: kubernetes
    k8s:
      template: cognition-sandbox
      namespace: cognition
      routerUrl: http://sandbox-router-svc.cognition.svc.cluster.local:8080
      ttl: 3600
      warmPool: ""

Session Lifecycle¶

Session created      →  create_sandbox_backend("kubernetes", labels={...})
                         No Sandbox CR yet. Backend stores config only.
                              │
                              ▼
First tool call      →  CognitionKubernetesSandboxBackend.execute()
                         → K8sSandbox._ensure_sandbox()
                         → SandboxClient.create_sandbox()
                         → Pod appears with scoping labels + TTL
                              │
                              ▼
Subsequent calls     →  execute() routes through existing sandbox
                              │
                              ▼
Session destroyed    →  backend.terminate()  ← Wired via SessionAgentManager.unregister_session()
                         → sandbox.terminate()
                         → SandboxClaim deleted
                         → Controller deletes Sandbox + Pod

TTL safety net: If terminate() is never called (server crash, network partition), the controller deletes the Sandbox CR when spec.shutdownTime expires.

Termination wiring: terminate() is called from SessionAgentManager.unregister_session() when a session is deleted via DELETE /sessions/{id}. This ensures sandbox pods are cleaned up when sessions are destroyed.

Deployment Prerequisites¶

When using config.sandbox.backend: kubernetes, the following must be installed before deploying Cognition:

Prerequisite	Install	Purpose
agent-sandbox controller	`kubectl apply -f .../v0.3.10/manifest.yaml`	Reconciles Sandbox CRs into pods
agent-sandbox extensions	`kubectl apply -f .../v0.3.10/extensions.yaml`	SandboxTemplate, SandboxClaim, SandboxWarmPool CRDs
sandbox-router	Deploy from agent-sandbox router	Proxies commands to sandbox pods
SandboxTemplate CR	User creates this	Defines sandbox pod spec (image, resources, security)

These are not bundled in Cognition's Helm chart. The agent-sandbox controller is cluster-scoped infrastructure, not per-application.

The Cognition Helm chart creates the RBAC (Role + RoleBinding) automatically when backend=kubernetes.

Helm Chart¶

RBAC (conditional on `backend=kubernetes`)¶

Namespace-scoped Role for sandbox lifecycle:

rules:
  - apiGroups: ["agents.x-k8s.io"]
    resources: ["sandboxes"]
    verbs: ["get", "list", "watch", "patch"]          # patch for shutdownTime
  - apiGroups: ["extensions.agents.x-k8s.io"]
    resources: ["sandboxclaims", "sandboxtemplates"]
    verbs: ["get", "list", "watch", "create", "delete"]  # SDK lifecycle

Cluster-scoped ClusterRole for startup validation (CRD existence checks):

rules:
  - apiGroups: ["apiextensions.k8s.io"]
    resources: ["customresourcedefinitions"]
    verbs: ["get", "list"]
    resourceNames:
      - sandboxes.agents.x-k8s.io
      - sandboxclaims.extensions.agents.x-k8s.io
      - sandboxtemplates.extensions.agents.x-k8s.io

Both are created automatically by the Helm chart when backend=kubernetes.

Example SandboxTemplate¶

See deploy/examples/cognition-sandbox-template.yaml.

The template must include writable volume mounts for /tmp and /workspace. The runtime image uses readOnlyRootFilesystem: true for security, which makes the root filesystem read-only. Without writable mount points, BaseSandbox file operations that write temporary data (e.g., heredoc payloads) will fail with "Read-only file system" errors.

NetworkPolicy (optional)¶

Set config.sandbox.k8s.denyEgress: true in Helm values to deny all egress from sandbox pods. This is the K8s equivalent of Docker's network_mode: "none".

Startup Validation¶

When sandbox_backend=kubernetes, the server validates at startup that: 1. The sandboxes.agents.x-k8s.io CRD exists (fatal if missing) 2. The sandboxclaims.extensions.agents.x-k8s.io CRD exists (fatal if missing) 3. The router health endpoint (/healthz) is reachable (warning if not)

If CRDs are missing, the server fails to start with a clear error message including install commands.

Live Demo Results¶

Verified on a Talos Linux cluster (amd64, K8s v1.34.1):

Creating sandbox...
Sandbox sandbox-claim-a71288fe is ready.     # ~2s (image cached)
Running: echo Hello from K8s sandbox!
  stdout: Hello from K8s sandbox!
  exit_code: 0
Running: python3 platform check
  stdout: sandbox-claim-a71288fe 3.11.15
Running: uname -a
  stdout: Linux sandbox-claim-a71288fe 6.12.48-talos x86_64 GNU/Linux
Terminating sandbox...
Terminated SandboxClaim: sandbox-claim-a71288fe

First sandbox on a fresh node took ~17s (image pull). Subsequent sandboxes took ~2s.

E2E Tests¶

K8s sandbox integration tests are in tests/e2e/test_k8s_sandbox_e2e.py. They are skipped unless COGNITION_K8S_E2E=1 is set (same pattern as other e2e tests that require external infrastructure).

# Port-forward the sandbox-router and Cognition server
kubectl port-forward svc/sandbox-router-svc -n cognition 8081:8080 &
kubectl port-forward svc/cognition -n cognition 8000:8000 &

# Run tests
COGNITION_K8S_E2E=1 COGNITION_K8S_E2E_ROUTER_URL=http://localhost:8081 \
    uv run pytest tests/e2e/test_k8s_sandbox_e2e.py -v

Test coverage:

Class	Tests	What it verifies
`TestK8sSandboxLifecycle`	2	API-level session create/message/delete with sandbox cleanup
`TestK8sSandboxDirectBackend`	9	Direct `K8sSandbox` operations: execute, Python, write/read, edit, upload/download, labels, TTL, terminate, lazy init
`TestK8sSandboxStartupValidation`	2	Cluster prerequisites: CRDs exist, SandboxTemplate exists

Known Gaps¶

Gap	Impact	Priority
Warm pool not implemented	First tool call pays cold-start latency	Low
Native SDK file transfer	Base64-through-execute is v1 only	Low

Security Considerations¶

The K8s sandbox provides equivalent isolation to the Docker backend, enforced by the K8s control plane:

Boundary	Mechanism
Process	Separate pod per session, own PID namespace
Network	Pod securityContext (add NetworkPolicy for egress denial)
Filesystem	`readOnlyRootFilesystem: true`, `emptyDir` for workspace
Capabilities	`capabilities.drop: ["ALL"]`
Resources	`resources.limits` on SandboxTemplate
Auto-cleanup	TTL-based deletion via agent-sandbox controller

Secrets: Never placed inside the sandbox. API keys and credentials stay in the Cognition server process. If an agent needs authenticated API access, define tools that run outside the sandbox.

For production, apply a NetworkPolicy to deny sandbox egress:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: sandbox-deny-egress
spec:
  podSelector:
    matchLabels:
      agents.x-k8s.io/sandbox: "true"
  policyTypes:
    - Egress
  egress: []

Layer Assignment¶

Component	Layer
`langchain-k8s-sandbox` package	Layer 3 (Execution)
`CognitionKubernetesSandboxBackend`	Layer 4 (Agent Runtime) + Layer 3
Settings fields	Layer 1 (Foundation)
Helm RBAC + values	Layer 1 (Foundation)
SandboxTemplate	Layer 1 (Foundation, prerequisite)

No upward imports. langchain-k8s-sandbox (Layer 3) has no Cognition dependency. CognitionKubernetesSandboxBackend (Layer 4) imports from Layer 3 only.