Bundled agent deployments (Agent Builder/Fleet, Insights/Clio, Polly) remain permanently stuck in DEPLOY_FAILED despite all pods showing 1/1 Running and the operator reporting deployment_ready: true.
The issue is caused by a hairpin NAT networking limitation on EKS with ALB that prevents the health check from completing.
How to Identify
The deployment appears healthy at the infrastructure level but never transitions to DEPLOYED state. Check the following:
# Bootstrap job is still running or failed after 10+ minutes
kubectl get job langsmith-agent-bootstrap -n langsmith
# Pods are healthy
kubectl get pods -n langsmith | grep agent-builder
# Shows 1/1 Running
# Operator confirms ready
kubectl logs -n langsmith -l app.kubernetes.io/component=operator | grep deployment_ready
# Shows "deployment_ready":true
Error Messages
In the listener logs (kubectl logs -n langsmith -l app.kubernetes.io/component=listener):
"A retriable httpx error occurred when calling health check endpoint
for K8s service agent-builder-<hash>. Retrying..."
type: httpx.ConnectTimeout
"TimeoutError: Timeout: New revision health check did not succeed
after 600 seconds. Please see server logs for more information."
In the bootstrap logs (kubectl logs -n langsmith -l job-name=langsmith-agent-bootstrap):
Status: DEPLOYING
...
Failed: DEPLOY_FAILED
ERROR: Deployment failures:
agent-builder: Deployment failed or timed out
Root Cause
When config.deployment.ingressHealthCheckEnabled: true is set in the Helm values, the listener performs the final deployment health check by calling the external ingress URL of the deployment (e.g. https://<hostname>/lgp/agent-builder-<hash>/ok) rather than the internal Kubernetes ClusterIP service.
On EKS with an AWS ALB ingress, pods inside the cluster cannot route requests back out through the external ALB to reach themselves. This is a hairpin NAT limitation. Every health check attempt results in httpx.ConnectTimeout, and after 600 seconds the revision is marked DEPLOY_FAILED.
This affects any cluster where the ALB is not configured to support loopback routing from within the VPC.
Solution
Set config.deployment.ingressHealthCheckEnabled: false in the Helm values.
This forces the listener to health check via the internal Kubernetes service directly. For example:
config:
deployment:
enabled: true
ingressHealthCheckEnabled: false
Apply the change and trigger a new bootstrap run. The bootstrap will retry automatically. The deployment should transition to DEPLOYED within seconds once the health check can reach the internal service.