Context

A LangSmith deployment may enter a "zombie state" where the /docs endpoint remains accessible and the service continues to accept user input, but it stops producing responses. When the service is restarted, all previously queued threads are processed, even though they correspond to past requests that are no longer relevant. This issue typically occurs due to infrastructure bottlenecks in self-hosted LangGraph environments.

Answer

This zombie state occurs due to architecture separation where the /docs endpoint is served by the API layer (FastAPI health check), while actual graph execution happens in background workers consuming from Redis queues. The workers can fail while the API layer remains healthy.

Root causes include:

Redis disk/memory exhaustion
PostgreSQL connection pool exhaustion (can affect individual replicas while others remain healthy)
Worker crashes or hanging silently
Client disconnections without proper cancellation

Stale requests persist because both Redis (task queue) and PostgreSQL (checkpoints/threads) store data durably with exactly-once semantics, and threads have no expiration by default.

To resolve this issue:

Upgrade LangGraph to the latest version - This includes worker cancellation fixes and PostgreSQL connection pool improvements.
Reduce PostgreSQL connection pool size for Azure environments:
```
LANGGRAPH_POSTGRES_POOL_MAX_SIZE=50  # Default is 150
ASYNCPG_POOL_MIN_SIZE=1              # Reduce idle connections
```
Warning: Ensure your pool size is adequate relative to N_JOBS_PER_WORKER. The pool is divided among workers (pool_size // n_jobs_per_worker). Setting the pool size too low can result in only 1 connection per worker, causing stability issues when connections go stale. If reducing pool size, monitor for connection‑related errors and increase if needed.

Configure health checks to detect connection pool exhaustion:

GET /ok?check_db=1  # Add check_db parameter to verify database connectivity

Enable automatic run cancellation on client disconnect:

async for chunk in client.runs.stream(
    thread_id=thread_id,
    assistant_id="agent",
    input={"messages": [{"role": "human", "content": "Hello"}]},
    on_disconnect="cancel",  # Enables cancellation on disconnect
):

Isolate background job event loops (only needed if workers share event loops with the API layer):
```
BG_JOB_ISOLATED_LOOPS=true  # Prevents sync operations from blocking workers
```

Enable resumable streams for timeout handling:

RESUMABLE_STREAM_TTL_SECONDS=300  # 5 minutes (default is 120s)

Handle stale runs with interrupt strategy:

run = await client.runs.create(
    thread_id,
    assistant_id,
    input={"messages": [...]},
    multitask_strategy="interrupt",  # Cancels any previous pending run
)

As a last resort, clear Redis data:
```
# Connect to your Redis instance and run:
FLUSHDB
```
Then restart the service after applying configuration changes.

Sources:

Environment Variables Documentation

Context

Answer

Root causes include:

Redis disk/memory exhaustion

PostgreSQL connection pool exhaustion (can affect individual replicas while others remain healthy)

Worker crashes or hanging silently

Client disconnections without proper cancellation

Stale requests persist because both Redis (task queue) and PostgreSQL (checkpoints/threads) store data durably with exactly-once semantics, and threads have no expiration by default.

To resolve this issue:

Upgrade LangGraph to the latest version - This includes worker cancellation fixes and PostgreSQL connection pool improvements.

Reduce PostgreSQL connection pool size for Azure environments:

LANGGRAPH_POSTGRES_POOL_MAX_SIZE=50  # Default is 150
ASYNCPG_POOL_MIN_SIZE=1              # Reduce idle connections

Warning: Ensure your pool size is adequate relative to N_JOBS_PER_WORKER. The pool is divided among workers (pool_size // n_jobs_per_worker). Setting the pool size too low can result in only 1 connection per worker, causing stability issues when connections go stale. If reducing pool size, monitor for connection‑related errors and increase if needed.

Configure health checks to detect connection pool exhaustion:

GET /ok?check_db=1  # Add check_db parameter to verify database connectivity

Enable automatic run cancellation on client disconnect:

async for chunk in client.runs.stream(
    thread_id=thread_id,
    assistant_id="agent",
    input={"messages": [{"role": "human", "content": "Hello"}]},
    on_disconnect="cancel",  # Enables cancellation on disconnect
):

Isolate background job event loops (only needed if workers share event loops with the API layer):

BG_JOB_ISOLATED_LOOPS=true  # Prevents sync operations from blocking workers

Enable resumable streams for timeout handling:

RESUMABLE_STREAM_TTL_SECONDS=300  # 5 minutes (default is 120s)

Handle stale runs with interrupt strategy:

run = await client.runs.create(
    thread_id,
    assistant_id,
    input={"messages": [...]},
    multitask_strategy="interrupt",  # Cancels any previous pending run
)

As a last resort, clear Redis data:

# Connect to your Redis instance and run:
FLUSHDB

Then restart the service after applying configuration changes.

Sources:

Why is my LangSmith deployment entering a "zombie state" where it accepts requests but stops responding?

Context

Answer

Why is my LangSmith deployment entering a "zombie state" where it accepts requests but stops responding?

Context

Answer