When working with LangGraph deployments, it’s important to understand how checkpoints, the database, API runtime memory, and TTL (time-to-live) interact. This article helps in diagnosing issues like out-of-memory (OOM) errors, database growth, and pod crashes.
Quick Reference
Issue | Symptom | Fix |
Pod memory growth | OOM errors, run cancellations, pod restarts | Optimize workflow code or increase pod memory |
Database growth | Storage cap reached, deployment inoperable | Configure TTL or manually clean up old checkpoints |
Large checkpoints | Slow runs, database bloat | Reduce application state size saved to checkpoints or move large data objects to external storage/LangGraph Store |
Connection timeouts | Long-running workflows fail with database connection errors | Use ConnectionPool instead of raw Connection with PostgresSaver |
MongoDB document size limit | "DocumentTooLarge" errors during checkpoint saves | Switch to PostgreSQL or reduce checkpoint size; avoid checkpointers in subgraphs |
For deployments:
Checkpoints are always stored in the database (Postgres in production deployments).
Connection Management: When using PostgresSaver directly (outside of LangGraph Platform deployments), the default behavior holds a database connection for the entire run duration. For long-running workflows, this can cause connection timeout issues. Use
ConnectionPoolinstead of a rawConnectionobject to automatically manage connections:from psycopg.rows import dict_row
from psycopg_pool import ConnectionPool
from langgraph.checkpoint.postgres import PostgresSaver
pool = ConnectionPool(
conn_string,
max_size=10,
max_idle=300.0, # Time (seconds) before idle connection is closed
kwargs={"autocommit": True, "row_factory": dict_row}
)
checkpointer = PostgresSaver(pool)
checkpointer.setup()Warning: When defining custom context schemas, avoid converting UUID objects to strings in field defaults. PostgreSQL UUID columns expect
uuid.UUIDobjects, not strings. For example, avoidthread_id: str = Field(default_factory=lambda: str(uuid.uuid4()))as this can cause "bad argument type for built-in operation" and "'str' object has no attribute 'hex'" errors during checkpoint saves. Instead, let LangGraph Platform automatically inject thethread_id, or usethread_id: str | None = Field(default=None)if you need to reference it in your schema.
Important: MongoDB checkpointers have a 16MB document size limit, while PostgreSQL supports up to 1GB per field. If you're experiencing "DocumentTooLarge" errors with MongoDB, consider switching to PostgreSQL or reducing your checkpoint size.
Checkpoints are always stored in the database (Postgres in production deployments).
Connection Management: When using PostgresSaver directly (outside of LangGraph Platform deployments), the default behavior holds a database connection for the entire run duration. For long-running workflows, this can cause connection timeout issues. Use
ConnectionPoolinstead of a rawConnectionobject to automatically manage connections:from psycopg.rows import dict_row
from psycopg_pool import ConnectionPool
from langgraph.checkpoint.postgres import PostgresSaver
pool = ConnectionPool(
conn_string,
max_size=10,
max_idle=300.0, # Time (seconds) before idle connection is closed
kwargs={"autocommit": True, "row_factory": dict_row}
)
checkpointer = PostgresSaver(pool)
checkpointer.setup()Warning: When defining custom context schemas, avoid converting UUID objects to strings in field defaults. PostgreSQL UUID columns expect
uuid.UUIDobjects, not strings. For example, avoidthread_id: str = Field(default_factory=lambda: str(uuid.uuid4()))as this can cause "bad argument type for built-in operation" and "'str' object has no attribute 'hex'" errors during checkpoint saves. Instead, let LangGraph Platform automatically inject thethread_id, or usethread_id: str | None = Field(default=None)if you need to reference it in your schema.The checkpointer is the component that saves them.
Checkpoints are never stored in memory. Checkpoints are written to the database at the end of each superstep. Application state is not retained (server state is contained to your graph state, meaning python/JS cleans it up when it goes out of scope), unless your agent/graph code introduces a memory leak. Checkpoints do not persist in pod memory after runs finish.
Checkpoints are always stored in the database (Postgres in production deployments).
The checkpointer is the component that saves them.
Checkpoints are never stored in memory. Checkpoints are written to the database at the end of each superstep. Application state is not retained (server state is contained to your graph state, meaning python/JS cleans it up when it goes out of scope), unless your agent/graph code introduces a memory leak. Checkpoints do not persist in pod memory after runs finish.
For langgraph dev (the 'in-memory' checkpointer that you can run on your local computer):
Checkpoints and other application state are stored in your process memory and committed to local disk in the
.langgraph_api/directory periodically.
2. API Memory vs. Database Storage
API Memory
Pod memory usage comes from code running inside the workflow.
Causes of high memory use include:
Configuration considerations for server deployments:
Avoid setting
WEB_CONCURRENCYas this grows memory linearly and prevents efficient sharing of PostgreSQL/Redis connections across forked processesFor isolating API requests from graph invocation while staying within a single process, consider using
BG_JOB_ISOLATED_LOOPSto run multiple event loops in threadsFor production deployments, split the API server and queue into separate components and scale them independently using Kubernetes
Loading large data objects (such as spreadsheets, images, or videos).
Many parallel tasks creating large in-memory objects (you can limit the amount of parallelism per-pod using N_JOBS_PER_WORKER)
Memory leaks in user code that block garbage collection (in-memory caches, etc.)
Typical symptoms: OOM errors, canceled runs, pod restarts.
Technical Note: LangGraph uses Python's ThreadPoolExecutor for JSON serialization (via loop.run_in_executor()) to avoid CPU blocking. This can create up to 32 background threads that remain in a sleep state after runs complete. These threads are bounded and don't indicate a memory leak, but contribute to overall memory usage. Use tools like py-spy to inspect thread creation if needed.
Database Storage
Database usage grows when:
Many runs are executed.
Checkpoints include large application state.
No TTL is configured, allowing old checkpoints to accumulate.
Subgraphs are compiled with their own checkpointers, creating separate checkpoint namespaces and duplicating storage
Large data objects (images, PDFs, videos, documents) stored directly in state as base64 or binary data can cause checkpoint bloat and database memory errors
Many runs are executed.
Checkpoints include large application state.
No TTL is configured, allowing old checkpoints to accumulate.
Impacts:
For large data objects in state: Instead of storing large payloads directly in graph state, use one of these patterns: - External storage + reference: Upload files to external storage (S3, etc.) and store only the reference key and metadata in state, fetching the data on demand - LangGraph Store: For data that needs to be accessible across threads, store large objects in the LangGraph Store and retrieve them by ID rather than checkpointing them
Implement a TTL
Use the "exit" durability mode (rather than "async" (default) or "sync" modes. This will write a checkpoint at the end of each run but omit checkpoints for intermediate steps, leading to less data duplication while sacrificing durability in the case of pod restarts.
Note that durability="exit" corresponds to "checkpoint_during=False" in langgraph versions < 0.6
3. Role of TTL (Time-to-Live)
TTL defines how long checkpoints and threads are retained in the database.
Without TTL:
Database tables grow indefinitely (unless you manually delete resources)
Disk usage increases until new writes are blocked.
With TTL:
Old checkpoints are automatically deleted after expiration.
Database growth is controlled.
Note: TTL only applies to checkpoints created after it is enabled. Older checkpoints must be cleaned up separately.
4. How These Issues Interrelate
Large application state leads to increased pod memory usage and typically would cause larger checkpoints, which accelerates database growth.
Memory leaks continuously increase pod memory usage. Increasing pod memory only delays the problem; leaks must be fixed in code.
Lack of TTL causes database disk usage to grow indefinitely.
How to resolve common issues:
For OOM errors: optimize workflow code or increase pod memory.
For disk space issues: configure TTL, clean up checkpoints, or use exit durability mdoe.
For large checkpoints: reduce the amount of state being saved.
Summary:
Pod memory issues usually point to workflow code, while database growth usually indicates missing TTL. Addressing both ensures stable and scalable deployments.