Context

When using LangSmith Deployments, you may want to disable thinking tokens in your API calls to reduce verbosity and improve chat performance by minimizing network overhead. Thinking tokens can add noise to chat conversations and increase latency.

Answer

You can disable thinking tokens by making them configurable in your graph and creating assistants with thinking disabled. Here's how:

1. Add a configurable field to your graph's context schema:

from dataclasses import dataclass

@dataclass
class ContextSchema:
    thinking_enabled: bool = True  # default to enabled

2. Use the configurable field in your graph:

In your graph, check this value when calling your model:

from langgraph.runtime import Runtime

def call_model(state, runtime: Runtime[ContextSchema]):
    thinking_enabled = runtime.context.thinking_enabled
    # call model as you normally would but replace the thinking parameter with `thinking_enabled` to inject thinking at runtime

3. Create a saved assistant with thinking disabled:

Once your graph has this configurable field, you can create an assistant that sets thinking_enabled: false in its context. This gives you a "chat mode" assistant without thinking tokens. You can do this via the SDK or in the LangSmith UI when creating/editing an assistant.

Sources:

Understand assistant configuration - how assistants store context values
Runtime context - how to define and use context schemas in your graph