Constantly increasing ClickHouse storage can feel like a never-ending battle. This guide outlines practical steps to reduce storage usage safely and effectively.
Step 1: Try these before deleting data
1. Enable auto-expanding PVCs. This prevents the instance from running out of disk and refusing requests.
2. Reduce TTL and long-term retention, especially for very long horizons (for example, 400 days means nothing is purged for a long time). See the TTL and data retention guide.
Note: TTL is set at ingestion time. Once data is in ClickHouse, you cannot retroactively change its TTL.
3. Enable blob storage. This moves large payloads to object storage and reduces the amount of data stored in ClickHouse. Follow the blob storage guide.
You can adjust the minimum object size threshold to send more or fewer payloads to object storage (lower threshold = less data in ClickHouse). See the values.yaml reference.
Tune settings from steps 2 and 3 until the rate of data ingestion roughly matches the purge rate.
If you are unable to tweak these settings, or they do not work, move on to step 2.
Step 2: Identify data for deletion
To connect to ClickHouse: exec into the langsmith-clickhouse pod, run clickhouse-client, and execute the queries below.
Check whether backups exist in your ClickHouse pod before deleting anything. Your backups may be holding onto stale data in your PVC. See Backups and data retention guide.
2A. Find the largest tables
SELECT
database, table,
formatReadableQuantity(sum(rows)) AS rows,
formatReadableSize(sum(bytes_on_disk)) AS bytes
FROM system.parts
WHERE active
GROUP BY database, table
ORDER BY sum(bytes_on_disk) DESCExample output:
| database | table | rows | bytes |
| --- | --- | ---: | ---: |
| default | runs_feedbacks_rmt_wide | 2.84 billion | 3.28 TiB |
| default | feedbacks_rmt_id | 2.83 billion | 3.26 TiB |
| default | feedbacks | 3.21 billion | 3.19 TiB |
| default | feedbacks_rmt | 2.83 billion | 3.19 TiB |
| default | runs | 9.07 billion | 1.60 TiB |
| default | runs_metadata_kv | 58.74 billion | 1.06 TiB |
| default | runs_lite | 9.01 billion | 627.30 GiB |
| default | runs_history | 9.81 billion | 601.15 GiB |
| default | runs_inputs_kv | 21.30 billion | 447.80 GiB |
| default | runs_run_id_v2 | 9.21 billion | 310.92 GiB |
| default | runs_token_counts | 12.15 billion | 297.96 GiB |Focus on the largest tables first. First, let's check for data that should have been TTL'd, but hasn't been.
2B. Find parts with TTL‑expired rows that were not cleared
WITH
currentDatabase() AS db
SELECT
s._part,
formatReadableQuantity(s.ttl_rows) AS ttl_rows_in_part,
formatReadableQuantity(p.rows) AS total_rows_in_part,
formatReadableSize(p.bytes_on_disk) AS num_bytes_in_part
FROM
(
SELECT
r._part,
countIf(
assumeNotNull(r.trace_first_received_at)
+ toIntervalSecond(assumeNotNull(r.ttl_seconds)) < now()
) ttl_rows
FROM <table_name> AS r
WHERE r.trace_first_received_at IS NOT NULL
AND r.ttl_seconds IS NOT NULL
AND r.inserted_at IS NOT NULL
GROUP BY r._part
HAVING ttl_rows > 0
) AS s
ANY INNER JOIN
(
SELECT name, rows, bytes_on_disk
FROM system.parts
WHERE active
AND database = db
AND table = '<table_name>'
) AS p
ON p.name = s._part
ORDER BY ttl_rows_in_part DESC;Example output:
| _part | ttl_rows_in_part | total_rows_in_part | num_bytes_in_part |
| --- | ---: | ---: | ---: |
| all_6930415_15479924_44_19219337 | 761.21 million | 1.73 billion | 147.28 GiB |
| all_0_2_2_19219337 | 73.50 million | 522.90 million | 136.90 GiB |
| all_10_10_1_19219337 | 7.00 | 301.15 million | 101.76 GiB |
| all_15_15_1_19219337 | 68.48 thousand | 962.30 million | 81.30 GiB |
| all_17471327_19114880_800_19219337 | 63.73 thousand | 181.48 million | 27.23 GiB |
| all_16_4153700_206_19219337 | 604.55 million | 1.70 billion | 114.70 GiB |
| all_4153701_6930414_39_19219337 | 309.27 million | 665.85 million | 48.49 GiB |
| all_15479925_17471326_39_19219337 | 193.62 million | 407.76 million | 37.64 GiB |If that query does not return much data (likely because TTLs are too high), you can try removing data older than X days.
2C. Find parts with rows older than X days
SELECT
s._part,
formatReadableQuantity(s.rows_before_cutoff) AS expired_rows_in_part,
formatReadableQuantity(p.rows) AS total_rows_in_part,
formatReadableSize(p.bytes_on_disk) AS num_bytes_in_part
FROM
(
SELECT
r._part,
countIf(r.inserted_at < now() - INTERVAL <num_days> DAYS) AS rows_before_cutoff
FROM <table_name> AS r
WHERE r.trace_first_received_at IS NOT NULL
AND r.ttl_seconds IS NOT NULL
AND r.inserted_at IS NOT NULL
GROUP BY r._part
) AS s
INNER JOIN
(
SELECT name, rows, bytes_on_disk
FROM system.parts
WHERE active
AND database = currentDatabase()
AND table = '<table_name>'
) AS p
ON p.name = s._part
ORDER BY rows_before_cutoff DESC;Example output (use this to target parts with the most deletable rows):
| _part | expired_rows_in_part | total_rows_in_part | num_bytes_in_part |
| --- | ---: | ---: | ---: |
| all_0_2_2_19219337 | 450.31 million | 522.90 million | 136.90 GiB |
| all_4_4_1_19219337 | 427.57 million | 427.57 million | 141.51 GiB |
| all_5_5_1_19219337 | 387.64 million | 387.64 million | 139.36 GiB |
| all_9_9_1_19219337 | 368.29 million | 368.29 million | 126.07 GiB |
| all_11_11_1_19219337 | 320.79 million | 320.79 million | 110.59 GiB |
| all_10_10_1_19219337 | 301.15 million | 301.15 million | 101.76 GiB |
| all_12_12_1_19219337 | 300.99 million | 300.99 million | 102.44 GiB |
| all_13_13_1_19219337 | 224.36 million | 224.36 million | 76.45 GiB |Adjust <num_days> until you meet your retention constraints while achieving a meaningful number of rows.
Step 3: Delete
WARNING: Before deleting, we highly recommend taking a backup of your data.
Pick a part name and run the following to trigger a DELETE mutation:
ALTER TABLE default.<table_name>
DELETE WHERE _part = '<your_part_name_here>'
AND trace_first_received_at IS NOT NULL
AND ttl_seconds IS NOT NULL
AND inserted_at IS NOT NULL
AND <WHERE_CLAUSE>Replace <WHERE_CLAUSE> with the selection criteria you chose in Step 2.
Examples:
- For traces older than X days: inserted_at < now() - INTERVAL <num_days> DAYS
- For expired traces: assumeNotNull(r.trace_first_received_at) + toIntervalSecond(assumeNotNull(r.ttl_seconds)) < now()
Mutations run asynchronously; the client returns immediately. Only run ONE mutation at a time.
Step 4: Track the mutation
Check mutation status:
SELECT * FROM system.mutations WHERE is_done = 0 AND is_killed = 0;You should see one row (sometimes two). When the mutation completes, is_done becomes 1 and the query returns nothing.
WARNING: Wait for each mutation to complete before starting another, otherwise query performance may be impacted.
To stop an active mutation if it is causing CPU/memory pressure on queries:
KILL MUTATION WHERE mutation_id = <mutation_id_from_select_query>;Step 5: Repeat
After a mutation, parts may be renamed. Re-run Step 2 to fetch updated part names, identify a part to delete, execute the deletion, and track the mutation. Repeat until disk usage reaches your target.
You can repeat this process with the following tables. Use trace_first_received_at instead of inserted_at:
- `runs_metadata_kv`
- `runs_inputs_kv`
- `runs_outputs_kv`
- `runs_reference_example_id`
- `runs_run_id_v2`
- `runs_tags`
- `runs_token_counts`
- `runs_trace_id`
- `runs_lite`
- `root_runs_lite`
- `runs_feedbacks_rmt_wide`
- `runs_token_counts_sources_wide`
- `runs_token_counts_rolled_up_wide`References
https://docs.langchain.com/langsmith/langsmith-managed-clickhouse