Constantly increasing ClickHouse storage can feel like a never-ending battle. This guide outlines practical steps to reduce storage usage safely and effectively.

Step 1: Try these before deleting data

1. Enable auto-expanding PVCs. This prevents the instance from running out of disk and refusing requests.

2. Reduce TTL and long-term retention, especially for very long horizons (for example, 400 days means nothing is purged for a long time). See the TTL and data retention guide.

Note: TTL is set at ingestion time. Once data is in ClickHouse, you cannot retroactively change its TTL.

3. Enable blob storage. This moves large payloads to object storage and reduces the amount of data stored in ClickHouse. Follow the blob storage guide.

You can adjust the minimum object size threshold to send more or fewer payloads to object storage (lower threshold = less data in ClickHouse). See the values.yaml reference.

Tune settings from steps 2 and 3 until the rate of data ingestion roughly matches the purge rate.

If you are unable to tweak these settings, or they do not work, move on to step 2.

Step 2: Identify data for deletion

To connect to ClickHouse: exec into the langsmith-clickhouse pod, run clickhouse-client, and execute the queries below.

Check whether backups exist in your ClickHouse pod before deleting anything. Your backups may be holding onto stale data in your PVC. See Backups and data retention guide.

2A. Find the largest tables

SELECT
  database, table,
  formatReadableQuantity(sum(rows)) AS rows,
  formatReadableSize(sum(bytes_on_disk)) AS bytes
FROM system.parts
WHERE active
GROUP BY database, table
ORDER BY sum(bytes_on_disk) DESC

Example output:

| database | table | rows | bytes |
| --- | --- | ---: | ---: |
| default | runs_feedbacks_rmt_wide | 2.84 billion | 3.28 TiB |
| default | feedbacks_rmt_id | 2.83 billion | 3.26 TiB |
| default | feedbacks | 3.21 billion | 3.19 TiB |
| default | feedbacks_rmt | 2.83 billion | 3.19 TiB |
| default | runs | 9.07 billion | 1.60 TiB |
| default | runs_metadata_kv | 58.74 billion | 1.06 TiB |
| default | runs_lite | 9.01 billion | 627.30 GiB |
| default | runs_history | 9.81 billion | 601.15 GiB |
| default | runs_inputs_kv | 21.30 billion | 447.80 GiB |
| default | runs_run_id_v2 | 9.21 billion | 310.92 GiB |
| default | runs_token_counts | 12.15 billion | 297.96 GiB |

Focus on the largest tables first. First, let's check for data that should have been TTL'd, but hasn't been.

2B. Find parts with TTL‑expired rows that were not cleared

WITH
  currentDatabase() AS db
SELECT
  s._part,
  formatReadableQuantity(s.ttl_rows) AS ttl_rows_in_part,
  formatReadableQuantity(p.rows) AS total_rows_in_part,
  formatReadableSize(p.bytes_on_disk) AS num_bytes_in_part
FROM
(
  SELECT
    r._part,
    countIf(
      assumeNotNull(r.trace_first_received_at)
      + toIntervalSecond(assumeNotNull(r.ttl_seconds)) < now()
    )  ttl_rows
  FROM <table_name> AS r
  WHERE r.trace_first_received_at IS NOT NULL
    AND r.ttl_seconds IS NOT NULL
    AND r.inserted_at IS NOT NULL
  GROUP BY r._part
  HAVING ttl_rows > 0
) AS s
ANY INNER JOIN
(
  SELECT name, rows, bytes_on_disk
  FROM system.parts
  WHERE active
    AND database = db
    AND table = '<table_name>'
) AS p
  ON p.name = s._part
ORDER BY ttl_rows_in_part DESC;

Example output:

| _part | ttl_rows_in_part | total_rows_in_part | num_bytes_in_part |
| --- | ---: | ---: | ---: |
| all_6930415_15479924_44_19219337 | 761.21 million | 1.73 billion | 147.28 GiB |
| all_0_2_2_19219337 | 73.50 million | 522.90 million | 136.90 GiB |
| all_10_10_1_19219337 | 7.00 | 301.15 million | 101.76 GiB |
| all_15_15_1_19219337 | 68.48 thousand | 962.30 million | 81.30 GiB |
| all_17471327_19114880_800_19219337 | 63.73 thousand | 181.48 million | 27.23 GiB |
| all_16_4153700_206_19219337 | 604.55 million | 1.70 billion | 114.70 GiB |
| all_4153701_6930414_39_19219337 | 309.27 million | 665.85 million | 48.49 GiB |
| all_15479925_17471326_39_19219337 | 193.62 million | 407.76 million | 37.64 GiB |

If that query does not return much data (likely because TTLs are too high), you can try removing data older than X days.

2C. Find parts with rows older than X days

SELECT
  s._part,
  formatReadableQuantity(s.rows_before_cutoff) AS expired_rows_in_part,
  formatReadableQuantity(p.rows) AS total_rows_in_part,
  formatReadableSize(p.bytes_on_disk) AS num_bytes_in_part
FROM
(
  SELECT
    r._part,
    countIf(r.inserted_at < now() - INTERVAL <num_days> DAYS) AS rows_before_cutoff
  FROM <table_name> AS r
  WHERE r.trace_first_received_at IS NOT NULL
    AND r.ttl_seconds IS NOT NULL
    AND r.inserted_at IS NOT NULL
  GROUP BY r._part
) AS s
INNER JOIN
(
  SELECT name, rows, bytes_on_disk
  FROM system.parts
  WHERE active
    AND database = currentDatabase()
    AND table = '<table_name>'
) AS p
  ON p.name = s._part
ORDER BY rows_before_cutoff DESC;

Example output (use this to target parts with the most deletable rows):

| _part | expired_rows_in_part | total_rows_in_part | num_bytes_in_part |
| --- | ---: | ---: | ---: |
| all_0_2_2_19219337 | 450.31 million | 522.90 million | 136.90 GiB |
| all_4_4_1_19219337 | 427.57 million | 427.57 million | 141.51 GiB |
| all_5_5_1_19219337 | 387.64 million | 387.64 million | 139.36 GiB |
| all_9_9_1_19219337 | 368.29 million | 368.29 million | 126.07 GiB |
| all_11_11_1_19219337 | 320.79 million | 320.79 million | 110.59 GiB |
| all_10_10_1_19219337 | 301.15 million | 301.15 million | 101.76 GiB |
| all_12_12_1_19219337 | 300.99 million | 300.99 million | 102.44 GiB |
| all_13_13_1_19219337 | 224.36 million | 224.36 million | 76.45 GiB |

Adjust <num_days> until you meet your retention constraints while achieving a meaningful number of rows.

Step 3: Delete

WARNING: Before deleting, we highly recommend taking a backup of your data.

Pick a part name and run the following to trigger a DELETE mutation:

ALTER TABLE default.<table_name>
DELETE WHERE _part = '<your_part_name_here>'
  AND trace_first_received_at IS NOT NULL
  AND ttl_seconds IS NOT NULL
  AND inserted_at IS NOT NULL
  AND <WHERE_CLAUSE>

Replace <WHERE_CLAUSE> with the selection criteria you chose in Step 2.

Examples:

- For traces older than X days: inserted_at < now() - INTERVAL <num_days> DAYS

- For expired traces: assumeNotNull(r.trace_first_received_at) + toIntervalSecond(assumeNotNull(r.ttl_seconds)) < now()

Mutations run asynchronously; the client returns immediately. Only run ONE mutation at a time.

Step 4: Track the mutation

Check mutation status:

SELECT * FROM system.mutations WHERE is_done = 0 AND is_killed = 0;

You should see one row (sometimes two). When the mutation completes, is_done becomes 1 and the query returns nothing.

WARNING: Wait for each mutation to complete before starting another, otherwise query performance may be impacted.

To stop an active mutation if it is causing CPU/memory pressure on queries:

KILL MUTATION WHERE mutation_id = <mutation_id_from_select_query>;

Step 5: Repeat

After a mutation, parts may be renamed. Re-run Step 2 to fetch updated part names, identify a part to delete, execute the deletion, and track the mutation. Repeat until disk usage reaches your target.

You can repeat this process with the following tables. Use trace_first_received_at instead of inserted_at:

- `runs_metadata_kv`
- `runs_inputs_kv`
- `runs_outputs_kv`
- `runs_reference_example_id`
- `runs_run_id_v2`
- `runs_tags`
- `runs_token_counts`
- `runs_trace_id`
- `runs_lite`
- `root_runs_lite`
- `runs_feedbacks_rmt_wide`
- `runs_token_counts_sources_wide`
- `runs_token_counts_rolled_up_wide`

References

https://docs.langchain.com/langsmith/langsmith-managed-clickhouse

Constantly increasing ClickHouse storage can feel like a never-ending battle. This guide outlines practical steps to reduce storage usage safely and effectively.

Step 1: Try these before deleting data

1. Enable auto-expanding PVCs. This prevents the instance from running out of disk and refusing requests.

2. Reduce TTL and long-term retention, especially for very long horizons (for example, 400 days means nothing is purged for a long time). See the TTL and data retention guide.

Note: TTL is set at ingestion time. Once data is in ClickHouse, you cannot retroactively change its TTL.

3. Enable blob storage. This moves large payloads to object storage and reduces the amount of data stored in ClickHouse. Follow the blob storage guide.

You can adjust the minimum object size threshold to send more or fewer payloads to object storage (lower threshold = less data in ClickHouse). See the values.yaml reference.

Tune settings from steps 2 and 3 until the rate of data ingestion roughly matches the purge rate.

If you are unable to tweak these settings, or they do not work, move on to step 2.

Step 2: Identify data for deletion

To connect to ClickHouse: exec into the langsmith-clickhouse pod, run clickhouse-client, and execute the queries below.

Check whether backups exist in your ClickHouse pod before deleting anything. Your backups may be holding onto stale data in your PVC. See Backups and data retention guide.

2A. Find the largest tables

SELECT
  database, table,
  formatReadableQuantity(sum(rows)) AS rows,
  formatReadableSize(sum(bytes_on_disk)) AS bytes
FROM system.parts
WHERE active
GROUP BY database, table
ORDER BY sum(bytes_on_disk) DESC

Example output:

| database | table | rows | bytes |
| --- | --- | ---: | ---: |
| default | runs_feedbacks_rmt_wide | 2.84 billion | 3.28 TiB |
| default | feedbacks_rmt_id | 2.83 billion | 3.26 TiB |
| default | feedbacks | 3.21 billion | 3.19 TiB |
| default | feedbacks_rmt | 2.83 billion | 3.19 TiB |
| default | runs | 9.07 billion | 1.60 TiB |
| default | runs_metadata_kv | 58.74 billion | 1.06 TiB |
| default | runs_lite | 9.01 billion | 627.30 GiB |
| default | runs_history | 9.81 billion | 601.15 GiB |
| default | runs_inputs_kv | 21.30 billion | 447.80 GiB |
| default | runs_run_id_v2 | 9.21 billion | 310.92 GiB |
| default | runs_token_counts | 12.15 billion | 297.96 GiB |

Focus on the largest tables first. First, let's check for data that should have been TTL'd, but hasn't been.

2B. Find parts with TTL‑expired rows that were not cleared

WITH
  currentDatabase() AS db
SELECT
  s._part,
  formatReadableQuantity(s.ttl_rows) AS ttl_rows_in_part,
  formatReadableQuantity(p.rows) AS total_rows_in_part,
  formatReadableSize(p.bytes_on_disk) AS num_bytes_in_part
FROM
(
  SELECT
    r._part,
    countIf(
      assumeNotNull(r.trace_first_received_at)
      + toIntervalSecond(assumeNotNull(r.ttl_seconds)) < now()
    )  ttl_rows
  FROM <table_name> AS r
  WHERE r.trace_first_received_at IS NOT NULL
    AND r.ttl_seconds IS NOT NULL
    AND r.inserted_at IS NOT NULL
  GROUP BY r._part
  HAVING ttl_rows > 0
) AS s
ANY INNER JOIN
(
  SELECT name, rows, bytes_on_disk
  FROM system.parts
  WHERE active
    AND database = db
    AND table = '<table_name>'
) AS p
  ON p.name = s._part
ORDER BY ttl_rows_in_part DESC;

Example output:

| _part | ttl_rows_in_part | total_rows_in_part | num_bytes_in_part |
| --- | ---: | ---: | ---: |
| all_6930415_15479924_44_19219337 | 761.21 million | 1.73 billion | 147.28 GiB |
| all_0_2_2_19219337 | 73.50 million | 522.90 million | 136.90 GiB |
| all_10_10_1_19219337 | 7.00 | 301.15 million | 101.76 GiB |
| all_15_15_1_19219337 | 68.48 thousand | 962.30 million | 81.30 GiB |
| all_17471327_19114880_800_19219337 | 63.73 thousand | 181.48 million | 27.23 GiB |
| all_16_4153700_206_19219337 | 604.55 million | 1.70 billion | 114.70 GiB |
| all_4153701_6930414_39_19219337 | 309.27 million | 665.85 million | 48.49 GiB |
| all_15479925_17471326_39_19219337 | 193.62 million | 407.76 million | 37.64 GiB |

If that query does not return much data (likely because TTLs are too high), you can try removing data older than X days.

2C. Find parts with rows older than X days

SELECT
  s._part,
  formatReadableQuantity(s.rows_before_cutoff) AS expired_rows_in_part,
  formatReadableQuantity(p.rows) AS total_rows_in_part,
  formatReadableSize(p.bytes_on_disk) AS num_bytes_in_part
FROM
(
  SELECT
    r._part,
    countIf(r.inserted_at < now() - INTERVAL <num_days> DAYS) AS rows_before_cutoff
  FROM <table_name> AS r
  WHERE r.trace_first_received_at IS NOT NULL
    AND r.ttl_seconds IS NOT NULL
    AND r.inserted_at IS NOT NULL
  GROUP BY r._part
) AS s
INNER JOIN
(
  SELECT name, rows, bytes_on_disk
  FROM system.parts
  WHERE active
    AND database = currentDatabase()
    AND table = '<table_name>'
) AS p
  ON p.name = s._part
ORDER BY rows_before_cutoff DESC;

Example output (use this to target parts with the most deletable rows):

| _part | expired_rows_in_part | total_rows_in_part | num_bytes_in_part |
| --- | ---: | ---: | ---: |
| all_0_2_2_19219337 | 450.31 million | 522.90 million | 136.90 GiB |
| all_4_4_1_19219337 | 427.57 million | 427.57 million | 141.51 GiB |
| all_5_5_1_19219337 | 387.64 million | 387.64 million | 139.36 GiB |
| all_9_9_1_19219337 | 368.29 million | 368.29 million | 126.07 GiB |
| all_11_11_1_19219337 | 320.79 million | 320.79 million | 110.59 GiB |
| all_10_10_1_19219337 | 301.15 million | 301.15 million | 101.76 GiB |
| all_12_12_1_19219337 | 300.99 million | 300.99 million | 102.44 GiB |
| all_13_13_1_19219337 | 224.36 million | 224.36 million | 76.45 GiB |

Adjust <num_days> until you meet your retention constraints while achieving a meaningful number of rows.

Step 3: Delete

WARNING: Before deleting, we highly recommend taking a backup of your data.

Pick a part name and run the following to trigger a DELETE mutation:

ALTER TABLE default.<table_name>
DELETE WHERE _part = '<your_part_name_here>'
  AND trace_first_received_at IS NOT NULL
  AND ttl_seconds IS NOT NULL
  AND inserted_at IS NOT NULL
  AND <WHERE_CLAUSE>

Replace <WHERE_CLAUSE> with the selection criteria you chose in Step 2.

Examples:

- For traces older than X days: inserted_at < now() - INTERVAL <num_days> DAYS

- For expired traces: assumeNotNull(r.trace_first_received_at) + toIntervalSecond(assumeNotNull(r.ttl_seconds)) < now()

Mutations run asynchronously; the client returns immediately. Only run ONE mutation at a time.

Step 4: Track the mutation

Check mutation status:

SELECT * FROM system.mutations WHERE is_done = 0 AND is_killed = 0;

You should see one row (sometimes two). When the mutation completes, is_done becomes 1 and the query returns nothing.

WARNING: Wait for each mutation to complete before starting another, otherwise query performance may be impacted.

To stop an active mutation if it is causing CPU/memory pressure on queries:

KILL MUTATION WHERE mutation_id = <mutation_id_from_select_query>;

Step 5: Repeat

You can repeat this process with the following tables. Use trace_first_received_at instead of inserted_at:

- `runs_metadata_kv`
- `runs_inputs_kv`
- `runs_outputs_kv`
- `runs_reference_example_id`
- `runs_run_id_v2`
- `runs_tags`
- `runs_token_counts`
- `runs_trace_id`
- `runs_lite`
- `root_runs_lite`
- `runs_feedbacks_rmt_wide`
- `runs_token_counts_sources_wide`
- `runs_token_counts_rolled_up_wide`

References

https://docs.langchain.com/langsmith/langsmith-managed-clickhouse

Managing ClickHouse Storage

Step 1: Try these before deleting data

Step 2: Identify data for deletion

Step 3: Delete

Step 4: Track the mutation

Step 5: Repeat

References

LangChain Support

Sign in to Chat

Managing ClickHouse Storage

Step 1: Try these before deleting data

Step 2: Identify data for deletion

Step 3: Delete

Step 4: Track the mutation

Step 5: Repeat

References

LangChain Support

Sign in to Chat