How We Fixed OpenClaw's Compaction Death Loop (And Why Your SQLite DB Is Probably Bloated Too)

Feb 28, 2026·Jason Haugh·10 min

OpenClawAI operationsSQLitedebuggingmaintenanceself-hosted AI

How We Fixed OpenClaw's Compaction Death Loop (And Why Your SQLite DB Is Probably Bloated Too)

TL;DR: Our AI chief of staff Oscar got stuck in a non-stop "Compacting context..." loop, barely responsive, /reset doing nothing. Root causes: a 184MB embedding cache with no TTL, over-aggressive context pruning settings, 49 orphaned session files, and SQLite running in DELETE journal mode. Eleven steps later, it's fixed, automated, and never happening again. The critical gotcha: OpenClaw timestamps are milliseconds, not seconds. Every query needs a * 1000 correction or you'll delete nothing and wonder why.

The System Went Silent

It started the way these things always do: something felt off.

Oscar, our AI chief of staff running on OpenClaw, was sluggish. Then slower. Then the "Compacting context..." message started appearing. Not occasionally. Constantly. Every few seconds, the system was firing off a compaction cycle. Commands took forever. /reset did nothing. The agent that's supposed to run our content pipeline and manage our operations was essentially locked up, caught in a loop it couldn't break out of.

This wasn't a network issue or a model API problem. This was internal. Something in OpenClaw's memory management had gone sideways, and we needed to figure out what before Oscar became completely useless.

Here's what we found, and exactly how we fixed it.

What We Found

Pulling back the covers on OpenClaw's SQLite database revealed four compounding problems, any one of which would have caused slowdowns. Together, they were a disaster.

1. Embedding Cache Bloat - 184MB and Growing

The embedding_cache table had grown to roughly 184MB with zero expiration logic. No TTL. No pruning. Every embedding ever generated just sat there. The cache had become a hoard, not a cache.

2. Context Pruning Set Too Aggressively

The contextPruning config had a TTL of 30 minutes and keepLastAssistants set to 3. That combination means the system was wiping context extremely fast, then immediately needing to compact what was left. Prune, compact, prune, compact. The loop was baked into the settings.

3. Forty-Nine Orphaned Session Files

Scattered across the sessions directory were 49 .jsonl transcript files with no corresponding active sessions. They were still getting indexed on every search, adding noise and overhead to every context lookup.

4. SQLite in DELETE Journal Mode

The database was running in DELETE journal mode instead of WAL (Write-Ahead Logging). For a database getting hit this frequently, that's a performance anchor. WAL handles concurrent reads and writes far better and is the right choice for any live agent workload.

The Fix, Step by Step

We worked through this methodically. Don't skip steps, especially the backup.

Step 1: Safe Backup First

Do not use cp on a live SQLite database. You will get a corrupted copy. Use the built-in backup command instead:

sqlite3 ~/.openclaw/memory/main.sqlite ".backup '~/.openclaw/backups/main.sqlite.backup-$(date +%Y%m%d)'"

This is hot-backup safe. It works while the database is live.

Step 2: Enable WAL Mode

sqlite3 ~/.openclaw/memory/main.sqlite "PRAGMA journal_mode=WAL;"

Do this before anything else. WAL mode improves concurrent access and reduces the write contention that was amplifying all of our other problems.

Step 3: Confirm No Active Writes

Check that nothing is actively writing before you start modifying data. If OpenClaw is running a session, let it idle first.

sqlite3 ~/.openclaw/memory/main.sqlite "PRAGMA wal_checkpoint;"

Step 4: Archive Orphan Transcripts

mkdir -p ~/.openclaw/archive/sessions/
# Move .jsonl files not referenced in active sessions to archive

Don't delete them - just move them out of the active index path. If you need to recover a session later, they're still there.

Step 5: The Critical Gotcha - Check Your Timestamp Format

Before writing any time-based DELETE query, run this check:

sqlite3 ~/.openclaw/memory/main.sqlite "SELECT updated_at FROM chunks LIMIT 1;"

If the value is 13 digits long, your timestamps are in milliseconds, not seconds. This matters enormously. The standard SQLite unixepoch() function returns seconds. If you use it directly in a WHERE clause against millisecond timestamps, you will never match any rows. Your "cleanup" query will silently delete nothing.

The tell: datetime(updated_at, 'unixepoch') returns empty. That's your confirmation you're dealing with milliseconds.

Step 6: Trim Old Session Chunks

Delete session chunks older than 14 days, with the millisecond correction applied:

DELETE FROM chunks
WHERE source='sessions'
AND updated_at < ((unixepoch()-1209600)*1000);

1209600 is 14 days in seconds. Multiply by 1000 to match the millisecond timestamps.

Step 7: Trim the Embedding Cache

Same approach, 30-day window:

DELETE FROM embedding_cache
WHERE updated_at < ((unixepoch()-2592000)*1000);

2592000 is 30 days in seconds.

Step 8: VACUUM the Database

Before running VACUUM, check your available disk space. SQLite needs approximately 1x the current database size free to complete the operation.

df -h ~  # check free space first
sqlite3 ~/.openclaw/memory/main.sqlite "VACUUM;"

This reclaims all the space from deleted rows and defragments the database pages.

Step 9: Integrity Check

Always verify after major surgery:

sqlite3 ~/.openclaw/memory/main.sqlite "PRAGMA integrity_check;"

It must return ok. Anything else - restore from your backup immediately.

Step 10: Fix the Context Pruning Settings

Use gateway config.patch to update (never edit openclaw.json directly):

{
  "agents": {
    "defaults": {
      "contextPruning": {
        "ttl": "2h",
        "keepLastAssistants": 8
      }
    }
  }
}

The old settings (TTL 30m, keepLastAssistants 3) were burning down context so fast the system couldn't keep up. Two hours gives the agent enough working memory to operate across real tasks. Eight recent assistant messages keeps the conversation coherent.

Note: the correct config path is agents.defaults.contextPruning - not sessions.main.contextPruning. Using the wrong path gives a cryptic invalid config error.

Step 11: Build a Weekly Maintenance Script

One-time cleanup is not enough. We built a script that runs all of the above automatically:

#!/bin/bash
DB="$HOME/.openclaw/memory/main.sqlite"
DATE=$(date +%Y%m%d-%H%M%S)

# Backup
sqlite3 "$DB" ".backup '$HOME/.openclaw/backups/main.sqlite.backup-$DATE'"

# Trim session chunks (14 days)
sqlite3 "$DB" "DELETE FROM chunks WHERE source='sessions' AND updated_at < ((unixepoch()-1209600)*1000);"

# Trim embedding cache (30 days)
sqlite3 "$DB" "DELETE FROM embedding_cache WHERE updated_at < ((unixepoch()-2592000)*1000);"

# VACUUM
sqlite3 "$DB" "VACUUM;"

# Integrity check
RESULT=$(sqlite3 "$DB" "PRAGMA integrity_check;")
echo "Integrity: $RESULT"

Cron schedule - every Sunday at 3AM:

0 3 * * 0 /path/to/db-maintenance.sh >> ~/.openclaw/logs/maintenance.log 2>&1

Signs You Need Emergency Maintenance

Watch for these warning signs:

"Compacting context..." firing every few seconds
/reset doesn't break the loop
System response times measured in minutes, not seconds
main.sqlite larger than 300MB
Large number of .jsonl files in ~/.openclaw/agents/main/sessions/

If you see these, run the maintenance script manually before trying anything else.

Key Lessons

SQLite databases don't self-clean. If you're running an AI agent on OpenClaw or any similar system, your memory database is growing right now. Without a maintenance routine, you'll hit this problem eventually.

Milliseconds will get you. OpenClaw stores updated_at as millisecond epoch timestamps. Every time-based query needs * 1000 on the unixepoch side. The silent failure mode - where your DELETE matches zero rows - is hard to catch without knowing what to look for.

Context pruning settings are a dial, not a default. The aggressive defaults made sense for conserving tokens, but they created a compaction feedback loop. Tune your settings to match your actual workload. For a full-time agent handling real operations, 2 hours and 8 recent turns is a minimum, not a luxury.

WAL mode should be standard. If you're running SQLite under any kind of concurrent or frequent-access workload, turn on WAL mode. It's a one-line change with a significant performance impact.

Always use sqlite3 .backup, not cp. This is non-negotiable for live databases.

Oscar is back to full speed. The compaction loop is gone. And every Sunday at 3AM, the system cleans itself while everyone's asleep.

If you're running OpenClaw and you haven't checked your database size lately, go check it now.

Update: What We Added the Day After

After publishing this post, we kept digging. Here's what we learned and what we added on top of the original fix.

Daily Size Monitor + Warning System

The weekly maintenance script tells you when things are bad. But "once a week" means you could spend 6 days with a bloated database before anything flags it. So we added a second layer — a daily size monitor that runs every morning at 9 AM and posts to our System Health channel:

DB="$HOME/.openclaw/memory/main.sqlite"
SIZE_MB=$(du -m "$DB" | cut -f1)

if [ "$SIZE_MB" -ge 300 ]; then
  echo "CRITICAL: DB at ${SIZE_MB}MB - run db-maintenance.sh immediately"
elif [ "$SIZE_MB" -ge 250 ]; then
  echo "WARNING: DB at ${SIZE_MB}MB - maintenance recommended"
else
  echo "OK: DB at ${SIZE_MB}MB"
fi

Three thresholds:

✅ Under 250MB — green, nothing to do
⚠️ 250-300MB — warning, schedule maintenance
🔴 300MB+ — critical, run the script now

Simple, cheap to run, and catches drift before it becomes a crisis.

We Consulted Gemini and ChatGPT on the Row Cap Question

After the fix, the embedding cache was sitting at 9,538 rows and 187MB. We considered adding an 8,000-row hard cap to trim it down. Before doing anything, we asked both Gemini and ChatGPT for their take.

Both said the same thing: wait, don't trim.

The reasoning was sound:

9,538 rows is not large in vector database terms — production systems routinely handle 50k-100k+
Trimming 1,500 rows saves almost nothing and risks burning OpenAI API credits to rebuild embeddings you just deleted
The 30-day TTL will fire naturally within days and handle it cleanly
An 8,000-row cap is arbitrary — ChatGPT specifically called it out and suggested 20-50k is more realistic for semantic recall quality

The more interesting finding: the storage format itself is the problem.

The Real Culprit: JSON Storage

When we checked how embeddings are actually stored in the database, we found this:

typeof(embedding) → text
length(embedding) → 19,198 characters

The embeddings are stored as JSON text arrays, not binary BLOBs. That means each embedding takes up ~19KB of text instead of the ~6KB it would occupy as packed float32 binary. We're storing at roughly 3x the necessary size.

The math: 9,538 rows × 19KB = ~187MB. If embeddings were stored as BLOBs, the same data would be ~60MB.

This is an OpenClaw core issue — not something you can fix in the database without breaking the application. But it's worth knowing: if your embedding cache looks bloated, the storage format is likely why, not the row count.

Current Status

As of March 1st:

DB size: 228MB — in the green zone
TTL trim fires Sunday March 8, will clear all February data
VACUUM runs after trim to reclaim disk
Daily monitor watching for drift above 250MB

The system is stable. The compaction loop is gone. The warning system is live. And now we know exactly why the cache is as large as it is — even if fixing it properly requires an upstream change.

We'll update this post again after the March 8 trim runs to show the actual size reduction.