Concepts
The corpus
A local SQLite database of your operational life and your conversations — git, plans, sessions, threads, messages.
The corpus is the operational pillar of aiperson — a local-first SQLite database at ~/.dotperson/corpus.db that indexes both what you did (events: commits, plans, tasks, sessions) and what you said (threads + messages: every conversation captured by an observer, harvester, or the browser extension).
Schema (V2)
| Table | Holds |
|---|---|
entities | Named things — people, repos, projects, conversations. kind + name + attrs. |
relations | Typed triples between entities, with temporal validity. |
events | Append-only stream of things that happened, with provenance and content hashes (deduped per minute bucket). |
threads | One conversation on one surface — Claude Code session, claude.ai chat, ChatGPT conversation, Cursor composer, voice memo. |
messages | One turn within a thread. role + content + occurred_at + tool_calls. Deduped on (thread_id, content_hash). |
embeddings | Vector store keyed by text_hash so two messages with identical content share the same row. |
events_vec_768 / events_vec_384 | sqlite-vec indices for events, routed by embedder dim (Vertex 768 / fastembed 384). |
messages_vec_768 / messages_vec_384 | sqlite-vec indices for messages. |
events_fts / messages_fts | SQLite FTS5 over event payloads and message contents respectively. |
provenance | captured_at + signed_by + signature per event. |
entity_aliases | Alias index — "TK" and "Humphrey" resolve to the same canonical entity. |
Run personkit corpus health to see counts per table.
Hybrid retrieval
A single corpus_query call ranks events from four signal sources and fuses with reciprocal rank fusion (RRF, k=60):
- Vec — dense vector similarity against the chosen embedder’s table. Routes by dim automatically.
- FTS — SQLite FTS5 BM25 over the JSON payload.
- Entity — restricts candidates whose payload references any of the supplied entity ids.
- Temporal — recency ranking within the supplied time window.
Every hit reports which sources contributed via matched_via.
A parallel suite for messages (FTS5 over messages.content, vec lookup against messages_vec_*) is wired into the V2 schema; the daily brief and personkit today use these directly.
Embedders
| Embedder | Model | Dim | When it activates |
|---|---|---|---|
vertex | text-embedding-005 | 768 | Default when GCP_PROJECT_ID is set + a token source is resolvable (Workload Identity, ADC, or GOOGLE_ACCESS_TOKEN). |
fastembed | BGE-small-en-v1.5 | 384 | Behind the fastembed Cargo feature; pure offline, no network. |
none | — | — | Falls back to FTS + entity + temporal only. |
Backfill is on-demand: personkit corpus embed [--max 2000]. The daemon’s cycle drains pending events + messages automatically.
Harvesters
Harvesters scrape structured local state on a cadence and write directly into corpus:
| Harvester | Source | Default cadence |
|---|---|---|
git | Configured git repos via libgit2 | 15 min |
claude_plans | ~/.claude/plans/*.md, ~/TASKS.md, project memory | 15 min |
claude_sessions | ~/.claude/projects/*/sessions/*.jsonl — full transcripts into messages since v0.3 | 15 min |
config_state | AI-tool config files, redacted | 60 min |
mcp_bridge | External MCP servers per ~/.dotperson/mcp_harvest.json | 5 min |
reading_list | ~/.dotperson/reading-list/inbox.jsonl | 15 min |
screenshots | macOS screenshot directory (or DOTPERSON_SCREENSHOTS_DIR) | 15 min |
Observers
Observers tap raw substrate logs and POST to the relay’s /v1/observations and mirror each observation into the local messages table since v0.3:
| Observer | Source |
|---|---|
continue | ~/.continue/sessions/*.json |
cursor | Cursor workspace SQLite |
copilot_chat | VS Code Copilot Chat JSON |
gemini_cli | ~/.gemini/history |
windsurf | Codeium Windsurf chats |
claude_desktop | ~/Library/Application Support/Claude/IndexedDB (presence-detect today; live drain in V2) |
Browser extension capture
The MV3 extension (v1.0) uses a per-surface SurfaceAdapter registry to cover the 21 chat surfaces TK uses today — ChatGPT, Claude, Gemini, Microsoft Copilot, Grok, Meta AI, Perplexity, Mistral Le Chat, HuggingChat, Poe, Pi, Lmarena (verified), plus Kimi, DeepSeek, Qwen, Doubao, ChatGLM, Wenxin (Ernie), Yi, Hailuo, SenseChat (experimental, behind a popup toggle).
Two capture paths run in parallel:
- DOM extractor — a MutationObserver on the message list. Each adapter declares its message-root selectors, conversation-id pattern, and role classifier. A 600 ms streaming-quiet debounce avoids emitting half-rendered partials.
- MAIN-world network interceptor (ChatGPT, Claude, Gemini, Copilot) — a content script in the page’s own JavaScript context wraps
window.fetch, parses the canonical SSE / JSON message payload as it arrives, and bridges viawindow.postMessageto the same dedupe queue. DOM remains the floor; the interceptor opportunistically supplements when it succeeds.
Both paths emit kind: "thread_turn" POSTs to /v1/observations with the surface-normalised source_surface tag. The relay fans out into two paths:
persona_observations— signal layer, fed to the synthesis worker.corpus_threads+corpus_messages— full thread mirror with deterministic UUIDv5 thread ids andsha256(content || role)dedupe (additive migration 0012). Survives across devices via corpus sync.
Dignity controls
Capture is opt-in per surface via the popup; experimental Chinese-frontier adapters ship behind a second Experimental surfaces flag. On top of that, every cloud capture is gated by:
- Master kill switch — pauses all capture across every surface from the popup.
- Per-conversation skip — the in-page indicator (bottom-right of every captured surface) is a one-click toggle for the active conversation only.
- Regex redactor — user-defined patterns in the popup are applied before the turn leaves the page; invalid regex is flagged in the UI and never reaches the capture path.
- Per-surface traffic-light dots — green (≤5 min since last accepted turn), yellow (5–60 min), red (otherwise or on error). Visible in the popup and on the dashboard, fed by the rolling
chrome.storage.local.captureStatsclient telemetry plus the relay’s/v1/me/capture-healthendpoint.
A 401 from the relay wipes the cached ID token and surfaces a “Sign in again” CTA — capture never re-queues behind a permanently-failing handshake.
Secret redaction
Every harvested payload runs through secret_redact() before being written. Patterns matched today:
- Quoted env-style:
API_KEY="…" - Unquoted env-style:
API_KEY=… - Prefix tokens:
gh[pousr]_…,sk-…,xox[abprs]-…,github_pat_… - Authorization headers:
Authorization: Bearer … - PEM blocks:
-----BEGIN […]-----…-----END […]-----
Privacy controls live in /dashboard/privacy.
Snapshots & portability
personkit export --format=markdown— emit the whole corpus as Obsidian-ready.mdfiles (one per thread), with YAML front-matter.personkit corpus snapshot/personkit corpus restore— Ed25519-signed JSONL snapshot of the whole corpus (entities + relations + events + provenance + threads + messages). Round-trip is byte-faithful; signatures are verified on restore.personkit corpus surfaces [--since 7d]— per-surface roll-up of recent threads (browser-ext:chatgpt, claude_code, etc.). Pair withpersonkit corpus messages --surface=<id> --limit=20to drill in.- The same per-surface health rendered on the dashboard (Overview) is available via the MCP
corpus_healthtool — it surfaces the local roll-up alongside the schema/embedding state in one structured response.