Skip to content

Concepts

The corpus

A local SQLite database of your operational life and your conversations — git, plans, sessions, threads, messages.


The corpus is the operational pillar of aiperson — a local-first SQLite database at ~/.dotperson/corpus.db that indexes both what you did (events: commits, plans, tasks, sessions) and what you said (threads + messages: every conversation captured by an observer, harvester, or the browser extension).

Schema (V2)

TableHolds
entitiesNamed things — people, repos, projects, conversations. kind + name + attrs.
relationsTyped triples between entities, with temporal validity.
eventsAppend-only stream of things that happened, with provenance and content hashes (deduped per minute bucket).
threadsOne conversation on one surface — Claude Code session, claude.ai chat, ChatGPT conversation, Cursor composer, voice memo.
messagesOne turn within a thread. role + content + occurred_at + tool_calls. Deduped on (thread_id, content_hash).
embeddingsVector store keyed by text_hash so two messages with identical content share the same row.
events_vec_768 / events_vec_384sqlite-vec indices for events, routed by embedder dim (Vertex 768 / fastembed 384).
messages_vec_768 / messages_vec_384sqlite-vec indices for messages.
events_fts / messages_ftsSQLite FTS5 over event payloads and message contents respectively.
provenancecaptured_at + signed_by + signature per event.
entity_aliasesAlias index — "TK" and "Humphrey" resolve to the same canonical entity.

Run personkit corpus health to see counts per table.

Hybrid retrieval

A single corpus_query call ranks events from four signal sources and fuses with reciprocal rank fusion (RRF, k=60):

  1. Vec — dense vector similarity against the chosen embedder’s table. Routes by dim automatically.
  2. FTS — SQLite FTS5 BM25 over the JSON payload.
  3. Entity — restricts candidates whose payload references any of the supplied entity ids.
  4. Temporal — recency ranking within the supplied time window.

Every hit reports which sources contributed via matched_via.

A parallel suite for messages (FTS5 over messages.content, vec lookup against messages_vec_*) is wired into the V2 schema; the daily brief and personkit today use these directly.

Embedders

EmbedderModelDimWhen it activates
vertextext-embedding-005768Default when GCP_PROJECT_ID is set + a token source is resolvable (Workload Identity, ADC, or GOOGLE_ACCESS_TOKEN).
fastembedBGE-small-en-v1.5384Behind the fastembed Cargo feature; pure offline, no network.
noneFalls back to FTS + entity + temporal only.

Backfill is on-demand: personkit corpus embed [--max 2000]. The daemon’s cycle drains pending events + messages automatically.

Harvesters

Harvesters scrape structured local state on a cadence and write directly into corpus:

HarvesterSourceDefault cadence
gitConfigured git repos via libgit215 min
claude_plans~/.claude/plans/*.md, ~/TASKS.md, project memory15 min
claude_sessions~/.claude/projects/*/sessions/*.jsonlfull transcripts into messages since v0.315 min
config_stateAI-tool config files, redacted60 min
mcp_bridgeExternal MCP servers per ~/.dotperson/mcp_harvest.json5 min
reading_list~/.dotperson/reading-list/inbox.jsonl15 min
screenshotsmacOS screenshot directory (or DOTPERSON_SCREENSHOTS_DIR)15 min

Observers

Observers tap raw substrate logs and POST to the relay’s /v1/observations and mirror each observation into the local messages table since v0.3:

ObserverSource
continue~/.continue/sessions/*.json
cursorCursor workspace SQLite
copilot_chatVS Code Copilot Chat JSON
gemini_cli~/.gemini/history
windsurfCodeium Windsurf chats
claude_desktop~/Library/Application Support/Claude/IndexedDB (presence-detect today; live drain in V2)

Browser extension capture

The MV3 extension (v1.0) uses a per-surface SurfaceAdapter registry to cover the 21 chat surfaces TK uses today — ChatGPT, Claude, Gemini, Microsoft Copilot, Grok, Meta AI, Perplexity, Mistral Le Chat, HuggingChat, Poe, Pi, Lmarena (verified), plus Kimi, DeepSeek, Qwen, Doubao, ChatGLM, Wenxin (Ernie), Yi, Hailuo, SenseChat (experimental, behind a popup toggle).

Two capture paths run in parallel:

  1. DOM extractor — a MutationObserver on the message list. Each adapter declares its message-root selectors, conversation-id pattern, and role classifier. A 600 ms streaming-quiet debounce avoids emitting half-rendered partials.
  2. MAIN-world network interceptor (ChatGPT, Claude, Gemini, Copilot) — a content script in the page’s own JavaScript context wraps window.fetch, parses the canonical SSE / JSON message payload as it arrives, and bridges via window.postMessage to the same dedupe queue. DOM remains the floor; the interceptor opportunistically supplements when it succeeds.

Both paths emit kind: "thread_turn" POSTs to /v1/observations with the surface-normalised source_surface tag. The relay fans out into two paths:

Dignity controls

Capture is opt-in per surface via the popup; experimental Chinese-frontier adapters ship behind a second Experimental surfaces flag. On top of that, every cloud capture is gated by:

A 401 from the relay wipes the cached ID token and surfaces a “Sign in again” CTA — capture never re-queues behind a permanently-failing handshake.

Secret redaction

Every harvested payload runs through secret_redact() before being written. Patterns matched today:

  1. Quoted env-style: API_KEY="…"
  2. Unquoted env-style: API_KEY=…
  3. Prefix tokens: gh[pousr]_…, sk-…, xox[abprs]-…, github_pat_…
  4. Authorization headers: Authorization: Bearer …
  5. PEM blocks: -----BEGIN […]-----…-----END […]-----

Privacy controls live in /dashboard/privacy.

Snapshots & portability