blog

Building a Personal AI Brain: Architecture for a Self-Hosted Agent

Planning the architecture for a personalized AI agent that runs on a home server, leverages AWS Bedrock for LLM inference, and uses MCP for distributed tool execution across devices.

Building a Personal AI Brain: Architecture for a Self-Hosted Agent

I've been using AI coding assistants daily and they're incredible for development work. But I want something broader — a personalized AI agent that knows me, runs on my own infrastructure, and helps with both technical and everyday tasks. Not 100% developer-focused, more like 60/40 technical to general.

Here's the architecture I've landed on after thinking through the options.

The Core Idea

A "Brain" server running on my local network that: - Uses AWS Bedrock (Claude) for LLM inference - Stores everything in PostgreSQL with pgvector - Runs local embeddings on a GPU - Supports multiple concurrent sessions from any device - Learns about me over time through automatic memory extraction - Executes tools both locally on the server AND on connected client devices

Why Not Just Use ChatGPT/Claude?

The personalization angle. I want an agent that compounds knowledge about me over weeks and months. My preferences, my projects, my communication style, my tools. Commercial chatbots reset every conversation (or have limited memory). I want full control over what it remembers and how it uses that knowledge.

Plus, I want tool execution on my own infrastructure — file access, database queries, home automation, whatever I decide to wire up.

Architecture Overview

┌─────────────────────────────────────────────────┐
│                LAN (Brain Server)                │
│                                                  │
│  ┌──────────┐  ┌──────────┐  ┌───────────────┐  │
│  │ Postgres │  │  Ollama  │  │  Brain Server  │  │
│  │ +pgvector│  │  (GPU)   │  │  (API + Core)  │  │
│  └──────────┘  └──────────┘  └───────┬───────┘  │
│                                      │           │
│                    ┌─────────────────┤           │
│                    │ MCP Servers     │           │
│                    │ (brain-local)   │           │
│                    └─────────────────┘           │
└──────────────────────────┬──────────────────────┘
                           │ VPN
            ┌──────────────┼──────────────┐
            │              │              │
     ┌──────┴──────┐ ┌────┴────┐  ┌──────┴──────┐
     │ Laptop CLI  │ │ Mobile  │  │  VPS Client │
     │ + MCP local │ │         │  │ + MCP local │
     └─────────────┘ └─────────┘  └─────────────┘

Key Design Decisions

Bedrock for LLM, Local GPU for Embeddings

Claude on Bedrock handles the conversational heavy lifting — it's hard to beat for quality. But embeddings are a different story. Running an embedding model locally on a GPU eliminates per-call costs and keeps latency low. Models like nomic-embed-text run comfortably on modest hardware and produce quality embeddings.

This split makes economic sense: pay for the expensive inference, run the commodity workload locally.

PostgreSQL for Everything

Instead of juggling multiple databases, PostgreSQL with pgvector handles all persistence:

  • Vector embeddings — Document chunks with their vector representations for RAG retrieval
  • Conversation history — Full message logs per session
  • Long-term memory — Facts and preferences learned about the user over time
  • User profiles — Static configuration and preferences

One database, one backup strategy, one operational concern.

RAG is Still the Right Starting Point

After evaluating the current landscape — GraphRAG, agentic retrieval, long-context stuffing, fine-tuning — vanilla RAG with hybrid search (semantic + keyword) is still the best starting point for a personal agent. It's the right balance of simplicity, cost, and effectiveness.

The plan is to start with recursive character splitting (1024 tokens, 200 overlap) and iterate. PostgreSQL's tsvector gives us BM25-style keyword search alongside pgvector's semantic search, so we get hybrid retrieval without additional infrastructure.

MCP for Distributed Tool Execution

This is where it gets interesting. The Model Context Protocol (MCP) supports both local and remote transports. This enables a powerful pattern:

  • Brain-side MCP servers run on the server, providing tools like file access, database queries, web search, and memory management
  • Client-side MCP servers run on whatever device you're connecting from, providing local tools like filesystem access, clipboard, and OS-specific integrations

When a laptop CLI connects to the brain, it registers its local MCP servers. The brain aggregates all available tools and presents them to the LLM. When the LLM calls a tool, the brain routes execution to wherever that tool lives.

The result: sitting at your laptop, the agent can access files on both the server and your local machine. Connect from your phone, and only server-side tools are available. The tool surface adapts to the client.

Session Management

Multiple concurrent sessions with pause/resume is essential. The data model is straightforward:

  • Sessions table tracks active/paused state per user
  • Messages table stores the full conversation history per session
  • WebSocket connections for streaming responses, REST for session management

Any client can list sessions, resume a paused one, or start fresh. The brain handles concurrent Bedrock calls independently.

Automatic Memory with Guardrails

The personalization magic comes from automatic memory extraction. After conversations, a separate LLM pass reviews what was discussed and extracts durable knowledge:

What gets extracted: - Preferences and opinions - Facts about projects, tools, and workflows - Recurring topics and interests - Relationships and context

Guardrails: - Confidence threshold — only store high-confidence extractions - Deduplication — semantic similarity check against existing memories before storing - Decay tracking — memories that never get retrieved are flagged for review - User review — a /memories command to list, edit, and delete what the agent knows - Category tagging — each memory is typed (preference, fact, project, etc.)

Multi-User Ready

The schema supports multiple users from the start. Each user gets isolated profiles, memories, conversations, and knowledge bases. If this works well, family members can have their own personalized experience on the same infrastructure.

Build Phases

  1. Core server — API + Bedrock integration + basic CLI client
  2. Session management — Concurrent sessions, pause/resume
  3. RAG pipeline — Local embeddings + pgvector retrieval
  4. MCP support — Brain-local tools first, then remote client tools
  5. Memory system — Automatic extraction with guardrails
  6. Multi-user — Auth, user isolation, shared knowledge bases
  7. Polish — Better chunking, more tools, mobile access

What I'm Skipping (For Now)

  • Fine-tuning — Profile + RAG handles personalization better for this use case
  • GraphRAG — Overkill until I hit multi-hop reasoning limits
  • Complex orchestration frameworks — Starting with direct Bedrock API calls, not LangChain/LlamaIndex
  • Bedrock Agents — Too opinionated, adds latency, harder to debug

Next Steps

Phase 1 starts now. A Python server on the LAN, talking to Bedrock, with a CLI client that connects over the VPN. Get the conversational loop feeling right before adding complexity.

The goal isn't to build a product — it's to build a tool that gets smarter about me every day. Let's see how it goes.

← back to index