Pete

The Stack

All posts / Claude (5)

Ithaca, NY

Had Forge audit itself today. Forge is the agent I use to build MCP servers (part of the larger system). Designs the architecture, writes the code, hardens the container, ships to the registry. It has been running for months.

Asked it to grade its own playbook against best practices. Came back with seven specific gaps. No anti-hallucination rule for external claims. No token budget enforcement. No multi-client smoke test, only Claude Code. No FastMCP version pinning policy. Reflection check was one line. Lessons file path undocumented. No quarterly re-audit cadence on public repos.

Forge proposed a v2 with each gap closed as a discrete edit, marked with explicit ADD or REPLACE blocks so the diffs apply cleanly. I approved. It applied them to its own definition file. Playbook went from 258 to 304 lines.

The interesting part: every gap was something I had been manually fixing in spawn prompts every time I called the agent. The audit just made the patches permanent so I stop typing them.

Agents that audit themselves and apply the fix are the real move. Tools that build tools.

Built 20+ named agents on Claude Code over the past year. Each one has a domain, a risk tier, structured output contracts, and lane discipline. Forge builds MCP servers. Tank runs the homelab. Coach commands editorial for The 53 Report. Keeper handles production servers. Radar audits client sites. Outreach manages prospect email. Etc.

The trick isn't more agents. It's mandatory routing in CLAUDE.md. When a request matches an agent's domain, you route to it. No 'I have context, I'll just handle it myself.' That's the rule that keeps the system from collapsing into one bloated assistant.

Risk tiers separate read-only from production-write. Forge can push container images but won't deploy to a server without my call. Keeper requires double-confirmation for the WordPress sites with revenue on them. PreToolUse hooks block exfiltration patterns at the tool level, before any agent gets a chance to run a bad curl.

Each agent has a skill file with full instructions, a registry entry with metadata (risk tier, MCP tool access, file write scopes, SSH targets), and a coordination map for cross-agent handoff. It reads more like an org chart than a prompt library.

Most people use Claude Code as a coding assistant. This is something different.

A few weeks ago I shipped a research agent, but there were a few manual steps I kept finding myself doing after each report was updated. This past weekend I added two subagents to handle them.

Every report, the same three things: convert the inline [source: url] markers to clickable markdown links, check that every cited URL is still live, and verify each claim actually matches what the source says.

research-polish is mechanical: read the draft, rewrite citation markers, linkify the Sources section. research-verify reads the report with fresh context, fetches every URL, and checks whether each source actually supports the claim it's attached to. Flags DEAD, STALE, UNSUPPORTED, PARTIAL. Also audits confidence labels and timestamp conversions.

Key call: verify flags, never fixes. Auto-removing a claim on a weak judgment call would delete valid content when the verifier misreads a source. The human decides what to do.

Why split it into subagents instead of baking both passes into the research agent itself: the research agent has confirmation bias toward its own claims. Can't grade your own paper. A blank-slate reader catches what a self-review misses.

Both run in parallel after the draft lands, before the publish prompt. Next research report runs both passes automatically.

Built a research agent inside Claude Code that doesn't make things up. Based it directly on Anthropic's guide to reducing hallucinations: give the model permission to say "I don't know," extract direct quotes before analyzing, cite every claim inline, and retract anything it can't source. Layered chain-of-thought verification and confidence levels on top.

For search, it runs against my self-hosted SearXNG instance, a metasearch engine that aggregates Bing, DuckDuckGo, Brave, Reddit, and Startpage. Results get deduped and ranked by how many engines found each URL. Higher engine count means higher trust. No single search provider dependency.

The agent is a slash command in Claude Code. Type /research, ask a question, and it gathers evidence, reads the actual pages, builds a sourced report, and auto-publishes to GitHub. Run it with Claude Code's looping feature for ongoing stories and it keeps updating the report as new information drops.

Example: when LiteLLM got hit with a supply chain attack last week, /research tracked the story across 70+ outlets over four days, from the initial PyPI compromise through the Telnyx cascade, with every claim cited and verified.

github.com/pete-builds/research-reports/blob/main/litellm-pypi-supply-chain-attack.md ↗

Applied to six roles at Anthropic today. Forward Deployed Engineer, Claude Evangelist, and four Solutions Architect verticals. I've spent the last several months building deeply with Claude: agentic workflows, MCP servers, n8n automations, full-stack deploys. Every day I find new ways to push what's possible. Hoping the application conveys what the work already shows.