Tools for AI Collaboration Are a Different Design Problem

ToolsClaudeAIMCPOptimization
By Michael

Tools for AI Collaboration Are a Different Design Problem

I've been working on an unreleased TUI app built in Go. I had to refactor because my Bubbletea Model had gotten massive. It had become a single-depth struct with all the properties dumped at the top level. Things fell where they could while moving fast on features. I was leveraging Claude to break down the model into smaller parts, but the problem with that kind of structure is that references to it are riddled throughout the codebase. That's where I was most looking forward to using Claude - cleaning up my tedious, ignorant mess.

I was halfway through the first submodel refactor when I saw three huge flashes of text from Claude. Grep output. That's when I got curious about whether I could make a tool and register it with Claude like a Skill, Hook, or Agent.

The answer is yes, and the tool is checkfor. JSON output, minimal tokens, built for repetition. The whole thing taught me something about a category of tooling we've needed for years, but that not many people are talking about.

Check out checkfor here

The Thesis

Building tools for AI collaboration is a fundamentally different design problem than building tools for humans.

Grep is already optimized. The source code is fine. What's not optimized is the output format. Grep was designed for humans to read in a terminal - color codes, repeated file paths for every match, a thousand and one different arguments. All of that makes perfect sense when you're the one using and reading it.

But Claude doesn't need any of that. Claude pays tokens for colors it can't see, formatting it doesn't use, and repeated information it already has. The optimization target changed from "human readability" to "token efficiency," and that means different tools entirely.

This isn't about making grep better. Grep is great at what it does. This is about recognizing that AI collaboration has different constraints, and we need a new category of tooling built around those constraints.

The Concrete Example

The refactoring touched 16 files in the internal/cli/ directory. Multiple phases - first breaking out the FormModel, then TableModel, then NavigationModel. Each phase meant finding every reference to the old fields and updating them to use the new submodel structure.

Claude was doing the refactor. It would update three or four files, then verify it didn't miss anything by running grep looking for old field names. Huge block of output scrolls by showing every match with context lines. Claude reports back: "Found 17 more references across 8 files."

Then it would do it again. Update those files. Verify with grep. More formatted output. "Found 9 references remaining."

And again. And again.

Each verification was necessary. When you're relying on AI to reliably refactor, you have to verify 4x to 5x as many times as the agent updates code. One missed reference and the whole thing breaks and you're back to pasting logs into Claude.

But each grep call was returning formatted output designed for a human to parse visually, and Claude was consuming all of it as tokens.

For most people it isn't an issue. So I spend 1000 tokens instead of 500, what's the big deal? That's less than a penny in API costs. But if I had left it, I would have had to start a new session and lose context at 3 refactors instead of getting through 5.

What Makes AI Tooling Different

Humans need formatting. Color codes, visual hierarchy, context lines to understand what they're looking at. When grep shows you a match with two lines before and after, that's helpful. You can see the function it's in, what's happening around it.

AI doesn't need that. Claude pays tokens for colors it can't see. It pays tokens for repeated file paths on every match when the filename once in a JSON structure would be enough. Context lines might be useful sometimes, but most of the time the line number alone is sufficient.

Then there's the repetition problem. A human runs grep once to find something. Claude might run the same search 12 times in one session to track progress. "How many references are left?" becomes a question you ask over and over as the refactor proceeds.

For humans, verbose output is mildly inconvenient. For AI, it can kill the entire workflow by exhausting the context window. Token budget isn't a soft constraint you can ignore. It's a hard limit, and when you hit it, the session ends.

The Numbers

Here's what the token usage actually looked like.

The refactoring session required 12 verification queries across multiple phases.

Method Total Tokens Multiplier vs checkfor Cost (Sonnet 4.5)
checkfor (actual) ~8,000 1x $0.024
Grep ~35,100 4.4x $0.105
Read (16 files × 3 passes) ~155,250 19.4x $0.466

Token calculation for Read tool:

  • 16 files totaling ~3,450 lines
  • Average 15 tokens per line
  • 3 passes for inventory, mid-check, final verification
  • 51,750 tokens per pass × 3 = 155,250 tokens

Session context limit: 200,000 tokens

With Read tool approach, would have exceeded limit during phase 3 of 5. With checkfor, completed all phases in single session.

API cost savings for full refactor: $0.442 vs Read, $0.081 vs Grep.

Design Principles for AI Collaboration Tools

Answer exactly the question asked. checkfor only scans one directory at single depth. Not recursive. If you ask about internal/cli/, that's what you get. Nothing more. This is different from most search tools that default to recursion because humans often want "find this anywhere." But for verification, you usually know exactly which directory matters.

JSON-only output. No human-friendly formatting, no colors, no repeated headers. Structured data that AI can parse instantly. The output includes a match count at the top, then an array of files with their matches. Line numbers, content, optional context. That's it.

Minimal by default, configurable when needed. Context lines default to zero. If you need surrounding lines to understand the match, add --context 1 or --context 2. Most verification tasks don't need it. "Is this field still referenced anywhere?" doesn't require knowing what function it's in.

Built for repetition. The tool is designed to be called many times in one session without token bloat. Same query at different stages of a refactor to track progress. 32 matches, then 17, then 9, then zero. Each call costs roughly the same minimal token count.

Native integration. checkfor runs as an MCP (Model Context Protocol) server, which is how tools register themselves with Claude Code. Not a wrapper script, not a hack. Claude Code sees it as a first-class tool with the same status as Read or Grep. Configuration goes in .mcp.json, and it's available immediately.

Exact counts matter. The JSON output includes matches_found as a top-level field. AI can report "17 references remaining" with confidence, not "approximately 17" or "many references." For tracking refactor progress, exact numbers make the difference between knowing you're done and guessing.

For AI Only

These tools are built for AI to use, not humans. If you run checkfor manually, you'll get raw JSON that's annoying to read. That's intentional. grep is still the right tool when you're searching files yourself. Token-optimized tools exist to make AI collaboration efficient, not to replace your existing workflow.

The Broader Pattern

This pattern applies way beyond search tools.

Every traditional CLI tool outputs information formatted for humans. ls with colorized files and column layouts. find with verbose paths and special characters. git log with formatted commit messages and author info. Test runners with pretty output, progress bars, summary tables.

All of that makes sense when you're the one reading it. But when AI uses these tools, it pays tokens for formatting that serves no purpose. The data is there, it's just wrapped in presentation layer designed for terminal eyeballs.

There's an entire category of "AI-native tools" that doesn't exist yet. Not replacements for the originals - those work fine for what they do. Complementary tools built around a different optimization target. Where grep optimizes for human readability, checkfor optimizes for token efficiency. Same core function, different constraints.

The optimization target changed, so the tool design has to change with it. Token budgets work like memory constraints in embedded systems. You don't just use less memory, you design around the limitation from the start. That's what AI collaboration tooling needs to do.

The Meta Insight

We're in the early days of AI collaboration tooling. Most of the tools Claude uses were designed decades ago for humans working in terminals. grep is from 1974. find is even older. They've been refined over 50 years to be excellent at what they do.

But the constraints are fundamentally different now. When you're designing for token budgets instead of screen real estate, you're solving a different problem. It's not about making grep faster or adding features. It's about recognizing that the output format itself is the bottleneck.

checkfor isn't grep with less output. It's a tool built from scratch with token efficiency as the primary constraint.

This is a new design problem, not an optimization problem.