⚡ Tony Meeting Briefing

Wednesday, April 29, 2026 · Prepared by Warren · 4 agenda topics
1
Google Workspace Connection
Authenticated

Warren already has authenticated access to Google Workspace via OAuth2. This extends agent capabilities beyond Slack into the full productivity suite.

📁
Google Drive
List, search, upload, download, share
📝
Google Docs
Create, read, export (PDF/text)
📊
Google Sheets
Read, write, append data
📧
Gmail
Read and send emails
📅
Calendar
List events, create meetings
Key takeaway: Warren goes from "lives in Slack" to "operates across the full Google Workspace." If a client uses GWS, agents can read and publish directly in their environment.
2
Tiffany — Warren on Tony's Mac Studio
Ready to deploy

"Tiffany" is the name for a local Warren instance running on Tony's Mac Studio. Same architecture, same skill system — running independently on Tony's hardware.

  • Which LLM provider? API keys needed (Anthropic / OpenAI / Together / local via Ollama)
  • Which channels? Slack only, or also Signal, WhatsApp, etc.?
  • Independent or connected? Isolated agent vs. shared memory with DGX Warren
  • Who configures? Charlie or Warren can guide Tony remotely
Valent connection: This same process is what the $5K self-install POC requires — install OpenClaw in their env, create an agent with relevant skills, connect to their channels. Tony's Mac is a dry run.
3
Warren Evals — How We Test Agent Judgment
Awaiting Tony
The problem: Warren produces 100+ outputs per day. Tony can't review everything. We need an automated system that catches quality regressions — not just "did it run?" but "did it make the right call?"

❌ Without Evals

  • Tony catches problems by accident
  • Bad outputs ship before anyone notices
  • Config changes break judgment silently
  • No way to measure improvement
  • "Trust me" is the only evidence

✅ With Evals

  • Every change tested automatically
  • LLM judge scores using Tony's criteria
  • Regressions caught before clients see them
  • Measurable agreement rate vs. human
  • Data-driven evidence of quality
🗂️
Capture
📏
Encode
🧪
Test
📊
Measure
🔁
Iterate
This loop runs on every change to Warren's prompts, SOPs, or config
Step 1 — Capture
Mine real human judgments
We extracted 68 of Tony's decisions from session history — every pass, fail, correction, and teaching moment. Each has: agent output, Tony's verdict, and his reasoning.
Step 2 — Encode
Turn judgment into scoring rubrics
Tony's teaching entries become structured YAML rubric criteria that an LLM judge can use to score any output. The rubrics ARE Tony's judgment, codified.
Step 3 — Test
LLM judge scores against rubrics
A different model (not the one Warren uses) reads rubric + agent output and scores it. Cross-model judging = no "grading your own homework." Binary: PASS or FAIL.
Step 4 — Measure
Compare LLM judge vs. human verdicts
Does the judge agree with Tony? Target: >90% agreement. First result: 100% on easy cases — but small sample. Need harder cases to stress-test.
Step 5 — Iterate
Disagreements → refine the rubric
Every disagreement reveals a rubric gap. Fix it, re-test, repeat. The 8 calibration scenarios we sent Tony are designed to find exactly these gaps.
L1
Mechanical Execution
Did it use the right tools in the right order? Fully automated.
"Did it commit within 10 min?" "Did it push?"
L2
Process Judgment
Did it follow the right process? Semi-automated with rubrics.
"Correct label?" "Right SOP?" "Proper routing?"
L3
Product Judgment
Is it building the right thing? Requires human calibration.
"Right scope?" "Read the customer's real need?"
The analogy: WWTD is the law. Evals are the court system. Having laws isn't enough — you need a system that consistently interprets and enforces them. WWTD tells Warren what good looks like. Evals prove whether outputs match.

📖 WWTD Alone

  • Static document — a RAG knowledge base
  • Describes Tony's principles in prose
  • Warren reads it and tries to follow
  • No verification it's actually working
  • Only covers Tony's domains
  • Conflicting principles — no resolution

🧪 WWTD + Evals

  • WWTD = the rules; Evals = enforcement
  • Automated testing proves compliance
  • Catches regressions on every change
  • Covers ALL domains (product + tech + process)
  • Multiple SMEs calibrate the system
  • Scenarios resolve principle conflicts
🎯
Tony
Product scope, sales/BD, customer judgment, design bar
⚙️
Charlie
Pipeline, architecture, code quality, technical correctness
📋
Victor
Process, delivery, operational quality, coordination
68
Historical judgments mined
100%
Judge agreement (easy set)
8
Calibration scenarios pending

Each scenario pits two of Tony's principles against each other. No obvious right answer — that's the point. Tony's verdict becomes ground truth for the rubrics.

Scenario 1
Scope vs. Relationship
Build unnecessary layer to protect client's concerns?
Scenario 2
Effort-Value vs. Stated Need
Build what they asked or what they need?
Scenario 3
Speed vs. Design Bar
Add features or polish visuals before demo?
Scenario 4
Known Pattern vs. New Info
Follow the rule or adapt to the situation?
Scenario 5
Truth vs. Perception
Present accurate data that embarrasses a stakeholder?
Scenario 6
Build for Need vs. Trust
Minimal viable or production-grade reliability?
Scenario 7
Show Don't Tell vs. Readiness
Demo when they asked for a document?
Scenario 8
Confidence Inversion
Safe deliverable or insight-driven risk?
L1 — Deterministic
Programmatic asserts, no LLM. Runs identically every time.
"Committed in <10 min?" "Branch pushed?" "Label correct?"
~
L2 — Near-Deterministic
Pattern matching + rules for routing/SOP. Lightweight LLM judge only on edge cases.
"Right SOP dispatched?" "Pipeline routing correct?"
L3 — Probabilistic
LLM judge evaluating judgment. Inherently non-deterministic — by design.
"Right scope?" "Read the customer's real need?"
Why L3 can't be deterministic — and why that's OK: Evaluating judgment requires judgment. An LLM scoring another LLM will have variance between runs. Our design decisions minimize this:
Binary PASS/FAIL
No 1-5 scale that fluctuates. Clean signal, less noise.
Cross-Model Judging
Different model judges the agent — avoids self-agreement bias.
False Pass Rate <5%
Constraint: catching bad outputs matters more than missing good ones.
Agreement Rate >90%
Target vs. human verdicts. Accepts margin — doesn't pretend to be perfect.
Future enhancement: Run the judge N times per evaluation and use majority vote for even higher reliability. Not needed until the basic calibration loop is proven.
Framework design & architecture doc Done
Full eval framework proposal in ops repo
68 Tony judgments mined & structured Done
19 FAILs, 7 corrections, 6 PASSes, 36 teaching directives
Reply-audit hook (live behavioral monitor) Done
Audits every message for anti-patterns: permission loops, false "done" claims
8 calibration scenarios sent to Tony Done
Principle-conflict scenarios for rubric stress-testing
Awaiting Tony's responses Now
Need PASS/FAIL/CONDITIONAL + reasoning. Voice notes OK.
Write PromptFoo rubrics (YAML) Next
Structured scoring rubrics for product-scope and sales/BD
Build custom PromptFoo provider Next
Wraps OpenClaw so PromptFoo can test end-to-end
First agreement measurement round Next
LLM judge vs. Tony verdicts — target >90% agreement
Wire into CI (GitHub Actions) Future
Auto-run evals on every PR that touches prompts/SOPs/config
Expand SME coverage (Charlie + Victor) Future
Mine engineering & process judgments into the corpus
4
Meeting Action Items