Tony Meeting Briefing — April 29, 2026

1

Google Workspace Connection

Authenticated

Warren already has authenticated access to Google Workspace via OAuth2. This extends agent capabilities beyond Slack into the full productivity suite.

Active Scopes

📁

Google Drive

List, search, upload, download, share

📝

Google Docs

Create, read, export (PDF/text)

📊

Google Sheets

Read, write, append data

📧

Gmail

Read and send emails

📅

Calendar

List events, create meetings

What This Enables

Agents publish deliverables directly to Google Docs shared with clients (not just Slack file drops)
Calendar integration: auto-create meetings, read team agenda, schedule follow-ups
Drive as a deliverable storage layer — especially for clients in Google Workspace
Gmail automation: programmatic follow-ups, summaries, notifications
Read existing team docs as input for analyses — no manual copy-paste

      Key takeaway: Warren goes from "lives in Slack" to "operates across the full Google Workspace." If a client uses GWS, agents can read and publish directly in their environment.
    

2

Tiffany — Warren on Tony's Mac Studio

Ready to deploy

"Tiffany" is the name for a local Warren instance running on Tony's Mac Studio. Same architecture, same skill system — running independently on Tony's hardware.

What Already Exists

Built a Warren clone ("Tiffany") on April 20 on DGX Spark as proof of concept
Full workspace cloned: AGENTS.md, SOUL.md, TOOLS.md, progressive disclosure, skills
Bound to a dedicated Slack channel with channel-based routing
Since evolved to "Stacey" (💎) — second agent live alongside Warren on DGX

What a Mac Studio Install Looks Like

Runtime: Node.js 22+ (Apple Silicon native), OpenClaw macOS app or CLI
Setup time: ~30 min for technical install, longer for specialization
Process: Install OpenClaw → openclaw setup → configure Slack → clone workspace → add API keys
Routing: Tony's DMs or a dedicated channel → Tiffany on Mac; other channels → Warren on DGX

Decisions Needed

Which LLM provider? API keys needed (Anthropic / OpenAI / Together / local via Ollama)
Which channels? Slack only, or also Signal, WhatsApp, etc.?
Independent or connected? Isolated agent vs. shared memory with DGX Warren
Who configures? Charlie or Warren can guide Tony remotely

      Valent connection: This same process is what the $5K self-install POC requires — install OpenClaw in their env, create an agent with relevant skills, connect to their channels. Tony's Mac is a dry run.
    

3

Warren Evals — How We Test Agent Judgment

Awaiting Tony

      The problem: Warren produces 100+ outputs per day. Tony can't review everything. We need an automated system that catches quality regressions — not just "did it run?" but "did it make the right call?"
    

❌ Without Evals

Tony catches problems by accident
Bad outputs ship before anyone notices
Config changes break judgment silently
No way to measure improvement
"Trust me" is the only evidence

✅ With Evals

Every change tested automatically
LLM judge scores using Tony's criteria
Regressions caught before clients see them
Measurable agreement rate vs. human
Data-driven evidence of quality

The Eval Loop — How It Works

🗂️

Capture

→

📏

Encode

→

🧪

Test

→

📊

Measure

→

🔁

Iterate

This loop runs on every change to Warren's prompts, SOPs, or config

Step 1 — Capture

Mine real human judgments

We extracted 68 of Tony's decisions from session history — every pass, fail, correction, and teaching moment. Each has: agent output, Tony's verdict, and his reasoning.

Step 2 — Encode

Turn judgment into scoring rubrics

Tony's teaching entries become structured YAML rubric criteria that an LLM judge can use to score any output. The rubrics ARE Tony's judgment, codified.

Step 3 — Test

LLM judge scores against rubrics

A different model (not the one Warren uses) reads rubric + agent output and scores it. Cross-model judging = no "grading your own homework." Binary: PASS or FAIL.

Step 4 — Measure

Compare LLM judge vs. human verdicts

Does the judge agree with Tony? Target: >90% agreement. First result: 100% on easy cases — but small sample. Need harder cases to stress-test.

Step 5 — Iterate

Disagreements → refine the rubric

Every disagreement reveals a rubric gap. Fix it, re-test, repeat. The 8 calibration scenarios we sent Tony are designed to find exactly these gaps.

Three Layers of Evaluation

L1

Mechanical Execution

Did it use the right tools in the right order? Fully automated.

"Did it commit within 10 min?" "Did it push?"

L2

Process Judgment

Did it follow the right process? Semi-automated with rubrics.

"Correct label?" "Right SOP?" "Proper routing?"

L3

Product Judgment

Is it building the right thing? Requires human calibration.

"Right scope?" "Read the customer's real need?"

Why Calibrate Beyond WWTD?

      The analogy: WWTD is the law. Evals are the court system. Having laws isn't enough — you need a system that consistently interprets and enforces them. WWTD tells Warren what good looks like. Evals prove whether outputs match.
    

📖 WWTD Alone

Static document — a RAG knowledge base
Describes Tony's principles in prose
Warren reads it and tries to follow
No verification it's actually working
Only covers Tony's domains
Conflicting principles — no resolution

🧪 WWTD + Evals

WWTD = the rules; Evals = enforcement
Automated testing proves compliance
Catches regressions on every change
Covers ALL domains (product + tech + process)
Multiple SMEs calibrate the system
Scenarios resolve principle conflicts

Full SME Coverage Map

🎯

Tony

Product scope, sales/BD, customer judgment, design bar

⚙️

Charlie

Pipeline, architecture, code quality, technical correctness

📋

Victor

Process, delivery, operational quality, coordination

Current Numbers

68

Historical judgments mined

100%

Judge agreement (easy set)

8

Calibration scenarios pending

The 8 Calibration Scenarios — Sent to Tony Today

Each scenario pits two of Tony's principles against each other. No obvious right answer — that's the point. Tony's verdict becomes ground truth for the rubrics.

Scenario 1

Scope vs. Relationship

Build unnecessary layer to protect client's concerns?

Scenario 2

Effort-Value vs. Stated Need

Build what they asked or what they need?

Scenario 3

Speed vs. Design Bar

Add features or polish visuals before demo?

Scenario 4

Known Pattern vs. New Info

Follow the rule or adapt to the situation?

Scenario 5

Truth vs. Perception

Present accurate data that embarrasses a stakeholder?

Scenario 6

Build for Need vs. Trust

Minimal viable or production-grade reliability?

Scenario 7

Show Don't Tell vs. Readiness

Demo when they asked for a document?

Scenario 8

Confidence Inversion

Safe deliverable or insight-driven risk?

Determinism by Layer — What's Guaranteed vs. What's Probabilistic

✓

L1 — Deterministic

Programmatic asserts, no LLM. Runs identically every time.

"Committed in <10 min?" "Branch pushed?" "Label correct?"

~

L2 — Near-Deterministic

Pattern matching + rules for routing/SOP. Lightweight LLM judge only on edge cases.

"Right SOP dispatched?" "Pipeline routing correct?"

≈

L3 — Probabilistic

LLM judge evaluating judgment. Inherently non-deterministic — by design.

"Right scope?" "Read the customer's real need?"

      Why L3 can't be deterministic — and why that's OK: Evaluating judgment requires judgment. An LLM scoring another LLM will have variance between runs. Our design decisions minimize this:
    

Binary PASS/FAIL

No 1-5 scale that fluctuates. Clean signal, less noise.

Cross-Model Judging

Different model judges the agent — avoids self-agreement bias.

False Pass Rate <5%

Constraint: catching bad outputs matters more than missing good ones.

Agreement Rate >90%

Target vs. human verdicts. Accepts margin — doesn't pretend to be perfect.

Future enhancement: Run the judge N times per evaluation and use majority vote for even higher reliability. Not needed until the basic calibration loop is proven.

Status & What's Left

Framework design & architecture doc Done

Full eval framework proposal in ops repo

68 Tony judgments mined & structured Done

19 FAILs, 7 corrections, 6 PASSes, 36 teaching directives

Reply-audit hook (live behavioral monitor) Done

Audits every message for anti-patterns: permission loops, false "done" claims

8 calibration scenarios sent to Tony Done

Principle-conflict scenarios for rubric stress-testing

Awaiting Tony's responses Now

Need PASS/FAIL/CONDITIONAL + reasoning. Voice notes OK.

Write PromptFoo rubrics (YAML) Next

Structured scoring rubrics for product-scope and sales/BD

Build custom PromptFoo provider Next

Wraps OpenClaw so PromptFoo can test end-to-end

First agreement measurement round Next

LLM judge vs. Tony verdicts — target >90% agreement

Wire into CI (GitHub Actions) Future

Auto-run evals on every PR that touches prompts/SOPs/config

Expand SME coverage (Charlie + Victor) Future

Mine engineering & process judgments into the corpus

4

Meeting Action Items

From Tony

Review and respond to the 8 WWTD calibration scenarios in DM
Confirm Mac Studio install timing — when does he want Tiffany live?
Decide: independent agent or connected to DGX Warren?

From Us (Victor + Warren)

GWS: Demo a concrete workflow (e.g., agent writes Google Doc → shares with client)
Tiffany: Prepare install package once Tony confirms Mac Studio specs
Evals: Process Tony's scenario responses → expand rubric coverage