Warren already has authenticated access to Google Workspace via OAuth2. This extends agent capabilities beyond Slack into the full productivity suite.
Active Scopes
📁
Google Drive
List, search, upload, download, share
📝
Google Docs
Create, read, export (PDF/text)
📊
Google Sheets
Read, write, append data
📧
Gmail
Read and send emails
📅
Calendar
List events, create meetings
What This Enables
- Agents publish deliverables directly to Google Docs shared with clients (not just Slack file drops)
- Calendar integration: auto-create meetings, read team agenda, schedule follow-ups
- Drive as a deliverable storage layer — especially for clients in Google Workspace
- Gmail automation: programmatic follow-ups, summaries, notifications
- Read existing team docs as input for analyses — no manual copy-paste
Key takeaway: Warren goes from "lives in Slack" to "operates across the full Google Workspace." If a client uses GWS, agents can read and publish directly in their environment.
"Tiffany" is the name for a local Warren instance running on Tony's Mac Studio. Same architecture, same skill system — running independently on Tony's hardware.
What Already Exists
- Built a Warren clone ("Tiffany") on April 20 on DGX Spark as proof of concept
- Full workspace cloned: AGENTS.md, SOUL.md, TOOLS.md, progressive disclosure, skills
- Bound to a dedicated Slack channel with channel-based routing
- Since evolved to "Stacey" (💎) — second agent live alongside Warren on DGX
What a Mac Studio Install Looks Like
- Runtime: Node.js 22+ (Apple Silicon native), OpenClaw macOS app or CLI
- Setup time: ~30 min for technical install, longer for specialization
- Process: Install OpenClaw →
openclaw setup → configure Slack → clone workspace → add API keys
- Routing: Tony's DMs or a dedicated channel → Tiffany on Mac; other channels → Warren on DGX
Decisions Needed
- Which LLM provider? API keys needed (Anthropic / OpenAI / Together / local via Ollama)
- Which channels? Slack only, or also Signal, WhatsApp, etc.?
- Independent or connected? Isolated agent vs. shared memory with DGX Warren
- Who configures? Charlie or Warren can guide Tony remotely
Valent connection: This same process is what the $5K self-install POC requires — install OpenClaw in their env, create an agent with relevant skills, connect to their channels. Tony's Mac is a dry run.
The problem: Warren produces 100+ outputs per day. Tony can't review everything. We need an automated system that catches quality regressions — not just "did it run?" but "did it make the right call?"
❌ Without Evals
- Tony catches problems by accident
- Bad outputs ship before anyone notices
- Config changes break judgment silently
- No way to measure improvement
- "Trust me" is the only evidence
✅ With Evals
- Every change tested automatically
- LLM judge scores using Tony's criteria
- Regressions caught before clients see them
- Measurable agreement rate vs. human
- Data-driven evidence of quality
The Eval Loop — How It Works
This loop runs on every change to Warren's prompts, SOPs, or config
Step 1 — Capture
Mine real human judgments
We extracted 68 of Tony's decisions from session history — every pass, fail, correction, and teaching moment. Each has: agent output, Tony's verdict, and his reasoning.
Step 2 — Encode
Turn judgment into scoring rubrics
Tony's teaching entries become structured YAML rubric criteria that an LLM judge can use to score any output. The rubrics ARE Tony's judgment, codified.
Step 3 — Test
LLM judge scores against rubrics
A different model (not the one Warren uses) reads rubric + agent output and scores it. Cross-model judging = no "grading your own homework." Binary: PASS or FAIL.
Step 4 — Measure
Compare LLM judge vs. human verdicts
Does the judge agree with Tony? Target: >90% agreement. First result: 100% on easy cases — but small sample. Need harder cases to stress-test.
Step 5 — Iterate
Disagreements → refine the rubric
Every disagreement reveals a rubric gap. Fix it, re-test, repeat. The 8 calibration scenarios we sent Tony are designed to find exactly these gaps.
Three Layers of Evaluation
L1
Mechanical Execution
Did it use the right tools in the right order? Fully automated.
"Did it commit within 10 min?" "Did it push?"
L2
Process Judgment
Did it follow the right process? Semi-automated with rubrics.
"Correct label?" "Right SOP?" "Proper routing?"
L3
Product Judgment
Is it building the right thing? Requires human calibration.
"Right scope?" "Read the customer's real need?"
Why Calibrate Beyond WWTD?
The analogy: WWTD is the law. Evals are the court system. Having laws isn't enough — you need a system that consistently interprets and enforces them. WWTD tells Warren what good looks like. Evals prove whether outputs match.
📖 WWTD Alone
- Static document — a RAG knowledge base
- Describes Tony's principles in prose
- Warren reads it and tries to follow
- No verification it's actually working
- Only covers Tony's domains
- Conflicting principles — no resolution
🧪 WWTD + Evals
- WWTD = the rules; Evals = enforcement
- Automated testing proves compliance
- Catches regressions on every change
- Covers ALL domains (product + tech + process)
- Multiple SMEs calibrate the system
- Scenarios resolve principle conflicts
Full SME Coverage Map
🎯
Tony
Product scope, sales/BD, customer judgment, design bar
⚙️
Charlie
Pipeline, architecture, code quality, technical correctness
📋
Victor
Process, delivery, operational quality, coordination
Current Numbers
68
Historical judgments mined
100%
Judge agreement (easy set)
8
Calibration scenarios pending
The 8 Calibration Scenarios — Sent to Tony Today
Each scenario pits two of Tony's principles against each other. No obvious right answer — that's the point. Tony's verdict becomes ground truth for the rubrics.
Scenario 1
Scope vs. Relationship
Build unnecessary layer to protect client's concerns?
Scenario 2
Effort-Value vs. Stated Need
Build what they asked or what they need?
Scenario 3
Speed vs. Design Bar
Add features or polish visuals before demo?
Scenario 4
Known Pattern vs. New Info
Follow the rule or adapt to the situation?
Scenario 5
Truth vs. Perception
Present accurate data that embarrasses a stakeholder?
Scenario 6
Build for Need vs. Trust
Minimal viable or production-grade reliability?
Scenario 7
Show Don't Tell vs. Readiness
Demo when they asked for a document?
Scenario 8
Confidence Inversion
Safe deliverable or insight-driven risk?
Determinism by Layer — What's Guaranteed vs. What's Probabilistic
✓
L1 — Deterministic
Programmatic asserts, no LLM. Runs identically every time.
"Committed in <10 min?" "Branch pushed?" "Label correct?"
~
L2 — Near-Deterministic
Pattern matching + rules for routing/SOP. Lightweight LLM judge only on edge cases.
"Right SOP dispatched?" "Pipeline routing correct?"
≈
L3 — Probabilistic
LLM judge evaluating judgment. Inherently non-deterministic — by design.
"Right scope?" "Read the customer's real need?"
Why L3 can't be deterministic — and why that's OK: Evaluating judgment requires judgment. An LLM scoring another LLM will have variance between runs. Our design decisions minimize this:
Binary PASS/FAIL
No 1-5 scale that fluctuates. Clean signal, less noise.
Cross-Model Judging
Different model judges the agent — avoids self-agreement bias.
False Pass Rate <5%
Constraint: catching bad outputs matters more than missing good ones.
Agreement Rate >90%
Target vs. human verdicts. Accepts margin — doesn't pretend to be perfect.
Future enhancement: Run the judge N times per evaluation and use majority vote for even higher reliability. Not needed until the basic calibration loop is proven.
Status & What's Left
Framework design & architecture doc Done
Full eval framework proposal in ops repo
68 Tony judgments mined & structured Done
19 FAILs, 7 corrections, 6 PASSes, 36 teaching directives
Reply-audit hook (live behavioral monitor) Done
Audits every message for anti-patterns: permission loops, false "done" claims
8 calibration scenarios sent to Tony Done
Principle-conflict scenarios for rubric stress-testing
Awaiting Tony's responses Now
Need PASS/FAIL/CONDITIONAL + reasoning. Voice notes OK.
Write PromptFoo rubrics (YAML) Next
Structured scoring rubrics for product-scope and sales/BD
Build custom PromptFoo provider Next
Wraps OpenClaw so PromptFoo can test end-to-end
First agreement measurement round Next
LLM judge vs. Tony verdicts — target >90% agreement
Wire into CI (GitHub Actions) Future
Auto-run evals on every PR that touches prompts/SOPs/config
Expand SME coverage (Charlie + Victor) Future
Mine engineering & process judgments into the corpus