Appearance
PromptFoo Integration for Conductor Prompts
Overview
This document describes the PromptFoo integration for evaluating and optimizing the Conductor's automated builder prompts.
Problem Context
The Conductor generates large automated prompts (27-39KB) that are sent to Claude Code agents. Both FS-3 and FS-4 tickets failed with builders silently exiting after reading these prompts. This integration helps us:
- Evaluate prompt quality - Test different prompt variations
- Optimize for reliability - Find prompts that work consistently
- Reduce costs - Smaller, more efficient prompts
- Prevent regressions - Test prompt changes before deployment
Architecture
Prompt Templates
Located in conductor/prompts/:
builder-base.txt - Baseline (current verbose format)
- Full JSON dumps of ticket, learnings, checklist
- Extensive Redis progress tracking instructions
- ~30-40KB when rendered
builder-compact.txt - Optimized compact version
- Essential information only
- Simplified structure
- ~10-15KB when rendered
builder-structured.txt - Well-organized markdown
- Clear sections with emoji markers
- Better hierarchy and readability
- ~15-20KB when rendered
Test Cases
Located in promptfooconfig.yaml:
- FS-4: Simple bug fix (magic link 401 error)
- FS-3: Complex refactoring (50 TypeScript errors)
- Edge case: Incomplete ticket information
Evaluation Criteria
Each test case assesses:
- ✅ Understanding: Does agent comprehend the task?
- ✅ Planning: Does agent create actionable plan?
- ✅ Reliability: Does agent avoid giving up?
- ✅ Cost: Stays under budget ($1-2 per test)
- ✅ Quality: Mentions relevant technical approaches
Setup
Prerequisites
bash
# 1. Install dependencies (already done)
npm install --save-dev promptfoo
# 2. Set Anthropic API key
export ANTHROPIC_API_KEY="your-api-key-here"
# Or add to conductor/.env
echo "ANTHROPIC_API_KEY=your-api-key" >> .envRunning Evaluations
bash
# Evaluate all prompt variations
npm run promptfoo:eval
# View results in browser
npm run promptfoo:view
# Run and view in one command
npm run promptfoo:compareResults Structure
Evaluations output to promptfoo-results.json:
json
{
"results": {
"baseline-verbose": {
"tests": [
{
"description": "FS-4: Magic link 401 error bug",
"score": 0.85,
"pass": true,
"cost": 1.23
}
],
"stats": {
"avgScore": 0.82,
"passRate": 0.67,
"avgCost": 1.45
}
},
"compact-focused": {
/* ... */
},
"structured-markdown": {
/* ... */
}
}
}Integration with Conductor
Current Flow
Conductor
└─> buildBuilderContext() in agent-manager.ts:451-568
└─> Generates large prompt with JSON dumps
└─> Writes to temp file
└─> Builder reads and (sometimes) exits silentlyOptimized Flow (Proposed)
Conductor
└─> buildBuilderContext() using BEST_TEMPLATE
└─> Renders compact template with essential data
└─> Validates prompt < 15KB
└─> Writes to temp file
└─> Builder successfully processesImplementation Plan
- Identify best template (run evaluations)
- Refactor buildBuilderContext() to use template
- Extract essential fields from Jira/learnings
- Add prompt size validation (warn if > 20KB)
- Monitor real-world performance
Continuous Optimization
Adding New Test Cases
Edit promptfooconfig.yaml:
yaml
tests:
- description: 'Your new test case'
vars:
ticket_id: 'FS-X'
ticket_summary: 'Brief summary'
# ... other vars
assert:
- type: llm-rubric
value: 'What success looks like'
threshold: 0.8Testing Real Tickets
bash
# Export real ticket to test case
node scripts/export-test-case.js FS-5 >> promptfooconfig.yaml
# Run evaluation
npm run promptfoo:evalCI/CD Integration
Add to .github/workflows/test.yml:
yaml
- name: Evaluate Prompts
run: npm run promptfoo:eval
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- name: Check Quality Threshold
run: |
pass_rate=$(jq '.stats.passRate' promptfoo-results.json)
if (( $(echo "$pass_rate < 0.8" | bc -l) )); then
echo "Prompt quality below threshold: $pass_rate"
exit 1
fiMetrics to Track
Prompt Quality Metrics
- Pass Rate: % of tests passing all assertions
- Average Score: Mean score across all rubric evaluations
- Cost per Test: API costs (target: < $1.50)
- Prompt Size: Rendered size in KB (target: < 15KB)
Real-World Metrics
After deploying optimized templates:
- Builder Success Rate: % of builders completing tasks
- Time to First Action: How long until builder starts work
- Silent Exit Rate: % of builders exiting without output
- Average Task Duration: Time from spawn to completion
Troubleshooting
Common Issues
Issue: ANTHROPIC_API_KEY is not set
bash
# Solution: Export the key
export ANTHROPIC_API_KEY="sk-ant-..."Issue: YAML parse error
bash
# Solution: Validate YAML syntax
npm run promptfoo:eval --debugIssue: File not found: prompts/...
bash
# Solution: Check working directory
cd /path/to/project/conductor
npm run promptfoo:evalDebugging
bash
# Run with debug logging
DEBUG=promptfoo:* npm run promptfoo:eval
# Test single prompt
promptfoo eval --prompts 'file://prompts/builder-compact.txt'
# Test single test case
promptfoo eval --filter 'FS-4'Next Steps
- Set ANTHROPIC_API_KEY and run evaluations
- Analyze results to identify best template
- Refactor agent-manager.ts to use winning template
- Deploy to production and monitor metrics
- Iterate based on real-world performance
References
- PromptFoo Docs: https://promptfoo.dev
- Conductor Source:
src/agent-manager.ts:451-568 - Prompt Templates:
prompts/*.txt - Test Configuration:
promptfooconfig.yaml
