PromptFoo Integration for Conductor Prompts

Overview

This document describes the PromptFoo integration for evaluating and optimizing the Conductor's automated builder prompts.

Problem Context

The Conductor generates large automated prompts (27-39KB) that are sent to Claude Code agents. Both FS-3 and FS-4 tickets failed with builders silently exiting after reading these prompts. This integration helps us:

Evaluate prompt quality - Test different prompt variations
Optimize for reliability - Find prompts that work consistently
Reduce costs - Smaller, more efficient prompts
Prevent regressions - Test prompt changes before deployment

Architecture

Prompt Templates

Located in conductor/prompts/:

builder-base.txt - Baseline (current verbose format)
- Full JSON dumps of ticket, learnings, checklist
- Extensive Redis progress tracking instructions
- ~30-40KB when rendered
builder-compact.txt - Optimized compact version
- Essential information only
- Simplified structure
- ~10-15KB when rendered
builder-structured.txt - Well-organized markdown
- Clear sections with emoji markers
- Better hierarchy and readability
- ~15-20KB when rendered

Test Cases

Located in promptfooconfig.yaml:

FS-4: Simple bug fix (magic link 401 error)
FS-3: Complex refactoring (50 TypeScript errors)
Edge case: Incomplete ticket information

Evaluation Criteria

Each test case assesses:

✅ Understanding: Does agent comprehend the task?
✅ Planning: Does agent create actionable plan?
✅ Reliability: Does agent avoid giving up?
✅ Cost: Stays under budget ($1-2 per test)
✅ Quality: Mentions relevant technical approaches

Setup

Prerequisites

bash

# 1. Install dependencies (already done)
npm install --save-dev promptfoo

# 2. Set Anthropic API key
export ANTHROPIC_API_KEY="your-api-key-here"

# Or add to conductor/.env
echo "ANTHROPIC_API_KEY=your-api-key" >> .env

Running Evaluations

bash

# Evaluate all prompt variations
npm run promptfoo:eval

# View results in browser
npm run promptfoo:view

# Run and view in one command
npm run promptfoo:compare

Results Structure

Evaluations output to promptfoo-results.json:

json

{
  "results": {
    "baseline-verbose": {
      "tests": [
        {
          "description": "FS-4: Magic link 401 error bug",
          "score": 0.85,
          "pass": true,
          "cost": 1.23
        }
      ],
      "stats": {
        "avgScore": 0.82,
        "passRate": 0.67,
        "avgCost": 1.45
      }
    },
    "compact-focused": {
      /* ... */
    },
    "structured-markdown": {
      /* ... */
    }
  }
}

Integration with Conductor

Current Flow

Conductor
  └─> buildBuilderContext() in agent-manager.ts:451-568
      └─> Generates large prompt with JSON dumps
          └─> Writes to temp file
              └─> Builder reads and (sometimes) exits silently

Optimized Flow (Proposed)

Conductor
  └─> buildBuilderContext() using BEST_TEMPLATE
      └─> Renders compact template with essential data
          └─> Validates prompt < 15KB
              └─> Writes to temp file
                  └─> Builder successfully processes

Implementation Plan

Identify best template (run evaluations)
Refactor buildBuilderContext() to use template
Extract essential fields from Jira/learnings
Add prompt size validation (warn if > 20KB)
Monitor real-world performance

Continuous Optimization

Adding New Test Cases

Edit promptfooconfig.yaml:

yaml

tests:
  - description: 'Your new test case'
    vars:
      ticket_id: 'FS-X'
      ticket_summary: 'Brief summary'
      # ... other vars
    assert:
      - type: llm-rubric
        value: 'What success looks like'
        threshold: 0.8

Testing Real Tickets

bash

# Export real ticket to test case
node scripts/export-test-case.js FS-5 >> promptfooconfig.yaml

# Run evaluation
npm run promptfoo:eval

CI/CD Integration

Add to .github/workflows/test.yml:

yaml

- name: Evaluate Prompts
  run: npm run promptfoo:eval
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

- name: Check Quality Threshold
  run: |
    pass_rate=$(jq '.stats.passRate' promptfoo-results.json)
    if (( $(echo "$pass_rate < 0.8" | bc -l) )); then
      echo "Prompt quality below threshold: $pass_rate"
      exit 1
    fi

Metrics to Track

Prompt Quality Metrics

Pass Rate: % of tests passing all assertions
Average Score: Mean score across all rubric evaluations
Cost per Test: API costs (target: < $1.50)
Prompt Size: Rendered size in KB (target: < 15KB)

Real-World Metrics

After deploying optimized templates:

Builder Success Rate: % of builders completing tasks
Time to First Action: How long until builder starts work
Silent Exit Rate: % of builders exiting without output
Average Task Duration: Time from spawn to completion

Troubleshooting

Common Issues

Issue: ANTHROPIC_API_KEY is not set

bash

# Solution: Export the key
export ANTHROPIC_API_KEY="sk-ant-..."

Issue: YAML parse error

bash

# Solution: Validate YAML syntax
npm run promptfoo:eval --debug

Issue: File not found: prompts/...

bash

# Solution: Check working directory
cd /path/to/project/conductor
npm run promptfoo:eval

Debugging

bash

# Run with debug logging
DEBUG=promptfoo:* npm run promptfoo:eval

# Test single prompt
promptfoo eval --prompts 'file://prompts/builder-compact.txt'

# Test single test case
promptfoo eval --filter 'FS-4'

Next Steps

Set ANTHROPIC_API_KEY and run evaluations
Analyze results to identify best template
Refactor agent-manager.ts to use winning template
Deploy to production and monitor metrics
Iterate based on real-world performance

References

PromptFoo Docs: https://promptfoo.dev
Conductor Source: src/agent-manager.ts:451-568
Prompt Templates: prompts/*.txt
Test Configuration: promptfooconfig.yaml

PromptFoo Integration for Conductor Prompts ​

Overview ​

Problem Context ​

Architecture ​

Prompt Templates ​

Test Cases ​

Evaluation Criteria ​

Setup ​

Prerequisites ​

Running Evaluations ​

Results Structure ​

Integration with Conductor ​

Current Flow ​

Optimized Flow (Proposed) ​

Implementation Plan ​

Continuous Optimization ​

Adding New Test Cases ​

Testing Real Tickets ​

CI/CD Integration ​

Metrics to Track ​

Prompt Quality Metrics ​

Real-World Metrics ​

Troubleshooting ​

Common Issues ​

Debugging ​

Next Steps ​

References ​

PromptFoo Integration for Conductor Prompts

Overview

Problem Context

Architecture

Prompt Templates

Test Cases

Evaluation Criteria

Setup

Prerequisites

Running Evaluations

Results Structure

Integration with Conductor

Current Flow

Optimized Flow (Proposed)

Implementation Plan

Continuous Optimization

Adding New Test Cases

Testing Real Tickets

CI/CD Integration

Metrics to Track

Prompt Quality Metrics

Real-World Metrics

Troubleshooting

Common Issues

Debugging

Next Steps

References