Skip to content

PromptFoo Integration for Conductor Prompts

Overview

This document describes the PromptFoo integration for evaluating and optimizing the Conductor's automated builder prompts.

Problem Context

The Conductor generates large automated prompts (27-39KB) that are sent to Claude Code agents. Both FS-3 and FS-4 tickets failed with builders silently exiting after reading these prompts. This integration helps us:

  1. Evaluate prompt quality - Test different prompt variations
  2. Optimize for reliability - Find prompts that work consistently
  3. Reduce costs - Smaller, more efficient prompts
  4. Prevent regressions - Test prompt changes before deployment

Architecture

Prompt Templates

Located in conductor/prompts/:

  • builder-base.txt - Baseline (current verbose format)

    • Full JSON dumps of ticket, learnings, checklist
    • Extensive Redis progress tracking instructions
    • ~30-40KB when rendered
  • builder-compact.txt - Optimized compact version

    • Essential information only
    • Simplified structure
    • ~10-15KB when rendered
  • builder-structured.txt - Well-organized markdown

    • Clear sections with emoji markers
    • Better hierarchy and readability
    • ~15-20KB when rendered

Test Cases

Located in promptfooconfig.yaml:

  1. FS-4: Simple bug fix (magic link 401 error)
  2. FS-3: Complex refactoring (50 TypeScript errors)
  3. Edge case: Incomplete ticket information

Evaluation Criteria

Each test case assesses:

  • Understanding: Does agent comprehend the task?
  • Planning: Does agent create actionable plan?
  • Reliability: Does agent avoid giving up?
  • Cost: Stays under budget ($1-2 per test)
  • Quality: Mentions relevant technical approaches

Setup

Prerequisites

bash
# 1. Install dependencies (already done)
npm install --save-dev promptfoo

# 2. Set Anthropic API key
export ANTHROPIC_API_KEY="your-api-key-here"

# Or add to conductor/.env
echo "ANTHROPIC_API_KEY=your-api-key" >> .env

Running Evaluations

bash
# Evaluate all prompt variations
npm run promptfoo:eval

# View results in browser
npm run promptfoo:view

# Run and view in one command
npm run promptfoo:compare

Results Structure

Evaluations output to promptfoo-results.json:

json
{
  "results": {
    "baseline-verbose": {
      "tests": [
        {
          "description": "FS-4: Magic link 401 error bug",
          "score": 0.85,
          "pass": true,
          "cost": 1.23
        }
      ],
      "stats": {
        "avgScore": 0.82,
        "passRate": 0.67,
        "avgCost": 1.45
      }
    },
    "compact-focused": {
      /* ... */
    },
    "structured-markdown": {
      /* ... */
    }
  }
}

Integration with Conductor

Current Flow

Conductor
  └─> buildBuilderContext() in agent-manager.ts:451-568
      └─> Generates large prompt with JSON dumps
          └─> Writes to temp file
              └─> Builder reads and (sometimes) exits silently

Optimized Flow (Proposed)

Conductor
  └─> buildBuilderContext() using BEST_TEMPLATE
      └─> Renders compact template with essential data
          └─> Validates prompt < 15KB
              └─> Writes to temp file
                  └─> Builder successfully processes

Implementation Plan

  1. Identify best template (run evaluations)
  2. Refactor buildBuilderContext() to use template
  3. Extract essential fields from Jira/learnings
  4. Add prompt size validation (warn if > 20KB)
  5. Monitor real-world performance

Continuous Optimization

Adding New Test Cases

Edit promptfooconfig.yaml:

yaml
tests:
  - description: 'Your new test case'
    vars:
      ticket_id: 'FS-X'
      ticket_summary: 'Brief summary'
      # ... other vars
    assert:
      - type: llm-rubric
        value: 'What success looks like'
        threshold: 0.8

Testing Real Tickets

bash
# Export real ticket to test case
node scripts/export-test-case.js FS-5 >> promptfooconfig.yaml

# Run evaluation
npm run promptfoo:eval

CI/CD Integration

Add to .github/workflows/test.yml:

yaml
- name: Evaluate Prompts
  run: npm run promptfoo:eval
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

- name: Check Quality Threshold
  run: |
    pass_rate=$(jq '.stats.passRate' promptfoo-results.json)
    if (( $(echo "$pass_rate < 0.8" | bc -l) )); then
      echo "Prompt quality below threshold: $pass_rate"
      exit 1
    fi

Metrics to Track

Prompt Quality Metrics

  • Pass Rate: % of tests passing all assertions
  • Average Score: Mean score across all rubric evaluations
  • Cost per Test: API costs (target: < $1.50)
  • Prompt Size: Rendered size in KB (target: < 15KB)

Real-World Metrics

After deploying optimized templates:

  • Builder Success Rate: % of builders completing tasks
  • Time to First Action: How long until builder starts work
  • Silent Exit Rate: % of builders exiting without output
  • Average Task Duration: Time from spawn to completion

Troubleshooting

Common Issues

Issue: ANTHROPIC_API_KEY is not set

bash
# Solution: Export the key
export ANTHROPIC_API_KEY="sk-ant-..."

Issue: YAML parse error

bash
# Solution: Validate YAML syntax
npm run promptfoo:eval --debug

Issue: File not found: prompts/...

bash
# Solution: Check working directory
cd /path/to/project/conductor
npm run promptfoo:eval

Debugging

bash
# Run with debug logging
DEBUG=promptfoo:* npm run promptfoo:eval

# Test single prompt
promptfoo eval --prompts 'file://prompts/builder-compact.txt'

# Test single test case
promptfoo eval --filter 'FS-4'

Next Steps

  1. Set ANTHROPIC_API_KEY and run evaluations
  2. Analyze results to identify best template
  3. Refactor agent-manager.ts to use winning template
  4. Deploy to production and monitor metrics
  5. Iterate based on real-world performance

References

  • PromptFoo Docs: https://promptfoo.dev
  • Conductor Source: src/agent-manager.ts:451-568
  • Prompt Templates: prompts/*.txt
  • Test Configuration: promptfooconfig.yaml

Part of the Zeron Platform | Built with VitePress