Skip to content

Conductor Intelligence & Self-Healing Capabilities

Overview

Conductor has evolved from a simple orchestrator to an intelligent, self-healing system that learns from incidents and prevents duplicate work. This document captures the key learnings and intelligence capabilities developed through real-world incidents.

Part of Zeron Platform

Conductor's intelligence layer will integrate with Nexus (the Zeron AI learning system) to share learnings across the platform and improve over time.

Key Intelligence Features

1. Work Deduplication & Retry Loop Prevention

Problem Solved: Conductor would continuously retry work that was already completed but had incorrect Jira status (e.g., October 2025 production incident).

Solution Implemented:

  • GitHub PR status checking before processing any ticket
  • Auto-transition tickets to Done when merged PR detected
  • Cross-reference multiple data sources (Jira, GitHub, Redis)

How It Works:

typescript
// Before processing any ticket
const { hasMerged, pr } = await githubClient.hasTicketMergedPR(ticketId);
if (hasMerged && pr) {
  // Skip processing and update Jira
  await jiraClient.updateTicketStatus(ticketId, 'Done');
  await redisState.completeTicket(ticketId);
  return; // Prevent retry loop
}

2. State Reconciliation & Management

Problem Solved: Redis state becoming stale and showing completed work as "active" in dashboard.

Intelligence Principles:

  • Redis is cache, not source of truth
  • Jira status can lag behind actual work
  • GitHub PR status is authoritative for code changes
  • Always validate state before taking action

Best Practices:

  • Implement TTL on Redis entries (planned)
  • Periodic reconciliation with external systems
  • Clear stale state proactively
  • Dashboard should distinguish stale vs active work

3. Path Confusion Prevention

Problem Solved: Agents working in wrong directories (3 PRs failed due to path confusion).

Multi-Layered Solution:

  1. Configuration Time: Real-time path validation UI
  2. Dashboard Time: Prominent warnings for invalid paths
  3. Agent Context Time: Explicit path verification warnings
  4. Documentation Time: Clear process guides

Key Learning: Common confusion patterns must be explicitly documented:

  • Similar directory names (e.g., "feedback-service" vs "zeron-feedback-service")
  • Monorepo vs standalone projects
  • Deleted vs active codebases

4. Self-Healing Mechanisms

Validator Retry with Corrections:

typescript
while (validatorRetryCount <= maxValidatorRetries) {
  if (validatorReport.recommendation === 'needs_rework') {
    const correctionPrompt = generateValidatorCorrectionPrompt(validatorReport);
    await resumeBuilder(correctionPrompt);
    validatorRetryCount++;
  }
}

External System Fallback:

typescript
try {
  await jiraClient.updateTicketStatus(ticketId, 'In Review');
} catch (error) {
  logger.warn('Jira transition failed, but work is complete - continuing');
  // Work completion continues regardless
}

Critical Operating Principles

1. "Work First, Bureaucracy Second"

The most important lesson from production incidents: Never let external system failures prevent work completion.

Application:

  • ✅ Continue if Jira transitions fail
  • ✅ Continue if Slack notifications fail
  • ✅ Continue if email notifications fail
  • ❌ Do NOT continue if core validation fails

2. "Verify Before Action"

Always check if work is already done before starting:

  • Check for existing PRs
  • Check for existing branches
  • Check Redis state
  • Check Jira status
  • Query AI learning system

3. "Multiple Sources of Truth"

Never trust a single data source:

  • Cross-reference Jira, GitHub, and Redis
  • GitHub PR status overrides Jira for code changes
  • Redis is temporary cache only
  • Log all discrepancies for debugging

4. "Fail Gracefully"

When things go wrong:

  • Log detailed error with context
  • Skip the problematic ticket
  • Continue processing other tickets
  • Mark state for manual review
  • Never enter infinite retry loops

Monitoring & Observability

Key Events to Monitor

bash
# Self-healing events
grep "Self-healing validator retry" conductor.log
grep "Failed to transition.*but work is complete" conductor.log
grep "ticket_completed_manual" conductor.log

# Deduplication events
grep "Ticket already has merged PR" conductor.log
grep "Skipping duplicate work" conductor.log

# State reconciliation
grep "Stale state detected" conductor.log
grep "State reconciliation" conductor.log

Dashboard Intelligence

The dashboard should clearly show:

  • Active vs stale tickets
  • External system failures (non-blocking)
  • Retry attempts and reasons
  • Path validation status
  • PR status for tickets

Future Intelligence Enhancements

Short Term (Next Sprint)

  • [ ] Redis TTL implementation (24-hour expiry)
  • [ ] Pre-validation checklist phase
  • [ ] Dashboard stale work indicators
  • [ ] Automated state cleanup job

Medium Term (Next Month)

  • [ ] Nexus Integration - Connect to Zeron AI learning system for work history
  • [ ] Pattern detection for common failures
  • [ ] Predictive issue detection
  • [ ] Auto-remediation for known issues

Long Term (Next Quarter)

  • [ ] Machine learning for optimal retry strategies
  • [ ] Intelligent workload distribution
  • [ ] Predictive capacity planning
  • [ ] Cross-project learning and knowledge transfer

Incident Response Pattern

When incidents occur, follow this pattern:

  1. Investigate Thoroughly

    • Don't make hasty decisions
    • Gather all relevant logs
    • Understand root cause
  2. Implement Multi-Layered Prevention

    • Configuration validation
    • Runtime checks
    • Agent warnings
    • Documentation
  3. Document Everything

    • Create incident report
    • Update prevention guide
    • Add to this intelligence document
    • Update Claude.md
  4. Verify Prevention Works

    • Test the fix
    • Monitor for recurrence
    • Adjust if needed

Lessons Learned

From Production Retry Loop Incident (October 2025)

  • Lesson: Work completion status can diverge across systems
  • Intelligence Added: GitHub PR checking before processing
  • Result: Prevented infinite retry loops

From Path Confusion Incident (October 2025)

  • Lesson: Agents can work in wrong directories without explicit warnings
  • Intelligence Added: Multi-layered path validation
  • Result: No more contaminated PRs with deleted files

From State Staleness Issues

  • Lesson: Redis state can become orphaned
  • Intelligence Added: State reconciliation and cleanup
  • Result: Dashboard accurately reflects active work

Conclusion

The Conductor's intelligence is not just about automation - it's about learning from failures, preventing known issues, and gracefully handling the unexpected. Each incident makes the system smarter and more resilient.

Remember: The goal is not perfection, but continuous improvement through intelligent adaptation.

Part of the Zeron Platform | Built with VitePress