forge-lcdl

Repair loops (`forge_lcdl.repair`)

Cheap model runs often fail in predictable ways (bad JSON, schema drift, flaky tests). Random retries waste tokens. This module adds deterministic building blocks:

Repair loops (forge_lcdl.repair)

Purpose

Cheap model runs often fail in predictable ways (bad JSON, schema drift, flaky tests). Random retries waste tokens. This module adds deterministic building blocks:

  1. classify_failure — turn a VerificationResult (FAIL), ParseFailure, SchemaFailure, ConfigFailure, or common exceptions into a FailureRecord with a FailureKind.
  2. RetryMemory + RetryPolicy — remember signatures of failures and decide should_retry / should_stop_retrying so identical failures do not loop forever.
  3. reduce_failure_to_repair — produce a RepairInstruction (action, prompt_suffix, minimal_next_step) without an LLM.

Task execution still uses Ok / Err (result.py). Repair is advisory for orchestrators, until_ok callbacks, or graph drivers.

Local verification

Use python3 (many systems have no python symlink). From the forge-lcdl repo root, set PYTHONPATH=src for ad-hoc snippets unless you use pip install -e ".[dev]".

./scripts/verify-repair-sprint.sh

Manual shortcuts:

export PYTHONPATH=src
python3 -m pytest -q tests/test_repair_loop.py
python3 -m compileall -q src/forge_lcdl/repair

Flow (conceptual)

  1. Run task or verifier → failure payload.
  2. record = classify_failure(... source="task" | "verify" | "graph" ,...)`
  3. If policy.should_retry(record, memory) → update model/context per reduce_failure_to_repair(record) and retry; else escalate/block.
  4. memory = memory.record(record) after each failure you count toward limits.

Failure kinds

Wire strings match enum values (e.g. schema_invalid). Each kind is also available as a value alias on the class (FailureKind.schema_invalid is the same as FailureKind.SCHEMA_INVALID).

Kind Typical source
schema_invalid Contract / schema verifier, SchemaFailure
json_invalid JSON-object verifier, ParseFailure, JSONDecodeError
test_failed Pytest subprocess verifier (non-timeout)
command_failed Timeouts, subprocess failures
missing_context Heuristic string / metadata
reasoning_error ConfigFailure, mis-routing
repeated_failure Explicit classification or dict hook
unsafe_request Policy strings / dict hook
tool_error Transport, gateway, unknown verifier failures
unknown Unclassified

Retry policy

  • max_attempts: record.attempt greater than this ⇒ stop (1-based attempts in FailureRecord).
  • same_failure_limit: classification_signature count already in memory before the next try ≥ limit ⇒ stop (default 2: two identical signatures in memory block a third without new context).
  • escalate_on: FailureKind set for which should_retry is always False.

Verification integration

On VerificationResult, only status == FAIL may be passed to classify_failure (otherwise ValueError). Mapping uses verifier_id (contract.schema, json.object, pytest.subprocess, …). See VERIFICATION.md.

Graph / operators

  • LcdlNode.attempts / error can populate FailureRecord.attempt and node_id when classifying executor errors (no executor changes in this sprint).
  • operators.until_ok can call reduce_failure_to_repair in on_retry later; see operators.py.

Model routing

models/routing.py max_retries is separate from RetryPolicy here; compose in the consumer when both apply.

Serialization

Use failure_record_to_dict / repair_instruction_to_dict for JSON-friendly structures.

Risks

Classification uses English heuristics on messages; non-English logs may bucket into unknown. Tighten mappings as you observe real failures.