Handbook
Repair loops (`forge_lcdl.repair`)
Cheap model runs often fail in predictable ways (bad JSON, schema drift, flaky tests). Random retries waste tokens. This module adds deterministic building blocks:
Repair loops (forge_lcdl.repair)
Purpose
Cheap model runs often fail in predictable ways (bad JSON, schema drift, flaky tests). Random retries waste tokens. This module adds deterministic building blocks:
classify_failure— turn aVerificationResult(FAIL),ParseFailure,SchemaFailure,ConfigFailure, or common exceptions into aFailureRecordwith aFailureKind.RetryMemory+RetryPolicy— remember signatures of failures and decideshould_retry/should_stop_retryingso identical failures do not loop forever.reduce_failure_to_repair— produce aRepairInstruction(action,prompt_suffix,minimal_next_step) without an LLM.
Task execution still uses Ok / Err (result.py). Repair is advisory for orchestrators, until_ok callbacks, or graph drivers.
Local verification
Use python3 (many systems have no python symlink). From the forge-lcdl repo root, set PYTHONPATH=src for ad-hoc snippets unless you use pip install -e ".[dev]".
./scripts/verify-repair-sprint.sh
Manual shortcuts:
export PYTHONPATH=src
python3 -m pytest -q tests/test_repair_loop.py
python3 -m compileall -q src/forge_lcdl/repair
Flow (conceptual)
- Run task or verifier → failure payload.
record = classify_failure(...source="task" | "verify" | "graph",...)`- If
policy.should_retry(record, memory)→ update model/context perreduce_failure_to_repair(record)and retry; else escalate/block. memory = memory.record(record)after each failure you count toward limits.
Failure kinds
Wire strings match enum values (e.g. schema_invalid). Each kind is also available as a value alias on the class (FailureKind.schema_invalid is the same as FailureKind.SCHEMA_INVALID).
| Kind | Typical source |
|---|---|
schema_invalid |
Contract / schema verifier, SchemaFailure |
json_invalid |
JSON-object verifier, ParseFailure, JSONDecodeError |
test_failed |
Pytest subprocess verifier (non-timeout) |
command_failed |
Timeouts, subprocess failures |
missing_context |
Heuristic string / metadata |
reasoning_error |
ConfigFailure, mis-routing |
repeated_failure |
Explicit classification or dict hook |
unsafe_request |
Policy strings / dict hook |
tool_error |
Transport, gateway, unknown verifier failures |
unknown |
Unclassified |
Retry policy
max_attempts:record.attemptgreater than this ⇒ stop (1-based attempts inFailureRecord).same_failure_limit:classification_signaturecount already in memory before the next try ≥ limit ⇒ stop (default 2: two identical signatures in memory block a third without new context).escalate_on:FailureKindset for whichshould_retryis alwaysFalse.
Verification integration
On VerificationResult, only status == FAIL may be passed to classify_failure (otherwise ValueError). Mapping uses verifier_id (contract.schema, json.object, pytest.subprocess, …). See VERIFICATION.md.
Graph / operators
LcdlNode.attempts/errorcan populateFailureRecord.attemptandnode_idwhen classifying executor errors (no executor changes in this sprint).operators.until_okcan callreduce_failure_to_repairinon_retrylater; seeoperators.py.
Model routing
models/routing.py max_retries is separate from RetryPolicy here; compose in the consumer when both apply.
Serialization
Use failure_record_to_dict / repair_instruction_to_dict for JSON-friendly structures.
Risks
Classification uses English heuristics on messages; non-English logs may bucket into unknown. Tighten mappings as you observe real failures.