forge-lcdl

Context packs (`forge_lcdl.context`)

Cheap models need smaller, relevant context, not whole-repository dumps. build_context_pack turns a task description and a repository path into a bounded, ordered ContextPack: scanned text files, keyword-ranked with…

Context packs (forge_lcdl.context)

Purpose

Cheap models need smaller, relevant context, not whole-repository dumps. build_context_pack turns a task description and a repository path into a bounded, ordered ContextPack: scanned text files, keyword-ranked with fixed boosts, trimmed to a UTF-8 byte budget, with reasons and provenance. No embeddings, no external search, and no network in this layer.

Flow

  1. Scan (context/repo_scan.py): walk the tree safely; skip denylisted directories; do not read obvious secret filenames; read a capped UTF-8 prefix per file; skip binary or invalid UTF-8.
  2. Rank (context/rank.py): deterministic keyword overlap on path and first line of preview; boosts for src/forge_lcdl, tests/ when the task mentions tests or verification, contracts/ when the task mentions contracts or task_id.
  3. Pack / trim (context/pack.py, context/trim.py): emit ContextItem rows (file vs excerpt), then cap total content UTF-8 bytes using truncate_utf8_bytes.

Budget

The budget_chars argument to build_context_pack is the maximum total UTF-8 byte length of all item content strings (same unit as truncate_utf8_bytes). The requested value is stored on the pack as token_or_char_budget; actual_content_utf8_bytes records the final total after trimming.

Skips and safety

Directories not descended: among others, .git, __pycache__, .venv, venv, dist, build, node_modules, .tox, reports, .pytest_cache, .mypy_cache, and *.egg-info directories.

Paths not read (content never loaded): examples include .env, *.env, names matching *secret*, id_rsa, *.pem, credentials.json, forge-certificator-secrets.env. These appear in excluded_files as (path, reason) pairs.

Per-file preview cap: MAX_PREVIEW_UTF8_BYTES (128 KiB) — large files contribute an excerpt only; the reason string may note preview_capped_bytes.

Non-text files are listed in `warnings (capped list length) and are not ranked into the pack.

Usage

From the repo root, with src on PYTHONPATH (or after pip install -e ".[dev]"):

from pathlib import Path
from forge_lcdl.context import build_context_pack, context_pack_to_dict

pack = build_context_pack(
    "Update tests for task_id pw_page_kind_route",
    Path("/path/to/forge-lcdl"),
    budget_chars=40_000,
)
d = context_pack_to_dict(pack)  # JSON-serializable dict

Optional now_iso=lambda: "…" fixes provenance["built_at"] in tests.

Serialization

Use context_pack_to_dict (and context_item_to_dict) for JSON-friendly structures: enums as strings, tuples as lists, excluded_files as {"path", "reason"} objects.

Composition with messages

messages.py defines LlmMessage / FileRef for transports. A future bridge can turn a ContextPack into a system or user string; Sprint 4 stops at the structured pack.

Model routing and contracts: MODEL-ROUTING.md, CONTRACT-SPEC.md.

Limits

Ranking is keyword-only (deterministic, no semantic search). For semantically richer retrieval, a later sprint might add embeddings while keeping the same ContextPack shape.