forge-lcdl

Extraction playbook: LLM plus deterministic convergence

This document summarizes how forge-lcdl tasks fit into staged document and DOM pipelines: the LLM proposes bounded JSON or code artifacts; deterministic steps validate, execute, and compare to oracle or golden data…

Governance alignment

The numbered stages below implement the library’s policy: freeze inputs and do deterministic prelude and segmentation before LLM classify or synthesize steps; keep payloads capped and contracts one task at a time; run Playwright or parsers and comparisons outside LCDL in the ingest/runtime layer while LCDL supplies bounded classify, synthesize, and diagnose JSON. Authorized use, prohibited directions, and contributor norms are spelled out in CONTRIBUTING.md.

Shared pattern

  1. Freeze the input (snapshot URL, frozen HTML, or PDF bytes plus hash).
  2. Deterministic prelude — no LLM: scroll/wait, heading inventory, text-density checks, layout coordinates.
  3. Segment in code — fixed chunk strategy (DOM chunks, PDF pages, paragraph windows).
  4. LLM classify / route — small JSON per batch (e.g. pw_chunk_classify v1).
  5. LLM synthesize — only exemplar payloads plus a probe summary, not full raw source (pw_extractor_synthesize_exemplar v1, or pw_extractor_synthesize_probe v1 for probe-only).
  6. Deterministic execute — run generated Playwright extractor or PDF parser; capture exceptions and row counts.
  7. Compare — prefix of oracle rows, schema, or checksum; feed compact facts to diagnose (pw_incremental_diagnose v1).
  8. Hint feedback — append hint_additions to operator hints; retry with budget and MC sampling.

Rules: cap every field sent to the model; one run_task per JSON contract; deterministic parsers and selectors before LLM on ambiguous spans; declare explicit success metrics and global_llm_budget.

Websites (DOM / HTML)

Practice Why
Put a DOM contract in operator_hints (container, option pattern, stem walk, graded-state cues). Anchors synthesis; fewer hallucinated selectors.
Put selector/tag inventory from page.evaluate into page_probe_summary, not raw HTML dumps. Saves context; lists valid hooks.
Keep chunk strategy in code; use the LLM only to classify or rank chunks. Stable segmentation.
Run the first execute pass on the same frozen URL as gather. Tight feedback loop.

PDFs (parallel idea)

Stage Deterministic LLM (optional future tasks)
Ingest Text plus layout (pypdf, pdfminer); text-density gate for OCR Page or region kind
Segment Pages, blocks, table bbox heuristics Region labels on text windows only
Tables Grid from positions / rules Small local prompt for ambiguous merged cells only
QA Row prefix vs golden Diagnose JSON mirroring incremental MC

Ground every LLM answer in page and char offset or bbox ids from the preprocessor when possible. Avoid base64 page images unless using an explicit vision task.

Page mechanics and Playwright discovery

For source-ingest flows that combine run_task with Playwright probes and mechanics validation, see:

  • PLAYWRIGHT-DISCOVERY.md — runtime vs LCDL responsibilities, end-to-end loop, example payloads, safety, testing.
  • PAGE-MECHANICS.md — checklist vs discover schema conventions and mechanics-focused task IDs.

forge-lcdl task index (Playwright builder)

Task id Version Role
pw_chunk_classify v1 Classify DOM chunks for MCQ-like blocks
pw_extractor_synthesize_exemplar v1 Synthesize extract_questions from exemplars + probe summary
pw_extractor_synthesize_probe v1 Synthesize from full page probe only
pw_incremental_diagnose v1 Hint text from compact failure payload
pw_page_kind_route v1 Classify page_kind from bounded probe; suggest strategies and follow-up probes
pw_quiz_mechanics_discover v1 Infer interactive-quiz mechanics (schema_version page_mechanics.v1; object-shaped actions)
pw_static_mcq_mechanics_discover v1 Infer static HTML/text MCQ mechanics (declarative strategies; no Python)
pw_grading_signal_infer v1 Infer correct_signal from submit deltas / feedback (no Playwright)
pw_selector_harden v1 Rank or reject selectors using inventories + constraints
pw_mechanics_repair v1 Propose repaired_mechanics after deterministic validation failure
pw_network_api_route_infer v1 Infer API-backed quiz hints from bounded network_events only

Optional future tasks could mirror the same contract style for PDF regions (pdf_region_classify, etc.).

Operator hint skeletons (paste into operator_hints / env)

Blog-style MCQ: “Stem and four options live in sibling <p> after each <h2>; correct line may read The answer is B — map to correct_index 0–3; iterate page.locator('h2').all().”

App-like quiz DOM: “Use main as root; options are button[role=radio]; map green/emerald styling or check icon to correct correct_index; never call a Locator as a function; use .all() for multiples.”

Linear PDF Q bank: “Extractor is Playwright-only for this playbook item — PDF path: extract text per page in Python first, then (if needed) small LLM calls on spans with (page, char_offset) only.”

  • PLAYWRIGHT-DISCOVERY.md — Consumer-facing orchestration for probes, routing, mechanics, repair, and testing.
  • PAGE-MECHANICS.md — Mechanics schema split (page_mechanics_v1 vs page_mechanics.v1) and focused tasks.
  • forge-certificators playwright_llm_page_discovery delegates these calls to forge_lcdl.run_task and read_certificator_profile() (LLM_* env vars).
  • OEP incremental MC pack defaults hints/oep-pmp-dom.txt into --page-hints when OEP_INCREMENTAL_PAGE_HINTS is unset.