Handbook
Extraction playbook: LLM plus deterministic convergence
This document summarizes how forge-lcdl tasks fit into staged document and DOM pipelines: the LLM proposes bounded JSON or code artifacts; deterministic steps validate, execute, and compare to oracle or golden data…
Governance alignment
The numbered stages below implement the library’s policy: freeze inputs and do deterministic prelude and segmentation before LLM classify or synthesize steps; keep payloads capped and contracts one task at a time; run Playwright or parsers and comparisons outside LCDL in the ingest/runtime layer while LCDL supplies bounded classify, synthesize, and diagnose JSON. Authorized use, prohibited directions, and contributor norms are spelled out in CONTRIBUTING.md.
Shared pattern
- Freeze the input (snapshot URL, frozen HTML, or PDF bytes plus hash).
- Deterministic prelude — no LLM: scroll/wait, heading inventory, text-density checks, layout coordinates.
- Segment in code — fixed chunk strategy (DOM chunks, PDF pages, paragraph windows).
- LLM classify / route — small JSON per batch (e.g.
pw_chunk_classifyv1). - LLM synthesize — only exemplar payloads plus a probe summary, not full raw source (
pw_extractor_synthesize_exemplarv1, orpw_extractor_synthesize_probev1 for probe-only). - Deterministic execute — run generated Playwright extractor or PDF parser; capture exceptions and row counts.
- Compare — prefix of oracle rows, schema, or checksum; feed compact facts to diagnose (
pw_incremental_diagnosev1). - Hint feedback — append
hint_additionsto operator hints; retry with budget and MC sampling.
Rules: cap every field sent to the model; one run_task per JSON contract; deterministic parsers and selectors before LLM on ambiguous spans; declare explicit success metrics and global_llm_budget.
Websites (DOM / HTML)
| Practice | Why |
|---|---|
Put a DOM contract in operator_hints (container, option pattern, stem walk, graded-state cues). |
Anchors synthesis; fewer hallucinated selectors. |
Put selector/tag inventory from page.evaluate into page_probe_summary, not raw HTML dumps. |
Saves context; lists valid hooks. |
| Keep chunk strategy in code; use the LLM only to classify or rank chunks. | Stable segmentation. |
| Run the first execute pass on the same frozen URL as gather. | Tight feedback loop. |
PDFs (parallel idea)
| Stage | Deterministic | LLM (optional future tasks) |
|---|---|---|
| Ingest | Text plus layout (pypdf, pdfminer); text-density gate for OCR | Page or region kind |
| Segment | Pages, blocks, table bbox heuristics | Region labels on text windows only |
| Tables | Grid from positions / rules | Small local prompt for ambiguous merged cells only |
| QA | Row prefix vs golden | Diagnose JSON mirroring incremental MC |
Ground every LLM answer in page and char offset or bbox ids from the preprocessor when possible. Avoid base64 page images unless using an explicit vision task.
Page mechanics and Playwright discovery
For source-ingest flows that combine run_task with Playwright probes and mechanics validation, see:
- PLAYWRIGHT-DISCOVERY.md — runtime vs LCDL responsibilities, end-to-end loop, example payloads, safety, testing.
- PAGE-MECHANICS.md — checklist vs discover schema conventions and mechanics-focused task IDs.
forge-lcdl task index (Playwright builder)
| Task id | Version | Role |
|---|---|---|
pw_chunk_classify |
v1 | Classify DOM chunks for MCQ-like blocks |
pw_extractor_synthesize_exemplar |
v1 | Synthesize extract_questions from exemplars + probe summary |
pw_extractor_synthesize_probe |
v1 | Synthesize from full page probe only |
pw_incremental_diagnose |
v1 | Hint text from compact failure payload |
pw_page_kind_route |
v1 | Classify page_kind from bounded probe; suggest strategies and follow-up probes |
pw_quiz_mechanics_discover |
v1 | Infer interactive-quiz mechanics (schema_version page_mechanics.v1; object-shaped actions) |
pw_static_mcq_mechanics_discover |
v1 | Infer static HTML/text MCQ mechanics (declarative strategies; no Python) |
pw_grading_signal_infer |
v1 | Infer correct_signal from submit deltas / feedback (no Playwright) |
pw_selector_harden |
v1 | Rank or reject selectors using inventories + constraints |
pw_mechanics_repair |
v1 | Propose repaired_mechanics after deterministic validation failure |
pw_network_api_route_infer |
v1 | Infer API-backed quiz hints from bounded network_events only |
Optional future tasks could mirror the same contract style for PDF regions (pdf_region_classify, etc.).
Operator hint skeletons (paste into operator_hints / env)
Blog-style MCQ: “Stem and four options live in sibling <p> after each <h2>; correct line may read The answer is B — map to correct_index 0–3; iterate page.locator('h2').all().”
App-like quiz DOM: “Use main as root; options are button[role=radio]; map green/emerald styling or check icon to correct correct_index; never call a Locator as a function; use .all() for multiples.”
Linear PDF Q bank: “Extractor is Playwright-only for this playbook item — PDF path: extract text per page in Python first, then (if needed) small LLM calls on spans with (page, char_offset) only.”
Related
- PLAYWRIGHT-DISCOVERY.md — Consumer-facing orchestration for probes, routing, mechanics, repair, and testing.
- PAGE-MECHANICS.md — Mechanics schema split (
page_mechanics_v1vspage_mechanics.v1) and focused tasks. - forge-certificators
playwright_llm_page_discoverydelegates these calls toforge_lcdl.run_taskandread_certificator_profile()(LLM_*env vars). - OEP incremental MC pack defaults
hints/oep-pmp-dom.txtinto--page-hintswhenOEP_INCREMENTAL_PAGE_HINTSis unset.