Handbook
Playwright source-ingest discovery with forge-lcdl
This guide describes how a consumer repo should orchestrate Playwright (probes, validation, execution) while delegating bounded JSON inference to forge_lcdl.run_task or TaskRunner. It references real task IDs and schema…
1. Responsibility split
| Responsibility | Owner |
|---|---|
| Open pages, scroll, click, capture DOM snapshots, network traces, screenshots (if allowed) | Source-ingest runtime |
| Cap and serialize page_probe, interaction_probe, network_events, chunk summaries | Runtime (deterministic) |
Call run_task(task_id, "v1", payload, profile=profile, chat=…) |
Runtime |
Infer page_kind, mechanics specs, repairs, grading signals, selector ranks, API-route hints |
LCDL tasks (LLM behind injectable chat) |
| Validate mechanics shape and semantics; retry or abort | Runtime (deterministic; may import forge_lcdl.schemas.page_mechanics_v1) |
| Execute mechanics via your generic runner (clicks, reads, loop policy) | Runtime |
LCDL does not run Playwright, perform live navigation inside tasks, or execute outbound HTTP for these contracts. It classifies and synthesizes JSON from provided probes only.
LCDL should not replace your full browser extraction loop (budgeted retries, bank persistence, deduplication). Tasks are single-shot JSON contracts with capped user payloads (typically 100000 UTF-8 bytes—see each contract).
2. End-to-end flow (recommended)
- Collect
page_probe— Bounded snapshot aligned withpage_probe_v1(title, landmarks, selector inventory, text excerpts—whatever your schema defines). Avoid dumping unrestricted HTML; prefer inventories and short excerpts. pw_page_kind_route— Passurl,page_probe, optionaloperator_hints. Readpage_kind,confidence,supported_strategies,evidence, andnext_probe_needed(interaction_probe,network_probe,static_chunk_probebooleans).- Collect follow-up probes as flags suggest:
- Interaction probe — Scripted clicks relevant to quiz flow; emit
interaction_probe_v1(or your bounded variant). - Network probe — List of observed requests/responses as rows (method,
url_path, status,content_type,request_excerpt,response_shapesummary—not raw secrets). - Static chunk probe — Chunk summaries for
pw_chunk_classifyif the page is mostly static MCQ HTML. - Infer mechanics (pick one primary path per surface):
- Interactive quiz DOM →
pw_quiz_mechanics_discover(requiresallowed_action_kinds,source_constraints). - Static MCQ page →
pw_static_mcq_mechanics_discover. - Optional
pw_network_api_route_inferwhen traces suggest API-backed flows—candidate URLs must appear only in suppliednetwork_events(validated in forge-lcdl). - Validate deterministically — Enums, forbidden action kinds, required keys; if you normalize to checklist
page_mechanics_v1with arrayactions, callvalidate_page_mechanics_shapewhere appropriate. - Repair if needed — On validation or runner failure, build
validation_failure(stage, expected, actual, short exception, optional UI hints) and callpw_mechanics_repair. Feedrepaired_mechanicsback into your validator before execution. - Execute — Generic runner applies declarative steps only (no arbitrary JS execution).
Optional refinements (same ingest pass, separate run_task calls):
pw_grading_signal_infer— When grading signals are unclear from probes alone but submit deltas exist.pw_selector_harden— When you haveselector_candidatespluselement_inventory/ snippets.
3. Example payloads
Values below are illustrative; keep real probes bounded and strip secrets.
3.1 page_probe (minimal shape)
Your runtime should align with page_probe_v1. Example skeleton:
{
"schema_version": "page_probe_v1",
"probe_id": "probe-001",
"title_text": "Practice quiz",
"landmarks": ["main", "navigation"],
"selector_inventory": ["button.submit", "[data-testid='quiz-root']"]
}
3.2 interaction_probe (minimal shape)
Example skeleton (interaction_probe_v1):
{
"schema_version": "interaction_probe_v1",
"probe_id": "probe-001",
"action_index": 0,
"outcome_summary": "Submitted dummy answer; options gained correct/incorrect classes."
}
3.3 pw_page_kind_route — input / output
Input (task_id: pw_page_kind_route, version v1):
{
"url": "https://example.com/quiz",
"operator_hints": "Authorized practice bank; prefer interactive_quiz if submit reveals answers.",
"page_probe": {
"schema_version": "page_probe_v1",
"probe_id": "probe-001",
"title_text": "Week 3 review"
}
}
Output (Ok.value):
{
"page_kind": "interactive_quiz",
"confidence": 0.72,
"supported_strategies": ["interactive_reveal"],
"evidence": ["Multiple radio options inside main", "Submit button visible"],
"next_probe_needed": {
"interaction_probe": true,
"network_probe": false,
"static_chunk_probe": false
}
}
Allowed page_kind strings are fixed in the contract (e.g. static_mcq_page, interactive_quiz, api_backed_quiz, login_or_blocked, unknown, …).
3.4 pw_quiz_mechanics_discover — input / output (abbreviated)
Input:
{
"url": "https://example.com/quiz",
"operator_hints": "Single-choice; four options per question.",
"page_probe": {
"schema_version": "page_probe_v1",
"probe_id": "probe-001"
},
"interaction_probe": {
"schema_version": "interaction_probe_v1",
"probe_id": "probe-001",
"action_index": 0
},
"source_constraints": {
"expected_option_count": 4,
"expected_choice_mode": "single",
"allowed_modes": ["interactive_reveal", "bank_backed_pass"]
},
"allowed_action_kinds": [
"click",
"click_if_present",
"click_option_by_index",
"wait_for_selector",
"read_text",
"read_class"
]
}
Output (required keys per contract; actions is an object of phases):
{
"schema_version": "page_mechanics.v1",
"page_kind": "interactive_quiz",
"confidence": 0.84,
"question": {
"container_selector": ".quiz-question",
"stem_selector": ".question-text",
"options_selector": ".answer-option",
"choice_mode": "single",
"expected_option_count": 4
},
"actions": {
"submit": { "kind": "click", "selector": "button.submit" },
"select_answer": { "kind": "click_option_by_index", "options_selector": ".answer-option" }
},
"grading": {
"mode": "reveal_after_submit",
"correct_signal": {
"kind": "css_class_or_feedback_text",
"correct_selector": ".correct",
"incorrect_selector": ".incorrect",
"text_patterns": []
}
},
"loop": {
"termination": "no_next_button_or_score_page",
"duplicate_question_policy": "stop",
"max_questions_default": 200
},
"safety": { "forbid_freeform_js": true, "max_clicks_per_question": 4 },
"notes": "Spec synthesized from probes; validate selectors live."
}
3.5 pw_mechanics_repair — input / output
Input:
{
"url": "https://example.com/quiz",
"operator_hints": "Grading read failed after submit.",
"previous_mechanics": {
"schema_version": "page_mechanics.v1",
"page_kind": "interactive_quiz",
"grading": {
"correct_signal": {
"kind": "css_class_or_feedback_text",
"correct_selector": ".correct",
"incorrect_selector": ".incorrect"
}
}
},
"validation_failure": {
"stage": "read_grade",
"expected": "correct_index 0..3",
"actual": null,
"exception_one_liner": "no correct selector matched",
"after_submit_text": "Try again.",
"class_delta": ["is-right", "is-wrong"],
"button_inventory": ["Next"]
}
}
Output:
{
"repaired_mechanics": {
"schema_version": "page_mechanics.v1",
"page_kind": "interactive_quiz",
"grading": {
"correct_signal": {
"kind": "css_class_or_feedback_text",
"correct_selector": ".is-right",
"incorrect_selector": ".is-wrong"
}
}
},
"patch_summary": "Updated grading selectors to match post-submit classes.",
"confidence": 0.79,
"issues": []
}
Post-validation rejects code-carrier keys and forbidden kind values; if repaired_mechanics.actions is a list, validate_page_mechanics_shape must pass.
3.6 Network trace task (optional)
pw_network_api_route_infer accepts network_events rows with method, url_path, status, content_type, request_excerpt, response_shape. Output candidate arrays must reference only observed url_path values (see contract).
4. Safety notes
- Scope: Authorized practice, QA, and owned content only.
- Never instruct or automate proctored exam bypass, credential theft, CAPTCHA solving, paywall evasion, or anti-bot circumvention. Task system prompts encode this; consumers must enforce policy in operators and URLs.
- Secrets: Do not place tokens, cookies, or
Authorizationheaders inoperator_hints, probe JSON, ornotes. Truncate or hash diagnostic payloads in logs. - Determinism: Prefer validated mechanics and explicit timeouts over “best effort” loops driven solely by the LLM.
5. Testing guidance
| Technique | Use |
|---|---|
| Fake chat | Implement chat(messages, **kwargs) -> ChatResult(True, '{"…json…"}') and pass it to run_task or TaskRunner(chat=…). Matches forge-lcdl unit tests. |
| Frozen probe fixtures | Serialize page_probe / interaction_probe / network_events from golden pages you own; version them beside golden HTML snapshots or hashes. |
| Live LLM integration | Optional; forge-lcdl marks gateway tests granite—default pytest skips them without env (see README.md). |
Consumer repos should run contract-level golden tests (payload → expected validator outcome) and smoke Playwright tests separately from LCDL.
Related documents
- PAGE-MECHANICS.md — Mechanics artifact versions, focused tasks, validation split.
- EXTRACTION-CONVERGENCE.md — Staged convergence playbook and task index.
- CONTRIBUTING.md — Governance and layering expectations.
- ADOPTION.md — Dependency and rollout notes for certificator/workbench.