Handbook

Playwright source-ingest discovery with forge-lcdl

This guide describes how a consumer repo should orchestrate Playwright (probes, validation, execution) while delegating bounded JSON inference to forge_lcdl.run_task or TaskRunner. It references real task IDs and schema…

1. Responsibility split

Responsibility	Owner
Open pages, scroll, click, capture DOM snapshots, network traces, screenshots (if allowed)	Source-ingest runtime
Cap and serialize page_probe, interaction_probe, network_events, chunk summaries	Runtime (deterministic)
Call `run_task(task_id, "v1", payload, profile=profile, chat=…)`	Runtime
Infer `page_kind`, mechanics specs, repairs, grading signals, selector ranks, API-route hints	LCDL tasks (LLM behind injectable `chat`)
Validate mechanics shape and semantics; retry or abort	Runtime (deterministic; may import `forge_lcdl.schemas.page_mechanics_v1`)
Execute mechanics via your generic runner (clicks, reads, loop policy)	Runtime

LCDL does not run Playwright, perform live navigation inside tasks, or execute outbound HTTP for these contracts. It classifies and synthesizes JSON from provided probes only.

LCDL should not replace your full browser extraction loop (budgeted retries, bank persistence, deduplication). Tasks are single-shot JSON contracts with capped user payloads (typically 100000 UTF-8 bytes—see each contract).

2. End-to-end flow (recommended)

Collect page_probe — Bounded snapshot aligned with page_probe_v1 (title, landmarks, selector inventory, text excerpts—whatever your schema defines). Avoid dumping unrestricted HTML; prefer inventories and short excerpts.
pw_page_kind_route — Pass url, page_probe, optional operator_hints. Read page_kind, confidence, supported_strategies, evidence, and next_probe_needed (interaction_probe, network_probe, static_chunk_probe booleans).
Collect follow-up probes as flags suggest:
Interaction probe — Scripted clicks relevant to quiz flow; emit interaction_probe_v1 (or your bounded variant).
Network probe — List of observed requests/responses as rows (method, url_path, status, content_type, request_excerpt, response_shape summary—not raw secrets).
Static chunk probe — Chunk summaries for pw_chunk_classify if the page is mostly static MCQ HTML.
Infer mechanics (pick one primary path per surface):
Interactive quiz DOM → pw_quiz_mechanics_discover (requires allowed_action_kinds, source_constraints).
Static MCQ page → pw_static_mcq_mechanics_discover.
Optional pw_network_api_route_infer when traces suggest API-backed flows—candidate URLs must appear only in supplied network_events (validated in forge-lcdl).
Validate deterministically — Enums, forbidden action kinds, required keys; if you normalize to checklist page_mechanics_v1 with array actions, call validate_page_mechanics_shape where appropriate.
Repair if needed — On validation or runner failure, build validation_failure (stage, expected, actual, short exception, optional UI hints) and call pw_mechanics_repair. Feed repaired_mechanics back into your validator before execution.
Execute — Generic runner applies declarative steps only (no arbitrary JS execution).

Optional refinements (same ingest pass, separate run_task calls):

pw_grading_signal_infer — When grading signals are unclear from probes alone but submit deltas exist.
pw_selector_harden — When you have selector_candidates plus element_inventory / snippets.

3. Example payloads

Values below are illustrative; keep real probes bounded and strip secrets.

3.1 `page_probe` (minimal shape)

Your runtime should align with page_probe_v1. Example skeleton:

{
  "schema_version": "page_probe_v1",
  "probe_id": "probe-001",
  "title_text": "Practice quiz",
  "landmarks": ["main", "navigation"],
  "selector_inventory": ["button.submit", "[data-testid='quiz-root']"]
}

3.2 `interaction_probe` (minimal shape)

Example skeleton (interaction_probe_v1):

{
  "schema_version": "interaction_probe_v1",
  "probe_id": "probe-001",
  "action_index": 0,
  "outcome_summary": "Submitted dummy answer; options gained correct/incorrect classes."
}

3.3 `pw_page_kind_route` — input / output

Input (task_id: pw_page_kind_route, version v1):

{
  "url": "https://example.com/quiz",
  "operator_hints": "Authorized practice bank; prefer interactive_quiz if submit reveals answers.",
  "page_probe": {
    "schema_version": "page_probe_v1",
    "probe_id": "probe-001",
    "title_text": "Week 3 review"
  }
}

Output (Ok.value):

{
  "page_kind": "interactive_quiz",
  "confidence": 0.72,
  "supported_strategies": ["interactive_reveal"],
  "evidence": ["Multiple radio options inside main", "Submit button visible"],
  "next_probe_needed": {
    "interaction_probe": true,
    "network_probe": false,
    "static_chunk_probe": false
  }
}

Allowed page_kind strings are fixed in the contract (e.g. static_mcq_page, interactive_quiz, api_backed_quiz, login_or_blocked, unknown, …).

3.4 `pw_quiz_mechanics_discover` — input / output (abbreviated)

Input:

{
  "url": "https://example.com/quiz",
  "operator_hints": "Single-choice; four options per question.",
  "page_probe": {
    "schema_version": "page_probe_v1",
    "probe_id": "probe-001"
  },
  "interaction_probe": {
    "schema_version": "interaction_probe_v1",
    "probe_id": "probe-001",
    "action_index": 0
  },
  "source_constraints": {
    "expected_option_count": 4,
    "expected_choice_mode": "single",
    "allowed_modes": ["interactive_reveal", "bank_backed_pass"]
  },
  "allowed_action_kinds": [
    "click",
    "click_if_present",
    "click_option_by_index",
    "wait_for_selector",
    "read_text",
    "read_class"
  ]
}

Output (required keys per contract; actions is an object of phases):

{
  "schema_version": "page_mechanics.v1",
  "page_kind": "interactive_quiz",
  "confidence": 0.84,
  "question": {
    "container_selector": ".quiz-question",
    "stem_selector": ".question-text",
    "options_selector": ".answer-option",
    "choice_mode": "single",
    "expected_option_count": 4
  },
  "actions": {
    "submit": { "kind": "click", "selector": "button.submit" },
    "select_answer": { "kind": "click_option_by_index", "options_selector": ".answer-option" }
  },
  "grading": {
    "mode": "reveal_after_submit",
    "correct_signal": {
      "kind": "css_class_or_feedback_text",
      "correct_selector": ".correct",
      "incorrect_selector": ".incorrect",
      "text_patterns": []
    }
  },
  "loop": {
    "termination": "no_next_button_or_score_page",
    "duplicate_question_policy": "stop",
    "max_questions_default": 200
  },
  "safety": { "forbid_freeform_js": true, "max_clicks_per_question": 4 },
  "notes": "Spec synthesized from probes; validate selectors live."
}

3.5 `pw_mechanics_repair` — input / output

Input:

{
  "url": "https://example.com/quiz",
  "operator_hints": "Grading read failed after submit.",
  "previous_mechanics": {
    "schema_version": "page_mechanics.v1",
    "page_kind": "interactive_quiz",
    "grading": {
      "correct_signal": {
        "kind": "css_class_or_feedback_text",
        "correct_selector": ".correct",
        "incorrect_selector": ".incorrect"
      }
    }
  },
  "validation_failure": {
    "stage": "read_grade",
    "expected": "correct_index 0..3",
    "actual": null,
    "exception_one_liner": "no correct selector matched",
    "after_submit_text": "Try again.",
    "class_delta": ["is-right", "is-wrong"],
    "button_inventory": ["Next"]
  }
}

Output:

{
  "repaired_mechanics": {
    "schema_version": "page_mechanics.v1",
    "page_kind": "interactive_quiz",
    "grading": {
      "correct_signal": {
        "kind": "css_class_or_feedback_text",
        "correct_selector": ".is-right",
        "incorrect_selector": ".is-wrong"
      }
    }
  },
  "patch_summary": "Updated grading selectors to match post-submit classes.",
  "confidence": 0.79,
  "issues": []
}

Post-validation rejects code-carrier keys and forbidden kind values; if repaired_mechanics.actions is a list, validate_page_mechanics_shape must pass.

3.6 Network trace task (optional)

pw_network_api_route_infer accepts network_events rows with method, url_path, status, content_type, request_excerpt, response_shape. Output candidate arrays must reference only observed url_path values (see contract).

4. Safety notes

Scope: Authorized practice, QA, and owned content only.
Never instruct or automate proctored exam bypass, credential theft, CAPTCHA solving, paywall evasion, or anti-bot circumvention. Task system prompts encode this; consumers must enforce policy in operators and URLs.
Secrets: Do not place tokens, cookies, or Authorization headers in operator_hints, probe JSON, or notes. Truncate or hash diagnostic payloads in logs.
Determinism: Prefer validated mechanics and explicit timeouts over “best effort” loops driven solely by the LLM.

5. Testing guidance

Technique	Use
Fake chat	Implement `chat(messages, kwargs) -> ChatResult(True, '{"…json…"}')` and pass it to `run_task` or `TaskRunner(chat=…)`**. Matches forge-lcdl unit tests.
Frozen probe fixtures	Serialize `page_probe` / `interaction_probe` / `network_events` from golden pages you own; version them beside golden HTML snapshots or hashes.
Live LLM integration	Optional; forge-lcdl marks gateway tests `granite`—default `pytest` skips them without env (see README.md).

Consumer repos should run contract-level golden tests (payload → expected validator outcome) and smoke Playwright tests separately from LCDL.

PAGE-MECHANICS.md — Mechanics artifact versions, focused tasks, validation split.
EXTRACTION-CONVERGENCE.md — Staged convergence playbook and task index.
CONTRIBUTING.md — Governance and layering expectations.
ADOPTION.md — Dependency and rollout notes for certificator/workbench.

forge-lcdl

1. Responsibility split

2. End-to-end flow (recommended)

3. Example payloads

3.1 page_probe (minimal shape)

3.2 interaction_probe (minimal shape)

3.3 pw_page_kind_route — input / output

3.4 pw_quiz_mechanics_discover — input / output (abbreviated)

3.5 pw_mechanics_repair — input / output