forge-lcdl

Task `pw_chunk_classify` v1

Classifies each page chunk (plain text / HTML excerpt) as likely containing a single four-option multiple-choice question or not. Output is merged back onto the input chunk dicts with is_question_block, confidence, and…

Task pw_chunk_classify v1

Summary

Classifies each page chunk (plain text / HTML excerpt) as likely containing a single four-option multiple-choice question or not. Output is merged back onto the input chunk dicts with is_question_block, confidence, and classify_reason (from model field reason).

Inputs

Field Type Notes
url string Page URL for model context
chunks list of dict Each dict should include chunk_id; optional heading_text, text_snippet, html_snippet
temperature number Optional; default 0.05
timeout_sec int Optional; default 180

Size limits (truncation before the LLM call): heading_text 400 chars, text_snippet 2500, html_snippet 1200; total user JSON UTF-8 truncated to 100000 bytes.

Output

JSON object (wrapped in task Ok.value):

{
  "chunks": [ { "... original fields ...", "is_question_block": true, "confidence": 0.9, "classify_reason": "..." } ]
}

Policy

  • Prefer OpenAI response_format: json_object when profile.prefer_json_object_mode is true; blocking retry without JSON mode on transport failure (same pattern as certificator Playwright discovery).

Ambiguity handling

  • Empty assistant text → ParseFailure.
  • Top-level JSON error object with message and no chunk_resultsGatewayFailure.
  • Missing chunk_results array → SchemaFailure.
  • Lenient parse: tolerates markdown fences and leading prose before the JSON object.

Changelog

  • v1 — Initial port of semantics aligned with forge_certificators.source_ingest.playwright_llm_page_discovery.llm_classify_question_chunks.