Handbook

Task `pw_chunk_classify` v1

Classifies each page chunk (plain text / HTML excerpt) as likely containing a single four-option multiple-choice question or not. Output is merged back onto the input chunk dicts with is_question_block, confidence, and…

Task `pw_chunk_classify` v1

Summary

Classifies each page chunk (plain text / HTML excerpt) as likely containing a single four-option multiple-choice question or not. Output is merged back onto the input chunk dicts with is_question_block, confidence, and classify_reason (from model field reason).

Inputs

Field	Type	Notes
`url`	string	Page URL for model context
`chunks`	list of dict	Each dict should include `chunk_id`; optional `heading_text`, `text_snippet`, `html_snippet`
`temperature`	number	Optional; default `0.05`
`timeout_sec`	int	Optional; default `180`

Size limits (truncation before the LLM call): heading_text 400 chars, text_snippet 2500, html_snippet 1200; total user JSON UTF-8 truncated to 100000 bytes.

Output

JSON object (wrapped in task Ok.value):

{
  "chunks": [ { "... original fields ...", "is_question_block": true, "confidence": 0.9, "classify_reason": "..." } ]
}

Policy

Prefer OpenAI response_format: json_object when profile.prefer_json_object_mode is true; blocking retry without JSON mode on transport failure (same pattern as certificator Playwright discovery).

Ambiguity handling

Empty assistant text → ParseFailure.
Top-level JSON error object with message and no chunk_results → GatewayFailure.
Missing chunk_results array → SchemaFailure.
Lenient parse: tolerates markdown fences and leading prose before the JSON object.

Changelog

v1 — Initial port of semantics aligned with forge_certificators.source_ingest.playwright_llm_page_discovery.llm_classify_question_chunks.

forge-lcdl

Task pw_chunk_classify v1

Summary

Inputs

Output

Policy

Ambiguity handling

Changelog

Task `pw_chunk_classify` v1