Handbook
Task `pw_chunk_classify` v1
Classifies each page chunk (plain text / HTML excerpt) as likely containing a single four-option multiple-choice question or not. Output is merged back onto the input chunk dicts with is_question_block, confidence, and…
Task pw_chunk_classify v1
Summary
Classifies each page chunk (plain text / HTML excerpt) as likely containing a single four-option multiple-choice question or not. Output is merged back onto the input chunk dicts with is_question_block, confidence, and classify_reason (from model field reason).
Inputs
| Field | Type | Notes |
|---|---|---|
url |
string | Page URL for model context |
chunks |
list of dict | Each dict should include chunk_id; optional heading_text, text_snippet, html_snippet |
temperature |
number | Optional; default 0.05 |
timeout_sec |
int | Optional; default 180 |
Size limits (truncation before the LLM call): heading_text 400 chars, text_snippet 2500, html_snippet 1200; total user JSON UTF-8 truncated to 100000 bytes.
Output
JSON object (wrapped in task Ok.value):
{
"chunks": [ { "... original fields ...", "is_question_block": true, "confidence": 0.9, "classify_reason": "..." } ]
}
Policy
- Prefer OpenAI
response_format: json_objectwhenprofile.prefer_json_object_modeis true; blocking retry without JSON mode on transport failure (same pattern as certificator Playwright discovery).
Ambiguity handling
- Empty assistant text →
ParseFailure. - Top-level JSON
errorobject withmessageand nochunk_results→GatewayFailure. - Missing
chunk_resultsarray →SchemaFailure. - Lenient parse: tolerates markdown fences and leading prose before the JSON object.
Changelog
- v1 — Initial port of semantics aligned with
forge_certificators.source_ingest.playwright_llm_page_discovery.llm_classify_question_chunks.