Recover Qwen3 XML Tool Calls Emitted Inside <think>¶
Problem Statement¶
This change addresses a parsing compatibility bug for the following setup:
--reasoning-parser qwen3--tool-call-parser qwen3_coder
Qwen3/Qwen3.5 models can emit XML tool-call markup such as:
inside the reasoning region delimited by <think> ... </think>.
The issue is not that vLLM causes the model to generate tool calls inside <think>. That is model output behavior.
The actual bug is that, when this output happens, vLLM loses the tool call during non-streaming parsing.
Root Cause¶
In the affected path:
qwen3_reasoning_parserextracts everything before</think>intoreasoning.- downstream tool parsing inspects only
content. - any
<tool_call>...</tool_call>block that remains insidereasoningnever reachesqwen3_coder.
As a result, the OpenAI-compatible response can contain:
- populated
reasoning - empty
tool_calls
even though the model actually produced a valid XML tool call.
Minimal Fix¶
The patch changes only:
vllm/reasoning/qwen3_reasoning_parser.py
During non-streaming reasoning extraction:
- detect XML tool-call blocks embedded in the extracted reasoning text
- remove those blocks from the returned
reasoning - prepend them to the returned
content
This allows the existing qwen3_coder tool parser to parse them normally, without changing the generic OpenAI serving pipeline.
Why This Is The Right Scope¶
This patch intentionally fixes recovery/parsing, not model generation.
It does not try to force Qwen3.5 models to stop emitting tool calls inside <think>. Instead, it makes vLLM robust when that output pattern appears.
This is the smallest change that fixes the observed benchmark failure while keeping the rest of the tool-calling stack unchanged.
Validation Scope¶
The added tests validate that:
- normal reasoning extraction remains unchanged
- embedded tool calls are promoted from
reasoningintocontent - promoted content remains parseable by
qwen3_coder - truncated reasoning without
</think>still recovers embedded tool calls - post-
</think>content is preserved
Limitation¶
This patch fixes the non-streaming path.
Streaming recovery is not addressed here, because the streaming path would require additional stateful changes in the serving layer to forward reasoning delivered tool markup into the tool parser before </think> is observed.