超越当前推理模型的思考

2025 年 11 月 20 日更新： Google DeepMind 最新发布的 Nano Banana Pro（基于 Gemini-3 Pro）实际上已经具备了在图像中进行思考的能力，在图像质量等方面带来了巨大飞跃。虽然我稍微怀疑底层其实是个双模型架构——比如 Gemini-3 Pro 在推理过程中调用了一个图像生成模型——但不管怎样，这依然相当令人印象深刻。我更期待看到这种能力将来如何拓展到其他模态。

有段时间了，我一直在琢磨怎么才能推动当前推理模型的前沿继续前进——不只是性能层面，还有其他的"用例"和"能力"。于是我想到了两个问题，好像还没看到别人提过（也许有？）：

当前的推理模型范式如何影响或增强多模态模型？
所谓的"混合推理模型"（可能）是怎么工作的？

围绕这两个话题，我有一些个人想法。不追求绝对正确，就是分享一下自己的观点。

1. 当前的推理模型范式如何影响或增强多模态模型？

Gemini-2.5 Flash Image（也就是 Nano-Banana）的表现让很多人震惊：图像生成、编辑，还有很多其他功能。现在不少模型不仅能输出文本，还能生成音频和图像。所以我在想：能不能把多模态模型和推理模型结合起来？不只是用文字"想"，还能用图像和音频来"想"？如果推理模型不仅在最终回复中用到多模态内容，在内部思考过程中也能用，那结果可能会好很多。听起来有点绕，举几个例子就清楚了：

音频模态：一个用户在准备演讲，想让模型生成一段音频样本。gpt-realtime 或 gemini-2.5 可以帮忙，但质量没法保证。如果模型能推理出在哪些节点该传达什么情感，生成草稿音频来试听，然后反复迭代，结果的质量可能会好很多。比如：

markdown
user:
	I'll have a speech about the future of our planet earth next week at the UN. Could you please give me a sample for it?

model (CoT):
	The user is asking: "..." So they want a sample speech.
	I could give a text script, but it will not be that helpful. I guess give them an example audio would be better. It may also require some emotions. I should first get up structures and contents first.
	...
	Alright, so I am done with initial composing, let me explore how emotions should be, and I will do some drafts, then it will be good to go..
	Well, I should have hope, urgency, inspiration, responsibility, solidarity, empathy, ...
	OK, so for hope, I think I should use upraising tone, just like this:
	[a piece of voice with upraising tone]
	Cool, sounds good, let's continue:
	...
	Alright, I think I got all emotions done. Let me make a first draft:
	[full composed speech with correct emotions]
	...

model (response):
	Here's the speech I prepared for you:
	[full final speech audio]

图像模态：虽然最新的 Nano-Banana 已经能创建和编辑出很惊艳的图片，但在复杂场景下仍然会翻车。这可以理解，因为模型是一次性直出图像的。如果它能花点时间先想一想、打个草稿——就像人类做设计那样——结果可能会好很多。比如：

markdown
user:
	Please create a cinematic image of a interstellar war. it needs to be epic.

model (CoT):
	The user is asking me to generate an image of ...
	So let me first imagine the scene: a massive fleet of sleek starships engaged in battle over a luminous nebula, laser beams and plasma torpedoes streaking across space, explosions creating bright fireballs, a giant battlecruiser in the foreground with detailed hull, background with distant stars and a planet partially visible ...
	Cool, let me create a sketch first:
	[a sketch of the scene with only few lines]
	...
	I see the planet in the image doesn't seem real, let me change the position:
	[a final-prepared version of the image with all elements and colors]
	...
	Great, it seems perfect, let me give this to the user.

model (response):
	Here's the image you want:
	[the final version of the image]

甚至视频模态 （我知道目前还没有语言模型能同时输出文本和视频，但也许这是个好思路）

现在的视频生成模型，比如 Veo-3，做得也不错；但都有一些明显的短板：生成时长太短、没法完成复杂的故事叙述，等等。所以如果能让模型在音频、图像等模态中进行思考，可能会极大地增强它的表现。比如：

markdown
user:
	please give me a short movie about space exploration.

model (CoT):
	The user asks for a short movie ...
	I plan to write the script first, then draw some certain frames, create some suitable BGMs. This seems good. Let me proceed with the some ideas.
	...
	Phew. I think this is amazing, user is gonna love it. So next, frames.
	For the opening, I think this would be good:
	[a sketch frame for the first part of the story]
	...
	That's great. I'm all done. Now, music.
	[some background audios for specific frames]
	...
	Sounds good, then let me go for the final video part.
	...

model (response):
	Here's the movie you want:
	[final movie]

2. 所谓的"混合推理模型"（可能）是怎么工作的？

混合推理模型的定义是：模型既可以直接回复，也可以在回复前深入思考（甚至能自行判断什么时候需要多想想）。目前只有少数模型具备这种能力：Claude-3.7、Claude-4、DeepSeek-v3.1、Qwen-3，以及其他一些（GPT-5 暂时不算，因为它用的是 Router）。

Claude 是闭源的，所以我们不知道它的思考开关具体怎么实现的（也许和其他模型类似）。DeepSeek-v3.1 和 Qwen-3 的非思考模式则是直接预填了一个空的思考块（类似 <think> </think>）。这是一种不错的快速方案让模型跳过思考直接回复，但……似乎 Qwen 团队对效果不太满意，很快就把思考和非思考模式拆成了两个模型（链接）。

但如果我们让模型始终都在思考，只是目的不同呢？我的意思是：

从根本上说，我们可以训练模型学会在不同设置下（思考模式开、关、自动）做出不同反应。推理时通过某种方式（比如 System Prompt）告诉模型用户选了哪种模式。不同模式下，模型会先看一眼当前设置，然后自己决定该怎么做。这样人工干预更少，模型自己就知道该干什么。

示例行为：

思考模式开启：

markdown
system: ... <thinking_mode>on</thinking_mode>

user:
	...

model (CoT):
	Let me see. I saw the thinking mode has been set with on, which means I should take more time to continue thinking afterwards.
	The user asks ...

model (response):
	...

思考模式关闭：

markdown
system: ... <thinking_mode>off</thinking_mode>

user:
	...

model (CoT):
	Hmmm... I see the thinking mode is off. This means I should start responding directly. Yes, no more thinking. Start to respond now.

model (response):
	...

思考模式自动：

markdown
system: ... <thinking_mode>auto</thinking_mode>

user:
	What is 1+1?

model (CoT):
	I see an auto thinking mode is been put. So basically I just need to decide how much I need to think. Hmmm... Let me see.
	The user asks: "What is 1+1?" This is trivial: 2. Any other things? No. Just a number is fine. Respond now.

model (response):
	2.

或者

markdown
system: ... <thinking_mode>auto</thinking_mode>

user:
	...

model (CoT):
	... Oh my god, this is hard. According to my setting, I guess I should take more time to think further for the message.
	The user asks ...

model (response):
	...

我们也可以用 RL 来让模型学习这些行为：

对于通用的"响应设置"行为，可以设一个验证器来检查模型是否正确地响应了设置——因为模型在响应设置时通常会用类似的措辞或语义。
对于模型是否正确执行了设置（比如关闭时停止思考、开启时继续思考），可以根据模型之后的行为来奖惩。比如设置为关闭时模型还在执意思考，那就给惩罚。
对于自动模式，我觉得可以用一个预标注了"简单/困难"标签的数据集，然后对在自动设置下正确处理问题的输出进行奖励。

这套方案未必真能跑通，但可能是一条值得探索的路径。为什么这么说？看看 OpenAI 的 o 系列和 GPT-5-thinking 模型——它们的推理力度都由一个叫"juice"的内部参数控制（GPT-5 甚至还有一个叫"oververbosity"的参数来控制回复的冗长程度）。此外，Anthropic 用类似 <max_thinking_length> 的方式来告诉模型总共应该思考多长时间。所以通过精心策划的数据和 RL 训练，模型也许能获得更加自适应和高效的思考能力。