还剩 1 分钟
    跳到主要内容
    2025-09-03·1 分钟阅读

    超越当前推理模型的思考

    涂津豪 著

    2025 年 11 月 20 日更新: Google DeepMind 最新发布的 Nano Banana Pro(基于他们最新的 Gemini-3 Pro 模型)实际上已经具备了在图像中进行思考的能力,这在图像质量等方面确实带来了巨大的飞跃。虽然我稍微有点怀疑这底层其实是一个双模型架构,比如 Gemini-3 Pro 在推理过程中调用了一个图像生成模型,但不管怎么说,这依然相当令人印象深刻。我更期待看到这种能力在未来如何进一步拓展到其他模态。


    有段时间了,我一直在想我们如何才能推动当前推理模型的前沿向前发展——不仅仅是它们的性能,还有其他的"用例"或者"功能"。于是我想出了这两个问题,似乎还没被其他人提到过(也许吧?):

    1. 当前的推理模型范式如何影响或增强多模态模型?
    2. 所谓的"混合推理模型"(可能)是如何工作的?

    围绕这两个话题,我有一些个人的想法。我并不追求绝对正确,只是想分享一下自己的观点。


    1. 当前的推理模型范式如何影响或增强多模态模型?

    Gemini-2.5 Flash Image(又名 Nano-Banana)的表现让大家都很震惊:图像生成、编辑,还有很多其他功能。如今很多模型不仅能输出文本,还能生成音频和图像。所以我在想,我们能不能把这些多模态模型与推理模型结合起来?不只是用文本来"思考",还能用图像和音频来"思考"?如果我们能让推理模型不仅在最终回复中使用多模态内容,还能在其内部思考过程中也使用,那结果可能会好很多。这听起来可能有点绕,所以这里举几个例子:

    • 音频模态:假设一个用户正在准备演讲,想让模型生成一段音频样本。gpt-realtime 或 gemini-2.5 可以帮忙,但质量无法保证。如果模型能够推理在特定节点应该传达什么情感,生成草稿音频来模拟,然后反复迭代,结果的质量可能会有所提升。比如:

      markdown
      user:
      I'll have a speech about the future of our planet earth next week at the UN. Could you please give me a sample for it?
      model (CoT):
      The user is asking: "..." So they want a sample speech.
      I could give a text script, but it will not be that helpful. I guess give them an example audio would be better. It may also require some emotions. I should first get up structures and contents first.
      ...
      Alright, so I am done with initial composing, let me explore how emotions should be, and I will do some drafts, then it will be good to go..
      Well, I should have hope, urgency, inspiration, responsibility, solidarity, empathy, ...
      OK, so for hope, I think I should use upraising tone, just like this:
      [a piece of voice with upraising tone]
      Cool, sounds good, let's continue:
      ...
      Alright, I think I got all emotions done. Let me make a first draft:
      [full composed speech with correct emotions]
      ...
      model (response):
      Here's the speech I prepared for you:
      [full final speech audio]
    • 图像模态:尽管最新的 Nano-Banana 能创建或编辑出令人惊艳的图片,但它在复杂场景下仍然会遇到困难。这是可以理解的,因为模型是一次性直接输出图像的。如果它能花点时间先思考和起草——就像人类那样——结果可能会更好。比如:

      markdown
      user:
      Please create a cinematic image of a interstellar war. it needs to be epic.
      model (CoT):
      The user is asking me to generate an image of ...
      So let me first imagine the scene: a massive fleet of sleek starships engaged in battle over a luminous nebula, laser beams and plasma torpedoes streaking across space, explosions creating bright fireballs, a giant battlecruiser in the foreground with detailed hull, background with distant stars and a planet partially visible ...
      Cool, let me create a sketch first:
      [a sketch of the scene with only few lines]
      ...
      I see the planet in the image doesn't seem real, let me change the position:
      [a final-prepared version of the image with all elements and colors]
      ...
      Great, it seems perfect, let me give this to the user.
      model (response):
      Here's the image you want:
      [the final version of the image]
    • 甚至视频模态 (我知道目前还没有语言模型能同时输出文本和视频,但也许这是个好思路)

      如今的视频生成模型,比如 Veo-3,也做得很不错;然而,它们都存在一些明显的短板:生成时长太短、无法完成复杂的故事叙述,等等。所以如果赋予模型在音频、图像等模态中进行思考的能力,可能会极大地增强它的表现。比如:

      markdown
      user:
      please give me a short movie about space exploration.
      model (CoT):
      The user asks for a short movie ...
      I plan to write the script first, then draw some certain frames, create some suitable BGMs. This seems good. Let me proceed with the some ideas.
      ...
      Phew. I think this is amazing, user is gonna love it. So next, frames.
      For the opening, I think this would be good:
      [a sketch frame for the first part of the story]
      ...
      That's great. I'm all done. Now, music.
      [some background audios for specific frames]
      ...
      Sounds good, then let me go for the final video part.
      ...
      model (response):
      Here's the movie you want:
      [final movie]

    2. 所谓的"混合推理模型"(可能)是如何工作的?

    混合推理模型的定义是:模型既可以直接回复,也可以在回复前进行深度思考(甚至能自行判断什么时候需要多想一些)。目前,只有少数模型具备这种能力:Claude-3.7、Claude-4、DeepSeek-v3.1、Qwen-3,以及一些其他的(GPT-5 暂时不算,因为它用的是 Router)。

    Claude 是闭源的,所以我们不知道它的思考开关具体是怎么实现的(也许和其他模型类似)。对于 DeepSeek-v3.1 和 Qwen-3 的非思考模式,它们只是预填充了一个空白的思考块(类似 )。这是一个不错且快速的方式来让模型跳过思考直接回复,但……好吧,看起来这个结果对 Qwen 团队来说并不太满意,他们很快就把思考模式和非思考模式拆分成了两个模型(链接)。

    但是,如果我们让模型始终都在思考,只是目的不同呢?我的意思是:

    从根本上来说,我们可以训练模型学会在不同的设置下(思考模式开启、关闭或自动)做出不同的反应。在推理阶段,我们通过某种方式(比如 System Prompt)告知模型用户选择了哪种模式。在不同模式下,模型会始终先看一下当前的设置,然后自行决定需要怎么做。这意味着人工干预会更少,模型自己就知道该怎么做。

    示例行为:

    • 思考模式开启:

      markdown
      system: ... <thinking_mode>on</thinking_mode>
      user:
      ...
      model (CoT):
      Let me see. I saw the thinking mode has been set with on, which means I should take more time to continue thinking afterwards.
      The user asks ...
      model (response):
      ...
    • 思考模式关闭:

      markdown
      system: ... <thinking_mode>off</thinking_mode>
      user:
      ...
      model (CoT):
      Hmmm... I see the thinking mode is off. This means I should start responding directly. Yes, no more thinking. Start to respond now.
      model (response):
      ...
    • 思考模式自动:

      markdown
      system: ... <thinking_mode>auto</thinking_mode>
      user:
      What is 1+1?
      model (CoT):
      I see an auto thinking mode is been put. So basically I just need to decide how much I need to think. Hmmm... Let me see.
      The user asks: "What is 1+1?" This is trivial: 2. Any other things? No. Just a number is fine. Respond now.
      model (response):
      2.

      或者

      markdown
      system: ... <thinking_mode>auto</thinking_mode>
      user:
      ...
      model (CoT):
      ... Oh my god, this is hard. According to my setting, I guess I should take more time to think further for the message.
      The user asks ...
      model (response):
      ...

    而且我们也可以用 RL 来让模型学习这些行为:

    1. 对于通用的"响应设置"行为,我们可以设置一个验证器来检查模型是否真正正确地响应了设置,因为模型在响应设置时通常会使用相似的措辞或语义。
    2. 对于模型是否正确响应了设置(比如在关闭时停止思考、在开启时继续思考),我们可以根据模型之后的行为来进行奖励/惩罚,比如如果模型在设置为关闭时仍然坚持继续思考,就给予惩罚。
    3. 对于自动思考模式,我认为我们可以使用一个带有"简单/困难"标签的预标注数据集,然后对在自动设置下正确处理问题的输出进行奖励。

    这套方案未必真的能跑通,但可能是一条值得探索的路径。为什么这么说?看看 OpenAI 的 o 系列模型和 GPT-5-thinking 模型。它们的推理力度都是由一个叫"juice"的内部参数控制的(GPT-5 甚至还有一个叫"oververbosity"的参数来控制最终回复的冗长程度)。此外,Anthropic 使用类似 <max_thinking_length> 的方式来告诉模型总共应该思考多长时间。所以通过精心策划的数据和 RL 训练,模型也许能够获得更加自适应和高效的思考能力。