2025 年 11 月 20 日更新: Google DeepMind 最新发布的 Nano Banana Pro(基于 Gemini-3 Pro)实际上已经具备了在图像中进行思考的能力,在图像质量等方面带来了巨大飞跃。虽然我稍微怀疑底层其实是个双模型架构——比如 Gemini-3 Pro 在推理过程中调用了一个图像生成模型——但不管怎样,这依然相当令人印象深刻。我更期待看到这种能力将来如何拓展到其他模态。
有段时间了,我一直在琢磨怎么才能推动当前推理模型的前沿继续前进——不只是性能层面,还有其他的"用例"和"能力"。于是我想到了两个问题,好像还没看到别人提过(也许有?):
- 当前的推理模型范式如何影响或增强多模态模型?
- 所谓的"混合推理模型"(可能)是怎么工作的?
围绕这两个话题,我有一些个人想法。不追求绝对正确,就是分享一下自己的观点。
1. 当前的推理模型范式如何影响或增强多模态模型?
Gemini-2.5 Flash Image(也就是 Nano-Banana)的表现让很多人震惊:图像生成、编辑,还有很多其他功能。现在不少模型不仅能输出文本,还能生成音频和图像。所以我在想:能不能把多模态模型和推理模型结合起来?不只是用文字"想",还能用图像和音频来"想"?如果推理模型不仅在最终回复中用到多模态内容,在内部思考过程中也能用,那结果可能会好很多。听起来有点绕,举几个例子就清楚了:
-
音频模态:一个用户在准备演讲,想让模型生成一段音频样本。gpt-realtime 或 gemini-2.5 可以帮忙,但质量没法保证。如果模型能推理出在哪些节点该传达什么情感,生成草稿音频来试听,然后反复迭代,结果的质量可能会好很多。比如:
markdownuser:I'll have a speech about the future of our planet earth next week at the UN. Could you please give me a sample for it?model (CoT):The user is asking: "..." So they want a sample speech.I could give a text script, but it will not be that helpful. I guess give them an example audio would be better. It may also require some emotions. I should first get up structures and contents first....Alright, so I am done with initial composing, let me explore how emotions should be, and I will do some drafts, then it will be good to go..Well, I should have hope, urgency, inspiration, responsibility, solidarity, empathy, ...OK, so for hope, I think I should use upraising tone, just like this:[a piece of voice with upraising tone]Cool, sounds good, let's continue:...Alright, I think I got all emotions done. Let me make a first draft:[full composed speech with correct emotions]...model (response):Here's the speech I prepared for you:[full final speech audio] -
图像模态:虽然最新的 Nano-Banana 已经能创建和编辑出很惊艳的图片,但在复杂场景下仍然会翻车。这可以理解,因为模型是一次性直出图像的。如果它能花点时间先想一想、打个草稿——就像人类做设计那样——结果可能会好很多。比如:
markdownuser:Please create a cinematic image of a interstellar war. it needs to be epic.model (CoT):The user is asking me to generate an image of ...So let me first imagine the scene: a massive fleet of sleek starships engaged in battle over a luminous nebula, laser beams and plasma torpedoes streaking across space, explosions creating bright fireballs, a giant battlecruiser in the foreground with detailed hull, background with distant stars and a planet partially visible ...Cool, let me create a sketch first:[a sketch of the scene with only few lines]...I see the planet in the image doesn't seem real, let me change the position:[a final-prepared version of the image with all elements and colors]...Great, it seems perfect, let me give this to the user.model (response):Here's the image you want:[the final version of the image] -
甚至视频模态 (我知道目前还没有语言模型能同时输出文本和视频,但也许这是个好思路)
现在的视频生成模型,比如 Veo-3,做得也不错;但都有一些明显的短板:生成时长太短、没法完成复杂的故事叙述,等等。所以如果能让模型在音频、图像等模态中进行思考,可能会极大地增强它的表现。比如:
markdownuser:please give me a short movie about space exploration.model (CoT):The user asks for a short movie ...I plan to write the script first, then draw some certain frames, create some suitable BGMs. This seems good. Let me proceed with the some ideas....Phew. I think this is amazing, user is gonna love it. So next, frames.For the opening, I think this would be good:[a sketch frame for the first part of the story]...That's great. I'm all done. Now, music.[some background audios for specific frames]...Sounds good, then let me go for the final video part....model (response):Here's the movie you want:[final movie]
2. 所谓的"混合推理模型"(可能)是怎么工作的?
混合推理模型的定义是:模型既可以直接回复,也可以在回复前深入思考(甚至能自行判断什么时候需要多想想)。目前只有少数模型具备这种能力:Claude-3.7、Claude-4、DeepSeek-v3.1、Qwen-3,以及其他一些(GPT-5 暂时不算,因为它用的是 Router)。
Claude 是闭源的,所以我们不知道它的思考开关具体怎么实现的(也许和其他模型类似)。DeepSeek-v3.1 和 Qwen-3 的非思考模式则是直接预填了一个空的思考块(类似 <think> </think>)。这是一种不错的快速方案让模型跳过思考直接回复,但……似乎 Qwen 团队对效果不太满意,很快就把思考和非思考模式拆成了两个模型(链接)。
但如果我们让模型始终都在思考,只是目的不同呢?我的意思是:
从根本上说,我们可以训练模型学会在不同设置下(思考模式开、关、自动)做出不同反应。推理时通过某种方式(比如 System Prompt)告诉模型用户选了哪种模式。不同模式下,模型会先看一眼当前设置,然后自己决定该怎么做。这样人工干预更少,模型自己就知道该干什么。
示例行为:
-
思考模式开启:
markdownsystem: ... <thinking_mode>on</thinking_mode>user:...model (CoT):Let me see. I saw the thinking mode has been set with on, which means I should take more time to continue thinking afterwards.The user asks ...model (response):... -
思考模式关闭:
markdownsystem: ... <thinking_mode>off</thinking_mode>user:...model (CoT):Hmmm... I see the thinking mode is off. This means I should start responding directly. Yes, no more thinking. Start to respond now.model (response):... -
思考模式自动:
markdownsystem: ... <thinking_mode>auto</thinking_mode>user:What is 1+1?model (CoT):I see an auto thinking mode is been put. So basically I just need to decide how much I need to think. Hmmm... Let me see.The user asks: "What is 1+1?" This is trivial: 2. Any other things? No. Just a number is fine. Respond now.model (response):2.或者
markdownsystem: ... <thinking_mode>auto</thinking_mode>user:...model (CoT):... Oh my god, this is hard. According to my setting, I guess I should take more time to think further for the message.The user asks ...model (response):...
我们也可以用 RL 来让模型学习这些行为:
- 对于通用的"响应设置"行为,可以设一个验证器来检查模型是否正确地响应了设置——因为模型在响应设置时通常会用类似的措辞或语义。
- 对于模型是否正确执行了设置(比如关闭时停止思考、开启时继续思考),可以根据模型之后的行为来奖惩。比如设置为关闭时模型还在执意思考,那就给惩罚。
- 对于自动模式,我觉得可以用一个预标注了"简单/困难"标签的数据集,然后对在自动设置下正确处理问题的输出进行奖励。
这套方案未必真能跑通,但可能是一条值得探索的路径。为什么这么说?看看 OpenAI 的 o 系列和 GPT-5-thinking 模型——它们的推理力度都由一个叫"juice"的内部参数控制(GPT-5 甚至还有一个叫"oververbosity"的参数来控制回复的冗长程度)。此外,Anthropic 用类似 <max_thinking_length> 的方式来告诉模型总共应该思考多长时间。所以通过精心策划的数据和 RL 训练,模型也许能获得更加自适应和高效的思考能力。
