种一颗小草莓

Background
#

之前在chatGPT横空出世的时候，我们写了个文章《Reflections on Transformers》，展示了我们的尝试。时隔一年多，我们确实也参与到了一些新药的研发、病理的理解中。今天o1出来了，我们也想展示一下新的尝试——让一个大语言模型学会推理，以面对长程、开放式的任务。我们先demo演示当前模型的能力：

问题： prove the lemma that n-th order polynomials on Z/p, where p is a prime, has at most n roots.

模型答案：

Plan 1: Direct Proof via Polynomial Division
Assume a polynomial has more than n roots.
Choose n+1 distinct roots.
Construct a polynomial of degree n using these roots (e.g., using the product of linear factors).
Divide the original polynomial by this constructed polynomial. Show that the remainder must be zero, leading to a contradiction.
proving the lemma via polynomial division:
Step 1: Assume a polynomial has more than n roots.
Let’s say we have a polynomial p(x) of degree n over Z/p, and assume it has n+1 distinct roots: r₁, r₂, …, rₙ₊₁.
Step 2: Construct a polynomial of degree n using these roots.
We can construct a polynomial q(x) of degree n using the product of linear factors corresponding to these roots:
q(x) = (x - r₁) (x - r₂) … (x - rₙ₊₁)
Step 3: Divide the original polynomial by the constructed polynomial.
Perform polynomial division of p(x) by q(x). Let the quotient be s(x) and the remainder be r(x):
p(x) = q(x) * s(x) + r(x)
Step 4: Show that the remainder must be zero.
Since each rᵢ is a root of p(x), we know p(rᵢ) = 0 for all i = 1 to n+1. Substituting x = rᵢ into the division equation, we get:
0 = q(rᵢ) * s(rᵢ) + r(rᵢ)
Because q(x) is constructed to have roots at r₁, r₂, …, rₙ₊₁, we have q(rᵢ) = 0 for all i. Therefore:
0 = 0 * s(rᵢ) + r(rᵢ)
This implies r(rᵢ) = 0 for all i = 1 to n+1. Since the remainder r(x) is zero at n+1 distinct points, and the degree of r(x) cannot exceed n (it’s the remainder of a division where the divisor has degree n), the remainder r(x) must be the zero polynomial.
Conclusion:
Therefore, p(x) = q(x) * s(x). This means p(x) is divisible by q(x), a polynomial of degree n. Since p(x) itself has degree n, it follows that p(x) cannot have more than n roots.

这看上去算是该有的都有，在这个推理中，两个模型被使用：1）一个更大体积的plan model，2）一个常规大小的execution model。第一步是plan model给出的指引，后续的步骤是execution model给出的具体的答案。当前我们的推理拓扑与网页版o1有一个比较明显的差异就是：EM给出的答案不会被PM验真，即当前的版本中就算我们的PM给出完美的CoT plan，EM也可能执行错，并且展示在用户面前。当然这件事比较容易修改，但相应的，我们也要给出时间和推理成本的预算。现在我们来以一个不太一样的方式重现一个o1。

Good answers via more compute
#

我们调用LLM的模式已经进过了如下历程改变，最早的时候用户直接向LLM下达任务：

flowchart LR AV[用户的问题]--> CV((Execution Model)) --> DV[最终答案]

后来有了思维链，任务被拆解成多个环节让模型完成：

flowchart LR AR[用户的问题]-->|Think step by step!| CR((Execution Model)) --> DR[最终答案]

到今天，我们可以看到，通过将CoT动态化，我们可以直接得到更好的输出结果：

flowchart LR AR[用户的问题]--> BR((Planner Model))-->|CoT:完成一二三四以解答| CR((Execution Model)) --> DR[最终答案]

Chain of thoughts
#

逻辑链（Chain of Thoughts）¹ 是整个LLM在当前推理业务中最重要的发明之一，一个大型问题通过思维链可以拆解成多个正确率高、中间过程更容易监控的子问题，从而被LLM正确求解。在微观任务上(generation length < 100)，以文本分类为例，让LLM先找到相关的字眼，再总结为对应的分类，可以显著地提升文本分类的正确率；在中层任务上(generation length < 1000)，让LLM先说明它会如何写一个代码，再实际地上手写，也可以显著地提升代码质量。在更大型的任务上(generation length > 1000)，通过多个链条的交互，我们也见到了智能体的涌现。

CoT是有效的，但写出高质量的CoT是困难的。

从语法的层面来说，CoT的句子接近另一种编程语言，它是连续的动名词短句的组装，do A, calculate B, summarize into C，这个语言只是没有明确的语法。当前的大多数语言模型厂商都会把编程的模型（甚至数学模型）单独训练一个同样参数量的版本出来，后缀为coder，有一部分考虑就是token分布的巨大差异。以下面的例子来说明，我们这个模型解答开放性的问题时，和openAI的o1给出的答案也是接近的（我们一定做对了什么）。

问题：measuring extend lifespan could take a long time. how to test a new anti-aging drug in a faster way?

Realm	OpenAI o1
Focus on Cellular Senescence 1. Isolate Senescent Cells: Culture cells from the target species and induce senescence using stressors (e.g., oxidative stress, radiation). 2. Treat with Drug: Expose senescent cells to the anti-aging drug and monitor changes in senescence markers (e.g., beta-galactosidase activity, p16INK4a expression). 3. Assess Cellular Function: Evaluate the drug’s impact on cellular functions affected by senescence, such as proliferation, DNA repair, and inflammatory cytokine production. 4. Compare to Controls: Compare the drug-treated senescent cells to untreated controls and cells treated with known senolytics (drugs that eliminate senescent cells).	Immune System Function: Assess immunosenescence by measuring responses to vaccines or infections. Short-Term Clinical Trials with Surrogate Endpoints Healthspan Indicators: Focus on quality-of-life measures rather than lifespan. Disease Progression Metrics: Track the impact on age-related diseases like diabetes, cardiovascular conditions, or osteoporosis. Metabolic Health: Monitor changes in glucose tolerance, lipid profiles, and insulin sensitivity.

Realm

OpenAI o1

Focus on Cellular Senescence
1. Isolate Senescent Cells: Culture cells from the target species and induce senescence using stressors (e.g., oxidative stress, radiation).
2. Treat with Drug: Expose senescent cells to the anti-aging drug and monitor changes in senescence markers (e.g., beta-galactosidase activity, p16INK4a expression).
3. Assess Cellular Function: Evaluate the drug’s impact on cellular functions affected by senescence, such as proliferation, DNA repair, and inflammatory cytokine production.
4. Compare to Controls: Compare the drug-treated senescent cells to untreated controls and cells treated with known senolytics (drugs that eliminate senescent cells).

Immune System Function: Assess immunosenescence by measuring responses to vaccines or infections.
Short-Term Clinical Trials with Surrogate Endpoints
Healthspan Indicators: Focus on quality-of-life measures rather than lifespan.
Disease Progression Metrics: Track the impact on age-related diseases like diabetes, cardiovascular conditions, or osteoporosis.
Metabolic Health: Monitor changes in glucose tolerance, lipid profiles, and insulin sensitivity.

从文义的层面来说，CoT的撰写则需要高度精确的表达，以及极度广阔的知识面（例如下面这个的Stokes引理的例子）。当前开源的agents都基于一系列手工打磨的prompt工作，这也局限了agents针对开放问题，或者长程问题（递归深度超过了预设prompt的问题）的解答精度。以下面的例子说明，当我们给定一个泛泛而谈的CoT，以及一个Realm写出来的CoT，在同样的开源的执行模型中得到的差异：

问题：Prove $\oint_{C}Pdx+Qdy=\iint_D\left( \frac{\partial Q}{\partial x} -\frac{\partial P}{\partial y}\right)dxdy$

	Realm	同参量的其他模型
CoT generated	Plan 1: Green’s Theorem as a Special Case 1. Start with Stokes’ Theorem: Recall Stokes’ Theorem, which relates a line integral around a closed curve to a surface integral over the surface bounded by the curve. 2. Choose a suitable surface: Let the surface be the region D itself, with the normal vector pointing upwards. 3. Express Stokes’ Theorem in 2D: Since D is a planar region, the curl of the vector field (P, Q, 0) simplifies to (∂Q/∂x - ∂P/∂y, 0, 0). 4. Reduce to Green’s Theorem: The surface integral in Stokes’ Theorem becomes the double integral of (∂Q/∂x - ∂P/∂y) over D, while the line integral reduces to the line integral in Green’s Theorem.	Let (P(x,y)) and (Q(x,y)) be functions of (x) and (y). Apply Green’s Theorem: $\oint_{C}Pdx+Qdy=\iint_{D}\left(\frac{\partial Q}{\partial x}+\frac{\partial P}{\partial y}\right)dxdy$, Rearrange the equation and simplify into the desired form.

在这里我们展示上述Realm plan被开源的普通模型执行的结果，另一侧对比的就不放上来了，因为是错的。

Start with Stokes’ Theorem:
Stokes’ Theorem states:
$$ \iint_S \nabla \times \mathbf{F} \cdot d\mathbf{S} = \oint_C \mathbf{F} \cdot d\mathbf{r}$$
where:
$S$ is a smooth oriented surface bounded by a simple, closed, piecewise smooth curve $C$.
$\mathbf{F}$ is a continuously differentiable vector field.
$\nabla \times \mathbf{F}$ is the curl of $\mathbf{F}$.
$d\mathbf{S}$ is the outward-pointing normal vector to the surface element.
$d\mathbf{r}$ is the tangent vector to the curve element.
Choose a suitable surface:
Let the surface $S$ be the region $D$ itself. Since $D$ is a planar region, we can choose the normal vector to point upwards (in the positive $z$ direction).
Express Stokes’ Theorem in 2D:
We need to find the curl of the vector field $\mathbf{F} = (P, Q, 0)$. In 3D, the curl is given by:
$$\nabla \times \mathbf{F} = \left( \frac{\partial R}{\partial y} - \frac{\partial Q}{\partial z}, \frac{\partial P}{\partial z} - \frac{\partial R}{\partial x}, \frac{\partial Q}{\partial x} - \frac{\partial P}{\partial y} \right)$$
Since $R = 0$, the curl simplifies to:
$$\nabla \times \mathbf{F} = \left( 0, 0, \frac{\partial Q}{\partial x} - \frac{\partial P}{\partial y} \right)$$
Therefore, the surface integral in Stokes’ Theorem becomes:
$$\iint_D \left( \frac{\partial Q}{\partial x} - \frac{\partial P}{\partial y} \right) dxdy$$
Reduce to Green’s Theorem:
The line integral in Stokes’ Theorem reduces to the line integral around the boundary curve $C$:
$$\oint_C \mathbf{F} \cdot d\mathbf{r} = \oint_C (Pdx + Qdy)$$
Therefore, Stokes’ Theorem in this case gives us:
$$\iint_D \left( \frac{\partial Q}{\partial x} - \frac{\partial P}{\partial y} \right) dxdy = \oint_C (Pdx + Qdy)$$
This is precisely Green’s Theorem.

很明显，一个能告诉学生“你得查一下XYZ”的导师，比起只能说“你得查一下”的导师更有可能带出优秀的学生。以上两个例子可以充分说明一个高质量的、具体化的CoT的重要性，我们需要教会一个模型写出来。

Data scarsity
#

高质量的CoT数据是稀少的。这个稀少有三方面原因：

我们在互联网语料上看到的都是思维结果，中间过程和修改过程都已经被化简删改。顶尖学者发表论文的时候并不会把草稿纸也同步扫描上传。
高质量的reasoning需要的CoT是精确的，不是通用的；是随问题改变的，不是静态的；是按照执行成功概率排序过的，不是简单采样生成的。不再是"Let’s think step by step"的一句话魔法了。即：生成CoT本身也是一个开放问题。
开放问题，极少参考，我们需要借助合成数据的力量了。一个良好的合成方案需要以下的冷启动要素：
- 种子数据
- 一个生成的车间 Generator
- 一个评价的质检员 Discriminator

Seed data
#

我们的种子数据是反向生成的，即，给定了QA对，我们会以反思的方式去询问模型，刚才这个答案展示了什么样的思考过程，这个过程能如何抽象以应对这类问题。这个过程在普通的模型中也能执行得很好。同时，因为正确的答案中通常提到了关键字眼，所以总结模型也会带入这类信息（这也是我们的planner是怎么见到这类公式就写出了Stokes这个词，给执行模型提供了巨大的便利）。

反思询问，如有需要，进一步提炼，最后rephrase，就得到了一系列的种子CoT。既然问答数据里的答案通常是对的，我们可以认为种子数据都是优秀数据。

我们也用了直接评估的方法对种子数据进行了质量把控，其方法如下：

提取出来的CoT会输入到模型中，执行N次
计算这些答案和实际的QA答案的相似度、重合度等。

Discriminator
#

如上图所示，我们可以直接将问题（不需要答案）扔给开源的模型们去尝试给出类似的CoT，我们可以认为这里面的CoT都是比较糟糕的。这时候对于同一个问题，我们就得到了优秀的答案和劣质的答案，这是一个偏好数据。我猜很多人读到这里已经理解到了RLHF的关键步骤已经完成了。

Generator
#

我们可以大量地找高质量的科学论文或者代码段，取出chunk，按照HyDE²的方法，列举出这些chunk里可以问的问题。例如，给它一个介绍DNA测序手段的文段，让模型发问：“既然DNA有双链，那测序的时候我怎么知道每个碎片是哪条链的”。这样，我们也能得到大量的优秀的问题。

RL or SimPO
#

到这个位置，选择强化学习的方式、或者其他等效方式来训练模型就变成了一个口味选择了。我们采用的是SimPO方法³ ，其基础流程是这样的：

对要微调的Planner模型输入两个文本串，一个是问题加上优质CoT，一个是问题加上劣质的CoT。
对CoT部分计算平均的logits
最大化优质CoT和劣质CoT的logits差异。

因为使用SimPO，所以也要求我们在训练的过程中采样新的问题，生成新的CoT，再用Discriminator进行评估选出里面较好的部分。现在回头看，可能Discriminator和Generator都应该是online训练的，会更好。

总结，我们的训练过程如下图：

复现感想
#

Scale by compute是一个新的路，之前，用LLM做草稿试错，成本局限在用户拆解任务和评判的时间，那现在这个局限也可以被逐步打破。
当前的这个预览版远远没到这类模型该有的水平，通过积累数据，我们其实可以将“执行得好的CoT保存作为后续的参考”，开放search将实时数据加入，影响CoT的书写方向。

References

Background#

Good answers via more compute#

Chain of thoughts#

Data scarsity#

Seed data#

Discriminator#

Generator#

RL or SimPO#

复现感想#