跳过正文
  1. 博客/

种一颗小草莓

9 分钟· ·
LLM
目录

Background
#

之前在chatGPT横空出世的时候,我们写了个文章《Reflections on Transformers》,展示了我们的尝试。时隔一年多,我们确实也参与到了一些新药的研发、病理的理解中。今天o1出来了,我们也想展示一下新的尝试——让一个大语言模型学会推理,以面对长程、开放式的任务。我们先demo演示当前模型的能力:

问题: prove the lemma that n-th order polynomials on Z/p, where p is a prime, has at most n roots.

模型答案:

Plan 1: Direct Proof via Polynomial Division

  1. Assume a polynomial has more than n roots.
  2. Choose n+1 distinct roots.
  3. Construct a polynomial of degree n using these roots (e.g., using the product of linear factors).
  4. Divide the original polynomial by this constructed polynomial. Show that the remainder must be zero, leading to a contradiction.

proving the lemma via polynomial division:

Step 1: Assume a polynomial has more than n roots.

Let’s say we have a polynomial p(x) of degree n over Z/p, and assume it has n+1 distinct roots: r₁, r₂, …, rₙ₊₁.

Step 2: Construct a polynomial of degree n using these roots.

We can construct a polynomial q(x) of degree n using the product of linear factors corresponding to these roots:

q(x) = (x - r₁) (x - r₂) … (x - rₙ₊₁)

Step 3: Divide the original polynomial by the constructed polynomial.

Perform polynomial division of p(x) by q(x). Let the quotient be s(x) and the remainder be r(x):

p(x) = q(x) * s(x) + r(x)

Step 4: Show that the remainder must be zero.

Since each rᵢ is a root of p(x), we know p(rᵢ) = 0 for all i = 1 to n+1. Substituting x = rᵢ into the division equation, we get:

0 = q(rᵢ) * s(rᵢ) + r(rᵢ)

Because q(x) is constructed to have roots at r₁, r₂, …, rₙ₊₁, we have q(rᵢ) = 0 for all i. Therefore:

0 = 0 * s(rᵢ) + r(rᵢ)

This implies r(rᵢ) = 0 for all i = 1 to n+1. Since the remainder r(x) is zero at n+1 distinct points, and the degree of r(x) cannot exceed n (it’s the remainder of a division where the divisor has degree n), the remainder r(x) must be the zero polynomial.

Conclusion:

Therefore, p(x) = q(x) * s(x). This means p(x) is divisible by q(x), a polynomial of degree n. Since p(x) itself has degree n, it follows that p(x) cannot have more than n roots.

这看上去算是该有的都有,在这个推理中,两个模型被使用:1)一个更大体积的plan model,2)一个常规大小的execution model。第一步是plan model给出的指引,后续的步骤是execution model给出的具体的答案。当前我们的推理拓扑与网页版o1有一个比较明显的差异就是:EM给出的答案不会被PM验真,即当前的版本中就算我们的PM给出完美的CoT plan,EM也可能执行错,并且展示在用户面前。当然这件事比较容易修改,但相应的,我们也要给出时间和推理成本的预算。现在我们来以一个不太一样的方式重现一个o1。

Good answers via more compute
#

我们调用LLM的模式已经进过了如下历程改变,最早的时候用户直接向LLM下达任务:

flowchart LR AV[用户的问题]--> CV((Execution Model)) --> DV[最终答案]

后来有了思维链,任务被拆解成多个环节让模型完成:

flowchart LR AR[用户的问题]-->|Think step by step!| CR((Execution Model)) --> DR[最终答案]

到今天,我们可以看到,通过将CoT动态化,我们可以直接得到更好的输出结果:

flowchart LR AR[用户的问题]--> BR((Planner Model))-->|CoT:完成一二三四以解答| CR((Execution Model)) --> DR[最终答案]

Chain of thoughts
#

逻辑链(Chain of Thoughts)1 是整个LLM在当前推理业务中最重要的发明之一,一个大型问题通过思维链可以拆解成多个正确率高、中间过程更容易监控的子问题,从而被LLM正确求解。在微观任务上(generation length < 100),以文本分类为例,让LLM先找到相关的字眼,再总结为对应的分类,可以显著地提升文本分类的正确率;在中层任务上(generation length < 1000),让LLM先说明它会如何写一个代码,再实际地上手写,也可以显著地提升代码质量。在更大型的任务上(generation length > 1000),通过多个链条的交互,我们也见到了智能体的涌现。

CoT是有效的,但写出高质量的CoT是困难的。

  1. 从语法的层面来说,CoT的句子接近另一种编程语言,它是连续的动名词短句的组装,do A, calculate B, summarize into C,这个语言只是没有明确的语法。当前的大多数语言模型厂商都会把编程的模型(甚至数学模型)单独训练一个同样参数量的版本出来,后缀为coder,有一部分考虑就是token分布的巨大差异。以下面的例子来说明,我们这个模型解答开放性的问题时,和openAI的o1给出的答案也是接近的(我们一定做对了什么)。

    问题:measuring extend lifespan could take a long time. how to test a new anti-aging drug in a faster way?

    RealmOpenAI o1
    Focus on Cellular Senescence
    1. Isolate Senescent Cells: Culture cells from the target species and induce senescence using stressors (e.g., oxidative stress, radiation).
    2. Treat with Drug: Expose senescent cells to the anti-aging drug and monitor changes in senescence markers (e.g., beta-galactosidase activity, p16INK4a expression).
    3. Assess Cellular Function: Evaluate the drug’s impact on cellular functions affected by senescence, such as proliferation, DNA repair, and inflammatory cytokine production.
    4. Compare to Controls: Compare the drug-treated senescent cells to untreated controls and cells treated with known senolytics (drugs that eliminate senescent cells).
    Immune System Function: Assess immunosenescence by measuring responses to vaccines or infections.
    Short-Term Clinical Trials with Surrogate Endpoints
    Healthspan Indicators: Focus on quality-of-life measures rather than lifespan.
    Disease Progression Metrics: Track the impact on age-related diseases like diabetes, cardiovascular conditions, or osteoporosis.
    Metabolic Health: Monitor changes in glucose tolerance, lipid profiles, and insulin sensitivity.
  2. 从文义的层面来说,CoT的撰写则需要高度精确的表达,以及极度广阔的知识面(例如下面这个的Stokes引理的例子)。当前开源的agents都基于一系列手工打磨的prompt工作,这也局限了agents针对开放问题,或者长程问题(递归深度超过了预设prompt的问题)的解答精度。以下面的例子说明,当我们给定一个泛泛而谈的CoT,以及一个Realm写出来的CoT,在同样的开源的执行模型中得到的差异:

    问题:Prove \(\oint_{C}Pdx+Qdy=\iint_D\left( \frac{\partial Q}{\partial x} -\frac{\partial P}{\partial y}\right)dxdy\)

    Realm同参量的其他模型
    CoT generatedPlan 1: Green’s Theorem as a Special Case

    1. Start with Stokes’ Theorem: Recall Stokes’ Theorem, which relates a line integral around a closed curve to a surface integral over the surface bounded by the curve.
    2. Choose a suitable surface: Let the surface be the region D itself, with the normal vector pointing upwards.
    3. Express Stokes’ Theorem in 2D: Since D is a planar region, the curl of the vector field (P, Q, 0) simplifies to (∂Q/∂x - ∂P/∂y, 0, 0).
    4. Reduce to Green’s Theorem: The surface integral in Stokes’ Theorem becomes the double integral of (∂Q/∂x - ∂P/∂y) over D, while the line integral reduces to the line integral in Green’s Theorem.
    Let (P(x,y)) and (Q(x,y)) be functions of (x) and (y). Apply Green’s Theorem: \(\oint_{C}Pdx+Qdy=\iint_{D}\left(\frac{\partial Q}{\partial x}+\frac{\partial P}{\partial y}\right)dxdy\), Rearrange the equation and simplify into the desired form.

在这里我们展示上述Realm plan被开源的普通模型执行的结果,另一侧对比的就不放上来了,因为是错的。

  1. Start with Stokes’ Theorem:

Stokes’ Theorem states:

$$ \iint_S \nabla \times \mathbf{F} \cdot d\mathbf{S} = \oint_C \mathbf{F} \cdot d\mathbf{r}$$

where:

  • $S$ is a smooth oriented surface bounded by a simple, closed, piecewise smooth curve $C$.
  • $\mathbf{F}$ is a continuously differentiable vector field.
  • $\nabla \times \mathbf{F}$ is the curl of $\mathbf{F}$.
  • $d\mathbf{S}$ is the outward-pointing normal vector to the surface element.
  • $d\mathbf{r}$ is the tangent vector to the curve element.
  1. Choose a suitable surface:

Let the surface $S$ be the region $D$ itself. Since $D$ is a planar region, we can choose the normal vector to point upwards (in the positive $z$ direction).

  1. Express Stokes’ Theorem in 2D:

We need to find the curl of the vector field $\mathbf{F} = (P, Q, 0)$. In 3D, the curl is given by:

$$\nabla \times \mathbf{F} = \left( \frac{\partial R}{\partial y} - \frac{\partial Q}{\partial z}, \frac{\partial P}{\partial z} - \frac{\partial R}{\partial x}, \frac{\partial Q}{\partial x} - \frac{\partial P}{\partial y} \right)$$

Since $R = 0$, the curl simplifies to:

$$\nabla \times \mathbf{F} = \left( 0, 0, \frac{\partial Q}{\partial x} - \frac{\partial P}{\partial y} \right)$$

Therefore, the surface integral in Stokes’ Theorem becomes:

$$\iint_D \left( \frac{\partial Q}{\partial x} - \frac{\partial P}{\partial y} \right) dxdy$$

  1. Reduce to Green’s Theorem:

The line integral in Stokes’ Theorem reduces to the line integral around the boundary curve $C$:

$$\oint_C \mathbf{F} \cdot d\mathbf{r} = \oint_C (Pdx + Qdy)$$

Therefore, Stokes’ Theorem in this case gives us:

$$\iint_D \left( \frac{\partial Q}{\partial x} - \frac{\partial P}{\partial y} \right) dxdy = \oint_C (Pdx + Qdy)$$

This is precisely Green’s Theorem.

很明显,一个能告诉学生“你得查一下XYZ”的导师,比起只能说“你得查一下”的导师更有可能带出优秀的学生。以上两个例子可以充分说明一个高质量的、具体化的CoT的重要性,我们需要教会一个模型写出来。

Data scarsity
#

高质量的CoT数据是稀少的。这个稀少有三方面原因:

  1. 我们在互联网语料上看到的都是思维结果,中间过程和修改过程都已经被化简删改。顶尖学者发表论文的时候并不会把草稿纸也同步扫描上传。
  2. 高质量的reasoning需要的CoT是精确的,不是通用的;是随问题改变的,不是静态的;是按照执行成功概率排序过的,不是简单采样生成的。不再是"Let’s think step by step"的一句话魔法了。即:生成CoT本身也是一个开放问题。
  3. 开放问题,极少参考,我们需要借助合成数据的力量了。一个良好的合成方案需要以下的冷启动要素:
    • 种子数据
    • 一个生成的车间 Generator
    • 一个评价的质检员 Discriminator

Seed data
#

我们的种子数据是反向生成的,即,给定了QA对,我们会以反思的方式去询问模型,刚才这个答案展示了什么样的思考过程,这个过程能如何抽象以应对这类问题。这个过程在普通的模型中也能执行得很好。同时,因为正确的答案中通常提到了关键字眼,所以总结模型也会带入这类信息(这也是我们的planner是怎么见到这类公式就写出了Stokes这个词,给执行模型提供了巨大的便利)。

反思询问,如有需要,进一步提炼,最后rephrase,就得到了一系列的种子CoT。既然问答数据里的答案通常是对的,我们可以认为种子数据都是优秀数据。

我们也用了直接评估的方法对种子数据进行了质量把控,其方法如下:

  • 提取出来的CoT会输入到模型中,执行N次
  • 计算这些答案和实际的QA答案的相似度、重合度等。
graph LR A[问题] B[答案] A-->C[“展示了什么思维过程”] B-->C C -->|带着答案总结| D["Seed CoT"] A --> G[“你会怎么解答”] G -->|直接推理| F[Ref CoT] D -->|训练| M[Discriminator] F -->|训练| M

Discriminator
#

如上图所示,我们可以直接将问题(不需要答案)扔给开源的模型们去尝试给出类似的CoT,我们可以认为这里面的CoT都是比较糟糕的。这时候对于同一个问题,我们就得到了优秀的答案和劣质的答案,这是一个偏好数据。我猜很多人读到这里已经理解到了RLHF的关键步骤已经完成了。

Generator
#

我们可以大量地找高质量的科学论文或者代码段,取出chunk,按照HyDE2的方法,列举出这些chunk里可以问的问题。例如,给它一个介绍DNA测序手段的文段,让模型发问:“既然DNA有双链,那测序的时候我怎么知道每个碎片是哪条链的”。这样,我们也能得到大量的优秀的问题。

RL or SimPO
#

到这个位置,选择强化学习的方式、或者其他等效方式来训练模型就变成了一个口味选择了。我们采用的是SimPO方法3 ,其基础流程是这样的:

  • 对要微调的Planner模型输入两个文本串,一个是问题加上优质CoT,一个是问题加上劣质的CoT。
  • 对CoT部分计算平均的logits
  • 最大化优质CoT和劣质CoT的logits差异。

因为使用SimPO,所以也要求我们在训练的过程中采样新的问题,生成新的CoT,再用Discriminator进行评估选出里面较好的部分。现在回头看,可能Discriminator和Generator都应该是online训练的,会更好。

总结,我们的训练过程如下图:

graph TB A[Sample Questions] --> G((生成模型)) G-->|Use lemma1, do this| D((辨别模型)) G-.->|Let's think step by step| D((辨别模型)) D --> GS[\Good CoTs/] D -.-> BS[\Bad CoTs/] GS --> SimPO(((SimPO Trainer))) BS -.-> SimPO SimPO -->|backprop| G

复现感想
#

  • Scale by compute是一个新的路,之前,用LLM做草稿试错,成本局限在用户拆解任务和评判的时间,那现在这个局限也可以被逐步打破。
  • 当前的这个预览版远远没到这类模型该有的水平,通过积累数据,我们其实可以将“执行得好的CoT保存作为后续的参考”,开放search将实时数据加入,影响CoT的书写方向。

References