Skip to main content
  1. Blog Posts/

Making a Strawberry in-house

12 mins· ·
LLM
Table of Contents

1. Background
#

When ChatGPT first emerged, we wrote an article titled “Reflections on Transformers” to showcase our attempts. Over a year later, we have indeed participated in the development of new drugs and the understanding of pathology. Today, with the release of o1, we also want to demonstrate new attempts - teaching a large language model to reason, to tackle long-range, open-ended tasks.

We will first demonstrate the capabilities of our current model through a demo:

Problem: Prove the lemma that n-th order polynomials on Z/p, where p is a prime, have at most n roots.

Model prediction:

Plan 1: Direct Proof via Polynomial Division

  1. Assume a polynomial has more than n roots.
  2. Choose n+1 distinct roots.
  3. Construct a polynomial of degree n using these roots (e.g., using the product of linear factors).
  4. Divide the original polynomial by this constructed polynomial. Show that the remainder must be zero, leading to a contradiction.

proving the lemma via polynomial division:

Step 1: Assume a polynomial has more than n roots.

Let’s say we have a polynomial p(x) of degree n over Z/p, and assume it has n+1 distinct roots: r₁, r₂, …, rₙ₊₁.

Step 2: Construct a polynomial of degree n using these roots.

We can construct a polynomial q(x) of degree n using the product of linear factors corresponding to these roots:

q(x) = (x - r₁) (x - r₂) … (x - rₙ₊₁)

Step 3: Divide the original polynomial by the constructed polynomial.

Perform polynomial division of p(x) by q(x). Let the quotient be s(x) and the remainder be r(x):

p(x) = q(x) * s(x) + r(x)

Step 4: Show that the remainder must be zero.

Since each rᵢ is a root of p(x), we know p(rᵢ) = 0 for all i = 1 to n+1. Substituting x = rᵢ into the division equation, we get:

0 = q(rᵢ) * s(rᵢ) + r(rᵢ)

Because q(x) is constructed to have roots at r₁, r₂, …, rₙ₊₁, we have q(rᵢ) = 0 for all i. Therefore:

0 = 0 * s(rᵢ) + r(rᵢ)

This implies r(rᵢ) = 0 for all i = 1 to n+1. Since the remainder r(x) is zero at n+1 distinct points, and the degree of r(x) cannot exceed n (it’s the remainder of a division where the divisor has degree n), the remainder r(x) must be the zero polynomial.

Conclusion:

Therefore, p(x) = q(x) * s(x). This means p(x) is divisible by q(x), a polynomial of degree n. Since p(x) itself has degree n, it follows that p(x) cannot have more than n roots.

This answer appears to have all the necessary components. In this reasoning process, two models are used: 1) a larger “plan model” and 2) an average sized “execution model”. The first step involves the plan model providing CoT, “the plan”, followed by the execution model generating specific answers for subsequent steps.

The current reasoning topology differs from the web version of o1 in a noticeable way: the answers provided by the execution model (EM) are not verified by the plan model (PM). This means that even if our PM provides a perfect chain-of-thought (CoT) plan, the EM might still execute it incorrectly, and the error will be displayed to the user.

Of course, this can be easily modified, providing a time and reasoning cost budget just like that of o1. Now, let’s replicate our own o1 instance in a slightly different way.

2. Good Answers via More Compute
#

The way we interact with LLMs has undergone the following evolution:

Initially, users directly issued tasks to the LLM:

flowchart LR AV[User's Question]--> CV((Execution Model)) --> DV[Final Answer]

Later, the concept of a thinking chain emerged, breaking down tasks into multiple steps for the model to complete:

flowchart LR AR[User's Question]-->|Think step by step!| CR((Execution Model)) --> DR[Final Answer]

Today, we see that by dynamically generating CoT, we can directly obtain better output results:

flowchart LR AR[User's Question]--> BR((Planner Model))-->|CoT: Complete steps one through four to answer| CR((Execution Model)) --> DR[Final Answer]

Chain of Thoughts
#

The Chain of Thoughts (CoT) 1 is one of the most important inventions in the current field of LLM reasoning. A large problem can be broken down into multiple sub-problems with higher accuracy and easier-to-monitor intermediate processes through a CoT. This allows LLMs to solve the problem correctly.

For micro-tasks (generation length < 100), such as text classification, letting the LLM first identify relevant keywords and then summarize them into corresponding categories can significantly improve classification accuracy. For mid-level tasks (generation length < 1000), allowing the LLM to first explain how it will write code and then actually write it can also significantly improve code quality. Even for larger tasks (generation length > 1000), we have seen the emergence of intelligent agents through multiple chain interactions.

CoT is effective, but writing high-quality CoT is challenging.

  1. From a grammatical perspective, CoT sentences resemble another syntax-less programming language. They are assembled from consecutive gerund phrases like “do A, calculate B, summarize into C”. Most language model vendors train a separate version of the model (often suffixed with “coder”) with the same number of parameters specifically for coding or mathematical tasks. This is partly due to the significant difference in token distribution.

    The following example illustrates how our model, when answering open-ended questions, provides answers similar to those given by OpenAI’s o1 (indicating we are on the right track).

Question: measuring extend lifespan could take a long time. how to test a new anti-aging drug in a faster way?

Realm (codename)OpenAI o1
Focus on Cellular Senescence
1. Isolate Senescent Cells: Culture cells from the target species and induce senescence using stressors (e.g., oxidative stress, radiation).
2. Treat with Drug: Expose senescent cells to the anti-aging drug and monitor changes in senescence markers (e.g., beta-galactosidase activity, p16INK4a expression).
3. Assess Cellular Function: Evaluate the drug’s impact on cellular functions affected by senescence, such as proliferation, DNA repair, and inflammatory cytokine production.
4. Compare to Controls: Compare the drug-treated senescent cells to untreated controls and cells treated with known senolytics (drugs that eliminate senescent cells).
…Immune System Function: Assess immunosenescence by measuring responses to vaccines or infections.
Short-Term Clinical Trials with Surrogate Endpoints
Healthspan Indicators: Focus on quality-of-life measures rather than lifespan.
Disease Progression Metrics: Track the impact on age-related diseases like diabetes, cardiovascular conditions, or osteoporosis.
Metabolic Health: Monitor changes in glucose tolerance, lipid profiles, and insulin sensitivity.
  1. Semantically, CoT writing requires highly precise expression and an extremely broad knowledge base (e.g., the example of Stokes’ theorem below). Currently, open-source agents rely on a series of manually refined prompts, which limits their accuracy in answering open-ended or long-range questions (questions whose answer would need a recursive thinking depth exceeding the pre-set prompt).

The following example illustrates the difference in results obtained from the same open-source execution model when provided with a generic CoT versus a CoT written by Realm:

Question: Prove \(\oint_{C}Pdx+Qdy=\iint_D\left( \frac{\partial Q}{\partial x} -\frac{\partial P}{\partial y}\right)dxdy\)

RealmEquiv-param other LLMs
Plan 1: Green’s Theorem as a Special Case

1. Start with Stokes’ Theorem: Recall Stokes’ Theorem, which relates a line integral around a closed curve to a surface integral over the surface bounded by the curve.
2. Choose a suitable surface: Let the surface be the region D itself, with the normal vector pointing upwards.
3. Express Stokes’ Theorem in 2D: Since D is a planar region, the curl of the vector field (P, Q, 0) simplifies to (∂Q/∂x - ∂P/∂y, 0, 0).
4. Reduce to Green’s Theorem: The surface integral in Stokes’ Theorem becomes the double integral of (∂Q/∂x - ∂P/∂y) over D, while the line integral reduces to the line integral in Green’s Theorem.
… Let (P(x,y)) and (Q(x,y)) be functions of (x) and (y). Apply Green’s Theorem: \(\oint_{C}Pdx+Qdy=\iint_{D}\left(\frac{\partial Q}{\partial x}+\frac{\partial P}{\partial y}\right)dxdy\), rearrange the equation and simplify into the desired form.

Here we will present the execution result from Realm plan with an off-the-shelf llama 3.1. We didn’t include the comparison because the other plan was wrong to start with.

  1. Start with Stokes’ Theorem:

Stokes’ Theorem states:

$$ \iint_S \nabla \times \mathbf{F} \cdot d\mathbf{S} = \oint_C \mathbf{F} \cdot d\mathbf{r}$$

where:

  • $S$ is a smooth oriented surface bounded by a simple, closed, piecewise smooth curve $C$.
  • $\mathbf{F}$ is a continuously differentiable vector field.
  • $\nabla \times \mathbf{F}$ is the curl of $\mathbf{F}$.
  • $d\mathbf{S}$ is the outward-pointing normal vector to the surface element.
  • $d\mathbf{r}$ is the tangent vector to the curve element.
  1. Choose a suitable surface:

Let the surface $S$ be the region $D$ itself. Since $D$ is a planar region, we can choose the normal vector to point upwards (in the positive $z$ direction).

  1. Express Stokes’ Theorem in 2D:

We need to find the curl of the vector field $\mathbf{F} = (P, Q, 0)$. In 3D, the curl is given by:

$$\nabla \times \mathbf{F} = \left( \frac{\partial R}{\partial y} - \frac{\partial Q}{\partial z}, \frac{\partial P}{\partial z} - \frac{\partial R}{\partial x}, \frac{\partial Q}{\partial x} - \frac{\partial P}{\partial y} \right)$$

Since $R = 0$, the curl simplifies to:

$$\nabla \times \mathbf{F} = \left( 0, 0, \frac{\partial Q}{\partial x} - \frac{\partial P}{\partial y} \right)$$

Therefore, the surface integral in Stokes’ Theorem becomes:

$$\iint_D \left( \frac{\partial Q}{\partial x} - \frac{\partial P}{\partial y} \right) dxdy$$

  1. Reduce to Green’s Theorem:

The line integral in Stokes’ Theorem reduces to the line integral around the boundary curve $C$:

$$\oint_C \mathbf{F} \cdot d\mathbf{r} = \oint_C (Pdx + Qdy)$$

Therefore, Stokes’ Theorem in this case gives us:

$$\iint_D \left( \frac{\partial Q}{\partial x} - \frac{\partial P}{\partial y} \right) dxdy = \oint_C (Pdx + Qdy)$$

This is precisely Green’s Theorem.

It’s clear that a tutor who can tell a student “You need to look into XYZ” is more likely to produce excellent students than one who can only say “You need to look into that.” These two examples effectively demonstrate the importance of a high-quality, specific CoT. We need to teach a model how to write one.

3. Data Scarcity
#

High-quality CoT data is scarce. This scarcity stems from three main reasons:

  1. Internet data primarily reflects thinking outcomes, with intermediate processes and revisions already simplified or removed. Top scholars rarely upload scanned drafts of their papers when publishing.

  2. High-quality reasoning requires precise CoTs, not generic ones; CoTs that change with the problem, not static ones; and CoTs ranked by execution success probability, not simply sampled. It’s no longer a matter of a one-size-fits-all “Let’s think step by step” magic phrase. Generating CoT itself is an open problem.

  3. For open problems with limited references necessitate synthetic data. A good synthetic solution requires the following cold-start elements:

  • Seed data
  • A generative workshop: Generator
  • A quality control inspector: Discriminator

3.1 Seed Data
#

Our seed data is reverse-generated. Given a QA pair, we reflectively ask the model what thought process led to the answer and how that process could be abstracted to handle similar problems. This process can be executed well by any typical models. Additionally, since correct answers often mention keywords, the summarization model also incorporates this information (which is how our planner recognized the term “Stokes” when encountering the relevant formula, providing a significant jump start advantage to the execution model).

Through reflective questioning, further refinement, and final rephrasing, we obtain a series of seed CoTs. Since the answers in question-answer data are typically correct, we can assume that the seed data is of high quality.

We also employ direct evaluation methods to ensure the quality of our seed data. This involves:

  • Inputting the extracted CoTs into the model and executing them N times.
  • Calculating the similarity and overlap between the generated answers and the actual QA answers, using Bertscore or BLEU for example.
graph LR A[Question] B[Answer] A-->C[“What thought process was demonstrated?”] B-->C C -->|Summarizing with the answer| D["Seed CoT"] A --> G[“How would you answer?”] G -->|Direct Inference| F[Ref CoT] D -->|Training| M[Discriminator] F -->|Training| M

3.2 Discriminator
#

As shown in the diagram above, we can directly feed problems (without answers) to open-source models to attempt generating similar CoTs. These CoTs would likely be of lower quality. Now, for the same problem, we have both high-quality and low-quality CoTs, creating a preference dataset. I suspect many readers have already grasped the key steps of RLHF at this point.

3.3 Generator
#

What we need is a quantity of high quality “deep” questions as the pseudo user input. We can extensively search for high-quality scientific papers or code snippets, extract chunks, and use the HyDE2 method to list potential questions related to those chunks. For example, given a passage explaining DNA sequencing methods, we could prompt the model to ask: “Since DNA has a double helix, how do we determine which strand each fragment belongs to during sequencing?” This approach allows us to generate a large number of high-quality questions.

4. RL or SimPO
#

At this point, choosing between reinforcement learning or an equivalent method to train the model becomes a matter of preference. We opted for the SimPO method 3, whose basic workflow is as follows:

  • The Planner model to be fine-tuned receives two text strings as input: one consisting of the question plus a high-quality CoT, and the other consisting of the question plus a low-quality CoT.
  • The average logits of the CoT portion are calculated.
  • The difference in logits between the high-quality and low-quality CoTs is maximized.

Due to the use of SimPO, we also need to sample new questions during training, generate new CoTs, and then use the Discriminator to evaluate and select the better ones. Looking back, it might be beneficial to train both the Discriminator and Generator online.

Our training process is illustrated in the following diagram:

graph LR A[Sample Questions] --> G((Generation Model)) G-->|Use lemma1, do this| D((Discrimination Model)) G-.->|Let's think step by step| D((Discrimination Model)) D --> GS[\Good CoTs/] D -.-> BS[\Bad CoTs/] GS --> SimPO(((SimPO Trainer))) BS -.-> SimPO SimPO -->|backprop| G

5. Reflections on the Implementation
#

  • Scaling by Compute: This approach opens up new possibilities. Previously, using LLMs for drafting and trial-and-error was limited by the user’s time spent breaking down tasks and evaluating results. Now, this limitation can be gradually overcome.
  • Room for Improvement: This preview version is far from reaching the full potential of such models. By accumulating use case data, we can save well-executed CoTs as references for future use. Open search can also incorporate real-time data to influence the direction of CoT writing.

References