ICLR 2026 Accepted Paper

Thoth

Unleashing Scientific Reasoning

Bio-experimental Protocol Generation via Structured Component-based Reward Mechanism. Achieving 52.10% average performance, surpassing ChatGPT-4o by +3.69%.

52.10%
Average Performance
+3.69%
vs ChatGPT-4o
12K+
Protocols in SciRecipe
27
Biological Subfields

Four Core Contributions

Thoth achieves breakthrough in bio-experimental protocol generation through innovative dataset, reasoning paradigm, evaluation mechanism, and training strategy.

SciRecipe Dataset

Large-scale multi-task dataset covering 27 biological subfields with 12K+ structured protocols.

Access Dataset

Sketch-and-Fill

Novel reasoning paradigm that transforms open-ended queries into verifiable structured protocols.

SCORE Mechanism

Structured reward framework jointly measuring step granularity, order consistency, and semantic fidelity.

Thoth Models

Protocol generation models with strong reasoning capabilities, achieving SOTA on multiple benchmarks.

Download Model

Methodology

Thoth employs a three-stage reasoning paradigm and fine-grained reward mechanism to ensure accuracy and executability of generated protocols.

Sketch-and-Fill Paradigm

<think>Think Stage

Decompose objectives, identify dependencies, justify steps

<key>Key Stage

Convert strategy to atomic, machine-readable JSON steps

<orc>Orchestrate Stage

Expand structured steps into fluent natural language

<note>Note Stage

Add critical safety information and warnings

SCORE Mechanism

1. Format Gate

Ensures output contains all four components with correct JSON structure

2. Consistency Gate

Verifies step-by-step correspondence between <key> and <orc>

3. Step Scale Reward

Measures gap between generated and ground-truth step counts: f(d) = cos(π·d/2M)

4. Step Semantics Reward

Evaluates order consistency (LCS/strict) and semantic alignment: r = r_order · r_semantic

Three-Stage Training Strategy

Stage 1: Pre-training (PT)

Learn semantic structure from protocol text

Duration: ~50K steps | LR: 1e-4
Stage 2: Supervised Fine-tuning (SFT)

Align with Sketch-and-Fill paradigm

Duration: ~30K steps | LR: 5e-5
Stage 3: Reinforcement Learning (RL)

Optimize with SCORE rewards

Duration: ~20K steps | LR: 1e-5

Experimental Results

Thoth achieves SOTA performance on SciRecipe-Eval and multiple scientific benchmarks, significantly outperforming existing top models.

52.10%
Average Performance
vs ChatGPT-4o: +3.69%
46.60%
Semantic Alignment
vs ChatGPT-4o: +6.56%
53.00%
Step Matching
vs ChatGPT-4o: +9.00%
75.34%
Order Consistency
vs ChatGPT-4o: +2.07%

SciRecipe-Eval Benchmark Results

ModelSemantic-AOrder-LCSOrder-SStep-MAvg
GPT-527.7958.1211.3518.7932.84
ChatGPT-4o40.0473.2724.0044.0048.41
Claude Opus 4.141.3271.7021.8034.5945.21
DeepSeek-V341.7273.9721.4441.7148.16
Thoth-mini44.2874.6825.3352.6751.10
Thoth46.6075.3425.5053.0052.10

Performance Comparison Across Models

GPT-5ChatGPT-4oClaude Opus 4.1DeepSeek-V3Thoth-miniThoth015304560
  • Semantic-A
  • Step-M
  • Average

SciRecipe Dataset

A large-scale, multi-task dataset designed to improve and evaluate LLMs in experimental protocol understanding and generation.

12,000+
Expert-curated protocols
27
Biological subfields
8
Task types
3
Data splits (train/val/test)

Protocol-Comprehension Tasks

  • Overview: Global protocol summarization
  • Specific: Fine-grained component analysis

Problem-Solving Tasks

  • Retrieval, Planning, Troubleshooting
  • Constraint, Scaling, Safety

Available Models

Download our pre-trained models from HuggingFace Hub

Thoth-mini

4B Parameters
Base ModelQwen3-4B
GPU Memory8GB
Avg Performance51.10%
Recommended

Thoth

8B Parameters
Base ModelQwen3-8B
GPU Memory17GB
Avg Performance52.10%

Citation

If you find this work useful, please cite our paper

@article{sun2025unleashing,
  title={Unleashing Scientific Reasoning for Bio-experimental 
         Protocol Generation via Structured Component-based 
         Reward Mechanism},
  author={Sun, Haoran and Jiang, Yankai and Tang, Zhenyu and 
          Pan, Yaning and Gu, Shuang and Lin, Zekai and 
          Wang, Lilong and Lou, Wenjie and Liu, Lei and 
          Bai, Lei and others},
  journal={arXiv preprint arXiv:2510.15600},
  year={2025}
}