Bio-experimental Protocol Generation via Structured Component-based Reward Mechanism. Achieving 52.10% average performance, surpassing ChatGPT-4o by +3.69%.
Thoth achieves breakthrough in bio-experimental protocol generation through innovative dataset, reasoning paradigm, evaluation mechanism, and training strategy.
Large-scale multi-task dataset covering 27 biological subfields with 12K+ structured protocols.
Access DatasetNovel reasoning paradigm that transforms open-ended queries into verifiable structured protocols.
Structured reward framework jointly measuring step granularity, order consistency, and semantic fidelity.
Protocol generation models with strong reasoning capabilities, achieving SOTA on multiple benchmarks.
Download ModelThoth employs a three-stage reasoning paradigm and fine-grained reward mechanism to ensure accuracy and executability of generated protocols.
<think>Think StageDecompose objectives, identify dependencies, justify steps
<key>Key StageConvert strategy to atomic, machine-readable JSON steps
<orc>Orchestrate StageExpand structured steps into fluent natural language
<note>Note StageAdd critical safety information and warnings
Ensures output contains all four components with correct JSON structure
Verifies step-by-step correspondence between <key> and <orc>
Measures gap between generated and ground-truth step counts: f(d) = cos(π·d/2M)
Evaluates order consistency (LCS/strict) and semantic alignment: r = r_order · r_semantic
Learn semantic structure from protocol text
Align with Sketch-and-Fill paradigm
Optimize with SCORE rewards
Thoth achieves SOTA performance on SciRecipe-Eval and multiple scientific benchmarks, significantly outperforming existing top models.
| Model | Semantic-A | Order-LCS | Order-S | Step-M | Avg |
|---|---|---|---|---|---|
| GPT-5 | 27.79 | 58.12 | 11.35 | 18.79 | 32.84 |
| ChatGPT-4o | 40.04 | 73.27 | 24.00 | 44.00 | 48.41 |
| Claude Opus 4.1 | 41.32 | 71.70 | 21.80 | 34.59 | 45.21 |
| DeepSeek-V3 | 41.72 | 73.97 | 21.44 | 41.71 | 48.16 |
| Thoth-mini | 44.28 | 74.68 | 25.33 | 52.67 | 51.10 |
| Thoth | 46.60 | 75.34 | 25.50 | 53.00 | 52.10 |
A large-scale, multi-task dataset designed to improve and evaluate LLMs in experimental protocol understanding and generation.
If you find this work useful, please cite our paper
@article{sun2025unleashing,
title={Unleashing Scientific Reasoning for Bio-experimental
Protocol Generation via Structured Component-based
Reward Mechanism},
author={Sun, Haoran and Jiang, Yankai and Tang, Zhenyu and
Pan, Yaning and Gu, Shuang and Lin, Zekai and
Wang, Lilong and Lou, Wenjie and Liu, Lei and
Bai, Lei and others},
journal={arXiv preprint arXiv:2510.15600},
year={2025}
}