组会 · 2026-05-20Lab Talk · 2026-05-20

AI 时代下的
研究操作系统

A research operating system
for the AI era

从一个信号说起：Fields 奖得主已经在用 GPT-5.5 Pro 推数学猜想。我们这种做 AI 的实验室，应该怎么调整自己的工作方式。

Start with a signal: Fields medalists are already using GPT-5.5 Pro to advance real math conjectures. For a lab that works on AI, that should change how we work.

今天讲三件事：方法论 · 已经验证能跑的管线 · 对实验室组织形态的一些建议。

Three things today: the methodology · pipelines already running in practice · a few suggestions for how the lab could organize.

Lexa · X @DayShuai · GitHub @AlyciaBHZ · 走在前面，比解释自己更省力

Lexa · X @DayShuai · GitHub @AlyciaBHZ · it is cheaper to run ahead than to explain yourself

一个信号One signal

Fields 奖得主用 AI 解决了真实的数学难题

Fields medalists are solving real math problems with AI

这不是 demo，不是 benchmark，是实际推进了某些猜想的进展。最顶级的头脑都已经把 AI 当工具，我们就更没有理由还停在“AI 会不会取代我”这一类话题上。

Not a demo, not a benchmark: real progress on real conjectures. If the strongest mathematicians already treat AI as a tool, we have no reason to stay stuck on "will AI replace me."

Fields 奖得主 · Timothy Gowers · 2026

Fields medalist · Timothy Gowers · 2026

GPT-5.5 Pro 两小时做出 PhD 级数学研究

GPT-5.5 Pro did PhD-level math research in under two hours

Gowers 在自己博客上写：模型在不到两小时内独立完成了一段博士级别的数学研究， 他自己的数学贡献是零。原话来自 The Decoder 等多家报道。

Gowers wrote on his own blog: the model did a self-contained piece of PhD-level math research in under two hours, with zero math contribution from him. Reported by the-decoder and others.

Fields 奖得主 · Terence Tao · 2026-04

Fields medalist · Terence Tao · 2026-04

GPT-5.4 Pro 解决 Erdős #1196，Tao 24 小时内验证

GPT-5.4 Pro solved Erdős #1196, Tao verified it within 24 hours

一道悬了 90 年、没人想到方法的 Erdős 猜想，GPT-5.4 Pro 用一个 prompt 80 分钟拿下。 Tao 评论这是“对整数解剖学的有意义贡献，远超过这个具体问题本身”，随后把它扩展成一个新数学理论的种子。

A 90-year-old Erdős conjecture, with no known viable method; GPT-5.4 Pro solved it in 80 minutes from one prompt. Tao called it "a meaningful contribution to the anatomy of integers, well beyond the specific problem," then developed it into the seed of a new mathematical theory.

2026 上半年其它信号

Other signals from H1 2026

·2026-01：GPT-5.2 自主解决 Erdős #397，Tao 验证。
·2026-01：一周内三道 Erdős 问题被 AI 攻克，全部由 Tao 亲自验证。
·2026 IPAM：Tao 公开说当下 AI 模型已经“ready for primetime”——在数学和理论物理里，AI 现在节省的时间多于浪费的时间。
·2026-04：一位 23 岁的业余研究者用 ChatGPT 解决了一道 60 年的老问题，Tao 评论说之前的研究者从一开始就走错了路。

·2026-01: GPT-5.2 solved Erdős #397 on its own, verified by Tao.
·2026-01: three Erdős problems solved by AI in one week, all verified by Tao himself.
·2026 IPAM: Tao said publicly that current AI models are "ready for primetime" — in math and theoretical physics, AI now saves more time than it wastes.
·2026-04: a 23-year-old amateur used ChatGPT to solve a 60-year-old open problem; Tao said prior researchers had been on the wrong path from the start.

这意味着什么

What this means

AAI 已经能介入到“前沿数学”这种最难抽象的脑力劳动。
B顶级 CEO 和科学家亲自在写代码——因为模型已经能帮他们完成大部分实现层的事情。
C“会不会用 AI”已经是新一道分水岭，而不是加分项。

AAI can now reach into frontier math — the hardest, most abstract intellectual work.
BTop CEOs and scientists are writing code themselves — because models can now cover much of the implementation layer.
C"Can you use AI well?" is now a dividing line, not a bonus.

对我们实验室的含义

What it means for our lab

→我们是一个在做 AI 的实验室，更应该走在世界的前沿。
→这是范式转移，不是赶时髦——科研流程会在 AI 时代被整轮洗牌，我们实验室的内涵就该长在新流程里。
→别人在用，我们更要把它用好、用得有方法、用得能复现。

→We are an AI lab. We should be near the frontier, not behind it.
→This is a paradigm shift, not a trend — the research workflow itself is being reorganized in the AI era, and the lab should be shaped by that new workflow.
→Others are already using it. We should use it better — with method, reproducibly.

我的方法论My methodology

核心：不让 AI 自由发挥，让它执行有边界的任务

Core idea: don't let AI improvise — give it bounded tasks

最近我做下来最稳的一条结论： AI 不是不会犯错，是当你给它一个明确、结构化、可验证的任务，它就能做得比人快、比人稳。管线的全部价值，就是把“研究”这件事拆成它能稳定完成的形状。

The most stable conclusion I've reached: AI does make mistakes — but when you give it a clear, structured, verifiable task, it can work faster and more steadily than a human. The whole point of the pipeline is to break "research" into forms AI can complete reliably.

原则 1

Principle 1

能结构化的全部结构化

Structure everything that can be structured

把 registries / TaskSpec / schemas / gates 这些结构用起来，把研究流程拆成机器可读的对象。 AI 看的是结构，不是聊天历史。

registries / TaskSpec / schemas / gates — split the research process into machine-readable objects. AI reads structure, not chat history.

原则 2

Principle 2

能加 gate 的全部加 gate

Gate everything that can be gated

Lean build pass、axiom audit、claim-vs-evidence 校验、确定性 gatekeeper。没过 gate 的产出，不进主线。

Lean build pass, axiom audit, claim-vs-evidence checks, deterministic gatekeepers. Output that fails a gate does not reach main.

原则 3

Principle 3

幻觉是任务边界没设清楚

Hallucination = unclear task boundary

AI 出幻觉，几乎都是任务太宽。具体到一个 TaskSpec、允许动哪些文件、允许做哪些 claim，幻觉就会直接消失。

When AI hallucinates, the task boundary is almost always too broad. Pin it to one TaskSpec, the files it may touch, and the claims it may make — hallucination largely disappears.

原则 4

Principle 4

agent 分层：监督 / 执行 / 评审

Claude 负责监督 · Codex 负责执行 · ChatGPT Pro extended thinking 负责主推力 · Claude 再做一轮对抗性评审。

Layered agents: supervise / execute / review

Claude as supervisor · Codex as executor · ChatGPT Pro extended thinking as the driving oracle · Claude runs one more pass as adversarial review.

权威背书 · OpenAI 后训练核心成员

Authority signal · OpenAI post-training core team

Jiayi Weng（@Trinkle23897）刚刚证明的是同一个方向

Jiayi Weng (@Trinkle23897) just demonstrated the same direction

让 Codex (GPT-5.4) 反复迭代 Atari 游戏策略代码—— 神经网络一次都没重训，但策略代码从 387 涨到 864（Breakout 满分）， MuJoCo Ant 跑到 6000+（深度 RL 级别），Atari57 逼近 PPO 基准。知识不压在参数里，而是写成可读、可改、可加测试锁住的代码—— 灾难性遗忘消失，人能锁定任何一步、审计任何一步。这正好是我管线一直在做的事：AI 写代码，人定边界。

Letting Codex (GPT-5.4) iteratively rewrite Atari game policy code — the neural net was never retrained, yet the policy code climbed from 387 to 864 (max on Breakout), cleared 6000+ on MuJoCo Ant (deep-RL level), and approached PPO baseline across the full Atari57 suite. Knowledge is not compressed into weights — it is written as code you can read, edit, and lock with tests — catastrophic forgetting disappears, and every step can be audited. This is exactly what my pipeline has been doing: AI writes code, humans set the boundaries.

"Maybe heuristics were not too weak. Maybe they were just too expensive to maintain. Maybe it's the next paradigm." — Jiayi Weng · 2026-05-08 · 3.1M views

展开看 BEDC 管线真实形态 · subprocess 链 + gate 评分 + 共识 supervisorExpand: real BEDC pipeline · subprocess chain + scored gates + consensus supervisor

Claude /loop  ·  wakes up every 30 min   (the heartbeat)
│
└─►  Supervisor   always-on monitor   (script, not agent)
     watches every step  ·  catches failures  ·  restarts inner loops
     │
     └─►  task queue  ──►  structured payload  ──►  event-driven flow
          │
          ▼
    ┌─────────────────────────────────────────┐
    │  Fast attempt                           │   writer agent + reviewer agent
    │  N short cycles                         │
    └──────────────────┬──────────────────────┘
                       │
   gate-1  :  reviewer.verdict
        if   accept            →  write draft, mark done_via_fast  ───┐
        elif fix-up            →  back to cycle, count++              │
        else escalate          →  ▼ deep reasoning                    │
                                                                      │
    ┌─────────────────────────────────────────┐                       │
    │  Deep reasoning                         │   long-form model     │
    │  multi-turn, per-turn checkpoint        │                       │
    └──────────────────┬──────────────────────┘                       │
                       │                                              │
   gate-2  :  controller.progress_signal                                │
        if   positive               →  continue next turn             │
        elif 3 consecutive low      →  stop · stuck                   │
        elif wall-clock cap hit     →  stop · timeout                 │
                                                                      │
                       ▼ done                                         │
    ┌─────────────────────────────────────────┐                       │
    │  Auto-discovery                         │   scan transcript     │
    │  mine adjacent ideas                    │                       │
    └──────────────────┬──────────────────────┘                       │
                       │                                              │
   gate-3  :  fit_score ≥ X   AND   novelty ≥ Y                         │
        if   both pass         →  append idea to task queue           │
        else                   →  drop                                │
                                                                      │
                       ◄────  draft lands here for review  ───────────┘
                       ▼
   gate-4  :  deterministic check       (pure code, no AI)
        if   schema / invariants / format valid   →  continue
        else                                      →  BLOCK
                       ▼
   gate-5  :  AI hygiene checklist      (fixed list, same every time)
        if   all N checks pass     →  continue
        else                       →  BLOCK
                       ▼
   gate-6  :  real compile / build / test   (not a lint)
        if   pass                  →  merge to main
        elif first fail            →  retry once
        else (second fail)         →  BLOCK, never force-push
                       ▼
            Output  ·  auto-release on cadence  ·  final state recorded

核心一句话：研究里凡是能写成 while 循环 + if/else 分支的事，AI 都能稳定接管。写不成 while/if-else 的部分——品味、方向、关键决策——人来管。

Core idea in one line: anything in research that fits a while loop and a chain of if / else branches, AI can run reliably. What can't be expressed that way — taste, direction, key decisions — that's where humans sit.

while外层循环固定：从队列拿任务，跑一轮，拿到结果，入审，入主线，循环到队列空或墙钟到。所有 worker 都共用这一个外壳。
if每个 gate 都是显式分支：上图 6 个 gate 都是写死的 if / elif / else，没有“AI 自己看着办”。新规则就是加一个 elif，不靠改 prompt。
disk所有状态落盘：每一步的 verdict 都是文件。软重启就是从某个 verdict 重跑，不需要保活内存里的对话历史。
role角色写在脚本里：流程由脚本承载；agent 只填一个 bounded 槽位；reviewer 永远是另一个 agent，不是写 draft 的那个。
stop失败不强推：任何 gate 拦下都触发回滚或封禁。管线不靠“再试一下”撑产出。

whileOne outer loop: pull a task from the queue, run a pass, land the result, review, merge. Repeat until the queue empties or the wall-clock cap fires. Every worker shares this shell.
ifEvery gate is an explicit branch: each of the six gates above is a written-down if / elif / else. No "AI figures it out." A new rule means a new elif, never a prompt tweak.
diskAll state lives on disk: every step's verdict is a file. Soft-restart = pick up from a verdict. No need to keep chat history alive in memory.
roleRoles live in scripts: the flow is carried by the script; agents only fill bounded slots; the reviewer is always a different agent from the one who drafted.
stopNever force-push: any gate failure triggers rollback or block. The pipeline never carries output forward on hope.

想法变具体 · 入门vague → concrete · onboarding

vision 模块：把宽泛的想法变成具体的事情

The vision module: turning vague ideas into concrete work

刚讲的管线是骨架。骨架里流的“想法”怎么变成具体的、能被管线消化的形状——这是 vision 模块管的事。用 AI 去找想法之间的内在关联，去判断什么时候它已经成熟到可以加进主线。想法本身是宽泛的，AI 的工作是把它和已有结构连起来，再决定它现在能不能落进去。

The pipeline above is the skeleton. How vague "ideas" become a shape the pipeline can digest — that's what the vision module handles. AI finds the internal links between ideas and judges when an idea is mature enough to enter main. Ideas are vague by nature; AI's job is to connect them to the existing structure, then decide whether they are ready to land.

输入

Input

一个宽泛的想法

A vague idea

一句感觉、一个方向、一个不一定能说清楚的直觉。

A feeling, a direction, an intuition you can't fully articulate.

中间

Middle

找内在关联

Find the internal links

AI 把它放进现有 repo 的上下文里，看它和已有定理 / 模块 / target 的接口在哪里。

AI places it inside the existing repo context and looks for interfaces with existing theorems / modules / targets.

判断

Judgment

什么时候可以加

When it's ready to land

不是有想法就加——是 AI 帮我们判断它现在是不是已经具体到可以落地，然后才进 claim packet → 形式化 → 入主线。

Not every idea gets in — AI helps judge whether it is concrete enough to land, and only then does it move into claim packet → formalization → main.

我们自己也是 vision 模块的用户——比如 NotebookLM 把我们的论文合成深度解读音频，听的过程里会冒出新的模糊想法，再闭环回 vision。这也是为什么下一页讲的 4 条 lane 里有一条专门做 NotebookLM 自动合成——既给外面看，也给我们自己听。 We use the vision module ourselves too — NotebookLM auto-synthesizes deep-dive audio from our own papers, and new vague ideas surface while listening, then loop back into vision. That is also why one of the four lanes on the next slide is dedicated to NotebookLM synthesis — for outside readers, and for us. vision → pipeline → vision · the loop

已经验证的管线A pipeline already proven

newmath / automath：稳定产出的证据

newmath / automath: evidence of stable output

下面这些不是计划，是已经在 GitHub 上跑了几个月、有 commit、有 release、有论文产出、有自动宣发的事实。

The numbers below are not a plan — they have been live on GitHub for months: real commits, real releases, real papers, real automated distribution.

3,427+ 个形式化验证定理 0 个 axiom 0 sorry mathlib-free Lean 4 5×/周 自动 release 42 篇论文在管线（含合作） 3 篇 P7 待投稿 16 个 AI agent 角色 4 条 lane 并行

3,427+ formally verified theorems 0 axiom 0 sorry mathlib-free Lean 4 5×/week daily auto-release 42 papers in pipeline (incl. collabs) 3 P7 papers ready to submit 16 AI agent roles 4 parallel lanes

本地真实在跑的 4 条 lane

Four lanes actually running locally

lane 1 · BEDC 深推进

lane 1 · BEDC deep push

核心理论自己往前长

The core theory keeps advancing

面向 BEDC 这套主理论的本地推进。任务形态：“已有定理 → 下一步还能逼出什么”。每天往主线增量。上一页的 pipeline 就是这条 lane 的实际实现。

Local work on the BEDC core theory. Task shape: "given existing theorems, what is the next thing to extract?" Increments land on main daily. The pipeline on the previous slide is exactly this lane's real implementation.

ChatGPT extended thinking 负责 oracle · Codex 负责调度和 Lean 落地 · Claude 负责 supervisor 和对抗性评审

ChatGPT extended thinking = oracle · Codex = orchestrator + lands Lean · Claude = supervisor + adversarial review

lane 2 · 开放问题靶向

lane 2 · open target

靶向已知未解问题

Aim at a known open problem

选定一个外部开放问题作为靶子，整条 lane 围绕它推进、形式化和写作。它和 BEDC 这条 lane 同构——BEDC 是“内部前沿”，open target 是“外部靶心”。

Pick an external open problem as the target; the whole lane pushes / formalizes / writes around it. Same shape as the BEDC lane — BEDC is the "internal frontier," open target is an "external bullseye."

模型分工和 lane 1 一样。区别只在目标来源—— oracle 先读完整的开放问题上下文，提出攻击方案；Codex 执行每一步；Claude 决定回退或继续。

Same model split as lane 1. Only the target source differs — oracle ingests the open-problem context and proposes an attack plan, Codex executes each step, and Claude decides whether to roll back or continue.

lane 3 · dev-automation-integration

论文宣发自动化 · 从定理到 preprint

Paper dissemination automation · theorem → preprint

主线每过一批新定理，这条 lane 自动把它们拼成完整论文：引言 / related work / 实验段 / bib / figure 全部 AI 起草，按 paper-series 模板装订。 外部合作论文走同一管线接入——合作者贡献 claim，我们这边自动完成整合和宣发。

Each new batch of theorems on main gets auto-stitched by this lane into a full paper: intro / related work / experiments / bib / figures all AI-drafted, bound by the paper-series template. External collaborations plug into the same pipeline — collaborators contribute claims, we automate the stitching and dissemination.

输出 → arxiv-ready preprint，进 release ledger；目前 3 篇 P7 待投稿、42 篇在管线（含合作）。

Output → arxiv-ready preprints, into the release ledger; currently 3 P7 papers ready to submit, 42 in the pipeline (including collaborations).

lane 4 · omega-paper-series 自动宣发

lane 4 · omega-paper-series automated distribution

NotebookLM + 抖音 + 小红书一键合成

NotebookLM + Douyin + Xiaohongshu, one-click synthesis

preprint 一发布，这条 lane 自动开跑： NotebookLM 合成深度解读音频 → 切短视频投抖音 → 生成图文卡片投小红书。宣发不再是事后人工跑一遍——它是和论文产出同一管线里的下一道工序。

The moment a preprint lands, this lane fires: NotebookLM auto-synthesizes a deep-dive audio digest → clipped into short video for Douyin → text+image card auto-posted to Xiaohongshu. Distribution is no longer an afterthought — it is the next stage in the same pipeline.

附带价值：我们自己也听 NotebookLM 的深度解读——听自己写的论文，中间会冒出新的模糊想法，再闭环回 vision 模块。

Side benefit: we listen to our own NotebookLM digests — new vague ideas surface while listening, then loop back into the vision module.

怎么“平衡”三个模型——各自的强项做各自的事

How to balance the three models — each does what it is best at

推ChatGPT Pro extended thinking 作为推理引擎：目前长链推理深度最够，“下一步往哪推”这种创造性判断交给它。
做Codex 作为执行层：最擅长在边界明确的代码 / Lean / LaTeX 改写任务里稳定产出，不要让它做开放性推理。
控Claude 作为监督和最终评审：最擅长读上下文、判断 claim 有没有越界，最适合卡边界。
分关键不是“用哪个模型最强”，而是 让每个模型只做它最擅长的那一步。lane 3-4 同一逻辑：NotebookLM 做合成、剪辑模型做切片、Claude 把关最终输出，不让任何一个越界。

pushChatGPT Pro extended thinking as oracle: it currently has the strongest long-chain reasoning depth — give it the creative judgment of "where do we push next?"
doCodex as the execution layer: best at producing reliably inside bounded code / Lean / latex rewriting tasks. Don't let it do open-ended reasoning.
holdClaude as supervisor + final review: best at reading context and judging whether a claim crossed the line — well suited to guarding boundaries.
splitThe point is not "which model is strongest" — it is letting each model do only the step it is best at. Same logic on lanes 3-4: NotebookLM synthesizes, a clipping model trims, Claude gates the final output — no one crosses its boundary.

展开看：一个 release 周期里 4 条 lane 的实际节奏Expand: the actual rhythm of all four lanes inside one release cycle

06:00每日 release 自动触发：Lean 全量构建 + axiom audit + 覆盖率统计，产物入 release ledger。
日间supervisor 看昨天的 release，给 lane 1（BEDC）和 lane 2（open target）各起一个目标，生成 claim packet。
日间oracle 推数学，Codex 落 Lean / LaTeX，gate 卡边界，三个模型在同一条 lane 上交接。
傍晚过 gate 的进入对抗性评审，Claude 走最后一遍。
夜间lane 3（dev-automation-integration）从主线抽取新通过的定理，自动重写章节、更新 bib、补 figure，上传 preprint。
夜间lane 4（omega-paper-series）接力：NotebookLM 合成音频 → 切短视频投抖音 → 图文卡片投小红书。
第二天我复核一遍，否掉不对的 claim，让 supervisor 重新分发。

06:00Daily release fires automatically: full Lean build + axiom audit + coverage stats, artifacts land in the release ledger.
daysupervisor reads yesterday's release, picks one target each for lane 1 (BEDC) and lane 2 (open target), generates the claim packet.
dayoracle pushes the math, Codex lands Lean / latex, and gates hold the boundary — three models hand off inside the same lane.
eveningWhat passes the gate goes into adversarial review. Claude does the final pass.
nightlane 3 (dev-automation-integration) pulls newly passed theorems off main, auto-rewrites sections, updates the bib, adds figures, uploads the preprint.
nightlane 4 (omega-paper-series) takes over: NotebookLM synthesizes audio → clipped into Douyin shorts → image+text card auto-posted to Xiaohongshu.
next dayI review, reject wrong claims, and ask the supervisor to redistribute the work.

同一套方法论换一个域Same methodology, different domain

PhD 项目：方法论可迁移的实物证据

PhD project: evidence the methodology transfers

我把上面验证过的这套结构搬到了我自己的博士项目上。最近也开始重新迭代、有产出了。重点不是项目内容本身，而是——同一套研究 OS，换一个完全不同的域，照样跑得起来。

I moved the structure validated above onto my own PhD project. It is iterating again and producing again. The point is not the project content itself — it's that the same research OS, in a completely different domain, still runs.

结构层面直接复用的部分

What gets reused at the structural level

骨同样的 supervisor + subprocess + event bus 结构。
任同样的 TaskSpec / gate / claim packet 协议。
证同样的“过 gate 才入主线”的产出原则。
迁从一个域换到另一个域，控制面几乎不用动，只换 worker 里的领域逻辑。

boneSame supervisor + subprocess + event bus structure.
taskSame TaskSpec / gate / claim packet protocol.
ruleSame output principle: nothing reaches main unless it passes the gate.
portSwap domain, the control plane barely moves — only the domain logic inside the workers changes.

多机并行 loop

Multi-machine parallel loop

节多个本地节点 + 一个 server 节点，分工注册在 node ledger 里。
心每个节点跑心跳脚本，自取任务、自动产出。
独每个任务对应一个独立工作单元，互不污染。
收失败 / 心跳全部进 events ledger，过 gate 的才进入合流。

nodeSeveral local nodes + one server node, roles registered in the node ledger.
beatEvery node runs a heartbeat script, pulls its own tasks, and produces on its own.
isoEach task gets an isolated work unit — no cross-contamination.
logFailures and heartbeats all go to the events ledger; only what passes the gate flows into the merge.

多节点 daemon loop 并行运行 · GitHub branches 截图 — 多节点 daemon loop · GitHub branches · 满屏的 worker 分支都是不同 lane 在 24×7 推进

PhD 这个项目最近重新有产出，但故事不是“AI 帮我写了代码”。是——同一套研究 OS，换一个域，照样跑得起来。这是我想呈现的“方法论可复用”的部分

The PhD project is producing again, but the story is not "AI wrote my code." It is that the same research OS, dropped into a different domain, still runs. This is the "methodology is reusable" part I want to show

基础设施Infrastructure

GitHub 是天生为 AI 准备的

GitHub was already shaped for AI collaboration

AI 会犯错，这是事实。但 GitHub 的所有机制——commit、branch、rollback、PR、actions—— 全部都是为“会犯错的执行者 + 需要审计的协作”设计的。我们实验室即使保持半开源传统，private repo 也已经足够吃满 AI 协作的好处。

AI makes mistakes. That is a given. But every GitHub mechanism — commit, branch, rollback, PR, actions — was designed for "fallible executors + auditable collaboration." Even if our lab keeps its half-open tradition, private repos are enough to capture the benefits of AI collaboration.

commit / rollback

每一步都可回滚

Every step can be rolled back

AI 改一点我就 push 一次。出问题就用 git reset 回到任意一个安全点，不再有“AI 把我代码搞乱了”这种事。

AI changes a little; I push once. If something breaks, git reset to any safe point. "AI broke my code" stops being a real failure mode.

branch-per-task

任务 = 分支

Task = branch

每个 task 都在一个独立分支跑，互不污染。过 gate 才合并，不过就丢掉。

Every task runs on its own branch, with no cross-contamination. If it passes the gate, it merges; if not, it gets dropped.

PR + review

AI 对 AI 评审

AI reviews AI

worker 出 PR，另一个 reviewer agent 负责评审。我只在最后卡一环。审计链全留痕。

The worker opens a PR; another reviewer agent audits it. I only hold the last gate. The full audit trail is on the record.

actions / CI

gate 自动化

Gates run themselves

daily-build / pr-gate 跑 Lean、跑 axiom audit、跑论文 sync。失败的不放进主线。

daily-build / pr-gate run Lean, run axiom audit, run paper sync. Failures don't reach main.

contributions

所有贡献都可视化

Every contribution is visible

每个人改了什么、什么时候改的、有没有过 review——全部留有轨迹。协作不再依赖人情或口头汇报。

Who changed what, when, whether it passed review — all on the record. Collaboration stops depending on social capital and verbal updates.

private repo OK

不开源也能适合 AI 协作

You can be AI-friendly without going open-source

知识产权不冲突。AI 协作的全部好处，private 一样可以拿到。

No IP conflict. The benefits of AI collaboration are still available in private repos.

这件事对新 intern 极其友好

This is unusually friendly to new interns

问新 intern 进来，最快上手的方式不是读 50 篇论文，是直接问 AI：“这个实验室现在在做什么”。
说他们有什么自己的想法，可以丢给 AI 让 AI 帮他们结构化、关联到现有项目。
参更进一步：他们自己的 AI 可以直接参与到我们的 repo 里（PR + review）。
学他们学到的不只是“做研究”，而是“在 AI 协作环境里做研究”——这才是未来 5 年的核心能力。

askFor a new intern, the fastest onramp is not reading 50 papers. It is asking AI: "what is this lab actually working on right now?"
tellIf they have an idea of their own, they can hand it to AI and ask it to structure the idea and link it to existing projects.
joinOne step further: their own AI can participate directly in our repos (PR + review).
learnWhat they pick up isn't just "how to do research" — it's "how to do research inside an AI-collaboration environment." That's the core skill of the next 5 years.

我现在的实际工作形态How my work actually looks now

4 台电脑、常开、loop 不间断、自动汇合

4 machines, always on, continuous loops, automatic merge flow

不是同时控制 4 个键盘，是 4 台机器各跑一个 daemon loop，各自推进自己的 lane，靠 git 实时 commit / handoff 通信。所有 loop 实时同步到 autodev 这一层（双向 sync），过审核才 merge 回主线。我每天真正投入注意力的，只有最有判断价值的环节。

Not me sitting at 4 keyboards. Each of 4 machines runs a daemon loop, pushes its own lane, and the machines talk to each other through live git commits / handoffs. Every loop syncs in real time into the autodev layer (both ways); only what passes review merges back to main. Each day, my attention goes only to the steps where judgment matters.

daemon每台机器跑一个 daemon loop——常驻进程，事件驱动地推进任务，不需要人启动。
独立每台机器在自己的 loop 里推进自己的 lane，互不干涉。
通信各机器之间通过 git 实时 commit / handoff 通信——一个 lane 卡住的时候另一个 lane 直接接手，不冲突。
同步autodev 是实时同步层，不只是中转——所有 worker 分支双向 sync：pull 最新 main + push 所有 worker commit，4 台机器始终看到一致状态。
审autodev 上跑 CI、跑自动 review，再加一道人 / adversarial 审核——通过了才 merge 回 main。
回没通过的留在 autodev 继续修；我的判断回写成下一轮 prompt 喂给 supervisor。

daemonEach machine runs one daemon loop — long-lived, event-driven, no human start button required.
isoEvery machine pushes its own lane inside its own loop, no interference.
talkMachines talk to each other through live git commits / handoffs — when one lane gets stuck, another can pick it up without collision.
syncautodev is a real-time sync layer, not just staging — every worker branch syncs both ways: pull latest main + push all worker commits, so all 4 machines see consistent state.
reviewautodev runs CI, runs auto-review, plus a human / adversarial review pass — only then does it merge back to main.
loopWhat does not pass stays in autodev for fixes; my judgment is written back as the next prompt for the supervisor.

超过自身经验的经验Experience beyond your own experience

把“office hour”蒸馏成 AI 可调用的资源

Distilling "office hours" into an AI-callable resource

先讲一下 gstack——Y Combinator 现任 CEO Garry Tan 自己写的开源 Claude Code skill 包

First, gstack — the open-source Claude Code skill pack written by Garry Tan, current CEO of Y Combinator

YC（Y Combinator）是硅谷孵化了 Airbnb / Stripe / OpenAI 等几千家公司的那家加速器。 Garry Tan 现在是它的 President & CEO，自己又是 Posterous 的 co-founder、写了 YC 内部 Bookface 的第一版。他在 2026 年初亲自下场写代码，把自己同时管理多个项目的方法封装成了 gstack——三周冲到 70k+ stars。

YC (Y Combinator) is the Silicon Valley accelerator that produced Airbnb / Stripe / OpenAI and thousands of others. Garry Tan is its current President & CEO, also a Posterous co-founder, and wrote the first version of YC's internal Bookface. In early 2026, he sat down and wrote the code himself, packaging how he manages multiple projects into gstack — 70k+ stars in three weeks.

它最被推崇的就是 /office-hours：把 16 个 YC partner 给 founder 做 office hour 时的判断风格蒸馏成 6 个逼问式问题，在你写任何代码之前先逼你把事情想清楚。 Garry 自己的原话是：你在这里得到的大概是真去 YC 做 office hour 价值的 10%，但这 10% 已经能让很多 founder 直接放弃错的想法。

The piece people praise most is /office-hours: it distills how 16 YC partners conduct office hours into 6 forcing questions that make you think clearly before writing any code. Garry's own line: what you get here may be 10% of the value of an actual YC office hour, but that 10% is already enough to make a lot of founders drop a wrong idea.

它为什么好用：在你让 AI 干活之前，先帮你把事情理解到底。这正好踩在 AI 协作最容易出问题的那一步上—— 大多数人会跳过理解、直接让 AI 跑，然后一边跑一边发现“哦不是这样”。 office-hours 把“理解”这一步前置了。

Why it works: it helps you understand the thing before you let AI act on it. That hits exactly the step where AI collaboration most often goes wrong — most people skip understanding, kick off AI, and only mid-run realize "oh that's not what I meant." office-hours moves "understanding" to the front.

我现在所有项目都会先用它过一遍

I now run every project through it first

只有自己（和 AI）先把整个事情理解清楚，自动化管线才拆得出来，才有用。顺序不能反——没有理解就上自动化，只是在快速放大自己的盲点。 office-hours 这种“理解优先”的工具，是研究 OS 的入口环节，不是边角料。

Only once you (and AI) actually understand the thing can the automation pipeline be decomposed cleanly and become useful. The order can't flip — automation without understanding just amplifies your blind spots faster. "Understanding first" tools like office-hours are the entry point of the research OS, not a side feature.

在这个基础上我自己做了一版“私人 office-hour”

On top of that I built my own "private office hour"

做了什么

What I built

爬把老师的论文、公开 talk、给我和别人的 comment 全部爬下来。
蒸交给 AI 蒸馏：他喜欢的方法、判断风格、踩过的坑是什么形状。
合整合我跟不少前辈交流过的 office hour 笔记。
用结果就是一个本地版的“私人 office hour”——我做模型的时候随时能问。

crawlPulled my advisor's papers, public talks, and the comments he has written to me and to others.
distillLet AI distill it: the methods he prefers, his judgment style, the shape of the traps he has seen.
mixMerged in office-hour notes from conversations with several senior researchers.
useResult: a local "private office hour" I can summon any time I'm designing a model.

真实价值

Real value

→我设计一个模型时，AI 会主动告诉我“如果是他，会怎么看这个 trade-off”。
→它能挑出“你这个想法里他会立刻反驳的盲点”。
→把多年导师指导的密度，压缩进一个可以随时召唤的对话框。
→不是替代真正的 office hour，是让真正的 office hour 变得更有质量。

→When I'm designing a model, AI proactively tells me "here's how he'd look at this trade-off."
→It can flag "the blind spot in your idea he'd push back on immediately."
→Years of advisor guidance, compressed into a chat box I can call at any time.
→Not a replacement for real office hours — it makes the real office hour higher quality.

研究里有很多“不可言说的言说”——品味、判断、踩过的坑。很多人在做世界模型、做具身智能，本质都是想捕获这种东西。在科研协作这个域里，AI 已经可以帮我们做这一步了。这是“超过自身经验的经验”

Research contains a lot that is hard to state directly — taste, judgment, the shape of past mistakes. Many people building world models or embodied AI are, in effect, trying to capture this kind of thing. Inside research collaboration, AI can already do this step for us. This is "experience beyond your own experience"

这件事会泛化This generalizes

你不需要第三方 app——你可以自己搭

You don't need third-party apps — you can build your own

office-hour 只是一个例子。规律是：你日常用的任何“工具型 app”，现在都可以让 agent 在本地帮你重做一份——更贴合你自己的 workflow、更可控、更便宜。

office-hours is just one example. The pattern is that any "utility app" you use day to day can now be rebuilt locally by an agent — better fit to your workflow, more controllable, cheaper.

现在你不需要第三方软件或平台了 —— 任何 app 能帮你做的事，你都可以用 agent 在本地快速搭一套。有 API 的，直接调用；没有 API 的，coding agent 用 Chrome / Playwright 去访问、截图，把你要的东西拿回来。我自己日常就这样在用 arxiv 的函数和其它工具。

You don't need third-party software or platforms anymore — whatever an app does for you, you can stand up a local version with your agents. If it has an API, the agent calls it directly; if not, a coding agent drives Chrome / Playwright to visit and screenshot whatever you need. I run my daily arxiv tooling and many others this way.

API有 API · agent 直接调用。arxiv / Google Scholar / GitHub / OpenReview / Vercel / ...
无没 API · coding agent + Chrome / Playwright，去访问、截图、解析、回写。
果结果 · 你的工具栈每天都在自己长——不是订阅 SaaS，是自己长。

APIHas API · the agent calls it directly. arxiv / Google Scholar / GitHub / OpenReview / Vercel / ...
no-APINo API · coding agent + Chrome / Playwright — visit, screenshot, parse, write back.
growResult · your toolchain grows on its own every day — not a SaaS subscription, but something you grow yourself.

“工具靠订阅”是上一个时代的逻辑。在 agent 时代，你的工具栈是自己长出来的——长得最贴合你自己。 OPC（One-Person-Company）的硬基础

"Subscribe to your tools" was the logic of the last era. In the agent era, your toolchain grows from your own work — shaped around you. The hard foundation of OPC (One-Person-Company)

可视化把控Visual control

所有研究路线都可以做成可视化 DAG

Every research roadmap can become a visual DAG

每个项目都有一个 roadmap，每个节点有明确的依赖关系。 DAG 让“我们到了哪一步”变成肉眼可见的事情——人把控方向，让 AI 不断把节点推到下一阶段。下面这张是 BEDC 项目现在真实在跑的 project map。

Every project has a roadmap, every node has explicit dependencies. A DAG makes "where are we now" visible — humans steer direction, and AI keeps pushing nodes to the next stage. The map below is the BEDC project's actual live project map.

BEDC Project Map 15,829 theorems · 1,357 regions · 7,745 edges

L0–3 · 83 L4–8 · 114 L9–13 · 194 L14–17 · 186 L18–21 · 129 L22–25 · 196

每个节点点开都有：constructive story（这个对象是怎么从底层一步步搭出来的）、 Lean 验证状态、被依赖的上下游、当前的 stub / checked 状态。这张图本身就是整个项目的“共同视野”——谁在哪一格、卡在哪条边、下一步往哪推，看一眼就清楚。

Click any node and you get: the constructive story (how this object is built up from the bottom step by step), Lean verification status, upstream / downstream dependencies, and current stub / checked state. The map itself is the project's shared field of view — who is on which cell, which edge is stuck, where to push next. One look is enough.

实验室协作 · 一个想法A thought on lab coordination

我建议实验室建立一个 org，统一协作面

I suggest the lab create a GitHub org — one shared working surface

就算保持半开源传统，把实验室的项目集中到一个 GitHub organization 下（可以全 private），会让 AI 协作的所有好处加成到实验室级别。我在自己的一个 side project 里已经跑通了这个模式，可以作为参考。

Even if we keep the half-open tradition, pulling the lab's projects under one GitHub organization (fully private is fine) would lift the benefits of AI collaboration to lab scale. I have already run this pattern in a side project of mine, and it can serve as a reference.

side project 上的实践模板

The template, proven on a side project

组把项目下的几个仓库放在同一个 org 下，统一权限和上下文。
命所有人每天进来只输一个 /daily 命令。
果AI 自动告诉他/她：项目今天进展、他负责的部分、可以做什么。
本哪怕协作者不写代码、不熟 AI，也能秒进入状态。AI 把所有 repo 当上下文。

groupPut the project's several repos under one org — unified permissions and context.
cmdEveryone comes in each day and types a single /daily.
outAI tells them: today's project progress, the part they own, and what they can do next.
lowEven a collaborator who does not code or know AI can get into the loop instantly. AI treats every repo as context.

放大到实验室是什么样

What it looks like at lab scale

入每个新 intern 进来，第一件事就是 /daily。
联AI 自己能看到实验室两个 repo 之间的关联，自动牵线。
想每个人的想法 + AI 的想法 → 都可以无限多。AI 来挑能落地的、去试。
迹每个人的贡献、每个想法的来源，GitHub 全部留有时间线。

inEvery new intern starts with /daily on day one.
linkAI can see connections between any two lab repos by itself, and wire them up automatically.
ideaEveryone's ideas + AI's ideas → effectively unlimited. AI selects the ones that can land and tries them.
traceEvery contribution, every idea's origin — GitHub keeps the full timeline.

实验室不一定要全开源，但建议有一个 org——即使是 private 的，也能让我们整个实验室的产能上一个数量级。

The lab does not have to go open-source, but I would suggest having an org — even fully private, it can raise the whole lab's output by an order of magnitude.

资源配置Allocating resources

再招一个 postdoc，还是把现有人全部武装到顶？

Hire another postdoc, or fully equip everyone we already have?

重点不是 AI 取代谁。重点是：同样一份预算，能让现有的每个人产能上一个台阶，还是只多一个人、其他人不动。

The question isn't who AI replaces. It's: for the same budget, do we lift the output of every existing person, or do we add one more person and leave everyone else unchanged?

线性

Linear

一个 postdoc

One postdoc

×1多一个人，多一份产能。
磨入组 / 适配 / visa / 编制 / 沟通成本。
不不会让现有的人也变快。

×1One more person, one more unit of output.
costonboarding / adaptation / visa / headcount / communication overhead.
noDoesn't make any existing person faster.

乘数

Multiplier

同等预算 · 把所有人都拉到顶配 AI

Same budget · top-tier AI stack for everyone

栈Claude Max 20x + ChatGPT Pro 20x，每人一套。
×N实验室现有的每个人产能上一个台阶。
即当天能用，没有入组磨合，没有 visa，没有编制。

stackClaude Max 20x + ChatGPT Pro 20x, one full set per person.
×NEvery existing person in the lab moves up a level of output.
nowUsable the same day — no onboarding, no visa, no headcount.

不是说不要 postdoc。是说在 2026 年， “补一个 AI 协作基础设施”在边际上比“再加一个人”杠杆更高—— 而且两件事不互斥，一个 postdoc 的预算，也足够同时把现有所有人的工具栈拉到最强。

This is not an argument against postdocs. The point is that in 2026, "filling in the AI-collaboration infrastructure" has higher marginal leverage than "adding one more person" — and the two are not mutually exclusive: one postdoc's budget is enough to also max out everyone's toolchain.

顺便提一个One more, on the side

组会的形式本身也可以优化

The lab meeting format itself can be improved

坦白说，纯讲话的组会信息密度是低的。如果我们真的能让 AI 深入到每个人的代码库里，很多“现在做到哪了”这件事其实可以直接看，不用每周再口头复述一遍。

Honestly, pure-talk lab meetings are low-density. If AI can really reach into everyone's repos, a lot of "where are we now" can be inspected directly — no need to recount it verbally every week.

现在

Now

纯口头组会

Pure verbal lab meeting

每个人讲 10–15 分钟，信息密度低，进展可视化弱，互相不了解的部分只能问。

Everyone talks for 10–15 minutes: low information density, weak progress visualization, and gaps in mutual understanding that can only be filled by asking.

可以并行的方式

A parallel option

异步 + DAG + AI 总结

Async + DAG + AI summary

每个人把进度 push 到自己的 repo；AI 每周自动汇总一个 lab-wide digest；组会本身只讨论卡点和方向。

Everyone pushes progress to their own repo; AI auto-aggregates a lab-wide digest each week; the meeting itself only discusses blockers and direction.

底层

Underneath

信任的来源是留痕

Trust comes from the record

GitHub 记录一切，谁在哪个时间做了什么、做对了什么——不再有“靠口头汇报”的不安。

GitHub logs everything — who did what, when, and what was correct — so we no longer rely on verbal updates for trust.

这一点我不强推。但如果我们建了 org，这个变化几乎是顺手就完成的。

I won't push this hard. But if we set up the org, this change almost comes for free.

收尾Closing

这套 OS 已经能跑 · 剩下的是把它打开

The OS already runs · what remains is turning it on for the lab

我讲这一整套，核心不是想展示我做了多少。是——这套 OS 已经验证能稳定产出，剩下的就是让它在实验室里跑起来，让大家一起更快产出、更好产出。

I am pitching this whole system not to show how much I have done. The point is that this OS has been validated and produces reliably. What remains is to get it running inside the lab, so everyone can produce faster and better together.

一句话回顾

Recap in one line

01信号 · Fields 奖得主都在用 AI 攻克猜想了，分水岭已经过去。
02方法 · 能结构化的全部结构化，能 gate 的全部 gate，AI 在边界里干活就稳。
03验证 · newmath / automath 上 3,427+ 个 0-axiom 定理 · 42 篇论文管线 · 5×/周自动 release。
04可迁移 · 同一套 OS 搬到 PhD 项目，最近重新有产出。
05更深一步 · AI 蒸馏 office-hour——“超过自身经验的经验”。
06建议 · 建 org · onboarding 走 /daily · 同等预算把现有人配满 Pro，放大产能。

01Signal · Fields medalists are already working on conjectures with AI; the dividing line is behind us.
02Method · structure everything that can be structured, gate everything that can be gated — AI is steady inside boundaries.
03Validation · newmath / automath: 3,427+ 0-axiom theorems · 42-paper pipeline · 5×/week auto-release.
04Transferable · the same OS moved onto the PhD project — producing again.
05Deeper · AI-distilled office-hours — "experience beyond your own experience."
06Proposal · build the org · onboarding via /daily · same budget, give everyone Pro-level tools, multiply output.

我希望我们一起跑这个 loop。现有的人 + AI + 一套共享的 OS，本来就该有 10× 的产能。我做这些不是想自己跑得快——是想让我们整个实验室一起更快、更好。这是我做这一切的核心目的

I want us to run this loop together. The people we have + AI + a shared OS should give us 10× output. I am not doing this so I can run faster alone — I am doing it so the whole lab can move faster and better, together. That's the real point of all of it

想接着跟管线细节 · X 上有

For the pipeline details, follow X

x.com/DayShuai →

真实管线、踩过的坑、每次判断背后是怎么想的——都在那条 timeline 上。

The real pipelines, the traps I have hit, and the reasoning behind each call are all on that timeline.

几条代表性的帖子

A few representative posts

2026-03-31 · 164 ❤ · 33 🔁 · 189 🔖

AutoMath 项目开源贴

AutoMath open-source announcement

“AI 不止会解题，它还能发现全新数学结构！仅从一个方程 x² = x + 1，零额外公理，用 Lean 4 形式化验证，AI + 人类协作推导出 9 大数学分支……核心方法：Derive · Discover · Name。”

"AI doesn't just solve problems — it can discover entirely new mathematical structures. Starting from a single equation x² = x + 1, zero extra axioms, formally verified in Lean 4, AI + human collaboration derived 9 major branches of math… Core method: Derive · Discover · Name."

2025-11-23 · 207 ❤ · 34 🔁

AI 时代一个人也能做理论物理

In the AI era, one person can do theoretical physics

“过去半年，我和 Auric 用 ChatGPT、Gemini、Grok 等各种 AI，从 0 开始做理论物理研究。没有导师，没有体系。我们用 AI 压缩了文献阅读、推导比对、结构重建，把原本需要十年的积累，在半年里做完。”

"Over the last six months, Auric and I used ChatGPT, Gemini, Grok and other AI systems to do theoretical physics research from scratch. No advisor, no system. AI compressed literature reading, derivation comparison, and structural reconstruction; work that would normally require ten years of accumulation, we did in six months."

2026-02-15 · #1stProof 系列

1stProof：验证我们一直在做的事

1stProof: validating what we've been doing all along

“我们参加 #1stProof，不是为了证明我们有多强。更是在验证我们一直在做的事情：用 AI 协助科研。我们搭建了一套自动化推理 agent 工作流——拿到问题先 plan，把问题拆成结构化子任务，再由 reasoning agent 推下一步推理路径。”

"We're in #1stProof not to prove how strong we are. We're validating what we've been doing all along: AI-assisted research. We built an automated reasoning agent workflow — take a problem, plan first, decompose it into structured subtasks, then let a reasoning agent push the next step."

2025-11-01 · 方法论

我们正在充当 MCP 里的“人工搬运工”

“在模型与工具之间贴胶水、搬运上下文。最小闭环：意图 → 假设 → 推理 → 证据 → 反驳 → 收敛。未来生产力 = 意图清晰度 × 推理闭环质量。”

We're acting as the "human couriers" inside MCP

"Gluing models to tools, carrying context by hand. Minimal closed loop: intent → hypothesis → reasoning → evidence → rebuttal → convergence. Future productivity = clarity of intent × quality of the reasoning loop."

谢谢 · Lexa · 2026-05-20

Thank you · Lexa · 2026-05-20

AI 时代下的研究操作系统

A research operating systemfor the AI era