AI 时代下的
研究操作系统
A research operating system
for the AI era
从一个信号说起 — Fields 奖得主已经在用 GPT-5.5 Pro 推数学猜想。 我们这种做 AI 的 lab,应该怎么调整自己的工作方式。
Start with a signal — Fields medalists are already using GPT-5.5 Pro to push real math conjectures. A lab like ours, that does AI, should rethink how it actually works.
今天讲三件事:方法论 · 已经验证能跑的管线 · 对实验室组织形态的一些建议。
Three things today: the methodology · pipelines already running in production · some suggestions for how the lab could organize.
Fields 奖得主用 AI 解决了真实的数学难题
Fields medalists are solving real math problems with AI
这不是 demo,不是 benchmark,是实际推进了某些猜想的进展。 最顶级的头脑都已经把 AI 当工具,我们就更没有理由还停在"AI 会不会取代我"这一类话题上。
Not a demo, not a benchmark — real progress on real conjectures. If the top minds in math already treat AI as a tool, we have no excuse to still be debating "will AI replace me."
GPT-5.5 Pro 两小时做出 PhD 级数学研究
GPT-5.5 Pro did PhD-level math research in under two hours
Gowers 在自己博客上写:模型在不到两小时内独立完成了一段博士级别的数学研究, 他自己的数学贡献是零。原话来自 the-decoder 等多家报道。
Gowers wrote on his own blog: the model did a self-contained piece of PhD-level math research in under two hours, with zero math contribution from him. Reported by the-decoder and others.
GPT-5.4 Pro 解决 Erdős #1196,Tao 24 小时内验证
GPT-5.4 Pro solved Erdős #1196, Tao verified it within 24 hours
一道悬了 90 年方法没人想到的 Erdős 猜想,GPT-5.4 Pro 一个 prompt 80 分钟拿下。 Tao 评论这是"对整数解剖学的有意义贡献,远超过这个具体问题本身", 随后把它扩展成一个新数学理论的种子。
A 90-year-old Erdős conjecture nobody had the right method for — GPT-5.4 Pro cracked it in 80 minutes from one prompt. Tao called it "a meaningful contribution to the anatomy of integers, well beyond the specific problem," then grew it into the seed of a new mathematical theory.
- ·2026-01:GPT-5.2 自主解决 Erdős #397,Tao 验证。
- ·2026-01:一周内三道 Erdős 问题被 AI 攻克,全部由 Tao 亲自验证。
- ·2026 IPAM:Tao 公开说当下 AI 模型已 "ready for primetime" —— 在数学和理论物理里,AI 现在节省的时间多于浪费的时间。
- ·2026-04:一位 23 岁的 amateur 用 ChatGPT 解决了一道 60 年的老问题,Tao 评论说之前的研究者从一开始就走错了路。
- ·2026-01: GPT-5.2 solved Erdős #397 on its own, verified by Tao.
- ·2026-01: three Erdős problems cracked by AI in one week, all verified by Tao himself.
- ·2026 IPAM: Tao said publicly that current AI models are "ready for primetime" — in math and theoretical physics, AI now saves more time than it wastes.
- ·2026-04: a 23-year-old amateur used ChatGPT to solve a 60-year-old open problem; Tao said prior researchers had been on the wrong path from the start.
这意味着什么
What this means
- AAI 已经能介入到"前沿数学"这种最难抽象的脑力劳动。
- B顶级 CEO 和科学家亲自在写代码 —— 因为模型已经能帮他们完成大部分实现层的事情。
- C"会不会用 AI" 已经是新一道分水岭,而不是加分项。
- AAI can now reach into frontier math — the hardest, most abstract intellectual work.
- BTop CEOs and scientists are writing code themselves — because the model handles most of the implementation layer for them.
- C"Can you use AI well" is now a watershed, not a bonus.
对我们 lab 的含义
What it means for our lab
- →我们是一个在做 AI 的实验室,更应该走在世界的前沿。
- →这是范式转移,不是赶时髦 —— 科研流程会在 AI 时代被整轮洗牌,我们 lab 的内涵就该长在新流程里。
- →别人在用,我们更要把它用好、用得有方法、用得能复现。
- →We're an AI lab — we should be ahead of the world, not behind it.
- →This is a paradigm shift, not a trend — the research workflow itself is getting reshuffled in the AI era, and our lab's substance should grow inside the new workflow.
- →Others are already using it. We should use it better — with method, reproducibly.
核心:不让 AI 自由发挥,让它执行有边界的任务
Core idea: don't let AI freelance — give it bounded tasks
最近我做下来最稳的一条结论: AI 不是不会犯错,是当你给它一个明确、结构化、可验证的任务,它就能做得比人快、比人稳。 管线的全部价值,就是把"研究"这件事拆成它能稳定完成的形状。
The most stable conclusion I've reached: AI does make mistakes — but when you give it a clear, structured, verifiable task, it runs faster and steadier than a human. The whole point of the pipeline is to cut "research" into shapes AI can finish reliably.
能结构化的全部结构化
Structure everything that can be structured
registries / TaskSpec / schemas / gates — 把研究流程拆成机器可读的对象。 AI 看的是结构,不是聊天历史。
registries / TaskSpec / schemas / gates — split the research process into machine-readable objects. AI reads structure, not chat history.
能加 gate 的全部加 gate
Gate everything that can be gated
Lean build pass、axiom audit、claim-vs-evidence 校验、确定性 gatekeeper。 没过 gate 的产出,不进主线。
Lean build pass, axiom audit, claim-vs-evidence checks, deterministic gatekeepers. Output that doesn't pass the gate doesn't reach main.
幻觉是任务边界没设清楚
Hallucination = unclear task boundary
AI 出幻觉,几乎都是任务太宽。具体到一个 TaskSpec、允许动哪些文件、 允许做哪些 claim — 幻觉直接消失。
When AI hallucinates, the task was almost always too wide. Pin it down to one TaskSpec, which files it can touch, which claims it's allowed to make — hallucination just disappears.
分层 agent:监督 / 执行 / 评审
Claude as supervisor · Codex as executor · ChatGPT Pro extended thinking 主推力 · Claude 再过一遍做 adversarial review。
Layered agents: supervise / execute / review
Claude as supervisor · Codex as executor · ChatGPT Pro extended thinking as the driving oracle · Claude runs one more pass as adversarial review.
Jiayi Weng(@Trinkle23897)刚刚证明的是同一个方向
Jiayi Weng (@Trinkle23897) just proved the same direction
让 Codex (GPT-5.4) 反复迭代 Atari 游戏策略代码 —— 神经网络一次都没重训,但策略代码从 387 涨到 864(Breakout 满分), MuJoCo Ant 跑到 6000+(深度 RL 级别),Atari57 逼近 PPO 基准。 知识不压在参数里,是写成可读、可改、可加 test 锁住的代码 —— 灾难性遗忘消失,人能 lock 任何一步、能审计任何一步。 这正好是我管线一直在做的事:AI 写代码,人定边界。
Letting Codex (GPT-5.4) iteratively rewrite Atari game policy code — the neural net was never retrained, yet the policy code climbed from 387 to 864 (max on Breakout), cleared 6000+ on MuJoCo Ant (deep-RL level), and approached PPO baseline across the full Atari57 suite. Knowledge isn't pressed into weights — it's written as code you can read, edit, and lock with tests — catastrophic forgetting disappears, every step can be audited. This is exactly what my pipeline has been doing: AI writes code, humans set the boundaries.
"Maybe heuristics were not too weak. Maybe they were just too expensive to maintain. Maybe it's the next paradigm." — Jiayi Weng · 2026-05-08 · 3.1M views
展开看 BEDC pipeline 真实形态 · subprocess 链 + gate 评分 + 共识 supervisorExpand: real BEDC pipeline · subprocess chain + scored gates + consensus supervisor
- 入BOARD → packet:从 BOARD.md 拾 target,先调研一轮,产出 _packet_*.md 这个结构化 handoff payload。下个 worker 不读自然语言,只消费这个对象。
- 推Codex orchestrator 驱动 Oracle:每轮 Codex 写 prompt、ChatGPT Pro 推一段、Codex 判 progress_delta。不是 Codex 卡了再升级到 Oracle —— 是 Codex 和 Oracle 每一轮都在线协作。
- 停四种 stop signal:BREAKTHROUGH / Q.E.D. / STUCK / 3 轮连续低 progress + 12h 墙钟硬上限。终止由 Codex 自己判,不由人介入。
- 现Stage 1.5 topic discovery:Codex 扫整段 transcript 抽相邻 claim,过一道小 gate(fit_score ≥ 7, novelty ≥ 6),新候选自动追加进 BOARD —— pipeline 自己长 target。
- 审Stage 2 Claude 独立审阅:先过 logic_packet_gate(纯代码确定性检查),再走 10-point hygiene checklist,最后真跑一遍 pdflatex 编译;失败 retry 一次,再失败标 BLOCKED,绝不强推。
- 共supervisor 是共识机制,不是 dispatcher:失败即拦截 + 多轨保护 + oracle 生命周期托管(崩溃 30s backoff 自重启)+ 周期 Claude tier-3 review。没有单点失败,没有"人盯着才能跑"的环节。
- 省早期每个产出 agent 都配一个 reviewer agent,后来发现:只要流程足够明确,根本不需要这一层。流程定义本身就是 review 的来源。
- inBOARD → packet: pick a target from BOARD.md, do one research pass, emit _packet_*.md — a structured handoff payload. The next worker doesn't read natural language, it consumes this object.
- driveCodex orchestrates Oracle: every turn Codex writes a prompt, ChatGPT Pro reasons, Codex judges progress_delta. It's not "Codex gets stuck then escalates to Oracle" — Codex and Oracle are online together every single turn.
- stopFour stop signals: BREAKTHROUGH / Q.E.D. / STUCK / 3 consecutive low-progress turns + 12h wall-clock hard cap. Termination decided by Codex itself, no human in the loop.
- growStage 1.5 topic discovery: Codex scans the full transcript for adjacent claims, passes a small gate (fit_score ≥ 7, novelty ≥ 6), auto-appends new candidates back to BOARD — the pipeline grows its own targets.
- reviewStage 2 Claude independent review: first deterministic logic_packet_gate (pure code, no AI), then 10-point hygiene checklist, then a real pdflatex compile. Fail → retry once → BLOCKED if still failing. Never force-push.
- consensussupervisor is a consensus mechanism, not a dispatcher: fail-as-block + multi-channel protection + oracle lifecycle (auto-restart with 30s backoff) + periodic Claude tier-3 review. No single point of failure, no "only runs while a human is watching."
- cutEarly on, every producer agent had a paired reviewer agent. Then I found out: if the process is sharp enough, you don't need that layer at all. The process definition is the review.
vision 模块:把宽泛的想法变成具体的事情
The vision module: turning vague ideas into concrete work
刚讲的管线是骨架。骨架里流的"想法"怎么变成具体的、能被管线消化的形状 —— 这是 vision 模块管的事。 用 AI 去找想法之间的内在关联,去判断什么时候它已经成熟到可以加进主线。 想法本身是宽泛的,AI 的工作是把它和已有结构连起来,再决定它现在能不能落进去。
The pipeline above is the skeleton. How vague "ideas" become a shape the pipeline can digest — that's what the vision module handles. Use AI to find the internal links between ideas, and to judge when an idea is mature enough to enter main. Ideas are vague by nature; AI's job is to tie them to the existing structure, then decide if they're ready to land.
一个宽泛的想法
A vague idea
一句感觉、一个方向、一个不一定能说清楚的直觉。
A feeling, a direction, an intuition you can't fully articulate.
找内在关联
Find the internal links
AI 把它放进现有 repo 的上下文里,看它和已有定理 / 模块 / target 的接口在哪。
AI drops it into the existing repo context, looks for the interfaces with existing theorems / modules / targets.
什么时候可以加
When it's ready to land
不是有想法就加 —— 是 AI 帮我们判断它现在是不是已经具体到可以落地, 然后才进 claim packet → 形式化 → 入主线。
Not every idea gets in — AI helps judge whether it's concrete enough to ship, and only then does it move into claim packet → formalization → main.
这件事对新 intern 极其友好
This is wildly friendly to new interns
- 问新 intern 进来,最快上手的方式不是读 50 篇论文,是直接问 AI:"这个 lab 现在在做什么"。
- 说他们有什么自己的想法,可以丢给 AI 让 AI 帮他们结构化、关联到现有项目。
- 参更进一步:他们自己的 AI 可以直接参与到我们的 repo 里(PR + review)。
- 学他们学到的不只是"做研究",而是"在 AI 协作环境里做研究" — 这才是未来 5 年的核心能力。
- askFor a new intern, the fastest onramp isn't reading 50 papers — it's asking AI: "what is this lab actually working on right now."
- tellIf they have their own idea, they can hand it to AI and let AI structure it and link it to existing projects.
- joinOne step further: their own AI can participate directly in our repos (PR + review).
- learnWhat they pick up isn't just "how to do research" — it's "how to do research inside an AI-collaboration environment." That's the core skill of the next 5 years.
我们自己也是 vision 模块的用户 —— 比如 NotebookLM 把我们的论文合成 deep-dive audio, 听的过程里会冒出新的 vague idea,闭环回 vision。 这也是为什么下一页讲的 4 条 lane 里有一条专门做 NotebookLM 自动合成 —— 既给外面看,也给我们自己听。 We use the vision module ourselves too — NotebookLM auto-synthesizes deep-dive audio from our own papers, and fresh vague ideas pop up mid-listen and loop back into vision. That's also why one of the four lanes on the next slide is dedicated to NotebookLM synthesis — for outside readers, and for us to listen to. vision → pipeline → vision · the loop
newmath / automath:稳定产出的证据
newmath / automath: evidence of stable output
下面这些不是计划,是已经在 GitHub 上跑了几个月、有 commit、有 release、有论文产出、有自动宣发的事实。
The numbers below aren't a plan — they've been live on GitHub for months: real commits, real releases, real papers, real auto-promotion.
四条本地真实在跑的 lane
Four lanes actually running locally
核心理论自己往前长
The core theory grows itself forward
面向 BEDC 这套主理论的本地推进。任务形态:"已有定理 → 下一步还能逼出什么"。 每天往主线增量。上一张 slide 的 pipeline 就是这条 lane 的实际实现。
Local push on the BEDC core theory. Task shape: "given existing theorems, what's the next thing to extract." Increments land on main daily. The pipeline on the previous slide is exactly this lane's real implementation.
ChatGPT extended thinking = oracle · Codex = orchestrator + 落 Lean · Claude = supervisor + adversarial review
ChatGPT extended thinking = oracle · Codex = orchestrator + lands Lean · Claude = supervisor + adversarial review
靶向已知未解问题
Aim at a known open problem
选定一个外部 open problem 作为靶子,全 lane 围绕它推进 / 形式化 / 写。 和 BEDC lane 同构 —— BEDC 是"内部 frontier",open target 是"外部靶心"。
Pick an external open problem as the target; the whole lane pushes / formalizes / writes around it. Same shape as the BEDC lane — BEDC is the "internal frontier," open target is an "external bullseye."
模型分工和 lane 1 一样。区别只在 target 来源 —— oracle 吃整段 open-problem 上下文先 propose attack plan,Codex 执行每一步,Claude 决定回退或继续。
Model split same as lane 1. Only target source differs — oracle ingests the open-problem context and proposes an attack plan, Codex executes each step, Claude decides rollback or continue.
论文 outreach 自动 · 从定理到 preprint
Paper outreach automation · theorem → preprint
主线每过一批新定理,这条 lane 自动把它们拼成完整论文: 引言 / related work / 实验段 / bib / figure 全部 AI 起草,按 paper-series 模板装订。 外部合作论文走同一管线接入 —— 合作者贡献 claim,stitch + outreach 我们这边自动跑。
Every batch of new theorems on main, this lane auto-stitches them into a full paper: intro / related work / experiments / bib / figures all AI-drafted, bound by the paper-series template. External collaborations plug into the same pipeline — collaborators contribute claims, we automate the stitching and outreach.
输出 → arxiv-ready preprint,进 release ledger;目前 3 篇 P7 待投稿、42 篇在管线(含合作)。
Output → arxiv-ready preprints, into the release ledger; currently 3 P7 papers ready to submit, 42 in pipeline (including collaborations).
NotebookLM + 抖音 + 小红书 一键合成
NotebookLM + Douyin + Xiaohongshu, one-click synthesis
preprint 一发布,这条 lane 自动开跑: NotebookLM 合成 deep-dive audio → 切短视频投 抖音 → 生成图文卡片投 小红书。 宣发不再是事后人工跑一遍 —— 是和论文产出同一管线里的下一道工序。
The moment a preprint lands, this lane fires: NotebookLM auto-synthesizes a deep-dive audio digest → clipped to short video for Douyin → text+image card auto-posted to Xiaohongshu. Promotion isn't an afterthought any more — it's the next stage in the same pipeline.
附带价值:我们自己也听 NotebookLM 的 deep dive —— 听自己写的论文,中间会冒出新 vague idea,闭环回 vision 模块。
Side benefit: we listen to our own NotebookLM digests — fresh vague ideas pop up mid-listen, loop back into the vision module.
怎么"平衡"三个模型 —— 各自的强项做各自的事
How to balance the three models — each one does what it's best at
- 推ChatGPT Pro extended thinking 当 oracle 用:长链推理深度目前最够,"下一步该往哪推"这种创造性判断交给它。
- 做Codex 当执行层用:最擅长在有明确边界的代码 / Lean / latex 改写任务里稳定产出,不能让它做开放性推理。
- 控Claude 当 supervisor + 最终 review:最擅长读上下文、判 claim 是不是过界,最适合卡边界。
- 分关键不是"用哪个模型最强",是 让每个模型只做它最擅长的那一步。lane 3-4 同一逻辑:NotebookLM 做合成、剪辑模型做切片、Claude 把关最终输出,不让任何一个越界。
- pushChatGPT Pro extended thinking as oracle: its long-chain reasoning depth is currently the deepest — hand it the creative call of "where do we push next."
- doCodex as the execution layer: best at producing reliably inside bounded code / Lean / latex rewriting tasks. Don't let it do open-ended reasoning.
- holdClaude as supervisor + final review: best at reading context and judging whether a claim crossed the line — perfect for guarding boundaries.
- splitThe point isn't "which model is strongest" — it's letting each model only do its single best step. Same logic on lanes 3-4: NotebookLM synthesizes, an editor model clips, Claude gates the final output — never crossing.
展开看:一个 release 周期里四条 lane 的实际节奏Expand: the actual rhythm of all four lanes inside one release cycle
- 06:00Daily release 自动触发:Lean 全量 build + axiom audit + 覆盖率统计,产物入 release ledger。
- 日间supervisor 看昨天 release,给 lane 1 (BEDC) 和 lane 2 (open target) 各起 target,生成 claim packet。
- 日间oracle 推数学、Codex 落 Lean / latex、gate 卡边界,三个模型在同一条 lane 上交接。
- 傍晚过 gate 的进入 adversarial review,Claude 走最后一遍。
- 夜间lane 3 (dev-automation-integration) 从主线抽新通过的定理,自动重写章节、更新 bib、补 figure,preprint 上传。
- 夜间lane 4 (omega-paper-series) 接力:NotebookLM 合 audio → 切短视频投抖音 → 图文卡片投小红书。
- 第二天我 review 一遍,否掉不对的 claim,让 supervisor 重新分发。
- 06:00Daily release fires automatically: full Lean build + axiom audit + coverage stats, artifacts land in the release ledger.
- daysupervisor reads yesterday's release, picks one target each for lane 1 (BEDC) and lane 2 (open target), generates the claim packet.
- dayoracle pushes the math, Codex lands Lean / latex, gates hold the boundary — three models hand off inside the same lane.
- eveningWhat passes the gate goes into adversarial review. Claude does the final pass.
- nightlane 3 (dev-automation-integration) pulls newly passed theorems off main, auto-rewrites sections, updates the bib, adds figures, uploads the preprint.
- nightlane 4 (omega-paper-series) takes the baton: NotebookLM synthesizes audio → clipped into Douyin shorts → image+text card auto-posted to Xiaohongshu.
- next dayI review, reject wrong claims, ask the supervisor to redispatch.
PhD 项目:方法论可迁移的实物证据
PhD project: live evidence the methodology transfers
我把上面验证过的这套结构搬到了我自己的博士项目上。 最近也开始重新迭代、有产出了。重点不是项目内容本身, 而是 —— 同一套研究 OS,换一个完全不同的域,照样跑得起来。
I moved the structure validated above onto my own PhD project. It's iterating again, producing again. The point isn't the project itself — it's that the same research OS, in a completely different domain, still runs.
结构层面直接复用的部分
What gets reused at the structural level
- 骨同样的 supervisor + subprocess + event bus 结构。
- 任同样的 TaskSpec / gate / claim packet 协议。
- 证同样的 "过 gate 才入主线" 的产出原则。
- 迁从一个域换到另一个域,控制面几乎不用动,只换 worker 里的领域逻辑。
- boneSame supervisor + subprocess + event bus structure.
- taskSame TaskSpec / gate / claim packet protocol.
- ruleSame "no main without passing the gate" output principle.
- portSwap domain, the control plane barely moves — only the domain logic inside the workers changes.
多机并行 loop
Multi-machine parallel loop
- 节多个本地节点 + 一个 server 节点,分工注册在 node ledger 里。
- 心每个节点跑心跳脚本,自取任务、自动产出。
- 独每个任务对应一个独立工作单元,互不污染。
- 收失败 / 心跳全部进 events ledger,过 gate 的才进入合流。
- nodeSeveral local nodes + one server node, roles registered in the node ledger.
- beatEvery node runs a heartbeat script, pulls its own tasks, produces on its own.
- isoEach task gets an isolated work unit — no cross-contamination.
- logFailures and heartbeats all go to the events ledger; only what passes the gate flows into the merge.
PhD 这个项目最近重新有产出,但故事不是"AI 帮我写了代码"。 是 —— 同一套研究 OS,换一个域,照样跑得起来。 这是我想呈现的"方法论可复用"的部分
The PhD project is producing again, but the story isn't "AI wrote my code." It's — the same research OS, dropped into a different domain, still runs. This is the "methodology is reusable" part I want to show
GitHub 是天生为 AI 准备的
GitHub was built for AI before AI knew it
AI 会犯错,这是事实。但 GitHub 的所有机制 — commit、branch、rollback、PR、actions — 全部都是为"会犯错的执行者 + 需要审计的协作"设计的。 我们 lab 即使保持半开源的传统,private repo 也已经足够把 AI 协作的好处吃满。
AI makes mistakes. Fine. But every GitHub mechanism — commit, branch, rollback, PR, actions — was designed for "fallible executors + auditable collaboration." Even if our lab keeps its half-open tradition, private repos alone are enough to capture all the AI collaboration upside.
每一步都可回滚
Every step is rollback-able
AI 改一点我就 push 一次。出问题,git reset 到任意一个安全点,没有"AI 把我代码搞乱了"这种事。
AI changes a little, I push once. Something breaks, git reset to any safe point. "AI broke my code" just doesn't happen.
任务 = 分支
Task = branch
每个 task 在一个独立分支跑,互不污染。过 gate 才合并,不过就丢掉。
Every task runs on its own branch, no cross-contamination. Passes the gate, it merges; doesn't, it gets dropped.
AI 对 AI 评审
AI reviews AI
worker 出 PR,另一个 reviewer agent 审。我只在最后卡一环。审计链全留痕。
Worker opens a PR, another reviewer agent audits it. I only hold the last gate. The full audit trail is on the record.
gate 自动化
Gates run themselves
daily-build / pr-gate 跑 Lean、跑 axiom audit、跑论文 sync。失败的不放进主线。
daily-build / pr-gate run Lean, run axiom audit, run paper sync. Failures don't reach main.
所有贡献都可视化
Every contribution is visible
每个人改了什么、什么时候改的、有没有过 review — 全部留有轨迹。 协作不再依赖人情或口头汇报。
Who changed what, when, whether it passed review — all on the record. Collaboration stops depending on social capital and verbal updates.
不开源也能 AI-friendly
You can be AI-friendly without going open-source
知识产权不冲突。AI 协作的全部好处,private 一样可以拿到。
No IP conflict. Every benefit of AI collaboration is still available on private.
4 台电脑、常开、loop 不间断、自动汇合
4 machines, always on, loops never stop, merges happen by themselves
不是同时控制 4 个键盘,是 4 台机器各跑一个 daemon loop, 各自推进自己的 lane,靠 git 实时 commit / handoff 通信。 所有 loop 实时同步到 autodev 这一层(双向 sync),过审核才 merge 回主线。 我每天的实际 attention 只放在最有判断价值的环节。
Not me at 4 keyboards. Each of 4 machines runs a daemon loop, pushes its own lane, and the machines talk to each other through live git commits / handoffs. Every loop syncs in real time into the autodev layer (both ways); only what passes review merges back to main. My actual attention each day only goes to the steps where judgment really matters.
- daemon每台机器跑一个 daemon loop —— 常驻进程,事件驱动地推进任务,不需要人启动。
- 独立每台机器在自己的 loop 里推进自己的 lane,互不干涉。
- 通信各机器之间通过 git 实时 commit / handoff 通信 —— 一个 lane 卡住的时候另一个 lane 直接接手,不冲突。
- 同步autodev 是实时同步层,不只是中转 —— 所有 worker 分支双向 sync:pull 最新 main + push 所有 worker commit,4 台机器始终看到一致状态。
- 审autodev 上跑 CI、跑自动 review,再加一道人 / adversarial 审核 —— 通过了才 merge 回 main。
- 回没通过的留在 autodev 继续修;我的判断回写成下一轮 prompt 喂给 supervisor。
- daemonEach machine runs one daemon loop — long-lived, event-driven, no human start button.
- isoEvery machine pushes its own lane inside its own loop, no interference.
- talkMachines talk to each other through live git commits / handoffs — when one lane gets stuck, another picks up without collision.
- syncautodev is a real-time sync layer, not just staging — every worker branch syncs both ways: pull latest main + push all worker commits, so all 4 machines see consistent state.
- reviewautodev runs CI, runs auto-review, plus a human / adversarial review pass — only then does it merge back to main.
- loopWhat doesn't pass stays in autodev for fixes; my judgment gets written back as the next round of prompt for the supervisor.
把"office hour"蒸馏成 AI 可调用的资源
Distilling "office hours" into something AI can call
先讲一下 gstack —— Y Combinator 现任 CEO Garry Tan 自己写的开源 Claude Code skill 包
First, gstack — the open-source Claude Code skill pack written by Garry Tan, current CEO of Y Combinator
YC(Y Combinator)是硅谷孵化了 Airbnb / Stripe / OpenAI 等几千家公司的那个 accelerator。 Garry Tan 现在是它的 President & CEO,自己又是 Posterous 的 co-founder、写了 YC 内部 Bookface 的第一版。 他在 2026 年初亲自下场写代码,把自己 hold 多个项目的方法封装成了 gstack —— 三周冲到 70k+ stars。
YC (Y Combinator) is the Silicon Valley accelerator that produced Airbnb / Stripe / OpenAI and thousands of others. Garry Tan is its current President & CEO, also a Posterous co-founder, and wrote the first version of YC's internal Bookface. Early 2026 he sat down and wrote the code himself, packaging how he juggles many projects into gstack — 70k+ stars in three weeks.
它最被推崇的就是 /office-hours: 把 16 个 YC partner 给 founder 做 office hour 时的判断风格蒸馏成 6 个 forcing question, 在你写任何代码之前先逼你把事情想清楚。 Garry 自己的原话是:你在这里得到的大概是真去 YC 做 office hour 价值的 10%, 但这 10% 已经能让很多 founder 直接放弃错的想法。
The piece people praise most is /office-hours: it distills how 16 YC partners conduct office hours into 6 forcing questions that make you think clearly before writing any code. Garry's own line: what you get here is maybe 10% of the value of an actual YC office hour, but that 10% is already enough to make a lot of founders drop a wrong idea.
它为什么好用:在你让 AI 干活之前,先帮你把事情理解到底。 这正好踩在 AI 协作最容易出问题的那一步上 —— 大多数人会跳过理解、直接让 AI 跑,然后一边跑一边发现"哦不是这样"。 office-hours 把"理解"这一步前置了。
Why it works: it helps you understand the thing before you turn AI loose on it. That hits exactly the step where AI collaboration most often goes wrong — most people skip understanding, kick off AI, and only mid-run realize "oh that's not what I meant." office-hours moves "understanding" to the front.
我现在所有项目都会先用它过一遍
I now run every project through it first
只有自己(和 AI)先把整个事情理解清楚,自动化管线才拆得出来,才有用。 顺序不能反 —— 没有理解就上自动化,只是在快速放大自己的盲点。 office-hours 这种"理解优先"的工具,是研究 OS 的入口环节,不是边角料。
Only once you (and AI) actually understand the thing can the automation pipeline be cut cleanly and be useful. The order can't flip — automation without understanding just amplifies your blind spots faster. "Understanding first" tools like office-hours are the entry point of the research OS, not an extra.
在这个基础上我自己做了一版"私人 office-hour"
On top of that I built my own "private office hour"
做了什么
What I built
- 爬把老师的论文、公开 talk、给我和别人的 comment 全部爬下来。
- 蒸交给 AI 蒸馏:他喜欢的方法、判断风格、踩过的坑是什么形状。
- 合整合我跟不少前辈交流过的 office hour 笔记。
- 用结果就是一个本地版的"私人 office hour" —— 我做模型的时候随时能问。
- crawlPulled my advisor's papers, public talks, and every comment he ever wrote to me and to others.
- distillLet AI distill it: the methods he prefers, his judgment style, the shape of the traps he's hit.
- mixMerged in office-hour notes from conversations with a bunch of seniors I've talked to.
- useResult: a local "private office hour" I can summon any time I'm designing a model.
真实价值
Real value
- →我设计一个模型时,AI 会主动告诉我"如果是他会怎么看这个 trade-off"。
- →它能挑出"你这个想法里他会立刻反驳的盲点"。
- →把多年导师指导的密度,压缩进一个可以随时召唤的对话框。
- →不是替代真正的 office hour,是让真正的 office hour 变得更有质量。
- →When I'm designing a model, AI proactively tells me "here's how he'd look at this trade-off."
- →It can flag "the blind spot in your idea he'd push back on immediately."
- →Years of advisor guidance, compressed into a chat box I can summon any time.
- →Not a replacement for real office hours — it makes the real office hours higher quality.
研究里有很多"不可言说的言说" —— 品味、判断、踩过的坑。 很多人在做世界模型、做具身智能,本质都是想捕获这种东西。 在科研协作这个域里,AI 已经可以帮我们做这一步了。 这是"超过自身经验的经验"
Research has a lot of "unsayable speech" — taste, judgment, the shape of past mistakes. A lot of people building world models, building embodied AI, are really trying to capture this kind of thing. Inside research collaboration, AI can already do this step for us. This is "experience beyond your own experience"
你不需要第三方 app —— 你可以自己搭
You don't need third-party apps — you can just build them
office-hour 只是一个例子。规律是:你日常用的任何"工具型 app", 现在都可以让 agent 在本地帮你重做一份 —— 更贴合你自己的 workflow、更可控、更便宜。
office-hours is just one example. The pattern: any "utility app" you use day-to-day can now be rebuilt locally by an agent — better fit to your own workflow, more controllable, cheaper.
中文
现在你不需要第三方软件或平台了 —— 任何 app 能帮你做的事, 你都可以用 agent 在本地快速搭一套。 有 API 的,直接访问; 没有 API 的,coding agent 用 Chrome / Playwright 去访问、截图, 把你要的东西拿回来。 我自己日常就这么用着 arxiv 的 function 和好多别的。
English
Right now you don't need a third-party software or platform — anything an app can do for you, you can quickly set up something local with your agents. Anything with an API can be accessed directly; for those without, coding agents can still use Chrome / Playwright to visit and screenshot to give you what you want. I'm using arxiv functions and many others in my daily life.
- API有 API · agent 直接 call。arxiv / Google Scholar / GitHub / OpenReview / Vercel / ...
- 无没 API · coding agent + Chrome / Playwright,去访问、截图、解析、回写。
- 果结果 · 你的工具栈每天都在自己长 —— 不是订阅 SaaS,是自己长。
- APIHas API · the agent calls it directly. arxiv / Google Scholar / GitHub / OpenReview / Vercel / ...
- no-APINo API · coding agent + Chrome / Playwright — visit, screenshot, parse, write back.
- growResult · your toolchain grows on its own every day — not a SaaS subscription, it actually grows.
"工具靠订阅"是上一个时代的逻辑。 在 agent 时代,你的工具栈是自己长出来的 —— 长得最贴合你自己。 OPC(One-Person-Company)的硬基础
"Subscribe to your tools" is last era's logic. In the agent era, your toolchain grows itself — shaped exactly to you. The hard floor of OPC (One-Person-Company)
所有研究路线都可以做成可视化 DAG
Every research roadmap can become a visual DAG
每个项目都有一个 roadmap,每个节点有明确的依赖关系。 DAG 让"我们到了哪一步"变成肉眼可见的事情 —— 人把控方向,让 AI 不断把节点推到下一阶段。 下面这张是 BEDC 项目现在真实在跑的 project map。
Every project has a roadmap, every node has explicit dependencies. A DAG makes "where are we now" something you can see — humans steer direction, AI keeps pushing nodes to the next stage. The map below is the BEDC project's actual live project map.
每个节点点开都有:constructive story(这个对象是怎么从底层一步步搭出来的)、 Lean 验证状态、被依赖的上下游、当前的 stub / checked 状态。 这张图本身就是整个项目的"共同视野"—— 谁在哪一格、卡在哪条边、下一步往哪推,看一眼就清楚。
Click any node and you get: the constructive story (how this object is built up from the bottom step by step), Lean verification status, upstream / downstream dependencies, and current stub / checked state. The map itself is the project's "shared field of view" — who is on which cell, which edge is stuck, where to push next: one look and it's clear.
我建议实验室建立一个 org,统一协作面
I'm proposing the lab create a GitHub org — one shared surface
就算保持半开源传统,把 lab 的项目集中到一个 GitHub organization 下(可以全 private), 会让 AI 协作的所有好处加成到实验室级别。我在自己的一个 side project 里已经跑通了这个模式, 可以作为参考。
Even if we keep the half-open tradition, pulling the lab's projects under one GitHub organization (fully private is fine) would lift every AI-collaboration benefit up to lab scale. I've already run this pattern in a side project of mine — happy to use it as a reference.
side project 上的实践模板
The template, proven on a side project
- 组把项目下的几个仓库放在同一个 org 下,统一权限和上下文。
- 命所有人每天进来只输一个 /daily 命令。
- 果AI 自动告诉他/她:项目今天进展、他负责的部分、可以做什么。
- 本哪怕协作者不写代码、不熟 AI,也能秒进入状态。AI 把所有 repo 当上下文。
- groupPut the project's several repos under one org — unified permissions and context.
- cmdEveryone comes in each day and types a single /daily.
- outAI tells them: today's progress on the project, the part they own, what they can do.
- lowEven a collaborator who doesn't code or doesn't know AI gets to "in the loop" instantly. AI treats every repo as context.
放大到实验室是什么样
What it looks like at lab scale
- 入每个新 intern 进来,第一件事就是 /daily。
- 联AI 自己能看到 lab 两个 repo 之间的关联,自动牵线。
- 想每个人的想法 + AI 的想法 → 都可以无限多。AI 来挑能落地的、去试。
- 迹每个人的贡献、每个想法的来源,GitHub 全部留有时间线。
- inEvery new intern starts with /daily on day one.
- linkAI can see connections between any two lab repos by itself, and wire them up automatically.
- ideaEvery person's ideas + AI's ideas → effectively unlimited. AI picks the ones that can ship and tries them.
- traceEvery contribution, every idea's origin — GitHub keeps the full timeline.
实验室不一定要全开源,但建议有一个 org —— 即使是 private 的, 也能让我们整个 lab 的产能上一个数量级。
The lab doesn't have to go open-source, but I'd suggest having an org — even fully private, it can push the whole lab's output up by an order of magnitude.
再招一个 postdoc,还是把现有人全部武装到顶?
Hire another postdoc, or kit out everyone we already have?
重点不是 AI 取代谁。重点是:同样一份预算, 能让现有的每个人产能上一个台阶,还是只多一个人、其他人不动。
The question isn't who AI replaces. It's: for the same budget, do we step up every existing person, or do we just add one extra and leave the rest where they are.
一个 postdoc
One postdoc
- ×1多一个人,多一份产能。
- 磨onboard / 适配 / visa / 编制 / 沟通成本。
- 不不会让现有的人也变快。
- ×1One more person, one more unit of output.
- costonboarding / adaptation / visa / headcount / communication overhead.
- noDoesn't make any existing person faster.
同等预算 · 把所有人都拉到顶配 AI
Same budget · top-tier AI stack for everyone
- 栈Claude Max 20x + ChatGPT Pro 20x,每人一套。
- ×N实验室现有的每个人产能上一个台阶。
- 即当天能用,没有 onboard,没有 visa,没有编制。
- stackClaude Max 20x + ChatGPT Pro 20x, one full set per person.
- ×NEvery existing person in the lab steps up a level of output.
- nowUsable the same day — no onboarding, no visa, no headcount.
不是说不要 postdoc。是说在 2026 年, "补一个 AI 协作基础设施"在边际上比"再加一个人"杠杆更高 —— 而且两件事不互斥,一个 postdoc 的预算,也足够同时把现有所有人的工具栈拉到最强。
Not arguing against postdocs. The point is that in 2026, "filling in the AI-collaboration infrastructure" has higher marginal leverage than "adding one more person" — and the two aren't mutually exclusive: one postdoc's budget is enough to also max out everyone's toolchain at the same time.
组会的形式本身也可以优化
The format of the lab meeting itself can be optimized
坦白说,纯讲话的组会信息密度是低的。如果我们真的能让 AI 深入到每个人的代码库里, 很多"现在做到哪了"这件事其实可以直接看,不用每周再口头复述一遍。
Honestly, pure-talk lab meetings are low-density. If AI can really reach into everyone's repos, a lot of the "where are we now" can just be looked at — no need to verbally recap it every week.
纯口头组会
Pure verbal lab meeting
每个人讲 10–15 分钟,信息密度低,进展可视化弱,互相不了解的部分只能问。
Everyone talks 10–15 minutes, low information density, weak progress visualization, gaps in mutual understanding can only be filled by asking.
异步 + DAG + AI 总结
Async + DAG + AI summary
每个人 push 进度到自己 repo;AI 每周自动汇总一个 lab-wide digest; 组会本身只讨论卡点和方向。
Everyone pushes progress to their own repo; AI auto-aggregates a lab-wide digest each week; the meeting itself only discusses blockers and direction.
信任的来源是留痕
Trust comes from the record
GitHub 记录一切,谁在哪个时间做了什么、做对了什么 — 不再有"靠口头汇报"的不安。
GitHub logs everything — who did what, when, what was right — no more anxiety about relying on verbal updates.
这一点我不强推。但如果我们建了 org,这个变化几乎是顺手就完成的。
Not pushing this one hard. But if we set up the org, this change pretty much happens for free.
这套 OS 已经能跑 · 剩下的是把它打开
The OS already runs · what's left is turning it on for the lab
我 pitch 这一整套,核心不是想 show 我做了多少。 是 — 这套 OS 已经验证能稳定产出, 剩下的就是让它在 lab 里跑起来,让大家一起更快产出、更好产出。
I'm pitching all of this not to show how much I've done. It's — this OS has been validated, it produces reliably, and what's left is getting it running inside the lab, so everyone produces faster and better together.
一句话回顾
One-line recap
- 01信号 · Fields 奖得主都在用 AI 攻克猜想了,分水岭已经过去。
- 02方法 · 能结构化的全部结构化,能 gate 的全部 gate,AI 在边界里干活就稳。
- 03验证 · newmath / automath 上 3,427+ 个 0-axiom 定理 · 42 篇论文管线 · 5×/周自动 release。
- 04可迁移 · 同一套 OS 搬到 PhD 项目,最近重新有产出。
- 05更深一步 · AI 蒸馏 office-hour —— "超过自身经验的经验"。
- 06建议 · 建 org · onboarding 走 /daily · 同等预算把现有人配满 Pro,放大产能。
- 01Signal · Fields medalists are already cracking conjectures with AI; the watershed is behind us.
- 02Method · structure everything that can be structured, gate everything that can be gated — AI is steady inside boundaries.
- 03Validation · newmath / automath: 3,427+ 0-axiom theorems · 42-paper pipeline · 5×/week auto-release.
- 04Transferable · same OS dropped onto the PhD project — producing again.
- 05Deeper · AI-distilled office-hours — "experience beyond your own experience."
- 06Proposal · build the org · onboarding via /daily · same budget, max out everyone with Pro, multiply output.
我希望我们一起跑这个 loop。 现有的人 + AI + 一套共享的 OS,本来就该有 10× 的产能。 我做这些不是想自己跑得快 —— 是想让我们整个 lab 一起更快、更好。 这是我做这一切的核心目的
I want us to run this loop together. The people we already have + AI + a shared OS — this should already be 10× output. I'm not doing this to run faster on my own — I'm doing it so our whole lab runs faster and better, together. That's the real point of all of it
想接着跟管线细节 · X 上有
Want to follow the pipeline details · they're on X
真实管线、踩过的坑、每次判断背后是怎么想的 —— 都在那条 timeline 上。
The real pipelines, the traps I've hit, the reasoning behind each call — all on that timeline.
几条代表性的帖子
A few representative posts
AutoMath 项目开源贴
AutoMath open-source announcement
"AI 不止会解题,它还能发现全新数学结构!仅从一个方程 x² = x + 1,零额外公理,用 Lean 4 形式化验证,AI + 人类协作推导出 9 大数学分支……核心方法:Derive · Discover · Name。"
"AI doesn't just solve problems — it can discover entirely new mathematical structures. Starting from a single equation x² = x + 1, zero extra axioms, formally verified in Lean 4, AI + human collaboration derived 9 major branches of math… Core method: Derive · Discover · Name."
AI 时代一个人也能做理论物理
In the AI era, one person can do theoretical physics
"过去半年,我和 Auric 用 ChatGPT、Gemini、Grok 等各种 AI,从 0 开始做理论物理研究。没有导师,没有体系。我们用 AI 压缩了文献阅读、推导比对、结构重建,把原本需要十年的积累,在半年里做完。"
"Over the last half year, Auric and I used ChatGPT, Gemini, Grok and others to do theoretical physics research from scratch. No advisor, no system. AI compressed the literature reading, derivation comparison, and structural reconstruction — what would normally take ten years of accumulation, we did in six months."
1stProof:验证我们一直在做的事
1stProof: validating what we've been doing all along
"我们参加 #1stProof,不是为了证明我们有多强。更是在验证我们一直在做的事情:用 AI 协助科研。我们搭建了一套自动化推理 agent 工作流 —— 拿到问题先 plan,把问题拆成结构化子任务,再由 reasoning agent 推下一步推理路径。"
"We're in #1stProof not to prove how strong we are. We're validating what we've been doing all along: AI-assisted research. We built an automated reasoning agent workflow — take a problem, plan first, cut it into structured subtasks, then let a reasoning agent push the next step of the reasoning path."
我们正在充当 MCP 里的"人工搬运工"
"在模型与工具之间贴胶水、搬运上下文。最小闭环:意图 → 假设 → 推理 → 证据 → 反驳 → 收敛。未来生产力 = 意图清晰度 × 推理闭环质量。"
We're acting as the "human couriers" inside MCP
"Pasting glue between models and tools, carrying context by hand. Minimal closed loop: intent → hypothesis → reasoning → evidence → rebuttal → convergence. Future productivity = clarity of intent × quality of the reasoning loop."