not much happened today

📝 摘要

**anthropic's mythos/opus cycle** sparked mixed reactions with praise for **claude mythos**'s one-shot workflows and concerns over **opus 4.8** benchmark regressions. **opus 4.7** showed strong chemistry task performance, "making claude a chemist." **sakana ai** launched an **rsi lab** focusing on recursive self-improvement under compute constraints, marking rsi as a formal research program. new benchmarks like **agents' last exam (ale)** and **swe-marathon** test agents on long-horizon, economically meaningful tasks, revealing low pass rates and coherence challenges. princeton's icml 2026 paper found models like **gpt 5.5**, **gemini 3.1 pro / 3.5 flash**, and **claude opus 4.7** still lack meaningful reliability improvements. tooling trends favor rl-environment-style frameworks for agent evaluation, exemplified by meta's **openenv**.

✍️ 编辑摘要

这条资讯的核心议题是“not much happened today”。

从当前聚合摘要看，最值得先关注的是：**anthropic's mythos/opus cycle** sparked mixed reactions with praise for **claude mythos**'s one-shot workflows and concerns over **opus 4.8** benchmark regressions. **opus 4.7** showed strong chemistry task performance, &#34；making claude a chemist.&#34。

如果你只看一遍，这条新闻与后续判断最相关的点是：涉及模型：claude-mythos、opus-4.8、opus-4.7，适合跟踪模型能力、价格或产品策略变化。

📌 关键信息

**anthropic's mythos/opus cycle** sparked mixed reactions with praise for **claude mythos**'s one-shot workflows and concerns over **opus 4.8** benchmark regressions. **opus 4.7** showed strong chemistry task performance, &#34
making claude a chemist.&#34
**sakana ai** launched an **rsi lab** focusing on recursive self-improvement under compute constraints, marking rsi as a formal research program. new benchmarks like **agents' last exam (ale)** and **swe-marathon** test agents on long-horizon, economically meaningful tasks, revealing low pass rates and coherence challenges. princeton's icml 2026 paper found models like **gpt 5.5**, **gemini 3.1 pro / 3.5 flash**, and **claude opus 4.7** still lack meaningful reliability improvements. tooling trends favor rl-environment-style frameworks for agent evaluation, exemplified by meta's **openenv**.

🧭 为什么值得关注

涉及模型：claude-mythos、opus-4.8、opus-4.7，适合跟踪模型能力、价格或产品策略变化。
涉及公司：anthropic、sakana-ai、meta-ai-fair，这通常意味着行业竞争、合作或商业化动作值得继续观察。
关联标签：recursive-self-improvement、benchmarking、agent-evaluation、long-horizon-tasks，可用于继续追踪同主题后续报道。

查看首个原始来源 →

🗂 主题卡片

涉及模型

claude-mythos opus-4.8 opus-4.7 gpt-5.5 gemini-3.1-pro gemini-3.5-flash claude-opus-4.7

涉及公司

anthropic sakana-ai meta-ai-fair princeton

关联标签

recursive-self-improvement benchmarking agent-evaluation long-horizon-tasks reliability reinforcement-learning sample-efficiency economically-meaningful-tasks agent-coherence anti-reward-hacking tooling rl-environments

← 查看全部资讯 →

📝 摘要

✍️ 编辑摘要

📌 关键信息

🧭 为什么值得关注

🗂 主题卡片

📌 更多资讯