not much happened today
📝 摘要
**anthropic's mythos/opus cycle** sparked mixed reactions with praise for **claude mythos**'s one-shot workflows and concerns over **opus 4.8** benchmark regressions. **opus 4.7** showed strong chemistry task performance, "making claude a chemist." **sakana ai** launched an **rsi lab** focusing on recursive self-improvement under compute constraints, marking rsi as a formal research program. new benchmarks like **agents' last exam (ale)** and **swe-marathon** test agents on long-horizon, economically meaningful tasks, revealing low pass rates and coherence challenges. princeton's icml 2026 paper found models like **gpt 5.5**, **gemini 3.1 pro / 3.5 flash**, and **claude opus 4.7** still lack meaningful reliability improvements. tooling trends favor rl-environment-style frameworks for agent evaluation, exemplified by meta's **openenv**.
✍️ 编辑摘要
这条资讯的核心议题是“not much happened today”。
从当前聚合摘要看,最值得先关注的是:**anthropic's mythos/opus cycle** sparked mixed reactions with praise for **claude mythos**'s one-shot workflows and concerns over **opus 4.8** benchmark regressions. **opus 4.7** showed strong chemistry task performance, ";making claude a chemist."。
如果你只看一遍,这条新闻与后续判断最相关的点是:涉及模型:claude-mythos、opus-4.8、opus-4.7,适合跟踪模型能力、价格或产品策略变化。
📌 关键信息
- **anthropic's mythos/opus cycle** sparked mixed reactions with praise for **claude mythos**'s one-shot workflows and concerns over **opus 4.8** benchmark regressions. **opus 4.7** showed strong chemistry task performance, "
- making claude a chemist."
- **sakana ai** launched an **rsi lab** focusing on recursive self-improvement under compute constraints, marking rsi as a formal research program. new benchmarks like **agents' last exam (ale)** and **swe-marathon** test agents on long-horizon, economically meaningful tasks, revealing low pass rates and coherence challenges. princeton's icml 2026 paper found models like **gpt 5.5**, **gemini 3.1 pro / 3.5 flash**, and **claude opus 4.7** still lack meaningful reliability improvements. tooling trends favor rl-environment-style frameworks for agent evaluation, exemplified by meta's **openenv**.
🧭 为什么值得关注
- 涉及模型:claude-mythos、opus-4.8、opus-4.7,适合跟踪模型能力、价格或产品策略变化。
- 涉及公司:anthropic、sakana-ai、meta-ai-fair,这通常意味着行业竞争、合作或商业化动作值得继续观察。
- 关联标签:recursive-self-improvement、benchmarking、agent-evaluation、long-horizon-tasks,可用于继续追踪同主题后续报道。
🗂 主题卡片
涉及模型
claude-mythos
opus-4.8
opus-4.7
gpt-5.5
gemini-3.1-pro
gemini-3.5-flash
claude-opus-4.7
涉及公司
anthropic
sakana-ai
meta-ai-fair
princeton
关联标签
recursive-self-improvement
benchmarking
agent-evaluation
long-horizon-tasks
reliability
reinforcement-learning
sample-efficiency
economically-meaningful-tasks
agent-coherence
anti-reward-hacking
tooling
rl-environments