[B! benchmark] mkusakaã®ãƒ–ãƒƒã‚¯ãƒžãƒ¼ã‚¯

mkusaka id:mkusaka

benchmarkã«é–¢ã™ã‚‹mkusakaã®ãƒ–ãƒƒã‚¯ãƒžãƒ¼ã‚¯ (49)

${{author_name}}$

{{{comment_expanded}}}

{{label}}

{{#is_bookmark}}ãƒªã‚¹ãƒˆ{{/is_bookmark}}{{^is_bookmark}}ãƒªãƒ³ã‚¯{{/is_bookmark}}

${{author_name}}$
{{author_name}}{{created}}
{{ #comment }}{{ comment }}{{ /comment }}
- {{ label }}

${{author_name}}$

{{{comment_expanded}}}

{{label}}

{{#is_bookmark}}ãƒªã‚¹ãƒˆ{{/is_bookmark}}{{^is_bookmark}}ãƒªãƒ³ã‚¯{{/is_bookmark}}

https://x.com/i/status/2030751983980716294
- 1 user
- x.com
- å¦ã³
mkusaka 2026/03/10
GitHub Actionsä¸Šã§Blazeã®æ€§èƒ½ã‚’å®‰å®šã—ã¦æ¸¬ã‚‹æ–¹æ³•ã‚’è§£èª¬ã—ã€25,000ã‚³ãƒ³ãƒãƒ¼ãƒãƒ³ãƒˆãƒ»median/IQRã§2%ï¼ˆ0.3msï¼‰å·®ã‚’æ¤œå‡ºã€CIã¯ç´„2åˆ†ã§å®Œäº†ã™ã‚‹æ‰‹é †ã‚’ç¤ºã—ã¾ã™ã€‚

AIè¦ç´„

benchmark

actions

perf

Laravel

è¨˜äº‹
ãƒªãƒ³ã‚¯
Eval awareness in Claude Opus 4.6â€™s BrowseComp performance
Evaluating Opus 4.6 on BrowseComp, we found cases where the model recognized the test, then found and decrypted answers to itâ€”raising questions about eval integrity in web-enabled environments. BrowseComp is an evaluation designed to test how well models can find hard-to-locate information on the web. Like many benchmarks, it is vulnerable to contamination: answers leak onto the public web through
mkusaka 2026/03/07
Claude Opus 4.6ãŒBrowseCompã®1,266å•ã§9ä»¶ã®æ±šæŸ“ã¨2ä»¶ã®eval awarenessï¼ˆæœ€å¤§40.5M tokensï¼‰ã‚’ç¢ºèªã—ãŸã€‚

AIè¦ç´„

BrowseComp

Anthropic

Claude

benchmark

research
ãƒªãƒ³ã‚¯
SWE Atlas - Codebase QnA
mkusaka 2026/03/05
SWE Atlasã®Codebase QnAã¯Dockerç’°å¢ƒã§å†ç¾ã•ã‚ŒãŸ124ã‚¿ã‚¹ã‚¯ã‚’ç”¨ã„ã€Task Resolve Rateã§LLMã®æ·±ã„ã‚³ãƒ¼ãƒ‰ç†è§£ã‚’è©•ä¾¡ã—ã¾ã™ã€‚

AIè¦ç´„

SWEAtlas

Codebase

QnA

benchmark

agentic
ãƒªãƒ³ã‚¯
Node.js vs Deno vs Bun Performance Benchmarks
mkusaka 2026/02/21
Node.js 25.6.1ã€Deno 2.6.9ã€Bun 1.3.9ã®HTTPã‚„JSONã€ãƒãƒƒã‚·ãƒ¥ã€bufferã€asyncæ€§èƒ½ã‚’æ¯”è¼ƒã—ãŸãƒ™ãƒ³ãƒè¨˜äº‹ã§ã™ã€‚

AIè¦ç´„

Node.js

Deno

Bun

benchmark
ãƒªãƒ³ã‚¯
https://x.com/i/status/2023821893846135212
mkusaka 2026/02/18
Sonnet 4.6ãŒGDPval-AAã§ELO1633ã‚’è¨˜éŒ²ã€adaptive thinkingã§280M tokensã‚’ä½¿ç”¨ã—Sonnet 4.5æ¯”4xã®æ”¹å–„ã‚’ç¤ºã—ãŸã€‚

AIè¦ç´„

GDPval-AA

Claude

Anthropic

benchmark

ãƒ‹ãƒ¥ãƒ¼ã‚¹
ãƒªãƒ³ã‚¯
https://blog.platformatic.dev/we-cut-nodejs-memory-in-half
mkusaka 2026/02/18
ã“ã®è¨˜äº‹ã¯ node-cagedï¼ˆNode.js 25ï¼‰ã§ pointer compression ã‚’æœ‰åŠ¹ã«ã—ã€ãƒ’ãƒ¼ãƒ—ã‚’ç´„50%å‰Šæ¸›ã€å¹³å‡ãƒ¬ã‚¤ãƒ†ãƒ³ã‚·ã¯2â€“4%å¢—ã§p99ã¯æ”¹å–„ã—ãŸæ¤œè¨¼çµæžœã‚’ç¤ºã—ã¾ã™ã€‚

AIè¦ç´„

Node.js

V8

memory

benchmark

node-caged
ãƒªãƒ³ã‚¯
OTelBench - OpenTelemetry AI Benchmark
23 tasks | 11 languages | 14 models | 14% pass rate | Updated 16 Jan 2026 | QuesmaOrg/otel-bench Distributed tracing requires stitching together distinct user journeys across complex microservices, rather than just writing isolated functions. We tested whether top models can successfully instrument applications with OpenTelemetry to see if they are actually ready to handle real-world Site Reliabil
mkusaka 2026/02/15
OpenTelemetryã®è‡ªå‹•ã‚¤ãƒ³ã‚¹ãƒˆãƒ«ãƒ¡ãƒ³ãƒ†ãƒ¼ã‚·ãƒ§ãƒ³ã‚’è©•ä¾¡ã™ã‚‹ãƒ™ãƒ³ãƒã§ã€23ã‚¿ã‚¹ã‚¯ãƒ»11è¨€èªžãƒ»14ãƒ¢ãƒ‡ãƒ«ã‚’å®Ÿè¡Œã—ã€åˆæ ¼çŽ‡ã¯14%ã€çµæžœã¨ã‚³ãƒ¼ãƒ‰ã¯GitHubã¨Harborã§å†ç¾å¯èƒ½ã§ã™ã€‚

AIè¦ç´„

OTel

tracing

benchmark

AI
ãƒªãƒ³ã‚¯
Reddit - The heart of the internet
mkusaka 2026/02/14
Go 1.26.0ã§CGoãƒ‰ãƒ©ã‚¤ãƒmattnãŒã‚¹ã‚³ã‚¢+18ï¼ˆ85ç‚¹ï¼‰ã‹ã¤ã‚¯ã‚¨ãƒªç´„16%é«˜é€ŸåŒ–ã—ã€pure Goã®moderncã¯114ç‚¹ã«ç•™ã¾ã£ãŸãƒ™ãƒ³ãƒçµæžœã‚’ã¾ã¨ã‚ã€‚

AIè¦ç´„

Go

SQLite

CGo

benchmark
ãƒªãƒ³ã‚¯
I Improved 15 LLMs at Coding in One Afternoon. Only the Harness Changed.
I Improved 15 LLMs at Coding in One Afternoon. Only the Harness Changed. In fact only the edit tool changed. Thatâ€™s it. Â¶ 0x0: The Wrong QuestionThe conversation right now is almost entirely about which model is best at coding, GPT-5.3 or Opus. Gemini vs whatever dropped this week. This framing is increasingly misleading because it treats the model as the only variable that matters, when in realit
mkusaka 2026/02/13
ç·¨é›†ãƒãƒ¼ãƒã‚¹ã‚’å¤‰ãˆã‚‹ã ã‘ã§å¤§å¹…æ”¹å–„ã‚’ç¤ºã™æŠ€è¡“è¨˜äº‹ã€‚HashlineãŒPatch/Replaceã‚’ä¸Šå›žã‚Šã€16ãƒ¢ãƒ‡ãƒ«ã®ãƒ™ãƒ³ãƒã§Grok Code Fast 1ã¯æˆåŠŸçŽ‡ãŒ6.7%â†’68.3%ã«å‘ä¸Šã—ã¾ã—ãŸ

AIè¦ç´„

LLM

AIã‚³ãƒ¼ãƒ‡ã‚£ãƒ³ã‚°

benchmark

AIã‚¨ãƒ¼ã‚¸ã‚§ãƒ³ãƒˆ

è¨˜äº‹
ãƒªãƒ³ã‚¯
https://x.com/i/status/2021828033640911196
mkusaka 2026/02/12
ãƒ¢ãƒ‡ãƒ«ã§ã¯ãªããƒãƒ¼ãƒã‚¹æ”¹å–„ã«ç€ç›®ã—ã€edit toolã€Œhashlineã€ã§Grok Code Fast 1ãŒ6.7%â†’68.3%ã«å‘ä¸Šã—ãŸãƒ™ãƒ³ãƒãƒžãƒ¼ã‚¯å ±å‘Šã€‚

AIè¦ç´„

LLM

AIã‚³ãƒ¼ãƒ‡ã‚£ãƒ³ã‚°

ãƒ„ãƒ¼ãƒ«

benchmark
ãƒªãƒ³ã‚¯
[Beta] Agent Benchmarks
mkusaka 2026/02/07
Agent Benchmarksï¼ˆãƒ™ãƒ¼ã‚¿ï¼‰ã§ã€ãƒžãƒ¼ã‚¸æ¸ˆã¿GitHub PRã‚’Ground Truthã«ã—ã¦Claude Codeã‚„Codexç‰ã‚’Qualityâ€‘vsâ€‘Costæ•£å¸ƒå›³ã§æ¯”è¼ƒã§ãã¾ã™ã€‚

AIè¦ç´„

AIã‚¨ãƒ¼ã‚¸ã‚§ãƒ³ãƒˆ

benchmark

Changelog

Codex
ãƒªãƒ³ã‚¯
Quantifying infrastructure noise in agentic coding evals
Infrastructure configuration can swing agentic coding benchmarks by several percentage pointsâ€”sometimes more than the leaderboard gap between top models. Agentic coding benchmarks like SWE-bench and Terminal-Bench are commonly used to compare the software engineering capabilities of frontier modelsâ€”with top spots on leaderboards often separated by just a few percentage points. These scores are oft
mkusaka 2026/02/06
Terminal-Bench 2.0å®Ÿé¨“ã§ã€Kubernetesä¸Šã®CPU/RAMå‰²å½“ãŒæˆåŠŸçŽ‡ã‚’æœ€å¤§6ãƒã‚¤ãƒ³ãƒˆæ”¹å–„ã—ã€ã‚¤ãƒ³ãƒ•ãƒ©ã‚¨ãƒ©ãƒ¼ã¯5.8%â†’0.5%ã«æ¸›å°‘ã€3xãƒ˜ãƒƒãƒ‰ãƒ«ãƒ¼ãƒ ã‚„ä¿è¨¼å‰²å½“ã¨hard k

AIè¦ç´„

Anthropic

agentic

benchmark

è¨˜äº‹
ãƒªãƒ³ã‚¯
Arcee AI | Trinity Large: An Open 400B Sparse MoE Model
Trinity-Large-Base is a true frontier-class foundation model. We match and exceed our peers in open-base models across a wide range of benchmarks, including math, coding, scientific reasoning, and raw knowledge absorption. Inference efficiencyWe trained on 2048 Nvidia B300 GPUs. As far as we can tell, itâ€™s the largest (publ icly stated, at least) pretraining run done on these machines. That means t
mkusaka 2026/01/31
Trinity Largeã¯ArceeãŒå…¬é–‹ã—ãŸ400B sparse MoEã§ã€ãƒˆãƒ¼ã‚¯ãƒ³å½“ãŸã‚Š13B active parametersã‚’æŒã¡Previewã‚’OpenRouterã§å…¬é–‹ä¸ã€‚

AIè¦ç´„

LLM

MoE

400B

benchmark

release
ãƒªãƒ³ã‚¯
Announcing Vortex Support in DuckDB
TL;DR: Vortex is a new columnar file format with a very promising design. SpiralDB and DuckDB Labs have partnered to give you a very fast experience while reading and writing Vortex files! I think it is worth starting this intro by talking a little bit about the established format for columnar data. Parquet has done some amazing things for analytics. If you go back to the times where CSV was the b
mkusaka 2026/01/25
DuckDBã§Vortexæ‹¡å¼µã‚’INSTALL/LOADã—ã¦read_vortexã‚„COPYã§å…¥å‡ºåŠ›ã§ãã€TPC-H SF100ãƒ™ãƒ³ãƒã§Parquet v2æ¯”ç´„18%é«˜é€Ÿã§ã—ãŸã€‚

AIè¦ç´„

DuckDB

Vortex

extension

Parquet

benchmark
ãƒªãƒ³ã‚¯
Testing if "bash is all you need" - Vercel
mkusaka 2026/01/23
GitHubã®Issues/PRã§æ¤œè¨¼ã—ãŸè¨˜äº‹ï¼šSQLãŒ100%ç²¾åº¦ï¼ˆ45sã€$0.51ï¼‰ã€bashã¯52.7%ï¼ˆ1,062,031ãƒˆãƒ¼ã‚¯ãƒ³ï¼‰ã§ãƒã‚¤ãƒ–ãƒªãƒƒãƒ‰ã¯è‡ªå·±æ¤œè¨¼ã«ã‚ˆã‚Š100%é”æˆã€‚

AIè¦ç´„

bash

AIã‚¨ãƒ¼ã‚¸ã‚§ãƒ³ãƒˆ

SQL

benchmark

comparison
ãƒªãƒ³ã‚¯
I Made Zig Compute 33 Million Satellite Positions in 3 Seconds. No GPU Required.
Update: I've since added multithreading and pushed astroz to 326M propagations/sec. Read the follow-up â†’ I've spent the past month optimizing SGP4 propagation and ended up with something interesting: astroz is now the fastest general purpose SGP4 implementation I'm aware of, hitting 11-13M propagations per second in native Zig and ~7M/s through Python with just pip install astroz. This post breaks
mkusaka 2026/01/22
astrozã¯Zigã§æ›¸ã‹ã‚ŒãŸSGP4é«˜é€ŸåŒ–ãƒ©ã‚¤ãƒ–ãƒ©ãƒªã§ã€SIMDæ´»ç”¨ã«ã‚ˆã‚Šãƒã‚¤ãƒ†ã‚£ãƒ–ã§11â€“13M propagations/secã€PythonçµŒç”±ã§ç´„7M/sã‚’é”æˆã—ã¾ã™ã€‚

AIè¦ç´„

Zig

SGP4

SIMD

satellite

benchmark
ãƒªãƒ³ã‚¯
Goã®MySQLãƒ‰ãƒ©ã‚¤ãƒã¯ã€ãªãœãƒ‘ãƒ©ãƒ¡ãƒ¼ã‚¿æ•°ãŒå¢—ãˆã‚‹ã¨Prepared StatementãŒæ€¥æ¿€ã«é…ããªã‚‹ã®ã‹ï¼Ÿ
ã“ã‚“ã«ã¡ã¯ã€æ ªå¼ä¼šç¤¾ PKSHA Techno logy ã‚½ãƒ•ãƒˆã‚¦ã‚§ã‚¢ã‚¨ãƒ³ã‚¸ãƒ‹ã‚¢ã®ä¸è¥¿ã§ã™ã€‚ 1. ã¯ã˜ã‚ã« Go è¨€èªžã§ MySQL ã‚’åˆ©ç”¨ã™ã‚‹å ´åˆã€æ¨™æº–ãƒ©ã‚¤ãƒ–ãƒ©ãƒªã® database/sql ã¨ã€ãƒ‰ãƒ©ã‚¤ãƒã§ã‚ã‚‹ go-sql-driver/mysql ã‚’çµ„ã¿åˆã‚ã›ã‚‹ã®ãŒä¸€èˆ¬çš„ã§ã™ã€‚ã“ã®æ§‹æˆã¯ã€äº‹å®Ÿä¸Šã®ãƒ‡ãƒ•ã‚¡ã‚¯ãƒˆã‚¹ã‚¿ãƒ³ãƒ€ãƒ¼ãƒ‰ã¨ã—ã¦å®šç€ã—ã¦ã„ã¾ã™ã€‚[1] ã¾ãŸã€ã‚»ã‚ãƒ¥ãƒªãƒ†ã‚£å¯¾ç–ã‚‚é‡è¦ã§ã™ã€‚ç‰¹ã« SQL ã‚¤ãƒ³ã‚¸ã‚§ã‚¯ã‚·ãƒ§ãƒ³ã‚’é˜²ããŸã‚ã€ãƒ—ãƒ¬ãƒ¼ã‚¹ãƒ›ãƒ«ãƒ€ ? ã‚’ç”¨ã„ãŸ Prepared Statement ã®åˆ©ç”¨ã¯ãƒ™ã‚¹ãƒˆãƒ—ãƒ©ã‚¯ãƒ†ã‚£ã‚¹ã¨è¨€ãˆã¾ã™ã€‚[2] ã—ã‹ã—ã€ã“ã®æ¨™æº–çš„ã§å®‰å…¨ãªæ§‹æˆã§ã‚‚ã€ãƒ‘ãƒ•ã‚©ãƒ¼ãƒžãƒ³ã‚¹ãŒè‘—ã—ãåŠ£åŒ–ã™ã‚‹ã‚±ãƒ¼ã‚¹ã¯ã‚ã‚Šã¾ã™ã€‚æœ¬ç¨¿ã§ã¯ Prepared Statement ã® 2 æ–¹å¼ã‚’æ•´ç†ã—ã€Go ã®ãƒ‡ãƒ•ã‚©ãƒ«ãƒˆè¨å®šã§æ€§èƒ½ãŒè½ã¡ã‚‹åŽŸå› ã‚’ãƒ™ãƒ³ãƒãƒžãƒ¼ã‚¯çµæžœã¨åˆã‚ã›ã¦ç¤ºã—ã¾ã™ã€‚è§£æ±ºç–ã¨æŽ¡ç”¨æ™‚ã®ãƒˆãƒ¬ãƒ¼
mkusaka 2025/12/29
ãƒ‘ãƒ©ãƒ¡ãƒ¼ã‚¿æ•°10,000ã§Serverâ€‘SideãŒç´„10.4â€¯msã€Clientâ€‘Sideã¯ç´„7.5â€¯msã¨é…ããªã‚‹åŽŸå› ã¨å¯¾ç–ã‚’è§£èª¬

AIè¦ç´„

Go

MySQL

perf

è¨˜äº‹

benchmark
ãƒªãƒ³ã‚¯
Go 1.26 ã¯ã€è»½ããªã‚Šé€Ÿããªã‚‹
ã¯ã˜ã‚ã« ã“ã®è¨˜äº‹ã¯ Go Advent Calendar 2025ã€æœ€çµ‚æ—¥ 25 æ—¥ç›®ã®è¨˜äº‹ã§ã™ã€‚ ä»Šå¹´ã‚‚çš†ã•ã‚“ãŠç–²ã‚Œæ§˜ã§ã—ãŸã€‚Go 1.26 ãŒã¾ã‚‚ãªããƒªãƒªãƒ¼ã‚¹ã•ã‚Œã¾ã™ã€‚ä»Šå›žã®ãƒªãƒªãƒ¼ã‚¹ã¯æ´¾æ‰‹ã•ã¯ç„¡ã„ã«ã—ã‚ã€Go ã‚’ä½¿ã£ã¦ãŠã‚‰ã‚Œã‚‹çš†ã•ã‚“ã«ã¨ã£ã¦ã¯ã¨ã¦ã‚‚å¤§ããªãƒªãƒªãƒ¼ã‚¹ã«ãªã‚‹ã‚“ã˜ã‚ƒãªã„ã‹ã¨æ€ã£ã¦ã„ã¾ã™ã€‚ çŸã„æ–‡ç« ã§è¨€ãˆã°ã€Œè»½ããªã‚Šã€ã€Œé€Ÿããªã‚‹ã€ã§ã™ã€‚ ãƒ‘ãƒ•ã‚©ãƒ¼ãƒžãƒ³ã‚¹ã«é–¢ã™ã‚‹å¤§ããªå¤‰æ›´2ã¤ Green Tea GC Go 1.25 ã¾ã§å®Ÿé¨“çš„ã«å°Žå…¥ã•ã‚ŒãŸ Green Tea GC ãŒ 1.26 ã§ãƒ‡ãƒ•ã‚©ãƒ«ãƒˆã§å°Žå…¥ã•ã‚Œã¾ã—ãŸã€‚ ã“ã‚Œã¾ã§ Go ã«ã¯æ•°å¤šãã® GC å®Ÿè£…ãŒå–ã‚Šè¾¼ã¾ã‚Œã¦ãã¾ã—ãŸã€‚ãã‚Œãžã‚Œã® GC ã«ã‚ˆã‚Šç´°ã‹ãªãƒãƒ¥ãƒ¼ãƒ‹ãƒ³ã‚°ãŒã•ã‚Œã€ãƒ‘ãƒ•ã‚©ãƒ¼ãƒžãƒ³ã‚¹æ”¹å–„ãŒè¡Œã‚ã‚Œã¾ã—ãŸã€‚ ã“ã‚Œã¾ã§ã® GC ã¯ã€ãƒ’ãƒ¼ãƒ—ä¸Šã®ã‚ªãƒ–ã‚¸ã‚§ã‚¯ãƒˆã¸ã®ãƒã‚¤ãƒ³ã‚¿ã‚’è¿½è·¡ã—ã¦å€‹åˆ¥ã«ãƒžãƒ¼ã‚¯ã—ã¦ã„ã¾ã—ãŸã€‚ãã®çµæžœãƒ¡ãƒ¢ãƒªã‚¢ã‚¯ã‚»ã‚¹ãŒãƒ©ãƒ³ãƒ€
mkusaka 2025/12/25
Go1.26ã§Green Tea GCãŒãƒ‡ãƒ•ã‚©ãƒ«ãƒˆåŒ–ã—GCã‚ªãƒ¼ãƒãƒ¼ãƒ˜ãƒƒãƒ‰ãŒ10ã€œ40%å‰Šæ¸›ã€syscall ãŒç´„30%é«˜é€ŸåŒ–ã•ã‚Œã¾ã™ã€‚

AIè¦ç´„

Go

è¨˜äº‹

ãƒ‘ãƒ•ã‚©ãƒ¼ãƒžãƒ³ã‚¹

benchmark

GC

syscall
ãƒªãƒ³ã‚¯
Instant database clones with PostgreSQL 18
Have you ever watched long running migration script, wondering if it's about to wreck your data? Or wish you can "just" spin a fresh copy of database for each test run? Or wanted to have reproducible snapshots to reset between runs of your test suite, (and yes, because you are reading boringSQL) needed to reset the learning environment? When your database is a few megabytes, pg_dump and restore wo
mkusaka 2025/12/24
PostgreSQLâ€¯18ã®file_copy_method=cloneã§6GBãƒ‡ãƒ¼ã‚¿ãƒ™ãƒ¼ã‚¹ã‚’ç´„212â€¯msã§å³æ™‚ã‚¯ãƒãƒ¼ãƒ³ã€ãƒ‡ã‚£ã‚¹ã‚¯è¿½åŠ ãªã—ã€‚

AIè¦ç´„

PostgreSQL

clone

benchmark

ãƒãƒ¥ãƒ¼ãƒˆãƒªã‚¢ãƒ«
ãƒªãƒ³ã‚¯
Zoom AI sets new state-of-the-art benchmark on Humanity's Last Exam
Zoom Workplace Collaboration tools in an AI-first work platform. Learn more Learn More Business Services Deliver personalized and seamless customer experiences. Learn more Learn more
mkusaka 2025/12/14
æœ¬æ–‡æœªæç¤ºã®ãŸã‚è©³ç´°è¦ç´„ä¸å¯ã€‚ã‚¿ã‚¤ãƒˆãƒ«ã‚ˆã‚ŠZoom AIãŒHumanity's Last Examã§SOTAé”æˆã‚’å ±å‘Š

AIè¦ç´„

Zoom

AI

LLM

HLE

benchmark
ãƒªãƒ³ã‚¯
1 2 3 æ¬¡ã®ãƒšãƒ¼ã‚¸