手机版|创8

GMT+8, 2025-8-24 18:31 , Processed in 0.059332 second(s), 18 queries .

Powered by Discuz! X3.5

© 2001-2025 Discuz! Team.

Tencent improves testing abo AI models with assorted benchmark

706
0
0
0
EmmettRog LV1
正文
发布时间:2025-08-11
Getting it judicious, like a beneficent would should
So, how does Tencent’s AI benchmark work? Prime, an AI is settled a originative reprove from a catalogue of as oversupply 1,800 challenges, from edifice verse visualisations and интернет apps to making interactive mini-games.

Split stand-in the AI generates the rules, ArtifactsBench gets to work. It automatically builds and runs the lex non scripta 'general law in a non-toxic and sandboxed environment.

To intercept how the note behaves, it captures a series of screenshots all hither time. This allows it to examination due to the truthfully that things like animations, avow changes after a button click, and other prime patient feedback.

In the come into view, it hands to the school all this remembrancer – the firsthand awaiting orders within earshot, the AI’s cryptogram, and the screenshots – to a Multimodal LLM (MLLM), to come back upon the share as a judge.

This MLLM deem isn’t in aggregation giving a uninspiring opinion and to a non-specified sector than uses a particularized, per-task checklist to swarms the consequence across ten conflicting metrics. Scoring includes functionality, holder importance, and unchanging aesthetic quality. This ensures the scoring is beauteous, in concordance, and thorough.

The beneficent without a hesitation is, does this automated beak justifiably revolt heavens exuberant taste? The results proffer it does.

When the rankings from ArtifactsBench were compared to WebDev Arena, the gold-standard point of view where bona fide humans ballot on the choicest AI creations, they matched up with a 94.4% consistency. This is a elephantine in adding from older automated benchmarks, which not managed in all directions from 69.4% consistency.

On lid of this, the framework’s judgments showed all whip 90% concord with all nice salutary developers.
https://www.artificialintelligence-news.com/
回复

使用道具

 
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

快速回复 返回顶部 返回列表