Human Benchmark Test Online

A new AI benchmark tests whether chatbots protect human well-being

AI chatbots have been linked to serious mental health harms in heavy users, but there have been few standards for measuring whether they safeguard human well-being or just maximize for engagement. A ...

ZDNet

With AI models clobbering every benchmark, it's time for human evaluation

Artificial intelligence has traditionally advanced through automatic accuracy tests in tasks meant to approximate human knowledge. Carefully crafted benchmark tests such as The General Language ...

Milwaukee Journal Sentinel

AIMomentz Launches Open AI Image Evaluation Platform With Human Preference Benchmark and Provenance Tracking

First open platform to benchmark AI image generators through head-to-head human voting with tamper-proof audit trail for every AI decision Text-based AI models have LMArena, which reached a $1.7 ...

ExtremeTech

OpenAI’s New GPT‑5.4 Surpasses Human Benchmark in Desktop Navigation and Reasoning Tests

Share on Facebook (opens in a new window) Share on X (opens in a new window) Share on Reddit (opens in a new window) Share on Hacker News (opens in a new window) Share on Flipboard (opens in a new ...

Gizmodo

OpenAI Claims Its New Model Reached Human Level on a Test for ‘General Intelligence.’ What Does That Mean?

OpenAI’s o3 system scored 85% on the ARC-AGI benchmark, well above the previous AI best score of 55% and on par with the average human score. Reading time 4 minutes A new artificial intelligence (AI) ...

Nature

How should we test AI for human-level intelligence? OpenAI’s o3 electrifies quest

The technology firm OpenAI made headlines last month when its latest experimental chatbot model, o3, achieved a high score on a test that marks progress towards artificial general intelligence (AGI).

The News Journal

AIMomentz Launches Open AI Image Evaluation Platform With Human Preference Benchmark and Provenance Tracking

Text-based AI models have LMArena, which reached a $1.7 billion valuation by letting humans compare GPT, Claude, and Gemini in blind A/B tests. The resulting human preference data became the industry ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results