Open-weight models are competing on agent benchmarks like SWE-bench Verified and Terminal-Bench. Here’s what that changes for builders and v
seen from China
seen from Poland
seen from Switzerland
seen from United States

seen from Poland
seen from China
seen from Türkiye
seen from Iraq
seen from China
seen from Russia

seen from Russia
seen from Sweden

seen from Malaysia

seen from Russia
seen from United States
seen from Italy
seen from Russia
seen from Sweden
seen from India

seen from Italy
Open-weight models are competing on agent benchmarks like SWE-bench Verified and Terminal-Bench. Here’s what that changes for builders and v
AI Could Beat Every Human Expert Within a Year — Scientists Are Stunned
AI is closing in on expert-level knowledge fast — could it surpass all human experts within a year? The data from Humanity's Last Exam says
Keep CALM: New model design could fix high enterprise AI costs
New Post has been published on https://thedigitalinsider.com/keep-calm-new-model-design-could-fix-high-enterprise-ai-costs/
Keep CALM: New model design could fix high enterprise AI costs
Enterprise leaders grappling with the steep costs of deploying AI models could find a reprieve thanks to a new architecture design.
While the capabilities of generative AI are attractive, their immense computational demands for both training and inference result in prohibitive expenses and mounting environmental concerns. At the centre of this inefficiency is the models’ “fundamental bottleneck” of an autoregressive process that generates text sequentially, token-by-token.
For enterprises processing vast data streams, from IoT networks to financial markets, this limitation makes generating long-form analysis both slow and economically challenging. However, a new research paper from Tencent AI and Tsinghua University proposes an alternative.
A new approach to AI efficiency
The research introduces Continuous Autoregressive Language Models (CALM). This method re-engineers the generation process to predict a continuous vector rather than a discrete token.
A high-fidelity autoencoder “compress[es] a chunk of K tokens into a single continuous vector,” which holds a much higher semantic bandwidth.
Instead of processing something like “the”, “cat”, “sat” in three steps, the model compresses them into one. This design directly “reduces the number of generative steps,” attacking the computational load.
The experimental results demonstrate a better performance-compute trade-off. A CALM AI model grouping four tokens delivered performance “comparable to strong discrete baselines, but at a significantly lower computational cost” for an enterprise.
One CALM model, for instance, required 44 percent fewer training FLOPs and 34 percent fewer inference FLOPs than a baseline Transformer of similar capability. This points to a saving on both the initial capital expense of training and the recurring operational expense of inference.
Rebuilding the toolkit for the continuous domain
Moving from a finite, discrete vocabulary to an infinite, continuous vector space breaks the standard LLM toolkit. The researchers had to develop a “comprehensive likelihood-free framework” to make the new model viable.
For training, the model cannot use a standard softmax layer or maximum likelihood estimation. To solve this, the team used a “likelihood-free” objective with an Energy Transformer, which rewards the model for accurate predictions without computing explicit probabilities.
This new training method also required a new evaluation metric. Standard benchmarks like Perplexity are inapplicable as they rely on the same likelihoods the model no longer computes.
The team proposed BrierLM, a novel metric based on the Brier score that can be estimated purely from model samples. Validation confirmed BrierLM as a reliable alternative, showing a “Spearman’s rank correlation of -0.991” with traditional loss metrics.
Finally, the framework restores controlled generation, a key feature for enterprise use. Standard temperature sampling is impossible without a probability distribution. The paper introduces a new “likelihood-free sampling algorithm,” including a practical batch approximation method, to manage the trade-off between output accuracy and diversity.
Reducing enterprise AI costs
This research offers a glimpse into a future where generative AI is not defined purely by ever-larger parameter counts, but by architectural efficiency.
The current path of scaling models is hitting a wall of diminishing returns and escalating costs. The CALM framework establishes a “new design axis for LLM scaling: increasing the semantic bandwidth of each generative step”.
While this is a research framework and not an off-the-shelf product, it points to a powerful and scalable pathway towards ultra-efficient language models. When evaluating vendor roadmaps, tech leaders should look beyond model size and begin asking about architectural efficiency.
The ability to reduce FLOPs per generated token will become a defining competitive advantage, enabling AI to be deployed more economically and sustainably across the enterprise to reduce costs—from the data centre to data-heavy edge applications.
See also: Flawed AI benchmarks put enterprise budgets at risk
Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is part of TechEx and is co-located with other leading technology events including the Cyber Security Expo, click here for more information.
AI News is powered by TechForge Media. Explore other upcoming enterprise technology events and webinars here.
Navigating the Next Frontier: Open-Weight Models, Benchmark Realities, and the Future of Trustworthy AI
By Dr. Saurabh Katiyar, Ph.D. Technology & Platform Leader | AI-First Ecosystems & Digital Transformation Published on SaurabhNotes.com Introduction Recent headlines have claimed that Z.ai (GLM-4.x) has become the world’s No. 1 open-weight AI model, based on LM Arena rankings. Such claims make for compelling news — but as scientists and technologists, we must separate signal from noise. In this…
The Sequence Knowledge # 555: Not All Benchmark are that Simple: An Intro to Multiturn Benchmarks
New Post has been published on https://thedigitalinsider.com/the-sequence-knowledge-555-not-all-benchmark-are-that-simple-an-intro-to-multiturn-benchmarks/
The Sequence Knowledge # 555: Not All Benchmark are that Simple: An Intro to Multiturn Benchmarks
A review of one of the most promising areas of AI evaluations.
Created Using GPT-4o
Today we will Discuss:
An intro to multiturn benchmarks.
A review of MT-Bench, a benchmark for open ended conversations.
Join Me for a Chat About AI Evals and Benchmarks:
💡 AI Concept of the Day: Learning About Multiturn AI Benchmarks
Multi-turn benchmarks represent a critical evolution in the evaluation of language models, particularly as LLMs transition from static prompt completion engines to interactive agents capable of sustained dialogue and reasoning. Unlike single-turn tasks, which assess performance in isolation, multi-turn benchmarks simulate dynamic, evolving contexts that require models to maintain coherence, track goals, and adapt their responses over extended interactions. This shift aligns more closely with real-world deployment scenarios, where users expect LLMs to function not just as oracles but as collaborators.
At the heart of multi-turn evaluation lies the challenge of contextual consistency. Models must not only remember prior turns but also reconcile conflicting information, resolve ambiguities, and revise earlier statements when presented with new evidence. This is non-trivial. Standard instruction tuning and next-token prediction objectives often fall short in encouraging persistent internal state representations or memory management strategies, both of which are essential for effective multi-turn performance.
Claude 4 AI Models Launched With Best-in-Class Coding Power
Introduction Anthropic has raised the bar in the generative AI landscape with the release of its Claude 4 series, including Claude Opus 4 and Claude Sonnet 4. Designed with a sharp focus on coding performance, tool integration, and reasoning depth, these large language models (LLMs) are setting new standards. The release was announced at Anthropic’s inaugural developer conference, and it’s…
Beyond Benchmarks: Why AI Evaluation Needs a Reality Check
New Post has been published on https://thedigitalinsider.com/beyond-benchmarks-why-ai-evaluation-needs-a-reality-check/
Beyond Benchmarks: Why AI Evaluation Needs a Reality Check
If you have been following AI these days, you have likely seen headlines reporting the breakthrough achievements of AI models achieving benchmark records. From ImageNet image recognition tasks to achieving superhuman scores in translation and medical image diagnostics, benchmarks have long been the gold standard for measuring AI performance. However, as impressive as these numbers may be, they don’t always capture the complexity of real-world applications. A model that performs flawlessly on a benchmark can still fall short when put to the test in real-world environments. In this article, we will delve into why traditional benchmarks fall short of capturing the true value of AI, and explore alternative evaluation methods that better reflect the dynamic, ethical, and practical challenges of deploying AI in the real world.
The Appeal of Benchmarks
For years, benchmarks have been the foundation of AI evaluation. They offer static datasets designed to measure specific tasks like object recognition or machine translation. ImageNet, for instance, is a widely used benchmark for testing object classification, while BLEU and ROUGE score the quality of machine-generated text by comparing it to human-written reference texts. These standardized tests allow researchers to compare progress and create healthy competition in the field. Benchmarks have played a key role in driving major advancements in the field. The ImageNet competition, for example, played a crucial role in the deep learning revolution by showing significant accuracy improvements.
However, benchmarks often simplify reality. As AI models are typically trained to improve on a single well-defined task under fixed conditions, this can lead to over-optimization. To achieve high scores, models may rely on dataset patterns that don’t hold beyond the benchmark. A famous example is a vision model trained to distinguish wolves from huskies. Instead of learning distinguishing animal features, the model relied on the presence of snowy backgrounds commonly associated with wolves in the training data. As a result, when the model was presented with a husky in the snow, it confidently mislabeled it as a wolf. This showcases how overfitting to a benchmark can lead to faulty models. As Goodhart’s Law states, “When a measure becomes a target, it ceases to be a good measure.” Thus, when benchmark scores become the target, AI models illustrate Goodhart’s Law: they produce impressive scores on leader boards but struggle in dealing with real-world challenges.
Human Expectations vs. Metric Scores
One of the biggest limitations of benchmarks is that they often fail to capture what truly matters to humans. Consider machine translation. A model may score well on the BLEU metric, which measures the overlap between machine-generated translations and reference translations. While the metric can gauge how plausible a translation is in terms of word-level overlap, it doesn’t account for fluency or meaning. A translation could score poorly despite being more natural or even more accurate, simply because it used different wording from the reference. Human users, however, care about the meaning and fluency of translations, not just the exact match with a reference. The same issue applies to text summarization: a high ROUGE score doesn’t guarantee that a summary is coherent or captures the key points that a human reader would expect.
For generative AI models, the issue becomes even more challenging. For instance, large language models (LLMs) are typically evaluated on a benchmark MMLU to test their ability to answer questions across multiple domains. While the benchmark may help to test the performance of LLMs for answering questions, it does not guarantee reliability. These models can still “hallucinate,” presenting false yet plausible-sounding facts. This gap is not easily detected by benchmarks that focus on correct answers without assessing truthfulness, context, or coherence. In one well-publicized case, an AI assistant used to draft a legal brief cited entirely bogus court cases. The AI can look convincing on paper but failed basic human expectations for truthfulness.
Challenges of Static Benchmarks in Dynamic Contexts
Adapting to Changing Environments
Static benchmarks evaluate AI performance under controlled conditions, but real-world scenarios are unpredictable. For instance, a conversational AI might excel on scripted, single-turn questions in a benchmark, but struggle in a multi-step dialogue that includes follow-ups, slang, or typos. Similarly, self-driving cars often perform well in object detection tests under ideal conditions but fail in unusual circumstances, such as poor lighting, adverse weather, or unexpected obstacles. For example, a stop sign altered with stickers can confuse a car’s vision system, leading to misinterpretation. These examples highlight that static benchmarks do not reliably measure real-world complexities.
Ethical and Social Considerations
Traditional benchmarks often fail to assess AI’s ethical performance. An image recognition model might achieve high accuracy but misidentify individuals from certain ethnic groups due to biased training data. Likewise, language models can score well on grammar and fluency while producing biased or harmful content. These issues, which are not reflected in benchmark metrics, have significant consequences in real-world applications.
Inability to Capture Nuanced Aspects
Benchmarks are great at checking surface-level skills, like whether a model can generate grammatically correct text or a realistic image. But they often struggle with deeper qualities, like common sense reasoning or contextual appropriateness. For example, a model might excel at a benchmark by producing a perfect sentence, but if that sentence is factually incorrect, it’s useless. AI needs to understand when and how to say something, not just what to say. Benchmarks rarely test this level of intelligence, which is critical for applications like chatbots or content creation.
Contextual Adaptation
AI models often struggle to adapt to new contexts, especially when faced with data outside their training set. Benchmarks are usually designed with data similar to what the model was trained on. This means they don’t fully test how well a model can handle novel or unexpected input —a critical requirement in real-world applications. For example, a chatbot might outperform on benchmarked questions but struggle when users ask irrelevant things, like slang or niche topics.
Reasoning and Inference
While benchmarks can measure pattern recognition or content generation, they often fall short on higher-level reasoning and inference. AI needs to do more than mimic patterns. It should understand implications, make logical connections, and infer new information. For instance, a model might generate a factually correct response but fail to connect it logically to a broader conversation. Current benchmarks may not fully capture these advanced cognitive skills, leaving us with an incomplete view of AI capabilities.
Beyond Benchmarks: A New Approach to AI Evaluation
To bridge the gap between benchmark performance and real-world success, a new approach to AI evaluation is emerging. Here are some strategies gaining traction:
Human-in-the-Loop Feedback: Instead of relying solely on automated metrics, involve human evaluators in the process. This could mean having experts or end-users assess the AI’s outputs for quality, usefulness, and appropriateness. Humans can better assess aspects like tone, relevance, and ethical consideration in comparison to benchmarks.
Real-World Deployment Testing: AI systems should be tested in environments as close to real-world conditions as possible. For instance, self-driving cars could undergo trials on simulated roads with unpredictable traffic scenarios, while chatbots could be deployed in live environments to handle diverse conversations. This ensures that models are evaluated in the conditions they will actually face.
Robustness and Stress Testing: It’s crucial to test AI systems under unusual or adversarial conditions. This could involve testing an image recognition model with distorted or noisy images or evaluating a language model with long, complicated dialogues. By understanding how AI behaves under stress, we can better prepare it for real-world challenges.
Multidimensional Evaluation Metrics: Instead of relying on a single benchmark score, evaluate AI across a range of metrics, including accuracy, fairness, robustness, and ethical considerations. This holistic approach provides a more comprehensive understanding of an AI model’s strengths and weaknesses.
Domain-Specific Tests: Evaluation should be customized to the specific domain in which the AI will be deployed. Medical AI, for instance, should be tested on case studies designed by medical professionals, while an AI for financial markets should be evaluated for its stability during economic fluctuations.
The Bottom Line
While benchmarks have advanced AI research, they fall short in capturing real-world performance. As AI moves from labs to practical applications, AI evaluation should be human-centered and holistic. Testing in real-world conditions, incorporating human feedback, and prioritizing fairness and robustness are critical. The goal is not to top leaderboards but to develop AI that is reliable, adaptable, and valuable in the dynamic, complex world.
The Sequence Knowledge #527: Let's Learn About Math Benchmarks
New Post has been published on https://thedigitalinsider.com/the-sequence-knowledge-527-lets-learn-about-math-benchmarks/
The Sequence Knowledge #527: Let's Learn About Math Benchmarks
What are the benchmarks that push the boundaries of foundation models in mathematical reasoning.
Created Using GPT-4o
Today we will Discuss:
An introduction to math benchmarks.
A review of Frontier Math, one of the most challenging math benchmarks ever built.
💡 AI Concept of the Day: An Intro to Math Benchmarks
In today’s series about AI benchmarks we are going to discuss one of the most fascinating areas of evaluation. Mathematical reasoning has rapidly emerged as one of the key vectors for evaluating foundation models models, prompting the development of sophisticated benchmarks to evaluate AI systems’ capabilities. These benchmarks serve as crucial tools for measuring progress and identifying areas for improvement in AI’s mathematical prowess, pushing the boundaries of what machines can achieve in complex problem-solving scenarios.
One of the most notable benchmarks is the MATH (Mathematics Assessment of Textual Heuristics) dataset, which presents a diverse array of complex mathematical problems ranging from basic arithmetic to advanced calculus and algebra. This benchmark is designed to assess AI models in zero-shot and few-shot settings, providing a comprehensive evaluation of their mathematical understanding and problem-solving abilities. The MATH benchmark has become increasingly saturated for state-of-the-art models, with leading systems achieving impressive accuracy rates.