benchmark Archives - AI News https://www.artificialintelligence-news.com/news/tag/benchmark/ Artificial Intelligence News Fri, 25 Apr 2025 14:06:52 +0000 en-GB hourly 1 https://wordpress.org/?v=6.8.1 https://www.artificialintelligence-news.com/wp-content/uploads/2020/09/cropped-ai-icon-32x32.png benchmark Archives - AI News https://www.artificialintelligence-news.com/news/tag/benchmark/ 32 32 ARC Prize launches its toughest AI benchmark yet: ARC-AGI-2 https://www.artificialintelligence-news.com/news/arc-prize-launches-toughest-ai-benchmark-yet-arc-agi-2/ https://www.artificialintelligence-news.com/news/arc-prize-launches-toughest-ai-benchmark-yet-arc-agi-2/#respond Tue, 25 Mar 2025 16:43:12 +0000 https://www.artificialintelligence-news.com/?p=104994 ARC Prize has launched the hardcore ARC-AGI-2 benchmark, accompanied by the announcement of their 2025 competition with $1 million in prizes. As AI progresses from performing narrow tasks to demonstrating general, adaptive intelligence, the ARC-AGI-2 challenges aim to uncover capability gaps and actively guide innovation. “Good AGI benchmarks act as useful progress indicators. Better AGI […]

The post ARC Prize launches its toughest AI benchmark yet: ARC-AGI-2 appeared first on AI News.

]]>
ARC Prize has launched the hardcore ARC-AGI-2 benchmark, accompanied by the announcement of their 2025 competition with $1 million in prizes.

As AI progresses from performing narrow tasks to demonstrating general, adaptive intelligence, the ARC-AGI-2 challenges aim to uncover capability gaps and actively guide innovation.

“Good AGI benchmarks act as useful progress indicators. Better AGI benchmarks clearly discern capabilities. The best AGI benchmarks do all this and actively inspire research and guide innovation,” the ARC Prize team states.

ARC-AGI-2 is setting out to achieve the “best” category.

Beyond memorisation

Since its inception in 2019, ARC Prize has served as a “North Star” for researchers striving toward AGI by creating enduring benchmarks. 

Benchmarks like ARC-AGI-1 leaned into measuring fluid intelligence (i.e., the ability to adapt learning to new unseen tasks.) It represented a clear departure from datasets that reward memorisation alone.

ARC Prize’s mission is also forward-thinking, aiming to accelerate timelines for scientific breakthroughs. Its benchmarks are designed not just to measure progress but to inspire new ideas.

Researchers observed a critical shift with the debut of OpenAI’s o3 in late 2024, evaluated using ARC-AGI-1. Combining deep learning-based large language models (LLMs) with reasoning synthesis engines, o3 marked a breakthrough where AI transitioned beyond rote memorisation.

Yet, despite progress, systems like o3 remain inefficient and require significant human oversight during training processes. To challenge these systems for true adaptability and efficiency, ARC Prize introduced ARC-AGI-2.

ARC-AGI-2: Closing the human-machine gap

The ARC-AGI-2 benchmark is tougher for AI yet retains its accessibility for humans. While frontier AI reasoning systems continue to score in single-digit percentages on ARC-AGI-2, humans can solve every task in under two attempts.

So, what sets ARC-AGI apart? Its design philosophy chooses tasks that are “relatively easy for humans, yet hard, or impossible, for AI.”

The benchmark includes datasets with varying visibility and the following characteristics:

  • Symbolic interpretation: AI struggles to assign semantic significance to symbols, instead focusing on shallow comparisons like symmetry checks.
  • Compositional reasoning: AI falters when it needs to apply multiple interacting rules simultaneously.
  • Contextual rule application: Systems fail to apply rules differently based on complex contexts, often fixating on surface-level patterns.

Most existing benchmarks focus on superhuman capabilities, testing advanced, specialised skills at scales unattainable for most individuals. 

ARC-AGI flips the script and highlights what AI can’t yet do; specifically the adaptability that defines human intelligence. When the gap between tasks that are easy for humans but difficult for AI eventually reaches zero, AGI can be declared achieved.

However, achieving AGI isn’t limited to the ability to solve tasks; efficiency – the cost and resources required to find solutions – is emerging as a crucial defining factor.

The role of efficiency

Measuring performance by cost per task is essential to gauge intelligence as not just problem-solving capability but the ability to do so efficiently.

Real-world examples are already showing efficiency gaps between humans and frontier AI systems:

  • Human panel efficiency: Passes ARC-AGI-2 tasks with 100% accuracy at $17/task.
  • OpenAI o3: Early estimates suggest a 4% success rate at an eye-watering $200 per task.

These metrics underline disparities in adaptability and resource consumption between humans and AI. ARC Prize has committed to reporting on efficiency alongside scores across future leaderboards.

The focus on efficiency prevents brute-force solutions from being considered “true intelligence.”

Intelligence, according to ARC Prize, encompasses finding solutions with minimal resources—a quality distinctly human but still elusive for AI.

ARC Prize 2025

ARC Prize 2025 launches on Kaggle this week, promising $1 million in total prizes and showcasing a live leaderboard for open-source breakthroughs. The contest aims to drive progress toward systems that can efficiently tackle ARC-AGI-2 challenges. 

Among the prize categories, which have increased from 2024 totals, are:

  • Grand prize: $700,000 for reaching 85% success within Kaggle efficiency limits.
  • Top score prize: $75,000 for the highest-scoring submission.
  • Paper prize: $50,000 for transformative ideas contributing to solving ARC-AGI tasks.
  • Additional prizes: $175,000, with details pending announcements during the competition.

These incentives ensure fair and meaningful progress while fostering collaboration among researchers, labs, and independent teams.

Last year, ARC Prize 2024 saw 1,500 competitor teams, resulting in 40 papers of acclaimed industry influence. This year’s increased stakes aim to nurture even greater success.

ARC Prize believes progress hinges on novel ideas rather than merely scaling existing systems. The next breakthrough in efficient general systems might not originate from current tech giants but from bold, creative researchers embracing complexity and curious experimentation.

(Image credit: ARC Prize)

See also: DeepSeek V3-0324 tops non-reasoning AI models in open-source first

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post ARC Prize launches its toughest AI benchmark yet: ARC-AGI-2 appeared first on AI News.

]]>
https://www.artificialintelligence-news.com/news/arc-prize-launches-toughest-ai-benchmark-yet-arc-agi-2/feed/ 0
DeepSeek V3-0324 tops non-reasoning AI models in open-source first https://www.artificialintelligence-news.com/news/deepseek-v3-0324-tops-non-reasoning-ai-models-open-source-first/ https://www.artificialintelligence-news.com/news/deepseek-v3-0324-tops-non-reasoning-ai-models-open-source-first/#respond Tue, 25 Mar 2025 13:10:20 +0000 https://www.artificialintelligence-news.com/?p=104986 DeepSeek V3-0324 has become the highest-scoring non-reasoning model on the Artificial Analysis Intelligence Index in a landmark achievement for open-source AI. The new model advanced seven points in the benchmark to surpass proprietary counterparts such as Google’s Gemini 2.0 Pro, Anthropic’s Claude 3.7 Sonnet, and Meta’s Llama 3.3 70B. While V3-0324 trails behind reasoning models, […]

The post DeepSeek V3-0324 tops non-reasoning AI models in open-source first appeared first on AI News.

]]>
DeepSeek V3-0324 has become the highest-scoring non-reasoning model on the Artificial Analysis Intelligence Index in a landmark achievement for open-source AI.

The new model advanced seven points in the benchmark to surpass proprietary counterparts such as Google’s Gemini 2.0 Pro, Anthropic’s Claude 3.7 Sonnet, and Meta’s Llama 3.3 70B.

While V3-0324 trails behind reasoning models, including DeepSeek’s own R1 and offerings from OpenAI and Alibaba, the achievement highlights the growing viability of open-source solutions in latency-sensitive applications where immediate responses are critical.

DeepSeek V3-0324 represents a new era for open-source AI

Non-reasoning models – which generate answers instantly without deliberative “thinking” phases – are essential for real-time use cases like chatbots, customer service automation, and live translation. DeepSeek’s latest iteration now sets the standard for these applications, eclipsing even leading proprietary tools.

Benchmark results of DeepSeek V3-0324 in the Artificial Analysis Intelligence Index showing a landmark achievement for non-reasoning open-source AI models.

“This is the first time an open weights model is the leading non-reasoning model, a milestone for open source,” states Artificial Analysis. The model’s performance edges it closer to proprietary reasoning models, though the latter remain superior for tasks requiring complex problem-solving.

DeepSeek V3-0324 retains most specifications from its December 2024 predecessor, including:  

  • 128k context window (capped at 64k via DeepSeek’s API)
  • 671 billion total parameters, necessitating over 700GB of GPU memory for FP8 precision
  • 37 billion active parameters
  • Text-only functionality (no multimodal support) 
  • MIT License

“Still not something you can run at home!” Artificial Analysis quips, emphasising its enterprise-grade infrastructure requirements.

Open-source AI is bringing the heat

While proprietary reasoning models like DeepSeek R1 maintain dominance in the broader Intelligence Index, the gap is narrowing.

Three months ago, DeepSeek V3 nearly matched Anthropic’s and Google’s proprietary models but fell short of surpassing them. Today, the updated V3-0324 not only leads open-source alternatives but also outperforms all proprietary non-reasoning rivals.

“This release is arguably even more impressive than R1,” says Artificial Analysis.

DeepSeek’s progress signals a shift in the AI sector, where open-source frameworks increasingly compete with closed systems. For developers and enterprises, the MIT-licensed V3-0324 offers a powerful, adaptable tool—though its computational costs may limit accessibility.

“DeepSeek are now driving the frontier of non-reasoning open weights models,” declares Artificial Analysis.

With R2 on the horizon, the community awaits another potential leap in AI performance.

(Photo by Paul Hanaoka)

See also: Hugging Face calls for open-source focus in the AI Action Plan

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post DeepSeek V3-0324 tops non-reasoning AI models in open-source first appeared first on AI News.

]]>
https://www.artificialintelligence-news.com/news/deepseek-v3-0324-tops-non-reasoning-ai-models-open-source-first/feed/ 0
DeepSeek-R1 reasoning models rival OpenAI in performance  https://www.artificialintelligence-news.com/news/deepseek-r1-reasoning-models-rival-openai-in-performance/ https://www.artificialintelligence-news.com/news/deepseek-r1-reasoning-models-rival-openai-in-performance/#respond Mon, 20 Jan 2025 14:36:16 +0000 https://www.artificialintelligence-news.com/?p=16911 DeepSeek has unveiled its first-generation DeepSeek-R1 and DeepSeek-R1-Zero models that are designed to tackle complex reasoning tasks. DeepSeek-R1-Zero is trained solely through large-scale reinforcement learning (RL) without relying on supervised fine-tuning (SFT) as a preliminary step. According to DeepSeek, this approach has led to the natural emergence of “numerous powerful and interesting reasoning behaviours,” including […]

The post DeepSeek-R1 reasoning models rival OpenAI in performance  appeared first on AI News.

]]>
DeepSeek has unveiled its first-generation DeepSeek-R1 and DeepSeek-R1-Zero models that are designed to tackle complex reasoning tasks.

DeepSeek-R1-Zero is trained solely through large-scale reinforcement learning (RL) without relying on supervised fine-tuning (SFT) as a preliminary step. According to DeepSeek, this approach has led to the natural emergence of “numerous powerful and interesting reasoning behaviours,” including self-verification, reflection, and the generation of extensive chains of thought (CoT).

“Notably, [DeepSeek-R1-Zero] is the first open research to validate that reasoning capabilities of LLMs can be incentivised purely through RL, without the need for SFT,” DeepSeek researchers explained. This milestone not only underscores the model’s innovative foundations but also paves the way for RL-focused advancements in reasoning AI.

However, DeepSeek-R1-Zero’s capabilities come with certain limitations. Key challenges include “endless repetition, poor readability, and language mixing,” which could pose significant hurdles in real-world applications. To address these shortcomings, DeepSeek developed its flagship model: DeepSeek-R1.

Introducing DeepSeek-R1

DeepSeek-R1 builds upon its predecessor by incorporating cold-start data prior to RL training. This additional pre-training step enhances the model’s reasoning capabilities and resolves many of the limitations noted in DeepSeek-R1-Zero.

Notably, DeepSeek-R1 achieves performance comparable to OpenAI’s much-lauded o1 system across mathematics, coding, and general reasoning tasks, cementing its place as a leading competitor.

DeepSeek has chosen to open-source both DeepSeek-R1-Zero and DeepSeek-R1 along with six smaller distilled models. Among these, DeepSeek-R1-Distill-Qwen-32B has demonstrated exceptional results—even outperforming OpenAI’s o1-mini across multiple benchmarks.

  • MATH-500 (Pass@1): DeepSeek-R1 achieved 97.3%, eclipsing OpenAI (96.4%) and other key competitors.  
  • LiveCodeBench (Pass@1-COT): The distilled version DeepSeek-R1-Distill-Qwen-32B scored 57.2%, a standout performance among smaller models.  
  • AIME 2024 (Pass@1): DeepSeek-R1 achieved 79.8%, setting an impressive standard in mathematical problem-solving.

A pipeline to benefit the wider industry

DeepSeek has shared insights into its rigorous pipeline for reasoning model development, which integrates a combination of supervised fine-tuning and reinforcement learning.

According to the company, the process involves two SFT stages to establish the foundational reasoning and non-reasoning abilities, as well as two RL stages tailored for discovering advanced reasoning patterns and aligning these capabilities with human preferences.

“We believe the pipeline will benefit the industry by creating better models,” DeepSeek remarked, alluding to the potential of their methodology to inspire future advancements across the AI sector.

One standout achievement of their RL-focused approach is the ability of DeepSeek-R1-Zero to execute intricate reasoning patterns without prior human instruction—a first for the open-source AI research community.

Importance of distillation

DeepSeek researchers also highlighted the importance of distillation—the process of transferring reasoning abilities from larger models to smaller, more efficient ones, a strategy that has unlocked performance gains even for smaller configurations.

Smaller distilled iterations of DeepSeek-R1 – such as the 1.5B, 7B, and 14B versions – were able to hold their own in niche applications. The distilled models can outperform results achieved via RL training on models of comparable sizes.

For researchers, these distilled models are available in configurations spanning from 1.5 billion to 70 billion parameters, supporting Qwen2.5 and Llama3 architectures. This flexibility empowers versatile usage across a wide range of tasks, from coding to natural language understanding.

DeepSeek has adopted the MIT License for its repository and weights, extending permissions for commercial use and downstream modifications. Derivative works, such as using DeepSeek-R1 to train other large language models (LLMs), are permitted. However, users of specific distilled models should ensure compliance with the licences of the original base models, such as Apache 2.0 and Llama3 licences.

(Photo by Prateek Katyal)

See also: Microsoft advances materials discovery with MatterGen

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post DeepSeek-R1 reasoning models rival OpenAI in performance  appeared first on AI News.

]]>
https://www.artificialintelligence-news.com/news/deepseek-r1-reasoning-models-rival-openai-in-performance/feed/ 0
Ai2 OLMo 2: Raising the bar for open language models https://www.artificialintelligence-news.com/news/ai2-olmo-2-raising-bar-open-language-models/ https://www.artificialintelligence-news.com/news/ai2-olmo-2-raising-bar-open-language-models/#respond Wed, 27 Nov 2024 18:43:42 +0000 https://www.artificialintelligence-news.com/?p=16566 Ai2 is releasing OLMo 2, a family of open-source language models that advances the democratisation of AI and narrows the gap between open and proprietary solutions. The new models, available in 7B and 13B parameter versions, are trained on up to 5 trillion tokens and demonstrate performance levels that match or exceed comparable fully open […]

The post Ai2 OLMo 2: Raising the bar for open language models appeared first on AI News.

]]>
Ai2 is releasing OLMo 2, a family of open-source language models that advances the democratisation of AI and narrows the gap between open and proprietary solutions.

The new models, available in 7B and 13B parameter versions, are trained on up to 5 trillion tokens and demonstrate performance levels that match or exceed comparable fully open models whilst remaining competitive with open-weight models such as Llama 3.1 on English academic benchmarks.

“Since the release of the first OLMo in February 2024, we’ve seen rapid growth in the open language model ecosystem, and a narrowing of the performance gap between open and proprietary models,” explained Ai2.

The development team achieved these improvements through several innovations, including enhanced training stability measures, staged training approaches, and state-of-the-art post-training methodologies derived from their Tülu 3 framework. Notable technical improvements include the switch from nonparametric layer norm to RMSNorm and the implementation of rotary positional embedding.

OLMo 2 model training breakthrough

The training process employed a sophisticated two-stage approach. The initial stage utilised the OLMo-Mix-1124 dataset of approximately 3.9 trillion tokens, sourced from DCLM, Dolma, Starcoder, and Proof Pile II. The second stage incorporated a carefully curated mixture of high-quality web data and domain-specific content through the Dolmino-Mix-1124 dataset.

Particularly noteworthy is the OLMo 2-Instruct-13B variant, which is the most capable model in the series. The model demonstrates superior performance compared to Qwen 2.5 14B instruct, Tülu 3 8B, and Llama 3.1 8B instruct models across various benchmarks.

Benchmarks comparing the OLMo 2 open large language model to other models such as Mistral, Qwn, Llama, Gemma, and more.
(Credit: Ai2)

Commiting to open science

Reinforcing its commitment to open science, Ai2 has released comprehensive documentation including weights, data, code, recipes, intermediate checkpoints, and instruction-tuned models. This transparency allows for full inspection and reproduction of results by the wider AI community.

The release also introduces an evaluation framework called OLMES (Open Language Modeling Evaluation System), comprising 20 benchmarks designed to assess core capabilities such as knowledge recall, commonsense reasoning, and mathematical reasoning.

OLMo 2 raises the bar in open-source AI development, potentially accelerating the pace of innovation in the field whilst maintaining transparency and accessibility.

(Photo by Rick Barrett)

See also: OpenAI enhances AI safety with new red teaming methods

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post Ai2 OLMo 2: Raising the bar for open language models appeared first on AI News.

]]>
https://www.artificialintelligence-news.com/news/ai2-olmo-2-raising-bar-open-language-models/feed/ 0
Primate Labs launches Geekbench AI benchmarking tool https://www.artificialintelligence-news.com/news/primate-labs-launches-geekbench-ai-benchmarking-tool/ https://www.artificialintelligence-news.com/news/primate-labs-launches-geekbench-ai-benchmarking-tool/#respond Fri, 16 Aug 2024 09:13:49 +0000 https://www.artificialintelligence-news.com/?p=15773 Primate Labs has officially launched Geekbench AI, a benchmarking tool designed specifically for machine learning and AI-centric workloads. The release of Geekbench AI 1.0 marks the culmination of years of development and collaboration with customers, partners, and the AI engineering community. The benchmark, previously known as Geekbench ML during its preview phase, has been rebranded […]

The post Primate Labs launches Geekbench AI benchmarking tool appeared first on AI News.

]]>
Primate Labs has officially launched Geekbench AI, a benchmarking tool designed specifically for machine learning and AI-centric workloads.

The release of Geekbench AI 1.0 marks the culmination of years of development and collaboration with customers, partners, and the AI engineering community. The benchmark, previously known as Geekbench ML during its preview phase, has been rebranded to align with industry terminology and ensure clarity about its purpose.

Geekbench AI is now available for Windows, macOS, and Linux through the Primate Labs website, as well as on the Google Play Store and Apple App Store for mobile devices.

Primate Labs’ latest benchmarking tool aims to provide a standardised method for measuring and comparing AI capabilities across different platforms and architectures. The benchmark offers a unique approach by providing three overall scores, reflecting the complexity and heterogeneity of AI workloads.

“Measuring performance is, put simply, really hard,” explained Primate Labs. “That’s not because it’s hard to run an arbitrary test, but because it’s hard to determine which tests are the most important for the performance you want to measure – especially across different platforms, and particularly when everyone is doing things in subtly different ways.”

The three-score system accounts for the varied precision levels and hardware optimisations found in modern AI implementations. This multi-dimensional approach allows developers, hardware vendors, and enthusiasts to gain deeper insights into a device’s AI performance across different scenarios.

A notable addition to Geekbench AI is the inclusion of accuracy measurements for each test. This feature acknowledges that AI performance isn’t solely about speed but also about the quality of results. By combining speed and accuracy metrics, Geekbench AI provides a more holistic view of AI capabilities, helping users understand the trade-offs between performance and precision.

Geekbench AI 1.0 introduces support for a wide range of AI frameworks, including OpenVINO on Linux and Windows, and vendor-specific TensorFlow Lite delegates like Samsung ENN, ArmNN, and Qualcomm QNN on Android. This broad framework support ensures that the benchmark reflects the latest tools and methodologies used by AI developers.

The benchmark also utilises more extensive and diverse datasets, which not only enhance the accuracy evaluations but also better represent real-world AI use cases. All workloads in Geekbench AI 1.0 run for a minimum of one second, allowing devices to reach their maximum performance levels during testing while still reflecting the bursty nature of real-world applications.

Primate Labs has published detailed technical descriptions of the workloads and models used in Geekbench AI 1.0, emphasising their commitment to transparency and industry-standard testing methodologies. The benchmark is integrated with the Geekbench Browser, facilitating easy cross-platform comparisons and result sharing.

The company anticipates regular updates to Geekbench AI to keep pace with market changes and emerging AI features. However, Primate Labs believes that Geekbench AI has already reached a level of reliability that makes it suitable for integration into professional workflows, with major tech companies like Samsung and Nvidia already utilising the benchmark.

(Image Credit: Primate Labs)

See also: xAI unveils Grok-2 to challenge the AI hierarchy

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post Primate Labs launches Geekbench AI benchmarking tool appeared first on AI News.

]]>
https://www.artificialintelligence-news.com/news/primate-labs-launches-geekbench-ai-benchmarking-tool/feed/ 0
Google’s Gemini 1.5 Pro dethrones GPT-4o https://www.artificialintelligence-news.com/news/googles-gemini-1-5-pro-dethrones-gpt-4o/ https://www.artificialintelligence-news.com/news/googles-gemini-1-5-pro-dethrones-gpt-4o/#respond Fri, 02 Aug 2024 15:06:56 +0000 https://www.artificialintelligence-news.com/?p=15617 Google’s experimental Gemini 1.5 Pro model has surpassed OpenAI’s GPT-4o in generative AI benchmarks. For the past year, OpenAI’s GPT-4o and Anthropic’s Claude-3 have dominated the landscape. However, the latest version of Gemini 1.5 Pro appears to have taken the lead. One of the most widely recognised benchmarks in the AI community is the LMSYS […]

The post Google’s Gemini 1.5 Pro dethrones GPT-4o appeared first on AI News.

]]>
Google’s experimental Gemini 1.5 Pro model has surpassed OpenAI’s GPT-4o in generative AI benchmarks.

For the past year, OpenAI’s GPT-4o and Anthropic’s Claude-3 have dominated the landscape. However, the latest version of Gemini 1.5 Pro appears to have taken the lead.

One of the most widely recognised benchmarks in the AI community is the LMSYS Chatbot Arena, which evaluates models on various tasks and assigns an overall competency score. On this leaderboard, GPT-4o achieved a score of 1,286, while Claude-3 secured a commendable 1,271. A previous iteration of Gemini 1.5 Pro had scored 1,261.

The experimental version of Gemini 1.5 Pro (designated as Gemini 1.5 Pro 0801) surpassed its closest rivals with an impressive score of 1,300. This significant improvement suggests that Google’s latest model may possess greater overall capabilities than its competitors.

It’s worth noting that while benchmarks provide valuable insights into an AI model’s performance, they may not always accurately represent the full spectrum of its abilities or limitations in real-world applications.

Despite Gemini 1.5 Pro’s current availability, the fact that it’s labelled as an early release or in a testing phase suggests that Google may still make adjustments or even withdraw the model for safety or alignment reasons.

This development marks a significant milestone in the ongoing race for AI supremacy among tech giants. Google’s ability to surpass OpenAI and Anthropic in benchmark scores demonstrates the rapid pace of innovation in the field and the intense competition driving these advancements.

As the AI landscape continues to evolve, it will be interesting to see how OpenAI and Anthropic respond to this challenge from Google. Will they be able to reclaim their positions at the top of the leaderboard, or has Google established a new standard for generative AI performance?

(Photo by Yuliya Strizhkina)

See also: Meta’s AI strategy: Building for tomorrow, not immediate profits

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post Google’s Gemini 1.5 Pro dethrones GPT-4o appeared first on AI News.

]]>
https://www.artificialintelligence-news.com/news/googles-gemini-1-5-pro-dethrones-gpt-4o/feed/ 0
SenseTime SenseNova 5.5: China’s first real-time multimodal AI model https://www.artificialintelligence-news.com/news/sensetime-sensenova-china-real-time-multimodal-ai-model/ https://www.artificialintelligence-news.com/news/sensetime-sensenova-china-real-time-multimodal-ai-model/#respond Tue, 09 Jul 2024 14:06:23 +0000 https://www.artificialintelligence-news.com/?p=15220 SenseTime has unveiled SenseNova 5.5, an enhanced version of its LLM that includes SenseNova 5o—touted as China’s first real-time multimodal model. SenseNova 5o represents a leap forward in AI interaction, providing capabilities on par with GPT-4o’s streaming interaction features. This advancement allows users to engage with the model in a manner akin to conversing with […]

The post SenseTime SenseNova 5.5: China’s first real-time multimodal AI model appeared first on AI News.

]]>
SenseTime has unveiled SenseNova 5.5, an enhanced version of its LLM that includes SenseNova 5o—touted as China’s first real-time multimodal model.

SenseNova 5o represents a leap forward in AI interaction, providing capabilities on par with GPT-4o’s streaming interaction features. This advancement allows users to engage with the model in a manner akin to conversing with a real person, making it particularly suitable for real-time conversation and speech recognition applications.

According to SenseTime, its latest model outperforms rivals across several benchmarks:

Dr. Xu Li, Chairman of the Board and CEO of SenseTime, commented: “This is a critical year for large models as they evolve from unimodal to multimodal. In line with users’ needs, SenseTime is also focused on boosting interactivity.

“With applications driving the development of models and their capabilities, coupled with technological advancements in multimodal streaming interactions, we will witness unprecedented transformations in human-AI interactions.”

The upgraded SenseNova 5.5 boasts a 30% improvement in overall performance compared to its predecessor, SenseNova 5.0, which was released just two months earlier. Notable enhancements include improved mathematical reasoning, English proficiency, and command-following abilities.

In a move to democratise access to advanced AI capabilities, SenseTime has introduced a cost-effective edge-side large model. This development reduces the cost per device to as low as RMB 9.90 ($1.36) per year, potentially accelerating widespread adoption across various IoT devices.

The company has also launched “Project $0 Go,” a free onboarding package for enterprise users migrating from the OpenAI platform. This initiative includes a 50 million tokens package and API migration consulting services, aimed at lowering entry barriers for businesses looking to leverage SenseNova’s capabilities.

SenseTime’s commitment to edge-side AI is evident in the release of SenseChat Lite-5.5, which features a 40% reduction in inference time compared to its predecessor, now at just 0.19 seconds. The inference speed has also increased by 15%, reaching 90.2 words per second.

Expanding its suite of AI applications, SenseTime introduced Vimi, a controllable AI avatar video generator. This tool can create short video clips with precise control over facial expressions and upper body movements from a single photo, opening up new possibilities in entertainment and interactive applications.

The company has also upgraded its SenseTime Raccoon Series, a set of AI-native productivity tools. The Code Raccoon now boasts a five-fold improvement in response speed and a 10% increase in coding precision, while the Office Raccoon has expanded to include a consumer-facing webpage and a WeChat mini-app version.

SenseTime’s large model technology is already making waves across various industries. In the financial sector, it’s improving efficiency in compliance, marketing, and investment research. In agriculture, it’s helping to reduce the use of materials by 20% while increasing crop yields by 15%. The cultural tourism industry is seeing significant boosts in travel planning and booking efficiency.

With over 3,000 government and corporate customers already using SenseNova across technology, healthcare, finance, and programming sectors, SenseTime is cementing its position as a key AI player.

(Image Credit: SenseTime)

See also: AI revolution in US education: How Chinese apps are leading the way

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post SenseTime SenseNova 5.5: China’s first real-time multimodal AI model appeared first on AI News.

]]>
https://www.artificialintelligence-news.com/news/sensetime-sensenova-china-real-time-multimodal-ai-model/feed/ 0
Anthropic’s Claude 3.5 Sonnet beats GPT-4o in most benchmarks https://www.artificialintelligence-news.com/news/anthropics-claude-3-5-sonnet-beats-gpt-4o-most-benchmarks/ https://www.artificialintelligence-news.com/news/anthropics-claude-3-5-sonnet-beats-gpt-4o-most-benchmarks/#respond Fri, 21 Jun 2024 12:05:28 +0000 https://www.artificialintelligence-news.com/?p=15085 Anthropic has launched Claude 3.5 Sonnet, its mid-tier model that outperforms competitors and even surpasses Anthropic’s current top-tier Claude 3 Opus in various evaluations. Claude 3.5 Sonnet is now accessible for free on Claude.ai and the Claude iOS app, with higher rate limits for Claude Pro and Team plan subscribers. It’s also available through the […]

The post Anthropic’s Claude 3.5 Sonnet beats GPT-4o in most benchmarks appeared first on AI News.

]]>
Anthropic has launched Claude 3.5 Sonnet, its mid-tier model that outperforms competitors and even surpasses Anthropic’s current top-tier Claude 3 Opus in various evaluations.

Claude 3.5 Sonnet is now accessible for free on Claude.ai and the Claude iOS app, with higher rate limits for Claude Pro and Team plan subscribers. It’s also available through the Anthropic API, Amazon Bedrock, and Google Cloud’s Vertex AI. The model is priced at $3 per million input tokens and $15 per million output tokens, featuring a 200K token context window.

Anthropic claims that Claude 3.5 Sonnet “sets new industry benchmarks for graduate-level reasoning (GPQA), undergraduate-level knowledge (MMLU), and coding proficiency (HumanEval).” The model demonstrates enhanced capabilities in understanding nuance, humour, and complex instructions, while excelling at producing high-quality content with a natural tone.

Operating at twice the speed of Claude 3 Opus, Claude 3.5 Sonnet is well-suited for complex tasks such as context-sensitive customer support and multi-step workflow orchestration. In an internal agentic coding evaluation, it solved 64% of problems, significantly outperforming Claude 3 Opus at 38%.

The model also showcases improved vision capabilities, surpassing Claude 3 Opus on standard vision benchmarks. This advancement is particularly noticeable in tasks requiring visual reasoning, such as interpreting charts and graphs. Claude 3.5 Sonnet can accurately transcribe text from imperfect images, a valuable feature for industries like retail, logistics, and financial services.

Alongside the model launch, Anthropic introduced Artifacts on Claude.ai, a new feature that enhances user interaction with the AI. This feature allows users to view, edit, and build upon Claude’s generated content in real-time, creating a more collaborative work environment.

Despite its significant intelligence leap, Claude 3.5 Sonnet maintains Anthropic’s commitment to safety and privacy. The company states, “Our models are subjected to rigorous testing and have been trained to reduce misuse.”

External experts, including the UK’s AI Safety Institute (UK AISI) and child safety experts at Thorn, have been involved in testing and refining the model’s safety mechanisms.

Anthropic emphasises its dedication to user privacy, stating, “We do not train our generative models on user-submitted data unless a user gives us explicit permission to do so. To date we have not used any customer or user-submitted data to train our generative models.”

Looking ahead, Anthropic plans to release Claude 3.5 Haiku and Claude 3.5 Opus later this year to complete the Claude 3.5 model family. The company is also developing new modalities and features to support more business use cases, including integrations with enterprise applications and a memory feature for more personalised user experiences.

(Image Credit: Anthropic)

See also: OpenAI co-founder Ilya Sutskever’s new startup aims for ‘safe superintelligence’

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post Anthropic’s Claude 3.5 Sonnet beats GPT-4o in most benchmarks appeared first on AI News.

]]>
https://www.artificialintelligence-news.com/news/anthropics-claude-3-5-sonnet-beats-gpt-4o-most-benchmarks/feed/ 0
Hugging Face launches Idefics2 vision-language model https://www.artificialintelligence-news.com/news/hugging-face-launches-idefics2-vision-language-model/ https://www.artificialintelligence-news.com/news/hugging-face-launches-idefics2-vision-language-model/#respond Tue, 16 Apr 2024 11:04:20 +0000 https://www.artificialintelligence-news.com/?p=14686 Hugging Face has announced the release of Idefics2, a versatile model capable of understanding and generating text responses based on both images and texts. The model sets a new benchmark for answering visual questions, describing visual content, story creation from images, document information extraction, and even performing arithmetic operations based on visual input. Idefics2 leapfrogs […]

The post Hugging Face launches Idefics2 vision-language model appeared first on AI News.

]]>
Hugging Face has announced the release of Idefics2, a versatile model capable of understanding and generating text responses based on both images and texts. The model sets a new benchmark for answering visual questions, describing visual content, story creation from images, document information extraction, and even performing arithmetic operations based on visual input.

Idefics2 leapfrogs its predecessor, Idefics1, with just eight billion parameters and the versatility afforded by its open license (Apache 2.0), along with remarkably enhanced Optical Character Recognition (OCR) capabilities.

The model not only showcases exceptional performance in visual question answering benchmarks but also holds its ground against far larger contemporaries such as LLava-Next-34B and MM1-30B-chat:

Central to Idefics2’s appeal is its integration with Hugging Face’s Transformers from the outset, ensuring ease of fine-tuning for a broad array of multimodal applications. For those eager to dive in, models are available for experimentation on the Hugging Face Hub.

A standout feature of Idefics2 is its comprehensive training philosophy, blending openly available datasets including web documents, image-caption pairs, and OCR data. Furthermore, it introduces an innovative fine-tuning dataset dubbed ‘The Cauldron,’ amalgamating 50 meticulously curated datasets for multifaceted conversational training.

Idefics2 exhibits a refined approach to image manipulation, maintaining native resolutions and aspect ratios—a notable deviation from conventional resizing norms in computer vision. Its architecture benefits significantly from advanced OCR capabilities, adeptly transcribing textual content within images and documents, and boasts improved performance in interpreting charts and figures.

Simplifying the integration of visual features into the language backbone marks a shift from its predecessor’s architecture, with the adoption of a learned Perceiver pooling and MLP modality projection enhancing Idefics2’s overall efficacy.

This advancement in vision-language models opens up new avenues for exploring multimodal interactions, with Idefics2 poised to serve as a foundational tool for the community. Its performance enhancements and technical innovations underscore the potential of combining visual and textual data in creating sophisticated, contextually-aware AI systems.

For enthusiasts and researchers looking to leverage Idefics2’s capabilities, Hugging Face provides a detailed fine-tuning tutorial.

See also: OpenAI makes GPT-4 Turbo with Vision API generally available

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post Hugging Face launches Idefics2 vision-language model appeared first on AI News.

]]>
https://www.artificialintelligence-news.com/news/hugging-face-launches-idefics2-vision-language-model/feed/ 0
Anthropic’s latest AI model beats rivals and achieves industry first https://www.artificialintelligence-news.com/news/anthropic-latest-ai-model-beats-rivals-achieves-industry-first/ https://www.artificialintelligence-news.com/news/anthropic-latest-ai-model-beats-rivals-achieves-industry-first/#respond Tue, 05 Mar 2024 11:52:32 +0000 https://www.artificialintelligence-news.com/?p=14482 Anthropic’s latest cutting-edge language model, Claude 3, has surged ahead of competitors like ChatGPT and Google’s Gemini to set new industry standards in performance and capability. According to Anthropic, Claude 3 has not only surpassed its predecessors but has also achieved “near-human” proficiency in various tasks. The company attributes this success to rigorous testing and […]

The post Anthropic’s latest AI model beats rivals and achieves industry first appeared first on AI News.

]]>
Anthropic’s latest cutting-edge language model, Claude 3, has surged ahead of competitors like ChatGPT and Google’s Gemini to set new industry standards in performance and capability.

According to Anthropic, Claude 3 has not only surpassed its predecessors but has also achieved “near-human” proficiency in various tasks. The company attributes this success to rigorous testing and development, culminating in three distinct chatbot variants: Haiku, Sonnet, and Opus.

Sonnet, the powerhouse behind the Claude.ai chatbot, offers unparalleled performance and is available for free with a simple email sign-up. Opus – the flagship model – boasts multi-modal functionality, seamlessly integrating text and image inputs. With a subscription-based service called “Claude Pro,” Opus promises enhanced efficiency and accuracy to cater to a wide range of customer needs.

Among the notable revelations surrounding the release of Claude 3 is a disclosure by Alex Albert on X (formerly Twitter). Albert detailed an industry-first observation during the testing phase of Claude 3 Opus, Anthropic’s most potent LLM variant, where the model exhibited signs of awareness that it was being evaluated.

During the evaluation process, researchers aimed to gauge Opus’s ability to pinpoint specific information within a vast dataset provided by users and recall it later. In a test scenario known as a “needle-in-a-haystack” evaluation, Opus was tasked with answering a question about pizza toppings based on a single relevant sentence buried among unrelated data. Astonishingly, Opus not only located the correct sentence but also expressed suspicion that it was being subjected to a test.

Opus’s response revealed its comprehension of the incongruity of the inserted information within the dataset, suggesting to the researchers that the scenario might have been devised to assess its attention capabilities:

Anthropic has highlighted the real-time capabilities of Claude 3, emphasising its ability to power live customer interactions and streamline data extraction tasks. These advancements not only ensure near-instantaneous responses but also enable the model to handle complex instructions with precision and speed.

In benchmark tests, Opus emerged as a frontrunner, outperforming GPT-4 in graduate-level reasoning and excelling in tasks involving maths, coding, and knowledge retrieval. Moreover, Sonnet showcased remarkable speed and intelligence, surpassing its predecessors by a considerable margin:

Haiku – the compact iteration of Claude 3 – shines as the fastest and most cost-effective model available, capable of processing dense research papers in mere seconds.

Notably, Claude 3’s enhanced visual processing capabilities mark a significant advancement, enabling the model to interpret a wide array of visual formats, from photos to technical diagrams. This expanded functionality not only enhances productivity but also ensures a nuanced understanding of user requests, minimising the risk of overlooking harmless content while remaining vigilant against potential harm.

Anthropic has also underscored its commitment to fairness, outlining ten foundational pillars that guide the development of Claude AI. Moreover, the company’s strategic partnerships with tech giants like Google signify a significant vote of confidence in Claude’s capabilities.

With Opus and Sonnet already available through Anthropic’s API, and Haiku poised to follow suit, the era of Claude 3 represents a milestone in AI innovation.

(Image Credit: Anthropic)

See also: AIs in India will need government permission before launching

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post Anthropic’s latest AI model beats rivals and achieves industry first appeared first on AI News.

]]>
https://www.artificialintelligence-news.com/news/anthropic-latest-ai-model-beats-rivals-achieves-industry-first/feed/ 0
DeepMind framework offers breakthrough in LLMs’ reasoning https://www.artificialintelligence-news.com/news/deepmind-framework-offers-breakthrough-llm-reasoning/ https://www.artificialintelligence-news.com/news/deepmind-framework-offers-breakthrough-llm-reasoning/#respond Thu, 08 Feb 2024 11:28:05 +0000 https://www.artificialintelligence-news.com/?p=14338 A breakthrough approach in enhancing the reasoning abilities of large language models (LLMs) has been unveiled by researchers from Google DeepMind and the University of Southern California. Their new ‘SELF-DISCOVER’ prompting framework – published this week on arXiV and Hugging Face – represents a significant leap beyond existing techniques, potentially revolutionising the performance of leading […]

The post DeepMind framework offers breakthrough in LLMs’ reasoning appeared first on AI News.

]]>
A breakthrough approach in enhancing the reasoning abilities of large language models (LLMs) has been unveiled by researchers from Google DeepMind and the University of Southern California.

Their new ‘SELF-DISCOVER’ prompting framework – published this week on arXiV and Hugging Face – represents a significant leap beyond existing techniques, potentially revolutionising the performance of leading models such as OpenAI’s GPT-4 and Google’s PaLM 2.

The framework promises substantial enhancements in tackling challenging reasoning tasks. It demonstrates remarkable improvements, boasting up to a 32% performance increase compared to traditional methods like Chain of Thought (CoT). This novel approach revolves around LLMs autonomously uncovering task-intrinsic reasoning structures to navigate complex problems.

At its core, the framework empowers LLMs to self-discover and utilise various atomic reasoning modules – such as critical thinking and step-by-step analysis – to construct explicit reasoning structures.

By mimicking human problem-solving strategies, the framework operates in two stages:

  • Stage one involves composing a coherent reasoning structure intrinsic to the task, leveraging a set of atomic reasoning modules and task examples.
  • During decoding, LLMs then follow this self-discovered structure to arrive at the final solution.

In extensive testing across various reasoning tasks – including Big-Bench Hard, Thinking for Doing, and Math – the self-discover approach consistently outperformed traditional methods. Notably, it achieved an accuracy of 81%, 85%, and 73% across the three tasks with GPT-4, surpassing chain-of-thought and plan-and-solve techniques.

However, the implications of this research extend far beyond mere performance gains.

By equipping LLMs with enhanced reasoning capabilities, the framework paves the way for tackling more challenging problems and brings AI closer to achieving general intelligence. Transferability studies conducted by the researchers further highlight the universal applicability of the composed reasoning structures, aligning with human reasoning patterns.

As the landscape evolves, breakthroughs like the SELF-DISCOVER prompting framework represent crucial milestones in advancing the capabilities of language models and offering a glimpse into the future of AI.

(Photo by Victor on Unsplash)

See also: The UK is outpacing the US for AI hiring

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with Digital Transformation Week and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post DeepMind framework offers breakthrough in LLMs’ reasoning appeared first on AI News.

]]>
https://www.artificialintelligence-news.com/news/deepmind-framework-offers-breakthrough-llm-reasoning/feed/ 0
OpenAI releases new models and lowers API pricing https://www.artificialintelligence-news.com/news/openai-releases-new-models-lowers-api-pricing/ https://www.artificialintelligence-news.com/news/openai-releases-new-models-lowers-api-pricing/#respond Fri, 26 Jan 2024 13:25:01 +0000 https://www.artificialintelligence-news.com/?p=14270 OpenAI has announced several updates that will benefit developers using its AI services, including new embedding models, a lower price for GPT-3.5 Turbo, an updated GPT-4 Turbo preview, and more robust content moderation capabilities. The San Francisco-based AI lab said its new text-embedding-3-small and text-embedding-3-large models offer upgraded performance over previous generations. For example, text-embedding-3-large […]

The post OpenAI releases new models and lowers API pricing appeared first on AI News.

]]>
OpenAI has announced several updates that will benefit developers using its AI services, including new embedding models, a lower price for GPT-3.5 Turbo, an updated GPT-4 Turbo preview, and more robust content moderation capabilities.

The San Francisco-based AI lab said its new text-embedding-3-small and text-embedding-3-large models offer upgraded performance over previous generations. For example, text-embedding-3-large achieves average scores of 54.9 percent on the MIRACL benchmark and 64.6 percent on the MTEB benchmark, up from 31.4 percent and 61 percent respectively for the older text-embedding-ada-002 model. 

Additionally, OpenAI revealed the price per 1,000 tokens has dropped 5x for text-embedding-3-small compared to text-embedding-ada-002, from $0.0001 to $0.00002. The company said developers can also shorten embeddings to reduce costs without significantly impacting accuracy.

Next week, OpenAI plans to release an updated GPT-3.5 Turbo model and cut its pricing by 50 percent for input tokens and 25 percent for output tokens. This will mark the third price reduction for GPT-3.5 Turbo in the past year as OpenAI aims to drive more adoption.

OpenAI has additionally updated its GPT-4 Turbo preview to version gpt-4-0125-preview, noting over 70 percent of requests have transitioned to the model since its debut. Improvements include more thorough completion of code generation and other tasks. 

To support developers building safe AI apps, OpenAI has also rolled out its most advanced content moderation model yet in text-moderation-007. The company said this identifies potentially harmful text more accurately than previous versions.

Finally, developers now have more control over API keys and visibility into usage metrics. OpenAI says developers can assign permissions to keys and view consumption on a per-key level to better track individual products or projects.

OpenAI says that more platform improvements are planned over the coming months to cater for larger development teams.

(Photo by Jonathan Kemper on Unsplash)

See also: OpenAI suspends developer of politician-impersonating chatbot

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with Digital Transformation Week and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post OpenAI releases new models and lowers API pricing appeared first on AI News.

]]>
https://www.artificialintelligence-news.com/news/openai-releases-new-models-lowers-api-pricing/feed/ 0