inference Archives - AI News

DeepSeek’s AIs: What humans really want

Dashveenjit Kaur — Wed, 09 Apr 2025 07:44:08 +0000

Chinese AI startup DeepSeek has solved a problem that has frustrated AI researchers for several years. Its breakthrough in AI reward models could improve dramatically how AI systems reason and respond to questions.

In partnership with Tsinghua University researchers, DeepSeek has created a technique detailed in a research paper, titled “Inference-Time Scaling for Generalist Reward Modeling.” It outlines how a new approach outperforms existing methods and how the team “achieved competitive performance” compared to strong public reward models.

The innovation focuses on enhancing how AI systems learn from human preferences – a important aspect of creating more useful and aligned artificial intelligence.

What are AI reward models, and why do they matter?

AI reward models are important components in reinforcement learning for large language models. They provide feedback signals that help guide an AI’s behaviour toward preferred outcomes. In simpler terms, reward models are like digital teachers that help AI understand what humans want from their responses.

“Reward modeling is a process that guides an LLM towards human preferences,” the DeepSeek paper states. Reward modeling becomes important as AI systems get more sophisticated and are deployed in scenarios beyond simple question-answering tasks.

The innovation from DeepSeek addresses the challenge of obtaining accurate reward signals for LLMs in different domains. While current reward models work well for verifiable questions or artificial rules, they struggle in general domains where criteria are more diverse and complex.

The dual approach: How DeepSeek’s method works

DeepSeek’s approach combines two methods:

Generative reward modeling (GRM): This approach enables flexibility in different input types and allows for scaling during inference time. Unlike previous scalar or semi-scalar approaches, GRM provides a richer representation of rewards through language.
Self-principled critique tuning (SPCT): A learning method that fosters scalable reward-generation behaviours in GRMs through online reinforcement learning, one that generates principles adaptively.

One of the paper’s authors from Tsinghua University and DeepSeek-AI, Zijun Liu, explained that the combination of methods allows “principles to be generated based on the input query and responses, adaptively aligning reward generation process.”

The approach is particularly valuable for its potential for “inference-time scaling” – improving performance by increasing computational resources during inference rather than just during training.

The researchers found that their methods could achieve better results with increased sampling, letting models generate better rewards with more computing.

Implications for the AI Industry

DeepSeek’s innovation comes at an important time in AI development. The paper states “reinforcement learning (RL) has been widely adopted in post-training for large language models […] at scale,” leading to “remarkable improvements in human value alignment, long-term reasoning, and environment adaptation for LLMs.”

The new approach to reward modelling could have several implications:

More accurate AI feedback: By creating better reward models, AI systems can receive more precise feedback about their outputs, leading to improved responses over time.
Increased adaptability: The ability to scale model performance during inference means AI systems can adapt to different computational constraints and requirements.
Broader application: Systems can perform better in a broader range of tasks by improving reward modelling for general domains.
More efficient resource use: The research shows that inference-time scaling with DeepSeek’s method could outperform model size scaling in training time, potentially allowing smaller models to perform comparably to larger ones with appropriate inference-time resources.

DeepSeek’s growing influence

The latest development adds to DeepSeek’s rising profile in global AI. Founded in 2023 by entrepreneur Liang Wenfeng, the Hangzhou-based company has made waves with its V3 foundation and R1 reasoning models.

The company upgraded its V3 model (DeepSeek-V3-0324) recently, which the company said offered “enhanced reasoning capabilities, optimised front-end web development and upgraded Chinese writing proficiency.” DeepSeek has committed to open-source AI, releasing five code repositories in February that allow developers to review and contribute to development.

While speculation continues about the potential release of DeepSeek-R2 (the successor to R1) – Reuters has speculated on possible release dates – DeepSeek has not commented in its official channels.

What’s next for AI reward models?

According to the researchers, DeepSeek intends to make the GRM models open-source, although no specific timeline has been provided. Open-sourcing will accelerate progress in the field by allowing broader experimentation with reward models.

As reinforcement learning continues to play an important role in AI development, advances in reward modelling like those in DeepSeek and Tsinghua University’s work will likely have an impact on the abilities and behaviour of AI systems.

Work on AI reward models demonstrates that innovations in how and when models learn can be as important increasing their size. By focusing on feedback quality and scalability, DeepSeek addresses one of the fundamental challenges to creating AI that understands and aligns with human preferences better.

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post DeepSeek’s AIs: What humans really want appeared first on AI News.

NVIDIA Dynamo: Scaling AI inference with open-source efficiency

Ryan Daws — Wed, 19 Mar 2025 16:49:21 +0000

NVIDIA has launched Dynamo, an open-source inference software designed to accelerate and scale reasoning models within AI factories.

Efficiently managing and coordinating AI inference requests across a fleet of GPUs is a critical endeavour to ensure that AI factories can operate with optimal cost-effectiveness and maximise the generation of token revenue.

As AI reasoning becomes increasingly prevalent, each AI model is expected to generate tens of thousands of tokens with every prompt, essentially representing its “thinking” process. Enhancing inference performance while simultaneously reducing its cost is therefore crucial for accelerating growth and boosting revenue opportunities for service providers.

A new generation of AI inference software

NVIDIA Dynamo, which succeeds the NVIDIA Triton Inference Server, represents a new generation of AI inference software specifically engineered to maximise token revenue generation for AI factories deploying reasoning AI models.

Dynamo orchestrates and accelerates inference communication across potentially thousands of GPUs. It employs disaggregated serving, a technique that separates the processing and generation phases of large language models (LLMs) onto distinct GPUs. This approach allows each phase to be optimised independently, catering to its specific computational needs and ensuring maximum utilisation of GPU resources.

“Industries around the world are training AI models to think and learn in different ways, making them more sophisticated over time,” stated Jensen Huang, founder and CEO of NVIDIA. “To enable a future of custom reasoning AI, NVIDIA Dynamo helps serve these models at scale, driving cost savings and efficiencies across AI factories.”

Using the same number of GPUs, Dynamo has demonstrated the ability to double the performance and revenue of AI factories serving Llama models on NVIDIA’s current Hopper platform. Furthermore, when running the DeepSeek-R1 model on a large cluster of GB200 NVL72 racks, NVIDIA Dynamo’s intelligent inference optimisations have shown to boost the number of tokens generated by over 30 times per GPU.

To achieve these improvements in inference performance, NVIDIA Dynamo incorporates several key features designed to increase throughput and reduce operational costs.

Dynamo can dynamically add, remove, and reallocate GPUs in real-time to adapt to fluctuating request volumes and types. The software can also pinpoint specific GPUs within large clusters that are best suited to minimise response computations and efficiently route queries. Dynamo can also offload inference data to more cost-effective memory and storage devices while retrieving it rapidly when required, thereby minimising overall inference costs.

NVIDIA Dynamo is being released as a fully open-source project, offering broad compatibility with popular frameworks such as PyTorch, SGLang, NVIDIA TensorRT-LLM, and vLLM. This open approach supports enterprises, startups, and researchers in developing and optimising novel methods for serving AI models across disaggregated inference infrastructures.

NVIDIA expects Dynamo to accelerate the adoption of AI inference across a wide range of organisations, including major cloud providers and AI innovators like AWS, Cohere, CoreWeave, Dell, Fireworks, Google Cloud, Lambda, Meta, Microsoft Azure, Nebius, NetApp, OCI, Perplexity, Together AI, and VAST.

NVIDIA Dynamo: Supercharging inference and agentic AI

A key innovation of NVIDIA Dynamo lies in its ability to map the knowledge that inference systems hold in memory from serving previous requests, known as the KV cache, across potentially thousands of GPUs.

The software then intelligently routes new inference requests to the GPUs that possess the best knowledge match, effectively avoiding costly recomputations and freeing up other GPUs to handle new incoming requests. This smart routing mechanism significantly enhances efficiency and reduces latency.

“To handle hundreds of millions of requests monthly, we rely on NVIDIA GPUs and inference software to deliver the performance, reliability and scale our business and users demand,” said Denis Yarats, CTO of Perplexity AI.

“We look forward to leveraging Dynamo, with its enhanced distributed serving capabilities, to drive even more inference-serving efficiencies and meet the compute demands of new AI reasoning models.”

AI platform Cohere is already planning to leverage NVIDIA Dynamo to enhance the agentic AI capabilities within its Command series of models.

“Scaling advanced AI models requires sophisticated multi-GPU scheduling, seamless coordination and low-latency communication libraries that transfer reasoning contexts seamlessly across memory and storage,” explained Saurabh Baji, SVP of engineering at Cohere.

“We expect NVIDIA Dynamo will help us deliver a premier user experience to our enterprise customers.”

Support for disaggregated serving

The NVIDIA Dynamo inference platform also features robust support for disaggregated serving. This advanced technique assigns the different computational phases of LLMs – including the crucial steps of understanding the user query and then generating the most appropriate response – to different GPUs within the infrastructure.

Disaggregated serving is particularly well-suited for reasoning models, such as the new NVIDIA Llama Nemotron model family, which employs advanced inference techniques for improved contextual understanding and response generation. By allowing each phase to be fine-tuned and resourced independently, disaggregated serving improves overall throughput and delivers faster response times to users.

Together AI, a prominent player in the AI Acceleration Cloud space, is also looking to integrate its proprietary Together Inference Engine with NVIDIA Dynamo. This integration aims to enable seamless scaling of inference workloads across multiple GPU nodes. Furthermore, it will allow Together AI to dynamically address traffic bottlenecks that may arise at various stages of the model pipeline.

“Scaling reasoning models cost effectively requires new advanced inference techniques, including disaggregated serving and context-aware routing,” stated Ce Zhang, CTO of Together AI.

“The openness and modularity of NVIDIA Dynamo will allow us to seamlessly plug its components into our engine to serve more requests while optimising resource utilisation—maximising our accelerated computing investment. We’re excited to leverage the platform’s breakthrough capabilities to cost-effectively bring open-source reasoning models to our users.”

Four key innovations of NVIDIA Dynamo

NVIDIA has highlighted four key innovations within Dynamo that contribute to reducing inference serving costs and enhancing the overall user experience:

GPU Planner: A sophisticated planning engine that dynamically adds and removes GPUs based on fluctuating user demand. This ensures optimal resource allocation, preventing both over-provisioning and under-provisioning of GPU capacity.
Smart Router: An intelligent, LLM-aware router that directs inference requests across large fleets of GPUs. Its primary function is to minimise costly GPU recomputations of repeat or overlapping requests, thereby freeing up valuable GPU resources to handle new incoming requests more efficiently.
Low-Latency Communication Library: An inference-optimised library designed to support state-of-the-art GPU-to-GPU communication. It abstracts the complexities of data exchange across heterogeneous devices, significantly accelerating data transfer speeds.
Memory Manager: An intelligent engine that manages the offloading and reloading of inference data to and from lower-cost memory and storage devices. This process is designed to be seamless, ensuring no negative impact on the user experience.

NVIDIA Dynamo will be made available within NIM microservices and will be supported in a future release of the company’s AI Enterprise software platform.

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post NVIDIA Dynamo: Scaling AI inference with open-source efficiency appeared first on AI News.

Cerebras vs Nvidia: New inference tool promises higher performance

Muhammad Zulhusni — Thu, 29 Aug 2024 09:42:34 +0000

AI hardware startup Cerebras has created a new AI inference solution that could potentially rival Nvidia’s GPU offerings for enterprises.

The Cerebras Inference tool is based on the company’s Wafer-Scale Engine and promises to deliver staggering performance. According to sources, the tool has achieved speeds of 1,800 tokens per second for Llama 3.1 8B, and 450 tokens per second for Llama 3.1 70B. Cerebras claims that these speeds are not only faster than the usual hyperscale cloud products required to generate these systems by Nvidia’s GPUs, but they are also more cost-efficient.

This is a major shift tapping into the generative AI market, as Gartner analyst Arun Chandrasekaran put it. While this market’s focus had previously been on training, it is currently shifting to the cost and speed of inferencing. This shift is due to the growth of AI use cases within enterprise settings and provides a great opportunity for vendors like Cerebras of AI products and services to compete based on performance.

As Micah Hill-Smith, co-founder and CEO of Artificial Analysis, says, Cerebras really shined in their AI inference benchmarks. The company’s measurements reached over 1,800 output tokens per second on Llama 3.1 8B, and the output on Llama 3.1 70B was over 446 output tokens per second. In this way, they set new records in both benchmarks.

Cerebras introduces AI inference tool with 20x speed at a fraction of GPU cost.

However, despite the potential performance advantages, Cerebras faces significant challenges in the enterprise market. Nvidia’s software and hardware stack dominates the industry and is widely adopted by enterprises. David Nicholson, an analyst at Futurum Group, points out that while Cerebras’ wafer-scale system can deliver high performance at a lower cost than Nvidia, the key question is whether enterprises are willing to adapt their engineering processes to work with Cerebras’ system.

The choice between Nvidia and alternatives such as Cerebras depends on several factors, including the scale of operations and available capital. Smaller firms are likely to choose Nvidia since it offers already-established solutions. At the same time, larger businesses with more capital may opt for the latter to increase efficiency and save on costs.

As the AI hardware market continues to evolve, Cerebras will also face competition from specialised cloud providers, hyperscalers like Microsoft, AWS, and Google, and dedicated inferencing providers such as Groq. The balance between performance, cost, and ease of implementation will likely shape enterprise decisions in adopting new inference technologies.

The emergence of high-speed AI inference, capable of exceeding 1,000 tokens per second, is equivalent to the development of broadband internet, which could open a new frontier for AI applications. Cerebras’ 16-bit accuracy and faster inference capabilities may enable the creation of future AI applications where entire AI agents must operate rapidly, repeatedly, and in real-time.

With the growth of the AI field, the market for AI inference hardware is also expanding. Accounting for around 40% of the total AI hardware market, this segment is becoming an increasingly lucrative target within the broader AI hardware industry. Given that more prominent companies occupy the majority of this segment, many newcomers should carefully consider important aspects of this competitive landscape, considering the competitive nature and significant resources required to navigate the enterprise space.

(Photo by Timothy Dykes)

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post Cerebras vs Nvidia: New inference tool promises higher performance appeared first on AI News.

Google expands partnership with Anthropic to enhance AI safety

Ryan Daws — Fri, 10 Nov 2023 15:56:36 +0000

Google has announced the expansion of its partnership with Anthropic to work towards achieving the highest standards of AI safety.

The collaboration between Google and Anthropic dates back to the founding of Anthropic in 2021. The two companies have closely collaborated, with Anthropic building one of the largest Google Kubernetes Engine (GKE) clusters in the industry.

“Our longstanding partnership with Google is founded on a shared commitment to develop AI responsibly and deploy it in a way that benefits society,” said Dario Amodei, co-founder and CEO of Anthropic.

“We look forward to our continued collaboration as we work to make steerable, reliable and interpretable AI systems available to more businesses around the world.”

Anthropic utilises Google’s AlloyDB, a fully managed PostgreSQL-compatible database, for handling transactional data with high performance and reliability. Additionally, Google’s BigQuery data warehouse is employed to analyse vast datasets, extracting valuable insights for Anthropic’s operations.

As part of the expanded partnership, Anthropic will leverage Google’s latest generation Cloud TPU v5e chips for AI inference. Anthropic will use the chips to efficiently scale its powerful Claude large language model, which ranks only behind GPT-4 in many benchmarks.

The announcement comes on the heels of both companies participating in the inaugural AI Safety Summit (AISS) at Bletchley Park, hosted by the UK government. The summit brought together government officials, technology leaders, and experts to address concerns around frontier AI.

Google and Anthropic are also engaged in collaborative efforts with the Frontier Model Forum and MLCommons, contributing to the development of robust measures for AI safety.

To enhance security for organisations deploying Anthropic’s models on Google Cloud, Anthropic is now utilising Google Cloud’s security services. This includes Chronicle Security Operations, Secure Enterprise Browsing, and Security Command Center, providing visibility, threat detection, and access control.

“Anthropic and Google Cloud share the same values when it comes to developing AI–it needs to be done in both a bold and responsible way,” commented Thomas Kurian, CEO of Google Cloud.

“This expanded partnership with Anthropic – built on years of working together – will bring AI to more people safely and securely, and provides another example of how the most innovative and fastest growing AI startups are building on Google Cloud.”

Google and Anthropic’s expanded partnership promises to be a critical step in advancing AI safety standards and fostering responsible development.

(Photo by charlesdeluvio on Unsplash)

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post Google expands partnership with Anthropic to enhance AI safety appeared first on AI News.

Dave Barnett, Cloudflare: Delivering speed and security in the AI era

Ryan Daws — Fri, 13 Oct 2023 15:39:34 +0000

AI News sat down with Dave Barnett, Head of SASE at Cloudflare, during Cyber Security & Cloud Expo Europe to delve into how the firm uses its cloud-native architecture to deliver speed and security in the AI era.

According to Barnett, Cloudflare’s cloud-native approach allows the company to continually innovate in the digital space. Notably, a significant portion of their services are offered to consumers for free.

“We continuously reinvent, we’re very comfortable in the digital space. We’re very proud that the vast majority of our customers actually consume our services for free because it’s our way of giving back to society,” said Barnett.

Barnett also revealed Cloudflare’s focus on AI during their anniversary week. The company aims to enable organisations to consume AI securely and make it accessible to everyone. Barnett says that Cloudflare achieves those goals in three key ways.

“One, as I mentioned, is operating AI inference engines within Cloudflare close to consumers’ eyeballs. The second area is securing the use of AI within the workplace, because, you know, AI has some incredibly positive impacts on people … but the problem is there are some data protection requirements around that,” explains Barnett.

“Finally, is the question of, ‘Could AI be used by the bad guys against the good guys?’ and that’s an area that we’re continuing to explore.”

Just a day earlier, AI News heard from Raviv Raz, Cloud Security Manager at ING, during a session at the expo that focused on the alarming potential of AI-powered cybercrime.

Regarding security models, Barnett discussed the evolution of the zero-trust concept, emphasising its practical applications in enhancing both usability and security. Cloudflare’s own journey with zero-trust began with a focus on usability, leading to the development of its own zero-trust network access products.

“We have servers everywhere and engineers everywhere that need to reboot those servers. In 2015, that involved VPNs and two-factor authentication… so we built our own zero-trust network access product for our own use that meant the user experiences for engineers rebooting servers in far-flung places was a lot better,” says Barnett.

“After 2015, the world started to realise that this approach had great security benefits so we developed that product and launched it in 2018 as Cloudflare Access.”

Cloudflare’s innovative strides also include leveraging NVIDIA GPUs to accelerate machine learning AI tasks on an edge network. This technology enables organisations to run inference tasks – such as image recognition – close to end-users, ensuring low latency and optimal performance.

“We launched Workers AI, which means that organisations around the world – in fact, individuals as well – can run their inference tasks at a very close place to where the consumers of that inference are,” explains Barnett.

“You could ask a question, ‘Cat or not cat?’, to a trained cat detection engine very close to the people that need it. We’re doing that in a way that makes it easily accessible to organisations looking to use AI to benefit their business.”

For developers interested in AI, Barnett outlined Cloudflare’s role in supporting the deployment of machine learning models. While machine learning training is typically conducted outside Cloudflare, the company excels in providing low-latency inference engines that are essential for real-time applications like image recognition.

Our conversation with Barnett shed light on Cloudflare’s commitment to cloud-native architecture, AI accessibility, and cybersecurity. As the industry continues to advance, Cloudflare remains at the forefront of delivering speed and security in the AI era.

You can watch our full interview with Dave Barnett below:

(Photo by ryan baker on Unsplash)

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with Cyber Security & Cloud Expo, Edge Computing Expo, and Digital Transformation Week.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post Dave Barnett, Cloudflare: Delivering speed and security in the AI era appeared first on AI News.

MLPerf Inference v3.1 introduces new LLM and recommendation benchmarks

Ryan Daws — Tue, 12 Sep 2023 11:46:58 +0000

The latest release of MLPerf Inference introduces new LLM and recommendation benchmarks, marking a leap forward in the realm of AI testing.

The v3.1 iteration of the benchmark suite has seen record participation, boasting over 13,500 performance results and delivering up to a 40 percent improvement in performance.

What sets this achievement apart is the diverse pool of 26 different submitters and over 2,000 power results, demonstrating the broad spectrum of industry players investing in AI innovation.

Among the list of submitters are tech giants like Google, Intel, and NVIDIA, as well as newcomers Connect Tech, Nutanix, Oracle, and TTA, who are participating in the MLPerf Inference benchmark for the first time.

David Kanter, Executive Director of MLCommons, highlighted the significance of this achievement:

“Submitting to MLPerf is not trivial. It’s a significant accomplishment, as this is not a simple point-and-click benchmark. It requires real engineering work and is a testament to our submitters’ commitment to AI, to their customers, and to ML.”

MLPerf Inference is a critical benchmark suite that measures the speed at which AI systems can execute models in various deployment scenarios. These scenarios span from the latest generative AI chatbots to the safety-enhancing features in vehicles, such as automatic lane-keeping and speech-to-text interfaces.

The spotlight of MLPerf Inference v3.1 shines on the introduction of two new benchmarks:

An LLM utilising the GPT-J reference model to summarise CNN news articles garnered submissions from 15 different participants, showcasing the rapid adoption of generative AI.

An updated recommender benchmark – refined to align more closely with industry practices – employs the DLRM-DCNv2 reference model and larger datasets, attracting nine submissions. These new benchmarks are designed to push the boundaries of AI and ensure that industry-standard benchmarks remain aligned with the latest trends in AI adoption, serving as a valuable guide for customers, vendors, and researchers alike.

Mitchelle Rasquinha, co-chair of the MLPerf Inference Working Group, commented: “The submissions for MLPerf Inference v3.1 are indicative of a wide range of accelerators being developed to serve ML workloads.

“The current benchmark suite has broad coverage among ML domains, and the most recent addition of GPT-J is a welcome contribution to the generative AI space. The results should be very helpful to users when selecting the best accelerators for their respective domains.”

MLPerf Inference benchmarks primarily focus on datacenter and edge systems. The v3.1 submissions showcase various processors and accelerators across use cases in computer vision, recommender systems, and language processing.

The benchmark suite encompasses both open and closed submissions in the performance, power, and networking categories. Closed submissions employ the same reference model to ensure a level playing field across systems, while participants in the open division are permitted to submit a variety of models.

As AI continues to permeate various aspects of our lives, MLPerf’s benchmarks serve as vital tools for evaluating and shaping the future of AI technology.

Find the detailed results of MLPerf Inference v3.1 here.

(Photo by Mauro Sbicego on Unsplash)

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post MLPerf Inference v3.1 introduces new LLM and recommendation benchmarks appeared first on AI News.

NVIDIA sets another AI inference record in MLPerf

Ryan Daws — Thu, 22 Oct 2020 09:16:41 +0000

NVIDIA has set yet another record for AI inference in MLPerf with its A100 Tensor Core GPUs.

MLPerf consists of five inference benchmarks which cover the main three AI applications today: image classification, object detection, and translation.

“Industry-standard MLPerf benchmarks provide relevant performance data on widely used AI networks and help make informed AI platform buying decisions,” said Rangan Majumder, VP of Search and AI at Microsoft.

Last year, NVIDIA led all five benchmarks for both server and offline data centre scenarios with its Turing GPUs. A dozen companies participated.

23 companies participated in this year’s MLPerf but NVIDIA maintained its lead with the A100 outperforming CPUs by up to 237x in data centre inference.

For perspective, NVIDIA notes that a single NVIDIA DGX A100 system – with eight A100 GPUs – provides the same performance as nearly 1,000 dual-socket CPU servers on some AI applications.

“We’re at a tipping point as every industry seeks better ways to apply AI to offer new services and grow their business,” said Ian Buck, Vice President of Accelerated Computing at NVIDIA.

“The work we’ve done to achieve these results on MLPerf gives companies a new level of AI performance to improve our everyday lives.”

The widespread availability of NVIDIA’s AI platform through every major cloud and data centre infrastructure provider is unlocking huge potential for companies across various industries to improve their operations.

Interested in hearing industry leaders discuss subjects like this? Attend the co-located 5G Expo, IoT Tech Expo, Blockchain Expo, AI & Big Data Expo, and Cyber Security & Cloud Expo World Series with upcoming events in Silicon Valley, London, and Amsterdam.

The post NVIDIA sets another AI inference record in MLPerf appeared first on AI News.

NVIDIA’s AI-focused Ampere GPUs are now available in Google Cloud

Ryan Daws — Wed, 08 Jul 2020 10:56:12 +0000

Google Cloud users can now harness the power of NVIDIA’s Ampere GPUs for their AI workloads.

The specific GPU added to Google Cloud is the NVIDIA A100 Tensor Core which was announced just last month. NVIDIA says the A100 “has come to the cloud faster than any NVIDIA GPU in history.”

NVIDIA claims the A100 boosts training and inference performance by up to 20x over its predecessors. Large AI models like BERT can be trained in just 37 minutes on a cluster of 1,024 A100s.

For those who enjoy their measurements in teraflops (TFLOPS), the A100 delivers around 19.5 TFLOPS in single-precision performance and 156 TFLOPS for Tensor Float 32 workloads.

Manish Sainani, Director of Product Management at Google Cloud, said:

“Google Cloud customers often look to us to provide the latest hardware and software services to help them drive innovation on AI and scientific computing workloads.
With our new A2 VM family, we are proud to be the first major cloud provider to market NVIDIA A100 GPUs, just as we were with NVIDIA T4 GPUs. We are excited to see what our customers will do with these new capabilities.”

The announcement couldn’t have arrived at a better time – with many looking to harness AI for solutions to the COVID-19 pandemic, in addition to other global challenges such as climate change.

Aside from AI training and inference, other things customers will be able to achieve with the new capabilities include data analytics, scientific computing, genomics, edge video analytics, and 5G services.

The new Ampere-based data center GPUs are now available in Alpha on Google Cloud. Users can access instances of up to 16 A100 GPUs, which provides a total of 640GB of GPU memory and 1.3TB of system memory.

You can register your interest for access here.

The post NVIDIA’s AI-focused Ampere GPUs are now available in Google Cloud appeared first on AI News.

inference Archives - AI News

DeepSeek’s AIs: What humans really want

What are AI reward models, and why do they matter?

The dual approach: How DeepSeek’s method works

Implications for the AI Industry

DeepSeek’s growing influence

What’s next for AI reward models?

See also: DeepSeek disruption: Chinese AI innovation narrows global technology divide

NVIDIA Dynamo: Scaling AI inference with open-source efficiency

A new generation of AI inference software

NVIDIA Dynamo: Supercharging inference and agentic AI

Support for disaggregated serving

Four key innovations of NVIDIA Dynamo

Cerebras vs Nvidia: New inference tool promises higher performance

Google expands partnership with Anthropic to enhance AI safety

Dave Barnett, Cloudflare: Delivering speed and security in the AI era

MLPerf Inference v3.1 introduces new LLM and recommendation benchmarks

NVIDIA sets another AI inference record in MLPerf

NVIDIA’s AI-focused Ampere GPUs are now available in Google Cloud