‘A virtual DPU within a GPU’: Could clever hardware hack be behind DeepSeek’s groundbreaking AI efficiency?

A new approach called DualPipe seems to be the key to DeekSeek’s success
One expert describes it as an on-GPU virtual DPU that maximizes bandwidth efficiency
While DeepSeek has used Nvidia GPUs only, one wonders how AMD’s Instinct would fare

China’s DeepSeek AI chatbot has stunned the tech industry, representing a credible alternative to OpenAI’s ChatGPT at a fraction of the cost.

A recent paper revealed DeepSeek V3 was trained on a cluster of 2,048 Nvidia H800 GPUs – crippled versions of the H100 (we can only imagine how much more powerful it would be running on AMD Instinct accelerators!). It reportedly required 2.79 million GPU-hours for pretraining, fine-tuning on 14.8 trillion tokens, and cost – according to calculations made by The Next Platform – a mere $5.58 million.

But exactly how DeepSeek’s developers managed this feat is likely down to a clever hack.

A virtual DPU on the GPU itself

First, some background. DeepSeek is an advanced Mixture-of-Experts (MoE) language model designed to optimize performance by selectively activating only the most relevant parts of its architecture for each task. The third version of the model, DeepSeek-V3, features a total of 671 billion parameters, with only 37 billion activated for any given token prediction. This selective activation massively reduces computational costs while maintaining high performance and accuracy – which you’ll see if you try it.

It’s easy to be skeptical of DeepSeek and the claims made regarding its training, but the paper reveals some of the magic the developers came up with to make the most of the crippled hardware they had to work with. This includes the creation of the DualPipe algorithm for efficient pipeline parallelism.

According to the information published by DeepSeek, DualPipe overlaps forward and backward computation, reduces latency, and optimizes data movement across GPUs. By efficiently managing communication, it minimizes idle time (pipeline bubbles) and dynamically balances GPU compute cores (Streaming Multiprocessors) between computation and communication, preventing data transfer bottlenecks as the model scales.

A commenter on The Next Platform describes DualPipe as “essentially creating a virtual DPU on the GPU itself to handle all-to-all communication,” which highlights its role in optimizing data transfer efficiency.

The paper goes into further detail, “In order to ensure sufficient computational performance for DualPipe, we customize efficient cross-node all-to-all communication kernels (including dispatching and combining) to conserve the number of SMs dedicated to communication. The implementation of the kernels is co-designed with the MoE gating algorithm and the network topology of our cluster. To be specific, in our cluster, cross-node GPUs are fully interconnected with IB, and intra-node communications are handled via NVLink.”

Example DualPipe scheduling

Example DualPipe scheduling for 8 PP ranks and 20 micro-batches in two directions. The micro-batches in the reverse direction are symmetric to those in the forward direction, so we omit their batch ID for illustration simplicity. Two cells enclosed by a shared black border have mutually overlapped computation and communication. (Image credit: DeekSeek)

https://cdn.mos.cms.futurecdn.net/f6no4XW3TzUhwgzgJVGbBJ-1200-80.jpg

Source link
waynewilliams@onmail.com (Wayne Williams)

I turned the Artemis II mission’s most stunning Earth photo into an iPhone wallpaper — but I needed a little help from AI

Save over $500 on this powerful Alienware 16X Aurora gaming laptop with an RTX 5060

Save over $500 on this powerful Alienware 16X Aurora gaming laptop with an RTX 5060

IKEA just launched the perfect balcony gardening starter kit — 12 essentials from $1.99

Ex-Israeli Intelligence Official: Shockwaves of Trump’s “Take Over Gaza” Heard, Felt Across Region

What UK political parties are promising in the 2019 general election

Otto Warmbier’s parents want North Korea to suffer for their son’s death

Could a ‘youthquake’ cause Boris Johnson to lose the general election?

Anthony Norman Reveals What the Show Left Out

Perez Hilton Has Emergency Surgery for Blood Clot

Big-Name Celebrities Headed to Astronomicon 9 in Ypsilanti

3 Best Movies To Watch On Netflix This Weekend (#1 Is A Scarlett Johansson Sci-Fi Hit That Made 11x Its Budget)

Down Arrow Button Icon

Cuba begins releasing prisoners under scrutiny of rights groups, U.S. govt

The Benefits of Red Light Therapy: Expert-Approved Advice

For some around Trump, war on Iran is a Christian calling

The YouTuber who has become one of Gen Z’s most beloved celebrities

26 last-minute holiday gifts that are still thoughtful and unique

Practicing gratitude regularly can make you less stressed and sleep better

8 things millennials wish you would just stop getting them for the holidays

‘A virtual DPU within a GPU’: Could clever hardware hack be behind DeepSeek’s groundbreaking AI efficiency?

Down Arrow Button Icon

I turned the Artemis II mission’s most stunning Earth photo into an iPhone wallpaper — but I needed a little help from AI

Anthony Norman Reveals What the Show Left Out

Cuba begins releasing prisoners under scrutiny of rights groups, U.S. govt