Alibaba unveils the network and datacenter design it uses for large language model training

Alibaba has revealed its datacenter design for LLM training, which apparently consists of an Ethernet-based network in which each host contains eight GPUs and nine NICs that each have two 200 GB/sec ports.

The tech giant, which also offers one of the best large language models (LLM) around via its Qwen model, trained on 110 billion parameters, says this design has been used in production for eight months, and aims to maximize the utilization of a GPU’s PCIe capabilities increasing the send/receive capacity of the network.

Another feature that increases speed is the use of NVlink for the intra-host network providing more bandwidth between hosts. Each port on the NICs is connected to a different top-of-rack switch avoiding a single point of failure a design that Alibaba call rail-optimized.

Each pod contains 15,000 GPUs

A new type of network is required because the traffic patterns in LLM training is different from general cloud computing because of low entropy and bursty traffic. there is also a higher sensitivity to faults and single point failures.

“Based on the unique characteristics of LLM training, we decided to build a new network architecture specifically for LLM training. We should meet the following goals; scalability, high performance, and single-ToR fault tolerance,” the company said.

Another part of the infrastructure that was revealed was the cooling mechanism. As no vendors could provide a solution to keep chips below 105C, the temperature at which switches begin to shut down, Alibaba designed and created its own vapor chamber heat sink along with using more wicked pillars at the center of chips carrying heat away more efficiently.

The design for LLM training is encapsulated in pods that contain 15,000 GPUs and each pod can be located in a single datacenter. “All datacenter buildings in commission in Alibaba Cloud have an overall power constraint of 18MW, and an 18MW building can accommodate approximately 15K GPUs. In conjunction with HPN, each single building perfectly houses an entire Pod, making predominant links inside the same building.” Alibaba wrote.

Alibaba also wrote it expects model parameters to continue to rise by an order of magnitude in the next several years from one trillion to 10 trillion parameters, and that its new architecture is planned to be able to support this and increase to a scale of 100,000 GPUs.

Via The Register

More from TechRadar Pro

https://cdn.mos.cms.futurecdn.net/utxCZJKcHqSFcgdF49JwpV-1200-80.jpg

Source link

I was sick of Apple Watch Live Activities until I found this simple fix

After one and a half playthroughs, Tales of Xillia Remastered has proven that this old RPG was worthy of an upgrade

Visible Wireless launches early Black Friday deal with an unlimited data plan for just $19/mo

Perplexity upgrades Comet to multitask across your tabs

Ex-Israeli Intelligence Official: Shockwaves of Trump’s “Take Over Gaza” Heard, Felt Across Region

What UK political parties are promising in the 2019 general election

Otto Warmbier’s parents want North Korea to suffer for their son’s death

Could a ‘youthquake’ cause Boris Johnson to lose the general election?

Which Celebrity Styles Americans Copy Most in 2025: New Study

New ‘Westworld’ trailer introduces us to another dystopian tech company

What’s the point of ‘Charlie’s Angels’ without Sam Rockwell dancing?

These striking photos capture the future of human flight

Enterprise Products Partners' SWOT analysis: midstream giant's stock resilience tested

JetBlue's SWOT analysis: airline stock faces turbulence amid strategic shifts

Minnesota lawmaker killed on Saturday served with compassion, governor says

Minnesota shooting suspect told friend in text message: I might be dead soon

The YouTuber who has become one of Gen Z’s most beloved celebrities

26 last-minute holiday gifts that are still thoughtful and unique

Practicing gratitude regularly can make you less stressed and sleep better

8 things millennials wish you would just stop getting them for the holidays

Alibaba unveils the network and datacenter design it uses for large language model training

I was sick of Apple Watch Live Activities until I found this simple fix

After one and a half playthroughs, Tales of Xillia Remastered has proven that this old RPG was worthy of an upgrade

Visible Wireless launches early Black Friday deal with an unlimited data plan for just $19/mo

Perplexity upgrades Comet to multitask across your tabs