By TechThop Team
Posted on: 17 Aug, 2022
InfiniBand is not something that cloud giants like Amazon, Google, and Meta want locked into, says Nvidia. Standard Ethernet is what they want.
It has been rumored for some time now that there will be a second network in the area of computer networking. The most common network is the one that connects client computers to servers, the LAN. A scale-out network has been created 'behind' the main network to perform AI tasks such as training deep learning programs on thousands of GPUs.
As a result, Broadcom, a vendor of switch silicon, has reached a critical impasse. GPU chip vendor Nvidia is now the dominant deep learning vendor as well. In 2020, it acquired Mellanox, a vendor of networking technology to interconnect chips.
As Ram Velaga, senior vice president and general manager of Broadcom's Core Switching Group, said in an interview with ZDNet, Nvidia is saying, 'I can sell a GPU for a couple of thousand dollars or I can sell an integrated system for half a million to a million plus.'
The cloud providers aren't responding well at all, Velaga told ZDNet, including Amazon and Alphabet's Google. As cloud giants scale computing resources, their economics dictate avoiding single-sourcing as they reduce costs.
Broadcom proposes a solution by embracing ethernet technology instead of InfiniBand's proprietary approach, to address this tension. It was announced Tuesday that Broadcom's Tomahawk 5 switch chip can interconnect endpoints at up to 51.2 terabits per second.
As a result, there is an engagement with us, in which they say, 'Look, if the Ethernet ecosystem can offer all the benefits InfiniBand can offer to a GPU interconnect and put it on a mainstream technology like Ethernet, so it can be widely available, and create a large network of connections, it will help people win by utilizing the GPU instead of proprietary networks,' Velaga said.
Broadcom has released the Tomahawk 5 following the Tomahawk 4, which was a 25.6-terabit-per-second chip two years ago. The Tomahawk 5 part aims to level the playing field by adding capabilities that had previously been reserved for InfiniBand. The key difference is latency, the average time it takes for the first bit of data to reach point B. InfiniBand
InfiniBand has been an advantage, as it makes it easier to go from the GPU to memory and back again to retrieve input data or parameter data for large neural networks.InfiniBand and Ethernet are now connected through RDMA over Converged Ethernet or RoCE. The open RoCE standard triumphs over the tight coupling of Nvidia GPUs and Infiniband.
As soon as you get RoCE, the Infiniband advantage disappears. The performance of Ethernet is actually comparable to that of InfiniBand.'The big cloud guys are saying, 'We want to build our own GPUs, but we don't have an InfiniBand network,' said Velaga. If you guys can provide us with an Ethernet-equivalent fabric, the rest can be done ourselves.'
The company believes that as the latency issue resolves, InfiniBand's weaknesses will become apparent, such as its ability to support a high number of GPUs. As a result of its lack of a distributed architecture, InfiniBand was always limited in scale, maybe to a thousand GPUs.'
The ethernet switch can also serve Intel and AMD CPUs, so collapsing the networking technology into one approach has certain economic benefits.
'The GPU interconnect market is expected to grow the fastest, and over time, Velaga expects the ratio will be 50/50. 'The same technology can be used for both the CPU interconnect and the GPU interconnect.
The fact that CPUs are sold much more than GPUs will normalize the volume of the market.' On an Ethernet switch, GPUs will consume the majority of bandwidth, whereas CPUs may consume more ports.
According to Velaga, the switch chip enables AI processing via 256 ports of 200 gigabits-per-second ethernet, the most of any switch chip. As a result of such dense 200-gig port configurations, Broadcom claims that AI/ML clusters will have low latency and flat performance.
It is the big cloud guys who want this,' said Velaga. The massive clouds with a lot of buying power have demonstrated their ability to force vendors to disaggregate, and that's the momentum we're riding,' said Velaga. 'These clouds absolutely do not want this and require an Ethernet-capable NIC interface to sell GPUs to them.'
For more stories like this
Explore our website