At Supercomputing 2024 (SC24), Enfabrica Corporation unveiled a milestone in AI data center networking: the Accelerated Compute Fabric (ACF) SuperNIC chip. This 3.2 Terabit-per-second (Tbps) Network Interface Card (NIC) SoC redefines large-scale AI and machine learning (ML) operations by enabling massive scalability, supporting clusters of over 500,000 GPUs. Enfabrica also raised $115 million in funding and is expected to release its (ACF) SuperNIC chip in Q1 2025.

Addressing AI Networking Challenges

As AI models grow increasingly large and sophisticated, data centers face mounting pressures to connect large numbers of specialized processing units, such as GPUs. These GPUs are crucial for high-speed computation in training and inference but are often left idle due to inefficient data movement across existing network architectures. The challenge lies in effectively interconnecting thousands of GPUs to ensure optimal data transfer without bottlenecks or performance degradation.

Traditional networking approaches can link approximately 100,000 AI computing chips in a data center before inefficiencies and slowdowns become significant. According to Enfabrica’s CEO, Rochan Sankar, the company’s new technology supports up to 500,000 chips in a single AI/ML system, enabling larger and more reliable AI model computations. By overcoming the constraints of conventional NIC designs, Enfabrica’s ACF SuperNIC maximizes GPU utilization and minimizes downtime.

Key Innovations in the ACF SuperNIC

The ACF SuperNIC boasts several industry-first features tailored to modern AI data center needs:

  1. High-Bandwidth, Multi-Port Connectivity: The ACF SuperNIC delivers multi-port 800-Gigabit Ethernet to GPU servers, quadrupling the bandwidth compared to other GPU-attached NICs. This setup provides unprecedented throughput and enhances multipath resiliency, ensuring robust communication across AI clusters.
  2. Efficient Two-Tier Network Design: With a high-radix configuration of 32 network ports and up to 160 PCIe lanes, the ACF SuperNIC simplifies the overall architecture of AI data centers. This efficiency allows operators to construct massive clusters using fewer tiers, reducing latency and improving data transfer efficiency across GPUs.
  3. Scaling Up and Scaling Out: The Enfabrica ACF SuperNIC, with its high-radix, high-bandwidth, and concurrent PCIe/Ethernet multipathing and data mover capabilities, can uniquely scale up and scale out four to eight latest-generation GPUs per server system. This significantly increases AI clusters’ performance, scale, and resiliency, ensuring optimal resource utilization and network efficiency.
  4. Integrated PCIe Interface: The chip supports 128 to 160 PCIe lanes, delivering speeds over 5 Tbps. This design allows multiple GPUs to connect to a single CPU while maintaining high-speed communication with data center spine switches. The result is a more efficient and flexible layout that supports large-scale AI workloads.
  5. Resilient Message Multipathing (RMM): Enfabrica’s proprietary RMM technology boosts the reliability of AI clusters. By mitigating the impact of network link failures or flaps, RMM prevents job stalls, ensuring smoother and more efficient AI training processes. Sankar notes the importance of this feature, especially in large setups where links to switches failures become frequent.
  6. Software-Defined RDMA Networking: This unique feature empowers data center operators with full-stack programmability and debuggability, bringing the benefits of software-defined networking (SDN) into Remote Direct Memory Access (RDMA) setups. It allows customization of the transport layer, which can optimize cloud-scale network topologies without sacrificing performance.

Enhanced Resiliency and Efficiency

Traditional systems often require one-to-one connections between GPUs and various components, such as PCIe switches and RDMA NICs. However, as the number of GPUs in a system increases, the risk of links to switches failures grows, with potential disruptions occurring as often as every 23 minutes in setups with over 100,000 GPUs, according to Shankar. 

The ACF SuperNIC addresses this issue by enabling multiple connections from GPUs to switches. This redundancy minimizes the impact of individual component failures, boosting system uptime and reliability.

The SuperNIC also introduces the Collective Memory Zoning feature, which supports zero-copy data transfers and optimizes host memory management. By reducing latency and enhancing memory efficiency, this technology maximizes the floating-point operations per second (FLOPs) utilization of GPU server fleets.

Scalability and Operational Benefits

The ACF SuperNIC’s design is not only about scale but also about operational efficiency. It provides a software stack that integrates with standard communication, existing interfaces, and RDMA networking operations. This compatibility ensures efficient deployment across diverse AI compute environments composed of GPUs and accelerators (AI chips) from different vendors. Data center operators benefit from streamlined networking infrastructure, reducing complexity and enhancing the flexibility of their AI data centers.

Availability and Future Prospects

Enfabrica’s ACF SuperNIC will be available in limited quantities in Q1 2025, with both the chips and pilot systems now open for orders through Enfabrica and selected partners. As AI models demand higher performance and larger scales, Enfabrica’s innovative approach could play a pivotal role in shaping the next generation of AI data centers designed to support Frontier AI models.

Filed in Computers. Read more about , , , , , and .

Discover more from Ubergizmo

Subscribe now to keep reading and get access to the full archive.

Continue reading