The NVIDIA DGX POD reference architecture combines DGX A100 systems, networking, and storage solutions into fully integrated offerings that are verified and ready to deploy. NetApp and NVIDIA are partnered to deliver industry-leading AI solutions. As an NVIDIA partner, NetApp offers two solutions for DGX A100 systems, one based on NetApp® ONTAP® AFF and the other based on NetApp EF-Series EF600 with BeeGFS. If your enterprise plans to run many distributed jobs using GPUs or partial vGPUs on an as-needed basis, and if you plan to use NFS and the rich data management functionality available in ONTAP, AFF solutions are a great fit. If you have fewer jobs using GPUs for long-running training operations and require the extreme performance of a parallel file system, consider E-Series solutions. Both solutions are accompanied by a reference architecture that includes observed bandwidth, IOPS, and training performance results under certain testing conditions.
NVIDIA® DGX POD™
The system architecture
First, a bit of a deeper dive into the DGX A100 system architecture. Recall that there are 8 ConnectX-6 HCAs intended for intra-DGX A100 compute traffic, and there are 1 or 2 dual-ported ConnectX-6 HCAs intended for external storage traffic. This means that for a real-world use case, the upper limit of single DGX A100 system storage throughput is the storage throughput of 1 or 2 dual-ported ConnectX-6 HCAs. NetApp’s approach to validation in the lab is based on real-world use cases.
Deep learning training
In this process, an external storage system serves data to multiple DGX A100 systems that work in concert, keeping NVLinks (and ultimately GPUs) busy using NVIDIA Collective Communications Library (NCCL) for distributed deep learning (DL) training and high-performance computing.

ACCESS OUR AI LAB
Experience our HPC and AI solutions.
Modular solutions for multiple users and diverse workloads. Reference architecture for quick deployment and quick updates. Easy to manage configurations allowing workload orchestration