NVIDIA® DGX POD™

The NVIDIA DGX POD reference architecture combines DGX A100 systems, networking, and storage solutions into fully integrated offerings that are verified and ready to deploy. NetApp and NVIDIA are partnered to deliver industry-leading AI solutions. As an NVIDIA partner, NetApp offers two solutions for DGX A100 systems, one based on NetApp® ONTAP® AFF and the other based on NetApp EF-Series EF600 with BeeGFS. If your enterprise plans to run many distributed jobs using GPUs or partial vGPUs on an as-needed basis, and if you plan to use NFS and the rich data management functionality available in ONTAP, AFF solutions are a great fit. If you have fewer jobs using GPUs for long-running training operations and require the extreme performance of a parallel file system, consider E-Series solutions. Both solutions are accompanied by a reference architecture that includes observed bandwidth, IOPS, and training performance results under certain testing conditions.

The system architecture

First, a bit of a deeper dive into the DGX A100 system architecture. Recall that there are 8 ConnectX-6 HCAs intended for intra-DGX A100 compute traffic, and there are 1 or 2 dual-ported ConnectX-6 HCAs intended for external storage traffic. This means that for a real-world use case, the upper limit of single DGX A100 system storage throughput is the storage throughput of 1 or 2 dual-ported ConnectX-6 HCAs. NetApp’s approach to validation in the lab is based on real-world use cases.

Deep learning training

In this process, an external storage system serves data to multiple DGX A100 systems that work in concert, keeping NVLinks (and ultimately GPUs) busy using NVIDIA Collective Communications Library (NCCL) for distributed deep learning (DL) training and high-performance computing.

Download the DDN POD BrochureDownload the ONTAP BrochureGet a Quote
DGX POD ONTAP DDN Dubai Emirates NVIDIA Artificial Intelligence solution

ACCESS OUR AI LAB

Experience our HPC and AI solutions.
Modular solutions for multiple users and diverse workloads. Reference architecture for quick deployment and quick updates. Easy to manage configurations allowing workload orchestration

Access AI LAB