Resource configuration for multi-node multi-GPU simulations

The FDTD solver in Ansys Lumerical FDTD™ supports running multi-node multi-GPU simulations for Business and Enterprise licenses. Multi-node multi-GPU simulations distribute the simulation over various compute nodes, allowing for simulations with large memory requirements to run, while speeding up the simulation speed by utilizing the compute power of multiple GPUs.

This article discusses requirements and additional setup instructions for multi-node multi-GPU simulations. For general instructions on setting up GPU simulations, see the Knowledge Base article on Resource configuration for single node GPU simulations.

Requirements

Multi-node multi-GPU requires the following:

Ansys Lumerical Business or Enterprice license
Compatible Linux operating system running CUDA-aware OpenMPI

Note: Starting in 2026 R1.1, it is possible to run multi-node multi-GPU simulations on all MPIs compatible with Lumerical. However, a CUDA-aware OpenMPI, which is only available on Linux, is strongly recommended to maximize the performance.

Resource configuration – non-CUDA aware MPI

Multi-node multi-GPU simulation using non-CUDA aware MPI for the exact same procedure as distributed simulations using CPU, except with GPU as the resource type.

Please see the knowledge base article Distributed computing for further information on setting up your MPI.

Resource configuration – CUDA-aware OpenMPI

Simulation with a CUDA-aware OpenMPI requires further installation and configuration.

First, ensure that you have set up a version of OpenMPI that is CUDA-aware. General instructions on setting up OpenMPI is in the Running simulations with MPI on Linux, but you need to adapt it as needed to ensure that the OpenMPI version you configure is built with CUDA-aware support.

After setting up your OpenMPI, you can add it to the resource configuration window using the following instructions

Open the Resource Configuration window. From the File tab in Lumerical.
Add a GPU or GPU Custom resource.
Open the Resource advanced option window by clicking Edit, and set up OpenMPI by changing the Job launching preset field to Remote: OpenMPI. See this Knowledge Base article for further information.

Note: Various fields, such as the MPI executable and mpi library path, needs to be set up differently than outlined in the article from the link above. These options are in Step 5. For this step, only change the Job launching preset.
Set the Nodes and Processes in the window. The list of processes corresponds to slots for each node. For multi-node multi-GPU computation, ensure that the number of GPUs on each node is equal or larger than the number of assigned slots for that node. Further information on distributed computing, including how to set up each node and general MPI resource configuration information can be found in this Knowledge Base article.
Set up the executable, library, and Additional environment variables, such that they point to the locations of the CUDA-aware MPI.
Set up extra command line options as needed for optimal performance. This depends on the OpenMPI setup, and the section below discusses the details of these setups.

Basic resource configuration for multi-node multi-GPU simulation is complete, and you can proceed on setting up the extra command line options as needed.

Note: When you use the license estimation utility for multi-node multi-GPU simulations, ensure that the total number of SMs across all nodes is entered into the SM estimate box.

Additional command line flags for CUDA Aware OpenMPI setups

When using FDTD GPU distributed across multiple nodes, it may be necessary to set additional command flags, depending on how the MPI is set up on your compute nodes. We provide a few configuration examples below.

Open MPI functions through the Modular Component Architecture (MCA), which allows you to select different messaging layers, transport layers and protocols.

UCX point-to-point message layer

We recommend using the Unified Communication – X Framework (UCX) library as the Point-to-point Message Layer (PML), which offers high inter-process communication efficiency. The setup process is more complex compared to the “ob1” PML (discussed below), however, the UCX PML offers the highest performance potential for FDTD multi-node multi-GPU simulations.

Prior to using UCX, carefully examine the Nvidia Unified Communication – X Framework Library documentations and the Open UCX documentations on various limitations and required flags.

An example command line flag list to use UCX is shown below.

--mca pml ucx -x UCX_MEMTYPE_CACHE=0 -x 
HCOLL_GPU_CUDA_MEMTYPE_CACHE_ENABLE=0 -x HCOLL_GPU_ENABLE=1

Warning: When using UCX, ensure that the number of processes on each node is set equal to the number of GPUs on the node, otherwise, an error may occur.

“ob1” point-to-point message layer

As an alternative to UCX, you can also use the “ob1” PML and a set of Byte Transport Layers (BTL) when launching the MPI application. The ob1 PML is easier to setup compared to UCX and should always be available. However, it is limited in efficiency from its BTL capabilities, and the performance ceiling is lower than that of UCX.

An example command line flag list is provided below for the ob1 PML setting.

--mca pml ob1 --mca btl self,vader,smcuda,tcp