In this article I will describe a few simple strategies you can use to improve your simulation performance. The basic strategies might require a little bit of testing, and some forethought but these techniques are straightforward and should pay off for massive jobs, multi-parameter sweeps and various optimization schemes.
For more information on high performance computing, how hardware impacts simulation performance and how to optimize AWS instances see these posts.
- High Performance Computing
- FDTD Simulation Benchmark
- Information on Hardware Specification
- Resource configuration elements and controls
More Efficient Simulations
1. Improve Simulation Set-up
This means reducing the simulation requirements by adjusting mesh size (MAX Δt that gives reasonable results), employing available symmetry or reducing the amount of data that monitors collect. By doing so we can ensure unnecessary operations are eliminated or at least minimized. The most effective thing to consider is whether you can reduce the spatial and temporal resolution of the simulation, as the algorithm scales like so,
Where D, is the dimension, dx is the mesh size, and V is the simulation volume. These will usually be set automatically based on the shortest wavelength, and mesh accuracy. Reducing the highest frequency, using lower mesh accuracy, or making the simulation volume smaller will improve performance, but must be balanced with requirements. Perform convergence testing, to find the right accuracy, and performance tradeoffs.
If you can reduce the amount of data monitors collect, by eliminating some, making them smaller or minimizing the number of frequency points this can help. Advanced settings allow you to specify which which fields to collect, and whether you want to down sample the spatial resolution. Frequency and time monitors will not be prohibitive, but give some thought to what is actually necessary. Movie monitors can be quite useful for building intuition and debugging, but will add additional complexity at each time step; they should not be used if performance is important.
2. Use CPU Resources Effectively
Distributed computing allows us to split large FDTD simulation jobs across separate processors or cores using a message passing interface MPI.
- Slice a simulation into multiple spatial units that can be run in parallel, with fields passed at each time step.
- Supports two different concurrency mechanisms:
i) Launching multiple executables.
ii) Executables that spawn multiple threads.
If you click on the resource button on the top menu bar of FDTD Solutions, the resource configuration window will open and there you will find the concurrency settings for a given machine. As you can see here, each FDTD solver by decompose the simulation into 4 processes with 4 threads each.
This would run a single simulation, on 16 core machine. The important thing to remember is that the number of threads times the number of processes must be equal to the total number of CPU cores available on any given machine. This will ensure that all the CPU cores are busy.
threads * process * simulations = N cores
3. Running Independent Simulations in Parallel
Parallelization is using either separate processors or separate CPU cores packaged inside a single processor to run multiple simulations at the same time. This is very helpful when running sweeps, or if you want to use the Lumerical job manager to queue work. The following resource configuration provides a good example.
Here we have a 32 core local machine which is set up to run 2 simulations in parallel using 4 processes and threads each. Additionally there is a remote resource 'compue_node' with 64 cores on the local network. It is set-up to run with same number of threads and processes, and will launch 4 simulations. This configuration will run 6 simulations in parallel, and consume 3 fdtd engine licenses, thanks to the new license sharing feature.
The discussion here is limited to Lumerical's job manager; however, it is also possible to use a job scheduler
4. Optimize Resource Configuration and Hardware
At a certain point distributing your simulation across more cores will not result in significant increases in the solving rate. There tends to be a plateau, which is particular to your simulation set-up. Once you have reached the point of diminishing returns you can safely use the excess cores for other tasks for parallelization. Having multiple processors and processor cores defined allows the operating system to run multiple jobs, on separate cores without task switching.
For some workflows you may want to test various configurations, MPI releases and machines to maximize the throughput.
Benchmarking
The most efficient resource configuration will depend on your simulation set-up, sim volume and the hardware it is running on; therefore, there is no general rule to maximize throughput. That being said, you can easily perform your own benchmarks using the script files attached to this article. For workflows that require a significant number of simulations, running a few benchmarks will help to make decisions about how to distribute the workflow to save time and use licenses effectively.
As of 2022 R1.3, performance metrics are available as result in the FDTD objects. This allows you to easily access the results from script. If you are testing resource configurations in previous, the information is only available
The simulation log file provides a lot of helpful information. It should tell you exactly how the FDTD volume was partitioned through MPI and the solving rate in mega-nodes per second i.e. how many millions of floating-point calculations are performed per second. You can can also find a breakdown of the time spent on various processes as well as debugging information.
1. Increase Core Count by Adding Processes
The simplest and most straightforward way to improve performance, increase the number of processes while keeping the threads fixed at 1. By default FDTD will use all available cores. If we run the lsf script FDTD_bench_core.lsf with an example file we get the following results.
As expected the simulation solve rate increases and the sim time decreases as we add cores. Notice however, that we don't do this linearly. In fact, by increasing the core 4x, from 6 to 24, we roughly double the solving speed from 120 to 260 Mnodes/s. In any case you will see diminishing returns as you add more cores, but where that inflection point is will depend on the simulation. Since users are more often limited by licenses they often want to use all available cores, which is typically the best practice.
2. Threads vs Cores
With a given number of cores, we can then vary the threads and process to see if we can improve the performance. In this case use all the cores on the machine 28, but you could choose any number of cores to perform this test on, ensuring the threads and processes add up to this fixed number. If we do that with the attached benchmark file, running the script FDTD_bench_thread.lsf we get the following results.
In this case the maximum solving rate is achieved using 28 processes and 1 thread. It is slightly worse but comparable results are achieved with 7 process and 4 threads each; while 14 processes with 2 threads each performs worse. You may achieve different results, but ultimately this a bit of fine tuning, that doesn't usually achieve the same gains as simply adding cores.
3. Better Throughput with Concurrency
By running jobs in parallel ie concurrency we can get through more jobs in the same amount of time. This may be helpful for large sweeps, or optimizations. As we saw in step 1, we don't get 4x improvement with 4x the cores. If you run 4 simulations in parallel using a quarter of the available cores, then in many cases you will get through 4 simulations faster, than running 4 simulations sequentially with all the cores.
In the following plot, we use all available cores, but increase the capacity and reduce the number of cores per simulation by the corresponding amount. An example script is included FDTD_bench_capacity.lsf
We see that per simulation the the performance is worse, but the concurrency effect is stronger resulting in better overall performance.
Additionally you may want to experiment with various hardware configurations or MPI types. In the cloud the possible permutations are immense, and it would easy to experiment with different instances with Ansys cloud. You may want to compare your results with existing FDTD Performance Benchmarks.