AWS Use Case: Auto-Scaling Clusters with Slurm Job Scheduler on AWS ParallelCluster UI

Overview

In this article, we will demonstrate a basic auto-scaling cluster setup on AWS using ParallelCluster and we will show how to run a FDTD parameter sweep using lumslurm Python module provided with Lumerical.

Pre-requisite:

This example requires Lumerical 2023 R2.2 or later, an Amazon Web Service account and access to the AWS Management Console.

A Lumerical license should be available, and the Ansys License manager should be accessible from your AWS instances. Here, we installed the license manager on the head node.

ParallelCluster Setup

We will use AWS ParallelCluster UI to set up the cluster. We will use the default Virtual Private Cloud (VPC) set by ParallelCluster. More advanced configuration is possible using a custom VPC.

Note: we used here the region EU West 1 (Ireland). The instance types available will depend on the region you select.

Create a key pair

Amazon secures access to all instances with a private key. You will need to create a key-pair before you can launch any instances, and when you do launch an instance make sure you choose the correct key.

On the AWS Management Console, go to EC2 dashboard. In “Network & Security”, click “Key Pairs”.

Click "Create key pair".

Name the key pair, set "Key pair type" to RSA and format to ".pem".

Click "Create key pair" and save the private key to your computer.

We will use this key to connect to the cluster head node via SSH.

Install ParallelCluster UI

The ParallelCluster UI is a web interface allowing to create, monitor and manage clusters. To install it, use “AWS ParallelCluster UI quick-create” link for the region you want to use.

Set name of the stack.

Enter e-mail address of admin user. Once the stack creation is completed, you will get a mail with a temporary password at this address.

In "Capabilities", agree to the CloudFormation capabilities required to create the stack.

Click "Create stack". Creating the stack for ParallelCluster UI will take a few minutes.In the created stack, go to the "Outputs" tab, get the URL from "ParallelClusterUIURL"

Open the URL, use e-mail and temporary password to login, then set up new password.

The AWS ParallelCluster UI is now set up. From this web interface, you can create and manage your clusters:

Create your cluster

In AWS ParallelCluster UI web interface, click “Create cluster” and select “Step by step”.

This will get us through the different steps required to create our cluster.

Cluster properties

First, name the cluster, select the operating system (here we use Amazon Linux 2) and the VPC in the drop list.

Head node instancce

Next, select the instance type for the head node. The head node will not be used to run any simulation so a small instance type can be selected. Here we will use a m6a.xlarge instance, and check the boxes “Add SSM session” and “Add DCV” to enable virtual desktop access. You need to specify which IP address(es) will be allowed to connect through the DCV session.

Select the key pair you created previously.

Queues

You can create as many queues as you want. Here, we will set 1 queue that will be used to run the simulations and post-process the simulation results.

Name the queue, select the instance type (we will use c7a.8xlarge) and define the number of static and dynamic nodes. Note that static nodes will always be running (unless you stop the instances manually) while dynamic nodes will only be running when needed.

Lastly, check “Turn off multithreading” as this won’t bring any performance improvement.

When dimensioning the partition(s), it might be useful to consider the license availability. By default, Slurm is not aware of what licenses are available, so job can be started while there is not enough licenses.

One option is to set the number of nodes and instance type to what can be run according to your license. Alternatively, Slurm can manage licenses as a resource. Once the cluster is created, you can add this new resource. The lumslurm python package is able to check how many licenses are required (only for business/standard license) and will add the requirements in the jobs so they will start when licenses are available.

Storage settings

We need to define a storage that will be shared by all the nodes head node and compute nodes). In “Storage types”, select “Amazon Elastic File System (EFS)” and click “Add storage”.

Name the storage, set the mount path. Check “External file system” if you want to use an already created EFS storage.

Note the deletion policy can be set to retain (the storage will be kept if you update or delete the cluster), or to delete the storage if you delete the cluster.

Create

In this last step, we can review the cluster configuration YAML file. You can click “Dry run” to test the configuration and detect potential issues.

Click “Create” to start the cluster creation process. This will take a few minutes.

Once the creation is complete, the head node is running. You can connect via SSH using the key pair previously created or via DCV.

To stop the head node, you need to use the AWS Management Console, select “Instances” in the EC2 dashboard.

Click "Instance state" to start, stop or reboot the selected instance.

Additional steps

Now that the cluster is created and the head node is running, we can install Lumerical and any other software that may be required.

Install Lumerical

We can use “scp” to upload Lumerical installation package to the head node. The public IP address is visible in the “Details” tab of the cluster.

scp -i [path to]/LumHPCkey.pem [path to]/Lumerical-2024-R2-38xx-xxxxxxxxxx.tar.gz ec2-user@34.245.100.244:Lumerical-2024-R2-xxxx-xxxxxxxxxx.tar.gz.

This will copy the file to the ec2-user home directory.

An alternative is to use the file transfer function of DCV:

With this, you can directly upload your files to the ec2-user home folder.

To install Lumerical, follow the instructions provided in Installing on a shared filesystem on Linux. Note Amazon Linux is currently not supported by the “extract-rpm.sh” script, so you have to extract the files manually. Copy Lumerical installation to the shared storage “efs0” so all compute nodes can access it.

The license configuration can be done using the ANSYSLMD_LICENSE_FILE variable or by adding/editing the “License.ini” file (see Lumerical license configuration from the command line).

Resolve dependencies

For Lumerical to run properly, we have to make sure all required libraries are installed on all the compute nodes. We will use a bootstrap script run every time a new node is created to install the missing libraries. The script will be stored on a S3 bucket “lumhpc-bootstrap” we created using the default configuration.
First, we generate the list of libraries required by Lumerical using the following command in the folder containing the rpm file:

rpm -qRp Lumerical-2024R2-R2-xxxxxxxxxx.el7.x86_64.rpm | grep -v '^rpmlib' > lumerical-deps

Copy “lumerical-deps” to the shared folder.

Create a script file containing:

#!/bin/bash
xargs yum -y install < /shared/apps/lumerical/v242/lumerical-deps

Upload the script to your S3 bucket. Then, in ParallelCluster UI, edit the cluster configuration

In the “Advanced options” of the queue configuration, check “Run script on node configured” and specify the path to the bootstrap script (here “start.sh”).

In “Cluster configuration YAML file”, you need to give access to the S3 bucket to the compute nodes. Add the “Iam” section under the queue configuration:

To apply the new settings and update the cluster configuration; click “Stop fleet” then “Edit cluster”.

Once the cluster update is completed, click “Start fleet”. Your cluster is now ready to use!

The partition will show as “idle~”, meaning it is ready to accept jobs, but the nodes are not running.

Submit jobs using lumslurm

In this section, we will use lumslurm (see Getting Started with lumslurm - Python API) to run an FDTD parameter sweep based on the Micro-LED example.

lumslurm configuration

The lumslurm python module requires a configuration file to locate mpirun, the FDTD engine, the FDTD GUI, python and the path to Lumerical Python API.

Open MPI is installed by default on the compute nodes, in “/opt/amazon/openmpi/bin/. For Python, you can use the one installed with Lumerical.

In our system, we will use the following:

mpirun: /opt/amazon/openmpi/bin/mpirun
mpilib: /opt/amazon/openmpi/lib64
fdtd_engine: /shared/apps/lumerical/v242/bin/fdtd-engine-ompi-lcl
fdtd_gui: /shared/apps/lumerical/v242/bin/fdtd-solutions
pythonpath: /shared/apps/lumerical/v242/api/python
python: /shared/apps/lumerical/v242/python/bin/python

The configuration file can be set either at system level or at the user level. We created the following .lumslurm.config file in ec2-user home directory.

Example: Micro LED workflow

To run this example, we need to upload the simulation file, “micro_LED_cylindrical.fsp” to the shared folder. We copied the file to “/shared/data/microLED/”. In this simulation file, a nested parameter sweep is set. We will use lumslurm to

run the parameter sweep on the cluster
- each simulation is submitted to Slurm as an individual job
Analyze the results (run the far field projections)
- Each analysis is a job that is queue and only started when the corresponding simulation is done
Load the results from the sweep when all simulations and analysis are completed

Here, we will run this flow from a Jupyter notebook. Note you need to install Jupyter and a browser for this. You can also run this from any Python IDE you install.

First, we import the required modules and set paths and simulation file:

Then we can run the parameter sweep. Since we’re running a nested sweep, we need to specify the full hierarchy, “position::polarization”:

As only one partition was set in this cluster, it will be used for both running the simulation and analyzing the results.

Once “lumslurm.run_sweep()” is run, you can monitor the jobs.

You can identify the solve jobs with the “S:” prefix, the processing jobs with “P:” and the collection job with “C:”. At first, the compute nodes will be started and configured (“CF” status), then the simulations are run (“R” status):

When the solve jobs are done, the processing job can be run.

Once all the jobs are completed, you can reopen the simulation file, the parameter sweep will contain the results.

Additional Resources