AWS Use Case: Auto-Scaling Cluster with Job Scheduler on AWS ParallelCluster (deprecated)

Overview

AWS ParallelCluster is an AWS supported Open Source cluster management tool that makes it easy to deploy and manage High-Performance Computing (HPC) clusters in AWS.

This guide will set up a compute cluster in AWS that you can submit engine jobs to using qsub that will dynamically start and stop compute instances depending on the queue size. Having a much faster simulation turnaround times.

Notes

This workflow or use case has been replaced by Ansys Gateway powered by AWS. Contact us or your Ansys Account manager for more information about this service.
This process/use case is provided as-is. Further documentation and support can be found in AWS -> aws-parallelcluster.readthedocs.io.

Prerequisites

Experience with AWS CLI
Experience with AWS Console (for Auditing/Cost Monitor)
Experience with Job Schedulers/Clusters
2-8 hours

Architecture

Pricing

AWS Provides its own pricing guide for AWS ParallelCluster here.

AWS charges are based on the amount of time and storage used. The majority costs incurred when running Lumerical Products on AWS while following our most common use-cases are:

EC2 (Elastic Cloud Compute) Instances: For EC2 you only pay for what you use on a sub-second basis (while the instance is in the "Running" state). Our uses cases use on-demand instances, however, there are cheaper hourly rates available for longer-term contracts and spot computing instances. See Amazon EC2 pricing for details.
EBS (Elastic Block Storage): These volumes are used by each EC2 instance as a system drive and are a fixed size set when the instance is launched. Charges are incurred whether the instances are in the "Running" or "Stopped" state. See Amazon EBS Pricing for details.
EFS (Elastic File System): These volumes are dynamically sized and you only pay for the storage you use. These volumes are used as a shared filesystem between multiple instances. See Amazon EFS pricing for details.

Example on-demand estimate while testing (USD):

License server (t3.nano, 10 GB storage): $5/month
Master instances (t3.micro, 20 GB storage each): $9.60/month, $0.013/hour
Compute instances (2 x c5.large, 20 GB storage each): $128/month, $0.18/hour
Shared storage (EFS, 5 GB usage): $5/month

Example on-demand estimate for production (USD):

License server (t3.nano, 10 GB storage): $5/month
Master instances (t3.micro, 20 GB storage each): $9.60/month, $0.013/hour
Compute instances (4 x c5n.18xlarge, 20 GB storage each): $11,360/month, $16/hour
Shared storage (EFS, 1 TB usage): $300/month

We always recommend that you use AWS Cost Explorer to monitor your usage closely. You can also use Amazon Cost Calculator to create a detailed cost estimate for your specific use-case.

1. Pre-Configuration

Create and configure your VPC and Security Group

Create a key-pair and IAM Role for “EC2 Services” for " Programmatic access" using AWS ParallelClusters documentation on necessary permissions. Take note of the Access key ID and Secret access key.

IAM Quick Permissions (Note: use the detailed permissions here)

{
 "Version": "2012-10-17",
   "Statement": [
   {
     "Sid": "VisualEditor0",
     "Effect": "Allow",
     "Action": [
     "sns:*",
     "application-autoscaling:*",
     "s3:*",
     "ec2:*",
     "cloudformation:*",
     "sqs:*",
     "ssm:*",
     "iam:*",
     "autoscaling:*"
     ],
     "Resource": "*"
 }
 ]
}

Create and Configure a License Server

2. AWS ParallelCluster CLI Configuration

Download the python module (requires: python, pip):
```
python -m pip install aws-parallelcluster
```
Configure AWS ParallelCluster with your Access key ID, Secret access key, and AWS Region
```
pcluster configure 
```
Move your .pem to ~/.parallelcluster/

3. Cluster Configuration

Using the bellow config as a template, modify the file ~/.parallelcluster/config with your own Keys/IDs. For documentation of all the options see: https://docs.aws.amazon.com/parallelcluster/latest/ug/configuration.html

~/.parallelcluster/config

[aws]
aws_access_key_id = <Access key ID>
aws_secret_access_key = <Secret access key>
aws_region_name = <Region>

[global]
cluster_template = default
update_check = true
sanity_check = true

[vpc default]
vpc_id = <VPC ID>
master_subnet_id = <Subnet ID>
additional_sg = <Security Group>
use_public_ips = true

# you can specify up to 5 EBS Shared Directories. 
# Note: /home/ is already shared with the compute instances 
#       (but it is located on the master instances drive)
[ebs lumericalproducts]
# Products installed on the master node will be shared by all compute nodes.
shared_dir = opt/lumerical
volume_size = 25

[cluster default]
key_name = <Key-pair Name>
vpc_settings = default

# OS Options: centos6, centos7, rhel6, rhel7 are supported. 
#    alinux is not supported, we only support alinux2 which 
#    AWS ParallelCluster does not support.
base_os = centos7

# the t3.micro is recomended by AWS ParallelCluster, 
#    if you wish to do pre and post processing on the master instance 
#    select something larger
master_instance_type = t3.micro

# the c5.large is the smallest instance that supports placement groups 
#    (needed for distributed MPI jobs), great for testing.
compute_instance_type = c5.large

# scheduler options: sge, slurm torque
scheduler = slurm

# When DYNAMIC is set, a unique placement group will be created 
#    and deleted as part of the cluster stack.
placement_group = DYNAMIC 

# only the compute nodes are in a placement group of type "cluster"
placement = compute 

# The bellow settings can be updated later by running: 
#     pcluster update aws-pcluster-demo
initial_queue_size = 0
max_queue_size = 2

# To sleep all compute nodes when there are no jobs in queue, set to false
maintain_initial_size = false

# can be a comman-seperated list, up to 5
ebs_settings = lumericalproducts

# EFA = Elastic Fabric Adapter, a high-speed network interface device for HPC. 
#    Use with c5n.18xlarge compute instances.
#    enable_efa = compute

# disable hyperthreading on all instances (AWS PC >= 2.5)
disable_hyperthreading = true

# leave hyperthreading enabled, but do not use them for scheduled jobs
# extra_json = { "cfncluster" : { "cfn_scheduler_slots" : "cores" } }

scaling_settings = custom_scaling

[scaling custom_scaling]
# Default: 10m
scaledown_idletime = 5


[aliases]
ssh = ssh -i ~/.parallelcluster/<Key-pair>.pem {CFN_USER}@{MASTER_IP} {ARGS}

Create your cluster:
```
pcluster create aws-pcluster-demo 
```
Upon successful completion, make note of the Master node public IP (this can also be found in the EC2 Console).

Master node configuration

Download the products: https://www.lumerical.com/downloads/

Copy installation files to the master node:

scp -i ~/.parallelcluster/<Key-pair Name>.pem <product>.tar.gz centos@<Master IP>:

SSH to the master node:
```
pcluster ssh aws-pcluster-demo
```
Install graphics library necessary to run CAD’s and License server configuration via x11 (not necessary for Centos 7):
```
sudo yum groupinstall -y "X Window System"
```
Reconnect to the master node with x11 (or Windows: MobaXTerm):
```
pcluster ssh aws-pcluster-demo -Y
```
Unzip the installation files:
```
tar -zxf <product>.tar.gz
```
Install the Lumerical suite to the default location (\opt\lumerical, for custom locations, see here):
```
sudo yum install Lumerical-[[ver]]/Lumerical-[[ver]].rpm
```
If you need the GUI: Patch the product using instructions here: Graphics related problems
Create a configuration file to set your license manager using this command and substituting the PRIVATE IP address of your license manager server.

For a local user configuration (license.ini created in ~/.config/Lumerical):
```
mkdir -p ~/.config/Lumerical; printf "[license]\nflexserver\host=27011@<server_privat_ip>\n" > ~/.config/Lumerical/License.ini 
```
OR for a system-wide configuration (license.ini created in /opt/lumerical/[[verpath]]/Lumerical/):
```
sudo su; mkdir -p /opt/lumerical/[[verpath]]/Lumerical; printf "[license]\nflexserver\host=27011@<server_privat_ip>\n" > /opt/lumerical/[[verpath]]/Lumerical/License.ini
```

4. Running Jobs

Run the CAD environment from the master node:

 pcluster ssh aws-pcluster-demo -Y "\opt\lumerical\[[verpath]]\bin\fdtd-solutions"

Submit engine jobs to the compute nodes from the master node:

qsub -b y \opt\lumerical\[[verpath]]\bin\fdtd-engine -t 0 <project file>

Submit engine jobs to the compute nodes from your local computer:

scp -i ~/.parallelcluster/aws-pcluster-demo <project file> centos@<master IP>

pcluster ssh aws-pcluster-demo "qrsh -b y \opt\lumerical\[[verpath]]\bin\fdtd-engine -t 0 <project file>"

scp -i ~/.parallelcluster/aws-pcluster-demo centos@<master IP>:<project file>

For instructions on how to run jobs on a cluster seamless from our CAD using Sun Grid Engine see: Running FDTD Jobs on a Cluster
Job Scheduler Submission Scripts (SGE, Slurm, Torque)