ParaTools Pro for E4S™ Getting Started with AWS Parallel Computing Service¶
General Background Information¶
In this tutorial we will show you how to set up and launch an HPC cluster using AWS Parallel Computing Service (PCS).
You will use the command line tools, AWS CLI, and AWS console to create a cluster.
This will use a number of .yaml files that describe the stack and are inputs for AWS CloudFormation.
We will then launch a GPU-accelerated head node that can spawn EC2 compute node instances linked with EFA networking capabilities.
For the purposes of this tutorial, we make the following assumptions:
- You have created an AWS account, and an are Administrative User.
Tutorial¶
Please reference the official AWS PCS Getting Started guide for more information. This tutorial follows the official tutorial linked above, with a few minor changes. If something is unclear, please check the official tutorial.
1. Create VPC and Subnets¶
You can skip this step by reusing previously created resources
If you have already created the VPC and subnets, you can reuse them, and skip this step. Use this link to search for VPC stacks in us-east-1 that contain the text "PTPro".
To create a new stack for the cluster's VPC and Subnets using the CloudFormation console, please use the following template:
0-pcs-cluster-cloudformation-vpc-and-subnets.yaml
Show template contents (click to expand)
# source: https://aws-hpc-recipes.s3.amazonaws.com/main/recipes/net/hpc_large_scale/assets/main.yaml
AWSTemplateFormatVersion: '2010-09-09'
Description: HPC-scale VPC with Multi-AZ Architecture.
This template creates a highly available VPC infrastructure optimized for HPC workloads across multiple Availability Zones.
It provisions both public and private subnets in two or optionally three AZs, with each subnet configured for 4096 IP addresses.
The template sets up NAT Gateways and Internet Gateway for secure outbound connectivity from private subnets.
VPC Flow Logs are enabled and directed to CloudWatch for comprehensive network traffic monitoring.
An S3 VPC Endpoint is configured to allow private subnet resources to access S3 without traversing the internet.
A VPC-wide security group is created to enable communication between resources within the VPC.
Use this template as a foundation for building scalable, secure networking infrastructure for HPC workloads.
Refer to the Outputs tab of the deployed stack for important resource identifiers including VPC ID, subnet IDs, security group ID, and internet gateway ID.
Metadata:
AWS::CloudFormation::Interface:
ParameterGroups:
- Label:
default: VPC
Parameters:
- CidrBlock
- Label:
default: Subnets A
Parameters:
- CidrPublicSubnetA
- CidrPrivateSubnetA
- Label:
default: Subnets B
Parameters:
- CidrPublicSubnetB
- CidrPrivateSubnetB
- Label:
default: Subnets C
Parameters:
- ProvisionSubnetsC
- CidrPublicSubnetC
- CidrPrivateSubnetC
Parameters:
CidrBlock:
AllowedPattern: '((\d{1,3})\.){3}\d{1,3}/\d{1,2}'
Default: 10.3.0.0/16
Description: VPC CIDR Block (eg 10.3.0.0/16)
Type: String
CidrPublicSubnetA:
AllowedPattern: '((\d{1,3})\.){3}\d{1,3}/\d{1,2}'
Default: 10.3.0.0/20
Description: VPC CIDR Block for the Public Subnet A
Type: String
CidrPublicSubnetB:
AllowedPattern: '((\d{1,3})\.){3}\d{1,3}/\d{1,2}'
Default: 10.3.16.0/20
Description: VPC CIDR Block for the Public Subnet B
Type: String
CidrPublicSubnetC:
AllowedPattern: '((\d{1,3})\.){3}\d{1,3}/\d{1,2}'
Default: 10.3.32.0/20
Description: VPC CIDR Block for the Public Subnet C
Type: String
CidrPrivateSubnetA:
AllowedPattern: '((\d{1,3})\.){3}\d{1,3}/\d{1,2}'
Default: 10.3.128.0/20
Description: VPC CIDR Block for the Private Subnet A
Type: String
CidrPrivateSubnetB:
AllowedPattern: '((\d{1,3})\.){3}\d{1,3}/\d{1,2}'
Default: 10.3.144.0/20
Description: VPC CIDR Block for the Private Subnet B
Type: String
CidrPrivateSubnetC:
AllowedPattern: '((\d{1,3})\.){3}\d{1,3}/\d{1,2}'
Default: 10.3.160.0/20
Description: VPC CIDR Block for the Private Subnet C
Type: String
ProvisionSubnetsC:
Type: String
Description: Provision optional 3rd set of subnets
Default: "True"
AllowedValues:
- "True"
- "False"
Mappings:
RegionMap:
us-east-1:
ZoneId1: use1-az6
ZoneId2: use1-az4
ZoneId3: use1-az5
us-east-2:
ZoneId1: use2-az2
ZoneId2: use2-az3
ZoneId3: use2-az1
us-west-1:
ZoneId1: usw1-az1
ZoneId2: usw1-az3
ZoneId3: usw1-az2
us-west-2:
ZoneId1: usw2-az1
ZoneId2: usw2-az2
ZoneId3: usw2-az3
eu-central-1:
ZoneId1: euc1-az3
ZoneId2: euc1-az2
ZoneId3: euc1-az1
eu-west-1:
ZoneId1: euw1-az1
ZoneId2: euw1-az2
ZoneId3: euw1-az3
eu-west-2:
ZoneId1: euw2-az2
ZoneId2: euw2-az3
ZoneId3: euw2-az1
eu-west-3:
ZoneId1: euw3-az1
ZoneId2: euw3-az2
ZoneId3: euw3-az3
eu-north-1:
ZoneId1: eun1-az2
ZoneId2: eun1-az1
ZoneId3: eun1-az3
ca-central-1:
ZoneId1: cac1-az2
ZoneId2: cac1-az1
ZoneId3: cac1-az3
eu-south-1:
ZoneId1: eus1-az2
ZoneId2: eus1-az1
ZoneId3: eus1-az3
ap-east-1:
ZoneId1: ape1-az3
ZoneId2: ape1-az2
ZoneId3: ape1-az1
ap-northeast-1:
ZoneId1: apne1-az4
ZoneId2: apne1-az1
ZoneId3: apne1-az2
ap-northeast-2:
ZoneId1: apne2-az1
ZoneId2: apne2-az3
ZoneId3: apne2-az2
ap-south-1:
ZoneId1: aps1-az2
ZoneId2: aps1-az3
ZoneId3: aps1-az1
ap-southeast-1:
ZoneId1: apse1-az1
ZoneId2: apse1-az2
ZoneId3: apse1-az3
ap-southeast-2:
ZoneId1: apse2-az3
ZoneId2: apse2-az1
ZoneId3: apse2-az2
us-gov-west-1:
ZoneId1: usgw1-az2
ZoneId2: usgw1-az1
ZoneId3: usgw1-az3
us-gov-east-1:
ZoneId1: usge1-az3
ZoneId2: usge1-az2
ZoneId3: usge1-az1
ap-northeast-3:
ZoneId1: apne3-az3
ZoneId2: apne3-az2
ZoneId3: apne3-az1
sa-east-1:
ZoneId1: sae1-az3
ZoneId2: sae1-az2
ZoneId3: sae1-az1
af-south-1:
ZoneId1: afs1-az3
ZoneId2: afs1-az2
ZoneId3: afs1-az1
ap-south-2:
ZoneId1: aps2-az3
ZoneId2: aps2-az2
ZoneId3: aps2-az1
ap-southeast-3:
ZoneId1: apse3-az3
ZoneId2: apse3-az2
ZoneId3: apse3-az1
ap-southeast-4:
ZoneId1: apse4-az3
ZoneId2: apse4-az2
ZoneId3: apse4-az1
ca-west-1:
ZoneId1: caw1-az3
ZoneId2: caw1-az2
ZoneId3: caw1-az1
eu-central-2:
ZoneId1: euc2-az3
ZoneId2: euc2-az2
ZoneId3: euc2-az1
eu-south-2:
ZoneId1: eus2-az3
ZoneId2: eus2-az2
ZoneId3: eus2-az1
il-central-1:
ZoneId1: ilc1-az3
ZoneId2: ilc1-az2
ZoneId3: ilc1-az1
me-central-1:
ZoneId1: mec1-az3
ZoneId2: mec1-az2
ZoneId3: mec1-az1
Conditions:
DoProvisionSubnetsC: !Equals [!Ref ProvisionSubnetsC, "True"]
Resources:
VPC:
Type: AWS::EC2::VPC
Properties:
CidrBlock: !Ref CidrBlock
EnableDnsHostnames: true
EnableDnsSupport: true
Tags:
- Key: "Name"
Value: !Sub '${AWS::StackName}:Large-Scale-HPC'
VPCFlowLog:
Type: AWS::EC2::FlowLog
Properties:
ResourceId: !Ref VPC
ResourceType: VPC
TrafficType: ALL
LogDestinationType: cloud-watch-logs
LogGroupName: !Sub '${AWS::StackName}-VPCFlowLogs'
DeliverLogsPermissionArn: !GetAtt FlowLogRole.Arn
FlowLogRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Principal:
Service:
- vpc-flow-logs.amazonaws.com
Action:
- "sts:AssumeRole"
ManagedPolicyArns:
- !Ref AWS::NoValue
Policies:
- PolicyName: FlowLogPolicy
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- "logs:CreateLogGroup"
- "logs:CreateLogStream"
- "logs:PutLogEvents"
- "logs:DescribeLogGroups"
- "logs:DescribeLogStreams"
Resource: !Sub "arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:${AWS::StackName}-VPCFlowLogs:*"
PublicSubnetA:
Type: AWS::EC2::Subnet
Properties:
VpcId: !Ref VPC
CidrBlock: !Ref CidrPublicSubnetA
AvailabilityZone: !GetAtt AvailabiltyZone1.ZoneName
MapPublicIpOnLaunch: true
Tags:
- Key: Name
Value: !Sub
- '${StackName}:PublicSubnetA-${AvailabilityZone}'
- StackName: !Ref AWS::StackName
AvailabilityZone: !GetAtt AvailabiltyZone1.ZoneName
PublicSubnetB:
Type: AWS::EC2::Subnet
Properties:
VpcId: !Ref VPC
CidrBlock: !Ref CidrPublicSubnetB
AvailabilityZone: !GetAtt AvailabiltyZone2.ZoneName
MapPublicIpOnLaunch: true
Tags:
- Key: Name
Value: !Sub
- '${StackName}:PublicSubnetB-${AvailabilityZone}'
- StackName: !Ref AWS::StackName
AvailabilityZone: !GetAtt AvailabiltyZone2.ZoneName
PublicSubnetC:
Type: AWS::EC2::Subnet
Condition: DoProvisionSubnetsC
Properties:
VpcId: !Ref VPC
CidrBlock: !Ref CidrPublicSubnetC
AvailabilityZone: !GetAtt AvailabiltyZone3.ZoneName
MapPublicIpOnLaunch: true
Tags:
- Key: Name
Value: !Sub
- '${StackName}:PublicSubnetC-${AvailabilityZone}'
- StackName: !Ref AWS::StackName
AvailabilityZone: !GetAtt AvailabiltyZone3.ZoneName
InternetGateway:
Type: AWS::EC2::InternetGateway
AttachGateway:
Type: AWS::EC2::VPCGatewayAttachment
Properties:
VpcId: !Ref VPC
InternetGatewayId: !Ref InternetGateway
PublicRouteTable:
Type: AWS::EC2::RouteTable
Properties:
VpcId: !Ref VPC
Tags:
- Key: Name
Value: !Sub '${AWS::StackName}:PublicRoute'
PublicRoute1:
Type: AWS::EC2::Route
Properties:
RouteTableId: !Ref PublicRouteTable
DestinationCidrBlock: 0.0.0.0/0
GatewayId: !Ref InternetGateway
PublicSubnetARouteTableAssociation:
Type: AWS::EC2::SubnetRouteTableAssociation
Properties:
SubnetId: !Ref PublicSubnetA
RouteTableId: !Ref PublicRouteTable
PublicSubnetBRouteTableAssociation:
Type: AWS::EC2::SubnetRouteTableAssociation
Properties:
SubnetId: !Ref PublicSubnetB
RouteTableId: !Ref PublicRouteTable
PublicSubnetCRouteTableAssociation:
Condition: DoProvisionSubnetsC
Type: AWS::EC2::SubnetRouteTableAssociation
Properties:
SubnetId: !Ref PublicSubnetC
RouteTableId: !Ref PublicRouteTable
PrivateSubnetA:
Type: AWS::EC2::Subnet
Properties:
VpcId: !Ref VPC
AvailabilityZone: !GetAtt AvailabiltyZone1.ZoneName
CidrBlock: !Ref CidrPrivateSubnetA
MapPublicIpOnLaunch: false
Tags:
- Key: Name
Value: !Sub
- '${StackName}:PrivateSubnetA-${AvailabilityZone}'
- StackName: !Ref AWS::StackName
AvailabilityZone: !GetAtt AvailabiltyZone1.ZoneName
PrivateSubnetB:
Type: AWS::EC2::Subnet
Properties:
VpcId: !Ref VPC
AvailabilityZone: !GetAtt AvailabiltyZone2.ZoneName
CidrBlock: !Ref CidrPrivateSubnetB
MapPublicIpOnLaunch: false
Tags:
- Key: Name
Value: !Sub
- '${StackName}:PrivateSubnetB-${AvailabilityZone}'
- StackName: !Ref AWS::StackName
AvailabilityZone: !GetAtt AvailabiltyZone2.ZoneName
PrivateSubnetC:
Type: AWS::EC2::Subnet
Condition: DoProvisionSubnetsC
Properties:
VpcId: !Ref VPC
AvailabilityZone: !GetAtt AvailabiltyZone3.ZoneName
CidrBlock: !Ref CidrPrivateSubnetC
MapPublicIpOnLaunch: false
Tags:
- Key: Name
Value: !Sub
- '${StackName}:PrivateSubnetC-${AvailabilityZone}'
- StackName: !Ref AWS::StackName
AvailabilityZone: !GetAtt AvailabiltyZone3.ZoneName
NatGatewayAEIP:
Type: AWS::EC2::EIP
DependsOn: AttachGateway
Properties:
Domain: vpc
NatGatewayBEIP:
Type: AWS::EC2::EIP
DependsOn: AttachGateway
Properties:
Domain: vpc
NatGatewayCEIP:
Condition: DoProvisionSubnetsC
Type: AWS::EC2::EIP
DependsOn: AttachGateway
Properties:
Domain: vpc
NatGatewayA:
Type: AWS::EC2::NatGateway
Properties:
AllocationId: !GetAtt NatGatewayAEIP.AllocationId
SubnetId: !Ref PublicSubnetA
NatGatewayB:
Type: AWS::EC2::NatGateway
Properties:
AllocationId: !GetAtt NatGatewayBEIP.AllocationId
SubnetId: !Ref PublicSubnetB
NatGatewayC:
Type: AWS::EC2::NatGateway
Condition: DoProvisionSubnetsC
Properties:
AllocationId: !GetAtt NatGatewayCEIP.AllocationId
SubnetId: !Ref PublicSubnetC
PrivateRouteTableA:
Type: AWS::EC2::RouteTable
Properties:
VpcId: !Ref VPC
Tags:
- Key: Name
Value: !Sub '${AWS::StackName}:PrivateRouteA'
PrivateRouteTableB:
Type: AWS::EC2::RouteTable
Properties:
VpcId: !Ref VPC
Tags:
- Key: Name
Value: !Sub '${AWS::StackName}:PrivateRouteB'
PrivateRouteTableC:
Type: AWS::EC2::RouteTable
Condition: DoProvisionSubnetsC
Properties:
VpcId: !Ref VPC
Tags:
- Key: Name
Value: !Sub '${AWS::StackName}:PrivateRouteC'
DefaultPrivateRouteA:
Type: AWS::EC2::Route
Properties:
RouteTableId: !Ref PrivateRouteTableA
DestinationCidrBlock: 0.0.0.0/0
NatGatewayId: !Ref NatGatewayA
DefaultPrivateRouteB:
Type: AWS::EC2::Route
Properties:
RouteTableId: !Ref PrivateRouteTableB
DestinationCidrBlock: 0.0.0.0/0
NatGatewayId: !Ref NatGatewayB
DefaultPrivateRouteC:
Type: AWS::EC2::Route
Condition: DoProvisionSubnetsC
Properties:
RouteTableId: !Ref PrivateRouteTableC
DestinationCidrBlock: 0.0.0.0/0
NatGatewayId: !Ref NatGatewayC
PrivateSubnetARouteTableAssociation:
Type: AWS::EC2::SubnetRouteTableAssociation
Properties:
RouteTableId: !Ref PrivateRouteTableA
SubnetId: !Ref PrivateSubnetA
PrivateSubnetBRouteTableAssociation:
Type: AWS::EC2::SubnetRouteTableAssociation
Properties:
RouteTableId: !Ref PrivateRouteTableB
SubnetId: !Ref PrivateSubnetB
PrivateSubnetCRouteTableAssociation:
Type: AWS::EC2::SubnetRouteTableAssociation
Condition: DoProvisionSubnetsC
Properties:
RouteTableId: !Ref PrivateRouteTableC
SubnetId: !Ref PrivateSubnetC
AvailabiltyZone1:
Type: Custom::AvailabiltyZone
DependsOn: LogGroupGetAZLambdaFunction
Properties:
ServiceToken: !GetAtt GetAZLambdaFunction.Arn
ZoneId: !FindInMap [RegionMap, !Ref "AWS::Region", ZoneId1]
AvailabiltyZone2:
Type: Custom::AvailabiltyZone
DependsOn: LogGroupGetAZLambdaFunction
Properties:
ServiceToken: !GetAtt GetAZLambdaFunction.Arn
ZoneId: !FindInMap [RegionMap, !Ref "AWS::Region", ZoneId2]
AvailabiltyZone3:
Type: Custom::AvailabiltyZone
Condition: DoProvisionSubnetsC
DependsOn: LogGroupGetAZLambdaFunction
Properties:
ServiceToken: !GetAtt GetAZLambdaFunction.Arn
ZoneId: !FindInMap [RegionMap, !Ref "AWS::Region", ZoneId3]
LogGroupGetAZLambdaFunction:
Type: AWS::Logs::LogGroup
DeletionPolicy: Delete
UpdateReplacePolicy: Delete
Properties:
LogGroupName: !Sub /aws/lambda/${GetAZLambdaFunction}
RetentionInDays: 7
GetAZLambdaFunction:
Type: AWS::Lambda::Function
Properties:
Description: GetAZLambdaFunction
Timeout: 60
Runtime: python3.12
Handler: index.handler
Role: !GetAtt GetAZLambdaRole.Arn
Code:
ZipFile: |
import cfnresponse
from json import dumps
from boto3 import client
EC2 = client('ec2')
def handler(event, context):
if event['RequestType'] in ('Create', 'Update'):
print(dumps(event, default=str))
data = {}
try:
response = EC2.describe_availability_zones(
Filters=[{'Name': 'zone-id', 'Values': [event['ResourceProperties']['ZoneId']]}]
)
print(dumps(response, default=str))
data['ZoneName'] = response['AvailabilityZones'][0]['ZoneName']
except Exception as error:
cfnresponse.send(event, context, cfnresponse.FAILED, {}, reason=error)
finally:
cfnresponse.send(event, context, cfnresponse.SUCCESS, data)
else:
cfnresponse.send(event, context, cfnresponse.SUCCESS, {})
Tags:
- Key: Name
Value: !Sub ${AWS::StackName}GetAZLambdaFunction
GetAZLambdaRole:
Type: AWS::IAM::Role
Properties:
Path: /
Description: GetAZLambdaFunction
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- sts:AssumeRole
Principal:
Service:
- !Sub 'lambda.${AWS::URLSuffix}'
ManagedPolicyArns:
- !Sub 'arn:${AWS::Partition}:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole'
Policies:
- PolicyName: GetAZLambdaFunction
PolicyDocument:
Version: '2012-10-17'
Statement:
- Sid: ec2
Effect: Allow
Action:
- ec2:DescribeAvailabilityZones
Resource:
- '*'
Tags:
- Key: Name
Value: !Sub ${AWS::StackName}-GetAZLambdaFunction
S3Endpoint:
Type: 'AWS::EC2::VPCEndpoint'
Properties:
VpcEndpointType: 'Gateway'
ServiceName: !Sub 'com.amazonaws.${AWS::Region}.s3'
RouteTableIds:
- !Ref PublicRouteTable
- !Ref PrivateRouteTableA
- !Ref PrivateRouteTableB
VpcId: !Ref VPC
SecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Allow all traffic from resources in VPC
VpcId:
Ref: VPC
SecurityGroupIngress:
- IpProtocol: -1
CidrIp: !Ref CidrBlock
SecurityGroupEgress:
- IpProtocol: -1
CidrIp: !Ref CidrBlock
Outputs:
VPC:
Value: !Ref VPC
Description: ID of the VPC
Export:
Name: !Sub ${AWS::StackName}-VPC
PublicSubnets:
Value: !Join
- ','
- - !Ref PublicSubnetA
- !Ref PublicSubnetB
- !If
- DoProvisionSubnetsC
- !Ref PublicSubnetC
- !Ref AWS::NoValue
Description: ID of the public subnets
Export:
Name: !Sub ${AWS::StackName}-PublicSubnets
PrivateSubnets:
Value: !Join
- ','
- - !Ref PrivateSubnetA
- !Ref PrivateSubnetB
- !If
- DoProvisionSubnetsC
- !Ref PrivateSubnetC
- !Ref AWS::NoValue
Description: ID of the private subnets
Export:
Name: !Sub ${AWS::StackName}-PrivateSubnets
DefaultPrivateSubnet:
Description: The ID of a default private subnet
Value: !Ref PrivateSubnetA
Export:
Name: !Sub "${AWS::StackName}-DefaultPrivateSubnet"
DefaultPublicSubnet:
Description: The ID of a default public subnet
Value: !Ref PublicSubnetA
Export:
Name: !Sub "${AWS::StackName}-DefaultPublicSubnet"
InternetGatewayId:
Description: The ID of the Internet Gateway
Value: !Ref InternetGateway
Export:
Name: !Sub "${AWS::StackName}-InternetGateway"
SecurityGroup:
Description: The ID of the local security group
Value: !Ref SecurityGroup
Export:
Name: !Sub "${AWS::StackName}-SecurityGroup"
Use the default options and give the stack a name, like AWSPCS-PTPro-cluster.
You can leave the options as the defaults.
Use this AWS Cloud Formation quick-create link to quickly provision these resources with default settings.
Under Capabilities: Check the box for I acknowledge that AWS CloudFormation might create IAM resources.
Once you have created this new VPC, find its VPC ID and note it by searching for it in the Amazon VPC Console by selecting "VPCs" and then searching for the name you picked above.
If you chose the stack name we suggested, you would search for PTPro, and if you are deploying this in us-east-1 you can use this link.
Make a note of the VPC ID once you have found it.
2. Create Security Groups¶
Summary
In this section we will create three security groups:
- A cluster security group enabling comms between the compute nodes, login node and AWS PCS controller
- An inbound ssh group that can optionally be enabled to allow ssh logins on the login node
- An DCV group that can optionally be enabled to allow DCV remote desktop connections to the login node
If you have already created these security groups you can reuse them and skip this step.
Using CloudFormation, create a new stack for the security groups using the following template:
1-pcs-cluster-cloudformation-security-groups.yaml
Show template contents (click to expand)
# source: https://aws-hpc-recipes.s3.amazonaws.com/main/recipes/pcs/getting_started/assets/pcs-cluster-sg.yaml
AWSTemplateFormatVersion: 2010-09-09
Description: Security group for AWS PCS clusters.
This template creates a self-referencing security group that enables communications between AWS PCS controller, compute nodes, and client nodes.
Optionally, it can also create a security group to enable SSH access to the cluster, and DCV remote desktop access to the login node.
Check the Outputs tab of this stack for useful details about resources created by this template.
Metadata:
AWS::CloudFormation::Interface:
ParameterGroups:
- Label:
default: Network
Parameters:
- VpcId
- Label:
default: Security group configuration
Parameters:
- CreateInboundSshSecurityGroup
- CreateInboundDcvSecurityGroup
- ClientIpCidr
Parameters:
VpcId:
Description: VPC where the AWS PCS cluster will be deployed
Type: 'AWS::EC2::VPC::Id'
ClientIpCidr:
Description: IP address(s) allowed to connect to nodes using SSH
Default: '0.0.0.0/0'
Type: String
AllowedPattern: (\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})/(\d{1,2})
ConstraintDescription: Value must be a valid IP or network range of the form x.x.x.x/x.
CreateInboundSshSecurityGroup:
Description: Create an inbound security group to allow SSH access to nodes.
Type: String
Default: 'True'
AllowedValues:
- 'True'
- 'False'
CreateInboundDcvSecurityGroup:
Description: Create an inbound security group to allow DCV access to login nodes on TCP/UDP 8443.
Type: String
Default: 'False'
AllowedValues:
- 'True'
- 'False'
Conditions:
CreateSshSecGroup: !Equals [!Ref CreateInboundSshSecurityGroup, 'True']
CreateDcvSecGroup: !Equals [!Ref CreateInboundDcvSecurityGroup, 'True']
Resources:
ClusterSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Supports communications between AWS PCS controller, compute nodes, and client nodes
VpcId: !Ref VpcId
GroupName: !Sub 'cluster-${AWS::StackName}'
ClusterAllowAllInboundFromSelf:
Type: AWS::EC2::SecurityGroupIngress
Properties:
GroupId: !Ref ClusterSecurityGroup
IpProtocol: '-1'
SourceSecurityGroupId: !Ref ClusterSecurityGroup
ClusterAllowAllOutboundToSelf:
Type: AWS::EC2::SecurityGroupEgress
Properties:
GroupId: !Ref ClusterSecurityGroup
IpProtocol: '-1'
DestinationSecurityGroupId: !Ref ClusterSecurityGroup
# This allows all outbound comms, which enables HTTPS calls and connections to networked storage
ClusterAllowAllOutboundToWorld:
Type: AWS::EC2::SecurityGroupEgress
Properties:
GroupId: !Ref ClusterSecurityGroup
IpProtocol: '-1'
CidrIp: 0.0.0.0/0
# Attach this to login nodes to enable inbound SSH access.
InboundSshSecurityGroup:
Condition: CreateSshSecGroup
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Allows inbound SSH access
GroupName: !Sub 'inbound-ssh-${AWS::StackName}'
VpcId: !Ref VpcId
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: 22
ToPort: 22
CidrIp: !Ref ClientIpCidr
# Attach this to login nodes to enable inbound DCV access on TCP/UDP 8443.
InboundDcvSecurityGroup:
Condition: CreateDcvSecGroup
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Allows inbound DCV access on TCP/UDP 8443
GroupName: !Sub 'inbound-dcv-${AWS::StackName}'
VpcId: !Ref VpcId
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: 8443
ToPort: 8443
CidrIp: !Ref ClientIpCidr
- IpProtocol: udp
FromPort: 8443
ToPort: 8443
CidrIp: !Ref ClientIpCidr
Outputs:
ClusterSecurityGroupId:
Description: Supports communication between PCS controller, compute nodes, and login nodes
Value: !Ref ClusterSecurityGroup
InboundSshSecurityGroupId:
Condition: CreateSshSecGroup
Description: Enables SSH access to login nodes
Value: !Ref InboundSshSecurityGroup
InboundDcvSecurityGroupId:
Condition: CreateDcvSecGroup
Description: Enables DCV access to login nodes on TCP/UDP 8443
Value: !Ref InboundDcvSecurityGroup
- Under stack name use something like
AWSPCS-PTPro-sg. - Select the VPC ID noted in step 1.
- Enable ssh, and optionally enable DCV access.
Use a Quick create link
You can use this AWS CloudFormation quick-create link to provision these security groups in us-east-1, however, you must ensure that you change the VPC ID to the one created in step 1.
3. Create PCS Cluster¶
If you have already created a cluster in this manner you can skip this step
Go to the AWS PCS console and create a new cluster.
- Under Cluster setup, choose a name like
AWSPCS-PTPro-cluster - Set the controller size to small.
- Use the version of slurm compatible with the ParaTools Pro for E4S(TM) image. This is usually the latest version available, 25.05 as of december 2025.
- Under Networking:
- Click "Create Cluster" to begin creating the cluster.
4. Create shared filesystem using EFS¶
- Go to EFS console and create a new filesystem.
- Ensure it is in the same region as the PCS cluster you are setting up.
- Create a file system
- For the name choose something like
AWSPCS-PTPro-fs. - Under "Virtual Private Cloud", use the VPC ID created in step 1.
- Click "Create File System"
- Note the FS ID.
- For the name choose something like
5. Create an Instance Profile¶
Go to the IAM console. Under Access Management -> Policies Check if a policy matching this one already exists, try searching for pcs. If no such policy exists, then create a new one and specify the permissions using the JSON editor as the following:
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"pcs:RegisterComputeNodeGroupInstance"
],
"Resource": "*",
"Effect": "Allow"
}
]
}
Name the new policy, something like AWS-PCS-polilcy and note the name that you chose.
Additional optional steps to enable DCV remote desktop access
If you plan to access the login node you will need to create an adaditional policy to access the DCV license server. If a matching policy exists you can reuse it, try searching for DCV to check. If no policy exists, then create a new one, specifying the permissions with the JSON editor as follows:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::dcv-license.region/us-*"
}
]
}
Give it a name like EC2AccessDCVLicenseS3.
Next, in the IAM Console to to Access Management -> Roles check if a role starting with AWS_PCS- exists with the following policies attached.
If not follow these instructions to create it.
- Select "Create Role"
- Select Trusted Entity Type: "AWS Service"
- Service or use case: "EC2"
- Use Case: "EC2"
- Click Next
- Add permissions
- Add the policy created earlier in step 5.
- If planning to use DCV to access the login node, also add the
EC2AccessDCVLicenseS3policy. - Add the
AmazonSSMManagedInstanceCorepolicy.
- Click Next
- Give the role a name that starts with
AWSPCS-(It must start withAWSPCS-)
6. Create EFA Placement Group¶
If such a placement group already exists you may simply reuse it.
Under the EC2 Console, navigate to Network & Security -> Placement Groups -> "Create placement group"
- Name it something like
AWSPCS-PTPro-cluster - Set strategy = "cluster"
- Click "Create group"
7. Create node Launch Template¶
Using CloudFormation, create a new stack for the node launch templates using the following template:
2-pcs-cluster-cloudformation-launch-templates.yaml
Show template contents (click to expand)
# original source: https://aws-hpc-recipes.s3.amazonaws.com/main/recipes/pcs/getting_started/assets/pcs-lt-efs-fsxl.yaml
# has been modified
AWSTemplateFormatVersion: 2010-09-09
Description: EC2 launch templates for AWS PCS login and compute node groups.
This template creates EC2 launch templates for AWS PCS login and compute node groups.
It demonstrates mounting EFS and FSx for Lustre file systems, configuring EC2 instance tags, enabling Instance Metadata Service Version 2 (IMDSv2), and setting up the cluster security group for communication with the AWS PCS controller.
Additionally, it shows how to configure inbound SSH access to the login nodes.
Use this template as a starting point to create custom launch templates tailored to your specific requirements.
Check the Outputs tab of this stack for useful details about resources created by this template.
Metadata:
AWS::CloudFormation::Interface:
ParameterGroups:
- Label:
default: Security
Parameters:
- VpcDefaultSecurityGroupId
- ClusterSecurityGroupId
- SshSecurityGroupId
- EnableDcvAccess
- DcvSecurityGroupId
- SshKeyName
- Label:
default: Networking
Parameters:
- VpcId
- PlacementGroupName
- NodeGroupSubnetId
- Label:
default: File systems
Parameters:
- EfsFilesystemId
- FSxLustreFilesystemId
- FSxLustreFilesystemMountName
Parameters:
VpcId:
Type: 'AWS::EC2::VPC::Id'
Description: Cluster VPC where EFA-enabled instances will be launched
NodeGroupSubnetId:
Type: AWS::EC2::Subnet::Id
Description: Subnet within cluster VPC where EFA-enabled instances will be launched
VpcDefaultSecurityGroupId:
Type: AWS::EC2::SecurityGroup::Id
Description: Cluster VPC 'default' security group. Make sure you choose the one from your cluster VPC!
ClusterSecurityGroupId:
Type: AWS::EC2::SecurityGroup::Id
Description: Security group for PCS cluster controller and nodes.
SshSecurityGroupId:
Type: AWS::EC2::SecurityGroup::Id
Description: Security group for SSH into login nodes
EnableDcvAccess:
Type: String
Description: Enable DCV access to login nodes? When set to True, the DcvSecurityGroupId parameter below will be required.
Default: 'True'
AllowedValues:
- 'True'
- 'False'
DcvSecurityGroupId:
Type: AWS::EC2::SecurityGroup::Id
Description: Security group for DCV access to login nodes (only used if EnableDcvAccess is True)
SshKeyName:
Type: AWS::EC2::KeyPair::KeyName
Description: SSH key name for access to login nodes
EfsFilesystemId:
Type: String
Description: Amazon EFS file system Id
FSxLustreFilesystemId:
Type: String
Description: Amazon FSx for Lustre file system Id
FSxLustreFilesystemMountName:
Type: String
Description: Amazon FSx for Lustre mount name
PlacementGroupName:
Type: String
Description: Placement group name for compute nodes (leave blank to creaet a new one)
Default: "AWSPCS-PTPro-cluster"
Conditions:
HasDcvAccess: !Equals [!Ref EnableDcvAccess, 'True']
Resources:
EfaSecurityGroup:
Type: 'AWS::EC2::SecurityGroup'
Properties:
GroupDescription: Support EFA
GroupName: !Sub 'efa-${AWS::StackName}'
VpcId: !Ref VpcId
EfaSecurityGroupOutboundSelfRule:
Type: 'AWS::EC2::SecurityGroupEgress'
Properties:
IpProtocol: '-1'
GroupId: !Ref EfaSecurityGroup
Description: Allow outbound EFA traffic to SG members
DestinationSecurityGroupId: !Ref EfaSecurityGroup
EfaSecurityGroupInboundSelfRule:
Type: 'AWS::EC2::SecurityGroupIngress'
Properties:
IpProtocol: '-1'
GroupId: !Ref EfaSecurityGroup
Description: Allow inbound EFA traffic to SG members
SourceSecurityGroupId: !Ref EfaSecurityGroup
LoginLaunchTemplate:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateName: !Sub 'login-${AWS::StackName}'
LaunchTemplateData:
TagSpecifications:
- ResourceType: instance
Tags:
- Key: HPCRecipes
Value: "true"
MetadataOptions:
HttpEndpoint: enabled
HttpPutResponseHopLimit: 4
HttpTokens: required
KeyName: !Ref SshKeyName
SecurityGroupIds:
- !Ref ClusterSecurityGroupId
- !Ref SshSecurityGroupId
- !If [HasDcvAccess, !Ref DcvSecurityGroupId, !Ref "AWS::NoValue"]
- !Ref VpcDefaultSecurityGroupId
UserData:
Fn::Base64: !Sub |
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="
--==MYBOUNDARY==
Content-Type: text/cloud-config; charset="us-ascii"
MIME-Version: 1.0
packages:
- amazon-efs-utils
runcmd:
# Mount EFS filesystem as /home
- mkdir -p /tmp/home
- rsync -aA /home/ /tmp/home
- echo "${EfsFilesystemId}:/ /home efs tls,_netdev" >> /etc/fstab
- mount -a -t efs defaults
- if [ "enabled" == "$(sestatus | awk '/^SELinux status:/{print $3}')" ]; then setsebool -P use_nfs_home_dirs 1; fi
- rsync -aA --ignore-existing /tmp/home/ /home
- rm -rf /tmp/home/
# If provided, mount FSxL filesystem as /shared
- if [ ! -z "${FSxLustreFilesystemId}" ]; then amazon-linux-extras install -y lustre=latest; mkdir -p /shared; chmod a+rwx /shared; mount -t lustre ${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /shared; chmod 777 /shared; fi
--==MYBOUNDARY==
ComputeLaunchTemplate:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateName: !Sub 'compute-${AWS::StackName}'
LaunchTemplateData:
TagSpecifications:
- ResourceType: instance
Tags:
- Key: HPCRecipes
Value: "true"
MetadataOptions:
HttpEndpoint: enabled
HttpPutResponseHopLimit: 4
HttpTokens: required
Placement:
GroupName: !Ref PlacementGroupName
NetworkInterfaces:
- Description: Primary network interface
DeviceIndex: 0
InterfaceType: efa
NetworkCardIndex: 0
SubnetId: !Ref NodeGroupSubnetId
Groups:
- !Ref EfaSecurityGroup
- !Ref ClusterSecurityGroupId
- !Ref VpcDefaultSecurityGroupId
KeyName: !Ref SshKeyName
UserData:
Fn::Base64: !Sub |
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="
--==MYBOUNDARY==
Content-Type: text/cloud-config; charset="us-ascii"
MIME-Version: 1.0
packages:
- amazon-efs-utils
runcmd:
# Mount EFS filesystem as /home
- mkdir -p /tmp/home
- rsync -aA /home/ /tmp/home
- echo "${EfsFilesystemId}:/ /home efs tls,_netdev" >> /etc/fstab
- mount -a -t efs defaults
- if [ "enabled" == "$(sestatus | awk '/^SELinux status:/{print $3}')" ]; then setsebool -P use_nfs_home_dirs 1; fi
- rsync -aA --ignore-existing /tmp/home/ /home
- rm -rf /tmp/home/
# If provided, mount FSxL filesystem as /shared
- if [ ! -z "${FSxLustreFilesystemId}" ]; then amazon-linux-extras install -y lustre=latest; mkdir -p /shared; chmod a+rwx /shared; mount -t lustre ${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /shared; fi
--==MYBOUNDARY==
Outputs:
LoginLaunchTemplateId:
Description: "Login nodes template ID"
Value: !Ref LoginLaunchTemplate
LoginLaunchTemplateName:
Description: "Login nodes template name"
Value: !Sub 'login-${AWS::StackName}'
ComputeLaunchTemplateId:
Description: "Compute nodes template ID"
Value: !Ref ComputeLaunchTemplate
ComputeLaunchTemplateName:
Description: "Compute nodes template name"
Value: !Sub 'compute-${AWS::StackName}'
EfaSecurityGroupId:
Description: Security group created to support EFA communications
Value: !Ref EfaSecurityGroup
Set the following values:
- VpcDefaultSecurityGroupId = value of "default" security group obtained in step 1
- ClusterSecurityGroupId = get value from output of step 2 key = "ClusterSecurityGroupId"
- SshSecurityGroupId = get value from output of step 2 key = "InboundSshSecurityGroupId"
- SshKeyName = pick a key
- VpcId = get value from output of step 1 key = "VPC"
- PlacementGroupName = use name chosen in step 6
- NodeGroupSubnetId = select the subnet labeled with PrivateSubnetA created in step 1
- EfsFilesystemId = EFS ID of FS created in step 4
8. Create node groups¶
In the PCS console, select the cluster created in step 3
- Create one node group for compute nodes
- Compute node groups -> Create compute node group
- Group name = compute-1
- EC2 Launch Template =
compute-<name>where<name>is the stack name chosen in step 7 - Subnets = PrivateSubnetA from step 1
- Instance types = g4dn.8xlarge (or other EFA-capable instance type)
- min count = 0, max count = 2
- AMI ID = Select a PCS-compatible AMI
- Create one node group for the login node
- Compute node groups -> Create compute node group
- Group name = login
- EC2 Launch Template =
login-<name>where<name>is the stack name chosen in step 7 - Subnets = PublicSubnetA from step 1
- Instance types = g4dn.4xlarge (or other instance type)
- min count = 1, max count = 1
- AMI ID = Select a PCS-compatible AMI
9. Create queue¶
In the PCS console, select the cluster created in step 3
- Queues -> Create queue
- name = compute-1
- Add the compute node group created in step 8.1
10. Connect to login node¶
In the PCS console, select the cluster created in step 3
- Compute node groups -> select login node group created in step 8.2
- Copy the "compute node group ID"
- Go to EC2 console -> Instances
- In the search bar "Find instances by attribute or tag (case sensitive)" search for the "compute node group ID"
- Select the resulting instance -- this is the login node
- Copy "Public IPv4 Address"
- SSH to that IP (should allow the login node to prepare itself for at least 5 minutes before SSHing)
- username = "ubuntu" (for our ubuntu-based images; username will vary depending on image type)
- ssh key = use the key chosen in step 7
11. Run sample job¶
Once connected to the login node, run sinfo to see slurm queue information.
You should see the queue created in step 9
Submit a job: sbatch -p <queue-name> script.sbatch
Since compute nodes are launched on demand, the first job submitted to a queue will cause the nodes to be spun up.
squeuewill show the job state asCFwhile the nodes are provisioned
Compute nodes will be brought down automatically after a period of inactivity called ScaledownIdletime
- This can be configured in step 3 during cluster creation by changing the "Slurm configuration" settings.
12. Shut nodes down¶
In the PCS console, select the cluster created in step 3