Looking for AWS ParallelCluster (PC)?
This guide covers AWS Parallel Computing Service (PCS), the AWS-managed Slurm service. For the open-source self-managed alternative, see Getting Started with AWS ParallelCluster.
This tutorial configures AWS Parallel Computing Service (PCS) with the matching ParaTools Pro for E4S™ on AWS PCS AMI from the AWS Marketplace:
Use the command line tools, AWS CLI, and the AWS console to create a cluster. The workflow uses several .yaml files that describe the stack and serve as inputs for AWS CloudFormation. The result is a GPU-accelerated head node that can spawn EC2 compute node instances linked with EFA networking.
For the purposes of this tutorial, you have already created an AWS account and are an Administrative User.
Tutorial
For additional context, see the official AWS PCS Getting Started guide. This tutorial follows the official guide with a few minor changes; refer to it if anything is unclear.
1. Create VPC and Subnets
It is possible to reuse existing VPC and subnets
If a compatible VPC and subnets already exist, skip this step and use them in place of the VpcId, PrivateSubnetA, and PublicSubnetA references in later steps. Search for existing PTPro VPC stacks in us-east-1 with this link.
Create a new stack for the cluster's VPC and subnets using the CloudFormation console with the following template:
0-pcs-cluster-cloudformation-vpc-and-subnets.yaml
Show template contents (click to expand)
| # source: https://aws-hpc-recipes.s3.amazonaws.com/main/recipes/net/hpc_large_scale/assets/main.yaml
AWSTemplateFormatVersion: '2010-09-09'
Description: HPC-scale VPC with Multi-AZ Architecture.
This template creates a highly available VPC infrastructure optimized for HPC workloads across multiple Availability Zones.
It provisions both public and private subnets in two or optionally three AZs, with each subnet configured for 4096 IP addresses.
The template sets up NAT Gateways and Internet Gateway for secure outbound connectivity from private subnets.
VPC Flow Logs are enabled and directed to CloudWatch for comprehensive network traffic monitoring.
An S3 VPC Endpoint is configured to allow private subnet resources to access S3 without traversing the internet.
A VPC-wide security group is created to enable communication between resources within the VPC.
Use this template as a foundation for building scalable, secure networking infrastructure for HPC workloads.
Refer to the Outputs tab of the deployed stack for important resource identifiers including VPC ID, subnet IDs, security group ID, and internet gateway ID.
Metadata:
AWS::CloudFormation::Interface:
ParameterGroups:
- Label:
default: VPC
Parameters:
- CidrBlock
- Label:
default: Subnets A
Parameters:
- CidrPublicSubnetA
- CidrPrivateSubnetA
- Label:
default: Subnets B
Parameters:
- CidrPublicSubnetB
- CidrPrivateSubnetB
- Label:
default: Subnets C
Parameters:
- ProvisionSubnetsC
- CidrPublicSubnetC
- CidrPrivateSubnetC
Parameters:
CidrBlock:
AllowedPattern: '((\d{1,3})\.){3}\d{1,3}/\d{1,2}'
Default: 10.3.0.0/16
Description: VPC CIDR Block (eg 10.3.0.0/16)
Type: String
CidrPublicSubnetA:
AllowedPattern: '((\d{1,3})\.){3}\d{1,3}/\d{1,2}'
Default: 10.3.0.0/20
Description: VPC CIDR Block for the Public Subnet A
Type: String
CidrPublicSubnetB:
AllowedPattern: '((\d{1,3})\.){3}\d{1,3}/\d{1,2}'
Default: 10.3.16.0/20
Description: VPC CIDR Block for the Public Subnet B
Type: String
CidrPublicSubnetC:
AllowedPattern: '((\d{1,3})\.){3}\d{1,3}/\d{1,2}'
Default: 10.3.32.0/20
Description: VPC CIDR Block for the Public Subnet C
Type: String
CidrPrivateSubnetA:
AllowedPattern: '((\d{1,3})\.){3}\d{1,3}/\d{1,2}'
Default: 10.3.128.0/20
Description: VPC CIDR Block for the Private Subnet A
Type: String
CidrPrivateSubnetB:
AllowedPattern: '((\d{1,3})\.){3}\d{1,3}/\d{1,2}'
Default: 10.3.144.0/20
Description: VPC CIDR Block for the Private Subnet B
Type: String
CidrPrivateSubnetC:
AllowedPattern: '((\d{1,3})\.){3}\d{1,3}/\d{1,2}'
Default: 10.3.160.0/20
Description: VPC CIDR Block for the Private Subnet C
Type: String
ProvisionSubnetsC:
Type: String
Description: Provision optional 3rd set of subnets
Default: "True"
AllowedValues:
- "True"
- "False"
Mappings:
RegionMap:
us-east-1:
ZoneId1: use1-az6
ZoneId2: use1-az4
ZoneId3: use1-az5
us-east-2:
ZoneId1: use2-az2
ZoneId2: use2-az3
ZoneId3: use2-az1
us-west-1:
ZoneId1: usw1-az1
ZoneId2: usw1-az3
ZoneId3: usw1-az2
us-west-2:
ZoneId1: usw2-az1
ZoneId2: usw2-az2
ZoneId3: usw2-az3
eu-central-1:
ZoneId1: euc1-az3
ZoneId2: euc1-az2
ZoneId3: euc1-az1
eu-west-1:
ZoneId1: euw1-az1
ZoneId2: euw1-az2
ZoneId3: euw1-az3
eu-west-2:
ZoneId1: euw2-az2
ZoneId2: euw2-az3
ZoneId3: euw2-az1
eu-west-3:
ZoneId1: euw3-az1
ZoneId2: euw3-az2
ZoneId3: euw3-az3
eu-north-1:
ZoneId1: eun1-az2
ZoneId2: eun1-az1
ZoneId3: eun1-az3
ca-central-1:
ZoneId1: cac1-az2
ZoneId2: cac1-az1
ZoneId3: cac1-az3
eu-south-1:
ZoneId1: eus1-az2
ZoneId2: eus1-az1
ZoneId3: eus1-az3
ap-east-1:
ZoneId1: ape1-az3
ZoneId2: ape1-az2
ZoneId3: ape1-az1
ap-northeast-1:
ZoneId1: apne1-az4
ZoneId2: apne1-az1
ZoneId3: apne1-az2
ap-northeast-2:
ZoneId1: apne2-az1
ZoneId2: apne2-az3
ZoneId3: apne2-az2
ap-south-1:
ZoneId1: aps1-az2
ZoneId2: aps1-az3
ZoneId3: aps1-az1
ap-southeast-1:
ZoneId1: apse1-az1
ZoneId2: apse1-az2
ZoneId3: apse1-az3
ap-southeast-2:
ZoneId1: apse2-az3
ZoneId2: apse2-az1
ZoneId3: apse2-az2
us-gov-west-1:
ZoneId1: usgw1-az2
ZoneId2: usgw1-az1
ZoneId3: usgw1-az3
us-gov-east-1:
ZoneId1: usge1-az3
ZoneId2: usge1-az2
ZoneId3: usge1-az1
ap-northeast-3:
ZoneId1: apne3-az3
ZoneId2: apne3-az2
ZoneId3: apne3-az1
sa-east-1:
ZoneId1: sae1-az3
ZoneId2: sae1-az2
ZoneId3: sae1-az1
af-south-1:
ZoneId1: afs1-az3
ZoneId2: afs1-az2
ZoneId3: afs1-az1
ap-south-2:
ZoneId1: aps2-az3
ZoneId2: aps2-az2
ZoneId3: aps2-az1
ap-southeast-3:
ZoneId1: apse3-az3
ZoneId2: apse3-az2
ZoneId3: apse3-az1
ap-southeast-4:
ZoneId1: apse4-az3
ZoneId2: apse4-az2
ZoneId3: apse4-az1
ca-west-1:
ZoneId1: caw1-az3
ZoneId2: caw1-az2
ZoneId3: caw1-az1
eu-central-2:
ZoneId1: euc2-az3
ZoneId2: euc2-az2
ZoneId3: euc2-az1
eu-south-2:
ZoneId1: eus2-az3
ZoneId2: eus2-az2
ZoneId3: eus2-az1
il-central-1:
ZoneId1: ilc1-az3
ZoneId2: ilc1-az2
ZoneId3: ilc1-az1
me-central-1:
ZoneId1: mec1-az3
ZoneId2: mec1-az2
ZoneId3: mec1-az1
Conditions:
DoProvisionSubnetsC: !Equals [!Ref ProvisionSubnetsC, "True"]
Resources:
VPC:
Type: AWS::EC2::VPC
Properties:
CidrBlock: !Ref CidrBlock
EnableDnsHostnames: true
EnableDnsSupport: true
Tags:
- Key: "Name"
Value: !Sub '${AWS::StackName}:Large-Scale-HPC'
VPCFlowLog:
Type: AWS::EC2::FlowLog
Properties:
ResourceId: !Ref VPC
ResourceType: VPC
TrafficType: ALL
LogDestinationType: cloud-watch-logs
LogGroupName: !Sub '${AWS::StackName}-VPCFlowLogs'
DeliverLogsPermissionArn: !GetAtt FlowLogRole.Arn
FlowLogRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Principal:
Service:
- vpc-flow-logs.amazonaws.com
Action:
- "sts:AssumeRole"
ManagedPolicyArns:
- !Ref AWS::NoValue
Policies:
- PolicyName: FlowLogPolicy
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- "logs:CreateLogGroup"
- "logs:CreateLogStream"
- "logs:PutLogEvents"
- "logs:DescribeLogGroups"
- "logs:DescribeLogStreams"
Resource: !Sub "arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:${AWS::StackName}-VPCFlowLogs:*"
PublicSubnetA:
Type: AWS::EC2::Subnet
Properties:
VpcId: !Ref VPC
CidrBlock: !Ref CidrPublicSubnetA
AvailabilityZone: !GetAtt AvailabiltyZone1.ZoneName
MapPublicIpOnLaunch: true
Tags:
- Key: Name
Value: !Sub
- '${StackName}:PublicSubnetA-${AvailabilityZone}'
- StackName: !Ref AWS::StackName
AvailabilityZone: !GetAtt AvailabiltyZone1.ZoneName
PublicSubnetB:
Type: AWS::EC2::Subnet
Properties:
VpcId: !Ref VPC
CidrBlock: !Ref CidrPublicSubnetB
AvailabilityZone: !GetAtt AvailabiltyZone2.ZoneName
MapPublicIpOnLaunch: true
Tags:
- Key: Name
Value: !Sub
- '${StackName}:PublicSubnetB-${AvailabilityZone}'
- StackName: !Ref AWS::StackName
AvailabilityZone: !GetAtt AvailabiltyZone2.ZoneName
PublicSubnetC:
Type: AWS::EC2::Subnet
Condition: DoProvisionSubnetsC
Properties:
VpcId: !Ref VPC
CidrBlock: !Ref CidrPublicSubnetC
AvailabilityZone: !GetAtt AvailabiltyZone3.ZoneName
MapPublicIpOnLaunch: true
Tags:
- Key: Name
Value: !Sub
- '${StackName}:PublicSubnetC-${AvailabilityZone}'
- StackName: !Ref AWS::StackName
AvailabilityZone: !GetAtt AvailabiltyZone3.ZoneName
InternetGateway:
Type: AWS::EC2::InternetGateway
AttachGateway:
Type: AWS::EC2::VPCGatewayAttachment
Properties:
VpcId: !Ref VPC
InternetGatewayId: !Ref InternetGateway
PublicRouteTable:
Type: AWS::EC2::RouteTable
Properties:
VpcId: !Ref VPC
Tags:
- Key: Name
Value: !Sub '${AWS::StackName}:PublicRoute'
PublicRoute1:
Type: AWS::EC2::Route
Properties:
RouteTableId: !Ref PublicRouteTable
DestinationCidrBlock: 0.0.0.0/0
GatewayId: !Ref InternetGateway
PublicSubnetARouteTableAssociation:
Type: AWS::EC2::SubnetRouteTableAssociation
Properties:
SubnetId: !Ref PublicSubnetA
RouteTableId: !Ref PublicRouteTable
PublicSubnetBRouteTableAssociation:
Type: AWS::EC2::SubnetRouteTableAssociation
Properties:
SubnetId: !Ref PublicSubnetB
RouteTableId: !Ref PublicRouteTable
PublicSubnetCRouteTableAssociation:
Condition: DoProvisionSubnetsC
Type: AWS::EC2::SubnetRouteTableAssociation
Properties:
SubnetId: !Ref PublicSubnetC
RouteTableId: !Ref PublicRouteTable
PrivateSubnetA:
Type: AWS::EC2::Subnet
Properties:
VpcId: !Ref VPC
AvailabilityZone: !GetAtt AvailabiltyZone1.ZoneName
CidrBlock: !Ref CidrPrivateSubnetA
MapPublicIpOnLaunch: false
Tags:
- Key: Name
Value: !Sub
- '${StackName}:PrivateSubnetA-${AvailabilityZone}'
- StackName: !Ref AWS::StackName
AvailabilityZone: !GetAtt AvailabiltyZone1.ZoneName
PrivateSubnetB:
Type: AWS::EC2::Subnet
Properties:
VpcId: !Ref VPC
AvailabilityZone: !GetAtt AvailabiltyZone2.ZoneName
CidrBlock: !Ref CidrPrivateSubnetB
MapPublicIpOnLaunch: false
Tags:
- Key: Name
Value: !Sub
- '${StackName}:PrivateSubnetB-${AvailabilityZone}'
- StackName: !Ref AWS::StackName
AvailabilityZone: !GetAtt AvailabiltyZone2.ZoneName
PrivateSubnetC:
Type: AWS::EC2::Subnet
Condition: DoProvisionSubnetsC
Properties:
VpcId: !Ref VPC
AvailabilityZone: !GetAtt AvailabiltyZone3.ZoneName
CidrBlock: !Ref CidrPrivateSubnetC
MapPublicIpOnLaunch: false
Tags:
- Key: Name
Value: !Sub
- '${StackName}:PrivateSubnetC-${AvailabilityZone}'
- StackName: !Ref AWS::StackName
AvailabilityZone: !GetAtt AvailabiltyZone3.ZoneName
NatGatewayAEIP:
Type: AWS::EC2::EIP
DependsOn: AttachGateway
Properties:
Domain: vpc
NatGatewayBEIP:
Type: AWS::EC2::EIP
DependsOn: AttachGateway
Properties:
Domain: vpc
NatGatewayCEIP:
Condition: DoProvisionSubnetsC
Type: AWS::EC2::EIP
DependsOn: AttachGateway
Properties:
Domain: vpc
NatGatewayA:
Type: AWS::EC2::NatGateway
Properties:
AllocationId: !GetAtt NatGatewayAEIP.AllocationId
SubnetId: !Ref PublicSubnetA
NatGatewayB:
Type: AWS::EC2::NatGateway
Properties:
AllocationId: !GetAtt NatGatewayBEIP.AllocationId
SubnetId: !Ref PublicSubnetB
NatGatewayC:
Type: AWS::EC2::NatGateway
Condition: DoProvisionSubnetsC
Properties:
AllocationId: !GetAtt NatGatewayCEIP.AllocationId
SubnetId: !Ref PublicSubnetC
PrivateRouteTableA:
Type: AWS::EC2::RouteTable
Properties:
VpcId: !Ref VPC
Tags:
- Key: Name
Value: !Sub '${AWS::StackName}:PrivateRouteA'
PrivateRouteTableB:
Type: AWS::EC2::RouteTable
Properties:
VpcId: !Ref VPC
Tags:
- Key: Name
Value: !Sub '${AWS::StackName}:PrivateRouteB'
PrivateRouteTableC:
Type: AWS::EC2::RouteTable
Condition: DoProvisionSubnetsC
Properties:
VpcId: !Ref VPC
Tags:
- Key: Name
Value: !Sub '${AWS::StackName}:PrivateRouteC'
DefaultPrivateRouteA:
Type: AWS::EC2::Route
Properties:
RouteTableId: !Ref PrivateRouteTableA
DestinationCidrBlock: 0.0.0.0/0
NatGatewayId: !Ref NatGatewayA
DefaultPrivateRouteB:
Type: AWS::EC2::Route
Properties:
RouteTableId: !Ref PrivateRouteTableB
DestinationCidrBlock: 0.0.0.0/0
NatGatewayId: !Ref NatGatewayB
DefaultPrivateRouteC:
Type: AWS::EC2::Route
Condition: DoProvisionSubnetsC
Properties:
RouteTableId: !Ref PrivateRouteTableC
DestinationCidrBlock: 0.0.0.0/0
NatGatewayId: !Ref NatGatewayC
PrivateSubnetARouteTableAssociation:
Type: AWS::EC2::SubnetRouteTableAssociation
Properties:
RouteTableId: !Ref PrivateRouteTableA
SubnetId: !Ref PrivateSubnetA
PrivateSubnetBRouteTableAssociation:
Type: AWS::EC2::SubnetRouteTableAssociation
Properties:
RouteTableId: !Ref PrivateRouteTableB
SubnetId: !Ref PrivateSubnetB
PrivateSubnetCRouteTableAssociation:
Type: AWS::EC2::SubnetRouteTableAssociation
Condition: DoProvisionSubnetsC
Properties:
RouteTableId: !Ref PrivateRouteTableC
SubnetId: !Ref PrivateSubnetC
AvailabiltyZone1:
Type: Custom::AvailabiltyZone
DependsOn: LogGroupGetAZLambdaFunction
Properties:
ServiceToken: !GetAtt GetAZLambdaFunction.Arn
ZoneId: !FindInMap [RegionMap, !Ref "AWS::Region", ZoneId1]
AvailabiltyZone2:
Type: Custom::AvailabiltyZone
DependsOn: LogGroupGetAZLambdaFunction
Properties:
ServiceToken: !GetAtt GetAZLambdaFunction.Arn
ZoneId: !FindInMap [RegionMap, !Ref "AWS::Region", ZoneId2]
AvailabiltyZone3:
Type: Custom::AvailabiltyZone
Condition: DoProvisionSubnetsC
DependsOn: LogGroupGetAZLambdaFunction
Properties:
ServiceToken: !GetAtt GetAZLambdaFunction.Arn
ZoneId: !FindInMap [RegionMap, !Ref "AWS::Region", ZoneId3]
LogGroupGetAZLambdaFunction:
Type: AWS::Logs::LogGroup
DeletionPolicy: Delete
UpdateReplacePolicy: Delete
Properties:
LogGroupName: !Sub /aws/lambda/${GetAZLambdaFunction}
RetentionInDays: 7
GetAZLambdaFunction:
Type: AWS::Lambda::Function
Properties:
Description: GetAZLambdaFunction
Timeout: 60
Runtime: python3.12
Handler: index.handler
Role: !GetAtt GetAZLambdaRole.Arn
Code:
ZipFile: |
import cfnresponse
from json import dumps
from boto3 import client
EC2 = client('ec2')
def handler(event, context):
if event['RequestType'] in ('Create', 'Update'):
print(dumps(event, default=str))
data = {}
try:
response = EC2.describe_availability_zones(
Filters=[{'Name': 'zone-id', 'Values': [event['ResourceProperties']['ZoneId']]}]
)
print(dumps(response, default=str))
data['ZoneName'] = response['AvailabilityZones'][0]['ZoneName']
except Exception as error:
cfnresponse.send(event, context, cfnresponse.FAILED, {}, reason=error)
finally:
cfnresponse.send(event, context, cfnresponse.SUCCESS, data)
else:
cfnresponse.send(event, context, cfnresponse.SUCCESS, {})
Tags:
- Key: Name
Value: !Sub ${AWS::StackName}GetAZLambdaFunction
GetAZLambdaRole:
Type: AWS::IAM::Role
Properties:
Path: /
Description: GetAZLambdaFunction
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- sts:AssumeRole
Principal:
Service:
- !Sub 'lambda.${AWS::URLSuffix}'
ManagedPolicyArns:
- !Sub 'arn:${AWS::Partition}:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole'
Policies:
- PolicyName: GetAZLambdaFunction
PolicyDocument:
Version: '2012-10-17'
Statement:
- Sid: ec2
Effect: Allow
Action:
- ec2:DescribeAvailabilityZones
Resource:
- '*'
Tags:
- Key: Name
Value: !Sub ${AWS::StackName}-GetAZLambdaFunction
S3Endpoint:
Type: 'AWS::EC2::VPCEndpoint'
Properties:
VpcEndpointType: 'Gateway'
ServiceName: !Sub 'com.amazonaws.${AWS::Region}.s3'
RouteTableIds:
- !Ref PublicRouteTable
- !Ref PrivateRouteTableA
- !Ref PrivateRouteTableB
VpcId: !Ref VPC
SecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Allow all traffic from resources in VPC
VpcId:
Ref: VPC
SecurityGroupIngress:
- IpProtocol: -1
CidrIp: !Ref CidrBlock
SecurityGroupEgress:
- IpProtocol: -1
CidrIp: !Ref CidrBlock
Outputs:
VPC:
Value: !Ref VPC
Description: ID of the VPC
Export:
Name: !Sub ${AWS::StackName}-VPC
PublicSubnets:
Value: !Join
- ','
- - !Ref PublicSubnetA
- !Ref PublicSubnetB
- !If
- DoProvisionSubnetsC
- !Ref PublicSubnetC
- !Ref AWS::NoValue
Description: ID of the public subnets
Export:
Name: !Sub ${AWS::StackName}-PublicSubnets
PrivateSubnets:
Value: !Join
- ','
- - !Ref PrivateSubnetA
- !Ref PrivateSubnetB
- !If
- DoProvisionSubnetsC
- !Ref PrivateSubnetC
- !Ref AWS::NoValue
Description: ID of the private subnets
Export:
Name: !Sub ${AWS::StackName}-PrivateSubnets
DefaultPrivateSubnet:
Description: The ID of a default private subnet
Value: !Ref PrivateSubnetA
Export:
Name: !Sub "${AWS::StackName}-DefaultPrivateSubnet"
DefaultPublicSubnet:
Description: The ID of a default public subnet
Value: !Ref PublicSubnetA
Export:
Name: !Sub "${AWS::StackName}-DefaultPublicSubnet"
InternetGatewayId:
Description: The ID of the Internet Gateway
Value: !Ref InternetGateway
Export:
Name: !Sub "${AWS::StackName}-InternetGateway"
SecurityGroup:
Description: The ID of the local security group
Value: !Ref SecurityGroup
Export:
Name: !Sub "${AWS::StackName}-SecurityGroup"
|
Give the stack a name like AWSPCS-PTPro-cluster and leave the options at their defaults.
Use this AWS CloudFormation quick-create link to quickly provision these resources with default settings
Under Capabilities, check the box for I acknowledge that AWS CloudFormation might create IAM resources.
After the VPC is created, find its ID in the Amazon VPC Console by selecting VPCs and searching for the stack name. If the suggested stack name was used, search for PTPro. For deployments in us-east-1, use this link. Note the VPC ID for use in later steps.
2. Create Security Groups
Summary
In this section, you will create three security groups:
- A cluster security group enabling communication between the compute nodes, login node, and AWS PCS controller.
- An inbound SSH group that can optionally be enabled to allow SSH logins on the login node.
- A DCV group that can optionally be enabled to allow DCV remote desktop connections to the login node.
It is possible to reuse existing security groups
If compatible security groups already exist, skip this step and substitute their IDs for the cluster-*-sg, InboundSshSecurityGroupId, and InboundDcvSecurityGroupId references in later steps.
Using CloudFormation, create a new stack for the security groups with the following template:
1-pcs-cluster-cloudformation-security-groups.yaml
Show template contents (click to expand)
| # source: https://aws-hpc-recipes.s3.amazonaws.com/main/recipes/pcs/getting_started/assets/pcs-cluster-sg.yaml
AWSTemplateFormatVersion: 2010-09-09
Description: Security group for AWS PCS clusters.
This template creates a self-referencing security group that enables communications between AWS PCS controller, compute nodes, and client nodes.
Optionally, it can also create a security group to enable SSH access to the cluster, and DCV remote desktop access to the login node.
Check the Outputs tab of this stack for useful details about resources created by this template.
Metadata:
AWS::CloudFormation::Interface:
ParameterGroups:
- Label:
default: Network
Parameters:
- VpcId
- Label:
default: Security group configuration
Parameters:
- CreateInboundSshSecurityGroup
- CreateInboundDcvSecurityGroup
- ClientIpCidr
Parameters:
VpcId:
Description: VPC where the AWS PCS cluster will be deployed
Type: 'AWS::EC2::VPC::Id'
ClientIpCidr:
Description: IP address(s) allowed to connect to nodes using SSH or DCV.
Default: '0.0.0.0/0'
Type: String
AllowedPattern: (\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})/(\d{1,2})
ConstraintDescription: Value must be a valid IP or network range of the form x.x.x.x/x.
CreateInboundSshSecurityGroup:
Description: Create an inbound security group to allow SSH access to nodes.
Type: String
Default: 'True'
AllowedValues:
- 'True'
- 'False'
CreateInboundDcvSecurityGroup:
Description: Create an inbound security group to allow DCV access to login nodes on TCP/UDP 8443.
Type: String
Default: 'False'
AllowedValues:
- 'True'
- 'False'
Conditions:
CreateSshSecGroup: !Equals [!Ref CreateInboundSshSecurityGroup, 'True']
CreateDcvSecGroup: !Equals [!Ref CreateInboundDcvSecurityGroup, 'True']
Resources:
ClusterSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Supports communications between AWS PCS controller, compute nodes, and client nodes
VpcId: !Ref VpcId
GroupName: !Sub 'cluster-${AWS::StackName}'
ClusterAllowAllInboundFromSelf:
Type: AWS::EC2::SecurityGroupIngress
Properties:
GroupId: !Ref ClusterSecurityGroup
IpProtocol: '-1'
SourceSecurityGroupId: !Ref ClusterSecurityGroup
ClusterAllowAllOutboundToSelf:
Type: AWS::EC2::SecurityGroupEgress
Properties:
GroupId: !Ref ClusterSecurityGroup
IpProtocol: '-1'
DestinationSecurityGroupId: !Ref ClusterSecurityGroup
# This allows all outbound comms, which enables HTTPS calls and connections to networked storage
ClusterAllowAllOutboundToWorld:
Type: AWS::EC2::SecurityGroupEgress
Properties:
GroupId: !Ref ClusterSecurityGroup
IpProtocol: '-1'
CidrIp: 0.0.0.0/0
# Attach this to login nodes to enable inbound SSH access.
InboundSshSecurityGroup:
Condition: CreateSshSecGroup
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Allows inbound SSH access
GroupName: !Sub 'inbound-ssh-${AWS::StackName}'
VpcId: !Ref VpcId
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: 22
ToPort: 22
CidrIp: !Ref ClientIpCidr
# Attach this to login nodes to enable inbound DCV access on TCP/UDP 8443.
InboundDcvSecurityGroup:
Condition: CreateDcvSecGroup
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Allows inbound DCV access on TCP/UDP 8443
GroupName: !Sub 'inbound-dcv-${AWS::StackName}'
VpcId: !Ref VpcId
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: 8443
ToPort: 8443
CidrIp: !Ref ClientIpCidr
- IpProtocol: udp
FromPort: 8443
ToPort: 8443
CidrIp: !Ref ClientIpCidr
Outputs:
ClusterSecurityGroupId:
Description: Supports communication between PCS controller, compute nodes, and login nodes
Value: !Ref ClusterSecurityGroup
InboundSshSecurityGroupId:
Condition: CreateSshSecGroup
Description: Enables SSH access to login nodes
Value: !Ref InboundSshSecurityGroup
InboundDcvSecurityGroupId:
Condition: CreateDcvSecGroup
Description: Enables DCV access to login nodes on TCP/UDP 8443
Value: !Ref InboundDcvSecurityGroup
|
- Under Stack name, use something like
AWSPCS-PTPro-sg.
- Set VpcId to the VPC ID noted in step 1.
- Enable SSH, and optionally enable DCV access.
Use a Quick create link
Use this AWS CloudFormation quick-create link to provision these security groups in us-east-1. Change the VPC ID to the one created in step 1.
3. Create PCS Cluster
It is possible to reuse an existing PCS cluster
If a compatible PCS cluster already exists, skip this step and reference its name in later steps.
Go to the AWS PCS console and create a new cluster.
- Under Cluster setup, choose a name like
AWSPCS-PTPro-cluster.
- Set the Controller size to Small.
- Use the version of Slurm compatible with the ParaTools Pro for E4S™ image. This is usually the latest version available (
25.05 as of December 2025).
- Under Networking:
- Use the VPC ID created in step 1 (e.g.,
AWSPCS-PTPro-cluster...).
- Select the subnet labeled
PrivateSubnetA created in step 1.
- Under Security groups choose Select an existing security group.
- Use the security group
cluster-*-sg created in step 2 (e.g., cluster-AWSPCS-PTPro-sg).
- Click Create Cluster to begin creating the cluster.
4. Create shared filesystem using EFS
- Go to the EFS console and ensure the region matches the region where the PCS cluster is being set up.
- Click Create file system:
- Name: something like
AWSPCS-PTPro-fs.
- Virtual Private Cloud (VPC): the VPC ID from step 1.
- Click Create.
- Note the File system ID (e.g.,
fs-0123456789abcdef0); it is needed in step 7.
5. Create an Instance Profile
Recommended: use the CloudFormation template
The fastest and least error-prone path is to deploy the CloudFormation template below, which creates the policy, role, and instance profile in one step, including the DCV license policy correctly parameterized for the stack's region.
3-pcs-cluster-cloudformation-iam.yaml
Show template contents (click to expand)
| AWSTemplateFormatVersion: '2010-09-09'
Description: >-
IAM role, policies, and instance profile for AWS PCS cluster nodes
(login + compute). Creates the role required by the AWS PCS service
(name must start with "AWSPCS-"), attaches the minimum PCS policy,
the AWS Systems Manager managed instance policy, and optionally the
Amazon DCV license-bucket read policy needed for remote-desktop
access on the login node.
Parameters:
RoleNameSuffix:
Type: String
Default: PCS-cluster
Description: >-
Suffix appended to the required "AWSPCS-" prefix to form the role
and instance-profile name. Must be unique in the account. The
final name will be "AWSPCS-<RoleNameSuffix>".
AllowedPattern: '[A-Za-z0-9+=,.@_-]+'
MinLength: 1
MaxLength: 50
EnableDcvLicenseAccess:
Type: String
Default: 'true'
AllowedValues: ['true', 'false']
Description: >-
If "true", attach a policy granting s3:GetObject on the Amazon
DCV license bucket for this region, required for DCV remote-
desktop licensing on EC2 instances.
Conditions:
AttachDcvPolicy: !Equals [!Ref EnableDcvLicenseAccess, 'true']
Resources:
PcsRegisterNodePolicy:
Type: AWS::IAM::ManagedPolicy
Properties:
ManagedPolicyName: !Sub '${AWS::StackName}-pcs-register-node'
Description: Allow EC2 instances to register as AWS PCS compute node group members.
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- pcs:RegisterComputeNodeGroupInstance
Resource: '*'
DcvLicenseAccessPolicy:
Type: AWS::IAM::ManagedPolicy
Condition: AttachDcvPolicy
Properties:
ManagedPolicyName: !Sub '${AWS::StackName}-dcv-license-access'
Description: >-
Allow EC2 instances to read the Amazon DCV license bucket for
the stack's deployment region, required for DCV licensing.
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action: s3:GetObject
Resource: !Sub 'arn:${AWS::Partition}:s3:::dcv-license.${AWS::Region}/*'
PcsNodeRole:
Type: AWS::IAM::Role
Properties:
RoleName: !Sub 'AWSPCS-${RoleNameSuffix}'
Description: >-
Instance role for AWS PCS cluster nodes. Name prefix "AWSPCS-"
is required by the AWS PCS service.
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: !Sub 'ec2.${AWS::URLSuffix}'
Action: sts:AssumeRole
ManagedPolicyArns: !If
- AttachDcvPolicy
- - !Ref PcsRegisterNodePolicy
- !Ref DcvLicenseAccessPolicy
- !Sub 'arn:${AWS::Partition}:iam::aws:policy/AmazonSSMManagedInstanceCore'
- - !Ref PcsRegisterNodePolicy
- !Sub 'arn:${AWS::Partition}:iam::aws:policy/AmazonSSMManagedInstanceCore'
PcsNodeInstanceProfile:
Type: AWS::IAM::InstanceProfile
Properties:
InstanceProfileName: !Sub 'AWSPCS-${RoleNameSuffix}'
Roles:
- !Ref PcsNodeRole
Outputs:
RoleName:
Description: Name of the PCS node IAM role.
Value: !Ref PcsNodeRole
Export:
Name: !Sub '${AWS::StackName}-RoleName'
RoleArn:
Description: ARN of the PCS node IAM role.
Value: !GetAtt PcsNodeRole.Arn
Export:
Name: !Sub '${AWS::StackName}-RoleArn'
InstanceProfileName:
Description: Name of the PCS node instance profile (pass this to node group / launch template).
Value: !Ref PcsNodeInstanceProfile
Export:
Name: !Sub '${AWS::StackName}-InstanceProfileName'
InstanceProfileArn:
Description: ARN of the PCS node instance profile.
Value: !GetAtt PcsNodeInstanceProfile.Arn
Export:
Name: !Sub '${AWS::StackName}-InstanceProfileArn'
|
Parameters:
RoleNameSuffix (default PCS-cluster) -- final role and instance-profile name is AWSPCS-<RoleNameSuffix>. The AWSPCS- prefix is required by AWS PCS.
EnableDcvLicenseAccess (default true) -- attach the DCV license read policy for remote-desktop use.
After the stack completes, reference the InstanceProfileName output in the node launch template in step 7. Skip to step 6.
To create the policy and role manually via the IAM console, follow the rest of this section.
Go to the IAM console. Under Access Management → Policies, check whether a policy matching this one already exists (search for pcs).
If none exists, create a new one and specify the permissions using the JSON editor as follows:
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"pcs:RegisterComputeNodeGroupInstance"
],
"Resource": "*",
"Effect": "Allow"
}
]
}
Name the new policy something like AWS-PCS-policy and note the name you chose.
Additional optional steps to enable DCV remote desktop access
To access the login node via DCV, create an additional policy granting read access to the DCV license server.
If a matching policy already exists, reuse it (search for DCV).
Otherwise, create a new one, specifying the permissions with the JSON editor as follows:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::dcv-license.*/*"
}
]
}
Give it a name like EC2AccessDCVLicenseS3.
Tighter region scope (optional)
The wildcard dcv-license.* matches only AWS-owned DCV license buckets (bucket name is reserved by AWS), so it is safe. For an explicit allowlist, enumerate the regions you deploy in, for example:
"Resource": [
"arn:aws:s3:::dcv-license.us-east-1/*",
"arn:aws:s3:::dcv-license.us-east-2/*",
"arn:aws:s3:::dcv-license.us-west-1/*",
"arn:aws:s3:::dcv-license.us-west-2/*"
]
In a CloudFormation template the policy Resource can be parameterized with !Sub 'arn:${AWS::Partition}:s3:::dcv-license.${AWS::Region}/*' so it substitutes the stack's region automatically. IAM policy JSON itself has no built-in variable for the EC2 instance's region.
Next, in the IAM Console go to Access Management → Roles and check whether a role starting with AWSPCS- already exists with the required policies attached.
Otherwise, create it as follows:
- Select Create Role.
- For Trusted Entity Type, choose AWS Service.
- For Service or use case, choose EC2; for Use Case, choose EC2.
- Click Next.
- Under Add permissions:
- Add the policy created earlier in step 5.
- If planning to use DCV to access the login node, also add the
EC2AccessDCVLicenseS3 policy.
- Add the
AmazonSSMManagedInstanceCore policy.
- Click Next.
- Give the role a name that must start with
AWSPCS- (e.g., AWSPCS-PTPro-role).
6. Create EFA Placement Group
It is possible to reuse an existing placement group
If a compatible cluster placement group already exists, skip this step and reference its name in later steps.
Under the EC2 Console, navigate to Network & Security → Placement Groups → Create placement group.
- Name: something like
AWSPCS-PTPro-cluster.
- Placement strategy: Cluster.
- Click Create group.
7. Create node Launch Templates
This step creates two EC2 launch templates -- one for the login node and one for compute nodes -- both wired up for EFA networking and the shared EFS filesystem.
Using CloudFormation, create a new stack using the following template:
2-pcs-cluster-cloudformation-launch-templates.yaml
Show template contents (click to expand)
| # original source: https://aws-hpc-recipes.s3.amazonaws.com/main/recipes/pcs/getting_started/assets/pcs-lt-efs-fsxl.yaml
# has been modified
AWSTemplateFormatVersion: 2010-09-09
Description: EC2 launch templates for AWS PCS login and compute node groups.
This template creates EC2 launch templates for AWS PCS login and compute node groups.
It demonstrates mounting EFS and FSx for Lustre file systems, configuring EC2 instance tags, enabling Instance Metadata Service Version 2 (IMDSv2), and setting up the cluster security group for communication with the AWS PCS controller.
Additionally, it shows how to configure inbound SSH access to the login nodes, and optionally sets an initial password for the `ubuntu` user so DCV web sessions can sign in out of the box.
Use this template as a starting point to create custom launch templates tailored to your specific requirements.
Check the Outputs tab of this stack for useful details about resources created by this template.
Metadata:
AWS::CloudFormation::Interface:
ParameterGroups:
- Label:
default: Security
Parameters:
- VpcDefaultSecurityGroupId
- ClusterSecurityGroupId
- SshSecurityGroupId
- EnableDcvAccess
- DcvSecurityGroupId
- DcvUbuntuPassword
- SshKeyName
- Label:
default: Networking
Parameters:
- VpcId
- PlacementGroupName
- NodeGroupSubnetId
- Label:
default: File systems
Parameters:
- EfsFilesystemId
- FSxLustreFilesystemId
- FSxLustreFilesystemMountName
Parameters:
VpcId:
Type: 'AWS::EC2::VPC::Id'
Description: Cluster VPC where EFA-enabled instances will be launched
NodeGroupSubnetId:
Type: AWS::EC2::Subnet::Id
Description: Subnet within cluster VPC where EFA-enabled instances will be launched
VpcDefaultSecurityGroupId:
Type: AWS::EC2::SecurityGroup::Id
Description: Cluster VPC 'default' security group. Make sure you choose the one from your cluster VPC!
ClusterSecurityGroupId:
Type: AWS::EC2::SecurityGroup::Id
Description: Security group for PCS cluster controller and nodes.
SshSecurityGroupId:
Type: AWS::EC2::SecurityGroup::Id
Description: Security group for SSH into login nodes
EnableDcvAccess:
Type: String
Description: Enable DCV access to login nodes? When set to True, the DcvSecurityGroupId parameter below will be required.
Default: 'True'
AllowedValues:
- 'True'
- 'False'
DcvSecurityGroupId:
Type: AWS::EC2::SecurityGroup::Id
Description: Security group for DCV access to login nodes (only used if EnableDcvAccess is True)
DcvUbuntuPassword:
Type: String
NoEcho: true
Default: ''
MinLength: 0
MaxLength: 128
AllowedPattern: "^$|^[A-Za-z0-9!@#%^&*()_+=,.:;/?<>{}\\[\\]~-]{8,128}$"
Description: >-
Optional initial password for the `ubuntu` user on the login node, used
to sign in to DCV web sessions at https://<login-public-ip>:8443/. Leave
blank to skip password setup (DCV will then require manually setting a
password on the instance before the web UI can be used). 8-128 chars;
avoid quotes, backticks, backslash, and whitespace.
SshKeyName:
Type: AWS::EC2::KeyPair::KeyName
Description: SSH key name for access to login nodes
EfsFilesystemId:
Type: String
Description: Amazon EFS file system Id
FSxLustreFilesystemId:
Type: String
Description: Amazon FSx for Lustre file system Id
FSxLustreFilesystemMountName:
Type: String
Description: Amazon FSx for Lustre mount name
PlacementGroupName:
Type: String
Description: Placement group name for compute nodes (leave blank to creaet a new one)
Default: "AWSPCS-PTPro-cluster"
Conditions:
HasDcvAccess: !Equals [!Ref EnableDcvAccess, 'True']
Resources:
EfaSecurityGroup:
Type: 'AWS::EC2::SecurityGroup'
Properties:
GroupDescription: Support EFA
GroupName: !Sub 'efa-${AWS::StackName}'
VpcId: !Ref VpcId
EfaSecurityGroupOutboundSelfRule:
Type: 'AWS::EC2::SecurityGroupEgress'
Properties:
IpProtocol: '-1'
GroupId: !Ref EfaSecurityGroup
Description: Allow outbound EFA traffic to SG members
DestinationSecurityGroupId: !Ref EfaSecurityGroup
EfaSecurityGroupInboundSelfRule:
Type: 'AWS::EC2::SecurityGroupIngress'
Properties:
IpProtocol: '-1'
GroupId: !Ref EfaSecurityGroup
Description: Allow inbound EFA traffic to SG members
SourceSecurityGroupId: !Ref EfaSecurityGroup
LoginLaunchTemplate:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateName: !Sub 'login-${AWS::StackName}'
LaunchTemplateData:
TagSpecifications:
- ResourceType: instance
Tags:
- Key: HPCRecipes
Value: "true"
MetadataOptions:
HttpEndpoint: enabled
HttpPutResponseHopLimit: 4
HttpTokens: required
KeyName: !Ref SshKeyName
SecurityGroupIds:
- !Ref ClusterSecurityGroupId
- !Ref SshSecurityGroupId
- !If [HasDcvAccess, !Ref DcvSecurityGroupId, !Ref "AWS::NoValue"]
- !Ref VpcDefaultSecurityGroupId
UserData:
Fn::Base64: !Sub |
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="
--==MYBOUNDARY==
Content-Type: text/cloud-config; charset="us-ascii"
MIME-Version: 1.0
packages:
- amazon-efs-utils
# Enable GDM autologin for `ubuntu` BEFORE gdm3 starts, so DCV's "console" session attaches
# to the user's Xorg without a greeter->user-session transition. A late (`runcmd`) edit +
# `systemctl restart gdm3` would segfault the mode=system dcvagent mid-transition.
# Idempotent: sed regex won't match on reboot after first edit. Remove once the underlying
# AMI bakes this config. Canonical per AWS DCV Ubuntu + GNOME setup.
bootcmd:
- sed -i 's/^# AutomaticLoginEnable = true/AutomaticLoginEnable = true/' /etc/gdm3/custom.conf
- sed -i 's/^# AutomaticLogin = user1/AutomaticLogin = ubuntu/' /etc/gdm3/custom.conf
write_files:
- path: /etc/update-motd.d/99-dcv-url
permissions: '0755'
owner: root:root
content: |
#!/bin/sh
TOKEN=$(curl -fsSL --max-time 2 -X PUT -H 'X-aws-ec2-metadata-token-ttl-seconds: 60' http://169.254.169.254/latest/api/token 2>/dev/null)
IP=$(curl -fsSL --max-time 2 -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/public-ipv4 2>/dev/null)
[ -n "$IP" ] || IP=$(hostname -I | awk '{print $1}')
printf '\n DCV remote desktop: https://%s:8443/ (user: ubuntu)\n\n' "$IP"
runcmd:
# Mount EFS filesystem as /home
- mkdir -p /tmp/home
- rsync -aA /home/ /tmp/home
- echo "${EfsFilesystemId}:/ /home efs tls,_netdev" >> /etc/fstab
- mount -a -t efs defaults
- if command -v sestatus >/dev/null 2>&1 && [ "X$(sestatus | awk '/^SELinux status:/{print $3}')" = "Xenabled" ]; then setsebool -P use_nfs_home_dirs 1; fi
- rsync -aA --ignore-existing /tmp/home/ /home
- rm -rf /tmp/home/
# If provided, mount FSxL filesystem as /shared
- if [ ! -z "${FSxLustreFilesystemId}" ]; then amazon-linux-extras install -y lustre=latest; mkdir -p /shared; chmod a+rwx /shared; mount -t lustre ${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /shared; chmod 777 /shared; fi
# If provided, set an initial password for the `ubuntu` user so DCV web sessions can sign in.
- if [ -n '${DcvUbuntuPassword}' ]; then echo 'ubuntu:${DcvUbuntuPassword}' | chpasswd; fi
--==MYBOUNDARY==--
ComputeLaunchTemplate:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateName: !Sub 'compute-${AWS::StackName}'
LaunchTemplateData:
TagSpecifications:
- ResourceType: instance
Tags:
- Key: HPCRecipes
Value: "true"
MetadataOptions:
HttpEndpoint: enabled
HttpPutResponseHopLimit: 4
HttpTokens: required
Placement:
GroupName: !Ref PlacementGroupName
NetworkInterfaces:
- Description: Primary network interface
DeviceIndex: 0
InterfaceType: efa
NetworkCardIndex: 0
SubnetId: !Ref NodeGroupSubnetId
Groups:
- !Ref EfaSecurityGroup
- !Ref ClusterSecurityGroupId
- !Ref VpcDefaultSecurityGroupId
KeyName: !Ref SshKeyName
UserData:
Fn::Base64: !Sub |
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="
--==MYBOUNDARY==
Content-Type: text/cloud-config; charset="us-ascii"
MIME-Version: 1.0
packages:
- amazon-efs-utils
runcmd:
# Mount EFS filesystem as /home
- mkdir -p /tmp/home
- rsync -aA /home/ /tmp/home
- echo "${EfsFilesystemId}:/ /home efs tls,_netdev" >> /etc/fstab
- mount -a -t efs defaults
- if command -v sestatus >/dev/null 2>&1 && [ "X$(sestatus | awk '/^SELinux status:/{print $3}')" = "Xenabled" ]; then setsebool -P use_nfs_home_dirs 1; fi
- rsync -aA --ignore-existing /tmp/home/ /home
- rm -rf /tmp/home/
# If provided, mount FSxL filesystem as /shared
- if [ ! -z "${FSxLustreFilesystemId}" ]; then amazon-linux-extras install -y lustre=latest; mkdir -p /shared; chmod a+rwx /shared; mount -t lustre ${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /shared; fi
--==MYBOUNDARY==--
Outputs:
LoginLaunchTemplateId:
Description: "Login nodes template ID"
Value: !Ref LoginLaunchTemplate
LoginLaunchTemplateName:
Description: "Login nodes template name"
Value: !Sub 'login-${AWS::StackName}'
ComputeLaunchTemplateId:
Description: "Compute nodes template ID"
Value: !Ref ComputeLaunchTemplate
ComputeLaunchTemplateName:
Description: "Compute nodes template name"
Value: !Sub 'compute-${AWS::StackName}'
EfaSecurityGroupId:
Description: Security group created to support EFA communications
Value: !Ref EfaSecurityGroup
DcvWebAccessHint:
Description: >-
Once the login node reaches "Active", point a browser at
https://<login-public-ipv4>:8443/ and sign in as user `ubuntu`
with the password supplied via DcvUbuntuPassword (if set). The
login node also prints the resolved DCV URL in its MOTD on SSH /
Session Manager login.
Value: "https://<login-public-ipv4>:8443/ (user: ubuntu)"
|
Give the stack a name (e.g., AWSPCS-PTPro-lt). Populate the parameters as follows:
| Parameter |
Value |
VpcId |
Output VPC from step 1 |
VpcDefaultSecurityGroupId |
The "default" security group of the VPC created in step 1 |
ClusterSecurityGroupId |
Output ClusterSecurityGroupId from step 2 |
SshSecurityGroupId |
Output InboundSshSecurityGroupId from step 2 |
SshKeyName |
An existing EC2 key pair you control |
PlacementGroupName |
Name chosen in step 6 |
NodeGroupSubnetId |
PrivateSubnetA from step 1 |
EfsFilesystemId |
EFS filesystem ID from step 4 |
DcvUbuntuPassword |
Optional initial password for the ubuntu user, used to sign in to DCV web sessions (see DCV remote desktop in step 10). Leave blank to skip and set a password manually later. Marked NoEcho so it is not shown in the console or stack events. |
After the stack reaches CREATE_COMPLETE, note the launch template names from the stack outputs. They will be named login-<stack-name> and compute-<stack-name>, and are referenced in step 8.
8. Create node groups
A cluster requires at least two compute node groups: one for interactive login nodes (statically scaled) and one for elastic compute nodes that run jobs.
In the AWS PCS console, select the cluster created in step 3, navigate to Compute node groups, and click Create.
AMI selection
For the AMI ID field, use a ParaTools Pro for E4S™ PCS-compatible AMI from the AWS Marketplace. Use the same AMI for both node groups so the login and compute environments stay in sync. Pick the product matching your cluster's target architecture:
Obtaining the AMI ID after subscribing:
- Open the marketplace product page above and click View purchase options / Continue to Subscribe.
- Accept the terms and wait for the subscription to be processed.
- Click Continue to Configuration.
- Select the delivery method, software version, and AWS region matching your cluster.
- Copy the AMI ID shown on the configuration page (format:
ami-0123456789abcdef0). Use this value in the AMI ID field when creating the compute node groups below.
Alternatively, after subscribing, find the AMI in the EC2 console under Images → AMIs, filtered by Owner alias = aws-marketplace and searching for ParaTools.
Recommended instance types
Choose instance types that match the AMI architecture. EFA is required for tightly-coupled MPI on compute nodes; GPU login nodes enable DCV/interactive visualization without EFA.
| Role |
x86_64 |
arm64 |
| Compute node group |
g4dn.8xlarge (NVIDIA T4, EFA) |
hpc7g.8xlarge (Graviton3E, 200 Gbps EFA, no GPU) |
Login node group (~4xlarge) |
g4dn.4xlarge (NVIDIA T4) |
g5g.4xlarge (Graviton2 + NVIDIA T4G) |
g5g has no EFA and is suited only for login / interactive visualization, not for compute.
8.1 Compute node group (compute-1)
This is a dynamic node group: instances are launched when jobs are submitted and terminated after the configured idle time, scaling down to zero when the queue is empty.
- Under Compute node group details:
- Compute node group name:
compute-1.
- Under Compute configuration:
- EC2 launch template:
compute-<stack-name> from step 7.
- Version: select the latest version of the launch template.
- IAM instance profile: select the Use an existing profile radio, then under Selected profile choose the
AWSPCS-* role created in step 5.
- Subnets:
PrivateSubnetA from step 1.
- Instance types:
g4dn.8xlarge (for arm64 clusters, see the Recommended instance types tip above).
- Scaling configuration: select the Dynamic node group radio. Set Minimum instance count to
0 and Maximum instance count to 2.
- AMI ID: select the Custom AMI radio, then paste the ParaTools Pro for E4S™ AMI ID obtained from the marketplace subscription (see the AMI selection note above).
- Leave Capacity purchase option at its default (
On-Demand). Skip Scheduler configuration and Tags.
- Click Create compute node group and wait for the Status field to show Active before proceeding.
8.2 Login node group (login)
This is a static node group: a single long-running instance you SSH into (or access via Session Manager) to submit jobs.
- Navigate back to Compute node groups and click Create.
- Under Compute node group details:
- Compute node group name:
login.
- Under Compute configuration:
- EC2 launch template:
login-<stack-name> from step 7.
- Version: select the latest version of the launch template.
- IAM instance profile: select the Use an existing profile radio, then under Selected profile choose the same
AWSPCS-* role used for compute-1.
- Subnets:
PublicSubnetA from step 1.
- Instance types:
g4dn.4xlarge (for arm64 clusters, see the Recommended instance types tip above).
- Scaling configuration: select the Static node group radio. Set both Minimum instance count and Maximum instance count to
1.
- AMI ID: select the Custom AMI radio and paste the same ParaTools Pro for E4S™ AMI ID used for
compute-1.
- Leave Capacity purchase option, Scheduler configuration, and Tags at their defaults.
- Click Create compute node group.
Wait for Active status
Wait for the login group to reach Active before attempting to connect in step 10. The login instance needs several minutes after activation for cloud-init and slurm configuration to complete.
9. Create queue
A queue exposes a compute node group to Slurm as a partition. Jobs submitted with sbatch -p <queue-name> will land on the attached compute node group.
Before creating the queue, ensure the compute-1 group from step 8.1 has reached Active status.
In the AWS PCS console, select the cluster created in step 3, navigate to Queues, and click Create queue.
- Under Queue configuration:
- Queue name:
compute-1 (this becomes the Slurm partition name).
- Compute node groups: select
compute-1 from step 8.1.
- Click Create queue and wait for the Status field to show Active.
10. Connect to login node
Once the login compute node group has reached Active, locate its EC2 instance and connect.
- Find the login instance.
- In the AWS PCS console, select the cluster from step 3.
- Go to Compute node groups and select the
login group from step 8.2.
- Copy the Compute node group ID (e.g.,
cng-abc123def456...).
-
Locate the instance in EC2.
- In the EC2 Console, choose Instances.
-
In the Find instances by attribute or tag (case sensitive) search bar, filter by the PCS tag:
tag:aws:pcs:compute-node-group-id = <compute-node-group-id>
There should be exactly one running instance matching the login group's ID.
- Select the instance and copy its Public IPv4 address.
-
Connect. Use either SSH or AWS Systems Manager Session Manager.
Allow time for cluster bootstrap
Wait about 2 minutes after the login node reaches Active before connecting, so cloud-init can finish.
DCV remote desktop (optional)
The ParaTools Pro for E4S™ AMI ships with NICE DCV configured to serve a GPU-accelerated Linux desktop on TCP 8443. The DCV license is granted to the node via the IAM policy from step 5, and inbound access is allowed by the DCV security group from step 2.
-
Open the DCV URL. Browse to the login node's public IPv4 (located via the same steps used to SSH in above):
https://<login-public-ipv4>:8443/
The browser warns about a self-signed certificate; accept to continue.
Shortcut: grab the URL from the MOTD
The login node's cloud-init installs a MOTD drop-in that prints the fully-resolved DCV URL on every SSH / Session Manager login, e.g.:
DCV remote desktop: https://54.81.250.30:8443/ (user: ubuntu)
Copy-paste that URL into your browser instead of hunting for the instance IP in the EC2 console.
-
Sign in.
- Username:
ubuntu.
- Password: the value supplied for
DcvUbuntuPassword when creating the launch-template stack in step 7.
If DcvUbuntuPassword was left blank, set a password on the login node before connecting:
Rotate or set the password later
DcvUbuntuPassword is only consumed once during cloud-init on first boot. To change the password later (or to set one when the parameter was left blank), SSH into the login node and run sudo passwd ubuntu.
11. Verify the Slurm environment
Once connected to the login node, confirm Slurm can see the queue and partition you created:
sinfo lists the Slurm partitions, their node states, and the compute node groups backing them. You should see the queue from step 9 listed as a partition in the idle~ state (the ~ suffix indicates dynamically-provisioned nodes that are currently powered down).
Compute nodes are automatically terminated after a period of inactivity governed by the ScaledownIdletime parameter. This can be configured in step 3 during cluster creation by adjusting the Slurm configuration settings.
The ParaTools Pro for E4S™ AMI ships with a set of MPI/HPC example programs pre-copied into your home directory at ~/examples.
Examples missing from ~/examples?
If ~/examples is empty or missing, first check /opt/demo -- the source copies live there and may not have been propagated to your home directory:
ls /opt/demo
cp -R /opt/demo ~/examples
If neither exists (for instance, on a fresh EFS mount that masked the AMI's /home contents), clone the ParaTools E4S Cloud Examples repository directly from GitHub:
git clone https://github.com/ParaToolsInc/e4s-cloud-examples.git ~/examples
Move into the examples directory:
NVIDIA NeMo™ and BioNeMo™ live in a dedicated Python environment
NeMo and BioNeMo are installed in a separate virtual environment to avoid dependency conflicts with other GPU/ML packages. Activate it before running NeMo or BioNeMo workloads (or source it from your sbatch script):
source /usr/local/py-env/nemo/bin/activate
Other Python packages (including vLLM) are available in the default system Python and require no activation.
12.1 Run the mpi-procname example
mpi-procname is a tiny MPI program that prints the rank and hostname of each process. It is a quick sanity check that MPI launches and that EFA is reachable between nodes.
cd ~/examples/mpi-procname
./clean.sh
./compile.sh
sbatch -p compute-1 mpiprocname.sbatch
Because compute nodes in this partition are provisioned on demand, the first sbatch submission will trigger an EC2 launch. Expect a few minutes of delay before the job starts; subsequent jobs on the same warm nodes will start almost immediately.
Monitor the job state with:
squeue lists the pending and running jobs. While nodes are being provisioned, the state column shows CF (configuring); once the nodes are up, it transitions to R (running), and the job disappears from the list when it completes. For node-level detail, run sinfo -N -l.
Once the job completes, the output file (e.g., slurm-<jobid>.out) will contain one line per MPI rank, showing rank/host placement.
12.2 Run the OSU Micro-Benchmarks
The OSU Micro-Benchmarks measure point-to-point MPI performance over EFA. The latency, bw (bandwidth), and bibw (bi-directional bandwidth) benchmarks are pre-built in the image and driven by the sbatch scripts in osu-benchmarks/:
cd ~/examples/osu-benchmarks
./clean.sh
sbatch -p compute-1 latency.sbatch
sbatch -p compute-1 bw.sbatch
sbatch -p compute-1 bibw.sbatch
Since the compute nodes were warmed up by the mpi-procname run, these three jobs should start back-to-back without further provisioning delay. Track them with squeue as before.
Each job writes to its own log file (osu-latency.log, osu-bw.log, osu-bibw.log) in the current directory.
13. Shut nodes down
To stop incurring EC2 charges, tear down the queue and node groups. The cluster, VPC, and CloudFormation stacks can be kept around for future use.
In the AWS PCS console, select the cluster created in step 3 and, in order:
- Delete the queue created in step 9 (Queues → select queue → Delete).
- Delete the
login node group from step 8.2 (Compute node groups → select group → Delete).
- Delete the
compute-1 node group from step 8.1 (Compute node groups → select group → Delete).
Deletion order matters
The queue must be deleted before its attached compute node group, otherwise the node group delete will fail.