Skip to content

ParaTools Pro for E4S™ Getting Started with AWS Parallel Computing Service

General Background Information

In this tutorial we will show you how to set up and launch an HPC cluster using AWS Parallel Computing Service (PCS). You will use the command line tools, AWS CLI, and AWS console to create a cluster. This will use a number of .yaml files that describe the stack and are inputs for AWS CloudFormation. We will then launch a GPU-accelerated head node that can spawn EC2 compute node instances linked with EFA networking capabilities.

For the purposes of this tutorial, we make the following assumptions:

Tutorial

Please reference the official AWS PCS Getting Started guide for more information. This tutorial follows the official tutorial linked above, with a few minor changes. If something is unclear, please check the official tutorial.

1. Create VPC and Subnets

You can skip this step by reusing previously created resources

If you have already created the VPC and subnets, you can reuse them, and skip this step. Use this link to search for VPC stacks in us-east-1 that contain the text "PTPro".

To create a new stack for the cluster's VPC and Subnets using the CloudFormation console, please use the following template:

0-pcs-cluster-cloudformation-vpc-and-subnets.yaml

Show template contents (click to expand)
# source: https://aws-hpc-recipes.s3.amazonaws.com/main/recipes/net/hpc_large_scale/assets/main.yaml

AWSTemplateFormatVersion: '2010-09-09'
Description: HPC-scale VPC with Multi-AZ Architecture.
  This template creates a highly available VPC infrastructure optimized for HPC workloads across multiple Availability Zones.
  It provisions both public and private subnets in two or optionally three AZs, with each subnet configured for 4096 IP addresses.
  The template sets up NAT Gateways and Internet Gateway for secure outbound connectivity from private subnets.
  VPC Flow Logs are enabled and directed to CloudWatch for comprehensive network traffic monitoring.
  An S3 VPC Endpoint is configured to allow private subnet resources to access S3 without traversing the internet.
  A VPC-wide security group is created to enable communication between resources within the VPC.
  Use this template as a foundation for building scalable, secure networking infrastructure for HPC workloads.
  Refer to the Outputs tab of the deployed stack for important resource identifiers including VPC ID, subnet IDs, security group ID, and internet gateway ID.

Metadata:
  AWS::CloudFormation::Interface:
    ParameterGroups:
      - Label:
          default: VPC
        Parameters:
          - CidrBlock
      - Label:
          default: Subnets A
        Parameters:
          - CidrPublicSubnetA
          - CidrPrivateSubnetA
      - Label:
          default: Subnets B
        Parameters:
          - CidrPublicSubnetB
          - CidrPrivateSubnetB
      - Label:
          default: Subnets C
        Parameters:
          - ProvisionSubnetsC
          - CidrPublicSubnetC
          - CidrPrivateSubnetC

Parameters:
  CidrBlock:
    AllowedPattern: '((\d{1,3})\.){3}\d{1,3}/\d{1,2}'
    Default: 10.3.0.0/16
    Description: VPC CIDR Block (eg 10.3.0.0/16)
    Type: String
  CidrPublicSubnetA:
    AllowedPattern: '((\d{1,3})\.){3}\d{1,3}/\d{1,2}'
    Default: 10.3.0.0/20
    Description: VPC CIDR Block for the Public Subnet A
    Type: String
  CidrPublicSubnetB:
    AllowedPattern: '((\d{1,3})\.){3}\d{1,3}/\d{1,2}'
    Default: 10.3.16.0/20
    Description: VPC CIDR Block for the Public Subnet B
    Type: String
  CidrPublicSubnetC:
    AllowedPattern: '((\d{1,3})\.){3}\d{1,3}/\d{1,2}'
    Default: 10.3.32.0/20
    Description: VPC CIDR Block for the Public Subnet C
    Type: String
  CidrPrivateSubnetA:
    AllowedPattern: '((\d{1,3})\.){3}\d{1,3}/\d{1,2}'
    Default: 10.3.128.0/20
    Description: VPC CIDR Block for the Private Subnet A
    Type: String
  CidrPrivateSubnetB:
    AllowedPattern: '((\d{1,3})\.){3}\d{1,3}/\d{1,2}'
    Default: 10.3.144.0/20
    Description: VPC CIDR Block for the Private Subnet B
    Type: String
  CidrPrivateSubnetC:
    AllowedPattern: '((\d{1,3})\.){3}\d{1,3}/\d{1,2}'
    Default: 10.3.160.0/20
    Description: VPC CIDR Block for the Private Subnet C
    Type: String
  ProvisionSubnetsC:
    Type: String
    Description: Provision optional 3rd set of subnets
    Default: "True"
    AllowedValues:
         - "True"
         - "False"

Mappings: 
  RegionMap: 
    us-east-1:
      ZoneId1: use1-az6
      ZoneId2: use1-az4
      ZoneId3: use1-az5
    us-east-2:
      ZoneId1: use2-az2
      ZoneId2: use2-az3
      ZoneId3: use2-az1
    us-west-1:
      ZoneId1: usw1-az1
      ZoneId2: usw1-az3
      ZoneId3: usw1-az2
    us-west-2:
      ZoneId1: usw2-az1
      ZoneId2: usw2-az2
      ZoneId3: usw2-az3
    eu-central-1:
      ZoneId1: euc1-az3
      ZoneId2: euc1-az2
      ZoneId3: euc1-az1
    eu-west-1:
      ZoneId1: euw1-az1
      ZoneId2: euw1-az2
      ZoneId3: euw1-az3
    eu-west-2:
      ZoneId1: euw2-az2
      ZoneId2: euw2-az3
      ZoneId3: euw2-az1
    eu-west-3:
      ZoneId1: euw3-az1
      ZoneId2: euw3-az2
      ZoneId3: euw3-az3
    eu-north-1:
      ZoneId1: eun1-az2
      ZoneId2: eun1-az1
      ZoneId3: eun1-az3
    ca-central-1:
      ZoneId1: cac1-az2
      ZoneId2: cac1-az1
      ZoneId3: cac1-az3
    eu-south-1:
      ZoneId1: eus1-az2
      ZoneId2: eus1-az1
      ZoneId3: eus1-az3
    ap-east-1:
      ZoneId1: ape1-az3
      ZoneId2: ape1-az2
      ZoneId3: ape1-az1
    ap-northeast-1:
      ZoneId1: apne1-az4
      ZoneId2: apne1-az1
      ZoneId3: apne1-az2
    ap-northeast-2:
      ZoneId1: apne2-az1
      ZoneId2: apne2-az3
      ZoneId3: apne2-az2
    ap-south-1:
      ZoneId1: aps1-az2
      ZoneId2: aps1-az3
      ZoneId3: aps1-az1
    ap-southeast-1:
      ZoneId1: apse1-az1
      ZoneId2: apse1-az2
      ZoneId3: apse1-az3
    ap-southeast-2:
      ZoneId1: apse2-az3
      ZoneId2: apse2-az1
      ZoneId3: apse2-az2
    us-gov-west-1:
      ZoneId1: usgw1-az2
      ZoneId2: usgw1-az1
      ZoneId3: usgw1-az3
    us-gov-east-1:
      ZoneId1: usge1-az3
      ZoneId2: usge1-az2
      ZoneId3: usge1-az1
    ap-northeast-3:
      ZoneId1: apne3-az3
      ZoneId2: apne3-az2
      ZoneId3: apne3-az1
    sa-east-1:
      ZoneId1: sae1-az3
      ZoneId2: sae1-az2
      ZoneId3: sae1-az1
    af-south-1:
      ZoneId1: afs1-az3
      ZoneId2: afs1-az2
      ZoneId3: afs1-az1
    ap-south-2:
      ZoneId1: aps2-az3
      ZoneId2: aps2-az2
      ZoneId3: aps2-az1
    ap-southeast-3:
      ZoneId1: apse3-az3
      ZoneId2: apse3-az2
      ZoneId3: apse3-az1
    ap-southeast-4:
      ZoneId1: apse4-az3
      ZoneId2: apse4-az2
      ZoneId3: apse4-az1
    ca-west-1:
      ZoneId1: caw1-az3
      ZoneId2: caw1-az2
      ZoneId3: caw1-az1
    eu-central-2:
      ZoneId1: euc2-az3
      ZoneId2: euc2-az2
      ZoneId3: euc2-az1
    eu-south-2:
      ZoneId1: eus2-az3
      ZoneId2: eus2-az2
      ZoneId3: eus2-az1
    il-central-1:
      ZoneId1: ilc1-az3
      ZoneId2: ilc1-az2
      ZoneId3: ilc1-az1
    me-central-1:
      ZoneId1: mec1-az3
      ZoneId2: mec1-az2
      ZoneId3: mec1-az1

Conditions:
     DoProvisionSubnetsC: !Equals [!Ref ProvisionSubnetsC, "True"]

Resources:

  VPC:
    Type: AWS::EC2::VPC
    Properties:
      CidrBlock: !Ref CidrBlock
      EnableDnsHostnames: true
      EnableDnsSupport: true
      Tags:
        - Key: "Name"
          Value: !Sub '${AWS::StackName}:Large-Scale-HPC'

  VPCFlowLog:
    Type: AWS::EC2::FlowLog
    Properties:
      ResourceId: !Ref VPC
      ResourceType: VPC
      TrafficType: ALL
      LogDestinationType: cloud-watch-logs
      LogGroupName: !Sub '${AWS::StackName}-VPCFlowLogs'
      DeliverLogsPermissionArn: !GetAtt FlowLogRole.Arn

  FlowLogRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - vpc-flow-logs.amazonaws.com
            Action:
              - "sts:AssumeRole"
      ManagedPolicyArns:
        - !Ref AWS::NoValue
      Policies:
        - PolicyName: FlowLogPolicy
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - "logs:CreateLogGroup"
                  - "logs:CreateLogStream"
                  - "logs:PutLogEvents"
                  - "logs:DescribeLogGroups"
                  - "logs:DescribeLogStreams"
                Resource: !Sub "arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:${AWS::StackName}-VPCFlowLogs:*"

  PublicSubnetA:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      CidrBlock: !Ref CidrPublicSubnetA
      AvailabilityZone: !GetAtt AvailabiltyZone1.ZoneName
      MapPublicIpOnLaunch: true
      Tags:
      - Key: Name
        Value: !Sub
          - '${StackName}:PublicSubnetA-${AvailabilityZone}'
          - StackName: !Ref AWS::StackName
            AvailabilityZone: !GetAtt AvailabiltyZone1.ZoneName

  PublicSubnetB:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      CidrBlock: !Ref CidrPublicSubnetB
      AvailabilityZone: !GetAtt AvailabiltyZone2.ZoneName
      MapPublicIpOnLaunch: true
      Tags:
      - Key: Name
        Value: !Sub
          - '${StackName}:PublicSubnetB-${AvailabilityZone}'
          - StackName: !Ref AWS::StackName
            AvailabilityZone: !GetAtt AvailabiltyZone2.ZoneName

  PublicSubnetC:
    Type: AWS::EC2::Subnet
    Condition: DoProvisionSubnetsC
    Properties:
      VpcId: !Ref VPC
      CidrBlock: !Ref CidrPublicSubnetC
      AvailabilityZone: !GetAtt AvailabiltyZone3.ZoneName
      MapPublicIpOnLaunch: true
      Tags:
      - Key: Name
        Value: !Sub
          - '${StackName}:PublicSubnetC-${AvailabilityZone}'
          - StackName: !Ref AWS::StackName
            AvailabilityZone: !GetAtt AvailabiltyZone3.ZoneName

  InternetGateway:
    Type: AWS::EC2::InternetGateway

  AttachGateway:
    Type: AWS::EC2::VPCGatewayAttachment
    Properties:
      VpcId: !Ref VPC
      InternetGatewayId: !Ref InternetGateway

  PublicRouteTable:
    Type: AWS::EC2::RouteTable
    Properties:
      VpcId: !Ref VPC
      Tags:
      - Key: Name
        Value: !Sub '${AWS::StackName}:PublicRoute'
  PublicRoute1:
    Type: AWS::EC2::Route
    Properties:
      RouteTableId: !Ref PublicRouteTable
      DestinationCidrBlock: 0.0.0.0/0
      GatewayId: !Ref InternetGateway

  PublicSubnetARouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      SubnetId: !Ref PublicSubnetA
      RouteTableId: !Ref PublicRouteTable

  PublicSubnetBRouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      SubnetId: !Ref PublicSubnetB
      RouteTableId: !Ref PublicRouteTable

  PublicSubnetCRouteTableAssociation:
    Condition: DoProvisionSubnetsC
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      SubnetId: !Ref PublicSubnetC
      RouteTableId: !Ref PublicRouteTable

  PrivateSubnetA:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      AvailabilityZone: !GetAtt AvailabiltyZone1.ZoneName
      CidrBlock: !Ref CidrPrivateSubnetA
      MapPublicIpOnLaunch: false
      Tags:
      - Key: Name
        Value: !Sub
          - '${StackName}:PrivateSubnetA-${AvailabilityZone}'
          - StackName: !Ref AWS::StackName
            AvailabilityZone: !GetAtt AvailabiltyZone1.ZoneName

  PrivateSubnetB:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      AvailabilityZone: !GetAtt AvailabiltyZone2.ZoneName
      CidrBlock: !Ref CidrPrivateSubnetB
      MapPublicIpOnLaunch: false
      Tags:
      - Key: Name
        Value: !Sub
          - '${StackName}:PrivateSubnetB-${AvailabilityZone}'
          - StackName: !Ref AWS::StackName
            AvailabilityZone: !GetAtt AvailabiltyZone2.ZoneName

  PrivateSubnetC:
    Type: AWS::EC2::Subnet
    Condition: DoProvisionSubnetsC
    Properties:
      VpcId: !Ref VPC
      AvailabilityZone: !GetAtt AvailabiltyZone3.ZoneName
      CidrBlock: !Ref CidrPrivateSubnetC
      MapPublicIpOnLaunch: false
      Tags:
      - Key: Name
        Value: !Sub
          - '${StackName}:PrivateSubnetC-${AvailabilityZone}'
          - StackName: !Ref AWS::StackName
            AvailabilityZone: !GetAtt AvailabiltyZone3.ZoneName

  NatGatewayAEIP:
    Type: AWS::EC2::EIP
    DependsOn: AttachGateway
    Properties:
      Domain: vpc

  NatGatewayBEIP:
    Type: AWS::EC2::EIP
    DependsOn: AttachGateway
    Properties:
      Domain: vpc

  NatGatewayCEIP:
    Condition: DoProvisionSubnetsC
    Type: AWS::EC2::EIP
    DependsOn: AttachGateway
    Properties:
      Domain: vpc

  NatGatewayA:
    Type: AWS::EC2::NatGateway
    Properties:
      AllocationId: !GetAtt NatGatewayAEIP.AllocationId
      SubnetId: !Ref PublicSubnetA

  NatGatewayB:
    Type: AWS::EC2::NatGateway
    Properties:
      AllocationId: !GetAtt NatGatewayBEIP.AllocationId
      SubnetId: !Ref PublicSubnetB

  NatGatewayC:
    Type: AWS::EC2::NatGateway
    Condition: DoProvisionSubnetsC
    Properties:
      AllocationId: !GetAtt NatGatewayCEIP.AllocationId
      SubnetId: !Ref PublicSubnetC

  PrivateRouteTableA:
    Type: AWS::EC2::RouteTable
    Properties:
      VpcId: !Ref VPC
      Tags:
        - Key: Name
          Value: !Sub '${AWS::StackName}:PrivateRouteA'

  PrivateRouteTableB:
    Type: AWS::EC2::RouteTable
    Properties:
      VpcId: !Ref VPC
      Tags:
        - Key: Name
          Value: !Sub '${AWS::StackName}:PrivateRouteB'

  PrivateRouteTableC:
    Type: AWS::EC2::RouteTable
    Condition: DoProvisionSubnetsC
    Properties:
      VpcId: !Ref VPC
      Tags:
        - Key: Name
          Value: !Sub '${AWS::StackName}:PrivateRouteC'

  DefaultPrivateRouteA:
    Type: AWS::EC2::Route
    Properties:
      RouteTableId: !Ref PrivateRouteTableA
      DestinationCidrBlock: 0.0.0.0/0
      NatGatewayId: !Ref NatGatewayA

  DefaultPrivateRouteB:
    Type: AWS::EC2::Route
    Properties:
      RouteTableId: !Ref PrivateRouteTableB
      DestinationCidrBlock: 0.0.0.0/0
      NatGatewayId: !Ref NatGatewayB

  DefaultPrivateRouteC:
    Type: AWS::EC2::Route
    Condition: DoProvisionSubnetsC
    Properties:
      RouteTableId: !Ref PrivateRouteTableC
      DestinationCidrBlock: 0.0.0.0/0
      NatGatewayId: !Ref NatGatewayC

  PrivateSubnetARouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      RouteTableId: !Ref PrivateRouteTableA
      SubnetId: !Ref PrivateSubnetA

  PrivateSubnetBRouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      RouteTableId: !Ref PrivateRouteTableB
      SubnetId: !Ref PrivateSubnetB

  PrivateSubnetCRouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Condition: DoProvisionSubnetsC
    Properties:
      RouteTableId: !Ref PrivateRouteTableC
      SubnetId: !Ref PrivateSubnetC

  AvailabiltyZone1:
    Type: Custom::AvailabiltyZone
    DependsOn: LogGroupGetAZLambdaFunction
    Properties:
      ServiceToken: !GetAtt GetAZLambdaFunction.Arn
      ZoneId: !FindInMap [RegionMap, !Ref "AWS::Region", ZoneId1]

  AvailabiltyZone2:
    Type: Custom::AvailabiltyZone
    DependsOn: LogGroupGetAZLambdaFunction
    Properties:
      ServiceToken: !GetAtt GetAZLambdaFunction.Arn
      ZoneId: !FindInMap [RegionMap, !Ref "AWS::Region", ZoneId2]

  AvailabiltyZone3:
    Type: Custom::AvailabiltyZone
    Condition: DoProvisionSubnetsC
    DependsOn: LogGroupGetAZLambdaFunction
    Properties:
      ServiceToken: !GetAtt GetAZLambdaFunction.Arn
      ZoneId: !FindInMap [RegionMap, !Ref "AWS::Region", ZoneId3]

  LogGroupGetAZLambdaFunction:
    Type: AWS::Logs::LogGroup
    DeletionPolicy: Delete
    UpdateReplacePolicy: Delete
    Properties:
      LogGroupName: !Sub /aws/lambda/${GetAZLambdaFunction}
      RetentionInDays: 7

  GetAZLambdaFunction:
    Type: AWS::Lambda::Function
    Properties:
      Description: GetAZLambdaFunction
      Timeout: 60
      Runtime: python3.12
      Handler: index.handler
      Role: !GetAtt GetAZLambdaRole.Arn
      Code:
        ZipFile: |
          import cfnresponse
          from json import dumps
          from boto3 import client
          EC2 = client('ec2')
          def handler(event, context):
              if event['RequestType'] in ('Create', 'Update'):
                  print(dumps(event, default=str))
                  data = {}
                  try:
                      response = EC2.describe_availability_zones(
                          Filters=[{'Name': 'zone-id', 'Values': [event['ResourceProperties']['ZoneId']]}]
                      )
                      print(dumps(response, default=str))
                      data['ZoneName'] = response['AvailabilityZones'][0]['ZoneName']
                  except Exception as error:
                      cfnresponse.send(event, context, cfnresponse.FAILED, {}, reason=error)
                  finally:
                      cfnresponse.send(event, context, cfnresponse.SUCCESS, data)
              else:
                  cfnresponse.send(event, context, cfnresponse.SUCCESS, {})
      Tags:
        - Key: Name
          Value: !Sub ${AWS::StackName}GetAZLambdaFunction

  GetAZLambdaRole:
    Type: AWS::IAM::Role
    Properties:
      Path: /
      Description: GetAZLambdaFunction
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action:
              - sts:AssumeRole
            Principal:
              Service:
                - !Sub 'lambda.${AWS::URLSuffix}'
      ManagedPolicyArns:
        - !Sub 'arn:${AWS::Partition}:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole'
      Policies:
        - PolicyName: GetAZLambdaFunction
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Sid: ec2
                Effect: Allow
                Action:
                  - ec2:DescribeAvailabilityZones
                Resource:
                  - '*'
      Tags:
        - Key: Name
          Value: !Sub ${AWS::StackName}-GetAZLambdaFunction

  S3Endpoint:
    Type: 'AWS::EC2::VPCEndpoint'
    Properties:
      VpcEndpointType: 'Gateway'
      ServiceName: !Sub 'com.amazonaws.${AWS::Region}.s3'
      RouteTableIds:
        - !Ref PublicRouteTable
        - !Ref PrivateRouteTableA
        - !Ref PrivateRouteTableB
      VpcId: !Ref VPC

  SecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
        GroupDescription: Allow all traffic from resources in VPC
        VpcId:
          Ref: VPC
        SecurityGroupIngress:
        - IpProtocol: -1
          CidrIp: !Ref CidrBlock
        SecurityGroupEgress:
        - IpProtocol: -1
          CidrIp: !Ref CidrBlock

Outputs:
  VPC:
    Value: !Ref VPC
    Description: ID of the VPC
    Export:
      Name: !Sub ${AWS::StackName}-VPC
  PublicSubnets:
    Value: !Join
      - ','
      - - !Ref PublicSubnetA
        - !Ref PublicSubnetB
        - !If
          - DoProvisionSubnetsC
          - !Ref PublicSubnetC
          - !Ref AWS::NoValue
    Description: ID of the public subnets
    Export:
      Name: !Sub ${AWS::StackName}-PublicSubnets
  PrivateSubnets:
    Value: !Join
      - ','
      - - !Ref PrivateSubnetA
        - !Ref PrivateSubnetB
        - !If
          - DoProvisionSubnetsC
          - !Ref PrivateSubnetC
          - !Ref AWS::NoValue
    Description: ID of the private subnets
    Export:
      Name: !Sub ${AWS::StackName}-PrivateSubnets
  DefaultPrivateSubnet:
    Description: The ID of a default private subnet
    Value: !Ref PrivateSubnetA
    Export:
      Name: !Sub "${AWS::StackName}-DefaultPrivateSubnet"
  DefaultPublicSubnet:
    Description: The ID of a default public subnet
    Value: !Ref PublicSubnetA
    Export:
      Name: !Sub "${AWS::StackName}-DefaultPublicSubnet"
  InternetGatewayId:
    Description: The ID of the Internet Gateway
    Value: !Ref InternetGateway
    Export:
      Name: !Sub "${AWS::StackName}-InternetGateway"
  SecurityGroup:
    Description: The ID of the local security group
    Value: !Ref SecurityGroup
    Export:
      Name: !Sub "${AWS::StackName}-SecurityGroup"

Use the default options and give the stack a name, like AWSPCS-PTPro-cluster. You can leave the options as the defaults.

Use this AWS Cloud Formation quick-create link to quickly provision these resources with default settings.

Under Capabilities: Check the box for I acknowledge that AWS CloudFormation might create IAM resources.

Once you have created this new VPC, find its VPC ID and note it by searching for it in the Amazon VPC Console by selecting "VPCs" and then searching for the name you picked above. If you chose the stack name we suggested, you would search for PTPro, and if you are deploying this in us-east-1 you can use this link. Make a note of the VPC ID once you have found it.

2. Create Security Groups

Summary

In this section we will create three security groups:

  • A cluster security group enabling comms between the compute nodes, login node and AWS PCS controller
  • An inbound ssh group that can optionally be enabled to allow ssh logins on the login node
  • An DCV group that can optionally be enabled to allow DCV remote desktop connections to the login node

If you have already created these security groups you can reuse them and skip this step.

Using CloudFormation, create a new stack for the security groups using the following template:

1-pcs-cluster-cloudformation-security-groups.yaml

Show template contents (click to expand)
# source: https://aws-hpc-recipes.s3.amazonaws.com/main/recipes/pcs/getting_started/assets/pcs-cluster-sg.yaml

AWSTemplateFormatVersion: 2010-09-09
Description: Security group for AWS PCS clusters.
  This template creates a self-referencing security group that enables communications between AWS PCS controller, compute nodes, and client nodes.
  Optionally, it can also create a security group to enable SSH access to the cluster, and DCV remote desktop access to the login node.
  Check the Outputs tab of this stack for useful details about resources created by this template.

Metadata:
  AWS::CloudFormation::Interface:
    ParameterGroups:
      - Label:
          default: Network
        Parameters:
          - VpcId
      - Label:
          default: Security group configuration
        Parameters:
          - CreateInboundSshSecurityGroup
          - CreateInboundDcvSecurityGroup
          - ClientIpCidr

Parameters:
  VpcId:
    Description: VPC where the AWS PCS cluster will be deployed
    Type: 'AWS::EC2::VPC::Id'
  ClientIpCidr:
    Description: IP address(s) allowed to connect to nodes using SSH 
    Default: '0.0.0.0/0'
    Type: String
    AllowedPattern: (\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})/(\d{1,2})
    ConstraintDescription: Value must be a valid IP or network range of the form x.x.x.x/x.
  CreateInboundSshSecurityGroup:
    Description: Create an inbound security group to allow SSH access to nodes.
    Type: String
    Default: 'True'
    AllowedValues:
      - 'True'
      - 'False'
  CreateInboundDcvSecurityGroup:
    Description: Create an inbound security group to allow DCV access to login nodes on TCP/UDP 8443.
    Type: String
    Default: 'False'
    AllowedValues:
      - 'True'
      - 'False'

Conditions:
  CreateSshSecGroup: !Equals [!Ref CreateInboundSshSecurityGroup, 'True']
  CreateDcvSecGroup: !Equals [!Ref CreateInboundDcvSecurityGroup, 'True']

Resources:

  ClusterSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Supports communications between AWS PCS controller, compute nodes, and client nodes
      VpcId: !Ref VpcId
      GroupName: !Sub 'cluster-${AWS::StackName}'

  ClusterAllowAllInboundFromSelf:
    Type: AWS::EC2::SecurityGroupIngress
    Properties:
      GroupId: !Ref ClusterSecurityGroup
      IpProtocol: '-1'
      SourceSecurityGroupId: !Ref ClusterSecurityGroup

  ClusterAllowAllOutboundToSelf:
    Type: AWS::EC2::SecurityGroupEgress
    Properties:
      GroupId: !Ref ClusterSecurityGroup
      IpProtocol:  '-1'
      DestinationSecurityGroupId: !Ref ClusterSecurityGroup

  # This allows all outbound comms, which enables HTTPS calls and connections to networked storage
  ClusterAllowAllOutboundToWorld:
    Type: AWS::EC2::SecurityGroupEgress
    Properties:
      GroupId: !Ref ClusterSecurityGroup
      IpProtocol: '-1'
      CidrIp: 0.0.0.0/0

  # Attach this to login nodes to enable inbound SSH access.
  InboundSshSecurityGroup:
    Condition: CreateSshSecGroup
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Allows inbound SSH access
      GroupName: !Sub 'inbound-ssh-${AWS::StackName}'
      VpcId: !Ref VpcId
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 22
          ToPort: 22
          CidrIp: !Ref ClientIpCidr

  # Attach this to login nodes to enable inbound DCV access on TCP/UDP 8443.
  InboundDcvSecurityGroup:
    Condition: CreateDcvSecGroup
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Allows inbound DCV access on TCP/UDP 8443
      GroupName: !Sub 'inbound-dcv-${AWS::StackName}'
      VpcId: !Ref VpcId
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 8443
          ToPort: 8443
          CidrIp: !Ref ClientIpCidr
        - IpProtocol: udp
          FromPort: 8443
          ToPort: 8443
          CidrIp: !Ref ClientIpCidr

Outputs:
  ClusterSecurityGroupId:
    Description: Supports communication between PCS controller, compute nodes, and login nodes
    Value: !Ref ClusterSecurityGroup
  InboundSshSecurityGroupId:
    Condition: CreateSshSecGroup
    Description: Enables SSH access to login nodes
    Value: !Ref InboundSshSecurityGroup
  InboundDcvSecurityGroupId:
    Condition: CreateDcvSecGroup
    Description: Enables DCV access to login nodes on TCP/UDP 8443
    Value: !Ref InboundDcvSecurityGroup
  • Under stack name use something like AWSPCS-PTPro-sg.
  • Select the VPC ID noted in step 1.
  • Enable ssh, and optionally enable DCV access.
Use a Quick create link

You can use this AWS CloudFormation quick-create link to provision these security groups in us-east-1, however, you must ensure that you change the VPC ID to the one created in step 1.

3. Create PCS Cluster

If you have already created a cluster in this manner you can skip this step

Go to the AWS PCS console and create a new cluster.

  • Under Cluster setup, choose a name like AWSPCS-PTPro-cluster
  • Set the controller size to small.
  • Use the version of slurm compatible with the ParaTools Pro for E4S(TM) image. This is usually the latest version available, 25.05 as of december 2025.
  • Under Networking:
    • use the VPC ID created in step 1. (e.g., AWSPCS-PTPro-cluster...)
    • Use the subnet labeled as PrivateSubnetA created in step 1.
    • Under "Security groups" choose "Select an existing security group"
      • Use the security group cluster-*-sg created in step 2 (e.g., cluster-AWSPCS-PTPro-sg)
  • Click "Create Cluster" to begin creating the cluster.

4. Create shared filesystem using EFS

  • Go to EFS console and create a new filesystem.
  • Ensure it is in the same region as the PCS cluster you are setting up.
  • Create a file system
    • For the name choose something like AWSPCS-PTPro-fs.
    • Under "Virtual Private Cloud", use the VPC ID created in step 1.
    • Click "Create File System"
    • Note the FS ID.

5. Create an Instance Profile

Go to the IAM console. Under Access Management -> Policies Check if a policy matching this one already exists, try searching for pcs. If no such policy exists, then create a new one and specify the permissions using the JSON editor as the following:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "pcs:RegisterComputeNodeGroupInstance"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

Name the new policy, something like AWS-PCS-polilcy and note the name that you chose.

Additional optional steps to enable DCV remote desktop access

If you plan to access the login node you will need to create an adaditional policy to access the DCV license server. If a matching policy exists you can reuse it, try searching for DCV to check. If no policy exists, then create a new one, specifying the permissions with the JSON editor as follows:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::dcv-license.region/us-*"
        }
    ]
}

Give it a name like EC2AccessDCVLicenseS3.

Next, in the IAM Console to to Access Management -> Roles check if a role starting with AWS_PCS- exists with the following policies attached. If not follow these instructions to create it.

  • Select "Create Role"
  • Select Trusted Entity Type: "AWS Service"
  • Service or use case: "EC2"
  • Use Case: "EC2"
  • Click Next
  • Add permissions
    • Add the policy created earlier in step 5.
    • If planning to use DCV to access the login node, also add the EC2AccessDCVLicenseS3 policy.
    • Add the AmazonSSMManagedInstanceCore policy.
  • Click Next
  • Give the role a name that starts with AWSPCS- (It must start with AWSPCS-)

6. Create EFA Placement Group

If such a placement group already exists you may simply reuse it.

Under the EC2 Console, navigate to Network & Security -> Placement Groups -> "Create placement group"

  • Name it something like AWSPCS-PTPro-cluster
  • Set strategy = "cluster"
  • Click "Create group"

7. Create node Launch Template

Using CloudFormation, create a new stack for the node launch templates using the following template:

2-pcs-cluster-cloudformation-launch-templates.yaml

Show template contents (click to expand)
# original source: https://aws-hpc-recipes.s3.amazonaws.com/main/recipes/pcs/getting_started/assets/pcs-lt-efs-fsxl.yaml
# has been modified

AWSTemplateFormatVersion: 2010-09-09
Description: EC2 launch templates for AWS PCS login and compute node groups.
  This template creates EC2 launch templates for AWS PCS login and compute node groups. 
  It demonstrates mounting EFS and FSx for Lustre file systems, configuring EC2 instance tags, enabling Instance Metadata Service Version 2 (IMDSv2), and setting up the cluster security group for communication with the AWS PCS controller. 
  Additionally, it shows how to configure inbound SSH access to the login nodes. 
  Use this template as a starting point to create custom launch templates tailored to your specific requirements.
  Check the Outputs tab of this stack for useful details about resources created by this template.

Metadata:
  AWS::CloudFormation::Interface:
    ParameterGroups:
      - Label:
          default: Security
        Parameters:
          - VpcDefaultSecurityGroupId
          - ClusterSecurityGroupId
          - SshSecurityGroupId
          - EnableDcvAccess
          - DcvSecurityGroupId
          - SshKeyName
      - Label:
          default: Networking
        Parameters:
          - VpcId
          - PlacementGroupName
          - NodeGroupSubnetId
      - Label:
          default: File systems
        Parameters:
          - EfsFilesystemId
          - FSxLustreFilesystemId
          - FSxLustreFilesystemMountName

Parameters:

  VpcId:
    Type: 'AWS::EC2::VPC::Id'
    Description: Cluster VPC where EFA-enabled instances will be launched
  NodeGroupSubnetId:
    Type: AWS::EC2::Subnet::Id
    Description: Subnet within cluster VPC where EFA-enabled instances will be launched
  VpcDefaultSecurityGroupId:
    Type: AWS::EC2::SecurityGroup::Id
    Description: Cluster VPC 'default' security group. Make sure you choose the one from your cluster VPC!
  ClusterSecurityGroupId:
    Type: AWS::EC2::SecurityGroup::Id
    Description: Security group for PCS cluster controller and nodes.
  SshSecurityGroupId:
    Type: AWS::EC2::SecurityGroup::Id
    Description: Security group for SSH into login nodes
  EnableDcvAccess:
    Type: String
    Description: Enable DCV access to login nodes? When set to True, the DcvSecurityGroupId parameter below will be required.
    Default: 'True'
    AllowedValues:
      - 'True'
      - 'False'
  DcvSecurityGroupId:
    Type: AWS::EC2::SecurityGroup::Id
    Description: Security group for DCV access to login nodes (only used if EnableDcvAccess is True)
  SshKeyName:
    Type: AWS::EC2::KeyPair::KeyName
    Description: SSH key name for access to login nodes
  EfsFilesystemId:
    Type: String
    Description: Amazon EFS file system Id
  FSxLustreFilesystemId:
    Type: String
    Description: Amazon FSx for Lustre file system Id
  FSxLustreFilesystemMountName:
    Type: String
    Description: Amazon FSx for Lustre mount name
  PlacementGroupName:
    Type: String
    Description: Placement group name for compute nodes (leave blank to creaet a new one)
    Default: "AWSPCS-PTPro-cluster"


Conditions:
  HasDcvAccess: !Equals [!Ref EnableDcvAccess, 'True']

Resources:

  EfaSecurityGroup:
    Type: 'AWS::EC2::SecurityGroup'
    Properties:
      GroupDescription: Support EFA
      GroupName: !Sub 'efa-${AWS::StackName}'
      VpcId: !Ref VpcId
  EfaSecurityGroupOutboundSelfRule:
    Type: 'AWS::EC2::SecurityGroupEgress'
    Properties:
      IpProtocol: '-1'
      GroupId: !Ref EfaSecurityGroup
      Description: Allow outbound EFA traffic to SG members
      DestinationSecurityGroupId: !Ref EfaSecurityGroup

  EfaSecurityGroupInboundSelfRule:
    Type: 'AWS::EC2::SecurityGroupIngress'
    Properties:
      IpProtocol: '-1'
      GroupId: !Ref EfaSecurityGroup
      Description: Allow inbound EFA traffic to SG members
      SourceSecurityGroupId: !Ref EfaSecurityGroup

  LoginLaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateName: !Sub 'login-${AWS::StackName}'

      LaunchTemplateData:
        TagSpecifications:
          - ResourceType: instance
            Tags:
              - Key: HPCRecipes
                Value: "true"
        MetadataOptions:
          HttpEndpoint: enabled
          HttpPutResponseHopLimit: 4
          HttpTokens: required
        KeyName: !Ref SshKeyName
        SecurityGroupIds:
          - !Ref ClusterSecurityGroupId
          - !Ref SshSecurityGroupId
          - !If [HasDcvAccess, !Ref DcvSecurityGroupId, !Ref "AWS::NoValue"]
          - !Ref VpcDefaultSecurityGroupId
        UserData:
          Fn::Base64: !Sub |
            MIME-Version: 1.0
            Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="

            --==MYBOUNDARY==
            Content-Type: text/cloud-config; charset="us-ascii"
            MIME-Version: 1.0

            packages:
            - amazon-efs-utils

            runcmd:
            # Mount EFS filesystem as /home
            - mkdir -p /tmp/home
            - rsync -aA /home/ /tmp/home
            - echo "${EfsFilesystemId}:/ /home efs tls,_netdev" >> /etc/fstab
            - mount -a -t efs defaults
            - if [ "enabled" == "$(sestatus | awk '/^SELinux status:/{print $3}')" ]; then setsebool -P use_nfs_home_dirs 1; fi
            - rsync -aA --ignore-existing /tmp/home/ /home
            - rm -rf /tmp/home/
            # If provided, mount FSxL filesystem as /shared
            - if [ ! -z "${FSxLustreFilesystemId}" ]; then amazon-linux-extras install -y lustre=latest; mkdir -p /shared; chmod a+rwx /shared; mount -t lustre ${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /shared; chmod 777 /shared; fi

            --==MYBOUNDARY==

  ComputeLaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateName: !Sub 'compute-${AWS::StackName}'
      LaunchTemplateData:
        TagSpecifications:
          - ResourceType: instance
            Tags:
              - Key: HPCRecipes
                Value: "true"
        MetadataOptions:
          HttpEndpoint: enabled
          HttpPutResponseHopLimit: 4
          HttpTokens: required
        Placement:
          GroupName: !Ref PlacementGroupName
        NetworkInterfaces:
          - Description: Primary network interface
            DeviceIndex: 0
            InterfaceType: efa
            NetworkCardIndex: 0
            SubnetId: !Ref NodeGroupSubnetId
            Groups:
            - !Ref EfaSecurityGroup
            - !Ref ClusterSecurityGroupId
            - !Ref VpcDefaultSecurityGroupId
        KeyName: !Ref SshKeyName
        UserData:
          Fn::Base64: !Sub |
            MIME-Version: 1.0
            Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="

            --==MYBOUNDARY==
            Content-Type: text/cloud-config; charset="us-ascii"
            MIME-Version: 1.0

            packages:
            - amazon-efs-utils

            runcmd:
            # Mount EFS filesystem as /home
            - mkdir -p /tmp/home
            - rsync -aA /home/ /tmp/home
            - echo "${EfsFilesystemId}:/ /home efs tls,_netdev" >> /etc/fstab
            - mount -a -t efs defaults
            - if [ "enabled" == "$(sestatus | awk '/^SELinux status:/{print $3}')" ]; then setsebool -P use_nfs_home_dirs 1; fi
            - rsync -aA --ignore-existing /tmp/home/ /home
            - rm -rf /tmp/home/
            # If provided, mount FSxL filesystem as /shared
            - if [ ! -z "${FSxLustreFilesystemId}" ]; then amazon-linux-extras install -y lustre=latest; mkdir -p /shared; chmod a+rwx /shared; mount -t lustre ${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /shared; fi

            --==MYBOUNDARY==

Outputs:
  LoginLaunchTemplateId:
    Description: "Login nodes template ID"
    Value: !Ref LoginLaunchTemplate
  LoginLaunchTemplateName:
    Description: "Login nodes template name"
    Value: !Sub 'login-${AWS::StackName}'
  ComputeLaunchTemplateId:
    Description: "Compute nodes template ID"
    Value: !Ref ComputeLaunchTemplate
  ComputeLaunchTemplateName:
    Description: "Compute nodes template name"
    Value: !Sub 'compute-${AWS::StackName}'
  EfaSecurityGroupId:
    Description: Security group created to support EFA communications
    Value: !Ref EfaSecurityGroup

Set the following values:

  • VpcDefaultSecurityGroupId = value of "default" security group obtained in step 1
  • ClusterSecurityGroupId = get value from output of step 2 key = "ClusterSecurityGroupId"
  • SshSecurityGroupId = get value from output of step 2 key = "InboundSshSecurityGroupId"
  • SshKeyName = pick a key
  • VpcId = get value from output of step 1 key = "VPC"
  • PlacementGroupName = use name chosen in step 6
  • NodeGroupSubnetId = select the subnet labeled with PrivateSubnetA created in step 1
  • EfsFilesystemId = EFS ID of FS created in step 4

8. Create node groups

In the PCS console, select the cluster created in step 3

  1. Create one node group for compute nodes
  2. Compute node groups -> Create compute node group
  3. Group name = compute-1
  4. EC2 Launch Template = compute-<name> where <name> is the stack name chosen in step 7
  5. Subnets = PrivateSubnetA from step 1
  6. Instance types = g4dn.8xlarge (or other EFA-capable instance type)
  7. min count = 0, max count = 2
  8. AMI ID = Select a PCS-compatible AMI
  9. Create one node group for the login node
  10. Compute node groups -> Create compute node group
  11. Group name = login
  12. EC2 Launch Template = login-<name> where <name> is the stack name chosen in step 7
  13. Subnets = PublicSubnetA from step 1
  14. Instance types = g4dn.4xlarge (or other instance type)
  15. min count = 1, max count = 1
  16. AMI ID = Select a PCS-compatible AMI

9. Create queue

In the PCS console, select the cluster created in step 3

  • Queues -> Create queue
  • name = compute-1
  • Add the compute node group created in step 8.1

10. Connect to login node

In the PCS console, select the cluster created in step 3

  • Compute node groups -> select login node group created in step 8.2
  • Copy the "compute node group ID"
  • Go to EC2 console -> Instances
  • In the search bar "Find instances by attribute or tag (case sensitive)" search for the "compute node group ID"
  • Select the resulting instance -- this is the login node
  • Copy "Public IPv4 Address"
  • SSH to that IP (should allow the login node to prepare itself for at least 5 minutes before SSHing)
    • username = "ubuntu" (for our ubuntu-based images; username will vary depending on image type)
    • ssh key = use the key chosen in step 7

11. Run sample job

Once connected to the login node, run sinfo to see slurm queue information. You should see the queue created in step 9 Submit a job: sbatch -p <queue-name> script.sbatch Since compute nodes are launched on demand, the first job submitted to a queue will cause the nodes to be spun up.

  • squeue will show the job state as CF while the nodes are provisioned

Compute nodes will be brought down automatically after a period of inactivity called ScaledownIdletime

  • This can be configured in step 3 during cluster creation by changing the "Slurm configuration" settings.

12. Shut nodes down

In the PCS console, select the cluster created in step 3

  1. Delete the queue by going to "Queues" and deleting the queue created in step 9
  2. Delete the login node group by gong to "Compute node groups" and deleting the node group created in step 8.2
  3. Delete the compute node group by going to "Compute node groups" and deleting the node group created in step 8.1