Skip to content

ParaTools Pro for E4S™ Getting Started with AWS Parallel Computing Service

Looking for AWS ParallelCluster (PC)?

This guide covers AWS Parallel Computing Service (PCS), the AWS-managed Slurm service. For the open-source self-managed alternative, see Getting Started with AWS ParallelCluster.

General Background Information

This tutorial configures AWS Parallel Computing Service (PCS) with the matching ParaTools Pro for E4S™ on AWS PCS AMI from the AWS Marketplace:

Architecture AWS Marketplace product
x86_64 ParaTools Pro for E4S™ on AWS PCS (x86)
arm64 (Graviton) ParaTools Pro for E4S™ on AWS PCS (ARM64)

Use the command line tools, AWS CLI, and the AWS console to create a cluster. The workflow uses several .yaml files that describe the stack and serve as inputs for AWS CloudFormation. The result is a GPU-accelerated head node that can spawn EC2 compute node instances linked with EFA networking.

For the purposes of this tutorial, you have already created an AWS account and are an Administrative User.

Tutorial

For additional context, see the official AWS PCS Getting Started guide. This tutorial follows the official guide with a few minor changes; refer to it if anything is unclear.

1. Create VPC and Subnets

It is possible to reuse existing VPC and subnets

If a compatible VPC and subnets already exist, skip this step and use them in place of the VpcId, PrivateSubnetA, and PublicSubnetA references in later steps. Search for existing PTPro VPC stacks in us-east-1 with this link.

Create a new stack for the cluster's VPC and subnets using the CloudFormation console with the following template:

0-pcs-cluster-cloudformation-vpc-and-subnets.yaml

Show template contents (click to expand)
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
# source: https://aws-hpc-recipes.s3.amazonaws.com/main/recipes/net/hpc_large_scale/assets/main.yaml

AWSTemplateFormatVersion: '2010-09-09'
Description: HPC-scale VPC with Multi-AZ Architecture.
  This template creates a highly available VPC infrastructure optimized for HPC workloads across multiple Availability Zones.
  It provisions both public and private subnets in two or optionally three AZs, with each subnet configured for 4096 IP addresses.
  The template sets up NAT Gateways and Internet Gateway for secure outbound connectivity from private subnets.
  VPC Flow Logs are enabled and directed to CloudWatch for comprehensive network traffic monitoring.
  An S3 VPC Endpoint is configured to allow private subnet resources to access S3 without traversing the internet.
  A VPC-wide security group is created to enable communication between resources within the VPC.
  Use this template as a foundation for building scalable, secure networking infrastructure for HPC workloads.
  Refer to the Outputs tab of the deployed stack for important resource identifiers including VPC ID, subnet IDs, security group ID, and internet gateway ID.

Metadata:
  AWS::CloudFormation::Interface:
    ParameterGroups:
      - Label:
          default: VPC
        Parameters:
          - CidrBlock
      - Label:
          default: Subnets A
        Parameters:
          - CidrPublicSubnetA
          - CidrPrivateSubnetA
      - Label:
          default: Subnets B
        Parameters:
          - CidrPublicSubnetB
          - CidrPrivateSubnetB
      - Label:
          default: Subnets C
        Parameters:
          - ProvisionSubnetsC
          - CidrPublicSubnetC
          - CidrPrivateSubnetC

Parameters:
  CidrBlock:
    AllowedPattern: '((\d{1,3})\.){3}\d{1,3}/\d{1,2}'
    Default: 10.3.0.0/16
    Description: VPC CIDR Block (eg 10.3.0.0/16)
    Type: String
  CidrPublicSubnetA:
    AllowedPattern: '((\d{1,3})\.){3}\d{1,3}/\d{1,2}'
    Default: 10.3.0.0/20
    Description: VPC CIDR Block for the Public Subnet A
    Type: String
  CidrPublicSubnetB:
    AllowedPattern: '((\d{1,3})\.){3}\d{1,3}/\d{1,2}'
    Default: 10.3.16.0/20
    Description: VPC CIDR Block for the Public Subnet B
    Type: String
  CidrPublicSubnetC:
    AllowedPattern: '((\d{1,3})\.){3}\d{1,3}/\d{1,2}'
    Default: 10.3.32.0/20
    Description: VPC CIDR Block for the Public Subnet C
    Type: String
  CidrPrivateSubnetA:
    AllowedPattern: '((\d{1,3})\.){3}\d{1,3}/\d{1,2}'
    Default: 10.3.128.0/20
    Description: VPC CIDR Block for the Private Subnet A
    Type: String
  CidrPrivateSubnetB:
    AllowedPattern: '((\d{1,3})\.){3}\d{1,3}/\d{1,2}'
    Default: 10.3.144.0/20
    Description: VPC CIDR Block for the Private Subnet B
    Type: String
  CidrPrivateSubnetC:
    AllowedPattern: '((\d{1,3})\.){3}\d{1,3}/\d{1,2}'
    Default: 10.3.160.0/20
    Description: VPC CIDR Block for the Private Subnet C
    Type: String
  ProvisionSubnetsC:
    Type: String
    Description: Provision optional 3rd set of subnets
    Default: "True"
    AllowedValues:
         - "True"
         - "False"

Mappings: 
  RegionMap: 
    us-east-1:
      ZoneId1: use1-az6
      ZoneId2: use1-az4
      ZoneId3: use1-az5
    us-east-2:
      ZoneId1: use2-az2
      ZoneId2: use2-az3
      ZoneId3: use2-az1
    us-west-1:
      ZoneId1: usw1-az1
      ZoneId2: usw1-az3
      ZoneId3: usw1-az2
    us-west-2:
      ZoneId1: usw2-az1
      ZoneId2: usw2-az2
      ZoneId3: usw2-az3
    eu-central-1:
      ZoneId1: euc1-az3
      ZoneId2: euc1-az2
      ZoneId3: euc1-az1
    eu-west-1:
      ZoneId1: euw1-az1
      ZoneId2: euw1-az2
      ZoneId3: euw1-az3
    eu-west-2:
      ZoneId1: euw2-az2
      ZoneId2: euw2-az3
      ZoneId3: euw2-az1
    eu-west-3:
      ZoneId1: euw3-az1
      ZoneId2: euw3-az2
      ZoneId3: euw3-az3
    eu-north-1:
      ZoneId1: eun1-az2
      ZoneId2: eun1-az1
      ZoneId3: eun1-az3
    ca-central-1:
      ZoneId1: cac1-az2
      ZoneId2: cac1-az1
      ZoneId3: cac1-az3
    eu-south-1:
      ZoneId1: eus1-az2
      ZoneId2: eus1-az1
      ZoneId3: eus1-az3
    ap-east-1:
      ZoneId1: ape1-az3
      ZoneId2: ape1-az2
      ZoneId3: ape1-az1
    ap-northeast-1:
      ZoneId1: apne1-az4
      ZoneId2: apne1-az1
      ZoneId3: apne1-az2
    ap-northeast-2:
      ZoneId1: apne2-az1
      ZoneId2: apne2-az3
      ZoneId3: apne2-az2
    ap-south-1:
      ZoneId1: aps1-az2
      ZoneId2: aps1-az3
      ZoneId3: aps1-az1
    ap-southeast-1:
      ZoneId1: apse1-az1
      ZoneId2: apse1-az2
      ZoneId3: apse1-az3
    ap-southeast-2:
      ZoneId1: apse2-az3
      ZoneId2: apse2-az1
      ZoneId3: apse2-az2
    us-gov-west-1:
      ZoneId1: usgw1-az2
      ZoneId2: usgw1-az1
      ZoneId3: usgw1-az3
    us-gov-east-1:
      ZoneId1: usge1-az3
      ZoneId2: usge1-az2
      ZoneId3: usge1-az1
    ap-northeast-3:
      ZoneId1: apne3-az3
      ZoneId2: apne3-az2
      ZoneId3: apne3-az1
    sa-east-1:
      ZoneId1: sae1-az3
      ZoneId2: sae1-az2
      ZoneId3: sae1-az1
    af-south-1:
      ZoneId1: afs1-az3
      ZoneId2: afs1-az2
      ZoneId3: afs1-az1
    ap-south-2:
      ZoneId1: aps2-az3
      ZoneId2: aps2-az2
      ZoneId3: aps2-az1
    ap-southeast-3:
      ZoneId1: apse3-az3
      ZoneId2: apse3-az2
      ZoneId3: apse3-az1
    ap-southeast-4:
      ZoneId1: apse4-az3
      ZoneId2: apse4-az2
      ZoneId3: apse4-az1
    ca-west-1:
      ZoneId1: caw1-az3
      ZoneId2: caw1-az2
      ZoneId3: caw1-az1
    eu-central-2:
      ZoneId1: euc2-az3
      ZoneId2: euc2-az2
      ZoneId3: euc2-az1
    eu-south-2:
      ZoneId1: eus2-az3
      ZoneId2: eus2-az2
      ZoneId3: eus2-az1
    il-central-1:
      ZoneId1: ilc1-az3
      ZoneId2: ilc1-az2
      ZoneId3: ilc1-az1
    me-central-1:
      ZoneId1: mec1-az3
      ZoneId2: mec1-az2
      ZoneId3: mec1-az1

Conditions:
     DoProvisionSubnetsC: !Equals [!Ref ProvisionSubnetsC, "True"]

Resources:

  VPC:
    Type: AWS::EC2::VPC
    Properties:
      CidrBlock: !Ref CidrBlock
      EnableDnsHostnames: true
      EnableDnsSupport: true
      Tags:
        - Key: "Name"
          Value: !Sub '${AWS::StackName}:Large-Scale-HPC'

  VPCFlowLog:
    Type: AWS::EC2::FlowLog
    Properties:
      ResourceId: !Ref VPC
      ResourceType: VPC
      TrafficType: ALL
      LogDestinationType: cloud-watch-logs
      LogGroupName: !Sub '${AWS::StackName}-VPCFlowLogs'
      DeliverLogsPermissionArn: !GetAtt FlowLogRole.Arn

  FlowLogRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - vpc-flow-logs.amazonaws.com
            Action:
              - "sts:AssumeRole"
      ManagedPolicyArns:
        - !Ref AWS::NoValue
      Policies:
        - PolicyName: FlowLogPolicy
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - "logs:CreateLogGroup"
                  - "logs:CreateLogStream"
                  - "logs:PutLogEvents"
                  - "logs:DescribeLogGroups"
                  - "logs:DescribeLogStreams"
                Resource: !Sub "arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:${AWS::StackName}-VPCFlowLogs:*"

  PublicSubnetA:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      CidrBlock: !Ref CidrPublicSubnetA
      AvailabilityZone: !GetAtt AvailabiltyZone1.ZoneName
      MapPublicIpOnLaunch: true
      Tags:
      - Key: Name
        Value: !Sub
          - '${StackName}:PublicSubnetA-${AvailabilityZone}'
          - StackName: !Ref AWS::StackName
            AvailabilityZone: !GetAtt AvailabiltyZone1.ZoneName

  PublicSubnetB:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      CidrBlock: !Ref CidrPublicSubnetB
      AvailabilityZone: !GetAtt AvailabiltyZone2.ZoneName
      MapPublicIpOnLaunch: true
      Tags:
      - Key: Name
        Value: !Sub
          - '${StackName}:PublicSubnetB-${AvailabilityZone}'
          - StackName: !Ref AWS::StackName
            AvailabilityZone: !GetAtt AvailabiltyZone2.ZoneName

  PublicSubnetC:
    Type: AWS::EC2::Subnet
    Condition: DoProvisionSubnetsC
    Properties:
      VpcId: !Ref VPC
      CidrBlock: !Ref CidrPublicSubnetC
      AvailabilityZone: !GetAtt AvailabiltyZone3.ZoneName
      MapPublicIpOnLaunch: true
      Tags:
      - Key: Name
        Value: !Sub
          - '${StackName}:PublicSubnetC-${AvailabilityZone}'
          - StackName: !Ref AWS::StackName
            AvailabilityZone: !GetAtt AvailabiltyZone3.ZoneName

  InternetGateway:
    Type: AWS::EC2::InternetGateway

  AttachGateway:
    Type: AWS::EC2::VPCGatewayAttachment
    Properties:
      VpcId: !Ref VPC
      InternetGatewayId: !Ref InternetGateway

  PublicRouteTable:
    Type: AWS::EC2::RouteTable
    Properties:
      VpcId: !Ref VPC
      Tags:
      - Key: Name
        Value: !Sub '${AWS::StackName}:PublicRoute'
  PublicRoute1:
    Type: AWS::EC2::Route
    Properties:
      RouteTableId: !Ref PublicRouteTable
      DestinationCidrBlock: 0.0.0.0/0
      GatewayId: !Ref InternetGateway

  PublicSubnetARouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      SubnetId: !Ref PublicSubnetA
      RouteTableId: !Ref PublicRouteTable

  PublicSubnetBRouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      SubnetId: !Ref PublicSubnetB
      RouteTableId: !Ref PublicRouteTable

  PublicSubnetCRouteTableAssociation:
    Condition: DoProvisionSubnetsC
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      SubnetId: !Ref PublicSubnetC
      RouteTableId: !Ref PublicRouteTable

  PrivateSubnetA:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      AvailabilityZone: !GetAtt AvailabiltyZone1.ZoneName
      CidrBlock: !Ref CidrPrivateSubnetA
      MapPublicIpOnLaunch: false
      Tags:
      - Key: Name
        Value: !Sub
          - '${StackName}:PrivateSubnetA-${AvailabilityZone}'
          - StackName: !Ref AWS::StackName
            AvailabilityZone: !GetAtt AvailabiltyZone1.ZoneName

  PrivateSubnetB:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      AvailabilityZone: !GetAtt AvailabiltyZone2.ZoneName
      CidrBlock: !Ref CidrPrivateSubnetB
      MapPublicIpOnLaunch: false
      Tags:
      - Key: Name
        Value: !Sub
          - '${StackName}:PrivateSubnetB-${AvailabilityZone}'
          - StackName: !Ref AWS::StackName
            AvailabilityZone: !GetAtt AvailabiltyZone2.ZoneName

  PrivateSubnetC:
    Type: AWS::EC2::Subnet
    Condition: DoProvisionSubnetsC
    Properties:
      VpcId: !Ref VPC
      AvailabilityZone: !GetAtt AvailabiltyZone3.ZoneName
      CidrBlock: !Ref CidrPrivateSubnetC
      MapPublicIpOnLaunch: false
      Tags:
      - Key: Name
        Value: !Sub
          - '${StackName}:PrivateSubnetC-${AvailabilityZone}'
          - StackName: !Ref AWS::StackName
            AvailabilityZone: !GetAtt AvailabiltyZone3.ZoneName

  NatGatewayAEIP:
    Type: AWS::EC2::EIP
    DependsOn: AttachGateway
    Properties:
      Domain: vpc

  NatGatewayBEIP:
    Type: AWS::EC2::EIP
    DependsOn: AttachGateway
    Properties:
      Domain: vpc

  NatGatewayCEIP:
    Condition: DoProvisionSubnetsC
    Type: AWS::EC2::EIP
    DependsOn: AttachGateway
    Properties:
      Domain: vpc

  NatGatewayA:
    Type: AWS::EC2::NatGateway
    Properties:
      AllocationId: !GetAtt NatGatewayAEIP.AllocationId
      SubnetId: !Ref PublicSubnetA

  NatGatewayB:
    Type: AWS::EC2::NatGateway
    Properties:
      AllocationId: !GetAtt NatGatewayBEIP.AllocationId
      SubnetId: !Ref PublicSubnetB

  NatGatewayC:
    Type: AWS::EC2::NatGateway
    Condition: DoProvisionSubnetsC
    Properties:
      AllocationId: !GetAtt NatGatewayCEIP.AllocationId
      SubnetId: !Ref PublicSubnetC

  PrivateRouteTableA:
    Type: AWS::EC2::RouteTable
    Properties:
      VpcId: !Ref VPC
      Tags:
        - Key: Name
          Value: !Sub '${AWS::StackName}:PrivateRouteA'

  PrivateRouteTableB:
    Type: AWS::EC2::RouteTable
    Properties:
      VpcId: !Ref VPC
      Tags:
        - Key: Name
          Value: !Sub '${AWS::StackName}:PrivateRouteB'

  PrivateRouteTableC:
    Type: AWS::EC2::RouteTable
    Condition: DoProvisionSubnetsC
    Properties:
      VpcId: !Ref VPC
      Tags:
        - Key: Name
          Value: !Sub '${AWS::StackName}:PrivateRouteC'

  DefaultPrivateRouteA:
    Type: AWS::EC2::Route
    Properties:
      RouteTableId: !Ref PrivateRouteTableA
      DestinationCidrBlock: 0.0.0.0/0
      NatGatewayId: !Ref NatGatewayA

  DefaultPrivateRouteB:
    Type: AWS::EC2::Route
    Properties:
      RouteTableId: !Ref PrivateRouteTableB
      DestinationCidrBlock: 0.0.0.0/0
      NatGatewayId: !Ref NatGatewayB

  DefaultPrivateRouteC:
    Type: AWS::EC2::Route
    Condition: DoProvisionSubnetsC
    Properties:
      RouteTableId: !Ref PrivateRouteTableC
      DestinationCidrBlock: 0.0.0.0/0
      NatGatewayId: !Ref NatGatewayC

  PrivateSubnetARouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      RouteTableId: !Ref PrivateRouteTableA
      SubnetId: !Ref PrivateSubnetA

  PrivateSubnetBRouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      RouteTableId: !Ref PrivateRouteTableB
      SubnetId: !Ref PrivateSubnetB

  PrivateSubnetCRouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Condition: DoProvisionSubnetsC
    Properties:
      RouteTableId: !Ref PrivateRouteTableC
      SubnetId: !Ref PrivateSubnetC

  AvailabiltyZone1:
    Type: Custom::AvailabiltyZone
    DependsOn: LogGroupGetAZLambdaFunction
    Properties:
      ServiceToken: !GetAtt GetAZLambdaFunction.Arn
      ZoneId: !FindInMap [RegionMap, !Ref "AWS::Region", ZoneId1]

  AvailabiltyZone2:
    Type: Custom::AvailabiltyZone
    DependsOn: LogGroupGetAZLambdaFunction
    Properties:
      ServiceToken: !GetAtt GetAZLambdaFunction.Arn
      ZoneId: !FindInMap [RegionMap, !Ref "AWS::Region", ZoneId2]

  AvailabiltyZone3:
    Type: Custom::AvailabiltyZone
    Condition: DoProvisionSubnetsC
    DependsOn: LogGroupGetAZLambdaFunction
    Properties:
      ServiceToken: !GetAtt GetAZLambdaFunction.Arn
      ZoneId: !FindInMap [RegionMap, !Ref "AWS::Region", ZoneId3]

  LogGroupGetAZLambdaFunction:
    Type: AWS::Logs::LogGroup
    DeletionPolicy: Delete
    UpdateReplacePolicy: Delete
    Properties:
      LogGroupName: !Sub /aws/lambda/${GetAZLambdaFunction}
      RetentionInDays: 7

  GetAZLambdaFunction:
    Type: AWS::Lambda::Function
    Properties:
      Description: GetAZLambdaFunction
      Timeout: 60
      Runtime: python3.12
      Handler: index.handler
      Role: !GetAtt GetAZLambdaRole.Arn
      Code:
        ZipFile: |
          import cfnresponse
          from json import dumps
          from boto3 import client
          EC2 = client('ec2')
          def handler(event, context):
              if event['RequestType'] in ('Create', 'Update'):
                  print(dumps(event, default=str))
                  data = {}
                  try:
                      response = EC2.describe_availability_zones(
                          Filters=[{'Name': 'zone-id', 'Values': [event['ResourceProperties']['ZoneId']]}]
                      )
                      print(dumps(response, default=str))
                      data['ZoneName'] = response['AvailabilityZones'][0]['ZoneName']
                  except Exception as error:
                      cfnresponse.send(event, context, cfnresponse.FAILED, {}, reason=error)
                  finally:
                      cfnresponse.send(event, context, cfnresponse.SUCCESS, data)
              else:
                  cfnresponse.send(event, context, cfnresponse.SUCCESS, {})
      Tags:
        - Key: Name
          Value: !Sub ${AWS::StackName}GetAZLambdaFunction

  GetAZLambdaRole:
    Type: AWS::IAM::Role
    Properties:
      Path: /
      Description: GetAZLambdaFunction
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action:
              - sts:AssumeRole
            Principal:
              Service:
                - !Sub 'lambda.${AWS::URLSuffix}'
      ManagedPolicyArns:
        - !Sub 'arn:${AWS::Partition}:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole'
      Policies:
        - PolicyName: GetAZLambdaFunction
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Sid: ec2
                Effect: Allow
                Action:
                  - ec2:DescribeAvailabilityZones
                Resource:
                  - '*'
      Tags:
        - Key: Name
          Value: !Sub ${AWS::StackName}-GetAZLambdaFunction

  S3Endpoint:
    Type: 'AWS::EC2::VPCEndpoint'
    Properties:
      VpcEndpointType: 'Gateway'
      ServiceName: !Sub 'com.amazonaws.${AWS::Region}.s3'
      RouteTableIds:
        - !Ref PublicRouteTable
        - !Ref PrivateRouteTableA
        - !Ref PrivateRouteTableB
      VpcId: !Ref VPC

  SecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
        GroupDescription: Allow all traffic from resources in VPC
        VpcId:
          Ref: VPC
        SecurityGroupIngress:
        - IpProtocol: -1
          CidrIp: !Ref CidrBlock
        SecurityGroupEgress:
        - IpProtocol: -1
          CidrIp: !Ref CidrBlock

Outputs:
  VPC:
    Value: !Ref VPC
    Description: ID of the VPC
    Export:
      Name: !Sub ${AWS::StackName}-VPC
  PublicSubnets:
    Value: !Join
      - ','
      - - !Ref PublicSubnetA
        - !Ref PublicSubnetB
        - !If
          - DoProvisionSubnetsC
          - !Ref PublicSubnetC
          - !Ref AWS::NoValue
    Description: ID of the public subnets
    Export:
      Name: !Sub ${AWS::StackName}-PublicSubnets
  PrivateSubnets:
    Value: !Join
      - ','
      - - !Ref PrivateSubnetA
        - !Ref PrivateSubnetB
        - !If
          - DoProvisionSubnetsC
          - !Ref PrivateSubnetC
          - !Ref AWS::NoValue
    Description: ID of the private subnets
    Export:
      Name: !Sub ${AWS::StackName}-PrivateSubnets
  DefaultPrivateSubnet:
    Description: The ID of a default private subnet
    Value: !Ref PrivateSubnetA
    Export:
      Name: !Sub "${AWS::StackName}-DefaultPrivateSubnet"
  DefaultPublicSubnet:
    Description: The ID of a default public subnet
    Value: !Ref PublicSubnetA
    Export:
      Name: !Sub "${AWS::StackName}-DefaultPublicSubnet"
  InternetGatewayId:
    Description: The ID of the Internet Gateway
    Value: !Ref InternetGateway
    Export:
      Name: !Sub "${AWS::StackName}-InternetGateway"
  SecurityGroup:
    Description: The ID of the local security group
    Value: !Ref SecurityGroup
    Export:
      Name: !Sub "${AWS::StackName}-SecurityGroup"

Give the stack a name like AWSPCS-PTPro-cluster and leave the options at their defaults.

Use this AWS CloudFormation quick-create link to quickly provision these resources with default settings

Under Capabilities, check the box for I acknowledge that AWS CloudFormation might create IAM resources.

After the VPC is created, find its ID in the Amazon VPC Console by selecting VPCs and searching for the stack name. If the suggested stack name was used, search for PTPro. For deployments in us-east-1, use this link. Note the VPC ID for use in later steps.

2. Create Security Groups

Summary

In this section, you will create three security groups:

  • A cluster security group enabling communication between the compute nodes, login node, and AWS PCS controller.
  • An inbound SSH group that can optionally be enabled to allow SSH logins on the login node.
  • A DCV group that can optionally be enabled to allow DCV remote desktop connections to the login node.
It is possible to reuse existing security groups

If compatible security groups already exist, skip this step and substitute their IDs for the cluster-*-sg, InboundSshSecurityGroupId, and InboundDcvSecurityGroupId references in later steps.

Using CloudFormation, create a new stack for the security groups with the following template:

1-pcs-cluster-cloudformation-security-groups.yaml

Show template contents (click to expand)
# source: https://aws-hpc-recipes.s3.amazonaws.com/main/recipes/pcs/getting_started/assets/pcs-cluster-sg.yaml

AWSTemplateFormatVersion: 2010-09-09
Description: Security group for AWS PCS clusters.
  This template creates a self-referencing security group that enables communications between AWS PCS controller, compute nodes, and client nodes.
  Optionally, it can also create a security group to enable SSH access to the cluster, and DCV remote desktop access to the login node.
  Check the Outputs tab of this stack for useful details about resources created by this template.

Metadata:
  AWS::CloudFormation::Interface:
    ParameterGroups:
      - Label:
          default: Network
        Parameters:
          - VpcId
      - Label:
          default: Security group configuration
        Parameters:
          - CreateInboundSshSecurityGroup
          - CreateInboundDcvSecurityGroup
          - ClientIpCidr

Parameters:
  VpcId:
    Description: VPC where the AWS PCS cluster will be deployed
    Type: 'AWS::EC2::VPC::Id'
  ClientIpCidr:
    Description: IP address(s) allowed to connect to nodes using SSH or DCV.
    Default: '0.0.0.0/0'
    Type: String
    AllowedPattern: (\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})/(\d{1,2})
    ConstraintDescription: Value must be a valid IP or network range of the form x.x.x.x/x.
  CreateInboundSshSecurityGroup:
    Description: Create an inbound security group to allow SSH access to nodes.
    Type: String
    Default: 'True'
    AllowedValues:
      - 'True'
      - 'False'
  CreateInboundDcvSecurityGroup:
    Description: Create an inbound security group to allow DCV access to login nodes on TCP/UDP 8443.
    Type: String
    Default: 'False'
    AllowedValues:
      - 'True'
      - 'False'

Conditions:
  CreateSshSecGroup: !Equals [!Ref CreateInboundSshSecurityGroup, 'True']
  CreateDcvSecGroup: !Equals [!Ref CreateInboundDcvSecurityGroup, 'True']

Resources:

  ClusterSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Supports communications between AWS PCS controller, compute nodes, and client nodes
      VpcId: !Ref VpcId
      GroupName: !Sub 'cluster-${AWS::StackName}'

  ClusterAllowAllInboundFromSelf:
    Type: AWS::EC2::SecurityGroupIngress
    Properties:
      GroupId: !Ref ClusterSecurityGroup
      IpProtocol: '-1'
      SourceSecurityGroupId: !Ref ClusterSecurityGroup

  ClusterAllowAllOutboundToSelf:
    Type: AWS::EC2::SecurityGroupEgress
    Properties:
      GroupId: !Ref ClusterSecurityGroup
      IpProtocol:  '-1'
      DestinationSecurityGroupId: !Ref ClusterSecurityGroup

  # This allows all outbound comms, which enables HTTPS calls and connections to networked storage
  ClusterAllowAllOutboundToWorld:
    Type: AWS::EC2::SecurityGroupEgress
    Properties:
      GroupId: !Ref ClusterSecurityGroup
      IpProtocol: '-1'
      CidrIp: 0.0.0.0/0

  # Attach this to login nodes to enable inbound SSH access.
  InboundSshSecurityGroup:
    Condition: CreateSshSecGroup
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Allows inbound SSH access
      GroupName: !Sub 'inbound-ssh-${AWS::StackName}'
      VpcId: !Ref VpcId
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 22
          ToPort: 22
          CidrIp: !Ref ClientIpCidr

  # Attach this to login nodes to enable inbound DCV access on TCP/UDP 8443.
  InboundDcvSecurityGroup:
    Condition: CreateDcvSecGroup
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Allows inbound DCV access on TCP/UDP 8443
      GroupName: !Sub 'inbound-dcv-${AWS::StackName}'
      VpcId: !Ref VpcId
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 8443
          ToPort: 8443
          CidrIp: !Ref ClientIpCidr
        - IpProtocol: udp
          FromPort: 8443
          ToPort: 8443
          CidrIp: !Ref ClientIpCidr

Outputs:
  ClusterSecurityGroupId:
    Description: Supports communication between PCS controller, compute nodes, and login nodes
    Value: !Ref ClusterSecurityGroup
  InboundSshSecurityGroupId:
    Condition: CreateSshSecGroup
    Description: Enables SSH access to login nodes
    Value: !Ref InboundSshSecurityGroup
  InboundDcvSecurityGroupId:
    Condition: CreateDcvSecGroup
    Description: Enables DCV access to login nodes on TCP/UDP 8443
    Value: !Ref InboundDcvSecurityGroup
  • Under Stack name, use something like AWSPCS-PTPro-sg.
  • Set VpcId to the VPC ID noted in step 1.
  • Enable SSH, and optionally enable DCV access.
Use a Quick create link

Use this AWS CloudFormation quick-create link to provision these security groups in us-east-1. Change the VPC ID to the one created in step 1.

3. Create PCS Cluster

It is possible to reuse an existing PCS cluster

If a compatible PCS cluster already exists, skip this step and reference its name in later steps.

Go to the AWS PCS console and create a new cluster.

  • Under Cluster setup, choose a name like AWSPCS-PTPro-cluster.
  • Set the Controller size to Small.
  • Use the version of Slurm compatible with the ParaTools Pro for E4S™ image. This is usually the latest version available (25.05 as of December 2025).
  • Under Networking:
    • Use the VPC ID created in step 1 (e.g., AWSPCS-PTPro-cluster...).
    • Select the subnet labeled PrivateSubnetA created in step 1.
    • Under Security groups choose Select an existing security group.
      • Use the security group cluster-*-sg created in step 2 (e.g., cluster-AWSPCS-PTPro-sg).
  • Click Create Cluster to begin creating the cluster.

4. Create shared filesystem using EFS

  • Go to the EFS console and ensure the region matches the region where the PCS cluster is being set up.
  • Click Create file system:
    • Name: something like AWSPCS-PTPro-fs.
    • Virtual Private Cloud (VPC): the VPC ID from step 1.
  • Click Create.
  • Note the File system ID (e.g., fs-0123456789abcdef0); it is needed in step 7.

5. Create an Instance Profile

Recommended: use the CloudFormation template

The fastest and least error-prone path is to deploy the CloudFormation template below, which creates the policy, role, and instance profile in one step, including the DCV license policy correctly parameterized for the stack's region.

3-pcs-cluster-cloudformation-iam.yaml

Show template contents (click to expand)
AWSTemplateFormatVersion: '2010-09-09'
Description: >-
  IAM role, policies, and instance profile for AWS PCS cluster nodes
  (login + compute). Creates the role required by the AWS PCS service
  (name must start with "AWSPCS-"), attaches the minimum PCS policy,
  the AWS Systems Manager managed instance policy, and optionally the
  Amazon DCV license-bucket read policy needed for remote-desktop
  access on the login node.

Parameters:
  RoleNameSuffix:
    Type: String
    Default: PCS-cluster
    Description: >-
      Suffix appended to the required "AWSPCS-" prefix to form the role
      and instance-profile name. Must be unique in the account. The
      final name will be "AWSPCS-<RoleNameSuffix>".
    AllowedPattern: '[A-Za-z0-9+=,.@_-]+'
    MinLength: 1
    MaxLength: 50

  EnableDcvLicenseAccess:
    Type: String
    Default: 'true'
    AllowedValues: ['true', 'false']
    Description: >-
      If "true", attach a policy granting s3:GetObject on the Amazon
      DCV license bucket for this region, required for DCV remote-
      desktop licensing on EC2 instances.

Conditions:
  AttachDcvPolicy: !Equals [!Ref EnableDcvLicenseAccess, 'true']

Resources:
  PcsRegisterNodePolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      ManagedPolicyName: !Sub '${AWS::StackName}-pcs-register-node'
      Description: Allow EC2 instances to register as AWS PCS compute node group members.
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action:
              - pcs:RegisterComputeNodeGroupInstance
            Resource: '*'

  DcvLicenseAccessPolicy:
    Type: AWS::IAM::ManagedPolicy
    Condition: AttachDcvPolicy
    Properties:
      ManagedPolicyName: !Sub '${AWS::StackName}-dcv-license-access'
      Description: >-
        Allow EC2 instances to read the Amazon DCV license bucket for
        the stack's deployment region, required for DCV licensing.
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action: s3:GetObject
            Resource: !Sub 'arn:${AWS::Partition}:s3:::dcv-license.${AWS::Region}/*'

  PcsNodeRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Sub 'AWSPCS-${RoleNameSuffix}'
      Description: >-
        Instance role for AWS PCS cluster nodes. Name prefix "AWSPCS-"
        is required by the AWS PCS service.
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: !Sub 'ec2.${AWS::URLSuffix}'
            Action: sts:AssumeRole
      ManagedPolicyArns: !If
        - AttachDcvPolicy
        - - !Ref PcsRegisterNodePolicy
          - !Ref DcvLicenseAccessPolicy
          - !Sub 'arn:${AWS::Partition}:iam::aws:policy/AmazonSSMManagedInstanceCore'
        - - !Ref PcsRegisterNodePolicy
          - !Sub 'arn:${AWS::Partition}:iam::aws:policy/AmazonSSMManagedInstanceCore'

  PcsNodeInstanceProfile:
    Type: AWS::IAM::InstanceProfile
    Properties:
      InstanceProfileName: !Sub 'AWSPCS-${RoleNameSuffix}'
      Roles:
        - !Ref PcsNodeRole

Outputs:
  RoleName:
    Description: Name of the PCS node IAM role.
    Value: !Ref PcsNodeRole
    Export:
      Name: !Sub '${AWS::StackName}-RoleName'

  RoleArn:
    Description: ARN of the PCS node IAM role.
    Value: !GetAtt PcsNodeRole.Arn
    Export:
      Name: !Sub '${AWS::StackName}-RoleArn'

  InstanceProfileName:
    Description: Name of the PCS node instance profile (pass this to node group / launch template).
    Value: !Ref PcsNodeInstanceProfile
    Export:
      Name: !Sub '${AWS::StackName}-InstanceProfileName'

  InstanceProfileArn:
    Description: ARN of the PCS node instance profile.
    Value: !GetAtt PcsNodeInstanceProfile.Arn
    Export:
      Name: !Sub '${AWS::StackName}-InstanceProfileArn'

Parameters:

  • RoleNameSuffix (default PCS-cluster) -- final role and instance-profile name is AWSPCS-<RoleNameSuffix>. The AWSPCS- prefix is required by AWS PCS.
  • EnableDcvLicenseAccess (default true) -- attach the DCV license read policy for remote-desktop use.

After the stack completes, reference the InstanceProfileName output in the node launch template in step 7. Skip to step 6.

To create the policy and role manually via the IAM console, follow the rest of this section.

Go to the IAM console. Under Access ManagementPolicies, check whether a policy matching this one already exists (search for pcs). If none exists, create a new one and specify the permissions using the JSON editor as follows:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "pcs:RegisterComputeNodeGroupInstance"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

Name the new policy something like AWS-PCS-policy and note the name you chose.

Additional optional steps to enable DCV remote desktop access

To access the login node via DCV, create an additional policy granting read access to the DCV license server. If a matching policy already exists, reuse it (search for DCV). Otherwise, create a new one, specifying the permissions with the JSON editor as follows:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::dcv-license.*/*"
        }
    ]
}

Give it a name like EC2AccessDCVLicenseS3.

Tighter region scope (optional)

The wildcard dcv-license.* matches only AWS-owned DCV license buckets (bucket name is reserved by AWS), so it is safe. For an explicit allowlist, enumerate the regions you deploy in, for example:

"Resource": [
    "arn:aws:s3:::dcv-license.us-east-1/*",
    "arn:aws:s3:::dcv-license.us-east-2/*",
    "arn:aws:s3:::dcv-license.us-west-1/*",
    "arn:aws:s3:::dcv-license.us-west-2/*"
]

In a CloudFormation template the policy Resource can be parameterized with !Sub 'arn:${AWS::Partition}:s3:::dcv-license.${AWS::Region}/*' so it substitutes the stack's region automatically. IAM policy JSON itself has no built-in variable for the EC2 instance's region.

Next, in the IAM Console go to Access ManagementRoles and check whether a role starting with AWSPCS- already exists with the required policies attached. Otherwise, create it as follows:

  • Select Create Role.
  • For Trusted Entity Type, choose AWS Service.
  • For Service or use case, choose EC2; for Use Case, choose EC2.
  • Click Next.
  • Under Add permissions:
    • Add the policy created earlier in step 5.
    • If planning to use DCV to access the login node, also add the EC2AccessDCVLicenseS3 policy.
    • Add the AmazonSSMManagedInstanceCore policy.
  • Click Next.
  • Give the role a name that must start with AWSPCS- (e.g., AWSPCS-PTPro-role).

6. Create EFA Placement Group

It is possible to reuse an existing placement group

If a compatible cluster placement group already exists, skip this step and reference its name in later steps.

Under the EC2 Console, navigate to Network & SecurityPlacement GroupsCreate placement group.

  • Name: something like AWSPCS-PTPro-cluster.
  • Placement strategy: Cluster.
  • Click Create group.

7. Create node Launch Templates

This step creates two EC2 launch templates -- one for the login node and one for compute nodes -- both wired up for EFA networking and the shared EFS filesystem.

Using CloudFormation, create a new stack using the following template:

2-pcs-cluster-cloudformation-launch-templates.yaml

Show template contents (click to expand)
# original source: https://aws-hpc-recipes.s3.amazonaws.com/main/recipes/pcs/getting_started/assets/pcs-lt-efs-fsxl.yaml
# has been modified

AWSTemplateFormatVersion: 2010-09-09
Description: EC2 launch templates for AWS PCS login and compute node groups.
  This template creates EC2 launch templates for AWS PCS login and compute node groups.
  It demonstrates mounting EFS and FSx for Lustre file systems, configuring EC2 instance tags, enabling Instance Metadata Service Version 2 (IMDSv2), and setting up the cluster security group for communication with the AWS PCS controller.
  Additionally, it shows how to configure inbound SSH access to the login nodes, and optionally sets an initial password for the `ubuntu` user so DCV web sessions can sign in out of the box.
  Use this template as a starting point to create custom launch templates tailored to your specific requirements.
  Check the Outputs tab of this stack for useful details about resources created by this template.

Metadata:
  AWS::CloudFormation::Interface:
    ParameterGroups:
      - Label:
          default: Security
        Parameters:
          - VpcDefaultSecurityGroupId
          - ClusterSecurityGroupId
          - SshSecurityGroupId
          - EnableDcvAccess
          - DcvSecurityGroupId
          - DcvUbuntuPassword
          - SshKeyName
      - Label:
          default: Networking
        Parameters:
          - VpcId
          - PlacementGroupName
          - NodeGroupSubnetId
      - Label:
          default: File systems
        Parameters:
          - EfsFilesystemId
          - FSxLustreFilesystemId
          - FSxLustreFilesystemMountName

Parameters:

  VpcId:
    Type: 'AWS::EC2::VPC::Id'
    Description: Cluster VPC where EFA-enabled instances will be launched
  NodeGroupSubnetId:
    Type: AWS::EC2::Subnet::Id
    Description: Subnet within cluster VPC where EFA-enabled instances will be launched
  VpcDefaultSecurityGroupId:
    Type: AWS::EC2::SecurityGroup::Id
    Description: Cluster VPC 'default' security group. Make sure you choose the one from your cluster VPC!
  ClusterSecurityGroupId:
    Type: AWS::EC2::SecurityGroup::Id
    Description: Security group for PCS cluster controller and nodes.
  SshSecurityGroupId:
    Type: AWS::EC2::SecurityGroup::Id
    Description: Security group for SSH into login nodes
  EnableDcvAccess:
    Type: String
    Description: Enable DCV access to login nodes? When set to True, the DcvSecurityGroupId parameter below will be required.
    Default: 'True'
    AllowedValues:
      - 'True'
      - 'False'
  DcvSecurityGroupId:
    Type: AWS::EC2::SecurityGroup::Id
    Description: Security group for DCV access to login nodes (only used if EnableDcvAccess is True)
  DcvUbuntuPassword:
    Type: String
    NoEcho: true
    Default: ''
    MinLength: 0
    MaxLength: 128
    AllowedPattern: "^$|^[A-Za-z0-9!@#%^&*()_+=,.:;/?<>{}\\[\\]~-]{8,128}$"
    Description: >-
      Optional initial password for the `ubuntu` user on the login node, used
      to sign in to DCV web sessions at https://<login-public-ip>:8443/. Leave
      blank to skip password setup (DCV will then require manually setting a
      password on the instance before the web UI can be used). 8-128 chars;
      avoid quotes, backticks, backslash, and whitespace.
  SshKeyName:
    Type: AWS::EC2::KeyPair::KeyName
    Description: SSH key name for access to login nodes
  EfsFilesystemId:
    Type: String
    Description: Amazon EFS file system Id
  FSxLustreFilesystemId:
    Type: String
    Description: Amazon FSx for Lustre file system Id
  FSxLustreFilesystemMountName:
    Type: String
    Description: Amazon FSx for Lustre mount name
  PlacementGroupName:
    Type: String
    Description: Placement group name for compute nodes (leave blank to creaet a new one)
    Default: "AWSPCS-PTPro-cluster"


Conditions:
  HasDcvAccess: !Equals [!Ref EnableDcvAccess, 'True']

Resources:

  EfaSecurityGroup:
    Type: 'AWS::EC2::SecurityGroup'
    Properties:
      GroupDescription: Support EFA
      GroupName: !Sub 'efa-${AWS::StackName}'
      VpcId: !Ref VpcId
  EfaSecurityGroupOutboundSelfRule:
    Type: 'AWS::EC2::SecurityGroupEgress'
    Properties:
      IpProtocol: '-1'
      GroupId: !Ref EfaSecurityGroup
      Description: Allow outbound EFA traffic to SG members
      DestinationSecurityGroupId: !Ref EfaSecurityGroup

  EfaSecurityGroupInboundSelfRule:
    Type: 'AWS::EC2::SecurityGroupIngress'
    Properties:
      IpProtocol: '-1'
      GroupId: !Ref EfaSecurityGroup
      Description: Allow inbound EFA traffic to SG members
      SourceSecurityGroupId: !Ref EfaSecurityGroup

  LoginLaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateName: !Sub 'login-${AWS::StackName}'

      LaunchTemplateData:
        TagSpecifications:
          - ResourceType: instance
            Tags:
              - Key: HPCRecipes
                Value: "true"
        MetadataOptions:
          HttpEndpoint: enabled
          HttpPutResponseHopLimit: 4
          HttpTokens: required
        KeyName: !Ref SshKeyName
        SecurityGroupIds:
          - !Ref ClusterSecurityGroupId
          - !Ref SshSecurityGroupId
          - !If [HasDcvAccess, !Ref DcvSecurityGroupId, !Ref "AWS::NoValue"]
          - !Ref VpcDefaultSecurityGroupId
        UserData:
          Fn::Base64: !Sub |
            MIME-Version: 1.0
            Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="

            --==MYBOUNDARY==
            Content-Type: text/cloud-config; charset="us-ascii"
            MIME-Version: 1.0

            packages:
            - amazon-efs-utils

            # Enable GDM autologin for `ubuntu` BEFORE gdm3 starts, so DCV's "console" session attaches
            # to the user's Xorg without a greeter->user-session transition. A late (`runcmd`) edit +
            # `systemctl restart gdm3` would segfault the mode=system dcvagent mid-transition.
            # Idempotent: sed regex won't match on reboot after first edit. Remove once the underlying
            # AMI bakes this config. Canonical per AWS DCV Ubuntu + GNOME setup.
            bootcmd:
            - sed -i 's/^#  AutomaticLoginEnable = true/AutomaticLoginEnable = true/' /etc/gdm3/custom.conf
            - sed -i 's/^#  AutomaticLogin = user1/AutomaticLogin = ubuntu/' /etc/gdm3/custom.conf

            write_files:
            - path: /etc/update-motd.d/99-dcv-url
              permissions: '0755'
              owner: root:root
              content: |
                #!/bin/sh
                TOKEN=$(curl -fsSL --max-time 2 -X PUT -H 'X-aws-ec2-metadata-token-ttl-seconds: 60' http://169.254.169.254/latest/api/token 2>/dev/null)
                IP=$(curl -fsSL --max-time 2 -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/public-ipv4 2>/dev/null)
                [ -n "$IP" ] || IP=$(hostname -I | awk '{print $1}')
                printf '\n  DCV remote desktop: https://%s:8443/  (user: ubuntu)\n\n' "$IP"

            runcmd:
            # Mount EFS filesystem as /home
            - mkdir -p /tmp/home
            - rsync -aA /home/ /tmp/home
            - echo "${EfsFilesystemId}:/ /home efs tls,_netdev" >> /etc/fstab
            - mount -a -t efs defaults
            - if command -v sestatus >/dev/null 2>&1 && [ "X$(sestatus | awk '/^SELinux status:/{print $3}')" = "Xenabled" ]; then setsebool -P use_nfs_home_dirs 1; fi
            - rsync -aA --ignore-existing /tmp/home/ /home
            - rm -rf /tmp/home/
            # If provided, mount FSxL filesystem as /shared
            - if [ ! -z "${FSxLustreFilesystemId}" ]; then amazon-linux-extras install -y lustre=latest; mkdir -p /shared; chmod a+rwx /shared; mount -t lustre ${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /shared; chmod 777 /shared; fi
            # If provided, set an initial password for the `ubuntu` user so DCV web sessions can sign in.
            - if [ -n '${DcvUbuntuPassword}' ]; then echo 'ubuntu:${DcvUbuntuPassword}' | chpasswd; fi

            --==MYBOUNDARY==--

  ComputeLaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateName: !Sub 'compute-${AWS::StackName}'
      LaunchTemplateData:
        TagSpecifications:
          - ResourceType: instance
            Tags:
              - Key: HPCRecipes
                Value: "true"
        MetadataOptions:
          HttpEndpoint: enabled
          HttpPutResponseHopLimit: 4
          HttpTokens: required
        Placement:
          GroupName: !Ref PlacementGroupName
        NetworkInterfaces:
          - Description: Primary network interface
            DeviceIndex: 0
            InterfaceType: efa
            NetworkCardIndex: 0
            SubnetId: !Ref NodeGroupSubnetId
            Groups:
            - !Ref EfaSecurityGroup
            - !Ref ClusterSecurityGroupId
            - !Ref VpcDefaultSecurityGroupId
        KeyName: !Ref SshKeyName
        UserData:
          Fn::Base64: !Sub |
            MIME-Version: 1.0
            Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="

            --==MYBOUNDARY==
            Content-Type: text/cloud-config; charset="us-ascii"
            MIME-Version: 1.0

            packages:
            - amazon-efs-utils

            runcmd:
            # Mount EFS filesystem as /home
            - mkdir -p /tmp/home
            - rsync -aA /home/ /tmp/home
            - echo "${EfsFilesystemId}:/ /home efs tls,_netdev" >> /etc/fstab
            - mount -a -t efs defaults
            - if command -v sestatus >/dev/null 2>&1 && [ "X$(sestatus | awk '/^SELinux status:/{print $3}')" = "Xenabled" ]; then setsebool -P use_nfs_home_dirs 1; fi
            - rsync -aA --ignore-existing /tmp/home/ /home
            - rm -rf /tmp/home/
            # If provided, mount FSxL filesystem as /shared
            - if [ ! -z "${FSxLustreFilesystemId}" ]; then amazon-linux-extras install -y lustre=latest; mkdir -p /shared; chmod a+rwx /shared; mount -t lustre ${FSxLustreFilesystemId}.fsx.${AWS::Region}.amazonaws.com@tcp:/${FSxLustreFilesystemMountName} /shared; fi

            --==MYBOUNDARY==--

Outputs:
  LoginLaunchTemplateId:
    Description: "Login nodes template ID"
    Value: !Ref LoginLaunchTemplate
  LoginLaunchTemplateName:
    Description: "Login nodes template name"
    Value: !Sub 'login-${AWS::StackName}'
  ComputeLaunchTemplateId:
    Description: "Compute nodes template ID"
    Value: !Ref ComputeLaunchTemplate
  ComputeLaunchTemplateName:
    Description: "Compute nodes template name"
    Value: !Sub 'compute-${AWS::StackName}'
  EfaSecurityGroupId:
    Description: Security group created to support EFA communications
    Value: !Ref EfaSecurityGroup
  DcvWebAccessHint:
    Description: >-
      Once the login node reaches "Active", point a browser at
      https://<login-public-ipv4>:8443/ and sign in as user `ubuntu`
      with the password supplied via DcvUbuntuPassword (if set). The
      login node also prints the resolved DCV URL in its MOTD on SSH /
      Session Manager login.
    Value: "https://<login-public-ipv4>:8443/  (user: ubuntu)"

Give the stack a name (e.g., AWSPCS-PTPro-lt). Populate the parameters as follows:

Parameter Value
VpcId Output VPC from step 1
VpcDefaultSecurityGroupId The "default" security group of the VPC created in step 1
ClusterSecurityGroupId Output ClusterSecurityGroupId from step 2
SshSecurityGroupId Output InboundSshSecurityGroupId from step 2
SshKeyName An existing EC2 key pair you control
PlacementGroupName Name chosen in step 6
NodeGroupSubnetId PrivateSubnetA from step 1
EfsFilesystemId EFS filesystem ID from step 4
DcvUbuntuPassword Optional initial password for the ubuntu user, used to sign in to DCV web sessions (see DCV remote desktop in step 10). Leave blank to skip and set a password manually later. Marked NoEcho so it is not shown in the console or stack events.

After the stack reaches CREATE_COMPLETE, note the launch template names from the stack outputs. They will be named login-<stack-name> and compute-<stack-name>, and are referenced in step 8.

8. Create node groups

A cluster requires at least two compute node groups: one for interactive login nodes (statically scaled) and one for elastic compute nodes that run jobs.

In the AWS PCS console, select the cluster created in step 3, navigate to Compute node groups, and click Create.

AMI selection

For the AMI ID field, use a ParaTools Pro for E4S™ PCS-compatible AMI from the AWS Marketplace. Use the same AMI for both node groups so the login and compute environments stay in sync. Pick the product matching your cluster's target architecture:

Architecture AWS Marketplace product
x86_64 ParaTools Pro for E4S™ on AWS PCS (x86)
arm64 (Graviton) ParaTools Pro for E4S™ on AWS PCS (ARM64)

Obtaining the AMI ID after subscribing:

  1. Open the marketplace product page above and click View purchase options / Continue to Subscribe.
  2. Accept the terms and wait for the subscription to be processed.
  3. Click Continue to Configuration.
  4. Select the delivery method, software version, and AWS region matching your cluster.
  5. Copy the AMI ID shown on the configuration page (format: ami-0123456789abcdef0). Use this value in the AMI ID field when creating the compute node groups below.

Alternatively, after subscribing, find the AMI in the EC2 console under ImagesAMIs, filtered by Owner alias = aws-marketplace and searching for ParaTools.

Recommended instance types

Choose instance types that match the AMI architecture. EFA is required for tightly-coupled MPI on compute nodes; GPU login nodes enable DCV/interactive visualization without EFA.

Role x86_64 arm64
Compute node group g4dn.8xlarge (NVIDIA T4, EFA) hpc7g.8xlarge (Graviton3E, 200 Gbps EFA, no GPU)
Login node group (~4xlarge) g4dn.4xlarge (NVIDIA T4) g5g.4xlarge (Graviton2 + NVIDIA T4G)

g5g has no EFA and is suited only for login / interactive visualization, not for compute.

8.1 Compute node group (compute-1)

This is a dynamic node group: instances are launched when jobs are submitted and terminated after the configured idle time, scaling down to zero when the queue is empty.

  • Under Compute node group details:
    • Compute node group name: compute-1.
  • Under Compute configuration:
    • EC2 launch template: compute-<stack-name> from step 7.
    • Version: select the latest version of the launch template.
    • IAM instance profile: select the Use an existing profile radio, then under Selected profile choose the AWSPCS-* role created in step 5.
    • Subnets: PrivateSubnetA from step 1.
    • Instance types: g4dn.8xlarge (for arm64 clusters, see the Recommended instance types tip above).
    • Scaling configuration: select the Dynamic node group radio. Set Minimum instance count to 0 and Maximum instance count to 2.
    • AMI ID: select the Custom AMI radio, then paste the ParaTools Pro for E4S™ AMI ID obtained from the marketplace subscription (see the AMI selection note above).
  • Leave Capacity purchase option at its default (On-Demand). Skip Scheduler configuration and Tags.
  • Click Create compute node group and wait for the Status field to show Active before proceeding.

8.2 Login node group (login)

This is a static node group: a single long-running instance you SSH into (or access via Session Manager) to submit jobs.

  • Navigate back to Compute node groups and click Create.
  • Under Compute node group details:
    • Compute node group name: login.
  • Under Compute configuration:
    • EC2 launch template: login-<stack-name> from step 7.
    • Version: select the latest version of the launch template.
    • IAM instance profile: select the Use an existing profile radio, then under Selected profile choose the same AWSPCS-* role used for compute-1.
    • Subnets: PublicSubnetA from step 1.
    • Instance types: g4dn.4xlarge (for arm64 clusters, see the Recommended instance types tip above).
    • Scaling configuration: select the Static node group radio. Set both Minimum instance count and Maximum instance count to 1.
    • AMI ID: select the Custom AMI radio and paste the same ParaTools Pro for E4S™ AMI ID used for compute-1.
  • Leave Capacity purchase option, Scheduler configuration, and Tags at their defaults.
  • Click Create compute node group.

Wait for Active status

Wait for the login group to reach Active before attempting to connect in step 10. The login instance needs several minutes after activation for cloud-init and slurm configuration to complete.

9. Create queue

A queue exposes a compute node group to Slurm as a partition. Jobs submitted with sbatch -p <queue-name> will land on the attached compute node group.

Before creating the queue, ensure the compute-1 group from step 8.1 has reached Active status.

In the AWS PCS console, select the cluster created in step 3, navigate to Queues, and click Create queue.

  • Under Queue configuration:
    • Queue name: compute-1 (this becomes the Slurm partition name).
    • Compute node groups: select compute-1 from step 8.1.
  • Click Create queue and wait for the Status field to show Active.

10. Connect to login node

Once the login compute node group has reached Active, locate its EC2 instance and connect.

  1. Find the login instance.
    • In the AWS PCS console, select the cluster from step 3.
    • Go to Compute node groups and select the login group from step 8.2.
    • Copy the Compute node group ID (e.g., cng-abc123def456...).
  2. Locate the instance in EC2.

    • In the EC2 Console, choose Instances.
    • In the Find instances by attribute or tag (case sensitive) search bar, filter by the PCS tag:

      tag:aws:pcs:compute-node-group-id = <compute-node-group-id>
      

      There should be exactly one running instance matching the login group's ID. - Select the instance and copy its Public IPv4 address.

  3. Connect. Use either SSH or AWS Systems Manager Session Manager.

    Use the key pair specified in step 7. For the ParaTools Pro for E4S™ Ubuntu-based AMIs, the default user is ubuntu:

    ssh -i <path-to-key.pem> ubuntu@<public-ipv4-address>
    
    • In the EC2 console, select the instance and click Connect.
    • Choose the Session Manager tab and click Connect.
    • An interactive browser-based terminal opens as user ssm-user.
    • Switch to the default user to pick up the cluster environment:

      sudo -i -u ubuntu
      

Allow time for cluster bootstrap

Wait about 2 minutes after the login node reaches Active before connecting, so cloud-init can finish.

DCV remote desktop (optional)

The ParaTools Pro for E4S™ AMI ships with NICE DCV configured to serve a GPU-accelerated Linux desktop on TCP 8443. The DCV license is granted to the node via the IAM policy from step 5, and inbound access is allowed by the DCV security group from step 2.

  1. Open the DCV URL. Browse to the login node's public IPv4 (located via the same steps used to SSH in above):

    https://<login-public-ipv4>:8443/
    

    The browser warns about a self-signed certificate; accept to continue.

    Shortcut: grab the URL from the MOTD

    The login node's cloud-init installs a MOTD drop-in that prints the fully-resolved DCV URL on every SSH / Session Manager login, e.g.:

    DCV remote desktop: https://54.81.250.30:8443/  (user: ubuntu)
    

    Copy-paste that URL into your browser instead of hunting for the instance IP in the EC2 console.

  2. Sign in.

    • Username: ubuntu.
    • Password: the value supplied for DcvUbuntuPassword when creating the launch-template stack in step 7.

    If DcvUbuntuPassword was left blank, set a password on the login node before connecting:

    sudo passwd ubuntu
    

Rotate or set the password later

DcvUbuntuPassword is only consumed once during cloud-init on first boot. To change the password later (or to set one when the parameter was left blank), SSH into the login node and run sudo passwd ubuntu.

11. Verify the Slurm environment

Once connected to the login node, confirm Slurm can see the queue and partition you created:

sinfo

sinfo lists the Slurm partitions, their node states, and the compute node groups backing them. You should see the queue from step 9 listed as a partition in the idle~ state (the ~ suffix indicates dynamically-provisioned nodes that are currently powered down).

Compute nodes are automatically terminated after a period of inactivity governed by the ScaledownIdletime parameter. This can be configured in step 3 during cluster creation by adjusting the Slurm configuration settings.

12. Run sample jobs from ParaTools E4S Cloud Examples

The ParaTools Pro for E4S™ AMI ships with a set of MPI/HPC example programs pre-copied into your home directory at ~/examples.

Examples missing from ~/examples?

If ~/examples is empty or missing, first check /opt/demo -- the source copies live there and may not have been propagated to your home directory:

ls /opt/demo
cp -R /opt/demo ~/examples

If neither exists (for instance, on a fresh EFS mount that masked the AMI's /home contents), clone the ParaTools E4S Cloud Examples repository directly from GitHub:

git clone https://github.com/ParaToolsInc/e4s-cloud-examples.git ~/examples

Move into the examples directory:

cd ~/examples

NVIDIA NeMo™ and BioNeMo™ live in a dedicated Python environment

NeMo and BioNeMo are installed in a separate virtual environment to avoid dependency conflicts with other GPU/ML packages. Activate it before running NeMo or BioNeMo workloads (or source it from your sbatch script):

source /usr/local/py-env/nemo/bin/activate

Other Python packages (including vLLM) are available in the default system Python and require no activation.

12.1 Run the mpi-procname example

mpi-procname is a tiny MPI program that prints the rank and hostname of each process. It is a quick sanity check that MPI launches and that EFA is reachable between nodes.

cd ~/examples/mpi-procname
./clean.sh
./compile.sh
sbatch -p compute-1 mpiprocname.sbatch

Because compute nodes in this partition are provisioned on demand, the first sbatch submission will trigger an EC2 launch. Expect a few minutes of delay before the job starts; subsequent jobs on the same warm nodes will start almost immediately.

Monitor the job state with:

squeue

squeue lists the pending and running jobs. While nodes are being provisioned, the state column shows CF (configuring); once the nodes are up, it transitions to R (running), and the job disappears from the list when it completes. For node-level detail, run sinfo -N -l.

Once the job completes, the output file (e.g., slurm-<jobid>.out) will contain one line per MPI rank, showing rank/host placement.

12.2 Run the OSU Micro-Benchmarks

The OSU Micro-Benchmarks measure point-to-point MPI performance over EFA. The latency, bw (bandwidth), and bibw (bi-directional bandwidth) benchmarks are pre-built in the image and driven by the sbatch scripts in osu-benchmarks/:

cd ~/examples/osu-benchmarks
./clean.sh
sbatch -p compute-1 latency.sbatch
sbatch -p compute-1 bw.sbatch
sbatch -p compute-1 bibw.sbatch

Since the compute nodes were warmed up by the mpi-procname run, these three jobs should start back-to-back without further provisioning delay. Track them with squeue as before.

Each job writes to its own log file (osu-latency.log, osu-bw.log, osu-bibw.log) in the current directory.

13. Shut nodes down

To stop incurring EC2 charges, tear down the queue and node groups. The cluster, VPC, and CloudFormation stacks can be kept around for future use.

In the AWS PCS console, select the cluster created in step 3 and, in order:

  1. Delete the queue created in step 9 (Queues → select queue → Delete).
  2. Delete the login node group from step 8.2 (Compute node groups → select group → Delete).
  3. Delete the compute-1 node group from step 8.1 (Compute node groups → select group → Delete).

Deletion order matters

The queue must be deleted before its attached compute node group, otherwise the node group delete will fail.