SLURM Scheduler Cluster Blueprint for GCP¶
General Info¶
Below is an example Google HPC-Toolkit bluiprint for using ParaTools Pro for E4S™. Once you have access to ParaTools Pro for E4S™ through the GCP marketplace, we recommend following the "quickstart tutorial" from the Google HPC-Toolkit project to get started if you are new to GCP and/or HPC-Toolkit. The ParaTools Pro for E4S™ blueprint provided below can be copied with some small modifications and used for the tutorial or in production.
Areas of the blueprint that require your attention and that may need to be changed are highlighted and have expandable annotations offering further guidance.
ParaTools Pro for E4S™ Slurm Cluster Blueprint Example¶
e4s-23.11-cluster-slurm-gcp-5-9-hpc-rocky-linux-8.yaml | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
|
-
Warning
Either uncomment this line and ensure that this matches the name of your project on GCP, or invokeghpc
with the--vars project_id="${PROJECT_ID}"
flag. -
Info
Ensure that this matches the image family from the GCP marketplace -
Danger
0.0.0.0/0
exposes TCP port 22 the entire world, fine for testing ephemeral clusters, but for persistent clusters you should limit traffic to your organizations IP range or a hardened bastion server -
Info
Themachine_type
andnode_count_dynamic_max
should be set to reflect the instance types and number of nodes you would like to use. These are spun up dynamically. You must ensure that you have sufficient quota to run with the number of vCPUs = (cores per node)*(node_cound_dynamic_max). For compute intensive, tightly coupled jobs, C3 or H3 instances have shown good performance. -
Info
This example includes an additional SLURM partition containing H3 nodes. At the time of this writing, access to H3 instances was limited and you may need to request access via a quota increase request. You do not need multiple SLURM partitions, and may consider removing this one. -
Info
To access the full high-speed per-VM Tier_1 networking capabilities on supported instance types, the gvnic must be enabled.