Slurm Cluster Blueprint for GCP¶
General Info¶
This page provides an example Cluster Toolkit blueprint for use with ParaTools Pro for E4S™. Once you have subscribed to ParaTools Pro for E4S™ on the GCP Marketplace, we recommend working through the "Deploy an HPC cluster with Slurm" quickstart from the Cluster Toolkit project if you are new to GCP or to the Cluster Toolkit. The blueprint below can be copied with small modifications and used either for the tutorial or in production.
Areas of the blueprint that require your attention, and that may need to be changed, are highlighted and have expandable annotations offering further guidance.
ParaTools Pro for E4S™ Slurm Cluster Blueprint Example¶
| e4s-25.11-cluster-slurm-gcp-v6.yaml | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 | |
-
Use your own GCP project ID
Set this to your own GCP project ID, or comment this line and invokegclusterwith the--vars project_id="${PROJECT_ID}"flag instead. The value shown here is the ParaTools development project and will not work for you. -
Image family
This must match the image family of the ParaTools Pro for E4S™ marketplace image you have subscribed to. The current GCluster (x86-64) image family isparatools-gcluster-e4s-2511-nvidia89-x86-64. -
Default VPC firewall behavior
Themodules/network/vpcmodule with nosettings:block creates a VPC with two firewall rules by default: one allowing SSH from Identity-Aware Proxy (IAP) (range35.235.240.0/20) and one allowing all intra-VPC traffic. The SSH button in the GCP Console for the login node uses IAP, so it works with this default. If you want to SSH directly to the login node's public IP from your workstation, you must add a firewall rule that allows your workstation's IP -- see Allowing direct SSH from your workstation. -
Node count and instance type
Themachine_typeandnode_count_dynamic_maxset the instance type and the maximum number of nodes Slurm can spin up dynamically in this nodeset. Tune these to match your usage and quotas (vCPUs = cores-per-node xnode_count_dynamic_max). This same setting also appears in thecompute_nodeset(line 74) andh3_nodeset(line 94) blocks. For compute-intensive, tightly coupled jobs, C3 or H3 instances perform well. -
GPU on the debug nodeset
Thedebug_nodesetattaches one NVIDIA Tesla T4 GPU per node by default, providing a low-cost path to test CUDA, NeMo, or PyTorch GPU workloads before scaling up. Setcount: 0and remove theguest_acceleratorblock, or change to a CPU-only instance type, if you do not need a GPU on the debug partition. -
Thebandwidth_tier: gVNIC vs. Tier 1bandwidth_tiersetting controls the network adapter and per-VM egress bandwidth ceiling. Valid values areplatform_default,virtio_enabled,gvnic_enabled(the gVNIC adapter without Tier 1), andtier_1_enabled(gVNIC adapter plus per-VM Tier 1 high-bandwidth networking). Thecompute_nodesetusestier_1_enabledbecausec3-standard-88is a C3 shape with 88 vCPUs, which clears the Tier 1 minimum (44 vCPUs for C3) and unlocks the full bandwidth ceiling for tightly coupled MPI and collective-heavy workloads. Theh3_nodeset(line 99) usesgvnic_enabledbecause the H3 family is not in the Tier 1 supported list (despite its 88 vCPUs); H3 is limited to gVNIC speeds. If you changecompute_nodeset'smachine_typeto a shape that does not support Tier 1 (for example any N1, N2, E2, or H3 shape, or a supported family below its vCPU minimum), droptier_1_enabledback togvnic_enabledto avoid a deployment error. -
Optional H3 partition
This example includes an additional Slurm partition containing H3 nodes. Access to H3 instances may require a quota-increase request. You do not need multiple Slurm partitions, so you may remove theh3_nodesetandh3_partitionmodules (and the- h3_partitionreference inslurm_controller.use) if you do not have H3 access.
Allowing Direct SSH from Your Workstation¶
The default firewall rules created by modules/network/vpc permit SSH
from Identity-Aware Proxy (IAP) only. The SSH button in the GCP Console (which
uses IAP) works without any extra configuration. If you want to SSH directly from your
workstation to the login node's public IP -- for example, to use scp for large file
transfers, or because you prefer a local terminal over the browser-based IAP SSH session
-- you must allow your workstation's IP address through the firewall.
There are two equivalent options:
Option 1: Add the Rule to the Blueprint¶
Edit the network module in your blueprint and add a settings: block listing the
firewall rule. Replace 203.0.113.42/32 with your workstation's public IP (find it
with curl -s ifconfig.me), or replace 203.0.113.0/24 with a CIDR block covering your
home or office network:
- id: network
source: modules/network/vpc
settings:
firewall_rules:
- name: ssh-from-workstation
direction: INGRESS
ranges: [203.0.113.42/32] # single IP -- replace with your workstation's IP
# ranges: [203.0.113.0/24] # or a CIDR block covering your network
allow:
- protocol: tcp
ports: [22]
Then re-run gcluster create -w ... and gcluster deploy ... to apply the change.
Option 2: Add the Rule Out of Band¶
After the cluster is deployed, add the firewall rule with gcloud directly:
# Replace YOUR_IP with your workstation's public IP, or a CIDR covering your network.
# Replace VPC_NAME with the name of the VPC created by your deployment
# (typically "${deployment_name}-net0", e.g., "ppro-e4s-25-11-cluster-net0").
gcloud compute firewall-rules create ssh-from-workstation \
--network=VPC_NAME \
--direction=INGRESS \
--action=ALLOW \
--rules=tcp:22 \
--source-ranges=YOUR_IP/32
Do not use 0.0.0.0/0
Opening TCP port 22 to the entire internet is a serious security risk. Always restrict
--source-ranges (or ranges:) to a single IP (/32) or a small CIDR block under
your control. Prefer Console (IAP) SSH whenever possible.