SLURM Scheduler Cluster Blueprint for GCP¶

General Info¶

Below is an example Google HPC-Toolkit bluiprint for using ParaTools Pro for E4S™. Once you have access to ParaTools Pro for E4S™ through the GCP marketplace, we recommend following the "quickstart tutorial" from the Google HPC-Toolkit project to get started if you are new to GCP and/or HPC-Toolkit. The ParaTools Pro for E4S™ blueprint provided below can be copied with some small modifications and used for the tutorial or in production.

Areas of the blueprint that require your attention and that may need to be changed are highlighted and have expandable annotations offering further guidance.

ParaTools Pro for E4S™ Slurm Cluster Blueprint Example¶

e4s-23.11-cluster-slurm-gcp-5-9-hpc-rocky-linux-8.yaml
blueprint_name: ppro-4-e4s-23-11-cluster-slurm-gcp-5-9-hpc-rocky-linux-8

vars:
#  project_id:  "paratools-pro" #(1)
  deployment_name: ppro-4-e4s-23-11-cluster-slurm-rocky8
  image_family: ppro-4-e4s-23-11-slurm-gcp-5-9-hpc-rocky-linux-8 #(2)!
  region: us-central1
  zone: us-central1-a
  disk_size_gb: 130

deployment_groups:
- group: primary
  modules:
  - id: network1
    source: modules/network/vpc
    settings:
      firewall_rules:
        - name: ssh-login
          direction: INGRESS
          ranges: [0.0.0.0/0] #(3)!
          allow:
            - protocol: tcp
              ports: [22]

  - id: homefs
    source: modules/file-system/filestore
    use: [network1]
    settings:
      local_mount: /home

  - id: compute_node_group
    source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
    settings:
      node_count_dynamic_max: 2 #(4)!
      machine_type: c3-standard-88
      disk_type: pd-balanced
      enable_smt: false
      disk_size_gb: $(vars.disk_size_gb)
      instance_image:
        family: $(vars.image_family)
        project: $(vars.project_id)
      instance_image_custom: true
      bandwidth_tier: gvnic_enabled #(6)!

  - id: compute_partition
    source: community/modules/compute/schedmd-slurm-gcp-v5-partition
    use:
    - network1
    - homefs
    - compute_node_group
    settings:
      partition_name: compute
      is_default: true
      enable_placement: true

  - id: h3_node_group
    source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
    settings:
      node_count_dynamic_max: 2
      machine_type: h3-standard-88
      disk_type: pd-balanced
      instance_image:
        family: $(vars.image_family)
        project: $(vars.project_id)
      instance_image_custom: true
      bandwidth_tier: gvnic_enabled

  - id: h3_partition #(5)!
    source: community/modules/compute/schedmd-slurm-gcp-v5-partition
    use:
    - network1
    - homefs
    - h3_node_group
    settings:
      partition_name: h3

  - id: slurm_controller
    source: community/modules/scheduler/schedmd-slurm-gcp-v5-controller
    use:
    - network1
    - compute_partition
    - h3_partition
    - homefs
    settings:
      disable_controller_public_ips: false
      disk_size_gb: $(vars.disk_size_gb)
      instance_image:
        family: $(vars.image_family)
        project: $(vars.project_id)
      instance_image_custom: true

  - id: slurm_login
    source: community/modules/scheduler/schedmd-slurm-gcp-v5-login
    use:
    - network1
    - slurm_controller
    settings:
      machine_type: n2-standard-4
      disable_login_public_ips: false
      disk_size_gb: $(vars.disk_size_gb)
      instance_image:
        family: $(vars.image_family)
        project: $(vars.project_id)
      instance_image_custom: true

Warning
Either uncomment this line and ensure that this matches the name of your project on GCP, or invoke ghpc with the --vars project_id="${PROJECT_ID}" flag.
Info
Ensure that this matches the image family from the GCP marketplace
Danger
0.0.0.0/0 exposes TCP port 22 the entire world, fine for testing ephemeral clusters, but for persistent clusters you should limit traffic to your organizations IP range or a hardened bastion server
Info
The machine_type and node_count_dynamic_max should be set to reflect the instance types and number of nodes you would like to use. These are spun up dynamically. You must ensure that you have sufficient quota to run with the number of vCPUs = (cores per node)*(node_cound_dynamic_max). For compute intensive, tightly coupled jobs, C3 or H3 instances have shown good performance.
Info
This example includes an additional SLURM partition containing H3 nodes. At the time of this writing, access to H3 instances was limited and you may need to request access via a quota increase request. You do not need multiple SLURM partitions, and may consider removing this one.
Info
To access the full high-speed per-VM Tier_1 networking capabilities on supported instance types, the gvnic must be enabled.