Microsoft High Performance Computing (HPC) Pack 2019 Service Resilience

A quick post to help make the services on the Microsoft High Performance Computing (HPC) 2019 Head Nodes more resilient by setting their recovery options.

Run the following PowerShell code on each of your Head Nodes which helps avoid failures for the users when starting the HPC Job Manager, or Admins when starting HPC Cluster Manager.

$HPCServices = Get-Service | where-object {$_.DisplayName  -like '*HPC *'}
ForEach ($HPCService in $HPCServices) {
  $ServiceDisplayName = $HPCService.DisplayName
  $ServiceName = $HPCService.Name
  write-verbose "Setting the `"$ServiceDisplayName`" service recovery options to:" -verbose
  write-verbose "- Restart the service 1 minute (60000 ms) after the first failure" -verbose
  write-verbose "- Restart the service 1 minute (60000 ms) after the second failure" -verbose
  write-verbose "- Restart the service 3 minute (180000 ms) after subsequent failures" -verbose
  write-verbose "- Reset the failure count every 1 day (86400 seconds)" -verbose
  Invoke-Command {cmd /c sc failure "$ServiceName" reset= 86400 actions= restart/60000/restart/60000/restart/180000} | out-null
}

HPC Pack 2019 logo

Read more