SLURM User and Administrator Guide for Cray systems

NOTE: As of January 2009, the SLURM interface to Cray systems is incomplete.

User Guide

This document describes the unique features of SLURM on Cray computers. You should be familiar with the SLURM's mode of operation on Linux clusters before studying the relatively few differences in Cray system operation described in this document.

SLURM's primary mode of operation is designed for use on clusters with nodes configured in a one-dimensional space. Minor changes were required for the smap and sview tools to map nodes in a three-dimensional space. Some changes are also desirable to optimize job placement in three-dimensional space.

SLURM has added an interface to Cray's Application Level Placement Scheduler (ALPS). The ALPS aprun command must used for task launch rather than SLURM's srun command. You should create a resource reservation using SLURM's salloc or sbatch command and execute aprun from within that allocation.

Administrator Guide

Cray/ALPS configuration

Node names must have a three-digit suffix describing their zero-origin position in the X-, Y- and Z-dimension respectively (e.g. "tux000" for X=0, Y=0, Z=0; "tux123" for X=1, Y=2, Z=3). Rectangular prisms of nodes can be specified in SLURM commands and configuration files using the system name prefix with the end-points enclosed in square brackets and separated by an "x". For example "tux[620x731]" is used to represent the eight nodes in a block with endpoints at "tux620" and "tux731" (tux620, tux621, tux630, tux631, tux720, tux721, tux730, tux731). NOTE: We anticipate that Cray will provide node coordinate information via the ALPS interface in the future, which may result in a more flexible node naming convention.

In ALPS, configure each node to be scheduled using SLURM as type BATCH.

SLURM configuration

Four variables must be defined in the config.h file: APBASIL_LOC (location of the apbasil command), HAVE_FRONT_END, HAVE_CRAY_XT and HAVE_3D. The apbasil command should automatically be found. If that is not the case, please notify us of its location on your system and we will add that to the search paths tested at configure time. The other variable definitions can be initiated in several different ways depending upon how SLURM is being built.

  1. Execute the configure command with the option --enable-cray-xt OR
  2. Execute the rpmbuild command with the option --with cray_xt OR
  3. Add %with_cray_xt 1 to your ~/.rpmmacros file.

One slurmd will be used to run all of the batch jobs on the system. It is from here that users will execute aprun commands to launch tasks. This is specified in the slurm.conf file by using the NodeName field to identify the compute nodes and both the NodeAddr and NodeHostname fields to identify the computer when slurmd runs (normally some sort of front-end node) as seen in the examples below.

Next you need to select from two options for the resource selection plugin (the SelectType option in SLURM's slurm.conf configuration file):

  1. select/cons_res - Performs a best-fit algorithm based upon a one-dimensional space to allocate whole nodes, sockets, or cores to jobs based upon other configuration parameters.
  2. select/linear - Performs a best-fit algorithm based upon a one-dimensional space to allocate whole nodes to jobs.

In order for select/cons_res or select/linear to allocate resources physically nearby in three-dimensional space, the nodes be specified in SLURM's slurm.conf configuration file in such a fashion that those nearby in slurm.conf (one-dimensional space) are also nearby in the physical three-dimensional space. If the definition of the nodes in SLURM's slurm.conf configuration file are listed on one line (e.g. NodeName=tux[000x333]), SLURM will automatically perform that conversion using a Hilbert curve. Otherwise you may construct your own node name ordering and list them one node per line in slurm.conf. Note that each node must be listed exactly once and consecutive nodes should be nearby in three-dimensional space. Also note that each node must be defined individually rather than using a hostlist expression in order to preserve the ordering (there is no problem using a hostlist expression in the partition specification after the nodes have already been defined). The open source code used by SLURM to generate the Hilbert curve is included in the distribution at contribs/skilling.c in the event that you wish to experiment with it to generate your own node ordering. Two examples of SLURM configuration files are shown below:

# slurm.conf for Cray XT system of size 4x4x4
# Parameters removed here
SelectType=select/linear
NodeName=DEFAULT Procs=8 RealMemory=2048 State=Unknown
NodeName=tux[000x333] NodeAddr=front_end NodeHostname=front_end
PartitionName=debug Nodes=tux[000x333] Default=Yes State=UP
# slurm.conf for Cray XT system of size 4x4x4
# Parameters removed here
SelectType=select/linear
NodeName=DEFAULT Procs=8 RealMemory=2048 State=Unknown
NodeName=tux000 NodeAddr=front_end NodeHostname=front_end
NodeName=tux100 NodeAddr=front_end NodeHostname=front_end
NodeName=tux110 NodeAddr=front_end NodeHostname=front_end
NodeName=tux010 NodeAddr=front_end NodeHostname=front_end
NodeName=tux011 NodeAddr=front_end NodeHostname=front_end
NodeName=tux111 NodeAddr=front_end NodeHostname=front_end
NodeName=tux101 NodeAddr=front_end NodeHostname=front_end
NodeName=tux001 NodeAddr=front_end NodeHostname=front_end
PartitionName=debug Nodes=tux[000x111] Default=Yes State=UP

In both of the examples above, the node names output by the scontrol show nodes will be ordered as defined (sequentially along the Hilbert curve or per the ordering in the slurm.conf file) rather than in numeric order (e.g. "tux001" follows "tux101" rather than "tux000"). SLURM partitions should contain nodes which are defined sequentially by that ordering for optimal performance.

Last modified 9 January 2009

Lawrence Livermore National Laboratory
7000 East Avenue • Livermore, CA 94550
Operated by Lawrence Livermore National Security, LLC, for the Department of Energy's
National Nuclear Security Administration
NNSA logo links to the NNSA Web site Department of Energy logo links to the DOE Web site