Ubuntu Server 12.04 + Slurm 2.5.7 fatnodes

2013-06-18
#ubuntu #server #slurm #fatnode

SLURM (Simple Linux Utility for Resource Management) — job scheduler and resource manager usually installed on supercomputers. For example, it runs on Lomonosov supercomputer in MSU, Moscow, Russia.

Usually one physical or virtual computer (physical node) is one logical node is SLURM. If one physical node serve more than one logical nodes, it called «fat node». Fat nodes are needed in case you have a lot of memory at physical node or several GPUs on it. Sometimes fat nodes are convenient.

System is Ubuntu 12.04 Server x64. We need additional configuration parameter, so SLURM will be built from the sources.

§ Prerequisites

Munge for nodes authentication and build-essential for building from sources are needed:

sudo apt-get install -y libmunge-dev munge build-essential

§ Building SLURM

Get, unpack and cd to slurm dir:

wget http://www.schedmd.com/download/latest/slurm-2.5.7.tar.bz2
tar xvf slurm-2.5.7.tar.bz2
cd slurm-2.5.7/

Configure SLURM to enable fat nodes, make and install it:

./configure --enable-multiple-slurmd
make
sudo make install

§ Configuring system

We need to add user slurm and add him to group with the same name:

sudo adduser slurm
sudo adduser slurm slurm

Create munge key and start munge daemon:

sudo /usr/sbin/create-munge-key
sudo service munge start

§ Configuring SLURM

Now create configuration file /usr/local/etc/slurm.conf:

ClusterName=ubuntu #< change this to your hostname
ControlMachine=ubuntu #< change this to your host name

SlurmUser=slurm
SlurmctldPort=6817
AuthType=auth/munge

StateSaveLocation=/tmp
SlurmdSpoolDir=/tmp/slurmd%n/
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd%n.pid
ProctrackType=proctrack/pgid
CacheGroups=0
ReturnToService=0

# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0

# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/linear
FastSchedule=1

# LOGGING
SlurmctldDebug=3
SlurmdDebug=3
JobCompType=jobcomp/none

# COMPUTE NODES
# control node
NodeName=ubuntu NodeAddr=127.0.0.1 Port=17000 State=UNKNOWN

# each logical node is on the same physical node, so we need different ports for them
# name node-[*] is arbitrary
NodeName=node-0 NodeAddr=127.0.0.1 Port=17001 State=UNKNOWN
NodeName=node-1 NodeAddr=127.0.0.1 Port=17002 State=UNKNOWN
NodeName=node-2 NodeAddr=127.0.0.1 Port=17003 State=UNKNOWN
NodeName=node-3 NodeAddr=127.0.0.1 Port=17004 State=UNKNOWN
NodeName=node-4 NodeAddr=127.0.0.1 Port=17005 State=UNKNOWN

# PARTITIONS
# partition name is arbitrary
PartitionName=cpu Nodes=node-[0-4] Default=YES MaxTime=INFINITE State=UP

§ Starting SLURM

Start SLURM control daemon:

sudo slurmctld -c

Start SLURM daemons for each logical node:

sudo slurmd -c -N node-0
sudo slurmd -c -N node-1
sudo slurmd -c -N node-2
sudo slurmd -c -N node-3
sudo slurmd -c -N node-4

Check if everything is alright:

sinfo

You must see that nodes are ready for work:

cpu*         up   infinite      5   idle node-[0-4]