All users must submit jobs to reserve resource and run their codes on cluster. There are two kinds of jobs which are called as interactive and batch jobs.

1) Batch Jobs:

Jobs are not run directly from the command line, the user needs to create a job script which specifies both resources, libraries and the job’s application that is to be run.

The script is submitted to SLURM (queueing system). If the requested resources are available on the system, the job will run. If not, it will be placed in a queue until such time as the resources do become available.

Users need to understand how to use the queueing system, how to create the job script, as well as need to check its progress or delete a job from the queueing system.

Job Scripts:

You can find job scripts inside /kuacc/jobscripts folder. You need to copy one of these scripts into your home directory(/kuacc/users/username/) and modify it according to your needs.

This is an example job script for KUACC HPC cluster. Note that a jobscript should start with #!/bin/bash.

#!/bin/bash

#SBATCH --job-name=Test        	
#SBATCH --nodes=1		
#SBATCH --ntasks-per-node=1	
#SBATCH --partition=short		
#SBATCH --qos=users		
#SBATCH --account=users	
#SBATCH --gres=gpu:tesla_t4:1	
#SBATCH --time=1:0:0		
#SBATCH --output=test-%j.out	
#SBATCH --mail-type=ALL
#SBATCH --mail-user=foo@bar.com 	

module load python/3.6.1
moddule load cuda/11.4
module load 8.2.2/cuda-11.4 	

python code.py

 

Jobscript can be divided into three sections.:

  • Requesting resources
  • Loading library and application modules
  • Running your codes

Requesting Resources:

This section is where resources are requested and slurm parameters are configured. “#SBATCH” should always be used at the beginning of lines. Also, a flag is used for each request.

#SBATCH <flag>

#SBATCH –job-name=Test                                 #Setting a job name

#SBATCH –nodes=1                                          #Asking for only one node

#SBATCH –ntasks-per-node=1                          #Asking one core on each node, one core

#SBATCH –partition=short                                #Running on short queue(max 2hours)

#SBATCH –qos=users                                        #Running on users qos (rules and limits)

#SBATCH –account=users                                 #Running on users partitions(group of nodes)

#SBATCH –gres=gpu:tesla_t4:1                         #Asking a tesla_t4 GPU

#SBATCH –time=1:0:0                                       #Reserving for one hour time limit.

#SBATCH –output=test-%j.out                          #Setting a output file name.

#SBATCH –mail-type=ALL                                 #All types all emails (BEGIN, END, FAIL, ALL)

#SBATCH –mail-user=foo@bar.com                  #Where to send emails

 

Note that line can be omitted/commented by adding second # at the beginning. Same way, you can add comments by starting #.

 

Note that, KUACC HPC partitions are listed as below. You can see active partitions by sinfo command.

Name MaxTimeLimit Nodes MaxJobs MaxSubmitJob
short 2 hours 50 nodes 50 300
mid 1 days 45 nodes 35 200
long 7 days 5 nodes 25 100
longer 30 days 3 nodes 5 50
ai 7 days 16 nodes 8 100
ilac Infinite 12 nodes Infinite Infinite
cosmos Infinite 8 nodes Infinite Infinite
biyofiz Infinite 4 nodes Infinite Infinite
cosbi Infinite 1 node Infinite Infinite
kutem Infinite 1 node Infinite Infinite
iui Infinite 1 node Infinite Infinite
hamsi Infinite 1 node Infinite Infinite
lufer Infinite 1 node Infinite Infinite

 

Note that, following flags can be used in your job scripts.

Resource Flag Syntax Description Notes
partition –partition=short Partition is a queue for jobs. default on kuacc is short
qos  –qos=users QOS is quality of service value (limits or priority boost) default on kuacc is users
time –time=01:00:00 Time limit for the job. 1 hour; default is 2 hours
nodes –nodes=1 Number of compute nodes for the job. default is 1
cpus/cores –ntasks-per-node=4 Corresponds to number of cores on the compute node. default is 1
resource feature –gres=gpu:1 Request use of GPUs on compute nodes default is no feature
memory –mem=4096 Memory limit per compute node for the  job.  Do not use with mem-per-cpu flag. default limit is 4096 MB per core
memory –mem-per-cpu=14000 Per core memory limit.  Do not use the mem flag, default limit is 4096 MB per core
account –account=users Users may belong to groups or accounts. default is the user’s primary group.
job name –job-name=”hello_test” Name of job. default is the JobID
constraint –constraint=gpu kuacc-nodes AVAIL_FEATURES
output file –output=test.out Name of file for stdout. default is the JobID
email address –mail-user=username@ku.edu.tr User’s email address required
email notification –mail-type=ALL

–mail-type=END

When email is sent to user. omit for no email

Note: –mem ve –mem-per-cpu flags:

mem-per-cpu: memory per core. If N core is reserved, Nxmem-per-cpu is reserved. Default units are megabytes.

#SBATCH --ntasks=5
#SBATCH –mem-per-cpu=20000

Total 5×20000=100000MB is reserved. For GB requests, use only G. Exp: 20G

mem: total memory requested per node. If you request more than one node(N) Nxmem is reserved. Default units are megabytes.

Loading library and application modules:

Users need to load application and library modules needed for his/her code. As in sample job script,

module load python/3.6.1
module load cuda/11.4
module load 8.2.2/cuda-11.4

For more information see the installing software modules page.

Running Code:

In this section of job script, users need to run his/her code.

python code.py

 

After preparing job script, it is submitted by sbatch command.

sbatch jobscript.sh

 

Command Description
sbatch

sbatch [script]

Submit a batch job

Example:
$ sbatch job.sub

scancel

scancel [job_id]

Kill a running job or cancel queued one

Example:
$ scancel 123456

squeue

squeue

List running or pending jobs

Example:
$ squeue

squeue -u userid

squeue -u [userid]

List running or pending jobs

Example:
$ squeue -u john

2) Interactive Jobs:

Batch jobs are submitted to slurm queuing system and runs when there is requested resource available. However, it can’t be used when user test and troubleshoot code in real time. Interactive jobs allow to interact with applications in real time. Users can then run graphical user interface (GUI) applications, execute scripts, or run other commands directly on a compute node.

Using srun command:

srun will submit your resource request to the queue. When the resource is available, a new bash session starts on reserved compute node. Same slurm flags are used for srun command.

Example:

srun -N 1 -n 4 -A users -p short --qos=users --gres=gpu:1 --mem=64G --time 1:00:00 --constraint=tesla_v100 --pty bash

By this command, slurm reserves 1 node, 4 cores, 64GB RAM, 1 gpu and constraint flag limits gpu type to tesla_v100 gpus with 1 hour time limit in short queue. Then, opens a terminal on compute node. If the terminal on compute node is closed, job is killed on queue.

Using salloc command:

salloc works same as srun --pty bash.  It will submit your resource request to queue. When the resource is available, it opens a terminal on the login node. However, you will have permission to ssh to reserved node.

Example: Same as in srun

salloc -N 1 -n 4 -A users -p short --qos=users --gres=gpu:1 --mem=64G --time 1:00:00 --constraint=tesla_v100

When resource is granted, need to find which node is reserved.

squeue -u username or kuacc-queue|grep username

 

ssh username@computenode_name

⚠️⚠️⚠️  Interactive jobs die when you disconnect from the login node either by choice or by internet connection problems. To keep a job alive you can use a terminal multiplexer like tmux,emacs,screen

How to run a GUI by using interactive jobs?

If you want to run graphical user interface (GUI) applications (Matlab, Mathematica, Rstudio, Ansys), you need to connect to login nodes with X11 forwarding and submit an interactive job with x11 parameter.

Procedure for GUI:

-Connect to cluster with X11 forwarding enabled.

ssh login.kuacc.ku.edu.tr –Xl username

⚠️⚠️⚠️  If you are using Mac, you may need to use Y depending on your x11 forwarding settings. Also, you need to install (xquartz https://www.xquartz.org/)

-Submit an interactive job with x11 parameter and reserve a compute node.

srun --x11 -N 1 -n 1 -p short --time=1:00:00 --pty bash

or

salloc --x11 -N 1 -n 1 -p short --time=1:00:00

then,

squeue -u username
ssh computenodename -Xl username

-Then, you can load application module and start GUI on compute node.

Note: The user must leave the X11 forwarded login node session open where they submitted the job.

⚠️⚠️⚠️  GUI session would be slow due to network connection. If you are using VPN, it would be much slower. Also, it will be killed if there is any internet shortage. You can avoid all these issues by using VNC connection and run your GUI application on login nodes. Give link