After submitting jobs, user needs to monitor jobs to check resource usage.

squeue: views information about jobs located in the Slurm scheduling queue.

squeue –u username

kuacc-queue     : Shows Jobs details on queue. It ıs detailed version of squeue command.

 kuacc-queue|grep username

or

kuacc-queue|more
(base) [yakarken18@login02 ~]$ kuacc-queue
JOBID    PARTITION  NAME             USER            TIME_LEFT    TIME_LIMIT   START_TIME           ST  NODES  CPUS  NODELIST(REASON)
2579092  short      run_custom.sh    bbiner21        59:29        1:00:00      2021-12-07T12:43:32  CA  1      1     it01
2579094  short      run_custom.sh    bbiner21        1:00:00      1:00:00      2021-12-07T12:45:32  CD  1      1     it01
2579090  mid        bash             igenel19        23:59:36     1-00:00:00   2021-12-07T12:41:29  CD  1      2     buyukliman
2579034  long       Il-gas           phaslak         6-22:57:38   7-00:00:00   2021-12-07T11:44:06  CD  1      4     rk01
2577763  cosmos     test             hakdemir        4-22:26:02   5-00:00:00   2021-12-07T11:10:06  CD  1      1     ke01
2577644  cosmos     test             hakdemir        4-17:55:23   5-00:00:00   2021-12-07T06:39:24  CD  1      1     ke03
2577553  cosmos     test             hakdemir        4-09:57:50   5-00:00:00   2021-12-06T22:39:30  CD  1      1     ke04
2577892  cosmos     test             hakdemir        4-05:11:55   5-00:00:00   2021-12-06T17:57:04  CD  1      1     ke08
2576427  mid        hph_D119V        sefenti19       23:59:00     23:59:00     2021-12-15T05:43:42  PD  1      8     (Resources)
2576424  mid        hph_G120V        sefenti19       23:59:00     23:59:00     2021-12-10T13:42:41  PD  1      8     (ReqNodeNotAvail, UnavailableNodes:)
2577787  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Resources)
2577788  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577789  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577790  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577791  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577792  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577793  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577794  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577795  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577796  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577797  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577798  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577799  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577800  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577801  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577802  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577803  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577804  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577805  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577806  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577807  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577808  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577809  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577810  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577811  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577812  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577813  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577814  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577815  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577816  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577817  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577818  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577819  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577820  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577821  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577822  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577823  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577824  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577825  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
2577826  cosmos     test             hakdemir        5-00:00:00   5-00:00:00   2021-12-10T11:08:13  PD  1      1     (Priority)
Command output lists information about jobs.

JOBID, PARTITION, NAME, USER, TIME_LEFT, TIME_LIMIT, START_TIME, ST, NODES, CPUS, NODELIST(REASON)

ST: STATUS

State Code Meaning
PENDING PD Job is awaiting resource allocation.
RUNNING R Job currently has an allocation.
SUSPENDED S Job has an allocation, but execution has been suspended.
COMPLETING CG Job is in the process of completing. Some processes on some nodes may still be active.
COMPLETED CD Job has terminated all processes on all nodes.
CONFIGURING CF  Job has been allocated resources, but are waiting for them to become ready for use
CANCELED CA Job was explicitly cancelled by the user or system administrator.

The job may or may not have been initiated.

FAILED F Job terminated with non-zero exit code or other failure condition.
TIMEOUT TO Job terminated upon reaching its time limit.
PREEMPTED PR Job has been suspended by an higher priority job on the same ressource.
NODE_FAIL NF Job terminated due to failure of one or more allocated nodes.

 

Note: Preempted state means that your job is killed and re-queued because of priority. Research groups donate compute nodes and they have priority on the nodes. If they submit a job and there is no resource on the node, jobs belonging to general user are killed and requeued. To avoid this, you may use IT nodes. IT (it01-it04) doesn’t have priority. Also, you can exclude nodes by exclude parameter in slurm.

NODELIST(REASON) column lists nodes running jobs. Also, it gives the reason for Jobs state.

Reason Meaning
InvalidQOS The job’s QOS is invalid.
Priority One or more higher priority jobs is in queue for running. Your job will eventually run.
Resources The job is waiting for resources to become available and will eventually run.
PartitionNodeLimit The number of nodes required by this job is outside of it’s partitions current limits. Can also indicate that required nodes are DOWN or DRAINED.
PartitionTimeLimit The job’s time limit exceeds it’s partition’s current time limit.
QOSJobLimit The job’s QOS has reached its maximum job count.
QOSResourceLimit The job’s QOS has reached some resource limit.
QOSTimeLimit The job’s QOS has reached its time limit.
QOSMaxCpuPerUserLimit Maximum number of cpus per user for your job’s QoS have been met; job will run eventually.
QOSGrpMaxJobsLimit Maximum number of jobs for your job’s QoS have been met; job will run eventually.
QOSGrpCpuLimit All CPUs assigned to your job’s specified QoS are in use; job will run eventually.
QOSGrpNodeLimit All nodes assigned to your job’s specified QoS are in use; job will run eventually.

 

Check all Reasons codes in following link. https://slurm.schedmd.com/squeue.html#lbAF

Note: Most often you encounter with Priority, Resources and QOSMaxCpuPerUserLimit.

scontrol :

This command also shows everything about a job. You can find job id by squeue and kuacc-queue commands.

scontrol show jobid  job_id

Note:You can access these information in more detail with kuacc-queue .

[root@login03 ~]# scontrol show jobid 2535745
JobId=2535745 JobName=JupiterNotebook
   UserId=xxxxx GroupId=domainusers(200513) MCS_label=N/A
   Priority=1669 Nice=0 Account=ai QOS=ai
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=06:03:50 TimeLimit=5-00:00:00 TimeMin=N/A
   SubmitTime=2021-10-26T11:34:35 EligibleTime=2021-10-26T11:34:35
   StartTime=2021-10-26T11:34:40 EndTime=2021-10-31T11:34:40 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=ai AllocNode:Sid=login02:19426
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=dy02
   BatchHost=dy02
   NumNodes=1 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=8G,node=1,gres/gpu:tesla_k80=1
   Socks/Node=* NtasksPerN:B:S:C=2:0:*:* CoreSpec=*
   MinCPUsNode=2 MinMemoryCPU=4G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=gpu:tesla_k80:1 Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/scratch/users/xxxxx/myjupyter/jupyter_submit.sh
   WorkDir=/scratch/users/xxxxx
   StdErr=/scratch/users/xxxxxjupyter-%J.log
   StdIn=/dev/null
   StdOut=/scratch/users/xxxxx/jupyter-%J.log
   Power=

 

Command output shows all details about job. Parameters can be checked in https://slurm.schedmd.com/scontrol.html

kuacc-summary: lists Cluster Jobs Summary (Group, User, Active Jobs, Active Cores, Pending Jobs and Pending Cores )

GROUP                  USER              ACTIVE_JOBS  ACTIVE_CORES  PENDING_JOBS  PENDING_CORES
----------------------------------------------------------------------------------------------
ai                                           50           277             0             0
                      aanees20                2            10             0             0
                      abaykal20               1             8             0             0
                      akutuk21                1             2             0             0
                      alperdogan              3             3             0             0
                      asafaya19               1            32             0             0
                      ashah20                 2             2             0             0
                      baristopal20            1             2             0             0
                      bbozkurt15              1             2             0             0
                      ccoban20                1            16             0             0
                      ckorkmaz14              3            15             0             0
                      ckorkmaz16              4            20             0             0
                      dyuret                  1             1             0             0
                      ecetin17                1             5             0             0
                      emunal                  1             1             0             0
                      gsoykan20               1             4             0             0
                      hcoban15                3            25             0             0
                      hguven20                1             1             0             0
                      hpc-okeskin             1            20             0             0
                      hpc-tkerimoglu          1             4             0             0
                      ishoer20                2             8             0             0
                      mali18                  3            12             0             0
                      nnayal17                1            20             0             0
                      nrahimi19               1             5             0             0
                      oulas15                 1             2             0             0
                      oyapici17               1            12             0             0
                      rji19                   7            35             0             0
                      shamdan17               1             2             0             0
                      skoc21                  1             2             0             0
                      sozcelik19              1             2             0             0
                      tanjary21               1             4             0             0
----------------------------------------------------------------------------------------------
biyofiz                                       4            31             0             0
                      akabakcioglu            1             1             0             0
                      ebahceci20              3            30             0             0
----------------------------------------------------------------------------------------------
cosmos                                      200           288             8            14
                      caltintas              50            50             0             0
                      gaksu                  16            16             0             0
                      gatekeeper             11            48             0             0
                      hakdemir               17            68             2             8
                      hdaglar17               0             0             6             6
                      hgulbalkan14            1             1             0             0
                      phaslak                38            38             0             0
                      saydin20               67            67             0             0
----------------------------------------------------------------------------------------------
ilac                                          1            12             0             0
                      hpc-spiepoli            1            12             0             0
----------------------------------------------------------------------------------------------
lufer                                         2            16             0             0
                      eyurtsev                2            16             0             0
----------------------------------------------------------------------------------------------
users                                       133           413           102           429
                      adasdemir16             1             8             0             0
                      alperdogan              1             4             0             0
                      ebahceci20             75            75             0             0
                      eyurtsev                1            12             0             0
                      fberber20               1             1             0             0
                      hakdemir               37           148           101           404
                      hnaseer19               1            25             1            25
                      phaslak                 4            16             0             0
                      scoskun17               1            32             0             0
                      sefenti19               1             8             0             0
                      yaydin20                8            80             0             0
                      ycan                    1             1             0             0
                      zabali16                1             3             0             0
----------------------------------------------------------------------------------------------
Totals:                                     390          1037           110           443
CPU and Memory Usage:

After job is submitted, user is allowed to use ssh for connecting compute node on which job is running.

ssh username@compute_node

After ssh to compute node, user can use following commands to check memory and cpu usage of submitted job.

 

ps: lists all processes.

ps -u username -o %cpu,rss,args

this will give you instantaneous usage every time you run command. Memory usage is in kilobytes.

 

top/htop: runs interactively and show live usage statistics.

htop -u username

Note: https://gridpane.com/kb/how-to-use-the-top-command-to-monitor-system-processes-and-resource-usage/

https://gridpane.com/kb/how-to-use-the-htop-command-to-monitor-system-processes-and-resource-usage/

 

nvidia-smi: shows GPU parameters and GPU memory usage.

 

[root@it04 ~]# nvidia-smi
Wed Oct 27 09:14:33 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   37C    P0    59W / 250W |    607MiB / 32510MiB |     24%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    390761      C   namd2                             603MiB |
+-----------------------------------------------------------------------------

607MiB / 32510MiB is how much GPU you use, consider if you need this much GPU or not.

 

Job Log Files: When a job is submitted, an output file defined in job scripts logs all outputs related to job including errors.

In job scripts,

#SBATCH --output=test-%j.out

Please check log file if you have any issue with your jobs.

 

Most Common Issues:

–           Segmentation fault – This is the error when your code is trying to reach outside the allocated memory. Sometimes, increasing memory solves the issue.

–           Slurmstepd error – shows exceeded memory limit. Not allocating enough memory(default memory/core: 4GB). Increasing memory with slurm mem parameters(mem,mem-per-cpu) solves the issue.

–           Requesting much more resources than available and waiting in queue. Please test your code and find out cpu-memory needs. Do not request more resources than you need.

–           Reserving much more resources than needed and causing other jobs queued.

–           User installed software issues. Please use modules on Cluster first.

–           Anaconda Virtual Environments would be tricky. Please search errors online and check the solutions shared by users who had same issue.

For any problem contact with hpc@support.ku.edu.tr .

 

Cancelling Jobs: 

User can cancel his/her job by scancel command.

scancel   job_id                           # kills job with given job_id
scancel -u username                        #kills user’s all jobs. User can only kill his/her own jobs
scancel -t pending -u username             #kills user’s all pending jobs.
scancel -t running -u username             #kills user’s all running jobs.

Job_id can be found by using:

squeue -u username

or

kuacc-queue|grep username