squeue: views information about jobs located in the Slurm scheduling queue.
squeue –u username
kuacc-queue : Shows Jobs details on queue. It ıs detailed version of squeue command.
kuacc-queue|grep username
or
kuacc-queue|more
(base) [yakarken18@login02 ~]$ kuacc-queue JOBID PARTITION NAME USER TIME_LEFT TIME_LIMIT START_TIME ST NODES CPUS NODELIST(REASON) 2579092 short run_custom.sh bbiner21 59:29 1:00:00 2021-12-07T12:43:32 CA 1 1 it01 2579094 short run_custom.sh bbiner21 1:00:00 1:00:00 2021-12-07T12:45:32 CD 1 1 it01 2579090 mid bash igenel19 23:59:36 1-00:00:00 2021-12-07T12:41:29 CD 1 2 buyukliman 2579034 long Il-gas phaslak 6-22:57:38 7-00:00:00 2021-12-07T11:44:06 CD 1 4 rk01 2577763 cosmos test hakdemir 4-22:26:02 5-00:00:00 2021-12-07T11:10:06 CD 1 1 ke01 2577644 cosmos test hakdemir 4-17:55:23 5-00:00:00 2021-12-07T06:39:24 CD 1 1 ke03 2577553 cosmos test hakdemir 4-09:57:50 5-00:00:00 2021-12-06T22:39:30 CD 1 1 ke04 2577892 cosmos test hakdemir 4-05:11:55 5-00:00:00 2021-12-06T17:57:04 CD 1 1 ke08 2576427 mid hph_D119V sefenti19 23:59:00 23:59:00 2021-12-15T05:43:42 PD 1 8 (Resources) 2576424 mid hph_G120V sefenti19 23:59:00 23:59:00 2021-12-10T13:42:41 PD 1 8 (ReqNodeNotAvail, UnavailableNodes:) 2577787 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Resources) 2577788 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577789 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577790 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577791 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577792 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577793 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577794 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577795 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577796 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577797 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577798 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577799 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577800 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577801 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577802 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577803 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577804 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577805 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577806 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577807 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577808 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577809 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577810 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577811 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577812 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577813 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577814 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577815 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577816 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577817 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577818 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577819 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577820 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577821 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577822 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577823 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577824 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577825 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority) 2577826 cosmos test hakdemir 5-00:00:00 5-00:00:00 2021-12-10T11:08:13 PD 1 1 (Priority)
JOBID, PARTITION, NAME, USER, TIME_LEFT, TIME_LIMIT, START_TIME, ST, NODES, CPUS, NODELIST(REASON)
ST: STATUS
State | Code | Meaning |
PENDING | PD | Job is awaiting resource allocation. |
RUNNING | R | Job currently has an allocation. |
SUSPENDED | S | Job has an allocation, but execution has been suspended. |
COMPLETING | CG | Job is in the process of completing. Some processes on some nodes may still be active. |
COMPLETED | CD | Job has terminated all processes on all nodes. |
CONFIGURING | CF | Job has been allocated resources, but are waiting for them to become ready for use |
CANCELED | CA | Job was explicitly cancelled by the user or system administrator.
The job may or may not have been initiated. |
FAILED | F | Job terminated with non-zero exit code or other failure condition. |
TIMEOUT | TO | Job terminated upon reaching its time limit. |
PREEMPTED | PR | Job has been suspended by an higher priority job on the same ressource. |
NODE_FAIL | NF | Job terminated due to failure of one or more allocated nodes. |
Note: Preempted state means that your job is killed and re-queued because of priority. Research groups donate compute nodes and they have priority on the nodes. If they submit a job and there is no resource on the node, jobs belonging to general user are killed and requeued. To avoid this, you may use IT nodes. IT (it01-it04) doesn’t have priority. Also, you can exclude nodes by exclude parameter in slurm.
NODELIST(REASON) column lists nodes running jobs. Also, it gives the reason for Jobs state.
Reason | Meaning |
InvalidQOS | The job’s QOS is invalid. |
Priority | One or more higher priority jobs is in queue for running. Your job will eventually run. |
Resources | The job is waiting for resources to become available and will eventually run. |
PartitionNodeLimit | The number of nodes required by this job is outside of it’s partitions current limits. Can also indicate that required nodes are DOWN or DRAINED. |
PartitionTimeLimit | The job’s time limit exceeds it’s partition’s current time limit. |
QOSJobLimit | The job’s QOS has reached its maximum job count. |
QOSResourceLimit | The job’s QOS has reached some resource limit. |
QOSTimeLimit | The job’s QOS has reached its time limit. |
QOSMaxCpuPerUserLimit | Maximum number of cpus per user for your job’s QoS have been met; job will run eventually. |
QOSGrpMaxJobsLimit | Maximum number of jobs for your job’s QoS have been met; job will run eventually. |
QOSGrpCpuLimit | All CPUs assigned to your job’s specified QoS are in use; job will run eventually. |
QOSGrpNodeLimit | All nodes assigned to your job’s specified QoS are in use; job will run eventually. |
Check all Reasons codes in following link. https://slurm.schedmd.com/squeue.html#lbAF
Note: Most often you encounter with Priority, Resources and QOSMaxCpuPerUserLimit.
This command also shows everything about a job. You can find job id by squeue and kuacc-queue commands.
scontrol show jobid job_id
Note:You can access these information in more detail with kuacc-queue
.
[root@login03 ~]# scontrol show jobid 2535745 JobId=2535745 JobName=JupiterNotebook UserId=xxxxx GroupId=domainusers(200513) MCS_label=N/A Priority=1669 Nice=0 Account=ai QOS=ai JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=06:03:50 TimeLimit=5-00:00:00 TimeMin=N/A SubmitTime=2021-10-26T11:34:35 EligibleTime=2021-10-26T11:34:35 StartTime=2021-10-26T11:34:40 EndTime=2021-10-31T11:34:40 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=ai AllocNode:Sid=login02:19426 ReqNodeList=(null) ExcNodeList=(null) NodeList=dy02 BatchHost=dy02 NumNodes=1 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=2,mem=8G,node=1,gres/gpu:tesla_k80=1 Socks/Node=* NtasksPerN:B:S:C=2:0:*:* CoreSpec=* MinCPUsNode=2 MinMemoryCPU=4G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=gpu:tesla_k80:1 Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/scratch/users/xxxxx/myjupyter/jupyter_submit.sh WorkDir=/scratch/users/xxxxx StdErr=/scratch/users/xxxxxjupyter-%J.log StdIn=/dev/null StdOut=/scratch/users/xxxxx/jupyter-%J.log Power=
Command output shows all details about job. Parameters can be checked in https://slurm.schedmd.com/scontrol.html
GROUP USER ACTIVE_JOBS ACTIVE_CORES PENDING_JOBS PENDING_CORES ---------------------------------------------------------------------------------------------- ai 50 277 0 0 aanees20 2 10 0 0 abaykal20 1 8 0 0 akutuk21 1 2 0 0 alperdogan 3 3 0 0 asafaya19 1 32 0 0 ashah20 2 2 0 0 baristopal20 1 2 0 0 bbozkurt15 1 2 0 0 ccoban20 1 16 0 0 ckorkmaz14 3 15 0 0 ckorkmaz16 4 20 0 0 dyuret 1 1 0 0 ecetin17 1 5 0 0 emunal 1 1 0 0 gsoykan20 1 4 0 0 hcoban15 3 25 0 0 hguven20 1 1 0 0 hpc-okeskin 1 20 0 0 hpc-tkerimoglu 1 4 0 0 ishoer20 2 8 0 0 mali18 3 12 0 0 nnayal17 1 20 0 0 nrahimi19 1 5 0 0 oulas15 1 2 0 0 oyapici17 1 12 0 0 rji19 7 35 0 0 shamdan17 1 2 0 0 skoc21 1 2 0 0 sozcelik19 1 2 0 0 tanjary21 1 4 0 0 ---------------------------------------------------------------------------------------------- biyofiz 4 31 0 0 akabakcioglu 1 1 0 0 ebahceci20 3 30 0 0 ---------------------------------------------------------------------------------------------- cosmos 200 288 8 14 caltintas 50 50 0 0 gaksu 16 16 0 0 gatekeeper 11 48 0 0 hakdemir 17 68 2 8 hdaglar17 0 0 6 6 hgulbalkan14 1 1 0 0 phaslak 38 38 0 0 saydin20 67 67 0 0 ---------------------------------------------------------------------------------------------- ilac 1 12 0 0 hpc-spiepoli 1 12 0 0 ---------------------------------------------------------------------------------------------- lufer 2 16 0 0 eyurtsev 2 16 0 0 ---------------------------------------------------------------------------------------------- users 133 413 102 429 adasdemir16 1 8 0 0 alperdogan 1 4 0 0 ebahceci20 75 75 0 0 eyurtsev 1 12 0 0 fberber20 1 1 0 0 hakdemir 37 148 101 404 hnaseer19 1 25 1 25 phaslak 4 16 0 0 scoskun17 1 32 0 0 sefenti19 1 8 0 0 yaydin20 8 80 0 0 ycan 1 1 0 0 zabali16 1 3 0 0 ---------------------------------------------------------------------------------------------- Totals: 390 1037 110 443
After job is submitted, user is allowed to use ssh for connecting compute node on which job is running.
ssh username@compute_node
After ssh to compute node, user can use following commands to check memory and cpu usage of submitted job.
ps: lists all processes.
ps -u username -o %cpu,rss,args
this will give you instantaneous usage every time you run command. Memory usage is in kilobytes.
top/htop: runs interactively and show live usage statistics.
htop -u username
Note: https://gridpane.com/kb/how-to-use-the-top-command-to-monitor-system-processes-and-resource-usage/
https://gridpane.com/kb/how-to-use-the-htop-command-to-monitor-system-processes-and-resource-usage/
nvidia-smi: shows GPU parameters and GPU memory usage.
[root@it04 ~]# nvidia-smi Wed Oct 27 09:14:33 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... Off | 00000000:3B:00.0 Off | 0 | | N/A 37C P0 59W / 250W | 607MiB / 32510MiB | 24% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 390761 C namd2 603MiB | +-----------------------------------------------------------------------------
607MiB / 32510MiB is how much GPU you use, consider if you need this much GPU or not.
Job Log Files: When a job is submitted, an output file defined in job scripts logs all outputs related to job including errors.
In job scripts,
#SBATCH --output=test-%j.out
Please check log file if you have any issue with your jobs.
Most Common Issues:
– Segmentation fault – This is the error when your code is trying to reach outside the allocated memory. Sometimes, increasing memory solves the issue.
– Slurmstepd error – shows exceeded memory limit. Not allocating enough memory(default memory/core: 4GB). Increasing memory with slurm mem parameters(mem,mem-per-cpu) solves the issue.
– Requesting much more resources than available and waiting in queue. Please test your code and find out cpu-memory needs. Do not request more resources than you need.
– Reserving much more resources than needed and causing other jobs queued.
– User installed software issues. Please use modules on Cluster first.
– Anaconda Virtual Environments would be tricky. Please search errors online and check the solutions shared by users who had same issue.
For any problem contact with hpc@support.ku.edu.tr .
Cancelling Jobs:
User can cancel his/her job by scancel command.
scancel job_id # kills job with given job_id scancel -u username #kills user’s all jobs. User can only kill his/her own jobs scancel -t pending -u username #kills user’s all pending jobs. scancel -t running -u username #kills user’s all running jobs.
Job_id can be found by using:
squeue -u username
or
kuacc-queue|grep username