You can use following commands for monitoring KUACC cluster.

df:

Disk filesystem – is used to list a full summary of available and used disk space of file system.

df -kh

k: To display all file system information and usage in 1024-byte blocks

h: human readable

Example usage:

[root@login03 ~]# df -kh
Filesystem                    Size  Used Avail Use% Mounted on
/dev/mapper/centos-root       200G   70G  131G  35% /
devtmpfs                      252G     0  252G   0% /dev
tmpfs                         252G  5.3M  252G   1% /dev/shm
tmpfs                         252G  422M  252G   1% /run
tmpfs                         252G     0  252G   0% /sys/fs/cgroup
/dev/sda2                    1016M  247M  770M  25% /boot
/dev/sda1                     200M  9.8M  191M   5% /boot/efi
beegfs_home2                  219T  164T   55T  75% /home2
beegfs_scratch                175T   77T   99T  44% /scratch
172.20.239.200:/vol_home2      82T   72T   11T  88% /userfiles
172.20.239.202:/VOL_KUTTAM     49T   34T   16T  69% /kuttam
172.20.239.201:/vol_datasets   50T   45T  5.7T  89% /datasets

Disks mounted on:

/ : server root disk
/scratch : User home disk
/datasets: General Datasets disk
/userfiles: User Datasets disk
/kuttam: Kuttam disk
/home2 : System Disk

Link:https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/s2-sysinfo-filesystems-df

du:

Disk usage – is used to learn disk usage of a file/folder

du -sh

s: grand total disk usage size of a directory

h: human readable

du -sh filename

In order to list disk usage of more than one files.

du -sh *

Link: https://www.redhat.com/sysadmin/du-command-options

sinfo:

Lists information about Slurm nodes and partitions. Command output has 6 columns.

Partition: Slurm partitions
Avail: Partition availability
Timelimit: Partition max time limit
Nodes: Nodes in requested partition
State: State of nodes(mix, idle, down, draining, drained, mix, alloc)

  • idle:free,
  • down: node is down,
  • draining: issue about node and node is closed for new jobs. Node will be taken out of cluster after jobs are finished,
  • drained:node is closed to all jobs,
  • mix:some of resource on node is used,
  • alloc:resource on node is fully allocated(used). No resource available on node.

Nodelist: Nodes in partition
sinfo shows all partitions and nodes in cluster. For ai partition, you can use “sinfo|grep ai

[root@login02 ~]# sinfo|grep ai
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
admin          up   infinite     30    mix ag01,ai[01,03-14],be[04,06-07],buyukliman,da[01-04],dy03,it[02-04],ke[02,04],rk01,sm01
admin          up   infinite      3   idle ai02,be[11-12]
short*         up    2:00:00     29    mix ag01,ai[01,03-14],be[04,06-07],buyukliman,da[01-04],dy03,it[01-04],rk01,sm01
short*         up    2:00:00      3   idle ai02,be[11-12]
mid            up 1-00:00:00     18    mix ai[05-11],be[04,06-07],buyukliman,da[03-04],dy03,it[03-04],rk01,sm01
long           up 7-00:00:00     19    mix ag01,ai[06-14],be[04,06-07],buyukliman,da04,dy03,it04,rk01,sm01
ai             up 7-00:00:00     14    mix ai[01,03-14],dy03
ai             up 7-00:00:00      1  alloc dy02
ai             up 7-00:00:00      1   idle ai02

In command output,

Partition: ai
Availibility: partition is upup
Time limit: max 7days
Nodes-State-Nodelist:
-There are 14 nodes with mix state. It means that some of resources on these nodes(ai[01,03-14],dy03) are used. There is still resource.
-There is only one node(dy02) fully allocated.
-There is only one node(ai02) free. fully available

⚠️⚠️⚠️ sinfo command is useful for listing resources in a short version. However, there is another command(kuacc-info) which is more detailed and more useful.

kuacc-info :

Gives cluster usage summary (CPUs, MEMs, Nodes).

There are two parts of command output. First part  lists users, their running and pending jobs number, node usage of users, cpus usage of users and memory usage of users on KUACC(IT) nodes(kcc) – cosmos(csms) nodes – ilac nodes(ilc) – ai nodes.

Second part of command output is very useful. It gives detailed version of sinfo command. Following output is second part of command output. Name of node, its partition, cpu usage, mem usage and status of node.

 

------------------------------------------------------------------------------------------------------------
                NAME          TYPE            CPU_USAGE     MEM_USAGE     STATUS
------------------------------------------------------------------------------------------------------------
                ag01         COSBI             7 ( 87%)    55.25 ( 45%)   BUSY
                ai01            AI            38 ( 95%)   186.88 ( 38%)   BUSY
                ai02            AI             0 (  0%)     0.00 (  0%)   FREE
                ai03            AI            36 ( 90%)   319.00 ( 64%)   BUSY
                ai04            AI            20 ( 50%)   211.00 ( 42%)   BUSY
                ai05            AI            19 ( 47%)   176.00 ( 35%)   BUSY
                ai06            AI             5 ( 12%)    60.00 ( 12%)   BUSY
                ai07            AI            38 ( 95%)   164.00 ( 33%)   BUSY
                ai08            AI            26 ( 65%)   148.00 ( 30%)   BUSY
                ai09            AI            29 ( 72%)   134.00 ( 27%)   BUSY
                ai10            AI            34 ( 85%)   256.00 ( 52%)   BUSY
                ai11            AI            35 ( 87%)   484.00 ( 98%)   FULL(MEM)
                ai12            AI            37 ( 92%)   406.30 ( 82%)   BUSY
                ai13            AI            26 ( 65%)   468.14 ( 95%)   FULL(MEM)
                ai14            AI            24 ( 60%)   378.00 ( 76%)   BUSY
                be01          ILAC            12 (100%)    48.00 ( 39%)   FULL(CPU)
                be02          ILAC            12 (100%)    48.00 ( 39%)   FULL(CPU)
                be03          ILAC            12 (100%)    48.00 ( 39%)   FULL(CPU)
                be04          ILAC            11 ( 91%)    44.00 ( 35%)   BUSY
                be05          ILAC            12 (100%)    48.00 ( 39%)   FULL(CPU)
                be06          ILAC            11 ( 91%)    44.00 ( 35%)   BUSY
                be07          ILAC             8 ( 66%)    32.00 ( 26%)   BUSY
                be08          ILAC            12 (100%)   123.05 (100%)   FULL(CPU)
                be09          ILAC            12 (100%)    56.00 ( 45%)   FULL(CPU)
                be10          ILAC            12 (100%)    56.00 ( 45%)   FULL(CPU)
                be11          ILAC             0 (  0%)     0.00 (  0%)   FREE
                be12          ILAC             0 (  0%)     0.00 (  0%)   FREE
          buyukliman         HAMSI            18 ( 25%)   108.00 ( 44%)   BUSY
                da01       BIYOFIZ            10 ( 50%)    40.00 ( 16%)   BUSY
                da02       BIYOFIZ            10 ( 50%)    40.00 ( 16%)   BUSY
                da03       BIYOFIZ            10 ( 50%)    40.00 ( 16%)   BUSY
                da04       BIYOFIZ             3 ( 15%)   140.00 ( 57%)   BUSY
                dy02            AI            36 (100%)   144.00 ( 29%)   FULL(CPU)
                dy03            AI             2 (  8%)    78.12 ( 17%)   BUSY
                it01         KUACC            26 ( 65%)   104.00 ( 21%)   BUSY
                it02         KUACC            10 ( 25%)    40.00 (  8%)   BUSY
                it03         KUACC            35 ( 87%)   140.00 ( 28%)   BUSY
                it04         KUACC            38 ( 95%)   192.00 ( 39%)   BUSY
                ke01        COSMOS            36 (100%)   217.66 ( 44%)   FULL(CPU)
                ke02        COSMOS            35 ( 97%)   242.48 ( 49%)   FULL(CPU)
                ke03        COSMOS            36 (100%)   225.66 ( 45%)   FULL(CPU)
                ke04        COSMOS            36 (100%)   231.66 ( 47%)   FULL(CPU)
                ke05        COSMOS            36 (100%)   198.83 ( 40%)   FULL(CPU)
                ke06        COSMOS            36 (100%)   190.83 ( 38%)   FULL(CPU)
                ke07        COSMOS            36 (100%)   198.83 ( 40%)   FULL(CPU)
                ke08        COSMOS            35 ( 97%)   160.00 ( 32%)   FULL(CPU)
                rk01         KUTEM            40 ( 55%)   240.00 ( 48%)   BUSY
                sm01           IUI             4 ( 20%)    24.00 ( 38%)   BUSY

⚠️⚠️⚠️ Command output is useful while submitting jobs. User can find a free resource and submit job on these nodes by nodelist or constraint flags.

kuacc-nodes:

Lists cluster nodes with specifications (CPU types, GPU list,Memory, Features etc). it is detailed version of sinfo command.

(base) [yakarken18@login02 ~]$ kuacc-nodes
NODELIST        STATE       CPUS    S:C:T    MEMORY   TMP_DISK GRES                          AVAIL_FEATURES                                    
ai[01,04-06,09] mixed         40   2:20:1    503000          0 gpu:tesla_t4:8                ai,ib,compute,40cpu,gpu                                 
ai[11-14]       mixed         40   2:20:1    503000          0 gpu:tesla_v100:8              ai,ib,compute,40cpu,gpu                                          
buyukliman      mixed         72   2:18:2    250000          0 (null)                        hamsi,ib,compute,72cpu                                     
da[03-04]       mixed         20   2:10:1    250000          0 gpu:tesla_k20m:1              biyofiz,ib,compute,20cpu                                          
dy02            mixed         36   2:18:1    504000          0 gpu:tesla_k80:4               ai,ib,compute,36cpu,gpu                                         
dy03            mixed         24   2:12:1    450000          0 gpu:tesla_k80:8               ai,ib,compute,24cpu,gpu                                  
it[01-02]       mixed         40   2:20:1    500000          0 gpu:tesla_v100:1              IT,ib,compute,tesla_v100                                          
sm01            mixed         20   2:10:1     64000          0 gpu:tesla_k40m:2,gpu:tesla_k80:1 iui,ib,compute,20cpu                                      
ag01            allocated      8    2:4:1    124000          0 gpu:gtx_1080ti:2              cosbi,ib,compute,8cpu,gpu                                       
ai[02-03,07-08, allocated     40   2:20:1    503000          0 gpu:tesla_t4:8                ai,ib,compute,40cpu,gpu                                  
be[01-12]       allocated     12    2:6:1    126000          0 gpu:tesla_k20m:1              ilac,ib,compute,12cpu                                                 
da02            allocated     20   2:10:1    250000          0 gpu:tesla_k20m:1              biyofiz,ib,compute,20cpu                                         
it[03-04]       allocated     40   2:20:1    500000          0 gpu:tesla_v100:1              IT,ib,compute,tesla_v100                                          
rk01            allocated     72   2:18:2    504000          0 (null)                        kutem,ib,compute,72cpu,HT                                         
da01            idle          20   2:10:1    250000          0 gpu:tesla_k20m:1              biyofiz,ib,compute,20cpu                                         
ke[01-08]       allocated     36   2:18:1    504000          0 (null)                        cosmos,ib,compute,36cpu                                                   

============================================================================================================
KUACC NODES CPU LIST
============================================================================================================
login02-login03  |model name      : Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
ag01             |model name      : Intel(R) Xeon(R) CPU E5-2637 v4 @ 3.50GHz
be01 - be14      |model name      : Intel(R) Xeon(R) CPU E5-2640    @ 2.50GHz
buyukliman       |model name      : Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
da01 - da04      |model name      : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
dy02             |model name      : Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
dy03             |model name      : Intel(R) Xeon(R) CPU E5-2695 v2 @ 2.10GHz
ke01 - ke08      |model name      : Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
rk01             |model name      : Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
sm01             |model name      : Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
it01 - it04      |model name      : Intel(R) Xeon(R) Gold 6148      @ 2.40GHz
ai01 - ai14      |model name      : Intel(R) Xeon(R) Gold 6248      @ 2.50GHz
============================================================================================================

 

NODELIST, STATE, CPUS, S:C:T(Socket:CorePerSocket:ThreadsPerCore), MEMORY, TMP_DISK (TEMPORARY_DISK), GRES, AVAIL_FEATURES

⚠️⚠️⚠️ This command lists specifications of all nodes in cluster. It is very useful when you need specific resource. For example, GRES column shows all gpu types in cluster. You can choose a specific gpu by using gres flag in your script with gres from this command output.

#SBATCH --gres=gpu_tesla_k80

⚠️⚠️⚠️ AVAIL_FEATURES column in command output lists all features of nodes. These Features are set by system admins and used for constraint flag in slurm job script.

For example,

(base) [yakarken18@login02 ~]$ kuacc-nodes |grep ai
ai[01,04-06,09] mixed         40   2:20:1    503000          0 gpu:tesla_t4:8   ai,ib,compute,40cpu,gpu,tesla_t4,6248,molpro,vnc                                                       
ai[11-14]       mixed         40   2:20:1    503000          0 gpu:tesla_v100:8 ai,ib,compute,40cpu,gpu,tesla_v100,6248,molpro,vnc                                                      
dy02            mixed         36   2:18:1    504000          0 gpu:tesla_k80:4  ai,ib,compute,36cpu,gpu,tesla_k80,e52695,e52695v4,                                                      
dy03            mixed         24   2:12:1    450000          0 gpu:tesla_k80:8  ai,ib,compute,24cpu,gpu,tesla_k80,e52695,e52695v2,                                                       
ai[02-03,07-08, allocated     40   2:20:1    503000          0 gpu:tesla_t4:8   ai,ib,compute,40cpu,gpu,tesla_t4,6248,molpro,vnc                                                        
ai01 - ai14      |model name      : Intel(R) Xeon(R) Gold 6248      @ 2.50GHz
ai[01-04,06-10] mixed 40 2:20:1 503000 0 gpu:tesla_t4:8                ai,ib,compute,40cpu,gpu,tesla_t4,6248,molpro,vnc

Features for ai nodes:

ai[01-10]: Feature=ai,ib,compute,40cpu,gpu,tesla_t4,6248,molpro,vnc
ai[11-14]: Feature= ai,ib,compute,40cpu,gpu,tesla_v100,6248,molpro,vnc
dy02: Feature= ai,ib,compute,36cpu,gpu,tesla_k80,e52695,e52695v4,vnc
dy03: Feature=ai,ib,compute,24cpu,gpu,tesla_k80,e52695,e52695v2,vnc
ai: ai partition
ib: node with infiniband network

compute:compute node

24cpu: node with 24cores
36cpu: node with 36cores
40cpu: node with 40cores
tesla_XX: node with tesla_XX gpu
molpro: node with 1TB local disk
6248: node with Intel Gold 6248 cpu
e52695: node with Intel e52695 cpu

⚠️⚠️⚠️ Some users needs to run their jobs on specific node. For example, on Intel Gold 6248 cpus. User can use 6248 feature in constraint flag and user limits jobs to run on 6248 cpus.

⚠️⚠️⚠️ Any feature can be added into node available feature list.

Examples:

  • By using constraint and feature tesla_k80. You can request only tesla_k80 nodes.
#SBATCH --constraint=tesla_k80
  • Some applications create to many tmp files. This will cause on issue on file system. Therefore, they use server disk for scratch data. Servers with molpro feature have local disks over 1TB. There is tmp2 folder on these nodes which can be used as a scratch folder.

By using constraint and feature molpro. tmp2 folder as stratch.

#SBATCH --constraint=molpro
export $TMP=/tmp2

At the end of job, scratch data should be cleaned.

scontrol show node node_name:

This command is used to check any information about compute nodes.

[root@login03 ~]# scontrol show node ai12
NodeName=ai12 Arch=x86_64 CoresPerSocket=20
   CPUAlloc=23 CPUErr=0 CPUTot=40 CPULoad=13.54
   AvailableFeatures=ai,ib,compute,40cpu,gpu,tesla_v100,6248,molpro,vnc
   ActiveFeatures=ai,ib,compute,40cpu,gpu,tesla_v100,6248,molpro,vnc
   Gres=gpu:tesla_v100:8
   NodeAddr=ai12 NodeHostName=ai12 Version=17.02
   OS=Linux RealMemory=503000 AllocMem=487188 FreeMem=109049 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=90 Owner=N/A MCS_label=N/A
   Partitions=admin,short,mid,ai
   BootTime=2021-10-20T14:17:44 SlurmdStartTime=2021-10-25T21:10:44
   CfgTRES=cpu=40,mem=503000M,gres/gpu:tesla_v100=8
   AllocTRES=cpu=23,mem=487188M,gres/gpu:tesla_v100=3
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Information: NodeName, Arch(architecture), CoresperSocket, CPUAlloc, CPUErr, CPUTot, CPULoad, AvailableFeatures, ActiveFeatures, Gres, NodeAddr, NodeHostName, Version, OS, RealMemory, AllocMem, FreeMem, Sockets, Boards, State, ThreadsPerCore, TmpDisk, Weight, Owner, MCS_label, Partitions, BootTime, SlurmdStartTime, CfgTRES, CapWatts, CurrentWatts, LowestJoules, ConsumedJoules..

AvailableFeatures : You can use kuacc-info and choose a node from list. Then, you can use scontrol show node node_name command and check its feature for constraint flag.

Gres: shows gpu version on node

Partitions: shows node’s partitions

CfgTRES: Resources available on node

AllocTRES: Resources allocated on node (This output has a bug. It only shows gpu usage with gpu version (gres=gpu:tesla_v100:x). If user submit job with –gres=gpu:1, it is not listed in command output.

⚠️⚠️⚠️ You can check other parameters by manual command “man scontrol”