背景:
日常LSF集群管理中,管理员希望看到每台机器的内存利用率,进而发现用户资源滥用行为。
有条件的,上报表监视系统,没条件的用以下脚本抓一下也行。
实现的效果:
定时任务跑一下,供管理员参考
脚本全文如下,复杂的正则表达式,可直接照抄使用:
echo "Hostname mem_sim maxmem_sum reservemem_sum mem_utilization" > /path/mem_report1 ## 将有任务在跑的计算机节点筛选出来 for hostname in `bhosts -X |awk ' $5 > 0' |grep ok | awk '{print $1}' |tr '\n' ' ' ` ## 循环体 do ## 统计 mem mem_sum=`bjobs -u all -m $hostname -o 'mem' | awk 'NR>1' | awk '{print $1}' |xargs echo -n | tr ' ' '+' | xargs echo | bc` ## 统计 maxmem maxmem_sum=`bjobs -u all -m $hostname -o 'max_mem' | awk 'NR>1' | awk '{print $1}' | grep -v "-" |xargs echo -n | tr ' ' '+' | xargs echo | bc` ## 统计 reservemem reservemem_sum=`bjobs -u all -m $hostname -o 'eresreq' | grep rusage | cut -d "=" -f 4 | cut -d "]" -f 1 | xargs echo -n | tr ' ' '+' | xargs echo | bc` ## 统计利用率(reserve / maxmem) mem_utilization=`echo "scale=2; $reservemem_sum/$maxmem_sum" |bc` ## 输出report至指定路径 echo "$hostname $mem_sum $maxmem_sum $reservemem_sum $mem_utilization" >> /path/mem_report1 ## 按照利用率的值降序排序 cat /path/mem_report1 |sort -rk 5 > /path/mem_report2 done
网友留言:
./lsfmen.sh
bhosts: invalid option -- 'X'
Usage:
bhosts [-h] [-V] [-R res_req] [-w | -l] [host_name ... | cluster_name]
or
bhosts [-h] [-V] -s [ resource_name ]
正常bhosts的参数有以下
bhosts -help
bhosts: illegal option -- h
Usage:
bhosts [-R res_req] [-w | -l | -e | -o "field_name[:[-][output_width]] ... [delimiter='character']" [-json]] [-a] [-alloc] [-x] [-X] [-cname] [-rc] [host_name ... | cluster_name] [-alloc]
or
bhosts [-R res_req] [-w | -e | -o "field_name[:[-][output_width]] ... [delimiter='character']" ] [-a] [-x] [-X] [-cname] [host_name ... | cluster_name] [-alloc] [-noheader]
or
bhosts [-e] [-cname] [-a] [-noheader] -s [ resource_name ... ]
or
bhosts [-l] [-aff] [host_name ... | cluster_name]
or
bhosts [-l] [-gpu] [host_name ... | cluster_name]
or
bhosts [-w] [-rconly]
or
bhosts [-h] [-V]