LSF配置job的最大内存限制

集群管理 7 3046 佚名 收藏

tu7.jpg

Enforce job level memory limits in LSF


LSF enforcement of job memory limits with job termination: 
This gives LSF control over how much memory jobs can use. LSF terminates any job that reaches the configured memory limit. LSF looks at the sum of the memory all job processes consume to determine if a job has reached the memory limit.

    1. Add memory limit parameters to lsf.conf:

    2. LSB_MEMLIMIT_ENFORCE=Y
      LSB_JOB_MEMLIMIT=Y
    3. Specify a memory limit in lsb.queues or lsb.applications:

    4. MEMLIMIT = 5000 #Memory limit of 5000 KB
    5. Reconfigure LSF:

    6. lsadmin reconfig
      badmin reconfig
      badmin hrestart all

You can specify memory limit at the queue level (lsb.queues), application profile level (lsb.application) or at job submission. Use the –M option when submitting jobs to specify a memory limit. For example,

bsub –M 50000 myjob.sh


LSF will allow this job to consume a maximum of 5000 KB of memory before terminating it.

The difference between LSB_JOB_MEMLIMIT set to y and LSB_MEMLIMIT_ENFORCE set to y is that with LSB_JOB_MEMLIMIT, only the per-job memory limit enforced by LSF is enabled. The per-process memory limit enforced by the OS is disabled. With LSB_MEMLIMIT_ENFORCE set to y, both the per-job memory limit enforced by LSF and the per-process memory limit enforced by the OS are enabled.


LSB_JOB_MEMLIMIT disables per-process memory limit enforced by the OS and enables per-job memory limit enforced by LSF. When the total memory allocated to all processes in the job exceeds the memory limit, LSF sends the following signals to kill the job: SIGINT first, then SIGTERM, then SIGKILL.


On UNIX, the time interval between SIGINT, SIGKILL, SIGTERM can be configured with the parameter JOB_TERMINATE_INTERVAL in lsb.params.


相关推荐:

网友留言:

  1. charles.cl
    回复
    请教一下,我们server 内存全部是512g,通过脚本方式批量提交job,特别容易出现当前执行server内存不足导致任务失败, 另外我们尝试loadStop负载阈值,来让内存低于一定值时候任务挂起,但是又引起当前server上所有job全部都挂起,需要人工介入kill部分程序释放一定内存才可以继续运行,这个问题一直解决不了。
    1. 团子精英
      回复
      建议将某一台节点机器的任务迁移到其他节点,然后bclose这台节点,然后单独修改这个节点的loadStop负载阈值
      如果不能单独修改,可以尝试将这个节点单独作为一个queue来修改。然后测试任务。
      当然,最终的建议是将任务分类,确认job类型,然后限制job数量的方式来解决可能更好一些
      1. charles.cl
        回复
        job数量限制一般都是根据cpu slot 限制,可以在不考虑cpu slot情况下,对每host的job数限制么。非常感谢。
        1. charles.cl
          回复
          MXJ参数我尝试过,他也是依赖于cpu slot ,我只要提交1个job 占用2个slot 第二个job就会 pend。
        1. 团子精英
          回复
          限制每个节点的jobs数量可以通过修改lsb.hosts内部的MXJ参数来实现。具体请查看手册
  1. raychade
    回复
    场景:某用户提交了一个<内存限制的job,运行过程中软件卡死导致内存大量占用,此时LSF是否会限制内存或自动kill ?
    1. 团子精英
      回复
      作者反馈:
      这个有可能遇到bug
      以前我们跑vcs,陷入了死循环,最后放弃
您需要 登录账户 后才能发表评论

我要评论:

◎欢迎参与讨论,请在这里发表您的看法、交流您的观点。
验证码