王海庆的云笔记

Centos7搭建slurm-22.05.3作业管理系统集群,支持普通用户执行任务


       Slurm是面向Linux和Unix的开源工作调度程序,由世界上许多超级计算机使用,主要功能如下: 

1、为用户分配计算节点的资源,以执行工作; 

2、提供的框架在一组分配的节点上启动、执行和监视工作(通常是并行作业); 

3、管理待处理作业的工作队列来仲裁资源争用问题;


Slurm架构


环境配置

 服务器IP  主机名操作系统 配置
 控制节点 172.18.7.31 master CentOS7.98核16G
 计算节点1 172.18.7.32 node01 CentOS7.98核32G
 计算节点2 172.18.7.33 node02 CentOS7.98核32G  


一、基础环境(除说明外,所有机器都要执行)

关闭防火墙

systemctl stop firewalld
systemctl disable firewalld
sed -i -e  's/^SELINUX=.*/SELINUX=disabled/g' /etc/selinux/config
setenforce 0


换成阿里云的源

rm -rf /etc/yum.repos.d/*
curl -o /etc/yum.repos.d/CentOS-Base.repo https://mirrors.aliyun.com/repo/Centos-7.repo
curl -o /etc/yum.repos.d/epel.repo https://mirrors.aliyun.com/repo/epel-7.repo

yum clean all
yum makecache fast -y


公司里CentOS7的源

rm -rf /etc/yum.repos.d/*
cat >  /etc/yum.repos.d/centos7.repo << EOF
[base]
name=base
baseurl=http://172.18.0.61/centos7/base
enabled=1
gpgcheck=0

[extras]
name=extras
baseurl=http://172.18.0.61/centos7/extras
enabled=1
gpgcheck=0

[updates]
name=updates
baseurl=http://172.18.0.61/centos7/updates
enabled=1
gpgcheck=0

[epel]
name=epel
baseurl=http://172.18.0.61/centos7/epel
enabled=1
gpgcheck=0
EOF

yum clean all
yum makecache fast -y


设置主机名,主机名一定不能重复(分别执行)

hostnamectl set-hostname master
hostnamectl set-hostname node01
hostnamectl set-hostname node02


设置hosts

cat >>  /etc/hosts << EOF
172.18.7.31 master
172.18.7.32 node01
172.18.7.33 node02
EOF


加快ssh访问

echo "UseDNS no" >> /etc/ssh/sshd_config
systemctl restart sshd


安装软件

yum -y install net-tools wget vim ntpdate chrony htop glances nfs-utils rpcbind python3


ntpdate 时间同步

# 公网时间服务器
ntpdate time1.aliyun.com
echo "*/5 * * * * /usr/sbin/ntpdate time1.aliyun.com" >> /var/spool/cron/root
timedatectl set-timezone Asia/Shanghai
hwclock --systohc

# 内网时间服务器
ntpdate 172.18.0.162
echo "*/5 * * * * /usr/sbin/ntpdate 172.18.0.162" >> /var/spool/cron/root
timedatectl set-timezone Asia/Shanghai
hwclock --systohc


配置SSH免登陆

# 控制节点上面执行
echo y| ssh-keygen -t rsa -P '' -f  ~/.ssh/id_rsa
ssh-copy-id -i ~/.ssh/id_rsa.pub  -o  StrictHostKeyChecking=no root@node01
ssh-copy-id -i ~/.ssh/id_rsa.pub  -o  StrictHostKeyChecking=no root@node02


二、配置Munge(除说明外,所有机器都要执行)


创建Munge用户

Munge用户要确保Master Node和Compute Nodes的UID和GID相同,所有节点都需要安装Munge;

groupadd -g 1108 munge
useradd -m -c "Munge Uid 'N' Gid Emporium" -d /var/lib/munge -u 1108 -g munge -s /sbin/nologin munge


生成熵池

# 安装
yum install -y rng-tools

# 使用/dev/urandom来做熵源
rngd -r /dev/urandom
 
sed -i 's#^ExecStart.*#ExecStart=/sbin/rngd -f -r /dev/urandom#g'  /usr/lib/systemd/system/rngd.service

systemctl daemon-reload
systemctl start rngd
systemctl enable rngd
systemctl status rngd


部署Munge,Munge是认证服务,实现本地或者远程主机进程的UID、GID验证。

yum install munge munge-libs munge-devel -y


 创建全局密钥,在Master Node创建全局使用的密钥

# 控制节点上面执行
/usr/sbin/create-munge-key -r
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key


密钥同步到所有计算节点

# 控制节点上面执行
scp -p /etc/munge/munge.key root@node01:/etc/munge
scp -p /etc/munge/munge.key root@node02:/etc/munge

# 计算节点上面执行
chown munge: /etc/munge/munge.key
chmod 400 /etc/munge/munge.key


启动所有节点 

systemctl restart munge
systemctl enable munge
systemctl status munge


测试Munge服务,每个计算节点与控制节点进行连接验证

# 本地查看凭据
munge -n

# 本地解码
munge -n | unmunge

# 验证compute node,远程解码
munge -n | ssh node01 unmunge

# Munge凭证基准测试
remunge


 三、配置Slurm(除说明外,所有机器都要执行)


创建Slurm用户 

groupadd -g 1109 slurm
useradd -m -c "Slurm manager" -d /var/lib/slurm -u 1109 -g slurm -s /bin/bash slurm


 安装Slurm依赖

yum install gcc gcc-c++ readline-devel perl-ExtUtils-MakeMaker pam-devel rpm-build mysql-devel http-parser-devel json-c-devel libjwt  libjwt-devel -y


编译Slurm和安装Slurm

# 下载地址
https://download.schedmd.com/slurm/

wget https://download.schedmd.com/slurm/slurm-22.05.3.tar.bz2

rpmbuild -ta --with mysql --with slurmrestd --with jwt slurm-22.05.3.tar.bz2
cd /root/rpmbuild/RPMS/x86_64/
yum localinstall -y slurm-*

参数 --with slurmrestd支持restful api

 

配置控制节点Slurm 

# 控制节点上面执行
cp /etc/slurm/cgroup.conf.example /etc/slurm/cgroup.conf
cp /etc/slurm/slurm.conf.example /etc/slurm/slurm.conf

cat >  /etc/slurm/slurm.conf << EOF

ClusterName=cluster
# SlurmctldHost=master

ControlMachine=master
ControlAddr=172.18.7.31
 
#
SlurmctldDebug=info
SlurmdDebug=debug3
GresTypes=gpu
 
MpiDefault=none
ProctrackType=proctrack/cgroup
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm/ctld
SwitchType=switch/none
TaskPlugin=task/affinity,task/cgroup
# Fix Mentioned Error
# TaskPluginParam=Sched
TaskPluginParam=verbose
 
# TIMERS
#InactiveLimit=0
#KillWait=30
#ResumeTimeout=600
MinJobAge=172800
#OverTimeLimit=0
#SlurmctldTimeout=12
#SlurmdTimeout=300
#Waittime=0

# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core

# LOGGING AND ACCOUNTING
AccountingStorageEnforce=limits
AccountingStorageHost=master
AccountingStoragePort=6819
AccountingStorageType=accounting_storage/slurmdbd

# Fix Mentioned Error
# AccountingStoreJobComment=YES
AccountingStoreFlags=job_comment


#JobCompHost=localhost
#JobCompPass=123456
#JobCompPort=3306
#JobCompType=jobcomp/mysql
#JobCompUser=root
#JobAcctGatherFrequency=1
#JobAcctGatherType=jobacct_gather/linux
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log

AuthAltTypes=auth/jwt
AuthAltParameters=jwt_key=/var/spool/slurm/ctld/jwt_hs256.key

MaxNodeCount=1000
TreeWidth=65533
 
# COMPUTE NODES 
NodeName=master,node[01-02] CPUs=4 RealMemory=6000 State=UNKNOWN
PartitionName=compute Nodes=node[01-02] Default=YES MaxTime=INFINITE State=UP AllowAccounts=zkxy,root

EOF


复制控制节点配置文件到计算节点 

# 控制节点上面执行
scp /etc/slurm/*.conf node01:/etc/slurm/
scp /etc/slurm/*.conf node02:/etc/slurm/


设置控制、计算节点文件权限 

mkdir -p /var/spool/slurm
chown slurm: /var/spool/slurm
mkdir -p /var/log/slurm
chown slurm: /var/log/slurm


配置控制节点Slurm Accounting,Accounting records为slurm收集作业步骤的信息,可以写入一个文本文件或数据库,但这个文件会变得越来越大,最简单的方法是使用MySQL来存储信息。


CentOS7采用yum方式安装mysql5.7(修改存储路径)


创建数据库的Slurm用户

# mysql5.7
grant all on slurm_acct_db.* to 'slurm'@'%' identified by 'Slurm*1234' with grant option;

# mysql8.0
CREATE USER 'slurm'@'%' identified with mysql_native_password  by 'Slurm*1234';
GRANT ALL ON slurm_acct_db.* TO 'slurm'@'%';
flush privileges;


配置slurmdbd.conf文件 

# 控制节点上面执行
cp /etc/slurm/slurmdbd.conf.example /etc/slurm/slurmdbd.conf
 
cat >  /etc/slurm/slurmdbd.conf << 'EOF'

AuthType=auth/munge
AuthInfo=/var/run/munge/munge.socket.2
DbdAddr=172.18.7.31
DbdHost=master
SlurmUser=slurm
DebugLevel=verbose
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
StorageType=accounting_storage/mysql
StorageHost=172.18.0.191
StorageUser=slurm
StoragePass=Slurm*1234
StorageLoc=slurm_acct_db #db名,slurmdbd会自动创建db
StoragePort=3306
AuthAltTypes=auth/jwt
AuthAltParameters=jwt_key=/var/spool/slurm/ctld/jwt_hs256.key

EOF


设置权限

# 控制节点上面执行
chown slurm: /etc/slurm/slurmdbd.conf
chown slurm: /etc/slurm/slurm.conf


Add JWT key to controller (StateSaveLocation目录)

mkdir -p /var/spool/slurm/ctld

dd if=/dev/random of=/var/spool/slurm/ctld/jwt_hs256.key bs=32 count=1
chown slurm:slurm /var/spool/slurm/ctld/jwt_hs256.key
chmod 0600 /var/spool/slurm/ctld/jwt_hs256.key
# chown root:root /etc/slurm
chmod 0755 /var/spool/slurm/ctld
chown slurm:slurm /var/spool/slurm/ctld

 

 启动服务

# 启动控制节点Slurmdbd服务
systemctl restart slurmdbd
systemctl enable slurmdbd
systemctl status slurmdbd
 
# 启动控制节点slurmctld服务
systemctl restart slurmctld
systemctl enable slurmctld
systemctl status slurmctld
 
# 启动计算节点的服务
systemctl restart slurmd
systemctl enable slurmd
systemctl status slurmd

# 服务无法启动,可通过直接启动命令查看
slurmdbd -Dvvv
slurmctld -Dvvv
slurmd -Dvvv


四、检查Slurm集群

创建用户

useradd zkxy
echo 123456 | passwd --stdin zkxy


检查Slurm集群

# 控制节点和计算节点上面都可以执行

# 查看集群
sinfo
scontrol show partition
scontrol show node

# 提交作业 
srun -N2 hostname
scontrol show jobs

# 查看作业
squeue -a


新建用户  

useradd whq
echo whq | passwd --stdin whq


运行slurm api  (不能是root和SlurmUser用户)

cat > /etc/slurm/slurmrestd.conf << 'EOF'
include /etc/slurm/slurm.conf
AuthType=auth/jwt
EOF

chown slurm:slurm /etc/slurm/slurmrestd.conf

su - whq
slurmrestd -f /etc/slurm/slurmrestd.conf 0.0.0.0:6688 -a jwt -s openapi/v0.0.36 
slurmrestd -f /etc/slurm/slurmrestd.conf -a rest_auth/jwt -s openapi/v0.0.36 -vvv 0.0.0.0:6688


创建systemd服务

cat > /usr/lib/systemd/system/slurmrestd.service <<EOF
[Unit]
Description=slurmrestd service
After=network.service

[Service]
Type=simple  

User=whq
Group=whq
WorkingDirectory=/usr/sbin
ExecStart=/usr/sbin/slurmrestd -f /etc/slurm/slurmrestd.conf -a rest_auth/jwt -s openapi/v0.0.36 -vvv 0.0.0.0:6688

Restart=always
 
ProtectSystem=full
PrivateDevices=yes
PrivateTmp=yes
NoNewPrivileges=true

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl stop slurmrestd
systemctl restart slurmrestd
systemctl enable slurmrestd
systemctl status slurmrestd


获取token(默认lifespan=1800,最大为99999999999)

scontrol token lifespan=999999999 username=whq


如果node状态为down,slurm Reason=Not responding,重启服务无效的话,可以试一下下面命令

scontrol update NodeName=node01 State=RESUME
scontrol update NodeName=node02 State=RESUME
scontrol update NodeName=node03 State=RESUME
scontrol update NodeName=node04 State=RESUME
scontrol update NodeName=node05 State=RESUME
scontrol update NodeName=node06 State=RESUME
scontrol update NodeName=node07 State=RESUME
scontrol update NodeName=node08 State=RESUME


python调用测试api

import requests
 
url='http://172.18.0.115:6688/slurm/v0.0.36/ping'
 
headers = {
    'X-SLURM-USER-NAME':'whq',
    'X-SLURM-USER-TOKEN':'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE2MzQxOTM4NzMsImlhdCI6MTYzNDE5MjA3Mywic3VuIjoid2hxIn0.82HpB4ss96Iw7o9JAzDp8WGRfFWDOCbPzx-J3Y5nK_U',
}

response = requests.get(url, headers=headers)

print(response.text)


slurm-rest_api接口参考

https://app.swaggerhub.com/apis/rherrick/slurm-rest_api/0.0.35#/


https://slurm.schedmd.com/SLUG20/REST_API.pdf
https://slurm.schedmd.com/SLUG19/REST_API.pdf


REST_API.pdf


因偶尔出现远程访问rest接口会比较慢,但是在集群内部访问会比较快。因此将6688端口转发到16688,这样就可以加快接口调用了。

yum install nginx -y

cat > /etc/nginx/conf.d/slurm.conf << 'EOF'
upstream backend {
    server 127.0.0.1:6688; 
}

server {
   listen 16688;  
   server_name localhost;
    location / {
        proxy_pass http://backend; 
    }
} 
EOF

systemctl restart nginx
systemctl enable nginx
systemctl status nginx


参考

https://www.cnblogs.com/liu-shaobo/p/13285839.html
https://blog.csdn.net/kongxx/article/details/52550653
https://www.jianshu.com/p/c7cf800656dc
https://www.jianshu.com/p/e560b19dbd3e


jwt参考

https://slurm.schedmd.com/jwt.html
https://elwe.rhrk.uni-kl.de/documentation/jwt.html


Slurm中文用户手册

https://docs.slurm.cn/users/


支持普通用户执行任务,包含两种方式:AllowAccounts 和 AllowGroups,推荐AllowGroups


AllowAccounts 启用队列账号管理

AccountingStorageEnforce=limits
…
PartitionName=compute Nodes=node[01-08] Default=YES MaxTime=INFINITE State=UP AllowAccounts=zkxy


AllowAccounts:后的账号名需要自己创建,下面是账号创建步骤

# 查询集群
sacctmgr list cluster               

# 此集群名称需要和 slurm.conf 文件中的 ClusterName 一致,如果 slurm.conf 文件中的 ClusterName 集群已存在则无需再创建集群
sacctmgr add cluster cluster  

# 添加账号,账号一定要创建在对应的集群中,也就是 slurm.conf 文件中的 ClusterName。
# 这里root加进来也不好用,必须设置AllowAccounts=zkxy,root才行
sacctmgr add account name=zkxy cluster=cluster

# 查询账号
sacctmgr list account                                   

# 添加用户到帐号并且给用户添加 qos
sacctmgr add user name=admin account=zkxy qos=normal cluster=cluster

# 查询
sacctmgr list assoc

# 执行
srun -N2 hostname

# 需要在每个节点上面创建这个用户
useradd admin
echo 123456 | passwd --stdin admin


AllowGroups 启用队列的用户访问控制(这种方式不是独立的,必须依附于AllowAccounts,那这个样子的话,作用不大了
AccountingStorageEnforce=limits
…
PartitionName=compute Nodes=node[01-08] Default=YES MaxTime=INFINITE State=UP AllowGroups=edauser AllowAccounts=zkxy

AllowGroups:后边的  edauser 组就是 /etc/group 文件中的组名。


mkdir /data2

mount -t nfs -o nolock,nfsvers=3 172.18.0.21:/mnt/UserDataTemp/UserTemp /data2

echo "172.18.0.21:/mnt/UserDataTemp/UserTemp /data2 nfs defaults 0 0" >> /etc/fstab


启动slurmdbd服务报错时

slurmdbd: error: Database settings not recommended values: innodb_lock_wait_timeout

[mysqld]
innodb_buffer_pool_size=1024M
innodb_log_file_size=64M
innodb_lock_wait_timeout=900


参考

https://www.cnblogs.com/dahu-daqing/p/12693334.html

文章最后更新时间: 2022-10-17 14:49:24