文章目录
搭建基于 Slurm、Munge、MariaDB、OpenMPI 和 LBNL-NHC 节点健康检查的 2 节点集群(假设节点u24u04s1为master(管理节点),u24u04s2作为node1(计算节点),操作系统为 Debian 系的Ubuntu-Server-24.04-LTS):
| 节点名称 | 节点IP地址 | 角色 |
|---|---|---|
| u24u04s1 | 192.168.122.125 | master |
| u24u04s2 | 192.168.122.126 | node1d |
Slurm架构如下:

1. 前期准备
1.1. 网络配置
-
确保两节点网络互通,建议静态 IP(如
master: 192.168.122.125,node1: 192.168.122.126)。 -
配置
/etc/hosts,两节点均添加:192.168.122.125 u24u04s1 192.168.122.126 u24u04s2 -
关闭防火墙(或开放必要端口,如 Slurm 的 6817-6819)和 SELinux。
1.2. SSH 免密登录
-
在u24u04s1生成密钥对,复制公钥到u24u04s2(包括u24u04s1自身):这步此前已经完成了,简略
$ ssh-keygen -t rsa # 一路回车 $ ssh-copy-id u24u04s1 $ ssh-copy-id u24u04s2 -
验证:在这两台服务器上可以通过
ssh命令无密码的方式在对方主机上执行命令。# 在master上远程执行命令如下 root@u24u04s1:~# ssh u24u04s2 hostname u24u04s2 root@u24u04s1:~# # 在node1上远程执行命令如下 root@u24u04s2:~# ssh u24u04s1 hostname u24u04s1 root@u24u04s2:~#由上述可见,已经实现了两个服务器的无密码互通。
2. 安装基础依赖
两个节点的编译环境已经安装完成,检查两节点如下:
root@u24u04s2:~# dpkg -l | egrep 'gcc|make|gcc-c++|openssh-client'
ii gcc 4:13.2.0-7ubuntu1 amd64 GNU C compiler
ii gcc-13 13.3.0-6ubuntu2~24.04 amd64 GNU C compiler
ii gcc-13-base:amd64 13.3.0-6ubuntu2~24.04 amd64 GCC, the GNU Compiler Collection (base package)
ii gcc-13-x86-64-linux-gnu 13.3.0-6ubuntu2~24.04 amd64 GNU C compiler for the x86_64-linux-gnu architecture
ii gcc-14-base:amd64 14.2.0-4ubuntu2~24.04 amd64 GCC, the GNU Compiler Collection (base package)
ii gcc-x86-64-linux-gnu 4:13.2.0-7ubuntu1 amd64 GNU C compiler for the amd64 architecture
ii libgcc-13-dev:amd64 13.3.0-6ubuntu2~24.04 amd64 GCC support library (development files)
ii libgcc-s1:amd64 14.2.0-4ubuntu2~24.04 amd64 GCC support library
ii make 4.3-4.1build2 amd64 utility for directing compilation
ii openssh-client 1:9.6p1-3ubuntu13.11 amd64 secure shell (SSH) client, for secure access to remote machines
ii xxd 2:9.1.0016-1ubuntu7.8 amd64 tool to make (or reverse) a hex dump
root@u24u04s2:~#
此前已经编译安装过PostgreSQL,所以这步不需要额外安装编译环境。
3. 安装 Munge(身份验证)
3.1. 两节点均安装
检查munge安装情况:
# master节点上的munge安装情况检查
root@u24u04s1:~# apt list | egrep munge
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
libdata-munge-perl/noble 0.097-3 all
libmoosex-mungehas-perl/noble 0.011-2 all
libmunge-dev/noble 0.5.15-4build1 amd64
libmunge-maven-plugin-java/noble 1.0-2 all
libmunge2/noble 0.5.15-4build1 amd64
libpod-elemental-perlmunger-perl/noble 0.200007-1 all
munge/noble 0.5.15-4build1 amd64
root@u24u04s1:~# dpkg -l | egrep munge
root@u24u04s1:~#
# node1节点上的munge安装情况检查
root@u24u04s2:~# apt list | egrep munge
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
libdata-munge-perl/noble 0.097-3 all
libmoosex-mungehas-perl/noble 0.011-2 all
libmunge-dev/noble 0.5.15-4build1 amd64
libmunge-maven-plugin-java/noble 1.0-2 all
libmunge2/noble 0.5.15-4build1 amd64
libpod-elemental-perlmunger-perl/noble 0.200007-1 all
munge/noble 0.5.15-4build1 amd64
root@u24u04s2:~# dpkg -l | egrep munge
root@u24u04s2:~#
从上述输出中可以看出,两个节点均没有安装Munge。具体安装步骤如下:
# master节点上安装如下
root@u24u04s1:~# apt install -y munge libmunge-dev libmunge2
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following packages were automatically installed and are no longer required:
ant ant-optional antlr4 clamav clamav-base clamav-freshclam default-jdk default-jdk-doc
...
Need to get 153 kB of archives.
After this operation, 532 kB of additional disk space will be used.
Get:1 https://mirrors.tuna.tsinghua.edu.cn/ubuntu noble/universe amd64 libmunge2 amd64 0.5.15-4build1 [14.7 kB]
Get:2 https://mirrors.tuna.tsinghua.edu.cn/ubuntu noble/universe amd64 munge amd64 0.5.15-4build1 [102 kB]
Get:3 https://mirrors.tuna.tsinghua.edu.cn/ubuntu noble/universe amd64 libmunge-dev amd64 0.5.15-4build1 [35.6 kB]
Fetched 153 kB in 3s (52.9 kB/s)
Selecting previously unselected package libmunge2:amd64.
(Reading database ... 247214 files and directories currently installed.)
Preparing to unpack .../libmunge2_0.5.15-4build1_amd64.deb ...
Unpacking libmunge2:amd64 (0.5.15-4build1) ...
Selecting previously unselected package munge.
Preparing to unpack .../munge_0.5.15-4build1_amd64.deb ...
Unpacking munge (0.5.15-4build1) ...
Selecting previously unselected package libmunge-dev.
Preparing to unpack .../libmunge-dev_0.5.15-4build1_amd64.deb ...
Unpacking libmunge-dev (0.5.15-4build1) ...
Setting up libmunge2:amd64 (0.5.15-4build1) ...
Setting up munge (0.5.15-4build1) ...
invoke-rc.d: policy-rc.d denied execution of start.
Created symlink /etc/systemd/system/multi-user.target.wants/munge.service → /usr/lib/systemd/system/munge.service.
/usr/sbin/policy-rc.d returned 101, not running 'start munge.service'
Setting up libmunge-dev (0.5.15-4build1) ...
Processing triggers for libc-bin (2.39-0ubuntu8.5) ...
Processing triggers for man-db (2.12.0-4build2) ...
Scanning processes...
Scanning candidates...
Scanning linux images...
Running kernel seems to be up-to-date.
Restarting services...
Service restarts being deferred:
systemctl restart NetworkManager.service
/etc/needrestart/restart.d/dbus.service
systemctl restart networkd-dispatcher.service
systemctl restart systemd-logind.service
systemctl restart unattended-upgrades.service
systemctl restart wpa_supplicant.service
No containers need to be restarted.
User sessions running outdated binaries:
root @ session #166: login[1029]
root @ session #168: sshd[21203]
root @ user manager service: systemd[21019]
No VM guests are running outdated hypervisor (qemu) binaries on this host.
root@u24u04s1:~#
# 检查Munge安装情况以及服务状态
root@u24u04s1:~# systemctl list-unit-files --type service | egrep munge
munge.service enabled enabled
root@u24u04s1:~# systemctl status munge
○ munge.service - MUNGE authentication service
Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; preset: enabled)
Active: inactive (dead)
Docs: man:munged(8)
root@u24u04s1:~#
# node1节点上安装如下
root@u24u04s2:~# apt install -y munge libmunge-dev libmunge2
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
munge is already the newest version (0.5.15-4build1).
libmunge-dev is already the newest version (0.5.15-4build1).
libmunge2 is already the newest version (0.5.15-4build1).
0 upgraded, 0 newly installed, 0 to remove and 172 not upgraded.
root@u24u04s2:~# dpkg -l | egrep 'munge|libmunge'
ii libmunge-dev 0.5.15-4build1 amd64 authentication service for credential -- development package
ii libmunge2:amd64 0.5.15-4build1 amd64 authentication service for credential -- library package
ii munge 0.5.15-4build1 amd64 authentication service to create and validate credentials
root@u24u04s2:~#
# 检查Munge安装情况以及服务状态
root@u24u04s2:~# systemctl list-unit-files --type service | egrep munge
munge.service enabled enabled
root@u24u04s2:~# systemctl status munge
● munge.service - MUNGE authentication service
Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; preset: enabled)
Active: active (running) since Fri 2025-08-01 15:58:15 UTC; 5min ago
Docs: man:munged(8)
Process: 951 ExecStart=/usr/sbin/munged $OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 963 (munged)
Tasks: 4 (limit: 4605)
Memory: 816.0K (peak: 1.2M)
CPU: 8ms
CGroup: /system.slice/munge.service
└─963 /usr/sbin/munged
Aug 01 15:58:15 u24u04s2 systemd[1]: Starting munge.service - MUNGE authentication service...
Aug 01 15:58:15 u24u04s2 (munged)[951]: munge.service: Referenced but unset environment variable evaluates to an empty string: OPTIONS
Aug 01 15:58:15 u24u04s2 systemd[1]: Started munge.service - MUNGE authentication service.
root@u24u04s2:~#
至此,上述的munge服务就已经安装完成了。上述软件包安装完成之后,会在系统中自动创建出munge用户以及组:
# master节点上检查munge用户和组
root@u24u04s1:~# id munge
uid=111(munge) gid=113(munge) groups=113(munge)
root@u24u04s1:~#
# node1节点上检查munge用户和组
root@u24u04s2:~# id munge
uid=110(munge) gid=112(munge) groups=112(munge)
root@u24u04s2:~#
3.2. 在 master 生成密钥并同步
# 在master节点生成密钥
root@u24u04s1:~# dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
1024+0 records in
1024+0 records out
1024 bytes (1.0 kB, 1.0 KiB) copied, 0.00272008 s, 376 kB/s
root@u24u04s1:~#
root@u24u04s1:~# ls -lh /etc/munge/munge.key
-rw------- 1 munge munge 1.0K Aug 1 16:08 /etc/munge/munge.key # 注意:该文件属主和属组均为munge,且文件权限为400
root@u24u04s1:~#
上述的随机码构成的密钥文件的属主和属组均为munge,且文件权限为400,否则需要执行如下命令修改:
$ chown munge:munge /etc/munge/munge.key
$ chmod 400 /etc/munge/munge.key
将上述的密钥文件分发到其他node节点,此处的为node1节点:
# 分发到node1节点
root@u24u04s1:~# scp /etc/munge/munge.key u24u04s2:/etc/munge/
munge.key 100% 1024 1.7MB/s 00:00
root@u24u04s1:~# ssh u24u04s2 'ls -lh /etc/munge' # 检查文件的属主和属组以及文件权限信息,此处符合要求
total 4.0K
-rw------- 1 munge munge 1.0K Aug 1 16:11 munge.key
root@u24u04s1:~#
3.3. 两节点启动 Munge 并设为开机自启
# master节点启动munge并设置为开机自动启动
root@u24u04s1:~# systemctl enable --now munge
Synchronizing state of munge.service with SysV service script with /usr/lib/systemd/systemd-sysv-install.
Executing: /usr/lib/systemd/systemd-sysv-install enable munge
root@u24u04s1:~# systemctl status munge
● munge.service - MUNGE authentication service
Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; preset: enabled)
Active: active (running) since Fri 2025-08-01 16:13:57 UTC; 3s ago
Docs: man:munged(8)
Process: 120745 ExecStart=/usr/sbin/munged $OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 120747 (munged)
Tasks: 4 (limit: 4605)
Memory: 664.0K (peak: 1.2M)
CPU: 6ms
CGroup: /system.slice/munge.service
└─120747 /usr/sbin/munged
Aug 01 16:13:57 u24u04s1 systemd[1]: Starting munge.service - MUNGE authentication service...
Aug 01 16:13:57 u24u04s1 (munged)[120745]: munge.service: Referenced but unset environment variable evaluates to an empty string: OPTIONS
Aug 01 16:13:57 u24u04s1 systemd[1]: Started munge.service - MUNGE authentication service.
root@u24u04s1:~#
# node1节点启动munge并设置为开机自动启动
root@u24u04s2:~# systemctl enable --now munge
Synchronizing state of munge.service with SysV service script with /usr/lib/systemd/systemd-sysv-install.
Executing: /usr/lib/systemd/systemd-sysv-install enable munge
root@u24u04s2:~# systemctl status munge
● munge.service - MUNGE authentication service
Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; preset: enabled)
Active: active (running) since Fri 2025-08-01 15:58:15 UTC; 16min ago
Docs: man:munged(8)
Main PID: 963 (munged)
Tasks: 4 (limit: 4605)
Memory: 816.0K (peak: 1.2M)
CPU: 9ms
CGroup: /system.slice/munge.service
└─963 /usr/sbin/munged
Aug 01 15:58:15 u24u04s2 systemd[1]: Starting munge.service - MUNGE authentication service...
Aug 01 15:58:15 u24u04s2 (munged)[951]: munge.service: Referenced but unset environment variable evaluates to an empty string: OPTIONS
Aug 01 15:58:15 u24u04s2 systemd[1]: Started munge.service - MUNGE authentication service.
root@u24u04s2:~# systemctl restart munge # 由于部署完munge之后,该服务已经启动了,所以此处更新了密钥文件之后,需要重启该服务
root@u24u04s2:~# systemctl status munge
● munge.service - MUNGE authentication service
Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; preset: enabled)
Active: active (running) since Fri 2025-08-01 16:14:54 UTC; 3s ago
Docs: man:munged(8)
Process: 1516 ExecStart=/usr/sbin/munged $OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 1518 (munged)
Tasks: 4 (limit: 4605)
Memory: 652.0K (peak: 1.5M)
CPU: 6ms
CGroup: /system.slice/munge.service
└─1518 /usr/sbin/munged
Aug 01 16:14:54 u24u04s2 systemd[1]: Starting munge.service - MUNGE authentication service...
Aug 01 16:14:54 u24u04s2 (munged)[1516]: munge.service: Referenced but unset environment variable evaluates to an empty string: OPTIONS
Aug 01 16:14:54 u24u04s2 systemd[1]: Started munge.service - MUNGE authentication service.
root@u24u04s2:~#
至此,munge服务就已经安装并启动完成了。
4. 安装 MariaDB(Slurm 账户信息)
数据库仅在master节点操作:
4.1. 安装并启动
$ yum install -y mariadb-server mariadb
$ systemctl start mariadb
$ systemctl enable mariadb
root@u24u04s1:~# apt install -y mariadb-server-core mariadb-server
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following packages were automatically installed and are no longer required:
ant ant-optional antlr4 clamav clamav-base clamav-freshclam default-jdk default-jdk-doc
...
Unpacking pv (1.8.5-2build1) ...
Setting up libconfig-inifiles-perl (3.000003-2) ...
Setting up galera-4 (26.4.16-2build4) ...
Setting up socat (1.8.0.0-4build3) ...
Setting up libdbd-mysql-perl:amd64 (4.052-1ubuntu3) ...
Setting up libmariadb3:amd64 (1:10.11.13-0ubuntu0.24.04.1) ...
Setting up pv (1.8.5-2build1) ...
Setting up liburing2:amd64 (2.5-1build1) ...
Setting up mariadb-client-core (1:10.11.13-0ubuntu0.24.04.1) ...
Setting up mariadb-server-core (1:10.11.13-0ubuntu0.24.04.1) ...
Setting up mariadb-client (1:10.11.13-0ubuntu0.24.04.1) ...
Setting up mariadb-server (1:10.11.13-0ubuntu0.24.04.1) ...
invoke-rc.d: policy-rc.d denied execution of stop.
invoke-rc.d: policy-rc.d denied execution of start.
Created symlink /etc/systemd/system/multi-user.target.wants/mariadb.service → /usr/lib/systemd/system/mariadb.service.
/usr/sbin/policy-rc.d returned 101, not running 'start mariadb.service'
Setting up mariadb-plugin-provider-bzip2 (1:10.11.13-0ubuntu0.24.04.1) ...
Setting up mariadb-plugin-provider-lzma (1:10.11.13-0ubuntu0.24.04.1) ...
Setting up mariadb-plugin-provider-lzo (1:10.11.13-0ubuntu0.24.04.1) ...
Setting up mariadb-plugin-provider-lz4 (1:10.11.13-0ubuntu0.24.04.1) ...
Setting up mariadb-plugin-provider-snappy (1:10.11.13-0ubuntu0.24.04.1) ...
Processing triggers for man-db (2.12.0-4build2) ...
Processing triggers for doc-base (0.11.2) ...
Processing 1 added doc-base file...
Processing triggers for libc-bin (2.39-0ubuntu8.5) ...
Processing triggers for mariadb-server (1:10.11.13-0ubuntu0.24.04.1) ...
Scanning processes...
Scanning candidates...
Scanning linux images...
Running kernel seems to be up-to-date.
Restarting services...
Service restarts being deferred:
systemctl restart NetworkManager.service
/etc/needrestart/restart.d/dbus.service
systemctl restart networkd-dispatcher.service
systemctl restart systemd-logind.service
systemctl restart unattended-upgrades.service
systemctl restart wpa_supplicant.service
No containers need to be restarted.
User sessions running outdated binaries:
root @ session #166: login[1029]
root @ session #168: sshd[21203]
root @ user manager service: systemd[21019]
No VM guests are running outdated hypervisor (qemu) binaries on this host.
root@u24u04s1:~#
root@u24u04s1:~# systemctl list-unit-files --type service | egrep mariadb
mariadb.service enabled enabled
mariadb@.service disabled enabled
root@u24u04s1:~#
root@u24u04s1:~# systemctl enable --now mariadb
Synchronizing state of mariadb.service with SysV service script with /usr/lib/systemd/systemd-sysv-install.
Executing: /usr/lib/systemd/systemd-sysv-install enable mariadb
root@u24u04s1:~# systemctl status mariadb
● mariadb.service - MariaDB 10.11.13 database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; preset: enabled)
Active: active (running) since Fri 2025-08-01 16:20:48 UTC; 3s ago
Docs: man:mariadbd(8)
https://mariadb.com/kb/en/library/systemd/
Process: 121649 ExecStartPre=/usr/bin/install -m 755 -o mysql -g root -d /var/run/mysqld (code=exited, status=0/SUCCESS)
Process: 121651 ExecStartPre=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)
Process: 121653 ExecStartPre=/bin/sh -c [ ! -e /usr/bin/galera_recovery ] && VAR= || VAR=`/usr/bin/galera_recovery`; [ $? -eq 0 ] && systemctl set-environment _WSREP_START_POSITION=$VAR || exit 1 (>
Process: 121727 ExecStartPost=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)
Process: 121729 ExecStartPost=/etc/mysql/debian-start (code=exited, status=0/SUCCESS)
Main PID: 121714 (mariadbd)
Status: "Taking your SQL requests now..."
Tasks: 14 (limit: 30393)
Memory: 79.0M (peak: 82.3M)
CPU: 257ms
CGroup: /system.slice/mariadb.service
└─121714 /usr/sbin/mariadbd
Aug 01 16:20:48 u24u04s1 mariadbd[121714]: 2025-08-01 16:20:48 0 [Note] Plugin 'FEEDBACK' is disabled.
Aug 01 16:20:48 u24u04s1 mariadbd[121714]: 2025-08-01 16:20:48 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
Aug 01 16:20:48 u24u04s1 mariadbd[121714]: 2025-08-01 16:20:48 0 [Warning] You need to use --log-bin to make --expire-logs-days or --binlog-expire-logs-seconds work.
Aug 01 16:20:48 u24u04s1 mariadbd[121714]: 2025-08-01 16:20:48 0 [Note] InnoDB: Buffer pool(s) load completed at 250801 16:20:48
Aug 01 16:20:48 u24u04s1 mariadbd[121714]: 2025-08-01 16:20:48 0 [Note] Server socket created on IP: '127.0.0.1'.
Aug 01 16:20:48 u24u04s1 mariadbd[121714]: 2025-08-01 16:20:48 0 [Note] /usr/sbin/mariadbd: ready for connections.
Aug 01 16:20:48 u24u04s1 mariadbd[121714]: Version: '10.11.13-MariaDB-0ubuntu0.24.04.1' socket: '/run/mysqld/mysqld.sock' port: 3306 Ubuntu 24.04
Aug 01 16:20:48 u24u04s1 systemd[1]: Started mariadb.service - MariaDB 10.11.13 database server.
Aug 01 16:20:48 u24u04s1 /etc/mysql/debian-start[121732]: Upgrading MariaDB tables if necessary.
Aug 01 16:20:48 u24u04s1 /etc/mysql/debian-start[121743]: Checking for insecure root accounts.
root@u24u04s1:~#
至此,master节点上的MariaDB数据库服务就已经安装完成,并且启动了数据库服务。
4.2. 配置数据库
-
初始化数据库(设置 root 密码):
root@u24u04s1:~# mysql_secure_installation NOTE: RUNNING ALL PARTS OF THIS SCRIPT IS RECOMMENDED FOR ALL MariaDB SERVERS IN PRODUCTION USE! PLEASE READ EACH STEP CAREFULLY! In o


3342

被折叠的 条评论
为什么被折叠?



