ls and find hangs on a folder. What to do?

Question

ls and find hangs on a folder. What to do?

I have a proxmox cluster with two nodes (s1 and s2). On s2 listing a certain directory hangs forever (like in this question):

$> strace -vf ls -l /etc/pve/nodes/s2
[...]
open("/etc/pve/nodes/s2", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
fstat(3, {st_dev=makedev(0, 48), st_ino=5, st_mode=S_IFDIR|0755, st_nlink=2, st_uid=0, st_gid=33, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2017-06-19T18:59:35+0300, st_mtime=2017-06-19T18:59:35+0300, st_ctime=2017-06-19T18:59:35+0300}) = 0
getdents(3,

Find also hangs

$> cd /etc/pve/nodes/s2
$> strace -vf find .
[...]
openat(AT_FDCWD, ".", O_RDONLY|O_NOCTTY|O_NONBLOCK|O_DIRECTORY|O_NOFOLLOW) = 4
fcntl(4, F_GETFD)                       = 0
fcntl(4, F_SETFD, FD_CLOEXEC)           = 0
fstat(4, {st_dev=makedev(0, 48), st_ino=5, st_mode=S_IFDIR|0755, st_nlink=2, st_uid=0, st_gid=33, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2017-06-19T18:59:35+0300, st_mtime=2017-06-19T18:59:35+0300, st_ctime=2017-06-19T18:59:35+0300}) = 0
fcntl(4, F_GETFL)                       = 0x38800 (flags O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_NOFOLLOW)
fcntl(4, F_SETFD, FD_CLOEXEC)           = 0
newfstatat(AT_FDCWD, ".", {st_dev=makedev(0, 48), st_ino=5, st_mode=S_IFDIR|0755, st_nlink=2, st_uid=0, st_gid=33, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2017-06-19T18:59:35+0300, st_mtime=2017-06-19T18:59:35+0300, st_ctime=2017-06-19T18:59:35+0300}, AT_SYMLINK_NOFOLLOW) = 0
fcntl(4, F_DUPFD, 3)                    = 5
fcntl(5, F_GETFD)                       = 0
fcntl(5, F_SETFD, FD_CLOEXEC)           = 0
getdents(4,

The part about LVM is not relevant

I have one LVM physical volume:

$> pvdisplay
  --- Physical volume ---
  PV Name               /dev/sda3
  VG Name               pve
  PV Size               1.82 TiB / not usable 3.07 MiB
  Allocatable           yes
  PE Size               4.00 MiB
  Total PE              476859
  Free PE               4039
  Allocated PE          472820
  PV UUID               fcuPa5-Wscw-wQI2-YXjI-SoMc-nQPe-1orltO

that is part of the pve group

$> pvs
  PV         VG  Fmt  Attr PSize PFree
  /dev/sda3  pve lvm2 a--  1.82t 15.78g

that has some logical volumes:

$> lvscan
  ACTIVE            '/dev/pve/swap' [8.00 GiB] inherit
  ACTIVE            '/dev/pve/root' [96.00 GiB] inherit
  ACTIVE            '/dev/pve/data' [1.70 TiB] inherit
  ACTIVE            '/dev/pve/vm-401-disk-1' [4.00 GiB] inherit
  [...]

The part about LVM is not relevant

mount shows that /dev/fuse is mounted in /etc/pve

$> df /etc/pve/nodes/s2
/dev/fuse          30720    36     30684   1% /etc/pve

I see some errors in dmesg like this:

[  483.990347] INFO: task lxc-pve-prestar:4588 blocked for more than 120 seconds.
[  483.990554]       Tainted: P          IO     4.15.18-16-pve #1
[  483.990721] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  483.990943] lxc-pve-prestar D    0  4588   4587 0x00000000
[  483.990945] Call Trace:
[  483.990947]  __schedule+0x3e0/0x870
[  483.990949]  ? path_parentat+0x3e/0x80
[  483.990951]  schedule+0x36/0x80
[  483.990953]  rwsem_down_write_failed+0x208/0x390
[  483.990955]  call_rwsem_down_write_failed+0x17/0x30
[  483.990957]  ? call_rwsem_down_write_failed+0x17/0x30
[  483.990959]  down_write+0x2d/0x40
[  483.990961]  filename_create+0x7e/0x160
[  483.990963]  SyS_mkdir+0x51/0x100
[  483.990965]  do_syscall_64+0x73/0x130
[  483.990967]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  483.990968] RIP: 0033:0x7ff84077a687
[  483.990969] RSP: 002b:00007fff343b4a98 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[  483.990971] RAX: ffffffffffffffda RBX: 000055ab07c8d010 RCX: 00007ff84077a687
[  483.990972] RDX: 0000000000000014 RSI: 00000000000001ff RDI: 000055ab0b26de70
[  483.990973] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
[  483.990974] R10: 000055ab0b0e1f38 R11: 0000000000000246 R12: 000055ab084ced58
[  483.990975] R13: 000055ab0b222fd0 R14: 000055ab0b26de70 R15: 00000000000001ff

Apparently proxmox uses Proxmox Cluster File System that gets mounted in /etc/pve so this must be a network issue. I can ping the two nodes both ways.

root@s1:~# pvecm status
Quorum information
------------------
Date:             Sun Jun 23 07:11:24 2019
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1/267728
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          4 10.0.0.5 (local)

root@s2:~# pvecm status
Quorum information
------------------
Date:             Sun Jun 23 07:14:11 2019
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000002
Ring ID:          2/192400
Quorate:          No

Votequorum information
----------------------
Expected votes:   1
Highest expected: 1
Total votes:      1
Quorum:           2 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 10.0.0.6 (local)

root@s1:~# pveversion --verbose
proxmox-ve: 5.4-1 (running kernel: 4.15.18-16-pve)
pve-manager: 5.4-6 (running version: 5.4-6/aa7856c5)
pve-kernel-4.15: 5.4-4
pve-kernel-4.15.18-16-pve: 4.15.18-41
pve-kernel-4.10.15-1-pve: 4.10.15-15
pve-kernel-4.10.11-1-pve: 4.10.11-9
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-10
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-52
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-43
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-28
pve-cluster: 5.0-37
pve-container: 2.0-39
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-22
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-2
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-52
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2

I have tested the conetivity between the two nodes. The result was as follows, so I guess we can conclude that multicast works.

root@s1:~# omping -m 239.192.109.7 -c 600 -i 1 -F -q s2 s1
s2 : waiting for response msg
s2 : waiting for response msg
s2 : joined (S,G) = (*, 239.192.109.7), pinging
s2 : given amount of query messages was sent

s2 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.185/0.265/0.387/0.018
s2 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.192/0.273/0.400/0.019

root@s2:~# omping -m 239.192.109.7 -c 600 -i 1 -F -q s2 s1
s1 : waiting for response msg
s1 : joined (S,G) = (*, 239.192.109.7), pinging
s1 : given amount of query messages was sent

s1 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.164/0.345/0.390/0.020
s1 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.183/0.369/0.410/0.020

The hosts file reads

root@s1:~# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
10.0.0.5 s1 pvelocalhost
10.0.0.6 s2
# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

and

root@s2:~# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
10.0.0.6 s2 pvelocalhost
10.0.0.5 s1
# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

The corosync service is running (same story in s2)

root@s1:~# journalctl -u corosync.service --no-pager
-- Logs begin at Sat 2019-06-22 17:05:48 EEST, end at Sat 2019-06-22 17:47:20 EEST. --
Jun 22 17:05:53 s1 systemd[1]: Starting Corosync Cluster Engine...
Jun 22 17:05:53 s1 corosync[2713]:  [MAIN  ] Corosync Cluster Engine ('2.4.4-dirty'): started and ready to provide service.
Jun 22 17:05:53 s1 corosync[2713]:  [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog systemd xmlconf qdevices qnetd snmp pie relro bindnow
Jun 22 17:05:53 s1 corosync[2713]: notice  [MAIN  ] Corosync Cluster Engine ('2.4.4-dirty'): started and ready to provide service.
Jun 22 17:05:53 s1 corosync[2713]: info    [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog systemd xmlconf qdevices qnetd snmp pie relro bindnow
Jun 22 17:05:54 s1 corosync[2713]:  [MAIN  ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
Jun 22 17:05:54 s1 corosync[2713]: warning [MAIN  ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
Jun 22 17:05:54 s1 corosync[2713]: warning [MAIN  ] Please migrate config file to nodelist.
Jun 22 17:05:54 s1 corosync[2713]:  [MAIN  ] Please migrate config file to nodelist.
Jun 22 17:05:54 s1 corosync[2713]: notice  [TOTEM ] Initializing transport (UDP/IP Multicast).
Jun 22 17:05:54 s1 corosync[2713]: notice  [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Jun 22 17:05:54 s1 corosync[2713]:  [TOTEM ] Initializing transport (UDP/IP Multicast).
Jun 22 17:05:54 s1 corosync[2713]:  [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Jun 22 17:05:54 s1 corosync[2713]: notice  [TOTEM ] The network interface [10.0.0.5] is now up.
Jun 22 17:05:54 s1 corosync[2713]: notice  [SERV  ] Service engine loaded: corosync configuration map access [0]
Jun 22 17:05:54 s1 corosync[2713]: info    [QB    ] server name: cmap
Jun 22 17:05:54 s1 corosync[2713]: notice  [SERV  ] Service engine loaded: corosync configuration service [1]
Jun 22 17:05:54 s1 corosync[2713]: info    [QB    ] server name: cfg
Jun 22 17:05:54 s1 corosync[2713]: notice  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jun 22 17:05:54 s1 corosync[2713]: info    [QB    ] server name: cpg
Jun 22 17:05:54 s1 corosync[2713]: notice  [SERV  ] Service engine loaded: corosync profile loading service [4]
Jun 22 17:05:54 s1 corosync[2713]:  [TOTEM ] The network interface [10.0.0.5] is now up.
Jun 22 17:05:54 s1 corosync[2713]: notice  [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Jun 22 17:05:54 s1 corosync[2713]: warning [WD    ] Watchdog not enabled by configuration
Jun 22 17:05:54 s1 corosync[2713]: warning [WD    ] resource load_15min missing a recovery key.
Jun 22 17:05:54 s1 corosync[2713]: warning [WD    ] resource memory_used missing a recovery key.
Jun 22 17:05:54 s1 corosync[2713]: info    [WD    ] no resources configured.
Jun 22 17:05:54 s1 corosync[2713]: notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
Jun 22 17:05:54 s1 corosync[2713]: notice  [QUORUM] Using quorum provider corosync_votequorum
Jun 22 17:05:54 s1 corosync[2713]: notice  [QUORUM] This node is within the primary component and will provide service.
Jun 22 17:05:54 s1 corosync[2713]: notice  [QUORUM] Members[0]:
Jun 22 17:05:54 s1 corosync[2713]: notice  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jun 22 17:05:54 s1 corosync[2713]: info    [QB    ] server name: votequorum
Jun 22 17:05:54 s1 corosync[2713]: notice  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jun 22 17:05:54 s1 corosync[2713]: info    [QB    ] server name: quorum
Jun 22 17:05:54 s1 corosync[2713]: notice  [TOTEM ] A new membership (10.0.0.5:182116) was formed. Members joined: 1
Jun 22 17:05:54 s1 corosync[2713]:  [SERV  ] Service engine loaded: corosync configuration map access [0]
Jun 22 17:05:54 s1 systemd[1]: Started Corosync Cluster Engine.
Jun 22 17:05:54 s1 corosync[2713]: warning [CPG   ] downlist left_list: 0 received
Jun 22 17:05:54 s1 corosync[2713]: notice  [QUORUM] Members[1]: 1
Jun 22 17:05:54 s1 corosync[2713]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Jun 22 17:05:54 s1 corosync[2713]:  [QB    ] server name: cmap
Jun 22 17:05:54 s1 corosync[2713]:  [SERV  ] Service engine loaded: corosync configuration service [1]
Jun 22 17:05:54 s1 corosync[2713]:  [QB    ] server name: cfg
Jun 22 17:05:54 s1 corosync[2713]:  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jun 22 17:05:54 s1 corosync[2713]:  [QB    ] server name: cpg
Jun 22 17:05:54 s1 corosync[2713]:  [SERV  ] Service engine loaded: corosync profile loading service [4]
Jun 22 17:05:54 s1 corosync[2713]:  [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Jun 22 17:05:54 s1 corosync[2713]:  [WD    ] Watchdog not enabled by configuration
Jun 22 17:05:54 s1 corosync[2713]:  [WD    ] resource load_15min missing a recovery key.
Jun 22 17:05:54 s1 corosync[2713]:  [WD    ] resource memory_used missing a recovery key.
Jun 22 17:05:54 s1 corosync[2713]:  [WD    ] no resources configured.
Jun 22 17:05:54 s1 corosync[2713]:  [SERV  ] Service engine loaded: corosync watchdog service [7]
Jun 22 17:05:54 s1 corosync[2713]:  [QUORUM] Using quorum provider corosync_votequorum
Jun 22 17:05:54 s1 corosync[2713]:  [QUORUM] This node is within the primary component and will provide service.
Jun 22 17:05:54 s1 corosync[2713]:  [QUORUM] Members[0]:
Jun 22 17:05:54 s1 corosync[2713]:  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jun 22 17:05:54 s1 corosync[2713]:  [QB    ] server name: votequorum
Jun 22 17:05:54 s1 corosync[2713]:  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jun 22 17:05:54 s1 corosync[2713]:  [QB    ] server name: quorum
Jun 22 17:05:54 s1 corosync[2713]:  [TOTEM ] A new membership (10.0.0.5:182116) was formed. Members joined: 1
Jun 22 17:05:54 s1 corosync[2713]:  [CPG   ] downlist left_list: 0 received
Jun 22 17:05:54 s1 corosync[2713]:  [QUORUM] Members[1]: 1
Jun 22 17:05:54 s1 corosync[2713]:  [MAIN  ] Completed service synchronization, ready to provide service.
Jun 22 17:26:40 s1 corosync[2713]: notice  [TOTEM ] A new membership (10.0.0.5:184780) was formed. Members
Jun 22 17:26:40 s1 corosync[2713]:  [TOTEM ] A new membership (10.0.0.5:184780) was formed. Members
Jun 22 17:26:40 s1 corosync[2713]: warning [CPG   ] downlist left_list: 0 received
Jun 22 17:26:40 s1 corosync[2713]: notice  [QUORUM] Members[1]: 1
Jun 22 17:26:40 s1 corosync[2713]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Jun 22 17:26:40 s1 corosync[2713]:  [CPG   ] downlist left_list: 0 received
Jun 22 17:26:40 s1 corosync[2713]:  [QUORUM] Members[1]: 1
Jun 22 17:26:40 s1 corosync[2713]:  [MAIN  ] Completed service synchronization, ready to provide service.

tcpdump shows activity on port 5404 so my conclusion is that the two nodes communicate

root@s1:~# tcpdump port 5404 | grep -v "192\.168\.0\.7"
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vmbr0, link-type EN10MB (Ethernet), capture size 262144 bytes
17:54:05.306075 IP s1.5404     239.192.109.7.5405: UDP, length 136
17:54:05.609111 IP s1.5404     239.192.109.7.5405: UDP, length 136
17:54:05.912145 IP s1.5404     239.192.109.7.5405: UDP, length 136
17:54:06.014427 IP s2.5404     239.192.109.7.5405: UDP, length 296
17:54:06.215173 IP s1.5404     239.192.109.7.5405: UDP, length 136
17:54:06.518208 IP s1.5404     239.192.109.7.5405: UDP, length 136
17:54:06.821242 IP s1.5404     239.192.109.7.5405: UDP, length 136
17:54:07.124277 IP s1.5404     239.192.109.7.5405: UDP, length 136
17:54:07.427312 IP s1.5404     239.192.109.7.5405: UDP, length 136
17:54:07.730347 IP s1.5404     239.192.109.7.5405: UDP, length 136
17:54:07.875423 IP s1.5404     239.192.109.7.5405: UDP, length 88
17:54:08.076147 IP s1.5404     239.192.109.7.5405: UDP, length 136
17:54:08.316885 IP s2.5404     239.192.109.7.5405: UDP, length 296
17:54:08.379755 IP s1.5404     239.192.109.7.5405: UDP, length 136
17:54:08.682792 IP s1.5404     239.192.109.7.5405: UDP, length 136
17:54:08.985856 IP s1.5404     239.192.109.7.5405: UDP, length 136
17:54:09.288923 IP s1.5404     239.192.109.7.5405: UDP, length 136
^C121 packets captured
133 packets received by filter
0 packets dropped by kernel

root@s2:~# tcpdump port 5404 | grep -v "192\.168\.0\.7"
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on enp2s0f0, link-type EN10MB (Ethernet), capture size 262144 bytes
17:53:31.114024 IP s1.5404     239.192.109.7.5405: UDP, length 136
17:53:31.413210 IP s2.5404     239.192.109.7.5405: UDP, length 296
17:53:31.417049 IP s1.5404     239.192.109.7.5405: UDP, length 136
17:53:31.720082 IP s1.5404     239.192.109.7.5405: UDP, length 136
17:53:32.023114 IP s1.5404     239.192.109.7.5405: UDP, length 136
17:53:32.326150 IP s1.5404     239.192.109.7.5405: UDP, length 136
17:53:32.629171 IP s1.5404     239.192.109.7.5405: UDP, length 136
17:53:32.883822 IP s1.5404     239.192.109.7.5405: UDP, length 88
^C86 packets captured
110 packets received by filter
0 packets dropped by kernel

pve-cluster service shows some errors on s2

root@s1:~# journalctl -u pve-cluster  --no-pager
-- Logs begin at Sat 2019-06-22 17:05:48 EEST, end at Sat 2019-06-22 18:00:20 EEST. --
Jun 22 17:05:51 s1 systemd[1]: Starting The Proxmox VE cluster filesystem...
Jun 22 17:05:51 s1 pmxcfs[2637]: [quorum] crit: quorum_initialize failed: 2
Jun 22 17:05:51 s1 pmxcfs[2637]: [quorum] crit: can't initialize service
Jun 22 17:05:51 s1 pmxcfs[2637]: [confdb] crit: cmap_initialize failed: 2
Jun 22 17:05:51 s1 pmxcfs[2637]: [confdb] crit: can't initialize service
Jun 22 17:05:51 s1 pmxcfs[2637]: [dcdb] crit: cpg_initialize failed: 2
Jun 22 17:05:51 s1 pmxcfs[2637]: [dcdb] crit: can't initialize service
Jun 22 17:05:51 s1 pmxcfs[2637]: [status] crit: cpg_initialize failed: 2
Jun 22 17:05:51 s1 pmxcfs[2637]: [status] crit: can't initialize service
Jun 22 17:05:53 s1 systemd[1]: Started The Proxmox VE cluster filesystem.
Jun 22 17:05:57 s1 pmxcfs[2637]: [status] notice: update cluster info (cluster name  AdvaitaCluster1, version = 8)
Jun 22 17:05:57 s1 pmxcfs[2637]: [status] notice: node has quorum
Jun 22 17:05:57 s1 pmxcfs[2637]: [dcdb] notice: members: 1/2637
Jun 22 17:05:57 s1 pmxcfs[2637]: [dcdb] notice: all data is up to date
Jun 22 17:05:57 s1 pmxcfs[2637]: [status] notice: members: 1/2637
Jun 22 17:05:57 s1 pmxcfs[2637]: [status] notice: all data is up to date

root@s2:~# journalctl -u pve-cluster  --no-pager
[...]
Jun 22 18:01:46 s2 pmxcfs[15830]: [status] crit: cpg_send_message failed: 6
Jun 22 18:01:47 s2 pmxcfs[15830]: [status] notice: cpg_send_message retry 10
Jun 22 18:01:48 s2 pmxcfs[15830]: [status] notice: cpg_send_message retry 20
Jun 22 18:01:49 s2 pmxcfs[15830]: [status] notice: cpg_send_message retry 30
Jun 22 18:01:50 s2 pmxcfs[15830]: [status] notice: cpg_send_message retry 40
Jun 22 18:01:51 s2 pmxcfs[15830]: [status] notice: cpg_send_message retry 50
Jun 22 18:01:52 s2 pmxcfs[15830]: [status] notice: cpg_send_message retry 60
Jun 22 18:01:53 s2 pmxcfs[15830]: [status] notice: cpg_send_message retry 70
Jun 22 18:01:54 s2 pmxcfs[15830]: [status] notice: cpg_send_message retry 80
Jun 22 18:01:54 s2 systemd[1]: Stopping The Proxmox VE cluster filesystem...
Jun 22 18:01:54 s2 pmxcfs[15830]: [main] notice: teardown filesystem
Jun 22 18:01:55 s2 pmxcfs[15830]: [status] notice: cpg_send_message retry 90
Jun 22 18:01:56 s2 pmxcfs[15830]: [status] notice: cpg_send_message retry 100
Jun 22 18:01:56 s2 pmxcfs[15830]: [status] notice: cpg_send_message retried 100 times
Jun 22 18:01:56 s2 pmxcfs[15830]: [status] crit: cpg_send_message failed: 6
Jun 22 18:02:04 s2 systemd[1]: pve-cluster.service: State 'stop-sigterm' timed out. Killing.
Jun 22 18:02:04 s2 systemd[1]: pve-cluster.service: Killing process 15830 (pmxcfs) with signal SIGKILL.
Jun 22 18:02:04 s2 systemd[1]: pve-cluster.service: Main process exited, code=killed, status=9/KILL
Jun 22 18:02:04 s2 systemd[1]: Stopped The Proxmox VE cluster filesystem.
Jun 22 18:02:04 s2 systemd[1]: pve-cluster.service: Unit entered failed state.
Jun 22 18:02:04 s2 systemd[1]: pve-cluster.service: Failed with result 'timeout'.
Jun 22 18:02:04 s2 systemd[1]: Starting The Proxmox VE cluster filesystem...
Jun 22 18:02:04 s2 pmxcfs[30809]: [status] notice: update cluster info (cluster name  AdvaitaCluster1, version = 7)
Jun 22 18:02:06 s2 systemd[1]: Started The Proxmox VE cluster filesystem.

pve-firewall is not enabled.

proxmox

debian-stretch

fuse

asked on Server Fault Jun 22, 2019 by

Nicu Tofan • edited Jun 23, 2019 by

Michael Hampton

1 Answer

This is what I did to get the things working. There must be a better way.

1. Get rid of the old cluster

I followed the instructions here for removing a node

Stop services

systemctl stop corosync
systemctl stop pve-cluster

Start in local mode

pmxcfs -l

Create a backup folder and back up the things they say to delete on BOTH nodes

cd ~
mkdir backup-pve-2019-06-23-07-34
mv /etc/pve/corosync.conf backup-pve-2019-06-23-07-34/

mkdir backup-pve-2019-06-23-07-34/etc/corosync -p
mv /etc/corosync/* backup-pve-2019-06-23-07-34/etc/corosync/
mkdir backup-pve-2019-06-23-07-34/var/lib/corosync/ -p
mv /var/lib/corosync/* backup-pve-2019-06-23-07-34/var/lib/corosync/

For next opperations /etc/pve needs to be mounted

killall pmxcfs
systemctl start pve-cluster
pvecm expected 1

mkdir backup-pve-2019-06-23-07-34/etc/pve/nodes -p
mv /etc/pve/nodes/s1  backup-pve-2019-06-23-07-34/etc/pve/nodes/

Cannot add a node to a cluster if it has containers so, on ONE of the nodes (say, s2) back up and destroy all containers.

root@s2:~# vzdump 100 101 ...
root@s2:~# pct destroy 100
root@s2:~# pct destroy 101
root@s2:~# ...

2. Create a new cluster

On one of the nodes (the one that kept its containers) create the cluster

root@s1:~# pvecm create NewClusterName

Add the other one

root@s2:~# pvecm add 10.0.0.5

As a good pice of software that it is it will get stuck at waiting for quorum... so CTRL+C out of it and restart both nodes.

Take a look at storage status so that you know which one has enough space

root@s2:~# pvesm status

Now restore the containers (replace local with the storage you decided previously; file names are going to be different, of course)

root@s2:~# pct restore 100 /var/lib/vz/dump/vzdump-lxc-100-2019_06_23-07_51_29.tar -storage local
root@s2:~# pct restore 101 /var/lib/vz/dump/vzdump-lxc-101-2019_06_23-07_51_29.tar -storage local
root@s2:~# ...

answered on Server Fault Jun 23, 2019 by

Nicu Tofan

User contributions licensed under CC BY-SA 3.0