Proxmox Ceph - Got timeout on separate network

Question

Proxmox Ceph - Got timeout on separate network

I've installed on 4 nodes a completly fresh OS with Proxmox. Every node has 2xNVMe und 1xHD, one NIC public, one NIC private. On the public network there is an additional wireguard interface running for PVE cluster communication. The private interface should be used only for the upcoming distributed storage.

# ip a s
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
    link/ether 6c:b3:11:07:f1:18 brd ff:ff:ff:ff:ff:ff
    inet 10.255.255.2/24 brd 10.255.255.255 scope global enp3s0
       valid_lft forever preferred_lft forever
    inet6 fe80::6eb3:11ff:fe07:f118/64 scope link 
       valid_lft forever preferred_lft forever
3: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether b4:2e:... brd ff:ff:ff:ff:ff:ff
    inet 168..../26 brd 168....127 scope global eno1
       valid_lft forever preferred_lft forever
    inet6 2a01:.../128 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::b62e:99ff:fecc:f5d0/64 scope link 
       valid_lft forever preferred_lft forever
4: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether a2:fd:6a:c7:f0:be brd ff:ff:ff:ff:ff:ff
    inet6 2a01:....::2/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::..:f0be/64 scope link 
       valid_lft forever preferred_lft forever
6: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default qlen 1000
    link/none 
    inet 10.3.0.10/32 scope global wg0
       valid_lft forever preferred_lft forever
    inet6 fd01:3::a/128 scope global 
       valid_lft forever preferred_lft forever

The nodes are fine and the PVE cluster is running as expected.

# pvecm status
Cluster information
-------------------
Name:             ac-c01
Config Version:   4
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Dec 15 22:36:44 2020
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000002
Ring ID:          1.11
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.3.0.4
0x00000002          1 10.3.0.10 (local)
0x00000003          1 10.3.0.13
0x00000004          1 10.3.0.16

The PVE firewall is active in the cluster but there is a rule, that all PVE nodes can talk to each other on any protocol on any port on any interface. This is true - I can ping, ssh, etc. between all nodes on all IPs.

Then I installed ceph.

pveceph install

On the first node I've initialized ceph with

pveceph init -network 10.255.255.0/24
pveceph createmon

That works.

On the second - I tried the same (I'm not sure, if I need to set the -network option - I tried with and without). That works too.

But pveceph createmon fails on any node with:

# pveceph createmon
got timeout

I can also reach port 10.255.255.1:6789 on any node. Whatever I try - I'm getting a "got timeout" on any node then node1. Also disabling firewall doesn't have any effect.

When I remove the -network option, I can run all commands. It looks like it cannot talk via the second interface. But the interface is fine.

When I set network to 10.3.0.0/24 and cluster-network to 10.255.255.0/24 it works too, but I want all ceph communication running via 10.255.255.0/24. What is wrong?

proxmox

ceph

asked on Server Fault Dec 15, 2020 by

TRW • edited Dec 15, 2020 by

TRW

1 Answer

The problem is - the MTU 9000 is a problem. Even when I run the complete Proxmox cluster via the private network, there are errors.

ip link set enp3s0 mtu 1500

So, Ceph has a problem with jumbo frames.

answered on Server Fault Dec 15, 2020 by

TRW

User contributions licensed under CC BY-SA 3.0