GlusterFS – migration woes

So I’ve got a GlusterFS cluster that is a simple 1×2 distribute-replicate with two bricks (one on each server). So I need to migrate this from one data centre to another with little or no downtime (which means move one server, perform heal, and then go and shut the other and move it too).

Unfortunately I got stuck in the middle with the moved server not wanting to talk with it’s peer again. As it turns out changing the IPs is … hard. Fortunately I used Private (RFC1819) IP addresses in the original setup on a dedicated cable between the two servers.

After reading about the woes other people had changing the IPs I decided I need to tunnel the IPs over. There are a few options here, build a tunnel using l2tp, or ip-in-ip, or openvpn, or many others. So I’ll discuss a few options here. Let’s assume we call our servers SrvA and SrvB, where SrvA had IP 192.168.0.1 and SrvB had 192.168.0.2. Simple enough. Assuming the new routable IPs are s.r.v.a and s.r.v.b.

ip-ip

This is probably the simplest, on SrvA:

ip link add name ipip-srvb type ipip remote s.r.v.b
ip address add 192.168.0.1/32 dev ipip-srvb
ip link set dev ipip-srvb up
ip route add 192.168.0.2/32 dev ipip-srvb

And on SrvB:

ip link add name ipip-srva type ipip remote s.r.v.a
ip address add 192.168.0.2/32 dev ipip-srva
ip link set dev ipip-srva up
ip route add 192.168.0.1/32 dev ipip-srva

The disadvantage here is that every packet has an additional 20 bytes of overhead. The MAJOR advantage is that it is very simple to do.

And the reason I didn’t use this is because one of the kernels for some obscure reason did not have CONFIG_IPIP set.

Route and NAT it

This one is a little more involved and requires iptables NAT support. Additionally you need to know what the local gateways are, so I’m just going to assume that they are g.w.s.a and g.w.s.b (You can figure them out by either dumping the routing table or more conveniently in the case of complex routing tables: ip route get d.e.s.t). On SrvA:

ip address add 192.168.0.1/32 dev lo
ip route add 192.168.0.2/32 via g.w.s.a
iptables -t nat -A OUTPUT -d 192.168.0.2 \
    -j DNAT --to s.r.v.b
iptables -t nat -A PREROUTING -s s.r.v.b \
    -j DNAT --to 192.168.0.1
iptables -t nat -A INPUT -s s.r.v.b \
    -j SNAT --to-source 192.168.0.2

And on SrvB:

ip address add 192.168.0.2/32 dev lo
ip route add 192.168.0.1/32 via g.w.s.b
iptables -t nat -A OUTPUT -d 192.168.0.1 \
    -j DNAT --to s.r.v.a
iptables -t nat -A PREROUTING -s s.r.v.a \
    -j DNAT --to 192.168.0.2
iptables -t nat -A INPUT -s s.r.v.a \
    -j SNAT --to-source 192.168.0.1

Yea, that is definitely more complex. Much more complex. Essentially what we do is to first assign the original IP onto the loopback interface. Then we add a route for the other sides private IP via the gateway going to the new IP address. This may not actually be required. At this point we can receive traffic on our old private IP, which we want. We can also transmit to the private IP of the other device – but this will most likely get dropped by other routes.

In order to combat the routing problems on the routers, we DNAT the egress packet to the private IP of the peer to it’s public address, in other words, instead of transmitting to 192.168.0.1 we transmit to s.r.v.a, and instead of to 192.168.0.2 we transmit to s.r.v.b – however, due to the connection tracking state the application (GlusterFS in this case) still things the communication is between the private IPs.

The final bit is that on RX of a packet we need to rewrite both s.r.v.a to 192.168.0.1 and s.r.v.b to 192.168.0.2 before delivering to the application, this is achieved by the last two rules on each server.

The main advantage here is that there is zero packet overhead (the private IPs are simply swapped out for the publics and GlusterFS is none the wiser). The disadvantage is that the solution is much harder to understand. I’m not sure the added complexity is really worth it, but it’s good to know that it can be done none the less.

Interestingly enough on one of the two servers I had to add a rule, say SrvB:

iptables -t nat -A POSTROUTING -s 192.168.0.2 -d s.r.v.a \
    -j SNAT --to-source s.r.v.b

So I just did that on both to be on the safe side (better safe than sorry).

After all of this I do get “State: Peer in Cluster (Connected)” for each peer on the other even though they are no longer attached directly by physical LAN. For now. In two weeks time I can create a VLAN on the switch for the private IPs and can revert to the much simpler configuration.

That is officially the second nastiest (not complex) thing I’ve done with routing and iptables (DNAT’ing udp traffic to a loopback IP on which OpenVPN was bound in order to create a connection tracking entry to be able to load balance inbound OpenVPN – or udp traffic in general – I suspect still takes the gold).

At this point I simply triggered a health check (gluster volume heal ${VOLNAME} … as per RedHat documentation), to make sure things get back in sync before I update DNS names to re-route large-scale traffic to the new data centre.

Comments are closed.