February 16, 2018

OpenShift HA Routing uses haproxy application routers to get traffic into the cluster. These application routers are made redundant by running ipfailover (keepalived) pods to maintain a set of Virtual IPs on each infrastructure node where the application routers run. These VIPs are then referenced by round robin DNS records to enable a measure of load balancing.

OK, so now you are load balancing at the network layer, but what about the link layer? Did you know that even if you somehow manage to perfectly balance traffic among the VIPs using RR DNS you could still be using only one of your application routers? Well you could be!

Example Environment

Here is an environment with 3 infrastructure nodes and and 3 primary or applications nodes in the 192.0.2.0/24 address space. The application domain is os.example.com.

The 3 infrastucture nodes participate in the HA routing by running ipfailover pods that implement keepalived to keep IPs 192.0.2.101-103 alive. Along side those ipfailover pods an haproxy pod binds to each IP.

The DNS response for lookups in *.os.example.com will look like this. The order of the addresses in the response should be varied by the DNS server.

$ dig +short foo.os.example.com
192.0.2.103
192.0.2.101
192.0.2.102

Now looking deeper into the network interface on the nodes, let’s enumerate the MAC address and the IP addresses on eth0:

Infra Node 1

[root@ose-prod-node-01 ~]# ethtool -P eth0
Permanent address: 00:1a:4a:48:BE:4B

[root@ose-prod-node-01 ~]# ip -4 -o a show eth0
2: eth0    inet 192.0.2.1/24 brd 192.0.2.255 scope global dynamic eth0\       valid_lft 62345sec preferred_lft 62345sec
2: eth0    inet 192.0.2.102/32 scope global eth0\       valid_lft forever preferred_lft forever
2: eth0    inet 192.0.2.101/32 scope global eth0\       valid_lft forever preferred_lft forever
2: eth0    inet 192.0.2.103/32 scope global eth0\       valid_lft forever preferred_lft forever

Infra Node 2

[root@ose-prod-node-02 ~]# ethtool -P eth0
Permanent address: 00:1a:4a:48:BE:EF

[root@ose-prod-node-02 ~]# ip -4 -o a show eth0
2: eth0    inet 192.0.2.2/24 brd 192.0.2.255 scope global dynamic eth0\       valid_lft 72864sec preferred_lft 72864sec
2: eth0    inet 192.0.2.103/32 scope global eth0\       valid_lft forever preferred_lft forever
2: eth0    inet 192.0.2.101/32 scope global eth0\       valid_lft forever preferred_lft forever
2: eth0    inet 192.0.2.102/32 scope global eth0\       valid_lft forever preferred_lft forever

Infra Node 3

[root@ose-prod-node-03 ~]# ethtool -P eth0
Permanent address: 00:1a:4a:48:BE:4C

[root@ose-prod-node-03 ~]# ip -4 -o a show eth0
2: eth0    inet 192.0.2.3/24 brd 192.0.2.255 scope global dynamic eth0\       valid_lft 81509sec preferred_lft 81509sec
2: eth0    inet 192.0.2.101/32 scope global eth0\       valid_lft forever preferred_lft forever
2: eth0    inet 192.0.2.102/32 scope global eth0\       valid_lft forever preferred_lft forever
2: eth0    inet 192.0.2.103/32 scope global eth0\       valid_lft forever preferred_lft forever

Node Summary

Node	Eth0 MAC	Eth0 IPs
ose-prod-node-01	00:1a:4a:48:BE:4B	192.0.2.1, 192.0.2.101, 192.0.2.102, 192.0.2.103
ose-prod-node-02	00:1a:4a:48:BE:EF	192.0.2.2, 192.0.2.101, 192.0.2.102, 192.0.2.103
ose-prod-node-03	00:1a:4a:48:BE:4C	192.0.2.3, 192.0.2.101, 192.0.2.102, 192.0.2.103

Client Traffic

Web browsers will look up the IP address for app.os.example.com and will get back three A records (192.0.2.101, 192.0.2.102, 192.0.2.103) in the response. The DNS server will shuffle the order of the IPs in the response and the client will typically choose the first IP in the list to connect to.

All three of these IPs are always available even if only one node is alive and the client has no idea which node is serving its traffic.

Keep in mind it is the network router attached to the nodes that will decide how to relay the packet from the client to those 3 IP addresses. Also remember that ultimately the packet will be sent to a MAC address, not an IP address. How? Adress resolution protocol. Since the router is on the same layer 2 segment as these VIPs it will maintain a ARP table that maps IP addresses to MAC addresses. Over time, or after some event like a network split or node reboots the ARP table could wind up thwarting your efforts to balance the traffic.

Network Router

Let’s check the ARP table on the router to see where traffic will be sent at the link layer.

core(s1)# sh ip arp 192.0.2.101
Address         Age (min)  Hardware Addr   Interface
192.0.2.101          0  001a.4a48.BE4B  Vlan176, Port-Channel391

core(s1)# sh ip arp 192.0.2.102
Address         Age (min)  Hardware Addr   Interface
192.0.2.102          0  001a.4a48.BE4C  Vlan176, Port-Channel391

core(s1)# sh ip arp 192.0.2.103
Address         Age (min)  Hardware Addr   Interface
192.0.2.103          0  001a.4a48.BE4C  Vlan176, Port-Channel391

You can see that IP address 192.0.2.101 is known to the router by MAC (Hardware) address 001a.4a48.BE4B which is eth0 on Node 1.

However, both addresses 192.0.2.102 and 192.0.2.103 are known to the router by the same MAC address 001a.4a48.BE4C with is eth0 on Node 3.

Router Summary

Router ARP Table	IP	MAC Address
192.0.2.101	00:1a:4a:48:BE:4B
192.0.2.102	00:1a:4a:48:BE:4C
192.0.2.103	00:1a:4a:48:BE:4C

This means that any time the network router is sending traffic to those 3 IPs it will always send it nodes 1 and 3. Node 2 will never get any traffic for those VIPs. Things could likely be worse. You might have all 3 IPs associated with the same IP address. Go look now!

The Fix

So how do you fix that? GARP. A gratuitous ARP from the node can inform the router, “Hey! I am MAC X:X and I answer to IP Y.Y. Remember that!" and the router will do as it is told and update it’s ARP table.

ssh root@ose-prod-node-01 arping -c 4 -A -I eth0 192.0.2.101
ssh root@ose-prod-node-02 arping -c 4 -A -I eth0 192.0.2.102
ssh root@ose-prod-node-03 arping -c 4 -A -I eth0 192.0.2.103

The Caveat

Of course if one of the nodes were to reboot it’s MAC address would become unreachable, it would fall out of the router ARP table, the router would send an ARP request and another MAC address would fill in and do double duty. Things will stay like that indefinitely.

One approach would be to perform the above arping on a regular basis or after maintenance events that take down infrastructure nodes.

Mind the GARP!

The Fix

Search

Tagged

Tags

Load balancing of OpenShift HA Routers Mind the GARP

The Fix

Search

Tagged

Tags