3 <title
>How to
do simple loadbalancing with Linux without a single point of failure
</title
>
7 bert hubert
<
;<a href
=mailto
:ahu@ds9a
.nl
>ahu@ds9a
.nl
</a
>>
;
11 This page reflects some experiments I did that show promise in
12 providing loadbalancing which can be very interesting in some situations
.
14 This is most useful
for services which are CPU bound
and not network bound
.
17 Loadbalance a service on one IP address over multiple Linux servers without
18 generating a
new single point of failure
.
22 Excellent projects like
<a href
=http
://linux-vs.org>The Linux Virtual
23 Server
</a
> or machines like the
<a href
=http
://www.alteonwebsystems.com>Alteon
24 Acedirector
</a
> already provide loadbalancing
. However
, these all entail
25 either an additional single point of failure
, or need the loadbalancing
26 machine itself to be redundantly
implemented (ie
, two boxes
).
28 Doing so is expensive
and often not needed
. It is however a very good way of
29 scaling to enormous bandwidths
- because of the tricks these solutions
30 employ
, they are able to
do gigabits of traffic
.
32 We want to be able to provide loadbalancing
for hosts that
do not saturate
33 their ethernet
, but
do need more CPU
or IO horsepower than a single box can
36 <H2
>Intended audience
</h2
>
37 Do not interpret this document
as a HOWTO
. Everything here is very
new and
38 very lightly tested
. Play around
, let me know what happens
, but don
't
39 complain that your 1024-server deployment just does not do what I promised
42 Even if you are confident that you are savvy enough to fool around, only use
43 what we descibe here if your service is CPU or IO bound, and if you are
44 not saturating your network. If the latter is the case, doing loadbalancing
45 like this will only hurt performance!
47 <h2>How it normally works </h2>
49 We'll assume that you have four servers
, 192.168.0.10 to
192.168.0.13, and
50 that the service you want to provide will live on the virtual IP address
51 192.168.0.2. We also assume that your subnet is
192.168.0.0/24
52 (192.168.0.0-192.168.0.255), and that your
default gateway is
192.168.0.1,
53 which need not be a Linux machine
. Furthermore
, you are using a hub
and not
62 [Internet
] - 192.168.0.1 --[HUB
]---+
---------+
-----+
-----+
68 Ok
- now a customer on the internet wants to access your webserver on
69 192.168.0.10, and a SYN
packet (which starts a TCP
/IP session
) arrives at
70 your
default gateway
, which then needs to access a host that feels
71 responsible
for 192.168.0.10.
75 In order to find the right host
, the router sends out an Address Resolution
76 Protocol (ARP
) 'who-has 192.168.0.10? tell 192.168.0.1'-query
. Normally then
77 one of your servers responds with its MAC address
'00:10:D7:01:20:11 has
78 192.168.0.10'. Your router then uses this information to route the SYN
79 packet to the proper MAC address
, which is then accepted by your webserver
84 <b
>It is vital that you understand this before proceeding
!</b
> The MAC
85 address can be likened to the address of your building
, '12 Router Avenue'.
86 The destination IP address is like the name of your company
. The router is
87 the mailperson that stands in your street
and shouts
'Where do I deliver
88 mail for Evil Linux Routing Tricks INC?'. Your receptionist would then shout
89 back
'Give it to the people over at 12 Router Avenue', which would prompt
90 the mailperson to deliver mail at that building
.
93 Router
-> mailperson
<br
>
94 Destination IP address
-> company name
<br
>
95 MAC
Address (also Hardware Address
, Ethernet Address
) -> house number +
97 ARP query
-> mailperson shouting
'Where do I deliver..'<br
>
98 ARP response
-> receptionist that replies
'Over at 12 Router Avenue'
99 <h2
>How we subvert this
for our purposes
</h2
>
100 Each IP address can have only one MAC address
, the router remembers only a
101 single MAC address
. So we need to give all our webservers the same MAC
102 address
! Yes
, this is the icky bit
. Also
, all webservers need to get an IP
103 alias so they feel resposible
for the service we want to offer on
106 This is achieved by executing the following on
192.168.0.10 to
13:
108 # ip link set eth0 down
109 # ip link set eth0 address 1:0:0:0:0:0
110 # ip link set eth0 up
111 # ip route add default via 192.168.0.1
112 # ip addr add dev eth0 192.168.0.2
115 FIXME
: There are MAC addresses reserved
for stunts like these
, but I haven
't
116 yet looked them up - please let me know.
119 The first three commands are self explanatory. The fourth is needed to
120 reestablish the default route that went down together with the interface.
121 The last command then adds 192.168.0.2 to the list of addresses the host
122 feels responsible for.
124 If you execute this remotely, make sure you do so from a script, as you
125 might lose contact after 'ip link set eth0 down
'! You might even wish to use
126 'nohup
' to make sure your script survives. If you haven't yet tried the
127 wonderful
'ip' tool
, please install iproute2
- it is far superior in
128 configuring the kernel than ifconfig
and friends are
.
134 [Internet
] - 192.168.0.1 --[HUB
]---+
---------+
-----+
-----+
137 192.168.0.10 11 12 13
138 additional
: 192.168.0.2 2 2 2
139 all have same MAC address
142 What then happens is that the SYN packet
for 192.168.0.2 comes along
, the
143 router does an ARP query to get the MAC address
, and gets
4 identical
144 responses
. This in itself is not a problem
- it would be neater
if only one
145 machine responded
, but hey
.
148 Now comes the problem
. The SYN packet gets transmitted over the network
, and
149 again all four machines respond with a SYN|ACK
! The router doesn
't care
150 about this, it is an IP device and has no clue what a SYN|ACK packet is. So
151 it sends all four packets back to the client that initiated the connection.
154 But the client now does get confused and swiftly drops the connection. Four
155 almost, but not quite, identical SYN|ACK packets is too much to deal with for a
158 The solution is simple: for each SYN packet, only one host should respond.
159 Now the problem is how to achieve that.
161 <h2>Making sure only one host gets the connection</h2>
162 First concentrate on the SYN packet, then we'll deal with the rest later
. The
163 solution is pretty obvious
- all machines need to be able to calculate
if
164 they want to deal with a connection
or not
. To
do this
, we look at the IP
165 address of the client
and do some bitfidling on it
.
167 First let
's do this for two hosts. We want all even IP addresses to go to
168 192.168.0.10, all odd ones to 192.168.0.11. We do do with the following
171 [192.168.0.10]# iptables -A INPUT -d 192.168.0.2 \! -s 0.0.0.0/0.0.0.1 -j DROP
172 [192.168.0.11]# iptables -A INPUT -d 192.168.0.2 \! -s 0.0.0.1/0.0.0.1 -j DROP
173 [192.168.0.12]# iptables -A INPUT -d 192.168.0.2 -j DROP
174 [192.168.0.13]# iptables -A INPUT -d 192.168.0.2 -j DROP
176 The ip addresses between brackets denote on which hosts the commands need to
177 be executed. We expressed the 'even
/odd
' constraint by using the rather
178 unconventional 0.0.0.1 netmask, '-1' in /-notation.
180 Basically we say 'drop all traffic to
192.168.0.2 unless the source ip
181 address is even
' (or odd, in case of 192.168.0.11). More explicitly, 'drop
182 all traffic to
192.168.0.2 if the last bit is
/is not
0'.
184 Well, we're nearly there
:-) If you now connect from the outside world to
185 192.168.0.2, depending on the even
/oddness of your source IP address
, you
'll
186 get connected to either 192.168.0.10 or to 192.168.0.11!
187 <H2>Scaling to four or more hosts</h2>
188 Two is not that interesting because we can, by definition, not deal with the
189 failure of one box, because we started loadbalancing because we needed more
190 horsepower than one machine can deliver.
192 To include all four hosts, we need to look at the last 2 bits of the source
193 IP address. These last two bits have values 1+2=3:
195 [192.168.0.10]# iptables -A INPUT -d 192.168.0.2 \! -s 0.0.0.0/0.0.0.3 -j DROP
196 [192.168.0.11]# iptables -A INPUT -d 192.168.0.2 \! -s 0.0.0.1/0.0.0.3 -j DROP
197 [192.168.0.12]# iptables -A INPUT -d 192.168.0.2 \! -s 0.0.0.2/0.0.0.3 -j DROP
198 [192.168.0.13]# iptables -A INPUT -d 192.168.0.2 \! -s 0.0.0.3/0.0.0.3 -j DROP
200 This reads like 'drop all traffic to
192.168.0.2 *unless
* the last
2 bits of
201 the IP address are
{00,01,10,11}'.
203 If you have 8 hosts this starts to look something like this:
205 [192.168.0.10]# iptables -A INPUT -d 192.168.0.2 \! -s 0.0.0.0/0.0.0.7 -j DROP
206 [192.168.0.11]# iptables -A INPUT -d 192.168.0.2 \! -s 0.0.0.1/0.0.0.7 -j DROP
207 [192.168.0.12]# iptables -A INPUT -d 192.168.0.2 \! -s 0.0.0.2/0.0.0.7 -j DROP
209 [192.168.0.17]# iptables -A INPUT -d 192.168.0.2 \! -s 0.0.0.7/0.0.0.7 -j DROP
211 If your number of servers is not a power of 2, things get lots more
212 interesting! See also the 'Where to go from here
' chapter.
214 There are some problems with the setup so far. Most notable:
216 <li>ICMP traffic that is related to TCP/IP sessions may get delivered to the
217 wrong server as it may have a different source IP address (any router on
218 your path can send ICMP messages!)</li>
219 <li>If you connect to 192.168.0.10,11,12,13, the other machines with the
220 same MAC address respond with ICMP redirects 'don
't send this to me'.</li
>
221 <li
>Unless you
switch off ip forwarding on the hosts
, they will even forward
222 the packet right back
for you
!
224 Luckily
, all these problems can be resolved by expanding our iptables rules
225 a bit
, and tweaking some files in
/proc
.
227 A
suggested (and partly untested
) set would be
:
229 <li
># echo 0 > /proc/sys/net/ipv4/conf/eth0/send_redirects</li>
230 <li
># echo 0 > /proc/sys/net/ipv4/ip_forward</li>
232 <li
># iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT </li>
233 <li
># iptables -A INPUT -m state --state NEW -p tcp -d 192.168.0.2 -s 0.0.0.X/0.0.0.3 -j ACCEPT</li>
234 <li
># iptables -A INPUT -p udp -d 192.168.0.2 -s 0.0.0.X/0.0.0.3 -j ACCEPT</li>
235 <li
># iptables -A INPUT -p icmp -d 192.168.0.2 -j DROP</li>
236 <li
># iptables -A INPUT -d 192.168.0.1X -j ACCEPT </li>
237 <li
># iptables -A INPUT -j DROP</li>
239 Where X goes from
0 to
3 for the different hosts
.
241 This prevents the servers from routing stuff back to the network
and enables
242 them to receive TCP
and UDP traffic meant
for them
. All machines receive
243 ICMP traffic
for the virtual IP address
, but iptables stateful filtering
244 make sure that the kernel stack only sees relevant ICMP messages
.
246 We also make sure that traffic to the non
-virtual IP address
*is
* accepted
247 properly
. The line by line summary
:
250 <li
>Stop the server from sending out redirects
for traffic it doesn
't want</li>
251 <li>Stop the server from forwarding back traffic it doesn't want
</li
>
252 <li
>Accept already running TCP
/IP sessions
- this is great
for when you
253 change which
new connections (even
, odd
, whatever
) you want to accept
,
254 without hurting existing ones
.</li
>
255 <li
>Allow
new incoming TCP sessions from selected IP addresses
</li
>
256 <li
>Allow incoming UDP packets from selected IP addresses
</li
>
257 <li
>Kill any remaining icmp traffic
for the virtual IP
- either it already got accepted by the
258 first iptables
line ('RELATED'), or it is not
for us
</li
>
259 <li
>Accept traffic
for our
real IP address
</li
>
260 <li
>Drop the rest
</li
>
263 If you want your machine to ping back
, add this after line
5:
266 # iptables -A INPUT -p icmp --icmp-type echo-request -j ACCEPT -d 192.168.0.2 -j ACCEPT
269 <H2
>Where to go from here
</h2
>
270 Besides loadbalancing
, you may need redundancy
. In order to
do so
, we need
271 tools that keep the iptables rules in sync over multiple hosts
. This hasn
't
272 been written yet, but it could be.
274 Such a tool would also calculate and insert the right iptables rules
276 <H2>And if I have a switch?</H2>
277 Two possible solutions - either configure your switch to act as a hub, or
278 employ additional tricks to confuse the switch so it acts as a hub. The
279 later option entails sending from a different MAC address than the one we
280 listen on. Doing so is, as far as I know, not possible with off the shelf
281 Linux tools. I doubt if it should be.
284 Solutions might be to get netfilter in a position where it can change source
285 MAC addresses on outgoing packets. This should also happen on ARP queries
286 and replies. As far as I know this is a hot item currently.
288 Another solution would be to teach linux that a card can have two addresses, a
289 'listen address
' and a 'send address
'.
291 I will be discussing this with the relevant people. If you feel that you are
292 one of those people, please contact <a href=mailto:ahu@ds9a.nl>me</a>.
294 <H2>I think you should be locked up!</H2>
295 I admit that having multiple hosts with identical MAC addresses is pretty
296 evil. I also know that there are cleaner solutions. But these all need
297 additional hardware and create new points of failure. I'm not advocating the
298 use of this trick
for all services
, but it would work
*very
* well
for
299 nameservers
. And <a href
=http
://www.powerdns.com>nameserving</a> is my trade.
301 <H2
>Doesn
't Microsoft do something like this with W2K?</H2>
302 People tell me so - I have never worked with Windows, so I wouldn't know
.
304 <small
><center
>$Id$
</center
></small
>