ARP Spoofing – a lost art? Maybe not.

Just over a month back we had an incident where the default gateway on our servers would just sporadically stop responding, we first observed this as our servers sporadically just stopping to respond and only once we realized we could log on to other servers and during these “outages” we could still communicate with our servers via our other servers (ie, we could access them from the local LAN but not anywhere else) did we start pointing fingers at the gateway.

At the time we figured this was a misconfigured or faulty gateway (Seeing as Internet Solutions is using HSVRP we figured these outages were just the “fail over” time).  Quintin however was told that our gateways are misconfigured and they told him he should be using x.y.z.253 and NOT x.y.z.1 as we were told during the installation of these servers.  Since then the Windows server which he was working on was not affected again and seeing as most of my Linux servers just keep running I didn’t think too much about this again until last weekend when I was actively working on one of them.  I re-reported the issue, making it clear that the explanations I were given previously was inaddequate and not acceptable (I were given answers such as someone accidentally bumped a cable etc …).

After running mtr traces from my dsl into the DC to three of the servers (web/mail, voip and windows) since reporting it at 15:54 (first response at 16:48 – reasonable response time for ONCE, thanks IS) I realized two things:

  1. ~ 30% round-trip packet loss on the final hop, and around 0.4 – 0.7 % loss on the hops before that (meaning, probably the DSL side of things).
  2. The Windows server was not affected at all!  This made no sense whatsoever to me until I realized it was using a different gateway.

Either way, after sending these traces off to IS @ 17:25 – showing even higher loss on the last hop than the above 30 % (67 % and 57 % respectively), and later (18:40) passing on an update showing that I could still get responses from my servers internally inbetween each other in spite of not being able to get access to the outside world.  At 7:38 the Sunday I looked at the traces again and since the loss decreased to 5 % and 6 % I figured they must’ve restored sanity and fixed the problem.  I was wrong, very very wrong.

One of my clients phoned me around 8:45ish and said I should look at my email – they think I might want to look at it, it looks like my admin site has been compromised.  Of course I immediately loaded the site, just to see a funny bar at the top-left and all my fonts looking rather funky.  A few page source showed a <script> tag embedded right at the beginning of my html document, even before the <!DOCTYPE declaration!  If you’ve ever seen me fire up a shell and ssh into a server you know it can be done quickly – I believe I broke all known records that morning … the grep was fired off so quickly on the php scripts I barely had time to think about it … nothing.  mysqldump with the grep … nothing, on /etc … nothing, /home … nothing (this took a while), / … nothing (this one took a LONG time).

So off I went to GLUG … ah, gotta love GLUG … and this thread ensued:  http://www.linux.org.za/Lists-Archives/glug-tech-0905/msg00009.html

As per the thread my initial thought was compromised proxy … however, seeing as I described the gateway problem first you can hopefully already guess that this was NOT the case. The post that can be located on http://www.linux.org.za/Lists-Archives/glug-tech-0905/msg00013.html was the breakthrough.  Specifically Quintin mentioned that he has issues loading certain pages and then I tried to load pages from other servers in the DC, and this in particular should explain:

Come to think of it - there is some correlation between the servers
that's available via our gateway and those that aren't.  I can reproduce
this "page hack" on the web pages that sporadically goes awol, but not
on those that doesn't (In our particular little subnet).  I wonder
whether those two are not perhaps related.  ARP spoofing anyone?  I
suspect this issue is going to be handed off to IS... sorry for the IS
guys on the list, but there is some work coming your way.

After this realization it became much easier to look for what I feared:  A router compromise.  So I started sniffing for arp traffic (inside of screen writing to a file), and it wasn’t long before I found sequences like this:

12:57:07.871279 ARP, Reply x.y.z.1 is-at 00:11:43:dc:15:24 (oui Unknown), length 46
12:57:07.884269 ARP, Reply x.y.z.1 is-at 00:11:43:dc:15:24 (oui Unknown), length 46
12:57:07.889516 ARP, Reply x.y.z.1 is-at 00:11:43:dc:15:24 (oui Unknown), length 46
12:57:08.433296 ARP, Reply x.y.z.1 is-at 00:00:0c:07:ac:0a (oui Cisco), length 46
12:57:08.433350 ARP, Reply x.y.z.1 is-at 00:00:0c:07:ac:0a (oui Cisco), length 46

That actually tells us three things, not just one:

  1. There is definitely somebody spoofing the router address, specifically, some device which thinks it’s a Dell router.
  2. There is likely a loop on the physical network seeing as we’re receiving the same packet multiple times (It’s actually an IS engineer that pointed this one out for me).
  3. IS is using HSRVP for their routers (which I already knew, but the MAC address from Cisco confirms this.

At 19:50 I once more sent an email to IS, asking them once more to get serious with this since this was a security issue, I also had to spell out what exactly was going on (not something that comes overly easy to me).  So for those that have been raising their eyebrows at the above – the best analogy that I could come up with is something down this lines:

A LAN is essentially like a room full of people, where each person represents a computer.  Just about none of these people generally know each other, they cannot communicate with anybody outside of the room.  In order to communicate to the outside you’ve got to speak with a special person that stands in a door – you only have his name and you’ve got no way to confirm who’s standing in doors and who not – not even by asking them.  So if I want to send a message to sombody not in the room I basically call out:  Hey Mr Router – who are you? and then Mr Router is supposed to call back, hey you, here I am!  What’s happening above is that two people is calling back, so which one is the real Mr Router and which one is the impersonator?  You’ve got to make a choice and follow through.  If you pick the right one, your message goes where it’s supposed to, if you don’t, well, nasty things can happen, as in this story:

The computer that pretended to be the router took the message, looking for certain things in the message, modified the message and then passed in on to the real Mr Router – pretending to be us.  Very, very naughty.

Even after all of this non-circumstantial evidence and definite proof (well beyond speculation) I received this shortly before 21:00 the sunday evening:

Network engineer found that there is a problem with the available uplink bandwidth from the switch and that this is entering planning for rectification. We will get our Netops engineers to assist, but will not be able to resolve it tonight.

I thought I was going to kill someone. My reply to this was simple, yet effective:

What does available uplink bandwidth have to do with a machine that’s spoofing ARP responses?

Yes, the uplink bandwidth is a problem, I’ve been saying that for a while now … but this issue is happening over a WEEKEND (A LONG WEEKEND I MAY ADD) when it’s dead quiet, so there is a better chance of hell freezing over than that it’s an uplink bandwidth issue.

I also immediately called in to the GSC in order to try and speak with whatever engineer made the assessment and to try and hammer it into his head what was going on, as they tried to connect me he fortunately called me.  This helped somewhat with my mood.

This is fortunately where things turned, within a few minutes of him calling me I managed to explain to him what was actually happening and he took action and disabled the compromised host’s port and all went back to normal by 21:30.  Confirmed by 21:34 with ARP traces that finally looked normal:

21:22:16.089287 arp reply x.y.z.1 is-at 00:00:0c:07:ac:0a (oui Cisco)
21:22:51.888723 arp reply x.y.z.1 is-at 00:00:0c:07:ac:0a (oui Cisco)
21:23:27.688410 arp reply x.y.z.1 is-at 00:00:0c:07:ac:0a (oui Cisco)
21:24:14.599464 arp reply x.y.z.1 is-at 00:00:0c:07:ac:0a (oui Cisco)

Much better, no duplicates, no more issues with ssh just going awol.  No more ping timeouts, no more infected HTTP responses.

The HTTP infections themselves were also pretty ingenious.  In particular, a typical HTTP response looks something like this:

HTTP/1.1 200 OK
Date: Sun, 03 May 2009 08:49:07 GMT
Server: Apache
Content-Length: 2698
Keep-Alive: timeout=15, max=98
Connection: Keep-Alive
Content-Type text/html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd";>
<html>

As you can see, there is a number of unneeded headers.  What would happen if we were to strip some of these out?  Eg, the Server, Keep-Alive, and Connection: headers?  Firstly, the lack of Keep-Alive would cause the connection to be closed, a new one will be re-opened for subsequent requests (slight slowdown … whoopdy do), The lack of the Server: header makes no difference at all, but now we do need (due to technical requirements of not disturbing the packet length) to adjust the Content-Length: header, and the opened up space gets padded with spaces.  So basically if we rip out X bytes from the headers we pre-pad the content (from <!DOCTYPE onwards) with spaces which doesn’t change the meaning of the content, we increase the Content-Length by X (being careful around magnitude boundaries, eg, going from 3 digit values to 4 digit values in which case we need to consume one of our added spaces again).  Since we now have a bunch of spaces at the beginning of the content we’ve got space to add some stuff in there (this broke the HTML standard in my page’s case but generally html devs don’t have those <!DOCTYPE headers and by the time anybody notices this it’s already too late anyway).  In this case the “payload” was a <script> tag using a src attribute pointing to a .gif file which actually contains some javascript.  This then proceeded to exploit some Windows bug and actually compromise the machine for whatever nefarious purpose the hacker had in mind.

There is more detailed explanations in the GLUG thread for some of these things, in particular you may also want to read these posts:

  • http://www.linux.org.za/Lists-Archives/glug-tech-0905/msg00018.html
  • http://www.linux.org.za/Lists-Archives/glug-tech-0905/msg00022.html
  • http://www.linux.org.za/Lists-Archives/glug-tech-0905/msg00026.html

I really need to learn how to write this kind of garbage in english and not my variant of techbabble.

Some further discussion regarding possible fixes ensued, and I made this post after discussion with various people, including the orignal person asking the question, I quote from http://www.linux.org.za/Lists-Archives/glug-tech-0905/msg00028.html:

> So we have a neat and devilishly cunning way of getting content onto a browser
> machine virtually anonymously. Ouch.
>
> What's the defense?

You don't want the answer to this.  In a word:  NONE.

I've been thinking really hard, and I can come up with only a handful of
"solutions":

1.  Hardcode the router's MAC in /etc/ethers.  This still doesn't
prevent the injection from happening with MAC spoofing (some host can
confuse the switch into overflowing it's MAC tables and thus effectively
sending all traffic everywhere, but getting the switch to only send it
to that host would be significantly harder, but not impossible, and
HSRVP could in this case actually provide a stepping stone for actually
making the exploit possible again.

2.  In this particular setup there are a few routers high in the range,
I could hard code my router to one of them seeing as only .1 was
targeted in this case.

3.  There is apparently some options on the CISCO switches which can
guess as to whom the attacker is and shut down that port, or at a
minimum detect and report it.  This will probably be quite effective in
the subnet.

4.  It's possible to configure CISCO switches to only allow
communication with the router port, but then you need to also configure
exceptions between servers that are allowed to communicate.

And all of those, without exception, only protects the local subnet.  It
DOES NOT prevent this attack from happening between other hops further
down the route.

As a friend of mine (IPv6 fanatic) would say:  This is a fundamental
flaw in IPv4.  We need IPv6 with build in security mechanisms and HMACs
which guarantees authenticity (Ok, we have it in IPv4 in IPSec, which is
basically the "security" features from IPv6 back-ported).  Not that I
understand how you can authenticate a packet unless you have a shared
key to run the HMAC with, which also implies that we're going to need
X509 certificates on a per-IP basis to be issued to each and every
machine on the Internet.  Which brings me to my next question:  How much
will Thawte and/or Verisign ask for such a certificate?  And how much
are they going to put into MS's pockets to prevent other parties from
entering the arena?

Also, considering that you now need to run two cryptographic hashes per
packet coming in off the wire, or going out on the wire, what
implications will this have on already loaded servers?

Just some things to ponder.

Jaco

The option referenced in 4 is called PVLAN (An extension of the VLAN aka virtual LAN concept).  Re option three, Simeon Miteff mentioned there is a tool called arpwatch which does this same thing on your Linux servers.  At a minimum this will allow one to detect this crap much quicker, and take appropriate action in a shorter time without having to go through the trouble of running sniffers and the like.

Comments are closed.