Just over a month back we had an incident where the default gateway on our servers would just sporadically stop responding, we first observed this as our servers sporadically just stopping to respond and only once we realized we could log on to other servers and during these “outages” we could still communicate with our servers via our other servers (ie, we could access them from the local LAN but not anywhere else) did we start pointing fingers at the gateway.
At the time we figured this was a misconfigured or faulty gateway (Seeing as Internet Solutions is using HSVRP we figured these outages were just the “fail over” time). Quintin however was told that our gateways are misconfigured and they told him he should be using x.y.z.253 and NOT x.y.z.1 as we were told during the installation of these servers. Since then the Windows server which he was working on was not affected again and seeing as most of my Linux servers just keep running I didn’t think too much about this again until last weekend when I was actively working on one of them. I re-reported the issue, making it clear that the explanations I were given previously was inaddequate and not acceptable (I were given answers such as someone accidentally bumped a cable etc …).
After running mtr traces from my dsl into the DC to three of the servers (web/mail, voip and windows) since reporting it at 15:54 (first response at 16:48 – reasonable response time for ONCE, thanks IS) I realized two things:
- ~ 30% round-trip packet loss on the final hop, and around 0.4 – 0.7 % loss on the hops before that (meaning, probably the DSL side of things).
- The Windows server was not affected at all! This made no sense whatsoever to me until I realized it was using a different gateway.
Either way, after sending these traces off to IS @ 17:25 – showing even higher loss on the last hop than the above 30 % (67 % and 57 % respectively), and later (18:40) passing on an update showing that I could still get responses from my servers internally inbetween each other in spite of not being able to get access to the outside world. At 7:38 the Sunday I looked at the traces again and since the loss decreased to 5 % and 6 % I figured they must’ve restored sanity and fixed the problem. I was wrong, very very wrong.
One of my clients phoned me around 8:45ish and said I should look at my email – they think I might want to look at it, it looks like my admin site has been compromised. Of course I immediately loaded the site, just to see a funny bar at the top-left and all my fonts looking rather funky. A few page source showed a <script> tag embedded right at the beginning of my html document, even before the <!DOCTYPE declaration! If you’ve ever seen me fire up a shell and ssh into a server you know it can be done quickly – I believe I broke all known records that morning … the grep was fired off so quickly on the php scripts I barely had time to think about it … nothing. mysqldump with the grep … nothing, on /etc … nothing, /home … nothing (this took a while), / … nothing (this one took a LONG time).
So off I went to GLUG … ah, gotta love GLUG … and this thread ensued: http://www.linux.org.za/Lists-Archives/glug-tech-0905/msg00009.html
As per the thread my initial thought was compromised proxy … however, seeing as I described the gateway problem first you can hopefully already guess that this was NOT the case. The post that can be located on http://www.linux.org.za/Lists-Archives/glug-tech-0905/msg00013.html was the breakthrough. Specifically Quintin mentioned that he has issues loading certain pages and then I tried to load pages from other servers in the DC, and this in particular should explain:
Come to think of it - there is some correlation between the servers that's available via our gateway and those that aren't. I can reproduce this "page hack" on the web pages that sporadically goes awol, but not on those that doesn't (In our particular little subnet). I wonder whether those two are not perhaps related. ARP spoofing anyone? I suspect this issue is going to be handed off to IS... sorry for the IS guys on the list, but there is some work coming your way.
After this realization it became much easier to look for what I feared: A router compromise. So I started sniffing for arp traffic (inside of screen writing to a file), and it wasn’t long before I found sequences like this:
12:57:07.871279 ARP, Reply x.y.z.1 is-at 00:11:43:dc:15:24 (oui Unknown), length 46
12:57:07.884269 ARP, Reply x.y.z.1 is-at 00:11:43:dc:15:24 (oui Unknown), length 46
12:57:07.889516 ARP, Reply x.y.z.1 is-at 00:11:43:dc:15:24 (oui Unknown), length 46
12:57:08.433296 ARP, Reply x.y.z.1 is-at 00:00:0c:07:ac:0a (oui Cisco), length 46
12:57:08.433350 ARP, Reply x.y.z.1 is-at 00:00:0c:07:ac:0a (oui Cisco), length 46
That actually tells us three things, not just one:
- There is definitely somebody spoofing the router address, specifically, some device which thinks it’s a Dell router.
- There is likely a loop on the physical network seeing as we’re receiving the same packet multiple times (It’s actually an IS engineer that pointed this one out for me).
- IS is using HSRVP for their routers (which I already knew, but the MAC address from Cisco confirms this.
At 19:50 I once more sent an email to IS, asking them once more to get serious with this since this was a security issue, I also had to spell out what exactly was going on (not something that comes overly easy to me). So for those that have been raising their eyebrows at the above – the best analogy that I could come up with is something down this lines:
A LAN is essentially like a room full of people, where each person represents a computer. Just about none of these people generally know each other, they cannot communicate with anybody outside of the room. In order to communicate to the outside you’ve got to speak with a special person that stands in a door – you only have his name and you’ve got no way to confirm who’s standing in doors and who not – not even by asking them. So if I want to send a message to sombody not in the room I basically call out: Hey Mr Router – who are you? and then Mr Router is supposed to call back, hey you, here I am! What’s happening above is that two people is calling back, so which one is the real Mr Router and which one is the impersonator? You’ve got to make a choice and follow through. If you pick the right one, your message goes where it’s supposed to, if you don’t, well, nasty things can happen, as in this story:
The computer that pretended to be the router took the message, looking for certain things in the message, modified the message and then passed in on to the real Mr Router – pretending to be us. Very, very naughty.
Even after all of this non-circumstantial evidence and definite proof (well beyond speculation) I received this shortly before 21:00 the sunday evening:
Network engineer found that there is a problem with the available uplink bandwidth from the switch and that this is entering planning for rectification. We will get our Netops engineers to assist, but will not be able to resolve it tonight.
I thought I was going to kill someone. My reply to this was simple, yet effective:
What does available uplink bandwidth have to do with a machine that’s spoofing ARP responses?
Yes, the uplink bandwidth is a problem, I’ve been saying that for a while now … but this issue is happening over a WEEKEND (A LONG WEEKEND I MAY ADD) when it’s dead quiet, so there is a better chance of hell freezing over than that it’s an uplink bandwidth issue.
I also immediately called in to the GSC in order to try and speak with whatever engineer made the assessment and to try and hammer it into his head what was going on, as they tried to connect me he fortunately called me. This helped somewhat with my mood.
This is fortunately where things turned, within a few minutes of him calling me I managed to explain to him what was actually happening and he took action and disabled the compromised host’s port and all went back to normal by 21:30. Confirmed by 21:34 with ARP traces that finally looked normal:
21:22:16.089287 arp reply x.y.z.1 is-at 00:00:0c:07:ac:0a (oui Cisco)
21:22:51.888723 arp reply x.y.z.1 is-at 00:00:0c:07:ac:0a (oui Cisco)
21:23:27.688410 arp reply x.y.z.1 is-at 00:00:0c:07:ac:0a (oui Cisco)
21:24:14.599464 arp reply x.y.z.1 is-at 00:00:0c:07:ac:0a (oui Cisco)
Much better, no duplicates, no more issues with ssh just going awol. No more ping timeouts, no more infected HTTP responses.
The HTTP infections themselves were also pretty ingenious. In particular, a typical HTTP response looks something like this:
HTTP/1.1 200 OK Date: Sun, 03 May 2009 08:49:07 GMT Server: Apache Content-Length: 2698 Keep-Alive: timeout=15, max=98 Connection: Keep-Alive Content-Type text/html <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd";> <html>
There is more detailed explanations in the GLUG thread for some of these things, in particular you may also want to read these posts:
I really need to learn how to write this kind of garbage in english and not my variant of techbabble.
Some further discussion regarding possible fixes ensued, and I made this post after discussion with various people, including the orignal person asking the question, I quote from http://www.linux.org.za/Lists-Archives/glug-tech-0905/msg00028.html:
> So we have a neat and devilishly cunning way of getting content onto a browser > machine virtually anonymously. Ouch. > > What's the defense? You don't want the answer to this. In a word: NONE. I've been thinking really hard, and I can come up with only a handful of "solutions": 1. Hardcode the router's MAC in /etc/ethers. This still doesn't prevent the injection from happening with MAC spoofing (some host can confuse the switch into overflowing it's MAC tables and thus effectively sending all traffic everywhere, but getting the switch to only send it to that host would be significantly harder, but not impossible, and HSRVP could in this case actually provide a stepping stone for actually making the exploit possible again. 2. In this particular setup there are a few routers high in the range, I could hard code my router to one of them seeing as only .1 was targeted in this case. 3. There is apparently some options on the CISCO switches which can guess as to whom the attacker is and shut down that port, or at a minimum detect and report it. This will probably be quite effective in the subnet. 4. It's possible to configure CISCO switches to only allow communication with the router port, but then you need to also configure exceptions between servers that are allowed to communicate. And all of those, without exception, only protects the local subnet. It DOES NOT prevent this attack from happening between other hops further down the route. As a friend of mine (IPv6 fanatic) would say: This is a fundamental flaw in IPv4. We need IPv6 with build in security mechanisms and HMACs which guarantees authenticity (Ok, we have it in IPv4 in IPSec, which is basically the "security" features from IPv6 back-ported). Not that I understand how you can authenticate a packet unless you have a shared key to run the HMAC with, which also implies that we're going to need X509 certificates on a per-IP basis to be issued to each and every machine on the Internet. Which brings me to my next question: How much will Thawte and/or Verisign ask for such a certificate? And how much are they going to put into MS's pockets to prevent other parties from entering the arena? Also, considering that you now need to run two cryptographic hashes per packet coming in off the wire, or going out on the wire, what implications will this have on already loaded servers? Just some things to ponder. Jaco
The option referenced in 4 is called PVLAN (An extension of the VLAN aka virtual LAN concept). Re option three, Simeon Miteff mentioned there is a tool called arpwatch which does this same thing on your Linux servers. At a minimum this will allow one to detect this crap much quicker, and take appropriate action in a shorter time without having to go through the trouble of running sniffers and the like.