So I’ve been seeing real weirdness recently, ssh doesn’t normally do this:
jkroon@dijkstra ~ $ ssh root@othala.uls.co.za
Read from socket failed: Connection reset by peer
jkroon@dijkstra ~ $
Working like this is NOT fun. In fact, it’s rather aggravating to say the least, around 5 to 10 % of my connections ends up like the above and up to about 80 % of them gets terminated within the first minute.
The specific network provider has refused to assist because I’m not using Windows, until I made a huge fuss about the issues outlined in this blog. Their responses to this point will be discussed near the end of this blog. I’ll also be withholding their name for the moment as they are a big corporate and might still come to the party, albeit I have my doubts.
UPDATE (2008/09/20):Â Vodacom Responded!
The short (english) version:
It seems that the provider has equipment on it’s network that is modifying TCP connections in some rather obscure ways, specifically, they are making adjustments to the sequence numbers on the outbound data stream, and from prior experience (and me just being a paranoid person) I’m rather annoyed, and intrigued to know what’s actually going on.
The long (technical) version:
After being annoyed for overly long (anything more than a day is highly annoying, and this has been running for all of last week) and being pointed to an article entitled Packet Forgery By ISPs: A Report on the Comcast Affair did not help to calm me down Saturday morning. Also, the calls to the call center which all ended with something down the lines of “Sorry sir, but we absolutely do not support anything but Windows Vista and XP” did not help. Apologies to any of the call centre agents that might happen to read this – I know you’re just doing your work.
Figuring out where connections are being reset from is not easy. The problem with tear-downs like this is that the RST packet has to come from the remote peer, so obviously there are three possible solutions:
- My server is resetting connections – not the case since it works flawlessly from every other network; or
- The network provider is resetting connections; or
- The modem, or the software used to connect the modem, suddenly developed an urge to RST connections (highly, extremely unlikely).
Now, the provider is essentially telling me to get lost since I’m not using a “supported operating system” (Windows Vista and XP) so I’m screwed in that regards (No, I really don’t own a copy of Windows I can use to test with).
We can eliminate one since if that was the case I’d be seeing this problem when ssh’ing to it over dsl as well.
We can eliminate the software as the same software is being used for establishing connections over DSL, 56K V.92, iBurst as well as any other PPP based connectivity you can think of. And the hardware (E220 modem) since it hasn’t had firmware updates (not unless it firmware updates itself) in ages and it hasn’t done this previously. The provider was in a way insinuating that it’s either my modem or the software I’m using – I’m 99.99 % sure this is NOT the case. Accurately generating RSTs for established connections takes effort, it’s something that has to be done that isn’t normally done by pure routers or switches.
Also, if a gateway can’t route a packet it should respond with ICMP destination host/network unreachable, with it’s OWN IP as source (Granted Windows does not handle this particularly well), not a RST packet that seems to come from the peer. So whatever is causing this is breaking the standards anyway. I don’t think Linux would do that (at least, not purposefully), and it wouldn’t bother spoofing it in such a way that tcpdump would see it coming in off ppp2, it would simply send it inward off of ethlan towards my laptop. Of course, all of this is assuming that it’s an RST that’s causing the connection to be dropped and not some other fault in the TCP data stream.
If it was a connection to the tower problem as the support guy also suggested then I would see a dangling connection on the server – yet, the connection is torn down on that side too, thus: ON BOTH ENDS OF THE CONNECTION IT’S LOOKING LIKE THE OTHER PARTY IS TEARING DOWN THE CONNECTION.
Bringing out the big guns: tcpdump. Fire it up on eth0 on the server, sniffing all traffic for port 22, and on ppp2 on my gateway at home, also all port 22 traffic. Time to get a few ssh connections between the two endpoints RSTed. I have to admit here, I got lucky with the first connection, immediately it got disconnected. Sometimes it takes a few seconds, and sometimes I get really (un)lucky and it takes minutes. The fact that the first connection died means there wasn’t a lot of raw data to sift through. This resulted in two packet traces, one on each side of the suspect network. After isolating the specific connection (wireshark came in handy here) I’m left with all packets for the connection that both ends, one on the gateway and one on the server.
Now, if nobody is breaking standards and playing by the rules we should see the exact same packets on both ends of the network. What we should see is an initial SYN packet, followed by a SYN/ACK, and then an ACK. That’s the initial three-way handshake. Then we want to see some data flowing, as well as some ACKs. A clean teardown will be initiated by a FIN packet. There is about 20 packets in each dump, but the tcpdump printed lines are long, so I’m not going to handle all of them (Use tcpdump -r on the above files). The initial three packets on both sides is exactly what we expect (well, almost, but the difference wasn’t discovered until later). Then the server then sends 20 bytes of data, it gets ACKed and the client also sends 20 bytes, which also gets ACKed. So far so good. Directly after this each end transmits ~790 bytes towards the other end. This is sent by both ends and passes in transit. The from the server to the client arrives correctly, however, what we RECEIVE at the server is a re-transmit of the initial 20-byte packet (however, the client never did re-transmit).
The implicit implication should be noted here: Some 3rd party device between the two networks are buffering data. This is NOT NORMAL. Packets DO NOT simply get duplicated (unless perhaps there’s a bug on some switch, and then they’ll probably arrive at the other peer within milli-seconds of one another).
And that is part of what I’m trying to wrap my head around. This is the twenty byte packet that LEFT the client:
0x0000: 0004 0200 0000 0000 0000 0000 0000 0800 ................
0x0010: 4500 0048 1f25 4000 3f06 9742 c42e adaf E..H.%@.?..B....
0x0020: 48e8 ca82 d2a7 0016 b1db 2827 099f c8df H.........('....
0x0030: 8018 002e f692 0000 0101 080a 000b 37db ..............7.
0x0040: 6dc2 1c0a 5353 482d 322e 302d 4f70 656e m...SSH-2.0-Open
0x0050: 5353 485f 342e 370a SSH_4.7.
And in english (somewhat) this basically means: IP (tos 0×0, ttl 63, id 7973, offset 0, flags [DF], proto TCP (6), length 72) 1-2-3-4.provider.co.za.53927 > othala.uls.co.za.ssh: P, cksum 0xf692 (correct), 1:21(20) ack 21 win 46 <nop,nop,timestamp 735195 1841437706>
Other than the checksum and ttl values being different, this is exactly what’s received by the server (ttl goes down to 45, and as a consequence the checksum has to adjust) – or so I though, please read on.
When the 20-segment byte arrives for the second time, the server responds with another ACK (this is normal, the assumption the operating system has to make is that the ACK got lost and thus the re-transmit – in other words, no problem here yet), and it includes a sack (selective ack) for some reason, in this case the sack serves no purpose as far as I can tell since the sack is marked for {1,21} and the ACK value is 21, thus the sack is redundant as it’s included implicitly by the ACK value. Not illegal though, just strange. Tcpdump writes it like this:
. ack 21 win 91 <nop,nop,timestamp 1841437760 735195,nop,nop,sack 1 {1:21}>
When this arrives at the client, tcpdump reports it like this:
. ack 21 win 91 <nop,nop,timestamp 1841437760 735195,nop,nop,sack 1 {264810543:264810563}>
Which means that the packets are different wrt to the sack options. A raw dump of the transmitted packet (excluding link-layer headers) looks like:
0x0000: 4500 0040 d4c5 4000 4006 e0a9 48e8 ca82 E..@..@.@...H... 0x0010: c42e adaf 0016 d2a7 099f cbef c1a3 d869 ...............i 0x0020: b010 005b 82b8 0000 0101 080a 6dc2 1c40 ...[........m..@ 0x0030: 000b 37db 0101 050a c1a3 d855 c1a3 d869 ..7........U...i
And the received side it looks like this:
0x0000: 4574 0040 d4c5 4000 2906 f735 48e8 ca82 Et.@..@.)..5H... 0x0010: c42e adaf 0016 d2a7 099f cbef b1db 283b ..............(; 0x0020: b010 005b 42af 0000 0101 080a 6dc2 1c40 ...[B.......m..@ 0x0030: 000b 37db 0101 050a c1a3 d855 c1a3 d869 ..7........U...i
Ok, so IP header is the first 20 bytes (offset == 5, 5 * 4 == 20, marked green), tcp header consumes the next 20 bytes (blue). There will be some options, but we’ll consume them a tad later.
In the IP header only the TTL (8 bits at 0×0008) and checksum (16 bits at 0x000A) values changed, so that’s all good and well.
Now, if the network provider was not messing with data the TCP header would have remained the same. However, at packet offset 0x001C there are 32 bits that’s different (marked in red). Taking 0×14 off there leaves us with tcp header offset 0×08 … checking the RFCs that turns out to be the ACK value (only legal if the flags->ack bit it set) and the flags.
If this hasn’t yet rung an alarm bell in your head, it should: Someone is modifying packets in transit!
Right, so SOMEONE IS MESSING WITH MY DATA! Yet, tcpdump is reporting these both as 21! Going back to confirm the sequence numbers in the opposite direction, and going back all the way to the initial SYN packet reveals that someone is in fact offsetting the sequence numbers by almost 260M! To be precise, the initially sent sequence number was 2983929894 and this arrived as 3248740436.
This confirms that someone (or something more exactly) is doing something with sequence numbers. I’m almost willing to bet a month’s salary on the fact that whatever is making these modifications is also the cause for the duplicate delivery. Either way, this is bizarre. A normal NAT gateway wouldn’t need these modifications as it only needs to be concerned with the IP and port-number mappings and has nothing doing with sequence numbers.
Once this now invalid (due to the sequence number adjustement) sack option reaches the client, it will retaliate with an RST packet (as can be evidenced from the dump). Since the sequence value on the RST packet will be in range tear-down follows. From this point on things goes downhill and all communication between the endpoints are done for, all further packets are between one of the end-points and the midway device.
What we know:
This is a tough question to really answer as there is no evidence beyond observations here. Not to mention that the internet is a big, big place and this can be happening anywhere along the path. What I can say is that I’m very sure it’s the specific provider in question as I’m not experiencing this problem on any other networks, and I can’t get out to any other networks from the provider without being stuck by this problem.
Consequences:
There may be others, but currently I’m aware of the following:
- All traffic for a specific connection HAS to flow through the device or alternatively some packets may escape being adjusted – the results will be similar to those that initiated this “investigation”, or alternatively, all adjustment information HAS to be shared between a set of devices (probably not highly scalable).
- Connections that attempts to make use of selective acknowledgments will get destroyed due to the nature of TCP and the fact that the TCP sack options aren’t being adjusted in along with the sequence numbers.
If you can think of others, please let me know so that I can add them here. These are tangible consequences, not some speculations.
Workarounds:
Currently there are no proper solutions, but the problem can be worked around by switching selective ack support off in Linux. Windows users – I’m not even sure windows is SACK-enabled, but I’ve had some Windows users mention to me that they too have some problems (One in particular kept on complaining that for the last week or two he was unable to game online – not that I’d know how you game online with 3G). Anyway, to switch off sack support you can add the following to /etc/sysctl.conf:
net.ipv4.tcp_sack = 0
And then run:
sysctl -p /etc/sysctl.conf
And then of course, if you’d prefer to temporarily switch it off (until reboot):
echo 0 > /proc/sys/net/ipv4/tcp_sack
Thanks goes to the folks on the GLUG mailing list. There may also be a mechanism in iptables to enforce no-sack on a gateway, but the above is good enough for me as I only have three machines on my small LAN that suffers from this.
Speculations:
There are many speculations that can be made, and being human we tend to like the conspiracy theory. Especially in light of technology like that developed by Netronome (Check out their SSL Inspector) which uses techniques underlying that would exhibit the same behavior as above should there be a bug in the transparent TCP/IP stack.
I’m not going to say that something is or is not going on here, but I am going to state that it looks fishy. Other, non-conspiracy theories are currently limited to the following:
Firstly, there could be a “security” fix on a router that adjusts the sequence numbers in order to protect operating systems with buggy PRNGs. This comes from a friend who has a rather extensive knowledge regarding routers and firewall blades often used in high-end networks, but he reserves that he hasn’t seen something like this yet.
The network operator themselves suggested that it “could be a CISCO protocol fixup specific to ssh” … this sounds like a load of bull if you ask me, but I could be wrong.
Then, from the GLUG mailing list, and this makes the most sense of all:
“Besides the NAT box breaking your TCP connections – if your TCP traffic is on a wireless network, then chances are that there is a TCP proxy between the wireless network and the Internet. Traffic from the TCP proxy to the wireless devices use TCP congestion control algorithms that are optimised for a wireless environment are improve the performance, throughput and user experience on the wireless network side. TCP from the proxy to the Internet uses your standard congestion control mechanism. If this TCP proxy can’t handle your connections then it is another point in the path that is a problem.”
The only flaw I can find in that is the adjustment of the sequence numbers, which would not be required for a simple congestion control enforcement. That and the fact that tcp congestion are generally (and probably to an extent incorrectly so) a function of the endpoints (Imagine an internet backbone router having to track all “current” connections over it just to do a bit of congestion control – resources that could have been better spent routing packets).
The fact is that nobody (except a few network administrators at the provider) can say for sure, and that no matter how much evidence points one way or another there is no conclusive proof of anything. The traffic looks very dodgy, and having written TCP splicing code myself I can state that sequence number adjustments are not required, and the only reason (based on background) that I can think of to make that adjustment is to ensure that if a packet does escape the intercept point the connection will be torn down (Once we start intercepting we want to keep on intercepting). The adjustment of the sack-options is something I also missed.
My guess at this point in time is simple: The SACK values aren’t being adjusted (As can be observed by the SACK value being in the , and due to the skewed ISN values the resulting ACKed segment is out of range, my laptop generates an RST because the packet is invalid, the peer sees that as tear-down initiation (oops, I sent to a non-connected port on the peer type of thing), sends back a FIN. This FIN never arrives btw, so I take it the network provider is blocking that as the connection has already been effectively torn down by the RST. This causes a flurry of spurious RST packets, but causes no additional harm. These RSTs are magically appearing though as neither of the tcp end-points is generating them.
That’s all for the moment. More news will hopefully follow.
First post! The easiest way of implementing a proxy is to redirect and accept the connection, and make your own connection with its own sequence numbers. You can usually catch a transparent proxy with tcptraceroute. It might also affect only a few ports.
Yes, that is probably the simplest, however, with a redirect you’re relying on the kernel to essentially do a re-mapping, and the user-space application at this point has no idea about the original destination of the incoming SYN packet.
To further complicate the issue, we don’t really want to let the recipient of the connection know what the proxy IP is either, so whilst your approach is perfectly acceptable for HTTP in the general case, or if the stuff is outbound over NAT as well it’s unacceptable for most other protocols.
Take ssh, whilst it’s easy enough to redirect all port 22 traffic to the local machine, how do you now know where to connect to? Thus the need for devices such as the “network accelerators” produced by Netronome.
You’ll catch the general transparent proxy with tcptraceroute, but not a well written tcp splice implementation. If properly done it’s pretty much impossible to detect, except if the data is being modified in transit. I can even re-segment tcp segments and you will not be able to detect it by simply looking at the traffic on end only, you will only be able to detect it by comparing the traffic at the source and the destination.
Think about the advantage for an HTTP transparent proxy – the connection gets “terminated” (or buffered, more accurately) on the proxy, now, if we don’t have the data in the cache, we re-originate the connection to the server as if from the original IP, and whilst splicing the data back to the client we make a copy for caching purposes.
Or at least, that’s the reasoning. Personally I don’t mind transparent proxies on HTTP, however, I don’t like people snooping on my ssh connections. And the symptoms here looks a lot like snooping.
As a matter of fact – the transparent proxies may well be why most people don’t observe this problem, they essentially terminate the outbound connection, the OS probably don’t do sack, or it’s located between the anomality and the towers and thus the traffic between it and the 3G clients aren’t passing through it, and on the other end HTTP is probably passing through a different (dedicated) link. Who knows. Again, only a few network engineers that runs the network will be able to tell us. And they haven’t communicated a single word with me yesterday or today.
A GLUG user has pointed out section 4.10 of rfc2757 (ftp://ftp.ietf.org/rfc/rfc2757). Whilst this would certainly explain the problem, it also doesn’t. The description here describes full termination, the fact that I’m not seeing sequence numbers adjusted in both directions makes me think that this is not what I’m seeing. Either way – this is probably the most logical explanation I’ve seen yet.
The user also states that if this is the case then the TCP proxy (relay?) is rather buggy. The same person also points out that employing IPSec as an encapsulation protocol between end-points would prevent the transparent TCP-proxy from being able to mangle the connections, this should indeed work, however, the deployment of this kind of infrastructure to all my servers would take weeks.
You scared the guy away…