|
Subject: OpenVPN Server Performance (real experience) and Virtual IP address managment Newsgroups: gmane.network.openvpn.user Date: 2006-02-21 02:15:09 GMT (3 years, 19 weeks, 2 days, 2 hours and 27 minutes ago) Hi: I thought I'd write to share my experience with a network which I'm (well, me and my colleague are) putting together (so pull up a chair and sit back for long read). Here's some background: - 400+ remote (client) locations, Linux based systems - 2 servers as gateways between the core network and the remote tunnels - The servers are Dell 2800 (with 1 Xeon 3.0GHz CPU) and 1GB of memory. - 3 LAN interfaces, 1 to the core LAN, 1 to a frame relay network and 1 (currently not used to the big bad Internet) - The servers are running with Mandrivia Linux 2006 x86_64 SMP kernel All remote sites use pre-shared keys distributed via a separate secure link and all remotes currently connect to the VPN gateways via the frame relay network. The Internet is not yet used but is coming online soon. My plan is (was) to operate the 2 servers as main and standby using VRRP or UCARP to share and manage a virtual IP address, more on this topic later. When we started implementing the network the population of remotes was low, about 30. In the server we are using the following IP related parameters to keep the remote VPN IP addresses as static as possible: server 10.101.0.0 255.255.0.0 ifconfig-pool-persist /etc/openvpn/ipp.txt persist-remote-ip ifconfig-pool-linear Some other server config options: daemon mssfix fragment 1200 mode server passtos proto udp dev tun keepalive 60 300 comp-lzo persist-key persist-tun One server side behavior we didn’t expect was that the server would update the ipp.txt file every so often, so in the wee hours stop the openvpn process, update the ipp.txt file with our new static site IP’s that we wanted them to have and restart the server. It would read the file and life would be good again. Until…last week. I did the usual stop on the process, update the file and start openvpn. I checked the status file and I could see connections establishing (marked UNDEF) and some established with the assigned IP address from the ipp.txt file. All seemed fine until I came in the next morning and found that I had over 800 UNDEF connections (many duplicate entries) and no tunnels up and running. So now I’m in PANIC MODE. Restarting the process didn’t work, restarting the entire server didn’t work. Transferring traffic to the backup server didn’t work and I was pretty much at a loss for ideas. The logs kept repeating the same message for every connection attempt from the remotes: TLS Error: TLS key negotiation failed to occur within 60 seconds (check your network connectivity). So I thought, maybe I should “check your network connection”. What I saw was a saturated inbound link which would be all the remotes trying to connect and virtually no traffic outbound from the server. I looked at the interface using tcpdump and saw the same thing, lotsa traffic in from the remotes and only VRRP advertisements going out from my server, no TLS responses as the logs indicated. Something is strange with that. Next I pull up top to look at the utilization of the system. What I saw was 99.9% USER utilization and it was all going to the openvpn process. So now I’m thinking that I’ll never get the remote network back up because the openvpn process is so saturated, unless I can block some of the traffic I have a major outage on my hands. Luckly, through that “other” secure link I have I was able to shutdown about half of the remote network systems. This allowed the server to catchup to the connection requests. Later we informed the users to power their remote servers back on and they were also able to reconnect. Okay the network is working again and I can breath now… So what happen? Well, all I know is that I don’t have 400+ remotes installed yet, we are only at about 200 and that was enough to kill this server. I took a look at the specs for the Nokia SSL based VPN server (500s) and they only support 500 clients but they have a dedicated encryption accelerator in it. I don’t have one of those. I suspect that all the key hashing/calculation and verification for each tunnel request is the problem, that’s at the TLS layer right? So I scoured the mailing list for similar occurrences from other users and although I didn’t look for very long I didn’t see anything except a reference someone made to 10,000+ clients and multi-threading support which may go into the 2.1 version. This kinda sparked a thought for me so I got permission to do an experiment with the network over a weekend, and that was break it again and get some more data now that I have an idea about what is going on and to look for. I just needed a reliable way of getting the network back up again. Further investigation into the man page revealed a connect-freq option. So this would be one of the items to test. What I was looking for was: 1) how many connections per second can this server handle? 2) how much of the dual CPU (Hyper-threading) was being used? 3) Would more RAM and/or more CPU help? (the server can accommodate dual Xeon CPUs) Now into the investigation we go. Step 1, take down the server and allow the clients to timeout and start sending new connections requests. The first change was to implement the connect-freq option with 2 clients per 1 second. While restarting the openvpn process we watched the CPU usage in detailed mode. What we saw was a little startling, openvpn process was using 99.9% of only one CPU, the other was idling. After a few minutes and many messages indicating that this new connection would exceed the connection rate we started to see established tunnels. Pings through them ranged between 800msec to 8seconds while the server was recovering the network, but it did recover and all 200 sites were connected after 10 minutes, while openvpn CPU usage was hovering about 85%. So we thought we’d try a little higher rate. We tried 20/sec, 15/sec, 5/sec, 4/sec, 3/sec, but only 2/sec would allow network recovery. So we’ve answered two of the questions we wanted to answer. The last was to add the second Xeon CPU, but by this point we knew the answer before dropping the CPU in. When we brought the 4CPU server back up we also reset the connect-freq back up to 20/sec. Again we saw no successful connections. We dropped the parameter back to 3/sec and still no luck. The openvpn process still maxed out only one CPU of the 4 available was being used. Resetting the connect-freq back to 2/sec allowed the remotes to connect again. Needless to say we pulled the extra CPU out of the server, why waste the money? So it would seem that (and I’m not a programmer) that the openvpn process/application needs to be multi-threaded to spawn off processes to utilize more processors? So along this line of thought, what about artificially splitting the client load by running two openvpn daemons and set each to a different UDP port bound to the same IP address and then split the remote network in half (via IP subnet) and assign each subnet half to one of the processes? I have plans to test this theory…maybe next weekend. If anyone has any useful thoughts or comments on this approach, please do share them. Now like I promised, that VRRP/UCARP virtual IP thing I was planning to use. First VRRP: I compiled the 0.4 version, the only one I could find on the net. I also tried to compile the version in “keepalived” but it seemed that I was missing many libraries or the config script couldn’t find or use the lib64 stuff on the server, so I gave up and went with the VRRP 0.4 version from: https://sourceforge.net/projects/vrrpd/ The biggest problem I’m dealing with (mentally) is how to get my internal (core LAN) routers (redundant with HSRP) to forward return traffic to the gateway which is carrying the active tunnels. So I though that I’d run VRRP on both the internal and external physical interfaces. I configure two instances of VRRP 1 for each of the internal and external NICs. This seems to work as one server was MASTER and the other BACKUP. There are problems though: 1) when the servers switch roles the routing table is cleared 2) openvpn has to be started after the BACKUP takes the MASTER role and manages the virtual IP (VIP) address, until it is MASTER the VIP on the backup is not active and openvpn can not bind to it. This is a problem with just about all the apps I need to bind to the VIP on the BACKUP, they can only be started after the BACKUP becomes MASTER, then re-apply the routes. 3) Then there’s that whole out of sync problem, what if the external IF switches but not the internal. Now I have traffic coming into one server and the core routers sending responses back to the other server. I’d need to dynamically re-route the traffic to the other server if this happened |
|
|