Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane
From: Dale <d.schultz <at> telesat.ca>
Subject: OpenVPN Server Performance (real experience) and Virtual IP address managment
Newsgroups: gmane.network.openvpn.user
Date: Tuesday 21st February 2006 02:15:09 UTC (over 12 years ago)
Hi:
I thought I'd write to share my experience with a network which I'm (well,
me 
and my colleague are) putting together (so pull up a chair and sit back for

long read).

Here's some background:
- 400+ remote (client) locations, Linux based systems
- 2 servers as gateways between the core network and the remote tunnels
- The servers are Dell 2800 (with 1 Xeon 3.0GHz CPU) and 1GB of memory.
- 3 LAN interfaces, 1 to the core LAN, 1 to a frame relay network and 1 
(currently not used to the big bad Internet)
- The servers are running with Mandrivia Linux 2006 x86_64 SMP kernel

All remote sites use pre-shared keys distributed via a separate secure link

and all remotes currently connect to the VPN gateways via the frame relay 
network.  The Internet is not yet used but is coming online soon.

My plan is (was) to operate the 2 servers as main and standby using VRRP or

UCARP to share and manage a virtual IP address, more on this topic later.

When we started implementing the network the population of remotes was low,

about 30.  In the server we are using the following IP related parameters
to 
keep the remote VPN IP addresses as static as possible:
server 10.101.0.0 255.255.0.0
ifconfig-pool-persist /etc/openvpn/ipp.txt
persist-remote-ip
ifconfig-pool-linear

Some other server config options:
daemon
mssfix
fragment 1200
mode server
passtos
proto udp
dev tun
keepalive 60 300
comp-lzo
persist-key
persist-tun

One server side behavior we didn’t expect was that the server would
update the 
ipp.txt file every so often, so in the wee hours stop the openvpn process, 
update the ipp.txt file with our new static site IP’s that we wanted them
to 
have and restart the server.  It would read the file and life would be good

again.  Until…last week.  I did the usual stop on the process, update the
file 
and start openvpn.  I checked the status file and I could see connections 
establishing (marked UNDEF) and some established with the assigned IP
address 
from the ipp.txt file.  All seemed fine until I came in the next morning
and 
found that I had over 800 UNDEF connections (many duplicate entries) and no

tunnels up and running.  So now I’m in PANIC MODE.  Restarting the
process 
didn’t work, restarting the entire server didn’t work.  Transferring
traffic 
to the backup server didn’t work and I was pretty much at a loss for
ideas.  
The logs kept repeating the same message for every connection attempt from
the 
remotes: TLS Error: TLS key negotiation failed to occur within 60 seconds 
(check your network connectivity).  So I thought, maybe I should “check
your 
network connection”.  What I saw was a saturated inbound link which would
be 
all the remotes trying to connect and virtually no traffic outbound from
the 
server.  I looked at the interface using tcpdump and saw the same thing,
lotsa 
traffic in from the remotes and only VRRP advertisements going out from my 
server, no TLS responses as the logs indicated.  Something is strange with 
that.  Next I pull up top to look at the utilization of the system.  What I

saw was 99.9% USER utilization and it was all going to the openvpn process.
 
So now I’m thinking that I’ll never get the remote network back up
because the 
openvpn process is so saturated, unless I can block some of the traffic I
have 
a major outage on my hands.  Luckly, through that “other” secure link I
have I 
was able to shutdown about half of the remote network systems.  This
allowed 
the server to catchup to the connection requests.  Later we informed the
users 
to power their remote servers back on and they were also able to reconnect.
 
Okay the network is working again and I can breath now…

So what happen?  Well, all I know is that I don’t have 400+ remotes
installed 
yet, we are only at about 200 and that was enough to kill this server.  I
took 
a look at the specs for the Nokia SSL based VPN server (500s) and they only

support 500 clients but they have a dedicated encryption accelerator in it.
 I 
don’t have one of those.  I suspect that all the key hashing/calculation
and 
verification for each tunnel request is the problem, that’s at the TLS
layer 
right?

So I scoured the mailing list for similar occurrences from other users and 
although I didn’t look for very long I didn’t see anything except a
reference 
someone made to 10,000+ clients and multi-threading support which may go
into 
the 2.1 version.  This kinda sparked a thought for me so I got permission
to 
do an experiment with the network over a weekend, and that was break it
again 
and get some more data now that I have an idea about what is going on and
to 
look for.  I just needed a reliable way of getting the network back up
again.  
Further investigation into the man page revealed a connect-freq option.  So

this would be one of the items to test.

What I was looking for was:
1)	how many connections per second can this server handle?
2)	how much of the dual CPU (Hyper-threading) was being used?
3)	Would more RAM and/or more CPU help? (the server can accommodate dual 
Xeon CPUs)

Now into the investigation we go.  Step 1, take down the server and allow
the 
clients to timeout and start sending new connections requests.  The first 
change was to implement the connect-freq option with 2 clients per 1
second.  
While restarting the openvpn process we watched the CPU usage in detailed 
mode.  What we saw was a little startling, openvpn process was using 99.9%
of 
only one CPU, the other was idling.  After a few minutes and many messages 
indicating that this new connection would exceed the connection rate we 
started to see established tunnels.  Pings through them ranged between
800msec 
to 8seconds while the server was recovering the network, but it did recover

and all 200 sites were connected after 10 minutes, while openvpn CPU usage
was 
hovering about 85%.  So we thought we’d try a little higher rate.  We
tried 
20/sec, 15/sec, 5/sec, 4/sec, 3/sec, but only 2/sec would allow network 
recovery.  So we’ve answered two of the questions we wanted to answer. 
The 
last was to add the second Xeon CPU, but by this point we knew the answer 
before dropping the CPU in.  When we brought the 4CPU server back up we
also 
reset the connect-freq back up to 20/sec.  Again we saw no successful 
connections.  We dropped the parameter back to 3/sec and still no luck. 
The 
openvpn process still maxed out only one CPU of the 4 available was being 
used.  Resetting the connect-freq back to 2/sec allowed the remotes to
connect 
again.  Needless to say we pulled the extra CPU out of the server, why
waste 
the money?

So it would seem that (and I’m not a programmer) that the openvpn 
process/application needs to be multi-threaded to spawn off processes to 
utilize more processors?  So along this line of thought, what about 
artificially splitting the client load by running two openvpn daemons and
set 
each to a different UDP port bound to the same IP address and then split
the 
remote network in half (via IP subnet) and assign each subnet half to one
of 
the processes?  I have plans to test this theory…maybe next weekend.

If anyone has any useful thoughts or comments on this approach, please do 
share them.

Now like I promised, that VRRP/UCARP virtual IP thing I was planning to
use. 
First VRRP:
I compiled the 0.4 version, the only one I could find on the net.  I also 
tried to compile the version in “keepalived” but it seemed that I was
missing 
many libraries or the config script couldn’t find or use the lib64 stuff
on 
the server, so I gave up and went with the VRRP 0.4 version from: 
https://sourceforge.net/projects/vrrpd/

The biggest problem I’m dealing with (mentally) is how to get my internal

(core LAN) routers (redundant with HSRP) to forward return traffic to the 
gateway which is carrying the active tunnels.  So I though that I’d run
VRRP 
on both the internal and external physical interfaces.  I configure two 
instances of VRRP 1 for each of the internal and external NICs.  This seems
to 
work as one server was MASTER and the other BACKUP.  There are problems
though:
1)	when the servers switch roles the routing table is cleared
2)	openvpn has to be started after the BACKUP takes the MASTER role and 
manages the virtual IP (VIP) address, until it is MASTER the VIP on the
backup 
is not active and openvpn can not bind to it.  This is a problem with just 
about all the apps I need to bind to the VIP on the BACKUP, they can only
be 
started after the BACKUP becomes MASTER, then re-apply the routes.
3)	Then there’s that whole out of sync problem, what if the external IF 
switches but not the internal.  Now I have traffic coming into one server
and 
the core routers sending responses back to the other server.  I’d need to

dynamically re-route the traffic to the other server if this happened :-(
4)	No scripting to run other apps when becoming MASTER or BACKUP.
5)	UCARP seemed more promising in this respect.

UCARP:
This seemed to be the solution to my VRRP scripting woes described above. 
It 
supports scripts for the VIP going up and then when going down.  So I built

the latest version.  But when I try it out on the server, openvpn can NOT
see 
ANY connection requests from the external network.  tcpdump shows that the 
requests were coming in, and to the correct MAC and VIP address, but the 
openvpn log was empty.  I could ping the VIP to death, use SNMP against it,

ssh to it.  Every *other* app I bound to the VIP worked but not openvpn. 
Does 
anyone have any experience with this setup could offer some assistance? 
Right 
now I’m back to VRRP as at least openvpn can bind to the VIP it creates. 
The 
switchover is now completely manual, so there’s little point in using
VRRP at 
all in the present state.

So this is my long sad tail, send tissues to…   ;-)

Dale




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log
files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
 
CD: 15ms