❃ ⌑ 🐚 (˘ᵕ˘❁)
seen from Germany
seen from Germany
seen from Spain
seen from China
seen from Germany
seen from Romania

seen from France

seen from Australia
seen from Yemen
seen from China

seen from Belgium
seen from United States
seen from United Kingdom

seen from United States
seen from Canada

seen from Netherlands
seen from Germany
seen from France

seen from United States
seen from Germany
❃ ⌑ 🐚 (˘ᵕ˘❁)
Accelerated SSL
There are any number of ways to improve the performance of SSL, ranging in difficulty from a simple config change to installing dedicated hardware or desperately trying to get the protocol modified.
This article focuses on accelerating the performance of SSL by improving OpenSSL's hash, block and key algorithm performance. By increasing the data which you can transmit and the number of SSL handshakes that can be performed, you can reduce the overhead of SSL on your current server hardware.
Beyond configuration and implementation improvements, I will also present two additional solutions: adding an SSL accelerator card and modifying OpenSSL to utilize Intel's IPP library.
OpenSSL version
OpenSSL 1.0.0d (or newer) provides a few important performance improvements over the older, and more commonly deployed 0.9.8 series:
SSL_MODE_RELEASE_BUFFERS -- if your application supports it (or you patch it to do so), memory utilization per SSL connection is drastically reduced
RSA performance -- significantly increased, particularly on the 32-bit x86 arch
As a result, you should strongly consider upgrading if you haven't already done so. RHEL 6 already provides OpenSSL 1.0.0, older versions have 0.9.8. However, you will need to take some further steps outlined below to correct performance degradations found on Nehalem and newer processors with RC4 and AES.
Architecture
Many of the changes presented below are dependent upon the system architecture you use. Only 32-bit x86 and 64-bit x86_64 have been compared, other architectures (SPARC, MIPS, etc.) will have very different performance characteristics.
If possible, you should ensure that your OS, application and OpenSSL are all 64-bit. RSA handshake performance is 2x better on x86_64, and for any high traffic site you will need the additional memory footprint that a 64-bit process provides.
Because of the specialized AES-NI instructions added in Intel Xeon Westmere (56xx series) and newer processors, it is highly desirable to utilize these processors in your hosts serving SSL traffic.
Ciphersuite selection
Based on a previous article, two ciphersuites have been selected:
TLS_RSA_WITH_AES_128_CBC_SHA: RSA + AES-128-CBC + SHA1
TLS_RSA_WITH_RC4_128_SHA: RSA + RC4 + SHA1
You may also choose to permit TLS_RSA_WITH_AES_256_CBC_SHA and TLS_RSA_WITH_RC4_128_MD5. All of the other current TLS ciphersuites are inferior from a performance or security standpoint, or are not sufficiently widely supported to consider now.
These ciphersuites contain three algorithms, each of which needs to be accelerated to improve overall performance:
Hash function (MAC)
Block cipher (bulk encryption)
Key exchange
Hash function
The two ciphersuites we chose both use the same hash function: SHA-1. MD5 and SHA-256 are also available in other ciphersuites; MD5 only for RC4, and SHA-256 is not widely supported yet.
SHA-1
IPP provides an 18% performance improvement, approximately matching the SSL accelerator card with a single CPU core. A single core can hash 3.6 Gbps of data.
Block cipher
Two block ciphers were selected: AES and RC4. This should provide sufficient coverage for all legacy and modern clients to successfully negotiate a ciphersuite with the server.
AES
AES is the preferred block cipher from a security perspective. It is also preferable from a performance standpoint if you are using a Westmere or newer CPU.
Moving from OpenSSL 0.9.8o to 1.0.0d results in a 48% decrease in performance on Nehalem and newer processors due to an undiagnosed issue with the x86_64 ASM in OpenSSL that means it is no longer properly optimized for this architecture. The easy fix is to simply prevent the ASM from being built and used when compiling OpenSSL.
IPP is a better solution though, as it enables OpenSSL to utilize the AES-NI instruction set on Westmere processors and yields a 280% increase in performance (above the fixed 1.0.0, for a total of 647%!). Performance is equivalent to the AES-NI OpenSSL engine, with the added benefit of lesser performance gains on with older CPUs.
AES-256 has similar performance characteristics, although it is 25% slower.
Using AES-128 and IPP, a single core can encrypt 5.6 Gbps of data.
RC4
While not as secure as AES, RC4 is important for two reasons:
Older clients may not support AES, notably selected browsers on Windows XP
It offers higher performance than AES on pre-Westmere CPUs
You can easily obtain a 32% performance increase by applying this patch, if you are using a Nehalem or newer CPU in 64-bit mode.
If you don't want to apply this patch, another option is to prevent OpenSSL from building and using the ASM version of RC4 in Configure.
It would take 5 CPU cores to equal the performance of the SSL accelerator card, which makes this encryption function the only clear winner for the accelerator card. It is also the only algorithm tested here which I did not implement replacement IPP functions for the OpenSSL native logic. IPP does include Arcfour functions to accelerate this cipher.
A single core can encrypt 3.2 Gbps of data.
Key exchange
While there are a variety of available key exchange algorithms, only RSA is well supported at this time.
RSA
RSA keys come in various sizes, and the size selection is essential to determining the number of handshakes that can be performed per second. A 1024-bit RSA key is considered less secure, but provides the best performance currently available (sizes below 1024-bit should not be considered). Operations on a 2048-bit RSA key are approximately 5 times slower. Key sizes above 2048-bits are not necessary unless you have specialized needs; they will substantially impair performance.
The easiest way to improve RSA performance is to upgrade to OpenSSL 1.0.0, which increases performance by 25%.
The SSL accelerator card yields a 195% increase in performance over a single unaccelerated CPU core; IPP yields a 116% increase over an unaccelerated core. On two or more cores, the IPP version eclipses the SSL accelerator card.
Using RSA-1024 and IPP, a single core can process 4,238 handshakes per second.
Testing methodology
All of the benchmarks in this article were obtained by running openssl speed -elapsed on a quiet host with the scaling_governor set to performance. Unless noted, the results are for a single core's performance (specifically an Intel Xeon X5650, which is a Westmere CPU) using a 64-bit OpenSSL binary. Performance on multiple cores will scale linearly, provided that your application can utilize multiple cores efficiently.
Performance characteristics will be different for other CPUs (such as pre-Nehalem Xeons), as well as 32-bit OpenSSL.
SSL card results were obtained by running openssl speed -elapsed -multi 12. The card used is a Cavium Nitrox, model CN1620-400-NHB.
AES: openssl speed -elapsed -evp aes-128-cbc
RC4: openssl speed -elapsed -evp rc4
SHA-1: openssl speed -elapsed -evp sha1
RSA: openssl speed -elapsed rsa1024
Acceleration options
The SSL accelerator card used for this benchmark retails for approximately $500. There are a few downsides to using a hardware device:
Must maintain a spare pool for quick replacement
Not general purpose
Constrained supply, may be difficult to procure in some regions
Requires some work to integrate with OpenSSL
May not have a free PCIe x4 slot in some servers
Intel's IPP library is not free software, and requires a developer license at $200 per user. However, you may freely redistribute the product utilizing IPP (e.g. OpenSSL) without royalties, and there is no per-server cost. There are some downsides:
Not free or open source
Requires some work to integrate with OpenSSL
Utilizes the system CPU
Consider that upgrading the system CPU may be cheaper (and more beneficial for other purposes) than a dedicated accelerator card. A hex-core Intel Xeon X5650 is only $300 more than a quad-core Intel Xeon E5640, and the two additional cores will easily eclipse the performance of the SSL accelerator card in every algorithm except RC4.
IPP patch
The IPP cryptography samples include a patch against OpenSSL 0.9.8j, which I have modified to work with OpenSSL 1.0.0d. My changes are simply to make the patch apply cleanly, the code is otherwise identical to Intel's original patch. That patch is available here.
When building this patched version of OpenSSL, you'll need to keep a few things in mind:
It expects IPP to be installed in /opt/intel/composerxe-2011.3.174/ -- if your IPP is somewhere else, simply modify the path listed in Configure to point to the right place.
You must use the build targets linux-elf-ipp (32-bit) and linux-x86_64-ipp (64-bit), instead of linux-elf and linux-x86_64.
You should still apply the RC4 patch as well, unless you're using a version of OpenSSL that already includes it (none do as of this writing) -- while IPP does have functions to accelerate RC4, they are not utilized in Intel's patch.
Summary
Obtain IPP and the cryptography library
Upgrade to OpenSSL 1.0.0d, patched to utilize IPP
Modify (or upgrade to a patched version) of your application which sets SSL_MODE_RELEASE_BUFFERS
Only offer AES-128 and RC4 ciphersuites
Deploy on Intel Xeon Westmere and newer processors
OpenSSL: Cipher Selection
OpenSSL supports a number of different SSL ciphersuites, which can have a huge impact on the overhead that SSL imposes on HTTP traffic. Below I approach SSL from the performance side of things. Sites deploy HTTPS for one of two reasons: 1. Financial or highly sensitive data is involved, there are regulations mandating security policies, and it would be extremely advantageous to someone if they obtained the data you're transfering (even if it takes a year to crack the encryption). 2. To avoid [Firesheep](http://codebutler.com/firesheep) style attacks or for integration with SSL secured services (e.g. Twitter). Nefarious persons might like to acquire the data, but they are probably not going to spend millions of dollars to crack your encryption. The second use case is far more common. For this type of deployment performance is key, and security is a bit of a secondary concern (which isn't to say you should be insecure). If you fall into the first category, I presume you are already well aware of the procedures required of you. ### Ciphers The following ciphers are supported in TLSv1: * [3DES](http://en.wikipedia.org/wiki/Triple-DES) has historically been the preferred cipher for high security HTTPS transactions. It has been slow since the dawn of time, although it has ubiquitous browser support. * [RC4](http://en.wikipedia.org/wiki/RC4) is considered less secure than 3DES, but is similarly very well supported. * [AES](http://en.wikipedia.org/wiki/Advanced_Encryption_Standard) is a newer cipher, formerly Rijndael, which won a contest to become the standardized replacement for DES. * [Camellia](http://en.wikipedia.org/wiki/Camellia_(cipher)) * [SEED](http://en.wikipedia.org/wiki/SEED) Other ciphers exist, but I don't feel that they're suitable for various reasons. #### SSLv2 It hopefully goes without saying, but under no circumstances should you allow SSLv2 connections in 2011 -- SSLv3 was standardized in [1996](http://www.mozilla.org/projects/security/pki/nss/ssl/draft302.txt)! SSLv3 and TLSv1 are both acceptable. ## Cipher performance
This test was run with a single 64-bit instance of **openssl speed** on an Intel Xeon E5620. Assembly instructions were disabled for AES and RC4 ciphers per [OpenSSL: Outmoded Assembly](http://zombe.es/post/4059999783/openssl-outmoded-asm). #### Results 100,000 Kbyte/s is my threshold for acceptable performance. This represents 1 CPU core (of 8 in my case) running at 100% utilization to transfer 780Mbit/s of data (which is a reasonable saturation point for a gigabit Ethernet link). This eliminates Camellia-256, SEED, and 3DES. RC4 is the fastest cipher, if you are using a processor which does not support [AESNI](http://en.wikipedia.org/wiki/AES_instruction_set). AES-128 is the next fastest cipher, and much faster than RC4 if you have AESNI support. It's about 54% slower if you don't. AES-256 is slower still, and unless explicitly configured otherwise, any browser that supports AES-128 will also support AES-256. Camellia-128 is a tad slower still. It's quite unlikely that any browser would **only** support Camellia so it is currently not essential. **NOTE:** openssl-0.9.x has different cipher performance: Camellia-128 is 4x slower, and RC4 is slower in very old 0.9.x releases. For this and other reasons, you should be using openssl-1.0.x. ### Cipher negotiation By default, OpenSSL negotiates the cipher to use like this: 1. The user's browser sends its cipherlist. 2. This is compared with the server's cipherlist[^1]. 3. The first cipher in the user's list that the server supports is utilized for the connection. The downside here is that it places the end user in control. From a performance standpoint, we want to be calling the shots here. This behavior can be changed with flag **SSL_OP_CIPHER_SERVER_PREFERENCE** to [SSL_CTX_set_options](http://www.openssl.org/docs/ssl/SSL_CTX_set_options.html#NOTES) in your application. Apache's mod_ssl has support for it as option _SSLHonorCipherOrder_. This changes the behavior to prefer the first shared cipher using the order specified in the server's cipherlist. For example:
client: 3des, aes-256-cbc, aes-128-cbc, rc4 server: aes-128-cbc, aes-256-cbc, rc4, 3des
Without this flag, **3des** is selected. With it set, **aes-128-cbc** is selected. That client connection might use 97.5% less CPU to encrypt the data stream as a result. #### Cipher ordering For servers with AESNI (Intel Westmere, AMD Bulldozer, and some others), I recommend: >AES-128, RC4, AES-256, Camellia-128 For servers without AESNI (or without an OpenSSL patched to support AESNI): >RC4, AES-128, AES-256, Camellia-128 In order to do this, you'll need to set your cipher string in your application's configuration. You can see what ciphers will be matched by a string with **openssl ciphers -v 'STRING'**, e.g.: >openssl ciphers -v 'AES128-SHA:RC4:AES:CAMELLIA128-SHA:!ADH:!aNULL:!DH:!EDH:!eNULL:!LOW:!SSLv2:!EXP:!NULL' will select AES-128, RC4, AES-256, Camellia-128 in that order.
AES128-SHA SSLv3 Kx=RSA Au=RSA Enc=AES(128) Mac=SHA1 RC4-SHA SSLv3 Kx=RSA Au=RSA Enc=RC4(128) Mac=SHA1 RC4-MD5 SSLv3 Kx=RSA Au=RSA Enc=RC4(128) Mac=MD5 AES256-SHA SSLv3 Kx=RSA Au=RSA Enc=AES(256) Mac=SHA1 CAMELLIA128-SHA SSLv3 Kx=RSA Au=RSA Enc=Camellia(128) Mac=SHA1
This can be configured in Apache with _SSLCipherSuite_. ### How widely is AES supported? All modern browsers[^2] support AES, however there is one big caveat: any browser which uses the system SSL services on Windows XP or older does not support AES. This includes: * Chrome * Internet Explorer * Safari but not Firefox or Opera. Windows Vista and newer do support AES at the OS level. This is a big motivator to continue permitting RC4, because the only other cipher in common use that these browsers are likely to support is 3DES. ### Why disable 3DES? You may be wondering: why not just permit 3DES as a fallback cipher at the end of my cipher string? Consider the following potential DOS Attack: 1. A large botnet of (unknowingly) controlled PCs is configured to permit **only** 3DES 2. These users continue using your service as they normally would, any nothing appears out of the ordinary to them (and there is no compromise in over-the-wire security) 3. Your servers are now processing this traffic 95% slower than normal How would you handle a sudden need for 2x the processing power to serve the same amount of traffic? While this scenario is unlikely to occur, it does highlight the extreme potential for denial or degradation of service simply as a result of the processing overhead of 3DES vs. any other cipher. With that said, if you know that you have clients that cannot support any cipher but 3DES or if RC4 is not secure enough for your needs, then you should add it to your cipher list. ### SSL session resume If you have a VIP with Nehalem and Westmere servers behind it where any of the hosts might answer an HTTPS request to the same public IP, you must have a consistent SSL cipher list across all of them. This is somewhat mitigated by the standard practice of assigning SSL connections in a sticky fashion on load balancers, but ideally you would have eliminated stickiness by deploying a shared SSL session cache between all of the hosts. Clients may open a new HTTPS connection (SSL session resume) which re-uses the previously negotiated session options, including the cipher. If all of the servers can resume the session (because they hold the original session id and data in cache), then you run the risk of a client who negotiated on a Westmere host and thus agreed to AES-128 being sent to a Nehalem host for its next connection where it should have agreed to RC4. Either prefer RC4 across all hosts behind this VIP, or divide your hosts between multiple public facing IPs. ## Conclusion 1. Use equipment that supports AESNI (or prefer RC4) 2. Enforce server side cipher preference 3. Pare down your cipher list, eliminate extremely slow ciphers like 3DES and insecure ciphers like export versions 4. Use openssl-1.0.x, in 64-bit mode [^1]: openssl-1.0.0d default: AES-256, Camellia-256, 3DES, AES-128, SEED, Camellia-128, RC2, RC4, DES, various export ciphers [^2]: Check with https://www.fortify.net/cgi/ssl_2.pl