Page MenuHomePhabricator

OpenSSL 3.x performance issues
Open, HighPublic

Description

Upgrading to bookworm implies moving from OpenSSL 1.1.x to OpenSSL 3.x. HAProxy is quite vocal about OpenSSL 3.x performance as it can be seen on https://github.com/haproxy/wiki/wiki/SSL-Libraries-Support-Status#openssl and https://github.com/openssl/openssl/issues/17627#issuecomment-1060123659.

We need to take a deeper look at these issues before upgrading cp servers to bookworm

Event Timeline

JFTR, if these turn out to be relevant for our use of haproxy as well, we also have the option to move to a dual library approach. I already did this in the past with Debian jessie which only had openssl 1.0, but we needed 1.1 for our TLS termination (back at the time done with nginx). Essentially a backport was created which is co-installable which provides a special libssl1.1-dev (which is then used by haproxy instead of the regular libssl-dev).

The perf issues are definitely relevant for traffic's use of haproxy (in a couple of different roles). Your option (making a libssl1.1-dev for bookworm that tracks the sec fixes that are still done for the bullseye case, and packaging our haproxy to build against it) would be the easiest path from our POV, for these cases.

More broadly though, I'm worried this is gonna bite us in a lot of other "minor" cases where we only notice it when we're under some cachebusting attack that increases internal connections to various $applayers that may terminate TLS using OpenSSL 3.0 as they migrate to bookworm, too. Apparently it affects client sides even worse than it does server sides, too :/, and we can't realistically relink/rebuild all such cases against 1.1. At least envoyproxy for mediawiki, which is one of the more-important cases, links boringssl instead and is immune.

Almost anything relevant internally uses envoy to mediate TLS both client and server side, so it's probably useful to list the oddballs.

Off the top of my head, I am sure changeprop doesn't use envoy, and I'm not sure the flink pipelines do, either. And ofc there's kafka, and the databases, and cross-dc mcrouter, which are going to be a potential issue.

I think ms-* swift will fall foul of this too, via the wmf-rewrite middleware (which is using python's urllib.request.build_opener to talk to e.g. thumbor.svc.codfw.wmnet:8800 ) [I'm not 100% sure, that might be http rather than https?]

I think ms-* swift will fall foul of this too, via the wmf-rewrite middleware (which is using python's urllib.request.build_opener to talk to e.g. thumbor.svc.codfw.wmnet:8800 ) [I'm not 100% sure, that might be http rather than https?]

Yep, HTTP at the moment (but we'd like to fix that soon).

Mentioned in SAL (#wikimedia-operations) [2023-12-06T14:21:03Z] <fabfur> repooling cp4052 after reimage (bookworm -> bullseye) due to possible impacting T352744

FYI, Elasticsearch doesn't use envoy either. The Flink pipelines @Joe mentioned are all in k8s, and wdqs (consumer of flink) uses envoy, so I think we're OK there?

Recent flink-app based deployments should use envoy. Not sure about the older rdf-streaming-updater, but there are plans to move that to flink-app chart.

I'll prepare the respective OpenSSL 1.1 forward ports. I'm optimistic I'll have something ready before the holiday break. Given haproxy's importance for our DDoS resiliency this seems like the least risky way forward while we wait for further upstream work in OpenSSL 3.x.

I'm wondering though if we reproduced this with the pilot bookworm cp installation? It would be useful to have some data so that we can also revisit this with OpenSSL 3.1/3.2 or some future OpenSSL release? There's no bug report in Debian about this even though haproxy is an immensely popular package in Debian and the haproxy.debian.net repository also default to just using Bookworm's OpenSSL 3.0 for the Bookworm packages. As such, the upstream statement "OpenSSL 3.x is basically only usable for personal sites" seems a little exaggerated as general statement.

I'm wondering though if we reproduced this with the pilot bookworm cp installation?

The pilot cp bookworm installation on cp4052 (upload@ulsfo) didn't experience the issues described on the OpenSSL GH issue, IMHO kinda expected considering the patterns of traffic in that cluster/DC and the nature of the performance issue

HAProxy 2.9 has been released, introducing AWS-LC support and with some interesting mention to OpenSSL on its release notes:

Integration with other components

  • QUIC: a limited compatibility layer allowing to use OpenSSL despite its lack of QUIC support was implemented and backported to 2.8.4. It does not support 0-RTT and I think everyone agrees that we should not have to hack around this. But for a while, users didn't have the choice but to use OpenSSL, so at least these ones can have some QUIC support now. The best solution of course, is to get rid of OpenSSL which is now the last SSL stack not supporting QUIC, and with horrible performance since 3.x.
  • Speaking of getting rid of OpenSSL, new serious contenders are now available, that anyone not afraid of making their own packages, who is concerned about performance so as not to pay 10 vCPUs when only 2 should be needed, and who doesn't depend on OpenSSL-centric features, should really evaluate. Don't get me wrong, we don't have much feedback yet on these options, so there may still be some rough edges, but sometimes energy savings and cost cutting can deserve living on the bleeding edge. The first option, wolfSSL, continued to make progress and their version 5.6.4 integrates pretty well with HAProxy now. Please don't use any older version. The second option, AWS-LC, is AWS's libcrypto. It's between BoringSSL and OpenSSL, is more similar to OpenSSL than wolfSSL but is lacking certain algos used by QUIC. It is particularly fast on ARM machines such as Graviton3 instances (AWS c7gn etc), where it can even be 15% faster than wolfSSL on RSA! The support status of such alternatives is regularly updated on the HAProxy wiki page at:

    https://github.com/haproxy/wiki/wiki/SSL-Libraries-Support-Status

wolfssl is packaged in Debian, so that may be a possible option longer term, https://tracker.debian.org/pkg/wolfssl.

wolfssl is packaged in Debian, so that may be a possible option longer term, https://tracker.debian.org/pkg/wolfssl.

wolfssl isn't fully supported in Debian, though: It's basically only included for some applications stuck with GPLv2 and where specific concerns with OpenSSL exist (for some upstream which don't want to rely on the "system library clause" of the GPL), it's not something we can comfortably use hapoxy with IMO.

I created a PoC forward port of OpenSSL 1.1.1w which is co-installable with the OpenSSL packages from Bookworm. The following binary packages are built:

dpkg-deb: building package 'libssl1.1-dbgsym' in '../libssl1.1-dbgsym_1.1.1w-0+deb11u1+wmf1_amd64.deb'.
dpkg-deb: building package 'libssl1.1' in '../libssl1.1_1.1.1w-0+deb11u1+wmf1_amd64.deb'.
dpkg-deb: building package 'libssl11-dev' in '../libssl11-dev_1.1.1w-0+deb11u1+wmf1_amd64.deb'.

You can grab it from build2001 in /var/cache/pbuilder/result/bookworm-amd64/*1.1.1w*

IOW, when you build haproxy against it, we only need to change the build dep from libssl-dev to libssl1.1-dev.

If that updated package works fine with haproxy on bookworm, as a next step I'd properly import it to Gitlab (and will rebase it to all future OpenSSL 1.1 updates in Bullseye going forward).

Change 1002580 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] package_builder: add hook for building HAProxy 2.6 component

https://gerrit.wikimedia.org/r/1002580

Change 1002580 merged by Ssingh:

[operations/puppet@production] package_builder: add hook for building HAProxy 2.6 component

https://gerrit.wikimedia.org/r/1002580

Mentioned in SAL (#wikimedia-operations) [2024-02-13T14:33:26Z] <moritzm> imported openssl 1.1.1w-0+deb11u1+wmf1 to component/haproxy26 T352744

Mentioned in SAL (#wikimedia-operations) [2024-02-14T09:49:12Z] <moritzm> imported openssl11 1.1.1w-0+deb11u1+wmf1 to component/haproxy26 T352744

Mentioned in SAL (#wikimedia-operations) [2024-02-15T15:24:53Z] <moritzm> imported openssl11 1.1.1w-0+deb11u1+wmf2 to component/haproxy26 T352744 (with fix for libssl11-dev file contents)

Change 1004126 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:cache::haproxy: add boolean to install component/haproxy26

https://gerrit.wikimedia.org/r/1004126

Change 1004128 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] cp4052: install haproxy 2.6 from component/haproxy

https://gerrit.wikimedia.org/r/1004128

10:01:40 < sukhe> !log reprepro -C component/haproxy26 include bookworm-wikimedia haproxy_2.6.16-1~bpo11+2_source.changes: T352744

Change 1004126 merged by Ssingh:

[operations/puppet@production] P:cache::haproxy: add boolean to install component/haproxy26

https://gerrit.wikimedia.org/r/1004126

Thanks to @MoritzMuehlenhoff, we have imported the forward port of OpenSSL 1.1.1 and have built haproxy 2.6 against it. We will be reimaging a cp host to bookworm.

Sharing in case others also need this: if you need to build something against openssl11 1.1.1w-0+deb11u1, please pass WIKIMEDIA=yes and haproxy26=YES during the build to pick up the haproxy26 component.

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4052.ulsfo.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4052.ulsfo.wmnet with OS bookworm executed with errors:

  • cp4052 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cp4052.ulsfo.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4052.ulsfo.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4052.ulsfo.wmnet with OS bookworm completed:

  • cp4052 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402201657_sukhe_2010679_cp4052.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 1004128 merged by Ssingh:

[operations/puppet@production] cp4052: install haproxy 2.6 from component/haproxy

https://gerrit.wikimedia.org/r/1004128

Mentioned in SAL (#wikimedia-operations) [2024-02-20T18:22:33Z] <sukhe> reprepro -C component/haproxy26 include bookworm-wikimedia haproxy_2.6.16-1~bpo12+1_amd64.changes: T352744

Mentioned in SAL (#wikimedia-operations) [2024-02-20T18:31:10Z] <sukhe> pool cp4052: bookworm cp host with haproxy 2.6 built against OpenSSL 1.1.1: T352744