Page MenuHomePhabricator

haproxy: work on systemd unit hardening (cp hosts)
Closed, ResolvedPublic

Description

Similar to the existing systemd unit hardening of varnish and ATS services, we should also harden the haproxy service, at least starting with the unit used by the cp hosts.

The current output of systemd-analyze security haproxy.service returns UNSAFE and we should try to make it better, incrementally.

Event Timeline

ssingh triaged this task as Medium priority.Nov 28 2022, 6:57 PM

Change 861445 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:cache::haproxy: harden systemd unit

https://gerrit.wikimedia.org/r/861445

Change 861445 merged by Ssingh:

[operations/puppet@production] P:cache::haproxy: harden systemd unit

https://gerrit.wikimedia.org/r/861445

Change 863332 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] hiera: enable haproxy systemd hardening on cp4045

https://gerrit.wikimedia.org/r/863332

We have enabled the hardened haproxy unit on traffic-cache-bullseye.traffic.eqiad1.wikimedia.cloud to start with, before rolling it out to the production cp hosts.

The hardened haproxy unit has been running for a while on traffic-cache-bullseye.traffic.eqiad1.wikimedia.cloud without any issues. Pending any further comments or issues, we will deploy the changes to a depooled and then pooled ulsfo host in the week of 16 Jan.

Looks like this still isn't rolled out based on my check on a random cp node. Still intend to roll this out?

BCornwall changed the task status from Open to Stalled.Feb 21 2023, 11:15 PM

Mentioned in SAL (#wikimedia-operations) [2023-02-27T08:54:57Z] <vgutierrez> test haproxy hardening in cp4045 - T323944

Change 863332 merged by Vgutierrez:

[operations/puppet@production] hiera: enable haproxy systemd hardening on cp4045

https://gerrit.wikimedia.org/r/863332

Vgutierrez changed the task status from Stalled to In Progress.Feb 27 2023, 3:37 PM
Vgutierrez added a subscriber: Vgutierrez.

yes, it's currently running on cp4045 and I'm planning to extend the experiment to ulsfo tomorrow EU morning

Change 892484 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hiera: Enable haproxy systemd hardening in ulsfo

https://gerrit.wikimedia.org/r/892484

Change 892484 merged by Vgutierrez:

[operations/puppet@production] hiera: Enable haproxy systemd hardening in ulsfo

https://gerrit.wikimedia.org/r/892484

Mentioned in SAL (#wikimedia-operations) [2023-02-28T08:43:51Z] <vgutierrez> enable system hardening for haproxy in ulsfo - T323944

Change 894484 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hiera: Disable HAProxy systemd hardening in ulsfo

https://gerrit.wikimedia.org/r/894484

Change 894484 merged by Vgutierrez:

[operations/puppet@production] hiera: Disable HAProxy systemd hardening in ulsfo

https://gerrit.wikimedia.org/r/894484

Mentioned in SAL (#wikimedia-operations) [2023-03-06T09:02:08Z] <vgutierrez> disabling haproxy systemd service unit hardening in ulsfo - T323944

I've disabled the systemd hardening after confirming issues in ulsfo:

vgutierrez@cp4041:~$ ps auxww |grep haproxy |wc -l
49

HAProxy is unable to terminate old processes properly with the systemd service unit hardening in place. These old processes keep handling incoming connections and as time passes their OCSP response data expires and triggers the flapping icinga alerts

Change 894544 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] cache::haproxy: Grant CAP_KILL on hardened mode

https://gerrit.wikimedia.org/r/894544

Change 894544 merged by Vgutierrez:

[operations/puppet@production] cache::haproxy: Grant CAP_KILL on hardened mode

https://gerrit.wikimedia.org/r/894544

Change 894545 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hiera: Enable HAProxy systemd hardening in cp4045

https://gerrit.wikimedia.org/r/894545

Change 894545 merged by Vgutierrez:

[operations/puppet@production] hiera: Enable HAProxy systemd hardening in cp4045

https://gerrit.wikimedia.org/r/894545

Mentioned in SAL (#wikimedia-operations) [2023-03-06T10:29:32Z] <vgutierrez> enable haproxy systemd service unit hardening in cp4045 - T323944

@ssingh this could be as easy to fix as granting CAP_KILL, I'm currently testing that on cp4045

Change 894961 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hiera: Enable HAProxy systemd hardening in cp4044

https://gerrit.wikimedia.org/r/894961

Change 894961 merged by Vgutierrez:

[operations/puppet@production] hiera: Enable HAProxy systemd hardening in cp4044

https://gerrit.wikimedia.org/r/894961

Mentioned in SAL (#wikimedia-operations) [2023-03-07T07:34:00Z] <vgutierrez> enable haproxy systemd service unit hardening in cp4044 - T323944

Change 895692 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hiera: Enable HAProxy systemd hardening in ulsfo

https://gerrit.wikimedia.org/r/895692

Change 895692 merged by Vgutierrez:

[operations/puppet@production] hiera: Enable HAProxy systemd hardening in ulsfo

https://gerrit.wikimedia.org/r/895692

Mentioned in SAL (#wikimedia-operations) [2023-03-08T08:50:03Z] <vgutierrez> re-enable HAProxy systemd service unit hardening in ulsfo - T323944

Change 897803 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hiera: Enable haproxy hardening globally for cp hosts

https://gerrit.wikimedia.org/r/897803

Change 897803 merged by Vgutierrez:

[operations/puppet@production] hiera: Enable haproxy hardening globally for cp hosts

https://gerrit.wikimedia.org/r/897803

Mentioned in SAL (#wikimedia-operations) [2023-03-13T09:55:04Z] <vgutierrez> Enable haproxy hardening in cp hosts globally - T323944

Vgutierrez assigned this task to ssingh.

Thanks to @Vgutierrez for taking care of the rollout of this. For posterity, the final result for now before we do more enhancements:

===== NODE GROUP =====                                                                                                                
(96) cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1075-1090].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3050-3065].esams.wmnet,cp[4037-4052].ulsfo.wmnet                                                                                                            
----- OUTPUT of 'systemd-analyze ...e | grep Overall' -----                                                                           
→ Overall exposure level for haproxy.service: 3.5 OK 🙂