Page MenuHomePhabricator

Phabricator downtime due to aphlict and websockets (aphlict current disabled)
Closed, ResolvedPublic

Description

On Friday November 15th we had a short Phabricator downtime.

It was related to the aphlict service (realtime notification server) which uses websockets (T112765) behind cache (formerly cache_misc T134870).

All php-fpm child processes were used by it.

Disabling the aphlict service (and puppet) was the immediate fix and brought Phabricator back normally.

For now aphlict is disabled. Users will still get normal notifications but no realtime pop-up notifications.

There is a theory this is related to recent changes on the caching (ATS) side because we did not have this problem before and it's not a new setup.

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+12 -2
operations/puppetproduction+2 -2
operations/puppetproduction+1 -0
operations/puppetproduction+1 -1
operations/puppetproduction+6 -8
operations/puppetproduction+21 -20
operations/puppetproduction+6 -8
operations/puppetproduction+46 -6
operations/dnsmaster+2 -0
operations/puppetproduction+1 -0
operations/puppetproduction+15 -2
operations/puppetproduction+12 -7
operations/puppetproduction+13 -4
operations/puppetproduction+4 -0
operations/puppetproduction+1 -1
operations/puppetproduction+2 -1
labs/privatemaster+3 -0
operations/puppetproduction+23 -0
operations/puppetproduction+2 -0
operations/puppetproduction+1 -1
operations/puppetproduction+2 -0
operations/puppetproduction+1 -0
operations/puppetproduction+0 -1
operations/puppetproduction+28 -8
operations/puppetproduction+11 -4
operations/puppetproduction+44 -6
operations/puppetproduction+2 -0
operations/puppetproduction+12 -12
operations/puppetproduction+129 -39
operations/puppetproduction+1 -1
operations/puppetproduction+0 -21
operations/puppetproduction+0 -15
operations/puppetproduction+8 -6
operations/puppetproduction+4 -0
operations/puppetproduction+6 -2
operations/puppetproduction+11 -1
operations/puppetproduction+1 -1
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

questions:

why are there 2 yaml files for apache traffic server

  • backend.yaml 22280
  • tls.yaml
  • what is port 3120
  • protocol is wss://
  • acme cert

Change 569104 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] ATS: directly talk wss:// to aphlict

https://gerrit.wikimedia.org/r/569104

Aklapper removed a subscriber: GerritBot.Mar 22 2020, 9:32 PM

Change 586338 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: re-enable aphlict service

https://gerrit.wikimedia.org/r/586338

Change 586338 merged by Dzahn:
[operations/puppet@production] phabricator: re-enable aphlict service

https://gerrit.wikimedia.org/r/586338

Change 586461 had a related patch set uploaded (by 20after4; owner: 20after4):
[operations/puppet@production] ATS/phabricator: configure aphlict certificate

https://gerrit.wikimedia.org/r/586461

Dzahn added a comment.Apr 7 2020, 9:25 AM

The aphlict service has been re-enabled on phab1001.

The plan is to have ATS (caching layer) talk directly to the nodejs (aphlict) service using wss:// (TLS) on port 22280 and avoiding to go via Apache first.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/569104

For this we can use the existing certificate created for envoy. Or, as an alternative, we can go via envoy for TLS termination and from there to nodejs, also avoiding Apache (httpd) in the mix. Since the Apache module caused the issue that took down all of Phabricator.

The existing certificate has these SANs on it:

alt_names: ["phabricator.discovery.wmnet","phabricator.svc.eqiad.wmnet","phabricator.svc.codfw.wmnet","phabricator.wikimedia.org","phab.wmfusercontent.org","bugs.wikimedia.org","bugzilla.wikimedia.org","git.wikimedia.org"]

and exists in this location:

certificate_chain: { filename: "/etc/ssl/localcerts/phabricator.discovery.wmnet.crt" }
private_key: { filename: "/etc/ssl/private/phabricator.discovery.wmnet.key" }

If we can tell aphlict to find it in this path we should be good to go without further action.

Change 586461 merged by Dzahn:
[operations/puppet@production] ATS/phabricator: configure aphlict certificate

https://gerrit.wikimedia.org/r/586461

Change 587224 had a related patch set uploaded (by 20after4; owner: 20after4):
[operations/puppet@production] ATS/phabricator: enable aphlict certificate in hiera.

https://gerrit.wikimedia.org/r/587224

mmodell added a comment.EditedApr 7 2020, 10:28 AM

So we have just one last remaining issue to deal with:

Unable to open file ("/etc/ssl/private/phabricator.discovery.wmnet.key"). Check that permissions are set correctly.

Error: EACCES: permission denied, open '/etc/ssl/private/phabricator.discovery.wmnet.key'

That file is owned by root.envoy:

$ ls -la /etc/ssl/private/phabricator.discovery.wmnet.key
-r--r----- 1 root envoy 227 Dec  5 02:58 /etc/ssl/private/phabricator.discovery.wmnet.key

But aphlict runs as aphlict.aphlict

Change 587225 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: enable TLS for aphlict

https://gerrit.wikimedia.org/r/587225

Change 587224 abandoned by 20after4:
ATS/phabricator: enable aphlict certificate in hiera.

https://gerrit.wikimedia.org/r/587224

Change 587233 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: add envoy TLS terminator for aphlict

https://gerrit.wikimedia.org/r/587233

Dzahn added a comment.Apr 7 2020, 12:06 PM

The new plan is to do TLS termination in envoy rather than in nodejs itself. Hence the new patch above to achieve that and add a second TLS listener in addition to the existing one for main Phabricator. That way we still avoid using the buggy apache module for websockets.

mmodell changed the task status from Open to Stalled.May 14 2020, 3:09 PM
mmodell removed mmodell as the assignee of this task.

I am currently unable to drive this forward as all the changes are pending merge in gerrit (operations/puppet)

@Dzahn: Do you think you will have any time to work on this in the foreseeable future? I can make my self available at any time that works best for your schedule, even at midnight CST or later.

Change 587233 merged by Dzahn:
[operations/puppet@production] phabricator: add envoy TLS terminator for aphlict

https://gerrit.wikimedia.org/r/587233

Change 603883 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: add non-global cert/key path for aphlict envoy terminator

https://gerrit.wikimedia.org/r/603883

Change 603883 merged by Dzahn:
[operations/puppet@production] phabricator: add non-global cert/key path for aphlict envoy terminator

https://gerrit.wikimedia.org/r/603883

Change 603895 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: add envoy TLS terminator for aphlict (DO NOT MERGE)

https://gerrit.wikimedia.org/r/603895

Change 587225 abandoned by Dzahn:
phabricator: enable TLS for aphlict

https://gerrit.wikimedia.org/r/587225

Change 615797 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] ATS: add backend for aphlict

https://gerrit.wikimedia.org/r/615797

Change 615796 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: set aphlict to disabled in eqiad

https://gerrit.wikimedia.org/r/615796

Change 615842 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] aphlict: add scap deploy target and missing parameters

https://gerrit.wikimedia.org/r/615842

Change 615842 merged by Dzahn:
[operations/puppet@production] aphlict: add scap deploy target and missing parameters

https://gerrit.wikimedia.org/r/615842

Change 615879 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] aphlict: add phab_deploy_finalize and rollback scripts

https://gerrit.wikimedia.org/r/615879

Change 615879 merged by Dzahn:
[operations/puppet@production] aphlict: remove requirement of the phab_deploy_finalize script

https://gerrit.wikimedia.org/r/615879

Change 616154 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator/aphlict: set base_dir parameter when using aphlict

https://gerrit.wikimedia.org/r/616154

Change 616154 merged by Dzahn:
[operations/puppet@production] phabricator/aphlict: set base_dir parameter when using aphlict

https://gerrit.wikimedia.org/r/616154

Change 616159 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] access: let existing phabricator admins get on aphlict machine

https://gerrit.wikimedia.org/r/616159

Change 616160 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: fix the basedir/base_dir parameter name for aphlict

https://gerrit.wikimedia.org/r/616160

Change 616159 merged by Dzahn:
[operations/puppet@production] access: let existing phabricator-root-admins get on aphlict machine

https://gerrit.wikimedia.org/r/616159

Change 616160 merged by Dzahn:
[operations/puppet@production] phabricator: fix the basedir/base_dir parameter name for aphlict

https://gerrit.wikimedia.org/r/616160

Change 616165 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] aphlict: require php-cli package

https://gerrit.wikimedia.org/r/616165

Change 616165 merged by Dzahn:
[operations/puppet@production] aphlict: require php-cli package

https://gerrit.wikimedia.org/r/616165

Change 616171 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] ssl: add certificate for aphlict.discovery.wmnet

https://gerrit.wikimedia.org/r/616171

Change 616171 merged by Dzahn:
[operations/puppet@production] ssl: add certificate for aphlict.discovery.wmnet

https://gerrit.wikimedia.org/r/616171

Change 616173 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[labs/private@master] add fake key for aphlict.discovery.wmnet

https://gerrit.wikimedia.org/r/616173

Change 616173 merged by Dzahn:
[labs/private@master] add fake key for aphlict.discovery.wmnet

https://gerrit.wikimedia.org/r/616173

Change 616184 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] aphlict: add envoy for TLS termination

https://gerrit.wikimedia.org/r/616184

Change 616184 merged by Dzahn:
[operations/puppet@production] aphlict: add envoy for TLS termination

https://gerrit.wikimedia.org/r/616184

Change 615796 merged by Dzahn:
[operations/puppet@production] phabricator: set aphlict to disabled in eqiad

https://gerrit.wikimedia.org/r/615796

Change 616630 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] aphlict: set envoy-proxy upstream port to 22280, no SNI

https://gerrit.wikimedia.org/r/616630

Change 616630 merged by Dzahn:
[operations/puppet@production] aphlict: set envoy-proxy upstream port to 22280, no SNI

https://gerrit.wikimedia.org/r/616630

mmodell changed the task status from Stalled to Open.Jul 28 2020, 5:41 PM

Change 616890 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] aphlict: add parameter/ferm rule to let phab server talk to admin port

https://gerrit.wikimedia.org/r/616890

Change 616890 merged by Dzahn:
[operations/puppet@production] aphlict: add parameter/ferm rule to let phab server talk to admin port

https://gerrit.wikimedia.org/r/616890

Change 616894 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: move phabricator_name key to common.yaml in Hiera

https://gerrit.wikimedia.org/r/616894

Change 616894 merged by Dzahn:
[operations/puppet@production] phabricator: move phabricator_name key to common.yaml in Hiera

https://gerrit.wikimedia.org/r/616894

Change 616896 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] aphlict: make port and IP for the admin interface configurable

https://gerrit.wikimedia.org/r/616896

Change 616896 merged by Dzahn:
[operations/puppet@production] aphlict: make port and IP for the admin interface configurable

https://gerrit.wikimedia.org/r/616896

Change 616906 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] aphlict: let the service listen on %{::ipaddress}

https://gerrit.wikimedia.org/r/616906

Change 616906 merged by Dzahn:
[operations/puppet@production] aphlict: let the service listen on %{::ipaddress}

https://gerrit.wikimedia.org/r/616906

Change 616909 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] add aphlict.discovery.wmnet with CNAME aphlict1001

https://gerrit.wikimedia.org/r/616909

Change 616909 merged by Dzahn:
[operations/dns@master] add aphlict.discovery.wmnet with CNAME aphlict1001

https://gerrit.wikimedia.org/r/616909

Dzahn added a subscriber: 20after4.EditedJul 28 2020, 9:13 PM

@20after4

Current status is now

  • aphlict1001.eqiad.wmnet up and running
  • phabricator-roots admin group has access
  • aphlict.service Active: active (running)
  • admin IP and port now configurable in Hiera with defaults 127.0.0.1 and 22281
  • nodejs listening on port 22280 for clients and port 22281 for the connection from phabricator (aka admin port)
  • admin port listening on actual IP on the first interface and not just 127.0.0.1 on the standalone server
tcp        0      0 0.0.0.0:22280           0.0.0.0:*               LISTEN      499        6634586    3091/nodejs         
tcp        0      0 10.64.48.39:22281       0.0.0.0:*               LISTEN      499        6634587    3091/nodejs
  • name of the active phabricator server moved to "common.yaml" in Hiera so aphlict class can use it as well
  • ferm / iptables opens admin port only for phabricator server but client port for all
[aphlict1001:~] $ sudo iptables -L | grep 222
ACCEPT     tcp  --  anywhere             anywhere             tcp dpt:22280
ACCEPT     tcp  --  phab1001.eqiad.wmnet  anywhere             tcp dpt:22281
  • aphlict.discovery.wmnet created in DNS with CNAME to aphlict1001
  • aphlict.discovery.wmnet TLS cert for envoy created, added in private and public repos
  • envoy-proxy added to aphlict1001 for TLS termination and is up and running
  • 1 envoy TLS terminator configured that does "port 443 -> port 22280" for the client port and uses the cert

Change 616917 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] aphlict: add second envoy TLS terminator for client port

https://gerrit.wikimedia.org/r/616917

So I think that this is all that remains:

  • cache layer proxy wss://phabricator.wikimedia.org to aphlict1001
  • enable notifications in phabricator
  • test

Change 603895 abandoned by Dzahn:
[operations/puppet@production] phabricator: add envoy TLS terminator for aphlict (DO NOT MERGE)

Reason:
done on a dedicated VM now with separate role aphlict

https://gerrit.wikimedia.org/r/603895

Change 569104 abandoned by Dzahn:
[operations/puppet@production] ATS/phabricator: directly talk wss:// to aphlict

Reason:
basically duplicate of https://gerrit.wikimedia.org/r/c/operations/puppet/ /615797 but no aphlict is on a separate VM

https://gerrit.wikimedia.org/r/569104

Change 618392 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] ssl: update aphlict TLS cert, add phabricator to SANs

https://gerrit.wikimedia.org/r/618392

Change 618392 merged by Dzahn:
[operations/puppet@production] ssl: update aphlict TLS cert, add phabricator to SANs

https://gerrit.wikimedia.org/r/618392

Change 615797 merged by Ema:
[operations/puppet@production] ATS: add new backend for phabricator aphlict

https://gerrit.wikimedia.org/r/615797

hashar added a subscriber: hashar.

I filed a dupe of this task. The admin interface states the notification server is not reachable. At https://phabricator.wikimedia.org/config/issue/aphlict.connect/ it says it can not connect to the service on aphlict1001.eqiad.wmnet due to some curl error (CURLE_COULDNT_CONNECT)

Change 619036 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] ATS: set caching to 'websockets' for Phabricator

https://gerrit.wikimedia.org/r/619036

Change 619036 merged by Ema:
[operations/puppet@production] ATS: set caching to 'websockets' for Phabricator

https://gerrit.wikimedia.org/r/619036

hashar removed a subscriber: hashar.Aug 10 2020, 1:45 PM

Went to https://phabricator.wikimedia.org/config/issue/aphlict.connect/ again and it states:

Issue Resolved
This setup issue has been resolved. Return to Open Issue List

Seems like setting caching to websockets solved the connection issue.

mmodell added a subscriber: hashar.Aug 10 2020, 5:08 PM

Seems like setting caching to websockets solved the connection issue.

Unfortunately it didn't actually fix it. That setup issue was related to the connection from phabricator to aphlict. The connection from users to aphlict still isn't working.

Concretely, when I attempt to connect from outside to https://phabricator.wikimedia.org:443 with appropriate websocket upgrade headers, I get the phabricator homepage.

Command:

curl --include \
     --no-buffer \
     --header "Connection: keep-alive, Upgrade" \
     --header "Upgrade: websocket" \
     --header "Cache-Control: no-cache" \
     --header "Accept: */*" \
     --header "Host: phabricator.wikimedia.org" \
     --header "Origin: https://phabricator.wikimedia.org" \
     --header "Sec-WebSocket-Key: SGVsbG8sIHdvcmxkIZ==" \
     --header "Sec-WebSocket-Version: 13" \
     https://phabricator.wikimedia.org:443/

Response

HTTP/2 200 
date: Mon, 10 Aug 2020 16:45:21 GMT
server: Apache
x-frame-options: Deny
content-security-policy: default-src https://phab.wmfusercontent.org; img-src https://phab.wmfusercontent.org data:; style-src https://phab.wmfusercontent.org 'unsafe-inline'; script-src https://phab.wmfusercontent.org; connect-src 'self'; frame-src 'self' https://commons.wikimedia.org; frame-ancestors 'none'; object-src 'none'; form-action 'self'  https://www.mediawiki.org https://m.mediawiki.org; base-uri 'none'
referrer-policy: no-referrer
cache-control: private, max-age=0, s-maxage=0
expires: Sat, 01 Jan 2000 00:00:00 GMT
x-content-type-options: nosniff
set-cookie: phsid=A%2Fkjiwjmnitjzy47cgqr3m745k2ushb73d4keee2r7; expires=Sat, 09-Aug-2025 16:45:21 GMT; Max-Age=157680000; path=/; domain=phabricator.wikimedia.org; secure; HttpOnly
set-cookie: next_uri=1597077921%2C%2Fws%2F; path=/; domain=phabricator.wikimedia.org; secure; HttpOnly
set-cookie: phcid=rpk5istax2wuzdfa; path=/; domain=phabricator.wikimedia.org; secure; HttpOnly
vary: Accept-Encoding
content-type: text/html; charset=UTF-8
age: 0
x-cache: cp2037 miss, cp2031 pass
x-cache-status: pass
server-timing: cache;desc="pass"
strict-transport-security: max-age=106384710; includeSubDomains; preload
set-cookie: WMF-Last-Access=10-Aug-2020;Path=/;HttpOnly;secure;Expires=Fri, 11 Sep 2020 12:00:00 GMT
x-client-ip: ****
accept-ranges: bytes

<!DOCTYPE html><html><head><meta charset="UTF-8" /><title>Login</title><meta name="viewport" content="width=device-width, initial-scale=1,
...

Interestingly, if I tell curl to use HTTP 1.1 with the --http1.1 argument, then the front-end proxy gets me to envoy which returns 403:

Command:

curl --include \
     --no-buffer \
     --http1.1 \
     --header "Connection: keep-alive, Upgrade" \
     --header "Upgrade: websocket" \
     --header "Cache-Control: no-cache" \
     --header "Accept: */*" \
     --header "Host: phabricator.wikimedia.org" \
     --header "Origin: https://phabricator.wikimedia.org" \
     --header "Sec-WebSocket-Key: SGVsbG8sIHdvcmxkID==" \
     --header "Sec-WebSocket-Version: 13" \
     https://phabricator.wikimedia.org:443/

Response:

HTTP/1.1 403 Forbidden
date: Mon, 10 Aug 2020 17:00:01 GMT
server: envoy
content-length: 0
Age: 0
X-Cache-Int: cp2033 pass
Connection: close

So apparently there is something related to HTTP 2 which breaks the websocket forwarding?

hmm that's interesting, please note that this is not the first time we use websockets. etherpad.wm.o is already using websockets successfully (even when HTTP/2 is used to perform the upgrade request)

Change 619465 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] enable envoy websockets support for role::aphlict

https://gerrit.wikimedia.org/r/619465

Change 619465 merged by CDanis:
[operations/puppet@production] enable envoy websockets support for role::aphlict

https://gerrit.wikimedia.org/r/619465

CDanis added a subscriber: CDanis.Aug 11 2020, 1:48 PM

The Envoy TLS terminator is now configured to allow websocket upgrades -- however, it's improperly configured. It's trying to connect via IPv6 address to the nodejs process, but the nodejs process is not listening there -- it listens only on the IPv4 address.

I'm not familiar with this setup so I'll leave it to @Dzahn to fix.

Change 619560 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] aphlict: listen on IPv6 instead IPv4 for client and admin ports

https://gerrit.wikimedia.org/r/619560

Dzahn added a comment.Aug 11 2020, 8:10 PM

test comment

Change 619560 merged by Dzahn:
[operations/puppet@production] aphlict: listen on IPv6 instead IPv4 for client and admin ports

https://gerrit.wikimedia.org/r/619560

Will it blend?

it appears to blend.

Dzahn added a comment.EditedAug 11 2020, 8:36 PM

The Envoy TLS terminator is now configured to allow websocket upgrades

Thank you for adding that line. Was aware of it but somehow lost it when splitting aphlict away from phab.

  • however, it's improperly configured. It's trying to connect via IPv6 address to the nodejs process, but the nodejs process is not listening there -- it listens only on the IPv4 address.

This is fixed now. Now it is listening on IPv6-only for both the client and the "admin" port (where phabricator connects)

tcp6 0 0 2620:0:861:107:10:22280 :::* LISTEN 499 16341385 20751/nodejs
tcp6 0 0 2620:0:861:107:10:22281 :::* LISTEN 499 16341386 20751/nodejs

I'm not familiar with this setup so I'll leave it to @Dzahn to fix.

Done:) Confirmed with Mukunda this works now. We both see pop-up notifications and realtime changes when things move on workboards. Yay!

Dzahn closed this task as Resolved.Aug 11 2020, 8:43 PM
Dzahn claimed this task.

We are seeing realtime notifications again and aphlict is now separated from Phabricator on a dedicated VM.

We are also not using mod_ws any longer which caused the instability that made us create this ticket.

The traffic flow is now:

client->ATS->Varnish->ATS->envoy-proxy->aphlict

and

phabricator->aphlict

with envoy-proxy doing TLS termination on the aphlict1001 server