Page MenuHomePhabricator

Phabricator downtime due to aphlict and websockets (aphlict current disabled)
Open, MediumPublic

Description

On Friday November 15th we had a short Phabricator downtime.

It was related to the aphlict service (realtime notification server) which uses websockets (T112765) behind cache (formerly cache_misc T134870).

All php-fpm child processes were used by it.

Disabling the aphlict service (and puppet) was the immediate fix and brought Phabricator back normally.

For now aphlict is disabled. Users will still get normal notifications but no realtime pop-up notifications.

There is a theory this is related to recent changes on the caching (ATS) side because we did not have this problem before and it's not a new setup.

Event Timeline

Dzahn created this task.Mon, Nov 18, 8:47 PM

Change 551271 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: disable aphlict

https://gerrit.wikimedia.org/r/551271

Change 551271 merged by Dzahn:
[operations/puppet@production] phabricator: disable aphlict

https://gerrit.wikimedia.org/r/551271

Mentioned in SAL (#wikimedia-operations) [2019-11-18T20:59:11Z] <mutante> phab1003 - re-enabling puppet after merging gerrit::551271 - making sure aphlict stays disabled incl. the apache config ProxyPass lines using mod_proxy_wstunnel (T238593)

Change 551731 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Disable websocket remap rules for phabricator

https://gerrit.wikimedia.org/r/551731

Qgil removed a subscriber: Qgil.Tue, Nov 19, 7:27 AM
ema triaged this task as Medium priority.Tue, Nov 19, 8:50 AM
ema added a project: Traffic.
Restricted Application added a project: Operations. · View Herald TranscriptTue, Nov 19, 8:51 AM
ema moved this task from Triage to Caching on the Traffic board.Tue, Nov 19, 8:51 AM

Change 551777 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] prometheus,ATS: track current active http/http2/websocket connections

https://gerrit.wikimedia.org/r/551777

ema added a comment.Tue, Nov 19, 9:59 AM

By going through SAL and the irc logs on #wikimedia-operations I've reconstructed the events as follows. There are some parts I don't understand so please fill the gaps.

  • 2019-11-15 17:23 phabricator paged. From irc: PROBLEM - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Phabricator
  • 2019-11-15 17:30 SAL: mutante: phabricator - -started phd service. @Dzahn It's not clear which command was run, on which server, and why
  • 2019-11-15 21:21 SAL: _joe_: disabling proxying to ws on phabricator1003 @Joe: why was this necessary? What was the issue observed at that point?

Change 551777 merged by Vgutierrez:
[operations/puppet@production] prometheus,ATS: track current active http/http2/websocket connections

https://gerrit.wikimedia.org/r/551777

Joe added a comment.EditedTue, Nov 19, 12:47 PM

The problem is most apache workers ended up being stuck talking to aphlict via proxy_wstunnel which has stability issues (we tried to use it on etherpad, and it created issues there too, apparently).

By going through SAL and the irc logs on #wikimedia-operations I've reconstructed the events as follows. There are some parts I don't understand so please fill the gaps.

  • 2019-11-15 16:31 phab1003 has 169 busy apache workers according to this grafana dashboard which seems unusual. The number of busy workers is usually lower (below 20), but in the past it had been occasionally about as high (eg: it was 159 on 2019-09-21 at 00:00).

Note that phab was intermittently unavailable (albeit without alerting the 1st time) happened twice on 2019-11-15. Once on 17:12 to 17:20 and from 21:03 to 21:12.

That's while the apache scoreboard is full and prometheus can't query apache for that data. There is a gap as well on 21:03 to 21:12

  • 2019-11-15 17:23 phabricator paged. From irc: PROBLEM - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Phabricator
  • 2019-11-15 17:30 SAL: mutante: phabricator - -started phd service. @Dzahn It's not clear which command was run, on which server, and why

phab1001, operator mistake.

  • 2019-11-15 21:21 SAL: _joe_: disabling proxying to ws on phabricator1003 @Joe: why was this necessary? What was the issue observed at that point?

As @Joe says, we believed at that point in time for this to be the best course of action to restore the service and maintain stability during the weekend. It's not necessarily the best course of action overall, although apache and websockets has bit us in the past

Change 551731 merged by Ema:
[operations/puppet@production] ATS: Disable websocket remap rules for phabricator

https://gerrit.wikimedia.org/r/551731

Dzahn added a comment.Tue, Nov 19, 4:14 PM
  • 2019-11-15 17:30 SAL: mutante: phabricator - -started phd service. @Dzahn It's not clear which command was run, on which server, and why

You can entirely disregard that, i was on phab1001 and not phab1003 by accident. So not the prod server.

Also that is why phd was not running and phd is unrelated to this issue. fwiw the commands are:

systemctl (status|start) phd (the main phabricator daemon)

systemctl (status|start) aphlict (the realtime notifications service)

systemctl (status|start) ssh-phab (the phabricator ssh service)

Mentioned in SAL (#wikimedia-operations) [2019-11-19T16:17:38Z] <mutante> phab1003 - systemctl stop aphlict (proxy config in apache is disabled as well as disabled in ATS) (T238593)

ema added a comment.Tue, Nov 19, 4:20 PM

You can entirely disregard that, i was on phab1001 and not phab1003 by accident. So not the prod server.

Right, yeah that makes sense. Thanks for the clarification! Was there any other admin action between the page and when @Joe disabled proxy_wstunnel? Is there any metric we can add to https://grafana.wikimedia.org/d/000000587/phabricator that could help tracking when apache workers "are stuck"?

Mentioned in SAL (#wikimedia-operations) [2019-11-19T16:21:21Z] <mutante> phab1003 - puppet restarts aphlict service even with "phabricator_aphlict_enabled: false" in Hiera. But it does properly remove the proxy config lines from apache. so service is running but not used. (T238593)

Dzahn added a comment.Tue, Nov 19, 4:25 PM

Was there any other admin action between the page and when @Joe disabled proxy_wstunnel?

Not that i know of. As noted above there were 2 events. The first time we got alerts and then everything recovered again (almost exactly after i started that service but on the wrong server). So that was without any admin action that i know of but it recovered. Then a little while later it happened again and Alex and Joe started debugging before i was aware of it. So also no admin action from me there.

Dzahn added a comment.Tue, Nov 19, 4:42 PM

Regarding the puppetization: There is hiera('phabricator_aphlict_enabled'. which is now set to false. What this does:

  • remove ferm hole for port 22280/tcp (confirmed)
  • remove all config values from the $notification_servers variable (this influences the apache erb template)
  • stops actively loading proxy_wstunnel (but does not actually unload it, just httpd::mod_conf is skipped, module still loaded)
  • In phabricator-default.conf.erb, the apache config, the "ProxyPass "/ws/" lines are skipped (and removed) because $notification_servers is empty. confirmed.
  • The part that the ProxyPass lines are also inside a "<IfModule mod_proxy_wstunnel.c>" would not have mattered because puppet does not unload the module.

So the systemd service is running and the httpd module is loaded but also the firewall is closed, apache does not proxy to it anymore and Valentin disabled it in ATS. Room for improvement to also stop service and unload module though with that single Hiera setting, i guess.

Change 551910 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: stop aphlict service if disabled in Hiera

https://gerrit.wikimedia.org/r/551910

Change 551910 merged by Dzahn:
[operations/puppet@production] phabricator: stop aphlict service if disabled in Hiera

https://gerrit.wikimedia.org/r/551910

Mentioned in SAL (#wikimedia-operations) [2019-11-19T20:09:51Z] <mutante> phab1003 after merging gerrit:551910 puppet now also stopped the actual aphlict service and removed the systemd unit file. had to manually run 'systemctl reset-failed' though to clean systemd status and avoid icinga alert (T238593)

Change 551915 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: if aphlict is disabled also unload wstunnel httpd module

https://gerrit.wikimedia.org/r/551915

Change 551915 merged by Dzahn:
[operations/puppet@production] phabricator: if aphlict is disabled also unload wstunnel httpd module

https://gerrit.wikimedia.org/r/551915

Dzahn added a comment.Tue, Nov 19, 9:22 PM

The Hiera key now does all the things, also stopped the service and unloaded the httpd module wstunnel.

After that Phabricator started to show a warning icon to admins that it can't talk to notification servers.

I clicked in the web UI to remove notification servers from the settings to make that go away.

The content before was

1[
2 {
3 "host": "phabricator.wikimedia.org",
4 "port": 443,
5 "protocol": "https",
6 "path": "/ws/",
7 "type": "client"
8 },
9 {
10 "host": "localhost",
11 "port": 22281,
12 "protocol": "http",
13 "type": "admin"
14 }
15]
.

Joe added a comment.Wed, Nov 20, 6:51 AM

I don't think the solution is removing aphlict, but instead proxying to it directly from envoy or ATS, our choice. @ema @Dzahn your choice :)

Dzahn added a comment.Wed, Nov 20, 1:24 PM

Either way when switching the Hiera key it should either enable or disable all the things and not just some of them.

As Mukunda pointed out the aphlict service does not even have to run on the same machine as main phabricator.

Change 552122 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] varnish: remove config for disabled phab_aphlict

https://gerrit.wikimedia.org/r/552122

As Mukunda pointed out the aphlict service does not even have to run on the same machine as main phabricator.

In order for aphlict to run on a separate server I think the phabricator config needs to be changed to point to the IP of the host running aphlict.

Change 552122 merged by Dzahn:
[operations/puppet@production] varnish: remove config for disabled phab_aphlict

https://gerrit.wikimedia.org/r/552122

Change 554219 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: remove aphlict from webserver config

https://gerrit.wikimedia.org/r/554219

Change 554219 merged by Dzahn:
[operations/puppet@production] phabricator: remove wstunnel for aphlict from webserver config

https://gerrit.wikimedia.org/r/554219