Page MenuHomePhabricator

ProbeDown - Etherpad
Closed, ResolvedPublic

Description

See T343646: ProbeDown (etherpad1003) and T345464: ProbeDown - Etherpad for previous occurrences.

Common information

  • alertname: ProbeDown
  • job: probes/custom
  • prometheus: ops
  • severity: task
  • site: eqiad
  • source: prometheus
  • team: serviceops-collab

Firing alerts



Event Timeline

LSobanski renamed this task from ProbeDown to ProbeDown - Etherpad.Oct 9 2023, 6:07 AM

Possibly relevant log from https://logstash.wikimedia.org/app/dashboards#/view/syslog?_g=h@323090d&_a=h@566182a:

[2023-10-07 21:42:44.967] [ERROR] console - Error: Request aborted
    at onaborted (/usr/share/etherpad-lite/src/node_modules/express/lib/response.js:1025:15)
    at Immediate.<anonymous> (/usr/share/etherpad-lite/src/node_modules/express/lib/response.js:1067:9)
    at processImmediate (internal/timers.js:461:21)

Failures started at Oct 7, 2023 @ 21:31:25 and ended at Oct 7, 2023 @ 21:37:10. Looking at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/prometheus/manifests/blackbox/check/http.pp#191, the default timeout is 2 minutes.

Jelto triaged this task as Medium priority.
Jelto subscribed.

According to the blackbox checks in logstash, etherpad returned a a few timeouts and after that roughly 5 minutes of 503: https://logstash.wikimedia.org/goto/dcf2bc8ef4c5394a4325d833e16181c3

The syslog in logstash has a error message and after that 12 minutes of silence: https://logstash.wikimedia.org/goto/42e2c743edeb183291e1df59f824e13d

The error message is (see):

[WARN] client - TypeError: a[t] is undefined -- {
  errorId: 'ZJj5X4r3cepvOxlrWObD',
  type: 'Uncaught exception',
  msg: 'TypeError: a[t] is undefined',
   ...

The error appears multiple times on a single pad as far as I can see. I found a related task in T126379, where a pad got corrupted. However I visited the pad from the most recent incident and etherpad works normally, no corruption or restart.
I'll do some more research and check if that pad is corrupted.

After some more digging I found a interesting log (which was present on the host only and not in logstash):

[2023-10-07 21:31:15.506] [WARN] ImportEtherpad - (pad WRN202310) unsupported attributes (try installing a plugin): list, start, strikethrough

I found a issue describing etherpad being down when installing a plugin: https://github.com/ether/etherpad-lite/issues/4774 and https://github.com/ether/etherpad-lite/issues/5583

Also in the pad mentioned above etherpad added a warning:

Warning: There is an Etherpad bug that causes copypastes of longer text to fail (probably https://github.com/ether/etherpad-lite/issues/4951 and/or https://github.com/ether/etherpad-lite/issues/5544). When copypasting text into this pad, one may need to reload the pad in the browser and check the result, and possibly split up the pasted text into smaller chunks.

So I guess pasting some special text/characters triggered the installation (attempt) of the plugins list, start, strikethrough. I have to check if this can be disabled and which plugins are installed/not installed.

After some more tests I noticed etherpad becomes unavailable when opening/importing/exporting bigger pads (text size around 300kb and 2300 lines of text, so not uncommon I'd say). This happened again in T349076.

Before tweaking certain buffer and import limits in etherpad config I'd try to bump the resources of the VM a bit. Currently etherpad as 1CPU and 1GB memory. I'd like to use 2 CPUs and 2GB of memory. @MoritzMuehlenhoff does that work for you? The ganeti eqiad / C cluster should have another 1CPU and 1GB memory free according to my gnt-node list queries.

Resource usage (especially memory) is also quite high at the moment: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=etherpad1003&var-datasource=thanos&var-cluster=misc&from=now-24h&to=now

Mentioned in SAL (#wikimedia-operations) [2023-10-20T07:21:25Z] <jelto> increase etherpad1003 CPU and memory (1CPU,1GB -> 2CPU,2GB) - T348386

Icinga downtime and Alertmanager silence (ID=4c5627b3-2925-474a-b57a-39f23d290560) set by jelto@cumin1001 for 0:15:00 on 1 host(s) and their services with reason: Reboot to use new CPU and memory config

etherpad1003.eqiad.wmnet

Instance rebooted and CPU and memory are at 2CPU,2GB now, cc @MoritzMuehlenhoff

I'll close the task, the issue should be fixed. If we see more probe down alerts we have to adjust buffer and rate limit settings in the etherpad config file.