Page MenuHomePhabricator

Thumbor units failing / service general slowness
Open, MediumPublic

Description

Today Thumbor paged due to its blackbox probes failing, specifically due to the service being slow and breaching its timeout (10s configured in service::catalog IIRC)

Additionally it looks like thumbor units get killed (and restarted) with SIGABRT. And haproxy reported latencies have goone up since a few days: https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?orgId=1&from=now-7d&to=now

2022-07-10-155841_1353x573_scrot.png (573×1 px, 163 KB)

Event Timeline

I found a lot of the following logs:

firejail: util.c:906: create_empty_dir_as_root: Assertion `(s.st_mode & 07777) == (mode)' failed.

It is also reproducible sudoing as user thumbor and executing the systemd's ExecStart of a failed unit.

Found the following exception, sending as NDA, as I suspect it is user-traffic related:

{P30998}

I downgraded firejail on all thumbor servers and that stopped, at least for now, the flurry of restarts we were seeing. More investigation is needed.

@fgiunchedi you OK with this being "Medium" priority? I'm trying to reduce the vast pile of untriaged tasks on the clinic duty board...

fgiunchedi triaged this task as Medium priority.Sep 8 2022, 8:47 AM

@MatthewVernon yes medium works, {{done}}

@Joe @fgiunchedi I wrote a rough draft based on the above. Feel free to expand or correct accordingly:

https://wikitech.wikimedia.org/wiki/Incidents/2022-07-10_thumbor

Grafana dashboard: Thumbor
{F35513584 height=300}