Page MenuHomePhabricator

WS Export downtime has been excessive recently
Closed, ResolvedPublic3 Estimated Story Points

Description

The web service keeps going offline.

CPU usage on Grafana:

Screenshot 2022-06-21 at 08-44-20 Cloud VPS Project Board - Grafana.png (821×1 px, 349 KB)

UptimeRobot recent data:

Screenshot 2022-06-21 at 08-48-55 WS Export - Community Tech tools.png (982×1 px, 86 KB)

Previously, excessive CPU usage has been from Calibre's ebook-convert process (i.e. creating PDFs etc.) rather than the epub generation.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The main filesystem was full, because Apache's tmp directory contained 11 GB of Calibre tmp files, e.g.: /tmp/systemd-private-5763bd0587ee42a89076febcaafe42cf-apache2.service-ekYWlY/tmp/calibre_4.13.0_tmp_zyI6jZ

Cleaned up, and now:

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda2        19G  9.5G  8.2G  54% /

(It also looks like the https://ws-export.wmcloud.org/logs/2022.sql.gz log dump is incomplete.)

I've moved the calibre temp directory to /ws-export/var/calibre-temp, so that should at least stop it filling up the main filesystem.

It does look like ebook-convert is the actual problem, though, of running out of memory, e.g.:

[ 1188.592783] Out of memory: Kill process 11869 (QtWebEngineProc) score 301 or sacrifice child
[ 1188.594473] Killed process 11869 (QtWebEngineProc) total-vm:1074240kB, anon-rss:15220kB, file-rss:0kB, shmem-rss:120kB
[ 1193.639564] Out of memory: Kill process 11798 (QtWebEngineProc) score 301 or sacrifice child
[ 1193.641230] Killed process 11798 (QtWebEngineProc) total-vm:1074060kB, anon-rss:15160kB, file-rss:0kB, shmem-rss:128kB
[ 1198.614985] Out of memory: Kill process 11823 (QtWebEngineProc) score 301 or sacrifice child
[ 1198.616777] Killed process 11823 (QtWebEngineProc) total-vm:1074348kB, anon-rss:14764kB, file-rss:0kB, shmem-rss:128kB
[ 1207.971253] Out of memory: Kill process 11834 (QtWebEngineProc) score 301 or sacrifice child
[ 1207.974551] Killed process 11834 (QtWebEngineProc) total-vm:1074348kB, anon-rss:14832kB, file-rss:0kB, shmem-rss:120kB

It looks like there's definitely multiple requests for the same PDFs, so I think caching generated PDFs (or, rather, any converted format) for an hour will help with this.

I've also reduced the non-logged-in rate limit to 10 requests per minute (it was 20).

In preparation for fixing up a better PDF-generation cache, I've done some upgrading of a few things: https://github.com/wikimedia/ws-export/pull/412 (ready for review, if anyone's got time).

Things seem to have stabilised now, although only with the rate limit now down at 2 requests per minute. I'll raise it incrementally and keep an eye on things, and also keep an ear out for anyone complaining that they're having to log in.

Samwilson set the point value for this task to 3.

It seems like downtime has decreased a fair bit with the current rate limit (of 4 requests minute before being made to log in). No one has said that this is annoying, so I think we can probably leave things as they are now and not worry about adding more aggressive caching of generated PDFs etc.

There's nothing to QA here, so moving to product sign-off.

NRodriguez subscribed.

Thanks for all your work on this @Samwilson