The tool has had more downtime recently than is usual:
Most recently it was down for 18 hours and 38 minutes from 2023-04-27 05:23:01.
Samwilson | |
Apr 28 2023, 12:18 AM |
F42126945: wsexport memory.png | |
Feb 25 2024, 10:30 PM |
F36965082: Screenshot 2023-04-28 at 08-16-54 WS Export - Community Tech tools.png | |
Apr 28 2023, 12:18 AM |
The tool has had more downtime recently than is usual:
Most recently it was down for 18 hours and 38 minutes from 2023-04-27 05:23:01.
This is still happening. I've not had much of a chance to look into it, but I does seem like it's the PDF conversion process that's routinely using 100% CPU. I'm not sure what the best fix is for this (a better conversion queue would be ideal, with the web server not being responsible for doing all the converting!) but a quick thing might be to just limit the number of pages allowed in a PDF conversion. (The Epub is always available for large books, and people can convert them locally with Calibre or whatever they like.)
There's also a bunch of failures related to trying to show the current version number in the footer, which can be fixed via T334454 (I've a patch created for that now).
Hello, @Samwilson -- I'm having this issue (T345025) when trying to export any file type from Wikisource, including epub files.
As of today, stats show 80% uptime over the last 90 days (aka. 20% downtime). Since December 6th, cumulative uptime is something like ~20%, with whole days of 0% uptime in the mix.
Stats are well and good, but is there any alerting when it falls down so someone with a pager can go kick it?
Alternately, at this level of (lack of) availability it needs to be actually disabled ("Download" button removed from the UI on the WIkisourcen).
I've been wondering if we should just turn off PDF generation, as am emergency measure. That'd at least keep it online for Epub generation, and if anyone wants a PDF (or any other format) they can use Calibre to convert it. I suspect people would be fairly disappointed, but maybe there's no other option for the time being.
As for being alerted, I use the emails from uptime robot to some extent, but I get so many from so many different tools that I've become rather blind to them.
I've been looking at ws-export a bit lately, and have been hoping to a) upgrade to the latest symfony, for using some of the latest stuff with the job-queue, and b) upgrade the VPS (which I've started, and the test site is now done).
A config like this would do it, I think:
# The PDF formats are `pdf-a4`, `pdf-a5`, `pdf-a6`, and `pdf-letter`. RewriteCond %{QUERY_STRING} format=pdf [NC] RewriteRule .* $1 [L,R=451] ErrorDocument 451 "Ebooks in formats other than EPUB are not available at the moment. See <a href='https://phabricator.wikimedia.org/T335553'>T335553</a> for details."
(I don't suppose it's an appropriate use of HTTP 451 Unavailable For Legal Reasons, but you get the idea.)
As of right now ws-export has been down for 32 hours straight. Over the last 30 days recorded uptime is 60%. That's a 40% downtime.
I think it's past time to disable PDF exports.
Okay, I've disabled PDFs for now. Let's give it a few days, and see if it does improve uptime.
You may also consider disabling HTMLZ. I noticed a massive download from ws.en and ws.fr in the past two months. I don’t think this format is very useful for individuals. Adding --disable-font-rescaling could help with pdfs.
I've merged and deployed @Tpt 's semaphore patch, and removed the block on PDFs, so ebook convert calls should now be limited to only four at once.
Good point about HTMLZ, @Denis_Gagne52, I was also wondering why that format was now so popular. Do you know what might be causing it? Conversion to HTMLZ will also be throttled now.
The instance is not responding by SSH and prometheus-alerts says it's been down for a day. I dumped its console log, and it shows some failures during boot.
Begin: Running /scripts/local-bottom ... GROWROOT: /sbin/growpart: 824: /sbin/growpart: grep: not found /sbin/growpart: 853: /sbin/growpart: sed: not found WARN: unknown label /sbin/growpart: 354: /sbin/growpart: sed: not found FAILED: sed failed on dump output /sbin/growpart: 83: /sbin/growpart: rm: not found done.
FAILED Failed to listen on SSSD NSS Service responder socket. FAILED Failed to listen on Service responder private socket. FAILED Failed to listen on SSD Sudo Service responder socket. FAILED Failed to listen on SSSD SSH Service responder socket.
Then later
[422452.260395] Out of memory: Killed process 1445204 (apache2) total-vm:2053832kB, anon-rss:878592kB, file-rss:4kB, shmem-rss:15864kB, UID:33 pgtables:2024kB oom_score_adj:0
One thing I noticed is that Google is crawling the site, despite the efforts to stop it from doing so.
I confirmed using Google Search Console that we're blocking access to robots.txt, but the actual crawl with UA "GoogleOther" is not blocked. I think we should allow access to robots.txt.
Data from grafana and the syslog indicate that the downtime was primarily due to an out-of-memory condition. It was in swapdeath, or whatever you call the panic swapping that Linux does these days. Look at the timestamps on these log messages:
2024-02-21T01:06:40.793764+00:00 wsexport-prod02 kernel: [505035.522390] Tasks state (memory values in pages): 2024-02-21T01:06:43.444104+00:00 wsexport-prod02 kernel: [505035.522391] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name 2024-02-21T01:06:44.820534+00:00 wsexport-prod02 kernel: [505035.522397] [ 269] 0 269 5983 334 69632 0 -1000 systemd-udevd 2024-02-21T01:06:46.505921+00:00 wsexport-prod02 kernel: [505035.522400] [ 417] 998 417 4500 299 77824 0 0 systemd-network 2024-02-21T01:06:48.730643+00:00 wsexport-prod02 kernel: [505035.522402] [ 495] 997 495 22514 235 77824 0 0 systemd-timesyn 2024-02-21T01:06:50.617373+00:00 wsexport-prod02 kernel: [505035.522404] [ 501] 101 501 7530 530 69632 0 -900 dbus-daemon 2024-02-21T01:06:52.081834+00:00 wsexport-prod02 kernel: [505035.522406] [ 504] 107 504 4958 293 77824 0 0 lldpd 2024-02-21T01:06:53.551169+00:00 wsexport-prod02 kernel: [505035.522408] [ 505] 106 505 439663 6913 442368 0 0 prometheus-node 2024-02-21T01:06:55.545466+00:00 wsexport-prod02 kernel: [505035.522410] [ 508] 0 508 111125 2731 131072 0 0 rsyslogd 2024-02-21T01:06:57.412726+00:00 wsexport-prod02 kernel: [505035.522412] [ 515] 0 515 8515 555 106496 0 0 sssd 2024-02-21T01:06:58.599984+00:00 wsexport-prod02 kernel: [505035.522413] [ 542] 107 542 4958 292 65536 0 0 lldpd 2024-02-21T01:07:00.592946+00:00 wsexport-prod02 kernel: [505035.522415] [ 548] 0 548 3883 329 69632 0 -1000 sshd 2024-02-21T01:07:02.299532+00:00 wsexport-prod02 kernel: [505035.522417] [ 551] 995 551 79902 847 114688 0 0 polkitd 2024-02-21T01:07:03.644227+00:00 wsexport-prod02 kernel: [505035.522419] [ 559] 0 559 56183 3253 184320 0 0 apache2 2024-02-21T01:07:04.934097+00:00 wsexport-prod02 kernel: [505035.522420] [ 590] 0 590 9500 1210 110592 0 0 sssd_pam 2024-02-21T01:07:05.945739+00:00 wsexport-prod02 kernel: [505035.522422] [ 591] 0 591 8293 441 98304 0 0 sssd_ssh 2024-02-21T01:07:07.375760+00:00 wsexport-prod02 kernel: [505035.522424] [ 592] 0 592 8675 840 102400 0 0 sssd_sudo 2024-02-21T01:07:08.674805+00:00 wsexport-prod02 kernel: [505035.522425] [ 595] 0 595 6475 274 77824 0 0 systemd-logind 2024-02-21T01:07:10.785148+00:00 wsexport-prod02 kernel: [505035.522427] [ 602] 0 602 723 22 40960 0 0 agetty 2024-02-21T01:07:11.974763+00:00 wsexport-prod02 kernel: [505035.522429] [ 603] 0 603 4458 147 57344 0 0 login 2024-02-21T01:07:13.829025+00:00 wsexport-prod02 kernel: [505035.522431] [ 609] 0 609 7257 2076 94208 0 0 unattended-upgr 2024-02-21T01:07:15.336010+00:00 wsexport-prod02 kernel: [505035.522432] [ 658] 0 658 4683 379 77824 0 100 systemd 2024-02-21T01:07:16.637250+00:00 wsexport-prod02 kernel: [505035.522434] [ 674] 0 674 1414 391 57344 0 0 bash 2024-02-21T01:07:18.129040+00:00 wsexport-prod02 kernel: [505035.522436] [1495723] 0 1495723 10359 288 102400 0 -250 systemd-journal 2024-02-21T01:07:19.942141+00:00 wsexport-prod02 kernel: [505035.522438] [1496337] 0 1496337 344422 3402 204800 0 0 prometheus-rsys 2024-02-21T01:07:21.789299+00:00 wsexport-prod02 kernel: [505035.522442] [1497644] 33 1497644 261636 52877 675840 0 0 apache2 2024-02-21T01:07:22.951738+00:00 wsexport-prod02 kernel: [505035.522444] [1498124] 33 1498124 244740 52335 667648 0 0 apache2 2024-02-21T01:07:24.091859+00:00 wsexport-prod02 kernel: [505035.522447] [1498292] 33 1498292 244741 52630 667648 0 0 apache2 2024-02-21T01:07:25.792442+00:00 wsexport-prod02 kernel: [505035.522449] [1499087] 33 1499087 228356 52638 663552 0 0 apache2 2024-02-21T01:07:27.611843+00:00 wsexport-prod02 kernel: [505035.522451] [1499319] 33 1499319 260921 52731 671744 0 0 apache2 2024-02-21T01:07:28.860919+00:00 wsexport-prod02 kernel: [505035.522452] [1499630] 33 1499630 243728 51907 659456 0 0 apache2 2024-02-21T01:07:30.861453+00:00 wsexport-prod02 kernel: [505035.522456] [1499782] 33 1499782 326148 52656 688128 0 0 apache2 2024-02-21T01:07:32.102094+00:00 wsexport-prod02 kernel: [505035.522460] [1499941] 33 1499941 228868 53139 667648 0 0 apache2 2024-02-21T01:07:33.715219+00:00 wsexport-prod02 kernel: [505035.522462] [1500441] 33 1500441 277509 52640 684032 0 0 apache2 2024-02-21T01:07:35.785600+00:00 wsexport-prod02 kernel: [505035.522466] [1500829] 33 1500829 228356 52627 663552 0 0 apache2 2024-02-21T01:07:38.237898+00:00 wsexport-prod02 kernel: [505035.522470] [1501168] 33 1501168 261113 52635 675840 0 0 apache2 2024-02-21T01:07:39.959260+00:00 wsexport-prod02 kernel: [505035.522472] [1501322] 33 1501322 228869 53141 667648 0 0 apache2 2024-02-21T01:07:41.712915+00:00 wsexport-prod02 kernel: [505035.522474] [1501563] 33 1501563 227848 52313 659456 0 0 apache2 2024-02-21T01:07:42.960290+00:00 wsexport-prod02 kernel: [505035.522476] [1501740] 33 1501740 228356 52629 663552 0 0 apache2 2024-02-21T01:07:44.530280+00:00 wsexport-prod02 kernel: [505035.522480] [1502218] 33 1502218 228868 53138 667648 0 0 apache2 2024-02-21T01:07:46.870584+00:00 wsexport-prod02 kernel: [505035.522481] [1502477] 33 1502477 276997 52656 679936 0 0 apache2 2024-02-21T01:07:48.549924+00:00 wsexport-prod02 kernel: [505035.522484] [1502712] 33 1502712 261124 52631 675840 0 0 apache2 2024-02-21T01:07:50.528512+00:00 wsexport-prod02 kernel: [505035.522486] [1502858] 33 1502858 244740 52640 667648 0 0 apache2 2024-02-21T01:07:51.778460+00:00 wsexport-prod02 kernel: [505035.522488] [1503026] 33 1503026 261124 52655 692224 0 0 apache2 2024-02-21T01:07:53.812760+00:00 wsexport-prod02 kernel: [505035.522490] [1503518] 33 1503518 228868 53146 667648 0 0 apache2 2024-02-21T01:07:55.522776+00:00 wsexport-prod02 kernel: [505035.522492] [1503858] 33 1503858 260612 52129 671744 0 0 apache2 2024-02-21T01:07:57.726343+00:00 wsexport-prod02 kernel: [505035.522493] [1504003] 33 1504003 278021 53156 684032 0 0 apache2 2024-02-21T01:07:59.849291+00:00 wsexport-prod02 kernel: [505035.522495] [1504146] 33 1504146 310777 53357 692224 0 0 apache2 2024-02-21T01:08:02.065346+00:00 wsexport-prod02 kernel: [505035.522498] [1504313] 33 1504313 228869 52879 667648 0 0 apache2 2024-02-21T01:08:03.865581+00:00 wsexport-prod02 kernel: [505035.522500] [1504770] 33 1504770 244740 52339 667648 0 0 apache2 2024-02-21T01:08:05.924014+00:00 wsexport-prod02 kernel: [505035.522502] [1505206] 33 1505206 227653 51635 663552 0 0 apache2 2024-02-21T01:08:07.831493+00:00 wsexport-prod02 kernel: [505035.522503] [1505540] 33 1505540 211972 52321 659456 0 0 apache2 2024-02-21T01:08:09.636051+00:00 wsexport-prod02 kernel: [505035.522505] [1505684] 33 1505684 228868 52845 667648 0 0 apache2 2024-02-21T01:08:11.428323+00:00 wsexport-prod02 kernel: [505035.522506] [1505838] 33 1505838 228868 52844 667648 0 0 apache2 2024-02-21T01:08:13.301684+00:00 wsexport-prod02 kernel: [505035.522511] [1506529] 33 1506529 228356 52330 663552 0 0 apache2 2024-02-21T01:08:15.692589+00:00 wsexport-prod02 kernel: [505035.522513] [1506671] 33 1506671 228165 52447 663552 0 0 apache2 2024-02-21T01:08:17.408311+00:00 wsexport-prod02 kernel: [505035.522514] [1506809] 33 1506809 211781 52009 659456 0 0 apache2 2024-02-21T01:08:19.747583+00:00 wsexport-prod02 kernel: [505035.522516] [1507140] 33 1507140 228356 52506 663552 0 0 apache2 2024-02-21T01:08:22.550883+00:00 wsexport-prod02 kernel: [505035.522520] [1507285] 33 1507285 260926 52340 671744 0 0 apache2 2024-02-21T01:08:24.848754+00:00 wsexport-prod02 kernel: [505035.522522] [1507582] 33 1507582 244740 52208 667648 0 0 apache2 2024-02-21T01:08:27.052923+00:00 wsexport-prod02 kernel: [505035.522524] [1508267] 33 1508267 228869 52716 667648 0 0 apache2 2024-02-21T01:08:28.836714+00:00 wsexport-prod02 kernel: [505035.522525] [1508612] 33 1508612 228356 52196 667648 0 0 apache2 2024-02-21T01:08:31.604261+00:00 wsexport-prod02 kernel: [505035.522527] [1509049] 33 1509049 261124 52202 671744 0 0 apache2 2024-02-21T01:08:33.079865+00:00 wsexport-prod02 kernel: [505035.522530] [1509259] 33 1509259 103570 31936 442368 0 0 apache2 2024-02-21T01:08:36.015444+00:00 wsexport-prod02 kernel: [505035.522532] [1509691] 33 1509691 81135 9555 286720 0 0 apache2 2024-02-21T01:08:38.416952+00:00 wsexport-prod02 kernel: [505035.522533] [1511413] 33 1511413 77502 5688 229376 0 0 apache2 2024-02-21T01:08:40.143473+00:00 wsexport-prod02 kernel: [505035.522535] [1511840] 33 1511840 77500 5071 217088 0 0 apache2 2024-02-21T01:08:42.577363+00:00 wsexport-prod02 kernel: [505035.522537] [1513752] 33 1513752 77500 5684 229376 0 0 apache2 2024-02-21T01:08:44.090978+00:00 wsexport-prod02 kernel: [505035.522539] [1513960] 33 1513960 77500 5282 225280 0 0 apache2 2024-02-21T01:08:45.608731+00:00 wsexport-prod02 kernel: [505035.522544] [1514176] 33 1514176 77500 5113 225280 0 0 apache2 2024-02-21T01:08:47.183089+00:00 wsexport-prod02 kernel: [505035.522546] [1514395] 33 1514395 59063 4969 204800 0 0 apache2 2024-02-21T01:08:49.801012+00:00 wsexport-prod02 kernel: [505035.522547] [1514602] 33 1514602 59063 4405 204800 0 0 apache2 2024-02-21T01:08:51.990212+00:00 wsexport-prod02 kernel: [505035.522549] [1514819] 33 1514819 59063 4406 204800 0 0 apache2 2024-02-21T01:08:54.091577+00:00 wsexport-prod02 kernel: [505035.522550] [1515595] 33 1515595 59063 4406 204800 0 0 apache2 2024-02-21T01:08:56.232447+00:00 wsexport-prod02 kernel: [505035.522552] [1515814] 33 1515814 59063 4406 204800 0 0 apache2 2024-02-21T01:08:58.869767+00:00 wsexport-prod02 kernel: [505035.522556] [1516027] 33 1516027 59063 4406 204800 0 0 apache2 2024-02-21T01:09:00.229466+00:00 wsexport-prod02 kernel: [505035.522557] [1516244] 33 1516244 59063 4405 204800 0 0 apache2 2024-02-21T01:09:02.071304+00:00 wsexport-prod02 kernel: [505035.522559] [1516454] 33 1516454 59063 4405 204800 0 0 apache2 2024-02-21T01:09:04.476558+00:00 wsexport-prod02 kernel: [505035.522561] [1516682] 33 1516682 59063 4405 204800 0 0 apache2 2024-02-21T01:09:06.408662+00:00 wsexport-prod02 kernel: [505035.522563] [1516888] 33 1516888 59065 4407 204800 0 0 apache2 2024-02-21T01:09:08.114316+00:00 wsexport-prod02 kernel: [505035.522564] [1517103] 33 1517103 59063 4406 204800 0 0 apache2 2024-02-21T01:09:09.826114+00:00 wsexport-prod02 kernel: [505035.522568] [1517321] 33 1517321 59109 5802 212992 0 0 apache2 2024-02-21T01:09:11.859289+00:00 wsexport-prod02 kernel: [505035.522570] [1517533] 33 1517533 59065 4406 204800 0 0 apache2 2024-02-21T01:09:13.533954+00:00 wsexport-prod02 kernel: [505035.522571] [1517943] 33 1517943 59063 4405 204800 0 0 apache2 2024-02-21T01:09:14.897073+00:00 wsexport-prod02 kernel: [505035.522573] [1518159] 33 1518159 59063 4406 204800 0 0 apache2 2024-02-21T01:09:17.185731+00:00 wsexport-prod02 kernel: [505035.522575] [1518368] 33 1518368 59065 4406 204800 0 0 apache2 2024-02-21T01:09:19.315331+00:00 wsexport-prod02 kernel: [505035.522576] [1518594] 33 1518594 59063 4406 204800 0 0 apache2 2024-02-21T01:09:22.117960+00:00 wsexport-prod02 kernel: [505035.522578] [1518810] 33 1518810 59063 4405 204800 0 0 apache2 2024-02-21T01:09:23.819382+00:00 wsexport-prod02 kernel: [505035.522580] [1519018] 33 1519018 59065 4406 204800 0 0 apache2 2024-02-21T01:09:26.101424+00:00 wsexport-prod02 kernel: [505035.522581] [1519235] 33 1519235 59065 4409 204800 0 0 apache2 2024-02-21T01:09:27.956194+00:00 wsexport-prod02 kernel: [505035.522583] [1519855] 33 1519855 59065 4408 204800 0 0 apache2 2024-02-21T01:09:29.843908+00:00 wsexport-prod02 kernel: [505035.522585] [1520223] 33 1520223 59063 4405 204800 0 0 apache2 2024-02-21T01:09:31.768784+00:00 wsexport-prod02 kernel: [505035.522586] [1520546] 33 1520546 59065 4407 204800 0 0 apache2 2024-02-21T01:09:33.322841+00:00 wsexport-prod02 kernel: [505035.522588] [1520762] 33 1520762 59065 4409 204800 0 0 apache2 2024-02-21T01:09:36.085902+00:00 wsexport-prod02 kernel: [505035.522592] [1520969] 33 1520969 59063 4405 204800 0 0 apache2 2024-02-21T01:09:37.990519+00:00 wsexport-prod02 kernel: [505035.522594] [1521188] 33 1521188 59063 4405 204800 0 0 apache2 2024-02-21T01:09:40.187977+00:00 wsexport-prod02 kernel: [505035.522595] [1521404] 33 1521404 59063 4405 204800 0 0 apache2 2024-02-21T01:09:42.987700+00:00 wsexport-prod02 kernel: [505035.522597] [1521615] 33 1521615 59063 4405 204800 0 0 apache2 2024-02-21T01:09:44.670900+00:00 wsexport-prod02 kernel: [505035.522598] [1521832] 33 1521832 59063 4406 204800 0 0 apache2 2024-02-21T01:09:46.369386+00:00 wsexport-prod02 kernel: [505035.522600] [1522047] 33 1522047 59063 4406 204800 0 0 apache2 2024-02-21T01:09:48.005094+00:00 wsexport-prod02 kernel: [505035.522602] [1522459] 33 1522459 77736 5959 229376 0 0 apache2 2024-02-21T01:09:50.008466+00:00 wsexport-prod02 kernel: [505035.522604] [1522671] 33 1522671 59063 4405 204800 0 0 apache2 2024-02-21T01:09:52.129037+00:00 wsexport-prod02 kernel: [505035.522605] [1522888] 33 1522888 59063 4406 204800 0 0 apache2 2024-02-21T01:09:54.278183+00:00 wsexport-prod02 kernel: [505035.522607] [1523102] 33 1523102 59065 4406 204800 0 0 apache2 2024-02-21T01:09:56.013956+00:00 wsexport-prod02 kernel: [505035.522608] [1523313] 33 1523313 59063 4405 204800 0 0 apache2 2024-02-21T01:09:57.534805+00:00 wsexport-prod02 kernel: [505035.522610] [1523530] 33 1523530 59063 4405 204800 0 0 apache2 2024-02-21T01:10:00.450857+00:00 wsexport-prod02 kernel: [505035.522611] [1523746] 33 1523746 59063 4406 204800 0 0 apache2 2024-02-21T01:10:02.324503+00:00 wsexport-prod02 kernel: [505035.522613] [1523777] 33 1523777 59063 4405 204800 0 0 apache2 2024-02-21T01:10:02.979753+00:00 wsexport-prod02 kernel: [505035.522615] [1524563] 33 1524563 59065 4406 204800 0 0 apache2 2024-02-21T01:10:04.798130+00:00 wsexport-prod02 kernel: [505035.522619] [1524770] 33 1524770 59063 4406 204800 0 0 apache2 2024-02-21T01:10:07.131847+00:00 wsexport-prod02 kernel: [505035.522621] [1524988] 33 1524988 59063 4406 204800 0 0 apache2 2024-02-21T01:10:08.625830+00:00 wsexport-prod02 kernel: [505035.522623] [1525204] 33 1525204 59063 4406 204800 0 0 apache2 2024-02-21T01:10:10.775688+00:00 wsexport-prod02 kernel: [505035.522624] [1525414] 33 1525414 77767 6475 229376 0 0 apache2 2024-02-21T01:10:12.406934+00:00 wsexport-prod02 kernel: [505035.522626] [1525630] 33 1525630 59063 4405 204800 0 0 apache2 2024-02-21T01:10:14.348056+00:00 wsexport-prod02 kernel: [505035.522627] [1525851] 33 1525851 59065 4407 204800 0 0 apache2 2024-02-21T01:10:16.432095+00:00 wsexport-prod02 kernel: [505035.522629] [1526059] 33 1526059 59065 4407 204800 0 0 apache2 2024-02-21T01:10:18.038547+00:00 wsexport-prod02 kernel: [505035.522630] [1526277] 33 1526277 59063 4406 204800 0 0 apache2 2024-02-21T01:10:19.392591+00:00 wsexport-prod02 kernel: [505035.522632] [1526491] 33 1526491 59065 4407 204800 0 0 apache2 2024-02-21T01:10:22.004137+00:00 wsexport-prod02 kernel: [505035.522634] [1526901] 33 1526901 59063 4405 204800 0 0 apache2 2024-02-21T01:10:23.565845+00:00 wsexport-prod02 kernel: [505035.522635] [1527111] 33 1527111 59063 4406 204800 0 0 apache2 2024-02-21T01:10:25.422057+00:00 wsexport-prod02 kernel: [505035.522637] [1527325] 33 1527325 77736 5967 229376 0 0 apache2 2024-02-21T01:10:27.445012+00:00 wsexport-prod02 kernel: [505035.522638] [1527543] 33 1527543 59063 4405 204800 0 0 apache2 2024-02-21T01:10:28.541553+00:00 wsexport-prod02 kernel: [505035.522640] [1527752] 33 1527752 59063 4406 204800 0 0 apache2 2024-02-21T01:10:31.289039+00:00 wsexport-prod02 kernel: [505035.522641] [1527969] 33 1527969 59063 4406 204800 0 0 apache2 2024-02-21T01:10:33.246344+00:00 wsexport-prod02 kernel: [505035.522642] [1528182] 33 1528182 59063 4405 204800 0 0 apache2 2024-02-21T01:10:34.746986+00:00 wsexport-prod02 kernel: [505035.522644] [1528700] 33 1528700 59063 4406 204800 0 0 apache2 2024-02-21T01:10:36.511488+00:00 wsexport-prod02 kernel: [505035.522645] [1529168] 33 1529168 59063 4405 204800 0 0 apache2 2024-02-21T01:10:38.370518+00:00 wsexport-prod02 kernel: [505035.522647] [1529383] 33 1529383 78926 8367 262144 0 0 apache2 2024-02-21T01:10:40.165795+00:00 wsexport-prod02 kernel: [505035.522648] [1529604] 33 1529604 59063 4405 204800 0 0 apache2 2024-02-21T01:10:42.092809+00:00 wsexport-prod02 kernel: [505035.522650] [1529821] 33 1529821 59063 4406 204800 0 0 apache2 2024-02-21T01:10:43.377917+00:00 wsexport-prod02 kernel: [505035.522652] [1530047] 33 1530047 136102 9989 303104 0 0 apache2 2024-02-21T01:10:45.575691+00:00 wsexport-prod02 kernel: [505035.522654] [1530267] 33 1530267 59063 4405 204800 0 0 apache2 2024-02-21T01:10:47.258033+00:00 wsexport-prod02 kernel: [505035.522662] [1530482] 33 1530482 59063 4406 204800 0 0 apache2 2024-02-21T01:10:50.412334+00:00 wsexport-prod02 kernel: [505035.522664] [1530698] 33 1530698 59063 4406 204800 0 0 apache2 2024-02-21T01:10:51.785146+00:00 wsexport-prod02 kernel: [505035.522665] [1530909] 33 1530909 59063 4405 204800 0 0 apache2 2024-02-21T01:10:55.205901+00:00 wsexport-prod02 kernel: [505035.522667] [1531318] 33 1531318 59065 4406 204800 0 0 apache2 2024-02-21T01:10:56.926600+00:00 wsexport-prod02 kernel: [505035.522669] [1531534] 33 1531534 59063 4405 204800 0 0 apache2 2024-02-21T01:10:59.039554+00:00 wsexport-prod02 kernel: [505035.522670] [1531742] 33 1531742 78887 8040 262144 0 0 apache2 2024-02-21T01:11:01.252741+00:00 wsexport-prod02 kernel: [505035.522672] [1531962] 33 1531962 78887 7582 262144 0 0 apache2 2024-02-21T01:11:03.045591+00:00 wsexport-prod02 kernel: [505035.522673] [1532192] 33 1532192 59063 4405 204800 0 0 apache2 2024-02-21T01:11:05.074182+00:00 wsexport-prod02 kernel: [505035.522675] [1532399] 33 1532399 59063 4406 204800 0 0 apache2 2024-02-21T01:11:07.280543+00:00 wsexport-prod02 kernel: [505035.522678] [1532623] 33 1532623 78887 7582 262144 0 0 apache2 2024-02-21T01:11:09.655732+00:00 wsexport-prod02 kernel: [505035.522680] [1532903] 33 1532903 59065 4406 204800 0 0 apache2 2024-02-21T01:11:11.160238+00:00 wsexport-prod02 kernel: [505035.522681] [1533672] 33 1533672 59063 4406 204800 0 0 apache2 2024-02-21T01:11:12.823341+00:00 wsexport-prod02 kernel: [505035.522683] [1533893] 33 1533893 77816 6220 221184 0 0 apache2 2024-02-21T01:11:14.342816+00:00 wsexport-prod02 kernel: [505035.522685] [1534105] 33 1534105 59063 4406 204800 0 0 apache2 2024-02-21T01:11:15.718870+00:00 wsexport-prod02 kernel: [505035.522686] [1534322] 33 1534322 77736 6433 229376 0 0 apache2 2024-02-21T01:11:17.096069+00:00 wsexport-prod02 kernel: [505035.522688] [1534540] 33 1534540 59065 4406 204800 0 0 apache2 2024-02-21T01:11:18.239012+00:00 wsexport-prod02 kernel: [505035.522689] [1534748] 33 1534748 59063 4406 204800 0 0 apache2 2024-02-21T01:11:20.296357+00:00 wsexport-prod02 kernel: [505035.522691] [1534964] 33 1534964 59063 4406 204800 0 0 apache2 2024-02-21T01:11:22.532454+00:00 wsexport-prod02 kernel: [505035.522693] [1535180] 33 1535180 59063 4405 204800 0 0 apache2 2024-02-21T01:11:24.570047+00:00 wsexport-prod02 kernel: [505035.522694] [1535389] 33 1535389 59063 4406 204800 0 0 apache2 2024-02-21T01:11:26.118457+00:00 wsexport-prod02 kernel: [505035.522696] [1535797] 33 1535797 78887 8040 262144 0 0 apache2 2024-02-21T01:11:28.430256+00:00 wsexport-prod02 kernel: [505035.522697] [1536020] 33 1536020 59063 4406 204800 0 0 apache2 2024-02-21T01:11:30.830034+00:00 wsexport-prod02 kernel: [505035.522699] [1536233] 33 1536233 59063 4406 204800 0 0 apache2 2024-02-21T01:11:33.232261+00:00 wsexport-prod02 kernel: [505035.522700] [1536449] 33 1536449 59063 4405 204800 0 0 apache2 2024-02-21T01:11:35.181393+00:00 wsexport-prod02 kernel: [505035.522702] [1536663] 33 1536663 59063 4406 204800 0 0 apache2 2024-02-21T01:11:37.667488+00:00 wsexport-prod02 kernel: [505035.522704] [1536871] 33 1536871 59065 4406 204800 0 0 apache2 2024-02-21T01:11:39.336359+00:00 wsexport-prod02 kernel: [505035.522705] [1537096] 33 1537096 59063 4405 204800 0 0 apache2 2024-02-21T01:11:40.896462+00:00 wsexport-prod02 kernel: [505035.522707] [1537312] 33 1537312 78588 7253 266240 0 0 apache2 2024-02-21T01:11:42.633230+00:00 wsexport-prod02 kernel: [505035.522708] [1538148] 33 1538148 59065 4409 204800 0 0 apache2 2024-02-21T01:11:44.499293+00:00 wsexport-prod02 kernel: [505035.522710] [1538373] 33 1538373 59063 4405 204800 0 0 apache2 2024-02-21T01:11:46.039475+00:00 wsexport-prod02 kernel: [505035.522711] [1541268] 105 1541268 9266 2057 102400 0 0 exim4 2024-02-21T01:11:48.255432+00:00 wsexport-prod02 kernel: [505035.522714] [1549391] 0 1549391 644 23 40960 0 0 sessionclean 2024-02-21T01:11:50.378310+00:00 wsexport-prod02 kernel: [505035.522715] [1549398] 0 1549398 644 33 40960 0 0 sessionclean 2024-02-21T01:11:53.634047+00:00 wsexport-prod02 kernel: [505035.522718] [1549400] 0 1549400 3849 25 45056 0 0 sort 2024-02-21T01:11:56.071626+00:00 wsexport-prod02 kernel: [505035.522722] [1549401] 0 1549401 3849 26 49152 0 0 sort 2024-02-21T01:11:58.387673+00:00 wsexport-prod02 kernel: [505035.522724] [1549402] 0 1549402 644 29 40960 0 0 sessionclean 2024-02-21T01:12:01.528141+00:00 wsexport-prod02 kernel: [505035.522725] [1549528] 0 1549528 1086 55 45056 0 0 puppet-run 2024-02-21T01:12:03.522898+00:00 wsexport-prod02 kernel: [505035.522727] [1549529] 0 1549529 19470 2717 184320 0 0 puppet 2024-02-21T01:12:05.923978+00:00 wsexport-prod02 kernel: [505035.522729] [1549567] 33 1549567 59063 4855 204800 0 0 apache2 2024-02-21T01:12:08.023849+00:00 wsexport-prod02 kernel: [505035.522730] [1549742] 33 1549742 59067 5076 212992 0 0 apache2 2024-02-21T01:12:09.683414+00:00 wsexport-prod02 kernel: [505035.522732] [1549744] 33 1549744 59067 4979 212992 0 0 apache2 2024-02-21T01:12:12.501534+00:00 wsexport-prod02 kernel: [505035.522734] [1549776] 33 1549776 59063 4855 204800 0 0 apache2 2024-02-21T01:12:15.619380+00:00 wsexport-prod02 kernel: [505035.522735] [1549783] 33 1549783 59063 4855 204800 0 0 apache2 2024-02-21T01:12:16.971651+00:00 wsexport-prod02 kernel: [505035.522737] [1549784] 33 1549784 59063 4855 204800 0 0 apache2 2024-02-21T01:12:18.884153+00:00 wsexport-prod02 kernel: [505035.522739] [1549823] 33 1549823 59067 5077 212992 0 0 apache2 2024-02-21T01:12:22.812117+00:00 wsexport-prod02 kernel: [505035.522741] [1549862] 33 1549862 59067 4316 208896 0 0 apache2 2024-02-21T01:12:24.322130+00:00 wsexport-prod02 kernel: [505035.522742] [1549883] 33 1549883 59067 4859 204800 0 0 apache2 2024-02-21T01:12:26.757584+00:00 wsexport-prod02 kernel: [505035.522744] [1549896] 33 1549896 59069 4323 208896 0 0 apache2 2024-02-21T01:12:30.755541+00:00 wsexport-prod02 kernel: [505035.522746] [1549918] 33 1549918 59067 4356 208896 0 0 apache2 2024-02-21T01:12:32.256911+00:00 wsexport-prod02 kernel: [505035.522748] [1549928] 0 1549928 5493 349 81920 0 0 sssd_be 2024-02-21T01:12:33.687008+00:00 wsexport-prod02 kernel: [505035.522749] [1549949] 33 1549949 59069 4260 204800 0 0 apache2 2024-02-21T01:12:35.269794+00:00 wsexport-prod02 kernel: [505035.522751] [1549970] 33 1549970 59067 3928 200704 0 0 apache2 2024-02-21T01:12:36.625806+00:00 wsexport-prod02 kernel: [505035.522752] [1549987] 0 1549987 1193 38 45056 0 0 find 2024-02-21T01:12:39.154564+00:00 wsexport-prod02 kernel: [505035.522754] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/user@0.service/init.scope,task=systemd,pid=658,uid=0 2024-02-21T01:12:41.503955+00:00 wsexport-prod02 kernel: [505035.522791] Out of memory: Killed process 658 (systemd) total-vm:18732kB, anon-rss:1516kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:76kB oom_score_adj:100
The messages hit the kernel ring buffer pretty quickly, but it took 6 minutes for rsyslogd to receive them and put timestamps on them. And then does it kill some of the 150 Apache workers? No, it kills systemd. Good job Linux.
Mentioned in SAL (#wikimedia-cloud) [2024-02-22T07:20:46Z] <TimStarling> on wsexport-prod02 reduced MaxRequestWorkers to 20 T335553
Despite reducing MaxRequestWorkers, there were three more instances of downtime in the last 4 hours, each time with a drop in available memory to near-zero, a load spike, and missed prometheus reports. No oom-killer and I haven't been able to find any useful related logs.
I found an apache process with 1.4GB of RSS remaining after the request terminated. Several other apache processes had RSS of over 500MB. I see that the tool downloads the Parsoid HTML of a page and loads it into a DOMDocument. That allows the process to exceed the memory limit, and memory is not returned to the OS when it is freed. So I set MaxConnectionsPerChild to 1 so that workers will exit after each request is served.
Reducing MaxConnectionsPerChild was definitely very helpful in turning the RAM available graph from a scary sawtooth to a more peaceful looking kelp forest. However, available memory dropped by 1.6GB 36 hours ago and did not recover.
With no wall clock time limit, requests can run for as long as they like, doing HTTP client requests.
(gdb) source /root/php8.2-8.2.7/.gdbinit (gdb) zbacktrace [0x7f2f53619e60] curl_multi_select(object[0x7f2f53619eb0], 1) [internal function] [0x7f2f53619db0] GuzzleHttp\Handler\CurlMultiHandler->tick() /var/www/tool/vendor/guzzlehttp/guzzle/src/Handler/CurlMultiHandler.php:165 [0x7f2f53619d30] GuzzleHttp\Handler\CurlMultiHandler->execute(object[0x7f2f53619d80]) /var/www/tool/vendor/guzzlehttp/guzzle/src/Handler/CurlMultiHandler.php:189 [0x7f2f53619c90] GuzzleHttp\Promise\Promise->invokeWaitFn() /var/www/tool/vendor/guzzlehttp/promises/src/Promise.php:251 [0x7f2f53619c10] GuzzleHttp\Promise\Promise->waitIfPending() /var/www/tool/vendor/guzzlehttp/promises/src/Promise.php:227 [0x7f2f53619b80] GuzzleHttp\Promise\Promise->invokeWaitList() /var/www/tool/vendor/guzzlehttp/promises/src/Promise.php:272 [0x7f2f53619b00] GuzzleHttp\Promise\Promise->waitIfPending() /var/www/tool/vendor/guzzlehttp/promises/src/Promise.php:229 [0x7f2f53619a70] GuzzleHttp\Promise\Promise->wait() /var/www/tool/vendor/guzzlehttp/promises/src/Promise.php:69 [0x7f2f53619950] App\BookProvider->getPages(array(1233)[0x7f2f536199a0]) /var/www/tool/src/BookProvider.php:203 [0x7f2f53619780] App\BookProvider->getMetadata("\37777777730\37777777652\37777777730\37777777660\37777777731\37777777603\37777777730\37777777661\37777777730\37777777651_\37777777730\37777777647\37777777731 \37777777604\37777777730\37777777655\37777777731\37777777601\37777777730\37777777647\37777777730\37777777670", false, object[0x7f2f536197f0]) /var/www/tool/src/BookProvider.php:131 [0x7f2f536196f0] App\BookProvider->get("\37777777730\37777777652\37777777730\37777777660\37777777731\37777777603\37777777730\37777777661\37777777730\37777777651_\37777777730\37777777647\37777777731\37777777604 \37777777730\37777777655\37777777731\37777777601\37777777730\37777777647\37777777730\37777777670") /var/www/tool/src/BookProvider.php:48 [0x7f2f53619670] App\BookCreator->create("\37777777730\37777777652\37777777730\37777777660\37777777731\37777777603\37777777730\37777777661\37777777730\37777777651 \37777777730\37777777647\37777777731 \37777777604\37777777730\37777777655\37777777731\37777777601\37777777730\37777777647\37777777730\37777777670") /var/www/tool/src/BookCreator.php:49 [0x7f2f53619500] App\Controller\ExportController->export(object[0x7f2f53619550], object[0x7f2f53619560], object[0x7f2f53619570], object[0x7f2f53619580], object[0x7f2f53619590], object[0x7f2f536195a0]) /var/www/tool/src/Controller/ExportController.php:157 [0x7f2f53619360] App\Controller\ExportController->home(object[0x7f2f536193b0], object[0x7f2f536193c0], object[0x7f2f536193d0], object[0x7f2f536193e0], object[0x7f2f536193f0], object[0x7f2f53619400]) /var/www/tool/src/Controller/ExportController.php:105 [0x7f2f53619280] Symfony\Component\HttpKernel\HttpKernel->handleRaw(object[0x7f2f536192d0], 1) /var/www/tool/vendor/symfony/http-kernel/HttpKernel.php:163 [0x7f2f536191b0] Symfony\Component\HttpKernel\HttpKernel->handle(object[0x7f2f53619200], 1, true) /var/www/tool/vendor/symfony/http-kernel/HttpKernel.php:75 [0x7f2f536190f0] Symfony\Component\HttpKernel\Kernel->handle(object[0x7f2f53619140]) /var/www/tool/vendor/symfony/http-kernel/Kernel.php:202 [0x7f2f53619020] (main) /var/www/tool/public/index.php:33 (gdb) print sapi_globals $1 = {server_context = 0x7f2f519a7030, request_info = {request_method = 0x7f2f52be25e8 "GET", query_string = 0x7f2f519a7f00 "lang=ar&page=%D8%AA%D8%B0%D9%83%D8%B1%D8%A9+%D8%A7%D9%84%D8%AD%D9%81%D8%A7%D8%B8&format=pdf-a5", cookie_data = 0x0, content_length = 0, ...
It's patiently downloading a list of 1233 chapters from Wikisource. If it successfully builds this book, it will cache the result for 1 day.
I migrated the tool to php-fpm and set a wall clock time limit of 120s.
Now some of my previous test cases are hitting the limit of 1024 open files. I strace'd a request and found that it is indeed trying to open over a thousand concurrent connections to text-lb.eqiad.wikimedia.org. This will need a PR.
The service went down another two times in the last few hours, due to OOMs. So I'm going to try setting a memory limit for the whole of php-fpm, using a systemd unit override file.
I was able to induce an OOM with concurrent requests in order to confirm that my solution is working. Hopefully this will mean the service will stay up now.
After two whole days of uninterrupted uptime, I am declaring this done.
Note followup T358634.