Page MenuHomePhabricator

Investigate recent increase in downtime (May 2021) [8 HR]
Closed, ResolvedPublicSpike

Description

Since the recent bout of updates to the tool, it has been offline more than it used to be. Last 90 days:

ia-upload_downtime.png (141ร—951 px, 10 KB)

The code changes of interest are probably those between 29 March and 14 April.

Event Timeline

There's a core dump that was created at 2021-05-10 15:54. (The web service was unavailable on that day for 23 minutes from 04:55. The two things are probably not related.)

Core was generated by `djvused -u /mnt/nfs/labstore-secondary-tools-project/ia-upload/ia-upload/jobque'.
Program terminated with signal SIGSEGV, Segmentation fault.

(That djvused command is just truncated; it doesn't look like it could've been malformed like that.)

The access log for the time recently before that was (POST only):

192.168.58.136 ia-upload.toolforge.org - [10/May/2021:13:23:49 +0000] "POST /commons/save HTTP/1.1" 200 3813 "https://ia-upload.toolforge.org/commons/fill?iaId=translationsrepr03pennuoft&commonsName=UPenn-Translations+and+Reprints+New+Series-vol3&format=djvu" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0"
192.168.58.136 ia-upload.toolforge.org - [10/May/2021:15:04:19 +0000] "POST /commons/save HTTP/1.1" 200 3812 "https://ia-upload.toolforge.org/commons/fill?iaId=americanjournalo14austuoft&commonsName=The+American+Journal+of+Psychology+Volume+14&format=djvu" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
192.168.8.3 ia-upload.toolforge.org - [10/May/2021:15:07:21 +0000] "POST /commons/save HTTP/1.1" 200 3812 "https://ia-upload.toolforge.org/commons/fill?iaId=americanjournalo15austuoft&commonsName=The+American+Journal+of+Psychology+Volume+15&format=djvu" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
192.168.8.3 ia-upload.toolforge.org - [10/May/2021:15:17:33 +0000] "POST /commons/save HTTP/1.1" 200 3812 "https://ia-upload.toolforge.org/commons/fill?iaId=americanjournalo17austuoft&commonsName=The+American+Journal+of+Psychology+Volume+17&format=djvu" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
192.168.95.8 ia-upload.toolforge.org - [10/May/2021:15:21:49 +0000] "POST /commons/save HTTP/1.1" 302 0 "https://ia-upload.toolforge.org/commons/fill?iaId=lettersofdoctorr00smit&commonsName=Letters+of+Doctor+Richard+Hill+and+His+Children&format=djvu" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
192.168.8.3 ia-upload.toolforge.org - [10/May/2021:18:24:32 +0000] "POST /commons/save HTTP/1.1" 200 3828 "https://ia-upload.toolforge.org/commons/fill?iaId=accountofliverpo00boot&commonsName=An+account+of+the+Liverpool+and+Manchester+Railway&format=djvu" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0"

However, all of those files were successfully uploaded to Commons. Jobs are run every 5 minutes, so it seems unlikely that it'd be from earlier.

I'll keep watching the web service closely and investigate when it next goes down.

MusikAnimal renamed this task from Investigate recent increase in downtime (May 2021) to Investigate recent increase in downtime (May 2021) [8 HR].May 12 2021, 4:08 PM
MusikAnimal added a project: Spike.
Restricted Application changed the subtype of this task from "Task" to "Spike". ยท View Herald TranscriptMay 12 2021, 4:08 PM

The auto-restarting of the web service seems to be helping, it's had 99.951% in the last 7 days.

The most recent long downtime was 1 h, 28 min on May 19 from 04:23 UTC. The error log for that time looks like this:

2021-05-19 04:00:16: (server.c.1464) server started (lighttpd/1.4.53)
2021-05-19 04:49:49: (mod_fastcgi.c.421) FastCGI-stderr: PHP Fatal error:  Maximum execution time of 30 seconds exceeded in /data/project/ia-upload/ia-upload/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php on line 408
2021-05-19 05:19:44: (mod_fastcgi.c.421) FastCGI-stderr: PHP Fatal error:  Maximum execution time of 30 seconds exceeded in /data/project/ia-upload/ia-upload/src/ApiClient/IaClient.php on line 72
2021-05-19 05:48:47: (mod_fastcgi.c.421) FastCGI-stderr: PHP Fatal error:  Maximum execution time of 30 seconds exceeded in /data/project/ia-upload/ia-upload/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php on line 408
2021-05-19 06:18:41: (mod_fastcgi.c.421) FastCGI-stderr: PHP Fatal error:  Maximum execution time of 30 seconds exceeded in /data/project/ia-upload/ia-upload/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php on line 408
2021-05-19 06:49:37: (mod_fastcgi.c.421) FastCGI-stderr: PHP Fatal error:  Maximum execution time of 30 seconds exceeded in /data/project/ia-upload/ia-upload/vendor/guzzlehttp/psr7/src/Stream.php on line 247
2021-05-19 07:16:56: (mod_fastcgi.c.421) FastCGI-stderr: PHP Fatal error:  Maximum execution time of 30 seconds exceeded in /data/project/ia-upload/ia-upload/vendor/addwiki/mediawiki-api-base/src/MediawikiApi.php on line 243
2021-05-19 07:47:51: (mod_fastcgi.c.421) FastCGI-stderr: PHP Warning:  unlink(/data/project/ia-upload/ia-upload/jobqueue/dli.ministry.05500/820.182%20Ac%20940.9.pdf): No such file or directory in /data/project/ia-upload/ia-upload/src/Controller/UploadController.php on line 415
2021-05-19 08:00:16: (server.c.2059) server stopped by UID = 0 PID = 0

We talked about the downtime issue in a CommTech engineering meeting, and think that there's likely to be slowness coming from the large amount of filesystem IO. We could reduce this by two things: moving the job queue to a database; and moving the tool to a VPS (so it's got its own filesystem and doesn't use NFS). This is similar to what we've done for Wikimedia OCR where we saw massive increase of speed just because tesseract could access the files quicker (in the case of IA Upload, it'd be djvused, djvm, djvuxmlparser, etc.).

Moving to a VPS wouldn't be too hard. Is it worth it? (It wouldn't help with the request timeouts to external services, of course, but those are mostly outside the control of the tool).

I think we can resolve this ticket and proceed with moving to a VPS: T284528

dom_walden subscribed.

No code change here, so I will move straight to Done.