Page MenuHomePhabricator

[4 hour spike] Investigate wsexport webservice crashes/restarts
Closed, ResolvedPublic

Description

The tail of service.log for this tool:

2019-03-12T21:08:18.862317 No running webservice job found, attempting to start it
2019-03-12T21:32:01.172806 No running webservice job found, attempting to start it
2019-03-12T21:45:33.550991 Throttled for 3 restarts in last 3600 seconds
2019-03-12T21:46:39.976910 Throttled for 3 restarts in last 3600 seconds
2019-03-12T21:47:46.339390 Throttled for 3 restarts in last 3600 seconds
2019-03-12T21:48:56.241504 No running webservice job found, attempting to start it
2019-03-13T03:37:22.164262 No running webservice job found, attempting to start it
2019-03-13T04:21:58.587285 No running webservice job found, attempting to start it
2019-03-13T05:20:42.041824 No running webservice job found, attempting to start it
2019-03-13T05:38:44.908925 No running webservice job found, attempting to start it
2019-03-13T07:37:27.752477 No running webservice job found, attempting to start it
2019-03-13T08:13:12.609819 No running webservice job found, attempting to start it
2019-04-17T11:39:30.323182 No running webservice job found, attempting to start it
2019-04-17T11:39:45.916116 Timed out attempting to start webservice (15s)

As we can see, there used to be quite a lot of problems but the situation has improved. Need to investigate.

Event Timeline

Another thingie:

2019-04-17 16:39:19: (mod_fastcgi.c.2424) unexpected end-of-file (perhaps the fastcgi process died): pid: 11226 socket: unix:/var/run/lighttpd/php.socket.wsexport-0
2019-04-17 16:39:19: (mod_fastcgi.c.3175) response not received, request sent: 1317 on socket: unix:/var/run/lighttpd/php.socket.wsexport-0 for /wsexport/tool/book.php?lang=fr&format=pdf-a5&page=Armorial_des_principales_maisons_et_familles_du_royaume, closing connection

from error.log looks exactly like of continuation of this problem, however it doesn't match any of the service.log errors.

jmatazzoni renamed this task from Investigate wsexport webservice crashes/restarts to [4 hour spike] Investigate wsexport webservice crashes/restarts.Apr 30 2019, 11:23 PM
jmatazzoni moved this task from Backlog to In Progress on the Toolforge board.
jmatazzoni moved this task from Discussion needed to In sprint on the WS Export board.

We're not running the latest code in production at the moment. T222330 needs to be done before that, and I think then we can look at how often it's crashing.

We're not running the latest code in production at the moment. T222330 needs to be done before that, and I think then we can look at how often it's crashing.

Hey @Samwilson is that a task you can you please turn into a ticket, so we can estimate it today?

@Samwilson, also, are there any other E-Book tasks that you can write up out of this investigation? Or is the one you mention basically a precursor to doing the investigation?

The only error logged in the last ~6 hours (since upgrading to the latest codebase) is:

2019-05-07 08:17:32: (mod_fastcgi.c.2543) FastCGI-stderr: PHP Fatal error: Maximum execution time of 30 seconds exceeded in /mnt/nfs/labstore-secondary-tools-project/wsexport/tool/utils/utils.php on line 124

Which corresponds to this URL from the access log:

https://tools.wmflabs.org/wsexport/tool/book.php?lang=it&format=pdf-a4&page=Pensieri_di_varia_filosofia_e_di_bella_letteratura

and it turns out that https://it.wikisource.org/wiki/Pensieri_di_varia_filosofia_e_di_bella_letteratura has 4,586 subpages — so it's not surprising that we fail to render it.

I'll keep monitoring the logs for the errors in this task's description.

I've also created T222684: Disallow robots from scraping ebooks.

An other thing I'm wondering about is caching of generated epubs and PDFs. Quite a few seem to be requested more than once. For instance, out of the last 5,000 access requests 2,740 works were requested, and 471 of these were requested more than once.

If this is correct?—

  • Unique works: tail -n 5000 access.log | grep -oP 'page=([^ ]+)' | cut -c6- | sort | uniq | wc -l
  • Multiple requests: tail -n 5000 access.log | grep -oP 'page=([^ ]+)' | cut -c6- | sort | uniq -c | sort -nr | awk '$1 > 1 {print $0}' | wc -l)

The most-requested works were:

878 test
55 Una+and+the+Lion
23 Il+Re+dell%27Aria
18 L%E2%80%99Am%C3%A9rique_au_XVIIIe_si%C3%A8cle_d%27apr%C3%A8s_un_voyageur_fran%C3%A7ais
18 Il_cinque_maggio

Anyway, I'm not really sure how we can invalidate any cached works we might create. Ideas?

Just to expand on the duplicate-request numbers: of the last 10000 requests (from 06/May/2019:10:42:54 to now) 1263/5267 works have been requested more than once.

The latest downtime was from 2019-05-09 04:41:45 to 04:44:06 (according to uptimerobot). There's nothing in the error log for this time (the latest was an out-of-memory error at 01:35:41). It looks like there's a large number of requests from SEMrushBot, so I've added that to the list of denied bots.

The last task to come out of this is T222936: Wikisource Ebooks: Investigate cache generated ebooks [8H]

@jmatazzoni is there any extra information that should be included here?

Niharika moved this task from Product sign-off to Done on the Community-Tech-Sprint board.
Niharika subscribed.

I don't have any idea about this project but it seems like I can close this as done.