Page MenuHomePhabricator

wsexport tool writing output to $HOME/tool/temp puts load on Tool Labs NFS server
Closed, ResolvedPublic

Description

From https://github.com/wsexport/tool/issues/127

It would be nicer to write temporary files to the actual /tmp directory rather than to a directory in the tool's $HOME. You may need to write to $HOME eventually for delivery of results via the http service, but doing all the work to the NFS filesystem is causing high I/O load spikes. See https://phabricator.wikimedia.org/T161898#3180464 for some examination.

Ideally the web interface would use some queuing mechanism to limit the number of parallel conversion jobs that are used to avoid work spikes that overload I/O and processing power on the shared job grid. This could be accomplished using the Redis server as a queue and a small number of dedicated jobs that polled Redis for work to do. I'd actually suggest starting with just a single worker and monitoring the queue depth for a week or two to determine if wait times actually warrant using more than a single worker.

Event Timeline

chasemp_freenode_#wikimedia-labs_20170418.log

1 tools-exec-1430
1 tools-exec-1437
1 tools-exec-1439
1 tools-exec-1442
2 tools-exec-1435
3 tools-exec-1434
3 tools-exec-1441
4 tools-exec-1432
5 tools-exec-1436
6 tools-exec-1433

chasemp_freenode_#wikimedia-labs_20170419.log

1 tools-exec-1435
1 tools-exec-1437
1 tools-exec-1439
1 tools-exec-1442
2 tools-exec-1430
2 tools-exec-1441
3 tools-exec-1432

@chasemp has suggested that we try an experiment to isolate the impact of these jobs to a single exec node on the grid. This would involve adding an -l hostname=$exec_host specification to the jsub command. He would like me/us to pick a host that is not in the set of hosts listed in T163208#3194367.

Mentioned in SAL (#wikimedia-labs) [2017-04-19T20:59:42Z] <bd808> Pinning jsub jobs to tools-exec-1426 for T163208

Here's the list of exec nodes with puppet failures so far this month from my irc logs sorted by frequency:

$ grep 'Puppet run on tools-exec-' 2017-04-*|awk '{print $8}'|sort|uniq -c|sort -rn
     42 tools-exec-1432
     26 tools-exec-1430
     16 tools-exec-1434
     16 tools-exec-1433
     12 tools-exec-1441
     12 tools-exec-1436
     10 tools-exec-1442
     10 tools-exec-1435
     10 tools-exec-1431
      6 tools-exec-gift-trusty-01
      6 tools-exec-1439
      6 tools-exec-1438
      6 tools-exec-1421
      6 tools-exec-1416
      6 tools-exec-1415
      4 tools-exec-1437
      4 tools-exec-1428
      4 tools-exec-1423
      4 tools-exec-1420
      4 tools-exec-1418
      4 tools-exec-1412
      4 tools-exec-1410
      4 tools-exec-1409
      4 tools-exec-1407
      4 tools-exec-1406
      4 tools-exec-1405
      4 tools-exec-1404
      4 tools-exec-1401
      2 tools-exec-1440
      2 tools-exec-1429
      2 tools-exec-1417
      2 tools-exec-1414
      2 tools-exec-1413
      2 tools-exec-1411
      2 tools-exec-1408
      2 tools-exec-1403
      2 tools-exec-1402

Based on this I arbitrarily picked tools-exec-1426 which does not appear in the list and at this exact moment is lightly loaded. The pinning was done with this edit:

tools.wsexport@tools-bastion-02:~/tool$ git diff http/book.php
diff --git a/http/book.php b/http/book.php
index c719304..cb91c3a 100644
--- a/http/book.php
+++ b/http/book.php
@@ -3,7 +3,7 @@ $wsexportConfig = [
        'basePath' => '..',
        'tempPath' => __DIR__ . '/../temp',
        'stat' => true,
-       'ebook-convert' => 'jsub -mem 2g -l release=trusty -sync y xvfb-run -a e
+       'ebook-convert' => 'jsub -mem 2g -l release=trusty -l hostname=tools-exec-1426 -s
 ];

 include_once __DIR__  . '/../book/init.php';

The long term awesome solution for the wsexport tool might be to move it to a project of its own and put the web frontend on a VM that also has enough CPU/RAM to run the conversion jobs. The big problem is getting the files back to the requesting users. There is really no way to do that in Tool Labs without using NFS as an intermediary file store.

The tool works by having the webservice select the proper script to execute to generate the desired output format (EPUB, PDF, etc) to a temp file and then spawning a synchronous grid job to download the content from wikisource and repackage it to that file. When the job finishes the file is streamed back to the requesting client by the webservice. The job is run somewhere on the grid (pinned to tools-exec-1426 at the moment), but wherever it is run is different than the exec node that runs the web frontend because we segregate webservices from general purpose exec nodes that will handle the conversion jobs. I can't think of a way for the webservice to hand off the socket to the spawned job so that the result could be stream directly back to the client or at least streamed from a file generated to local disk rather than NFS.

taavi edited projects, added Toolforge, Tools; removed Cloud-Services.
taavi subscribed.

Looks like wsexport is now in Cloud VPS and not Toolforge, so closing this (old) task.