Page MenuHomePhabricator

tools-bastion-05 is super slow
Closed, DuplicatePublic

Description

tools-bastion-05 is super slow right now.

mzmcbride@tools-bastion-05:~$ time cd database-reports/

real	0m17.471s
user	0m0.001s
sys	0m0.000s
mzmcbride@tools-bastion-05:~/database-reports$ time ls
build-aux		cronietab.submit  data	  general  README	    setup.py
createconfiguration.py	crontab.tools	  dbreps  INSTALL  reports	    TODO
CREDITS			crontab.yarrow	  enwiki  LICENSE  settings.sample  wikidatawiki

real	0m1.130s
user	0m0.002s
sys	0m0.003s
mzmcbride@tools-bastion-05:~/database-reports$ time cd reports/

real	0m1.965s
user	0m0.001s
sys	0m0.000s
mzmcbride@tools-bastion-05:~/database-reports/reports$ time ls
commonswiki  enwiki  general  __init__.py  plwiki  tests

real	0m3.051s
user	0m0.001s
sys	0m0.003s
mzmcbride@tools-bastion-05:~/database-reports/reports$ time cd general/

real	0m15.499s
user	0m0.009s
sys	0m0.004s

Maybe an NFS issue or similar? Disk usage looks fine, but disk read speed feels like the bottleneck.

Event Timeline

Luke081515 moved this task from Triage to Backlog on the Toolforge board.

It's responsive now, this seems to be coming and going. When the bastion seizes up it is typically the result of someone running a super-expensive job that eats up all the resources.

Maybe -05 is more vulnerable to this than the old -01 was? I can't imagine how but I will try to look into it. Unfortunately the Labs team is getting a bit stretched today.

I noticed @marcmiquel running big jobs on the bastion the other day that ran on 95%+ CPU for hours. He should probably read the grid engine manual ( https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Grid )

My excuses, I run a script in the bastion when I should have instead sent it to the grid. They noticed me this in the IRC channel and now it is clear.

I noticed @marcmiquel running big jobs on the bastion the other day that ran on 95%+ CPU for hours.

Is there any way to prevent this type of resource hogging more strictly? Whether the underlying excessive resource usage is malicious or accidental, when the host takes more than 15 seconds to change directories, the host is effectively unusable.

Is there any way to prevent this type of resource hogging more strictly? Whether the underlying excessive resource usage is malicious or accidental, when the host takes more than 15 seconds to change directories, the host is effectively unusable.

In general our current largest issue (IMO) is NFS utilization. We have a non trivial number of bots and tools that spew gigabytes of logs to NFS each day, read 50G data files from NFS into a shell pipeline of some sort and pipe the %0G of output back to the same NFS filesystem, and do other high I/O operations per second (IOPS) tasks. I've heard people say that toolserver didn't suffer from such issues, but honestly if that was true it was only because there were fewer users doing fewer things. @yuvipanda, @Andrew, and @chasemp are currently working diligently to salvage what is salvageable of the current infrastructure and implement new systems that are better prepared to deal with the growth in utilization that Labs and Tool Labs are experiencing.

As a user community there a a few things we can do to help out in the short term:

  1. We can check our own tools and bots for unnecessary error and logging output.
    1. If you have a PHP script that writes 25 lines of "Undefined index foo at line 75" each time the script is run, fix that code. Add [[https://secure.php.net/isset|isset()]] checks, lower the global error reporting level, or as a last ditch hack add [[https://secure.php.net/manual/en/language.operators.errorcontrol.php|@]] error suppression to offending lines.
    2. If you write a log file with a list of every page or line of input you process so that you know where things broke when the job stops unexpectedly, figure out how to update a single key in redis instead.
  2. Check the the data processing pipelines that our scripts use. If you are selecting 40G of data out of MySQL into a file and then processing that file with sort and uniq to produce a second 30G file then rethink how you are moving data around and try to find a way to process smaller slices of data. (Even reduced by an order of magnitude this is still a problem.) Ideally figure out how to process a stream of data that never hits NFS at all. I totally understand that not everyone knows how to make optimizations to pipelines like this, but anyone can write an email to labs-l and ask for advice.
  3. Document things that we have done to make our tools faster and less resource intensive. Wikitech has quite a bit of documentation on how to execute particular commands, but I haven't run into many descriptions of how to process data efficiently for task X.
  4. Stop running things that are redundant and collaborate with another project instead. I'd be willing to bet that there are multiple tools and bots out there that pick up the same source data, make 95% of the same intermediate changes and produce a report. Find the other tools that are likely to have a high overlap with yours and figure out how to combine the 50-90% of processing that is duplicated into a source feed that both reports can use.
  5. Don't run anything other than a text editor, less, or tail on the submit nodes. If you need to do more than that figure out how to do it via a grid job. This probably won't save NFS IOPS, but it will free up CPU for others.
  6. In general think of the server kitties and treat the gratis shared resources that you have access to in Labs and Tool Labs like you are actually paying for them, because you are paying with your time and effort and the time and effort of others.

</rant>