Page MenuHomePhabricator

On deployment-prep scap cache_git_info takes 12 minutes (that is too slow)
Closed, ResolvedPublic

Description

The job https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ takes a while, notably when generating the cache_git_info:

00:00:08.393 19:39:20 Started cache_git_info
00:12:20.924 19:51:33 Finished cache_git_info (duration: 12m 12s)

12 minutes :(

Event Timeline

hashar created this task.Sep 18 2018, 8:02 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 18 2018, 8:02 PM
thcipriani moved this task from Needs triage to Debt on the Scap board.Sep 18 2018, 8:23 PM
thcipriani added a subscriber: thcipriani.

Takes a while on beta because of all the extensions (plus disks are slower than in production where it takes like 20 seconds). IIRC we haven't done much to parallelize any of this, it serially walks the extensions directory. There are probably some easy wins here, but, again, it's been a beta-only problem so it hasn't been a high priority.

I gave it a quick try on deployment-deploy01 reproducing scap.tasks.cache_git_info(). Namely getting extensions and skins git information with scap.git.info and writing the json to a dummy file take roughly 40 seconds:

/home/hashar/scap_git.py
#!/usr/bin/python

import json

from scap.git import info as git_info
from scap.utils import iterate_subdirectories

for kind in ['extensions', 'skins']:
    for stuff in iterate_subdirectories('/srv/mediawiki-staging/php-master/%s' % kind):
        print('Stuff: %s' % stuff)
        info = git_info(stuff)
        with open('/tmp/foo.json', 'w') as f:
            json.dump(info, f)
$ time python scap_git.py
real	0m40.602s
user	0m13.600s
sys	0m26.824s
hashar added a comment.EditedSep 24 2018, 9:19 AM

I ended up manually hacking scap on deployment-deploy01 https://integration.wikimedia.org/ci/job/beta-scap-eqiad/222901/console:

00:00:06.979 09:11:21 cache_git_info( version=master ) wikidb is dewiktionary
00:00:07.705 09:11:22 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info.json
00:00:08.447 09:11:23 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-extensions-WikibaseQualityExternalValidation.json
00:00:09.177 09:11:23 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-extensions-SubpageFun.json
00:00:09.934 09:11:24 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-extensions-Ads.json
00:00:10.697 09:11:25 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-extensions-HierarchyBuilder.json
00:00:11.434 09:11:26 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-extensions-FixedHeaderTable.json
...

00:12:30.765 09:23:45 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-skins-Slate.json
00:12:31.562 09:23:46 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-skins-Refreshed.json
00:12:31.565 09:23:46 Finished cache_git_info (duration: 12m 24s)

Using iwatch shows that is quite slow:

[24/Sep/2018 09:17:51] IN_MODIFY ./info-extensions-ExtensionDistributor.json
[24/Sep/2018 09:17:51] IN_OPEN ./info-extensions-ExtensionDistributor.json
[24/Sep/2018 09:17:51] IN_MODIFY ./info-extensions-ExtensionDistributor.json
[24/Sep/2018 09:17:51] IN_CLOSE_WRITE ./info-extensions-ExtensionDistributor.json
[24/Sep/2018 09:17:51] * ./info-extensions-ExtensionDistributor.json is closed
[24/Sep/2018 09:17:52] IN_MODIFY ./info-extensions-WikiLabels.json
[24/Sep/2018 09:17:52] IN_OPEN ./info-extensions-WikiLabels.json
[24/Sep/2018 09:17:52] IN_MODIFY ./info-extensions-WikiLabels.json
[24/Sep/2018 09:17:52] IN_CLOSE_WRITE ./info-extensions-WikiLabels.json
[24/Sep/2018 09:17:52] * ./info-extensions-WikiLabels.json is closed
[24/Sep/2018 09:17:53] IN_MODIFY ./info-extensions-SacredText.json
[24/Sep/2018 09:17:53] IN_OPEN ./info-extensions-SacredText.json
[24/Sep/2018 09:17:53] IN_MODIFY ./info-extensions-SacredText.json
[24/Sep/2018 09:17:53] IN_CLOSE_WRITE ./info-extensions-SacredText.json
[24/Sep/2018 09:17:53] * ./info-extensions-SacredText.json is closed
[24/Sep/2018 09:17:53] IN_MODIFY ./info-extensions-BoilerPlate.json
[24/Sep/2018 09:17:53] IN_OPEN ./info-extensions-BoilerPlate.json
[24/Sep/2018 09:17:53] IN_MODIFY ./info-extensions-BoilerPlate.json
[24/Sep/2018 09:17:53] IN_CLOSE_WRITE ./info-extensions-BoilerPlate.json
[24/Sep/2018 09:17:53] * ./info-extensions-BoilerPlate.json is closed

Then strace shows a spam of close(x) = -1 EBADF (Bad file descriptor), that might be a reason for the slow down. That also happens with my dummy scap_git.py script though it is still WAY faster.

Maybe that is related to writes to /srv/mediawiki-staging/php-master/cache/gitinfo being slow.

Mentioned in SAL (#wikimedia-releng) [2018-09-24T09:24:08Z] <hashar> Live hacked scap code on deployment-deploy01 for T204762 and reverted hack changes

Mentioned in SAL (#wikimedia-releng) [2018-09-24T09:51:48Z] <hashar> deployment-deploy01 : backed up /srv/mediawiki-staging/php-master/cache/gitinfo and created a new. Its size of 69632 bytes might cause slow writes?? | T204762

The spam of close(x) = -1 EBADF (Bad file descriptor) comes from a call to os.closerange

scap/sh.py
1679 class OProc(object):

1691     def __init__(self, command, parent_log, cmd, stdin, stdout, stderr,

1862         self.pid = os.fork()
1863 
1864         # child
1865         if self.pid == 0: # pragma: no cover

1955                 # don't inherit file descriptors
1956                 max_fd = resource.getrlimit(resource.RLIMIT_NOFILE)[0]
1957                 os.closerange(3, max_fd)

That is https://amoffat.github.io/sh/ which looks like syntaxic sugar to run shell commands so one can sh.git.show("HEAD"). It seems to reimplement subprocess methods, which is probably not a good idea :] But that is orthogonal to the disk slowness.

Even funnier, I wanted to trace the execution of scap sync using the python trace module. The git caching took less than a minute!

$ ssh deployment-deploy01.deployment-prep.eqiad.wmflabs
$ sudo su - jenkins-deploy
$ cd /srv/mediawiki-staging
$ scap sync
10:44:34 Started scap: (no justification provided)
...
10:44:40 Started cache_git_info
10:45:35 Finished cache_git_info (duration: 00m 54s)
...
10:47:00 Finished scap: (no justification provided) (duration: 02m 26s)

I wanted some historical build durations. I took the IRC logs from https://wm-bot.wmflabs.org/logs/%23wikimedia-releng/ and looked for the irc notification when beta-scap-eqiad is fixed.

DateBuild #Duration
20180701 03:49:13#2140475 min 18 sec
20180704 05:37:09#2145034 min 13 sec
20180705 11:17:56#2145934 min 8 sec
20180707 03:56:23#21485812 min
20180717 23:29:25#21639247 min
20180719 21:28:19#2166294 min 30 sec
20180723 02:58:01#2171084 min 18 sec
20180824 22:43:06#22016248 min
20180827 21:44:55#22042431 min
20180828 21:13:24#22050129 min
20180829 12:42:51#22055515 min
20180830 22:49:50#22067716 min
20180902 13:33:44#22090715 min
20180903 08:15:54#22097914 min
20180904 03:39:51#22105116 min
20180904 17:43:15#22110215 min
20180905 17:10:59#22117917 min
20180906 08:21:04#22122915 min
20180907 21:19:51#22136249 min
20180913 14:38:40#22183913 min
20180914 11:51:19#22192514 min
20180914 14:38:12#22193814 min
20180914 21:36:33#22196614 min
20180916 17:55:32#22215113 min
20180916 22:26:51#22216916 min
20180920 01:47:10#22247413 min
20180920 11:17:15#22251313 min
20180921 01:39:15#22257013 min
20180922 18:21:48#22273713 min
20180923 15:36:11#22282513 min
20180924 02:47:59#22287214 min
20180924 08:39:05#22289715 min
20180924 16:37:12#22293213 min
20180925 06:38:30#22298813 min
20180925 10:57:25#22300615 min
20180925 11:16:23#22300813 min

There are holes in the data set, but July 23th 2018 and they are definitely slow since at least August 30th.

hashar added a comment.EditedSep 25 2018, 11:44 AM

If that is from scap, that might be one of:

f7cb839114333a5b00ed99f44f13fbfb64157857 from August 28th already notes that cache_git_info is slow on beta.

Or (most probably) it is entirely unrelated to scap itself.

dduvall triaged this task as Low priority.Oct 22 2018, 4:53 PM
dduvall raised the priority of this task from Low to Normal.

This is still the slowest step of beta-scap-eqiad. Taking 9 minutes of the total 10 minutes and 5 seconds at https://integration.wikimedia.org/ci/job/beta-scap-eqiad/246396/console, for example.

It is also the reason beta deploys don't happen at the scheduled interval because the parent job (beta-code-update-eqiad) waits, correctly, for the this child job to finish. With the parent and child job setup/teardown included its about 16-20 minutes, whereas the job is scheduled for every 10 minutes.

Still pretty quick, but those 10 extra minutes every time do add up over the day when QA is waiting for it to confirm fixes or react to on-going issues during development. It's long enough to be affecting the work flow, but short enough that you can't realistically do something useful with it. Oh well :)

Krenair added a comment.EditedApr 23 2019, 6:29 PM

Takes a while on beta because of all the extensions (plus disks are slower than in production where it takes like 20 seconds). IIRC we haven't done much to parallelize any of this, it serially walks the extensions directory.

So parallelising it might speed things up (though if the whole thing is just slow disk access there might not be much to win here?)
However, in beta we have *all* extensions that exist in gerrit cloned. Is there a way of making it only have the extensions used in prod? I know extension-list-labs went away almost a year ago so I'm not sure we need anything extra.

Edit: Ugh, I'm guessing we currently rely on the gerrit auto-submodule feature for each branch, whereas we'd like to track only prod's subset of extensions - just using the master branches... Maybe we can avoid initialising all of them or something.

To the best I could tell when I investigated it:

  • that seems to be a regression from end of July 2018, potentially tracing back to Scap
  • I could not reproduce the issue either by:
    • writing my own python script to do the same execution
    • running scap as the jenkins-deploy user

Eg scap sync with a sub minute cache_git_info:

Even funnier, I wanted to trace the execution of scap sync using the python trace module. The git caching took less than a minute!

$ ssh deployment-deploy01.deployment-prep.eqiad.wmflabs
$ sudo su - jenkins-deploy
$ cd /srv/mediawiki-staging
$ scap sync
10:44:34 Started scap: (no justification provided)
...
10:44:40 Started cache_git_info
10:45:35 Finished cache_git_info (duration: 00m 54s)
...
10:47:00 Finished scap: (no justification provided) (duration: 02m 26s)

I hacked scap to output some logging info to better show the cache_git_info progress https://integration.wikimedia.org/ci/job/beta-scap-eqiad/246645/console

I also tried again directly on the machine using:

sudo su - jenkins-deploy
cd /srv/mediawiki-staging/
scap sync

And it takes 33 seconds:

17:51:54 Started cache_git_info
17:52:28 Finished cache_git_info (duration: 00m 33s)

And I don't think it is due to a potentially cold disk cache, a jenkins build occurring immediately after is still super slow.
Or in short, the issue only occurs when being harnessed/wrapped up with the Jenkins job.

@hashar Perhaps an ENV variable is interfering? Try with sudo -i instead of sudo -s and maybe also check which environment variables that Jenkins sets when establishing the SSH connection before the job/scap command starts.

Takes a while on beta because of all the extensions (plus disks are slower than in production where it takes like 20 seconds). IIRC we haven't done much to parallelize any of this, it serially walks the extensions directory.

So parallelising it might speed things up (though if the whole thing is just slow disk access there might not be much to win here?)
However, in beta we have *all* extensions that exist in gerrit cloned. Is there a way of making it only have the extensions used in prod? I know extension-list-labs went away almost a year ago so I'm not sure we need anything extra.
Edit: Ugh, I'm guessing we currently rely on the gerrit auto-submodule feature for each branch, whereas we'd like to track only prod's subset of extensions - just using the master branches... Maybe we can avoid initialising all of them or something.

Both are valids (parallelism and less repositories) though they are workarounds and I would rather track those as other tasks ;)

hashar added a comment.May 2 2019, 9:22 AM

I went with a very basic script

/mnt/home/jenkins-deploy/scap_git.py
#!/usr/bin/python

import json
import os
import os.path

import scap.log as log
from scap.utils import iterate_subdirectories
from scap.sh.contrib import git

import logging

logging.basicConfig(level=logging.DEBUG)

with log.Timer('scap_git infos'):
    for kind in ['extensions', 'skins']:
        for stuff in iterate_subdirectories('/srv/mediawiki-staging/php-master/%s' % kind):
            os.chdir(stuff)

            with log.Timer(stuff):
                print('Url: %s' % git('ls-remote', '--get-url').strip())

Took more than 4 minutes on https://integration.wikimedia.org/ci/job/beta-hashar-cache-git-info/18/ . Each calls to git ls-remote --get-url takes roughly 250 ms. An example:

00:04:00.531 INFO:timer:Started /srv/mediawiki-staging/php-master/skins/DeepSea
00:04:00.531 INFO:scap.sh.command:<Command u'/usr/bin/git ls-remote --get-url'>: starting process
00:04:00.532 DEBUG:scap.sh.command.process:<Command u'/usr/bin/git ls-remote --get-url'>.<Process 7406 ['/usr/bin/git', 'ls-remote', '--get-url']>: started process
00:04:00.532 INFO:scap.sh.command:<Command u'/usr/bin/git ls-remote --get-url', pid 7406>: process started
00:04:00.533 DEBUG:scap.sh.command.process:<Command u'/usr/bin/git ls-remote --get-url'>.<Process 7406 ['/usr/bin/git', 'ls-remote', '--get-url']>: acquiring wait lock to wait for completion
00:04:00.533 DEBUG:scap.sh.command.process:<Command u'/usr/bin/git ls-remote --get-url'>.<Process 7406 ['/usr/bin/git', 'ls-remote', '--get-url']>: got wait lock
00:04:00.534 DEBUG:scap.sh.command.process:<Command u'/usr/bin/git ls-remote --get-url'>.<Process 7406 ['/usr/bin/git', 'ls-remote', '--get-url']>: exit code not set, waiting on pid
00:04:00.775 DEBUG:scap.sh.command.process:<Command u'/usr/bin/git ls-remote --get-url'>.<Process 7406 ['/usr/bin/git', 'ls-remote', '--get-url']>: <scap.sh.StreamReader object at 0x7f3a48078ad0> ready to be read from
00:04:00.776 DEBUG:scap.sh.command.process.streamreader:<Command u'/usr/bin/git ls-remote --get-url'>.<Process 7406 ['/usr/bin/git', 'ls-remote', '--get-url']>.stdout: got chunk size 61: 'https://gerrit.wikimedia.org/r'
00:04:00.777 DEBUG:scap.sh.stream_bufferer:acquiring buffering lock to process chunk (buffering: 1)
00:04:00.777 DEBUG:scap.sh.stream_bufferer:got buffering lock to process chunk (buffering: 1)
00:04:00.778 DEBUG:scap.sh.stream_bufferer:released buffering lock for processing chunk (buffering: 1)
00:04:00.778 DEBUG:scap.sh.command.process.streamreader:<Command u'/usr/bin/git ls-remote --get-url'>.<Process 7406 ['/usr/bin/git', 'ls-remote', '--get-url']>.stdout: putting chunk onto pipe: 'https://gerrit.wikimedia.org/r'
00:04:00.779 DEBUG:scap.sh.command.process:<Command u'/usr/bin/git ls-remote --get-url'>.<Process 7406 ['/usr/bin/git', 'ls-remote', '--get-url']>: <scap.sh.StreamReader object at 0x7f3a48078ad0> ready to be read from
00:04:00.780 DEBUG:scap.sh.command.process.streamreader:<Command u'/usr/bin/git ls-remote --get-url'>.<Process 7406 ['/usr/bin/git', 'ls-remote', '--get-url']>.stdout: got no chunk, done reading
00:04:00.780 DEBUG:scap.sh.command.process:<Command u'/usr/bin/git ls-remote --get-url'>.<Process 7406 ['/usr/bin/git', 'ls-remote', '--get-url']>: <scap.sh.StreamReader object at 0x7f3a48081410> ready to be read from
00:04:00.781 DEBUG:scap.sh.streamreader:got no chunk, done reading
00:04:00.781 DEBUG:scap.sh.stream_bufferer:acquiring buffering lock for flushing buffer
00:04:00.782 DEBUG:scap.sh.stream_bufferer:got buffering lock for flushing buffer
00:04:00.782 DEBUG:scap.sh.stream_bufferer:released buffering lock for flushing buffer
00:04:00.782 DEBUG:scap.sh.command.process.streamreader:<Command u'/usr/bin/git ls-remote --get-url'>.<Process 7406 ['/usr/bin/git', 'ls-remote', '--get-url']>.stdout: got chunk size 0 to flush: ''
00:04:00.783 DEBUG:scap.sh.stream_bufferer:acquiring buffering lock for flushing buffer
00:04:00.783 DEBUG:scap.sh.stream_bufferer:got buffering lock for flushing buffer
00:04:00.784 DEBUG:scap.sh.stream_bufferer:released buffering lock for flushing buffer
00:04:00.784 DEBUG:scap.sh.streamreader:got chunk size 0 to flush: ''
00:04:00.784 INFO:scap.sh.command:<Command u'/usr/bin/git ls-remote --get-url', pid 7406>: process completed
00:04:00.785 INFO:scap.sh.command:<Command u'/usr/bin/git ls-remote --get-url', pid 7406>: process completed
00:04:00.785 INFO:scap.sh.command:<Command u'/usr/bin/git ls-remote --get-url', pid 7406>: process completed
00:04:00.785 Url: https://gerrit.wikimedia.org/r/p/mediawiki/skins/DeepSea.git
00:04:00.786 INFO:timer:Finished /srv/mediawiki-staging/php-master/skins/DeepSea (duration: 0.253926992416)

Notably:

00:04:00.532 DEBUG:scap.sh.command.process:<Command u'/usr/bin/git ls-remote --get-url'>.<Process 7406 ['/usr/bin/git', 'ls-remote', '--get-url']>: started process
00:04:00.532 INFO:scap.sh.command:<Command u'/usr/bin/git ls-remote --get-url', pid 7406>: process started
00:04:00.533 DEBUG:scap.sh.command.process:<Command u'/usr/bin/git ls-remote --get-url'>.<Process 7406 ['/usr/bin/git', 'ls-remote', '--get-url']>: acquiring wait lock to wait for completion
00:04:00.533 DEBUG:scap.sh.command.process:<Command u'/usr/bin/git ls-remote --get-url'>.<Process 7406 ['/usr/bin/git', 'ls-remote', '--get-url']>: got wait lock
00:04:00.534 DEBUG:scap.sh.command.process:<Command u'/usr/bin/git ls-remote --get-url'>.<Process 7406 ['/usr/bin/git', 'ls-remote', '--get-url']>: exit code not set, waiting on pid
00:04:00.775 DEBUG:scap.sh.command.process:<Command u'/usr/bin/git ls-remote --get-url'>.<Process 7406 ['/usr/bin/git', 'ls-remote', '--get-url']>: <scap.sh.StreamReader object at 0x7f3a48078ad0> ready to be read from
00:04:00.776 DEBUG:scap.sh.command.process.streamreader:<Command u'/usr/bin/git ls-remote --get-url'>.<Process 7406 ['/usr/bin/git', 'ls-remote', '--get-url']>.stdout: got chunk size 61: 'https://gerrit.wikimedia.org/r'

The same script being run from manually is dramatically faster to less than 8 seconds (from 4 minutes):

7.639760 INFO:timer:Started /srv/mediawiki-staging/php-master/skins/DeepSea
7.639984 INFO:scap.sh.command:<Command u'/usr/bin/git ls-remote --get-url'>: starting process
7.643368 DEBUG:scap.sh.command.process:<Command u'/usr/bin/git ls-remote --get-url'>.<Process 29648 ['/usr/bin/git', 'ls-remote', '--get-url']>: started process
7.644402 INFO:scap.sh.command:<Command u'/usr/bin/git ls-remote --get-url', pid 29648>: process started
7.644476 DEBUG:scap.sh.command.process:<Command u'/usr/bin/git ls-remote --get-url'>.<Process 29648 ['/usr/bin/git', 'ls-remote', '--get-url']>: acquiring wait lock to wait for completion
7.644533 DEBUG:scap.sh.command.process:<Command u'/usr/bin/git ls-remote --get-url'>.<Process 29648 ['/usr/bin/git', 'ls-remote', '--get-url']>: got wait lock
7.644583 DEBUG:scap.sh.command.process:<Command u'/usr/bin/git ls-remote --get-url'>.<Process 29648 ['/usr/bin/git', 'ls-remote', '--get-url']>: exit code not set, waiting on pid
7.645534 DEBUG:scap.sh.command.process:<Command u'/usr/bin/git ls-remote --get-url'>.<Process 29648 ['/usr/bin/git', 'ls-remote', '--get-url']>: <scap.sh.StreamReader object at 0x7f3454020550> ready to be read from
7.645644 DEBUG:scap.sh.command.process.streamreader:<Command u'/usr/bin/git ls-remote --get-url'>.<Process 29648 ['/usr/bin/git', 'ls-remote', '--get-url']>.stdout: got chunk size 61: 'https://gerrit.wikimedia.org/r'
7.645968 DEBUG:scap.sh.stream_bufferer:acquiring buffering lock to process chunk (buffering: 1)
7.646060 DEBUG:scap.sh.stream_bufferer:got buffering lock to process chunk (buffering: 1)
7.646284 DEBUG:scap.sh.stream_bufferer:released buffering lock for processing chunk (buffering: 1)
7.646343 DEBUG:scap.sh.command.process.streamreader:<Command u'/usr/bin/git ls-remote --get-url'>.<Process 29648 ['/usr/bin/git', 'ls-remote', '--get-url']>.stdout: putting chunk onto pipe: 'https://gerrit.wikimedia.org/r'
7.646426 DEBUG:scap.sh.command.process:<Command u'/usr/bin/git ls-remote --get-url'>.<Process 29648 ['/usr/bin/git', 'ls-remote', '--get-url']>: <scap.sh.StreamReader object at 0x7f3454020550> ready to be read from
7.646482 DEBUG:scap.sh.command.process.streamreader:<Command u'/usr/bin/git ls-remote --get-url'>.<Process 29648 ['/usr/bin/git', 'ls-remote', '--get-url']>.stdout: got no chunk, done reading
7.646538 DEBUG:scap.sh.command.process:<Command u'/usr/bin/git ls-remote --get-url'>.<Process 29648 ['/usr/bin/git', 'ls-remote', '--get-url']>: <scap.sh.StreamReader object at 0x7f3454088050> ready to be read from
7.646587 DEBUG:scap.sh.streamreader:got no chunk, done reading
7.646641 DEBUG:scap.sh.stream_bufferer:acquiring buffering lock for flushing buffer
7.646686 DEBUG:scap.sh.stream_bufferer:got buffering lock for flushing buffer
7.646733 DEBUG:scap.sh.stream_bufferer:released buffering lock for flushing buffer
7.646786 DEBUG:scap.sh.command.process.streamreader:<Command u'/usr/bin/git ls-remote --get-url'>.<Process 29648 ['/usr/bin/git', 'ls-remote', '--get-url']>.stdout: got chunk size 0 to flush: ''
7.646849 DEBUG:scap.sh.stream_bufferer:acquiring buffering lock for flushing buffer
7.646893 DEBUG:scap.sh.stream_bufferer:got buffering lock for flushing buffer
7.646939 DEBUG:scap.sh.stream_bufferer:released buffering lock for flushing buffer
7.646984 DEBUG:scap.sh.streamreader:got chunk size 0 to flush: ''
7.647147 INFO:scap.sh.command:<Command u'/usr/bin/git ls-remote --get-url', pid 29648>: process completed
7.647233 INFO:scap.sh.command:<Command u'/usr/bin/git ls-remote --get-url', pid 29648>: process completed
7.647291 INFO:scap.sh.command:<Command u'/usr/bin/git ls-remote --get-url', pid 29648>: process completed
7.647365 Url: https://gerrit.wikimedia.org/r/p/mediawiki/skins/DeepSea.git
7.647504 INFO:timer:Finished /srv/mediawiki-staging/php-master/skins/DeepSea (duration: 0.00769400596619)
hashar added a comment.EditedMay 2 2019, 9:45 AM

The issue lays somewhere into scap/sh.py :/ One would notice the slowdown in log messages:

00:04:00.534 ['/usr/bin/git', 'ls-remote', '--get-url']>: exit code not set, waiting on pid
00:04:00.775 ['/usr/bin/git', 'ls-remote', '--get-url']>: <scap.sh.StreamReader object at 0x7f3a48078ad0> ready to be read from

Which would be scap.sh output_thread() function:

poller = Poller()
if stdout is not None:
    poller.register_read(stdout)

while poller:
    changed = no_interrupt(poller.poll, 0.1)

And that matches a bunch of poll() system calls happening when I ran the script with strace :-/

When running from Jenkins, the script is slightly slower for some reason and causes scap to wait with that 100ms poll.

I don't think there is any proper fix to it short of hunting down the cause of slowdown or fixing scap.sh. Most probably we should just switch to use python subprocess and get rid of all that custom logic. **Or it is an artfact of git being slow for some other reason :-\

Back in September I have noticed a spam of close() calls ( T204762#4610309 ). Before executing the process scap/sh closes all file descriptors (beside stdin/out/err) from the parent using:

max_fd = resource.getrlimit(resource.RLIMIT_NOFILE)[0]
os.closerange(3, max_fd)

Running from the command line I get 1024 maximum file desriptors which are fairly fast to close:

$ python closefds.py |ts -s %.s 
0.000029 Max fd: 1024
0.000823 Closing them all
0.000849 Done closing fds

The same script when run in Jenkins:

00:00:00.031 + python -u closefds.py
00:00:00.045 Max fd: 1048576
00:00:00.045 Closing them all
00:00:00.289 Done closing fds
00:00:00.365 Finished: SUCCESS

So when run under Jenkins, there is up to 1 million file descriptors available and closing them all takes ~ 240 miliseconds.

ulimit -n does yield 1048576 under Jenkins.

Change 507773 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] beta: limit scap to 512 max file descriptors

https://gerrit.wikimedia.org/r/507773

hashar added a comment.May 2 2019, 3:18 PM

Courtesy of Tyler:

antoine-approve

Change 507773 merged by jenkins-bot:
[integration/config@master] beta: limit scap to 512 max file descriptors

https://gerrit.wikimedia.org/r/507773

hashar closed this task as Resolved.May 2 2019, 3:26 PM

T222372 would be able dropping scap/sh.git entirely, but I don't think it will be much of a priority :]