On deployment-prep scap cache_git_info takes 12 minutes (that is too slow)
Open, NormalPublic

Description

The job https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ takes a while, notably when generating the cache_git_info:

00:00:08.393 19:39:20 Started cache_git_info
00:12:20.924 19:51:33 Finished cache_git_info (duration: 12m 12s)

12 minutes :(

hashar created this task.Sep 18 2018, 8:02 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 18 2018, 8:02 PM
thcipriani moved this task from Needs triage to Debt on the Scap board.Sep 18 2018, 8:23 PM
thcipriani added a subscriber: thcipriani.

Takes a while on beta because of all the extensions (plus disks are slower than in production where it takes like 20 seconds). IIRC we haven't done much to parallelize any of this, it serially walks the extensions directory. There are probably some easy wins here, but, again, it's been a beta-only problem so it hasn't been a high priority.

I gave it a quick try on deployment-deploy01 reproducing scap.tasks.cache_git_info(). Namely getting extensions and skins git information with scap.git.info and writing the json to a dummy file take roughly 40 seconds:

/home/hashar/scap_git.py
#!/usr/bin/python

import json

from scap.git import info as git_info
from scap.utils import iterate_subdirectories

for kind in ['extensions', 'skins']:
    for stuff in iterate_subdirectories('/srv/mediawiki-staging/php-master/%s' % kind):
        print('Stuff: %s' % stuff)
        info = git_info(stuff)
        with open('/tmp/foo.json', 'w') as f:
            json.dump(info, f)
$ time python scap_git.py
real	0m40.602s
user	0m13.600s
sys	0m26.824s
hashar added a comment.EditedSep 24 2018, 9:19 AM

I ended up manually hacking scap on deployment-deploy01 https://integration.wikimedia.org/ci/job/beta-scap-eqiad/222901/console:

00:00:06.979 09:11:21 cache_git_info( version=master ) wikidb is dewiktionary
00:00:07.705 09:11:22 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info.json
00:00:08.447 09:11:23 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-extensions-WikibaseQualityExternalValidation.json
00:00:09.177 09:11:23 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-extensions-SubpageFun.json
00:00:09.934 09:11:24 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-extensions-Ads.json
00:00:10.697 09:11:25 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-extensions-HierarchyBuilder.json
00:00:11.434 09:11:26 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-extensions-FixedHeaderTable.json
...

00:12:30.765 09:23:45 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-skins-Slate.json
00:12:31.562 09:23:46 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-skins-Refreshed.json
00:12:31.565 09:23:46 Finished cache_git_info (duration: 12m 24s)

Using iwatch shows that is quite slow:

[24/Sep/2018 09:17:51] IN_MODIFY ./info-extensions-ExtensionDistributor.json
[24/Sep/2018 09:17:51] IN_OPEN ./info-extensions-ExtensionDistributor.json
[24/Sep/2018 09:17:51] IN_MODIFY ./info-extensions-ExtensionDistributor.json
[24/Sep/2018 09:17:51] IN_CLOSE_WRITE ./info-extensions-ExtensionDistributor.json
[24/Sep/2018 09:17:51] * ./info-extensions-ExtensionDistributor.json is closed
[24/Sep/2018 09:17:52] IN_MODIFY ./info-extensions-WikiLabels.json
[24/Sep/2018 09:17:52] IN_OPEN ./info-extensions-WikiLabels.json
[24/Sep/2018 09:17:52] IN_MODIFY ./info-extensions-WikiLabels.json
[24/Sep/2018 09:17:52] IN_CLOSE_WRITE ./info-extensions-WikiLabels.json
[24/Sep/2018 09:17:52] * ./info-extensions-WikiLabels.json is closed
[24/Sep/2018 09:17:53] IN_MODIFY ./info-extensions-SacredText.json
[24/Sep/2018 09:17:53] IN_OPEN ./info-extensions-SacredText.json
[24/Sep/2018 09:17:53] IN_MODIFY ./info-extensions-SacredText.json
[24/Sep/2018 09:17:53] IN_CLOSE_WRITE ./info-extensions-SacredText.json
[24/Sep/2018 09:17:53] * ./info-extensions-SacredText.json is closed
[24/Sep/2018 09:17:53] IN_MODIFY ./info-extensions-BoilerPlate.json
[24/Sep/2018 09:17:53] IN_OPEN ./info-extensions-BoilerPlate.json
[24/Sep/2018 09:17:53] IN_MODIFY ./info-extensions-BoilerPlate.json
[24/Sep/2018 09:17:53] IN_CLOSE_WRITE ./info-extensions-BoilerPlate.json
[24/Sep/2018 09:17:53] * ./info-extensions-BoilerPlate.json is closed

Then strace shows a spam of close(x) = -1 EBADF (Bad file descriptor), that might be a reason for the slow down. That also happens with my dummy scap_git.py script though it is still WAY faster.

Maybe that is related to writes to /srv/mediawiki-staging/php-master/cache/gitinfo being slow.

Mentioned in SAL (#wikimedia-releng) [2018-09-24T09:24:08Z] <hashar> Live hacked scap code on deployment-deploy01 for T204762 and reverted hack changes

Mentioned in SAL (#wikimedia-releng) [2018-09-24T09:51:48Z] <hashar> deployment-deploy01 : backed up /srv/mediawiki-staging/php-master/cache/gitinfo and created a new. Its size of 69632 bytes might cause slow writes?? | T204762

The spam of close(x) = -1 EBADF (Bad file descriptor) comes from a call to os.closerange

scap/sh.py
1679 class OProc(object):

1691     def __init__(self, command, parent_log, cmd, stdin, stdout, stderr,

1862         self.pid = os.fork()
1863 
1864         # child
1865         if self.pid == 0: # pragma: no cover

1955                 # don't inherit file descriptors
1956                 max_fd = resource.getrlimit(resource.RLIMIT_NOFILE)[0]
1957                 os.closerange(3, max_fd)

That is https://amoffat.github.io/sh/ which looks like syntaxic sugar to run shell commands so one can sh.git.show("HEAD"). It seems to reimplement subprocess methods, which is probably not a good idea :] But that is orthogonal to the disk slowness.

Even funnier, I wanted to trace the execution of scap sync using the python trace module. The git caching took less than a minute!

$ ssh deployment-deploy01.deployment-prep.eqiad.wmflabs
$ sudo su - jenkins-deploy
$ cd /srv/mediawiki-staging
$ scap sync
10:44:34 Started scap: (no justification provided)
...
10:44:40 Started cache_git_info
10:45:35 Finished cache_git_info (duration: 00m 54s)
...
10:47:00 Finished scap: (no justification provided) (duration: 02m 26s)

I wanted some historical build durations. I took the IRC logs from https://wm-bot.wmflabs.org/logs/%23wikimedia-releng/ and looked for the irc notification when beta-scap-eqiad is fixed.

DateBuild #Duration
20180701 03:49:13#2140475 min 18 sec
20180704 05:37:09#2145034 min 13 sec
20180705 11:17:56#2145934 min 8 sec
20180707 03:56:23#21485812 min
20180717 23:29:25#21639247 min
20180719 21:28:19#2166294 min 30 sec
20180723 02:58:01#2171084 min 18 sec
20180824 22:43:06#22016248 min
20180827 21:44:55#22042431 min
20180828 21:13:24#22050129 min
20180829 12:42:51#22055515 min
20180830 22:49:50#22067716 min
20180902 13:33:44#22090715 min
20180903 08:15:54#22097914 min
20180904 03:39:51#22105116 min
20180904 17:43:15#22110215 min
20180905 17:10:59#22117917 min
20180906 08:21:04#22122915 min
20180907 21:19:51#22136249 min
20180913 14:38:40#22183913 min
20180914 11:51:19#22192514 min
20180914 14:38:12#22193814 min
20180914 21:36:33#22196614 min
20180916 17:55:32#22215113 min
20180916 22:26:51#22216916 min
20180920 01:47:10#22247413 min
20180920 11:17:15#22251313 min
20180921 01:39:15#22257013 min
20180922 18:21:48#22273713 min
20180923 15:36:11#22282513 min
20180924 02:47:59#22287214 min
20180924 08:39:05#22289715 min
20180924 16:37:12#22293213 min
20180925 06:38:30#22298813 min
20180925 10:57:25#22300615 min
20180925 11:16:23#22300813 min

There are holes in the data set, but July 23th 2018 and they are definitely slow since at least August 30th.

hashar added a comment.EditedSep 25 2018, 11:44 AM

If that is from scap, that might be one of:

f7cb839114333a5b00ed99f44f13fbfb64157857 from August 28th already notes that cache_git_info is slow on beta.

Or (most probably) it is entirely unrelated to scap itself.

dduvall triaged this task as Low priority.Oct 22 2018, 4:53 PM
dduvall raised the priority of this task from Low to Normal.