Based in Nantes, France CET/CEST (UTC+1, UTC+2)
Main IRC channel is #wikimedia-releng
Solved with the assistance of @Dzahn
I filled that task solely because the time went up. That indicates there is most probably an issue on the beta cluster somewhere rather than in the framework itself. So I would like to reproduce the run locally, see whether it is slow as well and find out the root cause.
Upstream has released a 2.15.17
From a quick chat with @MoritzMuehlenhoff :
Re declining, we did a quick experiment two years ago, but it never concretized. Maybe later we can revisit using Docker containers and pairing with people knowledgeable about R and its test/package infrastructure.
Eventually I wanted to reuse the exact same Docker package on Stretch (T226236) which got rejected. After some madness that seems to work (so far). The Stretch instances would receive Docker 18.09.7 from thirdparty/ci instead of 18.06.2 on Jessie.
@hashar I guess the CI servers should have more relaxed thresholds? Is it even possible to configure gerrit to whitelist some host?
I have upgraded zuul on contint2001 (T203846) which eventually got the zuul-server to start and establish two connections to the Gerrit server. I have stopped the service freeing the extra connections.
Gerrit has two connections from each contint servers for a total of four connections. We have Gerrit restricting to a total of four ssh connections. Hence the zuul-merger is no more able to fetch
Session User Remote Host -------------------------------------------------------------- 9bb66493 jenkins-bot contint2001.wikimedia.org 7b67f043 jenkins-bot contint2001.wikimedia.org 836929be jenkins-bot contint1001.wikimedia.org e36ee520 jenkins-bot contint1001.wikimedia.org
$ sudo su - zuul $ cd /srv/zuul/git/operations/puppet $ git fetch -v Received disconnect from 2620:0:861:3:208:80:154:85: 12: Too many concurrent connections (4) - max. allowed: 4 fatal: Could not read from remote repository.
Thank you for the explanation and triage :]
I think last time I synced with @aborrero to have the instance moved.
The network is barely capped anymore, it got bumped to 800Mbits for egress traffic.
Solved by applying labstore::traffic_shaping::egress: 100mbps to the instance hiera configuration.
The jobs have been corrected. Quibble would then need to fatal out as soon as a repository can not be cloned/fetched etc.
with ThreadPoolExecutor(max_workers=workers) as executor: for project, dest in dests.items(): # Copy and hijack the logger project_cloner = copy.copy(zuul_cloner) project_cloner.log = project_cloner.log.getChild(project)
I think that is due to Support to clone repositories in parallel (5f58fd252e499a37f19da753c064b7e34fc35028) released with 0.0.30. Passing `
Will do the Stretch upgrade later on when I can also handle the upgrade to a more recent Docker daemon. Lets stick to the current stack for now.
It is not about downgrading Docker, but rather to keep the same version we are currently using on the Jessie instances. My primary intent was just to migrate to Stretch, not to have to deal with a Docker migration and more puppet work. containerd for example is no more managed by Docker but by systemd and the 18.09 Docker package is no more provided for Jessie. It is just too risky/long to migrate both the OS and the Docker engine at the sametime.
@Krinkle did it get resolved? I messed something up and we had delays in processing metadata updates last night.
@Legoktm did the optimizations for mediawiki/core. I guess we can revisit what should be run for mediawiki/core and maybe drop some of the optimization that have been made.
00:00:31.608 ERROR:zuul.Repo:Unable to initialize repo for https://gerrit.wikimedia.org/r/npm-test 00:00:31.609 Traceback (most recent call last): 00:00:31.609 File "/usr/local/lib/python3.5/dist-packages/zuul/merger/merger.py", line 51, in __init__ 00:00:31.609 self._ensure_cloned() 00:00:31.610 File "/usr/local/lib/python3.5/dist-packages/zuul/merger/merger.py", line 63, in _ensure_cloned 00:00:31.610 git.Repo.clone_from(self.remote_url, self.local_path) 00:00:31.610 File "/usr/lib/python3/dist-packages/git/repo/base.py", line 925, in clone_from 00:00:31.611 return cls._clone(git, url, to_path, GitCmdObjectDB, progress, **kwargs) 00:00:31.611 File "/usr/lib/python3/dist-packages/git/repo/base.py", line 880, in _clone 00:00:31.611 finalize_process(proc, stderr=stderr) 00:00:31.611 File "/usr/lib/python3/dist-packages/git/util.py", line 341, in finalize_process 00:00:31.612 proc.wait(**kwargs) 00:00:31.612 File "/usr/lib/python3/dist-packages/git/cmd.py", line 291, in wait 00:00:31.612 raise GitCommandError(self.args, status, errstr) 00:00:31.612 git.exc.GitCommandError: Cmd('git') failed due to: exit code(128) 00:00:31.613 cmdline: git clone -v https://gerrit.wikimedia.org/r/npm-test /workspace/src/npm-test 00:00:31.613 stderr: 'Cloning into '/workspace/src/npm-test'... 00:00:31.613 fatal: remote error: npm-test unavailable 00:00:31.613 '
The build was for https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/537445/2 and there is no extension dependencies injected to it.
Thank you @Krinkle :-]
@Reedy might know more about the magic of the Fancy captchas.
Hi @herron, others have pointed me to you for this task since you are on ops clinic duty. The package can't really be rebuild on the SRE box since it requires network access and would vary due to the installation of python dependencies from https://pypi..org/ . It is legacy and a bad practice, but that is predates a lot of changes we have done since (such as using scap, components in apt.wikimedia.org etc).
That has been done by @Paladox on https://www.mediawiki.org/wiki/Continuous_integration/Zuul#new_package and I updated the doc.
I had not checked which instances was used for logstash-beta.wmflabs.org. One sure thing they are both broken in the same way due to rsyslog / logstash udp input conflicting on port 11514 :]
Looking at wmf-quibble-core-vendor-mysql-php72-docker build timing ( https://integration.wikimedia.org/ci/job/wmf-quibble-core-vendor-mysql-php72-docker/buildTimeTrend ):
Also confirmed from mw-debug-www.log:
wfClientAcceptsGzip: client accepts gzip. MediaWiki\OutputHandler::handleGzip() is compressing output
Change 528933 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[integration/quibble@master] Set cache directory
Released and deployed on September 17th with Quibble 0.0.35.
Well done, and thank you for the verification.
I'm sorry I didn't get to this! It sounds like you are (probably) all set.
I would recommend to push 1.34.0-wmf.23 to testwiki and attempt to reproduce there. If that works in production, then that is probably beta that is to blame somehow.
Most probably related to imported scap/sh.py which probably should be removed T222372
@Zoranzoki21 just a trace coming from Gerrit. That represents a queue of tasks the server has to do, in this case indexing change (whatever indexing means, I have no idea). :)
Spotted on Sep. 17 at 19:10. Roughly 1700 of them producing a nice spike of threads. But that is probably harmless.
So the logstash input for udp tries to bind on 11514 but rsyslogd is already listening there.
Ok I gave it a try creating a phpinfo file at the root of the mediawiki/core checkout and running something like:
rm src/LocalSettings.php quibble --skip-zuul --skip-deps --db sqlite -c 'xdg-open http://127.0.0.1:9412/phpinfo.php'
Eventually I have dig in the puppet log. I found out that all wmcs instance have a nfsclient puppet class applied which ends up invoking labstore::traffic_shapping. That classes creates a file /usr/local/sbin/tc-setup which has various shaping parameters.
To workaround the insanely slow puppet run from T228056, I commented out base::resolving::labs_additional_domains in hiera and the provisioning has been super fast.
I could not figure it out, so I guess we just have to wait for a while on the initial puppet run. After that the instance seems to behave properly.
I have checked in logstash, the error is gone since we have deployed the fix for T232613#5494695 (had to upgrade php-memcached).
Given we now have .messagebox and .warningbox, if the shared.css style is applied to the installer pages, I see no reason for not using them. So the later option I guess, migrate to the new standard and deprecate the barely used and dated method that nobody uses anyway.
Maybe someday, I have just logged this task for later and to be able to look it up whenever I encounter the issue. The hot fix is trivial (hack git-pbuilder to keep the envs).