Based in Nantes, France CET/CEST (UTC+1, UTC+2)
Main IRC channel is #wikimedia-releng
If that is from scap, that might be one of:
I wanted some historical build durations. I took the IRC logs from https://wm-bot.wmflabs.org/logs/%23wikimedia-releng/ and looked for the irc notification when beta-scap-eqiad is fixed.
From http://webdriver.io/guide/getstarted/upgrade.html webdriver.io 3 ran tests asynchronously. webdriver.io 4 now runs them synchronously. The sync: false is a feature flag to ease the transition/upgrade process, the doc stating:
Even funnier, I wanted to trace the execution of scap sync using the python trace module. The git caching took less than a minute!
$ ssh deployment-deploy01.deployment-prep.eqiad.wmflabs $ sudo su - jenkins-deploy $ cd /srv/mediawiki-staging $ scap sync 10:44:34 Started scap: (no justification provided) ... 10:44:40 Started cache_git_info 10:45:35 Finished cache_git_info (duration: 00m 54s) ... 10:47:00 Finished scap: (no justification provided) (duration: 02m 26s)
Train report published on Wikitech: https://wikitech.wikimedia.org/wiki/Incident_documentation/20180918-train
I think that followed one of our 1/1 discussion with me being confused by all the jobs that are named alike but serving different purposes (per patch or daily) or different framework (mediawiki_selenium or wdio).
Yup that works most of the time, but --depth 2 is not sufficient when a chain of patchset is being tested. I guess that is what prompted the revert.
The test has been backed up, the reason is Quibble does not include tests/browser/LocalSettings.php, it also run wdio tests from mediawiki/core. There is a work in progress to phase out tests/browser/LocalSettings.php and to run wdio directly from the root of the extension.
For now CI has sql_mode = 'TRADITIONAL' unfortunately I lack time to think/babysit the addition of ONLY_FULL_GROUP_BY.
Meanwhile, in Gerrit I have granted permissions to project owners to submit changes ( https://gerrit.wikimedia.org/r/#/c/labs/tools/+/462420/ ).
The spam of close(x) = -1 EBADF (Bad file descriptor) comes from a call to os.closerange
1679 class OProc(object):
I ended up manually hacking scap on deployment-deploy01 https://integration.wikimedia.org/ci/job/beta-scap-eqiad/222901/console:
00:00:06.979 09:11:21 cache_git_info( version=master ) wikidb is dewiktionary 00:00:07.705 09:11:22 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info.json 00:00:08.447 09:11:23 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-extensions-WikibaseQualityExternalValidation.json 00:00:09.177 09:11:23 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-extensions-SubpageFun.json 00:00:09.934 09:11:24 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-extensions-Ads.json 00:00:10.697 09:11:25 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-extensions-HierarchyBuilder.json 00:00:11.434 09:11:26 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-extensions-FixedHeaderTable.json 00:00:12.197 09:11:26 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-extensions-WikiCategoryTagCloud.json 00:00:12.942 09:11:27 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-extensions-BlueSpiceNamespaceManager.json 00:00:13.667 09:11:28 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-extensions-CloneDiff.json ...
I gave it a quick try on deployment-deploy01 reproducing scap.tasks.cache_git_info(). Namely getting extensions and skins git information with scap.git.info and writing the json to a dummy file take roughly 40 seconds:
Timeout wrapping solved it I guess. Thank you!
That is no more accurate. The background jobs processing has been overhauled meanwhile.
Flow is in maintenance mode anyway so that is unlikely to be ever fixed. I don't have interest in it anymore.
That is still valid. https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats sums all CPU metrics but guest (at least) is part of user.
After some years, I don't think it will serve much purpose. Some mediawiki extensions are incompatible with each other, some only make senses for specific wikis.
We don't really monitor Swift on beta.
Extension does not seem to have received much code update over the last 5 years and HHVM is legacy.
The Code of Conduct is already listed everywhere (eg Phabricator, soon Gerrit etc). It has been added to all mediawiki extensions and skins.
Patched to throttle. The script is probably no more running nowadays.
I had filled this task to understand what is high steal CPU. It is out of control of the instances. There is no actionable here :]
I have added a link to the task description to check the busiest webgrid-lighttpd instance: https://graphite-labs.wikimedia.org/render/?width=648&height=396&_salt=1537559933.891&hideLegend=false&target=cactiStyle(highestAverage(tools.*webgrid-lighttpd*.cpu.total.user%2C8))&from=-1hours
That is not a blocker to the train per see. I have filled it as a subtask because the train triggers the issue. Seems that is transient and self resolve after some short amount of time.
Another note from a discussion with Zeljko: we have a third party gathering git/gerrit metrics. They store it in ElasticSearch and use Kibana for dashboarding: https://wikimedia.biterg.io/
A quick and dirty way to get the busiest repos per number of commits is to use git log on mediawiki/extensions:
It seems Stephen and Joaquin did the last patches on marvin.git, the last ones being from ~ January 2018. Is Marvin still a thing or should we look at archiving the git repository / phabricator project etc?
1.32.0-wmf.22 has been rolled to all wikis and seems all fine.
TLDR: Python grp.getgrall uses POSIX getgrent which in libc has not been made aware of nscd. Hence there is no cache and a LDAP query is made each time....
Based on Graphana, seems some ORES server got overloaded and some requests did time out. That was a short break though and apparently everything works fine now. To be investigated though.
Though if I try a reported URL manually, it seems to work. Tried on deploy1001, deploy2001 and mw2219
Same happened with group2:
After discussion with Alexandros and Giuseppe, for now we have just updated the list of hosts in the dsh files. Now we have:
14:28:03 Finished Canaries Synced (duration: 00m 03s) 14:28:03 Executing check 'Check endpoints for mw2218.codfw.wmnet' 14:28:03 Executing check 'Check endpoints for mw2217.codfw.wmnet' 14:28:03 Executing check 'Check endpoints for mw2226.codfw.wmnet' 14:28:03 Executing check 'Check endpoints for mw2225.codfw.wmnet' 14:28:03 Executing check 'Check endpoints for mwdebug2002.codfw.wmnet' 14:28:03 Executing check 'Check endpoints for mw2216.codfw.wmnet' 14:28:03 Executing check 'Check endpoints for mw2215.codfw.wmnet' 14:28:03 Executing check 'Check endpoints for mwdebug2001.codfw.wmnet' 14:28:03 Executing check 'Check endpoints for mw2227.codfw.wmnet' 14:28:03 Executing check 'Check endpoints for mw2224.codfw.wmnet' 14:28:03 Executing check 'Check endpoints for mw2228.codfw.wmnet' 14:28:08 Finished Canary Endpoint Check Complete (duration: 00m 04s) 14:28:08 Waiting for canary traffic... 14:28:23 Executing check 'Logstash Error rate for mw2218.codfw.wmnet' 14:28:23 Executing check 'Logstash Error rate for mw2217.codfw.wmnet' 14:28:23 Executing check 'Logstash Error rate for mw2226.codfw.wmnet' 14:28:23 Executing check 'Logstash Error rate for mw2225.codfw.wmnet' 14:28:23 Executing check 'Logstash Error rate for mwdebug2002.codfw.wmnet' 14:28:23 Executing check 'Logstash Error rate for mw2216.codfw.wmnet' 14:28:23 Executing check 'Logstash Error rate for mw2215.codfw.wmnet' 14:28:23 Executing check 'Logstash Error rate for mwdebug2001.codfw.wmnet' 14:28:23 Executing check 'Logstash Error rate for mw2227.codfw.wmnet' 14:28:23 Executing check 'Logstash Error rate for mw2224.codfw.wmnet' 14:28:23 Executing check 'Logstash Error rate for mw2228.codfw.wmnet' 14:28:24 Finished sync-check-canaries (duration: 00m 24s)
And after the script has been run with --fix and --prefix=T201675, what is left is:
$ mwscript namespaceDupes.php zhwikiversity 0 pages to fix, 0 were resolvable.
/tmp was mounted as a tmpfs which comes with the noexec flag. I guess autoconf compiles a binary and then try to execute it to validate the C compiler works, but the noexec flag prevent the execution and the test fails.
contint1001$ grep -l '\--tmpfs /tmp' /var/lib/jenkins/jobs/*/config.xml|cut -d\/ -f1
I have abandoned the change https://gerrit.wikimedia.org/r/#/c/integration/config/+/457070/ which was passing --tmpfs /tmp to docker run, but forgot to redeploy the jenkins job. That has side effects beside mjolnir such as T204919 which is pycrypto not being installable.
The --tmpfs /tmp comes from abandoned change https://gerrit.wikimedia.org/r/#/c/integration/config/+/457070/ we did for T203181
I could not reproduce the issue so I went with live debugging in attempt to get the autoconf log file which is somewhere under /tmp.
The logic is Quibble quibble/cmd.py, roughly what it does is:
At least the pycrypto install works:
$ docker pull docker-registry.wikimedia.org/releng/tox:0.1.1 $ docker run --rm -it --entrypoint=/bin/bash docker-registry.wikimedia.org/releng/tox:0.1.1 $ pip3 install --target . pycrypto
While promoting 1.32.0-wmf.22 yesterday and today, I noticed a spike of web request took longer than 60 seconds and timed out. I suspect it always happened and it is now showing up thanks to the patches above. I have filled the issue as T204871
The web request times out are enforced since September 10th (see T97192#4561879 and https://lists.wikimedia.org/pipermail/wikitech-l/2018-September/090803.html ). We have not deployed a train since then.
I noticed a large spike of web requests timeouts when promoting group1. That solved by itself eventually. Filled as T204871
T204758 is similar and got fixed earlier.
It is gone from logstash. The last entry was at 2018-09-19T16:43:21 UTC.
That is from keyholder apparently. At some point it retrieves all groups with grp.getgrall(), that requests everything from LDAP.
AuthorizedKeysCommand is for sshd so that would indeed trigger a run of /usr/sbin/ssh-key-ldap-lookup on each of the client. But the nslcd CPU usage is on the cumin instance integration-cumin, will dig in to figure out what is causing all those ldap queries.
It still happens. The CI job uses the container docker-registry.wikimedia.org/releng/npm-browser-test:0.1.2 and have PhantomJS. The npm package phantomjs-prebuilt apparently find it:
> firstname.lastname@example.org install /src/node_modules/phantomjs-prebuilt > node install.js
I have removed subtasks related to ActorTableSchemaMigrationStage which was enabled on Monday 09/17. That is part of a feature flagged migration (T188327) and is not train related.
The error no more show up, ActorTableSchemaMigrationStage deployed on 09/17 has been rolledback: