hashar (Antoine "hashar" Musso (WMF))
WMF Software developer - Release Engineering

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 2:31 PM (207 w, 4 d)
Availability
Available
IRC Nick
hashar
LDAP User
Hashar
MediaWiki User
Unknown

https://www.mediawiki.org/wiki/User:Hashar

Based in Nantes, France CET/CEST (UTC+1, UTC+2)

Main IRC channel is #wikimedia-releng

antoine-approve

Recent Activity

Today

hashar added a watcher for Gerrit: hashar.
Wed, Sep 26, 1:05 AM

Yesterday

hashar added a comment to T204762: On deployment-prep scap cache_git_info takes 12 minutes (that is too slow).

If that is from scap, that might be one of:

Tue, Sep 25, 11:44 AM · Release-Engineering-Team (Kanban), Scap, Beta-Cluster-Infrastructure
hashar added a comment to T204762: On deployment-prep scap cache_git_info takes 12 minutes (that is too slow).

I wanted some historical build durations. I took the IRC logs from https://wm-bot.wmflabs.org/logs/%23wikimedia-releng/ and looked for the irc notification when beta-scap-eqiad is fixed.

Tue, Sep 25, 11:42 AM · Release-Engineering-Team (Kanban), Scap, Beta-Cluster-Infrastructure
hashar closed T182412: Investigate if WebdriverIO `sync: false` would be useful to us and document how to use it as Declined.

From http://webdriver.io/guide/getstarted/upgrade.html webdriver.io 3 ran tests asynchronously. webdriver.io 4 now runs them synchronously. The sync: false is a feature flag to ease the transition/upgrade process, the doc stating:

Tue, Sep 25, 11:08 AM · MediaWiki-Core-Tests, User-zeljkofilipin
hashar closed T182412: Investigate if WebdriverIO `sync: false` would be useful to us and document how to use it, a subtask of T190995: Someday/maybe Selenium framework improvements, as Declined.
Tue, Sep 25, 11:08 AM · Release-Engineering-Team (Someday), MediaWiki-Core-Tests, User-zeljkofilipin
zeljkofilipin awarded T198201: Quibble must include tests/selenium/LocalSettings.php (was ORES webdriver.io selenium test fail on CI due to lack of ORES server) a Insectivore token.
Tue, Sep 25, 11:06 AM · MW-1.32-release-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), Scoring-platform-team, Release-Engineering-Team (Kanban), Quibble, User-Ladsgroup, User-zeljkofilipin, Patch-For-Review, Browser-Tests, ORES
hashar added a comment to T204762: On deployment-prep scap cache_git_info takes 12 minutes (that is too slow).

Even funnier, I wanted to trace the execution of scap sync using the python trace module. The git caching took less than a minute!

$ ssh deployment-deploy01.deployment-prep.eqiad.wmflabs
$ sudo su - jenkins-deploy
$ cd /srv/mediawiki-staging
$ scap sync
10:44:34 Started scap: (no justification provided)
...
10:44:40 Started cache_git_info
10:45:35 Finished cache_git_info (duration: 00m 54s)
...
10:47:00 Finished scap: (no justification provided) (duration: 02m 26s)
Tue, Sep 25, 10:49 AM · Release-Engineering-Team (Kanban), Scap, Beta-Cluster-Infrastructure
hashar added a comment to T191068: 1.32.0-wmf.22 deployment blockers.

Train report published on Wikitech: https://wikitech.wikimedia.org/wiki/Incident_documentation/20180918-train

Tue, Sep 25, 10:24 AM · User-zeljkofilipin, Release-Engineering-Team (Kanban), Release, Train Deployments
hashar moved T204961: ORES requests for wikidatawiki models=damaging end up with HTTP request timed out from To Triage to Follow-up/Actionables on the Wikimedia-Incident board.
Tue, Sep 25, 10:23 AM · Patch-For-Review, Wikimedia-Incident, User-Ladsgroup, Scoring-platform-team (Current), ORES
hashar added a project to T204961: ORES requests for wikidatawiki models=damaging end up with HTTP request timed out: Wikimedia-Incident.
Tue, Sep 25, 10:23 AM · Patch-For-Review, Wikimedia-Incident, User-Ladsgroup, Scoring-platform-team (Current), ORES
hashar moved T204907: Scap is checking canary servers in dormant instead of active-dc from To Triage to Follow-up/Actionables on the Wikimedia-Incident board.
Tue, Sep 25, 10:22 AM · Release-Engineering-Team (Watching / External), Wikimedia-Incident, Operations, Datacenter-Switchover-2018, Scap
hashar edited projects for T204907: Scap is checking canary servers in dormant instead of active-dc , added: Operations, Wikimedia-Incident; removed Patch-For-Review.
Tue, Sep 25, 10:22 AM · Release-Engineering-Team (Watching / External), Wikimedia-Incident, Operations, Datacenter-Switchover-2018, Scap
hashar updated the task description for T204907: Scap is checking canary servers in dormant instead of active-dc .
Tue, Sep 25, 10:21 AM · Release-Engineering-Team (Watching / External), Wikimedia-Incident, Operations, Datacenter-Switchover-2018, Scap
hashar renamed T204871: Deployments of MediaWiki with scap cause a spam of "web request took longer than 60 seconds and timed out" from Promoting group1 to 1.32.0-wmf.22 caused a spam of web request took longer than 60 seconds and timed out to Deployments of MediaWiki with scap cause a spam of "web request took longer than 60 seconds and timed out".
Tue, Sep 25, 10:18 AM · Wikimedia-Incident, Release-Engineering-Team, Wikimedia-production-error
hashar moved T204871: Deployments of MediaWiki with scap cause a spam of "web request took longer than 60 seconds and timed out" from To Triage to Active Situation on the Wikimedia-Incident board.
Tue, Sep 25, 10:18 AM · Wikimedia-Incident, Release-Engineering-Team, Wikimedia-production-error
hashar added a project to T204871: Deployments of MediaWiki with scap cause a spam of "web request took longer than 60 seconds and timed out": Wikimedia-Incident.
Tue, Sep 25, 10:17 AM · Wikimedia-Incident, Release-Engineering-Team, Wikimedia-production-error

Mon, Sep 24

hashar added a comment to T198412: Document which per-patch/daily Jenkins job is running for repositories with Ruby/Node.js Selenium tests.

I think that followed one of our 1/1 discussion with me being confused by all the jobs that are named alike but serving different purposes (per patch or daily) or different framework (mediawiki_selenium or wdio).

Mon, Sep 24, 5:47 PM · Release-Engineering-Team (Kanban), User-zeljkofilipin, MediaWiki-Core-Tests
hashar added a comment to T179963: Workspaces for mwgate-php55lint / mwgate-php70lint are getting huge.

Yup that works most of the time, but --depth 2 is not sufficient when a chain of patchset is being tested. I guess that is what prompted the revert.

Mon, Sep 24, 5:43 PM · Wikimedia-production-error (Shared Build Failure), Release-Engineering-Team (Kanban), Continuous-Integration-Infrastructure
hashar moved T180171: ArticlePlaceholder should use MediaWiki qunit runner from Backlog to In-progress on the Release-Engineering-Team (Kanban) board.
Mon, Sep 24, 4:54 PM · Release-Engineering-Team (Kanban), Continuous-Integration-Infrastructure (shipyard), Patch-For-Review, Technical-Debt, Wikidata, ArticlePlaceholder
hashar placed T198201: Quibble must include tests/selenium/LocalSettings.php (was ORES webdriver.io selenium test fail on CI due to lack of ORES server) up for grabs.

The test has been backed up, the reason is Quibble does not include tests/browser/LocalSettings.php, it also run wdio tests from mediawiki/core. There is a work in progress to phase out tests/browser/LocalSettings.php and to run wdio directly from the root of the extension.

Mon, Sep 24, 4:53 PM · MW-1.32-release-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), Scoring-platform-team, Release-Engineering-Team (Kanban), Quibble, User-Ladsgroup, User-zeljkofilipin, Patch-For-Review, Browser-Tests, ORES
hashar placed T119371: Enable MariaDB/MySQL strict mode on CI db hosts up for grabs.

For now CI has sql_mode = 'TRADITIONAL' unfortunately I lack time to think/babysit the addition of ONLY_FULL_GROUP_BY.

Mon, Sep 24, 4:51 PM · Release-Engineering-Team (Kanban), Quibble, MediaWiki-Database, DBA, Continuous-Integration-Infrastructure
hashar placed T183513: Draft requirements for a Kubernetes based solution for CI up for grabs.
Mon, Sep 24, 4:44 PM · Kubernetes, Continuous-Integration-Infrastructure (shipyard), Release-Engineering-Team (Kanban)
hashar moved T203084: Migrate DonationInterface CI job to Docker from Backlog to Blocked (externally) on the Release-Engineering-Team (Kanban) board.
Mon, Sep 24, 4:43 PM · Patch-For-Review, MediaWiki-extensions-DonationInterface, Release-Engineering-Team (Kanban), Continuous-Integration-Infrastructure (shipyard)
hashar moved T196347: Quibble may need to rebuild localization cache before running tests from Backlog to Done (within RelEng) on the Release-Engineering-Team (Kanban) board.
Mon, Sep 24, 4:43 PM · Release-Engineering-Team (Kanban), Patch-For-Review, MediaWiki-Core-Tests, Quibble
hashar added a comment to T196347: Quibble may need to rebuild localization cache before running tests.

Pending on MediaWiki-Installer patch https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/446899/

Mon, Sep 24, 4:42 PM · Release-Engineering-Team (Kanban), Patch-For-Review, MediaWiki-Core-Tests, Quibble
hashar placed T179963: Workspaces for mwgate-php55lint / mwgate-php70lint are getting huge up for grabs.
Mon, Sep 24, 4:41 PM · Wikimedia-production-error (Shared Build Failure), Release-Engineering-Team (Kanban), Continuous-Integration-Infrastructure
hashar placed T165540: Add CODE_OF_CONDUCT.md to Wikimedia repositories up for grabs.
Mon, Sep 24, 4:40 PM · Release-Engineering-Team (Watching / External), User-Tgr, MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), Repository-Admins, Gerrit, GitHub-Mirrors, Developer-Advocacy
hashar committed rQUIBBLE1ffb798b2e8d: Parallelize mediawiki/core linter (authored by hashar).
Parallelize mediawiki/core linter
Mon, Sep 24, 3:57 PM
hashar committed rQUIBBLEafc03b9b00e1: Parallelize ext/skin linter (authored by hashar).
Parallelize ext/skin linter
Mon, Sep 24, 3:57 PM
hashar committed rQUIBBLEfb0171d084ae: Decouple composer/npm run for extskin (authored by hashar).
Decouple composer/npm run for extskin
Mon, Sep 24, 3:57 PM
hashar committed rQUIBBLE10b115e57ffb: Early return in extskin composer/npm test commands (authored by hashar).
Early return in extskin composer/npm test commands
Mon, Sep 24, 3:57 PM
hashar updated the task description for T205170: Archive marvin.
Mon, Sep 24, 1:56 PM · Readers-Web-Backlog (Tracking), Patch-For-Review, User-MarcoAurelio, Marvin, GitHub-Mirrors, Repository-Admins, Cleanup
hashar archived Marvin.
Mon, Sep 24, 1:54 PM
hashar edited Description on Marvin.
Mon, Sep 24, 1:54 PM
hashar added a comment to T180318: Add CI to all labs/tools/* repositories and archive obsolete ones.

Meanwhile, in Gerrit I have granted permissions to project owners to submit changes ( https://gerrit.wikimedia.org/r/#/c/labs/tools/+/462420/ ).

Mon, Sep 24, 10:56 AM · Tools, Patch-For-Review, Gerrit, Continuous-Integration-Config
hashar updated the task description for T180318: Add CI to all labs/tools/* repositories and archive obsolete ones.
Mon, Sep 24, 10:47 AM · Tools, Patch-For-Review, Gerrit, Continuous-Integration-Config
hashar added a comment to T204762: On deployment-prep scap cache_git_info takes 12 minutes (that is too slow).

The spam of close(x) = -1 EBADF (Bad file descriptor) comes from a call to os.closerange

scap/sh.py
1679 class OProc(object):
Mon, Sep 24, 10:39 AM · Release-Engineering-Team (Kanban), Scap, Beta-Cluster-Infrastructure
hashar added a comment to T204762: On deployment-prep scap cache_git_info takes 12 minutes (that is too slow).

I ended up manually hacking scap on deployment-deploy01 https://integration.wikimedia.org/ci/job/beta-scap-eqiad/222901/console:

00:00:06.979 09:11:21 cache_git_info( version=master ) wikidb is dewiktionary
00:00:07.705 09:11:22 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info.json
00:00:08.447 09:11:23 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-extensions-WikibaseQualityExternalValidation.json
00:00:09.177 09:11:23 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-extensions-SubpageFun.json
00:00:09.934 09:11:24 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-extensions-Ads.json
00:00:10.697 09:11:25 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-extensions-HierarchyBuilder.json
00:00:11.434 09:11:26 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-extensions-FixedHeaderTable.json
00:00:12.197 09:11:26 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-extensions-WikiCategoryTagCloud.json
00:00:12.942 09:11:27 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-extensions-BlueSpiceNamespaceManager.json
00:00:13.667 09:11:28 Writing cache file /srv/mediawiki-staging/php-master/cache/gitinfo/info-extensions-CloneDiff.json
...
Mon, Sep 24, 9:19 AM · Release-Engineering-Team (Kanban), Scap, Beta-Cluster-Infrastructure
hashar added a comment to T204762: On deployment-prep scap cache_git_info takes 12 minutes (that is too slow).

I gave it a quick try on deployment-deploy01 reproducing scap.tasks.cache_git_info(). Namely getting extensions and skins git information with scap.git.info and writing the json to a dummy file take roughly 40 seconds:

/home/hashar/scap_git.py
#!/usr/bin/python
Mon, Sep 24, 8:55 AM · Release-Engineering-Team (Kanban), Scap, Beta-Cluster-Infrastructure

Fri, Sep 21

hashar added a comment to T204077: maintenance-disconnect-full-disks gets stuck.

Timeout wrapping solved it I guess. Thank you!

Fri, Sep 21, 9:40 PM · Patch-For-Review, Release-Engineering-Team (Kanban), Continuous-Integration-Infrastructure
hashar awarded T153281: webgrid-lighttpd queues kill OOM jobs with SIGKILL leaving php-cgi processes behind a Love token.
Fri, Sep 21, 9:35 PM · Toolforge
hashar closed T128357: Beta cluster job queue is unmonitored / potentially not running all jobs as Invalid.

That is no more accurate. The background jobs processing has been overhauled meanwhile.

Fri, Sep 21, 9:29 PM · monitoring, WMF-JobQueue, Beta-Cluster-Infrastructure
hashar closed T128357: Beta cluster job queue is unmonitored / potentially not running all jobs, a subtask of T53497: Setup monitoring for Beta Cluster (tracking), as Invalid.
Fri, Sep 21, 9:29 PM · monitoring, Tracking, Beta-Cluster-Infrastructure
hashar closed T75455: .flow-post color override div#content color as Declined.

Flow is in maintenance mode anyway so that is unlikely to be ever fixed. I don't have interest in it anymore.

Fri, Sep 21, 9:28 PM · Growth-Team, Collaboration-Team-Triage, StructuredDiscussions
hashar edited projects for T161227: Prometheus graph incorrectly sums CPU user and CPU guest, added: monitoring; removed Prometheus-metrics-monitoring.

That is still valid. https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats sums all CPU metrics but guest (at least) is part of user.

Fri, Sep 21, 9:27 PM · monitoring, Graphite
hashar closed T188950: Blubber should support ruby/rake as Declined.

We can reopened / refill later on. For now we barely have any ruby code, the few I am aware of are jsduck which is being migrated to JSDoc (javascript based) and the mediawiki_selenium tests which are obsolete and to be replaced by wdio.

Fri, Sep 21, 9:23 PM · Release Pipeline (Blubber)
hashar closed T88239: Use dedicated jobs for mediawiki-core wmf branches that honour submodules as Declined.

After some years, I don't think it will serve much purpose. Some mediawiki extensions are incompatible with each other, some only make senses for specific wikis.

Fri, Sep 21, 9:22 PM · Continuous-Integration-Infrastructure
hashar closed T154482: scap3 allows plugins to declare dependencies we do not support as Invalid.
Fri, Sep 21, 9:19 PM · Release-Engineering-Team (Someday), Scap, Deployments, scap2
hashar closed T174584: On contint-operations-puppet apt-get yields about locale as Declined.
Fri, Sep 21, 9:16 PM · Release-Engineering-Team (Someday), Continuous-Integration-Infrastructure
hashar closed T161084: On beta enable swift statsd metric, a subtask of T160990: deployment-ms-be03.deployment-prep and deployment-ms-be04.deployment-prep have high load / system CPU, as Declined.
Fri, Sep 21, 9:13 PM · RelEng-Archive-FY201718-Q1, Patch-For-Review, media-storage, Beta-Cluster-Infrastructure
hashar closed T161084: On beta enable swift statsd metric as Declined.

We don't really monitor Swift on beta.

Fri, Sep 21, 9:13 PM · media-storage, Beta-Cluster-Infrastructure
hashar closed T51881: NaturalLanguageList parser tests do not pass on HHVM as Declined.

Extension does not seem to have received much code update over the last 5 years and HHVM is legacy.

Fri, Sep 21, 9:12 PM · HHVM, MediaWiki-extensions-Other
hashar closed T165540: Add CODE_OF_CONDUCT.md to Wikimedia repositories as Resolved.

The Code of Conduct is already listed everywhere (eg Phabricator, soon Gerrit etc). It has been added to all mediawiki extensions and skins.

Fri, Sep 21, 9:07 PM · Release-Engineering-Team (Watching / External), User-Tgr, MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)), Repository-Admins, Gerrit, GitHub-Mirrors, Developer-Advocacy
hashar closed T166531: A tool queries urwiki recentchanges 6 times per second as Resolved.

Patched to throttle. The script is probably no more running nowadays.

Fri, Sep 21, 8:15 PM · Tools, Toolforge, Cloud-Services
hashar closed T161118: Investigate instances with high "steal" CPU as Resolved.

I had filled this task to understand what is high steal CPU. It is out of control of the instances. There is no actionable here :]

Fri, Sep 21, 8:09 PM · Cloud-VPS, Cloud-Services
hashar closed T161118: Investigate instances with high "steal" CPU, a subtask of T161006: Convince nova-scheduler to pay attention to CPU metrics, as Resolved.
Fri, Sep 21, 8:09 PM · Patch-For-Review, Cloud-VPS, Cloud-Services
hashar added a comment to T132880: tools.jembot PHP processes run out of memory and leave orphan php-cgi processes regularly.

Fri, Sep 21, 8:02 PM · Tools
hashar added a comment to T132880: tools.jembot PHP processes run out of memory and leave orphan php-cgi processes regularly.

I have added a link to the task description to check the busiest webgrid-lighttpd instance: https://graphite-labs.wikimedia.org/render/?width=648&height=396&_salt=1537559933.891&hideLegend=false&target=cactiStyle(highestAverage(tools.*webgrid-lighttpd*.cpu.total.user%2C8))&from=-1hours

Fri, Sep 21, 8:02 PM · Tools
hashar updated the task description for T132880: tools.jembot PHP processes run out of memory and leave orphan php-cgi processes regularly.
Fri, Sep 21, 7:59 PM · Tools
hashar committed rQUIBBLE4d52686213ab: Early return in extskin composer/npm test commands (authored by hashar).
Early return in extskin composer/npm test commands
Fri, Sep 21, 5:14 PM
hashar committed rQUIBBLEe2ffec874d39: Decouple composer/npm run for extskin (authored by hashar).
Decouple composer/npm run for extskin
Fri, Sep 21, 5:14 PM
hashar added a comment to T204961: ORES requests for wikidatawiki models=damaging end up with HTTP request timed out.

That is not a blocker to the train per see. I have filled it as a subtask because the train triggers the issue. Seems that is transient and self resolve after some short amount of time.

Fri, Sep 21, 10:53 AM · Patch-For-Review, Wikimedia-Incident, User-Ladsgroup, Scoring-platform-team (Current), ORES
hashar added a comment to T199133: Find top 15 target projects that could use Selenium tests to prevent incidents.

Another note from a discussion with Zeljko: we have a third party gathering git/gerrit metrics. They store it in ElasticSearch and use Kibana for dashboarding: https://wikimedia.biterg.io/

Fri, Sep 21, 10:25 AM · Release-Engineering-Team (Kanban), User-zeljkofilipin
hashar added a comment to T199133: Find top 15 target projects that could use Selenium tests to prevent incidents.

A quick and dirty way to get the busiest repos per number of commits is to use git log on mediawiki/extensions:

Fri, Sep 21, 10:24 AM · Release-Engineering-Team (Kanban), User-zeljkofilipin
hashar created P7578 Busiest extensions by number of commits over one year - T199133.
Fri, Sep 21, 10:22 AM
hashar updated subscribers of T203749: marvin: critical severity security vulnerability detected in macaddress < 0.2.9 defined in package-lock.json. .

It seems Stephen and Joaquin did the last patches on marvin.git, the last ones being from ~ January 2018. Is Marvin still a thing or should we look at archiving the git repository / phabricator project etc?

Fri, Sep 21, 9:26 AM · Marvin, Security

Thu, Sep 20

hashar closed T191068: 1.32.0-wmf.22 deployment blockers as Resolved.

1.32.0-wmf.22 has been rolled to all wikis and seems all fine.

Thu, Sep 20, 9:15 PM · User-zeljkofilipin, Release-Engineering-Team (Kanban), Release, Train Deployments
hashar added a comment to T191921: mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5).

Great! \o/

Thu, Sep 20, 8:14 PM · Patch-For-Review, Operations, Release-Engineering-Team (Kanban), Scap
hashar added a comment to T204681: Keyholder retrieves all groups on each connection which is slow with LDAP.

TLDR: Python grp.getgrall uses POSIX getgrent which in libc has not been made aware of nscd. Hence there is no cache and a LDAP query is made each time....

Thu, Sep 20, 7:28 PM · Operations-Software-Development
hashar added a comment to T204961: ORES requests for wikidatawiki models=damaging end up with HTTP request timed out.

Based on Graphana, seems some ORES server got overloaded and some requests did time out. That was a short break though and apparently everything works fine now. To be investigated though.

Thu, Sep 20, 3:03 PM · Patch-For-Review, Wikimedia-Incident, User-Ladsgroup, Scoring-platform-team (Current), ORES
hashar added a comment to T204961: ORES requests for wikidatawiki models=damaging end up with HTTP request timed out.

Though if I try a reported URL manually, it seems to work. Tried on deploy1001, deploy2001 and mw2219

Thu, Sep 20, 2:55 PM · Patch-For-Review, Wikimedia-Incident, User-Ladsgroup, Scoring-platform-team (Current), ORES
hashar added a subtask for T191068: 1.32.0-wmf.22 deployment blockers: T204961: ORES requests for wikidatawiki models=damaging end up with HTTP request timed out.
Thu, Sep 20, 2:54 PM · User-zeljkofilipin, Release-Engineering-Team (Kanban), Release, Train Deployments
hashar added a parent task for T204961: ORES requests for wikidatawiki models=damaging end up with HTTP request timed out: T191068: 1.32.0-wmf.22 deployment blockers.
Thu, Sep 20, 2:54 PM · Patch-For-Review, Wikimedia-Incident, User-Ladsgroup, Scoring-platform-team (Current), ORES
hashar created T204961: ORES requests for wikidatawiki models=damaging end up with HTTP request timed out.
Thu, Sep 20, 2:53 PM · Patch-For-Review, Wikimedia-Incident, User-Ladsgroup, Scoring-platform-team (Current), ORES
hashar added a comment to T204871: Deployments of MediaWiki with scap cause a spam of "web request took longer than 60 seconds and timed out".

Same happened with group2:

Thu, Sep 20, 2:44 PM · Wikimedia-Incident, Release-Engineering-Team, Wikimedia-production-error
hashar added a comment to T204907: Scap is checking canary servers in dormant instead of active-dc .

After discussion with Alexandros and Giuseppe, for now we have just updated the list of hosts in the dsh files. Now we have:

14:28:03 Finished Canaries Synced (duration: 00m 03s)
14:28:03 Executing check 'Check endpoints for mw2218.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mw2217.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mw2226.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mw2225.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mwdebug2002.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mw2216.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mw2215.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mwdebug2001.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mw2227.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mw2224.codfw.wmnet'
14:28:03 Executing check 'Check endpoints for mw2228.codfw.wmnet'
14:28:08 Finished Canary Endpoint Check Complete (duration: 00m 04s)
14:28:08 Waiting for canary traffic...
14:28:23 Executing check 'Logstash Error rate for mw2218.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mw2217.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mw2226.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mw2225.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mwdebug2002.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mw2216.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mw2215.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mwdebug2001.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mw2227.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mw2224.codfw.wmnet'
14:28:23 Executing check 'Logstash Error rate for mw2228.codfw.wmnet'
14:28:24 Finished sync-check-canaries (duration: 00m 24s)
Thu, Sep 20, 2:29 PM · Release-Engineering-Team (Watching / External), Wikimedia-Incident, Operations, Datacenter-Switchover-2018, Scap
hashar added a comment to T201675: Create new namespaces in zhwikiversity.

And after the script has been run with --fix and --prefix=T201675, what is left is:

$ mwscript namespaceDupes.php zhwikiversity
0 pages to fix, 0 were resolvable.
Thu, Sep 20, 12:26 PM · Patch-For-Review, User-Urbanecm, Wikimedia-Site-requests, Chinese-Sites
hashar closed T204919: tox-docker fails on keyholder due to pycrypto as Resolved.

/tmp was mounted as a tmpfs which comes with the noexec flag. I guess autoconf compiles a binary and then try to execute it to validate the C compiler works, but the noexec flag prevent the execution and the test fails.

Thu, Sep 20, 11:27 AM · Operations-Software-Development, Continuous-Integration-Config
hashar added a comment to T203181: Quibble MariaDB should use a tmpfs as a datadir.

contint1001$ grep -l '\--tmpfs /tmp' /var/lib/jenkins/jobs/*/config.xml|cut -d\/ -f1
contint1001$

Thu, Sep 20, 11:22 AM · Patch-For-Review, Regression, Release-Engineering-Team, Continuous-Integration-Infrastructure (shipyard), Quibble
hashar added a comment to T203181: Quibble MariaDB should use a tmpfs as a datadir.

I have abandoned the change https://gerrit.wikimedia.org/r/#/c/integration/config/+/457070/ which was passing --tmpfs /tmp to docker run, but forgot to redeploy the jenkins job. That has side effects beside mjolnir such as T204919 which is pycrypto not being installable.

Thu, Sep 20, 11:20 AM · Patch-For-Review, Regression, Release-Engineering-Team, Continuous-Integration-Infrastructure (shipyard), Quibble
hashar added a comment to T204919: tox-docker fails on keyholder due to pycrypto.

The --tmpfs /tmp comes from abandoned change https://gerrit.wikimedia.org/r/#/c/integration/config/+/457070/ we did for T203181

Thu, Sep 20, 11:18 AM · Operations-Software-Development, Continuous-Integration-Config
hashar added a comment to T204919: tox-docker fails on keyholder due to pycrypto.

I could not reproduce the issue so I went with live debugging in attempt to get the autoconf log file which is somewhere under /tmp.

Thu, Sep 20, 11:01 AM · Operations-Software-Development, Continuous-Integration-Config
hashar added a comment to T199116: Quibble should run `npm install` and `npm run selenium-test` for each extension/skin that has Selenium tests.

The logic is Quibble quibble/cmd.py, roughly what it does is:

Thu, Sep 20, 10:11 AM · MediaWiki-Core-Tests, Release-Engineering-Team (Kanban), Quibble, User-zeljkofilipin
hashar added a comment to T204919: tox-docker fails on keyholder due to pycrypto.

At least the pycrypto install works:

$ docker pull docker-registry.wikimedia.org/releng/tox:0.1.1
$ docker run --rm -it --entrypoint=/bin/bash docker-registry.wikimedia.org/releng/tox:0.1.1
$ pip3 install --target . pycrypto
Thu, Sep 20, 9:04 AM · Operations-Software-Development, Continuous-Integration-Config
hashar created T204919: tox-docker fails on keyholder due to pycrypto.
Thu, Sep 20, 8:43 AM · Operations-Software-Development, Continuous-Integration-Config

Wed, Sep 19

hashar added a comment to T97192: HHVM request timeouts not working; support lowering the API request timeout per request.

While promoting 1.32.0-wmf.22 yesterday and today, I noticed a spike of web request took longer than 60 seconds and timed out. I suspect it always happened and it is now showing up thanks to the patches above. I have filled the issue as T204871

Wed, Sep 19, 7:57 PM · User-notice, Performance-Team (Radar), Patch-For-Review, User-Joe, Operations, Services (watching), Wikimedia-Incident, HHVM, Availability, MediaWiki-API
hashar added a project to T204871: Deployments of MediaWiki with scap cause a spam of "web request took longer than 60 seconds and timed out": Release-Engineering-Team.

The web request times out are enforced since September 10th (see T97192#4561879 and https://lists.wikimedia.org/pipermail/wikitech-l/2018-September/090803.html ). We have not deployed a train since then.

Wed, Sep 19, 7:56 PM · Wikimedia-Incident, Release-Engineering-Team, Wikimedia-production-error
hashar added a comment to T191068: 1.32.0-wmf.22 deployment blockers.

I noticed a large spike of web requests timeouts when promoting group1. That solved by itself eventually. Filled as T204871

Wed, Sep 19, 7:40 PM · User-zeljkofilipin, Release-Engineering-Team (Kanban), Release, Train Deployments
hashar created T204871: Deployments of MediaWiki with scap cause a spam of "web request took longer than 60 seconds and timed out".
Wed, Sep 19, 7:39 PM · Wikimedia-Incident, Release-Engineering-Team, Wikimedia-production-error
hashar created T204864: [1.32.0-wmf.22] includes/GlobalPreferencesFactory.php: PHP Notice: Undefined index: section.
Wed, Sep 19, 7:00 PM · MW-1.32-release-notes (WMF-deploy-2018-09-18 (1.32.0-wmf.22)), Wikimedia-production-error, Community-Tech, MediaWiki-extensions-GlobalPreferences
hashar added a comment to T204826: EchoForeignWikiRequest emits PHP error "Undefined index: query".

T204758 is similar and got fixed earlier.

Wed, Sep 19, 6:39 PM · Notifications, Wikimedia-production-error, Growth-Team
hashar closed T204758: EchoForeignWikiRequest emits PHP error "Undefined index: csrftoken" as Resolved.

It is gone from logstash. The last entry was at 2018-09-19T16:43:21 UTC.

Wed, Sep 19, 6:37 PM · MW-1.32-release-notes (WMF-deploy-2018-09-18 (1.32.0-wmf.22)), Growth-Team (Current Sprint), Patch-For-Review, Notifications, Wikimedia-production-error
hashar closed T204758: EchoForeignWikiRequest emits PHP error "Undefined index: csrftoken", a subtask of T191068: 1.32.0-wmf.22 deployment blockers, as Resolved.
Wed, Sep 19, 6:37 PM · User-zeljkofilipin, Release-Engineering-Team (Kanban), Release, Train Deployments
hashar renamed T204681: Keyholder retrieves all groups on each connection which is slow with LDAP from Keyholder retrives all groups on each connection which is slow with LDAP to Keyholder retrieves all groups on each connection which is slow with LDAP.
Wed, Sep 19, 5:13 PM · Operations-Software-Development
hashar renamed T204681: Keyholder retrieves all groups on each connection which is slow with LDAP from [cumin] on labs project, nslcd causes cumin to be very slow to Keyholder retrives all groups on each connection which is slow with LDAP.
Wed, Sep 19, 5:13 PM · Operations-Software-Development
hashar reopened T204681: Keyholder retrieves all groups on each connection which is slow with LDAP as "Open".

That is from keyholder apparently. At some point it retrieves all groups with grp.getgrall(), that requests everything from LDAP.

Wed, Sep 19, 5:07 PM · Operations-Software-Development
hashar added a comment to T204681: Keyholder retrieves all groups on each connection which is slow with LDAP.

AuthorizedKeysCommand is for sshd so that would indeed trigger a run of /usr/sbin/ssh-key-ldap-lookup on each of the client. But the nslcd CPU usage is on the cumin instance integration-cumin, will dig in to figure out what is causing all those ldap queries.

Wed, Sep 19, 11:29 AM · Operations-Software-Development
hashar added a comment to T183831: wikidata/query/gui CI job lacks PhantomJS / proper browsers.

It still happens. The CI job uses the container docker-registry.wikimedia.org/releng/npm-browser-test:0.1.2 and have PhantomJS. The npm package phantomjs-prebuilt apparently find it:

> phantomjs-prebuilt@2.1.16 install /src/node_modules/phantomjs-prebuilt
> node install.js
Wed, Sep 19, 7:52 AM · Wikidata-Campsite, Wikidata, Patch-For-Review, Continuous-Integration-Infrastructure, Wikidata Query UI, Jenkins
hashar added a comment to T191068: 1.32.0-wmf.22 deployment blockers.

I have removed subtasks related to ActorTableSchemaMigrationStage which was enabled on Monday 09/17. That is part of a feature flagged migration (T188327) and is not train related.

Wed, Sep 19, 7:31 AM · User-zeljkofilipin, Release-Engineering-Team (Kanban), Release, Train Deployments
hashar updated subscribers of T204767: Special:ActiveUsers fails with "Error: 1054 Unknown column 'rc_actor' in 'on clause' (10.192.32.110)".

The error no more show up, ActorTableSchemaMigrationStage deployed on 09/17 has been rolledback:

Wed, Sep 19, 7:31 AM · Patch-For-Review, Wikimedia-production-error, MediaWiki-Special-pages