Based in Nantes, France CET/CEST (UTC+1, UTC+2)
Thanks! As far as I know, we will continue on Monday.
I have noticed this task yesterday and wanted to elaborate but was busy on other things. I believe it should be reopened cause Phabricator has an issue in how it translates date. I gave it a try by setting the language to French and explicitly setting all the date / time preferences:
Rolled back due to an incompatibility of ParserOutput class signature between 1.36.0-wmf.18 and 1.36.0-wmf.20. That is an issue that has often hit us in the past and is tricky enough that we end up hitting it from time to time. The task is being worked on by all the appropriate people so it is on track, though I anticipate itl take a bit of time to address it or find a workaround and get the train unblocked.
Just for the context, the serialize / unserialize with different class signatures has been crippling for a while:
There was one occurrence of a logging issue on a request to a Parsoid server ( T269386 ) but that does not seem worth rolling back. Nothing else is showing in the log at least.
Another one from Parsoid T269386
Blockers got solved or ruled out as preexisting conditions. I am going to promote all wikis.
Validated on mwdebug1001 using https://commons.wikimedia.org/wiki/Special:AbuseFilter/history/28/item/127 as the example.
What I did this week was:
The lack of output surprised me and I was left wondering whether the patches actually applied.
https://gerrit.wikimedia.org/r/644966 proposes a fix though it is not ready yet. Is there a change we can easily revert to unblock the train or should we hold on the fix to be completed?
Thank you Lucas and Amir for the investigation.
Turns out our Jenkins servers are using Java 8, notably because of the Gearman plugin. We should eventually switch them to Java 11 but that is a different task really: T269354
I guess the easiest is probably to revert the monolog update in master and 1.36.0-wmf.20 ?
@Pchelolo I probably should have filed a single task for those two. I will be out for rest of the day, but you have my blessing to have those fix deployed today!
So yesterday I have promoted the "testwikis" and today I have promoted the group1 wikis. The group0 wikis have NOT been promoted.
Another one spotted during the train: T269235
With https://gerrit.wikimedia.org/r/c/mediawiki/core/+/579668 , the messages formatted for logstash now have the monolog level as an integer when we already have the level name:
3.3.0 has been released https://www.gerritcodereview.com/3.3.html
it seems that new users on Gerrit by default have the option Set new changes to "work in progress" by default on.
On the first search it queries https://codesearch.wmcloud.org/core/api/v1/search?stats=fosho&repos=*&rng=:20&q=namespace&files=&i=nope and the response has:
Thanks for the cleanup!
Based on dmesg the oom hasn't happened since beginning of October so that is an improvement. Some request to postgre required more memory than available apparently which triggered the oom killer.
It was broken due to some certificate issue. I tried switching to puppetmaster.cloudinfra.wmflabs.org cause that sometimes fix the cert. Puppet eventually set back the previous puppet master. Maybe there is an issue related to the instance FQDN used in the certificate? Anyway I haven't touched anything else after my short investigation.
@Volans @akosiaris can we let the releng latest images be included in a Debmonitor. I really need that functionality to see the packages and it seems to be straightforward: https://gerrit.wikimedia.org/r/c/operations/puppet/+/624096/1/modules/profile/files/docker/reporter/releng_rules.ini
There is something broken with the Puppet master configuration for phabricator-stage-1001 and I can't quite figure it out :-\
I have manually edit /etc/puppet/puppet.conf to point to the generic puppet master: puppetmaster.cloudinfra.wmflabs.org
The CI Docker image failed to build:
Creating /tmp/helm/repository Creating /tmp/helm/repository/cache (image.py:179) Creating /tmp/helm/repository/local Creating /tmp/helm/plugins (image.py:179) Creating /tmp/helm/starters (image.py:179) Creating /tmp/helm/cache/archive (image.py:179) Creating /tmp/helm/repository/repositories.yaml Adding stable repo with URL: https://charts.helm.sh/stable (image.py:179) Adding local repo with URL: http://127.0.0.1:8879/charts (image.py:179) $HELM_HOME has been configured at /tmp/helm. Not installing Tiller due to 'client-only' flag having been set (image.py:179) "wmf-stable" has been added to your repositories (image.py:179) "incubator" has been added to your repositories (image.py:179) "wmf-stable" has been added to your repositories (image.py:179) Error: repo "https://kubernetes-charts-incubator.storage.googleapis.com" is no longer available; try "https://charts.helm.sh/incubator" instead (image.py:179)
It's https://github.com/MirahezeBots/bots-web that's now canonical
We previously had T215658 that eventually has lead to the addition of table of CI results on Gerrit changes. There was some code at https://gerrit.wikimedia.org/r/c/operations/software/gerrit/plugins/wikimedia/+/489483/14/gr-wikimedia/gr-wikimedia-prettify-ci-comments.html
With the Gerrit 3.2 upgrade, the CI reports are no pretty (T256575). So maybe we can reuse some part from https://gerrit.wikimedia.org/r/c/operations/software/gerrit/plugins/wikimedia/+/489483/14/gr-wikimedia/gr-wikimedia-prettify-ci-comments.html to prettify ;)
The repository is already read only since August 11 2020. That has been done by @MacFan4000 : https://gerrit.wikimedia.org/r/plugins/gitiles/labs/tools/ZppixBot/+/820f147fb851a0c0dffef930aef2b03994842bc5%5E%21/#F0
Projects-Cleanup is the way to go ;)
Amazing, thank you for the detailed information and I can confirm the HTTP requests are entirely gone. I don't think it was causing much load to Gerrit anyway :]
The current build is still failing because Selenium is reporting errors. These errors are related to the test user not being able to log in to the wiki. This is to be expected, since WSOAuth is not configured. I am not sure how these errors could be fixed. Can those specific tests be disabled?
For wikifarm.pluggableauth.eqiad.wmflabs I have filed T268759
For wikifarm.pluggableauth.eqiad.wmflabs I have filed T268759
That has been more or less fixed by raising the newish IO quota for the integration instances. It is still not ideal and enhancements will be made later on via the infrastructure task T266777.
The count of all errors we had:
Deployed to all wikis. Congratulations everyone.
/etc/apt/sources.list.d/project-aptly.list comes from the Puppet class aptly::client. The class is applied on every instances of the project via Horizon project puppet config (https://horizon.wikimedia.org/project/puppet/):
classes: - apt::unattendedupgrades - role::aptly::client - role::prometheus::node_exporter
It seems to work properly and T267668 has not been reopened. If all goes well over night, I will deploy it on the rest of the wiki during the European time slot at 13:00 UTC (14:00 CET). Checking with SRE before pushing it obviously!
1.36.0-wmf.18 has been rolled to commons.wikimedia.org Nov 23 at 20:09 UTC.
Gitlab CE comes with a search system backed up by the database ( https://docs.gitlab.com/ee/user/search/#basic-search ). Searches for commit, code or comments are limited to the current project.
What is the status of the decommissioning of Graphoid ? The reason I ask is that CI still has to maintain images based on Debian Jessie (EOL since April 2020) with Node JS 6 (long EOL) and we would really like to be able to dispose of those legacy bits.
I have poked the last three remaining tasks:
Is the migration of Maps (Kartotherian) / Maps (Tilerator) to Kubernetes / Node 10 on some people radar? The reason I ask is that CI still has to maintain images based on Debian Jessie (EOL since April 2020) with Node JS 6 (long EOL) and we would really like to be able to dispose of those legacy bits.
Is the migration of #3d2png to Kubernetes / Node 10 on some people radar? The reason I ask is that CI still has to maintain images based on Debian Jessie (EOL since April 2020) with Node JS 6 (long EOL) and we would really like to be able to dispose of those legacy bits.
Thank you to everyone that acted on this task.
That must have un fixed at some point. It is no more happening.
Most probably that was due to T267561, the instance were running Varnish 5 while Puppet deployed VCL files intended for Varnish 6. That probably has been broken for more than a few months and we were not receiving any notification on puppet failure due to another bug somewhere.
Thank you @Jdforrester-WMF to have taken care of the cleanup!
T268009 is now public. That was a nasty security issue in Gerrit which is addressed by our upgrade to Gerrit version 3.2.5. There was still a couple hits after the upgrade, so the issue is not solved.
Gerrit has been upgraded. Next thing is to find out whether we can drop the work around we have applied to All-Projects / All-Users. Notably the READ permission for refs/* was quite helpful.
Thank you for the very quick fix and @QChris to have found the upstream issue that had the fix. That saved us from having to roll back the Gerrit upgrade.
An aftermath reported by some persons on IRC is that an URL such as https://gerrit.wikimedia.org/r/123 ends up redirected to an url including :80 which breaks it. T268260 and https://bugs.chromium.org/p/gerrit/issues/detail?id=13705
We have deployed Gerrit 3.2.5 and applied the All-Projects and All-Users workaround. This task can now be made public given upstream disclosed the security issues.
I am going to deploy the Gerrit release 3.2.5.