Page MenuHomePhabricator

1.38.0-wmf.9 deployment blockers
Closed, ResolvedPublic5 Estimated Story PointsRelease

Details

Backup Train Conductor
dduvall
Release Version
1.38.0-wmf.9
Release Date
Mon, Nov 15, 12:00 AM

2021 week 46 1.38-wmf.9 Changes wmf/1.38.0-wmf.9

This MediaWiki Train Deployment is scheduled for the week of Monday, November 15th:

Monday November 15thTuesday, November 16thWednesday, November 17thThursday, November 18thFriday
Backports only.Branch wmf.9 and deploy to Group 0 Wikis.Deploy wmf.9 to Group 1 Wikis.Deploy wmf.9 to all Wikis.No deployments on fridays

How this works

  • Any serious bugs affecting wmf.9 should be added as subtasks beneath this one.
  • Any open subtask(s) block the train from moving forward. This means no further deployments until the blockers are resolved.
  • If something is serious enough to warrant a rollback then you should bring it to the attention of deployers on the #wikimedia-operations IRC channel.
  • If you have a risky change in this week's train add a comment to this task using the Risky patch template
  • For more info about deployment blockers, see Holding the train.

Related Links

Other Deployments

Previous: 1.38.0-wmf.8
Next: 1.38.0-wmf.10

Related Objects

Event Timeline

thcipriani triaged this task as Medium priority.
thcipriani updated Other Assignee, added: dduvall.
thcipriani set the point value for this task to 5.
thcipriani changed Release Date from Nov 9 2020, 12:00 AM to Tue, Nov 9, 12:00 AM.Oct 21 2021, 12:30 AM
thcipriani changed Release Date from Tue, Nov 9, 12:00 AM to Mon, Nov 15, 12:00 AM.Oct 21 2021, 12:40 AM
Risky Patch! 🚂🔥
  • Change: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/699067/
  • Summary:
    • Rewrite of how we do in-process caching for information from the page table in Title objects. Bugs could go two ways: failure to cache may cause extra load on database servers. Failure to pure the cache during write operations (like edits, deletions, page moves) can cause inconsistencies which trigger exceptions.
  • Test plan:
    • The relevant classes (PageStore and LinkCache, as well as the relevant methods in Title) have good test coverage. The code in question in constantly exercised - by other tests as well as by production code. Bugs are bound to surface quickly.
  • Places to monitor:
  • Revert plan: Fix the individual error; revert the patch if that isn't possible; or roll back the train if rollback isn't possible either due to merge conflicts.
  • Affected wikis: All.
  • IRC contact: duesen, pchelolo. But best raise the alarm on Slack in the #platform-engineering-team channel.
  • UBN Task Projects/tags: Platform Engineering

Change 739316 had a related patch set uploaded (by Jeena Huneidi; author: Jeena Huneidi):

[operations/mediawiki-config@master] testwikis wikis to 1.38.0-wmf.9 refs T293950

https://gerrit.wikimedia.org/r/739316

Change 739316 merged by jenkins-bot:

[operations/mediawiki-config@master] testwikis wikis to 1.38.0-wmf.9 refs T293950

https://gerrit.wikimedia.org/r/739316

Mentioned in SAL (#wikimedia-operations) [2021-11-16T18:34:59Z] <jhuneidi@deploy1002> Started scap: testwikis wikis to 1.38.0-wmf.9 refs T293950

Mentioned in SAL (#wikimedia-operations) [2021-11-16T19:11:31Z] <jhuneidi@deploy1002> Finished scap: testwikis wikis to 1.38.0-wmf.9 refs T293950 (duration: 36m 32s)

jeena changed the task status from Open to In Progress.Tue, Nov 16, 7:14 PM

Change 739338 had a related patch set uploaded (by Jeena Huneidi; author: Jeena Huneidi):

[operations/mediawiki-config@master] group0 wikis to 1.38.0-wmf.9 refs T293950

https://gerrit.wikimedia.org/r/739338

Change 739338 merged by jenkins-bot:

[operations/mediawiki-config@master] group0 wikis to 1.38.0-wmf.9 refs T293950

https://gerrit.wikimedia.org/r/739338

Mentioned in SAL (#wikimedia-operations) [2021-11-16T20:09:26Z] <jhuneidi@deploy1002> rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.9 refs T293950

Mentioned in SAL (#wikimedia-operations) [2021-11-17T20:22:40Z] <jhuneidi@deploy1002> rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.9 refs T293950

Mentioned in SAL (#wikimedia-operations) [2021-11-17T20:23:44Z] <jhuneidi@deploy1002> Synchronized php: group1 wikis to 1.38.0-wmf.9 refs T293950 (duration: 01m 03s)

Change 739621 had a related patch set uploaded (by Jeena Huneidi; author: Jeena Huneidi):

[operations/mediawiki-config@master] group1 wikis to 1.38.0-wmf.9 refs T293950

https://gerrit.wikimedia.org/r/739621

Change 739621 merged by jenkins-bot:

[operations/mediawiki-config@master] group1 wikis to 1.38.0-wmf.9 refs T293950

https://gerrit.wikimedia.org/r/739621

f7febb6754d6aa86562fd219c47b3e8909e69573 reverted group1 wikis to wmf.7, without any explanation (and unfortunately without this task’s number in the commit message footer). @jeena, could you please elaborate on why and what to expect for the rest of the week?

f7febb6754d6aa86562fd219c47b3e8909e69573 reverted group1 wikis to wmf.7, without any explanation (and unfortunately without this task’s number in the commit message footer). @jeena, could you please elaborate on why and what to expect for the rest of the week?

The train was rolled back because there are open blockers (see the open subtasks of this task) that would negatively affect group1 wikis if the train was left there. The train will continue when the blockers have been resolved, unless that happens on a Friday or similar no-deploy time.

Sorry for the miscommunication. What @Majavah wrote is correct. In the future I will try to update this task with more information.
I am currently backporting the remaining blocker and will proceed to deploy to group1. If that goes well I will also deploy to all wikis today.

Change 739931 had a related patch set uploaded (by Jeena Huneidi; author: Jeena Huneidi):

[operations/mediawiki-config@master] group1 wikis to 1.38.0-wmf.9 refs T293950

https://gerrit.wikimedia.org/r/739931

Change 739931 merged by jenkins-bot:

[operations/mediawiki-config@master] group1 wikis to 1.38.0-wmf.9 refs T293950

https://gerrit.wikimedia.org/r/739931

Mentioned in SAL (#wikimedia-operations) [2021-11-18T20:30:55Z] <jhuneidi@deploy1002> rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.9 refs T293950

Mentioned in SAL (#wikimedia-operations) [2021-11-18T20:31:58Z] <jhuneidi@deploy1002> Synchronized php: group1 wikis to 1.38.0-wmf.9 refs T293950 (duration: 01m 03s)

Change 739934 had a related patch set uploaded (by Jeena Huneidi; author: Jeena Huneidi):

[operations/mediawiki-config@master] all wikis to 1.38.0-wmf.9 refs T293950

https://gerrit.wikimedia.org/r/739934

Change 739934 merged by jenkins-bot:

[operations/mediawiki-config@master] all wikis to 1.38.0-wmf.9 refs T293950

https://gerrit.wikimedia.org/r/739934

Mentioned in SAL (#wikimedia-operations) [2021-11-18T20:43:07Z] <jhuneidi@deploy1002> rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.9 refs T293950

f7febb6754d6aa86562fd219c47b3e8909e69573 reverted group1 wikis to wmf.7, without any explanation (and unfortunately without this task’s number in the commit message footer). @jeena, could you please elaborate on why and what to expect for the rest of the week?

The train was rolled back because there are open blockers (see the open subtasks of this task) that would negatively affect group1 wikis if the train was left there. The train will continue when the blockers have been resolved, unless that happens on a Friday or similar no-deploy time.

Sorry for the miscommunication. What @Majavah wrote is correct. In the future I will try to update this task with more information.
I am currently backporting the remaining blocker and will proceed to deploy to group1. If that goes well I will also deploy to all wikis today.

Okay, thanks to both of you for the explanation (and also for letting the train ride to group2 wikis).

I was reading Daniel's comment and opened the graph board:
https://grafana.wikimedia.org/d/GpL5R8CGz/mysql-query-rate?orgId=1

Query throughput for now - 7days seems to have a deviation after the deploy and it has not returned to baseline. Its different, not sure if that is different enough for any concern.

Screenshot 2021-11-19 at 02.19.04.png (546×1 px, 194 KB)

Marostegui raised the priority of this task from Medium to Unbreak Now!.Fri, Nov 19, 7:23 AM
Marostegui added a subscriber: Marostegui.

This has caused a huge increase on queries, this is an example of an enwiki replica:

Captura de pantalla 2021-11-19 a las 8.21.15.png (696×1 px, 178 KB)

https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=16&orgId=1&from=1637139798447&to=1637306390674&var-job=All&var-server=db1163&var-port=9104

This really needs to be investigated and/or reverted, enwiki is having 4x times the amount of queries it used to have.

Ladsgroup lowered the priority of this task from Unbreak Now! to Medium.Fri, Nov 19, 7:40 AM
Ladsgroup added a subscriber: Ladsgroup.

This really needs to be investigated and/or reverted, enwiki is having 4x times the amount of queries it used to have.

Acknowledging this. It's 00:34 local and I am not in much of a state to be doing a revert (or, indeed, remaining conscious for much longer).

I am wondering what impact T295930 - PHP Notice: Array to string conversion - will have on a rollback. wmf.9 has been on all wikis for some time. cc: @Majavah, @matej_suchanek on that question.

I am wondering what impact T295930 - PHP Notice: Array to string conversion - will have on a rollback. wmf.9 has been on all wikis for some time. cc: @Majavah, @matej_suchanek on that question.

It'll cause some log entries on meta to look weird. Definitely a smaller issue than the other UBN.

Risky Patch! 🚂🔥
  • Change: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/699067/
  • Summary:
    • Rewrite of how we do in-process caching for information from the page table in Title objects. Bugs could go two ways: failure to cache may cause extra load on database servers. Failure to pure the cache during write operations (like edits, deletions, page moves) can cause inconsistencies which trigger exceptions.
  • Test plan:
    • The relevant classes (PageStore and LinkCache, as well as the relevant methods in Title) have good test coverage. The code in question in constantly exercised - by other tests as well as by production code. Bugs are bound to surface quickly.
  • Places to monitor:
  • Revert plan: Fix the individual error; revert the patch if that isn't possible; or roll back the train if rollback isn't possible either due to merge conflicts.
  • Affected wikis: All.
  • IRC contact: duesen, pchelolo. But best raise the alarm on Slack in the #platform-engineering-team channel.
  • UBN Task Projects/tags: Platform Engineering

Noting here @jeena and @Ladsgroup reverted https://sal.toolforge.org/log/VSVNN30B8Fs0LHO5-XnK to fix T296063: thank you for the note about the risky patch!

Next week is thanksgiving in the US, I am volunteering to run any follow up train actions that would need to happen starting next Monday since it is a regular week for me.

For today, I am on week-end right now unfortunately.

Mentioned in SAL (#wikimedia-operations) [2021-11-19T16:42:41Z] <thcipriani@deploy1002> rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.38.0-wmf.9 refs T293950 T296098"

For CentralNotice not showing up on Mobile web view (T296077) it is scheduled for backport at 19:00 UTC, I have pinged the task to tentatively deploy it at 12:00 UTC (one hour from now). Then I guess we can roll the train again and try to find what is the memory leak issue is T296098.

The memory leak (T296098) does not appear anymore, it was the last step to claim this train to be complete.