Page MenuHomePhabricator

1.38.0-wmf.12 deployment blockers
Closed, ResolvedPublic5 Estimated Story PointsRelease

Details

Backup Train Conductor
brennen
Release Version
1.38.0-wmf.12
Release Date
Dec 6 2021, 12:00 AM

2021 week 49 1.38-wmf.12 Changes wmf/1.38.0-wmf.12

This MediaWiki Train Deployment is scheduled for the week of Monday, December 6th:

Monday December 6thTuesday, December 7thWednesday, December 8thThursday, December 9thFriday
Backports only.Branch wmf.12 and deploy to Group 0 Wikis.Deploy wmf.12 to Group 1 Wikis.Deploy wmf.12 to all Wikis.No deployments on fridays

How this works

  • Any serious bugs affecting wmf.12 should be added as subtasks beneath this one.
  • Any open subtask(s) block the train from moving forward. This means no further deployments until the blockers are resolved.
  • If something is serious enough to warrant a rollback then you should bring it to the attention of deployers on the #wikimedia-operations IRC channel.
  • If you have a risky change in this week's train add a comment to this task using the Risky patch template
  • For more info about deployment blockers, see Holding the train.

Related Links

Other Deployments

Previous: 1.38.0-wmf.11
Next: 1.38.0-wmf.13

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

T297066 seems to be a UI regression in WikibaseMediaInfo, I’ll let the SDC developers or train conductors decide if it should block the train. (It would only affect Commons, i.e. group1.)

Edit: This was resolved in time for the wmf.12 branch cut, no action necessary.

Just so this doesn't get lost in the comment above, if the train is rolled back after deploy, we will need to purge RESTBase content (See T296425) to ensure VE edits on metawiki and mediawiki on pages with <translate> tags don't break.

This is admittedly an edge case since VE editing on pages with <translate> isn't well-supported currently and so editors are unlikely to use VE on those pages. So, if a train is going to be rolled back and then rolled forward relatively quickly, there is probably some breathing room to let that be in a broken state for a while. But, if the train is going to be in a rolled back state for a long time, then, it is worth purging RESTBase storage via T296425.

Just so this doesn't get lost in the comment above, if the train is rolled back after deploy, we will need to purge RESTBase content (See T296425) to ensure VE edits on metawiki and mediawiki on pages with <translate> tags don't break.

This is admittedly an edge case since VE editing on pages with <translate> isn't well-supported currently and so editors are unlikely to use VE on those pages. So, if a train is going to be rolled back and then rolled forward relatively quickly, there is probably some breathing room to let that be in a broken state for a while. But, if the train is going to be in a rolled back state for a long time, then, it is worth purging RESTBase storage via T296425.

Thanks for this note! It seems likely that we will rollback at least once during this train as (a) we didn't do train last week and (b) there are so many risky patches on this train.

Is there any way to hide this incompatible behavior until the rollout of this train is complete (e.g., via a feature flag or similar)?

I'll note that this train will have a large amount of patches in the CentralAuth extension. The extension in general is risky (old and complicated codebase, little to no tests, tasked with critical things like authentication) so if you see or hear about any strange behaviour please feel free to rollback / revert things on a low treshold. Thanks!

Just so this doesn't get lost in the comment above, if the train is rolled back after deploy, we will need to purge RESTBase content (See T296425) to ensure VE edits on metawiki and mediawiki on pages with <translate> tags don't break.

This is admittedly an edge case since VE editing on pages with <translate> isn't well-supported currently and so editors are unlikely to use VE on those pages. So, if a train is going to be rolled back and then rolled forward relatively quickly, there is probably some breathing room to let that be in a broken state for a while. But, if the train is going to be in a rolled back state for a long time, then, it is worth purging RESTBase storage via T296425.

Thanks for this note! It seems likely that we will rollback at least once during this train as (a) we didn't do train last week and (b) there are so many risky patches on this train.

Is there any way to hide this incompatible behavior until the rollout of this train is complete (e.g., via a feature flag or similar)?

I'll chat with others on the team about this and see what seems feasible.

We had that as a fallback for next week for scenarios where the rollback was because of a bug in our code (the plan was to revert our code, roll train forward, and try again next week). But, yes, there could be rollbacks because of reasons not related to our code.

I also dropped a note on metawiki to alert translate admins to keep an eye out.

The volume of VE edits on metawiki is quite low. That seems to be true for mediawiki.org VE edits as well.

So, given all that, at this time, I feel comfortable that we can handle this by keeping an eye on VE edits and reverting / fixing them appropriately and rely on RESTBase content purge only if we really need it - for example if we find that Parsoid's support for translate itself is broken which might require a roll back, Parsoid revert, and train roll forward.

I looked at wmf-config to make sure there weren't other wikis, and commons is the other major wiki that uses translate tags. I looked at namespaces that commonly receive VE edits (File, Category, User) and there are only 69 pages that have translate tags in them in those namespaces.

Overall, as @Arlolra noted elsewhere, it would have been cleaner and better to have rolled out HTML->WT support for Parsoid HTML version 2.4.0 so that we have forward compatibility for that version *before* the Parsoid HTML version bump to 2.4.0 in this train.

But, all things considered, between metawiki, mediawiki, and commons, as noted in these comments, the intersection of

  • pages with translate tags on them
  • pages being edited with VE

is a small number that this feels acceptable to move ahead without needing to rely on RESTBase purges when the train rolls back.

All that said, if train operators prefer that we mitigate that for your sanity, I will happily work with my team to make those fixes.

All that said, if train operators prefer that we mitigate that for your sanity, I will happily work with my team to make those fixes.

From what you've said the number of impacted pages seems small, but I don't quite grok the impact to those pages: would editing be broken on those pages while we're in a rollback state and would the be fixed when we roll forward? Or does this cause some kind of corruption?

If it's possible to mitigate this, that'd be ideal. As an alternative, monitoring the rollout and having someone on hand for problems would work as well.

All that said, if train operators prefer that we mitigate that for your sanity, I will happily work with my team to make those fixes.

From what you've said the number of impacted pages seems small, but I don't quite grok the impact to those pages: would editing be broken on those pages while we're in a rollback state and would the be fixed when we roll forward? Or does this cause some kind of corruption?

VE edits on those pages can cause page corruption *If* the page had been reparsed (or purged) between train rollout and train rollback.

If it's possible to mitigate this, that'd be ideal. As an alternative, monitoring the rollout and having someone on hand for problems would work as well.

Yes, someone will be around.

Note that the blocker is in a weird state, it seems that there are several issues there. I think the db part is fixed but the memory part is not. Needs better investigation.

If it's possible to mitigate this, that'd be ideal.

We are working to take Parsoid out of the risky patches set and should have a backport to vendor before the train rolls out. We decided to split the 2.3.0 -> 2.4.0 bump across 2 trains. That will eliminate the need for RESTBase purges altogether without needing to keep any eye on individual edits on those 3 wikis.

The Parsoid "backport" is https://gerrit.wikimedia.org/r/c/mediawiki/vendor/+/744800. I assume that since the train hasn't rolled yet it's safe to merge that onto wmf.12, but I recall ops needing to specially sync stuff to make everything work correctly. Again: what we've done is turn off Translate annotations in Parsoid 0.15.0-a12 in order to make the Parsoid deploy "not risky" this week. We'll wait to turn on our risky feature, either next week or next year.

EDIT: Got confirmation from @dancy that it was safe to merge on the wmf.12 branch because the train hadn't been checked out yet; also updated the docs at https://wikitech.wikimedia.org/wiki/Parsoid#If_the_train_branch_has_already_been_cut to record that we should always get explicit clearance in #wikimedia-operations before merging a backport "after the cut" like this.

Note that the blocker is in a weird state, it seems that there are several issues there. I think the db part is fixed but the memory part is not. Needs better investigation.

I have no idea how to investigate this... I propose to roll forward and to look out for the memory issue showing up again.

Hm... I made LinkCache cache some more fields from the page table. Most of that cache is only in-process, so should not persist. Some of them are however written to WANObjectCache. But that shouldn't affect memory usage on app servers, right?

If it's possible to mitigate this, that'd be ideal.

We are working to take Parsoid out of the risky patches set and should have a backport to vendor before the train rolls out. We decided to split the 2.3.0 -> 2.4.0 bump across 2 trains. That will eliminate the need for RESTBase purges altogether without needing to keep any eye on individual edits on those 3 wikis.

Thanks for raising the concern and for the extra care you're taking. I appreciate it <3

Change 744878 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/mediawiki-config@master] testwikis wikis to 1.38.0-wmf.12 refs T293953

https://gerrit.wikimedia.org/r/744878

Change 744878 merged by jenkins-bot:

[operations/mediawiki-config@master] testwikis wikis to 1.38.0-wmf.12 refs T293953

https://gerrit.wikimedia.org/r/744878

Mentioned in SAL (#wikimedia-operations) [2021-12-07T21:23:43Z] <dancy@deploy1002> Started scap: testwikis wikis to 1.38.0-wmf.12 refs T293953

brennen moved this task from Backlog to Doing on the User-brennen board.
brennen subscribed.

Mentioned in SAL (#wikimedia-operations) [2021-12-07T22:07:58Z] <dancy@deploy1002> Finished scap: testwikis wikis to 1.38.0-wmf.12 refs T293953 (duration: 44m 14s)

Change 744886 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/mediawiki-config@master] group0 wikis to 1.38.0-wmf.12 refs T293953

https://gerrit.wikimedia.org/r/744886

Change 744886 merged by jenkins-bot:

[operations/mediawiki-config@master] group0 wikis to 1.38.0-wmf.12 refs T293953

https://gerrit.wikimedia.org/r/744886

Mentioned in SAL (#wikimedia-operations) [2021-12-07T22:15:24Z] <dancy@deploy1002> rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.12 refs T293953

Change 745313 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/mediawiki-config@master] group1 wikis to 1.38.0-wmf.12 refs T293953

https://gerrit.wikimedia.org/r/745313

Change 745313 merged by jenkins-bot:

[operations/mediawiki-config@master] group1 wikis to 1.38.0-wmf.12 refs T293953

https://gerrit.wikimedia.org/r/745313

Mentioned in SAL (#wikimedia-operations) [2021-12-08T20:16:25Z] <dancy@deploy1002> rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.12 refs T293953

Mentioned in SAL (#wikimedia-operations) [2021-12-08T20:17:30Z] <dancy@deploy1002> Synchronized php: group1 wikis to 1.38.0-wmf.12 refs T293953 (duration: 01m 05s)

Change 745315 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/mediawiki-config@master] group1 wikis to 1.38.0-wmf.9 refs T293953

https://gerrit.wikimedia.org/r/745315

Change 745315 merged by jenkins-bot:

[operations/mediawiki-config@master] group1 wikis to 1.38.0-wmf.9 refs T293953

https://gerrit.wikimedia.org/r/745315

Mentioned in SAL (#wikimedia-operations) [2021-12-08T20:21:27Z] <dancy@deploy1002> rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.9 refs T293953

Mentioned in SAL (#wikimedia-operations) [2021-12-08T20:22:31Z] <dancy@deploy1002> Synchronized php: group1 wikis to 1.38.0-wmf.9 refs T293953 (duration: 01m 04s)

Change 745331 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/mediawiki-config@master] group1 wikis to 1.38.0-wmf.12 refs T293953

https://gerrit.wikimedia.org/r/745331

Change 745331 merged by jenkins-bot:

[operations/mediawiki-config@master] group1 wikis to 1.38.0-wmf.12 refs T293953

https://gerrit.wikimedia.org/r/745331

Mentioned in SAL (#wikimedia-operations) [2021-12-08T21:41:52Z] <dancy@deploy1002> rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.12 refs T293953

Mentioned in SAL (#wikimedia-operations) [2021-12-08T21:42:56Z] <dancy@deploy1002> Synchronized php: group1 wikis to 1.38.0-wmf.12 refs T293953 (duration: 01m 04s)

Hi everybody,

there is an alert for Eventgate Analytics External event validation errors for the schema mediawiki.mediasearch_interaction, that lines up with the deployment timings:

https://sal.toolforge.org/log/xYr_m30B8Fs0LHO53sGB
Grafana dashboard

Logstash shows the same error over and over: '.search_result_page_id' should be integer

Hi everybody,

there is an alert for Eventgate Analytics External event validation errors for the schema mediawiki.mediasearch_interaction, that lines up with the deployment timings:

https://sal.toolforge.org/log/xYr_m30B8Fs0LHO53sGB
Grafana dashboard

Logstash shows the same error over and over: '.search_result_page_id' should be integer

Thanks for the report. I filed a separate task for that (T297400) and set it as a train blocker.

Change 745594 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/mediawiki-config@master] group2 wikis to 1.38.0-wmf.12 refs T293953

https://gerrit.wikimedia.org/r/745594

Change 745594 merged by jenkins-bot:

[operations/mediawiki-config@master] group2 wikis to 1.38.0-wmf.12 refs T293953

https://gerrit.wikimedia.org/r/745594

Mentioned in SAL (#wikimedia-operations) [2021-12-09T20:04:19Z] <dancy@deploy1002> rebuilt and synchronized wikiversions files: group2 wikis to 1.38.0-wmf.12 refs T293953

Change 745955 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/mediawiki-config@master] all wikis to 1.38.0-wmf.9 refs T293953

https://gerrit.wikimedia.org/r/745955

Change 745955 merged by jenkins-bot:

[operations/mediawiki-config@master] all wikis to 1.38.0-wmf.9 refs T293953

https://gerrit.wikimedia.org/r/745955

Mentioned in SAL (#wikimedia-operations) [2021-12-10T22:09:48Z] <dancy@deploy1002> rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.9 refs T293953

dancy added a subscriber: RLazarus.

The train has been rolled back to wmf.9 due to appserver and parsoid server memory growth. @RLazarus and others are investigating in IRC #wikimedia-operations.

Mentioned in SAL (#wikimedia-operations) [2021-12-13T18:25:56Z] <dancy> dancy@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.12 T293953

There's a possible regression at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Edit_from_%22What_links_here%22, maybe someone familiar with recent changes to editing stuff might have an idea if that's a real regression or not.

Filed as T297744. I'm not sure if this is necessarily a blocker. It's inconvenient but probably not a huge issue.

Filed as T297744. I'm not sure if this is necessarily a blocker. It's inconvenient but probably not a huge issue.

Agreed, I don't think we should roll back wmf.12 or hold up wmf.13 on this issue, but fixing it would be good.