Page MenuHomePhabricator

1.40.0-wmf.23 deployment blockers
Closed, ResolvedPublicRelease

Details

Backup Train Conductor
demon
Release Version
1.40.0-wmf.23
Release Date
Feb 13 2023, 12:00 AM

2023 week 07 1.40-wmf.23 Changes wmf/1.40.0-wmf.23

This MediaWiki Train Deployment is scheduled for the week of Monday, February 13th:

Monday February 13thTuesday, February 14thWednesday, February 15thThursday, February 16thFriday
Backports only.Branch wmf.23 and deploy to Group 0 Wikis.Deploy wmf.23 to Group 1 Wikis.Deploy wmf.23 to all Wikis.No deployments on fridays

How this works

  • Any serious bugs affecting wmf.23 should be added as subtasks beneath this one.
  • Any open subtask(s) block the train from moving forward. This means no further deployments until the blockers are resolved.
  • If something is serious enough to warrant a rollback then you should bring it to the attention of deployers on the #wikimedia-operations IRC channel.
  • If you have a risky change in this week's train add a comment to this task using the Risky patch template
  • For more info about deployment blockers, see Holding the train.

Related Links

Other Deployments

Previous: 1.40.0-wmf.22
Next: 1.40.0-wmf.24

Event Timeline

thcipriani triaged this task as Medium priority.
thcipriani updated Other Assignee, added: demon.
Jdlrobson subscribed.

Adding T329548 per https://wikitech.wikimedia.org/wiki/Deployments/Holding_the_train#Error-rate_increases_(See_#Logspam) as a blocker. I'm not sure about impact on users but it should definitely be fixed this week even if it doesn't block train.

Change 888729 had a related patch set uploaded (by TrainBranchBot; author: trainbranchbot):

[mediawiki/core@wmf/1.40.0-wmf.23] Branch commit for wmf/1.40.0-wmf.23

https://gerrit.wikimedia.org/r/888729

Change 888729 merged by jenkins-bot:

[mediawiki/core@wmf/1.40.0-wmf.23] Branch commit for wmf/1.40.0-wmf.23

https://gerrit.wikimedia.org/r/888729

Change 889191 had a related patch set uploaded (by TrainBranchBot; author: Dduvall):

[operations/mediawiki-config@master] testwikis wikis to 1.40.0-wmf.23

https://gerrit.wikimedia.org/r/889191

Change 889191 merged by jenkins-bot:

[operations/mediawiki-config@master] testwikis wikis to 1.40.0-wmf.23

https://gerrit.wikimedia.org/r/889191

Mentioned in SAL (#wikimedia-operations) [2023-02-14T18:17:21Z] <dduvall@deploy1002> Started scap: testwikis wikis to 1.40.0-wmf.23 refs T325586

Mentioned in SAL (#wikimedia-operations) [2023-02-14T19:08:24Z] <dduvall@deploy1002> Finished scap: testwikis wikis to 1.40.0-wmf.23 refs T325586 (duration: 51m 03s)

Mentioned in SAL (#wikimedia-operations) [2023-02-14T19:31:48Z] <dduvall> scap sync-world failed due to lack of disk space on deploy1002 /srv (cc T325586)

Mentioned in SAL (#wikimedia-operations) [2023-02-14T19:41:37Z] <dduvall@deploy1002> Started scap: testwikis wikis to 1.40.0-wmf.23 refs T325586

Mentioned in SAL (#wikimedia-operations) [2023-02-14T19:50:51Z] <dduvall@deploy1002> Finished scap: testwikis wikis to 1.40.0-wmf.23 refs T325586 (duration: 09m 14s)

Change 889217 had a related patch set uploaded (by TrainBranchBot; author: Dduvall):

[operations/mediawiki-config@master] group0 wikis to 1.40.0-wmf.23

https://gerrit.wikimedia.org/r/889217

Change 889217 merged by jenkins-bot:

[operations/mediawiki-config@master] group0 wikis to 1.40.0-wmf.23

https://gerrit.wikimedia.org/r/889217

Mentioned in SAL (#wikimedia-operations) [2023-02-14T20:12:34Z] <dduvall@deploy1002> rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.23 refs T325586

The Content-Transform-Team has continued to work on the Table of Contents. Potentially risky changes:

In addition, the slow march to deprecate old non-bcp-47-compliant language codes marches on:

I don't expect these are going to cause any issues in practice, but better safe than sorry. There should be no problems doing a rollback if problems do arise, although cherry-picking a revert for the TOC patches could be tricky due to the amount of code that landed in that area.

Change 889598 had a related patch set uploaded (by TrainBranchBot; author: Dduvall):

[operations/mediawiki-config@master] group1 wikis to 1.40.0-wmf.23

https://gerrit.wikimedia.org/r/889598

Change 889598 merged by jenkins-bot:

[operations/mediawiki-config@master] group1 wikis to 1.40.0-wmf.23

https://gerrit.wikimedia.org/r/889598

Mentioned in SAL (#wikimedia-operations) [2023-02-15T19:16:58Z] <dduvall@deploy1002> rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.23 refs T325586

Mentioned in SAL (#wikimedia-operations) [2023-02-15T19:17:20Z] <dduvall> large spike in undefined property errors. rolling back (T325586)

Mentioned in SAL (#wikimedia-operations) [2023-02-15T19:18:26Z] <dduvall> correction: spike may be temporary. holding (T325586)

Mentioned in SAL (#wikimedia-operations) [2023-02-15T19:23:35Z] <dduvall@deploy1002> Synchronized php: group1 wikis to 1.40.0-wmf.23 refs T325586 (duration: 06m 36s)

Mentioned in SAL (#wikimedia-operations) [2023-02-15T19:23:50Z] <dduvall> rolling back due to spike in parsoid errors (T325586)

@cscott I'm seeing a large spike in errors from Parsoid today. See https://logstash.wikimedia.org/goto/a7eb8ccc8cf7a123b9577e32e98f1b1c

Any chance these are related to the risky changes, specifically old cache entries?

sidenote: Please try to use the risk patch template in the future as it includes possible contacts which helps us to quickly triage.

@dduvall, we are looking. The ones from T329740 can all be simply suppressed since they don't have any user impact right now (we are compute information that is discarded -- will be plugged into output at a later pont) and we can fix the notices for next week's train. If you want to go that route, https://phabricator.wikimedia.org/T329740#8618948 has info for what to suppress.

We are also looking at other possible issues arising from the TOC work in Parsoid and core and seeing what is needed there.

We may also consider fixing these all and doing a fresh release of Parsoid, vendor, etc. which can be backported.

The spike seems related to the transition between wmf.22 and wmf.23, perhaps? There are logs like https://logstash.wikimedia.org/goto/a21be3de9c29c8905f2621f63dbb0c92 which are against wmf.22 and refer to properties which were removed from wmf.23, so it's possible the front end machines saw an inconsistent vendor directory briefly during the deploy?

The "bad UTF-8" errors are a known issue most likely related to editors providing bad content through an API (aka not /actually/ on the Main Page). We want to be aggressive about pouncing on bad UTF-8 data coming from the database, but we should probably tone down these logs if the data is user-supplied.

This one is more concerning: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-deploy-1-7.0.0-1-2023.02.15?id=TLKOVoYBtuN2AbPYdyqu and refers to wmf.23, but it seems to be impossible to not-initialize that field in the current codebase, so I'm wondering if this is also a transient of some kind.

Our conclusion at this point is that (a) the UTF-8 error is known and a small # of logs, (b) the T329740 is also known but harmless, we can roll a new release of parsoid or you could just suppress them and they will be gone next week, and (c) the "Typed property Wikimedia\Parsoid\Core\SectionMetadata::$codepointOffset must not be accessed before initialization" and the related "$byteOffset" one *are* related to TOC work (the "byteOffset" field was renamed "codepointOffset" in https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/885882) but we think they are transients related to deploy/rollback, not actual bugs in the code.

Ya, so, 3 things:

  • Suppressing notices as in T329740#8618948 gets rid of the noise (and suppresses the spike)
  • The other errors Scott referenced and one I was concerned about are only a handful and seem to be seen during rollback (afaict) and are transient ones.
  • The UTF-8 errors are also all a handful and all old known things which could potentially be bad wikitext being posted -- nothing specific to this train.

So, in terms of decision, what we need to figure out is whether we suppress logs as in bullet point #1 OR if we should try to fix the code and try to get it on the train. Since this is in the Parsoid repo, it is more involved and requires a new tag, vendor release, patch to update vendor in core, etc.

Ya, so, 3 things:

  • Suppressing notices as in T329740#8618948 gets rid of the noise (and suppresses the spike)
  • The other errors Scott referenced and one I was concerned about are only a handful and seem to be seen during rollback (afaict) and are transient ones.
  • The UTF-8 errors are also all a handful and all old known things which could potentially be bad wikitext being posted -- nothing specific to this train.

So, in terms of decision, what we need to figure out is whether we suppress logs as in bullet point #1 OR if we should try to fix the code and try to get it on the train. Since this is in the Parsoid repo, it is more involved and requires a new tag, vendor release, patch to update vendor in core, etc.

Thank you very much for summarizing.

I'm usually fine with suppressing notices, but this seemed like a pretty high volume and I'm a little worried that will increase significantly with tomorrow's deployment. While filtering helps keep the dashboard clear (very important to deployers et at) it's not going to reduce the volume of errors being sent through logstash. Do you anticipate an increase in these errors with all wiki deployment?

So, in terms of decision, what we need to figure out is whether we suppress logs as in bullet point #1 OR if we should try to fix the code and try to get it on the train. Since this is in the Parsoid repo, it is more involved and requires a new tag, vendor release, patch to update vendor in core, etc.

If there's a clear fix to be made in Parsoid (albeit with release overhead), I would prefer that to suppression. However, if you think it's a larger refactor to Parsoid and we're not in for too many messages with tomorrow's deployment, let's suppress.

I'm usually fine with suppressing notices, but this seemed like a pretty high volume and I'm a little worried that will increase significantly with tomorrow's deployment. While filtering helps keep the dashboard clear (very important to deployers et at) it's not going to reduce the volume of errors being sent through logstash.

That sounds reasonable.

Do you anticipate an increase in these errors with all wiki deployment?

Hard to tell. It depends on specific wikitext patterns being seen (headings generated by parser functions -- rare and seen on some mediawiki.org pages; headings generated by some extensions -- seemingly not that uncommon on some wikisource wikis and commons).

Let me consult with Scott and we'll respond here with a proposal.

Let me consult with Scott and we'll respond here with a proposal.

Sounds good. Thanks for the quick response to this.

If there's a clear fix to be made in Parsoid (albeit with release overhead), I would prefer that to suppression. However, if you think it's a larger refactor to Parsoid and we're not in for too many messages with tomorrow's deployment, let's suppress.

The fix in Parsoid is very easy. One liner even. The question is more if we have done backporting of a new vendor release before.

I'm starting the patch-and-tag-and-release-to-vendor process with https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/889637 and we'll get that merged to mainline mediawiki-vendor, then I'll post here and leave it up to you all on ops whether you want to backport that mediawiki-vendor patch to wmf.23.

Ok, https://gerrit.wikimedia.org/r/c/mediawiki/vendor/+/889641 is merged to mediawiki-vendor and https://gerrit.wikimedia.org/r/c/mediawiki/vendor/+/889607 is the cherry-pick to the wmf.23 branch. We'll leave it to ops' discretion whether they want to merge that the wmf.23 and deploy the backport, or leave that cherry-pick unmerged and just suppress the errors.

This addresses T329740. The UTF-8 error is low-volume. Let us know if the other one (byteOffset/codepointOffset TypeError) reoccurs outside the time window of rolling the train; our assumption is that is reflecting a transient inconsistency between vendor and core.

Thanks! I will merge and deploy the backport. I might wait until tomorrow to re-roll train to group1, depending on my own time constraints.

Change 889648 had a related patch set uploaded (by TrainBranchBot; author: Dduvall):

[operations/mediawiki-config@master] group1 wikis to 1.40.0-wmf.23

https://gerrit.wikimedia.org/r/889648

Change 889648 merged by jenkins-bot:

[operations/mediawiki-config@master] group1 wikis to 1.40.0-wmf.23

https://gerrit.wikimedia.org/r/889648

Mentioned in SAL (#wikimedia-operations) [2023-02-15T23:23:38Z] <dduvall@deploy1002> rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.23 refs T325586

Mentioned in SAL (#wikimedia-operations) [2023-02-15T23:30:22Z] <dduvall@deploy1002> Synchronized php: group1 wikis to 1.40.0-wmf.23 refs T325586 (duration: 06m 43s)

Thanks very much for your help today, @cscott and @ssastry. I've re-rolled group1.

Let us know if the other one (byteOffset/codepointOffset TypeError) reoccurs outside the time window of rolling the train; our assumption is that is reflecting a transient inconsistency between vendor and core.

So far, your assumption looks right.

Change 889845 had a related patch set uploaded (by TrainBranchBot; author: Dduvall):

[operations/mediawiki-config@master] all wikis to 1.40.0-wmf.23

https://gerrit.wikimedia.org/r/889845

Change 889845 merged by jenkins-bot:

[operations/mediawiki-config@master] all wikis to 1.40.0-wmf.23

https://gerrit.wikimedia.org/r/889845

Mentioned in SAL (#wikimedia-operations) [2023-02-16T19:29:06Z] <dduvall@deploy1002> rebuilt and synchronized wikiversions files: all wikis to 1.40.0-wmf.23 refs T325586