Page MenuHomePhabricator

1.35.0-wmf.32 deployment blockers
Closed, ResolvedPublicRelease

Details

Backup Train Conductor
mmodell
Release Version
1.35.0-wmf.32
Release Date
May 11 2020, 12:00 AM

2020 week 20 1.35-wmf.32 Changes wmf/1.35.0-wmf.32

This MediaWiki Train Deployment is scheduled for the week of Monday, May 11th:

Monday May 11thTuesday, May 12thWednesday, May 13thThursday, May 14thFriday
Backports only.Branch wmf.32 and deploy to Group 0 Wikis.Deploy wmf.32 to Group 1 Wikis.Deploy wmf.32 to all Wikis.No deployments on fridays

How this works

  • Any serious bugs affecting wmf.32 should be added as subtasks beneath this one.
  • Any open subtask(s) block the train from moving forward. This means no further deployments until the blockers are resolved.
  • If something is serious enough to warrant a rollback then you should bring it to the attention of deployers on the #wikimedia-operations IRC channel.
  • If you have a risky change in this week's train add a comment to this task using the Risky patch template
  • For more info about deployment blockers, see Holding the train.

Related Links

Other Deployments

Previous: 1.35.0-wmf.31
Next: 1.35.0-wmf.33

Event Timeline

@DannyS712 @Pchelolo: hi both! Last train there were a few blockers related to ongoing revision work (thank you for the responsiveness there @DannyS712). I see a few more patches are in master since the 1.35.0-wmf.31 branch cut.

Are there any additional precautions that the train operator could take to lessen the risk of these patches? That is: groups or wikis we should be especially cautious about? Should we ping someone when we're about to roll out? Would it make sense to roll the revision changes separately (somehow, tbd :)) when one or both of you are available, or take some other reasonable precautions? Thanks for your help!

@DannyS712 @Pchelolo: hi both! Last train there were a few blockers related to ongoing revision work (thank you for the responsiveness there @DannyS712). I see a few more patches are in master since the 1.35.0-wmf.31 branch cut.

Are there any additional precautions that the train operator could take to lessen the risk of these patches? That is: groups or wikis we should be especially cautious about? Should we ping someone when we're about to roll out? Would it make sense to roll the revision changes separately (somehow, tbd :)) when one or both of you are available, or take some other reasonable precautions? Thanks for your help!

Yeah, sorry about all of the UBNs :(
As for lessening risk, I started looking into requesting logstash access so I could help monitor new Revision issues.
The biggest upcoming patch is probably https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/593272/ - hopefully it'll go out with the next train; perhaps some others reviewing it could help?

As for lessening risk, I started looking into requesting logstash access so I could help monitor new Revision issues.

I'm not sure it works like that. When the issues are hitting production logstash, they're already in production...

As for lessening risk, I started looking into requesting logstash access so I could help monitor new Revision issues.

I'm not sure it works like that. When the issues are hitting production logstash, they're already in production...

Yes, but it would allow noticing them sooner (eg while only deployed to group0) and on the beta cluster (I think, right?). Anyawy, it was just an idea

As for lessening risk, I started looking into requesting logstash access so I could help monitor new Revision issues.

I'm not sure it works like that. When the issues are hitting production logstash, they're already in production...

Yes, but it would allow noticing them sooner (eg while only deployed to group0) and on the beta cluster (I think, right?). Anyawy, it was just an idea

Not necessarily; probably not much quicker than Brennen has been solving them. As the train rolls out to more wikis, more edge cases and issues are found. This often happens; it's fine on group0 and group1, but breaks when it hits enwiki etc

Hey @thcipriani yeah... somehow the issue stacked up on this train.

I've been thinking why did that happen as well and can't find anything specific to blame - all the individual patches were as small and confined as always, I didn't even think I should do a headsup on the blockers task about any of them since they all looked like easy straightforward replacements. Nor there was an especially large amount of patches. I apologize for the disturbance.

As for moving forward, I can promise to make deeper reviews, plus I'm planning to bring more reviewers into this cleanup, but I don't think that would necessarily get us completely safe. Adding a test checking every place the code's changed would be ideal, but that will likely expand the scope of the task at hand beyond reasonable size.

I guess one possible approach would be to have manual testing day on group0 - we would be compiling the list of patches going out, look into whether we can test them manually on test.wiki and doing that on Tuesday.

As for lessening risk, I started looking into requesting logstash access so I could help monitor new Revision issues.

I'm not sure it works like that. When the issues are hitting production logstash, they're already in production...

Yes, but it would allow noticing them sooner (eg while only deployed to group0) and on the beta cluster (I think, right?). Anyawy, it was just an idea

Also https://wikitech.wikimedia.org/wiki/Logstash#Beta_Cluster_Logstash

Yeah, sorry about all of the UBNs :(

Thanks for all the work you've been doing! It seems like it's in an area of code that's tricky to work with — kudos for all your work and responsiveness, it's greatly appreciated.

The biggest upcoming patch is probably https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/593272/ - hopefully it'll go out with the next train; perhaps some others reviewing it could help?

That's a good idea, extra review never hurt :)

Hey @thcipriani yeah... somehow the issue stacked up on this train.

I've been thinking why did that happen as well and can't find anything specific to blame - all the individual patches were as small and confined as always, I didn't even think I should do a headsup on the blockers task about any of them since they all looked like easy straightforward replacements. Nor there was an especially large amount of patches. I apologize for the disturbance.

No worries, this is our method of progressively de-risking: rolling out to successively larger groups, finding bugs during rollout is part of the process. I want to make sure you all have everything you need to deploy these changes confidently.

As for moving forward, I can promise to make deeper reviews, plus I'm planning to bring more reviewers into this cleanup, but I don't think that would necessarily get us completely safe. Adding a test checking every place the code's changed would be ideal, but that will likely expand the scope of the task at hand beyond reasonable size.

Fair.

I guess one possible approach would be to have manual testing day on group0 - we would be compiling the list of patches going out, look into whether we can test them manually on test.wiki and doing that on Tuesday.

We can sync up with you all before rolling to group0 -- if there are some smoke tests we could run there to gain some confidence, that'd be great! Thanks for thinking this through with me :)

After the mess I made last time, I thought it would be helpful to list my patches against core that are going out with this train. So far:

Changes backported to wmf.31

Changes going live with wmf.32

I believe that this is all of the patches, but I may have missed some. This does not include patches against extensions, just core.

Hmm. Not sure if it will cause issues, but RevisionItemBase::getId currently returns string despite being documented to return int. Discovered while adding tests in T252076: Cover RevisionList/RevisionItem classes with tests; patch for that also fixes the method to properly return int. Would it be possible to include this before the branch is cut?

@DannyS712 thank you for the summary ;)

https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/594807/ is worth adding but I don't have any confidence in +2 ing anything in mediawiki/core nowadays.

Mentioned in SAL (#wikimedia-operations) [2020-05-12T12:08:13Z] <hashar> Cutting branch 1.35.0-wmf.32 # T249964

Mentioned in SAL (#wikimedia-operations) [2020-05-12T14:38:03Z] <hashar> 1.35.0-wmf.22 is on test wikis. Will be pushed to group0 later today during the american window (19:00 - 21:00 UTC) # T249964

Haven't seen any new log messages.

@DannyS712 : there is nothing related to Revision :]]]

I have just found out that Monday 25th is a no deploy day due an holiday
in the US. I will thus push 1.35.0-wmf.32 to all wikis on Tuesday 25th at:

  • 13:00 UTC
  • 06:00 PDT
  • 15:00 CEST

No new errors occurring so far \o/

We still got hit by the INSERT rate doubling (T247028) but that is not related to the code being deployed, that predates it.

I am keeping this blocker task open meanwhile, will claim it to be a success if we don't have to rollback due to T247028.

There is barely any new errors.

There is a regression with Vector that causes some features to be missing in the language bar but it has been decided it is not worth blocking T252800

The database INSERT rate surged again, but it is not related to the code T252800 . There was at least no database lag this time.