Page MenuHomePhabricator

Add section for long-running tasks on the Deployment page (specially for database maintenance)
Closed, ResolvedPublic

Description

I discussed this with @hashar as a potential way of discovering interacting maintenance scripts, and have a better picture of what is going on on mediawiki and its database servers. The spark of this proposal was T136150, which was started with little awareness, and created some light issues due to its long-running nature. In particular, I would not be confident on running a schema change on enwiki at the same time that that particular maintenance is running, as -while schema changes cannot interact with deployments (unless, obviously, one of each breaks something)- they interact a lot with increase activity due to long running scripts and they make things like dc and master db failovers really difficult.

So the proposal would be to have some kind of way of notifying long-running ongoing changes (specially those related to the database) such as:

  • Schema changes: they do not block deployments, but can interact with long-running batch jobs. They can take weeks to be applied, such as T139090, so they cannot really be reduced to a single deployment slot (and it wouldn't be fair to block regular deployments). No schema change I made has gone wrong to create large problems, but there can always be a first time.
  • Long-running maintenance jobs: e.g. such as update collation jobs T136150. I am not talking about "I deploy and then run this script that takes 10 minutes"; "i18n updates" or anything that takes less than a deplooyment window. I am talking about those that happen in the background and can take hours or days to be executed.
  • High impacting Operations tasks such as DC failovers, network maintanance and application servers rolling upgrades.

@hashar mentioned the possibility of adding a section at the top of Deployments where me and developers can update those ongoing tasks. I would like to hear @greg 's opinion on that. Also helping me by communicating and, trying to enforce this (even if the only thing we can do is update the written policies and send an email asking for this). The section would not be maintained by Releng, each individual would do it and I would keep an eye on it and try to keep it up to date.

[https://wikitech.wikimedia.org/wiki/Deployments/Inclusion_criteria | The inclusion criteria] says that "Database schema changes" should be included on the Deployment calendar, but that doesn't have into account that no longer most Schema changes are high impacting (requiring read-only) but on the other side, They can take weeks to be deployed. I want to fix that.

Event Timeline

jcrespo created this task.Sep 3 2016, 10:02 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 3 2016, 10:02 AM

I am definitely a fan of having a list of operations that might impact the infrastructure one way or another. From discussion with @jcrespo it seemed the Deployments wiki page would make the much sense.

I have added this task to Monday Release-Engineering-Team weekly meeting agenda.

jcrespo moved this task from Triage to Backlog on the DBA board.Sep 6 2016, 1:21 PM
greg moved this task from Backlog to Next on the User-greg board.
greg added a comment.Sep 7 2016, 9:20 PM

I'm also a big +1 on having those long running maint scripts explicitly listed (and not just in the SWAT window where the script was initially deployed and ran, but that too).

A trick I used a while ago for things that took a long time that didn't inherently block other deployments was list them in the "Week of..." section for each week. See: https://wikitech.wikimedia.org/wiki/Deployments#Week_of_September_12th (it says "nothing yet" right now).

That was used for things just like you describe (mostly Operations team doing things that others should be aware of).

It's kind of easy to miss if you (the general 'you') are just looking at specific days/slots, but I'm not sure what else would make it more obvious.

Would using that mechanism (the "week of" section) make sense?

Also, re the inclusion criteria part: if we list them in that "week of" section that should be sufficient (if that is the right model for the solution here).

Also, re the inclusion criteria part: if we list them in that "week of" section that should be sufficient (if that is the right model for the solution here).

Yes

It's kind of easy to miss if you (the general 'you') are just looking at specific days/slots, but I'm not sure what else would make it more obvious.

Maybe use the table syntax when there are items active, with more explicit columns (start of the maintenance, potential ending, etc.)?

Also could we send a reminder for it recommending to use it beyond operations? As I commented before, I am not blocked by deployments, but I am from long running maintenance scripts.

greg added a subscriber: bd808.Sep 9 2016, 10:09 PM

It's kind of easy to miss if you (the general 'you') are just looking at specific days/slots, but I'm not sure what else would make it more obvious.

Maybe use the table syntax when there are items active, with more explicit columns (start of the maintenance, potential ending, etc.)?

So, here's a proposal that depends on one piece of unmerged code in jouncebot, but it should be doable soon :) (cc @bd808 )

https://gerrit.wikimedia.org/r/#/c/308086/ adds a now command to jouncebot.

With that and simply adding a window to the deployment calendar that starts at whenever someone plans to start the long running process and is set to last for the estimated time (with some buffer in case it takes longer than planned) then anyone could:
A) See the items in the calendar, where they think they'd be
B) Do jouncebot now in -operations to see if any are currently (probably) running with a link to the event window that has who/what is going on.

Also could we send a reminder for it recommending to use it beyond operations? As I commented before, I am not blocked by deployments, but I am from long running maintenance scripts.

Most definitely. I think this would make sense as an announcement to engineering@ and ops@ lists (where all deployers are subscribed).

bd808 added a comment.Sep 9 2016, 10:39 PM

So, here's a proposal that depends on one piece of unmerged code in jouncebot, but it should be doable soon :) (cc @bd808 )

https://gerrit.wikimedia.org/r/#/c/308086/ adds a now command to jouncebot.

With that and simply adding a window to the deployment calendar that starts at whenever someone plans to start the long running process and is set to last for the estimated time (with some buffer in case it takes longer than planned) then anyone could:
A) See the items in the calendar, where they think they'd be
B) Do jouncebot now in -operations to see if any are currently (probably) running with a link to the event window that has who/what is going on.

jouncebot: now is actually live; I cherry-picked it but haven't self-merged yet.

[16:34]  <    bd808>	jouncebot: now
[16:34]  <jouncebot>	No deployments scheduled for the next 62 hour(s) and 25 minute(s)

In addition to this, it would be possible to add some /topic management to jouncebot to help keep track of what is going on at any given time according to the schedule.

I would like some opinion from other ops or developers- I can promise to maintain the section for schema changes because normally only 1 is running at a time and takes ~1 week; I do not know if it creates a lot of overhead for ops maintenance or long running scripts.

And precisely, this only works i other people than me uses it :-). I would like to promote it by convincing of its usefulness (I hope). 0:-)

greg added a comment.Sep 12 2016, 5:11 PM

Here's what the output looks like from jouncebot when there are two overlapping events in the calendar:

17:03 <    greg-g> jouncebot: now
17:03 < jouncebot> For the next 0 hour(s) and 26 minute(s): Weekly Wikidata query service deployment window 
                   (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160912T1700)
17:03 < jouncebot> For the next 1 hour(s) and 56 minute(s): Test long running operation 
                   (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160912T1600)
17:03 <    greg-g> neat

That's from https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160912T1600 and https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160912T1700

So yeah, I think just having people add a window for the long running task is sufficient? Then we can both look at the deploy calendar for the details and use jouncebot to make sure something isn't still running that might be important.

@bd808: it might also make sense to have jouncebot's announcement/reminder at the start of windows to include something about overlaping events?

bd808 added a comment.Sep 12 2016, 5:25 PM

@bd808: it might also make sense to have jouncebot's announcement/reminder at the start of windows to include something about overlaping events?

Could do. I think either that or have jouncebot maintain a "| deploys: ...." section at the end of the channel topic would be reasonable.

greg added a comment.Sep 12 2016, 5:26 PM

yeah, I'd prefer both because I know some people ignore /topic changes (and /topics themselves) ;)

greg added a comment.Sep 20 2016, 10:19 PM

For the task at hand, I've added https://wikitech.wikimedia.org/w/index.php?title=Deployments&action=historysubmit&type=revision&diff=850923&oldid=850244

; Long running tasks/scripts
: While not strictly a deployment, performing long running (>1 hour) tasks (eg: migration scripts) can encounter issues when code is updated while a script is being run. For this reason it is required to add an entry in the calendar for the task with a window that accounts for the anticipated start time and estimated length for the task.

I'll email ops/wikitech-l/engineering shortly with a heads up.

greg added a comment.Sep 20 2016, 10:30 PM

Ok, emailed. Resolving.

Thanks @jcrespo for the suggestion!

greg closed this task as Resolved.Sep 20 2016, 10:30 PM
greg claimed this task.
greg triaged this task as Medium priority.
greg moved this task from Next to Done on the User-greg board.