Page MenuHomePhabricator

reboot of rcs servers (stream.wikimedia.org)
Closed, ResolvedPublic

Description

we need to reboot rcs servers for an upgrade

this will affect users of http://wikitech.wikimedia.org/wiki/stream.wikimedia.org

some bot and tool operators may have to restart their clients and anti-vandal tools use it

like we did last time, tell users about via lists, schedule a time window, reboot servers rcs1001 and 1002

Event Timeline

Dzahn renamed this task from reboot of rcs servers to reboot of rcs servers (stream.wikimedia.org).Mar 15 2016, 4:37 PM
Dzahn updated the task description. (Show Details)

I don't really get why we should keep being impeded in a normal course of operations that doesn't affect the operativity or uptime of rcstream because the clients using it are buggy.

It's setting bad expectations and encouraging bad coding practices on the client writers: any sane client-side implementation of any streaming protocol (or, more in general, of any long-running TCP connection) will allow reconnecting.

@Joe Yea, i don't disagree, i have just been asked to announce it like last time. You are probably right about setting bad expectations.

Well, for several reasons?

a) Because if anti-vandalism tools stop working, that's a real problem for the Wikimedians who depend on them but have never touched the code in question, nor would know how to.

b) Because we're to a very large degree a volunteer organization with volunteer developers, many of whom are not professional programmers, who sometimes make silly mistakes, and it would make sense for us to have a bit of patience with that and appreciate that they write code to make the projects better in their spare time. I'll be happy to include a sentence in Tech News about the fact that they should if possible rewrite their code if they have a problem with this (a link to an explanation of how would probably be beneficial, we have bot operators who aren't very good at this – do you have a suggestion?), but what's the benefit in letting them find out their bot has stopped working with no announcement telling them why instead of announcing it in beforehand and telling them what the problem is and how they can fix it? (That's a real question, I'm not being rhetorical.)

Or maybe I misunderstand what you're referring to? If so, my apologies. (:

Separate from the announce discussion, reboots have happened earlier today right before you added the last 2 comments.

Dzahn claimed this task.

i'm closing the ticket as resolved, not to stop the discussion about announcing it, but because the reboots are technically done and the follow-up up issues in the other tickets

Well, for several reasons?

a) Because if anti-vandalism tools stop working, that's a real problem for the Wikimedians who depend on them but have never touched the code in question, nor would know how to.

b) Because we're to a very large degree a volunteer organization with volunteer developers, many of whom are not professional programmers, who sometimes make silly mistakes, and it would make sense for us to have a bit of patience with that and appreciate that they write code to make the projects better in their spare time. I'll be happy to include a sentence in Tech News about the fact that they should if possible rewrite their code if they have a problem with this (a link to an explanation of how would probably be beneficial, we have bot operators who aren't very good at this – do you have a suggestion?), but what's the benefit in letting them find out their bot has stopped working with no announcement telling them why instead of announcing it in beforehand and telling them what the problem is and how they can fix it? (That's a real question, I'm not being rhetorical.)

@Johan my point was that connection can drop for any number of reasons, of which a system reboot is probably the less frequent; if the tools are that important - and we all know they are - detecting and managing disconnections is vital, in particular when the service is still going to be available in general. Announcing restarting an individual service on an individual server is completely unusual, especially given the service will still be available. There is no system we manage that requires this type of precaution, which is technically pretty unjustified.

Avoiding disconnections in this specific case is thus just a way to slow down our (already ridiculously overbooked) schedule, and I thought it was adding an unneeded overhead. If @Dzahn has been asked to annouce it, the number of clients that don't handle disconnects is non-negligible, which honestly wasn't my expectation (such clients are posed to fail for any software crash or network blip, and should be enough of an annoyance).

Your suggestion to write some documentation to show how to write an rcstream client that handles disconnections is of course good, and in fact documentation we have at https://wikitech.wikimedia.org/wiki/RCStream doesn't explain explicitly how to do that. When I have some spare time, I'll update the docs at least for the python clients accordingly.

I wasn't intending to show disrespect or disappreciation for the work of volunteers, and I think you overread my comment in that direction.

@Joe: Yes, I realized the possibility of me having done so, thus apologizing for possibly misreading you. (: Mea culpa.

@Johan Under normal circumstances we should be able to do these with minimal user impact. Nevertheless, because T130147 happened there might have been a bit more for a few minutes. I apologize for that.