Page MenuHomePhabricator

Have fallback communication channel when freenode has problems
Closed, DeclinedPublic

Description

Not just operations but engineering folks need to have a space to communicate e.g. during deploys, outages etc when freenode is down or is netsplitting or what have you.

This has been discussed in the past but without resolution. One option could be another irc network, or an internal server (which we hope isn't down due to whatever outage we're investigating).

Event Timeline

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptFeb 23 2016, 11:23 PM
greg added a subscriber: greg.Feb 23 2016, 11:28 PM

Can't tell if trolling or not ;) But, that's actually not a bad idea. They can be publicly viewable/joinable and everyone working on an outage will have a phab account. It's better than eg: a google hangout chatroom.

ArielGlenn triaged this task as High priority.Feb 23 2016, 11:29 PM
ArielGlenn added a project: Deployments.

I slapped deployment-systems on here because people doing deployments will be one of the main users of such a fallback setup. Yeha, that's not exactly scap3 development but I didn't find anything better.

Can't tell if trolling or not ;) But, that's actually not a bad idea.

Yep. Despite my trolling earlier, this is something in our control and works reasonably well. Other ideas would be Etherpad or Google Hangouts. Hangouts seems fine too for private discussions. It recently got a standalone https://hangouts.google.com/ web interface as well (no longer just via GMail or Google Plus).

Etherpad is public, which might not be cool, a chunk of what we do might want to wind up in a private space. But it could replace _operations temporarily if we had a set name to use.

greg added a comment.Feb 23 2016, 11:52 PM

I slapped deployment-systems on here because people doing deployments will be one of the main users of such a fallback setup. Yeha, that's not exactly scap3 development but I didn't find anything better.

Deployments is kind of turning into a "things related to deploying but not actually trebuchet or scap or salt" place, so yeah, that's fine (if not confusing).

Yep. Despite my trolling earlier, this is something in our control and works reasonably well. Other ideas would be Etherpad or Google Hangouts. Hangouts seems fine too for private discussions. It recently got a standalone https://hangouts.google.com/ web interface as well (no longer just via GMail or Google Plus).

Yeah. Etherpad is bad due to accidental (or on purpose, by a jerk/cause of the outage, for instance) deletion. And hangouts mean people wanting to follow along (a la -operations now) have to have a google account. At least with Conpherence we control the accounts (and it's just a mw.org account, really).

greg added a comment.Feb 23 2016, 11:54 PM

Etherpad is public, which might not be cool, a chunk of what we do might want to wind up in a private space. But it could replace _operations temporarily if we had a set name to use.

Yeah, and Conpherence rooms can be either, private or public.

Dzahn added a subscriber: Dzahn.Aug 10 2016, 8:59 PM

WMF used to run a freenode server (T82958) but we are now decom;ing it (T120752). That would have been the perfect fallback, like if there are netsplits or the entire freenode is down we'd still have our local server working just not linked into the freenode network.

greg added a comment.Aug 10 2016, 9:22 PM

I'm inclined to just have an official Conpherence room for this. It'd need to be clear that this room (or any solution, really) is only for backup purposes when Freenode is down.

Would we want:

  • 1 room that is public to everyone and joinable by anyone in Phab
  • 1 room that is private and only joinable by those who are in WMF-NDA? (or some other manual list? that's no bueno though)
  • 2 rooms, one private and one public

TBF: the self-hosted Freenode server solution wouldn't have made us make that choice: we'd still have the public and private channels that we do now (just people would need to potentially reconnect their IRC clients to that server explicitly if they weren't already on it during a netsplit/other Freenode outage).

Do we prefer a fallback that cannot be impacted by a Wikimedia outage of any kind? Conpherence is an option, but it is not off-site; a network outage affecting Wikimedia will render Conpherence useless.

An other IRC network (there are dozens of them), Google Hangouts and Slack are alternatives which do not have this problem. A Phabricator installation on a Rackspace VM should not be impacted either if we have some trouble, though it is important you can find its IP relatively easy somewhere (even if DNS is down).

Dzahn added a comment.Aug 10 2016, 9:45 PM

There is also the external VM that runs wikitech-static. It is outside WMF infra for this reason.

What we can do too: I setup a complete indipendent ircd at the past (for ircd related tests) at labs. Currently it allows at least 1000 users (I think you can raise that limit). It's not hard to install ircd-seven (the irc-software freenode has) and atheme (services-software which is used by freenode too). Of course it is normally more easy to use freenode, but as fallback it makes sense (then a WMF-maintained server I guess, not my install at labs, but if you need somebody to maintain that, I can do that).

Marostegui lowered the priority of this task from High to Normal.Jun 13 2017, 12:48 PM
Dzahn added a comment.Jun 13 2017, 5:30 PM

We could just agree on something like "if freenode is down we all switch to efnet, same channel names" and be done with it. vs. installing our own ircd internally.

http://www.efnet.org/

Dzahn added a comment.EditedJun 22 2017, 12:02 AM

alternative idea: we can do T168579 and then say "if there is a netsplit, we all connect to our local freenode server". It would still be up even if the rest of freenode is somehow gone and not linked to it anymore.

faidon closed this task as Declined.Jun 22 2017, 3:13 AM
faidon added a subscriber: faidon.

I've been using freenode for maybe 15 years now, 1/3 of which for Wikimedia, and I can probably count the number of times the entire network went down on the fingers of one hand. In the rare case this happens, we can just stop deployments, and if we need an emergency deployment or something, we can always coordinate via alternative mediums (Hangouts chat between e.g. opsens/releng, or that Conpherence room above). If it becomes a major downtime where it lasts hours/days, we can then explore our alternatives and set up e.g. our channels on OFTC or something.

Anything else, like setting up our own IRC server, or preparing a detailed fallback plan, sounds a bit overengineered to me and a waste of our time. I'm going to be bold and decline this, but if you disagree, feel free to reopen and we can discuss further :)

An other IRC network (there are dozens of them), Google Hangouts and Slack are alternatives which do not have this problem. A Phabricator installation on a Rackspace VM should not be impacted either if we have some trouble, though it is important you can find its IP relatively easy somewhere (even if DNS is down).

Depending on the type of outage (e.g., if it's only freenode that's having problems), we have the wikis and mailing lists such as wikitech-l@lists.wikimedia.org as alternate communication media as well, of course. I agree that freenode has been and is pretty stable. Having at least a default agreed upon contingency venue may be useful, which may have been the purpose of filing this task.