Page MenuHomePhabricator

Migrate irc.wikimedia.org/kraz to Buster
Closed, ResolvedPublic

Description

This will transferred to a new service in the future (per T185319), but depending on the time scale it might need another update to Stretch/Buster beforehand. kraz uses a custom, patched version of ircd-ratbox.

Event Timeline

ArielGlenn triaged this task as Medium priority.Jun 11 2019, 7:54 AM
MoritzMuehlenhoff renamed this task from Migrate irc.wikimedia.org/kraz to Stretch/Buster to Migrate irc.wikimedia.org/kraz to Buster.May 29 2020, 8:47 AM

Hi, is this going to happen? Beta cluster has also an IRC server running Jessie, and in an effort of getting rid of Jessie on beta I'm offering it as a test opportunity for the production upgrade.

In T224579#6887456, @Majavah wrote:

Hi, is this going to happen? Beta cluster has also an IRC server running Jessie, and in an effort of getting rid of Jessie on beta I'm offering it as a test opportunity for the production upgrade.

That sounds like good idea! There's a handful of packages which need to be made available for buster, I'll update the task when that's done

Mentioned in SAL (#wikimedia-operations) [2021-03-09T15:56:57Z] <moritzm> imported prometheus-ircd-exporter 0.2 to apt.wikimedia.org T224579

In T224579#6887456, @Majavah wrote:

Hi, is this going to happen? Beta cluster has also an IRC server running Jessie, and in an effort of getting rid of Jessie on beta I'm offering it as a test opportunity for the production upgrade.

That sounds like good idea! There's a handful of packages which need to be made available for buster, I'll update the task when that's done

I've imported all the custom Buster packages needed by the mw_rc_irc role. If you need anything merged to puppet.git (Hiera or so), please ping me on IRC (moritzm).

Hi, thanks! Copying from the subtask:

In T277081#6902189, @Majavah wrote:

deployment-ircd02 is now working on Buster and from a very quick look it seems to be working properly.

Change 670829 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Assign mw_rc_irc role to irc2001

https://gerrit.wikimedia.org/r/670829

Change 670913 had a related patch set uploaded (by Legoktm; owner: Legoktm):
[operations/mediawiki-config@master] Support having multiple IRC feed servers

https://gerrit.wikimedia.org/r/670913

Change 670914 had a related patch set uploaded (by Krinkle; owner: Legoktm):
[operations/mediawiki-config@master] Define IRC feed servers as an array in {Production,Labs}Services.php

https://gerrit.wikimedia.org/r/670914

Change 670915 had a related patch set uploaded (by Krinkle; owner: Legoktm):
[operations/mediawiki-config@master] Remove back-compat from when IRC feed servers was a string

https://gerrit.wikimedia.org/r/670915

Change 670829 merged by Muehlenhoff:
[operations/puppet@production] Assign mw_rc_irc role to irc2001

https://gerrit.wikimedia.org/r/670829

Change 670913 merged by jenkins-bot:
[operations/mediawiki-config@master] Support having multiple IRC feed servers

https://gerrit.wikimedia.org/r/670913

Change 670914 merged by jenkins-bot:
[operations/mediawiki-config@master] Define IRC feed servers as an array in {Production,Labs}Services.php

https://gerrit.wikimedia.org/r/670914

Mentioned in SAL (#wikimedia-operations) [2021-03-15T23:23:04Z] <legoktm@deploy1002> Synchronized wmf-config/CommonSettings.php: Support having multiple IRC feed servers (T224579) (duration: 00m 58s)

Mentioned in SAL (#wikimedia-operations) [2021-03-15T23:24:41Z] <legoktm@deploy1002> Synchronized wmf-config/: Define IRC feed servers as an array in {Production,Labs}Services.php (T224579) (duration: 00m 59s)

Change 670915 merged by jenkins-bot:
[operations/mediawiki-config@master] Remove back-compat from when IRC feed servers was a string

https://gerrit.wikimedia.org/r/670915

Mentioned in SAL (#wikimedia-operations) [2021-03-15T23:31:27Z] <legoktm@deploy1002> Synchronized wmf-config/CommonSettings.php: Remove back-compat from when IRC feed servers was a string (T224579) (duration: 00m 59s)

Should be all set on the MW side now.

Change 672687 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/mediawiki-config@master] Add irc2001.wikimedia.org (running buster) as second irc server

https://gerrit.wikimedia.org/r/672687

Since yesterday, the Prometheus jobs reduced availability alert has been firing about ircd on irc2001. Looking at the logs, there appears to be some breakdown in communication between prometheus-ircd-exporter and ircd:

Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: ERROR:__main__:Failed to connect to IRC server
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: ERROR:__main__:Failed to close connection to IRC server
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: Traceback (most recent call last):
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/SocketServer.py", line 599, in process_request_thread
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:     self.finish_request(request, client_address)
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/SocketServer.py", line 334, in finish_request
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:     self.RequestHandlerClass(request, client_address, self)
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/SocketServer.py", line 655, in __init__
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:     self.handle()
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/BaseHTTPServer.py", line 340, in handle
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:     self.handle_one_request()
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/BaseHTTPServer.py", line 328, in handle_one_request
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:     method()
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/dist-packages/prometheus_client/exposition.py", line 151, in do_GET
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:     output = encoder(registry)
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/dist-packages/prometheus_client/openmetrics/exposition.py", line 14, in generate_latest
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:     for metric in registry.collect():
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/dist-packages/prometheus_client/registry.py", line 75, in collect
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:     for metric in collector.collect():
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/bin/prometheus-ircd-exporter", line 55, in collect
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:     irc = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/socket.py", line 191, in __init__
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]:     _sock = _realsocket(family, type, proto)
Mar 16 15:28:31 irc2001 prometheus-ircd-exporter[13830]: error: [Errno 24] Too many open files
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]: Traceback (most recent call last):
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/SocketServer.py", line 599, in process_request_thread
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:     self.finish_request(request, client_address)
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/SocketServer.py", line 334, in finish_request
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:     self.RequestHandlerClass(request, client_address, self)
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/SocketServer.py", line 655, in __init__
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:     self.handle()
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/BaseHTTPServer.py", line 340, in handle
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:     self.handle_one_request()
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/BaseHTTPServer.py", line 328, in handle_one_request
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:     method()
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/dist-packages/prometheus_client/exposition.py", line 151, in do_GET
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:     output = encoder(registry)
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/dist-packages/prometheus_client/openmetrics/exposition.py", line 14, in generate_latest
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:     for metric in registry.collect():
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/dist-packages/prometheus_client/registry.py", line 75, in collect
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:     for metric in collector.collect():
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/bin/prometheus-ircd-exporter", line 55, in collect
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:     irc = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:   File "/usr/lib/python2.7/socket.py", line 191, in __init__
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]:     _sock = _realsocket(family, type, proto)
Mar 16 15:29:31 irc2001 prometheus-ircd-exporter[13830]: error: [Errno 24] Too many open files

It looks like the exporter was stuck in a loop:

[pid 18567] <... recvfrom resumed> "", 8192, 0, NULL, NULL) = 0
[pid 18486] <... futex resumed> )       = 0
[pid 18567] futex(0x55a8bff89fb0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 18486] recvfrom(1020,  <unfinished ...>
[pid 18567] <... futex resumed> )       = 0
[pid 18486] <... recvfrom resumed> "", 8192, 0, NULL, NULL) = 0
[pid 18567] recvfrom(1022,  <unfinished ...>
[pid 18486] futex(0x55a8bff89fb0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 18567] <... recvfrom resumed> "", 8192, 0, NULL, NULL) = 0
[pid 18486] <... futex resumed> )       = 0
[pid 18567] futex(0x55a8bff89fb0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 18486] recvfrom(1020,  <unfinished ...>
[pid 18567] <... futex resumed> )       = 0
[pid 18486] <... recvfrom resumed> "", 8192, 0, NULL, NULL) = 0
[pid 18567] recvfrom(1022,  <unfinished ...>
[pid 18486] futex(0x55a8bff89fb0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 18567] <... recvfrom resumed> "", 8192, 0, NULL, NULL) = 0
[pid 18486] <... futex resumed> )       = 0
[pid 18567] futex(0x55a8bff89fb0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 18486] recvfrom(1020,  <unfinished ...>
[pid 18567] <... futex resumed> )       = 0
[pid 18486] <... recvfrom resumed> "", 8192, 0, NULL, NULL) = 0
[pid 18567] futex(0x55a8bff89fb0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 18366] <... futex resumed> )       = 0
[pid 18567] <... futex resumed> )       = 0
[pid 18486] futex(0x55a8bff89fb0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>

And connections from Prometheus kept piling up. AFAIK the service/exporter is not owned ATM, I've restarted the exporter but this is obviously bound to happen again.

And connections from Prometheus kept piling up. AFAIK the service/exporter is not owned ATM, I've restarted the exporter but this is obviously bound to happen again.

Sure enough, the exporter is out of FDs again. I'm +1 to just remove the exporter since the service doesn't have an owner, the exporter is python2 and afaict we use the metrics anyways. Thoughts ?

Sure enough, the exporter is out of FDs again. I'm +1 to just remove the exporter since the service doesn't have an owner, the exporter is python2 and afaict we use the metrics anyways. Thoughts ?

+1, let's kill it with fire. It was never really used (and only added to replace an old Diamond collector), in the near future it will be replaced with something new anyway (which will have it's custom metrics), fixing this seems like flogging a dead horse.

Change 673972 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: remove ircd-exporter

https://gerrit.wikimedia.org/r/673972

Change 673972 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: remove ircd-exporter

https://gerrit.wikimedia.org/r/673972

Change 672687 merged by jenkins-bot:
[operations/mediawiki-config@master] Add irc2001.wikimedia.org (running buster) as second irc server

https://gerrit.wikimedia.org/r/672687

Mentioned in SAL (#wikimedia-operations) [2021-03-23T18:10:38Z] <legoktm@deploy1002> Synchronized wmf-config/ProductionServices.php: Add irc2001.wikimedia.org (running buster) as second irc server (T224579) (duration: 01m 08s)

Events are now going to irc2001.wikimedia.org. I watched #en.wikipedia on both kraz and irc2001 for a few minutes and saw identical output (note that channels won't exist on the new server until an edit/log entry comes through).

From T123729: Migrate irc.wikimedia.org to Jessie:

  • Announce in Tech/News, wikitech-l, wikitech-ambassadors that we'll be switching irc.wikimedia.org over to a new server on XX. Include a reminder that clients should switch to eventstreams if possible.
  • On XX, switch irc.wikimedia.org DNS to point to irc2001. Clients can switch to the new server manually if they want. All new connections will go to irc2001.
  • On XX + 1 week (or shorter?), shut down kraz.wikimedia.org. All clients should automatically reconnect to irc2001.

When should XX be?

When should XX be?

Moritz is going to switch DNS and reboot kraz "Thursday during the European morning", announcement to be sent shortly.


For Tech News / User-notice:

  • The Wikimedia IRC RC feeds have been switched to a new server. Make sure all tools automatically reconnect to irc.wikimedia.org and not the name of any specific server. Users should also consider switching to EventStreams, a more modern alternative.

Change 674617 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/dns@master] Point irc.wikimedia.org to irc2001

https://gerrit.wikimedia.org/r/674617

Mentioned in SAL (#wikimedia-operations) [2021-03-24T15:42:12Z] <moritzm> reduce RAM for irc2001 to 2G, was originally created with 8 G T224579

Change 674617 merged by Muehlenhoff:
[operations/dns@master] Point irc.wikimedia.org to irc2001

https://gerrit.wikimedia.org/r/674617

I've rebooted kraz to force the remaining bots still connected to kraz to reconnect to irc2001.w.o.

Those connections are quite long-lived, I sampled some stats for bots connected to #de.wikipedia: A day after the CNAME failover half of the bots had moved to irc2001, but two weeks later 1/3 of the bots were still connected to the old IP (until the reboot of kraz happened).

Change 677806 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/mediawiki-config@master] Broadcase IRC events to irc1001 instead of kraz

https://gerrit.wikimedia.org/r/677806

Change 677806 merged by jenkins-bot:

[operations/mediawiki-config@master] Broadcast IRC events to irc1001 instead of kraz

https://gerrit.wikimedia.org/r/677806

Mentioned in SAL (#wikimedia-operations) [2021-04-13T23:27:26Z] <legoktm@deploy1002> Synchronized wmf-config/ProductionServices.php: Config: [[gerrit:677806|Broadcast IRC events to irc1001 instead of kraz (T224579)]] (duration: 01m 06s)

kraz is ready for decom now \o/

cookbooks.sre.hosts.decommission executed by jmm@cumin1001 for hosts: kraz.wikimedia.org

  • kraz.wikimedia.org (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
MoritzMuehlenhoff claimed this task.

kraz has been replaced by two Buster instances (irc1001.wikimedia.org and irc2001.wikimedia.org) was eventually removed.