Page MenuHomePhabricator

udpmxircecho spam/not working if unable to connect to irc server
Open, MediumPublic

Description

upon reboot, ircecho on kraz started but the code simply prints errors and doesn't try to reconnect, a restart fixes it but it should really either exit or retry by itself

May 10 12:32:15 kraz udpmxircecho.py[931]: Not connected.
May 10 12:32:15 kraz udpmxircecho.py[931]: Not connected.
May 10 12:32:15 kraz udpmxircecho.py[931]: Not connected.
May 10 12:32:15 kraz udpmxircecho.py[931]: Not connected.
May 10 12:32:15 kraz udpmxircecho.py[931]: Not connected.
May 10 12:32:15 kraz udpmxircecho.py[931]: Not connected.
May 10 12:32:15 kraz udpmxircecho.py[931]: Not connected.
May 10 12:32:15 kraz udpmxircecho.py[931]: Not connected.
May 10 12:32:15 kraz udpmxircecho.py[931]: Not connected.
May 10 12:32:15 kraz udpmxircecho.py[931]: Not connected.
May 10 12:32:15 kraz udpmxircecho.py[931]: Not connected.
May 10 12:32:15 kraz udpmxircecho.py[931]: Not connected.
May 10 12:32:25 kraz systemd[1]: Stopping IRC bot for the MW RC IRCD...
May 10 12:32:25 kraz udpmxircecho.py[931]: Not
May 10 12:32:25 kraz systemd[1]: Stopped IRC bot for the MW RC IRCD.
May 10 12:32:35 kraz systemd[1]: Starting IRC bot for the MW RC IRCD...
May 10 12:32:35 kraz systemd[1]: Started IRC bot for the MW RC IRCD.

Details

Related Gerrit Patches:

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 10 2016, 12:48 PM

A "Requires=ircd.service" in the ircecho unit would probably fix this.

Dzahn claimed this task.May 10 2016, 2:44 PM

yeah I think it would, we should test ircd restarts while ircecho is running

Dzahn added a comment.May 10 2016, 4:26 PM

yep, i'll test it in labs. (we know setup works there now. i have added fake private data in labs/private, Krenair has tested it and fixed more. so that should not be hard now :)

Dzahn added a comment.May 10 2016, 4:28 PM

(but don't restart ircd in prod, users hate it :)

thanks @Dzahn @Krenair ! I think ircecho should react better to exceptions, generally except Exception is a bad sign in python, possibly just log the exception and exit and have systemd restart it or even better reconnect

Krenair renamed this task from ircecho spam/not working if unable to connect to irc server to udpmxircecho spam/not working if unable to connect to irc server.May 11 2016, 9:50 AM

for the record: created labs project "ircd" with instance "udpmx-01", added puppet group / role class

before reboot, things all work now in labs

root@udpmx-01:/etc/systemd/system# systemctl status ircd
● ircd.service - IRCd for Mediawiki RecentChanges feed
   Loaded: loaded (/etc/systemd/system/ircd.service; disabled)
   Active: active (running) since Tue 2016-05-10 21:58:50 UTC; 1 day 1h ago
 Main PID: 1474 (ircd)
   CGroup: /system.slice/ircd.service
           └─1474 /usr/bin/ircd -foreground

May 10 21:58:50 udpmx-01 systemd[1]: Started IRCd for Mediawiki RecentChanges feed.
root@udpmx-01:/etc/systemd/system# systemctl status ircecho
● ircecho.service - IRC bot for the MW RC IRCD
   Loaded: loaded (/etc/systemd/system/ircecho.service; disabled)
   Active: active (running) since Tue 2016-05-10 21:58:48 UTC; 1 day 1h ago
 Main PID: 1177 (python)
   CGroup: /system.slice/ircecho.service
           └─1177 python /usr/local/bin/udpmxircecho.py

May 10 21:58:48 udpmx-01 systemd[1]: Started IRC bot for the MW RC IRCD.
Info: Caching catalog for udpmx-01.ircd.eqiad.wmflabs
Info: Applying configuration version '1463008283'
Notice: Finished catalog run in 7.02 seconds

..but the bot is not on the irc server. after restarting the service it is connected. checked with irssi /whois rc-pmtpa, so it's as reported.

then added the Requires=ircd.service into the ircecho unit file, disabled puppet, rebooted instance..

but now both services are " Active: inactive (dead)" ...:p looking ...

Dzahn added a comment.May 12 2016, 1:33 AM

I tried a couple different combinations with Requires= , After=, Before= etc in both unit files but i haven't found a combination yet that results in both services being up after reboot AND the bot being logged in without a manual service restart. also tried letting udpmxircecho.py die if it can't connect/ join channels but it just sits there and still tries to connect even when ircd is not running.

Change 290588 had a related patch set uploaded (by Dzahn):
irecho: add systemd require/after to start after ircd

https://gerrit.wikimedia.org/r/290588

Change 290588 merged by Dzahn:
ircecho: add systemd require/after to start after ircd

https://gerrit.wikimedia.org/r/290588

Dzahn added a comment.May 25 2016, 1:41 AM

so this solved it in so far that after a reboot the next puppet run will start both the IRC server and the ircbot and the bot will be logged in ! tested in labs on udpmx-01 instance

May 25 01:34:15 udpmx-01 systemd[1]: Starting IRCd for Mediawiki RecentChanges feed...
May 25 01:34:15 udpmx-01 systemd[1]: Started IRCd for Mediawiki RecentChanges feed.
May 25 01:34:15 udpmx-01 systemd[1]: Starting IRC bot for the MW RC IRCD...
May 25 01:34:15 udpmx-01 systemd[1]: Started IRC bot for the MW RC IRCD.

Dzahn added a comment.May 25 2016, 1:43 AM

puppet just starts the ircecho service and then both are up:

Notice: /Stage[main]/Mw_rc_irc::Irc_echo/Service[ircecho]/ensure: ensure changed 'stopped' to 'running'
Info: /Stage[main]/Mw_rc_irc::Irc_echo/Service[ircecho]: Unscheduling refresh on Service[ircecho]
Notice: Finished catalog run in 12.31 seconds

● ircd.service - IRCd for Mediawiki RecentChanges feed

Loaded: loaded (/etc/systemd/system/ircd.service; disabled)
Active: active (running) since Wed 2016-05-25 01:39:58 UTC; 21s ago

● ircecho.service - IRC bot for the MW RC IRCD

Loaded: loaded (/etc/systemd/system/ircecho.service; disabled)
Active: active (running) since Wed 2016-05-25 01:39:58 UTC; 28s ago
Dzahn added a comment.Jun 10 2016, 5:12 AM

@fgiunchedi btw since we once talked about this on IRC. meanwhile i added you to the labs project for this.

https://wikitech.wikimedia.org/wiki/Nova_Resource:Ircd

Dzahn removed Dzahn as the assignee of this task.Jun 10 2016, 5:13 AM

i call this partially resolved.

and now giving back to pool for the moment. going to be on vacation and would be happy i anyone wants to take a look

Dzahn lowered the priority of this task from High to Medium.Jun 10 2016, 7:30 PM

also lowering priority because it is much less severe now.

before: if a reboot happens does not come back at all until a human intervenes and restarts it

now: if a reboot happens comes back after next puppet run (just not right after boot before the next run)