Page MenuHomePhabricator

tcpircbot-logmsgbot was not able to deliver messages
Open, MediumPublic

Description

Update

After a Libera.Chat IRC netsplit tcpircbot-logmsgbot on alert1001 was still seen as connected to the channels but was not able to deliver !log messages to the #wikimedia-operations channel.

tcpircbot-logmsgbot should be more resilient to IRC netsplits/disconnect/re-authentication and/or systemd should restart the whole unit if we detect the bot is not fully operational.

Original title: dbctl !log doesn't work

Original task description

All of a sudden dbctl config commit isn't logging things:

root@cumin1001:/home/marostegui/T282373_T282372_T282371# dbctl instance db2071 pool
root@cumin1001:/home/marostegui/T282373_T282372_T282371# dbctl config commit -m "Repool db2071"
codfw/groupLoadsBySection/s1 live                                  codfw/groupLoadsBySection/s1 generated
{                                                                  {
    "api": {                                                           "api": {
                                                                           "db2071": 100,
        "db2072": 100,                                                     "db2072": 100,
        "db2092": 100,                                                     "db2092": 100,
        "db2130": 100                                                      "db2130": 100
    },                                                                 },
    "contributions": {                                                 "contributions": {
codfw/sectionLoads/s1 live                                         codfw/sectionLoads/s1 generated
[                                                                  [
    {                                                                  {
        "db2112": 0                                                        "db2112": 0
    },                                                                 },
    {                                                                  {
                                                                           "db2071": 200,
        "db2072": 200,                                                     "db2072": 200,
        "db2085:3311": 50,                                                 "db2085:3311": 50,
        "db2088:3311": 50,                                                 "db2088:3311": 50,
        "db2092": 200,                                                     "db2092": 200,
        "db2103": 400,                                                     "db2103": 400,

Enter y or yes to confirm: y
Previous configuration saved. To restore it run: dbctl config restore /var/cache/conftool/dbconfig/20210602-045736-marostegui.json
WARNING:conftool.announce:dbctl commit (dc=all): 'Repool db2071', diff saved to https://phabricator.wikimedia.org/P16248 and previous config saved to /var/cache/conftool/dbconfig/20210602-045736-marostegui.json

The !log part doesn't seem to be arriving to IRC.
The bot is on -operations as I just tried:

[06:56:09]  <marostegui> !log Test
[06:56:12]  <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log

Event Timeline

Volans triaged this task as High priority.Wed, Jun 2, 5:15 AM

I think it was the IRC Libera netsplit of yesterday, from its logs on alert1001 (journalctl -u tcpircbot-logmsgbot.service)

Jun 01 15:23:36 alert1001 python3[5995]: 2021-06-01 15:23:36,007 TCP ('::ffff:10.64.32.25', 39558, 0, 0): "[....SNIP...]
Jun 01 17:24:13 alert1001 python3[5995]: Traceback (most recent call last):
Jun 01 17:24:13 alert1001 python3[5995]:   File "tcpircbot.py", line 146, in <module>
Jun 01 17:24:13 alert1001 python3[5995]:     readable, _, _ = select.select([bot.connection.socket] + files, [], [])
Jun 01 17:24:13 alert1001 python3[5995]: TypeError: argument must be an int, or have a fileno() method.
Jun 01 17:24:13 alert1001 systemd[1]: tcpircbot-logmsgbot.service: Main process exited, code=exited, status=1/FAILURE
Jun 01 17:24:13 alert1001 systemd[1]: tcpircbot-logmsgbot.service: Failed with result 'exit-code'.
Jun 01 17:24:18 alert1001 systemd[1]: tcpircbot-logmsgbot.service: Service RestartSec=5s expired, scheduling restart.
Jun 01 17:24:18 alert1001 systemd[1]: tcpircbot-logmsgbot.service: Scheduled restart job, restart counter is at 27.
Jun 01 17:24:18 alert1001 systemd[1]: Stopped TCP socket to IRC bot: tcpircbot-logmsgbot.
Jun 01 17:24:18 alert1001 systemd[1]: Started TCP socket to IRC bot: tcpircbot-logmsgbot.
Jun 01 17:24:19 alert1001 python3[31392]: 2021-06-01 17:24:19,173 platinum.libera.chat ['*** Checking Ident']
Jun 01 17:24:19 alert1001 python3[31392]: 2021-06-01 17:24:19,173 platinum.libera.chat ['*** Looking up your hostname...']
Jun 01 17:24:19 alert1001 python3[31392]: 2021-06-01 17:24:19,173 platinum.libera.chat ['*** Found your hostname: alert1001.wikimedia.org']
Jun 01 17:24:24 alert1001 python3[31392]: 2021-06-01 17:24:24,605 platinum.libera.chat ['*** No Ident response']
Jun 01 17:24:24 alert1001 python3[31392]: 2021-06-01 17:24:24,606 welcome [platinum.libera.chat -> logmsgbot]
Jun 01 17:24:54 alert1001 python3[31392]: 2021-06-01 17:24:54,972 NickServ!NickServ@services.libera.chat ['This nickname is registered. Please choose a diffe
Jun 01 17:24:55 alert1001 python3[31392]: 2021-06-01 17:24:55,078 NickServ!NickServ@services.libera.chat ['You have 30 seconds to identify to your nickname b
Jun 01 17:24:55 alert1001 python3[31392]: 2021-06-01 17:24:55,078 NickServ!NickServ@services.libera.chat ['You are now identified for \x02logmsgbot\x02.']
Jun 01 17:24:55 alert1001 python3[31392]: 2021-06-01 17:24:55,078 NickServ!NickServ@services.libera.chat ['Last login from: \x02~logmsgbot@alert1001.wikime

I'm restarting it

Mentioned in SAL (#wikimedia-operations) [2021-06-02T05:15:27Z] <volans|off> restart tcpircbot-logmsgbot on alert1001 - T284123

The restart seems to have solved the issue, logs actions are back working. We might want to investigate if we can make the bot more resilient to netsplit. I'm updating the task description and related tags to reflect that.

I've re-played all !log messages that got lost with a [REPLAY FROM $timestamp] message suffix. See https://sal.toolforge.org/production?p=0&q=%22REPLAY+FROM%22&d=2021-06-02

For reference I got the !log messages from the logs on alert1001 and then I used on the same host:

>>> import socket
>>> sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>>> sock.settimeout(10)
>>> sock.connect(('icinga.wikimedia.org', 9200))
>>> sock.sendall('$ORIGINAL_MESSAGE [REPLAY FROM $TIMESTAMP]'.encode())
[...REPEAT...]
Volans renamed this task from dbctl !log doesn't work to tcpircbot-logmsgbot was not able to deliver messages.Wed, Jun 2, 5:43 AM
Volans lowered the priority of this task from High to Medium.
Volans edited projects, added SRE; removed conftool.
Volans updated the task description. (Show Details)