Page MenuHomePhabricator

alert1001's tcpircbot down for all internal clients (spicerack, helmfile, dbctl, klaxon, etc)
Closed, ResolvedPublic

Description

The issue was actually with tcpircbot on alert1001, not with dbctl (which is merely a client of tcpircbot).

Followups noted in-thread; tasks to be opened for them.

Original report below.


I just realised that when depooling/repooling hosts, dbctl is no longer doing a !log action on irc.
The change does work and the depooling/repooling happens fine.

This is an example of an action that worked but never got looged on irc:

root@cumin1001:/home/marostegui# dbctl instance db1112 depool
root@cumin1001:/home/marostegui# dbctl config commit -m "Depool db1112 for schema change"
eqiad/groupLoadsBySection/DEFAULT live                             eqiad/groupLoadsBySection/DEFAULT generated
{                                                                  {
    "contributions": {
        "db1112": 100
    },
    "dump": {                                                          "dump": {
        "db1166": 100                                                      "db1166": 100
    },                                                                 },
    "logpager": {
        "db1112": 100
    },
    "recentchanges": {
        "db1112": 100
    },
    "recentchangeslinked": {
        "db1112": 100
    },
    "vslow": {                                                         "vslow": {
        "db1166": 100                                                      "db1166": 100
    },
    "watchlist": {
        "db1112": 100
    }                                                                  }
}                                                                  }
eqiad/sectionLoads/DEFAULT live                                    eqiad/sectionLoads/DEFAULT generated
[                                                                  [
    {                                                                  {
        "db1123": 0                                                        "db1123": 0
    },                                                                 },
    {                                                                  {
        "db1112": 400,
        "db1157": 500,                                                     "db1157": 500,
        "db1166": 300,                                                     "db1166": 300,
        "db1175": 500                                                      "db1175": 500
    }                                                                  }
]                                                                  ]

Enter y or yes to confirm: y
Previous configuration saved. To restore it run: dbctl config restore /var/cache/conftool/dbconfig/20210303-063846-marostegui.json
WARNING:conftool.announce:dbctl commit (dc=all): 'Depool db1112 for schema change', diff saved to https://phabricator.wikimedia.org/P14580 and previous config saved to /var/cache/conftool/dbconfig/20210303-063846-marostegui.json

I manually tested the !log action and it did work: https://wikitech.wikimedia.org/w/index.php?title=Server_Admin_Log&type=revision&diff=1901444&oldid=1901443

[07:41:20]  <marostegui> !log Testing log
[07:41:25]  <+stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log

Looks like the last successful logging from dbctl was this:

Related Objects

Event Timeline

logmsgbot should be running from alert* servers, for some reason it's connected from a WMCS address:

06:51:24  -- | [logmsgbot] (~logmsgbot@nat.openstack.eqiad1.wikimediacloud.org): logmsgbot
06:51:24  -- | [logmsgbot] kornbluth.freenode.net (Frankfurt, Germany)
06:51:24  -- | [logmsgbot] is using a secure connection

It's code doesn't seem to authenticate at all with NickServ

It's code doesn't seem to authenticate at all with NickServ

correction, it does, just using the server password method which I didn't immediately realise

This is an issue that mutante had tried to figure out earlier, IIRC is not connected from cloud, it just appears to be that way due to NAT... (there was some discussion in -operations about it but i forget) it should infact be running on an alert* server therefore, it wouldn't even be possible for it to be on cloud as AFAIK cloud is not allowed to communicate with backend production

If this helps in anyways, I just got this error when pushing:

root@cumin1001:/home/marostegui# dbctl instance db1164 pool -p5
root@cumin1001:/home/marostegui# dbctl config commit -m "Increase weight for db1164 in s1 T258361"
eqiad/groupLoadsBySection/s1 live                                  eqiad/groupLoadsBySection/s1 generated
    "api": {                                                           "api": {
        "db1084": 100,                                                     "db1084": 100,
        "db1119": 100,                                                     "db1119": 100,
        "db1134": 100,                                                     "db1134": 100,
        "db1135": 100,                                                     "db1135": 100,
        "db1164": 3,                                                       "db1164": 5,
        "db1169": 100                                                      "db1169": 100
    },                                                                 },
    "contributions": {                                                 "contributions": {
        "db1099:3311": 100,                                                "db1099:3311": 100,
        "db1105:3311": 100                                                 "db1105:3311": 100
eqiad/sectionLoads/s1 live                                         eqiad/sectionLoads/s1 generated
        "db1118": 325,                                                     "db1118": 325,
        "db1119": 200,                                                     "db1119": 200,
        "db1134": 200,                                                     "db1134": 200,
        "db1135": 200,                                                     "db1135": 200,
        "db1163": 200,                                                     "db1163": 200,
        "db1164": 6,                                                       "db1164": 10,
        "db1169": 200                                                      "db1169": 200
    }                                                                  }
]                                                                  ]

Enter y or yes to confirm: y
Previous configuration saved. To restore it run: dbctl config restore /var/cache/conftool/dbconfig/20210303-070530-marostegui.json
--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/conftool/__init__.py", line 56, in emit
    sock.connect(self.addr)
ConnectionRefusedError: [Errno 111] Connection refused
Call stack:
  File "/usr/bin/dbctl", line 11, in <module>
    load_entry_point('conftool==1.3.1', 'console_scripts', 'dbctl')()
  File "/usr/lib/python3/dist-packages/conftool/extensions/dbconfig/__init__.py", line 156, in main
    return cli.run_action()
  File "/usr/lib/python3/dist-packages/conftool/extensions/dbconfig/cli.py", line 37, in run_action
    self.irc.warning(result.announce_message)
  File "/usr/lib/python3.7/logging/__init__.py", line 1395, in warning
    self._log(WARNING, msg, args, **kwargs)
  File "/usr/lib/python3.7/logging/__init__.py", line 1519, in _log
    self.handle(record)
  File "/usr/lib/python3.7/logging/__init__.py", line 1529, in handle
    self.callHandlers(record)
  File "/usr/lib/python3.7/logging/__init__.py", line 1591, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python3.7/logging/__init__.py", line 905, in handle
    self.emit(record)
  File "/usr/lib/python3/dist-packages/conftool/__init__.py", line 60, in emit
    self.handleError(record)
Message: "dbctl commit (dc=all): 'Increase weight for db1164 in s1 T258361', diff saved to https://phabricator.wikimedia.org/P14583 and previous config saved to /var/cache/conftool/dbconfig/20210303-070530-marostegui.json"
Arguments: ()
WARNING:conftool.announce:dbctl commit (dc=all): 'Increase weight for db1164 in s1 T258361', diff saved to https://phabricator.wikimedia.org/P14583 and previous config saved to /var/cache/conftool/dbconfig/20210303-070530-marostegui.json

The issue is more general: tcpircbot crashes on every invokation in the following way:

Mar 03 07:18:00 alert1001 python[2522]: Traceback (most recent call last):
Mar 03 07:18:00 alert1001 python[2522]:   File "tcpircbot.py", line 147, in <module>
Mar 03 07:18:00 alert1001 python[2522]:     readable, _, _ = select.select([bot.connection.socket] + files, [], [])
Mar 03 07:18:00 alert1001 python[2522]: TypeError: argument must be an int, or have a fileno() method
Mar 03 07:18:00 alert1001 systemd[1]: tcpircbot-logmsgbot.service: Main process exited, code=exited, status=1/FAILURE
Mar 03 07:18:00 alert1001 systemd[1]: tcpircbot-logmsgbot.service: Failed with result 'exit-code'.
Mar 03 07:18:05 alert1001 systemd[1]: tcpircbot-logmsgbot.service: Service RestartSec=5s expired, scheduling restart.
Mar 03 07:18:05 alert1001 systemd[1]: tcpircbot-logmsgbot.service: Scheduled restart job, restart counter is at 432.
Mar 03 07:18:05 alert1001 systemd[1]: Stopped TCP socket to IRC bot: tcpircbot-logmsgbot.
Mar 03 07:18:05 alert1001 systemd[1]: Started TCP socket to IRC bot: tcpircbot-logmsgbot.

Mentioned in SAL (#wikimedia-operations) [2021-03-03T08:32:15Z] <godog> stop/mask tcpircbot-logmsgbot on pontoon-icinga-01 - T276299

We found a few other issues:

  • The nick has no enforce, thus a random instance running in labs is connecting (obviously without password)
  • Nickserv says it saw the user the last time at when the issues started:
NickServ (NickServ@services.): Last seen  : Mar 03 00:10:46 2021 (7h 45m 43s ago)
NickServ (NickServ@services.): User seen  : Mar 03 07:47:30 2021 (8m 59s ago) [this is me ghosting the user]
  • The nick/password still work, I tried ghosting the rogue labs user from an irc client and it works
Joe claimed this task.
Joe triaged this task as High priority.

After more analysis, this is my understanding of the outstanding problems:

  • pontoon was connecting without password as logmsgbot, and given the nick has no enforce on, it will just lie around, causing tcpircbot to be unable to connect
  • this also caused problems with freenode's spam protection so even after ghosting the user, we had issues coming from that
  • any attempt to !log while the bot is reconnecting (it can take up to 2 minutes) will crash the bot.

I have solved the immediate issue, but I guess the points above need to be fixed. But that's matter for another task.

CDanis renamed this task from dbctl not sending !log to IRC to alert1001's tcpircbot down for all internal clients (spicerack, helmfile, dbctl, klaxon, etc).Mar 3 2021, 4:17 PM
CDanis edited projects, added observability; removed conftool.
CDanis updated the task description. (Show Details)
CDanis added a subscriber: lmata.