Page MenuHomePhabricator

Frequent "Redis listener crashed; restarting in a few seconds." errors logged
Closed, ResolvedPublicBUG REPORT

Description

Typical event:

2024-03-04T00:10:10Z wikibugs2.irc ERROR: Redis listener crashed; restarting in a few seconds.
Traceback (most recent call last):
  File "/data/project/wikibugs/wikibugs2/src/wikibugs2/irc.py", line 109, in redisrunner
    await redislistener(bot)
  File "/data/project/wikibugs/wikibugs2/src/wikibugs2/irc.py", line 136, in redislistener
    bot.create_task(
  File "/usr/lib/python3.9/asyncio/base_events.py", line 433, in create_task
    task = tasks.Task(coro, loop=self, name=name)
TypeError: a coroutine was expected, got None
024-03-04T00:10:15Z asyncio_redis INFO: Connecting to redis
2024-03-04T00:10:15Z asyncio_redis INFO: Redis connection made
$ grep 'Redis listener crashed; restarting in a few seconds.' irc.log|wc -l
841

That log file covered ~19 hours of activity when grepped.

Details

TitleReferenceAuthorSource BranchDest Branch
irc: Mark Redis2Irc.privmsg_many as asynctoolforge-repos/wikibugs2!9bd808work/bd808/T359097-redismain
Customize query in GitLab

Event Timeline

  • What actually is the TypeError: a coroutine was expected, got None exception telling us about the state of the system?
  • Is connectivity to Redis actually lost this often?
  • Is there some other expected scenario that could explain the frequency of this error?
  • What actually is the TypeError: a coroutine was expected, got None exception telling us about the state of the system?

Here's where the exception comes from, the asyncio.Task constructor:

def __init__(self, coro, *, loop=None, name=None):
    super().__init__(loop=loop)
    if self._source_traceback:
        del self._source_traceback[-1]
    if not coroutines.iscoroutine(coro):
        # raise after Future.__init__(), attrs are required for __del__
        # prevent logging for pending task in __del__
        self._log_destroy_pending = False
        raise TypeError(f"a coroutine was expected, got {coro!r}")

The call that leads to this in wikibugs2.irc.Redis2Irc is:

if useful_info.get("raw"):
    # Send a pre-formatted Gerrit message.
    bot.create_task(
        bot.privmsg_many(
            useful_info["channels"],
            useful_info["msg"],
        ),
    )

bot.privmsg_many is not marked as async, thus it should always fail the coroutines.iscoroutine test. I think this error happens every time that this branch is entered. Now the burning question is if the proper fix is marking bot.privmsg_many as async so that it returns a coroutine, or if instead the bot.create_task wrapper should be removed.

  • Is connectivity to Redis actually lost this often?

Yes, in the past 15 minutes the worker node where the irc task is running on seems to have created 80 Redis connections.

bd808 changed the task status from Open to In Progress.Mar 5 2024, 5:14 PM
bd808 claimed this task.
bd808 moved this task from Backlog to Doing on the Wikibugs board.

Mentioned in SAL (#wikimedia-cloud) [2024-03-05T18:46:36Z] <wmbot~bd808@tools-sgebastion-11> Restarting irc job to pick up fix for T359097

bd808 moved this task from To Do to Done on the User-bd808 board.

Now the burning question is if the proper fix is marking bot.privmsg_many as async so that it returns a coroutine, or if instead the bot.create_task wrapper should be removed.

I went with making it async as that seemed generally to be the intent even if the implementation had never been correct.