Page MenuHomePhabricator

wikibugs needs restart almost everyday
Closed, ResolvedPublic

Description

During the last few days (sorry I cannot be more concrete) wikibugs has stopped showing comments from phabricator tasks on IRC (confirmed at least on #wikimedia-operations and DBA)
The only way to get it working has been to manually restart it. After the restart it immediately comes to back and start working normally.

All the time I have done the restart, the output of qstat has showed jobs working, so they were there but apparently not working.
I have not been able to find any error on the error log, apart from this - which happened after a restart:

2019-12-19 06:07:58,792 - wikibugs.wb2-grrrrit - ERROR - Error, probably SSH connection dropped.

Event Timeline

First of all thank you for restarting wikibugs when this happens!

Judging by the behavior, it sounds like the issue is in wb2-phab: the irc bot is still there, but phabricator messages are not coming through.

It is quite strange though: the requests to Phab do seem to come through:

wikibugs.log.2019-12-19-2019-12-19 06:07:37,665 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None                               wikibugs.log.2019-12-19-2019-12-19 06:07:38,759 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 159                                wikibugs.log.2019-12-19-2019-12-19 06:07:41,804 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None                               wikibugs.log.2019-12-19-2019-12-19 06:07:44,821 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 159                                wikibugs.log.2019-12-19-2019-12-19 06:07:45,928 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None                               wikibugs.log.2019-12-19-2019-12-19 06:07:47,033 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None                               wikibugs.log.2019-12-19-2019-12-19 06:07:48,134 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 159                                wikibugs.log.2019-12-19-2019-12-19 06:07:49,237 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None                               wikibugs.log.2019-12-19-2019-12-19 06:07:50,335 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 159                                wikibugs.log.2019-12-19-2019-12-19 06:07:51,436 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None                               wikibugs.log.2019-12-19-2019-12-19 06:07:52,534 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None                               wikibugs.log.2019-12-19-2019-12-19 06:07:53,630 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None                               wikibugs.log.2019-12-19-2019-12-19 06:07:54,724 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None                               wikibugs.log.2019-12-19-2019-12-19 06:07:55,817 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None                               wikibugs.log.2019-12-19-2019-12-19 06:07:56,920 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None                               wikibugs.log.2019-12-19-2019-12-19 06:07:58,020 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 159                                wikibugs.log.2019-12-19-2019-12-19 06:07:59,122 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 159                                wikibugs.log.2019-12-19-2019-12-19 06:08:00,223 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None                               wikibugs.log.2019-12-19-2019-12-19 06:08:01,327 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None                               wikibugs.log.2019-12-19-2019-12-19 06:08:01,342 - wikibugs.wb2-phab - INFO - Shutting down                                                                                                      wikibugs.log.2019-12-19:2019-12-19 06:12:36,086 - wikibugs.wb2-phab - INFO - Starting polling cycle                                                                                             wikibugs.log.2019-12-19-2019-12-19 06:12:36,091 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): phabricator.wikimedia.org:443                                             wikibugs.log.2019-12-19-2019-12-19 06:12:36,460 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/conduit.connect HTTP/1.1" 200 None                          wikibugs.log.2019-12-19-2019-12-19 06:12:36,574 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None                               wikibugs.log.2019-12-19-2019-12-19 06:12:36,680 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None                               wikibugs.log.2019-12-19-2019-12-19 06:12:37,785 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None                               wikibugs.log.2019-12-19-2019-12-19 06:12:38,884 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None                               wikibugs.log.2019-12-19-2019-12-19 06:12:39,986 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None

However, note that the request sizes are all 'None' or 159. That seems suspicious.

The last reported tasks before that:

2019-12-19 02:18:29,229 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None
2019-12-19 02:18:30,336 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 58
2019-12-19 02:18:31,454 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 280
2019-12-19 02:18:31,455 - wikibugs.wb2-phab - DEBUG - Processing {"class": "PhabricatorApplicationTransactionFeedStory", "epoch": 1576721911, "authorPHID": "PHID-USER-fn7qnpccfbitivgtw2rt", "chronologicalKey": "6771969043218032437", "data": {"objectPHID": "PHID-TASK-kmxfdsqkp5wmibd5fzqw", "transactionPHIDs": {"PHID-XACT-TASK-xgt6xm3yn72p2ww": "PHID-XACT-TASK-xgt6xm3yn72p2ww"}}}
[... task retrieval logs ...]
2019-12-19 02:18:33,093 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 58
2019-12-19 02:18:34,207 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None
2019-12-19 02:18:35,324 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None
[...]
2019-12-19 02:18:47,810 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 159
2019-12-19 02:18:48,920 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 159
2019-12-19 02:18:50,024 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None
2019-12-19 02:18:51,139 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None

Note that 'no new info' went from 58 to 159 bytes.

My hunch is that at some point we start sending requests with an invalid 'after' parameter, and because we never get a new value to use, we keep sending that one.

The last working event has timestamp 6771969043218032437. However, running https://phabricator.wikimedia.org/conduit/method/feed.query/ with after=6771969043218032437 returns

{"result":null,"error_code":"ERR-CONDUIT-CORE","error_info":"Cursor \"6771969043218032437\" does not identify a valid object in query \"PhabricatorFeedQuery\"."}

which (probably not coincidentally) is 157 bytes long (content-length=159, probably due to a \r\n).

Not entirely sure how to fix this yet, though. We can probably handle that error and just resynchronize as we would do on start. That does run the risk of missing a few messages but I suppose that is acceptable. The other option would be to keep track of 'old after values', but then we run the risk of repeating the same change over and over again... which really isn't acceptable.

I have had to restart wikibugs today.
The log only showed:

019-12-31 07:11:02,879 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:03,945 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:05,009 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:06,090 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:07,155 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:08,223 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:09,286 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:10,351 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:11,421 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:12,486 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:13,554 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:14,619 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:15,687 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:16,742 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:17,888 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:18,954 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:20,019 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:21,084 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:22,143 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:23,209 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:24,266 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:25,330 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:26,393 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:27,456 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:28,518 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:29,591 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:30,650 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:31,714 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:32,774 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:33,837 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:34,905 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:35,962 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:37,019 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:38,086 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:39,147 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:40,217 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:41,277 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:42,338 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:43,400 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:44,469 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:45,541 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160
2019-12-31 07:11:46,599 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160

FTR, on 2019-12-29 I also restarted wikibugs due to phab comments not showing up on irc. I did not check the logs at the time.

ema triaged this task as Medium priority.Dec 31 2019, 8:06 AM

Not entirely sure how to fix this yet, though. We can probably handle that error and just resynchronize as we would do on start. That does run the risk of missing a few messages but I suppose that is acceptable.

I think that would be a good start, missing a few messages is probably acceptable, considering the fact that right now it misses all of them till a restart is issued. I don't think it would be a big deal
Thanks for the troubleshooting!

Change 561294 had a related patch set uploaded (by Merlijn van Deen; owner: Merlijn van Deen):
[labs/tools/wikibugs2@master] T241109: Throw exception when Phabricator returns an error

https://gerrit.wikimedia.org/r/561294

I have deployed this patchset; it needs some work (formatting etc) and the change should really go upstream to Fab; I'll discuss that with @Legoktm. For now I just want to have the code running so we have some more information on how often we get errors back from the API. Based on that we could either have somewhat extensive error handling, or just let it be handled by the crash-and-restart loop.

Change 561294 merged by jenkins-bot:
[labs/tools/wikibugs2@master] T241109: Throw exception when Phabricator returns an error

https://gerrit.wikimedia.org/r/561294

valhallasw claimed this task.

From what I can see the bot crashes and restarts a few times per day, which -- although not great -- I think is acceptable. About 2/3rd of those are errors retrieving anchors, the rest are about 50-50 503s from the API and cursor errors that would have caused hangs.

Mentioned in SAL (#wikimedia-operations) [2020-02-12T08:38:19Z] <marostegui> Restart wikibugs as it doesn't show phab comments on irc - T241109

I have given this another restart and it started now to show comments again

I have given this another restart and it started now to show comments again

Reopening given that manual restarts are still needed.

Mentioned in SAL (#wikimedia-operations) [2020-02-12T13:21:31Z] <marostegui> Restart wikibugs as phab comments aren't showing up on irc - T241109

I have had to restart it again just now :(

Hmm, odd.

2020-02-12 12:43:01,182 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 58
2020-02-12 12:43:02,311 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 58
2020-02-12 12:43:03,429 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None
2020-02-12 12:43:04,547 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None
2020-02-12 12:43:05,651 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160

I'm not entirely sure what the 'None' entries are (no response? empty response?). But note that we switch from a 58 byte response to a 160 byte response again, which is consistent with the error response mentioned above. By why doesn't it crash?

This continues until the restart:

2020-02-12 13:21:01,788 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160                                                        2020-02-12 13:21:02,881 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 160                                                        2020-02-12 13:21:03,969 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None                                                       2020-02-12 13:21:05,070 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None                                                       2020-02-12 13:21:06,166 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None                                                       2020-02-12 13:21:07,274 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None                                                       2020-02-12 13:21:08,363 - wikibugs.wb2-phab - INFO - Shutting down
2020-02-12 13:21:56,098 - wikibugs.wb2-phab - INFO - Starting polling cycle
2020-02-12 13:21:56,102 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): phabricator.wikimedia.org:443
2020-02-12 13:21:56,472 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/conduit.connect HTTP/1.1" 200 None
2020-02-12 13:21:56,596 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None
2020-02-12 13:21:56,697 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None
2020-02-12 13:21:57,811 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 None
2020-02-12 13:21:58,927 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 58
2020-02-12 13:22:00,036 - urllib3.connectionpool - DEBUG - https://phabricator.wikimedia.org:443 "POST /api/feed.query HTTP/1.1" 200 58

after which we're back to 58 bytes.

The task that triggered this is T245000 which is now private -- which also means the chronologicalKey in the feed is no longer valid.

However, from what I can see this is a regular error response:

{"result":null,"error_code":"ERR-CONDUIT-CORE","error_info":"Cursor \"6792539612548286608\" does not identify a valid object in query \"PhabricatorFeedQuery\"."}

which should just result in an exception.

....except that I forgot to update the dependencies! Did that and restarted wb2-phab... I think that should now really fix it.

Thank you again for noticing the broken bot and restarting it :-)