Page MenuHomePhabricator

httpbb shouldn't alert when large pages are occasionally slow
Closed, ResolvedPublic

Description

I noticed that every now and then the httpbb_hourly_appserver.service service fails on cumin2002 due to Read timed out. (read timeout=10) for the test on https://meta.wikimedia.org/wiki/List_of_Wikipedias.

Here a full log:

Started Run httpbb appserver/ tests hourly on mw2271.codfw.wmnet.
Sending to mw2271.codfw.wmnet...
https://meta.wikimedia.org/wiki/List_of_Wikipedias (/srv/deployment/httpbb-tests/appserver/test_main.yaml:212)
    ERROR: HTTPSConnectionPool(host='mw2271.codfw.wmnet', port=443): Read timed out. (read timeout=10)
===
ERRORS: 124 requests attempted to mw2271.codfw.wmnet. Errors connecting to 1 host.
httpbb_hourly_appserver.service: Main process exited, code=exited, status=1/FAILURE
httpbb_hourly_appserver.service: Failed with result 'exit-code'.
httpbb_hourly_appserver.service: Consumed 2.410s CPU time.

The occurrences in the current journal:

Nov 13 06:38:40
Nov 13 15:38:40
Nov 17 18:40:42
Nov 18 10:36:59
Nov 19 01:37:41
Nov 19 19:37:41
Nov 20 19:38:41
Nov 23 09:03:41
Nov 23 14:03:41

It seems transient, and it seems to happen only on codfw, the same unit on cumin1001 doesn't have any of those read timeout failures.
If the time to generate that page from codfw is correct then maybe the timeout should be increased a bit.

Related Objects

Event Timeline

Volans triaged this task as Medium priority.Wed, Nov 23, 2:19 PM
Volans created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Good find, thanks.

It looks like this page is just a slow parse (in the HTML comment I see Real time usage: 9.438 seconds), so usually we get lucky and it's in parsercache, but when we get unlucky it times out.

We actually looked at this previously in T289202#7310903 -- back then, MediaWiki was fully active/passive, and the timeouts were only in the passive DC. We fixed it by just running httpbb in the active DC only.

When we started serving reads from both DCs, we restored the other httpbb job as well, but apparently this is a thing again, even in the active/active world.

I'll be out for the U.S. long weekend, but next week I'll look into some options including adding a retry, or the "run twice as often and alert only on two consecutive failures" strategy mentioned at T289202#7812998, which would also help us out here. I'm hesitant to just bump the deadline, but it's definitely worth considering as the quickest solution.

RLazarus renamed this task from httpbb random read timeout on cumin2002 to httpbb shouldn't alert when large pages are occasionally slow.Thu, Nov 24, 1:13 AM

Change 860136 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] httpbb: Bump the timeout for meta:List_of_Wikipedias, at least for now

https://gerrit.wikimedia.org/r/860136

Change 860136 merged by RLazarus:

[operations/puppet@production] httpbb: Bump the timeout for meta:List_of_Wikipedias, at least for now

https://gerrit.wikimedia.org/r/860136

Changed my mind on this -- still going to look into other solutions, but I did bump the deadline to 60s so that it doesn't spuriously alert in the meantime.

Maybe if the page we're trying to fetch is that cumbersome, we should switch to a different, lighter one?

Changed my mind on this -- still going to look into other solutions, but I did bump the deadline to 60s so that it doesn't spuriously alert in the meantime.

Thanks for the quick workaround.

Change 861497 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] httpbb: Replace URL for metawiki test

https://gerrit.wikimedia.org/r/861497

Change 861497 merged by RLazarus:

[operations/puppet@production] httpbb: Replace URL for metawiki test

https://gerrit.wikimedia.org/r/861497

Maybe if the page we're trying to fetch is that cumbersome, we should switch to a different, lighter one?

Yeah, agreed -- this doesn't need to test anything except "Meta URLs are actually routed to Meta." I swapped it to another page and we should be good.

All that other stuff about retries is still worth thinking about someday, but not necessary here.