Page MenuHomePhabricator

httpbb appserver test breaks deployment of the week due to a timeout parsing page
Closed, ResolvedPublic

Description

When doing the backport this morning, I got scap complaing with firstly a 503 error from both mwdebug hosts in codfw:

08:23:56 Executing check 'check_testservers_baremetal'
08:24:46 Check 'check_testservers_baremetal' failed: Sending to 4 hosts...
https://en.wikipedia.org/wiki/Barack_Obama (/srv/deployment/httpbb-tests/appserver/test_main.yaml:15)
  mwdebug2001.codfw.wmnet
    ERROR: HTTPSConnectionPool(host='mwdebug2001.codfw.wmnet', port=443): Read timed out. (read timeout=10)
  mwdebug2002.codfw.wmnet
    ERROR: HTTPSConnectionPool(host='mwdebug2002.codfw.wmnet', port=443): Read timed out. (read timeout=10)
===
ERRORS: 128 requests attempted to each of 4 hosts. Errors connecting to 2 hosts.

Looking in the debug log I see:

Mar 25, 2024 @ 08:24:13.202 mwdebug2001 Parsing Barack Obama was slow, took 12.10 seconds
Mar 25, 2024 @ 08:24:13.837 mwdebug2002 Parsing Barack Obama was slow, took 12.71 seconds

The httpb test has a 10 seconds read timeout but parsing the page took 12 seconds hence the timeout. It might because of a cold cache on the host since the next check did pass.

I don't think we should probe using Barack Obama, I have suggested that page a decade ago as a performance tests for the Parser since had the time it had the most references (<ref>) and that was a good indication at how crowed/bottlenecked our parser worked. I'd suggest changing the URL to a smaller page or alternatively raise the timeout.

Event Timeline

That got followed by a 503 which I haven't found the root cause for:

08:28:14 Executing check 'check_testservers_baremetal'
08:28:45 Check 'check_testservers_baremetal' failed: Sending to 4 hosts...
https://boardgovcom.wikimedia.org/wiki/Main_Page (/srv/deployment/httpbb-tests/appserver/test_remnant.yaml:43)
  mwdebug2002.codfw.wmnet
    Status code: expected 200, got 503.
    Body: expected to contain 'Board Governance Committee', got '<!DOCTYPE html>\n<html lang="en">\n<meta charset="ut'... (1638 characters total).
===
FAIL: 128 requests sent to each of 4 hosts. 1 request with failed assertions.
Joe subscribed.

I don't think httpbb tests should really break deployment, but rather ask for confirmation to the deployer if they intend to continue in case of issues.

I know this can be overcome using --force but I don't think that's great UX - httpbb fails and you have to re-run scap.

As far as the tests in httpbb, they're fine like they are, they're functional tests of our configuration and they've been crafted with very different goals than checking deployments specifically, so I'm not inclined to modify them.

Removing the serviceops tag as a consequence.

I don't think httpbb tests should really break deployment, but rather ask for confirmation to the deployer if they intend to continue in case of issues.

I know this can be overcome using --force but I don't think that's great UX - httpbb fails and you have to re-run scap.

I haven't pasted it above, but when run interactively, scap does indeed ask the developer whether they want to cancel, retry the check or continue.

As far as the tests in httpbb, they're fine like they are, they're functional tests of our configuration and they've been crafted with very different goals than checking deployments specifically, so I'm not inclined to modify them.

Removing the serviceops tag as a consequence.

I have filed this task specially because one of the test has an issue. The timeout is set to 10 seconds when the article took 12 seconds which was a red herring. My guess is either pick another article or raise the timeout

From the log server, the page routinely takes more than 10 seconds to parse :/

zgrep 'Parsing Barack Obama was slow,' /srv/mw-log/archive/slow-parsoid.log-2024032* | grep -o 'took ..\.'|sort -n|uniq -c
      2 took 12.
      1 took 13.
      2 took 15.
      6 took 16.
     10 took 17.
     10 took 18.
     13 took 19.
      9 took 20.
      8 took 21.
      6 took 22.
      4 took 23.
      9 took 30.
      5 took 31.
      3 took 33.
      1 took 38.

Change #1014425 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] httpbb: raise timeout for Barack Obama

https://gerrit.wikimedia.org/r/1014425

Some context given by @RLazarus from the CR:

At the time we added this test, the Barack Obama page did consistently load within the default timeout, and we wanted a test to make sure that remained true. Being "notoriously slow" is exactly the reason we picked it.
Have we decided it's okay for that page to take longer now? If so, we might as well just delete this test rather than bumping the timeout; there's no other reason to keep it around. If not, we should keep the test and fix it so that it passes.

I did a little maths on the 7 days of Barack Obama parsing on enwiki before the switchover.
I used the data from this logstash query

❯ cat barack.json| jq '[.hits.hits[]._source.time | tonumber] | add / length' 
17.763209876543215
~ 
❯ cat barack.json| jq '[.hits.hits[]._source.time | tonumber] | max'          
33.34
~ 
❯ cat barack.json| jq '[.hits.hits[]._source.time | tonumber] | min'          
6.46

I also went back a little further in time (between 4 months and 3 months ago) to see if it could be a result of moving more traffic to k8s, or the migration of parsoid to mw-onk8s, and the results were actually worse:

~ 
❯ cat barack2.json| jq '[.hits.hits[]._source.time | tonumber] | add / length' 
20.878611111111113

~ 
❯ cat barack2.json| jq '[.hits.hits[]._source.time | tonumber] | max' 
48.83

~
❯ cat barack2.json| jq '[.hits.hits[]._source.time | tonumber] | min'                                                            
13.3

So we haven't hit that target for a while now.

The reason we're catching it now, and not with the regular httpbb checks being done on the cumin nodes is that the scap httpbb checks are being run without --retry_on_timeout.

The reason we're catching it now, and not with the regular httpbb checks being done on the cumin nodes is that the scap httpbb checks are being run without --retry_on_timeout.

This is a good point. @hashar, if you update scap to use --retry_on_timeout when it runs httpbb, then the test will fail only when the same request exceeds the timeout twice in a row, which is much more likely to indicate a real regression. That's what we use to deflake this in the production environment, when we alert on the same set of tests.

Change #1015071 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] scap: Add retry_on_timeout to scap httpbb checks

https://gerrit.wikimedia.org/r/1015071

Change #1015071 merged by Clément Goubert:

[operations/puppet@production] scap: Add retry_on_timeout to scap httpbb checks

https://gerrit.wikimedia.org/r/1015071

Clement_Goubert claimed this task.

--retry_on_timeout merged and deployed, hopefully this makes deployments easier and closer to the tests we actually want to run.

Change #1014425 abandoned by Hashar:

[operations/puppet@production] httpbb: raise timeout for Barack Obama

Reason:

https://gerrit.wikimedia.org/r/c/operations/puppet/+/1015071 went to get httpbb to retry on timeout (`--retry_on_timeout`)

https://gerrit.wikimedia.org/r/1014425

Great, thank you for the patches! -:]

Note, it is still happening

15:22:29 Executing check 'check_testservers_k8s-2_of_2'
15:23:14 Check 'check_testservers_baremetal-1_of_1' failed: Sending to 4 hosts...
https://transitionteam.wikimedia.org/wiki/Main_Page (/srv/deployment/httpbb-tests/appserver/test_wikimania_wikimedia.yaml:26)
  mwdebug1002.eqiad.wmnet
    Status code: expected 200, got 503.
    Body: expected to contain 'Transition Team Wiki', got '<!DOCTYPE html>\n<html lang="en">\n<meta charset="ut'... (1953 characters total).
===
FAIL: 145 requests sent to each of 4 hosts. 1 request with failed assertions.

That has happening again when doing the backport this morning. I have filed another task T389169 since I don't know whether that is the same root cause or an issue with the patch I was deploying or something else.