Page MenuHomePhabricator

WebPageTest is down alert 2022-02-17
Closed, ResolvedPublic

Description

Early this morning alert fired that WebPageTest was down. Looking at the http://wpt.wmftest.org/testlog.php?days=1&filter=&all=on I can see that tests are coming through. But looking in Grafana I can see that some test for some URLS do not work. For example the Barack Obama page for desktop. Looking at the result page at WPT the page is just empty http://wpt.wmftest.org/result/220217_WK_5B/

I'll dig into the logs.

Event Timeline

I can see that Chrome tests is timing out for some tests

[2022-02-17 08:31:27] ERROR: The test for WebPageTest timed out. Is your WebPageTest agent overloaded with work? You can try to increase how long time to wait for tests to finish by configuring --webpagetest.timeout to a higher value (default is 600 and is in seconds).  {"error":{"code":"TIMEOUT","testId":"220217_M5_4M","message":"timeout"}}

But the main page works: http://wpt.wmftest.org/results.php?test=220217_CC_4K

Also for the test that works I can see that it's the exact same Chrome that runs before and after the tests stopped working.

Looking at the agent, there's a lot of errors that looks like this:

Error appending bodies
Traceback (most recent call last):
  File "/home/ubuntu/wptagent/internal/webpagetest.py", line 1339, in get_bodies
    json.loads('"' + data.replace('"', '\\"') + '"')
ValueError: Trailing data

Changed the branch but that didn't help. I remembered https://github.com/WPO-Foundation/wptagent/issues/391 and increased the php.ini post/uplad size to 100mb but that din't fix it either. I'm gonna disable getting the bodies for now.

Change 763515 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/synthetic-monitoring-tests@master] Do not get the body for Chrome desktop tests.

https://gerrit.wikimedia.org/r/763515

Change 763515 merged by jenkins-bot:

[performance/synthetic-monitoring-tests@master] Do not get the body for Chrome desktop tests.

https://gerrit.wikimedia.org/r/763515

That didn't help.

The log keep looking like this:

Traceback (most recent call last):
  File "/home/ubuntu/wptagent/internal/webpagetest.py", line 1339, in get_bodies
    json.loads('"' + data.replace('"', '\\"') + '"')
ValueError: Unrecognized escape sequence when decoding 'string'
Error matching requests to bodies
Traceback (most recent call last):
  File "/home/ubuntu/wptagent/internal/webpagetest.py", line 1258, in get_bodies
    with zipfile.ZipFile(bodies_zip, 'r') as zip_file:
  File "/usr/lib/python2.7/zipfile.py", line 779, in __init__
    self.fp = open(file, modeDict[mode])
IOError: [Errno 2] No such file or directory: '/home/ubuntu/wptagent/work/ip-172-31-5-55-172.31.5.55/220217_6R_8A.11.0/11_bodies.zip'
Error appending bodies
Traceback (most recent call last):
  File "/home/ubuntu/wptagent/internal/webpagetest.py", line 1339, in get_bodies
    json.loads('"' + data.replace('"', '\\"') + '"')
ValueError: Unrecognized escape sequence when decoding 'string'
Error appending bodies
Traceback (most recent call last):
  File "/home/ubuntu/wptagent/internal/webpagetest.py", line 1339, in get_bodies
    json.loads('"' + data.replace('"', '\\"') + '"')
ValueError: Unrecognized escape sequence when decoding 'string'

Hmm, looked at the code at our wptagent and I failed when I switched branch to the apache branch. Now it is switched.

The good thing is that we run the correct branch now. The negative is that the problem is still there for some URLs.

The problem is that I revert to the apache branch, but there's a job that fetches from the release branch. so my changes are overwritten. I think I disabled the places where the updates happens this time.

Finally it works again:
http://wpt.wmftest.org/result/220218_AC_4E0/

Switching to the Apache branch the row:

self.modify_hosts(task, task['dns_override'])

in wptagent/internal/desktop_browser.py broke every test, I've just remove that for now.

I'm gonna test and turn on getting the body of the HTML/CSS/JS response again and see that it works too.

Change 763686 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/synthetic-monitoring-tests@master] Enable get the response body for WPT.

https://gerrit.wikimedia.org/r/763686

Change 763686 merged by jenkins-bot:

[performance/synthetic-monitoring-tests@master] Enable get the response body for WPT.

https://gerrit.wikimedia.org/r/763686

Change 763688 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/synthetic-monitoring-tests@master] Disable getting bodies for WPT.

https://gerrit.wikimedia.org/r/763688

Change 763688 merged by jenkins-bot:

[performance/synthetic-monitoring-tests@master] Disable getting bodies for WPT.

https://gerrit.wikimedia.org/r/763688

Change 763691 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/synthetic-monitoring-tests@master] Disable getting response bodies for Firefox WPT.

https://gerrit.wikimedia.org/r/763691

Change 763691 merged by jenkins-bot:

[performance/synthetic-monitoring-tests@master] Disable getting response bodies for Firefox WPT.

https://gerrit.wikimedia.org/r/763691

We still get some errors in the WebPageTest agent log but the tests goes through. This is what happened:

  1. The WPT server used the Apache branch, the agent used the master branch and something was pushed that broke the tests.
  2. I switched to the Apache branch on the agent but there's a job that takes the latest updates from the release branch.
  3. I disabled the job that gets the changes from the release branch.
  4. There's a bug in the Apache branch with the dns_override , I removed that code since we don't use it.
  5. The tests come through (but there are still errors in the WPT agent log).

Closing and will update T278164