Page MenuHomePhabricator

MobileFrontend Chrome browser test job has become unstable
Closed, ResolvedPublic

Description

There have been lots of false positives/random failures on the MobileFrontend browser test job since 12th June. The tests that fail are not consistent:
https://integration.wikimedia.org/ci/view/Reading-Web/job/selenium-MobileFrontend/

This job is very important to us, so the added noise gives us great concern.

Has anything changed to the stack this week?

Event Timeline

@Jdlrobson - lots of errors fail with The Sauce VMs failed to start the browser or device. For more info, please check https://wiki.saucelabs.com/display/DOCS/Common+Error+Messages - this is already tracked in T152963.

As far as I know, nothing has changed recently. @hashar could know more.

I took a look at last 3 failed builds:

  • 453 has 2 unexpected HTTP response (500) (MediawikiApi::HttpError) failures
  • 454 has 2 Sauce could not start your job. For more information on what happened, please visit https://saucelabs.com/jobs/... (Selenium::WebDriver::Error::UnknownError) failures
  • 455 has 1 unexpected HTTP response (503) (MediawikiApi::HttpError) failure

I do not think anything special is happening. Selenium::WebDriver::Error::UnknownError is tracked as T152963 and MediawikiApi::HttpError might be just a temporary problem with beta cluster.

unexpected HTTP response (503) (MediawikiApi::HttpError) that means the target wiki has thrown a 500 error. So most probably an issue on the beta cluster itself? Unfortunately mediawiki_selenium / mediawiki_api do not show the URL :(

By looking at the time of the error occurred, one can potentially find the error in logstash on https://logstash-beta.wmflabs.org/app/kibana.

Most probably beta had issues?

greg changed the task status from Open to Stalled.Jul 7 2017, 10:55 PM

It's definitely improved. The error at https://integration.wikimedia.org/ci/view/Reading-Web/job/selenium-MobileFrontend/BROWSER=chrome,MEDIAWIKI_ENVIRONMENT=beta,PLATFORM=Linux,label=BrowserTests/479/ however was not real.. probably related to slowdown on beta cluster. Has anything happened to improve beta cluster stability?

Generally, a single failure due to infrastructure instability for the past week seems pretty decent. Not perfect nor great but it's a virtualized environment :) I wish we could make Beta Cluster 100% stable, but... it can't be and also be a testing environment.

Beta Logstash around that time doesn't show anything obvious: https://logstash-beta.wmflabs.org/goto/c0328da778b655e76c1df8009e3ee82c (for the record, that was the Fatal Monitor view, but I remove the jobrunner noise from ORES....)

Jdlrobson claimed this task.

good enough ! :)