Page MenuHomePhabricator

Where to trigger WebPageTest jobs?
Closed, ResolvedPublic

Description

@Krinkle told me about the WebPageTest block the ci-jessie instances so they get shut down manually sometimes. Either if we can have new instance where it doesn't matter (we don't block for others) or if just move it to the crontab of the WebPageTest server (we used that in the beginning). The fix we need to fix to move it to the crontab is queuing (we sometimes go over our time interval of one hour and we don't want to kill the current version, instead just queue) and send alerts to IRC if something fails.

Event Timeline

During the hackathon, I briefly mentioned to @Krinkle how the web performance jobs consume instances from the small pool of disposable instances (Nodepool).

We can probably set up a dedicated instance solely meant to run those tests. We would just need nodejs installed as I understand it and maybe allow up to 4 jobs in parallel on a 2 CPU instance. That is absolutely trivial to put in place.

The job is apparently scheduled hourly and some time takes more than hour to run. I haven't checked but potentially Jenkins is smart enough to avoid piling up the jobs. That can be verified by having a job that is scheduled every minute and sleep 120. Hopefully Jenkins magically aggregate the builds.

Finally the jobs seems good candidate for conversion toward matrix jobs. One defines in Jenkins a set of variables and their values, Jenkins then compose jobs dynamically based on combination of those parameters. The jobs ends up running for a shorter time. For the daily browser tests, we even have Jenkins get the values from a YAML file that is in the source repository, this way the job no more have to be changed when new parameters are added :-]

greg subscribed.

Looks like the answer here is "dedicated Jenkins worker" and "have Jenkins deal with the concurrency". Let's JFDI :)

Mentioned in SAL (#wikimedia-releng) [2017-07-24T14:40:11Z] <hashar> Booting integration-webperf instance 2CPU / 2GB RAM / 40G disk. Intended to host webperformance long running jobs . T166756

Change 367411 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] contint: webperf Jenkins slave

https://gerrit.wikimedia.org/r/367411

Mentioned in SAL (#wikimedia-releng) [2017-07-24T14:57:46Z] <hashar> recreating integration-webperf instance has simply "webperformance" Same 2CPU / 2GB RAM / 40G disk - T166756

webperformance:~$ nodejs --version
v6.11.0
webperformance:~$ npm -version
2.15.2

Pooled in Jenkins with ONE build executor (feel free to bump to 2 or more) https://integration.wikimedia.org/ci/computer/webperformance/

The Jenkins slave has the label WebPerformance.

Change 367416 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Move WebPageTest to a dedicated slave

https://gerrit.wikimedia.org/r/367416

Change 367416 merged by jenkins-bot:
[integration/config@master] Move WebPageTest to a dedicated slave

https://gerrit.wikimedia.org/r/367416

webperformance:~$ nodejs --version
v6.11.0
webperformance:~$ npm -version
2.15.2

The Jenkins slave has the label WebPerformance.

I'm not sure what happened, but it seems those packages aren't actually (properly) installed on the slave in question. The job is now failing:

00:00:07.756
npm: command not found
/usr/bin/env: node: No such file or directory

Confirmed via https://integration.wikimedia.org/ci/computer/webperformance/script

# Using `println ":command".execute().text`

> which npm
""
> which node
""
> which nodejs
""
> which env
"/usr/bin/env"

Running uname -a shows that the host self-identifies saucelabs-03, not as webperformance. Checking the slave configuration in Jenkins does show that it has it the correct IP (verified manually via horizon.wikimedia.org, matches the IP used by integration/webperformance, not integration/saucelabs-03).

Presumably the slave config and Jenkins was copied from saucelabs-03 and then the IP changed. Maybe it still had a connection from before, so I restarted the slave agent, but it was unable to launch: https://integration.wikimedia.org/ci/computer/webperformance/log

[07/25/17 01:46:05] [SSH] Opening SSH connection to 10.68.20.166:22.
[07/25/17 01:46:06] [SSH] WARNING: The SSH key for this host is not currently trusted. Connections will be denied until this new key is authorised.
Key exchange was not finished, connection is closed.

Figured it out. The problem was with the "Host Key Verification Strategy". The setting most slaves use now is "Manually trusted key". And it seems it still remembered the copied key from saucelabs-03. I noticed a "Trust Host Key" button appearing in the sidebar for the webperformance slave (never seen that before). After pressing that, the connection worked.

Change 367411 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] contint: webperf Jenkins slave

https://gerrit.wikimedia.org/r/367411

I'll leave this task open given the above patch is still cherry-picked.

@Krinkle regarding the bad slave being connected, that is entirely my fault. I created the new slave in Jenkins by copying saucelabs02. And apparently Jenkins connected to that host and kept that host ssh key. That sounds like a bug in Jenkins.

Eventually the webperformance slave got disconnected and reconnected to the proper IP address. But then the SSH Host key mismatched and Jenkins refused to pool the slave.

I guess it is something to keep in mind for later :-( Sorry for the mess up.

Change 367411 merged by Dzahn:
[operations/puppet@production] contint: webperformance Jenkins slave

https://gerrit.wikimedia.org/r/367411

I've increased to two build executors through the Jenkins GUI, could see that the two jobs collided and made one test wait,

Looks good @Peter and I guess later we can add another instance in the loop and run the various tests in parallel.

Looks good @Peter and I guess later we can add another instance in the loop and run the various tests in parallel.

@hashar That's good to hear, but for now I one instance is enough for us. All the Jenkins jobs do is:

  • Submit tests to the (external) WebPageTest agent.
  • Wait for the results.
  • Submit metrics to Graphite, exit build with proper exit code based on results, and notify over IRC.

The majority of the build time is the 10-50min idle loop waiting for the external results. This can happen from multiple jobs in parallel on a single instance without issue :)

Indeed, and I confirmed this afternoon that the instance is mostly idle with two jobs running in parallel. Fee free to add more executors to it as you add more jobs to run in parallel.