@Krinkle told me about the WebPageTest block the ci-jessie instances so they get shut down manually sometimes. Either if we can have new instance where it doesn't matter (we don't block for others) or if just move it to the crontab of the WebPageTest server (we used that in the beginning). The fix we need to fix to move it to the crontab is queuing (we sometimes go over our time interval of one hour and we don't want to kill the current version, instead just queue) and send alerts to IRC if something fails.
During the hackathon, I briefly mentioned to @Krinkle how the web performance jobs consume instances from the small pool of disposable instances (Nodepool).
We can probably set up a dedicated instance solely meant to run those tests. We would just need nodejs installed as I understand it and maybe allow up to 4 jobs in parallel on a 2 CPU instance. That is absolutely trivial to put in place.
The job is apparently scheduled hourly and some time takes more than hour to run. I haven't checked but potentially Jenkins is smart enough to avoid piling up the jobs. That can be verified by having a job that is scheduled every minute and sleep 120. Hopefully Jenkins magically aggregate the builds.
Finally the jobs seems good candidate for conversion toward matrix jobs. One defines in Jenkins a set of variables and their values, Jenkins then compose jobs dynamically based on combination of those parameters. The jobs ends up running for a shorter time. For the daily browser tests, we even have Jenkins get the values from a YAML file that is in the source repository, this way the job no more have to be changed when new parameters are added :-]
Mentioned in SAL (#wikimedia-releng) [2017-07-24T15:03:42Z] <hashar> Added webperformance Jenkins slave https://integration.wikimedia.org/ci/computer/webperformance/ with a single executor - T166756
webperformance:~$ nodejs --version v6.11.0 webperformance:~$ npm -version 2.15.2
Pooled in Jenkins with ONE build executor (feel free to bump to 2 or more) https://integration.wikimedia.org/ci/computer/webperformance/
The Jenkins slave has the label WebPerformance.
I'm not sure what happened, but it seems those packages aren't actually (properly) installed on the slave in question. The job is now failing:
00:00:07.756 npm: command not found /usr/bin/env: node: No such file or directory
# Using `println ":command".execute().text` > which npm "" > which node "" > which nodejs "" > which env "/usr/bin/env"
Running uname -a shows that the host self-identifies saucelabs-03, not as webperformance. Checking the slave configuration in Jenkins does show that it has it the correct IP (verified manually via horizon.wikimedia.org, matches the IP used by integration/webperformance, not integration/saucelabs-03).
Presumably the slave config and Jenkins was copied from saucelabs-03 and then the IP changed. Maybe it still had a connection from before, so I restarted the slave agent, but it was unable to launch: https://integration.wikimedia.org/ci/computer/webperformance/log
[07/25/17 01:46:05] [SSH] Opening SSH connection to 10.68.20.166:22. [07/25/17 01:46:06] [SSH] WARNING: The SSH key for this host is not currently trusted. Connections will be denied until this new key is authorised. Key exchange was not finished, connection is closed.
Figured it out. The problem was with the "Host Key Verification Strategy". The setting most slaves use now is "Manually trusted key". And it seems it still remembered the copied key from saucelabs-03. I noticed a "Trust Host Key" button appearing in the sidebar for the webperformance slave (never seen that before). After pressing that, the connection worked.
@Krinkle regarding the bad slave being connected, that is entirely my fault. I created the new slave in Jenkins by copying saucelabs02. And apparently Jenkins connected to that host and kept that host ssh key. That sounds like a bug in Jenkins.
Eventually the webperformance slave got disconnected and reconnected to the proper IP address. But then the SSH Host key mismatched and Jenkins refused to pool the slave.
I guess it is something to keep in mind for later :-( Sorry for the mess up.
@hashar That's good to hear, but for now I one instance is enough for us. All the Jenkins jobs do is:
- Submit tests to the (external) WebPageTest agent.
- Wait for the results.
- Submit metrics to Graphite, exit build with proper exit code based on results, and notify over IRC.
The majority of the build time is the 10-50min idle loop waiting for the external results. This can happen from multiple jobs in parallel on a single instance without issue :)