Page MenuHomePhabricator

deploy new snapshot hosts in eqiad
Closed, ResolvedPublic

Description

These all have first puppet run done but no roles applied. Set them up for use before April dump run.

Event Timeline

I wondered why I hadn't put my prep work anywhere and that was because I had no ticket to put it on.

The production dump scripts all pass flake8 and reasonably pylint; next up is to get those scripts (but not the rest of the dirs) into master.

The new snapshot hosts all have the dumps prod scripts and config/dblists/etc deployed and cron set to run at the appropriate time. Still to go: move the monitoring script, and move all misc cron jobs off to one of those hosts. Also need to make sure the corn job actually starts up properly tomorrow.

relevant changesets: https://gerrit.wikimedia.org/r/281530 plus minor fixups following.

aaand I get to redo all of this again tomorrow, because jessie which mw has never run on or even have all of the packages built. So tomorrow will be reinstall with trusty and then a puppet run with the manifests already there. Definitely bedtime.

Interestingly enough, some kind soul has built and pushed hhvm 3.12 for jessie about half an hour ago. Still missing are php-luasandbox and php5-fss. But it does make me think again about the re-install....

HHVM extensions not ready yet, and with the other dependencies I'm going to back off to trusty and look at the state of things again before next month's run.

Very bad news: the servers do not install with trusty. NO disk is detected, and the driver that runs on jessie (hpsa) doesn't find them when selected out of the menu. Neither does cciss, which I tried on the basis of a random Google search as a fallback.

This system is suposed to be certified with trusty http://www.ubuntu.com/certification/hardware/201409-15509/ and the disk controller is verified supposedly by HP for trusty as well http://www8.hp.com/h20195/v2/getpdf.aspx/c04346310.pdf?ver=2

So it seems that these ONLY install w/ jessie with HW raid. A close look at the canonical certification shows that the system was certified with 'HP LOGICAL VOLUME' which is their ay of saying they had hw raid enabled. We are now raid1 hw with a different partman recipe for snapshot1005 and install went smoothly. Running hpsa.

Chris won't be able to set raid on the other two until tomorrow morning his time (requires physical presence).

operations/dumps/scap repo requested for scap config files.

snapshot1005 wil kick off the monthly run with enwiki tomorrow and I'll be checking it. The other two new hosts are due to come on line later tomorrow and will join with dumps of regular wikis the next day. I'll be watching them too and updating here.

Note that a test run on snapshot1005 of worker.py for bewikbooks (with a test output directory so no, you don't get your dumps ahead of time :-P) ran properly and output looked good.

All three hosts are deployed. I'll see tonight if the one cron job starts properly; if it does I'll add the other two as cron runner and they'll pick up tomorrow morning.

The dump run is proceeding nicely, with one glitch from hhvm, which occasionally spits out a log message 'Lost parent' etc, causing the dumps jobs to believe there is an error. In fact the jobs run fine and to completion.

My it's been a while since I commented here. The three hosts are in production but without hhvm (see blocking subtask). The misc cron jobs have yet to move over.

Monitor job moved off of snapshot1004 (old host) to snapshot1007 (new host): https://gerrit.wikimedia.org/r/#/c/300006/

This is now complete; old hosts snapshot1001 through 1004 are unused.