as mentioned in ops meeting today and requested by Giuseppe:
reinstall OCG servers with a new partitioning scheme, use LVM, leave some free
extents, one by one
also see: T134773#2277591 (comment by hashar about imagemagick)
as mentioned in ops meeting today and requested by Giuseppe:
reinstall OCG servers with a new partitioning scheme, use LVM, leave some free
extents, one by one
also see: T134773#2277591 (comment by hashar about imagemagick)
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Declined | None | T130591 Increase size of root partition on ocg* servers | |||
Declined | None | T84723 reinstall OCG servers | |||
Resolved | Dzahn | T90839 pybal issue? | |||
Resolved | cscott | T120077 Implement flag to tell an OCG machine not to take new tasks from the redis task queue | |||
Resolved | Joe | T120078 OCG checks should be CRITICAL when reading from the server times out | |||
Resolved | cscott | T120079 The OCG cleanup cache script doesn't work properly | |||
Declined | None | T135034 make ocg role work on labs instances (install deployment-pdf instance with jessie) |
I basically gave up on this task after trying and asked for help. So just re-assigning it to me may not be the most effective choice.
It's possible that nobody knows how to decommission a host.
I read through the source (lib/threads/frontend.js) and it looked to me like the only way ocg1003 should be pulling jobs is if someone is hitting it explicitly with render requests on its web API (see handleRender()). If pybal is working, nobody should be sending it render jobs, although it may still be running gc jobs on a timer.
BUT I just noticed that the backend thread (lib/threads/backend.js) is configured to fetch jobs from the redis queue in a loop, using the redis blpop command.
So I guess we need to add some code on each iteration of that loop to check whether the service should do a safe shutdown, and if so do so. What's the standard way to do that? Trap SIGTERM and set a flag?
It should also be safe just to sudo service ocg stop on ocg1003. That only risks orphaning a single job in progress, which is probably what we've been doing every time we've restarted the service to date (sigh), although there does seem to be a SIGINT handler present (but the SIGINT handler doesn't seem to take any care to do a clean shutdown).
Filed T105372 to implement clean shut down.
In the interm, just service ocg stop when load seems low and we'll hope the (at most one) affected user whose render job hangs doesn't get too upset with us.
@cscott thank you very much for the additional comments and clarification. i'll move forward with this
"Unbreak Now!" priority ("needs to be fixed immediately, setting anything else aside") for six weeks.
@Dzahn: Any news here / is the priority correct?
@Dzahn ocg1003 is in 'decommission mode', where it would respond to front-end requests for cached files, but won't start any new backend jobs. I've also confirmed that the cache doesn't have any remaining files stored on ocg1003, and I'm pretty sure @Joe removed it from the front end round-robin pool. I confirmed via ganglia that it doesn't seem to be using any CPU or network any more (except for a brief spike of network to redis when I was clearing the cache). So it should be okay to shut down now, although it hasn't actually been shutdown yet.
Change 288049 had a related patch set uploaded (by Dzahn):
install_server/ocg: let ocg1003 use jessie installer
Change 288053 had a related patch set uploaded (by Dzahn):
install_server/ocg: let ocg1003 use raid1-lvm partman
Mentioned in SAL [2016-05-10T20:24:35Z] <mutante> scheduled icinga downtime for ocg1003 and all services on it, rebooting to PXE (T84723)
I got a jessie installer and it finished, but then it doesn't detect the disk/controller and i am prompted with BusyBox.
14:01 <mutante> like you had on another server recently
14:01 <mutante> and the last error is:
14:01 <mutante> mdadm: No devices listed in conf file were found. Volume group "ocg1003-vg" not found Skipping volume group ocg1003-vg
14:01 <mutante> Unable to find LVM volume ocg1003-vg/swap
14:19 <papaul> 1KWKJ
14:19 <papaul> 2
14:19 <papaul> HARD DRIVE, 500GB, EXPANDABLE SYSTEM, 7.2, 3.5, W-SU, E/C
14:24 <mutante> +----------------------------------------------------------------------------+
14:24 <mutante> |*Debian GNU/Linux, with Linux 4.4.0-1-amd64 |
14:24 <mutante> | Debian GNU/Linux, with Linux 4.4.0-1-amd64 (recovery mode) |
14:24 <mutante> | Debian GNU/Linux, with Linux 3.16.0-4-amd64 |
14:24 <mutante> | Debian GNU/Linux, with Linux 3.16.0-4-amd64 (recovery mode) |
14:25 <mutante> - Check rootdelay= (did the system wait long eno[ 34.725095] uhci_hcd: USB Universal Host Controller Interface driver
14:25 <mutante> ugh?)
14:25 <mutante> - Check root= (did the[ 34.735485] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
14:25 <mutante> system wait for the right device?)
14:25 <mutante> - Missing modules (cat /proc/modules; ls /dev)
14:25 <mutante> ALERT! /dev/disk/by-uuid/eb1384fc-afcd-404f-a16f-a7e52abd3ac0 does not exist. Dropping to a shell!
14:25 <mutante> modprobe: module ehci-orion not found in modules.dep
14:25 <mutante> BusyBox v1.22.1 (Debian 1:1.22.0-9+deb8u1) built-in shell (ash)
14:25 <mutante> Enter 'help' for a list of built-in commands.
14:25 <mutante> /bin/sh: can't access tty; job control turned off
14:25 <mutante> (initramfs)
14:26 <mutante> should i try booting with the old kernel ?
14:27 <mutante> to confirm that it detects the disk then?
14:27 <mutante> is this like a problem you had on others?
14:27 <papaul> yes
14:27 <papaul> the install complete for the problem is withing the kernel
14:28 <mutante> i am trying to boot again, this time 3.x kernel
14:30 <mutante> Loading 3.16-0-4-amd64
14:30 <mutante> Loading initial ramdisk ...
14:30 <mutante> [ 2.149082] i8042: No controller found
14:30 <mutante> Loading, please wait...
14:30 <mutante> mdadm: No devices listed in conf file were found. Volume group "ocg1003-vg" not found Skipping volume group ocg1003-vg
14:30 <mutante> Unable to find LVM volume ocg1003-vg/swap
thank you very much @Papaul for fixing that
I could continue with the install. Re-added to puppet, signed new cert, added new salt-key..etc
Initial puppet run, user accounts have been created.. @cscott you should be able to login again and check it out
the root partition is now: /dev/md0 46G 4.9G 39G 12% /
Next we are running into some puppet issues due to the distro change.
Package ttf-devanagari-fonts is not available
etc..
This is similar to the MW font package changes.
This is probably what Hashar said earlier and i linked in the task description.
Error: /Stage[main]/Ocg/File[/srv/deployment/ocg/log]/group: change from root to syslog failed: Could not find group syslog
needs systemd unit files
Change 288112 had a related patch set uploaded (by Dzahn):
ocg: make it work on systemd
Change 288132 had a related patch set uploaded (by Dzahn):
ocg: install the right font packages on jessie
Change 288139 had a related patch set uploaded (by Dzahn):
ocg: don't try to use syslog group on jessie
Change 288142 had a related patch set uploaded (by Dzahn):
ocg: set correct ImageMagick conf dir on jessie
NOW: "Notice: /Stage[main]/Ocg/Service[ocg]/ensure: ensure changed 'stopped' to 'running'
We still see some failures in puppet output related to apparmor but this part is fixed :)
We are now down to this remaining issue:
May 11 01:20:23 ocg1003 apparmor[11585]: Starting AppArmor profiles:AppArmor not available as kernel LSM..
Could not open configuration file /etc/ocg/mw-ocg-service.js! Error: Module did not self-register.
i added "apparmor=1 security=apparmor" to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub and rebooted the server. issue is fixed.
@Joe @cscott see all the comments above and now :)
[ocg1003:~] $ puppet agent -tv Info: Retrieving pluginfacts Info: Retrieving plugin Info: Loading facts Info: Caching catalog for ocg1003.eqiad.wmnet Info: Applying configuration version '1462930343' Notice: /Stage[main]/Ocg/Package[libjpeg-progs]/ensure: ensure changed 'purged' to 'present' Notice: /Stage[main]/Ocg/Service[ocg]/ensure: ensure changed 'stopped' to 'running' Info: /Stage[main]/Ocg/Service[ocg]: Unscheduling refresh on Service[ocg] Notice: Finished catalog run in 16.59 seconds root@ocg1003:~# systemctl status apparmor ● apparmor.service - LSB: AppArmor initialization Loaded: loaded (/etc/init.d/apparmor) -- Unit ocg.service has begun starting up. May 11 01:41:21 ocg1003 systemd[1]: Started MediaWiki Collection Offline Content Generator. -- Subject: Unit ocg.service has finished start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- **Unit ocg.service has finished starting up**. -- -- The start-up result is done. May 11 01:41:21 ocg1003 nodejs-ocg[2704]: Could not open configuration file /etc/ocg/mw-ocg-service.js! Error: Module did not self-register. May 11 01:41:21 ocg1003 systemd[1]: ocg.service: main process exited, code=exited, status=1/FAILURE May 11 01:41:21 ocg1003 systemd[1]: Unit ocg.service entered failed state. ....
Maybe we should be doing this on a labs machine first so I can log in and take a look at things. I don't have root on the production machines, so poking around is hard.
It *seems* that the "Error: Module did not register" is actually coming from node, and maybe it's complaining that the /etc/ocg/mw-ocg-service.js file is malformed? Could you take a look at that and compare it to the one on ocg1001/2?
This error happens whenever you have binary dependencies that have been compiled with a node version different than the one you're executing.
the diff is just this:
14,16c14 < config.coordinator.hostname = "ocg1003.eqiad.wmnet"; < < config.coordinator.decommission = "ocg1003.eqiad.wmnet"; --- > config.coordinator.hostname = "ocg1001.eqiad.wmnet";
What Marko said, i googled around a bit and this sounds like the binary has to be recompiled for the newer node version. ack.
re-installed ocg1003 with trusty for now so that it can be used until we have a package for jessie
All icinga services are green again, incl. ocg itself
OK: ocg_job_status 397469 msg: ocg_render_job_queue 0 msg
We still have the progress that partitioning is now better (T130591), larger / , raid , lvm...
I have not re-pooled it yet.
Change 347781 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] ocg: enable ocg1003, disable ocg1001
Change 347781 merged by Dzahn:
[operations/puppet@production] ocg: enable ocg1003, disable ocg1001
Mentioned in SAL (#wikimedia-operations) [2017-04-11T22:36:34Z] <mutante> ocg1003 started picking up jobs (mw-ocg-latexer) after it was enabled with gerrit:347781, ocg1001 was disabled in the same change. Also ganglia graphs confirm it. T84723 T161158