Page MenuHomePhabricator

reinstall OCG servers
Closed, DeclinedPublic

Description

as mentioned in ops meeting today and requested by Giuseppe:
reinstall OCG servers with a new partitioning scheme, use LVM, leave some free
extents, one by one

also see: T134773#2277591 (comment by hashar about imagemagick)

Details

Reference
rt8839
Related Gerrit Patches:
operations/puppet : productionocg: enable ocg1003, disable ocg1001
operations/puppet : productionocg: set correct ImageMagick conf dir on jessie
operations/puppet : productionocg: don't try to use syslog group on jessie
operations/puppet : productionocg: install the right font packages on jessie
operations/puppet : productionocg: make it work on systemd
operations/puppet : productioninstall_server/ocg: let ocg1003 use raid1-lvm partman
operations/puppet : productioninstall_server/ocg: let ocg1003 use jessie installer
operations/puppet : productionDecommission ocg1003.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Joe assigned this task to Dzahn.Jul 9 2015, 2:20 PM
Joe raised the priority of this task from Medium to Unbreak Now!.
Dzahn added a comment.Jul 9 2015, 4:27 PM

I basically gave up on this task after trying and asked for help. So just re-assigning it to me may not be the most effective choice.

cscott added a comment.Jul 9 2015, 5:30 PM

It's possible that nobody knows how to decommission a host.

I read through the source (lib/threads/frontend.js) and it looked to me like the only way ocg1003 should be pulling jobs is if someone is hitting it explicitly with render requests on its web API (see handleRender()). If pybal is working, nobody should be sending it render jobs, although it may still be running gc jobs on a timer.

BUT I just noticed that the backend thread (lib/threads/backend.js) is configured to fetch jobs from the redis queue in a loop, using the redis blpop command.

So I guess we need to add some code on each iteration of that loop to check whether the service should do a safe shutdown, and if so do so. What's the standard way to do that? Trap SIGTERM and set a flag?

It should also be safe just to sudo service ocg stop on ocg1003. That only risks orphaning a single job in progress, which is probably what we've been doing every time we've restarted the service to date (sigh), although there does seem to be a SIGINT handler present (but the SIGINT handler doesn't seem to take any care to do a clean shutdown).

Filed T105372 to implement clean shut down.

In the interm, just service ocg stop when load seems low and we'll hope the (at most one) affected user whose render job hangs doesn't get too upset with us.

Dzahn added a comment.Aug 5 2015, 11:49 PM

@cscott thank you very much for the additional comments and clarification. i'll move forward with this

"Unbreak Now!" priority ("needs to be fixed immediately, setting anything else aside") for six weeks.

@Dzahn: Any news here / is the priority correct?

Dzahn lowered the priority of this task from Unbreak Now! to High.Aug 22 2015, 1:05 AM
Dzahn changed the task status from Open to Stalled.Dec 10 2015, 10:09 PM

Change 286070 had a related patch set uploaded (by Cscott):
Decommission ocg1003.

https://gerrit.wikimedia.org/r/286070

Change 286070 merged by Giuseppe Lavagetto:
Decommission ocg1003.

https://gerrit.wikimedia.org/r/286070

I believe this is unblocked now.

Dzahn changed the task status from Stalled to Open.May 10 2016, 7:28 PM

@cscott cool, so currently ocg1003 is depooled already?

@Dzahn ocg1003 is in 'decommission mode', where it would respond to front-end requests for cached files, but won't start any new backend jobs. I've also confirmed that the cache doesn't have any remaining files stored on ocg1003, and I'm pretty sure @Joe removed it from the front end round-robin pool. I confirmed via ganglia that it doesn't seem to be using any CPU or network any more (except for a brief spike of network to redis when I was clearing the cache). So it should be okay to shut down now, although it hasn't actually been shutdown yet.

Change 288049 had a related patch set uploaded (by Dzahn):
install_server/ocg: let ocg1003 use jessie installer

https://gerrit.wikimedia.org/r/288049

Change 288053 had a related patch set uploaded (by Dzahn):
install_server/ocg: let ocg1003 use raid1-lvm partman

https://gerrit.wikimedia.org/r/288053

Change 288049 merged by Dzahn:
install_server/ocg: let ocg1003 use jessie installer

https://gerrit.wikimedia.org/r/288049

Change 288053 merged by Dzahn:
install_server/ocg: let ocg1003 use raid1-lvm partman

https://gerrit.wikimedia.org/r/288053

Mentioned in SAL [2016-05-10T20:24:35Z] <mutante> scheduled icinga downtime for ocg1003 and all services on it, rebooting to PXE (T84723)

Dzahn updated the task description. (Show Details)May 10 2016, 9:27 PM
Dzahn added a comment.May 10 2016, 9:31 PM

I got a jessie installer and it finished, but then it doesn't detect the disk/controller and i am prompted with BusyBox.

14:01 <mutante> like you had on another server recently
14:01 <mutante> and the last error is:
14:01 <mutante> mdadm: No devices listed in conf file were found. Volume group "ocg1003-vg" not found Skipping volume group ocg1003-vg
14:01 <mutante> Unable to find LVM volume ocg1003-vg/swap

14:19 <papaul> 1KWKJ
14:19 <papaul> 2
14:19 <papaul> HARD DRIVE, 500GB, EXPANDABLE SYSTEM, 7.2, 3.5, W-SU, E/C

14:24 <mutante> +----------------------------------------------------------------------------+
14:24 <mutante> |*Debian GNU/Linux, with Linux 4.4.0-1-amd64 |
14:24 <mutante> | Debian GNU/Linux, with Linux 4.4.0-1-amd64 (recovery mode) |
14:24 <mutante> | Debian GNU/Linux, with Linux 3.16.0-4-amd64 |
14:24 <mutante> | Debian GNU/Linux, with Linux 3.16.0-4-amd64 (recovery mode) |

14:25 <mutante> - Check rootdelay= (did the system wait long eno[ 34.725095] uhci_hcd: USB Universal Host Controller Interface driver
14:25 <mutante> ugh?)
14:25 <mutante> - Check root= (did the[ 34.735485] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
14:25 <mutante> system wait for the right device?)
14:25 <mutante> - Missing modules (cat /proc/modules; ls /dev)
14:25 <mutante> ALERT! /dev/disk/by-uuid/eb1384fc-afcd-404f-a16f-a7e52abd3ac0 does not exist. Dropping to a shell!
14:25 <mutante> modprobe: module ehci-orion not found in modules.dep
14:25 <mutante> BusyBox v1.22.1 (Debian 1:1.22.0-9+deb8u1) built-in shell (ash)
14:25 <mutante> Enter 'help' for a list of built-in commands.
14:25 <mutante> /bin/sh: can't access tty; job control turned off
14:25 <mutante> (initramfs)
14:26 <mutante> should i try booting with the old kernel ?
14:27 <mutante> to confirm that it detects the disk then?
14:27 <mutante> is this like a problem you had on others?
14:27 <papaul> yes
14:27 <papaul> the install complete for the problem is withing the kernel
14:28 <mutante> i am trying to boot again, this time 3.x kernel

14:30 <mutante> Loading 3.16-0-4-amd64
14:30 <mutante> Loading initial ramdisk ...
14:30 <mutante> [ 2.149082] i8042: No controller found
14:30 <mutante> Loading, please wait...
14:30 <mutante> mdadm: No devices listed in conf file were found. Volume group "ocg1003-vg" not found Skipping volume group ocg1003-vg
14:30 <mutante> Unable to find LVM volume ocg1003-vg/swap

Papaul added a subscriber: Papaul.May 10 2016, 10:10 PM

The problem was fixed by inserting the follow line into GRUB
acpi=off irqpoll

Dzahn added a comment.EditedMay 10 2016, 10:27 PM

thank you very much @Papaul for fixing that

I could continue with the install. Re-added to puppet, signed new cert, added new salt-key..etc

Initial puppet run, user accounts have been created.. @cscott you should be able to login again and check it out

the root partition is now: /dev/md0 46G 4.9G 39G 12% /

Next we are running into some puppet issues due to the distro change.

  • E: Unable to locate package ttf-indic-fonts-core

Package ttf-devanagari-fonts is not available
etc..

This is similar to the MW font package changes.


  • apparmor.serviceJob for apparmor.service failed.

This is probably what Hashar said earlier and i linked in the task description.


  • Error: Could not find group syslog

Error: /Stage[main]/Ocg/File[/srv/deployment/ocg/log]/group: change from root to syslog failed: Could not find group syslog


  • (/Stage[main]/Ocg/Service[ocg]) Provider upstart is not functional on this host

needs systemd unit files

Change 288112 had a related patch set uploaded (by Dzahn):
ocg: make it work on systemd

https://gerrit.wikimedia.org/r/288112

Change 288112 merged by Dzahn:
ocg: make it work on systemd

https://gerrit.wikimedia.org/r/288112

Change 288132 had a related patch set uploaded (by Dzahn):
ocg: install the right font packages on jessie

https://gerrit.wikimedia.org/r/288132

Change 288132 merged by Dzahn:
ocg: install the right font packages on jessie

https://gerrit.wikimedia.org/r/288132

Change 288139 had a related patch set uploaded (by Dzahn):
ocg: don't try to use syslog group on jessie

https://gerrit.wikimedia.org/r/288139

Change 288139 merged by Dzahn:
ocg: don't try to use syslog group on jessie

https://gerrit.wikimedia.org/r/288139

Change 288142 had a related patch set uploaded (by Dzahn):
ocg: set correct ImageMagick conf dir on jessie

https://gerrit.wikimedia.org/r/288142

Change 288142 merged by Dzahn:
ocg: set correct ImageMagick conf dir on jessie

https://gerrit.wikimedia.org/r/288142

Dzahn added a comment.May 11 2016, 1:21 AM

NOW: "Notice: /Stage[main]/Ocg/Service[ocg]/ensure: ensure changed 'stopped' to 'running'

We still see some failures in puppet output related to apparmor but this part is fixed :)

Dzahn added a comment.EditedMay 11 2016, 1:23 AM

We are now down to this remaining issue:

May 11 01:20:23 ocg1003 apparmor[11585]: Starting AppArmor profiles:AppArmor not available as kernel LSM..

Could not open configuration file /etc/ocg/mw-ocg-service.js! Error: Module did not self-register.

Dzahn added a comment.May 11 2016, 1:42 AM

i added "apparmor=1 security=apparmor" to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub and rebooted the server. issue is fixed.

@Joe @cscott see all the comments above and now :)

[ocg1003:~] $ puppet agent -tv
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for ocg1003.eqiad.wmnet
Info: Applying configuration version '1462930343'
Notice: /Stage[main]/Ocg/Package[libjpeg-progs]/ensure: ensure changed 'purged' to 'present'
Notice: /Stage[main]/Ocg/Service[ocg]/ensure: ensure changed 'stopped' to 'running'
Info: /Stage[main]/Ocg/Service[ocg]: Unscheduling refresh on Service[ocg]
Notice: Finished catalog run in 16.59 seconds

root@ocg1003:~# systemctl status apparmor
● apparmor.service - LSB: AppArmor initialization
   Loaded: loaded (/etc/init.d/apparmor)


-- Unit ocg.service has begun starting up.
May 11 01:41:21 ocg1003 systemd[1]: Started MediaWiki Collection Offline Content Generator.
-- Subject: Unit ocg.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- **Unit ocg.service has finished starting up**.
-- 
-- The start-up result is done.
May 11 01:41:21 ocg1003 nodejs-ocg[2704]: Could not open configuration file /etc/ocg/mw-ocg-service.js! Error: Module did not self-register.
May 11 01:41:21 ocg1003 systemd[1]: ocg.service: main process exited, code=exited, status=1/FAILURE
May 11 01:41:21 ocg1003 systemd[1]: Unit ocg.service entered failed state.

....

Maybe we should be doing this on a labs machine first so I can log in and take a look at things. I don't have root on the production machines, so poking around is hard.

It *seems* that the "Error: Module did not register" is actually coming from node, and maybe it's complaining that the /etc/ocg/mw-ocg-service.js file is malformed? Could you take a look at that and compare it to the one on ocg1001/2?

It *seems* that the "Error: Module did not register" is actually coming from node, and maybe it's complaining that the /etc/ocg/mw-ocg-service.js file is malformed? Could you take a look at that and compare it to the one on ocg1001/2?

This error happens whenever you have binary dependencies that have been compiled with a node version different than the one you're executing.

Dzahn added a comment.EditedMay 11 2016, 4:41 PM

the /etc/ocg/mw-ocg-service.js file is malformed? Could you take a look at that and compare it to the one on ocg1001/2?

the diff is just this:

14,16c14
< 	config.coordinator.hostname = "ocg1003.eqiad.wmnet";
< 
< 	config.coordinator.decommission = "ocg1003.eqiad.wmnet";
---
> 	config.coordinator.hostname = "ocg1001.eqiad.wmnet";

What Marko said, i googled around a bit and this sounds like the binary has to be recompiled for the newer node version. ack.

Dzahn removed Dzahn as the assignee of this task.May 11 2016, 9:19 PM
Dzahn added a comment.May 24 2016, 8:40 PM

re-installed ocg1003 with trusty for now so that it can be used until we have a package for jessie

All icinga services are green again, incl. ocg itself

OK: ocg_job_status 397469 msg: ocg_render_job_queue 0 msg

We still have the progress that partitioning is now better (T130591), larger / , raid , lvm...

I have not re-pooled it yet.

Pchelolo moved this task from Backlog to watching on the Services board.Oct 12 2016, 7:24 PM
Pchelolo edited projects, added Services (watching); removed Services.
fgiunchedi closed this task as Declined.Nov 29 2016, 11:02 PM
fgiunchedi added a subscriber: fgiunchedi.

We're sunsetting OCG

Change 347781 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] ocg: enable ocg1003, disable ocg1001

https://gerrit.wikimedia.org/r/347781

Change 347781 merged by Dzahn:
[operations/puppet@production] ocg: enable ocg1003, disable ocg1001

https://gerrit.wikimedia.org/r/347781

Mentioned in SAL (#wikimedia-operations) [2017-04-11T22:36:34Z] <mutante> ocg1003 started picking up jobs (mw-ocg-latexer) after it was enabled with gerrit:347781, ocg1001 was disabled in the same change. Also ganglia graphs confirm it. T84723 T161158