reinstall OCG servers
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	Dzahn
	Nov 10 2014, 7:54 PM

Description

as mentioned in ops meeting today and requested by Giuseppe:
reinstall OCG servers with a new partitioning scheme, use LVM, leave some free
extents, one by one

also see: T134773#2277591 (comment by hashar about imagemagick)

Details

Reference: rt8839

Subject	Repo	Branch	Lines +/-
ocg: enable ocg1003, disable ocg1001	operations/puppet	production	+0 -0
ocg: set correct ImageMagick conf dir on jessie	operations/puppet	production	+7 -1
ocg: don't try to use syslog group on jessie	operations/puppet	production	+3 -1
ocg: install the right font packages on jessie	operations/puppet	production	+9 -3
ocg: make it work on systemd	operations/puppet	production	+41 -12
install_server/ocg: let ocg1003 use raid1-lvm partman	operations/puppet	production	+2 -2
install_server/ocg: let ocg1003 use jessie installer	operations/puppet	production	+2 -0
Decommission ocg1003.	operations/puppet	production	+5 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Declined	None	T130591 Increase size of root partition on ocg* servers
Declined	None	T84723 reinstall OCG servers
Resolved	Dzahn	T90839 pybal issue?
Resolved	cscott	T120077 Implement flag to tell an OCG machine not to take new tasks from the redis task queue
Resolved	Joe	T120078 OCG checks should be CRITICAL when reading from the server times out
Resolved	cscott	T120079 The OCG cleanup cache script doesn't work properly
Declined	None	T135034 make ocg role work on labs instances (install deployment-pdf instance with jessie)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Joe assigned this task to Dzahn.Jul 9 2015, 2:20 PM

Joe raised the priority of this task from Medium to Unbreak Now!.

I basically gave up on this task after trying and asked for help. So just re-assigning it to me may not be the most effective choice.

It's possible that nobody knows how to decommission a host.

I read through the source (lib/threads/frontend.js) and it looked to me like the only way ocg1003 should be pulling jobs is if someone is hitting it explicitly with render requests on its web API (see handleRender()). If pybal is working, nobody should be sending it render jobs, although it may still be running gc jobs on a timer.

BUT I just noticed that the backend thread (lib/threads/backend.js) is configured to fetch jobs from the redis queue in a loop, using the redis blpop command.

So I guess we need to add some code on each iteration of that loop to check whether the service should do a safe shutdown, and if so do so. What's the standard way to do that? Trap SIGTERM and set a flag?

It should also be safe just to sudo service ocg stop on ocg1003. That only risks orphaning a single job in progress, which is probably what we've been doing every time we've restarted the service to date (sigh), although there does seem to be a SIGINT handler present (but the SIGINT handler doesn't seem to take any care to do a clean shutdown).

Filed T105372 to implement clean shut down.

In the interm, just service ocg stop when load seems low and we'll hope the (at most one) affected user whose render job hangs doesn't get too upset with us.

@cscott thank you very much for the additional comments and clarification. i'll move forward with this

"Unbreak Now!" priority ("needs to be fixed immediately, setting anything else aside") for six weeks.

@Dzahn: Any news here / is the priority correct?

Dzahn lowered the priority of this task from Unbreak Now! to High.Aug 22 2015, 1:05 AM

Krenair subscribed.Aug 22 2015, 1:40 AM

Joe mentioned this in T120077: Implement flag to tell an OCG machine not to take new tasks from the redis task queue.Dec 2 2015, 11:33 AM

Dzahn changed the task status from Open to Stalled.Dec 10 2015, 10:09 PM

cscott mentioned this in T120079: The OCG cleanup cache script doesn't work properly.Dec 15 2015, 5:30 PM

Dzahn added a parent task: T130591: Increase size of root partition on ocg* servers.Mar 29 2016, 10:50 PM

Change 286070 had a related patch set uploaded (by Cscott):
Decommission ocg1003.

https://gerrit.wikimedia.org/r/286070

gerritbot added a project: Patch-For-Review.Apr 28 2016, 9:55 PM

Change 286070 merged by Giuseppe Lavagetto:
Decommission ocg1003.

https://gerrit.wikimedia.org/r/286070

cscott closed subtask T120077: Implement flag to tell an OCG machine not to take new tasks from the redis task queue as Resolved.May 5 2016, 6:00 PM

cscott closed subtask T120079: The OCG cleanup cache script doesn't work properly as Resolved.May 10 2016, 6:03 PM

I believe this is unblocked now.

@cscott cool, so currently ocg1003 is depooled already?

@Dzahn ocg1003 is in 'decommission mode', where it would respond to front-end requests for cached files, but won't start any new backend jobs. I've also confirmed that the cache doesn't have any remaining files stored on ocg1003, and I'm pretty sure @Joe removed it from the front end round-robin pool. I confirmed via ganglia that it doesn't seem to be using any CPU or network any more (except for a brief spike of network to redis when I was clearing the cache). So it should be okay to shut down now, although it hasn't actually been shutdown yet.

Change 288049 had a related patch set uploaded (by Dzahn):
install_server/ocg: let ocg1003 use jessie installer

https://gerrit.wikimedia.org/r/288049

Change 288053 had a related patch set uploaded (by Dzahn):
install_server/ocg: let ocg1003 use raid1-lvm partman

https://gerrit.wikimedia.org/r/288053

Change 288049 merged by Dzahn:
install_server/ocg: let ocg1003 use jessie installer

https://gerrit.wikimedia.org/r/288049

Dzahn mentioned this in rOPUPb7af1a24ab73: install_server/ocg: let ocg1003 use jessie installer.May 10 2016, 8:17 PM

Change 288053 merged by Dzahn:
install_server/ocg: let ocg1003 use raid1-lvm partman

https://gerrit.wikimedia.org/r/288053

Dzahn mentioned this in rOPUP04c9e9a297f6: install_server/ocg: let ocg1003 use raid1-lvm partman.May 10 2016, 8:21 PM

Mentioned in SAL [2016-05-10T20:24:35Z] <mutante> scheduled icinga downtime for ocg1003 and all services on it, rebooting to PXE (T84723)

Dzahn updated the task description. (Show Details)May 10 2016, 9:27 PM

I got a jessie installer and it finished, but then it doesn't detect the disk/controller and i am prompted with BusyBox.

14:01 <mutante> like you had on another server recently
14:01 <mutante> and the last error is:
14:01 <mutante> mdadm: No devices listed in conf file were found. Volume group "ocg1003-vg" not found Skipping volume group ocg1003-vg
14:01 <mutante> Unable to find LVM volume ocg1003-vg/swap

14:19 <papaul> 1KWKJ
14:19 <papaul> 2
14:19 <papaul> HARD DRIVE, 500GB, EXPANDABLE SYSTEM, 7.2, 3.5, W-SU, E/C

14:24 <mutante> +----------------------------------------------------------------------------+
14:24 <mutante> |*Debian GNU/Linux, with Linux 4.4.0-1-amd64 |
14:24 <mutante> | Debian GNU/Linux, with Linux 4.4.0-1-amd64 (recovery mode) |
14:24 <mutante> | Debian GNU/Linux, with Linux 3.16.0-4-amd64 |
14:24 <mutante> | Debian GNU/Linux, with Linux 3.16.0-4-amd64 (recovery mode) |

14:25 <mutante> - Check rootdelay= (did the system wait long eno[ 34.725095] uhci_hcd: USB Universal Host Controller Interface driver
14:25 <mutante> ugh?)
14:25 <mutante> - Check root= (did the[ 34.735485] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
14:25 <mutante> system wait for the right device?)
14:25 <mutante> - Missing modules (cat /proc/modules; ls /dev)
14:25 <mutante> ALERT! /dev/disk/by-uuid/eb1384fc-afcd-404f-a16f-a7e52abd3ac0 does not exist. Dropping to a shell!
14:25 <mutante> modprobe: module ehci-orion not found in modules.dep
14:25 <mutante> BusyBox v1.22.1 (Debian 1:1.22.0-9+deb8u1) built-in shell (ash)
14:25 <mutante> Enter 'help' for a list of built-in commands.
14:25 <mutante> /bin/sh: can't access tty; job control turned off
14:25 <mutante> (initramfs)
14:26 <mutante> should i try booting with the old kernel ?
14:27 <mutante> to confirm that it detects the disk then?
14:27 <mutante> is this like a problem you had on others?
14:27 <papaul> yes
14:27 <papaul> the install complete for the problem is withing the kernel
14:28 <mutante> i am trying to boot again, this time 3.x kernel

14:30 <mutante> Loading 3.16-0-4-amd64
14:30 <mutante> Loading initial ramdisk ...
14:30 <mutante> [ 2.149082] i8042: No controller found
14:30 <mutante> Loading, please wait...
14:30 <mutante> mdadm: No devices listed in conf file were found. Volume group "ocg1003-vg" not found Skipping volume group ocg1003-vg
14:30 <mutante> Unable to find LVM volume ocg1003-vg/swap

The problem was fixed by inserting the follow line into GRUB
acpi=off irqpoll

thank you very much @Papaul for fixing that

I could continue with the install. Re-added to puppet, signed new cert, added new salt-key..etc

Initial puppet run, user accounts have been created.. @cscott you should be able to login again and check it out

the root partition is now: /dev/md0 46G 4.9G 39G 12% /

Next we are running into some puppet issues due to the distro change.

E: Unable to locate package ttf-indic-fonts-core

Package ttf-devanagari-fonts is not available
etc..

This is similar to the MW font package changes.

apparmor.serviceJob for apparmor.service failed.

This is probably what Hashar said earlier and i linked in the task description.

Error: Could not find group syslog

Error: /Stage[main]/Ocg/File[/srv/deployment/ocg/log]/group: change from root to syslog failed: Could not find group syslog

(/Stage[main]/Ocg/Service[ocg]) Provider upstart is not functional on this host

needs systemd unit files

Change 288112 had a related patch set uploaded (by Dzahn):
ocg: make it work on systemd

https://gerrit.wikimedia.org/r/288112

Change 288112 merged by Dzahn:
ocg: make it work on systemd

https://gerrit.wikimedia.org/r/288112

Dzahn mentioned this in rOPUPdd3e0b1c835e: ocg: make it work on systemd.May 10 2016, 11:26 PM

Change 288132 had a related patch set uploaded (by Dzahn):
ocg: install the right font packages on jessie

https://gerrit.wikimedia.org/r/288132

Change 288132 merged by Dzahn:
ocg: install the right font packages on jessie

https://gerrit.wikimedia.org/r/288132

Dzahn mentioned this in rOPUPc67983505bae: ocg: install the right font packages on jessie.May 11 2016, 12:02 AM

Change 288139 had a related patch set uploaded (by Dzahn):
ocg: don't try to use syslog group on jessie

https://gerrit.wikimedia.org/r/288139

Change 288139 merged by Dzahn:
ocg: don't try to use syslog group on jessie

https://gerrit.wikimedia.org/r/288139

Dzahn mentioned this in rOPUP9a65ddeb9581: ocg: don't try to use syslog group on jessie.May 11 2016, 12:31 AM

Change 288142 had a related patch set uploaded (by Dzahn):
ocg: set correct ImageMagick conf dir on jessie

https://gerrit.wikimedia.org/r/288142

Change 288142 merged by Dzahn:
ocg: set correct ImageMagick conf dir on jessie

https://gerrit.wikimedia.org/r/288142

NOW: "Notice: /Stage[main]/Ocg/Service[ocg]/ensure: ensure changed 'stopped' to 'running'

We still see some failures in puppet output related to apparmor but this part is fixed :)

Dzahn mentioned this in rOPUP5a381c6fc917: ocg: set correct ImageMagick conf dir on jessie.May 11 2016, 1:22 AM

We are now down to this remaining issue:

May 11 01:20:23 ocg1003 apparmor[11585]: Starting AppArmor profiles:AppArmor not available as kernel LSM..

Could not open configuration file /etc/ocg/mw-ocg-service.js! Error: Module did not self-register.

i added "apparmor=1 security=apparmor" to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub and rebooted the server. issue is fixed.

@Joe @cscott see all the comments above and now :)

[ocg1003:~] $ puppet agent -tv
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for ocg1003.eqiad.wmnet
Info: Applying configuration version '1462930343'
Notice: /Stage[main]/Ocg/Package[libjpeg-progs]/ensure: ensure changed 'purged' to 'present'
Notice: /Stage[main]/Ocg/Service[ocg]/ensure: ensure changed 'stopped' to 'running'
Info: /Stage[main]/Ocg/Service[ocg]: Unscheduling refresh on Service[ocg]
Notice: Finished catalog run in 16.59 seconds

root@ocg1003:~# systemctl status apparmor
● apparmor.service - LSB: AppArmor initialization
   Loaded: loaded (/etc/init.d/apparmor)


-- Unit ocg.service has begun starting up.
May 11 01:41:21 ocg1003 systemd[1]: Started MediaWiki Collection Offline Content Generator.
-- Subject: Unit ocg.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- **Unit ocg.service has finished starting up**.
-- 
-- The start-up result is done.
May 11 01:41:21 ocg1003 nodejs-ocg[2704]: Could not open configuration file /etc/ocg/mw-ocg-service.js! Error: Module did not self-register.
May 11 01:41:21 ocg1003 systemd[1]: ocg.service: main process exited, code=exited, status=1/FAILURE
May 11 01:41:21 ocg1003 systemd[1]: Unit ocg.service entered failed state.

....

hashar mentioned this in T134773: imagemagick::install refers to directory /etc/ImageMagic which does not exist on Jessie.May 11 2016, 8:00 AM

Maybe we should be doing this on a labs machine first so I can log in and take a look at things. I don't have root on the production machines, so poking around is hard.

It *seems* that the "Error: Module did not register" is actually coming from node, and maybe it's complaining that the /etc/ocg/mw-ocg-service.js file is malformed? Could you take a look at that and compare it to the one on ocg1001/2?

In T84723#2285492, @cscott wrote:

It *seems* that the "Error: Module did not register" is actually coming from node, and maybe it's complaining that the /etc/ocg/mw-ocg-service.js file is malformed? Could you take a look at that and compare it to the one on ocg1001/2?

This error happens whenever you have binary dependencies that have been compiled with a node version different than the one you're executing.

In T84723#2285492, @cscott wrote:

the /etc/ocg/mw-ocg-service.js file is malformed? Could you take a look at that and compare it to the one on ocg1001/2?

the diff is just this:

14,16c14
< 	config.coordinator.hostname = "ocg1003.eqiad.wmnet";
< 
< 	config.coordinator.decommission = "ocg1003.eqiad.wmnet";
---
> 	config.coordinator.hostname = "ocg1001.eqiad.wmnet";

What Marko said, i googled around a bit and this sounds like the binary has to be recompiled for the newer node version. ack.

Dzahn created subtask T135034: make ocg role work on labs instances (install deployment-pdf instance with jessie).May 11 2016, 7:19 PM

Dzahn removed Dzahn as the assignee of this task.May 11 2016, 9:19 PM

Dzahn mentioned this in rOPUP60d1fa038479: ocg: ocg1003 back to trusty installer.May 24 2016, 6:29 PM

re-installed ocg1003 with trusty for now so that it can be used until we have a package for jessie

All icinga services are green again, incl. ocg itself

OK: ocg_job_status 397469 msg: ocg_render_job_queue 0 msg

We still have the progress that partitioning is now better (T130591), larger / , raid , lvm...

I have not re-pooled it yet.

RobH removed a project: Patch-For-Review.May 25 2016, 9:49 PM

Dzahn mentioned this in rOPUPda63c30e485d: ocg: ocg1003 back to trusty installer.Jun 17 2016, 6:08 PM

Dzahn mentioned this in rOPUP08988e89ff6f: ocg: ocg1003 back to trusty installer.

Dzahn mentioned this in rOPUP201869220f8a: ocg: set correct ImageMagick conf dir on jessie.

Dzahn mentioned this in rOPUPfec2b7646b10: ocg: set correct ImageMagick conf dir on jessie.

Dzahn mentioned this in rOPUPb467e69a3e55: ocg: set correct ImageMagick conf dir on jessie.

Dzahn mentioned this in rOPUPff2b4325ffcd: ocg: set correct ImageMagick conf dir on jessie.

Dzahn mentioned this in rOPUPfbbe5ba67829: ocg: set correct ImageMagick conf dir on jessie.

Dzahn mentioned this in rOPUP7a24dcec811d: ocg: set correct ImageMagick conf dir on jessie.

Dzahn mentioned this in rOPUPf8bcbf48cc38: ocg: set correct ImageMagick conf dir on jessie.

Dzahn mentioned this in rOPUP9245f624d557: ocg: don't try to use syslog group on jessie.

Dzahn mentioned this in rOPUP740364eed83d: ocg: don't try to use syslog group on jessie.

Dzahn mentioned this in rOPUP4c7e9e48d3c5: ocg: install the right font packages on jessie.

Dzahn mentioned this in rOPUPf3b1550de2ab: install_server/ocg: let ocg1003 use jessie installer.

Dzahn mentioned this in rOPUP09a65f65f52a: install_server/ocg: let ocg1003 use raid1-lvm partman.

Dzahn mentioned this in rOPUPbd9e5c96bbb2: ocg: make it work on systemd.

Dzahn mentioned this in rOPUP92d3a54c64c9: ocg: make it work on systemd.

Dzahn mentioned this in rOPUP0645e5cd2ca2: ocg: install the right font packages on jessie.

Dzahn mentioned this in rOPUP7d3b102b9026: ocg: make it work on systemd.

Dzahn mentioned this in rOPUP4c6f31573736: install_server/ocg: let ocg1003 use raid1-lvm partman.

cscott mentioned this in rOPUP9c5b5cbacf11: Decommission ocg1003..Jun 17 2016, 6:10 PM

cscott mentioned this in rOPUP02b7cd60fab4: Decommission ocg1003..

cscott mentioned this in rOPUPe9fd6d2b4109: Decommission ocg1003..

cscott mentioned this in rOPUPb4ae67f0fd48: Decommission ocg1003..

• Pchelolo moved this task from Backlog to watching on the Services board.Oct 12 2016, 7:24 PM

• Pchelolo edited projects, added Services (watching); removed Services.

We're sunsetting OCG

fgiunchedi closed subtask T135034: make ocg role work on labs instances (install deployment-pdf instance with jessie) as Declined.Dec 1 2016, 11:37 PM

Change 347781 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] ocg: enable ocg1003, disable ocg1001

https://gerrit.wikimedia.org/r/347781

gerritbot added a project: Patch-For-Review.Apr 11 2017, 10:25 PM

Change 347781 merged by Dzahn:
[operations/puppet@production] ocg: enable ocg1003, disable ocg1001

https://gerrit.wikimedia.org/r/347781

Mentioned in SAL (#wikimedia-operations) [2017-04-11T22:36:34Z] <mutante> ocg1003 started picking up jobs (mw-ocg-latexer) after it was enabled with gerrit:347781, ocg1001 was disabled in the same change. Also ganglia graphs confirm it. T84723 T161158

Stashbot mentioned this in T161158: Degraded RAID on ocg1001.Apr 11 2017, 10:36 PM

reinstall OCG serversClosed, DeclinedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

reinstall OCG servers
Closed, DeclinedPublic
Actions

Related Objects
Search...