Page MenuHomePhabricator

Migrate dataset1001 and ms1001 to jessie
Closed, ResolvedPublic

Description

These are still on precise, migrate to jessie.

Event Timeline

MoritzMuehlenhoff raised the priority of this task from to Needs Triage.
MoritzMuehlenhoff updated the task description. (Show Details)
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptJan 15 2016, 11:55 AM
ArielGlenn triaged this task as Normal priority.
ArielGlenn set Security to None.
ArielGlenn moved this task from Backlog to Up Next on the Dumps-Generation board.Feb 9 2016, 3:49 PM

This will happen after the curren dump run completes. ETA 5-6 days from now.

hoo added a subscriber: hoo.Feb 15 2016, 11:58 AM

Please let me know if there are disruptions to be expected Monday or early Tuesday.

mark raised the priority of this task from Normal to High.Feb 15 2016, 2:28 PM
mark added a subscriber: mark.

I'll be doing this over the next two days; the monthly full run completed on Friday. Ms1001 will be first, with all rsyncs from dataset1001 to it paused.

ArielGlenn moved this task from Up Next to Active on the Dumps-Generation board.Feb 29 2016, 12:46 PM
ArielGlenn added a comment.EditedFeb 29 2016, 12:52 PM

Plan for ms1001 (proceeding now, no downtime notice to us or to users needed:

DONE - Disable all rsyncs to/from ms1001, any cron jobs that run there
DONE - Make sure ms1001 does not have its filesystem mounted anywhere (unmount if so)
DONE - Do a final rsync from dataset1001 to ms1001, including private information
DONE - manual | Double check paritioning recipe in use for ms1001, update if needed
DONE - cfg saved too |Record raid setup and what order disks have been added in virtual disks
DONE - (robh) | Check with chrisj if there's any other precautions to take so that arrays don't get touched
DONE Update ms1001
DONE Make sure files are retrievable via the web
DONE Re-enable all rsyncs to/from ms1001
DONE Re-enable all cron jobs

Change 273892 had a related patch set uploaded (by ArielGlenn):
disable rsync between ms1001 and dataset1001, prep for jessie upgrade

https://gerrit.wikimedia.org/r/273892

Change 273892 merged by ArielGlenn:
disable rsync between ms1001 and dataset1001, prep for jessie upgrade

https://gerrit.wikimedia.org/r/273892

rsync between dataset1001 and ms1001 is in progress. I'll rerun it tomorrow am and then proceed on the upgrade.

ms1001 now upgraded to jessie and back in service.

ArielGlenn added a comment.EditedMar 1 2016, 2:01 PM

Tomorrow's plan for dataset1001 looks much like today's for ms1001. Window set for 1 to 4 pm UTC.

DONE Record raid setup and what order disks have been added in virtual disks, save cfg, save fstab too for good measure
DONE Double check partitioning recipe in use for dataset1001, should be none
DONE Disable all rsyncs to/from dataset1001 except ms1001, disable any cron jobs that run there
DONE Disable cron jobs on snapshot1003 (these use the dataset1001 filesystem)
DONE Unmount dataset1001 filesystem from stat100x, snapshots
DONE One last rsync to ms1001 of /data
Update, test, re-enable services on snapshot1003 and other client hosts

ArielGlenn added a comment.EditedMar 2 2016, 3:42 PM

https://phabricator.wikimedia.org/P2698

Result of trying to PXE boot. Have tried:

  1. disable disk boot (now it just loops through failed PXE boots)
  2. check that NIC in boot sequence and with pxe enabled is the one in dhcp on carbon (it is).
  3. enable PXE for the second NIC just in case.
  4. turn on Option ROM in both nics (broadcom config) in hopes someone turned it off and it was related (pretty random, and it didn't help).

Now out of ideas.

Note that it never contacts carbon or seems to; I don't get the spinning line that indicates it's trying. I get immediate error exit.
Of course no entries from either nic are in the logs on carbon.

Note also that this host does boot just fine and run... i.e. there isn't an issue with network cables or the card or whatever. And it's got to be on the right vlan or how would any of its services work?

we never get to the debian installer. see the paste above.

Dzahn added a subscriber: Dzahn.Mar 2 2016, 3:53 PM

tried to get it to PXE boot as well.. enabled/disabled additional NICs, tried UEFI mode instead of BIOS:

PXE boot - Embedded NIC 1: Broadcom NetXtreme II Gigabit Ethernet (BCM5716C)
PXE-E12: Could not detect network connection.


PXE boot - Embedded NIC 2: Broadcom NetXtreme II Gigabit Ethernet (BCM5716C)
PXE-E12: Could not detect network connection.


PXE boot - Slot 2: Broadcom NetXtreme II 10 Gigabit Ethernet (BCM57810)
.
RobH added a subscriber: RobH.EditedMar 2 2016, 5:36 PM

Is that the same ixgbe hardware as in https://phabricator.wikimedia.org/T128068 ?

Nope, those use the Intel 10Gbit cards (memcached) where everything else uses broadcom. Also the changes that fixed that issue are live, so if it was similar it shoudl fix this as well right? (So it seems to be unrelated, I think...)

The issue that I see is we don't get the bios splash screen for the 10Gbit NIC. Since we cannot access the 10Gbit nic bios, we cannot set it's mode to pxe boot on the primary slot. Many times, the mode for boot isn't set in these add on cards.

I'll outline what I've tried, all with no joy:

  • disabling onboard NICS
  • setting all nics to PXE boot in bios (only shows onboard)
  • attempting uefi boot to configure the 10G nic
    • uefi mode is an attended boot mode. every reboot it asks what to boot from. attempting to boot the 10G card from here shows the slot, and will even hit carbon for a lease. However, the lease ack is received on carbon, but dataset1001 doesn't seem to accept it. No other systems use uefi mode, so this was simply for troubleshooting.

I'm not sure why we cannot see the 10Gb nic bios splash screen on boot. This is an older R510, so unlike the new dells, the only way I know into the 10Gb nic bios is via the ctrl+s option during post. Unfortunately, that option on dataset1001 only shows the onboard broadcom 1Gbit NIC.

We are well over our original slotted maintenance time so I have rolled back everything and all services are again operational. The nfs mount should show up on snapshot and stats hosts after the next puppet run on those hosts.

The upshot is that, even with the onboard nics disabled there is no prompt that gives one access to configure the external Broadcom NIC, it's not configurable from within UEFI or from the BIOS either, which only deals wtih integrated devices.

Boot in UEFI from the 10G NIC (slot 2 port 1, I guess this is a 2-port card) shows entries in the log on carbon but that's as far as it gets.

Mar 2 18:25:14 carbon dhcpd: DHCPREQUEST for 208.80.154.11 (208.80.154.10) from 00:0a:f7:5d:a0:20 via eth0
Mar 2 18:25:14 carbon dhcpd: DHCPACK on 208.80.154.11 to 00:0a:f7:5d:a0:20 via eth0
Mar 2 18:25:14 carbon dhcpd: DHCPREQUEST for 208.80.154.11 (208.80.154.10) from 00:0a:f7:5d:a0:20 via 208.80.154.2
Mar 2 18:25:14 carbon dhcpd: DHCPACK on 208.80.154.11 to 00:0a:f7:5d:a0:20 via 208.80.154.2
Mar 2 18:25:14 carbon dhcpd: DHCPREQUEST for 208.80.154.11 (208.80.154.10) from 00:0a:f7:5d:a0:20 via 208.80.154.3
Mar 2 18:25:14 carbon dhcpd: DHCPACK on 208.80.154.11 to 00:0a:f7:5d:a0:20 via 208.80.154.3
Mar 2 18:25:14 carbon atftpd[8523]: Serving jessie-installer/debian-installer/amd64/pxelinux.0 to 208.80.154.11:1064
Mar 2 18:25:14 carbon atftpd[8523]: Serving jessie-installer/debian-installer/amd64/pxelinux.0 to 208.80.154.11:1065

So I'm open to suggestion. Or help. Or anything that will make this box PXE boot. If I have to I'll dist-upgrade but that's a really last resort.

Fallback plan: chris is cabling up the 1gb nic, setting up a port for it, I'll install to that and then move to the 10gb nic once the upgrade is done. Hate hate hate.

He's set up port with public vlan. Getting email ready to go out now, new schedule will be tomorrow 3 to 6 pm UTC. Meh.

ArielGlenn added a comment.EditedMar 4 2016, 11:51 AM

Here we go again, in about one hour:

DONE Disable all rsyncs to/from dataset1001 except ms1001, disable any cron jobs that run there
DONE Disable cron jobs on snapshot1003 (these use the dataset1001 filesystem)
DONE Unmount dataset1001 filesystem from stat100x, snapshots
DONE One last rsync to ms1001 of /data
DONE Update
DONE Switch from 1gb nic to 10gb nic
DONE Test
DONE Re-enable services on snapshot1003 and other client hosts

ArielGlenn closed this task as Resolved.Mar 4 2016, 2:52 PM

And that's that. For future reference, anyone else needing to upgrade to an external nic may need to use the embedded nic for the upgrade and switch afterwards.

ArielGlenn moved this task from Active to Done on the Dumps-Generation board.Mar 4 2016, 2:52 PM