Page MenuHomePhabricator

Move the stat1004-6-7 hosts to Debian Buster
Open, Needs TriagePublic

Event Timeline

elukey created this task.Jun 10 2020, 2:51 PM
elukey moved this task from Backlog to Q1 2020/2021 on the Analytics-Clusters board.

This needs to be coordinated with our users since it will be invasive, things like jupyter venvs will need to be re-created etc..

Aklapper removed a project: Analytics.Jul 4 2020, 7:59 AM
Isaac added a subscriber: Isaac.Jul 24 2020, 4:59 PM

@elukey thanks for the heads up -- any expectation that any Python packages will be problematic to reinstall? The one that generally gives me the most headache btw is fasttext.

Also, for anyone who wants to better prepare, $ pip freeze > requirements.txt is a nice way to get a quick listing of packages that can then be easily reinstalled in the new environments (pip install -r requirements.txt).

@Isaac in theory there shouldn't be a lot of issues, but if you want to make sure you can try to install them on stat1005/stat1008 that are already running debian 10 (just to double check that nothing explodes etc..). Let me know if you find something weird!

Isaac added a comment.Jul 27 2020, 1:55 PM

but if you want to make sure you can try to install them on stat1005/stat1008 that are already running debian 10 (just to double check that nothing explodes etc..)

Ahh good point -- done and no issues. Thanks!

@elukey: I'm checking with my team right now for which days/weeks would be bad and will let you know soon, but I did want to highlight @Iflorez's for now:

Downtime in August on stat6 would be problematic on my end. I'm wrapping up GLOW year one analysis over the next weeks. Can stat6 updates be carried out after September 10th?

Yes definitely, let's define together a schedule that is a good compromise for all of you!

Okay, the team primarily works on stat1005 & stat1008. @SNowick_WMF uses stat1007 but is flexible and can switch to 5/8 during downtime. No restrictions or special requests from Product Analytics, just that one Sept 10 request from Irene.

elukey added a comment.Aug 4 2020, 2:20 PM

@mpopov thanks a lot for collecting thoughts, really appreciated. How about something like:

  • stat1004 reimaged during this week or the next
  • stat100[6,7] after sept 10th
Isaac added a comment.Aug 4 2020, 2:23 PM

stat1004 reimaged during this week or the next

@elukey just a heads up that I'm running some long-running SWAP notebooks via stat1004 but it's okay to kill those processes as part of the reimaging if they're still going when you proceed. They're long running because they run a number of sequential pyspark queries and it's easy for me to pick up from where they left off if they get killed. No need to check with me in advance.

mpopov added a comment.Aug 4 2020, 3:04 PM
  • stat1004 reimaged during this week or the next
  • stat100[6,7] after sept 10th

Sounds good!

Change 624061 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] install_server: add reuse recipe for stat100x hosts with 4 disks

https://gerrit.wikimedia.org/r/624061

Change 624061 merged by Elukey:
[operations/puppet@production] install_server: add reuse recipe for stat100x hosts with 4 disks

https://gerrit.wikimedia.org/r/624061

Change 624077 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] install_server: fix reuse-analytics-stat-4dev.cfg recipe

https://gerrit.wikimedia.org/r/624077

Change 624077 merged by Elukey:
[operations/puppet@production] install_server: fix reuse-analytics-stat-4dev.cfg recipe

https://gerrit.wikimedia.org/r/624077

klausman claimed this task.Sep 14 2020, 3:18 PM
mforns moved this task from Next Up to In Progress on the Analytics-Kanban board.Sep 14 2020, 4:09 PM

Script wmf-auto-reimage was launched by klausman on cumin1001.eqiad.wmnet for hosts:

['stat1004.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009180821_klausman_29009.log.

Script wmf-auto-reimage was launched by klausman on cumin1001.eqiad.wmnet for hosts:

['stat1004.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009180859_klausman_611.log.

Completed auto-reimage of hosts:

['stat1004.eqiad.wmnet']

Of which those FAILED:

['stat1004.eqiad.wmnet']

The first puppet run on stat1004 highlighted some issues that might need some work before 1006 and 1007's reimages:

Error: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install oozie-client' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package oozie-client
Error: /Stage[main]/Cdh::Oozie/Package[oozie-client]/ensure: change from 'purged' to 'present' failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install oozie-client' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package oozie-client
Error: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install mahout' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package mahout
Error: /Stage[main]/Cdh::Mahout/Package[mahout]/ensure: change from 'purged' to 'present' failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install mahout' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package mahout

In this case, the packages are not installed after the apt configuration for the buster, contained in profile::hadoop::common -> ::profile::cdh::apt. This is not a big deal, but let's see if we can fix.

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, Failed getting spark version via facter. (file: /etc/puppet/modules/profile/manifests/hadoop/spark2.pp, line: 101, column: 9) on node stat1004.eqiad.wmnet

This is due to this profile::hadoop::spark's snippet:

# Get spark_verison from facter.  Fail if not set.
$spark_version = $::spark_version
if !$spark_version or $spark_version == '' {
    fail('Failed getting spark version via facter.')
}

The above uses modules/profile/lib/facter/spark_version.rb, that relies on dpkg to figure out the version of spark deployed. During the first puppet run spark2 is not so puppet fails. We could extend the rb snippet with some extra step to check apt-cache policy or similar if dpkg comes up empty. This bug needs to be fixed since it causes puppet to fail completely (in the above case puppet keeps going even if it reports issues).

Change 628330 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop::spark2: use the package resource instead of require_package()

https://gerrit.wikimedia.org/r/628330

Change 628330 merged by Elukey:
[operations/puppet@production] profile::hadoop::spark2: use the package resource instead of require_package()

https://gerrit.wikimedia.org/r/628330

Script wmf-auto-reimage was launched by klausman on cumin1001.eqiad.wmnet for hosts:

['stat1006.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009240731_klausman_8593.log.

Completed auto-reimage of hosts:

['stat1006.eqiad.wmnet']

Of which those FAILED:

['stat1006.eqiad.wmnet']

Reimaging complete. The failure above is the failed first run of puppet due to no spark being installed. I did that manually, ran puppet rebooted for the kernel opts and the machine is now back in service.

Script wmf-auto-reimage was launched by klausman on cumin1001.eqiad.wmnet for hosts:

['stat1007.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009280844_klausman_784.log.

Completed auto-reimage of hosts:

['stat1007.eqiad.wmnet']

Of which those FAILED:

['stat1007.eqiad.wmnet']

Change 630578 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::cluster::packages::statistics: remove stretch bits

https://gerrit.wikimedia.org/r/630578

With these new upgrades happening, I wanted to move my Jupyter notebooks from stat1008 to stat1006 as stat1008 has been very busy lately. After rsync'ing my files, I started reinstalling my R libraries and had them error out because one of them wasn't available for R v3.3. That surprised me, because Debian Buster ships with R v3.5 (as can be found on stat1005 and stat1008).

lsb_release -a on stat1004, stat1006, and stat1007 all list the Debian version as 9.13 (stretch). So to me it looks like all of these had Strech reinstalled, rather than upgraded to Buster. Would be great to get that confirmed.

Change 631544 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set debian buster for stat100[467]

https://gerrit.wikimedia.org/r/631544

With these new upgrades happening, I wanted to move my Jupyter notebooks from stat1008 to stat1006 as stat1008 has been very busy lately. After rsync'ing my files, I started reinstalling my R libraries and had them error out because one of them wasn't available for R v3.3. That surprised me, because Debian Buster ships with R v3.5 (as can be found on stat1005 and stat1008).

lsb_release -a on stat1004, stat1006, and stat1007 all list the Debian version as 9.13 (stretch). So to me it looks like all of these had Strech reinstalled, rather than upgraded to Buster. Would be great to get that confirmed.

This is really embarrassing, I can confirm 100% what you are saying. We concentrated on explaining all new processes and tools to new people joining the team and we forgot to double check basics like this. As you can see from https://gerrit.wikimedia.org/r/631544 this is what I should've introduced to Tobias before starting the journey of reimaging, that took a lot of his energy and time to have a good set of backups in place in a timely manner. We'll have to redo the process again, thanks a lot for pinging us on this, I didn't realize :(

Will merge the change and set up dates again, probably for next week.

/me cries in a corner

Change 631544 merged by Elukey:
[operations/puppet@production] Set debian buster for stat100[467]

https://gerrit.wikimedia.org/r/631544

/me cries in a corner

Mistakes happen! You're still taking excellent care of our beloved analytics clients, and our work couldn't happen without you 😊

Script wmf-auto-reimage was launched by klausman on cumin1001.eqiad.wmnet for hosts:

['stat1007.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202010060909_klausman_20404.log.

Script wmf-auto-reimage was launched by klausman on cumin1001.eqiad.wmnet for hosts:

['stat1006.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202010080849_klausman_3923.log.

Completed auto-reimage of hosts:

['stat1006.eqiad.wmnet']

Of which those FAILED:

['stat1006.eqiad.wmnet']

Reimage of 1006 and 1007 were successful.

Script wmf-auto-reimage was launched by klausman on cumin1001.eqiad.wmnet for hosts:

['stat1004.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202010130756_klausman_14796.log.

Completed auto-reimage of hosts:

['stat1004.eqiad.wmnet']

and were ALL successful.

klausman moved this task from In Progress to Done on the Analytics-Kanban board.Tue, Oct 13, 9:15 AM

This should be complete, finally!

This should be complete, finally!

Thanks @klausman! 🏆

Isaac added a comment.Tue, Oct 13, 2:56 PM

Agreed! Thanks all for these updates (and I'm on the wrong task but thank you for the RAM upgrades too)! Exciting (and very useful) to see the machines getting stronger :)

Change 630578 merged by Elukey:
[operations/puppet@production] profile::analytics::cluster::packages::statistics: remove stretch bits

https://gerrit.wikimedia.org/r/630578