Page MenuHomePhabricator

Reimage thorium to Debian Stretch
Closed, ResolvedPublic3 Estimated Story Points

Description

Reimage thorium to Debian Stretch. It will require careful planning since most of our websites will be unavailable.

Event Timeline

fdans triaged this task as Medium priority.Apr 23 2018, 3:53 PM
fdans raised the priority of this task from Medium to High.
fdans moved this task from Incoming to Operational Excellence on the Analytics board.
fdans lowered the priority of this task from High to Medium.Apr 23 2018, 3:55 PM

Thinking out loud :)

During the last offsite we were wondering if our websites could have been served by more than one host, in order to be tolerant incase of failures. All the things on thorium as far as I can remember are stateless, so we could think about:

  1. replacing it with two ganeti instances not running in the Analytics VLAN, using a lvs endpoint in front of them (that will be called by Varnish).
  1. repurposing thorium for other needs, like analytics1003's standby db, etc..

If these ideas are too crazy I'll shut up :)

Hmm, good idea in general! The only issue is:

https://analytics.wikimedia.org/datasets/ and also wikistats 1.0. Both need
a lot of space.

Good points.. In theory wikistats 1.0 should go away soon right? The datasets are indeed a problem, I'll try to think about a solution :)

I don't think wikistats 1.0 will ever go away, will it? Erik might stop updating it, but I think it will stay online forever.

Luca and I just discussed, and decided that we should upgrade thorium to stretch anyway, and then later think about moving sites elsewhere.

elukey removed elukey as the assignee of this task.Jul 4 2018, 1:09 PM

FYI, this is scheduled to happen tomorrow Wed Sept 5 at about 13:30 UTC

Change 458174 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Temporarily removing thorium from netboot.cfg

https://gerrit.wikimedia.org/r/458174

Change 458174 merged by Ottomata:
[operations/puppet@production] Temporarily removing thorium from netboot.cfg

https://gerrit.wikimedia.org/r/458174

Mentioned in SAL (#wikimedia-analytics) [2018-09-05T13:40:28Z] <ottomata> reimaging thorium to debian stretch (this will cause an announced {stats,analytics}.wm.org downtime!) - T192641

Mentioned in SAL (#wikimedia-operations) [2018-09-05T13:40:35Z] <ottomata> reimaging thorium to debian stretch (this will cause an announced {stats,analytics}.http://wm.org/ downtime!) - T192641

Script wmf-auto-reimage was launched by otto on neodymium.eqiad.wmnet for hosts:

thorium.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201809051340_otto_10146_thorium_eqiad_wmnet.log.

Change 458176 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Use stretch for thorium

https://gerrit.wikimedia.org/r/458176

Change 458176 merged by Ottomata:
[operations/puppet@production] Use stretch for thorium

https://gerrit.wikimedia.org/r/458176

Completed auto-reimage of hosts:

['thorium.eqiad.wmnet']

Of which those FAILED:

['thorium.eqiad.wmnet']

Script wmf-auto-reimage was launched by otto on neodymium.eqiad.wmnet for hosts:

thorium.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201809051351_otto_12066_thorium_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['thorium.eqiad.wmnet']

Of which those FAILED:

['thorium.eqiad.wmnet']

Script wmf-auto-reimage was launched by otto on neodymium.eqiad.wmnet for hosts:

thorium.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201809051352_otto_12248_thorium_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['thorium.eqiad.wmnet']

and were ALL successful.

Ottomata set the point value for this task to 3.

There is cronspam from: Cron <root@stat1006> /usr/local/bin/published-datasets-sync -q

rsync: stat "/published-datasets-rsynced/stat1006/archive/public-datasets/all/cross_wiki/.editor_month.tsv.gz.gkVHiO" (in srv) failed: No such file or directory (2)

and earlier it included:

rsync: failed to connect to thorium.eqiad.wmnet (10.64.53.26): No route to host (113)

That's made me search for thorium in Phab and come here.

I think that was failing during the reinstall, it looks fine now.

Yep, looks like it stopped. No more mails so far. thanks!

Cron <root@thorium> /usr/local/bin/hardsync -t /srv /srv/published-datasets-rsynced/* /srv/analytics.wikimedia.org/datasets 2>&1 > /dev/null
Inbox
	x
Cron Daemon root@thorium.eqiad.wmnet via wikimedia.org 
	
6:30 PM (1 hour ago)
	
to root
cp: cannot stat '/srv/published-datasets-rsynced/stat1006/periodic/reports/metrics/page-creation/pagecreations_main_bots/.mrwiktionary.tsv.b028qZ': No such file or directory