Page MenuHomePhabricator

Upgrade stat1001 to Debian Jessie
Closed, ResolvedPublic21 Story Points

Description

@elukey, this one could be an easy into into using the mgmt console and doing a PXE boot install. Would have to be a little bit of communication with folks on Analytics list, and a downtime announcement, but it the actual reinstall should be fairly easy.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 4 2015, 12:58 AM
Ottomata renamed this task from Upgrade stat1001 to Ubuntu Trusty to Upgrade stat1001 to Debian Jessie.Jul 31 2015, 2:06 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 31 2015, 2:06 PM
Ottomata updated the task description. (Show Details)Jan 15 2016, 10:12 PM
Ottomata added a subscriber: elukey.
Dzahn raised the priority of this task from Low to Normal.Apr 11 2016, 6:37 PM
Dzahn added a subscriber: Dzahn.

could we raise the prio slightly to normal? are there other services here besides Apache?

elukey claimed this task.Apr 12 2016, 7:14 AM
elukey edited projects, added Analytics-Kanban; removed Analytics.

@Dzahn: agreed! Moved the ticket to our "Next up" column and assigned to me, I believe it was something I should have done in the beginning as "training" :)

I double checked and the only thing that worries me a bit are all the /home directories with stuff in them, that probably will need to be backupped somewhere before proceeding.

I am going to talk with @Ottomata and then update this task. ETA should probably be the end of this week!

Dzahn added a comment.Apr 12 2016, 2:40 PM

the only thing that worries me a bit are all the /home directories with stuff in them, that probably will need to be backupped somewhere before proceeding.

That is exactly what i wanted to say.. many users and it's just about the /home dirs and copying them back.

There are basically 2 options, either Bacula backup, that would be adding "include role::backup::host" and "backup::set {'home': }", then using bconsole on helium to restore from the backup, or setting up rsyncd on another host (not hard since puppetized examples when we did it before) and then using rsync (not over ssh) to copy files over. Just scp or rsync over ssh won't work because we disable agent forwarding.

I'll upload a change to Gerrit for the second option, just have to figure out where we can put the files temp.

@Dzahn another solution (more a hack) is to open a port / disable ferm for a bit (/me hides from @MoritzMuehlenhoff) and use netcat/openssl to copy files between hosts.

Change 282963 had a related patch set uploaded (by Dzahn):
rsync home directories for stat1001 upgrade

https://gerrit.wikimedia.org/r/282963

Ok, there are two places I know have important files on stat1001.

  1. The files served at datasets.wikimedia.org/ These are stored in three places, one is in /var/www and the other two are symlinked from /var/www. Those files should probably be backed up in case anything happens in the upgrade.
  1. The files that are served as stats.wikimedia.org (wikistats). These are in /a/stats.wikimedia.org. There may be other supporting setup around hosting and updating wikistats that I'm not aware of. We should definitely wait to hear from @ezachte before risking losing un-puppetized files.

Change 282963 merged by Dzahn:
rsync home directories for stat1001 upgrade

https://gerrit.wikimedia.org/r/282963

Change 282967 had a related patch set uploaded (by Dzahn):
don't use fluorine for rsyncd, conflicts

https://gerrit.wikimedia.org/r/282967

Change 282967 merged by Dzahn:
don't use fluorine for rsyncd, conflicts

https://gerrit.wikimedia.org/r/282967

Dzahn added a comment.Apr 12 2016, 4:54 PM

@Dzahn another solution (more a hack) is to open a port / disable ferm for a bit (/me hides from @MoritzMuehlenhoff) and use netcat/openssl to copy files between hosts.

Yeah, let's not get into the habit to disable puppet and ferm though. It opens up all other services on the target host, live hacks tend to be forgotten, disabled puppet will cause Icinga notifications, and so on. Let's setup rsyncd and ferm rules like this:

https://gerrit.wikimedia.org/r/#/c/282963/

I wanted to use fluorine but that already has an rsyncd from another role *udplog" which causes conflicts.

The problem is disk space though. wikistats is huge..

Dzahn added a comment.Apr 12 2016, 5:01 PM
  1. The files served at datasets.wikimedia.org/ These are stored in three places, one is in /var/www and the other two are symlinked from /var/www. Those files should probably be backed up in case anything happens in the upgrade.

Ok, so /var/www and the other 2 are /srv/aggregate-datasets and /srv/public-datasets. Would be no problem to add, it made me think we should just backup the entire /srv/ since there is more stuff in there.

  1. The files that are served as stats.wikimedia.org (wikistats). These are in /a/stats.wikimedia.org.

It was a bit confusing first, but it turns out /a is a link to /srv, so /a/ and /srv/ are the same thing. So yea, the entire /srv should do it for sure... BUT the shear size of it.. check this out:

19G	aggregate-datasets
0.	datasets.wikimedia.org
2.4G	geowiki
3.5G	org.wikimedia.community-analytics
231G	public-datasets
450M	redis
49G	stats.wikimedia.org
2.0T	wikistats

Those 2T are kind of a problem to copy ...

There may be other supporting setup around hosting and updating wikistats that I'm not aware of. We should definitely wait to hear from > @ezachte before risking losing un-puppetized files.

And if the 2T include unpupptized things, yes, we definitely need to wait.

Maybe we have to get somebody local in the DC to connect an external harddisk.?

Change 282974 had a related patch set uploaded (by Dzahn):
stat1001: setup to rsync /srv/ and /var/www

https://gerrit.wikimedia.org/r/282974

Change 282974 merged by Dzahn:
stat1001: setup to rsync /srv/ and /var/www

https://gerrit.wikimedia.org/r/282974

I'd say we should wait for @ezachte to figure out what we should do with those 2T. Pinging.

I can't login right now to check.
The vast majority of that 2TB will be backups, which I thin out every half year or so.
All html files in htdocs should be copies from generated files on stat1002.

There may be a few hand crafted/updated files though.
pls backup /srv/stats.wikimedia.org/htdocs/cgi-bin

Dzahn added a comment.Apr 13 2016, 8:46 PM

If that is mostly HTML and text let me try to compress that, we should achieve a high compression ratio and maybe it's not so bad then.

2.0T wikistats

Ja, this is why we don't backup! Too big!

stat1001 is in the analytics cluster...so technically we could use HDFS as a holding pen. Might take a looong time to copy though.

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.Apr 18 2016, 3:19 PM

Change 284225 had a related patch set uploaded (by Dzahn):
statistics: rsync on stat1004 for stat1001 migration

https://gerrit.wikimedia.org/r/284225

Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptApr 19 2016, 4:48 PM

Change 284225 merged by Dzahn:
statistics: rsync on stat1004 for stat1001 migration

https://gerrit.wikimedia.org/r/284225

Change 284235 had a related patch set uploaded (by Dzahn):
stats: adjust rsyncd pathes to use petabyte mount

https://gerrit.wikimedia.org/r/284235

Nuria moved this task from In Progress to Done on the Analytics-Kanban board.Apr 19 2016, 5:22 PM
Nuria moved this task from Done to In Progress on the Analytics-Kanban board.

Change 284235 abandoned by Dzahn:
stats: adjust rsyncd pathes to use petabyte mount

https://gerrit.wikimedia.org/r/284235

Dzahn added a comment.Apr 19 2016, 5:47 PM

We now have an rsyncd running on stat1004, ready to accept data from stat1001, it will be in /srv/stat1001/ , there are 3 modules, one for home, one for srv and one for var/www.

I looked at the backups at stat1001. I need to tidy things up. Some backups occur too often, and have a lot of garbage in it. Apologies for the overhead this incurred.

I started to thin out the backups and will use --exclude on rsync more often.

started the rsync for /srv with (thanks @Dzahn!):

rsync -avp /srv rsync://stat1004.eqiad.wmnet:/srv

that is still ongoing. After that I'll also backup /home and /varwww. When the data will be backupped, we'll schedule downtime and upgrade the host.

Change 284451 had a related patch set uploaded (by Elukey):
Add Debian Jessie PXE boot option to stat1001 as preparation step for the reimage.

https://gerrit.wikimedia.org/r/284451

Change 284451 merged by Elukey:
Add Debian Jessie PXE boot option to stat1001 as preparation step for the reimage.

https://gerrit.wikimedia.org/r/284451

All data backupped in /srv/stat1001 on stat1004, we should be ready to proceed.

Change 284878 had a related patch set uploaded (by Elukey):
Add a simple class to the statistics module to allow basic apache maintenace.

https://gerrit.wikimedia.org/r/284878

elukey added a subscriber: mforns.Apr 22 2016, 12:25 PM

Summary:

  1. we decided to have a maintenance page hosted somewhere to avoid connection timeouts during the reimage. I created https://gerrit.wikimedia.org/r/284878 and currently wondering if bohrium (running under cache::misc too) would be a good candidate. Also talked with @mforns to create a basic webpage since me and HTML/CSS are not friends :)
  1. rsync worked fine to stat1004
  1. announcement of the downtime not yet sent, candidate next Tuesday

Change 284910 had a related patch set uploaded (by Mforns):
Add maintenance page for oncoming outage

https://gerrit.wikimedia.org/r/284910

Change 284910 merged by Elukey:
Add maintenance page for oncoming outage

https://gerrit.wikimedia.org/r/284910

So after reviewing the change with @Ottomata we realized that it would be much easier to add a rule in VCL's cache::misc to enable/disable a backend with a flag (like maintenance => true). I'll work on it on Tuesday and schedule downtime for Thursday.

Change 285363 had a related patch set uploaded (by Elukey):
Modify the default Varnish error page to increase visibility of error messages.

https://gerrit.wikimedia.org/r/285363

Change 285364 had a related patch set uploaded (by Elukey):
Add a maintenance flag to cache::misc directors.

https://gerrit.wikimedia.org/r/285364

Change 285364 merged by Elukey:
Add a maintenance flag to cache::misc directors.

https://gerrit.wikimedia.org/r/285364

Change 285363 merged by Elukey:
Modify the default Varnish error page to increase visibility of error messages.

https://gerrit.wikimedia.org/r/285363

Change 284878 abandoned by Elukey:
Allow basic apache maintenace webpages for the statistics::web role.

https://gerrit.wikimedia.org/r/284878

Change 285976 had a related patch set uploaded (by Elukey):
Return a custom HTTP 503 response for all the stat1001 websites due to maintenance.

https://gerrit.wikimedia.org/r/285976

JAllemandou set the point value for this task to 21.Apr 28 2016, 4:31 PM
elukey moved this task from In Progress to Done on the Analytics-Kanban board.Apr 29 2016, 1:02 PM
elukey moved this task from Done to Ready to Deploy on the Analytics-Kanban board.

Change 285976 merged by Elukey:
Return a custom HTTP 503 response for all the stat1001 websites due to maintenance.

https://gerrit.wikimedia.org/r/285976

elukey added a comment.May 2 2016, 1:07 PM

Websites still working since we had a problem with the VCL change, the following patch should fix the issue:

https://gerrit.wikimedia.org/r/#/c/286419/

Change 286461 had a related patch set uploaded (by Elukey):
Update httpd access directive to only use Require instead of old Order/Allow/Deny.

https://gerrit.wikimedia.org/r/286461

Change 286461 merged by Elukey:
Update stat1001's httpd access directive to only use Require instead of old Order/Allow/Deny.

https://gerrit.wikimedia.org/r/286461

Change 286466 had a related patch set uploaded (by Elukey):
Remove NameVirtualHost directive from datasets.wikimedia.org config (2.4 upgrade).

https://gerrit.wikimedia.org/r/286466

Change 286466 merged by Elukey:
Remove NameVirtualHost directive from datasets.wikimedia.org config (2.4 upgrade).

https://gerrit.wikimedia.org/r/286466

elukey moved this task from Ready to Deploy to Done on the Analytics-Kanban board.May 3 2016, 8:09 AM

This task is set to done, but my home dir hasn't been restored yet, and https://stats.wikimedia.org/cgi-bin/search_portal.pl?search=views produces an access error.

elukey added a comment.May 3 2016, 2:55 PM

Hi @ezachte, my bad, working on it now.

elukey moved this task from Done to In Progress on the Analytics-Kanban board.May 3 2016, 3:45 PM

Change 286681 had a related patch set uploaded (by Elukey):
Add Apache mod-cgi to stat1001 configuration to allow Perl CGI scripts.

https://gerrit.wikimedia.org/r/286681

elukey added a comment.May 3 2016, 5:00 PM

Update: thanks to @Ottomata I was able to restore correctly the home directories, now everything should be fine.

I am in the process of adding mod-cgi to stat1001's apache config via https://gerrit.wikimedia.org/r/286681, probably this is why the perl scripts are not working anymore.

Change 286681 merged by Elukey:
Add Apache mod-cgi to stat1001 configuration to allow Perl CGI scripts.

https://gerrit.wikimedia.org/r/286681

Change 286696 had a related patch set uploaded (by Elukey):
Fix the mod-cgi configuration for stats.wikimedia.org after Apache 2.4 upgrade.

https://gerrit.wikimedia.org/r/286696

Change 286696 merged by Elukey:
Fix the mod-cgi configuration for stats.wikimedia.org after Apache 2.4 upgrade.

https://gerrit.wikimedia.org/r/286696

elukey added a comment.May 3 2016, 6:00 PM

And https://stats.wikimedia.org/cgi-bin/search_portal.pl?search=views now works fine!

@ezachte: would you mind to do another sanity check to verify that everything is working correctly? Thanks again for the report!

elukey moved this task from In Progress to Done on the Analytics-Kanban board.May 4 2016, 1:03 PM
elukey moved this task from Done to In Progress on the Analytics-Kanban board.

Issues fixed. Thanks, elukey, and others.

elukey moved this task from In Progress to Done on the Analytics-Kanban board.May 5 2016, 5:43 AM
Dzahn awarded a token.May 5 2016, 5:55 AM
elukey closed this task as Resolved.May 6 2016, 4:13 PM