@elukey, this one could be an easy into into using the mgmt console and doing a PXE boot install. Would have to be a little bit of communication with folks on Analytics list, and a downtime announcement, but it the actual reinstall should be fairly easy.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Dzahn | T123525 reduce amount of remaining Ubuntu 12.04 (precise) systems in production | |||
Resolved | elukey | T76348 Upgrade stat1001 to Debian Jessie | |||
Resolved | elukey | T132476 connect an external harddisk with >2TB space to stat1001 | |||
Resolved | Nuria | T133736 Out of service banner in dashiki |
Event Timeline
could we raise the prio slightly to normal? are there other services here besides Apache?
@Dzahn: agreed! Moved the ticket to our "Next up" column and assigned to me, I believe it was something I should have done in the beginning as "training" :)
I double checked and the only thing that worries me a bit are all the /home directories with stuff in them, that probably will need to be backupped somewhere before proceeding.
I am going to talk with @Ottomata and then update this task. ETA should probably be the end of this week!
That is exactly what i wanted to say.. many users and it's just about the /home dirs and copying them back.
There are basically 2 options, either Bacula backup, that would be adding "include role::backup::host" and "backup::set {'home': }", then using bconsole on helium to restore from the backup, or setting up rsyncd on another host (not hard since puppetized examples when we did it before) and then using rsync (not over ssh) to copy files over. Just scp or rsync over ssh won't work because we disable agent forwarding.
I'll upload a change to Gerrit for the second option, just have to figure out where we can put the files temp.
@Dzahn another solution (more a hack) is to open a port / disable ferm for a bit (/me hides from @MoritzMuehlenhoff) and use netcat/openssl to copy files between hosts.
Change 282963 had a related patch set uploaded (by Dzahn):
rsync home directories for stat1001 upgrade
Ok, there are two places I know have important files on stat1001.
- The files served at datasets.wikimedia.org/ These are stored in three places, one is in /var/www and the other two are symlinked from /var/www. Those files should probably be backed up in case anything happens in the upgrade.
- The files that are served as stats.wikimedia.org (wikistats). These are in /a/stats.wikimedia.org. There may be other supporting setup around hosting and updating wikistats that I'm not aware of. We should definitely wait to hear from @ezachte before risking losing un-puppetized files.
Change 282967 had a related patch set uploaded (by Dzahn):
don't use fluorine for rsyncd, conflicts
Yeah, let's not get into the habit to disable puppet and ferm though. It opens up all other services on the target host, live hacks tend to be forgotten, disabled puppet will cause Icinga notifications, and so on. Let's setup rsyncd and ferm rules like this:
https://gerrit.wikimedia.org/r/#/c/282963/
I wanted to use fluorine but that already has an rsyncd from another role *udplog" which causes conflicts.
The problem is disk space though. wikistats is huge..
- The files served at datasets.wikimedia.org/ These are stored in three places, one is in /var/www and the other two are symlinked from /var/www. Those files should probably be backed up in case anything happens in the upgrade.
Ok, so /var/www and the other 2 are /srv/aggregate-datasets and /srv/public-datasets. Would be no problem to add, it made me think we should just backup the entire /srv/ since there is more stuff in there.
- The files that are served as stats.wikimedia.org (wikistats). These are in /a/stats.wikimedia.org.
It was a bit confusing first, but it turns out /a is a link to /srv, so /a/ and /srv/ are the same thing. So yea, the entire /srv should do it for sure... BUT the shear size of it.. check this out:
19G aggregate-datasets 0. datasets.wikimedia.org 2.4G geowiki 3.5G org.wikimedia.community-analytics 231G public-datasets 450M redis 49G stats.wikimedia.org 2.0T wikistats
Those 2T are kind of a problem to copy ...
There may be other supporting setup around hosting and updating wikistats that I'm not aware of. We should definitely wait to hear from > @ezachte before risking losing un-puppetized files.
And if the 2T include unpupptized things, yes, we definitely need to wait.
Maybe we have to get somebody local in the DC to connect an external harddisk.?
Change 282974 had a related patch set uploaded (by Dzahn):
stat1001: setup to rsync /srv/ and /var/www
I'd say we should wait for @ezachte to figure out what we should do with those 2T. Pinging.
I can't login right now to check.
The vast majority of that 2TB will be backups, which I thin out every half year or so.
All html files in htdocs should be copies from generated files on stat1002.
There may be a few hand crafted/updated files though.
pls backup /srv/stats.wikimedia.org/htdocs/cgi-bin
If that is mostly HTML and text let me try to compress that, we should achieve a high compression ratio and maybe it's not so bad then.
2.0T wikistats
Ja, this is why we don't backup! Too big!
stat1001 is in the analytics cluster...so technically we could use HDFS as a holding pen. Might take a looong time to copy though.
Change 284225 had a related patch set uploaded (by Dzahn):
statistics: rsync on stat1004 for stat1001 migration
Change 284235 had a related patch set uploaded (by Dzahn):
stats: adjust rsyncd pathes to use petabyte mount
We now have an rsyncd running on stat1004, ready to accept data from stat1001, it will be in /srv/stat1001/ , there are 3 modules, one for home, one for srv and one for var/www.
I looked at the backups at stat1001. I need to tidy things up. Some backups occur too often, and have a lot of garbage in it. Apologies for the overhead this incurred.
I started to thin out the backups and will use --exclude on rsync more often.
started the rsync for /srv with (thanks @Dzahn!):
rsync -avp /srv rsync://stat1004.eqiad.wmnet:/srv
that is still ongoing. After that I'll also backup /home and /varwww. When the data will be backupped, we'll schedule downtime and upgrade the host.
Change 284451 had a related patch set uploaded (by Elukey):
Add Debian Jessie PXE boot option to stat1001 as preparation step for the reimage.
Change 284451 merged by Elukey:
Add Debian Jessie PXE boot option to stat1001 as preparation step for the reimage.
Change 284878 had a related patch set uploaded (by Elukey):
Add a simple class to the statistics module to allow basic apache maintenace.
Summary:
- we decided to have a maintenance page hosted somewhere to avoid connection timeouts during the reimage. I created https://gerrit.wikimedia.org/r/284878 and currently wondering if bohrium (running under cache::misc too) would be a good candidate. Also talked with @mforns to create a basic webpage since me and HTML/CSS are not friends :)
- rsync worked fine to stat1004
- announcement of the downtime not yet sent, candidate next Tuesday
Change 284910 had a related patch set uploaded (by Mforns):
Add maintenance page for oncoming outage
So after reviewing the change with @Ottomata we realized that it would be much easier to add a rule in VCL's cache::misc to enable/disable a backend with a flag (like maintenance => true). I'll work on it on Tuesday and schedule downtime for Thursday.
Change 285363 had a related patch set uploaded (by Elukey):
Modify the default Varnish error page to increase visibility of error messages.
Change 285364 had a related patch set uploaded (by Elukey):
Add a maintenance flag to cache::misc directors.
Change 285363 merged by Elukey:
Modify the default Varnish error page to increase visibility of error messages.
Change 284878 abandoned by Elukey:
Allow basic apache maintenace webpages for the statistics::web role.
Change 285976 had a related patch set uploaded (by Elukey):
Return a custom HTTP 503 response for all the stat1001 websites due to maintenance.
Change 285976 merged by Elukey:
Return a custom HTTP 503 response for all the stat1001 websites due to maintenance.
Websites still working since we had a problem with the VCL change, the following patch should fix the issue:
Change 286461 had a related patch set uploaded (by Elukey):
Update httpd access directive to only use Require instead of old Order/Allow/Deny.
Change 286461 merged by Elukey:
Update stat1001's httpd access directive to only use Require instead of old Order/Allow/Deny.
Change 286466 had a related patch set uploaded (by Elukey):
Remove NameVirtualHost directive from datasets.wikimedia.org config (2.4 upgrade).
Change 286466 merged by Elukey:
Remove NameVirtualHost directive from datasets.wikimedia.org config (2.4 upgrade).
This task is set to done, but my home dir hasn't been restored yet, and https://stats.wikimedia.org/cgi-bin/search_portal.pl?search=views produces an access error.
Change 286681 had a related patch set uploaded (by Elukey):
Add Apache mod-cgi to stat1001 configuration to allow Perl CGI scripts.
Update: thanks to @Ottomata I was able to restore correctly the home directories, now everything should be fine.
I am in the process of adding mod-cgi to stat1001's apache config via https://gerrit.wikimedia.org/r/286681, probably this is why the perl scripts are not working anymore.
Change 286681 merged by Elukey:
Add Apache mod-cgi to stat1001 configuration to allow Perl CGI scripts.
Change 286696 had a related patch set uploaded (by Elukey):
Fix the mod-cgi configuration for stats.wikimedia.org after Apache 2.4 upgrade.
Change 286696 merged by Elukey:
Fix the mod-cgi configuration for stats.wikimedia.org after Apache 2.4 upgrade.
And https://stats.wikimedia.org/cgi-bin/search_portal.pl?search=views now works fine!
@ezachte: would you mind to do another sanity check to verify that everything is working correctly? Thanks again for the report!