Page MenuHomePhabricator

Copy labmon data to new SSDs
Closed, ResolvedPublic

Description

This task tracks copying the graphite data from labmon's current old disks to the new SSDs once they arrive (tracked in T137724).

The process should be:

  • - Notify labs-l mailing list and set topic on Cloud-Services IRC channel
  • - Stop statsite/graphite/carbon-relay on labmon so it stops writing to the graphite data. Ensure and verify these are all stopped.
  • - Rsync the data from /srv/carbon/whisper to the external HDD connected to the machine
  • - Install SSDs - T138415
  • - Re-install the operating system (Jessie) and re-provision the machine (with same role *and* hostname)
  • - Stop the new statsite/graphite/carbon-relay on labmon so we don't start accepting new data yet
  • - Clean out the new data we've accumulated in the short time between puppet starting the services and us stopping them
  • - Rsync the data back
  • - Make sure the data is owned appropriately by user _graphite and group carbon-c-relay
  • - Start the services back up
  • - Verify new data is coming in
  • - deal with issue of graphite working at first, but now not displaying content (no error page, just file download prompt)
  • - PARTY!

Event Timeline

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptJun 15 2016, 8:40 PM
Restricted Application added a subscriber: Zppix. · View Herald Transcript
RobH added a comment.Jun 15 2016, 9:13 PM

Since this is likely to happen when others are at wikimania, can you list the exact data directories that require copy?

RobH mentioned this in Unknown Object (Task).Jun 16 2016, 4:27 PM
RobH added a subtask: Unknown Object (Task).
RobH added a comment.Jun 22 2016, 3:29 PM

The SSDs for this arrived today. I've emailed the labs-l list and updated the topic in Cloud-Services to reflect the planned work:

Labs users,
As many of you may recall, (mostly) Yuvi and (slightly) myself worked on labmon1001 a couple of weeks back. Unfortunately, the work performed wasn't enough (as it still left the host with spinning disks) and the graphite service hammering them bogs down the server. As a solution, we will be reinstalling the system on 2016-06-23 from 15:00 GMT to 23:00 GMT. I don't expect to use the majority of this window, since much of the data was migrated backup previously. However, it is a possibility if we have to re-sync all the data (without differential).
Details of this work can be viewed on https://phabricator.wikimedia.org/T137924.
Once the SSDs are installed, their ability to handle the iops generated from graphite is expected to bring load levels on labmon1001 down to sane levels.
Please let me know if there are any questions or concerns. They can be raised via this email thread, or by comment on the linked phabricator task.
Thanks,

RobH claimed this task.Jun 22 2016, 3:30 PM

Stealing this task for implementation tomorrow.

RobH updated the task description. (Show Details)Jun 22 2016, 5:09 PM
RobH added a subscriber: Cmjohnson.EditedJun 23 2016, 6:08 PM

So I went ahead and got the files copied back to usb, there was a huge list of differences but in the screen session I wasnt able to scroll up to copy/paste.

The secondary run to pick up changes and output is as follows:

root@labmon1001:~# rsync -vah --append-verify --info=progress2 /srv/carbon/whisper/ /media/backup/carbon/whisper/
sending incremental file list
              0   0%    0.00kB/s    0:00:00 (xfr#0, ir-chk=1000/2746170)
ores/ores-web-05/datasources_extracted/ruwiki/damaging/0/1/1/11/
              0   0%    0.00kB/s    0:00:00 (xfr#0, ir-chk=1000/2783286)
ores/ores-web-05/scores_request/wikidatawiki/reverted/0/1/0/22/
file has vanished: "/srv/carbon/whisper/ores/ores-web-05/scores_request/wikidatawiki/reverted/0/1/0/22/.p99.wsp.77QwJ2"
        262.14K   0%  250.00MB/s    0:00:00 (xfr#1, to-chk=0/2971383)   

sent 79.14M bytes  received 474.45K bytes  776.75K bytes/sec
total size is 951.56G  speedup is 11,951.64
rsync warning: some files vanished before they could be transferred (code 24) at main.c(1183) [sender=3.1.1]

So now I'll shutdown this machine and have @Cmjohnson put in the SSDs, while putting the old hdd aside for now. Please keep the old disks numbered, in case we need to fall back to their data.

RobH updated the task description. (Show Details)Jun 23 2016, 6:58 PM
RobH added a comment.EditedJun 23 2016, 7:37 PM

System's been reinstalled and data restoration is in progress:

40.21G  99%   31.87MB/s    0:20:03 (xfr#94363, ir-chk=1011/111912)

298G total, and 40GB in 20 minutes

893 / 40 = 22.325. 22.325 * 20 (minutes per 40GB amount) = 446.5 minutes or 7.4 hours.

My math is all very rough estimates. rsync's copy can go faster or slower depending on file size and rsync speeds vary slightly over time.

RobH updated the task description. (Show Details)Jun 23 2016, 7:38 PM
RobH added a comment.Jun 23 2016, 9:45 PM
This comment was removed by RobH.
RobH added a comment.Jun 23 2016, 9:51 PM

There is a root owned screen session running with the command of:

rsync -ah --append-verify --info=progress2 /media/backup/carbon/whisper/ /srv/carbon/whisper/

This looks as if it will take roughly until the end of my day. I'm going to drop a note to the labs list to alert them of the downtime.

Unfortunately, I don't see a way to speed this up, since we are restoring the data to the live system. I've placed the rsync in the screen session so if anyone else wanted to expedite this by working on it in the Friday AM before I start working, they can feel free.

RobH added a comment.Jun 23 2016, 10:00 PM

Sent to labs list:

Please note that the expended downtime for labmon1001 has been extended to 2016-06-24 @ 17:00 GMT. Details are noted on https://phabricator.wikimedia.org/T137924.
Basically I'm planning to watch it migrate data the rest of this evening (its been at it all day.) If it completes before I go to bed, I'll bring its services back online so we collect more data, and it becomes available.
The labmon1001 system has been reimaged, and the 893GB of data is being restored to the new SSDs on the system. During the restoration process, puppet and all services are halted on labmon1001, so it won't start attempting to write new data during the restoration.
Apologies for the extended downtime!

RobH updated the task description. (Show Details)Jun 24 2016, 3:59 AM
RobH added a comment.EditedJun 24 2016, 4:02 AM

All data has been restored from USB disk and services resumed. However, there are errors.

syslog is spamming with Failed value conversion (but i also saw those before the migration)

permissions have been set, but graphite is also failing (internal server error on apache). The main this is the data is copied back over, however I'm not certain if new data is being received.

I'm going to keep poking at this, but no further updates on this task means I haven't gotten any traction and will have to return to it in the AM.

The downtime was extended through then so no further announcements need to be made at this time.

RobH updated the task description. (Show Details)Jun 24 2016, 4:02 AM

So I looked at uwsgi logs and it looked like a possible race between uwsgi starting and graphite being fully installed. It looks ok now - I just restarted it! \o/

I'm going to throw more data at it now and see how it goes. I can verify that old data is still there, and new data is coming in now...

RobH removed RobH as the assignee of this task.Jun 24 2016, 7:47 PM
RobH updated the task description. (Show Details)
RobH added a comment.Jun 24 2016, 7:51 PM

Updating from ongoing work and chat with @yuvipanda and @chasemp

I've updated the task description to reflect that since Yuvi fixed this server in his AM, it is now experiencing a new issue.

When loading http://graphite.wmflabs.org or https://graphite.wmflabs.org the browser prompts for a file download.

All file permissions appear correct and when Yuvi restarted the system services, graphite was working. Data appears to be streaming into the system and being written, but at this time the actual service portal for graphite on labmon1001 is still down.

The labs team is aware of the issue. I'm removing myself as primary assignee of this issue, since the hardware issues appear to have been resolved.

FYI: We are holding onto the old SATA disks with data intact until we're certain we have the new system fully operational and we can wipe those disks. Please advise when those conditions are met. Thanks!

So I can make things work by adding:

DocumentRoot "/usr/share/graphite-web"

which I think is appropriate. I'm confused why it previously worked without that. This may be the only jessie graphite host so possibly that's related. For now I'm going to leave puppet disabled w/ the change for some sync up with yuvi on monday, but at least it's working.

So I can make things work by adding:
DocumentRoot "/usr/share/graphite-web"
which I think is appropriate. I'm confused why it previously worked without that.

Perhaps there was previously a symlink from wherever the default docroot is to the proper one?

can you poke about and drop some notes here?

current state https://phabricator.wikimedia.org/T137924#2406092

So I re-enabled puppet and that killed the line @chasemp added, and that exhibits the following behavior:

  1. https://graphite.wmflabs.org/ works (over HTTP) on a browser
  2. http://graphite.wmflabs.org/ does not (over HTTP) on a browser - downloads a strange file?
  3. *both* work on curl?!

This is consistent with behavior *I* was seeing without the hack - can someone else confirm?

ok, so that was on firefox, and on Chrome HTTPS doesn't work for me (there's a 'ERR_INVALID_RESPONSE' error) while the http hit downloads the exact same strange file as on firefox.

@yuvipanda neither 1. or 2. work for me, chrome tries to download a file called 'download.gz'. Both work with curl.

I'm on Iceweasel 38.5, what are you on, @tom29739? and can you upload this download.gz somewhere?

So the 'strange file' is gzip encoded, and I can repro this in curl with:

curl -H 'Accept-Encoding: gzip' https://graphite.wmflabs.org

No idea why setting DocumentRoot fixes this...

tom29739 added a comment.EditedJun 27 2016, 5:58 PM

@yuvipanda Chrome Version 51.0.2704.103 m (32 bit) on Windows 7. download.gz is 707 bytes from the https:// version. WinRAR can open it, there's a file named 'download' inside which is 677 bytes.

the download file inside the download.gz file contains this when I open it with Notepad++:
HTTP/1.1 200 OK
Vary: Cookie, Accept-Encoding
Content-Length: 542
Content-Type: text/html; charset=utf-8
Content-Encoding: gzip

‹ ,hqWÿeRÑnÚ0}÷WÜù¥Û„õaš6`¢”®Q« :ÔG'¹$–‚Ù7Ké×ï:€J5¿Ø×÷øÜsN2ù0ÂÂ6§ËŠàz<þ+—iz…­uu±Õ
ñ¨s4 hM¨B˜7*çíÔÀot^[×£1| yjÉO?ÄÁ¶°W0– õÈÚÃN×ø’cC 
ävßÔZ™¡ÓTõCN#ñ|"°)Æ*F7\í.Q H ¨ˆšïQÔuÝHõ"GÖ•Q}ùè1^,“t9d¡B<™½‡ZíØ_v Õ°Œ\e,®VXªtÈ=²Afç4iSÀÛuÊ¡(´'§³–ÞåsÅF/œ2 ç)Ä©„›y§±7÷«§
lçëõ<ÙÄËVkX¬’Ûx¯®î`ž<ÃCœÜ 9‚/ ÚY Éa1)â»á;{ãÌõNçìÈ”­*Jûa#РÛk¾gi…¨õ^“¢¾þÏΆÙà5©h_Ï8îI…ª>’¦g¿œj*M7ÎvÝ$:Þpô†‹yvNíÑ#A Oå×ñೄþ2³Ž§Oå oÇž¥oƒwùTFÙqHOÌ›ý©¼ïK º˜^[Wàsgëš=O¥± ´œŸ~e0¹%Dûhär@ø/m˜ðó̝[Ch¨'—çö]x"Ï“èìk&ØtŸ•ÿ   ûm

I couldn't find a place to upload it anywhere sorry. I'd presume that's the page content gzipped.

I've re-enabled setting DocumentRoot, which does fix this. I'll disable puppet again and re-investigate later.

It looks like it's spewing out incorrect gzipped data, which the browser can't decode so it downloads it as a file instead.
Maybe the default DocumentRoot is in the wrong place so that the webserver is selecting the wrong file for the user's browser to interpret? That's the only plausible explanation that I can think of where setting the DocumentRoot would fix it.

Change 296823 had a related patch set uploaded (by Yuvipanda):
graphite: Add DocumentRoot explicitly

https://gerrit.wikimedia.org/r/296823

Change 296823 merged by Yuvipanda:
graphite: Add DocumentRoot explicitly

https://gerrit.wikimedia.org/r/296823

yuvipanda closed this task as Resolved.Jun 30 2016, 8:37 PM

I just puppetized the DocumentRoot and am going to call this 'done' since the other option is a rabbithole I do not have the time to dive into...

yuvipanda closed subtask Unknown Object (Task) as Resolved.Jun 30 2016, 8:38 PM
yuvipanda reopened subtask Unknown Object (Task) as Open.
MoritzMuehlenhoff reopened this task as Open.Aug 5 2016, 7:03 AM

The temp host, which was added in https://phabricator.wikimedia.org/rOPUP93cef36f9ed6e2ec1e8ec2f6d5d345a4343e1610 is still up and running:

setting WMF4724 install params
this is a temp host for labmon1001 data migration, it will be reclaimed to spares in less than 48 hours
RobH added a comment.Aug 8 2016, 5:16 PM

The temp host, which was added in https://phabricator.wikimedia.org/rOPUP93cef36f9ed6e2ec1e8ec2f6d5d345a4343e1610 is still up and running:

setting WMF4724 install params
this is a temp host for labmon1001 data migration, it will be reclaimed to spares in less than 48 hours

I can reclaim this host, as long as @yuvipanda confirms he doesn't need any data or this host anymore.

Can be reclaimed!

RobH added a comment.Aug 8 2016, 5:22 PM

Confirmed he doesnt need the data, so I'll create the task and take care of remote steps to reclaim WMF4724

yuvipanda removed yuvipanda as the assignee of this task.Aug 10 2016, 12:54 AM
RobH closed this task as Resolved.Aug 10 2016, 5:17 PM
RobH claimed this task.
RobH closed subtask Unknown Object (Task) as Resolved.Oct 12 2016, 5:46 PM