⚓ T137924 Copy labmon data to new SSDs

	Subject	Repo	Branch	Lines +/-
	graphite: Add DocumentRoot explicitly	operations/puppet	production	+1 -1

yuvipanda created this task.Jun 15 2016, 8:40 PM

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptJun 15 2016, 8:40 PM

Restricted Application added a subscriber: Zppix. · View Herald Transcript

Since this is likely to happen when others are at wikimania, can you list the exact data directories that require copy?

yuvipanda updated the task description. (Show Details)Jun 15 2016, 9:17 PM

yuvipanda updated the task description. (Show Details)Jun 15 2016, 10:08 PM

RobH mentioned this in Unknown Object (Task).Jun 16 2016, 4:27 PM

RobH added a subtask: Unknown Object (Task).

The SSDs for this arrived today. I've emailed the labs-l list and updated the topic in Cloud-Services to reflect the planned work:

Labs users,

As many of you may recall, (mostly) Yuvi and (slightly) myself worked on labmon1001 a couple of weeks back. Unfortunately, the work performed wasn't enough (as it still left the host with spinning disks) and the graphite service hammering them bogs down the server. As a solution, we will be reinstalling the system on 2016-06-23 from 15:00 GMT to 23:00 GMT. I don't expect to use the majority of this window, since much of the data was migrated backup previously. However, it is a possibility if we have to re-sync all the data (without differential).

Details of this work can be viewed on https://phabricator.wikimedia.org/T137924.

Once the SSDs are installed, their ability to handle the iops generated from graphite is expected to bring load levels on labmon1001 down to sane levels.

Please let me know if there are any questions or concerns. They can be raised via this email thread, or by comment on the linked phabricator task.

Thanks,

Stealing this task for implementation tomorrow.

RobH mentioned this in T138415: eqiad: Install ssds to labmon1001.Jun 22 2016, 3:44 PM

Krenair added a subtask: T138415: eqiad: Install ssds to labmon1001.Jun 22 2016, 3:51 PM

RobH updated the task description. (Show Details)Jun 22 2016, 5:09 PM

RobH mentioned this in T137724: eqiad: 1 hardware access request for labs graphite.Jun 22 2016, 5:32 PM

RobH mentioned this in rOPUP93cef36f9ed6: setting WMF4724 install params.Jun 23 2016, 3:50 PM

So I went ahead and got the files copied back to usb, there was a huge list of differences but in the screen session I wasnt able to scroll up to copy/paste.

The secondary run to pick up changes and output is as follows:

root@labmon1001:~# rsync -vah --append-verify --info=progress2 /srv/carbon/whisper/ /media/backup/carbon/whisper/
sending incremental file list
              0   0%    0.00kB/s    0:00:00 (xfr#0, ir-chk=1000/2746170)
ores/ores-web-05/datasources_extracted/ruwiki/damaging/0/1/1/11/
              0   0%    0.00kB/s    0:00:00 (xfr#0, ir-chk=1000/2783286)
ores/ores-web-05/scores_request/wikidatawiki/reverted/0/1/0/22/
file has vanished: "/srv/carbon/whisper/ores/ores-web-05/scores_request/wikidatawiki/reverted/0/1/0/22/.p99.wsp.77QwJ2"
        262.14K   0%  250.00MB/s    0:00:00 (xfr#1, to-chk=0/2971383)   

sent 79.14M bytes  received 474.45K bytes  776.75K bytes/sec
total size is 951.56G  speedup is 11,951.64
rsync warning: some files vanished before they could be transferred (code 24) at main.c(1183) [sender=3.1.1]

So now I'll shutdown this machine and have @Cmjohnson put in the SSDs, while putting the old hdd aside for now. Please keep the old disks numbered, in case we need to fall back to their data.

RobH updated the task description. (Show Details)Jun 23 2016, 6:58 PM

System's been reinstalled and data restoration is in progress:

40.21G  99%   31.87MB/s    0:20:03 (xfr#94363, ir-chk=1011/111912)

298G total, and 40GB in 20 minutes

893 / 40 = 22.325. 22.325 * 20 (minutes per 40GB amount) = 446.5 minutes or 7.4 hours.

My math is all very rough estimates. rsync's copy can go faster or slower depending on file size and rsync speeds vary slightly over time.

RobH updated the task description. (Show Details)Jun 23 2016, 7:38 PM

There is a root owned screen session running with the command of:

rsync -ah --append-verify --info=progress2 /media/backup/carbon/whisper/ /srv/carbon/whisper/

This looks as if it will take roughly until the end of my day. I'm going to drop a note to the labs list to alert them of the downtime.

Unfortunately, I don't see a way to speed this up, since we are restoring the data to the live system. I've placed the rsync in the screen session so if anyone else wanted to expedite this by working on it in the Friday AM before I start working, they can feel free.

Sent to labs list:

Please note that the expended downtime for labmon1001 has been extended to 2016-06-24 @ 17:00 GMT. Details are noted on https://phabricator.wikimedia.org/T137924.

Basically I'm planning to watch it migrate data the rest of this evening (its been at it all day.) If it completes before I go to bed, I'll bring its services back online so we collect more data, and it becomes available.

The labmon1001 system has been reimaged, and the 893GB of data is being restored to the new SSDs on the system. During the restoration process, puppet and all services are halted on labmon1001, so it won't start attempting to write new data during the restoration.

Apologies for the extended downtime!

RobH updated the task description. (Show Details)Jun 24 2016, 3:59 AM

All data has been restored from USB disk and services resumed. However, there are errors.

syslog is spamming with Failed value conversion (but i also saw those before the migration)

permissions have been set, but graphite is also failing (internal server error on apache). The main this is the data is copied back over, however I'm not certain if new data is being received.

I'm going to keep poking at this, but no further updates on this task means I haven't gotten any traction and will have to return to it in the AM.

The downtime was extended through then so no further announcements need to be made at this time.

RobH updated the task description. (Show Details)Jun 24 2016, 4:02 AM

So I looked at uwsgi logs and it looked like a possible race between uwsgi starting and graphite being fully installed. It looks ok now - I just restarted it! \o/

I'm going to throw more data at it now and see how it goes. I can verify that old data is still there, and new data is coming in now...

RobH removed RobH as the assignee of this task.Jun 24 2016, 7:47 PM

RobH updated the task description. (Show Details)

Updating from ongoing work and chat with @yuvipanda and @chasemp

I've updated the task description to reflect that since Yuvi fixed this server in his AM, it is now experiencing a new issue.

When loading http://graphite.wmflabs.org or https://graphite.wmflabs.org the browser prompts for a file download.

All file permissions appear correct and when Yuvi restarted the system services, graphite was working. Data appears to be streaming into the system and being written, but at this time the actual service portal for graphite on labmon1001 is still down.

The labs team is aware of the issue. I'm removing myself as primary assignee of this issue, since the hardware issues appear to have been resolved.

FYI: We are holding onto the old SATA disks with data intact until we're certain we have the new system fully operational and we can wipe those disks. Please advise when those conditions are met. Thanks!

So I can make things work by adding:

DocumentRoot "/usr/share/graphite-web"

which I think is appropriate. I'm confused why it previously worked without that. This may be the only jessie graphite host so possibly that's related. For now I'm going to leave puppet disabled w/ the change for some sync up with yuvi on monday, but at least it's working.

In T137924#2406092, @chasemp wrote:

So I can make things work by adding:

DocumentRoot "/usr/share/graphite-web"

which I think is appropriate. I'm confused why it previously worked without that.

Perhaps there was previously a symlink from wherever the default docroot is to the proper one?

can you poke about and drop some notes here?

current state https://phabricator.wikimedia.org/T137924#2406092

• Cmjohnson closed subtask T138415: eqiad: Install ssds to labmon1001 as Resolved.Jun 27 2016, 2:33 PM

So I re-enabled puppet and that killed the line @chasemp added, and that exhibits the following behavior:

https://graphite.wmflabs.org/ works (over HTTP) on a browser
http://graphite.wmflabs.org/ does not (over HTTP) on a browser - downloads a strange file?
*both* work on curl?!

This is consistent with behavior *I* was seeing without the hack - can someone else confirm?

ok, so that was on firefox, and on Chrome HTTPS doesn't work for me (there's a 'ERR_INVALID_RESPONSE' error) while the http hit downloads the exact same strange file as on firefox.

A render URL, like https://graphite.wmflabs.org/render/?width=588&height=310&_salt=1467049750.544&target=tools.tools-elastic-02.cpu.total.user works for me both on HTTPs and HTTP on both Chrome and Firefox.

@yuvipanda neither 1. or 2. work for me, chrome tries to download a file called 'download.gz'. Both work with curl.

I'm on Iceweasel 38.5, what are you on, @tom29739? and can you upload this download.gz somewhere?

So the 'strange file' is gzip encoded, and I can repro this in curl with:

curl -H 'Accept-Encoding: gzip' https://graphite.wmflabs.org

No idea why setting DocumentRoot fixes this...

@yuvipanda Chrome Version 51.0.2704.103 m (32 bit) on Windows 7. download.gz is 707 bytes from the https:// version. WinRAR can open it, there's a file named 'download' inside which is 677 bytes.

the download file inside the download.gz file contains this when I open it with Notepad++:
HTTP/1.1 200 OK
Vary: Cookie, Accept-Encoding
Content-Length: 542
Content-Type: text/html; charset=utf-8
Content-Encoding: gzip

‹ ,hqWÿeRÑnÚ0}÷WÜù¥Û„õaš6`¢”®Q« :ÔG'¹$–‚Ù7Ké×ï:€J5¿Ø×÷øÜsN2ù0ÂÂ6§ËŠàz<þ+—iz…uu±Õ
ñ¨s4hM¨B˜7*çíÔÀot^[×£1| yjÉO?ÄÁ¶°W0– õÈÚÃN×ø’cC
ävßÔZ™¡ÓTõCN#ñ|"°)Æ*F7\í.Q H ¨ˆšïQÔuÝHõ"GÖ•Q}ùè1^,“t9d¡B<™½‡ZíØ_v Õ°Œ\e,®VXªtÈ=²Afç4iSÀÛuÊ¡(´'§³–ÞåsÅF/œ2 ç)Ä©„›y§±7÷«§
lçëõ<ÙÄËVkX¬’Ûx¯®î`ž<ÃCœÜ 9‚/ÚY Éa1)â»á;{ãÌõNçìÈ”*Jûa#Ð Ûk¾gi…¨õ^“¢¾þÏÎ†Ã™à5©h_Ï8îI…ª>’¦g¿œj*M7ÎvÝ$:Þpô†‹yvNíÑ#A Oå×ñà³„þ2³Ž§Oå oÇž¥oƒwùTFÙqHOÌ›Ã½©¼ïK º˜^[Wàsgëš=O¥±´œŸ~e0¹%Dûhär@ø/m˜ðóÌ[Ch¨'—çö]x"Ï“èìk&ØtŸ•ÿ ûm

I couldn't find a place to upload it anywhere sorry. I'd presume that's the page content gzipped.

I've re-enabled setting DocumentRoot, which does fix this. I'll disable puppet again and re-investigate later.

It looks like it's spewing out incorrect gzipped data, which the browser can't decode so it downloads it as a file instead.
Maybe the default DocumentRoot is in the wrong place so that the webserver is selecting the wrong file for the user's browser to interpret? That's the only plausible explanation that I can think of where setting the DocumentRoot would fix it.

Change 296823 had a related patch set uploaded (by Yuvipanda):
graphite: Add DocumentRoot explicitly

https://gerrit.wikimedia.org/r/296823

Change 296823 merged by Yuvipanda:
graphite: Add DocumentRoot explicitly

https://gerrit.wikimedia.org/r/296823

yuvipanda mentioned this in rOPUP1223f7e1c007: graphite: Add DocumentRoot explicitly.Jun 30 2016, 8:36 PM

I just puppetized the DocumentRoot and am going to call this 'done' since the other option is a rabbithole I do not have the time to dive into...

yuvipanda closed subtask Unknown Object (Task) as Resolved.Jun 30 2016, 8:38 PM

yuvipanda reopened subtask Unknown Object (Task) as Open.

The temp host, which was added in https://phabricator.wikimedia.org/rOPUP93cef36f9ed6e2ec1e8ec2f6d5d345a4343e1610 is still up and running:

setting WMF4724 install params

this is a temp host for labmon1001 data migration, it will be reclaimed to spares in less than 48 hours

In T137924#2525923, @MoritzMuehlenhoff wrote:

The temp host, which was added in https://phabricator.wikimedia.org/rOPUP93cef36f9ed6e2ec1e8ec2f6d5d345a4343e1610 is still up and running:

setting WMF4724 install params

this is a temp host for labmon1001 data migration, it will be reclaimed to spares in less than 48 hours

I can reclaim this host, as long as @yuvipanda confirms he doesn't need any data or this host anymore.

Can be reclaimed!

Confirmed he doesnt need the data, so I'll create the task and take care of remote steps to reclaim WMF4724

RobH mentioned this in T142412: reclaim WMF4724 to spares.Aug 8 2016, 5:23 PM

yuvipanda removed yuvipanda as the assignee of this task.Aug 10 2016, 12:54 AM

yuvipanda updated the task description. (Show Details)Aug 10 2016, 5:17 PM

RobH closed this task as Resolved.Aug 10 2016, 5:17 PM

RobH claimed this task.

RobH closed subtask Unknown Object (Task) as Resolved.Oct 12 2016, 5:46 PM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:42 PM

Copy labmon data to new SSDs
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

Status	Assigned	Task
Resolved	yuvipanda	T127957 graphite.wmflabs.org is very slow / flaky
Resolved	RobH	T137924 Copy labmon data to new SSDs
		Unknown Object (Task)
Resolved	• Cmjohnson	T138415 eqiad: Install ssds to labmon1001

Copy labmon data to new SSDsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Copy labmon data to new SSDs
Closed, ResolvedPublic
Actions

Related Objects
Search...