⚓ T168486 Migrate customer-facing Dumps endpoints to Cloud Services

Subject	Repo	Branch	Lines +/-
dumps: Copy web server logs to stat host	operations/puppet	production	+6 -0
dumps: Adjust network saturation thresholds for 10G interface	operations/puppet	production	+4 -2
dumps: Add nfs, load, network monitoring for dist servers	operations/puppet	production	+15 -0
dumps: Refactor profiles and hierakeys in web/	operations/puppet	production	+83 -75

bd808 created this task.Jun 21 2017, 12:24 AM

bd808 moved this task from Backlog to Q1 on the cloud-services-team (FY2017-18) board.

Assigning to @madhuvishy as the tech lead for this initiative. She will be responsible for creating a plan for this work and helping me report on it as the quarter progresses.

ArielGlenn subscribed.Jul 12 2017, 9:01 AM

• madhuvishy updated the task description. (Show Details)Jul 12 2017, 10:02 PM

• madhuvishy updated the task description. (Show Details)

ArielGlenn added a project: Datasets-General-or-Unknown.Jul 20 2017, 3:39 PM

• madhuvishy mentioned this in T167984: rack/setup/install labstore100[67].wikimedia.org.Jul 20 2017, 4:01 PM

Notes from hangout chat with Ariel:

We should keep the relative file structure for dumps the same so we don't break links for web/rsync downloaders
Maintenance windows are currently handled with announcing a downtime and just taking the server down. The dumps jobs email the dumps alias that only Ariel is currently on - sometimes users alert us of missing data even before the errors show up. The website is monitored by icinga.
There is ongoing work to make the dumps rsyncs incremental so they can run safely at small time windows of every 10 minutes or so. When we get here, we can alternatively pull from dumps servers to labstore1006 and 1007.
The monitor script will run on the dumpsdata boxes and generate the dumps index pages, which will be rsync-ed to labstores along with all the other data.
We have a small number of rsync mirrors that pull data from us everyday. We can do some work to stagger them or schedule them so they don't all run at the same time if needed.
We keep about 6-8 historical copies of dumps. latest/ is a symlink to the latest available complete dump.
QoS considerations for web - currently extremely limited bandwidth - disk IO is often tight, and not enough memory. We might be able to lift the bandwidth caps with the new servers, but we should consider keeping the number of connections limit per IP/UA.
The dumps and dataset modules in puppet have all the relevant infrastructure code.
Dumps run twice a month, 8-9 days per run. That gives about 1-2 days in between for maintenance and migration.
Dumpsdata servers will push data via rsync periodically to labstores.

zhuyifei1999 subscribed.Jul 23 2017, 4:45 AM

• madhuvishy created subtask T171508: Investigate and implement alternative for showmount based check at instance boot time.Jul 24 2017, 6:41 PM

• madhuvishy created subtask T171539: Puppetize and setup initial lvms and directory structures for labstore1006|7.Jul 24 2017, 10:24 PM

• madhuvishy created subtask T171540: Figure out how NFS failovers will work for the dumps servers - labstore1006|7.Jul 24 2017, 10:27 PM

• madhuvishy created subtask T171541: Setup periodic rsync jobs from dumps generation hosts to labstore1006|7.Jul 24 2017, 10:32 PM

• madhuvishy updated the task description. (Show Details)

A small correction to the above: there are indeed two dump runs per month, but the first run, starting on the 1st of the month, produces all historical revision content and takes 17-18 days. The second run, starting on the 20th of the month, skips this step, producing only current revision content, and takes 8-9 days.

bd808 updated the task description. (Show Details)Sep 11 2017, 9:50 PM

bd808 moved this task from Q1 to Q2 on the cloud-services-team (FY2017-18) board.Sep 29 2017, 10:27 PM

• madhuvishy closed subtask T171539: Puppetize and setup initial lvms and directory structures for labstore1006|7 as Resolved.Nov 20 2017, 5:20 AM

• madhuvishy created subtask T181431: Setup NFS on dumps servers.Nov 27 2017, 7:54 PM

• madhuvishy updated the task description. (Show Details)Nov 27 2017, 7:56 PM

small note new ports appeared accessible via instances which are intended but were noticed :)

OPEN 208.80.154.7 111 tcp 0 6 labstore1006.wikimedia.org
OPEN 208.80.155.106 111 tcp 0 6 labstore1007.wikimedia.org

ArielGlenn moved this task from Backlog to Active on the Datasets-General-or-Unknown board.Dec 4 2017, 9:59 AM

bd808 mentioned this in T182351: Make HTML dumps available.Dec 9 2017, 5:26 AM

ArielGlenn mentioned this in T182540: get datset1001, ms1001 ready for decommission.Dec 10 2017, 8:00 PM

bd808 renamed this task from Begin migrating customer-facing Dumps endpoints to Cloud Services to Migrate customer-facing Dumps endpoints to Cloud Services.Dec 18 2017, 3:59 AM

bd808 updated the task description. (Show Details)

bd808 moved this task from Q2 to Q3 on the cloud-services-team (FY2017-18) board.

ArielGlenn added a parent task: T182540: get datset1001, ms1001 ready for decommission.Feb 8 2018, 9:53 AM

elukey subscribed.Feb 12 2018, 6:18 PM

• madhuvishy updated the task description. (Show Details)Mar 1 2018, 6:12 PM

• madhuvishy added a subtask: T185101: Labstore1006/7 profile for meltdown kernel.

• madhuvishy closed subtask T171541: Setup periodic rsync jobs from dumps generation hosts to labstore1006|7 as Resolved.Mar 5 2018, 5:58 PM

Change 416502 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] dumps: Refactor profiles and hierakeys in web/

https://gerrit.wikimedia.org/r/416502

gerritbot added a project: Patch-For-Review.Mar 5 2018, 8:30 PM

Change 416502 merged by ArielGlenn:
[operations/puppet@production] dumps: Refactor profiles and hierakeys in web/

https://gerrit.wikimedia.org/r/416502

Draft timeline for migration:

March 28 - Ensure all data fetches and syncs for labstore1006|7 are in place - T188726
Apr 2 - Starting 14:30 UTC - NFS clients migration to new servers - T188643 and T188644
April 4 - Starting 14:30 UTC - Web service & rsync mirrors migration to new servers - T188645 and T188646

Each window on April 2 & 4 will last at least 2 hours if things go exceedingly well, and possibly up to several hours.

• madhuvishy mentioned this in T188647: Announce/Communicate dumps migration to labstore1006|7 to stakeholders.Mar 7 2018, 10:02 PM

• madhuvishy closed subtask T181431: Setup NFS on dumps servers as Resolved.Mar 16 2018, 12:19 AM

• madhuvishy closed subtask Restricted Task as Resolved.Mar 16 2018, 4:34 PM

• madhuvishy closed subtask T171540: Figure out how NFS failovers will work for the dumps servers - labstore1006|7 as Resolved.Mar 16 2018, 5:44 PM

• madhuvishy updated the task description. (Show Details)Mar 16 2018, 7:39 PM

• madhuvishy closed subtask T188073: Maintain symlinks for WMCS NFS Dumps users with new directory structure as Resolved.Mar 29 2018, 8:26 PM

Change 423040 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] dumps: Add nfs, load, network monitoring for dist servers

https://gerrit.wikimedia.org/r/423040

Change 423040 merged by Madhuvishy:
[operations/puppet@production] dumps: Add nfs, load, network monitoring for dist servers

https://gerrit.wikimedia.org/r/423040

• madhuvishy closed subtask T188642: Set up labstore1006|7 as the source for rsync mirror sites as Resolved.Apr 2 2018, 1:30 AM

Change 423359 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] dumps: Adjust network saturation thresholds for 10G interface

https://gerrit.wikimedia.org/r/423359

Change 423359 merged by Madhuvishy:
[operations/puppet@production] dumps: Adjust network saturation thresholds for 10G interface

https://gerrit.wikimedia.org/r/423359

• madhuvishy closed subtask T188641: Set up the web service that serves dumps.wikimedia.org as Resolved.Apr 2 2018, 4:58 AM

• madhuvishy updated the task description. (Show Details)

• madhuvishy closed subtask T188647: Announce/Communicate dumps migration to labstore1006|7 to stakeholders as Resolved.Apr 2 2018, 5:12 AM

• madhuvishy mentioned this in T188643: Migrate Dumps WMCS NFS users from labstore1003 to labstore1006/7.Apr 2 2018, 6:37 PM

• madhuvishy closed subtask T185101: Labstore1006/7 profile for meltdown kernel as Resolved.Apr 4 2018, 6:12 PM

• madhuvishy closed subtask T171508: Investigate and implement alternative for showmount based check at instance boot time as Resolved.

• madhuvishy closed subtask T188646: Point dumps.wikimedia.org to labstore1006/7 as Resolved.Apr 11 2018, 5:20 PM

• madhuvishy closed subtask T188645: Get all the rsync mirror sites to to switch over to labstore1006,7 as Resolved.

• madhuvishy closed subtask T188644: Migrate the stat* mount from dataset1001 to labstore1006/7 as Resolved.Apr 11 2018, 6:07 PM

• madhuvishy closed subtask T188643: Migrate Dumps WMCS NFS users from labstore1003 to labstore1006/7 as Resolved.Apr 11 2018, 6:29 PM

The nginx logs from the active web server (labstore1007) need to get shipped over to stat1005 somehow, as we do with the dataset1001 logs for now in profile::dumps::web::xmldumps_active. If that happens already someplace, I could not find it.

• madhuvishy updated the task description. (Show Details)Apr 13 2018, 3:44 PM

Change 428864 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] dumps: Copy web server logs to stat host

https://gerrit.wikimedia.org/r/428864

Change 428864 merged by Madhuvishy:
[operations/puppet@production] dumps: Copy web server logs to stat host

https://gerrit.wikimedia.org/r/428864

In T168486#4126319, @ArielGlenn wrote:

The nginx logs from the active web server (labstore1007) need to get shipped over to stat1005 somehow, as we do with the dataset1001 logs for now in profile::dumps::web::xmldumps_active. If that happens already someplace, I could not find it.

Done :)

• madhuvishy closed this task as Resolved.Apr 24 2018, 11:57 PM

ArielGlenn moved this task from Active to Done on the Datasets-General-or-Unknown board.Jul 2 2018, 12:35 PM

Migrate customer-facing Dumps endpoints to Cloud Services
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

Status	Assigned	Task
Resolved	bd808	T166402 Program 7 Outcome 3: data services
Resolved	ArielGlenn	T182540 get datset1001, ms1001 ready for decommission
Resolved	• madhuvishy	T168486 Migrate customer-facing Dumps endpoints to Cloud Services
Resolved	• madhuvishy	T171508 Investigate and implement alternative for showmount based check at instance boot time
Resolved	• madhuvishy	T174590 Revert patch that adds a temporary exception to the block-for-export check for the testlabs project
Resolved	• madhuvishy	T171539 Puppetize and setup initial lvms and directory structures for labstore1006\|7
Resolved	ArielGlenn	T171541 Setup periodic rsync jobs from dumps generation hosts to labstore1006\|7
Resolved	ArielGlenn	T188726 make sure all datasets in xmldatadumps/public/other on dataset1001 are accounted for on new labs boxes
Resolved	• ezachte	T189283 Replace cron jobs from EZachte's home directory on stat1005 with rsync fetches
Resolved	ArielGlenn	T189284 Stop serving slowparse logs from dumps distribution servers
Resolved	• madhuvishy	T181431 Setup NFS on dumps servers
Resolved	• madhuvishy	T171540 Figure out how NFS failovers will work for the dumps servers - labstore1006\|7
Resolved	• madhuvishy	T188073 Maintain symlinks for WMCS NFS Dumps users with new directory structure
Resolved	• madhuvishy	T188641 Set up the web service that serves dumps.wikimedia.org
Resolved	• madhuvishy	T188642 Set up labstore1006\|7 as the source for rsync mirror sites
Resolved	• madhuvishy	T188643 Migrate Dumps WMCS NFS users from labstore1003 to labstore1006/7
Resolved	• madhuvishy	T188644 Migrate the stat* mount from dataset1001 to labstore1006/7
Resolved	• madhuvishy	T188645 Get all the rsync mirror sites to to switch over to labstore1006,7
Resolved	• madhuvishy	T188646 Point dumps.wikimedia.org to labstore1006/7
Resolved	• madhuvishy	T185101 Labstore1006/7 profile for meltdown kernel
Resolved	• madhuvishy	T188647 Announce/Communicate dumps migration to labstore1006\|7 to stakeholders
		Restricted Task

Migrate customer-facing Dumps endpoints to Cloud ServicesClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Migrate customer-facing Dumps endpoints to Cloud Services
Closed, ResolvedPublic
Actions

Related Objects
Search...