Page MenuHomePhabricator

Decom EOL stats servers stat100[4-7]
Closed, ResolvedPublic

Description

These are all EOL and we should start encouraging users off, so that we can decom them.

Event Timeline

Gehel triaged this task as High priority.Dec 20 2023, 10:43 AM
Gehel moved this task from Incoming to Hardware refresh on the Data-Platform-SRE board.

I have sent the following message by email to analytics@lists.w.o and to data-platform-engineering@w.o and also started a thread on the #data-engineering-collab channel on Slack.

We need to plan to decommission several of the analytics clients, also referred to as the stats servers, since they have reached their end of service date. The servers in question are:

  • stat1004
  • stat1005
  • stat1006
  • stat1007

If you actively use these servers, please consider moving your work to alternative stat servers (namely, stat10[08-11]) as soon as reasonably possible.

Similarly, should you have personal files in your home directory on any of these servers that you would like to retain, now would be a good time to consider moving them to a different server, or moving them to your HDFS home directory.

There are some guides available on syncing files between stats servers and also using the hdfs CLI to manage files, which may help you to clean up the necessary files.

We would like to be able to decommission these servers three weeks' from today, which is on Tuesday May 28th. Please do feel free to get back to us if you feel that this timescale will not allow sufficient time for you to migrate your work to alternative servers, or if you have any other concerns about this plan.

I'll start some patches to prepare for decommissioning and check for any other consequences from decommissioning, whilst I wait for any replies.

Is the analytics-announce list still in use? I would expect such announcements to be sent there, I don't subscribe to analytics@ but just happened to catch this task in my Phab feed.

Is the analytics-announce list still in use? I would expect such announcements to be sent there, I don't subscribe to analytics@ but just happened to catch this task in my Phab feed.

Thanks @taavi for the reminder - I have now also sent a copy to that list.
It's very low traffic now, the last user message being November 2023 and prior to that it was June 2023. But you're right. I should have included that list as well.

Change #1028866 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Move stats misc_jobs from stat1007 to stat1011

https://gerrit.wikimedia.org/r/1028866

Change #1029176 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Drop the deprecated dumps fetcher that pulls from stat1007

https://gerrit.wikimedia.org/r/1029176

Change #1029176 merged by Btullis:

[operations/puppet@production] Drop the deprecated dumps fetcher that pulls from stat1007

https://gerrit.wikimedia.org/r/1029176

Change #1028866 merged by Btullis:

[operations/puppet@production] Move stats misc_jobs from stat1007 to stat1011

https://gerrit.wikimedia.org/r/1028866

Change #1030903 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Create /srv/analytics-wmde on stat1011

https://gerrit.wikimedia.org/r/1030903

Change #1030903 merged by Btullis:

[operations/puppet@production] Create /srv/analytics-wmde on stat1011

https://gerrit.wikimedia.org/r/1030903

After merging the patch to move the misc jobs from stat1007 to stat1011, plus a small fixup patch, I ran the following command on stat1007.

btullis@stat1007:~$ sudo systemctl disable performance-asoranking.timer product-analytics-movement-metrics.timer wmde-analytics-daily-early.timer wmde-analytics-daily-noon.timer wmde-analytics-minutely.timer wmde-analytics-weekly.timer wmde-toolkit-analyzer-build.timer
Removed /etc/systemd/system/multi-user.target.wants/product-analytics-movement-metrics.timer.
Removed /etc/systemd/system/multi-user.target.wants/performance-asoranking.timer.
Removed /etc/systemd/system/multi-user.target.wants/wmde-analytics-weekly.timer.
Removed /etc/systemd/system/multi-user.target.wants/wmde-analytics-daily-noon.timer.
Removed /etc/systemd/system/multi-user.target.wants/wmde-analytics-minutely.timer.
Removed /etc/systemd/system/multi-user.target.wants/wmde-analytics-daily-early.timer.

Otherwise the timers would have fired on both stat1007 and stat1011.

There was a slight problem with the wikidata dumps download stats, owing to the issues identified in T364820: The rsync_nginxlogs.service that sends web logs from clouddumps100[1-2] to a stats server isn't working

I have fixed up that rsync job and manually backfilled data with:

analytics-wmde@stat1011:/srv/analytics-wmde/graphite/src/scripts$ ./src/wikidata/dumpDownloads.php 
2024-05-14 11:11:12 wikidata-dumpDownloads Script Started!
2024-05-14 11:11:12 wikidata-dumpDownloads Targeting date: 10/May/2024
2024-05-14 11:11:13 wikidata-dumpDownloads Script Finished!
analytics-wmde@stat1011:/srv/analytics-wmde/graphite/src/scripts$ ./src/wikidata/dumpDownloads.php '-3 days'
2024-05-14 11:15:18 wikidata-dumpDownloads Script Started!
2024-05-14 11:15:18 wikidata-dumpDownloads Targeting date: 11/May/2024
2024-05-14 11:15:20 wikidata-dumpDownloads Script Finished!
analytics-wmde@stat1011:/srv/analytics-wmde/graphite/src/scripts$ ./src/wikidata/dumpDownloads.php '-2 days'
2024-05-14 11:15:24 wikidata-dumpDownloads Script Started!
2024-05-14 11:15:24 wikidata-dumpDownloads Targeting date: 12/May/2024
2024-05-14 11:15:30 wikidata-dumpDownloads Script Finished!
analytics-wmde@stat1011:/srv/analytics-wmde/graphite/src/scripts$

The data on the wikidata dumps downloads graph now looks broadly correct.
Even if it was off for a certain day, it should now sort itself out in future.

image.png (405×977 px, 44 KB)

Change #1038263 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Absent the rsync configuration for deprecated misc jobs

https://gerrit.wikimedia.org/r/1038263

Change #1038266 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove absented rsync configs for deprecated dumps

https://gerrit.wikimedia.org/r/1038266

Change #1038288 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove the discovery-analytics dsh config for stat1007

https://gerrit.wikimedia.org/r/1038288

Change #1038328 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove temporary firewall rule for WDQS graph_split

https://gerrit.wikimedia.org/r/1038328

Change #1038329 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Prepare stat100[4-7] for decommissioning

https://gerrit.wikimedia.org/r/1038329

I've got a stack of five patches that remove stat100[4-7] from service, placing them into the insetup::buster role.
I think that we can leave these server in this role for a week or so, just to ensure that we haven't forgotten anything that we might need to retrieve from the disks before decommissioning.

Change #1038263 merged by Btullis:

[operations/puppet@production] Absent the rsync configuration for deprecated misc jobs

https://gerrit.wikimedia.org/r/1038263

Change #1038266 merged by Btullis:

[operations/puppet@production] Remove absented rsync configs for deprecated dumps

https://gerrit.wikimedia.org/r/1038266

Change #1038288 merged by Bking:

[operations/puppet@production] Remove the discovery-analytics dsh config for stat1007

https://gerrit.wikimedia.org/r/1038288

Change #1038328 merged by Bking:

[operations/puppet@production] Remove temporary firewall rule for WDQS graph_split

https://gerrit.wikimedia.org/r/1038328

Change #1038329 merged by Btullis:

[operations/puppet@production] Prepare stat100[4-7] for decommissioning

https://gerrit.wikimedia.org/r/1038329

Mentioned in SAL (#wikimedia-analytics) [2024-06-05T10:16:52Z] <btullis> switching stat100[4-7] into insetup::buster role for T353785

Mentioned in SAL (#wikimedia-operations) [2024-06-05T16:53:35Z] <dzahn@cumin1002> START - Cookbook sre.hosts.downtime for 10 days, 0:00:00 on stat1004.eqiad.wmnet with reason: decom T353785

Mentioned in SAL (#wikimedia-operations) [2024-06-05T16:53:48Z] <dzahn@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on stat1004.eqiad.wmnet with reason: decom T353785

Mentioned in SAL (#wikimedia-operations) [2024-06-05T16:54:13Z] <mutante> downtimed stat1004 for 10 days to avoid alerting spam during decom process - T353785

Mentioned in SAL (#wikimedia-operations) [2024-06-05T16:56:09Z] <dzahn@cumin1002> START - Cookbook sre.hosts.downtime for 10 days, 0:00:00 on stat1005.eqiad.wmnet with reason: decom T353785

Mentioned in SAL (#wikimedia-operations) [2024-06-05T16:56:22Z] <dzahn@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on stat1005.eqiad.wmnet with reason: decom T353785

Mentioned in SAL (#wikimedia-operations) [2024-06-05T17:05:45Z] <dzahn@cumin1002> START - Cookbook sre.hosts.downtime for 10 days, 0:00:00 on stat1006.eqiad.wmnet with reason: decom T353785

Mentioned in SAL (#wikimedia-operations) [2024-06-05T17:05:58Z] <dzahn@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on stat1006.eqiad.wmnet with reason: decom T353785

Mentioned in SAL (#wikimedia-operations) [2024-06-05T17:06:41Z] <dzahn@cumin1002> START - Cookbook sre.hosts.downtime for 10 days, 0:00:00 on stat1007.eqiad.wmnet with reason: decom T353785

Mentioned in SAL (#wikimedia-operations) [2024-06-05T17:06:54Z] <dzahn@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on stat1007.eqiad.wmnet with reason: decom T353785

Change #1047926 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove stat1004-1007 from site.pp

https://gerrit.wikimedia.org/r/1047926

Change #1047926 merged by Muehlenhoff:

[operations/puppet@production] Remove stat1004-1007 from site.pp

https://gerrit.wikimedia.org/r/1047926

Change #1049991 had a related patch set uploaded (by Btullis; author: Xcollazo):

[analytics/refinery/scap@master] Move canary to stat1008, remove stat targets that no longer exist.

https://gerrit.wikimedia.org/r/1049991

Change #1049991 merged by Xcollazo:

[analytics/refinery/scap@master] Move canary to stat1008, remove stat targets that no longer exist.

https://gerrit.wikimedia.org/r/1049991