Page MenuHomePhabricator

Bring stat1010 into service with GPU from stat1005
Closed, ResolvedPublic

Description

stat1010 is a new stats box, which is curently in the insetup::data_engineering role

We should aim to bring this server into production as soon as it is convenient to do so.

It is running bullseye, so it will depend on fixing any outstanding issues in T329363: Upgrade Hadoop test cluster to Bullseye that relate to an-test-client1002 first.

It should also use the GPU from stat1005, so this will need to be moved by the DC-Ops team.
See also T329360: Upgrade stat1008 to bullseye where the GPU is mentioned.

Once it is done, we should look to decommission stat1005

  • update Wikitech documentation: Updated on stat1010 and as part of Data Engineering/Systems/Clients
  • announce availability to users
  • update refinery deployment targets
  • update any other references to stats servers

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Confirmed that the GPU is currently still in stat1005

BTullis triaged this task as Medium priority.Nov 15 2023, 9:45 AM
Gehel raised the priority of this task from Medium to High.Nov 15 2023, 9:49 AM

Added the kerberos principals and keytabs for stat1010.

analytics-privatedata/stat1010.eqiad.wmnet@WIKIMEDIA
analytics-product/stat1010.eqiad.wmnet@WIKIMEDIA
analytics-search/stat1010.eqiad.wmnet@WIKIMEDIA
analytics/stat1010.eqiad.wmnet@WIKIMEDIA
Entry for principal analytics-privatedata/stat1010.eqiad.wmnet@WIKIMEDIA with kvno 1, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:/srv/kerberos/keytabs/stat1010.eqiad.wmnet/analytics-privatedata/analytics-privatedata.keytab.
Entry for principal analytics-product/stat1010.eqiad.wmnet@WIKIMEDIA with kvno 1, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:/srv/kerberos/keytabs/stat1010.eqiad.wmnet/analytics-product/analytics-product.keytab.
Entry for principal analytics-search/stat1010.eqiad.wmnet@WIKIMEDIA with kvno 1, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:/srv/kerberos/keytabs/stat1010.eqiad.wmnet/analytics-search/analytics-search.keytab.
Entry for principal analytics/stat1010.eqiad.wmnet@WIKIMEDIA with kvno 1, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:/srv/kerberos/keytabs/stat1010.eqiad.wmnet/analytics/analytics.keytab.

Change 997797 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Bring two new stat servers into service

https://gerrit.wikimedia.org/r/997797

Change 997797 merged by Btullis:

[operations/puppet@production] Bring two new stat servers into service

https://gerrit.wikimedia.org/r/997797

Change 998480 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Allow the new stats servers to NFS mount from clouddumps servers

https://gerrit.wikimedia.org/r/998480

Change 998480 merged by Btullis:

[operations/puppet@production] Allow the new stats servers to NFS mount from clouddumps servers

https://gerrit.wikimedia.org/r/998480

Change 1003042 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[analytics/refinery/scap@master] Add stat1010 and stat1011 to scap targets

https://gerrit.wikimedia.org/r/1003042

adding a link to the rsync-published.service resolution and potential discussion on the stat1011 ticket. https://phabricator.wikimedia.org/T354526#9546019

Once the patch to Add stat1010 and stat1011 to scap targets is merged, we shall add a note on the ops week deployment and keep an eye out during the deployment incase of any issues with the two hosts.

Change 1003042 merged by Btullis:

[analytics/refinery/scap@master] Add stat1010 and stat1011 to scap targets

https://gerrit.wikimedia.org/r/1003042

Change 1005538 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[analytics/hdfs-tools/deploy@master] Add stat1010 ans stat1011 to hdfs_tools target

https://gerrit.wikimedia.org/r/1005538

BTullis added a subscriber: Stevemunene.

Moving back to in-progress because we haven't moved the GPU.

I have sent an email to analytics@lists.wikimedia.org and data-platform-engineering@wikimedia.org informing users that we plan to move the GPU imminently and to let us know if this is likely to affect their work.

I also posted the same message to the #talk-to-machine-learning Slack channel.

Change 1005538 merged by Btullis:

[analytics/hdfs-tools/deploy@master] Add stat1010 ans stat1011 to hdfs_tools target

https://gerrit.wikimedia.org/r/1005538

I've now created a child ticket for ops-eqiad to coordinate the move of the GPU, so I will mark this ticket as waiting.

Icinga downtime and Alertmanager silence (ID=7228d4d9-78d6-454a-b81c-2822a26b7415) set by btullis@cumin1002 for 2 days, 0:00:00 on 1 host(s) and their services with reason: Moving GPU from stat1005 to stat1010

stat1005.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=4bf40d1c-6051-41a9-8fd7-bce9305206d9) set by btullis@cumin1002 for 2 days, 0:00:00 on 1 host(s) and their services with reason: Moving GPU from stat1005 to stat1010

stat1010.eqiad.wmnet

We're still awaiting the required cable from Dell for enabling the GPU in stat1010. That's being tracked in T359089.

The cable has arrived and we're planning to shut down stat1010 to do the work today.

Icinga downtime and Alertmanager silence (ID=eeb43f86-4f54-4536-93ec-187d89fe5afd) set by btullis@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Connecting GPU power cable

stat1010.eqiad.wmnet

Mentioned in SAL (#wikimedia-analytics) [2024-04-09T13:20:44Z] <btullis> shut down stat1010 to have the GPU power connected for T336040

Unfortunately, the cable hadn't arrived so we will have to re-schedule this work again.

We now have the cable, so we are planning to carry out the work at 13:30 UTC tomorrow. I will send out the comms for that today.

Icinga downtime and Alertmanager silence (ID=13bfaf48-39f3-4d24-93c8-f26ad915f34d) set by btullis@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Connecting GPU power cable

stat1010.eqiad.wmnet

Mentioned in SAL (#wikimedia-analytics) [2024-04-24T12:27:42Z] <btullis> shutting down stat1010 to allow the GPU power cable to be fitted for: T336040

The GPU is now correctly detected.

btullis@stat1010:~$ sudo lshw -class display
  *-display                 
       description: VGA compatible controller
       product: Integrated Matrox G200eW3 Graphics Controller
       vendor: Matrox Electronics Systems Ltd.
       physical id: 0
       bus info: pci@0000:03:00.0
       version: 04
       width: 32 bits
       clock: 66MHz
       capabilities: pm vga_controller bus_master cap_list rom
       configuration: driver=mgag200 latency=0 maxlatency=32 mingnt=16
       resources: irq:16 memory:91000000-91ffffff memory:92808000-9280bfff memory:92000000-927fffff memory:c0000-dffff
  *-display UNCLAIMED
       description: VGA compatible controller
       product: Vega 10 XT [Radeon PRO WX 9100]
       vendor: Advanced Micro Devices, Inc. [AMD/ATI]
       physical id: 0
       bus info: pci@0000:3d:00.0
       version: 00
       width: 64 bits
       clock: 33MHz
       capabilities: pm pciexpress msi vga_controller cap_list
       configuration: latency=0
       resources: iomemory:38bf0-38bef iomemory:38bf0-38bef memory:38bfe0000000-38bfefffffff memory:38bff0000000-38bff01fffff ioport:6000(size=256) memory:ab000000-ab07ffff memory:ab0a0000-ab0bffff

I will make a patch to add the necessary packages.

Change #1023885 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Support the move of a GPU from stat1008 to stat1010

https://gerrit.wikimedia.org/r/1023885

Change #1023889 had a related patch set uploaded (by Btullis; author: Btullis):

[labs/private@master] Add dummy keytabs for new stats servers

https://gerrit.wikimedia.org/r/1023889

Change #1023889 merged by Btullis:

[labs/private@master] Add dummy keytabs for new stats servers

https://gerrit.wikimedia.org/r/1023889

Change #1023885 merged by Btullis:

[operations/puppet@production] Support the move of a GPU from stat1008 to stat1010

https://gerrit.wikimedia.org/r/1023885

Hmm. This isn't working correctly yet, I get the following results from radeontop.

btullis@stat1010:~$ sudo radeontop -l 10 -d -
Failed to open DRM node, no VRAM support.
Dumping to -, line limit 10.
1713976775.777684: bus 3d, gpu 100.00%, ee 100.00%, vgt 100.00%, ta 100.00%, sx 100.00%, sh 100.00%, spi 100.00%, sc 100.00%, pa 100.00%, db 100.00%, cb 100.00%
1713976776.777859: bus 3d, gpu 100.00%, ee 100.00%, vgt 100.00%, ta 100.00%, sx 100.00%, sh 100.00%, spi 100.00%, sc 100.00%, pa 100.00%, db 100.00%, cb 100.00%
1713976777.778032: bus 3d, gpu 100.00%, ee 100.00%, vgt 100.00%, ta 100.00%, sx 100.00%, sh 100.00%, spi 100.00%, sc 100.00%, pa 100.00%, db 100.00%, cb 100.00%
1713976778.778199: bus 3d, gpu 100.00%, ee 100.00%, vgt 100.00%, ta 100.00%, sx 100.00%, sh 100.00%, spi 100.00%, sc 100.00%, pa 100.00%, db 100.00%, cb 100.00%

There is nobody logged onto the server yet, so I can give it a reboot and try again.

Host rebooted by btullis@cumin1002 with reason: Troubleshooting GPU

Great! I believe that it is working. Here is the output from radeontop.

image.png (484×654 px, 45 KB)