Page MenuHomePhabricator
Feed Advanced Search

Oct 29 2020

klausman added a comment to T266771: Eq: new Druid test VM for analytics.
root@cumin1001:~# cookbook sre.ganeti.makevm --vcpus 4 --memory 8 --disk 100  --network analytics eqiad_D an-test-druid1001.eqiad.wmnet
START - Cookbook sre.ganeti.makevm
Ready to create Ganeti VM an-test-druid1001.eqiad.wmnet in the ganeti01.svc.eqiad.wmnet cluster on row D with 4 vCPUs, 8GB of RAM, 100GB of disk in the analytics network.
Is this correct?
Type "done" to proceed
> done
Allocated IPv4 10.64.53.6/24
Set DNS name of IP 10.64.53.6/24 to an-test-druid1001.eqiad.wmnet
Allocated IPv6 2620:0:861:108:10:64:53:6/64 with DNS name an-test-druid1001.eqiad.wmnet
Generating the DNS records from Netbox data. It will take a couple of minutes.
2020-10-29 11:11:19,218 [INFO] Gathering devices, interfaces, addresses and prefixes from Netbox
2020-10-29 11:13:08,271 [INFO] Gathered 2182 devices from Netbox
2020-10-29 11:13:08,271 [INFO] Generating DNS records
2020-10-29 11:13:14,959 [INFO] Generated 12032 direct and reverse records (6016 each) in 26 direct zones and 168 reverse zones                                                                                    
2020-10-29 11:13:14,960 [INFO] Cloning /srv/netbox-exports/dns.git/ to /tmp/dns-c25pcHBldHM-jdruvzjh ...
2020-10-29 11:13:15,557 [INFO] Generating zonefile snippets to directory /tmp/dns-c25pcHBldHM-jdruvzjh
2020-10-29 11:13:16,180 [INFO] Committed changes: 1d04e945917ee0470dff1c49e67146287211b411
2020-10-29 11:13:16,202 [INFO] Validating generated data
2020-10-29 11:13:16,202 [INFO] Commit details: {'insertions': 4, 'deletions': 0, 'lines': 4, 'files': 3}
commit 1d04e945917ee0470dff1c49e67146287211b411
Author: generate-dns-snippets <noc@wikimedia.org>
Date:   Thu Oct 29 11:13:16 2020 +0000
Oct 29 2020, 12:03 PM · vm-requests, Operations
klausman renamed T266771: Eq: new Druid test VM for analytics from Site: (QUANTITY) VM %request for SERVICE[S] to Eq: new Druid test VM for analytics.
Oct 29 2020, 10:56 AM · vm-requests, Operations
klausman claimed T266771: Eq: new Druid test VM for analytics.
Oct 29 2020, 10:51 AM · vm-requests, Operations
klausman created T266771: Eq: new Druid test VM for analytics.
Oct 29 2020, 10:51 AM · vm-requests, Operations

Oct 28 2020

klausman added a comment to T264408: Upgrade AMD ROCm drivers/tools to latest upstream.

I had a quick chat with Moritz about the kernel version/rocm siutuation, and we agreed that we'd test 5.8.0 (a backport to Buster) on stat1005 and see if it works better with either vanilla amdgp, or the rock-dkms. Will update here as soon as I have results. The test is likely going to happen this Friday (2020-10-30).

Oct 28 2020, 11:46 AM · Patch-For-Review, Analytics-Clusters, Analytics-Kanban

Oct 22 2020

klausman added a comment to T264408: Upgrade AMD ROCm drivers/tools to latest upstream.

https://wiki.debian.org/AMDGPUDriverOnStretchAndBuster2 indicates that soem people are experimenting with rocm on Debian. The page mentions two patches, but neither of them mentions the missing PCI symbols that cause the compile to fail with 4.19.0-12.

Oct 22 2020, 1:35 PM · Patch-For-Review, Analytics-Clusters, Analytics-Kanban
klausman added a comment to T264408: Upgrade AMD ROCm drivers/tools to latest upstream.

After some more experimenting, I have found that at least rocm33 compiles fine against 4.19.0-11, but fails with 4,19.0-12, with the above errors referring to the pci_platform_rom/pci_map_rom symbols.

Oct 22 2020, 11:46 AM · Patch-For-Review, Analytics-Clusters, Analytics-Kanban
klausman added a comment to T264408: Upgrade AMD ROCm drivers/tools to latest upstream.

And 3.7 has the same problem:

Oct 22 2020, 10:52 AM · Patch-For-Review, Analytics-Clusters, Analytics-Kanban
klausman added a comment to T264408: Upgrade AMD ROCm drivers/tools to latest upstream.

Unfortunately, the rocm38 kernel module does not compile against our current Buster kernel (4.19.0-12):

Oct 22 2020, 10:04 AM · Patch-For-Review, Analytics-Clusters, Analytics-Kanban

Oct 20 2020

klausman added a comment to T264408: Upgrade AMD ROCm drivers/tools to latest upstream.

Sounds good to me. I will make an announcement that we'll have some disruption on stat1005 and 1008 soonish and then do the upgrade via puppet host overrides. For two machines that seems simple enough. If-when we get more of them, we can make a separate role like the original change I made.

Oct 20 2020, 1:25 PM · Patch-For-Review, Analytics-Clusters, Analytics-Kanban
klausman added a comment to T264408: Upgrade AMD ROCm drivers/tools to latest upstream.

As the experiment today has shown, rocm38 ultimately depends on a few packages that are not available on Stretch. I think we have the following options:

Oct 20 2020, 12:58 PM · Patch-For-Review, Analytics-Clusters, Analytics-Kanban

Oct 13 2020

klausman added a comment to T255028: Move the stat1004-6-7 hosts to Debian Buster.

This should be complete, finally!

Oct 13 2020, 9:16 AM · Analytics-Kanban, Analytics-Clusters
klausman moved T255028: Move the stat1004-6-7 hosts to Debian Buster from In Progress to Done on the Analytics-Kanban board.
Oct 13 2020, 9:15 AM · Analytics-Kanban, Analytics-Clusters

Oct 8 2020

klausman added a comment to T255028: Move the stat1004-6-7 hosts to Debian Buster.

Reimage of 1006 and 1007 were successful.

Oct 8 2020, 10:07 AM · Analytics-Kanban, Analytics-Clusters

Oct 2 2020

klausman added a comment to T264119: Write Heavy Wing product plan.

Did a pass as well. Mostly cosmetic suggestions

Oct 2 2020, 9:48 AM · Machine Learning Platform (Current)

Sep 28 2020

klausman claimed T234629: Move the Analytics infrastructure to Debian Buster.
Sep 28 2020, 10:12 AM · Analytics-Kanban, Analytics
klausman moved T234629: Move the Analytics infrastructure to Debian Buster from Parent Tasks to Done on the Analytics-Kanban board.
Sep 28 2020, 10:12 AM · Analytics-Kanban, Analytics

Sep 24 2020

klausman closed T263731: Drop MemorySwapMax=0 from analytics puppet roles as Resolved.
Sep 24 2020, 10:22 AM · Analytics
klausman created T263731: Drop MemorySwapMax=0 from analytics puppet roles.
Sep 24 2020, 10:12 AM · Analytics
klausman added a comment to T255028: Move the stat1004-6-7 hosts to Debian Buster.

Reimaging complete. The failure above is the failed first run of puppet due to no spark being installed. I did that manually, ran puppet rebooted for the kernel opts and the machine is now back in service.

Sep 24 2020, 9:10 AM · Analytics-Kanban, Analytics-Clusters

Sep 18 2020

klausman renamed T263262: kakfa-jumbo1008 psu redundacy fail from Check jumbo1008.eqiad.wmnet PSU setup to Check jumbo1008.eqiad.wmnet PSU redundancy reported as critical.
Sep 18 2020, 2:24 PM · ops-eqiad, Operations
klausman created T263262: kakfa-jumbo1008 psu redundacy fail.
Sep 18 2020, 2:24 PM · ops-eqiad, Operations

Sep 17 2020

klausman closed T248574: GPUs are not correctly handling multitasking , a subtask of T247082: Upgrade AMD ROCm to latest upstream, as Resolved.
Sep 17 2020, 1:31 PM · Analytics-Clusters, Analytics-Kanban
klausman closed T248574: GPUs are not correctly handling multitasking as Resolved.
Sep 17 2020, 1:31 PM · Analytics
klausman reopened T248574: GPUs are not correctly handling multitasking , a subtask of T247082: Upgrade AMD ROCm to latest upstream, as Open.
Sep 17 2020, 1:29 PM · Analytics-Clusters, Analytics-Kanban
klausman reopened T248574: GPUs are not correctly handling multitasking as "Open".
Sep 17 2020, 1:29 PM · Analytics
klausman closed T248574: GPUs are not correctly handling multitasking , a subtask of T247082: Upgrade AMD ROCm to latest upstream, as Resolved.
Sep 17 2020, 1:27 PM · Analytics-Clusters, Analytics-Kanban
klausman closed T248574: GPUs are not correctly handling multitasking as Resolved.

The recent update of the GPU kernel-side drivers to using the rock-dkms package from upstream seems to have resolved this issue (parallel jobs seem to work just fine now.)

Sep 17 2020, 1:27 PM · Analytics
klausman added a comment to T251938: Monitoring GPU Usage on stat Machines.

I did some testing just now, and it looks like the current version of rocm_smi.py does not try to re-execute itself through sudo when the --showpidgpus or --showpids flags are used. Luca tells me that this used to be the case, but it looks like it has changed since. @Aroraakhil can you test whether the tool now works for you?

Sep 17 2020, 12:23 PM · Analytics-Radar, User-Elukey, Analytics-Clusters

Sep 16 2020

klausman set Final Story Points to 8 on T260442: AMD ROCm kernel drivers on stat1005/stat1008 don't support some features.
Sep 16 2020, 1:13 PM · Analytics-Kanban, Patch-For-Review, Analytics-Clusters
klausman moved T260442: AMD ROCm kernel drivers on stat1005/stat1008 don't support some features from In Progress to Done on the Analytics-Kanban board.
Sep 16 2020, 1:10 PM · Analytics-Kanban, Patch-For-Review, Analytics-Clusters
klausman reopened T260442: AMD ROCm kernel drivers on stat1005/stat1008 don't support some features as "Open".
Sep 16 2020, 1:09 PM · Analytics-Kanban, Patch-For-Review, Analytics-Clusters
klausman closed T260442: AMD ROCm kernel drivers on stat1005/stat1008 don't support some features as Resolved.
Sep 16 2020, 1:08 PM · Analytics-Kanban, Patch-For-Review, Analytics-Clusters
klausman added a comment to T260442: AMD ROCm kernel drivers on stat1005/stat1008 don't support some features.

Looking at the install procedure for the rocm upstream drivers, we considered turning the DKMS package (compiling the driver(s) ad-hoc during install) into a more static one (no compilation, just install binary files)

Sep 16 2020, 10:00 AM · Analytics-Kanban, Patch-For-Review, Analytics-Clusters

Sep 14 2020

klausman claimed T255028: Move the stat1004-6-7 hosts to Debian Buster.
Sep 14 2020, 3:18 PM · Analytics-Kanban, Analytics-Clusters

Sep 10 2020

klausman added a comment to P12562 (An Untitled Masterwork).

~ $ cat /proc/self/fdinfo/1
pos: 3144
flags: 0102002
mnt_id: 20

Sep 10 2020, 3:42 PM
klausman added a comment to T262427: Add more metrics to prometheus-amd-rocm-stats Python script.

https://docs.google.com/document/d/1OX7JsearaFNO5b8cT4ScLrUppdvFJmz8gWE1nG6_smc/edit# has an overview of parameters and their outputs. Will discuss with elukey which options might be useful to add.

Sep 10 2020, 11:18 AM · Analytics-Kanban, Analytics-Clusters
klausman added a comment to T260442: AMD ROCm kernel drivers on stat1005/stat1008 don't support some features.

elukey and I discussed a bit how will proceed from here. Open things:

Sep 10 2020, 11:18 AM · Analytics-Kanban, Patch-For-Review, Analytics-Clusters

Sep 9 2020

klausman created T262427: Add more metrics to prometheus-amd-rocm-stats Python script.
Sep 9 2020, 3:11 PM · Analytics-Kanban, Analytics-Clusters
klausman closed T262404: Update prometheus-amd-rocm-stats Python script to work with new JSON output as Resolved.

Submitted and live on stat1005 and stat1008, confirmed working as intended.

Sep 9 2020, 1:37 PM · Analytics-Clusters
klausman added a comment to T262404: Update prometheus-amd-rocm-stats Python script to work with new JSON output.

Turns out, the Prometheus lcient libs do not allow for not specifying a label. Thus, I will use the "sensor1" location as describe at the end of the task description.
We could special case the two different driver versions entirely, but that seems like more trouble than it's worth.

Sep 9 2020, 1:36 PM · Analytics-Clusters
klausman updated the task description for T262404: Update prometheus-amd-rocm-stats Python script to work with new JSON output.
Sep 9 2020, 12:36 PM · Analytics-Clusters
klausman created T262404: Update prometheus-amd-rocm-stats Python script to work with new JSON output.
Sep 9 2020, 12:36 PM · Analytics-Clusters
klausman added a comment to T260442: AMD ROCm kernel drivers on stat1005/stat1008 don't support some features.

Notes from the install:

Sep 9 2020, 11:04 AM · Analytics-Kanban, Patch-For-Review, Analytics-Clusters

Sep 1 2020

klausman added a comment to T261626: Requesting access to Production for klausman.

SSH confirmed working:

Sep 1 2020, 4:04 PM · Operations, SRE-Access-Requests

Aug 31 2020

klausman updated the task description for T261626: Requesting access to Production for klausman.
Aug 31 2020, 12:17 PM · Operations, SRE-Access-Requests
klausman updated the task description for T261626: Requesting access to Production for klausman.
Aug 31 2020, 12:16 PM · Operations, SRE-Access-Requests
klausman updated the task description for T261626: Requesting access to Production for klausman.
Aug 31 2020, 12:16 PM · Operations, SRE-Access-Requests
klausman added a comment to T261626: Requesting access to Production for klausman.

L3 is signed.

Aug 31 2020, 12:15 PM · Operations, SRE-Access-Requests
klausman renamed T261626: Requesting access to Production for klausman from Requesting access to RESOURCE for klausman to Requesting access to Production for klausman.
Aug 31 2020, 10:15 AM · Operations, SRE-Access-Requests
klausman created T261626: Requesting access to Production for klausman.
Aug 31 2020, 10:15 AM · Operations, SRE-Access-Requests