Page MenuHomePhabricator

klausman (Tobias Klausmann)
User

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Aug 31 2020, 9:52 AM (42 w, 2 d)
Availability
Available
LDAP User
Klausman
MediaWiki User
TKlausmann (WMF) [ Global Accounts ]

Recent Activity

Tue, Jun 22

klausman closed T285225: Update all Liftwing machines to latest packages as Resolved.

All updates complete, no errors since the reboots. Closing.

Tue, Jun 22, 2:44 PM · Machine-Learning-Team (Active Tasks)

Mon, Jun 21

klausman added a comment to T285225: Update all Liftwing machines to latest packages.

Now proceeding with eqiad. Updates should be the same. Will mention here otherwise.

Mon, Jun 21, 2:16 PM · Machine-Learning-Team (Active Tasks)
klausman added a comment to T285225: Update all Liftwing machines to latest packages.

ml-serve updates in codfw:

The following NEW packages will be installed:
  linux-image-4.19.0-17-amd64
The following packages will be upgraded:
  base-files debmonitor-client isc-dhcp-client isc-dhcp-common klibc-utils
  libgcrypt20 libglib2.0-0 libgnutls-dane0 libgnutls30 libklibc libmariadb3
  libxml2 linux-image-amd64 linux-perf linux-perf-4.19 mariadb-common
  python3-dnspython
17 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
Mon, Jun 21, 1:47 PM · Machine-Learning-Team (Active Tasks)
klausman added a comment to T285225: Update all Liftwing machines to latest packages.

ml-serve-ctrl updates in codfw:

Mon, Jun 21, 1:25 PM · Machine-Learning-Team (Active Tasks)
klausman added a comment to T285225: Update all Liftwing machines to latest packages.

ml-etcd in codfw all have these updates:

Mon, Jun 21, 1:13 PM · Machine-Learning-Team (Active Tasks)
klausman added a comment to T285225: Update all Liftwing machines to latest packages.

Clarification: this is only Debian-delivered and generic OS packages

Mon, Jun 21, 1:12 PM · Machine-Learning-Team (Active Tasks)
klausman created T285225: Update all Liftwing machines to latest packages.
Mon, Jun 21, 1:12 PM · Machine-Learning-Team (Active Tasks)

May 14 2021

klausman added a comment to T282802: Implement model storage for enwiki-goodfaith inference service.

@ACraze

$ ./model_upload.sh 
CHECKING FOR MODEL_BUCKET
Bucket 's3://wmf-ml-models/' created
UPLOADING enwiki.goodfaith.gradient_boosting.model to s3://wmf-ml-models/goodfaith/enwiki/202105140814
upload: 'enwiki.goodfaith.gradient_boosting.model' -> 's3://wmf-ml-models/goodfaith/enwiki/202105140814/enwiki.goodfaith.gradient_boosting.model'  [1 of 1]
 110612 of 110612   100% in    0s   339.32 KB/s  done
$
May 14 2021, 8:15 AM · Machine-Learning-Team (Active Tasks), Patch-For-Review, artificial-intelligence, revscoring, Lift-Wing

Apr 26 2021

klausman added a comment to T280467: Naming convention for the model storage structure.

Oh yes, I'd say this is a best practice I recommend. It comes from tensorflow-model-server that reads all subdirectories in the specified model directory and automatically picks up the largest value as the current model. So typically you'd have an increment number as model version, although I find the timestamp more convenient (and if it's in the YYYYMMDDHHMM format it will be incremental). Here's an example:

Apr 26 2021, 8:26 AM · Lift-Wing, Machine-Learning-Team (Active Tasks)

Apr 6 2021

klausman added a comment to T278238: Recreate ml-etcd2002 in a different row.

For the record: I just checked, and no, it wasn't you who created the VM on the wrong Ganeti host, but me. Thanks for fixing it!

Apr 6 2021, 7:50 AM · Patch-For-Review, Machine-Learning-Team, Lift-Wing
klausman closed T267670: (Need By: TBD) rack/setup/install ml-serve200[1-4] as Resolved.

Yes, this is all done!

Apr 6 2021, 7:48 AM · SRE, ops-codfw, DC-Ops

Mar 19 2021

klausman added a comment to T272918: Create ml-serve k8s cluster.

All worker nodes are now up and visible in both DCs:

ml-serve-ctrl1001:~$ kubectl get nodes -o wide
NAME                       STATUS   ROLES    AGE   VERSION    INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                       KERNEL-VERSION    CONTAINER-RUNTIME
ml-serve1001.eqiad.wmnet   Ready    <none>   83m   v1.16.15   10.64.0.41     <none>        Debian GNU/Linux 10 (buster)   4.19.0-14-amd64   docker://18.9.1
ml-serve1002.eqiad.wmnet   Ready    <none>   83m   v1.16.15   10.64.16.183   <none>        Debian GNU/Linux 10 (buster)   4.19.0-14-amd64   docker://18.9.1
ml-serve1003.eqiad.wmnet   Ready    <none>   83m   v1.16.15   10.64.32.81    <none>        Debian GNU/Linux 10 (buster)   4.19.0-14-amd64   docker://18.9.1
ml-serve1004.eqiad.wmnet   Ready    <none>   83m   v1.16.15   10.64.48.50    <none>        Debian GNU/Linux 10 (buster)   4.19.0-14-amd64   docker://18.9.1
ml-serve-ctrl2001 ~ $ kubectl get nodes -o wide
NAME                       STATUS   ROLES    AGE     VERSION    INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                       KERNEL-VERSION    CONTAINER-RUNTIME
ml-serve2001.codfw.wmnet   Ready    <none>   84m     v1.16.15   10.192.0.21    <none>        Debian GNU/Linux 10 (buster)   4.19.0-14-amd64   docker://18.9.1
ml-serve2002.codfw.wmnet   Ready    <none>   9m57s   v1.16.15   10.192.16.43   <none>        Debian GNU/Linux 10 (buster)   4.19.0-14-amd64   docker://18.9.1
ml-serve2003.codfw.wmnet   Ready    <none>   84m     v1.16.15   10.192.32.29   <none>        Debian GNU/Linux 10 (buster)   4.19.0-14-amd64   docker://18.9.1
ml-serve2004.codfw.wmnet   Ready    <none>   84m     v1.16.15   10.192.48.11   <none>        Debian GNU/Linux 10 (buster)   4.19.0-14-amd64   docker://18.9.1
Mar 19 2021, 12:54 PM · Machine-Learning-Team, Patch-For-Review, Lift-Wing

Mar 15 2021

klausman triaged T277492: Investigate separating k8s-level users between our k8s and thr ServiceOps k8s as Medium priority.
Mar 15 2021, 5:33 PM · Machine-Learning-Team
klausman committed rLPRI494f672411c9: hiera: move ML k8s worker secrets into the correct location (authored by klausman).
hiera: move ML k8s worker secrets into the correct location
Mar 15 2021, 4:44 PM
klausman committed rLPRIb84e8f31dbe3: hiera: add dummy secrets for ML k8s workers (authored by klausman).
hiera: add dummy secrets for ML k8s workers
Mar 15 2021, 4:37 PM

Mar 9 2021

klausman committed rLPRI9461a40d69b8: ml-k8s: Add dummy controllermanager_tokens (authored by klausman).
ml-k8s: Add dummy controllermanager_tokens
Mar 9 2021, 2:02 PM

Mar 5 2021

klausman committed rLPRIb9c62c28e10d: ml-ctrl: Add dummy keys for ML k8s control plane (authored by klausman).
ml-ctrl: Add dummy keys for ML k8s control plane
Mar 5 2021, 4:19 PM

Feb 26 2021

klausman added a comment to T275852: Investigate potential issues with the sudoeres env_keep values.

TMUX being visible is, as mentioned, not a security issue when sudo'ing to non-root. The var contains just a path, with its own permissions, and racing attacks with symlinks are unlikely to work since the enclosing dir is typically owned by the user, or it has tempdir semantics. It is theoretically possible that a bad user config makes this attackable, but it still seems very remote compare to other threats.

Feb 26 2021, 2:42 PM · User-jbond, Security, SRE

Feb 25 2021

klausman added a comment to T275630: eqiad/codfw: 2x2 VM request for ML-Serve Kubernetes cluster.

Created ml-serve-ctrl1001 and ml-serve-ctrl1002 in eqiad, rows B and D.

Feb 25 2021, 2:36 PM · Patch-For-Review, vm-requests, SRE

Feb 24 2021

klausman created T275630: eqiad/codfw: 2x2 VM request for ML-Serve Kubernetes cluster.
Feb 24 2021, 2:45 PM · Patch-For-Review, vm-requests, SRE

Feb 22 2021

klausman updated the language for P14444 (An Untitled Masterwork) from autodetect to python.
Feb 22 2021, 4:02 PM
klausman created P14444 (An Untitled Masterwork).
Feb 22 2021, 4:02 PM

Feb 16 2021

klausman added a comment to T273071: Create etcd VMs for use with ML platform.
root@ml-etcd2001:~# etcdctl -C https://ml-etcd2001.codfw.wmnet:2379 cluster-health
member 367f7076aea55538 is healthy: got healthy result from https://ml-etcd2002.codfw.wmnet:2379
member 3eaef5f31c9d4f07 is healthy: got healthy result from https://ml-etcd2001.codfw.wmnet:2379
member 6ec81f119df22c02 is healthy: got healthy result from https://ml-etcd2003.codfw.wmnet:2379
cluster is healthy
Feb 16 2021, 4:35 PM · Lift-Wing, Machine-Learning-Team (Active Tasks)
klausman closed T273071: Create etcd VMs for use with ML platform, a subtask of T272918: Create ml-serve k8s cluster, as Resolved.
Feb 16 2021, 4:12 PM · Machine-Learning-Team, Patch-For-Review, Lift-Wing
klausman closed T273071: Create etcd VMs for use with ML platform as Resolved.
Feb 16 2021, 4:12 PM · Lift-Wing, Machine-Learning-Team (Active Tasks)
klausman added a comment to T273071: Create etcd VMs for use with ML platform.
root@ml-etcd1001:~# etcdctl  -C https://ml-etcd1001.eqiad.wmnet:2379 cluster-health
member 27250fb9655951c0 is healthy: got healthy result from https://ml-etcd1003.eqiad.wmnet:2379
member bec8796f64226950 is healthy: got healthy result from https://ml-etcd1002.eqiad.wmnet:2379
member ec678e26e1c1f07a is healthy: got healthy result from https://ml-etcd1001.eqiad.wmnet:2379
cluster is healthy
root@ml-etcd1001:~#
Feb 16 2021, 4:12 PM · Lift-Wing, Machine-Learning-Team (Active Tasks)
klausman committed rLPRI0fe04ab80042: secrets: Add dummy keys for ml_etcd clusters (authored by klausman).
secrets: Add dummy keys for ml_etcd clusters
Feb 16 2021, 2:07 PM

Feb 3 2021

klausman added a comment to T273071: Create etcd VMs for use with ML platform.

All machines are now base installed (puppet-runs done with insetup).

Feb 3 2021, 3:41 PM · Lift-Wing, Machine-Learning-Team (Active Tasks)
klausman added a comment to T273074: eqiad: 3 VM request for ML team etcd.

All machines are now base installed (puppet-runs done with insetup).

Feb 3 2021, 3:41 PM · vm-requests, SRE
klausman added a comment to T273075: codfw: 3 VM request for ML team etcd.

All machines are now base installed (puppet-runs done with insetup).

Feb 3 2021, 3:41 PM · vm-requests, SRE

Feb 2 2021

klausman added a comment to T273075: codfw: 3 VM request for ML team etcd.

$ sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 3 --disk 20 codfw_B ml-etcd2001.codfw.wmnet
IPv4: 10.192.16.44/22
IPv6: 2620:0:860:102:10:192:16:44/64
MAC: aa:00:00:71:6a:f3

Feb 2 2021, 2:38 PM · vm-requests, SRE
klausman added a comment to T273074: eqiad: 3 VM request for ML team etcd.

$ sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 3 --disk 20 eqiad_B ml-etcd1001.eqiad.wmnet
IPv4: 10.64.16.200
IPv6: 2620:0:861:102:10:64:16:200
MAC: aa:00:00:ef:5f:2d

Feb 2 2021, 2:00 PM · vm-requests, SRE

Jan 28 2021

klausman added a comment to T273075: codfw: 3 VM request for ML team etcd.

@klausman Could you add the new cluster prefixes for ml (ml-etcd and others) to https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions#Servers ? that would be nice, thank you!

Jan 28 2021, 12:33 PM · vm-requests, SRE

Jan 27 2021

klausman added a comment to T273071: Create etcd VMs for use with ML platform.

VM requests in:

Jan 27 2021, 4:19 PM · Lift-Wing, Machine-Learning-Team (Active Tasks)
klausman created T273075: codfw: 3 VM request for ML team etcd.
Jan 27 2021, 4:18 PM · vm-requests, SRE
klausman created T273074: eqiad: 3 VM request for ML team etcd.
Jan 27 2021, 4:17 PM · vm-requests, SRE
klausman added a comment to T273071: Create etcd VMs for use with ML platform.

Hostnames: ml-etcd100x.eqiad and ml-etcd200x.codfw
For networking, we want row diversity, which should be easy enough for VMs this tiny.

Jan 27 2021, 4:02 PM · Lift-Wing, Machine-Learning-Team (Active Tasks)
klausman added a subtask for T272918: Create ml-serve k8s cluster: T273071: Create etcd VMs for use with ML platform.
Jan 27 2021, 3:54 PM · Machine-Learning-Team, Patch-For-Review, Lift-Wing
klausman added a parent task for T273071: Create etcd VMs for use with ML platform: T272918: Create ml-serve k8s cluster.
Jan 27 2021, 3:54 PM · Lift-Wing, Machine-Learning-Team (Active Tasks)
klausman created T273071: Create etcd VMs for use with ML platform.
Jan 27 2021, 3:53 PM · Lift-Wing, Machine-Learning-Team (Active Tasks)

Jan 25 2021

klausman added a comment to T272687: Give access to ml-serve* to the non-ops members of the ML team.

@klausman (adding a comment here incase it was missed from the meeting) when this access is revoked and the hacking is over we should rebuild all the machines using the re-image script so that we can ensure they can be automaticity installed and no manually steps where missed.

Jan 25 2021, 5:58 PM · Machine-Learning-Team (Active Tasks), SRE, SRE-Access-Requests
klausman edited projects for T272687: Give access to ml-serve* to the non-ops members of the ML team, added: Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.
Jan 25 2021, 4:38 PM · Machine-Learning-Team (Active Tasks), SRE, SRE-Access-Requests

Jan 22 2021

klausman closed T272687: Give access to ml-serve* to the non-ops members of the ML team as Resolved.
Jan 22 2021, 4:58 PM · Machine-Learning-Team (Active Tasks), SRE, SRE-Access-Requests
klausman triaged T272687: Give access to ml-serve* to the non-ops members of the ML team as High priority.
Jan 22 2021, 4:56 PM · Machine-Learning-Team (Active Tasks), SRE, SRE-Access-Requests
klausman added a project to T272687: Give access to ml-serve* to the non-ops members of the ML team: SRE-Access-Requests.
Jan 22 2021, 4:47 PM · Machine-Learning-Team (Active Tasks), SRE, SRE-Access-Requests
klausman created T272687: Give access to ml-serve* to the non-ops members of the ML team.
Jan 22 2021, 10:28 AM · Machine-Learning-Team (Active Tasks), SRE, SRE-Access-Requests

Jan 21 2021

klausman closed T272513: upgrade kernel of liftwing machines as Resolved.
Jan 21 2021, 11:41 AM · Machine-Learning-Team (Active Tasks)
klausman added a comment to T272513: upgrade kernel of liftwing machines.

All eight machines (ml-serve1001 - 4, ml-serve2001 - 4) are now on 5.9.0-0.bpo.5-amd64. I also installed the firmware packages for the NICs. The latter were not strictly necessary, but in general, having the latest firmware loaded on boot is preferable.

Jan 21 2021, 11:40 AM · Machine-Learning-Team (Active Tasks)

Jan 8 2021

klausman added a comment to T267670: (Need By: TBD) rack/setup/install ml-serve200[1-4].

The machines now have a base install (i.e. there is nothing special for them in puppet).

Jan 8 2021, 3:15 PM · SRE, ops-codfw, DC-Ops

Dec 17 2020

klausman added a comment to T268936: Backport kafkacat 1.6.0 from bullseye to buster-backports or buster-wikimedia.

Package is backported and uploaded to reprepro/aptx00y and updated on all stats100x machines.

Dec 17 2020, 2:55 PM · Analytics-Clusters, SRE

Dec 9 2020

klausman added a comment to T267670: (Need By: TBD) rack/setup/install ml-serve200[1-4].

Networking will be 1G. No hw RAID.

Dec 9 2020, 2:49 PM · SRE, ops-codfw, DC-Ops
klausman added a comment to T268936: Backport kafkacat 1.6.0 from bullseye to buster-backports or buster-wikimedia.

I've also poked Faidon on whether an official backport might be done.

Dec 9 2020, 2:07 PM · Analytics-Clusters, SRE
klausman added a comment to T255262: Kerberos credential cache location.

I presume the KRB5CCNAME in /etc/profile approach doesn't work due to there being no shell involved. Do you think using /etc/environment instead might work better? Or are there further problems beyond the env var not being set for some environments?

Dec 9 2020, 2:03 PM · Patch-For-Review, Analytics-Clusters
klausman added a comment to T268936: Backport kafkacat 1.6.0 from bullseye to buster-backports or buster-wikimedia.

This is definitely doable, but needs at least one change: The Bullseye version of the package depends on librdkafka1 >= 1.4.2, which Buster doesn't have. I've done a quick munge of the source package and it looks like building against the librdkafka available on Buster (0.11.5) works fine:

Dec 9 2020, 1:44 PM · Analytics-Clusters, SRE

Oct 29 2020

klausman added a comment to T266771: Eq: new Druid test VM for analytics.
root@cumin1001:~# cookbook sre.ganeti.makevm --vcpus 4 --memory 8 --disk 100  --network analytics eqiad_D an-test-druid1001.eqiad.wmnet
START - Cookbook sre.ganeti.makevm
Ready to create Ganeti VM an-test-druid1001.eqiad.wmnet in the ganeti01.svc.eqiad.wmnet cluster on row D with 4 vCPUs, 8GB of RAM, 100GB of disk in the analytics network.
Is this correct?
Type "done" to proceed
> done
Allocated IPv4 10.64.53.6/24
Set DNS name of IP 10.64.53.6/24 to an-test-druid1001.eqiad.wmnet
Allocated IPv6 2620:0:861:108:10:64:53:6/64 with DNS name an-test-druid1001.eqiad.wmnet
Generating the DNS records from Netbox data. It will take a couple of minutes.
2020-10-29 11:11:19,218 [INFO] Gathering devices, interfaces, addresses and prefixes from Netbox
2020-10-29 11:13:08,271 [INFO] Gathered 2182 devices from Netbox
2020-10-29 11:13:08,271 [INFO] Generating DNS records
2020-10-29 11:13:14,959 [INFO] Generated 12032 direct and reverse records (6016 each) in 26 direct zones and 168 reverse zones                                                                                    
2020-10-29 11:13:14,960 [INFO] Cloning /srv/netbox-exports/dns.git/ to /tmp/dns-c25pcHBldHM-jdruvzjh ...
2020-10-29 11:13:15,557 [INFO] Generating zonefile snippets to directory /tmp/dns-c25pcHBldHM-jdruvzjh
2020-10-29 11:13:16,180 [INFO] Committed changes: 1d04e945917ee0470dff1c49e67146287211b411
2020-10-29 11:13:16,202 [INFO] Validating generated data
2020-10-29 11:13:16,202 [INFO] Commit details: {'insertions': 4, 'deletions': 0, 'lines': 4, 'files': 3}
commit 1d04e945917ee0470dff1c49e67146287211b411
Author: generate-dns-snippets <noc@wikimedia.org>
Date:   Thu Oct 29 11:13:16 2020 +0000
Oct 29 2020, 12:03 PM · Analytics-Clusters, vm-requests, SRE
klausman renamed T266771: Eq: new Druid test VM for analytics from Site: (QUANTITY) VM %request for SERVICE[S] to Eq: new Druid test VM for analytics.
Oct 29 2020, 10:56 AM · Analytics-Clusters, vm-requests, SRE
klausman claimed T266771: Eq: new Druid test VM for analytics.
Oct 29 2020, 10:51 AM · Analytics-Clusters, vm-requests, SRE
klausman created T266771: Eq: new Druid test VM for analytics.
Oct 29 2020, 10:51 AM · Analytics-Clusters, vm-requests, SRE

Oct 28 2020

klausman added a comment to T264408: Upgrade AMD ROCm drivers/tools to latest upstream.

I had a quick chat with Moritz about the kernel version/rocm siutuation, and we agreed that we'd test 5.8.0 (a backport to Buster) on stat1005 and see if it works better with either vanilla amdgp, or the rock-dkms. Will update here as soon as I have results. The test is likely going to happen this Friday (2020-10-30).

Oct 28 2020, 11:46 AM · Patch-For-Review, Analytics-Clusters, Analytics-Kanban

Oct 22 2020

klausman added a comment to T264408: Upgrade AMD ROCm drivers/tools to latest upstream.

https://wiki.debian.org/AMDGPUDriverOnStretchAndBuster2 indicates that soem people are experimenting with rocm on Debian. The page mentions two patches, but neither of them mentions the missing PCI symbols that cause the compile to fail with 4.19.0-12.

Oct 22 2020, 1:35 PM · Patch-For-Review, Analytics-Clusters, Analytics-Kanban
klausman added a comment to T264408: Upgrade AMD ROCm drivers/tools to latest upstream.

After some more experimenting, I have found that at least rocm33 compiles fine against 4.19.0-11, but fails with 4,19.0-12, with the above errors referring to the pci_platform_rom/pci_map_rom symbols.

Oct 22 2020, 11:46 AM · Patch-For-Review, Analytics-Clusters, Analytics-Kanban
klausman added a comment to T264408: Upgrade AMD ROCm drivers/tools to latest upstream.

And 3.7 has the same problem:

Oct 22 2020, 10:52 AM · Patch-For-Review, Analytics-Clusters, Analytics-Kanban
klausman added a comment to T264408: Upgrade AMD ROCm drivers/tools to latest upstream.

Unfortunately, the rocm38 kernel module does not compile against our current Buster kernel (4.19.0-12):

Oct 22 2020, 10:04 AM · Patch-For-Review, Analytics-Clusters, Analytics-Kanban

Oct 20 2020

klausman added a comment to T264408: Upgrade AMD ROCm drivers/tools to latest upstream.

Sounds good to me. I will make an announcement that we'll have some disruption on stat1005 and 1008 soonish and then do the upgrade via puppet host overrides. For two machines that seems simple enough. If-when we get more of them, we can make a separate role like the original change I made.

Oct 20 2020, 1:25 PM · Patch-For-Review, Analytics-Clusters, Analytics-Kanban
klausman added a comment to T264408: Upgrade AMD ROCm drivers/tools to latest upstream.

As the experiment today has shown, rocm38 ultimately depends on a few packages that are not available on Stretch. I think we have the following options:

Oct 20 2020, 12:58 PM · Patch-For-Review, Analytics-Clusters, Analytics-Kanban

Oct 13 2020

klausman added a comment to T255028: Move the stat1004-6-7 hosts to Debian Buster.

This should be complete, finally!

Oct 13 2020, 9:16 AM · Analytics-Kanban, Analytics-Clusters
klausman moved T255028: Move the stat1004-6-7 hosts to Debian Buster from In Progress to Done on the Analytics-Kanban board.
Oct 13 2020, 9:15 AM · Analytics-Kanban, Analytics-Clusters

Oct 8 2020

klausman added a comment to T255028: Move the stat1004-6-7 hosts to Debian Buster.

Reimage of 1006 and 1007 were successful.

Oct 8 2020, 10:07 AM · Analytics-Kanban, Analytics-Clusters

Oct 2 2020

klausman added a comment to T264119: Write Heavy Wing product plan.

Did a pass as well. Mostly cosmetic suggestions

Oct 2 2020, 9:48 AM · Machine-Learning-Team (Active Tasks)

Sep 28 2020

klausman claimed T234629: Move the Analytics infrastructure to Debian Buster.
Sep 28 2020, 10:12 AM · Analytics-Kanban
klausman moved T234629: Move the Analytics infrastructure to Debian Buster from Parent Tasks to Done on the Analytics-Kanban board.
Sep 28 2020, 10:12 AM · Analytics-Kanban

Sep 24 2020

klausman closed T263731: Drop MemorySwapMax=0 from analytics puppet roles as Resolved.
Sep 24 2020, 10:22 AM · Analytics
klausman created T263731: Drop MemorySwapMax=0 from analytics puppet roles.
Sep 24 2020, 10:12 AM · Analytics
klausman added a comment to T255028: Move the stat1004-6-7 hosts to Debian Buster.

Reimaging complete. The failure above is the failed first run of puppet due to no spark being installed. I did that manually, ran puppet rebooted for the kernel opts and the machine is now back in service.

Sep 24 2020, 9:10 AM · Analytics-Kanban, Analytics-Clusters

Sep 18 2020

klausman renamed T263262: kakfa-jumbo1008 psu redundacy fail from Check jumbo1008.eqiad.wmnet PSU setup to Check jumbo1008.eqiad.wmnet PSU redundancy reported as critical.
Sep 18 2020, 2:24 PM · ops-eqiad, SRE
klausman created T263262: kakfa-jumbo1008 psu redundacy fail.
Sep 18 2020, 2:24 PM · ops-eqiad, SRE

Sep 17 2020

klausman closed T248574: GPUs are not correctly handling multitasking , a subtask of T247082: Upgrade AMD ROCm to latest upstream, as Resolved.
Sep 17 2020, 1:31 PM · Analytics-Clusters, Analytics-Kanban
klausman closed T248574: GPUs are not correctly handling multitasking as Resolved.
Sep 17 2020, 1:31 PM · Analytics
klausman reopened T248574: GPUs are not correctly handling multitasking , a subtask of T247082: Upgrade AMD ROCm to latest upstream, as Open.
Sep 17 2020, 1:29 PM · Analytics-Clusters, Analytics-Kanban
klausman reopened T248574: GPUs are not correctly handling multitasking as "Open".
Sep 17 2020, 1:29 PM · Analytics
klausman closed T248574: GPUs are not correctly handling multitasking , a subtask of T247082: Upgrade AMD ROCm to latest upstream, as Resolved.
Sep 17 2020, 1:27 PM · Analytics-Clusters, Analytics-Kanban
klausman closed T248574: GPUs are not correctly handling multitasking as Resolved.

The recent update of the GPU kernel-side drivers to using the rock-dkms package from upstream seems to have resolved this issue (parallel jobs seem to work just fine now.)

Sep 17 2020, 1:27 PM · Analytics
klausman added a comment to T251938: Monitoring GPU Usage on stat Machines.

I did some testing just now, and it looks like the current version of rocm_smi.py does not try to re-execute itself through sudo when the --showpidgpus or --showpids flags are used. Luca tells me that this used to be the case, but it looks like it has changed since. @Aroraakhil can you test whether the tool now works for you?

Sep 17 2020, 12:23 PM · Analytics-Radar, User-Elukey, Analytics-Clusters

Sep 16 2020

klausman set Final Story Points to 8 on T260442: AMD ROCm kernel drivers on stat1005/stat1008 don't support some features.
Sep 16 2020, 1:13 PM · Analytics-Kanban, Patch-For-Review, Analytics-Clusters
klausman moved T260442: AMD ROCm kernel drivers on stat1005/stat1008 don't support some features from In Progress to Done on the Analytics-Kanban board.
Sep 16 2020, 1:10 PM · Analytics-Kanban, Patch-For-Review, Analytics-Clusters
klausman reopened T260442: AMD ROCm kernel drivers on stat1005/stat1008 don't support some features as "Open".
Sep 16 2020, 1:09 PM · Analytics-Kanban, Patch-For-Review, Analytics-Clusters
klausman closed T260442: AMD ROCm kernel drivers on stat1005/stat1008 don't support some features as Resolved.
Sep 16 2020, 1:08 PM · Analytics-Kanban, Patch-For-Review, Analytics-Clusters
klausman added a comment to T260442: AMD ROCm kernel drivers on stat1005/stat1008 don't support some features.

Looking at the install procedure for the rocm upstream drivers, we considered turning the DKMS package (compiling the driver(s) ad-hoc during install) into a more static one (no compilation, just install binary files)

Sep 16 2020, 10:00 AM · Analytics-Kanban, Patch-For-Review, Analytics-Clusters

Sep 14 2020

klausman claimed T255028: Move the stat1004-6-7 hosts to Debian Buster.
Sep 14 2020, 3:18 PM · Analytics-Kanban, Analytics-Clusters

Sep 10 2020

klausman added a comment to P12562 (An Untitled Masterwork).

~ $ cat /proc/self/fdinfo/1
pos: 3144
flags: 0102002
mnt_id: 20

Sep 10 2020, 3:42 PM
klausman added a comment to T262427: Add more metrics to prometheus-amd-rocm-stats Python script.

https://docs.google.com/document/d/1OX7JsearaFNO5b8cT4ScLrUppdvFJmz8gWE1nG6_smc/edit# has an overview of parameters and their outputs. Will discuss with elukey which options might be useful to add.

Sep 10 2020, 11:18 AM · Analytics-Kanban, Analytics-Clusters
klausman added a comment to T260442: AMD ROCm kernel drivers on stat1005/stat1008 don't support some features.

elukey and I discussed a bit how will proceed from here. Open things:

Sep 10 2020, 11:18 AM · Analytics-Kanban, Patch-For-Review, Analytics-Clusters

Sep 9 2020

klausman created T262427: Add more metrics to prometheus-amd-rocm-stats Python script.
Sep 9 2020, 3:11 PM · Analytics-Kanban, Analytics-Clusters
klausman closed T262404: Update prometheus-amd-rocm-stats Python script to work with new JSON output as Resolved.

Submitted and live on stat1005 and stat1008, confirmed working as intended.

Sep 9 2020, 1:37 PM · Analytics-Clusters
klausman added a comment to T262404: Update prometheus-amd-rocm-stats Python script to work with new JSON output.

Turns out, the Prometheus lcient libs do not allow for not specifying a label. Thus, I will use the "sensor1" location as describe at the end of the task description.
We could special case the two different driver versions entirely, but that seems like more trouble than it's worth.

Sep 9 2020, 1:36 PM · Analytics-Clusters
klausman updated the task description for T262404: Update prometheus-amd-rocm-stats Python script to work with new JSON output.
Sep 9 2020, 12:36 PM · Analytics-Clusters
klausman created T262404: Update prometheus-amd-rocm-stats Python script to work with new JSON output.
Sep 9 2020, 12:36 PM · Analytics-Clusters
klausman added a comment to T260442: AMD ROCm kernel drivers on stat1005/stat1008 don't support some features.

Notes from the install:

Sep 9 2020, 11:04 AM · Analytics-Kanban, Patch-For-Review, Analytics-Clusters

Sep 1 2020

klausman added a comment to T261626: Requesting access to Production for klausman.

SSH confirmed working:

Sep 1 2020, 4:04 PM · SRE, SRE-Access-Requests

Aug 31 2020

klausman updated the task description for T261626: Requesting access to Production for klausman.
Aug 31 2020, 12:17 PM · SRE, SRE-Access-Requests