Page MenuHomePhabricator

klausman (Tobias Klausmann)
User

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Aug 31 2020, 9:52 AM (101 w, 20 h)
Availability
Available
LDAP User
Klausman
MediaWiki User
TKlausmann (WMF) [ Global Accounts ]

Recent Activity

Wed, Jul 27

klausman committed rLPRI756a0ad1d5b4: ML k8s: fix articletopic-outlink names (authored by klausman).
ML k8s: fix articletopic-outlink names
Wed, Jul 27, 11:10 AM
klausman committed rLPRI8e9d4c83b1ee: ml-k8s: add dummy secrects for article-outlink (authored by klausman).
ml-k8s: add dummy secrects for article-outlink
Wed, Jul 27, 10:52 AM
klausman added a comment to T313822: codfw: ml-serve2001 memmory issue DIMM A2.

Ok, the machine is booted and sitting in GRUB. @Papaul I can't seem to run memtes86+ via idrac (I just get a black screen). Can you check whether it works with direct access? Alternatively, do you know how to run it so that console redirection works? Thanks!

Wed, Jul 27, 10:01 AM · Machine-Learning-Team, SRE, ops-codfw

Tue, Jul 26

klausman closed T312550: uwsgi socket/UDP logger is broken if no other logger uses the same format as Resolved.

Change 817210 actually fixes this, we now see messages in logstash again. Apparently, an unset buffer size causes JSON generation to break. The upstream bug is still open, but I doubt it will be fixed soon, especially with a mitigation being available now.

Tue, Jul 26, 1:39 PM · SRE

Jul 7 2022

klausman added a comment to T312550: uwsgi socket/UDP logger is broken if no other logger uses the same format.

Upstream issue: https://github.com/unbit/uwsgi/issues/2456

Jul 7 2022, 3:39 PM · SRE
klausman claimed T312550: uwsgi socket/UDP logger is broken if no other logger uses the same format.
Jul 7 2022, 2:47 PM · SRE
klausman updated the task description for T312550: uwsgi socket/UDP logger is broken if no other logger uses the same format.
Jul 7 2022, 2:29 PM · SRE
klausman updated the task description for T312550: uwsgi socket/UDP logger is broken if no other logger uses the same format.
Jul 7 2022, 2:29 PM · SRE
klausman created T312550: uwsgi socket/UDP logger is broken if no other logger uses the same format.
Jul 7 2022, 2:28 PM · SRE

Jul 5 2022

klausman added a comment to T302195: Create the ml-serve-staging k8s cluster.

Now also running draftquality for enwiki:

Jul 5 2022, 1:38 PM · Patch-For-Review, Epic, Lift-Wing, Machine-Learning-Team (Active Tasks)

Jun 29 2022

klausman created T311628: Create Swift account for readonly access to ML models.
Jun 29 2022, 1:32 PM · Machine-Learning-Team (Active Tasks), SRE-swift-storage, Lift-Wing

Jun 23 2022

klausman added a comment to T302195: Create the ml-serve-staging k8s cluster.

Prometheus is now correctly set up with its own volumes (we hadn't done that yet), and I managed to save the old data.

Jun 23 2022, 2:56 PM · Patch-For-Review, Epic, Lift-Wing, Machine-Learning-Team (Active Tasks)

Jun 22 2022

klausman added a comment to T302195: Create the ml-serve-staging k8s cluster.

Add'l things done:

Jun 22 2022, 1:25 PM · Patch-For-Review, Epic, Lift-Wing, Machine-Learning-Team (Active Tasks)
klausman committed rLPRI60f26db2ba1a: pki: Fix wrong cluster name for ML staging k8s (authored by klausman).
pki: Fix wrong cluster name for ML staging k8s
Jun 22 2022, 12:28 PM
klausman committed rLPRI45bed6f9e285: Add dummy secrets for ML staging k8s CA (authored by klausman).
Add dummy secrets for ML staging k8s CA
Jun 22 2022, 12:28 PM

Jun 21 2022

klausman created P29933 (An Untitled Masterwork).
Jun 21 2022, 10:51 AM

Jun 13 2022

klausman added a comment to T302195: Create the ml-serve-staging k8s cluster.

Istio config and (most of) the cert-manager config have been applied. For cert-manager, I need to sync up with Luca regarding part of said config referring to the ml-serve endpoints.

Jun 13 2022, 11:21 AM · Patch-For-Review, Epic, Lift-Wing, Machine-Learning-Team (Active Tasks)

May 13 2022

klausman closed T303801: Upgrade ORES to Debian Buster as Resolved.
May 13 2022, 9:06 AM · Machine-Learning-Team (Active Tasks), Patch-For-Review

May 10 2022

klausman claimed T303801: Upgrade ORES to Debian Buster.
May 10 2022, 2:21 PM · Machine-Learning-Team (Active Tasks), Patch-For-Review

Mar 22 2022

klausman committed rLPRI899d25e97d8d: labs: Add dummy keyfile for ML staging k8s in codfw (authored by klausman).
labs: Add dummy keyfile for ML staging k8s in codfw
Mar 22 2022, 4:06 PM
klausman committed rLPRIbd2fb2724109: hiera: Add k8s dummy tokens for ML staging env (authored by klausman).
hiera: Add k8s dummy tokens for ML staging env
Mar 22 2022, 3:51 PM
klausman committed rLPRId554eac0951b: hiera: add dummy tokens for ML staging k8s setup (authored by klausman).
hiera: add dummy tokens for ML staging k8s setup
Mar 22 2022, 2:24 PM

Mar 18 2022

klausman added a comment to T302701: Re-evaluate ip pools for ml-serve-{eqiad,codfw}.

I put the smaller staging allocation at the end to avoid fragmentation (at least for now, it can't be avoided forever, in my experience). Similar, the Train/DSE range is "flipped" (/21 first) to avoid fragmentation between it and the preceding prod ranges. If there would be sufficiently smaller ranges needed in EQIAD for future projects, they should follow the same scheme as the staging ranges in CODFW (allocate from the end, try to avoid fragmentation in the same alternating-sizes pattern as for prod/train).

Mar 18 2022, 3:32 PM · Epic, Lift-Wing, Machine-Learning-Team (Active Tasks)
klausman added a comment to T302701: Re-evaluate ip pools for ml-serve-{eqiad,codfw}.

I have setup IP ranges (and sliced them up for our use):

Mar 18 2022, 3:29 PM · Epic, Lift-Wing, Machine-Learning-Team (Active Tasks)

Mar 15 2022

klausman closed T302197: Create etcd cluster for ml-serve-staging k8s, a subtask of T302195: Create the ml-serve-staging k8s cluster, as Resolved.
Mar 15 2022, 5:55 PM · Patch-For-Review, Epic, Lift-Wing, Machine-Learning-Team (Active Tasks)
klausman closed T302197: Create etcd cluster for ml-serve-staging k8s as Resolved.
# etcdctl -C https://ml-staging-etcd2001.codfw.wmnet:2379  cluster-health 
member 493aa03d462725d1 is healthy: got healthy result from https://ml-staging-etcd2002.codfw.wmnet:2379
member b12825ca936a35a6 is healthy: got healthy result from https://ml-staging-etcd2003.codfw.wmnet:2379
member fce0f93975c27096 is healthy: got healthy result from https://ml-staging-etcd2001.codfw.wmnet:2379
cluster is healthy
#
Mar 15 2022, 5:55 PM · Epic, Lift-Wing, Machine-Learning-Team (Active Tasks)

Mar 14 2022

klausman committed rLPRI7e8dac2b291d: Add dummy key for ML staging etcd in codfw (authored by klausman).
Add dummy key for ML staging etcd in codfw
Mar 14 2022, 4:50 PM

Mar 2 2022

klausman closed T302198: Create ml-serve-staging k8s's control plane VMs, a subtask of T302195: Create the ml-serve-staging k8s cluster, as Resolved.
Mar 2 2022, 11:21 AM · Patch-For-Review, Epic, Lift-Wing, Machine-Learning-Team (Active Tasks)
klausman closed T302198: Create ml-serve-staging k8s's control plane VMs as Resolved.
Mar 2 2022, 11:21 AM · Epic, Lift-Wing, Machine-Learning-Team (Active Tasks)
klausman merged T302504: New control plane VMs for ML staging cluster in codfw into T302198: Create ml-serve-staging k8s's control plane VMs.
Mar 2 2022, 11:20 AM · Epic, Lift-Wing, Machine-Learning-Team (Active Tasks)
klausman merged task T302504: New control plane VMs for ML staging cluster in codfw into T302198: Create ml-serve-staging k8s's control plane VMs.
Mar 2 2022, 11:20 AM · Infrastructure-Foundations, vm-requests
klausman closed T302504: New control plane VMs for ML staging cluster in codfw, a subtask of T302195: Create the ml-serve-staging k8s cluster, as Resolved.
Mar 2 2022, 11:19 AM · Patch-For-Review, Epic, Lift-Wing, Machine-Learning-Team (Active Tasks)
klausman closed T302504: New control plane VMs for ML staging cluster in codfw as Resolved.
Mar 2 2022, 11:19 AM · Infrastructure-Foundations, vm-requests
klausman added a comment to T302504: New control plane VMs for ML staging cluster in codfw.

Both machines imaged with Bullseye and through their initial puppet run.

Mar 2 2022, 11:19 AM · Infrastructure-Foundations, vm-requests
klausman claimed T302504: New control plane VMs for ML staging cluster in codfw.
Mar 2 2022, 10:19 AM · Infrastructure-Foundations, vm-requests
klausman added a parent task for T302504: New control plane VMs for ML staging cluster in codfw: T302195: Create the ml-serve-staging k8s cluster.
Mar 2 2022, 10:18 AM · Infrastructure-Foundations, vm-requests
klausman added a subtask for T302195: Create the ml-serve-staging k8s cluster: T302504: New control plane VMs for ML staging cluster in codfw.
Mar 2 2022, 10:18 AM · Patch-For-Review, Epic, Lift-Wing, Machine-Learning-Team (Active Tasks)
klausman added a comment to T302504: New control plane VMs for ML staging cluster in codfw.
$ sudo cookbook sre.ganeti.makevm --vcpus 2 --memory 4 --disk 20 codfw_B ml-staging-ctrl2001
ml-staging-ctrl2001                      1H IN A 10.192.16.93                    
ml-staging-ctrl2001                      1H IN AAAA 2620:0:860:102:10:192:16:93
MAC: aa:00:00:bb:2b:bc
$ sudo cookbook sre.ganeti.makevm --vcpus 2 --memory 4 --disk 20 codfw_D ml-staging-ctrl2002
ml-staging-ctrl2002                      1H IN A 10.192.48.64                                           
ml-staging-ctrl2002                      1H IN AAAA 2620:0:860:104:10:192:48:64
MAC: aa:00:00:56:0b:51
Mar 2 2022, 10:06 AM · Infrastructure-Foundations, vm-requests

Mar 1 2022

klausman closed T302503: New VMs for ML staging cluster in codfw as Resolved.
Mar 1 2022, 6:53 PM · Patch-For-Review, Infrastructure-Foundations, vm-requests
klausman added a comment to T302503: New VMs for ML staging cluster in codfw.

All three VMs have been created, installed and have had at least one puppet run.

Mar 1 2022, 6:53 PM · Patch-For-Review, Infrastructure-Foundations, vm-requests
klausman added a comment to T302503: New VMs for ML staging cluster in codfw.
$ sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 3 --disk 20 codfw_B ml-staging-etcd2002
[...]
+ml-staging-etcd2002                      1H IN A 10.192.16.92                                                                                                                                           
+ml-staging-etcd2002                      1H IN AAAA 2620:0:860:102:10:192:16:92                                                                                                                         
Mar 1 2022, 2:50 PM · Patch-For-Review, Infrastructure-Foundations, vm-requests
klausman added a comment to T302503: New VMs for ML staging cluster in codfw.
$ sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 3 --disk 20 codfw_A ml-staging-etcd2001
[...]
+ml-staging-etcd2001                      1H IN A 10.192.0.37
+ml-staging-etcd2001                      1H IN AAAA 2620:0:860:101:10:192:0:37
[...]
Mar 1 2022, 2:31 PM · Patch-For-Review, Infrastructure-Foundations, vm-requests

Feb 28 2022

klausman created P21604 (An Untitled Masterwork).
Feb 28 2022, 4:45 PM
klausman added a comment to T302503: New VMs for ML staging cluster in codfw.

Correction: we should use ml-staging-etcdXXX as names for consistency

Feb 28 2022, 3:47 PM · Patch-For-Review, Infrastructure-Foundations, vm-requests
klausman added a comment to T302701: Re-evaluate ip pools for ml-serve-{eqiad,codfw}.

This might be useful:

Feb 28 2022, 12:49 PM · Epic, Lift-Wing, Machine-Learning-Team (Active Tasks)
klausman added a comment to T302701: Re-evaluate ip pools for ml-serve-{eqiad,codfw}.

Is there currently any kind of auto-expire/auto-clean of old revisions? If not, does kserve have such functionality built-in somewhere? That might tide us over until we have a good plan for migrating to a bigger pool. I figure that in the long term, model #s will only increase, so we need a bigger pool even with swift expiry of old versions.

Feb 28 2022, 12:36 PM · Epic, Lift-Wing, Machine-Learning-Team (Active Tasks)

Feb 24 2022

klausman created T302516: Help Language team to make progress on open MT models to be used by Content Translation tool .
Feb 24 2022, 5:37 PM · Machine-Learning-Team (Active Tasks)
klausman created T302504: New control plane VMs for ML staging cluster in codfw.
Feb 24 2022, 4:08 PM · Infrastructure-Foundations, vm-requests
klausman renamed T302503: New VMs for ML staging cluster in codfw from New VMs for ML staging cluster in eqiad to New VMs for ML staging cluster in codfw.
Feb 24 2022, 3:55 PM · Patch-For-Review, Infrastructure-Foundations, vm-requests
klausman created T302503: New VMs for ML staging cluster in codfw.
Feb 24 2022, 3:54 PM · Patch-For-Review, Infrastructure-Foundations, vm-requests

Feb 4 2022

klausman closed T299427: ml-serve2001 logged a corrected memory error as Resolved.

Yes, I think so. Since the reboot, everything has been quiet:

Feb 4 2022, 4:45 PM · SRE, ops-codfw, Lift-Wing

Jan 27 2022

klausman created T300259: Explore ingress filtering for Lift Wing.
Jan 27 2022, 3:21 PM · Lift-Wing, Machine-Learning-Team (Active Tasks)

Jan 18 2022

klausman added a comment to T299427: ml-serve2001 logged a corrected memory error.
root@ml-serve2001:/sys/devices/system/edac/mc# grep .  mc*/*count
mc0/ce_count:0
mc0/ce_noinfo_count:0
mc0/ue_count:0
mc0/ue_noinfo_count:0
mc1/ce_count:0
mc1/ce_noinfo_count:0
mc1/ue_count:0
mc1/ue_noinfo_count:0
mc2/ce_count:0
mc2/ce_noinfo_count:0
mc2/ue_count:0
mc2/ue_noinfo_count:0
mc3/ce_count:0
mc3/ce_noinfo_count:0
mc3/ue_count:0
mc3/ue_noinfo_count:0

EDAC errors are cleared

Jan 18 2022, 4:58 PM · SRE, ops-codfw, Lift-Wing
klausman added a project to T299427: ml-serve2001 logged a corrected memory error: ops-codfw.
Jan 18 2022, 4:42 PM · SRE, ops-codfw, Lift-Wing
klausman created T299427: ml-serve2001 logged a corrected memory error.
Jan 18 2022, 4:41 PM · SRE, ops-codfw, Lift-Wing

Nov 25 2021

klausman added a comment to T289835: Create a LB service for inference.discovery.wmnet.
$ sudo confctl select 'cluster=ml_serve,service=kubesvc' set/pooled=yes:weight=1
The selector you chose has selected the following objects:
{"/codfw/ml_serve/kubesvc": ["ml-serve2002.codfw.wmnet", "ml-serve2004.codfw.wmnet", "ml-serve2001.codfw.wmnet", "ml-serve2003.codfw.wmnet"], "/eqiad/ml_serve/kubesvc": ["ml-serve1003.eqiad.wmnet", "ml-serve1004.eqiad.wmnet", "ml-serve1002.eqiad.wmnet", "ml-serve1001.eqiad.wmnet"]}
Ok to continue? [y/N]
confctl>y
codfw/ml_serve/kubesvc/ml-serve2002.codfw.wmnet: weight changed 1 => 1
codfw/ml_serve/kubesvc/ml-serve2004.codfw.wmnet: weight changed 1 => 1
codfw/ml_serve/kubesvc/ml-serve2001.codfw.wmnet: weight changed 1 => 1
codfw/ml_serve/kubesvc/ml-serve2003.codfw.wmnet: weight changed 1 => 1
eqiad/ml_serve/kubesvc/ml-serve1003.eqiad.wmnet: weight changed 1 => 1
eqiad/ml_serve/kubesvc/ml-serve1004.eqiad.wmnet: weight changed 1 => 1
eqiad/ml_serve/kubesvc/ml-serve1002.eqiad.wmnet: weight changed 1 => 1
eqiad/ml_serve/kubesvc/ml-serve1001.eqiad.wmnet: weight changed 1 => 1
WARNING:conftool.announce:conftool action : set/pooled=yes:weight=1; selector: cluster=ml_serve,service=kubesvc
$
Nov 25 2021, 3:21 PM · Patch-For-Review, Machine-Learning-Team, Lift-Wing

Oct 19 2021

klausman added a comment to T288789: API Gateway Integration.

Brain dump of a discussion I had with elukey follows. It's meant as a summary of functionality needed from the API Gateway and how it may tie in with config on our side.

Oct 19 2021, 2:09 PM · Lift-Wing, Machine-Learning-Team (Active Tasks)

Jun 22 2021

klausman closed T285225: Update all Liftwing machines to latest packages as Resolved.

All updates complete, no errors since the reboots. Closing.

Jun 22 2021, 2:44 PM · Machine-Learning-Team (Active Tasks)

Jun 21 2021

klausman added a comment to T285225: Update all Liftwing machines to latest packages.

Now proceeding with eqiad. Updates should be the same. Will mention here otherwise.

Jun 21 2021, 2:16 PM · Machine-Learning-Team (Active Tasks)
klausman added a comment to T285225: Update all Liftwing machines to latest packages.

ml-serve updates in codfw:

The following NEW packages will be installed:
  linux-image-4.19.0-17-amd64
The following packages will be upgraded:
  base-files debmonitor-client isc-dhcp-client isc-dhcp-common klibc-utils
  libgcrypt20 libglib2.0-0 libgnutls-dane0 libgnutls30 libklibc libmariadb3
  libxml2 linux-image-amd64 linux-perf linux-perf-4.19 mariadb-common
  python3-dnspython
17 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
Jun 21 2021, 1:47 PM · Machine-Learning-Team (Active Tasks)
klausman added a comment to T285225: Update all Liftwing machines to latest packages.

ml-serve-ctrl updates in codfw:

Jun 21 2021, 1:25 PM · Machine-Learning-Team (Active Tasks)
klausman added a comment to T285225: Update all Liftwing machines to latest packages.

ml-etcd in codfw all have these updates:

Jun 21 2021, 1:13 PM · Machine-Learning-Team (Active Tasks)
klausman added a comment to T285225: Update all Liftwing machines to latest packages.

Clarification: this is only Debian-delivered and generic OS packages

Jun 21 2021, 1:12 PM · Machine-Learning-Team (Active Tasks)
klausman created T285225: Update all Liftwing machines to latest packages.
Jun 21 2021, 1:12 PM · Machine-Learning-Team (Active Tasks)

May 14 2021

klausman added a comment to T282802: Implement model storage for enwiki-goodfaith inference service.

@ACraze

$ ./model_upload.sh 
CHECKING FOR MODEL_BUCKET
Bucket 's3://wmf-ml-models/' created
UPLOADING enwiki.goodfaith.gradient_boosting.model to s3://wmf-ml-models/goodfaith/enwiki/202105140814
upload: 'enwiki.goodfaith.gradient_boosting.model' -> 's3://wmf-ml-models/goodfaith/enwiki/202105140814/enwiki.goodfaith.gradient_boosting.model'  [1 of 1]
 110612 of 110612   100% in    0s   339.32 KB/s  done
$
May 14 2021, 8:15 AM · Machine-Learning-Team (Active Tasks), artificial-intelligence, revscoring, Lift-Wing

Apr 26 2021

klausman added a comment to T280467: Naming convention for the model storage structure.

Oh yes, I'd say this is a best practice I recommend. It comes from tensorflow-model-server that reads all subdirectories in the specified model directory and automatically picks up the largest value as the current model. So typically you'd have an increment number as model version, although I find the timestamp more convenient (and if it's in the YYYYMMDDHHMM format it will be incremental). Here's an example:

Apr 26 2021, 8:26 AM · Patch-For-Review, Lift-Wing, Machine-Learning-Team (Active Tasks)

Apr 6 2021

klausman added a comment to T278238: Recreate ml-etcd2002 in a different row.

For the record: I just checked, and no, it wasn't you who created the VM on the wrong Ganeti host, but me. Thanks for fixing it!

Apr 6 2021, 7:50 AM · Patch-For-Review, Machine-Learning-Team, Lift-Wing
klausman closed T267670: (Need By: TBD) rack/setup/install ml-serve200[1-4] as Resolved.

Yes, this is all done!

Apr 6 2021, 7:48 AM · SRE, ops-codfw, DC-Ops

Mar 19 2021

klausman added a comment to T272918: Create ml-serve k8s cluster.

All worker nodes are now up and visible in both DCs:

ml-serve-ctrl1001:~$ kubectl get nodes -o wide
NAME                       STATUS   ROLES    AGE   VERSION    INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                       KERNEL-VERSION    CONTAINER-RUNTIME
ml-serve1001.eqiad.wmnet   Ready    <none>   83m   v1.16.15   10.64.0.41     <none>        Debian GNU/Linux 10 (buster)   4.19.0-14-amd64   docker://18.9.1
ml-serve1002.eqiad.wmnet   Ready    <none>   83m   v1.16.15   10.64.16.183   <none>        Debian GNU/Linux 10 (buster)   4.19.0-14-amd64   docker://18.9.1
ml-serve1003.eqiad.wmnet   Ready    <none>   83m   v1.16.15   10.64.32.81    <none>        Debian GNU/Linux 10 (buster)   4.19.0-14-amd64   docker://18.9.1
ml-serve1004.eqiad.wmnet   Ready    <none>   83m   v1.16.15   10.64.48.50    <none>        Debian GNU/Linux 10 (buster)   4.19.0-14-amd64   docker://18.9.1
ml-serve-ctrl2001 ~ $ kubectl get nodes -o wide
NAME                       STATUS   ROLES    AGE     VERSION    INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                       KERNEL-VERSION    CONTAINER-RUNTIME
ml-serve2001.codfw.wmnet   Ready    <none>   84m     v1.16.15   10.192.0.21    <none>        Debian GNU/Linux 10 (buster)   4.19.0-14-amd64   docker://18.9.1
ml-serve2002.codfw.wmnet   Ready    <none>   9m57s   v1.16.15   10.192.16.43   <none>        Debian GNU/Linux 10 (buster)   4.19.0-14-amd64   docker://18.9.1
ml-serve2003.codfw.wmnet   Ready    <none>   84m     v1.16.15   10.192.32.29   <none>        Debian GNU/Linux 10 (buster)   4.19.0-14-amd64   docker://18.9.1
ml-serve2004.codfw.wmnet   Ready    <none>   84m     v1.16.15   10.192.48.11   <none>        Debian GNU/Linux 10 (buster)   4.19.0-14-amd64   docker://18.9.1
Mar 19 2021, 12:54 PM · Machine-Learning-Team, Patch-For-Review, Lift-Wing

Mar 15 2021

klausman triaged T277492: Investigate separating k8s-level users between our k8s and thr ServiceOps k8s as Medium priority.
Mar 15 2021, 5:33 PM · Machine-Learning-Team
klausman committed rLPRI494f672411c9: hiera: move ML k8s worker secrets into the correct location (authored by klausman).
hiera: move ML k8s worker secrets into the correct location
Mar 15 2021, 4:44 PM
klausman committed rLPRIb84e8f31dbe3: hiera: add dummy secrets for ML k8s workers (authored by klausman).
hiera: add dummy secrets for ML k8s workers
Mar 15 2021, 4:37 PM

Mar 9 2021

klausman committed rLPRI9461a40d69b8: ml-k8s: Add dummy controllermanager_tokens (authored by klausman).
ml-k8s: Add dummy controllermanager_tokens
Mar 9 2021, 2:02 PM

Mar 5 2021

klausman committed rLPRIb9c62c28e10d: ml-ctrl: Add dummy keys for ML k8s control plane (authored by klausman).
ml-ctrl: Add dummy keys for ML k8s control plane
Mar 5 2021, 4:19 PM

Feb 26 2021

klausman added a comment to T275852: Investigate potential issues with the sudoeres env_keep values.

TMUX being visible is, as mentioned, not a security issue when sudo'ing to non-root. The var contains just a path, with its own permissions, and racing attacks with symlinks are unlikely to work since the enclosing dir is typically owned by the user, or it has tempdir semantics. It is theoretically possible that a bad user config makes this attackable, but it still seems very remote compare to other threats.

Feb 26 2021, 2:42 PM · User-jbond, Security, SRE

Feb 25 2021

klausman added a comment to T275630: eqiad/codfw: 2x2 VM request for ML-Serve Kubernetes cluster.

Created ml-serve-ctrl1001 and ml-serve-ctrl1002 in eqiad, rows B and D.

Feb 25 2021, 2:36 PM · Patch-For-Review, vm-requests, SRE

Feb 24 2021

klausman created T275630: eqiad/codfw: 2x2 VM request for ML-Serve Kubernetes cluster.
Feb 24 2021, 2:45 PM · Patch-For-Review, vm-requests, SRE

Feb 22 2021

klausman updated the language for P14444 (An Untitled Masterwork) from autodetect to python.
Feb 22 2021, 4:02 PM
klausman created P14444 (An Untitled Masterwork).
Feb 22 2021, 4:02 PM

Feb 16 2021

klausman added a comment to T273071: Create etcd VMs for use with ML platform.
root@ml-etcd2001:~# etcdctl -C https://ml-etcd2001.codfw.wmnet:2379 cluster-health
member 367f7076aea55538 is healthy: got healthy result from https://ml-etcd2002.codfw.wmnet:2379
member 3eaef5f31c9d4f07 is healthy: got healthy result from https://ml-etcd2001.codfw.wmnet:2379
member 6ec81f119df22c02 is healthy: got healthy result from https://ml-etcd2003.codfw.wmnet:2379
cluster is healthy
Feb 16 2021, 4:35 PM · Lift-Wing, Machine-Learning-Team (Active Tasks)
klausman closed T273071: Create etcd VMs for use with ML platform, a subtask of T272918: Create ml-serve k8s cluster, as Resolved.
Feb 16 2021, 4:12 PM · Machine-Learning-Team, Patch-For-Review, Lift-Wing
klausman closed T273071: Create etcd VMs for use with ML platform as Resolved.
Feb 16 2021, 4:12 PM · Lift-Wing, Machine-Learning-Team (Active Tasks)
klausman added a comment to T273071: Create etcd VMs for use with ML platform.
root@ml-etcd1001:~# etcdctl  -C https://ml-etcd1001.eqiad.wmnet:2379 cluster-health
member 27250fb9655951c0 is healthy: got healthy result from https://ml-etcd1003.eqiad.wmnet:2379
member bec8796f64226950 is healthy: got healthy result from https://ml-etcd1002.eqiad.wmnet:2379
member ec678e26e1c1f07a is healthy: got healthy result from https://ml-etcd1001.eqiad.wmnet:2379
cluster is healthy
root@ml-etcd1001:~#
Feb 16 2021, 4:12 PM · Lift-Wing, Machine-Learning-Team (Active Tasks)
klausman committed rLPRI0fe04ab80042: secrets: Add dummy keys for ml_etcd clusters (authored by klausman).
secrets: Add dummy keys for ml_etcd clusters
Feb 16 2021, 2:07 PM

Feb 3 2021

klausman added a comment to T273071: Create etcd VMs for use with ML platform.

All machines are now base installed (puppet-runs done with insetup).

Feb 3 2021, 3:41 PM · Lift-Wing, Machine-Learning-Team (Active Tasks)
klausman added a comment to T273074: eqiad: 3 VM request for ML team etcd.

All machines are now base installed (puppet-runs done with insetup).

Feb 3 2021, 3:41 PM · vm-requests, SRE
klausman added a comment to T273075: codfw: 3 VM request for ML team etcd.

All machines are now base installed (puppet-runs done with insetup).

Feb 3 2021, 3:41 PM · vm-requests, SRE

Feb 2 2021

klausman added a comment to T273075: codfw: 3 VM request for ML team etcd.

$ sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 3 --disk 20 codfw_B ml-etcd2001.codfw.wmnet
IPv4: 10.192.16.44/22
IPv6: 2620:0:860:102:10:192:16:44/64
MAC: aa:00:00:71:6a:f3

Feb 2 2021, 2:38 PM · vm-requests, SRE
klausman added a comment to T273074: eqiad: 3 VM request for ML team etcd.

$ sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 3 --disk 20 eqiad_B ml-etcd1001.eqiad.wmnet
IPv4: 10.64.16.200
IPv6: 2620:0:861:102:10:64:16:200
MAC: aa:00:00:ef:5f:2d

Feb 2 2021, 2:00 PM · vm-requests, SRE

Jan 28 2021

klausman added a comment to T273075: codfw: 3 VM request for ML team etcd.

@klausman Could you add the new cluster prefixes for ml (ml-etcd and others) to https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions#Servers ? that would be nice, thank you!

Jan 28 2021, 12:33 PM · vm-requests, SRE

Jan 27 2021

klausman added a comment to T273071: Create etcd VMs for use with ML platform.

VM requests in:

Jan 27 2021, 4:19 PM · Lift-Wing, Machine-Learning-Team (Active Tasks)
klausman created T273075: codfw: 3 VM request for ML team etcd.
Jan 27 2021, 4:18 PM · vm-requests, SRE
klausman created T273074: eqiad: 3 VM request for ML team etcd.
Jan 27 2021, 4:17 PM · vm-requests, SRE
klausman added a comment to T273071: Create etcd VMs for use with ML platform.

Hostnames: ml-etcd100x.eqiad and ml-etcd200x.codfw
For networking, we want row diversity, which should be easy enough for VMs this tiny.

Jan 27 2021, 4:02 PM · Lift-Wing, Machine-Learning-Team (Active Tasks)
klausman added a subtask for T272918: Create ml-serve k8s cluster: T273071: Create etcd VMs for use with ML platform.
Jan 27 2021, 3:54 PM · Machine-Learning-Team, Patch-For-Review, Lift-Wing
klausman added a parent task for T273071: Create etcd VMs for use with ML platform: T272918: Create ml-serve k8s cluster.
Jan 27 2021, 3:54 PM · Lift-Wing, Machine-Learning-Team (Active Tasks)
klausman created T273071: Create etcd VMs for use with ML platform.
Jan 27 2021, 3:53 PM · Lift-Wing, Machine-Learning-Team (Active Tasks)

Jan 25 2021

klausman added a comment to T272687: Give access to ml-serve* to the non-ops members of the ML team.

@klausman (adding a comment here incase it was missed from the meeting) when this access is revoked and the hacking is over we should rebuild all the machines using the re-image script so that we can ensure they can be automaticity installed and no manually steps where missed.

Jan 25 2021, 5:58 PM · Machine-Learning-Team (Active Tasks), SRE, SRE-Access-Requests
klausman edited projects for T272687: Give access to ml-serve* to the non-ops members of the ML team, added: Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.
Jan 25 2021, 4:38 PM · Machine-Learning-Team (Active Tasks), SRE, SRE-Access-Requests

Jan 22 2021

klausman closed T272687: Give access to ml-serve* to the non-ops members of the ML team as Resolved.
Jan 22 2021, 4:58 PM · Machine-Learning-Team (Active Tasks), SRE, SRE-Access-Requests