Fri, May 27
Aiko deployed the change to deployment-prep, it looks very good:
All pods running revscoring 2.11.4. Tested various endpoints and I can see scores correctly, no errors in the logs.
Kevin was able to deploy successfully without issues, so I think that we can close for the moment!
Kevin and Aiko's users are now in the deployment POSIX group, they should be able to deploy now. Let's try do to it before closing the task :)
On contint1001 I see the following in /var/log/zuul/merger-debug.log:
Thu, May 26
quick note - these repositories are the same ones used in the ores-deploy gerrit repo. Usually we mirror code from github to gerrit, and we deploy it to ores. Since what we are doing is a changelog/version bump, we can skip (in my opinion) the deployment to ORES. We are not touching any model binary etc.. so it is safe to skip an update.
Created https://github.com/wikimedia/draftquality/pull/43https://github.com/wikimedia/draftquality/pull/43 for drafquality!
Wed, May 25
For articlequality, we need to first solve T309205 to release the new version of articlequality in Pypi.
The enwiki-goodfaith pod is now running revscoring 2.11.4 and I can successfully get scores without any weird error logged. I'll follow up with a code change to apply it to the other pods.
@achou I am curious about the running processes inside the pod after we use kserve - what do you see if you run ps -aux | grep python and ps -eLf | grep python? My understanding is that every Ray worker should be a python process, in this case it would be very interesting. We currently have some restrictions for memory/cpu of every pod in production, so we probably have to tune settings for this use case. For example, with 2 ray workers, I'd expect to see:
To keep archives happy - I am having a chat with Eric over email about this cluster and its future usage. The AQS cassandra cluster should become a multi-tenant/dc cluster able to support various use cases, so we need to decide if ml-cache is a valid use case for a standalone cluster or not.
Yep sorry forgot a few details, nice :) Before finishing let's expand https://wikitech.wikimedia.org/wiki/ORES/Deployment#Deploy_to_the_test_server with the steps to follow!
Tue, May 24
This has been solved with https://gerrit.wikimedia.org/r/c/operations/puppet/+/785111 in theory, closing the task.
Time flies and both ROCm and tensorflow-io got several releases.
Coming back to this after T303801. We migrated ORES to Debian Buster and Python 3.7, updating wheels and dependencies. The revscoring library was fully compatible with the new set up, and it is not working like a charm.
@colewhite hi! There is no rush at the moment of course, but I am wondering what remaining clients needed to be migrated before being able to switch the broker's TLS certs to PKI.
Change is rolled out everywhere, and now we have sane defaults in profile::kafka::broker.
Mon, May 23
The three kafka clusters in deployment-prep are using the new uid/gid, before turning the profile::kafka::broker::use_fixed_uid_gid option true by default I'll follow up with SRE to verify that no other cluster is left to move.
Fri, May 20
Things to do (in my opinion):
@thcipriani Hi! When you have a moment, could you please review this request and let me know if it is a good use case for deployment ? Thanks :)
Thu, May 19
I was able to use wrk, very interesting tool installed on deploy1002. We can use lua scripts like the following:
It is very weird, siege supports HTTP/1.1 but I see the following:
Interesting discovery - it seems that my previous tests with ab and siege used http/1.0, not 1.1, and the responses from istio where all 426 upgrade required (so not really representative of HTTP traffic handled by a single pod). The ab tool seems not ready for http 1.1 yet, siege should support it but I am using the version on deploy1002 that could be outdated.
Published https://pypi.org/project/revscoring/2.11.4/ from a python 3.7 environment (just to be extra sure). The size of the wheel seems to be the same of 2.11.2, so probably some compression changes for Python 3.7 happened (the 2.11.1 version, IIRC, was the last one released with Python 3.5).
Wed, May 18
After a chat with the team, we decided to keep the tornado workers setting to 1 (default), and try the auto-scaling features offered by Knative (min/max replicas etc..).
The tricky bit is making sure that clients support the Root PKI CA, but I agree that it would be a great improvement for Cassandra!
2.11.3 is live, I have also created https://wikitech.wikimedia.org/wiki/ORES/Deployment#Update_revscoring_in_PyPI.
root@deploy1002:~# kubectl label nodes ml-serve-ctrl2001.codfw.wmnet node-role.kubernetes.io/master="" node/ml-serve-ctrl2001.codfw.wmnet labeled root@deploy1002:~# kubectl label nodes ml-serve-ctrl2002.codfw.wmnet node-role.kubernetes.io/master="" node/ml-serve-ctrl2002.codfw.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve2001.codfw.wmnet failure-domain.beta.kubernetes.io/region=codfw node/ml-serve2001.codfw.wmnet labeled root@deploy1002:~# kubectl label nodes ml-serve2002.codfw.wmnet failure-domain.beta.kubernetes.io/region=codfw node/ml-serve2002.codfw.wmnet labeled root@deploy1002:~# kubectl label nodes ml-serve2003.codfw.wmnet failure-domain.beta.kubernetes.io/region=codfw node/ml-serve2003.codfw.wmnet labeled root@deploy1002:~# kubectl label nodes ml-serve2004.codfw.wmnet failure-domain.beta.kubernetes.io/region=codfw node/ml-serve2004.codfw.wmnet labeled root@deploy1002:~# kubectl label nodes ml-serve2005.codfw.wmnet failure-domain.beta.kubernetes.io/region=codfw node/ml-serve2005.codfw.wmnet labeled root@deploy1002:~# kubectl label nodes ml-serve2006.codfw.wmnet failure-domain.beta.kubernetes.io/region=codfw node/ml-serve2006.codfw.wmnet labeled root@deploy1002:~# kubectl label nodes ml-serve2007.codfw.wmnet failure-domain.beta.kubernetes.io/region=codfw node/ml-serve2007.codfw.wmnet labeled root@deploy1002:~# kubectl label nodes ml-serve2008.codfw.wmnet failure-domain.beta.kubernetes.io/region=codfw node/ml-serve2008.codfw.wmnet labeled
Resetting the task to open, since I think that Kevin and Aiko should end up in the deployment group. They will not need all the sudo capabilities for MediaWiki etc.., but as far as I can see the group is already composed by people that don't need it as well. We'll probably need to segment deployment further in the future, we'll see :)
Tue, May 17
I merged two changes for the ml-serve-eqiad cluster, and now the concerns expressed in T306649#7881940 should be gone:
Mon, May 16
Added the proposed node labels to ml-serve-eqiad via T308418#7930118. At this point I'll wait to see what strategy is best to pick between GlobalNetworkSet and fake nodes, and then we'll be able to test on ml-serve.
Nice catch, TIL about wmflib::resource_hosts, thanks John!
I realized that the copy of the revscoring repository from which I published 2.11.2 may not have had the correct commit from Aiko, so I created https://github.com/wikimedia/revscoring/pull/520 to release 2.11.3 and be sure. Sorry for the trouble, I'll update the docs once done.
root@deploy1002:~# kubectl label nodes ml-serve1001.eqiad.wmnet failure-domain.beta.kubernetes.io/region=eqiad node/ml-serve1001.eqiad.wmnet labeled root@deploy1002:~# kubectl label nodes ml-serve1002.eqiad.wmnet failure-domain.beta.kubernetes.io/region=eqiad node/ml-serve1002.eqiad.wmnet labeled root@deploy1002:~# kubectl label nodes ml-serve1003.eqiad.wmnet failure-domain.beta.kubernetes.io/region=eqiad node/ml-serve1003.eqiad.wmnet labeled root@deploy1002:~# kubectl label nodes ml-serve1004.eqiad.wmnet failure-domain.beta.kubernetes.io/region=eqiad node/ml-serve1004.eqiad.wmnet labeled root@deploy1002:~# kubectl label nodes ml-serve1005.eqiad.wmnet failure-domain.beta.kubernetes.io/region=eqiad node/ml-serve1005.eqiad.wmnet labeled root@deploy1002:~# kubectl label nodes ml-serve1006.eqiad.wmnet failure-domain.beta.kubernetes.io/region=eqiad node/ml-serve1006.eqiad.wmnet labeled root@deploy1002:~# kubectl label nodes ml-serve1007.eqiad.wmnet failure-domain.beta.kubernetes.io/region=eqiad node/ml-serve1007.eqiad.wmnet labeled root@deploy1002:~# kubectl label nodes ml-serve1008.eqiad.wmnet failure-domain.beta.kubernetes.io/region=eqiad node/ml-serve1008.eqiad.wmnet labeled
This may be due to T303559 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/771441, @jbond do you have any idea?
Tried to recheck, afaics wmflib::resource_hosts is called by profile::scap::dsh, but I cannot reach puppetdb03 from deploy03 (the firewall rules on puppetdb03 confirm what I am seeing, no rule to allow traffic from deploy03 to port 443 afaics).
Fri, May 13
Another use case mentioned in T307927#7921020 is that, IIUC, the /etc/helmfile/private config files changed group ownership as well, impacting ml deployers. For example, Aiko was able to set a home-local HELM_REPOSITORY_CACHE successfully, but then helmfile diff prompted the removal of a Secret due to some private files not readable anymore.
I may have created this task too soon, some discussion on T305729 is still happening, let's wait before proceeding.
Reopening since it seems that more discussion is needed :)
Thu, May 12
Quick question about how to proceed. Would it make sense to start testing adding manual labels in the ml-serve-eqiad cluster (since we have new E/F nodes there) to see if everything works as expected etc..? After this verification we could start thinking about how/where to get the node label info, and how to better share/maintain the calico bgp configs etc..
Wed, May 11
https://netbox.wikimedia.org/api/extras/job-results/3032166/ worked, so maybe some race condition with timing.
I confirm from Aiko's tests that HELM_CACHE_HOME is the problem, so we can try to set it differently for various groups. From what I can see in puppet, it could be as simple as:
Ah snap I thought it was deleted, yes please all can be cleaned up!