Luca has raised a few questions that may reveal relevant information:
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Nov 14 2023
Oct 24 2023
Oct 18 2023
Oct 12 2023
Oct 10 2023
Oct 5 2023
Oct 4 2023
File T348144 for decomming.
After discussion on IRC, I have also shutdown 1001 and 2001.
The machines ores[2001-2009].codfw.wmnet,ores[1001-1009].eqiad.wmnet have been shut down (1001 and 2001 are still running in case we need files from them).
Oct 2 2023
Do you think it would be useful to also keep the checksums in a different place, with permissions independent of the backing store behind the published/ directory?
SLO dashboard now available at: https://grafana-rw.wikimedia.org/d/slo-Lift_Wing_Readability/lift-wing-readability-slo-s?orgId=1
Sep 29 2023
Sep 28 2023
Sep 25 2023
Sep 20 2023
I've done 1002-1008 today, and everything went smoothly. All done!
Sep 19 2023
I am very in favor of this scheme.
Sep 13 2023
The service has been moved from the experimental namespace to readability in staging-codfw, and newly deployed to the same namespace to serve-codfw and -eqiad.
Sep 12 2023
One thing of note: after elevating the tier like Luca did yesterday, the token has to be re-issued using the webui to have the new limit baked into it.
Aug 22 2023
2007 and 2008 are now also done, again without problems.
Aug 21 2023
Machines ml-serve2001-2006 are now done. Zero errors or irregularities. Will do 7 and 8 later this week.
Aug 15 2023
(copied from T343900, this ticket is more appropriate for this info)
The problem is only really relevant for LLMs (Large Language Models), since they need more local disk space. Or at least the specific ones we tried did. We have plenty of disk space on our workers so far, so having a bigger kubelet partition/fs is quite feasible.
I have done ml2002 and ml2003 today (two machines to force some pods back onto 2002, to see it works properly). So far, everything seems fine.
Aug 14 2023
Should be all clean now:
Aug 11 2023
Upsides of local-ish to LW caching (e.g. Cassandra):
Aug 10 2023
While making/experimenting with the SLO dahsboard it became clear that the label cardinality of our input metrics is so high (>10k) that direct computation from the input metrics is not feasible --- the result set is so large that we get at best partial results.
Aug 8 2023
Aug 3 2023
Jul 26 2023
Jul 25 2023
I've also run the same test tool against RR-ML and got much worse latency overall (though no errors, which is great):
Jul 24 2023
Jul 20 2023
In the ingress gateway, these 503s look like this:
I am still seeing (rare) 503s, and in the queue proxy pod this is logged:
With 400(!) workers, for 5m:
With the fixed version deployed to all clusters, I ran a load test again. Note that the throughput would likely be higher with more workers (i.e. it's limited by the client, not the inference service).
I have done some testing today:
Jul 18 2023
In an effort to solve the practical problem (getting good RR inference without too many errors and timeouts), I'll do some testing on the other RR model (agnostic) to see if it's suitable as an alternative (maybe temporarily).
Jul 17 2023
I've now also managed to add some latency bucketing stuff. Not 100% yet if it is what we want, but in any case, it's progress.
I made some progress on the experimental dashboard (https://grafana.wikimedia.org/goto/VSolQfj4k?orgId=1). Request count (and 200 vs non-200) I now have a better mental model/grasp of. The current setup is not ML-specific, but rather Istio-specific. So it may not apply to all k8s setups, but only those where Istio is used.
I am working on Grafana/Thanos directly for now because it's a shorter change-try loop to find the right metrics than doing it with Grizzly directly. Even with templating, we still need specific metrics to put into slo_definitions.libsonnet. Arguably, that's the most difficult part due to the sheer number of metrics surfaced even just by our own Prom instance.
Jul 14 2023
https://grafana.wikimedia.org/goto/x7S0HpjVk?orgId=1 I've started an SLO dahsboard here. It only has one metric (Latency) so far, but it's a start.
Jul 13 2023
Digging into the logging of the kserve and istio containers a bit, I have found a few things:
Jul 12 2023
Correction: the iDrac is still down.
The iDrac is reachable again, so it likely was a different issue.
Jul 3 2023
shows URX error state, aka:
Jun 21 2023
Luca and I had a longer discussion via mail and IRC, about whether the backend-induced latency of an Inference Service should count towards the SLO budget or not.
Jun 20 2023
Change 930610 has been pushed to prod, so now we get the full feed from changeprop.
Jun 15 2023
I discussed the above questions with Luca today, and I think for now we can proceed with telling WME to start exploring the documentation we have (and tell us where there are gaps), and start testing against LiftWing/APIGW. This should surface any issues that might still be there, even if in the future the actual implementation of access to LW and rate limiting changes.
After some experimenting, the state of how rate limits for API tokens, the API gateway and Lift Wing currently are applied seems to be this:
Jun 12 2023
A few thoughts:
Jun 9 2023
This was caused by me using the wrong host.
Jun 5 2023
May 31 2023
May 30 2023
May 26 2023
After examining the setup some more, I figure I can delete the images on S3 as well, they are easy enough to reproduce with the docs I have. So this has been completed.
May 25 2023
Changes have been merged and deployed, Bot eqiad and codfw (and staging) sections of the API GW work fine (tested from within clusters), as well as remote (from my home machine).
May 24 2023
This should probably until we have updated to kserve 0.11 (T337213), I will tackle this after that.
Instances have been shutdown and S3 has been cleared of all but the latest checkpoint etc.