User Details
- User Since
- Jul 26 2022, 2:11 PM (60 w, 3 d)
- Availability
- Available
- IRC Nick
- claime
- LDAP User
- Clément Goubert
- MediaWiki User
- CGoubert-WMF [ Global Accounts ]
Wed, Sep 20
Tue, Sep 19
Mon, Sep 18
The prometheus-statsd-exporter container and configuration is already deployed as a side-car for a number of kubernetes services:
deployment-charts/charts on master [$?⇡] ❯ git grep statsd-exporter | cut -d/ -f1 | sort -u apertium api-gateway blubberoid changeprop developer-portal eventstreams linkrecommendation machinetranslation mediawiki-dev miscweb mobileapps recommendation-api shellbox termbox thumbor toolhub wikifeeds
p50 latency increased slightly, we may want to up the concurrency a little to see what shakes.
Example mw-web eqiad
Fri, Sep 15
Repooled, thank you!
Thu, Sep 14
We are now serving 5% of global traffic from mw-on-k8s. Resolving.
Wed, Sep 13
I'm putting mw2444 back into pooled=no (instead of pooled=inactive) so it gets scap updates and stops warning, however I'll wait until we're sure it's stable before actually putting it back in production.
Recap from irc discussion:
As we've given up on the first puppet run working completely, it doesn't make sense to put effort into fixing the first-order root cause of this particular issue, which is that /var/log/envoy permissions are wrong on first run.
Mon, Sep 11
Thanks @Jhancock.wm !
Fri, Sep 8
No problem with the general approach, I propose using a _shellbox_common_ directory like we have a _aqs2-common_ and a _mediawiki-common_ directory in helmfile.d/services and symlink from there.
Thu, Sep 7
A subsequent deployment didn't trigger that error again, I think we can file that as a transient issue with one pod on startup. We will look into it further if it happens again.
Since it only impacts one pod, it has a reqId and an actual request (meaning it's a runtime error, not a startup/load time error), and didn't log anything afterwards despite serving many requests, I'm downgrading to medium on the assumption it's a transient error. I will keep an eye on subsequent deployments to see if it pops up again.
Wed, Sep 6
Server came back up as I pressed submit, however there still is an issue with the management interface. It does powercycle the server when asked, but states Unable to perform requested operation.
Fri, Sep 1
Wed, Aug 30
Tue, Aug 29
CPU limits have now been removed on all mw-on-k8s deployments except mw-misc. We'll wait a few days to see how the reduced concurrency impacts latency if at all, then resolve this task.
Just FYI, JS and CSS are currently broken on prometheus-{eqiad,codfw}.wikipedia.org due to 401 and 403 errors, with some CORS sprinkled in
Mon, Aug 28
We are still experiencing issues, some logs are getting escaped into single byte ISO-8859-1 values instead of the double-byte utf-8 encoding.
Fri, Aug 25
@tstarling I've verified that the above requests work with the correct new configurations applied for pcre.backtrack_limit and max_execution_time after deploying to mw-debug and forcing the requests there through XWD. My checks for the ini values in debug are good as well.
Thu, Aug 24
Resolving, feel free to reopen if there are still any issues.
Aug 23 2023
Thank you, sorry for the out-of-order operation
Dumping the envoy configuration in one of our containers as well as there being no CLI flag set for it means envoy is setting its number of threads to the number of hardware cores.
In other words, we have a CPU limit of 500mCPU, a 48 core machine, so each worker thread gets ~10mCPU (this is just for illustration purposes because the allocation doesn't work like that).
Aug 22 2023
Everything looking ok, we will see how it copes with doubling the incoming traffic from T341780: Direct 5% of all traffic to mw-on-k8s (only going to 2% for now) and resolve afterwards if everything stays ok.
Pending more hardware, we will move on to 2% first.
Aug 21 2023
All deployments of mw-on-k8s are now using:
- Autocomputed CPU requests, no limits
- Autocomputed Memory requests and limits
Aug 18 2023
Considering there's no reservation for system resources at the moment, I feel like that would be a better solution than doing nothing, especially as we increase requests for T342748