Blurb
User Details
- User Since
- Oct 3 2014, 8:40 AM (493 w, 3 d)
- Roles
- Administrator
- Availability
- Available
- IRC Nick
- akosiaris
- LDAP User
- Alexandros Kosiaris
- MediaWiki User
- AKosiaris (WMF) [ Global Accounts ]
Sun, Mar 17
Sun, Mar 10
Fri, Mar 8
Zotero is using url downloader to access the internet. It's logs end up in logstash e.g.
@JMeybohm is there anything left to dohere? I think we can resolve.
Wed, Mar 6
Almost all parsoid hosts have been reimaged as kubernetes nodes. Scandium, testreduce1002, parse1001 and parse1002 being the exceptions. The former 2 because it was requested in T357392#9546852, the other 2 because we don't want to mess with the state of parsoid-php right before the SRE summit and DC switchover. I 'll reword this task a bit and then resolve it and file a cleanup follow up task for the last 2 nodes to reimage and related cleanups.
Tue, Mar 5
So this is a difficult one to tackle. From what I gather images (and layers) can end up being really large, close to 10GB. I have questions regarding how a pip install ends up consuming 10GB of disk space of course but the main issue here is probably that this is going to cause issue down the road anyway. So that is probably unsustainable long term.
We at ~50% mw-parsoid right now.
I 've added another 220 CPUs for codfw and 300 for eqiad, we should be good on this front. I 'll resolve in the interest of sparing someone else from doing so, feel free to reopen.
I 've accounted for the cordoned nodes and indeed...
Looking at logstash in the Kubernetes events dashboard and fiddling a bit with the filtering I finally see
Thu, Feb 29
Tue, Feb 27
Migration started, we are batch 1 for the next few days.
The LVS traffic approach was doomed to fail, since scap utilizes the same data structure to figure out which hosts to deploy to. I 've re-ran numbers on services_proxy and parsoid cluster to make sure I ain't missing anything and it appears that indeed the only direct client is RESTBase and monitoring/healthchecks. So, the services_proxy approach should work fine. I 've updated the plan in the task and I 'll start executing it.
Mon, Feb 26
Thu, Feb 22
I mistakenly didn't annotate https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/994193 with this Bug and thus it isn't linked, but that patch has been the last in the series of patches to enable tracing in mwdebug pods in wikikube. I just also tested it once more it and appears to send traces quite fine, so resolving this.
Wed, Feb 21
I 'll resolve, this is now done.
And again, I just bumped eqiad prometheus@k8s-aux.
Tue, Feb 20
Which teams are aware for this in the WMF? I see service-runner and Security tagged on the task, but I don't know if further communication efforts have happened. Does anyone know?
Feb 16 2024
Thanks for posting the question and I hope I managed to help.
Feb 15 2024
I 'll resolve this, all patches have been merged.
Feb 14 2024
I 've posted changes to revert the hardcoded of localhost to 127.0.0.1. I 've already deployed eventstreams and recommendation-api changes since Luca isn't currently around. cxserver and function-orchestrator devs have been added to the other 2 changes.
Feb 13 2024
I 'd argue that the policy already covers this, even if it isn't scoped (on purpose) outside of kubernetes production realms.
Patches have been deployed, simple curl tests as well as service-checker-swagger checks have passed. I double checked the diff, envoy is listening now on both IPv6 and IPv4.
Speaking with a appservers/wikikube clusters hat on, we don't see any problems with the lowering of the dyna.wikimedia.org from 10 minutes to 5 minutes.
Feb 12 2024
Adding @rzl @Scott_French and @Volans per recent discussion on spicerack/cumin training/onboarding. The 15 line candidate patch is at T356661#9516327. T277677 might also have some useful information
Thanks for the writeup. I agree on almost everything but I think some clarification would help me figure out some things.
Feb 9 2024
Already bumped by 200Mi in a9f958e50e5f5f4a8 ( T266216 ). I think we 'll need instead to dig a bit into why this is happening.
Feb 8 2024
I am not sure this would actually solve the problem tbh. It don't hurt to try ofc, which is why I +1e, but looking into the following 7d graph displaying max/min/stdev/avg among the 9 pods in codfw paints a picture where we probably have like 1 or 2 pods at high memory usages, indicating some pattern of usage causing this.
Marking this as public now that we are done patching.
To make sure sure the apt package requires the correct nodejs version I think we have to bump this to nodejs (>= 16) and also a compatible npm version. I'll create a new 1.9.7-2 version which should use the correct node packages.
Feb 7 2024
All of wikikube production clusters done as well. Adding @klausman for an FYI regarding the state of the rest of the clusters and some low quality ready to use code used to to the rolling restarts.
Feb 6 2024
Let's see how I can be of help.
staging clusters done, the production clusters are proceeding at a rate of ~10m per node (I have on purpose a big sleep 300 after each node to allow the situation to settle).
We got 6 hosts that are for some reason buster and need to be reimaged:
Rolling restart started in codfw (and a repeat of staging clusters just to make sure my code didn't forget something yesterday)
I 've started the rolling restart of all pods in eqiad. This is right now done with this magnificent piece of code (</sarcasm>)
runc was updated: 1.0.0~rc93+ds1-5+deb11u2 -> 1.0.0~rc93+ds1-5+deb11u3
kubernetes[2005-2060].codfw.wmnet,kubernetes[1005-1062].eqiad.wmnet,
mw[2260,2267,2291-2297,2355,2357,2366,2368,2370,2381,2395,2420-2425,24
27,2429-2431,2434-2437,2440,2442-2443,2445-2451].codfw.wmnet,mw[1360-1
363,1374-1383,1419,1423-1425,1439-1440,1457,1459-1466,1469-1475,1482,1
486,1488,1495-1496].eqiad.wmnet (195 hosts)
Feb 5 2024
https://security-tracker.debian.org/tracker/CVE-2024-21626 got updated, we got runc updates for bullseye and buster.
Feb 2 2024
Awesome! Thanks @colewhite
Evaluation criteria and replacement grading matrix published under https://wikitech.wikimedia.org/wiki/Kubernetes/CRE/criteria
Feb 1 2024
Jan 31 2024
Patches reviewed and merged, I had some followup patches in T266216 for specific workloads that were indeed having resource problems. I 've also noticed these alerts in alertmanager during a reimage and addition of a node, where indeed calico-node was being killed due to the livenessprobe, but it hasn't happened since then. We might wanna tune the query to avoid these at some point, although they are transient and are expected to be rare enough past June 2024 that it might not be worth it.
The 2 patches linked worked just fine. There is taskmanager left to be +1ed by the team, but I am willing to say that we are in a better place than we used to. I am gonna resolve this
Jan 30 2024
Thanks. I am not rush on my side as well, so +1 on waiting for the upgrade.