User Details
- User Since
- Oct 3 2014, 5:57 AM (451 w, 5 d)
- Availability
- Available
- LDAP User
- Giuseppe Lavagetto
- MediaWiki User
- GLavagetto (WMF) [ Global Accounts ]
Mon, May 29
Just to make clear what I did yesterday:
Sun, May 28
I would suppose the poolcounter limit that is reached is the one for expensive files:
Thu, May 25
It would be great if envoy fixed the TLS 1.3 to work well when two envoys talk to each other - we should check if that's been solved in the latest versions.
I think the current solution works well. Basically:
Wed, May 24
Sat, May 20
Regarding the last point - I think the best way to do this is to actually leave those hostpaths empty until someone needs to do something with the code - then allow people to run a command that will:
- Copy the code out of the latest mediawiki container to a path on the mw-debug kubernetes node
- Re-deploy the individual developer session, mounting the code from the hostPath
- tell the developer where they can find their code.
Fri, May 19
So the steps to do what Alex proposed would be:
AFAICT this is now resolved.
An alternative idea from @akosiaris which is also very interesting:
@jnuche is this task still valid? I think it's not really an issue given how scap works now, including the pre-download of images on the k8s nodes.
@jnuche I'm closing this task as resolved as AIUI we do this since months.
Tue, May 16
I can see the following two paths to solve this:
After discussion with @MoritzMuehlenhoff - it makes sense that snapshots might be broken right now that things change hectically for a testing distro before the freeze - we should pick up this task after June 3rd when the freeze will be enacted.
It's much simpler than that - debuerreotype would work correctly - the problem is that snapshots are full of broken links for bookworm - again:
While I've added bookworm to the build process, I think I'll revert that part of my change.
Mon, May 15
Fri, May 12
The videoscaling has slowed down, so the current situation is that we have:
My idea for implementing this is as follows:
- Create a benthos container
- Add a release containing a Deployment with N replicas (is this the best solution?) to the namespace of the service, running benthos with the adequate configuration, one per datacenter, listening to the local queue
- Limit what we need to configure to just the prefix for the URL we get from resource-change
The number of servers pooled in jobrunning was just 7, and some servers (mw1458,mw1461,nw1466,mw1495) were depooled from jobrunning but receiving no traffic; This is yet another case where we should probably reduce the number of connections reused by envoy as, especially for videoscaling, it means basically no load-balancing from LVS, because the connections between changeprop and the backends live "forever".
Thu, May 11
Our previous assumption that this was only happening (as far as our ability to reproduce the bug) in codfw just proved wrong, I randomly got a page on V22 while having set V10 in my preferences for one of the reported pages.
I would aslo add - purging pages shouldn't help, unless we broke something fundamental in how parsercache works.
Wed, May 10
This is solved thanks to @Legoktm's patch.
So, in order to get this working, we would need to install rsvg-convert in the base mediawiki image in production-images, and then use that in building our mw-on-k8s images.
For reference, the code in question is in Resourceloader\Image::rasterize:
Indeed. The code for resourceloader wants to use rsvg-convert and do it without going through shellbox - I guess for performance reasons.
Tue, May 9
Mon, May 8
In order to do this properly, we need to do as follows, IMHO:
- Pick a k8s node, or even better, reimage one appserver to act as an additional k8s node with specific node taints so that no "normal" pod can be executed
- Add a deployment of mw-on-k8s targeting those taints, add as many replicas as we can fit in that node; remember to also allow http connections as ab doesn't work well with TLS
- Use benchmw against this node (with the port you chose for http) and a depooled appserver with the same generation of hardware
Fri, May 5
The parsoid servers (reachable via localhost:6002 on any appserver, or any server running the service mesh proxy) can reply to requests for any url that mediawiki would respond to on an api or appserver. So your code could just call the api URL connecting to port 6002 on localhost (inside mediawiki-config we might need to add a configuration for that).
Thu, May 4
I'm frankly not sure how checking appserver.svc.eaqiad.wmnet:9090 from an appserver would work - that IP resolves locally to the loopback interface on any appserver. We'd need to pick another internal IP.
Tue, May 2
we also need to add wikifunctions to our internal certs
Apr 19 2023
Apr 18 2023
Another thing to consider is: given there are known XSS vectors in vega 2, this might be the first time we get a report, but not the first time this has been found out. We should probably get someone to check all the past revisions of all pages containing graphs for suspiciously-looking patterns.
Mar 30 2023
A starting point for this investigation can be which services currently call restbase:
Mar 29 2023
All stale nodes have been updated.
Adding @hnowlan as FYI so that we are careful about this in the future.
Very simply, those 3 servers are not in the targets file for deployment.
Mar 28 2023
I see a few ways to be able to enable this job on all wikis, but fundamentally the procedure I think would make sense is something as follows:
Sorry, I'm getting confused; to my understanding, WDQS/search will use mediawiki.page_change which AIUI are generated from mediawiki, not mediawiki.page_content_change.
Mar 27 2023
Mar 24 2023
On further thoughts:
Mar 23 2023
I wouldn't consider this task done, but we took all the actions that are reasonable on the SRE side of the issue. Retagging as necessary.