User Details
- User Since
- Oct 3 2014, 4:18 AM (499 w, 2 d)
- Availability
- Available
- IRC Nick
- ori
- LDAP User
- Ori
- MediaWiki User
- ATDT [ Global Accounts ]
Sun, Apr 21
Mon, Apr 8
Hi Mark! Could you summarize the back-and-forth? What were the alternatives considered?
Feb 1 2024
In lieu of exporting a route map, MediaWiki could, as a first pass at the problem, emit a response header that signals to the CDN that a request contained garbage parameters. The CDN could use this information to throttle clients that issue too many such requests. This may be less desirable than filtering all such requests at the edge, but it is also simpler.
Sep 29 2023
Sep 5 2023
Aug 10 2023
It's a bug in webhint, AFAICT. It thinks stale-while-revalidate should not hold a value, but that is wrong. This is the problematic code:
Aug 6 2023
Jul 31 2023
Jul 24 2023
I also don't know how well Swift would handle 15k QPS of object metadata updates (cf T211661#8377883)
Right. Now I remember. The initial expiration is indeed supposed to be set by Thumbor. The necessary functionality had some trouble landing in the Wikimedia Thumbor plugin repo, but it has since landed.
@MatthewVernon: my understanding is that rewrite.py is currently setting expiry headers for thumbnails on retrieval from Swift -- is that correct, and does that mean some thumbnails are already getting expired?
May 8 2023
This is really confusing.
Apr 18 2023
Vega ships an optional interpreter that can evaluate graph expressions by traversing an AST and performing each operation, rather than relying on runtime code generation. Per https://github.com/vega/vega/pull/3019#issuecomment-749107902, the interpreter mode is not the default because it is 10% slower. Seems like a negligible price to me. This seems like the only sensible option for keeping support for graph expressions but rooting out XSS vectors systematically.
Mar 5 2023
@phuedx I don't know, sorry.
Feb 14 2023
Does the edits graph in T327440#8542723 include bots? Bots may not be a large proportion of users but they do contribute a large proportion of edits.
Jan 13 2023
+1 to @Tgr's proposal
Jan 10 2023
It might be worth it to try and contact the library's co-maintainer. His contact info is at https://eatingco.de/about/.
Jan 9 2023
Dec 23 2022
Nov 14 2022
ㅤ
Oct 18 2022
@Jdforrester-WMF : the Beta Cluster instance of the function-evaluator now runs under GVisor. Some additional work will be required to make the production instance of the function-evaluator run under GVisor. There is documentation here: https://gvisor.dev/docs/user_guide/quick_start/kubernetes/.
I created a new task for the alerts, T321099. Let's continue there.
Wikifunctions on the Beta Cluster uses the *.wikimedia.beta.wmflabs.org wildcard cert, and the CertAlmostExpired alert was caused by automatic certificate renewal being broken on the Beta Cluster in general. T293585 is the issue; it looks like Valentin and Giuseppe fixed it.
Oct 14 2022
@phuedx I'm not aware of anything actively using it, no, but I'm also out of the loop -- can you ask someone on the performance team to confirm?
Oct 12 2022
I've cherry-picked the two Puppet patches on the beta cluster. The mediawiki-function-evaluator service is now running under gVisor.
Oct 11 2022
Never mind, I see that it is available for Bullseye -- sorry.
@Joe the Wikifunctions Beta Cluster instance is running Bullseye -- could you also pull it in there?
Sep 28 2022
Sep 8 2022
There are no outstanding issues that are specific to the Beta Cluster environment, AFAIK.
Sep 7 2022
Sep 6 2022
@cmassaro We have some logging now, and instructions on Wikitech on how to access the logs. I think there are more places where we can add additional logging to make debugging easier, but that is better dealt with on an ongoing basis than a dedicated task.
I suspect this is fallout from the URL query sorting change (cc @ori) not invalidating the cache of history pages properly.
Sep 2 2022
That no longer looks like an error that would be specific to the Beta cluster environment. @AAssaf-WMF , can you see if you get the same error locally?
The API Sandbox request in the task description is still failing, but the underlying error is now different:
I think you can create a patch to remove it from package.json, and we'll see if all the integration tests pass. If anything breaks after merge we can always revert easily.
Sep 1 2022
OK, it looks like the default User-Agent string sent by node-fetch is blocked by Varnish:
https://github.com/wikimedia/puppet/blob/9843300dba/modules/varnish/templates/wikimedia-frontend.vcl.erb#L716-L718
We need to set a custom user-agent string for the orchestrator.
Ok, I hacked in some debugging code to include the HTML body in the response, and it looks like the orchestrator is getting an error page with the message:
It seems that the orchestrator is getting an invalid response from the MediaWiki API:
When a page on a wiki is updated, MediaWiki sends purge requests to the CDN layer to invalidate objects in the cache. Currently, this is URL-based. So, for example, if I got edit the article on 'Science' on enwiki, MediaWiki will send purge requests to Varnish for the following URLs:
We're seeing errors again.
Aug 30 2022
This is now rolled for text frontends.
This is now complete. Many thanks to @Vgutierrez for partnering with me to get this rolled out.
Aug 28 2022
I tried setting EPP to 0 using x86_energy_perf_policy, thinking that bypassing the sysfs interface and writing directly to the MSR would make the setting sticky. Unfortunately this does not seem to be the case -- the EPP is gradually reset to 128, same as when you tried changing it via sysfs. At this point I also don't see value in further experimentation with the EPP knob and agree that performance is the way to go.
Aug 26 2022
Actually, let me not step on your toes. But if you can tolerate a short extension of this task, I would very much like to see this setting tested. I think there is a good chance it will give the same or very similar performance increase with less waste of power. Just to be fully explicit, the setting is:
Aug 22 2022
Aug 20 2022
Unfortunately continuation-local-storage and its more modern counterpart, AsyncLocalStorage come with a substantial performance cost, particularly for workloads with a lot of async/await calls. I don't think we can afford the performance penalty.
Aug 19 2022
@Vlad.shapik thank you, but what about the other points I raised?
Aug 18 2022
Congratulations, this is a huge win! I think we should dig deeper to see if we can get the same or similar performance benefit, but waste less power.
Thanks @roman-stolar. I think it's a mistake to combine (a) changes from (multiple?) upstream(s), (b) unmerged changes from Gerrit, and (c) your own work into a single commit, as in Icabc39dab. This destroys some useful history (for example, the authorship, review comments and discussion on Id6ec6d62c), and it makes future reconciliation with upstream code harder. It's also error-prone.
@Vlad.shapik, @roman-stolar, ping on the above :)
I'd also like to understand the deployment plan for this. Are you working with anyone on Wikimedia SRE to get this deployed? I strongly recommend deploying small, incremental updates rather than accumulating a lot of changes.
Aug 17 2022
Based on https://www.kernel.org/doc/html/v5.6/admin-guide/pm/intel_pstate.html#operation-modes the scaling behavior will be different for systems depending on whether or not hardware-managed P-states (HWP) support is available and enabled. It looks like it is not available on 56 out of 265 app servers: P32411.