- User Since
- Oct 3 2014, 4:18 AM (442 w, 4 d)
- IRC Nick
- LDAP User
- MediaWiki User
- ATDT [ Global Accounts ]
Sun, Mar 5
@phuedx I don't know, sorry.
Feb 14 2023
Does the edits graph in T327440#8542723 include bots? Bots may not be a large proportion of users but they do contribute a large proportion of edits.
Jan 13 2023
+1 to @Tgr's proposal
Jan 10 2023
It might be worth it to try and contact the library's co-maintainer. His contact info is at https://eatingco.de/about/.
Jan 9 2023
Dec 23 2022
Nov 14 2022
Oct 18 2022
@Jdforrester-WMF : the Beta Cluster instance of the function-evaluator now runs under GVisor. Some additional work will be required to make the production instance of the function-evaluator run under GVisor. There is documentation here: https://gvisor.dev/docs/user_guide/quick_start/kubernetes/.
I created a new task for the alerts, T321099. Let's continue there.
Wikifunctions on the Beta Cluster uses the *.wikimedia.beta.wmflabs.org wildcard cert, and the CertAlmostExpired alert was caused by automatic certificate renewal being broken on the Beta Cluster in general. T293585 is the issue; it looks like Valentin and Giuseppe fixed it.
Oct 14 2022
@phuedx I'm not aware of anything actively using it, no, but I'm also out of the loop -- can you ask someone on the performance team to confirm?
Oct 12 2022
I've cherry-picked the two Puppet patches on the beta cluster. The mediawiki-function-evaluator service is now running under gVisor.
Oct 11 2022
Never mind, I see that it is available for Bullseye -- sorry.
@Joe the Wikifunctions Beta Cluster instance is running Bullseye -- could you also pull it in there?
Sep 28 2022
Sep 8 2022
There are no outstanding issues that are specific to the Beta Cluster environment, AFAIK.
Sep 7 2022
Sep 6 2022
@cmassaro We have some logging now, and instructions on Wikitech on how to access the logs. I think there are more places where we can add additional logging to make debugging easier, but that is better dealt with on an ongoing basis than a dedicated task.
I suspect this is fallout from the URL query sorting change (cc @ori) not invalidating the cache of history pages properly.
Sep 2 2022
That no longer looks like an error that would be specific to the Beta cluster environment. @AAssaf-WMF , can you see if you get the same error locally?
The API Sandbox request in the task description is still failing, but the underlying error is now different:
I think you can create a patch to remove it from package.json, and we'll see if all the integration tests pass. If anything breaks after merge we can always revert easily.
Sep 1 2022
OK, it looks like the default User-Agent string sent by node-fetch is blocked by Varnish:
We need to set a custom user-agent string for the orchestrator.
Ok, I hacked in some debugging code to include the HTML body in the response, and it looks like the orchestrator is getting an error page with the message:
It seems that the orchestrator is getting an invalid response from the MediaWiki API:
When a page on a wiki is updated, MediaWiki sends purge requests to the CDN layer to invalidate objects in the cache. Currently, this is URL-based. So, for example, if I got edit the article on 'Science' on enwiki, MediaWiki will send purge requests to Varnish for the following URLs:
We're seeing errors again.
Aug 30 2022
This is now rolled for text frontends.
This is now complete. Many thanks to @Vgutierrez for partnering with me to get this rolled out.
Aug 28 2022
I tried setting EPP to 0 using x86_energy_perf_policy, thinking that bypassing the sysfs interface and writing directly to the MSR would make the setting sticky. Unfortunately this does not seem to be the case -- the EPP is gradually reset to 128, same as when you tried changing it via sysfs. At this point I also don't see value in further experimentation with the EPP knob and agree that performance is the way to go.
Aug 26 2022
Actually, let me not step on your toes. But if you can tolerate a short extension of this task, I would very much like to see this setting tested. I think there is a good chance it will give the same or very similar performance increase with less waste of power. Just to be fully explicit, the setting is:
Aug 22 2022
Aug 20 2022
Unfortunately continuation-local-storage and its more modern counterpart, AsyncLocalStorage come with a substantial performance cost, particularly for workloads with a lot of async/await calls. I don't think we can afford the performance penalty.
Aug 19 2022
@Vlad.shapik thank you, but what about the other points I raised?
Aug 18 2022
Congratulations, this is a huge win! I think we should dig deeper to see if we can get the same or similar performance benefit, but waste less power.
Thanks @roman-stolar. I think it's a mistake to combine (a) changes from (multiple?) upstream(s), (b) unmerged changes from Gerrit, and (c) your own work into a single commit, as in Icabc39dab. This destroys some useful history (for example, the authorship, review comments and discussion on Id6ec6d62c), and it makes future reconciliation with upstream code harder. It's also error-prone.
@Vlad.shapik, @roman-stolar, ping on the above :)
I'd also like to understand the deployment plan for this. Are you working with anyone on Wikimedia SRE to get this deployed? I strongly recommend deploying small, incremental updates rather than accumulating a lot of changes.
Aug 17 2022
Based on https://www.kernel.org/doc/html/v5.6/admin-guide/pm/intel_pstate.html#operation-modes the scaling behavior will be different for systems depending on whether or not hardware-managed P-states (HWP) support is available and enabled. It looks like it is not available on 56 out of 265 app servers: P32411.
I propose making this change on all eqiad appservers in soft state, with cumin. Our latency metrics are noisy so changing it everywhere at once will give us the best chance of measuring a benefit.
Follow-up items to get the Puppet repo on deployment-puppetmaster04 in good shape:
- The two cherry-picked reverts should be removed (https://gerrit.wikimedia.org/r/c/operations/puppet/+/823638, https://gerrit.wikimedia.org/r/c/operations/puppet/+/823639) and the changes they revert should be updated to not be incompatible with the Beta Cluster.
- https://gerrit.wikimedia.org/r/c/operations/puppet/+/668701 needs to be rebased and merged or re-cherry-picked on deployment-puppetmaster04. To get it to apply I had to manually resolve a conflict, and I'm not sure I did it correctly. So the actual diff on deployment-puppetmaster04 is not consistent with what's on Gerrit.
Aug 16 2022
The Puppet repo on deployment-puppetmaster04:/var/lib/git/operations/puppet is in MERGING state. There's an unresolved conflict in modules/profile/manifests/etcd/v3.pp. The conflict is between the upstream change I04aa7729e and a local patch, Iecfc26a94, which has been cherry-picked locally for the past year but never merged upstream.
Hi, I'm trying to understand https://gerrit.wikimedia.org/r/c/operations/software/thumbor-plugins/+/800170/ a bit better.
- What parts of this change are coming from upstream and what parts are new?
- How did https://gerrit.wikimedia.org/r/c/operations/software/thumbor-plugins/+/489022/6 ended up getting included, when it has not (AFAICT) been merged?
- Did the new code get thoroughly reviewed? I only looked at one file, tests/integration/test_swift.py, and it looks like the change made at least some of the test code unreachable -- adding assert False to mock_get_object() does not result in a test failure.
Aug 15 2022
@maryyang are you able to take this on as part of the work on logging?
EventLogging is home grown, and was not designed for purposes other than low volume analytics in MySQL databases.
Aug 14 2022
Aug 12 2022
The alerts are going to #wikimedia-operations; there were 21 alerts of this form on 2022-08-11:
We got alerts about the Beta Cluster cert being close to expiry (T311457#8147086) so I again ran:
Aug 11 2022
I see this error occur every few seconds in the function-orchestrator log on deployment-docker-wikifunctions01:
I implemented option 3, and created T314868 for tracking the roll-out.