Blurb
User Details
- User Since
- Oct 3 2014, 8:40 AM (478 w, 4 d)
- Roles
- Administrator
- Availability
- Available
- IRC Nick
- akosiaris
- LDAP User
- Alexandros Kosiaris
- MediaWiki User
- AKosiaris (WMF) [ Global Accounts ]
Yesterday
All of these (which can be grouped in 2 just 2 categores, mw and mc, have been already deemed dangerous and out of scope per my T271142#6955077. I 'll check with the team but I doubt we have any intention of devoting work to do those.
Fri, Dec 1
@Volans, since dumpsdata[1001-1003].eqiad.wmnet and snapshot[1005-1010].eqiad.wmnet are no longer with serviceops, I think we can resolve this one?
This is now done.
rdb*:
have the AAAA record: rdb[2009-2010]
lack the AAAA record: rdb[1009-1012,2007-2008]
Sep 21 2023
I 've just re-enabled the filter, rejecting traffic, we are meeting issues with high latencies and decreased availability in the parsoid cluster.
I 've just disabled the rule. It's still present, but inactive. For other SREs having to re-enable it in an emergency:
Sep 20 2023
We 'll schedule a scap deploy for RESTBase, thanks @jnuche
Sep 18 2023
with overrides being configured within WMF production wiring, as opposed to provided by the software. That imho violates the separation of concerns and wouldn't scale for other MW users to know about and keep in sync across core and hundreds of extension repos, and across major version upgrades.
The range and sizes of buckets in the histogram can be defined per metric (actually group of metrics, e.g. via a regex). We already use this a lot, e.g. in various services in WikiKube, where each service configures statsd-exporter per metric they want. It is not possible to do this on demand, as in allow the producer to change it, on the fly, without shipping a config change.
I 'll admit I am a bit stumped here. This is clearly not the CDN's fault as RESTBase exhibits the same behavior while also violating what it advertises as the documentation of the API.
Sep 15 2023
https://donate.wikipedia.org/.well-known/apple-developer-merchantid-domain-association now, in my checks, returns the contents of the file in this task. It might take a bit more (up to 30 minutes) to propagate everywhere. I am resolving this task, feel free to reopen to report issues.
I am gonna add one more data point. In all of these errors, the data.servedby stanza refers to an *eqiad* API server. I looked a bit at the distribution of those API servers to see if there is any pattern that would identify one or more specific ones, (thankfully?) that is not the case, apparently all eqiad API servers have the probability to appear in this dataset.
Sep 14 2023
Should we resolve this?
Sep 13 2023
I can sense the frustration pretty clearly and I appreciate the effort to illustrate it via this story, avoiding lashing out. As a data point, users aren't the only ones frustrated with the situation. Engineers (developers, software engineers, SREs) are frustrated (and have been for a long time) too.
Sep 8 2023
Sep 7 2023
Yes, it is safe, we haven't put yet those in production.
ICU67 images, built and pushed.
Editing the article will indeed issue cache purge events for both the CDN and RESTBase.
Given this isn't urgent and we have multiple ways of dealing with this, I 've re-enabled puppet and cadvisor has been started again. Sure enough, the latency has increased again.
There's a few more actionables here:
This isn't present in conf1* hosts, despite also running cadvisor and the same exact version, presumably because of a different kernel version. conf1* hosts are bullseye and conf2* hosts are buster. 5.10.0-15-amd64 vs 4.19.0-20-amd64
cadvisor is to blame. Adding @fgiunchedi for his information and a thumbs up on disabling cadvisor on conf2* until we can bump their kernel version.
I can reproduce internally, this has nothing to do with the CDN, it looks more like a RESTBase or a PCS service issue.
Sep 6 2023
So, ethtool -G eno1 rx 1000 apparently did the trick
Sep 5 2023
Adding a data point that just crossed my mind, just to rule it out.
Sep 4 2023
I 've uploaded changes for icu67 php7.4 images for use with a shellbox deployment. I 'll also create a temporary shellbox deployment based on those.
I 've merged and deployed the change mentioned in T344747, alongside the dependent changes. curl call listed in the Mathoid page returns succesfully an SVG, and openapi checks are apparently also fine. I 'll resolve this one but I 'll note that for
I think that the subject is misleading. Per https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/33520a1a4409a9b0cef71a0b4baba148f64f2d40 (gerrit change is https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/809194) deployed mathoid version isn't 2023-02-21 but 2022-06-28-144716-production (which is even older).
Sep 1 2023
@Ladsgroup @Marostegui , dependent T340843 is now resolved, you can proceed with the decom process. Thanks and sorry for waiting so long.
ipoid and toolhub done today. Resolving this. We 've chosen a path forward, implemented it and migrated the services that utilized hardcoded dbproxies networking rules to the new thing.
Aug 30 2023
linkrecommendation done today.
Aug 29 2023
Those numbers are for summary quantiles, not histograms buckets. Summaries aren't aggregatable and in almost all cases regarding multi-instance (e.g. multiple servers, multiple pods) metrics you don't want to deal with them. Almost all efforts to run aggregation queries over them will result in wrong results.
Patch was merged and deployed for cxserver. Things went ok after a couple of roadbump and corresponding brown paper bag fixes. I 'll be deploying the changes to ipoid, toolhub and linkrecommendation in the next few days and remove the hardcoded references to dbproxies from those deployments.
Fix merged and deployed. Some hiccups aside, it works fine across all 3 environments (staging, production eqiad, production codfw). I 'll resolve this one.
Aug 28 2023
Aug 22 2023
Jul 27 2023
Jul 26 2023
Increase the threshold of the alert from 1s to 2s (or 1.5) as I'm not aware of any issues arising from this
I am gonna close this as declined. While we do have the ability to block requests based on user-agent, we don't do that on request.
I 've gone ahead and populated the Saturation panels. Traffic, Errors and Latencies will need more work, but I will not be able to help with that anytime soon.
I 've gone ahead and created https://grafana.wikimedia.org/d/FEkiKFqVk/wikifunctions?orgId=1
php7.4-fpm-multiversion-base rebuilt as well, should make it out to mw-on-k8s in the next deployments. I think we can resolve this now. Feel free to reopn.
Jul 25 2023
The apparmor changes have been merged. I think the goal of this task is done. I 'll resolve, but feel free to reopen.
akosiaris@kubernetes1007:~$ sudo apparmor_status apparmor module is loaded. 10 profiles are loaded. 10 profiles are in enforce mode. /usr/bin/man docker-default lsb_release man_filter man_groff nvidia_modprobe nvidia_modprobe//kmod tcpdump wikifunctions-evaluator wikifunctions-orchestrator <snip>
Jul 24 2023
This is apparently stopped happening yesterday, the 23rd of July ~9:00am
OK, scheduling for tomorrow then, https://wikitech.wikimedia.org/wiki/Deployments#Tuesday,_July_25.
Jul 19 2023
@RobH mw hosts are 3 api servers and 3 appservers. You can do them anytime. Also it requires is a downtime and a poweroff per the description.
Jul 18 2023
Just found something more important than User-Agent unfortunately
This is apparently due to transclusions per https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&from=1689418150653&to=1689449568835&viewPanel=27
Jul 17 2023
Jul 12 2023
Looking quickly at mw-canaries and mwdebug, they all have 1.13.0-1+wmf1+buster1
Jul 11 2023
Can I just say that this is pretty awesome? Especially the max latencies for kafka are pretty telling. Keep up the good work on this one!
Jul 10 2023
Wikiwand/0.1 (https://www.wikiwand.com; admin@wikiwand.com) added to the list of user-agents. Please advise if it doesn't work, otherwise please resolve.
@Dzahn, Judging from the content of the task, this is for Infrastructure-Foundations, not serviceops, retagging.
Jul 7 2023
Just to add my 2 cents as a generic observation.
PCC at https://puppet-compiler.wmflabs.org/output/936062/42341/ says 0 diff for alert hosts, lvs hosts see a comment change in configuration and there arguably the .discovery.wmnet approach is better anyway informationally. The alerting part of this task doesn't apply anymore anyway, I am gonna resolve this.